Hermes
A Beowulf Cluster running Redhat Linux
This page contains information about a Beowulf cluster being constructed at
the Regional Weather Information Center (RWIC) at
the University of North Dakota.
History
The Scientific Computing Center, part of the
UND Aerospace complex, has been home to
three 'big-iron' supercomputer systems since its inception. Cray Research
X-MP, Y-MP, and J-90 systems have all called the SCC machine room home at one time
or another. However, like all other computer systems, these became obsolete and
too costly to maintain. They have all since been decomissioned. This has left
a sizable hole in the high-performance computing capabilities on the UND campus,
and with no funding to foot the multi-million dollar bill for a new Cray system,
an alternative was needed.
UND is not alone in a search for a newer, cheaper alternative to 'big-iron'
supercomputing systems. Luckily, advances in microcomputer and networking
technology have made it possible (with some clever programming) to simulate
the inner workings of a supercomputer with a group of smaller computers.
These clusters of computers are hooked to their own private network system,
come reasonably close to the performance of commercial supercomputers, and
cost orders of magnitude less to purchase. The only drawbacks are they are
more time-intensive to assemble, and may require more time and effort to
administrate.
In June of 2002, RWIC personnel constructed a 4-node cluster
from obsolete PC and network hardware as a learning exercise. In May of 2003, a
second, 2-node cluster was also built to gain experience
with Redhat Linux.
Hermes
The RWIC Beowulf cluster will be known as 'hermes'. Hermes' primary mission will
be the running of atmospheric models in support of the UND Atmospheric
Sciences department. The cluster will follow the basic Beowulf design,
with one master node, a number of slave nodes, and a fast network switch
connecting them all together.
The master node will be based on a dual AMD Athlon MP motherboard, and
incorporate a RAID disk controller and approx 1TB of hard disk storage.
A 100bTX ethernet NIC will connect the master to the outside network,
and a Myrinet NIC for cluster communications.
Each slave node will also have a dual MP CPU motherboard, a smaller, single
hard disk, and a Myrinet NIC.
All nodes will be mounted in rack-mount server cases and placed in a standard
19" 41U mounting rack. Inter-node communications is handled by a Myrinet
2Gb/s fiberoptic network switch, also mounted in the rack. Finally an
uninterruptable power supply will provide clean, filtered power with
backup capability to all the nodes and the switch. A keyboard, monitor,
and mouse will be attached to the master node for monitoring and
maintenance.
Initial configuration
Hermes will be built in several steps, mostly due to budgetary constraints.
The core system will allow for later expansion, while providing sufficient
computational power to run the models.
Hermes Core: 8 Athelon MP CPUs, 1 TB RAID, Myrinet 2Gbps backbone
- 1 Myrinet switch, 5-slot rack-mount
chassis with one 8-port card and one monitor card.
- 1 Master node with dual Athelon MP CPUs, 1GB of RAM, 3Ware IDE RAID controller, 6 180GB hard drives,
Myrinet NIC, 100bTX NIC, 8U RAID rack-mount chassis.
- 3 Slave nodes with dual Athelon MP CPUs, 1GB of RAM, 60GB hard drive, Myrinet NIC, 4U rack-mount chassis.
- Monitor, keyboard, mouse for Master node.
- Redhat Linux OS.
- Portland Group Cluster Development Kit.
- Rack-mount UPS power supply.
The system will be assembled in one 41U 19" mounting rack in the RWIC machine room.
Status: May 13, 2003
The core system hardware has been assembled and installed in the rack. Installation of
the operating system is waiting on cables for attaching the console to the master node.
Status: May 22, 2003
Hermes is operational. Details of the Myrinet and Linux configuration will be added at
a later time. The cluster has the PGI compilers and Myri's custom MPI-CH installed on
it, and runs parallel programs. The cluster has not yet been tuned and tweaked to
maximize performance.
I have run the Pallas benchmark
on the cluster, with interesting results. For comparison,
here are the results from a
Cray T3E running the same number of processes.
Status: June 24, 2003
After getting the PGI compilers working again (when the demo license ran out, all objects
compiled with it stopped working. This included the MPI-CH libraries on Hermes), I ran my
primitive benchmark program on Hermes, using 1 to 17 processes.
Here are the results. They confirm that the cluster
is most efficient when running with the number of processes equal to the number of
CPUs. Not surprising, but comforting to see things working like we think they should.
Status: July 30, 2003
One of Hermes' compute nodes went belly-up. The operating system had to be reinstalled from
scratch. It's thought a disk problem is responsible, but it hasn't been traced. For the
moment, the node seems to be working properly again. The crash allowed some fine-tuning
of the build and maintainance document for Hermes, as well as experience taking one node out
of a cluster and adjusting it to still run.
It was also discovered that the Myrinet NIC in the master node was only running at a bus
speed of 33MHz. Since the 64-bit PCI bus should run at 66MHz, this was a puzzle. We
finally discovered that the 3Ware Escalade 7500 IDE RAID controller was only a 33MHz card,
despite being of the 64-bit form factor. Some investigation revealed that the card was
pulling the 64-bit PCI bus down to 33MHz, including the Myrinet NIC. The 3Ware board
was moved to a 32-bit slot, where it could run at the slower speed without bothering
any other peripherals. The Myrinet NIC is now the only card in the 64-bit PCI bus, and
is running at 66MHz. Rerunning the Pallas benchmark revealed a noticable increase in
bandwidth. See results here.
Status: March 23, 2004
Hermes continues to evolve. Plans to expand the cluster ran into problems with finding
funding. To trim costs, the Myrinet LAN was swapped out in favor of gigabit
ethernet. The network hardware to set up a 4 node cluster based on gigabit ethernet was about
$200, while a 4 node Myrinet system was about $10,000. Initial testing has shown no
appreciable slow-down in most applications.
Other issues have been adjusting the number of rsh jobs that can spawn quickly, and an update
to the PGI CDK toolkit. A new run of the Pallas benchmark
shows the decrease in bandwidth, while the Pi calculator shows no decrease in MIPS. This
bears out what was expected: the impact on cluster speed by switching to the slower
interconnect will vary greatly with application program bandwidth dependency.
Status: April 16, 2004, "Regress for Success"
After some marathon debugging sessions and a lot of searching the web, some stack and
heap size issues with MM5 were addressed. Adjustments were made to the /etc/csh.cshrc
files on all nodes to bump these to unlimited. This at least partially solved a
stack-smashing problem with the model. Cutting down the vertical resolution of
MM5 also helped, but caused a different failure. The program would hang up on some
of the slave nodes, and timeout, causing broken pipes. After a lot of heartache and
mucking around, NFS error messages were found in the log files on these nodes, buried
among the authorization messages each MPI run generates. Further investigation revealed
that while normal pings between nodes worked fine, increasing the ping packet size to
8k or so would lead to a 30% packet loss rate between the slaves and master. This was
traced to the fact that the Netgear GA311 gigabit ethenet NICs weren't supported
by fedora, but instead were running under a different driver for a similar chipset.
This 8k packet size problem was borne out by runs of the pallas benchmark, and complicated
by the fact that NFS uses this packet size to increase performance on the cluster.
To test the theory, the GA311's were replaced with cheap fast ethernet (100bTX) NICs.
While the overall bandwidth of the cluster dropped, the huge latency spikes around
8k packet sizes vanished in a subsequent run of pallas.
More testing will be done with
this configuration until either proper drivers or alternate gigabit NICs are secured
and installed.
Status: April 22, 2004
Yet another set of gigabit NICs were installed today in Hermes. The D-Link DGE-550T adapters
are actually supported by Linux, solving a major problem.
Yet another run of pallas was done, showing better numbers
for the most part. One possibly bad LAN cable caused a node to drop off the network
during one run, and may still be causing problems for this one. Current plans are to
replace the cable and retest.
Status: April 26, 2004
The DGE-550T adapters were reporting transmit errors on the slaves. Once a NIC started doing
this, it usually lost contact with the rest of the cluster, causing more heartache. A call
to D-Link tech support revealed that they don't do support for Linux, just provide the latest
driver source on their website. This driver would build but not load. Very helpful.
It was then discovered that only the slave nodes were having this problem, not the master.
The only difference between the two were that the master NIC was running at 33MHz (the
3Ware IDE RAID pulls the 64 bit PCI bus down to that speed), while the slaves ran at their
default 66MHz. Moving the 550T's to 32 bit slots cured the transmit errors, albiet at a
bandwidth penalty. Another run of Pallas revealed
this.
Status: December 3, 2004
Hermes has been running operationally for several months now. The problem with self-destructing
filesystems on the slave nodes was traced to a problem with the on-board IDE controller and
the Linux kernel. Switching to Fedora Core 1 solved this. It was also found that the 3ware
IDE RAID controller has a problem with the old version of Linux as well. It too seems happy
with Fedora on the master node, so the entire cluster is running under that OS. We also
discovered that the D-Link 1005G gigabit ethernet switch that forms the backbone of the
cluster LAN cannot support jumbo frames, while the D-Link DGE-550T NICs can. Jumbo frames
can help speed NFS operations up considerably, but we decided not to upgrade the network
switch at this time, since it is not believed that NFS is the bottleneck limiting
performance. Several users of the cluster have also developed a "It's running now,
please don't break it by trying to make it better" frame of mind.
The original plan was to expand Hermes as we located funding sources to do so. However the
amount of time and money spent developing the cluster was greater than we bargined for,
and it's been decided that the best way to improve upon Hermes is to buy a new cluster
from a systems integrator who not only builds clusters, but also runs the models we
use as burn-in programs on them. This new system will be based on AMD Opteron CPUs,
adding the ability to run 64-bit programs as opposed to Hermes' 32-bit only limitation.
It will also give us one vendor to pester when problems arise, rather than chasing after
a whole bunch of different hardware and software companies as had to be done with Hermes.
The Myrinet hardware from Hermes will be used to build the new cluster. Hermes will remain
as it's configured with gigabit ethernet for the foreseeable future. Hermes will continue
to be used for some of the less-intense parallel processing duties, while the heavy-lifting
model runs will be shifted to the new cluster when it's installed.
While it might seem like cheating (or wimping out) to buy a new cluster, one has to weigh
the losses in time spent debugging the system in both man-hours and lost productivity
against the cost of paying a company that sells a system that's already designed, debugged,
and tuned. At the moment anyway, the commercial Beowulf provider seems less of a gamble.
Status: September 18, 2006
Hermes is once again operational. The RAID array lost one drive about six months ago, but
continued to run in degraded mode. Down-time to fix the array didn't arrive until this
past summer, but by then a second drive had started to develop glitches, and the controller
would not allow an array rebuild with two bad drives. It was decided to take the system
down and build a new disk array, as well as upgrade the OS to Fedora Core 3. The new array
consists of 4 320GB drives in a RAID 5 array for data storage, and two 250GB drives in a
mirrored array for system and home directories. Luckily the old array was still functional
enough to preserve the /etc, /usr/local, /usr/cluster_share, and /home directories. After
installing the new OS these directories were recovered, a few configuration files
tweaked and the system returned to operation. All slave nodes were also upgraded to
Core 3, and a faulty power supply in slave #3 was replaced.
Tools for building clusters:
Other cluster stuff: