You have heard all about this incredible thing called "Beowulf supercomputing" and want to build your own DebianBeowulf cluster. Where do you start?
First, consider your end-goal. What is the purpose of this cluster? What kinds of applications do you plan to run on it? Here are some typical applications of Beowulf clusters:
- Transport phenomena, including fluid dynamics, heat and mass transfer, multi-phase flows, aerodynamics, etc.
- Multi-million-atom molecular dynamics, and protein folding
- Cellular automata to model phenomena from epidemiology to options trading
- Graphics: distributed raytracing and rendering
- Hard NP problems such as DNA sequence alignment (bioinformatics)
Note that some applications do not scale to large numbers of processors as well as others. Also, some large problems benefit enormously from shared memory, for example (non-multipole) boundary element and electron density functional theory calculations really need all processors to have access to all the data. For these applications, linear (or near-linear) scaling in processors really requires a big shared memory machine such as the SGI Altix line, which is the largest Linux-native shared-memory architecture in the Intel world. IBM offers similar machines in their PowerPC architectures as well as SUN with Sparc.
Parallelizing computation can be done in many ways. It all comes down to which one does your application use, or which technology do you think you want to build around. As of August 2010, there are four major ways of creating parallelization:
From the relevant Wikipedia article:
In computing, symmetric multiprocessing or SMP involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory and are controlled by a single OS instance. Most common multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors. Processors may be interconnected using buses, crossbar switches or on-chip mesh networks.
The parallelization comes from writing your applications to utilize threads or processes, which can perform multiple and concurrent tasks.
2. Some use MessagePassingInterface (MPI).
From the relevant Wikipedia article:
MPI is a language-independent communications protocol used to program parallel computers. Both point-to-point and collective communication are supported. MPI "is a message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation. MPI's goals are high performance, scalability, and portability.
In short, your applications use the MPI libraries, which in turn pass commands and data among different nodes in your cluster and perform your computations from a single filesystem, common among all the nodes of the cluster.
3. Another way is ParallelVirtualMachine (PVM):
From the relevant Wikipedia article:
PVM is a software system that enables a collection of heterogeneous computers to be used as a coherent and flexible concurrent computational resource, or a "parallel virtual machine". The individual computers may be shared- or local-memory multiprocessors, vector supercomputers, specialized graphics engines, or scalar workstations and PCs, that may be interconnected by a variety of networks, such as Ethernet or FDDI.
There is also a Debian wiki entry on PVM: ParallelVirtualMachine. In short, PVM is an alternative to MPI.
4. One last way is to parallelize jobs using a job queueing system, such as OpenPBS (homepage), LoadLeveler (homepage), the Sun/Oracle Grid Engine (homepage) or Platform LSF (homepage). In today's multi-core, multi-threading systems, queuing multiple single-threading programs to perform your tasks, can decrease your time needed to complete the job, in an inverse logarithmically manner.
You can install or create applications which can use all four of these transports and can happily run side-by-side on your Beowulf cluster.
There are two ways to do SMP. In both ways, the operating system realizes all hardware as a single system image.
One way is hardware SMP. Examples are multi-core systems, such as the Sparc architecture or the Intel i7 architecture, multiple CPU systems, such as the SGI Origin2, or multi-CPU, multi-core systems, such as the IBM Power6 architecture.
The other way is via software SMP emulation. Taking advantage of 100Mbit/1Gbit/10Gbit networks, software SMP automatically routes all of the inter-process communication either locally if processes are on the same machine, or across the network if not. It also allows dynamic load balancing by migrating processes from machine to machine.
Major examples of sofware SMP include Mosix and Kerrighed.
Mosix is a kernel patch and set of utilities which make a cluster of machines appear as a single multiprocessor machine with shared memory (though large applications requiring shared memory will run slowly because it's really distributed). Unfortunately, Mosix is a bit harder to set up than the others, as you have to apply a special patch to the kernel sources and then build it. The Mosix license changed in 2002 and a fork called OpenMosix started.
Up to Debian Sarge, there used to be an OpenMosix kernel patch package which you could install. The package was removed in early 2006 and the OpenMosix project was closed down in March 1, 2008. As of August 2010, the source is still available on SourceForge. Combined with the Debian kernel building system you can (relatively) easy patch the required kernel and build your own version of the kernel running OpenMosix. Should you wish to do so, you will have to install the kernel-package package and read the relevant documentation.
If you want to try OpenMosix using a live cd, thus without changing your existing system, you can look at clusterknoppix or quantian. Those livecds also include a terminal server, so you can get an OpenMosix cluster using only 1 cdrom drive and zero hard disks, and the diskless clients can be used with kde or can be kept without xfree. ClusterKnoppix has not been updated ever since August 28, 2004. The Quantian Project has not been updated ever since February 2, 2006.
Another example is Kerrighed.
Kerrighed (homepage) is an actively maintained project, which provides the same functionality as OpenMosix. It is licensed under GPLv2. The downlside is that the code is not completely portable to all kernel version as it has to be compiled against a specific version of the kernel. As of August 2010, you can download the code and compile it against the Linux 2.6.20 kernel. There is no appropriate debian package *cough*TODO TODO TODO*cough* . 2.6.20 is dated from around February, 2007.
Applications involving solving partial differential equations (e.g. fluid flow, heat and mass transfer, combustion, etc.), the PETSc suite from Argonne National Labs provides a great start. There are prepackaged libraries of the toolkit on Debian Packages.
Next comes cluster design, for which you need to choose the hardware and the networking. If you choose multiple-processor SMP machines, the CPU and RAM costs will generally stay the same, assuming the SMP board has room for the RAM you want. CPU power increases according to Mooring's Law, and for constant disk/CPU ratio, that cost will fall too, accordingly. For DIY, the motherboard may cost less or more, e.g. the first dual-athlon boards cost as much as an entire single-athlon system. SMP boxes also need less space and fewer network connections per CPU, and therefore fewer switches, and are easier/cheaper to install and administrate. So if the percentage of the slowdown due to sharing memory between the processors is less than the percentage decrease in cost, it makes sense to SMP.
On networking, the first consideration is what limits the performance of your application. If you're doing a lot of dense linear algebra, then the CPU can chug through it extremely fast, so memory bandwidth tends to limit performance on a single-CPU system, and networking on a cluster. If you need bulk transfers of data or streaming, networking might also be your limiting factor. But most sparse matrix-based applications, such as fluid dynamics and molecular dynamics, tend to be less communication-intensive, and some schemes require very little communication at all.
If you will be network-limited, this will probably drive you toward bigger individual boxes (dual or quad processor systems), and might justify the additional expense of fast networking such as gigabit ethernet or myrinet, or even a NUMA system such as the Altix as mentioned above. It's worth noting that increasing processor speed and memory bandwidth at ever-lower cost is driving more applications toward being network-limited with 100 Mb ethernet. However, even with consumer-grade gbit ethernet being ever more affordable in 2010, this factor is mildly-to-all-together mitigated, depending on the grade of your 1Gbit switch. With the advent of 10Gbit and 100Gbit networking, this bottleneck will be mitigated further.
For administrative reasons, it is often easiest to make your cluster run diskless, that is, with the root filesystem and applications all on one machine. The advantage is that you only need to install and configure software in one place.
Software Generally installed on a Beowulf Cluster
- NFS/CIFS/AFS/other network/distributed/shared filesystem, to share the data/tools you are going to be working on, among the cluster.
- DHCPd to assign IP addresses for your cluster
- tFTP to netboot your cluster ( installing 100 machines by hand gets old by the 3rd installation ).
- DHCP to assign IPs to your cluster ( see above about 100 machines )
- ntp : MUI IMPORTANTE! forget that and you *will* have issues with nfs
- OpenSSH if you are trying to have multiple users/fan noise annoys you
Helpful Tools To Maintain Your Sanity
Here are a few tools/tips which may make your life as a Beowulf administrator easier:
Update Cluster, provided by the update-cluster package, maintains a centralized list of host names, IP addresses and MAC addresses with tools for using these to maintain machine lists for other software.
Dancer's Shell, or dsh, runs a single process on every machine in a cluster. Just put the names of the machines in your cluster in /etc/dsh/machines.list (which can be managed by update-cluster), then e.g. "dsh -a w" gives the output of w for all of the machines in that list.
If you are not running diskless, package synchronization is made a lot easier by "dpkg --get-selections > file" to build a list on one machine, then "dpkg --set-selections < file && apt-get upgrade" to install (and remove) accordingly on the others. Check the dpkg man page for details.
http://wiki.debian.org/FAI is a tool for fully automatic installations of debian, and has been used for setting up many clusters.
- atftpd: tftpd-hpa is good enough if you got a couple of machines to boot, or if you are booting the machines one at a time. If you try to boot 10 machines though, tftpd-hpa blocks while serving requests in FIFO mode. That results in the bootp request from the client timing out and you are left frustrated.
- alien, for the pesky "redhat only!" software.
- avahi-daemon, if you do not intend to keep a full-time dns server.
- fail2ban, if your cluster is going to be full-time facing the internets.
- dnsmasq, if you will need to NAT your cluster.
- exim4, exim-daemon-light, if you need to send configurations
- sshfs can save you a keytype or two.
- smbfs can also be useful, in case nfs is not fast enough.
Recommended Additional Packages for Bioiformatics
- libboost1.40-dev metapackage which encapsulates the 1.40 version of the C++ boost libraries.
- ia32-libs, for those people who distribute 32-bit, precompiled code only.
- lftp - because, yes in 2010, people use FTPS *sigh*
Recommended Perl Packages for Bioinformatics
useful for interpro scan installations, especially over Platform LSF
Many applications are helped considerably by optimizing them for your particular processor and memory/cache hierarchy. In particular, basic linear algebra subroutines (BLAS) can run several times faster when optimized using assembly language and tuned for your cache sizes. For this reason, the ATLAS (automatically-tuned linear algebra subroutines) project has collected high-performance subroutines for a variety of processors, and its build process automatically determines cache sizes and sets loop parameters accordingly. The documentation in its Debian package shows how to build it from source for maximum performance.
Note that the PETSc optimization flags are hidden in the file bmake/<arch>/base_variables where <arch> is the PETSc architecture determined by the petscarch script (linux, linux_alpha, linux_ppc, hurd, etc.).
Debian Wiki Links
PelicanHPC, an almost-instant, ramfs-based Beowulf cluster solution based on Debian Lenny/5.0 Live
KestrelHPC, an almost-instant, disk-based, permanent-installation Beowulf cluster solution, based on the scripts by Pelican-HPC
Wikipedia article about Beowulf computing
Article on how to set up Kerrighed up using Debian Lenny on Debian Admin
Comparison of TFTP servers at syslinux wiki
Share and enjoy!