You have heard all about this incredible thing called "Beowulf supercomputing" and want to build your own DebianBeowulf cluster. Where do you start?

First, consider your end-goal. What is the purpose of this cluster? What kinds of applications do you plan to run on it? Here are some typical applications of Beowulf clusters:

Note that some applications do not scale to large numbers of processors as well as others. Also, some large problems benefit enormously from shared memory, for example (non-multipole) boundary element and electron density functional theory calculations really need all processors to have access to all the data. For these applications, linear (or near-linear) scaling in processors really requires a big shared memory machine such as the SGI Altix line, which is the largest Linux-native shared-memory architecture in the Intel world. IBM offers similar machines in their PowerPC architectures as well as SUN with Sparc.

Since you're using Debian for your Beowulf cluster, you probably want to subscribe to the debian-beowulf email list, and maybe read some of its archives.

Parallelization Tools

Parallelizing computation can be done in many ways. It all comes down to which one does your application use, or which technology do you think you want to build around. As of August 2010, there are four major ways of creating parallelization:

1. Symmetric Multiprocessing (SMP) and multiple threads:

From the relevant Wikipedia article:

In computing, symmetric multiprocessing or SMP involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory and are controlled by a single OS instance. Most common multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors. Processors may be interconnected using buses, crossbar switches or on-chip mesh networks.

The parallelization comes from writing your applications to utilize threads or processes, which can perform multiple and concurrent tasks.

2. Some use MessagePassingInterface (MPI).

From the relevant Wikipedia article:

MPI is a language-independent communications protocol used to program parallel computers. Both point-to-point and collective communication are supported. MPI "is a message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation. MPI's goals are high performance, scalability, and portability.

In short, your applications use the MPI libraries, which in turn pass commands and data among different nodes in your cluster and perform your computations from a single filesystem, common among all the nodes of the cluster.

3. Another way is ParallelVirtualMachine (PVM):

From the relevant Wikipedia article:

PVM is a software system that enables a collection of heterogeneous computers to be used as a coherent and flexible concurrent computational resource, or a "parallel virtual machine".

The individual computers may be shared- or local-memory multiprocessors, vector supercomputers, specialized graphics engines, or scalar workstations and PCs, that may be interconnected by a variety of networks, such as Ethernet or FDDI.

There is also a Debian wiki entry on PVM: ParallelVirtualMachine. In short, PVM is an alternative to MPI.

4. One last way is to parallelize jobs using a job queueing system, such as OpenPBS (homepage), LoadLeveler (homepage), the Sun/Oracle Grid Engine (homepage) or Platform LSF (homepage). In today's multi-core, multi-threading systems, queuing multiple single-threading programs to perform your tasks, can decrease your time needed to complete the job, in an inverse logarithmically manner.

You can install or create applications which can use all four of these transports and can happily run side-by-side on your Beowulf cluster.

Symmetric MultiProcessing

There are two ways to do SMP. In both ways, the operating system realizes all hardware as a single system image.

One way is hardware SMP. Examples are multi-core systems, such as the Sparc architecture or the Intel i7 architecture, multiple CPU systems, such as the SGI Origin2, or multi-CPU, multi-core systems, such as the IBM Power6 architecture.

The other way is via software SMP emulation. Taking advantage of 100Mbit/1Gbit/10Gbit networks, software SMP automatically routes all of the inter-process communication either locally if processes are on the same machine, or across the network if not. It also allows dynamic load balancing by migrating processes from machine to machine.

Major examples of sofware SMP include Mosix and Kerrighed.

Mosix is a kernel patch and set of utilities which make a cluster of machines appear as a single multiprocessor machine with shared memory (though large applications requiring shared memory will run slowly because it's really distributed). Unfortunately, Mosix is a bit harder to set up than the others, as you have to apply a special patch to the kernel sources and then build it. The Mosix license changed in 2002 and a fork called OpenMosix started.

Up to Debian Sarge, there used to be an OpenMosix kernel patch package which you could install. The package was removed in early 2006 and the OpenMosix project was closed down in March 1, 2008. As of August 2010, the source is still available on SourceForge. Combined with the Debian kernel building system you can (relatively) easy patch the required kernel and build your own version of the kernel running OpenMosix. Should you wish to do so, you will have to install the kernel-package package and read the relevant documentation.

If you want to try OpenMosix using a live cd, thus without changing your existing system, you can look at clusterknoppix or quantian. Those livecds also include a terminal server, so you can get an OpenMosix cluster using only 1 cdrom drive and zero hard disks, and the diskless clients can be used with kde or can be kept without xfree. ClusterKnoppix has not been updated ever since August 28, 2004. The Quantian Project has not been updated ever since February 2, 2006.

Another example is Kerrighed.

Kerrighed (homepage) is an actively maintained project, which provides the same functionality as OpenMosix. It is licensed under GPLv2. The downlside is that the code is not completely portable to all kernel version as it has to be compiled against a specific version of the kernel. As of August 2010, you can download the code and compile it against the Linux 2.6.20 kernel. There is no appropriate debian package *cough*TODO TODO TODO*cough* . 2.6.20 is dated from around February, 2007.

Applications involving solving partial differential equations (e.g. fluid flow, heat and mass transfer, combustion, etc.), the PETSc suite from Argonne National Labs provides a great start. There are prepackaged libraries of the toolkit on Debian Packages.

Cluster Design

Next comes cluster design, for which you need to choose the hardware and the networking. If you choose multiple-processor SMP machines, the CPU and RAM costs will generally stay the same, assuming the SMP board has room for the RAM you want. CPU power increases according to Mooring's Law, and for constant disk/CPU ratio, that cost will fall too, accordingly. For DIY, the motherboard may cost less or more, e.g. the first dual-athlon boards cost as much as an entire single-athlon system. SMP boxes also need less space and fewer network connections per CPU, and therefore fewer switches, and are easier/cheaper to install and administrate. So if the percentage of the slowdown due to sharing memory between the processors is less than the percentage decrease in cost, it makes sense to SMP.

On networking, the first consideration is what limits the performance of your application. If you're doing a lot of dense linear algebra, then the CPU can chug through it extremely fast, so memory bandwidth tends to limit performance on a single-CPU system, and networking on a cluster. If you need bulk transfers of data or streaming, networking might also be your limiting factor. But most sparse matrix-based applications, such as fluid dynamics and molecular dynamics, tend to be less communication-intensive, and some schemes require very little communication at all.

If you will be network-limited, this will probably drive you toward bigger individual boxes (dual or quad processor systems), and might justify the additional expense of fast networking such as gigabit ethernet or myrinet, or even a NUMA system such as the Altix as mentioned above. It's worth noting that increasing processor speed and memory bandwidth at ever-lower cost is driving more applications toward being network-limited with 100 Mb ethernet. However, even with consumer-grade gbit ethernet being ever more affordable in 2010, this factor is mildly-to-all-together mitigated, depending on the grade of your 1Gbit switch. With the advent of 10Gbit and 100Gbit networking, this bottleneck will be mitigated further.

Diskless

For administrative reasons, it is often easiest to make your cluster run diskless, that is, with the root filesystem and applications all on one machine. The advantage is that you only need to install and configure software in one place.

Software Generally installed on a Beowulf Cluster

Helpful Tools To Maintain Your Sanity

Here are a few tools/tips which may make your life as a Beowulf administrator easier:

useful for interpro scan installations, especially over Platform LSF

Optimization

Many applications are helped considerably by optimizing them for your particular processor and memory/cache hierarchy. In particular, basic linear algebra subroutines (BLAS) can run several times faster when optimized using assembly language and tuned for your cache sizes. For this reason, the ATLAS (automatically-tuned linear algebra subroutines) project has collected high-performance subroutines for a variety of processors, and its build process automatically determines cache sizes and sets loop parameters accordingly. The documentation in its Debian package shows how to build it from source for maximum performance.

Note that the ?PETSc optimization flags are hidden in the file bmake/<arch>/base_variables where <arch> is the ?PETSc architecture determined by the petscarch script (linux, linux_alpha, linux_ppc, hurd, etc.).

Share and enjoy!