You have heard all about this incredible thing called "Beowulf supercomputing" and want to build your own DebianBeowulf cluster. Where do you start?

First, consider your end-goal. What is the purpose of this cluster? What kinds of applications do you plan to run on it? Here are some typical applications of Beowulf clusters:

Note that some applications do not scale to large numbers of processors as well as others. Also, some large problems benefit enormously from shared memory, for example (non-multipole) boundary element and electron density functional theory calculations really need all processors to have access to all the data. For these applications, linear (or near-linear) scaling in processors really requires a big shared memory machine such as the SGI Altix line, which is the largest Linux-native shared-memory architecture in the Intel world. IBM offers similar machines in their PowerPC architectures as well as SUN with Sparc.

Since you're using Debian for your Beowulf cluster, you probably want to subscribe to the debian-beowulf email list, and maybe read some of its archives.

Parallelization Tools

Parallelizing computation can be done in many ways. It all comes down to which one does your application use, or which technology do you think you want to build around. As of August 2010, there are four major ways of creating parallelization:

1. Symmetric Multiprocessing (SMP) and multiple threads:

From the relevant Wikipedia article:

In computing, symmetric multiprocessing or SMP involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory and are controlled by a single OS instance. Most common multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors. Processors may be interconnected using buses, crossbar switches or on-chip mesh networks.

The parallelization comes from writing your applications to utilize threads, which can do multiple and concurrent tasks.

2. Some use MessagePassingInterface (MPI).

From the relevant Wikipedia article:

MPI is a language-independent communications protocol used to program parallel computers. Both point-to-point and collective communication are supported. MPI "is a message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation. MPI's goals are high performance, scalability, and portability.

In short, your applications use the MPI libraries, which in turn pass commands and data among different nodes in your cluster and perform your computations from a single filesystem, common among all the nodes of the cluster.

3. Another way is ParallelVirtualMachine (PVM):

From the relevant Wikipedia article:

PVM is a software system that enables a collection of heterogeneous computers to be used as a coherent and flexible concurrent computational resource, or a "parallel virtual machine".

The individual computers may be shared- or local-memory multiprocessors, vector supercomputers, specialized graphics engines, or scalar workstations and PCs, that may be interconnected by a variety of networks, such as Ethernet or FDDI.

There is also a Debian wiki entry on PVM: ParallelVirtualMachine. In short, PVM is an alternative to MPI.

4. One last way is to parallelize jobs using a job queueing system, such as OpenPBS (homepage), LoadLeveler (homepage), the Sun/Oracle Grid Engine (homepage) or Platform LSF (homepage). In today's multi-core, multi-threading systems, queuing multiple single-threading programs to perform your tasks, can decrease your time needed to complete the job, in an inverse logarithmically manner.

You can install or create applications which can use all three of these transports and can happily run side-by-side on your Beowulf cluster.

If your application falls into this last category (multiple threads or processes), you will likely want to use Mosix to run it on your Beowulf cluster. Mosix is a kernel patch and set of utilities which make a cluster of machines appear as a single multiprocessor machine with shared memory (though large applications requiring shared memory will run slowly because it's really distributed). It automatically routes all of the inter-process communication either locally if processes are on the same machine, or across the network if not. It also does dynamic load balancing by migrating processes from machine to machine. Unfortunately, Mosix is a bit harder to set up than the others, as you have to apply a special patch to the kernel sources and then build it, but there is a Mosix kernel patch package which you can install, and the Debian kernel building system (install the kernel-package package and read its documentation) makes it (relatively) easy to patch and build your own kernel.

(Note: mosix license changed in 2002 (?), and a fork called OpenMosix started. There is a kernel patch also for ?OpenMosix. If you want to try ?OpenMosix using a live cd, thus without changing your existing system, you can look at clusterknoppix or quantian. Those livecds also include a terminal server, so you can get an Openmosix cluster using only 1 cdrom drive and zero hard disks, and the diskless clients can be used with kde or can be kept without xfree, pretty cool...)

Applications involving solving partial differential equations (e.g. fluid flow, heat and mass transfer, combustion, etc.), the ?[PETSc http://www-unix.mcs.anl.gov/petsc/] suite from Argonne National Labs provides a great start, and has a Debian package.

Cluster Design

Next comes cluster design, for which you need to choose the hardware and the networking. If you choose multiple-processor SMP machines, the CPU and RAM costs will generally stay the same (assuming the SMP board has room for the RAM you want), but case costs will fall, and for constant disk/CPU ratio, that cost will fall too. The motherboard may cost less or more, e.g. the first dual-athlon boards cost as much as entire single-athlon systems! SMP boxes also need less space and fewer network connections per CPU, and therefore fewer switches, and are easier/cheaper to install and admin (because there are fewer of them). So if the % slowdown due to sharing memory between the processors is less than the % decrease in cost, it makes sense to SMP.

On networking, the first consideration is what limits the performance of your application. If you're doing a lot of dense linear algebra, then the CPU can chug through it extremely fast, so memory bandwidth tends to limit performance on a single-CPU system, and networking on a cluster. If raytracing, networking might also be your limiting factor. But most sparse matrix-based applications, such as fluid dynamics and molecular dynamics, tend to be less communication-intensive, and some schemes require very little communication at all.

If you will be network-limited, this will probably drive you toward bigger individual boxes (dual or quad processor systems), and might justify the additional expense of fast networking such as gigabit ethernet or myrinet, or even a NUMA system such as the Altix as mentioned above. It's worth noting that increasing processor speed and memory bandwidth at ever-lower cost is driving more applications toward being network-limited with 100 Mb ethernet. However, even with consumer-grade gbit ethernet being ever more affordable in 2010, this factor is mildly-to-all-together mitigated, depending on the grade of your gbit switch.

Diskless

For administrative reasons, it is often easiest to make your cluster run diskless, that is, with the root filesystem and applications all on one machine. The advantage is that you only need to install and configure software in one place.

Tools Which Generally Characterize a Beowulf Cluster

* OpenMPI/LAM * NFS/CIFS/AFS/other network/distributed/shared filesystem, to share the data/tools you are going to be working on, among the cluster. * DHCPd to assign IP addresses for your cluster * tFTP to netboot your cluster ( installing 100 machines by hand gets old by the 3rd installation ). * DHCP to assign IPs to your cluster ( see above about 100 machines ) * ntp : MUI IMPORTANTE! forget that and you *will* have issues with nfs * OpenSSH if you are trying to have multiple users/fan noise annoys you

Helpful Tools To Maintain Your Sanity

Here are a few tools/tips which may make your life as a Beowulf administrator easier:

useful for interpro scan installations, especially over Platform LSF

Optimization

Many applications are helped considerably by optimizing them for your particular processor and memory/cache hierarchy. In particular, basic linear algebra subroutines (BLAS) can run several times faster when optimized using assembly language and tuned for your cache sizes. For this reason, the ATLAS (automatically-tuned linear algebra subroutines) project has collected high-performance subroutines for a variety of processors, and its build process automatically determines cache sizes and sets loop parameters accordingly. The documentation in its Debian package shows how to build it from source for maximum performance.

Note that the ?PETSc optimization flags are hidden in the file bmake/<arch>/base_variables where <arch> is the ?PETSc architecture determined by the petscarch script (linux, linux_alpha, linux_ppc, hurd, etc.).

Share and enjoy!