You have heard all about this incredible thing called "Beowulf supercomputing" and want to build your own DebianBeowulf cluster. Where do you start?

First, consider your end-goal. What is the purpose of this cluster? What kinds of applications do you plan to run on it? Here are some typical applications of Beowulf clusters:

Note that some applications do not scale to large numbers of processors as well as others. Also, some large problems benefit enormously from shared memory, for example (non-multipole) boundary element and electron density functional theory calculations really need all processors to have access to all the data. For these applications, linear (or near-linear) scaling in processors really requires a big shared memory machine such as the SGI Altix line, which is the largest Linux-native shared-memory architecture (known to the author).

Since you're using Debian for your Beowulf cluster, you probably want to subscribe to the debian-beowulf email list, and maybe read some of its archives.

Parallelization Tools

Next, which parallelization methodology does your application use, or which do you want to build around? Some use MessagePassingInterface (MPI), others ParallelVirtualMachine (PVM), still others are written to run in multiple threads or processes on a single machine. Applications using all three of these transports can happily run side-by-side on your Beowulf cluster.

If your application falls into this last category (multiple threads or processes), you will likely want to use Mosix to run it on your Beowulf cluster. Mosix is a kernel patch and set of utilities which make a cluster of machines appear as a single multiprocessor machine with shared memory (though large applications requiring shared memory will run slowly because it's really distributed). It automatically routes all of the inter-process communication either locally if processes are on the same machine, or across the network if not. It also does dynamic load balancing by migrating processes from machine to machine. Unfortunately, Mosix is a bit harder to set up than the others, as you have to apply a special patch to the kernel sources and then build it, but there is a Mosix kernel patch package which you can install, and the Debian kernel building system (install the kernel-package package and read its documentation) makes it (relatively) easy to patch and build your own kernel.

(Note: mosix license changed in 2002 (?), and a fork called OpenMosix started. There is a kernel patch also for ?OpenMosix. If you want to try ?OpenMosix using a live cd, thus without changing your existing system, you can look at clusterknoppix or quantian. Those livecds also include a terminal server, so you can get an Openmosix cluster using only 1 cdrom drive and zero hard disks, and the diskless clients can be used with kde or can be kept without xfree, pretty cool...)

Applications involving solving partial differential equations (e.g. fluid flow, heat and mass transfer, combustion, etc.), the ?[PETSc http://www-unix.mcs.anl.gov/petsc/] suite from Argonne National Labs provides a great start, and has a Debian package.

Cluster Design

Next comes cluster design, for which you need to choose the hardware and the networking. If you choose multiple-processor SMP machines, the CPU and RAM costs will generally stay the same (assuming the SMP board has room for the RAM you want), but case costs will fall, and for constant disk/CPU ratio, that cost will fall too. The motherboard may cost less or more, e.g. the first dual-athlon boards cost as much as entire single-athlon systems! SMP boxes also need less space and fewer network connections per CPU, and therefore fewer switches, and are easier/cheaper to install and admin (because there are fewer of them). So if the % slowdown due to sharing memory between the processors is less than the % decrease in cost, it makes sense to SMP.

On networking, the first consideration is what limits the performance of your application. If you're doing a lot of dense linear algebra, then the CPU can chug through it extremely fast, so memory bandwidth tends to limit performance on a single-CPU system, and networking on a cluster. If raytracing, networking might also be your limiting factor. But most sparse matrix-based applications, such as fluid dynamics and molecular dynamics, tend to be less communication-intensive, and some schemes require very little communication at all.

If you will be network-limited, this will probably drive you toward bigger individual boxes (dual or quad processor systems), and might justify the additional expense of fast networking such as gigabit ethernet or myrinet, or even a NUMA system such as the Altix as mentioned above. It's worth noting that increasing processor speed and memory bandwidth at ever-lower cost is driving more applications toward being network-limited with 100 Mb ethernet. However, even with consumer-grade gbit ethernet being ever more affordable in 2010, this factor is mildly-to-all-together mitigated, depending on the grade of your gbit switch.

Diskless

For administrative reasons, it is often easiest to make your cluster run diskless, that is, with the root filesystem and applications all on one machine. The advantage is that you only need to install and configure software in one place.

Tools Which Generally Characterize a Beowulf Cluster

* OpenMPI/LAM * NFS/CIFS/AFS/other network/distributed/shared filesystem, to share the data/tools you are going to be working on, among the cluster. * DHCPd to assign IP addresses for your cluster * tFTP to netboot your cluster ( installing 100 machines by hand gets old by the 3rd installation ). * DHCP to assign IPs to your cluster ( see above about 100 machines ) * ntp : MUI IMPORTANTE! forget that and you *will* have issues with nfs * OpenSSH if you are trying to have multiple users/fan noise annoys you

Helpful Tools To Maintain Your Sanity

Here are a few tools/tips which may make your life as a Beowulf administrator easier:

Optimization

Many applications are helped considerably by optimizing them for your particular processor and memory/cache hierarchy. In particular, basic linear algebra subroutines (BLAS) can run several times faster when optimized using assembly language and tuned for your cache sizes. For this reason, the ATLAS (automatically-tuned linear algebra subroutines) project has collected high-performance subroutines for a variety of processors, and its build process automatically determines cache sizes and sets loop parameters accordingly. The documentation in its Debian package shows how to build it from source for maximum performance.

Note that the ?PETSc optimization flags are hidden in the file bmake/<arch>/base_variables where <arch> is the ?PETSc architecture determined by the petscarch script (linux, linux_alpha, linux_ppc, hurd, etc.).

Share and enjoy!