Differences between revisions 13 and 14
Revision 13 as of 2007-09-07 19:10:04
Size: 8375
Editor: FranklinPiat
Comment: update InterWiki links
Revision 14 as of 2009-03-16 03:36:53
Size: 8391
Editor: anonymous
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 13: Line 13:
Note that some applications do not scale to large numbers of processors as well as others. Also, some large problems benefit enormously from shared memory, for example (non-multipole) boundary element and electron density functional theory calculations really need all processors to have access to all the data. For these applications, linear (or near-linear) scaling in processors really requires a big shared memory machine such as the [http://www.sgi.com/servers/altix/ SGI Altix] line, which is the largest Linux-native shared-memory architecture (known to the author). Note that some applications do not scale to large numbers of processors as well as others. Also, some large problems benefit enormously from shared memory, for example (non-multipole) boundary element and electron density functional theory calculations really need all processors to have access to all the data. For these applications, linear (or near-linear) scaling in processors really requires a big shared memory machine such as the [[http://www.sgi.com/servers/altix/|SGI Altix]] line, which is the largest Linux-native shared-memory architecture (known to the author).
Line 15: Line 15:
Since you're using Debian for your Beowulf cluster, you probably want to [http://www.debian.org/MailingLists/subscribe subscribe] to the debian-beowulf email list, and maybe read some of its [http://lists.debian.org/debian-beowulf/ archives]. Since you're using Debian for your Beowulf cluster, you probably want to [[http://www.debian.org/MailingLists/subscribe|subscribe]] to the debian-beowulf email list, and maybe read some of its [[http://lists.debian.org/debian-beowulf/|archives]].
Line 20: Line 20:
If your application falls into this last category (multiple threads or processes), you will likely want to use [http://www.mosix.cs.huji.ac.il/ Mosix] to run it on your Beowulf cluster. Mosix is a kernel patch and set of utilities which make a cluster of machines appear as a single multiprocessor machine with shared memory (though large applications requiring shared memory will run slowly because it's really distributed). It automatically routes all of the inter-process communication either locally if processes are on the same machine, or across the network if not. It also does dynamic load balancing by migrating processes from machine to machine. Unfortunately, Mosix is a bit harder to set up than the others, as you have to apply a special patch to the kernel sources and then build it, but there is a Mosix kernel patch package which you can install, and the Debian kernel building system (install the {{{kernel-package}}} package and read its documentation) makes it (relatively) easy to patch and build your own kernel. If your application falls into this last category (multiple threads or processes), you will likely want to use [[http://www.mosix.cs.huji.ac.il/|Mosix]] to run it on your Beowulf cluster. Mosix is a kernel patch and set of utilities which make a cluster of machines appear as a single multiprocessor machine with shared memory (though large applications requiring shared memory will run slowly because it's really distributed). It automatically routes all of the inter-process communication either locally if processes are on the same machine, or across the network if not. It also does dynamic load balancing by migrating processes from machine to machine. Unfortunately, Mosix is a bit harder to set up than the others, as you have to apply a special patch to the kernel sources and then build it, but there is a Mosix kernel patch package which you can install, and the Debian kernel building system (install the {{{kernel-package}}} package and read its documentation) makes it (relatively) easy to patch and build your own kernel.
Line 22: Line 22:
(Note: mosix license changed in 2002 (?), and a fork called [http://openmosix.sourceforge.net/ OpenMosix] started. There is a kernel patch also for OpenMosix.
If you want to try OpenMosix using a live cd, thus without changing your existing system, you can look at [http://bofh.be/clusterknoppix/ clusterknoppix] or [http://dirk.eddelbuettel.com/quantian.html quantian]. Those livecds also include a terminal server, so you can get an Openmosix cluster using only 1 cdrom drive and zero hard disks, and the diskless clients can be used with kde or can be kept without xfree, pretty cool...)
(Note: mosix license changed in 2002 (?), and a fork called [[http://openmosix.sourceforge.net/|OpenMosix]] started. There is a kernel patch also for OpenMosix.
If you want to try OpenMosix using a live cd, thus without changing your existing system, you can look at [[http://bofh.be/clusterknoppix/|clusterknoppix]] or [[http://dirk.eddelbuettel.com/quantian.html|quantian]]. Those livecds also include a terminal server, so you can get an Openmosix cluster using only 1 cdrom drive and zero hard disks, and the diskless clients can be used with kde or can be kept without xfree, pretty cool...)
Line 25: Line 25:
Applications involving solving partial differential equations (e.g. fluid flow, heat and mass transfer, combustion, etc.), the [["PETSc"] http://www-unix.mcs.anl.gov/petsc/] suite from Argonne National Labs provides a great start, and has a Debian package. Applications involving solving partial differential equations (e.g. fluid flow, heat and mass transfer, combustion, etc.), the [[[PETSc]] http://www-unix.mcs.anl.gov/petsc/] suite from Argonne National Labs provides a great start, and has a Debian package.
Line 51: Line 51:
Many applications are helped considerably by optimizing them for your particular processor and memory/cache hierarchy. In particular, basic linear algebra subroutines (BLAS) can run several times faster when optimized using assembly language and tuned for your cache sizes. For this reason, the [http://www.netlib.org/atlas/ ATLAS] (automatically-tuned linear algebra subroutines) project has collected high-performance subroutines for a variety of processors, and its build process automatically determines cache sizes and sets loop parameters accordingly. The documentation in its Debian package shows how to build it from source for maximum performance. Many applications are helped considerably by optimizing them for your particular processor and memory/cache hierarchy. In particular, basic linear algebra subroutines (BLAS) can run several times faster when optimized using assembly language and tuned for your cache sizes. For this reason, the [[http://www.netlib.org/atlas/|ATLAS]] (automatically-tuned linear algebra subroutines) project has collected high-performance subroutines for a variety of processors, and its build process automatically determines cache sizes and sets loop parameters accordingly. The documentation in its Debian package shows how to build it from source for maximum performance.
Line 53: Line 53:
Note that the ["PETSc"] optimization flags are hidden in the file {{{bmake/<arch>/base_variables}}} where {{{<arch>}}} is the ["PETSc"] architecture determined by the {{{petscarch}}} script (linux, linux_alpha, linux_ppc, hurd, etc.). Note that the [[PETSc]] optimization flags are hidden in the file {{{bmake/<arch>/base_variables}}} where {{{<arch>}}} is the [[PETSc]] architecture determined by the {{{petscarch}}} script (linux, linux_alpha, linux_ppc, hurd, etc.).

you've heard all about this incredible thing called "Beowulf supercomputing" and want to build your own DebianBeowulf cluster. Where do you start?

First, consider your end-goal. What is the purpose of this cluster? What kinds of applications do you plan to run on it? Here are some typical applications of Beowulf clusters:

  • Transport phenomena, including fluid dynamics, heat and mass transfer, multi-phase flows, aerodynamics, etc.
  • Multi-million-atom molecular dynamics, and protein folding
  • Cellular automata to model phenomena from epidemiology to options trading
  • Graphics: distributed raytracing and rendering

Note that some applications do not scale to large numbers of processors as well as others. Also, some large problems benefit enormously from shared memory, for example (non-multipole) boundary element and electron density functional theory calculations really need all processors to have access to all the data. For these applications, linear (or near-linear) scaling in processors really requires a big shared memory machine such as the SGI Altix line, which is the largest Linux-native shared-memory architecture (known to the author).

Since you're using Debian for your Beowulf cluster, you probably want to subscribe to the debian-beowulf email list, and maybe read some of its archives.

Parallelization Tools

Next, which parallelization methodology does your application use, or which do you want to build around? Some use MessagePassingInterface (MPI), others ParallelVirtualMachine (PVM), still others are written to run in multiple threads or processes on a single machine. Applications using all three of these transports can happily run side-by-side on your Beowulf cluster.

If your application falls into this last category (multiple threads or processes), you will likely want to use Mosix to run it on your Beowulf cluster. Mosix is a kernel patch and set of utilities which make a cluster of machines appear as a single multiprocessor machine with shared memory (though large applications requiring shared memory will run slowly because it's really distributed). It automatically routes all of the inter-process communication either locally if processes are on the same machine, or across the network if not. It also does dynamic load balancing by migrating processes from machine to machine. Unfortunately, Mosix is a bit harder to set up than the others, as you have to apply a special patch to the kernel sources and then build it, but there is a Mosix kernel patch package which you can install, and the Debian kernel building system (install the kernel-package package and read its documentation) makes it (relatively) easy to patch and build your own kernel.

(Note: mosix license changed in 2002 (?), and a fork called OpenMosix started. There is a kernel patch also for ?OpenMosix. If you want to try ?OpenMosix using a live cd, thus without changing your existing system, you can look at clusterknoppix or quantian. Those livecds also include a terminal server, so you can get an Openmosix cluster using only 1 cdrom drive and zero hard disks, and the diskless clients can be used with kde or can be kept without xfree, pretty cool...)

Applications involving solving partial differential equations (e.g. fluid flow, heat and mass transfer, combustion, etc.), the ?[PETSc http://www-unix.mcs.anl.gov/petsc/] suite from Argonne National Labs provides a great start, and has a Debian package.

Cluster Design

Next comes cluster design, for which you need to choose the hardware and the networking. If you choose multiple-processor SMP machines, the CPU and RAM costs will generally stay the same (assuming the SMP board has room for the RAM you want), but case costs will fall, and for constant disk/CPU ratio, that cost will fall too. The motherboard may cost less or more, e.g. the first dual-athlon boards cost as much as entire single-athlon systems! SMP boxes also need less space and fewer network connections per CPU, and therefore fewer switches, and are easier/cheaper to install and admin (because there are fewer of them). So if the % slowdown due to sharing memory between the processors is less than the % decrease in cost, it makes sense to SMP.

On networking, the first consideration is what limits the performance of your application. If you're doing a lot of dense linear algebra, then the CPU can chug through it extremely fast, so memory bandwidth tends to limit performance on a single-CPU system, and networking on a cluster. If raytracing, networking might also be your limiting factor. But most sparse matrix-based applications, such as fluid dynamics and molecular dynamics, tend to be less communication-intensive, and some schemes require very little communication at all.

If you will be network-limited, this will probably drive you toward bigger individual boxes (dual or quad processor systems), and might justify the additional expense of fast networking such as gigabit ethernet or myrinet, or even a NUMA system such as the Altix as mentioned above. It's worth noting that increasing processor speed and memory bandwidth at ever-lower cost is driving more applications toward being network-limited with 100 Mb ethernet...

Diskless

For administrative reasons, it is often easiest to make your cluster run ?DiskLess, that is, with the root filesystem and applications all on one machine. The advantage is that you only need to install and configure software in one place. The disadvantage is that you need to learn to use diskless , which can take some time. Reading the ?DiskLess page on this Wiki can help you make that decision.

Helpful Tools

Here are a few tools/tips which may make your life as a Beowulf administrator easier:

  • UpdateCluster, provided by the update-cluster package, maintains a centralized list of host names, IP addresses and MAC addresses with tools for using these to maintain machine lists for other software.

  • Dancer's Shell, or dsh, runs a single process on every machine in a cluster. Just put the names of the machines in your cluster in /etc/dsh/machines.list (which can be managed by update-cluster), then e.g. "dsh -a w" gives the output of w for all of the machines in that list.

  • The jmon package allows you to easily monitor the CPU, memory and swap usage of multiple machines. The list of machines is placed in ~/.jmonrc or /etc/jmonrc (the latter of which can be managed by update-cluster) which can be overridden with the -f option.

    If you are not running diskless, package synchronization is made a lot easier by "dpkg --get-selections \''' > file" to build a list on one machine, then "dpkg --set-selections < file && apt-get upgrade" to install (and remove) accordingly on the others. Check the dpkg man page for details.

  • http://wiki.debian.org/FAI is a tool for fully automatic installations of debian, and has been used for setting up many clusters.

Optimization

Many applications are helped considerably by optimizing them for your particular processor and memory/cache hierarchy. In particular, basic linear algebra subroutines (BLAS) can run several times faster when optimized using assembly language and tuned for your cache sizes. For this reason, the ATLAS (automatically-tuned linear algebra subroutines) project has collected high-performance subroutines for a variety of processors, and its build process automatically determines cache sizes and sets loop parameters accordingly. The documentation in its Debian package shows how to build it from source for maximum performance.

Note that the ?PETSc optimization flags are hidden in the file bmake/<arch>/base_variables where <arch> is the ?PETSc architecture determined by the petscarch script (linux, linux_alpha, linux_ppc, hurd, etc.).

Share and enjoy!