(There are two errors in the Wiki code below, but I don't understand why they're errors. Please fix them if you know how.)

Message Passing Interface, or MPI is a really powerful set of communication tools on which to build parallel processing computer programs. Many existing packages use it, and it's not very difficult to set up on Debian.

MPICH and LAM

The first thing you should know is that there are (at least) two implementations of the MPI standard, called MPICH and LAM, which are source compatible but not binary compatible! This means that if a piece of software is built using mpich, it will not run with lam, and vice versa. However, both can be installed at the same time, and since Debian packages specify which one they depend on, you usually need not worry about the difference too much, unless you plan to build/rebuild yourself.

Opinions vary on this, but the consensus seems to be that lam has slightly better performance, can operate with a heterogeneous network of machines, and can do neat tricks like adding and removing machines on the fly. However, you need to do a couple of extra things to use lam, as discussed below (step 9).

HOWTO

So, you have an application which depends on mpich or lam. Now what?

  1. Set up NIS, OpenLDAP or other distributed authentication mechanism (Samba?), and NFS or other network file system, on your cluster machines, so your users can log in everywhere with their same home directory.
  2. Install your application on all of the cluster machines. If the application is in a Debian package, this will automatically drag with it the mpich/lam MPI implementation it depends on, and rsh-client and rsh-server too (but mind the security note below!).

  3. Edit /etc/hosts.equiv and add the names of all of the machines in the cluster (or let update-cluster-hosts do that when bug 194460 is fixed).

  4. Run "update-alternatives --display rsh" to make sure rsh, and not ssh, is linked to /usr/bin/rsh, and if necessary, "update-alternatives --config rsh" to set that link properly.

  5. Test rsh using "rsh <machine> ls" or "rsh <machine> w" etc. where <machine> is the name of another machine in the cluster. Now you should be able to execute arbitrary commands on any of the machines from any other.

  6. Install either the mpich or lam3-dev package. If you install both, make sure to run "update-alternatives --display mpi" (and same for mpirun) to check which is the active one, and use --config to change it as necessary.

  7. Build the MPI-using application.
  8. Edit /etc/mpich/machines.LINUX, or /etc/lam/bhost.def and bhost.lam, adding the names of all of the cluster machines. If the machines have multiple processors, you can enter them multiple times. Note: UpdateCluster can manage /etc/mpich/machines.LINUX automatically, just build the cluster.xml file (using debconf if you like) and run "update-cluster-regenerate mpich".

  9. For lam, each user must run "lamboot -v" to initialize the cluster, and you may want to test it using recon. You should probably read those man pages.

  10. Run the application (as an ordinary user) using "mpirun -np # <appname> <args>" where # is the number of processors, <appname> the program name, and <args> that program's arguments. It will automatically run on all of your cluster machines, communicating via the network. Isn't that cool?

SECURITY NOTE: An rsh server is a VERY dangerous thing to install on machines with internet access. If someone knows that your machine runs it, and the name of any user on the system, (s)he can ip-spoof an rlogin/rsh to make it look like it's coming from a friendly machine, and log in as that user. See for example this old page on ip spoofing (the 1.3.91 kernel and Netscape 2.0 described there date from- 1995?); with Microsoft including ip spoofing in Windows XP this kind of activity is certain to increase. Be sure to run this only on a private network behind a firewall, or on a single internet-connected machine with only localhost in /etc/hosts.equiv. You've been warned!

See also: