Running Torque inside of Eucalyptus
We describe how to setup a TORQUE cluster system within a Eucalyptus cloud. The installation of Torque is not difficult, especially not since Debian provides these packages for it.
This page explains how to go some steps further and automate the installation of Torque across many machines. And it goes yet another step further in that all those machines shall be allowed to run clones of the same hard drive. And with yet another step we no longer require physical machines but allow networked virtual images: the cloud.
The here described process is applicable to any cloud image. There are several examples on how to prepare your own images. This Wiki of Debian has one such showing how to extend from within an already running instance, and ?another about how to create one from scratch. Both are educative in their own right.
Preparation
$ source ~/.euca/eucarc
Specify a Squeeze image
$ EMI=emi-1AF00C98
Start two instances of our Squeeze image
$ euca-run-instances $EMI -k mykey -t c1.medium -n2 RESERVATION r-4488080C myuser myuser-default INSTANCE i-57E309BE emi-1AF00C98 0.0.0.0 0.0.0.0 pending mykey 2010-09-13T02:31:51.172Z eki-D224100C eri-059910F2 INSTANCE i-4C1F0986 emi-1AF00C98 0.0.0.0 0.0.0.0 pending mykey 2010-09-13T02:31:51.173Z eki-D224100C eri-059910F2
After a few seconds it will be running
$ euca-describe-instances RESERVATION r-4488080C myuser default INSTANCE i-4C1F0986 emi-1AF00C98 192.168.0.14 192.168.0.14 running mykey 1 c1.medium 2010-09-13T02:31:51.173Z mycloud eki-D224100C eri-059910F2 INSTANCE i-57E309BE emi-1AF00C98 192.168.0.15 192.168.0.15 running mykey 0 c1.medium 2010-09-13T02:31:51.172Z mycloud eki-D224100C eri-059910F2
Let's say you want to start a torque server on 192.168.0.14 and two torque worker on 192.168.0.14 and 192.168.0.15, MPI enabled
- $ bash start_torque.sh -s="192.168.0.14" -n="192.168.0.14,192.168.0.15" -k="~/.euca/mykey.priv" -m=1
}}}
This will install all necessary torque packages in the instances. It might take a few minutes, depending on the internet connection and processor speed of the instances.
Connect to a instance as root with your key
ssh -X -i ~/.euca/mykey.priv root@192.168.0.14
virtual: Switch to the guest user
su - guest
Check if nodes are up
pbsnodes
Perform some simple tests
echo "sleep 10" | qsub echo "sleep 5" | qsub echo "hostname" | qsub echo "sleep 15" | qsub echo "hostname" | qsub echo "sleep 3" | qsub
Look at the queue
qstat
Let sleep 2 worker nodes
echo "sleep 10" | qsub -l nodes=2
Check if both nodes are in state 'job-exclusive'
pbsnodes
During the installation phase we compiled a simple MPI Hello World program.
Start it without Torque
$ mpiexec -n 4 /tmp/hello.out Hello MPI from the server process! Hello MPI! mesg from 1 of 4 on ip-192-168-0-14 Hello MPI! mesg from 2 of 4 on ip-192-168-0-14 Hello MPI! mesg from 3 of 4 on ip-192-168-0-14
Start it with Torque (without -tm support)
cat <<EOF > mpi-test_2_1_mpirun #PBS -N helloworld #PBS -l nodes=2:ppn=1 cd $PBS_O_WORKDIR /usr/bin/mpirun -np 2 --hostfile /etc/torque/hostfile -v -v -v /tmp/hello.out EOF qsub mpi-test_2_1_mpirun
Check the output files
cat helloworld.o* Hello MPI from the server process! Hello MPI! mesg from 1 of 2 on ip-192-168-0-15 cat helloworld.e*
Start it with Torque (with -tm support)
package is ready but not in squeeze yet
example script for setting up torque:
set -ex #NOTES: # in whatever mode you are running your Eucalyptus system, you will have two interfaces, one is called public(you can reach the instances from outside), # one is called inside(you can reach the instances usually only from inside, or maybe from your front end) # you have to specify a TORQUE SERVER and one or more TORQUE NODES, because I am accessing those instances you have to specify working IP addresses. # example: SYSTEM mode, both interfaces are using the same address, so there is no # example: MANAGED modes, public and private interface are different, TORQUE has to be setup for the PRIVATE mode (intra-cloud communtcation need firewall settings) # usage: bash start_torque.sh --verbose -s="192.168.0.13" -n="192.168.0.14,192.168.0.17,192.168.0.45" -k="~/.euca/mykey.priv" # default VERBOSE=0 IN_INSTANCE=0 echo `hostname` : `/sbin/ifconfig eth0 | grep "inet addr" | awk '{print $2}' | sed 's/addr\://'` for i in $* do case $i in -s=*|--torque-server=*) # remove option from string PUBLIC_TORQUE_SERVER_IP=`echo $i | sed 's/[-a-zA-Z0-9]*=//'` echo $PUBLIC_TORQUE_SERVER_IP ;; -n=*|--torque-nodes=*) # remove option from string NIPS=`echo $i | sed 's/[-a-zA-Z0-9]*=//'` PUBLIC_NODES_IP=`echo $NIPS | sed 's/\,/ /g'` echo $NIPS echo $PUBLIC_NODES_IP ;; -k=*|--key=*) # remove option from string KEY=`echo $i | sed 's/[-a-zA-Z0-9]*=//'` echo $KEY ;; --verbose) VERBOSE=1 ;; -i|--in-instance) IN_INSTANCE=1 ;; -m=*|--with-mpi=*) MPI=`echo $i | sed 's/[-a-zA-Z0-9]*=//'` #TODO: only 0 or 1 are feasible values ;; *) echo "unknown option" ;; esac done cat > keygen_in_instance.sh << EOF #!/bin/bash su guest -c 'ssh-keygen -t rsa -N "" -f /home/guest/.ssh/id_rsa' EOF chmod 755 keygen_in_instance.sh # BEGIN execution on master ################################################# if [ $IN_INSTANCE -eq 0 ] ; then # join server and nodes if [[ $PUBLIC_NODES_IP == *$PUBLIC_TORQUE_SERVER_IP* ]] then ALL_INSTANCES="$PUBLIC_NODES_IP" else ALL_INSTANCES="$PUBLIC_TORQUE_SERVER_IP $PUBLIC_NODES_IP" fi echo $ALL_INSTANCES # copy setup-torque-script to eucalyptus instances for NODE_IP in `echo $ALL_INSTANCES` do echo $NODE_IP # make this host known to ~/.ssh/known_hosts eval "ssh -i $KEY -o StrictHostKeychecking=no root@$NODE_IP echo ''" eval "/usr/bin/scp -p -i $KEY start_torque.sh root@$NODE_IP:/root/start_torque.sh" # MPI example eval "scp -p -i $KEY compileMPI.sh helloworld.c root@$NODE_IP:/root/" # start script in instance eval "ssh -X -i $KEY root@$NODE_IP \"/root/start_torque.sh -s=\"$PUBLIC_TORQUE_SERVER_IP\" -n=\"$NIPS\" -i -m=$MPI \"" done # generate keys in instances - for user guest for NODE_IP in `echo $ALL_INSTANCES` do echo $NODE_IP eval "/usr/bin/scp -p -i $KEY keygen_in_instance.sh root@$NODE_IP:/root/keygen_in_instance.sh" eval "ssh -X -i $KEY root@$NODE_IP \"/root/keygen_in_instance.sh\"" done # distribute keys for NODE_IP in `echo $ALL_INSTANCES` do # distribute this key to all other nodes for NODE_IP2 in `echo $ALL_INSTANCES` do echo $NODE_IP2 eval "/usr/bin/scp -p -i $KEY root@$NODE_IP:/home/guest/.ssh/id_rsa.pub /tmp/id_rsa.pub" eval "/usr/bin/scp -p -i $KEY /tmp/id_rsa.pub root@$NODE_IP2:/tmp/id_rsa.pub" eval "ssh -X -i $KEY root@$NODE_IP2 \"cat /tmp/id_rsa.pub >> /home/guest/.ssh/authorized_keys\"" #TODO, I need an entry in known_hosts, for now the following happens from within the instances #eval "ssh -X -i $KEY root@$NODE_IP \"ssh -o StrictHostKeychecking=no guest@$NODE_IP2 & echo '' & wait\"" done eval "ssh -X -i $KEY root@$NODE_IP /root/hosts.sh" done exit # on master don't execute commands for instances fi # END execution on master ################################################# # BEGIN execution in instance ############################################### usage() { cat << EOF usage: $0 options This script starts the torque environment. OPTIONS: -h Show this message -n nodes e.g. "192.168.0.14,192.168.0.14" -s torque server ip e.g. "192.168.0.13" -k key file -v Verbose -m With MPI support example: start_torque.sh --verbose -s="192.168.0.13" -n="192.168.0.14,192.168.0.17,192.168.0.45" -k="~/.euca/mykey.priv" EOF } #SERVER_IP NIPS KEY #if [[ -z $PUBLIC_NODES_IP ]] || [[ -z $PUBLIC_TORQUE_SERVER_IP ]] #then # usage # exit 1 #fi function install_package { PACKAGE=$1 if [ "`dpkg-query -W -f='${Status}\n' $PACKAGE`" != "install ok installed" ] ; then apt-get -o Dpkg::Options::="--force-confnew" --force-yes -y install $PACKAGE #aptitude -y install $PACKAGE if [ $? -ne 0 ] ; then echo "aptitude install $PACKAGE failed" fi else echo "package $PACKAGE is already installed" fi } export DEBIAN_FRONTEND="noninteractive" export APT_LISTCHANGES_FRONTEND="none" API_VERSION="2008-02-01" METADATA_URL="http://169.254.169.254/$API_VERSION/meta-data" CURL="/usr/bin/curl" # those variables are needed for the locales package export LANGUAGE=en_US.UTF-8 export LANG=en_US.UTF-8 export LC_ALL=en_US.UTF-8 # for dialog frontend export PATH=$PATH:/sbin:/usr/sbin:/usr/local/sbin export TERM=linux PRIVATE_TORQUE_SERVER_IP="172.16.1.2" PRIVATE_NODES="172.16.1.2 172.16.1.3 172.16.1.4 172.16.1.5 172.16.1.6 172.16.1.7 172.16.1.8" MODE="public" #MODE="private" #MODE="system" PUBLIC_TORQUE_SERVER_HOSTNAME=ip-`echo $PUBLIC_TORQUE_SERVER_IP | sed 's/\./-/g'` echo $PUBLIC_TORQUE_SERVER_IP $PUBLIC_TORQUE_SERVER_HOSTNAME PRIVATE_TORQUE_SERVER_HOSTNAME=ip-`echo $PRIVATE_TORQUE_SERVER_IP | sed 's/\./-/g'` echo $PRIVATE_TORQUE_SERVER_IP $PRIVATE_TORQUE_SERVER_HOSTNAME #GET INSTANCE IPs, create hostnames if [ $MODE == "public" ] ; then PUBLIC_INSTANCE_IP=`/sbin/ifconfig eth0 | grep "inet addr" | awk '{print $2}' | sed 's/addr\://'` else #PUBLIC_INSTANCE_IP=192.168.0.115 PUBLIC_INSTANCE_IP=`curl -s $METADATA_URL/public-ipv4` fi #PUBLIC_INSTANCE_HOSTNAME=`curl -s $METADATA_URL/public-hostname` PUBLIC_INSTANCE_HOSTNAME=ip-`echo $PUBLIC_INSTANCE_IP | sed 's/\./-/g'` echo $PUBLIC_INSTANCE_IP $PUBLIC_INSTANCE_HOSTNAME PRIVATE_INSTANCE_IP=`/sbin/ifconfig eth0 | grep "inet addr" | awk '{print $2}' | sed 's/addr\://'` PRIVATE_INSTANCE_HOSTNAME=ip-`echo $PRIVATE_INSTANCE_IP | sed 's/\./-/g'` echo $PRIVATE_INSTANCE_IP $PRIVATE_INSTANCE_HOSTNAME #using PUBLIC or PRIVATE interface if [ $MODE == "public" ] ; then INSTANCE_HOSTNAME=$PUBLIC_INSTANCE_HOSTNAME NODES=$PUBLIC_NODES_IP INSTANCE_IP=$PUBLIC_INSTANCE_IP TORQUE_SERVER_IP=$PUBLIC_TORQUE_SERVER_IP TORQUE_SERVER_HOSTNAME=$PUBLIC_TORQUE_SERVER_HOSTNAME else if [ $MODE == "private" ] ; then INSTANCE_HOSTNAME=$PRIVATE_INSTANCE_HOSTNAME NODES=$PRIVATE_NODES INSTANCE_IP=$PRIVATE_INSTANCE_IP TORQUE_SERVER_IP=$PRIVATE_TORQUE_SERVER_IP TORQUE_SERVER_HOSTNAME=$PRIVATE_TORQUE_SERVER_HOSTNAME else echo "please specify private or public interface" fi fi # using Google's nameserver echo "nameserver 8.8.8.8" >> /etc/resolv.conf # update aptitude first #echo "deb http://ftp.us.debian.org/debian squeeze main" > /etc/apt/sources.list #echo "deb http://security.debian.org/ squeeze/updates main" >> /etc/apt/sources.list #aptitude update apt-get -o Dpkg::Options::="--force-confnew" --force-yes -y update if [ $? -ne 0 ] ; then echo "aptitude update failed" fi # get rid of some error messages because of missing locales package install_package locales echo "en_US.UTF-8 UTF-8" > /etc/locale.gen locale-gen # install portmap for NFS install_package portmap #TODO mount here # install nmap install_package nmap nmap localhost -p 1-20000 # install lsb-release install_package lsb-release # Print some Information about the Operating System DISTRIBUTOR=`lsb_release -i | awk '{print $3}'` CODENAME=`lsb_release -c | awk '{print $2}'` echo $DISTRIBUTOR $CODENAME # install ntpdate install_package ntpdate ###ntpdate pool.ntp.org ntpdate ntp.ubuntu.com # install OpenMPI packages if [ $MPI -eq 1 ] ; then install_package "libopenmpi-dev" install_package "openmpi-bin" #compile MPI test program bash compileMPI.sh fi # make hostnames known to all the TORQUE nodes and server/scheduler if [ $MODE == "private" ] ; then for NODE_IP in `echo $PRIVATE_NODES` do NODE_HOSTNAME=ip-`echo $NODE_IP | sed 's/\./-/g'` echo "$NODE_IP $NODE_HOSTNAME" >> /etc/hosts #MPI support mkdir -p /etc/torque echo "$NODE_HOSTNAME slots=1" >> /etc/torque/hostfile done fi if [ $MODE == "public" ] ; then for NODE_IP in `echo $PUBLIC_NODES_IP` do NODE_HOSTNAME=ip-`echo $NODE_IP | sed 's/\./-/g'` if [ $INSTANCE_IP != $TORQUE_SERVER_IP ] || [ $NODE_IP != $TORQUE_SERVER_IP ]; then if ! egrep -q "$NODE_IP|$NODE_HOSTNAME" /etc/hosts ; then echo "$NODE_IP $NODE_HOSTNAME" >> /etc/hosts fi fi #MPI support mkdir -p /etc/torque if ! egrep -q "$NODE_HOSTNAME" /etc/torque/hostfile ; then echo "$NODE_HOSTNAME slots=1" >> /etc/torque/hostfile echo "(su - guest -c \"ssh -t -t -o StrictHostKeychecking=no guest@$NODE_HOSTNAME echo ''\")& wait" >> /root/hosts.sh # for key distribution fi done if ! egrep -q "$PUBLIC_TORQUE_SERVER_HOSTNAME" /etc/torque/hostfile ; then echo "(su - guest -c \"ssh -t -t -o StrictHostKeychecking=no guest@$PUBLIC_TORQUE_SERVER_HOSTNAME echo ''\")& wait" >> /root/hosts.sh # for key distribution fi fi chmod 755 /root/hosts.sh ## on TORQUE server if [ $INSTANCE_IP == $TORQUE_SERVER_IP ]; then #this one is for the scheduler, if using the public interface if ! egrep -q "127.0.1.1|$PUBLIC_INSTANCE_HOSTNAME" /etc/hosts ; then echo "127.0.1.1 $PUBLIC_INSTANCE_HOSTNAME" >> /etc/hosts fi # echo "$PRIVATE_INSTANCE_IP $PRIVATE_INSTANCE_HOSTNAME" >> /etc/hosts else if ! egrep -q "$TORQUE_SERVER_IP|$TORQUE_SERVER_HOSTNAME" /etc/hosts ; then echo "$TORQUE_SERVER_IP $TORQUE_SERVER_HOSTNAME" >> /etc/hosts fi fi # need to set a hostname before installing torque packages echo $INSTANCE_HOSTNAME > /etc/hostname # preserve hostname if rebooting is necessary hostname $INSTANCE_HOSTNAME # immediately change #getent hosts `hostname` #PUBLIC_INSTANCE_HOSTNAME=`curl -s $METADATA_URL/public-hostname` #echo "deb http://ftp.us.debian.org/debian sid main" > /etc/apt/sources.list apt-get -o Dpkg::Options::="--force-confnew" --force-yes -y update if [ $INSTANCE_IP == $TORQUE_SERVER_IP ]; then apt-get -o Dpkg::Options::="--force-confnew" --force-yes -y install torque-server torque-scheduler torque-client #aptitude -y install torque-server torque-scheduler torque-client fi if [[ $PUBLIC_NODES_IP == *$INSTANCE_IP* ]]; then apt-get -o Dpkg::Options::="--force-confnew" --force-yes -y install torque-mom #aptitude -y install torque-mom fi ## fix /tmp directory in debian eucalyptus image chmod 777 /tmp ## add user to all nodes USER=auser if id $USER > /dev/null 2>&1 then echo "user exist!" else adduser $USER --disabled-password --gecos "" fi #echo $PUBLIC_TORQUE_SERVER_HOSTNAME > /etc/torque/server_name #echo $PUBLIC_INSTANCE_HOSTNAME > /etc/hostname # preserve hostname if rebooting is necessary #hostname $PUBLIC_INSTANCE_HOSTNAME # immediately change DATE=`date '+%Y%m%d'` ## for TORQUE mom if [[ $PUBLIC_NODES_IP == *$INSTANCE_IP* ]]; then echo $TORQUE_SERVER_HOSTNAME > /etc/torque/server_name echo "\$timeout 120" > /var/spool/torque/mom_priv/config # more options possible (NFS...) echo "\$loglevel 5" >> /var/spool/torque/mom_priv/config # more options possible (NFS...) /etc/init.d/torque-mom restart cat /var/spool/torque/mom_logs/$DATE fi ## for TORQUE server if [ $INSTANCE_IP == $TORQUE_SERVER_IP ]; then echo $TORQUE_SERVER_HOSTNAME > /etc/torque/server_name rm -f /var/spool/torque/server_priv/nodes touch /var/spool/torque/server_priv/nodes for NODE_IP in `echo $NODES` do NODE_HOSTNAME=ip-`echo $NODE_IP | sed 's/\./-/g'` echo -ne "$NODE_HOSTNAME np=1\n" >> /var/spool/torque/server_priv/nodes done /etc/init.d/torque-server restart /etc/init.d/torque-scheduler restart qmgr -c "s s scheduling=true" qmgr -c "c q batch queue_type=execution" qmgr -c "s q batch started=true" qmgr -c "s q batch enabled=true" qmgr -c "s q batch resources_default.nodes=1" qmgr -c "s q batch resources_default.walltime=3600" # had to set this for MPI, TODO: double check qmgr -c "s q batch resources_min.nodes=1" qmgr -c "s s default_queue=batch" # let all nodes submit jobs, not only the server qmgr -c "s s allow_node_submit=true" #qmgr -c 'set server submit_hosts += $TORQUE_SERVER_IP' #qmgr -c 'set server submit_hosts += $INSTANCE_IP' # adding extra nodes #qmgr -c "create node $INSTANCE_HOSTNAME" #debug cat /var/spool/torque/server_logs/$DATE qstat -q pbsnodes -a cat /etc/torque/server_name fi
See also
http://web.mit.edu/star/cluster ?StarCluster installs Sun Grid Engine with Amazon Spot Instances
This is all LGPLed work on ?GitHub, developers provide Ubuntu images, need some help to integrate properly with either distribution as it seems, though.