Differences between revisions 9 and 10
Revision 9 as of 2010-09-13 03:23:16
Size: 10423
Editor: ?DominiqueBelhachemi
Comment:
Revision 10 as of 2010-09-13 03:31:53
Size: 10704
Editor: ?DominiqueBelhachemi
Comment:
Deletions are marked like this. Additions are marked like this.
Line 95: Line 95:
Start it with Torque
{{{
TODO - using mpirun or wait for openmpi with -tm support
Start it with Torque (without -tm support)
{{{
cat <<EOF > mpi-test_1_2_mpirun
#PBS -N petsc
#PBS -l nodes=1:ppn=2
cd $PBS_O_WORKDIR

/usr/bin/mpirun -np 2 --hostfile /etc/torque/hostfile -v -v -v /tmp/hello.out
EOF

qsub mpi-test_1_2_mpirun
}}}

Start it with Torque (without -tm support)
{{{
on my TODO list - package is ready but not in squeeze yet
Line 361: Line 374:

Running Torque inside of Eucalyptus

We describe how to setup a Torque cluster system within a Eucalyptus cloud.

TORQUE

Debian TORQUE package

 $ source ~/.euca/eucarc

Specify a Squeeze image

 $ EMI=emi-1AF00C98

Start two instances of our Squeeze image

 $ euca-run-instances $EMI -k mykey -t c1.medium -n2
 RESERVATION    r-4488080C      myuser  myuser-default
 INSTANCE       i-57E309BE      emi-1AF00C98    0.0.0.0 0.0.0.0 pending mykey   2010-09-13T02:31:51.172Z        eki-D224100C    eri-059910F2
 INSTANCE       i-4C1F0986      emi-1AF00C98    0.0.0.0 0.0.0.0 pending mykey   2010-09-13T02:31:51.173Z        eki-D224100C    eri-059910F2

After a few seconds it will be running

 $ euca-describe-instances 
 RESERVATION    r-4488080C      myuser default
 INSTANCE       i-4C1F0986      emi-1AF00C98    192.168.0.14    192.168.0.14    running         mykey   1       c1.medium       2010-09-13T02:31:51.173Z   mycloud    eki-D224100C    eri-059910F2
 INSTANCE       i-57E309BE      emi-1AF00C98    192.168.0.15    192.168.0.15    running         mykey   0       c1.medium       2010-09-13T02:31:51.172Z   mycloud    eki-D224100C    eri-059910F2

Let's say you want to start a torque server on 192.168.0.14 and two torque worker on 192.168.0.14 and 192.168.0.15, MPI enabled

  • $ bash start_torque.sh -s="192.168.0.14" -n="192.168.0.14,192.168.0.15" -k="~/.euca/mykey.priv" -m=1

}}}

This will install all necessary torque packages in the instances. It might take a few minutes, depending on the internet connection and processor speed of the instances.

Connect to a instance as root with your key

 ssh -X -i ~/.euca/mykey.priv root@192.168.0.14

virtual: Switch to the guest user

 su - guest

Check if nodes are up

 pbsnodes

Perform some simple tests

 echo "sleep 10" | qsub
 echo "sleep 5" | qsub
 echo "hostname" | qsub
 echo "sleep 15" | qsub
 echo "hostname" | qsub
 echo "sleep 3" | qsub

Look at the queue

 qstat

Let sleep 2 worker nodes

 echo "sleep 10" | qsub -l nodes=2

Check if both nodes are in state 'job-exclusive'

 pbsnodes

During the installation phase we compiled simple "MPI-?HelloWorld" program.

Start it without torque

$ mpiexec -n 4 /tmp/hello.out
Hello MPI from the server process!
Hello MPI!
 mesg from 1 of 4 on ip-192-168-0-14
Hello MPI!
 mesg from 2 of 4 on ip-192-168-0-14
Hello MPI!
 mesg from 3 of 4 on ip-192-168-0-14

Start it with Torque (without -tm support)

cat <<EOF > mpi-test_1_2_mpirun
#PBS -N petsc
#PBS -l nodes=1:ppn=2
cd $PBS_O_WORKDIR

/usr/bin/mpirun -np 2 --hostfile /etc/torque/hostfile -v -v -v /tmp/hello.out
EOF

qsub mpi-test_1_2_mpirun

Start it with Torque (without -tm support)

on my TODO list - package is ready but not in squeeze yet

example script for setting up torque:

set -ex

function install_package {
 PACKAGE=$1

 if [ "`dpkg-query -W -f='${Status}\n' $PACKAGE`" != "install ok installed" ] ; then
  apt-get -o Dpkg::Options::="--force-confnew" --force-yes -y install $PACKAGE
  #aptitude -y install $PACKAGE
  if [ $? -ne 0 ] ; then
  echo "aptitude install $PACKAGE failed"
  fi
 else
  echo "package $PACKAGE is already installed"
 fi
}


export DEBIAN_FRONTEND="noninteractive"
export APT_LISTCHANGES_FRONTEND="none"
API_VERSION="2008-02-01"
METADATA_URL="http://169.254.169.254/$API_VERSION/meta-data"
CURL="/usr/bin/curl"


# those variables are needed for the locales package
export LANGUAGE=en_US.UTF-8
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8

# for dialog frontend
export PATH=$PATH:/sbin:/usr/sbin:/usr/local/sbin
export TERM=linux


#SET values
#conf1
PUBLIC_TORQUE_SERVER_IP="192.168.0.2"
PUBLIC_NODES="192.168.0.3    192.168.0.4   192.168.0.5   192.168.0.6   192.168.0.7   192.168.0.8"

PRIVATE_TORQUE_SERVER_IP="172.16.1.2"
PRIVATE_NODES="172.16.1.2   172.16.1.3   172.16.1.4   172.16.1.5   172.16.1.6   172.16.1.7   172.16.1.8"



MODE="public"
#MODE="private"


PUBLIC_TORQUE_SERVER_HOSTNAME=ip-`echo $PUBLIC_TORQUE_SERVER_IP | sed 's/\./-/g'`
echo $PUBLIC_TORQUE_SERVER_IP $PUBLIC_TORQUE_SERVER_HOSTNAME

PRIVATE_TORQUE_SERVER_HOSTNAME=ip-`echo $PRIVATE_TORQUE_SERVER_IP | sed 's/\./-/g'`
echo $PRIVATE_TORQUE_SERVER_IP $PRIVATE_TORQUE_SERVER_HOSTNAME


#GET INSTANCE IPs, create hostnames
PUBLIC_INSTANCE_IP=`curl -s $METADATA_URL/public-ipv4`
#PUBLIC_INSTANCE_IP=192.168.0.115
#PUBLIC_INSTANCE_HOSTNAME=`curl -s $METADATA_URL/public-hostname`
PUBLIC_INSTANCE_HOSTNAME=ip-`echo $PUBLIC_INSTANCE_IP | sed 's/\./-/g'`
echo $PUBLIC_INSTANCE_IP $PUBLIC_INSTANCE_HOSTNAME

PRIVATE_INSTANCE_IP=`/sbin/ifconfig eth0 | grep "inet addr" | awk '{print $2}' | sed 's/addr\://'`
PRIVATE_INSTANCE_HOSTNAME=ip-`echo $PRIVATE_INSTANCE_IP | sed 's/\./-/g'`
echo $PRIVATE_INSTANCE_IP $PRIVATE_INSTANCE_HOSTNAME



#using PUBLIC or PRIVATE interface
if [ $MODE == "public" ] ; then
   INSTANCE_HOSTNAME=$PUBLIC_INSTANCE_HOSTNAME
   NODES=$PUBLIC_NODES
   INSTANCE_IP=$PUBLIC_INSTANCE_IP
   TORQUE_SERVER_IP=$PUBLIC_TORQUE_SERVER_IP
   TORQUE_SERVER_HOSTNAME=$PUBLIC_TORQUE_SERVER_HOSTNAME
else
   if [ $MODE == "private" ] ; then
      INSTANCE_HOSTNAME=$PRIVATE_INSTANCE_HOSTNAME
      NODES=$PRIVATE_NODES
      INSTANCE_IP=$PRIVATE_INSTANCE_IP
      TORQUE_SERVER_IP=$PRIVATE_TORQUE_SERVER_IP
      TORQUE_SERVER_HOSTNAME=$PRIVATE_TORQUE_SERVER_HOSTNAME
   else
      echo "please specify private or public interface"
   fi
fi


# using Google's nameserver
echo "nameserver 8.8.8.8" >> /etc/resolv.conf


# update aptitude first
#echo "deb http://ftp.us.debian.org/debian squeeze main" > /etc/apt/sources.list
#echo "deb http://security.debian.org/ squeeze/updates main" >> /etc/apt/sources.list
#aptitude update
apt-get -o Dpkg::Options::="--force-confnew" --force-yes -y update
if [ $? -ne 0 ] ; then
echo "aptitude update failed"
fi



# get rid of some error messages because of missing locales package
install_package locales
echo "en_US.UTF-8 UTF-8" > /etc/locale.gen
locale-gen

# install portmap for NFS
install_package portmap
#TODO mount here


# install nmap
install_package nmap
nmap localhost -p 1-20000

# install lsb-release
install_package lsb-release

# Print some Information about the Operating System
DISTRIBUTOR=`lsb_release -i | awk '{print $3}'`
CODENAME=`lsb_release -c | awk '{print $2}'`
echo $DISTRIBUTOR $CODENAME


# install ntpdate
install_package ntpdate
###ntpdate pool.ntp.org
ntpdate ntp.ubuntu.com

# install libopenmpi-dev
install_package "libopenmpi-dev"

# install openmpi-bin
install_package "openmpi-bin"

# make hostnames known to all the TORQUE nodes and server/scheduler

if [ $MODE == "private" ] ; then
   for NODE_IP in `echo $PRIVATE_NODES`
   do
      NODE_HOSTNAME=ip-`echo $NODE_IP | sed 's/\./-/g'`
      echo "$NODE_IP   $NODE_HOSTNAME" >> /etc/hosts
      #MPI support
      echo "$NODE_IP   $NODE_HOSTNAME" >> /etc/torque/hostfile
   done
fi


if [ $MODE == "public" ] ; then
   for NODE_IP in `echo $PUBLIC_NODES`
   do
      NODE_HOSTNAME=ip-`echo $NODE_IP | sed 's/\./-/g'`
      echo "$NODE_IP   $NODE_HOSTNAME" >> /etc/hosts
      #MPI support
      echo "$NODE_IP   $NODE_HOSTNAME" >> /etc/torque/hostfile
   done
fi




## on TORQUE server
if [ $INSTANCE_IP == $TORQUE_SERVER_IP ]; then
   #this one is for the scheduler, if using the public interface
   echo "127.0.1.1 $PUBLIC_INSTANCE_HOSTNAME" >> /etc/hosts

   echo "$PRIVATE_INSTANCE_IP $PRIVATE_INSTANCE_HOSTNAME" >> /etc/hosts
else
   echo "$TORQUE_SERVER_IP $TORQUE_SERVER_HOSTNAME" >> /etc/hosts
fi


# need to set a hostname before installing torque packages
echo $INSTANCE_HOSTNAME > /etc/hostname # preserve hostname if rebooting is necessary
hostname $INSTANCE_HOSTNAME # immediately change
#getent hosts `hostname`
#PUBLIC_INSTANCE_HOSTNAME=`curl -s $METADATA_URL/public-hostname`


#echo "deb http://ftp.us.debian.org/debian sid main" > /etc/apt/sources.list
apt-get -o Dpkg::Options::="--force-confnew" --force-yes -y update
if [ $INSTANCE_IP == $TORQUE_SERVER_IP ]; then
   apt-get -o Dpkg::Options::="--force-confnew" --force-yes -y install torque-mom torque-server torque-scheduler torque-client
   #aptitude -y install torque-mom torque-server torque-scheduler torque-client
else
   apt-get -o Dpkg::Options::="--force-confnew" --force-yes -y install torque-mom
   #aptitude -y install torque-mom
fi


## fix /tmp directory in debian eucalyptus image
chmod 777 /tmp
## add user to all nodes
USER=userA

if id $USER > /dev/null 2>&1
then
   echo "user exist!"
else
   adduser $USER --disabled-password --gecos ""
fi



#echo $PUBLIC_TORQUE_SERVER_HOSTNAME > /etc/torque/server_name
#echo $PUBLIC_INSTANCE_HOSTNAME > /etc/hostname # preserve hostname  if rebooting is necessary
#hostname $PUBLIC_INSTANCE_HOSTNAME # immediately change


DATE=`date '+%Y%m%d'`

## on TORQUE mom
echo $TORQUE_SERVER_HOSTNAME > /etc/torque/server_name
echo "\$timeout 120" > /var/spool/torque/mom_priv/config # more options possible (NFS...)
echo "\$loglevel 5" >> /var/spool/torque/mom_priv/config # more options possible (NFS...)

/etc/init.d/torque-mom restart
cat /var/spool/torque/mom_logs/$DATE


## on TORQUE server
if [ $INSTANCE_IP == $TORQUE_SERVER_IP ]; then
   echo $TORQUE_SERVER_HOSTNAME > /etc/torque/server_name
   rm -f /var/spool/torque/server_priv/nodes
   touch /var/spool/torque/server_priv/nodes
   for NODE_IP in `echo $NODES`
   do
      NODE_HOSTNAME=ip-`echo $NODE_IP | sed 's/\./-/g'`
      echo -ne "$NODE_HOSTNAME np=1\n" >> /var/spool/torque/server_priv/nodes
   done
   /etc/init.d/torque-server restart
   /etc/init.d/torque-scheduler restart
   qmgr -c "s s scheduling=true"
   qmgr -c "c q batch queue_type=execution"
   qmgr -c "s q batch started=true"
   qmgr -c "s q batch enabled=true"
   qmgr -c "s q batch resources_default.nodes=1"
   qmgr -c "s q batch resources_default.walltime=3600"
   # had to set this for MPI, TODO: double check
   qmgr -c "s q batch resources_min.nodes=1"
   qmgr -c "s s default_queue=batch"
   # let all nodes submit jobs, not only the server
   qmgr -c "s s allow_node_submit=true"
   #qmgr -c 'set server submit_hosts += $TORQUE_SERVER_IP'
   #qmgr -c 'set server submit_hosts += $INSTANCE_IP'

   # adding extra nodes
   #qmgr -c "create node $INSTANCE_HOSTNAME"


   cat /var/spool/torque/server_logs/$DATE
   qstat -q
   pbsnodes -a
fi