MPI Job Submission

1.  MPI job on the Grid


This page demonstrates how to run a simple MPI (Message Passing Interface) job on the grid. The scripts, the procedures and examples are based on the EGEE guide for MPI submission

and on the INFN Grid for MPI usage

The MPI support in the Grid environment is under development and not well mantained. The purpose is to hide the complexity of each computing site (different MPI implementations, different batch system, shared/not shared users home, etc.) to allow the users to submit MPI jobs from the User Interface specifing only the MPI flavour they want and some other variable.
The mpi-start package developed by EGEE should realize this goal introducing an additional layer between the Resource Broker and the MPI. This layer provides an interface for the Resource Broker to specify MPI jobs.

2.  A first simple test


You can do a basic test by logging in on a WN as a pool user and running the following example. If you have no access to a WN you can submit the same command in a script in a JDL file.

[gridbox02@ce-2wn2 ~]$ env | grep -i mpi_
MPI_SSH_HOST_BASED_AUTH=yes
MPI_SHARED_HOME=no
MPI_MPICH_VERSION=1.2.7p1
I2G_MPI_START=/opt/i2g/bin/mpi-start
MPI_MPICH_MPIEXEC=/opt/mpiexec-0.82/bin/mpiexec
MPI_MPICH_PATH=/opt/mpich-1.2.7p1/

You should get all the MPI variables you have exported on the site-info.def configuring your site.

3.  MPI and mpi-start


The mpi-start package provides an interface to execute MPI jobs in a grid site through the introduction of a "dummy" mpirun replacing other implementations of mpirun. This allows to specify a wrapper script (mpi-start-wrapper.sh) as executable in the JDL file, in place of the standard MPI executable.

3.1  Hello MPI World

To test the MPI support of the grid site, let's create a simple Fortran MPI code, hello.f.

c     Hello MPI World Fortran example  
      program hello
      include 'mpif.h'
      integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)

      call MPI_INIT(ierror)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
      call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
      print*, 'node', rank, ': Hello world'
      call MPI_FINALIZE(ierror)
      end

3.2  Pre and post processing

The mpi-hooks.sh file provides some pre and post processing operations to perform before and after the code execution. The script can be used to compile the code, to download data, or to analyze results and to save them on a storage element in the grid.
The following example (named "mpi-hooks.sh") compiles the executable with mpif77 before running it; the post-hook only writes a message to the standard output.

#!/bin/sh

# This function will be called before the execution of MPI executable.
# You can, for example, compile the executable itself.
#
pre_run_hook () {

  # Compile the program.
  echo "Compiling ${I2G_MPI_APPLICATION}"

  # Actually compile the program.
  cmd="mpif77 ${MPI_MPICC_OPTS} -o ${I2G_MPI_APPLICATION} ${I2G_MPI_APPLICATION}.f"
  echo $cmd
  $cmd
  if [ ! $? -eq 0 ]; then
    echo "Error compiling program.  Exiting..."
    exit 1
  fi

  # Everything's OK.
  echo "Successfully compiled ${I2G_MPI_APPLICATION}"

  return 0
}

# This function will be called after  the execution of MPI executable.
# A typical case for this is to upload the results to a storage element.
post_run_hook () {
  echo "Executing post hook."
  echo "Finished the post hook."

  return 0
}

3.3  mpi-start-wrapper

The mpi-start-wrapper.sh exports some mandatory variables to prepare the environment for the "dummy" mpirun of mpi-start package. It defines the pre and post processing functions defined in mpi-hooks.sh script and calls the mpi-start.

#!/bin/bash
#
# Pull in the arguments.
MY_EXECUTABLE=`pwd`/$1
MPI_FLAVOR=$2

# Convert flavor to lowercase in order to pass it to mpi-start.
MPI_FLAVOR_LOWER=`echo $MPI_FLAVOR | tr '[:upper:]' '[:lower:]'`

# Pull out the correct paths for the requested flavor.
eval MPI_PATH=`printenv MPI_${MPI_FLAVOR}_PATH`

# Ensure the prefix is correctly set.  Don't rely on the defaults.
eval I2G_${MPI_FLAVOR}_PREFIX=$MPI_PATH
export I2G_${MPI_FLAVOR}_PREFIX

# Touch the executable.  It must exist for the shared file system check.
# If it does not, then mpi-start may try to distribute the executable
# (while it shouldn't do that).
touch $MY_EXECUTABLE

# Setup for mpi-start.
export I2G_MPI_APPLICATION=$MY_EXECUTABLE
export I2G_MPI_APPLICATION_ARGS=
export I2G_MPI_TYPE=$MPI_FLAVOR_LOWER
export I2G_MPI_PRE_RUN_HOOK=mpi-hooks.sh
export I2G_MPI_POST_RUN_HOOK=mpi-hooks.sh

# If these are set then you will get more debugging information.
export I2G_MPI_START_VERBOSE=1
#export I2G_MPI_START_DEBUG=1

# Invoke mpi-start.
$I2G_MPI_START

3.4  The MPI JDL

In the JDL you have to specify a JobType Normal and the number of CPUs in NodeNumber on which you want to execute your code.

[
JobType = "Normal";
NodeNumber = 4;
Executable = "mpi-start-wrapper.sh";
Arguments = "hello MPICH";
StdOutput = "mpi-test.out";
StdError = "mpi-test.err";
InputSandbox = {"mpi-start-wrapper.sh","mpi-hooks.sh","hello.f"};
OutputSandbox = {"mpi-test.err","mpi-test.out","hello"};
]

Submitting this JDL with those files in the InputSandbox should perform the compilation and execution of hello.f on 4 CPUs, in a site supporting MPI and mpi-start. Running the MPI job is no different from any other grid job. Use the commands glite-wms-job-submit, glite-wms-job-status, and glite-wms-job-output to submit, check the status, and recover the output of a job.

If the job ran correctly, then the standard output should contain something like the following: The output should be something like this:

Modified mpirun: Executing command: mpi-start-wrapper.sh hello MPICH
************************************************************************
UID     =  gridbox09
HOST    =  ce-2wn2.grid.box
DATE    =  Wed Oct 22 11:59:37 CEST 2008
VERSION =  0.0.52
************************************************************************
mpi-start [INFO   ]: search for scheduler
mpi-start [INFO   ]: activate support for pbs
mpi-start [INFO   ]: activate support for mpich
mpi-start [INFO   ]: call backend MPI implementation
mpi-start [INFO   ]: start program with mpirun
-<START PRE-RUN HOOK>---------------------------------------------------
Compiling /home/gridbox09/globus-tmp.ce-2wn2.4472.0/.mpi/https_3a_2f_2fwms-4.grid.box_3a    ...
mpif77 -o /home/gridbox09/globus-tmp.ce-2wn2.4472.0/.mpi/https_3a_2f_2fwms-4.grid.box_3a900 ...
Successfully compiled /home/gridbox09/globus-tmp.ce-2wn2.4472.0/.mpi/https_3a_2f_2fwms-4.gr ...
-<STOP PRE-RUN HOOK>----------------------------------------------------
=[START]================================================================
 node 2: Hello world
 node 3: Hello world
 node 0: Hello world
 node 1: Hello world
=[FINISHED]=============================================================
-<START POST-RUN HOOK>---------------------------------------------------
Executing post hook.
Finished the post hook.
-<STOP POST-RUN HOOK>----------------------------------------------------

4.  The Intel MPI Benchmark on the Grid


The Intel MPI Benchmark (formerly named Pallas or IMB) is an open-source set of MPI benchmarks, to measure performance of computing platforms and MPI implementation. It is possibile to execute the IMB on a set of Worker Nodes with MPI using the tools described in the previous section.

4.1  Pre and post processing

The mpi-hooks.sh will:

  • download the IMB package from the storage element to the worker node
  • extract the package
  • compile it with the mpif77 wrapper
  • make a symbolic link to the executable
#!/bin/sh

# This function will be called before the execution of MPI executable.
# You can, for example, compile the executable itself.
#
pre_run_hook () {

  # Compile the program.
  echo "Compiling ${I2G_MPI_APPLICATION}"
  # Download the IMB package from the storage element to the worker node
  lcg-cp --vo gridbox lfn:/grid/gridbox/IMB.tar.gz file://$(pwd)/IMB.tar.gz
  # Extract the archive
  tar -zxvf IMB.tar.gz
  cd IMB_3.0/src
  # Compile it
  make -f make
  if [ ! $? -eq 0 ]; then
    echo "Error compiling program.  Exiting..."
    exit 1
  fi
  # make a symbolic link to executable
  cp IMB-MPI1 ../../IMB
  chmod 775 ../../IMB
  # Everything's OK.
  echo "Successfully compiled ${I2G_MPI_APPLICATION}"
  return 0
}

# This function will be called after  the execution of MPI executable.
# A typical case for this is to upload the results to a storage element.
post_run_hook () {
  echo "Executing post hook."
  echo "Finished the post hook."
  return 0
}

4.2  The mpi-start wrapper for IMB

The mpi-start-wrapper.sh is the same of the previous section, but a MY_ARGUMENTS variable to export if you want to execute a specific kind of MPI benchmark.

#!/bin/bash
#
# Pull in the arguments.
MY_EXECUTABLE=`pwd`/$1
MPI_FLAVOR=$2
MY_ARGUMENTS=$3

# Convert flavor to lowercase in order to pass it to mpi-start.
MPI_FLAVOR_LOWER=`echo $MPI_FLAVOR | tr '[:upper:]' '[:lower:]'`

# Pull out the correct paths for the requested flavor.
eval MPI_PATH=`printenv MPI_${MPI_FLAVOR}_PATH`

# Ensure the prefix is correctly set.  Don't rely on the defaults.
eval I2G_${MPI_FLAVOR}_PREFIX=$MPI_PATH
export I2G_${MPI_FLAVOR}_PREFIX

# Touch the executable.  It must exist for the shared file system check.
# If it does not, then mpi-start may try to distribute the executable
# (while it shouldn't do that).
touch $MY_EXECUTABLE

# Setup for mpi-start.
export I2G_MPI_APPLICATION=$MY_EXECUTABLE
export I2G_MPI_APPLICATION_ARGS=$MY_ARGUMENTS
export I2G_MPI_TYPE=$MPI_FLAVOR_LOWER
export I2G_MPI_PRE_RUN_HOOK=mpi-hooks.sh
export I2G_MPI_POST_RUN_HOOK=mpi-hooks.sh

# If these are set then you will get more debugging information.
export I2G_MPI_START_VERBOSE=1
#export I2G_MPI_START_DEBUG=1

# Invoke mpi-start.
$I2G_MPI_START

4.3  The MPI JDL for IMB

In the JDL you can specify which type of MPI benchmark will be performed by IMB (in this case pingpong, the third argument on the Arguments attribute). You can omit it, but the IMB test will take around 30 minutes to complete all the benchmarks on two gridseed worker nodes.

[
JobType = "MPICH";
CpuNumber = 4;
Executable = "mpi-start-wrapper.sh";
Arguments = "IMB MPICH pingpong";
StdOutput = "mpi-test.out";
StdError = "mpi-test.err";
InputSandbox = {"mpi-start-wrapper.sh","mpi-hooks.sh"};
OutputSandbox = {"mpi-test.err","mpi-test.out"};
]

4.4  The IMB pingpong output

This is the output you should get in mpi-test.out.

=[START]================================================================
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.0, MPI-1 part    
#---------------------------------------------------
# Date                  : Thu Nov 13 10:31:05 2008
# Machine               : i686
# System                : Linux
# Release               : 2.6.9-78.0.1.EL.cernsmp
# Version               : #1 SMP Tue Aug 5 11:10:20 CEST 2008
# MPI Version           : 1.2
# MPI Thread Environment: MPI_THREAD_FUNNELED

#
# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# PingPong

#---------------------------------------------------
# Benchmarking PingPong 
# #processes = 2 
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000        40.46         0.00
            1         1000        38.03         0.03
            2         1000        33.92         0.06
            4         1000        36.54         0.10
            8         1000        36.05         0.21
           16         1000        36.12         0.42
           32         1000        37.16         0.82
           64         1000        35.29         1.73
          128         1000        63.11         1.93
          256         1000        37.26         6.55
          512         1000        35.26        13.85
         1024         1000        37.56        26.00
         2048         1000        35.52        54.99
         4096         1000        91.71        42.59
         8192         1000        47.52       164.41
        16384         1000        87.96       177.64
        32768         1000        69.58       449.15
        65536          640       178.73       349.68
       131072          320       356.79       350.34
       262144          160       623.29       401.10
       524288           80      1274.07       392.44
      1048576           40      3165.84       315.87
      2097152           20     15574.10       128.42
      4194304           10     24705.30       161.91
=[FINISHED]=============================================================