Parallel Computing

From Madagascar
Jump to navigation Jump to search

Many of the data processing operations are data-parallel: different traces, shot gathers, frequency slices, etc. can be processed independently. Madagascar provides several mechanisms for handling this type of embarrassingly parallel applications on computers with multiple processors.

OpenMP (internal)

OpenMP is a standard framework for parallel applications on shared-memory systems. It is supported by the latest versions of GCC and by some other compilers.

To use OpenMP in your program, you do not need to add anything to your SConstruct. Just make sure the OMP libraries are installed on your system before you configure Madagascar, (or -- reinstall them and rerun the configuration command). Of course, you need to use the appropriate pragmas in your code. To find Madagascar programs that use OpenMP and that you can take as a model, run the following command:

<bash> grep "pragma omp" $RSFSRC/user/*/*.c |\ awk -F ':' '{ print $1 }' |\ uniq |\ awk -F '/' '{ print $NF }' |\ grep M </bash> On the last check (2011-08-10), 84 standalone programs (approximately 10% of Madagascar programs) were using OMP. Running this command in the directory $RSFSRC/api/c will yield a few functions parallelized with OMP (among which a Fourier Transform).

OpenMP (external)

To run on a multi-core shared-memory machine a data-parallel process that does not contain OpenMP calls, use sfomp. Thus, a call like <bash> sfradon np=100 p0=0 dp=0.01 < inp.rsf > out.rsf </bash> becomes <bash> sfomp sfradon np=100 p0=0 dp=0.01 < inp.rsf > out.rsf </bash> sfomp splits the input along the slowest axis (presumed to be data-parallel) and runs it through parallel threads. The number of threads is set by the OMP_NUM_THREADS environmental variable or (by default) by the number of available CPUs.

MPI (internal)

MPI (Message-Passing Interface) is the dominant standard framework for parallel processing on different computer architectures including distributed-memory systems. Several MPI implementations (such as Open MPI and MPICH2) are available.

An example of compiling a program with mpicc and running it under mpirun can be found in $RSFSRC/book/rsf/bash/mpi/SConstruct

MPI (external)

To parallelize a task using MPI but without including MPI calls in your source code, try sfmpi, as follows: <bash> mpirun -np 8 sfmpi sfradon np=100 p0=0 dp=0.01 input=inp.rsf output=out.rsf </bash> where the argument after -np specifies the number of processors involved. sfmpi will use this number to split the input along the slowest axis (presumed to be data-parallel) and to run it through parallel threads. Notice that the keywords input and output are specific to sfmpi and they will be used to specify the standard input and output streams of your program.

Some MPI implementations do not support system calls implemented in sfmpi and therefore will not support this feature.

MPI + OpenMP (both external)

It is possible to combine the advantages of shared-memory and distributed-memory architectures by using OpenMP and MPI together. <bash> mpirun -np 32 sfmpi sfomp sfradon np=100 p0=0 dp=0.01 input=inp.rsf output=out.rsf </bash> will distribute the job on 32 nodes and split it again on each node using shared-memory threads.


To get SCons to cut your inputs into slices, run in parallel on one multi-cpu workstation or on multiple cluster nodes and then collect, use the pscons wrapper to scons. Unlike the OpenMP or MPI utilities, this has fault tolerance -- in case of a node failing, restarting the job will allow it to complete.

Simply running pscons with no special environment variable set is equivalent to running scons -j nproc, where nproc is the auto-detected number of threads on your system. To fully use the potential of pscons for running on a distributed-memory computer, you need to set the environment variables RSF_CLUSTER and RSF_THREADS, and to use split and reduce arguments in your SConstruct Flow statements where appropriate.

Setting the environment variables

The RSF_CLUSTER variable holds, for each node, the name or IP address of that node (in a format that can be used by ssh), followed by the number of threads on the node. For example, creating 26 threads and sending them on 4 nodes, using respectively 6 CPUs on the first node, 4 CPUs on the second, and 8 CPUs on each of the last two nodes: <bash> export RSF_CLUSTER=' 6 4 8 8' </bash>

The RSF_THREADS variable holds the sum of the numbers of threads on all nodes, i.e.: <bash> export RSF_THREADS=26 </bash> If RSF_CLUSTER is not defined, RSF_THREADS can be used to override the auto-detected number of threads used on the local host. This can be useful in the case of processes using a large amount of memory.

In Beowulf-type clusters in which communication of the processor with the local disk is much faster than with the shared network storage, it is important to set in the shell resource file the temporary file location to a local disk, and the DATAPATH variable to a network-visible location for global collection of results, i.e.:

<bash> export DATAPATH=/disk1/data/myname/ export TMPDATAPATH=/tmp/ </bash>

The split and reduce options in Flow()

The split option specifies the number of the axis to be split and the size of that axis. For an axis 3 of length 1000 on the standard in file, and collection by concatenation: <python> Flow('radon','spike','radon adj=y p0=-4 np=200 dp=0.04',split=[3,1000],reduce='cat') </python> Concatenation is the default reduction method. The other valid option is reduce='add' .

In flows that are run by pscons, but contain both serial and parallel targets, care must be exercised in order to not create bottlenecks, in which tasks are distributed to multiple nodes, but the nodes sit idle while waiting other nodes to finish computing dependencies. Tasks that are not explicitly parallelized will be sped up by pscons if they are independent from each other. For example, building Madagascar with pscons instead of scons results in a visible speedup on a multithreaded machine.

Computing on the local node only by using the option local=1

By default, with pscons, SCons attempts to run all the commands of the SConstruct file in parallel. The option local=1 forces SCons to compute locally. It can be very useful in order to prevent serial parts of your python script to be run inefficiently in parallel, especially by overloading the I/O system with identical operations. <python> Flow('spike',None,'spike n1=100 n2=300 n3=1000',local=1) </python>

What to expect at runtime

SCons will create intermediate input and output slices in the current directory. For example, for <bash> Flow('out','inp','radon np=100 p0=0 dp=0.01',split=[3,256]) </bash> and <bash> RSF_THREADS=8 RSF_CLUSTER='localhost 4 4' </bash> the SCons output will look like: <bash> < inp.rsf /RSFROOT/bin/sfwindow n3=42 f3=0 squeeze=n > inp__0.rsf

< inp.rsf /RSFROOT/bin/sfwindow n3=42 f3=42 squeeze=n > inp__1.rsf

/usr/bin/ssh "cd /home/test ; /bin/env < inp.rsf /RSFROOT/bin/sfwindow n3=42 f3=84 squeeze=n > inp__2.rsf "

< inp.rsf /RSFROOT/bin/sfwindow n3=42 f3=126 squeeze=n > inp__3.rsf

< inp.rsf /RSFROOT/bin/sfwindow n3=42 f3=168 squeeze=n > inp__4.rsf

/usr/bin/ssh "cd /home/test ; /bin/env < inp.rsf /RSFROOT/bin/sfwindow f3=210 squeeze=n > inp__5.rsf "

< inp__0.rsf /RSFROOT/bin/sfradon p0=0 np=100 dp=0.01 > out__0.rsf

/usr/bin/ssh "cd /home/test ; /bin/env < inp__1.rsf /RSFROOT/bin/sfradon p0=0 np=100 dp=0.01 > out__1.rsf "

< inp__3.rsf /RSFROOT/bin/sfradon p0=0 np=100 dp=0.01 > out__3.rsf

/usr/bin/ssh "cd /home/test ; < spike__4.rsf /RSFROOT/bin/sfradon p0=0 np=100 dp=0.01 > out__4.rsf "

< inp__2.rsf /RSFROOT/bin/sfradon p0=0 np=100 dp=0.01 > out__2.rsf

< inp__5.rsf /RSFROOT/bin/sfradon p0=0 np=100 dp=0.01 > out__5.rsf

< out__0.rsf /RSFROOT/bin/sfcat axis=3 out__1.rsf out__2.rsf out__3.rsf out__4.rsf out__5.rsf > out.rsf </bash>

Note that operations were sent for execution in parallel, but the display is necessarily serial.

Runtime job monitoring can be achieved with sftop. To kill a distributed job, use sfkill.