It is often said that there are two views of MPI. One view is that MPI
is a lightweight protocol with only 6 commands. The other view is that it
is a in depth protocol with hundreds of specialized commands.
This document is for the 6 command people.
The 6 Commands
MPI_Init
MPI_Comm_size
MPI_Comm_rank
MPI_Send
MPI_Recv
MPI_Finalize
In short, set up an MPI program, get the number of processes
participating in the program, determine which of those processes
corresponds to the one calling the command, send messages, receive
messages, and stop participating in a parallel program.
MPI_Init(int *argc, char ***argv)
Takes the command line arguments to a program, checks for any
MPI options, and passes remaining command line arguments
to the main program.
MPI_Comm_size( MPI_Comm comm, int *size )
Determines the size of a given MPI Communicator. A communicator
is a set of processes that work together. For typical programs
this is the default MPI_COMM_WORLD, which is the communicator
for all processes available to an MPI program.
MPI_Comm_rank( MPI_Comm comm, int *rank )
Determine the rank of the current process within a communicator.
Typically, if a MPI program is being run on N processes, the
communicator would be MPI_COMM_WORLD, and the rank would be an integer from
0 to N-1.
MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm )
Send the contents of buf, which contains count elements of
type datatype
to a process of rank dest in the communicator comm,
flagged with the message
tag. Typically, the communicator is MPI_COMM_WORLD.
MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source,
int tag, MPI_Comm comm, MPI_Status *status )
Read into bufcount values of type datatype
from process source in
communicator comm if a message is sent flagged with tag.
Also receive information about the transfer into status.
MPI_Finalize()
Handles anything that the current MPI protocol will need to do
before exiting a program. Typically should be the final or near final
line of a program.
Deadlock - what happens if you don't match your sends and receives?
Suppose you have an MPI code where two processes are going to "trade"
some information. Each process wants to send a message to the other, who
will receive it. If you have each process send and then receive, it may happen
that the send fills the computers' buffers before the send is complete.
Without a receive in place to start clearing the computers' buffers,
the program simply stops running. It typically will not crash, but just hang.
If you receive first, then send, then it is very likely that the program
will hang when trying to receive information that has not yet been sent.
This assumes you are using what are called "blocking" sends and receives,
that is, sends and receives that will not let the program proceed until
the message has bent sent or received. (
Look here for a description of non-blocking communications.)
The phenomenon of a "stopped" program waiting for a send or receive that
will never happen is known as "deadlock".
As an example of how to implement MPI in a simple program, consider
the following example. You have a code to simulate the spreading
of a fire in a large forest. The simulation is a Monte Carlo
simulation, and every run is different. You want to compile
average results for a large number of runs per input parameter,
for a large number of input parameters, and you want to parallelize
this across N machines. (For an example of a serial code with
visualization, try
Interactivate's
Fire Applet.)
The Model
Using the serial code fire.c, you can run
a single instance of a forest fire simulation in which a forest
is modeled as an NxN grid of trees. One tree starts to smolder, and
each iteration nearby trees have some chance of catching fire.
The model follows the following rules:
Burning trees burn down.
Smoldering trees catch fire.
Unburnt trees next to (N, S, E, W) a burning tree catch
fire with some constant probability.
Repeat until fire burns out.
The main input parameter for the model is the chance of the
fire spreading from a burning tree to a nearby unburnt tree.
The main output for the model is the percentage of additional
trees burned beyond the first tree.
The desired outcome of the parallel version is to produce
a plot of average percent burns as a function of probability
of spreading, as quickly and as accurately as possible. This
should take into account that the probability of the fire
spreading will affect not only how long it takes for the
fire to burn out but also the number of iterations required
to reached an accurate representation of the average.
One Approach
DISCLAIMER: This is not presented as the optimum
solution to this problem, but an example. It is left for the
reader to attempt to find more efficient ways of parallelizing
the problem
The most straight forward, but not the most efficient
approach, would be to assume that Niter iterations
could be run on Nprob probabilities, and that
each process would perform a calculation for an equal number
of probabilities.
Each process would run the exact same program, and each
process would determine which subset of the range of
probabilities to use as input based on its rank and the
size of MPI_COMM_WORLD. The process with rank 0 would
compile all results and output them to the screen.
The pseudocode for the program would be
Start program
Choose subset of work
For prob = min to max do
burn forest
sum+=Percent burned
average = sum / n_prob
if (rank = 0) then
receive averages
print output
else
send averages
The main() routine
int main() {
// initial conditions and variable definitions
int forest_size=20;
double prob_spread=0.5;
int **forest;
int i;
// setup problem
forest=allocate_forest(forest_size);
initialize_forest(forest_size,forest);
light_tree(forest_size,forest,10,10);
// burn until fire is gone
while(forest_is_burning(forest_size,forest)) {
forest_burns(forest_size,forest,prob_spread);
}
// print output and clean up
print_forest(forest_size,forest);
delete_forest(forest_size,forest);
}
Looking at the main routine, what will need to be changed to
parallelize this? Notice that each process will need to allocate
and free memory. No process will need to print a picture of the
forest being burned. Each process will have to initialize the array,
light a tree on fire, and run through the while loop multiple times.
Each process will need to compute an average of those runs. Each
process will need to do this for multiple probabilities.
One natural step might be to put the process of initializing
data and calculating a single run into a subroutine
(fire_2.c).
The next step might be to put a loop for a number of trials around that
subroutine (fire_3.c).
The third step might be to introduce a loop over probability
(fire_4.c).
All of the previous steps have been setting up the code to
make it ready to be parallelized. The code has been structured
so that the main loop consists of processes which can be run
concurrently. What remains is to put in our standard start
and finish routines, to determine the starting and finishing
probability for each process, and to have the rank 0 process
collect and output the data (fire_mpi.c).
On a test cluster using 600 trials and 100 probabilities, on
3 2GHz Pentium 4 Dells running off of the
Bootable Cluster
CD, a speedup of 1.8 was seen for 3 machines, giving an
efficiency of 62%.
This is not the optimum solution.
The low efficiency of this solution lies in the difference in running
times for models which burn out quickly and models which burn slowly.
In this case, one process gets almost all probabilities which burn out
immediately, one process gets almost all processes which burn for a long
period of time, and one process gets almost all processes which burn for
an intermediate period of time.
What improvements could you suggest for improving the efficiency
of this code?
Suggestion from Kay Kussman, McNeese State University,
Math, Computer Science, and Statistics at NCSI PVAMU
Workshop, July 2003: One could
keep the outer
loop over probability, but have processes take each in a "round-robin".
For three processes, process 0 would perform probability 0, 3, 6, ...,
process 1 would perform probability 1, 4, 7, ..., and process
2 would compute probability 2, 5, 8, ..., and so on. This was implemented
on the same 3 node cluster as above, with the same parameters. Speedup
for this method was 2.8, and efficiency 92%
(fire_mpi_2.c).
Solution discussed at the NCSI workshop at Oklahoma University,
August 2004:
One could also split the job up so that each of the nodes do some
of the trials, but for every probability. An implementation of
that can be found at fire_mpi_3.c.
A FORTRAN version can be found at fire_mpi.f and
fire.inc.