The following assumes that you are using MPICH on a cluster of machines
running some variant of UNIX for which you have access to all or some of
the nodes via the mpirun command. (modifications for LAM and for using
schedulers to be written later.) It is also assumed that you are
executing the commands to compile, copy, and run the code from a command
line, or in a terminal window.
Typically, running an MPI program will consist of three steps:
Assuming you have code to compile
(If you have binary executables only, proceed to step 2, copy) you need
to create an executable. This involves compiling your code with the
appropriate compiler, linked against the MPI libraries. It is possible
to pass all options through a standard cc or f77 command, but MPICH
provides a "wrapper" (mpicc for cc/gcc, mpiCC for c++/g++ on UNIX/Linux,
mpicxx for c++/g++ on BSD/Mac OS X, and mpif77 for f77) that
appropriately links against the MPI libraries and sets the appropriate
include and library paths.
Example: hello.c (use your preferred text editor to create the file
hello.c)
#include <stdio.h>
#include <mpi.h>
int main(int argc, char ** argv) {
int rank, size;
char name[80];
int length;
MPI_Init(&argc, &argv); // note that argc and argv are passed
// by address
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Get_processor_name(name,&length);
printf("Hello MPI: processor %d of %d on %s\n", rank,size,name);
MPI_Finalize();
}
After saving the above example file, you can compile the program using
the mpicc command.
mpicc -o hello hello.c
The "-o" option provides an output file name, otherwise your executable
would be saved as "a.out". Be careful to make sure you provide an
executable name if you use the "-o" option. Many programmers have
deleted part of their source code by accidentally giving their source
code as their output file name.
If you have typed the file correctly, you should succesfully compile the
code, and an "ls" command should show that you have created the file
"hello"
[chilledmonkey:~/Desktop/HELLO_MPI] djoiner% ls hello.c
[chilledmonkey:~/Desktop/HELLO_MPI] djoiner% mpicc -o hello hello.c
[chilledmonkey:~/Desktop/HELLO_MPI] djoiner% ls hello
hello.c hello.o
[chilledmonkey:~/Desktop/HELLO_MPI] djoiner%
In order for your program to run on each node, the executable must exist
on each node.
There are as many ways to make sure that your executable exists on all
of the nodes as there are ways to put the cluster together in the first
place. I will cover one method here appropriate for use with the BCCD
bootable cluster cd.
This method will assume that an account (bccd) exists on all machines
with the same home directory (/home/bccd), that authentication is being
done via ssh, and that public keys have been shared for the account to
allow for login and remote execution without a password.
(If you are using the BCCD bootable cluster CD and have not set up your
nodes for remote access, make sure you have logged in to each machine as
bccd, started the heartbeat program, and included the appropriate public
keys in each machines authorized keys file. If there are no other BCCD
nodes on your network, on each node check to make sure you have all of
the public keys from the other nodes [ls /tmp/bccd] and add those public
keys to each nodes authorized keys file [cat /tmp/bccd/*/id_rsa.pub >>
/home/bccd/.ssh/authorized_keys])
One command that can be used to copy files between machines is "rsync";
rsync is a unix command that will synchronize files between remote
machines, and in its simplest use acts as a secure remote copy.
It takes similar arguments to the unix "cp" command.
If you have saved your example in a directory HELLO (i.e. the file is
saved as /home/bccd/HELLO/hello.c) the following command will copy the
entire HELLO directory to a remote node.
The format for rsync is rsync arguments source destination. Here, the
arguments are arv (archive, recursive, verbose), the source is
/home/bccd/HELLO, and the desitination is in username@host:location
format, which in this case is bccd@node_hostname:/home/bccd/.
This will need to be done for each host.
If you are not sure if your copy worked properly, ssh into each host,
and check to see that the files are there using the "ls" command.
Once you have compiled the code and copied it to all of the nodes, you
can run the code using the mpirun command.
Two of the more common arguments to the mpirun command are the "np"
argument that lets you specify how many processors to use, and the
"machinefile" argument that lets you specify exactly which nodes are
available for use.
On many machines, the machinefile argument will be optional, and will
default to the file mpich_install_dir/util/machines/machines.ARCH or
something similar, where ARCH stands for that machines' architecture
(e.g. FreeBSD,LINUX, etc.). For BCCD cluster cd users, since the mpich
directory is on CD, this file cannot be modified, and you should provide
your own machines file. (For BCCD users, a simple way of creating a
machines file from available BCCD nodes is to check to see which nodes
have broadcast their public keys. [ls /tmp/bccd > /home/bccd/machines].)
For this example, we will assume that a user bccd with a home directory
/home/bccd has a machines file /home/bccd/machines.
Change directory to the file where your executable is located, and run
your hello command using 4 processes: