The speedup of a parallel code is how much faster it runs
in parallel. If the time it takes to run a code on 1
processors is T1 and the time it takes to run
the same code on N processors is TN, then the speedup
is given by
S = T1 / TN.
This can depend on many things, but primarily depends on the
ratio of the amount of time your code spends communicating to the
amount of time it spends computing.
Definition of Efficiency
Efficiency is a measure of how much of your available processing
power is being used. The simplest way to think of it is as the
speedup per processor. This is equivalent to defining efficiency
as the time to run N models on N processors to the time to run 1
model on 1 processor.
E = S/N = T1 / (N TN)
This gives a more accurate measure of the true efficiency of
a parallel program than CPU usage, as it takes into account redundant
calculations as well as idle time.
Factors that affect speedup
The primary issue with speedup is the communication to computation
ratio. To get a higher speedup, you can
Communicate less
Compute more
Make connections faster
Communicate faster
The amount of time the computer requires to make a connection to
another computer is referred to as its latency, and the rate
at which data can be transferred is the bandwidth. Both can
have an impact on the speedup of a parallel code.
Collective communication can also help speed up your code.
As an example, imagine you are trying to tell a number of people about
a party. One method would be to tell each person individually, another
would be to tell people to "spread the word". Collective communication
refers to improving communication speed by having any node with the information
being sent participate in sending the information to other nodes.
Not all protocols allow for collective communication, and even
protocols which do may not require a vendor to implement collective
communication. An example is the broadcast routine in MPI. Many vendor
specific versions of MPI allow for broadcast routines which use
a "tree" method of communications. The more common implementation found
on most clusters, LAM-MPI and MPICH, simply have the sending machine contact
each receiving machine in turn.