More MPI performance optimisation
4 May 2016

Previously, I compared the following forms commonly used for MPI programs; (A):

across all nodes:
    do the same task

or (B):

do a task on node 0
broadcast the result to all nodes

Under the assumptions that CPUs are independent and identical, for form A, your total execution time is T, the time taken to run one instance of the task. In form B, it’s T+B, where B is the amount of time taken to broadcast the result. Since B cannot be less than zero, T < T+B, so form A is better.

Right? Wrong! Maybe.

T is not constant. Even given a node full of identical unloaded CPUs, T will vary for no particular reason. In my previous post I covered some of the reasons why CPUs are not independent and thus T is not going to be the same for all CPUs.

Form A cannot complete until all CPUs have finished running the task. If T is different for different CPUs, the final execution time is the worst-case execution time for all CPUs. Form A has execution time max(T(all CPUs)); form B has execution time T0+B.

Note that any sensible OS will assign your form B (single-threaded) task to the CPU with the lowest load and hence indirectly attain the lowest T0.

This is particularly relevant for cloud environments which are constantly oversold and so you are always sharing CPUs with someone else. If you use all of your assigned CPUs in a form A program, you’re going to experience a lot of variability in the execution time, and your final execution time is going to suffer. Using less CPUs than you’ve paid for may actually reduce overall execution time.

Form A is better if time B is relatively high – you have a lot of data to move around or your nodes are not sharing a memory bus.

So we should use form B always? No! Test, test, test!


comments powered by Disqus