Turbo Boost and MPI
4 May 2016

There’s this attitude when optimising that if you’re not maxing out all of your processing resources, you’re wasting them.

Utilisation is a good guideline, but it’s missing the wood for the trees. You actually want your task to run faster! Using more resources doesn’t guarantee that your job will run faster. If you have idle resources, you will usually get gains by using them, but it’s not a guarantee.

MPI programs are often written in the form (A):

across all nodes:
    do the same task

rather than (B):

do a task on node 0
broadcast the result to all nodes

Form A is ridiculously wasteful of hardware resources – it uses N times the CPU cycles as form B, but is often slightly faster. Why? In form B, the total time is <time to run your task> plus <time to broadcast>. In form A, under the assumption that CPU cores are independent, the time is <time to run your task>. You’ve saved <time to broadcast>, which matters if it’s a measureable percentage of <time to run your task>.

Are CPU cores independent? They’re less independent now than they used to be. Traditionally, the memory bus was the primary shared resource, and form A gets pretty good cache utilisation. All cores are doing the same job and will have the same memory access patterns, so caches mostly cover up the increased memory traffic.

Since 2011, Intel CPUs have supported Turbo Boost. This feature lets a single core run at higher than the nominal clock rate for a short period of time. This is motivated largely by thermal considerations. Obviously, the silicon is capable of running at a higher clock rate – otherwise it wouldn’t work at all, ever. The nominal clock speed is a self-imposed restriction that reflects that the heat cannot be removed from such a small area (maybe 1x1mm?) at a high enough rate to keep the core at a safe temperature. For a multicore package running at a lower clock rate, there’s more total heat but it’s spread across the package better. The individual cores do not reach a dangerous temperature.

So now, you can choose between multiple cores doing the same task at a lower clock rate versus single-core task+broadcast at a higher clock rate. Does this matter? It depends a lot on your hardware environment:

If you’re in a cloud environment (EC2, DigitalOcean, etc) then there’s almost no chance that you can turbo a core as other people will be running on the same machine. Because you don’t have exclusive access to the CPUs, your cores might complete the same job in different amounts of time. A synchronised task will complete in the worst-case execution time, so your final time might be better if you reduce the number of cores you use.

On something like CUDA, there are good reasons to run the same task on many cores even if many of them are idle. The hardware architecture rewards you for orderly memory access and you usually have a scarcity of memory bandwidth, not cores. If you can do exactly the same task across many cores (where ‘exactly’ means ‘the same CPU instructions in lockstep’) then you can use all of the cores. There’s no way to fit spare tasks or other users into the spare cores like in a CPU-based environment, so if you don’t use them, they get wasted. Even different branches breaks the ‘exactly’ requirement, so you’re usually better off wasting cycles on some cores than having the size of your thread group drop from 16 to 1.

What’s the take home lesson? Test, test, test. Don’t assume. CPU cores since 2011 are less independent than they used to be, and the common practice of running identical tasks across many cores often doesn’t hold any more. Test it again.

comments powered by Disqus