Fitting models to independent datasets is an embarrassingly parallel (Wikipedia, 2015) problem. The datasets have no dependence on each other and can be fitted separately.

This implies that, in an ideal world:

If we have enough datasets, we can expect the performance figures quoted here to extrapolate linearly (i.e. if fitting 1000 datasets takes 10 seconds, we can expect that fitting 2000 datasets will take around 20 seconds)
If we have more hardware to parallelise across (e.g. more CPU cores, more computers with or without GPUs) we can expect them to reduce the computation time proportionally to the amount of resources added
Assuming that EC2 has an unlimited supply of hardware for us to rent, we can perform very large model fits in an arbitrarily small amount of time with the same total cost. Renting twice as much hardware will halve our model fit time, so the total expenditure remains the same.

Of course, in practice things are not so ideal:

EC2 bills per-hour, so we cannot reduce execution time for extremely large jobs below an hour without incurring additional costs
EC2 instances take some time to boot, so there is a cost to using large numbers of instances
Large clusters of machines incur some overhead for communication
Very small tasks have fixed overheads. We saw this in the GPU results, where there was almost no time difference between a 100-sample dataset and a 1000-sample dataset.

Using R, the foreach package (Analytics et al., 2014) makes it easy to parallelise code across cores within a computer. The snow package (Tierney et al., 2013) is suitable for use across networked clusters of computers.

Improving GPU performance

Kernel launch time is still the limiting factor, so further reducing the number of kernel launches is the natural way to improve performance. Ideally, the entire EM algorithm (across multiple iterations) could be moved to the GPU using similar techniques to lp_sum to handle datasets larger than the thread block size.

On the author’s hardware, compute occupancy is around 60%, so we could expect, at most, another \( \frac{1}{0.6}= \) 67% performance gain.

At that stage, GPU performance would likely be the limiting factor. The current implementation of the lp_sum summation is not very efficient, but it would be prudent to check for hotspots with a profiler before investing further development time.

Finally, while running the GPU software, the CPUs have significant idle time. With additional software support, one could perform additional model fits using that idle CPU capacity, improving performance-per-dollar further.

Improving CPU performance

The obvious way to improve performance of the CPU implementation is to take advantage of additional CPU cores. The easiest way to achieve this is with the foreach R package, which can run arbitrary R code across any number of CPU threads. Running the R code under a profiler ought to reveal hotspots which can guide optimisation of the R code, Finally, rewriting the R implementation in C might provide further improvement.

New functionality

Modifying bulkem to fit Normal mixture models would be fairly straightforward and very useful; Normal models are far more common than inverse Gaussian.

Conclusion

This report describes bulkem, an R package which fits inverse Gaussian mixture models using the EM algorithm. It has demonstrated that GPUs can provide significant performance and cost advantages over CPUs in this application. Unlike most GPU computing packages, bulkem offers significant performance improvement even on small datasets. Directions for further improvement of the bulkem algorithms have been identified which might provide further improvement.

References

Revolution Analytics and S Weston. foreach: foreach looping construct for R, Apr 2014. URL http://cran.r-project.org/web/packages/foreach/index.html.

Wikipedia. Embarrassingly parallel, Mar 2015b. URL http://en.wikipedia.org/wiki/Embarrassingly_parallel.

L Tierney, AJ Rossini, N Li, and H Sevcikova. snow: simple network of workstations, Sep 2013. URL http://cran.r-project.org/web/packages/snow/index.html.

Ian Howson

bulkem: Discussion and conclusion

Next:
Results

bulkem: Discussion and conclusion
18 Jun 2015

The linear speedup assumption

Future work

Improving GPU performance

Improving CPU performance

New functionality

Conclusion

References

Ian Howson

bulkem: Discussion and conclusion

Next:Results

bulkem: Discussion and conclusion18 Jun 2015

The linear speedup assumption

Future work

Improving GPU performance

Improving CPU performance

New functionality

Conclusion

References

Next:
Results

bulkem: Discussion and conclusion
18 Jun 2015