Fitting models to independent datasets is an embarrassingly parallel (Wikipedia, 2015) problem. The datasets have no dependence on each other and can be fitted separately.
This implies that, in an ideal world:
Of course, in practice things are not so ideal:
Using R, the foreach
package (Analytics et al., 2014) makes it easy to parallelise code across cores within a computer. The snow
package (Tierney et al., 2013) is suitable for use across networked clusters of computers.
Kernel launch time is still the limiting factor, so further reducing the number of kernel launches is the natural way to improve performance. Ideally, the entire EM algorithm (across multiple iterations) could be moved to the GPU using similar techniques to lp_sum
to handle datasets larger than the thread block size.
On the author’s hardware, compute occupancy is around 60%, so we could expect, at most, another \( \frac{1}{0.6}= \) 67% performance gain.
At that stage, GPU performance would likely be the limiting factor. The current implementation of the lp_sum
summation is not very efficient, but it would be prudent to check for hotspots with a profiler before investing further development time.
Finally, while running the GPU software, the CPUs have significant idle time. With additional software support, one could perform additional model fits using that idle CPU capacity, improving performance-per-dollar further.
The obvious way to improve performance of the CPU implementation is to take advantage of additional CPU cores. The easiest way to achieve this is with the foreach
R package, which can run arbitrary R code across any number of CPU threads. Running the R code under a profiler ought to reveal hotspots which can guide optimisation of the R code, Finally, rewriting the R implementation in C might provide further improvement.
Modifying bulkem
to fit Normal mixture models would be fairly straightforward and very useful; Normal models are far more common than inverse Gaussian.
This report describes bulkem
, an R package which fits inverse Gaussian mixture models using the EM algorithm. It has demonstrated that GPUs can provide significant performance and cost advantages over CPUs in this application. Unlike most GPU computing packages, bulkem
offers significant performance improvement even on small datasets. Directions for further improvement of the bulkem
algorithms have been identified which might provide further improvement.
Revolution Analytics and S Weston. foreach: foreach looping construct for R, Apr 2014. URL http://cran.r-project.org/web/packages/foreach/index.html.
Wikipedia. Embarrassingly parallel, Mar 2015b. URL http://en.wikipedia.org/wiki/Embarrassingly_parallel.
L Tierney, AJ Rossini, N Li, and H Sevcikova. snow: simple network of workstations, Sep 2013. URL http://cran.r-project.org/web/packages/snow/index.html.