bulkem: Introduction

We wish to fit mixture models to a large number of datasets. We assume that an appropriate model is a two-component mixture of inverse Gaussian distributions. The components of the data are not well separated. The use case is roughly 40,000 datasets of 2,000 observations each.

A natural way to estimate the mixture parameters is to use the Expectation Maximisation algorithm. However, two problems need to be overcome:

Because of the large amount of data, most software implementations of the EM algorithm take a long time to execute
Because the EM algorithm is sensitive to the selection of initial parameters, many attempts must be made to fit any given dataset. This makes the fitting process even slower.

One way to reduce the time needed to generate the models is to use CUDA hardware. CUDA uses graphical processing units (GPUs) found in many computers to perform general-purpose computation. The software used to perform model fitting must be customised to suit CUDA.

This report describes the design and development of bulkem¹, an R package which fits mixture models using CUDA hardware. Using CUDA hardware, bulkem can fit a large number of small datasets around thirty times faster than a conventional CPU. It can fit very large datasets around 36 times faster than a conventional CPU.

The following variables are used throughout this report:

\(N\) : the number of elements in an array or the number of observations in a dataset
\(M\) : the number of components in the mixture model being fit
\(T\) : the number of threads being launched
\(D\) : the number of datasets

A number of performance measurements are quoted. Unless otherwise specified, those measurements were performed on an Intel i5-4460 (quad-core 3.2GHz) CPU with an NVIDIA GeForce GTX 660. The machine is running OS X Yosemite, CUDA Toolkit 6.5 and R 3.1.2.

1. The bulkem source code is available at https://github.com/ihowson/bulkem

Ian Howson