bulkem: Introduction
18 Jun 2015

We wish to fit mixture models to a large number of datasets. We assume that an appropriate model is a two-component mixture of inverse Gaussian distributions. The components of the data are not well separated. The use case is roughly 40,000 datasets of 2,000 observations each.

A natural way to estimate the mixture parameters is to use the Expectation Maximisation algorithm. However, two problems need to be overcome:

One way to reduce the time needed to generate the models is to use CUDA hardware. CUDA uses graphical processing units (GPUs) found in many computers to perform general-purpose computation. The software used to perform model fitting must be customised to suit CUDA.

This report describes the design and development of bulkem1, an R package which fits mixture models using CUDA hardware. Using CUDA hardware, bulkem can fit a large number of small datasets around thirty times faster than a conventional CPU. It can fit very large datasets around 36 times faster than a conventional CPU.

Conventions

The following variables are used throughout this report:

A number of performance measurements are quoted. Unless otherwise specified, those measurements were performed on an Intel i5-4460 (quad-core 3.2GHz) CPU with an NVIDIA GeForce GTX 660. The machine is running OS X Yosemite, CUDA Toolkit 6.5 and R 3.1.2.

Footnotes

1. The bulkem source code is available at https://github.com/ihowson/bulkem


comments powered by Disqus