Graphical processing units (GPUs) are specialized processors with dedicated memory that conventionally perform floating point operations required for rendering graphics. In response to commercial demand for real-time graphics rendering, the current generation of GPUs have evolved into many-core processors that are specifically designed to perform data-parallel computation. The main difference betwen GPUs and central processing units (CPUs) are that GPUs devote proportionally more transistors to arithmetic logic units and less to caches and flow control in comparison to CPUs. GPUs also typically have higher memory bandwidth compared to CPUs.

A recent trend in computer architecture is the move from traditional, single-core processors to multi-core processors and further to many-core or massively multi-core processors. This is primarily due to difficulties in making individual processors faster, while it is possible to provide more processing power by putting more cores onto a single die. The result of this trend is that computational problems which can take advantage of multiple threads can see significant linear speedup with the number of cores available.

Since GPUs have independently been developed to perform data-parallel computation using multiple cores, the scientific computing community is keen to take advantage of this technology by performing some computation on GPUs that has traditionally been done on CPUs. This is referred to as general-purpose computation on GPUs (GPGPU).

A data-parallel computation is a computation that has been parallelized by distributing the data amongst computing nodes. It can be contrasted with a task-parallel computation, in which the distribution of computing tasks is emphasized as opposed to the data. One framework that is used to accomplish data-parallelism is "single instruction, multiple data" (SIMD), in which multiple processors execute the same instructions on different pieces of data. This is the architecture used in GPUs, since it allows flow control computation to be shared amongst processors and thus allows more of the hardware to be devoted to instruction execution.

It is not the case that all computation must be 'completely' parallelizable. Indeed, although typically every thread will run identical functions, the functions themselves can condition on thread identifiers and data so that different instructions are executed in some threads. However, in SIMD architectures this leads to a performance hit since computation only occurs in parallel when the same instructions are being performed.

In general, if a computing task is well-suited to SIMD parallelization then it will be well-suited to computation on a GPU. In particular, data-parallel computations with high arithmetic intensity (computations where where the ratio of arithmetic operations to memory operations is high) are able to attain maximum performance from a GPU. This is because the volume of very fast arithmetic instruction can `hide' the relatively slow memory accesses.

From a statistical simulation perspective, integration via classical Monte Carlo or importance sampling are ideal computational tasks in a SIMD framework. This is because each computing node can produce and weight a sample in parallel, assuming that the sampling procedure and the weighting procedure have no conditional branches. If these methods do have branches, speedup can be compromised by many computing nodes running idle while others finish their tasks. For example, this can occur if the sampling procedure uses rejection sampling.

In general, if a computing task is not well-suited to SIMD parallelization then it will not be well-suited to computation on a GPU. In particular, task-parallel computations where one executes different instructions on the same or different data cannot utilize the shared flow control hardware on a GPU and often end up running sequentially. Of course, there are also many computational tasks in statistical computing that are just difficult to parallelize. For example, standard Metropolis-Hastings MCMC with a single chain is difficult to parallelize in the general case because it is a naturally sequential algorithm. Parallelization of this type of method usually centres around the parallelization of the target density evaluation, the sampling from or evaluation of the proposal density or computation of multiple possible execution paths.

Any computation that can be done in parallel can also be done by distributing work amongst multiple computing nodes that are connected only via a network. In addition, this type of work distribution can be appealing for task parallel computations since there are no restrictions on the types of operations each computing node can perform and work can even be distributed amongst specialized nodes. However, the cost of this type of distribution is in the time taken and energy consumed for data to be transferred between nodes and in the overhead associated with having each node doing its own flow control when it might have been shared.

The communication costs in a distributed system are high when the latency and bandwidth of the data transfer mechanism are high and low respectively. However, these costs also depend on the frequency and volume of the communication. As such, methods such as importance sampling are not particularly affected by distributing the work in terms of total computation time since there is almost no communication required. In contrast, methods such as sequential Monte Carlo are affected since communication occurs frequently. In such cases, when the procedure can be parallelized effectively there is significant speedup to be gained by performing the computation on a many-core processor like a GPU.