Data parallelism is a style of programming where essentially the same operation is applied to a large collection of values. This style became popular during the 80s and early 90s as a convenient way of programming large vector processors. Data parallelism has remained popular, especially in light of the rise of GPGPU programming. Often, data parallel programming is used for fine-grained parallelism, but it works at larger granularity too. For example, MapReduce is a restricted example of data parallelism.
Systems that support data parallelism typically provide a handful of data structures that can be manipulated with a set of parallel operators. These data structures include vectors, segmented vectors, and arrays. In this post, we’ll take a look at these different structures and in a later post we’ll discuss some of the parallel operators that manipulate them.
Data parallel systems typically provide some combination of vectors,
segmented vectors and arrays. In the abstract, we’ll consider a vector
to be an ordered, finite sequence of elements of the same types. Most
often, these are scalar values, but some systems allow vectors of
vectors. A segmented vector is a vector that has been divided into
segments (surprise!). Below is an example of a vector (
X) and a
segmented vector (
X = [1 2 3 4 5 6 7 8] Xs = [[1 2 3] [4 5] [6 7 8]]
The first vector just includes the numbers one through eight, while the second is made up of three segments. Notice how the segments don’t have to be the same length. You might also notice that the segmented vector looks a lot like a vector of vectors of numbers. Often, languages that support nested vectors implement them as segmented vectors. Representing nested vectors as segmented vectors is essentially the idea behind flattening, which is used heavily in the implementations of NESL and Data Parallel Haskell, for example.
Hardware rarely supports segmented vectors directly, so they are instead represented as a pair of a data vector and a segment descriptor. There are several possibilities for segment descriptors, and different representations have different performance characteristics depending on the hardware. One representation for the segment descriptor is the flag vector, which we see below.
Xdata = [1 2 3 4 5 6 7 8] Xseg = [1 0 0 1 0 1 0 0]
This example is the same as
Xs from above. The segment descriptor,
Xseg, includes a 1 in each element that starts a new segment and a 0
otherwise. This representation can simplify the implementation of some
operations, such as scan, but this representation also cannot
describe 0-length segments.
Another option is to store the length of each segment in the segment descriptor:
Xdata = [1 2 3 4 5 6 7 8] Xseg = [3 2 3]
This representation makes it easy to tell how many segments there are, and how long each one is, but it makes accessing individual elements more difficult because it’s not obvious where a segment starts. Instead, we could have the segment descriptor store the indexes of the beginning of each segment:
Xdata = [1 2 3 4 5 6 7 8] Xseg = [0 3 5]
As one final option (I’m sure there are many more), we’ll consider a case where each element in the segment descriptor indicates the segment that the data lies in:
Xdata = [1 2 3 4 5 6 7 8] Xseg = [0 0 0 1 1 2 2 2]
Each of these representations has different tradeoffs. Some take more space, others enable more efficient implementation of certain operations. Which operations are important will help inform the best choice of representation. Fortunately, many of these can converted to another form with a constant number of operations. You might notice some similarities between these representations and various sparse matrix formats.
Finally, we consider the array type. This is similar to a segmented vector in which each segment has the same size. Below is an example.
[1 2 3] A = [4 5 6] [7 8 9]
We can certainly represent this exactly as we would a segmented vector, but we can do many operations more efficiently if we instead store the array as a data vector and a list of dimensions. For example:
Adata = [1 2 3 4 5 6 7 8 9] Adim = [3 3]
These structures provide a solid basis for data parallel computing. Though there are many possible representations, one commonality is that the actual data elements are stored in a contiguous block of memory. This feature means that many data parallel operators, such as those we’ll discuss in an upcoming post, can be implemented in a way that maps efficiently onto data parallel processors like GPUs.