Next: Multiple GPU Implementation Up: Weiss & Shragge: GPU-based Previous: Numerical approach

GPU implementation

Each procedural step of our 3D FDTD algorithm is implemented as a GPU kernel function that is called at each time step by a CPU-based control thread. The computationally intensive steps of our algorithm are thus offloaded to the GPU (see Figure 2), where computation is greatly accelerated by concurrent execution of thousands of threads on the many hundred cores of the device.

Steps 1 and 5, the most computationally expensive steps of our algorithm, apply the FD stencil shown in Figure 1 at all points in the computational grid. The 8th-order stencil requires evaluating 25 data values per grid point, which potentially leads to a high number of transactions from the very high latency global memory. Minimizing the total number of these transactions is thus essential for maintaining high performance.

We adopt the method proposed in Micikevicius (2009) that minimizes global memory read redundancy and thereby mitigates the negative performance impact of high latency memory. This approach effectively optimizes the sharing of data between threads - thereby reducing the number of global memory transaction - by retrieving and storing a 2D plane of data in smaller but (roughly two orders of magnitude) faster shared memory and evaluating neighboring stencils concurrently. The 2D plane of threads is oriented perpendicular to the slowest-varying -axis, thereby optimizing retrieval of values from global memory by reading from more contiguous data blocks. This 2D algorithm then repeats slice-by-slice as the computation progresses through the 3D volume.

The boundary condition kernel in Step 8 that applies a cascade of operators to treat the domain edge values also exhibits performance limitations due to memory access pattern. Wavefield values are stored in memory as a 1D array varying from continuous in the -axis to a linear stride in the -axis to a 2D plane offset in the -axis. Accordingly, applying boundary conditions in the plane necessarily requires retrieving data from non-continuous memory locations leading to sub-optimal kernel performance. It remains to be determined whether this limitation can be circumvented through storing grid data in an alternative structure or in another form of GPU memory better optimized for 3D spatial locality.

The GPU kernels for Steps 2-4, 6-7 and 9 represent vector-vector operations that are highly efficient on GPU. We adopt a straightforward data-parallel scheme where one GPU thread per grid point is used to calculate the required values in a massively concurrent fashion. The high latency associated with global memory is hidden by the high degree of concurrency that allows computation to continue in some threads whilst others wait for memory transactions to complete. The use of shared memory for these kernels does not afford any acceleration because there is zero redundancy in memory transactions. Dimensions for thread blocks used in each kernel invocation were selected such that occupancy is maximized (as determined by NVIDIA's CUDA Occupancy Calculator). The selected values were then tuned experimentally with the NVIDIA Visual Profiler to identify the optimal configuration.

Subsections

Multiple GPU Implementation

Next: Multiple GPU Implementation Up: Weiss & Shragge: GPU-based Previous: Numerical approach

2013-12-07