next up previous [pdf]

Next: Results and discussion Up: Method Previous: Writing back

Implementation improvements

One important thing to avoid in programming parallel algorithms for GPUs is thread divergence. Thread divergence occurs when threads executing concurrently follow different execution paths. This behaviour can not be avoided in the conditional structures (if) of Algorithms 3 and 4. Iteration structures (forall) of Algorithms 1 and 3 can also suffer from thread divergence because threads in charge of vertices near the border of the velocity model have fewer neighbor vertices to process. One possible solution is to complete the neighborhoods of these vertices by extending the velocity model with dummy vertices. The weights between these dummy vertices and the normal ones have sufficiently big values to ensure that they are not going to form part of any shortest ray path. This is shown in Figure 5 where the white vertices have been added to the model and the dotted edges have big weights. Kernels in Algorithms 1 and 3 are still only launched for normal model vertices.

extvelmodel
extvelmodel
Figure 5.
Model extension with dummy vertices to diminish thread divergence.
[pdf] [png]

Another improvement is to use the device shared memory with data that is going to be read multiple times, because this memory is faster than the global one. Shared memory is only shared inside groups of threads called blocks. To take advantage of it the model is divided into rectangular vertex areas with enough information to be completely and independently processed by each block. Due to the dependency of a vertex on its neighbor vertices, these areas should contain some vertices from adjacent areas, i.e. the rectangular areas overlap. Figure 6 shows a velocity model and a group of overlapping areas. The stripped zones are the vertices processed by each block. The nonstripped border zones are vertices from adjacent areas that are needed in the current block. This area has a thickness equal to the neighborhood radius $r$. Grey areas correspond to the dummy vertices added before to reduce thread divergence. This approach is used in Micikevicius (2009) to compute 3D finite differences, but with a different overlapping shape.

In Algorithm 1 velocity values in array $v$ are read multiple times, hence that array is a good candidate to be loaded in shared memory. Before the iteration structure in this kernel function each thread reads once from global memory the velocity of its vertex in array $v$ and stores it in a shared array (this is the stripped zone in Figure 6). Some threads are also assigned to copy the overlaping zones of $v$, i.e, the nonstripped zone in Figure 6. The same technique can be applied in Algorithm 3, but with the array $tt$ of traveltimes. There is no need to copy the weight array $w$ to shared memory in this kernel function because each of its elements is read once.

overshared
overshared
Figure 6.
Model partitioning in overlaping areas.
[pdf] [png]


next up previous [pdf]

Next: Results and discussion Up: Method Previous: Writing back

2013-10-09