next up previous [pdf]

Next: Parallel reduction on CUDA Up: FWI and its GPU Previous: Nonlinear conjugate gradient method

Wavefield reconstruction via boundary saving

One key problem of GPU-based implementations of FWI is that the computation is always much faster than the data transfer between the host and device. Many researchers choose to reconstruct the source wavefield instead of storing the modeling time history on the disk, just saving the boundaries (Dussaud et al., 2008; Yang et al., 2014). For $ 2N$ -th order finite difference, regular grid scheme needs to save $ N$ points on each side (Dussaud et al., 2008), while staggered-grid scheme required at least $ 2N-1$ points on each side (Yang et al., 2014). In our implementation, we use 2nd order regular grid finite difference because FWI begins with a rough model and velocity refinement is mainly carried out during the optimization. Furthermore, high-order finite differences and staggered-grid schemes do not necessarily lead to FWI converge to an accurate solution while requiring more compute resources. A key observation for wavefield reconstruction is that one can reuse the same template by exchanging the role of $ p^{k+1}$ and $ p^{k-1}$ . In other words, for forward modeling we use

$\displaystyle p^{k+1}=2p^{k}-p^{k-1}+v^2\Delta t^2 \nabla^2 p^{k}.$ (13)

while for backward reconstruction we use

$\displaystyle p^{k-1}=2p^{k}-p^{k+1}+v^2\Delta t^2 \nabla^2 p^{k}.$ (14)

The wavefield extrapolation can be stepped efficiently via pointer swap, i.e.,

\begin{displaymath}\begin{split}&for\;ix,iz... \quad p_0(:)=2p_1(:)-p_0(:)+v^2(:...
...&ptr=p_0;p_0=p_1;p_1=ptr;// \mathrm{swap\; pointer} \end{split}\end{displaymath} (15)

where $ (:)=[ix,iz]$ , $ p_0$ and $ p_1$ are $ p^{k+1}/p^{k-1}$ and $ p^k$ , respectively.

Note that all the computation is done on GPU blocks. In our codes, the size of the block is set to be 16x16. We replicate the right- and bottom-most cols/rows enough times to bring the total model size up to an even multiple of block size. As shown in Figure 2, the whole computation area is divided into 16x16 blocks. For each block, we use a 18x18 shared memory array to cover all the grid points in this block. It implies that we add a redundant point on each side, which stores the value from other blocks, as marked by the window in Figure 2. When the computation is not performed for the interior blocks, special care needs to be paid to the choice of absorbing boundary condition (ABC) in the design of FWI codes. Allowing for efficient GPU implementation, we use the $ 45^o$ Clayton-Enquist ABC proposed in Clayton and Engquist (1977) and Engquist and Majda (1977). For the left boundary, it is

$\displaystyle \frac{\partial^2 p}{\partial x\partial t}-\frac{1}{v}\frac{\partial^2 p}{\partial t^2}=\frac{v}{2}\frac{\partial^2 p}{\partial z^2}$ (16)

which requires only one layer to be saved on each side for wavefield reconstruction. The equations for right and bottom boundary can also be written in a similar way. To simulate free surface boundary condition, no ABC is applied to the top boundary. The same technique has been adopted by Liu et al. (2013) for reverse time migration. We believe its application to FWI is valuable and straightforward.

blocks
blocks
Figure 2.
2D blocks in GPU memory. The marked window indicates that the shared memory in every block needs to be extended on each side with halo ghost points storing the grid value from other blocks.
[pdf] [png] [tikz]


next up previous [pdf]

Next: Parallel reduction on CUDA Up: FWI and its GPU Previous: Nonlinear conjugate gradient method

2021-08-31