Seismic wave absorption and dispersion, resulting from the presence of intrinsic anelasticity in subsurface media, has been considered one of the most important factors degrading the quality of seismogram and decreasing the resolution of migrated images, which affects the reliability of seismic interpretation (Wang and Guo, 2004; Carcione, 2007; Wang et al., 2018b). Many models have been proposed to characterize this frequency-dependent attenuation, which can be roughly classified into two categories: mechanical models and mathematical models. The former includes standard spring-pot models such as the Maxwell body, Kelvin-Voigt model, standard linear solid (SLS) model (Carcione, 2007; Mainardi, 2010), their generalizations such as the generalized Maxwell body (GMB) and generalized Zener body (GZB) (Cao and Yin, 2014; Moczo and Kristek, 2005), and their fractional extensions such as the fractional Kelvin model (FKM), fractional Zener model (FZM) (Rossikhin and Shitikova, 2010; Näsholm and Holm, 2013). Generally, the attenuation of seismic waves appears to be adequately modeled by a power law (Strick, 1967; Szabo, 1995,1994) or linear dependence on frequency over a finite bandwidth (a special case of power-law attenuation with a power of 1) (Kjartansson, 1979; McDonal et al., 1958; Futterman, 1962). Therefore, another category of models is established on the assumption of power-law attenuation, which includes the Kolsky-Futterman model (Kolsky, 1956; Futterman, 1962), the power-law model and Kjartansson's constant-$Q$ model (Kjartansson, 1979).

The attenuation models mentioned above are designed for mathematically characterizing frequency-dependent attenuation effects of subsurface media and further paving the way to mitigate these effects during seismic wave propagation. Early attempts to compensate for the $Q$ effect (attenuation and dispersion effects) are mostly conducted in the framework of one-way wave-equation migration (OWWEM) (Mittet et al., 1995; Dai and West, 1994; Mittet, 2007; Zhang et al., 2012; Wang and Guo, 2004). In recent years, $Q$-RTM has received increasing attention from the geophysical community (Wang et al., 2018c; Causse and Ursin, 2000; Sun et al., 2016; Zhang et al., 2010; Li et al., 2016; Guo et al., 2016; Zhu et al., 2014), which generalizes acoustic RTM by considering viscoacoustic propagation and compensating amplitude loss and phase distortion during source and receiver wavefields extrapolation. Zhu and Harris (2014) proposed a novel viscoacoustic wave equation with decoupled fractional Laplacians (DFLs) which separately dominate amplitude attenuation and phase dispersion, and they further applied this viscoacoustic wave equation in RTM so as to improve the resolution and quality of the image (Zhu et al., 2014). This decoupled viscoacoustic wave equation is attractive for $Q$-RTM due to its flexibility for separate amplitude compensation and phase correction, which can be achieved by simply reversing the absorption proportionality coefficient in sign while leaving the equivalent dispersion parameter unchanged (Treeby et al., 2010; Zhu et al., 2014).

Although the basic paradigm of $Q$-RTM has been well-established in recent years, there are still some problems and limitations in the process of the implementation, i.e., intensive computation, huge storage requirements and frequent disk I/O, and the difficult issue of stability. The emerging graphics processing unit (GPU) computing technology, built around a scalable array of multithreaded Streaming Multiprocessors (SMs), presents an opportunity for accelerating $Q$-RTM much further by appropriately exploiting the GPU's architectural characteristics (Farquhar et al., 2016; Tan et al., 2016). As a booming technology, the GPU computing technology has been widely applied into seismic modeling (Micikevicius, 2009; Zhang and Gao, 2014), imaging (Liu et al., 2012; Foltinek et al., 2009; Liu et al., 2013; Zhang et al., 2009; Yang et al., 2014), and inversion (Shin et al., 2014; Yang et al., 2015). In this chapter, we present a CUDA-based code package named cu$Q$-RTM, which aims to tackle these problems so as to achieve an efficient, storage-saving and stable $Q$-RTM. Next, we will briefly introduce how cu$Q$-RTM is designed to deal with these challenges and then outline the architecture of the cu$Q$-RTM code package.

Specifically, in order to avoid intensive computation, we implement $Q$-RTM in a multi-level parallel (MLP) fashion, either synchronously or asynchronously, to take advantage of all the CPUs and GPUs available. In the framework of cu$Q$-RTM, the basic forward and backward modeling modules, based on viscoacoustic wave equations with DFLs, are efficiently simulated using the Fourier pseudospectral method (PSM) (Chen et al., 2016; Zhu and Harris, 2014; Carcione, 2010). Discrete Fourier transforms (DFTs) of complex wavefields are the most time-consuming parts of these modules. Fortunately, DFTs can be efficiently computed by calling the CUFFT library API (Guide, 2013), which provides a simple interface for computing parallel FFTs on the GPU, and a simple configuration mechanism called a plan that completely specifies the optimal plan of execution. The use of a CUFFT standard library brings two obvious benefits: the configuration mechanism allows us to create the plans once and execute the plans multiple times (at every time step of the iteration) without recalculation of the configuration. Every CUFFT plan can be associated with a specified CUDA stream. Streaming the CUFFT execution allows for potential overlap between transforms and memory copies and provides a balanced calculation load on each card of the GPUs. CUFFT library functions can only be executed on the device and called from the host, so we have to split my customized kernel functions into k-space components and x-space components. From the brief codes in Appendix A, one can clearly figure out how these components are interconnected.

Apart from the issue of intensive computation, extensive data storage and burdensome disk I/O are another two bottlenecks for conventional RTM, especially for CUDA-based RTM which demands frequent memory copying between host and device (Yang et al., 2014; Liu et al., 2013). In the past three decades, several wavefield reconstruction strategies have been developed to reach a reasonable compromise between the computer memory requirement and computational complexity, for example, reverse propagation coupled with effective boundary saving (Yang et al., 2014), the optimal checkpointing scheme (Griewank and Walther, 2000; Symes, 2007), and their combinations such as the time-reversal checkpointing method (Anderson et al., 2012) and the checkpointing-assisted reverse-forward simulation (CARFS) method (Yang et al., 2016). Yang et al. (2016) proposed a novel viscoacoustic wavefield reconstruction algorithm referred as CARFS, which is implemented by monitoring the energy errors of the reconstruction, and taking it as a criterion to decide whether forward simulation or reverse simulation will be performed at the next time step. Wang et al. (2017b) proposed a robust viscoacoustic wavefield reconstruction scheme using time-reversal checkpointing (TRC) and k-space filtering (KSF). In this hybrid scheme, TRC serves as a time-domain regularization to eliminate accumulating errors by replacing the reconstructed wavefield with the stored wavefield at checkpoints, whereas KSF further suppresses high-wavenumber artifacts introduced during time-reversal reconstruction. In the cu$Q$-RTM package, we adopt the checkpointing-assisted time-reversal reconstruction (CATRC) scheme to reconstruct source wavefields, which combines the efficiency of reverse propagation and the stability of checkpointing. Unlike CARFS, the proposed CATRC scheme keeps the reconstruction errors within an acceptable range by imposing low-pass filtering on the time-reversal reconstructed wavefield so as to maintain a fixed recomputation ratio of two.

Finally, amplitude compensation in $Q$-RTM suffers from numerical instability because of it boosts high-frequency noise arising from high-frequency noise in seismic data and numerical errors from the finite machine precision (Yang et al., 2016; Wang, 2009; Zhao et al., 2017; Zhu et al., 2014). Therefore stabilization needs to be introduced either in the frequency or wavenumber domain (Ammari et al., 2013; Kalimeris and Scherzer, 2012). Since the forward and backward modeling modules are simulated by PSM in cu$Q$-RTM, it is more natural to conduct stabilization in the wavenumber domain. In some literature concerning $Q$ compensation, high-frequency noises are suppressed by utilizing a low-pass Tukey filter with its cutoff frequency identified by the noise level of measured data (Li et al., 2016; Zhu et al., 2014; Treeby et al., 2010). However, conventional time-invariant filtering fails to adapt with $Q$ distribution and compensation depth (travel time). Wang et al. (2018c) developed an adaptive stabilization for Q-RTM by analytically deriving k-space Green's functions for the constant-$Q$ wave equation with DFLs and its compensated equation, where the stabilization factor can be explicitly identified by the specified gain limit according to an empirical formula. In the provided package, we utilize the proposed adaptive stabilization method to deal with numerical instability in $Q$-RTM, which exhibits superior properties of time-variance and $Q$-dependence over conventional low-pass filtering.

In this paper, we present an open-source code package cu$Q$-RTM, which overcomes several problems commonly existing in conventional $Q$-RTM such as intensive computation, data storage and numerical stability, by adopting stable and efficient strategies like streamed CUFFT, CATRC and adaptive stabilization. The general architecture of the cu$Q$-RTM code package consists of memory manipulation, modules, kernels, and multi-level parallelism. Each component plays an indispensable role in GPU-CPU cooperative computing. We further demonstrate the validity and efficiency of cu$Q$-RTM with both synthetic and field examples.