In this chapter, we have presented an open-source code package cu$Q$-RTM, equipped with a set of the state-of-art strategies such as streamed CUFFT, the CATRC scheme, and adaptive stabilization, to achieve an efficient and robust $Q$-RTM. The architecture of the cu$Q$-RTM code package is composed of four components: memory manipulation, kernel, module, and multi-level parallelism. Task-oriented kernels are consolidated into several fully functional modules, which are further integrated into the complete process of $Q$-RTM. The package is implemented in an MLP manner to take advantages of all the CPUs and GPUs available, while maintaining impressively good stability and flexibility. We have demonstrated the effectiveness and applicability of the developed package by performing $Q$-RTM on both synthetic and field data. Either synthetic or field migrated images with $Q$ compensation exhibit sharper reflections and more balanced amplitude. Furthermore, speedup tests via viscoacoustic modeling on layered models indicates that the presented cu$Q$-RTM can be 50-80 times faster, compared with conventional CPU-based implementation with only a single GPU card. The strong scaling analysis of $Q$-RTM across multiple GPUs demonstrates the excellent scalability of the package.