A parallel sweeping preconditioner for heterogeneous 3D Helmholtz equations

Next: Parallel preconditioned GMRES(k) Up: Parallel sweeping preconditioner Previous: Selective inversion

## Global vector distributions

The goal of this subsection is to determine an appropriate scheme for distributing global vectors, i.e., ones representing a function over the entire domain (as opposed to only over a panel). And while the factorizations themselves may have occurred on subteams of processes each, in order to make use of all available processes for every subdomain solve, at this point we assume that each auxiliary problem's frontal tree has been selectively inverted and is distributed using a subtree-to-subteam mapping (recall Fig. 4) over the entire set of processes.

Since a subtree-to-subteam mapping will assign each supernode of an auxiliary problem to a team of processes, and each panel of the original 3D domain is by construction a subset of the domain of an auxiliary problem, it is straightforward to extend the supernodal subteam assignments to the full domain. We should then be able to distribute global vectors so that no communication is required for readying panel subvectors for subdomain solves (via extension by zero for interior panels, and trivially for the first panel). Since our nested dissection process does not partition in the shallow dimension of quasi-2D subdomains (see Fig. 3), extending a vector defined over a panel of the original domain onto the PML-padded auxiliary domain simply requires individually extending each supernodal subvector by zero in the direction.

Consider an element-wise two-dimensional cyclic distribution (30) of a frontal matrix over processes using an process grid, where and are . Then the entry will be stored by the process in the position in the process grid. Using the notation from (30), this distributed front would be denoted as , while its top-left quadrant would be referred to as (see Fig. 5 for a depiction of an matrix distribution).

Loosely speaking, means that each column of is distributed using the scheme denoted by , and each row is distributed using the scheme denoted by . For the element-wise two-dimensional distribution used for , , we have that the columns of are distributed like Matrix Columns ( ), and the rows of are distributed like Matrix Rows ( ). While this notation may seem vapid when only working with a single distributed matrix, it is useful when working with products of distributed matrices. For instance, if a  ' is used to represent rows/columns being redundantly stored (i.e., not distributed), then the result of every process multiplying its local submatrix of with its local submatrix of forms a distributed matrix , where the last expression refers to the local multiplication process.

We can now decide on a distribution for each supernodal subvector, say , based on the criteria that it should be fast to form and in the same distribution as , given that is distributed as . Suppose that we define a Column-major Vector distribution ( ) of , say , to mean that entry is owned by process , which corresponds to position in the process grid (if the grid is constructed with a column-major ordering of the process ranks; see the left side of Fig. 6). Then a call to MPI_Allgather (10) within each row of the process grid would allow for each process to collect all of the data necessary to form , as for any process row index ,

 (7)

See the left side of Fig. 7 for an example of an distribution of a matrix.

Similarly, if was distributed with a Row-major Vector distribution ( ), as shown on the right side of Fig. 6, say , so that entry is owned by the process in position of the process grid, then a call to MPI_Allgather within each column of the process grid would provide each process with the data necessary to form . Under reasonable assumptions, both of these redistributions can be shown to have per-process communication volume lower bounds of (if is ) and latency lower bounds of  (9). We also note that translating between and simply requires permuting which process owns each local subvector, so the communication volume would be , while the latency cost is .

We have thus described efficient techniques for redistributing to both the and distributions, which are the first steps for our parallel algorithms for forming and , respectively: Denoting the distributed result of each process in process column multiplying its local submatrix of by its local subvector of as , it holds that . Since Eq. (7) also implies that each process's local data from a distribution is a subset of its local data from a distribution, a simultaneous summation and scattering of within process rows, perhaps via MPI_Reduce_scatter or MPI_Reduce_scatter_block, yields the desired result, . An analogous process with and yields , which can then be cheaply permuted to form . Both calls to MPI_Reduce_scatter_block can be shown to have the same communication lower bounds as the previously discussed MPI_Allgather` calls (9).

As discussed at the beginning of this section, defining the distribution of each supernodal subvector specifies a distribution for a global vector, say . While the distribution shown in the left half of Fig. 6 simply assigns entry of a supernodal subvector , distributed over processes, to process , we can instead choose an alignment parameter, , where , and assign entry to process . If we simply set for every supernode, as the discussion at the beginning of this subsection implied, then at most processes will store data for the root separator supernodes of a global vector, as each root separator only has degrees of freedom by construction. However, there are root separators, so we can easily allow for up to processes to share the storage of a global vector if the alignments are carefully chosen. It is important to notice that the top-left quadrants of the frontal matrices for the root separators can each be distributed over processes, so processes can actively participate in the corresponding triangular matrix-vector multiplications.

 A parallel sweeping preconditioner for heterogeneous 3D Helmholtz equations

Next: Parallel preconditioned GMRES(k) Up: Parallel sweeping preconditioner Previous: Selective inversion

2014-08-20