next up previous [pdf]

Next: Bibliography Up: Fomel, Ying, & Song: Previous: Acknowledgments


Linear-time algorithm for a lowrank matrix approximation

In this appendix, we outline the lowrank matrix approximation algorithm in more details.

Let $ N_x$ be the number of samples both in space and wavenumber. Let us denote the samples in the spatial domain by $ \mathbf{x}=\{x_1,\ldots,x_{N_x}\}$ and the ones in the Fourier domain by $ \k =\{k_1,\ldots,k_{N_x}\}$ . The elements of the interaction matrix $ \mathbf{W}$ from equation (11) are then defined as

$\displaystyle W_{ij} = e^{\i [\phi(x_i,k_j,\Delta t] - x_i \cdot k_j]},\quad 1\le i,j\le N_x.$ (17)

Here we describe an algorithm by Engquist and Ying (2009) that generates, in a time linear with respect to $ N_x$ , an approximate factorization of $ \mathbf{W}$ of rank $ r$ in the following form

$\displaystyle \mathbf{W} \approx \mathbf{U M V^*}\;,$ (18)

where $ \mathbf{U}$ consists of $ r$ selected columns from $ \mathbf{W}$ , $ \mathbf{M}$ is a matrix of size $ r\times r$ and $ \mathbf{V^*}$ consists of $ r$ selected rows from $ \mathbf{W}$ .

The first question is: which columns of $ \mathbf{W}$ shall one pick for the matrix $ \mathbf{U}$ ? It has been shown by Goreinov et al. (1997) and Gu and Eisenstat (1996) that the $ r$ -dimensional volume spanned by these columns should be the maximum or close to the maximum among all possible choices of $ r$ columns from $ \mathbf{W}$ . More precisely, suppose $ \mathbf{W} = [w_1,\ldots,w_{N_x}]$ is a column partitioning of $ \mathbf{W}$ . Then one aims to find $ \{j_1,\ldots,j_r\}$ such that

$\displaystyle \{j_1,\ldots,j_r\} =$   argmin$\displaystyle _{\{j_1,\ldots,j_r\}} \mathrm{vol}_r(w_{j_1},\ldots,w_{j_r}).$ (19)

However, finding a set of $ r$ columns with almost the maximum $ r$ -dimensional volume is a computationally difficult problem due to the following two reasons. First, the length of the vectors $ N$ is typically very large for three dimensional problems, hence manipulating these vectors can be costly. Second, the number of the vectors $ N_x$ is also large. A exhaustive search over all possible choices of $ r$ vectors to find the one with the maximum volume is prohibitive expensive, so one needs to find a more practical approach.

In order to overcome the problem associated with long vectors, the first idea is to project to a lower dimensional space and search for the set of vectors with maximum volume among the projected vectors. However, one needs to ensure that the volume is roughly preserved after the projection so that the set of vectors with the maximum projected volume also has a near-maximum volume in the original space. One of the most celebrated theorems in high dimensional geometry and probability is the following Johnson-Lindenstrauss lemma (Johnson and Lindenstrauss, 1984).

Theorem 1   Let $ v_1,\ldots,v_N$ be a set of $ N$ vectors in $ R^d$ . Let $ T$ be a randomly generated subspace of dimension $ t=O(\log N /\varepsilon^2)$ and use $ P_T$ to denote the orthogonal projection onto $ T$ . Then with high probability,

$\displaystyle (1-\varepsilon) \lVert v_i-v_j \rVert \le
\sqrt{\frac{d}{t}} \lVert P_T v_i-P_T v_j \rVert \le
(1+\varepsilon) \lVert v_i-v_j \rVert
$

for $ 1\le i,j \le N$ .

This theorem essentially says that projecting to a subspace of dimension $ O(\log N)$ preserves the pairwise distance between $ N$ arbitrary vectors. There is an immediate generalization of this theorem due to Magen (2002), formulated slightly differently for our purpose.

Theorem 2   Let $ v_1,\ldots,v_N$ be a set of $ N$ vectors in $ R^d$ . Let $ T$ be a randomly generated subspace of dimension $ t=O(r^3 \log N /\varepsilon^2)$ and use $ P_T$ to denote the orthogonal projection onto $ T$ . Then with high probability,

$\displaystyle (1-\varepsilon) \cdot \mathrm{vol}_r(v_{i_1},\ldots,v_{i_r}) \le
...
..._T v_{i_r}) \le
(1+\varepsilon) \cdot \mathrm{vol}_r(v_{i_1},\ldots,v_{i_r})
$

for any $ \{i_1,\ldots,i_r\} \subset \{1,\ldots,N\}$ .

The main step of the proof is to bound the singular values of a random matrix between $ (1-\varepsilon)^{1/r}$ and $ (1+\varepsilon)^{1/r}$ (after a uniform scaling) and this ensures that the $ r$ -dimensional volume is preserved within a factor of $ (1-\varepsilon)$ and $ (1+\varepsilon)$ . In order to obtain this bound on the singular values, we need $ t$ to be $ O(r^3 \log N)$ . However, bounding the singular values is only one way to bound the volume, hence it is possible to improve the dependence of $ t$ on $ r$ . In fact, in practice, we observe that $ t$ only needs to scale like $ O(r\log N)$ .

Given a generic subspace $ T$ of dimension $ t$ , computing the projections $ P_T w_1,\ldots, P_T w_N$ takes $ O(tN^2)$ steps. Recall that our goal is to find an algorithm with linear complexity, hence this is still too costly. In order to reduce the cost of the random projection, the second idea of our approach is to randomly choose $ t$ coordinates and then project (or restrict) each vector only to these coordinates. This is a projection with much less randomness but one that is much more efficient to apply. Computationally, this is equivalent to restricting $ \mathbf{W}$ to $ t$ randomly selected rows. We do not yet have a theorem regarding the volume for this projection. However, it preserves the $ r$ -dimensional volume very well for the matrix $ \mathbf{W}$ and this is in fact due to the oscillatory nature of the columns of $ \mathbf{W}$ . We denote the resulting vectors by $ \{\mathbf{\mathbf{\widetilde{w}}}_1,\ldots,\mathbf{\mathbf{\widetilde{w}}}_{N_x}\}$ .

The next task is to find a set of columns $ \{j_1,\ldots,j_r\}$ so that the volume $ \mathrm{vol}_r(\mathbf{\widetilde{w}}_{j_1},\ldots,\mathbf{\widetilde{w}}_{j_r})$ is nearly maximum. As we mentioned earlier, exhaustive search is too costly. To overcome this, the third idea is to use the following pivoted QR algorithm (or pivoted Gram-Schmidt process) to find the $ r$ columns.
\begin{algorithmic}[1]
\FOR{$s=1,\ldots, r$}
\par
\STATE Find $j_s$ among $\{1...
...R
\par
\STATE $\{j_1,\ldots,j_r\}$ is the column set required
\end{algorithmic}
Once the column set is found, we set $ \mathbf{U} = \left[ \mathbf{w}_{j_1},\ldots,\mathbf{w}_{j_r}\right]$ .

In order to identify $ \mathbf{V^*}$ , one needs to find a set of $ r$ rows of $ \mathbf{W}$ that has an almost maximum volume. To do that, we repeat the same steps now to $ \mathbf{W^*}$ . More precisely, let

$\displaystyle \mathbf{W} = \begin{bmatrix}\mathbf{m}_1 \ \vdots\ \mathbf{m}_{N_x} \end{bmatrix}$ (20)

be the row partitioning of the matrix $ \mathbf{W}$ . The algorithm takes the following steps:
\begin{algorithmic}
% latex2html id marker 425
[1]
\STATE Select uniform random...
...\vdots\\
\mathbf{m}_{i_r}
\end{bmatrix}\;.
\end{equation}\end{algorithmic}

Once both $ \mathbf{U}$ and $ \mathbf{V^*}$ are identified, the last task is to compute the $ r\times r$ matrix $ \mathbf{M}$ for $ \mathbf{W} \approx \mathbf{U M V^*}$ . Minimizing

$\displaystyle \min_{\mathbf{M}} \lVert \mathbf{W - U M V^*} \rVert_F$ (21)

yeilds $ \mathbf{M = (U)^{\dagger} W (V^*)}^{\dagger}$ where $ \dagger$ stands for the pseudo-inverse. However, this formula requires taking matrix product with $ \mathbf{W}$ , which takes $ O(t N_x^2)$ steps. In order to achieve linear scaling, the fourth idea of our approach is to select randomly a set of $ t$ rows $ A$ and a set of $ t$ columns $ B$ and minimize

$\displaystyle \min_{\mathbf{M}} \lVert \mathbf{W}(A,B) - \mathbf{U}(A,:)   \mathbf{M}   \mathbf{V}(B,:)^* \rVert_F\;.$ (22)

The solution for this problem is

$\displaystyle \mathbf{M} = (\mathbf{U}(A,:))^{\dagger}  \mathbf{W}(A,B)   \left(\mathbf{V}(B,:)^*\right)^{\dagger}\;.$ (23)

Let us now discuss the overall cost of this algorithm. Random sampling of $ t$ rows and $ t$ columns of the matrix $ \mathbf{W}$ clearly takes $ O(t N_x)$ steps. Pivoted QR factorization on the projected columns $ \{\mathbf{\widetilde{w}}_1,\ldots,\mathbf{\widetilde{w}}_{N_x}\}$ takes $ O(t^2 N_x)$ steps and the cost for for the pivoted QR factorization on the projected rows. Finally, performing pseudo-inverses takes $ O(t^3)$ steps. Therefore, the overall cost of the algorithm is $ O(t N_x) + O(t^2 N_x) + O(t^3) = O(t^2 N_x)$ . As we mentioned earlier, in practice $ t=O(r \log N_x)$ . Hence, the overall cost is linear in $ N_x$ .


next up previous [pdf]

Next: Bibliography Up: Fomel, Ying, & Song: Previous: Acknowledgments

2013-08-31