Fast dictionary learning by sequential generalized K-means

Although K-SVD can obtain very successful performance in a number of sparse representation based approaches, since there involves many SVD operations in the K-SVD algorithm, it is very computationally expensive. Especially when utilized in multidimensional seismic data processing (e.g. 3D or 5D processing), the computational cost is not tolerable. The sequential generalized K-means algorithm (SGK) was proposed to increase the computational efficiency (Sahoo and Makur, 2013). SGK tries to solve slightly different iterative optimization problem in sparse coding as equation 3:

$\displaystyle \forall_i \mathbf{m}_i ^n = \arg \min_{\mathbf{m}_i} \parallel \m...
...hbf{M} \parallel_F^2, s.t. \forall_i \mathbf{m}_i = \mathbf{e}_t.% for some t,
$ (8)

$t$ indicates that $\mathbf{m}_i$ has all 0s except 1 in the $t$th position. The dictionary updating in SGK algorithm is also different. In SGK, equation 6 also holds. Instead of using SVD to minimize the objective function, which is computationally expensive, SGK turns to use least-squares method to solve the minimization problem. Taking the derivative of $J=\parallel\mathbf{E}^R_k- \mathbf{f}_k\mathbf{m}_R^k \parallel_F^2$ with respect to $\mathbf{f}_k$ and setting the result to 0 gives the following equation:

$\displaystyle \frac{\partial J}{\partial \mathbf{f}_k} = -2(\mathbf{E}^{R}_k-\mathbf{f}_k\mathbf{m}_R^k)(\mathbf{m}_R^k)^T = 0$ (9)

solving equation 9 leads to

$\displaystyle \mathbf{f}_k = \mathbf{E}^R_k (\mathbf{m}_R^k)^T \left( \mathbf{m}_R^k (\mathbf{m}_R^k)^T \right)^{-1}.$ (10)

It can be derived further that

\begin{displaymath}\begin{split}
\mathbf{E}^R_k (\mathbf{m}_R^k)^T &= \left(\mat...
...e k}\mathbf{f}_j\mathbf{m}_R^j (\mathbf{m}_R^k ).^T
\end{split}\end{displaymath} (11)

Here, $\mathbf{D}_R$ has the same meaning as $\mathbf{D}$ shown in equation 5 except for a smaller size due to the selection set $r_k$ that selects the entries in $\mathbf{m}_T^k$ that are non-zero.

Since $\forall_i,\parallel \mathbf{m}_i \parallel_0=1$, as constrained in equation 8 then

$\displaystyle \forall_{j\ne k} \mathbf{m}_R^j (\mathbf{m}_R^k )^T = 0.$ (12)

Since $\mathbf{m}_R^k$ is a smaller version of row vector $\mathbf{m}_T^k$ and all its entries are all equal to 1, $\mathbf{D}_R (\mathbf{m}_R^k )^T$ is simply a summation over all the column vectors in $\mathbf{D}_R$. Considering that $\mathbf{D}_R=\{\mathbf{d}_i: i\in r_k\}$,

$\displaystyle \mathbf{D}_R (\mathbf{m}_R^k )^T=\sum_{i\in r_k}\mathbf{d}_i$ (13)

Following equation 13, equation 11 becomes

$\displaystyle \mathbf{E}^R_k (\mathbf{m}_R^k)^T = \sum_{i\in r_k}\mathbf{d}_i.$ (14)

It is simple to derive that $\mathbf{m}_R^k(\mathbf{m}_R^k)^T=N_r^k$, where $N_r^k$ denotes the number of elements in the set $r_k$, or the number of training signals associated with the atom $\mathbf{f}_k$. The $k$th atom in $\mathbf{F}$ is

$\displaystyle \mathbf{f}_k =\frac{\sum_{i\in r_k}\mathbf{d}_i}{N_r^k}.$ (15)

Thus, in SGK, one can avoid the use of SVD. Instead the trained dictionary can be simply expressed as an average of several training signals. In this way, SGK can obtain significantly higher efficiency than K-SVD. In the next section, I will use several examples to show that the overall denoising performance does not degrade when one can obtain a much faster implementation.


2020-04-03