Matching Pursuit Before Computer Science

Recently, I have found some interesting references about techniques designed around 1938 and that, in my opinion, could be considered as (a variant of) Matching Pursuit. Perhaps this is something known by a lot of researchers in the right scientific field, but here is, however, what I recently discovered.

From what I knew until yesterday, when S. Mallat and Z. Zhang [1] defined their greedy or iterative algorithm named “Matching Pursuit” to decompose a signal \(s\in \mathbb{R}^N\) into a linear combination of atoms taken in a dictionary \(\mathcal{D}=\{g_k\in\mathbb{R}^N: 1\leq k\leq d, g_k^Tg_k = 1\}\) of \(d\) elements, the previous work to which they were referring to was the “Projection Pursuit” of J. Friedman and W. Tukey [2] in the field of statistical regression methods.

In short, MP is very simple. It reads (using matrix notations)

\begin{align} &R^0 = s, A^0 = 0 \tag{initialization}\\\ &R^{n+1}\ =\ R^n\ -\ (g_{*}^T R^n) g_{*},\tag{reconstruction}\\\ &A^{n+1}\ =\ A^n\ +\ (g_{*}^T R^n)g_{*}&\tag{reconstruction} \end{align}

with at each step \(n\geq 0\) \[ g_{*} \ = \ {\rm arg max}_{g\in\mathcal{D}}\ |g^T R^n| \tag{sensing}.\] The quantities \(R^n\) and \(A^n\) are the residual and the approximation of \(s\) respectively at the \(n\)th M step (so also an approximation in \(n\) terms of \(s\)). A quite direct modification of MP is the Orthogonal Matching Pursuit [8] where only the index (or parameters) of the best atom (i.e., maximizing its correlation with \(R^n\)) at each iteration is recorded, and the approximation \(A^n\) computed by at least square minimization on the set of the \(n\) previously selected atoms.

It is proved in [1] that MP always converges to … something, since the energy of the residual \(\|R^n\|\) decreases steadily towards 0 with \(n\). Under certain extra assumptions on the dictionary \(\mathcal{D}\) (e.g. with small coherence, or cumulative coherence, that roughly measure its closeness to an orthogonal basis) it is also proved that, if \(s\) is described by a linear combination of few elements of the dictionary (for sparse or compressible signal), i.e., with \(s = \mathcal{D}\alpha_*\) for \(\alpha_*\in\mathbb{R}^d\) having few non-zero (or large) components, then OMP recovers \(\alpha_*\) in the set of coefficients \((g_{*}^T R^n)\) computed at each iteration [9]. For instance, in the trivial case of an orthonormal basis (i.e., with vanishing coherence) (O)MP finds iteratively \(\alpha = \mathcal{D}^{-1} s = \alpha_*\).

Dozens (or hundreds ?) of variations of these initial greedy methods have been introduced since their first formulations in the signal processing community. These variations have improved for instance the initial MP rate of convergence through the iterations, or the ability to solve the sparse approximation problem (i.e., finding \(\alpha_*\) expressed above), or are MP techniques adapted to some specific problem like the emerging Compressed Sensing. Let’s quote for instance the gradient Pursuit, stagewise OMP, CoSaMP, regularized MP, subspace pursuit, … (see here and here for more information on these algorithms).

Another variation of (O)MP explained by K. Schnass and P. Vandergheynst in [3], splits the sensing part from the reconstruction part in the initial MP algorithm above (also adding the possibility to select more than only one atom per iteration). Indeed, the selection of the best atom \(g_*\) is performed there by a Sensing dictionary \(\mathcal{S}\) while the reconstruction stage building the residuals and approximations is still assigned to \(\mathcal{D}\). In short, this variation is also proved to solve the sparse problem in the two dictionaries satisfy a small cross (cumulative) coherence criterion, which is easier to fulfill than asking for a small (cumulative) coherence of only one dictionary in the initial (O)MP.

I introduced more precisely this last (O)MP variation above since it is under this form that I discovered it in the separated works of R.V. Southwell [4] and G. Temple [5] (the last being more readable in my opinion) in 1935 and in 1938 (!!), i.e., before the building of the first (electronic) computers in the 40s.

The context of the method in the initial Southwell’s work was the determination of stresses in structures. Let’s summarize his problem : the structure was modeled by \(N\) connected springs. If \(\alpha_*\in\mathbb{R}^N\) represents any motion vector of the springs extremities, then, at the new equilibrium state reached when some external forces \(F_{\rm external}\in\mathbb{R}^N\) are applied to the structure, the internal forces provided by the springs follow, of course, the Hooke law, i.e \(F_{\rm internal}=D \alpha_*\) for a certain symmetric matrix \(D\in\mathbb{R}^{N\times N}\) containing the spring constants, and finally Newton’s first law implies: \[ F_{\rm internal} + F_{\rm external} = D \alpha_* + F_{\rm external} = 0.\]

The global problem of Southwell was thus : given a linear system of equations \(D\alpha_* = s\), with \(s= -F_{\rm external}\) and \(D\) positive definite, how can you recover practically \(\alpha_*\) from \(s\) and \(D\) ? As explained by Temple, a solution to this problem is obviously also “applicable to any problem which is reducible to the solution of a system of non-homogeneous, linear, simultaneous algebraic equations in a finite number of unknown variables”.

Nowadays, the numerical solution seems trivial : take the inverse of \(D\) and apply it to \(s\), and if \(D\) is really big (or even small since I’m lazy) compute \(D^{-1}\) for instance with Matlab and run >> inv(D)*s (or do something clever with the “/” Matlab operator).

However, imagine the same problem in the 30s ! And assume you have to inverse a ridiculously small matrix of size 13x13. It can be really long to solve it analytically and worthless since you are interested in finding \(\alpha_*\). That’s why a few people were interested at that time in computing \(A^{-1}s\), or an approximation to it, without to have this painful inverse computation.

The technique found by R.V. Southwell and generalized later by Temple [4,5], was dubbed of “Successive Relaxation” inside a more general context named “Successive Approximations”. Mathematically, rewriting that work according to the notations of the Matching Pursuit, the Successive Relaxation algorithm reads :

\begin{align} &R^0 = s,\ \alpha^0 = 0,\tag{initialization}\\\ &R^{n+1}\ =\ R^n\ -\ \beta (R^{n})_{k^*} D_{k^*}\\\ &\alpha^{n+1}\ =\ \alpha^n\ +\ \beta (R^{n})_{k^*} e_{k^*}\tag{reconstruction} \end{align}

where \(\beta\in (0,2)\), \(D_j\) is the \(j\)th column of \(D\), \((R^n)_{m}\) is the \(m\)th component of \(R^n\), \(e_j\) is the vector such that \((e_j)_k=\delta_{jk}\) (canonical basis vector), and with, at each step \(n\geq 0\), the selection (sensing) \[ k^{*} = {\rm argmax}_{k} |(R^n)_k|, \tag{sensing}\] i.e., the component of the \(n\)th residual with the highest amplitude.

In this framework, since \(D\) is positive definite and thus non-singular, it is proved in [5] that the vectors \(\alpha^n\) tend to the true answer \(\alpha_*=D^{-1}s\). The parameter \(\beta\) controls the importance of what you removed or add in the residual and in the approximation respectively. You can prove easily that the decreasing of the residual energy is maximum when \(\beta=1\). In other words, in the concepts introduced in [3], they designed a Matching Pursuit where they selected for the reconstruction dictionary the orthonormal basis \(D\) and for the sensing dictionary the identity (the canonical basis) of \(\mathbb{R}^N\). Amazing, No?

An interesting learning of the algorithm above is the presence of the factor \(\beta\). In the more general context of (modern) MP with non-orthonormal dictionary, such a factor could be useful to minimize the “decoherence” effect observed experimentally in the decomposition of a signal when this one is not exactly fitted by elements of the dictionary (e.g., in image processing, arbitrarily oriented edges to be described by horizontal and vertical atoms).

G. Temple in [5] also extended the method to infinite dimensional Hilbert spaces for a different updating step of \(\alpha^n\). This is nothing else but the foundation of the (continuous) MP studied by authors like R. DeVore and Temlyakov some years ago (on that topic you can read also [6], a paper I wrote with C. De Vleeschouwer for a geometric description of this continuous formalism).

By googling a bit on “matching pursuit” and Southwell, I found this presentation of Peter Buhlmann who makes a more general connection between Southwell’s work, Matching Pursuit, and greedy algorithms (around slide 15) in the context of “Iterated Regularization for High-Dimensional Data”. In conclusion, who is this person who explained that we do nothing but always reinventing the wheel ? ;-)

If you want to complete this kind of “archeology of Matching Pursuit” please feel free to add some comments below (or to drop me an email). I’ll be happy to read them and improve my general knowledge of the topic.

Laurent

References:

Image credit: EDSAC was one of the first computers to implement the stored program (von Neumann) architecture. Source Wikipedia.

Laurent Jacques
Laurent Jacques
FNRS Senior Research Associate and Professor