Matching Pursuit Before Computer Science

Recently, I have found some interesting references about techniques designed around 1938 and that, in my opinion, could be qualified of (variant of) Matching Pursuit. Perhaps this is something known by a lot of researchers in the right scientific field, but here is however what I recently discovered. From what I knew until yesterday, when S. Mallat and Z. Zhang [1] defined their greedy or iterative algorithm named “Matching Pursuit” to decompose a signal $ s\in \mathbb{R}^N$ into a linear combination of atoms taken in a dictionary $ \mathcal{D}=\{g_k\in\mathbb{R}^N: 1\leq k\leq d, g_k^Tg_k = 1\}$ of $ d$ elements, the previous work to which they were referring to was the “Progression Pursuit” of J. Friedman and W. Stuetzle [2] in the field of statistical regression methods. In short, MP is very simple. It reads (using matrix notations) $ \quad R^0 = s, A^0 = 0,\hfill{\rm (initialization)}\\ \quad R^{n+1}\ =\ R^n\ -\ (g_{*}^T R^n)\, g_{*},\hfill{\rm (reconstruction)}\\ \quad A^{n+1}\ =\ A^n\ +\ (g_{*}^T R^n)\,g_{*} \hfill{\rm (reconstruction)}\\&s=1$ with at each step $ n\geq 0$ $ \quad g_{*} \ = \ {\rm arg\,max}_{g\in\mathcal{D}}\ |g^T R^n| \hfill{\rm (sensing)}.&s=1$ The quantities $ R^n$ and $ A^n$ are the residual and the approximation of $ s$ respectively at the $ n$th MP step (so also an approximation in $ n$ terms of $ s$). A quite direct modification of MP is the Orthogonal Matching Pursuit [8] where only the index (or parameters) of the best atom (i.e. maximizing its correlation with $ R^n$) at each iteration is recorded, and the approximation $ A^n$ computed by a least square minimization on the set of the $ n$ previously selected atoms. It is proved in [1] that MP converges always to … something, since the energy of the residual $ \|R^n\|$ decreases steadily towards 0 with $ n$.  Under certain extra assumptions on the dictionary $ \mathcal{D}$ (e.g. with small coherence, or cumulative coherence, that roughly measure its closeness to an orthogonal basis) it is also proved that, if $ s$ is described by a linear combination of few elements of the dictionary (for sparse or compressible signal), i.e. with $ s = \mathcal{D}\alpha_*$ for $ \alpha_*\in\mathbb{R}^d$ having few non-zero (or large) components, then OMP recovers $ \alpha_*$ in the set of coefficients $ (g_{*}^T R^n)$ computed at each iteration [9]. For instance, in the trivial case of an orthonormal basis (i.e. with vanishing coherence) (O)MP finds iteratively $ \alpha = \mathcal{D}^{-1} s = \alpha_*$. Dozens (or hundreds ?) of variations of these initial greedy methods have been introduced since their first formulations in the signal processing community. These variations have improved for instance the initial MP rate of convergence through the iterations, or the ability to solve the sparse approximation problem (i.e. finding $ \alpha_*$ expressed above), or are MP techniques adapted to some specific problem like the emerging Compressed Sensing. Let’s quote for instance the gradient Pursuit, stagewise OMP, CoSaMP, regularized MP,  subspace pursuit, … (see here and here for more informations on these). Another variation of (O)MP explained by K. Schnass and P. Vandergheynst in [3], splits the sensing part from the reconstruction part in the initial MP algorithm above (adding also the possibility to select more than only one atom per iteration). Indeed, the selection of the best atom $ g_*$ is performed there by a Sensing dictionary $ \mathcal{S}$ while the reconstruction stage building the residuals and approximations is still assigned to $ \mathcal{D}$. In short, this variation is also proved to solve the sparse problem if the two dictionaries satisfy a small cross (cumulative) coherence criterion, which is easier to fulfill than asking for a small (cumulative) coherence of only one dictionary in the initial (O)MP. I introduced more precisely this last (O)MP variation above since it is under this form that I discovered it in the separated works of R.V. Southwell [4] and G. Temple [5] (the last being more readable in my opinion) in 1935 and in 1938 (!!), i.e. before the building of the first (electronic) computers in the 40’s. The context of the method in the initial Southwell’s work was the determination of stresses in structures. Let’s summarize his problem : the structure was modelized by $ N$ connected springs. If $ \alpha_*\in\mathbb{R}^N$ represents any motion vector of the springs extremities, then, at the new equilibrium state reached when some external forces $ F_{\rm external}\in\mathbb{R}^N$ are applied to the structure, the internal forces provided by the springs follows of course the Hooke law, i.e $ F_{\rm internal}=D \alpha_*$ for a certain symmetric matrix $ D\in\mathbb{R}^{N\times N}$ containing the spring constants, and finally Newton’s first law implies : $ F_{\rm internal} + F_{\rm external} = D \alpha_* + F_{\rm external} = 0$. The global problem of Southwell was thus : given a linear system of equations $ D\alpha_* = s$, with $ s= -F_{\rm external}$ and $ D$ positive definite, how can you recover practically $ \alpha_*$ from $ s$ and $ D$  ? As explained by Temple, a solution to this problem is of course also “applicable to any problem which is reducible to the solution of a system of non-homogeneous, linear, simultaneous algebraic equations in a finite number of unknown variables”.   Nowadays, the numerical solution seems trivial : take the inverse of $ D$ and apply it to $ s$, and if $ D$ is really big (or even small since I’m lazy) compute $ D^{-1}$ for instance with Matlab and run “» inv(D)*s” (or do something clever with the “/” Matlab operator). However, imagine the same problem in the 30’s ! And assume you have to inverse a ridiculously small matrix of size 13x13. It can be really long to solve it analytically and worthless since you are interested in finding $ \alpha_*$. That’s why some persons were interested at that time in computing $ A^{-1}s$, or an approximation to it, without to have this painful inverse computation. The technique found by R.V. Southwell and generalized later by Temple [4,5], was dubbed of “Successive Relaxation” inside a more general context named “Successive Approximations”. Mathematically, rewriting that work under notations similar to these of modern Matching Pursuit methods, Successive Relaxation algorithm reads : $ R^0 = s,\ \alpha^0 = 0,\hfill {\rm (initialization)}\\ R^{n+1}\ =\ R^n\ -\ \beta\,(R^n)_{k^*} D_{k^*}\\ \alpha^{n+1}\ =\ \alpha^n\ +\ \beta\,(R^n)_{k^*}\, e_{k^*}\hfill{\rm (reconstruction)},&s=2$ where $ \beta\in (0,2)$, $ D_j$ is the $ j$th column of $ D$, $ (R^n)_{m}$ is the $ m$th component of $ R^n$, $ e_j$ is the vector such that $ (e_j)_k=\delta_{jk}$ (canonical basis vector), and with, at each step $ n\geq 0$, the selection (sensing) $ k^{*} = {\rm arg\,max}_{k} |(R^n)_k|, \hfill{\rm (sensing)}.&s=1$ i.e. the component of the $ n$th residual with the highest amplitude. In this framework, since $ D$ is positive definite and thus non-singular, it is proved in [5] that the vectors $ \alpha^n$ tend to the true answer $ \alpha_*=D^{-1}s$. The parameter $ \beta$ controls the importance of what you removed or add in the residual and in the approximation respectively. You can prove easily that the decreasing of the residual energy is of course maximum when $ \beta=1$. In other words, in the concepts introduced in [3], they designed a Matching Pursuit where they selected for the reconstruction dictionary the orthonormal basis $ D$ and for the sensing dictionary the identity (the canonical basis) of $ \mathbb{R}^N$. Amazing, No ? An interesting learning of the algorithm above is the presence of the factor $ \beta$. In the more general context of (modern) MP with non-orthonormal dictionary, such a factor could be useful to minimize the “decoherence” effect observed experimentally in the decomposition of a signal when this one is not exactly fitted by elements of the dictionary (e.g., in image processing, arbitrarily oriented edges to be described by horizontal and vertical atoms). G. Temple in [5] extended also the method to infinite dimensional Hilbert spaces for a different updating step of $ \alpha^n$. This is nothing else but the foundation of the (continuous) MP studied by authors like R. DeVore and Temlyakov some years ago (on that topic you can read also [6], i.e. a paper I wrote with C. De Vleeschouwer for a geometric description of this continuous formalism). By googling a bit on “matching pursuit” and Southwell, I found this presentation of Peter Buhlmann who makes a more general connection between Southwell’s work, Matching Pursuit, and greedy algorithms (around slide 15) in the context of “Iterated Regularization for High-Dimensional Data”. In conclusion of all that, who is this person who explained that we do nothing but always reinventing the wheel ? ;-) If you want to complete this kind of “archeology of Matching Pursuit” please feel free to add some comments below. I’ll be happy to read them and improve my general knowledge of the topic. Laurent

References :

[1] : S. G. Mallat and Z. Zhang, Matching Pursuits with Time-Frequency Dictionaries, IEEE Transactions on Signal Processing, December 1993, pp. 3397-3415. [2] : J. H. Friedman and J. W. Tukey (Sept. 1974). “A Projection Pursuit Algorithm for Exploratory Data Analysis”. IEEE Transactions on Computers C-23 (9): 881–890. doi:10.1109/T-C.1974.224051. ISSN 0018-9340. [3] : K. Schnass and P. Vandergheynst, Dictionary preconditioning for greedy algorithms, IEEE Transactions on Signal Processing, Vol. 56, Nr. 5, pp. 1994-2002, 2008. [4] : R. V. Southwell, “Stress-Calculation in Frameworks by the Method of “Systematic Relaxation of Constraints”. I and II. Proc Roy. Soc. Series A, Mathematical and Physical Sciences, Vol. 151, No. 872 (Aug. 1, 1935), pp. 56-95 [5] : G. Temple, The General Theory of Relaxation Methods Applied to Linear Systems, Proc. Roy. Soc. Series A, Mathematical and Physical Sciences, Vol. 169, No. 939 (Mar. 7, 1939), pp. 476-500. [6] : L. Jacques and C. De Vleeschouwer, “A Geometrical Study of Matching Pursuit Parametrization”, Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on], 2008, Vol. 56(7), pp. 2835-2848 [7] : R.A. DeVore and V.N. Temlyakov. “Some remarks on greedy algorithms.” Adv. Comput. Math., 5:173–187, 1996. [8] : Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal projection pursuit: Recursive function approximation with applications to wavelet decomposition,” in Proceedings of Twenty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 1, (Pacific Grove, CA), pp. 40- 44, NOV. 1-3, 1993. [9] : Tropp, J, “Greed is good: algorithmic results for sparse approximation”, IEEE T. Inform. Theory., 2004, 50, 2231-2242

Image credit :

EDSAC was one of the first computers to implement the stored program (von Neumann) architecture. Wikipedia.

Laurent Jacques
Laurent Jacques
FNRS Senior Research Associate and Professor