A Note on Spectral Properties of Some Gradient Methods

. Starting from the work by Barzilai and Borwein, the interest for gradient methods has gained a great amount of attention, and e ﬃ cient low-cost schemes are available nowadays. The acceleration strategies used by these methods are based on the deﬁnition of e ﬀ ective steplength updating rules, which capture spectral properties of the Hessian of the objective function. The methods arising from this idea represent e ﬀ ective computational tools, extremely appealing for a variety of large-scale optimization problems arising in applications. In this work we discuss the spectral properties of some recently proposed gradient methods with the aim of providing insight into their computational e ﬀ ectiveness. Numerical experiments supporting and illustrating the theoretical analysis are provided.


INTRODUCTION
Several strategies for accelerating gradient methods have been devised in the last years, stimulated by the seminal work by Barzilai and Borwein [1].These strategies share the idea of defining steplengths that capture spectral properties of the Hessian of the objective function; based on them, new first-order methods for continuous nonlinear optimization have been designed, which showed to be effective in some practical contexts [2,3,4,5,6].However, the convergence results available do not explain the great improvement with respect to the classical Cauchy Steepest Descent (SD) method, and we still do not have a deep understanding of the behaviour of the new methods.
In this work we discuss the spectral properties of some recently proposed steplength rules, with the aim of providing insight into their computational effectiveness.To this purpose, we consider a very simple unconstrained quadratic programming problem, suitable for analyzing the role of the eigenvalues of the Hessian in the behaviour of gradient methods: where A ∈ R n×n is symmetric positive definite and b ∈ R n .The generic gradient method for (1) is defined by the iteration where g k = ∇ f (x k ) = Ax k − b, and the steplength α k > 0 is chosen through some predefined rule.For instance, the classical SD and Minimum Residual (MR) methods take the following steplengths, which guarantee monotonicity of the sequences { f (x k )} and { ∇ f (x k ) }, respectively: Let λ 1 ≥ λ 2 ≥ . . ., λ n−1 ≥ λ n be the eigenvalues of A, with associated orthonormal eigenvectors d 1 , d 2 , . . ., d n .Without loss of generality, henceforth we make the following assumptions:

STEPLENGTHS AND HESSIAN SPECTRUM
Starting from recurrence (4), the following properties can be deduced: • the SD and MR methods have finite termination if and only if at some iteration the gradient is an eigenvector of A; i .Thus, small steplengths α k (say close to 1/λ 1 ) tend to decrease a large number of eigencomponents, with negligible reduction of those corresponding to small eigenvalues.The latter can be significantly reduced by using large values of α k , but this may end up increasing the eigencomponents corresponding to the dominating eigenvalues, as well as fostering non-monotonic behaviour.Therefore, some balance between large and small steplengths seems to be a key issue in devising effective gradient methods and this basic idea has given rise to novel steplength selection rules, some of which will be described in the sequel.
The spectral properties of the SD method have been deeply investigated [7,8,9,10].An interesting theoretical result concerning the asymptotic behaviour of this method is reported next [8].

Theorem 1
Let {x k } be a sequence generated by the SD method.Then The main consequence of Theorem 1 is that the SD method eventually performs its search in the 2D space spanned by d 1 and d n , thus showing the well-known zigzagging behaviour.This is in contrast with the possibility for the sequence {1/α k } to travel in the spectrum of the Hessian, which, according to the previous observations, seems to be a desirable feature for gradient methods.Furthermore, for the Cauchy choice of the steplength it is well known that the method has Q-linear rate of convergence which depends on ρ The Barzilai-Borwein (BB) steplength rules are given by: where ; they were obtained by including some second order information through a secant condition, and can be regarded as quasi-Newton methods with the Hessian approximated by 1 α k I.An interesting property of these rules is that furthermore, with these rules, both the sequences { f (x k )} and { ∇ f (x k ) } are non-monotonic.For strictly convex quadratic problems the BB methods have R-linear convergence, which does not explain why they are in practice much faster than the SD method.An explanation of this behaviour is the ability of generating sequences {1/α k } sweeping the spectrum of A [11].
Starting from [1], many other gradient methods have been proposed.Several methods, either based on the alternation of Cauchy and BB steplengths or the cyclic use of them (see, e.g., [12,13,14]), fit into the framework of Gradient Methods with Retards (GMR) [15].The convergence rate of these methods is R-linear, but their practical convergence behaviour is superior than the SD one.The approaches based on a prefixed alternation of steplength rules seem to be overcome by the selection rules ABB and ABB min , proposed in [16] and [17], which use an adaptive switching criterion for alternating the BB1 and BB2 steplengths: where m is a nonnegative integer and τ ∈ (0, 1).Following the original Adaptive Barzilai-Borwein (ABB) in [16], the ABB min strategy aims at generating a sequence of small steplengths with the BB2 rule so that next value computed by the BB1 rule becomes a suitable approximation of the inverse of some small eigenvalue.The switching criterion is based on the value , where θ k−1 is the angle between g k−1 and Ag k−1 , and allows to select α BB1 k when g k−1 is a sufficiently good approximation of an eigenvector of A [17].
A different approach is behind some recently proposed gradient methods, which alternate SD steplengths with a sequence of constant steplenghts computed by some specific rule that exploits previous SD steplenghts, with the aim of escaping from the two dimensional space in which the SD method tends to eventually reduce its search.The SDA and SDC methods [9,18] compute their constant steplengths by exploiting the formulas We note that the steplength α Y k , proposed by Yuan [19] and used in a different algorithmic framework, was determined by imposing finite termination for two-dimensional convex quadratic problems.In [9,18] the authors prove that the steplengths α A k and α Y k are related and share similar asymptotic properties, shown by the following theorem.

Theorem 2
Let α SD k be a sequence generated by the SD method.Then the sequences α A k and α Y k satisfy The steplengths of the SDA and the SDC methods, α SDA k and α SDC k , are defined by the following rule: where αs = α A s for SDA and αs = α Y s for SDC, and h and m are nonnegative integers with h ≥ 2. In SDC, the use of a finite sequence of Cauchy steps has a twofold goal: forcing the search in the two-dimensional space spanned by the eigenvectors d 1 and d n and getting a suitable approximation of the reciprocal of λ 1 through α Y k , in order to drive toward zero µ k 1 .If the component of the gradient along the eigenvector d 1 were completely removed, a sequence of Cauchy steps followed by constant steps computed with the Yuan rule would drive toward zero the component along the eigenvector d 2 , and so on.Thus, the cyclic alternation of steplengths defined by (5) attempts to eliminate the components of the gradient according to the decreasing order of the eigenvalues of A. The SDA method has similar properties; in this case, the selected constant steplength attempts to exploit the tendency of the gradient method with steplength 1/(λ 1 + λ n ) to align the search direction with d n , i.e., to eliminate the remaining eigencomponents.We also observe that if the Hessian matrix is ill conditioned, 1/(λ 1 + λ n ) ≈ 1/λ 1 and then SDA and SDC are expected to have very close behaviours.As the GMR methods, SDA and SDC have R-linear convergence, but in practice are competitive with the fastest gradient methods currently available.Furthermore, although the two methods are non-monotonic, a suitable choice of h and m leads to monotonicity in practice.
The alternation of Cauchy steplengths and constant steplengths characterizes also the Cauchy-short steps methods proposed in [10].The idea is to break the SD cycle by applying either very short or very long steps approximating suitable Hessian eigenvalues.Note that this strategy is also shared by the SDA and SDC methods, although they have been designed by taking a different point of view.
Finally, a different approach aimed at capturing the spectrum of the Hessian is exploited by the limited memory steepest descent method proposed in [20].The basic idea is to divide the sequence of gradient iterations into groups of m ≥ 1 iterations, referred to as sweeps, and to compute the steplengths for each sweep as the inverse of the Ritz values of the Hessian matrix, by exploiting the gradients obtained during the previous sweep.

NUMERICAL ILLUSTRATION
In order to illustrate our analysis, we compare some gradient methods on a very simple problem [17] of the form (1), with Hessian matrix and random b with entries in [−10, 10].For the sake of space, we only consider ABB (τ = 0.15), ABB min (τ = 0.8, m = 6) and SDC (h = 3, m = 3), which are representative of most of the strategies described in the previous section.The starting guess x 0 has been randomly generated too, with entries in [−1, 1].As stop condition we take g k < g 0 , with = 10 −6 .We focus on the distribution of the steplengths α k in the interval [1/λ 1 , 1/λ n ] = [0.001,1] and on its impact on the convergence behaviour.
We first compare the ABB and ABB min methods (see Fig. 1, top and middle).ABB min tends to use the BB2 rule many more times, thus taking steplengths that on the average are smaller than ABB.ABB min produces very few large steps; this happens twice, at iterations 20 and 44, with a quite remarkable effect in reducing the gradient component along the eigenvector d n , and more generally along d i for large i.The long steps appear to produce in the objective function some fluctuation followed by a strong decrease.The general behaviour of ABB is similar, but the non-monotonicity is slightly more noticeable, and this seems to deteriorate the performance of the method.We verified that this behaviour becomes more evident as the accuracy requirement increases.For instance, when = 10 −8 , ABB takes almost twice the number of iterations taken by ABB min .
Figure 1 (bottom) shows that the SDC method has a convergence history close to that of the ABB min method.However, as observed in the previous section, SDC has a monotonic behaviour, fostered by Yuan steps that are very short in agreement with Theorem 2. A careful examination shows that the first 18 iterations are able to significantly reduce the gradient components along d i for small i (this can be deduced from g k ≈ |d T n g k |)), thus allowing the method to adopt a long step (almost equal to 1 = 1/λ n ) at iteration 19, which produces a large decrease in the objective function and a strong reduction of the gradient component along d n .As for ABB min , the use of few selected long steps produces remarkable effects on the overall SDC behaviour.
In conclusion, the methods we considered, although based on different strategies, apparently share the ability of using large steplengths in a selective way to overcome the chaotic behaviour of BB, allowing improvement in terms of monotonicity and computational efficiency.This ability can be successfully exploited also in more general contexts of unconstrained/constrained optimization [4,6,11,14,20], when the large scale of the applications makes the use of effective gradient approaches an unavoidable choice.