Steplength selection in gradient projection methods for box-constrained quadratic programs

The role of the steplength selection strategies in gradient methods has been widely investigated in the last decades. Starting from the work of Barzilai and Borwein (1988), many eﬃcient steplength rules have been designed, that contributed to make the gradient approaches an eﬀective tool for the large-scale optimization problems arising in important real-world applications. Most of these steplength rules have been thought in unconstrained optimization, with the aim of exploiting some second-order information for achieving a fast annihilation of the gradient of the objective function. However, these rules are successfully used also within gradient projection methods for constrained optimization, though, to our knowledge, a detailed analysis of the eﬀects of the constraints on the steplength selections is still not available. In this work we investigate how the presence of the box constraints aﬀects the spectral properties of the Barzilai-Borwein rules in quadratic programming problems. The proposed analysis suggests the introduction of new steplength selection strategies speciﬁcally designed for taking account of the active constraints at each iteration. The re-sults of a set of numerical experiments show the eﬀectiveness of the new rules with respect to other state of the art steplength selections and their potential usefulness also in case of box-constrained non-quadratic optimization problems.


Introduction
Gradient methods are widely used for solving nonlinear optimization problems and, in many large-scale applications, their simplicity and low memory requirements make them the most convenient choice. In the last years, very efficient gradient-based approaches have been designed for unconstrained and constrained problems arising in different areas, such as image processing, compressive sensing and machine learning. The success of the gradient schemes in these applications is mainly due to the introduction of special strategies for accelerating their convergence rate, such as the adaptive techniques for the steplength selection, the use of extrapolation steps or the employment of scaled gradient directions.
This work deals with steplength selection rules, by investigating how popular techniques designed in the context of unconstrained optimization can be adapted to gradient projection methods for box-constrained quadratic programming problems. We focus on state of the art steplength strategies designed by exploiting the well-known Barzilai-Borwein (BB) rules [1]. These strategies have been originally developed in the unconstrained quadratic case, by taking into account the importance of using steplengths approximating the inverse of the eigenvalues of the Hessian matrix for achieving fast reductions of the eigencomponents of the gradient. In [2] the spectral properties of both the standard BB rule and other improved steplength techniques [3,4,5,6] have been investigated. It has been shown that the steplength rules specifically designed for an efficient approximation of the spectrum of the Hessian generally outperform the standard BB rule and are preferable also when applied for providing the tentative steplength of a line search procedure within gradient methods for general non-quadratic unconstrained problems [7,8,9,10,11].
For what concerns the case of the constrained optimization problems, the role of the steplength rules is not so deeply investigated. Many well-known gradient projection methods for constrained optimization simply exploit the same steplength selections designed for the unconstrained case in combination with some kind of line search strategy [12,13,14,15,16,17,18,19,20,21,22]. However, most of the steplength selections for unconstrained problems are designed for achieving a fast annihilation of the gradient of the objective function, that is not the main goal in constrained optimization. Thus, a better understanding of the role of the steplength in gradient projection methods seems necessary and useful for improving first-order approaches in constrained optimization.
In this paper, for the case of box-constrained quadratic programming problems, we analyze how the two popular BB steplength rules behave within gradient projection methods and how their spectral properties are affected by the presence of the constraints. In particular, on the basis of this spectral analysis, we design a new steplength rule that, by taking into account the active constraints at each iteration, is able to achieve fast reductions of the gradient components corresponding to the free variables at the solution. Numerical experiments on different sets of randomly generated test problems are carried out for evaluating the practical effectiveness of the gradient projection methods exploiting steplength selections based on the proposed rule. Preliminary numerical results on box-constrained non-quadratic problems are also discussed for showing the potential usefulness of the new steplength rule in this more general framework.
The outline of the paper is the following. In Section 2, after briefly recalling the gradient projection methods and the steplength rules we are interested to, we provide a spectral analysis of the standard BB rules in the case of box-constrained quadratic programs and introduce the new steplength selection; futhermore, alternation steplength strategies based on the new rule are discussed. In Section 3 we report the results of the numerical experiments and, finally, some conclusions are drawn in Section 4.

Algorithm 1 GP method for box-constrained quadratic programs
(gradient projection step) (steplength updating rule) end

Steplength rules in GP methods for box-constrained quadratic programs
We consider the following box-constrained Quadratic Programming (QP) problem where A is an n × n symmetric positive definite matrix, b, , u are vectors of R n , with ≤ u, and c is a scalar. Problem (1) has a unique solution x * , satisfying the following first-order optimality conditions where g(x) denotes the gradient of the objective function at x: In this work we are interested in solving problem (1) by means of the well-known Gradient Projection (GP) method combined with a line search strategy along the feasible direction. The main steps of the GP method are described in Algorithm 1. At each iteration, the operator P ≤x≤u (·) is exploited for computing the Euclidean projection of the vector x (k) − α k g(x (k) ) onto the feasible region given by the box constraints, in order to obtain the direction d (k) , which is a feasible descent direction for the objective function at x (k) . The new iterate is defined as where 0 < ν k ≤ 1 is determined by a backtracking procedure implementing a standard nonmonotone line search [23]. Finally, a steplength updating rule sets the parameter α k+1 ∈ [α min , α max ] for the next iteration. Our analysis focuses just on the steplength updating strategies and their relationship with the special constraints of problem (1). Before deepening this topic, we recall in Proposition 1 the basic convergence results for the sequence {x (k) } generated by Algorithm 1.
Proposition 1. Assume that the matrix A in problem (1) is symmetric and positive definite, then the following properties hold for the sequence {x (k) } generated by Algorithm 1: (i) the sequence {x (k) } converges R-linearly to the unique solution x * of problem (1); (ii) the sequence {f (x (k) )} has O 1 k convergence rate.
The proof of Proposition 1 can be derived from the analysis developed in [24] for the case in which the GP method is applied to the minimization of general strongly convex functions. For basic properties of the method we refer, for instance, to [12,25] and for a convergence analysis of more general scaled GP methods to [13,26,27]. It is worth stressing that the convergence properties in Proposition 1 hold independently of the choice of the steplength α k in the closed interval [α min , α max ]. This is very important from a practical point of view, since it allows one to make the updating rule of α k oriented at optimizing the performance. Many evidences of the remarkable convergence rate improvements achievable by means of suitable steplength updating rules are available in literature; in particular, promising results have been obtained, both on library test problems and real-world applications, by exploiting the Barzilai-Borwein steplength rules [12,20,26,18,14,15,17]. These rules, originally introduced for unconstrained minimization problems [1], aim at capturing some second order information by forcing the matrix (α k I n ) −1 to approximate the Hessian of the objective function through one of the following secant conditions: or where s (k−1) = x (k) − x (k−1) and y (k−1) = g (k) − g (k−1) , where the notation g (i) = g(x (i) ) is exploited. From (4) and (5) the steplength updating rules and are obtained, respectively. In the last years a lot of effort has been put into understanding the reasons of the effectiveness of the BB rules and great attention has been devoted to their ability in approximating the inverse of the eigenvalues of the objective Hessian. In fact, in the case of the unconstrained minimization of the quadratic function in (1), a gradient method is simply described by the iteration and for the gradient g (k+1) we have By denoting with {λ 1 , λ 2 , . . . , λ n } the eigenvalues of A and with {v 1 , v 2 , . . . , v n } a set of associated orthonormal eigenvectors, the gradient g (k+1) can be expressed as where µ (k+1) i ∈ R is called the i-th eigencomponent of g (k+1) and satisfies the following recurrence formula: The previous recurrence highlights the crucial role of a steplength α k approximating an inverse of an eigenvalue λ i : it forces a remarkable reduction of |µ i |, even if it can amplify the absolute value of some other eigencomponents. Thus, the behaviour of a steplength selection rule can be fruitfully studied in terms of its ability in reducing all the gradient eigencomponents by means of an effective sweeping of the spectrum of A −1 . When the gradient iteration (8) is applied in unconstrained quadratic optimization, the BB rules (6) and (7) can be written as These steplength rules are suitable for providing approximations of the eigenvalues of A −1 , since they satisfy 1 where λ min (A) and λ max (A) denote the minimum and the maximum eigenvalues of A, respectively. In [2], a spectral analysis of popular steplength selections shows that the BB rule (12) produces steplengths within the spectrum of A −1 in an almost random way, while other techniques, which are aimed at exploiting groups of small steplengths followed by some large steplengths [3,4,5,28,29], seem to better adapt to the spectrum of A −1 , thanks to their skill in catching the eigenvalues in a suitable order. In particular, very promising convergence rate improvements with respect to the standard BB rules have been provided by the Adaptive Barzilai-Borwein (ABB) strategy [3] and its modification ABB min [4], that are defined by and respectively, where τ ∈ (0, 1) and m a is a nonnegative integer. The BB rules (6), (7) and their adaptive alternations (15) and (16) can be exploited also in gradient methods for general nonlinear optimization problems; in these cases, they show a behaviour very similar to that observed for quadratic problems, although the recurrence relation (9) does not hold. In particular, in [2] it is shown that the adaptive alternation (16) appears more efficient than the rule (6) in fast approximating the spectrum of the Hessian matrices of the objective function and, consequently, shows a better convergence rate.
Even if all these BB-based steplength rules have been designed in the framework of unconstrained optimization, they are widely used in gradient projection methods for constrained minimization problems, still providing interesting performances [12,26,14,16,19,15]. Here, we investigate how the spectral analysis proposed in [2] can be generalized to the GP method for problem (1), with the aim to provide some explanations of the BB effectiveness in this special constrained optimization setting. We observe that, in this case, neither the recurrence formula (9) nor the equation (11) hold, and the optimality conditions (2) require that only some gradient components annihilate at the solution. These remarkable differences with respect to the unconstrained case are at the basis of the study developed in the next subsection.

BB-like rules for box-constrained quadratic programs
In order to understand the spectral properties of the BB rules (6) and (7) when they are exploited within the GP method for solving problem (1), we need to investigate how the gradient projection step influences their definitions. We first study the BB1 rule and then, by proceeding in a similar way, we develop the analysis for the BB2 selection. If, at the (k − 1)-th iteration, we suppose that the j-th component of , and the corresponding gradient component is non-negative (or non-positive), . As a consequence, the corresponding entry of the vector s (k−1) used in the steplength definition (6) is equal to zero: s (k−1) j = 0. By using the notation and This means that when the updating rule (6) is used in Algorithm 1 and α BB1 k ∈ [α min , α max ], the steplength α k satisfies the secant condition that is, it is defined by forcing the matrix (α k I) −1 to approximate the submatrix A I k−1 ,I k−1 of A defined by the intersection of the rows and the columns with indices in I k−1 . We call this submatrix the reduced Hessian matrix at the (k − 1)-th iteration.
In particular, in the following theorem we prove that the inverse of α BB1 k belongs to the spectrum of A I k−1 ,I k−1 .
where λ min (A I k−1 ,I k−1 ) and λ max (A I k−1 ,I k−1 ) are the minimum and the maximum eigenvalues of A I k−1 ,I k−1 , respectively.
Proof: We assume that at the iteration (k − 1) the rows/columns of A and the entries of any vector are reordered so that I k−1 is related to the first indices and J k−1 contains the last indices; by dropping for simplicity the iteration counter k − 1 from I k−1 and J k−1 , we can write In view of the GP iteration, we have that the entries of the iterate x (k) are where p can be partitioned into two sub-vectors given by Any entry g (k) i , i = 1, . . . , n, of the gradient g (k) has the following expression: Consequently, from (23), we can write From the definition (18), the value of α BB1 k can be written as or, in other words, α BB1 k is the reciprocal of a Rayleigh quotient of the submatrix A I,I .
Theorem 1 highlights that the use of the BB1 rule in Algorithm 1 leads to capture spectral properties of the matrix A I k−1 ,I k−1 instead of the whole matrix A, as done in the unconstrained quadratic case. Thus, we may argue that now the BB1 rule plays a role in reducing the gradient components with indices belonging to I k−1 . This can be observed in the special case in which I k = I k−1 and p i ), i ∈ I k . In fact, under these assumptions we have By denoting with γ 1 , . . . , γ r and u 1 , . . . , u r the eigenvalues and the associated orthonormal eigenvectors of A I k ,I k , respectively, where r = I k , and by writing g i u i , we obtain the following recurrence formula for the eigencomponents: This means that if the selection rule (25) provides a good approximation of 1 γi , a useful reduction of |μ (k+1) i | can be achieved. With regard to the second BB rule, using the above notation for the (k − 1)-th iteration of GP, we obtain that the steplength (7) can be written as In view of (24), we have and, consequently, 1/α BB2 k might be outside of the spectrum of the current reduced Hessian at x (k−1) . A simple way to correct (26) is to redefine this value as so that the following result holds. (1) is a symmetric positive definite matrix, we have The proof of Theorem 2 follows immediately by using the definition (24) of y (k−1) I k−1 . From the comparison between (27) and (28) we have that α BB2 k ≤ α BoxBB2 k ; furthermore, in analogy with the inequalities (14), for the rules α BB1 k and α BoxBB2 k we have: .
In view of these inequalities, the modified BB2 rule (28) can be exploited within the adaptive strategies (15) and (16) that use the BB steplengths in an alternate way, for an effective combination of short and long steps also in the framework of gradient methods for constrained optimization. In the following, we denote by BoxABB min the modified ABB min selection in which the BB2 rule is substituted by the BoxBB2 defined in (28): where τ ∈ (0, 1) and m a is a nonnegative integer; furthermore, we denote by BoxVABB min the BoxABB min variant in which a variable setting of the parameter τ is exploited, as suggested in [26]: with ϑ > 1. A typical value for ϑ is 1.1. It is worthwhile observing that this variable setting makes the efficiency of the steplength strategy less dependent on the value of τ provided by the user and in several applications it allowed remarkable performance improvements with respect to the standard ABB min strategy [26,19,30,31].
In the next section, we will evaluate on box-constrained test problems the behaviour of the proposed modified steplengths within the GP method. In particular, we will focus on the ability of these steplength rules to capture the spectral properties of the reduced Hessian, recovering in this way the benefits usually exhibited by the standard BB rules in case of unconstrained problems.

Numerical Experiments
In order to highlight the effectiveness of the modified versions of the BB rules for boxconstrained QP problems, we perform two numerical investigations: in the former we analyze the spectral behaviour of the different steplength rules on some toy problems, in the latter we evaluate the rules on a set of large-scale benchmark test problems. Finally, in order to have some hints about the behaviour of the modified BB rules for box-constrained non-quadratic minimization problems, we report the numerical results obtained by the GP method combined with these rules on some well-known general test problems. All the numerical experiments are carried out on a workstation equipped with an Intel Xeon QuadCore E5620 processor at 2,40 GHz and 18 Gb of RAM, by implementing the GP methods in the Matlab R2016a environment.

Numerical results on some special box-constrained QP problems
In this subsection we study the distribution of the steplengths generated by the considered rules with respect to the eigenvalues of the reduced Hessian matrices obtained during the GP iterative process. To this end, we randomly generated QP problems subject to lower bounds in which the solution, the number of active constraints at the solution and the distribution of the eigenvalues of the dense symmetric positive definite Hessian of the objective function are prefixed. The main features of the three test problems used for this analysis are described in Table 1, where na is the number of active lower constraints at the solution and I * denotes the set of the indices of the inactive constraints at the solution, so that A I * ,I * is the reduced Hessian matrix at the solution. The GP variants exploiting the different steplength rules are distinguished by means of the rule's name (BB1, BB2, BoxBB2, BoxABB min , BoxVABB min ) and share the same stopping criterion: where ϕ(x (k) ) is the vector with entries ϕ (k) In the methods the following parameter setting is used: M = 9, σ = 10 −4 , δ = 0.5, α min = 10 −10 , α max = 10 6 , α 0 = (g (0) T g (0) )/(g (0) T Ag (0) ); furthermore, τ = τ 1 = 0.5 and m α = 2 are set in BoxABB min and BoxVABB min .  Figures 1-3 confirm the ability of the BB1 and BoxBB2 rules to produce steplengths α k whose inverse belong to the spectrum of the reduced Hessian A I k−1 I k−1 . Furthermore, in the last iterations the eigenvalues of the reduced Hessian matrices tend to stabilize and to well approximate the eigenvalues of A I * I * . This means that the two rules play a role in reducing the gradient components with indices belonging to I * , that are just the gradient components annihilated at the solution. On the other hand, the inverse of the BB2 steplengths seem unable to effectively sweep the spectrum of the reduced Hessian matrices with dangerous effects on the performance. The behaviour of the BoxABB min rule suggests that the benefits exhibited by the special adaptive alternation of small and large steplengths introduced by ABB min for the unconstrained case, can be a useful strategy also in the box-constrained case, provided that the BoxBB2 selection is exploited in place of the standard BB2 rule.

Numerical results on large-scale box-constrained QP problems
In order to verify the above conclusions on a more exhaustive set of large-scale problems, we used the software package downloadable at http://www.dimat.unina2.it/diserafino/dds sw.htm for randomly generating box-constrained quadratic problems. Following the procedure proposed in [32,33], the software allows to generate test problems with different size, spectral properties and number of active constraints at the solution. We generated a dataset of 108 strictly convex QP problems with nondegenerate solutions, splitted into three groups of increasing size: n = 15000, 20000, 25000; for each group, 36 problems are generated by considering three values for the number na of active constraints at the solution, na ≈ 0.1 · n, 0.5 · n, 0.9 · n, three values for the condition number κ(A) of the Hessian matrix A, κ(A) = 10 4 , 10 5 , 10 6 , and four levels of near-degeneracy, obtained by setting the positive Lagrangian multipliers β * i , i = 1, . . . , na, equal to 10 −ηindeg , where η i ∈ (0, 1) is a random number and ndeg = 1, 4, 7, 10. The solution x * of each problem is randomly chosen from a uniform distribution in (−1, 1) and the Hessian matrix is defined as A = GDG T where G is the product of three Householder matrices associated to randomly generated unit vectors and D is the diagonal matrix of the eigenvalues, which are log-spaced between 1 and κ(A). Hence, the matrix A is not stored in memory but only the operator for computing the matrix-vector products involving A is available. The behaviour of the considered steplength rules on these test problems is evaluated by using the performance profiles proposed in [34]. By denoting with "solver" the GP method equipped with a generic steplength rule, we assume as performance measure of interest the execution time required by each solver to satisfy the stopping criterion where ϕ(x (k) ) is the vector with the entries defined in (33). As in [34], given a set P of n p problems, for each problem p we consider the ratio r p,s of the computing time of a solver s versus the best time of all the solvers (performance ratio) and, for each solver s and for θ ≥ 1 define P (r p,s , θ) = 1 if r p,s ≤ θ, 0 otherwise.
The perfomance profile of a solver s is given by ρ s (θ) = p∈P P (r p,s , θ) n p ; hence, ρ s (θ) is the probability for solver s that r p,s ≤ θ. If the stopping criterion is not fulfilled by a solver within the maximum number of 40000 iterations, the corresponding performance ratio is set equal to a fixed value r M larger than any ratio r p,s . For the test problems described above, the performance profiles of the different GP versions are reported in Figure 4. The BoxBB2 steplength rule is compared with the standard BB2 and BB1 rules in the top-left and top-right panels, respectively, while a comparison between the BoxBB2 and the BoxABB min is provided in the bottom-left panel; a summarizing analysis including also the BoxVABB min is given in the bottom-right panel. Figure 4 shows that, on the considered set of test problems, the proposed BoxBB2 steplength rule is generally preferable to the standard BB2 and BB1 selections. Furthermore, the adaptive alternation of the BB1 and BoxBB2 selections exploited in the BoxABB min is convenient with respect to the use of a single steplength rule. In particular, the version BoxVABB min is able to further improve the performance provided by the BoxABB min thanks to the updating formula (32) for the switching parameter τ .
All the above experiments highlight that the rationale behind the BoxBB2 design, aiming at making the rule able to generate values sweeping the spectrum of the inverse of the reduced Hessian matrices, has given rise to a steplength rule providing a promising practical behaviour, both in comparison with the widely used BB1 rule and, in particular, within efficient adaptive alternation strategies, like BoxABB min and BoxVABB min .

Beyond the quadratic case
In case of unconstrained optimization problems, the numerical behaviour exhibited by the BB-like rules in the quadratic case is also observed in case of non-quadratic problems [2]: the steplength selection strategies designed for efficiently adapting to the spectrum of the Hessian matrices of the objective function generally provide the best performance in terms of convergence rate of the gradient scheme. Then, it is natural to ask if in the case of box-constrained non-quadratic problems the proposed BoxBB2 rule preserves the benefits observed in the quadratic case and the adaptive alternations of the BB1 and BoxBB2 selections still outperform the other rules. In the following, we describe preliminary results obtained on some box-constrained non-quadratic test problems, even if the general nonquadratic case deserves a specific analysis and it is beyond the scope of this work.
The test problems are generated as follows. Starting from an unconstrained minimization problem with a twice continuously differentiable objective function g(x), for which a local minimum point x * is known, the technique proposed in [35] is used to generate a box-constrained problem of the form where are twice continuously differentiable non-decreasing functions. Due to the special definition of f (x), the point x * is a solution of (35). For our tests we selected some well-known non-quadratic functions g(x), described below.
(ii) Chained Rosenbrock function [37]: where n = 500, the values ϕ i , i = 1, . . . , 50, are defined as in [37, Table 1   (iii) Laplace2 function [10]: where A is a square matrix of order n = N 3 , N = 100, arising from the discretization of a 3D Laplacian on the unit box by a standard seven-point finite difference formula, h = 1 N +1 and b is chosen so that where the index i is associated with the mesh point (kh, rh, sh), k, r, s = 1, . . . , N . Two different settings for the parameters d, d 1 , d 2 and d 3 are considered: For each function g(x), we built the corresponding constrained versions with the following choices for the functions h i (x), as suggested in [35]: where α i are random numbers in (0.001, 0.011) and β i = 10 −ηindeg , where η i is a random number in (0, 1) and ndeg = 1, 4, 10; we recall that β i are the Lagrangian multipliers associated to the active constraints and then the value ndeg allows to control the degeneracy of the problem at x * . The vectors and u are set in such a way that the number of active constraints at the solution is equal to a prefixed value na; the same number of lower and upper active constraints is assumed and different problems are generated by considering na ≈ 0.1 · n, 0.5 · n, 0.9 · n. The procedure just described led us to construct a total amount of 108 box-constrained non-quadratic test problems. Similarly to the previous experiments, Figure 5 shows the performance profiles obtained by solving the above set of non-quadratic problems by the GP method with different steplength rules. All the parameters involved by the GP scheme and by the steplength rules are set as in the quadratic case, except for the initial steplength, that now is α 0 = 1, and the maximum number of iteration, that is equal to 4000. From Figure 5 we observe that the BoxBB2 rule confirms, also on these non-quadratic test problems, its better behaviour in comparison to the standard BB2 selection and well compares with the BB1 rule. Furthermore, the steplength strategies based on adaptive alternation of BB1 and BoxBB2 still provide the best performance. Then, by recalling that the proposed selections are designed essentially for achieving an effective approximation of the spectrum of the reduced Hessian, we may conclude that the spectral properties exhibited by the steplength rules play a crucial role for improving the gradient methods also in case of general box-constrained non-quadratic optimization problems.  Figure 5: Runtime performance profiles obtained by the GP method equipped with different steplength rules on a set of box-constrained non-quadratic test problems.

Conclusions
In this work we developed a spectral analysis of the Barzilai-Borwein steplength rules in gradient projection methods for box-constrained QP problems and, as a consequence, we proposed a steplength strategy suitable to exploit information on the active constraints at each iteration. In particular, the new rule provides steplengths able to capture spectral properties of the reduced Hessian matrices, i.e., the submatrices of the Hessian of the objective function given by the intersection of the rows and the columns with indices corresponding to the inactive constraints at the iterations of a gradient projection scheme. This steplength selection seems very useful in achieving a fast annihilation of the gradient components that must be zeroed at the solution and, consequently, it can be efficiently exploited in combination with the standard Barzilai-Borwein rule within state of the art adaptive alternation steplength strategies. Preliminary numerical results suggest that the proposed rule shows benefits also in gradient projection methods for general box-constrained non-quadratic optimization problems. Future work will concern the generalization of this analysis to the cases of more general constraints or more complex schemes like the scaled gradient projection methods [26,38,22].