Ritz-like values in steplength selections for stochastic gradient methods

The steplength selection is a crucial issue for the effectiveness of the stochastic gradient methods for large-scale optimization problems arising in machine learning. In a recent paper, Bollapragada et al. (SIAM J Optim 28(4):3312–3343, 2018) propose to include an adaptive subsampling strategy into a stochastic gradient scheme, with the aim to assure the descent feature in expectation of the stochastic gradient directions. In this approach, theoretical convergence properties are preserved under the assumption that the positive steplength satisfies at any iteration a suitable bound depending on the inverse of the Lipschitz constant of the objective function gradient. In this paper, we propose to tailor for the stochastic gradient scheme the steplength selection adopted in the full-gradient method knows as limited memory steepest descent method. This strategy, based on the Ritz-like values of a suitable matrix, enables to give a local estimate of the inverse of the local Lipschitz parameter, without introducing line search techniques, while the possible increase in the size of the subsample used to compute the stochastic gradient enables to control the variance of this direction. An extensive numerical experimentation highlights that the new rule makes the tuning of the parameters less expensive than the trial procedure for the efficient selection of a constant step in standard and mini-batch stochastic gradient methods.


Introduction
The problem we consider is the unconstrained minimization of the form where ξ is a multi-value random variable, f represents a cost function, and the mathematical expectation E is defined with Communicated by Yaroslav D. Sergeyev. respect to ξ in the probability space (Ξ , F, P). It is assumed that the function f : R d × Ξ → R is known analytically or it is provided by a black box oracle within a prefixed accuracy. In practice, since the probability distribution of ξ is unknown, we seek the solution of a problem that involves an estimate of the objective function F(x). The most common approximation is the Sample Average Approximation, defined as where the objective function is the empirical risk F n (x, ξ (n) based on a random sample ξ (n) = {ξ (n) 1 , . . . , ξ (n) n } of size n of the variable ξ . In the machine learning framework, each f i (x) ≡ f (x, ξ (n) i ) denotes the loss function related to the instance ξ (n) i of the training set. In the big data frame-work, since n can be a very large number, it is prohibitively expensive to deal with the objective function F n (x, ξ (n) ), its gradient or its Hessian matrix. A common approach to address the problem (1) or its approximation (2)-(3) is the Stochastic Gradient (SG) method and its variants, requiring only the gradient of one or few terms of F n (x) at each iteration, so that the cost of the overall optimization procedure is limited.
Starting from a vector x (0) ∈ R d , the basic iteration of the SG method can be written as where ξ (n k ) denotes a set of n k realizations of the random variable ξ , randomly chosen from the sample data ξ (n) , g(x (k) , ξ (n k ) ) is the stochastic gradient vector at the current iterate x (k) and α k is a positive steplength, known also as learning rate. The main strategies for the choices of ξ (n k ) give rise to the standard SG method, when n k = 1 for all k, and its mini-batch version, for n k > 1. In particular, given a randomly chosen subset S k ⊂ {1, . . . , n} of |S k | = n k indices, n k ≥ 1, and a subsample of the training set ξ (n k ) = {ξ (n k ) i } i∈S k , the stochastic gradient is defined as As concerns the convergence results of the standard SG method (4) and its variant with fixed subsample size n k , a very deep survey is given in Bottou et al. (2018). The results provided in Bottou et al. (2018) hold in the case of the solution of both problems (1) and (2). Under the crucial assumption that the gradient of the objective function is L-Lipschitz continuous and some additional conditions on the first and second moments of the stochastic gradient, when the positive steplength α k is bounded from above by a constant α max , the expected optimality gap for strongly convex objective functions, or the expected sum of gradients for general objective functions, asymptotically converge to values proportional to α max . In practice, if the steplength is sufficiently small and k → ∞, the method generates iterates in the neighborhood of the optimal or stationary value. Nevertheless, since the constants related to the assumptions, such as the Lipschitz parameter or the parameters involved in the bounds of the moments of the stochastic directions, are unknown and not easy to approximate, there is no guidance on the specific choice of the steplength. A selection of a too small value of a steplength without an accurate tuning, can give rise to a very slow learning process. In the literature, there exist a number of different proposals to overcome this drawback without resorting to second-order methods or introduce line search techniques.
In particular, we refer to Tan et al. (2016) and Yang et al. (2018), where the updating rule of the steplength is borrowed from the Barzilai's rules, well known in the deterministic context. In the case of strongly convex objective functions, in order to obtain a linear convergence in expectation to zero for the optimality gap (Yang et al. 2018) or to a solution for the sequence of the iterates (Tan et al. 2016), the updating rules are inserted in variance reducing schemes as SVRG (Johnson and Zhang 2013) or SAGA (Defazio et al. 2014). These methods require to periodically compute the full gradient or to store the last computed term of each gradient in the sum (3).
Another way to obtain the linear convergence for strongly convex objective functions consists in increasing n k at a geometric rate (Byrd et al. 2012) (see also Friedlander and Schmidt 2012). Despite this very strong condition, from the practical point of view, a procedure based on the so-called norm test, enables to control the sample size n k so that for some ζ > 0 (Hashemi et al. 2014). In the practical implementation, the left side of the last inequality can be approximated with the sample variance and the gradient ∇ F(x (k) ) on the right side with a sample gradient (Byrd et al. 2012;Bottou et al. 2018). Similar techniques are developed in Cartis and Scheinberg (2015), relaxing the norm test by the use of a line search technique based on the true value of the objective function.
A recent proposal suggested in Bollapragada et al. (2018) is to increase the sample on the basis of an inner product test, combined with an orthogonality test. These conditions guarantee that the negatives of the stochastic gradients based on subsamples of suitable size are descent directions in expectation. Numerical evidence highlights that the mechanism gives rise to an increase in n k slower than the one induced by the norm test; on the other hand, linear rate of convergence for objective functions satisfying the Polyak-Lojasiewicz (P-L) condition is preserved and other theoretical convergence features hold for general problems. These results strongly depend on the knowledge of the Lipschitz parameter L or on its suitable (local) estimate. Consequently, motivated by the numerical experiences shown in Franchini et al. (2020), in this paper we propose to tailor the steplength selection rule adopted in the Limited Memory Steepest Descent (LMSD) method (Fletcher 2012) to give a local estimate of the inverse of L in the SG framework, combining this strategy with the technique for increasing the subsample size detailed in Bollapragada et al. (2018) to adaptively control the variance of the stochastic directions.
The paper is organized as follows. In Sect. 2, we briefly recall the inner product test, the orthogonality test and the related theoretical convergence results. Section 3 is devoted to address the Ritz-like values in the context of the LMSD deterministic method and to tailor this technique to the stochastic framework; more precisely, the SG iteration based on the steplength defined by a Ritz-like value is combined with the adaptive subsampling technique proposed in Bollapragada et al. (2018). In Sect. 4 we describe the results of an extensive numerical experimentation. The conclusions are drawn in Sect. 5.

Theoretical results on SG method equipped with inner product and orthogonality tests
The convergence results on SG iteration (4) require that the stochastic gradient g (n k ) k based on the subsample ξ (n k ) is a descent direction sufficiently often, i.e., assuming that S k is chosen uniformly at random from {1, . . . , n} and g for all k ≥ 0. The variance of the term on the left-hand side of (6) can be controlled by determining the sample size n k at the k-th iteration so that the stochastic gradient is guaranteed to be a suitable estimate of the corresponding gradient. In particular, the following condition can be imposed on the sample size n k of ξ (n k ) : for some θ > 0. Furthermore, the inner product test can be combined with the orthogonality test, guaranteeing that the step direction is bounded away from orthogonality with ∇ F(x (k) ): for some ν > 0. The combination of the two tests (7) and (8) is also known as augmented inner product test.
We remark that assumption B holds when F is c-strongly convex, but it is also satisfied for other functions that are not convex (see Karimi and Nutini 2016). In addition we observe that assumptions A and B do not guarantee the existence of a stationary point for F; nevertheless, under the two assumptions, any stationary point for F is a global minimizer.
Furthermore, in view of the assumption C, the iteration (4) can be equipped by a variable steplength, as long as it belongs to the interval (α min , α max ], where α max is proportional to the inverse of L. Following the arguments of Bollapragada et al. (2018), the following theorems can be stated.
Theorem 1 Suppose the assumptions A and B hold. Let {x (k) } be the sequence generated by (4), where the size n k of any subsample is chosen so that the conditions (7) and (8) are satisfied and α k satisfies the assumption C. Then, we have that where ρ = 1−c α min . In particular, for a constant steplength The proof follows as in Theorem 3.2 of Bollapragada et al. (2018), by using the (P-L) condition instead of the strongly convexity of F and the inequality α k ≤ α max ≤ 1 (1+θ 2 +ν 2 )L . In the case of a convex function F such that the P-L condition does not hold, we can state the following theorem, whose proof runs as the one of Theorem 3.3 of Bollapragada et al. (2018), taking account of α k ≤ α max and of the additional strict bound on α max .
Finally, along the lines of Theorem 3.4 in Bollapragada et al. (2018), taking into account that α k > α min , we state the following proposition for the case of a general non-convex objective function F. In this case, {∇ F(x (k) )} converges to zero in expectation, with a sub-linear rate of convergence of the smallest gradients arising after K iterations.
Theorem 3 Suppose the assumption A holds and F is bounded below from F * . Let {x (k) } be the sequence generated by (4), where the size n k of any subsample is chosen so that the conditions (7) and (8) are satisfied and α k satisfies the assumption C. Assume that X * = argmin x F(x) = ∅. Then, we have that Furthermore, for any K > 0, we have To make a robust implementation of the iteration (4), Bollapragada et al. propose to determine the current steplength by a backtracking line search, aimed at providing a (local) estimate of Lipschitz parameter.
Exploiting the assumption of α k belonging to a suitable bounded interval, we propose an updating rule for the definition of the current α k , based on a stochastic version of the LMSD computation of Ritz-like values. In the next section we recall the deterministic procedure and we describe the stochastic version in detail.

Steplength selection via Ritz and harmonic Ritz values
Among the state-of-the-art steplength selection strategies for deterministic gradient methods, the limited memory rule proposed in Fletcher (2012) is one of the most effective ideas for capturing second-order information on the objective function. In order to describe our strategy for extending this approach to the stochastic gradient methods, in the following we recall the basic details on the rule (Fletcher 2012).

The deterministic framework
The limited memory rule (Fletcher 2012) provides the steplengths for performing groups of m ≥ 1 iterations, where m is a small number (generally not larger than 7). After each group of m iterations, called sweep, a symmetric tridiagonal m × m matrix is defined by exploiting the gradients computed within the sweep. The m eigenvalues of the tridiagonal matrix are interpreted as approximations of the eigenvalues of the Hessian of the objective function at the current iteration and their inverses define the m steplengths for the new sweep. The crucial point of this approach consists in building the tridiagonal matrix in an inexpensive way, starting from the information acquired in the last sweep.
To this end, in Fletcher (2012) the following strategy is proposed: suppose that the iterate x ( j) and m steplengths α j+k , k = 0, . . . , m − 1, are available for performing a new sweep and store the gradients and the steplengths used within the sweep in the following way: From the d × m matrix G j , an upper triangular m × m matrix R j such that G T j G j = R T j R j can be obtained, for example by means of the Cholesky factorization of G T j G j ; the matrix R j is non-singular if G j is full rank. By using G j , J j and R j define the matrix where r j is the solution of the linear system R T j r j = G T j g j+m . In the case of quadratic strictly convex objective function, T j is the symmetric tridiagonal matrix provided by m steps of the Lanczos process applied to the Hessian matrix of the objective function, with starting vector g j / g j ; this means that its eigenvalues, called Ritz values, are special approximations of the Hessian eigenvalues. In the general non-quadratic case, T j is upper Hessenberg and a symmetric tridiagonal matrix T j can be obtained as where the Matlab notation is used for denoting the lower triangular of T j . The limited memory steplength rule (Fletcher 2012) proposes to use the eigenvalues of T j , λ i , i = 1, . . . , m, as approximations of the eigenvalues of the Hessian of the objective function at the iteration ( j + m), and to exploit the inverses of these approximations as steplengths for the next sweep: Following the terminology used in the quadratic case, we call Ritz-like values the eigenvalues of T j . In Fletcher (2012) another idea is also introduced for defining the steplengths for the sweeps, based on a similar strategy. In the strictly convex quadratic case, this idea consists in obtaining the steplengths as eigenvalues of the matrix P −1 j T j , where The reciprocals of the eigenvalues of P −1 j T j are called harmonic Ritz. Replacing T j in (19) by the non-singular tridiagonal matrix T j , a pentadiagonal matrix P j is obtained. The matrices T j and T −1 P j can have nonpositive eigenvalues. This phenomenon is due firstly to the non-quadratic features of the objective function and secondly to the presence of negative curvature. The first situation can arise also for convex objective functions, whereas the second one concerns the minimization of general functions. To overcome these drawbacks, there are different strategies. In Fletcher (2012), di Serafino et al. (2018), the authors suggest to simply discard these values, hence providing fewer than m steplengths for the next sweep; if no positive eigenvalues are available, any tentative steplength can be adopted for a sweep of length 1. In addition it can be convenient to discard also the oldest back gradients. Another strategy, aimed at handling non-positive curvature, is to adopt a local cubic model, that reduces to a standard quadratic model when only positive eigenvalues are computed (Curtis and Guo April 2016).

The Stochastic framework
The strategy suggested by the LMSD method for an adaptive update of the steplength in the full gradient method performs as well as an L-BFGS method also for the minimization of non-quadratic and non-convex objective functions (Fletcher 2012 anddi Serafino et al. 2018). In a stochastic framework, where the computation of the Hessian matrix is very expansive, even when it is based on a subsampling, the LMSD approach can inspire a strategy for defining a selecting rule of the steplength at the current iteration of SG. The main difference with respect to the deterministic case is in the construction of the matrix G j , where we have to replace the full gradients computed at the m most recent iterations (m ≥ 1) with the corresponding stochastic gradients at the iterates x ( j+i) , obtained by using different samples of data {ξ (n j+i ) }, i = 0, .., m − 1: Following the procedure developed in the deterministic case combined with the approximation (17), the matrices T j and P j can be computed, by replacing G j in (16) with G j .
When the collected stochastic gradients are suitable approximations of the full gradients, i.e., they are in expectation suitable descent directions at the current iterate with a reduced variance, it is quite likely that from the inverses of the eigenvalues of the matrices T j and T −1 j P j , that are the Ritz-like and harmonic Ritz-like values, useful approximations of the inverse of the local Lipschitz constant of ∇ F can be obtained for the new sweep of iterations. For simplicity, we refer in the following to the Ritz-like values, but, similarly, the same considerations hold for harmonic Ritz-like values. We observe that, in addition to the drawbacks highlighted in the deterministic context, in this case G j is only an approximation of G j and, as a consequence, non-positive Ritz-like values can arise. As in the deterministic case, these values can be discarded, by removing also the oldest back stochastic gradients from G j . As a consequence, fewer than Furthermore, in order to avoid line search techniques, it is convenient to consider only the values λ i belonging to a prefixed range and we eliminate the values λ i = 1/α min , reducing again m R and discarding all the stochastic gradients giving rise to these values. If m R = 0, a tentative steplength α ∈ (α min , α max ] can be adopted for a sweep of length 1. This reference value is also used at the first iterate. If m R > 0, the steplengths in the next sweep are defined as A similar procedure involving the harmonic Ritz-like values enables us to define alternatively the steplengths in the next sweep as We remark that, in view of Theorem 3.3 in Curtis and Guo (April 2016), the positive harmonic Ritz-like values are greater or equal than the corresponding Ritz-like values; as a consequence, the rule (23) generates shorter steplengths with respect to the ones defined by (22). The alternate use of different rules to generate long and short stepsizes in the full gradient methods has been deeply investigated (see, for example, Dai and Yuan 2003;Zhou et al. 2006;Frassoldati et al. 2008), showing a large increase in their practical performance. Also in the stochastic framework, we can explore an alternate use of the Ritz-like and harmonic Ritz-like values. A first approach can be to simply toggle the use of the Ritz-like values to the one of the harmonic Ritz-like values at each sweep (Alternate Ritz-like values or A-R).
A second strategy may be to link the choice between Ritzlike and harmonic Ritz-like values to the selection of the size of the current subsample. We discuss in detail this selection. A crucial point is how to check when the stochastic gradients assembling the matrix G j can be considered acceptable estimates of the corresponding gradients. Inspired by the adaptive sampling technique in Bollapragada et al. (2018), the variance can be monitored by a suitable increase in the sample size n k . More precisely, the inner test condition (7) can be imposed on the sample size n k . Since the left-handside term of (7) is bounded from above by the true variance of individual gradient, the condition (7) holds when the following exact variance inner product test is satisfied: In order to implement condition (24), the variance can be approximate with the sample variance and the gradient ∇ F(x (k) ) on the right side with a sample gradient, so that the approximate inner product test is given by the following condition When this condition is not satisfied by the current sample size, the sample size is increased so that (25) is satisfied. With regard to the orthogonality test, a sufficient condition for (8) is the following exact variance orthogonality test: As for the previous test (24), a practical variant, named approximate variance orthogonality test, based on the sample approximation can be formulated as follows In order to choose a new sample size n k when the conditions (25) and (27) are not satisfied, we can compute and set n k = min( max(Z 1 , Z 2 ) , n). We observe that, when at the iteration k the size of the sample increases, the stochastic gradients previously stored are related to subsamples of lower size; then, we propose to discard the available Ritz-like values and to exploit the current stored stochastic gradients to determine a set of harmonic Ritz-like values. Indeed this strategy, named Adaptive Alternation of Ritz-like values (AA-R), leads to shorter steplengths in this transition phase.
The approximated version (25)-(27) of the augmented inner test is a way to choose the size of the subsample at the current iteration with the aim to control the goodness of the estimate g  (25) and (27) since numerical experience highlights that they are not too restrictive, slowly increasing the sequence {n k }.

Numerical experiments
In order to evaluate the effectiveness of the proposed steplength rule for SG methods, we consider the optimization problems arising in training binary and multi-labels classifiers for the following well-known data sets:  We consider two kinds of problems, the first relating to convex objective functions and the second involving a non-convex objective function. In the case of convex minimization problems, a binary classifier is searched for the data sets MNIST and w8a. For the MNIST data set, the two classes are the even and odd digits. In the non-convex case, the objective function arises from the design of a multi-class classifier for the MNIST data set. In both kinds of problems, a regularization term was added to the loss function to avoid overfitting.

Convex problems
We built linear classifiers corresponding to three different convex loss functions. Thus the minimization problem has the form where δ > 0 is the regularization parameter. By denoting as a i ∈ R d and b i ∈ {1, −1} the feature vector and the class label of the i-th sample, respectively, the loss function F n (x) assumes one of the following forms: -logistic regression (LR) loss: -square loss (SL): -smooth hinge loss (SH): We compare the effectiveness of the following schemes: -SG with a fixed mini-batch size in the version with fixed steplength, denoted by SG mini; -methods using Ritz-like values to adaptively select a suitable steplength; in particular, we consider: Fig. 2 Behavior of the optimality gap in 10 epochs for SG mini, A-R and AA-R methods in the case of the MNIST data set Fig. 3 Behavior of the optimality gap in 10 epochs for SG mini, A-R and AA-R methods in the case of the w8a data set -Alternate Ritz-like values in the scheme denoted by A-R, which toggles the use of the Ritz-like values to the one of the harmonic Ritz-like values at each sweep; -Adaptive Alternation of Ritz-like values in the scheme denoted by AA-R; in this method, when at the iteration k the size of the sample increases, we discard the available Ritz-like values and we exploit the current stored stochastic gradients to determine a set of harmonic Ritz-like values.
For both methods, an adaptive strategy is used for increasing the mini-batch size, as detailed in Sect. 3.2. In all the numerical simulations, we set θ = 0.7 in (25) and ν = 7 in (27).
In all the numerical experiments, carried out in Matlab on 1.6 GHz Intel Core i5 processor, we use the following setting: -the regularization parameter δ is equal to 10 −8 ; -in SG mini the size of the mini-batch is set as |S k | = |S| = 50 for all k ≥ 0; -in A-R and AA-R methods, the size of the initial minibatch is |n 0 | = 3; furthermore the maximum length of the sweep is set as m = 3; -each method is stopped after 10 epochs, i.e., after a time interval equivalent to 10 evaluations of a full gradient of F n or 10 visits of the whole data set; in this way we compare the behavior of the methods in a time equivalent to 10 iterations of a full gradient method applied to F n (x).
In the following, we report the results obtained by the considered methods on the MNIST and w8a, by using the three loss functions (logistic regression, square and smooth hinge functions). For any numerical simulation we perform 10 runs with the same parameters, but leaving the possibility to the random number generator to vary. Indeed, due to the stochastic nature of the methods, the average values in different simulations provide more reliable outcomes. In particular, for any numerical test, we report the following results: -the average value of the optimality gap F n (x)− F * , where x is the iterate obtained at the end of the 10 epochs and F * is an estimate of the optimal objective value; this value is obtained by a full gradient method with a huge number of iterations; -the related average accuracy A(x) at the end of the 10 epochs with respect to the testing set, i.e., the percentage of well-classified examples.
First of all, we determine by a trial procedure the best steplength α O PT for the standard SG method, i.e., the  α 10 −2 10 −3 10 −2 1 1 0 −2 5 10 −1 α min 10 −5 10 −7 10 −6 10 −3 10 −6 5 10 −4 α max 1 5 1 0 −2 5 10 −1 100 1 25 Table 3 Values of the setting providing the best results for AA-R method MNIST w8a α 10 −2 10 −3 10 −2 1 1 0 −2 5 10 −1 α min 10 −5 10 −6 10 −5 10 −3 10 −6 5 10 −4 α max 1 1 0 −1 1 100 5 10 −1 50 Table 4 Numerical results of the considered methods with F n (x) given by the logistic regression after 10 epochs  Table 5 Numerical results of the considered methods with F n (x) given by the square loss after 10 epochs  Table 6 Numerical results of the considered methods with F n (x) given by the smooth hinge loss after 10 epochs  Table 1, we report the value of the steplength α O PT corresponding to the best performance of standard SG in 20 epochs. Furthermore, in order to highlight the trouble to define a suitable learning rate, in Fig. 1 we show the trend of the optimality gap for five values of the steplengths in the case of MNIST data set with logic regression and square loss functions. The instability of the standard SG method behavior with respect to the selection of the steplength motivates the expensive trial process that produces Table 1. In the following, we report the numerical results of the comparison between SG mini and A-R and AA-R methods. In particular, in A-R and AA-R methods, different settings of the bounds α max and α min are used: The tentative value of the steplength α is set as 10α O PT . Figures 3 and 4 show the behavior of the optimality gap with respect to the first 10 epochs for MNIST and w8a, respectively, in the case of the three loss functions. In particular, the dashed black line refers to SG-mini, whereas the red and the blues lines are related to A-R and AA-R methods, respectively, in the above specified four settings. We observe that the results obtained with the A-R and AA-R methods are comparable with the ones obtained with the SG mini equipped with the best tuned steplength. Indeed, the adaptive steplength rules in A-R and AA-R methods seem to be slightly dependent on the values of α max and α min , making the choice of a suitable learning rate a less difficult task with respect to the selection of a good constant value in standard SG and SG mini methods. In Tables 2 and 3, we summarize the setting that provides the best results for A-R and AA-R methods. In Tables 4, 5 and 6, we show the final optimality gap (with respect to the training set) and accuracy (with respect to the testing set) obtained at the end of 10 epochs for the logistic regression, square and smooth hinge loss functions, respectively, for the best setting. The final accuracy of the three methods differs at most to the third decimal digit. This observation can also be extended to the simulations obtained for A-R and AA-R methods with the other settings. In Figs. 4 and 5, we show the increase in the subsample size in A-R and AA-R methods in the case of all the convex loss functions. Starting with n 0 = 3, the size of current subsample is at least 120 in the case of MNIST data set and 900 in the case of w8a data set at the end of the 10 epochs, much smaller than the number of sample n of the training set.
Finally, in Figs. 6 and 7 , we compare the behavior of SG mini and A-R and AA-R methods when the parameter α SGmini is not the best-tuned value, as in the previous experiments. In particular, SG mini method in Fig. 6 is carried out with α O PT replaced by α = 10 −5 for logistic regression function and α = 10 −5 for smooth hinge loss function, that is α SGmini = α|S|. In Fig. 7, SG mini is equipped with α = 1 for logistic regression function and α = 10 −5 for square loss function. A-R and AA-R methods are executed using the four previously specified settings, with α O PT set as above. Figures 6 and 7 highlight that a too small fixed steplength in SG mini produces a slow descent of the optimality gap; on the other hand, a steplength value larger than the besttuned one can cause oscillating behavior of the optimality gap and, sometimes, it does not guarantee the convergence of SG mini method. As regards A-R and AA-R methods, these (a) (b) Fig. 6 Comparison between SG-mini with respect to A-R and AA-R in 10 epochs on the MNIST data set approaches appear less dependent on an optimal setting of the parameters and they enable us to obtain smaller optimality gap values after the same number of epochs exploited by SG mini.
Furthermore, we observe that in A-R method, the behavior of the optimality gap is more stable than in AA-R method. Nevertheless, AA-R method can produce a smaller optimality gap at the end of 10 epochs.

Some further experiments and remarks on convex problems
In the previous experiments, A-R and AA-R methods are equipped with the approximated version of the augmented inner test, based on sample statistics. For small samples, the conditions (25)-(27) may not be reliable enough in providing a sample size able to control the errors in the gradient estimates; indeed, in presence of noise, the norm of the current stochastic gradient g (n k ) k can be greater than ∇ F(x (k) , so that the conditions (25)-(27) are verified for many iterations before producing an increase in the sample size. To prevent this drawback, in Bollapragada et al. (2018), when the sample size does not change for at least r consecutive iterations, an average vector of the last r sample gradients is computed: When g avg < γ g (n k ) k , for a prefixed γ ∈ (0, 1), the augmented inner product test is performed by replacing the current stochastic vector with the average vector g avg ; the possible consequence is an increase in the sample size. Typical values for r and γ are 10 and 0.38, respectively. For more details, also on this special setting, see Bollapragada et al. (2018); here, this practical procedure is viewed as a recovery strategy to improve the stability of SG method equipped with a line search rule for providing a suitable steplength. On the other hand, after some epochs, the effectiveness of the method can degrade for faster increase in the sequence {n k }, although the adoption of the recovery procedure makes smaller the total number of backtracking steps.
In order to highlight this remark, in Fig. 8 we show the results obtained for MNIST when the problem (29) with logistic regression function is addressed by SG method equipped with a simple line search. In particular, we report the optimality gap with respect to 10 epochs when the augmented inner product test is coupled with the recovery procedure (magenta line) and without this recovery procedure (green line). In the latter case, the final sample size is 48 with a large number of backtracking steps (2700), while in the former one the sample size increases until 3300 with very few backtracking steps (110). As a consequence, the recovery procedure appears crucial for the control of the effectiveness of the line search and the sequence {n k }.
The numerical results of the previous section show that A-R and AA-R methods are less dependent on the lack of reliability of the augmented inner product test for small values of n k . Nevertheless, we can introduce the recovery procedure when the computation of Ritz-like values gives rise to m R = 0 and the steplength at the next iteration is set to a tentative value α. More precisely, when this situation occurs, if the sample size has not changed in the last r iterations, the novel sample size is determined by using the approximated replaced by the average vector (30). In Figs. 9 and 10 , we show the behavior of the optimality gap obtained by using the modified versions of A-R and AA-R methods and SG method equipped with the line search rule for MNIST and w8a, respectively, in the case of the three loss functions in the objective. The comparison with Figs. 2 and 3 allows to observe that the recovery procedure improves the stability of A-R and AA-R methods with respect to the setting of α min and α max , preserving the effectiveness of the approach. Indeed the accuracy of the two versions at the end of 10 epochs differs at most to the third decimal digit. The final value of the sample size is at most 10 times the one obtained without the use of the recovery procedure. As already observed in the previous section, AA-R method allows to obtain better results with respect to A-R in most experiments.
Furthermore, we observe that the performance of our approach appears generally better with respect to SG with a line search procedure. The comparison is carried out by considering only the number of scalar products performed in all methods, that is n scalar products for each epoch. Indeed, for the considered loss functions, the computational cost of the evaluation of the stochastic gradient g (n k ) k is essentially given by the n k scalar products a T i x (k) , i ∈ S k . As regards SG method with a line search rule, we observe that, although the evaluation of an estimate of the objective function 1 n k i∈S k f i (x (k) ) at x (k) does not require additional scalar products and it is negligible, the computation of the same estimate at x (k) − αg (n k ) k requires at least additional n k scalar products. Thus, each iteration of SG with line search has a computational cost at least equal to two evaluations of the stochastic gradient on the same sample. Any backtrack- Fig. 8 Behavior of the optimality gap in 10 epochs for SG method equipped with a line search rule; magenta line is related to the version of the method combined with the recovery procedure while the green line is used for the version without this procedure. The parameters are chosen as in Bollapragada et al. (2018). In the experiment, logistic regression is the loss function and MNIST is the data set ing step increases the count of total scalar products. This preliminary analysis appears to favor schemes that avoid a line search rule for the determination of the steplength, also in the case of a few epochs when the sample size remains low. This topic may be the subject of future investigations.

A non-convex problem: a convolutional neural network
In the non-convex case, we consider as loss function an artificial neural network. In particular, dealing with image classification, we consider a convolutional neural network (CNN). The network is composed of an input layer, two sequences of convolutional and max-pooling layers, a fully connected layer and an output layer. We make use of rectified linear unit (ReLU) activations combined by a softmax function for the output layer and of a cross entropy as loss function (see Fig. 11). We consider the optimization problem arising in training a multi-class classifier for the MNIST data set.
We compare the effectiveness of the same methods considered in the previous section, i.e., SG mini, A-R and AA-R methods. In all the numerical experiments we use the following setting: -regularization parameter δ = 10 −4 ; -the first convolutive layer is composed by 64 filters, each filter has 5 × 5 dimension; after we apply a max-pooling of size 2 × 2; -the second convolutive layer is composed by 32 filters, each filter has 5 × 5 dimension; after we apply a maxpooling of size 2 × 2; -in SG mini, the size of the mini-batch, is set as |S| = 50; -in A-R and AA-R methods, the length of any sweep is at most m = 3; furthermore, θ = 0.7 in (25) and ν = 7 in (27) for all the numerical simulations; The numerical experiments were carried out in Matlab on Intel(R) Core(TM) i7-6700 CPU @ 3.40 GHz with 8 CPUs.
In Fig. 12, we can observe the different accuracies (with respect to the testing set) provided by the CNN trained with SG mini in 5 epochs; the fixed steplength is set to values between 0.001 and 0.9. As we can see, the method provides effective results for α = 0.5; a similar accuracy is obtained for α = 0.1. In cases with smaller steplengths, the accuracy in 5 epochs is unsatisfactory, while a higher steplength can lead to the divergence of the method. Hence, in a more marked way  Accuracy obtained by training the CNN with A-R and AA-R methods than the convex case, for non-convex problems, finding an effective steplength requires a very expensive trial procedure. Conversely, using a random steplength without a prior trial phase can lead to inaccurate results due to slow convergence or divergence of the method.
In Fig. 13, we report the results obtained by training the CNN with the A-R and AA-R methods. In particular, we show the behavior of the accuracy with respect to the testing set in the first 5 epochs with the following settings: -for A-R method, α min = 10 −3 , α max = 1, n 0 = 10; -for AA-R method, α min = 10 −2 , α max = 1, n 0 = 3.
The parameter α is set as 0.1 in all cases. We observe that A-R appears more robust with respect to the amplitude of the interval where α k can belong. Furthermore, we notice that the subsample size increases up to a maximum of 204 and 182 in A-R and AA-R methods, respectively.

Conclusions
In this paper we proposed to tailor the steplength selection rule based on the Ritz-like values, used successfully in the deterministic gradient schemes, to a stochastic scheme, recently proposed by Bollapragada et al. (2018). This SG method includes an adaptive subsampling strategy, aimed to control the variance of the stochastic directions. We observed that the theoretical properties of this approach hold under the assumption that the steplength selection rule obeys to the assumption α k ∈ (α min , α max ], where α max is proportional to the inverse of the Lipschitz parameter of the objective function gradient. Consequently, we reformulate the procedure for obtaining the Ritz-like values in the stochastic framework, by using the stochastic gradients instead of the standard gradients. It is required that these stochastic directions, although based on different subsamples, satisfy two conditions (the inner product test and the orthogonality test), ensuring the descent property in expectation. In particular, we proposed two different ways to select the current steplength, by simply toggling the Ritz-like values with the harmonic Ritz-like values (A-R method) or using the harmonic Ritz-like values only when the size of the subsample is increased (AA-R method). The numerical experimentation highlighted that the proposed methods enable to obtain an accuracy similar to the one obtained with SG mini-batch with fixed best-tuned steplength. Although also in this case it is necessary to carefully select a thresholding range for the steplengths, the proposed approach appears slightly dependent on the bounds imposed on the steplengths, making the parameters setting less expensive with respect to the SG framework. In conclusion, the proposed technique provides a guidance on the learning rate selection and it allows to perform similarly to the SG approach equipped with the best-tuned steplength.