Neural Network Modeling for Small Datasets

Neural network modeling for small datasets can be justified from a theoretical point of view according to some of Bartlett's results showing that the generalization performance of a multilayer perceptron (MLP) depends more on the L1 norm ‖c‖1 of the weights between the hidden layer and the output layer rather than on the total number of weights. In this article we investigate some geometrical properties of MLPs and drawing on linear projection theory, we propose an equivalent number of degrees of freedom to be used in neural model selection criteria like the Akaike information criterion and the Bayes information criterion and in the unbiased estimation of the error variance. This measure proves to be much smaller than the total number of parameters of the network usually adopted, and it does not depend on the number of input variables. Moreover, this concept is compatible with Bartlett's results and with similar ideas long associated with projection-based models and kernel models. Some numerical studies involving both real and simulated datasets are presented and discussed.


INTRODUCTION
An important issue in statistical modeling is related to so called indirect measures or virtual sensors. This involves the prediction of variables that are quite expensive to measure (e.g., the viscosity or the concentration of certain chemical species, some mechanical features) using other variables that are more easily measured, like, for example, the temperature or the pressure (see De Veaux and Ungar 1996). Such problems often involve some difficulties: (1) The available datasets are small (e.g., typical chemical datasets have only 30-100 observations); (2) the input-output relation to be estimated is nonlinear; and (3) there are many predictor variables, but, because linearity cannot be assumed, it is quite difficult to reduce the dimensionality of the problem by choosing a good subset of predictors or suitable underlying features.
Assume that we are provided with a dataset of N pairs (x n , y n ) of m-dimensional input vectors x n and scalar target values y n . In many applications the scientific law underlying the relationship between the response variable and the predictors is (at least partially) known, and the experimental data can be used to test the model assumptions and to estimate the adaptive parameters; such models are called first-principle models. Here we focus on the opposite case, that is, on applications in which the underlying first principle is unknown or the system is too complex to be mathematically described, so that the data are used to extract knowledge and afterward to derive some practical and useful models (see Cherkassky and Mulier 1998). In this context, neural networks may be attractive.
It may happen that the size N of the available sample is small compared with the number of weights of the neural network [e.g., a single hidden-layer multilayer perception (MLP) with p nodes in the hidden layer has W = p(m + 2) + 1 weights], so that the resulting neural model is considered overparameterized. In the literature, this problem has been approached in different ways. One strategy is based on the regularization technique called jittering, which consists of adding artificial noise to the input during training; training with jitter is a form of smoothing related to kernel regression and other regularization methods, such as ridge regression (see, e.g., Azencott, Doutriaux, and Younes 1993). Another possibility is to reduce the dimension m of the input space using suitable preprocessing techniques; this strategy is very useful when the dimension m of the input space is quite large (see, e.g., Yuan and Fine 1998).
Here we consider a third approach, based on overparameterized networks, which has been pursued in the literature by some authors. For instance, De Veaux, Schumi, Schweinsberg, and Ungar (1998) worked on a one-hidden layer perceptron with more than 200 weights trained with 61 data units concerning measurements taken from a polymer pilot plant; Lawrence, Giles, andTsoi (1996, 1997) provided many simulations with overparameterized MLPs, concluding that oversized networks can result in low training and generalization errors. These neural models can be justified from a theoretical point of view according to some results published by Bartlett (1998) showing that the generalization performance of an MLP depends more on the L 1 norm c 1 of the weights between the hidden layer and the output layer rather than on the total number of weights. These results suggest caution when transferring some basic statistical paradigms into the neural framework. Indeed, they seem to contradict the usual assumption in statistical modeling according to which the number of parameters should be (much) smaller than the number of observations used for their estimation. This contradiction is particularly apparent when the neural model is selected according to some goodness-of-fit statistic, like, for example, the Akaike information criterion (AIC), the Bayesian information criterion (BIC/SBC), or the final prediction error (FPE) (see Sec. 2 for details). Indeed, even if the underlying theories do not hold for neural models, they are quite simple to compute, and they are often considered crude estimates of the generalization error. Actually, in these model selection criteria the number K of degrees of freedom is set equal 298 SALVATORE INGRASSIA AND ISABELLA MORLINI to the number W of weights, so they are useless when W > N; in this case, for example, both the FPE and the unbiased estimate of variance (UEV) assume negative values, and this does not make sense.
In contrast, in the smoothing literature, notions of degrees of freedom different from the number of adaptive parameters in the final model, have a long history, and are well established. For example, for a linear smoothing operator S, where y = Sy is the smooth of y, Hastie and Tibshirani (1990, sec. 3.5) gave three definitions of degrees of freedom useful for different purposes. In the generalized cross validation (GCV) criterion the model complexity is measured by the trace of the smoothing matrix S; Friedman (1991) and many of the discussants of his article proposed various definitions of degrees of freedom for multivariate adaptive regression splines. They motivated their definitions in a parametric framework. Recently, Ye (1998) and Hodges and Sargent (2001) extended the necessity of finding new notions of degrees of freedom, different from the number of adaptive parameters, to all richly parameterized models and to complex hierarchical procedures. The generalized degrees of freedom proposed by Ye (1998) is completely general in that it is applicable to all complex modeling procedures. However, it is costly to compute, because it requires Monte Carlo simulations. Hodges and Sargent's degrees of freedom are defined for models that can be reexpressed as linear models, in the formal sense that the left side is known and the right side consists of known linear combinations of unknown parameters. Both of these notions arise from a parametric framework.
In this article we focus on nonlinear projection methods of the form where a 1 , . . . , a p ∈ R m , b 1 , . . . , b p , c 0 , c 1 , . . . , c p ∈ R, and τ is a sigmoidal function. However, the obtained results can be extended to other kinds of neural models, like radial basis function and projection pursuit regression models (see Hastie, Tibshirani, and Friedman 2001), where the function τ has a different nonlinear form. These are additive models, but in the derived features τ (a k x + b k ) rather than in the inputs themselves. We move from a nonparametric point of view, exclusively drawing on geometrical considerations. We extend some geometrical properties of the MLP reported by Ingrassia (1999) and generalize preliminary results given by Ingrassia andMorlini (2002, 2004), showing that the input-to-hidden weights and the hidden-to-output weights in an MLP play different roles. Drawing on the projection theory of linear models, here we introduce the equivalent number of degrees of freedom (edf ), K, which is equal to the trace of a suitable projection matrix. For model selection purposes, according to indices like AIC and BIC, this quantity can be set equal to the number of neurons p + 1 in the hidden layer [in eq. (1) the constant c 0 can be considered a weight between the output and a hidden neuron with value set to 1]-that is, the dimension of the projection space intrinsically found by the mapping function-whereas to estimate the error variance, sometimes better approximations of K (which also depend on the adopted regularization technique) should be considered.
Finally, using both real and simulated datasets, we present some empirical studies linking the number of neurons in the hidden layer to the L 1 norm c 1 . However, as shown in our examples, it is unstable across different simulations for large networks as well. Besides, as outlined by Bartlett (1998), for the size of the weights between the hidden layer and the output layer to be an appropriate measure of the mapping function complexity, the gradient descent algorithm must reach a suitable minimum of the error function. This is not always verified in practice, especially with small training datasets. The simulation examples show that the estimates of the error variance using such an edf K behaves quite well, whereas the error variances estimates based on the total number of weights W are far from the true value or may assume even negative values.
The article is organized as follows. In the next section we review the generalization performance of a neural model in the context of Bartlett's results and state the purpose for which we want to define the degrees of freedom of an MLP and similar projection methods. In Section 3 we prove some geometric properties of the sigmoidal functions that motivate our notion of degrees of freedom. In Section 4 we introduce our definition of equivalent degrees of freedom for multilayer perceptrons and other projection methods; we also provide relationships with other approaches. In Section 5 we present simulations relating the constant c 1 to the number of hidden units and compare the error variance estimates computed with different measures of degrees of freedom. Finally, in Section 6 we give some concluding remarks.

EVALUATION OF THE PERFORMANCE OF THE
NETWORK WITH SMALL DATASETS

The Problem
Let (X, Y) be a pair of a random vector X and a random variable Y with joint probability distribution p(x, y), where X is the m-dimensional input vector with values in some space X ⊆ R m and Y is a response variable with values in Y ⊆ R. Throughout this article we assume that the input-output relation can be written as Y = φ(x) + ε, where ε is a random variable with mean 0 and finite variance. The unknown functional dependency φ(x) = E[Y|x] is estimated by means of the function f p (x) realized by an MLP with m inputs, p neurons in the hidden layer, and one neuron in the output, where τ (·) is a sigmoidal function-like τ (z) = tanh(z) or τ (z) = (1 + e −z ) −1 -and a 1 , . . . , a p ∈ R m , b 1 , . . . , b p , c 0 , c 1 , . . . , c p ∈ R. We denote by A the p × m matrix with rows a 1 , . . . , a p , and we set b = (b 1 , . . . , b p ) and c = (c 1 , . . . , c p ). Because such quantities are called weights, we denote them by w, so that w ∈ R p(m+2)+1 , and sometimes we write f (x, w). Let F p be the set of all functions of type (2) for a fixed p, for 1 ≤ p ≤ N. (We discuss the upper bound N on p in Sec. 3.) In what follows we suppress the index p in f p and F p for simplicity of notation. The problem is to find the function f (0) = f (w (0) ) in the set F such that the generalization error (also called the expected risk), where the integral is over X × Y, attains its minimum, that is, and w (0) denotes the weights of f (0) . In practice, the distribution p(x, y) is unknown, but we have a sample, D = {(x 1 , y 1 ), . . . , (x N , y N )}, of N iid realizations of (X, Y), so that we can compute the empirical error or the empirical risk, Let f where the w (0) are the weights of f D . The problem is to explore the linkage between the empirical error (5) and the true statistical performance of the network, namely the generalization error (3); in other words, to investigate the conditions that may ensure that f (0) D is a reasonable estimate of f (0) . Usually the sample D is partitioned in three independent datasets: a learning set (or training set) L, a validation set V, and a test set T . The error (5) is referred to as the learning error (or training error) E( f , L), the validation error E( f , V), or the test error E( f , T ), the first computed using the learning set L, the second computed with the validation set V, and the last computed using the test set T . The learning error L is a set of examples used to fit the parameters of the model. The validation set and the test set play different roles; the set V is a set of examples used to tune the parameters of a model (e.g., to choose the number of hidden units in a neural network), and the set T is a set of examples used only to assess the performance of a fully specified model (see, e.g., Ripley 1996;Hastie et al. 2001). However, very often the sample D is split into only two different groups, the learning set L and the test set T .
Both the learning set and the test set are used to estimate the parameters w (0) , because E( f , L) is the function to be minimized, whereas E( f , T ) is the function used to control overfitting. As the learning (or training) process proceeds, at the beginning both the learning and the test errors generally decrease, but there comes a time when the test error starts to increase even though the learning error is still decreasing. It is then judged that overfitting is occurring. The training is then halted, and the current estimate of the weights is chosen to be w (0) . This technique is called early stopping. If the test error never decreases during the learning process, then the network is judged to be underparameterized, and a larger one is recommended. Otherwise, a local minimum is reached by the optimization algorithm, and initialization parameters should be changed.
In virtual sensors, the total amount of data is often quite limited, and one would like to use most of the data for training purposes to achieve a higher likelihood of selecting a good network. For example, De Veaux et al. (1998) worked on a set D of 61 observations, and they first trained the MLP with a learning set L of 50 samples to select the best architecture (the one minimizing the test error T of the remaining 11 observations). They subsequently trained the selected model on the basis of the whole set D of available observations. When the data cannot be split into three sets and the validation error cannot be computed for estimating the generalization error, model selection criteria are used to compare different networks. These criteria are based on the test error but also take into account a model complexity measure. (See Kadane and Lazar 2004 for a complete review of these indices and their theoretical basis both in the Bayesian and frequentist framework.) Traditionally, for general modeling problems, practitioners tend to measure model complexity with the number of parameters in the final model, because of the coincidence of this quantity and the degrees of freedom in linear models. Still, for linear models, the number of parameters can also be interpreted as the cost of the estimation process and thus can be used for obtaining an unbiased estimate of the error variance. The seminal work on model selection is based on the parametric statistics literature and is quite vast, but it must be noted that although model selection techniques for parametric models have been widely used in the past 30 years, surprisingly little work has been done on the application of these techniques in a semiparametric or nonparametric context.
Let f K be a statistical model based on K degrees of freedom; in the rest of the article, N denotes the size of the learning set. In general, these model selection criteria, denoted here by , are an extension of the maximum likelihood and have the form where the term E( f K ) = E( f K , L) is the empirical error of the model f K based on the training set and C K is a complexity term representing a penalty that grows as the number K of degrees of freedom in the model increases. If the model f K is too simple, it will give a large value for the criterion because the empirical error is large; whereas a model f K that is too complex will have a large value for the criterion because the complexity term is large. Typical indices include the AIC (Akaike 1974), where N is the size of the learning set. Another criterion used in linear regression models with least squares estimates of the parameters is the Schwarz (1978) BIC (or SBC). The BIC/SBC results in the selection of the model for which the following expression is the minimum: Besides AIC and BIC, other selection methods for which the algebraic expression requires normality and least squares estimates of the parameters are the FPE (Akaike 1970), the GCV error, and the well-established UEV, For these indices in neural network modeling, some properties have been investigated under key assumptions, but statistical optimality has not been made clear. These statistics are often used as crude estimates of the generalization error in nonlinear models, because correcting the statistics for nonlinearity requires much computation and the fulfillment of regularity conditions that are often violated by these models (Moody 1992). As outlined earlier, for smoothing operators, the notion of degrees of freedom has a long history, and in the GCV criterion model complexity is measured by the trace of the smoothing matrix. In contrast, little work has been done on neural networks and models of the form (1), and practitioners still measure the dimensionality of the model complexity by K = W, that is, the total number W of the weights in the model (1). Both theoretical and empirical studies (see, e.g., Bartlett 1998;De Veaux et al. 1998;Lawrence et al. 1996Lawrence et al. , 1997) support neural network modeling in which the number of sample data points N is (even much) smaller than the number of the weights W in the selected model. In this case, as W > N, both the FPE and the UEV assume negative values, and the unbiased estimates of the error variance also may assume negative values, and this does not make sense. Then the problem is to define K when model selection criteria are applied in neural network modeling.

The Size of the Weights Is More Important Than the Size of the Network
Some recent results for large networks (see Bartlett 1998) prove that the generalization performance of an MLP depends more on the size of the weights than on the size of the network (namely, on the number of adaptive parameters). Here an important role is played by the quantity c 1 = p k=0 |c p |, that is, by the sum of the values of the absolute weights between the hidden layer and the output.
Let us introduce the concept of misclassification error with margin γ . Let X 1 and X 2 be two populations in R m and set The misclassification error probability is given by where sgn(u) = 1 for u > 0 and sgn(u) = −1 for u < 0 [if u = 0, then we assume that sgn(u) = 0]. Given (x, y), the function f correctly classifies the point x if and only if y · f (x) > 0; more generally, the function f correctly classifies the point where y n = 1 if x n comes from X 1 and y n = −1 if x n comes from X 2 , with n = 1, . . . , N, we introduce the misclassification error with margin γ , where #{·} denotes the number of elements in the set {·}, which is the proportion of the number of cases that are not correctly classified with margin γ by f . If F p is the set of functions like (1) and for given constant C ≥ 1 we consider only those c for which then we have the following results.
Theorem 1 (Bartlett 1998). Let P be a probability distribution on X × {−1, +1}, 0 < γ ≤ 1 and 0 < η ≤ 1/2. Let F p be the set of functions f (x) like (1) such that k |c k | ≤ C with C ≥ 1. If the training set L is a sample of size N and has {−1, +1}-valued targets (i.e., the true values of the response), then with probability at least 1 − η, for each f ∈ F p , where for some α, a universal constant, The quantity (γ, N, η) is called the confidence interval. Equation (12) shows that the error is bounded by the sum of the empirical error with margin γ and by a quantity depending on c 1 through C but not on the number of weights. The proof of this theorem is based on a quantity called the fat-shattering dimension, which was introduced by Kearns and Schapire (1990) (see also Fine 1999).
An important consequence is that the quantity c 1 characterizes the property of generalization of a neural network. If f p 1 (x) and f p 2 (x) are two different models, with p 1 < p 2 , then they will have the same confidence interval (γ, N, η) in (12) if c (1) 1 = c (2) 1 , where c (1) and c (2) denote the weights between the hidden layer and the output layer of f p 1 and f p 2 . Thus a network f 1 with a norm c 1 and a large number of nodes (with small absolute values of the output layer weights) can be well approximated by a network f 2 with the same norm c 1 but fewer nodes when the difference between E γ ( f 1 , L) and E γ ( f 2 , L) is negligible. This approximation has negligible consequences on classification by a positive margin γ . From a practical point of view, the foregoing results justify the implementation of ad hoc strategies in the training algorithm, called regularization methods, which try to shrink the size of the weights. Indeed, large weights cause the sigmoids to saturate, and this leads to quite irregular error surfaces. An important regularization technique is the weight decay, which adds a penalty term on the error function (5), so that the quantity is minimized during the learning process, where the sum on the right side is over all weights of the network (see, e.g., Bishop 1995). Even if Bartlett's results reported in this section suggest a penalty that depends on the absolute values of the weights (a kind of penalty that is at the heart of Tibshirani's LASSO; see Tibshirani 1996), the weight decay regularization technique seems the most used practically for both computational reasons (it is implemented in many specialized packages) and theoretical justifications (from a Bayesian perspective, the minimization of a cost function with a penalty that depends on the squares of the weights correspond to the minimization of a likelihood with a multinormal prior distribution of the weights, whereas a cost function depending on the absolute value of the weight corresponds to the maximization of a likelihood with a Laplacian prior distribution of the weights). The parameter λ is the decay constant (or smoothing or shrinkage parameter), and suitable values are usually obtained by trial and error, by cross-validation, or by using GCV. Another simple type of regularization is early stopping, discussed previously. If the starting values are small in magnitude and the weights increase as the learning process is carried out, then stopping the training before convergence constrains the weights to remain small.

Discussion
One of the main consequences of Theorem 1 is that we should not apply to the neural framework the paradigm peculiar to linear models according to which the number of parameters should be less (or even substantially less) than the number of sample values. This consequence implies a deeper look at the role of the weights in the network. We remark that the confidence interval (13) is based only on quantities involving the weights between the hidden layer and the output layer. (In contrast, the weights between the input layer and the hidden layer do not appear.) This suggests that the input-to-hidden set of weights and the hidden-to output weights play different roles. We explore this issue further in Section 3. Besides, because the constant c 1 is easily computable, it is interesting to analyze its relationship to the learning error E( f , L) and the test error E( f , T ). We explore this issue further in Section 5.

GEOMETRIC PROPERTIES OF THE SIGMOIDAL FUNCTIONS
In this section we investigate some geometrical properties of the MLP. First, we recall some ideas given by Ingrassia (1999) for discrimination problems; then we prove analogous results for regression. For simplicity, throughout this section and in the next one, without loss of generality, we assume that b 1 = · · · = b p = 0 and c 0 = 0, so that the function f (x) is given by In addition, we assume that the sigmoidal function τ (·) is analytic; that is, it can be represented by a power series on some interval (−r, r), where r may be +∞. The hyperbolic tangent τ (z) = tanh(z) or the logistic function τ (z) = (1 + e −z ) −1 are examples of analytic sigmoidal functions. We point out that the function f is a combination of certain transformations of the input data: (a) a nonlinear projection from R m to R p given by the sigmoidal function τ , that is, x → τ (a 1 x), . . . , τ (a p x), and (b) a linear transformation from R p to R according to c 1 , . . . , c p . The results in this section are based on the following theorem (see, e.g., Rudin 1966).
Theorem 2. Let g be analytic and not identically 0 in the interval (−r, r), with r > 0. Then the set of the 0's of g in (−r, r) is at most countable.
evidently, these points are linearly dependent as p > m. Let A = (a ij ) be a p × m matrix with values in some hypercube [−u, u] mp for some u > 0, where we use the notation A ∈ [−u, u] mp to mean that all of the entries of A are in [−u, u]; thus the points Ax 1 , . . . , Ax p are linearly dependent, because they are obtained by a linear transformation acting on x 1 , . . . , x p . However, for u = 1/m the points are linearly independent for almost all matrices A ∈ [−u, u] mp , according to the following theorem.
Theorem 3 (Ingrassia 1999). Let This result proves that, given N > m points x 1 , . . . , x N ∈ R m , the transformed points τ (Ax 1 ), . . . , τ (Ax N ) generate an overspace of dimension p > m if the matrix A satisfies suitable conditions. In particular, the largest overspace is attained when p = N, that is, when the hidden layer has as many units as the number of points in the learning set. Moreover, it shows why neural networks have been shown to work well in presence of multicollinearity. On this topic De Veaux and Ungar (1994) presented a case study in which the temperature of a flow is measured by six different devices at various places in a production process. Even though the inputs are highly correlated, a better prediction of the response is gained using a weighted combination of all six predictors rather than choosing the single best measurement having the highest correlation with the response.
Theorem 3 introduces the concept of F p -linearization. We say that a learning set L is F p -linearizable or linearized by F p if there exists A ∈ [−1/m, 1/m] mp such that the empirical error E( f , L) is 0, in other words, if there exists a hyperplane in R p that correctly separates the points of the learning set (discrimination) or that perfectly interpolates them (regression). It is obvious that if L is F p -linearizable, then it is also F p+1 -linearizable; the counterpart is not true in general, as a simple analysis of the system (16) shows. Thus the family of learning sets that can be linearized by F p is a strict subset of the learning sets, which can be linearized by F p+1 . The smallest value p c such that L is F p -linearizable is here called the critical dimension of linearization. The next result, which generalizes theorem 8 of Ingrassia (1999), gives an upper bound on p c . Proof. Theorem 3 implies that the points τ (Ax 1 ), . . . , τ (Ax N ) are linearly independent for almost all matrices A ∈ [−1/m, 1/m] for p ≥ N. In particular, if p = N, then these points generate R N , and thus the system has a unique solution.
The upper bound on p c given earlier looks too large, but it refers to the worst-case situation. In neural modeling, given a learning set L of N sample data, the correct question seems to not be "what is the largest network we can train by L (if any)," but rather "what is the suitable size-namely, the dimension p of the space R p -necessary for fitting the inputoutput unknown dependence φ = E [Y|X]." This dimension p essentially depends more on the geometry of the data, and this explains why neural models may be successfully applied as virtual sensors when the predictors exhibit a high degree of multicollinearity. As a matter of fact, the hidden units break the multicollinearity and exploit the contribution of each single predictor. This is the reason why in modeling virtual sensors, the optimal size p of the hidden layer often may be greater than the number m of predictors.
Finally, we remark that Theorem 1 and Theorems 3 and 4 provide insight into the penalization term λ w 2 i in (14). Indeed, we can split the quantity i w 2 i into two components linking the weights of the input-to-hidden layer and of the hidden-to-output layer, Small values of the a ij 's constrain the quantities a n x n (n = 1, . . . , N) in the nonlinear range of the sigmoidal function τ (·), whereas small values of the c i 's help prevent overfitting. This also provides insight into two different choices of the decay constant λ for each term in (17) proposed by some authors on the basis of empirical studies (see, e.g., Bishop 1995).

EQUIVALENT NUMBER OF DEGREES OF FREEDOM IN MLP
Many authors have remarked that the number of degrees of freedom is often much smaller than the number of parameters involved in the model. For example, in presence of regularization, what Moody (1992) called the effective number of parameters and MacKay (1992) called the number of good parameters measurements does not equal the number of weights in the model. It is less than W and depends on the size of the regularization term. Recent studies regarding richly parameterized models have proposed various definitions of degrees of freedom that depend not on the number of parameters in the models but rather on, for example, the sum of the sensitivity of each fitted value to perturbation in the corresponding observed value (Ye 1998) or on the properties of the space in which the fitted values lie (Hodges and Sargent 2001). In the present study we focus on neural networks and models of the form (1) and propose some easy corrections to the model selection criteria that are automatically computed by most common data-mining software.
We recall that in the usual linear model, where y is N × 1, X is N × m, β is m × 1, and ε is N × 1, with cov(ε) = σ 2 I N , I N being the N-dimensional identity matrix. The number of degrees of freedom of the model is the dimension of the space in which the fitted valuesŷ = X(X X) −1 X y lie and is given by Here we assume that m ≤ N and that X is of full rank; however, even if X is not of full rank, then the definition of K is fine provided that one uses the generalized inverses. When the constant term is included in the model, β is (m + 1) × 1, and the matrix X has m + 1 columns, the first being an N-dimensional unit vector; in this case K = m + 1. As noted by Hodges and Sargent (2001), defining degrees of freedom in this way avoids quandaries created by counting parameters. We can also consider the principal component (PC) regression, where now for simplicity y and X are measured about their mean. Let A be the m × m matrix whose kth column is the kth eigenvector ofX X , wherē X is the centered version of X, so that the values of the PCs for each observation are given by Z = XA. Because A is orthogonal, Xβ can be rewritten as XAA β = Zγ , where Z = XA and γ = A β, so that (18) can therefore be written as which simply replaces the predictor variables by their PCs in the regression model. PC regression can be defined as the use of the model (20) or the reduced model, where γ p is a vector of p elements that are a subset of γ , Z p is an N × p matrix whose columns are the corresponding subset of columns of Z, and ε p is the appropriate error term (see, Jolliffe 1986, sec. 8.1). The degrees of freedom in model (20) that results from defining the projection of the centered matrixX in the space spanned by the m PCs and then K = p. Let us switch to nonlinear models of the form (15). For a given p × m matrix A, let T be the N × p matrix with rows τ (Ax 1 ) , . . . , τ (Ax N ) , with p ≤ N. According to Theorems 3 and 4, the matrix T has rank p (and then it is nonsingular) for almost all matrices A ∈ [−1/m, 1/m] mp . The empirical error E( f , L) can be written as is a projection matrix becauseŷ = Hy and H is symmetric, and positive semidefinite, idempotent, and it results in so thatŷ lies in the space R p and thus to the model f (x) = p k=1 c k τ (a k , x) should be given p equivalent number of degrees of freedom (edf ), according to Hastie and Tibshirani (1990, sec. 2.8).
We remark that the stopped training minimization procedure imposes some constraints on the estimates of the weights, but they are not explicit, so that we can set K = tr(H) = p; in contrast, the other common procedure is based on the weight decay strategy, and it imposes explicit constraints on the minimization of the error function. In this case the error function (14) is which shows that p is decreased by the quantity λ tr{(T T + λI p ) −1 }. Because T T is positive semidefinite, the p eigenvalues of T T, say l 1 , . . . , l p , are nonnegative. Thus (T T + λI p ) has eigenvalues (l 1 + λ), . . . , (l p + λ), and then the eigenvalues of (T T + λI p ) −1 are (l 1 + λ) −1 , . . . , (l p + λ) −1 . Hence, we have and finally, from (24) and (25), we get The matrix H λ is no longer a projection matrix, and thus the edf should be given by tr(H λ ) rather than p. However, in practice λ always assumes small values, so that p is slightly larger than tr(H λ ), and for practioners the simple quantity p can be used as a measure of edf in model selection criteria even when the decay strategy is implemented. In contrast, more accurate values could be required in the error variance estimation (10).
We present some numerical studies in Section 5.

Relationship With Other Approaches
Recently, for richly parameterized models, Hodges and Sargent (2001) have introduced a new measure of degrees of freedom, say ρ, based on a property of the vector space into which the outcomes y i are projected to give the fitted valuesŷ i . For the cases where both our definition of equivalent degrees of freedom and ρ are defined, they are equal if the model parameters are estimated by the minimization of the sum of squares error function. For example, this is true for RBFNs, where the second-layer parameters are estimated by least squares and the basis functions parameters are chosen by forward selection of the input vector or k-means cluster analysis. In (27) the nonlinear transfer function φ(·) is a Gaussian radial basis. Without loss of generality, if we suppress the constant term in the linear combination, then (27) can be expressed in matrix form as c =ŷ (28) where is an N × p matrix with generic element φ nk = φ(x n , a k ), n = 1, . . . , N, k = 1, . . . , p, and it satisfies the assumptions of the Hodges and Sargent approach. The parameters c are given by c = ( ) −1 y. The matrix is a real projection matrix, becauseŷ = Hy and H is symmetric, positive semidefinite, and idempotent and has rank p (or p + 1 if we include the constant term in the model). Because tr(H) = p (or p + 1 if we include the constant term), our notion of degrees of freedom is equal to ρ. If the Y variable is normally distributed with known variance σ 2 , our notion of edf is equal to the generalized degrees of freedom (GDF) introduced by Ye (1998). Indeed, Hodges and Sargent (2001) proved that the two quantities ρ and GDF are equal when both measures are defined. When the variance σ 2 is not known, as usually happens in practice, Ye's GDF must be computed by simulations, whereas our definition of edf is still explicit. With a weight-decay cost function (and a penalty term λ chosen by trial and error), we have where the l k 's are the eigenvalues of the matrix , which is a result formally equivalent to (26). Then ρ is smaller than p, and its value depends on λ. If the parameter λ is estimated by cross-validation, then the model cannot be reexpressed in a linear form, and Hodges and Sargent's degrees of freedom are not defined, whereas our measure is a good explicit approximation.
In PPR, the nonlinear function is a smoother, and the vectors a k are constrained to be unit vectors. The model f (x) = p k=1 c k τ k (a k , x) + c 0 results in a linear combination of ridge functions. It is built in a forward-stagewise manner, adding a pair (τ k , a k ) at each stage and then estimating the coefficient c k . Given the (N × p) basis function matrix T with generic (n, k) element τ nk = τ k (x n , a k ), the matrix P = T(T T) −1 T is a real projection matrix, and thus our definition of degrees of freedom is equal to tr(P).
As far as smoothers are concerned, for models of the form Sy =ŷ, where S is an n × n smoothing matrix (with rank n and with S S < S), our definition does not apply, because it is defined for projection models. For smoothers that can be expressed as Py =ŷ, where P is a linear operator, with rank p (or p + 1 if the constant term is included in the model) and with P P = P, our definition of degrees of freedom is equal to tr(P). So it is equal to Hastie and Tibshirani's first definition (they give three definitions of degrees of freedom, depending on the purpose for which they are used) and to Hodges and Sargent's definition.

NUMERICAL RESULTS
In this section we present some numerical results to investigate the behavior of Bartlett's constant c 1 and its relation to the learning error E( f p , L), the test error E( f p , T ), and number of edf K = p + 1 in model selection criteria. Moreover, we analyze the unbiased estimate of the variance (10) considering both K = p + 1 and K = tr(H λ ) as the edf. For this purpose, we consider three cases involving both real and simulated data.
The first dataset is the polymer dataset modeled by De Veaux et al. (1998) by means of an MLP with 18 neurons in the hidden layer. This dataset contains 61 units with 10 predictor variables concerning measurements of controlled variables in a polymer process plant and 4 responses concerning the outputs of the plant. We choose the 11th variable (i.e., the first of the four response variables) as the response variable y. These data, which lie in the interval [.1, .9], are particularly good for testing the robustness of nonlinear methods for irregularly spaced data. De Veaux, Psichogios, and Ungar (1993) showed that in this case neural networks are superior to multivariate adaptive splines regression and both are superior to linear regression. The data exhibit quite a large degree of multicollinearity, as can be seen by an analysis of the variance inflation factor VIF j = (1 − R 2 j ) −1 , where R 2 j is the coefficient of determination when the jth variable X j is regressed on the remaining p − 1 variables. If X j is nearly orthogonal to the remaining predictor variables, then R 2 j is small and then VIF j is close to unity, whereas if X j is nearly linearly dependent on some subset of the remaining variables, then R 2 j is near unity and VIF j is large. Table 1 lists the VIF values of the 10 predictors. In general, variance inflation factors larger than 10 imply serious problems with multicollinearity; here this is true for variables X 3 , X 4 , X 5 , X 6 , and X 7 . Following De Veaux et al. (1998), we fix 50 units for the learning set and 11 units for the test set. We then consider 100 different samples with different units for the training set and the test set (these sets all have 50 and 11 units). We consider neural networks with increasing numbers of hidden units from p = 2 to p = 25. For each p, we train the network 1,000 times, varying the samples and the initial weights; we adopt either the weight decay or the early-stopping regularization technique. The distribution of the learning and test errors and the distribution of c 1 versus the number p of neurons in the hidden layer are plotted in Figure 1 using boxplots, for both values obtained with stopped training and values obtained with weight decay. Figures 1(a)-1(d) show that for the errors to be rather stable across simulations, the network must be large enough; in particular, Figures 1(a) and 1(c) refer to training and test error with weight decay, and Figures 1(b) and 1(d) refer to training and test errors with stopped training. For small values of p, results are very unstable, and the algorithms are influenced mostly by local minima. The fit, as measured by the median value of the learning errors, increases as the complexity rises, and stabilizes (i.e., the spread of the boxplots becomes small) after a certain threshold (given by about 10 or 11 hidden units). The different behavior of boxplots in Figures 1(c) and 1(d) is due to the fact that test observations are used for controlling overfitting in the stopped training criterion, whereas they are independent of training with the weight decay and can be used for measuring the generalization performance of the network only in this second case. Figures 1(e) and 1(f ) show that the median values of c 1 are in practice linearly related to our measure p; however, the quantity c 1 is too unstable across simulations to be a good complexity measure. We remark that such median values of the training errors and of c 1 versus p have a similar trend for both weight decay and stopped training. Table 2 gives for each p the mean values of the training error E( f p ; L), the test error E( f p ; T ), the L 1 norm c 1 , and the model selection criteria AIC, BIC/SBC, GCV, and FPE, computed with K = p + 1 because here we implemented the stopped training. We remark that analogous results have been obtained using the weight decay strategy, where in model selection criteria K = tr(H λ ) can be approximated by K = p + 1 because here we are interested in the rank rather than in the size of the indices AIC, BIC, and so on.
The numerical results show that there is no unique p that minimizes the different model selection criteria, even if the model selected by the BIC agrees with the smallest value of the test error while the other ones suggest larger models; however, we note that many numerical experiments have shown that the BIC often works well for neural networks, whereas AIC and FPE tend to overfit with neural networks (see Sarle 1999;Kadane and Lazar 2004). For the sake of completeness, Table 3 gives the values for model selection criteria when the number of degrees of freedom is selected equal to W. In this case some indices are useless because they assume negative values, and some others have a high variability and anomalous peaks. FPE is useless because it assumes negative values for more than four hidden units; GCV suggests large models and shows an anomalous peak for the model with four hidden units (which holds a number W near N); this peak and the high variability are not supported by values summarized in the boxplots of Figure 1. Furthermore, the BIC and AIC give strictly increasing values as p grows, so we should select the model with p = 2. The second dataset concerns a vibration severity chart (see Cavarra, Crupi, Guglielmino, and Ingrassia 2002;Ingrassia and Morlini 2002) for centrifugal pumps in an ethylene system. There are 51 recordings of the overall level of vibration in the horizontal, vertical, and axial directions. Besides the overall vibration, the ratios v t+1 −v t t+1−t , where v t+1 is the vibration level at time t + 1 and v t is the vibration level at time t for each di- rection, are also considered input variables. Thus input patterns are vectors in R 6 . The response variable is a measure of the status of the centrifugal pump after a period of time, which takes value 0 for the 32 pumps working perfectly, value .5 for the 12 pumps with something wrong, and value 1 for the 11 pumps that are about to be broken down. In other words, the value 0 is associated with the target class low risk, the value .5 with the target class medium risk, and the value 1 with the class high risk. A total of 1,000 different partitions in a training set with 39 recordings and a test set with 12 recordings are randomly generated so that each test set contains 6 observations belonging to class 1 ( y = 0), 3 observations belonging to class 2 ( y = .5), and 3 observations belonging to class 3 ( y = 1). As far as the misclassification error rate is concerned, ifŷ ≤ .3, then the observation is assigned to class 1; if .3 <ŷ ≤ .7, then the observation is assigned to class 2; and ifŷ > .7, then the observation is assigned to class 3. Figure 2 reports boxplots of the training and the test misclassification errors and the values of c 1 obtained with stopped training and weight decay according to the same outline of Figure 1. As in Figure 1, values of c 1 are very unstable across simulations. In this example, values of c 1 for each number p of hidden units obtained with stopped training have a higher variability around their median than do values obtained with weight decay. However, median values obtained with stopped training and weight decay for each number p are similar. Table 4 gives for each p the mean values of model selection criteria obtained with K = p + 1 and the mean value of the validation errors obtained with weight decay. According to the validation error obtained with weight decay, all model selection criteria, with the exception of the BIC, suggest models with a number of hidden units ranging from 8 to 12 (with more than 12 hidden units, values obtained for these criteria start increasing), whereas the BIC selects models with a number p ranging from 5 to 8. However, we remarked that for neural networks, the BIC is preferable over the AIC and the FPE. The presence of models with different complexity but similar generalization errors is supported by values obtained in the different simulations and plotted in boxplots of Figures 2(a), 2(b), 2(c), and 2(d).
Also in this case, for the sake of completeness, Table 5 lists the values for model selection criteria when the number of degrees of freedom is selected equal to W. As in the previous example, some indices are useless because they assume negative values, and some others have a high variability and anomalous peaks. FPE and UEV are useless, because these indices assume negative values for more than five hidden units. GCV suggests large models; however, values obtained for this criterion for a number p of hidden units ranging from 2 to 12 have an extremely high variability and an anomalous peak for the model with 5 hidden units (which holds a number W not far from N). This peak and the high variability are not supported by val- ues summarized in the boxplots of Figure 2. Furthermore, the BIC and AIC give strictly increasing values as p grows, so we should select the model with p = 2. To summarize, we note that with W degrees of freedom, model selection criteria give contrasting results, whereas with K = p + 1, results are much more satisfactory.
The third analysis concerns the UEV introduced in (10), that is,σ for the two most popular regularization techniques (early stopping and weight decay) and for different choices of p and K. First, we generated three datasets according to the following models: M 2 : x = 0 ∈ R 10 , y = ; and M 3 : x ∈ (0, 130) ⊂ R, where follows a normal distribution N(0, σ 2 ), and θ 1 = 105, θ 2 = 2.6, θ 3 = .5, and θ 4 = −.02. Data coming from the models M 1 and M 2 have been randomly generated with σ 2 = .5; two datasets coming from the model M 3 have been randomly generated choosing σ 2 = 1 and σ 2 = 4. Data from M 1 and M 2 have been fitted using an MLP with p = 5 and p = 10 neurons in the input layer, whereas in the other case we considered an MLP with one neuron in the input layer. All simulations were performed in MATLAB. For each model, we generated 100 samples constituting 50 units, N = 39 units in the training set and 11 units in the test set for all cases; the input values in model M 3 have been sampled uniformly in the interval (0, 130). The values ofσ 2 obtained will be compared with the sample variance s 2 = i e i / (N − 1), where e 1 , . . . , e N are the errors (i.e., the realizations of the sample 1 , . . . , N ). Data coming from M 1 and M 2 exhibit a sample variance equal to s 2 = .2381; as far as the two datasets coming from the model M 3 are concerned, the sample variances were equal to s 2 = .7551 and s 2 = 2.4490.
First, we considered the early-stopping technique. Table 6 summarizes the obtained results; such results are the average values over 100 samples with different partitions of observations in the training and in the test sets. In both models M 1 and M 2 , the response does not depend on inputs, and thus the error variance estimate for these models must be independent on the number of inputs. This is verified when K = p + 1 is selected so that the values ofσ 2 are slightly greater than s 2 . In contrast, in many cases the error variance estimates with K = W degrees of freedom cannot be computed for MLPs, because the values obtained are negative; moreover, positive estimates are considerably larger than s 2 .
We performed another set of simulations implementing the weight decay regularization technique to compare K = p + 1 and K = tr(H λ ) given in (26) for different values of the smoothing parameter λ; here we denote byσ 2 p+1 andσ 2 λ the UEV (29) based on K = p + 1 and K = tr(H λ ). In this case the eigenvalues l i (i = 1, . . . , p) of T depend on x, so that we considered their means over the learning set. Tables 7 and 8 list the values that we obtained. The estimates agree quite well with s 2 (even if it is obvious thatσ 2 p+1 is a little larger thanσ 2 λ ). As far as the model M 3 is concerned, the results that we obtained are listed in Tables 9 and 10 for s 2 = .7551 and s 2 = 2.4490. Also in this case the obtained estimates agree quite well with s 2 .
Finally, we remark that in many statistical software packages for neural networks, the smoothing parameter λ can be selected automatically according to cross-validation procedures, but this value is not available, so that it is impossible to compute the unbiased estimate of the variance usingσ 2 λ . Table 11 lists the estimatesσ 2 for data coming from the model M 3 for both s 2 = .7551 and s 2 = 2.4490 adopting both early stopping and weight decay (with automatic λ selection).      We note that in all simulations, the unbiased estimates of the variance using both K = p + 1 and K = tr(H λ ) agree with the true values. Bartlett's (1998) theorem 1 gives the theoretical basis for using neural models with a total number of weights larger than the number of sample data points used to estimate the weights. Based on this result, we have investigated the role of the weights of a neural network with a mapping function of the form f (x) = p k=1 c k τ (a k x + b k ) + c 0 . We have shown that the two levels of weights play quite different roles; the input-tohidden weights concern just a (nonlinear) projection from R m to R p , whereas the hidden-to-output weights fit the projected data and perform the regression or the classification, according to the problem at hand. Both the projection and the fit are optimized according to the target values. From this point of view, the complexity of the network depends more on the number of hidden units than on the whole set of weights, and we have shown that the greatest complexity is reached by a network with a number of hidden neurons equal to the number of sample data points.

CONCLUDING REMARKS
According to results for richly parameterized models, the concept of equivalent number of degrees of freedom K has been introduced for one-hidden-layer networks like MLP and RBF as the trace of the projection matrix, and it proves to be K ≤ p + 1, where p + 1 is the number of the hidden to output weights. The value of K also depends on the adopted regularization technique. If early stopping is implemented, then we set K = tr(H) = p + 1, whereas when weight decay is implemented, it is K = tr(H λ ) < p + 1, and it also depends on the value of the smoothing parameter λ. However, in both cases, for model selection according to the BIC and AIC we can adopt K = p + 1, because we are interested mainly in the number of hidden units p that minimizes such indices. In contrast, estimation of the error variance (29) requires more attention; if early stopping is adopted, then we set K = p + 1, whereas in the other case (i.e., the weight decay), using K = p + 1 leads to slightly larger estimates than those obtained using the trace K = tr(H λ ). However, our numerical simulations showed that K = p+1 also could be adopted in general for this aim.
In contrast to Ye's GDF and according to the definition of ρ and to Hastie and Tibshirani's definitions, our notion of degrees of freedom does not depend on the underlying true function. For a certain MLP, it is equal for all three models, and fitting an MLP to a pure noise dataset does not require more degrees of freedom than fitting an MLP to a dataset with a structure between the inputs and the target.
We have also presented some numerical studies of the behavior of the constant c , given by the sum of the absolute values of the hidden to output weights, and of the training and test error of an MLP across the simulation. These studies show that the values of c have a similar increasing trend, with respect to the number p of hidden units, for both stopped training and weight decay. They also show that this quantity is particularly unstable across simulations, whereas our measure of degrees of freedom does not depend on local minima or on the convergence of the learning algorithm. The simulation studies show that besides model selection, the proposed measure also may be used to achieve a good estimate of the error variance, especially for small networks, whereas the total number of weights is useless for this task.
Finally, we remark that for deeper networks with more than one hidden layer, analogous considerations apply. The equivalent number of degrees of freedom relates to the units in the highest hidden layer (i.e., on the hidden layer immediately before of the output layer), with the other layer performing only geometric transformations of data; this is also congruent with theorem 28 of Bartlett (1998).