Sensitivity of profitability in cointegration-based pairs trading

The cointegrated-based pair trading crucially depends on two key parameters: the length of the formation period and the divergence signal (or opening trigger), which are generally arbitrarily or statistically determined in the literature. In this article, we perform a sensitivity analysis of the pairs trading profitability to its parametrization, employing the daily closing prices of the S&P 500 constituent stocks. We found that that not only the measures of performance (i.e. average excess returns, Sharpe ratios and percentage of positive excess returns), but also strategy characteristics and trades features (i.e. average trades’ duration and number of trades) are highly sensitive to the choice of the parameters.


1. Introduction
Over the past 20 years, pairs trading has attracted the attention of numerous researchers investigating pricing anomalies, and has been employed to verify the efficiency of financial markets in the relative pricing of highly correlated risks. Pairs trading exploits a portfolio of pairs composed of commoving securities, going long on the underpriced asset and short on the overpriced asset whenever their prices diverge from their historical "equilibrium". If a trader is able, systematically, to profit from closing the positions on the convergence of the relative mispricing measure (or Spread) to its mean, the market is considered to be inefficient.
The literature on pairs trading is heterogeneous in terms of both the approaches used and their empirical applications. This heterogeneity mainly stems from the way that the Spread time series is modelled, which strongly affects the profitability of this statistical arbitrage strategy. Among the quantitative methods proposed in the literature, the most common are the distance, cointegration and time series (or stochastic spread) methodologies (Krauss, 2017), which we briefly review in the following section.

Relevant literature
The distance approach, particularly Gatev et al. (2006), is widely used by practitioners and identifies comovement between assets based on their distance, defined as the Sum of Squared Deviations (SSD) between the normalized prices 2 during a one-year formation period. For the succeeding six-month trading period, Gatev et al. apply a simple self-financing trading strategy on the 20 pairs with minimum SSD, opening a trade whenever the Spread 3 diverges by more than twice its historical standard deviation 4 and closing the position when it reverts to zero or at the end of the trading period.
The few authors who propose a different structure include Nath (2003), who implements a pairs trading strategy in a high frequency setting and applies it to all the highly liquid securities in the secondary US government debt market, between 1994 and 2000. Pairs selection is performed using the SSD between standardized prices over a 40-day period. In a trading period of the same length, trades are executed using the 15th percentile of the SSD empirical distribution as the opening trigger and reversion to the median as the closing condition, employing the 5 th percentile as stop-loss rule. Chen et al. (2017) study univariate and quasi-multivariate settings, and identify pairs using the correlations among returns, computed over a five-year formation period. Trades are triggered in the succeeding month whenever a given stock's return deviates from the return of the paired asset or of the portfolio of the 50 most correlated assets.
In general, as Do et al. (2006) point out, the ease of application of the distance approach and its model-free property are offset by its poor forecasting ability. Also, its lower profitability with respect to other methods is assessed by Huck and Afawubo (2015) in a performance 2 evaluation of SSD, cointegration test and price ratio stationarity selection metrics, and by Rad et al. (2016).
A theoretical description of the cointegration methodology is provided by Vidyamurthy (2004) in his book Pairs Trading: Quantitative Methods and Analysis. In this branch, pairs' selection is always based on cointegration testing, but empirical applications differ in terms of the relative mispricing measure, the length of the formation and trading periods and the opening and closing triggers. Rad et al. (2016) conduct a comparative analysis on the mean-reversion of the Spread computed via the distance, cointegration and copula methods, using the same timing scheme and trading signals as in Gatev et al. (2006). In Rad et al.'s framework, the Spread is computed as the difference between the observed and predicted values of the first asset, that is, 1, -(̂+ 2, ), where ̂ and ̂ are estimated during the formation period, and the opening trigger is set equal to two times the residuals' standard deviation. In Hong and Susmel's (2003) investigation of the Asian ADR 5 market, the Spread is simply the price distance and the holding period is set at 3, 6 and 12 months. Broumandi and Reuber (2012) evaluate the effect of exchange rates on pairs trading profitability through an analysis of South American ADRs, employing price ratios to measure relative mispricing and using an opening trigger equal to ± 3 6 . Gutierrez and Tse (2011), combine cointegration with Granger causality and show how, for the three water utility stocks in their sample, the main source of profitability is the Granger-followers. They derive results for three different opening trigger values (0.25, 0.5, and 0.75 times the standard deviation of the residuals) and use a three-year estimation period and nine years of trading.
In all the above cases, positions are closed when the Spread reverts to zero, but there are also some examples of asymmetric opening/closing triggers. For instance, Caldeira and Moura (2013) first employ the in-sample Sharpe ratio to rank pairs after cointegration testing and then select the first 20 most profitable pairs to assess the performance of their beta-neutral strategy. They set the opening of a position to any time the Spread 7 goes over (or below) the threshold of +2 (or -2) times the historical standard deviation and set it to close when it becomes less than 0.75 (or greater than -0.5) times the same quantity. 8 Efforts have also been made to determine the boundary values that guarantee the trader a minimum profitability. As an example, Lin et al. (2006) proposed loss protection for pairs trading, based on a minimum profit per trade condition, and, subsequently, Puspaningrum et al. (2010) developed a numerical algorithm to estimate the average trade duration and numerosity in order to identify the optimal pre-set boundaries and imposing a minimum total profit condition. However, there are no large empirical applications aimed at assessing the advantages of this optimization.
3 framework that allows forecasting and decision-making. Elliott et al. define the Spread as the prices difference and model it as a mean-reverting Gaussian Markov chain, observed via Gaussian noise, which is estimated through Kalman filtering. Specifically, the Spread is represented as a state space process, that is: Hidden state equation: (1) Observation equation: where and are iid and ~(0,1). In continuous time, the state equation can be expressed as an OU process: where is a standard Brownian motion, is the mean and is the speed of convergence to the mean. In this setting, a trade is triggered whenever the Spread condition -( √2 ⁄ ) > > + ( √2 ⁄ ) is violated, and is closed at the first-hit-time , that is, whenever the Spread reverts to its mean. 9 Following this idea, Do et al. (2006) model mispricing at the returns level using a "stochastic residual spread" method, to avoid the restrictive requirement of return parity between stocks. The authors highlight that the main advantages of the stochastic spread approach are the ability to capture the Spread's mean-reversion, its forecasting properties and its complete tractability. In contrast, Bertram (2010) describes the Spread between log-prices as a symmetric OU process and derives the optimal entry and exit thresholds by maximizing the expected return or the Sharpe ratio per unit of time, whose closed-form solutions are obtained using an analytic expression for expected trade length and its variance.
As showed by Blázquez et al. (2018), whenever the Spread long-term mean is constant, the high computational intensity of the stochastic spread methodology is compensated for by its superior forecasting ability with respect to the previous techniques, which, however, does not guarantee higher returns. These authors analyse the selection process and the Spread series stationarity for five different methodologies: correlation, distance, stochastic spread, stochastic differential residual and cointegration. Through an application on the S&P 500 bank subgroup over the period 2008-2013, they conclude that cointegration selection is more accurate and more complete with respect to distance, and highlight that the assumption that the OU process parameters are constant is rarely verified in practice.

Contribution to the literature
In general, the pairs trading literature is usually aimed at demonstrating that this strategy produces significantly positive returns. The works in this stream of literature use many alternative methodologies and datasets that differ in terms of the type of securities traded and the financial markets considered. However, what emerges clearly is the high level of heterogeneity and arbitrariness related to the choice of formation and trading period length and to the definition of trade opening and closing signals. 4 To our knowledge, the only attempt to investigate the sensitivity of the returns from pairs trading to different parametrizations, is Huck (2013), whose analysis adopts the distance approach. Huck considers a six-month trading period as reasonable both to ensure that the information used to select the pairs is recent and to allow complete round-trip trades. He analyses the profitability of Gatev et al.'s (2006) methodology for a formation period of 6, 12, 18 and 24 months and an opening trigger equal to 2, 3 and 4 times the historical standard deviation of the Spread. He finds that the excess returns are highly sensitive to the length of the formation period, but are affected only marginally by the opening threshold level. In subsequent research, Huck investigates the superiority of the classical cointegration approach with respect to Gatev et al.'s (2006) SSD distance and stationarity of the price ratio selection criteria (Huck & Afawubo, 2015) and, also, the effect of market sentiment, measured by the VIX index, on pairs trading performance (Huck, 2015). In these contributions, all the results are derived for 1 and 2-year formation periods and opening triggers equal to 2 and 3, which gives the reader an idea of the influence of parametrization on the profitability of the strategy.
In this article, we evaluate the sensitivity of the returns from cointegration-based pairs trading to the parameters analysed by Huck (2013), using a more extensive, but tighter grid for the opening trigger. The aim is not to find the optimal parametrization in terms of profitability, but to evaluate the way that the returns vary according to different combinations of formation period length and opening trigger value. We conduct an empirical analysis of the S&P 500 index constituents, applying the classical cointegration approach with pre-selection of pairs based on SSD and Correlation between log-prices, as described in Essay 1. This analysis of the strategy performance contributes to the literature by providing new data points, based on a considerably large dataset, which is novel and is highly demanding computationally. Our findings should encourage further research on optimal parametrization of pairs trading.
The remainder of the paper is organized as follows: Section 2 reprises the methodology, described in detail in Essay 1; Section 3 presents the main results of the empirical application, paying particular attention to profitability and the trading features; 10 Section 4 presents the main conclusions.

Methodology
The methodology used is the classical cointegration-based pairs trading approach, including a preliminary pre-selection step, introduced to enhance the computational efficiency of the strategy as explained in Essay 1.
The method includes three main steps: 1. Pairs selection: during the formation period, pairs of stocks are ordered according to two alternative metrics and tested for cointegration based on this ranking, until we obtain 20 cointegrated pairs (subsection 2.1); 2. Pairs trading: over a six-month trading period, we implement a trading strategy based on the parameter estimates and the stationarity of the cointegration relationship derived 5 in step 1. Profitability is evaluated based on the monthly excess returns and the Sharpe ratio and takes account of transaction costs (subsection 2.2); 3. Steps 1 and 2 are repeated in a rolling window setting that shifts the sample one month ahead and generates six overlapping portfolios each month. Figure 1 provides a graphical representation.
Profitability is evaluated considering monthly average excess returns, Sharpe ratios and fractions of positive excess returns, and we investigate the following trade features: number of cointegration tests, transaction life, number of trades and consequent pairs classification.

Pairs selection
Identification of the 20 pairs eligible for trading is structured in two phases. First, all possible pairs are ordered according to a specific metric. We employ the two alternative methods proposed in Essay 1, which identify the best trade-off between profitability and variability. These are the Sum of Squared Deviation between the normalized log-prices (SSD) and the Pearson Correlation (CORR) between log-prices. Both metrics are computed monthly on data from the previous formation period, of alternatively 6, 12, 18 and 24 months duration, and stock pairs are sorted by minimum SSD and maximum CORR.
Specifically, SSD is a distance measure computed as: where ̃1 , and ̃2 , are the respective normalized log-prices of stocks 1 and 2 on day and is the number of trading days in the formation period. Normalization is performed to rescale the two stock log-price time series to start at $1, so that the normalized log-prices are defined as ̃1 , = ln ( 1, ) ln ( 1, =1 ) ⁄ and ̃2 , = ln ( 2, ) ln ( 2, =1 ) ⁄ . This selection method was proposed first by Gatev et al. (2006) and employed to order the pairs before cointegration testing by Rad et al. (2016). 6 CORR is the absolute value of the coefficient of the association between the log-price time series, that is: where 1, and 2, are the respective log-prices of stocks 1 and 2 on day , 1 ̅̅̅ and 2 ̅̅̅ are their corresponding sample means over the formation period, and T is the number of trading days in the formation period. This approach was employed by Miao (2014), who pre-selected only pairs with at least 0.9 correlation for cointegration testing.
Second, pairs eligible for trading are identified through cointegration-tests, run following the order determined using the above-mentioned metrics, until 20 pairs of stocks with cointegrated prices are selected. The choice of the top 20 pairs is common in the literature and helps to reduce the computational time required by the methodology. Notice that, as we showed in Essay 1, considering a larger sample would slightly reduce the profitability of the strategy for both the employed pre-selection metrics, but the choice depends mainly on the maximum loss that can be borne by the arbitrageur.
The cointegration approach (Vidyamurthy, 2004) is aimed at identifying a long-term relationship between assets, so that any significant deviation from this equilibrium is interpreted as relative mispricing. In general, two non-stationary (1) time series are cointegrated if there exists a linear combination of them that is stationary, that is, (0).
The long-run relationship between the log-price time series is defined by a linear regression model, that is: If a parameter exists such that the regression residuals are stationary, the two series are said to be cointegrated. According to the Engle and Granger (1987) approach, first the regression parameters ̂ and ̂ are estimated through ordinary least squares and then the residuals are computed as ̂= 1, −̂2 , −. Following this, the stationarity of ̂ is verified using the Augmented Dickey Fuller ( ) test (Dickey & Fuller, 1979) and any deviation of ̂ from the cointegration relation represents a departure from the long-run equilibrium.
In our procedure, the cointegration test is run regressing the first stock of the pair on the second ( 1, = 1 + 1 2, + 1, ) and vice versa ( 2, = 2 + 2 1, + 2, ). The pair is selected if the stocks are cointegrated in both cases and the strategy is implemented based only on the first regression to avoid any double-counting issue.

Pairs trading strategy
Once the top 20 cointegrated pairs have been selected, we implement a self-financing trading strategy for the following six-month period. Exploiting the stationarity and mean-reverting properties of the long-term equilibrium relationship between cointegrated stock prices, the short-term Spread between log-prices is defined as an out-of-sample residual, that is: where ̂ and ̂ are the estimates obtained in the first step during the formation period.
According to the trading rule we employ, a long-short pairs portfolio is formed whenever the following relationship is violated: where ̂ is the in-sample standard deviation of the residuals and δ is the opening trigger parameter. The position is closed when the Spread reverts to its equilibrium, that is, when it returns to within the estimated boundaries (hereafter RB) or, as is common in this field, when it crosses the zero level (hereafter RZ). If the Spread does not converge to its mean, the position is forcibly closed at the end of the trading period. As in Caldeira and Moura (2013), we employ an additional cut rule, according to which a position is closed whenever a 7% loss is realized and, also, after 50 trading days. This should prevent extreme results and loss of time value.
In detail, whenever the >̂ then the pair portfolio will be above its equilibrium value and should be sold short, while if < −̂ the pairs portfolio is undervalued (and the portfolio is bought). In both cases, our self-financing strategy prescribes that $1 worth of the relatively overpriced stock is sold and $1 worth of the underpriced stock is bought, so that the strategy is self-financing and the payoffs can be interpreted as excess returns.
Provided that a trading opportunity arises whenever the Spread departs from its zero-mean, the major issue is to define what should be considered a significant deviation, that is, determining the opening trigger. In our empirical application, we consider values for going from 1 to 6 times the in-sample standard deviation of the residuals, with jumps of 0.5. Our aim is to analyse how the pairs trading strategy is affected by variations in both the formation period length and the opening threshold.
Each trading period, the 20 selected pairs form a portfolio with daily excess returns equal to: where:


, is the weight associated to each pair , which is equal to 1 whenever a new position is opened on the pair and, for each subsequent period, is computed as: , = , −1 (1 + , −1 ) = (1 + ,1 ) … (1 + , −1 )  , is the daily mark-to-market excess return of pair , computed as: with:  , is a variable equal to 1 if in day a long position on stock is opened, -1 if a short position on stock is opened and 0 otherwise  , is the daily mark-to-market return of stock in day  , is the weight associated to stock , equal to , = , −1 (1 + , −1 ) = (1 + ,1 ) … (1 + , −1 )

8
The monthly portfolio returns are obtained by compounding the daily excess returns , , producing six returns time series, staggered one month, which, in turn, are averaged across the six overlapping portfolios using equal weights. Final average monthly excess returns are tested to be significantly positive based on both Newey-West heteroskedasticity and autocorrelation robust standard errors (Newey & West, 1987) with six lags, and the test for Superior Predictive Ability (Hansen, 2005), which accounts for data-snooping by considering the dependence between the statistics obtained applying different parametrizations to the same dataset.
The transaction costs accounted for in the performance evaluation include commissions and market impact. As in Do and Faff (2012), we set the market impact equal to 20 bps and estimate that commissions, extracted from the Investment Technology Group (ITG) reports, 11 are decreasing over time, going from 10 bps in 1998 to 3 bps in 2018. 12 . In the case of commissions, the value in basis points is deducted from (added to) the $1 amount bought (sold) at trade initiation, so that the total initial cash flow remains zero and the strategy continues to be self-financing. At closure, commissions are considered in the computation of daily excess returns, as follows: where is the amount of the commissions as a percentage. Instead, the market impact is included at the end of the analysis, computed as a percentage of the traded quantities and subtracted at both the beginning and end of each trade. Average monthly returns, net of market impact, are obtained from the net daily returns, following the same procedure used for excess returns. Since the empirical application is conducted on stocks with a high dollar value, market capitalization and liquidity, neither short-selling fees nor bid-ask spread are considered.

Data
The empirical application is conducted on the daily closing prices of the S&P 500 stocks 13 (inclusive of dividends). No stock is excluded from the sample and the time series are taken from Thomson Reuters DataStream and run from 1 st January 1998 to 30 th October 2018.
In our setting, one month includes 22 trading days and the pairs trading strategy is implemented over a total number of months, ranging from 241 for the 6-month formation period to 223 for the 24-month formation period. The first and last five months are excluded from the computation of the average monthly excess returns because there are less than six-overlapping portfolios available; therefore, the number of monthly returns observations varies between 231 and 213, depending on the length of the formation period. The main results are presented in Sections 3.2 and 3.3; the complete set of results is provided in the Appendix. 11 Do and Faff (2012) report commissions estimates up to 2009. For the years between 2010 and 2018, we extracted quarterly data from the ITC reports and computed average annual values. 12 Essay 1 Table 1 presents the commission estimates. 13 The dataset is composed of S&P 500 constituents at 30 th October 2018. Since the sample composition varies over time as new assets are included, the total number of stocks varies between 373 and 505. 9

Profitability results
Performance analysis of the pairs trading strategy is based on: the average monthly excess returns, computed as explained in Section 2.2; the Sharpe ratio, which is equal to the average monthly excess return divided by its standard deviation, and measures the profitability per unit of risk; and the percentage of positive monthly returns, 14 which provides relevant information on the frequency of negative outcomes. Finally, we assess the market impact effect. The aim of the application is to evaluate the sensitivity to changes in the length of the formation period and the opening trigger in relation to the profitability and the characteristics of the strategy.
Tables 1 and 2 present the average monthly excess returns using respectively the SSD and CORR pre-selection metrics. Significantly positive returns, based on their pValue computed using the Newey-West standard errors for a 0.05 confidence level, are in bold. 15 In both tables, the upper panel refers to the results obtained for the closing positions when the Spread reverts to within the estimated boundaries; and the lower panel refers to the results for the closing trades when the Spread reaches its zero-mean. A graphical representation of the average monthly excess returns before and after the inclusion of commissions and cut rules is provided in Figures 2 and 3.
Using the SSD pre-selection method, the excess returns are always statistically significant for the 12-and 18-month formation periods, both before and after the inclusion of commissions and cut rules. The 24-month formation period leads to a slightly lower profitability for central values of δ and excess returns that, in only rare cases, are not significantly positive. Finally, for the 6-months case, excess returns are considerably inferior and often non-significant, especially if commissions are considered.
The maximum values of the excess returns show different behaviours depending on the type of reversion: in the RB case, they are associated to a small range of δ values (between 3.5 and 5 before inclusion of commissions and cut rules, and between 5 and 6 after commissions and cut rules) for all formation period lengths considered; in the RZ setting the maxima are scattered across the whole opening trigger range and are associated to different values of δ before and after inclusion of commissions and cut rules. For any given formation period length, the excess returns behaviour is not always bell-shape as δ increases. However, in the reversion to within boundaries case, the returns are generally increasing in δ, up to a local maximum, which, nevertheless, does not always coincide with the observed global maximum.
Overall, the 12-month formation period provides the highest profitability, before and after the inclusion of commissions, with maxima of respectively 1% and 0.74% for the RB case and 0.51% and 0.46% in the RZ case. If cut rules are included, the maximum excess returns are achieved for the 24-month formation period in the RB case and the 18-month formation period in the RZ case, with respective values equal to 0.81% and 0.53%. 38 Notes: average monthly excess returns using SSD pre-selection for any formation period and opening trigger combination, before and after inclusion of commissions and cut rules, computed as explained in section 2.2; positions are closed when the Spread reverts to within the estimated boundaries (upper panel) and reaches its zero mean (lower panel). For the CORR pre-selection case, the 12-, 18-and 24-month formation periods continue to be preferable in terms of profitability. In the RB case, the first two formation period lengths lead to always statistically significant excess returns, with the 18-month period presenting the highest global maximum of 1.5% before inclusion of commissions and cut rules, 1.23% if commissions are included and 1.74% if cut rules are considered. In the RZ setting, all three applications produce similar results before the inclusion of cut rules, with maximum average excess returns around 0.73% before and 0.65% after inclusion of commissions. The 18-month formation period produces better performance compared to the 12-and 24-month periods only after inclusion of cut rules, with respective maxima of 1.59%, 1.19% and 0.98%. For all formation period lengths, excess returns are usually increasing for small values of δ and decreasing for large δs values, presenting two local maxima for both RB and RZ. The global maxima occur for δ strictly greater than 2 in RB and strictly smaller than 4.5 in RZ. After inclusion of commissions and cut rules, global maxima correspond to a δ value that is always in the interval [2. 5,4]. Upper panels: average monthly excess returns using SSD pre-selection. Lower panels: average monthly excess returns using CORR pre-selection. The results correspond to any formation period and opening trigger combination before (left) and after (right) inclusion of commissions and cut rules. Average monthly excess returns are computed as explained in Section 2.2, positions are closed when the Spread reverts to within the estimated boundaries and the parameter values considered are: 6, 12, 18 and 24 months for formation period length and 0.5 to 6 with 0.5 jumps, for the opening trigger.

Figure 3 -Average monthly excess returns for the RZ application
Upper panels: average monthly excess returns using SSD pre-selection. Lower panels: average monthly excess returns using CORR pre-selection. The results correspond to any formation period and opening trigger combination before (left) and after (right) inclusion of commissions and cut rules. Average monthly excess returns are computed as explained in Section 2.2, positions are closed when the Spread reaches its zero mean and the parameter values considered are: 6, 12, 18 and 24 months for formation period length and 0.5 to 6 with 0.5 jumps, for the opening trigger.
For both pre-selection techniques, the effect on average excess returns of the inclusion of commissions is always decreasing in δ in the RZ case (reductions between -0.17% and -0.03%), and is increasing for low δ values and decreasing for high δs in the RB setting (decreases of between -0.15% and -0.31%). The greater impact of commissions in the RB case is due to the higher frequency of round-trip trades.
The effect of cut rules differs for the pre-selection techniques and closing positions criteria. In the case of SSD pre-selection, the 6-month formation period outcomes mostly improve, except for δ equal to 0.5, reaching a maximum increment of 0.15%, but often remaining not significant. In the 12-month case, the results are generally slightly worse than before inclusion of cut rules, while the reverse applies to the 18-month case application, with a higher increase in the RZ setting (up to 0.15%). In the case of the 24-month period, the effects are mixed. In almost all the applications that employ CORR pre-selection, inclusion of cut rules affects pairs trading profitability positively for central values of δ, and negatively at the extremes. The exception are the 24-month case, where, if δ is between 3.5 and 4.5, negative effects are found for RB, and the 12-and 18-months cases where outcomes improve for all values of δ in RZ. Overall, deterioration is usually limited while improvements are consistent, sometimes leading 14 to almost double average monthly excess returns and reaching a maximum gain of 1.16% for the RZ application with 18-month formation period and opening trigger equal to 4.
In general, the CORR based approach tends to be associated to superior profitability with respect to SSD pre-selection, but, as expected, these higher excess returns are balanced by almost double the level of volatility (see Tables 7 to 22 in the Appendix). Moreover, the application where positions close when the Spread returns to within the boundaries appears to be more profitable with respect to the more common closing when the Spread reverts to zero.
Since excess returns variability plays a central role in the evaluation of pairs trading profitability, we consider the Sharpe ratio as it provides a measure of performance per unit of risk, which can be interpreted as a risk-adjusted average excess return. The results are presented in Tables 3 and 4  Excess returns volatility is generally increasing in δ for SSD pre-selection, presenting values between 0.01 and 0.03 that are stable across formation periods and do not vary much after inclusion of commissions and cut rules. In the CORR pre-selection, the volatility also remains fairly stable across formation periods, but with values generally higher than SSD, between 0.02 and 0.07, and affected by inclusion of cut rules in the RZ setting. 16 These results have important implications for the profitability analysis, in terms of Sharpe ratios.
When SSD pre-selection is employed, the risk-adjusted profitability is similar for the different formation period lengths, except the 6-month period length, which continues to perform worse with respect to the others. Overall, the 12-and 18-months applications yield slightly higher Sharpe ratios respectively before and after inclusion of commissions and cut rules, confirming the previous conclusions. However, since the volatility is increasing in δ, the maximum Sharpe ratios are usually associated to opening trigger values lower than in the excess returns analysis. In detail, maxima are found for δ between 3 and 6 in the RB case (after inclusion of commissions) and between 0.5 and 3 for RZ (both before and after inclusion of commissions).
Including commissions reduces the profitability of the pairs trading strategy unevenly, for all parameter combinations, and can lead to the maxima being associated to different opening trigger values than previously. In contrast, inclusion of cut rules tends to promote mostly homogeneous improvements in the pairs trading strategy Sharpe ratios, except for the 12-months formation period, similar to what was observed for excess returns. 16 See Tables 7 to 22 in the Appendix, which present the results for the monthly excess returns distributions: mean, standard deviation, minimum and maximum values, median, skewness and kurtosis. It should be noted that, for almost all parameter combinations, monthly excess returns are right-skewed and leptokurtic. For SSD preselection, the skewness and kurtosis values are fairly homogeneous across applications (respectively close to 0.7 and 5.5), with extreme values usually associated to the lowest and highest values of δ. When CORR is employed, both skewness and kurtosis present considerable heterogeneity across applications and, generally, are greater than for SSD, with some extremely high values, respectively greater than 10 and 100. 14 Notes: Sharpe ratios using SSD pre-selection for any formation period and opening trigger combination, before and after inclusion of commissions and cut rules, computed as average monthly excess return divided by their standard deviation; positions are closed when the Spread reverts within the estimated boundaries (upper panel) and reaches its zero mean ( lower panel). Sharpe ratios using SSD pre-selection for any formation period and opening trigger combination, before and after inclusion of commissions and cut rules, computed as average monthly excess return divided by their standard deviation; positions are closed when the Spread reverts within the estimated boundaries (upper panel) and reaches its zero mean (lower panel).  Upper two panels: Sharpe ratios using SSD pre-selection. Bottom two panels: Sharpe ratios using CORR pre-selection. The results correspond to any formation period and opening trigger combination before (left) and after (right) inclusion of commissions and cut rules. Sharpe ratios are computed as average monthly excess returns divided by their standard deviation, positions are closed when the Spread reverts to within the estimated boundaries ( Figure 4) or reaches its zero mean ( Figure 5) and the parameter values considered are: 6, 12, 18 and 24 months for the formation period length and from 0.5 to 6 with 0.5 jumps, for the opening trigger.
Similarly, when CORR pre-selection is employed, the 12-and 18-month formation periods generally display higher Sharpe ratios, but the differences among applications with distinct formation period lengths are often trifling. Notice that, unlike the excess returns analysis, the lower volatility of the 6-month pre-selection period provides better aligned results. In general, maximum Sharpe ratios are associated to opening trigger values in the interval [2,4] in RB and [0. 5,3] in RZ, and often coinciding with those for excess returns. Moreover, as in SSD, the risk-adjusted profitability before inclusion of commissions and cut rules in the RZ case is declining in δ. In both the RB and RZ cases, Sharpe ratios reduce with the inclusion of commissions and increase with the inclusion of cut rules, with similar variations across δs.
In contrast to the evaluation of excess returns, SSD pre-selection generally leads to higher profitability per unit of risk, with respect to CORR, due to its lower volatility, especially before inclusion of commissions and cut rules. Comparing the results of the applications based on different closing criteria, the reversion to within the boundaries application presents higher Sharpe ratios with respect to the case of closure at Spread convergence to zero, confirming the evidence from the excess returns analysis.
To evaluate the pairs trading performance, we look also at the frequency of positive outcomes, that is, the percentage of monthly excess returns that are strictly greater than zero. Since pairs trading belongs to the statistical arbitrage strategies category, negative results are allowed if the average outcome is guaranteed to be positive, and the frequency of positive excess returns allows us to investigate the efficacy of the strategy. 17 Figures 6 and 7 depict the results before and after inclusion of commissions and cut rules. Once again, for most parameter combinations, the RB setting performs better than the RZ setting and, in the first case, SSD pre-selection exhibits higher percentages. Also, these proportions appear to be sensitive to changes in both the formation period length and the opening trigger.
In the RB application, both the SSD and CORR implementations show frequencies always greater than 50% before inclusion of commissions and cut rules, with maxima exceeding 60%. When commissions and cut rules are considered, almost all the results suffer reductions. Some frequencies lower than 50% are present in SSD for the 6-month formation period and in CORR for all formation period lengths except for the 18-months. In both cases, maximum values are around 55%. In the RZ setting, before inclusion of commissions and cut rules, the 12-, 18-and 24-month formation periods for SSD pre-selection lead to always greater than 50% frequencies, with a maximum of 54%. In some cases, the inclusion of commissions and cut rules has a positive effect on the results. However, in the SSD application, the 12-month formation period is the only one exhibiting percentages always greater than 50% for all values of δ, and in the CORR application frequencies are lower than 50% for δs strictly smaller than 1.5 and greater than 4.5 for almost all formation period lengths, with a minimum of around 46%.
19  Upper two panels: percentage of positive monthly excess returns using SSD pre-selection. Bottom two panels: percentage of positive monthly excess returns using CORR pre-selection. The results correspond to any formation period and opening trigger combination before (left) and after (right) inclusion of commissions and cut rules. Positions are closed when the Spread reverts to within the estimated boundaries ( Figure 6) or reaches its zero mean ( Figure 7) and the parameter values considered are: 6, 12, 18 and 24 months for the formation period length and 0.5 to 6 with 0.5 jumps, for the opening trigger. The grey line corresponds to the 50% level.
Note that the previous results for average excess returns are affected considerably by the findings from the analysis of positive excess returns. Indeed, in most of the cases, the lower the frequency of positive monthly excess returns, the lower the average monthly excess return, with not significant outcomes often associated to frequencies of less than 50%.
As a final step in the performance evaluation, we consider the effect on the pairs trading strategy profitability of including market impactsee average monthly returns in Table 5. As discussed in Section 2.2, market impact is estimated as a variable cost that depends on the amounts traded. Final average monthly returns are based on daily excess returns from the pairs after inclusion of commissions and cut rules and are obtained subtracting the cost of the market impact at the opening and closing of each trade, for both stocks in each of the selected pairs, which provides the net daily returns of the pairs. The procedure used to compute the monthly portfolio net returns from net daily returns is the same as that used to calculate excess returns. The averages are simply the monthly return means across the six-overlapping portfolios. Since the initial investment required for each trade is equal to the cost associated to the market impact (40 bps), in this case, the results cannot be interpreted as excess returns.
Comparing average monthly excess returns before the inclusion of market impact (Tables 1  and 2) with the profitability analysis after its inclusion (Table 5), we find that the effect of market impact is generally declining in δ and in the length of the formation period, in all the applications, in terms of both pre-selection and Spread reversion. We find, also, that the market impact is usually higher in the RB case compared to the RZ case, and is higher in the case of pre-selection based on SSD compared to CORR. 18 The way that profitability is affected by market impact appears to be related to the average number of trades per month 19 which are depicted in Figure 8.
Including the market impact leads to some negative average monthly returns, especially for the cases of the 6-month formation period and low δs. SSD pre-selection again provides lower profitability compared to CORR and the highest performance tends to be associated to the 18month formation period, with maxima corresponding to opening trigger values strictly greater than 5. The picture for CORR pre-selection is slightly different. The best performance is associated, again, with the 18-month formation period, but in this case, the maximum average monthly returns always correspond to central values of δ, in the interval [2. 5,4]. Overall, the choice of a 12-month formation period and an opening trigger equal to 2, which are most frequent in the literature, would seem to be suboptimal for a pairs trading strategy profitability. For traders who want to optimize their profits, use of fixed and pre-established parameters may not provide the best results. Average monthly returns net of market impact using SSD and CORR pre-selections for any formation period and opening trigger combination; positions are closed when the Spread reverts to within the estimated boundaries (upper panel) and reaches its zero mean (lower panel). Market impact is computed as a percentage of the traded amounts and is subtracted at the beginning and at the end of each trade. This provides average monthly returns net of market impact as explained in Section 2.2.

Figure 8 -Pairs' average number of round-trip trades per month
Upper panels: average number of pair transactions per month using SSD pre-selection (normalized log-prices SSD). Bottom panels: average number of pair transactions per month using CORR pre-selection (log-prices correlation). The results correspond to any formation period and opening trigger combination before (left) and after (right) inclusion of commissions and cut rules and are computed as the number of round-trip trades per month, averaged across a 6-month trading period for each pair, and then across all pairs. Positions are closed when the Spread reverts to within the estimated boundaries (left) or reaches its zero mean (right) and the parameter values considered are: 6, 12, 18 and 24 months for the formation period length and 0.5 to 6 with 0.5 jumps, for the opening trigger.

Arbitrage strategy characteristics
Pairs trading strategy parametrization affects not only the strategy performance, as discussed above in detail, but also its characteristics and the features of the pair trades.
First, the average number of tests 20 required to identify the first 20 cointegrated pairs, reduces as the length of the formation period increases, and is lower for the CORR pre-selection compared to the SSD pre-selection (Table 6). This implies that, regardless of parameter choices, CORR is better able to capture comovement between assets. 25 Notes: number of total pairs and tested pairs, averaged across overlapping samples for pre-selection based on log-price correlation CORR and SSD between normalized log-prices, and considering formation periods of lengths 6, 12, 18 and 24 months.

23
The average number of complete trades per month ( Figure 8) is computed as the number of round-trip trades per month averaged, first, across a 6-month trading period for each pair and, then, across all pairs. The number appears to be lower if positions are closed at the Spread convergence to zero (between 0.94 and 0.03) than for Spread reversion to within the boundaries (between 1.29 and 0.09). Moreover, the number of complete trades is related negatively to both the opening trigger and the length of the formation period. The results correspond to any formation period and opening trigger combination, before (left) and after (right) inclusion of commissions and cut rules, and are computed as averages across all trades using both SSD (normalized log-prices SSD) and CORR (log-prices correlation) pre-selection. The parameter values considered are: 6, 12, 18 and 24 months for the formation period length and 0.5 to 6 with 0.5 jumps, for the opening trigger. Figure 9 shows that the average number of days that a position remains open (average life) appears to be affected in different ways by changes to the parameters. Variations in the opening trigger lead to consistent modifications of average life, using both SSD and CORR pre-selection and, especially, before inclusion of commissions and cut rules. Specifically, in the RB setting, average life is declining in δ and, in the RZ case, first increasing and then decreasing. Formation period length mainly affects the differences among average lives, for different values of δ. That is, the range of values that include average life as δ varies, becomes wider as the length of the formation period increases, with the maximum amplitude corresponding to the intervals [2. 5,41] and [11,69] respectively, for the RB and RZ applications (before inclusion of commissions and cut rules). In all cases, as expected, inclusion of cut rules leads to a reduction in the average number of days a position is kept open, 21 but does not alter the relation with the parameters and results on maximum ranges of [2, 10] and [6,36] for the RB and RZ applications respectively.
The pairs selected via cointegration testing can be classified according to the number of round-trip trades completed during the 6-month trading period: i) non-traded pairs are not involved in any trading; ii) single-round trips pairs are involved in only one trade; iii) multiple-openings pairs are involved in more than one trade. Figure 10 depicts the percentages of pairs (averaged across overlapping portfolios) in each of these categories. 22 Pair classification seems to be strongly affected by the choice of parameters. Specifically, the average fractions of non-traded pairs and multiple-openings pairs are, respectively, declining and increasing, both in δ and in the length of the formation period, regardless of the pre-selection metric and Spread reversion type. However, the magnitude of the decline in the percentage of multiple-openings pairs is higher in the RZ setting than in the RB setting. The average proportion of single-round trip pairs is slightly affected by changes in the parameters when positions are closed with Spread reversion to within the estimated boundaries. In the case of Spread convergence to zero, it first increases and then decreases according to both δ and the length of the formation period. Overall, the pairs classification appears to be similar across pre-selection methodologies. 23 21 The average number of trades per month and the average life clearly contribute in determining the number of open pairs per day (presented in the Appendix in tables from 7 to 22), which is negatively related with δ both for RB and RZ settings and the range of values in which it is included as δ varies becomes wider as the length of the formation period increases. 22 Notice that pair classification is unaffected by the inclusion of commissions and cut rules. 23 Tables 7 to 22 in the Appendix, present the average percentages (across overlapping portfolios) of non-convergent pairs both single-round trip and multiple-openings, before and after the inclusion of commissions and cut rules. The non-convergent single-round trips pairs are those pairs that are "truly" non-convergent, meaning that the Spread divergence from its zero-mean may be a consequence of a change in the long-run equilibrium. The results show that, in the RZ setting, the average fraction of non-convergent single-round trip pairs, in general is firstly increasing and then decreasing in δ (with the exception of the 6-month formation period where it is always increasing). In the RB application, the average fraction of non-convergent single-round trip pairs appears to be generally declining in δ before the inclusion of commissions and cut rules (with extremely high opening trigger values equal to 0.5), and decreasing after initially increasing when they are included. The average portion of non-convergent multiple-openings pairs appears to be always decreasing with the opening trigger, regardless of the Spread's type of reversion, with higher fractions associated to the RB setting.

Figure 10 -Pairs classified by number of round-trip trades (average percentages)
Pairs classified by number of round-trip trades during the trading period in percentages across overlapping portfolios: i) non-traded pairs are not involved in any trades; ii) single-round trip pairs are involved in one trade; iii) multiple-openings pairs are involved in more than one trade. The results correspond to any formation period and opening trigger combination, and are computed using SSD (upper two blocks) and CORR (lower two blocks) pre-selection, both if positions are closed when the Spread reverts to within the estimated boundaries (RB) and if the Spread reaches its zero mean (RZ). The parameter values considered are: 6, 12, 18 and 24 months for the formation period length (y axis) and 0.5 to 6 with 0.5 jumps, for the opening trigger (x axis).

Conclusions
This paper aims to investigate the sensitivity of pairs trading profitability to the strategy parametrization, in terms of the opening trigger and the formation period length. To this end, we performed a computationally intensive sensitivity analysis of different parametrization on a wide set of profitability measures as well as on the trades' characteristics. We are thus able to provide a complete investigation of the effects of parameter choice on the statistical arbitrage strategy.
We found that the profitability of the strategy is highly sensitive to both the length of the formation period and the opening threshold level. The effects vary according to the pre-selection metrics employed and the Spread reversion used to determine closure of a positions. As we showed in Essay 1, if pre-selection is based on SSD or CORR the traded pairs rarely coincide, which leads to different risk-return profiles. In addition, the strictness of the interpretation of 'the convergence to long-term equilibrium' leads to trades that differ in terms of both numerosity and length (in days), which affects the entire implementation of the strategy. Overall, despite these differences, all the performance indicators (excess returns, Sharpe ratios and percentages of positive excess returns) suggest that the choice of parameters is crucial for determining the profitability of the strategy. In addition, in Section 3.3, we showed that the characteristics of the strategy and of the trades are also highly sensitive to variations in the parameters and affect the strategy performance through the frequency and duration of roundtrip transactions.
Our analysis differs from the work of Huck (2013) whose application is based on Gatev et al.'s (2006) distance approach; however, both analyses are aimed at investigating the sensitivity of pairs trading to the strategy parametrization. In line with our findings, Huck finds that excess returns are highly sensitive to the length of the formation period but, differently from our analysis, are affected only marginally by the opening trigger. This difference may be due to the small set of opening trigger values (2, 3 and 4 times the Spread's historical standard deviation) considered in Huck (2013).
In the light of our results, more research is needed to investigate the parametrization of the pairs trading strategy in order to increase its efficiency in terms of profitability. In more detail, a possible direction for further investigations would be to develop a new framework that includes in-sample optimization of one or both of the parameters, based on a profitability measure, such as average excess return or Sharpe ratio. It would be interesting to both evaluate the effect of this optimization on pairs trading performance and to investigate whether the 'best' in-sample parameters, identified during the formation period, lead to the highest out-of-sample profitability, that is, in the following 6-month trading period. All results are presented before and after inclusion of commissions and cut rules. Monthly excess returns and monthly returns net of market impact series are computed as average values across the 6 overlapping portfolios for each month and are tested to be significantly positive using Newey-West heteroskedasticity and autocorrelation robust standard errors (Newey & West, 1987). Data-snooping is controlled for by the SPA test Consistent pValues (Hansen, 2005). The Sharpe ratio is computed as the ratio between average monthly excess returns and the standard deviation of monthly excess returns. The t-Test pValue in the bottom panel is the pValue of the difference-in-means test between monthly excess returns of crisis and non-crisis periods.  All results are presented before and after inclusion of commissions and cut rules. Monthly excess returns and monthly returns net of market impact series are computed as average values across the 6 overlapping portfolios for each month and are tested to be significantly positive using Newey-West heteroskedasticity and autocorrelation robust standard errors (Newey & West, 1987). Data-snooping is controlled for by the SPA test Consistent pValues (Hansen, 2005). The Sharpe ratio is computed as the ratio between average monthly excess returns and the standard deviation of monthly excess returns. The t-Test pValue in the bottom panel is the pValue of the difference-in-means test between monthly excess returns of crisis and non-crisis periods.  All results are presented before and after inclusion of commissions and cut rules. Monthly excess returns and monthly returns net of market impact series are computed as average values across the 6 overlapping portfolios for each month and are tested to be significantly positive using Newey-West heteroskedasticity and autocorrelation robust standard errors (Newey & West, 1987). Data-snooping is controlled for by the SPA test Consistent pValues (Hansen, 2005). The Sharpe ratio is computed as the ratio between average monthly excess returns and the standard deviation of monthly excess returns. The t-Test pValue in the bottom panel is the pValue of the difference-in-means test between monthly excess returns of crisis and non-crisis periods.  All results are presented before and after inclusion of commissions and cut rules. Monthly excess returns and monthly returns net of market impact series are computed as average values across the 6 overlapping portfolios for each month and are tested to be significantly positive using Newey-West heteroskedasticity and autocorrelation robust standard errors (Newey & West, 1987). Data-snooping is controlled for by the SPA test Consistent pValues (Hansen, 2005). The Sharpe ratio is computed as the ratio between average monthly excess returns and the standard deviation of monthly excess returns. The t-Test pValue in the bottom panel is the pValue of the difference-in-means test between monthly excess returns of crisis and non-crisis periods.  All results are presented before and after inclusion of commissions and cut rules. Monthly excess returns and monthly returns net of market impact series are computed as average values across the 6 overlapping portfolios for each month and are tested to be significantly positive using Newey-West heteroskedasticity and autocorrelation robust standard errors (Newey & West, 1987). Data-snooping is controlled for by the SPA test Consistent pValues (Hansen, 2005). The Sharpe ratio is computed as the ratio between average monthly excess returns and the standard deviation of monthly excess returns. The t-Test pValue in the bottom panel is the pValue of the difference-in-means test between monthly excess returns of crisis and non-crisis periods.  All results are presented before and after inclusion of commissions and cut rules. Monthly excess returns and monthly returns net of market impact series are computed as average values across the 6 overlapping portfolios for each month and are tested to be significantly positive using Newey-West heteroskedasticity and autocorrelation robust standard errors (Newey & West, 1987). Data-snooping is controlled for by the SPA test Consistent pValues (Hansen, 2005). The Sharpe ratio is computed as the ratio between average monthly excess returns and the standard deviation of monthly excess returns. The t-Test pValue in the bottom panel is the pValue of the difference-in-means test between monthly excess returns of crisis and non-crisis periods.  All results are presented before and after inclusion of commissions and cut rules. Monthly excess returns and monthly returns net of market impact series are computed as average values across the 6 overlapping portfolios for each month and are tested to be significantly positive using Newey-West heteroskedasticity and autocorrelation robust standard errors (Newey & West, 1987). Data-snooping is controlled for by the SPA test Consistent pValues (Hansen, 2005). The Sharpe ratio is computed as the ratio between average monthly excess returns and the standard deviation of monthly excess returns. The t-Test pValue in the bottom panel is the pValue of the difference-in-means test between monthly excess returns of crisis and non-crisis periods.  All results are presented before and after inclusion of commissions and cut rules. Monthly excess returns and monthly returns net of market impact series are computed as average values across the 6 overlapping portfolios for each month and are tested to be significantly positive using Newey-West heteroskedasticity and autocorrelation robust standard errors (Newey & West, 1987). Data-snooping is controlled for by the SPA test Consistent pValues (Hansen, 2005). The Sharpe ratio is computed as the ratio between average monthly excess returns and the standard deviation of monthly excess returns. The t-Test pValue in the bottom panel is the pValue of the difference-in-means test between monthly excess returns of crisis and non-crisis periods.  All results are presented before and after inclusion of commissions and cut rules. Monthly excess returns and monthly returns net of market impact series are computed as average values across the 6 overlapping portfolios for each month and are tested to be significantly positive using Newey-West heteroskedasticity and autocorrelation robust standard errors (Newey & West, 1987). Data-snooping is controlled for by the SPA test Consistent pValues (Hansen, 2005). The Sharpe ratio is computed as the ratio between average monthly excess returns and the standard deviation of monthly excess returns. The t-Test pValue in the bottom panel is the pValue of the difference-in-means test between monthly excess returns of crisis and non-crisis periods.  All results are presented before and after inclusion of commissions and cut rules. Monthly excess returns and monthly returns net of market impact series are computed as average values across the 6 overlapping portfolios for each month and are tested to be significantly positive using Newey-West heteroskedasticity and autocorrelation robust standard errors (Newey & West, 1987). Data-snooping is controlled for by the SPA test Consistent pValues (Hansen, 2005). The Sharpe ratio is computed as the ratio between average monthly excess returns and the standard deviation of monthly excess returns. The t-Test pValue in the bottom panel is the pValue of the difference-in-means test between monthly excess returns of crisis and non-crisis periods.  All results are presented before and after inclusion of commissions and cut rules. Monthly excess returns and monthly returns net of market impact series are computed as average values across the 6 overlapping portfolios for each month and are tested to be significantly positive using Newey-West heteroskedasticity and autocorrelation robust standard errors (Newey & West, 1987). Data-snooping is controlled for by the SPA test Consistent pValues (Hansen, 2005). The Sharpe ratio is computed as the ratio between average monthly excess returns and the standard deviation of monthly excess returns. The t-Test pValue in the bottom panel is the pValue of the difference-in-means test between monthly excess returns of crisis and non-crisis periods.  All results are presented before and after inclusion of commissions and cut rules. Monthly excess returns and monthly returns net of market impact series are computed as average values across the 6 overlapping portfolios for each month and are tested to be significantly positive using Newey-West heteroskedasticity and autocorrelation robust standard errors (Newey & West, 1987). Data-snooping is controlled for by the SPA test Consistent pValues (Hansen, 2005). The Sharpe ratio is computed as the ratio between average monthly excess returns and the standard deviation of monthly excess returns. The t-Test pValue in the bottom panel is the pValue of the difference-in-means test between monthly excess returns of crisis and non-crisis periods.  All results are presented before and after inclusion of commissions and cut rules. Monthly excess returns and monthly returns net of market impact series are computed as average values across the 6 overlapping portfolios for each month and are tested to be significantly positive using Newey-West heteroskedasticity and autocorrelation robust standard errors (Newey & West, 1987). Data-snooping is controlled for by the SPA test Consistent pValues (Hansen, 2005). The Sharpe ratio is computed as the ratio between average monthly excess returns and the standard deviation of monthly excess returns. The t-Test pValue in the bottom panel is the pValue of the difference-in-means test between monthly excess returns of crisis and non-crisis periods.