Is that backtest really reproducible?
1. What is “Backtesting”?
Backtesting is an algorithmic historical simulation of an investment strategy. Backtesting uses an algorithm to calculate the gains and losses that would have been incurred if the investment strategy you have drafted had been implemented over a period of time. Common statistics that evaluate the performance of an investment strategy, such as the Sharpe ratio and the information ratio, are used. Investors typically examine these back-testing statistics to determine asset allocation to the best performing investment (management) strategies, so asset managers do trial-and-error back-testing of a bloody number of times for good performance and present customer by that materials.
From an investor’s perspective, it is important to distinguish between in-sample (IS) and out-of-sample (OOS) performance of a back-tested investment strategy; IS performance is defined as the sample used to design the investment strategy (referred to in the machine learning literature as the “training period” or “training set”) It is what is called “OOS”) simulated in a sample (aka “test set”). OOS performance, on the other hand, is simulated on a sample (aka “test set”) that was not used to design the investment strategy. Since backtesting predicts the effectiveness of an investment strategy with its performance, it can be guaranteed to be reproducible and realistic if the IS performance matches the OOS performance. However, it is difficult to judge whether the backtest is reliable when you receive it because the results of the out-sample are future results. It cannot be called a pure out-sample as long as the results of the OOS can be fed back to improve the strategy.
So, when you receive a backtest with good results from a fund manager, it is very important that you manage to assess how realistic the simulation is. It is also important for fund managers to understand the uncertainty of their own backtesting results. In this post, we will look at how to evaluate the realism of a backtesting simulation and what to look out for in order to ensure a repeatable backtest.
2. Backtesting overfits.
Bailey, Borwein, López de Prado and Zhu (2015) argue that it is (relatively) easy to overfit (overlearn) a backtesting simulation for any financial time series. Here, overfitting is a machine learning concept that describes a situation in which a model focuses on a particular set of observed data (IS data) rather than on a general structure.
Bailey et. al. (2015) cite a situation where the backtesting results of a stock strategy are not good as an example of this claim. As the name implies, backtesting uses historical data, so it is possible to identify specific stocks that are experiencing losses and design the trading system to improve performance by adding some parameters to remove recommendations for those stocks ( (a technique known as “data snooping”). With a few repetitions of the simulation, you can derive “optimal parameters” that benefit from characteristics that are present in a particular sample but may be rare in the population.
There is a vast accumulation of research in the machine learning literature to address the problem of overfitting. But Bailey et. al. (2015) argue that the methods proposed in the context of machine learning are generally not applicable to multiple investment problems. The reasons for this seem to be the following four points.
Machine learning methods to prevent overfitting require explicit point estimates and confidence intervals in the domain in which the event is defined in order to assess the explanatory power and quality of the prediction, because few investment strategies make such explicit predictions.
- For example, it is not often said that “the E-mini S&P 500 is expected to be around 1,600 at 5 points per standard deviation as of Friday’s close,” but rather qualitative recommendations such as “buy” or “strong buy” are usually provided. Moreover, these forecasts do not specify the expiration date of the forecast and are subject to change when some unexpected event occurs.
Even if a particular investment strategy relies on a prediction formula, other components of the investment strategy may be overfitted.
- In other words, there are many ways to overfit an investment strategy other than simply adjusting the prediction equation.
The methods of regression overfitting are parametric and involve many assumptions about data that are unobservable in the case of finance.
Some methods do not control for the number of trials.
Bailey et. al. (2015) show that a comparatively low number of trials is needed to identify investment strategies with relatively poor backtesting performance. Think of the number of trials here as the number of trials and errors. It also calculates the minimum backtest length
(MinBTL), which is the length of the backtest required for the number of trials. In this paper, the Sharpe ratio is always used to evaluate performance, but it can be applied to other performance measures as well. Let’s take a look at what it does.
3. What is the distribution that the Sharpe ratio follows?
To derive the MinBTL, we first derive the (asymptotic) distribution of the Sharpe ratio. To begin with, the design of an investment strategy usually starts with the prior knowledge or belief that a particular pattern may help to predict the future value of a financial variable. For example, if you are aware of the lead-lag effect between bonds of different maturities, you can design a strategy that bets on a return to the equilibrium value if the yield curve rises. This model could take the form of a cointegration equation, a vector error correction model, or a system of stochastic differential equations.
The number of such model configurations (or trials) is vast, and fund managers naturally want to choose the one that maximizes the performance of their strategy, and to do so they conduct historical simulations (back testing) (see above). Backtesting evaluates the optimal sample size, frequency of signal updates, risk sizing, stop loss, maximum holding period, etc., in combination with other variables.
The Sharpe ratio used as a measure of performance assessment in this paper is a statistic that assesses the performance of a strategy based on a sample of historical returns, defined as the average excess return/standard deviation (risk) to BM. It is usually interpreted as “return to risk 1 standard deviation” and, depending on the asset class, if it is greater than 1, it can be considered a very good strategy. In the following, we will assume that the excess return \(r_t\) of a strategy is a random variable of i.i.d. and follows a normal distribution. That is, we assume that the distribution of \(r_t\) is independent of \(r_s(t\neq s)\). It is not a very realistic assumption, though.
\[ r_t \sim \mathcal{N}(\mu,\sigma^2) \] where \(\mathcal{N}\) is the normal distribution of the mean \(\mu\) and the variance \(\sigma^2\). Now, the excess return \(r_{t}(q)\) at time t~t-q+1 is defined (I’m ignoring the compounding part.)
\[ r_{t}(q) \equiv r_{t} + r_{t-1} + ... + r_{t-q+1} \] Then, the annualized Sharpe ratio is represented
\[ \begin{eqnarray} SR(q) &=& \frac{E[r_{t}(q)]}{\sqrt{Var(r_{t}(q))}}\\ &=& \frac{q\mu}{\sqrt{q}\sigma}\\ &=& \frac{\mu}{\sigma}\sqrt{q} \end{eqnarray} \]
We can express this as where \(q\) is the number (frequency) of returns per year. For example, \(q=365\) for daily returns (excluding leap years).
The true value of \(SR\) cannot be known because \(\mu\) and \(\sigma\) are generally unknown. So, let \(R_t\) be the sample return and the risk-free rate \(R^f\)(constant), we calculate the estimated Sharpe ratio by the sample mean of \(\hat{\mu}=1/T\sum_{t=1}^T R_{t}-R^f\) and the sample standard deviation of \(\hat{\sigma}=\sqrt{1/T\sum_{t=1}^{T}(R_{t}-\hat{\mu})}\) (where \(T\) is the sample size to be back-tested).
\[ \hat{SR}(q) = \frac{\hat{\mu}}{\hat{\sigma}}\sqrt{q} \] As an inevitable consequence, the calculation of \(SR\) is likely to be accompanied by considerable estimation errors. Now, let us derive the main topic of this section, the asymptotic distribution of \(\hat{SR}\). First, since the asymptotic distributions of \(\hat{\mu}\) and \(\hat{\sigma}^2\) are finite and i.i.d., applying the central limit theorem derives
\[ \sqrt{T}\hat{\mu}\sim^{a}\mathcal{N}(\mu,\sigma^2), \\ \sqrt{T}\hat{\sigma}^2\sim^a\mathcal{N}(\sigma^2,2\sigma^4) \] Since the Sharpe ratio is a random variable which is calculated from this \(\hat{\mu}\) and \(\hat{\sigma}^2\), let us denote this function as \(g(\hat{{\boldsymbol \theta}})\) where \(\hat{{\boldsymbol \theta}}=(\hat{\mu},\hat{\sigma}^2)\). Now, since it is i.i.d., \(\hat{{\boldsymbol \theta}}\) is independent of each other, and, from the above discussion, the asymptotic joint distribution is written as
\[ \sqrt{T}\hat{{\boldsymbol \theta}} \sim^a \mathcal{N}({\boldsymbol \theta},{\boldsymbol V_{\boldsymbol \theta}}) \] where \({\boldsymbol V_{\boldsymbol \theta}}\) is
\[ {\boldsymbol V_{\boldsymbol \theta}} = \left( \begin{array}{cccc} \sigma^2 & 0\\ 0 & 2\sigma^4\\ \end{array} \right) \] Since the sharp ratio estimates are now only a function of \(g(\hat{{\boldsymbol \theta}})\) and \(\hat{{\boldsymbol \theta}}\), from the delta method, the following
\[ \hat{SR} = g(\hat{{\boldsymbol \theta}}) \sim^a \mathcal{N}(g({\boldsymbol \theta}),\boldsymbol V_g) \] asymptotically follows a normal distribution where \(\boldsymbol V_g\) is
\[ \boldsymbol V_g=\frac{\partial g}{\partial{\boldsymbol \theta}}{\boldsymbol V_{\boldsymbol \theta}}\frac{\partial g}{\partial{\boldsymbol \theta}'} \] Since \(g({\boldsymbol \theta})=\mu/\sigma\), then
\[ \frac{\partial g}{\partial{\boldsymbol \theta}'} = \left[ \begin{array}{cccc} \frac{\partial g}{\partial \mu}\\ \frac{\partial g}{\partial \sigma^2}\\ \end{array} \right] = \left[ \begin{array}{cccc} \frac{1}{\sigma}\\ -\frac{\mu}{2\sigma^3}\\ \end{array} \right] \] Therefore, you can derive
\[ \begin{eqnarray} \boldsymbol V_g &=& \left( \begin{array}{cccc} \frac{\partial g}{\partial \mu}, \frac{\partial g}{\partial \sigma}\\ \end{array} \right) \left( \begin{array}{cccc} \sigma^2 & 0\\ 0 & 2\sigma^4\\ \end{array} \right) \left( \begin{array}{cccc} \frac{\partial g}{\partial \mu}\\ \frac{\partial g}{\partial \sigma}\\ \end{array} \right) \\ &=& \left( \begin{array}{cccc} \frac{\partial g}{\partial \mu}\sigma^2, \frac{\partial g}{\partial \sigma}2\sigma^4\\ \end{array} \right) \left( \begin{array}{cccc} \frac{\partial g}{\partial \mu}\\ \frac{\partial g}{\partial \sigma}\\ \end{array} \right) \\ &=& (\frac{\partial g}{\partial \mu})^2\sigma^2 + (\frac{\partial g}{\partial \sigma})^2\sigma^4 \\ &=& 1 + \frac{\mu^2}{2\sigma^2} \\ &=& 1 + \frac{1}{2}SR^2 \end{eqnarray} \] You may need to be careful when you see good performance because the variance tends to be exponentially larger as the absolute value of the Sharpe ratio increases. Here’s the distribution that the annualized Sharpe ratio estimate \(\hat{SR}(q)\) follows
\[ \hat{SR}(q)\sim^a \mathcal{N}(\sqrt{q}SR,\frac{V(q)}{T}) \\ V(q) = q{\boldsymbol V}_g = q(1 + \frac{1}{2}SR^2) \] Now, if \(y\) is the number of years of backtesting, we can write \(T=yq\) and use this to rewrite the above equation as follows (for a 3-year measurement with daily returns, the sample size \(T\) is \(T=3×365=1095\)).
\[ \hat{SR}(q)\sim^a \mathcal{N}(\sqrt{q}SR,\frac{1+\frac{1}{2}SR^2}{y}) \tag{1} \] The frequency \(q\) affects the mean of the Sharpe ratio, but not the variance. We can now derive an asymptotic distribution of the Sharpe ratio estimates. Now, what do we wanted to do with this? We were thinking about the reliability of the backtest. In other words, what is the probability that a backtest of \(N\) investment strategy ideas that FM twists and turns to develop a new product will produce a very high (good) value even though the true value of all of those Sharpe ratios is zero. In Bailey et. al. (2015), it was described as follows
How high is the expected maximum Sharpe ratio IS among a set of strategy configurations where the true Sharpe ratio is zero?
We also want to know how long you should be backtesting for in order to reduce the value of the expected maximum Sharpe ratio.
4. Trying to derive the minimum backtest length
The situation we are considering now is that let \(\mu=0\) and \(y\) 1 year for simplicity, then from equation (1) \(\hat{SR}(q)\) follows the standard normal distribution \(\mathcal{N}(0,1)\). Now, we will consider the expected value of the maximum value \(\max[\hat{SR}]_N\) of \(\hat{SR}_n(n=1,2,...N)\), but as those with good instincts will have noticed, the discussion goes into the context of extreme value statistics. Since \(\hat{SR}_n\sim\mathcal{N}(0,1)\) is i.i.d., the extreme value distribution of that maximum statistic becomes Gumbel distribution from the Fisher-Tippett-Gnedenko theorem (sorry, I haven’t been able to follow the proof).
\[ \lim_{N\rightarrow\infty}prob[\frac{\max[\hat{SR}]_N-\alpha}{\beta}\leq x] = G(x) = e^{-e^{-x}} \] where \(\alpha=Z(x)^{-1}[1-1/N], \beta=Z(x)^{-1}[1-1/Ne^{-1}]-\alpha\), and \(Z(x)\) represents the cumulative distribution function of the standard normal distribution. The moment generating function of the Gumbel distribution \(M_x(t)\) is written as
\[ \begin{eqnarray} M_x(t) &=& E[e^{tx}] = \int_{-\infty}^\infty e^{tx}e^{-x}e^{-e^{-x}}dx \\ \end{eqnarray} \] Converting variables with \(x=-\log(y)\) gives \(dx/dy=-1/y=-(e^{-x})^{-1}\), so
\[ \begin{eqnarray} M_x(t) &=& \int_{\infty}^0-e^{-t\log(y)}e^{-y}dy \\ &=& \int_{0}^\infty y^{-t}e^{-y}dy \\ &=& \Gamma(1-t) \end{eqnarray} \] where \(\Gamma(x)\) is a gamma function. From here, the expected value (mean) of the standardized maximum statistic is
\[ \begin{eqnarray} \lim_{N\rightarrow\infty} E[\frac{\max[\hat{SR}]_N-\alpha}{\beta}] &=& M_x'(t)|_{t=0} \\ &=& (-1)\Gamma'(1) \\ &=& (-1)(-\gamma) = \gamma \end{eqnarray} \] Here, $… $ is the Euler-Mascheroni constant. Thus, when \(N\) is large, the expected value of the maximum statistic of the standard normal distribution of i.i.d. is approximated as (\(N>1\))
\[ E[\max[\hat{SR}]] \approx \alpha + \gamma\beta = (1-\gamma)Z^{-1}[1-\frac{1}{N}]+\gamma Z^{-1}[1-\frac{1}{N}e^{-1}] \tag{2} \] This is Proposition 1 of Bailey et. al. (2015). We plot \(E[\max[\hat{SR}]]\) as a function of the number of strategies (number of trials and errors) \(N\), which is shown below.
library(ggplot2)
## Warning: パッケージ 'ggplot2' はバージョン 4.0.3 の R の下で造られました
ExMaxSR = function(N){
gamma_ct = -digamma(1)
Z = qnorm(0.99)
return((1-gamma_ct)*Z*(1-1/N) + gamma_ct*Z*(1-1/N*exp(1)^{-1}))
}
N = list(0:100)
result = purrr::map(N,ExMaxSR)
ggplot2::ggplot(data.frame(ExpMaxSR = unlist(result),N = unlist(N)),aes(x=N,y=ExpMaxSR)) +
geom_line(size=1) + ylim(0,3)
You can see that the expected value of \(\max[\hat{SR}]\) is rapidly increasing for the small \(N\). At \(N=10\), we expect to find at least one apparently quite good performing strategy, even though the true value of the Sharpe ratio of all strategies is zero. In finance, hold-out back-testing is often used, but this method does not take into account the number of trials, which is why it does not return reliable results when \(N\) is large. Don’t you think it’s very risky to do a lot of simulations in order to improve the backtest results? In the end, only the best performing of the \(N\) strategies will show up in the presentation material, so even if you consider 10 strategies, as in this example, any of them will have a Sharpe ratio distributed around 1.87. Of course, we don’t include the number of trials and errors in our materials, so it’s very misleading. When evaluating these materials, you might want to suspect false positives first.
So, as for what to do, Bailey et. al. (2015) calculate the Minimum Backtest Length. In short, they caution that as the number of trials (errors) \(N\) is increased, the number of years of backtesting \(y\) should also increase. Let’s show the relationship between \(N\) and the Minimum Backtest Length. As before, we will assume that \(\mu=0\), but consider the case with \(y\neq 1\). The expected value of the maximum statistic of the annualized Sharpe ratio is from equation (2)
\[ E[\max[\hat{SR}(q)]_N] \approx y^{-1/2}((1-\gamma)Z^{-1}[1-\frac{1}{N}]+\gamma Z^{-1}[1-\frac{1}{N}e^{-1}]) \] By solving this for \(y\), you can get MinBTL.
\[ MinBTL \approx (\frac{(1-\gamma)Z^{-1}[1-\frac{1}{N}]+\gamma Z^{-1}[1-\frac{1}{N}e^{-1}]}{\bar{E[\max[\hat{SR}(q)]_N]}})^2 \] Here, \(\bar{E[\max[\hat{SR}(q)]_N]}\) is the upper limit of \(E[\max[\hat{SR}(q)]_N]\), and the \(N\) strategy, where the true value of the Sharpe ratio is zero, suppresses the value that the maximum Sharpe ratio statistic can take. In doing so, the required backtesting years \(y\) are derived as MinBTL. Plotting the MinBTL as a function of \(N\), with \(\bar{E[\max[\hat{SR}(q)]_N]}=1\) is shown below.
MinBTL <- function(N,MaxSR){
return((ExMaxSR(N)/MaxSR)^2)
}
N = list(1:100)
result = purrr::map2(N,1,MinBTL)
ggplot2::ggplot(data.frame(MinBTL = unlist(result),N = unlist(N)),aes(x=N,y=MinBTL)) +
geom_line(size=1) + ylim(0,6)
simSR <- function(T1){
r = rnorm(T1)
return(mean(r)/sd(r))
}
If your backtesting period is less than 3 years, the number of trials (or errors) \(N\) should be limited to one. It is important to note that even if you are backtesting within the MinBTL, it is still possible to overfit. In other words, the MinBTL is a necessary condition, not a sufficient condition.
4. At the end
López de Prado (2018) lists the following as generic means of preventing overfitting
Develop models for entire asset classes or investment universes, rather than for specific securities. Investors diversify, hence they do not make mistake X only on security Y. If you find mistake X only on security Y, no matter how apparently profitable, it is likely a false discovery.
Apply bagging as a means to both prevent overfitting and reduce the variance of the forecasting error. If bagging deteriorates the performance of a strategy, it was likely overfit to a small number of observations or outliers.
Do not backtest until all your research is complete.
Record every backtest conducted on a dataset so that the probability of backtest overfitting may be estimated on the final selected result (see Bailey, Borwein, López de Prado and Zhu(2017)), and the Sharpe ratio may be properly deflated by the number of trials carried out (Bailey and López de Prado(2014.b)).
Simulate scenarios rather than history. A standard backtest is a historical simulation, which can be easily overfit. History is just the random path that was realized, and it could have been entirely different. Your strategy should be profitable under a wide range of scenarios, not just the anecdotal historical path. It is harder to overfit the outcome of thousands of “what if” scenarios.
Do not research under the influence of a backtest. If the backtest fails to identify a profitable strategy, start from scratch. Resist the temptation of reusing those results.
I think 3 and 6 are relevant contexts for today’s post. There is an accumulation of other research in this area, so if you’re going to do backtesting on the job, it’s good to study techniques, but I recommend learning about the proper way to operate backtesting as a primer in the first place.
In this post, I’ve decided to tackle a slightly different topic from the usual one. I often see the results of backtests in my own work, and I often do hold-out backtests on this blog. I will continue to follow my research on this topic so that I can understand and evaluate the uncertainty of the results obtained.