Is My Backtest Overfit? We Ran the Gauntlet on Our Own Strategies

STS ResearchPublished June 13, 2026Data through 2026-06-11

A backtest is overfit when it describes the past instead of an edge. The clean way to find out is to run a few standard tests, and the honest way is to run them on your own system and publish what they say. We did that on our six NQ strategies. Two of the six came back statistically fragile: their Harvey-Liu t-stat fell below 3.0, the rough line for "this looks real." We kept both anyway. One of them is the book's crash hedge, and every one of the six still made money on data the backtest never saw.

Below are the tests that matter, the t-stat for each strategy, and why we kept the two that failed. The book as a whole scores a t-stat of 4.63 against a hurdle of 3.0, so the warts are at the strategy level, not the portfolio level.

4.63

Book t-stat (hurdle is 3.0)

2 of 6

Strategies flagged fragile (t under 3)

6 of 6

Profitable on unseen 2026 data

19.8%

Probability of backtest overfitting (under 20% passes)

Whose trades are these (read this before using our numbers)

Everything below is measured on our own book: six systematic NQ strategies that run together, one position at a time (we call the combined six "the book"). TradingView backtests from June 2011 to June 2026, one to three contracts scaled by volatility, commissions and slippage included, $1,000,906 net. The entries are momentum and trend-continuation, intraday plus one overnight model. They are not mean reversion and not scalping.

That style shapes the test results. A momentum book lives or dies on a handful of big trending moves, which makes its statistics noisier and its overfit tests harder to pass than a high-win-rate system would be. Our exact t-stats describe our system. The method, running these tests on your own strategy before you trust it, transfers to any trader, including a discretionary one who wants to know whether a rule is real or just a story the chart told. The numbers themselves do not transfer.

Overfitting, in one plain idea

A strategy has thousands of dials: entry time, stop width, which indicator, what threshold. Turn enough dials and you can make almost any rule look great on past data, because you are fitting the noise, not the signal. That is overfitting, also called curve fitting. The tell is simple. An overfit strategy looks brilliant in the backtest and falls apart the moment it meets new data.

So the question "is my backtest overfit?" really means: how much of this result is edge, and how much is me having tried a lot of things until one looked good? You cannot answer that by staring at the equity curve. A curve-fit strategy and a real one can have the same beautiful curve. You answer it with tests built to punish the trying.

The four tests that actually matter

There are dozens of overfit checks. Four carry most of the weight, and each attacks the problem from a different side.

The Harvey-Liu t-stat. This is the headline test. The t-stat measures how far your average result sits from zero, scaled by how noisy your results are. A high t-stat means the profit is unlikely to be luck. Campbell Harvey and Yan Liu, in their 2014 paper on the flood of published trading "factors," argued that the usual bar of 2.0 is far too soft once you account for how many strategies people test before publishing one. Their tougher line is t-stat above 3.0. We use 3.0 as the pass mark, same as they recommend.

The Deflated Sharpe Ratio (DSR). A Sharpe ratio (return divided by how bumpy that return is) rewards smooth returns, but it is easy to inflate by testing many variations and reporting the best one. The DSR, from Marcos Lopez de Prado, takes your Sharpe and deflates it by how many configurations you tried and how skewed and fat-tailed your returns are. It answers: given all the trying, what is the chance this Sharpe is genuinely above zero? Higher is better. A whole diversified book can push it toward 100%, but a single optimization-heavy strategy rarely gets there.

Walk-forward, also called out-of-sample testing. Split history into a part the strategy was built on (in-sample) and a part it never touched (out-of-sample). If the edge only shows up in-sample, it was fit to that period. A real edge keeps working out-of-sample. This is the most intuitive test and the hardest to fake. If you only ever run one of these, run this one.

Probability of Backtest Overfitting (PBO). From the same Lopez de Prado line of work, this one is clever. It chops your history into many blocks, and across thousands of combinations it asks how often the configuration that looked best in-sample turned out below average out-of-sample. If your "best" settings are really just luck, they will flip to below-average a lot. PBO is the share of times they do. Lower is better; under 20% is the usual pass.

Passing all four does not promise a strategy will make money. It only lowers the odds you are fooling yourself.

The takeaway

No single test proves an edge. The t-stat asks if profit beats luck, the Deflated Sharpe punishes you for trying many versions, walk-forward demands it work on unseen data, and PBO checks whether your best settings were a fluke. We ran all four on our own strategies and the book passed every one, even though, as the next section shows, two of the six strategies did not pass on their own.

Two of our six strategies failed the t-stat hurdle

Here is the part most signal sellers would bury. When we run the gauntlet on each strategy by itself, two of the six fall short of the t-stat-above-3 line. The last column shows the profit factor (gross wins divided by gross losses; above 1.0 means the strategy made money) on data the backtest never saw. The two ORB rows are opening-range breakouts.

Strategy	Direction	Verdict	t-stat	Deflated Sharpe	Held out-of-sample?
Trend	Long	Real	3.25	85%	Yes (PF 1.28)
Long ORB	Long	Borderline	3.46	91%	Yes (PF 1.31)
Simple Short	Short	Borderline	3.23	93%	Yes (PF 4.07)
Short ORB	Short	Fragile	2.71	70%	Yes (PF 1.46)
Overnight	Both	Borderline	3.82	90%	Yes (PF 1.39)
Universal	Both	Fragile	2.80	67%	Yes (PF 1.36)

Bar chart of the Harvey-Liu t-stat for each of our six NQ strategies, standalone backtests 2011 to 2026, against the real-edge hurdle of 3.0 shown as a dashed line. Trend 3.25, Long ORB 3.46, Simple Short 3.23, Overnight 3.82 clear the hurdle; Short ORB 2.71 and Universal 2.80 fall below it and are labeled fragile. Every t-stat is printed on its bar. — Two of the six strategies, Short ORB and Universal, sit below the 3.0 t-stat line. By the strict test, they are not standalone edges.

None of the six clears a strict Deflated Sharpe above 95%, and at a high assumed trial count the deflated numbers fall further. Read plainly: these strategies are optimization-sensitive. They are not fabricated, but no one of them is a bulletproof, bet-the-house edge on its own.

Why a fragile strategy can still belong in the book

If Short ORB fails the t-stat test, why is it still running? Because the test it fails is the standalone test, and we do not trade these strategies standalone. They run together, one position at a time, and the value of a strategy inside a portfolio is not the same as its value alone.

Short ORB is a short. It loses money when the market grinds up, which is most of the time, and that drag is exactly what pulled its standalone t-stat below 3. But look at where it makes its money. In crash and high-volatility-down months, Short ORB made +$147,686 while our long strategies were bleeding. It is the book's downside insurance. We labeled it a trim candidate three separate times on its weak solo stats, and reversed that every time on the same fact: when the long side is on fire, this is the strategy holding the bucket.

That is the portfolio lesson under all of this. A strategy built to be patient and then violent will always look statistically weak by itself, because the test rewards steady profit. Cutting it would raise the book's average t-stat and lower its survival odds. We keep it.

On data the backtest never saw, all six held up

The single most convincing overfit test is also the simplest: does it work on data you did not build it on? We carved out January 2026 onward as a true out-of-sample window, market data that did not exist when these strategies were designed. It is the only stretch of history none of these strategies could have been fit to. We measured each one cold.

All six made money. Including the two fragile ones.

Strategy	Out-of-sample profit factor	Made money?
Trend	1.28	Yes
Long ORB	1.31	Yes
Simple Short	4.07	Yes
Short ORB	1.46	Yes
Overnight	1.39	Yes
Universal	1.36	Yes

Bar chart of out-of-sample profit factor for each of the six NQ strategies, January 2026 to June 2026, with a dashed break-even line at 1.0. All six bars sit above break-even: Trend 1.28, Long ORB 1.31, Simple Short 4.07, Short ORB 1.46, Overnight 1.39, Universal 1.36. The two strategies flagged fragile in-sample are labeled and still profitable. Every value is printed on its bar. — Profit factor above 1.0 means a strategy made money in the period. On data none of them had seen, all six cleared the line, fragile ones included.

These are small samples, 8 to 54 trades each in this short window, so read them as a check, not a verdict. The high 4.07 sits on the fewest trades, so lean on it least. The fragile strategies are weak, not broken. They held their edge on genuinely unseen data, which is the test an overfit strategy fails.

The book passes every test the strategies struggle with

Put the six together and the picture flips. Because the strategies are nearly uncorrelated (their daily results barely move together, average pairwise correlation about 0.13) and because the longs and shorts fail in opposite conditions, the combined book is far steadier than any single piece.

The portfolio t-stat is 4.63, comfortably past the 3.0 hurdle. The Deflated Sharpe sits near 100%. Walk-forward held in 7 of 8 windows. And the PBO, the probability our best-looking settings are luck, comes in at 19.8%, just under the 20% pass line. A pass, but not a comfortable one, so we report it plainly. That book-level PBO is the check on the blend itself. If we had simply fit the mix of six to the past, this is the test that would have caught it. None of those numbers is true of the weak strategies alone. The diversification is doing real work, and these tests are how we measure it.

This is also why we do not sell individual strategies. A single sleeve is an ingredient. The product is the diversified book, because that is the thing the statistics actually support.

Two proof points we keep on the shelf

Two stories from our own logs make the point harder than any test does.

We once ran roughly 1,100 backtests trying to improve our short strategy: different stops, exit times, entry triggers, volatility filters, sizing schemes. Nothing beat the production version inside the book. Several variations looked healthier on their own and every one of them hurt the portfolio. That is overfitting caught in the act. The "improvements" were fitting the past, and the discipline of testing them inside the full book, not alone, is what exposed them. All that trying is also why we deflate the Sharpe by a high trial count. Every one of those attempts is a reason to trust a single pretty result less.

The second is a bug. In June 2026 we found our live book was firing two trades on a single bar when it should hold one position at a time. We fixed it. The fix cost about $110,000 of backtested profit, because honoring one position at a time means turning down trades the buggy version took. We could have quietly kept the bigger number. We changed the number instead. A backtest you are willing to make worse for the sake of accuracy is a backtest you can trust a little more.

What to do with this

To check your own strategy, do not trust the equity curve and do not trust a single test. Run the four together. Hold the t-stat to 3.0, not 2.0. Deflate your Sharpe by the number of versions you actually tried, and be honest about that count. Carve off the most recent stretch of history, never let the strategy see it, and check that the edge survives. And if you ran a parameter sweep, run PBO to see how often your best settings would have flopped out-of-sample.

Then judge each piece at the level it lives at. A strategy can fail the solo test and still earn its seat in a portfolio, the way our crash hedge does, but only if you can show what job it does that the others cannot. If you cannot name that job, the weak strategy is probably just weak, and the honest move is to simplify it or set it aside. See how the six fit together in our six NQ strategies, or read the stop-width study for another case where the obvious backtest answer was the overfit one.

How we measured this

Instrument: CME Nasdaq-100 E-mini (NQ), $100,000 initial capital, no compounding, one to three contracts scaled by volatility. Data: TradingView list-of-trades exports from our live six-strategy book, June 2011 through June 2026, 5,424 combined trades. The per-strategy gauntlet runs on each strategy's standalone export; the t-stats, Deflated Sharpe figures, walk-forward profit factors, and out-of-sample buckets are produced by our own scripts and reproduce on any TradingView export of the same form.

The tests have honest limits. The Deflated Sharpe and PBO both depend on an assumed number of configurations tested; we report the t-stat (which does not) as the primary verdict and show the Deflated Sharpe at two trial counts so the sensitivity is visible. The out-of-sample window (January 2026 onward) is short by design; a five-month window is suggestive, not proof, and we will extend it as time passes. Strategy backtests before about 2011 are small-sample and we do not lean on them. These are hypothetical backtest results, not live fills. The underlying exports are our proprietary trade history, so we cannot publish the raw files, but the methods reproduce on any export, and our book-level numbers reconcile to the full tear sheet and the strategy page. Plans are on the pricing page.

We trade this book live and sell access to the signals, so judge the data accordingly. This article is educational and is not investment advice. Futures trading involves substantial risk of loss and is not suitable for every investor.

Hypothetical performance disclaimer (CFTC Rule 4.41): hypothetical or simulated performance results have certain limitations. Unlike an actual performance record, simulated results do not represent actual trading. Also, since the trades have not been executed, the results may have under- or over-compensated for the impact, if any, of certain market factors, such as lack of liquidity. Simulated trading programs in general are also subject to the fact that they are designed with the benefit of hindsight. No representation is being made that any account will or is likely to achieve profit or losses similar to those shown. Past performance does not indicate future results.