Strategy AnalyticsConcept PrimerJun 12, 2026 · 7 min read

Backtested and Forward-Tested Results in Futures Trading

Backtested results show how a strategy would have done on past data. Forward tests show what it actually does on live broker fills. The distinction matters.

By Imperial Analytics

A trader who tests a strategy on historical data and a trader who tests a strategy on live fills are answering two different questions. The first asks what the rules would have produced inside a record that already exists. The second asks what the rules actually produce inside a record that does not yet exist. The numbers from those two tests can differ by a wide margin, and the difference is itself a strategy signal. This post defines backtested and forward-tested results, names why they tend to diverge, and walks the conventions for reading them together.

By Imperial Analytics

What backtested and forward-tested results actually are

A backtested result is the per-trade record a strategy would have produced if it had been executed mechanically across a window of historical price data. A forward-tested result is the per-trade record the same strategy actually produces on live fills, in real time, on a small share of the trader's normal size. Both records contain a win rate, an average winner, an average loser, and an expectancy. The backtest computes those numbers against the past. The forward test reports those numbers as they happen.

Three details to hold from the start. A backtest is mechanical: every trade is taken if and only if the rules say so, on every bar of the historical sample, with no discretion and no override. A forward test is also mechanical when run correctly, but it runs on the live tape with the live latency, the live spread, and the live slippage, which the backtest cannot reproduce. The backtest reports a clean record on prices that already settled; the forward test reports a record on fills the broker actually returns.

The two records also live on different distributions. A backtest is a single sample drawn from a fixed past. Re-running the same backtest on the same data produces the same numbers. A forward test is a sample drawn from a market that is still evolving. Re-running a forward test under different live conditions produces different numbers, because the market itself is different. The forward test's variance is real; the backtest's variance is an artifact of how the historical window was sliced.

The third detail is the most easily missed. A backtest's expectancy is an answer to "what would have happened." A forward test's expectancy is an answer to "what is happening." A trader who acts on the first answer as if it were the second is treating a counterfactual as a measurement.

Why backtested results overstate the realized edge

Backtested results overstate the realized edge for three reasons that compound. The rules were chosen on the same data that scores them, so any noise that happened to favor the rules is counted twice. The fills are perfect because the historical record does not preserve real slippage. The trader's execution discipline is assumed, which understates the cost of running the strategy live. Each of these biases tilts the backtest upward, often by a meaningful share of the reported expectancy.

The first bias is the in-sample fit problem. A strategy is rarely written down with no data in view. The trader chose the lookback, the stop distance, the target multiple, and the filter thresholds in part because they performed well on the data the trader had. The data that informed the rules and the data that scores the rules are the same data, so the backtest cannot distinguish between an edge the rules captured and a pattern the rules accidentally fit. The longer the parameter list, the more room there was to fit noise. The cleaner the apparent backtest, the more reasonable it is to suspect the cleanness came partly from selection.

The second bias is the fill problem. A backtest typically assumes a fill at the bar's open, the close, or the high or low of the trigger bar. The actual broker fill on a live order will land at some price the historical bar does not record: a few ticks of slippage on a market order, a partial fill on a limit order in fast markets, no fill at all on a limit order that was at the level when the bar printed it but never gave the trader a chance to be at the front of the queue. Each of those costs is small per trade and accumulates across a sample. A trader who runs the same strategy on the same prices in a backtest and on live fills will see the backtest's expectancy systematically higher than the live expectancy, even when both records agree on every entry and exit signal.

The third bias is the execution-discipline assumption. A backtest assumes the trader takes every signal, holds to every stop, and exits at every target without hesitation. The forward test reveals what the trader actually does. A real trader misses a few signals during a coffee refill, widens a few stops in the moment, exits a few trades early on a feeling, and the journal records all of it. The backtest does not. The difference between mechanical execution in the backtest and discretionary execution in the live test is itself a strategy cost, and it shows up only in the forward test.

What forward-tested results add that backtests cannot

A forward test produces the only record that includes live slippage, live spread, the trader's actual execution discipline, and the current market regime. None of those four variables sit inside a backtest. The forward test's expectancy is therefore the only honest estimate of what the strategy will do at full size in the period ahead, and the gap between the backtest's expectancy and the forward test's expectancy is the strategy's true cost of going live.

The slippage and spread cost is the most easily measured. A trader who logs every fill, the bar's price at the moment the order was sent, and the actual price the broker filled at, can sum the difference across every trade. That sum is the dollar cost of execution against the backtest's perfect-fill assumption. It is not noise; it is a structural cost of running the strategy on a real venue. The forward test reports it; the backtest cannot.

The execution-discipline cost is the next layer. The forward test captures every trade the trader actually took, including the trades the rules did not call for and the rules-called trades the trader skipped. The backtest assumes neither category exists. The gap between the rules' theoretical trade list and the trader's actual trade list is a behavioral cost, and the trader can size it from the forward-test record by comparing rule-called trades to actually-taken trades over the same window.

The current-regime variable is the largest and the hardest to measure. A backtest scored across the last five years includes regimes the current market is no longer in. The forward test scores only the regime the market is in now. If realized volatility has doubled or halved since the backtest's window, the rules will hit different triggers, the stops will fill differently, and the per-trade math will move. A two-year backtest with a six-week forward test is comparing a long historical mixture against a short live segment. The mixture's expectancy is informative; the live segment's expectancy is binding.

↳ Note

A backtest answers what the rules would have produced on data that already exists. A forward test answers what the rules actually produce on data that does not yet exist. The trader's account depends on the second answer.

How to design a forward test that produces a decision-quality answer

A forward test produces a decision-quality answer when it runs for long enough to clear a sample-size minimum, runs at a size small enough to survive the worst plausible drawdown, runs on the same instrument and the same rules as the intended production strategy, and is recorded with the same per-trade fields the live system will track. Skipping any one of those four conditions leaves the forward-test result either too noisy to read or too unrepresentative to act on.

The sample-size condition is the load-bearing one. A forward test on ten trades is mostly noise. Twenty trades in the matching condition is the minimum the AI Operating Charter uses for any behavioral pattern claim, and the same floor is reasonable for a strategy claim where the trader needs a per-trade expectancy estimate that means something. Fifty or one hundred trades narrows the band further. A trader who calls a forward test conclusive after eight trades is treating variance as signal.

The size condition is what makes the test survivable. The forward test should run at a fraction of the trader's normal size, sized so that a string of losses inside the test window does not threaten the account. The math is straightforward: pick the worst plausible losing streak from the backtest, multiply by the per-trade stop in dollars at the forward-test size, and confirm that loss is one the trader can absorb without ending the test early. A test that ends early because the drawdown frightened the trader produces no answer at all.

Data note

Numerical examples in this post are illustrative. A forward test on a trader's own data only earns a strategy-level conclusion when the sample meets the minimums named in the AI Operating Charter: twenty trades in the matching condition for the per-trade expectancy claim, with tighter floors for any time-of-day or day-of-week sub-claim inside the forward-test window.

The same-rules condition matters because a forward test on adjusted rules is not a forward test of the original strategy. A trader who softened the filter mid-test, widened the stop after a tough week, or moved to a different instrument has rebuilt the strategy, and the test window now spans two strategies stitched together. The honest version of the test runs the production rules unchanged. If the rules need adjustment, the right move is to end the test, change the rules, and start a new test against the new rules.

The same-fields condition is the discipline that lets the forward test be compared to the backtest meaningfully. Both records should carry the entry price, the exit price, the position size, the stop distance, the actual fill prices, the time of entry and exit, the setup tag, and the instrument. With those fields on both sides, the backtest and the forward test can be lined up trade-for-trade, and the gap between them can be decomposed: slippage cost, execution gap, regime drift, sample noise.

How to read backtested and forward-tested results together

The backtest sets the historical expectation; the forward test reports the current realization. The honest read is the forward test's per-trade math, adjusted for sample size, with the backtest's metrics as a sanity benchmark. If the forward test's expectancy is within the backtest's variance band after accounting for slippage and execution gap, the strategy is performing as designed. If the forward test's expectancy sits below that band, the strategy is decaying, the live cost is higher than the backtest assumed, or both. The conclusion is data-driven, not narrative.

The arithmetic of comparison is straightforward. Take the backtest's per-trade expectancy, subtract a slippage estimate calibrated from the trader's broker history, and subtract an execution-gap estimate from the forward-test record's missed and unscheduled trades. That gives a forward-test target: the per-trade expectancy the strategy should produce live if the backtest's edge survived the move to real fills. Compare the forward test's actual per-trade expectancy against that target. The gap, with its standard-error band, is the answer to whether the strategy is still working.

A worked example illustrates the read. Suppose the backtest produced a per-trade expectancy of $48 across 400 trades, with an average winner of $185 and an average loser of $95 at a 51 percent win rate. The trader estimates live slippage at $6 per trade and an execution-gap cost of another $4 per trade. The forward-test target is roughly $38 per trade. The forward test runs for 30 trades and reports a per-trade expectancy of $41. The standard error on the win rate at 30 trades is wide enough that $41 and $38 are statistically indistinguishable, so the read is that the strategy is performing in line with the backtest after accounting for live costs. If the forward-test expectancy had come in at $5 per trade, the gap would clear any reasonable sampling band and the read would be that the live edge is materially smaller than the backtest implied.

The structural protection against reading the comparison wrong is to treat the backtest as a prior and the forward test as the update. The backtest's numbers are the strongest information the trader has before live data exists. The forward test is the live data that updates the prior. As more forward-test trades accumulate, the forward test's per-trade math should carry more weight than the backtest's, until the forward test is itself the production record and the backtest is a footnote about how the trader chose the rules. The trader who treats the backtest as the binding answer after a forward test has run is anchored to the past at the cost of the present.

Frequently asked questions

q: How many trades does a forward test need before the per-trade math means something? a: Twenty trades in the matching condition is a reasonable floor for a per-trade expectancy claim, and even that carries a wide standard-error band on the win rate. Fifty to one hundred trades is the range at which the band narrows enough to support a conclusion about whether the live edge matches the backtest's edge inside reasonable sampling tolerance.
q: Is a forward test the same as a paper trade? a: A paper trade is a forward test without real broker fills, and it inherits the backtest's fill problem. A real forward test runs at small size with the broker actually filling the orders, which is the only way to capture live slippage, partial fills, and the trader's own execution discipline. A paper trade is useful for system mechanics and for catching coding errors; it is not a substitute for a live forward test when the question is whether the edge survives real fills.
q: Should the forward test use the same instrument as the production strategy? a: Yes. A forward test on a different instrument tests a different strategy. Slippage, spread, liquidity, session structure, and tick value all change across instruments, and any of those changes can flip the per-trade math. If the production strategy will run on MES, the forward test should run on MES.
q: What does it mean if the backtest is strong and the forward test is weak? a: Three explanations are plausible and not mutually exclusive. The backtest may have fit noise that does not exist in live data, which is an overfitting problem. The strategy's edge may have decayed since the backtest's window, which is a regime problem. The trader's execution may be giving back the edge in slippage or discipline cost, which is a live-cost problem. The forward-test record itself, especially the trade-by-trade comparison to the rule-called trade list, usually says which explanation is dominant.

forward testingbacktestingstrategy analyticsexpectancyfutures