Opening note
A shorter, more readable version of the original archive entry, focused on the parts that remained technically useful.
How a Real Backtest Works
A condensed playbook for removing look-ahead, ghost quotes, binary payout bugs, and fantasy fills before you trust the numbers.
This note is the methods layer behind the Polymarket work. It is short on purpose, but it now reflects the real issues that showed up in the actual repo rather than only the usual textbook warnings.
Three ways a backtest lies
Most inflated results I have seen came from one of these mistakes.
1. Look-ahead inside the bar
If the simulator lets you enter and then stop out using the same future information, you are time traveling.
The fix is simple and boring: process Stop -> Take-profit -> Entry in that order.
2. Ghost quotes
If the strategy assumes it trades the quote it sees at time t, it is usually too optimistic.
Real systems need reaction time, network time, and market time. That is why reaction_latency_ms belongs in version one of the backtest, not in the apology after live trading disappoints.
3. Fantasy liquidity
A visible price is not the same thing as an available fill.
Once you add slippage, min_ask_size, queue effects, or maker fill uncertainty, a strategy that looked clean can collapse very quickly.
The bugs that make bad backtests look brilliant
The repository itself ended up surfacing a few very concrete failure modes.
- A taker backtest once used a binary label as a fallback payout when a future exit price was missing. That can fake a clean
$1 / $0outcome and inflate PnL immediately. - A maker simulator can count the same adverse event multiple times if it reevaluates every tick without a queue penalty or cancel cooldown.
- The final minute of a 15-minute market is often trivial or structurally unrealistic, so including it can create fake signal.
These are not abstract concerns. They are exactly the kinds of things that create "too good to be true" metrics while still looking technically respectable.
The minimum honest setup
For short-horizon systems, I now treat these as baseline conditions:
- event ordering that matches real execution,
- temporal train and test splits by slug or interval,
- reaction latency, for example
300-1000 ms, - slippage assumptions, for example
1-2%, - liquidity constraints instead of assuming the book always wants you,
- explicit handling of missing future prices instead of magical fallbacks,
- exclusion of the final 60 seconds when that segment is structurally misleading.
Anything less can still help with exploration, but not with confidence.
Alignment and leakage matter as much as fills
A second class of problems comes from data alignment.
- In asynchronous systems, two feeds almost never update in the same millisecond.
- "Last known value" with local receipt time is often the honest solution.
- Feature windows must be based on what your machine had actually received, not on a future-friendly exchange timestamp.
That is why the HFT pipeline ended up caring so much about local_receipt_ts_ms, freshness columns, and even explicit leakage validation. A model can be mathematically correct and still be temporally dishonest.
A sanity table worth keeping
| Setup | Hit rate shape | Typical interpretation |
|---|---|---|
| No latency | Very high | Flattering, not actionable |
| 300 ms reaction | Drops hard | A more honest baseline |
| 300 ms + slippage + liquidity filter | Often weak or negative | Where many fake edges disappear |
If the edge only lives in the first row, it is probably not an edge.
What even an honest simulator still misses
A better backtest is still not the market itself.
- Queue position is often simplified.
- Round-trip timing is never perfect.
- Adverse selection is still partly approximated.
- Maker survival is not the same thing as a real fill unless you have tape or explicit fill evidence.
So the simulator is a filter. It is not a guarantee.
The practical rule
Use the backtest to disqualify ideas aggressively, not to fall in love with them.
That rule is what kept the 15-minute system grounded and what made the HFT work much harder to fake. If the numbers survive realistic pain and realistic timestamps, then the strategy has earned the right to more attention. If not, the simulator already did its job.