Opening note
A shorter, more readable version of the original archive entry, focused on the parts that remained technically useful.
HFT on Polymarket: Model, Rust, and the 98% Lie
Phase 3: second-scale features, recorder-first infrastructure, and the point where even strong classifier metrics were not enough.
After long-horizon prediction hit data bottlenecks and the 15-minute system showed little durable edge, the next logical question was narrower: could second-scale signals still contain something useful?
That pushed the work toward HFT-style infrastructure. The problem was no longer just "can I train a model?" It was "can I observe, align, and execute honestly enough that the model means anything at all?"
HFT stack map
The HFT phase became a recorder-first stack: capture fast, prepare honestly, model carefully, and keep blocking live actions away from the hot path.
Recorder
The ingest layer that had to be trusted before anything else.
Prepared data
The data ladder between raw capture and ML.
Live architecture
The production-minded split between fast thinking and slow action.
This is the phase where architecture mattered most: good inference metrics without trustworthy timestamps and execution discipline were not enough.
Why HFT was the next step
The thesis here was much more specific than before.
- Maybe Polymarket lagged the reference market.
- Maybe order-book imbalance and short lag features contained signal.
- Maybe the real edge was not prediction alone, but data freshness and execution discipline.
That meant the project had to become infrastructure-first.
The architecture that made it possible
I split the stack by responsibility.
Rust on the hot path
- WebSockets for Binance and Polymarket.
- Event-driven waiting on the next Binance tick instead of sleepy polling.
- A single-source feature path for the HFT vector, including spread, OBI, and lag features.
- Recorder and latency primitives exposed through
poly_rust_core.
Python around it
- Recorder orchestration and token refresh.
- JSONL to Parquet conversion, including incremental ETL.
prepared_l1andprepared_l2dataset generation for train, val, and test.- LightGBM training, experiment sweeps, visualization, and paper/live wrappers.
This division worked well because it kept the timing-sensitive path in Rust while leaving iteration flexible in Python.
Data quality mattered more than model complexity
The real improvement in this phase was the recorder and the ETL around it.
- Each tick carried
local_receipt_ts_ms. - Alignment followed a "last known value" philosophy instead of pretending two feeds share exact timestamps.
- Freshness columns such as
binance_age_msandpoly_age_msturned misalignment into explicit information. - Training rows could be filtered by latency thresholds such as
200-500 ms.
The project also includes explicit leakage checks and dataset preparation stages, which made the pipeline feel much closer to production research than to a simple backtest.
To make that stage visible instead of abstract, I embedded one real source_recorded_l1_sol feature explorer inside the site. It lets you inspect a recorded slug, compare raw versus smoothed series, and see the sort of feature-debugging surface that sat behind prepared_l1.
In this phase, the recorder mattered more than the model headline.
Interactive appendix
Recorded L1 feature explorer
One real `source_recorded_l1_sol` export from the HFT repo, hosted inside the site so the feature layer can be inspected instead of just summarized.
This is the actual HTML-style explorer used to inspect recorded HFT series and feature behavior. Embedding it here makes the dataset and feature-engineering story much easier to validate.
The models got better, but not magically tradable
This phase is where the repo became most honest.
The HFT module supports multiple target families:
- taker-style horizons such as
hft_1s,hft_5s, andhft_10s, - maker protection models such as
maker_1s, - adverse-selection labels such as
adverse_bid_1s, - fair-value and expiry-oriented variants such as
fair_upandexpiry.
One of the most telling outcomes is the phase1_adverse_bid_1s_lightgbm evaluation:
- test accuracy around 0.921,
- Brier score around 0.068,
- log loss around 0.244,
- but still negative maker-style PnL in evaluation.
That is exactly the kind of result I trust more than a flashy headline. It proves the pipeline was capable of producing a decent classifier while still refusing to confuse classification quality with executable edge.
The same evaluation also measured inference latency in the tens of microseconds per batch on the model side, which tells a very different story from the earlier simplistic "will this strategy make money?" framing.
Where the 98% lie came from
Short-horizon systems are especially vulnerable to stale quotes.
If you assume you buy at the visible ask at time t and react instantly, the backtest can look absurdly strong. That is where the fake 98%-style hit rate comes from.
Once I forced the simulation to respect reaction time and execution pain, the picture changed:
reaction_latency_ms >= 300cut hit rate sharply,- slippage and queue realism made the result even more fragile,
- the quant diagnostics exposed several ways to accidentally fake PnL if the simulator was careless.
This is exactly why How a Real Backtest Works became a companion note to the whole project.
The execution architecture got more serious too
The HFT branch did not stop at paper labeling and offline models.
real_trading_hftunified paper and live modes around the same architecture.- The hot loop stayed free of direct API calls.
- A blocking
action_queuehanded place and cancel tasks to a secondary API worker. - Pre-flight ping checks and a latency monitor acted like a circuit breaker.
- The live logic refused price extremes outside the safer band, roughly
0.10-0.90, and demanded extra probability when spread cost was too large.
That made the system much more like a small execution engine than like a notebook wrapped in a CLI.
Takeaway
The HFT phase was the technically strongest part of the Polymarket journey because it forced everything to become more precise: timestamps, ETL, feature contracts, model evaluation, and execution architecture.
It also made the main lesson impossible to ignore: if your data path or simulator is weak, the model is just decorating a timing artifact.