HFT on Polymarket: Model, Rust, and the 98% Lie

Opening note

A shorter, more readable version of the original archive entry, focused on the parts that remained technically useful.

HFT on Polymarket: Model, Rust, and the 98% Lie

Phase 3: second-scale features, recorder-first infrastructure, and the point where even strong classifier metrics were not enough.

After long-horizon prediction hit data bottlenecks and the 15-minute system showed little durable edge, the next logical question was narrower: could second-scale signals still contain something useful?

That pushed the work toward HFT-style infrastructure. The problem was no longer just "can I train a model?" It was "can I observe, align, and execute honestly enough that the model means anything at all?"

HFT stack map

The HFT phase became a recorder-first stack: capture fast, prepare honestly, model carefully, and keep blocking live actions away from the hot path.

Click any node to jump to that part of the article

Recorder

The ingest layer that had to be trusted before anything else.

Polymarket feed

Order book and market events captured continuously

Binance tick stream

Reference market timing for cross-feed alignment

Local receipt time

The timestamp the strategy could actually know

Prepared data

The data ladder between raw capture and ML.

JSONL archive

Raw recorder format for full-fidelity capture

Parquet conversion

Columnar data for faster ETL and analysis

prepared_l1 and prepared_l2

Curated datasets filtered for alignment quality

Feature explorer

A real recorded slug inspection tool now embedded in the site

Models

Useful metrics, but only if executable.

40-feature vectors

Microstructure-focused feature assembly on second-scale windows

Target families

Adverse selection, maker labels, fair-up, and expiry variants

Eval versus PnL

Good classification numbers still tested against trading reality

Live architecture

The production-minded split between fast thinking and slow action.

Rust hot path

Feed handling and latency-sensitive helpers

API worker queue

Blocking order calls moved off the main loop

Safety rails

Pre-flight checks, price bands, and circuit-breaker logic

Recorder-first designLatency-aware datasetsRust plus Python split

This is the phase where architecture mattered most: good inference metrics without trustworthy timestamps and execution discipline were not enough.

Why HFT was the next step

The thesis here was much more specific than before.

Maybe Polymarket lagged the reference market.
Maybe order-book imbalance and short lag features contained signal.
Maybe the real edge was not prediction alone, but data freshness and execution discipline.

That meant the project had to become infrastructure-first.

The architecture that made it possible

I split the stack by responsibility.

Rust on the hot path

WebSockets for Binance and Polymarket.
Event-driven waiting on the next Binance tick instead of sleepy polling.
A single-source feature path for the HFT vector, including spread, OBI, and lag features.
Recorder and latency primitives exposed through poly_rust_core.

Python around it

Recorder orchestration and token refresh.
JSONL to Parquet conversion, including incremental ETL.
prepared_l1 and prepared_l2 dataset generation for train, val, and test.
LightGBM training, experiment sweeps, visualization, and paper/live wrappers.

This division worked well because it kept the timing-sensitive path in Rust while leaving iteration flexible in Python.

Data quality mattered more than model complexity

The real improvement in this phase was the recorder and the ETL around it.

Each tick carried local_receipt_ts_ms.
Alignment followed a "last known value" philosophy instead of pretending two feeds share exact timestamps.
Freshness columns such as binance_age_ms and poly_age_ms turned misalignment into explicit information.
Training rows could be filtered by latency thresholds such as 200-500 ms.

The project also includes explicit leakage checks and dataset preparation stages, which made the pipeline feel much closer to production research than to a simple backtest.

To make that stage visible instead of abstract, I embedded one real source_recorded_l1_sol feature explorer inside the site. It lets you inspect a recorded slug, compare raw versus smoothed series, and see the sort of feature-debugging surface that sat behind prepared_l1.

In this phase, the recorder mattered more than the model headline.

Interactive appendix

Recorded L1 feature explorer

One real `source_recorded_l1_sol` export from the HFT repo, hosted inside the site so the feature layer can be inspected instead of just summarized.

slug=sol_1771810200

58,613 rows

57 columns

Plotly

prepared_l1 context

Open larger explorer See the full case study

This is the actual HTML-style explorer used to inspect recorded HFT series and feature behavior. Embedding it here makes the dataset and feature-engineering story much easier to validate.

The models got better, but not magically tradable

This phase is where the repo became most honest.

The HFT module supports multiple target families:

taker-style horizons such as hft_1s, hft_5s, and hft_10s,
maker protection models such as maker_1s,
adverse-selection labels such as adverse_bid_1s,
fair-value and expiry-oriented variants such as fair_up and expiry.

One of the most telling outcomes is the phase1_adverse_bid_1s_lightgbm evaluation:

test accuracy around 0.921,
Brier score around 0.068,
log loss around 0.244,
but still negative maker-style PnL in evaluation.

That is exactly the kind of result I trust more than a flashy headline. It proves the pipeline was capable of producing a decent classifier while still refusing to confuse classification quality with executable edge.

The same evaluation also measured inference latency in the tens of microseconds per batch on the model side, which tells a very different story from the earlier simplistic "will this strategy make money?" framing.

Where the 98% lie came from

Short-horizon systems are especially vulnerable to stale quotes.

If you assume you buy at the visible ask at time t and react instantly, the backtest can look absurdly strong. That is where the fake 98%-style hit rate comes from.

Once I forced the simulation to respect reaction time and execution pain, the picture changed:

reaction_latency_ms >= 300 cut hit rate sharply,
slippage and queue realism made the result even more fragile,
the quant diagnostics exposed several ways to accidentally fake PnL if the simulator was careless.

This is exactly why How a Real Backtest Works became a companion note to the whole project.

The execution architecture got more serious too

The HFT branch did not stop at paper labeling and offline models.

real_trading_hft unified paper and live modes around the same architecture.
The hot loop stayed free of direct API calls.
A blocking action_queue handed place and cancel tasks to a secondary API worker.
Pre-flight ping checks and a latency monitor acted like a circuit breaker.
The live logic refused price extremes outside the safer band, roughly 0.10-0.90, and demanded extra probability when spread cost was too large.

That made the system much more like a small execution engine than like a notebook wrapped in a CLI.

Takeaway

The HFT phase was the technically strongest part of the Polymarket journey because it forced everything to become more precise: timestamps, ETL, feature contracts, model evaluation, and execution architecture.

It also made the main lesson impossible to ignore: if your data path or simulator is weak, the model is just decorating a timing artifact.

HFT on Polymarket: Model, Rust, and the 98% Lie

Reading guide

Opening note

HFT on Polymarket: Model, Rust, and the 98% Lie