Opening note
A shorter, more readable version of the original archive entry, focused on the parts that remained technically useful.
What I Learned Trying to Predict Polymarket Bets
Phase 1 of the Polymarket work: build the dataset, enrich it properly, and discover that collection was harder than modeling.
I started with the most intuitive idea: predict whether a binary Polymarket market was mispriced before resolution. The real surprise was not the modeling challenge. It was how much engineering was needed before the dataset even existed.
This phase came before the 15-minute bot and before the HFT recorder. It set the rule that shaped everything afterwards: own the data path first, then decide if the research deserves more time.
If the dataset does not exist, data collection becomes the real project.
Pipeline map
A high-level view of how the first Polymarket dataset moved from public endpoints to reusable research artifacts.
Sources
The raw market universe and activity feeds.
Enrichment
What made the scraper more than a plain ETL.
Feature layer
Signals distilled for training and scoring.
Outputs
Artifacts that survived beyond phase 1.
This map stays high level on purpose: the real story in phase 1 is that collection, normalization, and reuse were already a serious system.
The hypothesis
The thesis was broader than simple price prediction. I wanted to combine:
- trade flow and price buckets from the Data API,
- CLOB structure such as spread, midpoint, and depth,
- market metadata such as duration, category, and volume,
- holder-level signals such as whale concentration, trader skill, and side lean.
That last part matters. The scraper was not only gathering trades. It also had a full optional layer for top holders, trader metrics, and market lean weighted by whale score.
The pipeline I actually built
The root project is much more than a one-off script. It has a proper package structure, central config, bundle generators, and several outputs designed for later analysis.
The core scrape flow looked like this:
- Use Gamma API to scan events.
- Pull trades from the Data API.
- Optionally pull CLOB book and midpoint features.
- Optionally run top holders analysis with trader-metrics caching.
- Merge everything into market-level outputs for training or live-style scoring.
The main outputs were not just one CSV. The project writes features.csv, status.csv, raw_trades.csv, top_holders.csv, and trader_metrics.csv, which makes the work feel much closer to a research pipeline than to a notebook.
What made the scraper richer than a plain ETL
This first phase already contained signals that later justified the whole portfolio story.
execution_modesupportedTRAININGandLIVE.min_volume_usdandinitial_offsetcontrolled scope from the config layer instead of being hardcoded.top_holders_limitwent up to the API cap of 20 holders.- Trader metrics were cached for 5 days so the pipeline did not waste calls recomputing wallet histories.
- Whale influence was summarized into
whale_score, then reused to computelean_yes_pctandlean_no_pct.
That means the project was not just "download trades and train a model." It was already trying to express market structure, participant quality, and operational robustness.
Where the real cost appeared
The blocker was not feature engineering. It was the cost of collecting enough history to trust the research.
| What looked easy on paper | What it meant in practice |
|---|---|
| "Get historical markets" | No public bulk dump, so each event meant several API calls across Gamma, Data API, and CLOB |
| "Scale the dataset" | min_volume_usd, pagination offsets, and wide feature tables limited practical throughput |
| "Keep the scraper stable" | 429 responses, retries, backoff, and resume logic shaped runtime as much as the model did |
| "Use holder analytics too" | Wallet histories, top-holder reports, and cached trader metrics made the data richer but even more expensive to build |
A few thousand events could already mean hours of work. Tens of thousands of clean examples meant many long runs, more storage, and much more operational babysitting than I wanted for a first research front.
Why I paused this front
The pause was not a verdict on prediction markets as a topic. It was a scope decision.
- Long-horizon markets resolve slowly, so the learning loop is slow.
- Coverage had to be broad enough to avoid training on a thin or biased slice.
- The richer I made the dataset, the more collection became the bottleneck.
Technically, the phase was successful. Strategically, it was too expensive relative to the feedback it was giving me.
What carried forward
This phase still paid off because it clarified what the next phases should look like.
- It pushed me toward 15-Minute Trading on Polymarket, where the loop was much faster.
- It made the case for HFT on Polymarket: Model, Rust, and the 98% Lie, where recording my own data became practical.
- It reinforced the realism mindset that later became How a Real Backtest Works.
It also left behind bundle tooling and a clean module layout, which mattered later when the project had to become portable and server-friendly.
Takeaway
Prediction on Polymarket was technically viable, but the first serious obstacle was not the model. It was building a trustworthy dataset rich enough to deserve one.
That lesson shaped the rest of the journey: shorter horizons, tighter loops, and a much clearer respect for the difference between a good idea and a sustainable data problem.