Back to articles

Prediction markets

What I Learned Trying to Predict Polymarket Bets

Phase 1: Gamma/Data API collection, top-holder features, trader-metrics caching, and the moment dataset engineering became the real project.

Phase 016 min7 sections
Python
Pandas
Gamma API
Data API
CLOB
Top Holders
Trader Metrics Cache
Whale Score
Resume Mode
Dataset Design

Polymarket journey

Reading guide

Phase 01 of the broader case study.

Chapter

Phase 01

Prediction markets

Reading time

6 min

7 structured sections

The model was never the hard part. Making the market dataset exist was.
Open the Polymarket case study

Opening note

A shorter, more readable version of the original archive entry, focused on the parts that remained technically useful.

What I Learned Trying to Predict Polymarket Bets

Phase 1 of the Polymarket work: build the dataset, enrich it properly, and discover that collection was harder than modeling.

I started with the most intuitive idea: predict whether a binary Polymarket market was mispriced before resolution. The real surprise was not the modeling challenge. It was how much engineering was needed before the dataset even existed.

This phase came before the 15-minute bot and before the HFT recorder. It set the rule that shaped everything afterwards: own the data path first, then decide if the research deserves more time.

If the dataset does not exist, data collection becomes the real project.

Pipeline map

A high-level view of how the first Polymarket dataset moved from public endpoints to reusable research artifacts.

Click any node to jump to that part of the article
Dataset-first engineeringTop-holder signalsReusable CSV artifacts

This map stays high level on purpose: the real story in phase 1 is that collection, normalization, and reuse were already a serious system.

The hypothesis

The thesis was broader than simple price prediction. I wanted to combine:

  • trade flow and price buckets from the Data API,
  • CLOB structure such as spread, midpoint, and depth,
  • market metadata such as duration, category, and volume,
  • holder-level signals such as whale concentration, trader skill, and side lean.

That last part matters. The scraper was not only gathering trades. It also had a full optional layer for top holders, trader metrics, and market lean weighted by whale score.

The pipeline I actually built

The root project is much more than a one-off script. It has a proper package structure, central config, bundle generators, and several outputs designed for later analysis.

The core scrape flow looked like this:

  1. Use Gamma API to scan events.
  2. Pull trades from the Data API.
  3. Optionally pull CLOB book and midpoint features.
  4. Optionally run top holders analysis with trader-metrics caching.
  5. Merge everything into market-level outputs for training or live-style scoring.

The main outputs were not just one CSV. The project writes features.csv, status.csv, raw_trades.csv, top_holders.csv, and trader_metrics.csv, which makes the work feel much closer to a research pipeline than to a notebook.

What made the scraper richer than a plain ETL

This first phase already contained signals that later justified the whole portfolio story.

  • execution_mode supported TRAINING and LIVE.
  • min_volume_usd and initial_offset controlled scope from the config layer instead of being hardcoded.
  • top_holders_limit went up to the API cap of 20 holders.
  • Trader metrics were cached for 5 days so the pipeline did not waste calls recomputing wallet histories.
  • Whale influence was summarized into whale_score, then reused to compute lean_yes_pct and lean_no_pct.

That means the project was not just "download trades and train a model." It was already trying to express market structure, participant quality, and operational robustness.

Where the real cost appeared

The blocker was not feature engineering. It was the cost of collecting enough history to trust the research.

What looked easy on paperWhat it meant in practice
"Get historical markets"No public bulk dump, so each event meant several API calls across Gamma, Data API, and CLOB
"Scale the dataset"min_volume_usd, pagination offsets, and wide feature tables limited practical throughput
"Keep the scraper stable"429 responses, retries, backoff, and resume logic shaped runtime as much as the model did
"Use holder analytics too"Wallet histories, top-holder reports, and cached trader metrics made the data richer but even more expensive to build

A few thousand events could already mean hours of work. Tens of thousands of clean examples meant many long runs, more storage, and much more operational babysitting than I wanted for a first research front.

Why I paused this front

The pause was not a verdict on prediction markets as a topic. It was a scope decision.

  • Long-horizon markets resolve slowly, so the learning loop is slow.
  • Coverage had to be broad enough to avoid training on a thin or biased slice.
  • The richer I made the dataset, the more collection became the bottleneck.

Technically, the phase was successful. Strategically, it was too expensive relative to the feedback it was giving me.

What carried forward

This phase still paid off because it clarified what the next phases should look like.

It also left behind bundle tooling and a clean module layout, which mattered later when the project had to become portable and server-friendly.

Takeaway

Prediction on Polymarket was technically viable, but the first serious obstacle was not the model. It was building a trustworthy dataset rich enough to deserve one.

That lesson shaped the rest of the journey: shorter horizons, tighter loops, and a much clearer respect for the difference between a good idea and a sustainable data problem.