Level 2
14 min readLesson 14 of 43

Building Your Data Pipeline

How to collect, store, and process market data

Building Your Data Pipeline

You know what data matters. You know where to get it. Now you need a system to actually collect it, store it, and make it usable for analysis. This is your data pipeline.

A well-designed pipeline runs reliably in the background, collecting data around the clock so it's always ready when you need it for backtesting or live signals.

Pipeline Architecture Overview

Every data pipeline has three core stages: ingestion (getting the data), storage (keeping the data), and transformation (making the data useful).

For a crypto signals engine, this typically means: API calls to data providers, files or databases to store results, and processing scripts to clean and normalize everything.

Let's walk through each stage with practical considerations.

Stage 1: Data Ingestion

Ingestion is about pulling data from your sources reliably and efficiently.

Scheduled collection is the foundation. You'll run scripts on a cron schedule (or similar) that pull data at regular intervals. For 1-minute candles, you might run every minute. For funding rates that update every 8 hours, you run three times daily. Match your collection frequency to the data's update frequency.

API handling requires attention to rate limits and error handling. When you're pulling data for 50 coins across multiple endpoints, you'll hit rate limits. Implement delays between requests, respect provider guidelines, and build in retries for failed requests.

Incremental vs full pulls: Once you have historical data, you only need to pull new data. This is incremental collection. Track what you've already collected and only request what's missing. This dramatically reduces API calls and storage growth.

Error recovery matters: APIs fail, networks drop, servers go down. Your pipeline needs to handle gaps gracefully. Log failures, alert on extended outages, and have processes to backfill missing data.

Stage 2: Data Storage

You have two main options for storage: files or databases. Both work; the right choice depends on your use case.

File-based storage (CSV, Parquet) is simple and flexible. Each coin and timeframe gets its own file. Updates append new rows. Parquet format is particularly good because it compresses well and allows fast column-based queries, perfect for time series data.

The file approach is easier to set up, easier to debug, and easier to back up. For many traders, it's the right choice.

Database storage (PostgreSQL, MongoDB, InfluxDB) offers more powerful querying and better handling of concurrent access. If you're building a system where multiple processes need to read and write simultaneously, or where you need complex queries across your data, a database makes sense.

MongoDB works well for JSON-like alternative data. PostgreSQL handles structured time series well. InfluxDB is purpose-built for time series if that's your primary use case.

Our recommendation for starting out: use Parquet files. They're simple, fast, and you can always migrate to a database later if needed.

Stage 3: Data Transformation

Raw data from APIs rarely matches what you need for analysis. Transformation bridges that gap.

Cleaning handles missing values, removes duplicates, and fixes obvious errors. A single corrupted data point can throw off an entire backtest.

Normalization puts everything in consistent units and formats. Different exchanges report open interest in contracts or USD, timestamps might be in different timezones. Normalize everything to a standard format.

Derived metrics are where the magic happens. From raw OHLCV and order book data, you calculate z-scores, moving averages, divergences, and the custom indicators that form your edge. This is where your pipeline becomes your competitive advantage.

Multi-timeframe alignment is tricky but essential. If you're combining 1-minute CVD with 4-hour funding rates, you need to align them properly. Each 4-hour funding rate value applies to 240 one-minute bars. Get this wrong and you introduce look-ahead bias.

Example Pipeline Structure

Here's a simple but effective architecture:

A collection script runs every minute, pulling new data from your provider's API. It saves raw data to files organized by coin and endpoint. A separate processing script runs hourly, reading raw files, cleaning the data, calculating derived metrics, and outputting analysis-ready Parquet files.

The folder structure might look like: data/raw/BTC/funding_rates/ data/raw/BTC/open_interest/ data/raw/SOL/liquidations/ data/processed/BTC_features.parquet data/processed/SOL_features.parquet

Keep raw and processed data separate. You'll inevitably want to reprocess as you improve your transformation logic.

Practical Considerations

Storage space adds up. 1-minute data for 50 coins with dozens of indicators over multiple years is hundreds of gigabytes. Plan for it.

Processing time matters when you're running live signals. If processing takes 30 seconds but you need signals every minute, you have a problem. Optimize hot paths.

Monitoring saves headaches. Log your pipeline's health. Alert when collection fails. Track data freshness. The worst bug is a silent one that corrupts months of data before you notice.

Versioning your transformation logic. When you change how an indicator is calculated, you might need to reprocess historical data. Keep track of what version of your code produced what data.

What We Run at TargetHit

Our pipeline collects data from HyBlock for 54 coins across 58 different endpoints at 5 timeframes (1m, 5m, 15m, 1h, 4h). That's a lot of data.

We store raw data in Parquet files organized by coin, endpoint, and timeframe. A batch runner processes new data regularly, maintaining analysis-ready datasets.

The processed data feeds our discovery engine, which tests millions of indicator combinations looking for statistical edges. This is only possible because the pipeline provides clean, complete, properly aligned data.

Getting Started

Don't try to build the perfect pipeline on day one. Start simple:

Pick one coin and one data source. Write a script that pulls data and saves to a CSV. Run it manually a few times to verify it works. Set up a cron job to run it automatically. Monitor it for a week to catch edge cases.

Once that works reliably, expand: more coins, more endpoints, more sophisticated storage. Add transformation. Build toward your full architecture incrementally.

The worst mistake is building an elaborate system that you don't understand when it breaks. Start simple, add complexity only as needed.

Key Takeaways

A data pipeline has three stages: ingestion, storage, and transformation. Match collection frequency to your data's update frequency. File storage with Parquet is simple and effective for most use cases. Transformation is where you calculate the derived metrics that create edge. Start simple with one coin and one source, then expand incrementally. Monitor your pipeline because silent failures corrupt your data.

This completes the Data level. You now understand what data matters beyond price and volume, where to get it, and how to build infrastructure to collect and process it. Next, we move to Validation, where you'll learn how to test whether your ideas actually work.