The Quant Fund Architecture: Designing an Institutional-Grade Data Warehouse with ClickHouse and Python for Tick-Level Analytics

The Structural Failure of Row-Based Storage at Tick Resolution

The custom PostgreSQL pipeline constructed earlier in this series was engineered to replace Myfxbook’s opaque verification layer and stream MT5 plus CCXT metrics through a secured FastAPI backend. For trade-level tracking of a curated fleet, that architecture holds. Its load-bearing capacity is more than adequate when the write frequency stays in the range of hundreds to low thousands of rows per minute. The moment you push beyond aggregated trade logs and begin ingesting the raw substance of the market itself, tick-by-tick trades and full-depth order book deltas, the foundation cracks.

This is not a marketing-grade opinion. It is a structural inevitability rooted in how relational OLTP engines are built. PostgreSQL uses a row-oriented heap with B-tree indexes, MVCC versioning, and write-ahead logging designed for transactional consistency. Each insert triggers index updates, tuple versioning, vacuum overhead, and WAL amplification. When a single Binance BTCUSDT perpetual feed alone can emit ten to thirty thousand order book updates per second during a volatile event, the relational engine enters a regime it was never structurally designed to absorb. Disk IOPS saturate. Autovacuum falls behind. Query planner statistics become stale within minutes. Composite indexes on (symbol, timestamp) balloon until they dwarf the data itself.

Anyone who has attempted to backfill twelve months of Level 2 order book data into PostgreSQL understands the outcome. Storage consumption runs into multiple terabytes for a single symbol. Analytical queries that should complete in milliseconds take minutes. The database becomes the slowest link in the alpha generation chain, and that is a structural failure no amount of connection pooling or partition pruning can repair. When the instrument you are handling is raw market microstructure, the correct answer is not to optimize a row-oriented engine. The correct answer is to migrate to a columnar time-series warehouse engineered specifically for this workload.

Why ClickHouse: The Columnar Load-Bearing Architecture

ClickHouse is the database that quantitative funds, high-frequency market makers, and Tier-1 exchanges quietly run under their analytics stacks. It is a columnar, massively parallel, vectorized execution engine originally built at Yandex to process petabytes of clickstream data. For tick-level market data, its design properties are not merely a convenience. They are a structural match.

In a columnar store, data for a single column is stored contiguously on disk. When you query a moving VWAP across six months of BTCUSDT trades, ClickHouse only reads the price and quantity columns from disk, skipping every other byte. The SIMD-vectorized query pipeline then processes that data in batches of thousands of values per CPU instruction. Compare this to a row store that must hydrate entire tuples from disk even when the query touches only two fields. The difference in throughput is not incremental. It is one to two orders of magnitude.

MergeTree as the Foundational Substrate

Every serious table in ClickHouse is built on the MergeTree engine family. MergeTree is a log-structured merge tree variant. Incoming data is written to sorted parts on disk and merged asynchronously in the background. This structure achieves the write throughput of an append-only log while preserving the read efficiency of a sorted index. For tick ingestion, this is precisely the behavior required. There is no write amplification from index maintenance. There is no tuple versioning overhead. The sparse primary index, typically one index entry per 8,192 rows, keeps the index cache footprint negligible even across hundreds of billions of rows.

The variants that matter for market microstructure work are ReplacingMergeTree, which deduplicates by sorting key during background merges and is useful for reconciling duplicate WebSocket messages during reconnection storms, and AggregatingMergeTree, which stores intermediate aggregation states and is the structural basis for real-time materialized views computing rolling OHLCV, VWAP, or order book imbalance metrics at ingest time.

Compression Codecs as Structural Reinforcement

The compression pipeline is where ClickHouse decisively separates itself from general-purpose databases. Instead of applying a single block-level compression scheme, each column can be assigned a specialized codec chain tailored to its statistical structure. For monotonic timestamps, the DoubleDelta codec stores the second derivative of the time series, typically yielding compression ratios of 20 to 50x against raw Int64 values. For floating-point prices that change in small, correlated increments, the Gorilla codec, derived from Facebook’s original time-series paper, exploits the XOR relationship between consecutive values to compress to roughly 1.5 bytes per float on average. ZSTD then applies entropy coding on top. The practical result is that one year of full tick data for a liquid pair compresses from hundreds of gigabytes of raw data down to ten or twenty gigabytes on disk, while remaining instantly queryable.

Schema Design for Tick-Level Market Data

Schema design in ClickHouse is not the cosmetic exercise it is in PostgreSQL. The ordering key, the partition key, and the codec selection together determine every read and write performance characteristic of the table. A schema that looks correct but orders by the wrong key will force full partition scans on every analytical query. Get this wrong and you have constructed a beautifully compressed but operationally useless data swamp.

The Trades Table

The canonical trades table captures every executed trade from the exchange’s public feed. Symbol is stored as a LowCardinality string, which applies dictionary encoding because the set of traded symbols is small and bounded. The ordering key is chosen to match the dominant query pattern: filter by symbol, then by time.

CREATE TABLE market.trades
(
    exchange        LowCardinality(String),
    symbol          LowCardinality(String),
    trade_id        UInt64,
    event_time      DateTime64(3, 'UTC') CODEC(DoubleDelta, ZSTD(3)),
    price           Float64              CODEC(Gorilla, ZSTD(3)),
    quantity        Float64              CODEC(Gorilla, ZSTD(3)),
    side            Enum8('buy' = 1, 'sell' = 2),
    is_maker        UInt8
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (exchange, symbol, event_time, trade_id)
TTL event_time + INTERVAL 24 MONTH
SETTINGS index_granularity = 8192;

Partitioning by day is deliberate. It gives the background merge process small, bounded units of work and allows surgical ALTER TABLE ... DROP PARTITION operations for retention management. The TTL clause enforces automatic deletion after 24 months, which matters when you are running this on a VPS with finite storage. The ordering key places exchange and symbol first because every meaningful query filters on them. Event time follows because time-range scans are the second most common access pattern.

The Order Book Snapshot Table

Order book data is heavier and more structurally demanding than trade data. Each snapshot or delta carries up to N levels of bids and asks. The pragmatic encoding stores bids and asks as nested arrays rather than separate rows per level, because this dramatically reduces write volume and aligns with the columnar layout.

CREATE TABLE market.orderbook_l2
(
    exchange        LowCardinality(String),
    symbol          LowCardinality(String),
    event_time      DateTime64(3, 'UTC') CODEC(DoubleDelta, ZSTD(3)),
    update_id       UInt64,
    bid_prices      Array(Float64)  CODEC(Gorilla, ZSTD(3)),
    bid_quantities  Array(Float64)  CODEC(Gorilla, ZSTD(3)),
    ask_prices      Array(Float64)  CODEC(Gorilla, ZSTD(3)),
    ask_quantities  Array(Float64)  CODEC(Gorilla, ZSTD(3))
)
ENGINE = ReplacingMergeTree(update_id)
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (exchange, symbol, event_time, update_id)
SETTINGS index_granularity = 8192;

ReplacingMergeTree was chosen here because exchange WebSocket streams will occasionally replay or duplicate messages during reconnection handshakes, particularly when failing over across geographic endpoints. The engine collapses rows with identical sorting keys on merge, keeping the one with the largest update_id. This eliminates duplication without requiring application-layer deduplication state that would otherwise need to live in Redis.

Engineering the Python Ingestion Pipeline

Inserting into ClickHouse one row at a time is the single most common architectural mistake. The engine is built for batch ingestion. A single-row insert creates a new part on disk that must be merged later. Thousands of single-row inserts per second will generate part-count saturation and trigger the infamous Too many parts error, at which point ingestion halts entirely. The production-grade pattern is to buffer incoming events in memory and flush in batches of thousands or tens of thousands of rows per insert, typically every 100 to 500 milliseconds.

Asynchronous Batched Inserts from CCXT WebSockets

The implementation below is a minimal but production-posture ingestion worker. It consumes trades from a CCXT pro WebSocket stream, buffers them in an asyncio queue, and flushes to ClickHouse through the official clickhouse-connect async client. Error handling, typed function signatures, structured logging, and exponential backoff on transient failures are integrated from the start, not bolted on later.

import asyncio
import logging
from dataclasses import dataclass
from typing import List, Optional

import ccxt.pro as ccxtpro
import clickhouse_connect
from clickhouse_connect.driver.asyncclient import AsyncClient

logger = logging.getLogger("nql.ingest.trades")
logger.setLevel(logging.INFO)

BATCH_SIZE: int = 5_000
FLUSH_INTERVAL_SEC: float = 0.5
MAX_RETRIES: int = 5

@dataclass(slots=True)
class TradeRow:
    exchange: str
    symbol: str
    trade_id: int
    event_time_ms: int
    price: float
    quantity: float
    side: str
    is_maker: int

    def as_tuple(self) -> tuple:
        return (
            self.exchange,
            self.symbol,
            self.trade_id,
            self.event_time_ms,
            self.price,
            self.quantity,
            self.side,
            self.is_maker,
        )

COLUMN_NAMES: List[str] = [
    "exchange", "symbol", "trade_id", "event_time",
    "price", "quantity", "side", "is_maker",
]

async def clickhouse_client() -> AsyncClient:
    return await clickhouse_connect.get_async_client(
        host="127.0.0.1",
        port=8123,
        username="nql_writer",
        password_env_var="CH_WRITER_PASS",
        compress="lz4",
        database="market",
    )

async def flush_batch(client: AsyncClient, buffer: List[TradeRow]) -> None:
    if not buffer:
        return
    payload = [row.as_tuple() for row in buffer]
    for attempt in range(1, MAX_RETRIES + 1):
        try:
            await client.insert(
                table="trades",
                data=payload,
                column_names=COLUMN_NAMES,
            )
            logger.info("flushed batch size=%d", len(payload))
            return
        except Exception as exc:
            backoff = min(2 ** attempt, 30)
            logger.warning(
                "insert failed attempt=%d err=%s backoff=%ds",
                attempt, exc, backoff,
            )
            await asyncio.sleep(backoff)
    logger.error("insert permanently failed size=%d", len(payload))

async def consumer(queue: asyncio.Queue, client: AsyncClient) -> None:
    buffer: List[TradeRow] = []
    while True:
        try:
            row = await asyncio.wait_for(queue.get(), timeout=FLUSH_INTERVAL_SEC)
            buffer.append(row)
        except asyncio.TimeoutError:
            pass
        if len(buffer) >= BATCH_SIZE or (buffer and queue.empty()):
            await flush_batch(client, buffer)
            buffer = []

async def producer(exchange_id: str, symbols: List[str], queue: asyncio.Queue) -> None:
    exchange_cls = getattr(ccxtpro, exchange_id)
    exchange = exchange_cls({"enableRateLimit": True})
    try:
        while True:
            try:
                trades = await exchange.watch_trades_for_symbols(symbols)
                for t in trades:
                    await queue.put(TradeRow(
                        exchange=exchange_id,
                        symbol=t["symbol"],
                        trade_id=int(t["id"]),
                        event_time_ms=int(t["timestamp"]),
                        price=float(t["price"]),
                        quantity=float(t["amount"]),
                        side=t["side"],
                        is_maker=1 if t.get("takerOrMaker") == "maker" else 0,
                    ))
            except Exception as exc:
                logger.warning("producer error=%s reconnecting", exc)
                await asyncio.sleep(2.0)
    finally:
        await exchange.close()

async def main() -> None:
    logging.basicConfig(
        format="%(asctime)s %(levelname)s %(name)s %(message)s",
        level=logging.INFO,
    )
    queue: asyncio.Queue = asyncio.Queue(maxsize=100_000)
    client = await clickhouse_client()
    await asyncio.gather(
        producer("binance", ["BTC/USDT:USDT", "ETH/USDT:USDT"], queue),
        consumer(queue, client),
    )

if __name__ == "__main__":
    asyncio.run(main())

Several design choices warrant explicit note. The queue is bounded at 100,000 rows to apply backpressure when the consumer falls behind; silently unbounded queues are a classic source of memory exhaustion in long-running ingestion services. The flush happens either when the batch size threshold is hit or when the queue drains, which keeps latency low during slow periods without sacrificing throughput during volatile ones. The retry loop uses exponential backoff capped at thirty seconds, because hammering a saturated ClickHouse instance with immediate retries makes the problem worse, not better.

Materialized Views: Pre-Computed Alpha at Write Time

One of the architectural advantages a columnar warehouse provides over any application-layer solution is the materialized view computed at insert time. A materialized view in ClickHouse is effectively a trigger that transforms each incoming block of data and writes the result into a second table. You pay for the computation once, on ingest, and every subsequent read is a simple scan of pre-aggregated rows.

CREATE MATERIALIZED VIEW market.trades_ohlcv_1m
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMMDD(bucket)
ORDER BY (exchange, symbol, bucket)
AS
SELECT
    exchange,
    symbol,
    toStartOfMinute(event_time)            AS bucket,
    argMinState(price, event_time)         AS open_state,
    maxState(price)                        AS high_state,
    minState(price)                        AS low_state,
    argMaxState(price, event_time)         AS close_state,
    sumState(quantity)                     AS volume_state,
    sumState(price * quantity)             AS notional_state
FROM market.trades
GROUP BY exchange, symbol, bucket;

This view continuously produces one-minute OHLCV bars as trades arrive. VWAP is simply notional / volume, reconstructed at query time from the aggregation states via sumMerge. The same pattern extends to real-time order book imbalance, cumulative delta, trade intensity, and any other microstructure feature that can be expressed as an incremental aggregation. For the reinforcement learning and LightGBM pipelines previously constructed, this means features are available at query time in milliseconds, rather than computed on demand by rolling Pandas operations that choke under load.

Query Patterns for Institutional Backtesting

The payoff of this architecture is visible only when the research workload hits it. A query that asks for the one-minute VWAP of BTCUSDT across the last six months, filtered by exchange and time window, would run for minutes against PostgreSQL with correctly indexed data. Against the ClickHouse materialized view defined above, it returns in sub-second time even on modest hardware. The reason is purely structural. The engine reads one partition per day, scans only the four columns it needs, and merges pre-computed aggregation states rather than recomputing from raw trades.

For event-driven backtesters, the ability to issue queries such as “fetch every order book snapshot within 500 milliseconds of every trade above 10 BTC in notional size over the last 90 days” is transformative. This kind of conditional microstructure sampling is what separates a retail backtest from an institutional simulation. The query becomes a natural expression of intent, not a performance-engineering problem.

Operational Realities and Deployment Posture

Running ClickHouse in production is not free of friction. Memory management is the primary operational constraint. Queries that spill outside the per-query memory limit will fail rather than silently paging to disk. This is a deliberate design decision; an HFT analytics engine that paged to disk would produce unpredictable latencies. The operator is expected to configure max_memory_usage, max_bytes_before_external_group_by, and similar parameters according to the hardware envelope. For a dedicated VPS running alongside trading bots, a sensible starting point is a capped four to eight gigabytes for ClickHouse, with explicit query limits enforced for ad-hoc analytical work.

Backup strategy deserves equal structural attention. Native BACKUP and RESTORE commands, introduced in recent versions, write consistent snapshots to S3-compatible object storage. For a quant stack running on a Japan-hosted VPS, pointing these at a Backblaze B2 bucket provides off-site durability at trivial cost. Retention policies enforced via table TTL handle the deletion side automatically. Combined, the two give deterministic recoverability in the event of a disk failure or accidental DROP TABLE.

Network posture matters as well. The ClickHouse HTTP interface should never be exposed to the public internet. Authentication should route through the same FastAPI backend already deployed for the dashboard, with the database bound to localhost or a private interface. This keeps the attack surface identical to the existing PostgreSQL footprint and requires no additional infrastructure hardening beyond what the earlier architecture already enforces.

The Verdict

The market does not reward architectural sentimentality. PostgreSQL is an extraordinary piece of engineering, and for trade-ledger tracking, account equity snapshots, and bot configuration state, it remains the correct tool. When the workload shifts to raw tick data and full-depth order book flow, the correct tool changes. Row-oriented OLTP engines are not a suitable foundation for the terabytes of microstructure data that serious alpha research requires. Columnar time-series warehouses are. This is not a matter of preference. It is a matter of which engine’s internal structure matches the shape of the workload, and the shape of tick-level market data matches ClickHouse almost perfectly.

The quant funds that win the latency war, the compression war, and the backtest-fidelity war are not winning because they have better indicators. They are winning because their infrastructure can answer questions the retail stack cannot even articulate, at speeds that make research iteration cycles measured in minutes rather than hours. Migrating from PostgreSQL to ClickHouse for tick-level analytics is not a performance optimization. It is the structural precondition for every advanced feature engineering, machine learning, and reinforcement learning pipeline that follows. Without this foundation, the rest of the stack eventually collapses under its own data weight. With it, the ceiling on what can be researched and deployed is raised by orders of magnitude.