Skip to content

Columnar Databases

The Problem: Row Stores Waste I/O on Analytics

Consider a stock_ticks table:

symbol price name exchange timestamp
AAPL 142.50 Apple Inc NYSE 1772845000
GOOG 89.20 Alphabet NASDAQ 1772845001
MSFT 315.00 Microsoft NYSE 1772845002

And a typical analytics query:

SELECT AVG(price) FROM stock_ticks
WHERE exchange = 'NYSE' AND timestamp > now() - interval '24 hours'

We only need 3 columns (price, exchange, timestamp) out of 5. But in a row store, we read all 5 — because of how data lives on disk.


How Data is Stored on Disk

Row-Oriented (OLTP) — Postgres, MySQL

Data is stored row by row. Each disk page (~8KB) packs full rows together:

┌─────────────────── Disk Page 1 (8KB) ──────────────────────┐
│                                                             │
│  Row 1: [AAPL | 142.50 | Apple Inc  | NYSE   | 1772845000] │
│  Row 2: [GOOG |  89.20 | Alphabet   | NASDAQ | 1772845001] │
│  Row 3: [MSFT | 315.00 | Microsoft  | NYSE   | 1772845002] │
│  Row 4: [TSLA | 201.30 | Tesla Inc  | NASDAQ | 1772845003] │
│  ...                                                        │
└─────────────────────────────────────────────────────────────┘

┌─────────────────── Disk Page 2 (8KB) ──────────────────────┐
│                                                             │
│  Row 51: [AMZN | 178.90 | Amazon    | NASDAQ | 1772845050] │
│  Row 52: [META |  62.40 | Meta      | NASDAQ | 1772845051] │
│  ...                                                        │
└─────────────────────────────────────────────────────────────┘

Reading AVG(price) WHERE exchange = 'NYSE':

  • Must load every page into memory (full rows)
  • Parse through symbol, name — columns we don't need
  • With 500M rows × 5 columns → 3/5ths of disk I/O is wasted
  • With 80 columns and a 3-column query → 96% wasted

✅ Great for: SELECT * FROM users WHERE id = 42 (one read, full row)

❌ Bad for: Scanning millions of rows for aggregates

Column-Oriented (OLAP) — Redshift, BigQuery, ClickHouse

Data is stored column by column. Each column is its own file/segment:

┌──── symbol.col ─────┐  ┌──── price.col ─────┐  ┌──── name.col ──────────┐
│ AAPL                 │  │ 142.50              │  │ Apple Inc               │
│ GOOG                 │  │  89.20              │  │ Alphabet                │
│ MSFT                 │  │ 315.00              │  │ Microsoft               │
│ TSLA                 │  │ 201.30              │  │ Tesla Inc               │
│ AMZN                 │  │ 178.90              │  │ Amazon                  │
│ ...500M rows         │  │ ...500M rows        │  │ ...500M rows            │
└──────────────────────┘  └─────────────────────┘  └─────────────────────────┘

┌──── exchange.col ───┐  ┌──── timestamp.col ──┐
│ NYSE                │  │ 1772845000          │
│ NASDAQ              │  │ 1772845001          │
│ NYSE                │  │ 1772845002          │
│ NASDAQ              │  │ 1772845003          │
│ NASDAQ              │  │ 1772845050          │
│ ...500M rows        │  │ ...500M rows        │
└─────────────────────┘  └─────────────────────┘

Reading AVG(price) WHERE exchange = 'NYSE':

  • Read ONLY price.col, exchange.col, timestamp.col
  • Never touch symbol.col or name.col — zero wasted I/O
  • Only 3/5ths of total data read (exactly what we need)

✅ Great for: Aggregating millions of rows across a few columns

❌ Bad for: SELECT * FROM stock_ticks WHERE symbol = 'AAPL' LIMIT 1 (must read from 5 separate column files to reconstruct one row)


Visualizing the Difference

graph LR
    subgraph "Row Store — Read Full Rows"
        direction TB
        P1["Page 1<br/>R1: all cols<br/>R2: all cols<br/>R3: all cols"]
        P2["Page 2<br/>R4: all cols<br/>R5: all cols<br/>R6: all cols"]
    end

    subgraph "Column Store — Read Only Needed Columns"
        direction TB
        C1["price.col<br/>142.50<br/>89.20<br/>315.00<br/>..."]
        C2["exchange.col<br/>NYSE<br/>NASDAQ<br/>NYSE<br/>..."]
        C3["timestamp.col<br/>177284...<br/>177284...<br/>177284...<br/>..."]
        C4["symbol.col<br/>❌ NOT READ"]
        C5["name.col<br/>❌ NOT READ"]
    end

    style P1 fill:#4a1a1a,stroke:#ff4444
    style P2 fill:#4a1a1a,stroke:#ff4444
    style C1 fill:#1a4a1a,stroke:#44ff44
    style C2 fill:#1a4a1a,stroke:#44ff44
    style C3 fill:#1a4a1a,stroke:#44ff44
    style C4 fill:#3a3a3a,stroke:#888888
    style C5 fill:#3a3a3a,stroke:#888888

Bonus: Compression

Columnar storage gets massive compression that row stores can't match.

Why? Same-type, repetitive data compresses extremely well:

exchange.col (raw):      [NYSE, NASDAQ, NYSE, NYSE, BSE, NASDAQ, NYSE, NYSE, ...]
                          ↓ Dictionary Encoding
exchange.col (encoded):  Dictionary: {0: NYSE, 1: NASDAQ, 2: BSE}
                         Data: [0, 1, 0, 0, 2, 1, 0, 0, ...]
                          ↓ Run-Length Encoding (if sorted)
                         [0×3, 1×1, 2×1, ...]

Row store page (mixed types, low repetition):

[AAPL, 142.50, "Apple Inc", NYSE, 1772845000, GOOG, 89.20, ...]
→ Barely compresses. Different types, no patterns.

Column store achieves 5-10x compression routinely because:

  • Same data type in every value → predictable byte widths
  • High repetition in categorical columns → dictionary encoding
  • Sorted/sequential data → delta encoding (store differences)
  • Numeric columns → bit-packing (use only the bits you need)

Compression means:

  • Less disk I/O (read fewer bytes for same data)
  • More data fits in memory/cache
  • Redshift/BigQuery can process compressed data directly (no decompress step)

OLTP vs OLAP Summary

Row Store (OLTP) Column Store (OLAP)
Storage Row by row, all columns together Column by column, separate files
Great at Single-row lookups, inserts, updates Aggregations across millions of rows
Bad at Full-table scans on few columns Fetching a single complete row
Compression Poor (mixed types per page) Excellent (same type, repetitive)
Write speed Fast (append one row) Slower (must write to N column files)
Use case App database (users, orders, sessions) Analytics, dashboards, reporting
Examples PostgreSQL, MySQL, Oracle Redshift, BigQuery, ClickHouse, Snowflake
Query style WHERE id = ? (point lookup) GROUP BY, AVG(), COUNT() (scan)

Where Columnar Fits in Architecture

You almost never use a columnar DB as your primary database. It sits alongside your OLTP store:

graph LR
    App["Your App"] -->|"reads/writes"| PG["PostgreSQL<br/>(OLTP — row store)"]
    PG -->|"ETL / CDC<br/>pipeline"| RS["Redshift / BigQuery<br/>(OLAP — column store)"]
    RS -->|"dashboards<br/>& reports"| Analyst["Analytics<br/>Team"]

    style PG fill:#1a3a4a,stroke:#44aaff
    style RS fill:#4a3a1a,stroke:#ffaa44
    style App fill:#1a4a1a,stroke:#44ff44
    style Analyst fill:#3a1a4a,stroke:#aa44ff
  • App writes → Postgres (fast point writes, ACID transactions)
  • ETL pipeline → copies data periodically (or streams via CDC) to the columnar store
  • Analysts query → Redshift/BigQuery (heavy aggregations don't touch production DB)

Real-World Column Stores

Database Notes
Amazon Redshift Based on ParAccel (derived from C-Store paper). AWS managed.
Google BigQuery Serverless, uses Dremel engine. Capacitor columnar format.
ClickHouse Open source, originated at Yandex. Extremely fast inserts.
Apache Parquet Not a DB — a columnar file format. Used by Spark, Hive, etc.
Snowflake Cloud-native, separates storage and compute.
  • C-Store Paper"C-Store: A Column-oriented DBMS" (Stonebraker et al.) The academic paper that started it all. Vertica is the commercial version.