Alex Merced's Data, Dev and AI Blog

Row vs. Column: How Storage Layout Shapes Everything

Wed Apr 29 2026 13:01:00 GMT+0000 (Coordinated Universal Time) | 8 min read | query-engine, database

Cover Image

\n

This is Part 2 of a 10-part series on query engine design. Part 1 (Overview) introduced the nine decisions every engine must make. This article covers the first and most fundamental: how bytes are arranged on disk.

How Row Storage Works

A row store keeps all fields of a record physically together on a disk page. A page is typically 4KB to 16KB. Each page holds multiple complete "tuples" (records). When you read one page, you get every field for every record on that page.

How the same 5 records look in row-oriented layout versus column-oriented layout on disk

This layout is optimized for transactional workloads. Looking up a customer by ID? One page read gives you every field: name, email, address, balance, status. Inserting a new order? One write puts the entire record in one place. Updating a single field? The engine finds the tuple and modifies it in place.

PostgreSQL stores rows as heap tuples with a header containing transaction visibility info and a null bitmap. MySQL/InnoDB organizes rows in a clustered B-tree indexed by primary key. Oracle and SQL Server both default to row-based storage.

The weakness shows up with analytical queries. If your table has 50 columns and your query needs 3 of them, a row store still reads all 50 for every row. The other 47 columns ride along for free, wasting I/O bandwidth and polluting your CPU cache.

How Column Storage Works

A column store flips the layout. Instead of keeping all fields of a record together, it keeps all values for a single field together. Every price value is stored contiguously. Every status value is stored contiguously. And so on.

The data is typically organized in "row groups" (Parquet calls them this, ORC calls them "stripes"), each containing 100K to 1M rows. Within each row group, each column is stored as a separate "column chunk" with its own compression and encoding. Values at the same position across column chunks belong to the same logical record.

This layout is optimized for analytical workloads. When a query computes AVG(price) WHERE status = 'shipped', the engine reads only the price and status columns. The other 48 columns are never touched.

Systems like DuckDB, ClickHouse, Snowflake, Dremio, Redshift, and BigQuery all use columnar storage as their primary layout. Apache Parquet and ORC are open columnar file formats used across the data ecosystem.

The I/O Math

The savings from columnar storage scale with table width. Consider a concrete example:

  • Table: 50 columns, 1 billion rows, 100 bytes per row average
  • Total data: 100 GB
  • Query: SELECT AVG(price) FROM orders WHERE status = 'shipped'
  • Columns needed: 2 (price + status), approximately 4 GB

I/O comparison showing row store reading 100 GB versus column store reading only 4 GB for the same analytical query

Storage Layout Data Read Percentage of Total
Row store 100 GB 100%
Column store 4 GB 4%

That is a 25x reduction in I/O. For a table with 200 columns (common in analytics), the ratio gets even more dramatic.

The tradeoff goes the other direction for point lookups. Fetching one complete record from a column store requires reading from every column file: 50 separate reads for a 50-column table. A row store does it in one.

Why Columnar Compression Is So Much Better

Uniform data within a column enables specialized encoding that mixed-type rows cannot use:

Encoding Best For How It Works
Run-Length (RLE) Sorted columns with repeated values Store (value, count) pairs. A column of 1M "USA" values becomes one entry.
Dictionary Low-cardinality strings Map each unique string to an integer ID. Store the small integers instead.
Delta Sorted integers/timestamps Store differences between consecutive values. Monotonic sequences shrink to near-zero.
Bit-packing Small integers Use the minimum number of bits per value instead of a full 32 or 64 bits.

These techniques routinely achieve 5-10x compression on analytical datasets. Row stores cannot match this because adjacent bytes in a tuple belong to different data types, defeating any type-specific encoding.

Late Materialization

Column stores gain additional performance by deferring tuple reconstruction until the very end. This technique is called late materialization:

  1. Scan the status column. Produce a selection vector (a bitmap of matching row positions).
  2. Use that selection vector to read only the matching positions from the price column.
  3. Compute AVG(price) on the filtered values.

At no point did the engine reconstruct a full row. It worked entirely with columnar arrays and position-based selection. This avoids copying irrelevant data and keeps computation in tight, cache-friendly loops that exploit CPU SIMD instructions.

Dremio uses Apache Arrow as its native in-memory columnar format, which is specifically designed for this kind of vectorized, late-materialized processing.

Hybrid Approaches

Not every system picks one side and stays there.

SQL Server lets you add nonclustered columnstore indexes to row-based tables. The query optimizer decides which format to use for each query. Oracle offers an In-Memory Column Store (IMCS) that keeps hot data in both row and column format simultaneously in memory.

Wide-column stores like Cassandra and HBase take a different path. They group related columns into "column families." Within a family, data is stored together (row-like). Across families, storage is separate (column-like). This optimizes for workloads where certain columns are always accessed together.

Parquet and ORC use a hybrid layout at the file level: data is divided into row groups (row-like partitioning), and within each row group, each column is stored separately (column-like). This balances the benefits of columnar scanning with practical record reconstruction when needed.

Where Real Systems Land

Storage format choices across real systems from row-oriented PostgreSQL to column-oriented DuckDB, ClickHouse, Snowflake, and Dremio

System Storage Format Primary Workload Notes
PostgreSQL Row OLTP Heap tuples, TOAST for large values
MySQL/InnoDB Row OLTP Clustered B-tree by primary key
SQL Server Row + optional column Mixed Columnstore indexes for analytics
Oracle Row + optional column Mixed In-Memory Column Store (IMCS)
DuckDB Column OLAP (embedded) Morsel-driven parallelism
ClickHouse Column OLAP (real-time) MergeTree engine, sparse indexes
Snowflake Column Cloud OLAP Micro-partitions
Dremio Column OLAP (lakehouse) Arrow in-memory, reads Parquet/Iceberg
Redshift Column Cloud OLAP MPP, zone maps
Cassandra Wide-column Write-heavy distributed LSM-based, column families

When to Choose Which

The choice is driven by your dominant access pattern:

  • Mostly point lookups and transactional writes (user profiles, order processing, session management): use a row store. PostgreSQL and MySQL are battle-tested here.
  • Mostly analytical scans and aggregations (dashboards, reports, data science): use a column store. DuckDB for embedded, ClickHouse or Dremio for distributed, Snowflake or BigQuery for fully managed cloud.
  • Both workloads on the same data: use separate systems for each (the most common production pattern) or a hybrid like SQL Server with columnstore indexes.

Trying to force a row store into heavy analytics or a column store into high-frequency transactions will produce consistently poor results. The storage layout is the first domino, and it falls in one direction.

Books to Go Deeper

Check out these related posts:

Concurrency, Isolation, and MVCC: How Engines Handle Contention

Wed Apr 29 2026 13:09:00 GMT+0000 (Coordinated Universal Time)

Hash, Sort-Merge, Broadcast: How Distributed Joins Work

Wed Apr 29 2026 13:08:00 GMT+0000 (Coordinated Universal Time)

Partitioning, Sharding, and Data Distribution Strategies

Wed Apr 29 2026 13:07:00 GMT+0000 (Coordinated Universal Time)

📬 Join the Mailing List

Get updates directly to your inbox.

Subscribe Now

Menu

Search