wtf is columnar data

3 minute read Published: 2025-01-06

What is Columnar Storage?

Columnar storage is a way of organizing data in a database where values of a single column are stored together, rather than storing all the data for a single row together. This storage method is optimized for read-heavy, analytical workloads where queries often operate on a subset of columns over many rows.

How It Works:

In row-oriented databases, like MySQL or PostgreSQL, data is stored row by row:

Row 1: [ID1, Name1, Age1, Salary1]
Row 2: [ID2, Name2, Age2, Salary2]

In column-oriented databases, data is stored column by column:

ID:   [ID1, ID2, ID3, ...]
Name: [Name1, Name2, Name3, ...]
Age:  [Age1, Age2, Age3, ...]
Salary: [Salary1, Salary2, Salary3, ...]

Key Differences Between Columnar and Row-Oriented Storage

Aspect	Row-Oriented Storage	Columnar Storage
Data Layout	Stores data row by row.	Stores data column by column.
Best Use Case	Transactional workloads (OLTP) like updating user accounts, processing orders, etc.	Analytical workloads (OLAP) like aggregations, filtering, and reporting.
Query Performance	Good for queries involving entire rows.	Fast for queries involving a few columns over many rows.
Data Compression	Less effective; different data types in rows make compression harder.	Highly compressible; similar data types in columns lead to better compression.
Write Performance	Faster for frequent row-level updates.	Slower for updates, as entire columns may need rewriting.
Examples	MySQL, PostgreSQL, SQLite.	DuckDB, Apache Parquet, ClickHouse.

Why Columnar Storage Matters

1. Efficient Analytical Queries

Columnar storage is optimized for queries that:

Retrieve specific columns (e.g., SELECT Age FROM table).
Perform aggregates over large datasets (e.g., SUM(Salary) or AVG(Age)).

By only reading the columns needed, columnar storage avoids wasting time scanning irrelevant data.

2. Better Compression

Because all the data in a column is of the same type (e.g., integers, floats, strings), columnar storage compresses data better. For example:

The "Age" column (integers) can be compressed using run-length encoding.
The "Name" column (strings) can be compressed using dictionary encoding.

This reduces the size of data on disk and speeds up queries.

3. Vectorized Processing

Columnar databases often use vectorized execution, which processes chunks of columns in batches. This is faster than row-by-row processing.

When to Use Columnar vs. Row-Oriented Databases

Use Columnar Storage for:

Analytics: Dashboards, reporting systems, or BI tools.
Big Data: Handling datasets with billions of rows where you often run aggregations.
Data Warehousing: Storing historical data for periodic analysis.

Use Row-Oriented Storage for:

Transaction Processing: Web apps, e-commerce systems, or CRM tools where data is frequently updated.
APIs: Use cases that involve frequent read/write operations with complete records.

Example Comparison

Scenario: Salary Data for Employees

If you want to find the average salary of employees:

Row-Oriented:
- Reads all rows (including ID, Name, and Age) even though only the Salary column is needed.
Columnar:
- Reads just the Salary column, skipping ID, Name, and Age.

This makes columnar storage faster and more resource-efficient for this type of query.

Columnar Storage in Action

Modern tools and formats like DuckDB, Apache Parquet, and Amazon Redshift leverage columnar storage to handle analytics efficiently.

Would you like help setting up a columnar database or converting your data into columnar formats like Parquet?