wtf is duckDB

7 minute read Published: 2025-01-06

What DuckDB is:

Think of DuckDB as a tool that helps you analyze data quickly without setting up a big database server. It's great when you have a pile of data in files (like CSVs or Parquet) and you want to ask questions about it using SQL, but you don't want to deal with a complicated database system.

Why you might use it:

As a web developer or programmer, you might encounter these situations where DuckDB shines:

Analyzing Data Locally:
- You have a large log file, some user data, or analytics data stored in CSVs or JSON, and you want to summarize, filter, or manipulate it quickly using SQL.
- Example: "How many unique users logged in each day last month?"
Building Tools or Features with Data Analysis:
- You’re building a dashboard, a reporting system, or a feature that needs fast data summarization (e.g., showing trends, counts, averages).
- Example: A quick data report generator for a web app.
Lightweight Analytics:
- You need to do some heavy analysis on a dataset but don't want to set up a big database server like Postgres or MySQL.
- Example: Quickly querying through 1GB of CSV data for patterns without loading it into a traditional database.
Simplifying Data Workflows:
- You use Python, JavaScript, or another language and don’t want to leave your environment. DuckDB runs inside your app, just like a library.
- Example: You’re working with Python, have a pandas DataFrame, and want to use SQL to manipulate it.

What you don’t need DuckDB for:

Day-to-day CRUD operations in your app (you have databases like Postgres or MySQL for that).
Serving data to users in real-time through APIs.

DuckDB is like SQLite for analyzing data: lightweight, easy to use, and good for one-off or embedded analytics tasks. Would you like to see a real-world example with data you’re familiar with?

Here’s a real-world example:

Scenario: Analyzing Web Server Logs

Imagine you’ve got a web server log file in CSV format, like this:

timestamp, user_id, endpoint, response_time
2024-12-01T10:15:30Z, 12345, /home, 200
2024-12-01T10:16:00Z, 12346, /about, 350
2024-12-01T10:17:15Z, 12345, /contact, 180
2024-12-01T10:18:20Z, 12347, /home, 220
2024-12-01T10:19:00Z, 12345, /about, 300

You want to answer questions like:

Which endpoint is the slowest on average?
How many users accessed each endpoint?

Using DuckDB to Answer These Questions

Step 1: Install DuckDB

You can install it in your environment. For Python:

pip install duckdb

Step 2: Write a Quick Script

Here’s how you could load the CSV and analyze it:

import duckdb

# Load the log file into DuckDB
con = duckdb.connect()  # Start DuckDB
con.execute("CREATE TABLE logs AS SELECT * FROM read_csv_auto('server_logs.csv')")

# Query 1: Find the slowest endpoint on average
slowest_endpoint = con.execute("""
    SELECT endpoint, AVG(response_time) AS avg_response_time
    FROM logs
    GROUP BY endpoint
    ORDER BY avg_response_time DESC
    LIMIT 1
""").fetchall()
print("Slowest Endpoint:", slowest_endpoint)

# Query 2: Count accesses per endpoint
endpoint_counts = con.execute("""
    SELECT endpoint, COUNT(*) AS num_accesses
    FROM logs
    GROUP BY endpoint
    ORDER BY num_accesses DESC
""").fetchall()
print("Endpoint Access Counts:")
for row in endpoint_counts:
    print(row)

# Clean up
con.close()

Step 3: Example Output

Let’s say the script gives you:

Slowest Endpoint: [('/about', 325.0)]
Endpoint Access Counts:
('/home', 2)
('/about', 2)
('/contact', 1)

What Happened?

Load the Data: DuckDB read the CSV and treated it like a mini-database table.
Query with SQL: You wrote SQL queries to find averages, counts, and ordered results.
Got Answers Fast: Even with a large dataset, DuckDB would handle this in seconds.

Why DuckDB Over Alternatives?

No need to load the data into a heavy database like Postgres or MySQL.
Works directly on CSVs and other file formats.
Simple setup and blazing-fast performance for analytical tasks.

Would you like help setting this up with sample data, or does this example align with your needs?

Why DuckDB Works Well in Lambda

In-Process Database:
- DuckDB runs entirely in memory (unless you use its persistent storage features).
- No external database connection or server is required, which aligns well with Lambda’s stateless nature.
Lightweight:
- DuckDB is compact and can be included in your Lambda deployment package without much overhead.
Temporary Data Processing:
- Perfect for processing on-the-fly data, like analyzing files (e.g., CSVs or Parquet) stored in S3.

Example: Analyzing a CSV in S3 with Lambda and DuckDB

Problem:

You have a CSV file in an S3 bucket containing website analytics data. You want a Lambda function to:

Load the file.
Count the number of visits per page.

Step 1: Lambda Setup

Install DuckDB and package it with your function code.

Create a virtual environment:

python -m venv venv
source venv/bin/activate
pip install duckdb boto3

Package your environment and code:

zip -r lambda_function.zip lambda_function.py venv/lib/python3.*/site-packages/

Step 2: Lambda Function Code

Here’s the code for your Lambda:

import duckdb
import boto3
import os

s3 = boto3.client('s3')

def lambda_handler(event, context):
    # Input from event: S3 bucket and key
    bucket_name = event['bucket']
    file_key = event['key']
    
    # Download the file from S3
    local_file = '/tmp/data.csv'
    s3.download_file(bucket_name, file_key, local_file)

    # Initialize DuckDB and process the CSV
    con = duckdb.connect()
    con.execute("CREATE TABLE data AS SELECT * FROM read_csv_auto(?)", [local_file])

    # Query: Count visits per page
    results = con.execute("""
        SELECT page, COUNT(*) AS visits
        FROM data
        GROUP BY page
        ORDER BY visits DESC
    """).fetchall()
    
    # Clean up
    os.remove(local_file)
    con.close()

    return {"page_visits": results}

Step 3: Deploy to AWS

Upload the zipped function to Lambda and configure the required IAM role to allow S3 access.

Example Event Trigger:

To invoke the Lambda function, you could use an event like this:

{
    "bucket": "your-s3-bucket-name",
    "key": "path/to/your-data.csv"
}

Example Output:

If your CSV has a page column, the output might look like:

{
    "page_visits": [
        ["/home", 1234],
        ["/about", 567],
        ["/contact", 234]
    ]
}

Key Considerations:

Temporary Storage:
- Lambda has a 512MB /tmp directory where you can store temporary files like the CSV.
Memory and Timeout:
- Ensure your Lambda has enough memory and timeout to process large datasets.
- DuckDB is very efficient, but Lambda functions have a max timeout of 15 minutes.
Cold Start:
- Including DuckDB in your Lambda package increases its size slightly, which may affect cold start times. Use AWS Lambda Layers to manage dependencies separately if needed.
File Formats:
- DuckDB supports Parquet, which is better for large-scale data analysis than CSV. Consider converting your data to Parquet for better performance.

Here's a step-by-step guide to set up a workflow for using DuckDB in AWS Lambda to process data files from S3.

Step 1: Install Dependencies Locally

First, install DuckDB and boto3 (for AWS S3 interaction) in a Python virtual environment.

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install the required libraries:
```
pip install duckdb boto3
```

Package the dependencies:

mkdir python
pip install --target=python duckdb boto3
zip -r lambda_dependencies.zip python

This creates a Lambda Layer zip file containing the dependencies. A Lambda Layer is a way to include shared libraries without bundling them with each function.

Step 2: Write the Lambda Function Code

Here’s the complete function code for analyzing a CSV stored in S3:

import duckdb
import boto3
import os

s3 = boto3.client('s3')

def lambda_handler(event, context):
    # Input from the event: S3 bucket and key
    bucket_name = event['bucket']
    file_key = event['key']
    
    # Download the file to Lambda's /tmp directory
    local_file = '/tmp/data.csv'
    s3.download_file(bucket_name, file_key, local_file)

    try:
        # Initialize DuckDB and process the CSV
        con = duckdb.connect()
        con.execute("CREATE TABLE data AS SELECT * FROM read_csv_auto(?)", [local_file])

        # Query: Count visits per page
        results = con.execute("""
            SELECT page, COUNT(*) AS visits
            FROM data
            GROUP BY page
            ORDER BY visits DESC
        """).fetchall()

        # Prepare the results for return
        output = [{"page": row[0], "visits": row[1]} for row in results]
    finally:
        # Clean up the temporary file
        if os.path.exists(local_file):
            os.remove(local_file)

    return {"page_visits": output}

Save this as lambda_function.py.

Step 3: Package the Lambda Function

Zip your lambda_function.py:

zip lambda_function.zip lambda_function.py

Step 4: Deploy to AWS

Upload the Layer

Go to the AWS Lambda Console → Layers → Create Layer.
Upload the lambda_dependencies.zip file created earlier.
Choose a runtime compatible with your function (e.g., Python 3.9).

Create the Lambda Function

Go to the Lambda Console and create a new function:
- Runtime: Python 3.x
Upload the lambda_function.zip file.
Add the DuckDB Layer to the function:
- Go to the Lambda function’s Layers section and add the previously created layer.

Step 5: Configure IAM Permissions

Ensure the Lambda function has the necessary permissions to access S3. Attach an IAM role with the following policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::your-bucket-name/*"
        }
    ]
}

Replace your-bucket-name with your actual S3 bucket name.

Step 6: Test the Function

Trigger the function with the following test event:

{
    "bucket": "your-s3-bucket-name",
    "key": "path/to/your-data.csv"
}

Example output:

{
    "page_visits": [
        {"page": "/home", "visits": 1234},
        {"page": "/about", "visits": 567},
        {"page": "/contact", "visits": 234}
    ]
}

Step 7: Optimize for Performance

File Formats: If the data grows large, use Parquet files instead of CSV. DuckDB handles Parquet files natively and much faster.
```
con.execute("CREATE TABLE data AS SELECT * FROM read_parquet(?)", [local_file])
```
Memory Allocation: Increase the Lambda function's memory (e.g., 512MB or higher) for larger datasets.
Timeouts: Ensure the function’s timeout is long enough to process larger files (e.g., 30 seconds).

Creating Parquet files is straightforward and can be done using several tools and programming libraries. Here's how you can create Parquet files depending on your preferred workflow:

1. Using Python (Pandas)

If you already have a CSV or DataFrame, you can convert it to Parquet with the pandas library.

Install Required Libraries:

pip install pandas pyarrow

Example Code:

import pandas as pd

# Create a sample DataFrame
data = {
    "timestamp": ["2024-12-01T10:15:30Z", "2024-12-01T10:16:00Z"],
    "user_id": [12345, 12346],
    "endpoint": ["/home", "/about"],
    "response_time": [200, 350],
}
df = pd.DataFrame(data)

# Save as Parquet file
df.to_parquet("data.parquet", engine="pyarrow", index=False)

This will create a data.parquet file.

2. Using Python (DuckDB)

DuckDB itself can create Parquet files directly.

Install DuckDB:

pip install duckdb

Example Code:

import duckdb

# Sample CSV file
csv_file = "data.csv"

# Convert CSV to Parquet using DuckDB
con = duckdb.connect()
con.execute("COPY (SELECT * FROM read_csv_auto(?)) TO 'data.parquet' (FORMAT PARQUET)", [csv_file])
con.close()

This reads the CSV file and saves it as a Parquet file.

3. Using Apache Spark

If you're working with large datasets, Apache Spark is a powerful tool for handling Parquet files.

Example in PySpark:

Install PySpark:
```
pip install pyspark
```

Example Code:

from pyspark.sql import SparkSession

# Initialize Spark
spark = SparkSession.builder.appName("CSV to Parquet").getOrCreate()

# Read CSV file
df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Write to Parquet
df.write.parquet("data.parquet")

4. Using Command-Line Tools

If you don’t want to write code, you can use tools like Apache Arrow’s parquet-cli to create Parquet files from CSVs.

Install parquet-tools:

brew install parquet-tools  # Mac
# or use another package manager

Convert CSV to Parquet:

csv2parquet data.csv data.parquet

5. Using AWS Services

If your data is in AWS, you can convert it using AWS Glue or Athena:

AWS Glue: Create an ETL job to transform CSVs to Parquet.
AWS Athena: Run a CREATE TABLE AS SELECT query on your S3-stored CSV to produce a Parquet file.

Why Use Parquet?

Compact: Parquet is a columnar storage format, reducing file size.
Efficient: Optimized for analytical queries, making it much faster than CSV for DuckDB or other analytical tools.
Schema Support: Preserves data types and structures.