Content: Markdown
Framework: Zola
Hosting: Github Pages
Styling: SCSS
Theme: Consoler Dark
Title: parkerjones.dev

wtf is duckDB

7 minute read Published: 2025-01-06

What DuckDB is:

Think of DuckDB as a tool that helps you analyze data quickly without setting up a big database server. It's great when you have a pile of data in files (like CSVs or Parquet) and you want to ask questions about it using SQL, but you don't want to deal with a complicated database system.

Why you might use it:

As a web developer or programmer, you might encounter these situations where DuckDB shines:

  1. Analyzing Data Locally:

    • You have a large log file, some user data, or analytics data stored in CSVs or JSON, and you want to summarize, filter, or manipulate it quickly using SQL.
    • Example: "How many unique users logged in each day last month?"
  2. Building Tools or Features with Data Analysis:

    • You’re building a dashboard, a reporting system, or a feature that needs fast data summarization (e.g., showing trends, counts, averages).
    • Example: A quick data report generator for a web app.
  3. Lightweight Analytics:

    • You need to do some heavy analysis on a dataset but don't want to set up a big database server like Postgres or MySQL.
    • Example: Quickly querying through 1GB of CSV data for patterns without loading it into a traditional database.
  4. Simplifying Data Workflows:

    • You use Python, JavaScript, or another language and don’t want to leave your environment. DuckDB runs inside your app, just like a library.
    • Example: You’re working with Python, have a pandas DataFrame, and want to use SQL to manipulate it.

What you don’t need DuckDB for:

DuckDB is like SQLite for analyzing data: lightweight, easy to use, and good for one-off or embedded analytics tasks. Would you like to see a real-world example with data you’re familiar with?

Here’s a real-world example:

Scenario: Analyzing Web Server Logs

Imagine you’ve got a web server log file in CSV format, like this:

timestamp, user_id, endpoint, response_time
2024-12-01T10:15:30Z, 12345, /home, 200
2024-12-01T10:16:00Z, 12346, /about, 350
2024-12-01T10:17:15Z, 12345, /contact, 180
2024-12-01T10:18:20Z, 12347, /home, 220
2024-12-01T10:19:00Z, 12345, /about, 300

You want to answer questions like:

Using DuckDB to Answer These Questions

Step 1: Install DuckDB

You can install it in your environment. For Python:

pip install duckdb

Step 2: Write a Quick Script

Here’s how you could load the CSV and analyze it:

import duckdb

# Load the log file into DuckDB
con = duckdb.connect()  # Start DuckDB
con.execute("CREATE TABLE logs AS SELECT * FROM read_csv_auto('server_logs.csv')")

# Query 1: Find the slowest endpoint on average
slowest_endpoint = con.execute("""
    SELECT endpoint, AVG(response_time) AS avg_response_time
    FROM logs
    GROUP BY endpoint
    ORDER BY avg_response_time DESC
    LIMIT 1
""").fetchall()
print("Slowest Endpoint:", slowest_endpoint)

# Query 2: Count accesses per endpoint
endpoint_counts = con.execute("""
    SELECT endpoint, COUNT(*) AS num_accesses
    FROM logs
    GROUP BY endpoint
    ORDER BY num_accesses DESC
""").fetchall()
print("Endpoint Access Counts:")
for row in endpoint_counts:
    print(row)

# Clean up
con.close()

Step 3: Example Output

Let’s say the script gives you:

Slowest Endpoint: [('/about', 325.0)]
Endpoint Access Counts:
('/home', 2)
('/about', 2)
('/contact', 1)

What Happened?

  1. Load the Data: DuckDB read the CSV and treated it like a mini-database table.
  2. Query with SQL: You wrote SQL queries to find averages, counts, and ordered results.
  3. Got Answers Fast: Even with a large dataset, DuckDB would handle this in seconds.

Why DuckDB Over Alternatives?

Would you like help setting this up with sample data, or does this example align with your needs?

Why DuckDB Works Well in Lambda

  1. In-Process Database:

    • DuckDB runs entirely in memory (unless you use its persistent storage features).
    • No external database connection or server is required, which aligns well with Lambda’s stateless nature.
  2. Lightweight:

    • DuckDB is compact and can be included in your Lambda deployment package without much overhead.
  3. Temporary Data Processing:

    • Perfect for processing on-the-fly data, like analyzing files (e.g., CSVs or Parquet) stored in S3.

Example: Analyzing a CSV in S3 with Lambda and DuckDB

Problem:

You have a CSV file in an S3 bucket containing website analytics data. You want a Lambda function to:

  1. Load the file.
  2. Count the number of visits per page.

Step 1: Lambda Setup

Install DuckDB and package it with your function code.

  1. Create a virtual environment:
    python -m venv venv
    source venv/bin/activate
    pip install duckdb boto3
    
  2. Package your environment and code:
    zip -r lambda_function.zip lambda_function.py venv/lib/python3.*/site-packages/
    

Step 2: Lambda Function Code

Here’s the code for your Lambda:

import duckdb
import boto3
import os

s3 = boto3.client('s3')

def lambda_handler(event, context):
    # Input from event: S3 bucket and key
    bucket_name = event['bucket']
    file_key = event['key']
    
    # Download the file from S3
    local_file = '/tmp/data.csv'
    s3.download_file(bucket_name, file_key, local_file)

    # Initialize DuckDB and process the CSV
    con = duckdb.connect()
    con.execute("CREATE TABLE data AS SELECT * FROM read_csv_auto(?)", [local_file])

    # Query: Count visits per page
    results = con.execute("""
        SELECT page, COUNT(*) AS visits
        FROM data
        GROUP BY page
        ORDER BY visits DESC
    """).fetchall()
    
    # Clean up
    os.remove(local_file)
    con.close()

    return {"page_visits": results}

Step 3: Deploy to AWS

Upload the zipped function to Lambda and configure the required IAM role to allow S3 access.


Example Event Trigger:

To invoke the Lambda function, you could use an event like this:

{
    "bucket": "your-s3-bucket-name",
    "key": "path/to/your-data.csv"
}

Example Output:

If your CSV has a page column, the output might look like:

{
    "page_visits": [
        ["/home", 1234],
        ["/about", 567],
        ["/contact", 234]
    ]
}

Key Considerations:

  1. Temporary Storage:

    • Lambda has a 512MB /tmp directory where you can store temporary files like the CSV.
  2. Memory and Timeout:

    • Ensure your Lambda has enough memory and timeout to process large datasets.
    • DuckDB is very efficient, but Lambda functions have a max timeout of 15 minutes.
  3. Cold Start:

    • Including DuckDB in your Lambda package increases its size slightly, which may affect cold start times. Use AWS Lambda Layers to manage dependencies separately if needed.
  4. File Formats:

    • DuckDB supports Parquet, which is better for large-scale data analysis than CSV. Consider converting your data to Parquet for better performance.

Here's a step-by-step guide to set up a workflow for using DuckDB in AWS Lambda to process data files from S3.


Step 1: Install Dependencies Locally

First, install DuckDB and boto3 (for AWS S3 interaction) in a Python virtual environment.

  1. Create and activate a virtual environment:

    python3 -m venv venv
    source venv/bin/activate
    
  2. Install the required libraries:

    pip install duckdb boto3
    
  3. Package the dependencies:

    mkdir python
    pip install --target=python duckdb boto3
    zip -r lambda_dependencies.zip python
    

This creates a Lambda Layer zip file containing the dependencies. A Lambda Layer is a way to include shared libraries without bundling them with each function.


Step 2: Write the Lambda Function Code

Here’s the complete function code for analyzing a CSV stored in S3:

import duckdb
import boto3
import os

s3 = boto3.client('s3')

def lambda_handler(event, context):
    # Input from the event: S3 bucket and key
    bucket_name = event['bucket']
    file_key = event['key']
    
    # Download the file to Lambda's /tmp directory
    local_file = '/tmp/data.csv'
    s3.download_file(bucket_name, file_key, local_file)

    try:
        # Initialize DuckDB and process the CSV
        con = duckdb.connect()
        con.execute("CREATE TABLE data AS SELECT * FROM read_csv_auto(?)", [local_file])

        # Query: Count visits per page
        results = con.execute("""
            SELECT page, COUNT(*) AS visits
            FROM data
            GROUP BY page
            ORDER BY visits DESC
        """).fetchall()

        # Prepare the results for return
        output = [{"page": row[0], "visits": row[1]} for row in results]
    finally:
        # Clean up the temporary file
        if os.path.exists(local_file):
            os.remove(local_file)

    return {"page_visits": output}

Save this as lambda_function.py.


Step 3: Package the Lambda Function

  1. Zip your lambda_function.py:
    zip lambda_function.zip lambda_function.py
    

Step 4: Deploy to AWS

Upload the Layer

  1. Go to the AWS Lambda Console → Layers → Create Layer.
  2. Upload the lambda_dependencies.zip file created earlier.
  3. Choose a runtime compatible with your function (e.g., Python 3.9).

Create the Lambda Function

  1. Go to the Lambda Console and create a new function:
    • Runtime: Python 3.x
  2. Upload the lambda_function.zip file.
  3. Add the DuckDB Layer to the function:
    • Go to the Lambda function’s Layers section and add the previously created layer.

Step 5: Configure IAM Permissions

Ensure the Lambda function has the necessary permissions to access S3. Attach an IAM role with the following policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::your-bucket-name/*"
        }
    ]
}

Replace your-bucket-name with your actual S3 bucket name.


Step 6: Test the Function

Trigger the function with the following test event:

{
    "bucket": "your-s3-bucket-name",
    "key": "path/to/your-data.csv"
}

Example output:

{
    "page_visits": [
        {"page": "/home", "visits": 1234},
        {"page": "/about", "visits": 567},
        {"page": "/contact", "visits": 234}
    ]
}

Step 7: Optimize for Performance

  1. File Formats: If the data grows large, use Parquet files instead of CSV. DuckDB handles Parquet files natively and much faster.

    con.execute("CREATE TABLE data AS SELECT * FROM read_parquet(?)", [local_file])
    
  2. Memory Allocation: Increase the Lambda function's memory (e.g., 512MB or higher) for larger datasets.

  3. Timeouts: Ensure the function’s timeout is long enough to process larger files (e.g., 30 seconds).

Creating Parquet files is straightforward and can be done using several tools and programming libraries. Here's how you can create Parquet files depending on your preferred workflow:


1. Using Python (Pandas)

If you already have a CSV or DataFrame, you can convert it to Parquet with the pandas library.

Install Required Libraries:

pip install pandas pyarrow

Example Code:

import pandas as pd

# Create a sample DataFrame
data = {
    "timestamp": ["2024-12-01T10:15:30Z", "2024-12-01T10:16:00Z"],
    "user_id": [12345, 12346],
    "endpoint": ["/home", "/about"],
    "response_time": [200, 350],
}
df = pd.DataFrame(data)

# Save as Parquet file
df.to_parquet("data.parquet", engine="pyarrow", index=False)

This will create a data.parquet file.


2. Using Python (DuckDB)

DuckDB itself can create Parquet files directly.

Install DuckDB:

pip install duckdb

Example Code:

import duckdb

# Sample CSV file
csv_file = "data.csv"

# Convert CSV to Parquet using DuckDB
con = duckdb.connect()
con.execute("COPY (SELECT * FROM read_csv_auto(?)) TO 'data.parquet' (FORMAT PARQUET)", [csv_file])
con.close()

This reads the CSV file and saves it as a Parquet file.


3. Using Apache Spark

If you're working with large datasets, Apache Spark is a powerful tool for handling Parquet files.

Example in PySpark:

  1. Install PySpark:
    pip install pyspark
    
  2. Example Code:
    from pyspark.sql import SparkSession
    
    # Initialize Spark
    spark = SparkSession.builder.appName("CSV to Parquet").getOrCreate()
    
    # Read CSV file
    df = spark.read.csv("data.csv", header=True, inferSchema=True)
    
    # Write to Parquet
    df.write.parquet("data.parquet")
    

4. Using Command-Line Tools

If you don’t want to write code, you can use tools like Apache Arrow’s parquet-cli to create Parquet files from CSVs.

Install parquet-tools:

brew install parquet-tools  # Mac
# or use another package manager

Convert CSV to Parquet:

csv2parquet data.csv data.parquet

5. Using AWS Services

If your data is in AWS, you can convert it using AWS Glue or Athena:


Why Use Parquet?