What DuckDB is:
Think of DuckDB as a tool that helps you analyze data quickly without setting up a big database server. It's great when you have a pile of data in files (like CSVs or Parquet) and you want to ask questions about it using SQL, but you don't want to deal with a complicated database system.
Why you might use it:
As a web developer or programmer, you might encounter these situations where DuckDB shines:
-
Analyzing Data Locally:
- You have a large log file, some user data, or analytics data stored in CSVs or JSON, and you want to summarize, filter, or manipulate it quickly using SQL.
- Example: "How many unique users logged in each day last month?"
-
Building Tools or Features with Data Analysis:
- You’re building a dashboard, a reporting system, or a feature that needs fast data summarization (e.g., showing trends, counts, averages).
- Example: A quick data report generator for a web app.
-
Lightweight Analytics:
- You need to do some heavy analysis on a dataset but don't want to set up a big database server like Postgres or MySQL.
- Example: Quickly querying through 1GB of CSV data for patterns without loading it into a traditional database.
-
Simplifying Data Workflows:
- You use Python, JavaScript, or another language and don’t want to leave your environment. DuckDB runs inside your app, just like a library.
- Example: You’re working with Python, have a pandas DataFrame, and want to use SQL to manipulate it.
What you don’t need DuckDB for:
- Day-to-day CRUD operations in your app (you have databases like Postgres or MySQL for that).
- Serving data to users in real-time through APIs.
DuckDB is like SQLite for analyzing data: lightweight, easy to use, and good for one-off or embedded analytics tasks. Would you like to see a real-world example with data you’re familiar with?
Here’s a real-world example:
Scenario: Analyzing Web Server Logs
Imagine you’ve got a web server log file in CSV format, like this:
timestamp, user_id, endpoint, response_time
2024-12-01T10:15:30Z, 12345, /home, 200
2024-12-01T10:16:00Z, 12346, /about, 350
2024-12-01T10:17:15Z, 12345, /contact, 180
2024-12-01T10:18:20Z, 12347, /home, 220
2024-12-01T10:19:00Z, 12345, /about, 300
You want to answer questions like:
- Which endpoint is the slowest on average?
- How many users accessed each endpoint?
Using DuckDB to Answer These Questions
Step 1: Install DuckDB
You can install it in your environment. For Python:
pip install duckdb
Step 2: Write a Quick Script
Here’s how you could load the CSV and analyze it:
import duckdb
# Load the log file into DuckDB
con = duckdb.connect() # Start DuckDB
con.execute("CREATE TABLE logs AS SELECT * FROM read_csv_auto('server_logs.csv')")
# Query 1: Find the slowest endpoint on average
slowest_endpoint = con.execute("""
SELECT endpoint, AVG(response_time) AS avg_response_time
FROM logs
GROUP BY endpoint
ORDER BY avg_response_time DESC
LIMIT 1
""").fetchall()
print("Slowest Endpoint:", slowest_endpoint)
# Query 2: Count accesses per endpoint
endpoint_counts = con.execute("""
SELECT endpoint, COUNT(*) AS num_accesses
FROM logs
GROUP BY endpoint
ORDER BY num_accesses DESC
""").fetchall()
print("Endpoint Access Counts:")
for row in endpoint_counts:
print(row)
# Clean up
con.close()
Step 3: Example Output
Let’s say the script gives you:
Slowest Endpoint: [('/about', 325.0)]
Endpoint Access Counts:
('/home', 2)
('/about', 2)
('/contact', 1)
What Happened?
- Load the Data: DuckDB read the CSV and treated it like a mini-database table.
- Query with SQL: You wrote SQL queries to find averages, counts, and ordered results.
- Got Answers Fast: Even with a large dataset, DuckDB would handle this in seconds.
Why DuckDB Over Alternatives?
- No need to load the data into a heavy database like Postgres or MySQL.
- Works directly on CSVs and other file formats.
- Simple setup and blazing-fast performance for analytical tasks.
Would you like help setting this up with sample data, or does this example align with your needs?
Why DuckDB Works Well in Lambda
-
In-Process Database:
- DuckDB runs entirely in memory (unless you use its persistent storage features).
- No external database connection or server is required, which aligns well with Lambda’s stateless nature.
-
Lightweight:
- DuckDB is compact and can be included in your Lambda deployment package without much overhead.
-
Temporary Data Processing:
- Perfect for processing on-the-fly data, like analyzing files (e.g., CSVs or Parquet) stored in S3.
Example: Analyzing a CSV in S3 with Lambda and DuckDB
Problem:
You have a CSV file in an S3 bucket containing website analytics data. You want a Lambda function to:
- Load the file.
- Count the number of visits per page.
Step 1: Lambda Setup
Install DuckDB and package it with your function code.
- Create a virtual environment:
python -m venv venv source venv/bin/activate pip install duckdb boto3
- Package your environment and code:
zip -r lambda_function.zip lambda_function.py venv/lib/python3.*/site-packages/
Step 2: Lambda Function Code
Here’s the code for your Lambda:
import duckdb
import boto3
import os
s3 = boto3.client('s3')
def lambda_handler(event, context):
# Input from event: S3 bucket and key
bucket_name = event['bucket']
file_key = event['key']
# Download the file from S3
local_file = '/tmp/data.csv'
s3.download_file(bucket_name, file_key, local_file)
# Initialize DuckDB and process the CSV
con = duckdb.connect()
con.execute("CREATE TABLE data AS SELECT * FROM read_csv_auto(?)", [local_file])
# Query: Count visits per page
results = con.execute("""
SELECT page, COUNT(*) AS visits
FROM data
GROUP BY page
ORDER BY visits DESC
""").fetchall()
# Clean up
os.remove(local_file)
con.close()
return {"page_visits": results}
Step 3: Deploy to AWS
Upload the zipped function to Lambda and configure the required IAM role to allow S3 access.
Example Event Trigger:
To invoke the Lambda function, you could use an event like this:
{
"bucket": "your-s3-bucket-name",
"key": "path/to/your-data.csv"
}
Example Output:
If your CSV has a page
column, the output might look like:
{
"page_visits": [
["/home", 1234],
["/about", 567],
["/contact", 234]
]
}
Key Considerations:
-
Temporary Storage:
- Lambda has a 512MB
/tmp
directory where you can store temporary files like the CSV.
- Lambda has a 512MB
-
Memory and Timeout:
- Ensure your Lambda has enough memory and timeout to process large datasets.
- DuckDB is very efficient, but Lambda functions have a max timeout of 15 minutes.
-
Cold Start:
- Including DuckDB in your Lambda package increases its size slightly, which may affect cold start times. Use AWS Lambda Layers to manage dependencies separately if needed.
-
File Formats:
- DuckDB supports Parquet, which is better for large-scale data analysis than CSV. Consider converting your data to Parquet for better performance.
Here's a step-by-step guide to set up a workflow for using DuckDB in AWS Lambda to process data files from S3.
Step 1: Install Dependencies Locally
First, install DuckDB and boto3
(for AWS S3 interaction) in a Python virtual environment.
-
Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate
-
Install the required libraries:
pip install duckdb boto3
-
Package the dependencies:
mkdir python pip install --target=python duckdb boto3 zip -r lambda_dependencies.zip python
This creates a Lambda Layer zip file containing the dependencies. A Lambda Layer is a way to include shared libraries without bundling them with each function.
Step 2: Write the Lambda Function Code
Here’s the complete function code for analyzing a CSV stored in S3:
import duckdb
import boto3
import os
s3 = boto3.client('s3')
def lambda_handler(event, context):
# Input from the event: S3 bucket and key
bucket_name = event['bucket']
file_key = event['key']
# Download the file to Lambda's /tmp directory
local_file = '/tmp/data.csv'
s3.download_file(bucket_name, file_key, local_file)
try:
# Initialize DuckDB and process the CSV
con = duckdb.connect()
con.execute("CREATE TABLE data AS SELECT * FROM read_csv_auto(?)", [local_file])
# Query: Count visits per page
results = con.execute("""
SELECT page, COUNT(*) AS visits
FROM data
GROUP BY page
ORDER BY visits DESC
""").fetchall()
# Prepare the results for return
output = [{"page": row[0], "visits": row[1]} for row in results]
finally:
# Clean up the temporary file
if os.path.exists(local_file):
os.remove(local_file)
return {"page_visits": output}
Save this as lambda_function.py
.
Step 3: Package the Lambda Function
- Zip your
lambda_function.py
:zip lambda_function.zip lambda_function.py
Step 4: Deploy to AWS
Upload the Layer
- Go to the AWS Lambda Console → Layers → Create Layer.
- Upload the
lambda_dependencies.zip
file created earlier. - Choose a runtime compatible with your function (e.g., Python 3.9).
Create the Lambda Function
- Go to the Lambda Console and create a new function:
- Runtime: Python 3.x
- Upload the
lambda_function.zip
file. - Add the DuckDB Layer to the function:
- Go to the Lambda function’s Layers section and add the previously created layer.
Step 5: Configure IAM Permissions
Ensure the Lambda function has the necessary permissions to access S3. Attach an IAM role with the following policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::your-bucket-name/*"
}
]
}
Replace your-bucket-name
with your actual S3 bucket name.
Step 6: Test the Function
Trigger the function with the following test event:
{
"bucket": "your-s3-bucket-name",
"key": "path/to/your-data.csv"
}
Example output:
{
"page_visits": [
{"page": "/home", "visits": 1234},
{"page": "/about", "visits": 567},
{"page": "/contact", "visits": 234}
]
}
Step 7: Optimize for Performance
-
File Formats: If the data grows large, use Parquet files instead of CSV. DuckDB handles Parquet files natively and much faster.
con.execute("CREATE TABLE data AS SELECT * FROM read_parquet(?)", [local_file])
-
Memory Allocation: Increase the Lambda function's memory (e.g., 512MB or higher) for larger datasets.
-
Timeouts: Ensure the function’s timeout is long enough to process larger files (e.g., 30 seconds).
Creating Parquet files is straightforward and can be done using several tools and programming libraries. Here's how you can create Parquet files depending on your preferred workflow:
1. Using Python (Pandas)
If you already have a CSV or DataFrame, you can convert it to Parquet with the pandas
library.
Install Required Libraries:
pip install pandas pyarrow
Example Code:
import pandas as pd
# Create a sample DataFrame
data = {
"timestamp": ["2024-12-01T10:15:30Z", "2024-12-01T10:16:00Z"],
"user_id": [12345, 12346],
"endpoint": ["/home", "/about"],
"response_time": [200, 350],
}
df = pd.DataFrame(data)
# Save as Parquet file
df.to_parquet("data.parquet", engine="pyarrow", index=False)
This will create a data.parquet
file.
2. Using Python (DuckDB)
DuckDB itself can create Parquet files directly.
Install DuckDB:
pip install duckdb
Example Code:
import duckdb
# Sample CSV file
csv_file = "data.csv"
# Convert CSV to Parquet using DuckDB
con = duckdb.connect()
con.execute("COPY (SELECT * FROM read_csv_auto(?)) TO 'data.parquet' (FORMAT PARQUET)", [csv_file])
con.close()
This reads the CSV file and saves it as a Parquet file.
3. Using Apache Spark
If you're working with large datasets, Apache Spark is a powerful tool for handling Parquet files.
Example in PySpark:
- Install PySpark:
pip install pyspark
- Example Code:
from pyspark.sql import SparkSession # Initialize Spark spark = SparkSession.builder.appName("CSV to Parquet").getOrCreate() # Read CSV file df = spark.read.csv("data.csv", header=True, inferSchema=True) # Write to Parquet df.write.parquet("data.parquet")
4. Using Command-Line Tools
If you don’t want to write code, you can use tools like Apache Arrow’s parquet-cli to create Parquet files from CSVs.
Install parquet-tools:
brew install parquet-tools # Mac
# or use another package manager
Convert CSV to Parquet:
csv2parquet data.csv data.parquet
5. Using AWS Services
If your data is in AWS, you can convert it using AWS Glue or Athena:
- AWS Glue: Create an ETL job to transform CSVs to Parquet.
- AWS Athena: Run a
CREATE TABLE AS SELECT
query on your S3-stored CSV to produce a Parquet file.
Why Use Parquet?
- Compact: Parquet is a columnar storage format, reducing file size.
- Efficient: Optimized for analytical queries, making it much faster than CSV for DuckDB or other analytical tools.
- Schema Support: Preserves data types and structures.