Skip to content

Commit 1856908

Browse files
committed
DuckDB/analytical query blog post
1 parent 14cd93d commit 1856908

5 files changed

+192
-2
lines changed
Loading
272 KB
Loading

content/posts/2021-hundred-requests-per-second-with-django.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Title: Handling 100 Requests Per Second with Python & Django
22
Date: July 27, 2021
33
description: Rather than discussing advertising and privacy, in this post we're going to talk about how we scaled our ad server using some of the same tools and techniques we use on Read the Docs.
4-
tags: django, python, performance, scalability
4+
tags: django, python, performance, scalability, engineering
55
authors: David Fischer
66
image: /images/posts/2021-django-100-requests-second.jpg
77
image_credit: <span>Photo by <a href="https://unsplash.com/@rockthechaos?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Kolleen Gladden</a> on <a href="https://unsplash.com/?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>

content/posts/2024-niche-ad-targeting.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Title: Using embeddings in production with Postgres & Django for niche ad targeting
22
Date: Apr 23, 2024
33
description: How we built better contextual ad targeting using PostgreSQL, Django, and embeddings which are a way to encode the "relatedness" of text and pages for use in machine learning.
4-
tags: content-targeting, engineering
4+
tags: content-targeting, engineering, postgresql
55
authors: Eric Holscher
66
image: /images/posts/niche-targeting.png
77

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
Title: DuckDB and PostgreSQL Make a Great Pair for Analytical Processing
2+
Date: February 25, 2025
3+
description: How we built better contextual ad targeting using PostgreSQL, Django, and embeddings which are a way to encode the "relatedness" of text and pages for use in machine learning.
4+
tags: postgresql, duckdb, performance, engineering
5+
authors: David Fischer
6+
image: /images/posts/2025-duckdb-postgres.jpg
7+
image_credit: <span>Image created with <a href="https://chatgpt.com/" title="A duck and an elephant working on a computer to build an analytics platform. The image should have a lighthearted feel, be colorful, but be a bit funny and strange.">ChatGPT</a></span>
8+
9+
10+
Typically, our blog lives at the intersection of privacy, advertising,
11+
and building our business in this niche.
12+
This time, however, we're going to delve into some database scaling issues
13+
we have been hitting for some time and how we're solving them with DuckDB
14+
on top of our regular daily driver of PostgreSQL.
15+
16+
17+
## Postgres can do just about everything
18+
19+
At both EthicalAds and at our parent company Read the Docs, we use PostgreSQL heavily.
20+
We use Postgres to store basically all our production data and to be our "source of truth"
21+
for all advertising stats and expenditures.
22+
Postgres handles everything and is among the most dependable pieces of our infrastructure.
23+
Postgres can handle [ML embeddings]({filename}../posts/2024-niche-ad-targeting.md)
24+
with [pgvector](https://github.com/pgvector/pgvector) for better contextual ad targeting
25+
and combined with a read replica we scaled our setup to
26+
[hundreds of writes per second]({filename}../posts/2021-hundred-requests-per-second-with-django.md).
27+
At EthicalAds, we generally field ~4-5M ad requests per weekday and Postgres
28+
handles this kind of transaction processing effortlessly.
29+
30+
31+
## What about analytical processing?
32+
33+
However, one area where we've struggled a bit with Postgres is on analytical queries.
34+
When a new advertiser prospect comes to us and asks how much ad inventory we have on content
35+
related to automating compute infrastructure with a tool like Terraform, we want to be able to answer that easily.
36+
By aggregating data ahead of time, we can query the data much faster when an advertiser pulls up a report
37+
or needs an estimate on how long their campaign will take.
38+
39+
<div class="postimage text-center">
40+
<img class="w-100 shadow-lg" src="{static}../images/posts/2025-duckdb-postgres-usage.png" alt="Expensive analytical queries can do a number on your database.">
41+
<p>Expensive analytical queries can do a number on your database.</p>
42+
</div>
43+
44+
We run these expensive aggregations across our database nightly during lower traffic times
45+
and store the results in a series of tables of daily aggregated data by publisher, by country, by advertiser, and so on.
46+
Altogether, queries take around 45 minutes to an hour and add significant load to our read replica.
47+
Occasionally, (or not so occasionally if they happen to coincide with the [autovacuum](https://www.postgresql.org/docs/current/runtime-config-autovacuum.html)), they will timeout and have to be re-run.
48+
Despite how much we love Postgres at EthicalAds, this specifically has felt like one of the most brittle pieces of our setup.
49+
50+
51+
## Column-wise storage & DuckDB
52+
53+
These kinds of expensive aggregation queries historically are better fits for column databases, data warehouses and [OLAP databases](https://en.wikipedia.org/wiki/Online_analytical_processing) generally.
54+
We considered building out a data warehouse or other kinds of column oriented databases
55+
but never found something we really liked and we were always hesitant to add a second production system
56+
that could get out of sync with Postgres.
57+
Postgres additionally has extensions to add some of these capabilities
58+
(eg. [citus](https://github.com/citusdata/citus?tab=readme-ov-file#creating-tables-with-columnar))
59+
but these solutions all either didn't work for our use case or
60+
weren't supported on Azure's Managed Postgres, where we are hosted.
61+
This is where using DuckDB came to our rescue.
62+
63+
[DuckDB](https://duckdb.org/) is an in-process, analytical database.
64+
It's sort of like SQLite but for analytical workloads and querying data anywhere in a variety of formats.
65+
Like SQLite, you either run it in your app's process (Python for us)
66+
or you can run its own standalone CLI.
67+
It can read from CSV or Parquet files stored in blob storage
68+
or directly from an SQL database like Postgres.
69+
70+
Because most of our aggregations are for hourly or daily data and then data virtually never changes
71+
once it's written, it's a great match for a write-once system.
72+
As a result, we began writing parquet files on a daily basis to cloud storage.
73+
[Apache Parquet files](https://parquet.apache.org/), are files optimized for column-wise data files
74+
that are widely supported by various tools including Python and DuckDB.
75+
76+
77+
<blockquote class="blockquote mb-2">
78+
<p class="mb-2">
79+
“Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.”
80+
</p>
81+
<p class="small">- <a href="https://parquet.apache.org/">parquet.apache.org</a></p>
82+
</blockquote>
83+
84+
Using Python and DuckDB, it's easy to query a day or a month's worth of data
85+
whether from a developer laptop or directly from a server
86+
without adding any load on your production database.
87+
Queries that took 5-10 minutes against Postgres frequently take just a few seconds
88+
against parquet files optimized for the purpose.
89+
90+
```python
91+
import duckdb
92+
from fsspec import filesystem
93+
94+
# Support for S3 is even more straight-forward
95+
duckdb.sql("INSTALL azure")
96+
duckdb.sql("LOAD azure")
97+
98+
azure_account_name = os.getenv("AZURE_ACCOUNT_NAME")
99+
azure_account_key = os.getenv("AZURE_ACCOUNT_KEY")
100+
azure_conn_string = f"DefaultEndpointsProtocol=https;AccountName={azure_account_name};AccountKey={azure_account_key}"
101+
102+
duckdb.register_filesystem(filesystem("abfs", connection_string=azure_conn_string))
103+
104+
# Query the daily parquet files
105+
# Want an ORM? There's a SqlAlchemy driver for DuckDB
106+
duckdb.sql("""SELECT COUNT(*) FROM read_parquet("abfs://datalake/2025-02-01.parquet");""")
107+
108+
# Query a month of data
109+
duckdb.sql("""SELECT COUNT(*) FROM read_parquet("abfs://datalake/2025-01-*.parquet");""")
110+
```
111+
112+
113+
## Joining column-wise data with Postgres
114+
115+
Creating our data lake of parquet files is cool,
116+
but the real power comes from "joining" this columnar optimized data back up against Postgres.
117+
118+
119+
```python
120+
import duckdb
121+
122+
duckdb.sql("INSTALL postgres")
123+
duckdb.sql("LOAD postgres")
124+
125+
pg_conn_string = os.getenv("REPLICA_DATABASE_URL")
126+
duckdb.sql(
127+
f"ATTACH IF NOT EXISTS '{pg_conn_string}' AS adspg (TYPE POSTGRES)"
128+
)
129+
130+
# "Join" a daily aggregation back up with our Postgres source of truth
131+
# This query is nearly instant but would take tens of seconds normally
132+
duckdb.sql("""
133+
SELECT
134+
adv.name as 'advertiser_name',
135+
COUNT(*) as 'impression_cnt'
136+
FROM read_parquet("abfs://datalake/2025-02-01.parquet") pq
137+
INNER JOIN adspg.adserver_advertiser adv
138+
ON pq.advertiser_id = adv.id
139+
GROUP BY adv.name
140+
ORDER BY impression_cnt DESC
141+
LIMIT 10;
142+
""")
143+
```
144+
145+
This provides a number of advantages including:
146+
147+
* Unlike some data warehouses that need **lots** of de-normalized data,
148+
our parquet files only need the data required to make whatever aggregations we need.
149+
This keeps them small and fast.
150+
Joining the aggregated data back to Postgres for additional data is fast and easy.
151+
* It's possible to overload your production database with analytical queries,
152+
but you can't easily overload cloud storage from reading files.
153+
* It's even possible with the `COPY` command to run aggregations against parquet files
154+
and copy the aggregated back to Postgres directly from DuckDB.
155+
This sounds counter-intuitive but if you're running many different kinds of aggregations
156+
against the same data, this can be faster than querying directly.
157+
158+
159+
## A couple gotchas and other challenges
160+
161+
"But David. Won't it be slow to run a SQL query against a remote file?"
162+
Firstly, these queries are strictly analytical queries, nothing transactional.
163+
Remember that any of the major clouds these blob storage files are going to be in or near
164+
the data center where the rest of your servers are running.
165+
Querying them is a lot faster than I originally expected it to be.
166+
167+
While DuckDB is pretty smart about [cross database queries](https://duckdb.org/2024/01/26/multi-database-support-in-duckdb.html),
168+
I put "joins" in scare quotes for a reason.
169+
These are not traditional database joins and at some level DuckDB is querying records
170+
from one database into memory and using it to query the other database.
171+
In some situations, including some we saw in our workloads,
172+
performance can degrade pretty quickly and spike memory and CPU usage on both Postgres and the DuckDB process.
173+
Expensive, cross-database queries require a bit of extra testing and scrutiny.
174+
175+
Lastly, if anybody from the Azure team happens to be reading this,
176+
we'd love it if you'd add [pg_parquet](https://github.com/CrunchyData/pg_parquet/) to Azure Managed Postgres
177+
now that it [supports Azure storage](https://www.crunchydata.com/blog/pg_parquet-an-extension-to-connect-postgres-and-parquet).
178+
Dumping parquets from Postgres directly would be more optimized than what we're currently doing.
179+
180+
181+
## Wrapup
182+
183+
Hopefully this was helpful to see some concrete examples of using DuckDB in addition to PostgreSQL
184+
for analytical workloads. We believe there's a lot of potential to use DuckDB with parquet files
185+
for these kinds of queries in more places on EthicalAds and with Read the Docs, our parent as well.
186+
187+
Thanks for tuning in to one of our infrequent posts on technical challenges
188+
that come along with building an ad network without invasive tracking.
189+
Please [let us know]({filename}../pages/contact.md#inbound-form) if you have any ideas or feedback on our product or service.
190+
We always love to hear from you.

0 commit comments

Comments
 (0)