|
| 1 | +Title: DuckDB and PostgreSQL Make a Great Pair for Analytical Processing |
| 2 | +Date: February 25, 2025 |
| 3 | +description: How we built better contextual ad targeting using PostgreSQL, Django, and embeddings which are a way to encode the "relatedness" of text and pages for use in machine learning. |
| 4 | +tags: postgresql, duckdb, performance, engineering |
| 5 | +authors: David Fischer |
| 6 | +image: /images/posts/2025-duckdb-postgres.jpg |
| 7 | +image_credit: <span>Image created with <a href="https://chatgpt.com/" title="A duck and an elephant working on a computer to build an analytics platform. The image should have a lighthearted feel, be colorful, but be a bit funny and strange.">ChatGPT</a></span> |
| 8 | + |
| 9 | + |
| 10 | +Typically, our blog lives at the intersection of privacy, advertising, |
| 11 | +and building our business in this niche. |
| 12 | +This time, however, we're going to delve into some database scaling issues |
| 13 | +we have been hitting for some time and how we're solving them with DuckDB |
| 14 | +on top of our regular daily driver of PostgreSQL. |
| 15 | + |
| 16 | + |
| 17 | +## Postgres can do just about everything |
| 18 | + |
| 19 | +At both EthicalAds and at our parent company Read the Docs, we use PostgreSQL heavily. |
| 20 | +We use Postgres to store basically all our production data and to be our "source of truth" |
| 21 | +for all advertising stats and expenditures. |
| 22 | +Postgres handles everything and is among the most dependable pieces of our infrastructure. |
| 23 | +Postgres can handle [ML embeddings]({filename}../posts/2024-niche-ad-targeting.md) |
| 24 | +with [pgvector](https://github.com/pgvector/pgvector) for better contextual ad targeting |
| 25 | +and combined with a read replica we scaled our setup to |
| 26 | +[hundreds of writes per second]({filename}../posts/2021-hundred-requests-per-second-with-django.md). |
| 27 | +At EthicalAds, we generally field ~4-5M ad requests per weekday and Postgres |
| 28 | +handles this kind of transaction processing effortlessly. |
| 29 | + |
| 30 | + |
| 31 | +## What about analytical processing? |
| 32 | + |
| 33 | +However, one area where we've struggled a bit with Postgres is on analytical queries. |
| 34 | +When a new advertiser prospect comes to us and asks how much ad inventory we have on content |
| 35 | +related to automating compute infrastructure with a tool like Terraform, we want to be able to answer that easily. |
| 36 | +By aggregating data ahead of time, we can query the data much faster when an advertiser pulls up a report |
| 37 | +or needs an estimate on how long their campaign will take. |
| 38 | + |
| 39 | +<div class="postimage text-center"> |
| 40 | + <img class="w-100 shadow-lg" src="{static}../images/posts/2025-duckdb-postgres-usage.png" alt="Expensive analytical queries can do a number on your database."> |
| 41 | + <p>Expensive analytical queries can do a number on your database.</p> |
| 42 | +</div> |
| 43 | + |
| 44 | +We run these expensive aggregations across our database nightly during lower traffic times |
| 45 | +and store the results in a series of tables of daily aggregated data by publisher, by country, by advertiser, and so on. |
| 46 | +Altogether, queries take around 45 minutes to an hour and add significant load to our read replica. |
| 47 | +Occasionally, (or not so occasionally if they happen to coincide with the [autovacuum](https://www.postgresql.org/docs/current/runtime-config-autovacuum.html)), they will timeout and have to be re-run. |
| 48 | +Despite how much we love Postgres at EthicalAds, this specifically has felt like one of the most brittle pieces of our setup. |
| 49 | + |
| 50 | + |
| 51 | +## Column-wise storage & DuckDB |
| 52 | + |
| 53 | +These kinds of expensive aggregation queries historically are better fits for column databases, data warehouses and [OLAP databases](https://en.wikipedia.org/wiki/Online_analytical_processing) generally. |
| 54 | +We considered building out a data warehouse or other kinds of column oriented databases |
| 55 | +but never found something we really liked and we were always hesitant to add a second production system |
| 56 | +that could get out of sync with Postgres. |
| 57 | +Postgres additionally has extensions to add some of these capabilities |
| 58 | +(eg. [citus](https://github.com/citusdata/citus?tab=readme-ov-file#creating-tables-with-columnar)) |
| 59 | +but these solutions all either didn't work for our use case or |
| 60 | +weren't supported on Azure's Managed Postgres, where we are hosted. |
| 61 | +This is where using DuckDB came to our rescue. |
| 62 | + |
| 63 | +[DuckDB](https://duckdb.org/) is an in-process, analytical database. |
| 64 | +It's sort of like SQLite but for analytical workloads and querying data anywhere in a variety of formats. |
| 65 | +Like SQLite, you either run it in your app's process (Python for us) |
| 66 | +or you can run its own standalone CLI. |
| 67 | +It can read from CSV or Parquet files stored in blob storage |
| 68 | +or directly from an SQL database like Postgres. |
| 69 | + |
| 70 | +Because most of our aggregations are for hourly or daily data and then data virtually never changes |
| 71 | +once it's written, it's a great match for a write-once system. |
| 72 | +As a result, we began writing parquet files on a daily basis to cloud storage. |
| 73 | +[Apache Parquet files](https://parquet.apache.org/), are files optimized for column-wise data files |
| 74 | +that are widely supported by various tools including Python and DuckDB. |
| 75 | + |
| 76 | + |
| 77 | +<blockquote class="blockquote mb-2"> |
| 78 | + <p class="mb-2"> |
| 79 | + “Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.” |
| 80 | + </p> |
| 81 | + <p class="small">- <a href="https://parquet.apache.org/">parquet.apache.org</a></p> |
| 82 | +</blockquote> |
| 83 | + |
| 84 | +Using Python and DuckDB, it's easy to query a day or a month's worth of data |
| 85 | +whether from a developer laptop or directly from a server |
| 86 | +without adding any load on your production database. |
| 87 | +Queries that took 5-10 minutes against Postgres frequently take just a few seconds |
| 88 | +against parquet files optimized for the purpose. |
| 89 | + |
| 90 | +```python |
| 91 | +import duckdb |
| 92 | +from fsspec import filesystem |
| 93 | + |
| 94 | +# Support for S3 is even more straight-forward |
| 95 | +duckdb.sql("INSTALL azure") |
| 96 | +duckdb.sql("LOAD azure") |
| 97 | + |
| 98 | +azure_account_name = os.getenv("AZURE_ACCOUNT_NAME") |
| 99 | +azure_account_key = os.getenv("AZURE_ACCOUNT_KEY") |
| 100 | +azure_conn_string = f"DefaultEndpointsProtocol=https;AccountName={azure_account_name};AccountKey={azure_account_key}" |
| 101 | + |
| 102 | +duckdb.register_filesystem(filesystem("abfs", connection_string=azure_conn_string)) |
| 103 | + |
| 104 | +# Query the daily parquet files |
| 105 | +# Want an ORM? There's a SqlAlchemy driver for DuckDB |
| 106 | +duckdb.sql("""SELECT COUNT(*) FROM read_parquet("abfs://datalake/2025-02-01.parquet");""") |
| 107 | + |
| 108 | +# Query a month of data |
| 109 | +duckdb.sql("""SELECT COUNT(*) FROM read_parquet("abfs://datalake/2025-01-*.parquet");""") |
| 110 | +``` |
| 111 | + |
| 112 | + |
| 113 | +## Joining column-wise data with Postgres |
| 114 | + |
| 115 | +Creating our data lake of parquet files is cool, |
| 116 | +but the real power comes from "joining" this columnar optimized data back up against Postgres. |
| 117 | + |
| 118 | + |
| 119 | +```python |
| 120 | +import duckdb |
| 121 | + |
| 122 | +duckdb.sql("INSTALL postgres") |
| 123 | +duckdb.sql("LOAD postgres") |
| 124 | + |
| 125 | +pg_conn_string = os.getenv("REPLICA_DATABASE_URL") |
| 126 | +duckdb.sql( |
| 127 | + f"ATTACH IF NOT EXISTS '{pg_conn_string}' AS adspg (TYPE POSTGRES)" |
| 128 | +) |
| 129 | + |
| 130 | +# "Join" a daily aggregation back up with our Postgres source of truth |
| 131 | +# This query is nearly instant but would take tens of seconds normally |
| 132 | +duckdb.sql(""" |
| 133 | + SELECT |
| 134 | + adv.name as 'advertiser_name', |
| 135 | + COUNT(*) as 'impression_cnt' |
| 136 | + FROM read_parquet("abfs://datalake/2025-02-01.parquet") pq |
| 137 | + INNER JOIN adspg.adserver_advertiser adv |
| 138 | + ON pq.advertiser_id = adv.id |
| 139 | + GROUP BY adv.name |
| 140 | + ORDER BY impression_cnt DESC |
| 141 | + LIMIT 10; |
| 142 | +""") |
| 143 | +``` |
| 144 | + |
| 145 | +This provides a number of advantages including: |
| 146 | + |
| 147 | +* Unlike some data warehouses that need **lots** of de-normalized data, |
| 148 | + our parquet files only need the data required to make whatever aggregations we need. |
| 149 | + This keeps them small and fast. |
| 150 | + Joining the aggregated data back to Postgres for additional data is fast and easy. |
| 151 | +* It's possible to overload your production database with analytical queries, |
| 152 | + but you can't easily overload cloud storage from reading files. |
| 153 | +* It's even possible with the `COPY` command to run aggregations against parquet files |
| 154 | + and copy the aggregated back to Postgres directly from DuckDB. |
| 155 | + This sounds counter-intuitive but if you're running many different kinds of aggregations |
| 156 | + against the same data, this can be faster than querying directly. |
| 157 | + |
| 158 | + |
| 159 | +## A couple gotchas and other challenges |
| 160 | + |
| 161 | +"But David. Won't it be slow to run a SQL query against a remote file?" |
| 162 | +Firstly, these queries are strictly analytical queries, nothing transactional. |
| 163 | +Remember that any of the major clouds these blob storage files are going to be in or near |
| 164 | +the data center where the rest of your servers are running. |
| 165 | +Querying them is a lot faster than I originally expected it to be. |
| 166 | + |
| 167 | +While DuckDB is pretty smart about [cross database queries](https://duckdb.org/2024/01/26/multi-database-support-in-duckdb.html), |
| 168 | +I put "joins" in scare quotes for a reason. |
| 169 | +These are not traditional database joins and at some level DuckDB is querying records |
| 170 | +from one database into memory and using it to query the other database. |
| 171 | +In some situations, including some we saw in our workloads, |
| 172 | +performance can degrade pretty quickly and spike memory and CPU usage on both Postgres and the DuckDB process. |
| 173 | +Expensive, cross-database queries require a bit of extra testing and scrutiny. |
| 174 | + |
| 175 | +Lastly, if anybody from the Azure team happens to be reading this, |
| 176 | +we'd love it if you'd add [pg_parquet](https://github.com/CrunchyData/pg_parquet/) to Azure Managed Postgres |
| 177 | +now that it [supports Azure storage](https://www.crunchydata.com/blog/pg_parquet-an-extension-to-connect-postgres-and-parquet). |
| 178 | +Dumping parquets from Postgres directly would be more optimized than what we're currently doing. |
| 179 | + |
| 180 | + |
| 181 | +## Wrapup |
| 182 | + |
| 183 | +Hopefully this was helpful to see some concrete examples of using DuckDB in addition to PostgreSQL |
| 184 | +for analytical workloads. We believe there's a lot of potential to use DuckDB with parquet files |
| 185 | +for these kinds of queries in more places on EthicalAds and with Read the Docs, our parent as well. |
| 186 | + |
| 187 | +Thanks for tuning in to one of our infrequent posts on technical challenges |
| 188 | +that come along with building an ad network without invasive tracking. |
| 189 | +Please [let us know]({filename}../pages/contact.md#inbound-form) if you have any ideas or feedback on our product or service. |
| 190 | +We always love to hear from you. |
0 commit comments