We publish the PUDL pipeline outputs in several ways to serve different users and use cases. We're always trying to increase the accessibility of the PUDL data, so if you have a suggestion, please open a GitHub issue. If you have a question, you can create a GitHub discussion.
We recommend working with tables with the out_
prefix, as these tables contain the
most complete and easiest to work with data. For more information about the different
types of tables, read through :ref:`PUDL's naming conventions <asset-naming>`.
The :doc:`PUDL data dictionary </data_dictionaries/pudl_db>` provides direct links to :ref:`access-datasette` for each table if it is included in our SQLite outputs, and to the Parquet outputs if the table is available in that format.
:ref:`Platform <access-platform>` | :ref:`Format <access-format>` | :ref:`Version <access-version>` | User Types | Use Cases |
---|---|---|---|---|
:ref:`access-datasette` | SQLite, CSV | nightly |
Data Explorer, Spreadsheet Analyst | Explore PUDL SQLite databases interactively in a web browser. Select data to download as CSVs for local analysis in spreadsheets. Create sharable links to a particular selection of data. |
:ref:`access-kaggle` | SQLite, Parquet | nightly |
Data Scientist, Data Analyst, Jupyter Notebook User | Work with PUDL data products in Jupyter Notebooks via the web with minimal setup. Explore curated and contributed analyses and visualizations using PUDL data. notebooks. Create and share your own interactive notebooks using PUDL data. |
:ref:`access-cloud` | SQLite, Parquet | nightly , stable |
Data Scientist, Analytics Engineer, Data Engineer, Cloud Developer | Performant remote queries of clearly versioned PUDL Parquet outputs from cloud computing platforms or GitHub Actions. Fast bulk download of SQLite or Parquet outputs for local use. Parquet based data warehouse for large-scale data analysis in the cloud. Integrates well with Pandas, DuckDB, and other dataframe libraries. |
:ref:`access-zenodo` | SQLite, Parquet | stable |
Researcher, Publisher, Archivist | Access a specific, immutable version of the PUDL data by DOI for citation in academic publications or other applications where long-term reproducibility is needed. Web-based bulk download of data for local analysis. |
In order to serve a wider variety of users, we provide several ways to access PUDL data. When choosing an access method you'll want to consider:
- What tool or platform do you want to use to access the data?
- What data format are you most comfortable with?
- Which historical version of the data do you want?
PUDL data is distributed on a number of different platforms to acommodate a variety of different use cases. These include :ref:`access-datasette`, :ref:`access-kaggle`, :ref:`access-cloud`, and :ref:`access-zenodo`.
PUDL data is distributed in two main file formats
- SQLite: a self-contained relational database that holds many tables in a single file, supported by many programming languages and tools.
- Apache Parquet: a compressed, columnar storage format in which each file stores a single table. Parquet supports rich data types and metadata, and is highly performant.
Any data that is published using SQLite is available through :ref:`access-datasette`, and can be downloaded as a CSV through that platform. See below.
- Parquet Only: The hourly data tables are distributed only as Parquet files.
These tables have
hourly
in their names. - SQLite Only: The :ref:`minimally processed FERC data <access-raw-ferc>` which we have converted from XBRL and DBF into SQLite are only available in SQLite.
We assign a version number to our quarterly data releases so they can be easily
identified. These versions are based on the date of publication. For example,
v2024.11.0
would be the first release of the data that happened in November 2024.
These are referred to as stable
releases, and are archived for long-term access and
citation.
We also provide access to a nightly
development build of the data, which is updated
most weekday mornings. These builds are useful for beta testing new outputs, but are
ephemeral and may not be as well validated as the stable
releases.
We provide web-based access to the PUDL data via a Datasette deployment at:
https://data.catalyst.coop
Datasette is an open source tool developed by Simon Willison that wraps SQLite databases in an interactive front-end. It allows users to the PUDL database and metadata, filter the data them using dropdown menus or SQL, and download the selected data to CSVs. All the query parameters are stored in the URL so you can also share links to the data you've selected.
Note
The only SQLite database containing cleaned and integrated data is the core PUDL database. There are also several :ref:`FERC SQLite databases <access-raw-ferc>` derived from the old Visual FoxPro and new XBRL data formats, which we publish as SQLite to improve accessibility of the raw inputs, but they should generally not be used directly if the data you need has been integrated into the PUDL database.
Note
Only PUDL database tables that are available in SQLite are accessible via Datasette. Due to their size, we currently do not load any of the hourly tables into SQLite, and distribute them only as Parquet files.
Are you comfortable with Jupyter Notebooks? Want to explore a fresh version of all available PUDL data without needing to do any environment setup? Our nightly build outputs automatically update the PUDL Project Dataset on Kaggle once a week. There are several notebooks associated with the dataset, both curated by Catalyst and contributed by other Kaggle users.
All PUDL data products are freely available in the
AWS Open Data Registry
including both stable
and nightly
outputs and multiple years of past stable
releases. These include data in both SQLite and Parquet formats. The AWS S3 bucket is:
s3://pudl.catalyst.coop
The same outputs are available in a similarly named "requester pays" Google Cloud Storage bucket. However, you will need to authenticate your GCP account. The GCS bucket is:
gs://pudl.catalyst.coop
SQLite databases must be downloaded for local use, but Parquet files can be queried remotely using a number of different tools. Some examples below:
Using Pandas read_parquet()
Note
You will need to install pandas with the extra cloud dependencies.
import pandas as pd
# Outputs from the most recent nightly build:
nightly_df = pd.read_parquet("s3://pudl.catalyst.coop/nightly/core_eia__codes_energy_sources.parquet")
# Outputs from the most recent stable data release:
stable_df = pd.read_parquet("s3://pudl.catalyst.coop/stable/core_eia__codes_energy_sources.parquet")
# A specific stable version of the data:
versioned_df = pd.read_parquet("s3://pudl.catalyst.coop/v2024.11.0/core_eia__codes_energy_sources.parquet")
Using DuckDB and the httpfs extension
-- Install the httpfs extension once and it will be available in subsequent sessions
INSTALL httpfs;
SELECT * FROM read_parquet('s3://pudl.catalyst.coop/nightly/core_eia__codes_energy_sources.parquet');
Similar functionality exists for the dplyr library in R, the polars library in Rust, and many other programmatic data analysis tools.
You can also use the AWS CLI to see what data is available and download it locally. For example, to list the contents of the AWS S3 bucket to see what historic versions are available:
aws s3 ls --no-sign-request s3://pudl.catalyst.coop/
To list the contents of a particular version:
aws s3 ls --no-sign-request s3://pudl.catalyst.coop/v2024.8.0/
And then download the full PUDL SQLite database from the nightly build outputs:
aws s3 cp --no-sign-request s3://pudl.catalyst.coop/nightly/pudl.sqlite.zip .
The links below allow bulk download the most recent nightly
builds of all the SQLite
databases produced by PUDL, as well as their associated metadata in JSON.
- FERC Form 1:
- FERC Form 2:
- FERC Form 6:
- FERC Form 60:
- FERC Form 714:
If you want a specific, immutable version of our data for any reason, you can find them all here on Zenodo. Zenodo assigns long-lived DOIs to each archive, suitable for citation in academic journals and other publications. The most recent versioned PUDL data release can always be found using this Concept DOI: https://doi.org/10.5281/zenodo.3653158
From Zenodo you can download individual SQLite databases and a zipfile containing all the Parquet files bundled together.
The documentation for the latest such stable build is here. You can access the documentation for a specific version by hovering over the version selector at the bottom left of the page.
Sometimes you want to see the raw data that is published by the government, but it's hard to find or difficult to download, or you want to see what an older version of the published data looked like prior to being revised or deleted.
We use Zenodo to archive and version our raw data inputs. You can find all of our archives in the Catalyst Cooperative Community.
These have been minimally processed - in some cases, we've compressed them or grouped them into ZIP archives to fit the Zenodo repository requirements. In all cases we've added some metadata to help identify the resources you're looking for. But, apart from that, these datasets are unmodified.
If you want to run the PUDL data processing pipeline yourself from scratch, run the software tests, or make changes to the source code, you'll need to set up our development environment. This is a bit involved, so it has its :doc:`own separate documentation <dev/dev_setup>`.
Most users shouldn't need to do this, and will probably find working with the pre-processed data via one of the other access modes easier. But if you want to :doc:`contribute to the project <CONTRIBUTING>`, please give it a shot!