Make Open Data compatible with the Modern Data Ecosystem.
Open Data is a public good. As a result, individual [[incentives]] are not aligned with collective ones.
As an organization or research group, spending time curating and maintaining datasets for other people to use doesn't make economic sense, unless you can profit from that. When a scientist publishes a paper, they care about the paper itself. They're incentivized to. The data is usually an afterthought.
Combining data from different sources requires the user to reconcile the differences in schemas, formats, assumptions, and more. This data wrangling is time consuming, tedious and needs to be repeated by every user of the data.
The Open Data landscape has a few problems:
- Non Interoperability. Data is isolated in multiples places and between different formats.
- Hard to Use. Good datasets are hard to use as indexing is difficult and many standards compete. However, none of the indexers specify how the data is to be formatted, enforce any standardization, ... Users must still perform traditional forms of data merging, cleaning, and standardization.
- No Collaboration. No incentives exists for people to work on improving or curating datasets.
- No Versioning. Datasets disappear or change without notice. It's hard to know what changed and when. Losing data doesn't just inconvenience a few researchers. It actively hinders scientific progress.
Open Data can help organizations, scientist, and governments make better decisions. Data is one of the best ways to learn about the world and [[Coordination|coordinate]] people. Imagine if, every time you used a library, you had to find the original developer and hope they had a copy. It would be absurd. Yet that's essentially what we're asking scientists to do.
There are three big areas where people work on open data; at the government level covering thousands of datasets (CKAN, Socrata, …), at the scientific level (university level), and at the individual level where folks who are passionate about a topic publish a few datasets about it. This results on lots of datasets that are disconnected and still requires you to scrape, clean, and join it from all the heterogeneus sources to answer interesting questions. One of the big ways that data becomes useful is when it is tied to other data. Data is only as useful as the questions it can help answer. Joining, linking, and graphing datasets together allows one to ask more and different kinds of questions.
Open protocols create open systems. Open code creates tools. Open data creates open knowledge. We need better tools, protocols, and mechanisms to improve the Open Data ecosystem. It should be easy to find, download, process, publish, and collaborate on open datasets.
Iterative improvements over public datasets yield large amounts of value (check how Dune did it with blockchain data)¹. Access to data gives people the opportunity to create new business and make better decisions. Data is vital to understanding the world and improving public welfare. Metcalfe’s Law applies to data too. The more connected a dataset is to other data elements, the more valuable it is.
Open Source code has made a huge impact in the world. Let's make Open Data do the same! Let's make it possible for anyone to fork and re-publish fixed, cleaned, reformatted datasets as easily as we do the same things with code.
This document is a collection of ideas and principles to make Open Data more accessible, maintainable, and useful. Also, recognizing that a lot of people are already working on this, there are some amazing datasets, tools, and organizations out there, and, that Open Data is a people problem at 80%. This document is biased towards the technical side of things, as I think that's where I can contribute the most.
We have better and cheaper infrastructure. That includes things like faster storage, better compute, and, larger amounts of data. We need to improve our data workflows now. How does a world where people collaborate on datasets looks like? The data is there. We just need to use it.
The best thing to do with your data will be thought by someone else.
During the last few years, a large number of new data and open source tools have emerged. There are new query engines (e.g: DuckDB, DataFusion, ...), execution frameworks (WASM), data standards (Arrow, Parquet, ...), and a growing set of open data marketplaces (Datahub, HuggingFace Datasets, Kaggle Datasets). "Small data" deserves more tooling and people working on it.
These trends are already making its way towards movements like DeSci or smaller projects like Py-Code Datasets. But, we still need more tooling around data to improve interoperability as much as possible. Lots of companies have figured out how to make the most of their datasets. We should use similar tooling and approaches companies are using to manage the open datasets that surrounds us. A sort of Data Operating system.
One of the biggest problem in open data today is the fact that organizations treat data portals as graveyards where data goes to die. Keeping these datasets up to date is core concern (data has marginal temporal value), alongside using the data for operational purposes and showcasing it to the public.
Open data is hard to work with because of the overwhelming variety of formats and the engineering cost of integrating them. Data wrangling is a perpetual maintenance commitment, taking a lot of ongoing attention and resources. Better and modern data tooling can reduce these costs.
Organizations like Our World in Data or 538 provide useful analysis but have to deal with dataset management, spending most of their time building custom tools around their workflows. That works, but limits the potential of these datasets. Sadly, there is no data get OWID/daily-covid-cases
or data query "select * from 538/polls"
that could act as a quick and easy entry-point to explore datasets.
We could have a better data ecosystem if we collaborate on open standards! So, lets move towards more composable, maintainable, and reproducible open data.
¹ Blockchain data might be a great place to start building on these ideas as the data there is open, immutable, and useful.
- Easy. Create, curate and share datasets without friction.
- Frictionless: Data is useful only when used! Right now, we're not using most of humanity's datasets. That's not because they're not available but because they're hard to get. They're isolated in different places and multiple formats.
- Pragmatism: published data is better than almost published one because something is missing. Publishing datasets to the web is too hard now and there are few purpose-built tools that help.
- Familiar Workflow: people won't change their workflow to use a new tool. They will use something if it fits into their existing workflow.
- Versioned and Modular. Data and metadata (e.g:
relation
) should be updated, forked and discussed as code in version controlled repositories.- Prime composability (e.g: Arrow ecosystem) so tools/services can be swapped.
- Metadata as a first-class citizen. Even if minimal and automated.
- Git based approach collaboration. Adopt and integrate with
git
and GitHub to reduce surface area. Build tooling to adapt revisions, tags, branches, issues, PRs to datasets.- Portals are a GitHub repository with scripts to collect data from various sources, clean it, and join it, and publish useful datasets and artifacts for that community. Ideally, they are also simple to get started with and expose the best practices in data engineering for curating and transforming data.
- Provide a declarative way of defining the datasets schema and other meta-properties like relations or tests/checks.
- Support for integrating non-dataset files. A dataset could be linked to code, visualizations, pipelines, models, reports, ...
- Reproducible and Verifiable. People should be able to trust the final datasets without having to recompute everything from scratch. In "reality", events are immutable, data should be too. Make datasets the center of the tooling.
- With immutability and content addressing, you can move backwards in time and run transformations or queries on how the dataset was at a certain point in time.
- Datasets are books, not houses!
- Permissionless. Anyone should be able to add/update/fix datasets or their metadata. GitHub style collaboration, curation, and composability. On data.
- Aligned Incentives. Curators should have incentives to improve datasets. Data is messy after all, but a good set of incentives could make great datasets surface and reward contributors accordingly (e.g: number of contributors to Dune).
- Bounties could be created to reward people that adds useful but missing datasets.
- Surfacing and creating great datasets could be rewarded (retroactively or with bounties).
- Curating the data provides compounding benefits for the entire community!
- Rewarding the datasets creators according to the usefulness. E.g: CommonCrawl built an amazing repository that OpenAI has used for their GPTs LLMs. Not sure how well CommonCrawl was compensated.
- Governments needs to be forced to use their open data. This should create a feedback loop and have them improve the quality and freshness of the data. That forces to keep up on the quality and freshness.
- Open Source and Decentralized. Datasets should be stored in multiple places.
- Don't create yet another standard. Provide a way for people to integrate current indexers. Work on adapters for different datasets sources. Similar to:
- Foreign Data Wrappers in PostgreSQL
- Trustfall.
- Open source data integration projects like Airbyte. They can used to build open data connectors making possible to replicate something from
$RANDOM_SOURCE
(e.g: spreadsheets, Ethereum Blocks, URL, ...) to any destination. - Adapters are created by the community so data becomes connected.
- Having better data will help create better and more accessible AI models (people are working on this).
- Integrate with the modern data stack to avoid reinventing the wheel and increase surface of the required skill sets.
- Decentralized the computation (where data lives) and then cache inmutable and static copies of the results (or aggregations) in CDNs (IPFS, R2, Torrent). Most end user queries require only reading a small amount of data!
- Don't create yet another standard. Provide a way for people to integrate current indexers. Work on adapters for different datasets sources. Similar to:
- Other Principles from the Indie Web like have fun!
Package managers have been hailed among the most important innovations Linux brought to the computing industry. The activities of both publishers and users of datasets resemble those of authors and users of software packages.
- Distribution. Decentralized. No central authority. Can work in closed and private networks. Cache/CDN friendly.
- A data package is an URI (like in Deno). You can import from an URL (
data add example.com/dataset.yml
ordata add example.com/hub_curated_datasets.yml
). - As Rufus Pollock puts it, Keep it as simple as possible. Store the table location and schema and get me the data on the hard disk (or my browser) fast.
- Bootstrap a package registry. E.g: a GitHub repository with lots of known
datapackages
that acts as fallback and quick way to get started with the tool (data list
returns a bunch of known open datasets and integrates with platforms like Huggingface). - Each package has a persistent identifier.
- A data package is an URI (like in Deno). You can import from an URL (
- Indexing. Should be easy to list datasets matching a certain pattern or reading from a certain source.
- Datasets are linked to their metadata.
- One Git repository should match one portal/catalog/hub where related datasets are linked (not islands). Could also be a dataset. The main thing is for code and data to live together. Each Data Portal should be comparable to a website, and may have a specific topical focus (unify on a central theme).
- To avoid yet another open dataset portal, build adapters to integrate with other indexes.
- For example, integrate all Hugging Face datasets by making an scheduled job that builds a Frictionless Catalog (bunch of
datapackage.yml
s pointing to their parquet files). - Expose a JSON-LD so Google Dataset Search can index it.
- For example, integrate all Hugging Face datasets by making an scheduled job that builds a Frictionless Catalog (bunch of
- FAIR.
- Finding the right dataset to answer a question is difficult. Good metadata search is essential.
- Formatting. Datasets are saved and exposed in multiple formats (CSV, Parquet, ...). Could be done in the backend, or in the client when pulling data (WASM). The package manager should be format and storage agnostic. Give me the dataset with id
xyz
as a CSV in this folder. - Social. Allow users, organizations, stars, citations, attaching default visualizations (d3, Vega, Vegafusion, and others), ...
- Importing datasets. Making possible to
data fork user/data
, improve something and publish the resulting dataset back (via something like a PR). - Have issues and discussions close to the dataset.
- Linking data to other data makes all the data more valuable.
- Importing datasets. Making possible to
- Extensible. Users could extend the package resource (e.g: Time Series Tabular Package inherits from Tabular Package) and add better support for more specific kinds of data (geographical).
- Build integrations to ingest and publish data in other hubs (e.g: CKAN, HuggingFace, ...).
- Permanence. Each version should be permanent and accessible (look at
git
,IPFS
,dolt
, ...). - Versioning. Should be able to manage diffs and incremental changes in a smart way. E.g: only storing the new added rows or updated columns.
- Should allow automated harvesting of new data with sensors (external functions) or scheduled jobs.
- Each version is referenced by a hash. Git style.
- Each version is linked to the code that produced it.
- Smart. Use appropriate protocols for storing the data. E.g: rows/columns shouldn't be duplicated if they don't change.
- Think at the dataset level and not the file level.
- Tabular data could be partitioned to make it easier for future retrieval.
- Immutability. Never remove historical data. Data should be append only.
- Many public data sources issue restatements or revisions. The protocol should be able to handle this.
- Higher resolution is more valuable than lower resolution. Publish inmutable data and then compute the lower resolution data from it.
- Similar to how
git
deals with it. You could force the deletion of something in case that's needed, but that's not the default behavior.
- Flexible. Allow arbitrary backends. Both centralized (S3, GCS, ...) and decentralized (IPFS, Hypercore, Torrent, ...) layers.
- As agnostic as possible, supporting many types of data; tables, geospatial, images, ...
- Can all datasets can be represented as tabular datasets? This will enable to run SQL (
select, groupbys, joins
) on top of them which might be the easier way to start collaborating. - A dataset could have different formats derived from a common one. Build converters between formats relying on the Apache Arrow in memory standard format. This is similar to how Pandoc and LLVM work! The protocol could do the transformation (e.g: CSV to Parquet, JSON to Arrow, ...) automagically and run some checks at the data level to verify they contain the same information.
- Datasets could be tagged from a library of types (e.g:
ip-adress
) and conversion functions (ip-to-country
). Given that the representation is common (Arrow), the transformations could be written in multiple languages.
- Deterministic. Packaged lambda style transformations (WASM/Docker).
- For tabular data, starting with just SQL might be great.
- Pyodite + DuckDB for transformations could cover a large area.
- Datasets could be derived by importing other datasets and applying deterministic transformations in the
Datafile
. Similar to Docker containers and Splitfiles. That file will carry Metadata, Lineage and even some defaults (visualizations, code, ...)- Standard join keys are the most valuable ways to link data together.
- Declarative. Transformations should be defined as code and be idempotent. Similar to how Pachyderm/Kamu/Holium work.
- E.g: The transformation tool ends up orchestrating containers/functions that read/write from the storage layer, Pachyderm style.
- Environment agnostic. Can be run locally and remotely. One machine or a cluster. Streaming or batch.
- Templated. Having a repository/market of open transformations could empower a bunch of use cases ready to plug in to datasets:
- Detect outliers automatically on tabular data.
- Resize images.
- Normalize sound files.
- Detect suspicions records like a categorical variable value that only appears one time while others values appear many times.
- Enrich data smartly (Match and Augment pattern). If a matcher detects a date, the augmenter can add the day of week. If is something like a latitude and longitude, the augmenter adds country/city. Some tools do this with closed source data.
- Templated validations to make sure datasets conform to certain standards.
- Accessible. Datasets are files. Datasets are static assets living somewhere. Don't get in the middle with libraries, gated databases or obscure licenses.
- Documentation. Surface derived work (e.g: reports, other datasets, ...) and not only the raw data with minimal metadata.
- Embedded Visualizations. Know what's in there before downloading it.
- Sane Defaults. Suggest basic charts (bars, lines, time series, clustering). Multiple views.
- Exploratory. Allow drill downs and customization. Offer a simple way for people to query/explore the data.
- Dynamic. Use only the data you need. No need to pull 150GB.
- Default APIs. For some datasets, allowing REST API / GraphQL endpoints might be useful. Same with providing an SQL interface.
- Users should be able to clone public datasets with a single CLI command.
- Installing datasets could be mounting them from in a virtual filesystem (FUSE) and supporting random access (e.g: HTTP Range requests).
- Don't break history. If a dataset is updated, the old versions should still be accessible.
- Make sure the datasets are there for the long run. This might take different forms (using a domain name, IPFS, ...).
Please reach out if you want to chat about these ideas or ask more questions.
I'd say chain related data. Is open and people are eager to get their hands on it. I'm working on that area, so I might be biased.
I wonder if there are ways to use novel mechanisms (e.g: DAOs) to incentive people? Also, companies like Golden and index.as are doing interesting work on monetizing data curation.
LLMs could infer schema, types, and generate some metadata for us. [[Large Language Models|LLMs can parse unstructured data (CSV) and also generate structure from any data source (scrapping websites)]] making it easy to create datasets from random sources.
They're definitely blurring the line between structured and unstructured data too. Imagine pointing a LLMs to a GitHub repository with some CSVs and get the auto-generated datapakage.json
.
5. How can we stream/update new data reliably? E.g: some datasets like Ethereum blocks
could be updated every few minutes
I don't have a great answer. Perhaps just push the new data into partitioned datasets?
7. Is it possible to mount large amount of data (FUSE) from a remote source and get it dynamically as needed?
It should be possible. I wonder if we could mount all datasets locally and explore them as if they were in your laptop.
Parquet could be a great fit if we figure out how to deterministically serialize it and integrate with IPLD. This will reduce their size as unchanged columns could be encoded in the same CID.
Later on I think it could be interesting to explore running delta-rs
on top of IPFS.
Not sure. Homomorphic encryption?
10. How could something like Ver works?
If you can envision the table you would like to have in front of you, i.e., you can write down the attributes you would like the table to contain, then the system will find it for you. This probably needs a [[Knowledge Graphs]]!
11. How can a [[Knowledge Graphs]] help with the data catalog?
It could help users connect datasets. With good enough core datasets, it could be used as an LLM backend.
An easy tool for creating, maintaining and publishing databases with the ability to restrict parts or all of it behind a pay wall. Pair it with the ability to send email updates to your audience about changes and additions.
13. Curated and small data (e.g: at the community level) is not reachable by Google. How can we help there?
Indeed! With LLMs on the rise, community curated datasets become more important as they don't appear in the big data dumps.
I use it as a generic term to refero to data and content that can be freely used, modified, and shared by anyone for any purpose. Generally alligned with the Open Definition and The Open Data Commons.
- Qri. An evolution of the classical open portals that added [[Decentralized Protocols]] (IPFS) and computing on top of the data. Sadly, it came to an end early in 2022.
- Datalad. Extended to IPFS
- Is a great tool and uses Git Annex (distributed binary object tracking layer on top of git).
- Complicated to wrap your head around. Lots of different commands and concepts. On the other hand, it's very powerful and flexible. Git Annex is complex but powerful and flexible.
- Huggingface Datasets
- Quilt
- Forces both Python and S3
- Oxen
- Data is not accesible from other tools
- Docs are sparse
- Definitely more in the Git for Data space than Dataset Package Manager
- Frictionless Data
- Datopian Data CLI. Successor of DPM
- LakeFS. More like Git for Data
- Datasette
- Algovera Metahub
- DVC
- XVC
- ArtiVC
- Xetdata
- Dud
- Splitgraph
- Deep Lake
- Dim
- Hard to grok how to use it from the docs
- Quite small surface area. You can basically install datasets from URLs, create new ones, or apply some kind of GPT3 transformation on top of them
- Juan Benet's data
- Colah's data
- Dolt is another interesting project in the space with some awesome data structures. They also do data bounties!
- Wikipedia
- Github
- HackerNews
- Blockchain
- Our World In Data
- Fivethirtyeight
- BuzzFeed News
- ProPublica
- World Bank
- Ecosyste.ms
- Deps.dev
- Twitter Community Notes
- Open Meteo. Open Data on AWS.
- Datahub
- Frictionless
- Open Data Services
- Catalyst Cooperative
- Carbon Plan
- Data is Plural
- Data Liberation Project
- Opendatasoft
- Open Source Observer
- Source.coop
- Our World in Data
- Google Dataset Search
- Data Commons
- BigQuery Public Data
- Kaggle Datasets
- Datahub
- HuggingFace Datasets
- Data World
- Eurostat
- Statista
- Enigma
- DoltHub
- Socrata
- Nasdaq
- Zenodo
- Splitgraph
- Awesome Public Datasets
- Data Packaged Core Datasets
- Internet Archive Dataset Collection
- AWS Open Data Registry
- Datamarket
- Open Data Stack Exchange
- IPFS Datasets
- Datasets Subreddit. Open Data Subreddit
- Academic Torrents Datasets
- Open Data Inception
- Victoriano's Data Sources
- Data is Plural
- Open Sustainable Technology
- Public APIs
- Real Time Datasets
- Environmental Data Initiative
- Data One
- The Linked Open Data Cloud
- Organisation for Economic Co-operation and Development
- Safemap
- Is it hot in Learmonth right now? (Australia) and Hoy Extremo (Spain)
- Differential Privacy that allows releasing statistical information about datasets while protecting the privacy of individual data subjects.
- Homomorphic encryption.
- New deidentification techniques.
- Data watermarking, fingerprinting, and provenance tracking with blockchains.
- Better CPUs, compression algorithms, and storage technologies.
After playing with Rill Developer, DuckDB, Vega, WASM, Rath, and other modern Data IDEs, I think we have all the pieces for an awesome web based BI/Data exploration tool. Some of the features it could have:
- Let me add local and remote datasets. Not just one as I'd like to join them later.
- Let me plot it using Vega-Lite. Guide me through alternatives like Vega's Voyager2 does.
- Might be as simple as surfacing Observable Plot with DuckDB WASM...
- Use LLMs to improve the datasets and offer next steps:
- Get suggested transformations for certain columns. If it detect a date, extract day of the week. If it detects a string,
lower()
it... - Get suggested plots. Given that it'll know both the column names and the types. Should be possible to create a prompt that returns some plot ideas and another that takes that and write the Vega-Lite code to make it work.
- Make it easy to query the data via Natural Language.
- Get suggested transformations for certain columns. If it detect a date, extract day of the week. If it detects a string,
- Let me transform them with SQL (DuckDB) and Python (JupyterLite). Similar to Neptyne but in the browser (WASM).
- Let me save the plots in a separate space and give me a shareable URL encoded link.
- Local datasets could be shared using something like Magic Wormhole or a temporal storage service.
- Let me grab the state of the app (YAML/JSON), version control it, and generate static (to publish in GitHub Pages) and dynamic (hosted somewhere) dashboards from it.
- Similar to evidence.dev or portal.js.
- It could also have "smart" data checks. Similar to deepchecks alerting about anomalies, outliers, noisy variables, ...
- Given a large amount of [[Open Data]]. It could offer a way for people to upload their datasets and get them augmented.
- E.g: Upload a CSV with year and country and the tool could suggest GDP per Capita or population.
Could be an awesome front-end to explore [[Open Data]].
Inspired by ODF, Frictionless and Croissant.
name: "My Dataset"
owner: "My Org"
kind: "dataset"
version: 1
description: "Some description"
license: "MIT"
documentation:
url: "somewhere.com"
source:
- name: "prod"
db: "psql:/...."
pipeline:
- name: "Extract X"
type: image
image: docker/image:latest
cmd: "do something"
materializations:
- format: "Parquet"
location: "s3://....."
partition: "year"
schema:
fields:
- name: "name"
type: "string"
description: "The name of the user"
- name: "year"
- description: "...."
primary_key: "country_name"
metadata: "..."
- A package spec file describing a package.
- A hierarchical owner/name folder structure for installed packages.
- Spec file locator with fallback to the package registry.
- Versioning and latest versions.
- Asset checksums.
- The goal is to create a single, unified schema across datasets. This schema aims to strike a balance between flexibility to accommodate arbitrarily shaped data along with consistency in core tables.
- Datasets are built around two concepts: entities and timeseries.
- Entities are concrete things or objects (a geography, a company, a mortgage application).
- Timeseries are abstract measures (ie. statistics) related to an entity and a date.
- The core tables are:
entities
: Contains the entities that are being tracked. For example, Spain, Madrid, etc.- Should be something like
province_index
orweather_station_index
to be able to join with the timeseries. - This table contains permanent characteristics describing an entity. E.g: for Provinces, the name, the region.
- Each row represents a distinct entity. The table is wide, in that immutable characteristics are expressed in their own fields.
- Should be something like
attributes
: Attributes are descriptors of a timeseries. An attribute is the equivalent of a characteristic except for the abstract timeseries rather than the concrete entity.- Columns:
variable_id
: Unique identifier for the attributename
: Name of the attributedescription
: Description of the attributeunit
: Unit of the attributesource
: Source of the attributefrequency
: Frequency of the attribute (daily, monthly, etc.)measurement_type
: Type of measurement (e.g. nominal, ordinal, interval, ratio, percentage)- Metadata columns;
category
: Category of the attributenamespace
: Namespace of the attributetags
: JSON with tags of the attribute?aggregation_function
: Aggregation function to use when aggregating the attribute
- Columns:
timeseries
: Timeseries are abstract measures (ie. statistics, metrics) related to an entity and a date. Timeseries are temporal statistics or measures centered around an entity and timestamp. For example, GDP of Spain, population of Madrid, etc. Timeseries are abstract concepts (ie. a measure) rather than a concrete thing.- Could be something like
weather_timeseries
to be able to join with the entities. - Columns:
variable_id
: Unique identifier for the attributegeography_id
: Unique identifier for the geographydate
: Date of the metricvalue
: Value of the metric
- Could be something like
relationshipts
: Contains the relationships between entities. For example, Spain is composed of provinces, Madrid is a province, etc.- Relationships can also be temporal – valid for an interval defined by specific start and end dates.
characteristics
: Descriptors of an entity that are temporal. They have a start date and end date.