OS-Climate Data Commons Developer Guide

This developer guide is for data engineers, data scientists and developers of the OS-Climate community who are looking at leveraging the OS-Climate Data Commons to build data ingestion and processing pipelines, as well as AI / ML pipelines. It shows step-by-step how to configure your development environment, structure projects, and manage data and code in a way that complies with our Architecture Blueprint.

Need Help?

Outage / System failure: File an Linux Foundation (LF) outage ticket (note: select OS-Climate from project list)
New infrastructure request (e.g. software upgrade): File an LF ticket (note: select OS-Climate from project list)
General infrastructure support: Get help on OS-Climate Slack Data Commons channel
Data Commons developer support: Get help on OS-Climate Slack Developers channel

OS-Climate's Cluster Information

Cluster 1 CL1: used for development and initial upgrades of applications
Cluster 2 CL2: stable cluster, sandbox UI and released versions of tools are available from cluster 2
Cluster 3 CL3: administrative cluster, managed by Red Hat and Linux Foundation IT org
Cluster 4 CL4: latest implementation of Red Hat's Data Mesh pattern - under construction. Follows Open Data HubData Mesh Pattern.

Tools

Pipeline development leverages a number of tools provided by Data Commons. The list below provides an overview of key technologies involved as well as links to development instances:

Technology	Description	Link
GitHub	Version control tool used to maintain the pipelines as code	OS-Climate GitHub
GitHub Projects	Project tracking tool that integrates issues and pull requests	Data Commons Project Board
JupyterHub	Self-service environment for Jupyter notebooks used to develop pipelines	JupyterHub Development Instance
Kubeflow Pipelines	MLOps tool to support model development, training, serving and automated machine learning
Trino	Distributed SQL Query Engine for big data, used for data ingestion and distributed queries	Trino Console
CloudBeaver	Web-based database GUI tool which provides rich web interface to Trino	CloudBeaver Development Instance
Pachyderm	Data-driven pipeline management tool for machine learning, providing version control for data
dbt	SQL-based data transformation tool providing git-enabled version control of data transformation pipelines
Great Expectations	Data quality tool providing git-enabled data quality pipelines management
OpenMetadata	Centralized metadata store providing data discovery, data collaboration, metadata versioning and data lineage	OpenMetadata Development Instance
Airflow	Workflow management platform for data engineering pipelines	Airflow Development Instance
Apache Superset	Data exploration and visualization platform	Superset Development Instance
Grafana	Analytics and interactive visualization platform	Grafana Development Instance
INCEpTION	Text-annotation environment primarily used by OS-C for machine learning-based data extraction	INCEpTION Development Instance

GitOps for reproducibility, portability, traceability with AI support

Nowadays, developers (including data scientists) use Git and GitOps practices to store and share code on development platforms such as GitHub. GitOps best practices allow for reproducibility and traceability in projects. For this reason, we have decided to adopt a GitOps approach toward managing the platform, data pipeline code as well as data and related artifacts.

One of the most important requirements to ensure data quality through reproducibility is dependency management. Having dependencies clearly managed in audited configuration artifacts allows portability of notebooks, so they can be shared safely with others and reused in other projects.

Project templates

We use two project templates as starting point for new repositories:

A project template for data pipelines, specific to OS-Climate Data Commons, can be found here: Data Pipelines Template
A project tempalte specifically for AI/ML pipelines can be found here: Data Science Template.

Together the use of these templates ties data scientist needs (e.g. notebooks, models) and data engineers needs (e.g. data and metadata pipelines). Having structure in a project ensures all the pieces required for the Data and MLOps lifecycles are present and easily discoverable.

Tutorial Steps

Pre-requisites

ML Lifecycle / Source Lifecycle

Setup your initial environment
Explore notebooks and manage dependencies
Push changes to GitHub
Setup pipelines to create releases, build images and enable dependency management

DataOps Lifecycle

Data Ingestion Pipeline Overview
Data Extraction
Data Loading
Data Transformation
Metadata Management

ModelOps Lifecycle

ModelOps Lifecycle Overview
Setup and Deploy Inference Application
Test Deployed inference application
Monitor your inference application

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os-c-data-commons-developer-guide.md

os-c-data-commons-developer-guide.md

OS-Climate Data Commons Developer Guide

Tools

GitOps for reproducibility, portability, traceability with AI support

Project templates

Tutorial Steps

ML Lifecycle / Source Lifecycle

DataOps Lifecycle

ModelOps Lifecycle

Files

os-c-data-commons-developer-guide.md

Latest commit

History

os-c-data-commons-developer-guide.md

File metadata and controls

OS-Climate Data Commons Developer Guide

Tools

GitOps for reproducibility, portability, traceability with AI support

Project templates

Tutorial Steps

ML Lifecycle / Source Lifecycle

DataOps Lifecycle

ModelOps Lifecycle