Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Add docs for xdb-diff #3137

Merged
merged 2 commits into from
Sep 18, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions docs/guides/tablediff.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,3 +153,142 @@ SQLMESH_EXAMPLE.INCREMENTAL_MODEL ONLY sample rows:
```

The output matches, with the exception of the column labels in the `COMMON ROWS sample data differences`. The underlying table for each column is indicated by `s__` for "source" table (first table in the command's colon operator `:`) and `t__` for "target" table (second table in the command's colon operator `:`).

## Diffing tables or views across gateways
sungchun12 marked this conversation as resolved.
Show resolved Hide resolved

!!! info "Tobiko Cloud Feature"

Cross-database table diffing is available in [Tobiko Cloud](./observer.md#installation).

SQLMesh executes a project's models with a single database system, specified as a [gateway](../guides/connections.md#overview) in the project configuration.

The within-database table diff tool described above compares tables or environments within such a system. Sometimes, however, you might want to compare tables that reside in two different data systems.

For example, you might migrate your data transformations from an on-premises SQL engine to a cloud SQL engine while setting up your SQLMesh project. To demonstrate equivalence between the systems you could run the transformations in both and compare the new tables to the old tables.

The [within-database table diff](#diffing-models-across-environments) tool cannot make those comparisons, for two reasons:

1. It must join the two tables being diffed, but with two systems no single database engine can access both tables.
2. It assumes that data values can be compared across tables without modification. If the systems use different SQL engines, however, the diff must account for differences in the engines' data types (e.g., whether timestamps should include time zone information).

SQLMesh's cross-database table diff tool is built for just this scenario. Its comparison algorithm efficiently diffs tables without moving them from one system to the other and automatically addresses differences in data types.

### Configuration and syntax

To diff tables across systems, first configure [Gateways](../reference/configuration#Gateways) for each database system in your SQLMesh configuration file.

This example configures `bigquery` and `snowflake` gateways:

```yaml linenums="1"
gateways:
bigquery:
connection:
type: bigquery
[other connection parameters]

snowflake:
connection:
type: snowflake
[other connection parameters]
```

Then, specify each table's gateway in the `table_diff` command with this syntax: `[source_gateway]|[source table]:[target_gateway]|[target table]`.

For example, we could diff the `landing.table` table across `bigquery` and `snowflake` gateways like this:

```sh
$ sqlmesh table_diff 'bigquery|landing.table:snowflake|lake.table'
```

This syntax tells SQLMesh to use the cross-database diffing algorithm instead of the normal within-database diffing algorithm.

After adding gateways to the table names, use `table_diff` as described above - the same options apply for specifying the join keys, decimal precision, etc. See `sqlmesh table_diff --help` for a [full list of options](../reference/cli.md#table_diff).

!!! warning

Cross-database diff works for data objects (tables / views).

Diffing _models_ is not supported because we do not assume that both the source and target databases are managed by SQLMesh.

### Example output

A cross-database diff is broken up into two stages.

The first stage is a schema diff. This example shows that differences in column name case across the two tables are identified as schema differences:

```bash
$ sqlmesh table_diff 'bigquery|sqlmesh_example.full_model:snowflake|sqlmesh_example.full_model' --on item_id --show-sample

Schema Diff Between 'BIGQUERY|SQLMESH_EXAMPLE.FULL_MODEL' and 'SNOWFLAKE|SQLMESH_EXAMPLE.FULL_MODEL':
├── Added Columns:
│ ├── ITEM_ID (DECIMAL(38, 0))
│ └── NUM_ORDERS (DECIMAL(38, 0))
└── Removed Columns:
├── item_id (BIGINT)
└── num_orders (BIGINT)
Schema has differences; continue comparing rows? [y/n]:
```

SQLMesh prompts you before comparing data values across table rows. The prompt provides an opportunity to discontinue the comparison if the schemas are vastly different (potentially indicating a mistake) or you need to exclude columns from the diff because you know they won't match.

The second stage of the diff is comparing data values across tables. Within each system, SQLMesh divides the data into chunks, evaluates each chunk, and compares the outputs across systems. If a difference is found, it performs a row-level diff on that chunk by reading a sample of mismatched rows from each system.

This example shows that 2 rows were present in each system but had different values, one row was in Bigquery only, and one row was in Snowflake only:

```bash
Dividing source dataset into 10 chunks (based on 10947709 total records)
Checking chunks against target dataset
Chunk 1 hash mismatch!
Starting row-level comparison for the range (1 -> 3)
Identifying individual record hashes that don't match
Comparing

Row Counts:
├── PARTIAL MATCH: 2 rows (66.67%)
├── BIGQUERY ONLY: 1 rows (16.67%)
└── SNOWFLAKE ONLY: 1 rows (16.67%)

COMMON ROWS column comparison stats:
pct_match
num_orders 0.0


COMMON ROWS sample data differences:
Column: num_orders
┏━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓
┃ item_id ┃ BIGQUERY ┃ SNOWFLAKE ┃
┡━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩
│ 1 │ 5 │ 7 │
│ 2 │ 1 │ 2 │
└─────────┴──────────┴───────────┘

BIGQUERY ONLY sample rows:
item_id num_orders
7 4


SNOWFLAKE ONLY sample rows:
item_id num_orders
4 6
```

If there are no differences found between chunks, the source and target datasets can be considered equal:

```bash
Chunk 1 (1094771 rows) matches!
Chunk 2 (1094771 rows) matches!
...
Chunk 10 (1094770 rows) matches!

All 10947709 records match between 'bigquery|sqlmesh_example.full_model' and 'snowflake|TEST.SQLMESH_EXAMPLE.FULL_MODEL'
```

!!! info

Don't forget to specify the `--show-sample` option if you'd like to see a sample of the actual mismatched data!

Otherwise, only high level statistics for the mismatched rows will be printed.

### Supported engines

Cross-database diffing is supported on all execution engines that [SQLMesh supports](../integrations/overview.md#execution-engines).
Loading