Skip to content

Commit

Permalink
Docs: Add docs for xdb-diff (#3137)
Browse files Browse the repository at this point in the history
Co-authored-by: Trey Spiller <[email protected]>
  • Loading branch information
erindru and treysp authored Sep 18, 2024
1 parent 537cc67 commit 2f12ad6
Showing 1 changed file with 139 additions and 0 deletions.
139 changes: 139 additions & 0 deletions docs/guides/tablediff.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,3 +153,142 @@ SQLMESH_EXAMPLE.INCREMENTAL_MODEL ONLY sample rows:
```

The output matches, with the exception of the column labels in the `COMMON ROWS sample data differences`. The underlying table for each column is indicated by `s__` for "source" table (first table in the command's colon operator `:`) and `t__` for "target" table (second table in the command's colon operator `:`).

## Diffing tables or views across gateways

!!! info "Tobiko Cloud Feature"

Cross-database table diffing is available in [Tobiko Cloud](./observer.md#installation).

SQLMesh executes a project's models with a single database system, specified as a [gateway](../guides/connections.md#overview) in the project configuration.

The within-database table diff tool described above compares tables or environments within such a system. Sometimes, however, you might want to compare tables that reside in two different data systems.

For example, you might migrate your data transformations from an on-premises SQL engine to a cloud SQL engine while setting up your SQLMesh project. To demonstrate equivalence between the systems you could run the transformations in both and compare the new tables to the old tables.

The [within-database table diff](#diffing-models-across-environments) tool cannot make those comparisons, for two reasons:

1. It must join the two tables being diffed, but with two systems no single database engine can access both tables.
2. It assumes that data values can be compared across tables without modification. If the systems use different SQL engines, however, the diff must account for differences in the engines' data types (e.g., whether timestamps should include time zone information).

SQLMesh's cross-database table diff tool is built for just this scenario. Its comparison algorithm efficiently diffs tables without moving them from one system to the other and automatically addresses differences in data types.

### Configuration and syntax

To diff tables across systems, first configure [Gateways](../reference/configuration#Gateways) for each database system in your SQLMesh configuration file.

This example configures `bigquery` and `snowflake` gateways:

```yaml linenums="1"
gateways:
bigquery:
connection:
type: bigquery
[other connection parameters]

snowflake:
connection:
type: snowflake
[other connection parameters]
```

Then, specify each table's gateway in the `table_diff` command with this syntax: `[source_gateway]|[source table]:[target_gateway]|[target table]`.

For example, we could diff the `landing.table` table across `bigquery` and `snowflake` gateways like this:

```sh
$ sqlmesh table_diff 'bigquery|landing.table:snowflake|lake.table'
```

This syntax tells SQLMesh to use the cross-database diffing algorithm instead of the normal within-database diffing algorithm.

After adding gateways to the table names, use `table_diff` as described above - the same options apply for specifying the join keys, decimal precision, etc. See `sqlmesh table_diff --help` for a [full list of options](../reference/cli.md#table_diff).

!!! warning

Cross-database diff works for data objects (tables / views).

Diffing _models_ is not supported because we do not assume that both the source and target databases are managed by SQLMesh.

### Example output

A cross-database diff is broken up into two stages.

The first stage is a schema diff. This example shows that differences in column name case across the two tables are identified as schema differences:

```bash
$ sqlmesh table_diff 'bigquery|sqlmesh_example.full_model:snowflake|sqlmesh_example.full_model' --on item_id --show-sample

Schema Diff Between 'BIGQUERY|SQLMESH_EXAMPLE.FULL_MODEL' and 'SNOWFLAKE|SQLMESH_EXAMPLE.FULL_MODEL':
├── Added Columns:
│ ├── ITEM_ID (DECIMAL(38, 0))
│ └── NUM_ORDERS (DECIMAL(38, 0))
└── Removed Columns:
├── item_id (BIGINT)
└── num_orders (BIGINT)
Schema has differences; continue comparing rows? [y/n]:
```

SQLMesh prompts you before comparing data values across table rows. The prompt provides an opportunity to discontinue the comparison if the schemas are vastly different (potentially indicating a mistake) or you need to exclude columns from the diff because you know they won't match.

The second stage of the diff is comparing data values across tables. Within each system, SQLMesh divides the data into chunks, evaluates each chunk, and compares the outputs across systems. If a difference is found, it performs a row-level diff on that chunk by reading a sample of mismatched rows from each system.

This example shows that 2 rows were present in each system but had different values, one row was in Bigquery only, and one row was in Snowflake only:

```bash
Dividing source dataset into 10 chunks (based on 10947709 total records)
Checking chunks against target dataset
Chunk 1 hash mismatch!
Starting row-level comparison for the range (1 -> 3)
Identifying individual record hashes that don't match
Comparing
Row Counts:
├── PARTIAL MATCH: 2 rows (66.67%)
├── BIGQUERY ONLY: 1 rows (16.67%)
└── SNOWFLAKE ONLY: 1 rows (16.67%)
COMMON ROWS column comparison stats:
pct_match
num_orders 0.0
COMMON ROWS sample data differences:
Column: num_orders
┏━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓
┃ item_id ┃ BIGQUERY ┃ SNOWFLAKE ┃
┡━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩
│ 1 │ 5 │ 7 │
│ 2 │ 1 │ 2 │
└─────────┴──────────┴───────────┘
BIGQUERY ONLY sample rows:
item_id num_orders
7 4
SNOWFLAKE ONLY sample rows:
item_id num_orders
4 6
```
If there are no differences found between chunks, the source and target datasets can be considered equal:
```bash
Chunk 1 (1094771 rows) matches!
Chunk 2 (1094771 rows) matches!
...
Chunk 10 (1094770 rows) matches!
All 10947709 records match between 'bigquery|sqlmesh_example.full_model' and 'snowflake|TEST.SQLMESH_EXAMPLE.FULL_MODEL'
```
!!! info
Don't forget to specify the `--show-sample` option if you'd like to see a sample of the actual mismatched data!
Otherwise, only high level statistics for the mismatched rows will be printed.
### Supported engines
Cross-database diffing is supported on all execution engines that [SQLMesh supports](../integrations/overview.md#execution-engines).

0 comments on commit 2f12ad6

Please sign in to comment.