Skip to content

Commit

Permalink
Merge pull request #2266 from firebase/next
Browse files Browse the repository at this point in the history
Release: firestore-bigquery-export 0.1.57
  • Loading branch information
cabljac authored Jan 31, 2025
2 parents a64de2a + 78de81c commit 94fcf11
Show file tree
Hide file tree
Showing 58 changed files with 9,482 additions and 12,450 deletions.
12 changes: 12 additions & 0 deletions firestore-bigquery-export/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
## Version 0.1.57

feat - add basic materialized views support, incremental and non-incremental.

fix - do not add/update clustering if an invalid clustering field is present.

docs - improve cross-project IAM documentation

fix - emit correct events to extension, backwardly compatible.

docs - add documentation on workarounds to mitigate data loss during extension updates

## Version 0.1.56

feat - improve sync strategy by immediately writing to BQ, and using cloud tasks only as a last resort
Expand Down
61 changes: 49 additions & 12 deletions firestore-bigquery-export/POSTINSTALL.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,30 +4,30 @@ You can test out this extension right away!

1. Go to your [Cloud Firestore dashboard](https://console.firebase.google.com/project/${param:BIGQUERY_PROJECT_ID}/firestore/data) in the Firebase console.

1. If it doesn't already exist, create the collection you specified during installation: `${param:COLLECTION_PATH}`
2. If it doesn't already exist, create the collection you specified during installation: `${param:COLLECTION_PATH}`

1. Create a document in the collection called `bigquery-mirror-test` that contains any fields with any values that you'd like.
3. Create a document in the collection called `bigquery-mirror-test` that contains any fields with any values that you'd like.

1. Go to the [BigQuery web UI](https://console.cloud.google.com/bigquery?project=${param:BIGQUERY_PROJECT_ID}&p=${param:BIGQUERY_PROJECT_ID}&d=${param:DATASET_ID}) in the Google Cloud Platform console.
4. Go to the [BigQuery web UI](https://console.cloud.google.com/bigquery?project=${param:BIGQUERY_PROJECT_ID}&p=${param:BIGQUERY_PROJECT_ID}&d=${param:DATASET_ID}) in the Google Cloud Platform console.

1. Query your **raw changelog table**, which should contain a single log of creating the `bigquery-mirror-test` document.
5. Query your **raw changelog table**, which should contain a single log of creating the `bigquery-mirror-test` document.

```
SELECT *
FROM `${param:BIGQUERY_PROJECT_ID}.${param:DATASET_ID}.${param:TABLE_ID}_raw_changelog`
```
1. Query your **latest view**, which should return the latest change event for the only document present -- `bigquery-mirror-test`.
6. Query your **latest view**, which should return the latest change event for the only document present -- `bigquery-mirror-test`.
```
SELECT *
FROM `${param:BIGQUERY_PROJECT_ID}.${param:DATASET_ID}.${param:TABLE_ID}_raw_latest`
```
1. Delete the `bigquery-mirror-test` document from [Cloud Firestore](https://console.firebase.google.com/project/${param:BIGQUERY_PROJECT_ID}/firestore/data).
7. Delete the `bigquery-mirror-test` document from [Cloud Firestore](https://console.firebase.google.com/project/${param:BIGQUERY_PROJECT_ID}/firestore/data).
The `bigquery-mirror-test` document will disappear from the **latest view** and a `DELETE` event will be added to the **raw changelog table**.
1. You can check the changelogs of a single document with this query:
8. You can check the changelogs of a single document with this query:
```
SELECT *
Expand All @@ -54,13 +54,50 @@ Enabling wildcard references will provide an additional STRING based column. The
`Clustering` will not need to create or modify a table when adding clustering options, this will be updated automatically.
### Configuring Cross-Platform BigQuery Setup
#### Cross-project Streaming
When defining a specific BigQuery project ID, a manual step to set up permissions is required:
By default, the extension exports data to BigQuery in the same project as your Firebase project. However, you can configure it to export to a BigQuery instance in a different Google Cloud project. To do this:
1. Navigate to https://console.cloud.google.com/iam-admin/iam?project=${param:BIGQUERY_PROJECT_ID}
2. Add the **BigQuery Data Editor** role to the following service account:
`ext-${param:EXT_INSTANCE_ID}@${param:PROJECT_ID}.iam.gserviceaccount.com`.
1. During installation, set the `BIGQUERY_PROJECT_ID` parameter as your target BigQuery project ID.
2. Identify the service account on the source project associated with the extension. By default, it will be constructed as `ext-<extension-instance-id>@project-id.iam.gserviceaccount.com`. However, if the extension instance ID is too long, it may be truncated and 4 random characters appended to abide by service account length limits.
3. To find the exact service account, navigate to IAM & Admin -> IAM in the Google Cloud Platform Console. Look for the service account listed with "Name" as "Firebase Extensions <your extension instance ID> service account". The value in the "Principal" column will be the service account that needs permissions granted in the target project.
4. Grant the extension's service account the necessary BigQuery permissions on the target project. You can use our provided scripts:
**For Linux/Mac (Bash):**
```bash
curl -O https://raw.githubusercontent.com/firebase/extensions/master/firestore-bigquery-export/scripts/grant-crossproject-access.sh
chmod +x grant-crossproject-access.sh
./grant-crossproject-access.sh -f SOURCE_FIREBASE_PROJECT -b TARGET_BIGQUERY_PROJECT [-i EXTENSION_INSTANCE_ID] [-s SERVICE_ACCOUNT]
```

**For Windows (PowerShell):**
```powershell
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/firebase/extensions/master/firestore-bigquery-export/scripts/grant-crossproject-access.ps1" -OutFile "grant-crossproject-access.ps1"
.\grant-crossproject-access.ps1 -FirebaseProject SOURCE_FIREBASE_PROJECT -BigQueryProject TARGET_BIGQUERY_PROJECT [-ExtensionInstanceId EXTENSION_INSTANCE_ID] [-ServiceAccount SERVICE_ACCOUNT]
```

**Parameters:**
For Bash script:
- `-f`: Your Firebase (source) project ID
- `-b`: Your target BigQuery project ID
- `-i`: (Optional) Extension instance ID if different from default "firestore-bigquery-export"
- `-s`: (Optional) Service account email. If not provided, it will be constructed using the extension instance ID

For PowerShell script:
- `-FirebaseProject`: Your Firebase (source) project ID
- `-BigQueryProject`: Your target BigQuery project ID
- `-ExtensionInstanceId`: (Optional) Extension instance ID if different from default "firestore-bigquery-export"
- `-ServiceAccount`: (Optional) Service account email. If not provided, it will be constructed using the extension instance ID

**Prerequisites:**
- You must have the [gcloud CLI](https://cloud.google.com/sdk/docs/install) installed and configured
- You must have permission to grant IAM roles on the target BigQuery project
- The extension must be installed before running the script

**Note:** If extension installation is failing to create a dataset on the target project initially due to missing permissions, don't worry. The extension will automatically retry once you've granted the necessary permissions using these scripts.

### _(Optional)_ Import existing documents

Expand Down
133 changes: 133 additions & 0 deletions firestore-bigquery-export/PREINSTALL.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,81 @@ Prior to sending the document change to BigQuery, you have an opportunity to tra

The response should be indentical in structure.

#### Materialized Views

This extension supports both regular views and materialized views in BigQuery. While regular views compute their results each time they're queried, materialized views store their query results, providing faster access at the cost of additional storage.

There are two types of materialized views available:

1. **Non-incremental Materialized Views**: These views support more complex queries including filtering on aggregated fields, but require complete recomputation during refresh.

2. **Incremental Materialized Views**: These views update more efficiently by processing only new or changed records, but come with query restrictions. Most notably, they don't allow filtering or partitioning on aggregated fields in their defining SQL, among other limitations.

**Important Considerations:**
- Neither type of materialized view in this extension currently supports partitioning or clustering
- Both types allow you to configure refresh intervals and maximum staleness settings during extension installation or configuration
- Once created, a materialized view's SQL definition cannot be modified. If you reconfigure the extension to change either the view type (incremental vs non-incremental) or the SQL query, the extension will drop the existing materialized view and recreate it
- Carefully consider your use case before choosing materialized views:
- They incur additional storage costs as they cache query results
- Non-incremental views may have higher processing costs during refresh
- Incremental views have more query restrictions but are more efficient to update

Example of a non-incremental materialized view SQL definition generated by the extension:
```sql
CREATE MATERIALIZED VIEW `my_project.my_dataset.my_table_raw_changelog`
OPTIONS (
allow_non_incremental_definition = true,
enable_refresh = true,
refresh_interval_minutes = 60,
max_staleness = INTERVAL "4:0:0" HOUR TO SECOND
)
AS (
WITH latests AS (
SELECT
document_name,
MAX_BY(document_id, timestamp) AS document_id,
MAX(timestamp) AS timestamp,
MAX_BY(event_id, timestamp) AS event_id,
MAX_BY(operation, timestamp) AS operation,
MAX_BY(data, timestamp) AS data,
MAX_BY(old_data, timestamp) AS old_data,
MAX_BY(extra_field, timestamp) AS extra_field
FROM `my_project.my_dataset.my_table_raw_changelog`
GROUP BY document_name
)
SELECT *
FROM latests
WHERE operation != "DELETE"
)
```

Example of an incremental materialized view SQL definition generated by the extension:
```sql
CREATE MATERIALIZED VIEW `my_project.my_dataset.my_table_raw_changelog`
OPTIONS (
enable_refresh = true,
refresh_interval_minutes = 60,
max_staleness = INTERVAL "4:0:0" HOUR TO SECOND
)
AS (
SELECT
document_name,
MAX_BY(document_id, timestamp) AS document_id,
MAX(timestamp) AS timestamp,
MAX_BY(event_id, timestamp) AS event_id,
MAX_BY(operation, timestamp) AS operation,
MAX_BY(data, timestamp) AS data,
MAX_BY(old_data, timestamp) AS old_data,
MAX_BY(extra_field, timestamp) AS extra_field
FROM
`my_project.my_dataset.my_table_raw_changelog`
GROUP BY
document_name
)
```

Please review [BigQuery's documentation on materialized views](https://cloud.google.com/bigquery/docs/materialized-views-intro) to fully understand the implications for your use case.

#### Using Customer Managed Encryption Keys

By default, BigQuery encrypts your content stored at rest. BigQuery handles and manages this default encryption for you without any additional actions on your part.
Expand Down Expand Up @@ -100,6 +175,64 @@ If you follow these steps, your changelog table should be created using your cus

After your data is in BigQuery, you can run the [schema-views script](https://github.com/firebase/extensions/blob/master/firestore-bigquery-export/guides/GENERATE_SCHEMA_VIEWS.md) (provided by this extension) to create views that make it easier to query relevant data. You only need to provide a JSON schema file that describes your data structure, and the schema-views script will create the views.

#### Cross-project Streaming

By default, the extension exports data to BigQuery in the same project as your Firebase project. However, you can configure it to export to a BigQuery instance in a different Google Cloud project. To do this:

1. During installation, set the `BIGQUERY_PROJECT_ID` parameter to your target BigQuery project ID.

2. After installation, you'll need to grant the extension's service account the necessary BigQuery permissions on the target project. You can use our provided scripts:

**For Linux/Mac (Bash):**
```bash
curl -O https://raw.githubusercontent.com/firebase/extensions/master/firestore-bigquery-export/scripts/grant-crossproject-access.sh
chmod +x grant-crossproject-access.sh
./grant-crossproject-access.sh -f SOURCE_FIREBASE_PROJECT -b TARGET_BIGQUERY_PROJECT [-i EXTENSION_INSTANCE_ID]
```

**For Windows (PowerShell):**
```powershell
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/firebase/extensions/master/firestore-bigquery-export/scripts/grant-crossproject-access.ps1" -OutFile "grant-crossproject-access.ps1"
.\grant-crossproject-access.ps1 -FirebaseProject SOURCE_FIREBASE_PROJECT -BigQueryProject TARGET_BIGQUERY_PROJECT [-ExtensionInstanceId EXTENSION_INSTANCE_ID]
```

**Parameters:**
For Bash script:
- `-f`: Your Firebase (source) project ID
- `-b`: Your target BigQuery project ID
- `-i`: (Optional) Extension instance ID if different from default "firestore-bigquery-export"

For PowerShell script:
- `-FirebaseProject`: Your Firebase (source) project ID
- `-BigQueryProject`: Your target BigQuery project ID
- `-ExtensionInstanceId`: (Optional) Extension instance ID if different from default "firestore-bigquery-export"

**Prerequisites:**
- You must have the [gcloud CLI](https://cloud.google.com/sdk/docs/install) installed and configured
- You must have permission to grant IAM roles on the target BigQuery project
- The extension must be installed before running the script

**Note:** If extension installation is failing to create a dataset on the target project initially due to missing permissions, don't worry. The extension will automatically retry once you've granted the necessary permissions using these scripts.

#### Mitigating Data Loss During Extension Updates

When updating or reconfiguring this extension, there may be a brief period where data streaming from Firestore to BigQuery is interrupted. While this limitation exists within the Extensions platform, we provide two strategies to mitigate potential data loss.

##### Strategy 1: Post-Update Import
After reconfiguring the extension, run the import script on your collection to ensure all data is captured. Refer to the "Import Existing Documents" section above for detailed steps.

##### Strategy 2: Parallel Instance Method
1. Install a second instance of the extension that streams to a new BigQuery table
2. Reconfigure the original extension
3. Once the original extension is properly configured and streaming events
4. Uninstall the second instance
5. Run a BigQuery merge job to combine the data from both tables

##### Considerations
- Strategy 1 is simpler but may result in duplicate records that need to be deduplicated
- Strategy 2 requires more setup but provides better data continuity
- Choose the strategy that best aligns with your data consistency requirements and operational constraints

#### Billing
To install an extension, your project must be on the [Blaze (pay as you go) plan](https://firebase.google.com/pricing)

Expand Down
Loading

0 comments on commit 94fcf11

Please sign in to comment.