Skip to content

Commit cf96e01

Browse files
authoredJan 18, 2022
Groomed documentation (#88)
1 parent ac5068b commit cf96e01

File tree

8 files changed

+44
-15
lines changed

8 files changed

+44
-15
lines changed
 

‎Configuration.md

-7
Original file line numberDiff line numberDiff line change
@@ -268,10 +268,3 @@ to discover the kinds of requests that can be made.
268268

269269
> **NOTE**: MARS data is stored on tape drives. It takes longer for multiple workers to request data than a single
270270
> worker. Thus, it's recommended _not_ to set a partition key when writing MARS data configurations.
271-
272-
## Writing Efficient Data Requests
273-
274-
TODO([#26](https://github.com/googlestaging/weather-tools/issues/26)). In the mean-time, please consult this ECMWF
275-
documentation:
276-
* [Web API Retrieval Efficiency](https://confluence.ecmwf.int/display/WEBAPI/Retrieval+efficiency)
277-
* [Era 5 daily data retrieval efficiency](https://confluence.ecmwf.int/display/WEBAPI/ERA5+daily+retrieval+efficiency)

‎Efficient-Requests.md

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# Writing Efficient Data Requests
2+
3+
TODO([#26](https://github.com/googlestaging/weather-tools/issues/26)). In the mean-time, please consult this ECMWF
4+
documentation:
5+
* [Web API Retrieval Efficiency](https://confluence.ecmwf.int/display/WEBAPI/Retrieval+efficiency)
6+
* [Era 5 daily data retrieval efficiency](https://confluence.ecmwf.int/display/WEBAPI/ERA5+daily+retrieval+efficiency)

‎README.md

+4-1
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ across Alphabet.
1313
The first tool created was the weather downloader (`weather-dl`). This makes it easier to ingest data from the European
1414
Center for Medium Range Forecasts (ECMWF). `weather-dl` enables users to describe very specifically what data they'd
1515
like to ingest from ECMWF's catalogs. It also offers them control over how to parallelize requests, empowering users to
16-
[retrieve data efficiently](Configuration.html#writing-efficient-data-requests). Downloads are driven from a
16+
[retrieve data efficiently](Efficient-Requests.md). Downloads are driven from a
1717
[configuration file](Configuration.md), which can be reviewed (and version-controlled) independently of pipeline or
1818
analysis code.
1919

@@ -92,6 +92,7 @@ _Steps_:
9292
```shell
9393
weather-mv --uris "./local_run/**.nc" \ # or --uris "./split_data/**.nc" \
9494
--output_table "$PROJECT.$DATASET_ID.$TABLE_ID" \
95+
--temp_location "gs://$BUCKET/tmp" \ # Needed for batch writes to BigQuery
9596
--direct_num_workers 2
9697
```
9798

@@ -111,6 +112,8 @@ our [guide](CONTRIBUTING.md) to get started.
111112

112113
## License
113114

115+
This is not an official Google product.
116+
114117
```
115118
Copyright 2021 Google LLC
116119

‎docs/Efficient-Requests.md

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../Efficient-Requests.md

‎docs/_static/custom.css

+9-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
11
p {
22
text-align: justify;
3-
}
3+
}
4+
5+
body {
6+
min-width: 250px;
7+
}
8+
9+
div.body {
10+
min-width: 250px;
11+
}

‎docs/index.md

+1
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
weather_sp/README
1717
Runners
1818
Configuration
19+
Efficient-Requests
1920
CONTRIBUTING
2021
modules
2122
```

‎weather_dl/README.md

+15-2
Original file line numberDiff line numberDiff line change
@@ -35,11 +35,13 @@ _Common options_:
3535

3636
* `-f, --force-download`: Force redownload of partitions that were previously downloaded.
3737
* `-d, --dry-run`: Run pipeline steps without _actually_ downloading or writing to cloud storage.
38-
* `-m, --manifest-location MANIFEST_LOCATION`: Location of the manifest. Either a Firestore collection URI
39-
('fs://<my-collection>?projectId=<my-project-id>'), a GCS bucket URI, or 'noop://<name>' for an in-memory location.
4038
* `-l, --local-run`: Run locally and download to local hard drive. The data and manifest directory is set by default
4139
to '<$CWD>/local_run'. The runner will be set to `DirectRunner`. The only other relevant option is the config
4240
and `--direct_num_workers`
41+
* `-m, --manifest-location MANIFEST_LOCATION`: Location of the manifest. Either a Firestore collection URI
42+
('fs://<my-collection>?projectId=<my-project-id>'), a GCS bucket URI, or 'noop://<name>' for an in-memory location.
43+
* `-n, --num-requests-per-key`: Number of concurrent requests to make per API key. Default: make an educated guess per
44+
client & config. Please see the client documentation for more details.
4345

4446
Invoke with `-h` or `--help` to see the full range of options.
4547

@@ -67,6 +69,17 @@ weather-dl configs/mars_example_config.cfg \
6769
--job_name $JOB_NAME
6870
```
6971

72+
Using the DataflowRunner and specifying 3 requests per license
73+
74+
```bash
75+
weather-dl configs/mars_example_config.cfg \
76+
-n 3 \
77+
--runner DataflowRunner \
78+
--project $PROJECT \
79+
--temp_location gs://$BUCKET/tmp \
80+
--job_name $JOB_NAME
81+
```
82+
7083
For a full list of how to configure the Dataflow pipeline, please review
7184
[this table](https://cloud.google.com/dataflow/docs/reference/pipeline-options).
7285

‎weather_mv/README.md

+8-4
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,6 @@ Weather Mover loads weather data from cloud storage into [Google BigQuery](https
1111
* **Parallel Upload**: Each file will be processed in parallel. With Dataflow autoscaling, even large datasets can be
1212
processed in a reasonable amount of time.
1313

14-
> Note: Data is written into BigQuery using streaming inserts. It may take [up to 90 minutes](https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability)
15-
> for buffers to persist into storage. However, weather data will be available for querying immediately.
16-
1714
## Usage
1815

1916
```
@@ -48,6 +45,7 @@ _Usage Examples_:
4845
```bash
4946
weather-mv --uris "gs://your-bucket/*.nc" \
5047
--output_table $PROJECT.$DATASET_ID.$TABLE_ID \
48+
--temp_location "gs://$BUCKET/tmp" \ # Needed for batch writes to BigQuery
5149
--direct_num_workers 2
5250
```
5351

@@ -57,6 +55,7 @@ Upload only a subset of variables:
5755
weather-mv --uris "gs://your-bucket/*.nc" \
5856
--output_table $PROJECT.$DATASET_ID.$TABLE_ID \
5957
--variables u10 v10 t
58+
--temp_location "gs://$BUCKET/tmp" \
6059
--direct_num_workers 2
6160
```
6261

@@ -66,6 +65,7 @@ Upload all variables, but for a specific geographic region (for example, the con
6665
weather-mv --uris "gs://your-bucket/*.nc" \
6766
--output_table $PROJECT.$DATASET_ID.$TABLE_ID \
6867
--area 49.34 -124.68 24.74 -66.95 \
68+
--temp_location "gs://$BUCKET/tmp" \
6969
--direct_num_workers 2
7070
```
7171

@@ -77,7 +77,7 @@ weather-mv --uris "gs://your-bucket/*.nc" \
7777
--runner DataflowRunner \
7878
--project $PROJECT \
7979
--region $REGION \
80-
--temp_location gs://$BUCKET/tmp \
80+
--temp_location "gs://$BUCKET/tmp" \
8181
--job_name $JOB_NAME
8282
```
8383

@@ -93,6 +93,10 @@ streaming ingestion, use the `--topic` flag (see above). Objects that don't matc
9393
ingestion. It's worth noting: when setting up PubSub, **make sure to create a topic for GCS `OBJECT_FINALIZE` events
9494
only.**
9595

96+
Data is written into BigQuery using streaming inserts. It may
97+
take [up to 90 minutes](https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability)
98+
for buffers to persist into storage. However, weather data will be available for querying immediately.
99+
96100
> Note: It's recommended that you specify variables to ingest (`-v, --variables`) instead of inferring the schema for
97101
> streaming pipelines. Not all variables will be distributed with every file, especially when they are in Grib format.
98102

0 commit comments

Comments
 (0)
Please sign in to comment.