The purpose of this walkthrough is to create Custom Dataflow templates.
The value of Custom Dataflow templates is that it allows us to execute Dataflow jobs without installing any code. This is useful to enable Dataflow execution using an automated process or to enable others without technical expertise to run jobs via a user-friendly guided user interface.
It is recommended to go through this walkthrough using a new temporary Google Cloud project, unrelated to any of your existing Google Cloud projects.
Select or create a project to begin.
gcloud config set project <walkthrough-project-id/>
Best practice recommends a Dataflow job to:
- Utilize a worker service account to access the pipeline's files and resources
- Minimally necessary IAM permissions for the worker service account
- Minimally required Google cloud services
Therefore, this step will:
- Create service accounts
- Provision IAM credentials
- Enable required Google cloud services
Run the terraform workflow in the infrastructure/01.setup directory.
Terraform will ask your permission before provisioning resources.
If you agree with terraform provisioning resources,
type yes
to proceed.
DIR=infrastructure/01.setup
terraform -chdir=$DIR init
terraform -chdir=$DIR apply -var='project=<walkthrough-project-id/>'
Best practice recommends a Dataflow job to:
- Utilize a custom network and subnetwork
- Minimally necessary network firewall rules
- Building Python custom templates additionally requires the use of a Cloud NAT; per best practice we execute the Dataflow job using private IPs
Therefore, this step will:
- Provision a custom network and subnetwork
- Provision firewall rules
- Provision a Cloud NAT and its dependent Cloud Router
Run the terraform workflow in the infrastructure/02.network directory.
Terraform will ask your permission before provisioning resources.
If you agree with terraform provisioning resources,
type yes
to proceed.
DIR=infrastructure/02.network
terraform -chdir=$DIR init
terraform -chdir=$DIR apply -var='project=<walkthrough-project-id/>'
The Apache Beam example that our Dataflow template executes is a derived word count for both Java and python.
The word count example requires a source Google Cloud Storage bucket.
To make the example interesting, we copy all the files from
gs://apache-beam-samples/shakespeare/*
to a custom bucket in our project.
Therefore, this step will:
- Provision a Google Cloud storage bucket
- Create Google Cloud storage objects to read from in the pipeline
Run the terraform workflow in the infrastructure/03.io directory.
Terraform will ask your permission before provisioning resources.
If you agree with terraform provisioning resources,
type yes
to proceed.
DIR=infrastructure/03.io
terraform -chdir=$DIR init
terraform -chdir=$DIR apply -var='project=<walkthrough-project-id/>'
We will use Cloud Build to build the custom Dataflow template. There are advantages to using Cloud Build to build our custom Dataflow template, instead of performing the necessary commands on our local machine. Cloud Build connects to our version control, GitHub in this example, so that any changes made to a specific branch will automatically trigger a new build of our Dataflow template.
Therefore, this step will:
- Provision cloud build trigger that will:
- Run the language specific build process i.e. gradle shadowJar, go build, etc.
- Execute the
gcloud dataflow flex-template
command with relevant arguments.
In order to benefit from Cloud Build, the service requires we own this repository; it will not work with a any repository, even if it is public.
Therefore, complete these steps before proceeding:
First, set your GitHub organization or username:
GITHUB_REPO_OWNER=<change me>
Next, set expected defaults. (Note: Normally it makes sense to default terraform variables instead of doing this.)
GITHUB_REPO_NAME=professional-services
WORKING_DIR_PREFIX=examples/dataflow-custom-templates
Run the terraform workflow in the infrastructure/04.template directory.
Terraform will ask your permission before provisioning resources.
If you agree with terraform provisioning resources,
type yes
to proceed.
DIR=infrastructure/04.template
terraform -chdir=$DIR init
terraform -chdir=$DIR apply -var="project=$(gcloud config get-value project)" -var="github_repository_owner=$GITHUB_REPO_OWNER" -var="github_repository_name=$GITHUB_REPO_NAME" -var="working_dir_prefix=$WORKING_DIR_PREFIX"
Navigate to cloud-build/triggers.
You should see a Cloud Build trigger listed for each language of this example.
Click the RUN
button next to the created Cloud Build trigger to execute the
custom template Cloud Build trigger for your language of choice manually.
See Create Manual Triggers for more information.
This step will take several minutes to complete.
There are multiple ways to run a Dataflow Job from a custom template. We will use the Google Cloud Web UI.
To start the process, navigate to dataflow/createjob.
Select Custom Template
from the Dataflow template
drop down menu. Then,
click the BROWSE
button and navigate to the bucket with the name that starts
with dataflow-templates-
. Within this bucket, select the json file object
that represents the template details. You should see a JSON file for each
of the Cloud Build triggers you ran to create the custom template.
The Google Cloud console will further prompt for required fields such as Job name and any required fields for the custom Dataflow template.
When you are satisfied by the values provided to the custom Dataflow template,
click the RUN
button.
Navigate to dataflow/jobs to locate the job you just created. Clicking on the job will let you navigate to the job monitoring screen.