Skip to content

Latest commit

 

History

History
 
 

dask

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Using Dask and AWS Fargate with Amazon SageMaker Jupyter Notebooks

This example uses AWS CloudFormation to create an Amazon SageMaker Jupyter Notebook and AWS Fargate cluster for using Dask for distributed computation over large data volumes.

There are several Jupyter notebooks showing examples of how to work with data directly from S3. The notebooks show examples of how to pull a 2D or 3D variable from a dataset and visualize it. Additionally, the notebooks show how to extract a time series of a variable from a location.

Getting started

cloudformation-launch-stack

  1. Launch the stack, by default it will be in the us-east-1 region (since that is the location of most of the weather & climate data) but you can change it to any region you prefer.

architecture

  1. On the Parameters page, enter your DaskWorkerGitToken which is a GitHub OAuth Token. See below for how to get one if you don't have it. You can leave all the other parameters alone for now. Hit the next button.

If you don't have a GitHub OAuth Token you can generate one. The AWS services require a GitHub OAuth token to be able to build the Docker container image for the Dask worker & scheduler nodes. To generate the token go to https://github.com/settings/tokens. It is enough for the token to only have public_repo permissions.

architecture

  1. Hit next next on this page as no input or changes are necessary.

architecture

  1. Check that you understand this will create IAM resources. Hit the next button to start stack creation.

architecture

  1. Wait for the stack to finish creating. The last item in the events will be the name of your stack with CREATE_COMPLETE when it has successfully finished. This can take 10s of minutes to finish. Then navigate to the Outputs tab for the link to your Jupyter Notebook.

architecture

Jupyter Notebook

The Jupyter notebook environment will be set up with a kernel called conda_daskpy3 which will contain the matching software for the dask-workers.

Architecture

architecture

The diagram above shows the architecture at a high level. The CloudFormation template deploys as two nested stacks, one which deploys a pipeline to build the container image, the second to create the dask environment and associated resources.

The environment includes:

  1. A Virtual Private Cloud (VPC) with security groups to restrict traffic
  2. A public subnet with NAT Gateway for the scheduler and notebook, and a single private subnet for the dask workers
  3. A S3 Gateway endpoint to enable dask workers to access S3 without traversing the NAT Gateway
  4. An Elastic Container Service (ECS) cluster
  5. ECS service definitions for the dask scheduler and dask workers
  6. A SageMaker Notebook instance

When first deployed, the pipeline will create the container image which is deployed into the dask environment. Future updates to the specified GitHub repository will trigger an automatic rebuild of the container image.