S3 Integration for Datasets

This tutorial demonstrates Dotscience S3 integration with readonly datasets.

We recognise that S3 is simple and easy to use for large scale datasets, and Dotscience provides integration with S3. In this tutorial we demonstrate how to set up an S3 dataset, train models and view the provenance of the models with datasets. Dotscience will do an initial sync of all the data from S3, make a version of it in Dotscience and then allow you to attach the dataset to multiple projects. On each run of your project, it will do another sync to make sure it is up to date before being used.

Setting up a bucket

In this section, we will:

  1. Set up a new bucket with versioning (required to get the best out of Dotscience)
  2. Set up an account
  3. Create an IAM role and group and attach those to the account

First, let’s set up a bucket. Go to Services -> S3 in the Amazon web console

Click “create bucket” and this form should pop up. Let’s name it “dotscience-hello-s3”, and put it in whichever region is most appropriate for your use case.

Click “next” and tick the box under “Versioning” - this enables us to distinguish new and old artifacts in Dotscience and have proper provenance. Click “next” again.

Leave permissions as they are (block all public access) unless you’re intending on using this bucket for other uses - we’ll set up proper IAM permissions later. Click next again and then ok on the Review page.

Next, let’s set up a new “robot” account to use with Dotscience. You can also create your own access key pair for your own account, but this will probably provide your keypair with far more permissions than we need.

Go to Services -> Security and Compliance -> IAM (or just search for IAM) and click Users -> Create User.

Let’s name the account “dotscience-s3” and check “programmatic access” which gives us the keypair we need for Dotscience. Click next.

Select “add new user to group”, then create a group and name it “dotscience-s3” again.

Click next until you reach the success screen, and then “download csv” so that we have a copy of the keypair for use later as we still need to create an IAM policy.

Now let’s create the policy. Go back to Services -> Security and Compliance -> IAM and select Policies, then click “Create policy”.

Name it “dotscience-s3” again and paste the contents of this gist into the JSON tab.

Click “review policy”, “save changes” and then go back to IAM -> Groups and click or search for “dotscience-s3”.

Click “attach policy” and search for “dotscience-s3” and tick the box, then click “attach policy”.

Add some data

Now, we have an empty bucket with the appropriate policy and IAM role attached to an account, but we need some data in there before we can see Dotscience pull it in.

Head back to Services -> S3 and search for or click on your bucket (dotscience-hello-s3).

Click “Upload” and then drag and drop something you want to use from your computer.

Bring it into Dotscience

Finally, let’s create the dataset as an entity in Dotscience. Open up Dotscience and click “Datasets” in the top right bar:

Click “add new” and fill in the details we picked up earlier (Access key and secret key will be in the csv we downloaded)

For convenience, let’s also rename the dataset to be named the same as the bucket (this is optional) - to do this, click on the pencil next to “New Dataset”

Click “Create” and you should be brought back to the datasets list, with a sync status shown to tell you how far from downloaded your dataset is. When it shows “ready”, the transfer has completed and you should be able to successfully use it with a project:

Using the dataset in Jupyter

Next, let’s link the dataset to a project. I’m going to create a brand new one and name it “hello, s3!”, but you may use whichever project suits you.

Click Settings from within your project and scroll down to “Datasets” and select the dataset from the drop down.

The relative path is the folder name by which the dataset will appear, relative to the home directory your script or notebook is started from. Let’s call it “dotscience-hello-s3” for now, but you might want to call it “s3” or “dataset” for short.

Head back up to the top of your project and click the button next to “Jupyter is available”. Your project should launch on a runner - this may take a while longer than you’re used to as the dataset needs to get pulled onto the runner, but subsequent launches and runs on that runner should be much faster. You will also get faster results if you use a runner in a cloud VM rather than running locally.

Once Jupyter loads, you should see your dataset in the folder viewer:

Let’s create a notebook and consume some of that data. Click “Notebook -> Python 3” in the launcher window.

What you want to do with this data depends on what you put in there - in this case there’s a basic ‘hello world’ file we can play with, and a much larger file to show the syncing progress.

This very simple notebook reads in the hello world file, gives the line count as a summary statistic, and defines that hello world was an input to the notebook. If we run it and then click the “Dotscience” tab, we can see that Dotscience picked up the input and found the version id for it which is now stored for us in the commit metadata.

If we click on the “Runs” arrow we can see Dotscience has received these runs:

And then if we click on the top run and click “Run Details”, we can see a provenance graph showing the input data from the dataset was used to produce this run:

Using the dataset in scripts

If Jupyter isn’t your thing, you can also connect datasets to projects entirely using the CLI and python scripts.

In this section, we’re going to do 3 things:

  1. Create a dataset using the credentials we created in AWS earlier
  2. Link it to a project
  3. Launch a python task

Creating the dataset

You will need to have downloaded the ds cli, see the documentation here.

Run ds dataset create s3 --bucket dotscience-hello-s3 --access-key-id=<access-key-id> --secret-access-key=<secret-access-key>.

This should create the dataset in Dotscience and kick off a sync with S3 to download all of the files.

If you run ds dataset ls, you should be able to track the status of your dataset download. When the status says “ready”, that means all of the contents have downloaded.

Now let’s take our dataset and link it to a project. To do this, run ds project link <project_id> <dataset_id> dotscience-hello-s3

This will link the dataset on the path “dotscience-hello-s3”, so when we come to run the script, the data will be accessible from ./dotscience-hello-s3.

If we run ds project ls now we should see that information displayed next to the project we selected:

Finally, we can launch the project and use the dataset by using a ds run command like ds run -p "hello, s3!" -u ./src -- python ./src/script.py