Tutorials

Provenance

Learn how you can trace from a model to its training data and back from that to the raw data

Provenance

Using Dotscience you can automatically trace where files came from and how they were used. This is done either via:

  1. Your notebooks and python scripts being instrumented OR
  2. Dotscience detecting files were created during a run

Let’s start by opening a Jupyter session on a new project:

I’m going to upload a few files which you can get from - https://github.com/dotmesh-io/dotscience-roadsigns

The first notebook, get-data.ipynb fetches some data from S3 and does some conversion. Each ds.publish("...") results in a run being recorded. For more information on how to record runs see the section on References. If we run all cells in that notebook and head back to the “Runs” page we should be able to inspect those runs and see where inputs and outputs came from:

If we click on one of those runs:

…and then click “Run details”…

The first thing we see is a provenance graph, which shows what Dotscience inferred from those runs. In this case, since get-data.ipynb and Signnames.csv were uploaded in the same commit, it’s been inferred that Signnames.cvs came as an output from that notebook. This was then used as an input to the current run which gave us classes.json as an output.

Looking at a subsequent run, you’ll notice that there were three files were downloaded from the run and they’re now tracked within the project.

Now that the data prep is done, open and run all the cells in the notebook roadsigns.ipynb. This notebook creates a training set, tunes a model and builds a model with Dotscience. The run is published with ds.publish("trained tensorflow model"). For more information on how the model is generated, refer to the tutorial on Notebook based development

Looking closely at the provenance graph

The provenance graph shows us the compelete trace of the data flow, i.e it goes all the way to the raw data which forms the training set, right up to the current run created a model and it’s metadata file. The exact versions of the files are also displayed.

When you have many model, it’s easy to find where each model originated from. Nagivate to https://cloud.dotscience.com/models/builds to see the recently built models. Clicking at the model name takes you back to the exact run that generated the model.