Dotscience for Data Engineering

We explore how data engineering scripts can be instrumented with functions from the Dotscience python library to enable tracking and provenance.

How to instrument a data engineering script

An existing script from outside Dotscience that does data engineering prior to running a model can be augmented with Dotscience commands so that it is tracked within the system.

An un-augmented script might look like this:

To augment the script, add the ds.input(), ds.output(), and ds.publish() commands in the places highlighted:

We see that the following are added:

  • import dotscience as ds: Import the Dotscience Python library
  • ds.start(): Indicates the start of a data run. Everything between this and ds.publish() is in the run. Each run is tracked and versioned.
  • ds.input(): The path to the file to be loaded is wrapped with ds.input() to indicate that this is a dataset to be tracked and versioned.
  • ds.output(): Similarly, indicates that this is a dataset that has been output.
  • ds.publish(): Tells Dotscience that this is the end of the run begun by ds.start().

To run a .py script, use ds run client which has the form

ds run -p PROJECT [-d] [-I IMAGE] [OTHER OPTIONS...] COMMAND....

where -p is the project workspace, -d allows asynchronous running instead of reflecting the output back to the terminal, -I is the Docker image if not using the default Dotscience one, and COMMAND is the name of the script to run. Various other options are also available.

When the script is run and the resulting project state accessed, we see outputs of tracked items in the project, including the run version, datasets, and the provenance graph, here showing the one for the first run of the two in the script:

How to instrument a model training script

This is similar to the data engineering example above, but here we show a Jupyter notebook. Note we do not show the entire notebook, but focus on the sections that are changed by the addition of Dotscience functions. Also, we show only the augmented notebook, with the changes highlighted.

We see that the following are added:

  • import dotscience as ds ; ds.start() ; ds.input() ; ds.output() ; ds.publish() — same as data runs example above
  • ds.parameter(): This is how to track parameters in your run, for example, machine learning model hyperparameters. The parameter is specified in the usual way for the model training invocation, and is then wrapped by this function.
  • ds.summary(): This tracks summary statistics or metrics after the model is made about how it is doing, for example, accuracy on the validation dataset.
  • ds.label(): Allows the addition of other arbitrary labels to the run.

Further ds functions are available, such as ds.model(kind, name, *args, **kwargs), which allows a type and name to be assigned to a model. The first argument is the (TensorFlow, here) module itself (the library extracts the version), then the model name (any string), then the path to the model directory, and finally optional keyword argument (kwarg) classes="classes.json", where classes.json is a map of string class IDs eg “0”, “1”, etc, to human-readable class names.

Further details

For more details of the Dotscience Python library, see .