References

Run Metadata Format

Understand how Dotscience records run metadata

Dotscience run metadata spec

When a group of Dotscience runs completes, a commit is performed on all modified datasets to capture the generated data, and on the workspace dot itself to capture the code that was run and the fact that it was run.

The workspace dot has special Dotmesh commit metadata to mark it as a dotscience run, and this document specifies the format of that metadata.

Conventions

Any filename or pathname recorded in this metadata is relative to the root of a mounted dot, not the root of the filesystem in the container where the workload was run. Paths use / separators, and do not start with a / as they are all relative paths. They may not contain . or .. as any component of the path.

Version 1

Core data

type = dotscience.run.v1

This marks it as a Dotscience run commit.

author = ID

The ID of the user that submitted the run.

success = true or false

Whether the execution succeeded or not. If not specified, assume that it did.

message = STRING

This is a required field (i.e cannot be left blank). For a failed execution, this should contain the error message. During normal execution, this should contain something indicating dotscience made this commit.

Workload specification

Jupyter runs

workload.type = jupyter

Marks this as a Jupyter workload.

workload.image = JUPYTER IMAGE

The name of the Docker image used to run Jupyter.

workload.image.hash = IMAGE HASH

The hash of the Docker image used to run Jupyter.

Command runs

workload.type = command

Marks this is a command workload.

workload.image = DOCKER IMAGE NAME

This is the Docker image the workload was executed inside.

workload.image.hash = DOCKER IMAGE HASH

The hash of the Docker image the workload was executed inside.

workload.command = JSON LIST OF STRINGS

The command executed inside the docker image.

workload.environment = JSON OBJECT MAPPING STRINGS TO STRINGS

The shell environment used when executing the command.

Runner details

runner.name = STRING

The name of the actual runner instance. For instance, a hostname.

runner.version = STRING

The name of the runner software, including a version number. Eg, Dotscience kubernetes runner v1.2.

runner.platform = linux

The host platform the runner ran on. Currently, only Linux is supported.

runner.platform_version = STRING

On Linux runners, the output of the uname -a command.

runner.cpu = JSON LIST OF STRINGS

The CPUs the runner used. On a Linux runner, this should be computed by this command, or an equivalent:

grep 'model name' /proc/cpuinfo | sed 's/^[^:]*: //'

runner.gpu = JSON LIST OF STRINGS

The GPUs the runner used.

FIXME: Explain how to generate this (Luke knows of a container image that dumps out useful information)

runner.ram = INTEGER

The number of bytes of physical RAM the runner had.

runner.ram.ecc = true or false

If set to true, then the runner used error-correcting RAM. If set to false, it did not. If not set at all, we don’t know.

Execution details

exec.logs = JSON LIST OF FILENAMES

The logs of the workload execution are stored in a subdot of the workspace called dotscience_logs. Their names, relative to the root of the subdot, are stored in this JSON list.

The final part of the filename (after the last /) determines the type of log:

  • workload-stdout.log stores the standard output of the workload
  • workload-stderr.log stores the standard error of the workload
  • Others are logs from parts of the infrastructure.

Recommendation: A directory named after the run ID is created by the runner to store the logs, and files named as per the above stored within.

exec.start = YYYYMMDDTHHMMSS.SSS…

The time that execution of the workload started, in UTC.

exec.end = YYYYMMDDTHHMMSS.SSS…

The time that execution of the workload ended, in UTC.

exec.cpu-seconds = FLOAT

The number of CPU-seconds consumed by the workload.

exec.ram = INTEGER

The peak RAM usage of the workload, in bytes.

Datasets

input-dataset.REF = ID@COMMIT

The dataset with the given ID, at version COMMIT, was mounted at the path REF under the current working directory when the workload was executed.

output-dataset.REF = ID@COMMIT

The dataset with the given ID was mounted at the path REF under the current working directory when the workload was executed, and the resulting state of the dataset was commited and resulted in version COMMIT.

Run details

runs = JSON LIST OF STRINGS

A list of the run IDs that were recorded in this commit, in the order in which they happened. Run IDs are arbitary, but globally unique strings; their means of generation are unspecified, but a UUID would be appropriate. Metadata for each run is stored in the following properties:

run.RUN ID.authority = workload, derived, or correction

The authority by which this run metadata is known. If declared directly by the workload itself, it’s set to workload. If the workload did not provide run metadata and the execution engine derived it automatically (for example, by recording access to the filesystem), then it’s set to derived. If the workload declared one or more runs’ metadata, but at the time of the commit being made, the execution engine detected access to files beyond what was declared in workload-authority runs, then a correction run is automatically added to document the undeclared file accesses; the presence of this run in a commit inherently calls the workload-authority runs in that commit into question.

A commit may have either no runs, a single derived run (because the workload emitted no metadata), or one or more workload runs that the workload declared; in the latter case, there may also be, optionally, a single correction run. It is illegal to have more than one correction run, more than one derived run, a mixture of derived and workload runs, or a correction run without any workload runs.

run.RUN ID.description = STRING

An optional description of what happened in this run.

run.RUN ID.workload-file = FILE

The name of the source file inside the workspace dot that executed this run.

run.RUN ID.error = STRING

If this property is not present, the run was deemed successful. If it is present, it indicates that the run failed in some way, and the STRING is an error message explaining how.

run.RUN ID.input-files = JSON LIST

A list of which files in the workspace dot were read in this run. Each element in the JSON list is of the form FILENAME@COMMIT; the FILENAME is the full path from the root of the workspace dot, and COMMIT is the commit of the workspace dot where the file was last written to. As filenames could contain ‘@’ symbols but commit IDs cannot, the string after the final ‘@’ symbol should be considered the commit ID.

Only data files should be listed - source code files or other reference data files that are implicitly read as “part of the workload”, as opposed to explicit input files, need not be listed. The exact distinction between the two in a workspace dot is not necessarily clear; judgement must be applied.

run.RUN ID.output-files = JSON LIST

A list of which files in the workspace dot were written in this run. Each element in the JSON is the a filename relative to the root of the workspace dot. Unlike run.RUN ID.input-files, these do not have commit IDs in because the changed files are being written to the workspace dot that we are committing this metadata to (so we don’t know the commit ID at the time the metadata is written).

run.RUN ID.dataset-input-files.REF = JSON LIST

A list of which files in the dataset mounted at REF were read in this run. The format is as per the run.RUN ID.input-files: a list of elements of the form FILENAME@COMMIT, where FILENAME is relative to the root of the dataset and COMMIT is the commit of that dataset where the file was last written to.

run.RUN ID.dataset-output-files.REF = JSON LIST

A list of which files in the dataset mounted at REF were written in this run. Each element in the JSON is the a filename relative to the root of the dataset. As with run.RUN ID.output-files, we do not record commit IDs for the dataset output files - they are all recorded in the commit identified by output-dataset.REF = ID@COMMIT; we could duplicate that commit ID into every entry in this list, but it would be redundant and inconsistent with run.RUN ID.output-files.

run.RUN ID.label.KEY = VALUE

Arbitrary key=value labels for this run

Build artefacts: run.RUN ID.label.artefact:NAME = JSON OBJECT

If some of the outputs of this run are independently packageable built “artefacts” that could be deployed into some environment, they can be labelled as such in order to enable deployment automation. For example, if your run produces a machine learning model, you can label the model file(s) as an artefact so it could be deployed into production.

The JSON object should have the following fields:

type = tensorflow-model

Currently, tensorflow models are the only type understood by the system, but others will be added in future.

files = JSON OBJECT

This field lists all the files that comprise the built artefact. The keys of the JSON object depend on the type of model, and the values are paths relative to the workspace root, which can refer either to files or to entire subdirectories.

  • The files may be in the workspace, or in a dataset; the list of dataset mount prefixes in the run metadata should be consulted to locate them.
  • All files referenced in a build artefact that were created by this run should also be listed as outputs using the appropriate run.RUN ID.output-files or run.RUN ID.dataset-output-files.REF fields, but files listed in a model do not need to be listed as outputs - they may be present in the relevant dots already and weren’t generated by this run, but are still part of the model. For instance, static configuration files, or files generated by previous runs that simply were not changed by this run.

For tensorflow models, the keys in the JSON object are:

  • model, referring to the main model files.
  • classes, referring to the classes.csv file.
Other metadata, depending on the artefact type.

The artefact type may declare arbitrary other keys.

For tensorflow models, a version field is expected, whose contents are a string containing the Tensorflow version.

Example

A label declaring a tensorflow model called “roadsigns” might look like this:

run.02ecdc67-c49e-4d76-abe8-1ee13f2884b7.label.artefact:roadsigns={"type":"tensorflow-model","files":{"model":"output","classes":"classes.csv"},"version":"1.14.0"}

run.RUN ID.summary.KEY = VALUE

Summary statistics, recording the quantitative success of the run, used to drive the leaderboard UI.

run.RUN ID.parameters.KEY = VALUE

Records an input parameter to the run.

run.RUN ID.start = YYYYMMDDTHHMMSS.SSS…

The time that execution of the run started, in UTC.

run.RUN ID.end = YYYYMMDDTHHMMSS.SSS…

The time that execution of the run ended, in UTC.

Dataset commits

Any datasets modified by runs in the commit must also be committed, before the workspace dot is committed; the dataset commit IDs are recorded in the run.RUN ID.dataset-output-files.REF metadata key.

The following metadata must be attached to the dataset commits referenced by a version 1 run metadata record:

type = dotscience.run-output.v1

This marks that the commit is as the result of a Dotscience run that was recorded using version 1 of the metadata format.

workspace = ID OF WORKSPACE DOT

run.RUN ID.dataset-output-files = JSON LIST OF FILES

This records that the files in the list were modified as part of the run with the given ID. Only files in this dataset are listed. The entries in the JSON lists are paths to the files relative to the root of the dataset. The RUN ID must match a RUN ID specified in the metadata of the workspace dot commit that refers to this dataset commit.

Appendix 1: Workload run metadata format (Version 1)

The workload may output metadata which is incorporated into the metadata commit. This appendix defines the format of a run metadata record output by the workload.

Basic structure

Each run to be included in the commit is recorded by the workload as a single JSON document, which must be output to its standard output stream (for a command workload) or into the notebook (for a Jupyter workload), encoded in UTF8, in the following structure, which must come immediately after a newline (defined as a Unix-style line feed character, or a Windows-style carriage return then line feed pair):

PREFIX[[DOTSCIENCE-RUN:RUN ID]]JSON[[/DOTSCIENCE-RUN:RUN ID]]

Or, when necessary, the following structure:

PREFIX[[DOTSCIENCE-RUN-BASE64:RUN ID]]JSON ENCODED IN BASE64[[/DOTSCIENCE-RUN-BASE64:RUN ID]]

If newlines (be they Unix or Windows-style) occur within the JSON or JSON ENCODED IN BASE64 sections, and the string PREFIX follows that newline, then the entire “newline+PREFIX” group is considered a single newline. This allows the correct embedding of run metadata in output streams that automatically prefix every line, or the use of line comment prefixes (such as # or // in many languages) to escape them from other processing.

No extra whitespace is allowed in the [[...]] headers and footers, as they are matched exactly on a byte-for-byte basis.

Implementations are encouraged to use newlines and other non-significant whitespace, as allowed within JSON, to make the JSON in the first form easy for humans to read.

Implementations using the first form are responsible for choosing a RUN ID such that the string [[/DOTSCIENCE-RUN:RUN ID]] does not occur inside the JSON.

Content of the metadata JSON

The metadata JSON must be a JSON object with the following keys:

version = 1

This declares this metadata to use version 1 of the metadata specification.

error = STRING

If this property is not present, the run was deemed successful. If it is present, it indicates that the run failed in some way, and the STRING is an error message explaining how.

description = STRING

An optional description of the run.

workload-file = STRING

An optional declaration of the source file that executed this run, relative to the workspace dot. The system will attempt to deduce it if missing.

input = JSON LIST

A list of filenames that were read by this run, relative to the workspace dot; however, if the first component of the path is the REF of a dataset, then the file comes from within that dataset rather than the workspace dot.

output = JSON LIST

A list of filenames that were written by this run, relative to the workspace dot; however, if the first component of the path is the REF of a dataset, then the file comes from within that dataset rather than the workspace dot.

labels = JSON OBJECT

An object mapping string label names to label value strings, storing arbitrary key=value labels for this run.

summary = JSON OBJECT

An object mapping string summary-statistic names to value strings, recording the quantitative success of the run.

parameters = JSON OBJECT

An object mapping string parameter names to value strings, recording input parameters to the run.

start = YYYYMMDDTHHMMSS.SSS

The time that execution of the run started, in UTC.

end = YYYYMMDDTHHMMSS.SSS…

The time that execution of the run ended, in UTC.

Appendix 2: Example (Version 1)

This example is non-normative, meaning that if there’s a discrepancy between it and the specifications above, then the example is wrong and the specification is correct.

Command run

The workspace dot is called A. The user requests to run a command that reads from dataset B (with a REF of b), modifies (reads and writes back to) dataset C (with a REF of c), and writes to a dataset D (with a REF of d), as well as interacting with some data files in the workspace dot.

This results in commits to the workspace dot A, as well as datasets C and D; there is no commit on dataset B as it was only read from.

Metadata output by the workload

Note that this metadata was written in the non-base64 style, with a prefix of #. Two runs occurred, which read and wrote the same files; they seem to have run the same code (producing the same description), but with a different input parameter resulting in a different summary statistics.

 # [[DOTSCIENCE-RUN:02ecdc67-c49e-4d76-abe8-1ee13f2884b7]]
 # {
 #  "version": "1",
 #  "description": "Curve fit",
 #  "input": ["foo.csv", "b/input.csv", "c/cache.sqlite"],
 #  "output": ["log.txt", "c/cache.sqlite", "d/output.csv"],
 #  "labels": {},
 #  "parameters": {"smoothing": "1.0"},
 #  "summary": {"rms_error": "0.057"},
 #  "start": "20181004T130607.225",
 #  "end": "20181004T130608.225",
 # }
 # [[/DOTSCIENCE-RUN:02ecdc67-c49e-4d76-abe8-1ee13f2884b7]]
 # [[DOTSCIENCE-RUN:cd351be8-3ba9-4c5e-ad26-429d6d6033de]]
 # {
 #  "version": "1",
 #  "description": "Curve fit",
 #  "input": ["foo.csv", "b/input.csv", "c/cache.sqlite"],
 #  "output": ["log.txt", "c/cache.sqlite", "d/output.csv"],
 #  "labels": {},
 #  "parameters": {"smoothing": "2.0"},
 #  "summary": {"rms_error": "0.123"},
 #  "start": "20181004T130608.579",
 #  "end": "20181004T130609.579",
 # }
 # [[/DOTSCIENCE-RUN:cd351be8-3ba9-4c5e-ad26-429d6d6033de]]

Commit created on A (workspace dot)

type = dotscience.run.v1
author = 452342
date = 1538658370073482093
workload.type = command
workload.image = busybox
workload.image.hash = [email protected]:2a03a6059f21e150ae84b0973863609494aad70f0a80eaeb64bddd8d92465812
workload.command = ["sh","-c","curl http://localhost/testjob.sh | /bin/sh"]
workload.environment = {"DEBUG_MODE": "YES"}
runner.version = Runner=Dotscience Docker Executor rev. 63db3d0 Agent=Dotscience Agent rev. b1acc85
runner.name = bob
runner.platform = linux
runner.platform_version = Linux a1bc10a2fb6e 4.14.60 #1-NixOS SMP Fri Aug 3 05:50:45 UTC 2018 x86_64 GNU/Linux
runner.ram = 16579702784
runner.cpu = ["Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz", "Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz", "Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz", "Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz"]
exec.start = 20181004T130607.101
exec.end = 20181004T130610.223
exec.logs = ["16204868-ae5a-4574-907b-8d4774aad497/agent-stdout.log","16204868-ae5a-4574-907b-8d4774aad497/pull-workload-stdout.log","16204868-ae5a-4574-907b-8d4774aad497/workload-stdout.log"]
input-dataset.b = <ID of dot B>@<commit ID of dot B before the run>
input-dataset.c = <ID of dot C>@<commit ID of dot C before the run>
output-dataset.c = <ID of dot C>@<commit ID of dot C created by this run>
output-dataset.d = <ID of dot D>@<commit ID of dot D created by this run>
runs = ["02ecdc67-c49e-4d76-abe8-1ee13f2884b7", "cd351be8-3ba9-4c5e-ad26-429d6d6033de", "31df506d-c715-4159-99fd-60bb845d4dec"]
run.02ecdc67-c49e-4d76-abe8-1ee13f2884b7.authority = workload
run.02ecdc67-c49e-4d76-abe8-1ee13f2884b7.input-files = ["[email protected]<some earlier commit ID of workspace dot>"]
run.02ecdc67-c49e-4d76-abe8-1ee13f2884b7.dataset-input-files.b = ["[email protected]<some earlier commit ID of b>"]
run.02ecdc67-c49e-4d76-abe8-1ee13f2884b7.dataset-input-files.c = ["[email protected]<some earlier commit ID of c>"]
run.02ecdc67-c49e-4d76-abe8-1ee13f2884b7.output-files = ["log.txt"]
run.02ecdc67-c49e-4d76-abe8-1ee13f2884b7.dataset-output-files.c = ["cache.sqlite"]
run.02ecdc67-c49e-4d76-abe8-1ee13f2884b7.dataset-output-files.d = ["output.csv"]
run.02ecdc67-c49e-4d76-abe8-1ee13f2884b7.summary.rms_error = 0.057
run.02ecdc67-c49e-4d76-abe8-1ee13f2884b7.parameters.smoothing = 1.0
run.02ecdc67-c49e-4d76-abe8-1ee13f2884b7.start = 20181004T130607.225
run.02ecdc67-c49e-4d76-abe8-1ee13f2884b7.end = 20181004T130608.225
run.cd351be8-3ba9-4c5e-ad26-429d6d6033de.authority = workload
run.cd351be8-3ba9-4c5e-ad26-429d6d6033de.input-files = ["[email protected]<some earlier commit ID of workspace dot>"]
run.cd351be8-3ba9-4c5e-ad26-429d6d6033de.dataset-input-files.b = ["[email protected]<some earlier commit ID of b>"]
run.cd351be8-3ba9-4c5e-ad26-429d6d6033de.dataset-input-files.c = ["[email protected]<some earlier commit ID of c>"]
run.cd351be8-3ba9-4c5e-ad26-429d6d6033de.output-files = ["log.txt"]
run.cd351be8-3ba9-4c5e-ad26-429d6d6033de.dataset-output-files.c = ["cache.sqlite"]
run.cd351be8-3ba9-4c5e-ad26-429d6d6033de.dataset-output-files.d = ["output.csv"]
run.cd351be8-3ba9-4c5e-ad26-429d6d6033de.summary.rms_error = 0.123
run.cd351be8-3ba9-4c5e-ad26-429d6d6033de.parameters.smoothing = 2.
run.cd351be8-3ba9-4c5e-ad26-429d6d6033de.start = 20181004T130608.579
run.cd351be8-3ba9-4c5e-ad26-429d6d6033de.end = 20181004T130609.579
run.31df506d-c715-4159-99fd-60bb845d4dec.authority = correction
run.31df506d-c715-4159-99fd-60bb845d4dec.description = File changes were detected that the run metadata did not explain
run.31df506d-c715-4159-99fd-60bb845d4dec.output-files = ["mylibrary.pyc"]

No commit is created on B

As it is only used as an input, nothing was changed so there is no commit. However, the version of B that was used is still recorded in the workspace dot commit above.

Commit created on C

The ID of this commit is recorded in output.c in the workspace dot commit.

type = dotscience.run-output.v1
workspace = <ID of dot A>
run.02ecdc67-c49e-4d76-abe8-1ee13f2884b7.dataset-output-files = ["cache.sqlite"]
run.cd351be8-3ba9-4c5e-ad26-429d6d6033de.dataset-output-files = ["cache.sqlite"]

Commit created on D

The ID of this commit is recorded in output.d in the workspace dot commit.

type = dotscience.run-output.v1
workspace = <ID of dot A>
run.02ecdc67-c49e-4d76-abe8-1ee13f2884b7.dataset-output-files = ["output.csv"]
run.cd351be8-3ba9-4c5e-ad26-429d6d6033de.dataset-output-files = ["output.csv"]

Appendix 3: Implementation notes

This is a non-normative appendix to the specification.

Merging run metadata from the workload with automatically recorded run metadata

The workload may output metadata declaring runs and what files were read/written, and the execution engine may also monitor the runtime environment to observe the workload’s behaviour. The two may or may not tally.

There are three interesting cases:

  • A run happens and outputs no metadata because it’s not been annotated; we need to look at the observed behaviour and make up a single run commit with what we can observe, with the run’s authority property set to derived.
  • A fully instrumented run happens and clearly records everything it did in its metadata prints and we’re happy because it corresponds exactly to what we observed. We record the run metadata provided by the workload, with the run’s authority properties set to workload.
  • As per the previous case, except it missed a few things and we noticed some extra accesses to files. We record the run metadata provided by the workload, with the run’s authority properties set to workload, then create an extra run (with a new UUID), added to the end of the run list, listing all the otherwise unaccounted-for accesses, with authority set to correction.

Tracing the provenance of a file

We want to know how a file in a dot came to be. The dot may be a dataset, or a workspace dot. In either case, we can read back through the commits on that dot to find the most recent commit (not including commits AFTER the commit containing the version of the file we are tracing the provenance of) containing metadata recording a write to that file. This will give us the ID of the run that created that file.

We must now find the workspace dot commit containing that run. If the file was in a workspace dot, we already have it; if it was in a dataset, we need to read the workspace property of the commit to find the ID of the workspace dot, and walk its commit history to find the run (this should be cached in an index somewhere!).

Given the workspace dot commit and the run ID, we can extract the full metadata of the run - including the commit IDs of all dots that went into it, and the lists of files read from them.

These can then be recursively examined using this algorithm to find their provenance, until the trail runs dry; at that point, we have extracted the entire provenance tree of that file.