References

The ds command-line tool

Use dotscience from the command line

ds command-line tool user manual.

Hi there! Welcome to the user manual for the ds command-line tool, which lets you drive Dotscience from your shell or automate it using scripting.

Installing ds

To install the latest ds cli, run the following:

sudo curl -sSL -o /usr/local/bin/ds \
  https://get.dotmesh.io/$(uname -s)/ds

# Make the client binary executable
sudo chmod +x /usr/local/bin/ds

Introduction and Concepts.

The Dotscience Service is the central hub that controls Dotscience activities in your organisation. It acts as a manager and a store for the canonical versions of your code and data, and stores a history of your activities.

Dotscience Runners are the computers that actually run your workloads. Your administrator sets them up, and they connect to the Dotscience Service to ask for work to do and the data to do it on; and they submit results back.

The ds command-line tool is what you run on the computer you’re sat in front of, to control the Dotscience Service (and, indirectly, the Runners).

As such, the ds tool needs to know your Dotscience login details.

Authentication: ds login USERNAME.

Before running any of the interesting ds commands, you need to authenticate by running ds login with your username. It’ll prompt you for your password:

$ ds login alice
Password: Type your password here, it won't be echoed

Authenticated successfully.

Running a job: ds run -p PROJECT [-d] [-I IMAGE] [OTHER OPTIONS...] COMMAND....

This command runs a job. It runs the given COMMAND in the Docker image called IMAGE, with the workspace of PROJECT mounted as the current working directory. If IMAGE is not given, then quay.io/dotmesh/dotscience-python3:latest, a copy of python:3 with the Dotscience Python library pre-installed, is used.

Unless you specify the -d option, the ds run command will reflect the output of your job back to your screen as you watch. If you hit Ctrl+C, it will terminate the job. At the end of the job, the resulting job metadata will be displayed on your screen. The use of -d will be explained below in the Asynchronous mode section.

There are several other options not listed above, which we will explore below:

Uploading local files to the project workspace: -u DIRECTORY.

If you are editing your scripts locally, you can ask ds run to upload them into your project workspace for you, so you don’t have to upload them manually through the Dotscience web interface. This is accomplished with the -u DIRECTORY flag, which causes all files in the directory named by DIRECTORY (or its subdirectories) to be uploaded into the workspace before execution of the job. -u . uploads the current directory, which is often useful (and easy to type!)

By default, the entire contents of the directory DIRECTORY (including subdirectories) are synchronised into the workspace, so that it can be sent to the runner for execution and then committed to record the exact code that was executed in the audit trail - with the exception of any files whose names start with a . character, as by convention they are “hidden files”.

However, this can be override by using .dotscience-ignore files. If one of those is found in DIRECTORY or any of its subdirectories, then each line of the file (except blank lines or lines that start with #) is interpreted as a pattern. Any files matching that pattern in that directory or those beneath it will be ignored.

The patterns are Unix-style “globs”, which means that the following symbols in them have special meaning:

  • * matches any number of arbitrary characters
  • ? matches any character
  • {A,B,...} matches any of the sub-patterns A, B, etc.
  • \X matches the character X exactly (even if it’s a special character such as *)
  • [a-zA-Z0-9\-_] matches any of the characters from a to z, or from A to Z, or from 0 to 9, or a hyphen (note the use of \- to mean an actual hyphen rather than indicating a range of characters), or _.
  • [!a-z] matches any character OTHER THAN those from a to z.

In addition, it’s possible to cancel an earlier ignore pattern by putting it in a .dotscience-ignore file, but prefixed with a - symbol. For instance, to enable uploading files whose names start with a ., simply put -.*

Specifying environment variables: -e NAME and -e NAME=VALUE.

You can specify shell environment variables to be in effect when running the command.

  • -e NAME=VALUE sets the environment variable NAME to the given VALUE.
  • -e NAME sets the environment variable NAME to whatever value the environment variable NAME has on the computer where you’re running ds, in effect passing it through.

Git integration: -R REPO and -b REF.

If the code or data you wish to work with originates in a Git repo, you can ask Dotscience to fetch it into your workspace before running the job. Pass the git URL (any URL you could pass to git clone) in as the REPO and it will be checked out into a subdirectory of the workspace, either named after the repo or called code if the system can’t deduce that from the repo URL.

By default, it will check out master; specifying a REF (which can be a branch name, a tag name, or an arbitrary commit hash) will cause it to check out that ref instead.

The first time you use this in a workspace, it will git clone the repo into your workspace. It’s actually saved into the workspace, so subsequent runs will find the checkout there; but it won’t be updated with subsequent changes in the git repo unless you specify -R and -b if appropriate.

If you have an SSH key associated with this project, then that ssh key will be available for git to use to access your repo using SSH ([email protected]:PATH style git URLs). See Managing Secrets below.

Specifying runner labels: -r NAME and -r NAME=VALUE.

By default, the job will be run by the first free runner that picks it up. However, by passing runner labels through, you can make sure that only runners satisfying some requirement will try to run your job, or provide configuration for the runner that picks it up. The meanings of the NAMEs and VALUEs depends on the runners your Dotscience account is configured with - consult your administrator for details.

  • -r NAME=VALUE sets the runner label NAME to the given VALUE.
  • -r NAME sets the runner lab NAME to whatever value the environment variable NAME has on the computer where you’re running ds.

Asynchronous mode: -d.

If you pass the -d flag, the ds run will output a task ID once execution has been scheduled, then return you to your prompt rather than waiting for the job to actually run.

This task ID can then be used with the commands listed in the next section, Task Management Commands.

Task Management Commands.

Jobs started with ds run - be they asynchronous (-d) or not - as well as JupyterLab instances started through the Dotscience web interface all count as “tasks”, and can be managed with ds commands.

Listing tasks: ds task ls.

The ds task ls command lists all the tasks known to your account. The columns in the output are ID (the task ID), STATUS (whether it’s running or in some other state), WORKLOAD (jupyter for JupyterLab or command for ds run), and AGE (when the task was created).

Note that terminated tasks remain in the list for several days before being removed, unless manually removed with ds task rm.

Examining task details: ds task inspect ID.

ds task inspect ID returns all the details of a task, as JSON.

Terminating a task: ds task stop ID.

ds task stop ID requests that a running task be stopped. It might take a while for the task to finish cleanly - any pending changes need to be uploaded back to the Hub, so watch the task’s status to find out when it’s completed!

Deleting a task: ds task rm ID.

ds task rm ID removes a terminated task from the list. It’s only really useful if you want to clean up the list, when you’re sure you won’t want any further information about the task.

Getting task logs: ds logs [-v] [-f] ID.

ds logs ID prints out all the stored logs for the task.

If -f is given, then once it has printed out all stored logs, if the task is still running, it will continue to print out logs as they happen, until the task terminates.

If -v is given, then all log entries will be printed; otherwise, only log output from your workload will be printed. -v mode is mainly useful for diagnosing infrastructure issues. In -v mode, every log entry is prefixed with a header consisting of the timestamp, the log entry type (omitted for workload standard output), a colon, then the line; without -v, workload standard output lines are emitted as-is, while workload standard error lines have ERROR: prepended to them.

Managing projects.

Getting a list of projects: ds project ls [-q].

ds project ls lists all the projects in your account, including their ID and name. It’s quite possible to have multiple projects with the same name - if a project is in your account because it was shared with you and then you made a fork of it, for instance, you will have exactly that situation! Therefore, the IDs are often necessary to precisely specify which project you are referring to in commands that take a project name or ID.

This view will also show the datasets which are associated with this project and on what path they are mounted.

If you pass the -q flag, then rather than a table listing the project ID, NAME, workspace DOTS, COLLABORATORS and AGE columns, just the project names are listed (with no table heading).

Creating a project: ds project create.

This command creates a new project, and prints its ID out the console.

Describe a project: ds project inspect PROJECT.

ds project inspect prints out the full details of a project, in JSON format.

Deleting a project: ds project rm PROJECT.

This command deletes a project. Don’t run it by accident, as there’s no undelete!

List files in a project: ds file ls [-q] PROJECT.

To list the files in a project’s workspace, use ds file ls PROJECT. The result is a table with the file’s NAME, SIZE, and LAST MODIFIED timestamp; but if you specify -q it just lists the names.

This command links a dataset to a project on PATH, so that when tasks are launched, the dataset is mounted at ./<PATH>. For example if you were to use the path s3 the path to accessing the contents of the dataset from the task would begin ./s3/.

This command unlinks a dataset from a project so that it no longer shows up as a mount when tasks are launched.

Managing datasets

Getting a list of datasets: ds dataset ls [-q].

ds dataset ls lists all the projects in your account, including their ID and name. It will also show the sync status of the dataset, meaning - if it is an aws s3 dataset - how much of the s3 dataset has been downloaded to dotscience.

If you pass the -q flag, then rather than a table listing the dataset ID, NAME, etc columns, just the dataset names are listed (with no table heading).

Creating a dataset: ds dataset s3 create --name NAME --access-key-id AWS_ACCESS_KEY --secret-access-key AWS_SECRET_ACCESS_KEY --bucket BUCKET_NAME.

This command creates a new s3 dataset and prints the ID to the console.

Describe a dataset: ds dataset inspect DATASET.

ds dataset inspect prints out the full details of a dataset, in JSON format.

Deleting a dataset: ds dataset rm DATASET.

This command deletes a dataset. Don’t run it by accident, as there’s no undelete!

Managing runs.

Run listings: ds runs ls --summary-type=SUMMARY PROJECT.

You can get a list of runs that have happened in a project with ds runs ls PROJECT (the project name or ID both work, and you’ll need to use the ID if you have multiple projects with the same name). Only runs with a summary statistic of type SUMMARY are listed, and the value of that statistic is shown for every run in the SCORE column; if you omit --summary-type the system will pick one arbitrarily, which is just what you want if you only have one summary statistic in your project.

Details of a run: ds runs inspect --project PROJECT RUN.

You can view the full details of a run in a given project with ds runs inspect. The output is JSON, in the Run Metadata Format.

Managing runners.

List runners: ds runner ls [-q].

This command gets a list of all the runners attached to your account. The output is a table listing the ID, NAME, how many RUNNING TASKS it has, its STATUS, its TYPE (CPU or GPU), and its AGE; but if you specify -q then just the names are listed, with no table heading.

Details of a runner: ds runner inspect RUNNER.

ds runner inspect RUNNER returns the full details of the runner in JSON format, including a list of tasks the runner has run or is still running.

Create a runner: ds runner create [-q] [-d DESCRIPTION] [-s SIZE] [-t cpu|gpu-nvidia-runtime] [NAME].

This command creates a new runner.

All the fields are optional: if no NAME is given, one will be picked. If no -t type is specified, cpu will be assumed./ If no SIZE is given for the runner’s storage size in gigabytes, 10 will be assumed.

The output will look something like this:

Runner ID:        c871506a-93bb-49df-9ba7-6dca75a476bf
Runner name:      791d4f4a-untitled
Runner type:    cpu
Runner API token: PLBMTXVP3RLLTUFOEN3FOBCYDSPHL7BSLDZXVGXX3AAGZTZBBLJQ====

To start a runner, run:

docker pull quay.io/dotmesh/dotscience-runner:0.5.0 && \
docker run --name dotscience-runner -d -e TOKEN=PLBMTXVP3RLLTUFOEN3FOBCYDSPHL7BSLDZXVGXX3AAGZTZBBLJQ==== \
--restart always -v /var/run/docker.sock:/var/run/docker.sock \
-v dotscience-task-spool:/spool \
quay.io/dotmesh/dotscience-runner:0.5.0 ds-runner run --addr stage.dotscience.net:8800

The latter part is the command to run that will bring the runner up on any Docker installation. If you use the -q flag, then that’s all that is output; the metadata and instructional text is suppressed.

Remove a runner: ds runner rm RUNNER.

This command gets rid of a runner.

Managing secrets.

List secrets: ds secret ls [-q].

This will list all the secrets know to your account. It outputs a table listing the ID, NAME, TYPE, PROJECT and AGE; unless you specify -q, in which case you just get the names.

List details of a secret: ds secret inspect SECRET.

This prints out the details of a secret, in JSON form. The data field, for an SSH key, will be base64-encoded JSON containing public_key and private_key fields; they can be extracted using jq like so to get values that can be used with a normal ssh client:

$ ds secret inspect SECRET | jq -r .[0].data | base64 -d | jq -r .public_key | base64 -d
$ ds secret inspect SECRET | jq -r .[0].data | base64 -d | jq -r .private_key | base64 -d

Generate a secret: ds secret generate --project PROJECT --name NAME.

This will generate an SSH keypair associated with PROJECT, called NAME, and will output the public part of the keypair.

Removing a secret: ds secret rm SECRET.

You can delete a secret from your account with ds secret rm SECRET.

Other Commands.

These aren’t commands you’ll need very often, but here they are for completeness!

Online help: ds help.

You can obtain online help on a command by typing ds help COMMAND or ds COMMAND --help:

$ ds help version
Display the version of the Dotscience command-line tool

Usage:
  ds version [flags]

Flags:
  -h, --help   help for version

Checking the version of the ds tool: ds version.

$ ds version
Dotscience command-line tool version: 0.1

Connecting to different Dotscience hubs: ds set server-url.

By default, ds connects to the Dotscience SaaS server, but if you have a dedicated instance of your own, use ds set server-url to tell ds to connect to it. Set it to the scheme and domain part of the base URL you use for logging into Dotscience, such as https://cloud.dotscience.com; do not include anything after the domain name!