References

Security

Dotscience Security Features & Characteristics

Overview

Dotscience is secure by design. This section gives an overview of the security features and characteristics of the product.

For context, see the Architecture section.

The following facts relate to both the SaaS and AWS deployment modes.

Data in transit

This depends on the deployment mode, see below for details.

Data at rest

User’s passwords are salted and encrypted in the database using the Golang scrypt library

Secrets including S3 credentials for S3 datasets are encrypted using 256-bit AES-GCM encryption using Golang’s crypto/aes package.

Disk encryption depends on the deployment mode, see below for details.

Authentication and access control

Access to the Dotscience Hub web UI (and the underlying REST API) is controlled via user accounts, which are authenticated using salted hashed passwords stored in the Hub database. When the user has authenticated, they are granted a time-limited session token (JWT) which can be used to make actual API requests, and must be renewed. The validity period of the token is 24 hours.

The Runner docker container which you start when connecting a runner is given a 256-bit cryptographic token, issued by the Hub when the runner metadata is created, and required by the Hub for the Runner to authenticate itself to the Hub.

The Dotmesh protocol used by the Runner to transfer bulk data to and from the Hub is authenticated using a user-specific API key granted to the Runner when the workload is started.

When JupyterLab is started and a tunnel is used to allow the user to access the JupyterLab web interface running on the Runner in their browser, a session key is generated for that JupyterLab instance and communicated back to the user’s browser via the Hub REST API, so only that user can access that JupyterLab session.

Audit trails

There is a table in the Hub’s PostgreSQL database which records actions performed by each user.

Backups

Hub backups depend on the deployment mode, see below for details.

Data science work done on runners is backed up automatically to the Hub every time the user runs ds.publish().

Runner isolation

Tasks run on the runners (e.g. JupyterLab or ds run tasks) are effectively given root on the runner (specifically, the user inside the task container is root, and there is a /host bind-mount from the container to / on the host: useful for importing data from the runner’s host filesystem and debugging).

The security model is to use VM or server isolation as the security boundary between users – mutually untrusted users should be given different runners.

Deployment modes

SaaS

When Dotscience is deployed as a SaaS, the Dotscience Hub is run by Dotscience, some Runners can be deployed as either by us or you can attach your own.

In this configuration:

The Hub provides TLS encrypted endpoints for its web (https://cloud.dotscience.com) endpoints that your browser and ds CLI connect to, gRPC endpoints (cloud.dotscience.com:8800) used by runners to communicate with the Hub, and the Dotmesh protocol used to transfer bulk data between the Hub and the Runners. The TLS certificates are issued by LetsEncrypt.

Logs from both JupyterLab and ds run tasks are shipped back to the Hub via the TLS-encrypted gRPC connection between the runner and the hub.

The storage for workspaces and datasets in the hub is encrypted at rest using Google Cloud Platform disk encryption.

The storage for workspaces and datasets synchronized to runners we provide is encrypted at rest using Google Cloud Platform disk encryption.

The storage for workspaces and datasets synchronized to runners you provide is NOT encrypted at rest by Dotscience - please use disk encryption provided by your infrastructure if you require this.

We create hourly backups of our Hub.

AWS

When Dotscience is deployed as a private installation on AWS, the CloudFormation template deploys a private Hub and a single private Runner.

SSH access to both EC2 instances for support purposes is exposed to a configurable IP range.

HTTP access to the Hub’s web interface and API, and gRPC/Dotmesh protocol access for additional Runners to connect to the Hub, are NOT encrypted. The user is responsible for providing TLS termination in front of them by providing their own SSL certificate, or providing other means for connecting to them securely (e.g., attaching their own instances and clients directly to the VPC).

The storage for workspaces and datasets in the hub is encrypted at rest using default AWS disk encryption.

The storage for workspaces and datasets synchronized to runners we provide is encrypted at rest using default AWS disk encryption.

It’s up to you to configure backups of your AWS hub. We recommend using EBS disk snapshots.