Frequently Asked Questions.

This is the main Dotscience FAQ. If you have a question that is not answered here, please contact us.

This is work in progress, with more questions & answers to be added soon.

Questions are divided into

  • High level
  • Technical

These categories do not have strict boundaries.

High Level Questions

Why use this tool? The cloud has DevOps, or my company has in-house DevOps

There are many reasons to consider using Dotscience, but a quick list is:

  • We make model deployment easy
  • Not all companies are willing or able to use cloud tools or host their data outside their own premises
  • DevOps for machine learning (aka. MLOps) has a whole different set of requirements to regular DevOps, which we provide
  • It is possible to build DevOps for machine learning in-house, but most of the requirements are generic so for most companies the time is better spent buying rather than building the same tool over again
  • We are platform-agnostic, and thus not tied to any particular cloud, and can be run on-cloud, multi-cloud, on-prem, or hybrid

What are the differentiators of your tool from other similar tools and from the cloud?

The vast majority of competitors do not record data provenance, which is crucial for a business’s use of AI to have any reproducibility or accountability (for example for explanatory or auditing purposes). Competitors that do record provenance do not generally combine this with both easy deployment and full DevOps for ML.

The tool provides a mechanism to enable a user to ensure regulatory compliance. But does the user have to manually ensure such compliance in their code?

Dotscience will automatically track your work using filesystem snapshots, versioning all of your data, models, parameters, code (including notebooks), and runs. Tracking of the filesystem, data and code works even if you are not annotating code with the Dotscience Python library.

Thus the necessary information to create an audit trail is recorded. However, if the compliance comes from certain operations within the analysis code written by the user, for example, not using certain variables in your models, then this remains up to the user to implement correctly. For any given model, you can always see the code and data that generated it.

Is the product aimed at technical users only?

The product is aimed at data scientists, machine learning engineers, data engineers, and similar roles who need to take their analyses from experimental to production. It is assumed that the user is writing their own programming code to carry out their analyses, but does not wish to spend the time or resources to set up an entire production system from scratch.

Does the user have to recode for production and ensure correctness manually? A model receiving input and giving predictions is a different workflow from a model being trained.

Recoding models for production and ensuring correctness is a labor-intensive and error-prone task that has stymied many companies in their attempts to deploy machine learning into production.

Currently, Dotscience supports TensorFlow models for production deployment. A container with an Amazon-S3-compatible API for model deployment is exposed, a container image is built and, using TensorFlow Serving and the model proxy, production deployment on Kubernetes is available. This is done without having to be set up in detail by the user, and so extra code to deploy a model that was created during training is not needed.

Future support will include other model frameworks, and in this way, it will remain an easier path to production than having to recode trained models for production in a manual way.

One part of deployment that is not yet supported is the preprocessing steps outside of a model that are usually necessary between the raw data and the data fed to the model. This will be improved in future releases.

Does it meet company IT/security standards since arbitrary code can be executed?

The architecture of Dotscience ensures that, while arbitrary code can be executed, it can only be done so on the compute environment chosen by the user. The machine or VM that the user is using, the runner, acts as a security boundary. The user may choose to use the Dotscience hub for metadata management, but even the hub may be brought on premise if needed. Other security details include: each user gets their own runner, no ports are exposed on the runner, and it the Dotscience hub is used, the only traffic is outbound, being the ZFS file system and metadata stream.

Could this tool be used in academia? The ability to produce genuinely reproducible analyses would be valuable to many areas of science

Yes, it could be used there. While production deployment is less critical than in the enterprise, the end-to-end nature of the system, especially the “version everything” ethos, combined with the flexibility to execute arbitrary code, means that it has a high potential value as a reproducibility framework, even if machine learning is not used.

How many customers do you have? Are they using your tool in production? Are they willing to provide references?

Dotscience is an innovation group within its parent company DataDirect Networks (DDN). DDN is a 20 year old company with $300M+ annual revenue, and is the largest privately held storage company in the world.

While exact customer numbers and names would require signing an NDA agreement, Dotscience has paying customers and customers with production deployments.

An interview with S&P Global’s Ganesh Nagarathnam is on our website at https://dotscience.com/blog/2019-10-30-unblock-ai-in-enterprise .

What are your different products and their pricing?

Our pricing plans are free, paid, private, and enterprise. See https://dotscience.com/pricing for more details.

Technical Questions

Can I use my existing models in Dotscience?

Yes, any Python script, including IPython notebooks, can be tracked, versioned and run with Dotscience. Simply mark up the parameters and metrics you want to track with our Python library. Support for R models is on the roadmap.

How do I get my files into Dotscience?

The Jupyter UI can be used to add files, via the standard JupyterLab file upload section of its GUI.

How do I get my large files or datasets into Dotscience?

You can use the terminal in Jupyter to scp local files into your runner. Alternatively, if possible, consider using wget to get files from the web.

Does Dotscience do hyperparameter tuning?

No. Dotscience keeps track of hyperparameter values that you submit as well as the metrics associated with each parameter combination, but it doesn’t decide which values to use: that is up to you. However, Dotscience will track parameter values if those values are provided by a hyperparameter tuning method, like Scikit-Learn’s GridSearchCV.

What are the additional challenges in DevOps for machine learning (MLOps) versus regular software DevOps?

See our blog entry at https://dotscience.com/blog/2019-10-21-devops-for-ml .

Can runs be deleted? Most runs will be mistaken or not final, and so not applicable even in an audit. There is a risk of unwanted proliferation of objects in the provenance graph if everything has to be kept.

The ability to delete (or archive) runs is on our product roadmap.

A typical project produces over 100 ML models. How is this represented in the provenance graph without it becoming unreadable?

Displaying large numbers of models is a roadmap item, one method being to collapse siblings in the provenance graph to an ellipsis, which is then expandable again. Large projects are already somewhat addressed, in the sense that, if a model or dataset is selected, then the provenance for that item is shown rather than the full graph for the project.

How are processes that are iterative between dataset versions and model runs, such as forward selection or backward elimination, represented in the provenance graph? Are they data runs or model runs?

Explicit representations of this would require potentially complex generalizations of the provenance graph, which would be required to be able to represent the many possibilities. In one sense, however, they are already encompassed within the system: the bases of the provenance are the runs, which contain arbitrary code, data, and models. Thus an iterative process can be encompassed within a run, which has the usual versioning, inputs, and outputs.

What are referred to as “data runs” and “model runs” in some parts of the website are just an arbitrary distinction between runs that do and do not contain a model. Architecturally, there is only one kind of run, so the user does not have to choose between calling something a data run or a model run.

Can a company run 100% onsite or does it have to connect to the Dotscience hub?

A company can have its own instance of the Dotscience hub instead of connecting to the main one at cloud.dotscience.com. Thus the product can be used 100% on premise if desired.

Is there a plan to supply pre-built containers for users? E.g., basic data science with installs

We offer pre-packaged solutions for Amazon Web Services, Microsoft Azure, and Kubernetes GKE.

Since the basis of the solutions are Docker containers, which can contain any libraries, these can obviously be extended to a wide range of solutions. Such solutions might be based upon, for example, various machine learning algorithms or frameworks, industry-specific functions (e.g., finance), or general functions common to many industries (e.g., human resources).

What types of collaboration are supported? Users, groups, projects, permissions?

At present, single user level projects are supported. This will be extended to groups as a roadmap item. Such groups in turn will have fine-grained permissioning on who can view or edit parts of a project.

Some distributed runs are not exactly reproducible due to e.g. non-guaranteed line ordering. What about statistical reproducibility?

This would be up to the user to monitor, for example, does a rerun produce a result that is not significantly different to the original w.r.t. error bars derived by performing a sensitivity analysis. Dotscience would help in tracking such experiments by recording which data and which models were run and in what combinations, the audit trail thus proving or disproving reproducibility.

Is the documentation for Dotmesh relevant to Dotscience users?

The Dotmesh documentation is not needed for someone to use Dotscience. Also, it is not being actively maintained.

Does Dotscience use Python 3?

Yes, the Dotscience Python library uses Python 3.

Does Dotscience support Python 3 only and not Python 2.x?

Python 3 is tested within the system, whereas Python 2.x is not.