What is Dotscience?
In this section we detail at a high level some of the concepts that Dotscience enables for the user.
Besides each individual concept, it is also useful to consider what they provide in combination. For example, combining data provenance with end-to-end means that reproducibility of a model in production is traceable all the way back to the initial raw data that entered the system, simplifying management of real projects by requiring fewer components to be manually stitched together into a pipeline.
Dotscience versions and saves everything you need to rerun any version of your model:
- Environment: The compute environment via Docker, including in production.
- Dependencies: All software libraries and versions used.
- Code: The code used to run the analysis, e.g., Python.
- Datasets: Each dataset and version of it used, including ability to track large datasets (gigabytes or more), and many files (1000s or more), via ZFS and Dotmesh. Datasets can be imported from locations such as Amazon S3 storage.
- Models: Code, parameters, hyperparameters, metrics, and performance.
- Runs: Each instantiation of all or part of the code being run.
- Provenance: Relations between all datasets and models in the analysis, with provenance graph.
- Notebooks (when used): The contents of the notebook when run, including the ability to merge and diff notebooks with collaborators.
- End-to-end: Because arbitrary code can be executed, the full end-to-end data science dataflow of data preparation, feature engineering, model building, and deployment to production can be tracked.
Tracking is done via instrumentation of one’s code, for example using the Dotscience Python library ds() functions, to record runs, models, parameters, datasets, input, and output.
The result is that each project and model within it is automatically reproducible, accountable, and auditable, including in production.
Ad hoc individual data science work on single machines with manual recording of results does not work well beyond the research stage. Dotscience’s ability to fork and share projects, merge and diff Jupyter notebooks, and track and version everything including large datasets and runs, means that real collaboration is possible. This includes asynchronous coordination between groups and across time zones, and robustness to critical personnel being reassigned or leaving, via features such as the provenance graph helping work done to remain understandable.
Modern software is often delivered as containerized microservices via continuous integration and delivery (CI/CD) tools, which enables updates to be shipped in minutes rather than months. Machine learning in production likewise needs models to be updated at the same cadence. Dotscience therefore integrates with CI/CD tools such as CircleCI, or provides its own lightweight builtins so that the power of this approach is available to all users without requiring engineering and infrastructure setup.
Similar to CI/CD, our deployment allows models to be placed on Kubernetes via Docker, and monitored with the Grafana dashboard visualizations of the output from the Prometheus time series database, which in turn reads the output predictions from a deployed model. At its simplest in the SaaS GUI and using the Dotscience Hub, this constitutes a true one-click enterprise deploy. Or the system can be setup in other locations and with other tools.
Once a model is deployed, it has to be monitored. This is because machine learning models are complex nonlinear mappings and changes to the input data can result in unexpected output behavior that directly impacts business value. Besides the usual metrics of throughput, latency, uptime, and so on, statistical monitoring of inputs and outputs is necessary. These will be problem-dependent, necessitating in general the ability to perform arbitrary queries on model outputs. The PromQL language available in Prometheus allows this, enabling metrics such as the distribution of outputs of classifications, anomaly detection, thresholding, alerting, or statistical/model drift. Other time series databases or monitoring visualizations can be integrated similarly.
Dotscience’s web interface includes our metric explorer, with visualisations of model hyperparameter tuning, and the project provenance graph, helping users to select to best model(s) from training to promote to production. This is done within the framework of projects and runs, with everything tracked.
Run anywhere (on-premise, on-cloud, or hybrid/multicloud):
Dotscience stores snapshots of your model code and associated data. It also allows you to deploy any snapshot, and thus run the model version, on any compute infrastructure you choose. You could choose simply to run on your local machine – your laptop, or local server. But you might also want to point your model at a cloud instance on Amazon Web Services, Google Compute Platform, Microsoft Azure, or anywhere else on the web. This lets you take advantage of a variety of processing options without needing to send files around, or keep track of copies of your model code and training data. You tell Dotscience to run on your chosen machine by running a single command on that machine. Then, you can develop and run your model via an IDE in the Dotscience web interface (for example, JupyterLab) or via your local development environment. The model will execute remotely on the specified runner, sending snapshots of code changes, as well as the value of syntactic objects such as parameters and summary statistics, back to Dotscience’s web interface for storage and visualisation.