This provides brief definitions of terms and concepts likely to be encountered when using Dotscience. The motivation for this page is our functionality that covers a combination of DevOps, data science, and machine learning. Our users are therefore likely to include data scientists who are less familiar with DevOps terminology, or DevOps engineers who are less familiar with data science terminology.
We are not attempting to be comprehensive across both fields, but any feedback on terms missing that should be included is welcome.
Entries are arranged in alphabetical order: we have not attempted to divide them into various subfields.
For a more conversational introduction to the DevOps subset of these terms, see our blog entry “Dotscience MLOps terminology explained”. Common data science and machine learning terms are discussed extensively online, e.g., Wikipedia.
This is the comparison of one machine learning model to another to see which is performing better. Often the comparison is indirect, such as which version of a website with its own underlying ML model generated more revenue. We support this in the general sense that our deployed model outputs can be directed by the user towards some application undergoing this test, but we are also adding support for the more direct case of comparing two models being monitored on the same endpoint. Another example is in the experimental stage where two models can be compared on their performance metric versus training set ground truth.
This usually refers to the use of machine learning models to improve the software lifecycle, for example a model to automatically classify bug reports, or recommend which tests to run more frequently. This is distinguished from MLOps, which is about applying DevOps practices to the data science end to end workflow (see below).
Commonly used data storage functionality on Amazon Web Services. Dotscience can load data from any location that the user has access too, but we provide integration with S3 as a convenience.
Amazon Web Services (AWS)
One of the 3 major cloud providers that Dotscience has integrations with, along with Google Cloud Platform and Microsoft Azure. Since Dotscience supports hybrid or multicloud set ups, one or more (or none) of these may feature in a user’s workflow. An example can be seen in the deep dive demo video on the product page, where data is brought in from Amazon S3 and the model is later deployed to the Google Cloud Platform.
Ability to access all steps of a data flow such that what happened can be seen, and it can be reproduced. This includes seeing what data went into training a model, what the resulting outputs were, and on what data a model’s decision was based. In Dotscience, there is a table in the Hub’s PostgreSQL database which records actions performed by each user.
AutoML is machine learning where a process outside the core algorithm enables a search of the model hyperparameter space to find the best model without manual tuning required by the user. This can both give superior model performance and save user time. Dotscience does not currently implement AutoML itself but supports tools, such as H2O, that do. Our relation to other examples of automation in data science, such as auto data preparation or auto feature engineering, is similar: you can use your favorite tool as part of your dataflow within our system.
In a production deployment, a canary model is deployed alongside the live production model to provide a baseline to which the production model can be compared. This allows easier monitoring for changes in model performance such as model drift.
Continuous integration (CI) is the process to bring together the needed data and files to make a model for deployment, and continuous delivery (CD) is the process to deploy that model in the modern setup of being a microservice in a container. Dotscience provides both CI and CD via Docker and Kubernetes, although other tools can be used.
As well as the SaaS and JupyterLab interfaces, Dotscience provides a command line interface to allow usage from a general terminal environment, either within the SaaS or from your own machine.
In Dotscience, collaboration refers to the ability of multiple users to work on the same project, and for them to work asynchronously on versioned analyses, including with notebooks in JupyterLab. Work can be forked, pull requests can be made, and changes merged. This helps teams get better results than the individual team members might achieve alone, and helps reduce duplication of work.
In software development, commit refers to publishing a version of the software back to the repository. In Dotscience, the idea is similar, but the user publishes a version of their analysis back to the project. This is part of our setup for ensuring reproducibility, auditability, etc.
In a deployed machine learning model, concept drift is when the underlying ground truth of what the model is trying to predict is changing. Although the model may remain unaffected by this, it is much more likely to see a degradation in performance because the situation has probably become one that it was not trained for.
A typical production system using the modern container + microservices approach to deploying applications will have many containers running at once. Thus their arrangement and use of resources needs to be organized, which is container orchestration. The most commonly used container orchestrator with Dotscience is Google’s Kubernetes, but others can be used if desired.
When a model is deployed on an endpoint and the user wants to send new data to it via the command line (most likely as part of a script), one option is to use the model’s REST API and send the data using curl.
This refers to the steps of the analysis in a data science workflow between the raw data coming in and producing the data that is passed to the model (and possibly post-processing of model outputs too). Dotscience doesn’t have explicit support for particular data engineering tools, but you can use whatever library within the system. What we do have, however, is our versioning of datasets (including large, thanks to ZFS and Dotmesh), and our recording of data provenance. This keeps our reproducibility and auditability intact during the data engineering phase as well as the model building and deployment phases of a project (these phases often being iterative too).
Data drift in a deployed model is similar to concept drift but refers to a change in the underlying characteristics of any of the incoming data, not just the ground truth. As with concept drift, data drift is more likely to degrade the performance of your model than help it, and so such data can be monitored for via our model monitoring with Prometheus and Grafana (or other tools).
DataOps is similar to MLOps but refers to the use of DevOps practices during the data engineering phase rather than the model building phase. So data ingestion from sources such as databases, data formatting, and preparation are covered. As mentioned above in the data engineering entry, Dotscience is not specialized to this, but you gain the benefits of our data versioning and provenance which are important components of successful DataOps. The user can connect from us to whatever accessible data source, such as their own machine, Amazon S3, or a database.
In a production deployment, the user may want to compare one model against another to see how they are performing. Often one model is “live”, but the same data is routed through another model on the same endpoint, the decoy, whose outputs are not used for business decisions but are available for comparison to the live model. This can be used, for example, in A/B testing to justify replacement of one model by another.
Deep learning is a subfield of machine learning using artificial neural networks that has expanded rapidly in recent years due to its power to solve many previously unsolved or poorly solved use cases. In Dotscience, our most commonly used machine learning tool is TensorFlow, which has a focus on (but not requirement of) deep learning.
DevOps for ML
The UNIX diff command shows the differences between 2 files. In Dotscience, we support this for files, but also Jupyter notebooks. This allows the user to compare 2 versions of an analysis, and possibly update or merge them using our collaboration tools.
Dotscience makes use of Docker containers, for example as the medium in which to deploy models. They are important because they allow the environment in which a data science analysis was done to be recorded, with libraries, versions, etc. This in turn helps ensure our key requirements of reproducibility and auditability.
A Docker image is the format that a Docker container is stored in when it is not running.
Docker images are part of a Docker registry. Since each image is versioned, in Dotscience this enables images containing models to form the basis of a model registry.
Dotmesh is the open source tool upon which Dotscience is built. In turn built upon the ZFS file system, Dotmesh is a tool for snapshotting states of applications, and then storing them versioned in a repository in a manner similar to GitHub. This means that when using Dotscience, one’s analyses are versioned and stored in a similar way. Thus analyses can be shared and collaborated upon by other users.
See remote mode
Also known as the Dotscience Service, this is the central place in which machines doing the computation (runners) are coordinated, data (or pointers to data) are stored, and which users can login to when they want to use our SaaS interface. No user code is executed on the hub, and customers can also have their own Dotscience hub on the cloud, or on-premise.
Dotscience model proxy
The Dotscience model proxy handles taking outputs from deployed models and passing them to some application that aids model monitoring, for example, the Prometheus time series database.
Dotscience Python library
This is how user Python code is annotated so that the system knows what information to record, such as what lines of code constitute a run, input and output datasets, model parameters, and metrics. Because the system allows arbitrary code to be executed, this is preferable to trying to auto-detect information that could be coming from a plethora of constantly changing data science libraries.
An endpoint is a location and point of communication between one application and the outside world. In Dotscience, a deployed ML model is at an endpoint and so any other software that can access it can be used. In production model deployment, more than one model may be put on an endpoint, with split data traffic, for example going to a live model and a decoy model.
A compute environment is the setup of software (and perhaps hardware) that allows an analysis to arrive at a given reproducible result. So it includes things like which software libraries were used and what versions. Containers are widely used because they are a lightweight way of making an environment portable between different underlying machines. In Dotscience, this makes your analyses portable between different machines (runners) and locations (on-premise, on-cloud, or hybrid).
ETL, or extract-transform-load is the process for going from integrating needed data to data ready to be prepared for machine learning. So it is basically the same as data engineering but often has more of a database focus. Dotscience is not database focused, but can of course connect to databases when needed since arbitrary code can be executed as part of a data science analysis.
Experiments generally refers to the stage of building machine learning models and tuning their hyperparameters to find which model is performing best according to some metric like deviation between a model’s output and the known ground truth. This may be combined with, or done iteratively with, feature engineering and other data preparation. Dotscience is not specialist software for tracking experiments, unlike some competitors, but it has a basic metric explorer plus visualization where you can see which model in a project is doing best. Plus of course our versioning, recording of parameters, and data provenance capabilities guarantee that experiments that are carried out are correctly recorded and repeatable. We also allow tools such as TensorBoard to be used alongside TensorFlow, giving more extensive visualizations of experiments.
This is the conversion of an input dataset containing a certain set of features, e.g., columns, if it is tabular, or simply information if the data is unstructured, to a set of features able to produce better model performance. A classic example is in a financial use case where the ratio of credit used to credit available can give better predictive power than each value by itself. As with data engineering, Dotscience is not specialized to feature engineering, but similar considerations apply w.r.t. our tool’s genericness, reproducibility, etc., as to that field, and it is thus fully supported within our system.
A fork is the cloning of a repository of files (code, datasets, etc.) and subsequent work on these files causing it to diverge from its original state. Forks may be merged back to the original later or continue separately. In Dotscience, because the way a user works is similar to the versioned software development with Git way of working, if a project exists from a user and now a different user wants to work on it, typically they fork the project first before working on it. The most easily accessible example is our SaaS demo, where you fork the project before using it yourself.
A Git repository (or “repo”) is the information about a versioned set of files forming the core or a branch of a software project being worked on. The changes to the files are tracked over time. Dotscience treats the published states of a user’s project in the same way, enabling all of the user’s work and code runs to be tracked and versioned.
GitLab is a platform that provides a web application for the DevOps lifecycle. Dotscience has a GitLab integration that enables, for example, more general CI/CD specifications that advanced users may want compared to the default ones in our built-in model -> Docker -> Kubernetes deploy.
Google Cloud Platform (GCP)
One of the 3 major cloud providers that Dotscience has integrations with, along with Amazon Web Services and Microsoft Azure. The same considerations w.r.t. hybrid and multicloud apply as in the AWS entry above.
Grafana is a well-known dashboarding tool that enables users to see a visual representation of their applications in production, via monitoring. In Dotscience, output data from deployed models such as predictions is passed to the Dotscience model proxy, then the Prometheus time series database. The results of PromQL queries on this database, which can include ones needed for monitoring ML models like the distribution of output classifications, can then be seen in Grafana. As with Prometheus for the database, usage of Grafana as the monitoring tool is not required, it has just been the most common so far.
In machine learning models, the ground truth is the label or target data used when a supervised learning model is trained to make predictions. The ground truth represents what is really true, for example what kind of roadsign the data is an image of. Ground truth is typically not available, or not available until later, when a model is deployed (otherwise we wouldn’t need the model).
H2O is an open source machine learning library and platform that has found widespread usage due to its sophisticated and fast implementations of several machine learning algorithms, including as gradient-boosted decision trees (GBT). It also has an AutoML component that includes within it other popular libraries such as XGBoost. While deep learning has become increasingly popular, it is not suitable for all business use cases and where, for example, a powerful GBT model is appropriate, H2O may be a good choice.
Hybrid cloud refers to using more than one cloud platform (AWS, GCP, MS Azure, etc.) for one’s work, and is often a preference of businesses who want to avoid vendor lock-in to one particular platform. Dotscience’s architecture naturally supports hybrid cloud due to its bring-your-own-compute and containerized setup.
This refers to the tuning of parameters that are adjustable for a model’s algorithm before training is run, such as the number of nodes in a layer of a neural network. The final trained model has certain values of these hyperparameters as part of the information about how it was trained. Thus they need to be recorded to make such a model reproducible and auditable. Dotscience does this via user invocation of the ds.parameter() function in the Dotscience Python library.
Inference is when new data is passed to a model and it makes predictions on that data. Typically the model has been deployed in production and ground truth data is not available, although it is also good practice to evaluate a model’s performance on unseen testing data after the training process is complete. This is in effect inference in which the ground truth just happens to be available. In Dotscience, model monitoring is typically done on a deployed model with inference data and the model’s predictions on it.
Dotscience interactive mode refers to telling the system that you will be using an interactive method of running your analysis, such as a Jupyter notebook within our integrated JupyterLab. This mode is as opposed to script mode, where you are, for example, running a Python analysis from a .py script. You can tell Dotscience to use interactive mode via ds.interactive() in the Dotscience Python library.
JupyterLab is an environment in which Jupyter notebooks can be run, providing in effect an interactive IDE for data science where the user can perform arbitrary analyses. Python + Jupyter is one of the most common ways in which data scientists work, particularly on new projects where the focus is on analyzing a new problem and iteratively running experiments, before moving into production at scale.
Kubernetes is Google’s method of container orchestration that has become popular in recent years. It is important for ML because deploying models as microservices in containers via an orchestrator such as this is a common method of putting ML models into production. Dotscience does not require the user to know or configure Kubernetes in order to use it: typically you can either use the one already set up on our SaaS, or it will be part of the installation for Dotscience customers.
MarkDown is a way to format text files, for example the readme.md file in a Git repo, or in markdown cells in a Jupyter notebook. In both of these places, it is good practice to have some markdown content to make the work there easier to understand by others. Dotscience allows both of these uses of MarkDown, but does not require them.
Meltano is an open source convention-over-configuration product from GitLab for the whole data life cycle, all the way from loading data to analyzing it. You can create Meltano projects inside Dotscience workspaces and ensure data versioning. It adds functionality for seeing your dataflow diagrammatically, and for creating dashboards for reports about it. Meltano is primarily aimed at non-technical users who need an overview of a whole system or project, and so it compliments other technical Dotscience users who are doing more of the actual analysis or data science.
A model metric is a quantitative way to measure how well a model is performing. Typically this is for supervised learning and the model’s predictions are measured against the ground truth during training. Common metrics are things like accuracy or Gini score for classification, and mean squared error for regression. In Dotscience, the metric used and its value can be recorded via ds.metric() in the Dotscience Python library.
A common modern way of deploying applications is through a collection of parts each of which provides some functionality. Such pieces are microservices, and in MLOps they are useful because each model can be deployed as a microservice, in a container. It is then possible to swap out individual models without having to take down the rest of the service.
One of the 3 major cloud providers that Dotscience has integrations with, along with Google Cloud Platform and Microsoft Azure. The same considerations w.r.t. hybrid and multicloud apply as in the AWS and GCP entries above.
MLOps is the application of the concepts, and hence benefits of, the practices of DevOps in modern software development. So things like code versioning, reproducibility of compute environment, CI/CD, and monitoring. MLOps is distinguished as its own subfield however because ML requires extra concepts not usually required in software DevOps, such as data provenance, model hyperparameters, model metrics, user analyses (workflows), and likely degradation of model performance in production.
Note that this does not refer to the similar sounding field of ML for DevOps, which is using machine learning models to improve DevOps operations for software, whether or not that software is being used for ML or data science.
ModelOps is similar to MLOps, specifically referring to the organizing, deploying, and running in production of machine learning models, rather than the broader application of DevOps for ML principles throughout the data science process.
Model build in Dotscience is the step between a model run being published, making the model available in the model list, and the deployment step that follows build. Build refers to creating a Docker container image for the model, as in the CI stage of CI/CD, which can then be deployed (the CD stage), for example to Kubernetes.
When an ML model is deployed and over time the input data or ground truth (or both) evolve away from what was in the training set, the model’s performance is likely to degrade. The change of inputs are concept drift and data drift, and the resulting change of output is model drift. It is vital to monitor deployed ML models for model drift if their outputs are important. The most common Dotscience setup to do this is via Prometheus and Grafana.
The list of models in a Dotscience project constitutes a de facto model registry, because the full information needed to reproduce any of the models is available. Some further model registry capability is to be added, such as sharing models between projects.
Multicloud is similar to hybrid cloud above, referring to the ability to use more than one cloud platform in an analysis. As mentioned under hybrid cloud, Dotscience’s architecture naturally supports this.
Node-RED is an open source tool for wiring together devices or creating dataflows, supported by Dotscience in prototype form for creating pipelines. It stores the flows as JSON.
Pip is the Python package manager and is often used to install some library that the user wants from the Python package index PyPI. It can be invoked directly from the Jupyter notebook and is convenient to use. It’s not ideal however, because “pip install A” for package “A” may mean install the latest version, which may have changed since the last run. So it is up to the user to specify a given version to keep reproducibility (as it is to avoid other trivial ways to break reproducibility such as use of a dataset that is changing). Better is for the base Docker image, such as our one for Python 3 + TensorFlow, to already contain the libraries needed, or for a requirements.txt file with specified library versions to be used.
A Dotscience project is a set of files (notebooks, scripts, datasets, models, other files) created by a user or forked from another project. It is the main unit of organization used corresponding to achieving a given data science aim, such as performing related model experiments, solving a particular business problem (all or part of it), or for collaboration. A generalization from projects is for users and projects to be parts of groups or organizations of users, which is an RBAC component.
Prometheus is a time series database commonly used for monitoring deployed applications. In Dotscience, it is supported as a method of monitoring the outputs from deployed ML models and passing them to Grafana for visualization. Its key feature is that arbitrary queries can be performed on the data being monitored, via its query language PromQL. This enables the needed quantities for a given business problem and model deployment to be monitored. Often the query is also written in Grafana, since it has an interface to do this.
Provenance in Dotscience refers to where a dataset came from, and its exact versioning. This is crucial for reproducibility and auditability of any analysis since if you don’t know where the data came from you cannot guarantee a result can ever be reproduced. Dotscience records provenance via dataset and run versioning, and when runs are published to the system produces a guaranteed-correct provenance graph that enables users to visualize their analysis.
In the Dotscience Python library, ds.publish() indicates that the state of the system is to be recorded at this point. Typically this is at the end of a run, where the state is recorded, hence making the run reproducible. It is similar to making a commit in Git. The underlying Dotmesh functionality of snapshotting application states is what allows runs to be published in this way.
A pull request in Git and similar systems is the mechanism someone uses to notify collaborators that they have completed some work on their own branch that can be merged back into the main project. In Dotscience, this is how, for example, some improvement from a collaborator can be incorporated: they have forked the project onto their own branch and are now asking to merge it back into the “master” project. Dotscience allows you to approve or not approve such requests in the usual way. The reason it is called a “pull request” is from the Git pull command that is used to do the merger.
RBAC is role-based access control, a requirement in many businesses to make sure that only authorized users can see or work on a given analysis or dataset. Dotscience has some basic RBAC (as well as security!), and more is being added.
Also known as Dotscience anywhere. The simplest way to use Dotscience is to log on to the SaaS and use the GUI + JupyterLab. However, for most real customer projects users will want something like the model deployment process to also be scriptable. A way to do this is Dotscience remote mode, accessing the hub via the Dotscience Python library function ds.connect() from whatever machine or script you have.
A REST API (application programming interface that uses representational state transfer) is what allows web services and other parts of a computer system to communicate with each other. In Dotscience, a typical example would be a deployed model, which is on an endpoint, and data can be sent to it via its REST API.
In Dotscience, a run is any arbitrary code between the lines that the user declares to be the start and end of a run. This is done via the Dotscience Python library functions ds.start() and ds.publish(), or some variations on this. The ability of our underlying Dotmesh tooling to capture application states means that the users’ runs of code can be captured as states that become entries in the hub, in other words like a Git repo for code runs. This means that each run that a user does is versioned, and so they are reproducible.
Some of our material distinguishes between a data run and a model run, but the underlying object is the same: a model run is just a data run that has a model in it. The distinction can be useful in differentiating between a pure data preparation project phase, and model building.
Runner is a DevOps term that refers to the machine on which your computation is being executed. In Dotscience, we use “bring your own compute”, which means bring your own runner. The runner can in fact be the managed one on the Dotscience hub when being used for demo or evaluation purposes, but typically customers will have their own machines. This setup has the advantage that basically any machine can be a runner, including your own, another machine on-premise, a virtual machine, or a cloud instance. Necessary hardware, such as GPUs or TPUs for deep learning with large training sets, can therefore be easily brought in as needed.
Software as a service describes a piece of software, generally on the cloud, that can be used entirely online without the user having to set any of it up on their own machine. This is the state of our Dotscience hub, where you can sign up for and use our GUI. Dotscience also supplies Python SDK and terminal command line interfaces for those who want things more scriptable (which is most actual customer projects).
Scikit-learn is a popular machine learning library for data scientists using Python. While tools like TensorFlow scale better for large data, Scikit-learn is often used for experiments, prototyping, demos, or teaching, and so it makes sense for our platform to support it.
Dotscience script mode refers to telling the system that you will be using a scripting method of running your analysis, such as a Python analysis from a .py script. This is as opposed to interactive mode, where you are, for example, running a Jupyter notebook within our integrated JupyterLab. You can tell Dotscience to use script mode via ds.script() in the Dotscience Python library.
A secret a non-public authentication piece of information, used in Dotscience as in many other pieces of software to regulate access. Examples include credentials for accessing data via our integration to Amazon S3, and API keys for runners to access the Dotscience hub.
Supervised learning refers to the common method of training machine learning models by comparing their predictions against a known ground truth. The model’s performance can then be seen by some metric that measures this such as accuracy or mean squared error. Dotscience does not require your model to be supervised, but that has been the most common use case so far.
TensorBoard is an extension of TensorFlow that allows visualization of experiments with ML models such as hyperparameter tuning and seeing which model is best. Enabling this within Dotscience currently gives more visualizations than the basic metric explorer that is supplied with our SaaS GUI.
TensorFlow is Google’s library for machine learning, primarily focused on deep learning, although it has other algorithms too such as decision trees. TensorFlow has been the library used most often in Dotscience so far, and our base Docker image comes with it already available. We support TensorFlow 2.x, which makes for a generally easier user experience than the 1.x versions.
TensorFlow Serving is the part of the TensorFlow software that is designed for serving machine learning models in production. Dotscience’s TensorFlow model deployment is based upon this, but the user does not have to use it directly.
Terraform is a tool to enable you to set up a stack of software, via infrastructure-as-code, for example on the cloud. Dotscience is adding Terraform support as a way to set us up on your own system (on-premise or cloud).
Unsupervised learning refers to building a machine learning model when there is no ground truth to train on. A common example is most forms of clustering, where for example a distance metric might be minimized to assign cluster memberships to data points, but there are only data points and no labels. In Dotscience, unsupervised learning is supported because you can execute arbitrary code within any run. Like all models, unsupervised models can be viewed as a step in a dataflow that transforms a dataset, so because deployment also supports arbitrary pre- and post-processing, they can be deployed too. The emphasis, however, of our currently supported frameworks (TensorFlow and Scikit-learn) is more on supervised learning.
vSphere is the cloud computing virtualization platform from VMWare that allows managing virtual machine infrastructure on a large scale. Dotscience will be adding support this along with Terraform as another option in our ways to install the software.
ZFS is the file storage system that Dotscience is built upon, via Dotmesh. Unlike most filesystems, ZFS is aware of the storage system at both the filesystem level and the physical block storage level. This means that it can be used to keep files synchronized across any 2 Linux systems by only having to track their changes, and not having to recopy the data each time. This is because ZFS knows which blocks changed on disk, and thus data flows with large datasets can be handled efficiently.