Hyperparameter Optimization

We look at how you can use Dotscience to explore relationships between hyperparameters & metrics

Create a new project in Dotscience and open JupyterLab. See this tutorial on creating projects and working with the Dotscience hosted JupyterLab for more information.

This tutorial is in 2 parts:

  1. Tune hyperparameters to optimise the precision and recall on a scikit-learn dataset of digits
  2. Use H2O AutoML to automatically tune the hyperparamers of a model on product backorders

Hyperparameter optimization with scikit-learn

The notebook for this tutorial can be found at https://github.com/dotmesh-io/demos/blob/master/sklearn-gridsearch/grid-search.ipynb .

Download our demos Git repository with

git clone https://github.com/dotmesh-io/demos.git

Navigate to your project on Dotscience, open a JupyterLab session and upload the notebook file grid-search.ipynb from the git repository above. It can be found at demos/sklearn-gridsearch/grid-search.ipynb

At the start of the notebook, we import the dotscience python library and instrument our training with it. And if you look closely at the notebook, you will notice that as we iterate though a collection of scores to optimise them, we record the summary statistics with ds.add_summary("param", value) for all the parameters involved.

for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    ds.add_summary("%s-stdev" % (score,), std)
    ds.add_summary("%s-mean" % (score,), mean)
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))

Run the notebook by clicking run -> all cells

When the run completes, navigate to the Runs tab to see the summary of all the runs. Clicking on each run will show the provenance of the run.

Now, go to the Explore tab, you can see a graphical representation of the optimisation we did earlier.

The screen capture above shows the behaviour of the summary statistic precision-mean. From this we can draw conclusions about how the each change to the hyperparameters affected the summary statistic. Clicking on an individual data point, takes us to the run that was associated with that change.

You can also toggle the views between multiple optimisations by selecting it from the Summary statistic field.

We have now demonstrated Hyperparameter tuning on a simple machine learning model using an scikit-learn grid search. You can visualise the effect of tuning the params on the graph and specifically zoom into runs where the summary statistics go off the charts.

Note the problem is in a sense too easy: the hyperparameter plateau is too flat and the output model is the same for precision and recall with ties in quality.

Nevertheless, the principle is demonstrated of using Python code for hyperparameter optimization, augmented by the ds functions of the Dotscience Python library to automatically record versioning, provenance, parameters, and metrics within the system.

AutoML optimization with H2O

The setup for this is the same as tutorial 1, except that the notebook is in demos/h2o/automl_binary_classification_product_backorders.ipynb

Running the notebook shows how H2O can be used within Dotscience, and the model performances in the AutoML process tracked using the Dotscience Python library ds functions.

Note that the AutoML step takes a few minutes to complete.

In this case, we see that the stacked ensemble model combining the top individual models (mostly XGBoost and H2O’s GBM gradient boosted decision trees) outperforms the individual models.

Dotscience hyperparameter optimization in the future

Viewing the outputs of the tutorials in the Explore tab of your project will show that Dotscience’s current visualizations and integrations of scikit-learn and H2O are quite basic. These integrations and visualizations will be improved in the future. Nevertheless, the tracking and versioning of everything from the run is available, as with our other tutorials.