Using Data Science notebooks for machine learning model training and deployment
Other posts in this series:
1. Machine Learning at Tail Refinaria;
2. Easy access to training data for Machine Learning with Tail Refinaria;
3. Making feature engineering easier with Tail Refinaria;
4. Training and deploying a Machine Learning model with Tail Refinaria;
5. Using Data Science notebooks for machine learning model training and deployment;
6. We need to talk about Model Lineage.
In the previous article, we showed how machine learning models can be trained and deployed using Tail Refinaria.
Data Science notebooks, like Jupyter and Zeppelin, became one of the most used tools for Data Science. They allow to experiment with methods and at the same time document the process used to obtain the results.
Many times, however, these tools fail to provide a production environment to train and deploy the model at scale. Either because they are not integrated with the company Data Lake so it is not easy for Data Scientists to find and access data, or because they are not seamless connected to a cluster that allows for processing of large datasets.
It is a common scenario to do experimentations using a Data Science notebook tool and then re-program the experiment using other tools to deploy it to production. Lately, strategies like executing Jupyter notebooks with Papermill emerged to try to solve this problem.
We believe that when creating Data pipelines, the same code produced during the experimentation phase should be used in production, eliminating overhead and decreasing the number of errors. The same approach should be used for training and deploying machine learning models.
At Tail Refinaria we provide a Data Science notebook tool that can be used to create pipelines used for experimentations. If the experimentation is successful, we can schedule the same pipeline for running in production.
The code produced by our notebooks is Apache Spark SQL code, so we can run the pipelines created in an Apache Spark cluster, thus supporting large datasets.
As you saw in the previous articles, as we trained and experimented with our model, we were creating cells as we would in a typical notebook and we interactively executed the cells to simulate the results. When we were happy with the results, we would schedule the same pipeline to run in production. This eliminates the overhead of reprograming pipelines to run in production.
Tail Refinaria notebooks support a variety of cell types, instead of just text and code cells like most notebook tools. With more specialized cells, we can:
1- Use Read cells to access our Data Catalog and select datastores to read from the Data Lake, thus making the process of finding training data a lot easier. Also, when reading a datastore, the Data Scientist can choose if they want to read a sample from the datastore or the full dataset, thus allowing faster experimentations.
2- Use Spark SQL cells to do advanced transformation on the data
3- Use a Visualization Cell to plot graphs
4- Use a Write Cell to write the results to another datastore. Write Cells take care of locking, versioning and all the burden of writing and updating data in the Data Lake
5- Use an Export Cell to export the data to an external system, in different formats
6- Use Text Cells to explain what was done in the notebook
7- Use Projection, Transformation, Aggregation and Merge cells to have a Wizard like interface to produce Spark SQL code
8- Use a Machine Learning cell to train and execute a model
When editing a notebook at Tail Refinaria, one can connect to a Spark Cluster by a click of a button and simulate the execution of the notebook cell by cell, thus testing the hypothesis before deployment.
Having a platform that allows to create complex pipelines that can be used to run experiments and produce production code has proved valuable to our customers, decreasing the time it takes to create new models and deploy them.
In the next article, we will show another important part of the model construction: the Model Lineage.