Other posts in this series:
In the previous article, we showed how to add several stages of feature engineering to a Machine Learning pipeline cell to transform the data that will be used in the machine learning algorithm. We used as a starting point the data and example provided in the “Creating a k-means clustering model” article, by Google. This example will predict which cluster a London bike station will be part of using the k-means algorithm.
When the data is cleaned and transformed, it is ready to be used to train a new model. After training, evaluating and approving a model, we still need to deploy it to be used in production by the company.
Model deployment is a common problem for many Data Science teams. Tasks like model versioning, having a model catalog so the model can be found, and running the model code efficiently in production are not easily done. Usually it is up to the Data Science and Devops teams to come up with a strategy to accomplish these tasks.
With Tail Refinaria, we decided to provide methods to make all these tasks as easier and intuitive as possible.
Here is how you can do model training with Tail Refinaria:
1- After adding the feature engineering stages to the machine learning cell (see the previous article), we can choose a machine learning algorithm to execute. In this example, we will choose k-means. Note that the machine learning cell is in training mode.
In this example, we chose the “feature” column as the input and “prediction” as the column that will have the values predicted by the k-means algorithm. Note that “feature” is the name of the column we created in the feature engineering stage described in the previous article.
2- When we add the machine learning stage, we can tune the algorithm parameters. Here we are adding a value of 4 to the k parameter (number of clusters).
3- We can then execute the cell and inspect the model metrics. Note that when we execute the cell, all the stages are executed. This means that the feature engineering stage is executed before the model training.
4- If we are satisfied with the model, we can check the option “Save model” and give it a name. This way, when we execute this pipeline, a new model will be trained and persisted to be used in production later.
5- If we want to test the model, we can add a new machine learning cell to the pipeline, but using the prediction mode this time. We then select a dataset to test the model and run.
6- If we are happy with the results of our experimentation, we can save this pipeline and schedule it to train the model with a larger dataset. Note that the training will take place in an Apache Spark cluster, so you can use large datasets and leverage the distributed processing power of Apache Spark.
We can, of course, add other cells to the training pipeline. We could, for example, add a cell to write the test results to another datastore for inspection, or add a visualization cell to plot graphs with the test results.
When we save and execute the pipeline, Tail Refinaria will add the new model to the Model catalog, so Data Scientists can access the models available and understand how they were created.
Here is how the Model Catalog works at Tail Refinaria:
1- In the model catalog we can see all the models created by the company in one place:
2- Inspecting the model, we can see the metrics, parameters and when it was trained:
The metrics presented depends on the algorithm used. For Logistic Regression, for example, it would show a confusion matrix and other metrics.
We have more information on the Model Catalog, but we will talk more about it in a future article when we cover Model Lineage.
When we select a model to use, it is time to put it into action. The model we just trained is used to predict the cluster of a bike station. So, let’s create a new pipeline, read some data and apply the model trained over it.
1- The first step is to add a read cell to select a datastore from our Data Lake to apply the model to it. Note that since this is an interactive pipeline (a.k.a. a notebook), we can simulate the result as we add the cells.
2- Next, we add a Machine Learning cell in the prediction mode. We select an external model, i.e., a model that was trained and persisted by another pipeline.
3- We selected the London Bike – kmeans model and the cell shows the inputs we should provide and the outputs the model will give us. We can now run the cell to test if the result is what we want.
4- Now we can decide what to do with the output. We could, for example, add a Write cell and save the output to another datastore, or add an Export cell and export the result to an external system, or even use the output as the input of a Spark SQL cell to query it and find new insights. Let’s create a write cell and save the result to another datastore.
This Write cell will save the predictions to a new datastore called London Stations Predicted. This new datastore will be added to the Data Lake and later when we inspect its Data Lineage, we will see that the London Bike – kmeans model was used to generate its data.
We can now save this pipeline and schedule it for execution. As you can see, running a model in production is as simple as adding it to a pipeline. As always, the pipeline will run in an Apache Spark cluster, so large datasets can be handled. Since models are versioned, if you train a new version of the same model, you re-run this pipeline to produce updated data. In fact, using Tail Refinaria scheduling module, you can even schedule the prediction pipeline to run always after the training pipeline.
So far, we used interactive pipelines to train and run our models and we showed a little bit of how you can run experiments at Tail Refinaria. In the next article we will discuss why a seamless integration with Data Science notebooks is important for increase the productivity of Data Science teams.