Easy access to training data for Machine Learning with Tail Refinaria

Other posts in this series:
1. Machine Learning at Tail Refinaria;
2. Easy access to training data for Machine Learning with Tail Refinaria;
3. Making feature engineering easier with Tail Refinaria;
4. Training and deploying a Machine Learning model with Tail Refinaria;
5. Using Data Science notebooks for machine learning model training and deployment;
6. We need to talk about Model Lineage.
As we discussed in our previous article , obtaining, cleaning, sampling and preparing training data is one of the most time-consuming tasks when creating a machine learning model in a Data Science project.
Tail Refinaria is our Data Engineering environment that provides a series of features to create Data Lakes and Data Science Pipelines, including a Data Catalog, Data Lineage, Spark SQL compatible Data Science notebooks, batch execution of Apache Spark pipelines, pipelines scheduling, dataset licensing tracking, data visualization, and more.
When we decided to add Machine Learning support to Tail Refinaria, one of our main goals was to leverage our Data Catalog to make accessing training data easier, at the same time enforcing all the access control policies we already have in place for our Data Lake.
Here is how it works:
1- First step is to create an interactive pipeline, a.k.a. a notebook, to read the data in the Data Lake and prepare it for training. This pipeline can be later scheduled to retrain the model when new training data is available.
Note that when the interactive pipeline is connected, Tail Refinaria will connect to an Apache Spark cluster to allow live experimentation while we write the cells. We will later schedule this pipeline for batch execution, deploying in production the same code we used in our experimentations.
2- Next step is select from the Data Lake which datastores should be read for training the model. In this example, we will use the data available in the excellent kmeans tutorial by Google.
3- Once we select the datastore, we can read it to start the experimentation. Tail Refinaria automatically creates samples from all datastores in the Data Lake. Thus, we can choose to run our experimentation in a sample, to test the algorithms faster, and then later schedule the model training using the full dataset and not just the sample.
4- Next we can inspect the content of the selected datastore:
5- When we have read the datastores we will need, we can add a Spark SQL cell to the pipeline to write code to transform and select the data for training
So far, we accessed our Data Catalog, read two datastores that were previously available in the company Data Lake, read samples of them in our experimentation Spark cluster, wrote and executed Spark SQL code to transform the data.
If we needed access to third party data sources to train our model (census data, zip code bases, etc), we could visit the Tail Refinaria Marketplace and acquire new datasets to use in our model.
Of course, using Tail Refinaria, the Data Scientists will have access only to the data they have permission to read. Also, data can be anonymized automatically in the import pipeline when added to the Data Lake, adding another privacy layer.
Accessing and transforming data for training became a lot easier with Tail Refinaria. In the next article, we’ll show how to add feature engineering algorithms to our pipeline.
This post was written by Fabiane Bizinella Nardon (@fabianenardon), Chief Scientist at Tail. Fabiane has a MSc in Computer Science and PhD in Electronic Engineering. She is an expert in Data Engineering and Machine Learning Engineering. She is the program committee leader for the Machine Learning Engineering track of QCon São Paulo, a frequent speaker on the subject and author of several articles. She was also chosen a Java Champion by Sun Micosystems as a recognition of her contribution to the Java ecosystem.