Other posts in this series:
Since 80% of a Data Science project is data engineering (some say 90%), we at Tail spent the last couple of years creating a product to solve this problem, so we could focus on the 20% where the fun is.
Tail Refinaria in fact solves many common data engineering problems: it provides a data catalog, automatically detects data types to add semantics to the catalog, provides a Data Lineage tool to track changes in datastores, tracks data licensing to make it easier to be compliant with regulations like GDPR and LGPD (the Brazilian version of the privacy law), provides a Spark SQL compatible Data Science notebook for experimentation, allows to create and schedule pipelines for batch processing, using the same code created during interactive experimentation, automatically creates samples of the data to be used by Data Scientists during experimentation, connects to Google Data Studio to allow for complex visualizations, and more.
In fact, we were able to provide a tool for our customers that made the construction and management of a Data Lake and the process of building and executing Data Pipelines a lot easier. We believe we solved most of the Data Engineering problems very well with Tail Refinaria. As a matter of fact, we’ve been using the tool ourselves to create and run our pipelines and to manage our own Data Lake.
However, we noticed that the process of creating, tracking and deploying machine learning models was still a hard and cumbersome task for most Data Science teams, including ourselves and our customers. And we decided to solve this problem too.
Here are the main difficulties when creating and maintaining machine learning models that we wanted to solve:
Easy access to training data
Maybe one of the most time-consuming tasks when training a machine learning model is to access training data. Usually, there are lots of ETLs, authorizations, sampling, cleaning, negotiations, etc involved. And if you need external data sources, like census data, for example, there is more work involved to gather the data and make it available in an environment for training. If the datasets are large, the Data Scientist will have to access some form of cloud storage or even a cluster capable of distributed processing.
Make feature engineering easier
Feature engineering is an essential step in machine learning. There are many well known feature engineering algorithms, but choosing the right one, experimenting with them and more important, applying the same feature engineering algorithms to the production dataset for prediction, is still a task that is usually handmade and error prone.
Deploy the trained model in production
When a model is trained and ready, how to make it available for using in production? It is important to control the different model versions, keeping the metrics produced by each version to make sure it is performing well and compare with other versions. And then it should be easier to use a trained model in Data Science pipelines to produce predictions and insights. Also, a model catalog should be available so the company can find the models available, how they were trained and how good they are by inspecting the metrics produced. Sadly, this is not true for most teams.
Seamless integration with Data Science notebooks
Data Science notebooks are an excellent tool for experimentation. However, not all notebook environments have machine learning algorithms embedded and ready to use. Also, when training large datasets, the notebook should have access to a cluster of machines that are capable of processing large amounts of data, but many times this is not easily available to Data Scientists in the company.
Another aspect of machine learning model administration that we hardly see in company environments is model lineage. As a Data Lineage tool keeps the history of a datastore, which pipelines changed it, where the data came from, etc, model lineage tools should keep the history of a model. This would improve traceability and make retraining of models easier. A model lineage tool should store for each model which data was used for training, which transformations, feature engineering methods, algorithms and parameters were used for each model version, what was the legal license of each dataset used in the training, and when each model version was created.
Also, data generated by a model should be part of the datastore history. Thus, the Data Lineage tool should track data generated by that model, showing the algorithms, parameters and data involved in the model generation, making it possible to recreate the model and reproduce the data.
As you can see these problems are pure machine learning engineering tasks. In the next articles, we’ll show how we solved these problems in Tail Refinaria.
This post was written by Fabiane Bizinella Nardon (@fabianenardon), Chief Scientist at Tail. Fabiane has a MSc in Computer Science and PhD in Electronic Engineering. She is an expert in Data Engineering and Machine Learning Engineering. She is the program committee leader for the Machine Learning Engineering track of QCon São Paulo, a frequent speaker on the subject and author of several articles. She was also chosen a Java Champion by Sun Micosystems as a recognition of her contribution to the Java ecosystem.