We need to talk about Model Lineage

We need to talk about Model Lineage

Other posts in this series:

1. Machine Learning at Tail Refinaria;

2. Easy access to training data for Machine Learning with Tail Refinaria;

3. Making feature engineering easier with Tail Refinaria;

4. Training and deploying a Machine Learning model with Tail Refinaria;

5. Using Data Science notebooks for machine learning model training and deployment;

6. We need to talk about Model Lineage.


In the previous article we showed how Tail Refinaria provides a Data Science notebook that can be used to experiment and deploy machine learning models. In this article, we will discuss how we can track the history of a model.

Data Lineage is a topic well known in Data Science, although not many tools provide an easy way to implement it. Data Lineage holds the history of where the data came from, how it was transformed and its legal license. 

Tail Refinaria also provides Data Lineage for all the datastores in the Data lake, tracking automatically when a pipeline changes the data and, more importantly, keeping track of the original data licensing of each piece of data. This is important to be compliant with privacy legislations like GDPR or LGPD (the Brazilian data privacy legislation). In the screenshot below, you can see the Data Lineage automatically generated for a datastore in Tail Refinaria.

As more and more machine learning models are used to decide aspects of our lives, knowing how a model was created, which data was used for training it and so on, becomes very important both for model maintainability as for providing transparency on how decisions were made. 

Model Lineage keeps the history of a model: when it was trained, using which data, algorithms and parameters. This should be automatically generated each time a new version of a model is generated.

At Tail Refinaria, we provide Model Lineage, tracking automatically the training pipelines and the data used in the training. The screenshot below shows the Model Lineage of the London Bike – kmeans model we created in the previous articles.

Note that for each model version, Tail Refinaria keeps track of which pipeline was used for training, the data that was involved in the training and the data license of each datastore used. If we inspect the pipeline used for training, we can access all the parameters, transformations, feature engineering techniques, etc used to produce that model. 

Now, as machine learning models are used to produce new data, the Data Lineage of a datastore should also track the models involved on data production. The screenshot below shows the Data Lineage of the “London Stations Predicted” datastore. Note that the Data Lineage timeline shows that part of the data was generated by the “London Bike – kmeans” model.

Model Lineage is important for transparency and traceability, but in many corporate environments is still not available. We believe that with Tail Refinaria we provided an easy way of doing it and we hope more companies will acknowledge the importance knowing how machine learning models were created.


This post was written by Fabiane Bizinella Nardon (@fabianenardon), Chief Scientist at Tail. Fabiane has a MSc in Computer Science and PhD in Electronic Engineering. She is an expert in Data Engineering and Machine Learning Engineering. She is the program committee leader for the Machine Learning Engineering track of QCon São Paulo, a frequent speaker on the subject and author of several articles. She was also chosen a Java Champion by Sun Micosystems as a recognition of her contribution to the Java ecosystem.