Since 80% of a Data Science project is data engineering (some say 90%), we at Tail spent the last year an a half working on a project to help solve this problem so we could focus on the 20% where all the fun is.
Here’s what we did with Tail Refinaria:
1- We wanted to make it easier to create data catalogs, so the data would be easy to find when needed. You give the tool a sample of the real data you’ll upload to the Data Lake and we’ll automatically understand it and detect the data types:
2- Once we have semantic data types, we can automatically validate them when the real data comes in and even detect PII data to suggest automatically anonymization when the data is uploaded to the Data Lake:
3- We always ask which legal basis the company have in order to use that dataset. This is important, so data scientists will know if they have legal permission to use the data for a particular task (see GDPR or Brazilian LGPD):
4- When creating the datastore, we offer possible Data Augmentations, so the data import pipeline will automatically augment the data when new data is uploaded to the Data Lake:
5- When the datastore configuration is done, we’ll create automatically a Apache Spark SQL pipeline that will fire when new data is uploaded to a data receptor folder. The pipeline will do all the cleaning, anonymization, and augmentation requested.
6- When we process the data, our data import Apache Spark SQL pipeline will even create automatically a sample of the data to make it easier to do experiments.
7- And we will generate Data Lineage for all the datastore changes, representing the Data Lineage as a timeline, to make it easier to understand and track the data changes. Data Lineage is searchable by date and pipeline executed.
8- With the Data Lake created, you can execute interactive (notebook like) pipelines. In fact, we do create a Scala Spark Notebook with Spark SQL commands to represent the pipeline.
9- When you create a pipeline, we’ll generate Apache Spark SQL code behind the scenes and this will be your pipeline, that you can schedule and execute. This way, the same pipeline in the experimentation phase goes to production without extra work.
10- You can even send the visualizations created in the pipeline to a custom dashboard, that will be automatically updated each time the pipeline is executed.
11- With security, data lineage, data catalog, legal basis tracking, easy to find datasets, we hope to provide a real Data Lake and not a data swamp, as Alex Gorelik calls in his Enterprise Big Data Lake book.
12- There are many more features at Tail Refinaria: connectors to custom data types, a plugin architecture to add transformations, transaction queues, logging, data export, integration with Data Studio, support for Spark SQL code, etc. If you are interested in knowing more, feel free to contact us.
This post was written by Fabiane Bizinella Nardon (@fabianenardon), Chief Scientist at Tail. Fabiane has a MSc in Computer Science and PhD in Electronic Engineering. She is an expert in Data Engineering and Machine Learning Engineering. She is the program committee leader for the Machine Learning Engineering track of QCon São Paulo, a frequent speaker on the subject and author of several articles. She was also chosen a Java Champion by Sun Micosystems as a recognition of her contribution to the Java ecosystem.