Other posts in this series:
In the previous article , we showed how to create an interactive pipeline, a.k.a. a Data Science notebook, to read data from the Data Lake and transform it using Spark SQL. The pipeline is composed by cells, that can be executed during the experimentation phase in an Apache Spark cluster to provide instant feedback of how our experiments are behaving.
Applying feature engineering algorithms to the data prior to submitting them to the machine learning algorithm for training is an important step in any Data Science project.
Tail Refinaria leverages the feature engineering algorithms available in the Apache Spark MLLib library to make easier to chain several feature engineering transformations as part of model construction. These same transformations will be applied to the real data later when the model is executed for extracting predictions.
Here how it works:
1- We start by adding a Machine Learning cell to the pipeline we created in the previous article.
The Machine Learning cell can operate in the Training or Prediction mode. Training mode is for training models. Prediction mode is for executing a trained model over a dataset.
2- We start by choosing a dataset where to execute the tests of the trained model. In this example, we will choose the training dataset itself, but we could use other.
3- We then select a feature engineering algorithm to apply. In this example, we will choose Vector Assembler to combine several columns in just one column to be used by the machine learning algorithm later.
4- We then configure the feature engineering method, choosing the input columns to combine and the output column to generate.
5- Now that we added the feature engineering stage, we can execute it to check if the result would be what we needed.
At this point we could add several other stages, combining different feature engineering techniques. These stages will later be executed whenever the generated model is executed.
In the next article, we will show how to actually train the machine learning model and deploy it in production.
This post was written by Fabiane Bizinella Nardon (@fabianenardon), Chief Scientist at Tail. Fabiane has a MSc in Computer Science and PhD in Electronic Engineering. She is an expert in Data Engineering and Machine Learning Engineering. She is the program committee leader for the Machine Learning Engineering track of QCon São Paulo, a frequent speaker on the subject and author of several articles. She was also chosen a Java Champion by Sun Micosystems as a recognition of her contribution to the Java ecosystem.