Authors: Daksh Trehan, Sumit Chahal
ML Models in Dataiku:
There are several machine learning models available in Dataiku that we can train and use for prediction. Also, Dataiku offers to code your own model and use that in the Dataiku flow for predictions. There is also an option to code deep learning models in TensorFlow and use them in Dataiku.
Here we will train several models to predict the price of used cars and will compare their scores for the same. A dataset that we will use to train the models has several attributes like manufacturer, fuel type, gear type, the horsepower of the car, etc.
Since we need to predict the price of cars, it comes under a regression problem and for the same, we will train KNN, Ridge, Lasso, and Random Forest.
Creating ML Model in Dataiku:
We will create an ML model in Dataiku, where data for training is available in Snowflake. First, we will transform the data as per requirements and then train the model and finally, we will predict with the help of trained models.
1. We have uploaded a dataset for used car prices in Snowflake.
2. Connect Dataiku to Snowflake and get access to the USED_CARS database in Dataiku. (For this part, please refer to the following blog link – Dataiku with Snowflake )
3. Before training, we will transform the data. First records are removed with a data preparation recipe that has a null value for any feature.
4. Now with the help of a data preparation recipe, the column ‘model’ is removed.
5. The flow diagram will look like this after two recipes.
6. Now we will analyze the attributes and will remove the noise from them. As we club two child manufacturers mentioned separately into one parent value. Another one has removed outliers i.e remove records where the count of class attributes is very less and in the case of continuous attributes, remove records that have values very far from the mean.
7. The flow diagram will look like this after these steps.
8. Split the dataset into train and test datasets. We will use an 80:20 ratio which results in 80% records in the training dataset and after training of models with this dataset, we will evaluate the models with the remaining 20% records which are available in the test dataset.
9. Train three models – KNN, Ridge, and Random Forest on the training dataset.
10. All three models are used to score on the test dataset.
11. Finally models are used to predict the ‘Price’ for the whole dataset.
The results are as follows: