Author: Sirisha Kodukula
Dataiku is used as a data science tool for a wide range of data-related tasks. It can perform tasks that include cleaning the data, building recipes, creating ML models, and deploying. There is an inbuilt Jupyter notebook to code in the python language for the ml part.
Setup (Implementation Demo)
What is a Time Series?
Connect Snowflake to Dataiku with the help of a partner Connect.
Create a 14 days free account on Dataiku.
The interface would look like this:
When we connect a Snowflake to Dataiku, then a database, warehouse, user, and roles are automatically created in the Snowflake.
Create a new project in the Dataiku DSS and import the table from the Snowflake.
We can use a Jupyter notebook inside Dataiku, where we can code and export the resultant datasets.
We can prepare our datasets before training on Time Series using the visual recipes present.
Now we have to train and evaluate the models. For this, we need to enable the time series plugin.
Time series uses data that is based on time. Time-ordered data points are used to notice the changes. By using the ML models, we can actually predict the future only once we train the data using the Train and evaluate forecasting models. The data is divided into three categories based on the relation between the variable and time data :
Univariate time series - the single variable that depends on time.
Multivariate time series - 2 or more interrelated variables that depend on time.
Multiple time series - multiple time series which are not related to each other.
We need to prepare our dataset before setting it up for time series forecasting. For example, in our case, we need to parse the year column and fill in the missing values of timestamps. Dataiku DSS has different plugins, and one of them is a preparation plugin that helps us to perform operations on the time-related data by the visual recipes it has.
Train and evaluate forecasting models.
We need to have historical data to perform training on the time series data.
The time column, in our case, has the parsed years with no missing values:
PREPARE recipe is used for parsing the dates.
Time Series Preparation RESAMPLING recipe is used to fill in missing values.
In our case, the time column is "YEAR"
The Frequency can be minute to year.:
Our frequency column is also “YEAR”
The target column is the column where we need to perform the forecast
Our target column is “UNIT_SOLD”
most recent: only uses last records of each time series during training
whole data: To use all records
We had chosen “NO SAMPLING” which is the whole data.
We have two strategies for splitting our data. In our case, it's a time-based split. We also have time series cross-validation.
After training the model, we get a metrics file where we can see the model and its error rate for each metric.
From here, we can see the least error rate and choose the best algorithm to perform forecasting.
In our case, out of the three algorithms i.e FEEDFORWARD, SEASONAL NAIVE, TRIVIAL IDENTITY , feedforward has scored a 5.7 % error using the MASE matrix
So Dataiku has chosen the best algorithm with the least error rate for forecasting.
We have successfully implemented the machine learning model for forecasting values of future years based on the previous years using a time series tool. The accuracy of our model is 94.3 %.