top of page

Time Series Forecasting With Dataiku

Author: Sirisha Kodukula


Image source: https://www.actuia.com/

Introduction

Dataiku is used as a data science tool for a wide range of data-related tasks. It can perform tasks that include cleaning the data, building recipes, creating ML models, and deploying. There is an inbuilt Jupyter notebook to code in the python language for the ml part.


Content
  1. Setup (Implementation Demo)

  2. What is a Time Series?

  3. Parameters

  4. Conclusion

  5. References


1. Setup

Connect Snowflake to Dataiku with the help of a partner Connect.



Create a 14 days free account on Dataiku



The interface would look like this:



When we connect a Snowflake to Dataiku, then a database, warehouse, user, and roles are automatically created in the Snowflake.



Create a new project in the Dataiku DSS and import the table from the Snowflake.




We can use a Jupyter notebook inside Dataiku, where we can code and export the resultant datasets.


We can prepare our datasets before training on Time Series using the visual recipes present.



Now we have to train and evaluate the models. For this, we need to enable the time series plugin.



2. Time-Series

Time series uses data that is based on time. Time-ordered data points are used to notice the changes. By using the ML models, we can actually predict the future only once we train the data using the Train and evaluate forecasting models. The data is divided into three categories based on the relation between the variable and time data :

  • Univariate time series - the single variable that depends on time.

  • Multivariate time series - 2 or more interrelated variables that depend on time.

  • Multiple time series - multiple time series which are not related to each other.


Time-Series Preparation

We need to prepare our dataset before setting it up for time series forecasting. For example, in our case, we need to parse the year column and fill in the missing values of timestamps. Dataiku DSS has different plugins, and one of them is a preparation plugin that helps us to perform operations on the time-related data by the visual recipes it has.



Train and evaluate forecasting models


We need to have historical data to perform training on the time series data.

  • Input Data

  • Output Data

  • Settings

  • Related pages



3. Parameters

Time column


The time column, in our case, has the parsed years with no missing values:

  • PREPARE recipe is used for parsing the dates.

  • Time Series Preparation RESAMPLING recipe is used to fill in missing values.

  • In our case, the time column is "YEAR"



Frequency


The Frequency can be minute to year.:

  • Our frequency column is also “YEAR”


Target column


The target column is the column where we need to perform the forecast

  • Our target column is “UNIT_SOLD”


Sampling method

  • most recent: only uses last records of each time series during training

  • whole data: To use all records

  • We had chosen “NO SAMPLING” which is the whole data.


Splitting strategy


We have two strategies for splitting our data. In our case, it's a time-based split. We also have time series cross-validation.

  • After training the model, we get a metrics file where we can see the model and its error rate for each metric.



From here, we can see the least error rate and choose the best algorithm to perform forecasting.



In our case, out of the three algorithms i.e FEEDFORWARD, SEASONAL NAIVE, TRIVIAL IDENTITY , feedforward has scored a 5.7 % error using the MASE matrix


So Dataiku has chosen the best algorithm with the least error rate for forecasting.




4. Conclusion

We have successfully implemented the machine learning model for forecasting values of future years based on the previous years using a time series tool. The accuracy of our model is 94.3 %.


5. References

https://doc.dataiku.com/dss/latest/time-series/index.html


47 views0 comments

Recent Posts

See All
bottom of page