Author(s): Daksh Trehan, Sumit Chahal
What is Dataiku?
Dataiku is an end-to-end data science platform that helps in a wide range of data-related tasks. Its capabilities include cleaning the data, building recipes, developing pipelines, visualizing our ML model, and deploying them publicly. The tool is well versed for analysts, BI Engineers, Data Engineers, Data Scientists and thus boasts a common platform for collaboration from all roles involved in data cycles.
Dataiku developed a Data Science Studio (DSS) to minimize the time taken in the clean-train-test-deploy cycle, thus making it easier and less time-consuming for predictive applications. DSS is a combination of no-code, low-code, and 100% code, thus, making the tool accessible to everyone and fulfilling a wide variety of needs.
DSS is technologically sane and can work with either on-premise or cloud computation and storage. DSS supports all three major cloud platforms, i.e., AWS, GCP, Azure. It uses a pushdown architecture that allows organizations to fully utilize elastic storage as well as elastic computation. Furthermore, Dataiku allows 100+ 3rd party plugins to increase efficiency.
DSS contains visual flows that allow coders or non-coders to build easy data pipelines, recipes to join and transform the dataset, to build predictive models.
The data flow provides easy data cleansing tasks such as joining, cleaning, normalizing, enriching, and deduplicating records within a few clicks.
The cleaned data can further be transformed and manipulated by DSS within a few clicks, enabling complicated tasks such as filtering, splitting, concatenation, binning, currency conversion. Users can also write their custom formulas for better efficiency.
From creating statistical analysis charts to custom dashboards, DSS got you covered.
DSS saves time by quickly visualizing each column statistically within a few clicks, thus making it easier to develop and understand the latest patterns.
It includes a wide variety of charts, statistical methods, and geospatial analytics.
From Feature learning, feature selection, Time series visualization to Deep Learning with Keras, DSS enables all types of ML algorithms with the assistance of AutoML or writing your own code.
For AutoML algorithms, it helps you to tweak hyperparameters to get desired results. Dataiku doesn’t restrict you to AutoML but rather allows users to write their custom ML codes either in Python or Scala.
To train ML models, Dataiku has its own computing system. But, for large datasets that might not fit into memory, DSS supports Spark MLLib or H20 Sparkling water.
Once the model is created, trained, and deployed, Dataiku allows users to create project dashboards and share them with clients. The inability of the dashboard also allows stakeholders to see the output and track KPIs and values.
Steps to connect with Snowflake:
1. Go to the Partner Connect tab in the Snowflake and click on the Dataiku icon.
2. Default User, Role, Warehouse, Database are created for the Dataiku.
3. After launch, Log In or Sign Up for Dataiku if the account is not created already.
4. After accepting and filling in basic details, we will reach the homepage of Dataiku.
5. Go to the Features tab and then click to ‘ADD A FEATURE’ tab to connect with Snowflake.
6. Select Snowflake and fill in details of account URL l and also account level object details.
7. Now we can see the Snowflake in the list of features.
8. Go home and click on ‘OPEN DATAIKU DSS.’
9. Select Tutorial and Data Engineer Quick Start.
10. Sample project is imported.
11. Select Flow and then select all the datasets.
12. Select Change connection to store datasets into Snowflake.
We can clearly see the icons of datasets converted to Snowflake icons.
13. Changes can be seen in the Snowflake account.
We can see the tables in the ‘PUBLIC’ schema. These are basically the datasets from Dataiku, which are earlier connected to the Snowflake in step 12.