Sentiment Analysis With Dataiku Using Regression

Author: Vipul Tripathi


It’s the process of analyzing sentiments or feedback to determine the tone they carry, whether it’s positive, negative, or neutral. Sentiment Analysis helps in categorizing the texts with polarity to conclude. It is mainly the use of natural language processing to extract emotional information. In business settings, sentiment analysis is widely used to analyze customer feedback, detect spam mails, etc.

Why Dataiku?

Dataiku well serves the purpose of applying machine learning algorithms on data. But there should be valid reasons for the selection of Dataiku. The best part of Dataiku is to do impactful work without or minimum coding. It supports Smart Data Ingestion, Cleaning complex text fields, Combining / Joining datasets, Working with Geographical Data, and many more. With the no or minimum coding, it delivers brisk results.

Point to Remember

A Dataiku account should be created with the same email which is used in Snowflake to partner connect. If not, then account integration and instance creation should be done. All the column names used in the blog are referring to the sample dataset.

Steps Involved in Sentiment Analysis

There are four major steps to be followed in sentiment analysis:

1. Data Collection

This is one of the most essential steps in the sentiment analysis process. Everything will be dependent on the quality of the data that has been gathered and how it has been annotated or labeled.

The sample dataset having the feedback of students on several factors can be found here. If you have your own dataset that will be very interesting to start.

2. Data Preprocessing

Dataset is first present on Snowflake database, then it is fetched from there to Dataiku to preprocess.

Data preprocessing involves simplifying the text to get grammatical root words to get an efficient result. With the simplified words, the machine learning algorithm performs better to produce potential results.

In the image pasted below, you can notice that the Normalize text option is already selected, so you just have to select Stem words and Clear stop words. And repeat the above two processes of selecting stem word and clear stop word for next 6 text columns (this is in reference with the sample dataset, you can perform according to your dataset).

Now remove the other 10 columns by deleting them manually, and then run the script to save two columns. Repeat the same process for next columns, just keep a pair of columns in each dataset, the feedback text, and polarity.

Each dataset will have only 2 columns as shown below.

3. Data Analysis

The subtasks in the data analysis process involve model training, multilingual data, topic classification, and sentiment analysis. For the model training, the dataset must be preprocessed and manually labeled.

Once the dataset is split column-wise, select one of the files, go to the lab and apply the AutoML prediction model.

Select the column that contains the polarity, in the sample dataset, ccp is the polarity column and click on create to proceed. It is the feature column that helps the model in predicting the polarity of the feedback.

Data Analysis involves following several steps:

Design the Model:

Before training, we need to design the model to get efficient results. Keep the train and test ratio to 4:1. Now, Go to feature handling, turn on the button against the text column and select the other column as a feature.

Algorithms Selection:

Though Dataiku itself selects 2 or 3 related algorithms if you want you can select others and check the performance of the individual algorithm and deploy that which is more efficient.

Apply the same procedure to all the datasets.

Model Evaluation:

Evaluate the prediction models on the datasets. The evaluated model generates the final output and the accuracy matrix of the model. This helps in analyzing the performance of the model over different dataset, may differ in language, pattern, purpose, etc.


Take the output. It will add one predicted column to the dataset which will be more accurate. Just un-check the output probabilities box else it will add columns for probabilities of all three polarities. The predicted column can be seen. Repeat the same for all the evaluated datasets.

Again select the final dataset to prepare, this time don’t forget to create a dataset on Snowflake else will need to push there explicitly. Now, delete the first column from all the datasets.

Do the same for all the datasets. And this will be what the final flow looks like.

4. Data Visualization

The final step of the sentiment analysis is data visualization. Once the above-mentioned three steps are done, you have the final output for the visualization of what your customers, reviewers, and peers said about your product, service, or what feedback you received from them. All the feedback can be analyzed by making visualizations using PowerBI or Tableau.


Sentiment mining plays a very important role in each field to understand the opinions to improve the services. It is a time consuming factor to read all the reviews and come to a conclusion. So, with sentiment analysis, the polarity of the opinions can be predicted. As the sample dataset contains polarity for each of the feedbacks, regression analysis becomes effective in analyzing the future feedbacks.



36 views0 comments

Recent Posts

See All