Author: Mahiman Dave
Sentiment analysis is a technique implemented using natural language processing (NLP) that helps us to determine whether data is positive, negative, or neutral. Sentiment analysis is usually performed on literal data to help companies track brand, product, and service-related sentiments and understand the needs of customers.
Why is Sentiment Analysis important?
Sentiment analysis is a potent marketing tool that enables product creators to understand customer feelings about their marketing drives. It gives companies insights into the recognition of their products and brand, customer loyalty, customer satisfaction, advertising, and promotion success, as well as product acceptance.
Automating the analysis of customer feedback, such as reviews written in survey answers and across social media platforms, allows brands to understand what makes a customer delighted or disappointed so that they can modify products and services to meet their customer’s requirements.
For instance, using sentiment analysis to automatically analyze tons of feedback given in customer satisfaction surveys could help you discover why customers are satisfied or dissatisfied at each stage of the customer journey.
How does Sentiment Analysis work?
There are three approaches used:
Rule-based approach: Over here, a lexicon method comprising tokenization is involved. This approach follows the counting of the number of positive and negative words in the given dataset. If the number of negative words is greater than the number of positive words, then the sentiment is negative, else vice-versa.
Automatic Approach: This approach makes use of a machine learning technique. Firstly, the model is trained with the existing dataset to perform predictive analysis. The next step involves the extraction of words from the text. This text extraction step can be implemented using different machine learning techniques such as Naive Bayes, Linear Regression, Support Vector method, etc..
Hybrid Approach: As the name suggests is a combination of both the above approaches, i.e., rule-based and automatic approaches. The benefit of the hybrid approach is that the accuracy is high compared to the other two approaches.
Sentiment Analysis in Aerolytics
For our project purpose, we have extracted reviews from Skytrax website for Qantas Airlines. We decided to opt for Skytrax as our data source as it is one of the most reliable sources in the aviation industry. Along with the reviews, we also gathered some other metrics such as Seat Type, Type of Traveler, Route, Date of Review, etc. We have taken reviews from the year 2015 onwards for our analysis.
Web Scraping from Skytrax website
With a wide variety of libraries and tools available for web scraping, we were in a dilemma while trying to decide which library/tool should be used in our project. First we went through some of the freely available tools like ParseHub, Scrapy & Webhose.io. Here, the major problem was in understanding the implementation of the tool. Also, there was a restriction on the amount of data that could be scraped.
Therefore we decided to web-scrape data using python libraries. We went through the documentation of some of the libraries, such as Selenium, AutoScraper, MechanicalSoup & BeautifulSoup. Finally, we decided to go ahead with the BeautifulSoup library due to its accuracy and ease of use. We made use of HTML and lxml parser for different attributes.
We automated this process by deploying the python code into AWS Lambda function and also setting up a trigger using the AWS Event Bridge functionality. We scheduled the trigger on a monthly basis, the reason being that we wanted to have some volume of reviews that could be extracted together, which would bring some noticeable changes to the sentiment analysis. Also, it would be cost-effective compared to a weekly or daily schedule. Upon execution of this lambda function, we would be getting the reviews in a CSV file which would later be stored in an Amazon S3 bucket. Finally, the data would be loaded in the Snowflake staging area via Snowpipe.
After extracting the data from the web, since most of the data was unstructured, we began with data wrangling. Firstly, we converted the reviews into lowercase and then performed the tokenization of words. Next, we proceeded with removing special characters and then removed stopwords from the reviews. We made use of regular expressions and imported stopwords from nltk.corpus to perform the above two operations respectively. Finally, we performed lemmatization to convert all the words into their root form.
Going forward, we classified each review into positive, negative, and neutral using the Vader library in python. VADER, which is a lexicon and rule-based feeling analysis instrument, is explicitly sensitive to suppositions communicated in web-based media. It utilizes a mix of lexical highlights, such as words that are generally marked by their semantic direction as one or the other positive or negative. Vader can be used not only to find the Polarity score but also to find how positive or negative a conclusion is.
Classification of Reviews:
Lastly, to take our analysis to the next level and derive meaningful insights from sentiment analysis, we felt we needed to categorize the reviews in some of the airline categories. This took a lot of brainstorming from the entire team. To begin with, we came up with some naive ideas, like getting the word to have a maximum count in each review and then grouping a particular type of word in certain categories.
Then after extensive research, we found a library that would give us a probabilistic score out of 1 for each category that we give to our pipeline. So using the Transformers library, we went through the Zero-Shot Classification Models, and in that, we opted for facebook/bart-large-mnli model. Using this we categorized each review into either of the six categories namely:
Food & Beverages
Based on the above classifications, we found out the number of positive, negative, and neutral reviews for each category on a yearly basis.
To add onto it, we also derived correlations between the type of traveler and seat type, as shown below:
Next, we created a trend line chart of how a category was performing over the years. This helped us understand how a particular category was performing relative to previous years and whether the service was improving or not. For instance, Seat Comfort has consistently been the top-performing category over the years, whereas Booking, which used to be the worst-performing category in 2015, has shown significant improvements and has now been one of the best categories since 2019.
Finally, we generated a word cloud based on the review text. The size and density of the word determine the frequency of that particular word in the reviews. The higher the size and density of a word mean the word has appeared frequently in the reviews.
Challenges with sentiment analysis
Challenges associated with sentiment analysis typically revolve around imprecision in training models. Comments or reviews that have sarcasm involved in it or have information about the wrong product tend to pose a problem for sentiment analysis models and are often incorrectly classified. For example, if a customer received the wrong colored product and wrote a comment, "The shirt was green." this would be classified as a neutral response when in fact, it should be negative.
For this entire process of extracting Qantas Airline reviews from Skytrax Website and classifying each review as per its sentiment, I made use of:
BeautifulSoup Library - Python - Web Scraping
AWS Lambda & EventBridge Function - Automation of the entire process
NLTK Toolkit - Data Wrangling
Vader Library - Python - Classification of reviews into Positive/Negative/Neutral
Transformers Library - Python - Classification of reviews into either of 6 categories