Sentiment Analysis For Airline Reviews

Author: Mahiman Dave


Introduction

Sentiment analysis is a technique implemented using natural language processing (NLP) which helps us to determine whether data is positive, negative or neutral. Sentiment analysis is usually performed on literal data to help companies track brand, product and service related sentiments and understand the needs of customers.



Why is Sentiment Analysis important?

Sentiment analysis is a potent marketing tool that enables product creators to understand customer feelings about their marketing drives. It gives companies insights about recognition of their products and brand, customer loyalty, customer satisfaction, advertising and promotion's success, as well as product acceptance.


Automating analysis of customer feedback, such as reviews written in survey answers and across social media platforms, allows brands to understand what makes a customer delighted or disappointed, so that they can modify products and services to meet their customer’s requirements.


For instance, using sentiment analysis to automatically analyze tons of feedback given in customer satisfaction surveys could help you discover why customers are satisfied or dissatisfied at each stage of the customer journey.


How does Sentiment Analysis work?

There are three approaches used:

  1. Rule-based approach: Over here, a lexicon method comprising tokenization is involved. This approach follows counting of the number of positive and negative words in the given dataset. If the number of negative words is greater than the positive words then the sentiment is negative else vice-versa.

  2. Automatic Approach: This approach makes use of a machine learning technique. Firstly, the model is trained with the existing dataset to perform predictive analysis. Next step involves extraction of words from the text. This text extraction step can be implemented using different machine learning techniques such as Naive Bayes, Linear Regression, Support Vector method, etc..

  3. Hybrid Approach: As the name suggests is a combination of both the above approaches i.e. rule-based and automatic approach. The benefit of hybrid approach is that the accuracy is high compared to the other two approaches.


Sentiment Analysis in Aerolytics

For our project purpose we have extracted reviews from Skytrax website for Qantas Airlines. We decided to opt for Skytrax as our data source as it is one of the most reliable sources in the aviation industry. Along with the reviews we also gathered some other metrics such as Seat Type, Type of Traveler, Route, Date of Review, etc. We have taken reviews from the year 2015 onwards for our analysis.



Web Scraping from Skytrax website

With a wide variety of libraries and tools available for web scraping, we were in a dilemma while trying to decide which library/tool should be used in our project. First we went through some of the freely available tools like ParseHub, Scrapy & Webhose.io. Here, the major problem was in understanding implementation of the tool. Also, there was a restriction on the amount of data that could be scraped.


Therefore we decided to web scrape data using python libraries. We went through the documentation of some of the libraries such as Selenium, AutoScraper, MechanicalSoup & BeautifulSoup. Finally, we decided to go ahead with the BeautifulSoup library due to its accuracy and ease of use. We made use of html and lxml parser for different attributes.


We automated this process by deploying the python code into AWS Lambda function and also setting up a trigger using the AWS Event Bridge functionality. We scheduled the trigger on a monthly basis, the reason being that we wanted to have some volume of reviews that could be extracted together, which would bring some noticeable changes to the sentiment analysis. Also, it would be cost effective compared to a weekly or daily schedule. Upon execution of this lambda function, we would be getting the reviews in a csv file which would later be stored into an Amazon S3 bucket. Finally the data would be loaded in the Snowflake staging area via Snowpipe.



After extracting the data from the web, since most of the data was unstructured, we began with data wrangling. Firstly, we converted the reviews into lower case and then performed tokenization of words. Next, we proceeded with removing special characters and then removed stopwords from the reviews. We made use of regular expressions and imported stopwords from nltk.corpus to perform the above 2 operations respectively. Finally, we performed lemmatization to convert all the words into their root form.


Going forward, we classified each review into positive, negative and neutral using the Vader library in python. VADER, which is a lexicon and rule-based feeling analysis instrument, is explicitly sensitive to suppositions communicated in web-based media. It utilizes a mix of lexical highlights such as words that are generally marked by their semantic direction as one or the other positive or negative. Vader can be used not only to find the Polarity score, but also to find how positive or negative a conclusion is.



Classification of Reviews:

Lastly, to take our analysis to the next level and derive meaningful insights from sentiment analysis, we felt we needed to categorize the reviews in some of the airline categories. This took a lot of brainstorming from the entire team, to begin with we came up with some naive ideas like getting the word having maximum count in each review and then grouping a particular type of word in certain categories.


Then after extensive research we found a library that would give us a probabilistic score out of 1 for each category that we give to our pipeline. So using the Transformers library we went through the Zero Shot Classification Models and in that we opted for facebook/bart-large-mnli model. Using which we categorized each review in either of the 6 categories namely:


  • Airline Punctuality

  • Seat Comfort

  • Food & Beverages

  • Cancellations

  • Staff Service

  • Booking



Based on the above classifications, we found out the number of positive, negative and neutral reviews for each category on a yearly basis.




To add onto it, we also derived correlations between the type of traveler and seat type as shown below:



Next, we created a trend line chart of how a category was performing over the years. This helped us understand how a particular category was performing relative to previous years and whether the service was improving or not. For instance, Seat Comfort has consistently been the top performing category over the years whereas Booking which used to be the worst performing category in 2015 has shown significant improvements and has now been one of the best categories since 2019.



Finally, we generated a word cloud based on review text. The size and density of the word determines the frequency of that particular word in the reviews. Higher the size and density of a word means the word has appeared frequently in the reviews.



Challenges with sentiment analysis

Challenges associated with sentiment analysis typically revolve around imprecision in training models. Comments or reviews that have sarcasm involved in it or having information about the wrong product, tend to pose a problem for sentiment analysis models and are often incorrectly classified. For example, if a customer received the wrong coloured product and wrote a comment "The shirt was green." this would be classified as a neutral response when in fact it should be negative.


Conclusion

For this entire process of extracting Qantas Airline reviews from Skytrax Website and classifying each review as per its sentiment, I made use of:

  1. BeautifulSoup Library - Python - Web Scraping

  2. AWS Lambda & EventBridge Function - Automation of entire process

  3. NLTK Toolkit - Data Wrangling

  4. Vader Library - Python - Classification of reviews into Positive/Negative/Neutral

  5. Transformers Library - Python - Classification of reviews into either of 6 categories


Image Sources


28 views0 comments

Recent Posts

See All