Author: Divyansh Kumar
Sentiment

Sentiment is defined as “an attitude or opinion that is often caused or influenced by emotion” by Oxford Languages. Sentiment can be positive, negative, or neutral.
For the purpose of this project, when we say sentiment, we mean the sentiment of the tweets of potential/target audience for an upcoming movie towards the said upcoming movies and their actors.
Sentiment Analysis

Sentiment Analysis is a natural language processing (NLP) technique that is used to determine if a piece of text is positive, negative, or neutral.
It is often performed on textual data to help businesses analyze the sentiment around their brand and products in customer feedback/reviews.
In this project, we use Twitter as a source of the data on which we conduct sentiment analysis. For sentiment analysis, we use Textblob, which is a Python library for processing textual data built upon the Natural Language ToolKit (or NLTK).
Why Twitter?

Twitter is the largest social networking site where people share their opinions and news on anything and everything. Twitter is widely used because it allows people to express their opinions like no other platform. So it’s the best platform to do sentiment analysis on.
Twitter data is the easiest to fetch out of all the major social media sites. Now, let’s see how we scrape data from Twitter.
Available Twitter scrappers:
Twitter API & Tweepy
Even though Twitter API is an official API provided by Twitter to scrape through its vast data, it does have many limitations, like pricing and API limits.
You can only fetch a fixed number of tweets, in my experience, ~2500 every 15 mins and every month (we hit the API limit after ~30k tweets).
Other than that, the functions which can get you a lot of tweets at once are a part of the paid version of the Twitter API, and it’s very costly.
Such limitations of the Twitter API and the hindrance it caused to our work made us look for alternatives to the Twitter API.
We found SnScrape and Twint to be the best alternatives for the Twitter API and Tweepy combination to retrieve tweets and decided to go with Snscrape as we considered it to be the best match for our task.
Snscrape (SN Scrape)
SnScrape is an amazing Python library that can let you fetch any number of tweets for any date, and it’s totally free!

How did we do it?
Keyword: the keywords added to the query (movie/actor/actress names)
Query: the query passed into the tweet scrapping function; the query contains the keyword, today and yesterday’s date so the code can be automated to fetch data every day, this is the query we passed into the SnScrape function
- keyword until:date since:date -filter:replies
Eg: { query = JugJugg Jeeyo until:2022-06-20 since:2022-05-25 -filter:replies }
First, we get every tweet for our given keyword and date using the command below
{ sntwitter.TwitterSearchScraper(query).get_items() }.
Then we use a for loop to iterate through every tweet to fetch the tweet content, username, likes, retweets, and the tweet date.

{ for key in keyword: movie = str(key) + " until:"+ str(today) + " since:" + str(yesterday) + " -filter:replies" new_list = [] other_new_list = [] word_cloud = [] neutral = 0 positive = 0 negative = 0 for tweet in sntwitter.TwitterSearchScraper(movie).get_items(): # print(new_list) if len(new_list) == limit: break else: try: clean_tweet = clean_text(tweet.content) sentiment_result = sentiment_score(clean_tweet) if sentiment_result == 0: neutral += 1 elif sentiment_result == 1: positive += 1 else: negative += 1 new_list.append([key, tweet.date, tweet.user.username, tweet.content, tweet.likeCount, tweet.retweetCount, sentiment_result]) except: continue df = pd.DataFrame(new_list, columns=['Movie Name', 'Date', 'User', 'Tweet', 'Likes', 'Retweets', 'Sentiment Score']) df1=pd.concat([df1,df]) total_tweets = len(df.index) word_cloud = " ".join(word_cloud) other_new_list.append([key, today, total_tweets, positive, negative, neutral, word_cloud]) df_1 = pd.DataFrame(other_new_list, columns=['Movie Name', 'Date','Total Tweets','Positive Tweets', 'Negative Tweets', 'Neutral Tweets', 'Word Cloud']) df2=pd.concat([df2,df_1]) }
We then send the tweet content to a function we call clean_text() where the tweet content is cleaned, and we remove special characters through regular expressions. We then send the cleaned text for sentiment analysis.
{ def clean_text(text): remove_rt = lambda x: re.sub('RT @\w+: ',"",x) rt = lambda x: re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)","",x) clean_text = rt(remove_rt(text)) clean_text = clean_text.lower() }

We also want to create a word cloud of all the tweets we save to get an idea of what / if there was any particular word that was used in the majority of the tweets for that day. (If you don’t want a word cloud, you can ignore this step entirely.)
So, after we have removed the special characters, we clean the tweet content further and remove stopwords too like I, he, him, don’t, can’t among many others.
Stop words are not removed before the sentiment analysis because they are necessary to determine the polarity of the tweet but when we’re creating a word cloud, we don’t need stopwords.

We then separate the newly cleaned tweet with commas and add them to the list that we created to store all the words for that particular keyword.

{
words = clean_text.split()
stopword_list = stopwords.words('english')
words = [word for word in words if not word in stopword_list]
word_cloud_text = ' '.join(words)
}
After the tweet content is cleaned, we send it to a function we call sentiment_score() which uses the TextBlob library.
TextBlob is a Python library for processing textual data which provides a simple API for diving into common NLP tasks such as tagging part-of-speech, noun phrase extraction, sentiment analysis, classification, translation, and more.
Inside the sentiment_score() function we first pass the cleaned Twitter content to the textBlob() function which breaks down the twitter content and gives each part a tag, the tag specifies which part-of-speech the words in the tweet content are.
After we have the part-of-speech for the tweet content, we use the sentiment object to get the polarity and the subjectivity of that tweet.
Then we use the polarity object to specifically get the polarity score of that tweet and check if the polarity returned of the tweet is greater than or equal to or less than 0 and return a number 1, 0 and -1 respectively and store it in the variable sentiment_score which is added to the dataframe row for every tweet along with the tweet content, username, likes, retweets and the tweet date we stored the moment we fetched the tweet.

{ def sentiment_score(text): analysis = TextBlob(text) if analysis.sentiment.polarity > 0: return 1 elif analysis.sentiment.polarity == 0: return 0 else: return -1 }
We do it for the maximum limit we have set for every keyword (movie/actor/actress name). We also store the total number of tweets, likes, retweets and the word cloud list for a particular day for a particular keyword so we can get a daily/weekly view of how the Twitter data looked from a general perspective.

Challenges:
One of the first challenges we faced was getting a LOT of tweets which did not have the keywords we were searching for. On diving deep into the tweets by using the links of the tweets that are fetched using SnScrape, we found that we were not only fetching the tweets containing our keyword, but we were also fetching the replies to the tweets. So we added a command ‘-filter:replies’ to the initial query statement. This solves the comments issue.
Another issue we faced was the fact that we were also getting tweets in languages other than English like Hindi, Latin, Greek, and Spanish because these tweets contained keywords that were English and same as the movie names. To solve this problem, we used langdetect library and an if statement to detect the language and only add the tweets in the database if they were English, hindi, kannada, Tamil, Telugu or Malayalam.
Even though we were accepting tweets in major regional Indian languages, we wouldn’t be able to do sentiment analysis on these Tweets, so we translated the Hindi tweets to English tweets for the purpose of sentiment analysis before loading the actual tweets into the database.
{ #if language is already english, we don’t need to translate it lang = detect(tweet.content) if lang == 'en': elif lang == 'hi' or lang == 'kn' or lang == 'ta' or lang == 'te' or lang == 'ml': tweet_text = TextBlob(tweet.content).translate(from_lang=lang, to='en') }
