Authors: Pradeep Gandham & Prasad Mistary
What is EDA?
Exploratory Data Analysis commonly referred to as EDA, is an important process to understand, interpret and summarize the data. It helps to identify missing values or outliers in the data. EDA is generally used to understand the characteristics of a variable and its relationship with other variables in the dataset. As companies are moving towards a more data-oriented approach, Exploratory Data Analysis gives a better understanding of the data.
We are going to use the 2019 Airbnb data from New York City. Airbnb is a service that provides accommodations to customers by listing various accommodations from different hosts. The dataset mainly comprises the listings and their types, host details, pricing, reviews, and availability based on location, etc.,
EDA is mainly used to
Improve understanding of data
Understanding the importance of variables
Identify the outliers/missing values
Help prepare the data for modeling
Tableau for EDA
Tableau is a data visualization tool primarily used in the Business Intelligence field to make decisions. Tableau helps us to analyze the data and create dashboards to derive insights.
EDA is the primary step in any process to get ourselves familiarized with data. Tableau stands out as an intuitive and easy-to-use visualization tool. Whether to understand the data quickly in a simple manner or to explore the data by connecting multiple data sources or using custom SQL queries, Tableau provides solutions for them all.
The following are some of the steps in EDA that we will be exploring
The following is the dataset of Airbnb Data of New York for 2019, viewed in Tableau. There are 16 fields with 47895 records in the dataset. The “Name” field in the following picture shows the listings available in the data set.
When we sort the data by “Name,” we can find some anomalies in the field. The “Name” field contains hotel listings in languages other than English. For our analysis, we would consider those records with English characters. To remove them from the data, we use data source filters. It is easy to select the records by using the interactive UI and exclude them from our data. The following screenshots show how we can remove such anomalies using the user interactive filters.
Now, the data has been cleaned, and 982 records have been removed from the data set.
As the name suggests, ‘uni’ means ‘one,’ i.e., the analysis performed on a single dataset variable. The purpose of the univariate analysis is to describe the data by considering various measures such as mean, mode, frequency distribution, etc.,
The following are some of the ways of using univariate analysis
Bar chart: A bar chart is a pictorial representation of data in the form of rectangular bars where the size of the bars is proportionate to the measure of the data.
The following bar chart (Fig 1) represents the number of listings in each neighborhood group in New York City. Brooklyn and Manhattan have the most number of listings. Staten Island has the least number of listings at 359, which is roughly 15% of the total listings in Manhattan.
Fig 1: Bar chart showing the number of listings per neighborhood in New York
Histogram: A histogram is a frequency distribution where we group the observations by numerical data into continuous classes in the form of rectangles.
In the below chart (Fig 2), the vertical axis represents the number of listings, and the horizontal axis represents the number of days within which a listing is available.
The histogram shows that 18000 listings are not available throughout the year, i.e., 0 days. The reason for them might be that they are down for maintenance or closed temporarily for that year. The listings which are available 365 days are close to 2000 in number.
Fig 2: Histogram to show the number of listings based their availability
Price: From the bar chart (Fig 3) below, when we look at the price, it is observed that there are some hotel accommodations that are available at price $0. There are close to 1000 listings that are having the price above $2000 per night. The distribution of data also shows us a highly right skewed graph.
Fig 3: Line graph showing the number of listings based on price
To remove the skewness, we can apply a logarithmic transformation to the price. The following graph shows the distribution of price after transformation, and it resembles the bell curve.
Fig 4: Histogram showing listings based on log of price
In this way, we are able to understand various data fields of the data by plotting charts using Univariate analysis.
Bivariate analysis is a statistical analysis that explains the relationship between two variables. We can analyze the association between two variables.
The following are some of the ways of doing bivariate analysis
Dot plot: A dot plot, also known as a strip plot is a simple chart where data points are represented in small circles. It is like a bar graph where the small circle is the highest point of a bar in a bar graph. A dot plot is used to plot data that is segregated into bins/categories.
The following dot plot (Fig 5) shows the number of hotels available in each neighborhood by the type of room offering provided by the host.
Fig 5: Number of listings by room type by neighborhood
Manhattan has the highest number of listings for entire homes/apartments and shared rooms. Brooklyn has the highest number of private room listings, with Staten Island being the last in all three types of listings.
A stacked bar chart is a bar chart where the individual bars are segregated into multiple levels. This type of chart is used to
compare within sub-categories of a category and
compare within the categories themselves
The following stacked bar chart (Fig 6) has the average of the minimum number of nights spent by guests in various accommodations based on the neighborhood and the room type.
Manhattan can be seen as the place where people are interested in booking an entire apartment, whereas shared rooms are more likely to be booked for a longer time in Brooklyn. Brooklyn and Manhattan have an almost similar numbers of average nights booked by people for a private room.
Fig 6: Average Minimum number of nights for room types by neighborhood
Dual-axis graph: A dual-axis graph is used to show the relationship between two different variables. When we have mixed types of data such as price and the number of listings, in this case, a secondary axis can be plotted for the other data series either on the x/y-axis.
The following dual-axis graph (Fig 7) shows the average price of the accommodations in each neighborhood on one y-axis, whereas the number of listings in those neighborhoods has been plotted on the other side of it. The solid circles represent the number of listings, and the bars compare the average price of the listings in each neighborhood.
Manhattan has the highest number of listings and the average price as well. However, in Brooklyn, which has a similar number of listings to Manhattan, the average price is only 63% of Manhattan’s average price.
Fig 7: Number of listings and average price by neighborhood
Treemap: A tree map is a visualization chart that shows the hierarchical data and its structure of it. The arrangement is rectangular shape with the highest data point being the largest sized rectangle along with a color density to show the flow of the hierarchy.
The following tree map (Fig 8) shows the top listings with the most reviews present in the data set. It is possible for us to take the ‘n’ number of data points within the treemap as long as it holds value to it.
Fig 8: Top 10 hotels with the highest number of reviews
In this way, EDA will help us understand the data variables and better prepare our data for data modeling, KPI calculations, and dashboards.