Understanding Plots/Charts and its use case scenario
In the last 15 years, a humongous amount of data has been generated at every minute globally. To make sense of this huge amount of data generated, companies are spending a lot of time and money.
Now let suppose a company hires you as a Data Scientist or a Data Analyst in their firm. So, as part of your work, you will have to primarily deal with data. And the first step after getting any data and validating it, would be to explore and analyze the data, and in some cases present the analysis to the decision makers of the company. So, in such a case, what would you do to present the data so that even a non technical person from the management understands it. Obvious answer would be to make charts out of the data. Charts can be either for univariate analysis or bivariate analysis in case of structured data. Now, let’s dig deeper into the various charts and their use cases.
Here, we will be using the publicly available sample superstore data to interpret the various charts.
1) Bar Chart
Bar Chart is one of the most commonly used charts to represent data in an effective and easy manner. It is used when there are both categorical variables and numerical variables in the data. We use it when we want to see the aggregate measures of the numerical variable with respect to different categories.
In the fig , we can see which sub categories of products are giving high sales and which are giving low sales. Bar Charts gives the visualized comparison between the discrete categories. It is a really effective chart where the categories in the data are less. Bar Chart can both be vertical and horizontal. In python, the matplotlib API provides the bar() function which helps in creating bar charts. It comes under univariate analysis.
Histogram is a chart we commonly use to see the distribution of data in specified ranges as bins. It is similar to a bar chart but condenses the data into logical ranges or bins thereby making it more interpretable. It is only really effective if the data points are not too many.
Here, we see that about 4,800 orders contain 2 or 3 items, about 2,400 orders contain 4 or 5 items. In Python , the matplotlib library offers a hist() function using which we can make histograms. Thus, all in all we use histograms when data is big and we want to condense it into groups to see the data count of each group.
3) Scatter Plot
Scatter Plot is one of the most common chart used by Data Scientists to do exploratory analysis of data. It is used for finding the relationships between the numerical variables and also to detect outlier data points. That means scatter plot can show the extent of correlation between the variables, and so there are three types of scatter plot, showing positive correlation, negative correlation and no correlation. While making scatter plots, we need two numerical variables in the x-axis and -axis respectively.
As we can see from the plot, there is an outlier data point at 50% which gives a very high sale. And we can see that even at base price without discounts, sales are very good. Scatter plots are not very effective in case of large no. of data values
4) Box Plot
A box plot or a whisker plot is a chart very commonly created in Data Analysis to display the summary of the data values. Box plots divide the data into sections each containing approximately 25% of the data and give a five parameter summary like minimum, 1st quartile, median, 3rd quartile, maximum. It is also used for detecting outliers in data points of numerical variables. Because only after detecting outliers and seeing its number, we can decide further on its treatment.
In the figure above, more than 45 % of discounts are outliers in the discount distribution.
Also, we can see the summary of data in the form of boxplots with respect to categorical variables.
Here , we can outliers in discounts across different regions. So only in the central region, 80% of discounts were given for certain products. In python, Boxplot can be made using the sea born and matplotlib libraries. Seaborn gives a Boxplot function to directly make boxplots.
5) Word Cloud
The earlier charts we have seen have use cases only if the data is numerical and is in a tabular format. Now, how to represent data present in text format? Such as video transcripts, reviews etc. For that, we use word cloud charts to showcase the most frequent text terms in the corpus of text data. Word clouds can be unigram(1 word), bigram(2 words), trigram( 3 words) and so on.
This is a unigram word cloud I have made for the video transcripts of class 6 content of an e-learning portal. Here, the bigger words appear more often and smaller words appear less often. We can also do sentiment analysis through word cloud. Let me show an example. We are taking money heist web series reviews. Let’s do sentiment analysis.
In Python, Word clouds can be made using the WordCloud library .It is an open source library.
6) Violin Chart
A violin chart is a type of chart drawn to show the distribution of numeric data along with the probability density. It can be considered as a combination of box plot and density plot. Violin Plots are helpful when we need to make comparisons between different data groups.
So , as we can see, violin plot gives the data density distribution shape and also gives a summary of the data. It is commonly used for data comparison analysis as well, as we can see from the below figure.
Both matplotlib and seaborn libraries in Python offer violin plots.
7) Pie Chart
Pie Charts are a very much commonly used chart, it is basically a categorical share analysis chart. We very often use it to gauge if the data is balanced or imbalanced in case of classifier problems(binary and multi). In the pie chart, the numerical proportion of each category is represented as slices in a circular pie. Whenever we have discrete categorical variables and we want to see the share of total with respect to each category, then we can plot the pie chart.
8) Swarm Plot
A Swarm Plot can be a good complement to box plot or violin plot. Whereas these plots show the summary of the data points. Swarm plot plots all the data points to represent all data points in the form of a bee swarm and show the data distribution shape. It is similar to scatter plot but is better for categorical data distribution comparison in a single chart. It is mainly used when we want to show the distribution of numerical data with respect to different categories in a single chart.
Swarm plot function are available in both matplotlib and seaborn libraries as well as in MatLab, R etc.
9) Line Chart
Line Charts are simple yet very effective visual charts to analyze numerical data. It shows the evolution or trend of one or several numeric data. It is very much used in time series analysis where we need the trend of the data plotted against time.Also multiple lines charts can be made to compare different trend lines and make trend line analysis. Line Chart is very widely used in time series forecasting. And it is used only in case of non cyclical data.
10) Heat map Chart
Heat maps are very much used in EDA , Business Analysis, Financial analysis etc. This kind of chart mainly shows the relationship between all the variables with different hues and intensity determining the extent of relationship. It is very much used to determine multicollinearity in high dimensional data. The seaborn library offers a very effective heatmap function which can be used to draw the heatmap.
11) Bubble Chart
A Bubble Chart is quite the same as a scatter plot but with the addition of a third dimension whereby the concentration of data is shown through the size of bubbles. It is mainly used to see the hierarchy of the numerical data across different categories or zones. In making a bubble chart, at least three numeric variables are needed and one categorical variable. We can easily make bubble charts using the plotly express library in python and also in Tableau.
12) Distribution Plot
As we can suggest from the name, it shows the distribution of data through a combination of histogram and line chart. It is very useful when we need to determine if the distribution of the data is normal or skewed left or right. We can make use of the seaborn and plotly library in python to make the distribution charts.
13) Area Chart
It is basically a chart which is used in case of continuous numerical data and inplace of line charts to show the trend lines with respect to data type data. In an area plot, the area between the lines is filled and the lines make the boundary of the area which gives the different values at different points of time. Also, we can make stacked area charts. In stacked area charts, relative and absolute differences matter. The seaborn and plotly offers a very nice area chart function.
14) Density Plot
It is a variation of histogram showing the density distribution of numerical data by considering equal bins on its own. It shows the extremas(maxima & minima) of the data points. It is helpful in finding the p-value of the data distribution and also the tail data points in the data distribution.The pandas library, matplotlib , seabon help in making density plots. We mainly use density plots when we want to see the distribution of a numeric variable and the probability density function of the variable using kernel density function.
15) Waterfall Chart
A Waterfall Chart is a 2 dimensional chart mainly used in financial analysis.It is used for understanding the cumulative effects over time or variable by addition of positive or negative values. It is a very commonly used chart to see the profit and loss contribution of various factors in financial analysis. It can be used only in case of continuous numerical data variables. It is used for static data types, i.e data which is not cyclical. It is useful to visualize the accumulation or subtraction to total.There is a separate library for waterfall charts called waterfallcharts which can be installed and plotted with the help of matplotlib.
The above chart plots the revenue against all the month to see which month is contributing how much cumulatively, values being added sequentially.
So, these are some of the most commonly used charts in data analytics, and data exploration. There are many more kinds of charts available for making sense of the data. Also, while making each chart, we can fine tune the charts using various hyper parameters available in the seaborn, matplotlib and plotly libraries. The best way to explore different kinds of charts is by going to the official documentation site of the seaborn and matplotlib libraries. Also, experiment with the various parameters while exploring the data and visualizing in charts.
Lastly, “Practice makes a man perfect”. So, keep on practicing and gradually you would be able to generate deep insights by visualizing different charts according to data & business requirements.