CloudyML

DATA SCIENCE WORLD

Why is statistics so important in data science?

Statistics deals with methodologies to gather, review, analyze and draw conclusions from data. With specific statistics tools in hand, we can derive many key observations and make predictions from the data.

In Real world, we deal with many cases where we use statistics knowingly or unknowingly. We see and talk about statistics all the time but very few of us know the science behind it. So, if we learn the science, we can debate with facts and figures, understanding the statistics behind it better.

Types of Statistics

Descriptive Statistics:  

This type of statistics deals with numbers (numerical facts/figures/information) to describe any phenomena. Such numbers are descriptive statistics. e.g.  Reports of industry production, cricket batting/bowling averages, government deficits, Movie Ratings etc.

Inferential statistics

Inferential statistics is a decision/estimate/prediction/generalization about population, based on sample. It is used to make inferences from the data unlike descriptive statistics which describes what’s going on in our data.

  • Population is a collection of all possible individual, objects, or measurements of interest.
  • Sample is a portion, or a part, of the population of interest.
  1. Categorical data: It represents characteristics such as a person’s gender, marital status, hometown, or the types of movies they like. Categorical data can take on numerical values (such as “1” indicating male and “2” indicating female), but those numbers don’t have mathematical meaning. You can’t add them together.
  2. Numerical data: It consists of those quantities which denote measurement, such as a person’s height, weight, IQ, or blood pressure; or they’re a count, such as the number of stocks shares a person owns etc.

Numerical data can be further broken into two types: discrete and continuous.

  1. Discrete data represent items that can be counted. They take on possible values that can be listed out. The list of possible values may be fixed (also called finite); or it may go from 0, 1, 2, on to infinity (making it countably infinite).

Some examples are –

  • The Number of children in a school
  • Number of books in your library
  • Number of cases a Lawyer has won

In all the above examples the values can be 1, 2,3 … so on but can never be 1.2,4.6, 8.7 etc. Thus, making the results countable.

  1. Continuous data represent measurements. Their possible values cannot be counted and can only be described using intervals on the real number line.

For example, the exact amount of petrol purchased at the petrol pump for bikes with 20-liters tanks would be continuous data from 0 litres to 20 litres, represented by the interval [0, 20], inclusive. You might pump 8.40 litres, or 8.41, or 8.414863 litres, or any possible number from 0 to 20. In this way, continuous data can be thought of as being uncountably infinite.

Types of Data

  1. Categorical data: It represents characteristics such as a person’s gender, marital status, hometown, or the types of movies they like. Categorical data can take on numerical values (such as “1” indicating male and “2” indicating female), but those numbers don’t have mathematical meaning. You couldn’t add them together
  2. Numerical data: It consists of those quantities a measurement, such as a person’s height, weight, IQ, or blood pressure; or they’re a count, such as the number of stocks shares a person owns etc.

Numerical data can be further broken into two types: discrete and continuous.

a)Discrete data represent items that can be counted; they take on possible values that can be listed out. The list of possible values may be fixed (also called finite); or it may go from 0, 1, 2, on to infinity (making it countably infinite).

Some examples are –

  • Number of children in a school
  • The Number of books in your library
  • Number of cases a Lawyer has won

In all the above examples the values can be 1, 2,3 … so on but can never be 1.2,4.6, 8.7 etc. Thus, making the results countable.

b)Continuous data represent measurements; their possible values cannot be counted and can only be described using intervals on the real number line.

For example, the exact amount of petrol purchased at the petrol pump for bikes with 20-liters tanks would be continuous data from 0 liters to 20 liters, represented by the interval [0, 20], inclusive. You might pump 8.40 liters, or 8.41, or 8.414863 liters, or any possible number from 0 to 20. In this way, continuous data can be thought of as being uncountably infinite.

Levels of measurement

1. Qualitative Data

  • Nominal data levels of measurement:

A nominal variable is one in which values serve only as labels, even if those values are numbers.

Examples:

Here even though we code the different categories, we cannot say that since 2> 1 then Female>male.

  • Ordinal data levels of measurement:

Values of ordinal variables have a meaningful order to them. We can use frequencies, percentages, and certain non-parametric statistics with ordinal data. However, means, standard deviations, and parametric statistical tests are generally not appropriate to use with ordinal data.

In above example, the data gives a sense of comparability i.e., we can say that Highly satisfied is better than Average. Though, we can say that the difference between Highly satisfied and satisfied is same as Below Average and Very bad.

2. Quantitative Data

  • Interval scale data levels of measurement

For interval variables, we can make arithmetic assumptions about the degree of difference between values.  An example of an interval variable would be temperature.

An interval variable can be used to compute commonly used statistical measures such as the average (mean), standard deviation etc.  Many other advanced statistical tests and techniques also require interval or ratio data.

Example: Measurement of time of an historical event comes under interval scale, since year has no fix origin i.e., 0 year is different for different religions and countries.

  • Ratio scale data levels of measurement

All arithmetic operations are possible on a ratio variable.  An example of a ratio variable would be weight (e.g., in pounds).  We can accurately say that 20 pounds is twice as heavy as 10 pounds.  Additionally, ratio variables have a meaningful zero-point (e.g., exactly 0 pounds means the object has no weight).  Other examples of ratio variables include gross sales of a company, the expenditure of a company, the income of a company, etc.

Example: Measurement of temperature in kelvin scale, since kelvin has an absolute 0, measurement of average height of students in a class.

A ratio variable can be used as a dependent variable for most parametric statistical tests such as t-tests, F-tests, correlation, and regression.

Measures of Central Tendency

A measure of central tendency is a summary statistic that represents the center point or typical value of a dataset. These measures indicate where most values in a distribution fall and are also referred to as the central location of a distribution. You can think of it as the tendency of data to cluster around a middle value. In statistics, the three most common measures of central tendency are the mean, median, and mode. Each of these measures calculates the location of the central point using a different method.

Mean:  It is nothing but the arithmetic average of all the values. For calculating the mean, just add up all of the values and divide the sum by the number of observations in your dataset.

Let’s consider, we have dataset with n values as follows:

Mean for the above data will be given as:

Median: It is the middle value that splits the dataset in half. To find the median, first sort your data in any order. Then find the data point that has an equal number of values above it and below it. The method for locating the median varies slightly depending on whether your dataset has an even or odd number of values.

Let’s consider, we have dataset with n values as follows:

Case I) When n is odd:

Case ii) when n is even

Mode: The mode is the value that occurs the most frequently in your data set i.e., has the highest frequency. On a bar chart, the mode is the highest bar. If the data have multiple values that are tied for occurring the most frequently, you have a multimodal distribution. If no value repeats, the data do not have a mode.

Frequency: The number of times a variable occurs in the data set is called its frequency.

Let’s consider, we have dataset with 5 values as follows:

D = {2, 3, 3, 5, 3}

Mode = 3

Mean, Median, Mode for Grouped Data

Mean for grouped data is calculated the same way as we do in ungrouped data. It is just that the variable(x) becomes the midpoint of the interval.

Median:

where:

l = lower class boundary of the median class

h = size of the median class interval

f = Frequency corresponding to the median class

N = Total number of observations i.e., sum of frequencies.

c = Cumulative frequency preceding median class

Mode:

where:

l = lower boundary of the modal class

h = size of the modal class

 = Frequency corresponding to the modal class

 = Frequency preceding the modal class

 = Frequency proceeding the modal class

Skewness

Skewness is asymmetry in a statistical distribution, in which the curve distorts or skews either to the left or to the right. It is defined as the extent to which a distribution differs from a normal distribution.

In a normal distribution, the graph appears as a classical, symmetrical “bell-shaped curve.” The mean, or average, and the mode, or maximum point on the curve, are equal.

In a perfect normal distribution, the tails on either side of the curve are exact mirror images of each other.

When a distribution skews to the left, the tail on the curve’s left-hand side is longer than the tail on the right-hand side, and the mean is less than the mode. This situation is called negative skewness.

When a distribution is skewed to the right, the tail on the curve’s right-hand side is longer than the tail on the left-hand side, and the mean is greater than the mode. This situation is called positive skewness.

  • In a symmetric distribution, the mean and median both find the center accurately. They are approximately equal.
  • However, in a skewed distribution, the mean can miss the mark. In the histogram below, it is starting to fall outside the central area. This problem occurs because outliers have a substantial impact on the mean. Extreme values in an extended tail pull the mean away from the center. As the distribution becomes more skewed, the mean is drawn further away from the center.
    Here the median better represents the central tendency for the distribution.

Measures of Dispersion

It shows scattering of the data. It tells about variation in the data and gives a clear idea about the distribution of the data. The measure of dispersion shows the homogeneity or the heterogeneity of the distribution of the observations.

 

How is it useful?

  • Measures of dispersion shows the variation in the data which provides useful information. Less variation gives close representation while with larger variation average may not closely represent all the values in the sample.
  • Measures of dispersion enables us to compare two or more series with regards to their variations. It helps to determine consistency.
  • With the checking for variation in the data, we can try to control the causes behind the variations.

 Range: This is the most common and easily understandable measure of dispersion. It is the difference between two extreme observations of the data set. If and are two extreme observations, then

It gets affected by fluctuations as it is based on extreme observations. Thus, range is not a reliable measure of dispersion.

2) Standard Deviation

In statistics, the standard deviation is a very common measure of dispersion. Standard deviation measures how spread out the values in a data set are around the mean. More precisely, it is a measure of average distance between the values of data in the set and the mean. If the data values are all similar, then the standard deviation will be low (closer to zero). If the data values are highly variable, then the standard variation is high (further from zero).

The standard deviation is always a positive number and is always measured in the same units as the original data. Squaring the deviations overcomes the drawback of ignoring signs in mean deviations i.e., distance of points from mean must always be positive.

3) Variance

It is the average of the squared differences from the Mean.

Let’s take a look at an example:

A class of students took a test in Language Arts. The teacher determines that the mean grade in the exam is 65%. She is concerned that this is very low, so she determines the standard deviation to see if it seems that most students scored close to the mean, or not. The teacher finds that the standard deviation is high. After closely examining all the tests, the teacher determines that several students with very low scores were the outliers that pulled down the mean of the entire class scores.

Coefficient of Variation (CV)

The Coefficient of Variation (CV) is also known as Relative Standard Deviation (RSD). This is a standardized measure of dispersion of a probability distribution or frequency distribution. It gives the measure of variability and is expressed as a percentage. It is thus defined as the ratio of the standard deviation(σ) to the mean(μ).

CV = Standard Deviation / Mean

Covariance

  1. This is a method to find variance between two variables.
  2. It is the relationship between a pair of random variables where change in one variable causes change in another variable.
  3. It ranges between -infinity to +infinity, where the negative value represents negative relationship and the positive value represents positive relationship.
  4. This is used for the linear relationship between variables.
  5. It gives the direction of relationship between variables.
  6. It has dimensions.

Correlation

  1. It checks whether the pairs of variables relate to each other and shows the strength.
  2. It ranges between -1 to +1, wherein values close to +1 represent strong positive

            correlation and values close to -1 represent strong negative correlation.

  1. Variables indirectly relate to each other.
  2. This gives the direction and strength of relationship between variables.
  3. It is the scaled version of covariance.
  4. It’s dimensionless.

Thus, statistics is very useful to get insights from the data. Statistics is must to get into Data Science domain. To create or develop anything artificial that can bring change in the ecosystem, you need to know statistics. This is just because all machine learning and deep learning algorithms have some mathematics behind them on which they work. And to understand that, you need to be good at statistics and probability.

I hope you enjoyed reading the blog.

Happy Learning😊

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top