CloudyML

DATA SCIENCE WORLD

STATISTICS FOR
DATA SCIENCE

A Brief Introduction to Statistics for Data-Science  It is that branch of mathematics which deals with the collection, analysis, interpretation, and presentation of masses of numerical data. It gives us the liberty of tabulation and interpretation of numerical dataTherefore, statistics concepts can easily be applied to real life, such as for calculating the time to get ready for office, monthly expenditure calculation, weekly gyming and diet count, in education,etc.

 

Use of statistics in Data-Science

Statistics plays a vital role in large industries, where data is computed in bulk. It helps to collect, analyze and interpret the data. Also, with the help of statistical graphs, charts and tables it’s easier to present the data.

Table Of Content:-

Why and how Statistical Analysis is done?

Types of Statistical Analysis – 

  • Descriptive
  • Inferential
  • Predictive
  • EDA(Exploratory Data Analysis)

Fundamentals of Statistics

  • Random Variables.
  • Probability – Complement,intersection,union,conditional probability,Bayes’ Theorem
  • Central Tendency – Mean, median, mode, Skewness, Kurtosis
  • Variations – distortion,error,shift in data – Range,Standard. 
  • Range,Deviation,Covariance,Correlation,Standard error.
  • Regression – Finding out the relationship between independent and dependent variables.
  • Types of Regression – Linear – Y = aX+C and Multiple linear – Y = aX + bX1 + cX2 + … C

Importance of Statistical Analysis

This Statistics for Data Science is designed to introduce you to the fundamental standards of statistical strategies and techniques used for Data Analysis. After analyzing this blog you’ll have a practical understanding of crucial topics in statistics including  – 

  1. i) Data Gathering.
  2. ii) Summarizing data using descriptive statistics.

iii) Displaying and visualizing data. 

  1. iv) Examining relationships among variables. 
  2. v) Regression types and correlation analysis.

In the end, our readers will be able to relate it with machine learning:

✔Calculate and follow measures of central tendency and measures of dispersion to grouped and ungrouped data.

✔Summarize, present, and visualize data in a way that is clear, concise, and provides a practical perception for non-statisticians desiring the results.

✔Conduct speculation checks, correlation checks, and regression analysis.

✔Demonstrate proficiency in statistical evaluation using Python and Jupyter Notebooks.

You get to recognize a hands-on approach to statistical analysis using the tools of preference for Data Scientists and Data Analysts.

Also, at the end of the blog, you’ll complete a project to use numerous concepts in Data Science problems concerning a real-life simulated situation and demonstrate the expertise of foundational statistical thinking and reasoning. Our prime awareness is developing a clean understanding of the unique approaches for unique statistics types, growing intuitive information, making suitable tests of the proposed strategies, using Python to research our data, and interpreting the output accurately. 

This blog is a splendid walkthrough for plenty of specialists and college students willing to begin their adventure in Data and Statistics-pushed roles consisting of Data Scientists, Data Analysts, Business Analysts, Statisticians, and Researchers.

Why study Statistics ?

Because the most important aspect of any Data Science approach is how the information is processed. When we talk about growing insights out of statistics it is essentially digging out the possibilities. It is known as Statistical Analysis.

Most people are surprised at how data in the form of text, images, videos, and other relatively unstructured formats get effortlessly processed through Machine Learning models. But, the reality is we actually convert that statistics right into a numerical shape which isn’t always precisely our data however the numerical equal of it. So, this brings us to the very vital thing of Data Science.

Predictive Analysis Predictive Analytics uses statistical techniques and machine learning algorithms to describe the possibility of future outcomes, behavior, and trends depending on recent and previous data. Widely used techniques under predictive analysis include data mining, data modelling, artificial intelligence, machine learning etc. to make imperative predictions. ExamplePython Notebook for Performing Analysis of Meteorological Data.

Exploratory Data Analysis – This method fully focuses on analyzing patterns in the data to recognize potential relationships. EDA can be approached for discovering unknown associations within data, inspecting missing data from collected data and obtaining maximum insights, examining assumptions and hypotheses. For Reference – Consider this Notebook for Exploratory Data Analysis of Terrorism.

Probability :

Probability is the degree of the chance of an event to occur. Many events can not be expected with overall certainty. We can expect only the chance of an event to occur i.e. how likely they are to happen. Probability can range in from zero to 1, where 0 manners the event to be an impossible one and 1 suggests a sure event. The chance of all of the occasions in a pattern area provides as much as 1.

For example, when we toss a coin, either we get Head OR Tail, the possible outcomes are possible (H, T). But if we toss coins in the air, there could be 3 possibilities of events to occur, such as both the coins show heads or both show tails or one shows heads and another one tail, i.e.(H, H), (H, T),(T, T).

Formula for Probability : 

The probability formula is defined as the probability of an event to happen is equal to the ratio of the number of favorable outcomes and the total quantity of outcomes.

Probability of an Event to happen P(E) = Number of favorable outcomes/Total Number of outcomes. 

Random Variables : 

The idea of random variable forms is the cornerstone of many statistical principles. It is probably tough to digest its formal mathematical definition however sincerely put, a random variable is a manner to map the consequences of random processes, consisting of flipping a coin or rolling a dice, to numbers. For instance, we will outline the random procedure of flipping a coin through random variable X which takes a fee if one of the final results is heads and zero if the final results are tails.

In this example, we’ve got a random process of flipping a coin where this experiment can produce possible outcomes: {0,1}. Set of all possible outcomes is referred to as the sample space of the experiment. Each time the random process is repeated, it is called to be an Event. Flipping a coin and getting a tail as an outcome is considered as an event. The chance of this event occurring with a particular outcome is called the Probability of that event. A probability of an event is the likelihood that a random variable takes a specific fee of x which may be described through P(x). In the example of flipping a coin, the chance of having heads or tails is identical, that is zero. 5 or 50%. So we have the following setting:

wherein the probability of an event, in this example, can only take values in the range [0,1].

The significance of statistics in data science and statistics analytics cannot be underestimated. Statistics give tools and methods to give deeper data insights.

Mean, Variance, Standard Deviation :

To perceive the principles of mean, variance, and a lot of various statistical topics, it’s far crucial to examine the concepts of population and sample. The population is the set of all observations and is usually very large and diverse, whereas a sample is a subset of observations from the population that ideally is a real representation of the population.

Given that experimenting with an entire population is both not possible or surely too expensive, researchers or analysts use samples in preference to the complete population of their experiments or trials. To guarantee that the trial results are reliable and kept for the complete population.That is the sample desires to be unbiased. For this purpose, you’ll use statistical sampling techniques such as Random Sampling, Systematic Sampling, Cluster Sampling, Weighted Sampling, and Stratified Sampling.

Mean :

The mean, also referred to as the average, is a central value of a finite set of numbers. Let’s expect a random variable X in the data has the following values:

where N is the number of observations or data points in the sample set. Then the sample mean defined by μ, which is used to approximate the population mean, can be expressed as follows:

The mean is also referred to as Expectation which is defined by E() or random variable with a bar on the top. For example, the expectation of random variables X and Y, i.e. E(X) and E(Y), can be expressed as follows:

import numpy as np

import math

x = np.array([1,3,5,6])

mean_x = np.mean(x)

# In Case the Data contains NAN values

x_nan = np.array([1,3,5,6, math.nan])

mean_x_nan = np.nanmean(x_nan)

Variance :

The variance measures how far the data points unfolded out from the average value, and is equal to the sum of squares of variations between the data values and the average (mean). The sample variance can be defined by sigma squared, which can be used to approximate the population variance:

x = np.array([1,3,5,6])

variance_x = np.var(x)

# here we need to specify degrees of freedom (df) maximum no. of logically independent Data points that have freedom to vary

x_nan = np.array([1,3,5,6, math.nan])

mean_x_nan = np.nanvar(x_nan, ddof = 1)

Standard Deviation :

Standard Deviation is the square root of the variance and measures the extent to which data varies from its mean. The standard deviation defined by sigma can be expressed as follows:

Standard deviation is preferred over the variance because it has the same unit as the data points, which means you can interpret it more easily.

x = np.array([1,3,5,6])

variance_x = np.std(x)

x_nan = np.array([1,3,5,6, math.nan])

mean_x_nan = np.nanstd(x_nan, ddof = 1)

Skewness and Kurtosis :

“Skewness basically measures the symmetry of the distribution, whereas kurtosis determines the heaviness of the distribution tails.” Understanding the form of statistics is a critical action. It facilitates recognizing the maximum facts lying and analyzing the outliers in a given data.

In statistics, skewness is a degree of asymmetry discovered in a probability distribution that deviates from the symmetrical normal distribution (bell curve) in a given set of data.

The normal distribution helps to realize a skewness. When we speak about normal distribution, data is symmetrically distributed. The symmetrical distribution has 0 skewness as all measures of a central tendency lie in the middle. .   

When data is symmetrically distributed, the left-hand side, and right-hand side, comprise an identical number of observations. (If the dataset has 90 values, then the left-hand side has 45 observations, and the right-hand side has 45 observations.). But, what if not symmetrically distributed? That data is referred to as asymmetrical data, and that time skewness comes into the picture.

Types of skewness : 

1. Positive skewed or right-skewed : 

In statistics, a positively skewed distribution is a sort of distribution where, unlike symmetrically distributed data where all measures of the central tendency (mean, median, and mode) equal each other, with positively skewed data, the measures are dispersing, which means Positively Skewed Distribution is a type of distribution where the mean, median, and mode of the distribution are positive rather than negative or 0.

The mean of the data is more than the median. In different words, the results are bent towards the lower side. The mean will be more than the median as the median is the middle value and mode is constantly the highest value

The extreme positive skewness isn’t always ideal for distribution, as a high level of skewness can cause deceptive results. The data transformation tools are helping to make the skewed data closer to regular distribution. For positively skewed distributions, the famous transformation is the log transformation. The log transformation proposes the calculations of the natural logarithm for each value in the dataset

Median is the middle value, and mode is the highest value, and due to unbalanced distribution, the median will be higher than the mean.

Calculate the skewness coefficient of the sample 

Pearson’s first coefficient of skewness :Subtract a mode from a mean, then divide the difference by standard deviation.

As Pearson’s correlation coefficient differs from -1 (perfect negative linear relationship) to +1 (perfect positive linear relationship), including a value of 0 indicating no linear relationship, When we divide the covariance values by the standard deviation, it truly scales the value down to a limited range of -1 to +1. That accurately represents the range of the correlation values.

Pearson’s second coefficient of skewness :Multiply the difference by 3, and divide the product by standard deviation.

  •  If the skewness is between -0.5 & 0.5, the data are nearly symmetrical.
  •  If the skewness is between -1 & -0.5 (negative skewed) or between 0.5 & 1(positive skewed), the data are slightly skewed.
  • If the skewness is lower than -1 (negative skewed) or greater than 1 (positive skewed), the data are extremely skewed.

 

Kurtosis :

Kurtosis refers to the degree of presence of outliers in the distribution. Kurtosis is a statistical measure, whether the data is heavy-tailed or light-tailed in a normal distribution.

In finance, kurtosis is used as a measure of financial risk. A large kurtosis is associated with a high level of risk for an investment because it indicates that there are high probabilities of extremely large and extremely small returns. On the other hand, a small kurtosis signals a moderate level of risk because the probabilities of extreme returns are relatively low.

Excess Kurtosis :

The excess kurtosis is used in statistics and probability theory to compare the kurtosis coefficient with that normal distribution. Excess kurtosis can be positive (Leptokurtic distribution), negative (Platykurtic distribution), or near to zero (Mesokurtic distribution). Since normal distributions have a kurtosis of 3, excess kurtosis is calculated by subtracting kurtosis by 3.

   Excess kurtosis  =  Kurt – 3

Types of excess kurtosis –

  • Leptokurtic or heavy-tailed distribution (kurtosis more than normal distribution).
  • Mesokurtic (kurtosis same as the normal distribution).
  • Platykurtic or short-tailed distribution (kurtosis less than normal distribution).

Leptokurtic (kurtosis > 3) :

Leptokurtic is having very long and skinny tails, which means there are more chances of outliers. Positive values of kurtosis indicate that distribution is peaked and possesses thick tails. An extreme positive kurtosis indicates a distribution where more of the numbers are located in the tails of the distribution instead of around the mean.

Platykurtic (kurtosis < 3) :

Platykurtic having a lower tail and stretched around center tails means most of the data points are present in high proximity with mean. A platykurtic distribution is flatter (less peaked) when compared with the normal distribution.

Mesokurtic (kurtosis = 3) :

Mesokurtic is the same as the normal distribution, which means kurtosis is near to 0. In Mesokurtic, distributions are moderate in breadth, and curves are a medium peaked height.

Bayes Theorem : 

The Bayes Theorem/Bayes Law is the most powerful rule of probability and statistics, named after famous English statistician and philosopher, Thomas Bayes.

Bayes theorem is an effective probability law that brings the idea of subjectivity into the sector of Statistics and Mathematics wherein the entirety is about facts. It describes the possibility of an event, based on the prior data of conditions that are probably related to that event. For instance, if the risk of getting Coronavirus or Covid-19 is known to increase with age, then Bayes Theorem allows the danger to a man or woman of an acknowledged age to be decided more accurately by conditioning it at the age than simply assuming that this individual is not unusual to place to the population as a whole.

The concept of conditional probability, which plays an important role withinside the Bayes theorem, is a degree of the probability of an event happening, given that another event has already occurred. Bayes theorem can be described by the subsequent expression wherein the X and Y stand for events X and Y, respectively:

  • Pr (X|Y): the probability of event X occurring given that event or condition Y has occurred or is true.
  • Pr(Y|X): the probability of event Y occurring given that event or condition X has occurred or is true.
  • Pr(X) & Pr(Y):the probabilities of observing events X and Y,  respectively.

In the case of the earlier example, the probability of getting Coronavirus (event X) conditional on being at a certain age is Pr (X|Y), that’s the same as the probability of being at a certain age given one got a Coronavirus, Pr (Y|X), multiplied with the probability of getting a Coronavirus, Pr (X), divided to the probability of being at a certain age., Pr (Y).

Covariance :
The covariance is a measure of the joint variability of two random variables and describes the relationship between these two variables. It is defined as the expected value of the product of the two random variables’ deviations from their means. The covariance between two random variables X and Z can be described by the following expression, where E(X) and E(Z) represent the means of X and Z, respectively.

Covariance can take negative or positive values as well as value 0. A positive value of covariance indicates that two random variables tend to vary in the same direction, whereas a negative value suggests that these variables vary in opposite directions. Finally, the value 0 means that they don’t vary together.

x = np.array([1,3,5,6])

y = np.array([-2,-4,-5,-6])

#this will return the covariance matrix of x,y containing x_variance, y_variance on diagonal elements and covariance of x,y

cov_xy = np.cov(x,y)

Correlation :
Correlation measures the strength and the direction of the linear relationship between two variables. When correlation is detected then it means that there is a relationship/pattern between the values of two target variables. Correlations between two random variables X and Z are equal to the covariance between these two variables divided by the product of the standard deviations of these variables which can be described by the following expression.

Correlation coefficients’ values range between -1 and 1. Keep in mind that the correlation of a variable with itself is always 1, that is Cor(X, X) = 1. Even if there is a correlation between two variables, you cannot conclude that one variable causes a change in the other. This relationship could be coincidental, or a third factor might be causing both variables to change.

x = np.array([1,3,5,6])

y = np.array([-2,-4,-5,-6])

corr = np.corrcoef(x,y)

Regression Models : It is used to explain relationships between variables by fitting a line to the observed data. Regression allows us to estimate how a dependent variable changes as per the independent variable(s).

Linear Regression :

Dependent variables are regularly called reaction variables or defined variables, while independent variables are often referred to as regressors or explanatory variables. When the Linear Regression version is based on an unmarried unbiased variable, then the version is known as Simple Linear Regression and while the model is based on multiple independent variables, it’s called Multiple Linear Regression. Simple Linear Regression may be described by the following expression:

in which Y is the dependent variable, X is the independent variable that’s part of the data, β0 is the intercept that is unknown and constant, β1 is the slope coefficient or a parameter similar to the variable X that’s unknown, and consistent as well. Finally, u is the error term that the model makes while estimating the Y values. The predominant concept at the back of linear regression is to locate the best-becoming straight line, the regression line, through a set of paired ( X, Y ) data. Example– Refer to this Notebook for hands-on practice of Stock Prediction Using Linear Regression and DecisionTree Regression Model.

Multiple Linear Regression :

Multiple linear regression is used to estimate the relationship between more independent variables and one dependent variable.

  • Multiple linear regression (MLR), also known simply as multiple regression is a statistical approach that makes use of numerous explanatory variables to predict the outcome of a response variable.
  • Multiple regression is an extension of linear (OLS) regression that makes use of just one explanatory variable.
  • MLR is intensively used in economics and financial pre assumptions.

Multiple regression model follows these assumptions:

  • Linear relationship between dependent variables and independent variables.
  • The independent variables aren’t too distinctly correlated with each other.
  • yi observations are chosen independently and randomly from the population
  • Residuals need to be normally distributed with a mean of 0 and variance σ.

The coefficient of determination (R-squared) is a statistical metric used to measure how much of the variation in outcome can be defined by the variation in the independent variables. R2 constantly increases as extra predictors are added to the MLR model, despite the fact that the predictors won’t be associated with the outcome variable.

R2 can best be between 0 and 1, wherein 0 suggests that the outcome can’t be predicted by any of the independent variables and 1 shows that the outcome may be predicted without error from the independent variables.

Real-World examples of Multiple Linear Regression :-

Assume you’re a public health researcher interested in social factors that affect heart disease. You are surveying 500 towns and collecting data on the %age of humans in every town who smoke, the %age of humans in every town who ride a bike to work, and the %age of humans in every town who’ve heart disease.

Because you’ve got independent variables and one dependent variable, and all of your variables are quantitative, you could use multiple linear regression to analyze the relationship among them.

Visualizing the results in a graph :

It also can be beneficial to include a graph with your results. Multiple linear regression is more complex than simple linear regression,as there are extra parameters that fit on a 2-dimensional plot.

However, there are methods to show your results that include the outcomes of multiple independent variables on the dependent variable, even though the only one independent variable can actually be plotted on the x-axis.

Assumptions of multiple linear regression

Multiple linear regression makes all the identical assumptions in following ways: –

Homogeneity of variance (homoscedasticity) : the size of error in our prediction doesn’t alternate significantly across all values of the independent variable.

Independence of observations: the observations in the dataset were gathered using statistically valid methods, and there are no hidden relationships amongst variables.

In multiple linear regression, it is feasible that a number of the independent variables are simply correlated with one another, so it is crucial to check these before developing the regression model. If independent variables are too highly correlated (r2 > ~0.6), then the only one of them should be used within the regression model.

Normality: The statistics follow Normal distribution..

Linearity: The line of the best fit through the data points is a straight line, rather than a curve.

Hope you got the key concepts of Statistics for Data-Science! 🙂

Take A Look At Our Popular Data Science Course

AI FOR ALL

Get Hands-On Learning Experience

Leave a Comment

Your email address will not be published.

Scroll to Top