CloudyML

Chi Square Test

The Best Statistical Test

Based on the concept of Hypothesis Testing

Prerequisite

Chi Squared Test To understand this article you must have knowledge about hypothesis testing, Null hypothesis, alternative hypothesis, and some knowledge about the machine learning.

Introduction

In this blog, we will learn about one of the best statistical tests, that is used to measure the relationship between the two categorical variables, a test called the chi-square statistical test.

This test is based on the concept of hypothesis testing so solving any problem using the chi-square test is very similar to the hypothesis test.

In the ensemble model, we take the different types of models and combine them, In boosting we learn how to combine the models and how to train the models.

What is Chi-Square Test

Chi-squared test, a statistical method, is used by machine learning engineers or statisticians to check the correlation between two categorical variables.

The Chi-square test evaluates if two categorical variables are correlated in any way.

Take A Look At Our Popular Data Science Course

What is correlated?

Let’s suppose you have two random categorical variables X and Y. Now if you want a relationship between X and Y.

Relationship means if X values increase then Y values increase or decrease If X increases then Y also increases then we can say variables X and Y are correlated to each other.

But how do we find the variables that are correlated to each other, if somehow we will get some number then it is the better thing for us.

To solve this problem statisticians come up with the technique, which is nothing but a Chi-squared test.

When to use the Chi-Square Test

We can use the Chi-Square test in one of the following situations:

  1. When we want to estimate observed distribution matches with the expected distribution, this is also referred to as the goodness of fit test.
  2. When we want to estimate whether two random variables are dependent or independent.
  3. When we want to check the correlation between two categorical variables.

How to use Chi-Square Test

The chi squared test is one type of hypothesis testing, so to solve any problem using the Chi-Square Test we follow the same steps as hypothesis testing.

Revise the hypothesis test steps:

  • Define the Null hypothesis (H0)
  • Define the alternative hypothesis (H1)
  • Design the Test statistic(T)
  • Take the T & H0 & H1 find the p-value
  • If the p-value is > 5% then we fail to reject h0 
  • Else reject the H0 and accept the H1

So we follow all these steps to solve the problem in the chi-square test, we will also solve the one problem using chi-square later in this blog.

First, we learn more about the chi-square test, the formula for the chi-square statistic is as follows.

χ2 ⇒ this test statistic in chi-square and we have to find this value by solving the above equation.

 

O ⇒ This will be the observed data, which means those data that you gather during the experiment, For Eg the number of times something occurs.

 

E ⇒ This will be the expected or assumption data, that you will determine under the null hypothesis (H0), For eg, the coin is not biased towards to head and we expect the coming to the tails and heads equally.

So here the null hypothesis (H0) is our coin is not biased towards to head.

Chi-square Test Using SKLearn Library

Here in this section, we take Loan_Dataset and our task is to find which features are not co-related with the target values.

First, we define the hypothesis, and next to understand how to perform the chi-square test using the library.

  • Null Hypothesis (H0): There is no relationship between the variables.
  • Alternative Hypothesis (H1): There is a relationship between variables.
  • Let’s choose our p-value = 0.05: Choose a significance level (e.g. SL = 0.05 with a 95% confidence).
  1. if the p-value test result is more than 0.05, it means that the test result will lie in the acceptance region and we will fail to reject the null hypothesis, which means there is no relationship between the feature variables and target variables.
  2. If the p-value test result is less than 0.05, it means that the test result will lie in the rejection(critical) region and we will reject the null hypothesis and will go along with the alternate hypothesis, which means there is a relationship between the feature variables and target variables.

Before jumping to the code section we learn about chi2 and SelectKBest library.

SelectKBest:- this is available in the Sklearn library, It takes two arguments.

  • Score_func ⇒ this is the scoring function, in this case, we want chi-square values so we use chi2 as score_func
  • K ⇒ this is the number of top features to select. The “all” option bypasses selection, for use in a parameter search.

Syntax

class sklearn.feature_selection.SelectKBest(score_func=<function f_classif>, *, k=10)

Chi2:- this is available in the Sklearn library, It takes two arguments.

  • X ⇒ Sample vectors that mean your all features
  • Y ⇒ target variable

Returns

  • It returns the Chi2 values for each feature and p-values for all features

Syntax

sklearn.feature_selection.chi2(X, y)

Example

from sklearn.datasets import load_digits

from sklearn.feature_selection import SelectKBest, chi2

X, y = load_digits(return_X_y=True)

X.shape

>>> (1797, 64)

X_new = SelectKBest(chi2, k=20).fit_transform(X, y)

X_new.shape

>>> (1797, 20)

Now we have some idea about the libraries, we can go with the coding part.

Code Snippet

#import the library

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

import math

from sklearn.feature_selection import chi2,SelectKBest

#Load the dataset.. Click here

df_loan = pd.read_csv(“https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Loan_Dataset/loan_data_set.csv”)

df_loan.head()

 

Data Science Course With projects.Please take a look of Our Course

# Remove all null value

df_loan.dropna(inplace=True)

# drop the uninformative column(“Loan_ID”)

df_loan.drop(labels=[“Loan_ID”],axis=1,inplace=True)

df_loan.reset_index(drop=True,inplace=True)

df_loan[“Credit_History”]=df_loan[“Credit_History”].apply(lambda x: “N” if x == 0 else “Y”)

cat_cols = df_loan.select_dtypes(include= “object”).columns

cat_cols

cat_col = df_loan.select_dtypes(include= “object”).drop(‘Loan_Status’, axis = 1).columns

cat_col

# Convert object to category

df_loan[cat_cols] = df_loan[cat_cols].apply(lambda x:x.astype(“category”))

# encoding

df_loan[cat_cols]=df_loan[cat_cols].apply(lambda x: x.cat.codes)

df_loan.head()

X = df_loan[cat_col]

y = df_loan[“Loan_Status”]

# Lets use the sklearn chi2 function

cs = SelectKBest(score_func=chi2,k=7)

cs.fit(X,y)

feature_score = pd.DataFrame({“Score”:cs.scores_,“P_Value”:cs.pvalues_},index=X.columns)

feature_score.nlargest(n=6,columns=“Score”)

In these results which feature p-values higher than the significance level, we can say these features do not have any relationship with the target variable.

So Gender p-values are higher than SL, so we fail to reject the H0 and we say that the Gender has no relation with the Loan_Status feature.


And Credit_History feature p-value is very lower than SL, so we go along with the H1 and we say that the Credit_History has relation with the Loan_Status feature.

Take A Look At Our Popular Data Science Course

Here We Offers You a Data Science Course With projects.Please take a look of Our Course

Scroll to Top