Assumptions in Supervised Machine learning models
In the world of Data Science, we often get busy with learning the models and implementing them as soon as possible to fulfil our requirements. But we forget that each model that we use is based on some assumptions, without which the model is incomplete. Let’s see what’s hidden behind these models and what we might be missing by now. The assumptions in supervised machine learning models are listed below :-
- Regression model is linear in terms of coefficients and error term.
- The mean of the residuals is zero.
- Error terms are not correlated with each other, i.e., given an error value, we cannot predict the next error value.
- Iindependent variables(x) are uncorrelated with the residual term, also termed as exogeneity. This, in layman term, generalizes that in no way should the error term be predicted given the value of independent variables.
- The error terms have a constant variance, i.e., homoscedasticity.
- No Multicollinearity, i.e., no independent variables should be correlated with each other or affect one another. If there is multicollinearity, the precision of prediction by the OLS model decreases.This assumption is tested using Variance Inflation Factor (VIF) values.
- The error terms are normally distributed.
Ridge& Lasso Regression
- The assumptions are the same as those used in regular linear regression: linearity, constant variance (no outliers), and independence. Since these methods do not provide confidence limits, normality need not be assumed.
2. Logistic Regression
- It assumes that there is minimal or no multicollinearity among the independent variables.
- Usually it requires a large sample size to predict properly.
- It assumes the observations to be independent of each other.
3. Decision Tree
- At the beginning, the whole training set is considered as the root.
- Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.
- Records are distributed recursively on the basis of attribute values.
- Order to placing attributes as root or internal node of the tree is done by using some statistical approach which are below mention.
- Initially, whole training data is considered as root.
- Records are distributed recursively on the basis of the attribute value.
4. Random Forest
- It assumes that sampling is representative.
- No formal distributional assumptions, random forests are non-parametric and can thus handle skewed and multi-modal data as well as categorical data that are ordinal or non-ordinal.
- It may assume that encoded integer values for each input variable have an ordinal relationship.
- The biggest and only assumption is the assumption of conditional independence.
7.Support Vector Machines (SVM)
- It assumes data is independent and identically distributed.
- The data is in feature space, which means data in feature space can be measured by distance metrics such as Manhattan, Euclidean, etc.
- Each of the training data points consists of a set of vectors and a class label associated with each vector.
- Desired to have ‘K’ as an odd number in case of 2 class classification.
These are the assumptions of some of the machine learning algorithms. You can always find a better explanation of things on GeeksforGeeks, Stackexchange, Quora, Stackoverflow etc.But my aim to write this blog is to help anyone preparing for an interview to get everything in one place as some quick notes.
I hope you like my work. If you find any problems in the above details, please comment it out so that I can improve them.
Our Popular Data Science Course