# Data Science Interview Questions and Answers

Python deletes unwanted objects (built-in types or class instances) automatically to free the memory space. The process by which Python periodically frees and reclaims blocks of memory that no longer are in use is called Garbage Collection.

Python’s garbage collector runs during program execution and is triggered when an object’s reference count reaches zero. An object’s reference count changes as the number of aliases that point to it changes.

An object’s reference count increases when it is assigned a new name or placed in a container (list, tuple, or dictionary). The object’s reference count decreases when it’s deleted with del, its reference is reassigned, or its reference goes out of scope. When an object’s reference count reaches zero, Python collects it automatically.

In the given example:

**#Literal 10 is an object**

b = 10

**#Reference count of object 10 becomes 0**

b = 4

The literal value 10 is an object. The reference count of the object is incremented to 1, in line 1.

In line 2, it’s reference count becomes 0, as it is dereferenced. So the garbage collector deallocates the object.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

- Supervised Learning is a machine learning approach that’s defined by its use of labeled datasets. Unsupervised Learning uses machine learning algorithms to analyze and cluster unlabeled data sets.
- Training the model to predict output when a new data is provided is the objective of Supervised Learning. Finding useful insights, hidden patterns from the unknown dataset is the objective of the unsupervised learning.
- Supervised Learning can be used for 2 different types of problems i.e. regression and classification. Unsupervised Learning can be used for 3 different types of problems i.e. clustering, association and dimensionality reduction.
- Supervised Learning will use off-line analysis. Unsupervised Learning uses real time analysis of data.
- To assess whether right output is being predicted, direct feedback is accepted by the Supervised Learning Model. No feedback will be taken by the unsupervised learning model.
- Computational Complexity is very complex in Supervised Learning compared to Unsupervised Learning.
- Accurate results are produced using a supervised learning model. The accuracy of results produced are less in unsupervised learning models.
**Example: Is it a cat or a dog?**

When all the images of the data are labelled under the classes – cat and dog, then we use a supervised learning approach to find the class of an image. For the same problem, if there are no class labels, we use the unsupervised learning approach to create two separate clusters.

**Conclusion:** Having supervised data is always preferable and in the worst case scenario when we don’t have one, we opt for an unsupervised learning approach to solve the problem.

Learn Data Science and Get hands-on practical learning experience.

- The central limit theorem states that the sampling distribution of the mean approaches a normal distribution, as the sample size increases.
- As the sample size increases, the distribution of frequencies approximates a bell-shaped curve (i.e. normal distribution curve). The main objective of the Central Limit Theorem is that the average of your sample means will be the population mean.
- This is useful, as the research never knows which mean in the sampling distribution is the same as the population mean, but by selecting many random samples from a population the sample means will cluster together, allowing the research to make a very good estimate of the population mean.
- A sufficiently large sample can predict the parameters of a population such as the mean and standard deviation.
- Sample sizes equal to or greater than 30 are required for the central limit theorem to hold true.
- The Central Limit theorem plays a big role in hypothesis testing.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

– List and tuple are a class of data structure that can store one or more objects or values. List is mutable, whereas a tuple is immutable. This means that elements stored in the tuple cannot be reassigned or deleted, but it is possible to slice it, and even reassign and delete the whole tuple. Because tuples are immutable, they cannot be copied. On the other hand, lists can be copied and changed.__Mutability__– Tuples have a fixed length, while lists have variable lengths. Thus, the size of created lists can be changed, but that is not the case for tuples.__Length__– In terms of memory efficiency, since tuples are immutable, bigger chunks of memory can be allocated to them, while smaller chunks are required for lists to accommodate the variability. Bigger chunks in memory actually mean lower memory footprint as the overhead is low. So, tuples are more memory efficient than lists.__Memory efficiency__– Tuples are easier to debug for large projects due to its immutability. So, if there is a smaller project or a lesser amount of data, it is better to use lists. This is because lists can be changed, while tuples cannot, making tuples easier to track.**Debugging**

**Conclusion**-Since both types are data structures in Python, using either list or tuple depends on the programmer, on the basis of whether they want to change the data later or not.

- Big data is a combination of structured, semi structured and unstructured data collected by organizations that can be mined for information and used in machine learning projects, predictive modelling and other advanced analytics applications.
- While traditional data is measured in familiar sizes like megabytes, gigabytes and terabytes, big data is stored in petabytes and zettabytes.
- Big data is often characterized by the three V’s: 1)The large
*volume*of data in many environments 2) the wide*variety*of data types frequently stored in big data systems 3) the*velocity*at which much of the data is generated, collected and processed. - Big data is often stored in a data lake. While data warehouses are commonly built on relational databases and contain structured data only, data lakes can support various data types and typically are based on Hadoop clusters, cloud object storage services, NoSQL databases or other big data platforms.
- Cloud is a popular location for managing big data systems. Organizations can deploy their own cloud-based systems or use managed big-data-as-service offerings from cloud providers.
- The various tools and technologies used in big data ecosystems are Hadoop, NoSQL databases, MapReduce, YARN, Spark, Tableau

Get Complete Data Science Course for Hands-on Practical Learning Experience.

**Solution #1 – Using List Comprehension**

>>> t = [1, 3, 6]

>>> v = [t[i+1]-t[i] for i in range(len(t)-1)]

>>> v

Output – [2, 3]

**Solution #2 – Using Numpy****import numpy**

>>> t = [1, 3, 6]

>>> v = numpy.diff(t)

>>> v

Output – [2, 3]

Learn Data Science and Get hands-on practical learning experience.

- A histogram is used to summarize discrete or continuous data. It provides a visual interpretation of numerical data by showing the number of data points that fall within a specified range of values. (called “bins”)
- To construct a histogram from a continuous variable you first need to split the data into intervals, called
**bins**. - Bins should not be too small or too large, such that underlying pattern (frequency distribution) of the data can be easily seen
- In a histogram, it is the area of the bar that indicates the frequency of occurrences for each bin. This means that the height of the bar does not necessarily indicate how many occurrences of scores there were within each individual bin. It is the product of height multiplied by the width of the bin that indicates the frequency of occurrences within that bin.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

**List**consist of elements belonging to different data types. While**Arrays**consist of elements belonging to the same data type**Arrays need to be declared. Lists don’t**, since they are built into Python. Lists are created by simply enclosing a sequence of elements into square brackets. Creating an array, on the other hand, requires a specific function from either the*array*module (i.e.,*array.array()*) or*NumPy*package (i.e.,*numpy.array()*). Because of this, lists are used more often than arrays.**Arrays can store data very compactly**and are more efficient for storing large amounts of data. Lists are preferred for shorter sequences of data items.**Arrays are great for numerical operations**; lists cannot directly handle math operations. For example, you can divide each element of an array by the same number with just one line of code. If you try the same with a list, you’ll get an error.

Learn Data Science and Get hands-on practical learning experience.

The structural approach of coming up with the number is as follows:

Let’s start with the population of Delhi which is 2 Crores.

We will divide this population into two groups-

1. Family(80%) = 0.8*20000000 = 16000000 family members

2. Bachelors(Individuals) (20%) = 0.2*20000000 = 4000000

Number of families(assuming 4 members in each) = 16000000/4 = 4000000.

Guessing that 50% of families have cars, so the number of families with cars = 2000000.

In Delhi, we can assume that 25% of the families belong to high class society so they can afford 2 cars on an average and the rest can afford only one car.

Therefore the number of cars with families

= 0.25*2000000*2 + 0.75*2000000*1

= 2500000.

Now let’s say only 10% of the individual population can afford a single car.

Therefore, the number of cars with individuals

= 0.10*4000000

= 400000.

So the total number of cars in Delhi can be estimated as

2500000+400000 = 2900000

which can be rounded off to 3000000 for simple calculations.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

**#Method 1** – Deleting rows or columns: In Pandas, there are two very useful functions: 1) isnull() and dropna() – to find columns of data with missing or corrupted data and drop those values.2) fillna() – to fill the invalid values with a placeholder value (for ex – 0).**#Method 2** – Replacing the missing data with aggregated values: This way we won’t lose any data, but this method works only for small numeric datasets, which are linear in nature. However, these approximations can mess further results and this method is not the best.**#Method 3** – Creating an unknown category: Categorical features have a number of possible values, which gives us an opportunity to create one more category for the missing values. This way we will lower the variance by adding new information to the data. This could be used when the original information is missing or cannot be understood.**#Method 4** – Predicting missing values: So based on the data where we have no missing values, we can train a statistical or machine learning algorithm in order to predict the missing values.

Learn Data Science and Get hands-on practical learning experience.

- WHERE Clause is used to filter the records from the table based on the specified condition.HAVING Clause is used to filter records from the groups based on the specified condition
- WHERE Clause implements in row operations. HAVING Clause implements in column operation.
- WHERE Clause can be used without GROUP BY clause. HAVING Clause cannot be used without GROUP BY clause.
- WHERE Clause is used before GROUP BY clause. HAVING Clause is used after GROUP BY clause.
- WHERE Clause is used with single row functions like UPPER, LOWER etc. HAVING Clause is used with multiple row functions like SUM, COUNT etc.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

- The Dictionary maps a key to a value and cannot have duplicate keys. It utilizes a data structure called a hashmap and a key will be converted (using a hash algorithm) from a string into an integer value, to find the right index in the dictionary to look at.
- Lists though can be very slow when searching as the only way to search a list is to access each item in the list starting from the zeroth element and going up to the last element in the list.
- The list will be faster than the dictionary on the first item search because there is nothing to search in the first step. But in the second step, the list has to look through the first item and then the second item. So at each step the lookup takes more and more time. The larger the list, the longer it takes.
- Thus, Dictionary in principle has a faster lookup with O(1) i.e. constant time complexity; while the lookup performance of a List is an O(n) operation.

Learn Data Science and Get hands-on practical learning experience.

Modules refer to a file containing Python statements and definitions. We use modules to break down large programs into small manageable and organized files. Furthermore, modules provide reusability of code.

Some of the modules used are:

**Numpy**– It is an amazing module for doing any kind of mathematical operations in Python. So essentially, it allows you to work with array-like objects of multiple dimensions like matrices and helps to do one, two, three dimensional math very fast. The operations performed in Numpy are fast because a lot of operations are implemented in C.**Pandas**– It helps in reading and working with dataframes and just data in general. With pandas, it is easy to clean, manipulate and work with data.-
**Regular expression(re)**– With**re**module, we can do more complex text processing using regular expression pattern matching. **Itertools**– This module provides various functions that work on iterators to produce complex iterators. This module works as a fast, memory-efficient tool that is used either by themselves or in combination to form**iterator algebra**.

**tkinter** – It comes bundled with Python, using Tk and is Python’s standard GUI framework. It provides a fast and easy way to create GUI applications.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

*delimiter.*If it is a positive number, this function returns all to the left of the delimiter. If it is a negative number, this function returns all to the right of the

*delimiter.*

There is a linear relationship between the independent variable, x, and the dependent variable, y. The easiest way to detect if this assumption is met is to create a scatter plot of x vs. y. This allows you to visually see if there is a linear relationship between the two variables.__Linear Relationship:__The next assumption of linear regression is that the residuals are independent. This is mostly relevant when working with time series data. Ideally, we don’t want there to be a pattern among consecutive residuals. The simplest way to test if this assumption is met, is to look at a residual time series plot, which is a plot of residuals vs. time. Ideally, most of the residual autocorrelations should fall within the 95% confidence bands around zero, which are located at about +/- 2-over the square root of__Independence:__*n*, where*n*is the sample size.The next assumption of linear regression is that the residuals have constant variance at every level of x.This is known as**Homoscedasticity:***homoscedasticity*. Once you fit a regression line to a set of data, you can then create a scatterplot that shows the fitted values of the model vs. the residuals of those fitted values.The next assumption of linear regression is that the residuals are normally distributed. A Q-Q plot, short for quantile-quantile plot, is a type of plot that we can use to determine whether or not the residuals of a model follow a normal distribution. If the points on the plot roughly form a straight diagonal line, then the normality assumption is met.__Normality:__

Get Complete Data Science Course for Hands-on Practical Learning Experience.

- K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of predefined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
- K-means clustering algorithm works in three steps. Let’s see what these three steps are.
- Select the k values.
- Initialize the centroids.
- Select the group and find the average.

- K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of predefined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

- We will understand each figure one by one.
- Figure 1 shows the representation of data of two different items. The first item is shown in blue color and the second item has shown in red color. Here I am choosing the value of K randomly as 2. There are different methods by which we can choose the right k values.
- In figure 2, Join the two selected points. Now to find out centroid, we will draw a perpendicular line to that line. The points will move to their centroid. If you will notice there, then you will see that some of the red points are now moved to the blue points. Now, these points belong to the group of blue color items.
- The same process will continue in figure 3. we will join the two points and draw a perpendicular line to that and find out the centroid. Now the two points will move to its centroid and again some of the red points get converted to blue points.
- The same process is happening in figure 4. This process will be continued until and unless we get two completely different clusters of these groups.

- Advantages of K-mean: 1)It is very simple to implement 2)It is scalable to a huge data set and also faster to large datasets 3) It adapts the new examples very frequently 4) Generalization of clusters for different shapes and sizes.
- Disadvantages: 1)It is sensitive to the outliers 2)As the number of dimensions increases its scalability decreases.

1. Start the 7 minute sand timer and the 4 minute sand timer.

2. Once the 4 minute sand timer ends, turn it upside down instantly.

** Time Elapsed: 4 minutes**. At this moment, 3 minutes of sand is left in the 7 minute sand timer.

3. Once the 7 minute sand timer ends, turn it upside down instantly.

** Time Elapsed: 7 minutes**. At this moment, 1 minutes of sand is left in the 4 minute sand timer.

4. After the 4 minute sand timer ends, only 1 minute is elapsed in the 7 minute sand timer, therefore for another minute turn the 7 minute sand timer upside down.

__Time Elapsed: 8 minutes.__

5. When the 7 minute sand timer ends, total time elapsed is 9 minutes.

So effectively **8 + 1 = 9.**

Get Complete Data Science Course for Hands-on Practical Learning Experience.

**Bias** in machine learning is a type of error in which **certain elements of a dataset** are more heavily weighted and/or represented than others.

A biased dataset does not accurately represent a** model’s use case,** resulting in skewed outcomes, low accuracy levels, and analytical errors.

The 5 main types of machine learning bias, why they occur, and how to reduce their effect is as given below:

**1) Algorithmic bias**

- Algorithmic bias is the error that occurs when the algorithm at the core of the machine learning process is faulty or inappropriate for the current application.
- Algorithmic bias can be spotted when the application starts giving wrong results for almost identical cases (input cases).

**2)Sample bias**

- Sample bias occurs when a dataset does not reflect the realities of the environment in which a model will run. To fix this, a larger, more diverse dataset to train your model should be used.
**Example**: Certain facial recognition systems trained primarily on images of white men. These models have considerably lower levels of accuracy with women and people of different ethnicities.

**3)Prejudice Bias**

- Prejudice bias is often the result of the data being
**biased in the first place**. The data you extracted and used to train your model may have pre-existing bias, such as stereotypes and faulty case assumptions. So, using this data will always result in biased results no matter what algorithms you use. - Prejudice bias is quite difficult to solve; you can try to use entirely new dataset, try to modify the data to eliminate any existing biases.

**4)Measurement Bias**

- This type of bias occurs when the data collected for training differs from that collected in the real world, or when faulty measurements result in data distortion. Measurement bias can also occur due to inconsistent annotation during the data labelling stage of a project.
**An Example**of this bias occurs in image recognition datasets, where the training data is collected with one type of camera, but the production data is collected with a different camera.

**5)Exclusion Bias**

- Exclusion bias is most common at the data pre-processing stage. Most often it’s a case of deleting valuable data thought to be unimportant.
**For example**, imagine you have a dataset of customer sales in America and Canada. 98% of the customers are from America, so you choose to delete the location data thinking it is irrelevant. However, this means your model will not pick up on the fact that your Canadian customers spend two times more.

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the tree. This algorithm compares the values of the root attribute with the record (real dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes and moves further. It continues the process until it reaches the leaf node of the tree. The complete process can be better understood using the below algorithm:

**· Step-1**: Begin the tree with the root node, say S, which contains the complete dataset.

**· Step-2**: Find the best attribute in the dataset using **Attribute Selection Measure (ASM)**. By this measurement, we can easily select the best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:

** 1. Information Gain**: It is the measurement of changes in entropy after the segmentation of a dataset based on an attribute. A node/attribute having the highest information gain is split first.

** 2. Gini Index**: It is a measure of impurity or purity used while creating a decision tree in the CART (Classification and Regression Tree) algorithm. It creates binary splits and an attribute with the low Gini index is preferred.

**· Step-3**: Divide ‘S’ into subsets that contain possible values for the best attributes.

**· Step-4**: Generate the decision tree, containing the **Decision Node** and **Leaf Node**. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches. Leaf nodes are split based on the answer (Yes/No) and contain the best attribute.

**· Step-5**: Recursively make new decision trees using the subsets of the dataset created in step -3. Continue this process until a stage is reached where you cannot further classify the nodes.

Decision Tree is a **Supervised learning technique** that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

**Overfitting:**

- Overfitting occurs when a statistical model fits exactly against its training data. When this happens, the algorithm unfortunately
**cannot perform accurately against unseen data, defeating its purpose.** - When machine learning algorithms are constructed, they leverage a sample dataset to train the model. However, when the model trains for too long on sample data or when the model is too complex, it can start to learn the “noise,” or irrelevant information, within the dataset. When the model memorizes the noise and fits too closely to the training set, the model becomes “overfitted,” and it is unable to generalize well to new data.
**Low error rates**and a**high variance**are good indicators of overfitting. In order to prevent this type of behavior, part of the training dataset is typically set aside as the “test set” to check for overfitting.

**Underfitting:**

- Underfitting occurs when a data model is
**unable to capture the relationship between the input and output variables**accurately, generating a high error rate on both the training set and unseen data. - It occurs when a model is too simple, which can be a result of a model needing more training time, more input features, or less regularization. Like overfitting, when a model is underfitted, it cannot establish the dominant trend within the data, resulting in training errors and poor performance of the model.
**High bias**and**low variance**are good indicators of underfitting. Since this behavior can be seen while using the training dataset, underfitted models are usually easier to identify than overfitted ones.

**The ideal scenario when fitting a model is to find the balance between overfitting and underfitting.**

**K-Nearest****Neighbors**is a**supervised**classification algorithm where**K**describes the number of neighbour points that we look at for each data point to determine its classification. As a supervised algorithm, we have the labels of the data points, and use those to predict the labels of new data points.In addition, the concept behind this algorithm is that for a point, it will get its K nearest neighbours, based on the closest distance. And, this algorithm only really has to iterate one time through, unlike K-Means Clustering which iterates multiple times until convergence is reached.**K-Means Clustering**is an**unsupervised**algorithm, where the**K**is used to describe how many centroids of clusters there will be when applying the algorithm. As an unsupervised algorithm, we are not given any labels, but instead, we have parameters that we use to group similar data points together and find the clusters.The concept behind this algorithm is that we try to calculate the locations of the K centers, or the averages or means, of the data points, which are where the clusters are most likely centred on. It will recalculate the centre’s based on the data points, and will iterate multiple times until convergence is reached, which happens when the newly computed centre locations stop changing.- K-NN is a
**classification**or**regression**machine learning algorithm. While K-means is a clustering - K-NN is a
**lazy learner**while K-Means is an**eager learner**. KNN is also known as a lazy learner because it involves minimal training of the model. Hence, it doesn’t use training data to make generalizations on an unseen data set.

Learn Data Science and Get hands-on practical learning experience.

The goal of any supervised machine learning algorithm is to achieve **low bias** and **low variance**.

If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand, if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data.

Thus, there is a trade-off between bias and variance; in order to achieve good prediction performance. To build a good model, we need to find a balance between bias and variance such that it minimizes the total error.

**Total error = Bias^2 + Variance + Irreducible Error**

where,**Bias** is the difference between the average prediction of our model and the correct value which we are trying to predict, and**Variance** is the variability of model prediction for a given data point or a value which tells us the spread of our data.

**Irreducible error** is the error introduced from the chosen framing of the problem and may be caused by factors like unknown variables that influence the mapping of the input variables to the output variable.

**For example**: – The k-nearest neighbors algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute the prediction and in turn increases the bias of the model.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

· Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions about a population parameter or a population probability distribution.

· There are two types of hypothesis – Null and Alternative.

1. **Null Hypothesis:** It is denoted by H_{0}. A null hypothesis is the one in which sample observations result purely from chance. This means that the observations are not influenced by some non-random cause.

2. **Alternative Hypothesis****:** It is denoted by H_{a} or H_{1}. An alternative hypothesis is the one in which sample observations are influenced by some non-random cause.

· A hypothesis test concludes whether to *reject* the null hypothesis and accept the alternative hypothesis or to *fail to reject* the null hypothesis.

· The following steps are involved in hypothesis testing:

1. **The first step** is to state the null and alternative hypothesis clearly. The null and alternative hypothesis in hypothesis testing can be a one tailed or two tailed test.

2. **The second step** is to determine the test size. This means that the researcher decides whether a test should be one tailed or two tailed to get the right critical value and the rejection region.

3. **The third step** is to compute the test statistic and the probability value. This step of the hypothesis testing also involves the construction of the confidence interval depending upon the testing approach.

4. **The fourth step** involves the decision making step. This step of hypothesis testing helps the researcher reject or accept the null hypothesis by making comparisons between the subjective criterion from the second step and the objective test statistic or the probability value from the third step.

5. **The fifth step** is to draw a conclusion about the data and interpret the results obtained from the data.

· A null hypothesis is accepted or rejected basis P value and the region of acceptance.

1. **P value** – it is a function of the observed sample results. A threshold value is chosen before the test is conducted and is called the significance level, which is represented as α. **If the calculated value of P ≤ α, it suggests the inconsistency between the observed data and the assumption that the null hypothesis is true**. This suggests that the null hypothesis must be rejected

2. **Region of Acceptance** – It is the range of values that leads you to accept the null hypothesis. When you collect and observe sample data, you compute a test static. If its value falls within the specific range, the null hypothesis is accepted.

Learn Data Science and Get hands-on practical learning experience.

· A **random variable**, usually written *X*, is a variable whose possible values are numerical outcomes of a random phenomenon. There are two types of random variables, **discrete** and **contin uous.**

**1. ****Discrete Random Variables**

** **

· A **discrete random variable** is one which may take on only a countable number of distinct values such as 0,1,2,3,4, …….. Discrete random variables are usually (but not necessarily) counts. If a random variable can take only a finite number of distinct values, then it must be discrete.

· **Example**: Imagine a coin toss where, depending on the side of the coin landing face up, a bet of a dollar has been placed. The possibility of winning a dollar corresponding to the outcome of a coin toss before tossing the coin defines the random variable. The outcome of the coin toss is either heads, or tails, creating an equal probability of either outcome. Because the value of the random variable is defined as a real-valued dollar, the probability distribution is discrete.

**2. ****Continuous Random Variables**

** **

· A ** continuous random variable** is one which takes an infinite number of possible values. Continuous random variables are usually measurements.

· A continuous random variable is not defined at specific values. Instead, it is defined over an *interval* of values, and is represented by the **area under a curve** (in advanced mathematics, this is known as an *integral*). The probability of observing any single value is equal to 0, since the number of values which may be assumed by the random variable is infinite.

· **Example**: Imagine wanting to study the effects of caffeine intake on height. One’s height would be the continuous random variable as it is unknown before the completion of the experiment, and its value is taken from measuring within a range.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

· A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance (i.e. that the null hypothesis is true).

· The level of statistical significance is often expressed as a *p*-value between 0 and 1. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.

· A *p*-value less than 0.05 (typically ≤ 0.05) is statistically significant. It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random). Therefore, we reject the null hypothesis, and accept the alternative hypothesis.

· However, if the *p*-value is below your threshold of significance (typically *p* < 0.05), you can reject the null hypothesis, but this does not mean that there is a 95% probability that the alternative hypothesis is true. The *p*-value is conditional upon the null hypothesis being true, but is unrelated to the truth or falsity of the alternative hypothesis.

· A *p*-value higher than 0.05 (> 0.05) is not statistically significant and indicates strong evidence for the null hypothesis. This means we retain the null hypothesis and reject the alternative hypothesis.

· The term **significance level (alpha)** is used to refer to a pre-chosen probability and the term “P value” is used to indicate a probability that you calculate after a given study.

· The significance level (alpha) is the probability of type I error. **Type I error** is the false rejection of the null hypothesis and **type II error** is the false acceptance of the null hypothesis. The power of a test is one minus the probability of type II error (beta).

Learn Data Science and Get hands-on practical learning experience.

Classification algorithms take existing (labelled) datasets and use the available information to generate predictive models for use in classification of future data points. The following evaluation metrics for classification are an absolute measure of your machine learning model’s accuracy:

**Confusion Matrix:**The Confusion Matrix is a two-dimensional matrix that allows visualization of the algorithm’s performance. Predictions are highlighted and divided by class (true/false), before being compared with the actual values. This matrix essentially helps you determine if the classification model is optimized. It shows what errors are being made and helps to determine their exact type.**Accuracy:**A classification model’s accuracy is defined as the percentage of predictions it got right. However, it’s important to understand that it becomes less reliable when the probability of one outcome is significantly higher than the other one, making it less ideal as a stand-alone metric.**Recall**: Recall is the number of correct positive results divided by the number of all samples that should have been identified as positive.

Recall = TP / TP + FN

**Precision**: This metric is the number of correct positive results divided by the number of positive results predicted by the classifier.

Precision = TP / TP + FP

**F1 score:**The F1 score is basically the harmonic mean between precision and recall. It is used to measure the accuracy of tests and is a direct indication of the model’s performance. The range of the F1 score is between 0 to 1, with the goal being to get as close as possible to 1. It is calculated as per:

**ROC-AUC:**AUC of a classifier is equal to the capability of the model to distinguish correctly between the classes while the Receiver Operating System(ROC) is a probability curve. ROC-AUC is used for binary classification and represents the degree or measure of separability. It shows the performance of a classification model at all classification thresholds. AUC score is calculated from the plot for False Positive Rate(FPS) vs True Positive Rate(TPR).

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Most organizations emphasize data to drive business decisions. But data alone is not the goal. Facts and figures are meaningless if you can’t gain valuable insights that lead to more-informed actions. The different types of analytics is as follows:-

**Descriptive Analytics:****It**

**For instance**, say that an unusually high number of people are admitted to the emergency room in a short period of time. Descriptive analytics tells you that this is happening and provides real-time data with all the corresponding statistics (date of occurrence, volume, patient details, etc.).

**Predictive Analysis:**Predictive tools attempt to fill in gaps in the available data. If descriptive analytics answer the question, “what happened in the past,” predictive analytics answer the question, “what might happen in the future?”

Predictive analytics take historical data from CRM, POS, HR and ERP systems and use it to highlight patterns. Then, algorithms, statistical models and machine learning are employed to capture the correlations between targeted data sets.

The most common commercial **example** is a credit score. Banks uses historical information to predict whether or not a candidate is likely to keep up with payments.

**Prescriptive Analysis:****it**takes predictive*data*to the next level. Now that you have an idea of what will likely happen in the future**.**Prescriptive analytics**s**uggests various courses of action and outlines what the potential implications would be for each.

**Back to our hospital example**: Now that you know the illness is spreading, the prescriptive analytics tool may suggest that you increase the number of staff on hand to adequately treat the influx of patients.

Learn Data Science and Get hands-on practical learning experience.

201 is not an even number so let’s consider 200 games first. Assume event A fires more arrows on target than B in 200 games, and event B is B fires more arrows on target, and C is that they fire equal amounts of arrows on targets. We have:

P(A) + P(B) + P(C) = 1

Since A and B perform equally at archery, for 200 games, we have P(A) = P(B). Thus:

2P(A) + P(C) = 1

Now move to the extra game that person A plays. If in the last 200 games:

A is higher than B, then no matter whether A fires on target or not for this extra game, A is still higher than B.

If A is lower than B, even if A fires on target for the extra game, we would observe the most A=B, and A will still not be over B.

If A=B, if A fires on target for the extra game, then A will be higher than B, and the probability that A shoots on target for any game is 0.5.

Thus, the total probability that A is higher than B is:

P(A) + 0.5*P(C)

We know that 2P(A) + P(C) = 1, if we divide by 2 on both sides, we will have:

P(A) + 0.5*P(C) = 0.5

The probability that A gets more targets than B when A plays 201 games and B plays 200 games is 0.5.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

· Histograms and box plots are graphical representations for the frequency of numeric data values. They aim to describe the data and visually assess the central tendency, the amount of variation in the data as well as the presence of gaps, outliers or unusual data points.

· Histograms are preferred to determine the underlying probability distribution of a data. Box plots on the other hand are more useful when comparing between several data sets.

· Histogram is preferable when there is very little variance among the observed frequencies. While box plot shows moderate variation among the observed frequencies

· While histograms are better in displaying the distribution of data, a box plot is used to tell if the distribution is symmetric or skewed.

· Box plots are less detailed than histograms and take up less space.

· To conclude, both tools can be helpful to identify whether variability in data is within specification limits, and whether there is a shift in the process over time. Thus, the type of chart aid chosen depends on the type of data collected, rough analysis of data trends, and project goals.

Learn Data Science and Get hands-on practical learning experience.

Covariance is a measure to indicate the extent to which two random variables change in tandem. Correlation is a measure used to represent how strongly two random variables are related to each other.

Covariance is nothing but a measure of correlation. Correlation refers to the scaled form of covariance.

Covariance indicates the direction of the linear relationship between variables. Correlation on the other hand measures both the strength and direction of the linear relationship between two variables.

Covariance can vary between -∞ and +∞. Correlation ranges between -1 and +1

Covariance is affected by the change in scale. If all the values of one variable are multiplied by a constant and all the values of another variable are multiplied by a similar or different constant, then the covariance is changed. Correlation is not influenced by the change in scale.

Covariance of two dependent variables measures how much in real quantity (i.e. cm, kg, liters) on average they co-vary. Correlation of two dependent variables measures the proportion of how much on average these variables vary w.r.t one another.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

- In terms of the distribution of time spent per day on Facebook (FB), one can imagine there may be two groups of people on Facebook:

1. People who scroll quickly through their feed and don’t spend too much time on FB.

2. People who spend a large amount of their social media time on FB. - Based on this, we make claim about the distribution of time spent on FB. The metrics to describe our distribution can be 1) Centre (mean, median, mode) 2) Spread (standard deviation, inter quartile range 3) Shape (skewness, kurtosis, uni or bimodal) 4) Outliers (Do they exist?).
- We can give you a sample answer for your interview: –
- If we assume that a person is visiting Facebook page, there is a probability(p) that after one unit of time(t) has passed that she will leave the page.
- With a probability of p her visit will be limited to 1 unit of time. With a probability of (1−p)p her visit will be limited to 2 units of time. With a probability of (1−p)2p her visit will be limited to 3 units of time and so on. The probability mass function of this distribution is therefore (1−p)tp, and hence we can say this a geometric distribution.

Learn Data Science and Get hands-on practical learning experience.

A type I error occurs when the null **hypothesis** is true but is rejected. In other words, if a true null hypothesis is incorrectly rejected, type I error occurs. A type II error occurs when the null** hypothesis** is false but invalidly fails to be rejected. In other words, failure to reject a false null hypothesis results in type II error.

A type I error also known as **False positive.** A type II error also known as **False negative. It is also known as the false null hypothesis.**

Type I error equals the level of significance (α). ‘α’ is the so-called p-value.

Type II error equals the statistical power of a test. The probability 1- ‘β’ is called the statistical power of the study.

The probability that we will make a type I error is designated ‘α’ (alpha). Therefore, type I error is also known as alpha error. Probability that we will make a type II error is designated ‘β’ (beta). Therefore, type II error is also known as beta error.

It refers to non-acceptance of hypotheses, which ought to be accepted. It refers to the acceptance of a hypothesis, which ought to be rejected.

The probability of Type I error reduces with lower values of (*α) *since the lower value makes it difficult to reject the null hypothesis. The probability of Type II error reduces with higher values of (*α) *since the higher value makes it easier to reject the null hypothesis.

In the world of mathematics, the shortest distance between two points in any dimension is termed the Euclidean distance. It is the square root of the sum of squares of the difference between two points.

In Python, the numpy, scipy modules are very well equipped with functions to perform mathematical operations and calculate this line segment between two points.

**Solution 1: Using Numpy Module**

The numpy module can be used to find the required distance when the coordinates are in the form of an array. It has the norm() function, which can return the vector norm of an array

import numpy as np

a = np.array((1, 2, 3))

b = np.array((4, 5, 6))

dist = np.linalg.norm(a-b)

print dist

*Output:*

5.196152422706632

**Solution 2: Using Scipy Library**

The scipy library has many functions for mathematical and scientific calculation. The distance.euclidean() function returns the Euclidean Distance between two points.

from scipy.spatial import distance

a = (1, 2, 3)

b = (4, 5, 6)

print(distance.euclidean(a, b))

*Output:*

5.196152422706632

**Solution 3: Using math module**

The math module also can be used as an alternative. The dist() function from this module can return the line segment between two points.

from math import dist

a = (1, 2, 3)

b = (4, 5, 6)

print(dist(a, b))

__Output:__

5.196152422706632

The **scipy**** **and** ****math **module methods are a faster alternative to the numpy methods and work when the coordinates are in the form of a tuple or a list.

Learn Data Science and Get hands-on practical learning experience.

A trigger in MySQL is a set of SQL statements that reside in a system catalog. **It is a special type of stored procedure that is invoked automatically in response to an event**. Each trigger is associated with a table, which is activated on any DML statement such as **INSERT, UPDATE**, or **DELETE**.

A trigger is called a special procedure because it cannot be called directly like a stored procedure. The main difference between the trigger and procedure is that a trigger is called automatically when a data modification event is made against a table. In contrast, a stored procedure must be called explicitly.

Generally, **triggers are of two types** according to the SQL standard: row-level triggers and statement-level triggers.

**Row-Level Trigger:** It is a trigger, which is activated for each row by a triggering statement such as insert, update, or delete. For example, if a table has inserted, updated, or deleted multiple rows, the row trigger is fired automatically for each row affected by the insert, update or delete statement.

**Statement-Level Trigger:** It is a trigger, which is fired once for each event that occurs on a table regardless of how many rows are inserted, updated, or deleted.

For creating a new trigger, we need to use the CREATE TRIGGER statement. Its syntax is as follows:

CREATE TRIGGER trigger_name trigger_time trigger_event

ON table_name

FOR EACH ROW

BEGIN

…

END;

**Trigger_name** is the name of the trigger which must be put after the CREATE TRIGGER statement.

**Trigger_time** is the time of trigger activation and it can be BEFORE or AFTER. We must have to specify the activation time while defining a trigger.

**Trigger_event** can be INSERT, UPDATE, or DELETE. This event causes the trigger to be invoked. A trigger only can be invoked by one event.

**Table_name** is the name of the table. Actually, a trigger is always associated with a specific table.

**BEGIN…END** is the block in which we will define the logic for the trigger.

The characteristic of a frequency distribution that ascertains its symmetry about the mean is called skewness. On the other hand, Kurtosis means the relative pointedness of the standard bell curve, defined by the frequency distribution.

Skewness is characteristic of the deviation from the mean, to be greater on one side than the other, i.e. attribute of the distribution having one tail heavier than the other. Skewness is used to indicate the shape of the distribution of data. Conversely, kurtosis is a measure to indicate the flatness or peakedness of the frequency distribution curve and measures the tails or outliers of the distribution.

Skewness is an indicator of lack of symmetry, i.e., both left and right sides of the curve are unequal, with respect to the central point. As against this, kurtosis is a measure of data that is either peaked or flat, with respect to the probability distribution.

Skewness shows how much and in which direction, the values deviate from the mean. In contrast, kurtosis explains how tall and sharp the central peak is.

In a skewed distribution, the curve is extended to either the left or right side. So, when the plot is extended towards the right side more, it denotes positive skewness, wherein mode < median < mean. On the other hand, when the plot is stretched more towards the left direction, then it is called as negative skewness and so, mean < median < mode. Positive kurtosis represents that the distribution is more peaked than the normal distribution, whereas negative kurtosis shows that the distribution is less peaked than the normal distribution.

Learn Data Science and Get hands-on practical learning experience.

We want a recommendation algorithm sounds like RNN which is a recurrent neural network which is not easy to setup at all, but we can try building recommendation with much simpler approach that is with a simple prefix matching algorithm and we can certainly go into expanding it on until we have something that will be on par with RNN.

We will use Lookup in this database table/ prefix table. This prefix table starts with an input string and that is your prefix and it will output the suggested string or suffixes. Example: what does “hello” prefix to and its suffixes to the model.

Scoping is very important by doing fuzzy matching, context matching like what if you were using a different language. So, if you are trying to input “big” that could output any number of suffixes example: big shot or big sky or the big year etc.

In existing search corpus of billions of searches what proportion of the time do people writing the big actually click on the big shot and what proportion of the time do they output the big sky, so you can just have a simple thing that has every possible search prefix that has ever been typed on Netflix and output that to the most common thing that they clicked on. Boom! That’s your prefix matching a recommendation algorithm for type ahead search.

Context matching is also important here, if u have string input and a user profile with various number of features into a string output. You can convert user profile into a K means clustering we can output this into either John Stamos’s fan and not John Stamos’s fan, so if you are John Stamos’s fan and if you type the big, every time Netflix is going to recommend the big shot or else other way round. Also user profile can be set to right dimensionality.

Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behaviour. It can be considered the thoughtful process of determining what is normal and what is not. *Anomalies *are also referred to as outliers, novelties, noise, exceptions and deviations. Simply, anomaly detection is the task of defining a boundary around normal data points so that they can be distinguishable from outliers.

Anomalies can be broadly categorized as:

**Point anomalies:** A single instance of data is anomalous if it’s too far off from the rest. ** Business use case:** Detecting credit card fraud based on “amount spent.”

**Contextual anomalies:** The abnormality is context specific. This type of anomaly is common in time-series data** . Business use case:** Spending $100 on food every day during the holiday season is normal, but may be odd otherwise.

**Collective anomalies:** A set of data instances collectively helps in detecting anomalies*. Business use case: *Someone is trying to copy data from a remote machine to a local host unexpectedly, an anomaly that would be flagged as a potential cyber-attack.

The different types of methods for anomaly detection are as follows:

**Simple Statistical Methods**

The simplest approach to identifying irregularities in data is to flag the data points that deviate from common statistical properties of a distribution, including mean, median, mode, and quantiles. When an anomalous data point deviates by a certain standard deviation from the mean, then traversing mean over time-series data isn’t exactly trivial, as it’s not static. Thus, a rolling window to compute the average across the data points and it’s intended to smooth short-term fluctuations and highlight long-term ones.

**Machine Learning-Based Approaches for Anomaly Detection:**

**(a) Clustering-Based Anomaly Detection:**

The approach focuses on unsupervised learning, similar data points tend to belong to similar groups or clusters, as determined by their distance from local centroids.

The k-means algorithm can be used which partition the dataset into a given number of clusters. Any data points that fall outside of these clusters are considered as anomalies.

**(b) Density-based anomaly detection:**

This approach is based on the K-nearest neighbors algorithm. It’s evident that normal data points always occur around a dense neighborhood and abnormalities deviate far away. To measure the nearest set of a data point, you can use Euclidean distance or similar measure according to the type of data you have.

**(c) Support Vector Machine-Based Anomaly Detection:**

A support vector machine is another effective technique for detecting anomalies. One-Class SVMs have been devised for cases in which one class only is known, and the problem is to identify anything outside this class.

This is known as novelty detection, and it refers to automatic identification of unforeseen or abnormal phenomena, i.e. outliers, embedded in a large amount of normal data.

Anomaly detections helps to monitor any data source, including user logs, devices, networks, and servers. This rapidly helps in identifying zero-day attacks as well as unknown security threats.

Learn Data Science and Get hands-on practical learning experience.

Let’s understand t-test and Z-test first,

Z-test is a univariate hypothesis test which ascertains if the averages of the 2 datasets are different from each other when standard deviation or variance is given. Key assumptions made: 1)All data points are independent 2) Normal Distribution for Z, with an average zero and variance = 1.

The t-test can be referred to as a kind of parametric test that is applied to an identity, how the averages of 2 sets of data differ from each other when the standard deviation or variance is not given. Key assumptions made: 1) All data points are not dependent 2)Sample values are to recorded and taken accurately

The t-test is based on the student’s t-distribution. On the contrary, the z-test depends upon the assumption that the distribution of sample means will be normal.

So, what are the conditions for conducting these tests?

One of the essential conditions for conducting a t-test is that population standard deviation or the variance is unknown. Conversely, the population variance should be assumed to be known or be known in the case of a z-test.

Z-test is used when the sample size is large, which is n > 30, and the t-test is appropriate when the size of the sample is not big, which is small, i.e., that n < 30.

Let’s take a following dataset as example,

1 3 5 5 6 7

If I want to calculate the central tendency of this dataset then I have 3 choices to do so,

**Mean:** (1+3+5+5+6+7)/6 = 27/6 = 4.5

**Median:** 5+5/2 = 5

**Mode:** Since 5 has shown up two times the mode would be 5 here.

From the above observation, we can clearly state that most of the data lies around 5. In the above example, there are no outliers in the data. Now let’s take an example where we have an outlier condition in the data.

Before that lets understand, what is an outlier? Outliers are the values which lie outside of the other values. It could either lie on a higher or lower level depending upon the data.

Dataset: 1 3 5 5 6 27

If you clearly observe the number 27 is an outlier because it lies outside the current values. Now let’s calculate the central tendency of the dataset.

**Mean:** (1+3+5+5+6+27)/6 = 47/6 = 7.83

**Median:** 5+5/2 = 5;

**Mode:** Since 5 has shown up two times the mode would be 5 here.

See, what happened to mean? I have just introduced outliers in the data and the mean gets affected heavily. Now look at Median and Mode, they remain the same.

**To conclude:**

Median is the preferred measure of central tendency when: 1) There are a few extreme scores in the distribution of the data. 2) There are some missing or undetermined values in your data. 3) There is an open-ended distribution (For example, if you have a data field which measures number of children and your options are 00, 11, 22, 33, 44, 55 or “6 or more,” than the “6 or more field” is open ended and makes calculating the mean impossible, since we do not know exact values for this field).

The mode is the only measure you can use for nominal or categorical data that can’t be ordered.

Learn Data Science and Get hands-on practical learning experience.

Cross-Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two segments:

One used to learn or train a model and the other used to validate the model. In typical cross-validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against. The basic form of cross-validation is k-fold cross-validation.

Other forms of cross-validation are special cases of *k*-fold cross-validation or involve repeated rounds of *k*-fold cross-validation.

In *k*-fold cross-validation, the data is first partitioned into *k* equally (or nearly equally) sized segments or folds. Subsequently *k* iterations of training and validation are performed such that within each iteration a different fold of the data is held-out for validation while the remaining *k* − 1 folds are used for learning.

**Why do we use cross-validation?**

It allows us to get more metrics and draw important conclusions about our algorithm and our data.

Helps to tune the hyper parameters of a given machine learning algorithm, to get good performance according to some suitable metric.

It mitigates overfitting while building a pipeline of models, such that second’s models input will be real predictions on data that our first model has never seen before.

K-fold cross validation also significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in the validation set.

Let’s start with understanding correlation,

The correlation between two variables can be measured with a correlation coefficient which can range between -1 to 1. If the value is 0, the two variables are independent and there is no correlation. If the measure is extremely close to one of these values, it indicates a linear relationship and highly correlated with each other. This means a change in one variable is associated with a significant change in other variables.

How to test Multicollinearity?

Multicollinearity is a condition when there is a significant dependency or association between the independent variables or the predictor variables. A significant correlation between the independent variables is often the first evidence of presence of multicollinearity.

Correlation matrix / Correlation plot: A correlation plot can be used to identify the correlation or bivariate relationship between two independent variables

Variation Inflation Factor (VIF): VIF is used to identify the correlation of one independent variable with a group of other variables.

Consider that we have 9(assume V1 to V9) independent variables. To calculate the VIF of variable V1, we isolate the variable V1 and consider as the target variable and all the other variables(i.e V2 to V9) will be treated as the predictor variables.

We use all the other predictor variables and train a regression model and find out the corresponding R2 value.

Using this R2 value, we compute the VIF value given as the image below.

It is always desirable to have a VIF value as small as possible. A threshold is also set, which means that any independent variable greater than the threshold will have to be removed.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Linear regression is used to predict the continuous dependent variable using a given set of independent variables. Logistic Regression is used to predict the categorical dependent variable using a given set of independent variables.

Linear Regression is used for solving Regression Problems. Logistic Regression is used for mainly Classification problems.

Linear regression is used to estimate the dependent variable in case of a change in independent variables. For example, predict the price of houses. Whereas logistic regression is used to calculate the probability of an event. For example, classify if tissue is benign or malignant.

In Linear regression, we predict the value of continuous variables. In logistic Regression, we predict the values of categorical variables.

In linear regression, we find the best fit line, by which we can easily predict the output. In Logistic Regression, we find the S-curve by which we can classify the samples.

Least square estimation method is used for estimation of accuracy. Maximum likelihood estimation method is used for estimation of accuracy.

The output for Linear Regression must be a continuous value, such as price, age, etc. The output of Logistic Regression must be a Categorical value such as 0 or 1, Yes or No, etc.

Linear Regression assumes the normal or gaussian distribution of the dependent variable. Logistic regression assumes the binomial distribution of the dependent variable.

Oversampling and under sampling are 2 important techniques used in machine learning – classification problems in order to reduce the class imbalance thereby increasing the accuracy of the model.

Classification is nothing but predicting the category of a data point to which it may probably belong by learning about past characteristics of similar instances. When the segregation of classes is not approximately equal then it can be termed as a “**Class imbalance**” problem. To solve this scenario in our data set, we use oversampling and under sampling.

Oversampling is used when the amount of data collected is insufficient. A popular over-sampling technique is SMOTE (Synthetic Minority Over-sampling Technique), which creates synthetic samples by randomly sampling the characteristics from occurrences in the minority class.

Conversely, if a class of data is the overrepresented majority class, under sampling may be used to balance it with the minority class. Under sampling is used when the amount of collected data is sufficient. Common methods of under sampling include cluster centroids and To meke links, both of which target potential overlapping characteristics within the collected data sets to reduce the amount of majority data.

Ex: Let’s say in a bank majority of the customers are from a specific Race and very few customers are from other races , hence if the model is trained with this data , it is most likely that Model will reject the loan for Minority Race.

So, what should we do about it?

**For, **oversampling we** **increase the number of records belonging to the “minority race” category by duplicating its presence. So that the difference between the numbers of records belonging to both of the classes will narrow down.

Under sampling we reduce the number of records belonging to the “majority race”. The records for the deletion are selected strictly through a random process and are not influenced by any constraints or bias.

To conclude, over sampling is preferable as under sampling can result in the loss of important data. Under sampling is suggested when the amount of data collected is larger than ideal and can help data mining tools to stay within the limits of what they can effectively process.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

**SELECT * FROM employees;**

**Fig1:**

The following MySQL statement finds the maximum salary from each department and you will be required to use the GROUP BY clause with the SELECT query to generate the individual departments with the highest salary.

**SELECT department, MAX(salary)**

** FROM employees**

** GROUP BY department**

** ORDER BY MAX(salary) DESC;**

This query gives the output as follows:

Fig2:

**SVM** algorithms use a group of mathematical functions that are known as kernels. The function of a kernel is to require data as input and transform it into the desired form.

The kernel functions return the scalar product between two points in an exceedingly suitable feature space. Thus, by defining a notion of resemblance, with a little computing cost even in the case of very high-dimensional spaces.

Different SVM algorithms use differing kinds of kernel functions. These functions are of different kinds—for instance, linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid. Few of them are as follows:

**Linear Kernel :**

It is the most basic type of kernel, usually one dimensional in nature. It proves to be the best function when there are lots of features. The linear kernel is mostly preferred for text-classification problems as most of these kinds of classification problems can be linearly separated.

Linear kernel functions are faster than other functions.

Linear Kernel Formula

F(x, xj) = sum( x.xj)

Here, x, xj represents the data you’re trying to classify.

**Polynomial Kernel :**

It is a more generalized representation of the linear kernel. It is not as preferred as other kernel functions as it is less efficient and accurate.

Polynomial Kernel Formula

F(x, xj) = (x.xj+1)^d

Here ‘**.**’ shows the dot product of both the values, and d denotes the degree.

F(x, xj) representing the decision boundary to separate the given classes.

Gaussian Radial Basis Function (RBF)

It is one of the most preferred and used kernel functions in svm. It is usually chosen for non-linear data. It helps to make proper separation when there is no prior knowledge of data.

Gaussian Radial Basis Formula

F(x, xj) = exp(-gamma * ||x – xj||^2)

The value of gamma varies from 0 to 1. You have to manually provide the value of gamma in the code. The most preferred value for gamma is 0.1.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

**Bagging** and **Boosting **are two types of **Ensemble Learning**, which helps to improve machine learning results by combining several models. This approach allows the production of better predictive performance compared to a single model.

So, let’s understand the difference between Bagging and Boosting?

**Bagging(Bootstrap aggregation)**: It is a homogeneous weak learners’ model that learns from each other independently in parallel and combines them for determining the model average.**Boosting**: It is also a homogeneous weak learners’ model. In this model, learners learn sequentially and adaptively to improve model predictions of a learning algorithm.

If the classifier is unstable (high variance), then we need to apply bagging. If the classifier is steady and straightforward (high bias), then we need to apply boosting.

In bagging, different training data subsets are randomly drawn with replacement from the entire training dataset. In boosting, every new subset contains the elements that were misclassified by previous models.

Bagging is the simplest way of combining predictions that belong to the same type. Boosting is the way of combining predictions that belong to the different types.

Each model is built independently for bagging. While in the case of boosting, new models are influenced by the performance of previously built models.

Bagging attempts to tackle the overfitting issue. Boosting tries to reduce bias.

Example: The Random Forest model uses Bagging. The AdaBoost uses Boosting techniques.

**Feature scaling** is one of the most important data pre-processing step in machine learning. Algorithms that compute the distance between the features are biased towards numerically larger values if the data is not scaled.

Normalization and standardization are one of the few types of feature scaling.

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

**X_new = (X – X_min)/(X_max – X_min)**

Here, Xmax and Xmin are the maximum and the minimum values of the feature respectively.

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

**X_new = (X – mean)/Std**

Now, let’s understand the difference between the two:

Normalization is used when features are of different scales. Standardization is used when we want to ensure zero mean and unit standard deviation.

Normalization squishes n-dimensional data into an n-dimensional unit hypercube. Standardization on the other hand, translates the data to the mean vector of original data to the origin and squishes or expands.

Normalization is useful when we don’t know about the distribution and scales values between [0,1] or [-1,1]. Standardization is used when the feature distribution is normal or Gaussian and is not bounded to a certain range.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

An analytic function computes values over a group of rows and returns a single result for *each* row. This is different from an aggregate function, which returns a single result for *a group* of rows.

With analytic functions you can compute moving averages, rank items, calculate cumulative sums, and perform other analyses.

**RANK** gives you the ranking within your ordered position. Ties are assigned the same rank, with the next ranking(s) skipped.

**DENSE_RANK** again gives you the ranking within your ordered partition, but the ranks are consecutive. No ranks are skipped if there are ranks with multiple items.

Example below.. which is order partitioned on salary :

Due to next rankings skipped in the case of RANKS, generally DENSE_RANK is preferred as it gives proper ranking.

An activation function is an important feature of an artificial neural network which basically decides whether neurons should be activated or not.

All the input Xi’s are multiplied with their weights Wi’s assigned to each link, and summed together along with Bias b. Let Y be the summation of ((Wi*Xi) + b).

The value of Y ranges from -inf to +inf. To build sense into our network, we add activation function(f)-which will decide whether the information passed is useful or not, and based on the result it’ll get fired.

The properties that activation function should hold are:

Derivative or differential: Change in y-axis w.r.t change in x-axis. It is also known as slope.

Monotonic function: A function which is either entirely non-increasing or non-decreasing.

The types of activation function are as follows:

1)Linear Function: For this function, no matter how many layers are present in the neutral network, the output of the first layer is the same as the output of the nth layer.

2)Binary-Step function: Binary step function is a threshold-based activation function which means after a certain threshold neuron is activated and below the said threshold neuron is deactivated.

3)Non-linear function: Modern neural network models use non-linear activation functions. They allow the model to create complex mappings between the network’s inputs and outputs, such as images, video, audio, and data sets that are non-linear or have high dimensionality.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Here it means, 20% probability = 20/100 = 1/5

Probability of Seeing a Star in 15 minutes = 1/5

Probability of not seeing a Star in 15 minutes = 1 – 1/5 = 4/5

Probability that you see at least one shooting star in the period of an hour

= 1 – Probability of not seeing any Star in 60 minutes

= 1 – Probability of not seeing any Star in 15 * 4 minutes

= 1 – (4/5)⁴

= 1 – 0.4096

= 0.5904

**So, the probability of seeing at least one shooting star in a period of an hour is 0.594 .**

The statistical power of a study (sometimes called sensitivity) is how likely the study is to distinguish an actual effect from one of chance.

It’s the likelihood that the test is correctly rejecting the null hypothesis. For example, a study that has an 80% power means that the study has an 80% chance of the test having significant results.

A high statistical power means that the test results are likely valid. As the power increases, the probability of making a Type II error decreases.

A low statistical power means that the test results are questionable.

Statistical power helps you to determine if your sample size is large enough.

It is possible to perform a hypothesis test without calculating the statistical power. If your sample size is too small, your results may be inconclusive when they may have been conclusive if you had a large enough sample.

Learn Data Science and Get hands-on practical learning experience.

Facts

India’s population in a year – 1.3 bill

Population breakup – Rural – 70% and Urban – 30%

Assumptions

Every year India’s population would grow steadily, but the growth won’t be very fast-paced.

Every man and women would be eventually married (homogeneously or heterogeneously). They won’t prematurely die or prefer not to marry. People would be married only once.

In rural areas the age of marriage (in average) is between 15 – 35 year range. Similarly, in urban areas = 20 – 35 years. India is a young country, and 15 – 35 year range has around 50% of the total population.

Rural estimation

Rural population = 70% * 1.3 bill = 900 mill

Population within marriage age in a year = 50% * 900 mill = 450 mill

Number of marriages to happen = 450 / 2 = 225 mill marriages

These people will marry within a 20 year time period according to our assumptions.

Number of rural marriages in a year = 225 mill / 20 = 11.25 mill marriages

Urban estimation

Urban population = 30% * 1.3 bill = 400 mill

Population within marriage age in a year = 50% * 400 mill = 200 mill

Number of marriages to happen = 200 / 2 = 100 mill marriages

These people will marry within a 15 year time period according to our assumptions.

Number of urban marriages in a year = 100 mill / 15 = 6.6 mill marriages

Note and caveats

Many people die in accidents prematurely, and won’t marry. In addition, most people don’t marry as well as a consumer preference parameter. So, our market number is over-estimated. Even if we try to normalize it by introducing an error percentage of around 10%, the final number number will be lesser by around 10%-15%.

Answer = Approximately 14 million marriages occur in a year in India.

**Bayes’ theorem**, also known as** Bayes’ rule** or **Bayes’** **law**, is a theorem in statistics that describes the probability of one event or condition as it relates to another known event or condition.

Mathematically, it’s expressed as the true positive rate of a condition sample divided by the sum of the false positive rate of the population and the true positive rate of a condition.

Let’s understand this theorem with an example:

For instance, say you had a 60% chance of actually having the flu after a flu test, but out of people who had the flu, the test will be false 50% of the time, and the overall population only has a 5% chance of having the flu. Would you actually have a 60% chance of having the flu after having a positive test?

Bayes’ Theorem says no. It says that you have a (.6 * 0.05) (True Positive Rate of a Condition Sample) / (.6*0.05)(True Positive Rate of a Condition Sample) + (.5*0.95) (False Positive Rate of a Population) = 0.0594 or 5.94% chance of getting a flu.

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

Naive Bayes uses a similar method **to predict the probability of different class based on various attributes**. This algorithm is mostly used in text classification and with problems having multiple classes.

Learn Data Science and Get hands-on practical learning experience.

When calculating loss we consider only a single data point, then we use the term **loss function.**

Whereas, when calculating the sum of error for multiple data then we use the **cost function**.

In other words, the loss function is to capture the difference between the **actual** and **predicted** values for a single record whereas cost functions aggregate the difference for the entire **training dataset**.

The most commonly used loss functions are **Mean-squared error** and **Hinge loss**.

**Mean-Squared Error(MSE)**: In simple words, we can say how our model predicted values against the actual values.

** MSE = √(predicted value – actual value)2**

**Hinge loss**: It is used to train the machine learning classifier, which is

** L(y) = max(0,1- yy)**

Where y = -1 or 1 indicates two classes and y represents the output form of the classifier. The most common cost function represents the total cost as the sum of the fixed costs and the variable costs in the equation y = mx + b

There are many cost functions in machine learning and each has its use cases depending on whether it is a **regression** problem or **classification **problem.

**Regression cost function:**

Regression models are used to forecast a continuous variable, such as an employee’s pay, the cost of a car, the likelihood of obtaining a loan, and so on. They are determined as follows depending on the distance-based error:

** Error = y-y’**

Where, Y – Actual Input and Y’ – Predicted output.

There are a number of possible variables that can cause such a discrepancy that I would check to see:

The demographics of iOS and Android users might differ significantly. For example, according to Hootsuite, 43% of females use Instagram as opposed to 31% of men. If the proportion of female users for iOS is significantly larger than for Android then this can explain the discrepancy (or at least a part of it). This can also be said for age, race, ethnicity, location, etc…

Behavioural factors can also have an impact on the discrepancy. If iOS users use their phones more heavily than Android users, it’s more likely that they’ll indulge in Instagram and other apps than someone who spent significantly less time on their phones.

Another possible factor to consider is how Google Play and the App Store differ. For example, if Android users have significantly more apps (and social media apps) to choose from, that may cause greater dilution of users.

Lastly, any differences in the user experience can deter Android users from using Instagram compared to iOS users. If the app is more buggy for Android users than iOS users, they’ll be less likely to be active on the app.

Learn Data Science and Get hands-on practical learning experience.

Model Evaluation is a very important part in any analysis to answer the following questions,

How well does the model fit the data? Which predictors are most important? Are the predictions accurate?

So, the following are the criterion to access the model performance,

**Akaike Information Criteria (AIC)**: In simple terms, AIC estimates the relative amount of information lost by a given model. So, the less information lost the higher the quality of the model. Therefore, we always prefer models with minimum AIC.**Receiver operating characteristics (ROC curve)**: ROC curve illustrates the diagnostic ability of a binary classifier. It is calculated/ created by plotting True Positive against False Positive at various threshold settings. The performance metric of ROC curve is AUC (area under curve). Higher the area under the curve, better the prediction power of the model.**Confusion Matrix**: In order to find out how well the model does in predicting the target variable, we use a confusion matrix/ classification rate. It is nothing but a tabular representation of actual Vs predicted values which helps us to find the accuracy of the model.

Deductive reasoning is the form of valid reasoning, to deduce new information or conclusion from known related facts and information. Inductive reasoning arrives at a conclusion by the process of generalization using specific facts or data.

Deductive reasoning uses a top-down approach, whereas inductive reasoning uses a bottom-up approach.

In deductive reasoning, the conclusions are certain, whereas, in Inductive reasoning, the conclusions are probabilistic.

Deductive arguments can be valid or invalid, that means if premises or properties are true, the conclusion must be true. An Inductive argument can be strong or weak, that means conclusion may be false even if premises(properties) are true.

Usage of deductive reasoning is difficult, as we need facts which must be true. Usage of inductive reasoning is fast and easy, as we need evidence instead of true facts.

Learn Data Science and Get hands-on practical learning experience.

Selection bias stands for the bias which was introduced by the selection of individuals, groups or data for doing analysis in a way that the proper randomization is not achieved. It ensures that the sample obtained is not representative of the population intended to be analysed and sometimes it is referred to as the selection effect. This is the part of distortion of a statistical analysis which results from the method of collecting samples. If you don’t take selection bias into the account then some conclusions of the study may not be accurate.

The types of selection bias include:

**Sampling bias**: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.

**Time interval**: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.

**Data**: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.

**Attrition**: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.

There are 36 (6*6) outcomes for tossing two fair dices, and the outcomes when two dices sum to 8 are:

(2, 6), (3,5), (4,4), (5,3), (6,2);

The probability of two dice sums to be 8 is 5/36.

For the second part, it is a conditional probability that we are calculating. Assume event A is the event where the sum of the dice is 8, and event B is the first dice is 3. We know that event B’s outcomes are:

(3,1), (3,2), (3,3), (3,4), (3,5), (3,6)

and only (3,5) makes event A happen, thus the probability is 1/6.

We can also solve this using Bayes Theorem and conditional probability:

P(A|B) = P(A intersection B) / P(B)

The difference between P(AB) and P(A|B) is that:

P(AB) is 1/36: out of 36 outcomes, only (3,5) both satisfy event A and event B;

P(A|B) is 1/6: out of 6 outcomes from event B, (3,1), (3,2), (3,3), (3,4), (3,5), (3,6), only one outcome sums to 8 at (3,5), so that P(A|B) is 1/6. (also can be calculated by 1/36 / 1/6 = 1/6) .

Learn Data Science and Get hands-on practical learning experience.

**Reinforcement learning** (**RL**) is an area of machine learning concerned with how intelligent ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning

In a typical reinforcement learning scheme, the agent *observes *a* state S*, and choose an *action A *based on the *policy P*, and then the environment feeds back the *reward R of the action A, *and the environment switches to *the next state* *S’. *And the process keeps looping until you reach *the state DONE. *The* *ultimate goal of reinforcement learning is to maximize the *total reward, *i.e., **the long term sum of rewards***.*

At each turn for the agent, the agent observes the chess board (*state S*) and choose a move (*action A*) based on its learned chess-playing algorithm (*policy P). *Then the game (environment) feeds back the result of the move (maybe just position changing of a piece, or plus taking a piece of the rival, etc), which corresponds to the *reward R, *a value pre-defined (usually positive for “good”, negative for “bad”, zero for “neutral” or “we don’t know it’s good or bad”, yet defining a reward function is hard…). Then the game goes on and the move results in a new chess board state (*state S’*).

With reinforced learning, we don’t have to deal with this problem as the learning agent learns by playing the game. It will make a move (decision), check if it’s the right move (feedback), and keep the outcomes in memory for the next step it takes (learning). There is a reward for every correct decision the system takes and punishment for the wrong one.

**Pruning** is a data compression technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that are non-critical and redundant to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.

Pruning processes can be divided into two types (pre- and post-pruning).

**Pre-pruning** procedures prevent a complete induction of the training set by replacing a stop () criterion in the induction algorithm. Pre-pruning methods are considered to be more efficient because they do not induce an entire set, but rather trees remain small from the start

**Post-pruning** is the most common way of simplifying trees. Here, nodes and subtrees are replaced with leaves to reduce complexity.

The procedures are differentiated on the basis of their approach in the tree (top-down or bottom-up).

**Top-down fashion**: It will traverse nodes and trim subtrees starting at the root

**Bottom-up fashion:** It will begin at the leaf nodes

There is a popular pruning algorithm called reduced **error pruning**, in which starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected, the change is kept.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

A forecast refers to a calculation or an estimation which uses data from previous events, combined with recent trends to come up a future event outcome.

On the other hand, a prediction is an actual act of indicating that something will happen in the future with or without prior information.

Accuracy: A Forecast is more accurate compared to a prediction. This is because forecasts are derived by analysing a set of past data from the past and present trends.

On the other hand, a prediction can be right or wrong. For example, if you predict the outcome of a football match, the result depends on how well the teams played no matter their recent performance or players.

Bias: Forecasting uses mathematical formulas and as a result, they are free from personal as well as intuition bias.

On the other hand, predictions are in most cases subjective and fatalistic in nature.

Quantification: When using a model to do a forecast, it’s possible to come up with the exact quantity. For example, the World Bank uses economic trends, and the previous GDP values and other inputs to come up with a percentage value for a country’s economic growth.

However, when doing prediction, since there is no data for processing, one can only say whether the economy of a given country will grow or not.

Application: Forecasts are only applicable in the economic and meteorology field where there is a lot of information about the subject matter.

On the contrary, prediction can be applied anywhere as long as there is an expected future outcome.

Learn Data Science and Get hands-on practical learning experience.

So, what is regularization,

Regularization is any technique that aims to improve the validation score, sometimes at the cost of reducing the training score.

Some regularization techniques:

**L1** tries to minimize the absolute value of the parameters of the model. It produces sparse parameters.

**L2** tries to minimize the square value of the parameters of the model. It produces parameters with small values.

**Dropout** is a technique applied to neural networks that randomly sets some of the neurons’ outputs to zero during training. This forces the network to learn better representations of the data by preventing complex interactions between the neurons: Each neuron needs to learn useful features.

**Early stopping** will stop training when the validation score stops improving, even when the training score may be improving. This prevents overfitting on the training dataset.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Let’s call the 5-gallon bucket as A and 3-gallon bucket as B:

Fill B completely and empty it into the A.

Now, again fill the B completely and empty it into A.

Since A consists of 3 gallons already, it would accommodate only 2 gallons more. Thus, there would be balance 1 gallon in B

You now empty A and pour the 1 gallon from B into A.

Fill B completely and pour it completely into A which now has 1 + 3 = 4 gallons as required.

Learn Data Science and Get hands-on practical learning experience.

The data is split into three different categories while creating a model:

Training set: We use the training set for building the model and adjusting the model’s variables. But we cannot rely on the correctness of the model build on top of the training set. The model might give incorrect outputs on feeding new inputs.

Validation set: We use a validation set to look into the model’s response on top of the samples that don’t exist in the training dataset. Then, we will tune hyperparameters on the basis of the estimated benchmark of the validation data.

When we are evaluating the model’s response using the validation set, we are indirectly training the model with the validation set. This may lead to the overfitting of the model to specific data. So, this model won’t be strong enough to give the desired response to the real-world data.

Test set: The test dataset is the subset of the actual dataset, which is not yet used to train the model. The model is unaware of this dataset. So, by using the test dataset, we can compute the response of the created model on hidden data. We evaluate the model’s performance on the basis of the test dataset.

The model is always exposed to the test dataset after tuning the hyperparameters on top of the validation set.

As we know, the evaluation of the model on the basis of the validation set would not be enough. Thus, we use a test set for computing the efficiency of the model.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Cross-validation is a technique for dividing data between training and validation sets. On typical cross-validation this split is done randomly. But in stratified cross-validation, the split preserves the ratio of the categories on both the training and validation datasets.

For example, if we have a dataset with 10% of category A and 90% of category B, and we use stratified cross-validation, we will have the same proportions in training and validation. In contrast, if we use simple cross-validation, in the worst case we may find that there are no samples of category A in the validation set.

Stratified cross-validation may be applied in the following scenarios:

On a dataset with multiple categories. The smaller the dataset and the more imbalanced the categories, the more important it will be to use stratified cross-validation.

On a dataset with data of different distributions. For example, in a dataset for autonomous driving, we may have images taken during the day and at night. If we do not ensure that both types are present in training and validation, we will have generalization problems.

Learn Data Science and Get hands-on practical learning experience.

There are several ways to make a model more robust to outliers, from different points of view (data preparation or model building). An outlier in the question and answer is assumed being unwanted, unexpected, or a must-be-wrong value to the human’s knowledge so far (e.g. no one is 200 years old) rather than a rare event which is possible but rare.

Outliers are usually defined in relation to the distribution. Thus outliers could be removed in the pre-processing step (before any learning step), by using standard deviations (Mean +/- 2*SD), it can be used for normality. Or interquartile ranges Q1 – Q3, Q1 – is the “middle” value in the first half of the rank-ordered data set, Q3 – is the “middle” value in the second half of the rank-ordered data set. It can be used for not normal/unknown as threshold levels.

Moreover, data transformation (e.g. log transformation) may help if data have a noticeable tail. When outliers related to the sensitivity of the collecting instrument which may not precisely record small values, Winsorization may be useful. This type of transformation has the same effect as clipping signals (i.e. replaces extreme data values with less extreme values). Another option to reduce the influence of outliers is using mean absolute difference rather mean squared error.

For model building, some models are resistant to outliers (e.g. tree-based approaches) or non-parametric tests. Similar to the median effect, tree models divide each node into two in each split. Thus, at each split, all data points in a bucket could be equally treated regardless of extreme values they may have.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

In supervised learning, we train a model to learn the relationship between input data and output data. We need to have labelled data to be able to do supervised learning.

With unsupervised learning, we only have unlabelled data. The model learns a representation of the data. Unsupervised learning is frequently used to initialize the parameters of the model when we have a lot of unlabelled data and a small fraction of labelled data. We first train an unsupervised model and, after that, we use the weights of the model to train a supervised model.

In reinforcement learning, the model has some input data and a reward depending on the output of the model. The model learns a policy that maximizes the reward. Reinforcement learning has been applied successfully to strategic games such as Go and even classic Atari video games.

Learn Data Science and Get hands-on practical learning experience.

Let’s assume that we’re trying to predict renewal rate for Netflix subscription. So our problem statement is to predict which users will renew their subscription plan for the next month.

Next, we must understand the data that is needed to solve this problem. In this case, we need to check the number of hours the channel is active for each household, the number of adults in the household, number of kids, which channels are streamed the most, how much time is spent on each channel, how much has the watch rate varied from last month, etc. Such data is needed to predict whether or not a person will continue the subscription for the upcoming month.

After collecting this data, it is important that you find patterns and correlations. For example, we know that if a household has kids, then they are more likely to subscribe. Similarly, by studying the watch rate of the previous month, you can predict whether a person is still interested in a subscription. Such trends must be studied.

The next step is analysis. For this kind of problem statement, you must use a classification algorithm that classifies customers into 2 groups:

Customers who are likely to subscribe next month

Customers who are not likely to subscribe next month

Would you build predictive models? Yes, in order to achieve this you must build a predictive model that classifies the customers into 2 classes like mentioned above.

Which algorithms to choose? You can choose classification algorithms such as Logistic Regression, Random Forest, Support Vector Machine, etc.

Once you’ve opted the right algorithm, you must perform model evaluation to calculate the efficiency of the algorithm. This is followed by deployment.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Algorithms necessitate features with some specific characteristics to work appropriately. The data is initially in a raw form. You need to extract features from this data before supplying it to the algorithm. This process is called feature engineering. When you have relevant features, the complexity of the algorithms reduces. Then, even if a non-ideal algorithm is used, results come out to be accurate.

Feature engineering primarily has two goals:

Prepare the suitable input data set to be compatible with the machine learning algorithm constraints.

Enhance the performance of machine learning models.

Some of the techniques used for feature engineering include Imputation, Binning, Outliers Handling, Log transform, grouping operations, One-Hot encoding, Feature split, Scaling, Extracting data.

Learn Data Science and Get hands-on practical learning experience.

The Backpropagation algorithm looks for the minimum value of the error function in weight space using a technique called the delta rule or gradient descent. The weights that minimize the error function is then considered to be a solution to the learning problem.

We need backpropagation because,

Calculate the error – How far is your model output from the actual output.

Minimum Error – Check whether the error is minimized or not.

Update the parameters – If the error is huge then, update the parameters (weights and biases). After that again check the error.

Repeat the process until the error becomes minimum.

Model is ready to make a prediction – Once the error becomes minimum, you can feed some inputs to your model and it will produce the output.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Let’s suppose you are being tested for a disease, if you have the illness the test will end up saying you have the illness. However, if you don’t have the illness- 5% of the times the test will end up saying you have the illness and 95% of the times the test will give accurate result that you don’t have the illness.

Thus there is a 5% error in case you do not have the illness.

Out of 1000 people, 1 person who has the disease will get true positive result.

Out of the remaining 999 people, 5% will also get true positive result.

Close to 50 people will get a true positive result for the disease.

This means that out of 1000 people, 51 people will be tested positive for the disease even though only one person has the illness. There is only a 2% probability of you having the disease even if your reports say that you have the disease.

Learn Data Science and Get hands-on practical learning experience.

The main goal of an ads selection component is to narrow down the set of ads that are relevant for a given query. In a search-based system, the ads selection component is responsible for retrieving the top relevant ads from the ads database according to the user and query context.

In a feed-based system, the ads selection component will select the top k relevant ads based more on user interests than search terms.

Here is a general solution to this question. Say we use a funnel-based approach for modelling. It would make sense to structure the ad selection process in these three phases:

**Phase 1**: Quick selection of ads for the given query and user context according to selection criteria

**Phase 2**: Rank these selected ads based on a simple and fast algorithm to trim ads.

**Phase 3**: Apply the machine learning model on the trimmed ads to select the top ones.

Let’s imagine all of the zebras on an equilateral triangle. They each have two options of directions to go in if they are running along the outline to either edge. Given the case is random, let’s compute the possibilities in which they fail to collide.

There are only really two possibilities. The zebras will either all choose to run in a clockwise direction or a counter-clockwise direction.

Let’s calculate the probabilities of each. The probability that every zebra will choose to go clockwise will be the product of each zebra choosing the clockwise direction. Given there are two choices (counterclockwise or clockwise), that would be 1/2 * 1/2 * 1/2 = 1/8

The probability of every zebra going counter-clockwise is the same at 1/8. Therefore, if we sum up the probabilities, we get the correct probability of 1/4 or 25%.

Learn Data Science and Get hands-on practical learning experience.

The error of a model can either be of bias and/or variance. Very low bias but high variance indicates overfitting, as well as complexity of the model. By averaging these out, we can reduce the variance and increase the bias.

- a) A bagging algorithm can handle the high variance. The dataset is randomly subsampled mm times and the model trained using each subsample. Then the models are averaged by averaging out the predictions of each mode.
- b) By using the k-nearest neighbour algorithm, the trade-off between bias and variance can be achieved. The value of k is increased to increase the number of neighbours that contribute to the prediction, and this in turn increases the bias of the model.
- c) By using the support vector machine algorithm, the trade-off can be achieved by increasing the C parameter that influences the number of violations of the margin allowed in the training data, and this in turn increases the bias but decreases the variance.

When we use one-hot encoding, there is an increase in the dimensionality of a dataset.

The reason for the increase in dimensionality is that, for every class in the categorical variables, it forms a different variable.

Example: Suppose, there is a variable ‘Color.’ It has three sub-levels as Yellow, Purple, and Orange. So, one hot encoding ‘Color’ will create three different variables as Color.Yellow, Color.Purple, and Color.Orange.

In label encoding, the sub-classes of a certain variable get the value as 0 and 1. So, we use label encoding only for binary variables.

This is the reason that one hot encoding increases the dimensionality of data and label encoding does not.

Learn Data Science and Get hands-on practical learning experience.

Gini index and Node Entropy assist the binary classification tree to make decisions. Basically, the tree algorithm determines the feasible feature that is used to distribute data into the most genuine child nodes.

According to the Gini index, if we arbitrarily pick a pair of objects from a group, then they should be of identical class and the probability for this event should be 1.

To compute the Gini index, we should do the following:

Compute Gini for sub-nodes with the formula: The sum of the square of probability for success and failure (p^2 + q^2)

Compute Gini for split by weighted Gini rate of every node of the split

Now, Entropy is the degree of indecency that is given by the following:

where *a* and *b* are the probabilities of success and failure of the node

When Entropy = 0, the node is homogenous

When Entropy is high, both groups are present at 50–50 percent in the node.

Finally, to determine the suitability of the node as a root node, the entropy should be very low.

The probability of Aman getting selected for the interview is 1/8

P(A) = 1/8

The probability of Mohan getting selected for the interview is 5/12

P(B)=5/12

Now, the probability of at least one of them getting selected can be denoted at the Union of A and B, which means

P(A U B) =P(A)+ P(B) – (P(A ∩ B)) ………………………(1)

Where P(A ∩ B) stands for the probability of both Aman and Mohan getting selected for the job.

To calculate the final answer, we first have to find out the value of P(A ∩ B)

So, P(A ∩ B) = P(A) * P(B)

1/8 * 5/12

5/96

Now, put the value of P(A ∩ B) into equation 1

P(A U B) =P(A)+ P(B) – (P(A ∩ B))

1/8 + 5/12 -5/96

So, the answer will be 47/96.

Learn Data Science and Get hands-on practical learning experience.

We can use the following methods:

Since logistic regression is used to predict probabilities, we can use the AUC-ROC curve along with the confusion matrix to determine its performance.

Also, the analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes models for the number of model coefficients. Therefore, we always prefer models with a minimum AIC value.

Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model.

Clarify with the interviewers whether the question is about only a single version of the iPhone or all versions put together. Here, we shall assume that all iPhones put together are being talked about.

The first step toward solving this query will be segmentation. There are many ways in which India’s population can be segmented. Here, we shall first assume that only people who have attained a working age and are under the age of retirement own an iPhone. Children and old citizens do not own an iPhone. This removes 20% of the population as children and 20% as senior citizens.

The next assumption will be that only the upper stratum of India’s income range can afford an iPhone. This metric assumes that only 5% of the eligible citizens from the previous filter can own an iPhone.

Now, it is not necessary that every member of this upper stratum will own an iPhone. Other options, such as OnePlus, Samsung, etc., are also available. However, a fair assumption would be that 50% of the eligible population from the previous filter owns an iPhone.

Calculating the proportion of the population that owns an iPhone –

0.6 x 0.05 x 0.5 = 0.015

Total iPhones in India = 0.015 x 130 crore = 1.95 crore

Learn Data Science and Get hands-on practical learning experience.

Most likely outliers will have a negligible effect because the nodes are determined based on the sample proportions in each split region (and not on their absolute values). However, different implementations to choose split points of continuous variables exist. Some consider all possible split points, others percentiles. But, in some poorly chosen cases (e.g. dividing the range between min and max in equidistant split points), outliers might lead to sub-optimal split points. But you shouldn’t encounter these scenarios in popular implementations. That’s why, in such cases avoid taking such criterions. On the whole though, they are quite robust.

The main difference between these two data types is the operation you can perform on them. Lists are containers for elements having differing data types but arrays are used as containers for elements of the same data type.

Learn Data Science and Get hands-on practical learning experience.

Dictionary is faster because you used a better algorithm. The reason is because a dictionary is a lookup, while a list is an iteration. Dictionary uses a hash lookup, while your list requires walking through the list until it finds the result from beginning to the result each time.

It would take 4*10 = 40 seconds to train one-vs-all method one to one.

Learn Data Science and Get hands-on practical learning experience.

Kernel Function in SVM is a method used to take data as input and transform into the required form of processing data.

**Gaussian Kernel Radial Basis Function (RBF) :** It is used to perform transformation, when there is no prior knowledge about data and it uses radial basis method to improve the transformation.

**Sigmoid Kernel:** this function is equivalent to a two-layer, perceptron model of neural network, which is used as activation function for artificial neurons.

**Polynomial Kernel:** It represents the similarity of vectors in training set of data in a feature space over polynomials of the original variables used in kernel.

**Linear Kernel:** used when data is linearly separable.

Imbalanced data sets are a special case for classification problem where the class distribution is not uniform among the classes.

Learn Data Science and Get hands-on practical learning experience.

The approaches to treat missing values are:

- List wise or case deletion
- Pairwise deletion
- Mean substitution
- Regression imputation
- Maximum likelihood.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The key classification metrics are **Accuracy**, **Recall**, **Precision**, and **F1- Score.**

Learn Data Science and Get hands-on practical learning experience.

Bagging is a way to decrease the variance in the prediction by generating additional data for training from dataset using combinations with repetitions to produce multi-sets of the original data. Boosting is an iterative technique which adjusts the weight of an observation based on the last classification.

If we have 5 bagged decision trees that made the following class predictions for an input sample: blue, blue, red, blue and red, we would take the most frequent class and predict blue. This is an example of bagging algorithm.

While AdaBoost (Adaptive Boosting), Gradient Tree Boosting, XGBoost are the examples of boosting algorithms.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Techniques to Handle Imbalanced Data

- Use the right evaluation metrics
- Use K-fold Cross-Validation in the right way
- Ensemble different resampled datasets
- Resample with different ratios
- Cluster the abundant class
- Design your own models

Learn Data Science and Get hands-on practical learning experience.

The order of execution is

1. **FROM **2. **JOIN **3. **WHERE **4. **GROUP BY ** 5. **HAVING **6. **SELECT **7. **ORDER BY .**

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The special syntax *args in function is used to pass a variable number of arguments to a function. It is used to pass a non-key worded, variable-length argument list. The syntax is to use the symbol * to take in a variable number of arguments; by convention, it is often used with the word args.

Learn Data Science and Get hands-on practical learning experience.

One-Hot Encoding is the most common, correct way to deal with non-ordinal categorical data. It consists of creating an additional feature for each group of the categorical feature and mark each observation belonging (Value=1) or not (Value=0) to that group.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Interpolation is the process of calculating the unknown value from known given values whereas extrapolation is the process of calculating unknown values beyond the given data points.

Learn Data Science and Get hands-on practical learning experience.

Ways to handle missing values in the dataset:

- Deleting Rows with missing values.
- Impute missing values for continuous variable.
- Impute missing values for categorical variable.
- Other Imputation Methods.
- Using Algorithms that support missing values.
- Prediction of missing values.

Multiple imputation is more advantageous than the single imputation because it uses several complete data sets and provides both the within-imputation and between-imputation variability.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The statistical power of an A/B test refers to the test’s sensitivity to certain magnitudes of effect sizes. More precisely, it is the probability of observing a statistically significant result at level alpha (α) if a true effect of a certain magnitude (MEI) is in fact present.

Learn Data Science and Get hands-on practical learning experience.

“Covariance” indicates the direction of the linear relationship between variables. “Correlation” on the other hand measures both the strength and direction of the linear relationship between two variables.

This means that the index value of -1 gives the last element, and -2 gives the second last element of an array. The negative indexing starts from where the array ends.

example: for list L = [0,2,35,3]; L[-1] will print 3 in Python

Learn Data Science and Get hands-on practical learning experience.

z-test is used for it when sample size is large, generally n >30. Whereas t-test is used for hypothesis testing when sample size is small, usually n < 30 where n is used to quantify the sample size.

Overfitting is a modeling error which occurs when a function is too closely fit to a limited set of data points. Underfitting refers to a model that can neither model the training data nor generalize to new data.

Learn Data Science and Get hands-on practical learning experience.

The idea in kNN methods is to identify ‘k’ samples in the dataset that are similar or close in the space. Then we use these ‘k’ samples to estimate the value of the missing data points. Each sample’s missing values are imputed using the mean value of the ‘k’-neighbors found in the dataset.

Techniques to reduce overfitting:

- Increase training data.
- Reduce model complexity.
- Early stopping during the training phase.
- Ridge Regularization and Lasso Regularization.

Learn Data Science and Get hands-on practical learning experience.

Removing outliers is legitimate only for specific reasons. Outliers can be very informative about the subject-area and data collection process. If the outlier does not change the results but does affect assumptions, you may drop the outlier. Or just trim the data set, but replace outliers with the nearest “good” data, as opposed to truncating them completely.

For regression, R-square or average error. While for classification, evaluation metrics can be Precision, Recall or F1-score.

Learn Data Science and Get hands-on practical learning experience.

Confusion matrix. This is an NXN matrix where N is called the number of classes being predicted. This metric is called an error matrix and it portrays a dominant role for prediction mainly in the issues of statistical categorization.

LDA focuses on finding a feature subspace that maximizes the separability between the groups. While Principal component analysis is an unsupervised Dimensionality reduction technique, it ignores the class label. PCA focuses on capturing the direction of maximum variation in the data set.

Learn Data Science and Get hands-on practical learning experience.

In linear regression, the outcome (dependent variable) is continuous. It can have any one of an infinite number of possible values. In logistic regression, the outcome (dependent variable) has only a limited number of possible values. Logistic regression is used when the response variable is categorical in nature. How do you avoid local minima, use stochastic gradient descent.

The main difference in correlation vs regression is that the measures of the degree of a relationship between two variables; let them be x and y. Here, correlation is for the measurement of degree, whereas regression is a parameter to determine how one variable affects another.

Learn Data Science and Get hands-on practical learning experience.

It is because of the extra penalty for higher errors and squaring the residuals for mean deviation were observed to be more efficient than mean absolute deviation.

Mean Absolute Error(MAE) is preferred when we have too many outliers present in the dataset because MAE is robust to outliers whereas MSE and RMSE are very susceptible to outliers and these start penalizing the outliers by squaring the error terms.

Learn Data Science and Get hands-on practical learning experience.

Heteroskedasticity refers to situations where the variance of the residuals is unequal over a range of measured values. When running a regression analysis, heteroskedasticity results in an unequal scatter of the residuals (also known as the error term). To check for heteroscedasticity, you need to assess the residuals by fitted value plots specifically. Typically, the telltale pattern for heteroscedasticity is that as the fitted values increases, the variance of the residuals also increases.

A p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference. P-value can be used as an alternative to or in addition to pre-selected confidence levels for hypothesis testing.

Learn Data Science and Get hands-on practical learning experience.

Root cause analysis (RCA) is defined as a collective term that describes a wide range of approaches used to uncover causes of problems. Some RCA approaches are geared more toward identifying true root causes than others, some are more general problem-solving techniques, and others simply offer support for the core activity of root cause analysis.

Regularization is the process which regularizes or shrinks the coefficients towards zero. In simple words, regularization discourages learning a more complex or flexible model, to prevent overfitting.

Learn Data Science and Get hands-on practical learning experience.

Density-based spatial clustering of applications with noise (DBSCAN) is a well-known data clustering algorithm that is commonly used in data mining and machine learning. The main idea behind DBSCAN is that a point belongs to a cluster if it is close to many points from that cluster. It is able to find arbitrary shaped clusters and clusters with noise (i.e. outliers).

A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class.

Learn Data Science and Get hands-on practical learning experience.

The principal components are eigenvectors of the data’s covariance matrix. Thus, the principal components are often computed by eigen decomposition of the data covariance matrix or singular value decomposition of the data matrix. PCA is the simplest of the true eigenvector-based multivariate analyses and is closely related to factor analysis.

Cut the cake from middle first, then pile up the one piece on another, and then again cut it straight from the middle which will leave you with 4 pieces. Finally, put all the 4 pieces on one another, and cut it for the third time. This is how with 3 straight cuts, you can cut cake into 8 equal pieces.

Learn Data Science and Get hands-on practical learning experience.

K-means clustering aims to partition data into k clusters in a way that data points in the same cluster are similar and data points in the different clusters are farther apart. Similarity of two points is determined by the distance between them.

It is a classification problem.

Learn Data Science and Get hands-on practical learning experience.

Gini Index, also known as Gini impurity, calculates the amount of probability of a specific feature that is classified incorrectly when selected randomly.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Entropy is a measure of disorder or uncertainty and the goal of machine learning models and Data Scientists in general is to reduce uncertainty. Entropy can be defined as a measure of the purity of the sub split. Entropy always lies between 0 to 1.

Learn Data Science and Get hands-on practical learning experience.

Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset. It is based on model ensemble learning technique.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

Learn Data Science and Get hands-on practical learning experience.

XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Variance Inflation Factor (VIF) is used to detect the presence of multicollinearity. Variance inflation factors (VIF) measure how much the variance of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related.

Learn Data Science and Get hands-on practical learning experience.

The p-value is the probability that the null hypothesis is true. (1 – the p-value) is the probability that the alternative hypothesis is true. A low p-value shows that the results are replicable. A low p-value shows that the effect is large or that the result is of major theoretical, clinical or practical importance.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is actually true in the population; a type II error (false-negative) occurs if the investigator fails to reject a null hypothesis that is actually false in the population.

Learn Data Science and Get hands-on practical learning experience.

Keep your model simple. Use regularization technique. Use cross-validation.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

You can reduce High variance, by reducing the number of features in the model. There are several methods available to check which features don’t add much value to the model and which are of importance. Increasing the size of the training set can also help the model generalize.

Learn Data Science and Get hands-on practical learning experience.

You can reduce High variance, by reducing the number of features in the model. There are several methods available to check which features don’t add much value to the model and which are of importance. Increasing the size of the training set can also help the model generalize.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Model Parameters: These are the parameters in the model that must be determined using the training data set. These are the fitted parameters. Hyperparameters: These are adjustable parameters that must be tuned in order to obtain a model with optimal performance.

Learn Data Science and Get hands-on practical learning experience.

A simple method to detect multicollinearity in a model is by using something called the variance inflation factor or the VIF for each predicting variable. We can follow these steps in order to build a better model:

- Remove some of the highly correlated independent variables.
- Linearly combine the independent variables, such as adding them together.
- Perform an analysis designed for highly correlated variables, such as principal components analysis or partial least squares regression.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

It means that the model has mimicked the training pattern perfectly that it will cause overfitting problem in test samples. To avoid this overfitting, use techniques like less complex model or cross validation etc.

Learn Data Science and Get hands-on practical learning experience.

The recall is the ratio of the relevant results returned by the search engine to the total number of the relevant results that could have been returned. Specificity (SP) is calculated as the number of correct negative predictions divided by the total number of negatives. The precision is the proportion of relevant results in the list of all returned search results.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Variance is the average squared deviations from the mean, while standard deviation is the square root of this number. Both measures reflect variability in a distribution, but their units differ:

- Standard deviation is expressed in the same units as the original values (e.g., minutes or meters).
- Variance is expressed in much larger units (e.g., meters squared).

Variance helps to find the distribution of data in a population from a mean, and standard deviation also helps to know the distribution of data in population, but standard deviation gives more clarity about the deviation of data from a mean.

Learn Data Science and Get hands-on practical learning experience.

While building the decision tree, we would prefer choosing the attribute/feature with the least Gini index as the root node.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

A p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference. P-value can be used as an alternative to or in addition to pre-selected confidence levels for hypothesis testing. A p-value less than 0.05 is statistically significant. It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random). Therefore, we reject the null hypothesis, and accept the alternative hypothesis.

Learn Data Science and Get hands-on practical learning experience.

The concat() function pf pandas can be used to concatenate two Dataframes table by adding the rows of one to the other.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

it helps the training converge fast. it prevents any bias during the training. it prevents the model from learning the order of the training.

Learn Data Science and Get hands-on practical learning experience.

The final score for each class should be independent of each other. Thus we can not apply softmax activation, because softmax converts the score into probabilities taking other scores into consideration.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The cost parameter decides how much an SVM should be allowed to “bend” with the data. For a low cost, you aim for a smooth decision surface and for a higher cost, you aim to classify more points correctly. It is also simply referred to as the cost of misclassification.

Learn Data Science and Get hands-on practical learning experience.

Training accuracy is much higher than validation accuracy, proving that it’s the case of overfitting, so in this case, try regularization or making less complex model or any other method to avoid overfitting.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The cost parameter decides how much an SVM should be allowed to “bend” with the data. For a low cost, you aim for a smooth decision surface and for a higher cost, you aim to classify more points correctly. It is also simply referred to as the cost of misclassification.

Learn Data Science and Get hands-on practical learning experience.

Accuracy is calculated as the total number of two correct predictions (TP + TN) divided by the total number of a dataset (P + N).

Get Complete Data Science Course for Hands-on Practical Learning Experience.

eucl_distance = np. linalg. norm(point_a – point_b) where np stands for numpy.

Learn Data Science and Get hands-on practical learning experience.

Alex’s height = 164 + 1.30*15 = 183.5 cm.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Gradient boosting is a type of machine learning boosting. It relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. If a small change in the prediction for a case causes no change in error, then next target outcome of the case is zero. Gradient boosting produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

Learn Data Science and Get hands-on practical learning experience.

Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter. The leaves are the decisions or the final outcomes. A decision tree is a machine learning algorithm that partitions the data into subsets.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. They, also known as Artificial Neural Networks, are the subset of Deep Learning.

Learn Data Science and Get hands-on practical learning experience.

The bias–variance tradeoff is the property of a model that the variance of the parameter estimated across samples can be reduced by increasing the bias in the estimated parameters.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

One main difference between R^{2} and the adjusted R^{2}: R^{2} assumes that every single variable explains the variation in the dependent variable. The adjusted R^{2} tells you the percentage of variation explained by only the independent variables that actually affect the dependent variable.

Learn Data Science and Get hands-on practical learning experience.

When it comes to precision we’re talking about the true positives over the true positives plus the false positives. As opposed to recall which is the number of true positives over the true positives and the false negatives.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

A decision tree combines some decisions, whereas a random forest combines several decision trees. Thus, it is a long process, yet slow. Whereas, a decision tree is fast and operates easily on large data sets, especially the linear one. The random forest model needs rigorous training.

Learn Data Science and Get hands-on practical learning experience.

K-means clustering uses “centroids”, K different randomly-initiated points in the data, and assigns every data point to the nearest centroid. After every point has been assigned, the centroid is moved to the average of all of the points assigned to it.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Here are some important considerations while choosing an algorithm:

Size of the training data, Accuracy and/or Interpretability of the output, Speed or Training time, Linearity and number of features.

Learn Data Science and Get hands-on practical learning experience.

Unsupervised: Do not use the target variable (e.g. remove redundant variables). Correlation.

Supervised: Use the target variable (e.g. remove irrelevant variables). Wrapper: Search for well-performing subsets of features. RFE.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The intercept (often labeled as constant) is the point where the function crosses the y-axis. In some analysis, the regression model only becomes significant when we remove the intercept, and the regression line reduces to Y = b*X + error. The intercept (often labeled the constant) is the expected mean value of Y when all X=”0. Start with a regression equation with one predictor, X. If X sometimes equals 0, the intercept is simply the expected mean value of Y at that value. If X never equals 0, then the intercept has no intrinsic meaning.

Learn Data Science and Get hands-on practical learning experience.

Measuring the performance of Logistic Regression:

1. One can evaluate it by looking at the confusion matrix and count the misclassifications (when using some probability value as the cutoff)

2. One can evaluate it by looking at statistical tests such as the Deviance or individual Z-scores.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

It can vary time to time depending upon number of updates happened in the algorithm as per the requirement.

Learn Data Science and Get hands-on practical learning experience.

Resampling methods are used to ensure that the model is good enough and can handle variations in data. The model does that by training it on the variety of patterns found in the dataset.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

R^2 = 1 – (RSS/TSS) where RSS = sum of squares of residual and TSS = Total sum of squares

Learn Data Science and Get hands-on practical learning experience.

A/B testing is a basic randomized control experiment. It is a way to compare the two versions of a variable to find out which performs better in a controlled environment.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

No, they always don’t. That’s because in some cases it reaches a local minima or a local optima point.

Learn Data Science and Get hands-on practical learning experience.

A feature vector is a vector containing multiple elements about an object. Putting feature vectors for objects together can make up a feature space.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

A p-value higher than 0.05 (> 0.05) is not statistically significant and indicates strong evidence for the null hypothesis. This means we retain the null hypothesis and reject the alternative hypothesis.

Learn Data Science and Get hands-on practical learning experience.

R-squared (R^{2}) is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model. R-squared does not measure goodness of fit. R-squared does not measure predictive error. R-squared does not allow you to compare models using transformed responses. R-squared does not measure how one variable explains another. Some better metrics that could be better than R2 are:

- Mean Squared Error (MSE).
- Root Mean Squared Error (RMSE).
- Mean Absolute Error (MAE)

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The curse of dimensionality basically means that the error increases with the increase in the number of features. It refers to the fact that algorithms are harder to design in high dimensions and often have a running time exponential in the dimensions.

Learn Data Science and Get hands-on practical learning experience.

It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

A 95% confidence interval, for example, implies that were the estimation process repeated again and again, then 95% of the calculated intervals would be expected to contain the true parameter value.

Learn Data Science and Get hands-on practical learning experience.

Simple approaches include taking the average of the column and use that value, or if there is a heavy skew the median or mode might be better. A better approach, you can perform regression or nearest neighbor imputation on the column to predict the missing values. Then continue on with your analysis/model.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

INNER JOIN: The INNER JOIN keyword selects all rows from both the tables as long as the condition satisfies. This keyword will create the result-set by combining all rows from both the tables where the condition satisfies i.e value of the common field will be same. LEFT JOIN: This join returns all the rows of the table on the left side of the join and matching rows for the table on the right side of join. The rows for which there is no matching row on right side, the result-set will contain null. LEFT JOIN is also known as LEFT OUTER JOIN. RIGHT JOIN: RIGHT JOIN is similar to LEFT JOIN. This join returns all the rows of the table on the right side of the join and matching rows for the table on the left side of join. The rows for which there is no matching row on left side, the result-set will contain null. RIGHT JOIN is also known as RIGHT OUTER JOIN.

FULL JOIN: FULL JOIN creates the result-set by combining result of both LEFT JOIN and RIGHT JOIN. The result-set will contain all the rows from both the tables. The rows for which there is no matching, the result-set will contain NULL values.

Learn Data Science and Get hands-on practical learning experience.

Simple approaches include taking the average of the column and use that value, or if there is a heavy skew the median or mode might be better. A better approach, you can perform regression or nearest neighbor imputation on the column to predict the missing values. Then continue on with your analysis/model.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Variance and covariance are mathematical terms frequently used in statistics and probability theory. Variance refers to the spread of a data set around its mean value, while a covariance refers to the measure of the directional relationship between two random variables.

Learn Data Science and Get hands-on practical learning experience.

The 2-sample t-test takes your sample data from two groups and boils it down to the t-value. The process is very similar to the 1-sample t-test, and you can still use the analogy of the signal-to-noise ratio. Unlike the paired t-test, the 2-sample t-test requires independent groups for each sample.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

A pivot table is a table of grouped values that aggregates the individual items of a more extensive table within one or more discrete categories.

Learn Data Science and Get hands-on practical learning experience.

MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Most Important Methods For Statistical Data Analysis Mean.

1. Standard Deviation.

2. Regression.

3. Sample Size.

4. Determination.

5. Hypothesis Testing.

Learn Data Science and Get hands-on practical learning experience.

Sensitivity analysis is a method for predicting the outcome of a decision if a situation turns out to be different compared to the key predictions. It helps in assessing the riskiness of a strategy. Helps in identifying how dependent the output is on a particular input value.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

High p-values indicate that your evidence is not strong enough to suggest an effect exists in the population. An effect might exist but it’s possible that the effect size is too small, the sample size is too small, or there is too much variability for the hypothesis test to detect it. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.

Learn Data Science and Get hands-on practical learning experience.

While mean is the simple average of all the values, expected value of expectation is the average value of a random variable which is probability-weighted.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Regression is Intrapolation. Time-series refers to an ordered series of data. Time-series models usually forecast what comes next in the series – much like our childhood puzzles where we extrapolate and fill patterns.

Learn Data Science and Get hands-on practical learning experience.

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate. False Positive Rate.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Random Forest is suitable for situations when we have a large dataset, and interpretability is not a major concern. Decision trees are much easier to interpret and understand. Since a random forest combines multiple decision trees, it becomes more difficult to interpret.

Learn Data Science and Get hands-on practical learning experience.

A false positive is where you receive a positive result for a test, when you should have received a negative results. Some examples of false positives: A pregnancy test is positive, when in fact you aren’t pregnant. A cancer screening test comes back positive, but you don’t have the disease. Innocent party is found guilty in such cases.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Ways to minimize Bias in ML:

- Choose the correct learning model. There are two types of learning models, and each has its own pros and cons.
- Use the right training dataset.
- Perform data processing mindfully.
- Monitor real-world performance across the ML lifecycle.
- Make sure that there are no infrastructural issues.

Learn Data Science and Get hands-on practical learning experience.

If your data contains outliers, then you would typically rather use the median because otherwise the value of the mean would be dominated by the outliers rather than the typical values. In conclusion, if you are considering the mean, check your data for outliers, if any then better choose median.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Ridge and Lasso regression are some of the simple techniques to reduce model complexity and prevent over-fitting which may result from simple linear regression . Ridge Regression : In ridge regression, the cost function is altered by adding a penalty equivalent to square of the magnitude of the coefficients. lasso (least absolute shrinkage and selection operator; also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model.

Learn Data Science and Get hands-on practical learning experience.

Optimization is the problem of finding a set of inputs to an objective function that results in a maximum or minimum function evaluation. It is the challenging problem that underlies many machine learning algorithms, from fitting logistic regression models to training artificial neural networks.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The standard error of mean tells you how accurate the mean of any given sample from that population is likely to be compared to the true population mean. When the standard error increases, i.e. the means are more spread out, it becomes more likely that any given mean is an inaccurate representation of the true population mean.

Learn Data Science and Get hands-on practical learning experience.

Follow these methods:

- Remove some of the highly correlated independent variables.
- Linearly combine the independent variables, such as adding them together.
- Perform an analysis designed for highly correlated variables, such as principal components analysis or partial least squares regression.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The two main differences are: How trees are built: random forests builds each tree independently while gradient boosting builds one tree at a time. Combining results: random forests combine results at the end of the process (by averaging or “majority rules”) while gradient boosting combines results along the way.

Learn Data Science and Get hands-on practical learning experience.

SQL Server allows us to create our functions called as user defined functions in SQL Server. For example, if we want to perform some complex calculations, then we can place them in a separate function, and store it in the database. Whenever we need the calculation, we can call it.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

A hash table is a data structure that is used to store keys/value pairs. It uses a hash function to compute an index into an array in which an element will be inserted or searched. They are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches and sets. The idea of a hash table is to provide a direct access to its items. So that is why the it calculates the “hash code” of the key and uses it to store the item, instead of the key itself.

Learn Data Science and Get hands-on practical learning experience.

Bias is an error between the actual values and the model’s predicted values. Variance is also an error but from the model’s sensitivity to the training data. A prioritization of Bias over Variance will lead to a model that overfits the data. Prioritizing Variance will have a model underfit the data.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Example:

import numpy as np

# initializing points in

# numpy arrays

point1 = np.array((1, 2, 3))

point2 = np.array((1, 1, 1))

# finding sum of squares

sum_sq = np.sum(np.square(point1 – point2))

# Doing squareroot and # printing Euclidean distance

print(np.sqrt(sum_sq))

Learn Data Science and Get hands-on practical learning experience.

A Box Cox transformation is a transformation of non-normal dependent variables into a normal shape.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Collaborative filtering filters information by using the interactions and data collected by the system from other users. It’s based on the idea that people who agreed in their evaluation of certain items are likely to agree again in the future. Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback.

Learn Data Science and Get hands-on practical learning experience.

Entropy can be defined as a measure of the purity of the sub split. Entropy always lies between 0 to 1. The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the number of parameters to learn and the amount of computation performed in the network. The pooling layer summarizes the features present in a region of the feature map generated by a convolution layer.

Learn Data Science and Get hands-on practical learning experience.

Recurrent Neural Networks (RNN) are a class of Artificial Neural Networks that can process a sequence of inputs in deep learning and retain its state while processing the next sequence of inputs. Traditional neural networks will process an input and move onto the next one disregarding its sequence.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Dropout is a technique where randomly selected neurons are ignored during training. Batch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch. This has the effect of stabilizing the learning process and dramatically reducing the number of training epochs required to train deep networks.

Learn Data Science and Get hands-on practical learning experience.

Central tendency is defined as “the statistical measure that identifies a single value as representative of an entire distribution.”[2] It aims to provide an accurate description of the entire data. It is the single value that is most typical/representative of the collected data.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

A chi-square test is a statistical test used to compare observed results with expected results. The purpose of this test is to determine if a difference between observed data and expected data is due to chance, or if it is due to a relationship between the variables you are studying.

Learn Data Science and Get hands-on practical learning experience.

A/B testing is an optimisation technique often used to understand how an altered variable affects audience or user engagement. It’s a common method used in marketing, web design, product development, and user experience design to improve campaigns and goal conversion rates.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Removing outliers is legitimate only for specific reasons. Outliers can be very informative about the subject-area and data collection process. If the outlier does not change the results but does affect assumptions, you may drop the outlier. Or just trim the data set, but replace outliers with the nearest “good” data, as opposed to truncating them completely.

Learn Data Science and Get hands-on practical learning experience.

Analysis of variance (ANOVA) is a statistical technique that is used to check if the means of two or more groups are significantly different from each other. ANOVA checks the impact of one or more factors by comparing the means of different samples.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation.

Learn Data Science and Get hands-on practical learning experience.

Use the right evaluation metrics.

Use K-fold Cross-Validation in the right way.

Ensemble different resampled datasets.

Resample with different ratios.

Cluster the abundant class.

Design your own models.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

It is a mathematical function having a characteristic that can take any real value and map it to between 0 to 1 shaped like the letter “S”.

Y = 1 / 1+(e*-z)

Learn Data Science and Get hands-on practical learning experience.

The Gini Index and the Entropy and Information gain metrics are the metrics to use in the algorithm to create a decision tree

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Ensemble methods is a machine learning technique that combines several base models in order to produce one optimal predictive model.

Learn Data Science and Get hands-on practical learning experience.

While causation and correlation can exist at the same time, correlation does not imply causation. Causation explicitly applies to cases where action A causes outcome B. On the other hand, correlation is simply a relationship. Correlation between Ice cream sales and sunglasses sold. As the sales of ice creams is increasing so do the sales of sunglasses. Causation takes a step further than correlation.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The recall is the ratio of the relevant results returned by the search engine to the total number of the relevant results that could have been returned. The precision is the proportion of relevant results in the list of all returned search results. Accuracy is the measurement used to determine which model is best at identifying relationships and patterns between variables in a dataset based on the input, or training, data.

Learn Data Science and Get hands-on practical learning experience.

There is a popular method known as elbow method which is used to determine the optimal value of K to perform the K-Means Clustering Algorithm. The basic idea behind this method is that it plots the various values of cost with changing k. As the value of K increases, there will be fewer elements in the cluster.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence..

Learn Data Science and Get hands-on practical learning experience.

A bag of words is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard the grammatical details and the word order. It is called a “bag” of words because any information about the order or structure of words in the document is discarded.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The learning rate is a configurable hyperparameter used in the training of neural networks that has a small positive value, often in the range between 0.0 and 1.0. The learning rate controls how quickly the model is adapted to the problem.

Learn Data Science and Get hands-on practical learning experience.

The main intuitive difference between the L1 and L2 regularization is that L1 regularization tries to estimate the median of the data while the L2 regularization tries to estimate the mean of the data to avoid overfitting. That value will also be the median of the data distribution mathematically.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Some of the Feature selection techniques are: Information Gain, Chi-square test, Correlation Coefficient, Mean Absolute Difference (MAD), Exhaustive selection, Forward selection, Regularization.

Learn Data Science and Get hands-on practical learning experience.

The regression has five key assumptions:

Linear relationship.

Multivariate normality.

No or little multicollinearity.

No auto-correlation.

Homoscedasticity.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Random search differs from

Learn Data Science and Get hands-on practical learning experience.

Dimensionality Reduction is used to reduce the feature space with consideration by a set of principal features.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Support Vector Machine Learning Algorithm performs better in the reduced space. It is beneficial to perform dimensionality reduction before fitting an SVM if the number of features is large when compared to the number of observations.

Learn Data Science and Get hands-on practical learning experience.

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The recommender system mainly deals with the likes and dislikes of the users. Its major objective is to recommend an item to a user which has a high chance of liking or is in need of a particular user based on his previous purchases. It is like having a personalized team who can understand our likes and dislikes and help us in making the decisions regarding a particular item without being biased by any means by making use of a large amount of data in the repositories which are generated day by day.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The SQL Joins clause is used to combine records from two or more tables in a database.

Learn Data Science and Get hands-on practical learning experience.

mean squared error (MSE), and mean absolute error (MAE) are used to evaluate the regression problem’s accuracy. The squared error is everywhere differentiable, while the absolute error is not (its derivative is undefined at 0). This makes the squared error more amenable to the techniques of mathematical optimization.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Techniques to Handle unbalanced Data:

1. Use the right evaluation metrics

2. Use K-fold Cross-Validation in the right way

3. Ensemble different resampled datasets

4. Resample with different ratios

5. Design your own models

Learn Data Science and Get hands-on practical learning experience.

Activation functions are mathematical equations that determine the output of a neural network model. It is a non-linear transformation that we do over the input before sending it to the next layer of neurons or finalizing it as output.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Dimensionality Reduction is used to reduce the feature space with consideration by a set of principal features.

Mean Squared Error (MSE) gives a relatively high weight to large errors — therefore, MSE tends to put too much emphasis on large deviations.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

A long tail distribution of numbers is a kind of distribution having many occurrences far from the “head” or central part of the distribution. Most of occurrences in this kind of distributions occurs at early frequencies/values of x-axis.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Pickling in python refers to the process of serializing objects into binary streams, while unpickling is the inverse of that. It’s called that because of the pickle module in Python which implements the methods to do this.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

*def*keyword is used to define a normal function in Python. Similarly, the

*lambda*keyword is used to define an anonymous function in Python. This function can have any number of arguments but only one expression, which is evaluated and returned.

list1 + list2 gives a new list which is combined both of list1 and list2. so the length will be combined both length, i.e, len(list1) + len(list2).

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Specificity (SP) is calculated as the number of correct negative predictions divided by the total number of negatives.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

It depends on what is to be predicted. For example, assume that you are to predict the necessity for chemotherapy in cancer patients. If a person has tested positive for cancerous tumors but doesn’t actually have cancer, it is a false positive case. In this scenario, if chemotherapy proceeds on a non-cancerous person, it may actually induce cancer in him! Here, reducing false positives is more important than false negatives. On the other hand, take the case of scanning for wanted criminals at the airport. If a high-security alert is flagged during a staff shortage, it will be much easier for the criminal to be flagged as a non-threat. Here, detecting false negatives is more crucial. Lastly, there are some cases where a bias on either side may be equally disastrous. For example, bankers who give out loans determine credibility using credit scores. If the credit score can’t detect false negatives, the bank will lose out on good customers. If the credit score can’t detect false positives, the bank will acquire bad customers.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Advantages: Removes correlated features, Improves algorithm features, Reduces overfitting, Improves Visualization.

Disadvantages: Independent variables become less interpretable, Data Standardization is must, Information loss.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Follow these techniques:

Use the right evaluation metrics.

Use K-fold Cross-Validation in the right way.

Ensemble different resampled datasets.

Resample with different ratios.

Cluster the abundant class.

Design your own models.

ARIMA, short for ‘AutoRegressive Integrated Moving Average’, is a forecasting algorithm based on the idea that the information in the past values of the time series can alone be used to predict the future values.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

When it comes to precision we’re talking about the true positives over the true positives plus the false positives. As opposed to recall which is the number of true positives over the true positives and the false negatives.

Bias is one type of error which occurs due to wrong assumptions about data such as assuming data is linear when in reality, data follows a complex function. On the other hand, variance gets introduced with high sensitivity to variations in training data..

Get Complete Data Science Course for Hands-on Practical Learning Experience.

In classification data are grouped by analyzing data objects whose class label is known. Clustering analyzes data objects without knowing class label. There is some prior knowledge of attributes of each classification. There is no prior knowledge of attributes of data to form clusters.

Techniques to Handle Imbalanced Data

Use the right evaluation metrics

Use K-fold Cross-Validation in the right way

Ensemble different resampled datasets

Resample with different ratios

Cluster the abundant class

Design your own models

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The most effective way to find all of your outliers is by using the interquartile range (IQR). The IQR contains the middle bulk of your data, so outliers can be easily found once you know the IQR. Quartiles divide the entire set into four equal parts. So, there are three quartiles, first, second and third represented by Q 1 , Q 2 and Q 3 , respectively. Q 2 is nothing but the median.

Multicollinearity occurs when two or more independent variables(also known as predictor) are highly correlated with one another in a regression model. This means that an independent variable can be predicted from another independent variable in a regression model.

To remove multicollinearities, we can do two things.

1. We can create new features

2. remove them from our data.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them. It is basically a type of unsupervised technique.

Types of Clustering Methods:

1. Hierarchical based clustering

2. K-means clustering

3. Density based clustering: Density-based clustering methods provide a safety valve. Instead of assuming that every point is part of some cluster, we only look at points that are tightly packed and assume everything else is noise. This approach requires two parameters: a radius 𝜖 and a neighborhood density 𝛳. The idea of density-based is that we need to compare the density around an object with the density around its local neighbors.

It is a better approach because it does not require a-priori specification of number of clusters and is able to identify noise data while clustering.

This is called the case of imbalanced classification dataset. You can follow these techniques:

Use your own right evaluation metrics according to problem statement.

Use K-fold Cross-Validation in the right way.

Ensemble different resampled datasets.

Resample with different ratios.

Cluster the abundant class.

Design your own models.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Automating the machine learning workflow by enabling data to be transformed and correlated into a model that can then be analyzed to achieve outputs is done with ML pipelines. This type of ML pipeline makes the process of inputting data into the ML model fully automated. The main objective of having a proper pipeline for any ML model is to exercise control over it. A well-organised pipeline makes the implementation more flexible

The most effective way to find all of your outliers is by using the interquartile range (IQR). The IQR contains the middle bulk of your data, so outliers can be easily found once you know the IQR. Quartiles divide the entire set into four equal parts. So, there are three quartiles, first, second and third represented by Q 1 , Q 2 and Q 3 , respectively. Q 2 is nothing but the median.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

They are:

List wise or case deletion

Pairwise deletion

Mean substitution

Regression imputation

Maximum likelihood.

Techniques to reduce overfitting:

Increase training data.

Reduce model complexity.

Early stopping during the training phase.

Ridge Regularization and Lasso Regularization.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Random search differs from grid search in that we no longer provide an explicit set of possible values for each hyperparameter; rather, we provide a statistical distribution for each hyperparameter from which values are sampled. Essentially, we define a sampling distribution for each hyperparameter to carry out a randomized search.

kernel: It maps the observations into some feature space. Ideally the observations are more easily (linearly) separable after this transformation. There are multiple standard kernels for this transformations, e.g. the linear kernel, the polynomial kernel and the radial kernel.

C: It is a hypermeter in SVM to control error. The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly.

gamma: Gamma is used when we use the Gaussian RBF kernel. if you use linear or polynomial kernel then you do not need gamma only you need C hypermeter. Somewhere it is also used as sigma. Gamma decides that how much curvature we want in a decision boundary.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Follow these techniques:

1. Use Validation methods

2. Add more data

3. Apply feature engineering techniques(Normalization, Imputation etc)

4. Compare Multiple algorithms

5. Hyperparameter Tuning

Standardization is the process of putting different variables on the same scale. This process allows you to compare scores between different types of variables. Typically, to standardize variables, you calculate the mean and standard deviation for a variable. Log transformation is a data transformation method in which it replaces each variable x with a log(x). Log-transform decreases skew in some distributions, especially with large outliers. But, it may not be useful as well if the original distributed is not skewed. Also, log transform may not be applied to some cases (negative values), but standardization is always applicable (except σ=0).

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Object detection is a computer vision technique for locating instances of objects in images or videos. Object detection algorithms typically leverage machine learning or deep learning to produce meaningful results

Step 1: Remove duplicate or irrelevant observations. Remove unwanted observations from your dataset, including duplicate observations or irrelevant observations.

Step 2: Fix structural errors.

Step 3: Filter unwanted outliers.

Step 4: Handle missing data.

Step 5: Validate your data if it’s appropriate according to problem statement

Get Complete Data Science Course for Hands-on Practical Learning Experience.

In a classification task, there is a high chance for the algorithm to be biased if the dataset is imbalanced. An imbalanced dataset is one in which the number of samples in one class is very higher or lesser than the number of samples in the other class.

To counter such imbalanced datasets, we use a technique called up-sampling and down-sampling.

In up-sampling, we randomly duplicate the observations from the minority class in order to reinforce its signal. The most common way is to resample with replacement.

In down-sampling, we randomly remove the observations from the majority class. Thus after up-sampling or down-sampling, the dataset becomes balanced with same number of observations in each class.

The softmax function is an activation function that turns numbers into probabilities which sum to one. The softmax function outputs a vector that represents the probability distributions of a list of outcomes. It is also a core element used in deep learning classification tasks. Formula for softmax normalization can be given as:

Get Complete Data Science Course for Hands-on Practical Learning Experience.

To understand the working of Bagging, assume we have an N number of models and a Dataset D. Where m is the number of data and n is the number of features in each data. And we are supposed to do binary classification. First, we will split the dataset. For now, we will split this dataset into training and test set only. Let’s call the training dataset, where is the total number of training examples.

Take a sample of records from the training set and use it to train the first model, say m1. For the next model, m2 resample the training set and take another sample from the training set. We will do this same thing for the N number of models. Since we are resampling the training dataset and taking the samples from it without removing anything from the dataset, it might be possible that we have two or more training data record common in multiple samples. This technique of resampling the training dataset and providing the sample to the model is termed Row Sampling with Replacement. Suppose we have trained each model, and now we want to see the prediction on test data. Since we are working on binary classification, the output can be either 0 or 1. The test dataset is passed to each model, and we get a prediction from each model. Let’s say out of N models more than N/2 models predicted it to be 1; hence, using the model averaging technique like maximum vote, we can say that the predicted output for the test data is 1.

In boosting, we take records from the dataset and pass it to base learners sequentially; here, base learners can be any model. Suppose we have m number of records in the dataset. Then we pass a few records to base learner BL1 and train it. Once the BL1 gets trained, then we pass all the records from the dataset and see how the Base learner works. For all the records classified incorrectly by the base learner, we only take them and pass it to other base learners say BL2 and simultaneously pass the incorrect records classified by BL2 to train BL3. This will go on unless and until we specify some specific number of base learner models we need. Finally, we combine the output from these base learners and create a strong learner; thus, the model’s prediction power gets improved. Ok. So now we know how the Bagging and Boosting work.

**C:** It is a hypermeter in SVM to control error. The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly.

**Gamma: ** Gamma is used when we use the Gaussian RBF kernel. if you use linear or polynomial kernel then you do not need gamma only you need C hypermeter. Somewhere it is also used as sigma. Gamma decides that how much curvature we want in a decision boundary.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Depending on the problem and dataset, we decide whether outliers are important or not. Thus, it is not necessary that Outliers need to be removed all the time because sometimes they provide important information, specially when it is relevant to business.

Confusion matrix. This is an NXN matrix where N is called the number of classes being predicted. This metric is called an error matrix and it portrays a dominant role for prediction mainly in the issues of statistical categorization. A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Wrapper methods measure the “usefulness” of features based on the classifier performance. In contrast, the filter methods pick up the intrinsic properties of the features (i.e., the “relevance” of the features) measured via univariate statistics instead of cross-validation performance. So, wrapper methods are essentially solving the “real” problem (optimizing the classifier performance), but they are also computationally more expensive compared to filter methods due to the repeated learning steps and cross-validation.

Filter methods: information gain, chi-square test, fisher score, correlation coefficient, variance threshold

Wrapper methods: recursive feature elimination, sequential feature selection algorithms, genetic algorithms

Get Complete Data Science Course for Hands-on Practical Learning Experience.

In Data Science and Machine Learning, KMeans and DBScan are two of the most popular clustering(unsupervised) algorithms. Density clustering algorithms use the concept of reachability i.e. how many neighbors has a point within a radius. DBScan is more lovely because it doesn’t need parameter, k, which is the number of clusters we are trying to find, which KMeans needs. When you don’t know the number of clusters hidden in the dataset and there’s no way to visualize your dataset, it’s a good decision to use DBScan. DBSCAN produces a varying number of clusters, based on the input data.

Here’s a list of advantages of KMeans and DBScan:

- KMeans is much faster than DBScan
- DBScan doesn’t need number of clusters

Here’s a list of disadvantages of KMeans and DBScan:

- K-means need the number of clusters hidden in the dataset
- DBScan doesn’t work well over clusters with different densities
- DBScan needs a careful selection of its parameters

Get Complete Data Science Course for Hands-on Practical Learning Experience.

**Mean/Median Imputation:-**In a mean or median substitution, the mean or a median value of a variable is used in place of the missing data value for that same variable. Median over mean when the data column has any outliers.

**Mode substitution:-**In mode substitution, the highest occurring value for categorical value is used in place of the missing data value of the same variable.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Gaussian mixture models (GMMs) are often used for data clustering. You can use GMMs to perform either hard clustering or soft clustering on query data. To perform hard clustering, the GMM assigns query data points to the multivariate normal components that maximize the component posterior probability, given the data.

K-means clustering aims to partition data into k clusters in a way that data points in the same cluster are similar and data points in the different clusters are farther apart. Similarity of two points is determined by the distance between them.

Officially, k-means is one application of Vector-Quantification (VQ), and GMM is of Expectation-Maximize (EM) algorithm. But in my opinion, both k-means and GMM can be seen as a version of with different possibility density distribution: k-means uses uniform distribution while GMM uses gaussian.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

**1.5 indicated that the observation is 1.5 standard deviations above the mean and -1.5 means that the observation is 1.5 standard deviations below or less than the mean.**

Noisy data can be handled by following the given procedures:

1) Binning:

• Binning methods smooth a sorted data value by consulting the values around it.

• The sorted values are distributed into a number of “buckets,” or bins.

• Because binning methods consult the values around it, they perform local smoothing.

• Similarly, smoothing by bin medianscan be employed, in which each bin value is replaced by the bin median.

• In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries.

• Each bin value is then replaced by the closest boundary value.

• In general, the larger the width, the greater the effect of the smoothing.

• Alternatively, bins may be equal-width, where the interval range of values in each bin is constant.

• Binning is also used as a discretization technique.

2) Regression:

• Here data can be smoothed by fitting the data to a function.

• Linear regression involves finding the “best” line to fit two attributes, so that one attribute can be used to predict the other.

• Multiple linear regressionis an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface.

3) Clustering:

• Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.”

• Similarly, values that fall outside of the set of clusters may also be considered outliers.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The only difference between list and dictionary comprehension

Get Complete Data Science Course for Hands-on Practical Learning Experience.

If the values of the features are closer to each other there are chances for the algorithm to get trained well and faster instead of the data set where the data points or features values have high differences with each other will take more time to understand the data and the accuracy will be lower.

So if the data in any conditions has data points far from each other, scaling is a technique to make them closer to each other or in simpler words, we can say that the scaling is used for making data points generalized so that the distance between them will be lower.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The SelectKBest method selects the features according to the k highest score. By changing the ‘score_func’ parameter we can apply the method for both classification and regression data. Selecting best features is important process when we prepare a large dataset for training. The SelectKBest clas

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Probabilistic modeling is a statistical technique used to take into account the impact of random events or actions in predicting the potential occurrence of future outcomes. Statistical modeling is the process of applying statistical analysis to a dataset. A statistical model is a mathematical representation (or mathematical model) of observed data.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Min-Max Normalization

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Left Skewed Distribution: In a left skewed distribution, the mean is less than the median. Right Skewed Distribution: In a right skewed distribution, the mean is greater than the median.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Image Classification algorithms are the algorithms which are used to classify labels for images using their characteristics. Example: Convolutional Neural Networks.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

One-Hot Encoding is the most common, correct way to deal with non-ordinal categorical data. It consists of creating an additional feature for each group of the categorical feature and mark each observation belonging (Value=1) or not (Value=0) to that group.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Forward Selection chooses a subset of the predictor variables for the final model. Unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time. forward selection starts with a null model (with no predictors) and proceeds to add variables one at a time, and so unlike backward selection, it DOES NOT have to consider the full model (which includes all the predictors).

Get Complete Data Science Course for Hands-on Practical Learning Experience.

To remove multicollinearities, we can do two things.

1. We can create new features

2. remove those features from our data.

The recall is the ratio of the relevant results returned by the search engine to the total number of the relevant results that could have been returned. The precision is the proportion of relevant results in the list of all returned search results. When dealing with highly skewed datasets, Precision-Recall (PR) curves give a more informative picture of an algorithm’s performance. We show that a deep connection exists between ROC space and PR space, such that a curve dominates in ROC space if and only if it dominates in PR space.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Scatter Plot, Pair plots, Histogram, Box plots, Violin Plots, Contour plots etc.

The bias–variance tradeoff is the property of a model that the variance of the parameter estimated across samples can be reduced by increasing the bias in the estimated parameters.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Precision = True positives/ (True positives + False positives) = TP/ (TP + FP).

Recall = TruePositives / (TruePositives + FalseNegatives) = TP / (TP + FN).

The two are closely related. In fact, the point estimate is located exactly in the middle of the confidence interval. However, confidence intervals provide much more information and are preferred when making inferences.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The statistical power of an A/B test refers to the test’s sensitivity to certain magnitudes of effect sizes. More precisely, it is the probability of observing a statistically significant result at level alpha (α) if a true effect of a certain magnitude (MEI) is in fact present.

tf-idf vectorization gives a numerical representation of words entirely dependent on the nature and number of documents being considered. The same words will have different vector representations in another corpus.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Binary Step Function, Linear Activation Function, Sigmoid/Logistic Activation Function, Tanh Function (Hyperbolic Tangent), ReLU Activation Function

Follow these techniques:

- Use the right evaluation metrics.
- Use K-fold Cross-Validation in the right way.
- Ensemble different resampled datasets.
- Resample with different ratios.
- Cluster the abundant class.
- Design your own models.sigmoid

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The sigmoid function is used for the two-class logistic regression, whereas the softmax function is used for the multiclass logistic regression (a.k.a. MaxEnt, multinomial logistic regression, softmax Regression, Maximum Entropy Classifier).

Optimizers are algorithms or methods used to change the attributes of the neural network such as weights and learning rate to reduce the losses. Optimizers are used to solve optimization problems by minimizing the function.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

The Idea behind the precision-recall trade-off is that when a person changes the threshold for determining if a class is positive or negative it will tilt the scales. It means that it will cause precision to increase and recall to decrease, or vice versa.

These are the parameters used for building Decision Tree: min_samples_split, min_samples_leaf, max_features and criterion.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

A. Overfitting is a modeling error which occurs when a function is too closely fit to a limited set of data points. Underfitting refers to a model that can neither model the training data nor generalize to new data.

When the bias is more the prediction of the model is very far from the actual value. It means that the model is not having capacity to generalize the distribution of the data. This is underfitting. You will have to increase the complexity of the model so that it can better generalize the data distribution.

On the other hand when the variance of the model is more, then the values predicted by the model are highly spread from the expected value predicted by the model(not the actual value). This is overfitting. The model is highly complicated and needs to be made simple. Otherwise noise and outliers can take a great toll on the model.

.

So if the data in any conditions has data points far from each other, scaling is a technique to make them closer to each other or in simpler words, we can say that the scaling is used for making data points generalized so that the distance between them will be lower.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Prob = 6/6*5/6*4/6*3/6*2/6*1/6 = 6!/6^6 = 0.015

A decision tree combines some decisions, whereas a random forest combines several decision trees. Thus, it is a long process, yet slow. Whereas, a decision tree is fast and operates easily on large data sets, especially the linear one. The random forest model needs rigorous training.** **

Get Complete Data Science Course for Hands-on Practical Learning Experience.

We can convert the timestam** **

Confounding variable is a variable that is not included in an experiment, yet affects the relationship between the two variables in an experiment.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

PCA, Ridge regression, L1/L2 regularization etc.

Residual plots, validation curve, gain and lift chart, kolmogorov smirnov chart.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Unimodal Distribution, as the name suggests, is a single peaked distribution which means one value occurs with the greatest frequency than the other values. Distributions often have a clear peak to their shape. If a distribution has two fairly equal high points, it is called a bimodal distribution. It is a distribution where two values occur with the greatest frequency. The graph resembles two humps on a camel’s back. A data is called as skewed when curve appears distorted or skewed either to the left or to the right, in a statistical distribution. In a normal distribution, the graph appears symmetry but here, the data is unsymmetrical.

SVM does not overfit when using a lot of features, provided that you regularize correctly. With less features also, sometimes SVM model can have less accuracy. So, if you have many features SVM wouldn’t be much complex model.

Get Complete Data Science Course for Hands-on Practical Learning Experience.

Our Popular Data Science Course