Data Science Interview Questions and Answers
Python deletes unwanted objects (built-in types or class instances) automatically to free the memory space. The process by which Python periodically frees and reclaims blocks of memory that no longer are in use is called Garbage Collection.
Python’s garbage collector runs during program execution and is triggered when an object’s reference count reaches zero. An object’s reference count changes as the number of aliases that point to it changes.
An object’s reference count increases when it is assigned a new name or placed in a container (list, tuple, or dictionary). The object’s reference count decreases when it’s deleted with del, its reference is reassigned, or its reference goes out of scope. When an object’s reference count reaches zero, Python collects it automatically.
In the given example:
#Literal 10 is an object
b = 10
#Reference count of object 10 becomes 0
b = 4
The literal value 10 is an object. The reference count of the object is incremented to 1, in line 1.
In line 2, it’s reference count becomes 0, as it is dereferenced. So the garbage collector deallocates the object.
- Supervised Learning is a machine learning approach that’s defined by its use of labeled datasets. Unsupervised Learning uses machine learning algorithms to analyze and cluster unlabeled data sets.
- Training the model to predict output when a new data is provided is the objective of Supervised Learning. Finding useful insights, hidden patterns from the unknown dataset is the objective of the unsupervised learning.
- Supervised Learning can be used for 2 different types of problems i.e. regression and classification. Unsupervised Learning can be used for 3 different types of problems i.e. clustering, association and dimensionality reduction.
- Supervised Learning will use off-line analysis. Unsupervised Learning uses real time analysis of data.
- To assess whether right output is being predicted, direct feedback is accepted by the Supervised Learning Model. No feedback will be taken by the unsupervised learning model.
- Computational Complexity is very complex in Supervised Learning compared to Unsupervised Learning.
- Accurate results are produced using a supervised learning model. The accuracy of results produced are less in unsupervised learning models.
- Example: Is it a cat or a dog?
When all the images of the data are labelled under the classes – cat and dog, then we use a supervised learning approach to find the class of an image. For the same problem, if there are no class labels, we use the unsupervised learning approach to create two separate clusters.
Conclusion: Having supervised data is always preferable and in the worst case scenario when we don’t have one, we opt for an unsupervised learning approach to solve the problem.
- The central limit theorem states that the sampling distribution of the mean approaches a normal distribution, as the sample size increases.
- As the sample size increases, the distribution of frequencies approximates a bell-shaped curve (i.e. normal distribution curve). The main objective of the Central Limit Theorem is that the average of your sample means will be the population mean.
- This is useful, as the research never knows which mean in the sampling distribution is the same as the population mean, but by selecting many random samples from a population the sample means will cluster together, allowing the research to make a very good estimate of the population mean.
- A sufficiently large sample can predict the parameters of a population such as the mean and standard deviation.
- Sample sizes equal to or greater than 30 are required for the central limit theorem to hold true.
- The Central Limit theorem plays a big role in hypothesis testing.
- Mutability – List and tuple are a class of data structure that can store one or more objects or values. List is mutable, whereas a tuple is immutable. This means that elements stored in the tuple cannot be reassigned or deleted, but it is possible to slice it, and even reassign and delete the whole tuple. Because tuples are immutable, they cannot be copied. On the other hand, lists can be copied and changed.
- Type of elements – Elements belonging to different data types are stored in a list. While homogeneous elements i.e. elements of the same data types, are stored in tuples.
- Length – Tuples have a fixed length, while lists have variable lengths. Thus, the size of created lists can be changed, but that is not the case for tuples.
- Memory efficiency– In terms of memory efficiency, since tuples are immutable, bigger chunks of memory can be allocated to them, while smaller chunks are required for lists to accommodate the variability. Bigger chunks in memory actually mean lower memory footprint as the overhead is low. So, tuples are more memory efficient than lists.
- Debugging – Tuples are easier to debug for large projects due to its immutability. So, if there is a smaller project or a lesser amount of data, it is better to use lists. This is because lists can be changed, while tuples cannot, making tuples easier to track.
- Big data is a combination of structured, semi structured and unstructured data collected by organizations that can be mined for information and used in machine learning projects, predictive modelling and other advanced analytics applications.
- While traditional data is measured in familiar sizes like megabytes, gigabytes and terabytes, big data is stored in petabytes and zettabytes.
- Big data is often characterized by the three V’s: 1)The large volume of data in many environments 2) the wide variety of data types frequently stored in big data systems 3) the velocity at which much of the data is generated, collected and processed.
- Big data is often stored in a data lake. While data warehouses are commonly built on relational databases and contain structured data only, data lakes can support various data types and typically are based on Hadoop clusters, cloud object storage services, NoSQL databases or other big data platforms.
- Cloud is a popular location for managing big data systems. Organizations can deploy their own cloud-based systems or use managed big-data-as-service offerings from cloud providers.
- The various tools and technologies used in big data ecosystems are Hadoop, NoSQL databases, MapReduce, YARN, Spark, Tableau
Solution #1 – Using List Comprehension
>>> t = [1, 3, 6]
>>> v = [t[i+1]-t[i] for i in range(len(t)-1)]
Output – [2, 3]
Solution #2 – Using Numpy
>>> t = [1, 3, 6]
>>> v = numpy.diff(t)
Output – [2, 3]
- A histogram is used to summarize discrete or continuous data. It provides a visual interpretation of numerical data by showing the number of data points that fall within a specified range of values. (called “bins”)
- To construct a histogram from a continuous variable you first need to split the data into intervals, called bins.
- Bins should not be too small or too large, such that underlying pattern (frequency distribution) of the data can be easily seen
- In a histogram, it is the area of the bar that indicates the frequency of occurrences for each bin. This means that the height of the bar does not necessarily indicate how many occurrences of scores there were within each individual bin. It is the product of height multiplied by the width of the bin that indicates the frequency of occurrences within that bin.
- List consist of elements belonging to different data types. While Arrays consist of elements belonging to the same data type
- Arrays need to be declared. Lists don’t, since they are built into Python. Lists are created by simply enclosing a sequence of elements into square brackets. Creating an array, on the other hand, requires a specific function from either the array module (i.e., array.array()) or NumPy package (i.e., numpy.array()). Because of this, lists are used more often than arrays.
- Arrays can store data very compactly and are more efficient for storing large amounts of data. Lists are preferred for shorter sequences of data items.
- Arrays are great for numerical operations; lists cannot directly handle math operations. For example, you can divide each element of an array by the same number with just one line of code. If you try the same with a list, you’ll get an error.
The structural approach of coming up with the number is as follows:
Let’s start with the population of Delhi which is 2 Crores.
We will divide this population into two groups-
1. Family(80%) = 0.8*20000000 = 16000000 family members
2. Bachelors(Individuals) (20%) = 0.2*20000000 = 4000000
Number of families(assuming 4 members in each) = 16000000/4 = 4000000.
Guessing that 50% of families have cars, so the number of families with cars = 2000000.
In Delhi, we can assume that 25% of the families belong to high class society so they can afford 2 cars on an average and the rest can afford only one car.
Therefore the number of cars with families
= 0.25*2000000*2 + 0.75*2000000*1
Now let’s say only 10% of the individual population can afford a single car.
Therefore, the number of cars with individuals
So the total number of cars in Delhi can be estimated as
2500000+400000 = 2900000
which can be rounded off to 3000000 for simple calculations.
#Method 1 – Deleting rows or columns: In Pandas, there are two very useful functions: 1) isnull() and dropna() – to find columns of data with missing or corrupted data and drop those values.2) fillna() – to fill the invalid values with a placeholder value (for ex – 0).
#Method 2 – Replacing the missing data with aggregated values: This way we won’t lose any data, but this method works only for small numeric datasets, which are linear in nature. However, these approximations can mess further results and this method is not the best.
#Method 3 – Creating an unknown category: Categorical features have a number of possible values, which gives us an opportunity to create one more category for the missing values. This way we will lower the variance by adding new information to the data. This could be used when the original information is missing or cannot be understood.
#Method 4 – Predicting missing values: So based on the data where we have no missing values, we can train a statistical or machine learning algorithm in order to predict the missing values.
- WHERE Clause is used to filter the records from the table based on the specified condition.HAVING Clause is used to filter records from the groups based on the specified condition
- WHERE Clause implements in row operations. HAVING Clause implements in column operation.
- WHERE Clause can be used without GROUP BY clause. HAVING Clause cannot be used without GROUP BY clause.
- WHERE Clause is used before GROUP BY clause. HAVING Clause is used after GROUP BY clause.
- WHERE Clause is used with single row functions like UPPER, LOWER etc. HAVING Clause is used with multiple row functions like SUM, COUNT etc.
- The Dictionary maps a key to a value and cannot have duplicate keys. It utilizes a data structure called a hashmap and a key will be converted (using a hash algorithm) from a string into an integer value, to find the right index in the dictionary to look at.
- Lists though can be very slow when searching as the only way to search a list is to access each item in the list starting from the zeroth element and going up to the last element in the list.
- The list will be faster than the dictionary on the first item search because there is nothing to search in the first step. But in the second step, the list has to look through the first item and then the second item. So at each step the lookup takes more and more time. The larger the list, the longer it takes.
- Thus, Dictionary in principle has a faster lookup with O(1) i.e. constant time complexity; while the lookup performance of a List is an O(n) operation.
Modules refer to a file containing Python statements and definitions. We use modules to break down large programs into small manageable and organized files. Furthermore, modules provide reusability of code.
Some of the modules used are:
- Numpy – It is an amazing module for doing any kind of mathematical operations in Python. So essentially, it allows you to work with array-like objects of multiple dimensions like matrices and helps to do one, two, three dimensional math very fast. The operations performed in Numpy are fast because a lot of operations are implemented in C.
- Pandas – It helps in reading and working with dataframes and just data in general. With pandas, it is easy to clean, manipulate and work with data.
- Regular expression(re) – With re module, we can do more complex text processing using regular expression pattern matching.
- Itertools – This module provides various functions that work on iterators to produce complex iterators. This module works as a fast, memory-efficient tool that is used either by themselves or in combination to form iterator algebra.
- tkinter – It comes bundled with Python, using Tk and is Python’s standard GUI framework. It provides a fast and easy way to create GUI applications.
- Linear Relationship: There is a linear relationship between the independent variable, x, and the dependent variable, y. The easiest way to detect if this assumption is met is to create a scatter plot of x vs. y. This allows you to visually see if there is a linear relationship between the two variables.
- Independence: The next assumption of linear regression is that the residuals are independent. This is mostly relevant when working with time series data. Ideally, we don’t want there to be a pattern among consecutive residuals. The simplest way to test if this assumption is met, is to look at a residual time series plot, which is a plot of residuals vs. time. Ideally, most of the residual autocorrelations should fall within the 95% confidence bands around zero, which are located at about +/- 2-over the square root of n, where n is the sample size.
- Homoscedasticity: The next assumption of linear regression is that the residuals have constant variance at every level of x.This is known as homoscedasticity. Once you fit a regression line to a set of data, you can then create a scatterplot that shows the fitted values of the model vs. the residuals of those fitted values.
- Normality: The next assumption of linear regression is that the residuals are normally distributed. A Q-Q plot, short for quantile-quantile plot, is a type of plot that we can use to determine whether or not the residuals of a model follow a normal distribution. If the points on the plot roughly form a straight diagonal line, then the normality assumption is met.
- K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of predefined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
- K-means clustering algorithm works in three steps. Let’s see what these three steps are.
- Select the k values.
- Initialize the centroids.
- Select the group and find the average.
- K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of predefined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
- We will understand each figure one by one.
- Figure 1 shows the representation of data of two different items. The first item is shown in blue color and the second item has shown in red color. Here I am choosing the value of K randomly as 2. There are different methods by which we can choose the right k values.
- In figure 2, Join the two selected points. Now to find out centroid, we will draw a perpendicular line to that line. The points will move to their centroid. If you will notice there, then you will see that some of the red points are now moved to the blue points. Now, these points belong to the group of blue color items.
- The same process will continue in figure 3. we will join the two points and draw a perpendicular line to that and find out the centroid. Now the two points will move to its centroid and again some of the red points get converted to blue points.
- The same process is happening in figure 4. This process will be continued until and unless we get two completely different clusters of these groups.
- Advantages of K-mean: 1)It is very simple to implement 2)It is scalable to a huge data set and also faster to large datasets 3) It adapts the new examples very frequently 4) Generalization of clusters for different shapes and sizes.
- Disadvantages: 1)It is sensitive to the outliers 2)As the number of dimensions increases its scalability decreases.
1. Start the 7 minute sand timer and the 4 minute sand timer.
2. Once the 4 minute sand timer ends, turn it upside down instantly.
Time Elapsed: 4 minutes. At this moment, 3 minutes of sand is left in the 7 minute sand timer.
3. Once the 7 minute sand timer ends, turn it upside down instantly.
Time Elapsed: 7 minutes. At this moment, 1 minutes of sand is left in the 4 minute sand timer.
4. After the 4 minute sand timer ends, only 1 minute is elapsed in the 7 minute sand timer, therefore for another minute turn the 7 minute sand timer upside down.
Time Elapsed: 8 minutes.
5. When the 7 minute sand timer ends, total time elapsed is 9 minutes.
So effectively 8 + 1 = 9.
Bias in machine learning is a type of error in which certain elements of a dataset are more heavily weighted and/or represented than others.
A biased dataset does not accurately represent a model’s use case, resulting in skewed outcomes, low accuracy levels, and analytical errors.
The 5 main types of machine learning bias, why they occur, and how to reduce their effect is as given below:
1) Algorithmic bias
- Algorithmic bias is the error that occurs when the algorithm at the core of the machine learning process is faulty or inappropriate for the current application.
- Algorithmic bias can be spotted when the application starts giving wrong results for almost identical cases (input cases).
- Sample bias occurs when a dataset does not reflect the realities of the environment in which a model will run. To fix this, a larger, more diverse dataset to train your model should be used.
- Example: Certain facial recognition systems trained primarily on images of white men. These models have considerably lower levels of accuracy with women and people of different ethnicities.
- Prejudice bias is often the result of the data being biased in the first place. The data you extracted and used to train your model may have pre-existing bias, such as stereotypes and faulty case assumptions. So, using this data will always result in biased results no matter what algorithms you use.
- Prejudice bias is quite difficult to solve; you can try to use entirely new dataset, try to modify the data to eliminate any existing biases.
- This type of bias occurs when the data collected for training differs from that collected in the real world, or when faulty measurements result in data distortion. Measurement bias can also occur due to inconsistent annotation during the data labelling stage of a project.
- An Example of this bias occurs in image recognition datasets, where the training data is collected with one type of camera, but the production data is collected with a different camera.
- Exclusion bias is most common at the data pre-processing stage. Most often it’s a case of deleting valuable data thought to be unimportant.
- For example, imagine you have a dataset of customer sales in America and Canada. 98% of the customers are from America, so you choose to delete the location data thinking it is irrelevant. However, this means your model will not pick up on the fact that your Canadian customers spend two times more.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the tree. This algorithm compares the values of the root attribute with the record (real dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and moves further. It continues the process until it reaches the leaf node of the tree. The complete process can be better understood using the below algorithm:
· Step-1: Begin the tree with the root node, say S, which contains the complete dataset.
· Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM). By this measurement, we can easily select the best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
1. Information Gain: It is the measurement of changes in entropy after the segmentation of a dataset based on an attribute. A node/attribute having the highest information gain is split first.
2. Gini Index: It is a measure of impurity or purity used while creating a decision tree in the CART (Classification and Regression Tree) algorithm. It creates binary splits and an attribute with the low Gini index is preferred.
· Step-3: Divide ‘S’ into subsets that contain possible values for the best attributes.
· Step-4: Generate the decision tree, containing the Decision Node and Leaf Node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches. Leaf nodes are split based on the answer (Yes/No) and contain the best attribute.
· Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3. Continue this process until a stage is reached where you cannot further classify the nodes.
Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems.
- Overfitting occurs when a statistical model fits exactly against its training data. When this happens, the algorithm unfortunately cannot perform accurately against unseen data, defeating its purpose.
- When machine learning algorithms are constructed, they leverage a sample dataset to train the model. However, when the model trains for too long on sample data or when the model is too complex, it can start to learn the “noise,” or irrelevant information, within the dataset. When the model memorizes the noise and fits too closely to the training set, the model becomes “overfitted,” and it is unable to generalize well to new data.
- Low error rates and a high variance are good indicators of overfitting. In order to prevent this type of behavior, part of the training dataset is typically set aside as the “test set” to check for overfitting.
- Underfitting occurs when a data model is unable to capture the relationship between the input and output variables accurately, generating a high error rate on both the training set and unseen data.
- It occurs when a model is too simple, which can be a result of a model needing more training time, more input features, or less regularization. Like overfitting, when a model is underfitted, it cannot establish the dominant trend within the data, resulting in training errors and poor performance of the model.
- High bias and low variance are good indicators of underfitting. Since this behavior can be seen while using the training dataset, underfitted models are usually easier to identify than overfitted ones.
- K-Nearest Neighbors is a supervised classification algorithm where K describes the number of neighbour points that we look at for each data point to determine its classification. As a supervised algorithm, we have the labels of the data points, and use those to predict the labels of new data points.In addition, the concept behind this algorithm is that for a point, it will get its K nearest neighbours, based on the closest distance. And, this algorithm only really has to iterate one time through, unlike K-Means Clustering which iterates multiple times until convergence is reached.
- K-Means Clustering is an unsupervised algorithm, where the K is used to describe how many centroids of clusters there will be when applying the algorithm. As an unsupervised algorithm, we are not given any labels, but instead, we have parameters that we use to group similar data points together and find the clusters.The concept behind this algorithm is that we try to calculate the locations of the K centers, or the averages or means, of the data points, which are where the clusters are most likely centred on. It will recalculate the centre’s based on the data points, and will iterate multiple times until convergence is reached, which happens when the newly computed centre locations stop changing.
- K-NN is a classification or regression
machine learning algorithm. While K-means is a clustering machine learning algorithm and is used for scenarios such as getting deeper understanding of demographics, social media trends, marketing strategies evolution and so on.
- K-NN is a lazy learner while K-Means is an eager learner. KNN is also known as a lazy learner because it involves minimal training of the model. Hence, it doesn’t use training data to make generalizations on an unseen data set.
The goal of any supervised machine learning algorithm is to achieve low bias and low variance.
If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand, if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data.
Thus, there is a trade-off between bias and variance; in order to achieve good prediction performance. To build a good model, we need to find a balance between bias and variance such that it minimizes the total error.
Total error = Bias^2 + Variance + Irreducible Error
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict, and
Variance is the variability of model prediction for a given data point or a value which tells us the spread of our data.
Irreducible error is the error introduced from the chosen framing of the problem and may be caused by factors like unknown variables that influence the mapping of the input variables to the output variable.
For example: – The k-nearest neighbors algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute the prediction and in turn increases the bias of the model.
· Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions about a population parameter or a population probability distribution.
· There are two types of hypothesis – Null and Alternative.
1. Null Hypothesis: It is denoted by H0. A null hypothesis is the one in which sample observations result purely from chance. This means that the observations are not influenced by some non-random cause.
2. Alternative Hypothesis: It is denoted by Ha or H1. An alternative hypothesis is the one in which sample observations are influenced by some non-random cause.
· A hypothesis test concludes whether to reject the null hypothesis and accept the alternative hypothesis or to fail to reject the null hypothesis.
· The following steps are involved in hypothesis testing:
1. The first step is to state the null and alternative hypothesis clearly. The null and alternative hypothesis in hypothesis testing can be a one tailed or two tailed test.
2. The second step is to determine the test size. This means that the researcher decides whether a test should be one tailed or two tailed to get the right critical value and the rejection region.
3. The third step is to compute the test statistic and the probability value. This step of the hypothesis testing also involves the construction of the confidence interval depending upon the testing approach.
4. The fourth step involves the decision making step. This step of hypothesis testing helps the researcher reject or accept the null hypothesis by making comparisons between the subjective criterion from the second step and the objective test statistic or the probability value from the third step.
5. The fifth step is to draw a conclusion about the data and interpret the results obtained from the data.
· A null hypothesis is accepted or rejected basis P value and the region of acceptance.
1. P value – it is a function of the observed sample results. A threshold value is chosen before the test is conducted and is called the significance level, which is represented as α. If the calculated value of P ≤ α, it suggests the inconsistency between the observed data and the assumption that the null hypothesis is true. This suggests that the null hypothesis must be rejected
2. Region of Acceptance – It is the range of values that leads you to accept the null hypothesis. When you collect and observe sample data, you compute a test static. If its value falls within the specific range, the null hypothesis is accepted.
· A random variable, usually written X, is a variable whose possible values are numerical outcomes of a random phenomenon. There are two types of random variables, discrete and contin
1. Discrete Random Variables
· A discrete random variable is one which may take on only a countable number of distinct values such as 0,1,2,3,4, …….. Discrete random variables are usually (but not necessarily) counts. If a random variable can take only a finite number of distinct values, then it must be discrete.
· Example: Imagine a coin toss where, depending on the side of the coin landing face up, a bet of a dollar has been placed. The possibility of winning a dollar corresponding to the outcome of a coin toss before tossing the coin defines the random variable. The outcome of the coin toss is either heads, or tails, creating an equal probability of either outcome. Because the value of the random variable is defined as a real-valued dollar, the probability distribution is discrete.
2. Continuous Random Variables
· A continuous random variable is one which takes an infinite number of possible values. Continuous random variables are usually measurements.
· A continuous random variable is not defined at specific values. Instead, it is defined over an interval of values, and is represented by the area under a curve (in advanced mathematics, this is known as an integral). The probability of observing any single value is equal to 0, since the number of values which may be assumed by the random variable is infinite.
· Example: Imagine wanting to study the effects of caffeine intake on height. One’s height would be the continuous random variable as it is unknown before the completion of the experiment, and its value is taken from measuring within a range.
· A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance (i.e. that the null hypothesis is true).
· The level of statistical significance is often expressed as a p-value between 0 and 1. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.
· A p-value less than 0.05 (typically ≤ 0.05) is statistically significant. It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random). Therefore, we reject the null hypothesis, and accept the alternative hypothesis.
· However, if the p-value is below your threshold of significance (typically p < 0.05), you can reject the null hypothesis, but this does not mean that there is a 95% probability that the alternative hypothesis is true. The p-value is conditional upon the null hypothesis being true, but is unrelated to the truth or falsity of the alternative hypothesis.
· A p-value higher than 0.05 (> 0.05) is not statistically significant and indicates strong evidence for the null hypothesis. This means we retain the null hypothesis and reject the alternative hypothesis.
· The term significance level (alpha) is used to refer to a pre-chosen probability and the term “P value” is used to indicate a probability that you calculate after a given study.
· The significance level (alpha) is the probability of type I error. Type I error is the false rejection of the null hypothesis and type II error is the false acceptance of the null hypothesis. The power of a test is one minus the probability of type II error (beta)
Classification algorithms take existing (labelled) datasets and use the available information to generate predictive models for use in classification of future data points. The following evaluation metrics for classification are an absolute measure of your machine learning model’s accuracy:
- Confusion Matrix: The Confusion Matrix is a two-dimensional matrix that allows visualization of the algorithm’s performance. Predictions are highlighted and divided by class (true/false), before being compared with the actual values. This matrix essentially helps you determine if the classification model is optimized. It shows what errors are being made and helps to determine their exact type.
- Accuracy: A classification model’s accuracy is defined as the percentage of predictions it got right. However, it’s important to understand that it becomes less reliable when the probability of one outcome is significantly higher than the other one, making it less ideal as a stand-alone metric.
- Recall: Recall is the number of correct positive results divided by the number of all samples that should have been identified as positive.
Recall = TP / TP + FN
- Precision: This metric is the number of correct positive results divided by the number of positive results predicted by the classifier.
Precision = TP / TP + FP
- F1 score: The F1 score is basically the harmonic mean between precision and recall. It is used to measure the accuracy of tests and is a direct indication of the model’s performance. The range of the F1 score is between 0 to 1, with the goal being to get as close as possible to 1. It is calculated as per:
- ROC-AUC: AUC of a classifier is equal to the capability of the model to distinguish correctly between the classes while the Receiver Operating System(ROC) is a probability curve. ROC-AUC is used for binary classification and represents the degree or measure of separability. It shows the performance of a classification model at all classification thresholds. AUC score is calculated from the plot for False Positive Rate(FPS) vs True Positive Rate(TPR).
Most organizations emphasize data to drive business decisions. But data alone is not the goal. Facts and figures are meaningless if you can’t gain valuable insights that lead to more-informed actions. The different types of analytics is as follows:-
- Descriptive Analytics: It looks at data statistically to tell you what happened in the past. Descriptive analytics helps a business understand how it is performing by providing context to help stakeholders interpret information. This can be in the form of data visualizations like graphs, charts, reports and dashboards.
For instance, say that an unusually high number of people are admitted to the emergency room in a short period of time. Descriptive analytics tells you that this is happening and provides real-time data with all the corresponding statistics (date of occurrence, volume, patient details, etc.).
- Predictive Analysis: Predictive tools attempt to fill in gaps in the available data. If descriptive analytics answer the question, “what happened in the past,” predictive analytics answer the question, “what might happen in the future?”
Predictive analytics take historical data from CRM, POS, HR and ERP systems and use it to highlight patterns. Then, algorithms, statistical models and machine learning are employed to capture the correlations between targeted data sets.
The most common commercial example is a credit score. Banks uses historical information to predict whether or not a candidate is likely to keep up with payments.
- Prescriptive Analysis: it takes predictive data to the next level. Now that you have an idea of what will likely happen in the future. Prescriptive analytics suggests various courses of action and outlines what the potential implications would be for each.
Back to our hospital example: Now that you know the illness is spreading, the prescriptive analytics tool may suggest that you increase the number of staff on hand to adequately treat the influx of patients.
201 is not an even number so let’s consider 200 games first. Assume event A fires more arrows on target than B in 200 games, and event B is B fires more arrows on target, and C is that they fire equal amounts of arrows on targets. We have:
P(A) + P(B) + P(C) = 1
Since A and B perform equally at archery, for 200 games, we have P(A) = P(B). Thus:
2P(A) + P(C) = 1
Now move to the extra game that person A plays. If in the last 200 games:
A is higher than B, then no matter whether A fires on target or not for this extra game, A is still higher than B.
If A is lower than B, even if A fires on target for the extra game, we would observe the most A=B, and A will still not be over B.
If A=B, if A fires on target for the extra game, then A will be higher than B, and the probability that A shoots on target for any game is 0.5.
Thus, the total probability that A is higher than B is:
P(A) + 0.5*P(C)
We know that 2P(A) + P(C) = 1, if we divide by 2 on both sides, we will have:
P(A) + 0.5*P(C) = 0.5
The probability that A gets more targets than B when A plays 201 games and B plays 200 games is 0.5.
· Histograms and box plots are graphical representations for the frequency of numeric data values. They aim to describe the data and visually assess the central tendency, the amount of variation in the data as well as the presence of gaps, outliers or unusual data points.
· Histograms are preferred to determine the underlying probability distribution of a data. Box plots on the other hand are more useful when comparing between several data sets.
· Histogram is preferable when there is very little variance among the observed frequencies. While box plot shows moderate variation among the observed frequencies
· While histograms are better in displaying the distribution of data, a box plot is used to tell if the distribution is symmetric or skewed.
· Box plots are less detailed than histograms and take up less space.
· To conclude, both tools can be helpful to identify whether variability in data is within specification limits, and whether there is a shift in the process over time. Thus, the type of chart aid chosen depends on the type of data collected, rough analysis of data trends, and project goals.
Covariance is a measure to indicate the extent to which two random variables change in tandem. Correlation is a measure used to represent how strongly two random variables are related to each other.
Covariance is nothing but a measure of correlation. Correlation refers to the scaled form of covariance.
Covariance indicates the direction of the linear relationship between variables. Correlation on the other hand measures both the strength and direction of the linear relationship between two variables.
Covariance can vary between -∞ and +∞. Correlation ranges between -1 and +1
Covariance is affected by the change in scale. If all the values of one variable are multiplied by a constant and all the values of another variable are multiplied by a similar or different constant, then the covariance is changed. Correlation is not influenced by the change in scale.
Covariance of two dependent variables measures how much in real quantity (i.e. cm, kg, liters) on average they co-vary. Correlation of two dependent variables measures the proportion of how much on average these variables vary w.r.t one another.
- In terms of the distribution of time spent per day on Facebook (FB), one can imagine there may be two groups of people on Facebook:
1. People who scroll quickly through their feed and don’t spend too much time on FB.
2. People who spend a large amount of their social media time on FB.
- Based on this, we make claim about the distribution of time spent on FB. The metrics to describe our distribution can be 1) Centre (mean, median, mode) 2) Spread (standard deviation, inter quartile range 3) Shape (skewness, kurtosis, uni or bimodal) 4) Outliers (Do they exist?).
- We can give you a sample answer for your interview: –
- If we assume that a person is visiting Facebook page, there is a probability(p) that after one unit of time(t) has passed that she will leave the page.
- With a probability of p her visit will be limited to 1 unit of time. With a probability of (1−p)p her visit will be limited to 2 units of time. With a probability of (1−p)2p her visit will be limited to 3 units of time and so on. The probability mass function of this distribution is therefore (1−p)tp, and hence we can say this a geometric distribution.
A type I error occurs when the null hypothesis is true but is rejected. In other words, if a true null hypothesis is incorrectly rejected, type I error occurs. A type II error occurs when the null hypothesis is false but invalidly fails to be rejected. In other words, failure to reject a false null hypothesis results in type II error.
A type I error also known as False positive. A type II error also known as False negative. It is also known as the false null hypothesis.
Type I error equals the level of significance (α). ‘α’ is the so-called p-value.
Type II error equals the statistical power of a test. The probability 1- ‘β’ is called the statistical power of the study.
The probability that we will make a type I error is designated ‘α’ (alpha). Therefore, type I error is also known as alpha error. Probability that we will make a type II error is designated ‘β’ (beta). Therefore, type II error is also known as beta error.
It refers to non-acceptance of hypotheses, which ought to be accepted. It refers to the acceptance of a hypothesis, which ought to be rejected.
The probability of Type I error reduces with lower values of (α) since the lower value makes it difficult to reject the null hypothesis. The probability of Type II error reduces with higher values of (α) since the higher value makes it easier to reject the null hypothesis.
In the world of mathematics, the shortest distance between two points in any dimension is termed the Euclidean distance. It is the square root of the sum of squares of the difference between two points.
In Python, the numpy, scipy modules are very well equipped with functions to perform mathematical operations and calculate this line segment between two points.
Solution 1: Using Numpy Module
The numpy module can be used to find the required distance when the coordinates are in the form of an array. It has the norm() function, which can return the vector norm of an array
import numpy as np
a = np.array((1, 2, 3))
b = np.array((4, 5, 6))
dist = np.linalg.norm(a-b)
Solution 2: Using Scipy Library
The scipy library has many functions for mathematical and scientific calculation. The distance.euclidean() function returns the Euclidean Distance between two points.
from scipy.spatial import distance
a = (1, 2, 3)
b = (4, 5, 6)
Solution 3: Using math module
The math module also can be used as an alternative. The dist() function from this module can return the line segment between two points.
from math import dist
a = (1, 2, 3)
b = (4, 5, 6)
The scipy and math module methods are a faster alternative to the numpy methods and work when the coordinates are in the form of a tuple or a list.
A trigger in MySQL is a set of SQL statements that reside in a system catalog. It is a special type of stored procedure that is invoked automatically in response to an event. Each trigger is associated with a table, which is activated on any DML statement such as INSERT, UPDATE, or DELETE.
A trigger is called a special procedure because it cannot be called directly like a stored procedure. The main difference between the trigger and procedure is that a trigger is called automatically when a data modification event is made against a table. In contrast, a stored procedure must be called explicitly.
Generally, triggers are of two types according to the SQL standard: row-level triggers and statement-level triggers.
Row-Level Trigger: It is a trigger, which is activated for each row by a triggering statement such as insert, update, or delete. For example, if a table has inserted, updated, or deleted multiple rows, the row trigger is fired automatically for each row affected by the insert, update or delete statement.
Statement-Level Trigger: It is a trigger, which is fired once for each event that occurs on a table regardless of how many rows are inserted, updated, or deleted.
For creating a new trigger, we need to use the CREATE TRIGGER statement. Its syntax is as follows:
CREATE TRIGGER trigger_name trigger_time trigger_event
FOR EACH ROW
Trigger_name is the name of the trigger which must be put after the CREATE TRIGGER statement.
Trigger_time is the time of trigger activation and it can be BEFORE or AFTER. We must have to specify the activation time while defining a trigger.
Trigger_event can be INSERT, UPDATE, or DELETE. This event causes the trigger to be invoked. A trigger only can be invoked by one event.
Table_name is the name of the table. Actually, a trigger is always associated with a specific table.
BEGIN…END is the block in which we will define the logic for the trigger.
The characteristic of a frequency distribution that ascertains its symmetry about the mean is called skewness. On the other hand, Kurtosis means the relative pointedness of the standard bell curve, defined by the frequency distribution.
Skewness is characteristic of the deviation from the mean, to be greater on one side than the other, i.e. attribute of the distribution having one tail heavier than the other. Skewness is used to indicate the shape of the distribution of data. Conversely, kurtosis is a measure to indicate the flatness or peakedness of the frequency distribution curve and measures the tails or outliers of the distribution.
Skewness is an indicator of lack of symmetry, i.e., both left and right sides of the curve are unequal, with respect to the central point. As against this, kurtosis is a measure of data that is either peaked or flat, with respect to the probability distribution.
Skewness shows how much and in which direction, the values deviate from the mean. In contrast, kurtosis explains how tall and sharp the central peak is.
In a skewed distribution, the curve is extended to either the left or right side. So, when the plot is extended towards the right side more, it denotes positive skewness, wherein mode < median < mean. On the other hand, when the plot is stretched more towards the left direction, then it is called as negative skewness and so, mean < median < mode. Positive kurtosis represents that the distribution is more peaked than the normal distribution, whereas negative kurtosis shows that the distribution is less peaked than the normal distribution.
We want a recommendation algorithm sounds like RNN which is a recurrent neural network which is not easy to setup at all, but we can try building recommendation with much simpler approach that is with a simple prefix matching algorithm and we can certainly go into expanding it on until we have something that will be on par with RNN.
We will use Lookup in this database table/ prefix table. This prefix table starts with an input string and that is your prefix and it will output the suggested string or suffixes. Example: what does “hello” prefix to and its suffixes to the model.
Scoping is very important by doing fuzzy matching, context matching like what if you were using a different language. So, if you are trying to input “big” that could output any number of suffixes example: big shot or big sky or the big year etc.
In existing search corpus of billions of searches what proportion of the time do people writing the big actually click on the big shot and what proportion of the time do they output the big sky, so you can just have a simple thing that has every possible search prefix that has ever been typed on Netflix and output that to the most common thing that they clicked on. Boom! That’s your prefix matching a recommendation algorithm for type ahead search.
Context matching is also important here, if u have string input and a user profile with various number of features into a string output. You can convert user profile into a K means clustering we can output this into either John Stamos’s fan and not John Stamos’s fan, so if you are John Stamos’s fan and if you type the big, every time Netflix is going to recommend the big shot or else other way round. Also user profile can be set to right dimensionality.
Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behaviour. It can be considered the thoughtful process of determining what is normal and what is not. Anomalies are also referred to as outliers, novelties, noise, exceptions and deviations. Simply, anomaly detection is the task of defining a boundary around normal data points so that they can be distinguishable from outliers.
Anomalies can be broadly categorized as:
Point anomalies: A single instance of data is anomalous if it’s too far off from the rest. Business use case: Detecting credit card fraud based on “amount spent.”
Contextual anomalies: The abnormality is context specific. This type of anomaly is common in time-series data. Business use case: Spending $100 on food every day during the holiday season is normal, but may be odd otherwise.
Collective anomalies: A set of data instances collectively helps in detecting anomalies. Business use case: Someone is trying to copy data from a remote machine to a local host unexpectedly, an anomaly that would be flagged as a potential cyber-attack.
The different types of methods for anomaly detection are as follows:
Simple Statistical Methods
The simplest approach to identifying irregularities in data is to flag the data points that deviate from common statistical properties of a distribution, including mean, median, mode, and quantiles. When an anomalous data point deviates by a certain standard deviation from the mean, then traversing mean over time-series data isn’t exactly trivial, as it’s not static. Thus, a rolling window to compute the average across the data points and it’s intended to smooth short-term fluctuations and highlight long-term ones.
Machine Learning-Based Approaches for Anomaly Detection:
(a) Clustering-Based Anomaly Detection:
The approach focuses on unsupervised learning, similar data points tend to belong to similar groups or clusters, as determined by their distance from local centroids.
The k-means algorithm can be used which partition the dataset into a given number of clusters. Any data points that fall outside of these clusters are considered as anomalies.
(b) Density-based anomaly detection:
This approach is based on the K-nearest neighbors algorithm. It’s evident that normal data points always occur around a dense neighborhood and abnormalities deviate far away. To measure the nearest set of a data point, you can use Euclidean distance or similar measure according to the type of data you have.
(c) Support Vector Machine-Based Anomaly Detection:
A support vector machine is another effective technique for detecting anomalies. One-Class SVMs have been devised for cases in which one class only is known, and the problem is to identify anything outside this class.
This is known as novelty detection, and it refers to automatic identification of unforeseen or abnormal phenomena, i.e. outliers, embedded in a large amount of normal data.
Anomaly detections helps to monitor any data source, including user logs, devices, networks, and servers. This rapidly helps in identifying zero-day attacks as well as unknown security threats.
Let’s understand t-test and Z-test first,
Z-test is a univariate hypothesis test which ascertains if the averages of the 2 datasets are different from each other when standard deviation or variance is given. Key assumptions made: 1)All data points are independent 2) Normal Distribution for Z, with an average zero and variance = 1.
The t-test can be referred to as a kind of parametric test that is applied to an identity, how the averages of 2 sets of data differ from each other when the standard deviation or variance is not given. Key assumptions made: 1) All data points are not dependent 2)Sample values are to recorded and taken accurately
The t-test is based on the student’s t-distribution. On the contrary, the z-test depends upon the assumption that the distribution of sample means will be normal.
So, what are the conditions for conducting these tests?
One of the essential conditions for conducting a t-test is that population standard deviation or the variance is unknown. Conversely, the population variance should be assumed to be known or be known in the case of a z-test.
Z-test is used when the sample size is large, which is n > 30, and the t-test is appropriate when the size of the sample is not big, which is small, i.e., that n < 30.
Let’s take a following dataset as example,
1 3 5 5 6 7
If I want to calculate the central tendency of this dataset then I have 3 choices to do so,
Mean: (1+3+5+5+6+7)/6 = 27/6 = 4.5
Median: 5+5/2 = 5
Mode: Since 5 has shown up two times the mode would be 5 here.
From the above observation, we can clearly state that most of the data lies around 5. In the above example, there are no outliers in the data. Now let’s take an example where we have an outlier condition in the data.
Before that lets understand, what is an outlier? Outliers are the values which lie outside of the other values. It could either lie on a higher or lower level depending upon the data.
Dataset: 1 3 5 5 6 27
If you clearly observe the number 27 is an outlier because it lies outside the current values. Now let’s calculate the central tendency of the dataset.
Mean: (1+3+5+5+6+27)/6 = 47/6 = 7.83
Median: 5+5/2 = 5;
Mode: Since 5 has shown up two times the mode would be 5 here.
See, what happened to mean? I have just introduced outliers in the data and the mean gets affected heavily. Now look at Median and Mode, they remain the same.
Median is the preferred measure of central tendency when: 1) There are a few extreme scores in the distribution of the data. 2) There are some missing or undetermined values in your data. 3) There is an open-ended distribution (For example, if you have a data field which measures number of children and your options are 00, 11, 22, 33, 44, 55 or “6 or more,” than the “6 or more field” is open ended and makes calculating the mean impossible, since we do not know exact values for this field).
The mode is the only measure you can use for nominal or categorical data that can’t be ordered.
Cross-Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two segments:
One used to learn or train a model and the other used to validate the model. In typical cross-validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against. The basic form of cross-validation is k-fold cross-validation.
Other forms of cross-validation are special cases of k-fold cross-validation or involve repeated rounds of k-fold cross-validation.
In k-fold cross-validation, the data is first partitioned into k equally (or nearly equally) sized segments or folds. Subsequently k iterations of training and validation are performed such that within each iteration a different fold of the data is held-out for validation while the remaining k − 1 folds are used for learning.
Why do we use cross-validation?
It allows us to get more metrics and draw important conclusions about our algorithm and our data.
Helps to tune the hyper parameters of a given machine learning algorithm, to get good performance according to some suitable metric.
It mitigates overfitting while building a pipeline of models, such that second’s models input will be real predictions on data that our first model has never seen before.
K-fold cross validation also significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in the validation set.
Let’s start with understanding correlation,
The correlation between two variables can be measured with a correlation coefficient which can range between -1 to 1. If the value is 0, the two variables are independent and there is no correlation. If the measure is extremely close to one of these values, it indicates a linear relationship and highly correlated with each other. This means a change in one variable is associated with a significant change in other variables.
How to test Multicollinearity?
Multicollinearity is a condition when there is a significant dependency or association between the independent variables or the predictor variables. A significant correlation between the independent variables is often the first evidence of presence of multicollinearity.
Correlation matrix / Correlation plot: A correlation plot can be used to identify the correlation or bivariate relationship between two independent variables
Variation Inflation Factor (VIF): VIF is used to identify the correlation of one independent variable with a group of other variables.
Consider that we have 9(assume V1 to V9) independent variables. To calculate the VIF of variable V1, we isolate the variable V1 and consider as the target variable and all the other variables(i.e V2 to V9) will be treated as the predictor variables.
We use all the other predictor variables and train a regression model and find out the corresponding R2 value.
Using this R2 value, we compute the VIF value given as the image below.
It is always desirable to have a VIF value as small as possible. A threshold is also set, which means that any independent variable greater than the threshold will have to be removed.
Linear regression is used to predict the continuous dependent variable using a given set of independent variables. Logistic Regression is used to predict the categorical dependent variable using a given set of independent variables.
Linear Regression is used for solving Regression Problems. Logistic Regression is used for mainly Classification problems.
Linear regression is used to estimate the dependent variable in case of a change in independent variables. For example, predict the price of houses. Whereas logistic regression is used to calculate the probability of an event. For example, classify if tissue is benign or malignant.
In Linear regression, we predict the value of continuous variables. In logistic Regression, we predict the values of categorical variables.
In linear regression, we find the best fit line, by which we can easily predict the output. In Logistic Regression, we find the S-curve by which we can classify the samples.
Least square estimation method is used for estimation of accuracy. Maximum likelihood estimation method is used for estimation of accuracy.
The output for Linear Regression must be a continuous value, such as price, age, etc. The output of Logistic Regression must be a Categorical value such as 0 or 1, Yes or No, etc.
Linear Regression assumes the normal or gaussian distribution of the dependent variable. Logistic regression assumes the binomial distribution of the dependent variable.
Oversampling and under sampling are 2 important techniques used in machine learning – classification problems in order to reduce the class imbalance thereby increasing the accuracy of the model.
Classification is nothing but predicting the category of a data point to which it may probably belong by learning about past characteristics of similar instances. When the segregation of classes is not approximately equal then it can be termed as a “Class imbalance” problem. To solve this scenario in our data set, we use oversampling and under sampling.
Oversampling is used when the amount of data collected is insufficient. A popular over-sampling technique is SMOTE (Synthetic Minority Over-sampling Technique), which creates synthetic samples by randomly sampling the characteristics from occurrences in the minority class.
Conversely, if a class of data is the overrepresented majority class, under sampling may be used to balance it with the minority class. Under sampling is used when the amount of collected data is sufficient. Common methods of under sampling include cluster centroids and To meke links, both of which target potential overlapping characteristics within the collected data sets to reduce the amount of majority data.
Ex: Let’s say in a bank majority of the customers are from a specific Race and very few customers are from other races , hence if the model is trained with this data , it is most likely that Model will reject the loan for Minority Race.
So, what should we do about it?
For, oversampling we increase the number of records belonging to the “minority race” category by duplicating its presence. So that the difference between the numbers of records belonging to both of the classes will narrow down.
Under sampling we reduce the number of records belonging to the “majority race”. The records for the deletion are selected strictly through a random process and are not influenced by any constraints or bias.
To conclude, over sampling is preferable as under sampling can result in the loss of important data. Under sampling is suggested when the amount of data collected is larger than ideal and can help data mining tools to stay within the limits of what they can effectively process.
The following MySQL statement finds the maximum salary from each department and you will be required to use the GROUP BY clause with the SELECT query to generate the individual departments with the highest salary.
SELECT department, MAX(salary)
GROUP BY department
ORDER BY MAX(salary) DESC;
This query gives the output as follows:
SVM algorithms use a group of mathematical functions that are known as kernels. The function of a kernel is to require data as input and transform it into the desired form.
The kernel functions return the scalar product between two points in an exceedingly suitable feature space. Thus, by defining a notion of resemblance, with a little computing cost even in the case of very high-dimensional spaces.
Different SVM algorithms use differing kinds of kernel functions. These functions are of different kinds—for instance, linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid. Few of them are as follows:
Linear Kernel :
It is the most basic type of kernel, usually one dimensional in nature. It proves to be the best function when there are lots of features. The linear kernel is mostly preferred for text-classification problems as most of these kinds of classification problems can be linearly separated.
Linear kernel functions are faster than other functions.
Linear Kernel Formula
F(x, xj) = sum( x.xj)
Here, x, xj represents the data you’re trying to classify.
Polynomial Kernel :
It is a more generalized representation of the linear kernel. It is not as preferred as other kernel functions as it is less efficient and accurate.
Polynomial Kernel Formula
F(x, xj) = (x.xj+1)^d
Here ‘.’ shows the dot product of both the values, and d denotes the degree.
F(x, xj) representing the decision boundary to separate the given classes.
Gaussian Radial Basis Function (RBF)
It is one of the most preferred and used kernel functions in svm. It is usually chosen for non-linear data. It helps to make proper separation when there is no prior knowledge of data.
Gaussian Radial Basis Formula
F(x, xj) = exp(-gamma * ||x – xj||^2)
The value of gamma varies from 0 to 1. You have to manually provide the value of gamma in the code. The most preferred value for gamma is 0.1.
Bagging and Boosting are two types of Ensemble Learning, which helps to improve machine learning results by combining several models. This approach allows the production of better predictive performance compared to a single model.
So, let’s understand the difference between Bagging and Boosting?
Bagging(Bootstrap aggregation): It is a homogeneous weak learners’ model that learns from each other independently in parallel and combines them for determining the model average. Boosting: It is also a homogeneous weak learners’ model. In this model, learners learn sequentially and adaptively to improve model predictions of a learning algorithm.
If the classifier is unstable (high variance), then we need to apply bagging. If the classifier is steady and straightforward (high bias), then we need to apply boosting.
In bagging, different training data subsets are randomly drawn with replacement from the entire training dataset. In boosting, every new subset contains the elements that were misclassified by previous models.
Bagging is the simplest way of combining predictions that belong to the same type. Boosting is the way of combining predictions that belong to the different types.
Each model is built independently for bagging. While in the case of boosting, new models are influenced by the performance of previously built models.
Bagging attempts to tackle the overfitting issue. Boosting tries to reduce bias.
Example: The Random Forest model uses Bagging. The AdaBoost uses Boosting techniques.
Feature scaling is one of the most important data pre-processing step in machine learning. Algorithms that compute the distance between the features are biased towards numerically larger values if the data is not scaled.
Normalization and standardization are one of the few types of feature scaling.
Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.
X_new = (X – X_min)/(X_max – X_min)
Here, Xmax and Xmin are the maximum and the minimum values of the feature respectively.
Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
X_new = (X – mean)/Std
Now, let’s understand the difference between the two:
Normalization is used when features are of different scales. Standardization is used when we want to ensure zero mean and unit standard deviation.
Normalization squishes n-dimensional data into an n-dimensional unit hypercube. Standardization on the other hand, translates the data to the mean vector of original data to the origin and squishes or expands.
Normalization is useful when we don’t know about the distribution and scales values between [0,1] or [-1,1]. Standardization is used when the feature distribution is normal or Gaussian and is not bounded to a certain range.
An analytic function computes values over a group of rows and returns a single result for each row. This is different from an aggregate function, which returns a single result for a group of rows.
With analytic functions you can compute moving averages, rank items, calculate cumulative sums, and perform other analyses.
RANK gives you the ranking within your ordered position. Ties are assigned the same rank, with the next ranking(s) skipped.
DENSE_RANK again gives you the ranking within your ordered partition, but the ranks are consecutive. No ranks are skipped if there are ranks with multiple items.
Example below.. which is order partitioned on salary :
Due to next rankings skipped in the case of RANKS, generally DENSE_RANK is preferred as it gives proper ranking.
An activation function is an important feature of an artificial neural network which basically decides whether neurons should be activated or not.
All the input Xi’s are multiplied with their weights Wi’s assigned to each link, and summed together along with Bias b. Let Y be the summation of ((Wi*Xi) + b).
The value of Y ranges from -inf to +inf. To build sense into our network, we add activation function(f)-which will decide whether the information passed is useful or not, and based on the result it’ll get fired.
The properties that activation function should hold are:
Derivative or differential: Change in y-axis w.r.t change in x-axis. It is also known as slope.
Monotonic function: A function which is either entirely non-increasing or non-decreasing.
The types of activation function are as follows:
1)Linear Function: For this function, no matter how many layers are present in the neutral network, the output of the first layer is the same as the output of the nth layer.
2)Binary-Step function: Binary step function is a threshold-based activation function which means after a certain threshold neuron is activated and below the said threshold neuron is deactivated.
3)Non-linear function: Modern neural network models use non-linear activation functions. They allow the model to create complex mappings between the network’s inputs and outputs, such as images, video, audio, and data sets that are non-linear or have high dimensionality
Here it means, 20% probability = 20/100 = 1/5
Probability of Seeing a Star in 15 minutes = 1/5
Probability of not seeing a Star in 15 minutes = 1 – 1/5 = 4/5
Probability that you see at least one shooting star in the period of an hour
= 1 – Probability of not seeing any Star in 60 minutes
= 1 – Probability of not seeing any Star in 15 * 4 minutes
= 1 – (4/5)⁴
= 1 – 0.4096
So, the probability of seeing at least one shooting star in a period of an hour is 0.594
The statistical power of a study (sometimes called sensitivity) is how likely the study is to distinguish an actual effect from one of chance.
It’s the likelihood that the test is correctly rejecting the null hypothesis. For example, a study that has an 80% power means that the study has an 80% chance of the test having significant results.
A high statistical power means that the test results are likely valid. As the power increases, the probability of making a Type II error decreases.
A low statistical power means that the test results are questionable.
Statistical power helps you to determine if your sample size is large enough.
It is possible to perform a hypothesis test without calculating the statistical power. If your sample size is too small, your results may be inconclusive when they may have been conclusive if you had a large enough sample.
India’s population in a year – 1.3 bill
Population breakup – Rural – 70% and Urban – 30%
Every year India’s population would grow steadily, but the growth won’t be very fast-paced.
Every man and women would be eventually married (homogeneously or heterogeneously). They won’t prematurely die or prefer not to marry. People would be married only once.
In rural areas the age of marriage (in average) is between 15 – 35 year range. Similarly, in urban areas = 20 – 35 years. India is a young country, and 15 – 35 year range has around 50% of the total population.
Rural population = 70% * 1.3 bill = 900 mill
Population within marriage age in a year = 50% * 900 mill = 450 mill
Number of marriages to happen = 450 / 2 = 225 mill marriages
These people will marry within a 20 year time period according to our assumptions.
Number of rural marriages in a year = 225 mill / 20 = 11.25 mill marriages
Urban population = 30% * 1.3 bill = 400 mill
Population within marriage age in a year = 50% * 400 mill = 200 mill
Number of marriages to happen = 200 / 2 = 100 mill marriages
These people will marry within a 15 year time period according to our assumptions.
Number of urban marriages in a year = 100 mill / 15 = 6.6 mill marriages
Note and caveats
Many people die in accidents prematurely, and won’t marry. In addition, most people don’t marry as well as a consumer preference parameter. So, our market number is over-estimated. Even if we try to normalize it by introducing an error percentage of around 10%, the final number number will be lesser by around 10%-15%.
Answer = Approximately 14 million marriages occur in a year in India
Bayes’ theorem, also known as Bayes’ rule or Bayes’ law, is a theorem in statistics that describes the probability of one event or condition as it relates to another known event or condition.
Mathematically, it’s expressed as the true positive rate of a condition sample divided by the sum of the false positive rate of the population and the true positive rate of a condition.
Let’s understand this theorem with an example:
For instance, say you had a 60% chance of actually having the flu after a flu test, but out of people who had the flu, the test will be false 50% of the time, and the overall population only has a 5% chance of having the flu. Would you actually have a 60% chance of having the flu after having a positive test?
Bayes’ Theorem says no. It says that you have a (.6 * 0.05) (True Positive Rate of a Condition Sample) / (.6*0.05)(True Positive Rate of a Condition Sample) + (.5*0.95) (False Positive Rate of a Population) = 0.0594 or 5.94% chance of getting a flu.
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.
Naive Bayes uses a similar method to predict the probability of different class based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes.
When calculating loss we consider only a single data point, then we use the term loss function.
Whereas, when calculating the sum of error for multiple data then we use the cost function.
In other words, the loss function is to capture the difference between the actual and predicted values for a single record whereas cost functions aggregate the difference for the entire training dataset.
The most commonly used loss functions are Mean-squared error and Hinge loss.
Mean-Squared Error(MSE): In simple words, we can say how our model predicted values against the actual values.
MSE = √(predicted value – actual value)2
Hinge loss: It is used to train the machine learning classifier, which is
L(y) = max(0,1- yy)
Where y = -1 or 1 indicates two classes and y represents the output form of the classifier. The most common cost function represents the total cost as the sum of the fixed costs and the variable costs in the equation y = mx + b
There are many cost functions in machine learning and each has its use cases depending on whether it is a regression problem or classification problem.
Regression cost function:
Regression models are used to forecast a continuous variable, such as an employee’s pay, the cost of a car, the likelihood of obtaining a loan, and so on. They are determined as follows depending on the distance-based error:
Error = y-y’
Where, Y – Actual Input and Y’ – Predicted output
There are a number of possible variables that can cause such a discrepancy that I would check to see:
The demographics of iOS and Android users might differ significantly. For example, according to Hootsuite, 43% of females use Instagram as opposed to 31% of men. If the proportion of female users for iOS is significantly larger than for Android then this can explain the discrepancy (or at least a part of it). This can also be said for age, race, ethnicity, location, etc…
Behavioural factors can also have an impact on the discrepancy. If iOS users use their phones more heavily than Android users, it’s more likely that they’ll indulge in Instagram and other apps than someone who spent significantly less time on their phones.
Another possible factor to consider is how Google Play and the App Store differ. For example, if Android users have significantly more apps (and social media apps) to choose from, that may cause greater dilution of users.
Lastly, any differences in the user experience can deter Android users from using Instagram compared to iOS users. If the app is more buggy for Android users than iOS users, they’ll be less likely to be active on the app.
Model Evaluation is a very important part in any analysis to answer the following questions,
How well does the model fit the data? Which predictors are most important? Are the predictions accurate?
So, the following are the criterion to access the model performance,
- Akaike Information Criteria (AIC): In simple terms, AIC estimates the relative amount of information lost by a given model. So, the less information lost the higher the quality of the model. Therefore, we always prefer models with minimum AIC.
- Receiver operating characteristics (ROC curve): ROC curve illustrates the diagnostic ability of a binary classifier. It is calculated/ created by plotting True Positive against False Positive at various threshold settings. The performance metric of ROC curve is AUC (area under curve). Higher the area under the curve, better the prediction power of the model.
- Confusion Matrix: In order to find out how well the model does in predicting the target variable, we use a confusion matrix/ classification rate. It is nothing but a tabular representation of actual Vs predicted values which helps us to find the accuracy of the model.
Deductive reasoning is the form of valid reasoning, to deduce new information or conclusion from known related facts and information. Inductive reasoning arrives at a conclusion by the process of generalization using specific facts or data.
Deductive reasoning uses a top-down approach, whereas inductive reasoning uses a bottom-up approach.
In deductive reasoning, the conclusions are certain, whereas, in Inductive reasoning, the conclusions are probabilistic.
Deductive arguments can be valid or invalid, that means if premises or properties are true, the conclusion must be true. An Inductive argument can be strong or weak, that means conclusion may be false even if premises(properties) are true.
Usage of deductive reasoning is difficult, as we need facts which must be true. Usage of inductive reasoning is fast and easy, as we need evidence instead of true facts.
Selection bias stands for the bias which was introduced by the selection of individuals, groups or data for doing analysis in a way that the proper randomization is not achieved. It ensures that the sample obtained is not representative of the population intended to be analysed and sometimes it is referred to as the selection effect. This is the part of distortion of a statistical analysis which results from the method of collecting samples. If you don’t take selection bias into the account then some conclusions of the study may not be accurate.
The types of selection bias include:
Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.
Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.
Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.
There are 36 (6*6) outcomes for tossing two fair dices, and the outcomes when two dices sum to 8 are:
(2, 6), (3,5), (4,4), (5,3), (6,2);
The probability of two dice sums to be 8 is 5/36.
For the second part, it is a conditional probability that we are calculating. Assume event A is the event where the sum of the dice is 8, and event B is the first dice is 3. We know that event B’s outcomes are:
(3,1), (3,2), (3,3), (3,4), (3,5), (3,6)
and only (3,5) makes event A happen, thus the probability is 1/6.
We can also solve this using Bayes Theorem and conditional probability:
P(A|B) = P(A intersection B) / P(B)
The difference between P(AB) and P(A|B) is that:
P(AB) is 1/36: out of 36 outcomes, only (3,5) both satisfy event A and event B;
P(A|B) is 1/6: out of 6 outcomes from event B, (3,1), (3,2), (3,3), (3,4), (3,5), (3,6), only one outcome sums to 8 at (3,5), so that P(A|B) is 1/6. (also can be calculated by 1/36 / 1/6 = 1/6)
Reinforcement learning (RL) is an area of machine learning concerned with how intelligent ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning
In a typical reinforcement learning scheme, the agent observes a state S, and choose an action A based on the policy P, and then the environment feeds back the reward R of the action A, and the environment switches to the next state S’. And the process keeps looping until you reach the state DONE. The ultimate goal of reinforcement learning is to maximize the total reward, i.e., the long term sum of rewards.
At each turn for the agent, the agent observes the chess board (state S) and choose a move (action A) based on its learned chess-playing algorithm (policy P). Then the game (environment) feeds back the result of the move (maybe just position changing of a piece, or plus taking a piece of the rival, etc), which corresponds to the reward R, a value pre-defined (usually positive for “good”, negative for “bad”, zero for “neutral” or “we don’t know it’s good or bad”, yet defining a reward function is hard…). Then the game goes on and the move results in a new chess board state (state S’).
With reinforced learning, we don’t have to deal with this problem as the learning agent learns by playing the game. It will make a move (decision), check if it’s the right move (feedback), and keep the outcomes in memory for the next step it takes (learning). There is a reward for every correct decision the system takes and punishment for the wrong one
Pruning is a data compression technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that are non-critical and redundant to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.
Pruning processes can be divided into two types (pre- and post-pruning).
Pre-pruning procedures prevent a complete induction of the training set by replacing a stop () criterion in the induction algorithm. Pre-pruning methods are considered to be more efficient because they do not induce an entire set, but rather trees remain small from the start
Post-pruning is the most common way of simplifying trees. Here, nodes and subtrees are replaced with leaves to reduce complexity.
The procedures are differentiated on the basis of their approach in the tree (top-down or bottom-up).
Top-down fashion: It will traverse nodes and trim subtrees starting at the root
Bottom-up fashion: It will begin at the leaf nodes
There is a popular pruning algorithm called reduced error pruning, in which starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected, the change is kept.
A forecast refers to a calculation or an estimation which uses data from previous events, combined with recent trends to come up a future event outcome.
On the other hand, a prediction is an actual act of indicating that something will happen in the future with or without prior information.
Accuracy: A Forecast is more accurate compared to a prediction. This is because forecasts are derived by analysing a set of past data from the past and present trends.
On the other hand, a prediction can be right or wrong. For example, if you predict the outcome of a football match, the result depends on how well the teams played no matter their recent performance or players.
Bias: Forecasting uses mathematical formulas and as a result, they are free from personal as well as intuition bias.
On the other hand, predictions are in most cases subjective and fatalistic in nature.
Quantification: When using a model to do a forecast, it’s possible to come up with the exact quantity. For example, the World Bank uses economic trends, and the previous GDP values and other inputs to come up with a percentage value for a country’s economic growth.
However, when doing prediction, since there is no data for processing, one can only say whether the economy of a given country will grow or not.
Application: Forecasts are only applicable in the economic and meteorology field where there is a lot of information about the subject matter.
On the contrary, prediction can be applied anywhere as long as there is an expected future outcome.
So, what is regularization,
Regularization is any technique that aims to improve the validation score, sometimes at the cost of reducing the training score.
Some regularization techniques:
L1 tries to minimize the absolute value of the parameters of the model. It produces sparse parameters.
L2 tries to minimize the square value of the parameters of the model. It produces parameters with small values.
Dropout is a technique applied to neural networks that randomly sets some of the neurons’ outputs to zero during training. This forces the network to learn better representations of the data by preventing complex interactions between the neurons: Each neuron needs to learn useful features.
Early stopping will stop training when the validation score stops improving, even when the training score may be improving. This prevents overfitting on the training dataset.
Let’s call the 5-gallon bucket as A and 3-gallon bucket as B:
Fill B completely and empty it into the A.
Now, again fill the B completely and empty it into A.
Since A consists of 3 gallons already, it would accommodate only 2 gallons more. Thus, there would be balance 1 gallon in B
You now empty A and pour the 1 gallon from B into A.
Fill B completely and pour it completely into A which now has 1 + 3 = 4 gallons as required.
The data is split into three different categories while creating a model:
Training set: We use the training set for building the model and adjusting the model’s variables. But we cannot rely on the correctness of the model build on top of the training set. The model might give incorrect outputs on feeding new inputs.
Validation set: We use a validation set to look into the model’s response on top of the samples that don’t exist in the training dataset. Then, we will tune hyperparameters on the basis of the estimated benchmark of the validation data.
When we are evaluating the model’s response using the validation set, we are indirectly training the model with the validation set. This may lead to the overfitting of the model to specific data. So, this model won’t be strong enough to give the desired response to the real-world data.
Test set: The test dataset is the subset of the actual dataset, which is not yet used to train the model. The model is unaware of this dataset. So, by using the test dataset, we can compute the response of the created model on hidden data. We evaluate the model’s performance on the basis of the test dataset.
The model is always exposed to the test dataset after tuning the hyperparameters on top of the validation set.
As we know, the evaluation of the model on the basis of the validation set would not be enough. Thus, we use a test set for computing the efficiency of the model.
Cross-validation is a technique for dividing data between training and validation sets. On typical cross-validation this split is done randomly. But in stratified cross-validation, the split preserves the ratio of the categories on both the training and validation datasets.
For example, if we have a dataset with 10% of category A and 90% of category B, and we use stratified cross-validation, we will have the same proportions in training and validation. In contrast, if we use simple cross-validation, in the worst case we may find that there are no samples of category A in the validation set.
Stratified cross-validation may be applied in the following scenarios:
On a dataset with multiple categories. The smaller the dataset and the more imbalanced the categories, the more important it will be to use stratified cross-validation.
On a dataset with data of different distributions. For example, in a dataset for autonomous driving, we may have images taken during the day and at night. If we do not ensure that both types are present in training and validation, we will have generalization problems.
There are several ways to make a model more robust to outliers, from different points of view (data preparation or model building). An outlier in the question and answer is assumed being unwanted, unexpected, or a must-be-wrong value to the human’s knowledge so far (e.g. no one is 200 years old) rather than a rare event which is possible but rare.
Outliers are usually defined in relation to the distribution. Thus outliers could be removed in the pre-processing step (before any learning step), by using standard deviations (Mean +/- 2*SD), it can be used for normality. Or interquartile ranges Q1 – Q3, Q1 – is the “middle” value in the first half of the rank-ordered data set, Q3 – is the “middle” value in the second half of the rank-ordered data set. It can be used for not normal/unknown as threshold levels.
Moreover, data transformation (e.g. log transformation) may help if data have a noticeable tail. When outliers related to the sensitivity of the collecting instrument which may not precisely record small values, Winsorization may be useful. This type of transformation has the same effect as clipping signals (i.e. replaces extreme data values with less extreme values). Another option to reduce the influence of outliers is using mean absolute difference rather mean squared error.
For model building, some models are resistant to outliers (e.g. tree-based approaches) or non-parametric tests. Similar to the median effect, tree models divide each node into two in each split. Thus, at each split, all data points in a bucket could be equally treated regardless of extreme values they may have.
In supervised learning, we train a model to learn the relationship between input data and output data. We need to have labelled data to be able to do supervised learning.
With unsupervised learning, we only have unlabelled data. The model learns a representation of the data. Unsupervised learning is frequently used to initialize the parameters of the model when we have a lot of unlabelled data and a small fraction of labelled data. We first train an unsupervised model and, after that, we use the weights of the model to train a supervised model.
In reinforcement learning, the model has some input data and a reward depending on the output of the model. The model learns a policy that maximizes the reward. Reinforcement learning has been applied successfully to strategic games such as Go and even classic Atari video games.
Let’s assume that we’re trying to predict renewal rate for Netflix subscription. So our problem statement is to predict which users will renew their subscription plan for the next month.
Next, we must understand the data that is needed to solve this problem. In this case, we need to check the number of hours the channel is active for each household, the number of adults in the household, number of kids, which channels are streamed the most, how much time is spent on each channel, how much has the watch rate varied from last month, etc. Such data is needed to predict whether or not a person will continue the subscription for the upcoming month.
After collecting this data, it is important that you find patterns and correlations. For example, we know that if a household has kids, then they are more likely to subscribe. Similarly, by studying the watch rate of the previous month, you can predict whether a person is still interested in a subscription. Such trends must be studied.
The next step is analysis. For this kind of problem statement, you must use a classification algorithm that classifies customers into 2 groups:
Customers who are likely to subscribe next month
Customers who are not likely to subscribe next month
Would you build predictive models? Yes, in order to achieve this you must build a predictive model that classifies the customers into 2 classes like mentioned above.
Which algorithms to choose? You can choose classification algorithms such as Logistic Regression, Random Forest, Support Vector Machine, etc.
Once you’ve opted the right algorithm, you must perform model evaluation to calculate the efficiency of the algorithm. This is followed by deployment.
Algorithms necessitate features with some specific characteristics to work appropriately. The data is initially in a raw form. You need to extract features from this data before supplying it to the algorithm. This process is called feature engineering. When you have relevant features, the complexity of the algorithms reduces. Then, even if a non-ideal algorithm is used, results come out to be accurate.
Feature engineering primarily has two goals:
Prepare the suitable input data set to be compatible with the machine learning algorithm constraints.
Enhance the performance of machine learning models.
Some of the techniques used for feature engineering include Imputation, Binning, Outliers Handling, Log transform, grouping operations, One-Hot encoding, Feature split, Scaling, Extracting data.
The Backpropagation algorithm looks for the minimum value of the error function in weight space using a technique called the delta rule or gradient descent. The weights that minimize the error function is then considered to be a solution to the learning problem.
We need backpropagation because,
Calculate the error – How far is your model output from the actual output.
Minimum Error – Check whether the error is minimized or not.
Update the parameters – If the error is huge then, update the parameters (weights and biases). After that again check the error.
Repeat the process until the error becomes minimum.
Model is ready to make a prediction – Once the error becomes minimum, you can feed some inputs to your model and it will produce the output.
Let’s suppose you are being tested for a disease, if you have the illness the test will end up saying you have the illness. However, if you don’t have the illness- 5% of the times the test will end up saying you have the illness and 95% of the times the test will give accurate result that you don’t have the illness.
Thus there is a 5% error in case you do not have the illness.
Out of 1000 people, 1 person who has the disease will get true positive result.
Out of the remaining 999 people, 5% will also get true positive result.
Close to 50 people will get a true positive result for the disease.
This means that out of 1000 people, 51 people will be tested positive for the disease even though only one person has the illness. There is only a 2% probability of you having the disease even if your reports say that you have the disease.
The main goal of an ads selection component is to narrow down the set of ads that are relevant for a given query. In a search-based system, the ads selection component is responsible for retrieving the top relevant ads from the ads database according to the user and query context.
In a feed-based system, the ads selection component will select the top k relevant ads based more on user interests than search terms.
Here is a general solution to this question. Say we use a funnel-based approach for modelling. It would make sense to structure the ad selection process in these three phases:
Phase 1: Quick selection of ads for the given query and user context according to selection criteria
Phase 2: Rank these selected ads based on a simple and fast algorithm to trim ads.
Phase 3: Apply the machine learning model on the trimmed ads to select the top ones.
Our Popular Data Science Course