## Data Scientist Interview QnA

Company: Capgemini

## 1. Conditions for Overfitting and Underfitting.

If both the training accuracy and test accuracy are close then the model has not overfit. If the training result is very good and the test result is poor then the model has overfitted. If the training accuracy and test accuracy is low then the model has underfit.

## 2. What does it mean when the p-values are high and low?

The p-value is a number between 0 and 1 and interpreted in the following way: A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis. p-values very close to the cutoff (0.05) are considered to be marginal (could go either way).

## 3. What are the differences between correlation and covariance?

Covariance is a measure to indicate the extent to which two random variables change in tandem.Correlation is a measure used to represent how strongly two random variables are related to each other.Covariance is nothing but a measure of correlation.Correlation refers to the scaled form of covariance.

## 4. How to create a sparse Matrix in Python?

Python’s SciPy gives tools for creating sparse matrices using multiple data structures, as well as tools for converting a dense matrix to a sparse matrix. The function csr_matrix() is used to create a sparse matrix of compressed sparse row format whereas csc_matrix() is used to create a sparse matrix of compressed sparse column format

## 5. 𝐩-𝐯𝐚𝐥𝐮𝐞?

p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference. P-value can be used as an alternative to or in addition to pre-selected confidence levels for hypothesis testing.

## 6. 𝐈𝐧𝐭𝐞𝐫𝐩𝐨𝐥𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐄𝐱𝐭𝐫𝐚𝐩𝐨𝐥𝐚𝐭𝐢𝐨𝐧?

Interpolation is the process of calculating the unknown value from known given values whereas extrapolation is the process of calculating unknown values beyond the given data points.

## 7. 𝐔𝐧𝐢𝐟𝐨𝐫𝐦𝐞𝐝 𝐃𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧 & 𝐧𝐨𝐫𝐦𝐚𝐥 𝐝𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧?

The normal distribution is bell-shaped, which means value near the center of the distribution are more likely to occur as opposed to values on the tails of the distribution. The uniform distribution is rectangular-shaped, which means every value in the distribution is equally likely to occur.

## 8. 𝐑𝐞𝐜𝐨𝐦𝐦𝐞𝐧𝐝𝐞𝐫 𝐒𝐲𝐬𝐭𝐞𝐦𝐬?

The recommender system mainly deals with the likes and dislikes of the users. Its major objective is to recommend an item to a user which has a high chance of liking or is in need of a particular user based on his previous purchases. It is like having a personalized team who can understand our likes and dislikes and help us in making the decisions regarding a particular item without being biased by any means by making use of a large amount of data in the repositories which are generated day by day

## 9. 𝐉𝐎𝐈𝐍 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧 𝐢𝐧 𝐒𝐐L

The SQL Joins clause is used to combine records from two or more tables in a database

## 10. 𝐒𝐪𝐮𝐚𝐫𝐞𝐝 𝐞𝐫𝐫𝐨𝐫 𝐚𝐧𝐝 𝐚𝐛𝐬𝐨𝐥𝐮𝐭𝐞 𝐞𝐫𝐫𝐨𝐫?

mean squared error (MSE), and mean absolute error (MAE) are used to evaluate
the regression problem’s accuracy. The squared error is everywhere
differentiable, while the absolute error is not (its derivative is undefined at 0). This
makes the squared error more amenable to the techniques of mathematical
optimization