Data Scientist Interview QnA
Company: Capgemini
1. Conditions for Overfitting and Underfitting.
If both the training accuracy and test accuracy are close then the model has not overfit. If the training result is very good and the test result is poor then the model has overfitted. If the training accuracy and test accuracy is low then the model has underfit.ย
2. What does it mean when the p-values are high and low?
The p-value is a number between 0 and 1 and interpreted in the following way: A small p-value (typically โค 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis. p-values very close to the cutoff (0.05) are considered to be marginal (could go either way).ย
3. What are the differences between correlation and covariance?
Covariance is a measure to indicate the extent to which two random variables change in tandem.Correlation is a measure used to represent how strongly two random variables are related to each other.Covariance is nothing but a measure of correlation.Correlation refers to the scaled form of covariance.
4. How to create a sparse Matrix in Python?
Pythonโs SciPy gives tools for creating sparse matrices using multiple data structures, as well as tools for converting a dense matrix to a sparse matrix. The function csr_matrix() is used to create a sparse matrix of compressed sparse row format whereas csc_matrix() is used to create a sparse matrix of compressed sparse column format
5. ๐ฉ-๐ฏ๐๐ฅ๐ฎ๐?
p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference. P-value can be used as an alternative to or in addition to pre-selected confidence levels for hypothesis testing.
6. ๐๐ง๐ญ๐๐ซ๐ฉ๐จ๐ฅ๐๐ญ๐ข๐จ๐ง ๐๐ง๐ ๐๐ฑ๐ญ๐ซ๐๐ฉ๐จ๐ฅ๐๐ญ๐ข๐จ๐ง?
Interpolation is the process of calculating the unknown value from known given values whereas extrapolation is the process of calculating unknown values beyond the given data points.
7. ๐๐ง๐ข๐๐จ๐ซ๐ฆ๐๐ ๐๐ข๐ฌ๐ญ๐ซ๐ข๐๐ฎ๐ญ๐ข๐จ๐ง & ๐ง๐จ๐ซ๐ฆ๐๐ฅ ๐๐ข๐ฌ๐ญ๐ซ๐ข๐๐ฎ๐ญ๐ข๐จ๐ง?
The normal distribution is bell-shaped, which means value near the center of the distribution are more likely to occur as opposed to values on the tails of the distribution. The uniform distribution is rectangular-shaped, which means every value in the distribution is equally likely to occur.
8. ๐๐๐๐จ๐ฆ๐ฆ๐๐ง๐๐๐ซ ๐๐ฒ๐ฌ๐ญ๐๐ฆ๐ฌ?
The recommender system mainly deals with the likes and dislikes of the users. Its major objective is to recommend an item to a user which has a high chance of liking or is in need of a particular user based on his previous purchases. It is like having a personalized team who can understand our likes and dislikes and help us in making the decisions regarding a particular item without being biased by any means by making use of a large amount of data in the repositories which are generated day by day
9. ๐๐๐๐ ๐๐ฎ๐ง๐๐ญ๐ข๐จ๐ง ๐ข๐ง ๐๐L
The SQL Joins clause is used to combine records from two or more tables in a database
10. ๐๐ช๐ฎ๐๐ซ๐๐ ๐๐ซ๐ซ๐จ๐ซ ๐๐ง๐ ๐๐๐ฌ๐จ๐ฅ๐ฎ๐ญ๐ ๐๐ซ๐ซ๐จ๐ซ?
mean squared error (MSE), and mean absolute error (MAE) are used to evaluate
the regression problem’s accuracy. The squared error is everywhere
differentiable, while the absolute error is not (its derivative is undefined at 0). This
makes the squared error more amenable to the techniques of mathematical
optimization
