CloudyML

Data Scientist Interview QnA
Company: Capgemini

1. Conditions for Overfitting and Underfitting.

If both the training accuracy and test accuracy are close then the model has not overfit. If the training result is very good and the test result is poor then the model has overfitted. If the training accuracy and test accuracy is low then the model has underfit.ย 

2. What does it mean when the p-values are high and low?

The p-value is a number between 0 and 1 and interpreted in the following way: A small p-value (typically โ‰ค 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis. p-values very close to the cutoff (0.05) are considered to be marginal (could go either way).ย 

3. What are the differences between correlation and covariance?

Covariance is a measure to indicate the extent to which two random variables change in tandem.Correlation is a measure used to represent how strongly two random variables are related to each other.Covariance is nothing but a measure of correlation.Correlation refers to the scaled form of covariance.

4. How to create a sparse Matrix in Python?

Pythonโ€™s SciPy gives tools for creating sparse matrices using multiple data structures, as well as tools for converting a dense matrix to a sparse matrix. The function csr_matrix() is used to create a sparse matrix of compressed sparse row format whereas csc_matrix() is used to create a sparse matrix of compressed sparse column format

5. ๐ฉ-๐ฏ๐š๐ฅ๐ฎ๐ž?

p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference. P-value can be used as an alternative to or in addition to pre-selected confidence levels for hypothesis testing.

6. ๐ˆ๐ง๐ญ๐ž๐ซ๐ฉ๐จ๐ฅ๐š๐ญ๐ข๐จ๐ง ๐š๐ง๐ ๐„๐ฑ๐ญ๐ซ๐š๐ฉ๐จ๐ฅ๐š๐ญ๐ข๐จ๐ง?

Interpolation is the process of calculating the unknown value from known given values whereas extrapolation is the process of calculating unknown values beyond the given data points.

7. ๐”๐ง๐ข๐Ÿ๐จ๐ซ๐ฆ๐ž๐ ๐ƒ๐ข๐ฌ๐ญ๐ซ๐ข๐›๐ฎ๐ญ๐ข๐จ๐ง & ๐ง๐จ๐ซ๐ฆ๐š๐ฅ ๐๐ข๐ฌ๐ญ๐ซ๐ข๐›๐ฎ๐ญ๐ข๐จ๐ง?

The normal distribution is bell-shaped, which means value near the center of the distribution are more likely to occur as opposed to values on the tails of the distribution. The uniform distribution is rectangular-shaped, which means every value in the distribution is equally likely to occur.

8. ๐‘๐ž๐œ๐จ๐ฆ๐ฆ๐ž๐ง๐๐ž๐ซ ๐’๐ฒ๐ฌ๐ญ๐ž๐ฆ๐ฌ?

The recommender system mainly deals with the likes and dislikes of the users. Its major objective is to recommend an item to a user which has a high chance of liking or is in need of a particular user based on his previous purchases. It is like having a personalized team who can understand our likes and dislikes and help us in making the decisions regarding a particular item without being biased by any means by making use of a large amount of data in the repositories which are generated day by day

9. ๐‰๐Ž๐ˆ๐ ๐Ÿ๐ฎ๐ง๐œ๐ญ๐ข๐จ๐ง ๐ข๐ง ๐’๐L

The SQL Joins clause is used to combine records from two or more tables in a database

10. ๐’๐ช๐ฎ๐š๐ซ๐ž๐ ๐ž๐ซ๐ซ๐จ๐ซ ๐š๐ง๐ ๐š๐›๐ฌ๐จ๐ฅ๐ฎ๐ญ๐ž ๐ž๐ซ๐ซ๐จ๐ซ?

mean squared error (MSE), and mean absolute error (MAE) are used to evaluate the regression problem’s accuracy. The squared error is everywhere differentiable, while the absolute error is not (its derivative is undefined at 0). This makes the squared error more amenable to the techniques of mathematical optimization

Get Complete Hands-On Practical Learning Experience

Data Scientist/Analytics

Become Job-Ready

Scroll to Top