Data Scientist Interview QnA
Company: ZS & Legato
1. Meaning when p values are high or low?
High p-values indicate that your evidence is not strong enough to suggest an effect exists in the population. An effect might exist but it’s possible that the effect size is too small, the sample size is too small, or there is too much variability for the hypothesis test to detect it. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.
2. Difference between expected and mean value
While mean is the simple average of all the values, expected value of expectation is the average value of a random variable which is probability-weighted.
3. How time series problems different from regression problem
Regression is Intrapolation. Time-series refers to an ordered series of data. Time-series models usually forecast what comes next in the series – much like our childhood puzzles where we extrapolate and fill patterns.
4. RoC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate. False Positive Rate.
5. Random forest or multiple decision trees. Which is better?
Random Forest is suitable for situations when we have a large dataset, and interpretability is not a major concern. Decision trees are much easier to interpret and understand. Since a random forest combines multiple decision trees, it becomes more difficult to interpret.
6. Example when false positive is important than false negative
A false positive is where you receive a positive result for a test, when you should have received a negative results. Some examples of false positives: A pregnancy test is positive, when in fact you aren’t pregnant. A cancer screening test comes back positive, but you don’t have the disease. Innocent party is found guilty in such cases.
7. Different activation function?
Binary Step Function, Linear Activation Function, Sigmoid/Logistic Activation Function, Tanh Function (Hyperbolic Tangent), ReLU Activation Function.
8. How do you handle imbalance data?
Follow these techniques:
Use the right evaluation metrics.
Use K-fold Cross-Validation in the right way.
Ensemble different resampled datasets.
Resample with different ratios.
Cluster the abundant class.
Design your own models
9. Difference between sigmoid and softmax ?
The sigmoid function is used for the two-class logistic regression, whereas the softmax function is used for the multiclass logistic regression (a.k.a. MaxEnt, multinomial logistic regression, softmax Regression, Maximum Entropy Classifier).
10. Explain about optimizers?
Optimizers are algorithms or methods used to change the attributes of the neural network such as weights and learning rate to reduce the losses. Optimizers are used to solve optimization problems by minimizing the function.
11. Precision-Recall Trade off ?
The Idea behind the precision-recall trade-off is that when a person changes the threshold for determining if a class is positive or negative it will tilt the scales. It means that it will cause precision to increase and recall to decrease, or vice versa.
12. Decision Tree Parameters?
These are the parameters used for building Decision Tree: min_samples_split, min_samples_leaf, max_features and criterion.
13. Bagging and boosting ?
Bagging is a way to decrease the variance in the prediction by generating additional data for training from dataset using combinations with repetitions to produce multi-sets of the original data. Boosting is an iterative technique which adjusts the weight of an observation based on the last classification