Data Scientist Interview QnA
Company: HP & Capital One
1. If through training all the features in the dataset, an accuracy of 100% is obtained but with the validation set, the accuracy score is 75%. What should be looked out for?
Training accuracy is much higher than validation accuracy, proving that it’s the case of overfitting, so in this case, try regularization or making less complex model or any other method to avoid overfitting.
2. How is skewness different from kurtosis?
Skewness basically measures the asymmetry in data. Kurtosis on the other hand, measures the bulge / peak of a distribution curve.
3. How to calculate the accuracy of a binary classification algorithm using its confusion matrix?
Accuracy is calculated as the total number of two correct predictions (TP + TN) divided by the total number of a dataset (P + N).
4. How will you measure the Euclidean distance between the two arrays in numpy?
eucl_distance = np. linalg. norm(point_a – point_b) where np stands for numpy.
5. In a survey conducted, the average height was 164cm with a standard deviation of 15cm. If Alex had a z-score of 1.30, what will be his height?
Alex’s height = 164 + 1.30*15 = 183.5 cm.
6. How would you build a model to predict credit card fraud?
Use Kaggle’s Credit card fraud dataset, start with EDA (Exploratory Data Analysis). Applying train, test split over the data and then finally choosing any model like logistic regression, XGBoost or Random Forest. After Hyperparameter tuning and fitting the model, the final step would be evaluating its performance.
7. How would you derive new features from features that already exist?
Feature engineering is applied first to generate additional features, and then feature selection is done to eliminate irrelevant, redundant, or highly correlated features. This includes techniques like Binning, Data manipulation etc.
8. If you’re attempting to predict a customer’s gender, and you only have 100 data points, what problems could arise?
Overfitting because we might learn too much into some particular patterns within this small sample set so we lose generalization abilities on other datasets.
9. Suppose you were given two years of transaction history. What features would you use to predict credit risk?
Following are the features that can be used in such case. Transaction amount, Transaction count, Transaction frequency, transaction category: bar, grocery, jwery etc. transaction channels: credit card, debit card, international wire transfer etc. distance between transaction address and mailing address, fraud/ risk score
10. Explain overfitting and what steps you can take to prevent it.
Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. Some steps that we can take to avoid it:
1. Data augmentation
2. L1/L2 Regularization
3. Remove layers / number of units per layer
4. Cross-validation
11. Why does SVM need to maximize the margin between support vectors?
Our goal is to maximize the margin because the hyperplane for which the margin is maximum is the optimal hyperplane. Thus SVM tries to make a decision boundary in such a way that the separation between the two classes is as wide as possible in the plane
