Data Scientist Interview QnA
Company: Larsen and Tourbo
1. Ways to avoid overfitting
Some steps that we can take to avoid it:
1. Data augmentation
2. L1/L2 Regularization
3. Remove layers / number of units per layer
4. Cross-validation
2. Image classification algorithms
Image Classification algorithms are the algorithms which are used to classify labels for images using their characteristics. Example: Convolutional Neural Networks.
3. args will return?
The special syntax *args in function is used to pass a variable number of arguments to a function. It is used to pass a non-key worded, variable-length argument list. The syntax is to use the symbol * to take in a variable number of arguments; by convention, it is often used with the word args.
4. Difference between having and where clause in SQL.
WHERE Clause is used to filter the records from the table based on the specified condition. HAVING Clause is used to filter record from the groups based on the specified condition.
5. How do you handle categorical data?
One-Hot Encoding is the most common, correct way to deal with non-ordinal categorical data. It consists of creating an additional feature for each group of the categorical feature and mark each observation belonging (Value=1) or not
(Value=0) to that group.
6. What Is Interpolation And Extrapolation?
Interpolation is the process of calculating the unknown value from known given values whereas extrapolation is the process of calculating unknown values beyond the given data points.
7. SQL joins and Groups
The SQL Joins clause is used to combine records from two or more tables in a database. The GROUP BY statement groups rows that have the same values into summary rows, like “find the number of customers in each country”.
8. How do you handle null values and which Imputation method is more favorable?
Ways to handle missing values in the dataset:
Deleting Rows with missing values.
Impute missing values for continuous variable.
Impute missing values for categorical variable.
Other Imputation Methods.
Using Algorithms that support missing values.
Prediction of missing values.
Multiple imputation is more advantageous than the single imputation because it
uses several complete data sets and provides both the within-imputation and
between-imputation variability.
