Google Data Science Interview Questions

1. What are the Assumptions of Error in Linear Regression?

Answer:

Independence of Errors: The error terms should be independent of each other, meaning no correlation between consecutive errors (no autocorrelation). This assumption is often tested using the Durbin-Watson test in time series data.
Homoscedasticity: The variance of the error terms should remain constant across all levels of the independent variables. If the variance of errors changes (heteroscedasticity), it can lead to inefficiencies in estimating coefficients.
Normality of Errors: The error terms should be normally distributed, especially for hypothesis testing (like t-tests for coefficients). This is crucial for constructing accurate confidence intervals and p-values.

2. What is the Function of P-Values in High-Dimensional Linear Regression?

Answer:

P-values help test the null hypothesis that a regression coefficient (for a predictor) is zero. A low p-value suggests statistical significance, meaning the predictor likely affects the response variable.

In high-dimensional models:

Testing multiple predictors increases the chance of Type I errors (false positives), meaning predictors might seem significant by chance. Adjustments like Bonferroni correction or FDR methods can help counteract this.
High-dimensional data often has multicollinearity (highly correlated predictors), which can lead to unstable estimates and unreliable p-values. Removing correlated features is recommended.

3. How Would You Encode a Categorical Variable with Thousands of Distinct Values?

Answer:

Leave-One-Out Encoding: This target encoding variation calculates the target mean for each category, excluding the current observation to avoid target leakage.
- Pros: Reduces target leakage and works well with high-cardinality features.
- Cons: Computationally more expensive than simple target encoding.
Embedding-Based Encoding: For very high-cardinality features, embedding-based approaches can be effective. This technique uses a neural network to learn a dense vector representation of each category.
- Pros: Captures latent structures in data.
- Cons: More complex to implement.

4. Describe How PCA Works

Answer:

PCA (Principal Component Analysis) is a dimensionality reduction technique, useful when dealing with correlated features, noisy data, or visualizing data in fewer dimensions.

Steps to perform PCA:

Normalize Features: Standardize the data.
Calculate Covariance Matrix: Indicates how variables change together.
Find Eigenvectors and Eigenvalues: Eigenvectors show directions with the most data spread, and eigenvalues measure the amount of variance/spread.

Note: PCA assumes linear relationships, so it’s unsuitable for non-linear data. Also, since new dimensions are linear combinations of original ones, interpretation can become challenging.

Want to Learn Data Science 👇

CLICK HERE

Want to Learn Python 👇

CLICK HERE