ML / NLP Interviews

This blog compiles some points that might be encountered in ML / NLP interviews. Good luck (for myself and to all readers finding jobs).

Statistics
Probability
Causality, Fair ML
Models
Neural Network
Training
Natural Languages Processing specific
- NLP Basics
- Information theory
- Embedding
- Parsing

Statistics

Evaluating a model. TP / FP / TN / FN
Type 1 (false positive) and type 2 errors (false negative).
Precision (TP / (TP + FP)) vs. Recall (TP / (TP + FN)).
Sensitivity / specificity.
F1 score definitions. Micro / macro / weighted.
ROC: receiver operating characteristic) curve. True positive rate vs. false positive rate at various threshold settings.
AUC: area under (the ROC) curve.
Bias-variance tradeoff. How to decide whether the model is overfitting, looking from learning plot?
- High bias: overfit. Solve with more training.
- High variance: too flexible; low generalizability. Solve with (1) regularization (2) bagging algorithm (3) using less and more simple features.
LOOCV, k-fold. They are designed to counteract the bias!

Bayes Rule: posterior = likelihood * prior / evidence.
How to compute evidence? Sum them up / integral / etc.
How to compute posterior when it is almost impossible to do the above steps? Variational inference (to maximize the variational lower bound) or Monte Carlo sampling (e.g., Metropolis-Hastings algorithm, Gibbs sampling).
Probability distribution functions (pdf).
Exponential family distribution functions. Conjugate priors (Beta prior + binomial distribution; Dirichlet prior + multinomial distribution)
Mutual information $I(X, Y) = \sum_X \sum_Y P(X, Y) log\frac{P(X, Y)}{P(X)P(Y)}$
Conditional independence $P(X|Y) = P(Y)$
Bayesian Network, d-separation.
Variable elimination.

Randomized control trials.
Rubin-Neyman causal model.
Traditional deconfounding approaches: residualization, inverse probability weighting.
Fair Learning: unfairness metrics (demographics parity, equalized opportunity)
Direct optimize unfairness: (Zemel et al., 2013)
Adversarial components: (Madras et al., 2018)

Supervised learning models: SVM, Random Forest, Gaussian Process, LogReg, Adaboost, Naive Bayes.
Unsupervised: kNN, clustering, domain adaptation, self-supervised embedding.
Visualization: PCA, T-SNE.
Regression vs classification.
Kernel method. Wikipedia
Decision trees are suitable for nonlinear data Q7.
Explain boosting, gradient boosting, GBDT.
CRF is discriminative (measures conditional probability) whereas HMM is generative (measures joint probability).

Long Short Term Memory (LSTM)
First there are three gates: forget, information, and output. $f_t = \sigma(W_f [x_t, h_{t-1}] + b_f)$ $f_i = \sigma(W_i [x_t, h_{t-1}] + b_i)$ $f_o = \sigma(W_o [x_t, h_{t-1}] + b_o)$ Then there’s a G gate updating the cell states. $g_t = \eta(W_g [x_t, h_{t-1}] + b_g)$ $C_t = f_t * c_{t-1} + i_t * g_t$ For output we have: $h_t = o_t * \eta(C_t)$
Transformer (Vaswani et al., 2017) Key, Query, Value “gates” per head, multi-head attention, positional encoding.
Activations. ReLU vs. Sigmoid vs. Tanh.
Gradient explosion / gradient vanishing. How to address them? (Use ReLU; LSTM gating + gradient clipping etc.)
Batch normalization (Ioffe and Szegedy, 2015). Purpose is to address internal covariance shift. Need an additional “scale and shift back” step.

Cross-entropy loss or L2 loss?
Dropout. Explain them.
Gradient optimizer: GD, SGD.
Adagrad: adjust the gradients by $\frac{1}{diag(\sum grad*grad^T)}$ . A previous large gradient corresponds to a smaller step.
Momentum method: use gradients to change the velocity of the weights changing.
- Nesterov momentum: First jump, then measure gradients, and then make a correction.
- RMSprop: divide the gradient by a running average of its recent magnitude.
- Adam optimizer: a combination of Adagrad and RMSprop.
- See Hinton’s slides for more explainations.
Regularization. L1 / L2. L1 (“Least Absolute Shrinkage and Selection Operator”, LASSO) corresponds to a Laplacian prior, with lots of weights around 0, hence encourage sparcity, hence are regarded as having built-in feature selection mechanism. L2 corresponds to a Gaussian prior.
Handling imbalance classes.
Handling missing data.

Techniques for keyword normalization? Lemmatization and stemming.
Techniques for string matching? Levenshtein (i.e., edit distance), soundex, and metaphone.
Regular Expression (examples in Python).
. (single dot) matches any character.
.* matches any character sequences of any length.
[0-9] matches any one digit.
[a-z] matches any one lower-case alphabet.
Brackets need to be escaped. e.g., \(, \], if you want to match them.
\ need to be specially taken care of, in Python.
N-gram: Continuous N words as a bag.
Zipf’s law. In a corpus, the frequency of a word is inversely proportional to its rank (by frequency).

TF-IDF of the term = $TF * log \frac{1}{DF}$ where term frequency is the term’s frequency, and document frequency is the freq of documents containing the term.
Popular word embedding methods? GloVe, Word2Vec, FastText, ELMo.
- GloVe (Pennington et al., 2014): a global log-bilinear regression model trained on word-word co-occurrence counts.
- Word2Vec (Mikolov et al., 2013): Basically a two-layer networks predicting the context of a word with $P=\text{softmax}(w'^T (\sum w_c))$ (skip-gram), or predict the projected word given its contexts with $P=\sum \text{softmax}(w'^T w_c)$ (CBOW). The output is written as $w'$ in the above equations. Use the first layer output as word embedding.
- FastText (Facebook AI Research): An optimized implementation of Word2Vec, but incorporated sub-word information (e.g., including word components of lengths from 3 to 5 as)
- ELMo (Peters et al. 2018): Use intermediate layers outputs of Bi-LSTMs in additional to word embeddings as new embeddings.
- BERT (Devlin et al. 2018): Mask 15% of tokens. Predict them based on their contexts using Transformer.
Topic modeling: LDA.

Word sense disambiguation example: Lesk algorithm. Compare the dictionary definition of an ambiguous word with the terms contained in its neighborhood.
Recommendation system: HITS, PageRank, collaborative filtering.