Select a problem from the left rail.
Statistics & Experimentation · 45–60 Minutes
15 problems that separate data scientists from intuition machines. Each with a think-aloud framework, the key formulas, and the Senior ✦ insight that shows you think causally — not just statistically.
What interviewers score
Select a problem from the left rail.
"Before touching any formula I need to answer five questions: What is the randomisation unit? What is the primary metric and its MDE? What is the exposure window? Are there SUTVA concerns? And what validity threats should I pre-register against?"
For proportions: σ² = p(1−p) using the control base rate p.
| Unit | Use when | Risk |
|---|---|---|
| User | Retention, LTV, personalisation | Long ramp-up for low-DAU products |
| Session | Per-session conversion, latency | Same user can be in both variants |
| Device | Platform-specific UI | Multi-device users split across variants |
| Request | Backend latency, cache experiments | Requires SUTVA — no user-level carryover |
Intent-to-treat (ITT) analysis includes all assigned users, even those who never saw the feature. This dilutes the estimated effect size — sometimes massively. If only 20% of assigned users trigger the feature, ITT underestimates the true effect by ~5×. The fix is a trigger-analysis: analyse only users who triggered the experiment. The tradeoff: trigger analysis narrows to an engaged subgroup, potentially introducing selection bias. Always report both, state which is the primary result, and explain the difference.
One-sided tests have more power for the same α, but require pre-commitment: you must declare the expected direction before seeing any data. If you choose one-sided and the effect goes the wrong way, you cannot flip the test post-hoc — that is p-hacking. Most large tech companies (Google, Netflix, Meta) default to two-sided for all primary metrics precisely because the pre-commitment discipline is hard to enforce at scale. Use one-sided only when the hypothesis direction is physically constrained (e.g., a latency improvement cannot make the product slower).
"Standard A/B testing rests on SUTVA — the Stable Unit Treatment Value Assumption — which requires (1) no interference between units and (2) a single version of treatment. Both break in social products. If I give user A a better feed and A's posts get more likes, that changes B's experience even if B is in control."
| Design | Best for | Key limitation |
|---|---|---|
| User-level RCT | No interference (backend, pricing) | Fails under social/network interference |
| Cluster RCT | Social graphs, social features | Fewer units → higher variance |
| Geo experiment | Marketplace, ads, broad rollouts | Few markets → very low power |
| Switchback | Ride-share, logistics, real-time systems | Carryover; requires wash-out periods |
| Ego-cluster | Social feed, notifications | Overlapping ego-networks → complex |
The ideal cluster minimises cross-cluster edges (to contain spillover) while maximising within-cluster density (to preserve social dynamics). Companies like LinkedIn and Meta use community detection algorithms (Louvain, METIS) on the interaction graph to form experiment clusters. The key metric: graph cut ratio = cross-cluster edges / total edges. Target < 5% for social experiments. Smaller cut ratio → less bias, but also fewer clusters → higher variance. The bias-variance tradeoff is explicit and tunable by changing the clustering resolution parameter.
Rather than assuming full interference or no interference, exposure mapping (Aronow & Samii, 2017) models the specific interference structure. You define an exposure condition: e.g., "user i is directly treated AND has at least one treated neighbour." This lets you estimate direct effects, indirect effects (spillover), and total effects separately. Uber uses this to decompose marketplace treatment effects into driver-side and rider-side components. Stating this framework signals you understand interference as a modelling problem, not just a nuisance.
"A 4% lift after two weeks is a short-run estimate. I want to know whether it's novelty-driven (decaying over time) or adoption-driven (growing over time). These have opposite implications for whether the long-run effect is above or below what we measured."
| Observation | Interpretation | Action |
|---|---|---|
| Flat effect week 1→2 | Stable, likely real | Ship with confidence |
| Decaying effect week 1→2 | Likely novelty-driven | Extend 2–4 weeks or use holdout |
| Growing effect week 1→2 | Adoption lag — underestimated | Extend; effect may be larger |
| Volatile (high day-to-day var) | Metric too noisy | Increase n or switch metric |
Per-feature holdouts measure the effect of a single feature at long horizons. But the more powerful pattern — used at Google, Microsoft, and Netflix — is a permanent platform-level holdout: 1% of all users who never receive any new feature. This creates a counterfactual for the cumulative effect of all shipped features. The "holdout gap" grows over time and is one of the strongest signals of platform health. A shrinking holdout gap means features are collectively harmful; a growing one means the product is genuinely improving. Per-feature holdouts cannot measure this compounding effect.
Hohnhold et al. (2015) at Google showed that for ad-click features, the novelty effect decays roughly exponentially and stabilises within 3–4 weeks. You can fit the decay curve on the time series of daily treatment effects and extrapolate to τ_steady. This requires ≥ 3 weeks of data and assumes the decay is monotone. The key output: a corrected long-run estimate with wider confidence intervals (because you're extrapolating). This is far better than either ignoring the decay or refusing to ship until 8 weeks of holdout data accumulates.
"A/B test and bandit answer different questions. A/B testing is a fixed-horizon hypothesis test — it produces an unbiased causal estimate with controlled error rates. A bandit is an online optimisation algorithm — it minimises regret during the experiment by adaptively routing traffic. They are not interchangeable."
| Criterion | A/B Test | Bandit |
|---|---|---|
| Causal estimate | Unbiased | Biased (arm selection feedback) |
| Error control | Controlled α, β | No Type I/II guarantees |
| Regret during experiment | High (equal traffic to bad arms) | Low (adaptive) |
| Arms | 2–5 | 10–thousands |
| Reward delay | Tolerates delay | Requires near-immediate reward |
| Decision shelf life | Long (months) | Short (exploit now) |
Bandit logs are biased data — arms with more pulls look better than they are. If you want to evaluate a new policy (e.g., "what if we had used UCB1 instead of Thompson?") on historical bandit data, you need off-policy evaluation. The IPS (Inverse Propensity Score) estimator reweights each observed reward by the ratio of the target policy's action probability to the logging policy's action probability. The doubly-robust estimator adds an outcome model for variance reduction. Stating this distinction separates engineers who understand the bandit literature from those who just know the algorithm names.
Standard MAB ignores context — the same arm is optimal for all users. Contextual bandits (LinUCB, Neural Bandit) condition arm selection on user and context features, making the algorithm personalised. LinUCB maintains a linear model of reward per arm: reward_i(x) = x^T θ_i + noise, with an exploration bonus based on the uncertainty of the linear estimate. This is the dominant approach in production recommendation systems (news feeds, ads). The key challenge: balancing the exploration needed to learn the linear model parameters against exploitation of the current best estimate — exactly the same tradeoff as MAB, but in a higher-dimensional feature space.
"Difference-in-Differences removes time-invariant confounders and common time shocks by comparing the change in outcome for the treated city against the change in a comparable control city over the same period. The critical assumption I need to defend is parallel trends — not that the two cities are identical, but that they were moving in the same direction before the policy."
| Setting | Treatment | Control |
|---|---|---|
| Policy evaluation | State/city with new law | Similar state/city without |
| Feature launch | Country where feature went live | Country where it didn't |
| Dark launch | Users opted in early | Comparable users who didn't |
| Market shock | Firms hit by industry event | Unaffected peer firms |
When different units receive treatment at different times (staggered adoption), the standard DiD regression is a weighted average of all pairwise (treated group g, control group g') comparisons — and the weights can be negative. This means an early-treated unit can act as the "control" for a later-treated unit, and if treatment effects are heterogeneous over time, the aggregate estimate is biased. Goodman-Bacon (2021) decomposed this. The fix (Callaway & Sant'Anna, 2021): compute group-time ATTs using only never-treated or not-yet-treated units as controls, then aggregate. If you're running staggered DiD without this correction, your estimates may be sign-reversed.
When you have a single treated unit (one city, one country) and can't rely on any single control, synthetic control (Abadie et al., 2010) constructs a weighted combination of control units that best replicates the treated unit's pre-treatment trajectory. The weights are optimised to minimise pre-period difference. The key insight: the synthetic control is a more principled counterfactual than any hand-picked comparator, and you can visualise the pre-period fit to assess validity. It's limited to cases where you have many pre-treatment periods and a small number of treated units — exactly the "one city, one policy" scenario that DiD struggles with.
"IV lets us estimate a causal effect when the treatment is endogenous — correlated with unobserved confounders. The instrument must satisfy three conditions: relevance (it predicts the treatment), exogeneity (it's independent of unobserved confounders), and the exclusion restriction (it affects the outcome only through the treatment). I'll validate each before trusting the estimate."
| Instrument | Treatment | Setting |
|---|---|---|
| Random onboarding variant | Feature adoption | Product analytics |
| Distance to treatment centre | Programme participation | Policy evaluation |
| Lottery assignment | Medicaid coverage | Health economics |
| Judge leniency | Incarceration | Criminal justice |
| Birthday/age cutoff | School starting age | Education economics |
2SLS estimates LATE — the effect for users whose treatment status changed because of the instrument. If you're using the Medicaid lottery as an instrument, you estimate LATE for lottery winners who took up insurance — not the effect for the average uninsured person. If compliers are unusually healthy or motivated relative to the full population, LATE overestimates ATE. Always characterise who the compliers are by comparing their baseline covariates to the full sample. Saying "LATE ≠ ATE and here's why it matters for this policy decision" is the clearest signal of senior-level causal reasoning.
When instruments are weak (F < 10), the standard 2SLS confidence interval is unreliable — it can exclude the true effect with probability far exceeding the nominal α. The Anderson-Rubin (AR) test inverts a test of the hypothesis β = β₀ in the reduced form equation, bypassing the first stage entirely. The AR confidence interval is valid regardless of instrument strength. The cost: wider intervals. In practice: always check F-stat, report AR CI alongside 2SLS CI when F is borderline, and never claim precision you don't have. The correct statement when F = 6: "The instrument is weak; the 2SLS estimate is imprecise and potentially biased toward OLS."
"The key assumption is conditional independence — no unmeasured confounders: conditional on observed covariates X, treatment assignment is as good as random. This is the strong ignorability assumption. There's no way to test it from data alone. I'll proceed under this assumption, but I'll also run a sensitivity analysis at the end to understand how robust the estimate is to hidden bias."
Kang & Schafer (2007) demonstrated that in some simulation settings, IPW with a correctly specified propensity model performs worse than simple regression adjustment with a correctly specified outcome model. Doubly robust estimators (AIPW) win because they're consistent if either model is right — but if both are misspecified, AIPW can have catastrophic finite-sample performance due to near-zero propensity scores producing extreme weights. The modern fix is TMLE (Targeted Maximum Likelihood Estimation), which stabilises AIPW by "targeting" the estimate of interest rather than fitting a general nuisance model. Netflix and Uber use TMLE for their causal inference workflows.
If overlap is very poor — for example, power users are a completely different population from free users on every observable dimension — reweighting cannot make them comparable. No amount of propensity score adjustment creates valid counterfactuals when the groups don't overlap. In this case, consider: (1) restricting analysis to the region of common support and reporting the ATT for that subset, (2) using Manski-style partial identification bounds instead of a point estimate, or (3) finding an instrument. The worst outcome is reporting an IPW estimate with extreme, untrimmed weights as if it were valid — this is a common mistake in observational ML papers.
"This is a regression discontinuity setup — the badge threshold at 100 reviews creates a sharp discontinuity in treatment assignment. Sellers just above 100 and just below 100 are likely very similar on unobserved dimensions, making the comparison near the cutoff as good as locally randomised. I'm estimating a LATE at the threshold — not the ATE across all sellers."
| Approach | Problem | Recommendation |
|---|---|---|
| Global polynomial (degree 4+) | Extrapolates strongly; boundary estimates depend on data far from cutoff | Avoid (Gelman & Imbens 2019) |
| Local linear | Uses only data near cutoff; lower bias at boundary | Default choice |
| Local quadratic | Slightly more flexible; higher variance | Robustness check only |
The bandwidth h controls a fundamental bias-variance tradeoff. A wide bandwidth includes more data (lower variance) but units far from the cutoff may differ systematically from units near it (higher bias). The CCT optimal bandwidth minimises asymptotic MSE = Bias² + Variance, trading off the two. But asymptotic MSE optimality doesn't guarantee good finite-sample performance. Always report results across multiple bandwidths — if the estimated effect changes monotonically with bandwidth, it suggests that the treatment effect itself varies with distance from the cutoff, not that you found a consistent local effect.
Geographic RDD uses administrative boundaries (school district lines, county borders, DMA boundaries) as the running variable. The spatial analogue of the review cutoff: units just on each side of a border are geographically proximate and therefore similar, making the boundary comparison as-good-as-random. The key validity check: density of units should be continuous across the boundary (no sorting). Geographic RDD is powerful for policy evaluation but requires that agents can't systematically choose which side of the boundary to be on — homebuyers who choose school districts invalidate geographic RDD for education research.
"This is Simpson's paradox — an aggregate trend reverses within every subgroup. The cause is always a confounding variable that (1) differs in distribution between treatment groups and (2) is also associated with the outcome. But critically, which estimate is 'correct' — marginal or conditional — depends on the causal structure, not on the data alone."
Confounding is well-known: condition on a common cause C (confounder) to close the backdoor path T ← C → Y. Collider bias is less known but equally dangerous: conditioning on K when T → K ← Y OPENS a spurious T-Y path. A classic real-world example: studying the effect of a genetic variant (T) on disease severity (Y) among hospitalised patients (K). Hospitalisation is a collider — caused by both the genetic variant and other risk factors. The variant-severity association in the hospitalised sample is biased. This is one reason why studies restricted to clinical populations often replicate poorly in the general population.
The rule is causal, not statistical: if C is a pre-treatment baseline covariate that you'd want to account for in treatment allocation, use the conditional (subgroup-specific) estimate. If C is a mediator — on the causal path T → C → Y — using the conditional estimate removes the indirect effect and underestimates the total effect. If your question is "what is the total causal effect of T on Y?" you want the marginal estimate after blocking backdoor paths (via randomisation or adjusting for all confounders, but not mediators). The data literally cannot tell you which is right — only the causal story can. State this upfront in your answer and interviewers will recognise you've thought causally.
"The colleague is running a sequential hypothesis test with no α adjustment. Wald proved that if you can stop whenever p < 0.05, you can achieve ANY target α just by running long enough — even if the null is true. The nominal α = 0.05 is completely broken. We need a method that controls Type I error at any stopping time."
Under the null hypothesis, the p-value as a function of cumulative sample size is a martingale — its expected future value, given its current value, equals its current value. The maximum of a martingale over time has a much heavier tail than its marginal distribution at any fixed point. Specifically, by Ville's inequality, P(p(t) ≤ α for some t) can be as large as 1 — the peeking procedure has no Type I error control. This is not an approximation; it's exact. The mSPRT sidesteps this by constructing a test statistic that is itself a martingale under the null, with a controllable maximum — giving always-valid inference without giving up the ability to stop early.
HARKing — Hypothesizing After Results are Known — is rampant in industry experiment reviews because it's almost undetectable after the fact. Red flags: (1) absence of a power analysis pre-run (no pre-specified n means you could stop at any convenient point), (2) post-hoc justifications for unexpected results ("we didn't pre-specify this, but in hindsight it makes sense because..."), (3) reporting only 3 of 20 tested subgroups, (4) no pre-analysis plan or experiment registration. The structural fix: maintain an immutable experiment log that records the hypothesis, primary metric, and planned runtime before any data is collected. This is standard practice at Google, Meta, and Microsoft — any experiment without a registered plan should be treated as exploratory.
"Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. The recommendation team optimised the engagement proxy without improving subscriber value. I need to (1) diagnose the specific gaming vector, (2) validate that the proxy has decoupled from the north star (retention), and (3) redesign the metric to be structurally harder to game."
| Type | Mechanism | Fix |
|---|---|---|
| Regressional | Selection effect — high scorers regress to mean | Measure prospectively, not retrospectively |
| Extremal | Optimising to extremes breaks the proxy-objective link | Cap the metric; add a diminishing-returns penaliser |
| Causal | Optimising proxy breaks the proxy-objective correlation | Rotate metrics; validate quarterly |
| Reflective | Measurement changes behaviour (Hawthorne effect) | Blind measurement; use revealed preference |
Instead of measuring what users do (which can be gamed by surfacing content that triggers involuntary engagement), measure what users choose when given explicit alternatives. For content quality: measure the voluntary return visit rate — did the user voluntarily come back the next day? Did they share the content? Did they finish it without being auto-played into the next piece? These are revealed preference signals that require actual user satisfaction, not just interface design tricks. They're harder to game because you can't force a user to voluntarily return — you have to actually serve them well. Netflix uses 30-day voluntary retention rate (not engagement minutes) as their true north star for exactly this reason.
Some metrics are valid proxies for user value only within a specific operating range. Beyond that range, the proxy breaks down. Netflix observed that watch hours correlates positively with subscriber satisfaction up to about 3 hours per day — but beyond that threshold, it correlates negatively with long-run retention, presumably reflecting compulsive binge-watching of content the user later regrets. The metric has a valid operating range: [0, 3 hrs/day]. Optimising for watch hours beyond this range actively harms the north star. The implication: every proxy metric should be validated not just for direction but for its operating range. And if teams are optimising near or beyond the boundary, the metric has lost its validity signal.
"Power analysis has four linked parameters: α (false positive rate), power (1−β), metric variance σ², and MDE Δ. Fix any three and the fourth is determined. The real design decision is choosing an MDE that reflects business value — not the effect you hope to see."
| Change | Effect on n | Note |
|---|---|---|
| Halve MDE (Δ/2) | 4× more users | Most expensive lever — choose MDE carefully |
| α: 0.05 → 0.01 | ~30% more | For safety or financial tests |
| Power: 80% → 90% | ~25% more | Worth it for high-stakes launches |
| CUPED (ρ=0.7) | ~51% fewer | Use pre-experiment covariate |
Fixed-horizon tests require committing to a runtime and not peeking. This is impractical. The mSPRT (mixture Sequential Probability Ratio Test, Johari et al. 2015) produces always-valid p-values — the p-value at any look controls Type I error at α. The cost: ~10% more power than the equivalent fixed-horizon test. Optimizely, Netflix, and Airbnb use mSPRT by default. Mathematically: the classical p-value is a martingale under the null. Stopping when it first crosses α leads to very high FWER. The mSPRT converts this to an always-valid test by mixing over effect-size hypotheses, creating a confidence sequence that shrinks over time.
A fundamental confusion in practice: the MDE is the smallest effect worth detecting, NOT the effect you expect the feature to produce. If you set MDE = your best guess for the effect size, you're 50% powered — that's by definition of power. You need to set MDE lower than the expected effect to be well-powered. The right question to ask the PM: "What is the smallest improvement that would change your ship decision?" That's the MDE. If they say "1% CTR improvement," and your experiment is powered for a 2% MDE, you're underpowered for the decision they actually need to make.
"Testing 20 metrics at α=0.05 each gives a 64% chance of at least one false positive, even if the feature does nothing. We'd expect about 1 spurious significant result. The question is: was a primary metric pre-registered? If yes, that's the only one that counts for the ship decision. If no, we need to apply a multiple comparison correction before interpreting results."
| Method | Controls | Power | Use when |
|---|---|---|---|
| Bonferroni | FWER ≤ α | Lowest | Safety/financial, few tests (<5) |
| Holm-Bonferroni | FWER ≤ α | Slightly higher | Always dominates Bonferroni — prefer this |
| Benjamini-Hochberg | FDR ≤ α | High | Discovery setting, many metrics, FP tolerable |
| Pre-registration | FWER by design | Full α for primary | Best — do this before data collection |
Multiple comparisons arise over time (peeking at interim analyses) as well as across metrics. Alpha spending functions (O'Brien-Fleming, Pocock) allocate the α budget across K planned interim looks. O'Brien-Fleming spends almost no α early (when evidence is weakest) and reserves most of the budget for the final look — providing near-full power at the end while strictly controlling FWER. Pocock spends α uniformly, giving equal significance thresholds at each look but less power at the final look. The key insight: choosing a spending function is identical in structure to choosing an FDR correction method — in both cases you're allocating a finite error budget across multiple decisions.
BH at FDR=5% means: if you reject k hypotheses, you expect 0.05k of them to be false. Teams who don't understand this celebrate "20 significant results" without realising 1 is likely spurious. Always report E[FD] = FDR × rejections explicitly. For example: "We reject 12 metrics at BH FDR=5%. We expect ~0.6 false positives among these 12." This framing forces intellectual honesty. A corollary: if your team regularly reports BH FDR=5% results from 50-metric dashboards and never discusses E[FD], they are almost certainly over-shipping features with no real effect.
"This is Goodhart's Law: when CTR became a target, it ceased to be a good measure. The team optimised the proxy without improving the underlying objective. The fix isn't adding more counter-metrics — it's redesigning the OEC to be structurally harder to game while remaining directionally aligned with the north star."
| Property | How to verify |
|---|---|
| Sensitive | Moves reliably when A/A has a known injected effect |
| North-star aligned | Corr(OEC, north-star) > 0.6 across historical experiments |
| Not gameable | No obvious path to increase OEC without creating user value |
| Counter-metric resistant | Gaming OEC moves at least one guardrail adversely |
| Computable | Available within experiment window (≤2 weeks for primary) |
The strongest validation of a new OEC is to replay historical A/B tests under the new metric and check that it correctly predicts long-run outcomes. Netflix does this systematically: for a proposed OEC, they compare its verdict (ship/no-ship) against each historical experiment's long-run impact on subscriber retention at 12 months. A good OEC agrees with the long-run outcome in >80% of cases. A bad OEC (raw engagement minutes) agrees far less often because it promotes short-term binge-watching that increases churn. This is the gold standard for OEC validation — far better than theoretical arguments about alignment.
Instead of one composite OEC, decompose into components and hold teams accountable for each. Session duration is gameable by adding friction. But session duration = sessions_per_week × avg_duration_per_session. A team that games duration by adding friction sees sessions_per_week decrease. A team flooding low-quality content sees a quality-satisfaction guardrail decrease. It's much harder to simultaneously inflate multiple independent components than to game a single composite. This structural defence against Goodhart's Law is more robust than adding ever-more-complex counter-metrics to a single composite.
"CUPED subtracts out variation in the outcome metric that is predictable from pre-experiment behaviour. Because users who engage a lot before the experiment tend to engage a lot during it — regardless of which variant they see — this variation is noise that inflates experiment variance. Removing it lets us reach significance with fewer users."
| Method | Timing | Typical reduction | Limitation |
|---|---|---|---|
| Stratified assignment | Pre-assignment | 20–40% | Needs covariate before assignment |
| CUPED (linear) | Post-assignment | 30–60% | Linear Y-X only; needs pre-experiment data |
| MLRATE | Post-assignment | 50–80% | Complex; requires rich feature history |
| Post-stratification | Analysis time | 10–30% | Inflates variance if strata too small |
The algebraic equivalence between CUPED and ANCOVA is non-obvious but important. It means: (1) CUPED is not a heuristic — it's the standard OLS efficiency gain from including a covariate, backed by the Gauss-Markov theorem. (2) You can use any standard regression software, not a custom CUPED implementation. (3) If you extend to a machine-learning projection f(X) instead of linear θX, CUPED becomes equivalent to the partial linear Robinson (1988) estimator, which is semiparametrically efficient for the treatment effect under the partially linear model. In practice: always frame CUPED as "ANCOVA with a pre-experiment covariate" in technical reports — it's more recognisable and has better-understood properties.
Standard CUPED assumes a linear Y-X relationship. Microsoft's MLRATE replaces the linear projection θ·X with a gradient-boosted model trained on many pre-experiment features: Y_adj = Y − f(X). This achieves 70–80% variance reduction for complex engagement metrics where linear CUPED only gets 30–40%. Critical constraint: f must be trained solely on pre-experiment data — never on the experiment window, or the variance reduction is biased and you could invalidate the experiment. In practice: train f on data from a 4-week window ending the day before the experiment starts, then freeze f before the first user is assigned.