Statistics & Experimentation · 45–60 Minutes

Design it.
Defend it.
Causalise it.

15 problems that separate data scientists from intuition machines. Each with a think-aloud framework, the key formulas, and the Senior ✦ insight that shows you think causally — not just statistically.

Page

What interviewers score

Causal thinking — do you distinguish correlation from causation?
Validity threats — do you proactively name what could go wrong?
Estimand clarity — ATT vs ATE vs LATE before you estimate?
Practical tradeoffs — connecting statistics to business decisions?
Framework fluency — right method recalled quickly under pressure?

👤 Senior Data Scientist · Interviewer

Select a problem from the left rail.

Level —

Type —

Difficulty

Open with

"Before touching any formula I need to answer five questions: What is the randomisation unit? What is the primary metric and its MDE? What is the exposure window? Are there SUTVA concerns? And what validity threats should I pre-register against?"

Step 1 — Randomisation unit: Match the unit to the metric granularity. User-level for retention and LTV; session-level for per-session conversions. Changing the unit mid-experiment is a fatal error.
Step 2 — Define OEC & guardrails: Pre-register one primary metric (e.g., checkout conversion rate) and 2–3 guardrail metrics (page load time, error rate, revenue per order). Declare both before seeing any data.
Step 3 — Power analysis: Specify MDE (the smallest lift worth shipping), set α = 0.05 and power = 0.80. Compute n per variant. Inflate by 1/trigger-rate if not all users see the feature.
Step 4 — Assignment: hash(user_id + experiment_salt) mod 100 for stable, isolated assignment. Verify independence across concurrent experiments.
Step 5 — SRM check: On day 1 run a chi-squared test on observed vs expected traffic split. Any SRM (p < 0.01) means the assignment is broken — do not analyse results.
Step 6 — Novelty check: Compare week-1 vs week-2+ cohort effects. If week-1 lift is 2× week-2+, the effect is novelty-driven, not real.
Step 7 — Analyse & decide: Two-sample t-test (or Mann-Whitney if metric is non-normal). Ship only if: p < α AND effect size > MDE AND all guardrails are green.

Validity threats — name these proactively

SRM: bot filtering, assignment bug, or early stoppers leaking between variants
Novelty effect: users exploring a new UI inflate week-1 conversion
Simpson's paradox: platform mix (mobile/desktop) shifts between variants if not balanced
SUTVA violation: users share cart links or referral codes — treatment contaminates control
Multiple testing: post-hoc segment fishing (country, device) inflates Type I error

Sample Size — Two-sample t-test

n = 2σ²(z_α/2 + z_β)² / Δ² where: Δ = MDE (minimum detectable effect) σ² = metric variance per user z_α/2 ≈ 1.96 (α = 0.05, two-tailed) z_β ≈ 0.84 (power = 80%)

For proportions: σ² = p(1−p) using the control base rate p.

SRM Check

χ² = Σ (O_i − E_i)² / E_i df = k − 1 Reject (SRM detected) if p < 0.01. Stop analysis; do NOT interpret results until root cause is fixed.

Randomisation Unit Decision

Unit	Use when	Risk
User	Retention, LTV, personalisation	Long ramp-up for low-DAU products
Session	Per-session conversion, latency	Same user can be in both variants
Device	Platform-specific UI	Multi-device users split across variants
Request	Backend latency, cache experiments	Requires SUTVA — no user-level carryover

✦Senior DS — Trigger Analysis

Intent-to-treat (ITT) analysis includes all assigned users, even those who never saw the feature. This dilutes the estimated effect size — sometimes massively. If only 20% of assigned users trigger the feature, ITT underestimates the true effect by ~5×. The fix is a trigger-analysis: analyse only users who triggered the experiment. The tradeoff: trigger analysis narrows to an engaged subgroup, potentially introducing selection bias. Always report both, state which is the primary result, and explain the difference.

✦Senior DS — Two-sided vs One-sided

One-sided tests have more power for the same α, but require pre-commitment: you must declare the expected direction before seeing any data. If you choose one-sided and the effect goes the wrong way, you cannot flip the test post-hoc — that is p-hacking. Most large tech companies (Google, Netflix, Meta) default to two-sided for all primary metrics precisely because the pre-commitment discipline is hard to enforce at scale. Use one-sided only when the hypothesis direction is physically constrained (e.g., a latency improvement cannot make the product slower).

Open with

"Standard A/B testing rests on SUTVA — the Stable Unit Treatment Value Assumption — which requires (1) no interference between units and (2) a single version of treatment. Both break in social products. If I give user A a better feed and A's posts get more likes, that changes B's experience even if B is in control."

Step 1 — Name the interference mechanism: Direct (A interacts with B), indirect (A's posts appear in B's feed), or market-mediated (two-sided marketplace supply/demand). Each requires a different fix.
Step 2 — Cluster randomisation: Randomise at the cluster level (social graph community, geography) rather than the user level. Clusters should have many intra-cluster edges and few cross-cluster edges to contain spillover.
Step 3 — Geo experiment: Randomise by DMA/city/country. Give entire markets the treatment. Best for marketplace or social products where clusters are well-defined geographically.
Step 4 — Switchback design: Alternate treatment and control over time periods within a single market. Period 1: treatment. Period 2: control. Repeat. Good when geographic clusters are infeasible.
Step 5 — Estimate correctly: For cluster RCT, use cluster-level averages as the unit of analysis. For switchback, use time-period averages. Apply Horvitz-Thompson weights if cluster sizes differ.

Validity threats

Cluster heterogeneity: dense urban clusters differ systematically from sparse ones — balance cluster-level covariates
Carryover in switchback: treatment effects persist into control periods — add wash-out windows between periods
Positive vs negative spillover: viral features spill positively; cannibalisation spills negatively — the direction matters for interpreting bias
Market equilibrium effects: treating one city's supply can attract demand from adjacent cities, inflating geo estimates

SUTVA Violation

SUTVA requires: Y_i(T) = Y_i(T_i) i.e., unit i's outcome depends only on its own treatment. If violated: E[Y_i(T_i=1)] ≠ E[Y_i | T_i=1] The observed treated outcome is confounded by neighbours' treatments.

Naive Bias Under Spillover

Naive ATÊ = E[Y|T=1] − E[Y|T=0] = True ATE + Spillover Bias Spillover Bias > 0 for positive spillover (viral features) Spillover Bias < 0 for negative spillover (cannibalisation)

Design Strategy Selection

Design	Best for	Key limitation
User-level RCT	No interference (backend, pricing)	Fails under social/network interference
Cluster RCT	Social graphs, social features	Fewer units → higher variance
Geo experiment	Marketplace, ads, broad rollouts	Few markets → very low power
Switchback	Ride-share, logistics, real-time systems	Carryover; requires wash-out periods
Ego-cluster	Social feed, notifications	Overlapping ego-networks → complex

✦Senior DS — Graph Clustering for Cluster RCT

The ideal cluster minimises cross-cluster edges (to contain spillover) while maximising within-cluster density (to preserve social dynamics). Companies like LinkedIn and Meta use community detection algorithms (Louvain, METIS) on the interaction graph to form experiment clusters. The key metric: graph cut ratio = cross-cluster edges / total edges. Target < 5% for social experiments. Smaller cut ratio → less bias, but also fewer clusters → higher variance. The bias-variance tradeoff is explicit and tunable by changing the clustering resolution parameter.

✦Senior DS — Exposure Mapping

Rather than assuming full interference or no interference, exposure mapping (Aronow & Samii, 2017) models the specific interference structure. You define an exposure condition: e.g., "user i is directly treated AND has at least one treated neighbour." This lets you estimate direct effects, indirect effects (spillover), and total effects separately. Uber uses this to decompose marketplace treatment effects into driver-side and rider-side components. Stating this framework signals you understand interference as a modelling problem, not just a nuisance.

Open with

"A 4% lift after two weeks is a short-run estimate. I want to know whether it's novelty-driven (decaying over time) or adoption-driven (growing over time). These have opposite implications for whether the long-run effect is above or below what we measured."

Step 1 — Plot the effect over time: Break the two weeks into daily or weekly cohorts. Is the effect stable, growing, or decaying? Novelty: high in week 1, decays to steady-state. Adoption: low in week 1, grows as users learn the feature.
Step 2 — Cohort analysis: Compare users who first saw the feature in day 1 vs day 7 vs day 14. If day-1 users show higher lift than day-14 users, novelty is a factor. This isolates novelty from treatment effect.
Step 3 — Ask about survivorship bias: Users who churned during the experiment are excluded from the final analysis. Surviving users are more engaged than average — this biases the estimate upward.
Step 4 — Recommend a holdout group: If the effect is unstable or potentially novelty-driven, keep 5–10% of users in a permanent holdout and measure the long-run effect at 4–8 weeks.
Step 5 — Decision framework: If week-1 vs week-2 effect differs by more than 30% of the MDE, extend the experiment or ship to a partial holdout before full rollout.

Validity threats

Novelty effect: Users explore new UI → inflated short-run engagement that decays
Adoption lag: Complex features take time to be discovered → underestimated effect at 2 weeks
Survivorship bias: Users who churn during experiment are excluded → surviving cohort is unrepresentative
Regression to the mean: Users recruited during a spike period regress toward average over time
Seasonal confounding: Holiday or product cycle effects masquerade as treatment effects in short windows

Novelty Decay Model

Effect(t) = τ_steady + (τ_0 − τ_steady) × exp(−λt) where: τ_0 = initial effect (week 1) τ_steady = long-run steady-state effect λ = decay rate If τ_0 > τ_steady: novelty-inflated — wait for steady-state If τ_0 < τ_steady: adoption curve — extend experiment window

Holdout Sizing Rule of Thumb

Holdout size: 5–10% of traffic Duration: 4–12 weeks post-launch Powered to detect: 20% degradation in north-star at 90% power Holdout estimator: τ_long = Y_launched − Y_holdout measured at week 8+ after launch

When to Extend vs Ship

Observation	Interpretation	Action
Flat effect week 1→2	Stable, likely real	Ship with confidence
Decaying effect week 1→2	Likely novelty-driven	Extend 2–4 weeks or use holdout
Growing effect week 1→2	Adoption lag — underestimated	Extend; effect may be larger
Volatile (high day-to-day var)	Metric too noisy	Increase n or switch metric

✦Senior DS — Platform-level Holdouts

Per-feature holdouts measure the effect of a single feature at long horizons. But the more powerful pattern — used at Google, Microsoft, and Netflix — is a permanent platform-level holdout: 1% of all users who never receive any new feature. This creates a counterfactual for the cumulative effect of all shipped features. The "holdout gap" grows over time and is one of the strongest signals of platform health. A shrinking holdout gap means features are collectively harmful; a growing one means the product is genuinely improving. Per-feature holdouts cannot measure this compounding effect.

✦Senior DS — Novelty Effect Correction

Hohnhold et al. (2015) at Google showed that for ad-click features, the novelty effect decays roughly exponentially and stabilises within 3–4 weeks. You can fit the decay curve on the time series of daily treatment effects and extrapolate to τ_steady. This requires ≥ 3 weeks of data and assumes the decay is monotone. The key output: a corrected long-run estimate with wider confidence intervals (because you're extrapolating). This is far better than either ignoring the decay or refusing to ship until 8 weeks of holdout data accumulates.

Open with

"A/B test and bandit answer different questions. A/B testing is a fixed-horizon hypothesis test — it produces an unbiased causal estimate with controlled error rates. A bandit is an online optimisation algorithm — it minimises regret during the experiment by adaptively routing traffic. They are not interchangeable."

Step 1 — When to use A/B: You need a confident, unbiased estimate of the causal effect. The decision has a long shelf life (you'll reference it for 6+ months). You need to control Type I/II errors for regulatory, financial, or high-stakes decisions.
Step 2 — When to use bandit: You have many arms (≥ 10, e.g., creative variants, push-notification copy). The reward signal is immediate. The arms are short-lived (content that expires). You care more about cumulative reward during the experiment than about a clean causal estimate.
Step 3 — Thompson Sampling mechanics: Maintain a Beta(α, β) posterior per arm. At each step, sample θ_i from each posterior and route to the arm with the highest θ_i. Arms that perform well accumulate higher α → posterior shifts right → get more traffic.
Step 4 — UCB1 mechanics: Route to the arm with the highest (x̄_i + sqrt(2 ln t / n_i)). The second term is the exploration bonus — arms with fewer observations get a larger bonus. Deterministic and asymptotically optimal.
Step 5 — The bandit's cost: Bandit estimates are biased — arms that performed well early get more traffic, which inflates their estimates via a feedback loop. You cannot use bandit data for offline evaluation without bias correction (IPS/doubly robust estimator).

When bandits fail

Delayed rewards: Thompson Sampling requires immediate reward; 7-day conversion signal gives stale posteriors
Non-stationarity: Rewards drift over time — a bandit trained on Monday's distribution may lock in a suboptimal arm by Friday
Survivor bias in offline eval: Bandits create a biased log — you cannot evaluate alternative policies on bandit-collected data without off-policy correction
Long-run effects ignored: A bandit optimises immediate reward; novelty-inflated arms win early and dominate even after the effect decays

Thompson Sampling Update Rule

Prior: arm i ~ Beta(α_i, β_i) (initialise: α=1, β=1) Observe: k successes in n_i trials Posterior: Beta(α_i + k, β_i + n_i − k) At each step: sample θ_i ~ Beta(α_i, β_i) for each arm i route to arm a* = argmax_i(θ_i)

UCB1 Action Rule

a_t = argmax_i [ x̄_i + sqrt(2 ln t / n_i) ] x̄_i = empirical mean reward for arm i n_i = number of pulls of arm i t = total pulls so far Exploration bonus shrinks as n_i grows → exploitation increases

A/B Test vs Bandit — Decision Matrix

Criterion	A/B Test	Bandit
Causal estimate	Unbiased	Biased (arm selection feedback)
Error control	Controlled α, β	No Type I/II guarantees
Regret during experiment	High (equal traffic to bad arms)	Low (adaptive)
Arms	2–5	10–thousands
Reward delay	Tolerates delay	Requires near-immediate reward
Decision shelf life	Long (months)	Short (exploit now)

✦Senior DS — Off-Policy Evaluation

Bandit logs are biased data — arms with more pulls look better than they are. If you want to evaluate a new policy (e.g., "what if we had used UCB1 instead of Thompson?") on historical bandit data, you need off-policy evaluation. The IPS (Inverse Propensity Score) estimator reweights each observed reward by the ratio of the target policy's action probability to the logging policy's action probability. The doubly-robust estimator adds an outcome model for variance reduction. Stating this distinction separates engineers who understand the bandit literature from those who just know the algorithm names.

✦Senior DS — Contextual Bandits

Standard MAB ignores context — the same arm is optimal for all users. Contextual bandits (LinUCB, Neural Bandit) condition arm selection on user and context features, making the algorithm personalised. LinUCB maintains a linear model of reward per arm: reward_i(x) = x^T θ_i + noise, with an exploration bonus based on the uncertainty of the linear estimate. This is the dominant approach in production recommendation systems (news feeds, ads). The key challenge: balancing the exploration needed to learn the linear model parameters against exploitation of the current best estimate — exactly the same tradeoff as MAB, but in a higher-dimensional feature space.

Open with

"Difference-in-Differences removes time-invariant confounders and common time shocks by comparing the change in outcome for the treated city against the change in a comparable control city over the same period. The critical assumption I need to defend is parallel trends — not that the two cities are identical, but that they were moving in the same direction before the policy."

Step 1 — State the estimand: ATT — Average Treatment Effect on the Treated. What happened to ridership in the treated city because of the subsidy? Not the ATE across all cities.
Step 2 — Select a control group: Choose cities similar on pre-treatment ridership trends, demographics, and economic conditions. Avoid cities that also changed transit policy during the window.
Step 3 — Verify parallel trends: Plot pre-treatment ridership for treated and control cities. They need not have the same level — they must be moving in parallel. Test formally with an event study regression.
Step 4 — Estimate DiD: Regress ridership on Treated, Post, and their interaction (Treated × Post). The coefficient on the interaction is the DiD estimate.
Step 5 — Cluster standard errors: Cluster at the city level, not the observation level. Within-city observations are correlated. Failure to cluster understates standard errors dramatically.
Step 6 — Placebo tests: Run DiD on a metric the subsidy should not affect (e.g., car sales). A significant effect on the placebo metric means the parallel trends assumption is violated.

Validity threats

Parallel trends violation: Treated city was already growing faster before the subsidy — most common failure mode
Anticipation effects: Commuters change behaviour before the policy takes effect, biasing the pre-period
Spillover: Nearby cities change behaviour in response to the policy, contaminating the control group
Compositional change: Who lives in the city changes over time (gentrification), making the pre-post comparison invalid

DiD Estimator

τ̂_DiD = (Ȳ_treated,post − Ȳ_treated,pre) − (Ȳ_control,post − Ȳ_control,pre) Regression form (equivalent): Y = α + β₁·Treated + β₂·Post + τ·(Treated×Post) + ε τ is the DiD estimate. SEs clustered on the group (city) level.

Event Study Regression (Parallel Trends Test)

Y_it = α + Σ_t τ_t · (Treated_i × 1[period=t]) + FE_i + FE_t + ε_it Pre-treatment τ_t coefficients should ≈ 0. A plot of τ_t vs time (event study plot) is the standard validity check. Significant pre-period τ_t = parallel trends violated.

Common Applications

Setting	Treatment	Control
Policy evaluation	State/city with new law	Similar state/city without
Feature launch	Country where feature went live	Country where it didn't
Dark launch	Users opted in early	Comparable users who didn't
Market shock	Firms hit by industry event	Unaffected peer firms

✦Senior DS — Staggered DiD and Negative Weights

When different units receive treatment at different times (staggered adoption), the standard DiD regression is a weighted average of all pairwise (treated group g, control group g') comparisons — and the weights can be negative. This means an early-treated unit can act as the "control" for a later-treated unit, and if treatment effects are heterogeneous over time, the aggregate estimate is biased. Goodman-Bacon (2021) decomposed this. The fix (Callaway & Sant'Anna, 2021): compute group-time ATTs using only never-treated or not-yet-treated units as controls, then aggregate. If you're running staggered DiD without this correction, your estimates may be sign-reversed.

✦Senior DS — Synthetic Control

When you have a single treated unit (one city, one country) and can't rely on any single control, synthetic control (Abadie et al., 2010) constructs a weighted combination of control units that best replicates the treated unit's pre-treatment trajectory. The weights are optimised to minimise pre-period difference. The key insight: the synthetic control is a more principled counterfactual than any hand-picked comparator, and you can visualise the pre-period fit to assess validity. It's limited to cases where you have many pre-treatment periods and a small number of treated units — exactly the "one city, one policy" scenario that DiD struggles with.

Open with

"IV lets us estimate a causal effect when the treatment is endogenous — correlated with unobserved confounders. The instrument must satisfy three conditions: relevance (it predicts the treatment), exogeneity (it's independent of unobserved confounders), and the exclusion restriction (it affects the outcome only through the treatment). I'll validate each before trusting the estimate."

Step 1 — Relevance check: Regress app usage (D) on prompt length (Z). The first-stage F-statistic must be > 10 (Staiger-Stock 1997 rule of thumb). Weak instruments (F < 10) bias 2SLS toward OLS and inflate SEs.
Step 2 — Exogeneity argument: Prompt length was randomly assigned — this satisfies exogeneity. If it were self-selected, I'd need a stronger argument. Document the randomisation mechanism.
Step 3 — Exclusion restriction argument: Does prompt length affect 30-day retention through any path other than app usage? Risk: a longer onboarding prompt might signal product quality, directly affecting retention independent of usage. This is untestable — argue from domain knowledge.
Step 4 — Run 2SLS: Stage 1: regress D (app usage) on Z (prompt length) → get D̂. Stage 2: regress Y (retention) on D̂. The Stage 2 coefficient is the IV estimate.
Step 5 — Interpret LATE: 2SLS estimates LATE — the average effect for compliers (users whose usage changed due to prompt length). This may not equal ATE. Characterise who the compliers are.

Validity threats

Weak instrument: F < 10 → 2SLS is biased toward OLS, SEs are inflated; use Anderson-Rubin CI instead
Exclusion restriction violation: Prompt length directly affects retention independent of usage (e.g., longer prompt signals premium quality) — untestable
LATE ≠ ATE: Compliers (users sensitive to prompt length) may not represent the full user population
Monotonicity violation: If some users use the app less with a longer prompt (defiers), LATE is biased

Two-Stage Least Squares (2SLS)

Stage 1: D̂ = γ₀ + γ₁·Z + controls (get predicted usage) Stage 2: Y = β₀ + τ·D̂ + controls (use predicted usage) Wald estimator (no controls): β_IV = Cov(Z, Y) / Cov(Z, D) = Reduced form / First stage

Instrument Strength Tests

First-stage F-stat > 10 (Staiger-Stock 1997, rule of thumb) First-stage F-stat > 104.7 (Montiel-Pflueger 2019, 5% size distortion) Weak instrument test: if F < 10, use Anderson-Rubin test for valid CI regardless of instrument strength.

Classic Instruments in Tech & Economics

Instrument	Treatment	Setting
Random onboarding variant	Feature adoption	Product analytics
Distance to treatment centre	Programme participation	Policy evaluation
Lottery assignment	Medicaid coverage	Health economics
Judge leniency	Incarceration	Criminal justice
Birthday/age cutoff	School starting age	Education economics

✦Senior DS — LATE vs ATE and Policy Relevance

2SLS estimates LATE — the effect for users whose treatment status changed because of the instrument. If you're using the Medicaid lottery as an instrument, you estimate LATE for lottery winners who took up insurance — not the effect for the average uninsured person. If compliers are unusually healthy or motivated relative to the full population, LATE overestimates ATE. Always characterise who the compliers are by comparing their baseline covariates to the full sample. Saying "LATE ≠ ATE and here's why it matters for this policy decision" is the clearest signal of senior-level causal reasoning.

✦Senior DS — Anderson-Rubin Test for Weak Instruments

When instruments are weak (F < 10), the standard 2SLS confidence interval is unreliable — it can exclude the true effect with probability far exceeding the nominal α. The Anderson-Rubin (AR) test inverts a test of the hypothesis β = β₀ in the reduced form equation, bypassing the first stage entirely. The AR confidence interval is valid regardless of instrument strength. The cost: wider intervals. In practice: always check F-stat, report AR CI alongside 2SLS CI when F is borderline, and never claim precision you don't have. The correct statement when F = 6: "The instrument is weak; the 2SLS estimate is imprecise and potentially biased toward OLS."

Open with

"The key assumption is conditional independence — no unmeasured confounders: conditional on observed covariates X, treatment assignment is as good as random. This is the strong ignorability assumption. There's no way to test it from data alone. I'll proceed under this assumption, but I'll also run a sensitivity analysis at the end to understand how robust the estimate is to hidden bias."

Step 1 — Estimate propensity scores: e(X) = P(premium=1 | X). Use logistic regression or gradient boosting. Check calibration: decile plot of predicted vs observed subscription rates.
Step 2 — Check overlap: Plot the propensity score distribution for premium and free users. If scores for premium users are concentrated near 1 (or free users near 0), there's no common support — the comparison is invalid for those strata.
Step 3 — Choose estimator: Matching for ATT (average effect on premium subscribers). IPW for ATE. AIPW (doubly robust) for best of both — consistent if either the propensity or outcome model is correct.
Step 4 — Check balance: Compute Standardized Mean Difference (SMD) for each covariate before and after adjustment. Target SMD < 0.1 for all covariates. If not achieved, respecify the propensity model.
Step 5 — Sensitivity analysis: Rosenbaum bounds — how large would an unmeasured confounder need to be (in odds ratio terms) to overturn the conclusion? If the result is fragile to even mild confounding (Γ = 1.2), report with caution.

Validity threats

Unmeasured confounding: Users who subscribe premium are also more engaged — engagement is the confounder, and it may not be fully captured by observed features
Positivity violation: Some covariate combinations only occur in premium users — no valid counterfactual exists for these users
Extreme IPW weights: Scores near 0 or 1 produce very large weights → high variance, one influential observation can dominate
Model misspecification: Wrong propensity model → residual confounding even after adjustment

IPW Estimator (ATE)

ATE = (1/n) × Σ_i [ T_i·Y_i / e(X_i) − (1−T_i)·Y_i / (1−e(X_i)) ] Stabilised weights (better variance): w_i = T_i / e(X_i) for treated w_i = (1−T_i) / (1−e(X_i)) for control Clip weights at 99th percentile to reduce variance.

AIPW — Doubly Robust Estimator

AIPW = (1/n) × Σ_i [ Q(1, X_i) − Q(0, X_i) ← outcome model term + T_i·(Y_i − Q(1,X_i)) / e(X_i) ← IPW correction − (1−T_i)·(Y_i − Q(0,X_i)) / (1−e(X_i)) ] Consistent if either e(X) or Q(T,X) is correctly specified.

Covariate Balance Check

SMD_j = (μ_1j − μ_0j) / sqrt((σ²_1j + σ²_0j) / 2) Target: |SMD_j| < 0.1 for all covariates j after adjustment. Love plot: bar chart of |SMD| before/after, one row per covariate.

✦Senior DS — The Propensity Score Paradox

Kang & Schafer (2007) demonstrated that in some simulation settings, IPW with a correctly specified propensity model performs worse than simple regression adjustment with a correctly specified outcome model. Doubly robust estimators (AIPW) win because they're consistent if either model is right — but if both are misspecified, AIPW can have catastrophic finite-sample performance due to near-zero propensity scores producing extreme weights. The modern fix is TMLE (Targeted Maximum Likelihood Estimation), which stabilises AIPW by "targeting" the estimate of interest rather than fitting a general nuisance model. Netflix and Uber use TMLE for their causal inference workflows.

✦Senior DS — When Not to Use PSM

If overlap is very poor — for example, power users are a completely different population from free users on every observable dimension — reweighting cannot make them comparable. No amount of propensity score adjustment creates valid counterfactuals when the groups don't overlap. In this case, consider: (1) restricting analysis to the region of common support and reporting the ATT for that subset, (2) using Manski-style partial identification bounds instead of a point estimate, or (3) finding an instrument. The worst outcome is reporting an IPW estimate with extreme, untrimmed weights as if it were valid — this is a common mistake in observational ML papers.

Open with

"This is a regression discontinuity setup — the badge threshold at 100 reviews creates a sharp discontinuity in treatment assignment. Sellers just above 100 and just below 100 are likely very similar on unobserved dimensions, making the comparison near the cutoff as good as locally randomised. I'm estimating a LATE at the threshold — not the ATE across all sellers."

Step 1 — Density test (McCrary test): Plot the distribution of review counts near the threshold. A spike at exactly 100 suggests sellers manipulate their review count to just cross the threshold — this invalidates RDD. Formally test with McCrary (2008) density test.
Step 2 — Visual inspection: Plot average sales vs review count, binned, with a discontinuity marker at 100. A visible jump at 100 is the visual estimate. Absence of a jump = likely no effect.
Step 3 — Local linear regression: Fit separate linear regressions on each side of the cutoff within a bandwidth h. Use local linear, NOT a global polynomial — global polynomials extrapolate incorrectly near boundaries (Gelman & Imbens 2019).
Step 4 — Optimal bandwidth: Use the Calonico-Cattaneo-Titiunik (CCT) data-driven bandwidth selector. It minimises the MSE of the local linear estimator.
Step 5 — Robustness checks: Report estimates at 0.5h*, h*, and 2h*. If estimates change dramatically across bandwidths, the local relationship is unstable and results are fragile.
Step 6 — Placebo tests: Test for discontinuities at fake cutoffs (e.g., 80 and 120 reviews). Significant effects at non-thresholds indicate confounding.

Validity threats

Manipulation: Sellers soliciting reviews to reach exactly 100 — density test will catch this
Local validity: RDD LATE is only for sellers near 100 reviews; effect may not generalise to all sellers
Discrete running variable: Review count is integer — many sellers at exactly 100 creates a mass point; use bias-correction
Bandwidth sensitivity: If estimates differ sharply across bandwidths, the estimate is unreliable

Sharp RDD Estimand

τ_RDD = lim[x→c+] E[Y | X=x] − lim[x→c−] E[Y | X=x] Local linear regression (within bandwidth h): Y = α + τ·T + β_L·(X−c) + β_R·(X−c)·T + ε for |X−c| < h T = 1[X ≥ c], c = cutoff (100 reviews), h = bandwidth τ is the RDD estimate (discontinuity in E[Y] at X=c).

Fuzzy RDD as IV

Fuzzy RDD: P(T=1 | X) jumps at c but doesn't go from 0 to 1. (e.g., badge is automatic only sometimes) Fuzzy RDD estimator (Wald): τ_Fuzzy = τ_reduced_form / τ_first_stage = Jump in E[Y|X] at c / Jump in P(T|X) at c This is exactly 2SLS with Z = 1[X ≥ c] as the instrument.

Why Local Linear, Not Global Polynomial

Approach	Problem	Recommendation
Global polynomial (degree 4+)	Extrapolates strongly; boundary estimates depend on data far from cutoff	Avoid (Gelman & Imbens 2019)
Local linear	Uses only data near cutoff; lower bias at boundary	Default choice
Local quadratic	Slightly more flexible; higher variance	Robustness check only

✦Senior DS — Bandwidth as a Bias-Variance Tradeoff

The bandwidth h controls a fundamental bias-variance tradeoff. A wide bandwidth includes more data (lower variance) but units far from the cutoff may differ systematically from units near it (higher bias). The CCT optimal bandwidth minimises asymptotic MSE = Bias² + Variance, trading off the two. But asymptotic MSE optimality doesn't guarantee good finite-sample performance. Always report results across multiple bandwidths — if the estimated effect changes monotonically with bandwidth, it suggests that the treatment effect itself varies with distance from the cutoff, not that you found a consistent local effect.

✦Senior DS — Geographic RDD

Geographic RDD uses administrative boundaries (school district lines, county borders, DMA boundaries) as the running variable. The spatial analogue of the review cutoff: units just on each side of a border are geographically proximate and therefore similar, making the boundary comparison as-good-as-random. The key validity check: density of units should be continuous across the boundary (no sorting). Geographic RDD is powerful for policy evaluation but requires that agents can't systematically choose which side of the boundary to be on — homebuyers who choose school districts invalidate geographic RDD for education research.

Open with

"This is Simpson's paradox — an aggregate trend reverses within every subgroup. The cause is always a confounding variable that (1) differs in distribution between treatment groups and (2) is also associated with the outcome. But critically, which estimate is 'correct' — marginal or conditional — depends on the causal structure, not on the data alone."

Step 1 — Identify the confounding variable: What determines which subgroup a patient is in? Severity. Severe patients are more likely to receive Drug A AND less likely to recover — severe patients are both the confounder and the reason Drug A's aggregate looks worse.
Step 2 — Compute subgroup-specific rates: Drug B cures 80% of mild and 40% of severe. Drug A cures 70% of mild and 30% of severe. Drug B is better in each group. But Drug A is given more to mild patients (who have high recovery rates regardless), inflating A's aggregate.
Step 3 — Make the causal decision: Should we use marginal or conditional estimates? This depends on whether severity is a pre-treatment covariate (baseline characteristic) or a mediator (caused by the drug choice). If severity is pre-treatment, use conditional (subgroup-specific) estimates. The correct conclusion: Drug B is better.
Step 4 — Check for collider bias: Is the subgroup variable a collider — a common effect of both treatment and outcome? Conditioning on a collider opens a spurious path. This is the opposite of standard confounding.
Step 5 — Draw the DAG: Drug → Cure. Severity → Drug. Severity → Cure. Severity is a confounder (common cause of Drug and Cure). Adjust for severity — use conditional estimates.

Simpson's paradox traps

Choosing the wrong stratum: Using aggregate estimates when the confounder is a baseline covariate gives the wrong causal conclusion
Collider bias: Conditioning on a variable caused by both treatment and outcome creates spurious associations — the reverse of confounding
Ecological fallacy: Aggregate-level correlations (between country means) don't transfer to individual-level conclusions — related but distinct from Simpson's paradox
Not drawing the DAG: Without a causal diagram, you cannot tell which estimate is correct — the data alone won't tell you

Classic Numerical Example

Drug A Drug B Mild cases: 700/1000 (70%) 8/10 (80%) Severe cases: 300/1000 (30%) 2/10 (40%) ← Drug B better in BOTH groups Aggregate: 1000/2000 (50%) 10/20 (50%) ← Looks like a tie, or... If Drug A gets 1000 mild + 1000 severe: 700+300 = 50% If Drug B gets 8 mild + 2 severe: 8+2 = 50%... wait, drug B wins here too Mnemonic: Drug B is better in each subgroup, but Drug A gets more of the "easier" patients.

Causal Resolution via DAG

Confounder (C): common cause of Treatment (T) and Outcome (Y) C → T, C → Y, T → Y Marginal estimate: P(Y|T=1) − P(Y|T=0) [biased by C] Conditional: P(Y|T=1,C=c) − P(Y|T=0,C=c) [unbiased] Collider (K): common effect of T and Y T → K ← Y DO NOT condition on K — this opens a spurious T-Y path.

✦Senior DS — Collider Bias is the Hidden Twin of Confounding

Confounding is well-known: condition on a common cause C (confounder) to close the backdoor path T ← C → Y. Collider bias is less known but equally dangerous: conditioning on K when T → K ← Y OPENS a spurious T-Y path. A classic real-world example: studying the effect of a genetic variant (T) on disease severity (Y) among hospitalised patients (K). Hospitalisation is a collider — caused by both the genetic variant and other risk factors. The variant-severity association in the hospitalised sample is biased. This is one reason why studies restricted to clinical populations often replicate poorly in the general population.

✦Senior DS — When to Use Marginal vs Conditional Estimates

The rule is causal, not statistical: if C is a pre-treatment baseline covariate that you'd want to account for in treatment allocation, use the conditional (subgroup-specific) estimate. If C is a mediator — on the causal path T → C → Y — using the conditional estimate removes the indirect effect and underestimates the total effect. If your question is "what is the total causal effect of T on Y?" you want the marginal estimate after blocking backdoor paths (via randomisation or adjusting for all confounders, but not mediators). The data literally cannot tell you which is right — only the causal story can. State this upfront in your answer and interviewers will recognise you've thought causally.

Open with

"The colleague is running a sequential hypothesis test with no α adjustment. Wald proved that if you can stop whenever p < 0.05, you can achieve ANY target α just by running long enough — even if the null is true. The nominal α = 0.05 is completely broken. We need a method that controls Type I error at any stopping time."

Step 1 — Quantify the inflation: With daily checks and stopping at p < 0.05, simulations show the true Type I error reaches ~20-30% before the target sample size, and ~14% at 4 planned looks. The test is anti-conservative.
Step 2 — Name the p-hacking variants: Peeking is the most common, but others include: subgroup fishing (testing 20 subgroups and reporting the significant one), metric swapping (testing 10 metrics and declaring the significant one primary), and optional stopping with covariate adjustment.
Step 3 — Fix 1 — Pre-registration: Commit to the exact analysis plan — primary metric, statistical test, planned runtime, any subgroup analyses — before any data is collected. The most important structural fix.
Step 4 — Fix 2 — Sequential testing (mSPRT): Use always-valid p-values. The mSPRT (Johari et al., 2015) produces a p-value that controls Type I error at α at every look. The colleague can check daily — the p-value is valid at any stopping time.
Step 5 — Fix 3 — Alpha spending (for planned looks): If you know in advance how many looks you'll take (e.g., weekly for 4 weeks), use O'Brien-Fleming or Pocock alpha spending to allocate the α budget across looks.

Other p-hacking forms in industry

HARKing: Hypothesizing After Results are Known — reporting a post-hoc hypothesis as if it were pre-specified
Garden of forking paths: Many plausible analytical decisions (outlier threshold, metric window, population filter) each seem justified, but collectively inflate Type I error
Selective reporting: Showing only the segments/metrics that were significant — FWER ≈ 64% for 20 metrics
Undisclosed flexibility: Adjusting the metric definition or analysis window after seeing preliminary results

Peeking Inflation (Simulation Result)

Checks per experiment → True Type I error (α=0.05 nominal): 1 check (fixed horizon): 5.0% ← correct 2 checks (at 50%, 100%): 8.3% 4 checks (at 25%, 50%, 75%, 100%): 14.3% 100 daily checks: ~35% Source: Johari et al. (2015), "Always Valid Inference"

O'Brien-Fleming Alpha Spending

At fraction t of total planned n, spend: α(t) = 2·(1 − Φ(z_α/2 / √t)) At t=0.25: spend α(0.25) ≈ 0.0005 (almost nothing) At t=0.50: spend α(0.50) ≈ 0.005 At t=1.00: spend α(1.00) ≈ 0.05 (full α at end) Conservative early on; near-full power at final look.

mSPRT (Always-Valid p-value)

mSPRT p-value valid at ANY stopping time. Rejects when mixture likelihood ratio Λ_t ≥ 1/α. Key property: P(reject H₀ at any time | H₀) ≤ α Key cost: ~10–20% more data than fixed-horizon test Used by: Optimizely, Netflix, Airbnb

✦Senior DS — Why Peeking Fails: Martingale Argument

Under the null hypothesis, the p-value as a function of cumulative sample size is a martingale — its expected future value, given its current value, equals its current value. The maximum of a martingale over time has a much heavier tail than its marginal distribution at any fixed point. Specifically, by Ville's inequality, P(p(t) ≤ α for some t) can be as large as 1 — the peeking procedure has no Type I error control. This is not an approximation; it's exact. The mSPRT sidesteps this by constructing a test statistic that is itself a martingale under the null, with a controllable maximum — giving always-valid inference without giving up the ability to stop early.

✦Senior DS — HARKing Detection in Experiment Reviews

HARKing — Hypothesizing After Results are Known — is rampant in industry experiment reviews because it's almost undetectable after the fact. Red flags: (1) absence of a power analysis pre-run (no pre-specified n means you could stop at any convenient point), (2) post-hoc justifications for unexpected results ("we didn't pre-specify this, but in hindsight it makes sense because..."), (3) reporting only 3 of 20 tested subgroups, (4) no pre-analysis plan or experiment registration. The structural fix: maintain an immutable experiment log that records the hypothesis, primary metric, and planned runtime before any data is collected. This is standard practice at Google, Meta, and Microsoft — any experiment without a registered plan should be treated as exploratory.

Open with

"Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. The recommendation team optimised the engagement proxy without improving subscriber value. I need to (1) diagnose the specific gaming vector, (2) validate that the proxy has decoupled from the north star (retention), and (3) redesign the metric to be structurally harder to game."

Step 1 — Diagnose the decoupling: Plot the historical correlation between the engagement metric and 30/60/90-day retention across past A/B tests. If the correlation has been declining for 2+ quarters, the proxy was already breaking. This is a falsifiable diagnostic, not a guess.
Step 2 — Identify the gaming vector: How specifically has engagement increased? Review what the team actually shipped: clickbait thumbnails, notification spam, autoplay, artificial content loops? Each vector leaves a measurable signature — check time-on-screen distribution, skip rate, content repeat rate.
Step 3 — Run a causal audit: For each shipped feature, check whether the engagement gain was accompanied by a retention gain, retention neutral, or retention loss. Sort features by "retention efficiency" = retention lift / engagement lift. Features with low ratio are likely gaming the metric.
Step 4 — Redesign the OEC: Replace raw engagement minutes with a metric that requires genuine user preference — e.g., "completion rate on content > 10 minutes" or "voluntary return visits within 7 days after exposure." These are much harder to game without delivering real value.
Step 5 — Add structural counter-metrics: For the new OEC, name one metric that gaming it necessarily damages: voluntary skip rate, unsubscribe rate following recommendation, or Net Promoter-style feedback score. Both metrics must be green for a feature to ship.

Goodhart's Law failure modes

Regressional: Metric improves due to regression to the mean, not genuine improvement
Extremal: Optimising the metric at the extreme degrades validity — engagement minutes past 3 hrs/day may reflect problematic binge, not value
Causal: Treating a proxy-north-star correlation as causal and optimising the proxy breaks the correlation
Reflective (Hawthorne): Teams change behaviour because they know they're measured, not because the product improved

Proxy-North-Star Correlation Decay Test

For each historical A/B test i, record: Δ_engagement_i = engagement lift of the test Δ_retention_i = 60-day retention lift of the test Compute Corr(Δ_engagement, Δ_retention) over all tests. If this correlation has declined from >0.6 to <0.3 over 2 years: the proxy is being Goodharted. Flag this as a metric validity failure.

Retention Efficiency Ratio

Retention Efficiency = Δ_retention / Δ_engagement Features with ratio < 0 : engagement up, retention down (Goodharter) Features with ratio ≈ 0 : engagement up, retention neutral (noise) Features with ratio > 0 : engagement up, retention up (genuine value) Sort the team's feature backlog by this ratio. Features with ratio < 0 should be rolled back.

Goodhart Taxonomy (Manheim & Garrabrant, 2018)

Type	Mechanism	Fix
Regressional	Selection effect — high scorers regress to mean	Measure prospectively, not retrospectively
Extremal	Optimising to extremes breaks the proxy-objective link	Cap the metric; add a diminishing-returns penaliser
Causal	Optimising proxy breaks the proxy-objective correlation	Rotate metrics; validate quarterly
Reflective	Measurement changes behaviour (Hawthorne effect)	Blind measurement; use revealed preference

✦Senior DS — Revealed Preference as a Gaming-Resistant Metric

Instead of measuring what users do (which can be gamed by surfacing content that triggers involuntary engagement), measure what users choose when given explicit alternatives. For content quality: measure the voluntary return visit rate — did the user voluntarily come back the next day? Did they share the content? Did they finish it without being auto-played into the next piece? These are revealed preference signals that require actual user satisfaction, not just interface design tricks. They're harder to game because you can't force a user to voluntarily return — you have to actually serve them well. Netflix uses 30-day voluntary retention rate (not engagement minutes) as their true north star for exactly this reason.

✦Senior DS — Metric Validity Operating Range

Some metrics are valid proxies for user value only within a specific operating range. Beyond that range, the proxy breaks down. Netflix observed that watch hours correlates positively with subscriber satisfaction up to about 3 hours per day — but beyond that threshold, it correlates negatively with long-run retention, presumably reflecting compulsive binge-watching of content the user later regrets. The metric has a valid operating range: [0, 3 hrs/day]. Optimising for watch hours beyond this range actively harms the north star. The implication: every proxy metric should be validated not just for direction but for its operating range. And if teams are optimising near or beyond the boundary, the metric has lost its validity signal.

Open with

"Power analysis has four linked parameters: α (false positive rate), power (1−β), metric variance σ², and MDE Δ. Fix any three and the fourth is determined. The real design decision is choosing an MDE that reflects business value — not the effect you hope to see."

Step 1 — Set α and power: α = 0.05 standard; power = 0.80 standard. Use 0.90 for high-stakes launches (payments, safety features).
Step 2 — Choose MDE: The smallest effect worth shipping. Ask the PM: "If the lift were exactly X%, would you ship?" That X is your MDE. Smaller MDE = more users needed.
Step 3 — Estimate σ²: Use historical data from the same metric. For proportions: σ² = p(1−p) using the baseline control rate p.
Step 4 — Compute n per variant: n = 2σ²(z_α/2 + z_β)² / Δ². Round up. Total traffic = n × number_of_variants.
Step 5 — Account for dilution: If only 30% of assigned users trigger the feature, inflate n by 1/0.30 (trigger-analysis adjustment).
Step 6 — Convert to runtime: runtime = n_total / daily_eligible_traffic. Cap at 4 weeks to avoid seasonal confounding.

Common mistakes

Peeking: Stopping when p < 0.05 mid-experiment inflates actual Type I error to ~14%
Optimistic MDE: Choosing MDE = expected effect means you're only 50% powered to detect what you expect
Ignoring dilution: Undercounting trigger-eligible users leads to chronically underpowered experiments
Multiple variants without correction: 3 variants at α=0.05 each → ~14% FWER — apply Bonferroni

Sample Size — Two-sample t-test

n = 2σ²(z_α/2 + z_β)² / Δ² Δ = MDE (absolute) σ² = metric variance per user z_α/2 = 1.96 (α=0.05, two-tailed) z_β = 0.84 (power=80%), 1.28 (power=90%) Proportions: σ² = p(1−p), p = baseline conversion rate

Power Formula

Power = Φ( |Δ|·√n / σ − z_α/2 ) Increases with n and |Δ|; decreases with σ and z_α/2.

Effect of Parameter Changes on n

Change	Effect on n	Note
Halve MDE (Δ/2)	4× more users	Most expensive lever — choose MDE carefully
α: 0.05 → 0.01	~30% more	For safety or financial tests
Power: 80% → 90%	~25% more	Worth it for high-stakes launches
CUPED (ρ=0.7)	~51% fewer	Use pre-experiment covariate

✦Senior DS — Sequential Testing (mSPRT)

Fixed-horizon tests require committing to a runtime and not peeking. This is impractical. The mSPRT (mixture Sequential Probability Ratio Test, Johari et al. 2015) produces always-valid p-values — the p-value at any look controls Type I error at α. The cost: ~10% more power than the equivalent fixed-horizon test. Optimizely, Netflix, and Airbnb use mSPRT by default. Mathematically: the classical p-value is a martingale under the null. Stopping when it first crosses α leads to very high FWER. The mSPRT converts this to an always-valid test by mixing over effect-size hypotheses, creating a confidence sequence that shrinks over time.

✦Senior DS — MDE vs Expected Effect

A fundamental confusion in practice: the MDE is the smallest effect worth detecting, NOT the effect you expect the feature to produce. If you set MDE = your best guess for the effect size, you're 50% powered — that's by definition of power. You need to set MDE lower than the expected effect to be well-powered. The right question to ask the PM: "What is the smallest improvement that would change your ship decision?" That's the MDE. If they say "1% CTR improvement," and your experiment is powered for a 2% MDE, you're underpowered for the decision they actually need to make.

Open with

"Testing 20 metrics at α=0.05 each gives a 64% chance of at least one false positive, even if the feature does nothing. We'd expect about 1 spurious significant result. The question is: was a primary metric pre-registered? If yes, that's the only one that counts for the ship decision. If no, we need to apply a multiple comparison correction before interpreting results."

Step 1 — Check pre-registration: Was a primary metric declared before the experiment ran? If yes, only that metric determines the decision. The other 19 are exploratory — flag them as "directional, requires replication."
Step 2 — Choose the error criterion: FWER (any false positive is catastrophic) → Bonferroni/Holm. FDR (some false positives acceptable, care more about power) → Benjamini-Hochberg.
Step 3 — Apply Bonferroni: α_adj = 0.05 / 20 = 0.0025. Reject metrics with p < 0.0025 only. Very conservative — likely none of your three are significant after this.
Step 4 — Apply BH as alternative: Sort p-values. Find the largest k such that p_(k) ≤ kα/m (m=20). Reject H_(1) through H_(k). Controls FDR ≤ α — more powerful than Bonferroni.
Step 5 — Recommend pre-registration going forward: For every experiment, declare one primary metric and the exact analysis plan before data collection. This is the structural fix — corrections are second-best.

Multiple testing errors in the wild

Metric fishing: Reporting only the 3 significant metrics out of 20 is an uncorrected 3-of-20 selection — FWER ≈ 64%
Subgroup fishing: Testing 15 country×device subgroups inflates Type I error identically
Sequential metric addition: Adding metrics to a live experiment dashboard after seeing results is p-hacking

FWER Inflation Without Correction

FWER = 1 − (1−α)^m m=1: 5% m=5: 23% m=10: 40% m=20: 64% m=50: 92%

Benjamini-Hochberg (BH) Procedure

1. Sort p-values: p_(1) ≤ p_(2) ≤ ... ≤ p_(m) 2. Find largest k s.t. p_(k) ≤ k·α / m 3. Reject H_(1), ..., H_(k) Controls FDR ≤ α under independence + positive dependence. More powerful than Bonferroni when many true effects exist.

Method Selection Guide

Method	Controls	Power	Use when
Bonferroni	FWER ≤ α	Lowest	Safety/financial, few tests (<5)
Holm-Bonferroni	FWER ≤ α	Slightly higher	Always dominates Bonferroni — prefer this
Benjamini-Hochberg	FDR ≤ α	High	Discovery setting, many metrics, FP tolerable
Pre-registration	FWER by design	Full α for primary	Best — do this before data collection

✦Senior DS — Alpha Spending Across Time

Multiple comparisons arise over time (peeking at interim analyses) as well as across metrics. Alpha spending functions (O'Brien-Fleming, Pocock) allocate the α budget across K planned interim looks. O'Brien-Fleming spends almost no α early (when evidence is weakest) and reserves most of the budget for the final look — providing near-full power at the end while strictly controlling FWER. Pocock spends α uniformly, giving equal significance thresholds at each look but less power at the final look. The key insight: choosing a spending function is identical in structure to choosing an FDR correction method — in both cases you're allocating a finite error budget across multiple decisions.

✦Senior DS — Reporting Expected False Discoveries

BH at FDR=5% means: if you reject k hypotheses, you expect 0.05k of them to be false. Teams who don't understand this celebrate "20 significant results" without realising 1 is likely spurious. Always report E[FD] = FDR × rejections explicitly. For example: "We reject 12 metrics at BH FDR=5%. We expect ~0.6 false positives among these 12." This framing forces intellectual honesty. A corollary: if your team regularly reports BH FDR=5% results from 50-metric dashboards and never discusses E[FD], they are almost certainly over-shipping features with no real effect.

Open with

"This is Goodhart's Law: when CTR became a target, it ceased to be a good measure. The team optimised the proxy without improving the underlying objective. The fix isn't adding more counter-metrics — it's redesigning the OEC to be structurally harder to game while remaining directionally aligned with the north star."

Step 1 — Diagnose the decoupling: Correlate CTR gains vs revenue gains across the history of past A/B tests. If the correlation has been declining, the proxy was already breaking before this quarter.
Step 2 — Identify the gaming vector: How specifically did the team increase CTR? Clickbait titles? Notification spam? Surfacing viral but shallow content? The vector tells you what the new metric must penalise.
Step 3 — Redesign the OEC: Replace CTR with a metric that captures value, not just clicks. Candidates: "satisfied-click rate" (clicked AND session lasted > 30 seconds), post-click conversion rate, or (click + downstream engagement) composite.
Step 4 — Add counter-metrics as guardrails: For every OEC, name a metric that must not move adversely. Gaming CTR with clickbait should increase unsubscribe rate and decrease session depth — both are now guardrails.
Step 5 — Backtest the new OEC: Replay past A/B tests using the new metric. Does it correctly reject tests that increased CTR but hurt revenue? Backtesting is the strongest validation method.

Metric design failure modes

Proxy-north-star correlation decay: Validate the OEC–north-star correlation quarterly as teams optimise the proxy
Short-horizon bias: Metrics measurable in 2 weeks underweight effects that compound over months
Composite instability: Weighted composite metrics shift due to weight choice, not feature quality
Counter-metric proliferation: Too many guardrails makes it impossible to optimise for anything — keep it to 3–5

Metric Hierarchy

North Star → True long-run objective (hard/slow to measure) ↓ Primary OEC → What experiments optimise (sensitive, measurable in 2 wks) ↓ Guardrails → What we protect (latency, revenue, abuse rate) ↓ Diagnostics → Explain why OEC moved (not decision metrics)

OEC Sensitivity

Required n ∝ σ²(metric) / Δ² Lower-variance, higher-signal metrics need fewer users. Prefer: normalised per-user metrics over aggregate totals. Avoid: rare-event metrics (conversion < 0.1%) — very high σ².

Good OEC Properties

Property	How to verify
Sensitive	Moves reliably when A/A has a known injected effect
North-star aligned	Corr(OEC, north-star) > 0.6 across historical experiments
Not gameable	No obvious path to increase OEC without creating user value
Counter-metric resistant	Gaming OEC moves at least one guardrail adversely
Computable	Available within experiment window (≤2 weeks for primary)

✦Senior DS — Backtesting OEC Validity

The strongest validation of a new OEC is to replay historical A/B tests under the new metric and check that it correctly predicts long-run outcomes. Netflix does this systematically: for a proposed OEC, they compare its verdict (ship/no-ship) against each historical experiment's long-run impact on subscriber retention at 12 months. A good OEC agrees with the long-run outcome in >80% of cases. A bad OEC (raw engagement minutes) agrees far less often because it promotes short-term binge-watching that increases churn. This is the gold standard for OEC validation — far better than theoretical arguments about alignment.

✦Senior DS — Metric Decomposition Against Gaming

Instead of one composite OEC, decompose into components and hold teams accountable for each. Session duration is gameable by adding friction. But session duration = sessions_per_week × avg_duration_per_session. A team that games duration by adding friction sees sessions_per_week decrease. A team flooding low-quality content sees a quality-satisfaction guardrail decrease. It's much harder to simultaneously inflate multiple independent components than to game a single composite. This structural defence against Goodhart's Law is more robust than adding ever-more-complex counter-metrics to a single composite.

Open with

"CUPED subtracts out variation in the outcome metric that is predictable from pre-experiment behaviour. Because users who engage a lot before the experiment tend to engage a lot during it — regardless of which variant they see — this variation is noise that inflates experiment variance. Removing it lets us reach significance with fewer users."

Step 1 — Choose covariate X: The same metric Y measured in the 2 weeks before the experiment. Must be correlated with Y AND pre-determined (observed before treatment assignment — never use a post-assignment covariate).
Step 2 — Estimate θ: θ = Cov(Y, X) / Var(X). This is the OLS regression coefficient of Y on X, computed on pooled data.
Step 3 — Compute adjusted outcome: Y_adj = Y − θ·(X − mean(X)). Subtracts the part of Y predicted by X. The mean is unchanged; the variance is reduced.
Step 4 — Variance reduction: Var(Y_adj) = Var(Y) × (1 − ρ²), where ρ = Corr(Y, X). For engagement metrics with ρ ≈ 0.7, variance drops 51% — roughly halving the required sample size.
Step 5 — Run t-test on Y_adj: τ̂_CUPED = mean(Y_adj,T=1) − mean(Y_adj,T=0). Same inference as usual — just with lower variance.

When CUPED doesn't help

New users: No pre-experiment data — cannot compute X; use stratification or demographic proxies instead
Low correlation (ρ < 0.3): Variance reduction < 9% — barely worth implementation cost
Endogenous covariate: X measured after some users had early treatment exposure — biased adjustment
Non-linear Y-X relationship: Linear CUPED loses efficiency; use MLRATE (ML-based CUPED) instead

CUPED Estimator

θ = Cov(Y, X) / Var(X) Y_adj = Y − θ·(X − mean(X)) τ̂_CUPED = mean(Y_adj,T=1) − mean(Y_adj,T=0) Variance reduction: Var(Y_adj) = Var(Y) × (1 − ρ²) ρ = Corr(Y, X) ρ=0.5 → 25% variance reduction ρ=0.7 → 51% variance reduction → ~half the users needed ρ=0.9 → 81% variance reduction → ~1/5 the users needed

CUPED = ANCOVA (algebraic equivalence)

CUPED t-test on Y_adj ≡ OLS regression: Y = α + τ·T + θ·X + ε, reading off τ̂ Both produce the same point estimate and standard error.

Variance Reduction Methods

Method	Timing	Typical reduction	Limitation
Stratified assignment	Pre-assignment	20–40%	Needs covariate before assignment
CUPED (linear)	Post-assignment	30–60%	Linear Y-X only; needs pre-experiment data
MLRATE	Post-assignment	50–80%	Complex; requires rich feature history
Post-stratification	Analysis time	10–30%	Inflates variance if strata too small

✦Senior DS — Why CUPED = ANCOVA Matters

The algebraic equivalence between CUPED and ANCOVA is non-obvious but important. It means: (1) CUPED is not a heuristic — it's the standard OLS efficiency gain from including a covariate, backed by the Gauss-Markov theorem. (2) You can use any standard regression software, not a custom CUPED implementation. (3) If you extend to a machine-learning projection f(X) instead of linear θX, CUPED becomes equivalent to the partial linear Robinson (1988) estimator, which is semiparametrically efficient for the treatment effect under the partially linear model. In practice: always frame CUPED as "ANCOVA with a pre-experiment covariate" in technical reports — it's more recognisable and has better-understood properties.

✦Senior DS — MLRATE for Non-linear Variance Reduction

Standard CUPED assumes a linear Y-X relationship. Microsoft's MLRATE replaces the linear projection θ·X with a gradient-boosted model trained on many pre-experiment features: Y_adj = Y − f(X). This achieves 70–80% variance reduction for complex engagement metrics where linear CUPED only gets 30–40%. Critical constraint: f must be trained solely on pre-experiment data — never on the experiment window, or the variance reduction is biased and you could invalidate the experiment. In practice: train f on data from a 4-week window ending the day before the experiment starts, then freeze f before the first user is assigned.

Design it.Defend it.Causalise it.

Design it.
Defend it.
Causalise it.