← Experience

Data Scientist · EdTech Platform, New Delhi · 2022 – 2023

Content Recommendation
Engine

Hybrid recommendation system — ALS collaborative filtering, TF-IDF content-based NLP, and a LightGBM learned-to-rank layer — personalising the student portal homepage for 40,000+ UPSC aspirants. Designed, built, and owned solo.

Role Data Scientist Platform EdTech Portal Users 40K+ students Infra AWS EC2 · Redis · Celery
40K+
Students Served
+34%
CTR Improvement
<8ms
P95 Serve Latency
5
A/B Experiments
Stack — Python 3.10PandasNumPyimplicit (ALS)pdfplumberNLTKscikit-learnLightGBMPostgreSQLRedisCeleryFastAPIMLflowAWS EC2

The Problem

Zero Personalisation for 40,000 Students

The platform's content library held 2,400+ items — PDF notes, video lectures, practice tests, and previous-year question papers spanning 11 General Studies subjects and 30+ optional papers. Without personalisation, every student saw the same homepage carousel. High-value content was missed not because it was irrelevant, but because it was never surfaced. Six months of interaction data sat in PostgreSQL, unused.

Business Goal

Increase content engagement and completion rates by surfacing the right content for each student — without requiring explicit ratings, which students on exam-prep platforms never provide. Completion rate on premium content was the downstream metric the business cared about.

Key Constraints

Sole data scientist on shared AWS infrastructure with no dedicated ML budget. Recommendations must serve in under 10ms (embedded in the login portal homepage on mobile connections). Nightly retraining acceptable — content doesn't change hourly, and student preferences shift over days, not minutes.

Implicit Feedback Signals

Students never rate content explicitly. Positive signals are inferred from behaviour: PDF scroll depth >40%, video watch time >60 seconds, bookmarks, and exercise completions. Each signal carries a different confidence weight — a bookmark is a much stronger positive than a passive view — which determines how the ALS model weighs each interaction in the training matrix.

Cold Start Problem

Two cold-start cases require separate strategies. New students have zero interaction history. New content items have zero engagement data. A 3-question onboarding survey (primary subjects + difficulty preference) solves user cold start from day one. New content gets a content-based similarity score computed synchronously on publish, making it recommendable before the next nightly batch runs.

System Architecture — 5 Layers

01
Logging Layer PostgreSQL · content_events table · view / bookmark / complete / skip
02
Feature Layer Pandas + NumPy · 90-day rolling window · confidence-weighted sparse matrix · TF-IDF content vectors
03
Modelling Layer ALS (implicit) · TF-IDF cosine sim · LightGBM ranker · top-200 candidates → top-50 final
04
Batch Pipeline Celery · nightly 2 AM IST · ~22 min on EC2 m5.xlarge · MLflow metrics logged per run
05
Serving Layer FastAPI · Redis GET · <8ms P95 · 24h TTL · cold-start fallback on cache miss

Why Batch, Not Real-Time

The content library updates once or twice daily. Student subject preferences shift over weeks, not minutes — a student who clicks three Economics videos in one session doesn't need their recommendations updated within the hour; they need better recommendations tomorrow. A nightly Celery task retrains all models on a 90-day rolling interaction window and writes pre-computed top-50 recommendations per user to Redis. The homepage then becomes a single Redis GET: under 8ms regardless of model complexity. Real-time re-ranking on every page load would have added Kafka, Flink, and stateful operator infrastructure for marginal signal-quality gain at this scale.

Prerequisite: NLP Content Auto-Tagging Pipeline

Before collaborative or content-based filtering could run, 2,400+ unstructured PDFs needed structured metadata — subject tags, difficulty level, content type. This metadata is the upstream dependency for three core RecSys components:

  • TF-IDF content-based model — subject and topic tags are appended to the text representation, sharpening cosine similarity signal over raw PDF text alone.
  • LightGBM ranker — the difficulty_match feature requires a per-item difficulty label to compare against the user's revealed difficulty preference.
  • Cold-start onboarding — the 3-question survey maps student choices to subject labels that must exist on content items to bootstrap a preference vector.

A standalone NLP pipeline (Model 04) handled this: pdfplumber extracted text from the first 8 pages of each PDF, NLTK tokenised and stemmed it, and a TF-IDF + OneVsRestClassifier pipeline — trained on 400 manually-tagged items — assigned subject labels (multi-label, macro F1 = 0.91) and difficulty tier (macro F1 = 0.87). New uploads are tagged synchronously in under 2 seconds, so they enter the recommendation index before the next nightly batch.

Four-Model Architecture

Tagging → CF → Content-Based → LightGBM Ranker

No single model solved all cases cleanly. An NLP auto-tagger (Model 04) first turns raw PDFs into structured metadata — without it, content-based filtering has no reliable signal and the LightGBM ranker's difficulty feature has no ground truth. ALS collaborative filtering then dominates for warm users, TF-IDF handles cold start, and LightGBM learns to combine both in a nonlinear ranking function that manual blending cannot replicate.

01

ALS Collaborative Filtering

implicit library · matrix factorisation · implicit feedback

Finds students with similar interaction patterns and surfaces what they engaged with. Primary signal for warm users who have at least 20 tracked interactions.

Input Matrix

Sparse user-item interaction matrix of shape (40,000 × 2,400). Each cell is a confidence-weighted sum of interaction signals: completion = 3×, bookmark = 2×, view >60s = 1×, view <60s = 0.3×. Most cells are empty — the matrix is <2% dense, which is why standard SVD fails here.

Training Configuration

64 latent factors (tuned via MLflow grid search over 32/64/128). Regularisation = 0.01, 25 iterations, confidence scale α = 40. Trained on EC2 m5.xlarge using the C++ backend of the implicit library — full retraining completes in ~4 minutes, well within the nightly batch window.

Why ALS over SVD — the implicit feedback distinction

SVD treats every missing entry as a zero preference. At this scale, >98% of (user, item) pairs have no recorded interaction — not because the student dislikes that content, but because they've never encountered it. Recommending from a model that treats "unseen" as "disliked" is actively harmful. ALS with confidence weighting makes a critical distinction: high-confidence entries are observed interactions (the model is certain about the signal), low-confidence entries are unobserved (the model is uncertain, not negative). This matters significantly for a 2,400-item library where a student can only consume a fraction in a year.

Python — confidence-weighted interaction matrix + ALS training
import implicit, scipy.sparse as sparse, numpy as np

CONF_WEIGHTS = {
    'complete': 3.0,
    'bookmark': 2.0,
    'view_long': 1.0,    # watch > 60s or scroll > 40%
    'view_short': 0.3,   # passive, noisy signal
}

def build_interaction_matrix(events: pd.DataFrame, n_users: int, n_items: int):
    events['weight'] = events['event_type'].map(CONF_WEIGHTS).fillna(0.1)
    agg = events.groupby(['student_idx', 'content_idx'])['weight'].sum()
    return sparse.csr_matrix(
        (agg.values,
         (agg.index.get_level_values(0), agg.index.get_level_values(1))),
        shape=(n_users, n_items)
    )

# Train ALS — confidence matrix C = 1 + alpha * R
model = implicit.als.AlternatingLeastSquares(
    factors=64, regularization=0.01, iterations=25, use_gpu=False
)
model.fit(interaction_matrix * 40)   # alpha = 40

# Generate top-200 candidates for one user
ids, scores = model.recommend(
    userid=user_idx,
    user_items=interaction_matrix[user_idx],
    N=200,
    filter_already_liked_items=True,
)
02

Content-Based Filtering

scikit-learn TF-IDF · cosine similarity · metadata features

Recommends content similar to what a student has engaged with, based on subject, topic, difficulty, and content type. Essential for cold-start users and new content items.

Content Feature String

Each item is represented as a concatenated metadata string: subject (History, Polity, Geography, Economy, Science, Environment, Art & Culture…) + topic + difficulty level (Prelims / Mains / Both / Optional) + content type (PDF note / video / PYQ / practice test) + title. This gives the TF-IDF vectoriser rich, structured text to work with.

Vectoriser Configuration

TF-IDF with max_features = 5,000, ngram_range = (1, 2), sublinear_tf = True (log-scaled frequencies reduce the dominance of common words like "questions"). Item-item cosine similarity computed once and stored as a compressed sparse matrix in memory.

User Profile Construction

A user's CB profile is the mean TF-IDF vector of their top-10 most recent positively-interacted items (completions and bookmarks preferred). For cold-start users, the profile is built from the subjects and difficulty level they selected during onboarding.

Three Use Cases

1. Cold-start users (<20 tracked interactions) — CB is the primary model. 2. New content (<50 total interactions) — ALS has no signal; CB handles it using the item's metadata similarity to the user's profile. 3. Serendipity injection — 10% of final recommendations are CB-only, injected to prevent filter-bubble collapse in long-running users.

New content cold start — synchronous trigger

When a new item is added to the content library, a PostgreSQL trigger fires a Celery task that computes that item's TF-IDF vector and its cosine similarity to all user profiles. This runs in under 1 second and stores the results immediately. The item is recommendable via CB before the next nightly ALS batch — there is no 24-hour blind spot for new content.

03

LightGBM Learned-to-Rank

rank:ndcg objective · contextual features · nonlinear scoring

Takes 200 candidates from ALS + CB and re-scores them using a rich feature set. Learns the nonlinear interactions between CF score, CB score, recency, popularity, and user context — interactions that weighted linear blending cannot capture.

Feature Set (per candidate pair)

ALS score (dot product, normalised 0–1) · CB cosine similarity · item log-popularity (rolling 7-day view count, log-scaled) · recency score (exponential decay: 1.2× for items <30 days old, decaying to 0.8× at 90 days) · user subject affinity (last-14-day engagement share by subject) · difficulty match (binary: does item level match user's revealed preference).

Training Labels & Data

Positive labels: click events on model-surfaced items from A/B test traffic (Experiment 01 onwards). Using pre-recommendation editorial clicks as training labels would introduce selection bias — students only clicked from an editorially-curated item set, not the full catalog. Labels were collected from the 20% treatment group during Exp 01, then from 100% of traffic after Exp 01 shipped.

Training Configuration

Objective: rank:ndcg. LightGBM with 200 trees, learning rate = 0.05, max_depth = 6, num_leaves = 31. Trained weekly (not nightly) — daily click labels need to accumulate before there's sufficient volume for stable ranking model updates. NDCG@10 on held-out 10%: 0.73. Outperformed weighted linear blending (0.7 ALS + 0.3 CB) by 8% NDCG@10.

Why LightGBM beats linear blending

Linear blending assumes CF and CB contributions are additive and independent — they aren't. A new student with high subject affinity in History needs a different CF/CB balance than a veteran student with 12 months of interaction history. LightGBM learns these conditional relationships: subject affinity × content type × recency interacting together produces splits that a weighted average simply cannot express.

Python — LightGBM ranker feature construction + scoring
import lightgbm as lgb, numpy as np

def build_ranking_features(
    user_id: int, candidates: list,
    cf_scores: np.ndarray, cb_scores: np.ndarray
) -> pd.DataFrame:
    rows = []
    user_affinity = get_subject_affinity(user_id, days=14)
    user_level    = get_difficulty_preference(user_id)

    for item_id, cf, cb in zip(candidates, cf_scores, cb_scores):
        meta = content_meta[item_id]
        days_old = (today() - meta['published_at']).days
        rows.append({
            'cf_score':         cf,
            'cb_score':         cb,
            'log_popularity':   np.log1p(item_views_7d.get(item_id, 0)),
            'recency_score':    1.2 * np.exp(-days_old / 30),
            'subject_affinity': user_affinity.get(meta['subject'], 0.0),
            'difficulty_match': int(meta['level'] == user_level),
        })
    return pd.DataFrame(rows)

# Score + rank candidates
ranker   = lgb.Booster(model_file='ranker_weekly.lgb')
features = build_ranking_features(user_id, candidates, cf_scores, cb_scores)
scores   = ranker.predict(features)

# Top-50 after popularity debiasing
top50 = apply_debiasing(candidates, scores, n=50)

def apply_debiasing(items, scores, n, penalty=0.4):
    """Penalise globally overrepresented items."""
    adjusted = []
    for item, score in zip(items, scores):
        if item in global_top20_pct:
            score *= (1 - penalty)
        adjusted.append((item, score))
    return [i for i, _ in sorted(adjusted, key=lambda x: -x[1])[:n]]
04

NLP Content Auto-Tagger

pdfplumber · NLTK · TF-IDF · OneVsRestClassifier (scikit-learn)

Automatically extracts subject tags and difficulty level from raw PDFs. Upstream prerequisite that provides the structured metadata all three RecSys models depend on.

Why This Exists

2,400+ content items had been added manually over years with inconsistent or missing metadata. Inconsistent labels meant the TF-IDF vectoriser was indexing noise instead of signal, and the LightGBM difficulty_match feature had no reliable per-item ground truth. Manual re-tagging was an editorial bottleneck: a new batch of 30 PDFs could sit unindexed for weeks. The auto-tagger reduced labelling latency to under 2 seconds per item, triggered synchronously on upload so new content enters the recommendation index before the next nightly batch runs.

Multi-Label Subject Classification

UPSC GS content is structurally multi-label. River basin ecology spans Geography, Environment, and sometimes Polity. Historical treaties touch History, International Relations, and Map Work simultaneously. A single-label classifier systematically under-tags content, reducing content-based recall. OneVsRestClassifier trains an independent logistic regression per subject — 11 classifiers for 11 GS subjects — applying each independently so any combination of labels is possible. Ground truth: 400 items manually tagged by a subject-matter expert. Macro F1 on held-out 20%: 0.91.

Difficulty Classification

Three-tier difficulty (Prelims / Mains / Advanced Optional) treated as single-label multi-class. Vocabulary complexity, question type, and abstract reasoning density correlate strongly with tier — TF-IDF bigrams (up to 5,000 features) capture these patterns well. A standalone logistic regression is trained separately from the subject classifiers, as difficulty is orthogonal to subject and shares no label structure. Macro F1 on held-out 20%: 0.87.

How Auto-Tag Outputs Feed the RecSys

Content-Based Model (TF-IDF): Subject and topic tags are concatenated into the item's text representation before vectorisation. This dramatically sharpens cosine similarity — two Physics videos about thermodynamics now share a subject:physics token regardless of surface-level vocabulary differences, improving CB recall for cold-start users whose profiles are built from subject preferences.

LightGBM Ranker: The difficulty_match feature (binary: does item difficulty match the student's revealed preference tier?) is computed from auto-tag output. A student who consistently bookmarks and completes Mains-level content gets a 1.0 for Mains items and a 0.0 for Prelims — a signal the ranker learned to weight heavily in its top splits.

Onboarding Cold-Start: The 3-question survey asks students to select primary subjects and target tier. Those choices are stored as a preference vector keyed on subject labels — labels that are only meaningful if every content item carries accurate subject tags. Without reliable auto-tagging, the onboarding personalisation would have had nothing to map to.

Python — PDF text extraction · NLTK preprocessing · multi-label subject + difficulty classifiers
import pdfplumber, re, joblib
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import Pipeline

STOP_WORDS = set(stopwords.words('english'))
stemmer    = PorterStemmer()

def extract_text(pdf_path: str) -> str:
    pages = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages[:8]:          # first 8 pages capture subject well
            pages.append(page.extract_text() or '')
    return ' '.join(pages)

def preprocess(text: str) -> str:
    text   = re.sub(r'[^a-zA-Z\s]', ' ', text.lower())
    tokens = word_tokenize(text)
    return ' '.join(
        stemmer.stem(t) for t in tokens
        if t not in STOP_WORDS and len(t) > 2
    )

# ── Subject classifier: multi-label, 11 UPSC GS subjects ──────────────
# Ground truth: 400 items tagged manually by subject-matter expert
mlb       = MultiLabelBinarizer()
y_subject = mlb.fit_transform(train_labels_subject)   # (400, 11)

subject_clf = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=8000,
                              sublinear_tf=True, min_df=2)),
    ('clf',   OneVsRestClassifier(
                  LogisticRegression(C=1.0, max_iter=1000, solver='lbfgs'))),
])
subject_clf.fit(train_texts, y_subject)
# Macro F1 on held-out 20%: 0.91

# ── Difficulty classifier: single-label, 3 tiers ──────────────────────
# Prelims / Mains / Advanced Optional
difficulty_clf = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=5000,
                              sublinear_tf=True, min_df=2)),
    ('clf',   LogisticRegression(C=0.5, max_iter=1000, solver='lbfgs')),
])
difficulty_clf.fit(train_texts, train_labels_difficulty)
# Macro F1 on held-out 20%: 0.87

def tag_content_item(pdf_path: str) -> dict:
    """Tag a new PDF in <2s synchronously on upload."""
    clean      = preprocess(extract_text(pdf_path))
    subjects   = list(mlb.inverse_transform(subject_clf.predict([clean]))[0])
    difficulty = difficulty_clf.predict([clean])[0]
    return {'subjects': subjects, 'difficulty': difficulty}

joblib.dump(subject_clf,    'models/subject_clf.pkl')
joblib.dump(difficulty_clf, 'models/difficulty_clf.pkl')
joblib.dump(mlb,            'models/mlb.pkl')

Batch-First Architecture

Nightly Retraining, Sub-8ms Serving

The system is entirely batch-driven. A Celery task fires at 2 AM IST, retrains all three models, computes top-50 recommendations for every active student, applies debiasing and recency adjustments, and writes results to Redis. The homepage serving layer is a single Redis GET — no model inference at request time. Full pipeline runtime: ~22 minutes on a single EC2 m5.xlarge.

01

Interaction Extraction — PostgreSQL

Pull 90 days of rolling interaction data from the content_events table for all students who were active in the last 30 days. The 90-day window (not full history) is intentional: training on the full 12-month history amplified popularity bias from the pre-personalisation era, when the same 30 items dominated the editorial homepage and accumulated disproportionate interaction counts. Tested against 30/60/90/180-day windows in MLflow; 90 days produced the best NDCG@10 with sufficient signal for sparse users.

Approximately 500,000 quality interaction events (completions, bookmarks, long views) in the training window at steady state — roughly 12–15 meaningful signals per active student per month.

02

Matrix Construction — Pandas + NumPy

Aggregate events into a confidence-weighted sparse interaction matrix (40K × 2,400) using Pandas groupby and scipy.sparse.csr_matrix. The full matrix with all non-zero values fits comfortably in the m5.xlarge's 16 GB RAM — peak memory usage: ~1.1 GB. No distributed processing (PySpark, Spark) is needed or appropriate at this scale; adding cluster infrastructure would have doubled build complexity for zero performance gain.

03

ALS Retraining — implicit library

ALS is retrained from scratch every night (~4 minutes). Full retraining is cheap enough that incremental updates would add complexity without meaningful benefit. TF-IDF vectoriser and item-item similarity matrix are only recomputed when content metadata changes (checked via a version hash). LightGBM ranker is retrained weekly every Sunday night — daily click labels need ~7 days to accumulate enough ranking signal volume for stable gradient boosting.

04

Candidate Generation + Ranking

For each active student: ALS generates top-200 CF candidates (filtered for already-seen items). CB adds up to 50 candidates for cold-start users or new content items. Duplicates are removed. LightGBM ranker scores all candidates using the 6-feature set. Popularity debiasing penalty (0.4×) is applied to items in the top-20% by platform-wide 7-day view count. Recency boost (1.2× for items <30 days old, exponential decay) is applied. Top-50 by final score become the recommendation list.

05

Redis Write — Pre-computed Recommendations

Top-50 item IDs are serialised as JSON and written to Redis with key recs:{student_id} and TTL = 86,400 seconds (24 hours). A Redis pipeline batches all writes in a single round-trip. ~28,000 active users are processed in the ~22-minute batch window. Cold-start users (no interaction history + no onboarding survey) receive a subject-popularity fallback list rather than empty recommendations.

06

FastAPI Serving — Single Redis GET

The student portal homepage hits GET /recommendations/{student_id}. The handler reads recs:{student_id} from Redis and returns the ordered list. No model inference happens at request time. P95 latency: <8ms including network overhead. On cache miss (new student, expired TTL, or batch failure), the cold-start fallback endpoint returns content-based recommendations computed from the student's onboarding survey responses.

Python — PostgreSQL schema + Celery batch task + FastAPI serving endpoint
-- PostgreSQL: interaction logging
CREATE TABLE content_events (
    event_id    BIGSERIAL   PRIMARY KEY,
    student_id  INTEGER     NOT NULL REFERENCES students(id),
    content_id  INTEGER     NOT NULL REFERENCES content_items(id),
    event_type  VARCHAR(20) NOT NULL,   -- 'view', 'bookmark', 'complete', 'skip'
    duration_s  INTEGER,                -- watch/read time in seconds
    created_at  TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_events_student  ON content_events (student_id, created_at DESC);
CREATE INDEX idx_events_content  ON content_events (content_id);
CREATE INDEX idx_events_type_day ON content_events (event_type, created_at DESC);


# Celery: nightly batch pipeline
@celery.task(name='nightly_recs', max_retries=2, acks_late=True)
def run_nightly_recommendations():
    run_id = mlflow.start_run(run_name='nightly_batch')

    events  = extract_events(days=90)
    matrix  = build_interaction_matrix(events)
    als     = train_als(matrix, factors=64, alpha=40)
    tfidf   = load_or_recompute_tfidf()
    ranker  = lgb.Booster(model_file='ranker_weekly.lgb')

    active_users = get_active_users(days=30)
    pipe = redis.pipeline(transaction=False)

    for user_id in active_users:
        cf_ids, cf_scores = als.recommend(user_id, matrix[user_id], N=200,
                                          filter_already_liked_items=True)
        cb_ids, cb_scores = get_cb_candidates(user_id, tfidf, N=50)
        candidates = deduplicate(cf_ids, cf_scores, cb_ids, cb_scores)
        features   = build_ranking_features(user_id, *candidates)
        scores     = ranker.predict(features)
        top50      = apply_debiasing_and_recency(candidates[0], scores)
        pipe.set(f'recs:{user_id}', json.dumps(top50), ex=86400)

    pipe.execute()
    mlflow.log_metrics({'users_processed': len(active_users),
                        'batch_duration_s': elapsed()})
    mlflow.end_run()


# FastAPI: recommendation serving
@app.get('/recommendations/{student_id}')
async def get_recommendations(student_id: int, user: User = Depends(verify_jwt)):
    if user.id != student_id:
        raise HTTPException(403)

    cached = await redis.get(f'recs:{student_id}')
    if cached:
        return {'items': json.loads(cached), 'source': 'model'}

    fallback = await get_cold_start_recs(student_id)
    return {'items': fallback, 'source': 'cold_start'}

Production Monitoring

Batch Health & Alerting

A batch pipeline that fails silently serves stale recommendations for 24 hours before anyone notices. The monitoring layer is simple by design: three signals cover the failure modes that actually occur in production.

Batch Health Signals

  • Celery Flower dashboard monitors task state in real time — running, success, failure, retry
  • CloudWatch alarm fires if batch duration exceeds 40 min (expected: ~22 min) — indicates data extraction bottleneck or OOM event
  • MLflow logs users_processed and batch_duration_s per run — visible drift signals a data quality issue before it becomes a model quality issue
  • Celery task configured with max_retries=2 and acks_late=True — task is re-queued on worker crash, not silently dropped

Failure Behaviour

  • On batch failure: Redis keys retain 24h TTL — stale recommendations (previous night's run) serve automatically as a graceful fallback. Users see slightly outdated but valid recs, not empty lists
  • On batch failure after 2 retries: Slack alert sent via Celery failure callback with task ID, failure reason, and timestamp
  • Cache miss on serving (new user or expired TTL): cold-start fallback endpoint called — never a blank homepage
  • EC2 instance health checked by a CloudWatch agent with auto-restart on failure — no manual intervention required for transient instance issues

MLflow Experiment Tracking

All 40+ offline experiments tracked in a self-hosted MLflow instance on the same EC2 instance. Parameters tracked: ALS factor count (32/64/128), regularisation (0.005/0.01/0.05), confidence scale α (20/40/80), interaction window (30/60/90/180 days), LightGBM learning rate and max depth. Primary metric: NDCG@10 on a 10% held-out validation set. The 90-day window + 64 factors + α=40 combination was the Pareto-optimal point — best NDCG@10 without overfitting to historical popularity. Models are promoted via the MLflow Model Registry: Staging stage requires offline validation metrics above the current Production threshold before promotion.

12 Months of Experimentation

Five A/B Tests, Each Building on the Last

Every model change shipped through an A/B experiment — 80/20 traffic split (80% treatment, 20% holdout), minimum 14-day run to capture weekly study patterns, primary metric: 14-day click-through rate on recommended items, secondary metric: 30-day content completion rate. Five experiments ran sequentially over 12 months, each treatment group becoming the new baseline for the next test.

+34%
Cumulative CTR lift vs. no personalisation
+18%
30-day content completion rate lift
Catalog coverage improvement from popularity debiasing
# Hypothesis Treatment Control Primary Result Decision
01 Any collaborative personalisation beats editorial curation ALS-only recommendations Editorial curated homepage feed +22% CTR, +14% completion rate Shipped
02 Content-based outperforms ALS for cold-start users (<20 interactions) CB recs for cold-start cohort; ALS for warm users ALS for all users (sparse signal for cold-start) +31% CTR in cold-start segment; no change in warm segment Shipped — CB for cold-start, ALS for warm
03 LightGBM learned-to-rank outperforms weighted score blending LightGBM ranker (ALS + CB + 4 contextual features) Linear blend: 0.7 × CF score + 0.3 × CB score +8% CTR, +12% NDCG@10 vs. linear blend Shipped
04 Popularity debiasing improves catalog coverage without hurting CTR 0.4× penalty on overrepresented items (top-20% by 7-day views) No debiasing — popularity-dominated ranking CTR: no significant change · 3× catalog coverage (distinct items recommended) Shipped — catalog health justified despite neutral CTR
05 Recency boost increases engagement with newly published content 1.2× score multiplier for content <30 days old, exponential decay No recency adjustment +7% CTR on recently-published content subset; overall CTR +2% Shipped

The Experiment 04 Argument — Why Neutral CTR Was Enough to Ship

Experiment 04 was the most contested internally. Standard recommendation systems thinking: if CTR is neutral and you're adding complexity, don't ship. That reasoning fails here.

Before debiasing, the top 20 items out of 2,400 captured 78% of all recommendation slots. Students preparing for optional papers — Literature, Anthropology, Law, Agriculture — received the same mainstream General Studies content as everyone else, because those items had accumulated the most interactions during the pre-personalisation editorial era. The recommendation engine was perpetuating the bias it was supposed to replace.

After debiasing, the number of distinct items appearing in recommendations tripled. Optional-paper students started seeing subject-relevant content for the first time. The business case: optional-paper students convert to paid annual subscriptions at a higher rate than GS-only students. Improving their experience — even without a 14-day CTR signal — had long-term retention value that a short experiment window couldn't measure.

Lesson: short-run CTR is a measurement of what's easy to click, not what's good to recommend. Match the experiment metric to the business outcome that actually matters.

Business Outcome — What the Metrics Translated To

Across all 5 experiments over 12 months: +34% CTR and +18% completion rate on recommended content. Completion rate on premium (paid-tier) content — the downstream metric the business tracked for subscription value — sustained a measurable lift that persisted through the 6-month observation window post-launch. The recommendation system also reduced the support tickets asking "where do I find X subject notes" by surfacing relevant content proactively — an unmeasured but clearly visible qualitative improvement the product team noted.

What Actually Moved the Needle

Six Decisions Worth Carrying Forward

A year in production surfaced a clear hierarchy: data quality decisions outperformed architecture decisions; product-side changes sometimes outperformed model changes. Infrastructure was made correctly once and then left alone.

01 Implicit feedback weight design is the highest-leverage decision in the entire project

The first ALS model treated all interaction events equally. A student who left a PDF open in a browser tab while making tea registered the same training signal as a student who bookmarked a note after a full read-through. The model was dominated by passive view events — the noisiest, lowest-intent signal in the dataset.

Introducing confidence weights — completion = 3×, bookmark = 2×, view >60s = 1×, view <60s = 0.3× — improved offline NDCG@10 by 9% without changing model architecture, training time, or infrastructure. The weights weren't derived from theory; they came from a manual correlation analysis of which interaction types predicted repeat engagement on a 30-day holdout.

Principle: Spend more time designing what you count as a positive signal than tuning model hyperparameters. A better signal beats a better model.
02 A 90-day interaction window outperformed full history — historical bias is a real problem

Training the ALS model on the full 12-month interaction history amplified popularity bias from the pre-personalisation era. During that period, every student saw the same 30 editorial items on the homepage. Those items accumulated enormous interaction counts — not because they were the best content, but because they were the only visible content. When the ALS model was trained on this history, those same 30 items dominated its output, perpetuating the bias the system was designed to replace.

A 90-day rolling window reduced the influence of legacy editorial dominance and improved model responsiveness to current student interests. MLflow experiment across four window lengths: 90 days was the Pareto-optimal point — fresh enough to reduce legacy bias, long enough to retain signal for students who study infrequently. The 30-day window had better recency but 40% fewer training signals for students who study 2–3 times per week rather than daily.

MLflow comparison: 90-day window → NDCG@10 +4% vs. full history. 30-day window → NDCG@10 +1% vs. full history, but 40% fewer valid training rows for sparse users.
03 Cold start is a product problem as much as a modelling problem — and the product fix works better

The initial cold-start strategy was pure model-side: content-based recommendations built from the student's enrolled subject (captured during course registration). It worked, but produced generic subject-level popular content for 2–3 weeks until the ALS model had enough signal. During those early weeks, cold-start students had the worst recommendation quality on the platform.

A product change — a 3-question onboarding survey shown at first login, asking for primary preparation subjects, current stage (Prelims/Mains/Both), and difficulty preference — dramatically improved cold-start quality from day one. With those signals, the content-based profile could filter to subject-appropriate difficulty levels immediately instead of defaulting to subject-popular items.

CTR for new students in the first 30 days improved by 19% after the survey was introduced — more improvement than any model change delivered for the same cohort during the same period. The lesson: when cold-start quality is the bottleneck, ask the user before you model the user.

04 Popularity bias is structural — it requires explicit correction, not just better models

Before popularity debiasing, the top 20 items out of 2,400 captured 78% of all recommendation slots. This wasn't a model failure — it was a structural feedback loop. Popular items get recommended → students click them → they become more popular in training data → they get recommended more. A better model trained on the same data would amplify the same loop, not break it.

Breaking the loop required an explicit intervention: a 0.4× penalty applied to items in the global top-20% by 7-day view count, applied after LightGBM scoring and before final ranking. Experiment 04 showed no CTR change (popular items are genuinely good) but tripled the number of distinct items appearing in recommendations across the platform. Optional-paper students — 20% of the user base — started receiving subject-relevant recommendations for the first time.

Rule: popularity debiasing is not optional in a system with implicit feedback. The feedback loop will always compress the effective catalog toward a small popular head unless you explicitly counteract it.
05 Offline metrics don't always predict online impact — know which metric your stakeholder cares about

The LightGBM ranker was evaluated offline on NDCG@10 (+12% vs. linear blend). Online, in Experiment 03, it produced +8% CTR. A reasonable correlation — but Experiment 04 showed a complete disconnect: debiasing produced 0% CTR change despite tripling catalog coverage, a metric that wasn't even being tracked offline.

The business outcome that mattered — completion rate on premium content, which correlated with subscription renewal — didn't move in lockstep with 14-day CTR either. Completion rate improvement (+18% cumulatively) was a slower signal, visible only over 30-day observation windows, and required a separate tracking query to measure.

Lesson: define the downstream business metric before designing the experiment. NDCG@10 is a good offline proxy for ranking quality, but it cannot tell you whether the right users are getting the right content — only A/B testing with a business-aligned metric can do that.

06 Batch was the right infrastructure choice — and it still is for this scale

Content preferences for UPSC students shift over days, not minutes. A nightly batch pipeline matched the signal timescale correctly. Real-time re-ranking — Kafka, Flink, stateful stream operators — would have consumed the remaining infrastructure runway for marginal quality gain. The 22-minute nightly run on a single EC2 instance remains the most cost-effective decision made across the project.

Principle: match infrastructure complexity to the actual timescale of signal change, not to what's technically impressive. Batch + Redis served sub-8ms with zero streaming infrastructure.
What this role led to: from structured data to unstructured knowledge

The recommendation engine proved the platform could build and own production ML infrastructure — models trained nightly, evaluated rigorously, deployed with monitoring. That foundation created the conditions for the next problem.

The next problem was different in kind. Students weren't just discovering content; they were asking specific questions buried inside that content. A recommendation system surfaces items. It cannot answer "what did the 2019 UPSC Mains Essay paper ask about Ethics?" That question requires retrieval over unstructured knowledge, generation grounded in source material, and accuracy guarantees no collaborative filter can provide.

That distinction — discovery vs. synthesis — is what drove the move from recommendation systems to retrieval-augmented generation. The infrastructure habits (async pipelines, Redis caching, evaluation gates before deployment) carried over. The problem class changed entirely.