How would you like to explore this?
This case study has two views tailored to different audiences.
Your details are required to continue
Conversational AI Agent
for UPSC Aspirants
End-to-end production case study. Multi-agent RAG system, 8 system layers, full architecture with trade-offs, and production hardening from A to Z.
A domain where hallucination is not an option
UPSC Civil Services is India's most competitive examination — roughly 1 million candidates compete annually for under 1,000 positions. Students at a leading EdTech platform in New Delhi needed round-the-clock access to accurate, domain-expert answers across a vast multi-domain syllabus: History, Geography, Polity, Economy, Science, Current Affairs, and Ethics. The challenge was not building a chatbot. It was building a system where answer quality could never silently degrade.
Hallucination is unacceptable. A wrong answer about a constitutional article can directly damage a student's exam preparation. Generic LLM behaviour must be constrained to verified UPSC content.
Scale is real. 40,000 concurrent users means the system must handle exam-season spikes without degradation. Thread-blocking inference is not viable.
Cost must be controlled. GPT-4 at 40K users with unconstrained usage is financially unsustainable. Every architecture decision is a cost-quality trade-off.
Maximise answer quality while minimising GPT-4 API calls. This single constraint drives caching strategy, agent routing, retrieval design, and the evaluation gate in CI/CD.
Seven layers, one responsibility each
No layer does two things. The gateway never touches GPT-4. The workers never handle auth. The agents never write to Redis. Separation of concerns at this scale makes debugging tractable and scaling decisions obvious.
cache hit path: ──▶ Pinecone Semantic Cache ──▶ Stream to Client
cache miss: ──▶ Celery Task Queue ──▶ Celery Worker (AWS ECS)
Eight layers, fully decomposed
Each component: what it does, why this approach over alternatives, the implementation with real code, and the key trade-off. Click any row to expand.
FastAPI is the system's only public surface. Four responsibilities and nothing else: authenticate the student via JWT, enforce Redis rate-limits, check the semantic cache, and dispatch work to Celery. It never touches GPT-4. FastAPI's async event loop means a 1–2s RAG pipeline call does not block the thread — at 40K concurrent users, the difference between a working system and thread starvation.
FastAPI async patternsDjango and Flask are synchronous by default. A 1.5s GPT-4 call blocks a worker thread for its full duration. A thread pool of 40 exhausts in under a second at peak load. FastAPI with uvicorn handles each request as a coroutine — the thread stays free while Celery executes the task. The two-step POST → stream pattern adds one round-trip but lets clients reconnect the stream on mobile network drops — essential for students on slow connections.
@app.post("/chat")
async def chat(req: ChatRequest, user: User = Depends(verify_jwt)):
# Rate limit: 100 queries/student/day — atomic Redis INCR
rate_key = f"rate:{user.id}:{today()}"
if await redis.incr(rate_key) > 100:
raise HTTPException(429, "Daily query limit reached")
await redis.expire(rate_key, 86400)
# Semantic cache lookup — zero GPT-4 cost on hit
cached = await semantic_cache.lookup(req.query)
if cached:
return StreamingResponse(stream_string(cached))
# Dispatch to Celery — return task ID in <5ms, never block
task = rag_pipeline.delay(req.query, req.session_id, user.id)
return {"task_id": task.id}
@app.get("/stream/{task_id}")
async def stream_result(task_id: str):
# SSE stream: tokens arrive as worker produces them
return StreamingResponse(
token_generator(task_id),
media_type="text/event-stream",
) Two-step POST → stream adds one round-trip. Benefit: the gateway never blocks; clients reconnect the SSE stream on network drops. At 40K mobile users, reconnect resilience is not optional.
Redis does three completely separate jobs. Each is tuned, monitored, and reasoned about independently. Mixing them into one key namespace would make each harder to debug and optimize. Job 1: semantic response caching (primary cost lever). Job 2: per-user rate limiting (cost guard). Job 3: conversation session memory (context store for multi-turn coherence).
Data & caching patternsAt 40K students asking overlapping UPSC questions, cache hit rate is the single most impactful cost metric. A 40% hit rate means 40% of GPT-4 API spend simply does not happen. Threshold matters critically: at similarity 0.88 we had 6% false cache hits — wrong answers served. Moving to 0.92 reduced false hits to under 1% with only a 4% hit rate reduction. That trade-off is correct for high-stakes exam content.
# Job 1: Semantic cache — query Pinecone cache namespace
async def cache_lookup(query: str) -> str | None:
embedding = await embed(query)
results = pinecone_index.query(
vector=embedding, top_k=1, namespace="cache",
include_metadata=True,
)
if results.matches and results.matches[0].score >= 0.92:
return results.matches[0].metadata["answer"]
return None
# Job 2: Rate limiter — atomic, no race conditions
async def check_rate_limit(user_id: str) -> bool:
key = f"rate:{user_id}:{today()}"
count = await redis.incr(key)
if count == 1:
await redis.expire(key, 86400) # reset daily
return count <= 100
# Job 3: Session memory — bounded 10-turn window, 30-min TTL
async def get_history(session_id: str) -> list[dict]:
raw = await redis.lrange(f"session:{session_id}", 0, 9)
return [json.loads(m) for m in reversed(raw)]
async def append_turn(session_id: str, message: dict):
key = f"session:{session_id}"
await redis.lpush(key, json.dumps(message))
await redis.ltrim(key, 0, 9) # enforce 10-turn limit
await redis.expire(key, 1800) # 30-min inactivity TTL Semantic cache lives in a Pinecone namespace (not Redis) — embedding similarity search is Pinecone's native operation. A Redis+FAISS alternative would require maintaining a second in-memory vector index with operational overhead.
Celery decouples the API layer from the inference layer entirely. FastAPI dispatches a task and returns in under 5ms. Workers execute the full RAG pipeline independently on AWS ECS and scale horizontally without touching the API servers. A lightweight Lambda publishes Celery queue depth to CloudWatch every 30 seconds — ECS scales workers up when depth exceeds 50 pending tasks, down below 10.
MLOps & async pipelinesWithout Celery, every FastAPI request would block 1–2s while the RAG pipeline ran in-process. At 40K users, even with async FastAPI, you'd need 40K concurrent RAG pipelines running simultaneously — impossible. With Celery, workers scale as an independent fleet. Peak queue depth during exam season: ~380 tasks. Auto-scaling ECS absorbs this in seconds without over-provisioning year-round.
@celery.task(bind=True, max_retries=3)
def rag_pipeline(self, query: str, session_id: str, user_id: str):
try:
# Restore conversation context from Redis
history = get_session_history_sync(session_id)
# Invoke LangGraph supervisor agent
result = supervisor_graph.invoke({
"query": query,
"history": history,
"user_id": user_id,
})
# Cache answer in Pinecone cache namespace
pinecone_index.upsert(
vectors=[{
"id": cache_id(query),
"values": embed_sync(query),
"metadata": {"answer": result["answer"]},
}],
namespace="cache",
)
# Append turn to session memory
append_turn_sync(session_id, {
"role": "assistant", "content": result["answer"]
})
return result
except openai.RateLimitError as e:
raise self.retry(exc=e, countdown=5)
except Exception as e:
raise self.retry(exc=e, countdown=2) Separate API and worker containers means each scales to a different signal. API scales by CPU/memory; workers scale by queue depth. During exam-season spikes, workers go 4 → 20+ while the API tier stays at 3 instances unchanged.
The agent follows a Supervisor → Specialist pattern. A supervisor node classifies query intent and routes to one of four domain specialists: Factual Lookup, PYQ Analysis, Essay Guidance, Current Affairs. Each specialist has its own Pinecone namespace, prompt template, and LangSmith trace — completely isolated. A regression in the PYQ agent does not require re-evaluating the full system.
Agentic workflows & LangGraphA monolithic agent with a single system prompt for all UPSC query types consistently underperforms on edge cases. "What is the 42nd Constitutional Amendment?" needs a different retrieval namespace and prompt framing than "Give me an essay structure on India-China border disputes." Separate agents means separate optimization loops, separate prompt versions, and isolated failure surfaces. The supervisor adds ~50ms classification latency — worth it for per-domain accuracy gains.
from langgraph.graph import StateGraph, END
from typing import TypedDict
class AgentState(TypedDict):
query: str
history: list[dict]
intent: str # set by supervisor
retrieved_chunks: list[str]
answer: str
sources: list[str]
guardrail_passed: bool
token_usage: dict # per-node cost attribution
def supervisor_node(state: AgentState) -> AgentState:
intent = classify_intent(state["query"], state["history"])
return {**state, "intent": intent}
def route(state: AgentState) -> str:
return state["intent"] # "factual" | "pyq" | "essay" | "current_affairs"
def factual_agent(state: AgentState) -> AgentState:
chunks = pinecone_query(state["query"], namespace="factual", top_k=5)
answer = gpt4_generate(FACTUAL_PROMPT, chunks, state["history"])
return {**state, "retrieved_chunks": chunks, "answer": answer}
graph = StateGraph(AgentState)
graph.add_node("supervisor", supervisor_node)
graph.add_node("factual", factual_agent)
graph.add_node("pyq", pyq_agent)
graph.add_node("essay", essay_agent)
graph.add_node("current_affairs", current_affairs_agent)
graph.set_entry_point("supervisor")
graph.add_conditional_edges("supervisor", route, {
"factual": "factual", "pyq": "pyq",
"essay": "essay", "current_affairs": "current_affairs",
})
supervisor_graph = graph.compile() Four specialist agents adds ~50ms for the classification step per request. All queries pass through the supervisor. The latency cost is fixed within the 1–2s budget and the per-domain accuracy improvement justifies it unambiguously.
Three retrieval decisions define accuracy: (1) namespace-per-domain — each UPSC subject has its own Pinecone namespace, eliminating cross-domain retrieval noise; (2) hierarchical parent-child chunking — 128-token child chunks for precision retrieval, 512-token parent chunks assembled for GPT-4 context; (3) a dedicated cache namespace for semantic response caching, eliminating a second vector store.
Advanced RAG & vector searchLangSmith trace data exposed that without namespacing, the factual agent occasionally pulled irrelevant Current Affairs chunks when UPSC topic keywords overlapped with recent news. Hierarchical chunking addressed truncated-answer hallucinations — small chunks retrieve precisely but GPT-4 needs broader context to give complete answers. Reducing top-k from 8 to 5 (validated by RAGAS) cut token cost 22% with no accuracy loss.
# Namespace-per-domain: each subject isolated
DOMAIN_NAMESPACES = {
"factual": ["polity", "history", "economy", "geography"],
"pyq": ["pyq_prelims", "pyq_mains"],
"essay": ["essay_structure", "essay_content"],
"current": ["current_affairs"],
"cache": ["cache"],
}
def upsert_document(doc: Document, domain: str):
"""Hierarchical: child chunks indexed, parent assembled for context."""
for i, parent in enumerate(chunk(doc.text, size=512, overlap=50)):
for j, child in enumerate(chunk(parent, size=128, overlap=20)):
pinecone_index.upsert(vectors=[{
"id": f"{doc.id}_p{i}_c{j}",
"values": embed(child),
"metadata": {
"child_text": child,
"parent_text": parent, # full context for GPT-4
"domain": domain,
"updated_at": doc.updated_at,
},
}], namespace=domain)
def retrieve(query: str, namespace: str, top_k: int = 5) -> list[str]:
"""Returns parent text — complete context — for matched child chunks."""
results = pinecone_index.query(
vector=embed(query), top_k=top_k,
namespace=namespace, include_metadata=True,
)
return [m.metadata["parent_text"] for m in results.matches] Hierarchical chunking doubles upsert complexity. Payload: two text fields (child + parent) per vector. Benefit: retrieval precision from child chunks + generation context from parent chunks. The accuracy gain at 95% benchmark validates the complexity.
Three images, each with a single entrypoint: API gateway, Celery worker, Redis (standard image). API and worker share the same Python codebase but deploy as independent ECS services with different resource allocations and scaling policies. Multi-stage builds keep the API image at ~190MB (no ML dependencies) and the worker image at ~380MB (full stack). The same docker-compose.yml runs locally and in CI for environment parity.
MLOps & deploymentIf API and workers run as the same process, they must scale together. During a traffic spike, you need inference capacity (workers), not gateway capacity (API). Separating into two ECS services means the API scales by CPU/request count and workers scale by queue depth — the correct signal for each. Independent scaling avoids over-provisioning the gateway tier during exam-season inference spikes.
# Dockerfile.api — multi-stage, lean gateway image (~190MB)
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements/api.txt .
RUN pip install --no-cache-dir -r api.txt
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11 /usr/local/lib/python3.11
COPY src/ .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
# ── docker-compose.yml (local = production parity) ────────────
services:
api:
build: { context: ., dockerfile: Dockerfile.api }
env_file: .env.local
ports: ["8000:8000"]
depends_on: [redis]
worker:
build: { context: ., dockerfile: Dockerfile.worker }
command: celery -A tasks worker --concurrency=8 --loglevel=info
env_file: .env.local
depends_on: [redis]
redis:
image: redis:7-alpine
ports: ["6379:6379"] Two Dockerfiles means two build contexts to maintain. Payoff: API image starts 40% faster (no ML stack import time), workers can use GPU-optimised base images without bloating the API container, and each can be updated independently without a joint deploy.
The pipeline enforces one rule: no code or prompt change reaches production without RAGAS eval sign-off. The eval suite runs first — before unit tests, before Docker build. A failing eval blocks the entire pipeline. Prompts are versioned in LangSmith Hub and pinned by tag in production config. ECS blue-green keeps the old version fully alive until the new one passes health checks.
MLOps & CI/CD for LLMsBlue-green over rolling: with 40K active students, a broken rolling deploy where 50% of requests fail for 3 minutes is a serious incident — especially during exam season. Blue-green means zero downtime and instant rollback. The RAGAS gate runs first because a green test suite is meaningless if answer quality dropped. The 4-minute eval overhead is the correct trade-off for a system students depend on for exam preparation.
# .github/workflows/deploy.yml
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run RAGAS eval suite # blocks deploy on threshold failure
run: |
python evals/run_ragas.py
--dataset evals/golden_set.json
--thresholds '{"faithfulness":0.90,"answer_relevancy":0.88,"accuracy":0.93}'
test:
needs: evaluate
steps:
- run: pytest tests/ -v --cov=app --cov-fail-under=80
build-push:
needs: test
steps:
- name: Build and push API + worker images to ECR
run: |
docker build -f Dockerfile.api -t $ECR_REGISTRY/upsc-api:$GITHUB_SHA .
docker build -f Dockerfile.worker -t $ECR_REGISTRY/upsc-worker:$GITHUB_SHA .
docker push $ECR_REGISTRY/upsc-api:$GITHUB_SHA
docker push $ECR_REGISTRY/upsc-worker:$GITHUB_SHA
deploy:
needs: build-push
steps:
- name: Blue-green ECS deploy (old version stays live until health checks pass)
run: |
aws ecs update-service
--cluster upsc-prod --service upsc-api
--task-definition upsc-api:$NEW_REVISION
--deployment-configuration minimumHealthyPercent=100,maximumPercent=200 RAGAS eval adds ~4 minutes to every deploy. That is the correct trade-off: a 4-minute delay to ensure answer quality never drops below threshold is a bargain for a 40K-user live exam-prep system.
LangSmith makes the invisible visible. Every LangGraph node is traced with individual latency, token counts (input + output), retrieved chunk content with similarity scores, and the exact GPT-4 prompt for every production request. User feedback (thumbs up/down) feeds prompt refinement cycles. Custom CloudWatch dashboards track the four operational KPIs: queue depth, cache hit rate, per-specialist error rate, and P95 latency.
LLM system design & evaluationThree production fixes came directly from LangSmith data: (1) 12% Supervisor misclassification rate discovered via intent trace analysis — fixed with 6 few-shot examples, dropped to under 2%; (2) Polity namespace latency anomaly found via per-node latency traces — namespace index size imbalance, resolved by splitting; (3) Token cost opportunity from token count traces — reducing chunks 8→5 cut cost 22%, validated zero accuracy loss via RAGAS. Operating without trace visibility is flying blind.
from langsmith import Client, traceable
from langsmith.run_helpers import get_current_run_tree
ls = Client()
# Pin prompt versions — never "latest" in production
SUPERVISOR_PROMPT = ls.pull_prompt("upsc-supervisor:prod-v1.4")
FACTUAL_PROMPT = ls.pull_prompt("upsc-factual:prod-v2.1")
@traceable(name="factual_agent", tags=["specialist", "factual"])
def factual_agent(state: AgentState) -> AgentState:
run = get_current_run_tree()
chunks = retrieve(state["query"], namespace="factual")
response = openai.chat.completions.create(
model="gpt-4",
messages=build_messages(FACTUAL_PROMPT, chunks, state["history"]),
)
# Token usage tracked per node for cost attribution
run.extra["token_usage"] = response.usage.model_dump()
return {**state, "answer": response.choices[0].message.content}
# User feedback captured per run for prompt refinement
def record_feedback(run_id: str, score: int, comment: str = ""):
ls.create_feedback(
run_id=run_id,
key="user_rating",
score=score, # 1 = thumbs up, 0 = thumbs down
comment=comment,
) LangSmith stores every production trace. At 40K queries/day, trace volume is significant. Sampling strategy: 100% of errors always traced, 10% of successes sampled. This preserves full visibility on failures while keeping storage costs manageable.
Eight decisions that shaped the system
Every architectural decision made, the alternative considered, and the reasoning. This is the section that matters most in a system design interview.
| Decision | Chosen Approach | Alternative | Impact | Reasoning |
|---|---|---|---|---|
| Agent pattern | Supervisor + 4 specialists | Single monolithic agent | High | Isolated prompts, namespaces, traces per domain. A regression in one specialist does not contaminate others. Debugging stays tractable. |
| Inference execution | Celery async workers | Sync FastAPI endpoint | Critical | Prevents FastAPI thread exhaustion at 40K concurrent users. Workers scale horizontally independent of the API layer. |
| Semantic cache store | Pinecone cache namespace | Redis + FAISS in-memory | Medium | Pinecone already running. Cache is a namespace, not a second service. Eliminates a second vector store to maintain. |
| Chunking strategy | Hierarchical parent-child | Fixed 512-token chunks | High | Child chunks give retrieval precision. Parent chunks give generation context. Reduces truncated-answer hallucinations. |
| Retrieved chunks | 5 per query | 8–10 chunks | High | LangSmith traces showed no accuracy gain above 5 (RAGAS validated). Reducing cut latency 300ms and GPT-4 cost 22%. |
| Deployment strategy | ECS blue-green | Rolling update | Critical | 40K active users. A broken rolling deploy affects live traffic immediately. Blue-green keeps old version alive until health checks pass. |
| Prompt versioning | LangSmith Hub (pinned tags) | Git text files | High | LangSmith ties prompt versions to trace data. Every production request permanently linked to the exact prompt that generated it. |
| Cache threshold | 0.92 similarity | 0.88 (initial) | High | At 0.88, false cache hits were 6%. At 0.92, under 1%. Only a 4% hit rate reduction — correct trade-off for exam content. |
Production Lessons
The Ingestion Pipeline Is Half the System
Most teams spend 90% of effort on the query path and 10% on ingestion. The quality of what is in Pinecone determines 80% of answer quality. UPSC content changes annually. Without a robust, versioned ingestion pipeline, accuracy silently degrades after every content update. This is the piece that keeps a 95% benchmark honest over time.
LangSmith Pays for Itself in Week One
The first week of production tracing revealed three prompt logic bugs that would have taken days to reproduce from application logs. The Supervisor misclassification, the Polity namespace latency, and the token count opportunity — all found through traces, not logs. Operating an LLM pipeline in production without full trace visibility is flying blind.
Cache Hit Rate Is Your Best Cost KPI
Tracking and optimising semantic cache hit rate is the highest-leverage cost intervention at 40K users. Tuning the similarity threshold from 0.88 to 0.92 reduced false cache hits from 6% to under 1% — the difference between occasionally serving wrong answers (unacceptable in an exam context) and reliably serving correct ones.
Four additions that complete the system
What separates a production system from a prototype is how it handles edge cases, failures, and operational realities. These four additions are the gap between 90% complete and production-ready.
Problem: Pinecone returns top-K chunks by embedding similarity — cosine distance in embedding space does not always correlate with contextual relevance for GPT-4. Long-tail UPSC queries (specific Article numbers, Act citations, historical dates) suffered most because surface similarity diverged from semantic fit.
Solution: Cohere Rerank cross-encoder between Pinecone retrieval and GPT-4 generation. Cross-encoders score query-chunk pairs jointly — not independently — producing significantly more accurate relevance rankings than bi-encoder similarity alone.
import cohere
co = cohere.Client(api_key=COHERE_API_KEY)
def retrieve_and_rerank(
query: str, namespace: str, top_k: int = 5
) -> list[str]:
# Step 1: broad candidate pool from Pinecone (20, not 5)
raw_chunks = retrieve(query, namespace=namespace, top_k=20)
# Step 2: cross-encoder reranking — scores all 20 jointly
results = co.rerank(
query=query,
documents=raw_chunks,
top_n=top_k,
model="rerank-english-v3.0",
return_documents=True,
)
return [r.document.text for r in results.results]
# Replace in each specialist agent:
# Before: chunks = retrieve(state["query"], namespace="factual", top_k=5)
# After: chunks = retrieve_and_rerank(state["query"], "factual", top_k=5) Problem: OpenAI API outages during peak exam windows (Prelims in June, Mains in September) surface as 500 errors for 40K students with zero mitigation. No fallback tier existed — a single provider failure meant complete service outage at the most critical usage time.
Solution: A LangGraph conditional node implementing three fallback tiers: GPT-4 → GPT-3.5-turbo → semantic cache (relaxed threshold 0.85). Each tier has a timeout. The response includes a model quality signal so the client can surface a subtle "reduced quality" indicator when serving a degraded-tier answer.
def generation_with_fallback(state: AgentState) -> AgentState:
"""Three-tier fallback. Each tier has independent timeout."""
for model, timeout in [("gpt-4", 10), ("gpt-3.5-turbo", 8)]:
try:
response = openai.chat.completions.create(
model=model,
messages=build_messages(state),
timeout=timeout,
)
return {
**state,
"answer": response.choices[0].message.content,
"model_used": model,
"is_fallback": model != "gpt-4",
}
except (openai.APIError, openai.Timeout, openai.APIConnectionError):
continue
# Tier 3: semantic cache with relaxed threshold (0.85 vs 0.92)
cached = semantic_cache.lookup(state["query"], threshold=0.85)
if cached:
return {**state, "answer": cached, "model_used": "cache", "is_fallback": True}
return {**state, "answer": SERVICE_DEGRADED_MSG, "model_used": "none"} Problem: Updating the embedding model (e.g., text-embedding-ada-002 → text-embedding-3-large) silently invalidates all existing Pinecone vectors. New query embeddings are incompatible with old stored vectors — cosine distance comparisons return noise, answer quality collapses, and no error fires. No migration pipeline existed.
Solution: A Celery pipeline that creates a versioned namespace, re-embeds all documents with the new model, runs RAGAS eval against the new namespace, and performs an atomic config cutover only if the eval passes threshold. The old namespace is retained for a 7-day rollback window.
@celery.task(bind=True)
def migrate_embedding_model(self, domain: str, new_model: str):
old_ns = get_active_namespace(domain)
new_ns = f"{domain}_{model_version(new_model)}"
try:
# 1. Fetch all documents from old namespace
docs = [v.metadata["parent_text"]
for v in fetch_all_vectors(namespace=old_ns)]
# 2. Re-embed with new model (batched)
new_vecs = embed_batch(docs, model=new_model, batch_size=100)
# 3. Upsert to new namespace
upsert_namespace(new_vecs, namespace=new_ns)
# 4. RAGAS eval gate — must pass before cutover
score = run_ragas_eval(namespace=new_ns,
dataset="evals/golden_set.json")
if score["accuracy"] < 0.93:
raise ValueError(
f"Migration failed RAGAS gate: {score['accuracy']:.2%}. "
f"Old namespace {old_ns} unchanged."
)
# 5. Atomic cutover — old namespace retained for rollback
set_active_namespace(domain, new_ns)
schedule_cleanup(old_ns, delay_days=7)
except Exception as e:
notify_team(f"Migration FAILED for {domain}: {e}")
raise Problem: The 1–2s latency claim was observed organically in production — never validated upfront under simulated peak concurrent load. Before each major UPSC exam window (Prelims June, Mains September), the system ran untested at peak concurrency. Auto-scaling configuration was never stress-verified.
Solution: Locust load tests with realistic student behaviour profiles (weighted mix of factual, PYQ, and essay queries with think time). Run against staging with production-equivalent ECS task counts. Target: P95 latency ≤ 2.5s at 500 concurrent users. Scheduled to run 7 days before each exam cutoff date.
from locust import HttpUser, task, between
import random, uuid
FACTUAL = ["What is the 73rd Constitutional Amendment?",
"Explain the doctrine of basic structure."]
ESSAY = ["Essay structure: climate diplomacy 250 words",
"UPSC essay: federal governance challenges"]
PYQ = ["Previous year questions on monetary policy 2023",
"UPSC 2022 GS2 questions on judiciary"]
class UPSCStudent(HttpUser):
wait_time = between(2, 8) # realistic reading + thinking time
def on_start(self):
self.session_id = str(uuid.uuid4())
self.headers = {"Authorization": f"Bearer {get_test_token()}"}
@task(4) # 4x weight — factual is most common query type
def ask_factual(self):
r = self.client.post("/chat",
json={"query": random.choice(FACTUAL),
"session_id": self.session_id},
headers=self.headers)
if r.status_code == 200:
self.client.get(f"/stream/{r.json()['task_id']}",
headers=self.headers)
@task(2)
def ask_pyq(self):
self.client.post("/chat",
json={"query": random.choice(PYQ), "session_id": self.session_id},
headers=self.headers)
@task(1)
def ask_essay(self):
self.client.post("/chat",
json={"query": random.choice(ESSAY), "session_id": self.session_id},
headers=self.headers)
# Run: locust -f locustfile.py --headless -u 500 -r 50 --run-time 600s