ML Problem Types & Task Formulation
The single most impactful decision in any ML project happens before you open a notebook: converting a vague business question into a precise learning problem. Most production failures trace back to this stage, not to model choice.
Supervised learning requires labelled (X, y) pairs and optimises a mapping from inputs to targets. Unsupervised learning finds structure without labels — clustering, density estimation, dimensionality reduction. Self-supervised learning constructs labels from the data itself (next-token prediction, masked image modelling) enabling massive scale without manual annotation — this is how GPT and BERT are pre-trained. Reinforcement learning optimises a policy against a reward signal through environment interaction. The paradigm choice is driven by what you have: if you have ground-truth labels at scale → supervised. If labels are expensive → self-supervised pre-training then fine-tune. If the world is the reward signal → RL. If you want to understand the data structure → unsupervised.
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
# ── Supervised: labelled (X, y) pairs ────────────────────
X_train = np.random.randn(1000, 20)
y_train = (X_train[:, 0] + X_train[:, 1] > 0).astype(int) # binary label
clf = LogisticRegression(max_iter=500)
clf.fit(X_train, y_train)
print(clf.score(X_train, y_train)) # accuracy ~0.85
# ── Unsupervised: no labels, find structure ───────────────
X_unlabelled = np.random.randn(500, 10)
km = KMeans(n_clusters=4, random_state=42, n_init=10)
km.fit(X_unlabelled)
print(km.labels_[:10]) # cluster assignments
# ── Self-supervised: construct labels from data itself ────
# Minimal masked-token example (concept)
text_tokens = [101, 2054, 2003, 103, 2023, 102] # 103 = [MASK]
# Model predicts the masked token — label comes from the data itself
# Full implementation uses HuggingFace AutoModelForMaskedLM
# ── RL: policy optimised against reward ──────────────────
# Pseudocode: agent observes state, takes action, receives reward
# import gymnasium as gym
# env = gym.make("CartPole-v1")
# state, _ = env.reset()
# for _ in range(200):
# action = policy(state) # ε-greedy or learned policy
# state, reward, done, _, _ = env.step(action)
# if done: break Predicting a binary "will the user click?" label trains a click estimator, not a ranker. The model can achieve high AUC while degrading NDCG — clicks on position 1 and position 10 are not equivalent. The loss function does not care about rank order.
"User engagement" labels (clicks, watch-time) optimise for the proxy, not the underlying goal (satisfaction, retention). Models trained this way learn to maximise the metric, not the business outcome — leading to clickbait optimisation.
Start with the data and the decision. If you have labelled examples and a clear target → supervised. If labelling is expensive but raw data is abundant → self-supervised pre-training then fine-tune on a small labelled set. If the environment can simulate consequences of actions → RL. If you want to find structure or compress → unsupervised. In practice, most production systems combine paradigms: self-supervised for representation learning + supervised for the task head + unsupervised for anomaly detection in monitoring.
When the problem can be solved with deterministic rules with no meaningful false-positive cost. When the data available is too small to generalise (< a few hundred examples for tabular, less for vision). When the cost of a mistake is catastrophic and the model cannot provide uncertainty estimates. When the system needs to be fully auditable under regulation and the model cannot explain its decisions. Start with a rule-based baseline — if it is good enough, ship it. ML adds complexity, latency, and maintenance cost.
Self-supervised learning generates pseudo-labels automatically from the input structure itself — next token, masked patch, contrastive pairs — with no human labels at all. Semi-supervised learning uses a small set of human labels alongside a large unlabelled set, typically to propagate labels via pseudo-labelling, consistency regularisation (FixMatch), or graph-based methods. BERT pre-training is self-supervised; then fine-tuning on 1% labelled data with the rest unlabelled is semi-supervised.
Classification predicts a discrete class label — binary (spam/not-spam), multiclass (10 digits), multilabel (multiple tags). Regression predicts a continuous value. Ranking orders items by relevance. Generation produces sequences, images, or structured output. The task type determines the output layer, loss function, and evaluation metric. Getting the task type wrong is a frequent source of miscalibrated models: treating ordinal regression (star ratings 1–5) as 5-class classification discards the ordinal structure, causing the model to treat "1 star vs 5 stars" the same as "2 stars vs 3 stars."
import torch
import torch.nn as nn
# ── Binary classification ─────────────────────────────────
# Output: sigmoid → scalar in [0,1], loss: BCEWithLogitsLoss
binary_head = nn.Linear(128, 1)
loss_binary = nn.BCEWithLogitsLoss()
logits = binary_head(torch.randn(32, 128)) # (batch, 1)
labels = torch.randint(0, 2, (32, 1)).float()
loss = loss_binary(logits, labels)
# ── Multiclass classification ─────────────────────────────
# Output: softmax over C classes, loss: CrossEntropyLoss
mc_head = nn.Linear(128, 10) # 10 classes
loss_mc = nn.CrossEntropyLoss()
logits_mc = mc_head(torch.randn(32, 128)) # (batch, 10)
labels_mc = torch.randint(0, 10, (32,)) # (batch,) long
loss_mc_v = loss_mc(logits_mc, labels_mc)
# ── Regression ────────────────────────────────────────────
# Output: linear → scalar, loss: MSELoss or HuberLoss
reg_head = nn.Linear(128, 1)
loss_reg = nn.HuberLoss(delta=1.0) # robust to outliers
preds = reg_head(torch.randn(32, 128))
targets = torch.randn(32, 1)
loss_r = loss_reg(preds, targets)
# ── Ordinal regression (star ratings 1–5) ─────────────────
# Treat as 4 cumulative binary thresholds, not 5 classes
# P(y >= k) for k in {2,3,4,5} via 4 sigmoid outputs
ordinal_head = nn.Linear(128, 4)
# label encoding: rating 3 → [1, 1, 0, 0] (passes thresholds 2, 3, not 4, 5)
# ── Multilabel (tags: can have many) ─────────────────────
# Output: sigmoid per class independently
ml_head = nn.Linear(128, 20) # 20 possible tags
loss_ml = nn.BCEWithLogitsLoss()
targets_ml = torch.randint(0, 2, (32, 20)).float()
loss_ml_v = loss_ml(ml_head(torch.randn(32, 128)), targets_ml) Rating 1 and rating 5 are treated as equally different from rating 3 in cross-entropy. The model has no incentive to learn order. This causes poor calibration on boundary cases (1-2 star confusion, 4-5 star confusion) which are the most costly in production.
Using sigmoid activations for mutually-exclusive classes allows all class probabilities to be high simultaneously. Softmax enforces they sum to 1. Probabilities from a sigmoid multiclass model are not comparable across classes.
Churn is naturally a binary classification problem (churned / not churned) if you have a clear event definition. However, if the business needs a probability score to decide intervention priority, calibrated probability from logistic regression or gradient boosting is more useful than a hard label. Consider regression when the target is a continuous time-to-event (days until churn) — this gives you more granularity for intervention scheduling. The business decision drives the formulation.
Multiclass: each sample belongs to exactly one of K classes (mutually exclusive). Multilabel: each sample can belong to zero or more classes simultaneously (e.g., a document tagged with both "science" and "politics"). Multiclass uses softmax + cross-entropy. Multilabel uses sigmoid per class + BCEWithLogitsLoss. scikit-learn's MultiLabelBinarizer converts tag lists to binary matrices. Evaluation differs: for multilabel use per-label F1, macro-F1, or Hamming loss.
The prediction-to-decision gap is the distance between what the model outputs and the decision the business needs. A fraud detection model might achieve 99% accuracy but only block 40% of fraudulent transactions because the threshold is miscalibrated for the asymmetric cost structure. Three sources of gap: (1) Wrong metric — optimising accuracy on a 1% fraud dataset means always predicting "not fraud" wins. (2) Uncalibrated probabilities — a model that outputs 0.7 when the true probability is 0.3 leads to wrong threshold decisions. (3) Missing cost matrix — the cost of a false positive (blocking a legitimate transaction) and a false negative (allowing fraud) are very different; accuracy treats them equally.
import numpy as np
from sklearn.calibration import calibration_curve
from sklearn.metrics import confusion_matrix
import matplotlib
matplotlib.use('Agg') # non-interactive
# ── Cost-sensitive threshold selection ────────────────────
# Cost matrix: FP=1 (blocked legit txn), FN=50 (fraud loss)
cost_fp, cost_fn = 1, 50
y_true = np.array([0]*990 + [1]*10) # 1% fraud
y_prob = np.random.beta(1, 9, 1000) # scores in [0,1]
y_prob[990:] = np.random.beta(5, 2, 10) # fraud scores higher
# Sweep thresholds and pick the lowest expected cost
thresholds = np.linspace(0.01, 0.99, 99)
costs = []
for t in thresholds:
y_pred = (y_prob >= t).astype(int)
cm = confusion_matrix(y_true, y_pred, labels=[0, 1])
tn, fp, fn, tp = cm.ravel()
expected_cost = fp * cost_fp + fn * cost_fn
costs.append(expected_cost)
best_t = thresholds[np.argmin(costs)]
print(f"Optimal threshold: {best_t:.2f}")
# ── Calibration check ─────────────────────────────────────
# Perfectly calibrated: P(y=1 | score=0.7) should be 0.7
fraction_pos, mean_pred = calibration_curve(y_true, y_prob, n_bins=10)
# If mean_pred >> fraction_pos → overconfident
# Use isotonic regression or Platt scaling to fix
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=50, random_state=42)
cal_clf = CalibratedClassifierCV(clf, cv=3, method='isotonic') On a 1% fraud dataset, a model that always predicts "not fraud" achieves 99% accuracy. This metric is meaningless — it never fires on any actual fraud. The model passes code review but fails in production.
Random forests and gradient boosting produce well-ranked scores but poorly calibrated probabilities. Using a 0.5 threshold on an uncalibrated model is arbitrary. The model may output 0.8 when the true risk is only 0.3.
Define the cost matrix first: what does a false positive cost vs a false negative? Then sweep thresholds on a held-out validation set, compute expected cost at each threshold, and select the minimum. For equal costs, 0.5 is optimal. For fraud (FN >> FP), a lower threshold (higher recall) is typically correct. Document the threshold and retune it when the business cost structure changes — the model does not need retraining just because the operating point shifted.
A better model rarely fixes a badly framed problem. Write the decision first, then the prediction, then choose the model family. Reversing that order causes expensive pivots months later.