Accuracy Is a Lie: What the Confusion Matrix Actually Tells You

classification confusion-matrix precision-recall f1-score evaluation-metrics

If your classifier says it's 95% accurate, be suspicious. Accuracy is the most reported metric and often the most useless one, especially when classes are imbalanced. What you actually need is the confusion matrix, and once you understand it, you stop trusting any single number.

The confusion matrix breaks predictions into four buckets: True Positive (TP): predicted positive, was right. False Positive (FP): predicted positive, was wrong. True Negative (TN): predicted negative, was right. False Negative (FN): predicted negative, was wrong.

Here's why accuracy fails: imagine a cancer detection model. Say 95% of patients are healthy. A model that always predicts "no cancer" gets 95% accuracy without learning anything. But it misses every actual cancer case. FN is 100%.

$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

Correct negatives inflate the score when negatives dominate. The model looks good. It is not good.

Precision measures: of everything I flagged positive, how many actually were?

$\text{Precision} = \frac{TP}{TP + FP}$

Recall measures: of all actual positives, how many did I catch?

$\text{Recall} = \frac{TP}{TP + FN}$

For cancer detection, maximize recall: catch every case even at the cost of false alarms. For spam filters, maximize precision: don't kill real emails even if some spam slips through. These goals pull in opposite directions. Improving recall usually tanks precision.

F1 Score is the harmonic mean of both:

$F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

What clicked

The harmonic mean gives 0.18 for P=0.9, R=0.1: honest about how catastrophically bad a 0.1 on either end actually is. Arithmetic mean gives 0.5, which sounds okay. It's not.

Still shaky on

ROC curves can be misleading when class imbalance is severe, and precision-recall curves are recommended instead. I don't fully understand why yet. That's the next gap.

What's next

For comparing models across all decision thresholds: the ROC curve plots True Positive Rate (Recall) vs False Positive Rate, and AUC is the area under that curve. AUC = 1 is perfect, AUC = 0.5 is random guessing.