Entry 10 of 11
ML Fundamentals Series
·1 min read

MSE Penalizes Big Mistakes Harder: R² Tells You If Your Model Even Learned Anything

You've trained a regression model. It outputs numbers. How do you know if those numbers are any good? Two metrics cover this from different angles: MSE measures the size of your errors, R2R^2 measures whether your model learned anything at all.

Mean Squared Error: for each prediction, compute the difference between actual and predicted value, square it, average across all predictions:

MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

The squaring does two things: negatives and positives don't cancel, and large errors get penalized disproportionately (error of 10 → squared error of 100; error of 1 → 1). This makes MSE sensitive to outliers. RMSE (root of MSE) has the same properties but units match your original variable, which is easier to interpret.

R2R^2 (coefficient of determination) represents goodness of fit on a scale from 0 to 1:

R2=1SSresSStotal=1(yiy^i)2(yiyˉ)2R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{total}}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

R2=1R^2 = 1 means perfect fit. R2=0R^2 = 0 means the model does no better than always predicting the mean. R2<0R^2 < 0 is possible, it means the model is actively worse than the mean baseline.

What clicked

R2R^2 is a relative measure, it tells you how much better your model is compared to the laziest possible baseline (always predict the mean). MSE tells you the absolute magnitude of errors. You need both.

Still shaky on

R2R^2 never decreases when you add more features, even useless ones. Add 10 random noise columns and R2R^2 goes up. That's why Adjusted R2R^2 exists, it penalizes added features that don't improve the model. I haven't gone deep on this yet.

What's next

The equivalent story for classification, the confusion matrix and why accuracy alone will mislead you.