Supervised ML · Chapter 5 · 25 min read

Model Diagnostics: Is Your Regression Actually Valid?

Prerequisites This page builds directly on OLS Regression — you should understand residuals $\hat{u}_i$, RSS, and $R^2$ before reading this. The model selection section also benefits from familiarity with Log-Linear Regression.

You've run your OLS regression, got your coefficients, and your $R^2$ looks decent. But does your model actually satisfy the assumptions it relies on? If the errors aren't normally distributed, your p-values are wrong. If the variance of the errors changes with $X$, your standard errors are wrong. And if you keep adding variables to chase a higher $R^2$, you're building a model that only works on the data you already have.

This page teaches you how to diagnose those problems — and how to compare models honestly.

Part 1: Checking Model Assumptions

All OLS inference (t-tests, F-tests, confidence intervals) assumes that the model's errors follow certain patterns. We can't observe the true errors $u_i$, but we can look at the OLS residuals $\hat{u}_i$ as stand-ins. The key idea: if the model is correct, the residuals should behave like a well-behaved normal distribution centred at zero with constant spread.

Assumption 1: Normally Distributed Errors

If normality holds, 95% of your residuals should fall within $\pm 2\hat{\sigma}$ of zero. Any residual larger than $|2\hat{\sigma}|$ is a potential outlier — and too many of those signal a problem.

Why it matters: OLS t-tests and F-tests assume errors are normal. If they're not, your p-values are only asymptotically valid (i.e., roughly correct in large samples, but unreliable in small ones). Log-transforming the dependent variable often fixes this — see the Log-Linear Regression page.

Visual Check 1: Histogram of Residuals

The fastest way to get a first impression. Plot a histogram of $\hat{u}_i$. If normality holds, it should look bell-shaped — one peak in the middle, tapering off symmetrically on both sides, with no heavy tails or extreme outliers dragging one side out.

Two things to watch for: skewness (the bump is shifted left or right, with a long tail on one side) and bimodality (two bumps, which suggests two different data-generating processes got mixed together).

Interactive — Residual Histogram

Skewness: 0.0 Outlier strength: 0.0

A symmetric, single-peaked shape indicates normality. Use the sliders to introduce skewness or outliers and see how the distribution deforms.

Visual Check 2: QQ-Plot (Quantile-Quantile Plot)

A histogram is subjective — it depends on the bin width. A QQ-plot is more precise. It plots the actual quantiles of your residuals on the y-axis against the theoretical quantiles of a perfect normal distribution on the x-axis.

The theoretical normal line always has a 45° slope, because a perfect normal distribution would mean: actual quantile = theoretical quantile at every point. Deviations from this line reveal the type of non-normality you're dealing with.

Reading a QQ-plot: Points hugging the diagonal line → normality is fine. An S-curve → light tails (platykurtic). Points curving away at the ends → heavy tails (leptokurtic). Points rising above the line only on the right → positive skew.

Formalising It: Skewness and Kurtosis

Histograms and QQ-plots are visual — they give you a feel, not a number. The skewness coefficient $m_3$ and kurtosis coefficient $m_4$ turn those visual impressions into statistics you can compute and compare.

Skewness Coefficient $m_3$

Skewness measures whether one tail of the residual distribution is heavier than the other. A perfectly symmetric distribution has $m_3 = 0$.

$$m_3 = \frac{\frac{1}{n}\sum_{i=1}^{n} \hat{u}_i^3}{\left(\frac{RSS}{n}\right)^{3/2}}$$

$n$ = number of observations
$\hat{u}_i$ = OLS residual for observation $i$
$RSS = \sum_{i=1}^n \hat{u}_i^2$ = Residual Sum of Squares
$\left(\frac{RSS}{n}\right)^{3/2}$ = the estimated standard deviation of residuals, cubed (used to make $m_3$ scale-free)

Interpreting the sign: a positive $m_3$ means the distribution has a long right tail (more extreme positive residuals). A negative $m_3$ means a long left tail. A value of zero means perfect symmetry.

Kurtosis Coefficient $m_4$

Kurtosis measures the "peakedness" of a distribution — how much probability is concentrated in the tails versus the centre. The normal distribution has $m_4 = 3$. Think of that as the reference.

$$m_4 = \frac{\frac{1}{n}\sum_{i=1}^{n} \hat{u}_i^4}{\left(\frac{RSS}{n}\right)^{2}}$$

$\left(\frac{RSS}{n}\right)^{2}$ = the estimated variance of residuals, squared (makes $m_4$ scale-free)

If $m_4 > 3$: the distribution has heavier tails than normal (leptokurtic — more extreme outliers than expected). If $m_4 < 3$: lighter tails, fewer extremes (platykurtic). For your OLS residuals to be consistent with normality, $m_4$ should be close to 3.

Interactive — Skewness and Kurtosis Explorer

Skewness $m_3$: 0.0 Kurtosis $m_4$: 3.0

The red dashed line is a reference normal curve ($m_3{=}0, m_4{=}3$). Slide the values to see how skewness shifts the tail and how kurtosis changes peak height and tail heaviness.

The Jarque-Bera Test: A Formal Normality Test

Instead of eyeballing a histogram, the Jarque-Bera test gives you an actual hypothesis test for normality.

The null hypothesis is: the errors follow a normal distribution. The idea is that under normality, both $m_3$ should be close to 0 and $m_4$ should be close to 3. The Jarque-Bera statistic combines both into a single number:

$$J = n \left(\frac{m_3^2}{6} + \frac{(m_4 - 3)^2}{24}\right)$$

$J$ = Jarque-Bera statistic
$n$ = number of observations
$m_3$ = skewness coefficient
$m_4$ = kurtosis coefficient
Under $H_0$ (normality), $J$ follows a $\chi^2(2)$ distribution — chi-squared with 2 degrees of freedom

The critical value of $\chi^2(2)$ at the 5% level is 5.991. If $J > 5.991$, you reject $H_0$ — meaning the residuals are significantly non-normal.

Reject H₀ when p < 0.05 ↔ J > 5.991. This means: at a 5% significance level, the evidence is strong enough to conclude the errors are not normally distributed. Possible remedies: log-transform the dependent variable, remove influential outliers, or use a different model.

Assumption 2: Homoskedasticity

OLS assumes the variance of the errors is constant across all observations: $\text{Var}(u_i | X_i) = \sigma^2$ for every $i$. This is called homoskedasticity.

The opposite — where variance changes with $X$ or with the size of $Y$ — is called heteroskedasticity. It doesn't bias your coefficient estimates, but it makes your standard errors wrong, which means your t-statistics and p-values are unreliable.

Heteroskedasticity is common whenever you're dealing with data that spans a large range: wealthier households have more variable spending than poorer ones; larger firms have more variable profits; cities with bigger populations have more variable crime rates.

The quickest diagnostic: plot the residuals $\hat{u}_i$ against the fitted values $\hat{Y}_i$ (or against $X$). If homoskedasticity holds, the spread of the dots should look roughly the same across the whole range — a horizontal band of constant width. A funnel shape (spreading out to the right) is the classic sign of heteroskedasticity.

Interactive — Residuals Plot: Homoskedastic vs Heteroskedastic

Heteroskedasticity degree: 0.0

At 0, the spread of residuals is constant (homoskedastic). Increase the slider to see the classic funnel pattern of heteroskedasticity — variance growing with the fitted value.

Assumption 3: Zero-Mean Errors — $E(u\,|\,X) = 0$

This assumption says: on average, the model's errors are zero, regardless of the value of $X$. If this is violated, your model has a specification error — it's systematically wrong in a predictable way.

What does that mean concretely? If $E(u|X) > 0$, the true value of $Y$ will always be higher than what your model predicts — you're consistently underestimating. If $E(u|X) < 0$, you're consistently overestimating.

This assumption is typically violated in two situations. First, you've left out an important predictor that belongs in the model (omitted variable bias). Second, you've used the wrong functional form — for example, the true relationship is quadratic but you've modelled it as linear, so your residuals will systematically curve upward then downward across the range of $X$.

To check: plot the residuals and draw a horizontal line at their mean. If the mean line runs close to zero across the full range of $X$, the assumption is roughly satisfied. If the residuals have a curve or a trend pattern — clustering positive on one side and negative on the other — you have a specification problem.

Part 2: Choosing the Best Model

Once your model passes the diagnostic checks, a new question arises: is it the right model, or would adding (or removing) a variable give you something better?

Why $R^2$ Alone Will Mislead You: Overfitting

Here's an uncomfortable fact: $R^2$ can only increase (or stay the same) when you add a variable to a regression. It is mathematically impossible for $R^2$ to decrease, no matter how useless the new variable is — at worst, its coefficient is estimated as exactly zero, contributing nothing.

This creates the overfitting trap. If you keep adding variables to maximise $R^2$, you'll end up with a model that fits your specific dataset perfectly but fails badly on any new data. The model has memorised the noise in your sample rather than the underlying pattern.

The overfitting analogy: Imagine drawing a curve through 10 data points. A degree-9 polynomial will pass through all 10 exactly — $R^2 = 1$. But for point 11, it'll be wildly wrong. A simpler line might miss some points but generalises far better. More parameters = better fit on known data, worse prediction on new data.

$R^2$ is useful only at the extremes: when it's close to 1, the model explains nearly all variation; when it's close to 0, the predictors are barely better than nothing. Anywhere in between, you need better tools.

Adjusted $R^2$: Penalising Complexity

The adjusted $R^2$ (written $\bar{R}^2$) fixes the overfitting problem by penalising each additional variable you add. Unlike plain $R^2$, it can actually decrease when you add a variable that doesn't explain enough new variation to justify its presence.

$$\bar{R}^2 = 1 - \frac{(1 - R^2)(n - 1)}{n - K - 1}$$

$R^2$ = standard coefficient of determination
$n$ = number of observations
$K$ = number of predictor variables (not counting the intercept)
$(n-1)$ = total degrees of freedom (also in the denominator of the total variance)
$(n-K-1)$ = residual degrees of freedom — this is the penalty: the more variables $K$, the smaller this is, which increases the subtracted term

The logic: adding variable $K+1$ increases $R^2$, but it also increases $K$, shrinking $(n-K-1)$ and making the penalty term larger. $\bar{R}^2$ rises only if the new variable increases $R^2$ enough to overcome that growing penalty. If the variable is useless, $\bar{R}^2$ actually falls. Always aim to maximise $\bar{R}^2$, not $R^2$.

Interactive — Adjusted R² vs R²

Number of predictors K: 3 Raw R²: 0.72

n = 50 is fixed. Watch how adjusted R² can be lower than R² and how it responds when you add variables without improving fit. The gap widens as K grows relative to n.

Information Criteria: AIC and BIC

Information criteria provide another way to compare models, and they work across a wider range of situations than adjusted $R^2$. The two most common are the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Both penalise model complexity but in slightly different ways.

$$\text{AIC} = n \cdot \ln\!\left(\frac{RSS}{n}\right) + 2(K+1)$$ $$\text{BIC} = n \cdot \ln\!\left(\frac{RSS}{n}\right) + (K+1)\ln(n)$$

$n \cdot \ln(RSS/n)$ = a measure of how poorly the model fits (higher RSS = worse fit = higher value)
$K+1$ = number of estimated parameters ($K$ slopes + 1 intercept)
$2(K+1)$ = AIC's complexity penalty — grows linearly with parameters
$(K+1)\ln(n)$ = BIC's complexity penalty — grows with $\ln(n)$, so it penalises complexity more in large samples
$\ln(\cdot)$ = natural logarithm

The rule is simple: lower is better. When comparing two models, pick the one with the smaller AIC (or BIC). The values can be negative — that's fine. A more negative value is still better; it means the model fits well without needing many parameters.

AIC and BIC often agree, but when they conflict, AIC tends to favour slightly more complex models while BIC favours simpler ones (because its penalty grows with $\ln(n)$, which is larger than 2 for $n \geq 8$).

AIC vs BIC in practice: Use AIC when your primary goal is prediction accuracy. Use BIC when you want to identify the true underlying model (BIC is consistent — it will identify the correct model given enough data, AIC will not).

Comparing a Linear Model to a Log-Linear Model

One tricky situation: AIC and BIC cannot be directly compared between a linear model and a log-linear model. The reason is that the dependent variable is different — in one it's $Y$, in the other it's $\ln(Y)$ — so the residuals are on different scales and their RSS values are not comparable.

To fix this, we apply a correction to the log-linear model's AIC or BIC before comparing:

$$\text{AIC}^*_{\text{log-linear}} = \text{AIC}_{\text{log-linear}} + 2\sum_{i=1}^n \ln(\hat{Y}_i)$$

$\text{AIC}^*_{\text{log-linear}}$ = corrected AIC for the log-linear model (now comparable to the linear model)
$\hat{Y}_i$ = fitted value of the original dependent variable (not logged) — computed by exponentiating the log-linear fitted values
The correction term $2\sum \ln(\hat{Y}_i)$ accounts for the scale change caused by logging

After correction, compare normally: lower corrected AIC wins. (The same correction applies to BIC.)

Worked Example: Diagnosing a Salary Regression

Worked Example — Real Estate Prices

You've built a linear regression model that predicts apartment rental prices (€/month) using five predictors: size (m²), number of rooms, floor, distance to city centre, and building age. You have $n = 50$ observations. The RSS from your final model is 82.45.

Step 1: Compute the Skewness Coefficient

From your residuals you calculate $\sum \hat{u}_i^3 = 92.0$ and $\sigma^2 \approx RSS/n = 82.45/50 = 1.649$.

$$m_3 = \frac{\frac{1}{50} \cdot 92.0}{(1.649)^{3/2}} = \frac{1.84}{2.116} \approx 1.119$$

A positive value of $m_3 = 1.119$ means the residuals have a right-skewed tail — a few apartments are priced much higher than the model predicts.

Step 2: Compute the Kurtosis Coefficient

From the residuals: $\sum \hat{u}_i^4 = 254.2$, so $\frac{1}{n}\sum \hat{u}_i^4 = 5.084$.

$$m_4 = \frac{5.084}{(1.649)^2} = \frac{5.084}{2.719} \approx 3.062$$

$m_4 = 3.062$ is very close to 3 — good. The kurtosis is almost normal.

Step 3: Compute the Jarque-Bera Statistic

$$J = 50 \cdot \left(\frac{(1.119)^2}{6} + \frac{(3.062 - 3)^2}{24}\right) = 50 \cdot \left(0.2088 + 0.00016\right) \approx 10.44$$

The critical value for $\chi^2(2)$ at 5% is $5.991$. Since $J = 10.44 > 5.991$, we reject $H_0$. The residuals are significantly non-normal (p-value ≈ 0.005).

The culprit is the skewness — a handful of luxury apartments are pulling the right tail. A sensible fix: try a log-linear model where $\ln(\text{price})$ is the dependent variable, which often reduces this kind of skew.

Step 4: Model Selection — Adding or Removing a Variable?

Your colleague suggests also including a variable for whether the apartment has a parking space. The new model has $K = 6$ predictors and $R^2$ increases from $0.72$ to $0.724$ (almost nothing). Which model should you use?

$R^2$ (K=5): 0.720

$\bar{R}^2$ (K=5): 0.688

$R^2$ (K=6): 0.724

$\bar{R}^2$ (K=6): 0.681

Even though raw $R^2$ increased, adjusted $\bar{R}^2$ fell from 0.688 to 0.681. The parking variable adds so little explanatory power that the penalty for the extra parameter outweighs the gain. Leave it out.

Now compare the five-variable model (AIC = 37.01, BIC = 48.48) against a simpler three-variable model with only size, rooms, and distance (RSS = 95.2): AIC = 40.20, BIC = 47.85. The full model wins on AIC (37.01 < 40.20), but the simpler model wins on BIC (47.85 < 48.48). In this case, AIC suggests keeping the extra variables for better prediction accuracy, while BIC recommends the leaner model as a truer representation of the data.