Supervised ML · Chapter 1 · 22 min read

OLS Regression: Fitting a Line That Actually Means Something

You have data. You suspect that one variable influences another. How do you quantify that relationship? How confident can you be in your estimates? OLS (Ordinary Least Squares) regression is the answer. It's the foundation of almost all supervised machine learning — and understanding it deeply means you'll understand most of what comes after. Let's build it from the ground up.

Why Regression? The Big Picture

Regression answers a specific kind of question: How much does Y change when X changes by one unit, everything else held equal? That "everything else held equal" part (called ceteris paribus in econometrics) is key — it means we're isolating the effect of one variable while controlling for others.

Example: Does an extra year of education increase wages? Does advertising spending increase sales? Does a drug lower blood pressure? These are all regression questions. The model takes this form:

$$Y_i = \beta_0 + \beta_1 X_i + u_i \qquad \text{with} \quad \mathbb{E}[u_i \mid X_i] = 0$$

Where: the subscript $i$ labels each observation (from $i=1$ to $i=n$), $Y_i$ is the dependent variable (what you want to explain), $X_i$ is the explanatory variable (what you use to explain), $\beta_0$ (Greek "beta-zero") is the intercept (expected value of $Y$ when $X=0$), $\beta_1$ (Greek "beta-one") is the slope (how much $Y$ changes per one-unit increase in $X$), $u_i$ is the error term (everything else that affects $Y$ but isn't in the model), and $\mathbb{E}[u_i \mid X_i] = 0$ means "the expected value (average) of $u_i$, conditional on knowing $X_i$, equals zero" — i.e. the model's errors are unbiased on average for any value of $X$.

The model is linear: a one-unit change in $X$ always leads to the same $\beta_1$ change in $Y$, regardless of where you start. This is a strong simplification of reality, but it is extremely useful, interpretable, and often a good approximation.

Intuition: $\beta_1$ is the slope of a staircase. Every step up in $X$ means exactly $\beta_1$ more in $Y$. The intercept $\beta_0$ tells you where the staircase starts.

The OLS Estimator — Finding the Best Line

In real life, we don't know the true population parameters $\beta_0$ and $\beta_1$. We only have a sample — a slice of reality. So we have to estimate them from the data. The estimated versions are written $\hat{\beta}_0$ and $\hat{\beta}_1$.

The question is: out of all possible lines we could draw through the data, which one is the best? OLS answers: the best line minimises the Sum of Squared Residuals (SSR). A residual $\hat{u}_i = Y_i - \hat{Y}_i$ is the difference between the actual value and the predicted value. We square them (to eliminate signs and penalise large errors more) and minimise their sum:

$$\text{SSR} = \sum_{i=1}^{n} \hat{u}_i^2 = \sum_{i=1}^{n} (Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2 \quad \xrightarrow{\text{minimise over}} \quad \hat{\beta}_0,\, \hat{\beta}_1$$

Taking the first-order conditions (setting partial derivatives to zero) and solving gives us the OLS formulas:

$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} = \frac{s_{XY}}{s_X^2}$$ $$\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}$$

Notice something beautiful: $\hat{\beta}_1$ is just the covariance of $X$ and $Y$ divided by the variance of $X$. Here, $\hat{\beta}_1$ (read: "beta-one-hat") uses the hat symbol $\hat{\phantom{x}}$ to indicate an estimated value — we can't observe the true $\beta_1$, so the hat marks our best guess from the data. The sign of $\hat{\beta}_1$ is determined by the covariance — if $X$ and $Y$ move together, $\hat{\beta}_1$ is positive.

Mathematical interpretation: $\hat{\beta}_1$ equals the first conditional derivative $\frac{\partial \mathbb{E}[Y|X]}{\partial X}$, where $\partial$ (the "curly d") denotes a partial derivative — the rate of change of one quantity with respect to another while all other quantities are held constant, and $\mathbb{E}[Y|X]$ means "the expected value (average) of $Y$ given $X$." This captures the marginal effect of $X$ on the expected value of $Y$. If $\hat{\beta}_1 = 0$, then $X$ provides no information about expected $Y$.

Interactive — OLS Regression

Sample size n: 50 Noise σ: 12

n controls how many data points are generated. σ (sigma) controls the noise — how far points scatter from the true line. The red line is the OLS estimate; the stats panel above the chart shows the fitted equation, R², and standard error.

Multiple Linear Regression

Real-world outcomes are rarely driven by a single variable. Wages depend on education and experience and job type. To handle multiple explanatory variables, we extend the model. Now we have $K$ explanatory variables plus an intercept:

$$Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_K X_{Ki} + u_i$$

In matrix notation (much more compact), with $\mathbf{y}$ as the $n \times 1$ vector of outcomes and $\mathbf{X}$ as the $n \times (K+1)$ matrix of regressors (bold letters denote vectors or matrices; first column is all 1s for the intercept):

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u} \qquad \Rightarrow \qquad \hat{\boldsymbol{\beta}}_{\text{OLS}} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y}$$

Here $\mathbf{X}^\top$ is the transpose of $\mathbf{X}$ (rows and columns are swapped), $(\cdot)^{-1}$ denotes the matrix inverse (the matrix equivalent of dividing by a number), and $\Rightarrow$ means "it follows that." You don't need to apply this by hand — software computes it automatically. The important thing is what it gives you: the unique set of $\hat{\beta}$ values that minimises the SSR.

Each $\beta_j$ now represents the effect of variable $X_j$ on $Y$ while holding all other variables constant — this is the ceteris paribus interpretation. The intercept $\beta_0$ ensures the expected value of the residuals equals zero (by construction, when we include an intercept).

Key point: By including an intercept, OLS automatically ensures $\sum \hat{u}_i = 0$ — the residuals average to zero. This is not an assumption; it's a mathematical consequence of the first-order conditions.

Interaction Effects — When One Variable Changes Another's Impact

The standard multiple regression model assumes every predictor acts independently: the effect of $X_1$ on $Y$ is always $\beta_1$, regardless of what $X_2$ happens to be. But this is often unrealistic. Does advertising spend always boost sales by the same amount, whether the product is new or established? Does an extra year of education raise wages identically for someone with no experience versus twenty years of it? Usually not.

An interaction term lets the effect of one variable depend on the level of another. You create it by multiplying the two variables together and adding that product as an extra regressor:

$$Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 (X_{1i} \cdot X_{2i}) + u_i$$

$X_{1i} \cdot X_{2i}$ = the interaction term — a new column in your data, created by multiplying $X_1$ and $X_2$ observation by observation
$\beta_1$ = the effect of $X_1$ on $Y$ when $X_2 = 0$ (base slope of $X_1$)
$\beta_2$ = the effect of $X_2$ on $Y$ when $X_1 = 0$ (base slope of $X_2$)
$\beta_3$ = the interaction coefficient — by how much the slope of $X_1$ changes for each one-unit increase in $X_2$

Nothing about OLS changes mechanically — you just treat $X_1 \cdot X_2$ as a third ordinary regressor. The key difference is in interpretation: $\hat{\beta}_1$ alone no longer tells the full story of $X_1$'s effect. The full marginal effect of $X_1$ is found by differentiating with respect to $X_1$:

$$\frac{\partial\, E(Y \mid X_1, X_2)}{\partial X_1} = \beta_1 + \beta_3\, X_2$$

$\partial$ = partial derivative symbol — "the change in $E(Y)$ when only $X_1$ changes, holding everything else fixed"
The result $\beta_1 + \beta_3 X_2$ shows that the slope on $X_1$ is not constant — it shifts up or down depending on the current value of $X_2$
If $\beta_3 > 0$: higher $X_2$ amplifies $X_1$'s effect — the two variables reinforce each other
If $\beta_3 < 0$: higher $X_2$ dampens $X_1$'s effect — they partially offset each other
If $\beta_3 = 0$: no interaction — the model collapses back to a standard additive regression

Key point: With an interaction term, you can no longer report "$X_1$ increases $Y$ by $\hat{\beta}_1$." You must say "at $X_2 = x_0$, a one-unit increase in $X_1$ changes $Y$ by $\hat{\beta}_1 + \hat{\beta}_3 \cdot x_0$." Always pick a meaningful value of $X_2$ — typically its mean, or a low vs. high comparison — when reporting the marginal effect.

Interactive — Interaction Effect Explorer

$\beta_1$ (base slope of $X_1$): 3 $\beta_3$ (interaction): 1.5 Evaluate at $X_2 =$ 0

Each line is the regression of $Y$ on $X_1$ at a fixed value of $X_2$ (low / evaluated / high). As $\beta_3$ changes, the slopes fan out — the same one-unit change in $X_1$ has a different effect depending on the context set by $X_2$.

How Good Is the Fit? — R-squared

Once you have a fitted model, you want to know: how well does it actually describe the data? The most common metric is $R^2$ — the fraction of the total variation in $Y$ that is explained by the regressors.

$$R^2 = 1 - \frac{\text{SSR}}{\text{TSS}} \qquad \text{where} \quad \text{TSS} = \sum_{i=1}^{n}(Y_i - \bar{Y})^2$$

TSS (Total Sum of Squares) is the total variation in $Y$ before any model. SSR is what's left unexplained after the model. So $R^2$ is the share that's explained. It ranges from 0 to 1.

But $R^2$ has a flaw: it always increases when you add more variables, even if those variables are useless. This can tempt you to throw every variable into the model — a problem called overfitting. The solution is adjusted $R^2$:

$$\bar{R}^2 = 1 - \frac{n-1}{n-K-1}\cdot(1-R^2)$$

Where $K$ is the number of regressors (excluding intercept). Each added variable increases the penalty term $\frac{n-1}{n-K-1}$, so a new variable only improves $\bar{R}^2$ if it explains enough variation to justify its inclusion. Use $\bar{R}^2$ when comparing models with different numbers of variables.

Standard Errors and t-Tests — Is a Coefficient Significant?

Getting an estimate $\hat{\beta}_1 = 3.15$ is great, but what if that number is highly uncertain? Maybe with a different sample it would be $-2$ or $+8$. The standard error $\text{SE}(\hat{\beta}_j)$ quantifies this uncertainty — it is the estimated standard deviation of the estimator across repeated samples.

First, we need the residual standard error (an estimate of the true error variance $\sigma^2$):

$$\hat{\sigma}^2 = s^2 = \frac{\text{SSR}}{n - K - 1} \qquad \Rightarrow \qquad s = \sqrt{s^2}$$

We divide by $n-K-1$ (degrees of freedom), not $n$, to get an unbiased estimate. Then the standard error of each coefficient is:

$$\text{SE}(\hat{\beta}_j) = \frac{s}{\sqrt{\text{SSR}_j^*}}$$

Where $\text{SSR}_j^*$ (the asterisk $^*$ marks it as an auxiliary quantity, distinct from the main model's SSR) is the sum of squared residuals from a helper regression of $X_j$ on all other regressors — this isolates the variation in $X_j$ that's not explained by the other variables. The intuition: if $X_j$ is highly correlated with other regressors, $\text{SSR}_j^*$ is small, making $\text{SE}(\hat{\beta}_j)$ large (more uncertainty).

To test whether $\beta_j$ is significantly different from zero, compute the t-statistic:

$$t_j = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)}$$

This tells you by how many standard deviations $\hat{\beta}_j$ is different from zero. Under $H_0: \beta_j = 0$ (the null hypothesis that coefficient $j$ has no effect) and assuming normally distributed residuals, $t_j$ follows a $t$-distribution with $n-K-1$ degrees of freedom. The p-value is $p = 2[1 - F_{t}(|t_j|)]$, where $|t_j|$ is the absolute value of $t_j$ (its distance from zero) and $F_t$ is the CDF of the t-distribution. If $p < 0.05$, we call $\hat{\beta}_j$ statistically significant.

Intuition: A t-value of 2 means your estimate is 2 standard deviations away from zero — unusual enough under $H_0$ to be suspicious. Roughly, $|t| > 2$ corresponds to $p < 0.05$.

The Four OLS Assumptions

OLS estimates are unbiased and efficient (meaning: correct on average and as precise as possible) only when certain conditions hold. Violating these conditions can silently wreck your results. Here are the four key assumptions and how to check them.

1. Homoskedasticity (Constant Variance of Residuals)

The residuals should have the same variance $\sigma^2$ for all values of the regressors: $\text{Var}(u_i | \mathbf{X}) = \sigma^2$. If this fails (heteroskedasticity), your standard errors are wrong — and so are your t-tests and p-values.

Check: Plot residuals $\hat{u}_i$ against fitted values $\hat{Y}_i$. If the spread of residuals increases with $\hat{Y}$, you have heteroskedasticity. A formal test is the Breusch-Pagan test.

2. No Autocorrelation (Residuals Are Independent)

Residuals should not be correlated with each other: $\text{Cov}(u_i, u_j) = 0$ for $i \neq j$. Autocorrelation frequently occurs with time-series data — today's error is correlated with yesterday's.

Check: The Durbin-Watson (DW) statistic tests specifically for first-order autocorrelation:

$$DW = \frac{\sum_{t=2}^{n}(\hat{u}_t - \hat{u}_{t-1})^2}{\sum_{t=1}^{n}\hat{u}_t^2}$$

$DW \approx 2$ ($\approx$ means "approximately equal to"): no autocorrelation. $DW \ll 2$ ($\ll$ means "much less than"): positive autocorrelation. $DW \gg 2$ ($\gg$ means "much greater than"): negative autocorrelation. The Box-Ljung test generalises this to multiple lags and tests $H_0$: no autocorrelation up to lag $m$.

3. Normally Distributed Residuals

Residuals should follow a normal distribution: $u_i \sim N(0, \sigma^2)$. This is needed for t-tests and F-tests to be valid, especially in small samples. For large samples, the Central Limit Theorem saves you — coefficients become approximately normally distributed regardless.

Check: A QQ-plot (standardised residual quantiles vs. normal quantiles — you want points on the 45° line) or the Jarque-Bera test (tests whether skewness = 0 and kurtosis = 3).

4. No Multicollinearity

Regressors should not be perfectly (or near-perfectly) linearly related to each other. If two regressors are highly correlated, OLS can't distinguish their individual effects — both get large, uncertain standard errors.

Check: Compute the correlation matrix of your regressors. If any off-diagonal correlation is above ~0.8, be cautious. A formal measure is the Variance Inflation Factor (VIF).

Interactive — Residuals vs. Fitted Values (Homoskedasticity Check)

Heteroskedasticity: 0

Move slider right to introduce heteroskedasticity — watch the spread fan out.

Worked Example — House Size and Price

A real estate analyst collects data on 10 apartments: their size (in square metres) and their sale price (in thousands of euros).

Size (m²): 50, 65, 75, 80, 90, 100, 115, 120, 130, 150
Price (k€): 150, 190, 225, 240, 270, 310, 350, 370, 400, 460

Step 1 — Compute the OLS Estimates

First, compute the means: $\bar{X} = 97.5$ m², $\bar{Y} = 296.5$ k€.

$$\hat{\beta}_1 = \frac{s_{XY}}{s_X^2} = \frac{\sum(X_i-97.5)(Y_i-296.5)}{\sum(X_i-97.5)^2} \approx 3.153$$

$$\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X} = 296.5 - 3.153 \times 97.5 \approx -10.94$$

The estimated model is:

$$\widehat{\text{Price}} = -10.94 + 3.15 \cdot \text{Size}$$

Step 2 — Interpret the Coefficients

$\hat{\beta}_1 = 3.15$: Each additional square metre of apartment size is associated with a €3,150 higher sale price, on average. $\hat{\beta}_0 = -10.94$: The model's intercept (a 0 m² apartment would cost −€10,940 — not meaningful here, just the mathematical baseline).

Step 3 — Predict

How much would a 110 m² apartment cost?

$$\widehat{\text{Price}} = -10.94 + 3.15 \times 110 = 335.56 \;\text{k€}$$

Step 4 — Check the Fit

$$R^2 = 1 - \frac{\text{SSR}}{\text{TSS}} = 1 - \frac{83.19}{93{,}400} \approx 0.999$$

An $R^2$ of 0.999 means the model explains nearly all variation in price. The residual standard error is $s \approx 3.22$ k€, and the t-value for $\hat{\beta}_1$ is approximately $91.8$ — massively significant. Size is an overwhelmingly strong predictor of price in this data.

Step 5 — Real-World Interpretation

In a real analysis, you would want to include more variables (neighbourhood, floor, age of building) and check the four assumptions. Here, with just one predictor and 10 observations, the fit looks almost too good — in practice, no model is this clean.