Supervised ML · Chapter 4 · 20 min read

F-Test: Are These Variables Actually Doing Anything — Together?

You've built a regression model with many predictors. Each one looks borderline insignificant on its own t-test. Does that mean you should drop them all? Not necessarily. Variables that look weak individually can still be powerful as a group. The F-test is the tool that lets you test a whole group of coefficients at once — and it's one of the most important tests in econometrics.

The Big Picture: Why Not Just Use t-Tests?

The t-test works well for testing one coefficient at a time. But what if you want to test whether height, age, and weight together have any effect on running speed — even if none of them is individually significant? Running three separate t-tests doesn't answer that question, for two reasons.

First, individual insignificance doesn't imply joint insignificance. If height and age are correlated with each other, a t-test on height might look weak simply because age "absorbed" some of its effect (and vice versa). Dropped together, both variables might explain substantial variance. Second, running multiple tests inflates your false-positive rate — the more tests you run, the more likely one will cross the significance threshold by pure chance.

The F-test answers the real question directly: Can the group of variables we're thinking of dropping explain any meaningful variation in Y, or is their combined explanatory power basically zero?

Intuition: The F-test compares two models — one with the group of variables included, one without. If removing them barely changes how well the model fits, those variables are jointly useless. If removing them makes the fit much worse, they're jointly important.

The Setup: Restricted vs. Unrestricted Model

The F-test always compares two nested models:

The unrestricted model (full model) includes all variables you want to test. It has $k$ predictors total:

$$Y_i = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_k X_{ki} + u_i \quad \text{(Unrestricted)}$$

The restricted model enforces the null hypothesis $H_0$, which sets $q$ of those coefficients to zero — meaning you drop $q$ variables:

$$H_0: \beta_{j_1} = \beta_{j_2} = \cdots = \beta_{j_q} = 0 \quad \text{(} q \text{ restrictions)}$$

Here, $q$ is the number of restrictions — the count of variables in the group you suspect are jointly irrelevant. This is a researcher's choice: you decide which group to test. (In an exam, the question will tell you which group to use.)

The restricted model is just the unrestricted model but with those $q$ variables removed:

$$Y_i = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_{k-q} X_{(k-q)i} + u_i \quad \text{(Restricted, under } H_0\text{)}$$

Because the restricted model has fewer variables, it will always fit the data at least a little worse — its residual sum of squares $RSS_R$ will be at least as large as $RSS_U$ from the full model. The question is: how much worse?

Why Not the Aggregated t-Test?

It seems natural to test a group of $q$ coefficients by summing their individual t-statistics. If the $\hat{\beta}$s are uncorrelated, the sum of their squared t-stats does follow a known distribution:

$$T_{\text{agg}} = \sum_{l=1}^{q} t_l^2 \sim \chi^2(q) \quad \text{only if: (1) } \hat{\beta}\text{s uncorrelated, and (2) } \sigma^2 \text{ is known}$$

where $t_l = \hat{\beta}_l / se(\hat{\beta}_l)$ is the standard t-statistic for coefficient $l$, $se(\hat{\beta}_l)$ is the standard error of $\hat{\beta}_l$, and $\chi^2(q)$ is the chi-squared distribution with $q$ degrees of freedom.

But both conditions almost never hold in practice. Predictors are almost always correlated (multicollinearity), and we never know the true error variance $\sigma^2$ — we estimate it from the data. This invalidates the $\chi^2$ approach entirely.

The solution: use $\hat{\sigma}^2$ (estimated from residuals) instead of $\sigma^2$. This introduces extra uncertainty, which inflates the distribution from $\chi^2$ to an F-distribution. That's exactly what the F-test uses.

The F-Statistic

The F-statistic compares the increase in residual sum of squares when you impose the restrictions (i.e., remove the group of variables) to the residual variance in the unrestricted model:

$$F = \frac{(RSS_R - RSS_U)\,/\,q}{RSS_U\,/\,(n - k - 1)}$$

Each symbol explained:

$RSS_R$ — Residual Sum of Squares from the restricted model (with $q$ variables removed)
$RSS_U$ — Residual Sum of Squares from the unrestricted model (all variables included)
$q$ — Number of restrictions (= number of variables removed under $H_0$)
$n$ — Sample size
$k$ — Number of predictors in the unrestricted model (not counting the intercept)
$n - k - 1$ — Degrees of freedom of the unrestricted model (denominator df)

The numerator measures how much explanatory power you lose by removing the $q$ variables, scaled by $q$ (to make it a per-restriction measure). The denominator is the baseline variance in the full model — roughly, the average squared residual. If the ratio is large, removing those variables hurt a lot → they're jointly important. If the ratio is close to zero, removing them barely changed the fit → they're jointly useless.

An equivalent formula via $R^2$:

$$F = \frac{(R^2_U - R^2_R)\,/\,q}{(1 - R^2_U)\,/\,(n - k - 1)}$$

where $R^2_U$ and $R^2_R$ are the R-squared values from the unrestricted and restricted models respectively. Both formulas give the same result — use whichever is easier given your output.

The F-Distribution and the Decision Rule

Under the null hypothesis $H_0$, the F-statistic follows an F-distribution with two parameters: $q$ (numerator degrees of freedom) and $n-k-1$ (denominator degrees of freedom):

$$F \;\overset{H_0}{\sim}\; F(q,\; n-k-1)$$

The F-distribution is always non-negative and right-skewed. Large values of $F$ are evidence against $H_0$. The decision rule at significance level $\alpha$ (typically 0.05):

$$\text{Reject } H_0 \text{ if } F > F_{\alpha}(q,\; n-k-1)$$

where $F_\alpha(q, n-k-1)$ is the critical value — the $F$ value above which only $\alpha$ of the area under the distribution lies. In software (R, Python, Stata), the p-value is computed directly: reject $H_0$ if $p < \alpha$.

Note that the F-test is always one-sided (right tail only), even though the underlying t-tests are two-sided. This makes sense: $F$ values can only be positive, and only large values count as evidence against $H_0$.

Interactive — F-Distribution and Rejection Region

Numerator df ($q$): 2 Denominator df ($n{-}k{-}1$): 46 Your F-statistic: 0.04

The curve is the F-distribution for your chosen degrees of freedom. The red shaded area is the p-value — the probability of an F-statistic this large by chance alone. The dashed line marks your observed F. When it enters the red zone, you reject H₀.

Rule of thumb: For large samples, an F-statistic above roughly 3 (for $q=2$) or 2.6 (for $q=3$) typically signals joint significance at the 5% level. But always compare to the exact critical value or p-value for your degrees of freedom.

The Overall Model F-Test

A special case you'll always see in regression output: the overall F-test, which tests whether all slope coefficients are jointly zero — i.e., $H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0$. Here $q = k$ (all predictors), and the restricted model is just the intercept $\bar{Y}$. The F-statistic simplifies to:

$$F_{\text{overall}} = \frac{R^2\,/\,k}{(1 - R^2)\,/\,(n - k - 1)}$$

where $R^2$ is the R-squared of the full model. This is the F-statistic printed automatically at the bottom of every regression table in R, Stata, and Python's statsmodels. A significant overall F means the model as a whole explains significantly more variation than a flat line at the mean.

Worked Example — Does Body Size Predict Sprint Speed?

A sports scientist tracks 50 athletes. For each athlete she records: weekly training hours, height (cm), age (years), and 100m sprint speed (m/s). She suspects that training drives speed, but she wonders whether height and age add anything on top of that. Let's test it.

The Two Models

Unrestricted (full) model — all three predictors:

$$\text{speed}_i = \beta_0 + \beta_1 \cdot \text{training}_i + \beta_2 \cdot \text{height}_i + \beta_3 \cdot \text{age}_i + u_i$$

Restricted model — null hypothesis $H_0: \beta_2 = \beta_3 = 0$ (height and age have zero effect):

$$\text{speed}_i = \beta_0 + \beta_1 \cdot \text{training}_i + u_i$$

Here $q = 2$ (we're testing two restrictions), $k = 3$ predictors in the full model, $n = 50$ observations.

Step 1 — Fit Both Models by OLS

Running OLS on the unrestricted model gives:

$$\widehat{\text{speed}} = 8.54 + 0.109\cdot\text{training} + (-0.002)\cdot\text{height} + (-0.001)\cdot\text{age}$$

Individual t-statistics and p-values (with $n - k - 1 = 46$ degrees of freedom):

$$t_{\text{training}} = 9.01 \;\; (p \approx 0.000) \qquad t_{\text{height}} = -0.28 \;\; (p = 0.780) \qquad t_{\text{age}} = -0.04 \;\; (p = 0.972)$$

Height and age look individually insignificant. But does that mean they're jointly useless? Let's check properly.

Running OLS on the restricted model (training only) gives $\hat{\beta}_0 = 8.15$, $\hat{\beta}_1 = 0.109$. Now we read off the residual sums of squares:

$$RSS_U = 10.91 \qquad RSS_R = 10.93$$

Step 2 — Compute the F-Statistic

$$F = \frac{(RSS_R - RSS_U)\,/\,q}{RSS_U\,/\,(n-k-1)} = \frac{(10.93 - 10.91)\,/\,2}{10.91\,/\,(50-3-1)} = \frac{0.0187\,/\,2}{10.91\,/\,46} = \frac{0.009}{0.237} = 0.04$$

Step 3 — Compare to the Critical Value

At significance level $\alpha = 0.05$, the critical value from the $F(2, 46)$ distribution is:

$$F_{0.05}(2, 46) = 3.20$$

Our F-statistic of $0.04$ is far below $3.20$. The p-value is $0.96$ — nowhere near the 0.05 threshold.

Decision and Interpretation

$$F = 0.04 \;<\; F_{\text{crit}} = 3.20 \quad \Rightarrow \quad \text{Fail to reject } H_0$$

We cannot reject the null hypothesis that $\beta_2 = \beta_3 = 0$. In plain English: height and age, as a group, add essentially no predictive power for sprint speed, beyond what training hours already explain. Dropping them from the model is justified.

Note that the overall model F-test (testing whether training itself matters at all) gives $F_{\text{overall}} = 27.5$ with $p \approx 0$ — so the model with training is highly significant overall. It's specifically the addition of height and age on top of training that is unnecessary.