Supervised ML · Chapter 3 · 18 min read

Dummy Variables: Teaching Regression to See Categories

OLS regression is naturally built for numbers. But the real world is full of categories: gender, education level, country, job type. How do you include "has a university degree" or "lives in Italy" in a regression? You turn it into a number — specifically a 0 or 1. That's a dummy variable. Simple in concept, powerful in practice.

The Big Picture: What Problem Are We Solving?

Imagine you're studying wages and you want to control for whether someone attended university. You can't just write "university" in a regression — OLS needs numbers. And you can't assign 0 = no university, 1 = university and treat it like a continuous variable, because there's no meaningful "one unit of university." It's a category, not a scale.

The solution: create a binary indicator variable — a dummy variable $D$ — that equals 1 if the condition is true and 0 if it isn't. The coefficient on $D$ then measures the difference in the outcome between the two groups, controlling for everything else in the model. That's exactly the question you wanted to answer.

Intuition: A dummy variable splits your data into two groups. Its coefficient tells you how much the average outcome shifts from one group to the other — holding everything else constant.

One Dummy Variable: The Binary Case

Suppose you're regressing wage $Y$ on years of experience $X$ and a dummy $D$ (1 = university graduate, 0 = non-graduate). The model is:

$$Y_i = \beta_0 + \beta_1 D_i + \beta_2 X_i + u_i$$

where $Y_i$ is the wage for individual $i$, $D_i \in \{0, 1\}$ is the dummy variable (1 = university graduate), $X_i$ is years of experience, $u_i$ is the error term, and $\beta_0$, $\beta_1$, $\beta_2$ are the coefficients to be estimated.

To understand what $\beta_1$ means, let's look at what the model predicts for each group separately:

$$D_i = 0 \text{ (non-graduate):} \quad E[Y_i \mid D_i=0, X_i] = \beta_0 + \beta_2 X_i$$ $$D_i = 1 \text{ (graduate):} \quad E[Y_i \mid D_i=1, X_i] = (\beta_0 + \beta_1) + \beta_2 X_i$$

The difference between the two lines is exactly $\beta_1$. The non-graduate group is the baseline (intercept = $\beta_0$). The graduate group has a shifted intercept ($\beta_0 + \beta_1$) but the same slope $\beta_2$ on experience. So:

$$\hat{\beta}_1 = \text{estimated difference in average wage between graduates and non-graduates,}$$ $$\text{ceteris paribus (= for the same level of experience)}$$

The phrase ceteris paribus is Latin for "all else equal" — it means we're isolating the group difference by holding experience fixed. If $\hat{\beta}_1 = 4.2$, graduates earn on average €4.20/hr more than non-graduates with the same experience. The significance of this difference is tested with a standard t-test on $\hat{\beta}_1$, with $H_0: \beta_1 = 0$ (no difference between groups).

Interactive — Dummy Variable Shifts the Intercept

Group difference $\beta_1$ (€/hr): 5 Experience slope $\beta_2$: 0.5

Red dots = Group A (D = 0, baseline). Blue dots = Group B (D = 1). The vertical gap between the two regression lines is β₂ — the estimated group effect holding everything else equal. Adjust the sliders to see how the intercept shift changes the fitted lines.

Multiple Dummies — Combining Groups

You can include more than one dummy variable in a regression. For instance, if you also want to control for gender (dummy $D_{\text{female}}$) alongside the education dummy $D_{\text{grad}}$:

$$Y_i = \beta_0 + \beta_1 D_{\text{grad},i} + \beta_2 D_{\text{female},i} + \beta_3 X_i + u_i$$

The baseline group is now: non-graduate, male (both dummies = 0). Each coefficient measures the shift relative to that baseline:

$$\hat{\beta}_1 = \text{wage premium for being a graduate, ceteris paribus}$$ $$\hat{\beta}_2 = \text{wage difference for being female vs. male, ceteris paribus}$$

You can include as many binary dummies as you like. Each one asks the same question: how much does belonging to this group shift the expected outcome, holding everything else fixed?

General rule: Each dummy coefficient $\hat{\beta}_j$ tells you how much its group adds to (or subtracts from) the baseline intercept, ceteris paribus. The baseline is always the group with all dummies equal to 0.

Categorical Variables: More Than Two Groups

So far the dummies were binary. But what about a variable with more than two categories — like education level: dropout, high school, or college degree? Here you need to be careful.

Why not just code dropout = 0, high school = 1, college = 2 and use it as a regular numeric variable? Because that would force OLS to assume that the gap between dropout and high school equals the gap between high school and college — a very specific and often wrong assumption.

The correct approach: create one dummy for each category, except one. If there are $K$ categories, you include $K - 1$ dummies. In the education example with $K = 3$ groups:

$$D_1 = \begin{cases} 1 & \text{if high school degree} \\ 0 & \text{otherwise} \end{cases} \qquad D_2 = \begin{cases} 1 & \text{if college degree} \\ 0 & \text{otherwise} \end{cases}$$

The omitted category (dropout, where $D_1 = D_2 = 0$) is the baseline. The model becomes:

$$Y_i = \beta_0 + \beta_1 D_{1i} + \beta_2 D_{2i} + \beta_3 X_i + u_i$$

Now $\beta_1$ = wage premium for having a high school degree vs. being a dropout (for the same experience), and $\beta_2$ = wage premium for having a college degree vs. being a dropout. The gap between high school and college is then $\beta_2 - \beta_1$ — and unlike the naive coding, this gap can be different from the gap between dropout and high school. That's the flexibility dummy variables provide.

The Dummy Variable Trap

If you include all $K$ dummies — one per category, including the baseline — you create perfect multicollinearity. The three dummies (dropout + high school + college) always sum to exactly 1 for every observation. That means one dummy is a perfect linear combination of the others plus the intercept column. OLS cannot be computed — the design matrix $\mathbf{X}^\top\mathbf{X}$ becomes singular (non-invertible).

This is called the dummy variable trap. The fix is simple: always omit one category. Which one you omit defines the baseline — all other coefficients are interpreted relative to it. The choice of baseline doesn't change the overall fit of the model, only the interpretation of the intercept and slope coefficients.

$$\text{Rule: For } K \text{ categories, include exactly } K - 1 \text{ dummies. Always.}$$

Estimation — It's Still Just OLS

Estimating a model with dummy variables uses the exact same OLS formulas as always:

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}$$

where $\mathbf{X}$ is the design matrix, $\mathbf{Y}$ is the vector of outcomes, and $\hat{\boldsymbol{\beta}}$ is the vector of estimated coefficients. The only difference from a model with only continuous predictors: the design matrix $\mathbf{X}$ has columns that contain only 0s and 1s (for the dummies), rather than continuous values. This means many entries in those columns are 0, but the math is identical — OLS doesn't care whether a column is binary or continuous.

Interactive — Education Group Wage Differences

HS premium $\beta_1$ (€/hr): 4 College premium $\beta_2$ (€/hr): 10

Each group gets its own regression line. The group difference slider shifts one line up or down (the intercept dummy effect). The dotted line is the pooled regression that ignores group membership — compare it with the split lines to see what ignoring the dummy costs you.

Worked Example — Education and Hourly Wages

A labour economist surveys 60 workers in a city. For each worker she records: hourly wage (€), years of work experience, and education group (dropout, high school, or college degree). Does education level significantly predict wages, even after controlling for experience?

The Model

With $K = 3$ education groups, she creates $K - 1 = 2$ dummies. Dropout is the baseline:

$$\text{wage}_i = \beta_0 + \beta_1 D_{\text{HS},i} + \beta_2 D_{\text{college},i} + \beta_3 \cdot \text{exp}_i + u_i$$

Step 1 — OLS Estimation

Running OLS on the 60 observations (seeded random data with true values β₀=10, β₁=4, β₂=10, β₃=0.5) gives:

$$\hat{\beta}_0 = 10.66 \quad \hat{\beta}_1 = 3.43 \quad \hat{\beta}_2 = 9.12 \quad \hat{\beta}_3 = 0.48$$

Standard errors and t-statistics:

$$se(\hat{\beta}_1) = 0.74 \;\Rightarrow\; t_1 = 4.63 \quad se(\hat{\beta}_2) = 0.78 \;\Rightarrow\; t_2 = 11.75 \quad se(\hat{\beta}_3) = 0.04 \;\Rightarrow\; t_3 = 11.18$$

All p-values are essentially 0 — all three variables are highly significant.

Step 2 — Interpret the Coefficients

The baseline (dropout group, both dummies = 0) has an average wage of approximately €10.66/hr for zero years of experience. Each year of experience adds €0.48/hr on average (ceteris paribus).

Now for the group differences:

$$\hat{\beta}_1 = 3.43 \;\Rightarrow\; \text{High school graduates earn €3.43/hr more than dropouts, ceteris paribus}$$ $$\hat{\beta}_2 = 9.12 \;\Rightarrow\; \text{College graduates earn €9.12/hr more than dropouts, ceteris paribus}$$

The gap between high school and college is $\hat{\beta}_2 - \hat{\beta}_1 = 9.12 - 3.43 = €5.69/hr$ — substantially different from the gap between dropout and high school (€3.43/hr). This asymmetry is exactly what the dummy variable approach captures, and what the naive numeric coding (0, 1, 2) would have missed.

Step 3 — Predict a Specific Worker's Wage

Predict the wage for a college graduate with 10 years of experience:

$$\hat{Y} = 10.66 + 3.43 \cdot 0 + 9.12 \cdot 1 + 0.48 \cdot 10 = 10.66 + 9.12 + 4.80 = €24.58\,\text{/hr}$$

For a dropout with 10 years of experience:

$$\hat{Y} = 10.66 + 0 + 0 + 0.48 \cdot 10 = €15.46\,\text{/hr}$$

The college-vs-dropout wage gap at the same experience level is $24.58 - 15.46 = €9.12/hr$ — exactly $\hat{\beta}_2$, as the formula predicts. This is the power of ceteris paribus: the €9.12 gap is already adjusted for experience differences between groups.

Interpretation Summary

The three regression lines (one per education group) are parallel — they share the same slope $\hat{\beta}_3 = 0.48$ on experience. Only the intercepts differ. This reflects the dummy variable model's core assumption: education shifts the entire wage level up, but the wage-experience relationship is the same across groups. If you believed the slopes also differed by group (e.g., college graduates benefit more from each year of experience), you'd need an interaction term — but that's a topic for another page.