Supervised ML · Chapter 6 · 22 min read

Nonlinear Regression: When a Straight Line Isn't Enough

Prerequisites This page builds on OLS Regression — you should understand how $\beta$ coefficients are estimated and interpreted. If you're interested in interaction terms with dummy variables, also check Dummy Variables first.

OLS regression in its standard form draws a straight line through your data. But what if the true relationship curves? A drug that works better and better up to a certain dose, then becomes dangerous? Wages that rise with experience — but at a decreasing rate as you age? These aren't quirks. They're the norm in economics, medicine, and most real-world data.

The good news: you don't need a completely new method. You can extend ordinary OLS by adding squared terms or cross-products (interactions) to your model. The math stays the same — only the interpretation changes.

Part 1: Quadratic Terms — Modelling Curves

The Big Picture: Why Go Quadratic?

In a standard OLS model, the slope of $Y$ with respect to $X$ is always $\beta_1$ — a fixed constant. Every extra unit of $X$ has the exact same effect, no matter where you start. But if the true relationship curves, this constant slope is wrong everywhere.

Adding a quadratic term $X^2$ gives the slope a chance to vary as $X$ changes. A larger $X$ then shifts the slope up or down, allowing you to model diminishing returns, saturation effects, or U-shaped relationships.

$$Y_i = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + u_i$$

$Y_i$ = dependent variable (outcome) for observation $i$
$X_i$ = the predictor variable
$X_i^2$ = $X_i$ squared — you create this new column in your data and include it as a separate predictor
$\beta_0$ = intercept (value of $E(Y)$ when $X = 0$)
$\beta_1$ = linear part of the slope (effect when $X$ is near 0)
$\beta_2$ = curvature coefficient — determines whether the parabola opens upward or downward
$u_i$ = error term

Important: although the model contains $X^2$, it's still estimated by OLS without any modification. You simply treat $X$ and $X^2$ as two separate predictor columns. The linearity assumption in OLS refers to linearity in the parameters ($\beta$s), not in the variables.

How $\beta_2$ Shapes the Curve

The sign of $\beta_2$ tells you the shape of the relationship:

If $\beta_2 < 0$: the parabola opens downward — the effect of $X$ on $Y$ is initially positive but diminishes and eventually turns negative. Think of a worker's productivity peaking in mid-career, or a drug dose that helps at low levels but causes harm at high levels.

If $\beta_2 > 0$: the parabola opens upward — the effect is initially negative but becomes positive. Think of environmental costs that are negligible at low pollution levels but accelerate sharply at high ones.

The Marginal Effect: Slope Is No Longer Constant

In a standard linear model, the slope is always $\beta_1$. In a quadratic model, the slope depends on the current value of $X$. We calculate it by taking the derivative of $E(Y|X)$ with respect to $X$:

$$\frac{\partial E(Y \mid X)}{\partial X} = \beta_1 + 2\beta_2 X$$

$\frac{\partial E(Y|X)}{\partial X}$ = the marginal effect of $X$ — the expected change in $Y$ for a one-unit increase in $X$, evaluated at the current value of $X$
$\beta_1 + 2\beta_2 X$ = the slope at the point $X$ — this changes as $X$ changes
$\partial$ (partial derivative symbol) = indicates we're changing only $X$ while holding everything else fixed

This is the key difference from a standard OLS model: the effect of one extra unit of $X$ depends on where you currently are. At low values of $X$, the marginal effect is $\approx \beta_1$. As $X$ grows, the term $2\beta_2 X$ kicks in and changes the slope — accelerating it if $\beta_2 > 0$, decelerating it if $\beta_2 < 0$.

Key insight: You can no longer say "a one-unit increase in $X$ changes $Y$ by $\beta_1$." Instead, you say "at $X = x_0$, a one-unit increase in $X$ changes $Y$ by $\beta_1 + 2\beta_2 x_0$." Always report the marginal effect at a specific value of $X$.

Testing Whether the Quadratic Term Is Needed

Should you include $X^2$ at all? If the true relationship is actually linear, then $\beta_2 = 0$ and the model reduces to the standard linear case. So the test is simple: run the OLS regression with $X^2$ included, and look at the t-test for $\hat{\beta}_2$.

If $\hat{\beta}_2$ is not significantly different from zero (p-value > 0.05), you don't have evidence of curvature — stick with the linear model. If the t-test rejects $\hat{\beta}_2 = 0$, the non-linear relationship is statistically significant and you should keep the quadratic term.

The Vertex: Where the Slope Crosses Zero

The vertex (or "turning point") of the parabola is the value of $X$ where the slope equals zero — where the function switches from increasing to decreasing (or vice versa). It's the peak (if $\beta_2 < 0$) or the trough (if $\beta_2 > 0$).

$$X^* = -\frac{\beta_1}{2\beta_2}$$

$X^*$ = the vertex — the value of $X$ at which the marginal effect equals zero
Set $\frac{\partial E(Y|X)}{\partial X} = \beta_1 + 2\beta_2 X = 0$ and solve for $X$ to derive this formula
If $X^*$ lies outside the range of your data, the parabola is monotone in your data range — it only goes up or only goes down, and you won't observe the turning point

The vertex is practically important: it tells you, for example, at what years of experience wages peak, or at what level of advertising spending returns start diminishing. Always check whether $X^*$ falls within your observed data range — if not, your data only shows one side of the parabola.

Interactive — Quadratic Regression Explorer

$\beta_1$ (linear): 4.0 $\beta_2$ (curvature): -0.4

The red curve shows $E(Y|X) = 20 + \beta_1 X + \beta_2 X^2$. The dashed vertical line marks the vertex $X^* = -\beta_1 / (2\beta_2)$. The tangent line at the marker shows the marginal effect at that point. Drag the sliders to see how curvature and direction change.

Part 2: Interaction Terms — When Effects Depend on Context

The Big Picture: Why Interactions?

A standard regression assumes the effect of $X_1$ on $Y$ is the same regardless of the value of $X_2$. But is that realistic? Does education always have the same return on wages, whether you live in a rural area or a major city? Does a drug always have the same effect, regardless of age?

Interaction terms let you answer: does the effect of $X_1$ on $Y$ depend on the level of $X_2$? If it does, a pure additive model misses the story entirely.

The Model with Interaction

An interaction term is simply the product of two variables, $X_1 \cdot X_2$, added to the model:

$$Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 (X_{1i} \cdot X_{2i}) + u_i$$

$X_{1i} \cdot X_{2i}$ = the interaction term — a new variable created by multiplying $X_1$ and $X_2$ together
$\beta_1$ = the effect of $X_1$ on $Y$ when $X_2 = 0$
$\beta_2$ = the effect of $X_2$ on $Y$ when $X_1 = 0$
$\beta_3$ = the interaction coefficient — by how much does the effect of $X_1$ change for each one-unit increase in $X_2$?
(equivalently: by how much does the effect of $X_2$ change for each one-unit increase in $X_1$)

The Marginal Effect with Interactions

Taking the derivative with respect to $X_1$:

$$\frac{\partial E(Y \mid X_1, X_2)}{\partial X_1} = \beta_1 + \beta_3 X_2$$

The marginal effect of $X_1$ is no longer a fixed number — it depends on the current value of $X_2$
$\beta_1$ = base effect of $X_1$ (when $X_2 = 0$)
$\beta_3 X_2$ = how much the effect of $X_1$ changes as $X_2$ increases
If $\beta_3 > 0$: higher $X_2$ amplifies the effect of $X_1$
If $\beta_3 < 0$: higher $X_2$ dampens the effect of $X_1$
If $\beta_3 = 0$: no interaction — the two variables act independently

Interaction interpretation: $\beta_3$ tells you how the slope on $X_1$ shifts when $X_2$ increases by one unit. A positive $\beta_3$ means $X_1$'s effect grows stronger as $X_2$ rises — the two variables reinforce each other. Always report the full marginal effect $\beta_1 + \beta_3 X_2$ at a meaningful value of $X_2$.

Dummy Variables with Interaction Terms

A particularly common and powerful application is interacting a continuous variable with a dummy variable. Recall from Dummy Variables that a dummy $D_i \in \{0, 1\}$ shifts the intercept of the regression line up or down for different groups. But what if the slope also differs between groups?

Interacting $X$ with $D$ allows each group to have a completely different regression line — different intercept and different slope:

$$Y_i = \beta_0 + \beta_1 X_i + \beta_2 D_i + \beta_3 (X_i \cdot D_i) + u_i$$

$D_i$ = dummy variable (0 or 1) for group membership
$X_i \cdot D_i$ = interaction term: zero for the base group ($D=0$), equals $X_i$ for the other group ($D=1$)
$\beta_2$ = intercept shift for $D=1$ group (vertical gap between the two lines when $X = 0$)
$\beta_3$ = slope shift for $D=1$ group (how much steeper or flatter the line is for group $D=1$ compared to $D=0$)

Substituting $D = 0$ and $D = 1$ separately makes the model concrete:

Group D = 0 (base group): $$E(Y|X, D=0) = \beta_0 + \beta_1 X$$

Group D = 1: $$E(Y|X, D=1) = (\beta_0 + \beta_2) + (\beta_1 + \beta_3) X$$

For the base group, the model is a simple line with intercept $\beta_0$ and slope $\beta_1$. For the $D=1$ group, the intercept is shifted by $\beta_2$ and the slope is shifted by $\beta_3$. The two lines can cross — which is exactly what happens when the effect of $X$ differs across groups.

Interactive — Dummy Variable Interaction

$\beta_1$ (base slope): 1.5 $\beta_2$ (intercept shift): 4.0 $\beta_3$ (slope shift): 1.2

Red = base group (D=0), slope = $\beta_1$. Blue = D=1 group, slope = $\beta_1 + \beta_3$. The vertical gap at $X=0$ is $\beta_2$. If the lines cross, the two groups' advantage reverses at that point.

Worked Example: Salary, Experience, and Education

Worked Example — Labour Economics

A labour economist models annual salary (€) as a function of years of experience ($\text{Exp}$) and whether the worker holds a college degree ($\text{Edu} = 1$) or not ($\text{Edu} = 0$). The estimated model is:

$$\widehat{\text{Salary}} = 18{,}000 + 1{,}800 \cdot \text{Exp} - 45 \cdot \text{Exp}^2 + 600 \cdot (\text{Exp} \times \text{Edu})$$

Step 1: What Does Each Coefficient Mean?

$\hat{\beta}_1 = 1{,}800$: at the start of a career ($\text{Exp} \approx 0$), each extra year of experience adds €1,800 to salary for non-graduates. But this effect shrinks as experience grows (because $\hat{\beta}_2 = -45 < 0$).

$\hat{\beta}_2 = -45$: the parabola opens downward — wages rise with experience but at a diminishing rate, eventually peaking and (in theory) declining.

$\hat{\beta}_3 = 600$: college graduates benefit more from each year of experience. Their salary slope is $1{,}800 + 600 = 2{,}400$ at the start of their career, compared to $1{,}800$ for non-graduates.

Step 2: Find the Vertex (Peak Salary) for Each Group

For non-graduates ($\text{Edu} = 0$), the slope is $1{,}800 + 2 \cdot (-45) \cdot \text{Exp}$. Setting this to zero:

$$\text{Exp}^*_{\text{Edu=0}} = -\frac{1{,}800}{2 \cdot (-45)} = \frac{1{,}800}{90} = \mathbf{20 \text{ years}}$$

For college graduates ($\text{Edu} = 1$), the effective linear slope is $\beta_1 + \beta_3 = 1{,}800 + 600 = 2{,}400$:

$$\text{Exp}^*_{\text{Edu=1}} = -\frac{2{,}400}{2 \cdot (-45)} = \frac{2{,}400}{90} \approx \mathbf{26.7 \text{ years}}$$

College graduates reach their salary peak later and at a higher level — because their slope is steeper, it takes longer to be overcome by the $-45 \cdot \text{Exp}^2$ term.

Step 3: Predict Salary and Marginal Effects at Exp = 10

For a non-graduate with 10 years of experience:

$$\widehat{\text{Salary}} = 18{,}000 + 1{,}800 \cdot 10 - 45 \cdot 100 + 0 = 18{,}000 + 18{,}000 - 4{,}500 = \mathbf{€31{,}500}$$

For a college graduate with 10 years of experience:

$$\widehat{\text{Salary}} = 18{,}000 + 1{,}800 \cdot 10 - 45 \cdot 100 + 600 \cdot 10 = 31{,}500 + 6{,}000 = \mathbf{€37{,}500}$$

The marginal effect of one more year of experience at $\text{Exp} = 10$:

Non-graduate: $1{,}800 + 2 \cdot (-45) \cdot 10 = \mathbf{€900}$ per year

Graduate: $(1{,}800 + 600) + 2 \cdot (-45) \cdot 10 = \mathbf{€1{,}500}$ per year

At exactly 20 years of experience, the non-graduate's marginal effect reaches zero: $1{,}800 + 2 \cdot (-45) \cdot 20 = 1{,}800 - 1{,}800 = 0$. That's the wage peak for non-graduates. Any more experience beyond 20 years actually decreases predicted salary for this group.

Step 4: Interpret the Full Story

The model reveals a nuanced labour market story. Both groups see diminishing returns to experience — each extra year adds less and less salary — but college graduates maintain a stronger momentum: their salary keeps rising until 26.7 years of experience, reaching a peak of €50,000, versus the non-graduate peak of €36,000 at 20 years. The interaction term ($\hat{\beta}_3 = 600$) captures exactly this difference in momentum: college education doesn't just shift wages upward uniformly — it fundamentally changes how much people gain from experience.

Salary curves for non-graduates (red, D = 0) and graduates (blue, D = 1). The vertex markers show the experience level at which salary peaks for each group. The interaction term β₃ = 600 means each extra year of experience is worth €600 more for graduates — the widening gap between the two curves.