Statistics · Foundation Chapter · 25 min read

Statistics Basics: The Language of Data

Before any machine learning model, before any hypothesis test, before anything — there is data. And to make sense of data, you need a small set of core tools. This page gives you those tools. Not as a list of formulas to memorise, but as a set of ideas to truly understand. By the end, you'll know how to describe, compare, and draw conclusions from data — and you'll understand exactly why every formula looks the way it does.

Mean, Median, and Mode — Describing the Centre

Imagine you have a list of 10 numbers. How would you summarise them with a single value? The most natural answer is the mean (average): add everything up and divide by how many numbers you have.

$$\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$$

Where $\bar{x}$ (read: "x-bar") is the mean, $n$ is the count of data points, $x_i$ is each individual value (the subscript $i$ runs from 1 to $n$, labelling each observation), and $\sum_{i=1}^{n}$ is the summation sign — it means "add up the following term for every $i$ from 1 to $n$." The mean is pulled toward every value — which makes it sensitive to extreme numbers (called outliers).

The median is the middle value once you sort the data. Half the values are below it, half above. The mode is simply the most frequent value. Both are more robust to outliers than the mean — meaning an extreme value in your data won't drag them far off.

Intuition: If one billionaire walks into a room of 99 average earners, the mean salary in the room skyrockets — but the median barely moves. That's why median is preferred for income data.

Interactive — Mean vs. Median vs. Mode

Click Add Outlier to insert a value of 150 and watch the mean (red line) shift while the median (blue line) stays put — this is what being robust to outliers means in practice. Click Remove Outlier to undo.

Variance and Standard Deviation — Describing the Spread

The mean tells you where the centre is, but it says nothing about how spread out the data is. Two classes could have the same average grade but very different distributions — one all near the mean, one all over the place. That's where variance comes in.

The idea: measure how far each data point is from the mean, square it (so negatives and positives don't cancel), then average. The squaring also penalises large deviations more than small ones, which is what we want.

$$s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2$$

We divide by $n-1$ instead of $n$ because we are working with a sample, not the full population. Dividing by $n$ would slightly underestimate the true spread — $n-1$ corrects for this (this is called Bessel's correction). The standard deviation $s$ is simply the square root of the variance, which brings us back to the original units of the data:

$$s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}$$

Where: $s^2$ = sample variance, $s$ = sample standard deviation, $\bar{x}$ = sample mean, $n$ = number of observations, $x_i$ = each individual observation. The expression $(x_i - \bar{x})^2$ is the squared distance of one observation from the mean — squaring removes the minus sign so positives and negatives don't cancel each other out.

Intuition: Standard deviation is the "average distance from the mean." A small $s$ means data is tightly packed around the mean. A large $s$ means data is widely spread.

Interactive — Adjust the Spread

σ (sigma) controls the spread. Drag the slider to widen or narrow the distribution. A larger σ flattens and broadens the bell; a smaller σ makes it tall and tight.

Covariance — Do Two Variables Move Together?

So far we've only looked at one variable at a time. But most interesting questions involve two variables: does more study time lead to higher grades? Does higher temperature reduce ice cream sales? The covariance measures whether two variables tend to move in the same direction or opposite directions.

$$s_{xy} = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})$$

Where $s_{xy}$ is the sample covariance between variables $x$ and $y$, $\bar{x}$ and $\bar{y}$ are their means. When $x$ is above its mean and $y$ is also above its mean, the product $(x_i - \bar{x})(y_i - \bar{y})$ is positive. When they move in opposite directions, the product is negative. Averaging these products over all pairs gives the covariance.

Interpretation: A positive covariance means the variables tend to increase together. A negative covariance means when one goes up, the other tends to go down. Zero means no systematic linear relationship.

The problem with covariance is that its value depends on the units of measurement. Comparing covariances across different datasets is meaningless. That's why we normalise it — and get the correlation.

Intuition: Covariance tells you the direction of a relationship but not its strength in any comparable way. For that, use correlation.

Interactive — Positive, Negative, and Zero Covariance

Each dot is one (X, Y) data pair. Click the buttons to switch between positive covariance (cloud tilts up-right), negative (cloud tilts down-right), and zero (cloud is circular with no directional pattern).

Correlation Coefficient — How Strong Is the Relationship?

The correlation coefficient $r_{xy}$ solves the unit problem of covariance. It standardises the covariance by dividing by the standard deviations of both variables — so the result always lies between $-1$ and $+1$, regardless of units.

$$r_{xy} = \frac{s_{xy}}{s_x \cdot s_y} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2} \cdot \sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}}$$

Where $s_x$ and $s_y$ are the standard deviations of $x$ and $y$ respectively, and the $\cdot$ in $s_x \cdot s_y$ denotes multiplication. Interpretation: $r = +1$ means perfect positive linear relationship. $r = -1$ means perfect negative linear relationship. $r = 0$ means no linear relationship (though there could still be a non-linear one). Values between $0.7$ and $1$ (or $-0.7$ and $-1$) are generally considered strong.

Important: Correlation does not imply causation. Two variables can correlate perfectly without one causing the other.

Interactive — Control the Correlation

r controls the strength and direction of the linear relationship. At r = +1 the points form a perfect upward line; at r = −1 a perfect downward line; at r = 0 the cloud is a circle with no pattern.

The Normal Distribution — The Bell Curve

Many natural phenomena — heights, measurement errors, exam scores — tend to cluster around a central value and become increasingly rare as you move away from it. This produces the famous bell-shaped curve, described by the normal distribution.

The normal distribution is fully described by just two parameters: the mean $\mu$ (where the peak sits) and the standard deviation $\sigma$ (how wide the bell is).

$$f(x;\,\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}\, e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

Where $\mu$ (Greek letter "mu") is the mean (centre of the distribution), $\sigma$ (Greek letter "sigma") is the standard deviation (controls the width), $e$ is Euler's number (~2.718, the base of the natural logarithm), $\pi \approx 3.14159$ is pi, and $\approx$ means "approximately equal to." A key property: about 68% of the data lies within $\pm 1\sigma$, 95% within $\pm 2\sigma$, and 99.7% within $\pm 3\sigma$. This is the so-called 68–95–99.7 rule.

Intuition: The normal distribution is the natural shape that emerges when many small, independent random effects add together — which is why it shows up everywhere in nature (Central Limit Theorem).

Interactive — Shape the Bell Curve

μ (mu) shifts the peak left or right. σ (sigma) stretches or compresses the bell — a larger σ makes it wider and flatter. The total area under the curve always stays 1, so widening it lowers the peak height.

The t-Distribution — When You Don't Know the True Spread

The normal distribution assumes you know the true population standard deviation $\sigma$. In practice, you almost never do — you only have the sample standard deviation $s$. When you use $s$ instead of $\sigma$, the resulting distribution has heavier tails than the normal: it is less certain, more spread out. This is the t-distribution.

$$f(t;\,k) = \frac{\Gamma\!\left(\frac{k+1}{2}\right)}{\sqrt{k\pi}\;\Gamma\!\left(\frac{k}{2}\right)} \left(1+\frac{t^2}{k}\right)^{-\frac{k+1}{2}}$$

Where $k$ is the degrees of freedom (in most tests, $k = n - 1$, i.e. sample size minus one), $f(t;\,k)$ is the probability density at value $t$ given $k$ degrees of freedom, and $\Gamma(\cdot)$ is the gamma function — a generalisation of the factorial to non-integer numbers (you don't need to compute it by hand; software does it). As $k$ increases (i.e. as your sample size grows), the t-distribution approaches the normal distribution. With small samples (say $n = 5$), the tails are much heavier — reflecting the added uncertainty from not knowing $\sigma$.

Intuition: With a tiny sample you're less sure of your estimate, so extreme results should be less surprising. The t-distribution accounts for this extra uncertainty with fatter tails.

Interactive — t-Distribution vs. Normal

Grey = standard normal (used when σ is known). Red = t-distribution with k degrees of freedom. As k grows toward infinity the two curves converge — this is why large samples let you treat t ≈ normal.

The p-Value — How Surprising Is Your Result?

You ran an experiment and got a result. But could you have gotten this result just by chance, even if there was actually nothing going on? The p-value answers exactly that. It is the probability of observing a result as extreme as yours, or more extreme, assuming there is no real effect (the so-called null hypothesis $H_0$).

Small p-value → your result is very unlikely under $H_0$ → you have evidence against $H_0$. Here, $H_0$ (read: "H-naught" or "null hypothesis") is the assumption that there is no real effect. The conventional threshold is $\alpha = 0.05$ (where $\alpha$ is the Greek letter "alpha", used for the significance level): if $p < 0.05$, you reject $H_0$ and call the result "statistically significant."

In a two-sided t-test, the p-value is:

$$p = 2 \cdot P(T \geq |t_{\text{obs}}|) = 2\left[1 - F_{t,k}(|t_{\text{obs}}|)\right]$$

Where: $t_{\text{obs}}$ is your observed t-statistic (the test result from your data), $|t_{\text{obs}}|$ is the absolute value of $t_{\text{obs}}$ (i.e. its distance from zero, always positive — $|{-2.3}| = 2.3$), $P(T \geq \ldots)$ means "the probability that a random variable $T$ is greater than or equal to …", and $F_{t,k}$ is the cumulative distribution function (CDF) of the t-distribution with $k$ degrees of freedom — it gives the probability that $T$ is less than or equal to a given value. We multiply by 2 because we consider both tails (a result equally surprising in either direction counts as evidence against $H_0$).

Common misconception: The p-value is NOT the probability that $H_0$ is true. It is the probability of your data (or more extreme) given that $H_0$ is true. That's a crucial difference.

Interactive — p-Value Visualised

The red shaded areas are both tails — the region as extreme as or more extreme than your t-statistic. The p-value equals the total shaded area. The dashed line marks the α = 0.05 threshold.

Confidence Intervals — A Range of Plausible Values

A point estimate (like the sample mean) gives you a single best guess. But every estimate has uncertainty — your sample is just one of many possible samples from the population. A confidence interval (CI) gives you a range of values that plausibly contains the true population parameter.

A 95% CI for the population mean:

$$\bar{x} \pm t_{n-1,\,\alpha/2} \cdot \frac{s}{\sqrt{n}}$$

Where: $\bar{x}$ is the sample mean, $\pm$ means "plus or minus" (giving both the upper and lower bound of the interval), $t_{n-1,\,\alpha/2}$ is the critical t-value — the threshold from the t-distribution with $n-1$ degrees of freedom at significance level $\alpha/2$ (for $\alpha = 0.05$: the 97.5th percentile, approximately 1.96 for large $n$), $s$ is the sample standard deviation, and $\sqrt{n}$ is the square root of the sample size. The term $\frac{s}{\sqrt{n}}$ is called the standard error — it measures how much the sample mean is expected to vary from sample to sample.

Correct interpretation: if you repeated the experiment many times, 95% of the resulting confidence intervals would contain the true population mean. It does NOT mean there's a 95% chance the true mean lies in this specific interval.

Intuition: A larger sample $n$ shrinks the interval (more data → more certainty). A larger standard deviation $s$ widens it (more spread in data → more uncertainty).

Interactive — Confidence Interval

Each bar is a 95% confidence interval around the estimated mean. The dashed line marks the true mean (50). Increasing n narrows the bar; increasing σ widens it.

Worked Example — Analysing Student Exam Scores

A professor records the exam scores of 10 students:

72, 85, 90, 68, 95, 78, 82, 88, 76, 80

Step 1 — Calculate the Mean

$$\bar{x} = \frac{72+85+90+68+95+78+82+88+76+80}{10} = \frac{814}{10} = 81.4$$

Step 2 — Find the Median

Sort the data: 68, 72, 76, 78, 80, 82, 85, 88, 90, 95. With $n=10$ (even), the median is the average of the 5th and 6th values:

$$\text{Median} = \frac{80 + 82}{2} = 81.0$$

The mean (81.4) and median (81.0) are very close — this suggests no strong outliers are skewing the data.

Step 3 — Calculate Sample Variance and Std Dev

Using $\bar{x} = 81.4$, compute $(x_i - 81.4)^2$ for each score, sum them, divide by $n-1 = 9$:

$$s^2 = \frac{(72-81.4)^2 + (85-81.4)^2 + \cdots + (80-81.4)^2}{9} = \frac{626.40}{9} = 69.60$$
$$s = \sqrt{69.60} \approx 8.34$$

So scores typically deviate from the mean by about 8.3 points.

Step 4 — Covariance and Correlation (Study Hours vs Score)

Suppose you also tracked weekly study hours for 5 of these students: hours = [2, 4, 6, 8, 10] and scores = [55, 65, 75, 82, 90].

$$\bar{h} = 6.0, \quad \bar{s} = 73.4$$
$$s_{hs} = \frac{(2-6)(55-73.4)+(4-6)(65-73.4)+\cdots+(10-6)(90-73.4)}{4} = \frac{174}{4} = 43.5$$
$$r_{hs} = \frac{43.5}{\sqrt{10}\cdot\sqrt{189.3}} \approx 0.997$$

A correlation of 0.997 is nearly perfect — study hours and exam scores are almost perfectly linearly related in this sample.

Step 5 — Interpretation

The data shows a tight distribution around 81.4 (std dev ~8.3), and study hours predict scores almost perfectly (r ≈ 1). If you wanted to decide whether to extend or shorten study material, these numbers give you a solid, quantitative foundation.