Back to all posts

Multiple Linear Regression: A Step-by-Step Explanation

In the previous post on Simple Linear Regression, we predicted house prices using just one variable: house size. The model worked reasonably well, but in the real world, prices depend on many factors at once, size, number of bedrooms, location, age of the building, and more.

Multiple Linear Regression (MLR) extends simple regression to handle multiple predictors simultaneously. Instead of fitting a line through 2D data, we now fit a plane (with two predictors) or a hyperplane (with three or more predictors) through multi-dimensional data. The math is more involved, but the core idea, minimize squared errors, is exactly the same.

1. The Idea Behind Multiple Linear Regression

The MLR equation extends the simple regression line by adding a term for each new predictor:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \varepsilon \]

$y$: the outcome we want to predict (e.g., house price)
$x_1, x_2, \dots, x_p$: the predictors (e.g., area, bedrooms, age)
$\beta_0$: the intercept, baseline value when all predictors are zero
$\beta_j$: the coefficient for predictor $j$, how much y changes per unit of $x_j$, holding all other predictors constant
$\varepsilon$: the error term, variation we cannot explain with the predictors

The "holding others constant" interpretation is crucial. Think of cooking: suppose you want to know how much salt affects the final taste. If you also change the amount of oil and spices at the same time, you cannot isolate the effect of salt. MLR solves this problem, each coefficient tells you the effect of one ingredient while all others are held fixed.

Animated demonstration of a 2D multiple linear regression plane fitting — **Figure:** Animation showing how the regression plane Y = β₀ + β₁X₁ + β₂X₂ tilts and shifts as its parameters change. With two predictors, OLS fits a plane (not a line) through the 3D data cloud. Source: Cfbaf / Wikimedia Commons (Public Domain)

2. How the Coefficients Are Found

The Least Squares Principle

Just as in simple regression, we find coefficients by minimizing the Sum of Squared Errors (SSE):

\[ \text{SSE} = \sum_{i=1}^n \bigl(y_i - \hat{y}_i\bigr)^2 \]

where the fitted values are:

\[ \hat{y}_i = b_0 + b_1 x_{1i} + b_2 x_{2i} \]

With one predictor, calculus gives us simple formulas. With many predictors, the algebra is best handled using matrix notation. The compact solution is:

\[ \mathbf{b} = (X^\top X)^{-1} X^\top y \]

You do not need to memorize this formula, modern software like scikit-learn handles it automatically. But knowing it exists helps you understand that MLR is still "minimize squared errors, just in higher dimensions."

For this tutorial, we will use the deviations from the mean approach, it makes the calculations more transparent.

3. Example Dataset

We will extend our house price example by adding a second predictor: number of bedrooms. Both area and bedrooms should affect price, but by how much does each one matter when we account for the other simultaneously?

House	Area (sq.ft), x₁	Bedrooms, x₂	Price ($k), y
1	1000	2	250
2	1500	3	400
3	2000	4	450
4	2500	3	500
5	3000	5	550

4. Step-by-Step Solution

Step 1: Compute the Means

Find the average of each variable. Every deviation and cross-product calculation below uses these as reference points:

\[ \bar{x}_1 = 2000, \quad \bar{x}_2 = 3.4, \quad \bar{y} = 430 \]

Step 2: Compute Deviations from the Mean

Subtract each variable's mean from its values. Deviations reveal whether each house is above or below average on each dimension.

House	x₁	x₁ − x̄₁	x₂	x₂ − x̄₂	y	y − ȳ
1	1000	−1000	2	−1.4	250	−180
2	1500	−500	3	−0.4	400	−30
3	2000	0	4	+0.6	450	+20
4	2500	+500	3	−0.4	500	+70
5	3000	+1000	5	+1.6	550	+120

House 1 is below average on all three variables: smaller, fewer bedrooms, and cheaper than typical. House 5 is above average on all three. These aligned deviations are why area and bedrooms both have positive coefficients in the final model.

Step 3: Compute Cross-Products

To solve the normal equations, we need five cross-product sums. Think of each one as measuring how two variables "move together" across the dataset.

House	(x₁')² = S_x₁x₁	(x₂')² = S_x₂x₂	x₁'y' = S_x₁y	x₂'y' = S_x₂y	x₁'x₂' = S_x₁x₂
1	1,000,000	1.96	180,000	252	1400
2	250,000	0.16	15,000	12	200
3	0	0.36	0	12	0
4	250,000	0.16	35,000	−28	−200
5	1,000,000	2.56	120,000	192	1600
Sum	2,500,000	5.2	350,000	440	3000

Step 4: Set Up and Solve the Normal Equations

These sums form a system of two equations with two unknowns ($b_1$ and $b_2$):

\[ 2{,}500{,}000 \, b_1 + 3000 \, b_2 = 350{,}000 \] \[ 3000 \, b_1 + 5.2 \, b_2 = 440 \]

Solving this system (using substitution or matrix methods) gives:

\[ b_1 \approx 0.125, \quad b_2 \approx 12.5 \]

Interpretation:

Each additional square foot of area adds approximately $125 to the predicted price, holding bedrooms constant.
Each additional bedroom adds approximately $12,500 to the predicted price, holding area constant.

Notice that the area coefficient is different from the 0.14 we found in simple regression. That is expected, adding bedrooms changes how the model credits area alone. This is the essence of MLR: each coefficient is adjusted to account for the other predictors.

Step 5: Compute the Intercept

The intercept is calculated from the means:

\[ b_0 = 430 - (0.125)(2000) - (12.5)(3.4) \approx 137.5 \]

Final regression equation:

\[ \hat{y} = 137.5 + 0.125 x_1 + 12.5 x_2 \]

Visualizing the Regression Plane

With two predictors, the fitted model is a flat surface in 3D space, not a line, but a plane. Each house is a blue dot in this 3D space, and the red plane is the model's best fit.

3D regression plane for house price prediction

As both area and bedroom count increase, the plane rises, predicting a higher price. The plane cannot bend or curve; it is a linear model. If the true relationship curves, a linear model will be systematically off in some regions.

5. Making a Prediction

Suppose we want to predict the price of a 2,200 sq ft house with 4 bedrooms:

\[ \hat{y} = 137.5 + 0.125(2200) + 12.5(4) = 137.5 + 275 + 50 = 462.5 \text{ ($k)} \]

Our model estimates this house would cost around $462,500. This combines both the area and bedroom contributions together, something a simple regression on area alone could not do.

6. Assumptions and Pitfalls

MLR makes several assumptions. When they are violated, predictions become unreliable or misleading. Here are the key ones every beginner should know:

Linearity, each predictor is assumed to have a straight-line effect on the outcome. If the true relationship is curved, the linear model will be systematically off.
No multicollinearity, predictors should not be very highly correlated with each other. If area and number of rooms are almost perfectly correlated, the model cannot tell which one is doing the work, and coefficients become unstable. Think of trying to figure out which of two nearly identical ingredients changes the taste, you cannot isolate them.
Homoscedasticity, the spread of residuals should be roughly constant across all predicted values. If large houses show much more price variability than small houses, this assumption is violated.
Independence of observations, each data point should be independent of the others. House prices in the same neighborhood may be correlated, violating this assumption.
Normality of residuals (for hypothesis tests), residuals should roughly follow a normal distribution if you want valid confidence intervals and p-values. For prediction accuracy alone, this assumption matters less.

Conclusion

Multiple Linear Regression extends simple regression by balancing contributions from several variables simultaneously. We walked through the derivation, solved an example by hand, and interpreted the coefficients, noting that they always have an "all else equal" meaning.

MLR is one of the most widely used tools in data science because it is transparent and interpretable. Once you are comfortable with two predictors, adding more follows the same principles. The next challenge is regularization, techniques like Ridge and Lasso that handle cases where you have too many predictors or predictors that are highly correlated.