A Beginner's Guide to Lasso Regression

1. Introduction

In the previous post, we saw how Ridge Regression shrinks all coefficients toward zero to prevent overfitting. Ridge is great for stability, but it has one limitation: it never removes any feature completely. All coefficients end up small but non-zero, which means the model still uses every predictor, even irrelevant ones.

Lasso Regression (short for Least Absolute Shrinkage and Selection Operator) takes regularization one step further: it can drive some coefficients to exactly zero, effectively removing those features from the model entirely. This makes Lasso a tool for both regularization and automatic feature selection.

If you have 100 potential predictors and suspect only 10 of them truly matter, Lasso will often find those 10 and set the rest to zero, leaving you with a simpler, more interpretable model.

2. The Lasso Cost Function

To understand what makes Lasso different, compare the three loss functions side by side:

Ordinary Least Squares (OLS), minimize prediction error only:

J(\beta) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Ridge, adds an L2 penalty (sum of squared coefficients):

J_{\text{ridge}}(\beta) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^p \beta_j^2

Lasso, adds an L1 penalty (sum of absolute values of coefficients):

J_{\text{lasso}}(\beta) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^p |\beta_j|

The only difference between Ridge and Lasso is whether you square the coefficients (\(\beta_j^2\)) or take their absolute values (\(|\beta_j|\)). That one change has a dramatic effect on the solution.

3. Why Does Lasso Set Coefficients to Zero?

The geometric explanation is the most intuitive. Think of the cost function as a set of concentric ellipses centered on the OLS solution (the unconstrained best fit). The penalty is a constraint region, a shape that limits how large the coefficients can be. The solution must sit inside this region.

Ridge's constraint region is a circle (smooth, curved everywhere).
Lasso's constraint region is a diamond (four corners where axes are touched).

When the error ellipses expand outward from the OLS solution until they first touch the constraint region, they almost always touch a smooth part of the Ridge circle, giving a solution where both coefficients are non-zero. But the Lasso diamond has sharp corners right on the axes. The ellipses tend to hit these corners first, giving a solution where one coefficient is exactly zero.

In simple terms: Ridge has no corners, Lasso does, and those corners force coefficients to zero.

L1 norm diamond constraint and L2 norm circle constraint visualisation — **Figure:** The L1 ball (left, diamond shape) vs the L2 ball (right, circle). When the OLS solution contour expands and first hits the L1 diamond, it often touches a corner on an axis, forcing a coefficient to exactly zero. The smooth L2 circle has no corners, so it only shrinks coefficients. Source: Nicoguaro / Wikimedia Commons (CC BY 4.0)

Geometric comparison of L1 diamond and L2 circle constraint regions for Lasso vs Ridge — **Figure:** The sharp corners of the L1 diamond align with coordinate axes. When the OLS solution contour first touches the constraint, it typically does so at a corner, setting one coefficient to exactly zero. The smooth L2 circle only shrinks them.

Soft-Thresholding: The Math Behind Sparsity

Lasso's solutions are computed using a technique called soft-thresholding. For each coefficient, the update rule is:

S(z, \gamma) = \text{sign}(z) \cdot \max(|z| - \gamma, 0)

Here is what this means in plain English:

If the raw coefficient z is smaller than the threshold γ, it is set to exactly zero.
If z is larger than γ, it is reduced by γ (shrunk toward zero, but not to zero).

This "push-to-zero" behavior is what makes Lasso a feature selector, not just a shrinker.

Soft-thresholding function showing V-shape with flat zero region around origin — **Figure:** The soft-thresholding function. Coefficients smaller than γ are collapsed to exactly zero (flat region at bottom). Larger coefficients are shifted toward zero by γ. This is how Lasso performs feature selection.

4. Example with a Tiny Dataset

Let us trace through the soft-thresholding steps manually on a simple four-point dataset:

X	y
1	2
2	3
3	5
4	7

Scatter plot with OLS regression line fitted to the toy dataset — **Figure:** OLS regression line (no regularization) fitted to the four-point dataset. Lasso will shrink this slope coefficient depending on λ.

The OLS solution (without regularization) gives approximately:

\hat{y} = 0.5 + 1.5x

Now let us apply Lasso with \(\lambda = 1\). In the simplified single-feature case:

Step 1: Compute the raw correlation z

z = \frac{1}{n} \sum x_i y_i = \frac{1}{4}(1 \times 2 + 2 \times 3 + 3 \times 5 + 4 \times 7) = 12.75

Step 2: Compute the threshold γ

\gamma = \frac{\lambda}{2n} = \frac{1}{8} = 0.125

Step 3: Apply soft-thresholding

\beta_1 = S(12.75,\ 0.125) = 12.75 - 0.125 = 12.625

The coefficient shrinks slightly from 12.75 to 12.625. If we had used a larger \(\lambda\), the threshold would be larger and the shrinkage more dramatic. At some point, \(z\) would be smaller than \(\gamma\) and the coefficient would collapse to exactly zero.

Important note on scale: The value 12.625 is the result of a single coordinate-descent sub-step applied to the raw (unnormalized) dot product \(z = \frac{1}{n}\sum x_i y_i\). It is not a fitted slope coefficient in the usual sense and should not be compared directly to the OLS slope of approximately 1.5. The OLS slope is derived from the standard formula that accounts for the variance of \(x\), whereas \(z\) here is an intermediate quantity in the soft-thresholding calculation. This example illustrates the soft-thresholding operator in isolation, not a complete Lasso model fit.

Step 4: Manual Python Demo


import numpy as np

X = np.array([1,2,3,4])
y = np.array([2,3,5,7])
n = len(y)

# Compute raw correlation
z = (1/n) * np.sum(X * y)
lam = 1
gamma = lam / (2*n)   # soft-threshold value

def soft_threshold(z, gamma):
    if z > gamma:
        return z - gamma
    elif z < -gamma:
        return z + gamma
    else:
        return 0   # coefficient set to zero

beta1 = soft_threshold(z, gamma)
print("z =", z)           # 12.75
print("gamma =", gamma)   # 0.125
print("Updated coefficient β1 =", beta1)   # 12.625
print("Predictions:", beta1 * X)

5. Using Scikit-learn

In practice, Lasso is optimized iteratively (coordinate descent) because the absolute value function is not differentiable at zero. Scikit-learn handles all of this for you:


from sklearn.linear_model import Lasso
import numpy as np
from sklearn.preprocessing import StandardScaler

X = np.array([[1],[2],[3],[4]])
y = np.array([2,3,5,7])

# Always standardize before Lasso, the penalty is scale-sensitive
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = Lasso(alpha=1)   # alpha = lambda
model.fit(X_scaled, y)

print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_)

Notice that with a large enough alpha, the coefficient will print as exactly 0.0, Lasso has removed that feature from the model. This is the feature selection in action.

6. Regularization Path

A regularization path shows how each coefficient changes as \(\lambda\) increases. For Lasso, some coefficients reach zero and stay there, those features are eliminated. Features that survive longest (coefficients reaching zero at high \(\lambda\)) are the most important predictors.

Lasso coefficient regularization path showing coefficients shrinking to zero as lambda increases — **Figure:** Regularization path, each colored line is one predictor's coefficient as λ increases. Lines that hit zero early are the features Lasso eliminates first. The last lines to reach zero are the most important features.

7. Ridge vs Lasso: A Quick Comparison

Aspect	Ridge (L2)	Lasso (L1)
Penalty type	Squared coefficients (\(\beta_j^2\))	Absolute coefficients (\(\|\beta_j\|\))
Constraint shape	Circle (smooth)	Diamond (sharp corners)
Coefficients reach zero?	No, only shrinks	Yes, can eliminate features
Best for	Correlated predictors, stability	Feature selection, sparse models
When predictors are correlated	Keeps all, shrinks all	Arbitrarily picks one, zeroes others

When predictors are correlated, Lasso tends to arbitrarily pick one and discard the others, which can be unstable. In that case, Elastic Net (covered in the next post) combines both penalties to get the best of both approaches.

8. Key Takeaways

Lasso uses an L1 penalty (absolute values), which can drive coefficients to exactly zero, performing automatic feature selection.
Ridge uses L2 (squared values), which only shrinks coefficients, never eliminating them.
Always standardize features before applying Lasso, the L1 penalty is scale-sensitive, and features on larger scales will be unfairly penalized.
The choice of \(\lambda\) (called alpha in scikit-learn) is critical: too small → overfitting, too large → underfitting. Use cross-validation to tune it.
When you have many features and expect most to be irrelevant, Lasso is often the first regularization method to try.

9. Conclusion

Lasso is a powerful tool that does two jobs at once: it reduces overfitting through regularization, and it simplifies models by automatically removing irrelevant predictors. The key intuition: Ridge shrinks, Lasso selects. Use Ridge when you want stability and all features matter; use Lasso when you suspect many features are irrelevant and want a cleaner model.

References

Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1), 55–67.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
Scikit-learn. Lasso Documentation
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67(2), 301–320.