A Mathematical Deep Dive into Why L1 (Lasso) Induces Sparsity Over L2 (Ridge)
Abstract
Regularization is introduced in most Machine Learning curricula as a technique to combat overfitting by penalizing large model weights. Students frequently memorize the L1 and L2 penalty formulas without grasping why their geometric properties produce such drastically different outcomes. This article presents a rigorous yet intuitive account of the mathematics behind both penalties — demonstrating, from first principles, why L1 regularization drives feature weights to exact zeros (thereby performing automatic feature selection) while L2 regularization merely shrinks weights asymptotically toward zero. The treatment progresses from the loss function formulation, through the geometry of constraint regions, to practical executable code that every claim can be verified against.
1. Motivation: Why Regularization Exists
Before we examine the geometry, we must be precise about the problem regularization solves. Consider a linear regression model with `p’ features and `n’ training observations where `p’ is large relative to `n’. Minimizing the training loss alone , the Mean Squared Error (MSE) — gives us a model that achieves near-zero training error by fitting the noise in the data as well as the signal.
This pathology is called overfitting : the model memorizes the training set and generalizes poorly to unseen data. Geometrically, overfitting corresponds to parameter vectors `w’ with very large magnitudes — the model compensates for imperfect data by making its weights arbitrarily large in opposing directions, cancelling out noise.
Regularization adds a penalty term to the loss function that grows with the size of the weights, creating an explicit cost for complexity. The two dominant choices — L1 and L2 norms — lead to profoundly different solutions, and understanding why requires us to look at the geometry of their constraint regions.
> Core Insight (Keep This in Mind Throughout)
> The fundamental question we are answering: Given that both L1 and L2 penalties shrink weights, why does L1 produce exact zeros and L2 does not? – The answer lies entirely in the shape of the geometric constraint regions one has corners, the other does not.
2. The Business Scenario & Data Setup
We will build a **continuous Python pipeline**. We simulate a real-world sparse regression problem: a 10-feature dataset where only 3 features are truly predictive and 7 are pure noise. This mirrors practical scenarios in genomics, financial modeling, or sensor arrays, where most input signals are irrelevant.
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso, Ridge, ElasticNet, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# ── 1. Seed for reproducibility
np.random.seed(42)
n_samples, n_features = 200, 10
# ── 2. Synthetic Data Generation (sparse ground truth)
X_raw = np.random.randn(n_samples, n_features)
# True model: only 3 features matter, 7 are pure noise
true_weights = np.array([3.5, -2.1, 1.8, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
y_raw = X_raw @ true_weights + np.random.randn(n_samples) * 0.5
feature_names = [f"Feature_{i+1}" for i in range(n_features)]
df = pd.DataFrame(X_raw, columns=feature_names)
df['Target'] = y_raw
# ── 3. Standardize — mandatory before any regularized regression
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y_raw, test_size=0.2, random_state=42
)
print("=== DATASET SUMMARY ===")
print(f"Total samples : {n_samples}")
print(f"Total features : {n_features}")
print(f"Active (non-zero) : {(true_weights != 0).sum()}")
print(f"Inactive (zero) : {(true_weights == 0).sum()}")
print(f"Train Test : {len(X_train)} / {len(X_test)}")
print(f"nTrue weight vector: {true_weights}")
```
**Output:**
```
=== DATASET SUMMARY ===
Total samples : 200
Total features : 10
Active (non-zero) : 3
Inactive (zero) : 7
Train / Test : 160 / 40
True weight vector: [ 3.5 -2.1 1.8 0. 0. 0. 0. 0. 0. 0. ]
Data Interpretation: The ground truth has three active features (`Feature_1`, `Feature_2`, `Feature_3`) and seven zero-weight noise features. A perfect regularizer would recover exactly this sparse pattern. We will now test which penalty achieves this.
3. Baseline Mathematics: Defining the Loss and Penalties
3.1 The Unregularized Objective
For a dataset with `n’ samples and `p` features, let `X ∈ ℝⁿˣᵖ’ be the design matrix, `y ∈ ℝⁿ’ the target vector, and `w ∈ ℝᵖ` the weight vector. The standard ordinary least squares (OLS) objective is:
L(w) = (1/n) · ||y − Xw||₂² = (1/n) · Σᵢ (yᵢ − xᵢᵀw)²
This is a convex quadratic bowl in `p’-dimensional weight space. Its level sets, the sets of `w’ that produce equal loss values — form ellipses (or ellipsoids in higher dimensions) centered at the OLS solution
`ŵ = (XᵀX)⁻¹Xᵀy`.
3.2 Adding the Regularization Penalty
The regularized objective is formed by adding a penalty term scaled by a hyperparameter `λ > 0`:
L_reg(w) = L(w) + λ · Penalty(w)
|| L1 Penalty — Lasso | L2 Penalty — Ridge |
|-|-|-|
| Penalty |`λ Σᵢ |wᵢ|` (sum of absolute values)|`λ Σᵢ wᵢ²` (sum of squared values)|
|Full Objective|`L₁(w) = (1/n)||y−Xw||₂² + λΣ|wᵢ|`|`L₂(w) = (1/n)||y−Xw||₂² + λΣwᵢ²`||
Also written |`λ||w||₁`|`λ||w||₂²`|
3.3 The Lagrangian (Constrained) Formulation
A crucial insight: the penalized form above is mathematically equivalent to a constrained optimization problem. This equivalence is what makes the geometric interpretation possible.
For every value of `λ`, there exists a budget constant `t > 0` such that:
Lasso (L1): min L(w) subject to Σᵢ |wᵢ| ≤ t
Ridge (L2): min L(w) subject to Σᵢ wᵢ² ≤ t
In plain language: we are minimizing the loss subject to the constraint that the norm of the weight vector stays within a budget of size t. The sets `{w : Σ|wᵢ| ≤ t}` and `{w : Σwᵢ² ≤ t}` define the constraint regions — and their shapes are the key to everything that follows.
4. The Core Geometric Intuition
4.1 The Error Contour Lines
Since `L(w)` is a convex quadratic, its level sets (also called **isocost contours** or error ellipses) are **ellipses centered at the unconstrained OLS solution ŵ**. Points on the same ellipse have identical loss values. Moving outward from `ŵ` along any direction increases the loss.
Think of it like a topographic map of a valley. The OLS solution `ŵ` is the lowest point of the valley. The contour lines are the elevation rings around it — each ring marking equal altitude (equal loss). Our job is to find the lowest-altitude contour that just touches the constraint region.
> The Optimization Problem — Geometric Statement:
> Find the point on the boundary of the constraint region that lies on the smallest possible loss contour. This is the point where the loss contour is tangent to the constraint boundary — the optimal regularized solution `w*`.
4.2 Visualizing the Two Constraint Regions
fig, axes = plt.subplots(1, 2, figsize=(13, 6))
theta = np.linspace(0, 2*np.pi, 400)
t = 1.5 # constraint budget
for ax, title, color, shape in zip(axes,
["L1 Constraint Region — Diamond (Lasso)", "L2 Constraint Region — Circle (Ridge)"],
["#3498db", "#e74c3c"], ["L1", "L2"]):
if shape == "L1":
# L1 ball: a rotated square (diamond)
dx = [t, 0, -t, 0, t]
dy = [0, t, 0, -t, 0]
ax.fill(dx, dy, alpha=0.25, color=color, label='L1 Constraint Region')
ax.plot(dx, dy, color=color, lw=2.5)
for cx, cy in [(t,0),(0,t),(-t,0),(0,-t)]:
ax.plot(cx, cy, 'o', color='red', markersize=10, zorder=5)
ax.annotate("Corner\n(sparse solution)", xy=(t, 0), xytext=(t+0.3, 0.5),
arrowprops=dict(arrowstyle='->', color='black'), fontsize=9)
else:
# L2 ball: a smooth disk
ax.fill(t*np.cos(theta), t*np.sin(theta), alpha=0.25, color=color, label='L2 Constraint Region')
ax.plot(t*np.cos(theta), t*np.sin(theta), color=color, lw=2.5)
# OLS solution (outside constraint region)
ols_x, ols_y = 1.8, 1.2
ax.plot(ols_x, ols_y, '*', color='black', markersize=14, zorder=6, label='OLS Solution ŵ')
# Expanding loss ellipses
for scale in [0.6, 1.0, 1.45, 1.95]:
ax.plot(ols_x + scale*0.9*np.cos(theta),
ols_y + scale*0.5*np.sin(theta),
'--', color='#555', lw=1.0, alpha=0.6)
# Tangent (optimal) point
if shape == "L1":
ax.plot(t, 0, 's', color='#27ae60', markersize=12, zorder=7, label='Optimal w* (zero weight!)')
else:
tx, ty = t*np.cos(np.radians(-35)), t*np.sin(np.radians(-35))
ax.plot(tx, ty, 's', color='#27ae60', markersize=12, zorder=7, label='Optimal w* (non-zero)')
ax.axhline(0, color='gray', lw=0.8); ax.axvline(0, color='gray', lw=0.8)
ax.set_xlim(-2.5, 2.5); ax.set_ylim(-2.5, 2.5)
ax.set_xlabel("w₁", fontsize=12); ax.set_ylabel("w₂", fontsize=12) ax.set_title(title, fontsize=13, fontweight='bold')
ax.legend(loc='lower left', fontsize=8)
ax.set_aspect('equal')
plt.tight_layout()
plt.savefig('geometry_constraint.png', dpi=150, bbox_inches='tight')
plt.show()
4.3 The L1 Constraint Region — The Diamond
Shape and Structure – The L1 constraint region in 2D — the set `{(w₁, w₂) : |w₁| + |w₂| ≤ t}` — is a square rotated 45 degrees (a diamond or rhombus), with vertices at `(t, 0)`, `(−t, 0)`, `(0, t)`, `(0, −t)`. In higher dimensions, the L1 constraint region generalizes to a cross-polytope — a geometric figure with `2p` vertices, each located on a coordinate axis.
Why Corners Produce Zeros. The corners of the L1 diamond are located precisely at the coordinate axes — at points where exactly one weight is non-zero and all others are exactly zero. As the loss ellipse expands outward from `ŵ`, it will almost certainly hit a corner of the diamond first. And at that corner, one or more coordinates are exactly zero.
> Formal Argument via KKT Conditions: At the optimal solution `w*’, the sub-gradient of the L1 norm at `wᵢ = 0` is the entire interval `[−1, +1]`. The KKT condition requires only that the gradient of the loss lies in `[−λ, +λ]` — a large feasible set. Hence, zero is a natural resting point for the L1 penalty.
4.4 The L2 Constraint Region — The Circle
Shape and Structure – The L2 constraint region in 2D — the set `{(w₁, w₂) : w₁² + w₂² ≤ t}` — is a disk with radius `√t`. In higher dimensions, it generalizes to a Euclidean ball — a perfectly symmetric, smooth, round region.
Why the Circle Cannot Produce Zeros – The circle has no corners. It’s boundary is a perfectly smooth curve everywhere differentiable. The expanding loss ellipse will hit the circular boundary at a smooth tangent point — determined by the geometry of the ellipse. For this tangent point to have `wᵢ = 0’, the ellipse would need a specific axis-alignment that essentially never occurs for general data.
||L1 — Diamond | L2 — Circle |
|-|-|-|
| Boundary |Piecewise linear (flat faces)|Smooth everywhere|
|Vertices|2p corners on coordinate axes|No vertices, no corners|
|Tangent points|Generically fall on corners (zeros)|Generic, rarely on axes|
| Sub-gradient at zero |Entire interval `[−λ, +λ]`|Exactly 0 (only if uncorrelated)|
| Result |Exact zeros → sparse solution |Small but non-zero → dense solution |
5. Tier 1 — OLS Baseline (No Regularization)
Before applying regularization, we establish the unregularized baseline. OLS will assign non-zero weights to all 10 features, including the 7 noise features.
ols_model = LinearRegression()
ols_model.fit(X_train, y_train)
ols_mse = mean_squared_error(y_test, ols_model.predict(X_test))
print(“=== TIER 1 — OLS BASELINE (No Regularization) ===”)
print(f”Test MSE: {ols_mse:.4f}”)
print(f”\n{‘Feature’:<15} {‘Learned’:>10} {‘True’:>10}”)
for i, (name, w) in enumerate(zip(feature_names, ols_model.coef_)):
print(f” {name:<13} {w:>+10.4f} {true_weights[i]:>+10.4f}”)
“`
Output:
“`
=== TIER 1 — OLS BASELINE (No Regularization) ===
Test MSE: 0.1947
Feature Learned True
Feature_1 +3.1466 +3.5000
Feature_2 -2.2678 -2.1000
Feature_3 +1.7363 +1.8000
Feature_4 +0.0264 +0.0000 ← noise, but not zero
Feature_5 +0.0281 +0.0000 ← noise, but not zero
Feature_6 -0.0414 +0.0000 ← noise, but not zero
Feature_7 -0.0373 +0.0000 ← noise, but not zero
Feature_8 -0.0294 +0.0000 ← noise, but not zero
Feature_9 -0.0255 +0.0000 ← noise, but not zero
Feature_10 +0.0689 +0.0000 ← noise, but not zero
Business Interpretation: OLS correctly identifies the three active features with large coefficients, but it also assigns small, non-zero weights to all seven noise features. In a production model with thousands of irrelevant features, this noise accumulates into degraded generalization and an uninterpretable model.
6. Tier 2 — Ridge Regression (L2): Dense Shrinkage
Ridge regression adds the L2 penalty `λ Σwᵢ²` to the loss. Because the L2 constraint region is a smooth ball with no corners, Ridge shrinks all weights toward zero but drives none of them to exactly zero.
“`python
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)
ridge_mse = mean_squared_error(y_test, ridge_model.predict(X_test))
print(“=== TIER 2 — RIDGE REGRESSION (L2) ===”)
print(f”Test MSE: {ridge_mse:.4f}”)
exact_zeros_ridge = sum(1 for w in ridge_model.coef_ if abs(w) < 1e-6)
print(f”Weights driven to exact zero: {exact_zeros_ridge} / {n_features} ← Ridge cannot zero weights”)
print(f”\n{‘Feature’:<15} {‘Ridge Coef’:>12}”)
for name, w in zip(feature_names, ridge_model.coef_):
print(f” {name:<13} {w:>+12.4f}”)
“`
Output:
“`
=== TIER 2 — RIDGE REGRESSION (L2) ===
Test MSE: 0.1941
Weights driven to exact zero: 0 / 10 ← Ridge cannot zero weights
Feature Ridge Coef
Feature_1 +3.1301
Feature_2 -2.2546
Feature_3 +1.7290
Feature_4 +0.0263 ← noise feature, still alive
Feature_5 +0.0260 ← noise feature, still alive
Feature_6 -0.0407 ← noise feature, still alive
Feature_7 -0.0359 ← noise feature, still alive
Feature_8 -0.0307 ← noise feature, still alive
Feature_9 -0.0273 ← noise feature, still alive
Feature_10 +0.0677 ← noise feature, still alive
“`
Business Interpretation: Ridge shrinks the noise features from values like `+0.069` (OLS) down to `+0.068` (Ridge) — a marginal improvement. But it cannot eliminate them. If this were a deployed production model with 1,000 noise features, all 1,000 would still require computation at inference time. Ridge provides stability under multicollinearity but no feature selection.
7. Tier 3 — Lasso Regression (L1): Sparse Selection
Lasso adds the L1 penalty `λ Σ|wᵢ|`. Because the L1 constraint region is a diamond with corners on the coordinate axes, Lasso drives noise feature weights to **exactly zero** — performing automatic feature selection.
“`python
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)
lasso_mse = mean_squared_error(y_test, lasso_model.predict(X_test))
print(“=== TIER 3 — LASSO REGRESSION (L1) ===”)
print(f”Test MSE: {lasso_mse:.4f}”)
exact_zeros_lasso = sum(1 for w in lasso_model.coef_ if abs(w) < 1e-6)
print(f”Weights driven to exact zero: {exact_zeros_lasso} / {n_features} ← Lasso eliminated noise”)
print(f”\n{‘Feature’:<15} {‘Lasso Coef’:>12} {‘Status’:}”)
for i, (name, w) in enumerate(zip(feature_names, lasso_model.coef_)):
status = “← ZEROED (noise eliminated)” if abs(w) < 1e-6 else “← ACTIVE (signal retained)”
print(f” {name:<13} {w:>+12.4f} {status}”)
“`
**Output:**
“`
=== TIER 3 — LASSO REGRESSION (L1) ===
Test MSE: 0.2062
Weights driven to exact zero: 7 / 10 ← Lasso eliminated noise
Feature Lasso Coef Status
Feature_1 +3.0580 ← ACTIVE (signal retained)
Feature_2 -2.1790 ← ACTIVE (signal retained)
Feature_3 +1.6695 ← ACTIVE (signal retained)
Feature_4 +0.0000 ← ZEROED (noise eliminated)
Feature_5 +0.0000 ← ZEROED (noise eliminated)
Feature_6 -0.0000 ← ZEROED (noise eliminated)
Feature_7 -0.0000 ← ZEROED (noise eliminated)
Feature_8 -0.0000 ← ZEROED (noise eliminated)
Feature_9 -0.0000 ← ZEROED (noise eliminated)
Feature_10 +0.0000 ← ZEROED (noise eliminated)
“`
Business Interpretation: Lasso correctly identified and annihilated all 7 noise features, retaining only the 3 truly predictive features. This matches the ground truth exactly. A risk model or compliance engine built on this output is interpretable, auditable, and far cheaper to deploy — exactly 3 parameters instead of 10.
8. Tier 4 — Analytic Proof: The Soft-Thresholding Operator
The geometric argument can be made entirely rigorous through the coordinate-wise update for Lasso. For orthonormal features (`XᵀX = I`), the Lasso solution for each coordinate has a closed form known as the **soft-thresholding operator**:
“`
wᵢ* = sign(ŵᵢ) · max(|ŵᵢ| − λ, 0)
“`
Where `ŵᵢ` is the i-th OLS coefficient. The analogous Ridge solution is:
“`
wᵢ* = ŵᵢ / (1 + 2λ)
“`
“`python
def soft_threshold(w_ols, lam):
“””L1: Lasso soft-thresholding — sets small weights to exactly zero.”””
return np.sign(w_ols) * np.maximum(np.abs(w_ols) – lam, 0)
def ridge_shrink(w_ols, lam):
“””L2: Ridge proportional shrinkage — never reaches zero.”””
return w_ols / (1 + 2 * lam)
# Demonstrate on sample OLS coefficients
sample_ols_coefs = np.array([0.8, -0.3, 1.5, -0.1, 0.05])
lam = 0.5
l1_result = soft_threshold(sample_ols_coefs, lam)
l2_result = ridge_shrink(sample_ols_coefs, lam)
print(“=== TIER 4 — ANALYTIC SOFT-THRESHOLDING (λ = 0.5) ===”)
print(f”\n{‘OLS Coef’:>12} {‘L1 Lasso’:>12} {‘L2 Ridge’:>12} {‘L1 Status’}”)
for ols, l1, l2 in zip(sample_ols_coefs, l1_result, l2_result):
status = “→ ZEROED” if l1 == 0 else “→ shrunk”
print(f” {ols:>+10.4f} {l1:>+12.4f} {l2:>+12.4f} {status}”)
“`
**Output:**
“`
=== TIER 4 — ANALYTIC SOFT-THRESHOLDING (λ = 0.5) ===
OLS Coef L1 Lasso L2 Ridge L1 Status
+0.8000 +0.3000 +0.4000 → shrunk
-0.3000 -0.0000 -0.1500 → ZEROED
+1.5000 +1.0000 +0.7500 → shrunk
-0.1000 -0.0000 -0.0500 → ZEROED
+0.0500 +0.0000 +0.0250 → ZEROED
“`
The Decisive Difference. The parameter `λ` acts as a **hard threshold**: any OLS coefficient whose absolute value does not exceed `λ` is annihilated to zero. Coefficients `−0.3`, `−0.1`, and `0.05` — all below `λ = 0.5` — are completely zeroed by Lasso. Ridge, on the other hand, scales every weight by `1/(1 + 2λ) = 0.5`, shrinking all of them proportionally but never reaching zero.
> Soft-Thresholding Summary:
> `L1: wᵢ = sign(ŵᵢ) · max(|ŵᵢ| − λ, 0)` → threshold operator, produces zeros
> `L2: wᵢ = ŵᵢ / (1 + 2λ)` →proportional shrinkage, no zeros
9. Tier 5 — The Regularization Path
As `λ` increases from zero to infinity, the model transitions from fully unregularized (OLS) to fully suppressed (all zeros). Tracing this path visually demonstrates how L1 and L2 differ in their suppression behavior.
“`python
alphas = np.logspace(-3, 1, 80)
lasso_paths, ridge_paths = [], []
for a in alphas:
lasso_paths.append(Lasso(alpha=a, max_iter=5000).fit(X_train, y_train).coef_.copy())
ridge_paths.append(Ridge(alpha=a).fit(X_train, y_train).coef_.copy())
lasso_paths = np.array(lasso_paths)
ridge_paths = np.array(ridge_paths)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
colors = plt.cm.tab10(np.linspace(0, 1, n_features))
for i in range(n_features):
lw = 2.5 if true_weights[i] != 0 else 1.0
ls = ‘-‘ if true_weights[i] != 0 else ‘–‘
axes[0].plot(np.log10(alphas), lasso_paths[:, i], color=colors[i], lw=lw, ls=ls, label=feature_names[i])
axes[1].plot(np.log10(alphas), ridge_paths[:, i], color=colors[i], lw=lw, ls=ls, label=feature_names[i])
for ax, title in zip(axes,
[“L1 Lasso — Regularization Path\n(active features in solid; noise in dashed)”,
“L2 Ridge — Regularization Path\n(active features in solid; noise in dashed)”]):
ax.axhline(0, color=’black’, lw=0.8, ls=’:’)
ax.set_xlabel(“log₁₀(α) → stronger regularization →”, fontsize=11) ax.set_ylabel(“Coefficient Value”, fontsize=11)
ax.set_title(title, fontsize=12, fontweight=’bold’)
ax.legend(fontsize=7, ncol=2, loc=’upper left’)
plt.tight_layout()
plt.savefig(‘regularization_path.png’, dpi=150, bbox_inches=’tight’)
plt.show()
“`
Path Interpretation:
|Observation|Lasso (L1)|Ridge (L2)|
|-|-|-|
| Noise features|Abruptly snap to zero at moderate λ|Smoothly shrink but never reach zero|
| Active features|Retained until λ is large; then thresholded|Gradually shrink, never eliminated|
| Ordering|Features enter/exit sequentially|All move together proportionally|
| Implication |Built-in feature ranking by entry order|No feature ranking possible|
The Lasso path provides an ordered list of feature importance: the feature most correlated with the residuals enters first. This is automatic feature ranking with no extra computation.
10. Tier 6 — Cross-Validation: Selecting Optimal λ
The hyperparameter `λ` must be tuned — not set by hand. The standard approach is **k-fold cross-validation**: train on `k−1` folds, evaluate on the held-out fold, and repeat.
“`python
cv_alphas = np.logspace(-3, 1, 50)
lasso_cv_scores, ridge_cv_scores = [], []
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for a in cv_alphas:
l_mse = -cross_val_score(make_pipeline(StandardScaler(), Lasso(alpha=a, max_iter=5000)),
X_raw, y_raw, cv=kf, scoring=’neg_mean_squared_error’).mean()
r_mse = -cross_val_score(make_pipeline(StandardScaler(), Ridge(alpha=a)),
X_raw, y_raw, cv=kf, scoring=’neg_mean_squared_error’).mean()
lasso_cv_scores.append(l_mse)
ridge_cv_scores.append(r_mse)
best_lasso_alpha = cv_alphas[np.argmin(lasso_cv_scores)]
best_ridge_alpha = cv_alphas[np.argmin(ridge_cv_scores)]
print(“=== TIER 6 — CROSS-VALIDATION λ TUNING ===”)
print(f”Best Lasso α : {best_lasso_alpha:.4f} | CV MSE: {min(lasso_cv_scores):.4f}”)
print(f”Best Ridge α : {best_ridge_alpha:.4f} | CV MSE: {min(ridge_cv_scores):.4f}”)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(np.log10(cv_alphas), lasso_cv_scores, label=’Lasso (L1)’, color=’#3498db’, lw=2)
ax.plot(np.log10(cv_alphas), ridge_cv_scores, label=’Ridge (L2)’, color=’#e74c3c’, lw=2)
ax.axvline(np.log10(best_lasso_alpha), color=’#3498db’, ls=’–‘, alpha=0.7,
label=f’Best Lasso α = {best_lasso_alpha:.3f}’)
ax.axvline(np.log10(best_ridge_alpha), color=’#e74c3c’, ls=’–‘, alpha=0.7,
label=f’Best Ridge α = {best_ridge_alpha:.3f}’)
ax.set_xlabel(“log₁₀(α)”, fontsize=12)ax.set_ylabel(“5-Fold CV MSE”, fontsize=12)
ax.set_title(“Cross-Validation: Selecting Optimal λ for Lasso and Ridge”, fontsize=13, fontweight=’bold’)
ax.legend(); ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig(‘cv_tuning.png’, dpi=150, bbox_inches=’tight’)
plt.show()
“`
**Output:**
“`
=== TIER 6 — CROSS-VALIDATION λ TUNING ===
Best Lasso α : 0.0295 | CV MSE: 0.2480
Best Ridge α : 1.0481 | CV MSE: 0.2519
“`
> Practical λ Selection Rule: Use `lambda_max = max_j(|Xⱼᵀy| / n)` as an upper bound for Lasso — this is the smallest `λ` that sets all weights to zero. Then search a logarithmic grid between `lambda_max` and `lambda_max / 1000` using cross-validation.
11. Tier 7 — Elastic Net: Getting Both Properties
When neither L1 nor L2 perfectly fits the problem, **Elastic Net** combines both penalties:
“`
L_EN(w) = L(w) + λ₁·||w||₁ + λ₂·||w||₂²“`
Elastic Net inherits **sparsity from L1** and **stability under multicollinearity from L2**. The `l1_ratio` parameter controls the blend: `l1_ratio = 1.0` is pure Lasso; `l1_ratio = 0.0` is pure Ridge.
“`python
elastic_model = ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=5000)
elastic_model.fit(X_train, y_train)
elastic_mse = mean_squared_error(y_test, elastic_model.predict(X_test))
exact_zeros_elastic = sum(1 for w in elastic_model.coef_ if abs(w) < 1e-6)
print(“=== TIER 7 — ELASTIC NET (L1 + L2) ===”)
print(f”Test MSE : {elastic_mse:.4f}”)
print(f”Weights zeroed : {exact_zeros_elastic} / {n_features}”)
print(f”\n{‘Feature’:<15} {‘Elastic Net’:>12} Status”)
for name, w in zip(feature_names, elastic_model.coef_):
status = “← ZEROED” if abs(w) < 1e-6 else “← ACTIVE”
print(f” {name:<13} {w:>+12.4f} {status}”)
“`
**Output:**
“`
=== TIER 7 — ELASTIC NET (L1 + L2) ===
Test MSE : 0.2331
Weights zeroed : 6 / 10
Feature Elastic Net Status
Feature_1 +2.9746 ← ACTIVE
Feature_2 -2.1271 ← ACTIVE
Feature_3 +1.6535 ← ACTIVE
Feature_4 +0.0000 ← ZEROED
Feature_5 +0.0000 ← ZEROED
Feature_6 -0.0000 ← ZEROED
Feature_7 -0.0000 ← ZEROED
Feature_8 -0.0000 ← ZEROED
Feature_9 -0.0000 ← ZEROED
Feature_10 +0.0074 ← ACTIVE (borderline — l1_ratio controls this)
“`
Elastic Net Use Cases:
Correlated feature groups exist and you want to select the group (or part of it) rather than arbitrarily picking one.
* `p >> n` but the true model is not fully sparse (many features with small effects).
* The Lasso path is too erratic due to extreme collinearity.
12. Tier 8 — Final Model Comparison
“`python
best_lasso = Lasso(alpha=best_lasso_alpha, max_iter=5000).fit(X_train, y_train)
best_ridge = Ridge(alpha=best_ridge_alpha).fit(X_train, y_train)
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
models = [
(‘Ridge (L2)\nDense — no zeros’, best_ridge, ‘#e74c3c’),
(‘Lasso (L1)\nSparse — exact zeros’, best_lasso, ‘#3498db’),
(‘Elastic Net\nBalanced’, elastic_model, ‘#2ecc71’)
]
for ax, (name, model, color) in zip(axes, models):
coefs = model.coef_
bar_colors = [‘#bdc3c7’ if abs(c) < 1e-6 else color for c in coefs]
ax.bar(feature_names, coefs, color=bar_colors, edgecolor=’black’, lw=0.6)
ax.axhline(0, color=’black’, lw=0.8)
zeros = sum(1 for c in coefs if abs(c) < 1e-6)
ax.set_title(f”{name}\n(zeros: {zeros}/{n_features})”, fontsize=11, fontweight=’bold’)
ax.set_xticklabels(feature_names, rotation=45, ha=’right’, fontsize=8)
ax.set_ylabel(“Coefficient Value”)
# Overlay true weight positions
ax.step(range(n_features), true_weights, where=’mid’,
color=’black’, lw=2, ls=’–‘, label=’True weights’)
ax.legend(fontsize=8)
plt.suptitle(“Learned Coefficient Vectors: L1 vs L2 vs Elastic Net”,
fontsize=14, fontweight=’bold’, y=1.02)
plt.tight_layout()
plt.savefig(‘coef_comparison.png’, dpi=150, bbox_inches=’tight’)
plt.show()
“`
13. The Bayesian Perspective: Priors on Weights
Regularization has an elegant Bayesian interpretation that further illuminates the difference between L1 and L2. In the Bayesian framework, we place a prior distribution over the weights and seek the Maximum A Posteriori (MAP) estimate.
Ridge as a Gaussian Prior – L2 regularization corresponds to placing an independent Gaussian prior on each weight: `p(wᵢ) = N(0, σ²)` where `λ = 1/(2σ²)`. The Gaussian is smooth and bell-shaped. It assigns progressively smaller (but always positive) probability density to weights far from zero. The MAP solution shrinks weights toward zero but never forces them to exactly zero because the Gaussian has non-zero density everywhere.
Lasso as a Laplace Prior – L1 regularization corresponds to placing a Laplace (double-exponential) prior on each weight: `p(wᵢ) = (1/2b) · exp(−|wᵢ|/b)` where `λ = 1/b`. The Laplace distribution has a sharp peak at zero and heavy tails. It is not differentiable at `wᵢ = 0’. This sharp peak concentrates probability mass directly at zero, making zero the most probable a priori weight value — formally analogous to the geometric corner phenomenon.
||Gaussian Prior (Ridge) | Laplace Prior (Lasso) |
|-|-|-|
|Shape | Smooth, differentiable everywhere |Sharp peak at 0, non-differentiable at 0|
| Density at zero |Never the maximum for all weights simultaneously|Maximum density exactly at `wᵢ = 0`|
|MAP estimate |Shrinks weights continuously|Sets weak weights exactly to zero|
| Regularization type |Soft — all features retained|Hard — sparse models encouraged|
14. A Critical Practical Detail: Scale Sensitivity
Both L1 and L2 penalties are not scale-invariant. If `Feature_A’ ranges from 0 to 1 and `Feature_B’ranges from 0 to 10,000, the penalty on their respective weights is not comparable.
> Rule of Thumb — Always Standardize Before Regularizing
> Before fitting Lasso or Ridge, standardize each feature to zero mean and unit variance:
> `x̃ᵢⱼ = (xᵢⱼ − μⱼ) / σⱼ`
> Only after standardization does a single `λ` apply a fair, comparable penalty across all features. scikit-learn does not standardize automatically for `Lasso` and `Ridge` — you must apply `StandardScaler` yourself, as done in our pipeline above.
15. Strategic Overview: When to Use Each
|Use Lasso (L1) When…|Use Ridge (L2) When…|
|-|-|
|`p >> n` with suspected sparsity|Multicollinearity is present|
|Feature selection is a goal in itself|You believe all features are genuinely relevant|
|You need a deployable model with few features|You need a unique, stable closed-form solution|
|Domain knowledge suggests few active variables|Prediction accuracy over interpretability|
|**Criterion**|**L1 — Lasso**|**L2 — Ridge**|**Elastic Net**|
|-|-|-|-|
|Penalty Term|`λ Σ |wᵢ|`|`λ Σ wᵢ²`|`λ₁||w||₁ + λ₂||w||₂²`||Constraint Region|Diamond (polytope)|Circle (sphere)|Hybrid|
|Effect on Weights|Exact zeros (sparsity)|Shrinks toward zero|Partial sparsity|
|Feature Selection?|Yes — automatic|No|Partial|
|Closed-Form Solution?|No (sub-gradient)|Yes (analytical)|No|
|Handles Multicollinearity?|Poorly|Well|Well|
|Sparsity Prior (Bayesian)|Laplace|Gaussian|Laplace + Gaussian|
16. Key Summary for the Engineering Lead
Phase 1 Strategy – Always establish an OLS baseline first. The gap between OLS and the regularized models reveals how much overfitting was occurring — if Ridge and Lasso have dramatically better validation scores, your data has a noise problem that regularization is genuinely solving.
Production Pipeline Rule – Integrate your `StandardScaler` fit on training data only. Scaling the entire dataset before splitting causes data leakage — the scaler learns test-set statistics. Always: `scaler.fit(X_train)` → `scaler.transform(X_train)` → `scaler.transform(X_test)`.
*Feature Selection Value* Use Lasso’s zero-weight output to identify your minimal feature set. Re-train a plain OLS model on only those features for maximum interpretability and lowest inference cost — this is the *two-stage pipeline* used in high-dimensional biological and financial modeling.
Geometry Mnemonic* When a colleague asks why Lasso sets weights to zero:
“`
L1 has corners on the axes.
The loss ellipse hits corners.
Corners are zeros.
L2 has no corners.
The loss ellipse hits a smooth curve.
Smooth curves are non-zero.
“`
References & Further Reading
1. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288.
2. Hoerl, A.E. & Kennard, R.W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1), 55–67.
3. Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press.
4. Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 3 (Linear Models for Regression).
5. Zou, H. & Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net. JRSS Series B, 67(2), 301–320.
6. NPTEL Lectures on Machine Learning. IIT Madras / IIT Bombay — Prof. Mitesh Khapra, CS7015: Deep Learning.