- ways to reduce correlation in data in Machine Learning
- examples of each technique using Python
- when to use which technique
- Which technique in which use case
Ways to reduce Correlation
1. Feature Selection
Identify and removing one or more of the highly correlated features. Retain the most informative variables while discarding redundant ones.
Techniques:
- Manual Identification: Examining the correlation matrix (e.g., using a heatmap) and manually selecting features to remove based on a high correlation threshold.
- Variance Inflation Factor (VIF): VIF quantifies the severity of multicollinearity in an ordinary least squares regression analysis. A high VIF (typically > 5 or 10) indicates strong correlation with other predictors. Features with high VIF can be iteratively removed.
- Model-Based Selection: Using feature importance scores from models like tree-based methods (e.g., Random Forest, Gradient Boosting) or linear models with regularization (Lasso) to identify and remove less important correlated features.
import pandas as pd
import numpy as np
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso
import matplotlib.pyplot as plt
np.random.seed(42)
n_samples = 100
X1 = np.random.rand(n_samples)
X2 = 0.8 * X1 + 0.1 * np.random.rand(n_samples) # X2 is highly correlated with X1
X3 = np.random.rand(n_samples)
y = 2 * X1 + 3 * X2 + 0.5 * X3 + np.random.randn(n_samples)
df = pd.DataFrame({'Feature1': X1, 'Feature2': X2, 'Feature3': X3, 'Target': y})
X = df[['Feature1', 'Feature2', 'Feature3']]
correlation_matrix = X.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
print(correlation_matrix)
🐍 Practice Python Here! 🚀
Write, Run & Debug Python Code Directly in Your Browser! 💻✨
Python Examples:
Sample correlated data
np.random.seed(42)
n_samples = 100
X1 = np.random.rand(n_samples)
X2 = 0.8 * X1 + 0.1 * np.random.rand(n_samples) # X2 is highly correlated with X1
X3 = np.random.rand(n_samples)
y = 2 * X1 + 3 * X2 + 0.5 * X3 + np.random.randn(n_samples)
df = pd.DataFrame({'Feature1': X1, 'Feature2': X2, 'Feature3': X3, 'Target': y})
X = df[['Feature1', 'Feature2', 'Feature3']]
Manual Identification using Correlation Matrix
correlation_matrix = X.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
Features that are in Red in the heatmap are highly correlated
We might choose to remove one of them, e.g., Feature2.
X_manual_selected = X[['Feature1', 'Feature3']]
print("\nFeatures after manual selection:\n", X_manual_selected.head())
Variance Inflation Factor (VIF)
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("\nVIF values:\n", vif_data)
Feature2 has a high VIF. We might remove it.
X_vif_selected = X[['Feature1', 'Feature3']]
print("\nFeatures after VIF-based selection:\n", X_vif_selected.head())
Model-Based Selection using Random Forest
model_rf = RandomForestRegressor(random_state=42)
model_rf.fit(X, df['Target'])
feature_importances_rf = pd.Series(model_rf.feature_importances_, index=X.columns)
print("\nRandom Forest Feature Importances:\n", feature_importances_rf.sort_values(ascending=False))
If Feature2 has low importance despite high correlation, we might remove it.
X_rf_selected = X[['Feature1', 'Feature3']]
print("\nFeatures after Random Forest selection:\n", X_rf_selected.head())
Model-Based Selection using Lasso (L1 Regularization)
model_lasso = Lasso(alpha=0.1, random_state=42) # Adjust alpha for desired sparsity
model_lasso.fit(X, df['Target'])
feature_coefficients_lasso = pd.Series(model_lasso.coef_, index=X.columns)
print("\nLasso Coefficients:\n", feature_coefficients_lasso)
Features with zero coefficients are effectively removed by Lasso.
selected_features_lasso = feature_coefficients_lasso[feature_coefficients_lasso != 0].index.tolist()
X_lasso_selected = X[selected_features_lasso]
print("\nFeatures selected by Lasso:\n", X_lasso_selected.head())
When to Use:
When interpretability of the model is important. Keeping a smaller set of features makes the model easier to understand.
When the number of features is large, and you suspect redundancy.
As a preprocessing step before applying algorithms sensitive to multicollinearity (like linear regression without regularization).
Use Cases:
Linear Regression: To obtain stable and interpretable coefficients.
Logistic Regression: For similar reasons as linear regression.
Situations where you need to explain the model’s predictions based on a minimal set of factors.
2. Dimensionality Reduction Techniques
These techniques aim to transform the original features into a lower-dimensional space while preserving as much variance as possible. The new components are typically uncorrelated.
Techniques:
Principal Component Analysis (PCA): A linear dimensionality reduction technique that finds orthogonal principal components (linear combinations of the original features) that capture the maximum variance in the data. The first few principal components often explain most of the variability, and they are uncorrelated by construction.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Scale the data before applying PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA to reduce to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print("Original data shape:", X.shape)
print("PCA transformed data shape:", X_pca.shape)
print("\nFirst few PCA components:\n", X_pca[:5])
# The new components (columns of X_pca) are uncorrelated.
pca_df = pd.DataFrame(data=X_pca, columns=['Principal Component 1', 'Principal Component 2'])
print("\nCorrelation matrix of PCA components:\n", pca_df.corr())
When to Use:
- When the primary goal is to reduce the number of features without necessarily needing to interpret the new components in terms of the original features.
- As a preprocessing step for algorithms that perform better with lower-dimensional data or when dealing with a very large number of features.
- When visualization of high-dimensional data is needed (by reducing to 2 or 3 components).
Use Cases:
- Image Compression: Reducing the dimensionality of image data while retaining important information.
- Noise Reduction: PCA can sometimes help in separating signal from noise.
- Speeding up Machine Learning Algorithms: Training on a lower-dimensional dataset can be faster.
3. Creating Interaction Terms or Polynomial Features
Sometimes, the correlation arises because the relationship between the features and the target is not linear or additive. Creating new features that are combinations (interaction terms) or powers (polynomial features) of the original correlated features might capture the underlying relationship better and potentially reduce the need for both original features.
Python Example:
Python
from sklearn.preprocessing import PolynomialFeatures
# Create polynomial features of degree 2 (includes interaction term)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
feature_names_poly = poly.get_feature_names_out(input_features=X.columns)
X_poly_df = pd.DataFrame(X_poly, columns=feature_names_poly)
print("Original data shape:", X.shape)
print("Polynomial features data shape:", X_poly_df.shape)
print("\nFirst few polynomial features:\n", X_poly_df.head())
# Check the correlation of the new features
correlation_matrix_poly = X_poly_df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_poly.iloc[:3, :3], annot=True, cmap='coolwarm',
xticklabels=X.columns, yticklabels=X.columns)
plt.title('Correlation Matrix of Original Features in Polynomial Feature Set')
plt.show()
# Note that the correlation between the original features might still be present
# in the set of polynomial features. The goal here is to potentially capture
# the target relationship better using these new combinations.
When to Use:
- When you suspect non-linear relationships between features and the target variable.
- When domain knowledge suggests that the interaction between certain features is important.
Use Cases:
- Modeling physical phenomena where variables interact (e.g., force depends on mass and acceleration).
- Capturing synergistic or antagonistic effects of different factors.
4. Using Regularization Techniques
Regularized linear models like Ridge Regression (L2 regularization) and Lasso (L1 regularization) can mitigate the impact of multicollinearity. They add a penalty term to the loss function that discourages large coefficients. While they don’t explicitly remove correlated features (except Lasso can drive coefficients to zero), they can stabilize the model and reduce the variance caused by multicollinearity.
Python Example:
Python
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, df['Target'], test_size=0.3, random_state=42)
# Scale data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Ridge Regression
ridge = Ridge(alpha=1.0) # Adjust alpha
ridge.fit(X_train_scaled, y_train)
ridge_predictions = ridge.predict(X_test_scaled)
ridge_mse = mean_squared_error(y_test, ridge_predictions)
print(f"Ridge Regression MSE: {ridge_mse:.4f}")
print("Ridge Coefficients:", ridge.coef_)
# Lasso Regression
lasso = Lasso(alpha=0.1) # Adjust alpha
lasso.fit(X_train_scaled, y_train)
lasso_predictions = lasso.predict(X_test_scaled)
lasso_mse = mean_squared_error(y_test, lasso_predictions)
print(f"\nLasso Regression MSE: {lasso_mse:.4f}")
print("Lasso Coefficients:", lasso.coef_)
When to Use:
- When building linear models (linear regression, logistic regression) and you suspect multicollinearity.
- When you want to prevent overfitting, especially when dealing with a large number of features.
- Lasso (L1) is particularly useful when you also want to perform feature selection by driving some coefficients to zero.
Use Cases:
- Predictive modeling with linear relationships where multicollinearity might be an issue.
- Situations where a balance between model complexity and prediction accuracy is needed.
5. Centering and Scaling
While centering (subtracting the mean) and scaling (dividing by the standard deviation) the features do not directly reduce the correlation between them, they can help in making the model training process more stable and can be a prerequisite for some techniques like PCA and regularized linear models. Scaling ensures that features with larger ranges do not dominate the model.
Python Example:
Python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)
print("Original data head:\n", X.head())
print("\nScaled data head:\n", X_scaled_df.head())
print("\nCorrelation of original data:\n", X.corr())
print("\nCorrelation of scaled data:\n", X_scaled_df.corr())
# Note that the correlation coefficients remain the same after scaling.
When to Use:
- As a standard preprocessing step before applying many machine learning algorithms, especially those sensitive to feature scales (e.g., gradient-based methods, distance-based methods, PCA, regularization).
Use Cases:
- Almost all machine learning tasks involving numerical features.
Choosing the Right Technique
The best approach depends on the specific dataset, the machine learning algorithm you intend to use, and your goals (e.g., interpretability, prediction accuracy).
- If interpretability is crucial and you have a good understanding of your features, manual feature selection based on the correlation matrix and domain knowledge can be effective.
- When dealing with a large number of features and you want to reduce dimensionality while retaining most of the variance, PCA is a powerful tool. However, the resulting principal components might not be easily interpretable.
- If you suspect non-linear relationships, creating polynomial or interaction terms might be beneficial, but be cautious of increasing the dimensionality significantly.
- For linear models, regularization techniques (Ridge and Lasso) are excellent for handling multicollinearity and preventing overfitting. Lasso can also perform feature selection.
- VIF can be a more systematic way to identify and remove highly collinear features in the context of linear regression.
- Model-based feature selection can be useful when the importance of features is context-dependent on the model being used.
It’s often a good practice to explore multiple techniques and evaluate their impact on the performance of your machine learning model through techniques like cross-validation. Remember that reducing correlation is often a trade-off, and you need to balance it with the need to retain information relevant to your prediction task.