Saba Shahrukh June 3, 2025 0
  • What is scaling in Machine Learning
  • When do we use scaling in ML
  • What are the scaling methods
  • How we do it in python

Taming the Wild Numbers: Why Scaling Your Data is Like Fitting Shoes!

Ever tried to compare the weight of an elephant in kilograms with the length of an ant in millimeters? It’s like comparing apples and oranges, right? In the world of Machine Learning, our data can sometimes be just as wildly different in its “size” or scale. That’s where a handy technique called “Scaling” comes to the rescue!

Think of your data as a bunch of people with different shoe sizes. If you want to make everyone run a race fairly, you wouldn’t want someone with size 15 shoes to have a natural advantage over someone with size 5, just because their shoes are bigger! Scaling is like making sure everyone’s “shoe size” is brought to a similar range so the race is about their actual ability, not just their footwear.

What is Scaling

So, What Exactly IS Scaling in Simple Terms?

Imagine you have a list of house prices in Rupees (say, from 50 lakhs to 5 crores) and the sizes of those houses in square feet (say, from 500 sq ft to 5000 sq ft). These numbers are on totally different scales! If we feed this raw data to some Machine Learning “brains,” it might get confused and think that the price is way more important than the size just because the numbers are much bigger.

Scaling is like putting these different types of numbers on a more level playing field. We transform them so they fall within a similar range, making it easier for the Machine Learning model to learn fairly from all the information.

When to do Scaling

When Do We Need This “Shoe Fitting”?

We usually need to scale our data when we’re using certain types of Machine Learning “brains” that are sensitive to the size of the numbers. Here are a few relatable examples:

  • Finding the Nearest Neighbor (Like Recommending Similar Movies): Imagine you’re trying to find movies similar to one you liked. If one movie has a rating out of 5 and another has box office collections in millions, the “distance” between them might be unfairly influenced by the large box office numbers if we don’t scale them. Scaling helps us compare them based on their actual similarity in features, not just the magnitude of the numbers.
  • Learning Step-by-Step (Like a Student Studying): Some Machine Learning models learn by taking small steps to improve their understanding. If some features have very large values, the steps might become too big and overshoot the correct answer. Scaling helps keep these steps manageable and leads to faster and better learning.
  • Giving Everyone a Fair Score (Like Judging a Competition): If you’re judging a competition where one aspect is scored out of 10 and another out of 100, you’d probably want to bring them to a similar scale (maybe out of 100 for both) to give each aspect fair importance in the final score. Similarly, scaling ensures that all our data features contribute equally to the Machine Learning model’s decision.
Different ways to do Scaling

Different Ways to “Fit the Shoes” (Scaling Methods):

There are a few common techniques we use to scale our data. Think of them as different ways to adjust the shoe sizes:

  1. The “Bring Everyone to the Same Finish Line” Method (Min-Max Scaling):
    • Imagine all runners need to finish within a track of 0 to 1 meter. We take each person’s original “distance” (their data value) and squeeze or stretch it to fit within this 0 to 1 range.
    • Example: If house prices range from 50 lakhs to 5 crores, Min-Max scaling would transform the 50 lakh price to 0 and the 5 crore price to 1, with all other prices falling somewhere in between.
    • Use Case: Useful when the exact range of values is important or when we’re using algorithms that work best with values in a specific range (like some types of image processing).
  2. The “How Far From Average” Method (Standardization):
    • Instead of a fixed finish line, we see how far each runner is from the average runner. We also consider how spread out the runners are. This method makes the average value 0 and the spread (standard deviation) 1.
    • Example: For house sizes, we’d find the average size and how much individual sizes vary from this average. A house much larger than average would get a positive standardized value, while a smaller one would get a negative value.
    • Use Case: Very common and works well for many Machine Learning algorithms, especially those that assume the data follows a bell-shaped curve (like linear regression and some neural networks). It’s less sensitive to extreme values (outliers) than Min-Max scaling.
  3. The “Ignoring the Super Unusual” Method (Robust Scaling):
    • Imagine a race where a few super-fast or super-slow runners might skew the average. Robust scaling focuses on the main group of runners in the middle and scales based on that, ignoring the extreme outliers.
    • Example: If we have some very unusually expensive houses in our dataset, Robust scaling will scale the majority of the house prices based on the typical range, without being heavily influenced by those few expensive mansions.
    • Use Case: Great when your data has a lot of outliers that you don’t want to affect the scaling too much.
  4. The “Scaling by the Biggest Shoe” Method (Max Absolute Scaling):
    • We find the runner with the absolute biggest “distance” (positive or negative) from the starting point and then scale everyone else based on that. This makes sure all values fall between -1 and 1.
    • Example: If house prices range from -₹1 crore (maybe a data error) to ₹5 crores, we’d divide all prices by 5 crores, ensuring all scaled values are between -0.2 and 1.
    • Use Case: Useful for data that has both positive and negative values and when preserving zero values is important (like in some text processing techniques).
Scaling in Python

Doing the “Shoe Fitting” in Python (Simple Code):

Python makes scaling super easy with a library called scikit-learn. Here’s a simple example of how you can do it:

Python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Let's say we have some house data
data = {'Price': [5000000, 12000000, 8000000, 25000000],
        'Size': [1000, 2500, 1500, 4000]}
df = pd.DataFrame(data)

# Separate the features (what we want to scale)
X = df[['Price', 'Size']]

# It's a good practice to split your data into training and testing sets first!
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

# 1. Min-Max Scaling
min_max_scaler = MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train) # Learn the scaling from the training data
X_test_minmax = min_max_scaler.transform(X_test)       # Apply the same scaling to the test data

print("Min-Max Scaled Training Data:\n", pd.DataFrame(X_train_minmax, columns=['Price_Scaled', 'Size_Scaled']))
print("\nMin-Max Scaled Testing Data:\n", pd.DataFrame(X_test_minmax, columns=['Price_Scaled', 'Size_Scaled']))

# 2. Standardization
standard_scaler = StandardScaler()
X_train_standard = standard_scaler.fit_transform(X_train) # Learn the scaling from the training data
X_test_standard = standard_scaler.transform(X_test)     # Apply the same scaling to the test data

print("\nStandardized Training Data:\n", pd.DataFrame(X_train_standard, columns=['Price_Scaled', 'Size_Scaled']))
print("\nStandardized Testing Data:\n", pd.DataFrame(X_test_standard, columns=['Price_Scaled', 'Size_Scaled']))

Key Takeaway:

Scaling is a simple yet powerful technique in Machine Learning. By bringing our data features to a similar scale, we ensure that our models learn fairly and effectively from all the information, just like making sure all runners have appropriately sized shoes for a fair race! Choosing the right scaling method depends on the specific characteristics of your data and the Machine Learning algorithm you’re using. So, next time you’re working with data that has features on different scales, remember to give them a good “shoe fitting”!

Diving deeper into Scaling

Diving deeper and getting more technical

Scaling in machine learning is a crucial preprocessing technique used to standardize or normalize the range of independent variables or features in a dataset. It aims to bring all the features onto a similar scale, preventing features with larger values from dominating those with smaller values, which can negatively impact the performance of many machine learning algorithms.

When do we use scaling

When do we use scaling?

Feature scaling is highly recommended in the following scenarios:

  • Algorithms sensitive to feature magnitude: Algorithms that rely on distance calculations (e.g., K-Nearest Neighbors, K-Means, Support Vector Machines) or gradient descent (e.g., linear regression, logistic regression, neural networks) are significantly affected by the scale of features. Without scaling, features with larger ranges can disproportionately influence the results.
  • Improved convergence speed: Gradient descent-based algorithms converge much faster when the features are on a similar scale. Scaling helps the algorithm move more directly towards the minimum of the cost function.
  • Preventing bias: Features with larger magnitudes might be mistakenly considered more important by the model if not scaled, leading to biased learning.
  • Regularization: When using regularization techniques (like L1 or L2 regularization), scaling ensures that all features are penalized appropriately.
  • Principal Component Analysis (PCA): PCA aims to find components with maximum variance. If features have different scales, those with larger scales might dominate the principal components.

However, scaling might not be necessary for all algorithms, especially tree-based models like Decision Trees and Random Forests. These algorithms make splits based on individual features and are generally insensitive to the scale of other features.

Scaling Methods

Scaling Methods

There are several common methods for feature scaling:

  1. Min-Max Scaling (Normalization):
    • Scales the features to a specific range, typically between 0 and 1.
    • Formula: Xscaled​=Xmax​−Xmin​X−Xmin​​
    • Useful when you need values within a specific range and the distribution is not necessarily Gaussian. Sensitive to outliers.
  2. Standardization (Z-score Normalization):
    • Scales the features to have a mean of 0 and a standard deviation of 1.
    • Formula: Xscaled​=σX−μ​ where mu is the mean and sigma is the standard deviation of the feature.
    • Less affected by outliers compared to Min-Max scaling and is suitable for algorithms that assume a Gaussian distribution.
  3. Robust Scaling:
    • Scales features using statistics that are robust to outliers, such as the median and interquartile range (IQR).
    • Formula: Xscaled​=IQRX−Median​
    • Useful when the data contains significant outliers.
  4. Max Absolute Scaling:
    • Scales each feature by its maximum absolute value. The resulting values will be in the range [-1, 1].
    • Formula: Xscaled​=∣Xmax​∣X​
    • Suitable for sparse data as it preserves zero entries.
  5. Unit Vector Scaling (Normalization to Unit Norm):
    • Scales each sample (row) to have a unit norm (length of 1). This is often used in text processing or when the direction of the data is more important than the magnitude.
    • Common norms used are L1 and L2 norms. For L2 norm: Xscaled​=∑i=1n​Xi2​​X​
Scaling in Python

Scaling in Python

The scikit-learn library in Python provides convenient tools for scaling data using the preprocessing module. Here’s how you can implement the common scaling methods:

Python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, MaxAbsScaler, Normalizer

# Sample DataFrame
data = {'Age': [25, 40, 30, 65, 22],
        'Income': [30000, 75000, 55000, 120000, 28000],
        'Height': [1.65, 1.80, 1.72, 1.78, 1.68]}
df = pd.DataFrame(data)

# Separate features
X = df[['Age', 'Income', 'Height']]

# Split data into training and testing sets (important to fit scaler on training data only)
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

# 1. Min-Max Scaling
min_max_scaler = MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_test_minmax = min_max_scaler.transform(X_test)
df_train_minmax = pd.DataFrame(X_train_minmax, columns=X_train.columns)
df_test_minmax = pd.DataFrame(X_test_minmax, columns=X_test.columns)
print("Min-Max Scaled Training Data:\n", df_train_minmax)
print("\nMin-Max Scaled Testing Data:\n", df_test_minmax)

# 2. Standardization
standard_scaler = StandardScaler()
X_train_standard = standard_scaler.fit_transform(X_train)
X_test_standard = standard_scaler.transform(X_test)
df_train_standard = pd.DataFrame(X_train_standard, columns=X_train.columns)
df_test_standard = pd.DataFrame(X_test_standard, columns=X_test.columns)
print("\nStandardized Training Data:\n", df_train_standard)
print("\nStandardized Testing Data:\n", df_test_standard)

# 3. Robust Scaling
robust_scaler = RobustScaler()
X_train_robust = robust_scaler.fit_transform(X_train)
X_test_robust = robust_scaler.transform(X_test)
df_train_robust = pd.DataFrame(X_train_robust, columns=X_train.columns)
df_test_robust = pd.DataFrame(X_test_robust, columns=X_test.columns)
print("\nRobust Scaled Training Data:\n", df_train_robust)
print("\nRobust Scaled Testing Data:\n", df_test_robust)

# 4. Max Absolute Scaling
max_abs_scaler = MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_test_maxabs = max_abs_scaler.transform(X_test)
df_train_maxabs = pd.DataFrame(X_train_maxabs, columns=X_train.columns)
df_test_maxabs = pd.DataFrame(X_test_maxabs, columns=X_test.columns)
print("\nMax Absolute Scaled Training Data:\n", df_train_maxabs)
print("\nMax Absolute Scaled Testing Data:\n", df_test_maxabs)

# 5. Unit Vector Scaling (L2 Norm)
normalizer_l2 = Normalizer(norm='l2')
X_train_normalized_l2 = normalizer_l2.fit_transform(X_train)
X_test_normalized_l2 = normalizer_l2.transform(X_test)
df_train_normalized_l2 = pd.DataFrame(X_train_normalized_l2, columns=X_train.columns)
df_test_normalized_l2 = pd.DataFrame(X_test_normalized_l2, columns=X_test.columns)
print("\nL2 Normalized Training Data:\n", df_train_normalized_l2)
print("\nL2 Normalized Testing Data:\n", df_test_normalized_l2)

Important Note: It is crucial to fit the scaler (e.g., fit() method) only on the training data and then use the same fitted scaler to transform both the training and testing data (transform() method). This prevents data leakage from the test set into the training process. Sources

Explore all the scaling methods in mode detail and show examples of how they change data. Give end to end code in Python.

Let’s delve deeper into each scaling method with more detailed explanations and illustrative Python code examples. We’ll use a sample dataset to visualize how each method transforms the data distribution.

Min-Max Scaling (Normalization)

1. Min-Max Scaling (Normalization)

In Detail:

Min-Max scaling, also known as normalization, linearly transforms the data to fit within a specified range, typically [0, 1]. It achieves this by subtracting the minimum value of the feature and then dividing by the range (the difference between the maximum and minimum values).

When to Use:

  • Useful when the range of the data is important.
  • Often used as a preliminary step for algorithms that are sensitive to the input data range, such as neural networks.
  • Can be beneficial when you need all features to have a consistent scale for comparison or visualization.

Sensitivity to Outliers:

Min-Max scaling is highly sensitive to outliers. Outliers can significantly stretch or compress the scaled data, making the majority of the data points fall within a very narrow range.

Python Example:

Python

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns

# Sample Data with potential outliers
data = {'Feature_A': [5, 10, 15, 20, 100],
        'Feature_B': [-2, 0, 5, 8, 15]}
df = pd.DataFrame(data)

# Initialize the MinMaxScaler
min_max_scaler = MinMaxScaler()

# Fit and transform the data
df_scaled = pd.DataFrame(min_max_scaler.fit_transform(df), columns=df.columns)

# Visualize the original and scaled data
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.boxplot(data=df)
plt.title('Original Data Distribution')

plt.subplot(1, 2, 2)
sns.boxplot(data=df_scaled)
plt.title('Min-Max Scaled Data Distribution')

plt.tight_layout()
plt.show()

print("Original Data:\n", df)
print("\nMin-Max Scaled Data:\n", df_scaled)

In the box plots, you can observe how the range of each feature is transformed to [0, 1]. Notice how the outlier in ‘Feature_A’ (value 100) compresses the scaling of the other values.

Standardization (Z-score Normalization)

2. Standardization (Z-score Normalization)

In Detail:

Standardization centers the data around the mean and scales it to have a unit standard deviation. It subtracts the mean from each data point and then divides by the standard deviation. The resulting distribution will have a mean of 0 and a standard deviation of 1.

When to Use:

  • Suitable for algorithms that assume a Gaussian-like distribution of the data.
  • Less sensitive to outliers compared to Min-Max scaling because it uses the standard deviation, which is less affected by extreme values than the range.
  • Commonly used in algorithms like linear regression, logistic regression, and support vector machines.

Sensitivity to Outliers:

While less sensitive than Min-Max scaling, extreme outliers can still influence the mean and standard deviation, thus affecting the standardization.

Python Example:

Python

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import seaborn as sns

# Sample Data with potential outliers
data = {'Feature_A': [5, 10, 15, 20, 100],
        'Feature_B': [-2, 0, 5, 8, 15]}
df = pd.DataFrame(data)

# Initialize the StandardScaler
standard_scaler = StandardScaler()

# Fit and transform the data
df_scaled = pd.DataFrame(standard_scaler.fit_transform(df), columns=df.columns)

# Visualize the original and scaled data
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.boxplot(data=df)
plt.title('Original Data Distribution')

plt.subplot(1, 2, 2)
sns.boxplot(data=df_scaled)
plt.title('Standardized Data Distribution')

plt.tight_layout()
plt.show()

print("Original Data:\n", df)
print("\nStandardized Data:\n", df_scaled)

The box plots illustrate how standardization centers the data around zero. The spread of the scaled data is related to the standard deviation of the original data. The outlier still has an effect but doesn’t compress the other data points as much as in Min-Max scaling.

Robust Scaling

3. Robust Scaling

In Detail:

Robust scaling uses the median and the interquartile range (IQR) to scale the data. It subtracts the median from each data point and then divides by the IQR. The IQR is the range between the 25th percentile (Q1) and the 75th percentile (Q3), which is less sensitive to outliers.

When to Use:

  • Highly beneficial when the dataset contains significant outliers.
  • Provides a more stable scaling compared to Min-Max and standardization in the presence of extreme values.

Sensitivity to Outliers:

Robust scaling is designed to be less affected by outliers because it relies on the median and IQR, which are robust statistics.

Python Example:

Python

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import RobustScaler
import seaborn as sns

# Sample Data with significant outliers
data = {'Feature_A': [5, 10, 15, 20, 100],
        'Feature_B': [-10, 0, 5, 10, 20]}
df = pd.DataFrame(data)

# Initialize the RobustScaler
robust_scaler = RobustScaler()

# Fit and transform the data
df_scaled = pd.DataFrame(robust_scaler.fit_transform(df), columns=df.columns)

# Visualize the original and scaled data
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.boxplot(data=df)
plt.title('Original Data Distribution (with Outliers)')

plt.subplot(1, 2, 2)
sns.boxplot(data=df_scaled)
plt.title('Robust Scaled Data Distribution')

plt.tight_layout()
plt.show()

print("Original Data:\n", df)
print("\nRobust Scaled Data:\n", df_scaled)

The box plots clearly show how robust scaling handles the outliers. The majority of the data points are scaled within a more reasonable range, and the influence of the extreme values is minimized.

Max Absolute Scaling

4. Max Absolute Scaling

In Detail:

Max absolute scaling scales each feature by dividing each value by the maximum absolute value of that feature. As a result, all scaled values will be in the range [-1, 1]. This method does not shift the data; it only scales it.

When to Use:

  • Useful when preserving zero entries in sparse data is important.
  • Can be helpful when the magnitude of the data is more important than the relative differences between values.

Sensitivity to Outliers:

The maximum absolute value can be influenced by outliers, which in turn can affect the scaling of the other data points.

Python Example:

Python

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MaxAbsScaler
import seaborn as sns

# Sample Data with potential outliers and negative values
data = {'Feature_A': [-5, 10, -15, 20, -100],
        'Feature_B': [-2, 0, 5, -8, 15]}
df = pd.DataFrame(data)

# Initialize the MaxAbsScaler
max_abs_scaler = MaxAbsScaler()

# Fit and transform the data
df_scaled = pd.DataFrame(max_abs_scaler.fit_transform(df), columns=df.columns)

# Visualize the original and scaled data
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.boxplot(data=df)
plt.title('Original Data Distribution')

plt.subplot(1, 2, 2)
sns.boxplot(data=df_scaled)
plt.title('Max Absolute Scaled Data Distribution')

plt.tight_layout()
plt.show()

print("Original Data:\n", df)
print("\nMax Absolute Scaled Data:\n", df_scaled)

The box plots show how the data is scaled within the [-1, 1] range. The maximum absolute value in ‘Feature_A’ is 100, so all values are divided by 100. Similarly, for ‘Feature_B’, the maximum absolute value is 15.

Unit Vector Scaling (Normalization to Unit Norm)

5. Unit Vector Scaling (Normalization to Unit Norm)

In Detail:

Unit vector scaling aims to scale the vectors (rows of data) to have a unit norm (a length of 1). This is done independently for each sample. Common norms used are the L1 norm and the L2 norm.

  • L1 Norm: The sum of the absolute values of the vector elements is 1. Xscaled_i​=∑j=1n​∣Xj​∣Xi​​
  • L2 Norm: The Euclidean norm (square root of the sum of the squared values) of the vector is 1. Xscaled_i​=∑j=1n​Xj2​​Xi​​

When to Use:

  • Often used in text processing (e.g., normalizing word vectors) and in scenarios where the direction or orientation of the data points is more important than their magnitude.
  • Can be useful in clustering algorithms or when dealing with data that lies on a sphere.

Sensitivity to Outliers:

Unit vector scaling can be sensitive to outliers within a sample (row), as the outlier’s magnitude can dominate the norm calculation.

Python Example:

Python

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import Normalizer
import numpy as np

# Sample Data
data = {'Feature_A': [5, 10],
        'Feature_B': [-3, 4],
        'Feature_C': [1, -2]}
df = pd.DataFrame(data)

# Initialize the Normalizer (using L2 norm by default)
normalizer_l2 = Normalizer(norm='l2')
df_scaled_l2 = pd.DataFrame(normalizer_l2.fit_transform(df), columns=df.columns)
print("L2 Normalized Data:\n", df_scaled_l2)

# Initialize the Normalizer with L1 norm
normalizer_l1 = Normalizer(norm='l1')
df_scaled_l1 = pd.DataFrame(normalizer_l1.fit_transform(df), columns=df.columns)
print("\nL1 Normalized Data:\n", df_scaled_l1)

# Visualizing the effect on a single data point (vector)
original_vector = df.iloc[0].values
scaled_vector_l2 = df_scaled_l2.iloc[0].values
scaled_vector_l1 = df_scaled_l1.iloc[0].values

print("\nOriginal Vector:", original_vector)
print("L2 Normalized Vector:", scaled_vector_l2, " (L2 Norm:", np.linalg.norm(scaled_vector_l2), ")")
print("L1 Normalized Vector:", scaled_vector_l1, " (L1 Norm:", np.linalg.norm(scaled_vector_l1, ord=1), ")")

In this example, we demonstrate both L1 and L2 normalization. For each row, the values are scaled such that the L1 norm (sum of absolute values) or the L2 norm (Euclidean distance) of the row becomes 1. This type of scaling operates row-wise, unlike the other scalers that operate column-wise.

Choosing the appropriate scaling method depends on the specific characteristics of your data and the requirements of the machine learning algorithm you intend to use. It’s often a good practice to experiment with different scaling techniques to see which one yields the best performance for your particular problem. Remember to always fit the scaler on the training data and then transform both the training and test sets using the same fitted scaler. Sources

Please generate python code to visualize different scaling techniques and illustrate how scaling changes data visually in python

Python

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, MaxAbsScaler

# Generate some synthetic data with different scales and an outlier
data = pd.DataFrame({
    'Feature_A': [5, 10, 15, 20, 100] * 10,
    'Feature_B': [-2, 0, 5, 8, 15] * 10,
    'Feature_C': [0.1, 0.2, 0.15, 0.25, 0.3] * 10  # Feature with a smaller scale
})

# Initialize scalers
min_max_scaler = MinMaxScaler()
standard_scaler = StandardScaler()
robust_scaler = RobustScaler()
max_abs_scaler = MaxAbsScaler()

# Apply scaling
data_minmax = pd.DataFrame(min_max_scaler.fit_transform(data), columns=data.columns)
data_standard = pd.DataFrame(standard_scaler.fit_transform(data), columns=data.columns)
data_robust = pd.DataFrame(robust_scaler.fit_transform(data), columns=data.columns)
data_maxabs = pd.DataFrame(max_abs_scaler.fit_transform(data), columns=data.columns)

# --- Visualization using Box Plots ---
plt.figure(figsize=(15, 10))

plt.subplot(2, 3, 1)
sns.boxplot(data=data)
plt.title('Original Data Distribution')

plt.subplot(2, 3, 2)
sns.boxplot(data=data_minmax)
plt.title('Min-Max Scaled Data')

plt.subplot(2, 3, 3)
sns.boxplot(data=data_standard)
plt.title('Standardized Data')

plt.subplot(2, 3, 4)
sns.boxplot(data=data_robust)
plt.title('Robust Scaled Data')

plt.subplot(2, 3, 5)
sns.boxplot(data=data_maxabs)
plt.title('Max Absolute Scaled Data')

plt.tight_layout()
plt.suptitle('Comparison of Scaling Techniques (Box Plots)', y=1.02)
plt.show()

# --- Visualization using Scatter Plots (for two features) ---
if data.shape[1] >= 2:
    plt.figure(figsize=(15, 10))

    plt.subplot(2, 3, 1)
    plt.scatter(data['Feature_A'], data['Feature_B'])
    plt.xlabel('Feature_A')
    plt.ylabel('Feature_B')
    plt.title('Original Data (Scatter)')
    plt.grid(True)

    plt.subplot(2, 3, 2)
    plt.scatter(data_minmax['Feature_A'], data_minmax['Feature_B'])
    plt.xlabel('Feature_A (Min-Max Scaled)')
    plt.ylabel('Feature_B (Min-Max Scaled)')
    plt.title('Min-Max Scaled Data (Scatter)')
    plt.grid(True)

    plt.subplot(2, 3, 3)
    plt.scatter(data_standard['Feature_A'], data_standard['Feature_B'])
    plt.xlabel('Feature_A (Standardized)')
    plt.ylabel('Feature_B (Standardized)')
    plt.title('Standardized Data (Scatter)')
    plt.grid(True)

    plt.subplot(2, 3, 4)
    plt.scatter(data_robust['Feature_A'], data_robust['Feature_B'])
    plt.xlabel('Feature_A (Robust Scaled)')
    plt.ylabel('Feature_B (Robust Scaled)')
    plt.title('Robust Scaled Data (Scatter)')
    plt.grid(True)

    plt.subplot(2, 3, 5)
    plt.scatter(data_maxabs['Feature_A'], data_maxabs['Feature_B'])
    plt.xlabel('Feature_A (MaxAbs Scaled)')
    plt.ylabel('Feature_B (MaxAbs Scaled)')
    plt.title('Max Absolute Scaled Data (Scatter)')
    plt.grid(True)

    plt.tight_layout()
    plt.suptitle('Comparison of Scaling Techniques (Scatter Plots - Feature A vs B)', y=1.02)
    plt.show()

# --- Visualization using Histograms ---
plt.figure(figsize=(15, 10))

for i, feature in enumerate(data.columns):
    plt.subplot(len(data.columns), 5, 1 + i * 5)
    sns.histplot(data[feature], kde=True)
    plt.title(f'Original - {feature}')

    plt.subplot(len(data.columns), 5, 2 + i * 5)
    sns.histplot(data_minmax[feature], kde=True)
    plt.title(f'Min-Max - {feature}')

    plt.subplot(len(data.columns), 5, 3 + i * 5)
    sns.histplot(data_standard[feature], kde=True)
    plt.title(f'Standard - {feature}')

    plt.subplot(len(data.columns), 5, 4 + i * 5)
    sns.histplot(data_robust[feature], kde=True)
    plt.title(f'Robust - {feature}')

    plt.subplot(len(data.columns), 5, 5 + i * 5)
    sns.histplot(data_maxabs[feature], kde=True)
    plt.title(f'MaxAbs - {feature}')

plt.tight_layout()
plt.suptitle('Comparison of Scaling Techniques (Histograms)', y=1.02)
plt.show()

Explanation:

  1. Import Libraries: We import pandas for data manipulation, matplotlib.pyplot and seaborn for visualization, and the scaling classes from sklearn.preprocessing.
  2. Generate Synthetic Data: We create a sample Pandas DataFrame with three features having different scales and one outlier in ‘Feature_A’. This helps to clearly see the effect of each scaling technique.
  3. Initialize Scalers: We create instances of MinMaxScaler, StandardScaler, RobustScaler, and MaxAbsScaler.
  4. Apply Scaling: We fit each scaler to the original data and then transform the data to get the scaled versions. The results are stored in new DataFrames.
  5. Visualization using Box Plots:
    • We create a figure with subplots to display box plots of the original and scaled data for all features.
    • Box plots are useful for visualizing the distribution, median, quartiles, and potential outliers in each feature before and after scaling.
    • You’ll observe how each scaler changes the range and distribution of the data. Min-Max scaling will squeeze the data into [0, 1]. Standardization will center the data around zero. Robust scaling will be less affected by the outlier. MaxAbs scaling will bring the maximum absolute value of each feature to 1.
  6. Visualization using Scatter Plots (for two features):
    • If the data has at least two features, we create scatter plots of ‘Feature_A’ against ‘Feature_B’ for the original and scaled data.
    • Scatter plots help visualize how the relationship between two features changes after scaling. While the relative positions of the points might remain, the scales of the axes will change.
  7. Visualization using Histograms:
    • We create a figure with subplots to display histograms of each feature before and after applying each scaling technique.
    • Histograms show the frequency distribution of the data. You can observe how scaling affects the spread and shape of the distribution for each feature.

How Scaling Changes Data Visually:

  • Min-Max Scaling: The data will be compressed or stretched to fit within the range of 0 to 1. The shape of the distribution remains the same, but the x-axis scale changes. Outliers can compress the majority of the data into a small range.
  • Standardization: The data will be centered around zero, and the spread will be adjusted to have a standard deviation of one. The shape of the distribution is preserved, but the x-axis is rescaled.
  • Robust Scaling: This method is less influenced by outliers. The central part of the distribution (between the quartiles) will be scaled to a smaller range around zero, while outliers will remain relatively far from the central cluster.
  • Max Absolute Scaling: The data will be scaled so that the maximum absolute value in each feature becomes 1. The distribution’s shape is retained, but the x-axis is scaled to [-1, 1] if there are negative values, or [0, 1] if all values are non-negative.

By running this code, you will get three sets of visualizations (box plots, scatter plots if applicable, and histograms) that clearly illustrate how each scaling technique transforms the data’s distribution and range. This visual comparison is invaluable for understanding the impact of different scaling methods on your dataset.

Category: Uncategorized

Leave a Comment