Outliers – projectaiml

Saba Shahrukh May 28, 2025 0

how to check outliers in python in machine learning,
all the ways to check outliers in python
example codes showing how to check outliers

You’re looking to identify those pesky outliers in your machine learning datasets using Python! They can really throw off your model training, so it’s a great idea to get a handle on them. Here are several common methods, along with example code using popular Python libraries like NumPy, Pandas, and Scikit-learn:

1. Visual Inspection

This is often the first and most intuitive step. Visualizing your data can reveal points that lie far away from the general distribution.

Box Plots: Excellent for showing the distribution of a single numerical feature and highlighting potential outliers based on the interquartile range (IQR).
Scatter Plots: Useful for identifying outliers in the relationship between two numerical features.
Histograms: Can show unusual values or separate peaks that might indicate outliers.

Python

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Sample data
data = pd.DataFrame({
    'feature_1': np.concatenate([np.random.normal(0, 1, 100), [5, -4]]),
    'feature_2': np.concatenate([np.random.normal(2, 1, 100), [8, -3]])
})

# Box plots
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
sns.boxplot(x=data['feature_1'])
plt.title('Box Plot of Feature 1')

plt.subplot(1, 2, 2)
sns.boxplot(x=data['feature_2'])
plt.title('Box Plot of Feature 2')
plt.show()

# Scatter plot
plt.figure(figsize=(6, 6))
sns.scatterplot(x='feature_1', y='feature_2', data=data)
plt.title('Scatter Plot of Feature 1 vs Feature 2')
plt.show()

# Histograms
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
sns.histplot(data['feature_1'], kde=True)
plt.title('Histogram of Feature 1')

plt.subplot(1, 2, 2)
sns.histplot(data['feature_2'], kde=True)
plt.title('Histogram of Feature 2')
plt.show()

2. Z-Score (Standard Score)

The Z-score measures how many standard deviations a data point is from the mean of the distribution. A common threshold for identifying outliers is a Z-score greater than 3 or less than -3.

Python

from scipy import stats

# Calculate Z-scores
data['feature_1_zscore'] = np.abs(stats.zscore(data['feature_1']))
data['feature_2_zscore'] = np.abs(stats.zscore(data['feature_2']))

# Identify outliers based on Z-score > 3
outliers_zscore_feature1 = data[data['feature_1_zscore'] > 3]
outliers_zscore_feature2 = data[data['feature_2_zscore'] > 3]

print("Outliers in Feature 1 (Z-score > 3):\n", outliers_zscore_feature1)
print("\nOutliers in Feature 2 (Z-score > 3):\n", outliers_zscore_feature2)

3. Interquartile Range (IQR) Method

The IQR method identifies outliers based on the IQR, which is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). Outliers are typically defined as values below Q1−1.5timesIQR or above Q3+1.5timesIQR. The factor of 1.5 is common, but you can adjust it.

Python

# Calculate Q1, Q3, and IQR for Feature 1
Q1_f1 = data['feature_1'].quantile(0.25)
Q3_f1 = data['feature_1'].quantile(0.75)
IQR_f1 = Q3_f1 - Q1_f1

# Define outlier boundaries for Feature 1
lower_bound_f1 = Q1_f1 - 1.5 * IQR_f1
upper_bound_f1 = Q3_f1 + 1.5 * IQR_f1

# Identify outliers in Feature 1
outliers_iqr_feature1 = data[(data['feature_1'] < lower_bound_f1) | (data['feature_1'] > upper_bound_f1)]
print("Outliers in Feature 1 (IQR Method):\n", outliers_iqr_feature1)

# Calculate Q1, Q3, and IQR for Feature 2
Q1_f2 = data['feature_2'].quantile(0.25)
Q3_f2 = data['feature_2'].quantile(0.75)
IQR_f2 = Q3_f2 - Q1_f2

# Define outlier boundaries for Feature 2
lower_bound_f2 = Q1_f2 - 1.5 * IQR_f2
upper_bound_f2 = Q3_f2 + 1.5 * IQR_f2

# Identify outliers in Feature 2
outliers_iqr_feature2 = data[(data['feature_2'] < lower_bound_f2) | (data['feature_2'] > upper_bound_f2)]
print("\nOutliers in Feature 2 (IQR Method):\n", outliers_iqr_feature2)

4. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is a clustering algorithm that can also¹ be used for outlier detection. It identifies clusters of high density separated by regions of low density. Points in low-density regions are often considered outliers (noise).

Python

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# Scale the data before applying DBSCAN
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data[['feature_1', 'feature_2']])

# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5) # Adjust eps and min_samples as needed
clusters = dbscan.fit_predict(scaled_features)

# Add cluster labels to the DataFrame
data['dbscan_cluster'] = clusters

# Identify outliers (cluster label -1)
outliers_dbscan = data[data['dbscan_cluster'] == -1]
print("\nOutliers identified by DBSCAN:\n", outliers_dbscan)

# Visualize the clusters and outliers
plt.figure(figsize=(8, 6))
sns.scatterplot(x='feature_1', y='feature_2', hue='dbscan_cluster', data=data, palette='viridis')
plt.title('DBSCAN Clustering with Outliers')
plt.show()

5. Isolation Forest

Isolation Forest is an ensemble tree-based anomaly detection algorithm. It isolates outliers by randomly selecting a feature and then randomly selecting a split value between the² maximum and minimum values of the³ selected feature. Outliers, being rare and different, tend to be isolated in fewer splits compared to normal data points.

Python

from sklearn.ensemble import IsolationForest

# Train the Isolation Forest model
iso_forest = IsolationForest(contamination='auto', random_state=42) # 'auto' estimates the proportion of outliers
outlier_scores = iso_forest.fit_predict(data[['feature_1', 'feature_2']])

# Add outlier labels to the DataFrame (-1 for outlier, 1 for inlier)
data['isolation_forest_outlier'] = outlier_scores

# Identify outliers
outliers_iso_forest = data[data['isolation_forest_outlier'] == -1]
print("\nOutliers identified by Isolation Forest:\n", outliers_iso_forest)

# Visualize the outliers
plt.figure(figsize=(8, 6))
sns.scatterplot(x='feature_1', y='feature_2', hue='isolation_forest_outlier',
                data=data, palette={1: 'blue', -1: 'red'})
plt.title('Isolation Forest Outlier Detection')
plt.show()

6. Local Outlier Factor (LOF)

LOF measures the local deviation of a given data point with respect to its neighbors. It considers the density of neighbors around a data point compared to the density of neighbors of its neighbors. Points with significantly lower density than their neighbors are considered outliers.

Python

from sklearn.neighbors import LocalOutlierFactor

# Train the LOF model
lof = LocalOutlierFactor(n_neighbors=5, contamination='auto') # Adjust n_neighbors as needed
outlier_scores_lof = lof.fit_predict(data[['feature_1', 'feature_2']])

# Add outlier labels to the DataFrame (-1 for outlier, 1 for inlier)
data['lof_outlier'] = outlier_scores_lof

# Identify outliers
outliers_lof = data[data['lof_outlier'] == -1]
print("\nOutliers identified by Local Outlier Factor (LOF):\n", outliers_lof)

# Visualize the outliers
plt.figure(figsize=(8, 6))
sns.scatterplot(x='feature_1', y='feature_2', hue='lof_outlier',
                data=data, palette={1: 'blue', -1: 'red'})
plt.title('Local Outlier Factor (LOF) Outlier Detection')
plt.show()

Important Considerations:

Domain Knowledge: Always use your understanding of the data and the problem domain to guide your outlier detection process. What might look like an outlier statistically could be a valid and important data point.
Univariate vs. Multivariate Outliers: Some methods (like Z-score and IQR) are typically applied to individual features (univariate). Methods like DBSCAN, Isolation Forest, and LOF can detect outliers based on the relationships between multiple features (multivariate).
Scaling: For distance-based methods like DBSCAN and LOF, it’s often crucial to scale your data (e.g., using StandardScaler or MinMaxScaler) so that features with larger ranges don’t disproportionately influence the results.
Choosing the Right Method: There’s no one-size-fits-all approach. The best method depends on the characteristics of your data (e.g., distribution, dimensionality) and the type of outliers you expect. It’s often beneficial to try multiple methods and compare the results.
Handling Outliers: Once you’ve identified outliers, you need to decide how to handle them. Common approaches include:
- Removal: Removing the outlier rows from your dataset. Be cautious, as you might lose valuable information.
- Transformation: Applying transformations (e.g., log transformation, winsorizing) to reduce the impact of outliers.
- Imputation: Replacing outlier values with more reasonable estimates (e.g., mean, median).
- Separate Modeling: Sometimes, outliers might represent a distinct group that could be modeled separately.

By combining these techniques and carefully considering your data, you can effectively identify and handle outliers in your machine learning workflows. Let me know if you have any more questions!

which technique to use in which situation and in which use case

Ah, a crucial follow-up question! Knowing how to check for outliers is only half the battle. Knowing when to use which method is what makes you a savvy data scientist. Let’s break down the scenarios and use cases for each outlier detection technique:

1. Visual Inspection (Box Plots, Scatter Plots, Histograms)

When to Use:
- Initial Data Exploration: Always a good first step to get a feel for the data distribution.
- Low Dimensionality (1-3 features): Easier to visualize patterns and anomalies in fewer dimensions.
- Understanding Univariate Distributions: Box plots and histograms are excellent for seeing the spread and skewness of single features.
- Identifying Bivariate Relationships: Scatter plots can reveal outliers in the joint distribution of two variables.
- Qualitative Assessment: Helps build intuition about potential outliers before applying more formal methods.
Use Cases:
- Identifying unusual spending patterns in a single customer attribute (box plot).
- Spotting data entry errors in height and weight measurements (scatter plot).
- Recognizing unexpected peaks or tails in the distribution of website traffic (histogram).
Limitations: Becomes less effective and more subjective with higher dimensional data.

2. Z-Score (Standard Score)

When to Use:
- Approximately Normally Distributed Data: Assumes a Gaussian distribution. Less reliable if the data is heavily skewed or has multiple modes.
- Univariate Outlier Detection: Typically applied to individual features.
- Simple and Quick: Easy to implement and understand.
Use Cases:
- Identifying unusually high or low sensor readings, assuming the readings generally follow a normal distribution.
- Detecting anomalous server response times, if the typical response times are normally distributed.
Limitations: Sensitive to the presence of outliers themselves, as they can affect the mean and standard deviation, potentially masking other outliers. Not suitable for non-normal data.

3. Interquartile Range (IQR) Method

When to Use:
- Non-Normally Distributed Data: More robust to skewness and non-normality compared to the Z-score method.
- Univariate Outlier Detection: Applied to individual features.
- Relatively Simple and Effective: A good general-purpose outlier detection technique for single variables.
Use Cases:
- Identifying extreme income values in a population, which might be skewed.
- Detecting unusual product prices in an e-commerce dataset.
- Finding outliers in exam scores that might not follow a perfect normal distribution.
Limitations: Doesn’t consider relationships between multiple features. The 1.5 multiplier is a common convention but might need adjustment depending on the specific dataset and domain.

4. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

When to Use:
- Identifying Outliers as Regions of Low Density: Effective when outliers are isolated points in a low-density region, while normal data forms dense clusters.
- No Assumption About Data Distribution: Can handle complex, non-linear data distributions.
- Multivariate Outlier Detection: Considers the relationships between multiple features.
- Discovering Clusters of Arbitrary Shape: Can identify clusters that are not necessarily spherical.
Use Cases:
- Detecting fraudulent credit card transactions that deviate significantly from normal spending patterns of user segments.
- Identifying anomalous locations of vehicles in a fleet tracking system.
- Finding unusual network traffic patterns that don’t belong to typical communication flows.
Limitations: Performance can degrade in high-dimensional data (curse of dimensionality). Sensitive to the choice of eps (maximum distance between two samples for one to be considered as in the neighborhood of the other) and min_samples (the number of samples in a neighborhood for a point to be considered as a core¹ point). Determining these parameters can be challenging.

5. Isolation Forest

When to Use:
- High-Dimensional Data: Performs well in datasets with many features.
- Large Datasets: Efficient and scalable to larger datasets.
- Global Outlier Detection: Effective at identifying outliers that are globally rare and distinct.
- No Strong Assumptions About Data Distribution: Works well for various data distributions.
Use Cases:
- Detecting anomalies in network intrusion detection systems with numerous features.
- Identifying fraudulent activities in financial transactions with many attributes.
- Finding manufacturing defects based on a wide range of sensor readings.
Limitations: Might not perform as well in detecting local outliers (outliers that are unusual relative to their local neighborhood but not globally). The contamination parameter (expected proportion of outliers) needs to be estimated.

6. Local Outlier Factor (LOF)

When to Use:
- Local Outlier Detection: Excels at finding outliers that are unusual with respect to their local neighbors, even if they are not globally extreme.
- No Strong Assumptions About Data Distribution: Works well for various data distributions.
- Multivariate Outlier Detection: Considers the density of local neighborhoods in multiple dimensions.
Use Cases:
- Identifying unusual user behavior within a specific segment of customers.
- Detecting localized anomalies in sensor data where overall readings might be normal, but a specific sensor deviates from its nearby sensors.
- Finding unusual patterns in spatial data where density varies across regions.
Limitations: Can be computationally expensive for very large datasets. Sensitive to the choice of n_neighbors (number of neighbors to consider).

Here’s a table summarizing the recommendations:

Method	Best Use Cases	Key Advantages	Key Limitations
Visual Inspection	Initial exploration, low-dimensional data, understanding univariate/bivariate distributions.	Intuitive, helps build understanding.	Subjective, ineffective in high dimensions.
Z-Score	Approximately normal data, univariate outliers, simple and quick analysis.	Simple, easy to implement.	Assumes normality, sensitive to existing outliers.
IQR	Non-normal data, univariate outliers, robust to skewness.	Robust to non-normality, simple to implement.	Univariate only, the 1.5 multiplier might need adjustment.
DBSCAN	Outliers as low-density regions, complex distributions, multivariate outliers, discovering arbitrary shaped clusters.	No distributional assumptions, can find complex outlier patterns.	Sensitive to parameter selection (`eps`, `min_samples`), can struggle in high dimensions with varying densities.
Isolation Forest	High-dimensional data, large datasets, global outliers.	Efficient for high-dimensional data and large datasets, no strong distributional assumptions.	Might miss local outliers, needs estimation of outlier proportion.
LOF	Local outlier detection, non-normal data, multivariate outliers, varying densities.	Effective for finding outliers within local neighborhoods, no strong distributional assumptions.	Computationally expensive for large datasets, sensitive to `n_neighbors`.

General Guidance:

Start with Visualization: Always try to visualize your data first to gain initial insights.
Consider Data Distribution: If you believe your data is roughly normally distributed, Z-score can be a quick option. For non-normal data, IQR is often a better univariate choice.
Think About Outlier Nature: Are you looking for globally extreme values or anomalies within local contexts? This will guide you towards Isolation Forest or LOF, respectively.
Account for Dimensionality: For high-dimensional data, tree-based methods like Isolation Forest tend to perform well. Density-based methods might struggle due to the curse of dimensionality.
Experiment and Iterate: There’s no single best method for all situations. It’s often beneficial to try multiple techniques and evaluate their results based on your domain knowledge and the goals of your analysis.
Understand the Algorithms’ Assumptions and Parameters: Be aware of the underlying assumptions of each method and the impact of their parameters on the results. Tune parameters carefully.

By considering these factors, you can make more informed decisions about which outlier detection techniques to employ in different machine learning scenarios and use cases. Good luck with your outlier hunting!

Category: Uncategorized