Data Preparation Python Functions

Saba Shahrukh May 27, 2025 0

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, LabelEncoder, OneHotEncoder, SimpleImputer, Normalizer, PowerTransformer, QuantileTransformer, PolynomialFeatures

Explanation of Machine Learning Preprocessing Imports

This code imports essential tools for **preparing data for machine learning models**.

pandas (`pd`): For managing data in tables (like spreadsheets).
numpy (`np`): For advanced number crunching, especially with arrays.
train_test_split: To divide your data into training and testing sets.
sklearn.preprocessing tools: A collection of functions to clean and transform your data for better model performance:
- Scaling Data: `StandardScaler`, `MinMaxScaler`, `RobustScaler`, `Normalizer` adjust numerical data scales.
- Encoding Categories: `LabelEncoder` and `OneHotEncoder` convert text categories into numbers.
- Handling Missing Data: `SimpleImputer` fills in gaps in your dataset.
- Transforming Distributions: `PowerTransformer` and `QuantileTransformer` make data more “bell-curve” like.
- Creating New Features: `PolynomialFeatures` generates new features from existing ones (e.g., squares or interactions).

🐍 Practice Python Here! 🚀

Write, Run & Debug Python Code Directly in Your Browser! 💻✨

Python

1. Data Loading & Initial Exploration

Python

# Scenario: Reading data from a CSV file
df = pd.read_csv('inflammation-01.csv')
print("First 5 rows:\n", df.head())
print("\nLast 5 rows:\n", df.tail())
print("\nDataFrame Info:\n", df.info())
print("\nDescriptive Statistics:\n", df.describe())
print("\nDataFrame Shape:", df.shape)
print("\nColumn Data Types:\n", df.dtypes)

Explanation of Basic Data Loading and Inspection

This code snippet demonstrates the initial steps in a data analysis workflow: **loading data** from a CSV file into a pandas DataFrame and then performing some **basic inspections** to understand its structure and content.

df = pd.read_csv('inflammation-01.csv'):
This line uses the `pandas` library (`pd`) to **read data from a CSV file** named ‘inflammation-01.csv’ and store it in a DataFrame variable called `df`. A DataFrame is like a table in a spreadsheet or database.
print("First 5 rows:\n", df.head()):
Displays the **first 5 rows** of the DataFrame. This is useful for a quick peek at the top of your data.
print("\nLast 5 rows:\n", df.tail()):
Shows the **last 5 rows** of the DataFrame, helping you see the end of your data.
print("\nDataFrame Info:\n", df.info()):
Provides a concise **summary of the DataFrame**, including the number of entries, number of columns, non-null values per column, and the data type of each column. It’s great for quickly checking for missing data and data types.
print("\nDescriptive Statistics:\n", df.describe()):
Generates **descriptive statistics** (like count, mean, standard deviation, min, max, quartiles) for each numerical column. This gives you a statistical overview of your data’s distribution.
print("\nDataFrame Shape:", df.shape):
Outputs the **dimensions of the DataFrame** as a tuple (number of rows, number of columns).
print("\nColumn Data Types:\n", df.dtypes):
Lists the **data type** for each column in the DataFrame. This is important for ensuring columns are interpreted correctly (e.g., as numbers, text, or dates).

🐍 Practice Python Here! 🚀

Write, Run & Debug Python Code Directly in Your Browser! 💻✨

Use Case: Loading datasets from various sources and getting an initial understanding of the data structure, types, and basic statistics.

You can also download dataset from this link: ecommerce_dynamic_pricing_data

2. Handling Missing Values

Python

# Scenario: Dealing with missing values in a DataFrame
data = {'A': [1, 2, np.nan, 4, 5],
        'B': [np.nan, 6, 7, np.nan, 9],
        'C': ['p', 'q', 'r', 's', 't']}
df_missing = pd.DataFrame(data)
print("Original DataFrame with Missing Values:\n", df_missing)
print("\nMissing Value Counts per Column:\n", df_missing.isnull().sum())

# Filling missing values with the mean of each column
df_filled_mean = df_missing.fillna(df_missing.mean(numeric_only=True))
print("\nDataFrame with Missing Values Filled (Mean):\n", df_filled_mean)

# Dropping rows with any missing values
df_dropped_rows = df_missing.dropna()
print("\nDataFrame with Rows Containing Missing Values Dropped:\n", df_dropped_rows)

# Using SimpleImputer for more complex imputation strategies
imputer_mean = SimpleImputer(strategy='mean')
df_imputed_mean = pd.DataFrame(imputer_mean.fit_transform(df_missing[['A', 'B']]), columns=['A', 'B'])
print("\nDataFrame with Missing Values Imputed (SimpleImputer - Mean):\n", df_imputed_mean)

Use Case: Identifying and handling missing data, which can negatively impact model performance. Choosing the appropriate strategy (filling or dropping) depends on the amount and nature of missingness.

3. Data Type Conversion

Python

# Scenario: Converting column data types
df_types = pd.DataFrame({'col1': ['1', '2', '3'], 'col2': ['2023-01-01', '2023-01-02', '2023-01-03']})
print("Original Data Types:\n", df_types.dtypes)

df_types['col1'] = df_types['col1'].astype(int)
df_types['col2'] = pd.to_datetime(df_types['col2'])
print("\nConverted Data Types:\n", df_types.dtypes)

Use Case: Ensuring that data is in the correct format for analysis and modeling (e.g., numerical columns as integers or floats, date columns as datetime objects).

4. Handling Duplicates

Python

# Scenario: Identifying and removing duplicate rows
df_duplicates = pd.DataFrame({'col1': ['A', 'B', 'A', 'C', 'B'],
                              'col2': [1, 2, 1, 3, 2]})
print("Original DataFrame with Duplicates:\n", df_duplicates)
print("\nDuplicate Rows (Boolean Mask):\n", df_duplicates.duplicated())
df_no_duplicates = df_duplicates.drop_duplicates()
print("\nDataFrame with Duplicates Removed:\n", df_no_duplicates)

Use Case: Removing redundant data points that can bias models or lead to inaccurate analyses.

5. Feature Scaling

Python

# Scenario: Scaling numerical features
data_scaling = pd.DataFrame({'feature1': [10, 20, 30, 40, 50],
                             'feature2': [1, 5, 2, 8, 3]})
print("Original Data for Scaling:\n", data_scaling)

# StandardScaler
scaler_standard = StandardScaler()
scaled_standard = scaler_standard.fit_transform(data_scaling)
print("\nStandardScaler Scaled Data:\n", scaled_standard)

# MinMaxScaler
scaler_minmax = MinMaxScaler()
scaled_minmax = scaler_minmax.fit_transform(data_scaling)
print("\nMinMaxScaler Scaled Data:\n", scaled_minmax)

# RobustScaler
scaler_robust = RobustScaler()
scaled_robust = scaler_robust.fit_transform(data_scaling)
print("\nRobustScaler Scaled Data:\n", scaled_robust)

# Normalizer
normalizer = Normalizer()
normalized_data = normalizer.fit_transform(data_scaling)
print("\nNormalized Data:\n", normalized_data)

Use Case: Standardizing or normalizing numerical features to have similar scales, which is crucial for many machine learning algorithms that are sensitive to feature magnitudes (e.g., gradient descent-based algorithms, distance-based algorithms).

6. Encoding Categorical Features

Python

# Scenario: Encoding categorical data into numerical format
df_categorical = pd.DataFrame({'color': ['red', 'blue', 'green', 'red', 'blue'],
                               'size': ['small', 'medium', 'large', 'small', 'medium']})
print("Original Categorical Data:\n", df_categorical)

# LabelEncoder
label_encoder = LabelEncoder()
df_categorical['color_encoded'] = label_encoder.fit_transform(df_categorical['color'])
print("\nLabelEncoder Output:\n", df_categorical)

# OneHotEncoder
onehot_encoder = OneHotEncoder()
color_encoded_ohe = onehot_encoder.fit_transform(df_categorical[['color']]).toarray()
df_color_ohe = pd.DataFrame(color_encoded_ohe, columns=onehot_encoder.categories_[0])
print("\nOneHotEncoder Output for 'color':\n", df_color_ohe)

# pd.get_dummies()
df_dummies = pd.get_dummies(df_categorical, columns=['color', 'size'])
print("\npd.get_dummies() Output:\n", df_dummies)

Use Case: Converting categorical variables into a numerical representation that machine learning models can understand. The choice between Label Encoding and One-Hot Encoding depends on whether the categorical features have an ordinal relationship.

7. Feature Engineering

Python

# Scenario: Creating new features from existing ones
df_fe = pd.DataFrame({'product': ['A', 'B', 'A', 'C'],
                      'price': [10, 20, 12, 15],
                      'quantity': [5, 2, 10, 3]})
print("Original DataFrame for Feature Engineering:\n", df_fe)

# Using .apply() to create a new feature
df_fe['total_cost'] = df_fe.apply(lambda row: row['price'] * row['quantity'], axis=1)
print("\nFeature Engineering with .apply():\n", df_fe)

# Using .groupby() and .agg() to create aggregate features
df_grouped = df_fe.groupby('product')['price'].agg(['mean', 'max']).reset_index()
df_fe = pd.merge(df_fe, df_grouped, on='product', suffixes=('', '_agg'))
print("\nFeature Engineering with .groupby() and .agg():\n", df_fe)

# Using .cut() for binning
df_fe['price_bins'] = pd.cut(df_fe['price'], bins=[0, 12, 18, 25], labels=['low', 'medium', 'high'])
print("\nFeature Engineering with .cut():\n", df_fe)

# Using PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df_fe[['price', 'quantity']])
df_poly = pd.DataFrame(poly_features, columns=['price', 'quantity', 'price^2', 'price*quantity', 'quantity^2'])
print("\nPolynomial Features:\n", df_poly)

Use Case: Creating new features from existing ones to potentially improve model performance by capturing more complex relationships in the data.

8. Data Transformation

Python

# Scenario: Transforming data to achieve a more normal distribution
data_transform = pd.DataFrame({'skewed_feature': np.random.exponential(scale=2, size=100)})
print("Original Skewed Data:\n", data_transform.hist())

# PowerTransformer (Yeo-Johnson handles both positive and negative data)
power_transformer = PowerTransformer(method='yeo-johnson')
transformed_power = power_transformer.fit_transform(data_transform)
print("\nPowerTransformer Transformed Data:\n", pd.DataFrame(transformed_power, columns=['transformed_feature']).hist())

# QuantileTransformer (to uniform distribution)
quantile_transformer_uniform = QuantileTransformer(output_distribution='uniform', n_quantiles=50)
transformed_quantile_uniform = quantile_transformer_uniform.fit_transform(data_transform)
print("\nQuantileTransformer (Uniform) Transformed Data:\n", pd.DataFrame(transformed_quantile_uniform, columns=['transformed_feature']).hist())

# QuantileTransformer (to normal distribution)
quantile_transformer_normal = QuantileTransformer(output_distribution='normal', n_quantiles=50)
transformed_quantile_normal = quantile_transformer_normal.fit_transform(data_transform)
print("\nQuantileTransformer (Normal) Transformed Data:\n", pd.DataFrame(transformed_quantile_normal, columns=['transformed_feature']).hist())

Use Case: Transforming features to make their distribution more Gaussian-like, which can benefit some statistical models and algorithms that assume normality.

9. Data Splitting

Python

# Scenario: Splitting data into training and testing sets
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
y = np.array([0, 1, 0, 1, 0])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("X_train:\n", X_train)
print("\nX_test:\n", X_test)
print("\ny_train:", y_train)
print("\ny_test:", y_test)

Use Case: Dividing the dataset into separate sets for training the model and evaluating its performance on unseen data to avoid overfitting.

10. Handling Outliers

Python

# Scenario: Identifying and removing outliers (example using Z-score)
data_outliers = pd.DataFrame({'values': np.concatenate([np.random.normal(0, 1, 100), [10, -10]])})
print("Original Data with Potential Outliers:\n", data_outliers.describe())

z_scores = np.abs((data_outliers['values'] - data_outliers['values'].mean()) / data_outliers['values'].std())
threshold = 3
outliers = data_outliers[z_scores > threshold]
df_no_outliers = data_outliers[z_scores <= threshold]

print("\nIdentified Outliers (Z-score > 3):\n", outliers)
print("\nData without Outliers:\n", df_no_outliers.describe())

Use Case: Identifying and handling extreme values that can disproportionately influence model training. The method for handling outliers (removal, transformation, or capping) depends on the context and the nature of the outliers.

11. Text Data Preprocessing (Illustrative – Libraries not fully imported)

Python

# Scenario: Basic text preprocessing (conceptual)
text_data = pd.Series(["This is the first sentence.", "Another sentence here!", "The quick brown fox."])

# Tokenization (splitting into words) - using a conceptual split
tokens = [sentence.lower().split() for sentence in text_data]
print("Tokens:\n", tokens)

# Libraries like NLTK, spaCy, and scikit-learn's CountVectorizer/TfidfVectorizer
# provide more sophisticated tools for tokenization, stemming, lemmatization,
# and feature extraction from text.

Use Case: Cleaning and transforming text data into a format suitable for natural language processing (NLP) tasks. This often involves steps like tokenization, lowercasing, removing punctuation and stop words, and converting text into numerical vectors.

Remember that the specific data preparation steps and the functions you use will depend heavily on the characteristics of your data and the machine learning task you are trying to solve.

Category: Uncategorized