Best practices in Model Selection

Saba Shahrukh June 27, 2025 0

To navigate the challenges of model selection effectively and build robust, high-performing machine learning systems, it’s crucial to follow best practices. Here’s a comprehensive overview:

1. Thorough Understanding of the Problem and Data:

Define Business Objectives Clearly: Understand the ultimate goal of the machine learning project and how the model’s output will be used. This will influence the choice of evaluation metrics and the importance of interpretability.
In-depth Data Exploration and Analysis (EDA): Gain a deep understanding of the data’s characteristics, including its size, distribution, feature types, missing values, outliers, and potential biases. Visualize the data to identify patterns and relationships.
Feature Engineering and Selection: Create relevant features that capture the underlying patterns in the data. Apply appropriate feature scaling and handle missing values and outliers effectively. Consider feature selection techniques to reduce dimensionality and improve model performance and interpretability.

2. Establish a Robust Evaluation Framework:

Choose Appropriate Evaluation Metrics: Select metrics that align with the business objectives and the nature of the problem (e.g., accuracy, precision, recall, F1-score for classification; RMSE, MAE, R-squared for regression).
Implement Rigorous Cross-Validation: Use appropriate cross-validation techniques (e.g., K-Fold, Stratified K-Fold, Time Series CV, Group K-Fold) based on the data characteristics and potential dependencies. This provides a more reliable estimate of generalization performance.
Hold-Out Test Set: Always reserve a completely separate test set (not used during training or hyperparameter tuning) for the final evaluation of the selected model. This provides an unbiased assessment of its performance on truly unseen data.

3. Explore a Diverse Set of Candidate Models:

Consider Different Model Families: Don’t limit yourself to a single type of model. Explore linear models, tree-based models, support vector machines, neural networks, and ensemble methods.
Start with Simpler Models: Begin with simpler, more interpretable models as baselines. This helps establish a benchmark and provides insights into the complexity required for the task.
Gradually Increase Complexity: If simpler models don’t achieve satisfactory performance, progressively explore more complex models.
Leverage Domain Knowledge: Incorporate domain expertise to guide the selection of potentially suitable models.

4. Systematic Hyperparameter Tuning:

Define a Relevant Hyperparameter Space: Understand the key hyperparameters of each model and define a reasonable range or distribution of values to explore.
Employ Effective Search Strategies: Use techniques like Grid Search, Random Search, or more advanced optimization methods (e.g., Bayesian Optimization, Population-Based Training) to efficiently find the optimal hyperparameter settings.
Use Cross-Validation During Tuning: Evaluate each hyperparameter configuration using cross-validation on the training data to avoid overfitting to the validation set.
Document Tuned Hyperparameters: Keep track of the hyperparameter settings and their corresponding performance.

5. Focus on Generalization, Not Just Training Performance:

Prioritize Performance on Validation and Test Sets: The ultimate goal is to build a model that performs well on unseen data. Pay close attention to the performance on the validation and hold-out test sets.
Monitor for Overfitting: Track the performance on both training and validation sets during training and hyperparameter tuning. A significant gap between the two indicates potential overfitting.
Apply Regularization Techniques: Use appropriate regularization methods (L1, L2, Dropout, pruning, etc.) to prevent overfitting, especially for complex models.

6. Consider Interpretability and Explainability:

Understand the Trade-off: Be aware of the trade-off between model complexity/performance and interpretability.
Choose Interpretable Models When Necessary: If interpretability is a critical requirement (e.g., in regulated industries), prioritize simpler models or techniques that provide insights into the model’s decision-making process (e.g., feature importance, SHAP values).

7. Manage Computational Resources Effectively:

Balance Exploration with Efficiency: Be mindful of the computational cost of training and evaluating different models and hyperparameter configurations.
Utilize Efficient Tuning Techniques: Employ more efficient hyperparameter optimization methods when dealing with large models or search spaces.
Consider Model Size and Inference Speed: For deployment, consider the model’s size and the time it takes to make predictions, especially for real-time applications.

8. Document the Model Selection Process:

Keep Detailed Records: Document all the models explored, the hyperparameters tuned, the evaluation metrics used, the cross-validation strategies employed, and the reasons for selecting the final model.
Track Experiments: Use experiment tracking tools to manage and compare different model runs and configurations.
Ensure Reproducibility: Make sure the entire model selection process can be reproduced.

9. Iterate and Refine:

Model Selection is Not Always Linear: Be prepared to revisit earlier steps in the process based on the evaluation results. You might need to explore new features, try different models, or adjust the evaluation framework.
Continuous Monitoring and Retraining: Once a model is deployed, continuously monitor its performance and retrain it as new data becomes available or if performance degrades due to concept drift.

10. Consider Ensemble Methods:

Combine Multiple Models: Ensemble techniques (e.g., bagging, boosting, stacking) can often improve predictive performance and robustness by combining the strengths of multiple individual models.
Tune Ensemble Hyperparameters: Remember that ensemble methods also have hyperparameters that need to be tuned.

By adhering to these best practices, you can increase the likelihood of selecting a machine learning model that not only performs well on your data but also generalizes effectively to unseen data, meets your business objectives, and is practical for deployment and maintenance.

Use Case 1: Customer Churn Prediction for a Telecommunications Company

Use Case 1: Customer Churn Prediction for a Telecommunications Company

Problem Statement: A telecommunications company is experiencing customer churn, leading to significant revenue loss. They want to identify customers at high risk of churning so they can proactively offer retention incentives. The goal is to build a predictive model that accurately identifies these customers.

Challenges in Model Selection for Churn Prediction:

Imbalanced Data: Churn datasets are typically imbalanced (far fewer churners than non-churners). Standard accuracy metrics can be misleading.
Business Impact: False negatives (missing a churner) are often more costly than false positives (incorrectly identifying a non-churner as a churner). This necessitates careful consideration of metrics like Recall or F1-score.
Model Interpretability: While performance is key, understanding why a customer might churn (feature importance) can inform business strategies.
Avoiding Overfitting: A model that performs well on training data but poorly on new customers is useless. Robust validation is crucial.
Hyperparameter Optimization: Different models have different hyperparameters that significantly impact their performance.

Best Practices Demonstrated:

Exploratory Data Analysis (EDA) and Preprocessing: Understanding the data is fundamental.
Stratified K-Fold Cross-Validation: Essential for imbalanced datasets to ensure each fold maintains the original class distribution. This provides a more reliable estimate of generalization performance.
Multiple Model Candidates: Evaluating various algorithms (e.g., Logistic Regression, Random Forest, Gradient Boosting) to find the best fit for the problem.
Appropriate Evaluation Metrics: Focusing on metrics like Precision, Recall, F1-score, and ROC AUC, which are more relevant for imbalanced classification than simple accuracy.
Hyperparameter Tuning with RandomizedSearchCV (or GridSearchCV): Systematically searching for optimal hyperparameters for each model using cross-validation. RandomizedSearchCV is often preferred for its efficiency in large search spaces.
Pipelines: Streamlining the machine learning workflow, ensuring consistent preprocessing across training and validation folds and preventing data leakage.
Ensemble Methods (implicitly through trying Random Forest and Gradient Boosting): Showcasing models known for robust performance.
Model Persistence: Saving the best performing model for future deployment.

import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    confusion_matrix, classification_report
)
import matplotlib.pyplot as plt
import seaborn as sns
import joblib # For model persistence

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries loaded successfully.")

# --- 1. Data Generation (Simulated Telecom Churn Data) ---
def generate_telecom_churn_data(n_samples=5000, churn_rate=0.15):
    """Generates a synthetic telecom churn dataset."""
    data = {
        'MonthlyCharges': np.random.uniform(20, 120, n_samples),
        'TotalCharges': np.random.uniform(50, 6000, n_samples),
        'Tenure': np.random.randint(1, 72, n_samples),
        'Contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples, p=[0.6, 0.2, 0.2]),
        'InternetService': np.random.choice(['DSL', 'Fiber optic', 'No'], n_samples, p=[0.35, 0.45, 0.2]),
        'PaymentMethod': np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer (automatic)', 'Credit card (automatic)'], n_samples, p=[0.4, 0.2, 0.2, 0.2]),
        'Gender': np.random.choice(['Male', 'Female'], n_samples),
        'Dependents': np.random.choice([0, 1], n_samples, p=[0.7, 0.3]),
        'SeniorCitizen': np.random.choice([0, 1], n_samples, p=[0.8, 0.2]),
        'PhoneService': np.random.choice([0, 1], n_samples, p=[0.1, 0.9]),
        'MultipleLines': np.random.choice(['Yes', 'No', 'No phone service'], n_samples, p=[0.4, 0.5, 0.1])
    }
    df = pd.DataFrame(data)

    # Introduce some correlation with churn
    churn = np.zeros(n_samples, dtype=int)
    for i in range(n_samples):
        churn_prob = 0.05 # Base churn probability
        if df.loc[i, 'Contract'] == 'Month-to-month':
            churn_prob += 0.20
        if df.loc[i, 'InternetService'] == 'Fiber optic':
            churn_prob += 0.10
        if df.loc[i, 'MonthlyCharges'] > 80:
            churn_prob += 0.05
        if df.loc[i, 'Tenure'] < 12:
            churn_prob += 0.15
        if df.loc[i, 'PaymentMethod'] == 'Electronic check':
            churn_prob += 0.10

        if np.random.rand() < churn_prob:
            churn[i] = 1

    # Adjust churn rate to target
    current_churn_rate = np.mean(churn)
    if current_churn_rate < churn_rate:
        num_to_flip = int((churn_rate - current_churn_rate) * n_samples)
        non_churners_idx = np.where(churn == 0)[0]
        flip_indices = np.random.choice(non_churners_idx, num_to_flip, replace=False)
        churn[flip_indices] = 1
    elif current_churn_rate > churn_rate:
        num_to_flip = int((current_churn_rate - churn_rate) * n_samples)
        churners_idx = np.where(churn == 1)[0]
        flip_indices = np.random.choice(churners_idx, num_to_flip, replace=False)
        churn[flip_indices] = 0

    df['Churn'] = churn
    return df

df = generate_telecom_churn_data()
print("\nSimulated Data Head:")
print(df.head())
print("\nChurn Distribution:")
print(df['Churn'].value_counts(normalize=True))

# --- 2. Data Preprocessing Setup ---
# Identify numerical and categorical features
numerical_features = df.select_dtypes(include=np.number).columns.tolist()
numerical_features.remove('SeniorCitizen') # Treat SeniorCitizen as categorical for simplicity if 0/1
if 'Churn' in numerical_features:
    numerical_features.remove('Churn') # Target variable

categorical_features = df.select_dtypes(include='object').columns.tolist()
categorical_features.append('SeniorCitizen') # Add SeniorCitizen to categorical

# Create preprocessing pipelines for numerical and categorical features
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Create a preprocessor using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough' # Keep other columns if any, though none expected here
)

print(f"\nNumerical features: {numerical_features}")
print(f"Categorical features: {categorical_features}")

# --- 3. Model Candidates and Hyperparameter Grids ---
models = {
    'LogisticRegression': LogisticRegression(random_state=42, solver='liblinear'), # liblinear for small datasets, good default
    'RandomForestClassifier': RandomForestClassifier(random_state=42),
    'GradientBoostingClassifier': GradientBoostingClassifier(random_state=42)
}

# Define hyperparameter distributions for RandomizedSearchCV
param_distributions = {
    'LogisticRegression': {
        'classifier__C': np.logspace(-4, 4, 20), # Regularization parameter
        'classifier__penalty': ['l1', 'l2']
    },
    'RandomForestClassifier': {
        'classifier__n_estimators': np.arange(100, 500, 100),
        'classifier__max_depth': [None, 10, 20, 30],
        'classifier__min_samples_split': [2, 5, 10],
        'classifier__min_samples_leaf': [1, 2, 4]
    },
    'GradientBoostingClassifier': {
        'classifier__n_estimators': np.arange(100, 500, 100),
        'classifier__learning_rate': [0.01, 0.05, 0.1, 0.2],
        'classifier__max_depth': [3, 5, 7],
        'classifier__subsample': [0.7, 0.8, 0.9, 1.0]
    }
}

# --- 4. Prepare Data for Modeling ---
X = df.drop('Churn', axis=1)
y = df['Churn']

# Stratified split for initial train/test set to simulate unseen data
# This 'test_size' is the final hold-out set, not used in CV for model selection
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\nTraining set size: {len(X_train)} samples")
print(f"Test set size: {len(X_test)} samples")
print(f"Training churn distribution: {y_train.value_counts(normalize=True)}")
print(f"Test churn distribution: {y_test.value_counts(normalize=True)}")


# --- 5. Model Selection and Hyperparameter Tuning with Nested Cross-Validation Concept ---
# While not strictly 'nested cross-validation' in the scikit-learn sense of a separate outer loop,
# we are using RandomizedSearchCV with StratifiedKFold, which performs CV internally for tuning.
# The 'X_test' set remains completely untouched until the very end.

best_model = None
best_score = -np.inf # Initialize with negative infinity for AUC
best_model_name = ""
model_evaluation_results = {}

print("\nStarting Model Selection and Hyperparameter Tuning...")

for model_name, model in models.items():
    print(f"\n--- Processing {model_name} ---")

    # Create a pipeline for each model
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])

    # Define the cross-validation strategy
    # StratifiedKFold is crucial for imbalanced datasets
    cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    # Use RandomizedSearchCV for hyperparameter tuning with cross-validation
    # We choose 'roc_auc' as the primary scoring metric for churn prediction due to imbalance
    # and the importance of ranking customers by churn probability.
    search = RandomizedSearchCV(
        estimator=pipeline,
        param_distributions=param_distributions[model_name],
        n_iter=50, # Number of parameter settings that are sampled. Trade-off between runtime and performance
        cv=cv_strategy,
        scoring='roc_auc',
        verbose=1,
        n_jobs=-1, # Use all available CPU cores
        random_state=42
    )

    search.fit(X_train, y_train)

    print(f"Best parameters for {model_name}: {search.best_params_}")
    print(f"Best ROC AUC on training folds for {model_name}: {search.best_score_:.4f}")

    # Evaluate the best estimator on the training data (using predictions from CV)
    # The 'best_estimator_' attribute holds the trained pipeline with best params
    y_pred_proba_train = search.best_estimator_.predict_proba(X_train)[:, 1]
    y_pred_train = search.best_estimator_.predict(X_train)

    # Store results
    model_evaluation_results[model_name] = {
        'best_params': search.best_params_,
        'train_roc_auc_cv_avg': search.best_score_,
        'train_roc_auc': roc_auc_score(y_train, y_pred_proba_train),
        'train_accuracy': accuracy_score(y_train, y_pred_train),
        'train_precision': precision_score(y_train, y_pred_train),
        'train_recall': recall_score(y_train, y_pred_train),
        'train_f1': f1_score(y_train, y_pred_train),
        'best_estimator': search.best_estimator_
    }

    if search.best_score_ > best_score:
        best_score = search.best_score_
        best_model = search.best_estimator_
        best_model_name = model_name

print(f"\n--- Model Selection Complete ---")
print(f"The best model based on cross-validated ROC AUC is: {best_model_name}")
print(f"Best cross-validated ROC AUC: {best_score:.4f}")

# --- 6. Final Evaluation on the Unseen Test Set ---
print("\n--- Final Evaluation on Unseen Test Set ---")

if best_model:
    y_pred_proba_test = best_model.predict_proba(X_test)[:, 1]
    y_pred_test = best_model.predict(X_test)

    test_roc_auc = roc_auc_score(y_test, y_pred_proba_test)
    test_accuracy = accuracy_score(y_test, y_pred_test)
    test_precision = precision_score(y_test, y_pred_test)
    test_recall = recall_score(y_test, y_pred_test)
    test_f1 = f1_score(y_test, y_pred_test)

    print(f"Metrics for the Best Model ({best_model_name}) on Test Set:")
    print(f"  ROC AUC: {test_roc_auc:.4f}")
    print(f"  Accuracy: {test_accuracy:.4f}")
    print(f"  Precision: {test_precision:.4f}")
    print(f"  Recall: {test_recall:.4f}")
    print(f"  F1-Score: {test_f1:.4f}")

    print("\nClassification Report on Test Set:")
    print(classification_report(y_test, y_pred_test))

    print("\nConfusion Matrix on Test Set:")
    cm = confusion_matrix(y_test, y_pred_test)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['Not Churn', 'Churn'], yticklabels=['Not Churn', 'Churn'])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.title(f'Confusion Matrix for {best_model_name}')
    plt.show()

    # Visualize ROC curve
    from sklearn.metrics import roc_curve
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba_test)
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {test_roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'Receiver Operating Characteristic (ROC) Curve for {best_model_name}')
    plt.legend(loc="lower right")
    plt.grid(True)
    plt.show()

    # --- 7. Feature Importance (for tree-based models) ---
    if hasattr(best_model.named_steps['classifier'], 'feature_importances_'):
        print(f"\n--- Feature Importances for {best_model_name} ---")
        # Get feature names after one-hot encoding
        ohe_feature_names = best_model.named_steps['preprocessor'].named_transformers_['cat']['onehot'].get_feature_names_out(categorical_features)
        all_feature_names = numerical_features + list(ohe_feature_names)

        importances = best_model.named_steps['classifier'].feature_importances_
        feature_importance_df = pd.DataFrame({'feature': all_feature_names, 'importance': importances})
        feature_importance_df = feature_importance_df.sort_values(by='importance', ascending=False)
        print(feature_importance_df.head(10))

        plt.figure(figsize=(10, 6))
        sns.barplot(x='importance', y='feature', data=feature_importance_df.head(10))
        plt.title(f'Top 10 Feature Importances for {best_model_name}')
        plt.xlabel('Importance')
        plt.ylabel('Feature')
        plt.show()

# --- 8. Model Persistence ---
if best_model:
    model_filename = f'best_churn_prediction_model_{best_model_name}.pkl'
    joblib.dump(best_model, model_filename)
    print(f"\nBest model saved as {model_filename}")

    # Example of loading the model
    loaded_model = joblib.load(model_filename)
    print(f"Model loaded successfully: {loaded_model}")
    # You can now use loaded_model.predict(new_data) or loaded_model.predict_proba(new_data)

Explanation of Best Practices in the Code:

Synthetic Data Generation: For demonstration, we create a synthetic dataset that mimics common characteristics of telecom churn data, including imbalanced classes. In a real-world scenario, you would load your actual data here.
ColumnTransformer for Preprocessing:
- This is a cornerstone for robust preprocessing. It allows applying different transformers (e.g., StandardScaler for numerical, OneHotEncoder for categorical) to different columns.
- It handles feature types gracefully, preventing errors and ensuring correct transformations.
- handle_unknown='ignore' in OneHotEncoder is crucial for deployment, as it prevents errors if unseen categories appear in new data.
Pipeline for Workflow Automation:
- Pipelines combine preprocessing steps with the estimator into a single Scikit-learn object.
- Prevents Data Leakage: When used with cross-validation (like in RandomizedSearchCV), the preprocessor is fitted only on the training folds and transformed on both training and validation folds in each iteration. This is critical to avoid data leakage from the validation set into the training process.
- Simplifies code and makes the entire workflow reproducible.
StratifiedKFold Cross-Validation:
- Used within RandomizedSearchCV (cv_strategy).
- Crucial for imbalanced datasets like churn prediction. It ensures that each fold used for training and validation maintains the same proportion of churners and non-churners as the original dataset. This leads to more reliable performance estimates.
RandomizedSearchCV for Hyperparameter Tuning:
- More efficient than GridSearchCV for large search spaces, as it samples combinations randomly instead of exhaustively. This allows finding good hyperparameters in less time.
- n_iter: Controls the number of random combinations to try. Adjust based on computational budget.
- scoring='roc_auc': For imbalanced classification, ROC AUC is generally a better metric than accuracy. It measures the ability of a classifier to distinguish between classes, considering the trade-off between true positive rate and false positive rate across various thresholds. For churn, we often care about ranking customers by risk.
- n_jobs=-1: Utilizes all available CPU cores for parallel processing, speeding up the search.
Comprehensive Evaluation Metrics:
- Beyond accuracy, we explicitly calculate and display precision, recall, f1_score, and roc_auc_score.
- Recall (Sensitivity): Maximizing recall is often a primary business objective for churn prediction (we want to catch as many churners as possible).
- Precision: Important to ensure that proactive interventions are targeted effectively (we don’t want too many false positives).
- F1-Score: A harmonic mean of precision and recall, providing a balanced view.
- Confusion Matrix: Visualizes the types of errors the model makes (false positives, false negatives).
- ROC Curve: A graphical representation of the model’s performance across different classification thresholds, valuable for understanding the trade-off between sensitivity and specificity.
Hold-Out Test Set:
- The X_test, y_test split is a completely unseen dataset.
- It is used only once at the very end to provide an unbiased estimate of the best model’s generalization performance on new, never-before-seen data. This simulates how the model would perform in a real-world deployment.
Model Persistence (joblib):
- Once the best model is identified and evaluated, it’s saved to disk using joblib.dump(). This allows you to load the trained model later without retraining, essential for deployment.
Feature Importance (for tree-based models):
- For models like Random Forest and Gradient Boosting, we extract and visualize feature importances. This provides valuable insights into which factors are most influential in predicting churn, informing business strategy.
- Handling One-Hot Encoded Features: The code correctly retrieves feature names after one-hot encoding to make the importance scores interpretable.

This comprehensive approach ensures that the selected model is not only high-performing but also reliable, interpretable, and ready for deployment in a real-world business context.

Use Case 2: Generative AI for Novel Material Discovery and Optimization in Sustainable Manufacturing

Use Case 2: Generative AI for Novel Material Discovery and Optimization in Sustainable Manufacturing

Problem Statement: In industries like automotive, aerospace, and renewable energy, the demand for new materials with specific properties (e.g., lightweight, high strength-to-weight ratio, improved thermal conductivity, recyclability) is constantly increasing. Traditional material discovery and optimization processes are incredibly slow, expensive, and often rely on trial-and-error laboratory experiments. This bottleneck hinders innovation and the development of truly sustainable products.

The Role of Machine Learning (Specifically Generative AI):

Generative AI offers a revolutionary approach by designing new materials from scratch based on desired properties, rather than just optimizing existing ones. This is akin to “inverse design,” where the output (material properties) dictates the input (atomic structure and composition).

Challenges in Model Selection for Material Discovery:

High-Dimensional and Complex Data: Material properties are influenced by intricate atomic arrangements, chemical compositions, and processing conditions, leading to highly complex and non-linear relationships.
Limited Experimental Data: Obtaining experimental data for new materials is costly and time-consuming, meaning models often need to learn from relatively small, sparse datasets.
Generative vs. Predictive: Unlike churn prediction (a classification task), this involves generating novel structures, which requires different model architectures (e.g., Variational Autoencoders, Generative Adversarial Networks, Diffusion Models, or specialized graph neural networks for molecular structures).
Constraints and Manufacturability: Generated materials must not only possess desired properties but also be theoretically manufacturable and stable.
Explainability: Understanding why a particular atomic structure leads to desired properties is crucial for scientific insight and further innovation.

Best Practices Demonstrated in this Use Case:

Specialized Data Representation: Converting complex molecular/material structures into a machine-readable format (e.g., graph representations, descriptors).
Generative Model Architectures: Utilizing models capable of synthesizing new data points (material structures) rather than just classifying or predicting labels.
Property Prediction (Forward Model): Training a separate “forward” model to predict properties of generated materials, allowing for iterative refinement and optimization.
Multi-Objective Optimization: Balancing multiple, sometimes conflicting, desired material properties (e.g., maximizing strength while minimizing weight).
Active Learning / Bayesian Optimization: Strategically choosing which generated candidates to “synthesize” or simulate, minimizing expensive real-world experiments.
Human-in-the-Loop Validation: Emphasizing that AI assists human material scientists and engineers, who validate generated designs through simulations and experiments.
Robust Evaluation for Generative Models: Metrics beyond standard classification/regression, such as diversity, novelty, and validity of generated outputs, and their performance on downstream predictive tasks.

End-to-End Python Code for Generative Material Discovery (Conceptual & Simplified)

Disclaimer: A full, production-ready Generative AI model for material discovery is highly complex, requiring deep expertise in materials science, quantum chemistry, and advanced deep learning. This code provides a simplified, conceptual demonstration focusing on the workflow and best practices for model selection within this domain, using a hypothetical scenario.
We will simulate a dataset of material descriptors and their properties, and demonstrate how a generative model could propose new descriptors, which are then evaluated by a “property prediction” model.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor # For target transformation
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import loguniform # For randomized search on continuous distributions

# Set random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

print("Libraries loaded successfully.")

# --- 1. Simulated Data Generation: Material Descriptors & Properties ---
# Imagine these are computationally derived descriptors (e.g., features from DFT calculations)
# and properties (e.g., band gap, thermal conductivity, elastic modulus).
# We'll simulate 5 'descriptors' and 2 'properties'.

def generate_material_data(n_samples=1000):
    """Generates synthetic material descriptor and property data."""
    data = {f'descriptor_{i+1}': np.random.rand(n_samples) * 10 for i in range(5)}
    df = pd.DataFrame(data)

    # Simulate some non-linear relationships to properties
    df['property_A'] = (
        2 * df['descriptor_1']
        + 0.5 * df['descriptor_2']**2
        - 3 * np.sin(df['descriptor_3'])
        + np.random.randn(n_samples) * 0.5
    )
    df['property_B'] = (
        1.5 * df['descriptor_4']
        + 0.8 * np.exp(-df['descriptor_5'])
        + np.random.randn(n_samples) * 0.3
    )

    return df

material_df = generate_material_data(n_samples=1000)
print("\nSimulated Material Data Head:")
print(material_df.head())
print(f"\nData Shape: {material_df.shape}")

# Define features (descriptors) and targets (properties)
X = material_df[[f'descriptor_{i+1}' for i in range(5)]]
y_A = material_df['property_A']
y_B = material_df['property_B'] # We'll focus on property_A for simplicity in this example

# --- 2. Data Splitting ---
# Split into training and a truly unseen test set for final evaluation
X_train, X_test, y_train, y_test = train_test_split(
    X, y_A, test_size=0.2, random_state=42
)
print(f"\nTraining set size: {len(X_train)} samples")
print(f"Test set size: {len(X_test)} samples")

# --- 3. Preprocessing Pipeline ---
# Standardize features for most ML models
preprocessor = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# --- 4. Forward Model Candidates (Predicting Properties from Descriptors) ---
# These models will evaluate the "goodness" of generated material descriptors.
regressors = {
    'RandomForestRegressor': RandomForestRegressor(random_state=42),
    'GradientBoostingRegressor': GradientBoostingRegressor(random_state=42),
    'MLPRegressor': MLPRegressor(random_state=42, max_iter=1000) # Simple Neural Network
}

# Hyperparameter distributions for RandomizedSearchCV for forward models
param_distributions_forward = {
    'RandomForestRegressor': {
        'regressor__n_estimators': np.arange(50, 300, 50),
        'regressor__max_depth': [None, 10, 20],
        'regressor__min_samples_split': [2, 5]
    },
    'GradientBoostingRegressor': {
        'regressor__n_estimators': np.arange(50, 300, 50),
        'regressor__learning_rate': [0.01, 0.05, 0.1],
        'regressor__max_depth': [3, 5]
    },
    'MLPRegressor': {
        'regressor__hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50)],
        'regressor__activation': ['relu', 'tanh'],
        'regressor__solver': ['adam'],
        'regressor__alpha': loguniform(1e-5, 1e-1) # L2 penalty (regularization)
    }
}

# --- 5. Model Selection and Hyperparameter Tuning for Forward Models ---
# We use cross-validation to select the best "forward" model that predicts material properties.
best_forward_model = None
best_forward_score = -np.inf # Maximize R2
best_forward_model_name = ""
forward_model_results = {}

print("\nStarting Forward Model Selection and Hyperparameter Tuning...")

for model_name, model in regressors.items():
    print(f"\n--- Processing Forward Model: {model_name} ---")

    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('regressor', model)
    ])

    # KFold for regression
    cv_strategy = KFold(n_splits=5, shuffle=True, random_state=42)

    search = RandomizedSearchCV(
        estimator=pipeline,
        param_distributions=param_distributions_forward[model_name],
        n_iter=20, # Reduced for faster demo, increase for real use
        cv=cv_strategy,
        scoring='r2', # R-squared as the primary metric
        verbose=0, # Set to 1 for more output
        n_jobs=-1,
        random_state=42
    )

    search.fit(X_train, y_train)

    print(f"Best parameters for {model_name}: {search.best_params_}")
    print(f"Best R2 on training folds for {model_name}: {search.best_score_:.4f}")

    y_pred_train = search.best_estimator_.predict(X_train)
    train_r2 = r2_score(y_train, y_pred_train)
    train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))

    forward_model_results[model_name] = {
        'best_params': search.best_params_,
        'train_r2_cv_avg': search.best_score_,
        'train_r2': train_r2,
        'train_rmse': train_rmse,
        'best_estimator': search.best_estimator_
    }

    if search.best_score_ > best_forward_score:
        best_forward_score = search.best_score_
        best_forward_model = search.best_estimator_
        best_forward_model_name = model_name

print(f"\n--- Forward Model Selection Complete ---")
print(f"The best forward model based on cross-validated R2 is: {best_forward_model_name}")
print(f"Best cross-validated R2: {best_forward_score:.4f}")

# Final evaluation of the best forward model on the unseen test set
if best_forward_model:
    y_pred_test = best_forward_model.predict(X_test)
    test_r2 = r2_score(y_test, y_pred_test)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))

    print(f"\nFinal Evaluation of Best Forward Model ({best_forward_model_name}) on Unseen Test Set:")
    print(f"  Test R2: {test_r2:.4f}")
    print(f"  Test RMSE: {test_rmse:.4f}")

    # Plotting actual vs. predicted for the best forward model
    plt.figure(figsize=(10, 6))
    sns.scatterplot(x=y_test, y=y_pred_test, alpha=0.6)
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2) # Ideal line
    plt.xlabel(f'Actual {y_A.name}')
    plt.ylabel(f'Predicted {y_A.name}')
    plt.title(f'Actual vs. Predicted for Best Forward Model ({best_forward_model_name})')
    plt.grid(True)
    plt.show()

# --- 6. Generative Model: Simple VAE for Material Descriptors (Conceptual) ---
# This is a highly simplified VAE for demonstration.
# In a real scenario, this would be a much more sophisticated model
# trained on complex molecular graphs or atomic structures.

latent_dim = 2 # Small latent dimension for visualization
input_dim = X_train.shape[1]

# Encoder
encoder_inputs = keras.Input(shape=(input_dim,))
x = layers.Dense(64, activation='relu')(encoder_inputs)
x = layers.Dense(32, activation='relu')(x)
z_mean = layers.Dense(latent_dim, name="z_mean")(x)
z_log_var = layers.Dense(latent_dim, name="z_log_var")(x)

# Custom sampling layer for VAE
class Sampling(layers.Layer):
    def call(self, inputs):
        z_mean, z_log_var = inputs
        batch = tf.shape(z_mean)[0]
        dim = tf.shape(z_mean)[1]
        epsilon = tf.keras.backend.random_normal(shape=(batch, dim))
        return z_mean + tf.exp(0.5 * z_log_var) * epsilon

z = Sampling()([z_mean, z_log_var])
encoder = keras.Model(encoder_inputs, [z_mean, z_log_var, z], name="encoder")

# Decoder
latent_inputs = keras.Input(shape=(latent_dim,))
x = layers.Dense(32, activation='relu')(latent_inputs)
x = layers.Dense(64, activation='relu')(x)
decoder_outputs = layers.Dense(input_dim, activation='sigmoid')(x) # Sigmoid to keep in [0,1] range (scaled)
decoder = keras.Model(latent_inputs, decoder_outputs, name="decoder")

# VAE Model
class VAE(keras.Model):
    def __init__(self, encoder, decoder, **kwargs):
        super().__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder
        self.total_loss_tracker = keras.metrics.Mean(name="total_loss")
        self.reconstruction_loss_tracker = keras.metrics.Mean(
            name="reconstruction_loss"
        )
        self.kl_loss_tracker = keras.metrics.Mean(name="kl_loss")

    @property
    def metrics(self):
        return [
            self.total_loss_tracker,
            self.reconstruction_loss_tracker,
            self.kl_loss_tracker,
        ]

    def train_step(self, data):
        with tf.GradientTape() as tape:
            z_mean, z_log_var, z = self.encoder(data)
            reconstruction = self.decoder(z)
            reconstruction_loss = tf.reduce_mean(
                tf.square(data - reconstruction) # MSE for reconstruction
            )
            kl_loss = -0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
            kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis=1))
            total_loss = reconstruction_loss + kl_loss
        grads = tape.gradient(total_loss, self.trainable_weights)
        self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
        self.total_loss_tracker.update_state(total_loss)
        self.reconstruction_loss_tracker.update_state(reconstruction_loss)
        self.kl_loss_tracker.update_state(kl_loss)
        return {
            "loss": self.total_loss_tracker.result(),
            "reconstruction_loss": self.reconstruction_loss_tracker.result(),
            "kl_loss": self.kl_loss_tracker.result(),
        }

# Scale X_train before training VAE (important for sigmoid output)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_train_scaled_for_vae = (X_train_scaled - X_train_scaled.min()) / (X_train_scaled.max() - X_train_scaled.min()) # Scale to [0,1] for sigmoid output

vae = VAE(encoder, decoder)
vae.compile(optimizer=keras.optimizers.Adam())

print("\nTraining VAE (Generative Model) - this is a simplified example.")
vae.fit(X_train_scaled_for_vae, epochs=100, batch_size=32, verbose=0)
print("VAE training complete.")

# --- 7. Material Generation and Evaluation Loop (Iterative Design) ---
print("\n--- Generating Novel Materials and Evaluating Properties ---")

num_new_materials_to_generate = 100
generated_descriptors_scaled = decoder.predict(np.random.normal(size=(num_new_materials_to_generate, latent_dim)))

# Inverse transform to original feature scale
# Ensure generated values are within a reasonable range of training data before inverse transform
# Simple clipping to min/max of scaled training data
generated_descriptors_scaled = np.clip(
    generated_descriptors_scaled,
    X_train_scaled_for_vae.min(),
    X_train_scaled_for_vae.max()
)
generated_descriptors = generated_descriptors_scaled * (scaler.data_max_ - scaler.data_min_) + scaler.data_min_
generated_descriptors = pd.DataFrame(generated_descriptors, columns=X.columns)

print("\nSample of Generated Material Descriptors:")
print(generated_descriptors.head())

# Predict properties of newly generated materials using the best forward model
predicted_properties = best_forward_model.predict(generated_descriptors)

# Add predicted properties to the generated descriptors DataFrame
generated_materials_df = generated_descriptors.copy()
generated_materials_df['predicted_property_A'] = predicted_properties

print("\nSample of Generated Materials with Predicted Properties:")
print(generated_materials_df.head())

# Sort by predicted property (e.g., highest property_A is desirable)
top_materials = generated_materials_df.sort_values(by='predicted_property_A', ascending=False).head(5)

print("\nTop 5 Hypothetical New Materials for Further Investigation:")
print(top_materials)

# Visualization of latent space (if latent_dim == 2) and generated points
if latent_dim == 2:
    z_mean_train, z_log_var_train, _ = encoder.predict(X_train_scaled_for_vae)

    plt.figure(figsize=(10, 8))
    plt.scatter(z_mean_train[:, 0], z_mean_train[:, 1], c='blue', alpha=0.5, label='Original Data Latent Space')
    plt.scatter(
        np.random.normal(size=(num_new_materials_to_generate, latent_dim))[:, 0],
        np.random.normal(size=(num_new_materials_to_generate, latent_dim))[:, 1],
        c='red', alpha=0.5, label='Sampled Latent Space for Generation'
    )
    plt.xlabel('Latent Dimension 1')
    plt.ylabel('Latent Dimension 2')
    plt.title('Latent Space Visualization: Original Data vs. Sampled Points')
    plt.legend()
    plt.grid(True)
    plt.show()


# --- 8. Model Persistence (Saving the Best Forward Model and VAE Components) ---
# Saving the best property prediction model
model_filename_forward = f'best_material_property_predictor_{best_forward_model_name}.pkl'
joblib.dump(best_forward_model, model_filename_forward)
print(f"\nBest forward model saved as {model_filename_forward}")

# Saving VAE components (encoder and decoder) separately for generating new designs
encoder.save('material_vae_encoder.h5')
decoder.save('material_vae_decoder.h5')
print("\nVAE encoder and decoder saved.")

# Example of loading the models for future use
# loaded_forward_model = joblib.load(model_filename_forward)
# loaded_encoder = keras.models.load_model('material_vae_encoder.h5', custom_objects={'Sampling': Sampling})
# loaded_decoder = keras.models.load_model('material_vae_decoder.h5')
# print("\nModels loaded successfully for inference.")

Explanation of Best Practices in the Code for Material Discovery:

Specialized Data Representation (Conceptual):
- The descriptor_X columns represent a simplified version of material descriptors (e.g., electronic, structural, compositional features). In reality, these would come from sophisticated computational chemistry tools (like Density Functional Theory, molecular dynamics simulations) or experimental characterization.
- For actual molecular/material generation, one would use specialized data structures like SMILES strings (for molecules), graph representations (Graph Neural Networks), or crystallographic information files (CIFs) and deep learning models designed for these inputs. This example uses a simpler tabular descriptor for illustration.
Two-Phase Modeling (Forward and Generative):
- Forward Model: A standard regression model (RandomForestRegressor, GradientBoostingRegressor, MLPRegressor) is trained to predict material properties given their descriptors. This is crucial because it allows us to evaluate the properties of new materials generated by the VAE without needing to perform expensive physical experiments for every single candidate.
- Generative Model (VAE – Variational Autoencoder): This model learns the underlying distribution of material descriptors. Once trained, its decoder component can sample from a latent (compressed) space and generate novel material descriptors.
Model Selection for the Forward Model:
- Similar to the churn example, RandomizedSearchCV with KFold (for regression) is used to find the best performing “forward” model.
- R2_score and RMSE are appropriate metrics for regression tasks.
- The best forward model is critical because its accuracy directly impacts the effectiveness of the generative design loop.
Simplified Generative AI (VAE):
- A basic Variational Autoencoder (VAE) is implemented using TensorFlow/Keras.
- Encoder: Maps input material descriptors to a lower-dimensional “latent space” (mean and log-variance).
- Sampling Layer: Introduces randomness, allowing the model to generate diverse, novel samples when decoding.
- Decoder: Takes points from the latent space and reconstructs material descriptors.
- Loss Function: Combines reconstruction loss (how well the decoder reconstructs the input) and KL-divergence loss (encouraging the latent space to follow a simple distribution, like a normal distribution, making sampling easier).
- Scaling for VAE: Input data is scaled to a [0, 1] range, typical for models with sigmoid activation in the output layer.
Iterative Material Design Loop:
- Generation: The decoder samples random points from the latent space and generates new material descriptors.
- Evaluation: The best_forward_model (trained previously) then predicts the properties of these newly generated descriptors.
- Selection: The generated materials are ranked by their predicted properties, allowing material scientists to identify promising candidates for further, more detailed computational simulations (e.g., DFT) or actual laboratory synthesis and characterization.
- This loop embodies the “design-by-property” approach, where we specify desired properties and the AI suggests materials.
Visualization of Latent Space:
- If the latent dimension is 2, we can visualize how the original data clusters and where the new generated points originate from in the learned latent space, providing some interpretability.
Model Persistence:
- Both the best “forward” prediction model and the components of the VAE (encoder, decoder) are saved. This is essential for deploying the system: the generative part can propose new materials, and the predictive part can quickly screen them.

This use case provides a glimpse into the exciting frontier of AI in materials science, where machine learning is not just predicting but actively creating new scientific knowledge and designs.

Use Case 3: Predictive Maintenance for Industrial Robotics in Manufacturing

Okay, let’s move away from churn and generative materials and explore another exciting and impactful domain.

Use Case 3: Predictive Maintenance for Industrial Robotics in Manufacturing

Problem Statement: In modern manufacturing facilities, industrial robots are critical for production efficiency and consistency. Unexpected breakdowns of these robots lead to significant downtime, production losses, and costly emergency repairs. The goal is to predict potential equipment failures before they occur, enabling proactive maintenance, optimized scheduling of repairs, and reduced operational costs.

The Role of Machine Learning:

Machine learning models can analyze real-time sensor data (vibration, temperature, current, motor speed, pressure, etc.) from robots, identify subtle anomalies and patterns indicative of impending failures, and provide early warnings. This shifts maintenance from a reactive to a predictive paradigm.

Challenges in Model Selection for Predictive Maintenance:

Time-Series Data: Sensor data is sequential, requiring models capable of handling temporal dependencies (e.g., LSTMs, Transformers, or specialized time-series features for tree-based models).
Imbalanced Classes (Failure Events): Failures are rare events compared to normal operation. This imbalance is even more severe than churn prediction.
Anomaly Detection vs. Classification: Sometimes, failures manifest as deviations from normal behavior rather than falling into predefined failure categories, necessitating anomaly detection techniques alongside traditional classification.
Multi-Fault Prediction: A robot can fail in multiple ways (e.g., motor failure, bearing degradation, hydraulic leak). The model might need to predict different types of failures or overall Remaining Useful Life (RUL).
Data Quality and Missing Data: Sensor readings can be noisy, erroneous, or have gaps, requiring robust imputation and filtering.
Interpretability: Understanding which sensor readings or patterns indicate a failure is crucial for maintenance engineers to diagnose and fix the problem.

Best Practices Demonstrated in this Use Case:

Feature Engineering for Time Series: Extracting meaningful features from raw sensor data (e.g., rolling averages, standard deviations, FFT components, trend indicators).
Handling Imbalanced Data: Employing techniques like oversampling (SMOTE), undersampling, or using class weights in model training, alongside appropriate evaluation metrics.
Time-Series Cross-Validation: Using validation strategies that respect the temporal order of data, preventing data leakage from the future into the past (e.g., TimeSeriesSplit).
Early Warning System Metrics: Focusing on metrics like “Time to Detect,” “False Alarm Rate,” and “Missed Detection Rate,” which are more relevant than simple accuracy for operational effectiveness.
Multi-Class Classification (for different failure modes) or Regression (for RUL): While we’ll focus on binary classification (failure vs. no-failure) for simplicity, the principles extend.
Model Selection for Robustness: Evaluating models suited for complex time-series patterns (e.g., Gradient Boosting, LSTMs/GRUs, or specialized time-series algorithms).
Threshold Optimization: For early warning systems, tuning the classification probability threshold to balance false positives and false negatives based on operational cost.

End-to-End Python Code for Predictive Maintenance

Disclaimer: Real-world sensor data is complex and often proprietary. This code simulates time-series sensor data from a hypothetical industrial robot to demonstrate the workflow. A full solution would involve robust data ingestion from PLCs/SCADA systems.

import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    confusion_matrix, classification_report, precision_recall_curve, roc_curve, auc
)
from imblearn.over_sampling import SMOTE # For handling imbalance
from imblearn.pipeline import Pipeline as ImbPipeline # Use imblearn's pipeline for SMOTE
import matplotlib.pyplot as plt
import seaborn as sns
import joblib # For model persistence

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries loaded successfully.")

# --- 1. Data Generation (Simulated Robot Sensor Data with Failure Events) ---
def generate_robot_sensor_data(n_robots=5, n_days=365, readings_per_day=24):
    """
    Generates synthetic time-series sensor data for multiple robots with failure events.
    Failure events are simulated by abnormal sensor readings preceding the event.
    """
    data = []
    for robot_id in range(1, n_robots + 1):
        total_readings = n_days * readings_per_day
        time_index = pd.to_datetime(pd.date_range(start='2024-01-01', periods=total_readings, freq='H'))

        # Base normal operation values
        temp_base = np.random.normal(50, 2, total_readings) # Celsius
        vibration_base = np.random.normal(0.5, 0.1, total_readings) # G's
        current_base = np.random.normal(10, 0.5, total_readings) # Amps
        pressure_base = np.random.normal(80, 5, total_readings) # PSI

        # Initialize failure status and time to failure (target)
        failure = np.zeros(total_readings, dtype=int)
        time_to_failure = np.zeros(total_readings) # In hours

        # Simulate 1-3 failure events per robot
        num_failures = np.random.randint(1, 4)
        failure_indices = np.sort(np.random.choice(np.arange(total_readings // 4, total_readings - 100), num_failures, replace=False))

        for fail_idx in failure_indices:
            # Mark the actual failure point
            failure[fail_idx] = 1

            # Simulate abnormal sensor readings leading up to failure (e.g., 24-72 hours before)
            pre_fail_duration = np.random.randint(24, 73) # Hours before failure
            start_anomaly_idx = max(0, fail_idx - pre_fail_duration)

            # Introduce gradual increase/noise for anomalies
            anomaly_factor_temp = np.linspace(0, 5, pre_fail_duration)[::-1] # Increase before failure
            anomaly_factor_vib = np.linspace(0, 1.5, pre_fail_duration)[::-1]
            anomaly_factor_curr = np.linspace(0, 2, pre_fail_duration)[::-1]
            anomaly_factor_pres = np.linspace(0, 10, pre_fail_duration)[::-1]

            temp_base[start_anomaly_idx:fail_idx] += anomaly_factor_temp
            vibration_base[start_anomaly_idx:fail_idx] += anomaly_factor_vib
            current_base[start_anomaly_idx:fail_idx] += anomaly_factor_curr
            pressure_base[start_anomaly_idx:fail_idx] += anomaly_factor_pres

            # Calculate time_to_failure for points before the failure
            for i in range(start_anomaly_idx, fail_idx + 1):
                time_to_failure[i] = fail_idx - i # Hours until failure

        robot_df = pd.DataFrame({
            'timestamp': time_index,
            'robot_id': robot_id,
            'temperature': temp_base,
            'vibration': vibration_base,
            'current': current_base,
            'pressure': pressure_base,
            'failure': failure,
            'time_to_failure': time_to_failure # Can be used for RUL prediction or as a feature
        })
        data.append(robot_df)

    df = pd.concat(data).reset_index(drop=True)
    return df

df = generate_robot_sensor_data(n_robots=10, n_days=180)
print("\nSimulated Data Head:")
print(df.head())
print("\nFailure Distribution:")
print(df['failure'].value_counts(normalize=True))

# --- 2. Feature Engineering (for time-series data) ---
# For simplicity, we'll use a single robot's data and engineer basic rolling features.
# In a real scenario, you'd apply this group-wise for each robot.
df_robot1 = df[df['robot_id'] == 1].copy()

# Sort by timestamp to ensure correct rolling window calculation
df_robot1 = df_robot1.sort_values(by='timestamp').set_index('timestamp')

# Define sensor columns
sensor_cols = ['temperature', 'vibration', 'current', 'pressure']

# Create rolling window features (e.g., mean, std dev over last X hours)
window_sizes = [3, 6, 12, 24] # Examples: last 3, 6, 12, 24 hours
for col in sensor_cols:
    for window in window_sizes:
        df_robot1[f'{col}_roll_mean_{window}h'] = df_robot1[col].rolling(window=f'{window}H', closed='left').mean()
        df_robot1[f'{col}_roll_std_{window}h'] = df_robot1[col].rolling(window=f'{window}H', closed='left').std()
        # Add rate of change
        df_robot1[f'{col}_diff_1h'] = df_robot1[col].diff(periods=1)


# Drop rows with NaN values created by rolling windows (beginning of time series)
df_robot1.dropna(inplace=True)

# Select features (X) and target (y)
features = [col for col in df_robot1.columns if col not in ['robot_id', 'failure', 'time_to_failure']]
X = df_robot1[features]
y = df_robot1['failure'] # Our target is binary failure prediction

print(f"\nFeatures engineered. X shape: {X.shape}, y shape: {y.shape}")
print("Sample of Engineered Features:")
print(X.head())
print("Failure distribution after feature engineering:")
print(y.value_counts(normalize=True))

# --- 3. Train/Test Split (Time-Series Split) ---
# CRITICAL: For time-series, simply random splitting is wrong. We must maintain temporal order.
# We'll simulate a split where training data is from an earlier period and test data is later.
split_point = int(len(X) * 0.8) # 80% for training, 20% for testing

X_train, X_test = X.iloc[:split_point], X.iloc[split_point:]
y_train, y_test = y.iloc[:split_point], y.iloc[split_point:]

print(f"\nTime-series split applied.")
print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")
print(f"y_train failure ratio: {y_train.sum() / len(y_train):.4f}")
print(f"y_test failure ratio: {y_test.sum() / len(y_test):.4f}")

# --- 4. Model Candidates and Hyperparameter Grids ---
# We'll use Random Forest and Gradient Boosting. Logistic Regression is less suitable for complex time patterns.
classifiers = {
    'RandomForestClassifier': RandomForestClassifier(random_state=42, class_weight='balanced'), # 'balanced' handles imbalance
    'GradientBoostingClassifier': GradientBoostingClassifier(random_state=42)
}

# Define hyperparameter distributions for RandomizedSearchCV
# Using smaller ranges for demo, expand for real scenarios
param_distributions = {
    'RandomForestClassifier': {
        'classifier__n_estimators': np.arange(50, 200, 50),
        'classifier__max_depth': [10, 20, None],
        'classifier__min_samples_leaf': [1, 5]
    },
    'GradientBoostingClassifier': {
        'classifier__n_estimators': np.arange(50, 200, 50),
        'classifier__learning_rate': [0.05, 0.1, 0.2],
        'classifier__max_depth': [3, 5]
    }
}

# --- 5. Model Selection with TimeSeriesSplit Cross-Validation and Imbalance Handling ---
best_model = None
best_score = -np.inf # Maximize ROC AUC for robust evaluation
best_model_name = ""
model_evaluation_results = {}

print("\nStarting Model Selection and Hyperparameter Tuning...")

for model_name, classifier in classifiers.items():
    print(f"\n--- Processing {model_name} ---")

    # Imblearn Pipeline: SMOTE for oversampling minority class *before* classifier
    # This prevents data leakage during cross-validation by applying SMOTE only on training folds.
    pipeline = ImbPipeline(steps=[
        ('preprocessor', StandardScaler()), # Standardize numerical features
        ('oversampler', SMOTE(random_state=42)), # Apply SMOTE to balance classes
        ('classifier', classifier)
    ])

    # TimeSeriesSplit for cross-validation: ensures that validation folds are always 'future' data
    # than training folds. Crucial for time-series.
    # n_splits: number of splits. For N=5, it creates 5 (train, test) pairs.
    #   e.g., train=0-19%, test=20-39%; train=0-39%, test=40-59%; etc.
    tscv = TimeSeriesSplit(n_splits=5)

    # Use RandomizedSearchCV for hyperparameter tuning with TimeSeriesSplit
    # We choose 'roc_auc' as the primary scoring metric due to severe class imbalance
    # and the importance of ranking potential failure events.
    search = RandomizedSearchCV(
        estimator=pipeline,
        param_distributions=param_distributions[model_name],
        n_iter=10, # Reduced for faster demo, increase significantly for real use
        cv=tscv,
        scoring='roc_auc',
        verbose=1,
        n_jobs=-1,
        random_state=42,
        error_score='raise' # Raise errors to debug pipeline issues
    )

    search.fit(X_train, y_train)

    print(f"Best parameters for {model_name}: {search.best_params_}")
    print(f"Best ROC AUC on CV folds for {model_name}: {search.best_score_:.4f}")

    # Store results (we don't predict on X_train directly as it's been oversampled,
    # but search.best_score_ gives the CV performance)
    model_evaluation_results[model_name] = {
        'best_params': search.best_params_,
        'cv_roc_auc_avg': search.best_score_,
        'best_estimator': search.best_estimator_
    }

    if search.best_score_ > best_score:
        best_score = search.best_score_
        best_model = search.best_estimator_
        best_model_name = model_name

print(f"\n--- Model Selection Complete ---")
print(f"The best model based on cross-validated ROC AUC is: {best_model_name}")
print(f"Best cross-validated ROC AUC: {best_score:.4f}")

# --- 6. Final Evaluation on the Unseen Test Set ---
print("\n--- Final Evaluation on Unseen Test Set ---")

if best_model:
    # Predict probabilities for ROC AUC
    y_pred_proba_test = best_model.predict_proba(X_test)[:, 1]

    # You would typically choose an optimal threshold based on business costs
    # Let's pick one (e.g., 0.5) or optimize later
    optimal_threshold = 0.5
    y_pred_test = (y_pred_proba_test >= optimal_threshold).astype(int)

    test_roc_auc = roc_auc_score(y_test, y_pred_proba_test)
    test_accuracy = accuracy_score(y_test, y_pred_test)
    test_precision = precision_score(y_test, y_pred_test)
    test_recall = recall_score(y_test, y_pred_test)
    test_f1 = f1_score(y_test, y_pred_test)

    print(f"Metrics for the Best Model ({best_model_name}) on Test Set (Threshold={optimal_threshold}):")
    print(f"  ROC AUC: {test_roc_auc:.4f}")
    print(f"  Accuracy: {test_accuracy:.4f}") # Less reliable due to imbalance
    print(f"  Precision: {test_precision:.4f}") # Low precision means many false alarms
    print(f"  Recall: {test_recall:.4f}")     # High recall means catching failures
    print(f"  F1-Score: {test_f1:.4f}")       # Balance between precision and recall

    print("\nClassification Report on Test Set:")
    print(classification_report(y_test, y_pred_test))

    print("\nConfusion Matrix on Test Set:")
    cm = confusion_matrix(y_test, y_pred_test)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['No Failure', 'Failure'], yticklabels=['No Failure', 'Failure'])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.title(f'Confusion Matrix for {best_model_name} (Threshold={optimal_threshold})')
    plt.show()

    # Visualize ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba_test)
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {test_roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'Receiver Operating Characteristic (ROC) Curve for {best_model_name}')
    plt.legend(loc="lower right")
    plt.grid(True)
    plt.show()

    # Visualize Precision-Recall Curve (often more informative for imbalanced data)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_test)
    plt.figure(figsize=(8, 6))
    plt.plot(recall, precision, color='b', alpha=0.7, label=f'Precision-Recall curve (AP={auc(recall, precision):.2f})')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title(f'Precision-Recall Curve for {best_model_name}')
    plt.legend(loc="lower left")
    plt.grid(True)
    plt.show()

    # --- 7. Threshold Optimization based on Business Cost ---
    # In a real scenario, you'd define costs for False Positives (FP) and False Negatives (FN)
    # E.g., cost_fp = $50 (unnecessary maintenance check), cost_fn = $5000 (downtime + emergency repair)
    # We want to minimize total cost = FP * cost_fp + FN * cost_fn
    # Or, maximize (True Positives * benefit_tp) - (False Positives * cost_fp) - (False Negatives * cost_fn)

    # For demonstration, let's aim for a high recall while keeping precision reasonable.
    # We can iterate through thresholds to find the best balance.
    costs = []
    thresholds_to_test = np.linspace(0.01, 0.99, 100) # Test 100 thresholds

    # Example costs (hypothetical)
    COST_FALSE_POSITIVE = 100   # Cost of an unnecessary maintenance check
    COST_FALSE_NEGATIVE = 5000  # Cost of an actual breakdown (downtime, repair, lost production)

    for threshold in thresholds_to_test:
        y_pred_th = (y_pred_proba_test >= threshold).astype(int)
        tn, fp, fn, tp = confusion_matrix(y_test, y_pred_th).ravel()
        total_cost = (fp * COST_FALSE_POSITIVE) + (fn * COST_FALSE_NEGATIVE)
        costs.append(total_cost)

    best_cost_idx = np.argmin(costs)
    optimal_threshold_business = thresholds_to_test[best_cost_idx]
    min_total_cost = costs[best_cost_idx]

    print(f"\n--- Optimal Threshold based on Business Cost ---")
    print(f"Optimal threshold: {optimal_threshold_business:.4f}")
    print(f"Minimum estimated total cost per period: ${min_total_cost:.2f}")

    y_pred_optimized = (y_pred_proba_test >= optimal_threshold_business).astype(int)
    print("\nClassification Report with Business-Optimized Threshold:")
    print(classification_report(y_test, y_pred_optimized))
    print("Confusion Matrix with Business-Optimized Threshold:")
    print(confusion_matrix(y_test, y_pred_optimized))


# --- 8. Model Persistence ---
if best_model:
    model_filename = f'best_robot_predictive_maintenance_model_{best_model_name}.pkl'
    joblib.dump(best_model, model_filename)
    print(f"\nBest model saved as {model_filename}")

    # Example of loading the model
    # loaded_model = joblib.load(model_filename)
    # print(f"Model loaded successfully: {loaded_model}")
    # You can now use loaded_model.predict(new_sensor_data) or loaded_model.predict_proba(new_sensor_data)

Explanation of Best Practices in the Code for Predictive Maintenance:

Simulated Time-Series Data:
- The generate_robot_sensor_data function creates a realistic dataset with multiple robots and introduces “failure signatures” (abnormal sensor readings) before the actual failure point. This simulates the real-world scenario of predicting impending issues.
- It also includes time_to_failure which could be used for Remaining Useful Life (RUL) prediction, though the example focuses on binary failure classification.
Time-Series Feature Engineering:
- Raw sensor data is often not directly usable. We extract rolling window statistics (mean, standard deviation) over different timeframes (e.g., 3, 6, 12, 24 hours). These capture trends, variability, and recent behavior, which are highly indicative of machine health.
- Rate of change (.diff()) is another crucial time-series feature, indicating sudden shifts in sensor readings.
- This transforms the raw time series into a tabular dataset suitable for traditional ML classifiers.
Strict Time-Series Train/Test Split:
- Crucial Best Practice: Instead of random splitting, we explicitly split the data such that the training set consists of earlier observations and the test set consists of later observations. This accurately simulates real-world deployment, where the model trains on historical data and predicts on future, unseen data.
Imblearn.Pipeline with SMOTE:
- Failure events are rare, leading to severe class imbalance.
- SMOTE (Synthetic Minority Over-sampling Technique) is used to generate synthetic samples of the minority class (failures).
- Key Best Practice: Using imblearn.pipeline.Pipeline ensures that SMOTE is applied only to the training data within each cross-validation fold. Applying SMOTE before the CV split (on the entire X_train) would lead to data leakage and overly optimistic performance estimates. class_weight='balanced' in the classifier is another way to handle imbalance, often used in conjunction with or as an alternative to oversampling.
TimeSeriesSplit for Cross-Validation:
- Essential for Time-Series: Standard KFold shuffling would violate the temporal order, leaking future information into the training data.
- TimeSeriesSplit ensures that each validation fold is always after its corresponding training fold, maintaining the temporal integrity during model evaluation and hyperparameter tuning.
Appropriate Evaluation Metrics for Imbalanced Classification:
- roc_auc_score: The primary metric for hyperparameter tuning. It’s robust to class imbalance and measures the model’s ability to rank positive instances higher than negative ones, which is vital for an early warning system.
- precision_score and recall_score: Critically important for predictive maintenance.
  - Recall (True Positive Rate): How many actual failures did we correctly identify? High recall is usually desired to avoid costly unplanned downtime.
  - Precision (Positive Predictive Value): Of all the predicted failures, how many were actual failures? High precision reduces false alarms, saving unnecessary maintenance checks.
- f1_score: Harmonic mean of precision and recall, providing a balanced view.
- classification_report and confusion_matrix: Provide detailed insights into the model’s performance across both classes.
- Precision-Recall Curve: Often more informative than ROC for highly imbalanced datasets, as it directly shows the trade-off between identifying positives and avoiding false alarms.
Threshold Optimization based on Business Cost:
- The default 0.5 probability threshold is rarely optimal for imbalanced problems like predictive maintenance.
- Business Best Practice: We demonstrate how to calculate a custom “optimal” threshold by considering the real-world costs of false positives (e.g., unnecessary inspection) and false negatives (e.g., catastrophic breakdown). This directly aligns the model’s output with business objectives.
Model Persistence (joblib):
- The final best model (including its preprocessing and imbalance handling steps via the ImbPipeline) is saved for later deployment in a live monitoring system.

This use case provides a robust framework for building and selecting predictive maintenance models, directly addressing the unique challenges of time-series and imbalanced industrial data. Sources

Category:

Selection & Trade-offs