Data Splitting for Time-Series Considerations in Machine Learning

Saba Shahrukh June 11, 2025 0

Time-series data is a sequence of data points collected or recorded at successive, equally spaced points in time. The defining characteristic of time-series data is its chronological order, meaning the order of the observations matters and carries significant information.

Examples of time-series data include:

Daily stock prices
Monthly sales figures
Hourly temperature readings
Quarterly GDP measurements
Sensor data from IoT devices over time

The temporal ordering allows for the analysis of trends, seasonality, cyclical patterns, and irregularities, which are crucial for understanding past behavior and forecasting future values.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# --- 1. Generate Synthetic Time-Series Data ---
# Let's create a simple time-series dataset.
# We'll simulate daily sales data with a trend and some seasonality.

# Define a date range
dates = pd.date_range(start='2020-01-01', periods=365 * 3, freq='D') # 3 years of daily data

# Create a sales column with a trend and seasonality
# Trend: steadily increasing over time
trend = np.linspace(0, 100, len(dates))

# Seasonality: higher sales on certain days/months
# Let's simulate a weekly seasonality (e.g., higher sales on weekends)
# And a yearly seasonality (e.g., peak sales around holidays)
weekly_seasonality = 10 * np.sin(np.arange(len(dates)) * 2 * np.pi / 7)
yearly_seasonality = 20 * np.sin(np.arange(len(dates)) * 2 * np.pi / 365)

# Noise
noise = np.random.normal(0, 5, len(dates))

# Combine to get sales data
sales = trend + weekly_seasonality + yearly_seasonality + noise
sales[sales < 0] = 0 # Ensure sales are non-negative

# Create a DataFrame
df = pd.DataFrame({'Date': dates, 'Sales': sales})
df = df.set_index('Date')

print("--- Original DataFrame Head ---")
print(df.head())
print("\n--- Original DataFrame Tail ---")
print(df.tail())
print(f"\nTotal data points: {len(df)}")

# Plot the synthetic time-series data to visualize
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Sales'])
plt.title('Synthetic Time-Series Sales Data')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.grid(True)
plt.show()

# --- 2. Chronological Data Splitting ---
# For time-series, we split based on a specific point in time.
# For example, we can use the first 80% of data for training and the last 20% for testing.

# Calculate the split point
split_ratio = 0.8 # 80% for training, 20% for testing
split_index = int(len(df) * split_ratio)

# Split the data chronologically
train_df = df.iloc[:split_index]
test_df = df.iloc[split_index:]

print(f"\nTraining data points: {len(train_df)}")
print(f"Testing data points: {len(test_df)}")

print("\n--- Training Data Head and Tail ---")
print(train_df.head())
print(train_df.tail())

print("\n--- Testing Data Head and Tail ---")
print(test_df.head())
print(test_df.tail())

# Verify the chronological order
print(f"\nTraining data ends on: {train_df.index.max()}")
print(f"Testing data starts on: {test_df.index.min()}")

# Plot the split data to visualize
plt.figure(figsize=(12, 6))
plt.plot(train_df.index, train_df['Sales'], label='Training Data', color='blue')
plt.plot(test_df.index, test_df['Sales'], label='Testing Data', color='orange')
plt.axvline(x=train_df.index.max(), color='red', linestyle='--', label='Split Point')
plt.title('Time-Series Data Split (Chronological)')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.grid(True)
plt.show()

# --- Alternative: Using a specific date for splitting ---
# You might want to split your data at a known point, e.g., before a certain year or event.

split_date = '2022-12-31' # Split data before 2023

train_df_by_date = df[df.index <= split_date]
test_df_by_date = df[df.index > split_date]

print(f"\n--- Split by Date ({split_date}) ---")
print(f"Training data points (by date): {len(train_df_by_date)}")
print(f"Testing data points (by date): {len(test_df_by_date)}")

print(f"Training data ends on: {train_df_by_date.index.max()}")
print(f"Testing data starts on: {test_df_by_date.index.min()}")

# --- 3. Preparing for Machine Learning (Features and Target) ---
# Typically, you'd separate features (X) from the target variable (y).
# For simplicity, let's assume 'Sales' is our target.
# In a real scenario, you'd extract features like lagged sales, day of week, month, etc.

X = df.drop('Sales', axis=1) # No features for now, just a placeholder
y = df['Sales']

X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]

print(f"\nShape of X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"Shape of X_test: {X_test.shape}, y_test: {y_test.shape}")

# Note: For time-series, X_train might include lagged versions of y_train,
# or other time-based features that you engineer from the Date index.
# This example focuses purely on the chronological splitting mechanism.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# --- 1. Generate Synthetic Time-Series Data ---
# Let's create a simple time-series dataset.
# We'll simulate daily sales data with a trend and some seasonality.

# Define a date range
dates = pd.date_range(start='2020-01-01', periods=365 * 3, freq='D') # 3 years of daily data

# Create a sales column with a trend and seasonality
# Trend: steadily increasing over time
trend = np.linspace(0, 100, len(dates))

# Seasonality: higher sales on certain days/months
# Let's simulate a weekly seasonality (e.g., higher sales on weekends)
# And a yearly seasonality (e.g., peak sales around holidays)
weekly_seasonality = 10 * np.sin(np.arange(len(dates)) * 2 * np.pi / 7)
yearly_seasonality = 20 * np.sin(np.arange(len(dates)) * 2 * np.pi / 365)

# Noise
noise = np.random.normal(0, 5, len(dates))

# Combine to get sales data
sales = trend + weekly_seasonality + yearly_seasonality + noise
sales[sales < 0] = 0 # Ensure sales are non-negative

# Create a DataFrame
df = pd.DataFrame({'Date': dates, 'Sales': sales})
df = df.set_index('Date')

print("--- Original DataFrame Head ---")
print(df.head())
print("\n--- Original DataFrame Tail ---")
print(df.tail())
print(f"\nTotal data points: {len(df)}")

# Plot the synthetic time-series data to visualize
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Sales'])
plt.title('Synthetic Time-Series Sales Data')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.grid(True)
plt.show()

# --- 2. Chronological Data Splitting ---
# For time-series, we split based on a specific point in time.
# For example, we can use the first 80% of data for training and the last 20% for testing.

# Calculate the split point
split_ratio = 0.8 # 80% for training, 20% for testing
split_index = int(len(df) * split_ratio)

# Split the data chronologically
train_df = df.iloc[:split_index]
test_df = df.iloc[split_index:]

print(f"\nTraining data points: {len(train_df)}")
print(f"Testing data points: {len(test_df)}")

print("\n--- Training Data Head and Tail ---")
print(train_df.head())
print(train_df.tail())

print("\n--- Testing Data Head and Tail ---")
print(test_df.head())
print(test_df.tail())

# Verify the chronological order
print(f"\nTraining data ends on: {train_df.index.max()}")
print(f"Testing data starts on: {test_df.index.min()}")

# Plot the split data to visualize
plt.figure(figsize=(12, 6))
plt.plot(train_df.index, train_df['Sales'], label='Training Data', color='blue')
plt.plot(test_df.index, test_df['Sales'], label='Testing Data', color='orange')
plt.axvline(x=train_df.index.max(), color='red', linestyle='--', label='Split Point')
plt.title('Time-Series Data Split (Chronological)')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.grid(True)
plt.show()

# --- Alternative: Using a specific date for splitting ---
# You might want to split your data at a known point, e.g., before a certain year or event.

split_date = '2022-12-31' # Split data before 2023

train_df_by_date = df[df.index <= split_date]
test_df_by_date = df[df.index > split_date]

print(f"\n--- Split by Date ({split_date}) ---")
print(f"Training data points (by date): {len(train_df_by_date)}")
print(f"Testing data points (by date): {len(test_df_by_date)}")

print(f"Training data ends on: {train_df_by_date.index.max()}")
print(f"Testing data starts on: {test_df_by_date.index.min()}")

# --- 3. Preparing for Machine Learning (Features and Target) ---
# Typically, you'd separate features (X) from the target variable (y).
# For simplicity, let's assume 'Sales' is our target.
# In a real scenario, you'd extract features like lagged sales, day of week, month, etc.

X = df.drop('Sales', axis=1) # No features for now, just a placeholder
y = df['Sales']

X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]

print(f"\nShape of X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"Shape of X_test: {X_test.shape}, y_test: {y_test.shape}")

# Note: For time-series, X_train might include lagged versions of y_train,
# or other time-based features that you engineer from the Date index.
# This example focuses purely on the chronological splitting mechanism.

🐍 Practice Python Here! 🚀

Write, Run & Debug Python Code Directly in Your Browser! 💻✨

The code demonstrates how to effectively split time-series data chronologically, ensuring that your training data always comes before your testing data. This is a fundamental practice in time-series forecasting to ensure the integrity of your model evaluation and prevent data leakage.

Key takeaways:

Always split chronologically: Never use random sampling for time-series data.
Prevent data leakage: Ensure future information does not influence model training.
Realistic evaluation: Your model’s performance on the chronologically split test set will be a more accurate reflection of its real-world predictive power.

Further considerations for real-world scenarios:

Cross-validation for time series: Traditional K-fold cross-validation is not suitable. Consider techniques like “rolling origin” or “time series split” cross-validation, where folds are created chronologically.
Feature engineering: For time-series, features often include lagged values of the target variable, moving averages, time-based features (e.g., day of week, month, year), and external regressors.
Gap between train and test: Sometimes, a small gap is introduced between the training and testing sets to simulate a more robust future prediction, accounting for potential changes right after the training period.

Category:

Data Transformations & Scaling