Most machine learning models try to predict outcomes. But what if your goal is simply to detect observations that look unusual in a massive, interconnected network?
When dealing with highly sparse datasets—such as financial networks where illicit credit card transactions make up less than 1% of the data—standard classification models face a major hurdle: the Accuracy Paradox. A model can simply guess that every transaction is legitimate, achieve 99.9% accuracy, and be completely useless for fraud detection.
In this article, we will walk through an end-to-end Python implementation demonstrating how to overcome this challenge. By combining the unsupervised anomaly-isolating power of Isolation Forests with the structural, message-passing capabilities of Graph Neural Networks (GNNs), we can build an architecture designed specifically to catch the unseen 1%.
To detect illicit financial activities—such as credit card fraud or money laundering—data scientists often turn to Graph Neural Networks (GNNs). Transactions naturally form a graph (network) where accounts/users are nodes, and the transactions between them are edges.
However, detecting fraud introduces a massive challenge: The Accuracy Paradox. In a dataset where 99.9% of transactions are legitimate and 0.1% are illicit, a model that simply predicts “legitimate” for every transaction will achieve 99.9% accuracy, yet it will be completely useless for fraud detection.
This document provides a comprehensive, end-to-end academic and professional guide on overcoming this challenge using Graph Neural Networks combined with Isolation Forests, progressing from a conceptual understanding to a synthetic implementation, and finally applying it to the industry-standard Elliptic Dataset.
1. The Challenge of Sparse, Imbalanced Datasets
When dealing with highly sparse datasets (anomalies < 1%), standard loss functions like Cross-Entropy penalize the model uniformly. Because the majority class overwhelms the gradient updates, the model converges to a local minimum: predicting the majority class.
Algorithms and Techniques to Overcome Imbalance:
- Evaluation Metrics Re-alignment: Discard Accuracy. Use Precision, Recall, F1-Score, and the Area Under the Precision-Recall Curve (AUPRC).
- Cost-Sensitive Learning (Class Weights): Assigning a higher penalty to misclassifying the minority class. If fraud is 100 times rarer than legitimate transactions, the loss for missing a fraud case is multiplied by 100.
- Focal Loss: Dynamically scales the cross-entropy loss based on prediction confidence, forcing the model to focus on hard-to-predict examples (fraud) rather than easy ones (normal transactions).
- Graph-Specific Resampling: Techniques like GraphSMOTE generate synthetic minority nodes and wire them into the existing graph geometry.
- Ensemble Anomaly Detection (Isolation Forest): As highlighted in your post, unsupervised anomaly detection isolates outliers based on feature space density rather than historical labels.
How Isolation Forest helps the GNN: Isolation Forests (IF) are highly efficient at finding feature-level anomalies but ignore the graph structure. GNNs excel at finding structural anomalies but can struggle with feature-level class imbalance. By passing our raw transaction data through an Isolation Forest first, we can extract an anomaly_score. We then append this score as a new node feature into our GNN. The GNN now has a powerful, unsupervised prior to guide its structural message passing.
2. Phase 1: End-to-End Implementation on Synthetic Data
First, we will generate a synthetic credit card transaction graph. We will implement the Hybrid IF-GNN architecture.
Step 2.1: Generating the Sparse Dataset and Applying Isolation Forest
Python
import torch
import numpy as np
import networkx as nx
from torch_geometric.data import Data
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, precision_recall_fscore_support
# 1. Generate Synthetic Imbalanced Graph Data
def generate_synthetic_fraud_graph(num_nodes=5000, fraud_ratio=0.02):
"""Generates a graph where nodes are transactions and edges are device/IP links."""
np.random.seed(42)
# Generate random features (e.g., transaction amount, time, location hash)
# Normal transactions: normally distributed
X_normal = np.random.randn(num_nodes, 10)
# Inject anomalies (fraud) by shifting the distribution for a small subset
num_fraud = int(num_nodes * fraud_ratio)
X_normal[:num_fraud] += np.random.normal(loc=3.0, scale=1.0, size=(num_fraud, 10))
y = np.zeros(num_nodes, dtype=np.int64)
y[:num_fraud] = 1 # Class 1 is fraud
# Shuffle
indices = np.random.permutation(num_nodes)
X = X_normal[indices]
y = y[indices]
# Generate Edges: Preferential attachment to simulate real transaction networks
G = nx.barabasi_albert_graph(num_nodes, 3)
edges = list(G.edges)
edge_index = torch.tensor(edges, dtype=torch.long).t().contiguous()
# Create bidirectional edges
edge_index = torch.cat([edge_index, edge_index[[1, 0]]], dim=1)
return X, y, edge_index
X_raw, y_raw, edge_index = generate_synthetic_fraud_graph()
# 2. Apply Isolation Forest (Unsupervised Anomaly Detection)
print("Running Isolation Forest...")
iso_forest = IsolationForest(n_estimators=100, contamination=0.02, random_state=42)
iso_forest.fit(X_raw)
# Get continuous anomaly scores (lower means more anomalous in sklearn, so we invert it)
# We invert so that higher score = higher chance of being an anomaly
anomaly_scores = -iso_forest.decision_function(X_raw)
anomaly_scores = anomaly_scores.reshape(-1, 1)
# 3. Feature Augmentation: Combine original features with IF Anomaly Scores
X_augmented = np.hstack((X_raw, anomaly_scores))
x_tensor = torch.tensor(X_augmented, dtype=torch.float)
y_tensor = torch.tensor(y_raw, dtype=torch.long)
# Create PyTorch Geometric Data object
data = Data(x=x_tensor, edge_index=edge_index, y=y_tensor)
# Train/Test splits (80/20)
num_nodes = data.num_nodes
indices = np.random.permutation(num_nodes)
train_idx = indices[:int(0.8 * num_nodes)]
test_idx = indices[int(0.8 * num_nodes):]
data.train_mask = torch.zeros(num_nodes, dtype=torch.bool)
data.test_mask = torch.zeros(num_nodes, dtype=torch.bool)
data.train_mask[train_idx] = True
data.test_mask[test_idx] = True
Step 2.2: Defining the Graph Neural Network (GraphSAGE)
GraphSAGE is highly effective for transaction networks because it samples neighborhoods, allowing it to scale to massive graphs efficiently.
Python
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv
class FraudGNN(torch.nn.Module):
def __init__(self, in_channels, hidden_channels, out_channels):
super(FraudGNN, self).__init__()
# SAGEConv aggregates information from neighboring nodes
self.conv1 = SAGEConv(in_channels, hidden_channels)
self.conv2 = SAGEConv(hidden_channels, out_channels)
self.dropout = torch.nn.Dropout(p=0.5)
def forward(self, x, edge_index):
x = self.conv1(x, edge_index)
x = F.relu(x)
x = self.dropout(x)
x = self.conv2(x, edge_index)
return x # Raw logits
model = FraudGNN(in_channels=data.num_node_features, hidden_channels=32, out_channels=2)
Step 2.3: Cost-Sensitive Training Loop
To address the sparse dataset, we apply inverse class weighting to the Cross-Entropy Loss.
Python
# Calculate class weights for imbalanced data
num_neg = (data.y[data.train_mask] == 0).sum().item()
num_pos = (data.y[data.train_mask] == 1).sum().item()
weight_pos = num_neg / num_pos
class_weights = torch.tensor([1.0, weight_pos], dtype=torch.float)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss(weight=class_weights)
def train():
model.train()
optimizer.zero_grad()
out = model(data.x, data.edge_index)
loss = criterion(out[data.train_mask], data.y[data.train_mask])
loss.backward()
optimizer.step()
return loss.item()
def test():
model.eval()
with torch.no_grad():
out = model(data.x, data.edge_index)
pred = out.argmax(dim=1)
y_true = data.y[data.test_mask].numpy()
y_pred = pred[data.test_mask].numpy()
# Focus on precision, recall, and f1 instead of accuracy
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary')
return precision, recall, f1
# Training execution
for epoch in range(1, 101):
loss = train()
if epoch % 20 == 0:
p, r, f1 = test()
print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Precision: {p:.4f}, Recall: {r:.4f}, F1: {f1:.4f}')
Notice that by heavily weighting the positive class and passing the Isolation Forest scores as features, the model is guided structurally to find anomalies rather than just outputting “0” to minimize baseline error.
3. Phase 2: Applying to the Elliptic Dataset
The Elliptic Dataset is the premier benchmark for graph-based financial anomaly detection. While it maps Bitcoin transactions, the topological structures, feature sparsity, and detection logic are mathematically isomorphic to credit card or wire transfer graphs.
- Nodes: 203,769 transactions.
- Edges: 234,355 payment flows.
- Features: 166 (Local features like amount/time, and structural features).
- Labels: 0 (Licit), 1 (Illicit – ~2% of labeled data), 2 (Unknown).
Implementation with the Elliptic Dataset
Python
from torch_geometric.datasets import EllipticBitcoinDataset
# 1. Load the Dataset
dataset = EllipticBitcoinDataset(root='./data/Elliptic')
data = dataset[0]
# PyG maps: Licit=0, Illicit=1, Unknown=2
# We must filter out 'Unknown' nodes during training/testing
labeled_mask = (data.y == 0) | (data.y == 1)
train_mask = data.train_mask & labeled_mask
test_mask = data.test_mask & labeled_mask
# 2. Extract features and apply Isolation Forest
print("Running Isolation Forest on Elliptic Dataset...")
# We only fit the IF on the training data to prevent data leakage!
X_train_np = data.x[train_mask].numpy()
iso_forest_elliptic = IsolationForest(n_estimators=150, contamination=0.1, random_state=42)
iso_forest_elliptic.fit(X_train_np)
# Predict anomaly scores for the ENTIRE graph
X_all_np = data.x.numpy()
anomaly_scores_elliptic = -iso_forest_elliptic.decision_function(X_all_np).reshape(-1, 1)
# Augment Node Features
x_augmented_elliptic = torch.cat(
[data.x, torch.tensor(anomaly_scores_elliptic, dtype=torch.float)],
dim=1
)
data.x = x_augmented_elliptic
# 3. Model Definition and Class Weights
model_elliptic = FraudGNN(in_channels=data.num_node_features, hidden_channels=64, out_channels=2)
num_neg_ell = (data.y[train_mask] == 0).sum().item()
num_pos_ell = (data.y[train_mask] == 1).sum().item()
weight_pos_ell = num_neg_ell / num_pos_ell
weights_ell = torch.tensor([1.0, weight_pos_ell], dtype=torch.float)
optimizer_ell = torch.optim.Adam(model_elliptic.parameters(), lr=0.005, weight_decay=1e-5)
criterion_ell = torch.nn.CrossEntropyLoss(weight=weights_ell)
# 4. Training Loop for Elliptic
def train_elliptic():
model_elliptic.train()
optimizer_ell.zero_grad()
out = model_elliptic(data.x, data.edge_index)
loss = criterion_ell(out[train_mask], data.y[train_mask])
loss.backward()
optimizer_ell.step()
return loss.item()
def test_elliptic():
model_elliptic.eval()
with torch.no_grad():
out = model_elliptic(data.x, data.edge_index)
pred = out.argmax(dim=1)
y_true = data.y[test_mask].cpu().numpy()
y_pred = pred[test_mask].cpu().numpy()
print("\nClassification Report (Test Data):")
print(classification_report(y_true, y_pred, target_names=["Licit", "Illicit"]))
for epoch in range(1, 101):
loss = train_elliptic()
if epoch % 50 == 0:
print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')
test_elliptic()
Summary of the Synergy
By combining the Isolation Forest mentioned in your post with a Graph Neural Network, we solve the two biggest hurdles of sparse fraud detection:
- The Feature Hurdle: Isolation Forest natively isolates the rare statistical anomalies in continuous transaction features without needing labels.
- The Relational Hurdle: Smurfing (breaking down transactions) and cyclic fraud hide in plain sight among features but look highly anomalous in network topology. The GNN captures this by combining the nodes’ structural relationships with the Isolation Forest’s prior suspicion (anomaly score) to make a highly accurate, graph-aware prediction.