
Analysing Fraud Patterns: A Data-Driven Approach
Analysing Fraud Patterns: A Data-Driven Approach
Fraud detection in modern fintech requires sophisticated data analysis. This article explores how to use SQL, Python, and data visualization tools to identify fraud patterns, build detection models, and create actionable insights for fraud prevention teams.
The Data-Driven Fraud Detection Framework
Effective fraud detection requires a systematic approach to data analysis:
- Data Collection: Gathering transaction, account, and behavioural data
- Feature Engineering: Creating meaningful indicators from raw data
- Pattern Recognition: Identifying suspicious patterns and anomalies
- Model Development: Building detection models
- Visualization: Creating dashboards and reports
- Action: Taking preventive or investigative action
SQL for Fraud Pattern Analysis
SQL is essential for querying transaction databases and identifying patterns. Here are key techniques:
Transaction Velocity Analysis
Identifying rapid transactions:
SELECT
account_id,
COUNT(*) as transaction_count,
SUM(amount) as total_amount,
MIN(transaction_time) as first_transaction,
MAX(transaction_time) as last_transaction,
TIMESTAMPDIFF(HOUR, MIN(transaction_time), MAX(transaction_time)) as hours_span
FROM transactions
WHERE transaction_date = CURRENT_DATE
GROUP BY account_id
HAVING transaction_count > 20
AND hours_span < 24
ORDER BY transaction_count DESC;
Use Cases:
- Detecting fanout patterns
- Identifying money mule activity
- Finding account takeover indicators
Structuring Detection
Finding transactions just below thresholds:
SELECT
account_id,
COUNT(*) as transaction_count,
AVG(amount) as avg_amount,
MAX(amount) as max_amount
FROM transactions
WHERE amount BETWEEN 9000 AND 9999 -- Just below $10k threshold
GROUP BY account_id
HAVING transaction_count >= 5
ORDER BY transaction_count DESC;
Cross-Account Pattern Analysis
Identifying related accounts:
WITH account_relationships AS (
SELECT
t1.account_id as account_a,
t2.account_id as account_b,
COUNT(*) as shared_transactions
FROM transactions t1
JOIN transactions t2
ON t1.counterparty = t2.counterparty
AND t1.account_id < t2.account_id
WHERE t1.transaction_date = t2.transaction_date
GROUP BY t1.account_id, t2.account_id
HAVING shared_transactions >= 3
)
SELECT * FROM account_relationships
ORDER BY shared_transactions DESC;
Python for Advanced Analysis
Python enables sophisticated fraud analysis beyond SQL capabilities:
Feature Engineering
Creating fraud indicators:
import pandas as pd
import numpy as np
def create_fraud_features(df):
"""
Create features for fraud detection
"""
features = df.copy()
# Transaction velocity
features['transactions_last_hour'] = features.groupby('account_id')['timestamp'].transform(
lambda x: x.dt.hour.value_counts()
)
# Amount patterns
features['amount_deviation'] = features.groupby('account_id')['amount'].transform(
lambda x: (x - x.mean()) / x.std()
)
# Time-based features
features['hour'] = features['timestamp'].dt.hour
features['day_of_week'] = features['timestamp'].dt.dayofweek
features['is_weekend'] = features['day_of_week'].isin([5, 6])
# Geographic features
features['location_change'] = features.groupby('account_id')['location'].transform(
lambda x: x != x.shift(1)
)
return features
Pattern Detection with Machine Learning
Anomaly detection:
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
def detect_anomalies(transaction_data):
"""
Use Isolation Forest for anomaly detection
"""
# Prepare features
features = create_fraud_features(transaction_data)
feature_columns = ['amount', 'transactions_last_hour',
'amount_deviation', 'is_weekend']
X = features[feature_columns].fillna(0)
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train isolation forest
iso_forest = IsolationForest(contamination=0.01, random_state=42)
anomalies = iso_forest.fit_predict(X_scaled)
# Add predictions
features['anomaly_score'] = iso_forest.score_samples(X_scaled)
features['is_anomaly'] = anomalies == -1
return features[features['is_anomaly']]
Network Analysis
Identifying money mule networks:
import networkx as nx
def analyse_transaction_network(transactions):
"""
Build and analyse transaction network
"""
# Create graph
G = nx.DiGraph()
# Add edges (transactions)
for _, row in transactions.iterrows():
G.add_edge(row['from_account'], row['to_account'],
weight=row['amount'], time=row['timestamp'])
# Find suspicious patterns
suspicious_accounts = []
# High in-degree (receiving from many)
in_degree = dict(G.in_degree())
high_in_degree = [acc for acc, deg in in_degree.items() if deg > 10]
# High out-degree (sending to many)
out_degree = dict(G.out_degree())
high_out_degree = [acc for acc, deg in out_degree.items() if deg > 10]
# Centrality measures
betweenness = nx.betweenness_centrality(G)
high_betweenness = [acc for acc, cent in betweenness.items()
if cent > 0.1]
suspicious_accounts = set(high_in_degree + high_out_degree +
list(high_betweenness.keys()))
return suspicious_accounts, G
Data Visualization for Fraud Insights
Effective visualization helps fraud analysts quickly identify patterns:
Transaction Flow Visualization
Using network graphs:
- Show connections between accounts
- Highlight suspicious clusters
- Visualise fund flows
- Identify central nodes (potential money mules)
Time-Series Analysis
Transaction patterns over time:
- Identify unusual spikes
- Detect seasonal patterns
- Compare to historical baselines
- Highlight anomalies
Geographic Analysis
Location-based patterns:
- Map transaction origins and destinations
- Identify impossible travel (same-day transactions in distant locations)
- Highlight geographic clusters
- Detect cross-border patterns
Dashboard Creation
Key metrics to track:
- Fraud detection rate
- False positive rate
- Average investigation time
- Top fraud patterns
- Risk score distribution
Real-World Application: Case Study
Scenario: Detecting Money Mule Network
Step 1: Data Collection
- Gather 30 days of transaction data
- Include account metadata
- Collect device and location data
Step 2: SQL Analysis
-- Find accounts receiving from many sources
SELECT
to_account,
COUNT(DISTINCT from_account) as unique_senders,
SUM(amount) as total_received,
AVG(amount) as avg_amount
FROM transactions
WHERE transaction_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
GROUP BY to_account
HAVING unique_senders > 5
ORDER BY unique_senders DESC;
Step 3: Python Analysis
# Load data
transactions = pd.read_sql(query, connection)
# Network analysis
suspicious_accounts, graph = analyse_transaction_network(transactions)
# Feature engineering
features = create_fraud_features(transactions)
# Anomaly detection
anomalies = detect_anomalies(features)
# Combine results
high_risk_accounts = set(suspicious_accounts) | set(anomalies['account_id'])
Step 4: Visualization
- Create network graph showing connections
- Generate time-series of transaction volumes
- Map geographic distribution
- Build risk score dashboard
Step 5: Action
- Flag high-risk accounts for investigation
- Generate suspicious activity reports
- Implement additional monitoring
- Update detection rules
Best Practices
Data Quality
- Ensure data completeness
- Handle missing values appropriately
- Validate data accuracy
- Maintain data lineage
Performance Optimization
- Index database columns used in queries
- Use appropriate data types
- Optimise query performance
- Cache frequently used results
Model Maintenance
- Regularly retrain models
- Monitor model performance
- Update features as patterns evolve
- A/B test new approaches
Documentation
- Document analysis methodologies
- Maintain query libraries
- Create reusable functions
- Share insights across team
Tools and Technologies
SQL Databases
- PostgreSQL
- MySQL
- SQL Server
- BigQuery
Python Libraries
- pandas: Data manipulation
- numpy: Numerical computing
- scikit-learn: Machine learning
- networkx: Network analysis
- matplotlib/seaborn: Visualization
Visualization Tools
- Splunk: Log analysis and visualization
- Wazuh: Security monitoring
- Tableau: Business intelligence
- Custom dashboards
Conclusion
Data-driven fraud detection requires combining SQL for efficient data querying, Python for advanced analysis, and visualization tools for insights. The key is to think systematically about fraud patterns and build detection capabilities that evolve with new threats.
This approach demonstrates the importance of thinking deeper about fraud patterns, getting things done through systematic analysis, and delivering effective detection that protects customers.