Mohd UwaishA Python package combining 14+ overlap-handling techniques with 40+ complexity measures—tackling the problem that's often more harmful than imbalance
Everyone talks about class imbalance. But there's a more insidious problem lurking in your data: class overlap.
Santos et al. argue that class overlap is a more significant impediment to classifier performance than imbalance alone. Yet most practitioners don't have tools to diagnose or address it.
During my research on overlap-handling techniques, I investigated how different methods affect global structural complexity. The findings led me to build FairSample—a package specifically designed for the class overlap problem.
Class overlap occurs when instances from different classes share similar feature values. Your classifier sees:
Instance A: [feature1=5.2, feature2=3.1, feature3=1.4] → Class 0
Instance B: [feature1=5.1, feature2=3.2, feature3=1.3] → Class 1
These look almost identical, but belong to different classes. This confuses classifiers and degrades performance—even when your classes are perfectly balanced.
Imbalance: You have 100 instances of Class A, 10 of Class B
Overlap: Classes share the same feature space
This is why overlap requires more sophisticated analysis.
Before fixing anything, measure the problem:
from fairsample.complexity import ComplexityMeasures
cm = ComplexityMeasures(X, y)
# Get comprehensive overlap analysis
all_measures = cm.get_all_complexity_measures(measures='all')
# Focus on instance overlap
instance_overlap = cm.get_all_complexity_measures(
measures=['N3', 'N4', 'kDN', 'CM']
)
# Structural complexity
structural = cm.get_all_complexity_measures(
measures=['T1', 'LSC', 'DBC']
)
Why this matters: Different overlap patterns require different solutions.
FairSample implements algorithms specifically designed for overlap:
All from peer-reviewed research (2014-2024).
Evaluate how techniques affect your overlap:
from fairsample.utils import compare_techniques
# Compare multiple overlap-handling techniques
results = compare_techniques(
X, y,
techniques=['RFCL', 'EHSO', 'NBUS'],
complexity_measures='basic'
)
# See impact on overlap metrics
print(results[['technique', 'N3', 'T1', 'training_time']])
Verify that overlap actually reduced:
from fairsample import EHSO
from fairsample.complexity import compare_pre_post_overlap
# Apply overlap-handling technique
sampler = EHSO(random_state=42)
X_resampled, y_resampled = sampler.fit_resample(X, y)
# Measure structural changes
comparison = compare_pre_post_overlap(X, y, X_resampled, y_resampled)
print("Overlap Reduction:")
print(comparison['improvements'])
My research investigated how overlap-handling techniques affect data structure. Key insights:
Insight 1: Reducing overlap doesn't always improve classification
Insight 2: Different techniques, different structural effects
Insight 3: Context matters
These insights shaped FairSample's design—diagnostic tools before treatment.
from fairsample import RFCL
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Dataset with overlapping symptoms
# Different diseases, similar presentations
# Step 1: Quantify overlap
from fairsample.complexity import ComplexityMeasures
cm = ComplexityMeasures(X_train, y_train)
print(f"Instance Overlap (N3): {cm.analyze_overlap()['N3']:.4f}")
# Step 2: Handle overlap
sampler = RFCL(random_state=42)
X_resampled, y_resampled = sampler.fit_resample(X_train, y_train)
# Step 3: Train classifier
clf = RandomForestClassifier()
clf.fit(X_resampled, y_resampled)
# Step 4: Evaluate
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
FairSample provides comprehensive overlap quantification:
Feature Overlap:
Instance Overlap:
Structural Complexity:
Multiresolution:
Each reveals different aspects of your overlap problem.
pip install fairsample
Complete workflow:
from fairsample import EHSO
from fairsample.complexity import ComplexityMeasures
import pandas as pd
# Load data with class overlap
df = pd.read_csv('overlapping_data.csv')
X = df.drop('target', axis=1)
y = df['target']
# Diagnose overlap
cm = ComplexityMeasures(X, y)
print("Overlap Analysis:")
print(f" Instance Overlap (N3): {cm.analyze_overlap()['N3']:.4f}")
# Handle overlap
sampler = EHSO(random_state=42)
X_resampled, y_resampled = sampler.fit_resample(X, y)
# Validate reduction
from fairsample.complexity import compare_pre_post_overlap
comparison = compare_pre_post_overlap(X, y, X_resampled, y_resampled)
print("\nOverlap Reduction:")
print(comparison['improvements'])
# Use in classification
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_resampled, y_resampled)
Class overlap is particularly problematic in:
| Domain | Why Overlap Occurs |
|---|---|
| Medical Diagnosis | Overlapping symptoms between diseases |
| Fraud Detection | Fraudsters mimic legitimate behavior |
| Software Defect Prediction | Similar code metrics for faulty/non-faulty modules |
| Network Intrusion | Attacks disguised as normal traffic |
| Image Classification | Visually similar objects in different categories |
In these domains, addressing overlap is crucial for performance.
# Scenario 1: Only Imbalance (No Overlap)
# Class 0: 1000 instances, features [0-5]
# Class 1: 100 instances, features [10-15]
# Solution: Simple resampling works well ✓
# Scenario 2: Only Overlap (Balanced)
# Class 0: 500 instances, features [0-10]
# Class 1: 500 instances, features [5-15]
# Solution: Need overlap-handling techniques ⚠️
# Scenario 3: Both Imbalance + Overlap
# Class 0: 1000 instances, features [0-10]
# Class 1: 100 instances, features [5-15]
# Solution: FairSample's specialized techniques ✓✓
FairSample implements techniques from:
Full citations: CITATIONS.md
Class overlap is often more harmful than class imbalance. Yet most tools focus solely on balancing class ratios.
FairSample provides:
All specifically designed for the overlap problem.
Have overlap problems in your data? Try FairSample and share your results!
What domains have you encountered severe class overlap? Drop a comment below 👇