FairSample: Because Class Overlap Is Harder Than Class Imbalance

# machinelearning# python# datascience# opensource
FairSample: Because Class Overlap Is Harder Than Class ImbalanceMohd Uwaish

A Python package combining 14+ overlap-handling techniques with 40+ complexity measures—tackling the problem that's often more harmful than imbalance

The Overlooked Problem in Classification

Everyone talks about class imbalance. But there's a more insidious problem lurking in your data: class overlap.

Santos et al. argue that class overlap is a more significant impediment to classifier performance than imbalance alone. Yet most practitioners don't have tools to diagnose or address it.

During my research on overlap-handling techniques, I investigated how different methods affect global structural complexity. The findings led me to build FairSample—a package specifically designed for the class overlap problem.

What Is Class Overlap?

Class overlap occurs when instances from different classes share similar feature values. Your classifier sees:

Instance A: [feature1=5.2, feature2=3.1, feature3=1.4] → Class 0
Instance B: [feature1=5.1, feature2=3.2, feature3=1.3] → Class 1
Enter fullscreen mode Exit fullscreen mode

These look almost identical, but belong to different classes. This confuses classifiers and degrades performance—even when your classes are perfectly balanced.

Why Overlap Is Harder Than Imbalance

Imbalance: You have 100 instances of Class A, 10 of Class B

  • Solution: Sample to balance the ratio
  • Outcome: Classifier sees both classes equally

Overlap: Classes share the same feature space

  • Solution: Not straightforward—you're changing data structure
  • Outcome: Depends on how you handle it

This is why overlap requires more sophisticated analysis.

FairSample's Approach

1. Quantify Overlap First

Before fixing anything, measure the problem:

from fairsample.complexity import ComplexityMeasures

cm = ComplexityMeasures(X, y)

# Get comprehensive overlap analysis
all_measures = cm.get_all_complexity_measures(measures='all')

# Focus on instance overlap
instance_overlap = cm.get_all_complexity_measures(
    measures=['N3', 'N4', 'kDN', 'CM']
)

# Structural complexity
structural = cm.get_all_complexity_measures(
    measures=['T1', 'LSC', 'DBC']
)
Enter fullscreen mode Exit fullscreen mode

Why this matters: Different overlap patterns require different solutions.

2. 14+ Overlap-Handling Techniques

FairSample implements algorithms specifically designed for overlap:

  • EHSO - Evolutionary Hybrid Sampling in Overlap
  • RFCL - Repetitive Forward Class Learning
  • NBUS - Neighbourhood-Based Undersampling
  • URNS - Undersampling by Removing Noisy Samples
  • SVDDWSMOTE - Support Vector Data Description-based oversampling
  • OSM - Overlap-based Sampling Method
  • And more...

All from peer-reviewed research (2014-2024).

3. Multi-Dimensional Evaluation

Evaluate how techniques affect your overlap:

from fairsample.utils import compare_techniques

# Compare multiple overlap-handling techniques
results = compare_techniques(
    X, y,
    techniques=['RFCL', 'EHSO', 'NBUS'],
    complexity_measures='basic'
)

# See impact on overlap metrics
print(results[['technique', 'N3', 'T1', 'training_time']])
Enter fullscreen mode Exit fullscreen mode

4. Before/After Validation

Verify that overlap actually reduced:

from fairsample import EHSO
from fairsample.complexity import compare_pre_post_overlap

# Apply overlap-handling technique
sampler = EHSO(random_state=42)
X_resampled, y_resampled = sampler.fit_resample(X, y)

# Measure structural changes
comparison = compare_pre_post_overlap(X, y, X_resampled, y_resampled)
print("Overlap Reduction:")
print(comparison['improvements'])
Enter fullscreen mode Exit fullscreen mode

Research Insights Applied

My research investigated how overlap-handling techniques affect data structure. Key insights:

Insight 1: Reducing overlap doesn't always improve classification

  • Some techniques reduce overlap but fragment class structure
  • Always validate with classification metrics

Insight 2: Different techniques, different structural effects

  • Some improve instance overlap but worsen structural complexity
  • Others balance both
  • Trade-offs vary by dataset

Insight 3: Context matters

  • No universal solution exists
  • Measure your specific overlap profile first

These insights shaped FairSample's design—diagnostic tools before treatment.

Practical Example: Medical Diagnosis

from fairsample import RFCL
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Dataset with overlapping symptoms
# Different diseases, similar presentations

# Step 1: Quantify overlap
from fairsample.complexity import ComplexityMeasures
cm = ComplexityMeasures(X_train, y_train)
print(f"Instance Overlap (N3): {cm.analyze_overlap()['N3']:.4f}")

# Step 2: Handle overlap
sampler = RFCL(random_state=42)
X_resampled, y_resampled = sampler.fit_resample(X_train, y_train)

# Step 3: Train classifier
clf = RandomForestClassifier()
clf.fit(X_resampled, y_resampled)

# Step 4: Evaluate
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
Enter fullscreen mode Exit fullscreen mode

40+ Complexity Measures

FairSample provides comprehensive overlap quantification:

Feature Overlap:

  • F1, F1v, F2, F3, F4 - How much features discriminate between classes

Instance Overlap:

  • N3, N4, kDN, CM, R-value - How much instances overlap in feature space

Structural Complexity:

  • T1, LSC, DBC - How complex the decision boundary needs to be

Multiresolution:

  • Purity, MRCA, C1, C2 - Multi-scale overlap analysis

Each reveals different aspects of your overlap problem.

Installation & Quick Start

pip install fairsample
Enter fullscreen mode Exit fullscreen mode

Complete workflow:

from fairsample import EHSO
from fairsample.complexity import ComplexityMeasures
import pandas as pd

# Load data with class overlap
df = pd.read_csv('overlapping_data.csv')
X = df.drop('target', axis=1)
y = df['target']

# Diagnose overlap
cm = ComplexityMeasures(X, y)
print("Overlap Analysis:")
print(f"  Instance Overlap (N3): {cm.analyze_overlap()['N3']:.4f}")

# Handle overlap
sampler = EHSO(random_state=42)
X_resampled, y_resampled = sampler.fit_resample(X, y)

# Validate reduction
from fairsample.complexity import compare_pre_post_overlap
comparison = compare_pre_post_overlap(X, y, X_resampled, y_resampled)
print("\nOverlap Reduction:")
print(comparison['improvements'])

# Use in classification
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_resampled, y_resampled)
Enter fullscreen mode Exit fullscreen mode

When Overlap Matters Most

Class overlap is particularly problematic in:

Domain Why Overlap Occurs
Medical Diagnosis Overlapping symptoms between diseases
Fraud Detection Fraudsters mimic legitimate behavior
Software Defect Prediction Similar code metrics for faulty/non-faulty modules
Network Intrusion Attacks disguised as normal traffic
Image Classification Visually similar objects in different categories

In these domains, addressing overlap is crucial for performance.

Overlap vs. Imbalance: A Comparison

# Scenario 1: Only Imbalance (No Overlap)
# Class 0: 1000 instances, features [0-5]
# Class 1: 100 instances, features [10-15]
# Solution: Simple resampling works well ✓

# Scenario 2: Only Overlap (Balanced)
# Class 0: 500 instances, features [0-10]
# Class 1: 500 instances, features [5-15]
# Solution: Need overlap-handling techniques ⚠️

# Scenario 3: Both Imbalance + Overlap
# Class 0: 1000 instances, features [0-10]
# Class 1: 100 instances, features [5-15]
# Solution: FairSample's specialized techniques ✓✓
Enter fullscreen mode Exit fullscreen mode

The Research Foundation

FairSample implements techniques from:

  • Vuttipittayamongkol & Elyan (2020) - EHSO, NBUS - Information Sciences
  • Das et al. (2014) - RFCL - IEEE TKDE
  • Santos et al. (2023) - Overlap analysis framework - Artificial Intelligence Review
  • Lorena et al. (2019) - Complexity measures - ACM Computing Surveys

Full citations: CITATIONS.md

Resources

The Bottom Line

Class overlap is often more harmful than class imbalance. Yet most tools focus solely on balancing class ratios.

FairSample provides:

  • Diagnostic tools - Quantify overlap with 40+ measures
  • Treatment options - 14+ research-backed techniques
  • Validation methods - Verify overlap reduction

All specifically designed for the overlap problem.


Have overlap problems in your data? Try FairSample and share your results!

What domains have you encountered severe class overlap? Drop a comment below 👇

python #machinelearning #datascience #opensource #classoverlap