How I Built a Late Delivery Risk Predictor for APL Logistics: What a 95% Delay Rate in First Class Shipping Taught Me About Supply Chain ML

# machinelearning# datascience# python# supplychain

Sugnik Mondal

Late deliveries are not just an inconvenience. For a global logistics operator like APL Logistics...

Late deliveries are not just an inconvenience. For a global logistics operator like APL Logistics (KWE Group), a single delayed shipment can trigger SLA breaches, financial penalties, and long-term customer churn. Multiply that across hundreds of thousands of orders spanning five global markets, and the cost of reactive delay management becomes unsustainable.

The conventional approach has always been to handle delays after they happen — emergency rerouting, last-minute escalations, and reactive customer communication. This project takes a different approach entirely. Instead of reacting, it predicts.

This article walks through the end-to-end machine learning pipeline built to predict late delivery risk for APL Logistics — from raw data to a deployed Streamlit dashboard used by supply chain operations teams.

The Dataset

The project uses the DataCo Smart Supply Chain dataset — a comprehensive real-world transactional dataset from APL Logistics' global operations.

Raw dataset: 180,519 rows × 40 columns
After cleaning: 180,517 rows × 28 columns
Target variable: Late_delivery_risk (1 = Late, 0 = Not Late)

Target distribution:

Class	Count	Percentage
Not Late (0)	98,976	54.83%
Late (1)	81,541	45.17%

The near-balanced target distribution was an important early finding. It meant SMOTE was not required. class_weight='balanced' in all models was sufficient.

Data Cleaning — The Leakage Problem

The most critical cleaning decisions were around data leakage — columns that would not be available at the time of prediction (before dispatch) but that reveal the outcome after the fact.

Leakage columns dropped:

Delivery Status — Cramér's V of 1.00 with the target. Perfect correlation. This column contains values like "Late delivery" and "Shipping on time" — literally the answer. Using it would give 100% accuracy in training and zero accuracy in production.

Order Status — Values like COMPLETE, CLOSED, CANCELED are assigned after the order is fulfilled. At prediction time (before dispatch), this information does not exist.

The simple test for leakage: "At the moment the prediction is needed, would this information be available?" If not — drop it.

Other columns dropped:

PII columns: Customer Fname, Customer Lname, Customer Street, Customer Zipcode
ID columns: Category Id, Department Id, Customer Id, Order Customer Id
Redundant location: Latitude, Longitude (redundant with Market and Order Region)

Missing values: Only Customer Lname (8 rows) and Customer Zipcode (3 rows) had nulls — both dropped entirely as they were removal candidates anyway.

Duplicates: 2 duplicate rows removed.

Final cleaned dataset: 180,517 rows × 28 columns

Exploratory Data Analysis — The Most Surprising Finding

Before building any model, the data was explored thoroughly. The most counterintuitive finding came from shipping mode analysis.

Late delivery rate by shipping mode:

Shipping Mode	Late Delivery Rate
First Class	95.3%
Second Class	76.6%
Same Day	45.7%
Standard Class	38.1%

First Class shipping — which customers expect to be faster and more reliable — has a 95.3% late delivery rate. This is not a rounding error. Nearly every First Class order in the dataset arrived late. This suggests that First Class commitments are systematically over-promised relative to operational capacity.

Late delivery rate by market:

All five global markets showed rates between 54.4% and 55.2% — an extremely narrow band. This finding is operationally significant: the delay problem is not geographically concentrated. It is systemic across all markets, meaning market-level interventions alone will not solve it.

Shipping delay gap:

The gap between actual shipping days and scheduled shipping days averaged +0.57 days across all orders. 103,399 orders (57.3%) shipped later than scheduled. Most delays were by exactly one day — suggesting a consistent operational mismatch between scheduling and execution.

Correlation heatmap revealed multicollinearity:

Benefit per order and Order Profit Per Order → 1.00 correlation
Order Item Product Price and Product Price → 1.00 correlation
Sales per customer, Order Item Total, Sales → 0.99 correlation

These redundant columns were dropped in preprocessing to prevent multicollinearity — particularly harmful for Logistic Regression.

Feature Engineering — Where the Real Signal Was Created

Six new features were engineered from existing columns. These turned out to be some of the most important features in the final model.

1. Shipping Delay Gap

shipping_delay_gap = Days_for_shipping_real - Days_for_shipment_scheduled

Measures how many days actual shipping exceeded the scheduled commitment. This single feature ended up with an importance score of 0.7938 — accounting for 79% of the model's decision-making.

2. Shipping Pressure Index

shipping_pressure_index = Days_for_shipment_scheduled / (Order_Item_Quantity + 1)

Captures the relationship between delivery commitment and order complexity.

3. Is Express Flag

is_express = 1 if Shipping_Mode in ['First Class', 'Same Day'] else 0

Binary flag directly capturing the high-risk shipping modes identified in EDA.

4. High Discount Flag

high_discount_flag = 1 if Order_Item_Discount_Rate > 0.06 else 0

Flags orders with above-median discount rates.

5. Order Complexity Score

order_complexity_score = Order_Item_Quantity × Order_Item_Product_Price

Measures the financial complexity of the order.

6. Regional Congestion Score

region_congestion_score = average_late_delivery_rate_per_region

Encodes the historically observed delay rate per region as a continuous risk signal — ranging from 0.488 (Canada) to 0.580 (Central Africa).

The Anti-Leakage Preprocessing Pipeline

Preventing data leakage was not just about dropping columns. The entire preprocessing pipeline was structured to ensure no information from the test set contaminated the training process.

Load cleaned_data.csv
→ Feature Engineering (pure arithmetic — no fitting required)
→ Separate X and y
→ Train/Test Split (80/20, stratified) ← split happens HERE
→ Fit StandardScaler on X_train only
→ Transform X_train and X_test separately
→ Fit LabelEncoders on X_train only
→ Transform X_train and X_test separately
→ Save scaler.pkl and encoders.pkl
→ Train models on X_train only
→ Evaluate on X_test only

Why this matters for production:

The scaler saved to scaler.pkl carries the exact mean and standard deviation computed on X_train. When the Streamlit app receives a new order, it applies this saved scaler — not a newly fitted one. This guarantees that scaling is identical between training and inference, preventing silent prediction errors.

Split results:

Training set: 144,413 rows
Test set: 36,104 rows
Stratification maintained the 55/45 class ratio in both sets

Model Development — Dictionary Loop Approach

Three models were defined in a dictionary and trained in a loop — a clean, professional pattern that avoids repetitive code and makes comparison straightforward.

models = {
    "Logistic Regression": LogisticRegression(
        class_weight='balanced', max_iter=1000, random_state=42
    ),
    "Random Forest": RandomForestClassifier(
        class_weight='balanced', n_estimators=100, random_state=42, n_jobs=-1
    ),
    "XGBoost": XGBClassifier(
        scale_pos_weight=ratio, n_estimators=100,
        random_state=42, eval_metric='logloss'
    )
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    cv_roc = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    cv_f1  = cross_val_score(model, X_train, y_train, cv=5, scoring='f1')
    results[name] = {
        'cv_roc_auc': cv_roc.mean(),
        'cv_f1': cv_f1.mean()
    }

5-fold cross validation results (on X_train only):

Model	CV ROC-AUC	CV F1
Logistic Regression	0.9803 ± 0.0008	0.9749 ± 0.0019
Random Forest	0.9964 ± 0.0002	0.9786 ± 0.0004
XGBoost	0.9964 ± 0.0001	0.9792 ± 0.0005

Random Forest and XGBoost were essentially tied at baseline. XGBoost was selected for hyperparameter tuning due to its slightly lower variance and faster inference time.

Hyperparameter Tuning — RandomizedSearchCV

GridSearchCV was ruled out immediately. With 180,000+ rows and a large parameter space, exhaustive search would have been computationally prohibitive. RandomizedSearchCV with 30 iterations and 5-fold CV was used instead — sampling the parameter space efficiently.

param_grid = {
    'n_estimators':     [100, 200, 300],
    'max_depth':        [3, 4, 5, 6],
    'learning_rate':    [0.01, 0.05, 0.1, 0.2],
    'subsample':        [0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.7, 0.8, 0.9, 1.0],
    'min_child_weight': [1, 3, 5]
}

Best parameters found:

n_estimators: 200
max_depth: 6
learning_rate: 0.2
subsample: 0.9
colsample_bytree: 1.0
min_child_weight: 1

Best CV ROC-AUC: 0.9967

Model Evaluation — Final Results

Final model comparison (on X_test):

Model	Accuracy	Precision	Recall	F1 Score	ROC-AUC
Logistic Regression	0.9740	0.9589	0.9952	0.9767	0.9806
Random Forest	0.9773	0.9604	0.9998	0.9797	0.9970
XGBoost Baseline	0.9781	0.9633	0.9980	0.9804	0.9969
XGBoost Tuned	0.9787	0.9644	0.9980	0.9809	0.9972

Why XGBoost Tuned was selected as the best model:

Highest ROC-AUC: 0.9972
Highest Precision: 0.9644 — fewest false alarms
Lowest false positives: 730 (vs 845 for Logistic Regression)
Equal Recall to baseline XGBoost: 0.9980 — catches 99.8% of all true late deliveries
CV ROC-AUC (0.9967) and test ROC-AUC (0.9972) are consistent — no overfitting

Confusion matrix analysis:

Model	False Positives	False Negatives
Logistic Regression	845	95
Random Forest	817	4
XGBoost Baseline	752	39
XGBoost Tuned	730	39

In an operations context, false positives (flagging an on-time order as high risk) waste intervention resources. False negatives (missing a truly late order) lead to unmitigated delays. XGBoost Tuned minimizes both.

Feature Importance — What Actually Drives Late Deliveries

Top 10 global risk drivers (XGBoost Tuned):

Rank	Feature	Importance
1	Shipping Delay Gap	0.7938
2	Payment Type	0.1671
3	Scheduled Shipping Days	0.0023
4	Customer Country	0.0020
5	Order Country	0.0019
6	Market	0.0019
7	Regional Congestion Score	0.0019
8	Order State	0.0019
9	Customer City	0.0019
10	Order City	0.0018

shipping_delay_gap accounts for 79.38% of the model's decision-making. This engineered feature — created from the difference between actual and scheduled shipping days — is overwhelmingly the primary driver of late delivery risk.

The second most important feature is Payment Type at 16.71%. This was unexpected. Transfer payments show notably lower late delivery rates (48.5%) compared to other payment types (56.6%–57.5%). The mechanism behind this relationship warrants further investigation.

All other features combined account for less than 4% of importance — confirming that the delay gap is the fundamental root cause.

Risk Scoring

Each order received a Late Delivery Probability Score (0–1) and a Risk Category:

Low Risk: probability < 0.40
Medium Risk: 0.40 ≤ probability < 0.70
High Risk: probability ≥ 0.70

Risk distribution across 36,104 test orders:

Risk Category	Count	Percentage
High Risk	19,977	55.33%
Medium Risk	589	1.63%
Low Risk	15,538	43.04%

The bimodal probability distribution — with most orders near 0.0 or 1.0 — reflects the model's high confidence. The dominant shipping_delay_gap feature provides such strong signal that the model is rarely uncertain about an order's risk classification.

The Streamlit Application

A four-module Streamlit dashboard was built for supply chain operations teams:

Home — Project overview, professional disclaimer, methodology summary, usage guide. The app explicitly states it is designed for supply chain managers, logistics analysts, and operations teams — not end consumers. The reason: the inputs required (scheduled shipping days, actual shipping days, profit ratios, financial metrics) are only available in internal order management systems.

Risk Predictor — Operations teams enter order details. The app automatically engineers all 6 derived features, applies the saved scaler and encoders, and outputs a probability score, risk category, top risk drivers, and recommended action.

Risk Dashboard — Portfolio-level view of risk distribution, probability histogram, and feature importance chart.

Operations Action Panel — Filterable table of high-risk orders with adjustable threshold slider and CSV export.

Key Takeaways

1. Leakage prevention is non-negotiable.
Delivery Status had a Cramér's V of 1.00 with the target. Including it would have given a perfect model on paper and a useless model in production. Always ask: would this feature exist at prediction time?

2. Feature engineering made the biggest difference.
shipping_delay_gap — a single engineered feature — accounts for 79% of the model's decisions. No raw feature came close. Time spent on thoughtful feature engineering consistently outperforms time spent on model tuning.

3. Class balance should be checked before reaching for SMOTE.
The target was 55/45 — nearly balanced. SMOTE was unnecessary. class_weight='balanced' was cleaner, faster, and equally effective.

4. RandomizedSearchCV over GridSearchCV at scale.
With 180,000+ rows, GridSearch would have been impractical. RandomizedSearch with 30 iterations delivered strong results efficiently.

5. The most counterintuitive finding was the most actionable.
First Class shipping having a 95.3% late delivery rate is not a modeling artifact — it is a real operational failure that APL Logistics can act on directly, independent of any ML system.

Technical Stack

Python · Pandas · NumPy · Scikit-learn · XGBoost · Matplotlib · Seaborn · Plotly · Streamlit · Joblib · Jupyter Notebooks

This project was completed as part of the Data Science internship program at Unified Mentor Private Limited, in collaboration with APL Logistics (KWE Group).