
Sugnik MondalLate deliveries are not just an inconvenience. For a global logistics operator like APL Logistics...
Late deliveries are not just an inconvenience. For a global logistics operator like APL Logistics (KWE Group), a single delayed shipment can trigger SLA breaches, financial penalties, and long-term customer churn. Multiply that across hundreds of thousands of orders spanning five global markets, and the cost of reactive delay management becomes unsustainable.
The conventional approach has always been to handle delays after they happen — emergency rerouting, last-minute escalations, and reactive customer communication. This project takes a different approach entirely. Instead of reacting, it predicts.
This article walks through the end-to-end machine learning pipeline built to predict late delivery risk for APL Logistics — from raw data to a deployed Streamlit dashboard used by supply chain operations teams.
The project uses the DataCo Smart Supply Chain dataset — a comprehensive real-world transactional dataset from APL Logistics' global operations.
Late_delivery_risk (1 = Late, 0 = Not Late)Target distribution:
| Class | Count | Percentage |
|---|---|---|
| Not Late (0) | 98,976 | 54.83% |
| Late (1) | 81,541 | 45.17% |
The near-balanced target distribution was an important early finding. It meant SMOTE was not required. class_weight='balanced' in all models was sufficient.
The most critical cleaning decisions were around data leakage — columns that would not be available at the time of prediction (before dispatch) but that reveal the outcome after the fact.
Leakage columns dropped:
Delivery Status — Cramér's V of 1.00 with the target. Perfect correlation. This column contains values like "Late delivery" and "Shipping on time" — literally the answer. Using it would give 100% accuracy in training and zero accuracy in production.
Order Status — Values like COMPLETE, CLOSED, CANCELED are assigned after the order is fulfilled. At prediction time (before dispatch), this information does not exist.
The simple test for leakage: "At the moment the prediction is needed, would this information be available?" If not — drop it.
Other columns dropped:
Missing values: Only Customer Lname (8 rows) and Customer Zipcode (3 rows) had nulls — both dropped entirely as they were removal candidates anyway.
Duplicates: 2 duplicate rows removed.
Final cleaned dataset: 180,517 rows × 28 columns
Before building any model, the data was explored thoroughly. The most counterintuitive finding came from shipping mode analysis.
Late delivery rate by shipping mode:
| Shipping Mode | Late Delivery Rate |
|---|---|
| First Class | 95.3% |
| Second Class | 76.6% |
| Same Day | 45.7% |
| Standard Class | 38.1% |
First Class shipping — which customers expect to be faster and more reliable — has a 95.3% late delivery rate. This is not a rounding error. Nearly every First Class order in the dataset arrived late. This suggests that First Class commitments are systematically over-promised relative to operational capacity.
Late delivery rate by market:
All five global markets showed rates between 54.4% and 55.2% — an extremely narrow band. This finding is operationally significant: the delay problem is not geographically concentrated. It is systemic across all markets, meaning market-level interventions alone will not solve it.
Shipping delay gap:
The gap between actual shipping days and scheduled shipping days averaged +0.57 days across all orders. 103,399 orders (57.3%) shipped later than scheduled. Most delays were by exactly one day — suggesting a consistent operational mismatch between scheduling and execution.
Correlation heatmap revealed multicollinearity:
Benefit per order and Order Profit Per Order → 1.00 correlationOrder Item Product Price and Product Price → 1.00 correlationSales per customer, Order Item Total, Sales → 0.99 correlationThese redundant columns were dropped in preprocessing to prevent multicollinearity — particularly harmful for Logistic Regression.
Six new features were engineered from existing columns. These turned out to be some of the most important features in the final model.
1. Shipping Delay Gap
shipping_delay_gap = Days_for_shipping_real - Days_for_shipment_scheduled
Measures how many days actual shipping exceeded the scheduled commitment. This single feature ended up with an importance score of 0.7938 — accounting for 79% of the model's decision-making.
2. Shipping Pressure Index
shipping_pressure_index = Days_for_shipment_scheduled / (Order_Item_Quantity + 1)
Captures the relationship between delivery commitment and order complexity.
3. Is Express Flag
is_express = 1 if Shipping_Mode in ['First Class', 'Same Day'] else 0
Binary flag directly capturing the high-risk shipping modes identified in EDA.
4. High Discount Flag
high_discount_flag = 1 if Order_Item_Discount_Rate > 0.06 else 0
Flags orders with above-median discount rates.
5. Order Complexity Score
order_complexity_score = Order_Item_Quantity × Order_Item_Product_Price
Measures the financial complexity of the order.
6. Regional Congestion Score
region_congestion_score = average_late_delivery_rate_per_region
Encodes the historically observed delay rate per region as a continuous risk signal — ranging from 0.488 (Canada) to 0.580 (Central Africa).
Preventing data leakage was not just about dropping columns. The entire preprocessing pipeline was structured to ensure no information from the test set contaminated the training process.
Load cleaned_data.csv
→ Feature Engineering (pure arithmetic — no fitting required)
→ Separate X and y
→ Train/Test Split (80/20, stratified) ← split happens HERE
→ Fit StandardScaler on X_train only
→ Transform X_train and X_test separately
→ Fit LabelEncoders on X_train only
→ Transform X_train and X_test separately
→ Save scaler.pkl and encoders.pkl
→ Train models on X_train only
→ Evaluate on X_test only
Why this matters for production:
The scaler saved to scaler.pkl carries the exact mean and standard deviation computed on X_train. When the Streamlit app receives a new order, it applies this saved scaler — not a newly fitted one. This guarantees that scaling is identical between training and inference, preventing silent prediction errors.
Split results:
Three models were defined in a dictionary and trained in a loop — a clean, professional pattern that avoids repetitive code and makes comparison straightforward.
models = {
"Logistic Regression": LogisticRegression(
class_weight='balanced', max_iter=1000, random_state=42
),
"Random Forest": RandomForestClassifier(
class_weight='balanced', n_estimators=100, random_state=42, n_jobs=-1
),
"XGBoost": XGBClassifier(
scale_pos_weight=ratio, n_estimators=100,
random_state=42, eval_metric='logloss'
)
}
results = {}
for name, model in models.items():
model.fit(X_train, y_train)
cv_roc = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
cv_f1 = cross_val_score(model, X_train, y_train, cv=5, scoring='f1')
results[name] = {
'cv_roc_auc': cv_roc.mean(),
'cv_f1': cv_f1.mean()
}
5-fold cross validation results (on X_train only):
| Model | CV ROC-AUC | CV F1 |
|---|---|---|
| Logistic Regression | 0.9803 ± 0.0008 | 0.9749 ± 0.0019 |
| Random Forest | 0.9964 ± 0.0002 | 0.9786 ± 0.0004 |
| XGBoost | 0.9964 ± 0.0001 | 0.9792 ± 0.0005 |
Random Forest and XGBoost were essentially tied at baseline. XGBoost was selected for hyperparameter tuning due to its slightly lower variance and faster inference time.
GridSearchCV was ruled out immediately. With 180,000+ rows and a large parameter space, exhaustive search would have been computationally prohibitive. RandomizedSearchCV with 30 iterations and 5-fold CV was used instead — sampling the parameter space efficiently.
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [3, 4, 5, 6],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'subsample': [0.7, 0.8, 0.9, 1.0],
'colsample_bytree': [0.7, 0.8, 0.9, 1.0],
'min_child_weight': [1, 3, 5]
}
Best parameters found:
n_estimators: 200
max_depth: 6
learning_rate: 0.2
subsample: 0.9
colsample_bytree: 1.0
min_child_weight: 1
Best CV ROC-AUC: 0.9967
Final model comparison (on X_test):
| Model | Accuracy | Precision | Recall | F1 Score | ROC-AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.9740 | 0.9589 | 0.9952 | 0.9767 | 0.9806 |
| Random Forest | 0.9773 | 0.9604 | 0.9998 | 0.9797 | 0.9970 |
| XGBoost Baseline | 0.9781 | 0.9633 | 0.9980 | 0.9804 | 0.9969 |
| XGBoost Tuned | 0.9787 | 0.9644 | 0.9980 | 0.9809 | 0.9972 |
Why XGBoost Tuned was selected as the best model:
Confusion matrix analysis:
| Model | False Positives | False Negatives |
|---|---|---|
| Logistic Regression | 845 | 95 |
| Random Forest | 817 | 4 |
| XGBoost Baseline | 752 | 39 |
| XGBoost Tuned | 730 | 39 |
In an operations context, false positives (flagging an on-time order as high risk) waste intervention resources. False negatives (missing a truly late order) lead to unmitigated delays. XGBoost Tuned minimizes both.
Top 10 global risk drivers (XGBoost Tuned):
| Rank | Feature | Importance |
|---|---|---|
| 1 | Shipping Delay Gap | 0.7938 |
| 2 | Payment Type | 0.1671 |
| 3 | Scheduled Shipping Days | 0.0023 |
| 4 | Customer Country | 0.0020 |
| 5 | Order Country | 0.0019 |
| 6 | Market | 0.0019 |
| 7 | Regional Congestion Score | 0.0019 |
| 8 | Order State | 0.0019 |
| 9 | Customer City | 0.0019 |
| 10 | Order City | 0.0018 |
shipping_delay_gap accounts for 79.38% of the model's decision-making. This engineered feature — created from the difference between actual and scheduled shipping days — is overwhelmingly the primary driver of late delivery risk.
The second most important feature is Payment Type at 16.71%. This was unexpected. Transfer payments show notably lower late delivery rates (48.5%) compared to other payment types (56.6%–57.5%). The mechanism behind this relationship warrants further investigation.
All other features combined account for less than 4% of importance — confirming that the delay gap is the fundamental root cause.
Each order received a Late Delivery Probability Score (0–1) and a Risk Category:
Risk distribution across 36,104 test orders:
| Risk Category | Count | Percentage |
|---|---|---|
| High Risk | 19,977 | 55.33% |
| Medium Risk | 589 | 1.63% |
| Low Risk | 15,538 | 43.04% |
The bimodal probability distribution — with most orders near 0.0 or 1.0 — reflects the model's high confidence. The dominant shipping_delay_gap feature provides such strong signal that the model is rarely uncertain about an order's risk classification.
A four-module Streamlit dashboard was built for supply chain operations teams:
Home — Project overview, professional disclaimer, methodology summary, usage guide. The app explicitly states it is designed for supply chain managers, logistics analysts, and operations teams — not end consumers. The reason: the inputs required (scheduled shipping days, actual shipping days, profit ratios, financial metrics) are only available in internal order management systems.
Risk Predictor — Operations teams enter order details. The app automatically engineers all 6 derived features, applies the saved scaler and encoders, and outputs a probability score, risk category, top risk drivers, and recommended action.
Risk Dashboard — Portfolio-level view of risk distribution, probability histogram, and feature importance chart.
Operations Action Panel — Filterable table of high-risk orders with adjustable threshold slider and CSV export.
1. Leakage prevention is non-negotiable.
Delivery Status had a Cramér's V of 1.00 with the target. Including it would have given a perfect model on paper and a useless model in production. Always ask: would this feature exist at prediction time?
2. Feature engineering made the biggest difference.
shipping_delay_gap — a single engineered feature — accounts for 79% of the model's decisions. No raw feature came close. Time spent on thoughtful feature engineering consistently outperforms time spent on model tuning.
3. Class balance should be checked before reaching for SMOTE.
The target was 55/45 — nearly balanced. SMOTE was unnecessary. class_weight='balanced' was cleaner, faster, and equally effective.
4. RandomizedSearchCV over GridSearchCV at scale.
With 180,000+ rows, GridSearch would have been impractical. RandomizedSearch with 30 iterations delivered strong results efficiently.
5. The most counterintuitive finding was the most actionable.
First Class shipping having a 95.3% late delivery rate is not a modeling artifact — it is a real operational failure that APL Logistics can act on directly, independent of any ML system.
Python · Pandas · NumPy · Scikit-learn · XGBoost · Matplotlib · Seaborn · Plotly · Streamlit · Joblib · Jupyter Notebooks
This project was completed as part of the Data Science internship program at Unified Mentor Private Limited, in collaboration with APL Logistics (KWE Group).