Starting Point for Kagglers: Customer Churn Prediction Competition

# kaggle# machinelearning# aimodeling# ai
Starting Point for Kagglers: Customer Churn Prediction CompetitionGirma

You open the Playground Series S6E3 competition, see 250k+ rows of customer data, and think: “Where...

You open the Playground Series S6E3 competition, see 250k+ rows of customer data, and think: “Where do I even start?”

I’ve been there. This post is exactly the first notebook I wish I had when I jumped in a dead-simple, copy-paste-ready pipeline that takes you from raw CSV to a solid submission. No theory overload, just the steps that actually work (and why they matter). Let’s go!

1. Grab the Tools

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings("ignore")
Enter fullscreen mode Exit fullscreen mode

These are my go-to imports for every tabular comp. LightGBM will be your hero later.

2. Load & Quick Look

df = pd.read_csv("/kaggle/input/competitions/playground-series-s6e3/train.csv")
X = df.drop(columns=["Churn", "id"])
y = df["Churn"]
Enter fullscreen mode Exit fullscreen mode

Run df.shape, df.head(), df.info(). Clean data, zero missing values — we’re lucky today!

3. Tiny Cleanup (Just in Case)

X["TotalCharges"] = pd.to_numeric(X["TotalCharges"], errors="coerce")
Enter fullscreen mode Exit fullscreen mode

Always make sure numbers are actually numbers.

4. Know Your Columns

  • Numbers: tenure, MonthlyCharges, TotalCharges, SeniorCitizen
  • Categories: gender, Contract, PaymentMethod, streaming stuff, etc.

Models only understand numbers, so categories need love.

5. My Secret Weapon: Merge Columns

This one trick makes everything faster and cleaner:

X['StreamingAny'] = ((X['StreamingTV'] == 'Yes') | (X['StreamingMovies'] == 'Yes')).astype(int)
X = X.drop(columns=['StreamingTV', 'StreamingMovies'])
Enter fullscreen mode Exit fullscreen mode

Why I do this every time:

  • Cuts 4–5 columns → 20–40% faster training
  • Saves RAM (huge on big datasets)
  • Removes confusing duplicate signals
  • Model learns real customer habits instead of memorizing noise

Feels like decluttering your code suddenly everything runs smoother.

6. Turn Words into Numbers

Easy Yes/No first:

binary_cols = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling']
for col in binary_cols:
    X[col] = X[col].map({'Yes': 1, 'No': 0})
Enter fullscreen mode Exit fullscreen mode

Then the rest:

X = pd.get_dummies(X, drop_first=True)
Enter fullscreen mode Exit fullscreen mode

All numeric now. Boom.

7. Split Smart

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
Enter fullscreen mode Exit fullscreen mode

Stratify keeps the churn ratio the same critical for this competition.

8. Train Two Models (Quick Check + Real Deal)

Baseline (Random Forest):

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
print("RF ROC-AUC:", roc_auc_score(y_val, rf.predict_proba(X_val)[:, 1]))
Enter fullscreen mode Exit fullscreen mode

The one that actually scores well (LightGBM):

lgb = LGBMClassifier(random_state=42)
lgb.fit(X_train, y_train)
print("LGB ROC-AUC:", roc_auc_score(y_val, lgb.predict_proba(X_val)[:, 1]))
Enter fullscreen mode Exit fullscreen mode

LightGBM usually jumps ahead — this is your starting leaderboard model.

9. Test Set (Same Steps, No Leaks!)

test = pd.read_csv("/kaggle/input/competitions/playground-series-s6e3/test.csv")
test_X = test.drop(columns=['id'])

# Same merge
test_X['StreamingAny'] = ((test_X['StreamingTV'] == 'Yes') | (test_X['StreamingMovies'] == 'Yes')).astype(int)
test_X = test_X.drop(columns=['StreamingTV', 'StreamingMovies'])

# Same encoding
test_X = pd.get_dummies(test_X, drop_first=True)
test_X = test_X.reindex(columns=X.columns, fill_value=0)

preds = lgb.predict_proba(test_X)[:, 1]

submission = pd.DataFrame({"id": test["id"], "Churn": preds})
submission.to_csv("submission.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

Want to Level Up Later?

  • Add cross-validation
  • Merge more groups (add-ons, contract type)
  • Tune LightGBM with Optuna
  • Try CatBoost (zero encoding needed)

One-Sentence Recap

Start with clean loading → merge redundant columns → encode → split → train LGB → apply exact same steps to test → submit.

That’s the real starting point every Kaggler needs.

Copy this notebook, run it, and you’re already ahead.

Got a score? Hit a bug? Drop it in the comments or tag me I reply to every one.

Happy starting !

Girma Wakeyo

Kaggle → https://www.kaggle.com/girmawakeyo

GitHub → https://github.com/Girma35

X → https://x.com/Girma880731631

Follow for more quick-start notebooks and competition tips. Let’s climb those leaderboards together!