GirmaYou open the Playground Series S6E3 competition, see 250k+ rows of customer data, and think: “Where...
You open the Playground Series S6E3 competition, see 250k+ rows of customer data, and think: “Where do I even start?”
I’ve been there. This post is exactly the first notebook I wish I had when I jumped in a dead-simple, copy-paste-ready pipeline that takes you from raw CSV to a solid submission. No theory overload, just the steps that actually work (and why they matter). Let’s go!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings("ignore")
These are my go-to imports for every tabular comp. LightGBM will be your hero later.
df = pd.read_csv("/kaggle/input/competitions/playground-series-s6e3/train.csv")
X = df.drop(columns=["Churn", "id"])
y = df["Churn"]
Run df.shape, df.head(), df.info(). Clean data, zero missing values — we’re lucky today!
X["TotalCharges"] = pd.to_numeric(X["TotalCharges"], errors="coerce")
Always make sure numbers are actually numbers.
Models only understand numbers, so categories need love.
This one trick makes everything faster and cleaner:
X['StreamingAny'] = ((X['StreamingTV'] == 'Yes') | (X['StreamingMovies'] == 'Yes')).astype(int)
X = X.drop(columns=['StreamingTV', 'StreamingMovies'])
Why I do this every time:
Feels like decluttering your code suddenly everything runs smoother.
Easy Yes/No first:
binary_cols = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling']
for col in binary_cols:
X[col] = X[col].map({'Yes': 1, 'No': 0})
Then the rest:
X = pd.get_dummies(X, drop_first=True)
All numeric now. Boom.
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
Stratify keeps the churn ratio the same critical for this competition.
Baseline (Random Forest):
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
print("RF ROC-AUC:", roc_auc_score(y_val, rf.predict_proba(X_val)[:, 1]))
The one that actually scores well (LightGBM):
lgb = LGBMClassifier(random_state=42)
lgb.fit(X_train, y_train)
print("LGB ROC-AUC:", roc_auc_score(y_val, lgb.predict_proba(X_val)[:, 1]))
LightGBM usually jumps ahead — this is your starting leaderboard model.
test = pd.read_csv("/kaggle/input/competitions/playground-series-s6e3/test.csv")
test_X = test.drop(columns=['id'])
# Same merge
test_X['StreamingAny'] = ((test_X['StreamingTV'] == 'Yes') | (test_X['StreamingMovies'] == 'Yes')).astype(int)
test_X = test_X.drop(columns=['StreamingTV', 'StreamingMovies'])
# Same encoding
test_X = pd.get_dummies(test_X, drop_first=True)
test_X = test_X.reindex(columns=X.columns, fill_value=0)
preds = lgb.predict_proba(test_X)[:, 1]
submission = pd.DataFrame({"id": test["id"], "Churn": preds})
submission.to_csv("submission.csv", index=False)
Start with clean loading → merge redundant columns → encode → split → train LGB → apply exact same steps to test → submit.
That’s the real starting point every Kaggler needs.
Copy this notebook, run it, and you’re already ahead.
Got a score? Hit a bug? Drop it in the comments or tag me I reply to every one.
Happy starting !
Girma Wakeyo
Kaggle → https://www.kaggle.com/girmawakeyo
GitHub → https://github.com/Girma35
X → https://x.com/Girma880731631
Follow for more quick-start notebooks and competition tips. Let’s climb those leaderboards together!