Atharva ShahHow I Built and Published My First Python Library as a Semester 4 Student Every data...
Every data project I start looks the same. Load the data, spend 30 minutes hunting for outliers, write the same NaN handling code I wrote last week, watch my notebook eat RAM. Then repeat it all for the next project.
I got tired of it. So I built a library.
This is the story of how I went from a frustrated CS student to publishing pandasclean on PyPI — and what I learned along the way.
It started simple. I just wanted a function that could detect outliers and let me choose what to do with them. But once I had that, I thought — why not add NaN handling? And memory reduction? And a single function that runs everything?
Three weeks later I had a published library.
pip install pandasclean
It has four core functions:
find_outliers() — IQR based outlier detection
from pandasclean import find_outliers
# Just show me the bounds
df, bounds = find_outliers(df, strategy='report')
# Drop outlier rows
df_clean, bounds = find_outliers(df, strategy='drop')
# Cap values instead of dropping (Winsorization)
df_clean, bounds = find_outliers(df, strategy='cap')
The IQR method computes bounds as:
lower = Q1 - (multiplier × IQR)upper = Q3 + (multiplier × IQR)Use multiplier=1.5 for mild outliers and multiplier=3.0 for extreme ones only.
handle_nan() — Missing value handling
from pandasclean import handle_nan
# Fill with mean, median, or custom values
df_clean, report = handle_nan(df, strategy='mean')
df_clean, report = handle_nan(df, strategy='custom', fill_value={'age': 0, 'name': 'unknown'})
# Or drop rows/columns entirely
df_clean, report = handle_nan(df, strategy='drop', axis='rows')
reduce_memory() — Dtype downcasting
This one surprised me the most when I saw the results.
from pandasclean import reduce_memory
before = df.memory_usage(deep=True).sum() / (1024*1024)
df_optimized, report = reduce_memory(df)
after = df_optimized.memory_usage(deep=True).sum() / (1024*1024)
print(f"Before: {before:.2f} MB")
print(f"After: {after:.2f} MB")
Before: 1527.07 MB
After: 371.93 MB
Reduction: 75.6%
On a 15 million row dataset. That number still makes me happy.
What's happening under the hood:
int64 → smallest safe integer type (int8, int16, or int32)float64 → float32
category dtypeauto_clean() — One function to rule them all
from pandasclean import auto_clean
df_clean, report = auto_clean(df)
Runs NaN handling, memory reduction and outlier detection in the correct order with sensible defaults.
Building this taught me things I never would have learned from a tutorial.
The IQR = 0 edge case — what happens when a column has constant values? Q1 == Q3, so IQR = 0, and the bounds collapse. I had to add a guard for this.
pandas StringDtype compatibility — newer versions of pandas use pd.StringDtype() instead of plain object for string columns. My is_object_dtype() check was returning False for string columns and silently skipping them. Fixed by also checking is_string_dtype().
Nullable integers — pandas has two integer systems. int64 (numpy) can't hold NaN. Int64 (pandas nullable, capital I) can. Trying to downcast Int64 with NaN to numpy int8 raises cannot convert NA to integer. The fix was to detect NaN integers and downcast to nullable Int8/Int16/Int32 instead.
This was simpler than I expected. You need three things:
1. pyproject.toml:
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"
[project]
name = "pandasclean"
version = "0.1.2"
description = "A lightweight library for cleaning and optimizing pandas DataFrames"
dependencies = ["pandas>=1.3.0", "numpy>=1.21.0"]
2. Build:
pip install build twine
python -m build
3. Upload:
twine upload dist/*
That's it. Your library is live.
More than I expected. Here's the short version:
If you try it and something breaks — please open an issue. Feedback from real users is worth more than anything else at this stage.