Building pandasclean — a pandas data cleaning library from scratch to PyPI

# productivity# opensource# python# datascience

Atharva Shah

How I Built and Published My First Python Library as a Semester 4 Student Every data...

How I Built and Published My First Python Library as a Semester 4 Student

Every data project I start looks the same. Load the data, spend 30 minutes hunting for outliers, write the same NaN handling code I wrote last week, watch my notebook eat RAM. Then repeat it all for the next project.

I got tired of it. So I built a library.

This is the story of how I went from a frustrated CS student to publishing pandasclean on PyPI — and what I learned along the way.

The Idea

It started simple. I just wanted a function that could detect outliers and let me choose what to do with them. But once I had that, I thought — why not add NaN handling? And memory reduction? And a single function that runs everything?

Three weeks later I had a published library.

What pandasclean Does

pip install pandasclean

It has four core functions:

1. `find_outliers()` — IQR based outlier detection

from pandasclean import find_outliers

# Just show me the bounds
df, bounds = find_outliers(df, strategy='report')

# Drop outlier rows
df_clean, bounds = find_outliers(df, strategy='drop')

# Cap values instead of dropping (Winsorization)
df_clean, bounds = find_outliers(df, strategy='cap')

The IQR method computes bounds as:

lower = Q1 - (multiplier × IQR)
upper = Q3 + (multiplier × IQR)

Use multiplier=1.5 for mild outliers and multiplier=3.0 for extreme ones only.

2. `handle_nan()` — Missing value handling

from pandasclean import handle_nan

# Fill with mean, median, or custom values
df_clean, report = handle_nan(df, strategy='mean')
df_clean, report = handle_nan(df, strategy='custom', fill_value={'age': 0, 'name': 'unknown'})

# Or drop rows/columns entirely
df_clean, report = handle_nan(df, strategy='drop', axis='rows')

3. `reduce_memory()` — Dtype downcasting

This one surprised me the most when I saw the results.

from pandasclean import reduce_memory

before = df.memory_usage(deep=True).sum() / (1024*1024)
df_optimized, report = reduce_memory(df)
after = df_optimized.memory_usage(deep=True).sum() / (1024*1024)

print(f"Before: {before:.2f} MB")
print(f"After:  {after:.2f} MB")

Before: 1527.07 MB
After:  371.93 MB
Reduction: 75.6%

On a 15 million row dataset. That number still makes me happy.

What's happening under the hood:

int64 → smallest safe integer type (int8, int16, or int32)
float64 → float32
Low cardinality string columns → category dtype

4. `auto_clean()` — One function to rule them all

from pandasclean import auto_clean

df_clean, report = auto_clean(df)

Runs NaN handling, memory reduction and outlier detection in the correct order with sensible defaults.

The Interesting Technical Bits

Building this taught me things I never would have learned from a tutorial.

The IQR = 0 edge case — what happens when a column has constant values? Q1 == Q3, so IQR = 0, and the bounds collapse. I had to add a guard for this.

pandas StringDtype compatibility — newer versions of pandas use pd.StringDtype() instead of plain object for string columns. My is_object_dtype() check was returning False for string columns and silently skipping them. Fixed by also checking is_string_dtype().

Nullable integers — pandas has two integer systems. int64 (numpy) can't hold NaN. Int64 (pandas nullable, capital I) can. Trying to downcast Int64 with NaN to numpy int8 raises cannot convert NA to integer. The fix was to detect NaN integers and downcast to nullable Int8/Int16/Int32 instead.

Publishing to PyPI

This was simpler than I expected. You need three things:

1. pyproject.toml:

[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"

[project]
name = "pandasclean"
version = "0.1.2"
description = "A lightweight library for cleaning and optimizing pandas DataFrames"
dependencies = ["pandas>=1.3.0", "numpy>=1.21.0"]

2. Build:

pip install build twine
python -m build

3. Upload:

twine upload dist/*

That's it. Your library is live.

Results

96 downloads on day one
75.6% memory reduction on a 15M row dataset
Works across Python 3.8 to 3.12

What I Learned

More than I expected. Here's the short version:

Write tests before you think you need them — they saved me multiple times
Edge cases hide everywhere — IQR = 0, nullable integers, pandas version differences
Shipping something imperfect is better than perfecting something unshipped
A published library on your resume hits differently than a GitHub repo

What's Next

Z-score outlier detection (v0.2.0)
Column name standardisation
Duplicate detection
sklearn pipeline integration