Building pandasclean — a pandas data cleaning library from scratch to PyPI

# productivity# opensource# python# datascience
Building pandasclean — a pandas data cleaning library from scratch to PyPIAtharva Shah

How I Built and Published My First Python Library as a Semester 4 Student Every data...

How I Built and Published My First Python Library as a Semester 4 Student

Every data project I start looks the same. Load the data, spend 30 minutes hunting for outliers, write the same NaN handling code I wrote last week, watch my notebook eat RAM. Then repeat it all for the next project.

I got tired of it. So I built a library.

This is the story of how I went from a frustrated CS student to publishing pandasclean on PyPI — and what I learned along the way.


The Idea

It started simple. I just wanted a function that could detect outliers and let me choose what to do with them. But once I had that, I thought — why not add NaN handling? And memory reduction? And a single function that runs everything?

Three weeks later I had a published library.


What pandasclean Does

pip install pandasclean
Enter fullscreen mode Exit fullscreen mode

It has four core functions:

1. find_outliers() — IQR based outlier detection

from pandasclean import find_outliers

# Just show me the bounds
df, bounds = find_outliers(df, strategy='report')

# Drop outlier rows
df_clean, bounds = find_outliers(df, strategy='drop')

# Cap values instead of dropping (Winsorization)
df_clean, bounds = find_outliers(df, strategy='cap')
Enter fullscreen mode Exit fullscreen mode

The IQR method computes bounds as:

  • lower = Q1 - (multiplier × IQR)
  • upper = Q3 + (multiplier × IQR)

Use multiplier=1.5 for mild outliers and multiplier=3.0 for extreme ones only.

2. handle_nan() — Missing value handling

from pandasclean import handle_nan

# Fill with mean, median, or custom values
df_clean, report = handle_nan(df, strategy='mean')
df_clean, report = handle_nan(df, strategy='custom', fill_value={'age': 0, 'name': 'unknown'})

# Or drop rows/columns entirely
df_clean, report = handle_nan(df, strategy='drop', axis='rows')
Enter fullscreen mode Exit fullscreen mode

3. reduce_memory() — Dtype downcasting

This one surprised me the most when I saw the results.

from pandasclean import reduce_memory

before = df.memory_usage(deep=True).sum() / (1024*1024)
df_optimized, report = reduce_memory(df)
after = df_optimized.memory_usage(deep=True).sum() / (1024*1024)

print(f"Before: {before:.2f} MB")
print(f"After:  {after:.2f} MB")
Enter fullscreen mode Exit fullscreen mode
Before: 1527.07 MB
After:  371.93 MB
Reduction: 75.6%
Enter fullscreen mode Exit fullscreen mode

On a 15 million row dataset. That number still makes me happy.

What's happening under the hood:

  • int64 → smallest safe integer type (int8, int16, or int32)
  • float64float32
  • Low cardinality string columns → category dtype

4. auto_clean() — One function to rule them all

from pandasclean import auto_clean

df_clean, report = auto_clean(df)
Enter fullscreen mode Exit fullscreen mode

Runs NaN handling, memory reduction and outlier detection in the correct order with sensible defaults.


The Interesting Technical Bits

Building this taught me things I never would have learned from a tutorial.

The IQR = 0 edge case — what happens when a column has constant values? Q1 == Q3, so IQR = 0, and the bounds collapse. I had to add a guard for this.

pandas StringDtype compatibility — newer versions of pandas use pd.StringDtype() instead of plain object for string columns. My is_object_dtype() check was returning False for string columns and silently skipping them. Fixed by also checking is_string_dtype().

Nullable integers — pandas has two integer systems. int64 (numpy) can't hold NaN. Int64 (pandas nullable, capital I) can. Trying to downcast Int64 with NaN to numpy int8 raises cannot convert NA to integer. The fix was to detect NaN integers and downcast to nullable Int8/Int16/Int32 instead.


Publishing to PyPI

This was simpler than I expected. You need three things:

1. pyproject.toml:

[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"

[project]
name = "pandasclean"
version = "0.1.2"
description = "A lightweight library for cleaning and optimizing pandas DataFrames"
dependencies = ["pandas>=1.3.0", "numpy>=1.21.0"]
Enter fullscreen mode Exit fullscreen mode

2. Build:

pip install build twine
python -m build
Enter fullscreen mode Exit fullscreen mode

3. Upload:

twine upload dist/*
Enter fullscreen mode Exit fullscreen mode

That's it. Your library is live.


Results

  • 96 downloads on day one
  • 75.6% memory reduction on a 15M row dataset
  • Works across Python 3.8 to 3.12

What I Learned

More than I expected. Here's the short version:

  • Write tests before you think you need them — they saved me multiple times
  • Edge cases hide everywhere — IQR = 0, nullable integers, pandas version differences
  • Shipping something imperfect is better than perfecting something unshipped
  • A published library on your resume hits differently than a GitHub repo

What's Next

  • Z-score outlier detection (v0.2.0)
  • Column name standardisation
  • Duplicate detection
  • sklearn pipeline integration

Links

If you try it and something breaks — please open an issue. Feedback from real users is worth more than anything else at this stage.