I Wasted 40 Hours Rebuilding the Same Python Scraper. So I Stopped.

# python# webdev# beginners# programming

Otto Brennan

Every time I start a new scraping project, I spend the first few hours doing the same...

Every time I start a new scraping project, I spend the first few hours doing the same things:

Setting up rotating user agents
Adding retry logic with exponential backoff
Wiring up proxy rotation
Writing yet another CSV exporter
Dealing with JavaScript-rendered pages

This isn't the scraping work. It's setup work. And I kept doing it, project after project, because my previous code was buried in some old repo in a slightly different form.

Last month I finally got fed up and spent a weekend extracting all of it into a proper reusable kit. Here's what I ended up with.

The Core Problem With Most Scraping Code

Scraping tutorials always show you the happy path:

import requests
from bs4 import BeautifulSoup

resp = requests.get("https://example.com")
soup = BeautifulSoup(resp.text, "html.parser")
print(soup.find("h1").text)

That works great — until you hit the real world, where:

Sites block repeated requests from the same IP
Pages return 429 with no warning
The data you want requires JavaScript to render
Your script runs fine for 5 pages, then silently fails
You need the data in 3 formats for 3 different tools

What Production Scraping Actually Needs

1. Retries with backoff (not just `try/except`)

A bare try/except that logs and continues is almost as bad as no error handling:

# Bad: silent failure
try:
    resp = requests.get(url)
except:
    pass

# Good: retry with backoff
@retry(max_tries=3, backoff=2.0)
def fetch(url):
    resp = requests.get(url, timeout=30)
    resp.raise_for_status()
    return resp

The retry decorator I use handles exponential backoff automatically, with jitter to avoid thundering herd.

2. Rotating headers (not just user-agent)

Rotating just the User-Agent is obvious and sites know to look for it. Real browsers send a consistent set of headers:

def get_headers(url):
    return {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": random.choice(["en-US,en;q=0.9", "en-GB,en;q=0.8"]),
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1",
        "Referer": random.choice(REFERERS),
        # ... 6 more headers that real browsers send
    }

3. Polite delays (not just `time.sleep(1)`)

Fixed delays are easy to detect. Real human browsing has variance:

def polite_delay(min_s=1.5, max_s=4.0):
    time.sleep(random.uniform(min_s, max_s))

The jitter also means you don't hit server load spikes at exactly the same interval.

4. Unified storage

Different consumers need different formats. I got tired of writing three exporters:

store = DataStore("output", formats=["csv", "json", "sqlite"])

for item in scrape_results:
    store.save(item)

store.finalize()  # writes all three formats at once

Buffers in memory and flushes all formats on finalize(). Handles schema inference automatically.

5. JavaScript rendering when you need it

Some sites are just not scrapable with requests. Playwright solves this:

class JSScraper:
    def get_html(self, url, wait_for=None, scroll=False):
        with sync_playwright() as pw:
            browser = pw.chromium.launch(headless=True)
            page = browser.new_page()
            page.goto(url)

            if scroll:
                self._scroll_to_bottom(page)  # loads infinite scroll content

            return page.content()

The Kit

I packaged everything into 10 files: 6 scrapers/utilities and 4 helpers. Each file works standalone or as a module.

Scrapers:

scraper_basic.py — requests + BeautifulSoup with all the above
scraper_js.py — Playwright for JS-heavy sites, with scroll and click helpers
scraper_api.py — REST API pagination (offset-based and cursor-based)
scraper_sitemap.py — XML sitemap crawler

Infrastructure:

proxy_pool.py — fetches free proxies, validates them concurrently, rotates
storage.py — CSV/JSON/SQLite output
stealth_headers.py — 50+ UA strings, full header sets
utils.py — retry decorator, data cleaners, URL normalization
config.py — one central config file
example_project/ — end-to-end example against a real site

If you want it: Web Scraping Starter Kit on Gumroad — $24, instant download.

Two Other Things I Rebuilt

While I was at it, I also cleaned up two other kits:

Python Automation Scripts Pack ($19) — 10 standalone scripts: file organizer, bulk renamer, CSV processor, email sender, PDF merger, API fetcher, image resizer, and more. Same philosophy: copy the file, edit the config block at the top, run it.

Freelance CRM in Notion ($9) — 5-database Notion workspace for managing clients, projects, invoices, and time. If you freelance and you're still tracking this in a spreadsheet, this is worth 9 dollars.

What I Learned From This

The scraping kit took about a weekend to clean up and document. I'd been doing this work for years — the time investment was in organizing it, not building it.

If you have code you keep rewriting, it's worth extracting. At minimum you'll stop losing it. If it's useful to you, it's probably useful to others.

What would you add to a scraping kit? Drop a comment — I'm planning to add async support with httpx and a smarter sitemap priority scorer next.