I Wasted 40 Hours Rebuilding the Same Python Scraper. So I Stopped.

# python# webdev# beginners# programming
I Wasted 40 Hours Rebuilding the Same Python Scraper. So I Stopped.Otto Brennan

Every time I start a new scraping project, I spend the first few hours doing the same...

Every time I start a new scraping project, I spend the first few hours doing the same things:

  • Setting up rotating user agents
  • Adding retry logic with exponential backoff
  • Wiring up proxy rotation
  • Writing yet another CSV exporter
  • Dealing with JavaScript-rendered pages

This isn't the scraping work. It's setup work. And I kept doing it, project after project, because my previous code was buried in some old repo in a slightly different form.

Last month I finally got fed up and spent a weekend extracting all of it into a proper reusable kit. Here's what I ended up with.


The Core Problem With Most Scraping Code

Scraping tutorials always show you the happy path:

import requests
from bs4 import BeautifulSoup

resp = requests.get("https://example.com")
soup = BeautifulSoup(resp.text, "html.parser")
print(soup.find("h1").text)
Enter fullscreen mode Exit fullscreen mode

That works great — until you hit the real world, where:

  • Sites block repeated requests from the same IP
  • Pages return 429 with no warning
  • The data you want requires JavaScript to render
  • Your script runs fine for 5 pages, then silently fails
  • You need the data in 3 formats for 3 different tools

What Production Scraping Actually Needs

1. Retries with backoff (not just try/except)

A bare try/except that logs and continues is almost as bad as no error handling:

# Bad: silent failure
try:
    resp = requests.get(url)
except:
    pass

# Good: retry with backoff
@retry(max_tries=3, backoff=2.0)
def fetch(url):
    resp = requests.get(url, timeout=30)
    resp.raise_for_status()
    return resp
Enter fullscreen mode Exit fullscreen mode

The retry decorator I use handles exponential backoff automatically, with jitter to avoid thundering herd.

2. Rotating headers (not just user-agent)

Rotating just the User-Agent is obvious and sites know to look for it. Real browsers send a consistent set of headers:

def get_headers(url):
    return {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": random.choice(["en-US,en;q=0.9", "en-GB,en;q=0.8"]),
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1",
        "Referer": random.choice(REFERERS),
        # ... 6 more headers that real browsers send
    }
Enter fullscreen mode Exit fullscreen mode

3. Polite delays (not just time.sleep(1))

Fixed delays are easy to detect. Real human browsing has variance:

def polite_delay(min_s=1.5, max_s=4.0):
    time.sleep(random.uniform(min_s, max_s))
Enter fullscreen mode Exit fullscreen mode

The jitter also means you don't hit server load spikes at exactly the same interval.

4. Unified storage

Different consumers need different formats. I got tired of writing three exporters:

store = DataStore("output", formats=["csv", "json", "sqlite"])

for item in scrape_results:
    store.save(item)

store.finalize()  # writes all three formats at once
Enter fullscreen mode Exit fullscreen mode

Buffers in memory and flushes all formats on finalize(). Handles schema inference automatically.

5. JavaScript rendering when you need it

Some sites are just not scrapable with requests. Playwright solves this:

class JSScraper:
    def get_html(self, url, wait_for=None, scroll=False):
        with sync_playwright() as pw:
            browser = pw.chromium.launch(headless=True)
            page = browser.new_page()
            page.goto(url)

            if scroll:
                self._scroll_to_bottom(page)  # loads infinite scroll content

            return page.content()
Enter fullscreen mode Exit fullscreen mode

The Kit

I packaged everything into 10 files: 6 scrapers/utilities and 4 helpers. Each file works standalone or as a module.

Scrapers:

  • scraper_basic.py — requests + BeautifulSoup with all the above
  • scraper_js.py — Playwright for JS-heavy sites, with scroll and click helpers
  • scraper_api.py — REST API pagination (offset-based and cursor-based)
  • scraper_sitemap.py — XML sitemap crawler

Infrastructure:

  • proxy_pool.py — fetches free proxies, validates them concurrently, rotates
  • storage.py — CSV/JSON/SQLite output
  • stealth_headers.py — 50+ UA strings, full header sets
  • utils.py — retry decorator, data cleaners, URL normalization
  • config.py — one central config file
  • example_project/ — end-to-end example against a real site

If you want it: Web Scraping Starter Kit on Gumroad — $24, instant download.


Two Other Things I Rebuilt

While I was at it, I also cleaned up two other kits:

Python Automation Scripts Pack ($19) — 10 standalone scripts: file organizer, bulk renamer, CSV processor, email sender, PDF merger, API fetcher, image resizer, and more. Same philosophy: copy the file, edit the config block at the top, run it.

Freelance CRM in Notion ($9) — 5-database Notion workspace for managing clients, projects, invoices, and time. If you freelance and you're still tracking this in a spreadsheet, this is worth 9 dollars.


What I Learned From This

The scraping kit took about a weekend to clean up and document. I'd been doing this work for years — the time investment was in organizing it, not building it.

If you have code you keep rewriting, it's worth extracting. At minimum you'll stop losing it. If it's useful to you, it's probably useful to others.

What would you add to a scraping kit? Drop a comment — I'm planning to add async support with httpx and a smarter sitemap priority scorer next.