Otto BrennanEvery time I start a new scraping project, I spend the first few hours doing the same...
Every time I start a new scraping project, I spend the first few hours doing the same things:
This isn't the scraping work. It's setup work. And I kept doing it, project after project, because my previous code was buried in some old repo in a slightly different form.
Last month I finally got fed up and spent a weekend extracting all of it into a proper reusable kit. Here's what I ended up with.
Scraping tutorials always show you the happy path:
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://example.com")
soup = BeautifulSoup(resp.text, "html.parser")
print(soup.find("h1").text)
That works great — until you hit the real world, where:
try/except)
A bare try/except that logs and continues is almost as bad as no error handling:
# Bad: silent failure
try:
resp = requests.get(url)
except:
pass
# Good: retry with backoff
@retry(max_tries=3, backoff=2.0)
def fetch(url):
resp = requests.get(url, timeout=30)
resp.raise_for_status()
return resp
The retry decorator I use handles exponential backoff automatically, with jitter to avoid thundering herd.
Rotating just the User-Agent is obvious and sites know to look for it. Real browsers send a consistent set of headers:
def get_headers(url):
return {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": random.choice(["en-US,en;q=0.9", "en-GB,en;q=0.8"]),
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Referer": random.choice(REFERERS),
# ... 6 more headers that real browsers send
}
time.sleep(1))
Fixed delays are easy to detect. Real human browsing has variance:
def polite_delay(min_s=1.5, max_s=4.0):
time.sleep(random.uniform(min_s, max_s))
The jitter also means you don't hit server load spikes at exactly the same interval.
Different consumers need different formats. I got tired of writing three exporters:
store = DataStore("output", formats=["csv", "json", "sqlite"])
for item in scrape_results:
store.save(item)
store.finalize() # writes all three formats at once
Buffers in memory and flushes all formats on finalize(). Handles schema inference automatically.
Some sites are just not scrapable with requests. Playwright solves this:
class JSScraper:
def get_html(self, url, wait_for=None, scroll=False):
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url)
if scroll:
self._scroll_to_bottom(page) # loads infinite scroll content
return page.content()
I packaged everything into 10 files: 6 scrapers/utilities and 4 helpers. Each file works standalone or as a module.
Scrapers:
scraper_basic.py — requests + BeautifulSoup with all the abovescraper_js.py — Playwright for JS-heavy sites, with scroll and click helpersscraper_api.py — REST API pagination (offset-based and cursor-based)scraper_sitemap.py — XML sitemap crawlerInfrastructure:
proxy_pool.py — fetches free proxies, validates them concurrently, rotatesstorage.py — CSV/JSON/SQLite outputstealth_headers.py — 50+ UA strings, full header setsutils.py — retry decorator, data cleaners, URL normalizationconfig.py — one central config fileexample_project/ — end-to-end example against a real siteIf you want it: Web Scraping Starter Kit on Gumroad — $24, instant download.
While I was at it, I also cleaned up two other kits:
Python Automation Scripts Pack ($19) — 10 standalone scripts: file organizer, bulk renamer, CSV processor, email sender, PDF merger, API fetcher, image resizer, and more. Same philosophy: copy the file, edit the config block at the top, run it.
Freelance CRM in Notion ($9) — 5-database Notion workspace for managing clients, projects, invoices, and time. If you freelance and you're still tracking this in a spreadsheet, this is worth 9 dollars.
The scraping kit took about a weekend to clean up and document. I'd been doing this work for years — the time investment was in organizing it, not building it.
If you have code you keep rewriting, it's worth extracting. At minimum you'll stop losing it. If it's useful to you, it's probably useful to others.
What would you add to a scraping kit? Drop a comment — I'm planning to add async support with httpx and a smarter sitemap priority scorer next.