
SIÁN AgencyTL;DR — requests plus BeautifulSoup is the right tool for tutorials, side projects, and one-off...
TL;DR — requests plus BeautifulSoup is the right tool for tutorials, side projects, and one-off audits. It is the wrong tool for any scraper that has to run unsupervised, longer than a quarter, against a site that has even basic bot defenses. I've watched a dozen teams discover this the expensive way. Here's the diagnosis and the replacement.
I'm not anti-requests. The library is fast, predictable, and elegant. For 30% of scraping tasks it's still what I reach for first. The problem is that the rest of the scraping pipeline — JavaScript-rendered content, fingerprinting checks, modern auth flows, lazy loading — silently breaks the assumptions requests is built on.
Most teams discover this in stages. Here's the timeline.
You write the first version. requests.get(url) returns 200, BeautifulSoup parses the response, you find your selectors, you ship. Tests pass against the small URL set you tested with. Lunch.
You notice maybe 5% of pages return rows where half the fields are None. You add a check, log the URL, retry. The retry sometimes works.
What's actually happening: those pages render their data in JavaScript after the initial response. requests got the HTML skeleton. The data was never in it. The retries that "work" are coincidence — sometimes the cached page has stale rendered data; sometimes a CDN ships a different variant.
The target site rolled out a fingerprinting check. requests sends a default User-Agent that screams python-requests/2.31.0. You add headers. It works for two days. They tightened the check — now they look at TLS fingerprint, not just User-Agent. requests uses the system OpenSSL TLS stack, which is different from any real browser's. The block returns.
Login flow now requires a CSRF token, which is rendered in JavaScript, which requests can't run. You spend two days reverse-engineering the login flow, find the API endpoint behind it, hit that directly. Works for six weeks. They rotate the auth scheme.
You finally migrate. Most of the team is annoyed because the rewrite took longer than they wanted. The team that does it later is annoyed for the same reason.
The fundamental issue: requests is an HTTP client. Modern websites are browser applications. The thing you're scraping is the output of running JavaScript, not a static document. You can fight that for a while — by reverse-engineering APIs, faking TLS fingerprints, hand-rolling JS interpreters — but you're paying interest on a debt you took on the day you reached for requests instead of a real browser.
Specific failure modes you're going to hit:
<div id="root"></div> and not much else.requests looks like Python; real browsers look like Chrome/Firefox. Block lists distinguish them easily.requests answers none of these. Playwright (or Puppeteer) answers all of them, because Playwright is a browser.
Skip the year of pain. Start with Playwright. Use requests only when you've measured that the data is in the static HTML and the site has no fingerprinting:
from playwright.async_api import async_playwright
async def scrape(url):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
ctx = await browser.new_context(
user_agent="Mozilla/5.0 (...)",
viewport={"width": 1920, "height": 1080},
)
page = await ctx.new_page()
# Block heavy resources for speed.
await page.route("**/*.{png,jpg,jpeg,gif,svg,woff,woff2}",
lambda r: r.abort())
await page.goto(url, wait_until="domcontentloaded")
# Wait for the *data* to appear, not just the document.
await page.wait_for_selector('[data-product-id]', timeout=15_000)
return await extract_fields(page)
Five things requests can't give you that Playwright does for free:
wait_for_selector — semantic waits instead of time.sleep.requests is still right
Static documentation sites. Open RSS/Atom feeds. JSON APIs that don't require login. PDFs and CSVs hosted on S3. Anything where you've actually fetched the URL, looked at the response body, and confirmed your data is in it.
That's a real category. Just don't assume the next site you scrape will fall into it.
Across our actor portfolio, the migration ratio settled around 80/20 — Playwright for 80% of jobs, requests for the 20% where the data is genuinely static. The 80% includes our entire Sephora catalog pipeline, which spent its first version as a requests + BeautifulSoup script and never made it past month 2. The Playwright rewrite has been running unsupervised for 14 months.
If your scraper is currently 100% requests, your sample size isn't "this works fine." Your sample size is "the sites I've scraped so far happen to have static HTML."
Which of the five failure modes have you shipped to production? Drop the symptom in the comments — I'll point at the fix.
Written by **Jonas Keller, Senior Automation Architect at SIÁN Agency. Find more from Jonas on dev.to. For custom scraping or automation work, hire SIÁN Agency.