许映洲3:17 AM. My phone buzzed. It was an alert from the monitoring system — the scheduled web scraper had...
3:17 AM. My phone buzzed.
It was an alert from the monitoring system — the scheduled web scraper had crashed.
I grabbed my laptop, VPN'd into the server, and pulled the logs:
Error: ElementClickInterceptedException: element click intercepted
by iframe element: <iframe src="https://www.google.com/recaptcha/...">
The page was stuck on a reCAPTCHA challenge.
Honestly, this wasn't the first time. Last month it was hCaptcha. The month before that, Cloudflare Turnstile. Same script every time: the target site upgrades their bot detection, my scraper gets caught off guard, data collection stops, and the downstream pipeline breaks.
I stared at that CAPTCHA widget on the screen and thought: how am I supposed to deal with this?
The first thing that comes to mind: "don't let them know you're a bot."
The Puppeteer community has a popular solution: puppeteer-extra-plugin-stealth. It patches various fingerprints that Headless Chrome exposes — navigator.webdriver, Chrome DevTools Protocol artifacts, missing plugin lists, and so on.
Here's what the code looks like:
import puppeteer from 'puppeteer';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://example.com/protected-page');
// Pretend to be human
await page.evaluate(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => false });
});
Sounds great, right? The problem is — it doesn't guarantee 100% effectiveness.
Cloudflare's bot detection team is continuously upgrading their systems. The maintainers of stealth plugins are essentially playing a cat-and-mouse game. You patch something today, they come up with a new detection method tomorrow.
More critically, stealth only reduces the probability of triggering a CAPTCHA — it doesn't eliminate it. For a scheduled task that needs to run 24/7, "it probably won't trigger" isn't good enough.
Verdict: Useful, but not a silver bullet. Good for reducing trigger frequency, but not suitable as your only line of defense.
If machines can't solve it, get a "smarter machine" to do it.
Services like 2Captcha, Anti-Captcha, and CapSolver work on a simple principle: you send them a screenshot of the CAPTCHA, they either use AI to solve it or dispatch it to a real human to click through, and they send you back the result.
import { solve } from '2captcha-ts';
async function bypassCaptcha(page, siteKey) {
const result = await solve({
sitekey: siteKey,
pageurl: page.url(),
method: 'recaptcha'
});
await page.evaluate((token) => {
document.querySelector('#g-recaptcha-response').value = token;
}, result.data);
await page.click('#submit-button');
}
Each solve costs about $0.001 to $0.003. Doesn't sound like much. But let's do the math:
A few dozen dollars a month is still within budget. But the real problems aren't about money:
Verdict: Functional, but cost, stability, and compliance are all questionable. Suitable for small-scale, non-critical use cases.
Another approach: since too many requests from one IP triggers CAPTCHAs, just keep switching IPs.
import puppeteer from 'puppeteer';
const proxyList = [
'http://user:pass@proxy1:8080',
'http://user:pass@proxy2:8080',
'http://user:pass@proxy3:8080',
];
async function crawlWithProxy(url) {
const proxy = proxyList[Math.floor(Math.random() * proxyList.length)];
const browser = await puppeteer.launch({
args: [`--proxy-server=${proxy}`]
});
const page = await browser.newPage();
await page.goto(url);
// ... scraping logic
}
Anyone who's used proxy pools knows the drill:
A friend of mine does e-commerce data collection. He spends over $2,000/month on proxies alone. He once told me, with complete sincerity: "My proxy bill is ten times my server bill."
Verdict: Can reduce trigger frequency, but expensive, and effectiveness depends entirely on proxy quality. Suitable for teams with budgets.
The three approaches above share a common flaw: they all try to "beat" the CAPTCHA.
But think about it differently — CAPTCHAs exist for a reason. They're designed to tell humans and machines apart. So why not let a human handle it?
This is the core idea behind "human-in-the-loop" automation:
The first step is knowing when a CAPTCHA appears on the page. The detection logic is actually straightforward:
const CAPTCHA_SELECTORS = [
'iframe[src*="recaptcha"]',
'iframe[src*="hcaptcha"]',
'iframe[src*="challenges.cloudflare.com"]',
'iframe[src*="turnstile"]',
'.g-recaptcha',
'.h-captcha',
'#captcha',
'div[data-sitekey]',
];
async function detectCaptcha(page) {
for (const selector of CAPTCHA_SELECTORS) {
try {
const element = await page.$(selector);
if (element) {
const box = await element.boundingBox();
return {
detected: true,
type: guessCaptchaType(selector),
selector,
position: box,
};
}
} catch {
// selector didn't match, skip
}
}
return { detected: false };
}
function guessCaptchaType(selector) {
if (selector.includes('recaptcha')) return 'reCAPTCHA';
if (selector.includes('hcaptcha')) return 'hCaptcha';
if (selector.includes('cloudflare') || selector.includes('turnstile')) return 'Cloudflare Turnstile';
return 'unknown';
}
This covers the major CAPTCHA providers: Google reCAPTCHA, hCaptcha, and Cloudflare Turnstile. It uses iframe src attributes and class names for feature matching — simple but effective.
Once a CAPTCHA is detected, the key is to let a real person see the current page and manually interact with it:
async function handleCaptcha(page) {
const captcha = await detectCaptcha(page);
if (!captcha.detected) return false;
console.log(`[CAPTCHA] ${captcha.type} detected. Pausing for human intervention...`);
// Generate a live preview URL
const previewUrl = `http://localhost:9222/devtools/inspector.html?ws=localhost:9222`;
// Send notification (Slack, Telegram, Discord, etc.)
await sendNotification({
title: 'CAPTCHA detected — manual intervention needed',
message: `Type: ${captcha.type}\nPreview: ${previewUrl}`,
urgent: true,
});
// Poll until CAPTCHA disappears (human has solved it)
while (true) {
await sleep(2000);
const check = await detectCaptcha(page);
if (!check.detected) {
console.log('[CAPTCHA] CAPTCHA resolved. Resuming...');
break;
}
}
return true;
}
Embed the detection logic into your normal scraping flow:
async function smartCrawl(urls) {
const browser = await puppeteer.launch({ headless: false }); // Note: non-headless mode
const page = await browser.newPage();
for (const url of urls) {
await page.goto(url, { waitUntil: 'networkidle2' });
// Check for CAPTCHA after every page load
const blocked = await handleCaptcha(page);
if (blocked) {
// CAPTCHA handled, reload the page
await page.reload({ waitUntil: 'networkidle2' });
}
// Normal scraping logic
const data = await page.evaluate(() => {
return {
title: document.querySelector('h1')?.textContent,
price: document.querySelector('.price')?.textContent,
};
});
await saveData(data);
await sleep(randomInt(1000, 3000)); // Random delay to mimic human behavior
}
}
Of course, the caveat is: this approach requires that your use case doesn't need to be fully unattended. For most small-to-medium scraping tasks, the occasional need for human intervention is perfectly acceptable.
Honestly, I don't really worry about CAPTCHAs anymore.
My approach is simple: stealth plugin as the first line of defense to reduce trigger frequency, and when a CAPTCHA does appear, the script pauses and sends me a notification. I wake up, tap through it, and the scraper continues.
It's cheaper than paying for CAPTCHA solving services, less hassle than maintaining proxy pools, and way more reliable than trying to outsmart the CAPTCHA.
CAPTCHAs exist for a reason. They protect websites from being overwhelmed by malicious bots. That's a valid design goal. Instead of trying to "beat" them, make your automation smart enough to know when to ask for help.
After all, the best code isn't code that can solve every problem — it's code that knows when to get a human involved.
If you want to try this "human-in-the-loop" approach, check out xbrowser — it has built-in CAPTCHA detection and live preview. When a CAPTCHA is detected, it automatically pauses and generates a preview link so you can take over with one click.