Scraper Capability Report
What the zig-browser anti-bot stack can and can't scrape
Bot-protection capability matrix
How the stack fares against each major anti-bot vendor, and the mechanism.
| Protection | Status | How / why | Examples |
|---|---|---|---|
| No protection / basic WAF | Full | Coherent Chrome fingerprint (TLS JA3/JA4 + headers + h2 + navigator all agree) over a clean residential IP. | Wikipedia, gov sites, most Tech/News |
| Cloudflare (score-based) | Full | A clean residential IP + coherent fingerprint scores below CF's challenge threshold; rotate-on-block clears borderline exits. The passive jsd beacon on real pages is correctly ignored. | economist, dictionary, cloudflare.com, vercel |
| AWS-WAF (token challenge) | Solved natively | We run the challenge's SHA-256 proof-of-work in QuickJS, mint the aws-waf-token, and retry — no browser. The marquee capability. | imdb, amazon |
| DataDome — low-friction tenant | Pass | Reputation + coherent fingerprint; the tenant serves real content without escalating. | reuters, wsj, tripadvisor |
| DataDome — aggressive tenant | Blocked | We run the plv3 VM in QuickJS and mint a cookie, but the device-class check scores us view:captcha — it fingerprints the whole browser engine + graphics stack, which a non-browser engine can't fully match. | yelp, etsy, viator, monster |
| Cloudflare — hard managed challenge | Partial | "Just a moment" interstitial: rotate-on-block finds a passing exit on most score-based cases; the full Turnstile/BotGuard-grade solve is not done. | (rare on the dataset) |
| Akamai Bot Manager | Blocked | Static "Access Denied" — the server already decided; needs sensor-data payload generation (real browser). | apartments, telegraph, hm, tesla |
| PerimeterX / HUMAN | Blocked | Press-and-hold human-interaction challenge. | wayfair, zillow |
| Kasada | Blocked | KPSDK obfuscated VM + 429; needs the Kasada VM solver. | realtor, chewy |
| reCAPTCHA / hCaptcha walls | Blocked | Interactive CAPTCHA — needs a solving service. | (when a site forces it) |
Reachability by website category
325 real-world domainsRun live through the full stack (clean residential proxy + rotating --gen-profile + AWS-WAF solver + rotate-on-block). "Reached" = real <title> + content, not a challenge page.
| Category | Reached | Rate | |
|---|---|---|---|
| Finance · Education · Reference · Social · Entertainment · Health · Food · Automotive · Sports · Streaming | — | 100% | |
| Technology | 77 / 79 | 97% | |
| News | 38 / 40 | 95% | |
| Government | 18 / 19 | 94% | |
| E-Commerce | 31 / 36 | 86% | |
| Travel | 10 / 12 | 83% | |
| Jobs | 3 / 4 | 75% | |
| Real Estate | 5 / 8 | 62% | |
| Classifieds | 0 / 1 | 0% |
What the stack actually does
Identity & transport
- Coherent fingerprint —
--gen-profile(browserforge) makes UA, headers,navigator, JA3/JA4, and h2 SETTINGS all agree, so there's no split-identity tell. - Own HTTP/1.1 + HTTP/2 + TLS stack — Chrome/Firefox-shaped ClientHello, byte-exact cipher fingerprint.
- Residential proxy pool — ip-api-filtered clean exits (proxy=false, hosting=false), geo-targeted, with mobile-exit support and per-exit health scoring.
- Rotate-on-block +
.gov-direct routing, the highest-leverage knob for score-based vendors.
Challenge solving (QuickJS, no browser)
- AWS-WAF — runs the real
challenge.jsSHA-256 proof-of-work, mintsaws-waf-token. ✅ Reliable. - DataDome interstitial — runs the plv3 VM, mints a cookie, with a real canvas-2D rasterizer for the Picasso check. ⚠️ Clears lenient tenants; aggressive ones still score
captcha. - Cloudflare — detects IUAM/Turnstile; clears score-based via rotation. Full managed-challenge solve not implemented.
- Fingerprint fidelity — real canvas pixels, WebGPU/keyboard/screen props,
[native code]toString masking.
The capability boundary — and why
There's a clean line between what we beat and what we don't, and it maps to how each vendor verifies a client:
✅ We win when the check is deterministic or reputation-based
AWS-WAF is a math puzzle (SHA-256 PoW) — a verifiable answer we compute. Cloudflare score-based and low-friction DataDome judge IP + fingerprint coherence — which our clean residential IP + --gen-profile satisfy. These have a "right answer" we can produce natively.
❌ We lose when the check fingerprints the real browser engine
Aggressive DataDome (device-class / Picasso), PerimeterX, Kasada, and Akamai score the exact behavior of a real Chrome's engine + graphics stack (canvas/audio/WebGL pixel output, timing, sensor data). There's no "right answer" to compute — you have to be a real browser. A pure QuickJS engine can run their VMs and mint cookies, but can't present a real-Chrome device class, so it scores as a bot.
Notes & caveats
- Numbers from a single point-in-time run of the spider-rs/spider-browser-dataset (325 domains). Score-based vendors (Cloudflare, DataDome) move ±a few points shot-to-shot with IP quality — treat category rates as indicative, not exact.
- A few "misses" are transport (dead proxy timeouts) or JS-only SPAs (e.g. cve.org) that need a render pass, not anti-bot — those are recoverable, separate from the hard-vendor blocks.
- Build/test health:
zig build -Dtls -Dquickjsgreen; 482/488 quickjs tests, 455/471 default. AWS-WAF (imdb, amazon) and the proxy/rotation layer are production-stable.