Scraper Capability Report

What the zig-browser anti-bot stack can and can't scrape

zig-browserJune 12, 2026
Real-world reachability
94%
308/325 domains (spider-rs dataset, 18 categories)
Bot defenses we beat
4
none · Cloudflare (score) · AWS-WAF (solved) · DataDome (lenient)
Defenses that block us
4
Akamai · PerimeterX · Kasada · DataDome (aggressive)
Native JS challenge solver
AWS-WAF
QuickJS proof-of-work, no browser

Bot-protection capability matrix

How the stack fares against each major anti-bot vendor, and the mechanism.

ProtectionStatusHow / whyExamples
No protection / basic WAFFullCoherent Chrome fingerprint (TLS JA3/JA4 + headers + h2 + navigator all agree) over a clean residential IP.Wikipedia, gov sites, most Tech/News
Cloudflare (score-based)FullA clean residential IP + coherent fingerprint scores below CF's challenge threshold; rotate-on-block clears borderline exits. The passive jsd beacon on real pages is correctly ignored.economist, dictionary, cloudflare.com, vercel
AWS-WAF (token challenge)Solved nativelyWe run the challenge's SHA-256 proof-of-work in QuickJS, mint the aws-waf-token, and retry — no browser. The marquee capability.imdb, amazon
DataDome — low-friction tenantPassReputation + coherent fingerprint; the tenant serves real content without escalating.reuters, wsj, tripadvisor
DataDome — aggressive tenantBlockedWe run the plv3 VM in QuickJS and mint a cookie, but the device-class check scores us view:captcha — it fingerprints the whole browser engine + graphics stack, which a non-browser engine can't fully match.yelp, etsy, viator, monster
Cloudflare — hard managed challengePartial"Just a moment" interstitial: rotate-on-block finds a passing exit on most score-based cases; the full Turnstile/BotGuard-grade solve is not done.(rare on the dataset)
Akamai Bot ManagerBlockedStatic "Access Denied" — the server already decided; needs sensor-data payload generation (real browser).apartments, telegraph, hm, tesla
PerimeterX / HUMANBlockedPress-and-hold human-interaction challenge.wayfair, zillow
KasadaBlockedKPSDK obfuscated VM + 429; needs the Kasada VM solver.realtor, chewy
reCAPTCHA / hCaptcha wallsBlockedInteractive CAPTCHA — needs a solving service.(when a site forces it)

Reachability by website category

325 real-world domains

Run live through the full stack (clean residential proxy + rotating --gen-profile + AWS-WAF solver + rotate-on-block). "Reached" = real <title> + content, not a challenge page.

CategoryReachedRate
Finance · Education · Reference · Social · Entertainment · Health · Food · Automotive · Sports · Streaming100%
Technology77 / 7997%
News38 / 4095%
Government18 / 1994%
E-Commerce31 / 3686%
Travel10 / 1283%
Jobs3 / 475%
Real Estate5 / 862%
Classifieds0 / 10%
The pattern: content/info categories are ~100% (news, finance, education, reference, health, etc.). The losses concentrate in commerce, real-estate, travel, jobs — exactly the verticals that deploy the hard interactive vendors (Akamai/PerimeterX/Kasada/aggressive DataDome) to stop price/inventory scraping.

What the stack actually does

Identity & transport

  • Coherent fingerprint--gen-profile (browserforge) makes UA, headers, navigator, JA3/JA4, and h2 SETTINGS all agree, so there's no split-identity tell.
  • Own HTTP/1.1 + HTTP/2 + TLS stack — Chrome/Firefox-shaped ClientHello, byte-exact cipher fingerprint.
  • Residential proxy pool — ip-api-filtered clean exits (proxy=false, hosting=false), geo-targeted, with mobile-exit support and per-exit health scoring.
  • Rotate-on-block + .gov-direct routing, the highest-leverage knob for score-based vendors.

Challenge solving (QuickJS, no browser)

  • AWS-WAF — runs the real challenge.js SHA-256 proof-of-work, mints aws-waf-token. ✅ Reliable.
  • DataDome interstitial — runs the plv3 VM, mints a cookie, with a real canvas-2D rasterizer for the Picasso check. ⚠️ Clears lenient tenants; aggressive ones still score captcha.
  • Cloudflare — detects IUAM/Turnstile; clears score-based via rotation. Full managed-challenge solve not implemented.
  • Fingerprint fidelity — real canvas pixels, WebGPU/keyboard/screen props, [native code] toString masking.

The capability boundary — and why

There's a clean line between what we beat and what we don't, and it maps to how each vendor verifies a client:

✅ We win when the check is deterministic or reputation-based

AWS-WAF is a math puzzle (SHA-256 PoW) — a verifiable answer we compute. Cloudflare score-based and low-friction DataDome judge IP + fingerprint coherence — which our clean residential IP + --gen-profile satisfy. These have a "right answer" we can produce natively.

❌ We lose when the check fingerprints the real browser engine

Aggressive DataDome (device-class / Picasso), PerimeterX, Kasada, and Akamai score the exact behavior of a real Chrome's engine + graphics stack (canvas/audio/WebGL pixel output, timing, sensor data). There's no "right answer" to compute — you have to be a real browser. A pure QuickJS engine can run their VMs and mint cookies, but can't present a real-Chrome device class, so it scores as a bot.

To extend coverage to the hard vendors (the ~6% remaining: Akamai/PerimeterX/Kasada/aggressive-DataDome verticals) the realistic options are: a real/stealth browser (Camoufox, nodriver) for just the cookie-mint step, a commercial solver API, or premium mobile IPs for the score-based subset. All cross the "pure-native" line by design.
Notes & caveats