Files
claude-skills-reference/engineering/browser-automation/references/anti_detection_patterns.md
Reza Rezvani 97952ccbee feat(engineering): add browser-automation and spec-driven-workflow skills
browser-automation (564-line SKILL.md, 3 scripts, 3 references):
- Web scraping, form filling, screenshot capture, data extraction
- Anti-detection patterns, cookie/session management, dynamic content
- scraping_toolkit.py, form_automation_builder.py, anti_detection_checker.py
- NOT testing (that's playwright-pro) — this is automation & scraping

spec-driven-workflow (586-line SKILL.md, 3 scripts, 3 references):
- Spec-first development: write spec BEFORE code
- Bounded autonomy rules, 6-phase workflow, self-review checklist
- spec_generator.py, spec_validator.py, test_extractor.py
- Pairs with tdd-guide for red-green-refactor after spec

Updated engineering plugin.json (31 → 33 skills).
Added both to mkdocs.yml nav and generated docs pages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 12:57:18 +01:00

454 lines
14 KiB
Markdown

# Anti-Detection Patterns for Browser Automation
This reference covers techniques to make Playwright automation less detectable by anti-bot services. These are defense-in-depth measures — no single technique is sufficient, but combining them significantly reduces detection risk.
## Detection Vectors
Anti-bot systems detect automation through multiple signals. Understanding what they check helps you counter effectively.
### Tier 1: Trivial Detection (Every Site Checks These)
1. **navigator.webdriver** — Set to `true` by all automation frameworks
2. **User-Agent string** — Default headless UA contains "HeadlessChrome"
3. **WebGL renderer** — Headless Chrome reports "SwiftShader" or "Google SwiftShader"
### Tier 2: Common Detection (Most Anti-Bot Services)
4. **Viewport/screen dimensions** — Unusual sizes flag automation
5. **Plugins array** — Empty in headless mode, populated in real browsers
6. **Languages** — Missing or mismatched locale
7. **Request timing** — Machine-speed interactions
8. **Mouse movement** — No mouse events between clicks
### Tier 3: Advanced Detection (Cloudflare, DataDome, PerimeterX)
9. **Canvas fingerprint** — Headless renders differently
10. **WebGL fingerprint** — GPU-specific rendering variations
11. **Audio fingerprint** — AudioContext processing differences
12. **Font enumeration** — Different available fonts in headless
13. **Behavioral analysis** — Scroll patterns, click patterns, reading time
## Stealth Techniques
### 1. WebDriver Flag Removal
The most critical fix. Every anti-bot check starts here.
```python
await page.add_init_script("""
// Remove webdriver flag
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined,
});
// Remove Playwright-specific properties
delete window.__playwright;
delete window.__pw_manual;
""")
```
### 2. User Agent Configuration
Match the user agent to the browser you are launching. A Chrome UA with Firefox-specific headers is a red flag.
```python
# Chrome 120 on Windows 10 (most common configuration globally)
CHROME_WIN = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
# Chrome 120 on macOS
CHROME_MAC = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
# Chrome 120 on Linux
CHROME_LINUX = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
# Firefox 121 on Windows
FIREFOX_WIN = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0"
```
**Rules:**
- Update UAs every 2-3 months as browser versions increment
- Match UA platform to `navigator.platform` override
- If using Chromium, use Chrome UAs. If Firefox, use Firefox UAs.
- Never use obviously fake or ancient UAs
### 3. Viewport and Screen Properties
Common real-world screen resolutions (from analytics data):
| Resolution | Market Share | Use For |
|-----------|-------------|---------|
| 1920x1080 | ~23% | Default choice |
| 1366x768 | ~14% | Laptop simulation |
| 1536x864 | ~9% | Scaled laptop |
| 1440x900 | ~7% | MacBook |
| 2560x1440 | ~5% | High-end desktop |
```python
import random
VIEWPORTS = [
{"width": 1920, "height": 1080},
{"width": 1366, "height": 768},
{"width": 1536, "height": 864},
{"width": 1440, "height": 900},
]
viewport = random.choice(VIEWPORTS)
context = await browser.new_context(
viewport=viewport,
screen=viewport, # screen should match viewport
)
```
### 4. Navigator Properties Hardening
```python
STEALTH_INIT = """
// Plugins (headless Chrome has 0 plugins, real Chrome has 3-5)
Object.defineProperty(navigator, 'plugins', {
get: () => {
const plugins = [
{ name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer' },
{ name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai' },
{ name: 'Native Client', filename: 'internal-nacl-plugin' },
];
plugins.length = 3;
return plugins;
},
});
// Languages
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en'],
});
// Platform (match to user agent)
Object.defineProperty(navigator, 'platform', {
get: () => 'Win32', // or 'MacIntel' for macOS UA
});
// Hardware concurrency (real browsers report CPU cores)
Object.defineProperty(navigator, 'hardwareConcurrency', {
get: () => 8,
});
// Device memory (Chrome-specific)
Object.defineProperty(navigator, 'deviceMemory', {
get: () => 8,
});
// Connection info
Object.defineProperty(navigator, 'connection', {
get: () => ({
effectiveType: '4g',
rtt: 50,
downlink: 10,
saveData: false,
}),
});
"""
await context.add_init_script(STEALTH_INIT)
```
### 5. WebGL Fingerprint Evasion
Headless Chrome uses SwiftShader for WebGL, which anti-bot services detect.
```python
# Option A: Launch with a real GPU (headed mode on a machine with GPU)
browser = await p.chromium.launch(headless=False)
# Option B: Override WebGL renderer info
await page.add_init_script("""
const getParameter = WebGLRenderingContext.prototype.getParameter;
WebGLRenderingContext.prototype.getParameter = function(parameter) {
if (parameter === 37445) {
return 'Intel Inc.'; // UNMASKED_VENDOR_WEBGL
}
if (parameter === 37446) {
return 'Intel(R) Iris(TM) Plus Graphics 640'; // UNMASKED_RENDERER_WEBGL
}
return getParameter.call(this, parameter);
};
""")
```
### 6. Canvas Fingerprint Noise
Anti-bot services render text/shapes to a canvas and hash the output. Headless Chrome produces a different hash.
```python
await page.add_init_script("""
const originalToDataURL = HTMLCanvasElement.prototype.toDataURL;
HTMLCanvasElement.prototype.toDataURL = function(type) {
if (type === 'image/png' || type === undefined) {
// Add minimal noise to the canvas to change fingerprint
const ctx = this.getContext('2d');
if (ctx) {
const imageData = ctx.getImageData(0, 0, this.width, this.height);
for (let i = 0; i < imageData.data.length; i += 4) {
// Shift one channel by +/- 1 (imperceptible)
imageData.data[i] = imageData.data[i] ^ 1;
}
ctx.putImageData(imageData, 0, 0);
}
}
return originalToDataURL.apply(this, arguments);
};
""")
```
## Request Throttling Patterns
### Human-Like Delays
Real users do not click at machine speed. Add realistic delays between actions.
```python
import random
import asyncio
async def human_delay(action_type="browse"):
"""Add realistic delay based on action type."""
delays = {
"browse": (1.0, 3.0), # Browsing between pages
"read": (2.0, 8.0), # Reading content
"fill": (0.3, 0.8), # Between form fields
"click": (0.1, 0.5), # Before clicking
"scroll": (0.5, 1.5), # Between scroll actions
}
min_s, max_s = delays.get(action_type, (0.5, 2.0))
await asyncio.sleep(random.uniform(min_s, max_s))
```
### Request Rate Limiting
```python
import time
class RateLimiter:
"""Enforce minimum delay between requests."""
def __init__(self, min_interval_seconds=1.0):
self.min_interval = min_interval_seconds
self.last_request_time = 0
async def wait(self):
elapsed = time.time() - self.last_request_time
if elapsed < self.min_interval:
await asyncio.sleep(self.min_interval - elapsed)
self.last_request_time = time.time()
# Usage
limiter = RateLimiter(min_interval_seconds=2.0)
for url in urls:
await limiter.wait()
await page.goto(url)
```
### Exponential Backoff on Errors
```python
async def with_backoff(coro_factory, max_retries=5, base_delay=1.0):
for attempt in range(max_retries):
try:
return await coro_factory()
except Exception as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay:.1f}s...")
await asyncio.sleep(delay)
```
## Proxy Rotation Strategies
### Single Proxy
```python
browser = await p.chromium.launch(
proxy={"server": "http://proxy.example.com:8080"}
)
```
### Authenticated Proxy
```python
context = await browser.new_context(
proxy={
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass",
}
)
```
### Rotating Proxy Pool
```python
PROXIES = [
"http://proxy1.example.com:8080",
"http://proxy2.example.com:8080",
"http://proxy3.example.com:8080",
]
async def create_context_with_proxy(browser):
proxy = random.choice(PROXIES)
return await browser.new_context(
proxy={"server": proxy}
)
```
### Per-Request Proxy (via Context Rotation)
Playwright does not support per-request proxy switching. Achieve it by creating a new context for each request or batch:
```python
async def scrape_url(browser, url, proxy):
context = await browser.new_context(proxy={"server": proxy})
page = await context.new_page()
try:
await page.goto(url)
data = await extract_data(page)
return data
finally:
await context.close()
```
### SOCKS5 Proxy
```python
browser = await p.chromium.launch(
proxy={"server": "socks5://proxy.example.com:1080"}
)
```
## Headless Detection Avoidance
### Running Chrome Channel Instead of Chromium
The bundled Chromium binary has different properties than a real Chrome install. Using the Chrome channel makes the browser indistinguishable from a normal install.
```python
# Use installed Chrome instead of bundled Chromium
browser = await p.chromium.launch(channel="chrome", headless=True)
```
**Requirements:** Chrome must be installed on the system.
### New Headless Mode (Chrome 112+)
Chrome's "new headless" mode is harder to detect than the old one:
```python
browser = await p.chromium.launch(
args=["--headless=new"],
)
```
### Avoiding Common Flags
Do NOT pass these flags — they are headless-detection signals:
- `--disable-gpu` (old headless workaround, not needed)
- `--no-sandbox` (security risk, detectable)
- `--disable-setuid-sandbox` (same as above)
## Behavioral Evasion
### Mouse Movement Simulation
Anti-bot services track mouse events. A click without preceding mouse movement is suspicious.
```python
async def human_click(page, selector):
"""Click with preceding mouse movement."""
element = await page.query_selector(selector)
box = await element.bounding_box()
if box:
# Move to element with slight offset
x = box["x"] + box["width"] / 2 + random.uniform(-5, 5)
y = box["y"] + box["height"] / 2 + random.uniform(-5, 5)
await page.mouse.move(x, y, steps=random.randint(5, 15))
await asyncio.sleep(random.uniform(0.05, 0.2))
await page.mouse.click(x, y)
```
### Typing Speed Variation
```python
async def human_type(page, selector, text):
"""Type with variable speed like a human."""
await page.click(selector)
for char in text:
await page.keyboard.type(char)
# Faster for common keys, slower for special characters
if char in "aeiou tnrs":
await asyncio.sleep(random.uniform(0.03, 0.08))
else:
await asyncio.sleep(random.uniform(0.08, 0.20))
```
### Scroll Behavior
Real users scroll gradually, not in instant jumps.
```python
async def human_scroll(page, distance=None):
"""Scroll down gradually like a human."""
if distance is None:
distance = random.randint(300, 800)
current = 0
while current < distance:
step = random.randint(50, 150)
await page.mouse.wheel(0, step)
current += step
await asyncio.sleep(random.uniform(0.05, 0.15))
```
## Detection Testing
### Self-Check Script
Navigate to these URLs to test your stealth configuration:
- `https://bot.sannysoft.com/` — Comprehensive bot detection test
- `https://abrahamjuliot.github.io/creepjs/` — Advanced fingerprint analysis
- `https://browserleaks.com/webgl` — WebGL fingerprint details
- `https://browserleaks.com/canvas` — Canvas fingerprint details
### Quick Test Pattern
```python
async def test_stealth(page):
"""Navigate to detection test page and report results."""
await page.goto("https://bot.sannysoft.com/")
await page.wait_for_timeout(3000)
# Check for failed tests
failed = await page.eval_on_selector_all(
"td.failed",
"els => els.map(e => e.parentElement.querySelector('td').textContent)"
)
if failed:
print(f"FAILED checks: {failed}")
else:
print("All checks passed.")
await page.screenshot(path="stealth_test.png", full_page=True)
```
## Recommended Stealth Stack
For most automation tasks, apply these in order of priority:
1. **WebDriver flag removal** — Critical, takes 2 lines
2. **Custom user agent** — Critical, takes 1 line
3. **Viewport configuration** — High priority, takes 1 line
4. **Request delays** — High priority, add random.uniform() calls
5. **Navigator properties** — Medium priority, init script block
6. **Chrome channel** — Medium priority, one launch option
7. **WebGL override** — Low priority unless hitting advanced anti-bot
8. **Canvas noise** — Low priority unless hitting advanced anti-bot
9. **Proxy rotation** — Only for high-volume or repeated scraping
10. **Behavioral simulation** — Only for sites with behavioral analysis