Web Scraping

Advanced Web Scraping Techniques

15 min read January 2025

Web scraping at scale is both an art and a science. Over the years, I've built scraping systems that collect millions of data points daily while respecting website policies and avoiding blocks. Here are the advanced techniques that make it possible.

Important Disclaimer

Always respect websites' Terms of Service and robots.txt files. Use scraping responsibly and ethically. The techniques described here are for legitimate data collection purposes only.

The Tools of the Trade

Python Scrapy Playwright BeautifulSoup Redis PostgreSQL Docker

1. Intelligent Proxy Rotation

Using the same IP address for thousands of requests is a quick way to get blocked. A robust proxy rotation system is essential for large-scale scraping.

import random
from typing import List
import httpx

class ProxyRotator:
    def __init__(self, proxies: List[str]):
        self.proxies = proxies
        self.failed_proxies = set()

    def get_proxy(self) -> str:
        available = [p for p in self.proxies
                     if p not in self.failed_proxies]
        if not available:
            self.failed_proxies.clear()
            available = self.proxies
        return random.choice(available)

    def mark_failed(self, proxy: str):
        self.failed_proxies.add(proxy)

    async def fetch(self, url: str) -> str:
        proxy = self.get_proxy()
        try:
            async with httpx.AsyncClient(proxy=proxy) as client:
                response = await client.get(url, timeout=30)
                return response.text
        except Exception:
            self.mark_failed(proxy)
            raise

Proxy Types to Consider

Datacenter proxies: Fast and cheap, but easier to detect
Residential proxies: Real IPs from ISPs, harder to block
Mobile proxies: IPs from mobile carriers, highest trust level
Rotating proxies: Automatic IP rotation per request

2. Browser Fingerprint Management

Modern websites don't just check your IP - they analyze your browser fingerprint. This includes screen resolution, installed fonts, WebGL renderer, and dozens of other signals.

from playwright.async_api import async_playwright

async def create_stealth_browser():
    playwright = await async_playwright().start()

    browser = await playwright.chromium.launch(
        headless=True,
        args=[
            '--disable-blink-features=AutomationControlled',
            '--disable-dev-shm-usage',
            '--no-sandbox'
        ]
    )

    context = await browser.new_context(
        viewport={'width': 1920, 'height': 1080},
        user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                   'AppleWebKit/537.36 (KHTML, like Gecko) '
                   'Chrome/120.0.0.0 Safari/537.36',
        locale='en-US',
        timezone_id='America/New_York'
    )

    # Override navigator properties
    await context.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined
        });
    """)

    return browser, context

3. Rate Limiting and Delays

Hitting a server with 1000 requests per second is a guaranteed way to get blocked (and potentially cause harm). Implement intelligent rate limiting that mimics human behavior.

import asyncio
import random

class AdaptiveRateLimiter:
    def __init__(self, base_delay: float = 1.0):
        self.base_delay = base_delay
        self.current_delay = base_delay
        self.consecutive_errors = 0

    async def wait(self):
        # Add randomness to appear more human
        jitter = random.uniform(0.5, 1.5)
        await asyncio.sleep(self.current_delay * jitter)

    def on_success(self):
        self.consecutive_errors = 0
        # Gradually speed up on success
        self.current_delay = max(
            self.base_delay,
            self.current_delay * 0.9
        )

    def on_error(self):
        self.consecutive_errors += 1
        # Exponential backoff on errors
        self.current_delay = min(
            60,  # Max 60 seconds
            self.current_delay * (2 ** self.consecutive_errors)
        )

Pro Tip

Analyze the website's own traffic patterns. If real users typically make 1-2 requests per minute, your scraper should do the same.

4. Session and Cookie Management

Many websites track sessions to detect bots. Properly managing cookies and maintaining session state can make your scraper appear more legitimate.

import httpx
from http.cookiejar import CookieJar

class SessionManager:
    def __init__(self):
        self.sessions = {}

    def get_session(self, domain: str) -> httpx.AsyncClient:
        if domain not in self.sessions:
            self.sessions[domain] = httpx.AsyncClient(
                cookies=CookieJar(),
                follow_redirects=True,
                timeout=30
            )
        return self.sessions[domain]

    async def close_all(self):
        for session in self.sessions.values():
            await session.aclose()

5. Handling JavaScript-Heavy Sites

Many modern websites render content dynamically with JavaScript. For these, you need a headless browser that can execute JavaScript and wait for content to load.

async def scrape_dynamic_page(url: str) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        await page.goto(url)

        # Wait for specific content to load
        await page.wait_for_selector('.product-list', timeout=10000)

        # Or wait for network to be idle
        await page.wait_for_load_state('networkidle')

        content = await page.content()
        await browser.close()

        return content

6. Data Pipeline Architecture

For large-scale scraping, you need a robust pipeline that separates concerns:

URL Queue: Redis-based queue for URLs to scrape
Fetcher Workers: Distributed workers that fetch pages
Parser Workers: Extract structured data from HTML
Data Store: PostgreSQL for structured storage
Monitoring: Track success rates, latency, and errors

# Using Redis as a URL queue
import redis.asyncio as redis

class URLQueue:
    def __init__(self, redis_url: str):
        self.redis = redis.from_url(redis_url)
        self.queue_key = "scraping:urls"
        self.seen_key = "scraping:seen"

    async def add_url(self, url: str):
        # Only add if not already seen
        if not await self.redis.sismember(self.seen_key, url):
            await self.redis.sadd(self.seen_key, url)
            await self.redis.rpush(self.queue_key, url)

    async def get_url(self) -> str | None:
        result = await self.redis.lpop(self.queue_key)
        return result.decode() if result else None

    async def queue_size(self) -> int:
        return await self.redis.llen(self.queue_key)

7. Error Handling and Recovery

Things will go wrong. Networks fail, websites change, and CAPTCHAs appear. Build robust error handling from the start.

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=60)
)
async def fetch_with_retry(url: str, session: httpx.AsyncClient) -> str:
    response = await session.get(url)

    if response.status_code == 429:  # Too Many Requests
        raise RateLimitError("Rate limited, backing off")

    if response.status_code == 403:  # Forbidden
        raise BlockedError("IP might be blocked")

    response.raise_for_status()
    return response.text

8. CAPTCHA Handling

When you encounter CAPTCHAs, you have several options:

Slow down and reduce your request rate
Rotate to a different IP/proxy
Use CAPTCHA solving services (for legitimate use cases)
Implement browser automation with human-like behavior

"The best way to handle CAPTCHAs is to avoid triggering them in the first place. Slow down, vary your patterns, and respect the website."

Best Practices Summary

Respect robots.txt: It's both ethical and legal protection
Identify yourself: Use a descriptive User-Agent with contact info
Rate limit aggressively: Slow scraping is reliable scraping
Handle errors gracefully: Retry with backoff, don't hammer
Store raw data: Parse later, you can't re-scrape deleted pages
Monitor continuously: Detect blocks and changes quickly
Stay updated: Anti-bot technology evolves constantly

Conclusion

Web scraping at scale requires careful engineering and ethical consideration. The techniques described here have helped me build systems that reliably collect data while maintaining good relationships with the websites I scrape.

Remember: the goal is to collect data, not to play cat and mouse with website operators. When in doubt, reach out to the website owner - many will provide API access or data exports if you explain your legitimate use case.