GStars
    D4Vinci

    D4Vinci/Scrapling

    ๐Ÿ•ท๏ธ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!

    ai
    automation
    testing
    ai-scraping
    crawler
    crawling
    crawling-python
    data
    data-extraction
    mcp
    mcp-server
    playwright
    python
    scraping
    selectors
    stealth
    web-scraper
    web-scraping
    web-scraping-python
    webscraping
    xpath
    Python
    BSD-3-Clause
    17.1K stars
    1.1K forks
    17.1K watching
    Updated 2/27/2026
    View on GitHub
    Backblaze Advertisement

    Loading star history...

    Health Score

    5.6

    Weekly Growth

    +0

    +0.0% this week

    Contributors

    1

    Total contributors

    Open Issues

    6

    Generated Insights

    About Scrapling

    Scrapling Poster
    Effortless Web Scraping for the Modern Web

    D4Vinci%2FScrapling | Trendshift
    ุงู„ุนุฑุจูŠู‡ | Espaรฑol | Deutsch | ็ฎ€ไฝ“ไธญๆ–‡ | ๆ—ฅๆœฌ่ชž | ะ ัƒััะบะธะน
    Tests PyPI version PyPI Downloads
    Discord X (formerly Twitter) Follow
    Supported Python versions

    Selection methods ยท Fetchers ยท Spiders ยท Proxy Rotation ยท CLI ยท MCP

    Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl.

    Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation โ€” all in a few lines of Python. One library, zero compromises.

    Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.

    from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
    StealthyFetcher.adaptive = True
    p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)  # Fetch website under the radar!
    products = p.css('.product', auto_save=True)                                        # Scrape data that survives website design changes!
    products = p.css('.product', adaptive=True)                                         # Later, if the website structure changes, pass `adaptive=True` to find them!
    

    Or scale up to full crawls

    from scrapling.spiders import Spider, Response
    
    class MySpider(Spider):
      name = "demo"
      start_urls = ["https://example.com/"]
    
      async def parse(self, response: Response):
          for item in response.css('.product'):
              yield {"title": item.css('h2::text').get()}
    
    MySpider().start()
    

    Platinum Sponsors

    Sponsors

    Do you want to show your ad here? Click here and choose the tier that suites you!


    Key Features

    Spiders โ€” A Full Crawling Framework

    • ๐Ÿ•ท๏ธ Scrapy-like Spider API: Define spiders with start_urls, async parse callbacks, and Request/Response objects.
    • โšก Concurrent Crawling: Configurable concurrency limits, per-domain throttling, and download delays.
    • ๐Ÿ”„ Multi-Session Support: Unified interface for HTTP requests, and stealthy headless browsers in a single spider โ€” route requests to different sessions by ID.
    • ๐Ÿ’พ Pause & Resume: Checkpoint-based crawl persistence. Press Ctrl+C for a graceful shutdown; restart to resume from where you left off.
    • ๐Ÿ“ก Streaming Mode: Stream scraped items as they arrive via async for item in spider.stream() with real-time stats โ€” ideal for UI, pipelines, and long-running crawls.
    • ๐Ÿ›ก๏ธ Blocked Request Detection: Automatic detection and retry of blocked requests with customizable logic.
    • ๐Ÿ“ฆ Built-in Export: Export results through hooks and your own pipeline or the built-in JSON/JSONL with result.items.to_json() / result.items.to_jsonl() respectively.

    Advanced Websites Fetching with Session Support

    • HTTP Requests: Fast and stealthy HTTP requests with the Fetcher class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP/3.
    • Dynamic Loading: Fetch dynamic websites with full browser automation through the DynamicFetcher class supporting Playwright's Chromium and Google's Chrome.
    • Anti-bot Bypass: Advanced stealth capabilities with StealthyFetcher and fingerprint spoofing. Can easily bypass all types of Cloudflare's Turnstile/Interstitial with automation.
    • Session Management: Persistent session support with FetcherSession, StealthySession, and DynamicSession classes for cookie and state management across requests.
    • Proxy Rotation: Built-in ProxyRotator with cyclic or custom rotation strategies across all session types, plus per-request proxy overrides.
    • Domain Blocking: Block requests to specific domains (and their subdomains) in browser-based fetchers.
    • Async Support: Complete async support across all fetchers and dedicated async session classes.

    Adaptive Scraping & AI Integration

    • ๐Ÿ”„ Smart Element Tracking: Relocate elements after website changes using intelligent similarity algorithms.
    • ๐ŸŽฏ Smart Flexible Selection: CSS selectors, XPath selectors, filter-based search, text search, regex search, and more.
    • ๐Ÿ” Find Similar Elements: Automatically locate elements similar to found elements.
    • ๐Ÿค– MCP Server to be used with AI: Built-in MCP server for AI-assisted Web Scraping and data extraction. The MCP server features powerful, custom capabilities that leverage Scrapling to extract targeted content before passing it to the AI (Claude/Cursor/etc), thereby speeding up operations and reducing costs by minimizing token usage. (demo video)

    High-Performance & battle-tested Architecture

    • ๐Ÿš€ Lightning Fast: Optimized performance outperforming most Python scraping libraries.
    • ๐Ÿ”‹ Memory Efficient: Optimized data structures and lazy loading for a minimal memory footprint.
    • โšก Fast JSON Serialization: 10x faster than the standard library.
    • ๐Ÿ—๏ธ Battle tested: Not only does Scrapling have 92% test coverage and full type hints coverage, but it has been used daily by hundreds of Web Scrapers over the past year.

    Developer/Web Scraper Friendly Experience

    • ๐ŸŽฏ Interactive Web Scraping Shell: Optional built-in IPython shell with Scrapling integration, shortcuts, and new tools to speed up Web Scraping scripts development, like converting curl requests to Scrapling requests and viewing requests results in your browser.
    • ๐Ÿš€ Use it directly from the Terminal: Optionally, you can use Scrapling to scrape a URL without writing a single line of code!
    • ๐Ÿ› ๏ธ Rich Navigation API: Advanced DOM traversal with parent, sibling, and child navigation methods.
    • ๐Ÿงฌ Enhanced Text Processing: Built-in regex, cleaning methods, and optimized string operations.
    • ๐Ÿ“ Auto Selector Generation: Generate robust CSS/XPath selectors for any element.
    • ๐Ÿ”Œ Familiar API: Similar to Scrapy/BeautifulSoup with the same pseudo-elements used in Scrapy/Parsel.
    • ๐Ÿ“˜ Complete Type Coverage: Full type hints for excellent IDE support and code completion. The entire codebase is automatically scanned with PyRight and MyPy with each change.
    • ๐Ÿ”‹ Ready Docker image: With each release, a Docker image containing all browsers is automatically built and pushed.

    Getting Started

    Let's give you a quick glimpse of what Scrapling can do without deep diving.

    Basic Usage

    HTTP requests with session support

    from scrapling.fetchers import Fetcher, FetcherSession
    
    with FetcherSession(impersonate='chrome') as session:  # Use latest version of Chrome's TLS fingerprint
        page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
        quotes = page.css('.quote .text::text').getall()
    
    # Or use one-off requests
    page = Fetcher.get('https://quotes.toscrape.com/')
    quotes = page.css('.quote .text::text').getall()
    

    Advanced stealth mode

    from scrapling.fetchers import StealthyFetcher, StealthySession
    
    with StealthySession(headless=True, solve_cloudflare=True) as session:  # Keep the browser open until you finish
        page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
        data = page.css('#padded_content a').getall()
    
    # Or use one-off request style, it opens the browser for this request, then closes it after finishing
    page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
    data = page.css('#padded_content a').getall()
    

    Full browser automation

    from scrapling.fetchers import DynamicFetcher, DynamicSession
    
    with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:  # Keep the browser open until you finish
        page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
        data = page.xpath('//span[@class="text"]/text()').getall()  # XPath selector if you prefer it
    
    # Or use one-off request style, it opens the browser for this request, then closes it after finishing
    page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
    data = page.css('.quote .text::text').getall()
    

    Spiders

    Build full crawlers with concurrent requests, multiple session types, and pause/resume:

    from scrapling.spiders import Spider, Request, Response
    
    class QuotesSpider(Spider):
        name = "quotes"
        start_urls = ["https://quotes.toscrape.com/"]
        concurrent_requests = 10
        
        async def parse(self, response: Response):
            for quote in response.css('.quote'):
                yield {
                    "text": quote.css('.text::text').get(),
                    "author": quote.css('.author::text').get(),
                }
                
            next_page = response.css('.next a')
            if next_page:
                yield response.follow(next_page[0].attrib['href'])
    
    result = QuotesSpider().start()
    print(f"Scraped {len(result.items)} quotes")
    result.items.to_json("quotes.json")
    

    Use multiple session types in a single spider:

    from scrapling.spiders import Spider, Request, Response
    from scrapling.fetchers import FetcherSession, AsyncStealthySession
    
    class MultiSessionSpider(Spider):
        name = "multi"
        start_urls = ["https://example.com/"]
        
        def configure_sessions(self, manager):
            manager.add("fast", FetcherSession(impersonate="chrome"))
            manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
        
        async def parse(self, response: Response):
            for link in response.css('a::attr(href)').getall():
                # Route protected pages through the stealth session
                if "protected" in link:
                    yield Request(link, sid="stealth")
                else:
                    yield Request(link, sid="fast", callback=self.parse)  # explicit callback
    

    Pause and resume long crawls with checkpoints by running the spider like this:

    QuotesSpider(crawldir="./crawl_data").start()
    

    Press Ctrl+C to pause gracefully โ€” progress is saved automatically. Later, when you start the spider again, pass the same crawldir, and it will resume from where it stopped.

    Advanced Parsing & Navigation

    from scrapling.fetchers import Fetcher
    
    # Rich element selection and navigation
    page = Fetcher.get('https://quotes.toscrape.com/')
    
    # Get quotes with multiple selection methods
    quotes = page.css('.quote')  # CSS selector
    quotes = page.xpath('//div[@class="quote"]')  # XPath
    quotes = page.find_all('div', {'class': 'quote'})  # BeautifulSoup-style
    # Same as
    quotes = page.find_all('div', class_='quote')
    quotes = page.find_all(['div'], class_='quote')
    quotes = page.find_all(class_='quote')  # and so on...
    # Find element by text content
    quotes = page.find_by_text('quote', tag='div')
    
    # Advanced navigation
    quote_text = page.css('.quote')[0].css('.text::text').get()
    quote_text = page.css('.quote').css('.text::text').getall()  # Chained selectors
    first_quote = page.css('.quote')[0]
    author = first_quote.next_sibling.css('.author::text')
    parent_container = first_quote.parent
    
    # Element relationships and similarity
    similar_elements = first_quote.find_similar()
    below_elements = first_quote.below_elements()
    

    You can use the parser right away if you don't want to fetch websites like below:

    from scrapling.parser import Selector
    
    page = Selector("<html>...</html>")
    

    And it works precisely the same way!

    Async Session Management Examples

    import asyncio
    from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession
    
    async with FetcherSession(http3=True) as session:  # `FetcherSession` is context-aware and can work in both sync/async patterns
        page1 = session.get('https://quotes.toscrape.com/')
        page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')
    
    # Async session usage
    async with AsyncStealthySession(max_pages=2) as session:
        tasks = []
        urls = ['https://example.com/page1', 'https://example.com/page2']
        
        for url in urls:
            task = session.fetch(url)
            tasks.append(task)
        
        print(session.get_pool_stats())  # Optional - The status of the browser tabs pool (busy/free/error)
        results = await asyncio.gather(*tasks)
        print(session.get_pool_stats())
    

    CLI & Interactive Shell

    Scrapling includes a powerful command-line interface:

    asciicast

    Launch the interactive Web Scraping shell

    scrapling shell
    

    Extract pages to a file directly without programming (Extracts the content inside the body tag by default). If the output file ends with .txt, then the text content of the target will be extracted. If it ends in .md, it will be a Markdown representation of the HTML content; if it ends in .html, it will be the HTML content itself.

    scrapling extract get 'https://example.com' content.md
    scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome'  # All elements matching the CSS selector '#fromSkipToProducts'
    scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
    scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare
    

    [!NOTE] There are many additional features, but we want to keep this page concise, including the MCP server and the interactive Web Scraping Shell. Check out the full documentation here

    Performance Benchmarks

    Scrapling isn't just powerfulโ€”it's also blazing fast. The following benchmarks compare Scrapling's parser with the latest versions of other popular libraries.

    Text Extraction Speed Test (5000 nested elements)

    #LibraryTime (ms)vs Scrapling
    1Scrapling2.021.0x
    2Parsel/Scrapy2.041.01
    3Raw Lxml2.541.257
    4PyQuery24.17~12x
    5Selectolax82.63~41x
    6MechanicalSoup1549.71~767.1x
    7BS4 with Lxml1584.31~784.3x
    8BS4 with html5lib3391.91~1679.1x

    Element Similarity & Text Search Performance

    Scrapling's adaptive element finding capabilities significantly outperform alternatives:

    LibraryTime (ms)vs Scrapling
    Scrapling2.391.0x
    AutoScraper12.455.209x

    All benchmarks represent averages of 100+ runs. See benchmarks.py for methodology.

    Installation

    Scrapling requires Python 3.10 or higher:

    pip install scrapling
    

    This installation only includes the parser engine and its dependencies, without any fetchers or commandline dependencies.

    Optional Dependencies

    1. If you are going to use any of the extra features below, the fetchers, or their classes, you will need to install fetchers' dependencies and their browser dependencies as follows:

      pip install "scrapling[fetchers]"
      
      scrapling install
      

      This downloads all browsers, along with their system dependencies and fingerprint manipulation dependencies.

    2. Extra features:

      • Install the MCP server feature:
        pip install "scrapling[ai]"
        
      • Install shell features (Web Scraping shell and the extract command):
        pip install "scrapling[shell]"
        
      • Install everything:
        pip install "scrapling[all]"
        

      Remember that you need to install the browser dependencies with scrapling install after any of these extras (if you didn't already)

    Docker

    You can also install a Docker image with all extras and browsers with the following command from DockerHub:

    docker pull pyd4vinci/scrapling
    

    Or download it from the GitHub registry:

    docker pull ghcr.io/d4vinci/scrapling:latest
    

    This image is automatically built and pushed using GitHub Actions and the repository's main branch.

    Contributing

    We welcome contributions! Please read our contributing guidelines before getting started.

    Disclaimer

    [!CAUTION] This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect the terms of service of websites and robots.txt files.

    License

    This work is licensed under the BSD-3-Clause License.

    Acknowledgments

    This project includes code adapted from:

    • Parsel (BSD License)โ€”Used for translator submodule

    Designed & crafted with โค๏ธ by Karim Shoair.

    Discover Repositories

    Search across tracked repositories by name or description