DataHen Till is a companion tool to your existing web scraper that instantly makes it scalable, maintainable, and more unblockable, with minimal code changes on your scraper. Integrates with any scraper in 5 minutes. Web scraping is usually easy to get started, especially on a small scale. However, as you try to scale it up, it gets exponentially difficult. Scraping 10,000 records can easily be done with simple web scraper scripts in any programming language, but as you try to scrape millions of pages, you would need to architect and build features on your web scraping script that allows you to scale, maintain and unblock your scrapers. Scraping to millions or even billions of records requires much more pre-planning. It's not simply running your existing web scraper script in a bigger CPU/Ram machine. More thoughts are needed.
Features
- Till provides a plug-and-play method of making your web scrapers scalable
- As you try to scale up the number of requests, quite often, the target websites will detect your scraper and try to block your requests using Captcha
- Till helps you circumvent detected as a web scraper by identifying your scraper as a real web browser
- Maintaining high-scale scrapers is challenging due to the massive volume of requests and interactions between your scrapers and the target websites
- Postmortem analysis & reproducability
- User-Agent randomizer
- Proxy IP address rotation
- Sticky Sessions