Heritrix: Internet Archive Web Crawler

The archive-crawler project is building Heritrix: a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.

Today’s small-to-medium-sized (SMB) businesses and large enterprises are saving on their monthly communications costs by making one simple decision: to switch to a VoIP service solution from their old, outdated Plain Old Telephone Service (POTS). By choosing a new VoIP service, these companies enjoy the flexibility, reliability, call features, and audio quality that only a VoIP service can provide. Plus, they cut their phone bill by up to 70%!

Post Views: 546

Website	http://crawler.archive.org
Tags	Archiving Library WWW/HTTP
License	Apache License V2.0 GNU Library or Lesser General Public License version 2.0 (LGPLv2)
Platform	Linux Mac Windows
Features	deeply and thoroughly harvests website content works on any Java platform (Linux recommended) stores content to ARC or ISO WARC aggregate/transcript format web interface for operator control and monitoring of crawls

Related Projects: