We aggregate information from all open source repositories to provide the most of useful open source products, from Enterprise product to small libraries across all platforms. Please search and find what you want.

Heritrix: Internet Archive Web Crawler

The archive-crawler project is building Heritrix: a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.

Today’s small-to-medium-sized (SMB) businesses and large enterprises are saving on their monthly communications costs by making one simple decision: to switch to a VoIP service solution from their old, outdated Plain Old Telephone Service (POTS). By choosing a new VoIP service, these companies enjoy the flexibility, reliability, call features, and audio quality that only a VoIP service can provide. Plus, they cut their phone bill by up to 70%!

Website http://crawler.archive.org
Tags
License
Platform
Features
  • deeply and thoroughly harvests website content
  • works on any Java platform (Linux recommended)
  • stores content to ARC or ISO WARC aggregate/transcript format
  • web interface for operator control and monitoring of crawls