node simplecrawler

node-simplecrawler.

Simplecrawler is designed to provide the most basic possible API for crawling websites, while being as flexible and robust as possible. I wrote simplecrawler to archive, analyse, and search some very large websites. It has happily chewed through 50,000 pages and written tens of gigabytes to disk without issue.

What does simplecrawler do?

  • Provides a very simple event driven API using EventEmitter
  • Extremely configurable base for writing your own crawler
  • Provides some simple logic for autodetecting linked resources – which you can replace or augment
  • Has a flexible queue system which can be frozen to disk and defrosted
  • Provides basic statistics on network performance
  • Uses buffers for fetching and managing data, preserving binary data (except when discovering links)

 

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s