WebCrawler

MeatballWiki | RecentChanges | Random Page | Indices | Categories

A web crawling is a depth or breadth first searching robot that runs along the WorldWideWeb. There are actually smarter algorithms than pure DFS or BFS, especially those that prevent inadvertent DenialOfService attacks and those that adhere to the RobotsExclusionStandard.

WebCrawlers are chiefly used as indexing robots for search engines and the ilk.

Is a spider different than a WebCrawler? Is a WebCrawler merely one particular spider?

Inktomi's crawler is called Slurp. It services Inktomi, Yahoo!, HotBot? and Snap amongst others. (http://www.inktomi.com/slurp.html) Slurp has been hitting hyperpolis.com (my domain) pretty hard lately and has inspired me to talk more about web crawling in general. -- SunirShah

As far as it can be determined the first ever web crawler developed was called "The Wanderer" and it was created by Matthew gray, initally it was designed as a tool to track the growth of the internet by counting the number of web servers which were up, but not long after it's inital release it was modified to allow the capture of URL's, this database of captured URL's was called "The Wandex" which was the first ever collection of URL's, a lot of people had a problem with Wanderer this was because in it's earlier incarnations it would degrade the performace of the network a lot, this was due to it accessing web pages hundreds of times a day, this was fixed in later versions but site administrators have been wary of web crawlers ever since.

Recommendations

Any recommendations? In particular: Which ones are usefull for making snapshots of a wiki?

wget

GNU Wget is a freely available network utility to retrieve files from the World Wide Web using HTTP and FTP. It works non-interactively. The recursive retrieval of HTML pages, as well as FTP sites, is supported: you can use Wget to make mirrors of archives and home pages, or traverse the web like a WWW robot (Wget understands /robots.txt).

http://www.gnu.org/software/wget/wget.html

It doesn't seem to be well suited to download a HTML snapshot of the wiki. Anybody know the necessary settings?

LWP

There really is no other choice, if you know what you're looking for and want to control what you grab, what you store and how you output, Perl's LWP is what you want. Well, you also want HTML::Parser. Those two and you can write your own spiders, search engines, or whatever. I think LWP::RobotUA? understands /robots.txt, but really, if you know what your searching for, you can write the behavior yourself.

Of course, there are similar tools for Python, but since I learned and earned my living on Perl long before Python was installed on any machine I had an account on, I've not learned it.

WebCrawler

Recommendations

wget

LWP

Discussion