<nowiki>WebSPHINX</nowiki> ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for Web crawlers. A WebCrawler (also called a robot or spider) is a program that browses and processes Web pages automatically.

http://www.cs.cmu.edu/~rcm/websphinx/

The Crawler Workbench is a Java applet that puts a customizable Web crawler right in your browser. Using the Crawler Workbench, you can: 

*Visualize a collection of Web pages as a graph 
*Save pages to your local disk for offline browsing 
*Concatenate pages together for viewing or printing them as a single document 
*Extract all text matching a certain pattern from a collection of pages. 
*Develop a custom crawler in Java or Javascript that processes pages however you want. 

[http://www.cs.cmu.edu/~rcm/websphinx/workbench.html See it in action!] .. but please be careful as it hammers the server.  ''(Link now defunct.)''

----

Source code is available at the above link.

----
CategoryInformationVisualization