A SURVEY OF FOCUSED WEB CRAWLING ALGORITHMS
In: SIKDD 2004 at multiconference IS 2004, 12-15 Oct 2004, Ljubljana, Slovenia.
Web search engines collect data from the Web by “crawling” it – performing a simulated browsing of the web by extracting links from pages, downloading all of them and repeating the process ad infinitum. This process requires enormous amounts of hardware and network resources, ending up with a large fraction of the visible web on the crawler’s storage array. But when only information about a predefined topic set is desired, a specialization of the aforementioned process called “focused crawling” is used. What follows here is a shor t review of existing techniques for focused crawling.