Crawling the deep web

Crawling the deep web
Comment E-mail Print Share
First Published: Tue, Mar 09 2010. 09 14 PM IST

 Illustration: Raajan / Mint
Illustration: Raajan / Mint
Updated: Tue, Mar 09 2010. 09 14 PM IST
Today Google is synonymous with search. These engines, working on algorithms, yield results faster than we can say search, and make us believe we have all the information. While popular search engines are capable of searching much of the Web, there are sites that lie below their radar. So there are sites that you probably will never come across. Welcome to deep Web.
Illustration: Raajan / Mint
Limited searches
Each search engine has a program called spider, crawler or bot that constantly crawls the Internet. It then indexes all the pages it crawls and ranks them according to relevance of content. While crawling websites, the bots also follow the links on your site, thus increasing their footprint. Depending on algorithms, a search engine can either confirm the presence of a page without indexing it, or index the page content and look for hyperlinks on a page. The frequency of the spider-crawling websites depends on the search engine’s discretion.
However, search engines have some limitations as they operate on fixed algorithms, often leading to irrelevant results because the search engine is sometimes not able to contextualize the query. Also, search engine bots only crawl static Web pages, whereas a majority of the information on the Net is stored in databases, which the spiders are not able to crawl. Thus, the search results miss out on the data in several databases, such as those of universities and government organizations, among others. All this adds up to huge numbers, making the search results only a fraction of the total data available.
Decoding content
The most logical question then is, if Google can’t find the data, where exactly is it and why can’t it be crawled? Let’s try to decode the deep Web by virtue of content. A database contains information stored in tables that are created by programs such as Access, SQL or Oracle. This data can only be retrieved by posting a query. The query, when executed, searches the database to come up with the result that has been specified. This is very different from searching static Web pages that can be accessed directly by crawlers.
Federated search
So let us try to find ways in which we can harness this repository of information. Popular search engines crawl websites for links, but they are not able to search databases. Federated search engines is a category that searches multiple online databases or Web resources.
Juliana Freire, associate professor at the School of Computing at the University of Utah, and her team have come up with DeepPeep (www.deeppeep.org). This aims to crawl public databases for casual as well as expert users. “DeepPeep is a vertical search engine: It is specialized in Web forms—the entry points to hidden Web content. Currently, we index 45,000 pages that contain forms, spanning seven distinct domains,” she says.
According to her, a lot of mainstream websites are searching the deep Web to a certain extent. “Having the ability to combine visible and hidden Web information in a search engine’s results can lead to higher quality information being presented to users. But if not done right, this can lead to further overwhelming users with information,” she adds.
Pros and cons of federated search
Saves time: Federated search engines are a boon to research students as they help save time. These engines perform simultaneous searches on different databases so that the user does not have to visit individual sites to perform a search. They consolidate the results of all the various database searches on to one Web page.
Quality of results: The quality of results in a federated search is much better than a popular search engine result on the same topic. This is because the federated search gets its results from databases that are associated with organizations which are authorities on the subject.
Real-time search: Federated search does real-time searching, giving you the most up-to-date information from the source on your query. In popular search engines, search results are updated only when the crawlers crawl the Web. Deep Web search engines search each source live for every query. So as soon as the parent database is updated to include a new document, the search will find it.
However, federated search engines take longer to come up with results than popular search engines. This is because federated search results depend on how fast the underlying database search performs. Most federated search engines throw up the results to a query as and when they get a result.
So next time you log on to the Web, do not limit your search to popular search engines. Do try to explore the deep Web.
Federated search engines
This search engine’s speciality is Web forms. The current beta version tracks 45,000 forms across seven domains. These domains include auto, airfare, biology, book, job and rentals.
Deep Web Technologies provides federated searches for scientific and technology-related search queries. It has federated search sites for particular topics, such as Mednar for medical research; Biznar for business-related sources; and Worldwidescience for scientific content from databases across the world.
Marcus Zillman, executive director of Virtual Private Library and a deep Web expert, has developed a subject-tracer information blog which has links to research websites divided into neat categories. This blog does not have any search query box. So you will have to decide your area of interest and then follow the links on the virtual private library’s homepage.
This is an excellent site that scours the deep Web in a search for people based on their names, email addresses, usernames or even phone numbers.
Write to us at businessoflife@livemint.com
Content brought to you by Digit .
Comment E-mail Print Share
First Published: Tue, Mar 09 2010. 09 14 PM IST
blog comments powered by Disqus
  • Wed, Sep 24 2014. 05 16 PM
  • Wed, Sep 17 2014. 04 45 PM
Subscribe |  Contact Us  |  mint Code  |  Privacy policy  |  Terms of Use  |  Advertising  |  Mint Apps  |  About HT Media  |  Jobs
Contact Us
Copyright © 2014 HT Media All Rights Reserved