One day last summer, Google’s search engine trundled quietly past a milestone. It added the 1 trillionth address to the list of Web pages it knows about. But as impossibly big as that number may seem, it represents only a fraction of the entire Web. Beyond those trillion pages lies an even more vast Web of hidden data.
The challenges major search engines face in penetrating this so-called deep Web go a long way towards explaining why they still can’t provide satisfying answers to questions such as “What's the best fare from New York to London next Thursday?” or “When will the Yankees play the Red Sox this year?” The answers are readily available—if only the search engines knew how to find them.
Now a new breed of technologies taking shape will extend the reach of search engines into the Web’s hidden corners. When that happens, it will do more than just improve the quality of search results—it may ultimately reshape the way many companies do business online.
Search engines rely on programs known as crawlers (or spiders) that gather information by following the trails of hyperlinks that tie the Web together. While that approach works well for the pages that make up the surface Web, these programs have a harder time penetrating databases that are set up to respond to typed queries.
“The crawlable Web is the tip of the iceberg,” says Anand Rajaraman, co-founder of Kosmix (www.kosmix.com), a deep Web search start-up whose investors include Jeffrey P. Bezos, chief executive of Amazon.com. Kosmix has developed software that matches searches with the databases most likely to yield relevant information, then returns an overview of the topic drawn from multiple sources. “Most search engines try to help you find a needle in a haystack,” Rajaraman says, “but what we’re trying to do is help you explore the haystack.”
Google’s deep Web search strategy involves sending out a program to analyse the contents of every database it encounters. In a similar vein, Juliana Freire at the University of Utah is working on an ambitious project called DeepPeep (www.deeppeep.org) that eventually aims to crawl and index every database on the public Web. Extracting the contents of such far-flung data sets requires a sophisticated kind of computational guessing game.
“The naive way would be to query all the words in the dictionary,” Freire says. Instead, DeepPeep starts by posing a small number of sample queries “so we can then use that to build up our understanding of the databases and choose which words to search”. Based on that analysis, the program then fires off automated search terms in an effort to dislodge as much data as possible. Freire claims her approach retrieves at least 90% of the content stored in any given database. As the major search engines begin experimenting with incorporating deep Web content into their search results, they must figure out how to present different kinds of data without overly complicating their pages. This poses a special quandary for Google, which has long resisted the temptation to make significant changes to its tried-and-true search results format.
Beyond the realm of consumer searches, deep Web technologies may eventually let businesses use data in new ways. For example, a health site could cross-reference data from pharmaceutical companies with the latest findings from medical researchers, or a local news site could extend its coverage by letting users tap into public records stored in government databases.
This level of data integration could eventually point the way to something like the Semantic Web, the much-hyped—but so far unrealized—vision of a Web of interconnected data. That vision has struggled to gain traction because of all the human labour required to create the necessary reference files to make it work. Deep Web technologies hold the promise of achieving similar benefits at a much lower cost by automating the process of analysing database structures and cross-referencing results.
“The huge thing is the ability to connect disparate data sources,” says Mike Bergman, a computer scientist and consultant who is credited with coining the term “deep Web”. As websites start to automate the process of searching one another's content, perhaps the day will come when the Web just starts surfing itself.
©2009/THE NEW YORK TIMES
Our monthly column, Toolkit, earlier scheduled for this page, has been deferred. Write to us at businessoflife@ livemint.com