Technology blogs have wondered whether Google is a lumbering giant in this Twitter moment, unable to handle streams of tweets that were broadcast just seconds earlier.
Google moves faster than some of its critics think. But even if didn’t, the more important question is this: Do we really want Google’s search engine to swallow Twitter’s output as fast as it comes, without filtering, analysing and ranking by authority?
“Real-time search begets real-time spam,” writes Danny Sullivan, editor-in-chief of the website Search Engine Land.
Anyone who signs up to follow a particular Twitterer receives tweets instantaneously, as they are dispatched (when the system is functioning). Filtering is not an issue in such cases: The 1.77 million followers of Britney Spears presumably look forward to receiving every morsel of information broadcast from her account.
But if one wants to search Twitter for tweets about a topic—say, about Spears, but encompassing anyone’s tweet that happens to mention her—Twitter’s data fill an ocean in which it’s hard to find specific fish.
Twitter’s search page says, “See what’s happening—right now”. But Twitter’s database was not originally designed to be searched like Google’s was. Last year, in fact, Twitter bought another start-up, Summize, to provide it with search functionality.
Even so, search performance on Twitter is sluggish compared with the live tweet stream. Sullivan notes that Twitter’s search service does not consistently deliver real-time results: 20 or more minutes often pass before a given tweet appears in search results.
At Google only hundredths of a second are needed to check its index when a search phrase is submitted. But to prepare, the company re-surveys the wide Web to update that index on a schedule that the company does not divulge. Some websites, like those of news organizations, are checked very often. Others await their turn in a rotating schedule of visits by Google’s crawler, the software that collects copies of Web pages.
Peter Norvig, director of research at Google, says Larry Page, one of Google's co-founders, has consistently pushed the company’s engineers to index the most active Web pages faster. When the frequency was increased to hourly, Page insisted that the interval be referred to as “3,600 seconds” to emphasize that it would be reduced further, which it was.
Google checks news feeds constantly but does not so easily pull in tweets. At a press event in London last month, Page was asked to comment on any plans that Google had to search Twitter in real time. After praising Twitter for doing a “great job” in showing information to users in real time, Page said he had long been pushing his search teams to index every second. “They sort of laugh at me and go, ‘It’s OK if it’s a few minutes’ old,” he said. “And I’m like, ‘No, no, it needs to be every second’.”
A number of search start-ups have appeared recently that differentiate their offerings from older search engines’ by playing up their specialized focus on the real-time Web. For example, OneRiot, based in Boulder, Colorado, covers Twitter among other social media, but it has an intriguing means of reducing Twitter spam: it does not index the text in tweets—it plucks only the links, reasoning that the videos, news stories and blog posts that are being shared are what others will be most interested in.
OneRiot follows the link, checks for spam by comparing the content of the page with the content of the tweet, and then uses its own algorithms to figure out where the link should go in its always-changing index of “hot” items.
Strictly speaking, this is not real-time processing. But checking links before adding them to the index seems to be time well spent. Google crawls Twitter’s website to collect the same links included in tweets that OneRiot collects, and these may show up in Google’s search results.
Google’s almost-real-time search provides much higher-quality results than does literal real-time search. When speaking about the need to index the Web “every second”, Page acknowledged the usefulness of taking a wee bit of time to analyse the gathered information. “If you really want up-to-the-second information, it’s not going to be as good as if you’re willing to wait a couple of minutes,” he said. “I’m not sure everybody needs to be seeing this stuff every second.”
©2009/THE NEW YORK TIMES
Randall Stross is an author based in Silicon Valley and a professor of business at San Jose State University.