Copyright holders challenge sites that scrape content5 min read . Updated: 02 Mar 2009, 11:32 PM IST
Copyright holders challenge sites that scrape content
Copyright holders challenge sites that scrape content
When the popular New York business blog Silicon Alley Insider quoted a quarter of Peggy Noonan’s Wall Street Journal column in mid-February, the editor added a caveat at the end: “We thank Dow Jones in advance for allowing us to bring it to you."
The editor added “in advance" because Dow Jones, the publisher of The Wall Street Journal, had not given the blog permission to use the column. (The Wall Street Journal has an exclusive content partnership in India with Mint.) The excerpt was published with the assumption that it would be permitted under the “fair use" statute of Copyright Law.
But some media executives are growing concerned that the increasingly popular curators of the Web that are taking large pieces of the original work—a practice sometimes called scraping—are shaving away potential readers and profiting from the content.
With the Web’s advertising engine stalling just as newspapers are under pressure, some publishers are second-guessing their liberal attitude toward free content.
“A lot of news organizations are saying, ‘We’re not willing to accept the tiny fraction of a penny that we get from the page views that these links are sending in,’" said Joshua Benton, the director of the Nieman Journalism Lab at Harvard. “They think they need to defend their turf more aggressively."
Copyright infringement lawsuits directed at bloggers and other online publishers seem to be on the rise. David Ardia, the director of the Citizen Media Law Project, said his colleagues kept track of 16 such suits in 2007. In 2004 and 2005, it monitored three such suits each year. And newspapers sometimes send cease-and-desist orders to sites that they believe have crossed the line.
Some publishers complained last week when Google News, a site that aggregates headlines from thousands of news sources, added advertising to its search results.
Last December, GateHouse Media sued The New York Times Co., alleging copyright infringement after local sites associated with The Boston Globe, a Times Co. newspaper, copied the headlines and lead sentences of GateHouse’s newspaper articles. The case was settled out of court in January.
In another case, which is pending, the Associated Press sued the online news distributor All Headline News last year, saying that it had improperly copied AP articles.
The legal disputes are emblematic of a larger question that has emerged from the Internet’s link economy. The editors of many websites, including ones operated by the Times Co., post excerpts from competitors’ content from time to time. At what point does excerpting from an article become illegal copying?
Courts have not provided much of an answer. In the US, the Copyright Law provides a four-point definition of fair use, which takes into consideration the purpose (commercial vs. educational) and the substantiality of the excerpt.
But editors in search of a legal word limit are sorely disappointed. Even before the Internet, lawyers lamented that the fair use factors “didn’t map well onto real life", Ardia said. “New modes of creation, reuse, mixing and mash-ups made possible by digital technologies and the Internet have made it even more clear that Congress’ attempt to define fair use is woefully inadequate."
For now, websites are defining it themselves. Sites such as Alley Insider and The Huffington Post are ad-supported businesses that filter the Web for readers, highlighting what they deem to be the most meaningful parts of newspaper articles and TV segments.
Alley Insider, according to its editor-in-chief, Henry Blodget, operates under a digital golden rule: “To excerpt others the way we want to be excerpted ourselves." The post about Noonan’s column, including five full paragraphs, had explicit credits to the author and the newspaper, three links to the source and a direct encouragement to users to read the original column.
Alley Insider doubtlessly exposed new readers to Noonan’s column, and an unknown number of users followed the links to The Journal’s website. But others probably did not follow the link, meaning that Alley Insider alone—and not The Journal—reaped the advertising pennies from the excerpt.
The Huffington Post, the popular news and opinion forum co-founded by the author and columnist Arianna Huffington, is perhaps the star of the excerpting debate. Huffington’s editors are especially adept at optimizing the site for search engine results, so that in a Google search, a Huffington Post summary of a Washington Post or a CNN.com report may appear ahead of the original article.
“We want to both drive traffic to ourselves and drive traffic to others," Huffington said in a telephone interview. Adding that “we are (at) the beginning of developing the rules of the road" online, she said the site’s editors were “constantly talking" about appropriate excerpting conduct.
To the extent that the site republishes articles produced by other organizations, “we excerpt to add value", Huffington said, sometimes by combining articles, videos and trans-cripts. Much of the Web works this way, skimming quotes and photos from other sources while trying to remain within the provisions of fair use.
Huffington said The Huffington Post, which had at least 20 million unique visitors in January, received at least 100 requests for links each weekday from reporters, editors and public relations representatives. “Everybody wants to be linked to," she said.
That is true as long as readers follow those links. The prevailing wisdom is that content should roam widely online, but lacklustre digital advertising off late has called that into question.
That has fuelled a round of recent commentaries about payment models for online news. Cablevision, the owner of the Newsday newspaper, said on Thursday that it would “end distribution of free Web content". Hearst, the owner of 16 newspapers, said on Friday that it would charge for some content on its websites.
Widespread excerpting would seem to make pay models harder to impose. Even more troubling for news organizations is blatant copying. In December, The Huffington Post’s new Chicago offshoot was accused of copying the full contents of local publications’ concert reviews. Huffington called it a “mistake made by an intern".
Other sites copy content from news organizations using automated syndication feeds. The sites typically display text or show ads around the excerpts to make money.
GateHouse’s suit against The New York Times Co. contended that the company was “link scraping" by automatically aggregating articles from GateHouse newspapers, to be excerpted on local news sites operated by The Boston Globe.
“They felt that The Globe was benefiting too much from the work of GateHouse journalists," said Benton of the Harvard journalism lab. The Times Co. denied that it was scraping GateHouse’s site and said its use of GateHouse content did not violate copyright laws.
It also said that GateHouse’s websites copied headlines and other text from Times Co. sites. Last month the Times Co. agreed to stop copying GateHouse’s headlines and lead paragraphs.
It remains to be seen whether excerpting standards from before the Internet age still apply. Ardia said quoting “is often a sign of respect" online.
“The norms are developing outside—or ahead of—the law," he said.
Alley Insider’s partial republication of Noonan’s column, for instance, was edited shortly after it was posted online. The reason, Blodget said, was that the excerpt seemed slightly too long.
©2009/THE NEW YORK TIMES