ZDNet UK


Skip to Main Content

ZDNet.co.uk - Winner of Best Business Website 2007
  1. Home
  2. News
  3. Blogs
  4. Reviews
  5. Prices
  6. Resources
  7. Community
  8. My ZDNet

 

ZDNet UK RSS Feeds


IT Jobs

Emerging tech Toolkit

Anti-spam tool helps digitise books

Stephen Shankland CNET News.com

Published: 25 May 2007 13:11 BST

  • Email
  • Trackback
  • Clip Link
  • Print friendly
  • Post Comment

A group of Carnegie Mellon University programmers has launched a service called "ReCaptcha" that can help cut down on spam while letting people digitise books.

The project is a variation of the widely used "Captcha" (Completely Automated Public Turing test to tell Computers and Humans Apart) technique used to weed out computer abuse such as emailing spam or posting spam on blog comments. Captchas require users to pass little pattern-recognition tests, commonly reading distorted or obscured words.

ReCaptcha turns this chore into a productive task by letting users digitise scanned images of words that computers couldn't figure out.

Vista Upgrade Blog

Vista Upgrade Blog
Grappling with the OS

How is the switch to Vista affecting your workplace? Take a look at our new group blog and share your pain and praise.

Read more +

"Not only can you solve your problems with spam, you can help preserve mankind's written history into the digital age," said Ben Maurer, the project's chief architect and a Carnegie Mellon University undergraduate, announcing the project on his blog on Wednesday.

Since the project launched on Tuesday, 150 websites have begun using it, said Luis von Ahn, a Carnegie Mellon assistant professor and ReCaptcha's "executive producer". In just the first half of Thursday, the project had digitised 8,000 words, he said.

It's a new example of how the internet can harness the collective energies of large numbers of people. Other examples include news sites such as Digg and Slashdot, which give prominence to content that users rate highly, and stock photography seller iStockphoto, which is beta testing an "Image Fight" site to rate photo quality.

ReCaptcha has the potential to digitise vast quantities of words. Von Ahn estimates that people perform 60 million Captcha tests daily.

The service presents users with two words, one from a conventional Captcha test and the other an unknown word that a computerised optical character recognition couldn't figure out. If the user correctly identifies the known word, he or she is presumed to have decoded the unknown one. Currently, ReCaptcha requires three separate people to digitise the word the same before it's determined to be correct, von Ahn said.

Von Ahn was a member of the Carnegie Mellon team that developed Captcha in response to a Yahoo request for technology to keep computers from registering bogus email accounts, according to Carnegie Mellon. He's a recipient of a MacArthur Foundation "genius grant", which funded some ReCaptcha work.

Digital libraries
The ReCaptcha project is digitising books in the Internet Archive, a project building a digital library of cultural materials and which operates the Wayback Machine of historical website snapshots.

Among the first books being digitised is Psychology by philosopher John Dewey, von Ahn said. The project is considering other book archives, too, he added.

The ReCaptcha service is available now through an application programming interface (API) for people to integrate into their websites. Software plug-ins to use the API are open-source software packages hosted at Google Code.

ReCaptcha also can be used to shield email addresses from computers that harvest them for spam mailing lists.

Von Ahn's specialty is what he calls "human computation", which he defines as "novel techniques for utilising the computational abilities [or "cycles"] of humans".

Microsoft Research has its own philanthropic variation of Captcha technology: a project called Asirra that shows pictures of cats and dogs rather than text. Computers do a poor job telling the animals apart, but people can. To get a supply of constantly refreshed pet images, Microsoft pulls photos — and "adopt me" links — from the Petfinder website.

Two of his higher-profile projects were online games ESP Game and Peekaboom, which rely on crowds to label images. Like reading obfuscated text, it's a task at which computers are poor.

Google licensed the ESP Game technology and offers it as its Google Image Labeler to improve its own image-search technology.

Carnegie Mellon is hosting the ReCaptcha service on $30,000 (£15,112) worth of servers donated by Intel, von Ahn said. Other sponsors include Novell, which contributed its Suse Linux Enterprise Server support subscriptions, and Carnegie Mellon.

  • Email
  • Trackback
  • Clip Link
  • Print friendly Print with HP

Did you find this article useful?
4 out of 7 people found this useful


Full Talkback thread

0 comments

Company/Topic Alerts

Create a new alert from the list below:




Related Jobs

Web Developer - Oddschecker.com

We have a superb opportunity for a web developer to work with one of the key sports betting websites on the internet. Operating since the late 1990s ...

C# 3 month development contract role in Manchester

It is a three-month contract for an exciting company that specialises in producing websites and the ideal candidate will possess skills in: - ...

Glasgow C/C++, ENGINEER 30 - 40K

The technology uses standard digital stills cameras to instantly capture images, which are then processed automatically to produce a high-resolution ...

Featured Talkback

While full medical records may be of (dubious) value at rear/base medical facilities, these could be provided much simpler by either physical disk or electronic transfer to an "in theatre" database for individuals posted in. That £80m (and it's associated running costs) could have been far better employed in resuscitating a disbanded infantry battalion or providing a big boost in equipment quality and quantity.

By: 1000215420

Read full story:
Photos: MoD unveils £80m IT health programme