Captcha: A Machine Learning AI Training Tool.
Never a wasted moment.
https://www.techradar.com/news/captcha-if-you-can-how-youve-been-training-ai-for-years-without-realising-it
From the Article
Congratulations are in order. You, yes you, dear reader, have been part of something incredible. Thanks to your hard work, millions of books containing pretty much the sum-total of human knowledge have been successfully digitised, saving their texts for future generations. All because of you.
No, seriously.
You know how occasionally you’ll be prompted with a “Captcha” when filling out a form on the internet, to prove that you’re fully human? Behind the scenes of one of the most popular Captcha systems – Google’s Recaptcha – your humanoid clicks have been helping figure out things that traditional computing just can’t manage, and in the process you’ve been helping to train Google’s AI to be even smarter.
And you thought you were merely logging into some website or other.
Origins
Recaptcha (or “reCATCHA” if you prefer) started out as a collaboration by a number of computer scientists at Carnegie Mellon University in Pittsburgh, first released in 2007 – and it was quickly snaffled up by Google in 2009. The premise was as described above: by marrying up users who need to prove they are human to data that needs transcribing, both sides get something out of it.
So instead of digitising books by having one person carry out the very boring task of typing or checking a whole book manually, instead millions of people can unknowingly collaborate to achieve the same goal. Remember how it always used to be two words you had to enter? Conceivably, only one was the “real” test, and the other was a new word that was yet to be transcribed – but as the user you wouldn’t know which was which, so you’d have to attempt to do both accurately.
Amazingly, thanks to Recaptcha boxes appearing on thousands of major websites and receiving tens of millions of completions a day, by 2011 Recaptcha had finished digitising the entire Google Books archive – as well as 13 million articles from the New York Times back-catalogue dating back to 1851.
So what did Google do next, with no books left to digitise? In what was perhaps a happy coincidence, this coincided with the growth of artificial intelligence and machine learning.
Training montage
In 2012, Google started including not just words, but snippets of photos from Google Street View – making users transcribe door numbers and other signage. And in 2014, the system became all about training AI.
Essentially, the way machine learning works is that you hand the machine a bunch of data that is already sorted – say, a bunch of images of cats that you have tagged as cats, and then it uses this information to build a neural network that enables it to pick the cats out of other images. The more pictures of cats that you feed it, the more accurate the AI becomes at picking out cats from other images.
Google has countless reasons to want to train AI to recognise objects in images: better Google Image Search results, more accurate Google Maps results, and enabling you to search your Google Photos library for all of the photos you have taken of a specific object or place. Oh, and the small matter of making sure that your driverless car doesn’t hit anything. You know when Recaptcha asks you to identify street signs? Essentially you’re playing a very small role in piloting a driverless car somewhere, at some point in the future.
So it is hugely convenient then that Google has as its disposal hundreds of millions of internet users to work for it: by using Recaptcha to tackle these problems, Google can use our need to prove we’re human to force us to use our very human intuitions to build its database.
This is why currently, instead of simply throwing up some text, Recaptcha is giving users more image-related tasks: “Click all of the images of cats”, “Click all of the boxes on the grid overlaying an image that contain a cat”, and so on. For thousands of different objects.
This is a particularly useful asset for Google, as it competes with other internet giants to grow its machine learning datasets and algorithms: The more data it can analyse, the better results will be – giving its current and future products a competitive advantage.
Using AI to beat AI
Amusingly, there is only one problem with using captchas to train machine learning algorithms. What’s to stop, for example, people who want to get around captchas from using machine learning against captchas?
Last year developer Francis Kim built a proof of concept means to beat Recaptcha by using Google’s machine learning abilities against it. In just 40 lines of Javascript, he was able to build a system that uses the rival Clarifai image recognition API to look at the images Google’s Recaptcha throws up, and identify the objects the captcha requires. So if Recaptcha demands the user select images of storefronts to prove their humanity, Clarifai is able to pick them out instead.
Conceivably too, this sort of thing would also be possible using Google’s own technology. Because Google wants to sell its clever tech to other companies, it opens TensorFlow up to developers through an API itself. This means that you could conceivably use TensorFlow to trick the Captcha that trains TensorFlow. This wouldn’t work 100% of the time – but once an AI is sufficiently well trained, it should be able to do the trick in a large number of cases.
What’s clear from Recaptcha is not just that it is an ingenious idea, but also that thanks to our hard work, it is getting increasingly difficult to separate us humans from the machines.
Share This Story, Choose Your Platform!
8 Comments
Comments are closed.
Explains the questions I had over why the images changed. I figured when it came to stop lights and cross-walks it was the self driving AI.
You mean we have been active in our own downfall? There must be a mad evil genius cackling somewhere.
A buddy of mine got me woke to captcha being used for data mining back around 2012 or 2013. Fun thing you can do to throw a tiny monkey wrench in the system is intentionally give it wrong answers the first time or two that you have to enter one, including providing profane words or offensive names in text captchas. It won’t make any serious impact on what they’ve already done with the system, but it may make you feel a little better and give them some garbage data that makes the system slightly less efficient.
5
And what does information does Clarifai secretly capture? Or DuckDuckGo? Or WordPress? Or the Brave browser? What is the alternative? Unplugging our electronics? The point being, we are damned if we do, damned if we don’t.
0.5
I’ve thought on the tech end of things quite a bit. At war I saw how easy it was to compromise and leverage tech.
My personal top secret, classified nearly fail proof plan is to dump it all at the first sign of sketchyness. Right now it’s a valuable tool. There probably wouldn’t be a patriot movement without it, nor the cross pollination of information that takes place.
In my mind it will be similar to the bank robber documentary, when the gang meets up and dumps all their tech at the same time, effectively going dark.
The corundum is deciding on that point :- ) Simply enough we don’t have any real clue just how deep the data collection and processing goes. We can make good assumptions and talk with great authority about it, but it’s all really in hindsight and hindsight is too fucking late to put the cork back.
There’s a dude whose whole life is structured around being anonymous. With practices that started pre-tech, that has literally wrote the books on the subject. The ironic part is that his greyness flags him regularly as being sketchy. There may not be enough actionable to determine exactly what he’s up too, but it’s clear he’s up to something. Yet he was easily scooped up by the FBI and interrogated, then dumped right back out because they were hunting someone else. Have a nice day dude that has no history haha
Use it and abuse it, and be planning on the moment to loose it. Just being on a forum such as Brushbeater puts you in a network tree, this blog response does. When you were on Frugal Squirrel in the 80’s does, Tea Party affiliation, military service, your entire culture is being demonized prepping the public mind.
When you come onto their radar your data will be sorted out.
I’ve read that they are choosey about what they will digitize. And some books that have been saved have since been deleted. ???