Renaud Vigourt
Gone are the days when the web was dominated by humans posting social media updates or exchanging memes. Earlier this year, for the first time since the data has been tracked, , accounted for the bulk of web traffic.
Well over half of that bot traffic is from malicious bots, hoovering up personal data left unprotected online, for instance. But an increasing proportion comes from bots sent out by artificial intelligence companies to gather data for their models or respond to user prompts. Indeed, ChatGPT-User, a bot powering OpenAI鈥檚 ChatGPT, is now responsible for 6 per cent of all web traffic, while ClaudeBot, an automated system developed by AI company Anthropic, accounts for 13 per cent.
The AI companies say such data scraping is vital to keep their models up to date. Content creators feel differently, however, seeing AI bots as tools for copyright infringement on a grand scale. Earlier this year, for example, Disney and Universal sued AI company Midjourney, arguing that the tech firm鈥檚 image generator plagiarises characters from popular franchises like Star Wars and Despicable Me.
Few content creators have the money for lawsuits, so some are adopting more radical methods of fighting back. They are using online tools that make it harder for AI bots to find their content 鈥 or that manipulate it in a way that , so that the AI begins to confuse images of cars with images of cows, for example. But while this 鈥淎I poisoning鈥 can help content creators protect their work, it might also inadvertently make the web a more dangerous place.
Copyright infringement
For centuries, copycats have made a quick profit by mimicking the work of artists. It is one reason why we have intellectual property and copyright laws. But the arrival in the past few years of AI image generators such as Midjourney or OpenAI鈥檚 DALL-E has supercharged the issue.
A central concern in the US is what is known as the fair use doctrine. This allows samples of copyrighted material to be used under certain conditions without requesting permission from the copyright holder. Fair use law is deliberately flexible, but at its heart is the idea that you can use an original work to create something new, provided it is altered enough and doesn鈥檛 have a detrimental market effect on the original work.
Many artists, musicians and other campaigners argue that AI tools are blurring the boundary between fair use and copyright infringement to the cost of content creators. For instance, it isn鈥檛 necessarily detrimental for someone to draw a picture of Mickey Mouse in, say, The Simpsons鈥 universe for their own entertainment. But with AI, it is now possible for anyone to spin up large numbers of such images quickly and in a manner where the transformative nature of what they have done is questionable. Once they have made these images, it would be easy to produce a range of T-shirts based on them, for example, which would cross from personal to commercial use and breach the fair use doctrine.
Keen to protect their commercial interests, some content creators in the US are taking legal action. The Disney and Universal lawsuit against Midjourney, launched in June, is just the latest example. Others include an over alleged unauthorised use of the newspaper鈥檚 stories.
Disney sued AI company Midjourney over its image generator, which they say plagiarises Disney characters Photo 12/Alamy
The AI companies strongly deny any wrongdoing, insisting that data scraping is permissible under the fair use doctrine. In an in March, OpenAI鈥檚 chief global affairs officer, Chris Lehane, warned that rigid copyright rules elsewhere in the world, where for content creators, 鈥渁re repressing innovation and investment鈥. OpenAI has previously said it would be 鈥渋mpossible鈥 to develop AI models that meet people鈥檚 needs without using copyrighted work. Google takes a similar view. In an open letter also published in March, the company said, 鈥淭hree areas of law can impede appropriate access to data necessary for training leading models: copyright, privacy, and patents.鈥
However, at least for the moment, it seems the campaigners have the court of public opinion on their side. When the site IPWatchdog analysed public responses to an inquiry about copyright and AI by the US Copyright Office, it found that contained negative sentiments about AI.
What may not help AI firms gain public sympathy is a suspicion that their bots are sending so much traffic to some websites that and perhaps 鈥 and that content creators are powerless to stop them. For instance, there are techniques content creators can use to opt out of having bots crawl their websites, including reconfiguring a small file at the heart of the website to say that bots are banned. But there are indications that bots can sometimes and continue crawling anyway.
AI data poisoning
It is little wonder, then, that new tools are being made available to content creators that offer stronger protection against AI bots. One such tool was launched this year by Cloudflare, an internet infrastructure company that provides its users protection against distributed denial-of-service (DDoS) attacks, in which an attacker floods a web server with so much traffic that it knocks the site itself offline. To combat AI bots that may pose their own DDoS-like risk, Cloudflare is fighting fire with fire: it produces a maze of AI-generated pages full of nonsense content so that AI bots expend all their time and energy looking at the nonsense, rather than the actual information they seek.
The tool, known as , is designed to trap the 50 billion requests a day from AI crawlers that Cloudflare says it encounters on the websites within its network. According to Cloudflare, AI Labyrinth should 鈥渟low down, confuse, and waste the resources of AI crawlers and other bots that don鈥檛 respect 鈥榥o crawl鈥 directives鈥. Cloudflare has since released , which asks AI companies to pay to access websites, or else be blocked from crawling its content.
An alternative is to allow the AI bots access to online content 鈥 but to subtly 鈥減oison鈥 it in such a way that it renders the data less useful for the bot’s purposes. The tools and , developed at the University of Chicago, have become central to this form of resistance. Both are free to download from the university鈥檚 website and can run on a user鈥檚 computer.
Glaze, released in 2022, functions defensively, applying imperceptible, pixel-level alterations, or “style cloaks”, to an artist’s work. These changes, invisible to humans, cause AI models to misinterpret the art’s style. For example, a watercolour painting might be perceived as an oil painting. Nightshade, published in 2023, is a more offensive tool that poisons image data 鈥 again, imperceptibly as far as humans are concerned 鈥 in a way that encourages an AI model to make an incorrect association, such as learning to link the word 鈥渃at鈥 with images of dogs. Both tools have been downloaded more than 10 million times.
The Nightshade tool gradually poisons AI bots so that they represent dogs as cats Ben Y. Zhao
The AI poisoning tools put power back in the hands of artists, says at the University of Chicago, who is the senior researcher behind both Glaze and Nightshade. 鈥淭hese are trillion-dollar market-cap companies, literally the biggest companies in the world, taking by force what they want,鈥 he says.
Using tools like Zhao鈥檚 is a way for artists to exert the little power they have over how their work is used. 鈥淕laze and Nightshade are really interesting, cool tools that show a neat method of action that doesn’t rely on changing regulations, which can take a while and might not be a place of advantage for artists,鈥 says at the Electronic Frontier Foundation, a US-based digital rights non-profit.
The idea of self-sabotaging content to try to ward off alleged copycats isn鈥檛 new, says at Stockholm University in Sweden. 鈥淏ack in the day, when there was a large unauthorised use of databases 鈥 from telephone directories to patent lists 鈥 it was advised to put in some errors to help you out in terms of evidence,鈥 she says. For instance, a cartographer might deliberately include false place names on their maps. If those false names then appear later in a map produced by a rival, it would provide clear evidence of plagiarism. The practice still makes headlines today: music lyrics website Genius into its content, which it alleged showed that Google had been using its content without permission. Google denies the allegations, and a court case filed by Genius against Google was dismissed.
Even calling it 鈥渟abotage鈥 is debatable, according to Hoffman-Andrews. 鈥淚 don’t think of it as sabotage necessarily,鈥 he says. 鈥淭hese are the artist鈥檚 own images that they are applying their own edits to. They’re fully free to do what they want with their data.鈥
It is unknown to what extent AI companies are taking their own countermeasures to try to combat this poisoning of the well, either by ignoring any content that is marked with the poison or trying to remove it from the data. But Zhao鈥檚 attempts to break his own system showed that Glaze was still 85 per cent effective against all countermeasures he could think of taking, suggesting that AI companies may conclude that dealing with poisoned data is more trouble than it鈥檚 worth.
Spreading fake news
However, it isn鈥檛 just artists with content to protect who are experimenting with poisoning the well against AI. Some nation-states may be using similar principles to push false narratives. For instance, US-based think tank the Atlantic Council claimed earlier this year that Russia鈥檚 Pravda news network 鈥 whose name means 鈥渢ruth鈥 in Russian 鈥 has to trick AI bots into disseminating fake news stories.
Pravda鈥檚 approach, as alleged by the think tank, involves posting millions of web pages, sort of like Cloudflare鈥檚 AI Labyrinth. But in this case, the Atlantic Council says the pages are designed to look like real news articles and are being used to promote the Kremlin鈥檚 narrative about Russia鈥檚 war in Ukraine. The sheer volume of stories could lead AI crawlers to over-emphasise certain narratives when responding to users, and an analysis published this year by US technology firm NewsGuard, which tracks Pravda鈥檚 activities, found that in a third of cases.
The relative success in shifting conversations highlights the inherent problem with all things AI: technology tricks used by good actors with good intentions can always be co-opted by bad actors with nefarious goals.
There is, however, a solution to these problems, says Zhao 鈥 although it may not be one that AI companies are willing to consider. Instead of indiscriminately gathering whatever data they can find online, AI companies could enter into formal agreements with legitimate content providers and ensure that their products are trained using only reliable data. But this approach carries a price, because licensing agreements can be costly. 鈥淭hese companies are unwilling to license these artists鈥 works,鈥 says Zhao. 鈥淎t the root of all this is money.鈥
Topics:




