Opt-In or Opt-Out? What’s at Stake in Regulating AI Crawling
Which model should the UK adopt for AI training on online content?
As AI models grow more data-hungry, they increasingly rely on large-scale web crawling to collect text, images and code from across the public internet. For decades, this kind of automated access was guided by voluntary norms like robots.txt, a simple file created in the 1990s to tell early search engines which pages they could or could not index.
That system was never designed for today’s industrial-scale AI scraping. Publishers, from newsrooms to public interest sites like Wikipedia, now face high server costs, lost revenue, and the reproduction of their work in AI outputs without permission or compensation. At the same time, governments want to support innovation, attract AI companies and keep their countries competitive (including here in the UK).
Meanwhile, the Internet Engineering Task Force (IETF) is developing new standards that will let publishers signal, in a clear, machine-readable way, whether their content can be used for AI training. But even once these standards are released, many pieces of content on the web will stilll lack any clear consent signal because the standards will be new and adoption will take time.
That brings us to a key policy question: in the future, should AI developers be allowed to train on online content by default, or only when permission is explicitly given?
What is opt-in?
Under an opt-in model, AI developers would need explicit permission before using a website’s content for training. Think of it as: “AI use is not allowed unless I say yes.”
Pros:
Strong protection of publishers’ rights and consent.
Clear, enforceable expectations for AI developers.
Works with emerging technical standards that give websites a clear, machine-readable way to express consent for AI training.
Cons:
Most websites currently do not signal AI preferences, meaning far less content would be available for training.
Smaller organisations may find it harder to update their content and adopt new settings.
Could limit access for legitimate research projects that rely on broad datasets.
What is opt-out?
Under an opt-out model, AI developers can use content unless a publisher says no. Think of it as: “AI use is allowed unless I tell you to stop.”
Pros:
Easier for AI developers and researchers to access data.
Minimal friction for innovation and model development.
Simple for large platforms that already support automated crawling controls.
Cons:
Puts the burden of protection on publishers, many of whom don’t even know opting out is possible.
Risks widespread unlicensed use of digital content.
Advantages major AI firms who can strike exclusive deals with big publishers.
Fails to prevent AI training on downstream copies of a work: screenshots, reposts, embeds and scraped versions that appear on sites the publisher doesn’t control.
Why does this matter for the UK?
The UK’s recent AI copyright consultation leaned toward an opt-out approach, which supporters like Nick Clegg agree with, arguing that requiring permission before training would be “implausible” and would “basically kill the AI industry in Britain overnight.” Creator advocates like Ed Newton-Rex, a composer and founder of the nonprofit Fairly Trained, argue that opt-outs give creators only the illusion of control. You can block AI crawling on your own site, but AI companies can still train on the countless “downstream copies” of your work that appear elsewhere online: screenshots, embeds, quotes, reposts, ads and scraped versions you don’t control. In addition, evidence suggests that most people who have the option to opt out of generative AI training don’t know that they can, and the administrative burden of opting out all of your content can be huge. Last spring, Dua Lipa, Elton John, Paul McCartney and many more UK artists wrote to the Prime Minister, urging him to give government support to proposals that would protect copyright in relation to AI.
What do you think?
Which model should the UK adopt for AI training on online content?



