Publishers Reject AI Scrapers, Block Bots at Server Level

The Rise of AI Bot Traffic and Website Resistance

The open web is experiencing a significant shift as more websites implement measures to block unwanted automated traffic. This trend is driven by concerns over the use of website content for training AI models and the strain that non-human users place on server resources. However, some companies continue to scrape data despite these restrictions.

According to an online traffic analysis conducted by BuiltWith, a web metrics company, the number of publishers attempting to prevent AI bots from scraping their content has increased substantially since July 2025. Specifically, about 5.6 million websites have added OpenAI's GPTBot to the disallow list in their robots.txt files, up from approximately 3.3 million at the start of July. This represents a nearly 70% increase.

Websites can communicate to crawlers whether they permit automated requests to harvest information through entries in their robots.txt files. While compliance with these directives is voluntary, repeated violations could lead to legal consequences, as seen in Reddit's recent lawsuit against Anthropic.

Growing Resistance Against AI Bots

Anthropic's ClaudeBot is also facing increasing resistance. Currently, it is blocked on about 5.8 million websites, compared to 3.2 million in early July. Similarly, the company's Claude-SearchBot, used for surfacing sites in search results, is encountering a rising block rate.

AppleBot, which indexes data for search, is also seeing more resistance. It is now blocked on approximately 5.8 million websites, up from around 3.2 million in July. GoogleBot, responsible for indexing data for search, is also facing growing opposition, possibly due to its role in providing AI Overviews atop search results. BuiltWith reports that 18 million sites now ban GoogleBot, which may mean those sites cannot be indexed in Google Search.

According to Arc XP, a publishing platform spun out of The Washington Post, about half of news sites blocked GPTBot as of July. OpenAI, Anthropic, and Google did not immediately respond to requests for comment.

Implications and Challenges

Anirudh Agarwal, CEO of OutreachX, a web marketing consultancy, noted that the frequency with which GPTBot is being blocked highlights how publishers view AI crawlers. If OpenAI's GPTBot is being blocked, other AI crawlers could face similar challenges.

Tollbit, a company that helps publishers monetize AI traffic through access fees for crawlers, reported in its Q2 2025 report that there has been a 336% increase in sites blocking AI crawlers over the past year. Additionally, across all AI bots, 13.26% of requests ignored robots.txt directives in Q2 2025, up from 3.3% in Q4 2024. This behavior has led to legal challenges, including a lawsuit by major news publishers against Perplexity in 2024.

Complications in Bot Blocking Efforts

Bot blocking efforts have become more complex as AI firms like OpenAI and Perplexity have introduced browsers that integrate their AI models. According to the Tollbit report, "The latest AI browsers like Perplexity Comet, and devtools like Firecrawl or Browserless are indistinguishable from humans in site logs." Publishers that block such tools risk inadvertently blocking human traffic. Therefore, Tollbit emphasizes the importance of non-human site traffic accurately identifying itself.

For organizations that are not major publishers, the impact of AI bot traffic can be overwhelming. In October, the blogging service Bear experienced an outage due to AI bot traffic, a problem also reported by Belgium-based blogger Wouter Groeneveld. Developer David Gerard, who runs the AI-skeptic blog Pivot-to-AI, recently mentioned on Mastodon that RationalWiki.org was struggling to manage AI bot traffic.

Industry Responses and New Solutions

Will Allen, VP of product at Cloudflare, stated that the company observes "a lot of people that are out there trying to scrape large amounts of data, ignoring any robots.txt directives, and ignoring other attempts to block them." While bot traffic is increasing, it isn't necessarily harmful, but it does indicate more attacks and attempts to bypass paywalls and content restrictions.

To address this issue, Cloudflare launched a service called Pay per crawl, allowing content owners to offer automated access for a price. Allen declined to disclose which sites have joined the beta testing but acknowledged that new economic solutions would be beneficial.

"We have a thesis or two about how that could evolve," he said. "But really, we think there's going to be a lot of different evolution, a lot of different experimentation. And so we're keeping a pretty tight private beta for our Pay per crawl product just to really learn, from both sides of the market – people who are looking to access content at scale and people who are looking to protect content."

Posting Komentar untuk "Publishers Reject AI Scrapers, Block Bots at Server Level"