How to make your site more/less fit for AI crawlers?

Aleesha Jacob

2 years ago

In this CHATGPT and Gemini content-scraping era, publishers are finding themselves at a crossroads, grappling with the double-edged sword of AI crawler bots. With the rise of generative AI, blogging is changing, and publishers are adjusting their strategies to either ward off these digital intruders or roll out the welcome mat. Let’s dive into the contrasting approaches of 404 Media, The Washington Post, and Politico EU, and explore how these decisions impact their digital footprint.

The Fortress Approach: 404 Media’s Bot Blockade

404 Media has taken a staunch stance against AI crawlers, effectively building a digital fortress around its content. By implementing strict bot-blocking measures and a registration wall, 404 Media aims to safeguard its original articles from the prying eyes of AI, ensuring that only human readers can access their valuable insights. This defensive strategy underscores a commitment to content exclusivity and control, but it’s not without its challenges.

Pros:

Content Sovereignty: By restricting AI access, 404 Media retains full control over its content, preventing unauthorized use and ensuring its articles remain unique.
Negotiating Power: This approach gives publishers leverage in potential licensing discussions with AI firms, positioning them to monetize their content effectively.

Cons:

Limited Reach: A walled-off approach may hinder content discoverability, potentially reducing audience size and engagement.
Reader Friction: Registration requirements can deter readers, adding hurdles to content access.

The Balancing Act: The Washington Post’s Selective Content Strategy

The Washington Post navigates the AI crawler conundrum with a nuanced strategy, cherry-picking which bots can crawl its site. This selective openness aims to preserve SEO rankings while protecting valuable content behind paywalls. It’s a delicate balance between trying to rank online and walling yourself from GPTbots.

Pros:

SEO Optimization: By allowing certain crawlers, The Washington Post ensures its content remains competitive in search engine rankings.
Content Protection: Selective blocking keeps premium content secure, maintaining the value of their subscription model.

Cons:

Complexity: Managing bot access requires constant vigilance and adjustment, adding operational complexity.
Potential Loopholes: Determining which bots to allow or block can be tricky, with new AI technologies constantly emerging.

The Open-Door Policy: Politico EU’s Embrace of AI Crawlers

Politico EU adopts an inclusive stance towards AI crawlers, betting on openness to boost brand visibility and reach. By making its content readily available to AI, Politico EU aims to capitalize on the expansive reach of AI-driven platforms, positioning itself as a primary source of political news for both humans and machines.

Pros:

Brand Amplification: Open access to AI crawlers can significantly extend Politico EU’s audience, enhancing brand recognition.
Engagement Opportunities: Being featured in AI-generated content can drive traffic back to Politico EU’s site, potentially increasing reader engagement and subscription conversions.

Cons:

Content Commoditization: There’s a risk that wide accessibility might dilute the perceived value of Politico EU’s content, with it appearing across numerous platforms.
Control Challenges: Embracing AI crawlers makes it harder to control how and where content is used, potentially leading to unauthorized reproductions.

Stop ChatGPT & Gemini from stealing your content by adding this to your robots.txt file:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Some others as well:

User-Agent: omgili
Disallow: /

User-Agent: omgilibot
Disallow: /

Related Read: GPT Bot Guide

When did publishers ever agree to train AI for free?

Voluntary Contributions vs. Involuntary Use:

The line between voluntary contribution and involuntary use becomes blurred when it comes to training AI. While some creators knowingly contribute to AI projects, believing in the potential benefits of technology advancement, many are unaware that their intellectual property is being used to train AI without explicit consent or compensation.

The Role of User Agreements and Fine Print:

Often, the permission to use this data for training AI is buried within the terms of service or user agreements of various platforms. Users and creators, by agreeing to these terms, may unknowingly grant companies the right to use their content for improving AI algorithms, effectively contributing to AI training for free.

Why are creators not being compensated?

The Economic Model of AI Development:

The development and training of AI models require substantial computational resources and vast datasets. Tech companies argue that the collective nature of these datasets makes individual compensation impractical. Furthermore, the economic model of many AI ventures relies on minimizing costs, including the cost of acquiring data, which often sidelines the idea of compensating individual creators.

The Challenge of Attribution:

Identifying and compensating individual creators for their contributions to AI training datasets is a logistical and technological challenge. Given the massive scale of data ingestion by AI models, tracing content back to its original creator and determining the value of each contribution is daunting, if not impossible, with current systems.

Intellectual Property Rights vs. Fair Use Doctrine:

The legality of using creators’ content to train AI without compensation sits at the intersection of intellectual property rights and the fair use doctrine. While creators hold copyright to their original content, AI companies often argue that their use of this content for training purposes falls under fair use, a legal doctrine allowing limited use of copyrighted material without permission for purposes such as research, teaching, or scholarship.

Emerging Regulations and Legal Battles:

The legal framework surrounding AI and copyright is evolving. In various jurisdictions, lawsuits and regulatory proposals are beginning to challenge the status quo, seeking clearer guidelines and protections for creators. These legal battles and potential regulatory changes could reshape how AI companies access and use data for training purposes, possibly leading to more explicit consent mechanisms and compensation models.

Conclusion: The Path Forward

As AI continues to evolve, the dialogue between creators, tech companies, and legislators will be crucial in shaping a fair and equitable ecosystem. Balancing the need for innovation with the rights of creators requires thoughtful regulation, transparent practices, and perhaps new models for compensation that recognize the value of contributions to the digital commons. The future of AI development hinges on finding a harmonious solution that respects both the creators’ rights and the potential benefits of AI for society.

Protect your content now by getting started here!

The Fortress Approach: 404 Media’s Bot Blockade

Pros:

Cons:

The Balancing Act: The Washington Post’s Selective Content Strategy

Pros:

Cons:

The Open-Door Policy: Politico EU’s Embrace of AI Crawlers

Pros:

Cons:

Stop ChatGPT & Gemini from stealing your content by adding this to your robots.txt file:

When did publishers ever agree to train AI for free?

Voluntary Contributions vs. Involuntary Use:

The Role of User Agreements and Fine Print:

Why are creators not being compensated?

The Economic Model of AI Development:

The Challenge of Attribution:

Intellectual Property Rights vs. Fair Use Doctrine:

Emerging Regulations and Legal Battles:

Conclusion: The Path Forward

Related Reads