Underminer

by Tellyworth

Description

Underminer aims to discourage AI model crawler bots and other unwanted bots, while allowing search engine crawlers and RSS aggregators to work normally.

It works in two ways:

Using robots.txt rules to prohibit well-behaved AI training bots from crawling your site.
By selectively corrupting the meaning of text content, to sabotage badly-behaved bots that ignore those rules.

Well-behaved search engine bots are permitted to crawl the site as normal, without being restricted by robots.txt or being subjected to sabotage.

Well-behaved LLM training bots that obey robots.txt will not receive any corrupted content.

Features

100% free and open source. No upsells or nags or subscriptions or promotions or freemium versions or donations. Ever.
Auto-detects crawler bots.
Completely invisible to real users and well-behaved bots.
Verifies IP ranges of well-known bots to detect fake Googlebot crawlers etc.
(Almost) zero-configuration.
Language-neutral; works with page and post content in most languages.
Preview mode: you can see what a bad bot will see.

Instead of blocking bad bots, they will simply be served corrupt and useless content:

Words and sentences are randomly rearranged.
Lists and paragraphs are re-ordered.
Numbers are randomized.
Currency symbols and measurement units are randomly changed.
Alt text and descriptions are randomly switched around.
Links are switched around.
Image URLs are intentionally broken.

Well-behaved bots

For the purposes of this plugin, a well-behaved bot is one that:

Obeys robots.txt.
Publishes an up-to-date list of IP ranges.
Identifies search crawling separately from AI model training crawling.

Requirements

PHP 8.0 or higher
WordPress 6.4 or higher

Faq

Is this ethical?

Yes.

Will this harm well-behaved bots?

No. Bots that obey robots.txt will be unaffected.

Will this harm my site’s SEO standing?

Theoretically no; the plugin explicitly allows well-known and well-behaved search engine crawlers if you have your site visibility set to allow them.

Practically speaking though, there are no guarantees. If your livelihood or self-esteem is dependent on your search engine rankings, you should not use this plugin.

How are bots detected?

Bots are detected using the crawler detect library.

Verified IP ranges for known bots are fetched directly from the search engine operators and shipped with this plugin (see config/crawler-ip-ranges.php).

For that reason, I highly recommend enabling auto-updates for this plugin to ensure you have the latest IP range data. The plugin will automatically stop doing IP range verification if the IP ranges data hasn’t been updated in more than 180 days.

Does this really discourage rogue LLM operators from greedily scraping content without regard for copyright, attribution, resource usage, or consent?

If enough people do something like this, I sure hope so.

Does this plugin rely on any external services or APIs?

No, not for its normal operation – all processing is local to your server.

The only external API used by the plugin is an optional one-time request to ipify.org to verify if your web server’s IP address environment variables are correct. This is optional, and only runs when (and if) an administrator explicitly requests it by clicking a button.

Why isn’t the bot counter working?

The bot counter is only available on systems configured to use Memcached for object caching, and have a working wp_cache_incr() function.

It’s not intended as a critical feature, just a little fun.

Reviews

Changelog

1.0.3

Fix a careless fatal error

1.0.2

Fix handling of WordPress self-requests.
Refactor IP calculation code.
Consolidate IP ranges to reduce size of crawler IP data.

1.0.1

Improved robots.txt rules.

1.0.0

Initial public release.