Limiting Chat GPTBot Crawl Rate

From Cookipedia
Revision as of 11:33, 20 October 2024 by Chef (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Bye-bye, AI Crawler

Limiting Chat GPTBot Crawl Rate

Recently ChatGPT's crawler robot GPTBot had been crawling this site so heavily that it has raised the processor load so high that it had rendered the site inoperable to normal users.

It was hitting the server from multiple ip addresses more than 30 times per second.

For the benefit of others, here are various ways to prevent this crawler from slowing down your Linux server.

Full User-agent string:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot

To limit the crawl rate and ensure your server remains usable, you can control how frequently GPTBot (and other bots) request pages by configuring your robots.txt file or using rate-limiting techniques at the server level.

Step 1: Adjust Crawl Rate with robots.txt

You can limit the crawl rate for GPTBot by adding the following to your robots.txt file:

User-agent: GPTBot
Crawl-delay: 10
# Crawl-delay: This sets a delay (in seconds) between each request from the bot. Adjust the value (e.g., 10 seconds) to a rate that suits your server load.

Step 2: Use Server-Side Rate Limiting

You can also set rate limits using server-side tools like mod_qos (for Apache) or ngx_http_limit_req_module (for NGINX). These modules help manage how many requests are allowed per second per IP address.

NGINX Configuration (if you are using NGINX):

http {
    limit_req_zone $binary_remote_addr zone=bot_zone:10m rate=1r/s;

    server {
        location / {
            limit_req zone=bot_zone burst=5 nodelay;
        }
    }
}

This limits bots to 1 request per second, with a burst capacity of 5.

Apache Configuration (if you are using Apache):

You can use mod_qos to limit the requests:

QS_SrvRequestRate 1

This limits requests to 1 per second.

Step 3: Use Fail2Ban for Rate-Limiting Bots (Advanced)

If you are using Fail2Ban with iptables or firewalld, you can also set up a Fail2Ban rule to detect excessive bot traffic and throttle it:

Create a custom jail for GPTBot in /etc/fail2ban/jail.local:

[gptbot]
enabled  = true
port     = http,https
filter   = gptbot
logpath  = /var/log/apache2/access.log  # or /var/log/nginx/access.log
maxretry = 10
findtime = 60
bantime  = 600

Create a filter in /etc/fail2ban/filter.d/gptbot.conf:

[Definition]
failregex = <HOST> - - .*"GET .* HTTP.*" .* "GPTBot"

This will ban IPs that send more than 10 requests in 60 seconds for 10 minutes.

By combining robots.txt settings and server-side rate limiting, you can control bot activity and prevent server overload.

#tools #chatGPT #Robotstxt #fail2ban #iptables #firewalld #serverload #apache #webcrawler #pest