Add more spiders which do not seem to honour robots.txt

I went through the last couple of logs afer the first round of 'turn off the spiders' went out. I looked at the areas which the /robots.txt disregard and then looked for the bots which ignored it and still looked up stuff in 'accounts'. This may cut down CPU spikes as these are looking at dynamic data which can 'blow' things up. It might be good to add similar tooling to pagure and src since they seem to be hit a lot in the logs also. Signed-off-by: Stephen Smoogen <ssmoogen@redhat.com>
2024-07-08 10:12:02 -04:00 · 2024-07-08 10:12:02 -04:00 · 7e426dbf37
commit 7e426dbf37
parent 377e83fdd1
2 changed files with 7 additions and 2 deletions
--- a/roles/httpd/website/templates/robots/lists.fedoraproject.org-robots.txt
+++ b/roles/httpd/website/templates/robots/lists.fedoraproject.org-robots.txt
@ -9,3 +9,6 @@ Disallow: /

 User-agent: ClaudeBot
 Disallow: /
+
+User-agent: Barkrowler
+Disallow: /
--- a/roles/mailman3/templates/mailmanweb.conf.j2
+++ b/roles/mailman3/templates/mailmanweb.conf.j2
@ -33,9 +33,11 @@ ProxyPassReverse / http://127.0.0.1:8000/
 # Redirecting to hyperkitty if nothing is specified
 RewriteEngine on
 RewriteRule  ^/$    /archives [R,L]
+
 # Spiders-gone-wild
-# These spiders do not follow robots.txt
-RewriteCond %{HTTP_USER_AGENT} ^.*(Bytespider|ClaudeBot).*$ [NC]
+# These spiders may not follow robots.txt and will
+# hit admin sections which consume large amounts of CPU
+RewriteCond %{HTTP_USER_AGENT} ^.*(Bytespider|ClaudeBot|Amazonbot|YandexBot|claudebot|ChatGLM-Spider|GPTBot|Barkrowler|YisouSpider|MJ12bot).*$ [NC]
 RewriteRule .* - [F,L]

 # Old static archives