Crawler bots are downloading substitutes

  • Done
  • quality assurance status badge
Details
4 participants
  • Leo Famulari
  • Tobias Geerinckx-Rice
  • Mark H Weaver
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Leo Famulari
Severity
normal
L
L
L
Leo Famulari wrote on 6 Dec 2021 23:18
[maintenance] hydra: berlin: Create robots.txt.
(address . 52338@debbugs.gnu.org)
2f52f6b48db55f8a79b07dbb242b297ab49d6083.1638828946.git.leo@famulari.name
I tested that `guix system build` does succeed with this change, but I
would like a review on whether the resulting Nginx configuration is
correct, and if this is the correct path to disallow. It generates an
Nginx location block like this:

------
location /robots.txt {
add_header Content-Type text/plain;
return 200 "User-agent: *
Disallow: /nar
";
}
------

* hydra/nginx/berlin.scm (berlin-locations): Add a robots.txt Nginx location.
---
hydra/nginx/berlin.scm | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

Toggle diff (22 lines)
diff --git a/hydra/nginx/berlin.scm b/hydra/nginx/berlin.scm
index 1f4b0be..3bb2129 100644
--- a/hydra/nginx/berlin.scm
+++ b/hydra/nginx/berlin.scm
@@ -174,7 +174,14 @@ PUBLISH-URL."
(nginx-location-configuration
(uri "/berlin.guixsd.org-export.pub")
(body
- (list "root /var/www/guix;"))))))
+ (list "root /var/www/guix;")))
+
+ (nginx-location-configuration
+ (uri "/robots.txt")
+ (body
+ (list
+ "add_header Content-Type text/plain;"
+ "return 200 \"User-agent: *\nDisallow: /nar/\n\";"))))))
(define guix.gnu.org-redirect-locations
(list
--
2.34.0
M
M
Mathieu Othacehe wrote on 9 Dec 2021 14:27
Re: bug#52338: Crawler bots are downloading substitutes
(name . Leo Famulari)(address . leo@famulari.name)(address . 52338@debbugs.gnu.org)
87tufh6h85.fsf_-_@gnu.org
Hello Leo,

Toggle quote (7 lines)
> + (nginx-location-configuration
> + (uri "/robots.txt")
> + (body
> + (list
> + "add_header Content-Type text/plain;"
> + "return 200 \"User-agent: *\nDisallow: /nar/\n\";"))))))

Nice, the bots are also accessing the Cuirass web interface, do you
think it would be possible to extend this snippet to prevent it?

Thanks,

Mathieu
T
T
Tobias Geerinckx-Rice wrote on 9 Dec 2021 16:42
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
87sfv1ivl2.fsf@nckx
Mathieu Othacehe ???
Toggle quote (5 lines)
> Hello Leo,
>
>> + (nginx-location-configuration
>> + (uri "/robots.txt")

It's a micro-optimisation, but it can't hurt to generate ‘location
= /robots.txt’ instead of ‘location /robots.txt’ here.

Toggle quote (6 lines)
>> + (body
>> + (list
>> + "add_header Content-Type text/plain;"
>> + "return 200 \"User-agent: *\nDisallow:
>> /nar/\n\";"))))))

Use \r\n instead of \n, even if \n happens to work.

There are many ‘buggy’ crawlers out there. It's in their own
interest to be fussy whilst claiming to respect robots.txt. The
less you deviate from the most basic norm imaginable, the better.

I tested whether embedding raw \r\n bytes in nginx.conf strings
like this works, and it seems to, even though a human would
probably not do so.

Toggle quote (4 lines)
> Nice, the bots are also accessing the Cuirass web interface, do
> you
> think it would be possible to extend this snippet to prevent it?

You can replace ‘/nar/’ with ‘/’ to disallow everything:

Disallow: /

If we want crawlers to index only the front page (so people can
search for ‘Guix CI’, I guess), that's possible:

Disallow: /
Allow: /$

Don't confuse ‘$’ with ‘supports regexps’. Buggy bots might fall
back to ‘Disallow: /’.

This is where it gets ugly: nginx doesn't support escaping ‘$’ in
strings. At all. It's insane.
geo $dollar { default "$"; } #
stackoverflow.com/questions/57466554
server {
location = /robots.txt {
return 200
"User-agent: *\r\nDisallow: /\r\nAllow: /$dollar\r\n";
}
}
*Obviously.*

An alternative to that is to serve a real on-disc robots.txt.

Kind regards,

T G-R
-----BEGIN PGP SIGNATURE-----

iIMEARYKACsWIQT12iAyS4c9C3o4dnINsP+IT1VteQUCYbIwmQ0cbWVAdG9iaWFz
LmdyAAoJEA2w/4hPVW15y2MBAILKgUIzreTZdQAAQaTODJziTLB3oomvmrwEpsjM
VhnaAP9/P3wC8RwFz3hIJqUIRnXEp5/d9fgqVk/96ouiXhOGAw==
=fmbL
-----END PGP SIGNATURE-----

L
L
Leo Famulari wrote on 10 Dec 2021 17:22
(name . Tobias Geerinckx-Rice)(address . me@tobias.gr)
YbN+tx/MhHt/IdAD@jasmine.lan
On Thu, Dec 09, 2021 at 04:42:24PM +0100, Tobias Geerinckx-Rice wrote:
[...]
Toggle quote (2 lines)
> An alternative to that is to serve a real on-disc robots.txt.

Alright, I leave it up to you. I just want to prevent bots from
downloading substitutes. I don't really have opinions about any of the
details.
-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEEsFFZSPHn08G5gDigJkb6MLrKfwgFAmGzfrcACgkQJkb6MLrK
fwjOVxAApyg72GXHlubP+5xBaYOitLNivjNzkR32FHiXroQzuuW0EU9RLCpZtghx
J/AydzqeOqWveMvdXN05d3WB1KmTjind8kJylG1CArRrzgqVeQIFJSzIWkEkFiXs
I1Ca/A4f9FVaH3tAv5snq7fnEN5NXWXBp/q521X1LltNXi0sW4Flq1fm1NCPB4Dc
y3yiwCy5t3O++H/00s6KGGk7Hceh7u8Fu43Lq/5jKNVikt955kkidwIyVM3EWUjX
hcT4xf1inffa8rAqgw7ilFDGPH1VswBFA7hs75CUS22GhD+eV67+DbKuw0JJ/iKS
goKGr+SL89jQ3kK23HmWH9XRni4+lOW44LiOdnJmxtFi9ctlatH+k+M8bCfCibex
B9ROf3sjaReR6CscWX3pvG680sjaB67QptGFAsQlCJiZs3DFTJxzfPHSEXvgVFPJ
lJguah2uE/h32ZK+8MACKU4bcIlHs/zeg2bIbhxDMcpDLhR9cnlXys+Z3WM8hxOj
ZfrIp69aNX9VP9p3ImYotGGs4t/qvGAY8Xf0uadhsa5OQSCO/wtW8PZ6guGa95iu
+HI8erZdwCbV8Nu2mf1PK7YbjRBBEvnX/39jfjXUmSO7tYwu+srGJdxFTgc8xnEY
t0BNFo/8PxrZepRioFAeYqSgj6d7O+/HbHGVpRfPEg1EVPh81Cc=
=GPOf
-----END PGP SIGNATURE-----


T
T
Tobias Geerinckx-Rice wrote on 10 Dec 2021 17:47
(name . Leo Famulari)(address . leo@famulari.name)
87ilvw4db2.fsf@nckx
Leo Famulari ???
Toggle quote (2 lines)
> Alright, I leave it up to you.

Dammit.

Kind regards,

T G-R
-----BEGIN PGP SIGNATURE-----

iIMEARYKACsWIQT12iAyS4c9C3o4dnINsP+IT1VteQUCYbOEoQ0cbWVAdG9iaWFz
LmdyAAoJEA2w/4hPVW15WFABAO9dhSlJfA53EQQXHscpg/x6dluiUhbRgZwLBWhR
qS6GAQCd/AzcajtJGLT+nYDLNyLarxBEK/mfFoB2kl64p3zRDg==
=UDxy
-----END PGP SIGNATURE-----

M
M
Mark H Weaver wrote on 10 Dec 2021 22:21
87r1ak2m1p.fsf@netris.org
Hi Leo,

Leo Famulari <leo@famulari.name> writes:

Toggle quote (10 lines)
> I noticed that some bots are downloading substitutes from
> ci.guix.gnu.org.
>
> We should add a robots.txt file to reduce this waste.
>
> Specifically, I see bots from Bing and Semrush:
>
> https://www.bing.com/bingbot.htm
> https://www.semrush.com/bot.html

For what it's worth: during the years that I administered Hydra, I found
that many bots disregarded the robots.txt file that was in place there.
In practice, I found that I needed to periodically scan the access logs
for bots and forcefully block their requests in order to keep Hydra from
becoming overloaded with expensive queries from bots.

Regards,
Mark
T
T
Tobias Geerinckx-Rice wrote on 10 Dec 2021 23:52
(name . Mark H Weaver)(address . mhw@netris.org)
875yrw3vvk.fsf@nckx
All,

Mark H Weaver ???
Toggle quote (10 lines)
> For what it's worth: during the years that I administered Hydra,
> I found
> that many bots disregarded the robots.txt file that was in place
> there.
> In practice, I found that I needed to periodically scan the
> access logs
> for bots and forcefully block their requests in order to keep
> Hydra from
> becoming overloaded with expensive queries from bots.

Very good point.

IME (which is a few years old at this point) at least the
highlighted BingBot & SemrushThing always respected my robots.txt,
but it's definitely a concern. I'll leave this bug open to remind
us of that in a few weeks or so…

If it does become a problem, we (I) might add some basic
User-Agent sniffing to either slow down or outright block
non-Guile downloaders. Whitelisting any legitimate ones, of
course. I think that's less hassle than dealing with dynamic IP
blocks whilst being equally effective here.

Thanks (again) for taking care of Hydra, Mark, and thank you Leo
for keeping an eye on Cuirass :-)

T G-R
-----BEGIN PGP SIGNATURE-----

iIMEARYKACsWIQT12iAyS4c9C3o4dnINsP+IT1VteQUCYbPc3w0cbWVAdG9iaWFz
LmdyAAoJEA2w/4hPVW15Ky0BANhwI9BhRdXrGDsJPEblJvGMpSEWysyED3p7TZVU
cF87AQDpw2NAebc3S4G2nEoAhKIoYZWLyyjW6G6HXQVib5WtAA==
=bLY/
-----END PGP SIGNATURE-----

M
M
Mathieu Othacehe wrote on 11 Dec 2021 10:46
(name . Tobias Geerinckx-Rice)(address . me@tobias.gr)
87sfuzwk1u.fsf@gnu.org
Hey,

The Cuirass web interface logs were quite silent this morning and I
suspected an issue somewhere. I then realized that you did update the
Nginx conf and the bots were no longer knocking at our door, which is
great!

Thanks to both of you,

Mathieu
M
M
Mathieu Othacehe wrote on 19 Dec 2021 17:53
(address . 52338-done@debbugs.gnu.org)
87wnk0pmd4.fsf@gnu.org
Toggle quote (2 lines)
> Thanks to both of you,

And closing!

Mathieu
Closed
?