'guix publish' without '--cache' can send garbled narinfo responses

  • Done
  • quality assurance status badge
Details
3 participants
  • Guillaume Le Vaillant
  • Ludovic Courtès
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Guillaume Le Vaillant
Severity
important
G
G
Guillaume Le Vaillant wrote on 5 Apr 2022 11:58
[PATCH] Check URI when verifying narinfo validity.
(address . guix-patches@gnu.org)
87a6czbzvh.fsf@kitej
When trying to upgrade a machine using a substitute server on the same
LAN, I get this crash a lot:

Toggle snippet (38 lines)
# guix system reconfigure --substitute-urls="http://192.168.0.22:8080 https://ci.guix.gnu.org" /etc/guix/config.scm
substitute: mise à jour des substituts depuis « http://192.168.0.22:8080 »... 100.0 %
substitute: Backtrace:
substitute: In ice-9/boot-9.scm:
substitute: 1752:10 17 (with-exception-handler _ _ #:unwind? _ # _)
substitute: In unknown file:
substitute: 16 (apply-smob/0 #<thunk 7fe08afb72e0>)
substitute: In ice-9/boot-9.scm:
substitute: 724:2 15 (call-with-prompt _ _ #<procedure default-prompt-handle…>)
substitute: In ice-9/eval.scm:
substitute: 619:8 14 (_ #(#(#<directory (guile-user) 7fe08afbcc80>)))
substitute: In guix/ui.scm:
substitute: 2209:7 13 (run-guix . _)
substitute: 2172:10 12 (run-guix-command _ . _)
substitute: In ice-9/boot-9.scm:
substitute: 1752:10 11 (with-exception-handler _ _ #:unwind? _ # _)
substitute: 1752:10 10 (with-exception-handler _ _ #:unwind? _ # _)
substitute: In guix/scripts/substitute.scm:
substitute: 757:18 9 (_)
substitute: 348:26 8 (process-query #<output: file 4> _ #:cache-urls _ #:acl _)
substitute: In guix/substitutes.scm:
substitute: 369:45 7 (lookup-narinfos/diverse _ _ #<procedure 7fe088c9cbc0 …> …)
substitute: In unknown file:
substitute: 6 (filter #<procedure 7fe088c9cbc0 at guix/scripts/subst…> …)
substitute: In guix/narinfo.scm:
substitute: 215:32 5 (valid-narinfo? _ _ #:verbose? _)
substitute: In ice-9/boot-9.scm:
substitute: 1685:16 4 (raise-exception _ #:continuable? _)
substitute: 1685:16 3 (raise-exception _ #:continuable? _)
substitute: 1780:13 2 (_ #<&compound-exception components: (#<&assertion-fail…>)
substitute: 1685:16 1 (raise-exception _ #:continuable? _)
substitute: 1685:16 0 (raise-exception _ #:continuable? _)
substitute:
substitute: ice-9/boot-9.scm:1685:16: In procedure raise-exception:
substitute: In procedure car: Wrong type argument in position 1 (expecting pair): ()
guix system: erreur : `/gnu/store/wgygsxcdy1z3pfvwhpgyl5vjp4xvwhhh-guix-1.3.0-23.a27e47f/bin/guix substitute' died unexpectedly

It looks like the 'narinfo-uri' field is an empty list instead of a list
of URIs. Is that supposed to be possible?
Does the the attached patch adding a check for the validity of
this field in the 'valid-narinfo?' function make sense?


The substitute server configuration is:

Toggle snippet (7 lines)
(service guix-publish-service-type
(guix-publish-configuration
(host "0.0.0.0")
(port 8080)
(compression '(("zstd" 3)))
(advertise? #t)))
From 8d9a45b2f38809fb3acfacf6f83532b7b556e78c Mon Sep 17 00:00:00 2001
From: Guillaume Le Vaillant <glv@posteo.net>
Date: Tue, 5 Apr 2022 11:50:48 +0200
Subject: [PATCH] narinfo: Check URI when verifying narinfo validity.

* guix/narinfo.scm (valid-narinfo?): Check if the 'uri' field is valid.
---
guix/narinfo.scm | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

Toggle diff (23 lines)
diff --git a/guix/narinfo.scm b/guix/narinfo.scm
index 4fc550aa6c..466ce20deb 100644
--- a/guix/narinfo.scm
+++ b/guix/narinfo.scm
@@ -209,11 +209,13 @@ (define %mandatory-fields
(define* (valid-narinfo? narinfo #:optional (acl (current-acl))
#:key verbose?)
- "Return #t if NARINFO's signature is not valid."
+ "Return #t if NARINFO's signature is valid."
(let ((hash (narinfo-sha256 narinfo))
(signature (narinfo-signature narinfo))
- (uri (uri->string (first (narinfo-uris narinfo)))))
- (and hash signature
+ (uri (if (null? (narinfo-uris narinfo))
+ #f
+ (uri->string (first (narinfo-uris narinfo))))))
+ (and hash signature uri
(signature-case (signature hash acl)
(valid-signature #t)
(invalid-signature
--
2.35.1
-----BEGIN PGP SIGNATURE-----

iIUEAREKAC0WIQTLxZxm7Ce5cXlAaz5r6CCK3yH+PwUCYkwWog8cZ2x2QHBvc3Rl
by5uZXQACgkQa+ggit8h/j9XpwD/V8fwTiGUelJ8zHqE9C8Jz4utsd/GUDfWLHFY
e/kXrDIA/3XZm1n0UrYE9KN/z03rtUtRbSp7TREBvzlt3R2daxLD
=o6sA
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 5 Apr 2022 19:08
(name . Guillaume Le Vaillant)(address . glv@posteo.net)(address . 54723@debbugs.gnu.org)
877d83lapv.fsf@gnu.org
Hi,

Guillaume Le Vaillant <glv@posteo.net> skribis:

Toggle quote (6 lines)
> When trying to upgrade a machine using a substitute server on the same
> LAN, I get this crash a lot:
>
> # guix system reconfigure --substitute-urls="http://192.168.0.22:8080 https://ci.guix.gnu.org" /etc/guix/config.scm
> substitute: mise à jour des substituts depuis « http://192.168.0.22:8080 »... 100.0 %

[...]

Toggle quote (3 lines)
> It looks like the 'narinfo-uri' field is an empty list instead of a list
> of URIs. Is that supposed to be possible?

I don’t think so. Could you grab a narinfo and share it?


where HASH is the hash part of a store item.

What could happen though is a situation where ‘guix publish’ only offers
a compression method not supported by the client. In that case,
‘narinfo-best-uri’ throws a match-error because ‘choices’ is the empty
list. We should fix that.

Toggle quote (3 lines)
> Does the the attached patch adding a check for the validity of
> this field in the 'valid-narinfo?' function make sense?

Maybe, but I’d like to make sure we understand the issue.

Thanks,
Ludo’.
G
G
Guillaume Le Vaillant wrote on 5 Apr 2022 19:51
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 54723@debbugs.gnu.org)
875ynnbe65.fsf@kitej
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (15 lines)
> Hi,
>
> [...]
>
> Could you grab a narinfo and share it?
>
> wget -qO - http://192.168.0.22:8080/HASH.narinfo
>
> where HASH is the hash part of a store item.
>
> What could happen though is a situation where ‘guix publish’ only offers
> a compression method not supported by the client. In that case,
> ‘narinfo-best-uri’ throws a match-error because ‘choices’ is the empty
> list. We should fix that.

I tried downloading a few narinfos and they don't look broken (2 of them
in attachment).

However for one of them it looks like the guix-publish server got stuck
on the request for several minutes (the second attempt worked):

Toggle snippet (17 lines)
wget http://192.168.0.22:8080/184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo
--2022-04-05 19:33:56-- http://192.168.0.22:8080/184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo
Connexion à 192.168.0.22:8080… connecté.
requête HTTP transmise, en attente de la réponse… Erreur de lecture (Connexion ré-initialisée par le correspondant) dans les en-têtes.
Nouvel essai.

--2022-04-05 19:36:40-- (essai : 2) http://192.168.0.22:8080/184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo
Connexion à 192.168.0.22:8080… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 7428 (7,3K) [application/x-nix-narinfo]
Sauvegarde en : « 184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo »

184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo 100%[=====================================================================================================>] 7,25K --.-KB/s ds 0,02s

2022-04-05 19:36:40 (391 KB/s) — « 184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo » sauvegardé [7428/7428]

Could that be the cause of the issue?
L
L
Ludovic Courtès wrote on 9 Apr 2022 22:32
(name . Guillaume Le Vaillant)(address . glv@posteo.net)(address . 54723@debbugs.gnu.org)
87pmlq6lqx.fsf_-_@gnu.org
Hi,

Guillaume Le Vaillant <glv@posteo.net> skribis:

Toggle quote (21 lines)
> However for one of them it looks like the guix-publish server got stuck
> on the request for several minutes (the second attempt worked):
>
> wget http://192.168.0.22:8080/184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo
> --2022-04-05 19:33:56-- http://192.168.0.22:8080/184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo
> Connexion à 192.168.0.22:8080… connecté.
> requête HTTP transmise, en attente de la réponse… Erreur de lecture (Connexion ré-initialisée par le correspondant) dans les en-têtes.
> Nouvel essai.
>
> --2022-04-05 19:36:40-- (essai : 2) http://192.168.0.22:8080/184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo
> Connexion à 192.168.0.22:8080… connecté.
> requête HTTP transmise, en attente de la réponse… 200 OK
> Taille : 7428 (7,3K) [application/x-nix-narinfo]
> Sauvegarde en : « 184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo »
>
> 184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo 100%[=====================================================================================================>] 7,25K --.-KB/s ds 0,02s
>
> 2022-04-05 19:36:40 (391 KB/s) — « 184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo » sauvegardé [7428/7428]
>
> Could that be the cause of the issue?

Yes, it could be the issue. Is there any clue as to why ‘guix publish’
hung up first? Something in its log maybe?

Ludo’.
G
G
Guillaume Le Vaillant wrote on 9 Apr 2022 23:06
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 54723@debbugs.gnu.org)
877d7ydjwk.fsf@kitej
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (32 lines)
> Hi,
>
> Guillaume Le Vaillant <glv@posteo.net> skribis:
>
>> However for one of them it looks like the guix-publish server got stuck
>> on the request for several minutes (the second attempt worked):
>>
>> wget http://192.168.0.22:8080/184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo
>> --2022-04-05 19:33:56-- http://192.168.0.22:8080/184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo
>> Connexion à 192.168.0.22:8080… connecté.
>> requête HTTP transmise, en attente de la réponse… Erreur de lecture (Connexion ré-initialisée par le correspondant) dans les en-têtes.
>> Nouvel essai.
>>
>> --2022-04-05 19:36:40-- (essai : 2) http://192.168.0.22:8080/184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo
>> Connexion à 192.168.0.22:8080… connecté.
>> requête HTTP transmise, en attente de la réponse… 200 OK
>> Taille : 7428 (7,3K) [application/x-nix-narinfo]
>> Sauvegarde en : « 184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo »
>>
>> 184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo
>> 100%[=====================================================================================================>]
>> 7,25K --.-KB/s ds 0,02s
>>
>> 2022-04-05 19:36:40 (391 KB/s) — « 184b50qkcrkchc6dpwwcd7ypqb5yvrm3.narinfo » sauvegardé [7428/7428]
>>
>> Could that be the cause of the issue?
>
> Yes, it could be the issue. Is there any clue as to why ‘guix publish’
> hung up first? Something in its log maybe?
>
> Ludo’.

There are 2 errors that occur a lot in the guix-publish log files:

- "In procedure fport_write: Broken pipe"
It happens when trying to write to a socket apparently.

- "In procedure sign: gcrypt: Cannot allocate memory"
The machine has 64 GiB of RAM, of which at least 50 GiB is free, so
gcrypt should have enough to make a signature...

Note: full log file with complete error messages in attachment.
-----BEGIN PGP SIGNATURE-----

iIUEAREKAC0WIQTLxZxm7Ce5cXlAaz5r6CCK3yH+PwUCYlH6+w8cZ2x2QHBvc3Rl
by5uZXQACgkQa+ggit8h/j/j3QD+LDOGlj8OvUAiQ4siZ1kMZIjbtiyo/TyjUuE0
Upv7wuYA/2JBguRzSz9TOgR0rP8bB1W2737I7aOj02hQproqdc+i
=iS1C
-----END PGP SIGNATURE-----

G
G
Guillaume Le Vaillant wrote on 11 Apr 2022 15:31
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 54723@debbugs.gnu.org)
87wnfv90cl.fsf@kitej
Guillaume Le Vaillant <glv@posteo.net> skribis:

Toggle quote (9 lines)
> There are 2 errors that occur a lot in the guix-publish log files:
>
> - "In procedure fport_write: Broken pipe"
> It happens when trying to write to a socket apparently.
>
> - "In procedure sign: gcrypt: Cannot allocate memory"
> The machine has 64 GiB of RAM, of which at least 50 GiB is free, so
> gcrypt should have enough to make a signature...

I captured the network traffic between the "guix publish" server and
a "guix upgrade" client to see if the "broken pipe" errors could
come from real networking issues.
According to wireshark the problem doesn't come from there, the TCP
stream didn't have any error.

However, looking at the full TCP stream in wireshark I saw that the
"guix publish" server sends some bad narinfo responses.
Sometimes some parts of the response are missing (here, Signature
incomplete, URL and Compression fields missing):
Toggle snippet (12 lines)
HTTP/1.1 200 OK
Content-Length: 959
Content-Type: application/x-nix-narinfo;charset=UTF-8

StorePath: /gnu/store/dxpaqmix7zixm8pwcvvmq8q969q50jpp-pngload-2.0.0-2.91f1d70-checkout
NarHash: sha256:0s94fdbrbqj12qvgyn2g4lfwvz7qhhzbclrpz5ni7adwxgrmvxl1
NarSize: 245224
References:
Deriver: ybdimrfjs090kzmimf5j1x5hs8y4d24p-pngload-2.0.0-2.91f1d70-checkout.drv
Signature: 1;kitej;KHNpZ25hdHVyZSAKIChkYXRhIAogIChmbGFncyByZmM2OTc5KQogIChoYXNoIHNoYTI1NiAjNDY3NDk2RTJEOTZBMzc0QzFGN0M1MzJCNjc3MTM1NzVFOTkyRjQ0Qzc3MzQwRDUwQTcyRTkyMDJGRURDQkQxMyMpCiAgKQogKHNpZy12YWwgCiAgKGVjZHNhIAogICAociAjMDZEQTAwMkQyNjE3MEQ3ODVDNkM3NkMyMUEwM0UzNDlCMkUwMDc4MTUyQzFBQURFNjhFMEZGOUJDRkUyMUFDNSMpCiAgIChzICMwNjNDM0UyNjg2MEU2OTIzNDdEMjNGNTQ4RUM3RDJGRUZGQjc0Q0I4NjNEMjlDMUE3QjA4REFCQjEzQjZDRjAxIykKICAgKQogICkKIChwdWJsaWMta2V5IAogIC

Sometimes the response looks like almost complete garbage:
Toggle snippet (11 lines)
HTTP/1.1 200 OK
Content-Length: 970
Content-Type: application/x-nix-narinfo;charsetcharsetHTTP/=UTF-8

1
1

1
.S

When the client receives these bad narinfos, it often makes it crash
with errors like:
- Wrong type (expecting exact integer): #f
- unmatched line "1\r"
- Wrong type argument in position 1 (expecting pair): ()

So it looks like the broken pipe problem comes from the "guix publish"
server, or from Guile... And making the code reconstructing narinfos
from HTTP responses more robust in case of bad input would be useful.
-----BEGIN PGP SIGNATURE-----

iIUEAREKAC0WIQTLxZxm7Ce5cXlAaz5r6CCK3yH+PwUCYlQ2+g8cZ2x2QHBvc3Rl
by5uZXQACgkQa+ggit8h/j+M5AD+N63kh7hCuBNzyKtnlmunk6FIsv0lrnvQvH9K
ER1LZH4A/jPLvyEzcUq9MDLfg2xD69YBUorxzncfB0t4lOpiKVZa
=6EVZ
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 12 Apr 2022 09:47
(name . Guillaume Le Vaillant)(address . glv@posteo.net)(address . 54723@debbugs.gnu.org)
871qy2g2v3.fsf@gnu.org
Hi,

Guillaume Le Vaillant <glv@posteo.net> skribis:

Toggle quote (35 lines)
> However, looking at the full TCP stream in wireshark I saw that the
> "guix publish" server sends some bad narinfo responses.
> Sometimes some parts of the response are missing (here, Signature
> incomplete, URL and Compression fields missing):
>
> HTTP/1.1 200 OK
> Content-Length: 959
> Content-Type: application/x-nix-narinfo;charset=UTF-8
>
> StorePath: /gnu/store/dxpaqmix7zixm8pwcvvmq8q969q50jpp-pngload-2.0.0-2.91f1d70-checkout
> NarHash: sha256:0s94fdbrbqj12qvgyn2g4lfwvz7qhhzbclrpz5ni7adwxgrmvxl1
> NarSize: 245224
> References:
> Deriver: ybdimrfjs090kzmimf5j1x5hs8y4d24p-pngload-2.0.0-2.91f1d70-checkout.drv
> Signature: 1;kitej;KHNpZ25hdHVyZSAKIChkYXRhIAogIChmbGFncyByZmM2OTc5KQogIChoYXNoIHNoYTI1NiAjNDY3NDk2RTJEOTZBMzc0QzFGN0M1MzJCNjc3MTM1NzVFOTkyRjQ0Qzc3MzQwRDUwQTcyRTkyMDJGRURDQkQxMyMpCiAgKQogKHNpZy12YWwgCiAgKGVjZHNhIAogICAociAjMDZEQTAwMkQyNjE3MEQ3ODVDNkM3NkMyMUEwM0UzNDlCMkUwMDc4MTUyQzFBQURFNjhFMEZGOUJDRkUyMUFDNSMpCiAgIChzICMwNjNDM0UyNjg2MEU2OTIzNDdEMjNGNTQ4RUM3RDJGRUZGQjc0Q0I4NjNEMjlDMUE3QjA4REFCQjEzQjZDRjAxIykKICAgKQogICkKIChwdWJsaWMta2V5IAogIC
>
>
> Sometimes the response looks like almost complete garbage:
>
> HTTP/1.1 200 OK
> Content-Length: 970
> Content-Type: application/x-nix-narinfo;charsetcharsetHTTP/=UTF-8
>
> 1
> 1
>
> 1
> .S
>
> When the client receives these bad narinfos, it often makes it crash
> with errors like:
> - Wrong type (expecting exact integer): #f
> - unmatched line "1\r"
> - Wrong type argument in position 1 (expecting pair): ()

Woow. How do you build and run ‘guix publish’? Is it a distro package
or is it coming straight from Guix? What command-line options are you
passing?

I’ve never seen this, although we have it running on several servers,
notably ci.guix. I wonder what could cause this.

Thanks,
Ludo’.
G
G
Guillaume Le Vaillant wrote on 12 Apr 2022 10:54
(name . Ludovic Courtès)(address . ludo@gnu.org)
87o816wt36.fsf@kitej
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (10 lines)
> Woow. How do you build and run ‘guix publish’? Is it a distro package
> or is it coming straight from Guix? What command-line options are you
> passing?
>
> I’ve never seen this, although we have it running on several servers,
> notably ci.guix. I wonder what could cause this.
>
> Thanks,
> Ludo’.

I'm using guix-publish-service-type in the operating-system definition
to manage the "guix publish" server, using the on-the-fly mode and
fast Zstandard compression:

Toggle snippet (8 lines)
(service guix-publish-service-type
(guix-publish-configuration
(host "0.0.0.0")
(port 8080)
(compression '(("zstd" 3)))
(advertise? #t)))

When booting the machine, shepherd starts the server with the following
command-line options:

Toggle snippet (3 lines)
/gnu/store/059svbd32i4s0l9s5i7z0krcnl666bjy-guix-1.3.0-24.2fb4304/libexec/guix/guile \ /gnu/store/059svbd32i4s0l9s5i7z0krcnl666bjy-guix-1.3.0-24.2fb4304/bin/guix publish -u guix-publish -p 8080 -C zstd:3 --nar-path=nar --listen=0.0.0.0 --advertise

There's another report about this at https://issues.guix.gnu.org/53668
I had forgotten about, where Simon Streit and Maxim Cournoyer indicate
that they have seen this issue too.
-----BEGIN PGP SIGNATURE-----

iIUEAREKAC0WIQTLxZxm7Ce5cXlAaz5r6CCK3yH+PwUCYlVFvQ8cZ2x2QHBvc3Rl
by5uZXQACgkQa+ggit8h/j91WgD/XqYz6LtKogKSNZerrXAqJZUhPG2SPgR3E9Wv
rS/EZtAA/REc9W+O8OPtNRWtOkjnA060LMut97drfk89Yzc6O8f+
=JDIe
-----END PGP SIGNATURE-----

G
G
Guillaume Le Vaillant wrote on 14 Apr 2022 14:18
(name . Ludovic Courtès)(address . ludo@gnu.org)
87sfqfc0fq.fsf@kitej
Guillaume Le Vaillant <glv@posteo.net> skribis:

Toggle quote (33 lines)
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> Woow. How do you build and run ‘guix publish’? Is it a distro package
>> or is it coming straight from Guix? What command-line options are you
>> passing?
>>
>> I’ve never seen this, although we have it running on several servers,
>> notably ci.guix. I wonder what could cause this.
>>
>> Thanks,
>> Ludo’.
>
> I'm using guix-publish-service-type in the operating-system definition
> to manage the "guix publish" server, using the on-the-fly mode and
> fast Zstandard compression:
>
> (service guix-publish-service-type
> (guix-publish-configuration
> (host "0.0.0.0")
> (port 8080)
> (compression '(("zstd" 3)))
> (advertise? #t)))
>
>
> When booting the machine, shepherd starts the server with the following
> command-line options:
>
> /gnu/store/059svbd32i4s0l9s5i7z0krcnl666bjy-guix-1.3.0-24.2fb4304/libexec/guix/guile \ /gnu/store/059svbd32i4s0l9s5i7z0krcnl666bjy-guix-1.3.0-24.2fb4304/bin/guix publish -u guix-publish -p 8080 -C zstd:3 --nar-path=nar --listen=0.0.0.0 --advertise
>
> There's another report about this at <https://issues.guix.gnu.org/53668>
> I had forgotten about, where Simon Streit and Maxim Cournoyer indicate
> that they have seen this issue too.

I tested a few things, and it looks like there is a concurrency issue in
the 'http-write' function of the publish server.
After removing the 'call-with-new-thread' in the part of the function
sending narinfos, I can't reproduce the "broken pipe" error anymore.

Toggle snippet (53 lines)
diff --git a/guix/scripts/publish.scm b/guix/scripts/publish.scm
index 870dfc11e9..e9fad5845a 100644
--- a/guix/scripts/publish.scm
+++ b/guix/scripts/publish.scm
@@ -1002,28 +1002,25 @@ (define compression
(begin
(when (keep-alive? response)
(keep-alive client))
- (call-with-new-thread
- (lambda ()
- (set-thread-name "publish narinfo")
- (let* ((narinfo
- (with-store store
- (narinfo-string store (utf8->string body)
- #:nar-path nar-path
- #:compressions compressions)))
- (narinfo-bv (string->bytevector narinfo "UTF-8"))
- (narinfo-length
- (bytevector-length narinfo-bv))
- (response (write-response
- (with-content-length response
- narinfo-length)
- client))
- (output (response-port response)))
- (configure-socket client)
- (put-bytevector output narinfo-bv)
- (force-output output)
- (unless (keep-alive? response)
- (close-port output))
- (values)))))
+ (let* ((narinfo
+ (with-store store
+ (narinfo-string store (utf8->string body)
+ #:nar-path nar-path
+ #:compressions compressions)))
+ (narinfo-bv (string->bytevector narinfo "UTF-8"))
+ (narinfo-length
+ (bytevector-length narinfo-bv))
+ (response (write-response
+ (with-content-length response
+ narinfo-length)
+ client))
+ (output (response-port response)))
+ (configure-socket client)
+ (put-bytevector output narinfo-bv)
+ (force-output output)
+ (unless (keep-alive? response)
+ (close-port output))
+ (values)))
(%http-write server client response body))))
(_
(match (assoc-ref (response-headers response) 'x-raw-file)
-----BEGIN PGP SIGNATURE-----

iIUEAREKAC0WIQTLxZxm7Ce5cXlAaz5r6CCK3yH+PwUCYlgTyQ8cZ2x2QHBvc3Rl
by5uZXQACgkQa+ggit8h/j+hKAD/TUAKe/Vv8UWCHTciK8KcpJCI6VeHvl+liDxd
aNdrMjgA/0X0tqnhOfDNHEdgGZokZyHDKBl2gn1ICTPE5V0AppFw
=v6M3
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 18 Apr 2022 21:39
(name . Guillaume Le Vaillant)(address . glv@posteo.net)
87bkwyxjtp.fsf@gnu.org
Hi!

Guillaume Le Vaillant <glv@posteo.net> skribis:

Toggle quote (5 lines)
> I tested a few things, and it looks like there is a concurrency issue in
> the 'http-write' function of the publish server.
> After removing the 'call-with-new-thread' in the part of the function
> sending narinfos, I can't reproduce the "broken pipe" error anymore.

Ah ha! Little has changed in this code, so I think the culprit must be
f743f2046be2c5a338ab871ae8666d8f6de7440b.

Could you try running it from master with this patch reverted?

You can spawn it from the build tree with:

sudo -E ./pre-inst-env guix publish -u nobody

Bonus points if you have a script to reproduce the issue. I guess that
involves making pipelined keep-alive narinfo requests the way ‘guix
substitute’ and ‘guix weather’ do.

Thanks,
Ludo’.
G
G
Guillaume Le Vaillant wrote on 20 Apr 2022 16:10
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 54723@debbugs.gnu.org)
87v8v3ddz5.fsf@kitej
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (25 lines)
> Hi!
>
> Guillaume Le Vaillant <glv@posteo.net> skribis:
>
>> I tested a few things, and it looks like there is a concurrency issue in
>> the 'http-write' function of the publish server.
>> After removing the 'call-with-new-thread' in the part of the function
>> sending narinfos, I can't reproduce the "broken pipe" error anymore.
>
> Ah ha! Little has changed in this code, so I think the culprit must be
> f743f2046be2c5a338ab871ae8666d8f6de7440b.
>
> Could you try running it from master with this patch reverted?
>
> You can spawn it from the build tree with:
>
> sudo -E ./pre-inst-env guix publish -u nobody
>
> Bonus points if you have a script to reproduce the issue. I guess that
> involves making pipelined keep-alive narinfo requests the way ‘guix
> substitute’ and ‘guix weather’ do.
>
> Thanks,
> Ludo’.

Hi,

With master at 101edbe63a887678722bc26cd85a7b7f5637428f, I reproduce the
issue very often when trying to upgrade a profile with around 400
packages. The logs of the publish server show hundreds of narinfo
requests and tens of "broken pipe" errors.

With an additional commit reverting
f743f2046be2c5a338ab871ae8666d8f6de7440b, I can't reproduce the issue
anymore. The logs still show hundreds of narinfo requests, but no
errors.


I tried to make a script making many narinfo requests to reproduce the
issue more easily, but I didn't get it to work yet. I thought accessing
the 'fetch-narinfos' function from "guix/substitutes.scm" with:

Toggle snippet (6 lines)
(use-modules (guix)
(guix substitutes))

(define fetch-narinfos (@@ (guix substitutes) fetch-narinfos))

would work, but it doesn't:

Toggle snippet (14 lines)
$ guix repl fetch-narinfos.scm
Backtrace:
4 (primitive-load "/home/guillaume/fetch-narinfos.scm")
In ice-9/eval.scm:
626:19 3 (_ #<directory (guix-user) 7faab8e0e8c0>)
223:20 2 (proc #<directory (guix-user) 7faab8e0e8c0>)
In unknown file:
1 (%resolve-variable (5 (guix substitutes) # . #f) #<dire…>)
In ice-9/boot-9.scm:
1685:16 0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
erreur : fetch-narinfos : variable non liée
-----BEGIN PGP SIGNATURE-----

iIUEAREKAC0WIQTLxZxm7Ce5cXlAaz5r6CCK3yH+PwUCYmAY/g8cZ2x2QHBvc3Rl
by5uZXQACgkQa+ggit8h/j+sLQEAk6QV8vHv5acWW6Q6qX3nw/gfMDMYXL/zpByM
DoiM7vkA+wbpvXYmS4UcQrjoHzDfQZyHy2EOG4DPjb8vi1balYQB
=vEZs
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 29 Apr 2022 18:20
(name . Guillaume Le Vaillant)(address . glv@posteo.net)
87y1zn6ev5.fsf_-_@gnu.org
Hi,

(+Cc: Mathieu.)

Guillaume Le Vaillant <glv@posteo.net> skribis:

Toggle quote (10 lines)
> With master at 101edbe63a887678722bc26cd85a7b7f5637428f, I reproduce the
> issue very often when trying to upgrade a profile with around 400
> packages. The logs of the publish server show hundreds of narinfo
> requests and tens of "broken pipe" errors.
>
> With an additional commit reverting
> f743f2046be2c5a338ab871ae8666d8f6de7440b, I can't reproduce the issue
> anymore. The logs still show hundreds of narinfo requests, but no
> errors.

For the record, a simple way to reproduce it is to start ‘guix publish’:

sudo -E ./pre-inst-env guix publish -u nobody -p 9999

and to spawn ‘guix weather’, for example with:

guix weather $(guix package -I. -p ~/.guix-home/profile |cut -f1) \
--substitute-urls=http://localhost:9999

which crashes with the dreaded backtrace:

Toggle snippet (26 lines)
Backtrace:
11 (primitive-load "/home/ludo/.config/guix/current/bin/guix")
In guix/ui.scm:
2230:7 10 (run-guix . _)
2193:10 9 (run-guix-command _ . _)
In ice-9/boot-9.scm:
1752:10 8 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
In guix/scripts/weather.scm:
595:9 7 (_)
In guix/build/utils.scm:
677:23 6 (every* #<procedure 7f4af9621de0 at guix/scripts/weather.scm:595:17 (server)> _)
In guix/scripts/weather.scm:
597:21 5 (_ "http://localhost:9999")
120:17 4 (report-server-coverage _ ("/gnu/store/428bzp9325mfapyr4ywzwsz4ic7ssx2b-shepherd-0.9.0" "/gnu/store/sll9nfmqk8lkrraqbkyp3y…" …) …)
In unknown file:
3 (_ #<procedure 7f4af896ca40 at guix/scripts/weather.scm:201:2 ()> #<procedure list _> . #w())
In guix/substitutes.scm:
322:31 2 (lookup-narinfos "http://localhost:9999" _ #:open-connection _ #:make-progress-reporter _)
245:26 1 (fetch-narinfos _ _ #:open-connection _ #:make-progress-reporter _)
In ice-9/boot-9.scm:
1685:16 0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
Wrong type (expecting exact integer): #f

I looked more closely at f743f2046be2c5a338ab871ae8666d8f6de7440b and I
think it had a logical flaw: if you pipeline 100 GET requests, then,
with this commit, ‘guix publish’ would spawn 100 threads that would all
reply concurrently (more or less). This is clearly wrong: replies
should be sent sequentially.

So in commit c1719a0adf3fa7611b56ca4d75b3ac8cf5c9c8ac I went ahead and
reverted it that commit. I also added a test that reproduces the issue
above.


Now, commit f743f2046be2c5a338ab871ae8666d8f6de7440b was itself an
attempt to fix a bug whereby ‘narinfo-string’ would take too long,
thereby preventing ‘guix publish’ from accepting connections (since it’s
single-threaded).

I think the only reasonable way to fix it is by using Fibers to make
‘guix publish’ concurrent (using a fiberized server like in Cuirass).
We should do that at some point. That’ll also allow us to remove some
of the hacks we’ve accumulated.

Now, the only way ‘narinfo-string’ can take too long these days is (I
think) if the store GC lock is held (we should check that hypothesis,
but I believe that if the GC lock is held, then the ‘query-path-info’
RPC made from ‘narinfo-string’ might block until the lock is released).
The GC lock is no longer held for hours on berlin, so there’s less
pressure to address that.

To summarize: I think ‘guix publish’ is okay as-is but we should
fiberize it sometime.

Thoughts?

Ludo’.
L
L
Ludovic Courtès wrote on 29 Apr 2022 18:20
control message for bug #54723
(address . control@debbugs.gnu.org)
87wnf76eu3.fsf@gnu.org
retitle 54723 'guix publish' without '--cache' can send garbled narinfo responses
quit
L
L
Ludovic Courtès wrote on 29 Apr 2022 18:21
(address . control@debbugs.gnu.org)
87v8ur6ett.fsf@gnu.org
reassign 54723 guix
quit
L
L
Ludovic Courtès wrote on 29 Apr 2022 18:21
(address . control@debbugs.gnu.org)
87tuab6eth.fsf@gnu.org
severity 54723 important
quit
L
L
Ludovic Courtès wrote on 29 Apr 2022 18:21
(address . control@debbugs.gnu.org)
87sfpv6etc.fsf@gnu.org
close 54723
quit
L
L
Ludovic Courtès wrote on 29 Apr 2022 23:16
Re: bug#54723: [PATCH] Check URI when verifying narinfo validity.
(name . Guillaume Le Vaillant)(address . glv@posteo.net)
87ee1f615b.fsf_-_@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (4 lines)
> So in commit c1719a0adf3fa7611b56ca4d75b3ac8cf5c9c8ac I went ahead and
> reverted it that commit. I also added a test that reproduces the issue
> above.

Commit 9eca13094d9bdf091f9096e0f3a8b450dac0e595 updates the ‘guix’
package, so you can now pull, reconfigure, and enjoy. :-)

Ludo’.
G
G
Guillaume Le Vaillant wrote on 30 Apr 2022 14:12
(name . Ludovic Courtès)(address . ludo@gnu.org)
87tuaalqfc.fsf@kitej
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (11 lines)
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> So in commit c1719a0adf3fa7611b56ca4d75b3ac8cf5c9c8ac I went ahead and
>> reverted it that commit. I also added a test that reproduces the issue
>> above.
>
> Commit 9eca13094d9bdf091f9096e0f3a8b450dac0e595 updates the ‘guix’
> package, so you can now pull, reconfigure, and enjoy. :-)
>
> Ludo’.

I just tried and everything worked without a glitch.
Thanks.
-----BEGIN PGP SIGNATURE-----

iIUEAREKAC0WIQTLxZxm7Ce5cXlAaz5r6CCK3yH+PwUCYm0n9w8cZ2x2QHBvc3Rl
by5uZXQACgkQa+ggit8h/j8hzAD9HVaWYh++EYPMMGCShJb4Jbs3jLfYE4dBAVkJ
d1nMaEQA/1Z8EDpGG3VaIw9CPf4IOCVYoym440w+k057mGuploc3
=YS/d
-----END PGP SIGNATURE-----

M
M
Mathieu Othacehe wrote on 30 Apr 2022 14:15
(name . Ludovic Courtès)(address . ludo@gnu.org)
87ilqqwyvw.fsf@gnu.org
Hey Ludo,

Toggle quote (7 lines)
> Now, the only way ‘narinfo-string’ can take too long these days is (I
> think) if the store GC lock is held (we should check that hypothesis,
> but I believe that if the GC lock is held, then the ‘query-path-info’
> RPC made from ‘narinfo-string’ might block until the lock is released).
> The GC lock is no longer held for hours on berlin, so there’s less
> pressure to address that.

Thanks for having a look to that issue. On my side, I'd like to take
some time to look closely at the situation.

Toggle quote (3 lines)
> To summarize: I think ‘guix publish’ is okay as-is but we should
> fiberize it sometime.

I think however that it would be nice to create a wip branch for those
changes, deploy it and monitor it closely on Berlin before pushing. From
experience, the publish code is really fragile and the I/O pressure on
Berlin can exhibit a lot of unexpected issues.

Mathieu
L
L
Ludovic Courtès wrote on 1 May 2022 15:12
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
87pmkx4ct4.fsf@gnu.org
Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (5 lines)
> I think however that it would be nice to create a wip branch for those
> changes, deploy it and monitor it closely on Berlin before pushing. From
> experience, the publish code is really fragile and the I/O pressure on
> Berlin can exhibit a lot of unexpected issues.

Agreed, it looks like we have to be more cautious now.

Ludo’.
?