Bandwidth-induced offload timeout abort whole operating

  • Done
  • quality assurance status badge
Details
2 participants
  • Ludovic Courtès
  • Maxim Cournoyer
Owner
unassigned
Submitted by
Maxim Cournoyer
Severity
normal
M
M
Maxim Cournoyer wrote on 20 Feb 2023 04:28
(name . bug-guix)(address . bug-guix@gnu.org)
87ilfxm2wf.fsf@gmail.com
Hi Guix,

I can reproduce this rather easily on my system:

Toggle snippet (19 lines)
$ ./pre-inst-env guix build icedove
The following derivations will be built:
/gnu/store/l6r93asndd0kwv7024iyrl71zd0lbpbq-icedove-102.7.2.drv
/gnu/store/8zi808086b3vlfjrhdm87fgljziwdqx2-icedove-l10n-102.7.2.drv
/gnu/store/v0sq7rb8fk36kjasb27a71z1a27wxb1s-icedove-minimal-102.7.2.drv
process 19542 acquired build slot '/var/guix/offload/localhost:6666/0'
normalized load on machine 'localhost' is 0.08
building /gnu/store/8zi808086b3vlfjrhdm87fgljziwdqx2-icedove-l10n-102.7.2.drv...
process 19548 acquired build slot '/var/guix/offload/localhost:6666/1'
normalized load on machine 'localhost' is 0.08
building /gnu/store/v0sq7rb8fk36kjasb27a71z1a27wxb1s-icedove-minimal-102.7.2.drv...
guix offload: sending 1 store item (558 MiB) to 'localhost'...
exporting path `/gnu/store/bwb5hcdyzgq16kmbsva7ax0zq6lzg78z-icedove-102.7.2.tar.xz'
guix offload: error: failed to connect to 'localhost': Timeout connecting to localhost
cannot build derivation `/gnu/store/l6r93asndd0kwv7024iyrl71zd0lbpbq-icedove-102.7.2.drv': 1 dependencies couldn't be built
guix build: error: build of
`/gnu/store/l6r93asndd0kwv7024iyrl71zd0lbpbq-icedove-102.7.2.drv' failed

The third derivation tries to get a build slot and times out, because
the first two have already saturated the bandwidth of the link and it
takes more time than expected to get a reply.

The workaround is to use '-k', for "--keep-continuing", and retry the
3rd failing derivation after the first two completed.

I don't have a clear idea on how to improve the situation other than use
longer timeouts... but perhaps these timeouts could be dynamic based on
the load of the network/CPU/ ?

--
Thanks,
Maxim
L
L
Ludovic Courtès wrote on 23 Feb 2023 23:26
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 61646@debbugs.gnu.org)
87wn483to1.fsf@gnu.org
Hi Maxim,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (24 lines)
> I can reproduce this rather easily on my system:
>
> $ ./pre-inst-env guix build icedove
> The following derivations will be built:
> /gnu/store/l6r93asndd0kwv7024iyrl71zd0lbpbq-icedove-102.7.2.drv
> /gnu/store/8zi808086b3vlfjrhdm87fgljziwdqx2-icedove-l10n-102.7.2.drv
> /gnu/store/v0sq7rb8fk36kjasb27a71z1a27wxb1s-icedove-minimal-102.7.2.drv
> process 19542 acquired build slot '/var/guix/offload/localhost:6666/0'
> normalized load on machine 'localhost' is 0.08
> building /gnu/store/8zi808086b3vlfjrhdm87fgljziwdqx2-icedove-l10n-102.7.2.drv...
> process 19548 acquired build slot '/var/guix/offload/localhost:6666/1'
> normalized load on machine 'localhost' is 0.08
> building /gnu/store/v0sq7rb8fk36kjasb27a71z1a27wxb1s-icedove-minimal-102.7.2.drv...
> guix offload: sending 1 store item (558 MiB) to 'localhost'...
> exporting path `/gnu/store/bwb5hcdyzgq16kmbsva7ax0zq6lzg78z-icedove-102.7.2.tar.xz'
> guix offload: error: failed to connect to 'localhost': Timeout connecting to localhost
> cannot build derivation `/gnu/store/l6r93asndd0kwv7024iyrl71zd0lbpbq-icedove-102.7.2.drv': 1 dependencies couldn't be built
> guix build: error: build of
> `/gnu/store/l6r93asndd0kwv7024iyrl71zd0lbpbq-icedove-102.7.2.drv' failed
>
> The third derivation tries to get a build slot and times out, because
> the first two have already saturated the bandwidth of the link and it
> takes more time than expected to get a reply.

Weird. Since the it’s a timeout while connecting, I suppose the patch
below would improve the situation:
Toggle diff (12 lines)
diff --git a/guix/scripts/offload.scm b/guix/scripts/offload.scm
index 578b3b9888..90cf97401c 100644
--- a/guix/scripts/offload.scm
+++ b/guix/scripts/offload.scm
@@ -220,7 +220,7 @@ (define* (open-ssh-session machine #:optional max-silent-time)
(session (make-session #:user (build-machine-user machine)
#:host (build-machine-name machine)
#:port (build-machine-port machine)
- #:timeout 10 ;initial timeout (seconds)
+ #:timeout 30 ;initial timeout (seconds)
;; #:log-verbosity 'protocol
#:identity (build-machine-private-key machine)
WDYT?
Ludo’.
M
M
Maxim Cournoyer wrote on 25 Feb 2023 03:46
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 61646@debbugs.gnu.org)
87356uh37e.fsf@gmail.com
Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (46 lines)
> Hi Maxim,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> I can reproduce this rather easily on my system:
>>
>> $ ./pre-inst-env guix build icedove
>> The following derivations will be built:
>> /gnu/store/l6r93asndd0kwv7024iyrl71zd0lbpbq-icedove-102.7.2.drv
>> /gnu/store/8zi808086b3vlfjrhdm87fgljziwdqx2-icedove-l10n-102.7.2.drv
>> /gnu/store/v0sq7rb8fk36kjasb27a71z1a27wxb1s-icedove-minimal-102.7.2.drv
>> process 19542 acquired build slot '/var/guix/offload/localhost:6666/0'
>> normalized load on machine 'localhost' is 0.08
>> building /gnu/store/8zi808086b3vlfjrhdm87fgljziwdqx2-icedove-l10n-102.7.2.drv...
>> process 19548 acquired build slot '/var/guix/offload/localhost:6666/1'
>> normalized load on machine 'localhost' is 0.08
>> building /gnu/store/v0sq7rb8fk36kjasb27a71z1a27wxb1s-icedove-minimal-102.7.2.drv...
>> guix offload: sending 1 store item (558 MiB) to 'localhost'...
>> exporting path `/gnu/store/bwb5hcdyzgq16kmbsva7ax0zq6lzg78z-icedove-102.7.2.tar.xz'
>> guix offload: error: failed to connect to 'localhost': Timeout connecting to localhost
>> cannot build derivation
>> `/gnu/store/l6r93asndd0kwv7024iyrl71zd0lbpbq-icedove-102.7.2.drv': 1
>> dependencies couldn't be built
>> guix build: error: build of
>> `/gnu/store/l6r93asndd0kwv7024iyrl71zd0lbpbq-icedove-102.7.2.drv' failed
>>
>> The third derivation tries to get a build slot and times out, because
>> the first two have already saturated the bandwidth of the link and it
>> takes more time than expected to get a reply.
>
> Weird. Since the it’s a timeout while connecting, I suppose the patch
> below would improve the situation:
>
> diff --git a/guix/scripts/offload.scm b/guix/scripts/offload.scm
> index 578b3b9888..90cf97401c 100644
> --- a/guix/scripts/offload.scm
> +++ b/guix/scripts/offload.scm
> @@ -220,7 +220,7 @@ (define* (open-ssh-session machine #:optional max-silent-time)
> (session (make-session #:user (build-machine-user machine)
> #:host (build-machine-name machine)
> #:port (build-machine-port machine)
> - #:timeout 10 ;initial timeout (seconds)
> + #:timeout 30 ;initial timeout (seconds)
> ;; #:log-verbosity 'protocol
> #:identity (build-machine-private-key machine)

Hm, how can I test this again?

I tried launching a daemon both on the remote and locally, with
something like:

sudo -E ./pre-inst-env ./guix-daemon --build-users-group guixbuild
--max-silent-time 0 --timeout 0 --log-compression none --discover=yes
--substitute-urls "https://ci.guix.gnu.org
https://bordeaux.guix.gnu.org" --max-jobs=20

and the code edited doesn't seem to run (I put an (error 'hello) in
there and nothing happened).

--
Thanks,
Maxim
M
M
Maxim Cournoyer wrote on 25 Feb 2023 04:07
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 61646-done@debbugs.gnu.org)
87y1omfnnk.fsf@gmail.com
Hello,

Ludovic Courtès <ludo@gnu.org> writes:

[...]

Toggle quote (16 lines)
> Weird. Since the it’s a timeout while connecting, I suppose the patch
> below would improve the situation:
>
> diff --git a/guix/scripts/offload.scm b/guix/scripts/offload.scm
> index 578b3b9888..90cf97401c 100644
> --- a/guix/scripts/offload.scm
> +++ b/guix/scripts/offload.scm
> @@ -220,7 +220,7 @@ (define* (open-ssh-session machine #:optional max-silent-time)
> (session (make-session #:user (build-machine-user machine)
> #:host (build-machine-name machine)
> #:port (build-machine-port machine)
> - #:timeout 10 ;initial timeout (seconds)
> + #:timeout 30 ;initial timeout (seconds)
> ;; #:log-verbosity 'protocol
> #:identity (build-machine-private-key machine)

Nevermind my previous message, it was --sysconfdir that had not been
set, thus ignoring my offload setup (/etc/guix/machines.scm). The
command worked to test the change from the local machine:

Toggle snippet (6 lines)
sudo -E ./pre-inst-env ./guix-daemon --build-users-group guixbuild \
--max-silent-time 0 --timeout 0 --log-compression none --discover=yes \
--substitute-urls "https://ci.guix.gnu.org https://bordeaux.guix.gnu.org" \
--max-jobs=4

I pushed the fix in commit 53d718f61b.

Closing, thank you!

--
Thanks,
Maxim
Closed
?
Your comment

This issue is archived.

To comment on this conversation send an email to 61646@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 61646
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch