Hi,
I have not checked all the details, since the code of “guix offload” is
run by root, IIUC and so it is not as friendly as usual to debug. :-)
On Fri, 17 Dec 2021 at 16:57, Maxim Cournoyer <maxim.cournoyer@gmail.com> wrote:
Toggle quote (6 lines)
>> However, I think this behavior was unintentionally lost in
>> efbf5fdd01817ea75de369e3dd2761a85f8f7dd5. Maxim, WDYT?
>
> I just reviewed this commit, and don't see anywhere where the behavior
> would have changed. The discarding happens here:
Toggle quote (3 lines)
> previously load could be set to +inf.0. Now it is a float between 0.0
> and 1.0, with threshold defaulting to 0.6.
My /etc/guix/machines.scm contains only one machine and --max-jobs=0.
Because the machine is unreachable, IIUC, ’node’ is (or should be) false
and ’load’ is thus not involved, I guess. Indeed, ’report-load’
displays nothing, and instead I get:
Toggle snippet (16 lines)
The following derivation will be built:
/gnu/store/c1qicg17ygn1a0biq0q4mkprzy4p2x74-hello-2.10.drv
process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0'
guix offload: error: failed to connect to 'x.x.x.x': Timeout connecting to x.x.x.x
waiting for locks or build slots...
process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0'
guix offload: error: failed to connect to 'x.x.x.x': Timeout connecting to x.x.x.x
process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0'
guix offload: error: failed to connect to 'x.x.x.x': Timeout connecting to x.x.x.x
process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0'
guix offload: error: failed to connect to 'x.x.x.x': Timeout connecting to x.x.x.x
process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0'
C-c C-c
Well, if the machine is not reachable, then ’session’ is false, right?
Toggle snippet (28 lines)
@@ -472,11 +480,15 @@ (define (machine-faster? m1 m2)
(let* ((session (false-if-exception (open-ssh-session best
%short-timeout)))
(node (and session (remote-inferior session)))
- (load (and node (normalized-load best (node-load node))))
+ (load (and node (node-load node)))
+ (threshold (build-machine-overload-threshold best))
(space (and node (node-free-disk-space node))))
+ (when load (report-load best load))
(when node (close-inferior node))
(when session (disconnect! session))
- (if (and node (< load 2.) (>= space %minimum-disk-space))
+ (if (and node
+ (or (not threshold) (< load threshold))
+ (>= space %minimum-disk-space))
[...]
(begin
;; BEST is unsuitable, so try the next one.
(when (and space (< space %minimum-disk-space))
(format (current-error-port)
"skipping machine '~a' because it is low \
on disk space (~,2f MiB free)~%"
(build-machine-name best)
(/ space (expt 2 20) 1.)))
(release-build-slot slot)
(loop others)))))
Therefore, the ’else’ branch goes and so the codes does ’(loop others)’.
However, I miss why ’others’ is not empty (only one machine in
/etc/guix/machines.scm). Well, the message «waiting for locks or build
slots...» suggests that something is restarted and it is not that ’loop’
we are observing but another one.
On daemon side, I do not know what this ’waitingForAWhile’ and
’lastWokenUp’ mean.
Toggle snippet (12 lines)
/* If we are polling goals that are waiting for a lock, then wake
up after a few seconds at most. */
if (!waitingForAWhile.empty()) {
useTimeout = true;
if (lastWokenUp == 0)
printMsg(lvlError, "waiting for locks or build slots...");
if (lastWokenUp == 0 || lastWokenUp > before) lastWokenUp = before;
timeout.tv_sec = std::max((time_t) 1, (time_t) (lastWokenUp + settings.pollInterval - before));
} else lastWokenUp = 0;
Bah it requires more investigations and I agree with Maxim that
efbf5fdd01817ea75de369e3dd2761a85f8f7dd5 is probably not the issue
there.
Cheers,
simon