offloading should fall back to local build after n tries

OpenSubmitted by ng0.
Details
2 participants
  • Ludovic Courtès
  • ng0
Owner
unassigned
Severity
normal
N
(address . bug-guix@gnu.org)
8760ppr3q3.fsf@we.make.ritual.n0.is
When I forgot that my build machine is offline and I did not pass--no-build-hook, the offloading keeps trying forever until I had tocancel the build, boot the build-machine and started the build again.
A solution could be a config option or default behavior which afterfailing to offload for n times gives up and uses the local builder.
Is this desired at all? Setups like hydra could get problems, but forsmall setups with the same architecture there could be a solution beyond--no-build-hook?-- ng0
L
L
Ludovic Courtès wrote on 26 Sep 2016 11:20
(name . ng0)(address . ngillmann@runbox.com)(address . 24496@debbugs.gnu.org)
87r387nhjg.fsf@gnu.org
Hello!
ng0 <ngillmann@runbox.com> skribis:
Toggle quote (11 lines)> When I forgot that my build machine is offline and I did not pass> --no-build-hook, the offloading keeps trying forever until I had to> cancel the build, boot the build-machine and started the build again.>> A solution could be a config option or default behavior which after> failing to offload for n times gives up and uses the local builder.>> Is this desired at all? Setups like hydra could get problems, but for> small setups with the same architecture there could be a solution beyond> --no-build-hook?
Like you say, on Hydra-style setup this could be a problem: thefront-end machine may have --max-jobs=0, meaning that it cannot performbuilds on its own.
So I guess we would need a command-line option to select a differentbehavior. I’m not sure how to do that because ‘guix offload’ is“hidden” behind ‘guix-daemon’, so there’s no obvious place for such anoption.
In the meantime, you could also hack up your machines.scm: it wouldreturn a list where unreachable machines have been filtered out.
Ludo’.
N
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 24496@debbugs.gnu.org)
87vax8nis5.fsf@we.make.ritual.n0.is
Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (24 lines)> Hello!>> ng0 <ngillmann@runbox.com> skribis:>>> When I forgot that my build machine is offline and I did not pass>> --no-build-hook, the offloading keeps trying forever until I had to>> cancel the build, boot the build-machine and started the build again.>>>> A solution could be a config option or default behavior which after>> failing to offload for n times gives up and uses the local builder.>>>> Is this desired at all? Setups like hydra could get problems, but for>> small setups with the same architecture there could be a solution beyond>> --no-build-hook?>> Like you say, on Hydra-style setup this could be a problem: the> front-end machine may have --max-jobs=0, meaning that it cannot perform> builds on its own.>> So I guess we would need a command-line option to select a different> behavior. I’m not sure how to do that because ‘guix offload’ is> “hidden” behind ‘guix-daemon’, so there’s no obvious place for such an> option.
Could the daemon run with --enable-hydra-style or --disable-hydra-styleand --disable-hydra-style would allow falling back to local build ifafter a defined time - keeping slow connections in mind - the machinedid not reply.
Toggle quote (3 lines)> In the meantime, you could also hack up your machines.scm: it would> return a list where unreachable machines have been filtered out.
How can I achieve this?
And to append to this bug: it seems to me that offloading requires 1lsh-key for eachbuild-machine. (https://lists.gnu.org/archive/html/help-guix/2016-10/msg00007.html)and that you can not directly address them (say I want to create somesystem where I want to build on machine 1 AND machine 2. Having 2 x86_64in machines.scm only selects one of them (if 2 were working, see linkedthread) and builds on the one which is accessible first. If however thefirst machine is somehow blocked and it fails, therefore terminates lshconnection, the build does not happen at all.
Leaving out the problems, what I want to do in short: How could I buildon both systems at the same time when I desire to do so?
Toggle quote (3 lines)> Ludo’.>
--
L
L
Ludovic Courtès wrote on 5 Oct 2016 13:36
(name . ng0)(address . ngillmann@runbox.com)(address . 24496@debbugs.gnu.org)
87a8ej81u3.fsf@gnu.org
ng0 <ngillmann@runbox.com> skribis:
Toggle quote (2 lines)> Ludovic Courtès <ludo@gnu.org> writes:
[...]
Toggle quote (14 lines)>> Like you say, on Hydra-style setup this could be a problem: the>> front-end machine may have --max-jobs=0, meaning that it cannot perform>> builds on its own.>>>> So I guess we would need a command-line option to select a different>> behavior. I’m not sure how to do that because ‘guix offload’ is>> “hidden” behind ‘guix-daemon’, so there’s no obvious place for such an>> option.>> Could the daemon run with --enable-hydra-style or --disable-hydra-style> and --disable-hydra-style would allow falling back to local build if> after a defined time - keeping slow connections in mind - the machine> did not reply.
That would be too ad-hoc IMO, and the problem mentioned above remains.
Toggle quote (5 lines)>> In the meantime, you could also hack up your machines.scm: it would>> return a list where unreachable machines have been filtered out.>> How can I achieve this?
Something like:
(define the-machine (build-machine …))
(if (managed-to-connect-timely the-machine) (list the-machine) '())
… where ‘managed-to-connect-timely’ would try to connect to themachine with a timeout.
Toggle quote (4 lines)> And to append to this bug: it seems to me that offloading requires 1> lsh-key for each> build-machine.
The main machine needs to be able to connect to each build machine overSSH, so indeed, that requires proper SSH key registration (host keys andauthorized user keys).
Toggle quote (8 lines)> (https://lists.gnu.org/archive/html/help-guix/2016-10/msg00007.html)> and that you can not directly address them (say I want to create some> system where I want to build on machine 1 AND machine 2. Having 2> x86_64 in machines.scm only selects one of them (if 2 were working,> see linked thread) and builds on the one which is accessible first. If> however the first machine is somehow blocked and it fails, therefore> terminates lsh connection, the build does not happen at all.
The code that selects machines is in (guix scripts offload),specifically ‘choose-build-machine’. It tries to choose the “best”machine, which means, roughly, the fastest and least loaded one.
HTH,Ludo’.
?
Your comment

Commenting via the web interface is currently disabled.

To comment on this conversation send email to 24496@debbugs.gnu.org