failed builds due to exceeding max-silent-time not marked as failed in db

  • Done
  • quality assurance status badge
Details
3 participants
  • Leo Famulari
  • Ludovic Courtès
  • Florian Paul Schmidt
Owner
unassigned
Submitted by
Florian Paul Schmidt
Severity
normal
F
F
Florian Paul Schmidt wrote on 2 Dec 2015 23:03
(address . bug-guix@gnu.org)
565F6A9B.9050406@gmx.net
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256


Hi,

on my system bulding the derivation for the package tbb (version
4.3.2) does not complete due to exceeding the max-silent-time default
value of 3600 seconds (one hour).

It seems that in this case the path is not marked as failed in the
sqlite3 db

/var/guix/db/db.sqlite

in the table FailedPaths. This is quite annoying since it seems that
several packages depend on it causing the derivation to be built
several times (each taking over an hour to fail).

The guix daemon is running with the --cache-failures option and I
would expect the second run of

for n in `guix package -A | cut -f1`; do guix build --no-substitutes
"$n" || true; done

to be mostly a NOOP, since all failures from the first run should be
cached. And even in the first run I wouldn't expect failed
dependencies to be tried to build again. Contrary to this on this box
even the second run of this takes about half a day or so to complete ;)

Flo

P.S.: FYI: The thing that takes over an hour to run is

./test_atomic.exe


- --
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCAAGBQJWX2qaAAoJEA5f4Coltk8ZnasH/jOg+E0Y/CDxw5SGgcJN0Q6K
TYo41AVz0u9tLJEVYW4ZW9Z7A3UL5OTB+03LwC1zT7iDtFzU6a7BzaW2N3gP+GGi
Tx+Rq0z7ZIHEF1t71YFtPOAIpuyxwl1yMnRo0kd8BVsrNu843ITI4w+kzGV4tcP1
l9uDf7c+WQ8MFhoMDUqjW5ufIb3zy6yKk1GDXw14xZ8laeiE8hrXFE2LFV4WCxzP
VMPDgHBlPF6pAKLYpWSpL2RtL/WxO9tYIYpQ16EW7GjOouCy2ObT+1CJ75kSIOie
DZ/RLUSxa39amDFwii5liR+ETgvz3FCoBAcyI5AP/76uMToub1z3S1PNt58EnsE=
=Hivd
-----END PGP SIGNATURE-----
F
F
Florian Paul Schmidt wrote on 4 Dec 2015 23:40
(address . bug-guix@gnu.org)
56621663.4080007@gmx.net
Attached is a first stab at fixing this. There are additional options to
guix-daemons now:

--cache-failures cache build failures
--cache-hook-failures cache build failures due to hook failures
(depends
on cache-failures)
--cache-timeout-failures cache build failures due to timeouts
(depends
on cache-failures)

Patch compiles, but is yet untested since the system I need it has gone
away for the time being..

Flo

On 12/02/2015 11:03 PM, Florian Paul Schmidt wrote:
Toggle quote (53 lines)
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
>
> Hi,
>
> on my system bulding the derivation for the package tbb (version
> 4.3.2) does not complete due to exceeding the max-silent-time default
> value of 3600 seconds (one hour).
>
> It seems that in this case the path is not marked as failed in the
> sqlite3 db
>
> /var/guix/db/db.sqlite
>
> in the table FailedPaths. This is quite annoying since it seems that
> several packages depend on it causing the derivation to be built
> several times (each taking over an hour to fail).
>
> The guix daemon is running with the --cache-failures option and I
> would expect the second run of
>
> for n in `guix package -A | cut -f1`; do guix build --no-substitutes
> "$n" || true; done
>
> to be mostly a NOOP, since all failures from the first run should be
> cached. And even in the first run I wouldn't expect failed
> dependencies to be tried to build again. Contrary to this on this box
> even the second run of this takes about half a day or so to complete ;)
>
> Flo
>
> P.S.: FYI: The thing that takes over an hour to run is
>
> ./test_atomic.exe
>
>
> - --
> https://fps.io
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2
>
> iQEcBAEBCAAGBQJWX2qaAAoJEA5f4Coltk8ZnasH/jOg+E0Y/CDxw5SGgcJN0Q6K
> TYo41AVz0u9tLJEVYW4ZW9Z7A3UL5OTB+03LwC1zT7iDtFzU6a7BzaW2N3gP+GGi
> Tx+Rq0z7ZIHEF1t71YFtPOAIpuyxwl1yMnRo0kd8BVsrNu843ITI4w+kzGV4tcP1
> l9uDf7c+WQ8MFhoMDUqjW5ufIb3zy6yKk1GDXw14xZ8laeiE8hrXFE2LFV4WCxzP
> VMPDgHBlPF6pAKLYpWSpL2RtL/WxO9tYIYpQ16EW7GjOouCy2ObT+1CJ75kSIOie
> DZ/RLUSxa39amDFwii5liR+ETgvz3FCoBAcyI5AP/76uMToub1z3S1PNt58EnsE=
> =Hivd
> -----END PGP SIGNATURE-----
>
>
>
From 3e376f7d22a62c19491d830c34182f2f4828f0a3 Mon Sep 17 00:00:00 2001
From: Florian Paul Schmidt <mista.tapas@gmx.net>
Date: Fri, 4 Dec 2015 23:37:13 +0100
Subject: [PATCH] guix-daemon: cache more failures if requested

---
nix/libstore/build.cc | 8 ++++++++
nix/libstore/globals.cc | 4 ++++
nix/libstore/globals.hh | 6 ++++++
nix/nix-daemon/guix-daemon.cc | 12 ++++++++++++
4 files changed, 30 insertions(+)

Toggle diff (103 lines)
diff --git a/nix/libstore/build.cc b/nix/libstore/build.cc
index efe1ab2..48936f9 100644
--- a/nix/libstore/build.cc
+++ b/nix/libstore/build.cc
@@ -1483,12 +1483,20 @@ void DerivationGoal::buildDone()
if (settings.printBuildTrace)
printMsg(lvlError, format("@ build-failed %1% - timeout") % drvPath);
worker.timedOut = true;
+
+ if (settings.cacheFailure && settings.cacheTimeoutFailure)
+ foreach (DerivationOutputs::iterator, i, drv.outputs)
+ worker.store.registerFailedPath(i->second.path);
}
else if (hook && (!WIFEXITED(status) || WEXITSTATUS(status) != 100)) {
if (settings.printBuildTrace)
printMsg(lvlError, format("@ hook-failed %1% - %2% %3%")
% drvPath % status % e.msg());
+
+ if (settings.cacheFailure && settings.cacheHookFailure)
+ foreach (DerivationOutputs::iterator, i, drv.outputs)
+ worker.store.registerFailedPath(i->second.path);
}
else {
diff --git a/nix/libstore/globals.cc b/nix/libstore/globals.cc
index 07f23d4..7829c1c 100644
--- a/nix/libstore/globals.cc
+++ b/nix/libstore/globals.cc
@@ -48,6 +48,8 @@ Settings::Settings()
compressLog = true;
maxLogSize = 0;
cacheFailure = false;
+ cacheTimeoutFailure = false;
+ cacheHookFailure = false;
pollInterval = 5;
checkRootReachability = false;
gcKeepOutputs = false;
@@ -158,6 +160,8 @@ void Settings::update()
_get(compressLog, "build-compress-log");
_get(maxLogSize, "build-max-log-size");
_get(cacheFailure, "build-cache-failure");
+ _get(cacheTimeoutFailure, "build-cache-timeout-failure");
+ _get(cacheHookFailure, "build-cache-hook-failure");
_get(pollInterval, "build-poll-interval");
_get(checkRootReachability, "gc-check-reachability");
_get(gcKeepOutputs, "gc-keep-outputs");
diff --git a/nix/libstore/globals.hh b/nix/libstore/globals.hh
index c17e10d..bf8666a 100644
--- a/nix/libstore/globals.hh
+++ b/nix/libstore/globals.hh
@@ -170,6 +170,12 @@ struct Settings {
/* Whether to cache build failures. */
bool cacheFailure;
+ /* Whether to cache timeout failures */
+ bool cacheTimeoutFailure;
+
+ /* Whether to cache hook failures */
+ bool cacheHookFailure;
+
/* How often (in seconds) to poll for locks. */
unsigned int pollInterval;
diff --git a/nix/nix-daemon/guix-daemon.cc b/nix/nix-daemon/guix-daemon.cc
index 1934487..f613de9 100644
--- a/nix/nix-daemon/guix-daemon.cc
+++ b/nix/nix-daemon/guix-daemon.cc
@@ -80,6 +80,8 @@ builds derivations on behalf of its clients.");
#define GUIX_OPT_NO_BUILD_HOOK 14
#define GUIX_OPT_GC_KEEP_OUTPUTS 15
#define GUIX_OPT_GC_KEEP_DERIVATIONS 16
+#define GUIX_OPT_CACHE_TIMEOUT_FAILURES 17
+#define GUIX_OPT_CACHE_HOOK_FAILURES 18
static const struct argp_option options[] =
{
@@ -104,6 +106,10 @@ static const struct argp_option options[] =
n_("do not use the 'build hook'") },
{ "cache-failures", GUIX_OPT_CACHE_FAILURES, 0, 0,
n_("cache build failures") },
+ { "cache-timeout-failures", GUIX_OPT_CACHE_TIMEOUT_FAILURES, 0, 0,
+ n_("cache build failures due to timeouts (depends on cache-failures)") },
+ { "cache-hook-failures", GUIX_OPT_CACHE_HOOK_FAILURES, 0, 0,
+ n_("cache build failures due to hook failures (depends on cache-failures)") },
{ "lose-logs", GUIX_OPT_LOSE_LOGS, 0, 0,
n_("do not keep build logs") },
{ "disable-log-compression", GUIX_OPT_DISABLE_LOG_COMPRESSION, 0, 0,
@@ -189,6 +195,12 @@ parse_opt (int key, char *arg, struct argp_state *state)
case GUIX_OPT_CACHE_FAILURES:
settings.cacheFailure = true;
break;
+ case GUIX_OPT_CACHE_TIMEOUT_FAILURES:
+ settings.cacheTimeoutFailure = true;
+ break;
+ case GUIX_OPT_CACHE_HOOK_FAILURES:
+ settings.cacheHookFailure = true;
+ break;
case GUIX_OPT_IMPERSONATE_LINUX_26:
settings.impersonateLinux26 = true;
break;
--
2.5.0
L
L
Leo Famulari wrote on 9 Dec 2015 20:57
(name . Florian Paul Schmidt)(address . mista.tapas@gmx.net)(address . 22078@debbugs.gnu.org)
20151209195720.GA18503@jasmine
Attachment: file
L
L
Ludovic Courtès wrote on 14 Dec 2015 00:11
(name . Florian Paul Schmidt)(address . mista.tapas@gmx.net)(address . 22078@debbugs.gnu.org)
87a8peuep0.fsf@gnu.org
Florian Paul Schmidt <mista.tapas@gmx.net> skribis:

Toggle quote (11 lines)
> Attached is a first stab at fixing this. There are additional options
> to guix-daemons now:
>
> --cache-failures cache build failures
> --cache-hook-failures cache build failures due to hook failures
> (depends
> on cache-failures)
> --cache-timeout-failures cache build failures due to timeouts
> (depends
> on cache-failures)

OK. I’m unsure whether it makes sense to cache failures due to timeout
because, by definition, they’re non-deterministic.

Another problem is that clients can choose what the timeout is (both
max-silent-time and absolute max-time), so it’d be easy for a client to
force a timeout failure; on a multi-user system, that would amount to a
DoS attack.

I’m not sure how to address these issues, so I’m rather in favor of the
status quo.

WDYT?

Thanks,
Ludo’.
F
F
Florian Paul Schmidt wrote on 14 Dec 2015 09:39
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 22078@debbugs.gnu.org)
566E8026.6070903@gmx.net
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 14.12.2015 00:11, Ludovic Courtès wrote:

Toggle quote (3 lines)
> OK. I’m unsure whether it makes sense to cache failures due to
> timeout because, by definition, they’re non-deterministic.

Except for cases where they are deterministic (Consider a buggy
package that has a testcase that reduces to while (true) { } that is
not optimized away). They very seldom are though. Ayways: I'm not
proposing to make any of this the default.

Toggle quote (5 lines)
> Another problem is that clients can choose what the timeout is
> (both max-silent-time and absolute max-time), so it’d be easy for a
> client to force a timeout failure; on a multi-user system, that
> would amount to a DoS attack.

You mean a user just builds all packages with a timeout that's
impossible to fulfill? And consequently all their failures will be
cached and if then another user tries to build them they just get the
cached failure? That points out another (though more contrived) flaw
indeed:

Even without caching failures a package might be nondeterministic for
some reason (bugs always happen). A user who knows how to trigger the
failure (assuming it's depending on something under the user's
control) then could DOS that particular build.

In general it would probably be good to have a way of resetting the
cached failures in the db. Maybe --check does almost this: If a failed
derivation gets built again with --check will the subsequent success
overwrite the failed one and remove the entry from the FailedPaths
table? Or will --check just happily report that the build is
nondeterministic?

Toggle quote (4 lines)
>
> I’m not sure how to address these issues, so I’m rather in favor of
> the status quo.

I found that the changes I made don't seem to work correctly anyways.
So LNGTMUAC (let's not get that merged under any circumstances).

Flo

- --
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCAAGBQJWboAmAAoJEA5f4Coltk8Zhe4H/2B8jFpvjyTCn87eRoPCVjYV
3bHgjl/WXByrei93l65q+TY+IxFxA66p1Q9GV/cBoj7k/gkFxylUamaqw0wbQEbm
yohD0G7YnKpywXCLp1pwJeFeBUGmAe/F0Fw4G45OGcAeIQ7AbDZRHmYq4KRe9x1q
i0n96plAirsy5zvBY88bdZU8Fbc4c8pm1Mw2e8B9i3EEWjwcXh8UWeuerTKHhMK4
KNtxgX+Wnx05ZmzOnM3yJKOM8qgujW4peYhVJl3SRAMv/5kLFVCOOUC3XbsijdMM
8ny68tXgE5pNtfHsGko/rqwBT/LQ0C94zO+ggkitd51sgLFKXMRdt+j9pmDLfS0=
=ZFzw
-----END PGP SIGNATURE-----
L
L
Ludovic Courtès wrote on 14 Dec 2015 17:40
(name . Florian Paul Schmidt)(address . mista.tapas@gmx.net)(address . 22078-done@debbugs.gnu.org)
87mvtdx9u2.fsf@gnu.org
Florian Paul Schmidt <mista.tapas@gmx.net> skribis:

Toggle quote (10 lines)
> On 14.12.2015 00:11, Ludovic Courtès wrote:
>
>> OK. I’m unsure whether it makes sense to cache failures due to
>> timeout because, by definition, they’re non-deterministic.
>
> Except for cases where they are deterministic (Consider a buggy
> package that has a testcase that reduces to while (true) { } that is
> not optimized away). They very seldom are though. Ayways: I'm not
> proposing to make any of this the default.

Yes.

Toggle quote (10 lines)
>> Another problem is that clients can choose what the timeout is
>> (both max-silent-time and absolute max-time), so it’d be easy for a
>> client to force a timeout failure; on a multi-user system, that
>> would amount to a DoS attack.
>
> You mean a user just builds all packages with a timeout that's
> impossible to fulfill? And consequently all their failures will be
> cached and if then another user tries to build them they just get the
> cached failure?

Right.

Toggle quote (7 lines)
> That points out another (though more contrived) flaw indeed:
>
> Even without caching failures a package might be nondeterministic for
> some reason (bugs always happen). A user who knows how to trigger the
> failure (assuming it's depending on something under the user's
> control) then could DOS that particular build.

That’s very unlikely because builds are performed under a separate UID,
in a container.

Toggle quote (3 lines)
> In general it would probably be good to have a way of resetting the
> cached failures in the db.

One can do:

guix gc --clear-failures $(guix gc --list-failures)

Toggle quote (5 lines)
> Maybe --check does almost this: If a failed derivation gets built
> again with --check will the subsequent success overwrite the failed
> one and remove the entry from the FailedPaths table? Or will --check
> just happily report that the build is nondeterministic?

Good question. I guess --check would just do nothing, but I haven’t
checked.

Toggle quote (6 lines)
>> I’m not sure how to address these issues, so I’m rather in favor of
>> the status quo.
>
> I found that the changes I made don't seem to work correctly anyways.
> So LNGTMUAC (let's not get that merged under any circumstances).

Heh, OK. :-)

In general, I expect there should be very few packages that get stuck
forever (like Chicken currently), and it’s obviously a bug to fix. So I
guess we can simply live. with the possibility that occasionally your
machine will be trying to build Chicken and fail again. ;-) You can
always choose a smaller timeout anyway.

Thanks,
Ludo’.
Closed
?