(guix status) occasionally crashes when max-jobs > 1

  • Open
  • quality assurance status badge
Details
2 participants
  • Ludovic Courtès
  • Maxim Cournoyer
Owner
unassigned
Submitted by
Maxim Cournoyer
Severity
important
M
M
Maxim Cournoyer wrote on 19 Sep 2020 19:36
Guix substitute crash in procedure raise-exception: wrong type agument in position 1: #f
(name . bug-guix)(address . bug-guix@gnu.org)
87imc9y8hq.fsf@gmail.com
Seems like a second exception raised in the process of raising an
exception :-).

The original crash reason is unkown, the network was operating normally.



qemu-5.0.0 1.6MiB/s 00:26 | 41.8MiB transferredBacktrace:
In ice-9/boot-9.scm:
1731:15 19 (with-exception-handler #<procedure 7f485337bd50 at ice-9/boot-9.scm:1815:7 (exn)> _ # _ # …)
In guix/scripts/package.scm:
966:10 18 (_)
In guix/status.scm:
776:4 17 (call-with-status-report _ _)
In guix/store.scm:
1300:8 16 (call-with-build-handler _ _)
1300:8 15 (call-with-build-handler #<procedure 7f485337b000 at guix/ui.scm:1166:2 (continue store …> …)
In guix/scripts/package.scm:
889:2 14 (_)
In guix/build/syscalls.scm:
1167:4 13 (call-with-file-lock/no-wait _ _ _)
In guix/scripts/package.scm:
143:19 12 (build-and-use-profile #<store-connection 256.99 7f4854803190> "/var/guix/profiles/per-u…" …)
In guix/store.scm:
2042:24 11 (run-with-store #<store-connection 256.99 7f4854803190> _ #:guile-for-build _ #:system _ # …)
In guix/profiles.scm:
1586:2 10 (_ _)
348:2 9 (_ _)
In guix/store.scm:
1924:12 8 (_ #<store-connection 256.99 7f4854803190>)
1357:5 7 (map/accumulate-builds #<store-connection 256.99 7f4854803190> _ _)
1368:15 6 (_ #<store-connection 256.99 7f4854803190> _ _)
740:13 5 (process-stderr _ _)
In unknown file:
4 (display "@ substituter-succeeded /gnu/store/277yzwcx6fshsxlqnwvyv94q66miv3ii-diffoscope…" …)
In guix/status.scm:
699:16 3 (write! _ _ _)
613:6 2 (_ (download-progress "/gnu/store/954b3kw59mdvanp33yx93q5pzzll4wg9-qemu-5.0.0" "http:…" …) …)
In guix/progress.scm:
213:14 1 (display-download-progress "qem@" _ #:start-time _ #:transferred _ #:log-port _)
In ice-9/boot-9.scm:
1669:16 0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1669:16: In procedure raise-exception:
In procedure =: Wrong type argument in position 1: #f

Maxim
M
M
Maxim Cournoyer wrote on 29 Sep 2020 19:00
(address . 43518@debbugs.gnu.org)
873630plg1.fsf@gmail.com
Hello,

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

[...]

Toggle quote (40 lines)
> qemu-5.0.0 1.6MiB/s 00:26 | 41.8MiB transferredBacktrace:
> In ice-9/boot-9.scm:
> 1731:15 19 (with-exception-handler #<procedure 7f485337bd50 at ice-9/boot-9.scm:1815:7 (exn)> _ # _ # …)
> In guix/scripts/package.scm:
> 966:10 18 (_)
> In guix/status.scm:
> 776:4 17 (call-with-status-report _ _)
> In guix/store.scm:
> 1300:8 16 (call-with-build-handler _ _)
> 1300:8 15 (call-with-build-handler #<procedure 7f485337b000 at guix/ui.scm:1166:2 (continue store …> …)
> In guix/scripts/package.scm:
> 889:2 14 (_)
> In guix/build/syscalls.scm:
> 1167:4 13 (call-with-file-lock/no-wait _ _ _)
> In guix/scripts/package.scm:
> 143:19 12 (build-and-use-profile #<store-connection 256.99 7f4854803190> "/var/guix/profiles/per-u…" …)
> In guix/store.scm:
> 2042:24 11 (run-with-store #<store-connection 256.99 7f4854803190> _ #:guile-for-build _ #:system _ # …)
> In guix/profiles.scm:
> 1586:2 10 (_ _)
> 348:2 9 (_ _)
> In guix/store.scm:
> 1924:12 8 (_ #<store-connection 256.99 7f4854803190>)
> 1357:5 7 (map/accumulate-builds #<store-connection 256.99 7f4854803190> _ _)
> 1368:15 6 (_ #<store-connection 256.99 7f4854803190> _ _)
> 740:13 5 (process-stderr _ _)
> In unknown file:
> 4 (display "@ substituter-succeeded /gnu/store/277yzwcx6fshsxlqnwvyv94q66miv3ii-diffoscope…" …)
> In guix/status.scm:
> 699:16 3 (write! _ _ _)
> 613:6 2 (_ (download-progress "/gnu/store/954b3kw59mdvanp33yx93q5pzzll4wg9-qemu-5.0.0" "http:…" …) …)
> In guix/progress.scm:
> 213:14 1 (display-download-progress "qem@" _ #:start-time _ #:transferred _ #:log-port _)
> In ice-9/boot-9.scm:
> 1669:16 0 (raise-exception _ #:continuable? _)
>
> ice-9/boot-9.scm:1669:16: In procedure raise-exception:
> In procedure =: Wrong type argument in position 1: #f

Here's a second instance of that problem, that I could get just now,
from Guix commit d48b17adb91d68acf6fb3f321c05102fcc8c39eb:

Toggle snippet (106 lines)
guix environment --system=armhf-linux --ad-hoc gcc-toolchain
substitute: updating substitutes from 'http://127.0.0.1:8080'... 100.0%
substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
0.4 MB will be downloaded
downloading from https://ci.guix.gnu.org/nar/lzip/jf2x64p5z1ykg4px2plh0v5a7r20pc50-glibc-utf8-locales-2.31 ...
glibc-utf8-locales-2.31 376KiB 683KiB/s 00:01 [##################] 100.0%

substitute: updating substitutes from 'http://127.0.0.1:8080'... 100.0%
substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
The following derivations will be built:
/gnu/store/2iswjr97v4kfpb8q0jwzjwwjm9rvcq1q-profile.drv
/gnu/store/anx317lak54y5lg3hw8i8fn5hy485ayx-gcc-toolchain-10.2.0.drv
/gnu/store/6vqbb2jks246cjbqclxaq6x1j4zcvvg1-gcc-10.2.0.drv
/gnu/store/ih15jlds75wh246dnb6s9sixqz93mqrp-module-import-compiled.drv

10 items will be downloaded
building CA certificate bundle...
building fonts directory...
building directory of Info manuals...
building database for manual pages...
substitute: updating substitutes from 'http://127.0.0.1:8080'... 100.0%
substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
process 18399 acquired build slot '/var/guix/offload/127.0.0.1:6666/1'
load on machine '127.0.0.1' is 0.78 (normalized: 0.39)
building /gnu/store/ih15jlds75wh246dnb6s9sixqz93mqrp-module-import-compiled.drv...
The following builds are still in progress:
/gnu/store/dzchr45vhqwa538skm6h5mjqb73ly01i-manual-database.drv
/gnu/store/ih15jlds75wh246dnb6s9sixqz93mqrp-module-import-compiled.drv
/gnu/store/s12hk0x8khglmqky7035a904ryjf4hbd-info-dir.drv
/gnu/store/zafx1kh4jnycqxhy2kfm6jajq3dgp3cm-fonts-dir.drv

The following builds are still in progress:
/gnu/store/dzchr45vhqwa538skm6h5mjqb73ly01i-manual-database.drv
/gnu/store/ih15jlds75wh246dnb6s9sixqz93mqrp-module-import-compiled.drv
/gnu/store/s12hk0x8khglmqky7035a904ryjf4hbd-info-dir.drv

The following builds are still in progress:
/gnu/store/dzchr45vhqwa538skm6h5mjqb73ly01i-manual-database.drv
/gnu/store/ih15jlds75wh246dnb6s9sixqz93mqrp-module-import-compiled.drv

The following build is still in progress:
/gnu/store/ih15jlds75wh246dnb6s9sixqz93mqrp-module-import-compiled.drv

|offloading build of /gnu/store/ih15jlds75wh246dnb6s9sixqz93mqrp-module-import-compiled.drv to '127.0.0.1'
downloading from https://ci.guix.gnu.org/nar/lzip/1nr37qqaqmcn3dbrjg5z7r0mfkaaf2aq-glibc-bootstrap-0 ...
downloading from https://ci.guix.gnu.org/nar/lzip/idr9bk61lar8i6fqpmkvpg8ph5wdbapr-isl-0.22.1 ...
downloading from https://ci.guix.gnu.org/nar/lzip/dnikscj9hksx4g4fs4qd2ynj1rzv6fz9-libelf-0.8.13 ...
downloading from https://ci.guix.gnu.org/nar/6m9zimmw8p6gbc1yfbg454c1r587b7h4-gcc-10.2.0.tar.xz ...
|

substitution of /gnu/store/dnikscj9hksx4g4fs4qd2ynj1rzv6fz9-libelf-0.8.13 complete
80% [############################################################################### ]

substitution of /gnu/store/idr9bk61lar8i6fqpmkvpg8ph5wdbapr-isl-0.22.1 complete
90% [######################################################################################### ]

downloading from https://ci.guix.gnu.org/nar/lzip/ws1p3jb5pjlfmsmklq2b6288ny9nwm55-libstdc%2B%2B-7.5.0 ...
substitution of /gnu/store/1nr37qqaqmcn3dbrjg5z7r0mfkaaf2aq-glibc-bootstrap-0 complete
100% [###################################################################################################]

downloading from https://ci.guix.gnu.org/nar/lzip/ldwybsllicl9h433aixm1cv28g70ifz2-libstdc%2B%2B-boot0-4.9.4 ...


4
substitution of /gnu/store/ws1p3jb5pjlfmsmklq2b6288ny9nwm55-libstdc++-7.5.0 complete
substitution of /gnu/store/ldwybsllicl9h433aixm1cv28g70ifz2-libstdc++-boot0-4.9.4 complete
gcc-10.2.0.tar.xz 74.3MiB 812KiB/s 00:25 [#### ] 27.2%

xz 77898736 21561440
gcc-10.2.0.tar.xz 74.3MiB 1.1MiB/s 00:25 [####### ] 39.1%Backtrace:
In ice-9/boot-9.scm:
1736:10 19 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
1731:15 18 (with-exception-handler #<procedure 7f5866dd6240 at ice-9/boot-9.scm:1815:7 (exn)> _ # _ # …)
1736:10 17 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
In guix/store.scm:
631:37 16 (thunk)
1300:8 15 (call-with-build-handler _ _)
1300:8 14 (call-with-build-handler _ _)
1300:8 13 (call-with-build-handler #<procedure 7f5866e10b70 at guix/ui.scm:1166:2 (continue store …> …)
In guix/status.scm:
776:4 12 (call-with-status-report _ _)
In guix/scripts/environment.scm:
741:14 11 (_)
In guix/store.scm:
2042:24 10 (run-with-store #<store-connection 256.99 7f586aec4410> _ #:guile-for-build _ #:system _ # …)
In guix/scripts/environment.scm:
754:18 9 (_ _)
In guix/store.scm:
1911:38 8 (_ #<store-connection 256.99 7f5864a456e0>)
1368:15 7 (_ #<store-connection 256.99 7f5864a456e0> _ _)
740:13 6 (process-stderr _ _)
In unknown file:
5 (display "@ substituter-succeeded /gnu/store/r06j3ms57z4mzfpdzfclsi3i9hr4184g-module-imp…" …)
In guix/status.scm:
699:16 4 (write! _ _ _)
613:6 3 (_ (download-progress "/gnu/store/6m9zimmw8p6gbc1yfbg454c1r587b7h4-gcc-10.2.0.tar.xz" # …) …)
In guix/progress.scm:
213:14 2 (display-download-progress "6m9zimmw8p6gbc1yfbg454c1r58@" _ #:start-time _ #:transferred _ …)
In ice-9/boot-9.scm:
1669:16 1 (raise-exception _ #:continuable? _)
1669:16 0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1669:16: In procedure raise-exception:
In procedure =: Wrong type argument in position 1: #f

Maxim
L
L
Ludovic Courtès wrote on 30 Sep 2020 19:19
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 43518@debbugs.gnu.org)
87wo0bxjvd.fsf@gnu.org
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (1 lines)
[…]
Toggle quote (1 lines)
> gcc-10.2.0.tar.xz 74.3MiB 1.1MiB/s 00:25 [####### ] 39.1%Backtrace:
[…]
Toggle quote (14 lines)
> In unknown file:
> 5 (display "@ substituter-succeeded /gnu/store/r06j3ms57z4mzfpdzfclsi3i9hr4184g-module-imp…" …)
> In guix/status.scm:
> 699:16 4 (write! _ _ _)
> 613:6 3 (_ (download-progress "/gnu/store/6m9zimmw8p6gbc1yfbg454c1r587b7h4-gcc-10.2.0.tar.xz" # …) …)
> In guix/progress.scm:
> 213:14 2 (display-download-progress "6m9zimmw8p6gbc1yfbg454c1r58@" _ #:start-time _ #:transferred _ …)
> In ice-9/boot-9.scm:
> 1669:16 1 (raise-exception _ #:continuable? _)
> 1669:16 0 (raise-exception _ #:continuable? _)
>
> ice-9/boot-9.scm:1669:16: In procedure raise-exception:
> In procedure =: Wrong type argument in position 1: #f

If you have a case that’s reproducible, please take advantage of it and
add ‘pk’ calls in (guix progress) to see what happens. Here it seems
that ‘transferred’ (in ‘display-download-progress’) is #f.

The #f would come from this clause in (guix status):

('download-progress item uri
(= string->number size)
(= string->number transferred)) ;<- here

That in turn comes from ‘progress-reporter/trace’, called from (guix
scripts substitute).

It looks like this issue happens when max-jobs is (noticeably) greater
than 1, as if the client was receiving intermingled “@
download-progress” traces, and thus not getting the right ‘transferred’
value.

Could you try to reproduce it while ‘guix’ is running under strace?
Like:

strace -o log -s 200 guix build whatever --max-jobs=20

?

Then, if you reproduce the error, we’ll look for intermingled traces in
the log.

Thanks,
Ludo’.
M
M
Maxim Cournoyer wrote on 2 Oct 2020 19:50
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 43518@debbugs.gnu.org)
87zh54ms9i.fsf@gmail.com
Hello!

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (35 lines)
> Hi,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> downloading from https://ci.guix.gnu.org/nar/6m9zimmw8p6gbc1yfbg454c1r587b7h4-gcc-10.2.0.tar.xz ...
> […]
>> gcc-10.2.0.tar.xz 74.3MiB 1.1MiB/s 00:25 [####### ] 39.1%Backtrace:
> […]
>> In unknown file:
>> 5 (display "@ substituter-succeeded /gnu/store/r06j3ms57z4mzfpdzfclsi3i9hr4184g-module-imp…" …)
>> In guix/status.scm:
>> 699:16 4 (write! _ _ _)
>> 613:6 3 (_ (download-progress "/gnu/store/6m9zimmw8p6gbc1yfbg454c1r587b7h4-gcc-10.2.0.tar.xz" # …) …)
>> In guix/progress.scm:
>> 213:14 2 (display-download-progress "6m9zimmw8p6gbc1yfbg454c1r58@" _ #:start-time _ #:transferred _ …)
>> In ice-9/boot-9.scm:
>> 1669:16 1 (raise-exception _ #:continuable? _)
>> 1669:16 0 (raise-exception _ #:continuable? _)
>>
>> ice-9/boot-9.scm:1669:16: In procedure raise-exception:
>> In procedure =: Wrong type argument in position 1: #f
>
> If you have a case that’s reproducible, please take advantage of it and
> add ‘pk’ calls in (guix progress) to see what happens. Here it seems
> that ‘transferred’ (in ‘display-download-progress’) is #f.
>
> The #f would come from this clause in (guix status):
>
> ('download-progress item uri
> (= string->number size)
> (= string->number transferred)) ;<- here
>
> That in turn comes from ‘progress-reporter/trace’, called from (guix
> scripts substitute).

It's a bit difficult for me to follow the calls here :-). I thought
it'd have to be in 'progress-reporter/file' because that's the only one
that ends up calling display-download-progress. The line 213 of (guix
progress) is:

(unless (zero? transferred)

and if transferred is #f that would indeed fail with the error message
shown in the backtrace, so it seems you are correct in your analysis.

Toggle quote (10 lines)
> It looks like this issue happens when max-jobs is (noticeably) greater
> than 1, as if the client was receiving intermingled “@
> download-progress” traces, and thus not getting the right ‘transferred’
> value.
>
> Could you try to reproduce it while ‘guix’ is running under strace?
> Like:
>
> strace -o log -s 200 guix build whatever --max-jobs=20

I've reconfigured my offload machine to support 20 jobs, and
instrumented my local checkouts with the following pk calls:

Toggle snippet (82 lines)
3 files changed, 18 insertions(+), 11 deletions(-)
guix/progress.scm | 7 ++++++-
guix/scripts/substitute.scm | 19 ++++++++++---------
guix/status.scm | 3 ++-

modified guix/progress.scm
@@ -194,6 +194,7 @@ throughput."
(time-difference (current-time (time-type start-time))
start-time)))
+ (pk 'display-download-progress 'file= file 'size= size 'transferred= transferred)
(if (and (number? size) (not (zero? size)))
(let* ((% (* 100.0 (/ transferred size)))
(throughput (/ transferred elapsed))
@@ -247,7 +248,10 @@ ABBREVIATION used to shorten FILE for display."
(report
(let ((rate-limited-render (rate-limited render %progress-interval)))
(lambda (value)
+ (pk 'called-render/transferred= transferred)
+ (pk 'file: file 'size: size)
(set! transferred value)
+ (pk 'transferred-set-to value)
(rate-limited-render))))
;; Don't miss the last report.
(stop render))))
@@ -307,6 +311,7 @@ progress reports, write \"build trace\" lines to be processed elsewhere."
log-port)))
(report (let ((report (rate-limited report-progress %progress-interval)))
(lambda (transferred)
+ (pk 'progress-reporter/trace 'total= total 'transferred= transferred)
(set! total transferred)
(report transferred))))
(stop (lambda ()
@@ -348,7 +353,7 @@ REPORTER, which should be a <progress-reporter> object."
((? eof-object?) 0)
(x x))))
(set! total (+ total n))
- (report total)
+ (report (pk 'progress-report-port:read!:total= total))
n))))
(start)
(make-custom-binary-input-port "progress-port-proc"
modified guix/scripts/substitute.scm
@@ -991,15 +991,16 @@ DESTINATION as a nar file. Verify the substitute against ACL."
(let* ((dl-size (or download-size
(and (equal? compression "none")
(narinfo-size narinfo))))
- (reporter (if print-build-trace?
- (progress-reporter/trace
- destination
- (uri->string uri) dl-size
- (current-error-port))
- (progress-reporter/file
- (uri->string uri) dl-size
- (current-error-port)
- #:abbreviation nar-uri-abbreviation))))
+ (reporter (pk 'reporter (if print-build-trace?
+ (progress-reporter/trace
+ destination
+ (uri->string uri) dl-size
+ (current-error-port))
+ (progress-reporter/file
+ (uri->string uri) dl-size
+ (current-error-port)
+ #:abbreviation nar-uri-abbreviation)))))
+ (error 'quit)
(progress-report-port reporter raw)))
((input pids)
;; NOTE: This 'progress' port of current process will be
modified guix/status.scm
@@ -532,7 +532,8 @@ addition to build events."
(= string->number transferred))
;; Print a progress bar, but only if there's only one on-going
;; job--otherwise the output would be intermingled.
- (when (= 1 (simultaneous-jobs status))
+ (pk 'print-build-event:download-progress 'transferred= transferred)
+ (when (= 1 (pk 'simultaneous-jobs= (simultaneous-jobs status)))
(match (find (matching-download item)
(build-status-downloading status))
(#f #f) ;shouldn't happen!

It seems that the only pk messages I see are the ones I put in (guix
status). Perhaps the other modules are not taken from the client but
from the daemon? I'm confused :-).

Toggle snippet (3 lines)
;;; (print-build-event:download-progress transferred= 4980832) ;;; (simultaneous-jobs= 3)

But so far haven't been able to reproduce the error. I'll keep
recording strace log when building things in the hope of catching the
bug!

To be continued...

Thanks for the help!

Maxim
L
L
Ludovic Courtès wrote on 3 Oct 2020 11:37
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 43518@debbugs.gnu.org)
87eemfhcpg.fsf@gnu.org
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (47 lines)
> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Hi,
>>
>> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>>
>>> downloading from https://ci.guix.gnu.org/nar/6m9zimmw8p6gbc1yfbg454c1r587b7h4-gcc-10.2.0.tar.xz ...
>> […]
>>> gcc-10.2.0.tar.xz 74.3MiB 1.1MiB/s 00:25 [####### ] 39.1%Backtrace:
>> […]
>>> In unknown file:
>>> 5 (display "@ substituter-succeeded /gnu/store/r06j3ms57z4mzfpdzfclsi3i9hr4184g-module-imp…" …)
>>> In guix/status.scm:
>>> 699:16 4 (write! _ _ _)
>>> 613:6 3 (_ (download-progress "/gnu/store/6m9zimmw8p6gbc1yfbg454c1r587b7h4-gcc-10.2.0.tar.xz" # …) …)
>>> In guix/progress.scm:
>>> 213:14 2 (display-download-progress "6m9zimmw8p6gbc1yfbg454c1r58@" _ #:start-time _ #:transferred _ …)
>>> In ice-9/boot-9.scm:
>>> 1669:16 1 (raise-exception _ #:continuable? _)
>>> 1669:16 0 (raise-exception _ #:continuable? _)
>>>
>>> ice-9/boot-9.scm:1669:16: In procedure raise-exception:
>>> In procedure =: Wrong type argument in position 1: #f
>>
>> If you have a case that’s reproducible, please take advantage of it and
>> add ‘pk’ calls in (guix progress) to see what happens. Here it seems
>> that ‘transferred’ (in ‘display-download-progress’) is #f.
>>
>> The #f would come from this clause in (guix status):
>>
>> ('download-progress item uri
>> (= string->number size)
>> (= string->number transferred)) ;<- here
>>
>> That in turn comes from ‘progress-reporter/trace’, called from (guix
>> scripts substitute).
>
> It's a bit difficult for me to follow the calls here :-). I thought
> it'd have to be in 'progress-reporter/file' because that's the only one
> that ends up calling display-download-progress. The line 213 of (guix
> progress) is:
>
> (unless (zero? transferred)
>
> and if transferred is #f that would indeed fail with the error message
> shown in the backtrace, so it seems you are correct in your analysis.

That’s correct.

The “@ download-progress” line is printed by (guix scripts substitute)
and later consumed by (guix status) in the client, which is why I
mentioned ‘progress-reporter/trace’ above.

I think the problem we’re looking at could occur if those traces are not
printed in an atomic way, and thus (guix status) gets to see
truncated/mixed up traces. So I tried this:

_NIX_OPTIONS=print-extended-build-trace=1 sudo -E \
./pre-inst-env strace -s 200 -o ,,s guix substitute \
--substitute /gnu/store/pknm43xsza6nlc7bn27djip8fis92akd-gcc-toolchain-10.2.0 /tmp/t.drv

It shows that traces are printed in a single write(2) call:

Toggle snippet (3 lines)
write(2, "@ download-progress /tmp/t.drv http://ci.guix.gnu.org/nar/lzip/pknm43xsza6nlc7bn27djip8fis92akd-gcc-toolchain-10.2.0 4843 4843\n", 127) = 127

So this side of things seems to be good. But then traces could be
mangled/truncated by the daemon maybe. An strace log of the failing
case would be very helpful.

Thanks,
Ludo’.
M
M
Maxim Cournoyer wrote on 4 Oct 2020 05:19
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 43518@debbugs.gnu.org)
87v9fqlltn.fsf@gmail.com
Hello!

Ludovic Courtès <ludo@gnu.org> writes:

[...]

Toggle quote (20 lines)
> The “@ download-progress” line is printed by (guix scripts substitute)
> and later consumed by (guix status) in the client, which is why I
> mentioned ‘progress-reporter/trace’ above.
>
> I think the problem we’re looking at could occur if those traces are not
> printed in an atomic way, and thus (guix status) gets to see
> truncated/mixed up traces. So I tried this:
>
> _NIX_OPTIONS=print-extended-build-trace=1 sudo -E \
> ./pre-inst-env strace -s 200 -o ,,s guix substitute \
> --substitute /gnu/store/pknm43xsza6nlc7bn27djip8fis92akd-gcc-toolchain-10.2.0 /tmp/t.drv
>
> It shows that traces are printed in a single write(2) call:
>
> write(2, "@ download-progress /tmp/t.drv http://ci.guix.gnu.org/nar/lzip/pknm43xsza6nlc7bn27djip8fis92akd-gcc-toolchain-10.2.0 4843 4843\n", 127) = 127
>
> So this side of things seems to be good. But then traces could be
> mangled/truncated by the daemon maybe. An strace log of the failing
> case would be very helpful.

Not sure this matters or not, but thought I'd add the information here
in case: the ntpd service was stopped for unkown reasons on my local
machine, leading to 'guix offload status' to output the following
warning:

guix offload: warning: machine '127.0.0.1' is 106 seconds behind

I've since restarted the ntpd service and that warning disappeared.

Maxim
M
M
Maxim Cournoyer wrote on 4 Oct 2020 13:53
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 43518@debbugs.gnu.org)
87r1qeky0k.fsf@gmail.com
Hello again!

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (26 lines)
> Hello!
>
> Ludovic Courtès <ludo@gnu.org> writes:
>
> [...]
>
>> The “@ download-progress” line is printed by (guix scripts substitute)
>> and later consumed by (guix status) in the client, which is why I
>> mentioned ‘progress-reporter/trace’ above.
>>
>> I think the problem we’re looking at could occur if those traces are not
>> printed in an atomic way, and thus (guix status) gets to see
>> truncated/mixed up traces. So I tried this:
>>
>> _NIX_OPTIONS=print-extended-build-trace=1 sudo -E \
>> ./pre-inst-env strace -s 200 -o ,,s guix substitute \
>> --substitute /gnu/store/pknm43xsza6nlc7bn27djip8fis92akd-gcc-toolchain-10.2.0 /tmp/t.drv
>>
>> It shows that traces are printed in a single write(2) call:
>>
>> write(2, "@ download-progress /tmp/t.drv http://ci.guix.gnu.org/nar/lzip/pknm43xsza6nlc7bn27djip8fis92akd-gcc-toolchain-10.2.0 4843 4843\n", 127) = 127
>>
>> So this side of things seems to be good. But then traces could be
>> mangled/truncated by the daemon maybe. An strace log of the failing
>> case would be very helpful.

[...]

I managed to capture two instances of 'transferred= #f' from my pk
output in the attached logs. Curiously, they didn't lead to the crash.
Attached is a pruned version of the strace log of a command like
'./pre-inst-env guix package -m my-manifest --max-jobs=20'. Offloading
was in use.
Attachment: log.pruned.lz
Thanks,

Maxim
M
M
Maxim Cournoyer wrote on 1 Dec 2020 18:23
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 43518@debbugs.gnu.org)
87im9ll9ta.fsf@gmail.com
Hello Ludovic,

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (38 lines)
> Hello again!
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> Hello!
>>
>> Ludovic Courtès <ludo@gnu.org> writes:
>>
>> [...]
>>
>>> The “@ download-progress” line is printed by (guix scripts substitute)
>>> and later consumed by (guix status) in the client, which is why I
>>> mentioned ‘progress-reporter/trace’ above.
>>>
>>> I think the problem we’re looking at could occur if those traces are not
>>> printed in an atomic way, and thus (guix status) gets to see
>>> truncated/mixed up traces. So I tried this:
>>>
>>> _NIX_OPTIONS=print-extended-build-trace=1 sudo -E \
>>> ./pre-inst-env strace -s 200 -o ,,s guix substitute \
>>> --substitute /gnu/store/pknm43xsza6nlc7bn27djip8fis92akd-gcc-toolchain-10.2.0 /tmp/t.drv
>>>
>>> It shows that traces are printed in a single write(2) call:
>>>
>>> write(2, "@ download-progress /tmp/t.drv http://ci.guix.gnu.org/nar/lzip/pknm43xsza6nlc7bn27djip8fis92akd-gcc-toolchain-10.2.0 4843 4843\n", 127) = 127
>>>
>>> So this side of things seems to be good. But then traces could be
>>> mangled/truncated by the daemon maybe. An strace log of the failing
>>> case would be very helpful.
>
> [...]
>
> I managed to capture two instances of 'transferred= #f' from my pk
> output in the attached logs. Curiously, they didn't lead to the crash.
> Attached is a pruned version of the strace log of a command like
> './pre-inst-env guix package -m my-manifest --max-jobs=20'. Offloading
> was in use.

The trace attached is even better, in that it triggered the problem! I
don't have time to take a look now, but I hope it'll be useful in
understanding the issue better. It's rather precious (quite some luck
seems to be needed to reproduce) :-).
Attachment: log.precious
Thanks,

Maxim
M
M
Maxim Cournoyer wrote on 19 Jan 2021 04:43
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 43518@debbugs.gnu.org)
874kjdh9rp.fsf@gmail.com
Hello,

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (48 lines)
> Hello Ludovic,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> Hello again!
>>
>> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>>
>>> Hello!
>>>
>>> Ludovic Courtès <ludo@gnu.org> writes:
>>>
>>> [...]
>>>
>>>> The “@ download-progress” line is printed by (guix scripts substitute)
>>>> and later consumed by (guix status) in the client, which is why I
>>>> mentioned ‘progress-reporter/trace’ above.
>>>>
>>>> I think the problem we’re looking at could occur if those traces are not
>>>> printed in an atomic way, and thus (guix status) gets to see
>>>> truncated/mixed up traces. So I tried this:
>>>>
>>>> _NIX_OPTIONS=print-extended-build-trace=1 sudo -E \
>>>> ./pre-inst-env strace -s 200 -o ,,s guix substitute \
>>>> --substitute /gnu/store/pknm43xsza6nlc7bn27djip8fis92akd-gcc-toolchain-10.2.0 /tmp/t.drv
>>>>
>>>> It shows that traces are printed in a single write(2) call:
>>>>
>>>> write(2, "@ download-progress /tmp/t.drv http://ci.guix.gnu.org/nar/lzip/pknm43xsza6nlc7bn27djip8fis92akd-gcc-toolchain-10.2.0 4843 4843\n", 127) = 127
>>>>
>>>> So this side of things seems to be good. But then traces could be
>>>> mangled/truncated by the daemon maybe. An strace log of the failing
>>>> case would be very helpful.
>>
>> [...]
>>
>> I managed to capture two instances of 'transferred= #f' from my pk
>> output in the attached logs. Curiously, they didn't lead to the crash.
>> Attached is a pruned version of the strace log of a command like
>> './pre-inst-env guix package -m my-manifest --max-jobs=20'. Offloading
>> was in use.
>
> The trace attached is even better, in that it triggered the problem! I
> don't have time to take a look now, but I hope it'll be useful in
> understanding the issue better. It's rather precious (quite some luck
> seems to be needed to reproduce) :-).
>

[...]

Toggle quote (52 lines)
> write(1, ";;; (print-build-event:download-progress transferred= #f)\n", 58) = 58
> write(1, "\n", 1) = 1
> write(1, ";;; (simultaneous-jobs= 1)\n", 27) = 27
> write(1, "\n", 1) = 1
> write(1, ";;; (display-download-progress file= \"https:@\" size= #f transferred= #f)\n", 73) = 73
> write(2, "Backtrace:\n", 11) = 11
> futex(0x7f81fbe97648, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> write(2, "In ice-9/boot-9.scm:\n", 21) = 21
> write(2, " 1731:15 19 (with-exception-handler #<procedure 7f81f8667210 at ice-9/boot-9.scm:1815:7 (exn)> _ # _ # \342\200\246)\n", 109) = 109
> write(2, "In guix/scripts/package.scm:\n", 29) = 29
> write(2, " 972:10 18 (_)\n", 17) = 17
> write(2, "In guix/status.scm:\n", 20) = 20
> write(2, " 777:4 17 (call-with-status-report _ _)\n", 43) = 43
> write(2, "In guix/store.scm:\n", 19) = 19
> write(2, " 1301:8 16 (call-with-build-handler _ _)\n", 43) = 43
> write(2, " 1301:8 15 (call-with-build-handler #<procedure 7f81f866d5d0 at guix/ui.scm:1169:2 (continue store \342\200\246> \342\200\246)\n", 111) = 111
> write(2, "In guix/scripts/package.scm:\n", 29) = 29
> write(2, " 895:2 14 (_)\n", 17) = 17
> write(2, "In guix/build/syscalls.scm:\n", 28) = 28
> write(2, " 1167:4 13 (call-with-file-lock/no-wait _ _ _)\n", 49) = 49
> write(2, "In guix/scripts/package.scm:\n", 29) = 29
> write(2, " 144:19 12 (build-and-use-profile #<store-connection 256.99 7f81f6934870> \"/var/guix/profiles/per-u\342\200\246\" \342\200\246)\n", 111) = 111
> mmap(NULL, 262144, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f81e1d3f000
> write(2, "In guix/store.scm:\n", 19) = 19
> write(2, " 2049:24 11 (run-with-store #<store-connection 256.99 7f81f6934870> _ #:guile-for-build _ #:system _ # \342\200\246)\n", 109) = 109
> write(2, "In guix/profiles.scm:\n", 22) = 22
> write(2, " 1598:2 10 (_ _)\n", 19) = 19
> write(2, " 348:2 9 (_ _)\n", 19) = 19
> write(2, "In guix/store.scm:\n", 19) = 19
> write(2, " 1931:12 8 (_ #<store-connection 256.99 7f81f6934870>)\n", 57) = 57
> write(2, " 1358:5 7 (map/accumulate-builds #<store-connection 256.99 7f81f6934870> _ _)\n", 81) = 81
> write(2, " 1369:15 6 (_ #<store-connection 256.99 7f81f6934870> _ _)\n", 61) = 61
> write(2, " 741:13 5 (process-stderr _ _)\n", 34) = 34
> write(2, "In unknown file:\n", 17) = 17
> write(2, " 4 (display \"@ substituter-succeeded /gnu/store/srcpjs17wygd5f7fa251j716n2lvdlrz-gwenhywfar\342\200\246\" \342\200\246)\n", 111) = 111
> write(2, "In guix/status.scm:\n", 20) = 20
> write(2, " 700:16 3 (write! _ _ _)\n", 28) = 28
> write(2, " 614:6 2 (_ (download-progress \"/gnu/store/ypz9jjlv4gkfd4m2n359d9ixma0sfydz-ghc-8.6.5\" \"https:@\" \342\200\246) \342\200\246)\n", 111) = 111
> write(2, "In guix/progress.scm:\n", 22) = 22
> write(2, " 214:14 1 (display-download-progress \"https:@\" _ #:start-time _ #:transferred _ #:log-port _)\n", 97) = 97
> write(2, "In ice-9/boot-9.scm:\n", 21) = 21
> write(2, " 1669:16 0 (raise-exception _ #:continuable? _)\n", 50) = 50
> write(2, "\n", 1) = 1
> futex(0x7f81fbe97650, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> futex(0x7f81fbe97ce8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> write(2, "ice-9/boot-9.scm:1669:16: In procedure raise-exception:\n", 56) = 56
> write(2, "In procedure =: Wrong type argument in position 1: #f\n", 54) = 54
> fcntl(18, F_SETLKW, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
> close(18) = 0
> exit_group(1) = ?
> +++ exited with 1 +++

[...]

Today I stumbled upon this issue while building stuff on core-updates,
with the failure happening at a different place:

Toggle snippet (38 lines)
jx934sgqx1qmybvn1yi1yfrgjk-rustc-1.24.1-src.tar.gz 67512512 4259936
Backtrace:
In guix/ui.scm:
2154:12 19 (run-guix-command _ . _)
In ice-9/boot-9.scm:
1736:10 18 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
1731:15 17 (with-exception-handler #<procedure 7f2203d53fc0 at ice-9/boot-9.scm:1815:7 (exn)> _ # _ # ?)
1736:10 16 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
In guix/store.scm:
632:37 15 (thunk)
1301:8 14 (call-with-build-handler _ _)
1301:8 13 (call-with-build-handler #<procedure 7f2203d538a0 at guix/ui.scm:1169:2 (continue store ?> ?)
In guix/status.scm:
780:4 12 (call-with-status-report _ _)
In guix/scripts/environment.scm:
751:14 11 (_)
In guix/store.scm:
2049:24 10 (run-with-store #<store-connection 256.99 7f2205598690> _ #:guile-for-build _ #:system _ # ?)
In guix/scripts/environment.scm:
764:18 9 (_ _)
In guix/store.scm:
1918:38 8 (_ #<store-connection 256.99 7f2200b33d70>)
1369:15 7 (_ #<store-connection 256.99 7f2200b33d70> _ _)
741:13 6 (process-stderr _ _)
In unknown file:
5 (display "gz https://ci.guix.gnu.org/nar/9dlfwa8qza8k6jj9m5a6sw2rkv73nx2m-rustc-1.20.0-s?" ?)
In guix/status.scm:
703:16 4 (write! _ _ _)
616:15 3 (_ (download-succeeded "/gnu/store/9dlfwa8qza8k6jj9m5a6sw2rkv73nx2m-rustc-1.20.0-src.?" ?) ?)
273:33 2 (compute-status _ #<<build-status> building: (#<<build> derivation: "/gnu/store/wmm8ghla?> ?)
In ice-9/boot-9.scm:
1669:16 1 (raise-exception _ #:continuable? _)
1669:16 0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1669:16: In procedure raise-exception:
In procedure struct-vtable: Wrong type argument in position 1 (expecting struct): #f

Maxim
L
L
Ludovic Courtès wrote on 5 Feb 2021 10:49
control message for bug #43518
(address . control@debbugs.gnu.org)
874kiq7skp.fsf@gnu.org
retitle 43518 (guix status) occasionally crashes when max-jobs > 1
quit
L
L
Ludovic Courtès wrote on 5 Feb 2021 10:52
(address . control@debbugs.gnu.org)
8735ya7sfo.fsf@gnu.org
severity 43518 important
quit
?