Sporadic guix-offload crashes due to EOF errors

OpenSubmitted by Marius Bakke.
Details
2 participants
  • Ludovic Courtès
  • Marius Bakke
Owner
unassigned
Severity
normal
M
M
Marius Bakke wrote on 31 May 11:51 +0200
(address . bug-guix@gnu.org)
87mu5owhxh.fsf@gnu.org
Hello,
During 'guix build -s aarch64-linux dolphin' on Berlin, I got this crash:
Toggle snippet (36 lines)building /gnu/store/87655bh9rqcr29qasl1c4yj3skmxkyiz-kfilemetadata-5.70.0.drv...process 12989 acquired build slot '/var/guix/offload/overdrive1.guixsd.org:52522/1'process 12989 acquired build slot '/var/guix/offload/dover.guix.info:9023/1'process 12989 acquired build slot '/var/guix/offload/141.80.167.167:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.163:22/0'process 12989 acquired build slot '/var/guix/offload/localhost:2223/1'process 12989 acquired build slot '/var/guix/offload/141.80.167.168:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.173:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.176:22/0'process 12989 acquired build slot '/var/guix/offload/localhost:2222/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.165:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.169:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.181:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.170:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.174:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.180:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.161:22/0'Backtrace:In ice-9/boot-9.scm: 1736:10 5 (with-exception-handler _ _ #:unwind? _ # _)In unknown file: 4 (apply-smob/0 #<thunk 7f3344d296c0>)In ice-9/boot-9.scm: 718:2 3 (call-with-prompt _ _ #<procedure default-prompt-handle…>)In ice-9/eval.scm: 619:8 2 (_ #(#(#<directory (guile-user) 7f3344933f00>)))In guix/ui.scm: 1936:12 1 (run-guix-command _ . _)In guix/scripts/offload.scm: 742:22 0 (guix-offload . _)
guix/scripts/offload.scm:742:22: In procedure guix-offload:Throw to key `match-error' with args `("match" "no matching pattern" #<eof>)'.guix build: error: unexpected EOF reading a line
Which is strange because guix/scripts/offload.scm:742 is wrapped in a(unless (eof-object? ...)) block.
When this happens, the build command terminates, along with any otherbuilds that it had started concurrently. Builds from other clientswere unaffected, of course.
I have also seen this occur on my personal offloading setup once everyblue moon, but don't know what could have caused it.
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCgAdFiEEu7At3yzq9qgNHeZDoqBt8qM6VPoFAl7TfjoACgkQoqBt8qM6VPokIQf+Lt6//6tAzD0+oGyAhc15p4LAqqkW5/lxQb/6nluNcifbjN/SSCgeFTzuk/Lti/6HEbjOqrUORtw9Yl50B3/5Is/qzuqFtmf7i++47r6QN2+dlwwK4vfePylcqy8caALlHDvC7rqrpVFq3MxStuW7QXBe/+jSjSORAh1KVh8TKHnhkdkRaazyZOP8oIl879u7txA9a1wqtkmgjX49QqkXrTZV8QK4PBCW+hiqobgIcxqDr9l21GMEWRYwssIxH0RphQlMrvhg2TPQhY2guNeLVy2eKwtr3VYe51RKWZTkoQ+ntQfOylltoypPQRJFs59QUW85DzbybcV+9I5Monr6Og===iT+K-----END PGP SIGNATURE-----
M
M
Marius Bakke wrote on 31 May 12:12 +0200
(address . 41625@debbugs.gnu.org)
87k10swgyy.fsf@gnu.org
Marius Bakke <marius@gnu.org> writes:
Toggle quote (2 lines)> During 'guix build -s aarch64-linux dolphin' on Berlin, I got this crash:
Funny, I just got it _again_ building the same derivation. It seems tohappen while checking load on the selected offload machine, which is'overdrive1.guixsd.org' in this and the previous case:
Toggle snippet (36 lines)successfully built /gnu/store/qwscp6h2zsxr0knaizn1fn0saw1pfimi-kidletime-5.70.0.drvprocess 51308 acquired build slot '/var/guix/offload/141.80.167.174:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.181:22/0'process 51308 acquired build slot '/var/guix/offload/overdrive1.guixsd.org:52522/1'process 51308 acquired build slot '/var/guix/offload/localhost:2223/1'process 51308 acquired build slot '/var/guix/offload/141.80.167.162:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.180:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.161:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.170:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.163:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.166:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.177:22/0'process 51308 acquired build slot '/var/guix/offload/localhost:2222/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.173:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.168:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.169:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.176:22/0'Backtrace:In ice-9/boot-9.scm: 1736:10 5 (with-exception-handler _ _ #:unwind? _ # _)In unknown file: 4 (apply-smob/0 #<thunk 7f3e4487b1c0>)In ice-9/boot-9.scm: 718:2 3 (call-with-prompt _ _ #<procedure default-prompt-handle…>)In ice-9/eval.scm: 619:8 2 (_ #(#(#<directory (guile-user) 7f3e4445ff00>)))In guix/ui.scm: 1936:12 1 (run-guix-command _ . _)In guix/scripts/offload.scm: 742:22 0 (guix-offload . _)
guix/scripts/offload.scm:742:22: In procedure guix-offload:Throw to key `match-error' with args `("match" "no matching pattern" #<eof>)'.guix build: error: unexpected EOF reading a line
'guix offload test' passes without problems.
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCgAdFiEEu7At3yzq9qgNHeZDoqBt8qM6VPoFAl7TgxUACgkQoqBt8qM6VPrt3ggAt+nNax+K3pWTlItX1uy7Dv3Gehhb6BNpJA1PiDA1TTmlh4GG+TmXsp4vaPGheo0jnk2thwJ/AuaqYRu+Ec3mpMR1F/XaO/9ULWC0eeerlRhhnZJtnxopKnoB2BQmzS5loGWL8jmNL7bQ0+uESLZDsl5gbbWgTh0vAjneE97oaewTgYcBLF6dD7Mg9lDsuWTY9B5D+uoUGrpH0TRjydoNNCu0hjXisQ2mDKSh0KnohKuPHAkJ+0zMAJaGWrv8a9VqXbODDJrr/m3hdZt6PUA1iOh0HnYz7U8o6in84taHdhU3mlrlap7UlSmR5VcnbwolJfU9pzEYl2BfEBsagnE2AQ===W9oA-----END PGP SIGNATURE-----
M
M
Marius Bakke wrote on 31 May 13:21 +0200
(address . 41625@debbugs.gnu.org)
87h7vwwdrs.fsf@gnu.org
Marius Bakke <marius@gnu.org> writes:
Toggle quote (2 lines)> 'guix offload test' passes without problems.
Not so fast, running it in a loop reveals the crash.
There is a trace file in /root/offloadtest.trace on Berlin with such anoccurence. It looks like a timeout is reached shortly before the EOFerror:
10139 poll([{fd=14, events=POLLIN|POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT}])10139 poll([{fd=14, events=POLLIN}], 1, 15000) = 0 (Timeout)10139 write(2, "Backtrace:\n", 11) = 11
This seems to be from a different node than the one reported previously,as the preceding connect() was to this machine:
10139 connect(44, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("141.80.167.186")}, 16) = -1 EINPROGRESS (Operation now in progress)
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCgAdFiEEu7At3yzq9qgNHeZDoqBt8qM6VPoFAl7Tk0cACgkQoqBt8qM6VPpkSwgAuUZtpdzkttyqECe0BebW8+V/xeKAa06wXCk8rdVZhbAfmJ57MM4p6L/QBVn3Qy3LBrZJ3IskgIUZXIByOoIPK15L5hyMUMVV5GDQrU+UDIBJNKH8H0o9XsJY2DequT4SsQ7aF8njhA21xZ8XLvMk6pu7SPNZPLszvapDEM9kgA39DTx6lfxBLw0VD9v+pVQtuZqg37ge6jYBZrH1uoZm7vFbO1QXESVBUT4xmY5hjnsBpLUFcUOnuW1mC39lpgtBPwC1umLVgVoyCAB5XCb6i1LMyFQ1ysLLbkBXyCfDs0ZhHSZeZm/3hbAxn2POIAqxUseebhbqDjKHnYm57HckXA===z0CC-----END PGP SIGNATURE-----
L
L
Ludovic Courtès wrote on 4 Jun 14:05 +0200
(name . Marius Bakke)(address . marius@gnu.org)(address . 41625@debbugs.gnu.org)
87a71jc9yi.fsf@gnu.org
Hi,
Marius Bakke <marius@gnu.org> skribis:
Toggle quote (19 lines)> Marius Bakke <marius@gnu.org> writes:>>> 'guix offload test' passes without problems.>> Not so fast, running it in a loop reveals the crash.>> There is a trace file in /root/offloadtest.trace on Berlin with such an> occurence. It looks like a timeout is reached shortly before the EOF> error:>> 10139 poll([{fd=14, events=POLLIN|POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT}])> 10139 poll([{fd=14, events=POLLIN}], 1, 15000) = 0 (Timeout)> 10139 write(2, "Backtrace:\n", 11) = 11>> This seems to be from a different node than the one reported previously,> as the preceding connect() was to this machine:>> 10139 connect(44, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("141.80.167.186")}, 16) = -1 EINPROGRESS (Operation now in progress)
So it looks like ‘connect’ fails and eventually we get an EOF object.However, I don’t see where that EOF comes from because the return valueof ‘connect!’ (the Guile-SSH procedure) is properly checked.
Ludo’.
?