Cuirass 504 errors

  • Done
  • quality assurance status badge
Details
2 participants
  • Mathieu Othacehe
  • zimoun
Owner
unassigned
Submitted by
Mathieu Othacehe
Severity
normal
M
M
Mathieu Othacehe wrote on 26 Jul 2020 18:10
(address . bug-guix@gnu.org)
87eeoy9rzk.fsf@gnu.org
Hello,

Back from holidays, perfect time to fix some Cuirass issues :) The
Cuirass web interface frequently serves 504 errors for all requests,
requiring a service restart on berlin.

Having a look to /var/log/cuirass-web.log it seems that we have indeed
multiple things going wrong.

A first problem is caused by checkout entries pointing to remove
inputs. This should be fix with f71f026a41d8e68e4a7f11ef6e708964594a599c
in Cuirass.

A second issue is caused when a build product download is started, then
aborted. In that case, sendfile throws an exception or enters an endless
loop.

There's a third issue, but the cause is not clear to me:

Toggle snippet (15 lines)
Uncaught exception in fiber ##f:
In ice-9/boot-9.scm:
1736:10 5 (with-exception-handler _ _ #:unwind? _ # _)
In web/server/fiberized.scm:
160:26 4 (_)
In ice-9/suspendable-ports.scm:
83:4 3 (write-bytes #<closed: file 7f3a4ed46310> #vu8(60 33 ?) ?)
In unknown file:
2 (port-write #<closed: file 7f3a4ed46310> #vu8(60 33 # ?) ?)
In ice-9/boot-9.scm:
1669:16 1 (raise-exception _ #:continuable? _)
1669:16 0 (raise-exception _ #:continuable? _)
ice-9/boot-9.scm:1669:16: In procedure raise-exception:

Thanks,

Mathieu
Z
Z
zimoun wrote on 28 Jul 2020 00:11
86zh7kzjys.fsf@gmail.com
Hi Mathieu,

On Sun, 26 Jul 2020 at 18:10, Mathieu Othacehe <othacehe@gnu.org> wrote:

Toggle quote (4 lines)
> A second issue is caused when a build product download is started, then
> aborted. In that case, sendfile throws an exception or enters an endless
> loop.

What do you mean by “build product download is started, then aborted”?

Cheers,
simon
M
M
Mathieu Othacehe wrote on 28 Jul 2020 09:32
(name . zimoun)(address . zimon.toutoune@gmail.com)(address . 42548@debbugs.gnu.org)
87sgdckscm.fsf@gnu.org
Hey zimoun,

Toggle quote (2 lines)
> What do you mean by “build product download is started, then aborted”?

Here I mean clicking on the downloadable image here[1] and then hit
"cancel" when the download popup appears, or the abort button later on,
when the download is started.

Thanks,

Mathieu

Z
Z
zimoun wrote on 28 Jul 2020 10:49
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 42548@debbugs.gnu.org)
86d04gyqfs.fsf@gmail.com
Hi Mathieu,

On Tue, 28 Jul 2020 at 09:32, Mathieu Othacehe <othacehe@gnu.org> wrote:

Toggle quote (4 lines)
> Here I mean clicking on the downloadable image here[1] and then hit
> "cancel" when the download popup appears, or the abort button later on,
> when the download is started.

Ah that’ annoying indeed. :-)

And does it mess Cuirass if the connection is lost e.g. down the
network?

Cheers,
simon
M
M
Mathieu Othacehe wrote on 28 Jul 2020 16:56
(name . zimoun)(address . zimon.toutoune@gmail.com)(address . 42548@debbugs.gnu.org)
87mu3jheme.fsf@gnu.org
Toggle quote (3 lines)
> And does it mess Cuirass if the connection is lost e.g. down the
> network?

Not sure yet, I also found this message:

Toggle snippet (16 lines)
Uncaught exception in fiber ##f:
In ice-9/boot-9.scm:
1736:10 5 (with-exception-handler _ _ #:unwind? _ # _)
In web/server/fiberized.scm:
160:26 4 (_)
In ice-9/suspendable-ports.scm:
83:4 3 (write-bytes #<closed: file 7ff11c2dec40> #vu8(60 33 ?) ?)
In unknown file:
2 (port-write #<closed: file 7ff11c2dec40> #vu8(60 33 # ?) ?)
In ice-9/boot-9.scm:
1669:16 1 (raise-exception _ #:continuable? _)
1669:16 0 (raise-exception _ #:continuable? _)
ice-9/boot-9.scm:1669:16: In procedure raise-exception:
In procedure fport_write: Broken pipe

that suggests that we try to write something to a closed file.

To be investigated :)

Mathieu
M
M
Mathieu Othacehe wrote on 30 Jul 2020 16:47
(address . 42548@debbugs.gnu.org)
87r1st83gv.fsf@gnu.org
Hey,

Toggle quote (4 lines)
> A second issue is caused when a build product download is started, then
> aborted. In that case, sendfile throws an exception or enters an endless
> loop.

Ok, so I found a couple of errors here. First, I noticed that it was not
possible to download simultaneously two build products, because the
first download was blocking the whole process.

This is solved by: 6ad9c602697ffe33c8fbb09ccd796b74bf600223. In short,
current-fiber was set to #f, both in the context of the caller and the
spawned thread. So I think the get-message operating was blocking the
whole thread instead of suspending the current fiber. But if someone
else could take a look it would be nice :).

Second issue, sendfile may throw EPIPE or ECONNRESET if the client
disconnects before the end of the transfer. I think, besides the dirty
backtrace, it was not harmful. But anyway, its better to catch this as
we are doing in "guix publish", see:
0955a11abd9e27c96a1375cca6a1c97869b5780a.

I fear it won't be enough to fix the 504 errors, but at least it's a
start.

Thanks,

Mathieu
M
M
Mathieu Othacehe wrote on 4 Aug 2020 18:48
(address . 42548@debbugs.gnu.org)
87bljq8ihz.fsf@gnu.org
Hello,

Toggle quote (4 lines)
> that suggests that we try to write something to a closed file.
>
> To be investigated :)

Ok, so I have a better grasp on what's going on. Cuirass web server is
receiving some requests such as "/builds/1234)" which were not rejected,
but worst, caused SQL queries such as "select * from Builds".

As the table is quite large, it caused some of the DB workers to
hang. Once all DB workers were hanging, the queries started to
accumulate until the open fd limit (1024) was reached.

I did consolidate the HTTP queries validation, and Cuirass web server is
now running since 48 hours, which has not happened in months I think.

I also added some warnings to detect DB workers hanging for more than 5
seconds. The next step is to log all SQL queries using[1]. This should
allow us to spot this kind of issues more easily. Logging the duration
of each query should also help us to optimize the queries.

I'm still waiting a few days before closing this issue.

Thanks,

Mathieu

M
M
Mathieu Othacehe wrote on 6 Aug 2020 10:16
(address . 42548-done@debbugs.gnu.org)
87a6z8ur2j.fsf@gnu.org
Hello,

Toggle quote (2 lines)
> I'm still waiting a few days before closing this issue.

No issues so far, closing this one.

Mathieu
Closed
?
Your comment

This issue is archived.

To comment on this conversation send an email to 42548@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 42548
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch