[Cuirass] Remote server not picking up job, losing workers

  • Done
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Ludovic Courtès
  • Christopher Baines
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
normal
L
L
Ludovic Courtès wrote on 26 Jun 2023 10:54
(address . bug-guix@gnu.org)
87fs6ey56z.fsf@inria.fr
As of cuirass@1.1.0-16.b825967, ‘cuirass remote-server’ appears to not
pick jobs as quickly as it should and to lose sight of workers (you can
see them come and go on https://ci.guix.gnu.org/workers).

/var/log/cuirass-remote-worker.log shows that it does build things, but
only sporadically. Then there are things like:

2023-06-26 10:07:58 warning: Poll loop busy during 3404 seconds.

This is presumably related to Cuirass commit
c4743b54720e86b0e0b0295fb6d33977e4293644 (previously ‘remote-worker’
would have a database worker thread; now it opens a new connection every
time—a stopgap before it’s fiberized, but apparently not a good one).

Ludo’.
L
L
Ludovic Courtès wrote on 30 Jun 2023 17:45
(address . 64297@debbugs.gnu.org)
87wmzlymvn.fsf@gnu.org
Ludovic Courtès <ludovic.courtes@inria.fr> skribis:

Toggle quote (14 lines)
> As of cuirass@1.1.0-16.b825967, ‘cuirass remote-server’ appears to not
> pick jobs as quickly as it should and to lose sight of workers (you can
> see them come and go on <https://ci.guix.gnu.org/workers>).
>
> /var/log/cuirass-remote-worker.log shows that it does build things, but
> only sporadically. Then there are things like:
>
> 2023-06-26 10:07:58 warning: Poll loop busy during 3404 seconds.
>
> This is presumably related to Cuirass commit
> c4743b54720e86b0e0b0295fb6d33977e4293644 (previously ‘remote-worker’
> would have a database worker thread; now it opens a new connection every
> time—a stopgap before it’s fiberized, but apparently not a good one).

Apparently this has to do with squee calling ‘current-read-waiter’
(i.e., poll(2)) while waiting for its response and passing checking the
wrong FD for some reason, as in this case:

Toggle snippet (24 lines)
18484 15:01:00 connect(55, {sa_family=AF_UNIX, sun_path="/tmp/ephemeralpg.58xnKh/.s.PGSQL.5432"}, 110) = 0 <0.000019>
18484 15:01:00 getsockopt(55, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000014>
18484 15:01:00 getsockname(55, {sa_family=AF_UNIX}, [128 => 2]) = 0 <0.000010>
18484 15:01:00 poll([{fd=55, events=POLLOUT|POLLERR}], 1, -1) = 1 ([{fd=55, revents=POLLOUT}]) <0.000011>
18484 15:01:00 sendto(55, "\0\0\0!\0\3\0\0user\0ludo\0database\0test\0\0", 33, MSG_NOSIGNAL, NULL, 0) = 33 <0.000014>
18484 15:01:00 poll([{fd=55, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
18484 15:01:00 <... poll resumed>) = 1 ([{fd=55, revents=POLLIN}]) <0.001786>
18484 15:01:00 recvfrom(55, "R\0\0\0\10\0\0\0\0S\0\0\0\26application_name\0\0S\0\0\0\31client_encoding\0UTF8\0S\0\0\0\27DateStyle\0ISO, MD"..., 16384, 0, NULL, NULL) = 376 <0.000016>
18484 15:01:00 sendto(55, "P\0\0\0\366\0\nUPDATE Builds SET status = -2, worker = null FROM\n(SELECT id FROM Workers"..., 291, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
18484 15:01:00 <... sendto resumed>) = 291 <0.000589>
18484 15:01:00 poll([{fd=54, events=POLLIN}], 1, -1 <unfinished ...>
18484 15:03:14 <... poll resumed>) = 1 ([{fd=54, revents=POLLIN}]) <134.319198>
18484 15:03:14 recvfrom(55, <unfinished ...>
18484 15:03:14 <... recvfrom resumed>"1\0\0\0\0042\0\0\0\4n\0\0\0\4C\0\0\0\rUPDATE 0\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 35 <0.000026>
18484 15:03:14 sendto(55, "P\0\0\0V\0DELETE FROM Workers WHERE\n(extract(epoch from now())::int - last_seen) > "..., 131, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
18484 15:03:14 <... sendto resumed>) = 131 <0.000084>
18484 15:03:14 poll([{fd=54, events=POLLIN}], 1, -1) = 1 ([{fd=54, revents=POLLNVAL}]) <0.000013>
18484 15:03:14 recvfrom(55, "1\0\0\0\0042\0\0\0\4n\0\0\0\4C\0\0\0\rDELETE 4\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 35 <0.000016>
18484 15:03:14 sendto(55, "X\0\0\0\4", 5, MSG_NOSIGNAL, NULL, 0) = 5 <0.000025>
18484 15:03:14 close(55) = 0 <0.000018>
18484 15:03:14 openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) <0.000019>
18484 15:03:14 write(2, "2023-06-30T15:03:14 warning: Poll loop busy during 134 seconds.\n", 64) = 64 <0.000024>

In this case FD 54 is a connection with a worker process; terminating
that process led poll(2) to return, thus unblocking the “poll loop”.

The problem is most likely with the connection-to-port caching in
squee’s ‘connection-socket-port’, as can be seen in this other trace
where I added ‘pk’ calls in ‘connection-socket-port’:

Toggle snippet (13 lines)
19468 15:37:43 connect(55, {sa_family=AF_UNIX, sun_path="/tmp/ephemeralpg.58xnKh/.s.PGSQL.5432"}, 110) = 0 <0.000018>
19468 15:37:43 getsockopt(55, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000010>
19468 15:37:43 getsockname(55, {sa_family=AF_UNIX}, [128 => 2]) = 0 <0.000009>
19468 15:37:43 poll([{fd=55, events=POLLOUT|POLLERR}], 1, -1) = 1 ([{fd=55, revents=POLLOUT}]) <0.000009>
19468 15:37:43 sendto(55, "\0\0\0!\0\3\0\0user\0ludo\0database\0test\0\0", 33, MSG_NOSIGNAL, NULL, 0) = 33 <0.000012>
19468 15:37:43 poll([{fd=55, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=55, revents=POLLIN}]) <0.002109>
19468 15:37:43 recvfrom(55, "R\0\0\0\10\0\0\0\0S\0\0\0\26application_name\0\0S\0\0\0\31client_encoding\0UTF8\0S\0\0\0\27DateStyle\0ISO, MD"..., 16384, 0, NULL, NULL) = 376 <0.000009>
19468 15:37:43 sendto(55, "P\0\0\0\366\0\nUPDATE Builds SET status = -2, worker = null FROM\n(SELECT id FROM Workers"..., 291, MSG_NOSIGNAL, NULL, 0) = 291 <0.000012>
19468 15:37:43 write(1, "\n", 1) = 1 <0.000015>
19468 15:37:43 ioctl(54, TCGETS, 0x7ffd8fde6660) = -1 ENOTTY (Inappropriate ioctl for device) <0.000012>
19468 15:37:43 write(1, ";;; (cached #<pg-conn cdac20 (connected)> #<input-output: file 54>)\n", 68) = 68 <0.000012>

Above we open a fresh connection on FD 55, but ‘connection-socket-port’
determines that the connection object is cached and associated with a
port for FD 54.

To be continued…

Ludo’.
L
L
Ludovic Courtès wrote on 1 Jul 2023 00:42
(address . 64297@debbugs.gnu.org)
87sfa8zi57.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (4 lines)
> The problem is most likely with the connection-to-port caching in
> squee’s ‘connection-socket-port’, as can be seen in this other trace
> where I added ‘pk’ calls in ‘connection-socket-port’:

C
C
Christopher Baines wrote on 1 Jul 2023 12:28
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 64297@debbugs.gnu.org)
87zg4fq5zs.fsf@cbaines.net
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (10 lines)
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> The problem is most likely with the connection-to-port caching in
>> squee’s ‘connection-socket-port’, as can be seen in this other trace
>> where I added ‘pk’ calls in ‘connection-socket-port’:
>
> Confirmed, with a fix!
>
> https://notabug.org/cwebber/guile-squee/pulls/8

I've merged that change, updated guile-squee in Guix, pulled on berlin,
reconfigured and restarted Cuirass now.

It seems to be building some now stuff at least.
-----BEGIN PGP SIGNATURE-----

iQKlBAEBCgCPFiEEPonu50WOcg2XVOCyXiijOwuE9XcFAmSgACdfFIAAAAAALgAo
aXNzdWVyLWZwckBub3RhdGlvbnMub3BlbnBncC5maWZ0aGhvcnNlbWFuLm5ldDNF
ODlFRUU3NDU4RTcyMEQ5NzU0RTBCMjVFMjhBMzNCMEI4NEY1NzcRHG1haWxAY2Jh
aW5lcy5uZXQACgkQXiijOwuE9Xfspw//cJ4hauzPD7e/PokmBaoz8ul61GsF670v
XWvv4OFywZe3IBlVXkzGiIX731YBb02cdIFIeWK4YrAOTh5s53JaTLABRV7mByov
ys3eFaZyMrZJkB0F8ROzjWDgDubXTY2coQqoN8Q/2Q7u5Jk7hJ2VVkAiIz8tqxJM
QUvyg99tCUbSmgwl1A33GB/1n0z93MU9WNT8W62qoN3hFiN/+uJrxXltpjJn/Q8m
N6koTLe1VAa90aLGcp54kJselDKVBZcH+AeOAxFL8JufmpfPq8eFAG3eMcFi0Z5o
fp3tj3DGE8hEzNSgo+N6iHnk65cTGIvse9vhzNwmYzFlATyv4qC4Yeo/gSWrnOHs
ZJAjxqEG/vPDyTicB3q4bp9cubcJZ154YF4I5XUXL0gzb4sVsyRZTRyQecApmaS/
rfTXGfoOFGbyLi9m2b8ymG2jSKk70hitHTAQLKQEO5U5TTcXq7in7Q4cDh0ZK7ho
iPO8XOfyNa8WEkL+KfNy7FNzhlONPGDDtSLFU/dsf65N4MHbqPp9Qx/4sW1vGvJs
3glebYlKrqWlXYwJZkAXvBsI+bPdj/WMtVz07EviDjYNt1/KYdguNV7GekT+DK9j
K/zaSI+9mylxhXjCK/BcrpHebeD9eppr4v/FDhGWAmFp8GMdPyuw9z0j1tZWTsz3
GAY9UCiZ3pg=
=n00e
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 1 Jul 2023 18:57
(name . Christopher Baines)(address . mail@cbaines.net)(address . 64297-done@debbugs.gnu.org)
87mt0fwovi.fsf@gnu.org
Hello,

Christopher Baines <mail@cbaines.net> skribis:

Toggle quote (15 lines)
> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Ludovic Courtès <ludo@gnu.org> skribis:
>>
>>> The problem is most likely with the connection-to-port caching in
>>> squee’s ‘connection-socket-port’, as can be seen in this other trace
>>> where I added ‘pk’ calls in ‘connection-socket-port’:
>>
>> Confirmed, with a fix!
>>
>> https://notabug.org/cwebber/guile-squee/pulls/8
>
> I've merged that change, updated guile-squee in Guix, pulled on berlin,
> reconfigured and restarted Cuirass now.

Awesome, thanks a lot! It does seem to be working hard now.

(I’ve just reconfigured as well, as I hadn’t seen your message…)

Ludo’.
Closed
?