Hi Mark,

Mark H Weaver <mhw@netris.org> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Mark H Weaver <mhw@netris.org> skribis:
>>
>>> Sometimes when compiling a package with --verbosity=1, some parts of the
>>> compile output leak through.  For example, see the transcript below.
>>
>> Weird.
>
> FWIW, a few observations, possibly relevant:
>
> (1) Each chunk of leaked output begins with 1 or 2 Unicode Replacement
>     characters (U+FFFD).  In the transcript I provided, the first leak
>     began with 1 replacement char, and all later leaks began with 2.
>
> (2) The replacement characters are immediately followed by
>     "@ build-log 30033 4096\n", and that string is also sprinkled
>     throughout the leaked output, with approximately ~4060-4070
>     characters of leaked output between each occurrence of
>     "@ build-log 30033 4096\n".

Indeed.  I managed to reproduce it while building modem-manager.  I
strace’d ‘guix build’ with an additional ‘pk’¹ to see what happens, and
here’s what leads to the wrong “write(2, "�@ build-log…")” call:

--8<---------------cut here---------------start------------->8---
read(13, "gmlo\0\0\0\0", 8)             = 8
read(13, "\27\20\0\0\0\0\0\0", 8)       = 8
read(13, "@ build-log 22090 4096\n […] warning: \342\200", 4119) = 4119
read(13, "\0", 1)                       = 1
write(1, "\n", 1)                       = 1
write(1, ";;; (write 1008 <> #f 0)\n", 25) = 25
write(1, "\n", 1)                       = 1
write(1, ";;; (write 985 <> 22090 4096)\n", 30) = 30
write(1, "\n", 1)                       = 1
write(1, ";;; (write 1008 <> 22090 3111)\n", 31) = 31
write(1, "\n", 1)                       = 1
write(1, ";;; (write 1008 <> 22090 2103)\n", 31) = 31
write(1, "\n", 1)                       = 1
write(1, ";;; (write 1008 <> 22090 1095)\n", 31) = 31
write(1, "\n", 1)                       = 1
write(1, ";;; (write 88 <> 22090 87)\n", 27) = 27
write(2, "\r\33[K\\ 'build' phase", 19) = 19
[…]
write(2, "\r\33[K\\ 'build' phase", 19) = 19
write(1, "\n", 1)                       = 1
write(1, ";;; (write 1 <> #f 0)\n", 22) = 22
read(13, "gmlo\0\0\0\0", 8)             = 8
read(13, "\27\20\0\0\0\0\0\0", 8)       = 8
read(13, "@ build-log 22090 4096\n\230g_simple_async_result_take_error\342\200\231 is deprecated[…]", 4119) = 4119
read(13, "\0", 1)                       = 1
write(1, "\n", 1)                       = 1
write(1, ";;; (write 1008 <> #f 0)\n", 25) = 25
write(2, "\357\277\275@ build-log 22090 4096\n", 26) = 26
--8<---------------cut here---------------end--------------->8---

The third read(2) call here ends on a partial UTF-8 sequence for LEFT
SINGLE QUOTATION MARK (we get the first two bytes of a three byte
sequence.)

What happens is that ‘process-stderr’ in (guix store) gets that byte
string from the daemon, passes it through ‘read-maybe-utf8-string’,
which replaces the last two bytes with REPLACEMENT CHARACTER, which is
itself a 3-byte sequence.

Thus, we have this extra byte that’s being inserted.  That confuses the
whole machinery since the build log was announced as being 4096-byte
long, and it’s now 4097-byte long.

Internally, ‘build-event-output-port’ keeps the last byte of the
REPLACEMENT CHARACTER sequence in the ‘%fragments’ buffer.
Consequently, the “@ build-log” string that comes next doesn’t start on
a newline, and thus it is considered build output.  Since the first byte
does not constitute a valid UTF-8 sequence, another REPLACEMENT
CHARACTER is inserted there when it gets printed.


So ‘build-event-output-port’ is working as expected.  The problem is the
first layer of UTF-8 decoding that happens in ‘process-stderr’, in the
‘%stderr-next’ case.  We would need to disable it, but only if the build
output port is ‘build-event-output-port’ (i.e., it’s capable of
interpreting “multiplexed build output” correctly.)

Thanks,
Ludo’.

¹ pk: