From debbugs-submit-bounces@debbugs.gnu.org Sat May 04 14:55:47 2019 Received: (at 35350) by debbugs.gnu.org; 4 May 2019 18:55:47 +0000 Received: from localhost ([127.0.0.1]:52287 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hMzp4-0007s9-VD for submit@debbugs.gnu.org; Sat, 04 May 2019 14:55:47 -0400 Received: from world.peace.net ([64.112.178.59]:56056) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hMzp2-0007rw-Vm for 35350@debbugs.gnu.org; Sat, 04 May 2019 14:55:45 -0400 Received: from mhw by world.peace.net with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1hMzow-0000Xn-G9; Sat, 04 May 2019 14:55:38 -0400 From: Mark H Weaver To: Ludovic =?utf-8?Q?Court=C3=A8s?= Subject: Re: bug#35350: Some compile output still leaks through with --verbosity=1 References: <87mukkfd2j.fsf@netris.org> <87r29v2jz2.fsf@gnu.org> <87ftq9silk.fsf@netris.org> <87imv5jai5.fsf@gnu.org> <87k1fgh9c0.fsf@netris.org> <874l6jh0bx.fsf@gnu.org> <87imuvme7g.fsf@netris.org> <87r29e5zsw.fsf@gnu.org> Date: Sat, 04 May 2019 14:53:50 -0400 In-Reply-To: <87r29e5zsw.fsf@gnu.org> ("Ludovic \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\= \=\?utf-8\?Q\?s\?\= message of "Sat, 04 May 2019 11:33:51 +0200") Message-ID: <87tveauk2u.fsf@netris.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 35350 Cc: 35350@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi Ludovic, Ludovic Court=C3=A8s writes: > Mark H Weaver skribis: > >> Ludovic Court=C3=A8s writes: > > [...] > >>> So there are two things. To fix the issue you reported (build output >>> that goes through), I think we must simply turn off UTF-8 decoding from >>> =E2=80=98process-stderr=E2=80=99 and leave that entirely to =E2=80=98bu= ild-event-output-port=E2=80=99. >> >> Can we assume that UTF-8 is the appropriate encoding for >> (current-build-output-port)? My interpretation of the Guix manual entry >> for 'current-build-output-port' suggests that the answer should be "no". > > What goes to =E2=80=98current-build-output-port=E2=80=99 comes from build= s processes. > It=E2=80=99s usually UTF-8 but it can be anything, including binary garba= ge, > which should be gracefully handled. > > That=E2=80=99s why =E2=80=98process-stderr=E2=80=99 currently uses =E2=80= =98read-maybe-utf8-string=E2=80=99. I agree that we should (permissively) interpret the build process output as UTF-8, regardless of locale settings. However, the encoding of 'current-build-output-port' is orthogonal, and I see no reason to assume that it's UTF-8. As 'process-stderr' is currently implemented, it makes no assumptions about the encoding of 'current-build-output-port'. That's because it uses only textual I/O on it. The end result is that the UTF-8 build output is effectively converted into the port encoding of 'current-build-output-port', whatever it might be. I think that's how it should be, no? >> Also, in your previous message you wrote: >> >> The problem is the first layer of UTF-8 decoding that happens in >> =E2=80=98process-stderr=E2=80=99, in the =E2=80=98%stderr-next=E2=80= =99 case. We would need to >> disable it, but only if the build output port is >> =E2=80=98build-event-output-port=E2=80=99 (i.e., it=E2=80=99s capable = of interpreting >> =E2=80=9Cmultiplexed build output=E2=80=9D correctly.) >> >> It sounds like you're suggesting that 'process-stderr' should look to >> see if (current-build-output-port) is a 'build-event-output-port', and >> in that case it should use binary I/O primitives to write raw binary >> data to it, otherwise it should use text I/O primitives and write >> characters to it. Do I understand correctly? > > Yes. (Actually, rather than guessing if (current-build-output-port) is > a =E2=80=98build-event-output-port=E2=80=99, there could be a fluid to as= k for the use > of raw binary primitives.) > >> IMO, it would be cleaner to treat 'build-event-output-port' uniformly, >> and specifically as a textual port of unknown encoding. > > (You mean =E2=80=98current-build-output-port=E2=80=99, right?) Yes, indeed. > I think you=E2=80=99re right. I=E2=80=99m not yet entirely sure what the= implications > are. There=E2=80=99s a couple of tests in tests/store.scm for UTF-8 > interpretation that describe behavior that I think we should preserve. I certainly agree that we should preserve those tests. I would go further and add two more tests that bind 'current-build-output-port' to a port with a non-UTF-8 encoding (e.g. UTF-16) and verify that the =CE=BB gets converted correctly. The test build process would output the =CE=BB as UTF-8, but it should be written to 'current-build-output-port' as e.g. UTF-16. What do you think? >> I would suggest changing 'build-event-output-port' to create an R6RS >> custom *textual* output port, so that it wouldn't have to worry about >> encodings at all, and it would only be given whole characters. >> Internally, it would be doing exactly what you suggest above, but those >> details would be encapsulated within the custom textual port. >> >> However, I don't think we can use Guile's current implementation of R6RS >> custom textual output ports, which are currently built on Guile's legacy >> soft ports, which I suspect have a similar bug with multibyte characters >> sometimes being split (see 'soft_port_write' in vports.c). >> >> Having said all of this, my suggestions would ultimately entail having >> two separate places along the stderr pipeline where 'utf8->string!' >> would be used, and maybe that's too much until we have a more optimized >> C implementation of it. > > Yeah it looks like we don=E2=80=99t yet have custom textual output ports = that we > could rely on, do we? > > I support your work to add that in Guile proper! For now, I can offer a new implementation of custom textual output ports built upon custom binary ports and the 'utf8->string!' that I previously sent. See attached. Thanks, Mark --8<---------------cut here---------------start------------->8--- GNU Guile 2.2.4 Copyright (C) 1995-2017 Free Software Foundation, Inc. Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'. This program is free software, and you are welcome to redistribute it under certain conditions; type `,show c' for details. Enter `,help' for help. scheme@(guile-user)> (load "utf8-decoder.scm") scheme@(guile-user)> (load "guile-new-custom-textual-ports.scm") scheme@(guile-user)> (define (my-write! str start count) (pk 'my-write! (substring str start (+ start count))) count) scheme@(guile-user)> (define port (make-custom-textual-output-port "test1" = my-write! #f #f #f)) scheme@(guile-user)> (display "Hello =CE=BB world!" port) scheme@(guile-user)> (force-output port) ;;; (my-write! "Hello =CE=BB world!") scheme@(guile-user)> (string->utf8 "=CE=BB") $2 =3D #vu8(206 187) scheme@(guile-user)> (string->utf8 "Hello =CE=BB world!") $3 =3D #vu8(72 101 108 108 111 32 206 187 32 119 111 114 108 100 33) scheme@(guile-user)> (put-bytevector port #vu8(72 101 108 108 111 32 206)) scheme@(guile-user)> (force-output port) ;;; (my-write! "Hello ") scheme@(guile-user)> (put-bytevector port #vu8(187 32 119 111 114 108 100 3= 3)) scheme@(guile-user)> (force-output port) ;;; (my-write! "=CE=BB world!") scheme@(guile-user)> --8<---------------cut here---------------end--------------->8--- --=-=-= Content-Type: text/plain; charset=utf-8 Content-Disposition: inline; filename=guile-new-custom-textual-ports.scm Content-Transfer-Encoding: quoted-printable Content-Description: New implementation of custom textual output ports for Guile ;;; Copyright =C2=A9 2019 Mark H Weaver ;;; ;;; This program is free software: you can redistribute it and/or modify ;;; it under the terms of the GNU General Public License as published by ;;; the Free Software Foundation, either version 3 of the License, or ;;; (at your option) any later version. ;;; ;;; This program is distributed in the hope that it will be useful, ;;; but WITHOUT ANY WARRANTY; without even the implied warranty of ;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;;; GNU General Public License for more details. ;;; ;;; You should have received a copy of the GNU General Public License ;;; along with this program. If not, see . (use-modules (rnrs io ports)) (define (make-custom-textual-output-port id write! get-position set-position! close) (let (;; Allocate a per-port string buffer which will be used as a ;; temporary buffer for decoding, to avoid heap allocation ;; during normal operation. (buffer (make-string 4096)) ;; 'state' is the UTF-8 decoder state, which represents a ;; proper prefix of a well-formed UTF-8 byte sequence. These ;; are bytes that 'binary-write!' has accepted and reported as ;; having been written, although we are not able to decode ;; them into a character to pass to (textual) 'write!' until ;; more bytes arrive. (state 0)) (define (binary-write! bv start count) (call-with-values (lambda () ;; XXX FIXME: Consider performing this ;; decoding strictly. (utf8->string! state bv start (+ start count) buffer 0 (string-length buffer))) (lambda (new-state bv-pos char-count) (let* (;; Avoid calling write! with (char-count =3D 0) unless ;; (count =3D 0) was passed to us, because calling ;; 'write!' with count=3D0 has a special meaning: it ;; means to pass an EOF object to the byte/character ;; sink. (chars-accepted (if (and (zero? char-count) (not (zero? count))) 0 (write! buffer 0 char-count))) ;; Compute 'bytes-accepted' in such a way that the ;; bytes from STATE are not included, because they ;; were passed to us in previous calls, and are not ;; part of the bytevector range that we are now being ;; asked to write. However, it's important to note ;; that if 'write!' did not accept the bytes from ;; STATE, 'bytes-accepted' will be negative. We must ;; handle that case specially below. (bytes-accepted (- count (string-utf8-length (substring buffer chars-accepted char-count))))) ;; If 'bytes-accepted' is negative, that means the bytes ;; from STATE were not written. This can only happen if ;; 'chars-accepted' is 0, because 'write!' can only accept ;; whole code points, and the bytes from STATE are part of ;; at most a single code point. In this case, we must ;; leave STATE unchanged and return 0. (if (negative? bytes-accepted) 0 (begin (set! state new-state) bytes-accepted)))))) (define (binary-close) (set! buffer #f) (when close (close))) (define port (make-custom-binary-output-port id binary-write! get-position set-position! binary-close)) ;; Always use UTF-8 as the encoding for custom textual ports, as ;; an internal implementation detail, to ensure that all Unicode ;; characters will pass through regardless of the current locale. (set-port-encoding! port "UTF-8") port)) --=-=-=--