From debbugs-submit-bounces@debbugs.gnu.org Tue Apr 13 16:08:15 2021 Received: (at 33848) by debbugs.gnu.org; 13 Apr 2021 20:08:15 +0000 Received: from localhost ([127.0.0.1]:60911 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1lWPKY-0001p4-HV for submit@debbugs.gnu.org; Tue, 13 Apr 2021 16:08:15 -0400 Received: from world.peace.net ([64.112.178.59]:35660) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1lWPKV-0001oj-VF for 33848@debbugs.gnu.org; Tue, 13 Apr 2021 16:08:13 -0400 Received: from mhw by world.peace.net with esmtpsa (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1lWPKN-00048N-PD; Tue, 13 Apr 2021 16:08:04 -0400 From: Mark H Weaver To: Ludovic =?utf-8?Q?Court=C3=A8s?= Subject: Re: bug#33848: Store references in SBCL-compiled code are "invisible" In-Reply-To: <87ft01axta.fsf@gnu.org> References: <87r2e8jpfx.fsf@gnu.org> <87tvizgghs.fsf@ambrevar.xyz> <87k1juaomo.fsf@gnu.org> <87muoqhk62.fsf@ambrevar.xyz> <87zhsq8wkj.fsf@gnu.org> <87d0pmhbgn.fsf@ambrevar.xyz> <87r2e28tkv.fsf@gnu.org> <874laygkiy.fsf@ambrevar.xyz> <87lfa5eymf.fsf@ambrevar.xyz> <87tuoscsk9.fsf@gnu.org> <87im57b8u7.fsf@ambrevar.xyz> <87czvebky2.fsf@netris.org> <87eefu30a4.fsf@gnu.org> <87im56l6es.fsf@yamatai> <87wntm8j18.fsf@ambrevar.xyz> <87a6qil4b1.fsf@yamatai> <87a6qiz5b3.fsf@ambrevar.xyz> <871rbtc3j5.fsf@netris.org> <87r1js9udv.fsf@netris.org> <87h7kj7j7x.fsf@netris.org> <87ft01axta.fsf@gnu.org> Date: Tue, 13 Apr 2021 16:06:19 -0400 Message-ID: <87k0p6rlt5.fsf@netris.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 33848 Cc: Guillaume Le Vaillant , Pierre Neidhardt , 33848@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi Ludovic, Ludovic Court=C3=A8s writes: > Please add a =E2=80=9CFixes=E2=80=9D line in the commit log. Done, thanks. > I=E2=80=99m not reviewing the code in depth and I trust your judgment. > > The risks of bugs I can think of are: missed ASCII references (a > regression, whereby some ASCII references would not get rewritten), This is indeed a risk whenever we modify the grafting code, which is written for efficiency, not clarity. I've tried to be careful, and have checked that my newly grafted system and user profiles do not retain references to ungrafted code, modulo the following pre-existing issues: (ibus-daemon launches ungrafted subprocesses) (Chunked store references in .zo files in Racket 8) > and unrelated UTF-{16,32}-base32-looking references getting rewritten. > > I guess the latter is very unlikely because only sequences found in the > replacement table may be rewritten, right? Yes, that's correct. > The former should be caught by =E2=80=98tests/grafts.scm=E2=80=99 but we = could also > check the closure of real-world systems, for instance, to make sure it > doesn=E2=80=99t refer to ungrafted things. As I mention above, I've already done so for my own GNOME-based Guix system. > Do you know how this affects performance? I have not yet measured it, but subjectively, I do not notice any obvious difference in speed. I expect that the main performance impact is that large blocks of NULs will be processed more slowly by the current draft version of the new grafting code. That's because NULs can now be part of a nix hash, and therefore the new grafting code can only advance 1 byte position when seeing a NUL, whereas previously it would skip ahead 33 positions in that case. If desired, the handling of NULs could be made more efficient, at the cost of a bit more complexity. When seeing a NUL, we could check the adjacent bytes to see if this is part of a nix-base32 character in UTF-16 or UTF-32 encoding. If not, we could skip ahead. > Perhaps add short docstrings for clarity. Done. >> +(for-each >> + (lambda (char-size1) >> + (for-each >> + (lambda (char-size2) >> + (for-each >> + (lambda (gap) >> + (for-each >> + (lambda (offset) >> + (test-equal (format #f "replace-store-references, char-sizes= ~a ~a, gap ~s, offset ~a" >> + char-size1 char-size2 gap offset) [...] >> + ;; offsets to test >> + (map (lambda (i) (- buffer-size (* 40 char-size1) i)) >> + (iota 30)))) >> + ;; gaps >> + '("" "-" " " "a"))) >> + ;; char-size2 values to test >> + '(1 2))) >> + ;; char-size1 values to test >> + '(1 2 4)) > > For clarity, perhaps you can define a top-level procedure for the test > and call it from =E2=80=98for-each=E2=80=99. Good idea. I'd like to optimize the tests a bit before pushing this. They take a fairly long time to run, and lead to a huge 1.5G grafts.log file. It might be sufficient to avoid 'test-equal' here. > Modulo these very minor issues, it looks like it=E2=80=99s ready to go! Sounds good. Thanks for the review! Below, I've attached my current revision of this draft patch, which incorporates your suggestions regarding the main code. What remains is to improve the tests. Regards, Mark --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: inline; filename=0001-DRAFT-grafts-Support-rewriting-UTF-16-and-UTF-32-sto.patch Content-Transfer-Encoding: quoted-printable Content-Description: [PATCH] DRAFT: grafts: Support rewriting UTF-16 and UTF-32 store references From f3141eae346a66ff52c70708c87f880938bdbb24 Mon Sep 17 00:00:00 2001 From: Mark H Weaver Date: Fri, 2 Apr 2021 18:36:23 -0400 Subject: [PATCH] DRAFT: grafts: Support rewriting UTF-16 and UTF-32 store references. Partially fixes . * guix/build/graft.scm (replace-store-references): Add support for finding and rewriting UTF-16 and UTF-32 store references. * tests/grafts.scm: Add tests. --- guix/build/graft.scm | 281 +++++++++++++++++++++++++++++-------------- tests/grafts.scm | 68 +++++++++++ 2 files changed, 258 insertions(+), 91 deletions(-) diff --git a/guix/build/graft.scm b/guix/build/graft.scm index c119ee71d1..30be988587 100644 --- a/guix/build/graft.scm +++ b/guix/build/graft.scm @@ -1,6 +1,6 @@ ;;; GNU Guix --- Functional package management for GNU ;;; Copyright =C2=A9 2014, 2015, 2016, 2018 Ludovic Court=C3=A8s -;;; Copyright =C2=A9 2016 Mark H Weaver +;;; Copyright =C2=A9 2016, 2021 Mark H Weaver ;;; ;;; This file is part of GNU Guix. ;;; @@ -55,6 +55,52 @@ (string->char-set "0123456789abcdfghijklmnpqrsvwxyz") <>)) =20 +(define (nix-base32-char-or-nul? c) + "Return true if C is a nix-base32 character or NUL, otherwise return fal= se." + (or (nix-base32-char? c) + (char=3D? c #\nul))) + +(define (possible-utf16-hash? buffer i w) + "Return true if (I - W) is large enough to hold a UTF-16 encoded +nix-base32 hash and if BUFFER contains NULs in all positions where NULs +are to be expected in a UTF-16 encoded hash+dash pattern whose dash is +found at position I. Otherwise, return false." + (and (<=3D (* 2 hash-length) (- i w)) + (let loop ((j (+ 1 (- i (* 2 hash-length))))) + (or (>=3D j i) + (and (zero? (bytevector-u8-ref buffer j)) + (loop (+ j 2))))))) + +(define (possible-utf32-hash? buffer i w) + "Return true if (I - W) is large enough to hold a UTF-32 encoded +nix-base32 hash and if BUFFER contains NULs in all positions where NULs +are to be expected in a UTF-32 encoded hash+dash pattern whose dash is +found at position I. Otherwise, return false." + (and (<=3D (* 4 hash-length) (- i w)) + (let loop ((j (+ 1 (- i (* 4 hash-length))))) + (or (>=3D j i) + (and (zero? (bytevector-u8-ref buffer j)) + (zero? (bytevector-u8-ref buffer (+ j 1))) + (zero? (bytevector-u8-ref buffer (+ j 2))) + (loop (+ j 4))))))) + +(define (insert-nuls char-size bv) + "Given a bytevector BV, return a bytevector containing the same bytes but +with (CHAR-SIZE - 1) NULs inserted between every two adjacent bytes from B= V. +For example, (insert-nuls 4 #u8(1 2 3)) =3D> #u8(1 0 0 0 2 0 0 0 3)." + (if (=3D char-size 1) + bv + (let* ((len (bytevector-length bv)) + (bv* (make-bytevector (+ 1 (* char-size + (- len 1))) + 0))) + (let loop ((i 0)) + (when (< i len) + (bytevector-u8-set! bv* (* i char-size) + (bytevector-u8-ref bv i)) + (loop (+ i 1)))) + bv*))) + (define* (replace-store-references input output replacement-table #:optional (store (%store-directory))) "Read data from INPUT, replacing store references according to @@ -76,9 +122,9 @@ bytevectors to the same value." (list->vector (map pred (iota 256))) <>)) =20 - (define nix-base32-byte? + (define nix-base32-byte-or-nul? (optimize-u8-predicate - (compose nix-base32-char? + (compose nix-base32-char-or-nul? integer->char))) =20 (define (dash? byte) (=3D byte 45)) @@ -86,100 +132,153 @@ bytevectors to the same value." (define request-size (expt 2 20)) ; 1 MiB =20 ;; We scan the file for the following 33-byte pattern: 32 bytes of - ;; nix-base32 characters followed by a dash. To accommodate large files, - ;; we do not read the entire file, but instead work on buffers of up to - ;; 'request-size' bytes. To ensure that every 33-byte sequence appears - ;; entirely within exactly one buffer, adjacent buffers must overlap, - ;; i.e. they must share 32 byte positions. We accomplish this by - ;; "ungetting" the last 32 bytes of each buffer before reading the next - ;; buffer, unless we know that we've reached the end-of-file. + ;; nix-base32 characters followed by a dash. When we find such a pattern + ;; whose hash is in REPLACEMENT-TABLE, we perform the required rewrite a= nd + ;; continue scanning. + ;; + ;; To support UTF-16 and UTF-32 store references, the 33 bytes comprising + ;; this hash+dash pattern may optionally be interspersed by extra NUL by= tes. + ;; This simple approach works because the characters we are looking for = are + ;; restricted to ASCII. UTF-16 hashes are interspersed with single NUL + ;; bytes ("\0"), and UTF-32 hashes are interspersed with triplets of NULs + ;; ("\0\0\0"). Note that we require NULs to be present only *between* t= he + ;; other bytes, and not at either end, in order to be insensitive to byte + ;; order. + ;; + ;; To accommodate large files, we do not read the entire file at once, b= ut + ;; instead work on buffers of up to REQUEST-SIZE bytes. To ensure that + ;; every hash+dash pattern appears in its entirety in at least one buffe= r, + ;; adjacent buffers must overlap by one byte less than the maximum size = of a + ;; hash+dash pattern. We accomplish this by "ungetting" a suffix of each + ;; buffer before reading the next buffer, unless we know that we've reac= hed + ;; the end-of-file. (let ((buffer (make-bytevector request-size))) - (let loop () - ;; Note: We avoid 'get-bytevector-n' to work around - ;; . + (define-syntax-rule (byte-at i) + (bytevector-u8-ref buffer i)) + (let outer-loop () (match (get-bytevector-n! input buffer 0 request-size) ((? eof-object?) 'done) (end - ;; We scan the buffer for dashes that might be preceded by a - ;; nix-base32 hash. The key optimization here is that whenever we - ;; find a NON-nix-base32 character at position 'i', we know that = it - ;; cannot be part of a hash, so the earliest position where the n= ext - ;; hash could start is i+1 with the following dash at position i+= 33. - ;; - ;; Since nix-base32 characters comprise only 1/8 of the 256 possi= ble - ;; byte values, and exclude some of the most common letters in - ;; English text (e t o u), in practice we can advance by 33 posit= ions - ;; most of the time. - (let scan-from ((i hash-length) (written 0)) - ;; 'i' is the first position where we look for a dash. 'writte= n' - ;; is the number of bytes in the buffer that have already been - ;; written. + (define (scan-from i w) + ;; Scan the buffer for dashes that might be preceded by nix has= hes, + ;; where I is the minimum position where such a dash might be + ;; found, and W is the number of bytes in the buffer that have = been + ;; written so far. We assume that I - W >=3D HASH-LENGTH. + ;; + ;; The key optimization here is that whenever we find a byte at + ;; position I that cannot occur within a nix hash (because it's + ;; neither a nix-base32 character nor NUL), we can infer that t= he + ;; earliest position where the next hash could start is at I + = 1, + ;; and therefore the earliest position for the following dash is + ;; (+ I 1 HASH-LENGTH), which is I + 33. + ;; + ;; Since nix-base32-or-nul characters comprise only about 1/8 of + ;; the 256 possible byte values, and exclude some of the most + ;; common letters in English text (e t o u), we can advance 33 + ;; positions much of the time. (if (< i end) - (let ((byte (bytevector-u8-ref buffer i))) - (cond ((and (dash? byte) - ;; We've found a dash. Note that we do not k= now - ;; whether the preceeding 32 bytes are nix-ba= se32 - ;; characters, but we do not need to know. If - ;; they are not, the following lookup will fa= il. - (lookup-replacement - (string-tabulate (lambda (j) - (integer->char - (bytevector-u8-ref buffer - (+ j (- i hash-length))= ))) - hash-length))) - =3D> (lambda (replacement) - ;; We've found a hash that needs to be replac= ed. - ;; First, write out all bytes preceding the h= ash - ;; that have not yet been written. - (put-bytevector output buffer written - (- i hash-length written)) - ;; Now write the replacement string. - (put-bytevector output replacement) - ;; Since the byte at position 'i' is a dash, - ;; which is not a nix-base32 char, the earlie= st - ;; position where the next hash might start is - ;; i+1, and the earliest position where the - ;; following dash might start is (+ i 1 - ;; hash-length). Also, increase the write - ;; position to account for REPLACEMENT. - (let ((len (bytevector-length replacement))) - (scan-from (+ i 1 len) - (+ i (- len hash-length)))))) - ;; If the byte at position 'i' is a nix-base32 char, - ;; then the dash we're looking for might be as earl= y as - ;; the following byte, so we can only advance by 1. - ((nix-base32-byte? byte) - (scan-from (+ i 1) written)) - ;; If the byte at position 'i' is NOT a nix-base32 - ;; char, then the earliest position where the next = hash - ;; might start is i+1, with the following dash at - ;; position (+ i 1 hash-length). + (let ((byte (byte-at i))) + (cond ((dash? byte) + (found-dash i w)) + ((nix-base32-byte-or-nul? byte) + (scan-from (+ i 1) w)) (else - (scan-from (+ i 1 hash-length) written)))) - - ;; We have finished scanning the buffer. Now we determine = how - ;; many bytes have not yet been written, and how many bytes= to - ;; "unget". If 'end' is less than 'request-size' then we r= ead - ;; less than we asked for, which indicates that we are at E= OF, - ;; so we needn't unget anything. Otherwise, we unget up to - ;; 'hash-length' bytes (32 bytes). However, we must be car= eful - ;; not to unget bytes that have already been written, becau= se - ;; that would cause them to be written again from the next - ;; buffer. In practice, this case occurs when a replacemen= t is - ;; made near or beyond the end of the buffer. When REPLACE= MENT - ;; went beyond END, we consume the extra bytes from INPUT. - (begin - (if (> written end) - (get-bytevector-n! input buffer 0 (- written end)) - (let* ((unwritten (- end written)) - (unget-size (if (=3D end request-size) - (min hash-length unwritten) - 0)) - (write-size (- unwritten unget-size))) - (put-bytevector output buffer written write-size) - (unget-bytevector input buffer (+ written write-siz= e) - unget-size))) - (loop))))))))) + (not-part-of-hash i w)))) + (finish-buffer i w))) + + (define (not-part-of-hash i w) + ;; Position I is known to not be within a nix hash that we must + ;; rewrite. Therefore, the earliest position where the next ha= sh + ;; might start is I + 1, and therefore the earliest position of + ;; the following dash is (+ I 1 HASH-LENGTH). + (scan-from (+ i 1 hash-length) w)) + + (define (found-dash i w) + ;; We know that there is a dash '-' at position I, and that + ;; I >=3D HASH-LENGTH. The immediately preceding bytes *might* + ;; contain a nix-base32 hash, but that is not yet known. Here, + ;; we rule out all but one possible encoding (ASCII, UTF-16, + ;; UTF-32) by counting how many NULs precede the dash. + (cond ((not (zero? (byte-at (- i 1)))) + ;; The dash is *not* preceded by a NUL, therefore it + ;; cannot possibly be a UTF-16 or UTF-32 hash. Proceed + ;; to check for an ASCII hash. + (found-possible-hash 1 i w)) + + ((not (zero? (byte-at (- i 2)))) + ;; The dash is preceded by exactly one NUL, therefore it + ;; cannot be an ASCII or UTF-32 hash. Proceed to check + ;; for a UTF-16 hash. + (if (possible-utf16-hash? buffer i w) + (found-possible-hash 2 i w) + (not-part-of-hash i w))) + + (else + ;; The dash is preceded by at least two NULs, therefore + ;; it cannot be an ASCII or UTF-16 hash. Proceed to + ;; check for a UTF-32 hash. + (if (possible-utf32-hash? buffer i w) + (found-possible-hash 4 i w) + (not-part-of-hash i w))))) + + (define (found-possible-hash char-size i w) + ;; We know that there is a dash '-' at position I, that + ;; I >=3D CHAR-SIZE * HASH-LENGTH, and that the only possible + ;; encoding for the preceding hash is as indicated by + ;; CHAR-SIZE. Here we check to see if the given hash is in + ;; REPLACEMENT-TABLE, and if so, we perform the required + ;; rewrite. + (let* ((hash (string-tabulate + (lambda (j) + (integer->char + (byte-at (- i (* char-size + (- hash-length j)))))) + hash-length)) + (replacement* (lookup-replacement hash)) + (replacement (and replacement* + (insert-nuls char-size replacement*)))) + (cond + ((not replacement) + (not-part-of-hash i w)) + (else + ;; We've found a hash that needs to be replaced. + ;; First, write out all bytes preceding the hash + ;; that have not yet been written. + (put-bytevector output buffer w + (- i (* char-size hash-length) w)) + ;; Now write the replacement string. + (put-bytevector output replacement) + ;; Now compute the new values of W and I and continue. + (let ((w (+ (- i (* char-size hash-length)) + (bytevector-length replacement)))) + (scan-from (+ w hash-length) w)))))) + + (define (finish-buffer i w) + ;; We have finished scanning the buffer. Now we determine how = many + ;; bytes have not yet been written, and how many bytes to "unge= t". + ;; If END is less than REQUEST-SIZE then we read less than we a= sked + ;; for, which indicates that we are at EOF, so we needn't unget + ;; anything. Otherwise, we unget up to (* 4 HASH-LENGTH) bytes. + ;; However, we must be careful not to unget bytes that have alr= eady + ;; been written, because that would cause them to be written ag= ain + ;; from the next buffer. In practice, this case occurs when a + ;; replacement is made near or beyond the end of the buffer. W= hen + ;; REPLACEMENT went beyond END, we consume the extra bytes from + ;; INPUT. + (if (> w end) + (get-bytevector-n! input buffer 0 (- w end)) + (let* ((unwritten (- end w)) + (unget-size (if (=3D end request-size) + (min (* 4 hash-length) + unwritten) + 0)) + (write-size (- unwritten unget-size))) + (put-bytevector output buffer w write-size) + (unget-bytevector input buffer (+ w write-size) + unget-size))) + (outer-loop)) + + (scan-from hash-length 0)))))) =20 (define (rename-matching-files directory mapping) "Apply MAPPING to the names of all the files in DIRECTORY, where MAPPING= is diff --git a/tests/grafts.scm b/tests/grafts.scm index a12c6a5911..0e1c7355b1 100644 --- a/tests/grafts.scm +++ b/tests/grafts.scm @@ -1,5 +1,6 @@ ;;; GNU Guix --- Functional package management for GNU ;;; Copyright =C2=A9 2014, 2015, 2016, 2017, 2018, 2019 Ludovic Court=C3= =A8s +;;; Copyright =C2=A9 2021 Mark H Weaver ;;; ;;; This file is part of GNU Guix. ;;; @@ -468,4 +469,71 @@ replacement "/gnu/store"))))) =20 +(define (nul-expand str char-size) + (string-join (map string (string->list str)) + (make-string (- char-size 1) #\nul))) + +(for-each + (lambda (char-size1) + (for-each + (lambda (char-size2) + (for-each + (lambda (gap) + (for-each + (lambda (offset) + (test-equal (format #f "replace-store-references, char-sizes ~a= ~a, gap ~s, offset ~a" + char-size1 char-size2 gap offset) + (string-append (make-string offset #\=3D) + (nul-expand (string-append "/gnu/store/" + (make-string 32 #\6) + "-BlahBlaH") + char-size1) + gap + (nul-expand (string-append "/gnu/store/" + (make-string 32 #\8) + "-SoMeTHiNG") + char-size2) + (list->string (map integer->char (iota 77 33))= )) + + ;; Create input data where the right-hand-size of the dash ("= -something" + ;; here) goes beyond the end of the internal buffer of + ;; 'replace-store-references'. + (let* ((content (string-append (make-string offset #\=3D) + (nul-expand (string-append= "/gnu/store/" + = (make-string 32 #\5) + = "-blahblah") + char-size1) + gap + (nul-expand (string-append= "/gnu/store/" + = (make-string 32 #\7) + = "-something") + char-size2) + (list->string + (map integer->char (iota = 77 33))))) + (replacement (alist->vhash + `((,(make-string 32 #\5) + . ,(string->utf8 (string-append + (make-string 32 #\6) + "-BlahBlaH"))) + (,(make-string 32 #\7) + . ,(string->utf8 (string-append + (make-string 32 #\8) + "-SoMeTHiNG"))))))) + (call-with-output-string + (lambda (output) + ((@@ (guix build graft) replace-store-references) + (open-input-string content) output + replacement + "/gnu/store")))))) + ;; offsets to test + (map (lambda (i) (- buffer-size (* 40 char-size1) i)) + (iota 30)))) + ;; gaps + '("" "-" " " "a"))) + ;; char-size2 values to test + '(1 2))) + ;; char-size1 values to test + '(1 2 4)) + + (test-end) --=20 2.31.1 --=-=-=--