ludo@gnu.org (Ludovic Courtès) writes:
Toggle quote (42 lines)
> Mark H Weaver <mhw@netris.org> skribis:
>
>> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>>
>>> ludo@gnu.org (Ludovic Courtès) writes:
>>>
>>>> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>>>>
>>>>> I've encountered the following crash when trying to use substitute on a
>>>>> file which contains NUL characters:
>>>>
>>>> Yes, that’s because Guile’s ‘regexp-exec’ simply wraps libc’s ‘regexec’,
>>>> which does not handle NULs.
>>>>
>>>> We should consider switching to the pure-Scheme SRFI-115:
>>>>
>>>> https://srfi.schemers.org/srfi-115/srfi-115.html
>>>
>>> This looks good, and I started looking into porting `substitute' to it,
>>> but quickly noticed it doesn't seem to be implemented in Guile yet?
>
> ISTR that the reference implementation works fine on Guile.
>
>> Indeed. SRFI-115 for Guile is on my TODO list, although it might be
>> better to wait until after we switch to using UTF-8 encoding internally
>> for strings, since that will drastically affect the implementation of
>> any efficient regexp matcher on Scheme strings.
>
> Indeed, though I suppose it doesn’t matter much for the cases where
> ‘substitute*’ is used?
>
>> Anyway, 'substitute*' is to be used only on text files, and NUL bytes
>> are not a valid textual character. So, I think that this case is
>> outside of what 'substitute*' is meant to do, and therefore not a bug in
>> 'substitute*', although of course a more graceful error would surely be
>> preferable.
>
> Yes, that’s also a good point.
>
> So yeah, I think it may be good “eventually” to switch to SRFI-115, but
> that’s not urgent.
Sorry for taking some time to answer; I was puzzled by the fact that my
repro didn't work when ran from the REPL. It seems the problem only
occurs when run inside Guix's build environment, maybe a side effect
which depends on the locale used?
In the `patch-el-files' phase of the emacs-build-system, we find the
following snippet:
(with-directory-excursion el-dir
;; Some old '.el' files (e.g., tex-buf.el in AUCTeX) are still encoded
;; with the "ISO-8859-1" locale.
(unless (false-if-exception (substitute-cmd))
(with-fluids ((%default-port-encoding "ISO-8859-1"))
(substitute-cmd))))
In case an exception is returned while processing the file, it is
retried being opened with the "ISO-8859-1" encoding. Or, this resolves
to a call to `open-file', which documentation says:
‘b’
Use binary mode, ensuring that each byte in the file will be
read as one Scheme character.
To provide this property, the file will be opened with the
8-bit character encoding "ISO-8859-1", ignoring the default
port encoding. *Note Ports::, for more information on port
encodings.
So, by opening an file whose encoding is unknown as a ISO-8859-1 file,
we are doing the same as if we had passed the 'binary option. Could this
explain why we end up with NUL characters where we were expecting text?
To validate this hypothesis, I've added the following test message to
the patch-el-files phase:
(unless (false-if-exception (substitute-cmd))
(format (current-error-port) ">>> IS THIS IT? <<<")
(with-fluids ((%default-port-encoding "ISO-8859-1"))
(substitute-cmd))))
And re-ran the emacs-realgud build (minus the patch working around this
issue), and this is what I got:
Toggle snippet (41 lines)
starting phase `patch-el-files'
>>> IS THIS IT? <<<Backtrace:
15 (primitive-load "/gnu/store/xdmirz24yxpxzqh8xj83z9h0axm…")
In ice-9/eval.scm:
191:35 14 (_ _)
In srfi/srfi-1.scm:
863:16 13 (every1 #<procedure 9a9540 at /gnu/store/mz8vs1cxv1z7y…> …)
In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/gnu-build-system.scm:
684:27 12 (_ _)
In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/emacs-build-system.scm:
117:10 11 (patch-el-files #:outputs _)
In srfi/srfi-1.scm:
640:9 10 (for-each #<procedure substitute-one-file (file-name)> _)
In ice-9/boot-9.scm:
849:4 9 (with-throw-handler _ _ _)
In ice-9/ports.scm:
444:17 8 (call-with-input-file _ _ #:binary _ #:encoding _ # _)
In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/utils.scm:
609:26 7 (_ _)
635:26 6 (_ #<input: ./realgud/common/bp-image-data.el 17> #<inp…>)
In srfi/srfi-1.scm:
466:18 5 (fold #<procedure 7ffff53d1520 at /gnu/store/mz8vs1cxv…> …)
In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/utils.scm:
638:37 4 (_ _ "\"II*\x00(\x03\x00\x00ÿÿÿÿÿÿÿÿþÿ@@@@ÿÿÿÿ\x04\x04\…")
In ice-9/regex.scm:
189:12 3 (list-matches _ _ _)
177:19 2 (fold-matches _ "\"II*\x00(\x03\x00\x00ÿÿÿÿÿÿÿÿþÿ@@@@ÿ…" …)
In unknown file:
1 (regexp-exec #<regexp 81e400> "\"II*\x00(\x03\x00\x00ÿ…" …)
In ice-9/boot-9.scm:
760:25 0 (dispatch-exception _ _ _)
ice-9/boot-9.scm:760:25: In procedure dispatch-exception:
ice-9/boot-9.scm:760:25: string contains #\nul character: "\"II*\x00(\x03\x00\x00ÿÿÿÿÿÿÿÿþÿ@@@@ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿBBBBÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿþÿ@@@@ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿBBBBÿÿÿÿÿÿÿÿÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿBBBBÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x10\x10\x10\x10ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x10\x10\x10\x10ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x10\x10\x10\x10ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿþÿ>>>>ÿÿþÿ<<<<ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿþÿ>>>>ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿþÿ<<<<ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿþÿ>>>>ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿþÿ<<<<ÿÿÿÿ\x0f\x0f\x0f\x0fÿÿÿÿ\x0f\x0f\x0f\x0fÿÿÿÿ\x0f\x0f\x0f\x0fÿÿþÿ>>>>ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x14\x00\x00\x01\x03\x00\x01\x00\x00\x00\n"
builder for `/gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv' failed with exit code 1
@ build-failed /gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv - 1 builder for `/gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv' failed with exit code 1
guix build: error: build failed: build of
`/gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv'
failed
So it is indeed triggered by switching to the "ISO-8859-1" encoding
(although I still cannot reproduce this from the REPL?).
If I remove the exception guard like this:
Toggle snippet (10 lines)
(with-directory-excursion el-dir
;; Some old '.el' files (e.g., tex-buf.el in AUCTeX) are still encoded
;; with the "ISO-8859-1" locale.
- (unless (false-if-exception (substitute-cmd))
- (with-fluids ((%default-port-encoding "ISO-8859-1"))
- (substitute-cmd))))
+ (substitute-cmd))
#t))
the exception thrown on the first substitute* call is this:
Toggle snippet (33 lines)
starting phase `patch-el-files'
Backtrace:
12 (primitive-load "/gnu/store/dvyyqxfr08fsr18k2f43gakh23d…")
In ice-9/eval.scm:
191:35 11 (_ _)
In srfi/srfi-1.scm:
863:16 10 (every1 #<procedure 9a96a0 at /gnu/store/xn6p33hhfyz6l…> …)
In /gnu/store/xn6p33hhfyz6l5j9jd9qpnblp9ajnb9k-module-import/guix/build/gnu-build-system.scm:
684:27 9 (_ _)
In /gnu/store/xn6p33hhfyz6l5j9jd9qpnblp9ajnb9k-module-import/guix/build/emacs-build-system.scm:
104:27 8 (patch-el-files #:outputs _)
In srfi/srfi-1.scm:
640:9 7 (for-each #<procedure substitute-one-file (file-name)> _)
In ice-9/boot-9.scm:
849:4 6 (with-throw-handler _ _ _)
In ice-9/ports.scm:
444:17 5 (call-with-input-file _ _ #:binary _ #:encoding _ # _)
In /gnu/store/xn6p33hhfyz6l5j9jd9qpnblp9ajnb9k-module-import/guix/build/utils.scm:
609:26 4 (_ _)
645:22 3 (_ #<input: ./realgud/common/bp-image-data.el 15> #<inp…>)
In ice-9/rdelim.scm:
195:24 2 (read-line _ _)
In unknown file:
1 (%read-line #<input: ./realgud/common/bp-image-data.el …>)
In ice-9/boot-9.scm:
760:25 0 (dispatch-exception _ _ _)
ice-9/boot-9.scm:760:25: In procedure dispatch-exception:
ice-9/boot-9.scm:760:25: Throw to key `decoding-error' with args
`("peek-char" "input decoding error" 84 #<input:
./realgud/common/bp-image-data.el 15>)'.
Should we keep my workaround for now? It seems there are valid cases to
have the file opened as "ISO-8859-1", but this can mean introducing
binary symbols such as NUL in the data (thus regexp crashes). When we
finally move to srfi-115, we should remove this workaround.
WDYT?
Here's an updated patch with Ludovic's suggestion:
From 573ecb3570355c47aa18c091c0a193e7d90a6949 Mon Sep 17 00:00:00 2001
NUL chars.
Fixes issue #30116.
* guix/build/utils.scm (substitute): Add condition to skip lines containing
the NUL character.
---
guix/build/utils.scm | 46 ++++++++++++++++++++++++++++------------------
1 file changed, 28 insertions(+), 18 deletions(-)
Toggle diff (70 lines)
diff --git a/guix/build/utils.scm b/guix/build/utils.scm
index 7391307c8..2a37dba06 100644
--- a/guix/build/utils.scm
+++ b/guix/build/utils.scm
@@ -3,6 +3,7 @@
;;; Copyright © 2013 Andreas Enge <andreas@enge.fr>
;;; Copyright © 2013 Nikita Karetnikov <nikita@karetnikov.org>
;;; Copyright © 2015 Mark H Weaver <mhw@netris.org>
+;;; Copyright © 2018 Maxim Cournoyer <maxim.cournoyer@gmail.com>
;;;
;;; This file is part of GNU Guix.
;;;
@@ -621,28 +622,37 @@ PROC as (PROC LINE MATCHES); PROC must return the line that will be written as
a substitution of the original line. Be careful about using '$' to match the
end of a line; by itself it won't match the terminating newline of a line."
(let ((rx+proc (map (match-lambda
- (((? regexp? pattern) . proc)
- (cons pattern proc))
- ((pattern . proc)
- (cons (make-regexp pattern regexp/extended)
- proc)))
+ (((? regexp? pattern) . proc)
+ (cons pattern proc))
+ ((pattern . proc)
+ (cons (make-regexp pattern regexp/extended)
+ proc)))
pattern+procs)))
(with-atomic-file-replacement file
(lambda (in out)
(let loop ((line (read-line in 'concat)))
- (if (eof-object? line)
- #t
- (let ((line (fold (lambda (r+p line)
- (match r+p
- ((regexp . proc)
- (match (list-matches regexp line)
- ((and m+ (_ _ ...))
- (proc line m+))
- (_ line)))))
- line
- rx+proc)))
- (display line out)
- (loop (read-line in 'concat)))))))))
+ (cond
+ ((eof-object? line)
+ #t)
+ ((string-index line #\nul)
+ ;; The regexp functions of the GNU C library (which Guile uses)
+ ;; cannot deal with NUL characters, so skip to the next line.
+ ;; TODO: Port to srfi-115, once we have it implemented in Guile.
+ (format (current-error-port)
+ "skipping line with NUL characters: ~s\n" line)
+ (loop (read-line in 'concat)))
+ (else
+ (let ((line (fold (lambda (r+p line)
+ (match r+p
+ ((regexp . proc)
+ (match (list-matches regexp line)
+ ((and m+ (_ _ ...))
+ (proc line m+))
+ (_ line)))))
+ line
+ rx+proc)))
+ (display line out)
+ (loop (read-line in 'concat))))))))))
(define-syntax let-matches
--
2.16.0