[PATCH] `substitute' crashes when file contains NUL characters (core-updates)

  • Done
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Maxim Cournoyer
  • Mark H Weaver
Owner
unassigned
Submitted by
Maxim Cournoyer
Severity
normal
M
M
Maxim Cournoyer wrote on 15 Jan 2018 02:27
(name . bug-guix)(address . bug-guix@gnu.org)
87r2qrc3mq.fsf@gmail.com
Hello,
I've encountered the following crash when trying to use substitute on a
file which contains NUL characters:
Toggle snippet (28 lines)
(define problematic-file "/tmp/bp-image-data.el")
scheme@(guix build utils)> ,m (guix build utils)
scheme@(guix build utils)> (substitute* problematic-file
(("toto") "tata"))
ice-9/boot-9.scm:752:25: In procedure dispatch-exception:
string contains #\nul character: "\"II*\x00(\x03\x00\x00??????????@@@@????\x04\x04\x04\x04????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x04\x04\x04\x04????BBBB????????????????????@@@@????\x04\x04\x04\x04????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x04\x04\x04\x04????BBBB????????????\x04\x04\x04\x04????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x04\x04\x04\x04????BBBB????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x10\x10\x10\x10????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x10\x10\x10\x10????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x10\x10\x10\x10????\x04\x04\x04\x04????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x04\x04\x04\x04????>>>>????<<<<????\x04\x04\x04\x04????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x04\x04\x04\x04????>>>>????????????????????<<<<????\x04\x04\x04\x04????\x01\x01\x01\x01????\x01\x01\x01\x01????\x01\x01\x01\x01????\x04\x04\x04\x04????>>>>????????????????????????????????????<<<<????\x0f\x0f\x0f\x0f????\x0f\x0f\x0f\x0f????\x0f\x0f\x0f\x0f????>>>>??????????????????????????\x14\x00\x00\x01\x03\x00\x01\x00\x00\x00\n"
Entering a new prompt. Type `,bt' for a backtrace or `,q' to continue.
scheme@(guix build utils) [1]> ,bt
In ice-9/boot-9.scm:
841:4 9 (with-throw-handler _ _ _)
In ice-9/ports.scm:
444:17 8 (call-with-input-file _ _ #:binary _ #:encoding _ #:guess-encoding _)
In guix/build/utils.scm:
609:26 7 (_ _)
635:26 6 (_ #<input: /tmp/bp-image-data.el 14> #<input-output: /tmp/bp-image-data.el.qVytzo 13>)
In srfi/srfi-1.scm:
466:18 5 (fold #<procedure 7f29b8929520 at guix/build/utils.scm:635:32 (r+p line)> "\"II*\x00(\x03\x00\x00???…" …)
In guix/build/utils.scm:
638:37 4 (_ _ "\"II*\x00(\x03\x00\x00??????????@@@@????\x04\x04\x04\x04????\x01\x01\x01\x01????\x01\x01\x01\x0…")
In ice-9/regex.scm:
189:12 3 (list-matches _ _ _)
177:19 2 (fold-matches _ "\"II*\x00(\x03\x00\x00??????????@@@@????\x04\x04\x04\x04????\x01\x01\x01\x01????\x0…" …)
In unknown file:
1 (regexp-exec #<regexp 51f3bc0> "\"II*\x00(\x03\x00\x00??????????@@@@????\x04\x04\x04\x04????\x01\x01…" …)
In ice-9/boot-9.scm:
752:25 0 (dispatch-exception _ _ _)
That file comes from emacs-realgud, which I'm attempting to package:
This was discovered when the patch-el-files phase of the
emacs-build-system crashed as above when it called substitute*.
Patch to follow.
Maxim
M
M
Maxim Cournoyer wrote on 15 Jan 2018 02:38
[PATCH] `substitute' crashes when file contains NUL characters (core-updates))
(address . 30116@debbugs.gnu.org)
87k1wjc35d.fsf_-_@gmail.com
From 9891e428eae0ed24e0d61862b3f5e298606b79eb Mon Sep 17 00:00:00 2001
From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Date: Sun, 14 Jan 2018 20:31:33 -0500
Subject: [PATCH] utils: Prevent substitute from crashing on files containing
NUL chars.

Fixes issue #30116.

* guix/build/utils.scm (substitute): Add condition to skip lines containing
the NUL character.
---
guix/build/utils.scm | 44 ++++++++++++++++++++++++++------------------
1 file changed, 26 insertions(+), 18 deletions(-)

Toggle diff (68 lines)
diff --git a/guix/build/utils.scm b/guix/build/utils.scm
index 7391307c8..975f4e70a 100644
--- a/guix/build/utils.scm
+++ b/guix/build/utils.scm
@@ -3,6 +3,7 @@
;;; Copyright © 2013 Andreas Enge <andreas@enge.fr>
;;; Copyright © 2013 Nikita Karetnikov <nikita@karetnikov.org>
;;; Copyright © 2015 Mark H Weaver <mhw@netris.org>
+;;; Copyright © 2018 Maxim Cournoyer <maxim.cournoyer@gmail.com>
;;;
;;; This file is part of GNU Guix.
;;;
@@ -621,28 +622,35 @@ PROC as (PROC LINE MATCHES); PROC must return the line that will be written as
a substitution of the original line. Be careful about using '$' to match the
end of a line; by itself it won't match the terminating newline of a line."
(let ((rx+proc (map (match-lambda
- (((? regexp? pattern) . proc)
- (cons pattern proc))
- ((pattern . proc)
- (cons (make-regexp pattern regexp/extended)
- proc)))
+ (((? regexp? pattern) . proc)
+ (cons pattern proc))
+ ((pattern . proc)
+ (cons (make-regexp pattern regexp/extended)
+ proc)))
pattern+procs)))
(with-atomic-file-replacement file
(lambda (in out)
(let loop ((line (read-line in 'concat)))
- (if (eof-object? line)
- #t
- (let ((line (fold (lambda (r+p line)
- (match r+p
- ((regexp . proc)
- (match (list-matches regexp line)
- ((and m+ (_ _ ...))
- (proc line m+))
- (_ line)))))
- line
- rx+proc)))
- (display line out)
- (loop (read-line in 'concat)))))))))
+ (cond
+ ((eof-object? line)
+ #t)
+ ((string-contains line (make-string 1 #\nul))
+ ;; The regexp functions of the GNU C library (which Guile uses)
+ ;; cannot deal with NUL characters, so skip to the next line.
+ (format #t "skipping line with NUL characters: ~s\n" line)
+ (loop (read-line in 'concat)))
+ (else
+ (let ((line (fold (lambda (r+p line)
+ (match r+p
+ ((regexp . proc)
+ (match (list-matches regexp line)
+ ((and m+ (_ _ ...))
+ (proc line m+))
+ (_ line)))))
+ line
+ rx+proc)))
+ (display line out)
+ (loop (read-line in 'concat))))))))))
(define-syntax let-matches
--
2.15.1
L
L
Ludovic Courtès wrote on 16 Jan 2018 12:23
Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL characters (core-updates)
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 30116@debbugs.gnu.org)
87o9lu6o9m.fsf@gnu.org
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (3 lines)
> I've encountered the following crash when trying to use substitute on a
> file which contains NUL characters:

Yes, that’s because Guile’s ‘regexp-exec’ simply wraps libc’s ‘regexec’,
which does not handle NULs.

We should consider switching to the pure-Scheme SRFI-115:


Ludo’.
L
L
Ludovic Courtès wrote on 17 Jan 2018 15:37
Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL characters (core-updates))
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 30116@debbugs.gnu.org)
87h8rk4kl9.fsf@gnu.org
Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (11 lines)
> From 9891e428eae0ed24e0d61862b3f5e298606b79eb Mon Sep 17 00:00:00 2001
> From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
> Date: Sun, 14 Jan 2018 20:31:33 -0500
> Subject: [PATCH] utils: Prevent substitute from crashing on files containing
> NUL chars.
>
> Fixes issue #30116.
>
> * guix/build/utils.scm (substitute): Add condition to skip lines containing
> the NUL character.

[...]

Toggle quote (2 lines)
> + ((string-contains line (make-string 1 #\nul))

Rather (string-index line #\nul).

Toggle quote (5 lines)
> + ;; The regexp functions of the GNU C library (which Guile uses)
> + ;; cannot deal with NUL characters, so skip to the next line.
> + (format #t "skipping line with NUL characters: ~s\n" line)
> + (loop (read-line in 'concat)))

Rather (format (current-error-port) …).

It’s strange semantics, but it’s probably better than crashing in the
contexts where we use it.

Otherwise LGTM. This would have to go to the next ‘core-updates’ (or
‘core-updates-next’ in the meantime.)

Thanks!

Ludo’.
M
M
Maxim Cournoyer wrote on 21 Jan 2018 05:24
Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL characters (core-updates)
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 30116@debbugs.gnu.org)
87607vu9dp.fsf@gmail.com
ludo@gnu.org (Ludovic Courtès) writes:

Toggle quote (14 lines)
> Hi,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> I've encountered the following crash when trying to use substitute on a
>> file which contains NUL characters:
>
> Yes, that’s because Guile’s ‘regexp-exec’ simply wraps libc’s ‘regexec’,
> which does not handle NULs.
>
> We should consider switching to the pure-Scheme SRFI-115:
>
> https://srfi.schemers.org/srfi-115/srfi-115.html

This looks good, and I started looking into porting `substitute' to it,
but quickly noticed it doesn't seem to be implemented in Guile yet?

Thanks,

Maxim
M
M
Mark H Weaver wrote on 21 Jan 2018 19:17
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
877esb84ae.fsf@netris.org
Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (17 lines)
> ludo@gnu.org (Ludovic Courtès) writes:
>
>> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>>
>>> I've encountered the following crash when trying to use substitute on a
>>> file which contains NUL characters:
>>
>> Yes, that’s because Guile’s ‘regexp-exec’ simply wraps libc’s ‘regexec’,
>> which does not handle NULs.
>>
>> We should consider switching to the pure-Scheme SRFI-115:
>>
>> https://srfi.schemers.org/srfi-115/srfi-115.html
>
> This looks good, and I started looking into porting `substitute' to it,
> but quickly noticed it doesn't seem to be implemented in Guile yet?

Indeed. SRFI-115 for Guile is on my TODO list, although it might be
better to wait until after we switch to using UTF-8 encoding internally
for strings, since that will drastically affect the implementation of
any efficient regexp matcher on Scheme strings.

Anyway, 'substitute*' is to be used only on text files, and NUL bytes
are not a valid textual character. So, I think that this case is
outside of what 'substitute*' is meant to do, and therefore not a bug in
'substitute*', although of course a more graceful error would surely be
preferable.

What do you think?

Mark
L
L
Ludovic Courtès wrote on 22 Jan 2018 11:58
(name . Mark H Weaver)(address . mhw@netris.org)
87o9lmp3c2.fsf@gnu.org
Mark H Weaver <mhw@netris.org> skribis:

Toggle quote (19 lines)
> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> ludo@gnu.org (Ludovic Courtès) writes:
>>
>>> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>>>
>>>> I've encountered the following crash when trying to use substitute on a
>>>> file which contains NUL characters:
>>>
>>> Yes, that’s because Guile’s ‘regexp-exec’ simply wraps libc’s ‘regexec’,
>>> which does not handle NULs.
>>>
>>> We should consider switching to the pure-Scheme SRFI-115:
>>>
>>> https://srfi.schemers.org/srfi-115/srfi-115.html
>>
>> This looks good, and I started looking into porting `substitute' to it,
>> but quickly noticed it doesn't seem to be implemented in Guile yet?

ISTR that the reference implementation works fine on Guile.

Toggle quote (5 lines)
> Indeed. SRFI-115 for Guile is on my TODO list, although it might be
> better to wait until after we switch to using UTF-8 encoding internally
> for strings, since that will drastically affect the implementation of
> any efficient regexp matcher on Scheme strings.

Indeed, though I suppose it doesn’t matter much for the cases where
‘substitute*’ is used?

Toggle quote (6 lines)
> Anyway, 'substitute*' is to be used only on text files, and NUL bytes
> are not a valid textual character. So, I think that this case is
> outside of what 'substitute*' is meant to do, and therefore not a bug in
> 'substitute*', although of course a more graceful error would surely be
> preferable.

Yes, that’s also a good point.

So yeah, I think it may be good “eventually” to switch to SRFI-115, but
that’s not urgent.

Thoughts?

Ludo’.
M
M
Maxim Cournoyer wrote on 23 Jan 2018 05:27
(name . Ludovic Courtès)(address . ludo@gnu.org)
87k1w9ryhz.fsf@gmail.com
ludo@gnu.org (Ludovic Courtès) writes:

Toggle quote (42 lines)
> Mark H Weaver <mhw@netris.org> skribis:
>
>> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>>
>>> ludo@gnu.org (Ludovic Courtès) writes:
>>>
>>>> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>>>>
>>>>> I've encountered the following crash when trying to use substitute on a
>>>>> file which contains NUL characters:
>>>>
>>>> Yes, that’s because Guile’s ‘regexp-exec’ simply wraps libc’s ‘regexec’,
>>>> which does not handle NULs.
>>>>
>>>> We should consider switching to the pure-Scheme SRFI-115:
>>>>
>>>> https://srfi.schemers.org/srfi-115/srfi-115.html
>>>
>>> This looks good, and I started looking into porting `substitute' to it,
>>> but quickly noticed it doesn't seem to be implemented in Guile yet?
>
> ISTR that the reference implementation works fine on Guile.
>
>> Indeed. SRFI-115 for Guile is on my TODO list, although it might be
>> better to wait until after we switch to using UTF-8 encoding internally
>> for strings, since that will drastically affect the implementation of
>> any efficient regexp matcher on Scheme strings.
>
> Indeed, though I suppose it doesn’t matter much for the cases where
> ‘substitute*’ is used?
>
>> Anyway, 'substitute*' is to be used only on text files, and NUL bytes
>> are not a valid textual character. So, I think that this case is
>> outside of what 'substitute*' is meant to do, and therefore not a bug in
>> 'substitute*', although of course a more graceful error would surely be
>> preferable.
>
> Yes, that’s also a good point.
>
> So yeah, I think it may be good “eventually” to switch to SRFI-115, but
> that’s not urgent.

Sorry for taking some time to answer; I was puzzled by the fact that my
repro didn't work when ran from the REPL. It seems the problem only
occurs when run inside Guix's build environment, maybe a side effect
which depends on the locale used?

In the `patch-el-files' phase of the emacs-build-system, we find the
following snippet:

(with-directory-excursion el-dir
;; Some old '.el' files (e.g., tex-buf.el in AUCTeX) are still encoded
;; with the "ISO-8859-1" locale.
(unless (false-if-exception (substitute-cmd))
(with-fluids ((%default-port-encoding "ISO-8859-1"))
(substitute-cmd))))

In case an exception is returned while processing the file, it is
retried being opened with the "ISO-8859-1" encoding. Or, this resolves
to a call to `open-file', which documentation says:

‘b’
Use binary mode, ensuring that each byte in the file will be
read as one Scheme character.

To provide this property, the file will be opened with the
8-bit character encoding "ISO-8859-1", ignoring the default
port encoding. *Note Ports::, for more information on port
encodings.

So, by opening an file whose encoding is unknown as a ISO-8859-1 file,
we are doing the same as if we had passed the 'binary option. Could this
explain why we end up with NUL characters where we were expecting text?

To validate this hypothesis, I've added the following test message to
the patch-el-files phase:

(unless (false-if-exception (substitute-cmd))
(format (current-error-port) ">>> IS THIS IT? <<<")
(with-fluids ((%default-port-encoding "ISO-8859-1"))
(substitute-cmd))))

And re-ran the emacs-realgud build (minus the patch working around this
issue), and this is what I got:

Toggle snippet (41 lines)
starting phase `patch-el-files'
>>> IS THIS IT? <<<Backtrace:
15 (primitive-load "/gnu/store/xdmirz24yxpxzqh8xj83z9h0axm…")
In ice-9/eval.scm:
191:35 14 (_ _)
In srfi/srfi-1.scm:
863:16 13 (every1 #<procedure 9a9540 at /gnu/store/mz8vs1cxv1z7y…> …)
In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/gnu-build-system.scm:
684:27 12 (_ _)
In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/emacs-build-system.scm:
117:10 11 (patch-el-files #:outputs _)
In srfi/srfi-1.scm:
640:9 10 (for-each #<procedure substitute-one-file (file-name)> _)
In ice-9/boot-9.scm:
849:4 9 (with-throw-handler _ _ _)
In ice-9/ports.scm:
444:17 8 (call-with-input-file _ _ #:binary _ #:encoding _ # _)
In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/utils.scm:
609:26 7 (_ _)
635:26 6 (_ #<input: ./realgud/common/bp-image-data.el 17> #<inp…>)
In srfi/srfi-1.scm:
466:18 5 (fold #<procedure 7ffff53d1520 at /gnu/store/mz8vs1cxv…> …)
In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/utils.scm:
638:37 4 (_ _ "\"II*\x00(\x03\x00\x00ÿÿÿÿÿÿÿÿþÿ@@@@ÿÿÿÿ\x04\x04\…")
In ice-9/regex.scm:
189:12 3 (list-matches _ _ _)
177:19 2 (fold-matches _ "\"II*\x00(\x03\x00\x00ÿÿÿÿÿÿÿÿþÿ@@@@ÿ…" …)
In unknown file:
1 (regexp-exec #<regexp 81e400> "\"II*\x00(\x03\x00\x00ÿ…" …)
In ice-9/boot-9.scm:
760:25 0 (dispatch-exception _ _ _)

ice-9/boot-9.scm:760:25: In procedure dispatch-exception:
ice-9/boot-9.scm:760:25: string contains #\nul character: "\"II*\x00(\x03\x00\x00ÿÿÿÿÿÿÿÿþÿ@@@@ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿBBBBÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿþÿ@@@@ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿBBBBÿÿÿÿÿÿÿÿÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿBBBBÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x10\x10\x10\x10ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x10\x10\x10\x10ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x10\x10\x10\x10ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿþÿ>>>>ÿÿþÿ<<<<ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿþÿ>>>>ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿþÿ<<<<ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿþÿ>>>>ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿþÿ<<<<ÿÿÿÿ\x0f\x0f\x0f\x0fÿÿÿÿ\x0f\x0f\x0f\x0fÿÿÿÿ\x0f\x0f\x0f\x0fÿÿþÿ>>>>ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x14\x00\x00\x01\x03\x00\x01\x00\x00\x00\n"
builder for `/gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv' failed with exit code 1
@ build-failed /gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv - 1 builder for `/gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv' failed with exit code 1
guix build: error: build failed: build of
`/gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv'
failed

So it is indeed triggered by switching to the "ISO-8859-1" encoding
(although I still cannot reproduce this from the REPL?).

If I remove the exception guard like this:

Toggle snippet (10 lines)
(with-directory-excursion el-dir
;; Some old '.el' files (e.g., tex-buf.el in AUCTeX) are still encoded
;; with the "ISO-8859-1" locale.
- (unless (false-if-exception (substitute-cmd))
- (with-fluids ((%default-port-encoding "ISO-8859-1"))
- (substitute-cmd))))
+ (substitute-cmd))
#t))

the exception thrown on the first substitute* call is this:

Toggle snippet (33 lines)
starting phase `patch-el-files'
Backtrace:
12 (primitive-load "/gnu/store/dvyyqxfr08fsr18k2f43gakh23d…")
In ice-9/eval.scm:
191:35 11 (_ _)
In srfi/srfi-1.scm:
863:16 10 (every1 #<procedure 9a96a0 at /gnu/store/xn6p33hhfyz6l…> …)
In /gnu/store/xn6p33hhfyz6l5j9jd9qpnblp9ajnb9k-module-import/guix/build/gnu-build-system.scm:
684:27 9 (_ _)
In /gnu/store/xn6p33hhfyz6l5j9jd9qpnblp9ajnb9k-module-import/guix/build/emacs-build-system.scm:
104:27 8 (patch-el-files #:outputs _)
In srfi/srfi-1.scm:
640:9 7 (for-each #<procedure substitute-one-file (file-name)> _)
In ice-9/boot-9.scm:
849:4 6 (with-throw-handler _ _ _)
In ice-9/ports.scm:
444:17 5 (call-with-input-file _ _ #:binary _ #:encoding _ # _)
In /gnu/store/xn6p33hhfyz6l5j9jd9qpnblp9ajnb9k-module-import/guix/build/utils.scm:
609:26 4 (_ _)
645:22 3 (_ #<input: ./realgud/common/bp-image-data.el 15> #<inp…>)
In ice-9/rdelim.scm:
195:24 2 (read-line _ _)
In unknown file:
1 (%read-line #<input: ./realgud/common/bp-image-data.el …>)
In ice-9/boot-9.scm:
760:25 0 (dispatch-exception _ _ _)

ice-9/boot-9.scm:760:25: In procedure dispatch-exception:
ice-9/boot-9.scm:760:25: Throw to key `decoding-error' with args
`("peek-char" "input decoding error" 84 #<input:
./realgud/common/bp-image-data.el 15>)'.

Should we keep my workaround for now? It seems there are valid cases to
have the file opened as "ISO-8859-1", but this can mean introducing
binary symbols such as NUL in the data (thus regexp crashes). When we
finally move to srfi-115, we should remove this workaround.

WDYT?

Here's an updated patch with Ludovic's suggestion:
From 573ecb3570355c47aa18c091c0a193e7d90a6949 Mon Sep 17 00:00:00 2001
From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Date: Sun, 14 Jan 2018 20:31:33 -0500
Subject: [PATCH] utils: Prevent substitute from crashing on files containing
NUL chars.

Fixes issue #30116.

* guix/build/utils.scm (substitute): Add condition to skip lines containing
the NUL character.
---
guix/build/utils.scm | 46 ++++++++++++++++++++++++++++------------------
1 file changed, 28 insertions(+), 18 deletions(-)

Toggle diff (70 lines)
diff --git a/guix/build/utils.scm b/guix/build/utils.scm
index 7391307c8..2a37dba06 100644
--- a/guix/build/utils.scm
+++ b/guix/build/utils.scm
@@ -3,6 +3,7 @@
;;; Copyright © 2013 Andreas Enge <andreas@enge.fr>
;;; Copyright © 2013 Nikita Karetnikov <nikita@karetnikov.org>
;;; Copyright © 2015 Mark H Weaver <mhw@netris.org>
+;;; Copyright © 2018 Maxim Cournoyer <maxim.cournoyer@gmail.com>
;;;
;;; This file is part of GNU Guix.
;;;
@@ -621,28 +622,37 @@ PROC as (PROC LINE MATCHES); PROC must return the line that will be written as
a substitution of the original line. Be careful about using '$' to match the
end of a line; by itself it won't match the terminating newline of a line."
(let ((rx+proc (map (match-lambda
- (((? regexp? pattern) . proc)
- (cons pattern proc))
- ((pattern . proc)
- (cons (make-regexp pattern regexp/extended)
- proc)))
+ (((? regexp? pattern) . proc)
+ (cons pattern proc))
+ ((pattern . proc)
+ (cons (make-regexp pattern regexp/extended)
+ proc)))
pattern+procs)))
(with-atomic-file-replacement file
(lambda (in out)
(let loop ((line (read-line in 'concat)))
- (if (eof-object? line)
- #t
- (let ((line (fold (lambda (r+p line)
- (match r+p
- ((regexp . proc)
- (match (list-matches regexp line)
- ((and m+ (_ _ ...))
- (proc line m+))
- (_ line)))))
- line
- rx+proc)))
- (display line out)
- (loop (read-line in 'concat)))))))))
+ (cond
+ ((eof-object? line)
+ #t)
+ ((string-index line #\nul)
+ ;; The regexp functions of the GNU C library (which Guile uses)
+ ;; cannot deal with NUL characters, so skip to the next line.
+ ;; TODO: Port to srfi-115, once we have it implemented in Guile.
+ (format (current-error-port)
+ "skipping line with NUL characters: ~s\n" line)
+ (loop (read-line in 'concat)))
+ (else
+ (let ((line (fold (lambda (r+p line)
+ (match r+p
+ ((regexp . proc)
+ (match (list-matches regexp line)
+ ((and m+ (_ _ ...))
+ (proc line m+))
+ (_ line)))))
+ line
+ rx+proc)))
+ (display line out)
+ (loop (read-line in 'concat))))))))))
(define-syntax let-matches
--
2.16.0
Maxim
L
L
Ludovic Courtès wrote on 23 Jan 2018 15:11
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87o9lk64xz.fsf@gnu.org
Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (27 lines)
> In the `patch-el-files' phase of the emacs-build-system, we find the
> following snippet:
>
> (with-directory-excursion el-dir
> ;; Some old '.el' files (e.g., tex-buf.el in AUCTeX) are still encoded
> ;; with the "ISO-8859-1" locale.
> (unless (false-if-exception (substitute-cmd))
> (with-fluids ((%default-port-encoding "ISO-8859-1"))
> (substitute-cmd))))
>
> In case an exception is returned while processing the file, it is
> retried being opened with the "ISO-8859-1" encoding. Or, this resolves
> to a call to `open-file', which documentation says:
>
> ‘b’
> Use binary mode, ensuring that each byte in the file will be
> read as one Scheme character.
>
> To provide this property, the file will be opened with the
> 8-bit character encoding "ISO-8859-1", ignoring the default
> port encoding. *Note Ports::, for more information on port
> encodings.
>
> So, by opening an file whose encoding is unknown as a ISO-8859-1 file,
> we are doing the same as if we had passed the 'binary option. Could this
> explain why we end up with NUL characters where we were expecting text?

That could be the reason. Guile provides a way to honor Emacs-style
‘encoding’ declarations, and ‘call-with-input-file’ does that if we pass
#:guess-encoding #t (info "(guile) Character Encoding of Source Files").

Did the faulty file have such a declaration?

Thanks,
Ludo’.
M
M
Maxim Cournoyer wrote on 25 Jan 2018 06:11
(name . Ludovic Courtès)(address . ludo@gnu.org)
87po5y8qv5.fsf@gmail.com
ludo@gnu.org (Ludovic Courtès) writes:

Toggle quote (35 lines)
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> In the `patch-el-files' phase of the emacs-build-system, we find the
>> following snippet:
>>
>> (with-directory-excursion el-dir
>> ;; Some old '.el' files (e.g., tex-buf.el in AUCTeX) are still encoded
>> ;; with the "ISO-8859-1" locale.
>> (unless (false-if-exception (substitute-cmd))
>> (with-fluids ((%default-port-encoding "ISO-8859-1"))
>> (substitute-cmd))))
>>
>> In case an exception is returned while processing the file, it is
>> retried being opened with the "ISO-8859-1" encoding. Or, this resolves
>> to a call to `open-file', which documentation says:
>>
>> ‘b’
>> Use binary mode, ensuring that each byte in the file will be
>> read as one Scheme character.
>>
>> To provide this property, the file will be opened with the
>> 8-bit character encoding "ISO-8859-1", ignoring the default
>> port encoding. *Note Ports::, for more information on port
>> encodings.
>>
>> So, by opening an file whose encoding is unknown as a ISO-8859-1 file,
>> we are doing the same as if we had passed the 'binary option. Could this
>> explain why we end up with NUL characters where we were expecting text?
>
> That could be the reason. Guile provides a way to honor Emacs-style
> ‘encoding’ declarations, and ‘call-with-input-file’ does that if we pass
> #:guess-encoding #t (info "(guile) Character Encoding of Source Files").
>
> Did the faulty file have such a declaration?

Sadly, it doesn't. Although even if it did, I don't think it would be
very robust to expect every misbehaving files we might encounter to
include one!

So I think we should apply my v2 patch to core-updates for now (see my
previous reply on this thread), until we have our substitute routine
implemented using srfi-115!

Thanks,

Maxim
L
L
Ludovic Courtès wrote on 25 Jan 2018 12:11
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87shauxkf1.fsf@gnu.org
Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (41 lines)
> ludo@gnu.org (Ludovic Courtès) writes:
>
>> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>>
>>> In the `patch-el-files' phase of the emacs-build-system, we find the
>>> following snippet:
>>>
>>> (with-directory-excursion el-dir
>>> ;; Some old '.el' files (e.g., tex-buf.el in AUCTeX) are still encoded
>>> ;; with the "ISO-8859-1" locale.
>>> (unless (false-if-exception (substitute-cmd))
>>> (with-fluids ((%default-port-encoding "ISO-8859-1"))
>>> (substitute-cmd))))
>>>
>>> In case an exception is returned while processing the file, it is
>>> retried being opened with the "ISO-8859-1" encoding. Or, this resolves
>>> to a call to `open-file', which documentation says:
>>>
>>> ‘b’
>>> Use binary mode, ensuring that each byte in the file will be
>>> read as one Scheme character.
>>>
>>> To provide this property, the file will be opened with the
>>> 8-bit character encoding "ISO-8859-1", ignoring the default
>>> port encoding. *Note Ports::, for more information on port
>>> encodings.
>>>
>>> So, by opening an file whose encoding is unknown as a ISO-8859-1 file,
>>> we are doing the same as if we had passed the 'binary option. Could this
>>> explain why we end up with NUL characters where we were expecting text?
>>
>> That could be the reason. Guile provides a way to honor Emacs-style
>> ‘encoding’ declarations, and ‘call-with-input-file’ does that if we pass
>> #:guess-encoding #t (info "(guile) Character Encoding of Source Files").
>>
>> Did the faulty file have such a declaration?
>
> Sadly, it doesn't. Although even if it did, I don't think it would be
> very robust to expect every misbehaving files we might encounter to
> include one!

Sure, I was asking just because it’s an Emacs-related package.

Toggle quote (4 lines)
> So I think we should apply my v2 patch to core-updates for now (see my
> previous reply on this thread), until we have our substitute routine
> implemented using srfi-115!

Sounds good! Note that I’ll wait until after the current ‘core-updates’
has been merged. Please do ping me if you think I’ve forgotten!

Thanks,
Ludo’.
M
M
Maxim Cournoyer wrote on 14 Jun 2018 03:40
Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL characters (core-updates))
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 30116@debbugs.gnu.org)
877en2taam.fsf@gmail.com
ludo@gnu.org (Ludovic Courtès) writes:

Toggle quote (36 lines)
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> From 9891e428eae0ed24e0d61862b3f5e298606b79eb Mon Sep 17 00:00:00 2001
>> From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
>> Date: Sun, 14 Jan 2018 20:31:33 -0500
>> Subject: [PATCH] utils: Prevent substitute from crashing on files containing
>> NUL chars.
>>
>> Fixes issue #30116.
>>
>> * guix/build/utils.scm (substitute): Add condition to skip lines containing
>> the NUL character.
>
> [...]
>
>> + ((string-contains line (make-string 1 #\nul))
>
> Rather (string-index line #\nul).
>
>> + ;; The regexp functions of the GNU C library (which Guile uses)
>> + ;; cannot deal with NUL characters, so skip to the next line.
>> + (format #t "skipping line with NUL characters: ~s\n" line)
>> + (loop (read-line in 'concat)))
>
> Rather (format (current-error-port) …).
>
> It’s strange semantics, but it’s probably better than crashing in the
> contexts where we use it.
>
> Otherwise LGTM. This would have to go to the next ‘core-updates’ (or
> ‘core-updates-next’ in the meantime.)
>
> Thanks!
>
> Ludo’.

Ping. Is it the right time to merge this?

Maxim
M
M
Mark H Weaver wrote on 14 Jun 2018 09:02
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87r2l9q294.fsf@netris.org
Hi Maxim,

Thanks for working on this. I found a problem with this patch,
and I also have a suggestion. Please see below.

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (66 lines)
> From 9891e428eae0ed24e0d61862b3f5e298606b79eb Mon Sep 17 00:00:00 2001
> From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
> Date: Sun, 14 Jan 2018 20:31:33 -0500
> Subject: [PATCH] utils: Prevent substitute from crashing on files containing
> NUL chars.
>
> Fixes issue #30116.
>
> * guix/build/utils.scm (substitute): Add condition to skip lines containing
> the NUL character.
> ---
> guix/build/utils.scm | 44 ++++++++++++++++++++++++++------------------
> 1 file changed, 26 insertions(+), 18 deletions(-)
>
> diff --git a/guix/build/utils.scm b/guix/build/utils.scm
> index 7391307c8..975f4e70a 100644
> --- a/guix/build/utils.scm
> +++ b/guix/build/utils.scm
> @@ -3,6 +3,7 @@
> ;;; Copyright © 2013 Andreas Enge <andreas@enge.fr>
> ;;; Copyright © 2013 Nikita Karetnikov <nikita@karetnikov.org>
> ;;; Copyright © 2015 Mark H Weaver <mhw@netris.org>
> +;;; Copyright © 2018 Maxim Cournoyer <maxim.cournoyer@gmail.com>
> ;;;
> ;;; This file is part of GNU Guix.
> ;;;
> @@ -621,28 +622,35 @@ PROC as (PROC LINE MATCHES); PROC must return the line that will be written as
> a substitution of the original line. Be careful about using '$' to match the
> end of a line; by itself it won't match the terminating newline of a line."
> (let ((rx+proc (map (match-lambda
> - (((? regexp? pattern) . proc)
> - (cons pattern proc))
> - ((pattern . proc)
> - (cons (make-regexp pattern regexp/extended)
> - proc)))
> + (((? regexp? pattern) . proc)
> + (cons pattern proc))
> + ((pattern . proc)
> + (cons (make-regexp pattern regexp/extended)
> + proc)))
> pattern+procs)))
> (with-atomic-file-replacement file
> (lambda (in out)
> (let loop ((line (read-line in 'concat)))
> - (if (eof-object? line)
> - #t
> - (let ((line (fold (lambda (r+p line)
> - (match r+p
> - ((regexp . proc)
> - (match (list-matches regexp line)
> - ((and m+ (_ _ ...))
> - (proc line m+))
> - (_ line)))))
> - line
> - rx+proc)))
> - (display line out)
> - (loop (read-line in 'concat)))))))))
> + (cond
> + ((eof-object? line)
> + #t)
> + ((string-contains line (make-string 1 #\nul))
> + ;; The regexp functions of the GNU C library (which Guile uses)
> + ;; cannot deal with NUL characters, so skip to the next line.
> + (format #t "skipping line with NUL characters: ~s\n" line)
> + (loop (read-line in 'concat)))

This code will unconditionally *delete* all lines that contain NULs.

This will happen because the lines with NULs are not being written to
the output file, which will replace the original file when this loop
reaches EOF. So, any lines that are not copied to the output will be
lost.

To preserve the lines with NULs, you should call (display line out)
before calling 'loop'.

Also, please use (string-index line #\nul) to check for NULs instead of
'string-contains'. It should be more efficient.

Toggle quote (13 lines)
> + (else
> + (let ((line (fold (lambda (r+p line)
> + (match r+p
> + ((regexp . proc)
> + (match (list-matches regexp line)
> + ((and m+ (_ _ ...))
> + (proc line m+))
> + (_ line)))))
> + line
> + rx+proc)))
> + (display line out)
> + (loop (read-line in 'concat))))))))))

With the changes suggested above, I would have no objection to pushing
this to core-updates. However, it occurs to me that we could handle the
NUL case in a better way:

Since the C regex functions that we use cannot handle NUL bytes, we
could use a different code point to represent NUL during those
operations. We could choose a code point from one of the Unicode
does not occur in the string.

Let NUL* be the code point which will represent NUL bytes. First
replace all NULs with NUL*s, then perform the substitutions, and finally
replace all ALT*s with NULs before writing to the output.

As an important optimization, we should avoid performing these extra
operations unless (string-index line #\nul) finds a NUL.

We could then perform these extra substitutions with simple, inefficient
code. Maybe something like this (untested):

Toggle snippet (23 lines)
(with-atomic-file-replacement file
(lambda (in out)
(let loop ((line (read-line in 'concat)))
(if (eof-object? line)
#t
(let* ((nul* (or (and (string-index line #\nul)
(unused-private-use-code-point line))
#\nul))
(line* (replace-char #\nul nul* line))
(line1* (fold (lambda (r+p line)
(match r+p
((regexp . proc)
(match (list-matches regexp line)
((and m+ (_ _ ...))
(proc line m+))
(_ line)))))
line*
rx+proc))
(line1 (replace-char nul* #\nul line1*)))
(display line1 out)
(loop (read-line in 'concat)))))))))

Where the following additional private procedures would be added to
(guix build utils) above the definition for 'substitute':

Toggle snippet (26 lines)
(define (unused-private-use-code-point s)
"Find a code point within a Unicode Private Use Area that is not
present in S, and return the corresponding character object. If one
cannot be found, return false."
(define (scan lo hi)
(and (<= lo hi)
(let ((c (integer->char lo)))
(if (string-index s c)
(scan (+ lo 1) hi)
c))))
(or (scan #xE000 #xF8FF)
(scan #xF0000 #xFFFFD)
(scan #x100000 #x10FFFD)))

(define (replace-char c1 c2 s)
"Return a string which is equal to S except with all instances of C1
replaced by C2. If C1 and C2 are equal, return S."
(if (char=? c1 c2)
s
(string-map (lambda (c)
(if (char=? c c1)
c2
c))
s)))

What do you think?

Mark
L
L
Ludovic Courtès wrote on 14 Jun 2018 10:01
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 30116@debbugs.gnu.org)
87r2l923wa.fsf@gnu.org
Hello Maxim,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (2 lines)
> ludo@gnu.org (Ludovic Courtès) writes:

[...]

Toggle quote (9 lines)
>> Otherwise LGTM. This would have to go to the next ‘core-updates’ (or
>> ‘core-updates-next’ in the meantime.)
>>
>> Thanks!
>>
>> Ludo’.
>
> Ping. Is it the right time to merge this?

Yes you can push it to ‘core-updates’ now. Thank you!

Ludo’.
L
L
Ludovic Courtès wrote on 14 Jun 2018 10:02
(name . Mark H Weaver)(address . mhw@netris.org)
87muvx23u8.fsf@gnu.org
Mark H Weaver <mhw@netris.org> skribis:

Toggle quote (3 lines)
> Thanks for working on this. I found a problem with this patch,
> and I also have a suggestion. Please see below.

I hadn’t seen Mark’s reply, which raises valid concerns. Please dismiss
the message I just sent, Maxim.

Ludo’.
M
M
Maxim Cournoyer wrote on 16 Jun 2018 18:47
(name . Mark H Weaver)(address . mhw@netris.org)
87tvq2smpm.fsf@gmail.com
Hi Mark,

Mark H Weaver <mhw@netris.org> writes:

Toggle quote (83 lines)
> Hi Maxim,
>
> Thanks for working on this. I found a problem with this patch,
> and I also have a suggestion. Please see below.
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> From 9891e428eae0ed24e0d61862b3f5e298606b79eb Mon Sep 17 00:00:00 2001
>> From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
>> Date: Sun, 14 Jan 2018 20:31:33 -0500
>> Subject: [PATCH] utils: Prevent substitute from crashing on files containing
>> NUL chars.
>>
>> Fixes issue #30116.
>>
>> * guix/build/utils.scm (substitute): Add condition to skip lines containing
>> the NUL character.
>> ---
>> guix/build/utils.scm | 44 ++++++++++++++++++++++++++------------------
>> 1 file changed, 26 insertions(+), 18 deletions(-)
>>
>> diff --git a/guix/build/utils.scm b/guix/build/utils.scm
>> index 7391307c8..975f4e70a 100644
>> --- a/guix/build/utils.scm
>> +++ b/guix/build/utils.scm
>> @@ -3,6 +3,7 @@
>> ;;; Copyright © 2013 Andreas Enge <andreas@enge.fr>
>> ;;; Copyright © 2013 Nikita Karetnikov <nikita@karetnikov.org>
>> ;;; Copyright © 2015 Mark H Weaver <mhw@netris.org>
>> +;;; Copyright © 2018 Maxim Cournoyer <maxim.cournoyer@gmail.com>
>> ;;;
>> ;;; This file is part of GNU Guix.
>> ;;;
>> @@ -621,28 +622,35 @@ PROC as (PROC LINE MATCHES); PROC must return the line that will be written as
>> a substitution of the original line. Be careful about using '$' to match the
>> end of a line; by itself it won't match the terminating newline of a line."
>> (let ((rx+proc (map (match-lambda
>> - (((? regexp? pattern) . proc)
>> - (cons pattern proc))
>> - ((pattern . proc)
>> - (cons (make-regexp pattern regexp/extended)
>> - proc)))
>> + (((? regexp? pattern) . proc)
>> + (cons pattern proc))
>> + ((pattern . proc)
>> + (cons (make-regexp pattern regexp/extended)
>> + proc)))
>> pattern+procs)))
>> (with-atomic-file-replacement file
>> (lambda (in out)
>> (let loop ((line (read-line in 'concat)))
>> - (if (eof-object? line)
>> - #t
>> - (let ((line (fold (lambda (r+p line)
>> - (match r+p
>> - ((regexp . proc)
>> - (match (list-matches regexp line)
>> - ((and m+ (_ _ ...))
>> - (proc line m+))
>> - (_ line)))))
>> - line
>> - rx+proc)))
>> - (display line out)
>> - (loop (read-line in 'concat)))))))))
>> + (cond
>> + ((eof-object? line)
>> + #t)
>> + ((string-contains line (make-string 1 #\nul))
>> + ;; The regexp functions of the GNU C library (which Guile uses)
>> + ;; cannot deal with NUL characters, so skip to the next line.
>> + (format #t "skipping line with NUL characters: ~s\n" line)
>> + (loop (read-line in 'concat)))
>
> This code will unconditionally *delete* all lines that contain NULs.
>
> This will happen because the lines with NULs are not being written to
> the output file, which will replace the original file when this loop
> reaches EOF. So, any lines that are not copied to the output will be
> lost.
>
> To preserve the lines with NULs, you should call (display line out)
> before calling 'loop'.

Good observation! I agree that we should keep limit the effect of
ignoring NULs only to the substitution.

Toggle quote (3 lines)
> Also, please use (string-index line #\nul) to check for NULs instead of
> 'string-contains'. It should be more efficient.

OK!

Toggle quote (28 lines)
>
>> + (else
>> + (let ((line (fold (lambda (r+p line)
>> + (match r+p
>> + ((regexp . proc)
>> + (match (list-matches regexp line)
>> + ((and m+ (_ _ ...))
>> + (proc line m+))
>> + (_ line)))))
>> + line
>> + rx+proc)))
>> + (display line out)
>> + (loop (read-line in 'concat))))))))))
>
> With the changes suggested above, I would have no objection to pushing
> this to core-updates. However, it occurs to me that we could handle the
> NUL case in a better way:
>
> Since the C regex functions that we use cannot handle NUL bytes, we
> could use a different code point to represent NUL during those
> operations. We could choose a code point from one of the Unicode
> Private Use Areas <https://en.wikipedia.org/wiki/Private_Use_Areas> that
> does not occur in the string.
>
> Let NUL* be the code point which will represent NUL bytes. First
> replace all NULs with NUL*s, then perform the substitutions, and finally
> replace all ALT*s with NULs before writing to the output.

Do I understand this transformation as NULs -> NUL*s and back from NUL*s
-> NULs correctly? I'm not sure how NUL*s became ALT*s in your explanation.

Toggle quote (3 lines)
> As an important optimization, we should avoid performing these extra
> operations unless (string-index line #\nul) finds a NUL.

OK.

Toggle quote (56 lines)
> We could then perform these extra substitutions with simple, inefficient
> code. Maybe something like this (untested):
>
> (with-atomic-file-replacement file
> (lambda (in out)
> (let loop ((line (read-line in 'concat)))
> (if (eof-object? line)
> #t
> (let* ((nul* (or (and (string-index line #\nul)
> (unused-private-use-code-point line))
> #\nul))
> (line* (replace-char #\nul nul* line))
> (line1* (fold (lambda (r+p line)
> (match r+p
> ((regexp . proc)
> (match (list-matches regexp line)
> ((and m+ (_ _ ...))
> (proc line m+))
> (_ line)))))
> line*
> rx+proc))
> (line1 (replace-char nul* #\nul line1*)))
> (display line1 out)
> (loop (read-line in 'concat)))))))))
>
>
> Where the following additional private procedures would be added to
> (guix build utils) above the definition for 'substitute':
>
> (define (unused-private-use-code-point s)
> "Find a code point within a Unicode Private Use Area that is not
> present in S, and return the corresponding character object. If one
> cannot be found, return false."
> (define (scan lo hi)
> (and (<= lo hi)
> (let ((c (integer->char lo)))
> (if (string-index s c)
> (scan (+ lo 1) hi)
> c))))
> (or (scan #xE000 #xF8FF)
> (scan #xF0000 #xFFFFD)
> (scan #x100000 #x10FFFD)))
>
> (define (replace-char c1 c2 s)
> "Return a string which is equal to S except with all instances of C1
> replaced by C2. If C1 and C2 are equal, return S."
> (if (char=? c1 c2)
> s
> (string-map (lambda (c)
> (if (char=? c c1)
> c2
> c))
> s)))
>
> What do you think?

It raises the complexity level a bit for something which doesn't seem to
be a very common scenario, but otherwise seems a very elegant
workaround. It seems to me that your implementation is already pretty
complete. I'll try write a test for validating it and report back.

Thank you for sharing your ideas!

Maxim
M
M
Mark H Weaver wrote on 17 Jun 2018 06:36
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87d0wq3u8f.fsf@netris.org
Hi Maxim,

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (19 lines)
> Mark H Weaver <mhw@netris.org> writes:
>
>> With the changes suggested above, I would have no objection to pushing
>> this to core-updates. However, it occurs to me that we could handle the
>> NUL case in a better way:
>>
>> Since the C regex functions that we use cannot handle NUL bytes, we
>> could use a different code point to represent NUL during those
>> operations. We could choose a code point from one of the Unicode
>> Private Use Areas <https://en.wikipedia.org/wiki/Private_Use_Areas> that
>> does not occur in the string.
>>
>> Let NUL* be the code point which will represent NUL bytes. First
>> replace all NULs with NUL*s, then perform the substitutions, and finally
>> replace all ALT*s with NULs before writing to the output.
>
> Do I understand this transformation as NULs -> NUL*s and back from NUL*s
> -> NULs correctly? I'm not sure how NUL*s became ALT*s in your explanation.

Sorry, it's a typo. Where I wrote "ALT*s", I meant to write "NUL*s".

Toggle quote (5 lines)
>> What do you think?
>
> It raises the complexity level a bit for something which doesn't seem to
> be a very common scenario,

FWIW, I agree that it's not a common scenario, and it's not entirely
clear that it was worth the time I spent on it, or the added complexity.
On the other hand, I would dislike having a basic API like 'substitute*'
be subtly broken in this way.

Toggle quote (4 lines)
> but otherwise seems a very elegant
> workaround. It seems to me that your implementation is already pretty
> complete. I'll try write a test for validating it and report back.

Sounds good. Thank you!

Mark
M
M
Maxim Cournoyer wrote on 8 Jan 2021 20:14
Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL characters (core-updates)
(name . Mark H Weaver)(address . mhw@netris.org)
87wnwn5jgv.fsf_-_@gmail.com
Hello Mark,

I was recently reminded of this bug by a new encounter; at last wrote a
test for your proposed fix, and it appear to work as intended! I've
committed it on your behalf in commit 485ac28235 on the core-updates
branch.

Closing!

Thank you for the clever hack :-)

Maxim
Closed
M
M
Mark H Weaver wrote on 8 Jan 2021 22:42
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 30116-done@debbugs.gnu.org)
871rev14wh.fsf@netris.org
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
Toggle quote (5 lines)
> I was recently reminded of this bug by a new encounter; at last wrote a
> test for your proposed fix, and it appear to work as intended! I've
> committed it on your behalf in commit 485ac28235 on the core-updates
> branch.

Thanks for taking care of this Maxim, and for adding the test case.

Regards,
Mark
Closed
?