Disarchive occasionally fails tests

OpenSubmitted by Ludovic Courtès.
Details
4 participants
  • Bengt Richter
  • Ludovic Courtès
  • Ludovic Courtès
  • Timothy Sample
Owner
unassigned
Severity
normal
L
L
Ludovic Courtès wrote on 30 Apr 12:00 +0200
(address . bug-guix@gnu.org)
87v984gkhn.fsf@inria.fr
Hi Timothy,
Disarchive 0.2.0 occasionally fails two tests:
FAIL: tests/kinds/octal.scm - [prop] Writing is reversible FAIL: tests/kinds/octal.scm - [prop] Serializing is reversible
(Thanks, Quickcheck! :-))
I added ‘pk’ calls like so:
Toggle snippet (14 lines)(test-assert "[prop] Writing is reversible" (quickcheck (property ((octal $octal)) (test-when (valid-octal? octal) (begin (equal? (pk 'oct octal) (pk 'decode (decode-octal (encode-octal octal)))))))))
(test-assert "[prop] Serializing is reversible" (quickcheck (property ((octal $octal)) (test-when (valid-octal? octal) (equal? (pk 'OCT octal) (pk 'DECODE (serdeser -octal- octal)))))))
and got this output:
Toggle snippet (27 lines);;; (oct #<<unstructured-octal> value: 0 source: #<<zero-string> value: "\U0f94a4\u0912\U025627\U10e96a\u9576\u2077\u048f\U0f2f60\U0f744b" trailer: #vu8(172 156 23 48 25 29 159 226 210)>>)
;;; (decode #<<unstructured-octal> value: 0 source: #<<zero-string> value: "\U0f94a4\u0912\U025627\U10e96a\u9576\u2077\u048f\U0f2f60\U0f744b" trailer: #vu8(172 156 23 48 25 29 159 226 210)>>)actual-value: #factual-error:+ (out-of-range+ #f+ "Value out of range ~S to ~S: ~S"+ (8 9 10)+ (10))result: FAIL
[…]
;;; (OCT #<<unstructured-octal> value: 0 source: #<<zero-string> value: "\U0f94a4\u0912\U025627\U10e96a\u9576\u2077\u048f\U0f2f60\U0f744b" trailer: #vu8(172 156 23 48 25 29 159 226 210)>>)
;;; (DECODE #<<unstructured-octal> value: 0 source: #<<zero-string> value: "\U0f94a4\u0912\U025627\U10e96a\u9576\u2077\u048f\U0f2f60\U0f744b" trailer: #vu8(172 156 23 48 25 29 159 226 210)>>)actual-value: #factual-error:+ (out-of-range+ #f+ "Value out of range ~S to ~S: ~S"+ (8 9 10)+ (10))result: FAIL
I’m not sure where the exception comes from though.
Thoughts?
Thanks,Ludo’.
T
T
Timothy Sample wrote on 30 Apr 21:49 +0200
(name . Ludovic Courtès)(address . ludovic.courtes@inria.fr)(address . 48114@debbugs.gnu.org)
87pmybeen3.fsf@ngyro.com
Hey,
Ludovic Courtès <ludovic.courtes@inria.fr> writes:
Toggle quote (5 lines)> Disarchive 0.2.0 occasionally fails two tests:>> FAIL: tests/kinds/octal.scm - [prop] Writing is reversible> FAIL: tests/kinds/octal.scm - [prop] Serializing is reversible
These two tests have a bit of a problem. They occasionally fail by“giving up”, which is when too many test cases are discarded rather thanused. (This happens because you might write a generator for a supersetof the values you’re interested in, and then filter out some values with“test-when”.) I don’t think this is happening here, though. You wouldsee something like “Gave up! Passed only 0 ests [sic].”
Toggle quote (46 lines)> I added ‘pk’ calls like so:>> (test-assert "[prop] Writing is reversible"> (quickcheck> (property ((octal $octal))> (test-when (valid-octal? octal)> (begin> (equal? (pk 'oct octal) (pk 'decode (decode-octal (encode-octal octal)))))))))>> (test-assert "[prop] Serializing is reversible"> (quickcheck> (property ((octal $octal))> (test-when (valid-octal? octal)> (equal? (pk 'OCT octal) (pk 'DECODE (serdeser -octal- octal)))))))>>> and got this output:>> ;;; (oct #<<unstructured-octal> value: 0 source: #<<zero-string> value: "\U0f94a4\u0912\U025627\U10e96a\u9576\u2077\u048f\U0f2f60\U0f744b" trailer: #vu8(172 156 23 48 25 29 159 226 210)>>)>> ;;; (decode #<<unstructured-octal> value: 0 source: #<<zero-string> value: "\U0f94a4\u0912\U025627\U10e96a\u9576\u2077\u048f\U0f2f60\U0f744b" trailer: #vu8(172 156 23 48 25 29 159 226 210)>>)> actual-value: #f> actual-error:> + (out-of-range> + #f> + "Value out of range ~S to ~S: ~S"> + (8 9 10)> + (10))> result: FAIL>> […]>> ;;; (OCT #<<unstructured-octal> value: 0 source: #<<zero-string> value: "\U0f94a4\u0912\U025627\U10e96a\u9576\u2077\u048f\U0f2f60\U0f744b" trailer: #vu8(172 156 23 48 25 29 159 226 210)>>)>> ;;; (DECODE #<<unstructured-octal> value: 0 source: #<<zero-string> value: "\U0f94a4\u0912\U025627\U10e96a\u9576\u2077\u048f\U0f2f60\U0f744b" trailer: #vu8(172 156 23 48 25 29 159 226 210)>>)> actual-value: #f> actual-error:> + (out-of-range> + #f> + "Value out of range ~S to ~S: ~S"> + (8 9 10)> + (10))> result: FAIL>> I’m not sure where the exception comes from though.
I can’t seem to reproduce this. I’ve run the test suite many, manytimes, but I also tried:
,use (disarchive kinds octal) ,use (disarchive kinds zero-string) ,use (disarchive serialization) (define the-zero-string (make-zero-string "\U0f94a4\u0912\U025627\U10e96a\u9576\u2077\u048f\U0f2f60\U0f744b" #vu8(172 156 23 48 25 29 159 226 210))) (define the-octal (make-unstructured-octal 0 the-zero-string)) (equal? the-octal (decode-octal (encode-octal the-octal))) (equal? the-octal (serdeser -octal- the-octal))
Which works fine. (Does it work for you?)
However, isn’t it possible that these values aren’t the culprits? Withthe “pk” calls you added, isn’t it printing the last OK value withouttelling us the value causing the issue?
What if you run it with the following?
(test-assert "[prop] Writing is reversible" (quickcheck (property ((octal $octal)) (test-when (valid-octal? octal) (false-if-exception ; <-- changed! (equal? octal (decode-octal (encode-octal octal))))))))
This way, Guile-QuickCheck should print the offending value and the seedused for the tests, which could be helpful for reproducing. (The factthat it doesn’t handle exceptions well is a known bug!)

-- Tim
L
L
Ludovic Courtès wrote on 2 May 21:57 +0200
(name . Timothy Sample)(address . samplet@ngyro.com)(address . 48114@debbugs.gnu.org)
874kfk6h8o.fsf@gnu.org
Hello!
Timothy Sample <samplet@ngyro.com> skribis:
Toggle quote (3 lines)> I can’t seem to reproduce this. I’ve run the test suite many, many> times, but I also tried:
I can reproduce it quickly with:
while make check TESTS=tests/kinds/octal.scm -j5 ; do : ; done
… in C locale (LC_ALL & co. all unset).
Toggle quote (4 lines)> However, isn’t it possible that these values aren’t the culprits? With> the “pk” calls you added, isn’t it printing the last OK value without> telling us the value causing the issue?
You’re right, the values printed are not the culprit. The problem comesfrom the generator (I had to raise the (quickcheck …) form out of‘test-assert’ so I could get a backtrace):
Toggle snippet (27 lines)Backtrace: 13 (primitive-load "/data/src/disarchive/./build-aux/test-driver.scm")In ice-9/eval.scm: 619:8 12 (_ #(#(#<directory (guile-user) 7fccb09d9f00> ((() "./tests/kinds/octal.scm") (# . "no") (# . #) ?)) #)) 619:8 11 (_ #(#(#(#(#(#(#(#(#<directory (guile-user) 7fccb09d9f00> ("./tests/kinds/octal?") ?)) ?) ?) ?) ?) ?) ?))In ice-9/boot-9.scm: 142:2 10 (dynamic-wind _ _ #<procedure 7fccaf5b81a0 at ice-9/eval.scm:330:13 ()>)In unknown file: 9 (primitive-load "./tests/kinds/octal.scm")In quickcheck.scm: 118:6 8 (check #<<quickcheck-config> seed: 321557891 stop?: #<procedure 7fccaf8c3540 at ice-9/eval.scm:336:13?> ?) 98:12 7 (check-results _ #<<property> names: (octal) gen/arbs: (#<<arbitrary> gen: #<<generator> proc: #<proce?>)In quickcheck/generator.scm: 65:2 6 (_ 7 #<<rng-state> start: #(1907167801 2749187034 1190323419 1039883844 766725436 3567744198) s1: #(29?>) 65:2 5 (_ 7 #<<rng-state> start: #(1907167801 2749187034 1190323419 1039883844 766725436 3567744198) s1: #(29?>) 78:17 4 (_ 7 #<<rng-state> start: #(1907167801 2749187034 1190323419 1039883844 766725436 3567744198) s1: #(28?>) 105:22 3 (_ _)In tests/kinds.scm: 84:22 2 (fix-unstructured-octal-value #<<unstructured-octal> value: 7 source: #<<zero-string> value: "\U0f99aa?>) 86:47 1 (_ _)In unknown file: 0 (substring "\U0f99aa?\U0ff7c1\U0fb97a\U0ff933?\U0fe7a1" 6 8)
ERROR: In procedure substring:Value out of range 6 to 7: 8
Note that this is in C locale, which may mean that ‘regexp-exec’, whichpasses strings to libc, gets offsets wrong somehow (see‘fixup_multibyte_match’ in libguile), though I couldn’t reproduce itwith the string above.
Anyway, ‘guix build disarchive’ builds in en_US.utf8 locale, so thething above is probably a wrong lead.
If I switch to en_US.utf8, I occasionally get the following errorinstead:
Toggle snippet (22 lines)test-name: [prop] Serializing is reversiblelocation: tests/kinds/octal.scm:154source:+ (test-assert+ "[prop] Serializing is reversible"+ (quickcheck+ (property+ ((octal $octal))+ (test-when+ (valid-octal? octal)+ (equal?+ (pk 'OCT octal)+ (pk 'DECODE (serdeser -octal- octal)))))))
;;; (OCT #<<unstructured-octal> value: 0 source: #<<zero-string> value: "" trailer: "">>)
;;; (DECODE #<<unstructured-octal> value: 0 source: #<<zero-string> value: "" trailer: "">>)Gave up! Passed only 1 est.actual-value: #fresult: FAIL
This is more in line with what you described. Any ideas on how toaddress that?
Thanks,Ludo’.
T
T
Timothy Sample wrote on 3 May 04:24 +0200
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 48114@debbugs.gnu.org)
87a6pceerf.fsf@ngyro.com
Hi,
Ludovic Courtès <ludo@gnu.org> writes:
[...]
Toggle quote (8 lines)> ERROR: In procedure substring:> Value out of range 6 to 7: 8>> Note that this is in C locale, which may mean that ‘regexp-exec’, which> passes strings to libc, gets offsets wrong somehow (see> ‘fixup_multibyte_match’ in libguile), though I couldn’t reproduce it> with the string above.
I’m still looking into this, but I wanted to quickly post thisreproducer for the Guile bug:
(use-modules (ice-9 regex)) (define str "\U101514\U103ab0\U0f6e6e\U02e278\U01d9eb\U10b996\U1089b5\uea15\U0fa074\U101e41\U02e330\u0177\u2492") (match:substring (string-match "[0-8]+" str))
This triggers the out-of-range error when run with “LC_ALL=C”.

-- Tim
T
T
Timothy Sample wrote on 3 May 06:02 +0200
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 48114@debbugs.gnu.org)
8735v4ea7y.fsf@ngyro.com
Timothy Sample <samplet@ngyro.com> writes:
Toggle quote (10 lines)> I’m still looking into this, but I wanted to quickly post this> reproducer for the Guile bug:>> (use-modules (ice-9 regex))> (define str> "\U101514\U103ab0\U0f6e6e\U02e278\U01d9eb\U10b996\U1089b5\uea15\U0fa074\U101e41\U02e330\u0177\u2492")> (match:substring (string-match "[0-8]+" str))>> This triggers the out-of-range error when run with “LC_ALL=C”.
It turns out that all that’s needed is the last code point, which is“Number Eleven Full Stop”, or ‘⒒’. When Guile converts this to an ASCIIC string using ‘u32_conv_from_encoding’, it becomes “11.”. The regex(“[0-8]+”) matches the “11” part with start index 0 and end index 2.The ‘fixup_multibyte_match’ function does nothing (it only matters whenthe locale encoding is multibyte) [1]. Guile then builds the matchvector with the original string but keeps the ASCII offsets. In otherwords, it thinks the match substring goes from 0 to 2 in a single codepoint string:
,use (ice-9 regex) (string-match "11" "\u2492") => #("\u2492" (0 . 2))
I’m not sure there’s any way to solve this nicely in Guile. It would beclearer if the match vector included the string as libc matched it, butit’s still surprising that the match happens with a different string.
In Disarchive, I can rewrite the generator without regex. I’ll do thatand see what I can do about the “Gave up!” issue.
[1] It works on the converted-to-ASCII C string, which means that thebyte offsets and code point offsets are the same. Hence, it has nothingto do.

-- Tim
B
B
Bengt Richter wrote on 3 May 08:19 +0200
(name . Timothy Sample)(address . samplet@ngyro.com)
20210503061950.GA26660@LionPure
Hi Timothy, Ludo,
On +2021-05-03 00:02:09 -0400, Timothy Sample wrote:
Toggle quote (43 lines)> Timothy Sample <samplet@ngyro.com> writes:> > > I’m still looking into this, but I wanted to quickly post this> > reproducer for the Guile bug:> >> > (use-modules (ice-9 regex))> > (define str> > "\U101514\U103ab0\U0f6e6e\U02e278\U01d9eb\U10b996\U1089b5\uea15\U0fa074\U101e41\U02e330\u0177\u2492")> > (match:substring (string-match "[0-8]+" str))> >> > This triggers the out-of-range error when run with “LC_ALL=C”.> > It turns out that all that’s needed is the last code point, which is> “Number Eleven Full Stop”, or ‘⒒’. When Guile converts this to an ASCII> C string using ‘u32_conv_from_encoding’, it becomes “11.”. The regex> (“[0-8]+”) matches the “11” part with start index 0 and end index 2.> The ‘fixup_multibyte_match’ function does nothing (it only matters when> the locale encoding is multibyte) [1]. Guile then builds the match> vector with the original string but keeps the ASCII offsets. In other> words, it thinks the match substring goes from 0 to 2 in a single code> point string:> > ,use (ice-9 regex)> (string-match "11" "\u2492")> => #("\u2492" (0 . 2))> > I’m not sure there’s any way to solve this nicely in Guile. It would be> clearer if the match vector included the string as libc matched it, but> it’s still surprising that the match happens with a different string.> > In Disarchive, I can rewrite the generator without regex. I’ll do that> and see what I can do about the “Gave up!” issue.> > [1] It works on the converted-to-ASCII C string, which means that the> byte offsets and code point offsets are the same. Hence, it has nothing> to do.> > > -- Tim>
> >
What happens with these?(code ppoints in decimal)
8554 _Ⅺ_ "ROMAN NUMERAL ELEVEN" 8570 _ⅺ_ "SMALL ROMAN NUMERAL ELEVEN" 9322 _⑪_ "CIRCLED NUMBER ELEVEN" 9342 _⑾_ "PARENTHESIZED NUMBER ELEVEN" 9362 _⒒_ "NUMBER ELEVEN FULL STOP" 9451 _⓫_ "NEGATIVE CIRCLED NUMBER ELEVEN" 13155 _㍣_ "IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR ELEVEN" 13290 _㏪_ "IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY ELEVEN"
I would argue that none of these should be "decoded" into ascii polyglyphssince they are atomic character glyphs. IMO It is over-eager transformationto make them into ascii polyglyphs.
/Super/sub/-script placement metadata is another thing to consider --"decode" to ascii art?? ;-)
Unicode characters representing mathematical values inother languages are different. Those are subject to natural languagetranslation with locale-dependent semantics.
These might be candidates for that?:(code points in decimal)
8544 _Ⅰ_ "ROMAN NUMERAL ONE" 8545 _Ⅱ_ "ROMAN NUMERAL TWO" 8546 _Ⅲ_ "ROMAN NUMERAL THREE" 8547 _Ⅳ_ "ROMAN NUMERAL FOUR" 8548 _Ⅴ_ "ROMAN NUMERAL FIVE" 8549 _Ⅵ_ "ROMAN NUMERAL SIX" 8550 _Ⅶ_ "ROMAN NUMERAL SEVEN" 8551 _Ⅷ_ "ROMAN NUMERAL EIGHT" 8552 _Ⅸ_ "ROMAN NUMERAL NINE" 8553 _Ⅹ_ "ROMAN NUMERAL TEN" 8554 _Ⅺ_ "ROMAN NUMERAL ELEVEN" 8555 _Ⅻ_ "ROMAN NUMERAL TWELVE" 8556 _Ⅼ_ "ROMAN NUMERAL FIFTY" 8557 _Ⅽ_ "ROMAN NUMERAL ONE HUNDRED" 8558 _Ⅾ_ "ROMAN NUMERAL FIVE HUNDRED" 8559 _Ⅿ_ "ROMAN NUMERAL ONE THOUSAND" 8560 _ⅰ_ "SMALL ROMAN NUMERAL ONE" 8561 _ⅱ_ "SMALL ROMAN NUMERAL TWO" 8562 _ⅲ_ "SMALL ROMAN NUMERAL THREE" 8563 _ⅳ_ "SMALL ROMAN NUMERAL FOUR" 8564 _ⅴ_ "SMALL ROMAN NUMERAL FIVE" 8565 _ⅵ_ "SMALL ROMAN NUMERAL SIX" 8566 _ⅶ_ "SMALL ROMAN NUMERAL SEVEN" 8567 _ⅷ_ "SMALL ROMAN NUMERAL EIGHT" 8568 _ⅸ_ "SMALL ROMAN NUMERAL NINE" 8569 _ⅹ_ "SMALL ROMAN NUMERAL TEN" 8570 _ⅺ_ "SMALL ROMAN NUMERAL ELEVEN" 8571 _ⅻ_ "SMALL ROMAN NUMERAL TWELVE" 8572 _ⅼ_ "SMALL ROMAN NUMERAL FIFTY" 8573 _ⅽ_ "SMALL ROMAN NUMERAL ONE HUNDRED" 8574 _ⅾ_ "SMALL ROMAN NUMERAL FIVE HUNDRED" 8575 _ⅿ_ "SMALL ROMAN NUMERAL ONE THOUSAND" 8576 _ↀ_ "ROMAN NUMERAL ONE THOUSAND C D" 8577 _ↁ_ "ROMAN NUMERAL FIVE THOUSAND" 8578 _ↂ_ "ROMAN NUMERAL TEN THOUSAND" 8579 _Ↄ_ "ROMAN NUMERAL REVERSED ONE HUNDRED" 8581 _ↅ_ "ROMAN NUMERAL SIX LATE FORM" 8582 _ↆ_ "ROMAN NUMERAL FIFTY EARLY FORM" 8583 _ↇ_ "ROMAN NUMERAL FIFTY THOUSAND" 8584 _ↈ_ "ROMAN NUMERAL ONE HUNDRED THOUSAND" 12321 _〡_ "HANGZHOU NUMERAL ONE" 12322 _〢_ "HANGZHOU NUMERAL TWO" 12323 _〣_ "HANGZHOU NUMERAL THREE" 12324 _〤_ "HANGZHOU NUMERAL FOUR" 12325 _〥_ "HANGZHOU NUMERAL FIVE" 12326 _〦_ "HANGZHOU NUMERAL SIX" 12327 _〧_ "HANGZHOU NUMERAL SEVEN" 12328 _〨_ "HANGZHOU NUMERAL EIGHT" 12329 _〩_ "HANGZHOU NUMERAL NINE" 12344 _〸_ "HANGZHOU NUMERAL TEN" 12345 _〹_ "HANGZHOU NUMERAL TWENTY" 12346 _〺_ "HANGZHOU NUMERAL THIRTY"
Just my intuitive reaction, no academic creds to back it up ;)
-- Regards,Bengt Richter
L
L
Ludovic Courtès wrote on 3 May 22:03 +0200
(name . Timothy Sample)(address . samplet@ngyro.com)(address . 48114@debbugs.gnu.org)
874kfjwpn4.fsf@gnu.org
Hi!
Timothy Sample <samplet@ngyro.com> skribis:
Toggle quote (15 lines)> Timothy Sample <samplet@ngyro.com> writes:>>> I’m still looking into this, but I wanted to quickly post this>> reproducer for the Guile bug:>>>> (use-modules (ice-9 regex))>> (define str>> "\U101514\U103ab0\U0f6e6e\U02e278\U01d9eb\U10b996\U1089b5\uea15\U0fa074\U101e41\U02e330\u0177\u2492")>> (match:substring (string-match "[0-8]+" str))>>>> This triggers the out-of-range error when run with “LC_ALL=C”.>> It turns out that all that’s needed is the last code point, which is> “Number Eleven Full Stop”, or ‘⒒’.
Whaaat? “Number Eleven Full Stop”, I wonder how the Unicode folks cameup with that one. ㊷ = ㉚ + ⒓
Toggle quote (17 lines)> When Guile converts this to an ASCII C string using> ‘u32_conv_from_encoding’, it becomes “11.”. The regex (“[0-8]+”)> matches the “11” part with start index 0 and end index 2. The> ‘fixup_multibyte_match’ function does nothing (it only matters when> the locale encoding is multibyte) [1]. Guile then builds the match> vector with the original string but keeps the ASCII offsets. In other> words, it thinks the match substring goes from 0 to 2 in a single code> point string:>> ,use (ice-9 regex)> (string-match "11" "\u2492")> => #("\u2492" (0 . 2))>> I’m not sure there’s any way to solve this nicely in Guile. It would be> clearer if the match vector included the string as libc matched it, but> it’s still surprising that the match happens with a different string.
Yeah, I don’t think there’s much we can do. It’s a lot of fun anyway.
Thanks for investigating!
Ludo’.
?