Toggle quote (43 lines)
> Timothy Sample <samplet@ngyro.com> writes:
>
> > I’m still looking into this, but I wanted to quickly post this
> > reproducer for the Guile bug:
> >
> > (use-modules (ice-9 regex))
> > (define str
> > "\U101514\U103ab0\U0f6e6e\U02e278\U01d9eb\U10b996\U1089b5\uea15\U0fa074\U101e41\U02e330\u0177\u2492")
> > (match:substring (string-match "[0-8]+" str))
> >
> > This triggers the out-of-range error when run with “LC_ALL=C”.
>
> It turns out that all that’s needed is the last code point, which is
> “Number Eleven Full Stop”, or ‘?’. When Guile converts this to an ASCII
> C string using ‘u32_conv_from_encoding’, it becomes “11.”. The regex
> (“[0-8]+”) matches the “11” part with start index 0 and end index 2.
> The ‘fixup_multibyte_match’ function does nothing (it only matters when
> the locale encoding is multibyte) [1]. Guile then builds the match
> vector with the original string but keeps the ASCII offsets. In other
> words, it thinks the match substring goes from 0 to 2 in a single code
> point string:
>
> ,use (ice-9 regex)
> (string-match "11" "\u2492")
> => #("\u2492" (0 . 2))
>
> I’m not sure there’s any way to solve this nicely in Guile. It would be
> clearer if the match vector included the string as libc matched it, but
> it’s still surprising that the match happens with a different string.
>
> In Disarchive, I can rewrite the generator without regex. I’ll do that
> and see what I can do about the “Gave up!” issue.
>
> [1] It works on the converted-to-ASCII C string, which means that the
> byte offsets and code point offsets are the same. Hence, it has nothing
> to do.
>
>
> -- Tim
>
>
>
What happens with these?
(code ppoints in decimal)
8554 _?_ "ROMAN NUMERAL ELEVEN"
8570 _?_ "SMALL ROMAN NUMERAL ELEVEN"
9322 _?_ "CIRCLED NUMBER ELEVEN"
9342 _?_ "PARENTHESIZED NUMBER ELEVEN"
9362 _?_ "NUMBER ELEVEN FULL STOP"
9451 _?_ "NEGATIVE CIRCLED NUMBER ELEVEN"
13155 _?_ "IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR ELEVEN"
13290 _?_ "IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY ELEVEN"
I would argue that none of these should be "decoded" into ascii polyglyphs
since they are atomic character glyphs. IMO It is over-eager transformation
to make them into ascii polyglyphs.
/Super/sub/-script placement metadata is another thing to consider --
"decode" to ascii art?? ;-)
Unicode characters representing mathematical values in
other languages are different. Those are subject to natural language
translation with locale-dependent semantics.
These might be candidates for that?:
(code points in decimal)
8544 _?_ "ROMAN NUMERAL ONE"
8545 _?_ "ROMAN NUMERAL TWO"
8546 _?_ "ROMAN NUMERAL THREE"
8547 _?_ "ROMAN NUMERAL FOUR"
8548 _?_ "ROMAN NUMERAL FIVE"
8549 _?_ "ROMAN NUMERAL SIX"
8550 _?_ "ROMAN NUMERAL SEVEN"
8551 _?_ "ROMAN NUMERAL EIGHT"
8552 _?_ "ROMAN NUMERAL NINE"
8553 _?_ "ROMAN NUMERAL TEN"
8554 _?_ "ROMAN NUMERAL ELEVEN"
8555 _?_ "ROMAN NUMERAL TWELVE"
8556 _?_ "ROMAN NUMERAL FIFTY"
8557 _?_ "ROMAN NUMERAL ONE HUNDRED"
8558 _?_ "ROMAN NUMERAL FIVE HUNDRED"
8559 _?_ "ROMAN NUMERAL ONE THOUSAND"
8560 _?_ "SMALL ROMAN NUMERAL ONE"
8561 _?_ "SMALL ROMAN NUMERAL TWO"
8562 _?_ "SMALL ROMAN NUMERAL THREE"
8563 _?_ "SMALL ROMAN NUMERAL FOUR"
8564 _?_ "SMALL ROMAN NUMERAL FIVE"
8565 _?_ "SMALL ROMAN NUMERAL SIX"
8566 _?_ "SMALL ROMAN NUMERAL SEVEN"
8567 _?_ "SMALL ROMAN NUMERAL EIGHT"
8568 _?_ "SMALL ROMAN NUMERAL NINE"
8569 _?_ "SMALL ROMAN NUMERAL TEN"
8570 _?_ "SMALL ROMAN NUMERAL ELEVEN"
8571 _?_ "SMALL ROMAN NUMERAL TWELVE"
8572 _?_ "SMALL ROMAN NUMERAL FIFTY"
8573 _?_ "SMALL ROMAN NUMERAL ONE HUNDRED"
8574 _?_ "SMALL ROMAN NUMERAL FIVE HUNDRED"
8575 _?_ "SMALL ROMAN NUMERAL ONE THOUSAND"
8576 _?_ "ROMAN NUMERAL ONE THOUSAND C D"
8577 _?_ "ROMAN NUMERAL FIVE THOUSAND"
8578 _?_ "ROMAN NUMERAL TEN THOUSAND"
8579 _?_ "ROMAN NUMERAL REVERSED ONE HUNDRED"
8581 _?_ "ROMAN NUMERAL SIX LATE FORM"
8582 _?_ "ROMAN NUMERAL FIFTY EARLY FORM"
8583 _?_ "ROMAN NUMERAL FIFTY THOUSAND"
8584 _?_ "ROMAN NUMERAL ONE HUNDRED THOUSAND"
12321 _?_ "HANGZHOU NUMERAL ONE"
12322 _?_ "HANGZHOU NUMERAL TWO"
12323 _?_ "HANGZHOU NUMERAL THREE"
12324 _?_ "HANGZHOU NUMERAL FOUR"
12325 _?_ "HANGZHOU NUMERAL FIVE"
12326 _?_ "HANGZHOU NUMERAL SIX"
12327 _?_ "HANGZHOU NUMERAL SEVEN"
12328 _?_ "HANGZHOU NUMERAL EIGHT"
12329 _?_ "HANGZHOU NUMERAL NINE"
12344 _?_ "HANGZHOU NUMERAL TEN"
12345 _?_ "HANGZHOU NUMERAL TWENTY"
12346 _?_ "HANGZHOU NUMERAL THIRTY"
Just my intuitive reaction, no academic creds to back it up ;)
--
Regards,
Bengt Richter