Hi! Timothy Sample skribis: > Timothy Sample writes: > >> I’m still looking into this, but I wanted to quickly post this >> reproducer for the Guile bug: >> >> (use-modules (ice-9 regex)) >> (define str >> "\U101514\U103ab0\U0f6e6e\U02e278\U01d9eb\U10b996\U1089b5\uea15\U0fa074\U101e41\U02e330\u0177\u2492") >> (match:substring (string-match "[0-8]+" str)) >> >> This triggers the out-of-range error when run with “LC_ALL=C”. > > It turns out that all that’s needed is the last code point, which is > “Number Eleven Full Stop”, or ‘⒒’. Whaaat? “Number Eleven Full Stop”, I wonder how the Unicode folks came up with that one. ㊷ = ㉚ + ⒓ > When Guile converts this to an ASCII C string using > ‘u32_conv_from_encoding’, it becomes “11.”. The regex (“[0-8]+”) > matches the “11” part with start index 0 and end index 2. The > ‘fixup_multibyte_match’ function does nothing (it only matters when > the locale encoding is multibyte) [1]. Guile then builds the match > vector with the original string but keeps the ASCII offsets. In other > words, it thinks the match substring goes from 0 to 2 in a single code > point string: > > ,use (ice-9 regex) > (string-match "11" "\u2492") > => #("\u2492" (0 . 2)) > > I’m not sure there’s any way to solve this nicely in Guile. It would be > clearer if the match vector included the string as libc matched it, but > it’s still surprising that the match happens with a different string. Yeah, I don’t think there’s much we can do. It’s a lot of fun anyway. Thanks for investigating! Ludo’.