Timothy Sample writes: > I’m still looking into this, but I wanted to quickly post this > reproducer for the Guile bug: > > (use-modules (ice-9 regex)) > (define str > "\U101514\U103ab0\U0f6e6e\U02e278\U01d9eb\U10b996\U1089b5\uea15\U0fa074\U101e41\U02e330\u0177\u2492") > (match:substring (string-match "[0-8]+" str)) > > This triggers the out-of-range error when run with “LC_ALL=C”. It turns out that all that’s needed is the last code point, which is “Number Eleven Full Stop”, or ‘⒒’. When Guile converts this to an ASCII C string using ‘u32_conv_from_encoding’, it becomes “11.”. The regex (“[0-8]+”) matches the “11” part with start index 0 and end index 2. The ‘fixup_multibyte_match’ function does nothing (it only matters when the locale encoding is multibyte) [1]. Guile then builds the match vector with the original string but keeps the ASCII offsets. In other words, it thinks the match substring goes from 0 to 2 in a single code point string: ,use (ice-9 regex) (string-match "11" "\u2492") => #("\u2492" (0 . 2)) I’m not sure there’s any way to solve this nicely in Guile. It would be clearer if the match vector included the string as libc matched it, but it’s still surprising that the match happens with a different string. In Disarchive, I can rewrite the generator without regex. I’ll do that and see what I can do about the “Gave up!” issue. [1] It works on the converted-to-ASCII C string, which means that the byte offsets and code point offsets are the same. Hence, it has nothing to do. -- Tim