Attila Lendvai schreef op wo 13-04-2022 om 07:51 [+0000]: > i'm not sure why the wrong locale breaks file-system walking and deleting, though. > > i assume if every function in guile uses/assumes the same locale (character > encoding), then both directions through the guile FFI should be idempotent, no? > and i think both ASCII and UTF-8 are idempotent wrt C bytes <-> scheme string > conversions. The problem is that the default character encoding is ANSI_X3.4-1968 (US-ASCII) and any bytes above 127 makes things non-ASCII. Also, the string procedures internally always use UTF-8 (or possibly ISO-85519-1 as an optimisation?), they are not raw bytes instead they can be consideres a vector of characters (string-ref returns characters, not bytes, and doesn't use byte positions). > IOW, it's only the displaying of the chars that should be broken, > not file operations. LANG=bogus guile (guile-user)> (setlocale LC_ALL) (guile-user)> (use-modules (ice-9 i18n)) (guile-user)> (locale-encoding) (guile-user)> (locale-encoding) $2 = "ANSI_X3.4-1968" Apparently the fallback encoding is ‘ANSI_X3.4-1968’. Let's take a look at this encoding. According to IANA (https://www.iana.org/assignments/character-sets/character-sets.xhtml), this character encoding can also be named ‘US-ASCII’ and is specified in RFC2046. Some excerpts: "US-ASCII" does not indicate an arbitrary 7-bit character set[sic], but specifies that all octets in the body must be interpreted as characters according to the US-ASCII character set. so it looks like, say, é cannot be encoded as US-ASCII, it does not belong to the character set of the encoding. More generally, anything beyond the 127 (Unicode) codepoint cannot be encoded in ANSI_X3.4-1968. Let's test this (in a new REPL with an UTF-8 locale): ((@ (ice-9 iconv) string->bytevector) "é" "ANSI_X3.4-1968") ice-9/boot-9.scm:1669:16: In procedure raise-exception: Throw to key `encoding-error' with args `("put-char" "conversion to port encoding failed" 84 # #\é)'. ((@ (ice-9 iconv) string->bytevector) "é" "ANSI_X3.4-1968" 'substitute) $2 = #vu8(63) ((@ (rnrs bytevectors) utf8->string) #vu8(63)) $3 = "?" and the other direction: ((@ (ice-9 iconv) bytevector->string) #vu8(128) "ANSI_X3.4-1968" 'substitute) $5 = "�" ;; why #\� and not #\?? I don't know, I guess Guile is inconsistent (FWIW, I would throw an decoding-error here instead of silently corrupting the file names.) Greetings, Maxime.