From debbugs-submit-bounces@debbugs.gnu.org Mon May 03 16:04:08 2021 Received: (at 48114) by debbugs.gnu.org; 3 May 2021 20:04:08 +0000 Received: from localhost ([127.0.0.1]:48426 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ldenY-0004e5-JK for submit@debbugs.gnu.org; Mon, 03 May 2021 16:04:08 -0400 Received: from eggs.gnu.org ([209.51.188.92]:47064) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ldenW-0004di-LG for 48114@debbugs.gnu.org; Mon, 03 May 2021 16:04:06 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:33242) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ldenR-0000P2-Ac; Mon, 03 May 2021 16:04:01 -0400 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=54470 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1ldenQ-0007UM-KF; Mon, 03 May 2021 16:04:00 -0400 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: Timothy Sample Subject: Re: bug#48114: Disarchive occasionally fails tests References: <87v984gkhn.fsf@inria.fr> <87pmybeen3.fsf@ngyro.com> <874kfk6h8o.fsf@gnu.org> <87a6pceerf.fsf@ngyro.com> <8735v4ea7y.fsf@ngyro.com> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 14 =?utf-8?Q?Flor=C3=A9al?= an 229 de la =?utf-8?Q?R?= =?utf-8?Q?=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Mon, 03 May 2021 22:03:59 +0200 In-Reply-To: <8735v4ea7y.fsf@ngyro.com> (Timothy Sample's message of "Mon, 03 May 2021 00:02:09 -0400") Message-ID: <874kfjwpn4.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 48114 Cc: 48114@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Hi! Timothy Sample skribis: > Timothy Sample writes: > >> I=E2=80=99m still looking into this, but I wanted to quickly post this >> reproducer for the Guile bug: >> >> (use-modules (ice-9 regex)) >> (define str >> "\U101514\U103ab0\U0f6e6e\U02e278\U01d9eb\U10b996\U1089b5\uea15\U0fa074\= U101e41\U02e330\u0177\u2492") >> (match:substring (string-match "[0-8]+" str)) >> >> This triggers the out-of-range error when run with =E2=80=9CLC_ALL=3DC= =E2=80=9D. > > It turns out that all that=E2=80=99s needed is the last code point, which= is > =E2=80=9CNumber Eleven Full Stop=E2=80=9D, or =E2=80=98=E2=92=92=E2=80=99. Whaaat? =E2=80=9CNumber Eleven Full Stop=E2=80=9D, I wonder how the Unicode= folks came up with that one. =E3=8A=B7 =3D =E3=89=9A + =E2=92=93 > When Guile converts this to an ASCII C string using > =E2=80=98u32_conv_from_encoding=E2=80=99, it becomes =E2=80=9C11.=E2=80= =9D. The regex (=E2=80=9C[0-8]+=E2=80=9D) > matches the =E2=80=9C11=E2=80=9D part with start index 0 and end index 2.= The > =E2=80=98fixup_multibyte_match=E2=80=99 function does nothing (it only ma= tters when > the locale encoding is multibyte) [1]. Guile then builds the match > vector with the original string but keeps the ASCII offsets. In other > words, it thinks the match substring goes from 0 to 2 in a single code > point string: > > ,use (ice-9 regex) > (string-match "11" "\u2492") > =3D> #("\u2492" (0 . 2)) > > I=E2=80=99m not sure there=E2=80=99s any way to solve this nicely in Guil= e. It would be > clearer if the match vector included the string as libc matched it, but > it=E2=80=99s still surprising that the match happens with a different str= ing. Yeah, I don=E2=80=99t think there=E2=80=99s much we can do. It=E2=80=99s a= lot of fun anyway. Thanks for investigating! Ludo=E2=80=99.