Stack marking race conditions lead to crashes for multi-threaded programs in 2.2.7/3.0.1

  • Done
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Mark H Weaver
  • Andy Wingo
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
serious
L
L
Ludovic Courtès wrote on 24 Aug 2017 00:20
Grafting code triggers GC/thread-safety issue on Guile 2.2.2
(address . bug-guix@gnu.org)
877exuj58y.fsf@gnu.org
On current ‘core-updates’, the code in (guix build graft) triggers
random Guile crashes (GC issue? thread-safety issue?) when running on
Guile 2.2.2, as initially reported by Marius at

Toggle snippet (27 lines)
grafting '/gnu/store/i71kkrch1asnwvm0vm71w9aaza0n2m9q-icecat-52.1.0-gnu1' -> '/gnu/store/7w92kgcdcmf7lsc9nvs6b2ca7mk9422s-icecat-52.1.0-gnu1'...
ERROR: In procedure put-bytevector: Wrong type argument in position 1 (expecting open output port): #<closed: file 8f1930>
builder for `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed with exit code 1

[...]

ludo@ribbon ~/src/guix/+core-updates$ guix gc --clear-failures $(guix gc --list-failures)
ludo@ribbon ~/src/guix/+core-updates$ guix build /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
La jena derivo estos konstruata:
/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
@ build-started /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - x86_64-linux /var/log/guix/drvs/3c//rrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv.bz2
grafting '/gnu/store/i71kkrch1asnwvm0vm71w9aaza0n2m9q-icecat-52.1.0-gnu1' -> '/gnu/store/7w92kgcdcmf7lsc9nvs6b2ca7mk9422s-icecat-52.1.0-gnu1'...
ERROR: In procedure put-bytevector: Wrong type argument in position 1 (expecting open output port): #<closed: file 7517e0>
builder for `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed with exit code 1
@ build-failed /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - 1 builder for `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed with exit code 1
guix build: error: build failed: build of `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed
ludo@ribbon ~/src/guix/+core-updates$ guix gc --clear-failures $(guix gc --list-failures)
ludo@ribbon ~/src/guix/+core-updates$ guix build /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
La jena derivo estos konstruata:
/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
@ build-started /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - x86_64-linux /var/log/guix/drvs/3c//rrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv.bz2
grafting '/gnu/store/i71kkrch1asnwvm0vm71w9aaza0n2m9q-icecat-52.1.0-gnu1' -> '/gnu/store/7w92kgcdcmf7lsc9nvs6b2ca7mk9422s-icecat-52.1.0-gnu1'...
ERROR: In procedure variable-ref: Not a variable: (194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255)
builder for `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed with exit code 1
@ build-failed /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - 1 builder for `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed with exit code 1

The problem does not show up when running on a single thread:

Toggle snippet (11 lines)
guix build: error: build failed: build of `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed
ludo@ribbon ~/src/guix/+core-updates$ guix gc --clear-failures $(guix gc --list-failures)
ludo@ribbon ~/src/guix/+core-updates$ guix build /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv --cores=1
La jena derivo estos konstruata:
/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
@ build-started /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - x86_64-linux /var/log/guix/drvs/3c//rrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv.bz2
grafting '/gnu/store/i71kkrch1asnwvm0vm71w9aaza0n2m9q-icecat-52.1.0-gnu1' -> '/gnu/store/7w92kgcdcmf7lsc9nvs6b2ca7mk9422s-icecat-52.1.0-gnu1'...
@ build-succeeded /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv -
/gnu/store/7w92kgcdcmf7lsc9nvs6b2ca7mk9422s-icecat-52.1.0-gnu1

Ludo’.
L
L
Ludovic Courtès wrote on 24 Aug 2017 00:26
control message for bug #28211
(address . control@debbugs.gnu.org)
8760ddkjjh.fsf@gnu.org
severity 28211 serious
L
L
Ludovic Courtès wrote on 24 Aug 2017 00:48
Re: bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2
(address . 28211@debbugs.gnu.org)
87ziapj3ye.fsf@gnu.org
ludo@gnu.org (Ludovic Courtès) skribis:

Toggle quote (5 lines)
> On current ‘core-updates’, the code in (guix build graft) triggers
> random Guile crashes (GC issue? thread-safety issue?) when running on
> Guile 2.2.2, as initially reported by Marius at
> <https://lists.gnu.org/archive/html/guix-devel/2017-08/msg00013.html>:

Commit e4925e00ca420737556e2039b4fa1c40121ee567 works around it by using
Guile 2.0 for these derivations.

Ludo’.
L
L
Ludovic Courtès wrote on 24 Aug 2017 00:53
Re: Grafting fails for latest Go release candidate
(name . Marius Bakke)(address . mbakke@fastmail.com)
87lgm9j3q9.fsf@gnu.org
Marius Bakke <mbakke@fastmail.com> skribis:

Toggle quote (8 lines)
> Most likely just lucky, new grafting failure today.
>
> I've noticed they are more likely to occur when the host system is busy.
> Grafting a large package such as 'texlive' is almost guaranteed to fail
> on a fully-utilized machine.
>
> Maybe we should switch back to 2.0 for the time being?


I took a look: the code in (guix build graft) is multi-threaded but
there’s no shared state among threads (except for the current
output/error ports, which may partly explain the error messages that
suggest memory corruption.) So it may be that the problem is in
libguile.

Ludo’.
M
M
Mark H Weaver wrote on 24 Aug 2017 06:56
Re: bug#28211: Grafting fails for latest Go release candidate
(name . Ludovic Courtès)(address . ludo@gnu.org)
878ti9o96p.fsf@netris.org
ludo@gnu.org (Ludovic Courtès) writes:

Toggle quote (16 lines)
> Marius Bakke <mbakke@fastmail.com> skribis:
>
>> Most likely just lucky, new grafting failure today.
>>
>> I've noticed they are more likely to occur when the host system is busy.
>> Grafting a large package such as 'texlive' is almost guaranteed to fail
>> on a fully-utilized machine.
>>
>> Maybe we should switch back to 2.0 for the time being?
>
> Yes, done: <https://bugs.gnu.org/28211>.
>
> I took a look: the code in (guix build graft) is multi-threaded but
> there’s no shared state among threads (except for the current
> output/error ports,

In Guile, port operations do not include built-in thread
synchronization, so it's not safe for multiple threads to write to the
current output/error ports concurrently. Those writes should be
protected by a mutex.

However, we had agreed that in 2.2, the port operations should be
implemented in such a way that the worst that could happen on typical
hardware in practice was for the output to be garbled. I didn't review
the new 2.2 ports code, so I'm not sure whether this was accomplished.

Mark
L
L
Ludovic Courtès wrote on 24 Apr 2018 18:03
Re: bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2
(address . 28211@debbugs.gnu.org)
87d0yo1tie.fsf@gnu.org
ludo@gnu.org (Ludovic Courtès) skribis:

Toggle quote (5 lines)
> On current ‘core-updates’, the code in (guix build graft) triggers
> random Guile crashes (GC issue? thread-safety issue?) when running on
> Guile 2.2.2, as initially reported by Marius at
> <https://lists.gnu.org/archive/html/guix-devel/2017-08/msg00013.html>:

The problem still shows up with Guile 2.2.3+. Some raw notes on clues I
have so far.

When running the grafting code under Helgrind, we get things like this:

Toggle snippet (115 lines)
==30263== ----------------------------------------------------------------
==30263==
==30263== Lock at 0x5168F60 was first observed
==30263== at 0x4C34943: pthread_mutex_init (in /gnu/store/yiiz2cdfsx6rqvp1gvlz34pvrl0cjf5c-valgrind-3.13.0/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==30263== by 0x4EF36C3: scm_threads_prehistory (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4E9F64D: scm_i_init_guile (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4EF2557: scm_i_init_thread_for_guile (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4EF2588: with_guile (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x5185757: GC_call_with_stack_base (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x4EF2957: scm_with_guile (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4E9F5F1: scm_boot_guile (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x400B5F: main (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/bin/guile)
==30263== Address 0x5168f60 is 0 bytes inside data symbol "scm_i_misc_mutex"
==30263==
==30263== Lock at 0x53AA240 was first observed
==30263== at 0x4C30D6D: mutex_trylock_WRK (in /gnu/store/yiiz2cdfsx6rqvp1gvlz34pvrl0cjf5c-valgrind-3.13.0/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==30263== by 0x4C349E6: pthread_mutex_trylock (in /gnu/store/yiiz2cdfsx6rqvp1gvlz34pvrl0cjf5c-valgrind-3.13.0/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==30263== by 0x517C9C3: GC_register_disappearing_link_inner (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x4F0585D: weak_set_add_x (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4F05E1F: scm_c_weak_set_add_x (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4EF0964: scm_i_str2symbol (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4EF1162: scm_from_utf8_symboln (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4EAA950: scm_c_public_ref (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4ED39BF: scm_compile_shell_switches (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4ED3A24: scm_shell (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4E9F45C: invoke_main_func (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4E81C59: c_body (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== Address 0x53aa240 is 0 bytes inside data symbol "GC_allocate_ml"
==30263==
==30263== Possible data race during write of size 8 at 0xB349528 by thread #8
==30263== Locks held: 1, at address 0x5168F60
==30263== at 0x4E85E65: scm_strerror (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4573E4F: ???
==30263== by 0x4E85F12: scm_syserror (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4E8F228: scm_mkdir (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4F0117C: vm_regular_engine (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4F047D9: scm_call_n (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4E78EB7: scm_call_with_unblocked_asyncs (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4F0117C: vm_regular_engine (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4F047D9: scm_call_n (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4EF2495: really_launch (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4E81C59: c_body (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4F0117C: vm_regular_engine (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==
==30263== This conflicts with a previous read of size 8 by thread #9
==30263== Locks held: 1, at address 0x53AA240
==30263== at 0x5182F70: GC_push_all_eager (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x518CA2C: GC_push_all_stacks (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x51834AC: GC_mark_some (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x5178F7C: GC_stopped_mark (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x5179A68: GC_try_to_collect_inner (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x517A43B: GC_collect_or_expand (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x517FD6A: GC_alloc_large (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x5180182: GC_generic_malloc (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== Address 0xb349528 is on thread #8's stack
==30263== in frame #0, created by scm_strerror (???:)
==30263==
==30263== ----------------------------------------------------------------
==30263==
==30263== Lock at 0x5168F60 was first observed
==30263== at 0x4C34943: pthread_mutex_init (in /gnu/store/yiiz2cdfsx6rqvp1gvlz34pvrl0cjf5c-valgrind-3.13.0/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==30263== by 0x4EF36C3: scm_threads_prehistory (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4E9F64D: scm_i_init_guile (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4EF2557: scm_i_init_thread_for_guile (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4EF2588: with_guile (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x5185757: GC_call_with_stack_base (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x4EF2957: scm_with_guile (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4E9F5F1: scm_boot_guile (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x400B5F: main (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/bin/guile)
==30263== Address 0x5168f60 is 0 bytes inside data symbol "scm_i_misc_mutex"
==30263==
==30263== Lock at 0x53AA240 was first observed
==30263== at 0x4C30D6D: mutex_trylock_WRK (in /gnu/store/yiiz2cdfsx6rqvp1gvlz34pvrl0cjf5c-valgrind-3.13.0/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==30263== by 0x4C349E6: pthread_mutex_trylock (in /gnu/store/yiiz2cdfsx6rqvp1gvlz34pvrl0cjf5c-valgrind-3.13.0/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==30263== by 0x517C9C3: GC_register_disappearing_link_inner (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x4F0585D: weak_set_add_x (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4F05E1F: scm_c_weak_set_add_x (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4EF0964: scm_i_str2symbol (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4EF1162: scm_from_utf8_symboln (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4EAA950: scm_c_public_ref (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4ED39BF: scm_compile_shell_switches (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4ED3A24: scm_shell (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4E9F45C: invoke_main_func (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4E81C59: c_body (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== Address 0x53aa240 is 0 bytes inside data symbol "GC_allocate_ml"
==30263==
==30263== Possible data race during write of size 8 at 0xB3494F8 by thread #8
==30263== Locks held: 1, at address 0x5168F60
==30263== at 0x4EC2C2F: scm_i_default_port_conversion_strategy (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0xC3F0D9F: ???
==30263== by 0x4EEC5DE: scm_from_locale_stringn (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4E85E69: scm_strerror (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4E85F12: scm_syserror (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4E8F228: scm_mkdir (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4F0117C: vm_regular_engine (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4F047D9: scm_call_n (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4E78EB7: scm_call_with_unblocked_asyncs (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4F0117C: vm_regular_engine (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4F047D9: scm_call_n (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263== by 0x4EF2495: really_launch (in /gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==
==30263== This conflicts with a previous read of size 8 by thread #9
==30263== Locks held: 1, at address 0x53AA240
==30263== at 0x5182F70: GC_push_all_eager (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x518CA2C: GC_push_all_stacks (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x51834AC: GC_mark_some (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x5178F7C: GC_stopped_mark (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x5179A68: GC_try_to_collect_inner (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x517A43B: GC_collect_or_expand (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x517FD6A: GC_alloc_large (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== by 0x5180182: GC_generic_malloc (in /gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263== Address 0xb3494f8 is on thread #8's stack
==30263== in frame #0, created by scm_i_default_port_conversion_strategy (???:)

I find it hard to draw any conclusions from this, though.

I was also able to gather backtraces at the time of one of these crashes
(here it’s “Wrong type to apply”) with GDB:

Toggle snippet (120 lines)
(gdb) catch syscall exit_group
Catchpoint 1 (syscall 'exit_group' [231])
(gdb) condition 1 ($rdi != 0)
(gdb) r
Starting program: /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/bin/guile t.scm
ERROR: In procedure type-pointer:
ERROR: In procedure gdbscm_type_pointer: Wrong type argument in position 1 (expecting gdb:type): #f
Error while executing Scheme code.[Thread debugging using libthread_db enabled]
Using host libthread_db library "/gnu/store/4sqaib7c2dfjv62ivrg9b8wa7bh226la-glibc-2.26.105-g0890d5379c/lib/libthread_db.so.1".
[New Thread 0x7ffff5ea5700 (LWP 8359)]
[New Thread 0x7ffff5654700 (LWP 8360)]
[New Thread 0x7ffff4e03700 (LWP 8361)]
[New Thread 0x7ffff3ecd700 (LWP 8362)]
grafting '/gnu/store/w3hxb7hrafkhxplcf5qzvsc0fbipqb3i-perl-5.26.1' -> '/gnu/store/w3hxb7hrafkhxplcf5qzvsc0fbipqb3i-perl-5.XX.X'...

;;; (jobs 8)
[New Thread 0x7ffff35ac700 (LWP 8363)]
[New Thread 0x7fffead5b700 (LWP 8364)]
[New Thread 0x7ffff2d5b700 (LWP 8365)]
[New Thread 0x7ffff24f8700 (LWP 8366)]
[New Thread 0x7ffff1ca7700 (LWP 8367)]
[New Thread 0x7ffff1456700 (LWP 8368)]
[New Thread 0x7ffff0c05700 (LWP 8369)]
[New Thread 0x7fffebfff700 (LWP 8370)]
Wrong type to apply: "lib"
[Switching to Thread 0x7ffff2d5b700 (LWP 8365)]

Thread 8 "guile" hit Catchpoint 1 (call to syscall exit_group), 0x00007ffff5f65a88 in _exit ()
from /gnu/store/n6acaivs0jwiwpidjr551dhdni5kgpcr-glibc-2.26.105-g0890d5379c/lib/libc.so.6
(gdb) thread apply all bt

Thread 13 (Thread 0x7fffebfff700 (LWP 8370)):
#0 0x00007ffff5eda5c6 in sigsuspend ()
from /gnu/store/n6acaivs0jwiwpidjr551dhdni5kgpcr-glibc-2.26.105-g0890d5379c/lib/libc.so.6
#1 0x00007ffff785d65a in GC_suspend_handler_inner ()
from /gnu/store/7ivzka4by2casz1p65l8pm9vjnis0zjw-libgc-7.6.0/lib/libgc.so.1
#2 0x00007ffff785d70f in GC_suspend_handler ()
from /gnu/store/7ivzka4by2casz1p65l8pm9vjnis0zjw-libgc-7.6.0/lib/libgc.so.1
#3 <signal handler called>
#4 0x00007ffff762caea in __lll_lock_wait ()
from /gnu/store/n6acaivs0jwiwpidjr551dhdni5kgpcr-glibc-2.26.105-g0890d5379c/lib/libpthread.so.0
#5 0x00007ffff7625c65 in pthread_mutex_lock ()
from /gnu/store/n6acaivs0jwiwpidjr551dhdni5kgpcr-glibc-2.26.105-g0890d5379c/lib/libpthread.so.0
#6 0x00007ffff784f9ed in GC_notify_or_invoke_finalizers ()
from /gnu/store/7ivzka4by2casz1p65l8pm9vjnis0zjw-libgc-7.6.0/lib/libgc.so.1
#7 0x00007ffff7851dcd in GC_generic_malloc_many ()
from /gnu/store/7ivzka4by2casz1p65l8pm9vjnis0zjw-libgc-7.6.0/lib/libgc.so.1
#8 0x00007ffff785a2ce in GC_malloc_kind ()
from /gnu/store/7ivzka4by2casz1p65l8pm9vjnis0zjw-libgc-7.6.0/lib/libgc.so.1
#9 0x00007ffff7b591c6 in scm_i_make_string ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#10 0x00007ffff7b5b4d6 in scm_from_stringn ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#11 0x00007ffff7b5b76e in scm_take_locale_stringn ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#12 0x00007ffff7aff4b3 in scm_i_relativize_path ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#13 0x00007ffff7b041f7 in scm_open_file_with_encoding ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#14 0x00007ffff7b04333 in scm_i_open_file ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#15 0x00007ffff7b7017d in vm_regular_engine ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#16 0x00007ffff7b737da in scm_call_n ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#17 0x00007ffff7ae7eb8 in scm_call_with_unblocked_asyncs ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#18 0x00007ffff7b7017d in vm_regular_engine ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#19 0x00007ffff7b737da in scm_call_n ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#20 0x00007ffff7b61496 in really_launch ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#21 0x00007ffff7af0c5a in c_body ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#22 0x00007ffff7b7017d in vm_regular_engine ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#23 0x00007ffff7b737da in scm_call_n ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#24 0x00007ffff7b629a6 in catch ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#25 0x00007ffff7af1240 in scm_i_with_continuation_barrier ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#26 0x00007ffff7af12d5 in scm_c_with_continuation_barrier ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#27 0x00007ffff7b615bc in with_guile ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#28 0x00007ffff7856758 in GC_call_with_stack_base ()
from /gnu/store/7ivzka4by2casz1p65l8pm9vjnis0zjw-libgc-7.6.0/lib/libgc.so.1
#29 0x00007ffff7b60c8d in launch_thread ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#30 0x00007ffff76234d5 in start_thread ()
from /gnu/store/n6acaivs0jwiwpidjr551dhdni5kgpcr-glibc-2.26.105-g0890d5379c/lib/libpthread.so.0
#31 0x00007ffff5f962cf in clone ()
from /gnu/store/n6acaivs0jwiwpidjr551dhdni5kgpcr-glibc-2.26.105-g0890d5379c/lib/libc.so.6

Thread 12 (Thread 0x7ffff0c05700 (LWP 8369)):
#0 0x00007ffff762c0b7 in do_futex_wait.constprop ()
from /gnu/store/n6acaivs0jwiwpidjr551dhdni5kgpcr-glibc-2.26.105-g0890d5379c/lib/libpthread.so.0
#1 0x00007ffff762c164 in __new_sem_wait_slow.constprop.0 ()
from /gnu/store/n6acaivs0jwiwpidjr551dhdni5kgpcr-glibc-2.26.105-g0890d5379c/lib/libpthread.so.0
#2 0x00007ffff785dd62 in GC_stop_world ()
from /gnu/store/7ivzka4by2casz1p65l8pm9vjnis0zjw-libgc-7.6.0/lib/libgc.so.1
#3 0x00007ffff7849f0e in GC_stopped_mark ()
from /gnu/store/7ivzka4by2casz1p65l8pm9vjnis0zjw-libgc-7.6.0/lib/libgc.so.1
#4 0x00007ffff784aa69 in GC_try_to_collect_inner ()
from /gnu/store/7ivzka4by2casz1p65l8pm9vjnis0zjw-libgc-7.6.0/lib/libgc.so.1
#5 0x00007ffff784b43c in GC_collect_or_expand ()
from /gnu/store/7ivzka4by2casz1p65l8pm9vjnis0zjw-libgc-7.6.0/lib/libgc.so.1
#6 0x00007ffff7850d6b in GC_alloc_large ()
from /gnu/store/7ivzka4by2casz1p65l8pm9vjnis0zjw-libgc-7.6.0/lib/libgc.so.1
#7 0x00007ffff7851183 in GC_generic_malloc ()
from /gnu/store/7ivzka4by2casz1p65l8pm9vjnis0zjw-libgc-7.6.0/lib/libgc.so.1
#8 0x00007ffff7851398 in GC_malloc_kind_global ()
from /gnu/store/7ivzka4by2casz1p65l8pm9vjnis0zjw-libgc-7.6.0/lib/libgc.so.1
#9 0x00007ffff7aedabf in scm_make_bytevector ()
from /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/lib/libguile-2.2.so.1
#10 0x00007ffff7b7017d in vm_regular_engine ()
from /gnu/store/3
This message was truncated. Download the full message here.
L
L
Ludovic Courtès wrote on 8 May 2018 23:55
(address . 28211@debbugs.gnu.org)
87fu3124nt.fsf@gnu.org
Hello Andy & Mark,

ludo@gnu.org (Ludovic Courtès) skribis:

Toggle quote (9 lines)
> ludo@gnu.org (Ludovic Courtès) skribis:
>
>> On current ‘core-updates’, the code in (guix build graft) triggers
>> random Guile crashes (GC issue? thread-safety issue?) when running on
>> Guile 2.2.2, as initially reported by Marius at
>> <https://lists.gnu.org/archive/html/guix-devel/2017-08/msg00013.html>:
>
> The problem still shows up with Guile 2.2.3+.

Here’s a clearer backtrace:

Toggle snippet (177 lines)
Core was generated by `/gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/bin/guile t.scm'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f7019db0d79 in scm_is_pair (x=0x0) at ../libguile/pairs.h:159
159 ../libguile/pairs.h: Dosiero a? dosierujo ne ekzistas.
[Current thread is 1 (Thread 0x7f6fe6f5d700 (LWP 2856))]
(gdb) thread apply all bt

Thread 18 (Thread 0x7f7015859700 (LWP 2845)):
#0 0x00007f70198c77af in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x12715a0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x129bd80, cond=cond@entry=0x1271578) at pthread_cond_wait.c:502
#2 __pthread_cond_wait (cond=cond@entry=0x1271578, mutex=mutex@entry=0x129bd80) at pthread_cond_wait.c:655
#3 0x00007f7019dffe45 in scm_pthread_cond_wait (cond=cond@entry=0x1271578, mutex=mutex@entry=0x129bd80) at threads.c:1621
#4 0x00007f7019e0000b in block_self (
queue=((#<smob thread 1425d50>) #<smob thread 1404cf0> #<smob thread 1554c70> #<smob thread 1554ff0> #<smob thread 1425db0> #<smob thread 1554de0> #<smob thread 1404a60> #<smob thread 1425e10> #<smob thread 1404fe0> #<smob thread 1425aa0> #<smob thread 1404a00> #<smob thread 1404c20> #<smob thread 1425d50>), mutex=mutex@entry=0x129bd80,
waittime=waittime@entry=0x0) at threads.c:316
#5 0x00007f7019e00157 in lock_mutex (current_thread=0x1271540, waittime=0x0, m=0x129bd80, kind=SCM_MUTEX_STANDARD) at threads.c:1037
#6 scm_timed_lock_mutex (mutex=#<smob mutex 1552b70>, timeout=<optimized out>) at threads.c:1098
#7 0x00007f7019e0e17d in vm_regular_engine (thread=0x12715a0, vp=0x144ecf0, registers=0x0, resume=428636079) at vm-engine.c:784
#8 0x00007f7019e117da in scm_call_n (proc=proc@entry=#<program 1423960>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#9 0x00007f7019d94879 in scm_call_0 (proc=proc@entry=#<program 1423960>) at eval.c:481
#10 0x00007f7019d85eb8 in scm_call_with_unblocked_asyncs (proc=#<program 1423960>) at async.c:400
#11 0x00007f7019e0e17d in vm_regular_engine (thread=0x12715a0, vp=0x144ecf0, registers=0x0, resume=428636079) at vm-engine.c:784
#12 0x00007f7019e117da in scm_call_n (proc=#<program 1371a80>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#13 0x00007f7019d94879 in scm_call_0 (proc=<optimized out>) at eval.c:481
#14 0x00007f7019dff496 in really_launch (d=0x1633240) at threads.c:794
#15 0x00007f7019d8ec5a in c_body (d=0x7f7015858e60) at continuations.c:422
#16 0x00007f7019e0e17d in vm_regular_engine (thread=0x12715a0, vp=0x144ecf0, registers=0x0, resume=428636079) at vm-engine.c:784
#17 0x00007f7019e117da in scm_call_n (proc=proc@entry=#<smob catch-closure 1423d60>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#18 0x00007f7019d94879 in scm_call_0 (proc=proc@entry=#<smob catch-closure 1423d60>) at eval.c:481
#19 0x00007f7019e009a6 in catch (tag=tag@entry=#t, thunk=#<smob catch-closure 1423d60>,
handler=<error reading variable: ERROR: Cannot access memory at address 0x100000000>0x1423ce0,
pre_unwind_handler=<error reading variable: ERROR: Cannot access memory at address 0x100000000>0x1423c60) at throw.c:137
#20 0x00007f7019e00ce5 in scm_catch_with_pre_unwind_handler (key=key@entry=#t, thunk=<optimized out>, handler=<optimized out>, pre_unwind_handler=<optimized out>) at throw.c:254
#21 0x00007f7019e00e9f in scm_c_catch (tag=tag@entry=#t, body=body@entry=0x7f7019d8ec50 <c_body>, body_data=body_data@entry=0x7f7015858e60,
handler=handler@entry=0x7f7019d8eee0 <c_handler>, handler_data=handler_data@entry=0x7f7015858e60,
pre_unwind_handler=pre_unwind_handler@entry=0x7f7019d8ed40 <pre_unwind_handler>, pre_unwind_handler_data=0x1311bc0) at throw.c:377
#22 0x00007f7019d8f240 in scm_i_with_continuation_barrier (body=body@entry=0x7f7019d8ec50 <c_body>, body_data=body_data@entry=0x7f7015858e60,
handler=handler@entry=0x7f7019d8eee0 <c_handler>, handler_data=handler_data@entry=0x7f7015858e60,
pre_unwind_handler=pre_unwind_handler@entry=0x7f7019d8ed40 <pre_unwind_handler>, pre_unwind_handler_data=0x1311bc0) at continuations.c:360
#23 0x00007f7019d8f2d5 in scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:456
#24 0x00007f7019dff5bc in with_guile (base=base@entry=0x7f7015858ec8, data=data@entry=0x7f7015858ef0) at threads.c:661
#25 0x00007f7019af4758 in GC_call_with_stack_base (fn=fn@entry=0x7f7019dff570 <with_guile>, arg=arg@entry=0x7f7015858ef0) at misc.c:1935
#26 0x00007f7019dfec8d in scm_i_with_guile (dynamic_state=<optimized out>, data=0x1633240, func=0x7f7019dff420 <really_launch>) at threads.c:704
#27 launch_thread (d=0x1633240) at threads.c:803
#28 0x00007f70198c14d5 in start_thread (arg=0x7f7015859700) at pthread_create.c:465
#29 0x00007f70182342cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 17 (Thread 0x7f701714f700 (LWP 2842)):
#0 0x00007f70198c77af in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x1271ae0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x129bd80, cond=cond@entry=0x1271ab8) at pthread_cond_wait.c:502
#2 __pthread_cond_wait (cond=cond@entry=0x1271ab8, mutex=mutex@entry=0x129bd80) at pthread_cond_wait.c:655
#3 0x00007f7019dffe45 in scm_pthread_cond_wait (cond=cond@entry=0x1271ab8, mutex=mutex@entry=0x129bd80) at threads.c:1621
#4 0x00007f7019e0000b in block_self (
queue=((#<smob thread 1425d50>) #<smob thread 1404cf0> #<smob thread 1554c70> #<smob thread 1554ff0> #<smob thread 1425db0> #<smob thread 1554de0> #<smob thread 1404a60> #<smob thread 1425e10> #<smob thread 1404fe0> #<smob thread 1425aa0> #<smob thread 1404a00> #<smob thread 1404c20> #<smob thread 1425d50>), mutex=mutex@entry=0x129bd80,
waittime=waittime@entry=0x0) at threads.c:316
#5 0x00007f7019e00157 in lock_mutex (current_thread=0x1271a80, waittime=0x0, m=0x129bd80, kind=SCM_MUTEX_STANDARD) at threads.c:1037
#6 scm_timed_lock_mutex (mutex=#<smob mutex 1552b70>, timeout=<optimized out>) at threads.c:1098
#7 0x00007f7019e0e17d in vm_regular_engine (thread=0x1271ae0, vp=0x144eea0, registers=0x0, resume=428636079) at vm-engine.c:784
#8 0x00007f7019e117da in scm_call_n (proc=proc@entry=#<program 163cfc0>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#9 0x00007f7019d94879 in scm_call_0 (proc=proc@entry=#<program 163cfc0>) at eval.c:481
#10 0x00007f7019d85eb8 in scm_call_with_unblocked_asyncs (proc=#<program 163cfc0>) at async.c:400
#11 0x00007f7019e0e17d in vm_regular_engine (thread=0x1271ae0, vp=0x144eea0, registers=0x0, resume=428636079) at vm-engine.c:784
#12 0x00007f7019e117da in scm_call_n (proc=#<program 1371b40>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#13 0x00007f7019d94879 in scm_call_0 (proc=<optimized out>) at eval.c:481
#14 0x00007f7019dff496 in really_launch (d=0x16333c0) at threads.c:794
#15 0x00007f7019d8ec5a in c_body (d=0x7f701714ee60) at continuations.c:422
#16 0x00007f7019e0e17d in vm_regular_engine (thread=0x1271ae0, vp=0x144eea0, registers=0x0, resume=428636079) at vm-engine.c:784
#17 0x00007f7019e117da in scm_call_n (proc=proc@entry=#<smob catch-closure 141e400>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#18 0x00007f7019d94879 in scm_call_0 (proc=proc@entry=#<smob catch-closure 141e400>) at eval.c:481
#19 0x00007f7019e009a6 in catch (tag=tag@entry=#t, thunk=#<smob catch-closure 141e400>,
handler=<error reading variable: ERROR: Cannot access memory at address 0x100000000>0x141e380,
pre_unwind_handler=<error reading variable: ERROR: Cannot access memory at address 0x100000000>0x141e300) at throw.c:137
#20 0x00007f7019e00ce5 in scm_catch_with_pre_unwind_handler (key=key@entry=#t, thunk=<optimized out>, handler=<optimized out>, pre_unwind_handler=<optimized out>) at throw.c:254
#21 0x00007f7019e00e9f in scm_c_catch (tag=tag@entry=#t, body=body@entry=0x7f7019d8ec50 <c_body>, body_data=body_data@entry=0x7f701714ee60,
handler=handler@entry=0x7f7019d8eee0 <c_handler>, handler_data=handler_data@entry=0x7f701714ee60,
pre_unwind_handler=pre_unwind_handler@entry=0x7f7019d8ed40 <pre_unwind_handler>, pre_unwind_handler_data=0x1311bc0) at throw.c:377
#22 0x00007f7019d8f240 in scm_i_with_continuation_barrier (body=body@entry=0x7f7019d8ec50 <c_body>, body_data=body_data@entry=0x7f701714ee60,
handler=handler@entry=0x7f7019d8eee0 <c_handler>, handler_data=handler_data@entry=0x7f701714ee60,
pre_unwind_handler=pre_unwind_handler@entry=0x7f7019d8ed40 <pre_unwind_handler>, pre_unwind_handler_data=0x1311bc0) at continuations.c:360
#23 0x00007f7019d8f2d5 in scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:456
#24 0x00007f7019dff5bc in with_guile (base=base@entry=0x7f701714eec8, data=data@entry=0x7f701714eef0) at threads.c:661
#25 0x00007f7019af4758 in GC_call_with_stack_base (fn=fn@entry=0x7f7019dff570 <with_guile>, arg=arg@entry=0x7f701714eef0) at misc.c:1935
#26 0x00007f7019dfec8d in scm_i_with_guile (dynamic_state=<optimized out>, data=0x16333c0, func=0x7f7019dff420 <really_launch>) at threads.c:704
#27 launch_thread (d=0x16333c0) at threads.c:803
#28 0x00007f70198c14d5 in start_thread (arg=0x7f701714f700) at pthread_create.c:465
#29 0x00007f70182342cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 16 (Thread 0x7f7006f5d700 (LWP 2849)):
#0 0x00007f70198c77af in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x14e4ca0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x129bd80, cond=cond@entry=0x14e4c78) at pthread_cond_wait.c:502
#2 __pthread_cond_wait (cond=cond@entry=0x14e4c78, mutex=mutex@entry=0x129bd80) at pthread_cond_wait.c:655
#3 0x00007f7019dffe45 in scm_pthread_cond_wait (cond=cond@entry=0x14e4c78, mutex=mutex@entry=0x129bd80) at threads.c:1621
#4 0x00007f7019e0000b in block_self (
queue=((#<smob thread 1425d50>) #<smob thread 1404cf0> #<smob thread 1554c70> #<smob thread 1554ff0> #<smob thread 1425db0> #<smob thread 1554de0> #<smob thread 1404a60> #<smob thread 1425e10> #<smob thread 1404fe0> #<smob thread 1425aa0> #<smob thread 1404a00> #<smob thread 1404c20> #<smob thread 1425d50>), mutex=mutex@entry=0x129bd80,
waittime=waittime@entry=0x0) at threads.c:316
#5 0x00007f7019e00157 in lock_mutex (current_thread=0x14e4c40, waittime=0x0, m=0x129bd80, kind=SCM_MUTEX_STANDARD) at threads.c:1037
#6 scm_timed_lock_mutex (mutex=#<smob mutex 1552b70>, timeout=<optimized out>) at threads.c:1098
#7 0x00007f7019e0e17d in vm_regular_engine (thread=0x14e4ca0, vp=0x144eab0, registers=0x0, resume=428636079) at vm-engine.c:784
#8 0x00007f7019e117da in scm_call_n (proc=proc@entry=#<program 1513fe0>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#9 0x00007f7019d94879 in scm_call_0 (proc=proc@entry=#<program 1513fe0>) at eval.c:481
#10 0x00007f7019d85eb8 in scm_call_with_unblocked_asyncs (proc=#<program 1513fe0>) at async.c:400
#11 0x00007f7019e0e17d in vm_regular_engine (thread=0x14e4ca0, vp=0x144eab0, registers=0x0, resume=428636079) at vm-engine.c:784
#12 0x00007f7019e117da in scm_call_n (proc=#<program 1371980>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#13 0x00007f7019d94879 in scm_call_0 (proc=<optimized out>) at eval.c:481
#14 0x00007f7019dff496 in really_launch (d=0x149ed00) at threads.c:794
#15 0x00007f7019d8ec5a in c_body (d=0x7f7006f5ce60) at continuations.c:422
#16 0x00007f7019e0e17d in vm_regular_engine (thread=0x14e4ca0, vp=0x144eab0, registers=0x0, resume=428636079) at vm-engine.c:784
#17 0x00007f7019e117da in scm_call_n (proc=proc@entry=#<smob catch-closure 141f440>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#18 0x00007f7019d94879 in scm_call_0 (proc=proc@entry=#<smob catch-closure 141f440>) at eval.c:481
#19 0x00007f7019e009a6 in catch (tag=tag@entry=#t, thunk=#<smob catch-closure 141f440>,
handler=<error reading variable: ERROR: Cannot access memory at address 0x100000000>0x141f400,
pre_unwind_handler=<error reading variable: ERROR: Cannot access memory at address 0x100000000>0x141f380) at throw.c:137
#20 0x00007f7019e00ce5 in scm_catch_with_pre_unwind_handler (key=key@entry=#t, thunk=<optimized out>, handler=<optimized out>, pre_unwind_handler=<optimized out>) at throw.c:254
#21 0x00007f7019e00e9f in scm_c_catch (tag=tag@entry=#t, body=body@entry=0x7f7019d8ec50 <c_body>, body_data=body_data@entry=0x7f7006f5ce60,
handler=handler@entry=0x7f7019d8eee0 <c_handler>, handler_data=handler_data@entry=0x7f7006f5ce60,
pre_unwind_handler=pre_unwind_handler@entry=0x7f7019d8ed40 <pre_unwind_handler>, pre_unwind_handler_data=0x1311bc0) at throw.c:377
#22 0x00007f7019d8f240 in scm_i_with_continuation_barrier (body=body@entry=0x7f7019d8ec50 <c_body>, body_data=body_data@entry=0x7f7006f5ce60,
handler=handler@entry=0x7f7019d8eee0 <c_handler>, handler_data=handler_data@entry=0x7f7006f5ce60,
pre_unwind_handler=pre_unwind_handler@entry=0x7f7019d8ed40 <pre_unwind_handler>, pre_unwind_handler_data=0x1311bc0) at continuations.c:360
#23 0x00007f7019d8f2d5 in scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:456
#24 0x00007f7019dff5bc in with_guile (base=base@entry=0x7f7006f5cec8, data=data@entry=0x7f7006f5cef0) at threads.c:661
#25 0x00007f7019af4758 in GC_call_with_stack_base (fn=fn@entry=0x7f7019dff570 <with_guile>, arg=arg@entry=0x7f7006f5cef0) at misc.c:1935
#26 0x00007f7019dfec8d in scm_i_with_guile (dynamic_state=<optimized out>, data=0x149ed00, func=0x7f7019dff420 <really_launch>) at threads.c:704
#27 launch_thread (d=0x149ed00) at threads.c:803
#28 0x00007f70198c14d5 in start_thread (arg=0x7f7006f5d700) at pthread_create.c:465
#29 0x00007f70182342cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 15 (Thread 0x7f7007fff700 (LWP 2847)):
#0 0x00007f70198c77af in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x1271220) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x129bd80, cond=cond@entry=0x12711f8) at pthread_cond_wait.c:502
#2 __pthread_cond_wait (cond=cond@entry=0x12711f8, mutex=mutex@entry=0x129bd80) at pthread_cond_wait.c:655
#3 0x00007f7019dffe45 in scm_pthread_cond_wait (cond=cond@entry=0x12711f8, mutex=mutex@entry=0x129bd80) at threads.c:1621
#4 0x00007f7019e0000b in block_self (
queue=((#<smob thread 1425d50>) #<smob thread 1404cf0> #<smob thread 1554c70> #<smob thread 1554ff0> #<smob thread 1425db0> #<smob thread 1554de0> #<smob thread 1404a60> #<smob thread 1425e10> #<smob thread 1404fe0> #<smob thread 1425aa0> #<smob thread 1404a00> #<smob thread 1404c20> #<smob thread 1425d50>), mutex=mutex@entry=0x129bd80,
waittime=waittime@entry=0x0) at threads.c:316
#5 0x00007f7019e00157 in lock_mutex (current_thread=0x12711c0, waittime=0x0, m=0x129bd80, kind=SCM_MUTEX_STANDARD) at threads.c:1037
#6 scm_timed_lock_mutex (mutex=#<smob mutex 1552b70>, timeout=<optimized out>) at threads.c:1098
#7 0x00007f7019e0e17d in vm_regular_engine (thread=0x1271220, vp=0x144ebd0, registers=0x0, resume=428636079) at vm-engine.c:784
#8 0x00007f7019e117da in scm_call_n (proc=proc@entry=#<program 146fe80>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#9 0x00007f7019d94879 in scm_call_0 (proc=proc@entry=#<program 146fe80>) at eval.c:481
#10 0x00007f7019d85eb8 in scm_call_with_unblocked_asyncs (proc=#<program 146fe80>) at async.c:400
#11 0x00007f7019e0e17d in vm_regular_engine (thread=0x1271220, vp=0x144ebd0, registers=0x0, resume=428636079) at vm-engine.c:784
#12 0x00007f7019e117da in scm_call_n (proc=#<program 1371a00>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#13 0x00007f7019d94879 in scm_call_0 (proc=<optimized out>) at eval.c:481
#14 0x00007f7019dff496 in really_launch (d=0x1633100) at threads.c:794
#15 0x00007f7019d8ec5a in c_body (d=0x7f7007ffee60) at continuations.c:422
#16 0x00007f7019e0e17d in vm_regular_engine (thread=0x1271220, vp=0x144ebd0, registers=0x0, resume=428636079) at vm-engine.c:784
#17 0x00007f7019e117da in scm_call_n (proc=proc@entry=#<smob catch-closure 146fa00>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#18 0x00007f7019d94879 in scm_call_0 (proc=proc@entry=#<smob catch-closure 146fa00>) at eval.c:481
#19 0x00007f7019e009a6 in catch (tag=tag@entry=#t, thunk=#<smob catch-closure 146fa00>,
handler=<error reading variable: ERROR: Cannot access memory at address 0x100000000>0x146f8e0,
pre_unwind_handler=<error reading variable: ERROR: Cannot access memory at address 0x100000000>0x146f8a0) at throw.c:137
#20 0x00007f7019e00ce5 in scm_catch_with_pre_unwind_handler (key=key@entry=#t, thunk=<optimized out>, handler=<optimized out>, pre_unwind_handler=<optimized out>) at throw.c:254
#21 0x00007f7019e00e9f in scm_c_catch (tag=tag@entry=#t, body=body@entry=0x7f7019d8ec50 <c_body>, body_data=body_data@entry=0x7f7007ffee60,
handler=handler@entry=0x7f7019d8eee0 <c_handler>, handler_data=handler_data@entry=0x7f7007ffee60,
pre_unwind_handler=pre_unwind_handler@entry=0x7f7019d8ed40 <pre_unwind_handler>, pre_unwind_handler_data=0x1311bc0) at throw.c:377
#22 0x00007f7019d8f240 in scm_i_with_continuation_barrier (body=body@entry=0x7f7019d8ec50 <c_body>, body_data=body_data@entry=0x7f7007ffee60,
handler=handler@entry=0x7f7019d8eee0 <c_handler>, handler_data=handler_data@entry=0x7f7007ffee60,
pre_unwind_handler=pre_unwind_handler@entry=0x7f7019d8ed40 <pre_unwind_handler>, pre_unwind_handler_data=0x1311bc0) at continuations.c:360
#23 0x00007f7019d8f2d5 in scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:456
#24 0x00007f7019dff5bc in with_guile (base=base@entry=0x7f7007ffeec8, data=data@entry=0x7f7007ffeef0) at threads.c:661
#25 0x00007f7019af4758 in GC_call_with_stack_base (fn=fn@entry=0x7f7019dff570 <with_guile>, arg=arg@entry=0x7f7007ffeef0) at misc.c:1935
#26 0x00007f7019dfec8d in scm_i_with_guile (dynamic_state=<optimized out>, data=0x1633100, func=0x7f7019dff420 <really_launch>) at threads.c:704
#27 launch_thread (d=0x1633100) at threads.c:803
#28 0x00007f70198c14d5 in start_thread (arg=0x7f7007fff700) at pthread_create.c:465
#29 0x00007f70182342cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 14 (Thread 0x7f70160ab700 (LWP 2844)):
#0 0x00007f70198c77af in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x1271764) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x129bd80, cond=cond@entry=0x1271738) at pthread_cond_wait.c:502
#2 __pthread_cond_wait (cond=cond@entry=0x1271738, mutex=mutex@entry=0x129bd80) at pthread_cond_wait.c:655
#3 0x00007f7019dffe45 in scm_pthread_cond_wait (cond=cond@entry=0x1271738, mutex=mutex@entry=0x129bd80) at threads.c:1621
#4 0x00007f7019e0000b in block_self (
queue=((#<smob thread 1425d50>) #<smob thread 1404cf0> #<smob thread 1554c70> #<smob thread 1554ff0> #<smob thread 1425db0> #<smob thread 1554de0> #<smob thread 1404a60> #<smob thread 1425e10> #<smob thread 1404fe0> #<smob thread 1425aa0> #<smob thread 1404a00> #<smob thread 1404c20> #<smob thread 1425d50>), mutex=mutex@entry=0x129bd80,
waittime=waittime@entry=0x0) at threads.c:316
#5 0x00007f7019e00157 in lock_mutex (current_thread=0x1271700, waittime=0x0, m=0x129bd80, kind=SCM_MUTEX_STANDARD) at threads.c
This message was truncated. Download the full message here.
Toggle diff (14 lines)
diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c
index 9509cd643..1136b2271 100644
--- a/libguile/vm-engine.c
+++ b/libguile/vm-engine.c
@@ -151,7 +151,8 @@
code, or otherwise push anything on the stack, you will need to
CACHE_SP afterwards to restore the possibly-changed stack pointer. */
-#define SYNC_IP() vp->ip = (ip)
+#define SYNC_IP() \
+ do { vp->ip = (ip); vp->sp = (sp); } while (0)
#define CACHE_SP() sp = vp->sp
#define CACHE_REGISTER() \
That seemed to help but I eventually got another similar crash.

FWIW I managed to reduce (guix build graft) to just the code below, and
it’s enough to trigger a crash after a dozen of runs:

Toggle snippet (60 lines)
(define-module (guix build graft)
#:use-module (guix build utils)
#:use-module (rnrs bytevectors)
#:use-module (ice-9 vlist)
#:use-module (ice-9 match)
#:use-module (ice-9 threads)
#:use-module (ice-9 binary-ports)
#:use-module (ice-9 iconv)
#:use-module (srfi srfi-1) ; list library
#:use-module (srfi srfi-26) ; cut and cute
#:export (replace-store-references
rewrite-directory))

(define (exit-on-exception proc)
"Return a procedure that wraps PROC so that 'primitive-exit' is called when
an exception is caught."
(lambda (arg)
(catch #t
(lambda ()
(proc arg))
(const #t)
(lambda (key . args)
;; Since ports are not thread-safe as of Guile 2.0, reopen stderr.
(let ((port (fdopen 2 "w0")))
(print-exception port #f key args)
(display-backtrace (make-stack #t) port)
(dynamic-call "abort" (dynamic-link))
(sleep 1000)
(primitive-exit 1))))))

(define* (rewrite-directory directory output mapping)
(define prefix-len
(string-length directory))

(define (destination file)
(string-append output (string-drop file prefix-len)))

(define (rewrite-leaf file)
(let (#;(stat (lstat file))
(dest (destination file)))
(catch 'foo
(lambda ()
(throw 'foo (dirname (string-append "/tmp/x" dest))))
(lambda (key file)
(call-with-output-string
(lambda (output)
(make-bytevector (expt 2 10) #x77)
(string->bytevector file "UTF-8")
(open-input-string file)
(make-bytevector (expt 2 20) #x77)))))))

(n-par-for-each (pk 'jobs (parallel-job-count))
(exit-on-exception rewrite-leaf)
(find-files directory (const #t)
#:directories? #t)))

;;; graft.scm ends here


Thoughts? Whatever ideas you have could be helpful. :-)

Cheers,
Ludo’.
M
M
Mark H Weaver wrote on 9 May 2018 02:32
(name . Ludovic Courtès)(address . ludo@gnu.org)
87d0y5k6sl.fsf@netris.org
Hi Ludovic,

ludo@gnu.org (Ludovic Courtès) writes:
[...]
Toggle quote (3 lines)
> Thread 1 (Thread 0x7f6fe6f5d700 (LWP 2856)):
> #0 0x00007f7019db0d79 in scm_is_pair (x=<error reading variable: ERROR: Cannot access memory at address 0x0>0x0) at ../libguile/pairs.h:159
> #1 scm_ilength (sx=<optimized out>) at list.c:190
[...]
Toggle quote (5 lines)
> What this means is that Thread 1 gets NULL instead of a list as its
> on-stack argument (vm-engine.c:909 is ‘tail-apply’).
>
> How can arguments on the VM stack be zeroed?

I doubt that's what happened, because I expect that each VM stack is
dedicated to a single hardware thread. In theory, if a single VM stack
is used by one thread, and then later used by another thread,
thread-safety issues on the VM stack could arise in the absense of
proper thread synchronization.

However, I think it's _far_ more likely that the NULL argument on the
stack was copied from memory shared by multiple threads without proper
thread synchronization.

On modern weakly-ordered memory architectures, writes by one thread may
be seen in a different order by another thread. For example, if one
thread allocates a pair, initializes its CAR and CDR to non-NULL values,
and then writes the address of the pair somewhere, another thread could
first read the address of the pair, and then read NULL from its CAR or
CDR, unless proper thread synchronization is used. At best, this
requires memory barriers in both the reader and writer which are
typically quite expensive. On x86 processors the read barrier could
expand into a no-op in typical cases, but the write barrier cannot be
avoided, and must be placed after initializing the structure and before
writing its pointer.

I think it's most likely that something like this is happening, because
NULL is not a valid SCM value. The only code that should normally write
a NULL to an SCM slot is Boehm GC, which clears memory before handing it
to the user. So, if you read a NULL from a memory location that's meant
to hold an SCM, then it's most likely because the reading thread does
not yet see the writes that initialized it, because of the
weakly-ordered memory architecture.

Toggle quote (3 lines)
> I commented out the MADV_DONTNEED call to be sure, but I can still
> reproduce the bug.

I strongly doubt that the MADV_DONTNEED is relevant to this issue.

Toggle quote (16 lines)
> Then I thought vp->sp might be out-of-sync compared to the local
> variable ‘sp’, which in turn could cause ‘scm_i_vm_mark_stack’ to not
> mark a few items on the tip of the stack. So I did this:
>
> diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c
> index 9509cd643..1136b2271 100644
> --- a/libguile/vm-engine.c
> +++ b/libguile/vm-engine.c
> @@ -151,7 +151,8 @@
> code, or otherwise push anything on the stack, you will need to
> CACHE_SP afterwards to restore the possibly-changed stack pointer. */
>
> -#define SYNC_IP() vp->ip = (ip)
> +#define SYNC_IP() \
> + do { vp->ip = (ip); vp->sp = (sp); } while (0)

I don't see how a change like this could be useful for any thread safety
problem. Since stores by one thread may be seen in a different order by
other threads, the memory write corresponding to "vp->sp = (sp)" might
be delayed for some arbitrarily long period of time after the writes
that follow it, up until the next appropriate memory barrier.

For now, I would suggest avoiding multi-threaded code in Guix, or at
least to avoid loading any Scheme code from multiple threads.

How about using multiple processes instead?

Mark
L
L
Ludovic Courtès wrote on 9 May 2018 09:17
(name . Mark H Weaver)(address . mhw@netris.org)
87zi19z4a8.fsf@gnu.org
Hi Mark,

Mark H Weaver <mhw@netris.org> skribis:

Toggle quote (21 lines)
> ludo@gnu.org (Ludovic Courtès) writes:
> [...]
>> Thread 1 (Thread 0x7f6fe6f5d700 (LWP 2856)):
>> #0 0x00007f7019db0d79 in scm_is_pair (x=<error reading variable: ERROR: Cannot access memory at address 0x0>0x0) at ../libguile/pairs.h:159
>> #1 scm_ilength (sx=<optimized out>) at list.c:190
> [...]
>> What this means is that Thread 1 gets NULL instead of a list as its
>> on-stack argument (vm-engine.c:909 is ‘tail-apply’).
>>
>> How can arguments on the VM stack be zeroed?
>
> I doubt that's what happened, because I expect that each VM stack is
> dedicated to a single hardware thread. In theory, if a single VM stack
> is used by one thread, and then later used by another thread,
> thread-safety issues on the VM stack could arise in the absense of
> proper thread synchronization.
>
> However, I think it's _far_ more likely that the NULL argument on the
> stack was copied from memory shared by multiple threads without proper
> thread synchronization.

It could be this, but this particular case is an “embarrassingly
parallel” program where threads work on independent data sets without
any inter-thread communication whatsoever.

What you describe could nevertheless be happening at a lower level,
within libguile, though it’s not clear to me where that could be.

Toggle quote (5 lines)
>> I commented out the MADV_DONTNEED call to be sure, but I can still
>> reproduce the bug.
>
> I strongly doubt that the MADV_DONTNEED is relevant to this issue.

I thought about it because that’s one way part of the VM stack could be
zeroed out.

Toggle quote (19 lines)
>> Then I thought vp->sp might be out-of-sync compared to the local
>> variable ‘sp’, which in turn could cause ‘scm_i_vm_mark_stack’ to not
>> mark a few items on the tip of the stack. So I did this:
>>
>> diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c
>> index 9509cd643..1136b2271 100644
>> --- a/libguile/vm-engine.c
>> +++ b/libguile/vm-engine.c
>> @@ -151,7 +151,8 @@
>> code, or otherwise push anything on the stack, you will need to
>> CACHE_SP afterwards to restore the possibly-changed stack pointer. */
>>
>> -#define SYNC_IP() vp->ip = (ip)
>> +#define SYNC_IP() \
>> + do { vp->ip = (ip); vp->sp = (sp); } while (0)
>
> I don't see how a change like this could be useful for any thread safety
> problem.

I witnessed situations where the local ‘sp’ seemed to be different from
‘vp->sp’, though it’s hard to tell because I’m unsure where gcc stores
‘sp’. Here’s an example:

Toggle snippet (44 lines)
(gdb) frame
#16 0x00007fabf30af2ca in vm_regular_engine (thread=0x24e6000, vp=0x22de6c0, registers=0x0, resume=40) at vm-engine.c:785
785 ret = scm_apply_subr (sp, FRAME_LOCALS_COUNT ());
(gdb) p vp->sp
$5 = (union scm_vm_stack_element *) 0x7fabec158718
(gdb) p (union scm_vm_stack_element *) $r13
$6 = (union scm_vm_stack_element *) 0x7fabec158e30
(gdb) p $6 - $5
$7 = 227
(gdb) p vp->fp
$8 = (union scm_vm_stack_element *) 0x7fabec158730
(gdb) p vp->stack_top
$9 = (union scm_vm_stack_element *) 0x7fabec159000
(gdb) p vp->stack_bottom
$10 = (union scm_vm_stack_element *) 0x7fabec158000
(gdb) p vp->sp_min_since_gc
$11 = (union scm_vm_stack_element *) 0x7fabec158620
(gdb) info registers
rax 0x1 1
rbx 0xa 10
rcx 0x28 40
rdx 0x0 0
rsi 0x23f1920 37689632
rdi 0x24e6000 38690816
rbp 0x22de6c0 0x22de6c0
rsp 0x7fabcce18660 0x7fabcce18660
r8 0x1 1
r9 0x1 1
r10 0x100 256
r11 0x23f1920 37689632
r12 0x7fabf330b8c0 140376496191680
r13 0x7fabec158e30 140376376970800
r14 0x7fabf30c6d7c 140376493813116
r15 0x7fabf0fa7f28 140376459083560
rip 0x7fabf30af2ca 0x7fabf30af2ca <vm_regular_engine+14058>
eflags 0x10246 [ PF ZF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0

My hypothesis was that such a bug could lead heap elements to be
reclaimed too early. This is more likely to happen in a multi-threaded
context because one thread could be allocating memory and triggering a
GC while another thread is invoking a subr with an out-of-sync ‘vp->sp’.

Does that make sense?

Toggle quote (5 lines)
> For now, I would suggest avoiding multi-threaded code in Guix, or at
> least to avoid loading any Scheme code from multiple threads.
>
> How about using multiple processes instead?

We could do that, but with my Guile maintainer hat on (a hat I don’t
wear that often as you might have noticed ;-)) I think it would be nice
to fix the issue.

Thanks,
Ludo’.
A
A
Andy Wingo wrote on 9 May 2018 11:11
(name . Mark H Weaver)(address . mhw@netris.org)
871sel6vnq.fsf@igalia.com
On Wed 09 May 2018 02:32, Mark H Weaver <mhw@netris.org> writes:

Toggle quote (4 lines)
> However, I think it's _far_ more likely that the NULL argument on the
> stack was copied from memory shared by multiple threads without proper
> thread synchronization.

I think this is unlikely on x86 given its total-store-ordering memory
model. I agree with you about the value of barriers, but I don't think
they are part of this bug that Ludo is seeing.

Toggle quote (5 lines)
>> I commented out the MADV_DONTNEED call to be sure, but I can still
>> reproduce the bug.
>
> I strongly doubt that the MADV_DONTNEED is relevant to this issue.

It could be. It would zero out VM stack frames, and if GC is called
when/if vp->sp is out of date, then that would be possible. However I
think vp->sp is never out of date, so that's probably not it. The
things that can be out of date are the on-heap copy of the IP (vp->ip)
and the local register copy of the sp (sp). It's more likely for the
local "sp" cache to be out of date -- if we recursed through Scheme in a
call out from the VM, eventualy causing stack expansion and relocation,
then on the return forgot to re-cache the sp value, that could be it.

Similarly, forgetting to set vp->ip before calling out to something that
could GC could likewise cause a problem because the stack map wouldn't
be right and the precise stack marker could clear a value by mistake.
This only happens for non-innermost frames though; the innermost frame
is marked conservatively.

The rules are: update vp->ip before something that can allocate, and
update local "sp" after returning from a C function that could have
recursively called Scheme.

I did find a couple places in the VM where we forgot to do one of these,
e.g. 07b7490f73fb4a6cb0c9577d082d37c8d9cee7b0 and just now
9a72e212622fa3bd118d7c02c4386601285b3224. These two patches aren't
shipped yet fwiw.

Andy
M
M
Mark H Weaver wrote on 10 May 2018 08:50
(name . Andy Wingo)(address . wingo@igalia.com)
87fu30dmx3.fsf@netris.org
Hi Andy,

Andy Wingo <wingo@igalia.com> writes:

Toggle quote (10 lines)
> On Wed 09 May 2018 02:32, Mark H Weaver <mhw@netris.org> writes:
>
>> However, I think it's _far_ more likely that the NULL argument on the
>> stack was copied from memory shared by multiple threads without proper
>> thread synchronization.
>
> I think this is unlikely on x86 given its total-store-ordering memory
> model. I agree with you about the value of barriers, but I don't think
> they are part of this bug that Ludo is seeing.

I think you're forgetting about the C compiler. It's true that x86
machine code has a TSO memory model, but C does not. In the absence of
barriers, the C compiler may freely reorder stores to non-volatile,
non-atomic objects. In particular, it is free to reorder the
initialization of an object with the write of that object's address.

I admit that I haven't checked whether GCC 5.5.0 does this in practice.
Do you have reason to believe that it never does so?

Thanks,
Mark
A
A
Andy Wingo wrote on 10 May 2018 09:53
(name . Mark H Weaver)(address . mhw@netris.org)
87tvrg3q1d.fsf@igalia.com
On Thu 10 May 2018 08:50, Mark H Weaver <mhw@netris.org> writes:

Toggle quote (21 lines)
> Andy Wingo <wingo@igalia.com> writes:
>
>> On Wed 09 May 2018 02:32, Mark H Weaver <mhw@netris.org> writes:
>>
>>> However, I think it's _far_ more likely that the NULL argument on the
>>> stack was copied from memory shared by multiple threads without proper
>>> thread synchronization.
>>
>> I think this is unlikely on x86 given its total-store-ordering memory
>> model. I agree with you about the value of barriers, but I don't think
>> they are part of this bug that Ludo is seeing.
>
> I think you're forgetting about the C compiler. It's true that x86
> machine code has a TSO memory model, but C does not. In the absence of
> barriers, the C compiler may freely reorder stores to non-volatile,
> non-atomic objects. In particular, it is free to reorder the
> initialization of an object with the write of that object's address.
>
> I admit that I haven't checked whether GCC 5.5.0 does this in practice.
> Do you have reason to believe that it never does so?

Oh I agree with you here as well, and compiler reordering could well be
happening here. My suspicions are however that it's not happening. In
libguile, we rarely mutate shared state, and in that case it's usually
within mutexes. The main source of mutation in libguile is
initialization -- but there that's on a fresh object local to a thread,
and we try to avoid publishing that object to other threads without a
barrier (atomic or mutex), and in any case such publishing is usually
outside of the region that a compiler can work on.

There is the possibility of mutation via e.g. vector-set!, but hopefully
we aren't doing that on shared data; likewise in Scheme there are
barriers (the atomic box instructions and mutexes, both of which are
compiler barriers as well). It's still possible to write Scheme
programs with races, of course, but I don't think that's what's
happening here.

I could be misunderstanding things of course!

Andy
M
M
Mark H Weaver wrote on 10 May 2018 17:48
(name . Ludovic Courtès)(address . ludo@gnu.org)
878t8reckg.fsf@netris.org
Hi,

I've been studying the backtrace provided by Ludovic, trying to
determine where, in the Scheme code, this error happened in thread 1.

ludo@gnu.org (Ludovic Courtès) writes:
Toggle quote (12 lines)
> Thread 1 (Thread 0x7f6fe6f5d700 (LWP 2856)):
> #0 0x00007f7019db0d79 in scm_is_pair (x=<error reading variable: ERROR: Cannot access memory at address 0x0>0x0) at ../libguile/pairs.h:159
> #1 scm_ilength (sx=<optimized out>) at list.c:190
> #2 0x00007f7019e0e2f6 in vm_regular_engine (thread=0x1425670, vp=0x144e6c0, registers=0x0, resume=16) at vm-engine.c:909
> #3 0x00007f7019e117da in scm_call_n (proc=proc@entry=#<program 1555fe0>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
> #4 0x00007f7019d94879 in scm_call_0 (proc=proc@entry=#<program 1555fe0>) at eval.c:481
> #5 0x00007f7019d85eb8 in scm_call_with_unblocked_asyncs (proc=#<program 1555fe0>) at async.c:400
> #6 0x00007f7019e0e17d in vm_regular_engine (thread=0x1425670, vp=0x144e6c0, registers=0x0, resume=16) at vm-engine.c:784
> #7 0x00007f7019e117da in scm_call_n (proc=#<program 13717c0>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
> #8 0x00007f7019d94879 in scm_call_0 (proc=<optimized out>) at eval.c:481
> #9 0x00007f7019dff496 in really_launch (d=0x149e520) at threads.c:794

The call to 'scm_call_with_unblocked_asyncs' in frame #5 is an important
clue, because it is used in very few places. I see only two such places
in Guile (and none in Guix):


I assume that it must be the second one in this case. So, I believe
frames 5-9 correspond to lines 147 through 158 of ice-9/threads.scm:


which is within 'call-with-new-thread', called via the 'begin-thread'
macro used by 'n-par-for-each', here:


However, I would expect to see a 'catch' frame deeper in the stack,
since the procedure passed to 'n-par-for-each' by graft.scm is wrapped
by 'exit-on-exception'.

From these facts, I believe we can conclude that the error is happening
within the body of 'begin-thread' in 'n-par-for-each', but outside of
the body of 'proc' passed to 'n-par-for-each'.

There's not a lot of code in there. My best guess at the moment is that
this error is happening within the (apply proc args) on line 335,
although I do not yet have an explanation of how that could happen.

To be continued...

Mark
M
M
Mark H Weaver wrote on 10 May 2018 18:01
(name . Ludovic Courtès)(address . ludo@gnu.org)
874ljfebzo.fsf@netris.org
Mark H Weaver <mhw@netris.org> writes:
Toggle quote (4 lines)
> From these facts, I believe we can conclude that the error is happening
> within the body of 'begin-thread' in 'n-par-for-each', but outside of
> the body of 'proc' passed to 'n-par-for-each'.

It could also be happening within the 'call-with-backtrace', used here:


More investigation is needed, but it's unlikely that I'll be able to
spend more time on this today.

Mark
L
L
Ludovic Courtès wrote on 29 Jun 2018 17:03
Stack marking issue in multi-threaded code
(name . Andy Wingo)(address . wingo@igalia.com)
87a7rdvdm9.fsf_-_@gnu.org
Hey hey, comrades!

I have a fix for some (most?) of the crashes we were seeing while
running multi-threaded code such as (guix build compile), and,
presumably, the grafting code mentioned at the beginning of this bug
report, although I haven’t checked yet.

So, ‘scm_i_vm_mark_stack’ marks the stack precisely, but contrary to
what I suspected, precise marking is not at fault.

Instead, the problem has to do with the fact that some VM instructions
change the frame pointer (vp->fp) before they have set up the dynamic
link for that new frame.

As a consequence, if a stop-the-world GC is triggered after vp->fp has
been changed and before its dynamic link has been set, the stack-walking
loop in ‘scm_i_vm_mark_stack’ could stop very early, leaving a lot of
objects unmarked.

The patch below fixes the problem for me. \o/

I’m thinking we could perhaps add a compiler barrier before ‘vp->fp = new_fp’
statements, but in practice it’s not necessary here (x86_64, gcc 7).

Thoughts?

I’d like to push this real soon. I’ll also do more testing on real
workloads from Guix, and then I’d like to release 2.2.4, hopefully
within a few days.

Thank you and thanks Andy for the discussions on IRC!

Ludo’, who’s going to party all night long. :-)
Toggle diff (96 lines)
diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c
index 1aa4e9699..19ff3e498 100644
--- a/libguile/vm-engine.c
+++ b/libguile/vm-engine.c
@@ -548,7 +548,7 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp,
VM_DEFINE_OP (1, call, "call", OP2 (X8_F24, X8_C24))
{
scm_t_uint32 proc, nlocals;
- union scm_vm_stack_element *old_fp;
+ union scm_vm_stack_element *old_fp, *new_fp;
UNPACK_24 (op, proc);
UNPACK_24 (ip[1], nlocals);
@@ -556,9 +556,10 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp,
PUSH_CONTINUATION_HOOK ();
old_fp = vp->fp;
- vp->fp = SCM_FRAME_SLOT (old_fp, proc - 1);
- SCM_FRAME_SET_DYNAMIC_LINK (vp->fp, old_fp);
- SCM_FRAME_SET_RETURN_ADDRESS (vp->fp, ip + 2);
+ new_fp = SCM_FRAME_SLOT (old_fp, proc - 1);
+ SCM_FRAME_SET_DYNAMIC_LINK (new_fp, old_fp);
+ SCM_FRAME_SET_RETURN_ADDRESS (new_fp, ip + 2);
+ vp->fp = new_fp;
RESET_FRAME (nlocals);
@@ -586,7 +587,7 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp,
{
scm_t_uint32 proc, nlocals;
scm_t_int32 label;
- union scm_vm_stack_element *old_fp;
+ union scm_vm_stack_element *old_fp, *new_fp;
UNPACK_24 (op, proc);
UNPACK_24 (ip[1], nlocals);
@@ -595,9 +596,10 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp,
PUSH_CONTINUATION_HOOK ();
old_fp = vp->fp;
- vp->fp = SCM_FRAME_SLOT (old_fp, proc - 1);
- SCM_FRAME_SET_DYNAMIC_LINK (vp->fp, old_fp);
- SCM_FRAME_SET_RETURN_ADDRESS (vp->fp, ip + 3);
+ new_fp = SCM_FRAME_SLOT (old_fp, proc - 1);
+ SCM_FRAME_SET_DYNAMIC_LINK (new_fp, old_fp);
+ SCM_FRAME_SET_RETURN_ADDRESS (new_fp, ip + 3);
+ vp->fp = new_fp;
RESET_FRAME (nlocals);
@@ -3893,7 +3895,7 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp,
NEXT (1);
{
- union scm_vm_stack_element *old_fp;
+ union scm_vm_stack_element *old_fp, *new_fp;
size_t old_frame_size = FRAME_LOCALS_COUNT ();
SCM proc = scm_i_async_pop (thread);
@@ -3907,9 +3909,10 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp,
handle-interrupts opcode to handle any additional
interrupts. */
old_fp = vp->fp;
- vp->fp = SCM_FRAME_SLOT (old_fp, old_frame_size + 1);
- SCM_FRAME_SET_DYNAMIC_LINK (vp->fp, old_fp);
- SCM_FRAME_SET_RETURN_ADDRESS (vp->fp, ip);
+ new_fp = SCM_FRAME_SLOT (old_fp, old_frame_size + 1);
+ SCM_FRAME_SET_DYNAMIC_LINK (new_fp, old_fp);
+ SCM_FRAME_SET_RETURN_ADDRESS (new_fp, ip);
+ vp->fp = new_fp;
SP_SET (0, proc);
diff --git a/libguile/vm.c b/libguile/vm.c
index c8ec6e1b2..7749159e5 100644
--- a/libguile/vm.c
+++ b/libguile/vm.c
@@ -1011,6 +1011,18 @@ scm_i_vm_mark_stack (struct scm_vm *vp, struct GC_ms_entry *mark_stack_ptr,
slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp), &cache);
}
+ size_t extra = 0;
+ for (; sp < vp->stack_top; sp++)
+ {
+ if (GC_is_heap_ptr (sp->as_ptr))
+ extra++;
+ }
+ if (extra)
+ {
+ printf ("%s extra: %zi\n", __func__, extra);
+ abort ();
+ }
+
return_unused_stack_to_os (vp);
return mark_stack_ptr;
M
M
Mark H Weaver wrote on 29 Jun 2018 18:54
(name . Ludovic Courtès)(address . ludo@gnu.org)
87in61y1m0.fsf@netris.org
Hi Ludovic,

ludo@gnu.org (Ludovic Courtès) writes:

Toggle quote (19 lines)
> I have a fix for some (most?) of the crashes we were seeing while
> running multi-threaded code such as (guix build compile), and,
> presumably, the grafting code mentioned at the beginning of this bug
> report, although I haven’t checked yet.
>
> So, ‘scm_i_vm_mark_stack’ marks the stack precisely, but contrary to
> what I suspected, precise marking is not at fault.
>
> Instead, the problem has to do with the fact that some VM instructions
> change the frame pointer (vp->fp) before they have set up the dynamic
> link for that new frame.
>
> As a consequence, if a stop-the-world GC is triggered after vp->fp has
> been changed and before its dynamic link has been set, the stack-walking
> loop in ‘scm_i_vm_mark_stack’ could stop very early, leaving a lot of
> objects unmarked.
>
> The patch below fixes the problem for me. \o/

That's great news! Thanks for investigating.

Toggle quote (5 lines)
> I’m thinking we could perhaps add a compiler barrier before ‘vp->fp = new_fp’
> statements, but in practice it’s not necessary here (x86_64, gcc 7).
>
> Thoughts?

Indeed, we need something to ensure that the compiler won't reorder
these writes. My recommendation would be to use the 'volatile'
qualifier in the declarations of both the 'fp' field of 'struct scm_vm'
and the stack element modified by SCM_FRAME_SET_DYNAMIC_LINK.

Maybe something like this:

Toggle snippet (36 lines)
diff --git a/libguile/frames.h b/libguile/frames.h
index ef2db3df5..8554f886b 100644
--- a/libguile/frames.h
+++ b/libguile/frames.h
@@ -89,6 +89,7 @@
union scm_vm_stack_element
{
scm_t_uintptr as_uint;
+ volatile scm_t_uintptr as_volatile_uint;
scm_t_uint32 *as_ip;
SCM as_scm;
double as_f64;
@@ -104,7 +105,7 @@ union scm_vm_stack_element
#define SCM_FRAME_RETURN_ADDRESS(fp) ((fp)[0].as_ip)
#define SCM_FRAME_SET_RETURN_ADDRESS(fp, ra) ((fp)[0].as_ip = (ra))
#define SCM_FRAME_DYNAMIC_LINK(fp) ((fp) + (fp)[1].as_uint)
-#define SCM_FRAME_SET_DYNAMIC_LINK(fp, dl) ((fp)[1].as_uint = ((dl) - (fp)))
+#define SCM_FRAME_SET_DYNAMIC_LINK(fp, dl) ((fp)[1].as_volatile_uint = ((dl) - (fp)))
#define SCM_FRAME_SLOT(fp,i) ((fp) - (i) - 1)
#define SCM_FRAME_LOCAL(fp,i) (SCM_FRAME_SLOT (fp, i)->as_scm)
#define SCM_FRAME_NUM_LOCALS(fp, sp) ((fp) - (sp))
diff --git a/libguile/vm.h b/libguile/vm.h
index a1cac391f..0a6179c19 100644
--- a/libguile/vm.h
+++ b/libguile/vm.h
@@ -38,7 +38,7 @@ enum {
struct scm_vm {
scm_t_uint32 *ip; /* instruction pointer */
union scm_vm_stack_element *sp; /* stack pointer */
- union scm_vm_stack_element *fp; /* frame pointer */
+ union scm_vm_stack_element *volatile fp; /* frame pointer */
union scm_vm_stack_element *stack_limit; /* stack limit address */
int trace_level; /* traces enabled if trace_level > 0 */
union scm_vm_stack_element *sp_min_since_gc; /* deepest sp since last gc */

Toggle quote (4 lines)
> I’d like to push this real soon. I’ll also do more testing on real
> workloads from Guix, and then I’d like to release 2.2.4, hopefully
> within a few days.

Sounds good to me.

Thanks,
Mark
L
L
Ludovic Courtès wrote on 29 Jun 2018 23:18
(name . Mark H Weaver)(address . mhw@netris.org)
87muvdthp4.fsf@gnu.org
Hi Mark,

Mark H Weaver <mhw@netris.org> skribis:

Toggle quote (2 lines)
> ludo@gnu.org (Ludovic Courtès) writes:

[...]

Toggle quote (5 lines)
>> I’m thinking we could perhaps add a compiler barrier before ‘vp->fp = new_fp’
>> statements, but in practice it’s not necessary here (x86_64, gcc 7).
>>
>> Thoughts?

I just pushed the patch as 23af45e248e8e2bec99c712842bf24d6661abbe2.

Toggle quote (17 lines)
> Indeed, we need something to ensure that the compiler won't reorder
> these writes. My recommendation would be to use the 'volatile'
> qualifier in the declarations of both the 'fp' field of 'struct scm_vm'
> and the stack element modified by SCM_FRAME_SET_DYNAMIC_LINK.
>
> Maybe something like this:
>
> diff --git a/libguile/frames.h b/libguile/frames.h
> index ef2db3df5..8554f886b 100644
> --- a/libguile/frames.h
> +++ b/libguile/frames.h
> @@ -89,6 +89,7 @@
> union scm_vm_stack_element
> {
> scm_t_uintptr as_uint;
> + volatile scm_t_uintptr as_volatile_uint;

[...]

Toggle quote (2 lines)
> +#define SCM_FRAME_SET_DYNAMIC_LINK(fp, dl) ((fp)[1].as_volatile_uint = ((dl) - (fp)))

I’m not sure this is exactly what we need.

What I had in mind, to make sure the vp->fp assignment really happens
after the SCM_FRAME_SET_DYNAMIC_LINK, was to do something like this:

SCM_FRAME_SET_RETURN_ADDRESS (new_fp, …);
SCM_FRAME_SET_DYNAMIC_LINK (new_fp, …);

asm volatile ("" : : : "memory");

vp->fp = new_fp;

I think that more accurately reflects what we want, WDYT?

It’s GCC-specific though (I don’t think Clang implements ‘asm’ in a
compatible way, does it?) and I suppose we’d have to ignore the non-GCC
case.

Thoughts?

Ludo’.
M
M
Mark H Weaver wrote on 30 Jun 2018 01:18
(name . Ludovic Courtès)(address . ludo@gnu.org)
87efgpw5ao.fsf@netris.org
Hi Ludovic,

ludo@gnu.org (Ludovic Courtès) writes:

Toggle quote (52 lines)
> Mark H Weaver <mhw@netris.org> skribis:
>
>> ludo@gnu.org (Ludovic Courtès) writes:
>
> [...]
>
>>> I’m thinking we could perhaps add a compiler barrier before ‘vp->fp = new_fp’
>>> statements, but in practice it’s not necessary here (x86_64, gcc 7).
>>>
>>> Thoughts?
>
> I just pushed the patch as 23af45e248e8e2bec99c712842bf24d6661abbe2.
>
>> Indeed, we need something to ensure that the compiler won't reorder
>> these writes. My recommendation would be to use the 'volatile'
>> qualifier in the declarations of both the 'fp' field of 'struct scm_vm'
>> and the stack element modified by SCM_FRAME_SET_DYNAMIC_LINK.
>>
>> Maybe something like this:
>>
>> diff --git a/libguile/frames.h b/libguile/frames.h
>> index ef2db3df5..8554f886b 100644
>> --- a/libguile/frames.h
>> +++ b/libguile/frames.h
>> @@ -89,6 +89,7 @@
>> union scm_vm_stack_element
>> {
>> scm_t_uintptr as_uint;
>> + volatile scm_t_uintptr as_volatile_uint;
>
> [...]
>
>> +#define SCM_FRAME_SET_DYNAMIC_LINK(fp, dl) ((fp)[1].as_volatile_uint = ((dl) - (fp)))
>
> I’m not sure this is exactly what we need.
>
> What I had in mind, to make sure the vp->fp assignment really happens
> after the SCM_FRAME_SET_DYNAMIC_LINK, was to do something like this:
>
> SCM_FRAME_SET_RETURN_ADDRESS (new_fp, …);
> SCM_FRAME_SET_DYNAMIC_LINK (new_fp, …);
>
> asm volatile ("" : : : "memory");
>
> vp->fp = new_fp;
>
> I think that more accurately reflects what we want, WDYT?
>
> It’s GCC-specific though (I don’t think Clang implements ‘asm’ in a
> compatible way, does it?) and I suppose we’d have to ignore the non-GCC
> case.

Right, the problem is that the "asm volatile" solution is not portable.

To my mind, it's fine to optionally use GCC extensions for improved
performance, but I think we should ensure that Guile works reliably when
compiled with any conforming C compiler.

What problem do you see with my proposed portable solution? If I
understand C99 section 5.1.2.3 paragraph 2 correctly, compilers are not
permitted to reorder accesses to volatile objects as long as there is a
sequence point between them.

I should say that I'm not confident that _either_ of these proposed
solutions will adequately address all of the possible problems that
could occur when GC is performed on VM threads stopped at arbitrary
points in their execution.

Mark
L
L
Ludovic Courtès wrote on 30 Jun 2018 22:53
(name . Mark H Weaver)(address . mhw@netris.org)
87h8lkro7c.fsf@gnu.org
Hi Mark,

Mark H Weaver <mhw@netris.org> skribis:

Toggle quote (2 lines)
> ludo@gnu.org (Ludovic Courtès) writes:

[...]

Toggle quote (45 lines)
>>> Indeed, we need something to ensure that the compiler won't reorder
>>> these writes. My recommendation would be to use the 'volatile'
>>> qualifier in the declarations of both the 'fp' field of 'struct scm_vm'
>>> and the stack element modified by SCM_FRAME_SET_DYNAMIC_LINK.
>>>
>>> Maybe something like this:
>>>
>>> diff --git a/libguile/frames.h b/libguile/frames.h
>>> index ef2db3df5..8554f886b 100644
>>> --- a/libguile/frames.h
>>> +++ b/libguile/frames.h
>>> @@ -89,6 +89,7 @@
>>> union scm_vm_stack_element
>>> {
>>> scm_t_uintptr as_uint;
>>> + volatile scm_t_uintptr as_volatile_uint;
>>
>> [...]
>>
>>> +#define SCM_FRAME_SET_DYNAMIC_LINK(fp, dl) ((fp)[1].as_volatile_uint = ((dl) - (fp)))
>>
>> I’m not sure this is exactly what we need.
>>
>> What I had in mind, to make sure the vp->fp assignment really happens
>> after the SCM_FRAME_SET_DYNAMIC_LINK, was to do something like this:
>>
>> SCM_FRAME_SET_RETURN_ADDRESS (new_fp, …);
>> SCM_FRAME_SET_DYNAMIC_LINK (new_fp, …);
>>
>> asm volatile ("" : : : "memory");
>>
>> vp->fp = new_fp;
>>
>> I think that more accurately reflects what we want, WDYT?
>>
>> It’s GCC-specific though (I don’t think Clang implements ‘asm’ in a
>> compatible way, does it?) and I suppose we’d have to ignore the non-GCC
>> case.
>
> Right, the problem is that the "asm volatile" solution is not portable.
>
> To my mind, it's fine to optionally use GCC extensions for improved
> performance, but I think we should ensure that Guile works reliably when
> compiled with any conforming C compiler.

I agree, of course (that said, most compilers implement most GCC
extensions nowadays, but ‘asm’ is probably an exception).

Toggle quote (5 lines)
> What problem do you see with my proposed portable solution? If I
> understand C99 section 5.1.2.3 paragraph 2 correctly, compilers are not
> permitted to reorder accesses to volatile objects as long as there is a
> sequence point between them.

My understand is that what you propose is (almost*) equivalent to the
asm trick, provided SCM_FRAME_SET_DYNAMIC_LINK is the last statement
before the vp->fp assignment. So we’d have to make sure we keep things
in this order, right?

[*] The ‘volatile’ qualifier on the field does more than just an
instruction reordering barrier: AIUI, it forces the compiler to emit
a load and store instruction right at the assignment point, which is
a stricter constraint than what we need, I think.

Toggle quote (5 lines)
> I should say that I'm not confident that _either_ of these proposed
> solutions will adequately address all of the possible problems that
> could occur when GC is performed on VM threads stopped at arbitrary
> points in their execution.

Yeah, as discussed on IRC with Andy, we’d be better off if we were sure
that each stack is marked by the thread it belongs to, in terms of data
locality, and thus also in terms of being sure that vp->fp is up-to-date
when the marker reads it. It’s not something we can change now, though.

Alternately, we could use atomics when accessing vp->fp to ensure memory
consistency (I tried that during my debugging journey…). It could be
costly.

Anyway, I don’t think we’ll have the final word on all this before
2.2.4. The way I see it we should keep working on improving it, but
there are difficult choices to make, so it will probably take a bit of
time.

Thanks,
Ludo’.
M
M
Mark H Weaver wrote on 30 Jun 2018 23:49
(name . Ludovic Courtès)(address . ludo@gnu.org)
87tvpkrlm8.fsf@netris.org
Hi Ludovic,

ludo@gnu.org (Ludovic Courtès) writes:

Toggle quote (64 lines)
> Mark H Weaver <mhw@netris.org> skribis:
>
>> ludo@gnu.org (Ludovic Courtès) writes:
>
> [...]
>
>>>> Indeed, we need something to ensure that the compiler won't reorder
>>>> these writes. My recommendation would be to use the 'volatile'
>>>> qualifier in the declarations of both the 'fp' field of 'struct scm_vm'
>>>> and the stack element modified by SCM_FRAME_SET_DYNAMIC_LINK.
>>>>
>>>> Maybe something like this:
>>>>
>>>> diff --git a/libguile/frames.h b/libguile/frames.h
>>>> index ef2db3df5..8554f886b 100644
>>>> --- a/libguile/frames.h
>>>> +++ b/libguile/frames.h
>>>> @@ -89,6 +89,7 @@
>>>> union scm_vm_stack_element
>>>> {
>>>> scm_t_uintptr as_uint;
>>>> + volatile scm_t_uintptr as_volatile_uint;
>>>
>>> [...]
>>>
>>>> +#define SCM_FRAME_SET_DYNAMIC_LINK(fp, dl) ((fp)[1].as_volatile_uint = ((dl) - (fp)))
>>>
>>> I’m not sure this is exactly what we need.
>>>
>>> What I had in mind, to make sure the vp->fp assignment really happens
>>> after the SCM_FRAME_SET_DYNAMIC_LINK, was to do something like this:
>>>
>>> SCM_FRAME_SET_RETURN_ADDRESS (new_fp, …);
>>> SCM_FRAME_SET_DYNAMIC_LINK (new_fp, …);
>>>
>>> asm volatile ("" : : : "memory");
>>>
>>> vp->fp = new_fp;
>>>
>>> I think that more accurately reflects what we want, WDYT?
>>>
>>> It’s GCC-specific though (I don’t think Clang implements ‘asm’ in a
>>> compatible way, does it?) and I suppose we’d have to ignore the non-GCC
>>> case.
>>
>> Right, the problem is that the "asm volatile" solution is not portable.
>>
>> To my mind, it's fine to optionally use GCC extensions for improved
>> performance, but I think we should ensure that Guile works reliably when
>> compiled with any conforming C compiler.
>
> I agree, of course (that said, most compilers implement most GCC
> extensions nowadays, but ‘asm’ is probably an exception).
>
>> What problem do you see with my proposed portable solution? If I
>> understand C99 section 5.1.2.3 paragraph 2 correctly, compilers are not
>> permitted to reorder accesses to volatile objects as long as there is a
>> sequence point between them.
>
> My understand is that what you propose is (almost*) equivalent to the
> asm trick, provided SCM_FRAME_SET_DYNAMIC_LINK is the last statement
> before the vp->fp assignment. So we’d have to make sure we keep things
> in this order, right?

I don't think it matters. This would only ensure the relative ordering
of the SCM_FRAME_SET_DYNAMIC_LINK and the vp->fp assignment. IIUC, the
non-volatile memory accesses could be reordered relative to either of
them.

The "asm volatile" memory barrier would be stronger, guaranteeing that
_all_ accesses before the barrier would be performed before _all_
accessed after the barrier, within that thread.

Toggle quote (5 lines)
> [*] The ‘volatile’ qualifier on the field does more than just an
> instruction reordering barrier: AIUI, it forces the compiler to emit
> a load and store instruction right at the assignment point, which is
> a stricter constraint than what we need, I think.

I'm not sure what "right at the assignment point" means exactly, given
that the compiler is free to reorder things quite a bit.

In addition to the C99 standard which I cited in my previous email, you
might also find the following link informative:


In particular:

[...] The minimum requirement is that at a sequence point all
previous accesses to volatile objects have stabilized and no
subsequent accesses have occurred. Thus an implementation is free to
reorder and combine volatile accesses that occur between sequence
points, but cannot do so for accesses across a sequence point. The
use of volatile does not allow you to violate the restriction on
updating objects multiple times between two sequence points.
Accesses to non-volatile objects are not ordered with respect to
volatile accesses. You cannot use a volatile object as a memory
barrier to order a sequence of writes to non-volatile memory. For
instance:
int *ptr = something;
volatile int vobj;
*ptr = something;
vobj = 1;
Unless *ptr and vobj can be aliased, it is not guaranteed that the
write to *ptr occurs by the time the update of vobj happens. If you
need this guarantee, you must use a stronger memory barrier [...]

Toggle quote (10 lines)
>> I should say that I'm not confident that _either_ of these proposed
>> solutions will adequately address all of the possible problems that
>> could occur when GC is performed on VM threads stopped at arbitrary
>> points in their execution.
>
> Yeah, as discussed on IRC with Andy, we’d be better off if we were sure
> that each stack is marked by the thread it belongs to, in terms of data
> locality, and thus also in terms of being sure that vp->fp is up-to-date
> when the marker reads it. It’s not something we can change now, though.

I'm not sure it matters what thread the marking is done in, because when
the actual collection happens, all threads are first stopped in their
signal handlers, and presumably the appropriate memory barriers are
performed so that all threads are synchronized before the full
collection.

Now, I'm aware that libgc does incremental marking as well, but I'm
optimistically hoping that they've done their homework to ensure that if
anything might have changed between the incremental marking and the full
collection, that the relevant objects will be marked again before the
full collection. However, I admit that I don't know enough of libgc
internals to know if this is really the case.

What do you think?

Toggle quote (4 lines)
> Alternately, we could use atomics when accessing vp->fp to ensure memory
> consistency (I tried that during my debugging journey…). It could be
> costly.

This would be far too costly to consider, in my opinion.

Toggle quote (5 lines)
> Anyway, I don’t think we’ll have the final word on all this before
> 2.2.4. The way I see it we should keep working on improving it, but
> there are difficult choices to make, so it will probably take a bit of
> time.

Sounds good.

Thanks,
Mark
A
A
Andy Wingo wrote on 1 Jul 2018 12:12
(name . Mark H Weaver)(address . mhw@netris.org)
87d0w7l0wr.fsf@igalia.com
Hi!

First of all, I said on IRC but: ___nice___ debugging, Ludo! An
impressive show of persistence. Thanks Mark also for insightful
comments!

On Sat 30 Jun 2018 23:49, Mark H Weaver <mhw@netris.org> writes:

Toggle quote (16 lines)
>>> I should say that I'm not confident that _either_ of these proposed
>>> solutions will adequately address all of the possible problems that
>>> could occur when GC is performed on VM threads stopped at arbitrary
>>> points in their execution.
>>
>> Yeah, as discussed on IRC with Andy, we’d be better off if we were sure
>> that each stack is marked by the thread it belongs to, in terms of data
>> locality, and thus also in terms of being sure that vp->fp is up-to-date
>> when the marker reads it. It’s not something we can change now, though.
>
> I'm not sure it matters what thread the marking is done in, because when
> the actual collection happens, all threads are first stopped in their
> signal handlers, and presumably the appropriate memory barriers are
> performed so that all threads are synchronized before the full
> collection.

I think you are right here. Still, it would be nice from a locality
POV if threads could mark themselves. In some future I think it would
be nice if threads cooperatively reached safepoints, instead of using
the signal mechanism. In that case we could precisely mark the most
recent stack frame as well.

Toggle quote (7 lines)
>> Anyway, I don’t think we’ll have the final word on all this before
>> 2.2.4. The way I see it we should keep working on improving it, but
>> there are difficult choices to make, so it will probably take a bit of
>> time.
>
> Sounds good.

Yeah! Really great that this is fixed, and apologies for introducing it
in the first place!!

A
L
L
Ludovic Courtès wrote on 2 Jul 2018 12:28
Re: bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2
(address . 28211@debbugs.gnu.org)
87fu12hqxn.fsf@gnu.org
Hi,

ludo@gnu.org (Ludovic Courtès) skribis:

Toggle quote (31 lines)
> On current ‘core-updates’, the code in (guix build graft) triggers
> random Guile crashes (GC issue? thread-safety issue?) when running on
> Guile 2.2.2, as initially reported by Marius at
> <https://lists.gnu.org/archive/html/guix-devel/2017-08/msg00013.html>:
>
> grafting '/gnu/store/i71kkrch1asnwvm0vm71w9aaza0n2m9q-icecat-52.1.0-gnu1' -> '/gnu/store/7w92kgcdcmf7lsc9nvs6b2ca7mk9422s-icecat-52.1.0-gnu1'...
> ERROR: In procedure put-bytevector: Wrong type argument in position 1 (expecting open output port): #<closed: file 8f1930>
> builder for `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed with exit code 1
>
> [...]
>
> ludo@ribbon ~/src/guix/+core-updates$ guix gc --clear-failures $(guix gc --list-failures)
> ludo@ribbon ~/src/guix/+core-updates$ guix build /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
> La jena derivo estos konstruata:
> /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
> @ build-started /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - x86_64-linux /var/log/guix/drvs/3c//rrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv.bz2
> grafting '/gnu/store/i71kkrch1asnwvm0vm71w9aaza0n2m9q-icecat-52.1.0-gnu1' -> '/gnu/store/7w92kgcdcmf7lsc9nvs6b2ca7mk9422s-icecat-52.1.0-gnu1'...
> ERROR: In procedure put-bytevector: Wrong type argument in position 1 (expecting open output port): #<closed: file 7517e0>
> builder for `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed with exit code 1
> @ build-failed /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - 1 builder for `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed with exit code 1
> guix build: error: build failed: build of `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed
> ludo@ribbon ~/src/guix/+core-updates$ guix gc --clear-failures $(guix gc --list-failures)
> ludo@ribbon ~/src/guix/+core-updates$ guix build /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
> La jena derivo estos konstruata:
> /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
> @ build-started /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - x86_64-linux /var/log/guix/drvs/3c//rrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv.bz2
> grafting '/gnu/store/i71kkrch1asnwvm0vm71w9aaza0n2m9q-icecat-52.1.0-gnu1' -> '/gnu/store/7w92kgcdcmf7lsc9nvs6b2ca7mk9422s-icecat-52.1.0-gnu1'...
> ERROR: In procedure variable-ref: Not a variable: (194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255)
> builder for `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed with exit code 1
> @ build-failed /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - 1 builder for `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed with exit code 1

Bad news: when using Guile 2.2.4 for grafting, I’m still getting similar
crashes, perhaps marginally less frequently.

Ludo’.
M
M
Mark H Weaver wrote on 3 Jul 2018 21:01
Re: Stack marking issue in multi-threaded code
(name . Andy Wingo)(address . wingo@igalia.com)
87in5wnnyo.fsf@netris.org
Hi Andy,

Andy Wingo <wingo@igalia.com> writes:
Toggle quote (24 lines)
> On Sat 30 Jun 2018 23:49, Mark H Weaver <mhw@netris.org> writes:
>
>>>> I should say that I'm not confident that _either_ of these proposed
>>>> solutions will adequately address all of the possible problems that
>>>> could occur when GC is performed on VM threads stopped at arbitrary
>>>> points in their execution.
>>>
>>> Yeah, as discussed on IRC with Andy, we’d be better off if we were sure
>>> that each stack is marked by the thread it belongs to, in terms of data
>>> locality, and thus also in terms of being sure that vp->fp is up-to-date
>>> when the marker reads it. It’s not something we can change now, though.
>>
>> I'm not sure it matters what thread the marking is done in, because when
>> the actual collection happens, all threads are first stopped in their
>> signal handlers, and presumably the appropriate memory barriers are
>> performed so that all threads are synchronized before the full
>> collection.
>
> I think you are right here. Still, it would be nice from a locality
> POV if threads could mark themselves. In some future I think it would
> be nice if threads cooperatively reached safepoints, instead of using
> the signal mechanism. In that case we could precisely mark the most
> recent stack frame as well.

I agree that stopping threads at safepoints before collections would be
ideal.

Toggle quote (10 lines)
>>> Anyway, I don’t think we’ll have the final word on all this before
>>> 2.2.4. The way I see it we should keep working on improving it, but
>>> there are difficult choices to make, so it will probably take a bit of
>>> time.
>>
>> Sounds good.
>
> Yeah! Really great that this is fixed, and apologies for introducing it
> in the first place!!

It's great that Ludovic found the problem (great debugging Ludovic!),
but FWIW, my assessment is that this bug is not fixed by commit
23af45e248e8e2bec99c712842bf24d6661abbe2, and therefore is not fixed in
Guile-2.2.4, contrary to the claims made in the commit log and the NEWS.
Unless I'm mistaken, that commit makes *no* difference to the
requirements on the compiler regarding the order of those writes.

Mark
L
L
Ludovic Courtès wrote on 12 Mar 2020 22:59
Stack marking issue in multi-threaded code, 2020 edition
(name . Andy Wingo)(address . wingo@igalia.com)(address . 28211@debbugs.gnu.org)
87tv2tp74g.fsf_-_@gnu.org
Hi!

I think I’ve found another race condition involving stack marking, as a
followup to https://issues.guix.gnu.org/issue/28211 (this time on
3.0.1+, but the code is almost the same.)

‘abort_to_prompt’ does this:

Toggle snippet (20 lines)
fp = vp->stack_top - fp_offset;
sp = vp->stack_top - sp_offset;

/* Continuation gets nargs+1 values: the one more is for the cont. */
sp = sp - nargs - 1;

/* Shuffle abort arguments down to the prompt continuation. We have
to be jumping to an older part of the stack. */
if (sp < vp->sp)
abort ();
sp[nargs].as_scm = cont;
while (nargs--)
sp[nargs] = vp->sp[nargs];

/* Restore VM regs */
vp->fp = fp;
vp->sp = sp;
vp->ip = vra;

What if ‘scm_i_vm_mark_stack’ walks the stack right before the ‘vp->fp’
assignment? It can determine that one of the just-assigned ‘sp[nargs]’
is a dead slot, and thus set it to SCM_UNSPECIFIED. Later, when we set
‘vp->fp’, that stack slot that we just initialized has been overwritten
by ‘scm_i_vm_mark_stack’. Down the road, we get something like:

Wrong type to apply: #<unspecified>

I believe this is what I’m seeing here (0x7ff7f838dda0 is being set to
SCM_UNSPECIFIED while thread 2 is in ‘abort_to_prompt’):

Toggle snippet (46 lines)
(rr) thread 5
[Switching to thread 5 (Thread 24572.24575)]
#0 scm_i_vm_mark_stack (vp=0x7ff7fd820b48, mark_stack_ptr=0x7ff7fc0ebf90,
mark_stack_limit=0x7ff7fc0fbec0) at vm.c:743
743 break;
(rr) list
738 break;
739 case SLOT_DESC_DEAD:
740 /* This value may become dead as a result of GC,
741 so we can't just leave it on the stack. */
742 sp->as_scm = SCM_UNSPECIFIED;
743 break;
744 }
745 }
746 sp = SCM_FRAME_PREVIOUS_SP (fp);
747 /* Inner frames may have a dead slots map for precise marking.
(rr) p sp->as_scm
$59 = #<unspecified>
(rr) p sp
$60 = (union scm_vm_stack_element *) 0x7ff7f838dda0
(rr) thread 2
[Switching to thread 2 (Thread 24572.24577)]
#0 0x00007ff7fdb7bb36 in __GI___sigsuspend (
set=set@entry=0x7ff7fe132720 <suspend_handler_mask>)
at ../sysdeps/unix/sysv/linux/sigsuspend.c:26
26 ../sysdeps/unix/sysv/linux/sigsuspend.c: Dosiero a? dosierujo ne ekzistas.
(rr) frame 4
#4 0x00007ff7fe228f14 in abort_to_prompt (thread=0x7ff7fd820b40,
saved_mra=<optimized out>) at vm.c:1465
1465 sp[nargs] = vp->sp[nargs];
(rr) p sp
$61 = (union scm_vm_stack_element *) 0x7ff7f838dd90
(rr) p fp
$62 = (union scm_vm_stack_element *) 0x7ff7f838ddb0
(rr) p &sp[2]
$63 = (union scm_vm_stack_element *) 0x7ff7f838dda0
(rr) p vp->sp
$64 = (union scm_vm_stack_element *) 0x7ff7f838dcf0
(rr) p vp->fp
$65 = (union scm_vm_stack_element *) 0x7ff7f838dd08
(rr) p vp->stack_bottom
$66 = (union scm_vm_stack_element *) 0x7ff7f838a000
(rr) p vp->stack_top
$67 = (union scm_vm_stack_element *) 0x7ff7f838e000

Comments about this analysis?

How do we fix it? It’s a bit troubling that this is all lock-free. A
fix I can think of is to just re-do the sp[nargs] assignments after the
vp->sp etc. assignments.

Thoughts?

Ludo’.
L
L
Ludovic Courtès wrote on 13 Mar 2020 23:38
(name . Andy Wingo)(address . wingo@igalia.com)(address . 28211@debbugs.gnu.org)
87blozkhiu.fsf@gnu.org
Hi,

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (8 lines)
> What if ‘scm_i_vm_mark_stack’ walks the stack right before the ‘vp->fp’
> assignment? It can determine that one of the just-assigned ‘sp[nargs]’
> is a dead slot, and thus set it to SCM_UNSPECIFIED. Later, when we set
> ‘vp->fp’, that stack slot that we just initialized has been overwritten
> by ‘scm_i_vm_mark_stack’. Down the road, we get something like:
>
> Wrong type to apply: #<unspecified>

I believe the patch below solves this problem. Also attached is a small
program that reproduces the bug (without the patch) after a couple of
minutes.

Thoughts?

The full grafting test case exposes another VM stack-corruption-looking
crash that I’m still investigating.

Ludo’.
Toggle diff (81 lines)
diff --git a/libguile/vm.c b/libguile/vm.c
index b20c6eb5f..5cc934e99 100644
--- a/libguile/vm.c
+++ b/libguile/vm.c
@@ -1351,7 +1351,8 @@ scm_i_vm_emergency_abort (SCM *tag_and_argv, size_t n)
scm_t_bits *prompt;
scm_t_dynstack_prompt_flags flags;
ptrdiff_t fp_offset, sp_offset;
- union scm_vm_stack_element *fp, *sp;
+ union scm_vm_stack_element *fp;
+ volatile union scm_vm_stack_element *sp;
SCM *argv;
uint32_t *vra;
uint8_t *mra;
@@ -1392,16 +1393,19 @@ scm_i_vm_emergency_abort (SCM *tag_and_argv, size_t n)
to be jumping to an older part of the stack. */
if (sp < vp->sp)
abort ();
- sp[nargs].as_scm = cont;
-
- while (nargs--)
- sp[nargs].as_scm = *argv++;
/* Restore VM regs */
vp->fp = fp;
- vp->sp = sp;
+ vp->sp = (union scm_vm_stack_element *) sp;
vp->ip = vra;
+ /* Restore the arguments on SP. This must be done after 'vp->fp' has
+ been set so that a concurrent 'scm_i_vm_mark_stack' does not
+ overwrite it (see <https://bugs.gnu.org/28211>). */
+ sp[nargs].as_scm = cont;
+ while (nargs--)
+ sp[nargs].as_scm = *argv++;
+
/* Jump! */
vp->mra_after_abort = mra;
longjmp (*registers, 1);
@@ -1417,7 +1421,8 @@ abort_to_prompt (scm_thread *thread, uint8_t *saved_mra)
scm_t_bits *prompt;
scm_t_dynstack_prompt_flags flags;
ptrdiff_t fp_offset, sp_offset;
- union scm_vm_stack_element *fp, *sp;
+ union scm_vm_stack_element *fp, *orig_sp;
+ volatile union scm_vm_stack_element *sp;
uint32_t *vra;
uint8_t *mra;
jmp_buf *registers;
@@ -1452,6 +1457,7 @@ abort_to_prompt (scm_thread *thread, uint8_t *saved_mra)
/* Recompute FP, as scm_dynstack_unwind may have expanded the stack. */
fp = vp->stack_top - fp_offset;
sp = vp->stack_top - sp_offset;
+ orig_sp = vp->sp;
/* Continuation gets nargs+1 values: the one more is for the cont. */
sp = sp - nargs - 1;
@@ -1460,15 +1466,19 @@ abort_to_prompt (scm_thread *thread, uint8_t *saved_mra)
to be jumping to an older part of the stack. */
if (sp < vp->sp)
abort ();
- sp[nargs].as_scm = cont;
- while (nargs--)
- sp[nargs] = vp->sp[nargs];
/* Restore VM regs */
vp->fp = fp;
- vp->sp = sp;
+ vp->sp = (union scm_vm_stack_element *) sp;
vp->ip = vra;
+ /* Restore the arguments on SP. This must be done after 'vp->fp' has
+ been set so that a concurrent 'scm_i_vm_mark_stack' does not
+ overwrite it (see <https://bugs.gnu.org/28211>). */
+ sp[nargs].as_scm = cont;
+ while (nargs--)
+ sp[nargs] = orig_sp[nargs];
+
/* If there are intervening C frames, then jump over them, making a
nonlocal exit. Otherwise fall through and let the VM pick up where
it left off. */
;; https://issues.guix.gnu.org/issue/28211 (use-modules (ice-9 threads)) (define (thunk) (catch 'foo (lambda () (throw 'foo (iota 10))) (lambda (key lst) (unless (equal? lst (iota 10)) (primitive-_exit 42))))) (n-par-for-each (* 2 (current-processor-count)) (lambda _ (let loop () (thunk) (loop))) (iota 1000))
A
A
Andy Wingo wrote on 17 Mar 2020 22:16
Re: Stack marking issue in multi-threaded code, 2020 edition
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 28211@debbugs.gnu.org)
87a74eznq1.fsf@pobox.com
On Thu 12 Mar 2020 22:59, Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (30 lines)
> I think I’ve found another race condition involving stack marking, as a
> followup to <https://issues.guix.gnu.org/issue/28211> (this time on
> 3.0.1+, but the code is almost the same.)
>
> ‘abort_to_prompt’ does this:
>
> fp = vp->stack_top - fp_offset;
> sp = vp->stack_top - sp_offset;
>
> /* Continuation gets nargs+1 values: the one more is for the cont. */
> sp = sp - nargs - 1;
>
> /* Shuffle abort arguments down to the prompt continuation. We have
> to be jumping to an older part of the stack. */
> if (sp < vp->sp)
> abort ();
> sp[nargs].as_scm = cont;
> while (nargs--)
> sp[nargs] = vp->sp[nargs];
>
> /* Restore VM regs */
> vp->fp = fp;
> vp->sp = sp;
> vp->ip = vra;
>
>
> What if ‘scm_i_vm_mark_stack’ walks the stack right before the ‘vp->fp’
> assignment? It can determine that one of the just-assigned ‘sp[nargs]’
> is a dead slot, and thus set it to SCM_UNSPECIFIED.

I think you're right here.

Given that the most-recently-pushed frame is marked conservatively, I
think it would be sufficient to reset vp->fp before shuffling stack
args; that would make it so that the frame includes the values to
shuffle, their target locations, and probably some other crap in
between. Given that marking the crap is harmless, I think that would be
enough. WDYT?

In a more perfect world, initiating GC should tell threads to reach a
safepoint and mark their own stacks -- preserves thread locality and
prevents this class of bug. But given that libgc uses signals to stop
threads, we have to be less precise.

Cheers,

Andy
L
L
Ludovic Courtès wrote on 17 Mar 2020 22:53
control message for bug #28211
(address . control@debbugs.gnu.org)
87v9n2bqbu.fsf@gnu.org
reassign 28211 guile
quit
L
L
Ludovic Courtès wrote on 17 Mar 2020 22:56
(address . control@debbugs.gnu.org)
87tv2mbq7p.fsf@gnu.org
retitle 28211 Stack marking race conditions lead to crashes for multi-threaded programs in 2.2.7/3.0.1
quit
L
L
Ludovic Courtès wrote on 18 Mar 2020 00:22
Re: bug#28211: Stack marking issue in multi-threaded code, 2020 edition
(name . Andy Wingo)(address . wingo@igalia.com)(address . 28211@debbugs.gnu.org)
87mu8ebm82.fsf@gnu.org
Hi!

Andy Wingo <wingo@igalia.com> skribis:

Toggle quote (41 lines)
> On Thu 12 Mar 2020 22:59, Ludovic Courtès <ludo@gnu.org> writes:
>
>> I think I’ve found another race condition involving stack marking, as a
>> followup to <https://issues.guix.gnu.org/issue/28211> (this time on
>> 3.0.1+, but the code is almost the same.)
>>
>> ‘abort_to_prompt’ does this:
>>
>> fp = vp->stack_top - fp_offset;
>> sp = vp->stack_top - sp_offset;
>>
>> /* Continuation gets nargs+1 values: the one more is for the cont. */
>> sp = sp - nargs - 1;
>>
>> /* Shuffle abort arguments down to the prompt continuation. We have
>> to be jumping to an older part of the stack. */
>> if (sp < vp->sp)
>> abort ();
>> sp[nargs].as_scm = cont;
>> while (nargs--)
>> sp[nargs] = vp->sp[nargs];
>>
>> /* Restore VM regs */
>> vp->fp = fp;
>> vp->sp = sp;
>> vp->ip = vra;
>>
>>
>> What if ‘scm_i_vm_mark_stack’ walks the stack right before the ‘vp->fp’
>> assignment? It can determine that one of the just-assigned ‘sp[nargs]’
>> is a dead slot, and thus set it to SCM_UNSPECIFIED.
>
> I think you're right here.
>
> Given that the most-recently-pushed frame is marked conservatively, I
> think it would be sufficient to reset vp->fp before shuffling stack
> args; that would make it so that the frame includes the values to
> shuffle, their target locations, and probably some other crap in
> between. Given that marking the crap is harmless, I think that would be
> enough. WDYT?

Sounds good. Following our discussion on IRC, I pushed what you
proposed as 89edd1bc2dcff50fb05c3598a846d6b51b172f7c. \o/

I confirmed with and without rr that it no longer triggers the dreaded
crash.

BTW, pro tip: to run ./meta/guile under rr, I do:

sed -i libguile/guile \
-e 's/exec /exec rr record -n --syscall-buffer-sig=SIGUSR1 /g'

where ‘-n’ disables stack switching.

Toggle quote (5 lines)
> In a more perfect world, initiating GC should tell threads to reach a
> safepoint and mark their own stacks -- preserves thread locality and
> prevents this class of bug. But given that libgc uses signals to stop
> threads, we have to be less precise.

Yup, agreed.

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 18 Mar 2020 00:23
control message for bug #28211
(address . control@debbugs.gnu.org)
87lfnybm7e.fsf@gnu.org
tags 28211 fixed
close 28211
quit
?
Your comment

This issue is archived.

To comment on this conversation send an email to 28211@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 28211
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch