Hi, Andy Wingo writes: > On Thu 05 Jul 2018 05:33, Mark H Weaver writes: > >>> One problem I’ve noticed is that the child process that >>> ‘call-with-decompressed-port’ spawns would be stuck trying to get the >>> allocation lock: >>> >>> So it seems quite clear that the thing has the alloc lock taken. I >>> suppose this can happen if one of the libgc threads runs right when we >>> call fork and takes the alloc lock, right? >> >> Does libgc spawn threads that run concurrently with user threads? If >> so, that would be news to me. My understanding was that incremental >> marking occurs within GC allocation calls, and marking threads are only >> spawned after all user threads have been stopped, but I could be wrong. > > I think Mark is correct. Actually, looking at the libgc code more closely, it seems that Ludovic is correct. GC_init calls GC_start_mark_threads_inner if PARALLEL_MARK is defined at compile time. However, it's also the case that libgc uses 'pthread_atfork' (where available) to arrange to grab the GC allocation as well as the "mark locks" in the case where parallel marking is enabled. See fork_prepare_proc, fork_parent_proc, and fork_child_proc in pthread_support.c. As the libgc developers admit in the comment above 'fork_prepare_proc', they are not strictly meeting the requirements for safe use of 'fork', but they _are_ grabbing the allocation lock during 'fork', at least on systems that support 'pthread_atfork'. It looks like setting the GC_MARKERS environment variable to 1 should result in 'available_markers_m1' being set to 0, and thus effectively disable parallel marking. In that case, if I understand the code correctly, no marker threads will be spawned. It would be interesting to see if this problem can be reproduced when GC_MARKERS is set to 1. >> The first idea that comes to my mind is that perhaps the finalization >> thread is holding the GC allocation lock when 'fork' is called. > > So of course we agree you're only supposed to "fork" when there are no > other threads running, I think. > > As far as the finalizer thread goes, "primitive-fork" calls > "scm_i_finalizer_pre_fork" which should join the finalizer thread, > before the fork. There could be a bug obviously but the intention is > for Guile to shut down its internal threads. Ah, good! I didn't know this. So, I guess the problem is most likely elsewhere. We should probably arrange to join the signal delivery thread at the same time, and then to respawn it in the parent and child if needed. What do you think? > Here's the body of primitive-fork fwiw: > > { > int pid; > scm_i_finalizer_pre_fork (); > if (scm_ilength (scm_all_threads ()) != 1) I think there's a race here. I think it's possible for the finalizer thread to be respawned after 'scm_i_finalizer_pre_fork' in two different ways: (1) 'scm_all_threads' performs allocation, which could trigger GC. (2) another thread could perform heap allocation and trigger GC after 'scm_i_finalizer_pre_fork' joins the thread. it might then shut down before 'scm_all_thread' is called. However, these are highly unlikely scenarios, and most likely not the problem we're seeing here. Still, I think the 'scm_i_finalizer_pre_fork' should be moved after the 'if', to avoid this race. >> Another possibility: both the finalization thread and the signal >> delivery thread call 'scm_without_guile', which calls 'GC_do_blocking', >> which also temporarily grabs the GC allocation lock before calling the >> specified function. See 'GC_do_blocking_inner' in pthread_support.c in >> libgc. You spawn the signal delivery thread by calling 'sigaction' and >> you make work for it to do every second when the SIGALRM is delivered. > > The signal thread is a possibility though in that case you'd get a > warning; the signal-handling thread appears in scm_all_threads. Do you > see a warning? If you do, that is a problem :) Good point! >>> If that is correct, the fix would be to call fork within >>> ‘GC_call_with_alloc_lock’. >>> >>> How does that sound? >> >> Sure, sounds good to me. > > I don't think this is necessary. I think the problem is that other > threads are running. If we solve that, then we solve this issue; if we > don't solve that, we don't know what else those threads are doing, so we > don't know what mutexes and other state they might have. On second thought, I agree with Andy here. Grabbing the allocation lock shouldn't be needed, and moreover it's not sufficient. I think we really need to ensure that no other threads are running, in which case grabbing the allocation lock is pointless. Anyway, it seems that libgc is already arranging to do this, at least on modern GNU/Linux systems. Thanks, Mark