Hi Ludovic, ludo@gnu.org (Ludovic Courtès) writes: [...] > Thread 1 (Thread 0x7f6fe6f5d700 (LWP 2856)): > #0 0x00007f7019db0d79 in scm_is_pair (x=0x0) at ../libguile/pairs.h:159 > #1 scm_ilength (sx=) at list.c:190 [...] > What this means is that Thread 1 gets NULL instead of a list as its > on-stack argument (vm-engine.c:909 is ‘tail-apply’). > > How can arguments on the VM stack be zeroed? I doubt that's what happened, because I expect that each VM stack is dedicated to a single hardware thread. In theory, if a single VM stack is used by one thread, and then later used by another thread, thread-safety issues on the VM stack could arise in the absense of proper thread synchronization. However, I think it's _far_ more likely that the NULL argument on the stack was copied from memory shared by multiple threads without proper thread synchronization. On modern weakly-ordered memory architectures, writes by one thread may be seen in a different order by another thread. For example, if one thread allocates a pair, initializes its CAR and CDR to non-NULL values, and then writes the address of the pair somewhere, another thread could first read the address of the pair, and then read NULL from its CAR or CDR, unless proper thread synchronization is used. At best, this requires memory barriers in both the reader and writer which are typically quite expensive. On x86 processors the read barrier could expand into a no-op in typical cases, but the write barrier cannot be avoided, and must be placed after initializing the structure and before writing its pointer. I think it's most likely that something like this is happening, because NULL is not a valid SCM value. The only code that should normally write a NULL to an SCM slot is Boehm GC, which clears memory before handing it to the user. So, if you read a NULL from a memory location that's meant to hold an SCM, then it's most likely because the reading thread does not yet see the writes that initialized it, because of the weakly-ordered memory architecture. > I commented out the MADV_DONTNEED call to be sure, but I can still > reproduce the bug. I strongly doubt that the MADV_DONTNEED is relevant to this issue. > Then I thought vp->sp might be out-of-sync compared to the local > variable ‘sp’, which in turn could cause ‘scm_i_vm_mark_stack’ to not > mark a few items on the tip of the stack. So I did this: > > diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c > index 9509cd643..1136b2271 100644 > --- a/libguile/vm-engine.c > +++ b/libguile/vm-engine.c > @@ -151,7 +151,8 @@ > code, or otherwise push anything on the stack, you will need to > CACHE_SP afterwards to restore the possibly-changed stack pointer. */ > > -#define SYNC_IP() vp->ip = (ip) > +#define SYNC_IP() \ > + do { vp->ip = (ip); vp->sp = (sp); } while (0) I don't see how a change like this could be useful for any thread safety problem. Since stores by one thread may be seen in a different order by other threads, the memory write corresponding to "vp->sp = (sp)" might be delayed for some arbitrarily long period of time after the writes that follow it, up until the next appropriate memory barrier. For now, I would suggest avoiding multi-threaded code in Guix, or at least to avoid loading any Scheme code from multiple threads. How about using multiple processes instead? Mark