So for example: • 20501 tries to acquire aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0.19.8.1.lock, which is held by 20503; • 20503 tries to acquire amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock, which is held by 20501. This comes from the fact that ‘LocalStore::buildPaths’ takes theuser-supplied derivation list as is, without sorting it, and thenacquires locks in that order in ‘Worker::run’. A topological sort (or maybe an alphanumeric sort?) should allow us toguarantee that guix-daemon processes take locks in the same order, andthen don’t end up in a deadlock. I discovered this bug while monitoring Cuirass on berlin: severalsessions submit batches of 200 derivations in ‘build-paths’ RPCs, andsometimes most of the corresponding guix-daemon processes would end upbeing stuck in a lock-acquiring loop. Ludo’.
Toggle quote (4 lines)> This comes from the fact that ‘LocalStore::buildPaths’ takes the> user-supplied derivation list as is, without sorting it, and then> acquires locks in that order in ‘Worker::run’.
This diagnostic is incorrect: ‘Goals’ is a set sorted according to‘CompareGoalPtrs’, which is lexical sort that arranges so substitutiongoals come before derivation goals. Thus, ‘_topGoals’ and ‘awake2’ inWorker::run are sorted in a deterministic fashion.
The problem is that ‘Worker::waitForAWhile’ reshuffles the order ofgoals by temporarily moving goals out of the way. This can happen whenoffloading replies “postpone”, which is inherently non-deterministic(which goals are put to sleep will vary from one session to anothersession.) When those goals are eventually woken up from ‘Worker::waitForInput’,they’re reprocessed, in sorted order, but potentially with “holes”compared to other ‘guix-daemon’ processes. That’s only a partial explanation; we need to go further to come up withan actual deadlock scenario. Ludo’.