Hi! Caleb Ristvedt skribis: >> This change had an undesirable effect: the connection would be kept for >> the body of ‘with-temporary-store-file’, during which we’d call: >> >> finalize-store-file -> register-path >> >> which accesses the database. At this point, for each ‘guix offload’ >> process, we’d thus have the database open twice: once for the session’s >> guix-daemon, and once for that ‘register-path’ call. > > If the connection wasn't kept for the body of with-temporary-store-file, > the temporary store file wouldn't be protected from GC during the body > (the daemon treats unlocked temproots files as "stale"), thus rather > defeating the purpose. It makes sense, then, that the connection was > also kept for the body prior to this patch - indeed, unless emacs's > parenthesis-matching capabilities are failing me, it appears that the > body is solidly within the 'with-store' form in > 37edbc91e34fb5658261e637e6ffccdb381e5271. Oh you’re right, sorry for the confusion. >> On berlin, the effect is that we see many ‘guix offload’ processes >> stalled because the SQLite database is busy: > > ... which makes this quite the mystery indeed. I assume you've tested > with the patch reverted and found that this issue goes away? No. I observed the behavior and looked for recent changes that could cause the problem. But I guess I was tired and jumped to silly conclusions. > > > AFAIK just having the database open doesn't by itself impose any > locks. The daemon process we're connected to should have it open, but > should just be blocked waiting for our next RPC. Database locks happen > when transactions are started (either explicitly or implicitly), and > implicitly-started transactions are automatically committed by sqlite > (specifically when the statement that started the transaction is either > reset or finalized). The only loose end I can think of right now is that > call-with-transaction only catches exceptions of type 'sqlite-error, so > in theory if a different type of exception were to be thrown, it could > exit in a broken state where neither a commit nor a rollback has been > performed. Really it should catch all exception types, and use match in > the handler to pick out the sqlite-errors. If that were causing the > problems, though, we'd expect to see some errors appearing in the > offload output. Good point but yes, we’d see an error, and ‘guix offload’ would probably exit right away. > Actually, come to think of it, there could be another issue with > call-with-transaction: if somehow it's possible for SQLITE_BUSY errors > to occur despite the connection having succeeded with a 'begin > immediate;' (which immediately starts a write transaction), then the > rollback wouldn't occur, and what should be a failed transaction > followed by a successful transaction becomes one long, > restarted-in-the-middle transaction. I'm not sure if that's a problem in > practice, though. > > And now that I look at it again, it turns out that most of our database > query procedures in (guix store database) aren't finalizing their > statements in case of a nonlocal exit... which would tend to happen if, > for example, an SQLITE_BUSY error occurred. Which would cause the > statement to not be finalized until the garbage collector got ahold of > it. But due to statement caching the garbage collector likely won't get > ahold of it until the database connection itself is destroyed. The > wording at https://www.sqlite.org/lang_transaction.html makes me think > that this shouldn't be an issue because the errors we'd expect all seem > to roll back automatically, but if we somehow got one that didn't roll > back automatically, there would potentially be an extended amount of > time before the statement was finalized and the implicit transaction > committed. > > Also, I've noticed that with the way that finalize-store-file is > written, we actually already have a database open when we call > register-path. This is because it's needed in order to call path-id, but > the scope of that with-database form is rather larger than it needs to > be. > > We may have a situation here where things go fine until a single > SQLITE_BUSY error is produced by chance, and that causes more > SQLITE_BUSY errors, and so on. Hmm, sounds plausible. > In summary, there are many things I could imagine going wrong to cause / > contribute to the observed behavior, but the patch, barring some absurd > guile compilation bug, is not one of them. I do, however, think that > (guix store database) needs some love. Yeah. >> They loop pretty much indefinitely on this and nothing (or very little) >> happens on the system. > > To be clear, the nothing-happening status is common to all processes > that use the database, including daemon processes? That's quite severe. I just did a random sample, but several offload processes were stuck like the one I showed, and clients would usually get “database is locked” messages from the daemon. >> I’ll revert this patch but I’m happy to hear what you think, Caleb. > > If the data says it's causing those problems, I'd tend to agree with > that. I would really like to understand how, though, because even after > a few hours of brainstorming bizarre edge cases I still can't come up > with a satisfying explanation. No you’re right, my analysis was wrong. Further investigation needed… >> Another reason to implement temp roots in Scheme, as it would allow us >> to not open a connection to the daemon from ‘guix offload’! > > Soon™. Conceptually the code is there, I'm working towards a rebase that > tries to first make the rest of daemon-side guix compatible with fibers > - thread pools✓, eval-with-container✓, fibers-friendly waitpid✓, etc. Neat! For master we could do with a simpler implementation, but we’ll see. Thanks, Ludo’.