mcron randomly stops running jobs

  • Done
  • quality assurance status badge
Details
2 participants
  • Ludovic Courtès
  • Robert Vollmert
Owner
unassigned
Submitted by
Robert Vollmert
Severity
important
R
R
Robert Vollmert wrote on 30 Aug 2019 19:31
(address . bug-guix@gnu.org)
1DDB50D6-8CC4-4617-B2DC-80BCAADE2126@vllmrt.net
This is the third time I’ve seen this, and this time I’m
sure that nothing else happened.

I have numerous mcron jobs configured, many of which run every 15 minutes,
and one which runs once a minute. Just now, at 19:22, I noticed the minutely
cron job didn’t seem to run, and saw that one every-15-minute job which always
logs to mcron.log had last run at 18:00. I restarted mcron, and jobs started
running again. There is no timezone confusion either: That every-15-minute job
logged its first run at 19:30.
L
L
Ludovic Courtès wrote on 21 Feb 2020 21:19
control message for bug #37237
(address . control@debbugs.gnu.org)
87v9nz3d7v.fsf@gnu.org
severity 37237 important
quit
L
L
Ludovic Courtès wrote on 23 Feb 2020 19:30
Re: bug#37237: mcron randomly stops running jobs
(name . Robert Vollmert)(address . rob@vllmrt.net)(address . 37237@debbugs.gnu.org)
87eeulqhq5.fsf@gnu.org
Hi,

Robert Vollmert <rob@vllmrt.net> skribis:

Toggle quote (7 lines)
> I have numerous mcron jobs configured, many of which run every 15 minutes,
> and one which runs once a minute. Just now, at 19:22, I noticed the minutely
> cron job didn’t seem to run, and saw that one every-15-minute job which always
> logs to mcron.log had last run at 18:00. I restarted mcron, and jobs started
> running again. There is no timezone confusion either: That every-15-minute job
> logged its first run at 19:30.

I’ve been bitten by this problem several times already. I suspect the
patch at
helps with zombies that mcron tends to leave behind, but I’m not sure it
helps with this “stops running jobs” problem (I’d need to grab a C stack
trace of mcron when that happens.)

Anyway, let’s get this fixed!

Ludo’.
L
L
Ludovic Courtès wrote on 18 Mar 2020 12:10
(name . Robert Vollmert)(address . rob@vllmrt.net)(address . 37237-done@debbugs.gnu.org)
878sjxapgc.fsf@gnu.org
Hi,

Robert Vollmert <rob@vllmrt.net> skribis:

Toggle quote (7 lines)
> I have numerous mcron jobs configured, many of which run every 15 minutes,
> and one which runs once a minute. Just now, at 19:22, I noticed the minutely
> cron job didn’t seem to run, and saw that one every-15-minute job which always
> logs to mcron.log had last run at 18:00. I restarted mcron, and jobs started
> running again. There is no timezone confusion either: That every-15-minute job
> logged its first run at 19:30.

When mcron is stuck, it’s apparently stuck in ‘pthread_join’, waiting
for the finalization thread to terminate:

Toggle snippet (37 lines)
(gdb) bt
#0 0x00007f7f730d8478 in __GI___pthread_timedjoin_ex (threadid=140185229293312, thread_return=thread_return@entry=0x0, abstime=abstime@entry=0x0,
block=block@entry=true) at pthread_join_common.c:84
#1 0x00007f7f730d82dc in __pthread_join (threadid=<optimized out>, thread_return=thread_return@entry=0x0) at pthread_join.c:24
#2 0x00007f7f731d7508 in stop_finalization_thread () at finalizers.c:265
#3 0x00007f7f731d7729 in scm_i_finalizer_pre_fork () at finalizers.c:290
#4 0x00007f7f73250426 in scm_fork () at posix.c:1220
#5 0x00007f7f7324979f in vm_regular_engine (thread=0x7f7f6acb79d0, vp=0x7f7f6b299f30, registers=0xafaf, resume=1930265720) at vm-engine.c:786
#6 0x00007f7f7324bfd7 in scm_call_n (proc=#<program 7f7f6aeaf5d0>, argv=argv@entry=0x7ffd9b10bed8, nargs=nargs@entry=1) at vm.c:1260
#7 0x00007f7f731cdf58 in scm_call_1 (proc=<optimized out>, arg1=<optimized out>) at eval.c:485
#8 0x000000000040130c in inner_main ()
#9 0x00007f7f731e5cfd in invoke_main_func (body_data=0x7ffd9b10c3b0) at init.c:341
#10 0x00007f7f731c835a in c_body (d=0x7ffd9b10c2f0) at continuations.c:422
#11 0x00007f7f7324979f in vm_regular_engine (thread=0x7f7f6acb79d0, vp=0x7f7f6b299f30, registers=0xafaf, resume=1930265720) at vm-engine.c:786
#12 0x00007f7f7324bfd7 in scm_call_n (proc=proc@entry=#<smob catch-closure 7f7f6b2a1460>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1260
#13 0x00007f7f731cdf39 in scm_call_0 (proc=proc@entry=#<smob catch-closure 7f7f6b2a1460>) at eval.c:479
#14 0x00007f7f7323ac12 in catch (tag=tag@entry=#t, thunk=#<smob catch-closure 7f7f6b2a1460>,
handler=<error reading variable: ERROR: Cannot access memory at address 0x73776b3469667269>0x7f7f6b2a1420,
pre_unwind_handler=<error reading variable: ERROR: Cannot access memory at address 0x73776b3469667269>0x7f7f6b2a1400) at throw.c:137
#15 0x00007f7f7323aef5 in scm_catch_with_pre_unwind_handler (key=key@entry=#t, thunk=<optimized out>, handler=<optimized out>, pre_unwind_handler=<optimized out>)
at throw.c:254
#16 0x00007f7f7323b0bf in scm_c_catch (tag=tag@entry=#t, body=body@entry=0x7f7f731c8350 <c_body>, body_data=body_data@entry=0x7ffd9b10c2f0,
handler=handler@entry=0x7f7f731c85e0 <c_handler>, handler_data=handler_data@entry=0x7ffd9b10c2f0,
pre_unwind_handler=pre_unwind_handler@entry=0x7f7f731c8440 <pre_unwind_handler>, pre_unwind_handler_data=0x7f7f6b2939a0) at throw.c:377
#17 0x00007f7f731c8940 in scm_i_with_continuation_barrier (body=body@entry=0x7f7f731c8350 <c_body>, body_data=body_data@entry=0x7ffd9b10c2f0,
handler=handler@entry=0x7f7f731c85e0 <c_handler>, handler_data=handler_data@entry=0x7ffd9b10c2f0,
pre_unwind_handler=pre_unwind_handler@entry=0x7f7f731c8440 <pre_unwind_handler>, pre_unwind_handler_data=0x7f7f6b2939a0) at continuations.c:360
#18 0x00007f7f731c89d5 in scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:456
#19 0x00007f7f7323984c in with_guile (base=base@entry=0x7ffd9b10c358, data=data@entry=0x7ffd9b10c380) at threads.c:661
#20 0x00007f7f73127a68 in GC_call_with_stack_base (fn=fn@entry=0x7f7f73239800 <with_guile>, arg=arg@entry=0x7ffd9b10c380) at misc.c:1941
#21 0x00007f7f73239b68 in scm_i_with_guile (dynamic_state=<optimized out>, data=data@entry=0x7ffd9b10c380, func=func@entry=0x7f7f731e5ce0 <invoke_main_func>)
at threads.c:704
#22 scm_with_guile (func=func@entry=0x7f7f731e5ce0 <invoke_main_func>, data=data@entry=0x7ffd9b10c3b0) at threads.c:710
#23 0x00007f7f731e5e92 in scm_boot_guile (argc=12, argv=0x7ffd9b10c4f8, main_func=0x4012a0 <inner_main>, closure=0x0) at init.c:324
#24 0x000000000040118b in main ()

The finalization thread lives its life as if it hadn’t received the
“stop” message in its pipe:

Toggle snippet (36 lines)
(gdb) thread 17
[Switching to thread 17 (Thread 0x7f7f6acb7700 (LWP 44975))]
#0 0x00007f7f730e0344 in __libc_read (fd=5, buf=buf@entry=0x7f7f6acb6a40, nbytes=nbytes@entry=1) at ../sysdeps/unix/sysv/linux/read.c:26
26 ../sysdeps/unix/sysv/linux/read.c: No such file or directory.
(gdb) bt
#0 0x00007f7f730e0344 in __libc_read (fd=5, buf=buf@entry=0x7f7f6acb6a40, nbytes=nbytes@entry=1) at ../sysdeps/unix/sysv/linux/read.c:26
#1 0x00007f7f731d7497 in read_finalization_pipe_data (data=0x7f7f6acb6a40) at finalizers.c:199
#2 0x00007f7f7312d503 in GC_do_blocking_inner (data=0x7f7f6acb6a00 "\200t\035s\177\177", context=<optimized out>) at pthread_support.c:1362
#3 0x00007f7f73121d62 in GC_with_callee_saves_pushed (fn=0x7f7f7312d4c0 <GC_do_blocking_inner>, arg=arg@entry=0x7f7f6acb6a00 "\200t\035s\177\177") at mach_dep.c:328
#4 0x00007f7f73127a9c in GC_do_blocking (fn=fn@entry=0x7f7f731d7480 <read_finalization_pipe_data>, client_data=client_data@entry=0x7f7f6acb6a40) at misc.c:2053
#5 0x00007f7f73239bba in scm_without_guile (func=0x7f7f731d7480 <read_finalization_pipe_data>, data=0x7f7f6acb6a40) at threads.c:722
#6 0x00007f7f731d784b in finalization_thread_proc (unused=<optimized out>) at finalizers.c:212
#7 0x00007f7f731c835a in c_body (d=0x7f7f6acb6e50) at continuations.c:422
#8 0x00007f7f7324979f in vm_regular_engine (thread=0x5, vp=0x7f7f6b299ea0, registers=0x1, resume=1930298180) at vm-engine.c:786
#9 0x00007f7f7324bfd7 in scm_call_n (proc=proc@entry=#<smob catch-closure 7f7f6b22fb80>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1260
#10 0x00007f7f731cdf39 in scm_call_0 (proc=proc@entry=#<smob catch-closure 7f7f6b22fb80>) at eval.c:479
#11 0x00007f7f7323ac12 in catch (tag=tag@entry=#t, thunk=#<smob catch-closure 7f7f6b22fb80>,
handler=<error reading variable: ERROR: Cannot access memory at address 0x73776b3469667269>0x7f7f6b22f740,
pre_unwind_handler=<error reading variable: ERROR: Cannot access memory at address 0x73776b3469667269>0x7f7f6b22f720) at throw.c:137
#12 0x00007f7f7323aef5 in scm_catch_with_pre_unwind_handler (key=key@entry=#t, thunk=<optimized out>, handler=<optimized out>, pre_unwind_handler=<optimized out>)
at throw.c:254
#13 0x00007f7f7323b0bf in scm_c_catch (tag=tag@entry=#t, body=body@entry=0x7f7f731c8350 <c_body>, body_data=body_data@entry=0x7f7f6acb6e50,
handler=handler@entry=0x7f7f731c85e0 <c_handler>, handler_data=handler_data@entry=0x7f7f6acb6e50,
pre_unwind_handler=pre_unwind_handler@entry=0x7f7f731c8440 <pre_unwind_handler>, pre_unwind_handler_data=0x7f7f6b2939a0) at throw.c:377
#14 0x00007f7f731c8940 in scm_i_with_continuation_barrier (body=body@entry=0x7f7f731c8350 <c_body>, body_data=body_data@entry=0x7f7f6acb6e50,
handler=handler@entry=0x7f7f731c85e0 <c_handler>, handler_data=handler_data@entry=0x7f7f6acb6e50,
pre_unwind_handler=pre_unwind_handler@entry=0x7f7f731c8440 <pre_unwind_handler>, pre_unwind_handler_data=0x7f7f6b2939a0) at continuations.c:360
#15 0x00007f7f731c89d5 in scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:456
#16 0x00007f7f7323984c in with_guile (base=base@entry=0x7f7f6acb6eb8, data=data@entry=0x7f7f6acb6ee0) at threads.c:661
#17 0x00007f7f73127a68 in GC_call_with_stack_base (fn=fn@entry=0x7f7f73239800 <with_guile>, arg=arg@entry=0x7f7f6acb6ee0) at misc.c:1941
#18 0x00007f7f73239b68 in scm_i_with_guile (dynamic_state=<optimized out>, data=<optimized out>, func=<optimized out>) at threads.c:704
#19 scm_with_guile (func=<optimized out>, data=<optimized out>) at threads.c:710
#20 0x00007f7f730d7015 in start_thread (arg=0x7f7f6acb7700) at pthread_create.c:486
#21 0x00007f7f7300891f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This is on 2.2.6 and very likely the same issue as
https://bugs.gnu.org/37757: ‘finalization_thread_proc’ gets EINTR, by
chance ‘data.byte’ is zero, and so it keeps going instead of exiting.

Commit e002332b8a136638c2a4fc1a2f2a26541ce0211d moves mcron to 2.2.7,
which should fix this problem. \o/

Ludo’.
Closed
?
Your comment

This issue is archived.

To comment on this conversation send an email to 37237@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 37237
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch