[shepherd] Timer not executed

  • Done
  • quality assurance status badge
Details
2 participants
  • Ludovic Courtès
  • Tomas Volf
Owner
unassigned
Submitted by
Tomas Volf
Severity
normal

Debbugs page

Tomas Volf wrote 2 weeks ago
(address . bug-guix@gnu.org)
87y0xw1fmz.fsf@wolfsden.cz
Hi,

I might have find a bug in shepherd timers. I have a timer scheduled to
run every 24 hours, the definition is as follow:

Toggle snippet (21 lines)
(define %kerberos-log-in-refresh-service
(let ((name 'kerberos-log-in-refresh))
(simple-service
name
home-shepherd-service-type
(list (shepherd-service
(documentation "Refresh the kerberos ticket.")
(provision (list name))
(requirement '(kerberos-reachable?))
(start #~(make-timer-constructor
(calendar-event #:hours '(12) #:minutes '(0))
(command (list #$%kerberos-log-in))))
(stop #~(make-timer-destructor))
(modules (cons '(shepherd service timer)
%default-modules))
(actions (list (shepherd-action
(name 'trigger)
(documentation "Immediately refresh the ticket.")
(procedure #~trigger-timer)))))))))

This should run every 24 hours (at noon) and execute the
%kerberos-log-in script (simple guile program that authenticates against
kerberos).

However that did not happen. Here are the logs:

Toggle snippet (9 lines)
2025-02-22 19:17:00 Service kerberos-log-in running with value #<<process> id: 730 command: ("/gnu/store/8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in")>.
2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
2025-02-23 12:00:02 Waiting anew for timer 'kerberos-log-in-refresh' (resuming from sleep state?).
2025-02-23 22:00:01 Not rotating '/home/<redacted>/.local/state/shepherd/dbus.log', which is below the 8192 B threshold.

The ones from 19:17:00 are from 'kerberos-log-in service, which is
one-shot executed upon login. That went fine.

However the 'kerberos-log-in-refresh is only at 12:00:02, and only as
"Waiting anew ...". The message indicates that the computer might be
resuming from sleep, however that was not the case here. It is a
desktop machine, and it was left running over night.

Here is herd status:

Toggle snippet (21 lines)
$ herd status kerberos-log-in-refresh
● Status of kerberos-log-in-refresh:
It is running since Sat 22 Feb 2025 07:17:00 PM CET (28 hours ago).
Timed service.
Periodically running: /gnu/store/8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in
It is enabled.
Provides: kerberos-log-in-refresh
Requires: kerberos-reachable?
Custom action: trigger
Will be respawned.

Upcoming timer alarms:
Mon 24 Feb 2025 12:00:00 PM CET (in 13 hours)
Tue 25 Feb 2025 12:00:00 PM CET (in 37 hours)
Wed 26 Feb 2025 12:00:00 PM CET (in 3 days)
Thu 27 Feb 2025 12:00:00 PM CET (in 4 days)
Fri 28 Feb 2025 12:00:00 PM CET (in 5 days)



Have a nice day,
Tomas

--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
-----BEGIN PGP SIGNATURE-----

iQJCBAEBCgAsFiEEt4NJs4wUfTYpiGikL7/ufbZ/wakFAme7m5QOHH5Ad29sZnNk
ZW4uY3oACgkQL7/ufbZ/wakYBQ/+K15AMkG2pgFPzykw3IaxyZTs0WYDaYgJwgw2
WQ9LrN0zit3qp3V3TaCPxjx33Lo3oiYeMpm4RsCz6MbuKkpPESZjeNp0Ei/dP2Ju
m6oDV5xLbfrknxzOSmtguCCOLvFmGmmout33df6Mg2gfBEBKfNBZlRYZHQgiK12E
szCSQ5QHLbV75BJfie0N+92B3XKUYDO006e1giVVNV4nhM/qxf+aPc5ezZkXtmAn
t2jd6d+UCptiBhVx5TGSEACJXWg9zl9Rvac0MhXEOSKPXtaYCKwb7wbjO1jI3gAg
nk53vs1s5Q9iGZE5agHlzTkwwDS8BgTBzZ7mv9nLaKF7MRIhf+5ey4P943l0goXa
mfsuY7ccDpAFh028C69OBIogMswzJ24IRsIlKyxxSLRq65iKzLBfcoNsx+8uSjUV
KgBuhCiAIM96KDn8KfRTFZ6WoveWln/khtRtQYFPkQRCQTHbY3PLhv/PO+Wp+Ars
QAnxbhB7u1VbG8/pz/DktY4A30fguevwxf2WFYEPYh0QPvFn6+VTiFHMhO2qhoqo
A+apJzk8msbMzLGmUXDl0F4n6lya3Qlq7KPRCEImtWE722s/fc8yeNxJkPfh+iTl
ylOMQayzgeMamBabSAyEfWNSjBG+KzzBfxgdwqT+1i6JJVCB3H0nRdYe0U5cLXqk
mrBZuGY=
=RcTh
-----END PGP SIGNATURE-----

Ludovic Courtès wrote 2 weeks ago
(name . Tomas Volf)(address . ~@wolfsden.cz)(address . 76516@debbugs.gnu.org)
87cyf7tiro.fsf@gnu.org
Hi Tomas,

Tomas Volf <~@wolfsden.cz> skribis:

Toggle quote (19 lines)
> However that did not happen. Here are the logs:
>
> 2025-02-22 19:17:00 Service kerberos-log-in running with value #<<process> id: 730 command: ("/gnu/store/8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in")>.
> 2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
> 2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
> 2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
> 2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
> 2025-02-23 12:00:02 Waiting anew for timer 'kerberos-log-in-refresh' (resuming from sleep state?).
> 2025-02-23 22:00:01 Not rotating '/home/<redacted>/.local/state/shepherd/dbus.log', which is below the 8192 B threshold.
>
>
> The ones from 19:17:00 are from 'kerberos-log-in service, which is
> one-shot executed upon login. That went fine.
>
> However the 'kerberos-log-in-refresh is only at 12:00:02, and only as
> "Waiting anew ...". The message indicates that the computer might be
> resuming from sleep, however that was not the case here. It is a
> desktop machine, and it was left running over night.

What architecture is this on?

From the excerpt above, the ‘log-rotation’ timer did fire as expected.
Did it also have “Waiting anew” messages?

Ludo’.
Ludovic Courtès wrote 2 weeks ago
(name . Tomas Volf)(address . ~@wolfsden.cz)(address . 76516@debbugs.gnu.org)
874j0jtijs.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (2 lines)
>> 2025-02-23 12:00:02 Waiting anew for timer 'kerberos-log-in-refresh' (resuming from sleep state?).

The “Waiting anew” message happens when the timer fires 2 seconds or
more later than expected (see ‘sleep-operation/check’), which is indeed
the case here.

It’s not supposed to happen normally. Before we bump that to 10
seconds, say, it would be good to understand why the timer got late
here.

Are there services that could block shepherd somehow, for instance by
calling ‘waitpid’, or running computations at 12:00pm?

Ludo’.
Tomas Volf wrote 2 weeks ago
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 76516@debbugs.gnu.org)
87o6yr17t2.fsf@wolfsden.cz
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (25 lines)
> Hi Tomas,
>
> Tomas Volf <~@wolfsden.cz> skribis:
>
>> However that did not happen. Here are the logs:
>>
>> 2025-02-22 19:17:00 Service kerberos-log-in running with value #<<process> id: 730 command: ("/gnu/store/8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in")>.
>> 2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
>> 2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
>> 2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
>> 2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
>> 2025-02-23 12:00:02 Waiting anew for timer 'kerberos-log-in-refresh' (resuming from sleep state?).
>> 2025-02-23 22:00:01 Not rotating '/home/<redacted>/.local/state/shepherd/dbus.log', which is below the 8192 B threshold.
>>
>>
>> The ones from 19:17:00 are from 'kerberos-log-in service, which is
>> one-shot executed upon login. That went fine.
>>
>> However the 'kerberos-log-in-refresh is only at 12:00:02, and only as
>> "Waiting anew ...". The message indicates that the computer might be
>> resuming from sleep, however that was not the case here. It is a
>> desktop machine, and it was left running over night.
>
> What architecture is this on?

x86_64-linux, AMD Ryzen 5 5600G with Radeon Graphics

Toggle quote (4 lines)
>
> From the excerpt above, the ‘log-rotation’ timer did fire as expected.
> Did it also have “Waiting anew” messages?

No, no such message. Actually, there are only 2 additional lines in the
log file. So the following are the last 4 lines of the
shepherd.log.1.zstd (you did already see the first 2 lines):

Toggle snippet (6 lines)
2025-02-23 12:00:02 Waiting anew for timer 'kerberos-log-in-refresh' (resuming from sleep state?).
2025-02-23 22:00:01 Not rotating '/home/<redacted>/.local/state/shepherd/dbus.log', which is below the 8192 B threshold.

2025-02-23 22:00:01 Rotating log.

The empty line is in the log, that is not a copy&paste error. The next
log (shepherd.log) starts with:

Toggle snippet (4 lines)
2025-02-23 22:00:01 Rotating '/home/<redacted>/.local/state/shepherd/shepherd.log' to '/home/<redacted>/.local/state/shepherd/shepherd.log.1'.
2025-02-23 22:00:01 Rotated '/home/<redacted>/.local/state/shepherd/shepherd.log'.

So there is not much indication what happened.

Tomas

--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
-----BEGIN PGP SIGNATURE-----

iQJCBAEBCgAsFiEEt4NJs4wUfTYpiGikL7/ufbZ/wakFAme8wzkOHH5Ad29sZnNk
ZW4uY3oACgkQL7/ufbZ/wam8IQ//Z5vwd6+bijsT0k8rb1hJRKlpABudFG1eMJXN
cePwzsnOq8r1CLvjRcDNXiILgze3QJ4rzjj6bljj24UR5OxrcKy2PitJPcjynCQ4
tH+ZZ3CH/brzdm9G35/n37jW3EnIjj2NAwZuqr/kWiK5vj4yU2o+fkmJkNAjxrBz
euO56+/VX3bjkKSyrzXmzMOykkaC+dk0BsPbIbQXWeSfrk2crJOblbpNgblaJhlJ
41/+hXn7QRmuiGB4oU0puLEVH4CKJrvA0IFYFErkO5IMwMWG2E0ceuAREdg3rCO9
RK4XlPo4rpnuSeZeBGbpqUkRrDATf/Jv79ehDavMzuCBP47xJLx4nrf17S4Liy1/
rKbD48DxF3DU+16DFA9lfge17beR9c5UUFh9DoGBvFRUpCSBc6C86dG3IzSK7iyI
5bchXIrJ2IfYYuMJukU9M0QzuWWLNRzCFlc+DAYUzWkVHsvNR86P/NYJApMFFukk
CV7Rzaaw2YGz6QMoxY5uY0ReafRxANWBAYS3TlUxGTgxPWbnlLKzA0/+gIBQ7YT4
Tj5EpRW48ict2cGr2eVyGyxkpsYnKf+1MHOkeF+uXJs+iiYK0Y8nXLZnFAPDjpZo
m23c5b6vO8TUdlbGgQTTuuBKXBcBq1u1WKQoNrS0x3hjvcrlR8VWEJEcokdjHMnf
6AoMk7E=
=a6aS
-----END PGP SIGNATURE-----

Tomas Volf wrote 2 weeks ago
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 76516@debbugs.gnu.org)
87jz9f16yt.fsf@wolfsden.cz
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (12 lines)
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>>> 2025-02-23 12:00:02 Waiting anew for timer 'kerberos-log-in-refresh' (resuming from sleep state?).
>
> The “Waiting anew” message happens when the timer fires 2 seconds or
> more later than expected (see ‘sleep-operation/check’), which is indeed
> the case here.
>
> It’s not supposed to happen normally. Before we bump that to 10
> seconds, say, it would be good to understand why the timer got late
> here.

I definitely agree on this.

(I wonder if there is better way to detect the sleep. I feel like *any*
number will be wrong for someone. Do we know how for example systemd's
timers handle this?)

Toggle quote (4 lines)
>
> Are there services that could block shepherd somehow, for instance by
> calling ‘waitpid’, or running computations at 12:00pm?

Not really (I think). This is full shepherd status output:

Toggle snippet (15 lines)
$ herd status
Started:
+ dbus
+ pulseaudio
+ root
+ timer
+ transient
Running timers:
+ kerberos-log-in-refresh
+ log-rotation
One-shot:
* kerberos-log-in
* kerberos-reachable?

I have already shared the definition of kerberos-log-in-refresh. There
is no other timer scheduled (except for log rotation). Other services
are from Guix, with the exception of pulseaudio:

Toggle snippet (14 lines)
(define (home-pulseaudio-shepherd-services _)
"Return a shepherd service to run a pulseaudio daemon.

Currently no configuration is supported."
(list
(shepherd-service
(documentation "Run a pulseaudio daemon.")
(provision '(pulseaudio))
(start #~(make-forkexec-constructor
'(#$(file-append pulseaudio "/bin/pulseaudio")
"--daemonize=false")))
(stop #~(make-kill-destructor)))))

There is a timer scheduled to run every 15 minutes in the system
shepherd, but is it not compute heavy (it just checks error counts from
the root filesystem). The machine has 12 cores, each at ~3GHz, 32GB of
RAM and SSD for /. I am not aware of any significant resource use that
should happen at noon, but even if there would be one, it is hard to
believe shepherd would not get a time slice on *any* core for 2 seconds.

For what it is worth, today the cronjob worked fine, however even today
it was executed at :01, so a second later then it should have been.

Toggle snippet (4 lines)
2025-02-24 12:00:01 Timer 'kerberos-log-in-refresh' spawned process 24129.
2025-02-24 12:00:01 Registering new logger for kerberos-log-in-refresh.

If you have any idea what additional information would be useful, I have
no problem deploying patched shepherd with extra logging to this machine
(assuming you know what extra logs we need).

Tomas

--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
-----BEGIN PGP SIGNATURE-----

iQJCBAEBCgAsFiEEt4NJs4wUfTYpiGikL7/ufbZ/wakFAme8x3oOHH5Ad29sZnNk
ZW4uY3oACgkQL7/ufbZ/wanNkg//dRdCSCidVNQDR2IikDpxjzbSDm8APG5KKR3O
9u/BMoPJjKOAFWbHd6XjJkSqr4nRmWJeTDvrFtbYPwq6hEZYR0wFg11X+aopARKS
X81Nsu7Nh1+56ntWvAEJT5PnIyTuPwHhe+B4F99WxZdTQiZpyo2AOVmZPMnrQ6TK
/yNsJqHtwj51xNPj5BlkX8IJ/SFggASYwzWmOefoA02pamvEcxwzHAPhnFwHqmPy
hQv8Bi//VdfuYO5TYuTxn2ZxmWZ+GuKMUiAKg0O6ICwHS52qo9C2OTMrznJnLVUW
CFvUzoK+e6wsMCNNXOGx4cGuUlPLOy5yE7tAlgcGQjiD8D3xtHzEBQE38tlQOb33
zC6zuD1r39F4xvpVUidNIgZGPi6u9H7Ey+oXmjwbQuX4NZpx3wmQnQEOexOBqlVS
evmJLp2u/4MEUP+PB4NJJPkIUqzBeDdJmjrsENXdIvSLjykZ4m7E8EVdFNxVl/vu
TwE2PJsZ/T2o3D4Kcolo5cWlWZ2SNu64nYcwyrF/6AR0VVOuHmRPsP865L13kFFK
pYdMMg/93QgnwlXzAJ6eCVjxehnrUn7zpc8jazfvihtWNJowZHwlLQynm6N7XvCO
x8eX1yXYIrmhdLVQJRhtiGRaiE254a3Pn17TkAJcG0jThgAb0qHPpvaKpvJVrRNU
JBJM5d8=
=a45N
-----END PGP SIGNATURE-----

Ludovic Courtès wrote 2 weeks ago
(name . Tomas Volf)(address . ~@wolfsden.cz)(address . 76516@debbugs.gnu.org)
87h64jrosm.fsf@gnu.org
Tomas Volf <~@wolfsden.cz> skribis:

Toggle quote (4 lines)
> (I wonder if there is better way to detect the sleep. I feel like *any*
> number will be wrong for someone. Do we know how for example systemd's
> timers handle this?)

I believe systemd is the one initiating hibernation, so it has the
information first-hand; in our case this is initiated by elogind and
shepherd doesn’t know. Probably something to fix.

Anyway, this time drift remains a mystery to me. I would go for a hack
like this:
Toggle diff (23 lines)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index adc4530..1587a02 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -2490,6 +2490,10 @@ keyword arguments as @code{fork+exec-command}: @code{#:user},
"Make an operation that returns @var{timeout} when @var{seconds} have
elapsed and @var{overslept} when many more seconds have elapsed--this can
happen if the machine is suspended or put into hibernation mode."
+ (define max-delay
+ ;; Time after which we consider that we missed the deadline.
+ (if (> seconds 180) 10 2))
+
(let ((expiry (+ (get-internal-real-time)
(inexact->exact
(round (* seconds internal-time-units-per-second))))))
@@ -2497,7 +2501,7 @@ happen if the machine is suspended or put into hibernation mode."
(lambda ()
(let* ((now (get-internal-real-time))
(delta (- now expiry)))
- (if (> delta (* 2 internal-time-units-per-second))
+ (if (> delta (* max-delay internal-time-units-per-second))
overslept
timeout))))))
WDYT?
Ludo’.
Tomas Volf wrote 2 weeks ago
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 76516@debbugs.gnu.org)
87bjurymtv.fsf@wolfsden.cz
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (24 lines)
> Tomas Volf <~@wolfsden.cz> skribis:
>
>> (I wonder if there is better way to detect the sleep. I feel like *any*
>> number will be wrong for someone. Do we know how for example systemd's
>> timers handle this?)
>
> I believe systemd is the one initiating hibernation, so it has the
> information first-hand; in our case this is initiated by elogind and
> shepherd doesn’t know. Probably something to fix.
>
> Anyway, this time drift remains a mystery to me. I would go for a hack
> like this:
>
> diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
> index adc4530..1587a02 100644
> --- a/modules/shepherd/service.scm
> +++ b/modules/shepherd/service.scm
> @@ -2490,6 +2490,10 @@ keyword arguments as @code{fork+exec-command}: @code{#:user},
> "Make an operation that returns @var{timeout} when @var{seconds} have
> elapsed and @var{overslept} when many more seconds have elapsed--this can
> happen if the machine is suspended or put into hibernation mode."
> + (define max-delay
> + ;; Time after which we consider that we missed the deadline.

I would extend the comment to describe why both 10 and 2 are used.

Toggle quote (9 lines)
> + (if (> seconds 180) 10 2))
> +
> (let ((expiry (+ (get-internal-real-time)
> (inexact->exact
> (round (* seconds internal-time-units-per-second))))))
> @@ -2497,7 +2501,7 @@ happen if the machine is suspended or put into hibernation mode."
> (lambda ()
> (let* ((now (get-internal-real-time))

I have no idea how Shepherd works internally (and much less how Fibers
work), so maybe this comment is completely off, but this seems
suspicious. Should this lambda not get the wake up time as an argument,
instead of calling get-internal-real-time to get the "now"?

I have no idea what guarantees do Fibers make regarding the delays
between detecting that time is up and calling the callback. And after
quick look at the source code I have decided that it is way beyond me to
try to figure it out.

Is there a way to enable logging of the events? So we would know when
fibers decided the timer is up, and when the lambda was called?

Toggle quote (10 lines)
> (delta (- now expiry)))
> - (if (> delta (* 2 internal-time-units-per-second))
> + (if (> delta (* max-delay internal-time-units-per-second))
> overslept
> timeout))))))
>
>
>
> WDYT?

Well, in *this* particular case it would have resolved the problem, so
great for me I guess. However I have left a suggestion above.

Out of curiosity, I have scheduled a timer event for tomorrow 23:0{0..5}
to see if they will fire with delay. Testing with short timer (closest
whole minute) did not bring any results (the timers were executed
exactly on time), so maybe the long wait is a factor? Will report
tomorrow.

Tomas

PS: Looking into timer.scm, I see this comment

Toggle snippet (6 lines)
;; Reached when resuming from sleep state: we slept
;; significantly more than the requested number of seconds. To
;; avoid triggering every timer when resuming from sleep state,
;; sleep again to remain in sync.

Not sure I would call 2 (or even the 10) a "significantly more". :) If I
expect the cron to sleep for 86400 seconds, 10 more seems... minor.

Maybe (I did not put too much though into this and the numbers are
completely thumb-sucked), the "overslept" could be if the sleep was
longer by more than 10% of the timer period, clipped to be at least 2,
and at most 30 minutes?

If I have a cron scheduled to run once a month, I would guess most
people would prefer to have it run 20 minutes late than to skip a month
completely.

--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
-----BEGIN PGP SIGNATURE-----

iQJCBAEBCgAsFiEEt4NJs4wUfTYpiGikL7/ufbZ/wakFAme8+OwOHH5Ad29sZnNk
ZW4uY3oACgkQL7/ufbZ/wanulRAAjMwIwYvgmpwxYkejU/inXftNqIc6uTVd1aq4
VCBrI5ntgmfJf9xDqQQnklkPCmCHR8AqytlGHNOr0+cVGlQyLfpvkoX4L9mBsDrh
UqtsqOl2gnrHlU9t9jfMM9RZlQAn4vL8CzKRg2Z+PRa+W4TqxRYI4rBHG05TIg7N
mOmd17wKbXmUC7B6MUtojdzYMVvKTzMo/1y/4pTXKIXVf8//3PJ077p2NePygNIx
KR66nKnPz0Vezp1C+Aibv1Mvmibk1ZUjdHoOT2cfPSBtzsm+1E9lY7wXUcba5vYL
2YS2huXbNS8QKnTPBUa8wL3BaYlXW+hwshQt6EIsHJXNCIk4Y96VhE/Ru4lTeORA
dg69E625rlU1cm1XY2L89hOCLE9vx5TKoDCUN6AoV6wuh+JLA8YfFCQliUIZCpb7
KOw4YGPmf+cDZXVxNEgbEXVv7nl3jUlB9y3ab0WzLWTusAgLe9ZIeV3QegslPVyW
0tUHnW8xi45tMynst9nhzqJxuPSboaR4Q/pzWM367cxo1EVQMBcTErYNmMysSGXS
c2AAjCFu5L/+oYWpumPSKZko3Pjja8UY7Tpuq+yvAzb8d18kLC6LlgvGeIXv+PgW
KaObrXx6lsm5YwuMj6Tgahscwb2O3Syu5IR0yKY8JlbNdcfLlOnpVzhs0oLscS7c
9FRh9Is=
=4G9P
-----END PGP SIGNATURE-----

Ludovic Courtès wrote 2 weeks ago
(name . Tomas Volf)(address . ~@wolfsden.cz)(address . 76516@debbugs.gnu.org)
875xkx3tm5.fsf@gnu.org
Hi,

Tomas Volf <~@wolfsden.cz> skribis:

Toggle quote (5 lines)
> I have no idea how Shepherd works internally (and much less how Fibers
> work), so maybe this comment is completely off, but this seems
> suspicious. Should this lambda not get the wake up time as an argument,
> instead of calling get-internal-real-time to get the "now"?

Yes, it would probably be nicer, but it wouldn’t make much of a
difference here (and it’s not related to the bug: the bug shows that we
sleep longer than asked for).

Toggle quote (3 lines)
> Is there a way to enable logging of the events? So we would know when
> fibers decided the timer is up, and when the lambda was called?

There’s no logging at the Fibers level; all we have is logging by
shepherd itself.

Toggle quote (15 lines)
> PS: Looking into timer.scm, I see this comment
>
> ;; Reached when resuming from sleep state: we slept
> ;; significantly more than the requested number of seconds. To
> ;; avoid triggering every timer when resuming from sleep state,
> ;; sleep again to remain in sync.
>
> Not sure I would call 2 (or even the 10) a "significantly more". :) If I
> expect the cron to sleep for 86400 seconds, 10 more seems... minor.
>
> Maybe (I did not put too much though into this and the numbers are
> completely thumb-sucked), the "overslept" could be if the sleep was
> longer by more than 10% of the timer period, clipped to be at least 2,
> and at most 30 minutes?

Yeah, though there’s no reason for sleeps to drift this much, it’s a
pretty fundamental assumption. Maybe this:

(define max-delay
;; Time after which we consider that we missed the deadline. Tolerate a
;; slight drift, which can happen occasionally.
(max (min (/ seconds 10.) 120) 2))

Thanks,
Ludo’.
Tomas Volf wrote 2 weeks ago
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 76516@debbugs.gnu.org)
87senyq2k1.fsf@wolfsden.cz
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (13 lines)
> Hi,
>
> Tomas Volf <~@wolfsden.cz> skribis:
>
>> I have no idea how Shepherd works internally (and much less how Fibers
>> work), so maybe this comment is completely off, but this seems
>> suspicious. Should this lambda not get the wake up time as an argument,
>> instead of calling get-internal-real-time to get the "now"?
>
> Yes, it would probably be nicer, but it wouldn’t make much of a
> difference here (and it’s not related to the bug: the bug shows that we
> sleep longer than asked for).

I am not sure this is correct. What the bug shows is that the callback
is called later then expected. We do not know how long the sleep was.
Am I missing something?

Toggle quote (25 lines)
>
>> Is there a way to enable logging of the events? So we would know when
>> fibers decided the timer is up, and when the lambda was called?
>
> There’s no logging at the Fibers level; all we have is logging by
> shepherd itself.
>
>> PS: Looking into timer.scm, I see this comment
>>
>> ;; Reached when resuming from sleep state: we slept
>> ;; significantly more than the requested number of seconds. To
>> ;; avoid triggering every timer when resuming from sleep state,
>> ;; sleep again to remain in sync.
>>
>> Not sure I would call 2 (or even the 10) a "significantly more". :) If I
>> expect the cron to sleep for 86400 seconds, 10 more seems... minor.
>>
>> Maybe (I did not put too much though into this and the numbers are
>> completely thumb-sucked), the "overslept" could be if the sleep was
>> longer by more than 10% of the timer period, clipped to be at least 2,
>> and at most 30 minutes?
>
> Yeah, though there’s no reason for sleeps to drift this much, it’s a
> pretty fundamental assumption.

Does not seem to hold in this particular case (at least for the lower
bound). ¯\_(ツ)_/¯

Toggle quote (7 lines)
> Maybe this:
>
> (define max-delay
> ;; Time after which we consider that we missed the deadline. Tolerate a
> ;; slight drift, which can happen occasionally.
> (max (min (/ seconds 10.) 120) 2))

That should work, yeah. At least as a temporary measure. :)

Few additional data-points: The timers I have scheduled for almost 24h
in the future fired exactly on time. As for the kerberos-log-in-refresh
timer, twice it fired within the 2 seconds (12:00:01), once outside
(12:00:02).

I was thinking about this some more, and the right solution here
probably is to use netlink to listen for ACPI events, the same way acpid
does. That should provide reliable information about the suspend and
resume events.

Tomas

--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
Ludovic Courtès wrote 2 weeks ago
(name . Tomas Volf)(address . ~@wolfsden.cz)(address . 76516@debbugs.gnu.org)
87ldtok6ia.fsf@gnu.org
Hi,

Tomas Volf <~@wolfsden.cz> skribis:

Toggle quote (13 lines)
>>> I have no idea how Shepherd works internally (and much less how Fibers
>>> work), so maybe this comment is completely off, but this seems
>>> suspicious. Should this lambda not get the wake up time as an argument,
>>> instead of calling get-internal-real-time to get the "now"?
>>
>> Yes, it would probably be nicer, but it wouldn’t make much of a
>> difference here (and it’s not related to the bug: the bug shows that we
>> sleep longer than asked for).
>
> I am not sure this is correct. What the bug shows is that the callback
> is called later then expected. We do not know how long the sleep was.
> Am I missing something?

The bug is that it slept longer than expected, not that it was late, if
you see what I mean.

Toggle quote (9 lines)
>> Maybe this:
>>
>> (define max-delay
>> ;; Time after which we consider that we missed the deadline. Tolerate a
>> ;; slight drift, which can happen occasionally.
>> (max (min (/ seconds 10.) 120) 2))
>
> That should work, yeah. At least as a temporary measure. :)

Heh, agreed. Pushed as 7a7b4e16f9697c4822b7693e63cc4ba0ace134a2.

Toggle quote (5 lines)
> Few additional data-points: The timers I have scheduled for almost 24h
> in the future fired exactly on time. As for the kerberos-log-in-refresh
> timer, twice it fired within the 2 seconds (12:00:01), once outside
> (12:00:02).

OK.

Toggle quote (5 lines)
> I was thinking about this some more, and the right solution here
> probably is to use netlink to listen for ACPI events, the same way acpid
> does. That should provide reliable information about the suspend and
> resume events.

Sounds like a good idea (though it’s a bit annoying to depend on
guile-netlink and low-level details).

Another thing I had in mind was to use an elogind hook so that shepherd
would know when we’re suspending; this is necessary for other things
such as locking LUKS devices on suspend. But that’s a change for 1.1.x.

Thanks,
Ludo’.
Ludovic Courtès wrote 2 weeks ago
control message for bug #76516
(address . control@debbugs.gnu.org)
87jz98k6hk.fsf@gnu.org
close 76516
quit
Tomas Volf wrote 1 weeks ago
Re: bug#76516: [shepherd] Timer not executed
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 76516@debbugs.gnu.org)
87ldtonx0i.fsf@wolfsden.cz
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (6 lines)
> [..]
>
> Another thing I had in mind was to use an elogind hook so that shepherd
> would know when we’re suspending; this is necessary for other things
> such as locking LUKS devices on suspend. But that’s a change for 1.1.x.

I see two possible problem here (both solvable).

1. AFAICT shepherd currently does not depend on elogind at all. Having
it as a run-time dependency might be fine on Guix (assuming we move
elogind into %base-services), but could be annoying on foreign
distributions, especially from non-root user's point of view.

2. How will the hook know what all processes it should let know? There
is no global registry of all running shepherd processes no?

Though I am sure both of these are solvable.

Have a nice day (and thanks for the fix :) ),
Tomas

--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
-----BEGIN PGP SIGNATURE-----

iQJCBAEBCgAsFiEEt4NJs4wUfTYpiGikL7/ufbZ/wakFAmfDmh0OHH5Ad29sZnNk
ZW4uY3oACgkQL7/ufbZ/walWsA//d0cl5VZY40lIkozTwHIg0MI+qPSkl98TVrtg
yYuiKlvplCZenfDhK57eEVK2G4EDSYububZIGTkYoX0+WFeR1iLmzLt4zGLMBZ1K
pK3S99uqBbCUbYuRL5oXH+myKHDOJgW8NcqafIFDiS/8ORzsxyaU/Mv50Ql4e/PQ
wh/tFOC2pvrVUmLGSeqWn6kAamdigb23mAOw4TUJ0rQxV0s6gnZrYV4dGR5sP1Vh
UsbTg0sSfLQDn3aCd3Uka1BcrFZFHXgYcPs6vqdMm8hTWiwakJIV3Vkl70fe9Rcp
UE/wBKerd7riGO82xNhTa7c8Rj0HBWi1/lEueuzF2fhLpTXIkcznUvQ2DASTIRsN
y6HIIkIfrBE7dPb0cD5+THLYCSWarRbQ18RYxRNzs6B74t/aVNH9fxr4zwliy3Kq
ndGu4YLGCdSMGFQOlL5mdbsvSrpwruOPS15eAXvjRbWCR9c+Hb3VvIq+7q13JERU
4DFu3Ljr+7xY7USKW0S2hgWx9cXd2O8dT6jeUsNyOvefRlrz8KUSjIYJs1Jqv4Go
WRSO2Ls+hzx7jj2hMT1b9bQ6PBxBf6M0+iAZmqsz5tSS8mtJyw7tbT/Pzyyunu8+
7wH51lKAV3khonl+tDnA0MvPaZGTN/oDomzhPynPiKp6+zkDPsvZSobPGU42QIfL
VF4oVo4=
=wisB
-----END PGP SIGNATURE-----

?
Your comment

Commenting via the web interface is currently disabled.

To comment on this conversation send an email to 76516@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 76516
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch
You may also tag this issue. See list of standard tags. For example, to set the confirmed and easy tags
mumi command -t +confirmed -t +easy
Or, remove the moreinfo tag and set the help tag
mumi command -t -moreinfo -t +help