Cuirass & pointer finalization.

  • Done
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Mathieu Othacehe
  • zimoun
Owner
unassigned
Submitted by
Mathieu Othacehe
Severity
important
M
M
Mathieu Othacehe wrote on 26 Feb 2021 15:14
(address . bug-guix@gnu.org)
8735xihq60.fsf@gnu.org
Hello,

I'm trying to fix a memory corruption in the remote-server process of
Cuirass since a few days. Even though I don't have a usable core dump
file yet, I'm pretty sure the error comes from the "zmq-msg-init"
procedure of Guile-Simple-ZMQ.

This procedure creates a bytevector, call the C function zmq_msg_init to
initialize it, adds zmq_msg_close as pointer finalizer and returns a
wrapped pointer.

My understanding is that the wrapped pointer that is passed around in
Cuirass ensures that the underlying bytevector is not garbage collected
until the pointer goes out of scope. However, some assertions failures
such as this one:

Toggle snippet (3 lines)
Assertion failed: check () (src/msg.cpp:394)

let me think that the bytevector is garbage collected, while ZMQ is
still using it. Some help would be much appreciated here :).

Thanks,

Mathieu
Z
Z
zimoun wrote on 26 Feb 2021 21:12
86im6e1tbr.fsf@gmail.com
Hi Mathieu,

I know nothing about the topic and I probably out-of-scope.

On Fri, 26 Feb 2021 at 15:14, Mathieu Othacehe <othacehe@gnu.org> wrote:

Toggle quote (21 lines)
> I'm trying to fix a memory corruption in the remote-server process of
> Cuirass since a few days. Even though I don't have a usable core dump
> file yet, I'm pretty sure the error comes from the "zmq-msg-init"
> procedure of Guile-Simple-ZMQ.
>
> This procedure creates a bytevector, call the C function zmq_msg_init to
> initialize it, adds zmq_msg_close as pointer finalizer and returns a
> wrapped pointer.
>
> My understanding is that the wrapped pointer that is passed around in
> Cuirass ensures that the underlying bytevector is not garbage collected
> until the pointer goes out of scope. However, some assertions failures
> such as this one:
>
> --8<---------------cut here---------------start------------->8---
> Assertion failed: check () (src/msg.cpp:394)
> --8<---------------cut here---------------end--------------->8---
>
> let me think that the bytevector is garbage collected, while ZMQ is
> still using it. Some help would be much appreciated here :).

From ’zmq-msg-init’ defined here:


and why is ’zmq-message-content’ used for? Since ’message’ is
initialized with zero, I guess. Well, I am confused by:

Toggle snippet (13 lines)
(let ((content-ptr (zmq_msg_data (message->pointer message)))
[...]
(pointer->bytevector content-ptr size))))


(let ((msg (pointer->message! msg-pointer)))
(when content-bv
(let ((target (zmq-message-content msg)))
(bytevector-copy! content-bv 0 target 0 len)))
msg))))

Is ’target’ at the same address than ’msg’? Maybe ’target’ creates
somehow a dangling pointer.


Cheers,
simon
M
M
Mathieu Othacehe wrote on 27 Feb 2021 13:50
(address . 46796@debbugs.gnu.org)(name . Ludovic Courtès)(address . ludo@gnu.org)
87ft1hvfm4.fsf@gnu.org
Hey,

Here's a Valgrind backtrace:

Toggle snippet (17 lines)
==97844== Thread 17:
==97844== Invalid read of size 4
==97844== at 0x114465B9: zmq::msg_t::close() (in /gnu/store/zd9lbfqa3170nsfd4177dnr38k1sjbnc-zeromq-4.3.4/lib/libzmq.so.5.2.4)
==97844== by 0x3A58E98F: ???
==97844== by 0x489AC78: chained_finalizer (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
==97844== by 0x49A16EE: GC_invoke_finalizers (in /gnu/store/iycnpxxrg8m9wf9w58d6zvp9sdby6m9d-libgc-7.6.12/lib/libgc.so.1.3.6)
==97844== by 0x489AF08: scm_run_finalizers (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
==97844== by 0x489AF8C: finalization_thread_proc (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
==97844== by 0x488BB09: c_body (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
==97844== by 0x4913148: vm_regular_engine (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
==97844== by 0x49145B4: scm_call_n (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
==97844== by 0x4890BB9: scm_call_2 (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
==97844== by 0x48923B9: scm_c_with_exception_handler (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
==97844== by 0x4909C3C: scm_c_catch (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
==97844== Address 0x7373313569316263 is not stack'd, malloc'd or (recently) free'd

It looks like the finalizer is operating on a memory region that has
already been free'd. The documentation associated with the
finalization functions in <gc.h> says:

Toggle snippet (7 lines)
/* When obj is no longer accessible, invoke */
/* (*fn)(obj, cd). If a and b are inaccessible, and */
/* a points to b (after disappearing links have been */
/* made to disappear), then only a will be */


As far as I understand, OBJ is the wrapped pointer to the bytevector
created in "zmq-msg-init". There's a weak reference between the pointer
and the bytevector that is introduced by "register_weak_reference" in
"bytevector->pointer".

My interrogation is: do I have the guarantee that the pointer and its
references are still readable from within the finalizer? The above
snippet says that FN is invoked when OBJ is unaccessible, but does this
mean its content may have already been collected?

Cc'ing Ludo :)

Thanks,

Mathieu
M
M
Mathieu Othacehe wrote on 27 Feb 2021 13:59
(name . zimoun)(address . zimon.toutoune@gmail.com)(address . 46796@debbugs.gnu.org)
87blc5vf7d.fsf@gnu.org
Hello zimoun,

Toggle quote (18 lines)
> and why is ’zmq-message-content’ used for? Since ’message’ is
> initialized with zero, I guess. Well, I am confused by:
>
> (let ((content-ptr (zmq_msg_data (message->pointer message)))
> [...]
> (pointer->bytevector content-ptr size))))
>
> …
>
> (let ((msg (pointer->message! msg-pointer)))
> (when content-bv
> (let ((target (zmq-message-content msg)))
> (bytevector-copy! content-bv 0 target 0 len)))
> msg))))
>
> Is ’target’ at the same address than ’msg’? Maybe ’target’ creates
> somehow a dangling pointer.

No 'target' is not at the same address than 'msg', it's just a field of
'msg' that is allocated internally when "zmq_msg_init_size" is called.

Allocating a message with "zmq_msg_init_size" and filling its content by
memcpy'ing data to the memory region pointed by "zmq_msg_data" is the
example given in "Man 3 zmq_msg_send", to I hope this is a valid
use-case :).

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 1 Mar 2021 10:37
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 46796@debbugs.gnu.org)
87k0qrusde.fsf@gnu.org
Hi!

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (6 lines)
> Here's a Valgrind backtrace:
>
> ==97844== Thread 17:
> ==97844== Invalid read of size 4
> ==97844== at 0x114465B9: zmq::msg_t::close() (in /gnu/store/zd9lbfqa3170nsfd4177dnr38k1sjbnc-zeromq-4.3.4/lib/libzmq.so.5.2.4)

First, is this function idempotent? (Is it OK to close an msg_t
multiple times.)

Second, remember that finalizers can run in a separate thread. Thus,
you must make sure there are no other threads, such as ZMQ’s internal
threads, still operating on the object when it is freed.

Toggle quote (29 lines)
> ==97844== by 0x3A58E98F: ???
> ==97844== by 0x489AC78: chained_finalizer (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
> ==97844== by 0x49A16EE: GC_invoke_finalizers (in /gnu/store/iycnpxxrg8m9wf9w58d6zvp9sdby6m9d-libgc-7.6.12/lib/libgc.so.1.3.6)
> ==97844== by 0x489AF08: scm_run_finalizers (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
> ==97844== by 0x489AF8C: finalization_thread_proc (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
> ==97844== by 0x488BB09: c_body (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
> ==97844== by 0x4913148: vm_regular_engine (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
> ==97844== by 0x49145B4: scm_call_n (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
> ==97844== by 0x4890BB9: scm_call_2 (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
> ==97844== by 0x48923B9: scm_c_with_exception_handler (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
> ==97844== by 0x4909C3C: scm_c_catch (in /gnu/store/q8brh7j5mwy0hbrly6hjb1m3wwndxqc8-guile-3.0.5/lib/libguile-3.0.so.1.3.0)
> ==97844== Address 0x7373313569316263 is not stack'd, malloc'd or (recently) free'd
>
>
> It looks like the finalizer is operating on a memory region that has
> already been free'd. The documentation associated with the
> finalization functions in <gc.h> says:
>
> /* When obj is no longer accessible, invoke */
> /* (*fn)(obj, cd). If a and b are inaccessible, and */
> /* a points to b (after disappearing links have been */
> /* made to disappear), then only a will be */
>
>
> As far as I understand, OBJ is the wrapped pointer to the bytevector
> created in "zmq-msg-init". There's a weak reference between the pointer
> and the bytevector that is introduced by "register_weak_reference" in
> "bytevector->pointer".

There are (roughly) three objects here: the “msg”, the pointer object,
and the bytevector that pointer refers to.

The bytevector may be freed when the pointer object becomes unreachable.

But you probably also need a weak link from the “msg” object to the
pointer object to ensure that the pointer object outlives the msg
object.

You also need to check the zmq::msg_t::close memory semantics: does it
free the data associated with the message? If so, that’s redundant with
the pointer finalizer.

Toggle quote (5 lines)
> My interrogation is: do I have the guarantee that the pointer and its
> references are still readable from within the finalizer? The above
> snippet says that FN is invoked when OBJ is unaccessible, but does this
> mean its content may have already been collected?

Not sure, but most likely the problem is at a higher layer. :-)

If you can get a reduced test case to run under ‘rr’, that should allow
you to see where the message was first freed.

This is all pretty vague and general, but I HTH!

Ludo’.
L
L
Ludovic Courtès wrote on 1 Mar 2021 15:14
control message for bug #46796
(address . control@debbugs.gnu.org)
87ft1frmel.fsf@gnu.org
severity 46796 important
quit
M
M
Mathieu Othacehe wrote on 2 Mar 2021 09:08
Re: bug#46796: Cuirass & pointer finalization.
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 46796@debbugs.gnu.org)
87mtvmnfjb.fsf@gnu.org
Hey Ludo,

Many thanks for your help here, it feels great to have your support as
always.

Toggle quote (3 lines)
> First, is this function idempotent? (Is it OK to close an msg_t
> multiple times.)

The zmq_msg_close function is mostly responsible of freeing the memory
allocated by ZMQ to store the data associated with the message. I think
it is safe to use it multiple times and from different threads.

It is not responsible of freeing the zmq_msg_t structure itself which is
allocated by the user, usually on the stack in C programs.

I have written a small reproducer of the situation:

Toggle snippet (13 lines)
(use-modules (system foreign)
(rnrs bytevectors))

(define close
(dynamic-func "test_close" (dynamic-link "/home/mathieu/tmp/libtest")))

(let loop ()
(let* ((bv (make-bytevector 64))
(ptr (bytevector->pointer bv)))
(set-pointer-finalizer! ptr close)
(loop)))

this program creates a bytevector of 64 bytes and attaches the C
function "test_close" as a pointer finalizer to the bytevector pointer.

This function looks like:

Toggle snippet (14 lines)
int
test_close (void *x)
{
int i;
char *v = x;

for (i = 0; i < 64; i++) {
v[i] = '1';
}

return 0;
}

It overrides the bytevector content, that should cause a segmentation
error if the bytevector is already freed.

And it does indeed, which makes me think that despite the weak reference
between the pointer object and the bytevector, the bytevector is already
GC'd when the finalizer is called.

I'm now using guardians in Guile-Simple-ZMQ instead of pointer
finalizers, and do not experience crashes anymore, but I would really
like to understand what's happening here.

Any clues :)?

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 2 Mar 2021 14:50
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 46796@debbugs.gnu.org)
87blc1oeba.fsf@gnu.org
Hi!

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (40 lines)
> I have written a small reproducer of the situation:
>
> (use-modules (system foreign)
> (rnrs bytevectors))
>
> (define close
> (dynamic-func "test_close" (dynamic-link "/home/mathieu/tmp/libtest")))
>
> (let loop ()
> (let* ((bv (make-bytevector 64))
> (ptr (bytevector->pointer bv)))
> (set-pointer-finalizer! ptr close)
> (loop)))
>
>
> this program creates a bytevector of 64 bytes and attaches the C
> function "test_close" as a pointer finalizer to the bytevector pointer.
>
> This function looks like:
>
> int
> test_close (void *x)
> {
> int i;
> char *v = x;
>
> for (i = 0; i < 64; i++) {
> v[i] = '1';
> }
>
> return 0;
> }
>
> It overrides the bytevector content, that should cause a segmentation
> error if the bytevector is already freed.
>
> And it does indeed, which makes me think that despite the weak reference
> between the pointer object and the bytevector, the bytevector is already
> GC'd when the finalizer is called.

Hmm I think the bytevector and the pointer object can be finalized in
the same GC cycle; when that happens, you have no guarantee as to the
order in which they are finalized.

IOW, I think you cannot assume, from the pointer finalizer, that the
bytevector is still reachable.

But… is it really similar to your ZMQ issue? There you had message
object wrappers (as per ‘define-wrapped-pointer-type’) and a pointer
object to the underlying C object, right?

Toggle quote (4 lines)
> I'm now using guardians in Guile-Simple-ZMQ instead of pointer
> finalizers, and do not experience crashes anymore, but I would really
> like to understand what's happening here.

Guardians are finalizers; it’s just a different interface.

Ludo’.
M
M
Mathieu Othacehe wrote on 2 Mar 2021 18:02
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 46796@debbugs.gnu.org)
87im69bia7.fsf@gnu.org
Hey,

Toggle quote (4 lines)
> Hmm I think the bytevector and the pointer object can be finalized in
> the same GC cycle; when that happens, you have no guarantee as to the
> order in which they are finalized.

That would explain the crashes indeed.

Toggle quote (4 lines)
> But… is it really similar to your ZMQ issue? There you had message
> object wrappers (as per ‘define-wrapped-pointer-type’) and a pointer
> object to the underlying C object, right?

I think the only difference is that the reproducer doesn't introduce the
wrapped pointer object. Using ZMQ, the message creation looks like:

zmq-msg-init
Bytevector creation with make-bytevector at address P
Bytevector initialization with zmq_msg_init(P)
Install zmq_msg_close as finalizer on P
Message wrapping using (pointer->message P)
Return the wrapped message

The user can then operate on the wrapped message by passing it to other
message API procedures such as zmq-message-size. Those procedures will
call ZMQ using the underlying pointer.

The bytevector/pointer object undetermined GC order is really
problematic then. I'm not sure why I'm not experiencing this crash using
Guardians since they are also using finalizers.

The ultimate work around would be to leave the message closing
responsibility to the user but that would be sad. Do you know if there's
another to prevent the bytevector from being collected before the
pointer object?

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 8 Mar 2021 14:45
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 46796@debbugs.gnu.org)
87v9a1ep3p.fsf@gnu.org
Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (20 lines)
>> Hmm I think the bytevector and the pointer object can be finalized in
>> the same GC cycle; when that happens, you have no guarantee as to the
>> order in which they are finalized.
>
> That would explain the crashes indeed.
>
>> But… is it really similar to your ZMQ issue? There you had message
>> object wrappers (as per ‘define-wrapped-pointer-type’) and a pointer
>> object to the underlying C object, right?
>
> I think the only difference is that the reproducer doesn't introduce the
> wrapped pointer object. Using ZMQ, the message creation looks like:
>
> zmq-msg-init
> Bytevector creation with make-bytevector at address P
> Bytevector initialization with zmq_msg_init(P)
> Install zmq_msg_close as finalizer on P
> Message wrapping using (pointer->message P)
> Return the wrapped message

Shouldn’t the finalizer be on <message>, then?

Toggle quote (8 lines)
> The user can then operate on the wrapped message by passing it to other
> message API procedures such as zmq-message-size. Those procedures will
> call ZMQ using the underlying pointer.
>
> The bytevector/pointer object undetermined GC order is really
> problematic then. I'm not sure why I'm not experiencing this crash using
> Guardians since they are also using finalizers.

Guardians “revive” objects: when you call the guardian, it returns the
object that _would have_ been GC’d. IOW, guardians delay “actual”
finalization. That may be the explanation.

Toggle quote (3 lines)
> The ultimate work around would be to leave the message closing
> responsibility to the user but that would be sad.

Yeah, don’t do that. :-)

Toggle quote (3 lines)
> Do you know if there's another to prevent the bytevector from being
> collected before the pointer object?

I’d really need to dive into the code but I’m confident there’s nothing
special about this scenario; we’re probably just overlooking some
pointer ownership rule.

I see something risky: AIUI, ‘zmq-message-content’ returns a bytevector
that aliases a message’s buffer. The problem is that the bytevector may
still be used from Scheme after the message is destroyed, and then bad
things can happen.

Also, regarding the message API, my goal back then (but I never got
around to it) was to not expose the msg API as such, and instead to
have ‘zmq-send’, ‘zmq-receive’ etc. transparently create msg_t objects.
That simplifies things for users and perhaps also for the
implementation.

HTH,
Ludo’.
L
L
Ludovic Courtès wrote on 23 Nov 2023 12:39
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 46796-done@debbugs.gnu.org)
87pm00d7gr.fsf@gnu.org
Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (5 lines)
> I'm trying to fix a memory corruption in the remote-server process of
> Cuirass since a few days. Even though I don't have a usable core dump
> file yet, I'm pretty sure the error comes from the "zmq-msg-init"
> procedure of Guile-Simple-ZMQ.

I’m closing this old bug because it’s most likely been fixed with
Cuirass commit 40f70d28aed55c404cca6a0760860fb4942e6bee and this:


Ludo’.
Closed
?