Installer: null pointer exception during partitioning

  • Done
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Mathieu Othacehe
  • Juan
Owner
unassigned
Submitted by
Juan
Severity
important
Merged with
J
installation error
(name . bug-guix@gnu.org)(address . bug-guix@gnu.org)
wWdteTLTUnxVX5pRpeceVgWYy_OZ39lObX0cx9myYTbL9YOVOv3zJ7Mj6kXszbwI5-IxhGglrmg-gIuEtzMGefejvXedlqOLDf04AIW4HB8=@protonmail.com
Hi!

I ran into some trouble while attempting to install Guix SD (1.0.1.x86_64). It happens when I try to do the guided graphical installation, I'll transcript the whole text here:

"
The installer has encountered an unexpected problem. The backtrace is displayed below. Please report it by email to <bug-guix@gnu.org>.

In ice-9/boot-9.scm:
829:9 19 (catch srfi-34 #<procedure 2636000 at ./gnu/installer/steps.scm:144:7 ()> #<procedure 25db4b0 at ./gnu/installer/steps.scm:144:7 (keyc)> _)
829:9 18 (catch srfi-34 #<procedure 2695e00 at ./gnu/installer/steps.scm:144:7 ()> #<procedure 25db460 at ./gnu/installer/steps.scm:144:7 (keyc)> _)
829:9 17 (catch srfi-34 #<procedure 2695c00 at ./gnu/installer/steps.scm:144:7 ()> #<procedure 25db410 at ./gnu/installer/steps.scm:144:7 (keyc)> _)
829:9 16 (catch srfi-34 #<procedure 1079dc0 at ./gnu/installer/steps.scm:144:7 ()> #<procedure 107df00 at ./gnu/installer/steps.scm:144:7 (keyc)> _)
In ./gnu/installer/steps.scm:
182:21 15(_)
In ./gnu/installer/newt/partition.scm:
755:33 14 (run-partitioning-page)
In ./gnu/installer/parted.scm:
1010:14 13 (auto-partition! #<<disk> bytestructure: #<bytestructure 0x106d840>> #:scheme _)
870:21 12 (loop _ _ _)
863:17 11 (loop _ 2617712816 1289318400)
771:25 10 (mkpart #<<disk> bytestructure: #<bytestructure 0x106d840>> _ #:previous-partition _)
In parted/structs.scm:
552:19 9 (pointer->partition _)
132:3 8 (pointer->bytestructure #<pointer 0x0> #<bytestructure-descriptor 0x29f4740>)
In unknown file:
7 (pointer->bytevector #<pointer 0x0> 88 #<undefined> #<undefined>)
In ice-9/boot.scm:
751:25 6 (dispatch-exception 5 null-pointer-error ("pointer->bytevector" "null pointer dereference" () ()))
In ice-9/eval.scm:
619:8 5 (_ #(#(#<directory (guile-user) b67140> #<<installer> name: newt init: #<procedure init ()> exit: #procedure exit ()> exit-error:
#<procedure exit-error (file key args> final-p...>) ...))

619:8 4 (_ #(#(#(#<directory (guile-user) b67140> #<<installer> name: newt init: #<procedure init ()> exit: #procedure exit ()> exit-error:
#<procedure exit-error (file key args> fi...>) ...) #))
In ice-9/ports.scm:
462:17 3 (call-with-output-file _ _ #:binary _ #:encoding _)
In ice-9/eval.scm:
619:8 2 (_ #(#(#<directory (guile-user) b67140> null-pointer-error ("pointer->bytevector" "null pointer dereference" () ())) #<output: /tmp/last-installer-error 12>))
159:9 1 (_ #(#(#<directory (guile-user) b67140> null-pointer-error ("pointer->bytevector" "null pointer dereference" () ())) #<output: /tmp/last-installer-error 12>))
In unknown file:
0 (make-stack #t)
ice-9/eval.scm:159:9: In procedure pointer->bytevector: null pointer dereference
"

Here are the specifics of my computer:
ASUS MAXIMUS VI IMPACT ACPI BIOS Revision 1301
CPU: Intel Core 15-4440 CPU @ 3.19GHz

I'm not sure if it's a hardware compatibility problem, a bug in the guided graphical installation, or something else.

Thanks in advance, kind regards.
Attachment: file
L
L
Ludovic Courtès wrote on 27 Jun 2019 22:30
control message for bug #36402
(address . control@debbugs.gnu.org)
87ftnuvkig.fsf@gnu.org
retitle 36402 Installer: null pointer exception during partitioning
quit
L
L
L
Ludovic Courtès wrote on 27 Jun 2019 22:30
control message for bug #35858
(address . control@debbugs.gnu.org)
87d0iyvki3.fsf@gnu.org
merge 35858 36402
quit
L
L
Ludovic Courtès wrote on 27 Jun 2019 23:08
Re: bug#36402: installation error
(name . Juan)(address . r5jm@protonmail.com)
87lfxmu47j.fsf@gnu.org
Hi Juan,

Juan <r5jm@protonmail.com> skribis:

Toggle quote (2 lines)
> I ran into some trouble while attempting to install Guix SD (1.0.1.x86_64). It happens when I try to do the guided graphical installation, I'll transcript the whole text here:

[...]

Toggle quote (14 lines)
> 755:33 14 (run-partitioning-page)
> In ./gnu/installer/parted.scm:
> 1010:14 13 (auto-partition! #<<disk> bytestructure: #<bytestructure 0x106d840>> #:scheme _)
> 870:21 12 (loop _ _ _)
> 863:17 11 (loop _ 2617712816 1289318400)
> 771:25 10 (mkpart #<<disk> bytestructure: #<bytestructure 0x106d840>> _ #:previous-partition _)
> In parted/structs.scm:
> 552:19 9 (pointer->partition _)
> 132:3 8 (pointer->bytestructure #<pointer 0x0> #<bytestructure-descriptor 0x29f4740>)
> In unknown file:
> 7 (pointer->bytevector #<pointer 0x0> 88 #<undefined> #<undefined>)
> In ice-9/boot.scm:
> 751:25 6 (dispatch-exception 5 null-pointer-error ("pointer->bytevector" "null pointer dereference" () ()))

That looks like what was reported at
https://issues.guix.gnu.org/issue/35858, so I’ve merged both. Thanks
for the report, Juan!

Mathieu, in the same spirit as
https://issues.guix.gnu.org/issue/35783, I think we have an object
life cycle and memory management issue.

I hadn’t noticed but we’re doing manual memory management by calling
things like ‘disk-destroy’ in the installer. That’s crash-prone and
best avoided.

The usual way to handle it in bindings is by:

1. Adding pointer finalizers. So for example the pointer object
associated with a <disk> record would have a finalizer that calls
‘ped_disk_destroy’.

2. Having a weak-key hash table to track object dependencies when
needed. So, if a <device> aggregates a <disk>, there must be an
entry in the hash table that maps the <device> to the <disk>. That
way, we ensure that the <disk> object remains live as long as the
<device> is live.

We can expose “close” functions that free OS resources such as file
descriptors, but we should not expose deallocation functions like
‘ped_disk_destroy’; instead, we let the GC call them when the objects
become unreachable.

Does that make sense?

I think we should audit and adjust Guile-Parted in that spirit. WDYT?

Thanks,
Ludo’.
M
M
Mathieu Othacehe wrote on 29 Jun 2019 17:47
(name . Ludovic Courtès)(address . ludo@gnu.org)
87woh4mm08.fsf@gmail.com
Hey Ludo,

Toggle quote (4 lines)
> Does that make sense?
>
> I think we should audit and adjust Guile-Parted in that spirit. WDYT?

Yes, it seems like the right thing to do. I'll try to apply those
changes to Guile-Parted next week. However, as we cannot reproduce those
null-pointer issues, we won't be sure if we fixed them for sure.

In the meantime, and completely unrelated, I created a new
wip-cross-system branch to fix most of the cross-compilation issues
preventing from cross-building a Guix System.

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 2 Jul 2019 15:25
(name . Mathieu Othacehe)(address . m.othacehe@gmail.com)
878stg8t68.fsf@gnu.org
Hi,

Mathieu Othacehe <m.othacehe@gmail.com> skribis:

Toggle quote (8 lines)
>> Does that make sense?
>>
>> I think we should audit and adjust Guile-Parted in that spirit. WDYT?
>
> Yes, it seems like the right thing to do. I'll try to apply those
> changes to Guile-Parted next week. However, as we cannot reproduce those
> null-pointer issues, we won't be sure if we fixed them for sure.

Perhaps we can reproduce them by adding a bunch of calls to ‘gc’ in the
code here and there. That’s often a good way to stress-test memory
management.

Toggle quote (4 lines)
> In the meantime, and completely unrelated, I created a new
> wip-cross-system branch to fix most of the cross-compilation issues
> preventing from cross-building a Guix System.

Neat! Consider opening a new issue for this. :-)

Thank you,
Ludo’.
M
M
Mathieu Othacehe wrote on 31 Aug 2019 20:43
(name . Ludovic Courtès)(address . ludo@gnu.org)
878sr989br.fsf@gmail.com
Hey Ludo,

Toggle quote (4 lines)
> Does that make sense?
>
> I think we should audit and adjust Guile-Parted in that spirit. WDYT?

Sorry for the delay!

I followed your advice and hid all destroy related functions behind
pointer finalizers. I also added some unit tests to Guile-Parted.

I pushed everything but feel free to comment! Then, I'll make a new
release and try to adapt the installer to those changes.

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 1 Sep 2019 21:55
(name . Mathieu Othacehe)(address . m.othacehe@gmail.com)
87a7bnaj0b.fsf@gnu.org
Howdy,

Mathieu Othacehe <m.othacehe@gmail.com> skribis:

Toggle quote (3 lines)
> I followed your advice and hid all destroy related functions behind
> pointer finalizers. I also added some unit tests to Guile-Parted.

Nice!

I tried “guix build guile-parted --with-branch=guile-parted=master”, but
that fails because ‘build-aux/test-driver.scm’ is missing from the repo,
I think.

It might be useful to add calls to ‘gc’ here and there in the tests to
stress-test memory management.

Toggle quote (3 lines)
> I pushed everything but feel free to comment! Then, I'll make a new
> release and try to adapt the installer to those changes.

Awesome, thanks a lot!

Ludo’.
M
M
Mathieu Othacehe wrote on 2 Sep 2019 11:50
(name . Ludovic Courtès)(address . ludo@gnu.org)
87imqbgh71.fsf@gmail.com
Hey,

I pushed the missing file :).

Toggle quote (3 lines)
> It might be useful to add calls to ‘gc’ here and there in the tests to
> stress-test memory management.

Inserting gc calls here:

Toggle snippet (14 lines)
(test-assert "partition-remove extended"
(with-tmp-device
"device-extended.iso"
(lambda (new-device)
(let* ((device (get-device new-device))
(disk (disk-new device))
(partitions (disk-partitions disk))
(extended-partition (find extended-partition? partitions)))
(gc) ; <-- Try to destroy disk?
(disk-remove-partition* disk extended-partition)
(gc)
(equal? (extended-partition-count disk) 0)))))

causes a segfault. Is it legal to call GC here? Do you have any clue on
how to investigate what the GC is doing?

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 3 Sep 2019 11:13
(name . Mathieu Othacehe)(address . m.othacehe@gmail.com)
87woepyc7d.fsf@gnu.org
Hello,

Mathieu Othacehe <m.othacehe@gmail.com> skribis:

Toggle quote (21 lines)
>> It might be useful to add calls to ‘gc’ here and there in the tests to
>> stress-test memory management.
>
> Inserting gc calls here:
>
> (test-assert "partition-remove extended"
> (with-tmp-device
> "device-extended.iso"
> (lambda (new-device)
> (let* ((device (get-device new-device))
> (disk (disk-new device))
> (partitions (disk-partitions disk))
> (extended-partition (find extended-partition? partitions)))
> (gc) ; <-- Try to destroy disk?
> (disk-remove-partition* disk extended-partition)
> (gc)
> (equal? (extended-partition-count disk) 0)))))
>
> causes a segfault. Is it legal to call GC here? Do you have any clue on
> how to investigate what the GC is doing?

GC might run at any time, so yes, it’s valid to insert calls to ‘gc’
anywhere. So this is good, this is kind of issue we want to catch. :-)

To investigate, I would recommend re-reading how memory management works
in Parted. Questions such as:

1. Can Parted free a C object (disk, partition, etc.) behind your
back? Is there a way to prevent it?

2. When a Parted object aggregates another object, how’s memory
managed? For example, if a “disk” aggregates (refers to) a
“partition”, who’s responsible for freeing that partition?

3. Relatedly, if, say, a “disk” aggregates a “partition”, do you make
sure on the Scheme side that you do not free the partition while
the disk is still alive?

You can make sure this doesn’t happen by using a weak-key hash
table, as discussed before, where the key is the disk and the value
is the list of partitions it aggregates.

If you can get a backtrace from the core dump, that might give clues.

Setting the environment variable:

export GLIBC_TUNABLES=glibc.malloc.check=1

might tell you if it’s a double-free error or something.

You can also use Valgrind though libgc creates a lot of noise there.

Please share whatever you gather before you get depressed. ;-)

HTH!

Ludo’.
M
M
Mathieu Othacehe wrote on 4 Sep 2019 14:31
(name . Ludovic Courtès)(address . ludo@gnu.org)
87imq8tf88.fsf@gmail.com
Hey Ludo,

Toggle quote (2 lines)
> Please share whatever you gather before you get depressed. ;-)

Thanks a lot for your help, it was really useful! I used valgrind a lot
to understand what was happening.

Ok so here's a little summary:

* Using ped_device_get Parted returns new devices. It may return already
existing devices if the given path is already known. Users are
responsible for their destruction calling ped_device_destroy.

* Using ped_disk_new Parted returns new disks from devices. Users are
responsible for their destruction calling ped_disk_destroy.

* A disk contains partitions. A user can remove partitions without
deleting them. It is also possible to delete them calling
ped_partition_destroy. On disk destruction, all the partitions
associated are destroyed.

Here's how memory is managed in Guile-Parted:

* ped_device_destroy is set a finalizer for device pointers.

* ped_disk_destroy is set a finalizer for disk pointers. Device object
associated to disks are recorded in a weak key hash table, to ensure
that the lifetime of a disk is shorted than that of its device.

* No finalizer is set for partition pointers. However, disk associated
to partitions are recorded in a weak key hash table, to ensure that the
lifetime of a partition is shorter that that of its disk. The user can
access ped_disk_remove_partition function from Parted but cannot acces
ped_partition_destroy function. Partition destruction is done by Parted
of disk destruction.

And here is what was going wrong:

ped_device_get and ped_device_get_next can return pointers to already
existing device object. So set-pointer-finalizer! was possibly called
multiple times on the same device pointer, resulting in calling
ped_device_destroy multiple times on the same device pointer.

To prevent that, I created a weak value hash table to make sure that one
<device> object maps to exactly one device pointer, and that the pointer
finalizer is set only once. See commit b35839b.

I also added (gc) calls at various locations in the tests in commit
728fd01.

WDYT?

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 5 Sep 2019 10:32
(name . Mathieu Othacehe)(address . m.othacehe@gmail.com)
87pnkfi1nx.fsf@gnu.org
Hello!

Mathieu Othacehe <m.othacehe@gmail.com> skribis:

Toggle quote (11 lines)
> And here is what was going wrong:
>
> ped_device_get and ped_device_get_next can return pointers to already
> existing device object. So set-pointer-finalizer! was possibly called
> multiple times on the same device pointer, resulting in calling
> ped_device_destroy multiple times on the same device pointer.
>
> To prevent that, I created a weak value hash table to make sure that one
> <device> object maps to exactly one device pointer, and that the pointer
> finalizer is set only once. See commit b35839b.

Good catch!

I confirm that:

guix build guile-parted --with-branch=guile-parted=master --check

passed several times in a row. :-)

b35839b LGTM!

(‘define-wrapped-pointer-type’ takes care of this, but we can’t use it
while we use bytestructures (info "(guile) Void Pointers and Byte
Access").)

It seems to me that the fix should be not just for ‘pointer->device!’
but for all the ‘pointer->RECORD!’ procedures, where we potentially have
similar scenarios, and where we’d rather have:

(eq? (pointer->X ptr) (pointer->X ptr))

So perhaps you should define your own ‘define-wrapped-type’ macro that
does ‘define-record-type’ + the weak hash table thing, and replace all
‘define-record-type’ instances in structs.scm with
‘define-wrapped-type’. How does that sound?

Thank you!

Ludo’.
M
M
Mathieu Othacehe wrote on 5 Sep 2019 15:53
(name . Ludovic Courtès)(address . ludo@gnu.org)
87mufiq27h.fsf@gmail.com
Hey,

Toggle quote (5 lines)
> So perhaps you should define your own ‘define-wrapped-type’ macro that
> does ‘define-record-type’ + the weak hash table thing, and replace all
> ‘define-record-type’ instances in structs.scm with
> ‘define-wrapped-type’. How does that sound?

Seems like the right thing to do :) However, I had a look to all Parted
functions which result is passed to a pointer->X! function, and except
ped_device_get, they always return newly allocated objects. So I guess
we are safe for now.

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 8 Sep 2019 21:35
(name . Mathieu Othacehe)(address . m.othacehe@gmail.com)
877e6i8ttu.fsf@gnu.org
Hello,

Mathieu Othacehe <m.othacehe@gmail.com> skribis:

Toggle quote (10 lines)
>> So perhaps you should define your own ‘define-wrapped-type’ macro that
>> does ‘define-record-type’ + the weak hash table thing, and replace all
>> ‘define-record-type’ instances in structs.scm with
>> ‘define-wrapped-type’. How does that sound?
>
> Seems like the right thing to do :) However, I had a look to all Parted
> functions which result is passed to a pointer->X! function, and except
> ped_device_get, they always return newly allocated objects. So I guess
> we are safe for now.

OK, sounds good!

Ludo’.
M
M
Mathieu Othacehe wrote on 18 Mar 2020 18:56
(name . Ludovic Courtès)(address . ludo@gnu.org)
87sgi5czrq.fsf@gmail.com
Hello,

This has hopefully been resolved by Guile-Parted 0.0.2 update, so
closing!

Thanks,

Mathieu
Closed
?