Installer: null pointer exception during partitioning

Done

Details

3 participants

Ludovic Courtès
Mathieu Othacehe
Juan

Owner: unassigned

Submitted by: Juan

Severity: important

Merged with

35858

Juan wrote on 27 Jun 2019 03:47

installation error

Recipients:(name . bug-guix@gnu.org)(address . bug-guix@gnu.org)

Message-ID:wWdteTLTUnxVX5pRpeceVgWYy_OZ39lObX0cx9myYTbL9YOVOv3zJ7Mj6kXszbwI5-IxhGglrmg-gIuEtzMGefejvXedlqOLDf04AIW4HB8=@protonmail.com

Hi!

I ran into some trouble while attempting to install Guix SD (1.0.1.x86_64). It happens when I try to do the guided graphical installation, I'll transcript the whole text here:

"
The installer has encountered an unexpected problem. The backtrace is displayed below. Please report it by email to <bug-guix@gnu.org>.

In ice-9/boot-9.scm:
    829:9 19 (catch srfi-34 #<procedure 2636000 at ./gnu/installer/steps.scm:144:7 ()> #<procedure 25db4b0 at ./gnu/installer/steps.scm:144:7 (keyc)> _)
    829:9 18 (catch srfi-34 #<procedure 2695e00 at ./gnu/installer/steps.scm:144:7 ()> #<procedure 25db460 at ./gnu/installer/steps.scm:144:7 (keyc)> _)
    829:9 17 (catch srfi-34 #<procedure 2695c00 at ./gnu/installer/steps.scm:144:7 ()> #<procedure 25db410 at ./gnu/installer/steps.scm:144:7 (keyc)> _)
    829:9 16 (catch srfi-34 #<procedure 1079dc0 at ./gnu/installer/steps.scm:144:7 ()> #<procedure 107df00 at ./gnu/installer/steps.scm:144:7 (keyc)> _)
In ./gnu/installer/steps.scm:
    182:21 15(_)
In ./gnu/installer/newt/partition.scm:
    755:33 14 (run-partitioning-page)
In ./gnu/installer/parted.scm:
   1010:14 13 (auto-partition! #<<disk> bytestructure: #<bytestructure 0x106d840>> #:scheme _)
    870:21 12 (loop _ _ _)
    863:17 11 (loop _ 2617712816 1289318400)
    771:25 10 (mkpart #<<disk> bytestructure: #<bytestructure 0x106d840>> _ #:previous-partition _)
In parted/structs.scm:
    552:19 9 (pointer->partition _)
     132:3 8 (pointer->bytestructure #<pointer 0x0> #<bytestructure-descriptor 0x29f4740>)
In unknown file:
    7 (pointer->bytevector #<pointer 0x0> 88 #<undefined> #<undefined>)
In ice-9/boot.scm:
    751:25 6 (dispatch-exception 5 null-pointer-error ("pointer->bytevector" "null pointer dereference" () ()))
In ice-9/eval.scm:
    619:8 5 (_ #(#(#<directory (guile-user) b67140> #<<installer> name: newt init: #<procedure init ()> exit: #procedure exit ()> exit-error:
#<procedure exit-error (file key args> final-p...>) ...))

    619:8 4 (_ #(#(#(#<directory (guile-user) b67140> #<<installer> name: newt init: #<procedure init ()> exit: #procedure exit ()> exit-error:
#<procedure exit-error (file key args> fi...>) ...) #))
In ice-9/ports.scm:
    462:17 3 (call-with-output-file _ _ #:binary _ #:encoding _)
In ice-9/eval.scm:
    619:8 2 (_ #(#(#<directory (guile-user) b67140> null-pointer-error ("pointer->bytevector" "null pointer dereference" () ())) #<output: /tmp/last-installer-error 12>))
    159:9 1 (_ #(#(#<directory (guile-user) b67140> null-pointer-error ("pointer->bytevector" "null pointer dereference" () ())) #<output: /tmp/last-installer-error 12>))
In unknown file:
    0 (make-stack #t)
ice-9/eval.scm:159:9: In procedure pointer->bytevector: null pointer dereference
"

Here are the specifics of my computer:
ASUS MAXIMUS VI IMPACT ACPI BIOS Revision 1301
CPU: Intel Core 15-4440 CPU @ 3.19GHz

I'm not sure if it's a hardware compatibility problem, a bug in the guided graphical installation, or something else.

Thanks in advance, kind regards.

Attachment: file

Ludovic Courtès wrote on 27 Jun 2019 22:30

control message for bug #36402

Recipients:(address . control@debbugs.gnu.org)

Message-ID:87ftnuvkig.fsf@gnu.org

retitle 36402 Installer: null pointer exception during partitioning

quit

Ludovic Courtès wrote on 27 Jun 2019 22:30

Recipients:(address . control@debbugs.gnu.org)

Message-ID:87ef3evkia.fsf@gnu.org

severity 36402 important

quit

Ludovic Courtès wrote on 27 Jun 2019 22:30

control message for bug #35858

Recipients:(address . control@debbugs.gnu.org)

Message-ID:87d0iyvki3.fsf@gnu.org

merge 35858 36402

quit

Ludovic Courtès wrote on 27 Jun 2019 23:08

Re: bug#36402: installation error

Recipients:(name . Juan)(address . r5jm@protonmail.com)

Message-ID:87lfxmu47j.fsf@gnu.org

Hi Juan,

Juan <r5jm@protonmail.com> skribis:

Toggle quote (2 lines)

> I ran into some trouble while attempting to install Guix SD (1.0.1.x86_64). It happens when I try to do the guided graphical installation, I'll transcript the whole text here:

[...]

Toggle quote (14 lines)>     755:33 14 (run-partitioning-page)
> In ./gnu/installer/parted.scm:
>    1010:14 13 (auto-partition! #<<disk> bytestructure: #<bytestructure 0x106d840>> #:scheme _)
>     870:21 12 (loop _ _ _)
>     863:17 11 (loop _ 2617712816 1289318400)
>     771:25 10 (mkpart #<<disk> bytestructure: #<bytestructure 0x106d840>> _ #:previous-partition _)
> In parted/structs.scm:
>     552:19 9 (pointer->partition _)
>      132:3 8 (pointer->bytestructure #<pointer 0x0> #<bytestructure-descriptor 0x29f4740>)
> In unknown file:
>     7 (pointer->bytevector #<pointer 0x0> 88 #<undefined> #<undefined>)
> In ice-9/boot.scm:
>     751:25 6 (dispatch-exception 5 null-pointer-error ("pointer->bytevector" "null pointer dereference" () ()))

That looks like what was reported at

https://issues.guix.gnu.org/issue/35858, so I’ve merged both. Thanks

for the report, Juan!

Mathieu, in the same spirit as

https://issues.guix.gnu.org/issue/35783, I think we have an object

life cycle and memory management issue.

I hadn’t noticed but we’re doing manual memory management by calling

things like ‘disk-destroy’ in the installer. That’s crash-prone and

best avoided.

The usual way to handle it in bindings is by:

1. Adding pointer finalizers. So for example the pointer object

associated with a <disk> record would have a finalizer that calls

‘ped_disk_destroy’.

2. Having a weak-key hash table to track object dependencies when

needed. So, if a <device> aggregates a <disk>, there must be an

entry in the hash table that maps the <device> to the <disk>. That

way, we ensure that the <disk> object remains live as long as the

<device> is live.

We can expose “close” functions that free OS resources such as file

descriptors, but we should not expose deallocation functions like

‘ped_disk_destroy’; instead, we let the GC call them when the objects

become unreachable.

Does that make sense?

I think we should audit and adjust Guile-Parted in that spirit. WDYT?

Thanks,

Ludo’.

Mathieu Othacehe wrote on 29 Jun 2019 17:47

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)

Message-ID:87woh4mm08.fsf@gmail.com

Hey Ludo,

Toggle quote (4 lines)

> Does that make sense?

> I think we should audit and adjust Guile-Parted in that spirit. WDYT?

Yes, it seems like the right thing to do. I'll try to apply those

changes to Guile-Parted next week. However, as we cannot reproduce those

null-pointer issues, we won't be sure if we fixed them for sure.

In the meantime, and completely unrelated, I created a new

wip-cross-system branch to fix most of the cross-compilation issues

preventing from cross-building a Guix System.

Thanks,

Mathieu

Ludovic Courtès wrote on 2 Jul 2019 15:25

Recipients:(name . Mathieu Othacehe)(address . m.othacehe@gmail.com)

Message-ID:878stg8t68.fsf@gnu.org

Hi,

Mathieu Othacehe <m.othacehe@gmail.com> skribis:

Toggle quote (8 lines)

>> Does that make sense?

>> I think we should audit and adjust Guile-Parted in that spirit. WDYT?

> Yes, it seems like the right thing to do. I'll try to apply those

> changes to Guile-Parted next week. However, as we cannot reproduce those

> null-pointer issues, we won't be sure if we fixed them for sure.

Perhaps we can reproduce them by adding a bunch of calls to ‘gc’ in the

code here and there. That’s often a good way to stress-test memory

management.

Toggle quote (4 lines)

> In the meantime, and completely unrelated, I created a new

> wip-cross-system branch to fix most of the cross-compilation issues

> preventing from cross-building a Guix System.

Neat! Consider opening a new issue for this. :-)

Thank you,

Ludo’.

Mathieu Othacehe wrote on 31 Aug 2019 20:43

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)

Message-ID:878sr989br.fsf@gmail.com

Hey Ludo,

Toggle quote (4 lines)

> Does that make sense?

> I think we should audit and adjust Guile-Parted in that spirit. WDYT?

Sorry for the delay!

I followed your advice and hid all destroy related functions behind

pointer finalizers. I also added some unit tests to Guile-Parted.

I pushed everything but feel free to comment! Then, I'll make a new

release and try to adapt the installer to those changes.

Thanks,

Mathieu

Ludovic Courtès wrote on 1 Sep 2019 21:55

Recipients:(name . Mathieu Othacehe)(address . m.othacehe@gmail.com)

Message-ID:87a7bnaj0b.fsf@gnu.org

Howdy,

Mathieu Othacehe <m.othacehe@gmail.com> skribis:

Toggle quote (3 lines)

> I followed your advice and hid all destroy related functions behind

> pointer finalizers. I also added some unit tests to Guile-Parted.

Nice!

I tried “guix build guile-parted --with-branch=guile-parted=master”, but

that fails because ‘build-aux/test-driver.scm’ is missing from the repo,

I think.

It might be useful to add calls to ‘gc’ here and there in the tests to

stress-test memory management.

Toggle quote (3 lines)

> I pushed everything but feel free to comment! Then, I'll make a new

> release and try to adapt the installer to those changes.

Awesome, thanks a lot!

Ludo’.

Mathieu Othacehe wrote on 2 Sep 2019 11:50

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)

Message-ID:87imqbgh71.fsf@gmail.com

Hey,

I pushed the missing file :).

Toggle quote (3 lines)

> It might be useful to add calls to ‘gc’ here and there in the tests to

> stress-test memory management.

Inserting gc calls here:

Toggle snippet (14 lines)(test-assert "partition-remove extended"
  (with-tmp-device
   "device-extended.iso"
   (lambda (new-device)
     (let* ((device (get-device new-device))
            (disk (disk-new device))
            (partitions (disk-partitions disk))
            (extended-partition (find extended-partition? partitions)))
       (gc) ; <-- Try to destroy disk?
       (disk-remove-partition* disk extended-partition)
       (gc)
       (equal? (extended-partition-count disk) 0)))))

causes a segfault. Is it legal to call GC here? Do you have any clue on

how to investigate what the GC is doing?

Thanks,

Mathieu

Ludovic Courtès wrote on 3 Sep 2019 11:13

Recipients:(name . Mathieu Othacehe)(address . m.othacehe@gmail.com)

Message-ID:87woepyc7d.fsf@gnu.org

Hello,

Mathieu Othacehe <m.othacehe@gmail.com> skribis:

Toggle quote (21 lines)>> It might be useful to add calls to ‘gc’ here and there in the tests to
>> stress-test memory management.
>
> Inserting gc calls here:
>
> (test-assert "partition-remove extended"
>   (with-tmp-device
>    "device-extended.iso"
>    (lambda (new-device)
>      (let* ((device (get-device new-device))
>             (disk (disk-new device))
>             (partitions (disk-partitions disk))
>             (extended-partition (find extended-partition? partitions)))
>        (gc) ; <-- Try to destroy disk?
>        (disk-remove-partition* disk extended-partition)
>        (gc)
>        (equal? (extended-partition-count disk) 0)))))
>
> causes a segfault. Is it legal to call GC here? Do you have any clue on
> how to investigate what the GC is doing?

GC might run at any time, so yes, it’s valid to insert calls to ‘gc’

anywhere. So this is good, this is kind of issue we want to catch. :-)

To investigate, I would recommend re-reading how memory management works

in Parted. Questions such as:

1. Can Parted free a C object (disk, partition, etc.) behind your

back? Is there a way to prevent it?

2. When a Parted object aggregates another object, how’s memory

managed? For example, if a “disk” aggregates (refers to) a

“partition”, who’s responsible for freeing that partition?

3. Relatedly, if, say, a “disk” aggregates a “partition”, do you make

sure on the Scheme side that you do not free the partition while

the disk is still alive?

You can make sure this doesn’t happen by using a weak-key hash

table, as discussed before, where the key is the disk and the value

is the list of partitions it aggregates.

If you can get a backtrace from the core dump, that might give clues.

Setting the environment variable:

export GLIBC_TUNABLES=glibc.malloc.check=1

might tell you if it’s a double-free error or something.

You can also use Valgrind though libgc creates a lot of noise there.

Please share whatever you gather before you get depressed. ;-)

HTH!

Ludo’.

Mathieu Othacehe wrote on 4 Sep 2019 14:31

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)

Message-ID:87imq8tf88.fsf@gmail.com

Hey Ludo,

Toggle quote (2 lines)

> Please share whatever you gather before you get depressed. ;-)

Thanks a lot for your help, it was really useful! I used valgrind a lot

to understand what was happening.

Ok so here's a little summary:

* Using ped_device_get Parted returns new devices. It may return already

existing devices if the given path is already known. Users are

responsible for their destruction calling ped_device_destroy.

* Using ped_disk_new Parted returns new disks from devices. Users are

responsible for their destruction calling ped_disk_destroy.

* A disk contains partitions. A user can remove partitions without

deleting them. It is also possible to delete them calling

ped_partition_destroy. On disk destruction, all the partitions

associated are destroyed.

Here's how memory is managed in Guile-Parted:

* ped_device_destroy is set a finalizer for device pointers.

* ped_disk_destroy is set a finalizer for disk pointers. Device object

associated to disks are recorded in a weak key hash table, to ensure

that the lifetime of a disk is shorted than that of its device.

* No finalizer is set for partition pointers. However, disk associated

to partitions are recorded in a weak key hash table, to ensure that the

lifetime of a partition is shorter that that of its disk. The user can

access ped_disk_remove_partition function from Parted but cannot acces

ped_partition_destroy function. Partition destruction is done by Parted

of disk destruction.

And here is what was going wrong:

ped_device_get and ped_device_get_next can return pointers to already

existing device object. So set-pointer-finalizer! was possibly called

multiple times on the same device pointer, resulting in calling

ped_device_destroy multiple times on the same device pointer.

To prevent that, I created a weak value hash table to make sure that one

<device> object maps to exactly one device pointer, and that the pointer

finalizer is set only once. See commit b35839b.

I also added (gc) calls at various locations in the tests in commit

728fd01.

WDYT?

Thanks,

Mathieu

Ludovic Courtès wrote on 5 Sep 2019 10:32

Recipients:(name . Mathieu Othacehe)(address . m.othacehe@gmail.com)

Message-ID:87pnkfi1nx.fsf@gnu.org

Hello!

Mathieu Othacehe <m.othacehe@gmail.com> skribis:

Toggle quote (11 lines)> And here is what was going wrong:
>
> ped_device_get and ped_device_get_next can return pointers to already
> existing device object. So set-pointer-finalizer! was possibly called
> multiple times on the same device pointer, resulting in calling
> ped_device_destroy multiple times on the same device pointer.
>
> To prevent that, I created a weak value hash table to make sure that one
> <device> object maps to exactly one device pointer, and that the pointer
> finalizer is set only once. See commit b35839b.

Good catch!

I confirm that:

guix build guile-parted --with-branch=guile-parted=master --check

passed several times in a row. :-)

b35839b LGTM!

(‘define-wrapped-pointer-type’ takes care of this, but we can’t use it

while we use bytestructures (info "(guile) Void Pointers and Byte

Access").)

It seems to me that the fix should be not just for ‘pointer->device!’

but for all the ‘pointer->RECORD!’ procedures, where we potentially have

similar scenarios, and where we’d rather have:

(eq? (pointer->X ptr) (pointer->X ptr))

So perhaps you should define your own ‘define-wrapped-type’ macro that

does ‘define-record-type’ + the weak hash table thing, and replace all

‘define-record-type’ instances in structs.scm with

‘define-wrapped-type’. How does that sound?

Thank you!

Ludo’.

Mathieu Othacehe wrote on 5 Sep 2019 15:53

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)

Message-ID:87mufiq27h.fsf@gmail.com

Hey,

Toggle quote (5 lines)

> So perhaps you should define your own ‘define-wrapped-type’ macro that

> does ‘define-record-type’ + the weak hash table thing, and replace all

> ‘define-record-type’ instances in structs.scm with

> ‘define-wrapped-type’. How does that sound?

Seems like the right thing to do :) However, I had a look to all Parted

functions which result is passed to a pointer->X! function, and except

ped_device_get, they always return newly allocated objects. So I guess

we are safe for now.

Thanks,

Mathieu

Ludovic Courtès wrote on 8 Sep 2019 21:35

Recipients:(name . Mathieu Othacehe)(address . m.othacehe@gmail.com)

Message-ID:877e6i8ttu.fsf@gnu.org

Hello,

Mathieu Othacehe <m.othacehe@gmail.com> skribis:

Toggle quote (10 lines)>> So perhaps you should define your own ‘define-wrapped-type’ macro that
>> does ‘define-record-type’ + the weak hash table thing, and replace all
>> ‘define-record-type’ instances in structs.scm with
>> ‘define-wrapped-type’.  How does that sound?
>
> Seems like the right thing to do :) However, I had a look to all Parted
> functions which result is passed to a pointer->X! function, and except
> ped_device_get, they always return newly allocated objects. So I guess
> we are safe for now.

OK, sounds good!

Ludo’.

Mathieu Othacehe wrote on 18 Mar 2020 18:56

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)

Message-ID:87sgi5czrq.fsf@gmail.com

Hello,

This has hopefully been resolved by Guile-Parted 0.0.2 update, so

closing!

Thanks,

Mathieu

Closed

Your comment

This issue is archived.

To comment on this conversation send an email to 36402@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it

mumi current 36402

Then, you may apply the latest patchset in this issue (with sign off)

mumi am -- -s

Or, compose a reply to this issue

mumi compose

Or, send patches to this issue

mumi send-email *.patch

is:open	open issues
is:done	closed issues
submitter:<who>	search issue submitter
author:<who>	search by message author
date:yesterday..now	search by issue date
mdate:3m..2d	search by message date