Guix test failure on tests/store.

  • Done
  • quality assurance status badge
Details
2 participants
  • Ludovic Courtès
  • Roel Janssen
Owner
unassigned
Submitted by
Roel Janssen
Severity
normal
R
R
Roel Janssen wrote on 21 Dec 2017 17:49
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 29676@debbugs.gnu.org)
87tvwkc9g1.fsf@gnu.org
Ludovic Courtès writes:

Toggle quote (52 lines)
> Roel Janssen <roel@gnu.org> skribis:
>
>> Ludovic Courtès writes:
>>
>>> Hi Roel,
>>>
>>> ludo@gnu.org (Ludovic Courtès) skribis:
>>>
>>>> Roel Janssen <roel@gnu.org> skribis:
>>>
>>> [...]
>>>
>>>>>>> actual-error:
>>>>>>> + (srfi-34
>>>>>>> + #<condition &nix-protocol-error [message: "path `dtmp/guix-tests/store/462z3fnl7bs44vp9s97jyg1z74nsfvly-tar' is not in the Nix store" status: 1] 1071090>)
>>>>>>> result: FAIL
>>>>>>
>>>>>> The “dtmp” bit (instead of “/tmp”) looks fishy and would suggest a
>>>>>> memory corruption, presumably in the daemon.
>>>>>
>>>>> The daemon used in the test, or the daemon used to do the package build?
>>>>
>>>> The daemon under test (within the build environment).
>>>>
>>>>>> I suppose the failure random, isn’t it?
>>>>>
>>>>> I ran it again, and I've got the same error:
>>>>>
>>>>> actual-error:
>>>>> + (srfi-34
>>>>> + #<condition &nix-protocol-error [message: "path `dtmp/guix-tests/store/462z3fnl7bs44vp9s97jyg1z74nsfvly-tar' is not in the Nix store" status: 1] 1754ed0>)
>>>>>
>>>>> So that's either a very funny coincidence, or it's a structural problem.
>>>>
>>>> It’s better if it’s not random. :-)
>>>>
>>>>> Do you have any suggestions for how I can debug this problem?
>>>>
>>>> Assuming the failure also happens when you run “make check” outside the
>>>> build container (in the failed build tree), can you add ‘valgrind’ in
>>>> ‘test-env’? Specifically, in ‘test-env’, look for the line that invokes
>>>> ‘./pre-inst-env guix-daemon’ and change it to ‘./pre-inst-env valgrind
>>>> guix-daemon’.
>>>
>>> Did you manage to gather more info?
>>
>> No. When I run "make check -k" many tests fail.
>
> In a fresh checkout? If many tests fail, then there’s probably
> something wrong with the environment, such as guix-daemon failing to
> start or something. Could you check the logs?

Indeed. I found out that guix-daemon couldn't start because I didn't
set --localstatedir properly. I "make check" again and that yields the
same issue with tests/store.scm.

However, upon changing test-env to include valgrind, more tests fail
again because of a problem accessing the daemon-socket.

So I manually started the guix-daemon with --disable-chroot and
prepended valgrind --leak-check=full to it. Then I ran:
$ guile -s tests/store.scm

Which yielding many more failures.
Then:
$ make check

Which also yielded more test failures.

I don't think the memory problem is in the log, but at least it shows
that we should have a look at the memory allocation of guix-daemon.

Kind regards,
Roel Janssen
Attachment: valgrind.log
R
R
Roel Janssen wrote on 21 Dec 2017 20:53
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 29676@debbugs.gnu.org)
87inczhn78.fsf@gnu.org
Ludovic Courtès writes:

Toggle quote (55 lines)
> Roel Janssen <roel@gnu.org> skribis:
>
>> Ludovic Courtès writes:
>>
>>> Hi Roel,
>>>
>>> ludo@gnu.org (Ludovic Courtès) skribis:
>>>
>>>> Roel Janssen <roel@gnu.org> skribis:
>>>
>>> [...]
>>>
>>>>>>> actual-error:
>>>>>>> + (srfi-34
>>>>>>> + #<condition &nix-protocol-error [message: "path `dtmp/guix-tests/store/462z3fnl7bs44vp9s97jyg1z74nsfvly-tar' is not in the Nix store" status: 1] 1071090>)
>>>>>>> result: FAIL
>>>>>>
>>>>>> The “dtmp” bit (instead of “/tmp”) looks fishy and would suggest a
>>>>>> memory corruption, presumably in the daemon.
>>>>>
>>>>> The daemon used in the test, or the daemon used to do the package build?
>>>>
>>>> The daemon under test (within the build environment).
>>>>
>>>>>> I suppose the failure random, isn’t it?
>>>>>
>>>>> I ran it again, and I've got the same error:
>>>>>
>>>>> actual-error:
>>>>> + (srfi-34
>>>>> + #<condition &nix-protocol-error [message: "path `dtmp/guix-tests/store/462z3fnl7bs44vp9s97jyg1z74nsfvly-tar' is not in the Nix store" status: 1] 1754ed0>)
>>>>>
>>>>> So that's either a very funny coincidence, or it's a structural problem.
>>>>
>>>> It’s better if it’s not random. :-)
>>>>
>>>>> Do you have any suggestions for how I can debug this problem?
>>>>
>>>> Assuming the failure also happens when you run “make check” outside the
>>>> build container (in the failed build tree), can you add ‘valgrind’ in
>>>> ‘test-env’? Specifically, in ‘test-env’, look for the line that invokes
>>>> ‘./pre-inst-env guix-daemon’ and change it to ‘./pre-inst-env valgrind
>>>> guix-daemon’.
>>>
>>> Did you manage to gather more info?
>>
>> No. When I run "make check -k" many tests fail.
>
> In a fresh checkout? If many tests fail, then there’s probably
> something wrong with the environment, such as guix-daemon failing to
> start or something. Could you check the logs?
>
> Thanks,
> Ludo’.

Indeed. I found out that guix-daemon couldn't start because I didn't
set --localstatedir properly. I "make check" again and that yields the
same issue with tests/store.scm.

However, upon changing test-env to include valgrind, more tests fail
again because of a problem accessing the daemon-socket.

So I manually started the guix-daemon with --disable-chroot and
prepended valgrind --leak-check=full to it. Then I ran:
$ guile -s tests/store.scm

Which yielding many more failures.
Then:
$ make check

Which also yielded more test failures. But at least I have a valgrind
log now, which I uploaded here:

I don't think the memory problem is in the log, but at least it shows
that we should have a look at the memory allocation of guix-daemon.
Sometimes it seems to leak memory.

Kind regards,
Roel Janssen
L
L
Ludovic Courtès wrote on 22 Dec 2017 10:24
(name . Roel Janssen)(address . roel@gnu.org)(address . 29676@debbugs.gnu.org)
87lghv9ku7.fsf@gnu.org
Hello,

Roel Janssen <roel@gnu.org> skribis:

Toggle quote (14 lines)
> ==24971== 4,104 bytes in 1 blocks are possibly lost in loss record 351 of 365
> ==24971== at 0x4C2AAD6: malloc (in /gnu/store/18w3ykyqkcq5zp1qx17qhamkxlczzl0n-valgrind-3.12.0/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==24971== by 0x4E719E3: sqlite3MemMalloc (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
> ==24971== by 0x4E4DB8B: sqlite3Malloc (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
> ==24971== by 0x4E51316: pcache1Alloc (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
> ==24971== by 0x4E6E71A: sqlite3BtreeCursor (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
> ==24971== by 0x4EA9053: sqlite3VdbeExec (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
> ==24971== by 0x4EB271E: sqlite3_step (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
> ==24971== by 0x4EB34D1: sqlite3_exec (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
> ==24971== by 0x426886: nix::LocalStore::openDB(bool) (local-store.cc:293)
> ==24971== by 0x42BEF4: nix::LocalStore::LocalStore(bool) (local-store.cc:169)
> ==24971== by 0x40A356: acceptConnection(int)::{lambda()#1}::operator()() const (nix-daemon.cc:755)
> ==24971== by 0x40E16B: std::_Function_handler<void (), acceptConnection(int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (functional:1871)

I suspect these “possibly lost” reports are false alarms.

Anyway, there’s no “invalid read” or “invalid write” report, which is
what we were looking for. :-/

Thanks,
Ludo’.
R
R
Roel Janssen wrote on 3 Jan 2018 18:59
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 29676@debbugs.gnu.org)
87fu7m6cxs.fsf@gnu.org
Ludovic Courtès writes:

Toggle quote (23 lines)
> Hello,
>
> Roel Janssen <roel@gnu.org> skribis:
>
>> ==24971== 4,104 bytes in 1 blocks are possibly lost in loss record 351 of 365
>> ==24971== at 0x4C2AAD6: malloc (in /gnu/store/18w3ykyqkcq5zp1qx17qhamkxlczzl0n-valgrind-3.12.0/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
>> ==24971== by 0x4E719E3: sqlite3MemMalloc (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
>> ==24971== by 0x4E4DB8B: sqlite3Malloc (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
>> ==24971== by 0x4E51316: pcache1Alloc (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
>> ==24971== by 0x4E6E71A: sqlite3BtreeCursor (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
>> ==24971== by 0x4EA9053: sqlite3VdbeExec (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
>> ==24971== by 0x4EB271E: sqlite3_step (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
>> ==24971== by 0x4EB34D1: sqlite3_exec (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
>> ==24971== by 0x426886: nix::LocalStore::openDB(bool) (local-store.cc:293)
>> ==24971== by 0x42BEF4: nix::LocalStore::LocalStore(bool) (local-store.cc:169)
>> ==24971== by 0x40A356: acceptConnection(int)::{lambda()#1}::operator()() const (nix-daemon.cc:755)
>> ==24971== by 0x40E16B: std::_Function_handler<void (), acceptConnection(int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (functional:1871)
>
> I suspect these “possibly lost” reports are false alarms.
>
> Anyway, there’s no “invalid read” or “invalid write” report, which is
> what we were looking for. :-/

Indeed. There are some 'definitely lost' reports as well, but these are
very small.

I can't seem to reproduce this on my laptop, which is strange because
the machine I do get the test failure on has ECC memory, and my laptop
doesn't have that.

Kind regards,
Roel Janssen
R
R
Roel Janssen wrote on 3 Feb 2018 01:13
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 29676-close@debbugs.gnu.org)
87po5nhqv3.fsf@gnu.org
Ludovic Courtès writes:

Toggle quote (26 lines)
> Hello,
>
> Roel Janssen <roel@gnu.org> skribis:
>
>> ==24971== 4,104 bytes in 1 blocks are possibly lost in loss record 351 of 365
>> ==24971== at 0x4C2AAD6: malloc (in /gnu/store/18w3ykyqkcq5zp1qx17qhamkxlczzl0n-valgrind-3.12.0/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
>> ==24971== by 0x4E719E3: sqlite3MemMalloc (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
>> ==24971== by 0x4E4DB8B: sqlite3Malloc (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
>> ==24971== by 0x4E51316: pcache1Alloc (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
>> ==24971== by 0x4E6E71A: sqlite3BtreeCursor (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
>> ==24971== by 0x4EA9053: sqlite3VdbeExec (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
>> ==24971== by 0x4EB271E: sqlite3_step (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
>> ==24971== by 0x4EB34D1: sqlite3_exec (in /gnu/store/6d4ihp7xbdh3a0ffbpm5n45q4v3w0l35-sqlite-3.19.3/lib/libsqlite3.so.0.8.6)
>> ==24971== by 0x426886: nix::LocalStore::openDB(bool) (local-store.cc:293)
>> ==24971== by 0x42BEF4: nix::LocalStore::LocalStore(bool) (local-store.cc:169)
>> ==24971== by 0x40A356: acceptConnection(int)::{lambda()#1}::operator()() const (nix-daemon.cc:755)
>> ==24971== by 0x40E16B: std::_Function_handler<void (), acceptConnection(int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (functional:1871)
>
> I suspect these “possibly lost” reports are false alarms.
>
> Anyway, there’s no “invalid read” or “invalid write” report, which is
> what we were looking for. :-/
>
> Thanks,
> Ludo’.

So it seems that this issue has resolved itself. Notably, this bug
disappeared after a software update on the storage system that provides
the /gnu filesystem mount (NFS).

So this makes me think that it wasn't a problem in the guix-daemon.

Kind regards,
Roel Janssen
?
Your comment

This issue is archived.

To comment on this conversation send an email to 29676@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 29676
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch