Setting inodes to 0 leads to incorrect output when extracting with GNU cpio

  • Open
  • quality assurance status badge
Details
One participant
  • Skyler Ferris
Owner
unassigned
Submitted by
Skyler Ferris
Severity
normal
S
S
Skyler Ferris wrote on 24 Mar 17:17 +0100
(address . bug-guix@gnu.org)
83c29759-4e54-40ef-a9d3-b27c4774cd02@protonmail.com
Hello,

I have encountered a bug that is caused by the interaction of
write-cpio-archive from (gnu build linux-initrd) writing all inodes as 0
and the way that GNU cpio processes file headers. I observed this bug
while creating a custom initramfs where init is based on a bash script
used by another distribution (but I will provide a minimal reproducer
below). This bug only exhibits itself when there are multiple different
hard links present in the input directory. This email will contain a
short set of reproduction steps, an explanation of what I understand the
cause of the bug to be, some possible paths forward, and a disclaimer
about my limitations due to my background.

To reproduce this bug, run the following commands:

```shell
$ mkdir /tmp/source
$ cd /tmp/source
$ echo contents1 > file1.txt
$ ln file1.txt link1.txt
$ echo contents2 > file2.txt
$ echo contents3 > file3.txt
$ ln file3.txt link3.txt
$ guix repl
> (use-modules (gnu build linux-initrd))
> ; disable compression so we don't waste time on it while debugging,
it does not impact reproduction
> (write-cpio-archive "." "../archive.cpio" #:compress? #f)
> ,q
$ cd ..
$ mkdir out
$ cd out
$ cat ../archive.cpio | cpio -i
$ cat *
```

After running the final step you will see that all of file1.txt,
link1.txt, file3.txt, and link3.txt have the contents "contents1": the
files which should contain "contents3" have been created incorrectly.

Now I will list the set of steps the relevant programs performed which
caused this error, followed by a more verbose explanation with
references to source code:

1. Guix creates the archive with the inode and major & minor device
numbers set to 0. Number of hard links is reported accurately.
2. CPIO reads the archive and hard links files when the header indicates
that there are multiple links. It uses the inode and major & minor
device numbers to find the correct file to hard link to.
3. As file3.txt and link3.txt both have multiple links and share their
inode and major & minor device numbers with file1.txt, they are all
linked to file1.txt

This error occurs when the cpio utility processes files with hard link.
In `copyin_regular_file`, there is a code block which only runs if the
file has multiple hard links and the newascii (or checksummed new ascii)
format is in use (1). Within that code block there is a conditional to
check if the file size is 0, with a comment explaining that the newascii
format only records the data for the final file pointing to the relevant
inode rather than repeating the data each time. The  code in
guix/cpio.scm does not actually do this, so this code block never
executes. Instead, the other code block runs which simply calls
`link_to_maj_min_ino` (and checks for an error code) (2). This uses
`find_inode_file` which references a hash table that associates the
inode/major device/minor device with a file path, and if it finds a
match then it creates a hard link on the target file system. However,
Guix's `file->cpio-header*` sets all of the inode and device numbers to
0 for reproducibility. This causes cpio to hard link every file with
multiple links to the first file that has multiple links.

I see 3 possible paths forward to address this issue:

1. Provide spoofed inode numbers, tracking hard link data. In (gnu build
linux-initrd), the `write-cpio-archive` procedure sorts the files by
name so we can provide inode numbers that increase sequentially.
However, in order to make sure that the correct hard links are findable
by the cpio utility we would need to track the real inode numbers as
well and use the correct pseudonym in each place. This would noticeably
increase the complexity of the code.
2. Provide spoofed inode numbers and spoofed hard link data. In order to
avoid tracking the real hard link numbers we can just report all files
as having only a single link, and still provide sequential inode numbers
as above. This will not increase the size of the cpio archives we
generate compared to current output because we are storing the data for
each link anyway. This will add some complexity to the cpio code, but
less than option 1.
3. Don't support inputs with multiple hard links and require callers to
work around this issue. This avoids any changes to the cpio code.

I am in favor of option 2 because I think it strikes a good balance
between keeping the cpio code stable and supporting reasonable use
cases. The cpio code is used to build the initramfs in Guix systems so a
bug here could make some systems unbootable. Guix does provide
transactional rollbacks which is helpful but it is still a frustrating
experience to reboot and immediately see a crash; debugging issues in
this early environment is significantly more difficult than debugging
post-boot issues. Hard links are not common on many systems because they
add complexity to filesystem analysis, but Guix makes good use of them
to save space in the store, where it is common for many files to share
data and creating symlinks would prevent the garbage collector from
deleting otherwise unused outputs.

The limitations I referred to in the beginning of the email are that I
am inexperienced in this domain. I have only recently (over the past
month or so) started looking at building a custom initramfs, and I have
never worked with CPIO archives before. I think that my analysis makes
sense based on the code I have read and the behavior I have observed,
but take everything I say with a grain of salt.

I would appreciate any thoughts that anyone has on this matter.

Regards,
Skyler

(1)
(2)
?