Setting inodes to 0 leads to incorrect output when extracting with GNU cpio

Skyler Ferris wrote on 24 Mar 17:17 +0100
Recipients:(address . bug-guix@gnu.org)
Message-ID:83c29759-4e54-40ef-a9d3-b27c4774cd02@protonmail.com
Hello,

I have encountered a bug that is caused by the interaction of 
write-cpio-archive from (gnu build linux-initrd) writing all inodes as 0 
and the way that GNU cpio processes file headers. I observed this bug 
while creating a custom initramfs where init is based on a bash script 
used by another distribution (but I will provide a minimal reproducer 
below). This bug only exhibits itself when there are multiple different 
hard links present in the input directory. This email will contain a 
short set of reproduction steps, an explanation of what I understand the 
cause of the bug to be, some possible paths forward, and a disclaimer 
about my limitations due to my background.

To reproduce this bug, run the following commands:

```shell
$ mkdir /tmp/source
$ cd /tmp/source
$ echo contents1 > file1.txt
$ ln file1.txt link1.txt
$ echo contents2 > file2.txt
$ echo contents3 > file3.txt
$ ln file3.txt link3.txt
$ guix repl
 > (use-modules (gnu build linux-initrd))
 > ; disable compression so we don't waste time on it while debugging, 
it does not impact reproduction
 > (write-cpio-archive "." "../archive.cpio" #:compress? #f)
 > ,q
$ cd ..
$ mkdir out
$ cd out
$ cat ../archive.cpio | cpio -i
$ cat *
```

After running the final step you will see that all of file1.txt, 
link1.txt, file3.txt, and link3.txt have the contents "contents1": the 
files which should contain "contents3" have been created incorrectly.

Now I will list the set of steps the relevant programs performed which 
caused this error, followed by a more verbose explanation with 
references to source code:

1. Guix creates the archive with the inode and major & minor device 
numbers set to 0. Number of hard links is reported accurately.
2. CPIO reads the archive and hard links files when the header indicates 
that there are multiple links. It uses the inode and major & minor 
device numbers to find the correct file to hard link to.
3. As file3.txt and link3.txt both have multiple links and share their 
inode and major & minor device numbers with file1.txt, they are all 
linked to file1.txt

This error occurs when the cpio utility processes files with hard link. 
In `copyin_regular_file`, there is a code block which only runs if the 
file has multiple hard links and the newascii (or checksummed new ascii) 
format is in use (1). Within that code block there is a conditional to 
check if the file size is 0, with a comment explaining that the newascii 
format only records the data for the final file pointing to the relevant 
inode rather than repeating the data each time. The� code in 
guix/cpio.scm does not actually do this, so this code block never 
executes. Instead, the other code block runs which simply calls 
`link_to_maj_min_ino` (and checks for an error code) (2). This uses 
`find_inode_file` which references a hash table that associates the 
inode/major device/minor device with a file path, and if it finds a 
match then it creates a hard link on the target file system. However, 
Guix's `file->cpio-header*` sets all of the inode and device numbers to 
0 for reproducibility. This causes cpio to hard link every file with 
multiple links to the first file that has multiple links.

I see 3 possible paths forward to address this issue:

1. Provide spoofed inode numbers, tracking hard link data. In (gnu build 
linux-initrd), the `write-cpio-archive` procedure sorts the files by 
name so we can provide inode numbers that increase sequentially. 
However, in order to make sure that the correct hard links are findable 
by the cpio utility we would need to track the real inode numbers as 
well and use the correct pseudonym in each place. This would noticeably 
increase the complexity of the code.
2. Provide spoofed inode numbers and spoofed hard link data. In order to 
avoid tracking the real hard link numbers we can just report all files 
as having only a single link, and still provide sequential inode numbers 
as above. This will not increase the size of the cpio archives we 
generate compared to current output because we are storing the data for 
each link anyway. This will add some complexity to the cpio code, but 
less than option 1.
3. Don't support inputs with multiple hard links and require callers to 
work around this issue. This avoids any changes to the cpio code.

I am in favor of option 2 because I think it strikes a good balance 
between keeping the cpio code stable and supporting reasonable use 
cases. The cpio code is used to build the initramfs in Guix systems so a 
bug here could make some systems unbootable. Guix does provide 
transactional rollbacks which is helpful but it is still a frustrating 
experience to reboot and immediately see a crash; debugging issues in 
this early environment is significantly more difficult than debugging 
post-boot issues. Hard links are not common on many systems because they 
add complexity to filesystem analysis, but Guix makes good use of them 
to save space in the store, where it is common for many files to share 
data and creating symlinks would prevent the garbage collector from 
deleting otherwise unused outputs.

The limitations I referred to in the beginning of the email are that I 
am inexperienced in this domain. I have only recently (over the past 
month or so) started looking at building a custom initramfs, and I have 
never worked with CPIO archives before. I think that my analysis makes 
sense based on the code I have read and the behavior I have observed, 
but take everything I say with a grain of salt.

I would appreciate any thoughts that anyone has on this matter.

Regards,
Skyler

(1) 
https://git.savannah.gnu.org/cgit/cpio.git/tree/src/copyin.c?id=900bab656ff24db5e3099941fb909c79c07962ed#n400
(2) 
https://git.savannah.gnu.org/cgit/cpio.git/tree/src/copypass.c?id=900bab656ff24db5e3099941fb909c79c07962ed#n341
is:open	open issues
is:done	closed issues
submitter:<who>	search issue submitter
author:<who>	search by message author
date:yesterday..now	search by issue date
mdate:3m..2d	search by message date