Hi Marius,

On Fri, 25 Sep 2020 01:11:18 +0200
Marius Bakke <marius@gnu.org> wrote:

> Arguably running code for foreign architectures through QEMU binfmt is
> something of a hack.

The problem here is not qemu.  It's not even the size of the whole struct
dirent that's the problem, or endianness, or alignment, or any of that stuff.

The problem is that if you have a 64-bit hash value then you can't fit it into
a 32-bit field.

You have to make the field bigger.

It's the only non-insane way to fix this and still be able to use the host
filesystem from the guest.

This bug is making readdir randomly and silently succeed or fail.

Note: The bug I'm referring to is caused by having 32 bit offsets in user space
in the first place--nothing else.  That is the root of the problem, not qemu,
not ext4, not the kernel.  Having 32 bit offsets in user space when kernel space
uses 64 bit offsets is just weird (note: the kernel always uses 64 bit
offsets--also on 32 bit architectures!).

Note: This also affects 9p, which has been recently "fixed" in the Linux kernel.

>Why is this not an issue with i686 on x86_64 kernels?

I'm not sure.  I'll check.

>Or armhf on AArch64?

I cannot check that, but it totally should be a problem there.

It has nothing to do with qemu specifically, and IMO qemu is right in not
"fixing" it (because it's NOT a problem in qemu), and in expecting
32 bit guests not to use the getdents64 system call :P

But even using getdents (no 64) as a flag what size d_off the caller
expects is an ugly workaround.
Not because of the flag, but because of what you have to do if it's set.

The right fix is to switch to LFS (and thus also 64 bit d_off) on 32 bit guests.
It's 2020; normal systems have drives a lot bigger than 4 GiB.  Even Raspberrys
have bigger drives.

Also, just for the record, glibc always calls getdents64, never getdents.
Using getdents64 and then being surprised when a 64 bit result comes back is
seriously weird.  And they are still defending it!

If they didn't do that, at least the kernel would know something is up,
and qemu would still pass through getdents as getdents to the host kernel and
getdents64 as getdents64 to the host kernel.  Also, qemu could warn on calls
to getdents (and not on calls to getdents64).

There are no downsides to knowing what the user expects.

But even then, user space probably would have LFS disabled.  That means
as soon as user space mounts something from NFS or 9p or whatever on their
(let's say embedded) platform, they would have this problem again,
and it would be very difficult to find what happened.  Why keep this problem
alive?

>This problem only manifests when using QEMUs syscall emulation, right?

No.

It's a fundamental problem with the slot you want to store something in
being too small for it.

We just didn't see it yet, because usually the system *silently returns* halfway
through the readdir loop and does not tell us about any problems once it
encounters any d_off >= 2**32 along the way.

In the "json-c" bug, luckily json-c or cmake actually NEEDED a file that had
not been returned because of that.  That is not a given.

>Mandating that every package *must* be patched to support it seems user-hostile.

Ok, then how about a flag that users can set in order to let it compile
regardless?  But then they shouldn't use that flag in any (non-cross-compiled)
guix packages.
We can mention it in the news so people are not surprised.
(Thinking about it, the flag could just be "-D_FILE_OFFSET_BITS=32".  Because
when you are setting it explicitly, you probably know what you are doing)

It really depends on whether we officially support running 32 bit guests
on 64 bit hosts.  Qemu has nothing to do with it.

Since the majority of our build farm runs qemu transparent emulation, and
since a majority of our ARM packages are built using it, it seems that we
decided to support 32 bit guests on 64 bit hosts.  Well, then something like
this patch has to be done, otherwise we have Schrödinger's build system--and
no one likes that.

(I want to stress that the main problem is the case where readdir SUCCEEDS
even though there is a problem;
if readdir would always fail in the situation with a 32 bit guest on
64 bit host, that would be much less bad.  But it doesn't always fail!)

I'd be really uneasy if the sanity check this patch introduces were only
enabled when explicitly setting a flag--for something as bad as this.

Especially for an error that silently makes it into base derivations and then
distributes all the way to who-knows-what normal client derivations.

Keeping the user from shooting himself in the foot by default if he doesn't
say "yes, please shoot me in the foot" beforehand is not "user-hostile".

We could also do a deprecation warning instead:

struct dirent* readdir(DIR* d) __attribute__((deprecated("Unsafe if not using large file support")));

f1.c:5:2: warning: 'readdir' is deprecated: Unsafe if not using large file support [-Wdeprecated-declarations]

But then the warning would not be emitted in the, for example, json-c package,
but in cmake.  I would never have found the bug that way.  I guess now we
know, but still... only emitting a warning seems weak for something this bad.

So I'm against it.

>  I'm more in favor of dropping it on the
> build farm,

>or just keep fixing things on a per-package basis.

That means you have to *find* the things first.  The only reason glibc found
the problem this one time is because the d_off telldir pointer just happened
to be >= 2**32.  If it's not bigger one time, and is bigger another time in
the same package build run, then you get weird unreproducible results.

That is horrible.  A copy-recursively could leave half the files off and not
tell anyone.  PLEASE let's do something about it.  Also, let's delete all
our ARM substitutes we have so far.

> A less user-hostile solution could perhaps be to (setenv "CFLAGS"
> "-D_FILE_OFFSET_BITS=64") on 32-bit architectures in gnu-build-system.

Sure, we can definitely do that in addition.  It's a good idea.  There can
be no ABI compatibility problem inside Guix if *all* Guix packages use that
flag.

While we are at it, we can do it on ALL architectures.  Because that's what
we actually want.  Should there be a 128 bit CPU, we still want all CPUs
(no matter how many bit registers the CPU has) use the same file offset
bit size.

But will that be picked up everywhere?  cmake?  Rust C extensions?  Python
extensions?

> Not sure whether that could cause any adverse effects.

It will.  And if we want to find out where those effects would be, this
patch--or one with warnings (that one is only useful if there is a way to
automatically collect AND KEEP all warnings into one place)--is how to find
out.

>  But again, I don't like the idea of optimizing for QEMUs user-mode emulation.

It's not optimizing for qemu specifically.  Qemu is not at fault here.
(guest) glibc is at fault.  Because of the way we build it.

I've thought through a few possible fixes for this problem:

(1) Wait until the kernel adds a "I-have-space-for-x-bits" field to the
getdents64 system call and for qemu to actually use it.  But how would qemu
know how much space the guest glibc has in its field?
But fine, let's say they have a crystal ball.  Even then, what is ext4 going
to do with this request?  Just cut the cookie somewhere and throw away half?
That's not safe.  So this solution is right out.

(2) Have glibc use getdents on 32 bit and getdents64 on 64 bit.  See above--only
now qemu can tell know much space the guest glibc has, and so can the host kernel
(which is much better).
Cutting the cookie is not safe.  So this solution is still out, barely.

(3) Have qemu maintain a mapping table per file of which 64-bit d_off is
associated with which weird cookie qemu invents and remembers.

That doesn't help us in the case where you DO NOT USE qemu but still have
a 32 bit executable running on a 64 bit host (natively).

Therefore, this solution is out.

(4) Build user space with large file support.  There are no downsides to this,
and as a distribution we can indeed decide only to support large file guix
packages.  I argue that this is the right solution for all "Guix system" hosts,
because of the "SD card" argument above.  Also, directories with Guix on it
aren't exactly small either.  So you need it anyway.  If there are embedded
platforms with disks smaller than 4 GiB used in guix, we can given those
glibcs an original dirent.h (they will fail loudly--so we will find them easily).

I prefer this solution.

(5) Build the qemu that emulates 32 bit ARM for 32 bit i686 (this is not a typo).
In that case the kernel would enable compatibility mode (X86_32 personality) and
it would work.  I want to stress that this compatibility mode calculates
different hashes depending on the program that is requesting it, on the
same host, directory and entry.  This is not something one would want,
especially when building things--the kernel is basically lying to you.
See kernel option CONFIG_X86_32--which we enable.

It would work okay in practice, because user space programs usually don't store
d_off.  But (4) is just better.

P.S. by now, my local build found a binutils 2.35.1 bfd plugin, and
(libstdc++-v3/src/filesystem/dir.cc in) libstdc++ 7.5.0, that also don't use
large files and thus would also only work sporadically in random ways.

How many hundreds of packages does this affect that we don't know yet?

PPS. I get a lot of "grep: invalid option -o" messages when building
glibc-2.32--even without the patch.  That's probably not good either.

PPPS. new patch will come soon.