Hi,
I’ve noticed that appending to a file on btrfs will create a hole
before the appended data under certain circumstances:
- Appending means O_APPEND or RWF_APPEND,
- Writing is done in direct mode, i.e. O_DIRECT, and
- The source buffer is not present in the page tables (what mmap()
calls “unpopulated”).
The hole seems to generally have the size of the data being
written (i.e. a 64k write will create a 64k hole, followed by the
64k of data we actually wanted to write), but I assume this is
true only up to a specific size (depending on how the request is
split in the kernel).
I’ve appended a reproducer.
We encounter this problem with virtio-fs (sharing of directories
between a virtual machine host and guest), where writing is done
by virtiofsd, a process external to the VMM (e.g. qemu), but the
buffer comes from the VM guest. Memory is shared between
virtiofsd and the VMM, so virtiofsd often writes data from shared
memory without having accessed it itself, so it isn’t present in
virtiofsd’s page tables.
Stefano Garzarella (CC-ed) has tested some Fedora kernels,
and has confirmed that this bug does not appear in 6.0.7-301.fc37.x86_64,
but does appear in 6.0.9-300.fc37.x86_64.
While I don’t know anything about btrfs code, I looked into it,
and I believe the changes made by commit
8184620ae21213d51eaf2e0bd4186baacb928172 (btrfs: fix lost file
sync on direct IO write with nowait and dsync iocb) may have
caused this. Specifically, it changed the `goto again` on EFAULT
to `goto relock`, a much earlier label, which causes
btrfs_direct_write() to call generic_write_checks() again after
the first btrfs_dio_write() attempt.
btrfs_dio_write() will have already allocated extents for the data
and updated the file length before trying to actually write the
data (which generates the EFAULT), so this second
generic_write_checks() call will update the I/O position to this
new file length, exactly at the end of the area to where the data
was supposed to be written.
To test this hypothesis, I’ve tried skipping the
generic_write_checks() call after reaching this specific goto, and
that does make the bug disappear.
Hanna