From: Tingmao Wang <m@maowtm.org>
To: Dominique Martinet <asmadeus@codewreck.org>,
Christian Schoenebeck <linux_oss@crudebyte.com>,
Eric Van Hensbergen <ericvh@kernel.org>,
Latchesar Ionkov <lucho@ionkov.net>
Cc: Tingmao Wang <m@maowtm.org>, v9fs@lists.linux.dev
Subject: [PATCH 0/1] fs/9p: Do not open remote file with APPEND mode when writeback cache is used
Date: Sun, 2 Nov 2025 20:24:38 +0000 [thread overview]
Message-ID: <cover.1762115015.git.m@maowtm.org> (raw)
Hi,
Earlier I noticed that when using cache=mmap mode for my QEMU VM's rootfs,
fish seems to sometimes report corrupted history file. It turns out that
for some reason, when fish writes the file, a bunch of NUL bytes are being
added to the end.
Some further investigation lead me to conclude that the problem is with
how O_APPEND is handled in v9fs when a caching mode with writeback is used
(e.g. cache=loose or cache=mmap). Basically, the file is opened with
O_APPEND on the server side as well, in which case writes works fine in
uncached mode, but when the page cache is involved, Linux writes whole
pages (with position for that write pointing to the start of the page),
causing previously written content to be written again (since when the
file is opened with O_APPEND by QEMU, all writes goes to the end
regardless of offset).
Pasted at the end is a program to reproduce the problem. It will:
1. Open a file with O_APPEND
2. Write "Hello\n", sync, then write "Goodbye\n"
3. At this point the file is corrupted. It will use drop_caches to make
the problem immediately observable in the guest when it tries to read
the data back, even though this is not required - the file on the host
contains duplicate content as soon as a problematic write is issued.
Here is what happens:
root@6-18-0-rc3-next-20251031-dev-dirty ~# linux/reproducer-write-then-read /tmp/9p/a
Try reading /tmp/9p/a from the host now.
Press Enter to continue...
We can inspect the content from the host at this point:
> hexdump -C /tmp/linux-test/a
00000000 48 65 6c 6c 6f 0a 48 65 6c 6c 6f 0a 47 6f 6f 64 |Hello.Hello.Good|
00000010 62 79 65 0a |bye.|
00000014
The program will also detect this.
Here is the same setup but with some debug logs (I added logging to dump
the content being sent to the host)
openat(AT_FDCWD, "/tmp/9p/a", O_WRONLY|O_CREAT|O_APPEND, 0644
[ 10.207738][ T197] 9pnet: -- v9fs_vfs_lookup (197): dir: ffff888103228000 dentry: (a) ffff88810042ebc0 flags: 0
...
) = 3
write(3, "Hello\n", 6
[ 10.211944][ T197] 9pnet: -- v9fs_file_write_iter (197): fid 2
[ 10.212057][ T197] 9pnet: -- v9fs_file_write_iter (197): (cached)
) = 6
sync(
[ 10.212550][ T61] 9pnet: -- v9fs_fid_find_inode (61): inode: ffff88810a128000
[ 10.212794][ T61] 9pnet: (00000061) >>> TWRITE fid 2 offset 0 count 6 (/6)
[ 10.212932][ T61] content to be written: 00000000: 48 65 6c 6c 6f 0a Hello.
[ 10.213224][ T61] 9pnet: (00000061) >>> size=29 type: 118 tag: 0
[ 10.213447][ T61] 9pnet: (00000061) <<< size=11 type: 119 tag: 0
[ 10.213565][ T61] 9pnet: (00000061) <<< RWRITE count 6
[ 10.213751][ T61] 9pnet: -- v9fs_write_inode_dotl (61): v9fs_write_inode_dotl: inode ffff88810a128000
) = 0
write(3, "Goodbye\n", 8
[ 10.270821][ T197] 9pnet: -- v9fs_file_write_iter (197): fid 2
[ 10.270920][ T197] 9pnet: -- v9fs_file_write_iter (197): (cached)
) = 8
fsync(3
[ 10.271346][ T197] 9pnet: -- v9fs_fid_find_inode (197): inode: ffff88810a128000
[ 10.271501][ T197] 9pnet: (00000197) >>> TWRITE fid 2 offset 0 count 14 (/14)
[ 10.271610][ T197] content to be written: 00000000: 48 65 6c 6c 6f 0a 47 6f 6f 64 62 79 65 0a Hello.Goodbye.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This causes "Hello\n" to be written again due to the file being opened
in append mode on the host.
[ 10.271772][ T197] 9pnet: (00000197) >>> size=37 type: 118 tag: 0
[ 10.271933][ T197] 9pnet: (00000197) <<< size=11 type: 119 tag: 0
[ 10.272030][ T197] 9pnet: (00000197) <<< RWRITE count 14
[ 10.272205][ T197] 9pnet: -- v9fs_file_fsync_dotl (197): filp ffff88810952b800 datasync 0
[ 10.272325][ T197] 9pnet: (00000197) >>> TFSYNC fid 2 datasync:0
[ 10.272494][ T197] 9pnet: (00000197) >>> size=15 type: 50 tag: 0
[ 10.272640][ T197] 9pnet: (00000197) <<< size=7 type: 51 tag: 0
[ 10.272733][ T197] 9pnet: (00000197) <<< RFSYNC fid 2
) = 0
close(3
[ 10.273010][ T197] 9pnet: -- v9fs_dir_release (197): inode: ffff88810a128000 filp: ffff88810952b800 fid: 2
...
My understanding is that when we get to v9fs' write_iter, iocb->ki_pos
will, for an O_APPEND file, always point at the end (c.f.
generic_write_checks_count), so technically, except to mitigate guest vs
host write race conditions, we never needed to open the file as O_APPEND
on the server side in the first place, as we always send the correct
offset.
This can also lead to unexpected file lengthing if an fstat is issued
after a write, since we will (first flush the dirty pages, then) refresh
the i_size from the server. This is ultimately the cause of the NUL
bytes. This case can be tested via the reproducer by setting
DO_FSTAT_AFTER_WRITE=1.
I haven't tested what happens in cached mode if you have two fds open to
the same file, one with O_APPEND and one without - not sure yet whether
the folio writeback will use the "correct" fid... but I somewhat suspect
the behaviour even for the non-append fd will not be correct if the
O_APPEND one is used?
The patch that follows is an attempt at fixing this. Technically opening
the file with O_APPEND on the server side should be fine for uncached
mode, so I've preserved that, but I'm not 100% sure if there are any
problematic edge cases.
I did test, by having two cats pointing to the same file, one with ">" and
one with ">>", that the "two fd" situation is correctly handled when this
patch is applied.
Testing was done mainly on cache=none, cache=loose and cache=mmap modes,
but I also ran this reproducer and some previous ones over all caching bit
combinations from 0 to 15. I also confirmed that the problem with
fish_history no longer reproduces.
reproducer-write-then-read.c:
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
int main(int argc, const char **argv)
{
int fd1, fd2;
char buf[1024];
ssize_t size;
struct stat statbuf;
size_t i;
int ret;
bool corrupted = false;
size_t expected_total_size = 0;
bool do_fstat_after_write = false;
const char *expected_final_content = "Hello\nGoodbye\n";
if (argc != 2) {
fprintf(stderr, "Usage: %s <file>\n", argv[0]);
return 1;
}
if (getenv("DO_FSTAT_AFTER_WRITE") != NULL) {
do_fstat_after_write = true;
}
unlink(argv[1]);
fd1 = open(argv[1], O_WRONLY | O_APPEND | O_CREAT, 0644);
if (fd1 < 0) {
perror("open fd1");
return 1;
}
size = sprintf(buf, "Hello\n");
errno = 0;
if (write(fd1, buf, size) != size) {
perror("write");
close(fd1);
return 1;
}
expected_total_size += size;
if (do_fstat_after_write) {
ret = fstat(fd1, &statbuf);
if (ret < 0) {
perror("fstat");
close(fd1);
return 1;
}
if (statbuf.st_size != expected_total_size) {
fprintf(stderr, "File size returned by fstat mismatch after writes (1)\n");
corrupted = true;
}
}
sync();
size = sprintf(buf, "Goodbye\n");
errno = 0;
if (write(fd1, buf, size) != size) {
perror("write");
close(fd1);
return 1;
}
expected_total_size += size;
if (do_fstat_after_write) {
ret = fstat(fd1, &statbuf);
if (ret < 0) {
perror("fstat");
close(fd1);
return 1;
}
if (statbuf.st_size != expected_total_size) {
fprintf(stderr, "File size returned by fstat mismatch after writes (2)\n");
corrupted = true;
}
size = sprintf(buf, "Final goodbye\n");
errno = 0;
if (write(fd1, buf, size) != size) {
perror("write");
close(fd1);
return 1;
}
expected_total_size += size;
expected_final_content = "Hello\nGoodbye\nFinal goodbye\n";
}
fsync(fd1);
close(fd1);
sync();
printf("Try reading %s from the host now.\nPress Enter to continue...\n", argv[1]);
getchar();
fd1 = open("/proc/sys/vm/drop_caches", O_WRONLY);
if (fd1 < 0) {
perror("open drop_caches");
return 1;
}
errno = 0;
if (write(fd1, "1\n", 2) != 2) {
close(fd1);
perror("write");
return 1;
}
close(fd1);
fd2 = open(argv[1], O_RDONLY);
if (fd2 < 0) {
perror("open fd2");
return 1;
}
if ((size = read(fd2, buf, sizeof(buf))) < 0) {
close(fd2);
perror("read");
return 1;
}
if (size != expected_total_size) {
fprintf(stderr, "File size mismatch after reopen\n");
corrupted = true;
}
close(fd2);
close(fd1);
fprintf(stdout, "File content:\n");
for (i = 0; i < size; i++) {
if (buf[i] == '\0') {
corrupted = true;
printf("\\0");
} else {
printf("%c", buf[i]);
}
}
if (memcmp(buf, expected_final_content, expected_total_size) != 0) {
fprintf(stdout, "\nFile content mismatch\n");
corrupted = true;
}
if (corrupted) {
fprintf(stdout, "\nFile corrupted\n");
return 1;
}
return 0;
}
Tingmao Wang (1):
fs/9p: Do not open remote file with APPEND mode when writeback cache
is used
fs/9p/vfs_file.c | 13 +++++++++++--
fs/9p/vfs_inode.c | 7 ++++++-
fs/9p/vfs_inode_dotl.c | 6 ++++++
3 files changed, 23 insertions(+), 3 deletions(-)
base-commit: 98bd8b16ae57e8f25c95d496fcde3dfdd8223d41
--
2.51.2
next reply other threads:[~2025-11-02 20:25 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-02 20:24 Tingmao Wang [this message]
2025-11-02 20:24 ` [PATCH 1/1] fs/9p: Do not open remote file with APPEND mode when writeback cache is used Tingmao Wang
2025-11-02 23:07 ` Dominique Martinet
2025-11-02 23:56 ` [PATCH v2] fs/9p: Don't " Tingmao Wang
2025-11-03 7:34 ` Dominique Martinet
2025-11-10 13:25 ` Christian Schoenebeck
2025-11-10 14:22 ` Christian Schoenebeck
2025-11-02 23:58 ` [PATCH 1/1] fs/9p: Do not " Tingmao Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1762115015.git.m@maowtm.org \
--to=m@maowtm.org \
--cc=asmadeus@codewreck.org \
--cc=ericvh@kernel.org \
--cc=linux_oss@crudebyte.com \
--cc=lucho@ionkov.net \
--cc=v9fs@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox