From: Chao Shi <coshi036@gmail.com>
To: Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>,
linux-fsdevel@vger.kernel.org
Cc: Jan Kara <jack@suse.cz>,
linux-kernel@vger.kernel.org, Chao Shi <coshi036@gmail.com>,
Sungwoo Kim <iam@sung-woo.kim>, Dave Tian <daveti@purdue.edu>,
Weidong Zhu <weizhu@fiu.edu>
Subject: [RFC PATCH] fs/buffer: serialize set_buffer_uptodate against concurrent clears
Date: Sat, 25 Apr 2026 22:01:37 -0400 [thread overview]
Message-ID: <20260426020137.1221985-1-coshi036@gmail.com> (raw)
A WARN_ON_ONCE(!buffer_uptodate(bh)) in mark_buffer_dirty() is reachable
from the buffered write path on a block device when the underlying
device returns I/O errors at high density. Reproduced by fuzzing an
NVMe controller (FEMU) that returns crafted error completions for a
sustained workload from /dev/nvme0n1.
The race is:
CPU A: block_commit_write (folio lock held) CPU B: end_buffer_async_read
set_buffer_uptodate(bh);
clear_buffer_uptodate(bh);
mark_buffer_dirty(bh); /* WARN fires */
The contract documented at set_buffer_uptodate() in
include/linux/buffer_head.h:140 already states:
"Any other serialization (with IO errors or whatever that might
clear the bit) has to come from other state (eg BH_Lock)."
block_commit_write() and the buffer_new() branch in
__block_write_begin_int() violate this contract: they hold the folio
lock but not BH_Lock when calling set_buffer_uptodate() immediately
followed by mark_buffer_dirty(). Take BH_Lock around the pair so the
documented serialization holds.
The race is the same family as 558d6450c775 ("ext4: fix
WARN_ON_ONCE(!buffer_uptodate) after an error writing the
superblock"), which addressed the ext4 superblock-specific case via
state recovery. No equivalent recovery hook exists in the generic
block_commit_write() path, so apply BH_Lock instead.
WARN stack:
RIP: mark_buffer_dirty+0x4c2/0x560 fs/buffer.c:1183
Call Trace:
block_commit_write fs/buffer.c
block_write_end fs/buffer.c
iomap_write_end fs/iomap/buffered-io.c
iomap_file_buffered_write
blkdev_buffered_write block/fops.c
blkdev_write_iter
vfs_write
__x64_sys_write
Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).
Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
---
Notes for reviewers (RFC):
1. lock_buffer() in the buffered-write end path: in the steady state
the bh should be unlocked when block_commit_write() runs
(block_write_begin already waited for any RMW read), so contention
should be rare. fio numbers TBD; happy to defer until measured if
that is the bar.
2. Several other call sites have the same set_buffer_uptodate(bh)
immediately followed by mark_buffer_dirty(bh) pattern without
BH_Lock:
fs/nilfs2/mdt.c:60
fs/ocfs2/alloc.c:6840
fs/ocfs2/aops.c:655
fs/exfat/fatent.c:408
fs/exfat/misc.c:168
fs/ufs/ialloc.c:146
fs/ufs/inode.c:1076, 1088
fs/ufs/balloc.c:324
fs/jfs/super.c:770
fs/ext2/super.c:1594
fs/ntfs3/fsntfs.c:1096, 1491
fs/ntfs3/bitmap.c:755, 796, 1387
Most look like one-shot init / metadata paths where concurrent IO
completion on the same bh is unlikely, but I have not audited each.
Per-fs follow-ups by respective maintainers, or one tree-wide
series?
3. Reproducer is currently fuzzer-only (FEMU + syz-executor). A
minimal C reproducer using dm-flakey for read-error injection is
in progress.
fs/buffer.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/fs/buffer.c b/fs/buffer.c
index 4d7f84e77d2..bc4fad93392 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2041,9 +2041,16 @@ int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
if (buffer_new(bh)) {
clean_bdev_bh_alias(bh);
if (folio_test_uptodate(folio)) {
+ /*
+ * See block_commit_write() for why we
+ * must hold BH_Lock around set_uptodate
+ * + mark_dirty.
+ */
+ lock_buffer(bh);
clear_buffer_new(bh);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
+ unlock_buffer(bh);
continue;
}
if (block_end > to || block_start < from)
@@ -2104,8 +2111,20 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
if (!buffer_uptodate(bh))
partial = true;
} else {
+ /*
+ * Per the contract documented at set_buffer_uptodate()
+ * (include/linux/buffer_head.h), callers must hold
+ * BH_Lock to serialize against concurrent clears of
+ * BH_Uptodate. Holding only the folio lock is not
+ * sufficient: a concurrent end_buffer_async_read() on
+ * a previously failed read can clear BH_Uptodate
+ * between set_buffer_uptodate() and mark_buffer_dirty(),
+ * tripping the WARN_ON_ONCE in mark_buffer_dirty().
+ */
+ lock_buffer(bh);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
+ unlock_buffer(bh);
}
if (buffer_new(bh))
clear_buffer_new(bh);
--
2.43.0
next reply other threads:[~2026-04-26 2:01 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-26 2:01 Chao Shi [this message]
2026-04-26 2:42 ` [RFC PATCH] fs/buffer: serialize set_buffer_uptodate against concurrent clears Matthew Wilcox
2026-04-27 11:24 ` Jan Kara
2026-04-27 13:46 ` Matthew Wilcox
2026-04-27 15:48 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260426020137.1221985-1-coshi036@gmail.com \
--to=coshi036@gmail.com \
--cc=brauner@kernel.org \
--cc=daveti@purdue.edu \
--cc=iam@sung-woo.kim \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=weizhu@fiu.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox