[RFC PATCH] fs/buffer: serialize set_buffer_uptodate against concurrent clears

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] fs/buffer: serialize set_buffer_uptodate against concurrent clears
@ 2026-04-26  2:01 Chao Shi
  2026-04-26  2:42 ` Matthew Wilcox
  0 siblings, 1 reply; 5+ messages in thread
From: Chao Shi @ 2026-04-26  2:01 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, linux-fsdevel
  Cc: Jan Kara, linux-kernel, Chao Shi, Sungwoo Kim, Dave Tian,
	Weidong Zhu

A WARN_ON_ONCE(!buffer_uptodate(bh)) in mark_buffer_dirty() is reachable
from the buffered write path on a block device when the underlying
device returns I/O errors at high density.  Reproduced by fuzzing an
NVMe controller (FEMU) that returns crafted error completions for a
sustained workload from /dev/nvme0n1.

The race is:

  CPU A: block_commit_write (folio lock held)        CPU B: end_buffer_async_read
    set_buffer_uptodate(bh);
                                                       clear_buffer_uptodate(bh);
    mark_buffer_dirty(bh);  /* WARN fires */

The contract documented at set_buffer_uptodate() in
include/linux/buffer_head.h:140 already states:

  "Any other serialization (with IO errors or whatever that might
   clear the bit) has to come from other state (eg BH_Lock)."

block_commit_write() and the buffer_new() branch in
__block_write_begin_int() violate this contract: they hold the folio
lock but not BH_Lock when calling set_buffer_uptodate() immediately
followed by mark_buffer_dirty().  Take BH_Lock around the pair so the
documented serialization holds.

The race is the same family as 558d6450c775 ("ext4: fix
WARN_ON_ONCE(!buffer_uptodate) after an error writing the
superblock"), which addressed the ext4 superblock-specific case via
state recovery.  No equivalent recovery hook exists in the generic
block_commit_write() path, so apply BH_Lock instead.

WARN stack:

  RIP: mark_buffer_dirty+0x4c2/0x560 fs/buffer.c:1183
  Call Trace:
    block_commit_write          fs/buffer.c
    block_write_end             fs/buffer.c
    iomap_write_end             fs/iomap/buffered-io.c
    iomap_file_buffered_write
    blkdev_buffered_write       block/fops.c
    blkdev_write_iter
    vfs_write
    __x64_sys_write

Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).

Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
---

Notes for reviewers (RFC):

1. lock_buffer() in the buffered-write end path: in the steady state
   the bh should be unlocked when block_commit_write() runs
   (block_write_begin already waited for any RMW read), so contention
   should be rare.  fio numbers TBD; happy to defer until measured if
   that is the bar.

2. Several other call sites have the same set_buffer_uptodate(bh)
   immediately followed by mark_buffer_dirty(bh) pattern without
   BH_Lock:

     fs/nilfs2/mdt.c:60
     fs/ocfs2/alloc.c:6840
     fs/ocfs2/aops.c:655
     fs/exfat/fatent.c:408
     fs/exfat/misc.c:168
     fs/ufs/ialloc.c:146
     fs/ufs/inode.c:1076, 1088
     fs/ufs/balloc.c:324
     fs/jfs/super.c:770
     fs/ext2/super.c:1594
     fs/ntfs3/fsntfs.c:1096, 1491
     fs/ntfs3/bitmap.c:755, 796, 1387

   Most look like one-shot init / metadata paths where concurrent IO
   completion on the same bh is unlikely, but I have not audited each.
   Per-fs follow-ups by respective maintainers, or one tree-wide
   series?

3. Reproducer is currently fuzzer-only (FEMU + syz-executor).  A
   minimal C reproducer using dm-flakey for read-error injection is
   in progress.

 fs/buffer.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 4d7f84e77d2..bc4fad93392 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2041,9 +2041,16 @@ int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
 			if (buffer_new(bh)) {
 				clean_bdev_bh_alias(bh);
 				if (folio_test_uptodate(folio)) {
+					/*
+					 * See block_commit_write() for why we
+					 * must hold BH_Lock around set_uptodate
+					 * + mark_dirty.
+					 */
+					lock_buffer(bh);
 					clear_buffer_new(bh);
 					set_buffer_uptodate(bh);
 					mark_buffer_dirty(bh);
+					unlock_buffer(bh);
 					continue;
 				}
 				if (block_end > to || block_start < from)
@@ -2104,8 +2111,20 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
 			if (!buffer_uptodate(bh))
 				partial = true;
 		} else {
+			/*
+			 * Per the contract documented at set_buffer_uptodate()
+			 * (include/linux/buffer_head.h), callers must hold
+			 * BH_Lock to serialize against concurrent clears of
+			 * BH_Uptodate.  Holding only the folio lock is not
+			 * sufficient: a concurrent end_buffer_async_read() on
+			 * a previously failed read can clear BH_Uptodate
+			 * between set_buffer_uptodate() and mark_buffer_dirty(),
+			 * tripping the WARN_ON_ONCE in mark_buffer_dirty().
+			 */
+			lock_buffer(bh);
 			set_buffer_uptodate(bh);
 			mark_buffer_dirty(bh);
+			unlock_buffer(bh);
 		}
 		if (buffer_new(bh))
 			clear_buffer_new(bh);
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] fs/buffer: serialize set_buffer_uptodate against concurrent clears
  2026-04-26  2:01 [RFC PATCH] fs/buffer: serialize set_buffer_uptodate against concurrent clears Chao Shi
@ 2026-04-26  2:42 ` Matthew Wilcox
  2026-04-27 11:24   ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2026-04-26  2:42 UTC (permalink / raw)
  To: Chao Shi
  Cc: Alexander Viro, Christian Brauner, linux-fsdevel, Jan Kara,
	linux-kernel, Sungwoo Kim, Dave Tian, Weidong Zhu

On Sat, Apr 25, 2026 at 10:01:37PM -0400, Chao Shi wrote:
> A WARN_ON_ONCE(!buffer_uptodate(bh)) in mark_buffer_dirty() is reachable
> from the buffered write path on a block device when the underlying
> device returns I/O errors at high density.  Reproduced by fuzzing an
> NVMe controller (FEMU) that returns crafted error completions for a
> sustained workload from /dev/nvme0n1.
> 
> The race is:
> 
>   CPU A: block_commit_write (folio lock held)        CPU B: end_buffer_async_read
>     set_buffer_uptodate(bh);
>                                                        clear_buffer_uptodate(bh);
>     mark_buffer_dirty(bh);  /* WARN fires */

Why are we calling clear_buffer_uptodate() in end_buffer_async_read()?
If the buffer is uptodate, we shouldn't be reading into it.  If it's
not uptodate, we don't need to clear the uptodate flag because it's
already clear.

I've been deleting calls to ClearPageUptodate and folio_clear_uptodate()
from filesystems; it's almost always the wrong thing to do.  But the
buffer cache does have slightly different rules from the page cache,
so this may not translate well.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] fs/buffer: serialize set_buffer_uptodate against concurrent clears
  2026-04-26  2:42 ` Matthew Wilcox
@ 2026-04-27 11:24   ` Jan Kara
  2026-04-27 13:46     ` Matthew Wilcox
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2026-04-27 11:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Chao Shi, Alexander Viro, Christian Brauner, linux-fsdevel,
	Jan Kara, linux-kernel, Sungwoo Kim, Dave Tian, Weidong Zhu

On Sun 26-04-26 03:42:38, Matthew Wilcox wrote:
> On Sat, Apr 25, 2026 at 10:01:37PM -0400, Chao Shi wrote:
> > A WARN_ON_ONCE(!buffer_uptodate(bh)) in mark_buffer_dirty() is reachable
> > from the buffered write path on a block device when the underlying
> > device returns I/O errors at high density.  Reproduced by fuzzing an
> > NVMe controller (FEMU) that returns crafted error completions for a
> > sustained workload from /dev/nvme0n1.
> > 
> > The race is:
> > 
> >   CPU A: block_commit_write (folio lock held)        CPU B: end_buffer_async_read
> >     set_buffer_uptodate(bh);
> >                                                        clear_buffer_uptodate(bh);
> >     mark_buffer_dirty(bh);  /* WARN fires */
> 
> Why are we calling clear_buffer_uptodate() in end_buffer_async_read()?
> If the buffer is uptodate, we shouldn't be reading into it.  If it's
> not uptodate, we don't need to clear the uptodate flag because it's
> already clear.
> 
> I've been deleting calls to ClearPageUptodate and folio_clear_uptodate()
> from filesystems; it's almost always the wrong thing to do.  But the
> buffer cache does have slightly different rules from the page cache,
> so this may not translate well.

Right, I think the clear_buffer_uptodate() call in end_buffer_async_read()
is indeed superfluous and we can just remove it - possibly replace with
WARN_ON_ONCE(buffer_uptodate(bh)) - to deal with this race.

That being said block_commit_write() could possibly race with
end_buffer_write_sync() as well and there the clear_buffer_uptodate()
should stay (well, unless we want to do a serious rework of IO error bh
handling). So the changes to block_commit_write() might still be needed.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] fs/buffer: serialize set_buffer_uptodate against concurrent clears
  2026-04-27 11:24   ` Jan Kara
@ 2026-04-27 13:46     ` Matthew Wilcox
  2026-04-27 15:48       ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2026-04-27 13:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Chao Shi, Alexander Viro, Christian Brauner, linux-fsdevel,
	linux-kernel, Sungwoo Kim, Dave Tian, Weidong Zhu

On Mon, Apr 27, 2026 at 01:24:45PM +0200, Jan Kara wrote:
> On Sun 26-04-26 03:42:38, Matthew Wilcox wrote:
> > Why are we calling clear_buffer_uptodate() in end_buffer_async_read()?
> > If the buffer is uptodate, we shouldn't be reading into it.  If it's
> > not uptodate, we don't need to clear the uptodate flag because it's
> > already clear.
> > 
> > I've been deleting calls to ClearPageUptodate and folio_clear_uptodate()
> > from filesystems; it's almost always the wrong thing to do.  But the
> > buffer cache does have slightly different rules from the page cache,
> > so this may not translate well.
> 
> Right, I think the clear_buffer_uptodate() call in end_buffer_async_read()
> is indeed superfluous and we can just remove it - possibly replace with
> WARN_ON_ONCE(buffer_uptodate(bh)) - to deal with this race.
> 
> That being said block_commit_write() could possibly race with
> end_buffer_write_sync() as well and there the clear_buffer_uptodate()
> should stay (well, unless we want to do a serious rework of IO error bh
> handling). So the changes to block_commit_write() might still be needed.

Well, I don't want to do a serious rework of io error bh handling, but
I do have questions!  Let's say we take this patch, so we do:

	lock_buffer(bh);
	set_buffer_uptodate(bh);
	mark_buffer_dirty(bh);
	unlock_buffer(bh);

That means we don't hit the warning, but we do hit an error in
end_buffer_write_sync() anyway:

	mark_buffer_write_io_error(bh);
	clear_buffer_uptodate(bh);
	unlock_buffer(bh);

We now have a buffer which is marked as write_io_error, dirty and
!uptodate.  That's a logically incoherent state -- dirty means "is
newer than what is on disc" and !uptodate means "on disc is newer
than this buffer".  I don't know if we have any assertions that
check for this, but we probably should?

So I think what we need here is (in addition to this patch):

+++ b/fs/buffer.c
@@ -169,7 +169,8 @@ void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
        } else {
                buffer_io_error(bh, ", lost sync page write");
                mark_buffer_write_io_error(bh);
-               clear_buffer_uptodate(bh);
+               if (!buffer_dirty(bh)
+                       clear_buffer_uptodate(bh);
        }
        unlock_buffer(bh);
        put_bh(bh);

This makes sense to me since we need to retry the write anyway.

And of course, now I'm looking at buffer.c, I found another whoopsie.
Remember 7375f22495e7 where we called put_bh() on a stack buffer
after unlocking it?  Don't we have the same problem in
end_buffer_write_sync()?  I can't find anyone calling it with a
BH that's on the stack, but it's a bit of a nasty footgun.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] fs/buffer: serialize set_buffer_uptodate against concurrent clears
  2026-04-27 13:46     ` Matthew Wilcox
@ 2026-04-27 15:48       ` Jan Kara
  0 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2026-04-27 15:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Chao Shi, Alexander Viro, Christian Brauner,
	linux-fsdevel, linux-kernel, Sungwoo Kim, Dave Tian, Weidong Zhu

On Mon 27-04-26 14:46:41, Matthew Wilcox wrote:
> On Mon, Apr 27, 2026 at 01:24:45PM +0200, Jan Kara wrote:
> > On Sun 26-04-26 03:42:38, Matthew Wilcox wrote:
> > > Why are we calling clear_buffer_uptodate() in end_buffer_async_read()?
> > > If the buffer is uptodate, we shouldn't be reading into it.  If it's
> > > not uptodate, we don't need to clear the uptodate flag because it's
> > > already clear.
> > > 
> > > I've been deleting calls to ClearPageUptodate and folio_clear_uptodate()
> > > from filesystems; it's almost always the wrong thing to do.  But the
> > > buffer cache does have slightly different rules from the page cache,
> > > so this may not translate well.
> > 
> > Right, I think the clear_buffer_uptodate() call in end_buffer_async_read()
> > is indeed superfluous and we can just remove it - possibly replace with
> > WARN_ON_ONCE(buffer_uptodate(bh)) - to deal with this race.
> > 
> > That being said block_commit_write() could possibly race with
> > end_buffer_write_sync() as well and there the clear_buffer_uptodate()
> > should stay (well, unless we want to do a serious rework of IO error bh
> > handling). So the changes to block_commit_write() might still be needed.
> 
> Well, I don't want to do a serious rework of io error bh handling, but
> I do have questions!  Let's say we take this patch, so we do:
> 
> 	lock_buffer(bh);
> 	set_buffer_uptodate(bh);
> 	mark_buffer_dirty(bh);
> 	unlock_buffer(bh);
> 
> That means we don't hit the warning, but we do hit an error in
> end_buffer_write_sync() anyway:
> 
> 	mark_buffer_write_io_error(bh);
> 	clear_buffer_uptodate(bh);
> 	unlock_buffer(bh);
> 
> We now have a buffer which is marked as write_io_error, dirty and
> !uptodate.  That's a logically incoherent state -- dirty means "is
> newer than what is on disc" and !uptodate means "on disc is newer
> than this buffer".  I don't know if we have any assertions that
> check for this, but we probably should?

So I tend to agree with you that clearing buffer_uptodate bit on write IO
error (as much as it goes back many many years) is wrong. When I tried to
fix that some 20 years ago, Linus was against it arguing that uptodate
means "it matches the on-disk content unless dirty is set" so we just
worked-around it in ext3 / ext4. Fixing this was the "serious rework" I was
alluding to but these days we probably have less code depending on uptodate
bit getting cleared on write IO error than back then. So maybe it will be
mostly smooth.

> So I think what we need here is (in addition to this patch):
> 
> +++ b/fs/buffer.c
> @@ -169,7 +169,8 @@ void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
>         } else {
>                 buffer_io_error(bh, ", lost sync page write");
>                 mark_buffer_write_io_error(bh);
> -               clear_buffer_uptodate(bh);
> +               if (!buffer_dirty(bh)
> +                       clear_buffer_uptodate(bh);

Well, if we touch this code, I'd just remove the clearing altogether.

>         }
>         unlock_buffer(bh);
>         put_bh(bh);
> 
> This makes sense to me since we need to retry the write anyway.
> 
> And of course, now I'm looking at buffer.c, I found another whoopsie.
> Remember 7375f22495e7 where we called put_bh() on a stack buffer
> after unlocking it?  Don't we have the same problem in
> end_buffer_write_sync()?  I can't find anyone calling it with a
> BH that's on the stack, but it's a bit of a nasty footgun.

Yes, I'm not aware of any such place either. Culprit of this is that struct
buffer_head is used both as a structure attached to a folio caching mapping
information (perfectly fine to submit for IO) and structure to pass around
mapping information which gets sometimes also abused to submit IO and then
there are issues. We are getting rid of this second use but it's slow.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-04-27 15:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-26  2:01 [RFC PATCH] fs/buffer: serialize set_buffer_uptodate against concurrent clears Chao Shi
2026-04-26  2:42 ` Matthew Wilcox
2026-04-27 11:24   ` Jan Kara
2026-04-27 13:46     ` Matthew Wilcox
2026-04-27 15:48       ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox