[PATCH v2] fs/buffer: serialize set_buffer_uptodate against concurrent clears

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] fs/buffer: serialize set_buffer_uptodate against concurrent clears
@ 2026-04-30  5:36 Chao Shi
  2026-04-30  7:28 ` Jan Kara
  0 siblings, 1 reply; 2+ messages in thread
From: Chao Shi @ 2026-04-30  5:36 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: viro, brauner, jack, willy, linux-kernel, Chao Shi, Sungwoo Kim,
	Dave Tian, Weidong Zhu

A WARN_ON_ONCE(!buffer_uptodate(bh)) in mark_buffer_dirty() is
reachable from the buffered write path on a block device when the
underlying device returns I/O errors at high density.  Reproduced
by fuzzing an NVMe controller (FEMU) that returns crafted error
completions for a sustained workload from /dev/nvme0n1.

The contract documented at set_buffer_uptodate() in
include/linux/buffer_head.h reads:

  Any other serialization (with IO errors or whatever that might
  clear the bit) has to come from other state (eg BH_Lock).

In fs/buffer.c, BH_Uptodate can be cleared from four I/O completion
callbacks: __end_buffer_read_notouch, end_buffer_write_sync,
end_buffer_async_read, end_buffer_async_write.

end_buffer_async_read() runs with BH_Lock held throughout, so its
clear is already serialized against any caller that also holds
BH_Lock around set_buffer_uptodate(); the call may in fact be
redundant, but addressing that is independent of this fix.

end_buffer_write_sync() likewise holds BH_Lock while it clears
BH_Uptodate on the write-error path.  Removing that clear would
change long-standing buffer-cache I/O-error semantics and is out
of scope here.

The race is therefore between block_commit_write() and
end_buffer_write_sync():

  CPU A: block_commit_write              CPU B: end_buffer_write_sync
         (folio lock held, BH_Lock NOT)         (BH_Lock held)
    set_buffer_uptodate(bh);
                                           clear_buffer_uptodate(bh);
                                           unlock_buffer(bh);
    mark_buffer_dirty(bh);  /* WARN */

CPU B observes the contract; CPU A does not.  With one side unlocked
the serialization is one-sided and ineffective: CPU A's set can be
immediately followed by CPU B's clear, tripping the WARN_ON_ONCE.
In the fuzzing reproducer, write-error completions are frequent
(visible as repeated "lost async page write" and per-LBA write
failures); buffer I/O completion callbacks on the write-error path
(e.g. end_buffer_write_sync, end_buffer_async_write) clear
BH_Uptodate while holding BH_Lock.

The bug is not benign: a not-uptodate buffer can be marked dirty
and subsequently written back; depending on whether the buffer was
fully or partially covered by the user write, this can leave on-disk
content that does not match the intended buffered write state.

Fix this by taking BH_Lock around set_buffer_uptodate() +
mark_buffer_dirty() in block_commit_write(), so both sides of the
contract use the documented serialization.

Found by FuzzNvme (Syzkaller with FEMU fuzzing framework).

Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
---
Hi Matthew, Hi Jan,

Thanks for the review on v1.  v2 takes the feedback, quick notes below.

To Matthew:

You were right that the v1's timing diagram named the wrong racer.

The actual race is with end_buffer_write_sync() on the write-error
path, as Jan pointed out -- it also clears BH_Uptodate under BH_Lock,
but block_commit_write()'s else branch was reaching set_buffer_uptodate
without BH_Lock, leaving the serialization one-sided.  v2's commit
message and in-code comment now name end_buffer_write_sync() as the
racer.

To Jan:

Thanks for confirming the racer and for the historical context on the
dirty + !uptodate state question.  v2 keeps the fix scoped to taking
BH_Lock in block_commit_write(); the broader semantic question and any
change to end_buffer_write_sync()'s clear is left out of scope here.

The redundant lock_buffer/unlock_buffer that v1 added to the
buffer_new branch of __block_write_begin_int() is also dropped in v2 --
that bh has no in-flight async I/O so no race exists there.

v1 thread for context:
https://lore.kernel.org/all/20260426020137.1221985-1-coshi036@gmail.com/

Thanks,
Chao

Changes in v2:
  - Drop the lock_buffer/unlock_buffer added in v1 to the buffer_new
    branch of __block_write_begin_int(): that bh is freshly BH_New
    and has no in-flight async I/O on it, so no race exists at that
    site.
  - Rewrite the commit message and the in-code comment to identify
    end_buffer_write_sync() as the actual racer, not
    end_buffer_async_read() as v1 claimed; end_buffer_async_read()
    holds BH_Lock across its clear so a caller that also holds
    BH_Lock would already be serialized.
  - Reference the BH_Lock contract at set_buffer_uptodate() in
    include/linux/buffer_head.h explicitly.
  - Drop verbose line-number citations and the WARN stack dump from
    the commit message; tighten wording around reproducer evidence
    and on-disk impact.

v1: https://lore.kernel.org/all/20260426020137.1221985-1-coshi036@gmail.com/

 fs/buffer.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 4d7f84e77d2..6fddb2f1e7c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2104,8 +2104,22 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
 			if (!buffer_uptodate(bh))
 				partial = true;
 		} else {
+			/*
+			 * Per the contract documented at set_buffer_uptodate()
+			 * in include/linux/buffer_head.h, callers must hold
+			 * BH_Lock to serialize against concurrent clears of
+			 * BH_Uptodate.  Holding only the folio lock is not
+			 * sufficient: a concurrent end_buffer_write_sync() on
+			 * the write-error path clears BH_Uptodate while
+			 * holding BH_Lock; without BH_Lock here the clear can
+			 * land between set_buffer_uptodate() and
+			 * mark_buffer_dirty(), tripping the WARN_ON_ONCE in
+			 * mark_buffer_dirty().
+			 */
+			lock_buffer(bh);
 			set_buffer_uptodate(bh);
 			mark_buffer_dirty(bh);
+			unlock_buffer(bh);
 		}
 		if (buffer_new(bh))
 			clear_buffer_new(bh);

base-commit: ffe69af1f87fa77da975ad4b0b093d48c3cbe6c3
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH v2] fs/buffer: serialize set_buffer_uptodate against concurrent clears
  2026-04-30  5:36 [PATCH v2] fs/buffer: serialize set_buffer_uptodate against concurrent clears Chao Shi
@ 2026-04-30  7:28 ` Jan Kara
  0 siblings, 0 replies; 2+ messages in thread
From: Jan Kara @ 2026-04-30  7:28 UTC (permalink / raw)
  To: Chao Shi
  Cc: linux-fsdevel, viro, brauner, jack, willy, linux-kernel,
	Sungwoo Kim, Dave Tian, Weidong Zhu

On Thu 30-04-26 01:36:45, Chao Shi wrote:
> A WARN_ON_ONCE(!buffer_uptodate(bh)) in mark_buffer_dirty() is
> reachable from the buffered write path on a block device when the
> underlying device returns I/O errors at high density.  Reproduced
> by fuzzing an NVMe controller (FEMU) that returns crafted error
> completions for a sustained workload from /dev/nvme0n1.
> 
> The contract documented at set_buffer_uptodate() in
> include/linux/buffer_head.h reads:
> 
>   Any other serialization (with IO errors or whatever that might
>   clear the bit) has to come from other state (eg BH_Lock).
> 
> In fs/buffer.c, BH_Uptodate can be cleared from four I/O completion
> callbacks: __end_buffer_read_notouch, end_buffer_write_sync,
> end_buffer_async_read, end_buffer_async_write.
> 
> end_buffer_async_read() runs with BH_Lock held throughout, so its
> clear is already serialized against any caller that also holds
> BH_Lock around set_buffer_uptodate(); the call may in fact be
> redundant, but addressing that is independent of this fix.
> 
> end_buffer_write_sync() likewise holds BH_Lock while it clears
> BH_Uptodate on the write-error path.  Removing that clear would
> change long-standing buffer-cache I/O-error semantics and is out
> of scope here.
> 
> The race is therefore between block_commit_write() and
> end_buffer_write_sync():
> 
>   CPU A: block_commit_write              CPU B: end_buffer_write_sync
>          (folio lock held, BH_Lock NOT)         (BH_Lock held)
>     set_buffer_uptodate(bh);
>                                            clear_buffer_uptodate(bh);
>                                            unlock_buffer(bh);
>     mark_buffer_dirty(bh);  /* WARN */
> 
> CPU B observes the contract; CPU A does not.  With one side unlocked
> the serialization is one-sided and ineffective: CPU A's set can be
> immediately followed by CPU B's clear, tripping the WARN_ON_ONCE.
> In the fuzzing reproducer, write-error completions are frequent
> (visible as repeated "lost async page write" and per-LBA write
> failures); buffer I/O completion callbacks on the write-error path
> (e.g. end_buffer_write_sync, end_buffer_async_write) clear
> BH_Uptodate while holding BH_Lock.
> 
> The bug is not benign: a not-uptodate buffer can be marked dirty
> and subsequently written back; depending on whether the buffer was
> fully or partially covered by the user write, this can leave on-disk
> content that does not match the intended buffered write state.
> 
> Fix this by taking BH_Lock around set_buffer_uptodate() +
> mark_buffer_dirty() in block_commit_write(), so both sides of the
> contract use the documented serialization.
> 
> Found by FuzzNvme (Syzkaller with FEMU fuzzing framework).
> 
> Acked-by: Sungwoo Kim <iam@sung-woo.kim>
> Acked-by: Dave Tian <daveti@purdue.edu>
> Acked-by: Weidong Zhu <weizhu@fiu.edu>
> Signed-off-by: Chao Shi <coshi036@gmail.com>

Thanks for trying but this basically just silences the warning without
addressing the real problem. Sure enough it silences the warning in
mark_buffer_dirty() but effectively it just papers over the real problem -
the IO completion with error can come the moment you unlock the bh and it
will happily clear the uptodate bit, resulting in the same "dirty but not
uptodate" invalid buffer state. So I actually prefer keeping things as they
currently are so that we are reminded there is this unresolved issue.

But it would be actually very welcome if you took the work, went through
all the places using end_buffer_write_sync() and converted them in
filesystem-by-filesystem to check for IO error by checking
buffer_write_io_error() instead of buffer_uptodate() and then we can drop
clear_buffer_uptodate() call from end_buffer_write_sync(). Yes, it is much
more work but also much more useful.

								Honza

> ---
> Hi Matthew, Hi Jan,
> 
> Thanks for the review on v1.  v2 takes the feedback, quick notes below.
> 
> To Matthew:
> 
> You were right that the v1's timing diagram named the wrong racer.
> 
> The actual race is with end_buffer_write_sync() on the write-error
> path, as Jan pointed out -- it also clears BH_Uptodate under BH_Lock,
> but block_commit_write()'s else branch was reaching set_buffer_uptodate
> without BH_Lock, leaving the serialization one-sided.  v2's commit
> message and in-code comment now name end_buffer_write_sync() as the
> racer.
> 
> To Jan:
> 
> Thanks for confirming the racer and for the historical context on the
> dirty + !uptodate state question.  v2 keeps the fix scoped to taking
> BH_Lock in block_commit_write(); the broader semantic question and any
> change to end_buffer_write_sync()'s clear is left out of scope here.
> 
> The redundant lock_buffer/unlock_buffer that v1 added to the
> buffer_new branch of __block_write_begin_int() is also dropped in v2 --
> that bh has no in-flight async I/O so no race exists there.
> 
> v1 thread for context:
> https://lore.kernel.org/all/20260426020137.1221985-1-coshi036@gmail.com/
> 
> Thanks,
> Chao
> 
> Changes in v2:
>   - Drop the lock_buffer/unlock_buffer added in v1 to the buffer_new
>     branch of __block_write_begin_int(): that bh is freshly BH_New
>     and has no in-flight async I/O on it, so no race exists at that
>     site.
>   - Rewrite the commit message and the in-code comment to identify
>     end_buffer_write_sync() as the actual racer, not
>     end_buffer_async_read() as v1 claimed; end_buffer_async_read()
>     holds BH_Lock across its clear so a caller that also holds
>     BH_Lock would already be serialized.
>   - Reference the BH_Lock contract at set_buffer_uptodate() in
>     include/linux/buffer_head.h explicitly.
>   - Drop verbose line-number citations and the WARN stack dump from
>     the commit message; tighten wording around reproducer evidence
>     and on-disk impact.
> 
> v1: https://lore.kernel.org/all/20260426020137.1221985-1-coshi036@gmail.com/
> 
>  fs/buffer.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 4d7f84e77d2..6fddb2f1e7c 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2104,8 +2104,22 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
>  			if (!buffer_uptodate(bh))
>  				partial = true;
>  		} else {
> +			/*
> +			 * Per the contract documented at set_buffer_uptodate()
> +			 * in include/linux/buffer_head.h, callers must hold
> +			 * BH_Lock to serialize against concurrent clears of
> +			 * BH_Uptodate.  Holding only the folio lock is not
> +			 * sufficient: a concurrent end_buffer_write_sync() on
> +			 * the write-error path clears BH_Uptodate while
> +			 * holding BH_Lock; without BH_Lock here the clear can
> +			 * land between set_buffer_uptodate() and
> +			 * mark_buffer_dirty(), tripping the WARN_ON_ONCE in
> +			 * mark_buffer_dirty().
> +			 */
> +			lock_buffer(bh);
>  			set_buffer_uptodate(bh);
>  			mark_buffer_dirty(bh);
> +			unlock_buffer(bh);
>  		}
>  		if (buffer_new(bh))
>  			clear_buffer_new(bh);
> 
> base-commit: ffe69af1f87fa77da975ad4b0b093d48c3cbe6c3
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-04-30  7:28 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-30  5:36 [PATCH v2] fs/buffer: serialize set_buffer_uptodate against concurrent clears Chao Shi
2026-04-30  7:28 ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox