public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs: annotate lockless bli_flags access in buf item paths
@ 2026-03-27 13:14 Cen Zhang
  2026-03-30  1:33 ` Dave Chinner
  0 siblings, 1 reply; 3+ messages in thread
From: Cen Zhang @ 2026-03-27 13:14 UTC (permalink / raw)
  To: cem; +Cc: linux-xfs, linux-kernel, baijiaju1990, Cen Zhang, stable

xfs_buf_item_unpin() and xfs_buf_item_committed() read bip->bli_flags
without holding the buffer lock, while xfs_buf_item_release() clears
XFS_BLI_LOGGED, XFS_BLI_HOLD, and XFS_BLI_ORDERED under that lock.
This can happen when an older checkpoint still holds a CIL reference to
the BLI while a new transaction is finishing with the buffer.

The lockless readers only test XFS_BLI_STALE and
XFS_BLI_INODE_ALLOC_BUF, which are disjoint from the bits being
cleared, so correctness is not affected in practice.  However, the
plain C accesses still constitute a data race that may allow the
compiler to optimize them in unexpected ways (e.g., load tearing or
fused reloads), so they should be marked explicitly.

Annotate the two lockless reads with READ_ONCE(), make the clearing
store in xfs_buf_item_release() a WRITE_ONCE(), and use READ_ONCE() in
the buf-item tracepoint that may snapshot bli_flags without the lock.

Fixes: 8e1238508633 ("xfs: remove stale parameter from ->iop_unpin method")
Fixes: 71e330b59390 ("xfs: Introduce delayed logging core code")
Cc: stable@vger.kernel.org
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
---
 fs/xfs/xfs_buf_item.c | 18 +++++++++++-------
 fs/xfs/xfs_trace.h    |  2 +-
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 8487635579e5..c3d0dc17ee10 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -502,7 +502,8 @@ xfs_buf_item_unpin(
 {
 	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
 	struct xfs_buf		*bp = bip->bli_buf;
-	int			stale = bip->bli_flags & XFS_BLI_STALE;
+	unsigned int		flags = READ_ONCE(bip->bli_flags);
+	int			stale = flags & XFS_BLI_STALE;
 	int			freed;
 
 	ASSERT(bp->b_log_item == bip);
@@ -679,13 +680,14 @@ xfs_buf_item_release(
 {
 	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
 	struct xfs_buf		*bp = bip->bli_buf;
-	bool			hold = bip->bli_flags & XFS_BLI_HOLD;
-	bool			stale = bip->bli_flags & XFS_BLI_STALE;
+	unsigned int		flags = bip->bli_flags;
+	bool			hold = flags & XFS_BLI_HOLD;
+	bool			stale = flags & XFS_BLI_STALE;
 	bool			aborted = test_bit(XFS_LI_ABORTED,
 						   &lip->li_flags);
-	bool			dirty = bip->bli_flags & XFS_BLI_DIRTY;
+	bool			dirty = flags & XFS_BLI_DIRTY;
 #if defined(DEBUG) || defined(XFS_WARN)
-	bool			ordered = bip->bli_flags & XFS_BLI_ORDERED;
+	bool			ordered = flags & XFS_BLI_ORDERED;
 #endif
 
 	trace_xfs_buf_item_release(bip);
@@ -705,7 +707,8 @@ xfs_buf_item_release(
 	 * per-transaction state from the bli, which has been copied above.
 	 */
 	bp->b_transp = NULL;
-	bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_HOLD | XFS_BLI_ORDERED);
+	WRITE_ONCE(bip->bli_flags,
+		   flags & ~(XFS_BLI_LOGGED | XFS_BLI_HOLD | XFS_BLI_ORDERED));
 
 	/* If there are other references, then we have nothing to do. */
 	if (!atomic_dec_and_test(&bip->bli_refcount))
@@ -792,10 +795,11 @@ xfs_buf_item_committed(
 	xfs_lsn_t		lsn)
 {
 	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
+	unsigned int		flags = READ_ONCE(bip->bli_flags);
 
 	trace_xfs_buf_item_committed(bip);
 
-	if ((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) && lip->li_lsn != 0)
+	if ((flags & XFS_BLI_INODE_ALLOC_BUF) && lip->li_lsn != 0)
 		return lip->li_lsn;
 	return lsn;
 }
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 813e5a9f57eb..2069534bb0c1 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -895,7 +895,7 @@ DECLARE_EVENT_CLASS(xfs_buf_item_class,
 	),
 	TP_fast_assign(
 		__entry->dev = bip->bli_buf->b_target->bt_dev;
-		__entry->bli_flags = bip->bli_flags;
+		__entry->bli_flags = READ_ONCE(bip->bli_flags);
 		__entry->bli_recur = bip->bli_recur;
 		__entry->bli_refcount = atomic_read(&bip->bli_refcount);
 		__entry->buf_bno = xfs_buf_daddr(bip->bli_buf);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] xfs: annotate lockless bli_flags access in buf item paths
  2026-03-27 13:14 [PATCH] xfs: annotate lockless bli_flags access in buf item paths Cen Zhang
@ 2026-03-30  1:33 ` Dave Chinner
  2026-03-30  2:51   ` Cen Zhang
  0 siblings, 1 reply; 3+ messages in thread
From: Dave Chinner @ 2026-03-30  1:33 UTC (permalink / raw)
  To: Cen Zhang; +Cc: cem, linux-xfs, linux-kernel, baijiaju1990, stable

On Fri, Mar 27, 2026 at 09:14:48PM +0800, Cen Zhang wrote:
> xfs_buf_item_unpin() and xfs_buf_item_committed() read bip->bli_flags
> without holding the buffer lock, while xfs_buf_item_release() clears
> XFS_BLI_LOGGED, XFS_BLI_HOLD, and XFS_BLI_ORDERED under that lock.
> This can happen when an older checkpoint still holds a CIL reference to
> the BLI while a new transaction is finishing with the buffer.

It's much more complex than that, and I think that there aren't any
meaningful data races in these cases.

> The lockless readers only test XFS_BLI_STALE and
> XFS_BLI_INODE_ALLOC_BUF, which are disjoint from the bits being
> cleared, so correctness is not affected in practice.  However, the
> plain C accesses still constitute a data race that may allow the
> compiler to optimize them in unexpected ways (e.g., load tearing or
> fused reloads), so they should be marked explicitly.

I'm not sure that sort of thing will ever occur given that bli_flags
is a 32 bit int with natural alignment.

> Annotate the two lockless reads with READ_ONCE(), make the clearing
> store in xfs_buf_item_release() a WRITE_ONCE(), and use READ_ONCE() in
> the buf-item tracepoint that may snapshot bli_flags without the lock.
> 
> Fixes: 8e1238508633 ("xfs: remove stale parameter from ->iop_unpin method")
> Fixes: 71e330b59390 ("xfs: Introduce delayed logging core code")
> Cc: stable@vger.kernel.org
> Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
> ---
>  fs/xfs/xfs_buf_item.c | 18 +++++++++++-------
>  fs/xfs/xfs_trace.h    |  2 +-
>  2 files changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 8487635579e5..c3d0dc17ee10 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -502,7 +502,8 @@ xfs_buf_item_unpin(
>  {
>  	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
>  	struct xfs_buf		*bp = bip->bli_buf;
> -	int			stale = bip->bli_flags & XFS_BLI_STALE;
> +	unsigned int		flags = READ_ONCE(bip->bli_flags);
> +	int			stale = flags & XFS_BLI_STALE;
>  	int			freed;

When XFS_BLI_STALE is set in unpin, then by definition the buffer
must be locked and the BLI referenced. i.e. the buffer is locked
when XFS_BLI_STALE is set, and it remains locked until journal
completion runs xfs_buf_item_unpin() and drops the final BLI
reference. So if XFS_BLI_STALE is set, this isn't an unlocked read.

Regardless of that, if the unpin call doesn't drop the last
reference, then we return from xfs_buf_item_unpin() without having
ever checked the stale flag. i.e the read is completely irrelevant
in cases but the last reference to the BLI.

And if this is the last reference to the BLI, then it is not joined
into a current transaction, which means nothing is going to be
changing the BLI state.

Hence we only care about the bli_flags when unpin has the only
remaining reference to the bli, and in that case the read cannot be
racing with anything else changing it's state. There is no
data access race here on the final reference, regardless of what
locks are held.

IOWs, using READ_ONCE to indicate a data access race is -incorrect-
in the cases where we actually use the result of the read.

Seriously: if you want to mark data accesses as being racy, you
better be adding comments to explain what the actual data race is
and why it matters enough that it needs annotation. Otherwise it is
just useless noise in the code...

>  	ASSERT(bp->b_log_item == bip);
> @@ -679,13 +680,14 @@ xfs_buf_item_release(
>  {
>  	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
>  	struct xfs_buf		*bp = bip->bli_buf;
> -	bool			hold = bip->bli_flags & XFS_BLI_HOLD;
> -	bool			stale = bip->bli_flags & XFS_BLI_STALE;
> +	unsigned int		flags = bip->bli_flags;
> +	bool			hold = flags & XFS_BLI_HOLD;
> +	bool			stale = flags & XFS_BLI_STALE;
>  	bool			aborted = test_bit(XFS_LI_ABORTED,
>  						   &lip->li_flags);
> -	bool			dirty = bip->bli_flags & XFS_BLI_DIRTY;
> +	bool			dirty = flags & XFS_BLI_DIRTY;
>  #if defined(DEBUG) || defined(XFS_WARN)
> -	bool			ordered = bip->bli_flags & XFS_BLI_ORDERED;
> +	bool			ordered = flags & XFS_BLI_ORDERED;
>  #endif
>  
>  	trace_xfs_buf_item_release(bip);
> @@ -705,7 +707,8 @@ xfs_buf_item_release(
>  	 * per-transaction state from the bli, which has been copied above.
>  	 */
>  	bp->b_transp = NULL;
> -	bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_HOLD | XFS_BLI_ORDERED);
> +	WRITE_ONCE(bip->bli_flags,
> +		   flags & ~(XFS_BLI_LOGGED | XFS_BLI_HOLD | XFS_BLI_ORDERED));
>  
>  	/* If there are other references, then we have nothing to do. */

This is always called with the buffer locked, so it already
serialises correctly against all the other reads that set or check
those flags under the buffer lock.

Indeed, why does this one specific write matter, and not all the
other writes to bip->bli_flags that set the flags that "lockless
reads" might trip over? Like, say, in xfs_trans_inode_alloc_buf() or
xfs_trans_binval()?

>  	if (!atomic_dec_and_test(&bip->bli_refcount))
> @@ -792,10 +795,11 @@ xfs_buf_item_committed(
>  	xfs_lsn_t		lsn)
>  {
>  	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
> +	unsigned int		flags = READ_ONCE(bip->bli_flags);
>  
>  	trace_xfs_buf_item_committed(bip);
>  
> -	if ((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) && lip->li_lsn != 0)
> +	if ((flags & XFS_BLI_INODE_ALLOC_BUF) && lip->li_lsn != 0)
>  		return lip->li_lsn;
>  	return lsn;

There is no consequential race here, either. Once an BLI is marked
as an XFS_BLI_INODE_ALLOC_BUF under the buffer lock - the same lock
hold that clears flags in xfs_buf_item_release() - that flag never
gets clears from bip->bli_flags. Hence we just don't care what the
compiler does for this read, because it will always return the correct
value no matter what. i.e. it is yet another benign data race....

> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 813e5a9f57eb..2069534bb0c1 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -895,7 +895,7 @@ DECLARE_EVENT_CLASS(xfs_buf_item_class,
>  	),
>  	TP_fast_assign(
>  		__entry->dev = bip->bli_buf->b_target->bt_dev;
> -		__entry->bli_flags = bip->bli_flags;
> +		__entry->bli_flags = READ_ONCE(bip->bli_flags);

We just don't care here...

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] xfs: annotate lockless bli_flags access in buf item paths
  2026-03-30  1:33 ` Dave Chinner
@ 2026-03-30  2:51   ` Cen Zhang
  0 siblings, 0 replies; 3+ messages in thread
From: Cen Zhang @ 2026-03-30  2:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: cem, linux-xfs, linux-kernel, baijiaju1990, stable

Hi Dave,

> When XFS_BLI_STALE is set in unpin, then by definition the buffer
> must be locked and the BLI referenced. [...]
> Hence we only care about the bli_flags when unpin has the only
> remaining reference to the bli, and in that case the read cannot be
> racing with anything else changing it's state.

Thank you for the detailed walkthrough.  You're right about several
things: READ_ONCE()/WRITE_ONCE() was the wrong tool here, the
WRITE_ONCE in xfs_buf_item_release() is useless since it's under
the buffer lock, and the tracepoint change doesn't matter.  I'm
dropping those.

However, I want to share the actual concurrent access evidence
before we decide on the remaining two reads:

  (1) xfs_buf_item_unpin:510 reads bli_flags
  (2) xfs_buf_item_committed:803 reads bli_flags

Both race with two different write paths:

  Write A: xfs_trans_dirty_buf:532  (bli_flags |= DIRTY|LOGGED)
           via new transaction on the same buffer
  Write B: xfs_buf_item_release:713 (bli_flags &= ~LOGGED|HOLD|ORDERED)
           via iop_committing on xlog_cil_commit path

The concurrent paths are:
  CPU 0: old checkpoint completing on workqueue
         → xlog_cil_committed → iop_committed / iop_unpin
  CPU 1: new transaction with the same buffer committing
         → xlog_cil_commit → iop_committing → iop_release
         (or xfs_trans_dirty_buf during the transaction)

The comment at xlog_cil.c:1856 even acknowledges this:
  "the CIL checkpoint can race with us and we can run checkpoint
   completion before we've updated and unlocked the log items"

> IOWs, using READ_ONCE to indicate a data access race is -incorrect-
> in the cases where we actually use the result of the read.

Agreed.  For xfs_buf_item_unpin(), you're right that stale is
only consumed on the last-reference path where no concurrent writer
exists.  But the read at line 510 happens unconditionally _before_
atomic_dec_and_test(), so on the non-last-reference path it does
race with the writes listed above.  The value is discarded in that
case, so there's no semantic issue.

Rather than annotating a race whose result is unused, a cleaner fix
is to move the stale read to after the freed check, so the read
only occurs when we actually need the value -- at which point there
is no race:

    freed = atomic_dec_and_test(&bip->bli_refcount);
    ...
    if (!freed) { xfs_buf_rele(bp); return; }
  + stale = bip->bli_flags & XFS_BLI_STALE;

This eliminates the concurrent access entirely with no behavior
change.

> There is no consequential race here, either. Once an BLI is marked
> as an XFS_BLI_INODE_ALLOC_BUF under the buffer lock [...] that
> flag never gets cleared from bip->bli_flags.

Correct -- the bit being read (INODE_ALLOC_BUF) is never modified
by the concurrent writers, so the result is always correct.  But
the concurrent writes to other bits in the same word (Write A and
Write B above) do constitute a data access race on that memory
location.  Since the race is confirmed
benign, data_race() is the right annotation.

  - if ((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) && ...)
  + if ((data_race(bip->bli_flags) & XFS_BLI_INODE_ALLOC_BUF) && ...)

So the v2 would be:
  - Move stale read in xfs_buf_item_unpin() after freed check
  - data_race() in xfs_buf_item_committed()
  - Drop everything else (WRITE_ONCE, tracepoint, release-side)

Does this sound reasonable?

Thanks,
Cen

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-03-30  2:51 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-27 13:14 [PATCH] xfs: annotate lockless bli_flags access in buf item paths Cen Zhang
2026-03-30  1:33 ` Dave Chinner
2026-03-30  2:51   ` Cen Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox