* [PATCH] xfs: annotate lockless bli_flags access in buf item paths
@ 2026-03-27 13:14 Cen Zhang
2026-03-30 1:33 ` Dave Chinner
0 siblings, 1 reply; 3+ messages in thread
From: Cen Zhang @ 2026-03-27 13:14 UTC (permalink / raw)
To: cem; +Cc: linux-xfs, linux-kernel, baijiaju1990, Cen Zhang, stable
xfs_buf_item_unpin() and xfs_buf_item_committed() read bip->bli_flags
without holding the buffer lock, while xfs_buf_item_release() clears
XFS_BLI_LOGGED, XFS_BLI_HOLD, and XFS_BLI_ORDERED under that lock.
This can happen when an older checkpoint still holds a CIL reference to
the BLI while a new transaction is finishing with the buffer.
The lockless readers only test XFS_BLI_STALE and
XFS_BLI_INODE_ALLOC_BUF, which are disjoint from the bits being
cleared, so correctness is not affected in practice. However, the
plain C accesses still constitute a data race that may allow the
compiler to optimize them in unexpected ways (e.g., load tearing or
fused reloads), so they should be marked explicitly.
Annotate the two lockless reads with READ_ONCE(), make the clearing
store in xfs_buf_item_release() a WRITE_ONCE(), and use READ_ONCE() in
the buf-item tracepoint that may snapshot bli_flags without the lock.
Fixes: 8e1238508633 ("xfs: remove stale parameter from ->iop_unpin method")
Fixes: 71e330b59390 ("xfs: Introduce delayed logging core code")
Cc: stable@vger.kernel.org
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
---
fs/xfs/xfs_buf_item.c | 18 +++++++++++-------
fs/xfs/xfs_trace.h | 2 +-
2 files changed, 12 insertions(+), 8 deletions(-)
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 8487635579e5..c3d0dc17ee10 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -502,7 +502,8 @@ xfs_buf_item_unpin(
{
struct xfs_buf_log_item *bip = BUF_ITEM(lip);
struct xfs_buf *bp = bip->bli_buf;
- int stale = bip->bli_flags & XFS_BLI_STALE;
+ unsigned int flags = READ_ONCE(bip->bli_flags);
+ int stale = flags & XFS_BLI_STALE;
int freed;
ASSERT(bp->b_log_item == bip);
@@ -679,13 +680,14 @@ xfs_buf_item_release(
{
struct xfs_buf_log_item *bip = BUF_ITEM(lip);
struct xfs_buf *bp = bip->bli_buf;
- bool hold = bip->bli_flags & XFS_BLI_HOLD;
- bool stale = bip->bli_flags & XFS_BLI_STALE;
+ unsigned int flags = bip->bli_flags;
+ bool hold = flags & XFS_BLI_HOLD;
+ bool stale = flags & XFS_BLI_STALE;
bool aborted = test_bit(XFS_LI_ABORTED,
&lip->li_flags);
- bool dirty = bip->bli_flags & XFS_BLI_DIRTY;
+ bool dirty = flags & XFS_BLI_DIRTY;
#if defined(DEBUG) || defined(XFS_WARN)
- bool ordered = bip->bli_flags & XFS_BLI_ORDERED;
+ bool ordered = flags & XFS_BLI_ORDERED;
#endif
trace_xfs_buf_item_release(bip);
@@ -705,7 +707,8 @@ xfs_buf_item_release(
* per-transaction state from the bli, which has been copied above.
*/
bp->b_transp = NULL;
- bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_HOLD | XFS_BLI_ORDERED);
+ WRITE_ONCE(bip->bli_flags,
+ flags & ~(XFS_BLI_LOGGED | XFS_BLI_HOLD | XFS_BLI_ORDERED));
/* If there are other references, then we have nothing to do. */
if (!atomic_dec_and_test(&bip->bli_refcount))
@@ -792,10 +795,11 @@ xfs_buf_item_committed(
xfs_lsn_t lsn)
{
struct xfs_buf_log_item *bip = BUF_ITEM(lip);
+ unsigned int flags = READ_ONCE(bip->bli_flags);
trace_xfs_buf_item_committed(bip);
- if ((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) && lip->li_lsn != 0)
+ if ((flags & XFS_BLI_INODE_ALLOC_BUF) && lip->li_lsn != 0)
return lip->li_lsn;
return lsn;
}
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 813e5a9f57eb..2069534bb0c1 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -895,7 +895,7 @@ DECLARE_EVENT_CLASS(xfs_buf_item_class,
),
TP_fast_assign(
__entry->dev = bip->bli_buf->b_target->bt_dev;
- __entry->bli_flags = bip->bli_flags;
+ __entry->bli_flags = READ_ONCE(bip->bli_flags);
__entry->bli_recur = bip->bli_recur;
__entry->bli_refcount = atomic_read(&bip->bli_refcount);
__entry->buf_bno = xfs_buf_daddr(bip->bli_buf);
--
2.34.1
^ permalink raw reply related [flat|nested] 3+ messages in thread* Re: [PATCH] xfs: annotate lockless bli_flags access in buf item paths 2026-03-27 13:14 [PATCH] xfs: annotate lockless bli_flags access in buf item paths Cen Zhang @ 2026-03-30 1:33 ` Dave Chinner 2026-03-30 2:51 ` Cen Zhang 0 siblings, 1 reply; 3+ messages in thread From: Dave Chinner @ 2026-03-30 1:33 UTC (permalink / raw) To: Cen Zhang; +Cc: cem, linux-xfs, linux-kernel, baijiaju1990, stable On Fri, Mar 27, 2026 at 09:14:48PM +0800, Cen Zhang wrote: > xfs_buf_item_unpin() and xfs_buf_item_committed() read bip->bli_flags > without holding the buffer lock, while xfs_buf_item_release() clears > XFS_BLI_LOGGED, XFS_BLI_HOLD, and XFS_BLI_ORDERED under that lock. > This can happen when an older checkpoint still holds a CIL reference to > the BLI while a new transaction is finishing with the buffer. It's much more complex than that, and I think that there aren't any meaningful data races in these cases. > The lockless readers only test XFS_BLI_STALE and > XFS_BLI_INODE_ALLOC_BUF, which are disjoint from the bits being > cleared, so correctness is not affected in practice. However, the > plain C accesses still constitute a data race that may allow the > compiler to optimize them in unexpected ways (e.g., load tearing or > fused reloads), so they should be marked explicitly. I'm not sure that sort of thing will ever occur given that bli_flags is a 32 bit int with natural alignment. > Annotate the two lockless reads with READ_ONCE(), make the clearing > store in xfs_buf_item_release() a WRITE_ONCE(), and use READ_ONCE() in > the buf-item tracepoint that may snapshot bli_flags without the lock. > > Fixes: 8e1238508633 ("xfs: remove stale parameter from ->iop_unpin method") > Fixes: 71e330b59390 ("xfs: Introduce delayed logging core code") > Cc: stable@vger.kernel.org > Signed-off-by: Cen Zhang <zzzccc427@gmail.com> > --- > fs/xfs/xfs_buf_item.c | 18 +++++++++++------- > fs/xfs/xfs_trace.h | 2 +- > 2 files changed, 12 insertions(+), 8 deletions(-) > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c > index 8487635579e5..c3d0dc17ee10 100644 > --- a/fs/xfs/xfs_buf_item.c > +++ b/fs/xfs/xfs_buf_item.c > @@ -502,7 +502,8 @@ xfs_buf_item_unpin( > { > struct xfs_buf_log_item *bip = BUF_ITEM(lip); > struct xfs_buf *bp = bip->bli_buf; > - int stale = bip->bli_flags & XFS_BLI_STALE; > + unsigned int flags = READ_ONCE(bip->bli_flags); > + int stale = flags & XFS_BLI_STALE; > int freed; When XFS_BLI_STALE is set in unpin, then by definition the buffer must be locked and the BLI referenced. i.e. the buffer is locked when XFS_BLI_STALE is set, and it remains locked until journal completion runs xfs_buf_item_unpin() and drops the final BLI reference. So if XFS_BLI_STALE is set, this isn't an unlocked read. Regardless of that, if the unpin call doesn't drop the last reference, then we return from xfs_buf_item_unpin() without having ever checked the stale flag. i.e the read is completely irrelevant in cases but the last reference to the BLI. And if this is the last reference to the BLI, then it is not joined into a current transaction, which means nothing is going to be changing the BLI state. Hence we only care about the bli_flags when unpin has the only remaining reference to the bli, and in that case the read cannot be racing with anything else changing it's state. There is no data access race here on the final reference, regardless of what locks are held. IOWs, using READ_ONCE to indicate a data access race is -incorrect- in the cases where we actually use the result of the read. Seriously: if you want to mark data accesses as being racy, you better be adding comments to explain what the actual data race is and why it matters enough that it needs annotation. Otherwise it is just useless noise in the code... > ASSERT(bp->b_log_item == bip); > @@ -679,13 +680,14 @@ xfs_buf_item_release( > { > struct xfs_buf_log_item *bip = BUF_ITEM(lip); > struct xfs_buf *bp = bip->bli_buf; > - bool hold = bip->bli_flags & XFS_BLI_HOLD; > - bool stale = bip->bli_flags & XFS_BLI_STALE; > + unsigned int flags = bip->bli_flags; > + bool hold = flags & XFS_BLI_HOLD; > + bool stale = flags & XFS_BLI_STALE; > bool aborted = test_bit(XFS_LI_ABORTED, > &lip->li_flags); > - bool dirty = bip->bli_flags & XFS_BLI_DIRTY; > + bool dirty = flags & XFS_BLI_DIRTY; > #if defined(DEBUG) || defined(XFS_WARN) > - bool ordered = bip->bli_flags & XFS_BLI_ORDERED; > + bool ordered = flags & XFS_BLI_ORDERED; > #endif > > trace_xfs_buf_item_release(bip); > @@ -705,7 +707,8 @@ xfs_buf_item_release( > * per-transaction state from the bli, which has been copied above. > */ > bp->b_transp = NULL; > - bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_HOLD | XFS_BLI_ORDERED); > + WRITE_ONCE(bip->bli_flags, > + flags & ~(XFS_BLI_LOGGED | XFS_BLI_HOLD | XFS_BLI_ORDERED)); > > /* If there are other references, then we have nothing to do. */ This is always called with the buffer locked, so it already serialises correctly against all the other reads that set or check those flags under the buffer lock. Indeed, why does this one specific write matter, and not all the other writes to bip->bli_flags that set the flags that "lockless reads" might trip over? Like, say, in xfs_trans_inode_alloc_buf() or xfs_trans_binval()? > if (!atomic_dec_and_test(&bip->bli_refcount)) > @@ -792,10 +795,11 @@ xfs_buf_item_committed( > xfs_lsn_t lsn) > { > struct xfs_buf_log_item *bip = BUF_ITEM(lip); > + unsigned int flags = READ_ONCE(bip->bli_flags); > > trace_xfs_buf_item_committed(bip); > > - if ((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) && lip->li_lsn != 0) > + if ((flags & XFS_BLI_INODE_ALLOC_BUF) && lip->li_lsn != 0) > return lip->li_lsn; > return lsn; There is no consequential race here, either. Once an BLI is marked as an XFS_BLI_INODE_ALLOC_BUF under the buffer lock - the same lock hold that clears flags in xfs_buf_item_release() - that flag never gets clears from bip->bli_flags. Hence we just don't care what the compiler does for this read, because it will always return the correct value no matter what. i.e. it is yet another benign data race.... > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > index 813e5a9f57eb..2069534bb0c1 100644 > --- a/fs/xfs/xfs_trace.h > +++ b/fs/xfs/xfs_trace.h > @@ -895,7 +895,7 @@ DECLARE_EVENT_CLASS(xfs_buf_item_class, > ), > TP_fast_assign( > __entry->dev = bip->bli_buf->b_target->bt_dev; > - __entry->bli_flags = bip->bli_flags; > + __entry->bli_flags = READ_ONCE(bip->bli_flags); We just don't care here... -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] xfs: annotate lockless bli_flags access in buf item paths 2026-03-30 1:33 ` Dave Chinner @ 2026-03-30 2:51 ` Cen Zhang 0 siblings, 0 replies; 3+ messages in thread From: Cen Zhang @ 2026-03-30 2:51 UTC (permalink / raw) To: Dave Chinner; +Cc: cem, linux-xfs, linux-kernel, baijiaju1990, stable Hi Dave, > When XFS_BLI_STALE is set in unpin, then by definition the buffer > must be locked and the BLI referenced. [...] > Hence we only care about the bli_flags when unpin has the only > remaining reference to the bli, and in that case the read cannot be > racing with anything else changing it's state. Thank you for the detailed walkthrough. You're right about several things: READ_ONCE()/WRITE_ONCE() was the wrong tool here, the WRITE_ONCE in xfs_buf_item_release() is useless since it's under the buffer lock, and the tracepoint change doesn't matter. I'm dropping those. However, I want to share the actual concurrent access evidence before we decide on the remaining two reads: (1) xfs_buf_item_unpin:510 reads bli_flags (2) xfs_buf_item_committed:803 reads bli_flags Both race with two different write paths: Write A: xfs_trans_dirty_buf:532 (bli_flags |= DIRTY|LOGGED) via new transaction on the same buffer Write B: xfs_buf_item_release:713 (bli_flags &= ~LOGGED|HOLD|ORDERED) via iop_committing on xlog_cil_commit path The concurrent paths are: CPU 0: old checkpoint completing on workqueue → xlog_cil_committed → iop_committed / iop_unpin CPU 1: new transaction with the same buffer committing → xlog_cil_commit → iop_committing → iop_release (or xfs_trans_dirty_buf during the transaction) The comment at xlog_cil.c:1856 even acknowledges this: "the CIL checkpoint can race with us and we can run checkpoint completion before we've updated and unlocked the log items" > IOWs, using READ_ONCE to indicate a data access race is -incorrect- > in the cases where we actually use the result of the read. Agreed. For xfs_buf_item_unpin(), you're right that stale is only consumed on the last-reference path where no concurrent writer exists. But the read at line 510 happens unconditionally _before_ atomic_dec_and_test(), so on the non-last-reference path it does race with the writes listed above. The value is discarded in that case, so there's no semantic issue. Rather than annotating a race whose result is unused, a cleaner fix is to move the stale read to after the freed check, so the read only occurs when we actually need the value -- at which point there is no race: freed = atomic_dec_and_test(&bip->bli_refcount); ... if (!freed) { xfs_buf_rele(bp); return; } + stale = bip->bli_flags & XFS_BLI_STALE; This eliminates the concurrent access entirely with no behavior change. > There is no consequential race here, either. Once an BLI is marked > as an XFS_BLI_INODE_ALLOC_BUF under the buffer lock [...] that > flag never gets cleared from bip->bli_flags. Correct -- the bit being read (INODE_ALLOC_BUF) is never modified by the concurrent writers, so the result is always correct. But the concurrent writes to other bits in the same word (Write A and Write B above) do constitute a data access race on that memory location. Since the race is confirmed benign, data_race() is the right annotation. - if ((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) && ...) + if ((data_race(bip->bli_flags) & XFS_BLI_INODE_ALLOC_BUF) && ...) So the v2 would be: - Move stale read in xfs_buf_item_unpin() after freed check - data_race() in xfs_buf_item_committed() - Drop everything else (WRITE_ONCE, tracepoint, release-side) Does this sound reasonable? Thanks, Cen ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-03-30 2:51 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-27 13:14 [PATCH] xfs: annotate lockless bli_flags access in buf item paths Cen Zhang 2026-03-30 1:33 ` Dave Chinner 2026-03-30 2:51 ` Cen Zhang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox