From: Dave Chinner <dgc@kernel.org>
To: Cen Zhang <zzzccc427@gmail.com>
Cc: cem@kernel.org, linux-xfs@vger.kernel.org,
linux-kernel@vger.kernel.org, baijiaju1990@gmail.com,
stable@vger.kernel.org
Subject: Re: [PATCH] xfs: annotate lockless bli_flags access in buf item paths
Date: Mon, 30 Mar 2026 12:33:22 +1100 [thread overview]
Message-ID: <acnS4iOihfWL4ay5@dread> (raw)
In-Reply-To: <20260327131448.156177-1-zzzccc427@gmail.com>
On Fri, Mar 27, 2026 at 09:14:48PM +0800, Cen Zhang wrote:
> xfs_buf_item_unpin() and xfs_buf_item_committed() read bip->bli_flags
> without holding the buffer lock, while xfs_buf_item_release() clears
> XFS_BLI_LOGGED, XFS_BLI_HOLD, and XFS_BLI_ORDERED under that lock.
> This can happen when an older checkpoint still holds a CIL reference to
> the BLI while a new transaction is finishing with the buffer.
It's much more complex than that, and I think that there aren't any
meaningful data races in these cases.
> The lockless readers only test XFS_BLI_STALE and
> XFS_BLI_INODE_ALLOC_BUF, which are disjoint from the bits being
> cleared, so correctness is not affected in practice. However, the
> plain C accesses still constitute a data race that may allow the
> compiler to optimize them in unexpected ways (e.g., load tearing or
> fused reloads), so they should be marked explicitly.
I'm not sure that sort of thing will ever occur given that bli_flags
is a 32 bit int with natural alignment.
> Annotate the two lockless reads with READ_ONCE(), make the clearing
> store in xfs_buf_item_release() a WRITE_ONCE(), and use READ_ONCE() in
> the buf-item tracepoint that may snapshot bli_flags without the lock.
>
> Fixes: 8e1238508633 ("xfs: remove stale parameter from ->iop_unpin method")
> Fixes: 71e330b59390 ("xfs: Introduce delayed logging core code")
> Cc: stable@vger.kernel.org
> Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
> ---
> fs/xfs/xfs_buf_item.c | 18 +++++++++++-------
> fs/xfs/xfs_trace.h | 2 +-
> 2 files changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 8487635579e5..c3d0dc17ee10 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -502,7 +502,8 @@ xfs_buf_item_unpin(
> {
> struct xfs_buf_log_item *bip = BUF_ITEM(lip);
> struct xfs_buf *bp = bip->bli_buf;
> - int stale = bip->bli_flags & XFS_BLI_STALE;
> + unsigned int flags = READ_ONCE(bip->bli_flags);
> + int stale = flags & XFS_BLI_STALE;
> int freed;
When XFS_BLI_STALE is set in unpin, then by definition the buffer
must be locked and the BLI referenced. i.e. the buffer is locked
when XFS_BLI_STALE is set, and it remains locked until journal
completion runs xfs_buf_item_unpin() and drops the final BLI
reference. So if XFS_BLI_STALE is set, this isn't an unlocked read.
Regardless of that, if the unpin call doesn't drop the last
reference, then we return from xfs_buf_item_unpin() without having
ever checked the stale flag. i.e the read is completely irrelevant
in cases but the last reference to the BLI.
And if this is the last reference to the BLI, then it is not joined
into a current transaction, which means nothing is going to be
changing the BLI state.
Hence we only care about the bli_flags when unpin has the only
remaining reference to the bli, and in that case the read cannot be
racing with anything else changing it's state. There is no
data access race here on the final reference, regardless of what
locks are held.
IOWs, using READ_ONCE to indicate a data access race is -incorrect-
in the cases where we actually use the result of the read.
Seriously: if you want to mark data accesses as being racy, you
better be adding comments to explain what the actual data race is
and why it matters enough that it needs annotation. Otherwise it is
just useless noise in the code...
> ASSERT(bp->b_log_item == bip);
> @@ -679,13 +680,14 @@ xfs_buf_item_release(
> {
> struct xfs_buf_log_item *bip = BUF_ITEM(lip);
> struct xfs_buf *bp = bip->bli_buf;
> - bool hold = bip->bli_flags & XFS_BLI_HOLD;
> - bool stale = bip->bli_flags & XFS_BLI_STALE;
> + unsigned int flags = bip->bli_flags;
> + bool hold = flags & XFS_BLI_HOLD;
> + bool stale = flags & XFS_BLI_STALE;
> bool aborted = test_bit(XFS_LI_ABORTED,
> &lip->li_flags);
> - bool dirty = bip->bli_flags & XFS_BLI_DIRTY;
> + bool dirty = flags & XFS_BLI_DIRTY;
> #if defined(DEBUG) || defined(XFS_WARN)
> - bool ordered = bip->bli_flags & XFS_BLI_ORDERED;
> + bool ordered = flags & XFS_BLI_ORDERED;
> #endif
>
> trace_xfs_buf_item_release(bip);
> @@ -705,7 +707,8 @@ xfs_buf_item_release(
> * per-transaction state from the bli, which has been copied above.
> */
> bp->b_transp = NULL;
> - bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_HOLD | XFS_BLI_ORDERED);
> + WRITE_ONCE(bip->bli_flags,
> + flags & ~(XFS_BLI_LOGGED | XFS_BLI_HOLD | XFS_BLI_ORDERED));
>
> /* If there are other references, then we have nothing to do. */
This is always called with the buffer locked, so it already
serialises correctly against all the other reads that set or check
those flags under the buffer lock.
Indeed, why does this one specific write matter, and not all the
other writes to bip->bli_flags that set the flags that "lockless
reads" might trip over? Like, say, in xfs_trans_inode_alloc_buf() or
xfs_trans_binval()?
> if (!atomic_dec_and_test(&bip->bli_refcount))
> @@ -792,10 +795,11 @@ xfs_buf_item_committed(
> xfs_lsn_t lsn)
> {
> struct xfs_buf_log_item *bip = BUF_ITEM(lip);
> + unsigned int flags = READ_ONCE(bip->bli_flags);
>
> trace_xfs_buf_item_committed(bip);
>
> - if ((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) && lip->li_lsn != 0)
> + if ((flags & XFS_BLI_INODE_ALLOC_BUF) && lip->li_lsn != 0)
> return lip->li_lsn;
> return lsn;
There is no consequential race here, either. Once an BLI is marked
as an XFS_BLI_INODE_ALLOC_BUF under the buffer lock - the same lock
hold that clears flags in xfs_buf_item_release() - that flag never
gets clears from bip->bli_flags. Hence we just don't care what the
compiler does for this read, because it will always return the correct
value no matter what. i.e. it is yet another benign data race....
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 813e5a9f57eb..2069534bb0c1 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -895,7 +895,7 @@ DECLARE_EVENT_CLASS(xfs_buf_item_class,
> ),
> TP_fast_assign(
> __entry->dev = bip->bli_buf->b_target->bt_dev;
> - __entry->bli_flags = bip->bli_flags;
> + __entry->bli_flags = READ_ONCE(bip->bli_flags);
We just don't care here...
-Dave.
--
Dave Chinner
dgc@kernel.org
next prev parent reply other threads:[~2026-03-30 1:33 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-27 13:14 [PATCH] xfs: annotate lockless bli_flags access in buf item paths Cen Zhang
2026-03-30 1:33 ` Dave Chinner [this message]
2026-03-30 2:51 ` Cen Zhang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=acnS4iOihfWL4ay5@dread \
--to=dgc@kernel.org \
--cc=baijiaju1990@gmail.com \
--cc=cem@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=stable@vger.kernel.org \
--cc=zzzccc427@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox