* fix buffer refcount races
@ 2025-01-13 4:24 Christoph Hellwig
2025-01-13 4:24 ` [PATCH 1/2] xfs: check for dead buffers in xfs_buf_find_insert Christoph Hellwig
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Christoph Hellwig @ 2025-01-13 4:24 UTC (permalink / raw)
To: Carlos Maiolino; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs
Hi all,
this series fixes two races in buffer refcount handling, that I've
so far not actually seen in real life, but only found while reading
through the buffer cache code to understand some of the stranger looking
locking decisions.
One can cause a buffer about to be freed to be returned from
xfs_buf_insert. I think this one is extremely unlikely to be hit,
as it requires the buffer to not be present for the initial lookup,
but already being evicted when trying to add the new buffer. But
at least the fix is trivial.
The second causes buffer lookups to be missed when moving to the LRU.
This might actually be able to trigger the first one, but otherwise
just means we're doing a pass through insert which will find it.
For pure lookups using xfs_buf_incore it could cause us to miss buffer
invalidation. The fix for that is bigger and has bigger implications
because it not requires all b_hold increments to be done under d_lock.
This causes more contention, but as releasing the buffer always takes
the lock it can't be too horrible. I also have a only minimally
tested series to switch it over to a lockref here:
http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-buffer-locking
Diffstat:
b/fs/xfs/xfs_buf.c | 3 -
b/fs/xfs/xfs_buf.h | 4 +-
b/fs/xfs/xfs_trace.h | 10 ++---
fs/xfs/xfs_buf.c | 93 ++++++++++++++++++++++++++-------------------------
4 files changed, 56 insertions(+), 54 deletions(-)
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 1/2] xfs: check for dead buffers in xfs_buf_find_insert 2025-01-13 4:24 fix buffer refcount races Christoph Hellwig @ 2025-01-13 4:24 ` Christoph Hellwig 2025-01-13 4:56 ` Darrick J. Wong 2025-01-14 0:24 ` Dave Chinner 2025-01-13 4:24 ` [PATCH 2/2] xfs: fix buffer lookup vs release race Christoph Hellwig 2025-01-13 5:08 ` fix buffer refcount races Darrick J. Wong 2 siblings, 2 replies; 15+ messages in thread From: Christoph Hellwig @ 2025-01-13 4:24 UTC (permalink / raw) To: Carlos Maiolino; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs Commit 32dd4f9c506b ("xfs: remove a superflous hash lookup when inserting new buffers") converted xfs_buf_find_insert to use rhashtable_lookup_get_insert_fast and thus an operation that returns the existing buffer when an insert would duplicate the hash key. But this code path misses the check for a buffer with a reference count of zero, which could lead to reusing an about to be freed buffer. Fix this by using the same atomic_inc_not_zero pattern as xfs_buf_insert. Fixes: 32dd4f9c506b ("xfs: remove a superflous hash lookup when inserting new buffers") Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_buf.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 6f313fbf7669..f80e39fde53b 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -664,9 +664,8 @@ xfs_buf_find_insert( spin_unlock(&bch->bc_lock); goto out_free_buf; } - if (bp) { + if (bp && atomic_inc_not_zero(&bp->b_hold)) { /* found an existing buffer */ - atomic_inc(&bp->b_hold); spin_unlock(&bch->bc_lock); error = xfs_buf_find_lock(bp, flags); if (error) -- 2.45.2 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] xfs: check for dead buffers in xfs_buf_find_insert 2025-01-13 4:24 ` [PATCH 1/2] xfs: check for dead buffers in xfs_buf_find_insert Christoph Hellwig @ 2025-01-13 4:56 ` Darrick J. Wong 2025-01-14 0:24 ` Dave Chinner 1 sibling, 0 replies; 15+ messages in thread From: Darrick J. Wong @ 2025-01-13 4:56 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Carlos Maiolino, Dave Chinner, linux-xfs On Mon, Jan 13, 2025 at 05:24:26AM +0100, Christoph Hellwig wrote: > Commit 32dd4f9c506b ("xfs: remove a superflous hash lookup when inserting > new buffers") converted xfs_buf_find_insert to use > rhashtable_lookup_get_insert_fast and thus an operation that returns the > existing buffer when an insert would duplicate the hash key. But this > code path misses the check for a buffer with a reference count of zero, > which could lead to reusing an about to be freed buffer. Fix this by > using the same atomic_inc_not_zero pattern as xfs_buf_insert. > > Fixes: 32dd4f9c506b ("xfs: remove a superflous hash lookup when inserting new buffers") > Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: <stable@vger.kernel.org> # v6.0 and with that added, Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> --D > --- > fs/xfs/xfs_buf.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index 6f313fbf7669..f80e39fde53b 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -664,9 +664,8 @@ xfs_buf_find_insert( > spin_unlock(&bch->bc_lock); > goto out_free_buf; > } > - if (bp) { > + if (bp && atomic_inc_not_zero(&bp->b_hold)) { > /* found an existing buffer */ > - atomic_inc(&bp->b_hold); > spin_unlock(&bch->bc_lock); > error = xfs_buf_find_lock(bp, flags); > if (error) > -- > 2.45.2 > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] xfs: check for dead buffers in xfs_buf_find_insert 2025-01-13 4:24 ` [PATCH 1/2] xfs: check for dead buffers in xfs_buf_find_insert Christoph Hellwig 2025-01-13 4:56 ` Darrick J. Wong @ 2025-01-14 0:24 ` Dave Chinner 1 sibling, 0 replies; 15+ messages in thread From: Dave Chinner @ 2025-01-14 0:24 UTC (permalink / raw) To: Christoph Hellwig Cc: Carlos Maiolino, Darrick J. Wong, Dave Chinner, linux-xfs On Mon, Jan 13, 2025 at 05:24:26AM +0100, Christoph Hellwig wrote: > Commit 32dd4f9c506b ("xfs: remove a superflous hash lookup when inserting > new buffers") converted xfs_buf_find_insert to use > rhashtable_lookup_get_insert_fast and thus an operation that returns the > existing buffer when an insert would duplicate the hash key. But this > code path misses the check for a buffer with a reference count of zero, > which could lead to reusing an about to be freed buffer. Fix this by > using the same atomic_inc_not_zero pattern as xfs_buf_insert. > > Fixes: 32dd4f9c506b ("xfs: remove a superflous hash lookup when inserting new buffers") > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/xfs/xfs_buf.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index 6f313fbf7669..f80e39fde53b 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -664,9 +664,8 @@ xfs_buf_find_insert( > spin_unlock(&bch->bc_lock); > goto out_free_buf; > } > - if (bp) { > + if (bp && atomic_inc_not_zero(&bp->b_hold)) { > /* found an existing buffer */ > - atomic_inc(&bp->b_hold); > spin_unlock(&bch->bc_lock); > error = xfs_buf_find_lock(bp, flags); > if (error) Looks good. Reviewed-by: Dave Chinner <dchinner@redhat.com> -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/2] xfs: fix buffer lookup vs release race 2025-01-13 4:24 fix buffer refcount races Christoph Hellwig 2025-01-13 4:24 ` [PATCH 1/2] xfs: check for dead buffers in xfs_buf_find_insert Christoph Hellwig @ 2025-01-13 4:24 ` Christoph Hellwig 2025-01-13 17:55 ` Darrick J. Wong 2025-01-13 20:55 ` Dave Chinner 2025-01-13 5:08 ` fix buffer refcount races Darrick J. Wong 2 siblings, 2 replies; 15+ messages in thread From: Christoph Hellwig @ 2025-01-13 4:24 UTC (permalink / raw) To: Carlos Maiolino; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs Since commit 298f34224506 ("xfs: lockless buffer lookup") the buffer lookup fastpath is done without a hash-wide lock (then pag_buf_lock, now bc_lock) and only under RCU protection. But this means that nothing serializes lookups against the temporary 0 reference count for buffers that are added to the LRU after dropping the last regular reference, and a concurrent lookup would fail to find them. Fix this by doing all b_hold modifications under b_lock. We're already doing this for release so this "only" ~ doubles the b_lock round trips. We'll later look into the lockref infrastructure to optimize the number of lock round trips again. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_buf.c | 93 ++++++++++++++++++++++++---------------------- fs/xfs/xfs_buf.h | 4 +- fs/xfs/xfs_trace.h | 10 ++--- 3 files changed, 55 insertions(+), 52 deletions(-) diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index f80e39fde53b..dc219678003c 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -133,15 +133,6 @@ __xfs_buf_ioacct_dec( } } -static inline void -xfs_buf_ioacct_dec( - struct xfs_buf *bp) -{ - spin_lock(&bp->b_lock); - __xfs_buf_ioacct_dec(bp); - spin_unlock(&bp->b_lock); -} - /* * When we mark a buffer stale, we remove the buffer from the LRU and clear the * b_lru_ref count so that the buffer is freed immediately when the buffer @@ -177,9 +168,9 @@ xfs_buf_stale( atomic_set(&bp->b_lru_ref, 0); if (!(bp->b_state & XFS_BSTATE_DISPOSE) && (list_lru_del_obj(&bp->b_target->bt_lru, &bp->b_lru))) - atomic_dec(&bp->b_hold); + bp->b_hold--; - ASSERT(atomic_read(&bp->b_hold) >= 1); + ASSERT(bp->b_hold >= 1); spin_unlock(&bp->b_lock); } @@ -238,14 +229,14 @@ _xfs_buf_alloc( */ flags &= ~(XBF_UNMAPPED | XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD); - atomic_set(&bp->b_hold, 1); + spin_lock_init(&bp->b_lock); + bp->b_hold = 1; atomic_set(&bp->b_lru_ref, 1); init_completion(&bp->b_iowait); INIT_LIST_HEAD(&bp->b_lru); INIT_LIST_HEAD(&bp->b_list); INIT_LIST_HEAD(&bp->b_li_list); sema_init(&bp->b_sema, 0); /* held, no waiters */ - spin_lock_init(&bp->b_lock); bp->b_target = target; bp->b_mount = target->bt_mount; bp->b_flags = flags; @@ -589,6 +580,20 @@ xfs_buf_find_lock( return 0; } +static bool +xfs_buf_try_hold( + struct xfs_buf *bp) +{ + spin_lock(&bp->b_lock); + if (bp->b_hold == 0) { + spin_unlock(&bp->b_lock); + return false; + } + bp->b_hold++; + spin_unlock(&bp->b_lock); + return true; +} + static inline int xfs_buf_lookup( struct xfs_buf_cache *bch, @@ -601,7 +606,7 @@ xfs_buf_lookup( rcu_read_lock(); bp = rhashtable_lookup(&bch->bc_hash, map, xfs_buf_hash_params); - if (!bp || !atomic_inc_not_zero(&bp->b_hold)) { + if (!bp || !xfs_buf_try_hold(bp)) { rcu_read_unlock(); return -ENOENT; } @@ -664,7 +669,7 @@ xfs_buf_find_insert( spin_unlock(&bch->bc_lock); goto out_free_buf; } - if (bp && atomic_inc_not_zero(&bp->b_hold)) { + if (bp && xfs_buf_try_hold(bp)) { /* found an existing buffer */ spin_unlock(&bch->bc_lock); error = xfs_buf_find_lock(bp, flags); @@ -1043,7 +1048,10 @@ xfs_buf_hold( struct xfs_buf *bp) { trace_xfs_buf_hold(bp, _RET_IP_); - atomic_inc(&bp->b_hold); + + spin_lock(&bp->b_lock); + bp->b_hold++; + spin_unlock(&bp->b_lock); } static void @@ -1051,10 +1059,15 @@ xfs_buf_rele_uncached( struct xfs_buf *bp) { ASSERT(list_empty(&bp->b_lru)); - if (atomic_dec_and_test(&bp->b_hold)) { - xfs_buf_ioacct_dec(bp); - xfs_buf_free(bp); + + spin_lock(&bp->b_lock); + if (--bp->b_hold) { + spin_unlock(&bp->b_lock); + return; } + __xfs_buf_ioacct_dec(bp); + spin_unlock(&bp->b_lock); + xfs_buf_free(bp); } static void @@ -1064,51 +1077,40 @@ xfs_buf_rele_cached( struct xfs_buftarg *btp = bp->b_target; struct xfs_perag *pag = bp->b_pag; struct xfs_buf_cache *bch = xfs_buftarg_buf_cache(btp, pag); - bool release; bool freebuf = false; trace_xfs_buf_rele(bp, _RET_IP_); - ASSERT(atomic_read(&bp->b_hold) > 0); - - /* - * We grab the b_lock here first to serialise racing xfs_buf_rele() - * calls. The pag_buf_lock being taken on the last reference only - * serialises against racing lookups in xfs_buf_find(). IOWs, the second - * to last reference we drop here is not serialised against the last - * reference until we take bp->b_lock. Hence if we don't grab b_lock - * first, the last "release" reference can win the race to the lock and - * free the buffer before the second-to-last reference is processed, - * leading to a use-after-free scenario. - */ spin_lock(&bp->b_lock); - release = atomic_dec_and_lock(&bp->b_hold, &bch->bc_lock); - if (!release) { + ASSERT(bp->b_hold >= 1); + if (bp->b_hold > 1) { /* * Drop the in-flight state if the buffer is already on the LRU * and it holds the only reference. This is racy because we * haven't acquired the pag lock, but the use of _XBF_IN_FLIGHT * ensures the decrement occurs only once per-buf. */ - if ((atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru)) + if (--bp->b_hold == 1 && !list_empty(&bp->b_lru)) __xfs_buf_ioacct_dec(bp); goto out_unlock; } - /* the last reference has been dropped ... */ + /* we are asked to drop the last reference */ + spin_lock(&bch->bc_lock); __xfs_buf_ioacct_dec(bp); if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) { /* - * If the buffer is added to the LRU take a new reference to the + * If the buffer is added to the LRU, keep the reference to the * buffer for the LRU and clear the (now stale) dispose list - * state flag + * state flag, else drop the reference. */ - if (list_lru_add_obj(&btp->bt_lru, &bp->b_lru)) { + if (list_lru_add_obj(&btp->bt_lru, &bp->b_lru)) bp->b_state &= ~XFS_BSTATE_DISPOSE; - atomic_inc(&bp->b_hold); - } + else + bp->b_hold--; spin_unlock(&bch->bc_lock); } else { + bp->b_hold--; /* * most of the time buffers will already be removed from the * LRU, so optimise that case by checking for the @@ -1863,13 +1865,14 @@ xfs_buftarg_drain_rele( struct xfs_buf *bp = container_of(item, struct xfs_buf, b_lru); struct list_head *dispose = arg; - if (atomic_read(&bp->b_hold) > 1) { + if (!spin_trylock(&bp->b_lock)) + return LRU_SKIP; + if (bp->b_hold > 1) { /* need to wait, so skip it this pass */ + spin_unlock(&bp->b_lock); trace_xfs_buf_drain_buftarg(bp, _RET_IP_); return LRU_SKIP; } - if (!spin_trylock(&bp->b_lock)) - return LRU_SKIP; /* * clear the LRU reference count so the buffer doesn't get @@ -2208,7 +2211,7 @@ xfs_buf_delwri_queue( */ bp->b_flags |= _XBF_DELWRI_Q; if (list_empty(&bp->b_list)) { - atomic_inc(&bp->b_hold); + xfs_buf_hold(bp); list_add_tail(&bp->b_list, list); } diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index 3d56bc7a35cc..cbf7c2a076c7 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -172,7 +172,8 @@ struct xfs_buf { xfs_daddr_t b_rhash_key; /* buffer cache index */ int b_length; /* size of buffer in BBs */ - atomic_t b_hold; /* reference count */ + spinlock_t b_lock; /* internal state lock */ + unsigned int b_hold; /* reference count */ atomic_t b_lru_ref; /* lru reclaim ref count */ xfs_buf_flags_t b_flags; /* status flags */ struct semaphore b_sema; /* semaphore for lockables */ @@ -182,7 +183,6 @@ struct xfs_buf { * bt_lru_lock and not by b_sema */ struct list_head b_lru; /* lru list */ - spinlock_t b_lock; /* internal state lock */ unsigned int b_state; /* internal state flags */ int b_io_error; /* internal IO error state */ wait_queue_head_t b_waiters; /* unpin waiters */ diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 4fe689410eb6..b29462363b81 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -498,7 +498,7 @@ DECLARE_EVENT_CLASS(xfs_buf_class, __entry->dev = bp->b_target->bt_dev; __entry->bno = xfs_buf_daddr(bp); __entry->nblks = bp->b_length; - __entry->hold = atomic_read(&bp->b_hold); + __entry->hold = bp->b_hold; __entry->pincount = atomic_read(&bp->b_pin_count); __entry->lockval = bp->b_sema.count; __entry->flags = bp->b_flags; @@ -569,7 +569,7 @@ DECLARE_EVENT_CLASS(xfs_buf_flags_class, __entry->bno = xfs_buf_daddr(bp); __entry->length = bp->b_length; __entry->flags = flags; - __entry->hold = atomic_read(&bp->b_hold); + __entry->hold = bp->b_hold; __entry->pincount = atomic_read(&bp->b_pin_count); __entry->lockval = bp->b_sema.count; __entry->caller_ip = caller_ip; @@ -612,7 +612,7 @@ TRACE_EVENT(xfs_buf_ioerror, __entry->dev = bp->b_target->bt_dev; __entry->bno = xfs_buf_daddr(bp); __entry->length = bp->b_length; - __entry->hold = atomic_read(&bp->b_hold); + __entry->hold = bp->b_hold; __entry->pincount = atomic_read(&bp->b_pin_count); __entry->lockval = bp->b_sema.count; __entry->error = error; @@ -656,7 +656,7 @@ DECLARE_EVENT_CLASS(xfs_buf_item_class, __entry->buf_bno = xfs_buf_daddr(bip->bli_buf); __entry->buf_len = bip->bli_buf->b_length; __entry->buf_flags = bip->bli_buf->b_flags; - __entry->buf_hold = atomic_read(&bip->bli_buf->b_hold); + __entry->buf_hold = bip->bli_buf->b_hold; __entry->buf_pincount = atomic_read(&bip->bli_buf->b_pin_count); __entry->buf_lockval = bip->bli_buf->b_sema.count; __entry->li_flags = bip->bli_item.li_flags; @@ -4978,7 +4978,7 @@ DECLARE_EVENT_CLASS(xfbtree_buf_class, __entry->xfino = file_inode(xfbt->target->bt_file)->i_ino; __entry->bno = xfs_buf_daddr(bp); __entry->nblks = bp->b_length; - __entry->hold = atomic_read(&bp->b_hold); + __entry->hold = bp->b_hold; __entry->pincount = atomic_read(&bp->b_pin_count); __entry->lockval = bp->b_sema.count; __entry->flags = bp->b_flags; -- 2.45.2 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] xfs: fix buffer lookup vs release race 2025-01-13 4:24 ` [PATCH 2/2] xfs: fix buffer lookup vs release race Christoph Hellwig @ 2025-01-13 17:55 ` Darrick J. Wong 2025-01-13 20:55 ` Dave Chinner 1 sibling, 0 replies; 15+ messages in thread From: Darrick J. Wong @ 2025-01-13 17:55 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Carlos Maiolino, Dave Chinner, linux-xfs On Mon, Jan 13, 2025 at 05:24:27AM +0100, Christoph Hellwig wrote: > Since commit 298f34224506 ("xfs: lockless buffer lookup") the buffer > lookup fastpath is done without a hash-wide lock (then pag_buf_lock, now > bc_lock) and only under RCU protection. But this means that nothing > serializes lookups against the temporary 0 reference count for buffers > that are added to the LRU after dropping the last regular reference, > and a concurrent lookup would fail to find them. Hrmm, and then what happens? The rhashtable lookup in xfs_buf_lookup returns a buffer, but we cannot atomic_inc_not_zero it due to the transient zero state, so xfs_buf_lookup returns -ENOENT? Then xfs_buf_find_insert creates a new buffer and tries to rhashtable_lookup_get_insert_fast it into the cache, but either finds the old buffer and returns it; or inserts the new buffer? Oh, I see, the XBF_INCORE causes xfs_buf_get_map to return NULL if the lookup fails, but there's actually still a buffer in the rhashtable which is actually the one that the xfs_buf_incore caller wanted to examine... and possibly now the _rele_cached thread has bumped the b_hold count back up to 1, so we've now missed the buffer. And the end result is that remote xattr invalidation / online repair reap fails to stale an old buffer, which is just hilarious if the space gets reallocated to something with the same daddr but a different buffer size (e.g. btree block becomes inode cluster). > Fix this by doing all b_hold modifications under b_lock. We're already > doing this for release so this "only" ~ doubles the b_lock round trips. > We'll later look into the lockref infrastructure to optimize the number > of lock round trips again. So I guess we no longer let b_hodl temporarily touch zero in _rele_cached, which prevents the above race? > Signed-off-by: Christoph Hellwig <hch@lst.de> The code looks right to me, and if my understanding of the problem is correct (i.e. the answers to all the boolean questions are 'yes') then Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> --D > --- > fs/xfs/xfs_buf.c | 93 ++++++++++++++++++++++++---------------------- > fs/xfs/xfs_buf.h | 4 +- > fs/xfs/xfs_trace.h | 10 ++--- > 3 files changed, 55 insertions(+), 52 deletions(-) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index f80e39fde53b..dc219678003c 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -133,15 +133,6 @@ __xfs_buf_ioacct_dec( > } > } > > -static inline void > -xfs_buf_ioacct_dec( > - struct xfs_buf *bp) > -{ > - spin_lock(&bp->b_lock); > - __xfs_buf_ioacct_dec(bp); > - spin_unlock(&bp->b_lock); > -} > - > /* > * When we mark a buffer stale, we remove the buffer from the LRU and clear the > * b_lru_ref count so that the buffer is freed immediately when the buffer > @@ -177,9 +168,9 @@ xfs_buf_stale( > atomic_set(&bp->b_lru_ref, 0); > if (!(bp->b_state & XFS_BSTATE_DISPOSE) && > (list_lru_del_obj(&bp->b_target->bt_lru, &bp->b_lru))) > - atomic_dec(&bp->b_hold); > + bp->b_hold--; > > - ASSERT(atomic_read(&bp->b_hold) >= 1); > + ASSERT(bp->b_hold >= 1); > spin_unlock(&bp->b_lock); > } > > @@ -238,14 +229,14 @@ _xfs_buf_alloc( > */ > flags &= ~(XBF_UNMAPPED | XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD); > > - atomic_set(&bp->b_hold, 1); > + spin_lock_init(&bp->b_lock); > + bp->b_hold = 1; > atomic_set(&bp->b_lru_ref, 1); > init_completion(&bp->b_iowait); > INIT_LIST_HEAD(&bp->b_lru); > INIT_LIST_HEAD(&bp->b_list); > INIT_LIST_HEAD(&bp->b_li_list); > sema_init(&bp->b_sema, 0); /* held, no waiters */ > - spin_lock_init(&bp->b_lock); > bp->b_target = target; > bp->b_mount = target->bt_mount; > bp->b_flags = flags; > @@ -589,6 +580,20 @@ xfs_buf_find_lock( > return 0; > } > > +static bool > +xfs_buf_try_hold( > + struct xfs_buf *bp) > +{ > + spin_lock(&bp->b_lock); > + if (bp->b_hold == 0) { > + spin_unlock(&bp->b_lock); > + return false; > + } > + bp->b_hold++; > + spin_unlock(&bp->b_lock); > + return true; > +} > + > static inline int > xfs_buf_lookup( > struct xfs_buf_cache *bch, > @@ -601,7 +606,7 @@ xfs_buf_lookup( > > rcu_read_lock(); > bp = rhashtable_lookup(&bch->bc_hash, map, xfs_buf_hash_params); > - if (!bp || !atomic_inc_not_zero(&bp->b_hold)) { > + if (!bp || !xfs_buf_try_hold(bp)) { > rcu_read_unlock(); > return -ENOENT; > } > @@ -664,7 +669,7 @@ xfs_buf_find_insert( > spin_unlock(&bch->bc_lock); > goto out_free_buf; > } > - if (bp && atomic_inc_not_zero(&bp->b_hold)) { > + if (bp && xfs_buf_try_hold(bp)) { > /* found an existing buffer */ > spin_unlock(&bch->bc_lock); > error = xfs_buf_find_lock(bp, flags); > @@ -1043,7 +1048,10 @@ xfs_buf_hold( > struct xfs_buf *bp) > { > trace_xfs_buf_hold(bp, _RET_IP_); > - atomic_inc(&bp->b_hold); > + > + spin_lock(&bp->b_lock); > + bp->b_hold++; > + spin_unlock(&bp->b_lock); > } > > static void > @@ -1051,10 +1059,15 @@ xfs_buf_rele_uncached( > struct xfs_buf *bp) > { > ASSERT(list_empty(&bp->b_lru)); > - if (atomic_dec_and_test(&bp->b_hold)) { > - xfs_buf_ioacct_dec(bp); > - xfs_buf_free(bp); > + > + spin_lock(&bp->b_lock); > + if (--bp->b_hold) { > + spin_unlock(&bp->b_lock); > + return; > } > + __xfs_buf_ioacct_dec(bp); > + spin_unlock(&bp->b_lock); > + xfs_buf_free(bp); > } > > static void > @@ -1064,51 +1077,40 @@ xfs_buf_rele_cached( > struct xfs_buftarg *btp = bp->b_target; > struct xfs_perag *pag = bp->b_pag; > struct xfs_buf_cache *bch = xfs_buftarg_buf_cache(btp, pag); > - bool release; > bool freebuf = false; > > trace_xfs_buf_rele(bp, _RET_IP_); > > - ASSERT(atomic_read(&bp->b_hold) > 0); > - > - /* > - * We grab the b_lock here first to serialise racing xfs_buf_rele() > - * calls. The pag_buf_lock being taken on the last reference only > - * serialises against racing lookups in xfs_buf_find(). IOWs, the second > - * to last reference we drop here is not serialised against the last > - * reference until we take bp->b_lock. Hence if we don't grab b_lock > - * first, the last "release" reference can win the race to the lock and > - * free the buffer before the second-to-last reference is processed, > - * leading to a use-after-free scenario. > - */ > spin_lock(&bp->b_lock); > - release = atomic_dec_and_lock(&bp->b_hold, &bch->bc_lock); > - if (!release) { > + ASSERT(bp->b_hold >= 1); > + if (bp->b_hold > 1) { > /* > * Drop the in-flight state if the buffer is already on the LRU > * and it holds the only reference. This is racy because we > * haven't acquired the pag lock, but the use of _XBF_IN_FLIGHT > * ensures the decrement occurs only once per-buf. > */ > - if ((atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru)) > + if (--bp->b_hold == 1 && !list_empty(&bp->b_lru)) > __xfs_buf_ioacct_dec(bp); > goto out_unlock; > } > > - /* the last reference has been dropped ... */ > + /* we are asked to drop the last reference */ > + spin_lock(&bch->bc_lock); > __xfs_buf_ioacct_dec(bp); > if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) { > /* > - * If the buffer is added to the LRU take a new reference to the > + * If the buffer is added to the LRU, keep the reference to the > * buffer for the LRU and clear the (now stale) dispose list > - * state flag > + * state flag, else drop the reference. > */ > - if (list_lru_add_obj(&btp->bt_lru, &bp->b_lru)) { > + if (list_lru_add_obj(&btp->bt_lru, &bp->b_lru)) > bp->b_state &= ~XFS_BSTATE_DISPOSE; > - atomic_inc(&bp->b_hold); > - } > + else > + bp->b_hold--; > spin_unlock(&bch->bc_lock); > } else { > + bp->b_hold--; > /* > * most of the time buffers will already be removed from the > * LRU, so optimise that case by checking for the > @@ -1863,13 +1865,14 @@ xfs_buftarg_drain_rele( > struct xfs_buf *bp = container_of(item, struct xfs_buf, b_lru); > struct list_head *dispose = arg; > > - if (atomic_read(&bp->b_hold) > 1) { > + if (!spin_trylock(&bp->b_lock)) > + return LRU_SKIP; > + if (bp->b_hold > 1) { > /* need to wait, so skip it this pass */ > + spin_unlock(&bp->b_lock); > trace_xfs_buf_drain_buftarg(bp, _RET_IP_); > return LRU_SKIP; > } > - if (!spin_trylock(&bp->b_lock)) > - return LRU_SKIP; > > /* > * clear the LRU reference count so the buffer doesn't get > @@ -2208,7 +2211,7 @@ xfs_buf_delwri_queue( > */ > bp->b_flags |= _XBF_DELWRI_Q; > if (list_empty(&bp->b_list)) { > - atomic_inc(&bp->b_hold); > + xfs_buf_hold(bp); > list_add_tail(&bp->b_list, list); > } > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > index 3d56bc7a35cc..cbf7c2a076c7 100644 > --- a/fs/xfs/xfs_buf.h > +++ b/fs/xfs/xfs_buf.h > @@ -172,7 +172,8 @@ struct xfs_buf { > > xfs_daddr_t b_rhash_key; /* buffer cache index */ > int b_length; /* size of buffer in BBs */ > - atomic_t b_hold; /* reference count */ > + spinlock_t b_lock; /* internal state lock */ > + unsigned int b_hold; /* reference count */ > atomic_t b_lru_ref; /* lru reclaim ref count */ > xfs_buf_flags_t b_flags; /* status flags */ > struct semaphore b_sema; /* semaphore for lockables */ > @@ -182,7 +183,6 @@ struct xfs_buf { > * bt_lru_lock and not by b_sema > */ > struct list_head b_lru; /* lru list */ > - spinlock_t b_lock; /* internal state lock */ > unsigned int b_state; /* internal state flags */ > int b_io_error; /* internal IO error state */ > wait_queue_head_t b_waiters; /* unpin waiters */ > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > index 4fe689410eb6..b29462363b81 100644 > --- a/fs/xfs/xfs_trace.h > +++ b/fs/xfs/xfs_trace.h > @@ -498,7 +498,7 @@ DECLARE_EVENT_CLASS(xfs_buf_class, > __entry->dev = bp->b_target->bt_dev; > __entry->bno = xfs_buf_daddr(bp); > __entry->nblks = bp->b_length; > - __entry->hold = atomic_read(&bp->b_hold); > + __entry->hold = bp->b_hold; > __entry->pincount = atomic_read(&bp->b_pin_count); > __entry->lockval = bp->b_sema.count; > __entry->flags = bp->b_flags; > @@ -569,7 +569,7 @@ DECLARE_EVENT_CLASS(xfs_buf_flags_class, > __entry->bno = xfs_buf_daddr(bp); > __entry->length = bp->b_length; > __entry->flags = flags; > - __entry->hold = atomic_read(&bp->b_hold); > + __entry->hold = bp->b_hold; > __entry->pincount = atomic_read(&bp->b_pin_count); > __entry->lockval = bp->b_sema.count; > __entry->caller_ip = caller_ip; > @@ -612,7 +612,7 @@ TRACE_EVENT(xfs_buf_ioerror, > __entry->dev = bp->b_target->bt_dev; > __entry->bno = xfs_buf_daddr(bp); > __entry->length = bp->b_length; > - __entry->hold = atomic_read(&bp->b_hold); > + __entry->hold = bp->b_hold; > __entry->pincount = atomic_read(&bp->b_pin_count); > __entry->lockval = bp->b_sema.count; > __entry->error = error; > @@ -656,7 +656,7 @@ DECLARE_EVENT_CLASS(xfs_buf_item_class, > __entry->buf_bno = xfs_buf_daddr(bip->bli_buf); > __entry->buf_len = bip->bli_buf->b_length; > __entry->buf_flags = bip->bli_buf->b_flags; > - __entry->buf_hold = atomic_read(&bip->bli_buf->b_hold); > + __entry->buf_hold = bip->bli_buf->b_hold; > __entry->buf_pincount = atomic_read(&bip->bli_buf->b_pin_count); > __entry->buf_lockval = bip->bli_buf->b_sema.count; > __entry->li_flags = bip->bli_item.li_flags; > @@ -4978,7 +4978,7 @@ DECLARE_EVENT_CLASS(xfbtree_buf_class, > __entry->xfino = file_inode(xfbt->target->bt_file)->i_ino; > __entry->bno = xfs_buf_daddr(bp); > __entry->nblks = bp->b_length; > - __entry->hold = atomic_read(&bp->b_hold); > + __entry->hold = bp->b_hold; > __entry->pincount = atomic_read(&bp->b_pin_count); > __entry->lockval = bp->b_sema.count; > __entry->flags = bp->b_flags; > -- > 2.45.2 > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] xfs: fix buffer lookup vs release race 2025-01-13 4:24 ` [PATCH 2/2] xfs: fix buffer lookup vs release race Christoph Hellwig 2025-01-13 17:55 ` Darrick J. Wong @ 2025-01-13 20:55 ` Dave Chinner 2025-01-15 5:38 ` Christoph Hellwig 1 sibling, 1 reply; 15+ messages in thread From: Dave Chinner @ 2025-01-13 20:55 UTC (permalink / raw) To: Christoph Hellwig Cc: Carlos Maiolino, Darrick J. Wong, Dave Chinner, linux-xfs On Mon, Jan 13, 2025 at 05:24:27AM +0100, Christoph Hellwig wrote: > Since commit 298f34224506 ("xfs: lockless buffer lookup") the buffer > lookup fastpath is done without a hash-wide lock (then pag_buf_lock, now > bc_lock) and only under RCU protection. But this means that nothing > serializes lookups against the temporary 0 reference count for buffers > that are added to the LRU after dropping the last regular reference, > and a concurrent lookup would fail to find them. > > Fix this by doing all b_hold modifications under b_lock. We're already > doing this for release so this "only" ~ doubles the b_lock round trips. > We'll later look into the lockref infrastructure to optimize the number > of lock round trips again. > > Signed-off-by: Christoph Hellwig <hch@lst.de> Logic changes look ok, but... > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > index 3d56bc7a35cc..cbf7c2a076c7 100644 > --- a/fs/xfs/xfs_buf.h > +++ b/fs/xfs/xfs_buf.h > @@ -172,7 +172,8 @@ struct xfs_buf { > > xfs_daddr_t b_rhash_key; /* buffer cache index */ > int b_length; /* size of buffer in BBs */ > - atomic_t b_hold; /* reference count */ > + spinlock_t b_lock; /* internal state lock */ > + unsigned int b_hold; /* reference count */ > atomic_t b_lru_ref; /* lru reclaim ref count */ > xfs_buf_flags_t b_flags; /* status flags */ > struct semaphore b_sema; /* semaphore for lockables */ > @@ -182,7 +183,6 @@ struct xfs_buf { > * bt_lru_lock and not by b_sema > */ > struct list_head b_lru; /* lru list */ > - spinlock_t b_lock; /* internal state lock */ > unsigned int b_state; /* internal state flags */ > int b_io_error; /* internal IO error state */ > wait_queue_head_t b_waiters; /* unpin waiters */ ... I think this is misguided. The idea behind the initial cacheline layout is that it should stay read-only as much as possible so that cache lookups can walk the buffer without causing shared/exclusive cacheline contention with existing buffer users. This was really important back in the days when the cache used a rb-tree (i.e. the rbnode pointers dominated lookup profiles), and it's still likely important with the rhashtable on large caches. i.e. Putting a spinlock in that first cache line will result in lookups and shrinker walks having cacheline contention as the shrinker needs exclusive access for the spin lock, whilst the lookup walk needs shared access for the b_rhash_head, b_rhash_key and b_length fields in _xfs_buf_obj_cmp() for lookless lookup concurrency. Hence I think it would be better to move the b_hold field to the same cacheline as the b_state field rather than move it to the initial cacheline that cache lookups walk... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] xfs: fix buffer lookup vs release race 2025-01-13 20:55 ` Dave Chinner @ 2025-01-15 5:38 ` Christoph Hellwig 2025-01-15 11:21 ` Dave Chinner 0 siblings, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2025-01-15 5:38 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Carlos Maiolino, Darrick J. Wong, Dave Chinner, linux-xfs On Tue, Jan 14, 2025 at 07:55:30AM +1100, Dave Chinner wrote: > The idea behind the initial cacheline layout is that it should stay > read-only as much as possible so that cache lookups can walk the > buffer without causing shared/exclusive cacheline contention with > existing buffer users. > > This was really important back in the days when the cache used a > rb-tree (i.e. the rbnode pointers dominated lookup profiles), and > it's still likely important with the rhashtable on large caches. > > i.e. Putting a spinlock in that first cache line will result in > lookups and shrinker walks having cacheline contention as the > shrinker needs exclusive access for the spin lock, whilst the lookup > walk needs shared access for the b_rhash_head, b_rhash_key and > b_length fields in _xfs_buf_obj_cmp() for lookless lookup > concurrency. Hmm, this contradict the comment on top of xfs_buf, which explicitly wants the lock and count in the semaphore to stay in the first cache line. These, similar to the count that already is in the cacheline and the newly moved lock (which would still keep the semaphore partial layout) are modified for the uncontended lookup there. Note that since the comment was written b_sema actually moved entirely into the first cache line, and this patch keeps it there, nicely aligning b_lru_ref on my x86_64 no-debug config. Now I'm usually pretty bad about these cacheline micro-optimizations and I'm talking to the person who wrote that comment here, so that rationale might not make sense, but then the comment doesn't either. I'm kinda tempted to just stick to the rationale there for now and then let someone smarter than me optimize the layout for the new world order. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] xfs: fix buffer lookup vs release race 2025-01-15 5:38 ` Christoph Hellwig @ 2025-01-15 11:21 ` Dave Chinner 0 siblings, 0 replies; 15+ messages in thread From: Dave Chinner @ 2025-01-15 11:21 UTC (permalink / raw) To: Christoph Hellwig Cc: Carlos Maiolino, Darrick J. Wong, Dave Chinner, linux-xfs On Wed, Jan 15, 2025 at 06:38:00AM +0100, Christoph Hellwig wrote: > On Tue, Jan 14, 2025 at 07:55:30AM +1100, Dave Chinner wrote: > > The idea behind the initial cacheline layout is that it should stay > > read-only as much as possible so that cache lookups can walk the > > buffer without causing shared/exclusive cacheline contention with > > existing buffer users. > > > > This was really important back in the days when the cache used a > > rb-tree (i.e. the rbnode pointers dominated lookup profiles), and > > it's still likely important with the rhashtable on large caches. > > > > i.e. Putting a spinlock in that first cache line will result in > > lookups and shrinker walks having cacheline contention as the > > shrinker needs exclusive access for the spin lock, whilst the lookup > > walk needs shared access for the b_rhash_head, b_rhash_key and > > b_length fields in _xfs_buf_obj_cmp() for lookless lookup > > concurrency. > > Hmm, this contradict the comment on top of xfs_buf, which explicitly > wants the lock and count in the semaphore to stay in the first cache > line. The semaphore, yes, because locking the buffer is something the fast path lookup does, and buffer locks are rarely contended. Shrinkers, OTOH, work on the b_lru_ref count and use the b_lock spin lock right up until the point that the buffer is going to be reclaimed. These are not shared with the cache lines accessed by lookups. Indeed, it looks to me like the historic placing of the b_lru_ref on the first cacheline is now incorrect, because it is no longer modified during lookup - we moved that to the lookup callers a long time ago. i.e. shrinker reclaim shouldn't touch the first cacheline until it is goign to reclaim the buffer. A racing lookup at this point is also very rare, so the fact it modifies the first cacheline of the buffer is fine - it's going to need that exclusive to remove it from the cache, anyway. IOWs, the current separate largely keeps the lookup fast path and shrinker reclaim operating on different cachelines in the same buffer object, and hence they don't interfere with each other. However, the change to to use the b_lock and a non-atomic hold count means that every time a shrinker scans a buffer - even before looking at the lru ref count - it will pull the first cache line exclusive due to the unconditional spin lock attempt it now makes. When we are under tight memory pressure, only the frequently referenced buffers will stay in memory (hence lookup hits them), and they will be scanned by reclaim just as frequently as they are accessed by the filesystem to keep them referenced and on the LRUs... > These, similar to the count that already is in the cacheline > and the newly moved lock (which would still keep the semaphore partial > layout) are modified for the uncontended lookup there. Note that > since the comment was written b_sema actually moved entirely into > the first cache line, and this patch keeps it there, nicely aligning > b_lru_ref on my x86_64 no-debug config. The comment was written back in the days of the rbtree based index, where all we could fit on the first cacheline was the rbnode, the lookup critical fields (daddr, length, flags), the buffer data offset (long gone) and the part of the semaphore structure involved in locking the semaphore... While the code may not exactly match the comment anymore, the comment is actually still valid and correct and we should be fixing the code to match the comment, not making the situation worse... > Now I'm usually pretty bad about these cacheline micro-optimizations > and I'm talking to the person who wrote that comment here, so that > rationale might not make sense, but then the comment doesn't either. > > I'm kinda tempted to just stick to the rationale there for now and then > let someone smarter than me optimize the layout for the new world order. I'd just leave b_lock where it is for now - if it is now going to be contended between lookup and reclaim, we want it isolated to a cacheline that minimises contention with other lookup related data.... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fix buffer refcount races 2025-01-13 4:24 fix buffer refcount races Christoph Hellwig 2025-01-13 4:24 ` [PATCH 1/2] xfs: check for dead buffers in xfs_buf_find_insert Christoph Hellwig 2025-01-13 4:24 ` [PATCH 2/2] xfs: fix buffer lookup vs release race Christoph Hellwig @ 2025-01-13 5:08 ` Darrick J. Wong 2025-01-13 5:14 ` Christoph Hellwig 2 siblings, 1 reply; 15+ messages in thread From: Darrick J. Wong @ 2025-01-13 5:08 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Carlos Maiolino, Dave Chinner, linux-xfs On Mon, Jan 13, 2025 at 05:24:25AM +0100, Christoph Hellwig wrote: > Hi all, > > this series fixes two races in buffer refcount handling, that I've > so far not actually seen in real life, but only found while reading > through the buffer cache code to understand some of the stranger looking > locking decisions. > > One can cause a buffer about to be freed to be returned from > xfs_buf_insert. I think this one is extremely unlikely to be hit, > as it requires the buffer to not be present for the initial lookup, > but already being evicted when trying to add the new buffer. But > at least the fix is trivial. > > The second causes buffer lookups to be missed when moving to the LRU. > This might actually be able to trigger the first one, but otherwise > just means we're doing a pass through insert which will find it. > For pure lookups using xfs_buf_incore it could cause us to miss buffer > invalidation. The fix for that is bigger and has bigger implications > because it not requires all b_hold increments to be done under d_lock. Just to be clear, should this sentence say "...because it *now* requires"? > This causes more contention, but as releasing the buffer always takes > the lock it can't be too horrible. I also have a only minimally > tested series to switch it over to a lockref here: > > http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-buffer-locking Will take a look; some of those patches look familiar. ;) --D > > Diffstat: > b/fs/xfs/xfs_buf.c | 3 - > b/fs/xfs/xfs_buf.h | 4 +- > b/fs/xfs/xfs_trace.h | 10 ++--- > fs/xfs/xfs_buf.c | 93 ++++++++++++++++++++++++++------------------------- > 4 files changed, 56 insertions(+), 54 deletions(-) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fix buffer refcount races 2025-01-13 5:08 ` fix buffer refcount races Darrick J. Wong @ 2025-01-13 5:14 ` Christoph Hellwig 2025-01-13 7:13 ` Darrick J. Wong 0 siblings, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2025-01-13 5:14 UTC (permalink / raw) To: Darrick J. Wong Cc: Christoph Hellwig, Carlos Maiolino, Dave Chinner, linux-xfs On Sun, Jan 12, 2025 at 09:08:46PM -0800, Darrick J. Wong wrote: > > This might actually be able to trigger the first one, but otherwise > > just means we're doing a pass through insert which will find it. > > For pure lookups using xfs_buf_incore it could cause us to miss buffer > > invalidation. The fix for that is bigger and has bigger implications > > because it not requires all b_hold increments to be done under d_lock. > > Just to be clear, should this sentence say > "...because it *now* requires"? Yes. > > This causes more contention, but as releasing the buffer always takes > > the lock it can't be too horrible. I also have a only minimally > > tested series to switch it over to a lockref here: > > > > http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-buffer-locking > > Will take a look; some of those patches look familiar. ;) Well, the first batch after these fixes are the buffer cleanups I reposted last week that you've mostly but not entirely reviewed. The reminder only really depends on them for changed context. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: fix buffer refcount races 2025-01-13 5:14 ` Christoph Hellwig @ 2025-01-13 7:13 ` Darrick J. Wong 0 siblings, 0 replies; 15+ messages in thread From: Darrick J. Wong @ 2025-01-13 7:13 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Carlos Maiolino, Dave Chinner, linux-xfs On Mon, Jan 13, 2025 at 06:14:35AM +0100, Christoph Hellwig wrote: > On Sun, Jan 12, 2025 at 09:08:46PM -0800, Darrick J. Wong wrote: > > > This might actually be able to trigger the first one, but otherwise > > > just means we're doing a pass through insert which will find it. > > > For pure lookups using xfs_buf_incore it could cause us to miss buffer > > > invalidation. The fix for that is bigger and has bigger implications > > > because it not requires all b_hold increments to be done under d_lock. > > > > Just to be clear, should this sentence say > > "...because it *now* requires"? > > Yes. <nod> > > > This causes more contention, but as releasing the buffer always takes > > > the lock it can't be too horrible. I also have a only minimally > > > tested series to switch it over to a lockref here: > > > > > > http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-buffer-locking > > > > Will take a look; some of those patches look familiar. ;) > > Well, the first batch after these fixes are the buffer cleanups I > reposted last week that you've mostly but not entirely reviewed. The > reminder only really depends on them for changed context. That should be done now. :) --D ^ permalink raw reply [flat|nested] 15+ messages in thread
* fix buffer refcount races v2
@ 2025-01-16 6:01 Christoph Hellwig
2025-01-16 6:01 ` [PATCH 2/2] xfs: fix buffer lookup vs release race Christoph Hellwig
0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2025-01-16 6:01 UTC (permalink / raw)
To: Carlos Maiolino; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs
Hi all,
this series fixes two races in buffer refcount handling, that I've
so far not actually seen in real life, but only found while reading
through the buffer cache code to understand some of the stranger looking
locking decisions.
One can cause a buffer about to be freed to be returned from
xfs_buf_insert. I think this one is extremely unlikely to be hit,
as it requires the buffer to not be present for the initial lookup,
but already being evicted when trying to add the new buffer. But
at least the fix is trivial.
The second causes buffer lookups to be missed when moving to the LRU.
This might actually be able to trigger the first one, but otherwise
just means we're doing a pass through insert which will find it.
For pure lookups using xfs_buf_incore it could cause us to miss buffer
invalidation. The fix for that is bigger and has bigger implications
because it not requires all b_hold increments to be done under d_lock.
This causes more contention, but as releasing the buffer always takes
the lock it can't be too horrible. I also have a only minimally
tested series to switch it over to a lockref here:
http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-buffer-locking
Changes since v1:
- don't move b_lock
Diffstat:
xfs_buf.c | 94 ++++++++++++++++++++++++++++++------------------------------
xfs_buf.h | 2 -
xfs_trace.h | 10 +++---
3 files changed, 54 insertions(+), 52 deletions(-)
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 2/2] xfs: fix buffer lookup vs release race 2025-01-16 6:01 fix buffer refcount races v2 Christoph Hellwig @ 2025-01-16 6:01 ` Christoph Hellwig 2025-01-25 7:35 ` Lai, Yi 0 siblings, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2025-01-16 6:01 UTC (permalink / raw) To: Carlos Maiolino; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs Since commit 298f34224506 ("xfs: lockless buffer lookup") the buffer lookup fastpath is done without a hash-wide lock (then pag_buf_lock, now bc_lock) and only under RCU protection. But this means that nothing serializes lookups against the temporary 0 reference count for buffers that are added to the LRU after dropping the last regular reference, and a concurrent lookup would fail to find them. Fix this by doing all b_hold modifications under b_lock. We're already doing this for release so this "only" ~ doubles the b_lock round trips. We'll later look into the lockref infrastructure to optimize the number of lock round trips again. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/xfs_buf.c | 93 ++++++++++++++++++++++++---------------------- fs/xfs/xfs_buf.h | 2 +- fs/xfs/xfs_trace.h | 10 ++--- 3 files changed, 54 insertions(+), 51 deletions(-) diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index f80e39fde53b..dc219678003c 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -133,15 +133,6 @@ __xfs_buf_ioacct_dec( } } -static inline void -xfs_buf_ioacct_dec( - struct xfs_buf *bp) -{ - spin_lock(&bp->b_lock); - __xfs_buf_ioacct_dec(bp); - spin_unlock(&bp->b_lock); -} - /* * When we mark a buffer stale, we remove the buffer from the LRU and clear the * b_lru_ref count so that the buffer is freed immediately when the buffer @@ -177,9 +168,9 @@ xfs_buf_stale( atomic_set(&bp->b_lru_ref, 0); if (!(bp->b_state & XFS_BSTATE_DISPOSE) && (list_lru_del_obj(&bp->b_target->bt_lru, &bp->b_lru))) - atomic_dec(&bp->b_hold); + bp->b_hold--; - ASSERT(atomic_read(&bp->b_hold) >= 1); + ASSERT(bp->b_hold >= 1); spin_unlock(&bp->b_lock); } @@ -238,14 +229,14 @@ _xfs_buf_alloc( */ flags &= ~(XBF_UNMAPPED | XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD); - atomic_set(&bp->b_hold, 1); + spin_lock_init(&bp->b_lock); + bp->b_hold = 1; atomic_set(&bp->b_lru_ref, 1); init_completion(&bp->b_iowait); INIT_LIST_HEAD(&bp->b_lru); INIT_LIST_HEAD(&bp->b_list); INIT_LIST_HEAD(&bp->b_li_list); sema_init(&bp->b_sema, 0); /* held, no waiters */ - spin_lock_init(&bp->b_lock); bp->b_target = target; bp->b_mount = target->bt_mount; bp->b_flags = flags; @@ -589,6 +580,20 @@ xfs_buf_find_lock( return 0; } +static bool +xfs_buf_try_hold( + struct xfs_buf *bp) +{ + spin_lock(&bp->b_lock); + if (bp->b_hold == 0) { + spin_unlock(&bp->b_lock); + return false; + } + bp->b_hold++; + spin_unlock(&bp->b_lock); + return true; +} + static inline int xfs_buf_lookup( struct xfs_buf_cache *bch, @@ -601,7 +606,7 @@ xfs_buf_lookup( rcu_read_lock(); bp = rhashtable_lookup(&bch->bc_hash, map, xfs_buf_hash_params); - if (!bp || !atomic_inc_not_zero(&bp->b_hold)) { + if (!bp || !xfs_buf_try_hold(bp)) { rcu_read_unlock(); return -ENOENT; } @@ -664,7 +669,7 @@ xfs_buf_find_insert( spin_unlock(&bch->bc_lock); goto out_free_buf; } - if (bp && atomic_inc_not_zero(&bp->b_hold)) { + if (bp && xfs_buf_try_hold(bp)) { /* found an existing buffer */ spin_unlock(&bch->bc_lock); error = xfs_buf_find_lock(bp, flags); @@ -1043,7 +1048,10 @@ xfs_buf_hold( struct xfs_buf *bp) { trace_xfs_buf_hold(bp, _RET_IP_); - atomic_inc(&bp->b_hold); + + spin_lock(&bp->b_lock); + bp->b_hold++; + spin_unlock(&bp->b_lock); } static void @@ -1051,10 +1059,15 @@ xfs_buf_rele_uncached( struct xfs_buf *bp) { ASSERT(list_empty(&bp->b_lru)); - if (atomic_dec_and_test(&bp->b_hold)) { - xfs_buf_ioacct_dec(bp); - xfs_buf_free(bp); + + spin_lock(&bp->b_lock); + if (--bp->b_hold) { + spin_unlock(&bp->b_lock); + return; } + __xfs_buf_ioacct_dec(bp); + spin_unlock(&bp->b_lock); + xfs_buf_free(bp); } static void @@ -1064,51 +1077,40 @@ xfs_buf_rele_cached( struct xfs_buftarg *btp = bp->b_target; struct xfs_perag *pag = bp->b_pag; struct xfs_buf_cache *bch = xfs_buftarg_buf_cache(btp, pag); - bool release; bool freebuf = false; trace_xfs_buf_rele(bp, _RET_IP_); - ASSERT(atomic_read(&bp->b_hold) > 0); - - /* - * We grab the b_lock here first to serialise racing xfs_buf_rele() - * calls. The pag_buf_lock being taken on the last reference only - * serialises against racing lookups in xfs_buf_find(). IOWs, the second - * to last reference we drop here is not serialised against the last - * reference until we take bp->b_lock. Hence if we don't grab b_lock - * first, the last "release" reference can win the race to the lock and - * free the buffer before the second-to-last reference is processed, - * leading to a use-after-free scenario. - */ spin_lock(&bp->b_lock); - release = atomic_dec_and_lock(&bp->b_hold, &bch->bc_lock); - if (!release) { + ASSERT(bp->b_hold >= 1); + if (bp->b_hold > 1) { /* * Drop the in-flight state if the buffer is already on the LRU * and it holds the only reference. This is racy because we * haven't acquired the pag lock, but the use of _XBF_IN_FLIGHT * ensures the decrement occurs only once per-buf. */ - if ((atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru)) + if (--bp->b_hold == 1 && !list_empty(&bp->b_lru)) __xfs_buf_ioacct_dec(bp); goto out_unlock; } - /* the last reference has been dropped ... */ + /* we are asked to drop the last reference */ + spin_lock(&bch->bc_lock); __xfs_buf_ioacct_dec(bp); if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) { /* - * If the buffer is added to the LRU take a new reference to the + * If the buffer is added to the LRU, keep the reference to the * buffer for the LRU and clear the (now stale) dispose list - * state flag + * state flag, else drop the reference. */ - if (list_lru_add_obj(&btp->bt_lru, &bp->b_lru)) { + if (list_lru_add_obj(&btp->bt_lru, &bp->b_lru)) bp->b_state &= ~XFS_BSTATE_DISPOSE; - atomic_inc(&bp->b_hold); - } + else + bp->b_hold--; spin_unlock(&bch->bc_lock); } else { + bp->b_hold--; /* * most of the time buffers will already be removed from the * LRU, so optimise that case by checking for the @@ -1863,13 +1865,14 @@ xfs_buftarg_drain_rele( struct xfs_buf *bp = container_of(item, struct xfs_buf, b_lru); struct list_head *dispose = arg; - if (atomic_read(&bp->b_hold) > 1) { + if (!spin_trylock(&bp->b_lock)) + return LRU_SKIP; + if (bp->b_hold > 1) { /* need to wait, so skip it this pass */ + spin_unlock(&bp->b_lock); trace_xfs_buf_drain_buftarg(bp, _RET_IP_); return LRU_SKIP; } - if (!spin_trylock(&bp->b_lock)) - return LRU_SKIP; /* * clear the LRU reference count so the buffer doesn't get @@ -2208,7 +2211,7 @@ xfs_buf_delwri_queue( */ bp->b_flags |= _XBF_DELWRI_Q; if (list_empty(&bp->b_list)) { - atomic_inc(&bp->b_hold); + xfs_buf_hold(bp); list_add_tail(&bp->b_list, list); } diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index 3d56bc7a35cc..9ccc6f93f636 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -172,7 +172,7 @@ struct xfs_buf { xfs_daddr_t b_rhash_key; /* buffer cache index */ int b_length; /* size of buffer in BBs */ - atomic_t b_hold; /* reference count */ + unsigned int b_hold; /* reference count */ atomic_t b_lru_ref; /* lru reclaim ref count */ xfs_buf_flags_t b_flags; /* status flags */ struct semaphore b_sema; /* semaphore for lockables */ diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 4fe689410eb6..b29462363b81 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -498,7 +498,7 @@ DECLARE_EVENT_CLASS(xfs_buf_class, __entry->dev = bp->b_target->bt_dev; __entry->bno = xfs_buf_daddr(bp); __entry->nblks = bp->b_length; - __entry->hold = atomic_read(&bp->b_hold); + __entry->hold = bp->b_hold; __entry->pincount = atomic_read(&bp->b_pin_count); __entry->lockval = bp->b_sema.count; __entry->flags = bp->b_flags; @@ -569,7 +569,7 @@ DECLARE_EVENT_CLASS(xfs_buf_flags_class, __entry->bno = xfs_buf_daddr(bp); __entry->length = bp->b_length; __entry->flags = flags; - __entry->hold = atomic_read(&bp->b_hold); + __entry->hold = bp->b_hold; __entry->pincount = atomic_read(&bp->b_pin_count); __entry->lockval = bp->b_sema.count; __entry->caller_ip = caller_ip; @@ -612,7 +612,7 @@ TRACE_EVENT(xfs_buf_ioerror, __entry->dev = bp->b_target->bt_dev; __entry->bno = xfs_buf_daddr(bp); __entry->length = bp->b_length; - __entry->hold = atomic_read(&bp->b_hold); + __entry->hold = bp->b_hold; __entry->pincount = atomic_read(&bp->b_pin_count); __entry->lockval = bp->b_sema.count; __entry->error = error; @@ -656,7 +656,7 @@ DECLARE_EVENT_CLASS(xfs_buf_item_class, __entry->buf_bno = xfs_buf_daddr(bip->bli_buf); __entry->buf_len = bip->bli_buf->b_length; __entry->buf_flags = bip->bli_buf->b_flags; - __entry->buf_hold = atomic_read(&bip->bli_buf->b_hold); + __entry->buf_hold = bip->bli_buf->b_hold; __entry->buf_pincount = atomic_read(&bip->bli_buf->b_pin_count); __entry->buf_lockval = bip->bli_buf->b_sema.count; __entry->li_flags = bip->bli_item.li_flags; @@ -4978,7 +4978,7 @@ DECLARE_EVENT_CLASS(xfbtree_buf_class, __entry->xfino = file_inode(xfbt->target->bt_file)->i_ino; __entry->bno = xfs_buf_daddr(bp); __entry->nblks = bp->b_length; - __entry->hold = atomic_read(&bp->b_hold); + __entry->hold = bp->b_hold; __entry->pincount = atomic_read(&bp->b_pin_count); __entry->lockval = bp->b_sema.count; __entry->flags = bp->b_flags; -- 2.45.2 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] xfs: fix buffer lookup vs release race 2025-01-16 6:01 ` [PATCH 2/2] xfs: fix buffer lookup vs release race Christoph Hellwig @ 2025-01-25 7:35 ` Lai, Yi 2025-01-26 5:50 ` Christoph Hellwig 0 siblings, 1 reply; 15+ messages in thread From: Lai, Yi @ 2025-01-25 7:35 UTC (permalink / raw) To: Christoph Hellwig Cc: Carlos Maiolino, Darrick J. Wong, Dave Chinner, linux-xfs, yi1.lai Hi Christoph Hellwig, Greetings! I used Syzkaller and found that there is possible deadlock in xfs_buf_get_map in linux-next tag - next-20250121. After bisection and the first bad commit is: " ee10f6fcdb96 xfs: fix buffer lookup vs release race " All detailed into can be found at: https://github.com/laifryiee/syzkaller_logs/tree/main/250124_181030_xfs_buf_get_map Syzkaller repro code: https://github.com/laifryiee/syzkaller_logs/tree/main/250124_181030_xfs_buf_get_map/repro.c Syzkaller repro syscall steps: https://github.com/laifryiee/syzkaller_logs/tree/main/250124_181030_xfs_buf_get_map/repro.prog Syzkaller report: https://github.com/laifryiee/syzkaller_logs/tree/main/250124_181030_xfs_buf_get_map/repro.report Kconfig(make olddefconfig): https://github.com/laifryiee/syzkaller_logs/tree/main/250124_181030_xfs_buf_get_map/kconfig_origin Bisect info: https://github.com/laifryiee/syzkaller_logs/tree/main/250124_181030_xfs_buf_get_map/bisect_info.log bzImage: https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250124_181030_xfs_buf_get_map/bzImage_f066b5a6c7a06adfb666b7652cc99b4ff264f4ed Issue dmesg: https://github.com/laifryiee/syzkaller_logs/blob/main/250124_181030_xfs_buf_get_map/f066b5a6c7a06adfb666b7652cc99b4ff264f4ed_dmesg.log " [ 57.155207] ====================================================== [ 57.155506] WARNING: possible circular locking dependency detected [ 57.155814] 6.13.0-next-20250121-f066b5a6c7a0-dirty #1 Not tainted [ 57.156126] ------------------------------------------------------ [ 57.156434] repro/749 is trying to acquire lock: [ 57.156672] ffff888023a9daa0 (&bp->b_lock){+.+.}-{3:3}, at: xfs_buf_get_map+0x1ba3/0x2fb0 [ 57.157105] [ 57.157105] but task is already holding lock: [ 57.157389] ffff88801f182980 (&bch->bc_lock){+.+.}-{3:3}, at: xfs_buf_get_map+0x156f/0x2fb0 [ 57.157816] [ 57.157816] which lock already depends on the new lock. [ 57.157816] [ 57.158210] [ 57.158210] the existing dependency chain (in reverse order) is: [ 57.158575] [ 57.158575] -> #1 (&bch->bc_lock){+.+.}-{3:3}: [ 57.158880] _raw_spin_lock+0x38/0x50 [ 57.159114] xfs_buf_rele+0x2c1/0x1670 [ 57.159331] xfs_trans_brelse+0x385/0x410 [ 57.159569] xfs_imap_lookup+0x38d/0x5a0 [ 57.159803] xfs_imap+0x668/0xc80 [ 57.160006] xfs_iget+0x875/0x2dd0 [ 57.160219] xfs_mountfs+0x121d/0x2130 [ 57.160444] xfs_fs_fill_super+0x12bb/0x1ed0 [ 57.160697] get_tree_bdev_flags+0x3d8/0x6c0 [ 57.160952] get_tree_bdev+0x29/0x40 [ 57.161166] xfs_fs_get_tree+0x26/0x30 [ 57.161389] vfs_get_tree+0x9e/0x390 [ 57.161600] path_mount+0x707/0x2000 [ 57.161820] __x64_sys_mount+0x2a8/0x330 [ 57.162055] x64_sys_call+0x1e1d/0x2140 [ 57.162288] do_syscall_64+0x6d/0x140 [ 57.162512] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 57.162796] [ 57.162796] -> #0 (&bp->b_lock){+.+.}-{3:3}: [ 57.163093] __lock_acquire+0x2ff8/0x5d60 [ 57.163329] lock_acquire+0x1bd/0x550 [ 57.163541] _raw_spin_lock+0x38/0x50 [ 57.163756] xfs_buf_get_map+0x1ba3/0x2fb0 [ 57.163968] xfs_buf_read_map+0xe8/0xaa0 [ 57.164191] xfs_trans_read_buf_map+0x12f/0x990 [ 57.164452] xfs_read_agf+0x259/0x5a0 [ 57.164654] xfs_alloc_read_agf+0x127/0xb50 [ 57.164889] xfs_ioc_trim+0x1369/0x22f0 [ 57.165116] xfs_file_ioctl+0x18a0/0x2270 [ 57.165349] __x64_sys_ioctl+0x1ba/0x220 [ 57.165583] x64_sys_call+0x1227/0x2140 [ 57.165810] do_syscall_64+0x6d/0x140 [ 57.166031] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 57.166314] [ 57.166314] other info that might help us debug this: [ 57.166314] [ 57.166677] Possible unsafe locking scenario: [ 57.166677] [ 57.166973] CPU0 CPU1 [ 57.167198] ---- ---- [ 57.167425] lock(&bch->bc_lock); [ 57.167604] lock(&bp->b_lock); [ 57.167890] lock(&bch->bc_lock); [ 57.168184] lock(&bp->b_lock); [ 57.168355] [ 57.168355] *** DEADLOCK *** [ 57.168355] [ 57.168623] 1 lock held by repro/749: [ 57.168791] #0: ffff88801f182980 (&bch->bc_lock){+.+.}-{3:3}, at: xfs_buf_get_map+0x156f/0x2fb0 [ 57.169238] [ 57.169238] stack backtrace: [ 57.169464] CPU: 1 UID: 0 PID: 749 Comm: repro Not tainted 6.13.0-next-20250121-f066b5a6c7a0-dirty #1 [ 57.169915] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/014 [ 57.170448] Call Trace: [ 57.170585] <TASK> [ 57.170703] dump_stack_lvl+0xea/0x150 [ 57.170915] dump_stack+0x19/0x20 [ 57.171096] print_circular_bug+0x47f/0x750 [ 57.171320] check_noncircular+0x2f4/0x3e0 [ 57.171536] ? __pfx_check_noncircular+0x10/0x10 [ 57.171777] ? register_lock_class+0xbf/0x10f0 [ 57.172009] ? __pfx_lockdep_lock+0x10/0x10 [ 57.172223] ? xfs_trans_read_buf_map+0x12f/0x990 [ 57.172486] __lock_acquire+0x2ff8/0x5d60 [ 57.172699] ? __pfx___lock_acquire+0x10/0x10 [ 57.172922] ? __pfx___lock_acquire+0x10/0x10 [ 57.173144] ? __pfx_mark_lock.part.0+0x10/0x10 [ 57.173375] lock_acquire+0x1bd/0x550 [ 57.173564] ? xfs_buf_get_map+0x1ba3/0x2fb0 [ 57.173782] ? __pfx_lock_acquire+0x10/0x10 [ 57.173997] ? lock_release+0x441/0x870 [ 57.174196] ? __pfx_lock_release+0x10/0x10 [ 57.174414] _raw_spin_lock+0x38/0x50 [ 57.174606] ? xfs_buf_get_map+0x1ba3/0x2fb0 [ 57.174822] xfs_buf_get_map+0x1ba3/0x2fb0 [ 57.175034] ? __pfx_xfs_buf_get_map+0x10/0x10 [ 57.175264] ? kasan_save_alloc_info+0x3c/0x50 [ 57.175505] ? kmem_cache_alloc_noprof+0x13e/0x440 [ 57.175754] ? xfs_trans_alloc+0x86/0x860 [ 57.175970] ? xfs_trans_alloc_empty+0x92/0xc0 [ 57.176199] ? __x64_sys_ioctl+0x1ba/0x220 [ 57.176429] ? x64_sys_call+0x1227/0x2140 [ 57.176641] ? do_syscall_64+0x6d/0x140 [ 57.176841] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 57.177115] xfs_buf_read_map+0xe8/0xaa0 [ 57.177316] ? xfs_read_agf+0x259/0x5a0 [ 57.177518] ? __pfx_xfs_buf_read_map+0x10/0x10 [ 57.177749] ? __sanitizer_cov_trace_cmp4+0x1a/0x20 [ 57.178009] ? xfs_trans_buf_item_match+0x257/0x300 [ 57.178263] xfs_trans_read_buf_map+0x12f/0x990 [ 57.178499] ? xfs_read_agf+0x259/0x5a0 [ 57.178701] ? __pfx_xfs_trans_read_buf_map+0x10/0x10 [ 57.178962] ? xfs_trans_alloc+0x86/0x860 [ 57.179175] xfs_read_agf+0x259/0x5a0 [ 57.179368] ? rcu_is_watching+0x19/0xc0 [ 57.179579] ? __pfx_xfs_read_agf+0x10/0x10 [ 57.179806] ? xfs_trans_alloc_empty+0x92/0xc0 [ 57.180039] xfs_alloc_read_agf+0x127/0xb50 [ 57.180252] ? __pfx_xfs_alloc_read_agf+0x10/0x10 [ 57.180495] ? xfs_log_force+0x351/0xa40 [ 57.180702] xfs_ioc_trim+0x1369/0x22f0 [ 57.180910] ? __pfx_xfs_ioc_trim+0x10/0x10 [ 57.181106] ? __kasan_check_read+0x15/0x20 [ 57.181317] ? mark_lock.part.0+0xf3/0x17b0 [ 57.181540] ? __kasan_check_read+0x15/0x20 [ 57.181760] ? __kasan_check_read+0x15/0x20 [ 57.181983] ? mark_lock.part.0+0xf3/0x17b0 [ 57.182201] ? mark_lock.part.0+0xf3/0x17b0 [ 57.182420] ? __pfx_mark_lock.part.0+0x10/0x10 [ 57.182655] xfs_file_ioctl+0x18a0/0x2270 [ 57.182859] ? xfs_file_ioctl+0x18a0/0x2270 [ 57.183077] ? __pfx_xfs_file_ioctl+0x10/0x10 [ 57.183306] ? __kasan_check_read+0x15/0x20 [ 57.183526] ? __lock_acquire+0xdb8/0x5d60 [ 57.183738] ? __sanitizer_cov_trace_switch+0x58/0xa0 [ 57.184004] ? do_vfs_ioctl+0x4fa/0x1920 [ 57.184212] ? __pfx___lock_acquire+0x10/0x10 [ 57.184441] ? __pfx_do_vfs_ioctl+0x10/0x10 [ 57.184662] ? __this_cpu_preempt_check+0x21/0x30 [ 57.184911] ? lock_release+0x441/0x870 [ 57.185114] ? __pfx_lock_release+0x10/0x10 [ 57.185334] ? __fget_files+0x1fb/0x3a0 [ 57.185533] ? __pfx_xfs_file_ioctl+0x10/0x10 [ 57.185755] __x64_sys_ioctl+0x1ba/0x220 [ 57.185964] x64_sys_call+0x1227/0x2140 [ 57.186170] do_syscall_64+0x6d/0x140 [ 57.186369] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 57.186631] RIP: 0033:0x7f9490a3ee5d [ 57.186829] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c88 [ 57.187726] RSP: 002b:00007f94909fedf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 57.188108] RAX: ffffffffffffffda RBX: 00007f94909ff640 RCX: 00007f9490a3ee5d [ 57.188460] RDX: 0000000020000000 RSI: 00000000c0185879 RDI: 0000000000000004 [ 57.188813] RBP: 00007f94909fee20 R08: 00007ffe62e59aaf R09: 0000000000000000 [ 57.189162] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f94909ff640 [ 57.189518] R13: 0000000000000006 R14: 00007f9490a9f560 R15: 0000000000000000 [ 57.189876] </TASK> " Hope this cound be insightful to you. Regards, Yi Lai --- If you don't need the following environment to reproduce the problem or if you already have one reproduced environment, please ignore the following information. How to reproduce: git clone https://gitlab.com/xupengfe/repro_vm_env.git cd repro_vm_env tar -xvf repro_vm_env.tar.gz cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0 // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel // You could change the bzImage_xxx as you want // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version You could use below command to log in, there is no password for root. ssh -p 10023 root@localhost After login vm(virtual machine) successfully, you could transfer reproduced binary to the vm by below way, and reproduce the problem in vm: gcc -pthread -o repro repro.c scp -P 10023 repro root@localhost:/root/ Get the bzImage for target kernel: Please use target kconfig and copy it to kernel_src/.config make olddefconfig make -jx bzImage //x should equal or less than cpu num your pc has Fill the bzImage file into above start3.sh to load the target kernel in vm. Tips: If you already have qemu-system-x86_64, please ignore below info. If you want to install qemu v7.1.0 version: git clone https://github.com/qemu/qemu.git cd qemu git checkout -f v7.1.0 mkdir build cd build yum install -y ninja-build.x86_64 yum -y install libslirp-devel.x86_64 ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp make make install On Thu, Jan 16, 2025 at 07:01:42AM +0100, Christoph Hellwig wrote: > Since commit 298f34224506 ("xfs: lockless buffer lookup") the buffer > lookup fastpath is done without a hash-wide lock (then pag_buf_lock, now > bc_lock) and only under RCU protection. But this means that nothing > serializes lookups against the temporary 0 reference count for buffers > that are added to the LRU after dropping the last regular reference, > and a concurrent lookup would fail to find them. > > Fix this by doing all b_hold modifications under b_lock. We're already > doing this for release so this "only" ~ doubles the b_lock round trips. > We'll later look into the lockref infrastructure to optimize the number > of lock round trips again. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> > --- > fs/xfs/xfs_buf.c | 93 ++++++++++++++++++++++++---------------------- > fs/xfs/xfs_buf.h | 2 +- > fs/xfs/xfs_trace.h | 10 ++--- > 3 files changed, 54 insertions(+), 51 deletions(-) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index f80e39fde53b..dc219678003c 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -133,15 +133,6 @@ __xfs_buf_ioacct_dec( > } > } > > -static inline void > -xfs_buf_ioacct_dec( > - struct xfs_buf *bp) > -{ > - spin_lock(&bp->b_lock); > - __xfs_buf_ioacct_dec(bp); > - spin_unlock(&bp->b_lock); > -} > - > /* > * When we mark a buffer stale, we remove the buffer from the LRU and clear the > * b_lru_ref count so that the buffer is freed immediately when the buffer > @@ -177,9 +168,9 @@ xfs_buf_stale( > atomic_set(&bp->b_lru_ref, 0); > if (!(bp->b_state & XFS_BSTATE_DISPOSE) && > (list_lru_del_obj(&bp->b_target->bt_lru, &bp->b_lru))) > - atomic_dec(&bp->b_hold); > + bp->b_hold--; > > - ASSERT(atomic_read(&bp->b_hold) >= 1); > + ASSERT(bp->b_hold >= 1); > spin_unlock(&bp->b_lock); > } > > @@ -238,14 +229,14 @@ _xfs_buf_alloc( > */ > flags &= ~(XBF_UNMAPPED | XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD); > > - atomic_set(&bp->b_hold, 1); > + spin_lock_init(&bp->b_lock); > + bp->b_hold = 1; > atomic_set(&bp->b_lru_ref, 1); > init_completion(&bp->b_iowait); > INIT_LIST_HEAD(&bp->b_lru); > INIT_LIST_HEAD(&bp->b_list); > INIT_LIST_HEAD(&bp->b_li_list); > sema_init(&bp->b_sema, 0); /* held, no waiters */ > - spin_lock_init(&bp->b_lock); > bp->b_target = target; > bp->b_mount = target->bt_mount; > bp->b_flags = flags; > @@ -589,6 +580,20 @@ xfs_buf_find_lock( > return 0; > } > > +static bool > +xfs_buf_try_hold( > + struct xfs_buf *bp) > +{ > + spin_lock(&bp->b_lock); > + if (bp->b_hold == 0) { > + spin_unlock(&bp->b_lock); > + return false; > + } > + bp->b_hold++; > + spin_unlock(&bp->b_lock); > + return true; > +} > + > static inline int > xfs_buf_lookup( > struct xfs_buf_cache *bch, > @@ -601,7 +606,7 @@ xfs_buf_lookup( > > rcu_read_lock(); > bp = rhashtable_lookup(&bch->bc_hash, map, xfs_buf_hash_params); > - if (!bp || !atomic_inc_not_zero(&bp->b_hold)) { > + if (!bp || !xfs_buf_try_hold(bp)) { > rcu_read_unlock(); > return -ENOENT; > } > @@ -664,7 +669,7 @@ xfs_buf_find_insert( > spin_unlock(&bch->bc_lock); > goto out_free_buf; > } > - if (bp && atomic_inc_not_zero(&bp->b_hold)) { > + if (bp && xfs_buf_try_hold(bp)) { > /* found an existing buffer */ > spin_unlock(&bch->bc_lock); > error = xfs_buf_find_lock(bp, flags); > @@ -1043,7 +1048,10 @@ xfs_buf_hold( > struct xfs_buf *bp) > { > trace_xfs_buf_hold(bp, _RET_IP_); > - atomic_inc(&bp->b_hold); > + > + spin_lock(&bp->b_lock); > + bp->b_hold++; > + spin_unlock(&bp->b_lock); > } > > static void > @@ -1051,10 +1059,15 @@ xfs_buf_rele_uncached( > struct xfs_buf *bp) > { > ASSERT(list_empty(&bp->b_lru)); > - if (atomic_dec_and_test(&bp->b_hold)) { > - xfs_buf_ioacct_dec(bp); > - xfs_buf_free(bp); > + > + spin_lock(&bp->b_lock); > + if (--bp->b_hold) { > + spin_unlock(&bp->b_lock); > + return; > } > + __xfs_buf_ioacct_dec(bp); > + spin_unlock(&bp->b_lock); > + xfs_buf_free(bp); > } > > static void > @@ -1064,51 +1077,40 @@ xfs_buf_rele_cached( > struct xfs_buftarg *btp = bp->b_target; > struct xfs_perag *pag = bp->b_pag; > struct xfs_buf_cache *bch = xfs_buftarg_buf_cache(btp, pag); > - bool release; > bool freebuf = false; > > trace_xfs_buf_rele(bp, _RET_IP_); > > - ASSERT(atomic_read(&bp->b_hold) > 0); > - > - /* > - * We grab the b_lock here first to serialise racing xfs_buf_rele() > - * calls. The pag_buf_lock being taken on the last reference only > - * serialises against racing lookups in xfs_buf_find(). IOWs, the second > - * to last reference we drop here is not serialised against the last > - * reference until we take bp->b_lock. Hence if we don't grab b_lock > - * first, the last "release" reference can win the race to the lock and > - * free the buffer before the second-to-last reference is processed, > - * leading to a use-after-free scenario. > - */ > spin_lock(&bp->b_lock); > - release = atomic_dec_and_lock(&bp->b_hold, &bch->bc_lock); > - if (!release) { > + ASSERT(bp->b_hold >= 1); > + if (bp->b_hold > 1) { > /* > * Drop the in-flight state if the buffer is already on the LRU > * and it holds the only reference. This is racy because we > * haven't acquired the pag lock, but the use of _XBF_IN_FLIGHT > * ensures the decrement occurs only once per-buf. > */ > - if ((atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru)) > + if (--bp->b_hold == 1 && !list_empty(&bp->b_lru)) > __xfs_buf_ioacct_dec(bp); > goto out_unlock; > } > > - /* the last reference has been dropped ... */ > + /* we are asked to drop the last reference */ > + spin_lock(&bch->bc_lock); > __xfs_buf_ioacct_dec(bp); > if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) { > /* > - * If the buffer is added to the LRU take a new reference to the > + * If the buffer is added to the LRU, keep the reference to the > * buffer for the LRU and clear the (now stale) dispose list > - * state flag > + * state flag, else drop the reference. > */ > - if (list_lru_add_obj(&btp->bt_lru, &bp->b_lru)) { > + if (list_lru_add_obj(&btp->bt_lru, &bp->b_lru)) > bp->b_state &= ~XFS_BSTATE_DISPOSE; > - atomic_inc(&bp->b_hold); > - } > + else > + bp->b_hold--; > spin_unlock(&bch->bc_lock); > } else { > + bp->b_hold--; > /* > * most of the time buffers will already be removed from the > * LRU, so optimise that case by checking for the > @@ -1863,13 +1865,14 @@ xfs_buftarg_drain_rele( > struct xfs_buf *bp = container_of(item, struct xfs_buf, b_lru); > struct list_head *dispose = arg; > > - if (atomic_read(&bp->b_hold) > 1) { > + if (!spin_trylock(&bp->b_lock)) > + return LRU_SKIP; > + if (bp->b_hold > 1) { > /* need to wait, so skip it this pass */ > + spin_unlock(&bp->b_lock); > trace_xfs_buf_drain_buftarg(bp, _RET_IP_); > return LRU_SKIP; > } > - if (!spin_trylock(&bp->b_lock)) > - return LRU_SKIP; > > /* > * clear the LRU reference count so the buffer doesn't get > @@ -2208,7 +2211,7 @@ xfs_buf_delwri_queue( > */ > bp->b_flags |= _XBF_DELWRI_Q; > if (list_empty(&bp->b_list)) { > - atomic_inc(&bp->b_hold); > + xfs_buf_hold(bp); > list_add_tail(&bp->b_list, list); > } > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > index 3d56bc7a35cc..9ccc6f93f636 100644 > --- a/fs/xfs/xfs_buf.h > +++ b/fs/xfs/xfs_buf.h > @@ -172,7 +172,7 @@ struct xfs_buf { > > xfs_daddr_t b_rhash_key; /* buffer cache index */ > int b_length; /* size of buffer in BBs */ > - atomic_t b_hold; /* reference count */ > + unsigned int b_hold; /* reference count */ > atomic_t b_lru_ref; /* lru reclaim ref count */ > xfs_buf_flags_t b_flags; /* status flags */ > struct semaphore b_sema; /* semaphore for lockables */ > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > index 4fe689410eb6..b29462363b81 100644 > --- a/fs/xfs/xfs_trace.h > +++ b/fs/xfs/xfs_trace.h > @@ -498,7 +498,7 @@ DECLARE_EVENT_CLASS(xfs_buf_class, > __entry->dev = bp->b_target->bt_dev; > __entry->bno = xfs_buf_daddr(bp); > __entry->nblks = bp->b_length; > - __entry->hold = atomic_read(&bp->b_hold); > + __entry->hold = bp->b_hold; > __entry->pincount = atomic_read(&bp->b_pin_count); > __entry->lockval = bp->b_sema.count; > __entry->flags = bp->b_flags; > @@ -569,7 +569,7 @@ DECLARE_EVENT_CLASS(xfs_buf_flags_class, > __entry->bno = xfs_buf_daddr(bp); > __entry->length = bp->b_length; > __entry->flags = flags; > - __entry->hold = atomic_read(&bp->b_hold); > + __entry->hold = bp->b_hold; > __entry->pincount = atomic_read(&bp->b_pin_count); > __entry->lockval = bp->b_sema.count; > __entry->caller_ip = caller_ip; > @@ -612,7 +612,7 @@ TRACE_EVENT(xfs_buf_ioerror, > __entry->dev = bp->b_target->bt_dev; > __entry->bno = xfs_buf_daddr(bp); > __entry->length = bp->b_length; > - __entry->hold = atomic_read(&bp->b_hold); > + __entry->hold = bp->b_hold; > __entry->pincount = atomic_read(&bp->b_pin_count); > __entry->lockval = bp->b_sema.count; > __entry->error = error; > @@ -656,7 +656,7 @@ DECLARE_EVENT_CLASS(xfs_buf_item_class, > __entry->buf_bno = xfs_buf_daddr(bip->bli_buf); > __entry->buf_len = bip->bli_buf->b_length; > __entry->buf_flags = bip->bli_buf->b_flags; > - __entry->buf_hold = atomic_read(&bip->bli_buf->b_hold); > + __entry->buf_hold = bip->bli_buf->b_hold; > __entry->buf_pincount = atomic_read(&bip->bli_buf->b_pin_count); > __entry->buf_lockval = bip->bli_buf->b_sema.count; > __entry->li_flags = bip->bli_item.li_flags; > @@ -4978,7 +4978,7 @@ DECLARE_EVENT_CLASS(xfbtree_buf_class, > __entry->xfino = file_inode(xfbt->target->bt_file)->i_ino; > __entry->bno = xfs_buf_daddr(bp); > __entry->nblks = bp->b_length; > - __entry->hold = atomic_read(&bp->b_hold); > + __entry->hold = bp->b_hold; > __entry->pincount = atomic_read(&bp->b_pin_count); > __entry->lockval = bp->b_sema.count; > __entry->flags = bp->b_flags; > -- > 2.45.2 > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] xfs: fix buffer lookup vs release race 2025-01-25 7:35 ` Lai, Yi @ 2025-01-26 5:50 ` Christoph Hellwig 0 siblings, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2025-01-26 5:50 UTC (permalink / raw) To: Lai, Yi Cc: Christoph Hellwig, Carlos Maiolino, Darrick J. Wong, Dave Chinner, linux-xfs, yi1.lai On Sat, Jan 25, 2025 at 03:35:10PM +0800, Lai, Yi wrote: > Hi Christoph Hellwig, > > Greetings! > > I used Syzkaller and found that there is possible deadlock in xfs_buf_get_map in linux-next tag - next-20250121. Oops yes. I turns out this is fixed by the patch new in my queue, which lead to finding the original race. So something like the patch below with a little more commit message language and a Fixes tag should take care of it: --- From 36df4ec6acdc876838ce1708e07a2d6cd1aebff2 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig <hch@lst.de> Date: Sun, 12 Jan 2025 07:00:04 +0100 Subject: xfs: remove xfs_buf_cache.bc_lock xfs_buf_cache.bc_lock serializes adding buffers to and removing them from the hashtable. But as the rhashtable code already uses fine grained internal locking for inserts and removals the extra protection isn't actually required. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_buf.c | 20 ++++++++------------ fs/xfs/xfs_buf.h | 1 - 2 files changed, 8 insertions(+), 13 deletions(-) diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 601723135e8f..da3860d5666b 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -41,8 +41,7 @@ struct kmem_cache *xfs_buf_cache; * * xfs_buf_rele: * b_lock - * pag_buf_lock - * lru_lock + * lru_lock * * xfs_buftarg_drain_rele * lru_lock @@ -497,7 +496,6 @@ int xfs_buf_cache_init( struct xfs_buf_cache *bch) { - spin_lock_init(&bch->bc_lock); return rhashtable_init(&bch->bc_hash, &xfs_buf_hash_params); } @@ -647,17 +645,20 @@ xfs_buf_find_insert( if (error) goto out_free_buf; - spin_lock(&bch->bc_lock); + /* The new buffer keeps the perag reference until it is freed. */ + new_bp->b_pag = pag; + + rcu_read_lock(); bp = rhashtable_lookup_get_insert_fast(&bch->bc_hash, &new_bp->b_rhash_head, xfs_buf_hash_params); if (IS_ERR(bp)) { + rcu_read_unlock(); error = PTR_ERR(bp); - spin_unlock(&bch->bc_lock); goto out_free_buf; } if (bp && xfs_buf_try_hold(bp)) { /* found an existing buffer */ - spin_unlock(&bch->bc_lock); + rcu_read_unlock(); error = xfs_buf_find_lock(bp, flags); if (error) xfs_buf_rele(bp); @@ -665,10 +666,8 @@ xfs_buf_find_insert( *bpp = bp; goto out_free_buf; } + rcu_read_unlock(); - /* The new buffer keeps the perag reference until it is freed. */ - new_bp->b_pag = pag; - spin_unlock(&bch->bc_lock); *bpp = new_bp; return 0; @@ -1085,7 +1084,6 @@ xfs_buf_rele_cached( } /* we are asked to drop the last reference */ - spin_lock(&bch->bc_lock); __xfs_buf_ioacct_dec(bp); if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) { /* @@ -1097,7 +1095,6 @@ xfs_buf_rele_cached( bp->b_state &= ~XFS_BSTATE_DISPOSE; else bp->b_hold--; - spin_unlock(&bch->bc_lock); } else { bp->b_hold--; /* @@ -1115,7 +1112,6 @@ xfs_buf_rele_cached( ASSERT(!(bp->b_flags & _XBF_DELWRI_Q)); rhashtable_remove_fast(&bch->bc_hash, &bp->b_rhash_head, xfs_buf_hash_params); - spin_unlock(&bch->bc_lock); if (pag) xfs_perag_put(pag); freebuf = true; diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index 7e73663c5d4a..3b4ed42e11c0 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -80,7 +80,6 @@ typedef unsigned int xfs_buf_flags_t; #define XFS_BSTATE_IN_FLIGHT (1 << 1) /* I/O in flight */ struct xfs_buf_cache { - spinlock_t bc_lock; struct rhashtable bc_hash; }; -- 2.45.2 ^ permalink raw reply related [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-01-26 5:50 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-13 4:24 fix buffer refcount races Christoph Hellwig 2025-01-13 4:24 ` [PATCH 1/2] xfs: check for dead buffers in xfs_buf_find_insert Christoph Hellwig 2025-01-13 4:56 ` Darrick J. Wong 2025-01-14 0:24 ` Dave Chinner 2025-01-13 4:24 ` [PATCH 2/2] xfs: fix buffer lookup vs release race Christoph Hellwig 2025-01-13 17:55 ` Darrick J. Wong 2025-01-13 20:55 ` Dave Chinner 2025-01-15 5:38 ` Christoph Hellwig 2025-01-15 11:21 ` Dave Chinner 2025-01-13 5:08 ` fix buffer refcount races Darrick J. Wong 2025-01-13 5:14 ` Christoph Hellwig 2025-01-13 7:13 ` Darrick J. Wong -- strict thread matches above, loose matches on Subject: below -- 2025-01-16 6:01 fix buffer refcount races v2 Christoph Hellwig 2025-01-16 6:01 ` [PATCH 2/2] xfs: fix buffer lookup vs release race Christoph Hellwig 2025-01-25 7:35 ` Lai, Yi 2025-01-26 5:50 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox