* [PATCH] xfs: fix use-after-free in xfs_inode_item_push()
@ 2026-03-04 16:24 Yuto Ohnuki
2026-03-04 16:44 ` Darrick J. Wong
2026-03-04 23:06 ` Dave Chinner
0 siblings, 2 replies; 5+ messages in thread
From: Yuto Ohnuki @ 2026-03-04 16:24 UTC (permalink / raw)
To: Carlos Maiolino
Cc: Darrick J . Wong, Dave Chinner, Brian Foster, linux-xfs,
linux-kernel, Yuto Ohnuki, syzbot+652af2b3c5569c4ab63c
Since commit 90c60e164012 ("xfs: xfs_iflush() is no longer necessary"),
xfs_inode_item_push() no longer holds the inode locked (ILOCK_SHARED)
while flushing, so the inode and its log item can be freed via
RCU callback (xfs_inode_free_callback) while the AIL lock is
temporarily dropped.
This results in a use-after-free when the function reacquires the AIL
lock by dereferencing the freed log item's li_ailp pointer at offset 48.
Fix this by saving the ailp pointer in a local variable while the AIL
lock is held and the log item is guaranteed to be valid.
Also move trace_xfs_ail_push() before xfsaild_push_item() because the
log item may be freed during the push.
Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c
Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary")
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
---
fs/xfs/xfs_inode_item.c | 5 +++--
fs/xfs/xfs_trans_ail.c | 8 +++++++-
2 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 8913036b8024..0a8957f9c72f 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -746,6 +746,7 @@ xfs_inode_item_push(
struct xfs_inode_log_item *iip = INODE_ITEM(lip);
struct xfs_inode *ip = iip->ili_inode;
struct xfs_buf *bp = lip->li_buf;
+ struct xfs_ail *ailp = lip->li_ailp;
uint rval = XFS_ITEM_SUCCESS;
int error;
@@ -771,7 +772,7 @@ xfs_inode_item_push(
if (!xfs_buf_trylock(bp))
return XFS_ITEM_LOCKED;
- spin_unlock(&lip->li_ailp->ail_lock);
+ spin_unlock(&ailp->ail_lock);
/*
* We need to hold a reference for flushing the cluster buffer as it may
@@ -795,7 +796,7 @@ xfs_inode_item_push(
rval = XFS_ITEM_LOCKED;
}
- spin_lock(&lip->li_ailp->ail_lock);
+ spin_lock(&ailp->ail_lock);
return rval;
}
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 923729af4206..e34d8a7e341d 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -510,6 +510,13 @@ xfsaild_push(
if (test_bit(XFS_LI_FLUSHING, &lip->li_flags))
goto next_item;
+ /*
+ * The log item may be freed after the push if the AIL lock is
+ * temporarily dropped and the RCU grace period expires,
+ * so trace it before pushing.
+ */
+ trace_xfs_ail_push(lip);
+
/*
* Note that iop_push may unlock and reacquire the AIL lock. We
* rely on the AIL cursor implementation to be able to deal with
@@ -519,7 +526,6 @@ xfsaild_push(
switch (lock_result) {
case XFS_ITEM_SUCCESS:
XFS_STATS_INC(mp, xs_push_ail_success);
- trace_xfs_ail_push(lip);
ailp->ail_last_pushed_lsn = lsn;
break;
--
2.50.1
Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] xfs: fix use-after-free in xfs_inode_item_push()
2026-03-04 16:24 [PATCH] xfs: fix use-after-free in xfs_inode_item_push() Yuto Ohnuki
@ 2026-03-04 16:44 ` Darrick J. Wong
2026-03-04 17:41 ` Yuto Ohnuki
2026-03-04 23:06 ` Dave Chinner
1 sibling, 1 reply; 5+ messages in thread
From: Darrick J. Wong @ 2026-03-04 16:44 UTC (permalink / raw)
To: Yuto Ohnuki
Cc: Carlos Maiolino, Darrick J . Wong, Dave Chinner, Brian Foster,
linux-xfs, linux-kernel, syzbot+652af2b3c5569c4ab63c
On Wed, Mar 04, 2026 at 04:24:06PM +0000, Yuto Ohnuki wrote:
> Since commit 90c60e164012 ("xfs: xfs_iflush() is no longer necessary"),
> xfs_inode_item_push() no longer holds the inode locked (ILOCK_SHARED)
> while flushing, so the inode and its log item can be freed via
> RCU callback (xfs_inode_free_callback) while the AIL lock is
> temporarily dropped.
>
> This results in a use-after-free when the function reacquires the AIL
> lock by dereferencing the freed log item's li_ailp pointer at offset 48.
>
> Fix this by saving the ailp pointer in a local variable while the AIL
> lock is held and the log item is guaranteed to be valid.
>
> Also move trace_xfs_ail_push() before xfsaild_push_item() because the
> log item may be freed during the push.
>
> Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c
> Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary")
> Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Cc: <stable@vger.kernel.org> # v5.9
> ---
> fs/xfs/xfs_inode_item.c | 5 +++--
> fs/xfs/xfs_trans_ail.c | 8 +++++++-
> 2 files changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 8913036b8024..0a8957f9c72f 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -746,6 +746,7 @@ xfs_inode_item_push(
> struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> struct xfs_inode *ip = iip->ili_inode;
> struct xfs_buf *bp = lip->li_buf;
> + struct xfs_ail *ailp = lip->li_ailp;
> uint rval = XFS_ITEM_SUCCESS;
> int error;
>
> @@ -771,7 +772,7 @@ xfs_inode_item_push(
> if (!xfs_buf_trylock(bp))
> return XFS_ITEM_LOCKED;
>
> - spin_unlock(&lip->li_ailp->ail_lock);
> + spin_unlock(&ailp->ail_lock);
>
> /*
> * We need to hold a reference for flushing the cluster buffer as it may
> @@ -795,7 +796,7 @@ xfs_inode_item_push(
> rval = XFS_ITEM_LOCKED;
> }
>
> - spin_lock(&lip->li_ailp->ail_lock);
> + spin_lock(&ailp->ail_lock);
Hmm, so the @lip UAF is here, where we try to re-acquire the AIL lock?
> return rval;
> }
>
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 923729af4206..e34d8a7e341d 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -510,6 +510,13 @@ xfsaild_push(
> if (test_bit(XFS_LI_FLUSHING, &lip->li_flags))
> goto next_item;
>
> + /*
> + * The log item may be freed after the push if the AIL lock is
> + * temporarily dropped and the RCU grace period expires,
> + * so trace it before pushing.
> + */
> + trace_xfs_ail_push(lip);
> +
> /*
> * Note that iop_push may unlock and reacquire the AIL lock. We
> * rely on the AIL cursor implementation to be able to deal with
> @@ -519,7 +526,6 @@ xfsaild_push(
> switch (lock_result) {
> case XFS_ITEM_SUCCESS:
> XFS_STATS_INC(mp, xs_push_ail_success);
> - trace_xfs_ail_push(lip);
Do the tracepoints in the other legs of the switch statement have a
similar UAF problem because they dereference @lip?
--D
>
> ailp->ail_last_pushed_lsn = lsn;
> break;
> --
> 2.50.1
>
>
>
>
> Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
>
> Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
>
>
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] xfs: fix use-after-free in xfs_inode_item_push()
2026-03-04 16:44 ` Darrick J. Wong
@ 2026-03-04 17:41 ` Yuto Ohnuki
0 siblings, 0 replies; 5+ messages in thread
From: Yuto Ohnuki @ 2026-03-04 17:41 UTC (permalink / raw)
To: djwong
Cc: bfoster, cem, darrick.wong, dchinner, linux-kernel, linux-xfs,
syzbot+652af2b3c5569c4ab63c, ytohnuki
> > ---
> > fs/xfs/xfs_inode_item.c | 5 +++--
> > fs/xfs/xfs_trans_ail.c | 8 +++++++-
> > 2 files changed, 10 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index 8913036b8024..0a8957f9c72f 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
> > @@ -746,6 +746,7 @@ xfs_inode_item_push(
> > struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> > struct xfs_inode *ip = iip->ili_inode;
> > struct xfs_buf *bp = lip->li_buf;
> > + struct xfs_ail *ailp = lip->li_ailp;
> > uint rval = XFS_ITEM_SUCCESS;
> > int error;
> >
> > @@ -771,7 +772,7 @@ xfs_inode_item_push(
> > if (!xfs_buf_trylock(bp))
> > return XFS_ITEM_LOCKED;
> >
> > - spin_unlock(&lip->li_ailp->ail_lock);
> > + spin_unlock(&ailp->ail_lock);
> >
> > /*
> > * We need to hold a reference for flushing the cluster buffer as it may
> > @@ -795,7 +796,7 @@ xfs_inode_item_push(
> > rval = XFS_ITEM_LOCKED;
> > }
> >
> > - spin_lock(&lip->li_ailp->ail_lock);
> > + spin_lock(&ailp->ail_lock);
>
> Hmm, so the @lip UAF is here, where we try to re-acquire the AIL lock?
Yes. The syzbot report shows a Read of size 8 at offset 48 (li_ailp)
when spin_lock() dereferences the freed log item to get the
AIL pointer.
> > return rval;
> > }
> >
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index 923729af4206..e34d8a7e341d 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -510,6 +510,13 @@ xfsaild_push(
> > if (test_bit(XFS_LI_FLUSHING, &lip->li_flags))
> > goto next_item;
> >
> > + /*
> > + * The log item may be freed after the push if the AIL lock is
> > + * temporarily dropped and the RCU grace period expires,
> > + * so trace it before pushing.
> > + */
> > + trace_xfs_ail_push(lip);
> > +
> > /*
> > * Note that iop_push may unlock and reacquire the AIL lock. We
> > * rely on the AIL cursor implementation to be able to deal with
> > @@ -519,7 +526,6 @@ xfsaild_push(
> > switch (lock_result) {
> > case XFS_ITEM_SUCCESS:
> > XFS_STATS_INC(mp, xs_push_ail_success);
> > - trace_xfs_ail_push(lip);
>
> Do the tracepoints in the other legs of the switch statement have a
> similar UAF problem because they dereference @lip?
>
> --D
Thank you very much for pointing out the other switch statement.
XFS_ITEM_PINNED is always returned before the AIL lock
is dropped, so trace_xfs_ail_pinned() is safe.
However, looking into it further, XFS_ITEM_FLUSHING and
XFS_ITEM_LOCKED can also be returned via the rval path after the AIL
lock is dropped and reacquired. So trace_xfs_ail_flushing() and
trace_xfs_ail_locked() could also hit a UAF in those cases.
I'll send a v2 that addresses those as well.
Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] xfs: fix use-after-free in xfs_inode_item_push()
2026-03-04 16:24 [PATCH] xfs: fix use-after-free in xfs_inode_item_push() Yuto Ohnuki
2026-03-04 16:44 ` Darrick J. Wong
@ 2026-03-04 23:06 ` Dave Chinner
2026-03-05 18:28 ` Yuto Ohnuki
1 sibling, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2026-03-04 23:06 UTC (permalink / raw)
To: Yuto Ohnuki
Cc: Carlos Maiolino, Darrick J . Wong, Dave Chinner, Brian Foster,
linux-xfs, linux-kernel, syzbot+652af2b3c5569c4ab63c
On Wed, Mar 04, 2026 at 04:24:06PM +0000, Yuto Ohnuki wrote:
> Since commit 90c60e164012 ("xfs: xfs_iflush() is no longer necessary"),
> xfs_inode_item_push() no longer holds the inode locked (ILOCK_SHARED)
> while flushing, so the inode and its log item can be freed via
> RCU callback (xfs_inode_free_callback) while the AIL lock is
> temporarily dropped.
How? You state this can happen, but then don't explain how this
occurs. i.e. the commit message just describes the UAF failure and
the fix in the code rahter than how the failure occurs in the first
place.
IOWs, I have to spend a bunch of my time reverse engineering the bug
report to understand exactly how this UAF occurs, because it
certainly doesn't happen in normal operation.
i.e in normal operation, either (or both) the IFLUSHING lock is held
on the inode, or the inode is attached to and referenced by the
inode cluster buffer.
xfs_reclaim_inode() explicitly takes IFLUSHING and checks the inode
is clean (i.e. not attached to the cluster buffer) before starting
the process of freeing it.
IOWs, the inode cannot be reclaimed until it is unlocked and removed
from the inode cluster buffer, and this happens during inode cluster
buffer IO completion in normal operation whilst the buffer is still
locked.
The only way that we can do anything differently in reclaim is if we
are in shutdown conditions. In which case reclaim locks the inode
and calls xfs_iflush_shutdown_abort(). This locks the buffer, then
removes the inode from the buffer, the AIL and clears the IFLUSHING
state, then allows indoe reclaim to continue.
Omitted from the sysbot traces is this:
[ 1161.203695][T20859] XFS (loop6): metadata I/O error in "xfs_btree_read_buf_block+0x2b0/0x490" at daddr 0x4 len 4 error 74
[ 1161.254510][T20859] XFS (loop6): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x518/0x950 (fs/xfs/xfs_trans_buf.c:311). Shutting down filesystem.
[ 1161.254574][T20859] XFS (loop6): Please unmount the filesystem and rectify the problem(s)
[ 1162.085767][T19986] XFS (loop6): Unmounting Filesystem 9f91832a-3b79-45c3-9d6d-ed0bc7357fe4
[ 1162.203508][T20924] ==================================================================
[ 1162.203522][T20924] BUG: KASAN: slab-use-after-free in xfs_inode_item_push+0x396/0x720
[ 1162.203550][T20924] Read of size 8 at addr ffff88805e822ae8 by task xfsaild/loop6/20924
The filesystem is indeed in a shutdown state and is being unmounted.
However, the AIL is still running, which means unmount hasn't yet
got to reclaiming inodes - the AIL is flushed and emptied before
that happens. Hence this must be a race between background inode
reclaim and the xfsaild. Interestingly, some of the failures are on
quota inodes, which get released in unmount moments before we call
xfs_unmount_flush_inodes(). This implies unmount triggered the AIL
to run and is waiting in xfs_ail_push_all_sync() for all dirty
inodes to be removed from the AIL.
IOWs, the reason this race condition occurred is that unmount has
triggered the AIL to push -and fail- all the dirty inodes in the
system at the same time that background reclaim is trying to -fail
and reclaim- all the dirty inodes.
There's the underlying root cause of the race condition that is
resulting in the UAF bug being exposed - the unmount code
is allowing background inode reclaim to remove dirty inodes from the
AIL whilst it is explicitly pushing dirty inodes and failing them to
remove them from the AIL.
Indeed, xfs_unmount_flush_inodes() does:
xfs_ail_push_all_sync(mp->m_ail);
xfs_inodegc_stop(mp);
cancel_delayed_work_sync(&mp->m_reclaim_work);
xfs_reclaim_inodes(mp);
IOWs, it pushes the entire AIL to fail everything on it and waits
for that to complete, then stops inodegc, then is
stops background reclaim work, then it reclaims all the remaining
inodes.
Honestly, that looks broken. If there are inodes queued for GC, then
they can be dirtied and inserted into the AIL as part of the inodegc
queue flush. This is unlikely if the filesystem is shut down, but
it could happen in normal operation. We also don't need background
inode reclaim running here, because we are about to run a blocking
inode reclaim pass, too.
Hence If we change this like so:
+ cancel_delayed_work_sync(&mp->m_reclaim_work);
+ xfs_inodegc_stop(mp);
xfs_ail_push_all_sync(mp->m_ail);
- xfs_inodegc_stop(mp);
- cancel_delayed_work_sync(&mp->m_reclaim_work);
xfs_reclaim_inodes(mp);
We no longer have a vector for adding inodes to the AIL after it has
been flushed, and we will not have two background threads both
racing to abort/fail/free dirty inodes during unmount. I think this
needs to be done for correctness of unmount, regardless of the fact
it exposes a UAF issues elsewhere in the code.
Ok, so that's one problem, but how is background inode reclaim and
inode item pushing racing to create a UAF situation?
Again, I think that's a shutdown related issue. xfs_iflush_cluster()
does:
/*
* Abort flushing this inode if we are shut down because the
* inode may not currently be in the AIL. This can occur when
* log I/O failure unpins the inode without inserting into the
* AIL, leaving a dirty/unpinned inode attached to the buffer
* that otherwise looks like it should be flushed.
*/
if (xlog_is_shutdown(mp->m_log)) {
xfs_iunpin_wait(ip);
xfs_iflush_abort(ip);
xfs_iunlock(ip, XFS_ILOCK_SHARED);
error = -EIO;
continue;
}
......
if (error) {
/*
* Shutdown first so we kill the log before we release this
* buffer. If it is an INODE_ALLOC buffer and pins the tail
* of the log, failing it before the _log_ is shut down can
* result in the log tail being moved forward in the journal
* on disk because log writes can still be taking place. Hence
* unpinning the tail will allow the ICREATE intent to be
* removed from the log an recovery will fail with uninitialised
* inode cluster buffers.
*/
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
bp->b_flags |= XBF_ASYNC;
xfs_buf_ioend_fail(bp);
return error;
}
The first hunk results in individual inodes being aborted, marked
clean and removed from the AIL. This then allows reclaim to lock
them and call xfs_iflush_shutdown_abort() on them, which then
serialises on the buffer lock.
The second hunk is where the buffer lock gets dropped. This runs IO
completion to abort any inodes that were previously marked IFLUSHING
and so skipped by the first hunk. This will result in them being
marked clean and removed from the AIL, hence also allowing reclaim
to lock and abort them.
At this point in xfs_buf_ioend_fail(), the buffer is then unlocked
and any pending inode reclaim will now make progress. For a UAF to
then occur a few lines of code later in xfs_inode_item_push() on the
log item, the xfsaild() -must- be preempted and not rescheduled
until the inode reclaim has done all it's work (which is quite a
lot, with lots of locking involved) and then have the rcu grace
period expire and have all the RCU callbacks run to free the inode
before it gets scheduled again to relock the AIL....
So, yes, we need to handle this case in xfs_inode_item_push().
More importantly:
Now I understand why the UAF occurs, it becomes obvious that
xfs_dquot_item_push() has the same log item UAF bug in it, too. the
path to it is slightly more convolutedi and does not involve RCU
freeing at all. i.e. dquots can be reclaimed asynchronously via a
memory pressure driven shrinker, so if xfsaild can be preempted for
long periods of time, the same race window exists where a flushed
dquot can be marked clean, reclaimed and freed before the AIL lock
is regained.
> This results in a use-after-free when the function reacquires the AIL
> lock by dereferencing the freed log item's li_ailp pointer at offset 48.
>
> Fix this by saving the ailp pointer in a local variable while the AIL
> lock is held and the log item is guaranteed to be valid.
>
> Also move trace_xfs_ail_push() before xfsaild_push_item() because the
> log item may be freed during the push.
That will simply create lots of unnecessary noise in AIL
tracing. We do not want "push" traces on every item, we only want
them on the items that returned XFS_ITEM_SUCCESS. We have other
trace points for different return values, too, and they are often
much noisier than XFS_ITEM_SUCESS. e.g. inode cluster flushing can
have a 30:1 ratio of XFS_ITEM_FLUSHING to XFS_INODE_SUCCESS - we do
not want a "push" trace for every "flushing" trace, as that will
massively increase the number of traces emitted by this code.
Also, because I know understand how the race condition occurs, I can
state that the UAF doesn't just occur when XFS_ITEM_SUCCESS is
returned. The UAF is most likely to occur when the buffer is failed
because xfs_iflush_cluster() returns -EIO, which results in
XFS_ITEM_LOCKED being returned.
In that case, we run:
case XFS_ITEM_LOCKED:
XFS_STATS_INC(mp, xs_push_ail_locked);
>>>>>>>> trace_xfs_ail_locked(lip);
stuck++;
break;
See the problem?
If we cannot rely on the the log item being valid after
xfsaild_push_item() is called in shutdown conditions due to
pre-emption, then we cannot rely on it being valid for any of
the code we run after that call. Hence there are more issues than
just movign a tracepoint.
The tracepoints will need to be modified to take individual values,
which will need to be stored in temporary variables before
xfsaild_push_item() is called.
Comments need to be added to xfsaild_push_item() to indicate that
after ->iop_push is called, the log item may be invalid.
Comments need to be added to xfsaild_push() at the
xfsaild_push_item() call site that the log item cannot be referenced
after this call as it may already be free on return.
Comments need to be added to xfs_inode_item_push() and
xfs_dquot_item_push() indicating the exact line of code where we can
no longer reference the log item because it may have been freed.
And the unmount inode flushing operation ordering also needs to be
fixed...
Cheers,
Dave.
--
Dave Chinner
dgc@kernel.org
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Re: [PATCH] xfs: fix use-after-free in xfs_inode_item_push()
2026-03-04 23:06 ` Dave Chinner
@ 2026-03-05 18:28 ` Yuto Ohnuki
0 siblings, 0 replies; 5+ messages in thread
From: Yuto Ohnuki @ 2026-03-05 18:28 UTC (permalink / raw)
To: dgc
Cc: bfoster, cem, darrick.wong, dchinner, linux-kernel, linux-xfs,
syzbot+652af2b3c5569c4ab63c, ytohnuki
> How? You state this can happen, but then don't explain how this
> occurs. i.e. the commit message just describes the UAF failure and
> the fix in the code rahter than how the failure occurs in the first
> place.
>
> IOWs, I have to spend a bunch of my time reverse engineering the bug
> report to understand exactly how this UAF occurs, because it
> certainly doesn't happen in normal operation.
>
> i.e in normal operation, either (or both) the IFLUSHING lock is held
> on the inode, or the inode is attached to and referenced by the
> inode cluster buffer.
>
> xfs_reclaim_inode() explicitly takes IFLUSHING and checks the inode
> is clean (i.e. not attached to the cluster buffer) before starting
> the process of freeing it.
>
> IOWs, the inode cannot be reclaimed until it is unlocked and removed
> from the inode cluster buffer, and this happens during inode cluster
> buffer IO completion in normal operation whilst the buffer is still
> locked.
>
> The only way that we can do anything differently in reclaim is if we
> are in shutdown conditions. In which case reclaim locks the inode
> and calls xfs_iflush_shutdown_abort(). This locks the buffer, then
> removes the inode from the buffer, the AIL and clears the IFLUSHING
> state, then allows indoe reclaim to continue.
>
> Omitted from the sysbot traces is this:
>
> [ 1161.203695][T20859] XFS (loop6): metadata I/O error in "xfs_btree_read_buf_block+0x2b0/0x490" at daddr 0x4 len 4 error 74
> [ 1161.254510][T20859] XFS (loop6): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x518/0x950 (fs/xfs/xfs_trans_buf.c:311). Shutting down filesystem.
> [ 1161.254574][T20859] XFS (loop6): Please unmount the filesystem and rectify the problem(s)
> [ 1162.085767][T19986] XFS (loop6): Unmounting Filesystem 9f91832a-3b79-45c3-9d6d-ed0bc7357fe4
> [ 1162.203508][T20924] ==================================================================
> [ 1162.203522][T20924] BUG: KASAN: slab-use-after-free in xfs_inode_item_push+0x396/0x720
> [ 1162.203550][T20924] Read of size 8 at addr ffff88805e822ae8 by task xfsaild/loop6/20924
>
> The filesystem is indeed in a shutdown state and is being unmounted.
> However, the AIL is still running, which means unmount hasn't yet
> got to reclaiming inodes - the AIL is flushed and emptied before
> that happens. Hence this must be a race between background inode
> reclaim and the xfsaild. Interestingly, some of the failures are on
> quota inodes, which get released in unmount moments before we call
> xfs_unmount_flush_inodes(). This implies unmount triggered the AIL
> to run and is waiting in xfs_ail_push_all_sync() for all dirty
> inodes to be removed from the AIL.
>
> IOWs, the reason this race condition occurred is that unmount has
> triggered the AIL to push -and fail- all the dirty inodes in the
> system at the same time that background reclaim is trying to -fail
> and reclaim- all the dirty inodes.
>
> There's the underlying root cause of the race condition that is
> resulting in the UAF bug being exposed - the unmount code
> is allowing background inode reclaim to remove dirty inodes from the
> AIL whilst it is explicitly pushing dirty inodes and failing them to
> remove them from the AIL.
>
> Indeed, xfs_unmount_flush_inodes() does:
>
> xfs_ail_push_all_sync(mp->m_ail);
> xfs_inodegc_stop(mp);
> cancel_delayed_work_sync(&mp->m_reclaim_work);
> xfs_reclaim_inodes(mp);
>
> IOWs, it pushes the entire AIL to fail everything on it and waits
> for that to complete, then stops inodegc, then is
> stops background reclaim work, then it reclaims all the remaining
> inodes.
>
> Honestly, that looks broken. If there are inodes queued for GC, then
> they can be dirtied and inserted into the AIL as part of the inodegc
> queue flush. This is unlikely if the filesystem is shut down, but
> it could happen in normal operation. We also don't need background
> inode reclaim running here, because we are about to run a blocking
> inode reclaim pass, too.
>
> Hence If we change this like so:
>
> + cancel_delayed_work_sync(&mp->m_reclaim_work);
> + xfs_inodegc_stop(mp);
> xfs_ail_push_all_sync(mp->m_ail);
> - xfs_inodegc_stop(mp);
> - cancel_delayed_work_sync(&mp->m_reclaim_work);
> xfs_reclaim_inodes(mp);
>
> We no longer have a vector for adding inodes to the AIL after it has
> been flushed, and we will not have two background threads both
> racing to abort/fail/free dirty inodes during unmount. I think this
> needs to be done for correctness of unmount, regardless of the fact
> it exposes a UAF issues elsewhere in the code.
>
> Ok, so that's one problem, but how is background inode reclaim and
> inode item pushing racing to create a UAF situation?
>
> Again, I think that's a shutdown related issue. xfs_iflush_cluster()
> does:
>
> /*
> * Abort flushing this inode if we are shut down because the
> * inode may not currently be in the AIL. This can occur when
> * log I/O failure unpins the inode without inserting into the
> * AIL, leaving a dirty/unpinned inode attached to the buffer
> * that otherwise looks like it should be flushed.
> */
> if (xlog_is_shutdown(mp->m_log)) {
> xfs_iunpin_wait(ip);
> xfs_iflush_abort(ip);
> xfs_iunlock(ip, XFS_ILOCK_SHARED);
> error = -EIO;
> continue;
> }
> ......
> if (error) {
> /*
> * Shutdown first so we kill the log before we release this
> * buffer. If it is an INODE_ALLOC buffer and pins the tail
> * of the log, failing it before the _log_ is shut down can
> * result in the log tail being moved forward in the journal
> * on disk because log writes can still be taking place. Hence
> * unpinning the tail will allow the ICREATE intent to be
> * removed from the log an recovery will fail with uninitialised
> * inode cluster buffers.
> */
> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> bp->b_flags |= XBF_ASYNC;
> xfs_buf_ioend_fail(bp);
> return error;
> }
>
> The first hunk results in individual inodes being aborted, marked
> clean and removed from the AIL. This then allows reclaim to lock
> them and call xfs_iflush_shutdown_abort() on them, which then
> serialises on the buffer lock.
>
> The second hunk is where the buffer lock gets dropped. This runs IO
> completion to abort any inodes that were previously marked IFLUSHING
> and so skipped by the first hunk. This will result in them being
> marked clean and removed from the AIL, hence also allowing reclaim
> to lock and abort them.
>
> At this point in xfs_buf_ioend_fail(), the buffer is then unlocked
> and any pending inode reclaim will now make progress. For a UAF to
> then occur a few lines of code later in xfs_inode_item_push() on the
> log item, the xfsaild() -must- be preempted and not rescheduled
> until the inode reclaim has done all it's work (which is quite a
> lot, with lots of locking involved) and then have the rcu grace
> period expire and have all the RCU callbacks run to free the inode
> before it gets scheduled again to relock the AIL....
>
> So, yes, we need to handle this case in xfs_inode_item_push().
>
> More importantly:
>
> Now I understand why the UAF occurs, it becomes obvious that
> xfs_dquot_item_push() has the same log item UAF bug in it, too. the
> path to it is slightly more convolutedi and does not involve RCU
> freeing at all. i.e. dquots can be reclaimed asynchronously via a
> memory pressure driven shrinker, so if xfsaild can be preempted for
> long periods of time, the same race window exists where a flushed
> dquot can be marked clean, reclaimed and freed before the AIL lock
> is regained.
>
> > This results in a use-after-free when the function reacquires the AIL
> > lock by dereferencing the freed log item's li_ailp pointer at offset 48.
> >
> > Fix this by saving the ailp pointer in a local variable while the AIL
> > lock is held and the log item is guaranteed to be valid.
> >
> > Also move trace_xfs_ail_push() before xfsaild_push_item() because the
> > log item may be freed during the push.
>
> That will simply create lots of unnecessary noise in AIL
> tracing. We do not want "push" traces on every item, we only want
> them on the items that returned XFS_ITEM_SUCCESS. We have other
> trace points for different return values, too, and they are often
> much noisier than XFS_ITEM_SUCESS. e.g. inode cluster flushing can
> have a 30:1 ratio of XFS_ITEM_FLUSHING to XFS_INODE_SUCCESS - we do
> not want a "push" trace for every "flushing" trace, as that will
> massively increase the number of traces emitted by this code.
>
> Also, because I know understand how the race condition occurs, I can
> state that the UAF doesn't just occur when XFS_ITEM_SUCCESS is
> returned. The UAF is most likely to occur when the buffer is failed
> because xfs_iflush_cluster() returns -EIO, which results in
> XFS_ITEM_LOCKED being returned.
>
> In that case, we run:
>
> case XFS_ITEM_LOCKED:
> XFS_STATS_INC(mp, xs_push_ail_locked);
> >>>>>>>> trace_xfs_ail_locked(lip);
>
> stuck++;
> break;
>
> See the problem?
>
> If we cannot rely on the the log item being valid after
> xfsaild_push_item() is called in shutdown conditions due to
> pre-emption, then we cannot rely on it being valid for any of
> the code we run after that call. Hence there are more issues than
> just movign a tracepoint.
>
> The tracepoints will need to be modified to take individual values,
> which will need to be stored in temporary variables before
> xfsaild_push_item() is called.
>
> Comments need to be added to xfsaild_push_item() to indicate that
> after ->iop_push is called, the log item may be invalid.
>
> Comments need to be added to xfsaild_push() at the
> xfsaild_push_item() call site that the log item cannot be referenced
> after this call as it may already be free on return.
>
> Comments need to be added to xfs_inode_item_push() and
> xfs_dquot_item_push() indicating the exact line of code where we can
> no longer reference the log item because it may have been freed.
>
> And the unmount inode flushing operation ordering also needs to be
> fixed...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> dgc@kernel.org
Hi Dave,
Thank you for the detailed analysis and comprehensive feedback on this
patch. Your explanation of the race condition mechanism and the
additional issues you identified (dquot_item_push, unmount ordering,
and all switch case tracepoints) were extremely helpful.
I've prepared a v2 patch that addresses all the points you raised:
- Fixed the unmount ordering in xfs_unmount_flush_inodes()
- Added ailp local variables to both xfs_inode_item_push() and
xfs_dquot_item_push()
- Modified all 4 tracepoints in the switch statement to avoid
dereferencing the log item after xfsaild_push_item() returns
- Introduced a new xfs_ail_push_class to preserve compatibility with
existing tracepoints
- Added comments documenting where log items must not be referenced
I've also updated the commit message to provide a more detailed
explanation of how the UAF occurs, as you commented.
I'll send the v2 patch shortly.
Thanks again for taking the time to review this thoroughly.
Best regards,
Yuto
Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-03-05 18:28 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-04 16:24 [PATCH] xfs: fix use-after-free in xfs_inode_item_push() Yuto Ohnuki
2026-03-04 16:44 ` Darrick J. Wong
2026-03-04 17:41 ` Yuto Ohnuki
2026-03-04 23:06 ` Dave Chinner
2026-03-05 18:28 ` Yuto Ohnuki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox