From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 54994354AC7; Thu, 5 Mar 2026 21:49:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772747396; cv=none; b=DK/J/m/tfx4nqfxiQxm2Vm57d1Gl3WvknzJiBwtaw7I4TpaB7HPKYogDTSGLbKDgBj8QWffml65WI1Pw3en7auTZa6kh/AtVy3SlcD2msZgI7lqI60oKQS42Ia9uySYxvgn5tOV2Ok17F8O3L9VpH9IccMFm9jU0MuMb1RBNQwA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772747396; c=relaxed/simple; bh=3/t4bdbqokSl/X8GtHRwTOf3R8xbOYBso+1WOELuR7s=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=LyoccI83vr3ZolFr5EXOfQDrqcWeFZ3c5Honxn0C214E7F6J2sE4d0W8/0DG+ABj4ixczNa2JJz24v0+wDclS5DPdPoLKqY6hWIzG2bjptuaI+i+oldtbrA5Jc7uH+sh3yBBSJteGrA1rgyO+Qv9tgNMPy0FQWR7djkOOc5fQDs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=UsxlvzQ9; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="UsxlvzQ9" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 32FDFC116C6; Thu, 5 Mar 2026 21:49:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772747395; bh=3/t4bdbqokSl/X8GtHRwTOf3R8xbOYBso+1WOELuR7s=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=UsxlvzQ9KXgtMtIpPJ+zqFJEuvlmuPmSwCwsHOviZBVX8ZtXf0GaldknmJueqrQxu nVAmlJmdUQPrBbScEB8yyDBb2U22ATs1L7brIpCNuKkvZIkOod//2ql9331i2vOnqJ 1qVkewAt9KFSZw6mery+wgb39Nh1/KfkjvM/0SgsIpeth3+lTeZ/GcEk18rE9rHXr0 tLjvO2Kf2yA9cyv05j78v7Od/o4jqzxK/BNceiRvKHdHU8BqzJvHYkCHSjGK+1A1Lp HYIr4wu7MokP4H/gr4ArtMa/kBmRxGwiTv+PQM/hNlsQPdD2jfvamAkEPxOM9aIZ8j hGe6rqtmtX/Jg== Date: Fri, 6 Mar 2026 08:49:45 +1100 From: Dave Chinner To: Yuto Ohnuki Cc: Carlos Maiolino , Dave Chinner , "Darrick J . Wong" , Brian Foster , linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org, syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com, stable@vger.kernel.org Subject: Re: [PATCH v2] xfs: fix use-after-free of log items during AIL pushing Message-ID: References: <20260305185836.56478-2-ytohnuki@amazon.com> Precedence: bulk X-Mailing-List: stable@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260305185836.56478-2-ytohnuki@amazon.com> On Thu, Mar 05, 2026 at 06:58:37PM +0000, Yuto Ohnuki wrote: > When a filesystem is shut down, background inode reclaim and the xfsaild > can race to abort and free dirty inodes. During xfs_iflush_cluster(), if > the filesystem is shut down, individual inodes are aborted, marked clean > and removed from the AIL. When the buffer is subsequently failed via > xfs_buf_ioend_fail(), the buffer is unlocked and pending inode reclaim > can make progress. > > If the xfsaild is then preempted long enough for reclaim to complete its > work and the RCU grace period to expire, the inode and its log item are > freed before the xfsaild reacquires the AIL lock. This results in a > use-after-free when dereferencing the log item's li_ailp pointer at > offset 48. > > Since commit 90c60e164012 ("xfs: xfs_iflush() is no longer necessary"), > xfs_inode_item_push() no longer holds ILOCK_SHARED while flushing, > removing the protection that prevented the inode from being reclaimed > during the flush. > > xfs_dquot_item_push() has the same issue, as dquots can be reclaimed > asynchronously via a memory pressure driven shrinker while the AIL lock > is temporarily dropped. > > The unmount sequence in xfs_unmount_flush_inodes() also contributes to > the race by pushing the AIL while background reclaim and inodegc are > still running. > > Additionally, all tracepoints in the xfsaild_push() switch statement > dereference the log item after xfsaild_push_item() returns, when the > item may already be freed. The UAF is most likely when > xfs_iflush_cluster() returns -EIO and XFS_ITEM_LOCKED is returned. > > Fix this by: > - Reordering xfs_unmount_flush_inodes() to stop background reclaim and > inodegc before pushing the AIL. > - Saving the ailp pointer in local variables in xfs_inode_item_push() > and xfs_dquot_item_push() when the AIL lock is held and the log item > is guaranteed to be valid. > - Capturing log item fields before calling xfsaild_push_item() so that > tracepoints do not dereference potentially freed log items. > A new xfs_ail_push_class trace event class is introduced for this > purpose, while the existing xfs_log_item_class remains unchanged to > preserve compatibility. > - Adding comments documenting that log items must not be referenced > after iop_push() returns. I think this should be broken up into separate commits. Certainly the unmount changes should be a standalone commit... > Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com > Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c > Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary") > Cc: # v5.9 > Signed-off-by: Yuto Ohnuki > --- > Changes in v2: > - Reordered xfs_unmount_flush_inodes() to stop reclaim before pushing > AIL suggested by Dave Chinner > - Introduced xfs_ail_push_class trace event to avoid dereferencing > freed log items in tracepoints > - Added comments documenting that log items must not be referenced > after iop_push() returns > - Saved ailp pointer in local variables in push functions > - Link to v1: https://lore.kernel.org/all/20260304162405.58017-2-ytohnuki@amazon.com/ > --- > fs/xfs/xfs_dquot_item.c | 10 ++++++++-- > fs/xfs/xfs_inode_item.c | 9 +++++++-- > fs/xfs/xfs_mount.c | 4 ++-- > fs/xfs/xfs_trace.h | 35 +++++++++++++++++++++++++++++++---- > fs/xfs/xfs_trans_ail.c | 28 ++++++++++++++++++++++++---- > 5 files changed, 72 insertions(+), 14 deletions(-) > > diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c > index 491e2a7053a3..223e7162db02 100644 > --- a/fs/xfs/xfs_dquot_item.c > +++ b/fs/xfs/xfs_dquot_item.c > @@ -125,6 +125,7 @@ xfs_qm_dquot_logitem_push( > struct xfs_dq_logitem *qlip = DQUOT_ITEM(lip); > struct xfs_dquot *dqp = qlip->qli_dquot; > struct xfs_buf *bp; > + struct xfs_ail *ailp = lip->li_ailp; > uint rval = XFS_ITEM_SUCCESS; > int error; > > @@ -153,7 +154,12 @@ xfs_qm_dquot_logitem_push( > goto out_unlock; > } > > - spin_unlock(&lip->li_ailp->ail_lock); > + /* > + * After dropping the AIL lock, the log item may be freed via > + * memory pressure driven shrinker. Do not reference lip after > + * this point. > + */ > + spin_unlock(&ailp->ail_lock); Not true. We hold the buffer lock here, which prevents reclaim from removing the log item from the buffer. Hence the point at which it becomes unsafe to reference the log item is when the cluster buffer lock is dropped. i.e.: > > error = xfs_dquot_use_attached_buf(dqp, &bp); > if (error == -EAGAIN) { > @@ -174,7 +180,7 @@ xfs_qm_dquot_logitem_push( > xfs_buf_relse(bp); > > out_relock_ail: > - spin_lock(&lip->li_ailp->ail_lock); > + spin_lock(&ailp->ail_lock); the log item only becomes unsafe to reference after the call to xfs_buf_relse() just above the code that regains the AIL lock. > out_unlock: > mutex_unlock(&dqp->q_qlock); > return rval; > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > index 8913036b8024..f584e0a2f174 100644 > --- a/fs/xfs/xfs_inode_item.c > +++ b/fs/xfs/xfs_inode_item.c > @@ -746,6 +746,7 @@ xfs_inode_item_push( > struct xfs_inode_log_item *iip = INODE_ITEM(lip); > struct xfs_inode *ip = iip->ili_inode; > struct xfs_buf *bp = lip->li_buf; > + struct xfs_ail *ailp = lip->li_ailp; > uint rval = XFS_ITEM_SUCCESS; > int error; > > @@ -771,7 +772,11 @@ xfs_inode_item_push( > if (!xfs_buf_trylock(bp)) > return XFS_ITEM_LOCKED; > > - spin_unlock(&lip->li_ailp->ail_lock); > + /* > + * After dropping the AIL lock, the log item may be freed via > + * RCU callback. Do not reference lip after this point. > + */ > + spin_unlock(&ailp->ail_lock); Same for the inode log item here - it's when the cluster buffer is unlocked that the log item can be freed from under us. > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c > index 9c295abd0a0a..786e1fc720e5 100644 > --- a/fs/xfs/xfs_mount.c > +++ b/fs/xfs/xfs_mount.c > @@ -621,9 +621,9 @@ xfs_unmount_flush_inodes( > > xfs_set_unmounting(mp); > > - xfs_ail_push_all_sync(mp->m_ail); > - xfs_inodegc_stop(mp); > cancel_delayed_work_sync(&mp->m_reclaim_work); > + xfs_inodegc_stop(mp); > + xfs_ail_push_all_sync(mp->m_ail); > xfs_reclaim_inodes(mp); > xfs_health_unmount(mp); > xfs_healthmon_unmount(mp); This should be in it's own patch. > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > index 813e5a9f57eb..ee4b72878f7b 100644 > --- a/fs/xfs/xfs_trace.h > +++ b/fs/xfs/xfs_trace.h > @@ -1646,14 +1646,41 @@ TRACE_EVENT(xfs_log_force, > __entry->lsn, (void *)__entry->caller_ip) > ) > > +DECLARE_EVENT_CLASS(xfs_ail_push_class, > + TP_PROTO(dev_t dev, uint type, unsigned long flags, xfs_lsn_t lsn), > + TP_ARGS(dev, type, flags, lsn), > + TP_STRUCT__entry( > + __field(dev_t, dev) > + __field(uint, type) > + __field(unsigned long, flags) > + __field(xfs_lsn_t, lsn) > + ), > + TP_fast_assign( > + __entry->dev = dev; > + __entry->type = type; > + __entry->flags = flags; > + __entry->lsn = lsn; > + ), > + TP_printk("dev %d:%d lsn %d/%d type %s flags %s", > + MAJOR(__entry->dev), MINOR(__entry->dev), > + CYCLE_LSN(__entry->lsn), BLOCK_LSN(__entry->lsn), > + __print_symbolic(__entry->type, XFS_LI_TYPE_DESC), > + __print_flags(__entry->flags, "|", XFS_LI_FLAGS)) > +) > + > +#define DEFINE_AIL_PUSH_EVENT(name) \ > +DEFINE_EVENT(xfs_ail_push_class, name, \ > + TP_PROTO(dev_t dev, uint type, unsigned long flags, xfs_lsn_t lsn), \ > + TP_ARGS(dev, type, flags, lsn)) > +DEFINE_AIL_PUSH_EVENT(xfs_ail_push); > +DEFINE_AIL_PUSH_EVENT(xfs_ail_pinned); > +DEFINE_AIL_PUSH_EVENT(xfs_ail_locked); > +DEFINE_AIL_PUSH_EVENT(xfs_ail_flushing); > + > #define DEFINE_LOG_ITEM_EVENT(name) \ > DEFINE_EVENT(xfs_log_item_class, name, \ > TP_PROTO(struct xfs_log_item *lip), \ > TP_ARGS(lip)) > -DEFINE_LOG_ITEM_EVENT(xfs_ail_push); > -DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned); > -DEFINE_LOG_ITEM_EVENT(xfs_ail_locked); > -DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing); > DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_mark); > DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_skip); > DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_unpin); Move the definition of the xfs_ail_push_class to after the definition of the log item event trace points. i.e. the log item event definitions should be grouped with the definition of the log item event class, not have a whole new event class and trace event definitions between the class definition and event definitions... > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c > index 923729af4206..48b14146826b 100644 > --- a/fs/xfs/xfs_trans_ail.c > +++ b/fs/xfs/xfs_trans_ail.c > @@ -387,6 +387,11 @@ xfsaild_push_item( > return XFS_ITEM_PINNED; > if (test_bit(XFS_LI_FAILED, &lip->li_flags)) > return xfsaild_resubmit_item(lip, &ailp->ail_buf_list); > + > + /* > + * Once iop_push() returns, the log item may have been freed > + * and must not be dereferenced. > + */ > return lip->li_ops->iop_push(lip, &ailp->ail_buf_list); > } This should really be a comment in the header describing the funciton behaviour. i.e. something like: "@lip may have been released and freed by the time this function returns, so callers must not dereference the log item after calling this function." > > @@ -506,20 +511,35 @@ xfsaild_push( > lsn = lip->li_lsn; > while ((XFS_LSN_CMP(lip->li_lsn, ailp->ail_target) <= 0)) { > int lock_result; > + dev_t dev; > + uint type; > + unsigned long flags; > + xfs_lsn_t item_lsn; > > if (test_bit(XFS_LI_FLUSHING, &lip->li_flags)) > goto next_item; > > + /* > + * Store log item information before pushing, as the item > + * may be freed after dropping the AIL lock. > + */ > + dev = lip->li_log->l_mp->m_super->s_dev; Don't need dev - we can get to it from the ailp structure. i.e. the tracepoints can be passed the ailp and so the dev part of the tracepoint can remain hidden from this code (as it should be). > + type = lip->li_type; > + flags = lip->li_flags; > + item_lsn = lip->li_lsn; > + > /* > * Note that iop_push may unlock and reacquire the AIL lock. We > * rely on the AIL cursor implementation to be able to deal with > * the dropped lock. > + * After this call returns, the log item may have been freed and > + * must not be referenced. > */ > lock_result = xfsaild_push_item(ailp, lip); > switch (lock_result) { > case XFS_ITEM_SUCCESS: > XFS_STATS_INC(mp, xs_push_ail_success); > - trace_xfs_ail_push(lip); > + trace_xfs_ail_push(dev, type, flags, item_lsn); > > ailp->ail_last_pushed_lsn = lsn; > break; > @@ -537,7 +557,7 @@ xfsaild_push( > * AIL is being flushed. > */ > XFS_STATS_INC(mp, xs_push_ail_flushing); > - trace_xfs_ail_flushing(lip); > + trace_xfs_ail_flushing(dev, type, flags, item_lsn); > > flushing++; > ailp->ail_last_pushed_lsn = lsn; > @@ -545,14 +565,14 @@ xfsaild_push( > > case XFS_ITEM_PINNED: > XFS_STATS_INC(mp, xs_push_ail_pinned); > - trace_xfs_ail_pinned(lip); > + trace_xfs_ail_pinned(dev, type, flags, item_lsn); > > stuck++; > ailp->ail_log_flush++; > break; > case XFS_ITEM_LOCKED: > XFS_STATS_INC(mp, xs_push_ail_locked); > - trace_xfs_ail_locked(lip); > + trace_xfs_ail_locked(dev, type, flags, item_lsn); > > stuck++; > break; At this point, xfsaild_push() is getting too long to easily read and understand. I think that we should factor the entire loop contents into a separate function so the loop looks like this: while ((XFS_LSN_CMP(lip->li_lsn, ailp->ail_target) <= 0)) { xfsaild_process_logitem(ailp, lip, &stuck, &flushing); count++; /* * Are there too many items we can't do anything with? * * If we are skipping too many items because we can't flush * them or they are already being flushed, we back off and * given them time to complete whatever operation is being * done. i.e. remove pressure from the AIL while we can't make * progress so traversals don't slow down further inserts and * removals to/from the AIL. * * The value of 100 is an arbitrary magic number based on * observation. */ if (stuck > 100) break; lip = xfs_trans_ail_cursor_next(ailp, &cur); if (lip == NULL) break; if (lip->li_lsn != lsn && count > 1000) break; lsn = lip->li_lsn; } Probably worth doing that as a separate patch, then adding the modifications to fix the bug as a followup patch. -Dave. -- Dave Chinner dgc@kernel.org