public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] xfs: fix and cleanups for log item push
@ 2024-08-23 11:04 Long Li
  2024-08-23 11:04 ` [PATCH 1/5] xfs: remove redundant set null for ip->i_itemp Long Li
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Long Li @ 2024-08-23 11:04 UTC (permalink / raw)
  To: djwong, chandanbabu
  Cc: linux-xfs, david, yi.zhang, houtao1, leo.lilong, yangerkun

Hi all,

This patch series fix some issues during item push, The following is a
brief overview of the patches, see the patches for more details.

Patch 1 : Simple clean code.
Patch 2 : Fixed an issue where the tail lsn was moved forward abnormally
          due to deleting log item from AIL before log shutdown.
Patch 3-5 : Fix log item access UAF after inode/dquot item pushed.


Long Li (5):
  xfs: remove redundant set null for ip->i_itemp
  xfs: ensuere deleting item from AIL after shutdown in dquot flush
  xfs: add XFS_ITEM_UNSAFE for log item push return result
  xfs: fix a UAF when dquot item push
  xfs: fix a UAF when inode item push

 fs/xfs/xfs_dquot.c      |  8 +++++++-
 fs/xfs/xfs_dquot_item.c | 10 +++++++++-
 fs/xfs/xfs_icache.c     |  1 -
 fs/xfs/xfs_inode_item.c | 21 ++++++++++++++-------
 fs/xfs/xfs_stats.h      |  1 +
 fs/xfs/xfs_trans.h      |  1 +
 fs/xfs/xfs_trans_ail.c  |  7 +++++++
 7 files changed, 39 insertions(+), 10 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/5] xfs: remove redundant set null for ip->i_itemp
  2024-08-23 11:04 [PATCH 0/5] xfs: fix and cleanups for log item push Long Li
@ 2024-08-23 11:04 ` Long Li
  2024-08-23 16:37   ` Darrick J. Wong
  2024-08-25  4:52   ` Christoph Hellwig
  2024-08-23 11:04 ` [PATCH 2/5] xfs: ensuere deleting item from AIL after shutdown in dquot flush Long Li
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 26+ messages in thread
From: Long Li @ 2024-08-23 11:04 UTC (permalink / raw)
  To: djwong, chandanbabu
  Cc: linux-xfs, david, yi.zhang, houtao1, leo.lilong, yangerkun

ip->i_itemp has been set null in xfs_inode_item_destroy(), so there is
no need set it null again in xfs_inode_free_callback().

Signed-off-by: Long Li <leo.lilong@huawei.com>
---
 fs/xfs/xfs_icache.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index cf629302d48e..a5e5e5520a3b 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -143,7 +143,6 @@ xfs_inode_free_callback(
 		ASSERT(!test_bit(XFS_LI_IN_AIL,
 				 &ip->i_itemp->ili_item.li_flags));
 		xfs_inode_item_destroy(ip);
-		ip->i_itemp = NULL;
 	}
 
 	kmem_cache_free(xfs_inode_cache, ip);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/5] xfs: ensuere deleting item from AIL after shutdown in dquot flush
  2024-08-23 11:04 [PATCH 0/5] xfs: fix and cleanups for log item push Long Li
  2024-08-23 11:04 ` [PATCH 1/5] xfs: remove redundant set null for ip->i_itemp Long Li
@ 2024-08-23 11:04 ` Long Li
  2024-08-23 17:00   ` Darrick J. Wong
  2024-08-23 11:04 ` [PATCH 3/5] xfs: add XFS_ITEM_UNSAFE for log item push return result Long Li
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Long Li @ 2024-08-23 11:04 UTC (permalink / raw)
  To: djwong, chandanbabu
  Cc: linux-xfs, david, yi.zhang, houtao1, leo.lilong, yangerkun

Deleting items from the AIL before the log is shut down can result in the
log tail moving forward in the journal on disk because log writes can still
be taking place. As a result, items that have been deleted from the AIL
might not be recovered during the next mount, even though they should be,
as they were never written back to disk.

Signed-off-by: Long Li <leo.lilong@huawei.com>
---
 fs/xfs/xfs_dquot.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index c1b211c260a9..4cbe3db6fc32 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1332,9 +1332,15 @@ xfs_qm_dqflush(
 	return 0;
 
 out_abort:
+	/*
+	 * Shutdown first to stop the log before deleting items from the AIL.
+	 * Deleting items from the AIL before the log is shut down can result
+	 * in the log tail moving forward in the journal on disk because log
+	 * writes can still be taking place.
+	 */
+	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 	dqp->q_flags &= ~XFS_DQFLAG_DIRTY;
 	xfs_trans_ail_delete(lip, 0);
-	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 out_unlock:
 	xfs_dqfunlock(dqp);
 	return error;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/5] xfs: add XFS_ITEM_UNSAFE for log item push return result
  2024-08-23 11:04 [PATCH 0/5] xfs: fix and cleanups for log item push Long Li
  2024-08-23 11:04 ` [PATCH 1/5] xfs: remove redundant set null for ip->i_itemp Long Li
  2024-08-23 11:04 ` [PATCH 2/5] xfs: ensuere deleting item from AIL after shutdown in dquot flush Long Li
@ 2024-08-23 11:04 ` Long Li
  2024-08-23 17:17   ` Darrick J. Wong
  2024-08-24  3:34   ` Christoph Hellwig
  2024-08-23 11:04 ` [PATCH 4/5] xfs: fix a UAF when dquot item push Long Li
  2024-08-23 11:04 ` [PATCH 5/5] xfs: fix a UAF when inode " Long Li
  4 siblings, 2 replies; 26+ messages in thread
From: Long Li @ 2024-08-23 11:04 UTC (permalink / raw)
  To: djwong, chandanbabu
  Cc: linux-xfs, david, yi.zhang, houtao1, leo.lilong, yangerkun

After pushing log items, the log item may have been freed, making it
unsafe to access in tracepoints. This commit introduces XFS_ITEM_UNSAFE
to indicate when an item might be freed during the item push operation.

Signed-off-by: Long Li <leo.lilong@huawei.com>
---
 fs/xfs/xfs_stats.h     | 1 +
 fs/xfs/xfs_trans.h     | 1 +
 fs/xfs/xfs_trans_ail.c | 7 +++++++
 3 files changed, 9 insertions(+)

diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
index a61fb56ed2e6..9a7a020587cf 100644
--- a/fs/xfs/xfs_stats.h
+++ b/fs/xfs/xfs_stats.h
@@ -86,6 +86,7 @@ struct __xfsstats {
 	uint32_t		xs_push_ail_pushbuf;
 	uint32_t		xs_push_ail_pinned;
 	uint32_t		xs_push_ail_locked;
+	uint32_t		xs_push_ail_unsafe;
 	uint32_t		xs_push_ail_flushing;
 	uint32_t		xs_push_ail_restarts;
 	uint32_t		xs_push_ail_flush;
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index f06cc0f41665..fd4f04853fe2 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -117,6 +117,7 @@ void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
 #define XFS_ITEM_PINNED		1
 #define XFS_ITEM_LOCKED		2
 #define XFS_ITEM_FLUSHING	3
+#define XFS_ITEM_UNSAFE		4
 
 /*
  * This is the structure maintained for every active transaction.
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 8ede9d099d1f..a5ab1ffb8937 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -561,6 +561,13 @@ xfsaild_push(
 
 			stuck++;
 			break;
+		case XFS_ITEM_UNSAFE:
+			/*
+			 * The item may have been freed, so we can't access the
+			 * log item here.
+			 */
+			XFS_STATS_INC(mp, xs_push_ail_unsafe);
+			break;
 		default:
 			ASSERT(0);
 			break;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/5] xfs: fix a UAF when dquot item push
  2024-08-23 11:04 [PATCH 0/5] xfs: fix and cleanups for log item push Long Li
                   ` (2 preceding siblings ...)
  2024-08-23 11:04 ` [PATCH 3/5] xfs: add XFS_ITEM_UNSAFE for log item push return result Long Li
@ 2024-08-23 11:04 ` Long Li
  2024-08-23 17:20   ` Darrick J. Wong
  2024-08-23 11:04 ` [PATCH 5/5] xfs: fix a UAF when inode " Long Li
  4 siblings, 1 reply; 26+ messages in thread
From: Long Li @ 2024-08-23 11:04 UTC (permalink / raw)
  To: djwong, chandanbabu
  Cc: linux-xfs, david, yi.zhang, houtao1, leo.lilong, yangerkun

If errors are encountered while pushing a dquot log item, the dquot dirty
flag is cleared. Without the protection of dqlock and dqflock locks, the
dquot reclaim thread may free the dquot. Accessing the log item in xfsaild
after this can trigger a UAF.

  CPU0                              CPU1
  push item                         reclaim dquot
  -----------------------           -----------------------
  xfsaild_push_item
    xfs_qm_dquot_logitem_push(lip)
      xfs_dqlock_nowait(dqp)
      xfs_dqflock_nowait(dqp)
      spin_unlock(&lip->li_ailp->ail_lock)
      xfs_qm_dqflush(dqp, &bp)
                       <encountered some errors>
        xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE)
        dqp->q_flags &= ~XFS_DQFLAG_DIRTY
                       <dquot is not diry>
        xfs_trans_ail_delete(lip, 0)
        xfs_dqfunlock(dqp)
      spin_lock(&lip->li_ailp->ail_lock)
      xfs_dqunlock(dqp)
                                    xfs_qm_shrink_scan
                                      list_lru_shrink_walk
                                        xfs_qm_dquot_isolate
                                          xfs_dqlock_nowait(dqp)
                                          xfs_dqfunlock(dqp)
                                          //dquot is clean, not flush it
                                          xfs_dqfunlock(dqp)
                                          dqp->q_flags |= XFS_DQFLAG_FREEING
                                          xfs_dqunlock(dqp)
                                          //add dquot to dispose list
                                      //free dquot in dispose list
                                      xfs_qm_dqfree_one(dqp)
  trace_xfs_ail_xxx(lip)  //UAF

Fix this by returning XFS_ITEM_UNSAFE in xfs_qm_dquot_logitem_push() when
dquot flush encounters errors (excluding EAGAIN error), ensuring xfsaild
does not access the log item after it is pushed.

Fixes: 9e4c109ac822 ("xfs: add AIL pushing tracepoints")
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
 fs/xfs/xfs_dquot_item.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index 7d19091215b0..afc7ad91ddef 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -160,8 +160,16 @@ xfs_qm_dquot_logitem_push(
 		if (!xfs_buf_delwri_queue(bp, buffer_list))
 			rval = XFS_ITEM_FLUSHING;
 		xfs_buf_relse(bp);
-	} else if (error == -EAGAIN)
+	} else if (error == -EAGAIN) {
 		rval = XFS_ITEM_LOCKED;
+	} else {
+		/*
+		 * The dirty flag has been cleared; the dquot may be reclaimed
+		 * after unlock. It's unsafe to access the item after it has
+		 * been pushed.
+		 */
+		rval = XFS_ITEM_UNSAFE;
+	}
 
 	spin_lock(&lip->li_ailp->ail_lock);
 out_unlock:
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/5] xfs: fix a UAF when inode item push
  2024-08-23 11:04 [PATCH 0/5] xfs: fix and cleanups for log item push Long Li
                   ` (3 preceding siblings ...)
  2024-08-23 11:04 ` [PATCH 4/5] xfs: fix a UAF when dquot item push Long Li
@ 2024-08-23 11:04 ` Long Li
  2024-08-23 17:22   ` Darrick J. Wong
  4 siblings, 1 reply; 26+ messages in thread
From: Long Li @ 2024-08-23 11:04 UTC (permalink / raw)
  To: djwong, chandanbabu
  Cc: linux-xfs, david, yi.zhang, houtao1, leo.lilong, yangerkun

KASAN reported a UAF bug while fault injection test:

  ==================================================================
  BUG: KASAN: use-after-free in xfs_inode_item_push+0x2db/0x2f0
  Read of size 8 at addr ffff888022f74788 by task xfsaild/sda/479

  CPU: 0 PID: 479 Comm: xfsaild/sda Not tainted 6.2.0-rc7-00003-ga8a43e2eb5f6 #89
  Call Trace:
   <TASK>
   dump_stack_lvl+0x51/0x6a
   print_report+0x171/0x4a6
   kasan_report+0xb7/0x130
   xfs_inode_item_push+0x2db/0x2f0
   xfsaild+0x729/0x1f70
   kthread+0x290/0x340
   ret_from_fork+0x1f/0x30
   </TASK>

  Allocated by task 494:
   kasan_save_stack+0x22/0x40
   kasan_set_track+0x25/0x30
   __kasan_slab_alloc+0x58/0x70
   kmem_cache_alloc+0x197/0x5d0
   xfs_inode_item_init+0x62/0x170
   xfs_trans_ijoin+0x15e/0x240
   xfs_init_new_inode+0x573/0x1820
   xfs_create+0x6a1/0x1020
   xfs_generic_create+0x544/0x5d0
   vfs_mkdir+0x5d0/0x980
   do_mkdirat+0x14e/0x220
   __x64_sys_mkdir+0x6a/0x80
   do_syscall_64+0x39/0x80
   entry_SYSCALL_64_after_hwframe+0x63/0xcd

  Freed by task 14:
   kasan_save_stack+0x22/0x40
   kasan_set_track+0x25/0x30
   kasan_save_free_info+0x2e/0x40
   __kasan_slab_free+0x114/0x1b0
   kmem_cache_free+0xee/0x4e0
   xfs_inode_free_callback+0x187/0x2a0
   rcu_do_batch+0x317/0xce0
   rcu_core+0x686/0xa90
   __do_softirq+0x1b6/0x626

  The buggy address belongs to the object at ffff888022f74758
   which belongs to the cache xfs_ili of size 200
  The buggy address is located 48 bytes inside of
   200-byte region [ffff888022f74758, ffff888022f74820)

  The buggy address belongs to the physical page:
  page:ffffea00008bdd00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22f74
  head:ffffea00008bdd00 order:1 compound_mapcount:0 subpages_mapcount:0 compound_pincount:0
  flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
  raw: 001fffff80010200 ffff888010ed4040 ffffea00008b2510 ffffea00008bde10
  raw: 0000000000000000 00000000001a001a 00000001ffffffff 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff888022f74680: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
   ffff888022f74700: fc fc fc fc fc fc fc fc fc fc fc fa fb fb fb fb
  >ffff888022f74780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                        ^
   ffff888022f74800: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
   ffff888022f74880: fc fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ==================================================================

When push inode item in xfsaild, it will race with reclaim inodes task.
Consider the following call graph, both tasks deal with the same inode.
During flushing the cluster, it will enter xfs_iflush_abort() in shutdown
conditions, inode's XFS_IFLUSHING flag will be cleared and lip->li_buf set
to null. Concurrently, inode will be reclaimed in shutdown conditions,
there is no need to wait xfs buf lock because of lip->li_buf is null at
this time, inode will be freed via rcu callback if xfsaild task schedule
out during flushing the cluster. so, it is unsafe to reference lip after
flushing the cluster in xfs_inode_item_push().

			<log item is in AIL>
			<filesystem shutdown>
spin_lock(&ailp->ail_lock)
xfs_inode_item_push(lip)
  xfs_buf_trylock(bp)
  spin_unlock(&lip->li_ailp->ail_lock)
  xfs_iflush_cluster(bp)
    if (xfs_is_shutdown())
      xfs_iflush_abort(ip)
	xfs_trans_ail_delete(ip)
	  spin_lock(&ailp->ail_lock)
	  spin_unlock(&ailp->ail_lock)
	xfs_iflush_abort_clean(ip)
      error = -EIO
			<log item removed from AIL>
			<log item li_buf set to null>
    if (error)
      xfs_force_shutdown()
	xlog_shutdown_wait(mp->m_log)
	  might_sleep()
					xfs_reclaim_inode(ip)
					if (shutdown)
					  xfs_iflush_shutdown_abort(ip)
					    if (!bp)
					      xfs_iflush_abort(ip)
					      return
				        __xfs_inode_free(ip)
					   call_rcu(ip, xfs_inode_free_callback)
			......
			<rcu grace period expires>
			<rcu free callbacks run somewhere>
			  xfs_inode_free_callback(ip)
			    kmem_cache_free(ip->i_itemp)
			......
<starts running again>
    xfs_buf_ioend_fail(bp);
      xfs_buf_ioend(bp)
        xfs_buf_relse(bp);
    return error
spin_lock(&lip->li_ailp->ail_lock)
  <UAF on log item>

Additionally, after xfsaild_push_item(), the tracepoints can still access
the log item, potentially causing a UAF. I've previously submitted two
versions [1][2] attempting to solve this issue, but the solutions had
flaws.

Fix it by returning XFS_ITEM_UNSAFE in xfs_inode_item_push() when the log
item might be freed, ensuring xfsaild does not access the log item after
it is pushed.

[1] https://patchwork.kernel.org/project/xfs/patch/20230211022941.GA1515023@ceph-admin/
[2] https://patchwork.kernel.org/project/xfs/patch/20230722025721.312909-1-leo.lilong@huawei.com/
Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary")
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
 fs/xfs/xfs_inode_item.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index b509cbd191f4..c855cd2c81a5 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -720,10 +720,11 @@ STATIC uint
 xfs_inode_item_push(
 	struct xfs_log_item	*lip,
 	struct list_head	*buffer_list)
-		__releases(&lip->li_ailp->ail_lock)
-		__acquires(&lip->li_ailp->ail_lock)
+		__releases(&ailp->ail_lock)
+		__acquires(&ailp->ail_lock)
 {
 	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
+	struct xfs_ail		*ailp = lip->li_ailp;
 	struct xfs_inode	*ip = iip->ili_inode;
 	struct xfs_buf		*bp = lip->li_buf;
 	uint			rval = XFS_ITEM_SUCCESS;
@@ -748,7 +749,7 @@ xfs_inode_item_push(
 	if (!xfs_buf_trylock(bp))
 		return XFS_ITEM_LOCKED;
 
-	spin_unlock(&lip->li_ailp->ail_lock);
+	spin_unlock(&ailp->ail_lock);
 
 	/*
 	 * We need to hold a reference for flushing the cluster buffer as it may
@@ -762,17 +763,23 @@ xfs_inode_item_push(
 		if (!xfs_buf_delwri_queue(bp, buffer_list))
 			rval = XFS_ITEM_FLUSHING;
 		xfs_buf_relse(bp);
-	} else {
+	} else if (error == -EAGAIN) {
 		/*
 		 * Release the buffer if we were unable to flush anything. On
 		 * any other error, the buffer has already been released.
 		 */
-		if (error == -EAGAIN)
-			xfs_buf_relse(bp);
+		xfs_buf_relse(bp);
 		rval = XFS_ITEM_LOCKED;
+	} else {
+		/*
+		 * The filesystem has already been shut down. If there's a race
+		 * between inode flush and inode reclaim, the inode might be
+		 * freed. Accessing the item after this point would be unsafe.
+		 */
+		rval = XFS_ITEM_UNSAFE;
 	}
 
-	spin_lock(&lip->li_ailp->ail_lock);
+	spin_lock(&ailp->ail_lock);
 	return rval;
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] xfs: remove redundant set null for ip->i_itemp
  2024-08-23 11:04 ` [PATCH 1/5] xfs: remove redundant set null for ip->i_itemp Long Li
@ 2024-08-23 16:37   ` Darrick J. Wong
  2024-08-25  4:52   ` Christoph Hellwig
  1 sibling, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2024-08-23 16:37 UTC (permalink / raw)
  To: Long Li; +Cc: chandanbabu, linux-xfs, david, yi.zhang, houtao1, yangerkun

On Fri, Aug 23, 2024 at 07:04:35PM +0800, Long Li wrote:
> ip->i_itemp has been set null in xfs_inode_item_destroy(), so there is
> no need set it null again in xfs_inode_free_callback().
> 
> Signed-off-by: Long Li <leo.lilong@huawei.com>

Seems reasonable,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_icache.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index cf629302d48e..a5e5e5520a3b 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -143,7 +143,6 @@ xfs_inode_free_callback(
>  		ASSERT(!test_bit(XFS_LI_IN_AIL,
>  				 &ip->i_itemp->ili_item.li_flags));
>  		xfs_inode_item_destroy(ip);
> -		ip->i_itemp = NULL;
>  	}
>  
>  	kmem_cache_free(xfs_inode_cache, ip);
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] xfs: ensuere deleting item from AIL after shutdown in dquot flush
  2024-08-23 11:04 ` [PATCH 2/5] xfs: ensuere deleting item from AIL after shutdown in dquot flush Long Li
@ 2024-08-23 17:00   ` Darrick J. Wong
  2024-08-24  3:08     ` Long Li
  2024-08-27  9:40     ` Dave Chinner
  0 siblings, 2 replies; 26+ messages in thread
From: Darrick J. Wong @ 2024-08-23 17:00 UTC (permalink / raw)
  To: Long Li; +Cc: chandanbabu, linux-xfs, david, yi.zhang, houtao1, yangerkun

On Fri, Aug 23, 2024 at 07:04:36PM +0800, Long Li wrote:
> Deleting items from the AIL before the log is shut down can result in the
> log tail moving forward in the journal on disk because log writes can still
> be taking place. As a result, items that have been deleted from the AIL
> might not be recovered during the next mount, even though they should be,
> as they were never written back to disk.
> 
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
>  fs/xfs/xfs_dquot.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index c1b211c260a9..4cbe3db6fc32 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -1332,9 +1332,15 @@ xfs_qm_dqflush(
>  	return 0;
>  
>  out_abort:
> +	/*
> +	 * Shutdown first to stop the log before deleting items from the AIL.
> +	 * Deleting items from the AIL before the log is shut down can result
> +	 * in the log tail moving forward in the journal on disk because log
> +	 * writes can still be taking place.
> +	 */
> +	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
>  	dqp->q_flags &= ~XFS_DQFLAG_DIRTY;
>  	xfs_trans_ail_delete(lip, 0);
> -	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);

I see the logic in shutting down the log before letting go of the dquot
log item that triggered the shutdown, but I wonder, why do we delete the
item from the AIL?  AFAICT the inode items don't do that on iflush
failure, but OTOH I couldn't figure out how the log items in the AIL get
deleted from the AIL after a shutdown.  Or maybe during a shutdown we
just stop xfsaild and let the higher level objects free the log items
during reclaim?

--D

>  out_unlock:
>  	xfs_dqfunlock(dqp);
>  	return error;
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] xfs: add XFS_ITEM_UNSAFE for log item push return result
  2024-08-23 11:04 ` [PATCH 3/5] xfs: add XFS_ITEM_UNSAFE for log item push return result Long Li
@ 2024-08-23 17:17   ` Darrick J. Wong
  2024-08-24  3:30     ` Long Li
  2024-08-27  9:44     ` Dave Chinner
  2024-08-24  3:34   ` Christoph Hellwig
  1 sibling, 2 replies; 26+ messages in thread
From: Darrick J. Wong @ 2024-08-23 17:17 UTC (permalink / raw)
  To: Long Li; +Cc: chandanbabu, linux-xfs, david, yi.zhang, houtao1, yangerkun

On Fri, Aug 23, 2024 at 07:04:37PM +0800, Long Li wrote:
> After pushing log items, the log item may have been freed, making it
> unsafe to access in tracepoints. This commit introduces XFS_ITEM_UNSAFE
> to indicate when an item might be freed during the item push operation.
> 
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
>  fs/xfs/xfs_stats.h     | 1 +
>  fs/xfs/xfs_trans.h     | 1 +
>  fs/xfs/xfs_trans_ail.c | 7 +++++++
>  3 files changed, 9 insertions(+)
> 
> diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
> index a61fb56ed2e6..9a7a020587cf 100644
> --- a/fs/xfs/xfs_stats.h
> +++ b/fs/xfs/xfs_stats.h
> @@ -86,6 +86,7 @@ struct __xfsstats {
>  	uint32_t		xs_push_ail_pushbuf;
>  	uint32_t		xs_push_ail_pinned;
>  	uint32_t		xs_push_ail_locked;
> +	uint32_t		xs_push_ail_unsafe;
>  	uint32_t		xs_push_ail_flushing;
>  	uint32_t		xs_push_ail_restarts;
>  	uint32_t		xs_push_ail_flush;
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index f06cc0f41665..fd4f04853fe2 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -117,6 +117,7 @@ void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
>  #define XFS_ITEM_PINNED		1
>  #define XFS_ITEM_LOCKED		2
>  #define XFS_ITEM_FLUSHING	3
> +#define XFS_ITEM_UNSAFE		4
>  
>  /*
>   * This is the structure maintained for every active transaction.
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 8ede9d099d1f..a5ab1ffb8937 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -561,6 +561,13 @@ xfsaild_push(
>  
>  			stuck++;
>  			break;
> +		case XFS_ITEM_UNSAFE:
> +			/*
> +			 * The item may have been freed, so we can't access the
> +			 * log item here.
> +			 */
> +			XFS_STATS_INC(mp, xs_push_ail_unsafe);

Aha, so this is in reaction to Dave's comment "So, yeah, these failure
cases need to return something different to xfsaild_push() so it knows
that it is unsafe to reference the log item, even for tracing purposes."

What we're trying to communicate through xfsaild_push_item is that we've
cycled the AIL lock and possibly done something (e.g. deleting the log
item from the AIL) such that the lip reference might be stale.

Can we call this XFS_ITEM_STALEREF?  "Unsafe" is vague.

--D

> +			break;
>  		default:
>  			ASSERT(0);
>  			break;
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] xfs: fix a UAF when dquot item push
  2024-08-23 11:04 ` [PATCH 4/5] xfs: fix a UAF when dquot item push Long Li
@ 2024-08-23 17:20   ` Darrick J. Wong
  2024-08-24  2:03     ` Long Li
  0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2024-08-23 17:20 UTC (permalink / raw)
  To: Long Li; +Cc: chandanbabu, linux-xfs, david, yi.zhang, houtao1, yangerkun

On Fri, Aug 23, 2024 at 07:04:38PM +0800, Long Li wrote:
> If errors are encountered while pushing a dquot log item, the dquot dirty
> flag is cleared. Without the protection of dqlock and dqflock locks, the
> dquot reclaim thread may free the dquot. Accessing the log item in xfsaild
> after this can trigger a UAF.
> 
>   CPU0                              CPU1
>   push item                         reclaim dquot
>   -----------------------           -----------------------
>   xfsaild_push_item
>     xfs_qm_dquot_logitem_push(lip)
>       xfs_dqlock_nowait(dqp)
>       xfs_dqflock_nowait(dqp)
>       spin_unlock(&lip->li_ailp->ail_lock)
>       xfs_qm_dqflush(dqp, &bp)
>                        <encountered some errors>
>         xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE)
>         dqp->q_flags &= ~XFS_DQFLAG_DIRTY
>                        <dquot is not diry>
>         xfs_trans_ail_delete(lip, 0)
>         xfs_dqfunlock(dqp)
>       spin_lock(&lip->li_ailp->ail_lock)
>       xfs_dqunlock(dqp)
>                                     xfs_qm_shrink_scan
>                                       list_lru_shrink_walk
>                                         xfs_qm_dquot_isolate
>                                           xfs_dqlock_nowait(dqp)
>                                           xfs_dqfunlock(dqp)
>                                           //dquot is clean, not flush it
>                                           xfs_dqfunlock(dqp)
>                                           dqp->q_flags |= XFS_DQFLAG_FREEING
>                                           xfs_dqunlock(dqp)
>                                           //add dquot to dispose list
>                                       //free dquot in dispose list
>                                       xfs_qm_dqfree_one(dqp)
>   trace_xfs_ail_xxx(lip)  //UAF
> 
> Fix this by returning XFS_ITEM_UNSAFE in xfs_qm_dquot_logitem_push() when
> dquot flush encounters errors (excluding EAGAIN error), ensuring xfsaild
> does not access the log item after it is pushed.
> 
> Fixes: 9e4c109ac822 ("xfs: add AIL pushing tracepoints")
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
>  fs/xfs/xfs_dquot_item.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
> index 7d19091215b0..afc7ad91ddef 100644
> --- a/fs/xfs/xfs_dquot_item.c
> +++ b/fs/xfs/xfs_dquot_item.c
> @@ -160,8 +160,16 @@ xfs_qm_dquot_logitem_push(
>  		if (!xfs_buf_delwri_queue(bp, buffer_list))
>  			rval = XFS_ITEM_FLUSHING;
>  		xfs_buf_relse(bp);
> -	} else if (error == -EAGAIN)
> +	} else if (error == -EAGAIN) {
>  		rval = XFS_ITEM_LOCKED;
> +	} else {
> +		/*
> +		 * The dirty flag has been cleared; the dquot may be reclaimed
> +		 * after unlock. It's unsafe to access the item after it has
> +		 * been pushed.
> +		 */
> +		rval = XFS_ITEM_UNSAFE;
> +	}
>  
>  	spin_lock(&lip->li_ailp->ail_lock);

Um, didn't we just establish that lip could have been freed?  Why is it
safe to continue accessing the AIL through the lip here?

--D

>  out_unlock:
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] xfs: fix a UAF when inode item push
  2024-08-23 11:04 ` [PATCH 5/5] xfs: fix a UAF when inode " Long Li
@ 2024-08-23 17:22   ` Darrick J. Wong
  2024-08-27  8:14     ` Long Li
  0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2024-08-23 17:22 UTC (permalink / raw)
  To: Long Li; +Cc: chandanbabu, linux-xfs, david, yi.zhang, houtao1, yangerkun

On Fri, Aug 23, 2024 at 07:04:39PM +0800, Long Li wrote:
> KASAN reported a UAF bug while fault injection test:
> 
>   ==================================================================
>   BUG: KASAN: use-after-free in xfs_inode_item_push+0x2db/0x2f0
>   Read of size 8 at addr ffff888022f74788 by task xfsaild/sda/479
> 
>   CPU: 0 PID: 479 Comm: xfsaild/sda Not tainted 6.2.0-rc7-00003-ga8a43e2eb5f6 #89
>   Call Trace:
>    <TASK>
>    dump_stack_lvl+0x51/0x6a
>    print_report+0x171/0x4a6
>    kasan_report+0xb7/0x130
>    xfs_inode_item_push+0x2db/0x2f0
>    xfsaild+0x729/0x1f70
>    kthread+0x290/0x340
>    ret_from_fork+0x1f/0x30
>    </TASK>
> 
>   Allocated by task 494:
>    kasan_save_stack+0x22/0x40
>    kasan_set_track+0x25/0x30
>    __kasan_slab_alloc+0x58/0x70
>    kmem_cache_alloc+0x197/0x5d0
>    xfs_inode_item_init+0x62/0x170
>    xfs_trans_ijoin+0x15e/0x240
>    xfs_init_new_inode+0x573/0x1820
>    xfs_create+0x6a1/0x1020
>    xfs_generic_create+0x544/0x5d0
>    vfs_mkdir+0x5d0/0x980
>    do_mkdirat+0x14e/0x220
>    __x64_sys_mkdir+0x6a/0x80
>    do_syscall_64+0x39/0x80
>    entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
>   Freed by task 14:
>    kasan_save_stack+0x22/0x40
>    kasan_set_track+0x25/0x30
>    kasan_save_free_info+0x2e/0x40
>    __kasan_slab_free+0x114/0x1b0
>    kmem_cache_free+0xee/0x4e0
>    xfs_inode_free_callback+0x187/0x2a0
>    rcu_do_batch+0x317/0xce0
>    rcu_core+0x686/0xa90
>    __do_softirq+0x1b6/0x626
> 
>   The buggy address belongs to the object at ffff888022f74758
>    which belongs to the cache xfs_ili of size 200
>   The buggy address is located 48 bytes inside of
>    200-byte region [ffff888022f74758, ffff888022f74820)
> 
>   The buggy address belongs to the physical page:
>   page:ffffea00008bdd00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22f74
>   head:ffffea00008bdd00 order:1 compound_mapcount:0 subpages_mapcount:0 compound_pincount:0
>   flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
>   raw: 001fffff80010200 ffff888010ed4040 ffffea00008b2510 ffffea00008bde10
>   raw: 0000000000000000 00000000001a001a 00000001ffffffff 0000000000000000
>   page dumped because: kasan: bad access detected
> 
>   Memory state around the buggy address:
>    ffff888022f74680: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
>    ffff888022f74700: fc fc fc fc fc fc fc fc fc fc fc fa fb fb fb fb
>   >ffff888022f74780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>                         ^
>    ffff888022f74800: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
>    ffff888022f74880: fc fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>   ==================================================================
> 
> When push inode item in xfsaild, it will race with reclaim inodes task.
> Consider the following call graph, both tasks deal with the same inode.
> During flushing the cluster, it will enter xfs_iflush_abort() in shutdown
> conditions, inode's XFS_IFLUSHING flag will be cleared and lip->li_buf set
> to null. Concurrently, inode will be reclaimed in shutdown conditions,
> there is no need to wait xfs buf lock because of lip->li_buf is null at
> this time, inode will be freed via rcu callback if xfsaild task schedule
> out during flushing the cluster. so, it is unsafe to reference lip after
> flushing the cluster in xfs_inode_item_push().
> 
> 			<log item is in AIL>
> 			<filesystem shutdown>
> spin_lock(&ailp->ail_lock)
> xfs_inode_item_push(lip)
>   xfs_buf_trylock(bp)
>   spin_unlock(&lip->li_ailp->ail_lock)
>   xfs_iflush_cluster(bp)
>     if (xfs_is_shutdown())
>       xfs_iflush_abort(ip)
> 	xfs_trans_ail_delete(ip)
> 	  spin_lock(&ailp->ail_lock)
> 	  spin_unlock(&ailp->ail_lock)
> 	xfs_iflush_abort_clean(ip)
>       error = -EIO
> 			<log item removed from AIL>
> 			<log item li_buf set to null>
>     if (error)
>       xfs_force_shutdown()
> 	xlog_shutdown_wait(mp->m_log)
> 	  might_sleep()
> 					xfs_reclaim_inode(ip)
> 					if (shutdown)
> 					  xfs_iflush_shutdown_abort(ip)
> 					    if (!bp)
> 					      xfs_iflush_abort(ip)
> 					      return
> 				        __xfs_inode_free(ip)
> 					   call_rcu(ip, xfs_inode_free_callback)
> 			......
> 			<rcu grace period expires>
> 			<rcu free callbacks run somewhere>
> 			  xfs_inode_free_callback(ip)
> 			    kmem_cache_free(ip->i_itemp)
> 			......
> <starts running again>
>     xfs_buf_ioend_fail(bp);
>       xfs_buf_ioend(bp)
>         xfs_buf_relse(bp);
>     return error
> spin_lock(&lip->li_ailp->ail_lock)
>   <UAF on log item>
> 
> Additionally, after xfsaild_push_item(), the tracepoints can still access
> the log item, potentially causing a UAF. I've previously submitted two
> versions [1][2] attempting to solve this issue, but the solutions had
> flaws.
> 
> Fix it by returning XFS_ITEM_UNSAFE in xfs_inode_item_push() when the log
> item might be freed, ensuring xfsaild does not access the log item after
> it is pushed.
> 
> [1] https://patchwork.kernel.org/project/xfs/patch/20230211022941.GA1515023@ceph-admin/
> [2] https://patchwork.kernel.org/project/xfs/patch/20230722025721.312909-1-leo.lilong@huawei.com/
> Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary")
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
>  fs/xfs/xfs_inode_item.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index b509cbd191f4..c855cd2c81a5 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -720,10 +720,11 @@ STATIC uint
>  xfs_inode_item_push(
>  	struct xfs_log_item	*lip,
>  	struct list_head	*buffer_list)
> -		__releases(&lip->li_ailp->ail_lock)
> -		__acquires(&lip->li_ailp->ail_lock)
> +		__releases(&ailp->ail_lock)
> +		__acquires(&ailp->ail_lock)

I wonder, is smatch or whatever actually uses these annotations smart
enough to read through the local variable declarations below?

>  {
>  	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> +	struct xfs_ail		*ailp = lip->li_ailp;
>  	struct xfs_inode	*ip = iip->ili_inode;
>  	struct xfs_buf		*bp = lip->li_buf;
>  	uint			rval = XFS_ITEM_SUCCESS;
> @@ -748,7 +749,7 @@ xfs_inode_item_push(
>  	if (!xfs_buf_trylock(bp))
>  		return XFS_ITEM_LOCKED;
>  
> -	spin_unlock(&lip->li_ailp->ail_lock);
> +	spin_unlock(&ailp->ail_lock);
>  
>  	/*
>  	 * We need to hold a reference for flushing the cluster buffer as it may
> @@ -762,17 +763,23 @@ xfs_inode_item_push(
>  		if (!xfs_buf_delwri_queue(bp, buffer_list))
>  			rval = XFS_ITEM_FLUSHING;
>  		xfs_buf_relse(bp);
> -	} else {
> +	} else if (error == -EAGAIN) {
>  		/*
>  		 * Release the buffer if we were unable to flush anything. On
>  		 * any other error, the buffer has already been released.
>  		 */
> -		if (error == -EAGAIN)
> -			xfs_buf_relse(bp);
> +		xfs_buf_relse(bp);
>  		rval = XFS_ITEM_LOCKED;
> +	} else {
> +		/*
> +		 * The filesystem has already been shut down. If there's a race
> +		 * between inode flush and inode reclaim, the inode might be
> +		 * freed. Accessing the item after this point would be unsafe.
> +		 */
> +		rval = XFS_ITEM_UNSAFE;

I wonder if it's time to convert this to a switch statement but the fix
looks correct so

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


>  	}
>  
> -	spin_lock(&lip->li_ailp->ail_lock);
> +	spin_lock(&ailp->ail_lock);
>  	return rval;
>  }
>  
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] xfs: fix a UAF when dquot item push
  2024-08-23 17:20   ` Darrick J. Wong
@ 2024-08-24  2:03     ` Long Li
  0 siblings, 0 replies; 26+ messages in thread
From: Long Li @ 2024-08-24  2:03 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: chandanbabu, linux-xfs, david, yi.zhang, houtao1, yangerkun

On Fri, Aug 23, 2024 at 10:20:38AM -0700, Darrick J. Wong wrote:
> On Fri, Aug 23, 2024 at 07:04:38PM +0800, Long Li wrote:
> > If errors are encountered while pushing a dquot log item, the dquot dirty
> > flag is cleared. Without the protection of dqlock and dqflock locks, the
> > dquot reclaim thread may free the dquot. Accessing the log item in xfsaild
> > after this can trigger a UAF.
> > 
> >   CPU0                              CPU1
> >   push item                         reclaim dquot
> >   -----------------------           -----------------------
> >   xfsaild_push_item
> >     xfs_qm_dquot_logitem_push(lip)
> >       xfs_dqlock_nowait(dqp)
> >       xfs_dqflock_nowait(dqp)
> >       spin_unlock(&lip->li_ailp->ail_lock)
> >       xfs_qm_dqflush(dqp, &bp)
> >                        <encountered some errors>
> >         xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE)
> >         dqp->q_flags &= ~XFS_DQFLAG_DIRTY
> >                        <dquot is not diry>
> >         xfs_trans_ail_delete(lip, 0)
> >         xfs_dqfunlock(dqp)
> >       spin_lock(&lip->li_ailp->ail_lock)
> >       xfs_dqunlock(dqp)
> >                                     xfs_qm_shrink_scan
> >                                       list_lru_shrink_walk
> >                                         xfs_qm_dquot_isolate
> >                                           xfs_dqlock_nowait(dqp)
> >                                           xfs_dqfunlock(dqp)
> >                                           //dquot is clean, not flush it
> >                                           xfs_dqfunlock(dqp)
> >                                           dqp->q_flags |= XFS_DQFLAG_FREEING
> >                                           xfs_dqunlock(dqp)
> >                                           //add dquot to dispose list
> >                                       //free dquot in dispose list
> >                                       xfs_qm_dqfree_one(dqp)
> >   trace_xfs_ail_xxx(lip)  //UAF
> > 
> > Fix this by returning XFS_ITEM_UNSAFE in xfs_qm_dquot_logitem_push() when
> > dquot flush encounters errors (excluding EAGAIN error), ensuring xfsaild
> > does not access the log item after it is pushed.
> > 
> > Fixes: 9e4c109ac822 ("xfs: add AIL pushing tracepoints")
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> >  fs/xfs/xfs_dquot_item.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
> > index 7d19091215b0..afc7ad91ddef 100644
> > --- a/fs/xfs/xfs_dquot_item.c
> > +++ b/fs/xfs/xfs_dquot_item.c
> > @@ -160,8 +160,16 @@ xfs_qm_dquot_logitem_push(
> >  		if (!xfs_buf_delwri_queue(bp, buffer_list))
> >  			rval = XFS_ITEM_FLUSHING;
> >  		xfs_buf_relse(bp);
> > -	} else if (error == -EAGAIN)
> > +	} else if (error == -EAGAIN) {
> >  		rval = XFS_ITEM_LOCKED;
> > +	} else {
> > +		/*
> > +		 * The dirty flag has been cleared; the dquot may be reclaimed
> > +		 * after unlock. It's unsafe to access the item after it has
> > +		 * been pushed.
> > +		 */
> > +		rval = XFS_ITEM_UNSAFE;
> > +	}
> >  
> >  	spin_lock(&lip->li_ailp->ail_lock);
> 
> Um, didn't we just establish that lip could have been freed?  Why is it
> safe to continue accessing the AIL through the lip here?
> 
> --D
> 

We are still within the dqlock context here, and the dquot will only be
released after the dqlock is released. In contrast, during the inode item
push, the inode log item may be released within xfs_iflush_cluster() - this
is where the two cases differ. However, for the sake of code consistency,
should we also avoid accessing the AIL through lip here?

Thanks,
Long Li


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] xfs: ensuere deleting item from AIL after shutdown in dquot flush
  2024-08-23 17:00   ` Darrick J. Wong
@ 2024-08-24  3:08     ` Long Li
  2024-08-27  9:40     ` Dave Chinner
  1 sibling, 0 replies; 26+ messages in thread
From: Long Li @ 2024-08-24  3:08 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: chandanbabu, linux-xfs, david, yi.zhang, houtao1, yangerkun

On Fri, Aug 23, 2024 at 10:00:06AM -0700, Darrick J. Wong wrote:
> On Fri, Aug 23, 2024 at 07:04:36PM +0800, Long Li wrote:
> > Deleting items from the AIL before the log is shut down can result in the
> > log tail moving forward in the journal on disk because log writes can still
> > be taking place. As a result, items that have been deleted from the AIL
> > might not be recovered during the next mount, even though they should be,
> > as they were never written back to disk.
> > 
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> >  fs/xfs/xfs_dquot.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> > index c1b211c260a9..4cbe3db6fc32 100644
> > --- a/fs/xfs/xfs_dquot.c
> > +++ b/fs/xfs/xfs_dquot.c
> > @@ -1332,9 +1332,15 @@ xfs_qm_dqflush(
> >  	return 0;
> >  
> >  out_abort:
> > +	/*
> > +	 * Shutdown first to stop the log before deleting items from the AIL.
> > +	 * Deleting items from the AIL before the log is shut down can result
> > +	 * in the log tail moving forward in the journal on disk because log
> > +	 * writes can still be taking place.
> > +	 */
> > +	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> >  	dqp->q_flags &= ~XFS_DQFLAG_DIRTY;
> >  	xfs_trans_ail_delete(lip, 0);
> > -	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> 
> I see the logic in shutting down the log before letting go of the dquot
> log item that triggered the shutdown, but I wonder, why do we delete the
> item from the AIL?  AFAICT the inode items don't do that on iflush
> failure, but OTOH I couldn't figure out how the log items in the AIL get
> deleted from the AIL after a shutdown.  Or maybe during a shutdown we
> just stop xfsaild and let the higher level objects free the log items
> during reclaim?
> 
> --D
> 

When inode flush failure, the inode item is also removed from the AIL. Since
the inode item has already been added to bp->b_li_list during precommit, it
can be deleted through the error handling xfs_buf_ioend_fail(bp), and this
deletion occurs after the log shutdown. However, during dquot item push,
the dquot item has not yet been attached to the buffer. Therefore, in this
case, the dquot item is directly removed from the AIL.

Thanks,
Long Li

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] xfs: add XFS_ITEM_UNSAFE for log item push return result
  2024-08-23 17:17   ` Darrick J. Wong
@ 2024-08-24  3:30     ` Long Li
  2024-08-27  9:44     ` Dave Chinner
  1 sibling, 0 replies; 26+ messages in thread
From: Long Li @ 2024-08-24  3:30 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: chandanbabu, linux-xfs, david, yi.zhang, houtao1, yangerkun

On Fri, Aug 23, 2024 at 10:17:09AM -0700, Darrick J. Wong wrote:
> On Fri, Aug 23, 2024 at 07:04:37PM +0800, Long Li wrote:
> > After pushing log items, the log item may have been freed, making it
> > unsafe to access in tracepoints. This commit introduces XFS_ITEM_UNSAFE
> > to indicate when an item might be freed during the item push operation.
> > 
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> >  fs/xfs/xfs_stats.h     | 1 +
> >  fs/xfs/xfs_trans.h     | 1 +
> >  fs/xfs/xfs_trans_ail.c | 7 +++++++
> >  3 files changed, 9 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
> > index a61fb56ed2e6..9a7a020587cf 100644
> > --- a/fs/xfs/xfs_stats.h
> > +++ b/fs/xfs/xfs_stats.h
> > @@ -86,6 +86,7 @@ struct __xfsstats {
> >  	uint32_t		xs_push_ail_pushbuf;
> >  	uint32_t		xs_push_ail_pinned;
> >  	uint32_t		xs_push_ail_locked;
> > +	uint32_t		xs_push_ail_unsafe;
> >  	uint32_t		xs_push_ail_flushing;
> >  	uint32_t		xs_push_ail_restarts;
> >  	uint32_t		xs_push_ail_flush;
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index f06cc0f41665..fd4f04853fe2 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -117,6 +117,7 @@ void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
> >  #define XFS_ITEM_PINNED		1
> >  #define XFS_ITEM_LOCKED		2
> >  #define XFS_ITEM_FLUSHING	3
> > +#define XFS_ITEM_UNSAFE		4
> >  
> >  /*
> >   * This is the structure maintained for every active transaction.
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index 8ede9d099d1f..a5ab1ffb8937 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -561,6 +561,13 @@ xfsaild_push(
> >  
> >  			stuck++;
> >  			break;
> > +		case XFS_ITEM_UNSAFE:
> > +			/*
> > +			 * The item may have been freed, so we can't access the
> > +			 * log item here.
> > +			 */
> > +			XFS_STATS_INC(mp, xs_push_ail_unsafe);
> 
> Aha, so this is in reaction to Dave's comment "So, yeah, these failure
> cases need to return something different to xfsaild_push() so it knows
> that it is unsafe to reference the log item, even for tracing purposes."
> 

Yse ...

> What we're trying to communicate through xfsaild_push_item is that we've
> cycled the AIL lock and possibly done something (e.g. deleting the log
> item from the AIL) such that the lip reference might be stale.
> 
> Can we call this XFS_ITEM_STALEREF?  "Unsafe" is vague.
> 

Yes, it's seems reasonable, but doesn't XFS_ITEM_STALED looks more consistent??
 
Thanks,
Long Li

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] xfs: add XFS_ITEM_UNSAFE for log item push return result
  2024-08-23 11:04 ` [PATCH 3/5] xfs: add XFS_ITEM_UNSAFE for log item push return result Long Li
  2024-08-23 17:17   ` Darrick J. Wong
@ 2024-08-24  3:34   ` Christoph Hellwig
  2024-08-27  9:41     ` Long Li
  2024-08-27 10:00     ` Dave Chinner
  1 sibling, 2 replies; 26+ messages in thread
From: Christoph Hellwig @ 2024-08-24  3:34 UTC (permalink / raw)
  To: Long Li; +Cc: djwong, chandanbabu, linux-xfs, david, yi.zhang, houtao1,
	yangerkun

On Fri, Aug 23, 2024 at 07:04:37PM +0800, Long Li wrote:
> After pushing log items, the log item may have been freed, making it
> unsafe to access in tracepoints. This commit introduces XFS_ITEM_UNSAFE
> to indicate when an item might be freed during the item push operation.

So instead of this magic unsafe operation I think declaring a rule that
the lip must never be accessed after the return is the much saner
choice.

We'll then need to figure out how we can still keep useful tracing
without accessing the lip.  The only information the trace points need
from the lip itself are the type, flags, and lsn and those seem small
enough to save on the stack before calling into ->iop_push.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] xfs: remove redundant set null for ip->i_itemp
  2024-08-23 11:04 ` [PATCH 1/5] xfs: remove redundant set null for ip->i_itemp Long Li
  2024-08-23 16:37   ` Darrick J. Wong
@ 2024-08-25  4:52   ` Christoph Hellwig
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2024-08-25  4:52 UTC (permalink / raw)
  To: Long Li; +Cc: djwong, chandanbabu, linux-xfs, david, yi.zhang, houtao1,
	yangerkun

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] xfs: fix a UAF when inode item push
  2024-08-23 17:22   ` Darrick J. Wong
@ 2024-08-27  8:14     ` Long Li
  0 siblings, 0 replies; 26+ messages in thread
From: Long Li @ 2024-08-27  8:14 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: chandanbabu, linux-xfs, david, yi.zhang, houtao1, yangerkun

On Fri, Aug 23, 2024 at 10:22:42AM -0700, Darrick J. Wong wrote:
> On Fri, Aug 23, 2024 at 07:04:39PM +0800, Long Li wrote:
> > KASAN reported a UAF bug while fault injection test:
> > 
> >   ==================================================================
> >   BUG: KASAN: use-after-free in xfs_inode_item_push+0x2db/0x2f0
> >   Read of size 8 at addr ffff888022f74788 by task xfsaild/sda/479
> > 
> >   CPU: 0 PID: 479 Comm: xfsaild/sda Not tainted 6.2.0-rc7-00003-ga8a43e2eb5f6 #89
> >   Call Trace:
> >    <TASK>
> >    dump_stack_lvl+0x51/0x6a
> >    print_report+0x171/0x4a6
> >    kasan_report+0xb7/0x130
> >    xfs_inode_item_push+0x2db/0x2f0
> >    xfsaild+0x729/0x1f70
> >    kthread+0x290/0x340
> >    ret_from_fork+0x1f/0x30
> >    </TASK>
> > 
> >   Allocated by task 494:
> >    kasan_save_stack+0x22/0x40
> >    kasan_set_track+0x25/0x30
> >    __kasan_slab_alloc+0x58/0x70
> >    kmem_cache_alloc+0x197/0x5d0
> >    xfs_inode_item_init+0x62/0x170
> >    xfs_trans_ijoin+0x15e/0x240
> >    xfs_init_new_inode+0x573/0x1820
> >    xfs_create+0x6a1/0x1020
> >    xfs_generic_create+0x544/0x5d0
> >    vfs_mkdir+0x5d0/0x980
> >    do_mkdirat+0x14e/0x220
> >    __x64_sys_mkdir+0x6a/0x80
> >    do_syscall_64+0x39/0x80
> >    entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > 
> >   Freed by task 14:
> >    kasan_save_stack+0x22/0x40
> >    kasan_set_track+0x25/0x30
> >    kasan_save_free_info+0x2e/0x40
> >    __kasan_slab_free+0x114/0x1b0
> >    kmem_cache_free+0xee/0x4e0
> >    xfs_inode_free_callback+0x187/0x2a0
> >    rcu_do_batch+0x317/0xce0
> >    rcu_core+0x686/0xa90
> >    __do_softirq+0x1b6/0x626
> > 
> >   The buggy address belongs to the object at ffff888022f74758
> >    which belongs to the cache xfs_ili of size 200
> >   The buggy address is located 48 bytes inside of
> >    200-byte region [ffff888022f74758, ffff888022f74820)
> > 
> >   The buggy address belongs to the physical page:
> >   page:ffffea00008bdd00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22f74
> >   head:ffffea00008bdd00 order:1 compound_mapcount:0 subpages_mapcount:0 compound_pincount:0
> >   flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
> >   raw: 001fffff80010200 ffff888010ed4040 ffffea00008b2510 ffffea00008bde10
> >   raw: 0000000000000000 00000000001a001a 00000001ffffffff 0000000000000000
> >   page dumped because: kasan: bad access detected
> > 
> >   Memory state around the buggy address:
> >    ffff888022f74680: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
> >    ffff888022f74700: fc fc fc fc fc fc fc fc fc fc fc fa fb fb fb fb
> >   >ffff888022f74780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >                         ^
> >    ffff888022f74800: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
> >    ffff888022f74880: fc fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >   ==================================================================
> > 
> > When push inode item in xfsaild, it will race with reclaim inodes task.
> > Consider the following call graph, both tasks deal with the same inode.
> > During flushing the cluster, it will enter xfs_iflush_abort() in shutdown
> > conditions, inode's XFS_IFLUSHING flag will be cleared and lip->li_buf set
> > to null. Concurrently, inode will be reclaimed in shutdown conditions,
> > there is no need to wait xfs buf lock because of lip->li_buf is null at
> > this time, inode will be freed via rcu callback if xfsaild task schedule
> > out during flushing the cluster. so, it is unsafe to reference lip after
> > flushing the cluster in xfs_inode_item_push().
> > 
> > 			<log item is in AIL>
> > 			<filesystem shutdown>
> > spin_lock(&ailp->ail_lock)
> > xfs_inode_item_push(lip)
> >   xfs_buf_trylock(bp)
> >   spin_unlock(&lip->li_ailp->ail_lock)
> >   xfs_iflush_cluster(bp)
> >     if (xfs_is_shutdown())
> >       xfs_iflush_abort(ip)
> > 	xfs_trans_ail_delete(ip)
> > 	  spin_lock(&ailp->ail_lock)
> > 	  spin_unlock(&ailp->ail_lock)
> > 	xfs_iflush_abort_clean(ip)
> >       error = -EIO
> > 			<log item removed from AIL>
> > 			<log item li_buf set to null>
> >     if (error)
> >       xfs_force_shutdown()
> > 	xlog_shutdown_wait(mp->m_log)
> > 	  might_sleep()
> > 					xfs_reclaim_inode(ip)
> > 					if (shutdown)
> > 					  xfs_iflush_shutdown_abort(ip)
> > 					    if (!bp)
> > 					      xfs_iflush_abort(ip)
> > 					      return
> > 				        __xfs_inode_free(ip)
> > 					   call_rcu(ip, xfs_inode_free_callback)
> > 			......
> > 			<rcu grace period expires>
> > 			<rcu free callbacks run somewhere>
> > 			  xfs_inode_free_callback(ip)
> > 			    kmem_cache_free(ip->i_itemp)
> > 			......
> > <starts running again>
> >     xfs_buf_ioend_fail(bp);
> >       xfs_buf_ioend(bp)
> >         xfs_buf_relse(bp);
> >     return error
> > spin_lock(&lip->li_ailp->ail_lock)
> >   <UAF on log item>
> > 
> > Additionally, after xfsaild_push_item(), the tracepoints can still access
> > the log item, potentially causing a UAF. I've previously submitted two
> > versions [1][2] attempting to solve this issue, but the solutions had
> > flaws.
> > 
> > Fix it by returning XFS_ITEM_UNSAFE in xfs_inode_item_push() when the log
> > item might be freed, ensuring xfsaild does not access the log item after
> > it is pushed.
> > 
> > [1] https://patchwork.kernel.org/project/xfs/patch/20230211022941.GA1515023@ceph-admin/
> > [2] https://patchwork.kernel.org/project/xfs/patch/20230722025721.312909-1-leo.lilong@huawei.com/
> > Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary")
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> >  fs/xfs/xfs_inode_item.c | 21 ++++++++++++++-------
> >  1 file changed, 14 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index b509cbd191f4..c855cd2c81a5 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
> > @@ -720,10 +720,11 @@ STATIC uint
> >  xfs_inode_item_push(
> >  	struct xfs_log_item	*lip,
> >  	struct list_head	*buffer_list)
> > -		__releases(&lip->li_ailp->ail_lock)
> > -		__acquires(&lip->li_ailp->ail_lock)
> > +		__releases(&ailp->ail_lock)
> > +		__acquires(&ailp->ail_lock)
> 
> I wonder, is smatch or whatever actually uses these annotations smart
> enough to read through the local variable declarations below?


Static analysis tools, including smatch, sparse, and others, have limitations
in their ability to fully understand complex pointer relationships and aliasing.

This limitation can lead to false positives or missed issues when the lock
reference in the annotation differs from its usage in the function body.
Therefore, it's usually best practice to maintain consistency between
annotations and actual code usage.

So I think it's just to keep the declaration and the function body consistent.

> 
> >  {
> >  	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> > +	struct xfs_ail		*ailp = lip->li_ailp;
> >  	struct xfs_inode	*ip = iip->ili_inode;
> >  	struct xfs_buf		*bp = lip->li_buf;
> >  	uint			rval = XFS_ITEM_SUCCESS;
> > @@ -748,7 +749,7 @@ xfs_inode_item_push(
> >  	if (!xfs_buf_trylock(bp))
> >  		return XFS_ITEM_LOCKED;
> >  
> > -	spin_unlock(&lip->li_ailp->ail_lock);
> > +	spin_unlock(&ailp->ail_lock);
> >  
> >  	/*
> >  	 * We need to hold a reference for flushing the cluster buffer as it may
> > @@ -762,17 +763,23 @@ xfs_inode_item_push(
> >  		if (!xfs_buf_delwri_queue(bp, buffer_list))
> >  			rval = XFS_ITEM_FLUSHING;
> >  		xfs_buf_relse(bp);
> > -	} else {
> > +	} else if (error == -EAGAIN) {
> >  		/*
> >  		 * Release the buffer if we were unable to flush anything. On
> >  		 * any other error, the buffer has already been released.
> >  		 */
> > -		if (error == -EAGAIN)
> > -			xfs_buf_relse(bp);
> > +		xfs_buf_relse(bp);
> >  		rval = XFS_ITEM_LOCKED;
> > +	} else {
> > +		/*
> > +		 * The filesystem has already been shut down. If there's a race
> > +		 * between inode flush and inode reclaim, the inode might be
> > +		 * freed. Accessing the item after this point would be unsafe.
> > +		 */
> > +		rval = XFS_ITEM_UNSAFE;
> 
> I wonder if it's time to convert this to a switch statement but the fix
> looks correct so
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> --D
> 

Using a switch statement would be more concise. If we're still using this 
approach in the next version, I can refactor it to use a switch statement.

Thanks for your review,
Long Li

> 
> >  	}
> >  
> > -	spin_lock(&lip->li_ailp->ail_lock);
> > +	spin_lock(&ailp->ail_lock);
> >  	return rval;
> >  }
> >  
> > -- 
> > 2.39.2
> > 
> > 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] xfs: ensuere deleting item from AIL after shutdown in dquot flush
  2024-08-23 17:00   ` Darrick J. Wong
  2024-08-24  3:08     ` Long Li
@ 2024-08-27  9:40     ` Dave Chinner
  2024-08-31 13:45       ` Long Li
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2024-08-27  9:40 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Long Li, chandanbabu, linux-xfs, yi.zhang, houtao1, yangerkun

On Fri, Aug 23, 2024 at 10:00:06AM -0700, Darrick J. Wong wrote:
> On Fri, Aug 23, 2024 at 07:04:36PM +0800, Long Li wrote:
> > Deleting items from the AIL before the log is shut down can result in the
> > log tail moving forward in the journal on disk because log writes can still
> > be taking place. As a result, items that have been deleted from the AIL
> > might not be recovered during the next mount, even though they should be,
> > as they were never written back to disk.
> > 
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> >  fs/xfs/xfs_dquot.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> > index c1b211c260a9..4cbe3db6fc32 100644
> > --- a/fs/xfs/xfs_dquot.c
> > +++ b/fs/xfs/xfs_dquot.c
> > @@ -1332,9 +1332,15 @@ xfs_qm_dqflush(
> >  	return 0;
> >  
> >  out_abort:
> > +	/*
> > +	 * Shutdown first to stop the log before deleting items from the AIL.
> > +	 * Deleting items from the AIL before the log is shut down can result
> > +	 * in the log tail moving forward in the journal on disk because log
> > +	 * writes can still be taking place.
> > +	 */
> > +	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> >  	dqp->q_flags &= ~XFS_DQFLAG_DIRTY;
> >  	xfs_trans_ail_delete(lip, 0);
> > -	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> 
> I see the logic in shutting down the log before letting go of the dquot
> log item that triggered the shutdown, but I wonder, why do we delete the
> item from the AIL?  AFAICT the inode items don't do that on iflush
> failure, but OTOH I couldn't figure out how the log items in the AIL get
> deleted from the AIL after a shutdown. 

Intents are removed from the AIL when the transaction containing
the deferred intent chain is cancelled instead of committed due the
log being shut down.

For everything else in the AIL, the ->iop_push method is supposed to
do any cleanup that is necessary by failing the item push and
running the item failure method itself.

For buffers, this is running IO completion as if an IO error
occurred. Error handling sees the shutdown and removes the item from
the AIL.

For inodes, xfs_iflush_cluster() fails the inode buffer as if an IO
error occurred, that then runs the individual inode abort code that
removes the inode items from the AIL.

For dquots, it has the ancient cleanup method that inodes used to
have. i.e. if the dquot has been flushed to the buffer, it is attached to
the buffer and then the buffer submission will fail and run IO
completion with an error. If the dquot hasn't been flushed to the
buffer because either it or the underlying dquot buffer is corrupt
it will remove the dquot from the AIL and then shut down the
filesystem.

It's the latter case that could be an issue. It's not the same as
the inode item case, because the tail pinning that the INODE_ALLOC
inode item type flag causes does not happen with dquots. There is
still a potential window where the dquot could be at the tail of the
log, and remocing it moves the tail forward at exactly the same time
the log tail is being sampled during a log write, and the shutdown
doesn't happen fast enough to prevent the log write going out to
disk.

To make timing of such a race even more unlikely, it would have to
race with a log write that contains a commit record, otherwise the
log tail lsn in the iclog will be ignored because it wasn't
contained within a complete checkpoint in the journal.  It's very
unlikely that a filesystem will read a corrupt dquot from disk at
exactly the same point in time these other journal pre-conditions
are met, but it could happen...

> Or maybe during a shutdown we just stop xfsaild and let the higher
> level objects free the log items during reclaim?

The AIL contains objects that have no references elsewhere in the
filesystem. It must be pushed until empty during unmount after a
shutdown to ensure that all the items in it have been pushed,
failed, removed from the AIL and freed...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] xfs: add XFS_ITEM_UNSAFE for log item push return result
  2024-08-24  3:34   ` Christoph Hellwig
@ 2024-08-27  9:41     ` Long Li
  2024-08-27 10:00     ` Dave Chinner
  1 sibling, 0 replies; 26+ messages in thread
From: Long Li @ 2024-08-27  9:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: djwong, chandanbabu, linux-xfs, david, yi.zhang, houtao1,
	yangerkun

On Fri, Aug 23, 2024 at 08:34:43PM -0700, Christoph Hellwig wrote:
> On Fri, Aug 23, 2024 at 07:04:37PM +0800, Long Li wrote:
> > After pushing log items, the log item may have been freed, making it
> > unsafe to access in tracepoints. This commit introduces XFS_ITEM_UNSAFE
> > to indicate when an item might be freed during the item push operation.
> 
> So instead of this magic unsafe operation I think declaring a rule that
> the lip must never be accessed after the return is the much saner
> choice.
> 
> We'll then need to figure out how we can still keep useful tracing
> without accessing the lip.  The only information the trace points need
> from the lip itself are the type, flags, and lsn and those seem small
> enough to save on the stack before calling into ->iop_push.
> 
> 

Hi Christoph,

Thank you for pointing out the issues with the current approach. Establishing
a rule to not access 'lip' after the item has been pushed would indeed make
the logic clearer.

However, saving the log item information that needs to be traced on the stack
also looks unappealing:

	1. We would need to add new log item trace code, instead of using the
	generic DEFINE_LOG_ITEM_EVENT macro definition. This increases code
	redundancy.

	2. We would need to use CONFIG_TRACEPOINTS to manage whether we need
	to save type, flags, lsn, and other information on the stack.

	3. If we need to extend tracing to other fields of the log item in
	the future, we would need to add new variables.

If we save log item on the stack before each push, I think it would affect
performance, although this impact would be minimal. I wonder what other 
people's opinions are?

Thanks,
Long Li

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] xfs: add XFS_ITEM_UNSAFE for log item push return result
  2024-08-23 17:17   ` Darrick J. Wong
  2024-08-24  3:30     ` Long Li
@ 2024-08-27  9:44     ` Dave Chinner
  1 sibling, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2024-08-27  9:44 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Long Li, chandanbabu, linux-xfs, yi.zhang, houtao1, yangerkun

On Fri, Aug 23, 2024 at 10:17:09AM -0700, Darrick J. Wong wrote:
> On Fri, Aug 23, 2024 at 07:04:37PM +0800, Long Li wrote:
> > After pushing log items, the log item may have been freed, making it
> > unsafe to access in tracepoints. This commit introduces XFS_ITEM_UNSAFE
> > to indicate when an item might be freed during the item push operation.
> > 
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> >  fs/xfs/xfs_stats.h     | 1 +
> >  fs/xfs/xfs_trans.h     | 1 +
> >  fs/xfs/xfs_trans_ail.c | 7 +++++++
> >  3 files changed, 9 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
> > index a61fb56ed2e6..9a7a020587cf 100644
> > --- a/fs/xfs/xfs_stats.h
> > +++ b/fs/xfs/xfs_stats.h
> > @@ -86,6 +86,7 @@ struct __xfsstats {
> >  	uint32_t		xs_push_ail_pushbuf;
> >  	uint32_t		xs_push_ail_pinned;
> >  	uint32_t		xs_push_ail_locked;
> > +	uint32_t		xs_push_ail_unsafe;
> >  	uint32_t		xs_push_ail_flushing;
> >  	uint32_t		xs_push_ail_restarts;
> >  	uint32_t		xs_push_ail_flush;
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index f06cc0f41665..fd4f04853fe2 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -117,6 +117,7 @@ void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
> >  #define XFS_ITEM_PINNED		1
> >  #define XFS_ITEM_LOCKED		2
> >  #define XFS_ITEM_FLUSHING	3
> > +#define XFS_ITEM_UNSAFE		4
> >  
> >  /*
> >   * This is the structure maintained for every active transaction.
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index 8ede9d099d1f..a5ab1ffb8937 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -561,6 +561,13 @@ xfsaild_push(
> >  
> >  			stuck++;
> >  			break;
> > +		case XFS_ITEM_UNSAFE:
> > +			/*
> > +			 * The item may have been freed, so we can't access the
> > +			 * log item here.
> > +			 */
> > +			XFS_STATS_INC(mp, xs_push_ail_unsafe);
> 
> Aha, so this is in reaction to Dave's comment "So, yeah, these failure
> cases need to return something different to xfsaild_push() so it knows
> that it is unsafe to reference the log item, even for tracing purposes."

I wish I knew when I said that and what I said it about. There's no
pointer to previous versions of this fix in the series description -
it's not even marked as a "v2" patchset.

> What we're trying to communicate through xfsaild_push_item is that we've
> cycled the AIL lock and possibly done something (e.g. deleting the log
> item from the AIL) such that the lip reference might be stale.
> 
> Can we call this XFS_ITEM_STALEREF?  "Unsafe" is vague.

In this case, the returned value tells the aild what the result of
the push was. i.e. the return value is XFS_ITEM_FREED, telling the
aild that the push has freed the item (and so the aild should not
access it again).

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] xfs: add XFS_ITEM_UNSAFE for log item push return result
  2024-08-24  3:34   ` Christoph Hellwig
  2024-08-27  9:41     ` Long Li
@ 2024-08-27 10:00     ` Dave Chinner
  2024-08-27 12:30       ` Christoph Hellwig
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2024-08-27 10:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Long Li, djwong, chandanbabu, linux-xfs, yi.zhang, houtao1,
	yangerkun

On Fri, Aug 23, 2024 at 08:34:43PM -0700, Christoph Hellwig wrote:
> On Fri, Aug 23, 2024 at 07:04:37PM +0800, Long Li wrote:
> > After pushing log items, the log item may have been freed, making it
> > unsafe to access in tracepoints. This commit introduces XFS_ITEM_UNSAFE
> > to indicate when an item might be freed during the item push operation.
> 
> So instead of this magic unsafe operation I think declaring a rule that
> the lip must never be accessed after the return is the much saner
> choice.

That may well be the case, but in the normal case the only way to
remove the item from the AIL is to run IO completion. We don't
actually submit IO for anything we've pushed until after we've
dropped out of the item push loop, so IO completion and potential
item AIL removal and freeing can't occur for anything we need to do
IO on.

Hence the only cases where the item might have been already removed
from the AIL by the ->iop_push() are those where the push itself
removes the item from the AIL. This only occurs in shutdown
situations, so it's not the common case.

In which case, returning XFS_ITEM_FREED to tell the push code that
it was freed and should not reference it at all is fine. We don't
really even need tracing for this case because if the items can't be
removed from the AIL, they will leave some other AIL trace when
pushe (i.e.  they will be stuck locked, pinned or flushing and those
will leave traces...)

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] xfs: add XFS_ITEM_UNSAFE for log item push return result
  2024-08-27 10:00     ` Dave Chinner
@ 2024-08-27 12:30       ` Christoph Hellwig
  2024-08-27 21:52         ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2024-08-27 12:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Long Li, djwong, chandanbabu, linux-xfs,
	yi.zhang, houtao1, yangerkun

On Tue, Aug 27, 2024 at 08:00:05PM +1000, Dave Chinner wrote:
> Hence the only cases where the item might have been already removed
> from the AIL by the ->iop_push() are those where the push itself
> removes the item from the AIL. This only occurs in shutdown
> situations, so it's not the common case.
> 
> In which case, returning XFS_ITEM_FREED to tell the push code that
> it was freed and should not reference it at all is fine. We don't
> really even need tracing for this case because if the items can't be
> removed from the AIL, they will leave some other AIL trace when
> pushe (i.e.  they will be stuck locked, pinned or flushing and those
> will leave traces...)

So XFS_ITEM_FREED is definitively a better name, but it still feels
a bit fragile that any of these shutdown paths need special handling
inside ->iop_push.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] xfs: add XFS_ITEM_UNSAFE for log item push return result
  2024-08-27 12:30       ` Christoph Hellwig
@ 2024-08-27 21:52         ` Dave Chinner
  2024-08-28  4:23           ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2024-08-27 21:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Long Li, djwong, chandanbabu, linux-xfs, yi.zhang, houtao1,
	yangerkun

On Tue, Aug 27, 2024 at 05:30:49AM -0700, Christoph Hellwig wrote:
> On Tue, Aug 27, 2024 at 08:00:05PM +1000, Dave Chinner wrote:
> > Hence the only cases where the item might have been already removed
> > from the AIL by the ->iop_push() are those where the push itself
> > removes the item from the AIL. This only occurs in shutdown
> > situations, so it's not the common case.
> > 
> > In which case, returning XFS_ITEM_FREED to tell the push code that
> > it was freed and should not reference it at all is fine. We don't
> > really even need tracing for this case because if the items can't be
> > removed from the AIL, they will leave some other AIL trace when
> > pushe (i.e.  they will be stuck locked, pinned or flushing and those
> > will leave traces...)
> 
> So XFS_ITEM_FREED is definitively a better name, but it still feels
> a bit fragile that any of these shutdown paths need special handling
> inside ->iop_push.

Agreed, but I don't see an easy way to fix that right now because
the shutdown behaviour is both item type and item state specific.

I suspect that we'd do better to have explicit shutdown processing
of log items in the AIL (i.e. a ->iop_shutdown method) that is
called instead of ->iop_push when the AIL detects that the
filesystem has shut down. We can then define the exact behaviour we
want in this case and processing does not have to be non-blocking
for performance and latency reasons.

If we go down that route, I think we'd want to add a
XFS_ITEM_SHUTDOWN return value after the push code calls
xfs_force_shutdown(). The push code does not error out the item or
remove it from the AIL, just shuts down the fs and returns
XFS_ITEM_SHUTDOWN.

The AIL then breaks out of the push loop and submits the delwri
buffers which will error them all out and remove them from the AIL
because the fs is shut down. It then starts a new walk from the tail
calling ->iop_shutdown on everything remaining in the AIL until the
AIL is empty....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] xfs: add XFS_ITEM_UNSAFE for log item push return result
  2024-08-27 21:52         ` Dave Chinner
@ 2024-08-28  4:23           ` Christoph Hellwig
  2024-08-29 10:16             ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2024-08-28  4:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Long Li, djwong, chandanbabu, linux-xfs,
	yi.zhang, houtao1, yangerkun

On Wed, Aug 28, 2024 at 07:52:48AM +1000, Dave Chinner wrote:
> I suspect that we'd do better to have explicit shutdown processing
> of log items in the AIL (i.e. a ->iop_shutdown method) that is
> called instead of ->iop_push when the AIL detects that the
> filesystem has shut down. We can then define the exact behaviour we
> want in this case and processing does not have to be non-blocking
> for performance and latency reasons.
> 
> If we go down that route, I think we'd want to add a
> XFS_ITEM_SHUTDOWN return value after the push code calls
> xfs_force_shutdown(). The push code does not error out the item or
> remove it from the AIL, just shuts down the fs and returns
> XFS_ITEM_SHUTDOWN.

Yes, that seems even better.  But it would probably be a fair amount
of work.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] xfs: add XFS_ITEM_UNSAFE for log item push return result
  2024-08-28  4:23           ` Christoph Hellwig
@ 2024-08-29 10:16             ` Dave Chinner
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2024-08-29 10:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Long Li, djwong, chandanbabu, linux-xfs, yi.zhang, houtao1,
	yangerkun

On Tue, Aug 27, 2024 at 09:23:33PM -0700, Christoph Hellwig wrote:
> On Wed, Aug 28, 2024 at 07:52:48AM +1000, Dave Chinner wrote:
> > I suspect that we'd do better to have explicit shutdown processing
> > of log items in the AIL (i.e. a ->iop_shutdown method) that is
> > called instead of ->iop_push when the AIL detects that the
> > filesystem has shut down. We can then define the exact behaviour we
> > want in this case and processing does not have to be non-blocking
> > for performance and latency reasons.
> > 
> > If we go down that route, I think we'd want to add a
> > XFS_ITEM_SHUTDOWN return value after the push code calls
> > xfs_force_shutdown(). The push code does not error out the item or
> > remove it from the AIL, just shuts down the fs and returns
> > XFS_ITEM_SHUTDOWN.
> 
> Yes, that seems even better.  But it would probably be a fair amount
> of work.

I don't think so. Only the log items that implement ->iop_push would
need to implement ->iop_shutdown, and there are only 3 items that
implement iop_push. i.e. inode, dquot and buffer items.

The buffer item shutdown method would simply be:

{
	xfs_buf_lock(bp);
	bp->b_flags |= XBF_ASYNC;
	xfs_buf_ioend_fail(bp);
	return XFS_ITEM_FREED;
}

The inode item would use the same setup as xfs_inode_item_push()
to lock the inode cluster buffer, then run the same code as above
for the buffer item shutdown.

I think that dquots end up the same - do the same setup work to get
the locked dquot buffer, then fail it directly without doing IO.

It doesn't seem like it's all that complex....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] xfs: ensuere deleting item from AIL after shutdown in dquot flush
  2024-08-27  9:40     ` Dave Chinner
@ 2024-08-31 13:45       ` Long Li
  0 siblings, 0 replies; 26+ messages in thread
From: Long Li @ 2024-08-31 13:45 UTC (permalink / raw)
  To: Dave Chinner, Darrick J. Wong
  Cc: chandanbabu, linux-xfs, yi.zhang, houtao1, yangerkun

On Tue, Aug 27, 2024 at 07:40:14PM +1000, Dave Chinner wrote:
> On Fri, Aug 23, 2024 at 10:00:06AM -0700, Darrick J. Wong wrote:
> > On Fri, Aug 23, 2024 at 07:04:36PM +0800, Long Li wrote:
> > > Deleting items from the AIL before the log is shut down can result in the
> > > log tail moving forward in the journal on disk because log writes can still
> > > be taking place. As a result, items that have been deleted from the AIL
> > > might not be recovered during the next mount, even though they should be,
> > > as they were never written back to disk.
> > > 
> > > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > > ---
> > >  fs/xfs/xfs_dquot.c | 8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> > > index c1b211c260a9..4cbe3db6fc32 100644
> > > --- a/fs/xfs/xfs_dquot.c
> > > +++ b/fs/xfs/xfs_dquot.c
> > > @@ -1332,9 +1332,15 @@ xfs_qm_dqflush(
> > >  	return 0;
> > >  
> > >  out_abort:
> > > +	/*
> > > +	 * Shutdown first to stop the log before deleting items from the AIL.
> > > +	 * Deleting items from the AIL before the log is shut down can result
> > > +	 * in the log tail moving forward in the journal on disk because log
> > > +	 * writes can still be taking place.
> > > +	 */
> > > +	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > >  	dqp->q_flags &= ~XFS_DQFLAG_DIRTY;
> > >  	xfs_trans_ail_delete(lip, 0);
> > > -	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > 
> > I see the logic in shutting down the log before letting go of the dquot
> > log item that triggered the shutdown, but I wonder, why do we delete the
> > item from the AIL?  AFAICT the inode items don't do that on iflush
> > failure, but OTOH I couldn't figure out how the log items in the AIL get
> > deleted from the AIL after a shutdown. 
> 
> Intents are removed from the AIL when the transaction containing
> the deferred intent chain is cancelled instead of committed due the
> log being shut down.
> 
> For everything else in the AIL, the ->iop_push method is supposed to
> do any cleanup that is necessary by failing the item push and
> running the item failure method itself.
> 
> For buffers, this is running IO completion as if an IO error
> occurred. Error handling sees the shutdown and removes the item from
> the AIL.
> 
> For inodes, xfs_iflush_cluster() fails the inode buffer as if an IO
> error occurred, that then runs the individual inode abort code that
> removes the inode items from the AIL.
> 
> For dquots, it has the ancient cleanup method that inodes used to
> have. i.e. if the dquot has been flushed to the buffer, it is attached to
> the buffer and then the buffer submission will fail and run IO
> completion with an error. If the dquot hasn't been flushed to the
> buffer because either it or the underlying dquot buffer is corrupt
> it will remove the dquot from the AIL and then shut down the
> filesystem.
> 
> It's the latter case that could be an issue. It's not the same as
> the inode item case, because the tail pinning that the INODE_ALLOC
> inode item type flag causes does not happen with dquots. There is

I'd like to know if the "INODE_ALLOC inode item" refers to a buf
item with the XFS_BLI_INODE_ALLOC_BUF flag? I understand that when
this type of buf item undergoes relog, the tail lsn might be pinned,
but I'm not sure why it's mentioned here, Why does it cause inode
and dquot to be very different?

> still a potential window where the dquot could be at the tail of the
> log, and remocing it moves the tail forward at exactly the same time
> the log tail is being sampled during a log write, and the shutdown
> doesn't happen fast enough to prevent the log write going out to
> disk.
> 
> To make timing of such a race even more unlikely, it would have to
> race with a log write that contains a commit record, otherwise the
> log tail lsn in the iclog will be ignored because it wasn't
> contained within a complete checkpoint in the journal.  It's very
> unlikely that a filesystem will read a corrupt dquot from disk at
> exactly the same point in time these other journal pre-conditions
> are met, but it could happen...
> 

This is a very detailed explanation. I will add this to my commit
message in the next version. Yes, although the conditions for it
to occur are strict, it's still possible to happen.

Thanks,
Long Li

> > Or maybe during a shutdown we just stop xfsaild and let the higher
> > level objects free the log items during reclaim?
> 
> The AIL contains objects that have no references elsewhere in the
> filesystem. It must be pushed until empty during unmount after a
> shutdown to ensure that all the items in it have been pushed,
> failed, removed from the AIL and freed...
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-08-31 13:35 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-23 11:04 [PATCH 0/5] xfs: fix and cleanups for log item push Long Li
2024-08-23 11:04 ` [PATCH 1/5] xfs: remove redundant set null for ip->i_itemp Long Li
2024-08-23 16:37   ` Darrick J. Wong
2024-08-25  4:52   ` Christoph Hellwig
2024-08-23 11:04 ` [PATCH 2/5] xfs: ensuere deleting item from AIL after shutdown in dquot flush Long Li
2024-08-23 17:00   ` Darrick J. Wong
2024-08-24  3:08     ` Long Li
2024-08-27  9:40     ` Dave Chinner
2024-08-31 13:45       ` Long Li
2024-08-23 11:04 ` [PATCH 3/5] xfs: add XFS_ITEM_UNSAFE for log item push return result Long Li
2024-08-23 17:17   ` Darrick J. Wong
2024-08-24  3:30     ` Long Li
2024-08-27  9:44     ` Dave Chinner
2024-08-24  3:34   ` Christoph Hellwig
2024-08-27  9:41     ` Long Li
2024-08-27 10:00     ` Dave Chinner
2024-08-27 12:30       ` Christoph Hellwig
2024-08-27 21:52         ` Dave Chinner
2024-08-28  4:23           ` Christoph Hellwig
2024-08-29 10:16             ` Dave Chinner
2024-08-23 11:04 ` [PATCH 4/5] xfs: fix a UAF when dquot item push Long Li
2024-08-23 17:20   ` Darrick J. Wong
2024-08-24  2:03     ` Long Li
2024-08-23 11:04 ` [PATCH 5/5] xfs: fix a UAF when inode " Long Li
2024-08-23 17:22   ` Darrick J. Wong
2024-08-27  8:14     ` Long Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox