[PATCHSET 0/3] xfs: fix cpu hotplug mess

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET 0/3] xfs: fix cpu hotplug mess
@ 2023-08-24 23:21 Darrick J. Wong
  2023-08-24 23:21 ` [PATCH 1/3] xfs: fix per-cpu CIL structure aggregation racing with dying cpus Darrick J. Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Darrick J. Wong @ 2023-08-24 23:21 UTC (permalink / raw)
  To: chandan.babu, djwong
  Cc: peterz, ritesh.list, sandeen, tglx, linux-xfs, david, ritesh.list,
	sandeen

Hi all,

Ritesh and Eric separately reported crashes in XFS's hook function for
CPU hot remove if the remove event races with a filesystem being
mounted.  I also noticed via generic/650 that once in a while the log
will shut down over an apparent overrun of a transaction reservation;
this turned out to be due to CIL percpu list aggregation failing to pick
up the percpu list items from a dying CPU.

Either way, the solution here is to eliminate the need for a CPU dying
hook by using a private cpumask to track which CPUs have added to their
percpu lists directly, and iterating with that mask.  This fixes the log
problems and (I think) solves a theoretical UAF bug in the inodegc code
too.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been lightly tested with fstests.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fix-percpu-lists-6.6
---
 fs/xfs/xfs_icache.c        |   60 ++++++++++----------------------------------
 fs/xfs/xfs_icache.h        |    1 -
 fs/xfs/xfs_log_cil.c       |   50 +++++++++++--------------------------
 fs/xfs/xfs_log_priv.h      |   14 ++++------
 fs/xfs/xfs_mount.h         |    6 +++-
 fs/xfs/xfs_super.c         |   55 +---------------------------------------
 include/linux/cpuhotplug.h |    1 -
 7 files changed, 40 insertions(+), 147 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] xfs: fix per-cpu CIL structure aggregation racing with dying cpus
  2023-08-24 23:21 [PATCHSET 0/3] xfs: fix cpu hotplug mess Darrick J. Wong
@ 2023-08-24 23:21 ` Darrick J. Wong
  2023-08-24 23:53   ` Dave Chinner
  2023-08-24 23:21 ` [PATCH 2/3] xfs: use per-mount cpumask to track nonempty percpu inodegc lists Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2023-08-24 23:21 UTC (permalink / raw)
  To: chandan.babu, djwong
  Cc: tglx, peterz, ritesh.list, sandeen, linux-xfs, david, ritesh.list,
	sandeen

From: Darrick J. Wong <djwong@kernel.org>

In commit 7c8ade2121200 ("xfs: implement percpu cil space used
calculation"), the XFS committed (log) item list code was converted to
use per-cpu lists and space tracking to reduce cpu contention when
multiple threads are modifying different parts of the filesystem and
hence end up contending on the log structures during transaction commit.
Each CPU tracks its own commit items and space usage, and these do not
have to be merged into the main CIL until either someone wants to push
the CIL items, or we run over a soft threshold and switch to slower (but
more accurate) accounting with atomics.

Unfortunately, the for_each_cpu iteration suffers from the same race
with cpu dying problem that was identified in commit 8b57b11cca88f
("pcpcntrs: fix dying cpu summation race") -- CPUs are removed from
cpu_online_mask before the CPUHP_XFS_DEAD callback gets called.  As a
result, both CIL percpu structure aggregation functions fail to collect
the items and accounted space usage at the correct point in time.

If we're lucky, the items that are collected from the online cpus exceed
the space given to those cpus, and the log immediately shuts down in
xlog_cil_insert_items due to the (apparent) log reservation overrun.
This happens periodically with generic/650, which exercises cpu hotplug
vs. the filesystem code:

smpboot: CPU 3 is now offline
XFS (sda3): ctx ticket reservation ran out. Need to up reservation
XFS (sda3): ticket reservation summary:
XFS (sda3):   unit res    = 9268 bytes
XFS (sda3):   current res = -40 bytes
XFS (sda3):   original count  = 1
XFS (sda3):   remaining count = 1
XFS (sda3): Filesystem has been shut down due to log error (0x2).

Applying the same sort of fix from 8b57b11cca88f to the CIL code seems
to make the generic/650 problem go away, but I've been told that tglx
was not happy when he saw:

"...the only thing we actually need to care about is that
percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when
there are no CPUs dying, it has no addition overhead except for a
cpumask_or() operation."

The CPU hotplug code is rather complex and difficult to understand and I
don't want to try to understand the cpu hotplug locking well enough to
use cpu_dying mask.  Furthermore, there's a performance improvement that
could be had here.  Attach a private cpu mask to the CIL structure so
that we can track exactly which cpus have accessed the percpu data at
all.  It doesn't matter if the cpu has since gone offline; log item
aggregation will still find the items.  Better yet, we skip cpus that
have not recently logged anything.

Worse yet, Ritesh Harjani and Eric Sandeen both reported today that CPU
hot remove racing with an xfs mount can crash if the cpu_dead notifier
tries to access the log but the mount hasn't yet set up the log.

Link: https://lore.kernel.org/linux-xfs/ZOLzgBOuyWHapOyZ@dread.disaster.area/T/
Link: https://lore.kernel.org/lkml/877cuj1mt1.ffs@tglx/
Link: https://lore.kernel.org/lkml/20230414162755.281993820@linutronix.de/
Link: https://lore.kernel.org/linux-xfs/ZOVkjxWZq0YmjrJu@dread.disaster.area/T/
Cc: tglx@linutronix.de
Cc: peterz@infradead.org
Reported-by: ritesh.list@gmail.com
Reported-by: sandeen@sandeen.net
Fixes: af1c2146a50b ("xfs: introduce per-cpu CIL tracking structure")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log_cil.c  |   50 +++++++++++++++----------------------------------
 fs/xfs/xfs_log_priv.h |   14 ++++++--------
 fs/xfs/xfs_super.c    |    1 -
 3 files changed, 21 insertions(+), 44 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index eccbfb99e894..6eb767a49188 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -124,7 +124,7 @@ xlog_cil_push_pcp_aggregate(
 	struct xlog_cil_pcp	*cilpcp;
 	int			cpu;

-	for_each_online_cpu(cpu) {
+	for_each_cpu(cpu, &ctx->cil_pcpmask) {
 		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);

 		ctx->ticket->t_curr_res += cilpcp->space_reserved;
@@ -165,7 +165,13 @@ xlog_cil_insert_pcp_aggregate(
 	if (!test_and_clear_bit(XLOG_CIL_PCP_SPACE, &cil->xc_flags))
 		return;

-	for_each_online_cpu(cpu) {
+	/*
+	 * We can race with other cpus setting cil_pcpmask.  However, we've
+	 * atomically cleared PCP_SPACE which forces other threads to add to
+	 * the global space used count.  cil_pcpmask is a superset of cilpcp
+	 * structures that could have a nonzero space_used.
+	 */
+	for_each_cpu(cpu, &ctx->cil_pcpmask) {
 		int	old, prev;

 		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
@@ -554,6 +560,7 @@ xlog_cil_insert_items(
 	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
 	int			space_used;
 	int			order;
+	unsigned int		cpu_nr;
 	struct xlog_cil_pcp	*cilpcp;

 	ASSERT(tp);
@@ -577,7 +584,12 @@ xlog_cil_insert_items(
 	 * can't be scheduled away between split sample/update operations that
 	 * are done without outside locking to serialise them.
 	 */
-	cilpcp = get_cpu_ptr(cil->xc_pcp);
+	cpu_nr = get_cpu();
+	cilpcp = this_cpu_ptr(cil->xc_pcp);
+
+	/* Tell the future push that there was work added by this CPU. */
+	if (!cpumask_test_cpu(cpu_nr, &ctx->cil_pcpmask))
+		cpumask_test_and_set_cpu(cpu_nr, &ctx->cil_pcpmask);

 	/*
 	 * We need to take the CIL checkpoint unit reservation on the first
@@ -1790,38 +1802,6 @@ xlog_cil_force_seq(
 	return 0;
 }

-/*
- * Move dead percpu state to the relevant CIL context structures.
- *
- * We have to lock the CIL context here to ensure that nothing is modifying
- * the percpu state, either addition or removal. Both of these are done under
- * the CIL context lock, so grabbing that exclusively here will ensure we can
- * safely drain the cilpcp for the CPU that is dying.
- */
-void
-xlog_cil_pcp_dead(
-	struct xlog		*log,
-	unsigned int		cpu)
-{
-	struct xfs_cil		*cil = log->l_cilp;
-	struct xlog_cil_pcp	*cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
-	struct xfs_cil_ctx	*ctx;
-
-	down_write(&cil->xc_ctx_lock);
-	ctx = cil->xc_ctx;
-	if (ctx->ticket)
-		ctx->ticket->t_curr_res += cilpcp->space_reserved;
-	cilpcp->space_reserved = 0;
-
-	if (!list_empty(&cilpcp->log_items))
-		list_splice_init(&cilpcp->log_items, &ctx->log_items);
-	if (!list_empty(&cilpcp->busy_extents))
-		list_splice_init(&cilpcp->busy_extents, &ctx->busy_extents);
-	atomic_add(cilpcp->space_used, &ctx->space_used);
-	cilpcp->space_used = 0;
-	up_write(&cil->xc_ctx_lock);
-}
-
 /*
  * Perform initial CIL structure initialisation.
  */
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 1bd2963e8fbd..af87648331d5 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -231,6 +231,12 @@ struct xfs_cil_ctx {
 	struct work_struct	discard_endio_work;
 	struct work_struct	push_work;
 	atomic_t		order_id;
+
+	/*
+	 * CPUs that could have added items to the percpu CIL data.  Access is
+	 * coordinated with xc_ctx_lock.
+	 */
+	struct cpumask		cil_pcpmask;
 };

 /*
@@ -278,9 +284,6 @@ struct xfs_cil {
 	wait_queue_head_t	xc_push_wait;	/* background push throttle */

 	void __percpu		*xc_pcp;	/* percpu CIL structures */
-#ifdef CONFIG_HOTPLUG_CPU
-	struct list_head	xc_pcp_list;
-#endif
 } ____cacheline_aligned_in_smp;

 /* xc_flags bit values */
@@ -705,9 +708,4 @@ xlog_kvmalloc(
 	return p;
 }

-/*
- * CIL CPU dead notifier
- */
-void xlog_cil_pcp_dead(struct xlog *log, unsigned int cpu);
-
 #endif	/* __XFS_LOG_PRIV_H__ */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 09638e8fb4ee..ef7775657ce3 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2313,7 +2313,6 @@ xfs_cpu_dead(
 	list_for_each_entry_safe(mp, n, &xfs_mount_list, m_mount_list) {
 		spin_unlock(&xfs_mount_list_lock);
 		xfs_inodegc_cpu_dead(mp, cpu);
-		xlog_cil_pcp_dead(mp->m_log, cpu);
 		spin_lock(&xfs_mount_list_lock);
 	}
 	spin_unlock(&xfs_mount_list_lock);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] xfs: use per-mount cpumask to track nonempty percpu inodegc lists
  2023-08-24 23:21 [PATCHSET 0/3] xfs: fix cpu hotplug mess Darrick J. Wong
  2023-08-24 23:21 ` [PATCH 1/3] xfs: fix per-cpu CIL structure aggregation racing with dying cpus Darrick J. Wong
@ 2023-08-24 23:21 ` Darrick J. Wong
  2023-08-25  0:12   ` Dave Chinner
  2023-08-24 23:21 ` [PATCH 3/3] xfs: remove cpu hotplug hooks Darrick J. Wong
  2023-08-24 23:45 ` [PATCH 4/3] generic/650: race mount and unmount with cpu hotplug too Darrick J. Wong
  3 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2023-08-24 23:21 UTC (permalink / raw)
  To: chandan.babu, djwong; +Cc: linux-xfs, david, ritesh.list, sandeen

From: Darrick J. Wong <djwong@kernel.org>

Directly track which CPUs have contributed to the inodegc percpu lists
instead of trusting the cpu online mask.  This eliminates a theoretical
problem where the inodegc flush functions might fail to flush a CPU's
inodes if that CPU happened to be dying at exactly the same time.  Most
likely nobody's noticed this because the CPU dead hook moves the percpu
inodegc list to another CPU and schedules that worker immediately.  But
it's quite possible that this is a subtle race leading to UAF if the
inodegc flush were part of an unmount.

Further benefits: This reduces the overhead of the inodegc flush code
slightly by allowing us to ignore CPUs that have empty lists.  Better
yet, it reduces our dependence on the cpu online masks, which have been
the cause of confusion and drama lately.

Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   60 +++++++++++----------------------------------------
 fs/xfs/xfs_icache.h |    1 -
 fs/xfs/xfs_mount.h  |    6 +++--
 fs/xfs/xfs_super.c  |    4 +--
 4 files changed, 18 insertions(+), 53 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index e541f5c0bc25..7fd876e94ecb 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -443,7 +443,7 @@ xfs_inodegc_queue_all(
 	int			cpu;
 	bool			ret = false;
 
-	for_each_online_cpu(cpu) {
+	for_each_cpu(cpu, &mp->m_inodegc_cpumask) {
 		gc = per_cpu_ptr(mp->m_inodegc, cpu);
 		if (!llist_empty(&gc->list)) {
 			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
@@ -463,7 +463,7 @@ xfs_inodegc_wait_all(
 	int			error = 0;
 
 	flush_workqueue(mp->m_inodegc_wq);
-	for_each_online_cpu(cpu) {
+	for_each_cpu(cpu, &mp->m_inodegc_cpumask) {
 		struct xfs_inodegc	*gc;
 
 		gc = per_cpu_ptr(mp->m_inodegc, cpu);
@@ -1845,10 +1845,12 @@ xfs_inodegc_worker(
 						struct xfs_inodegc, work);
 	struct llist_node	*node = llist_del_all(&gc->list);
 	struct xfs_inode	*ip, *n;
+	struct xfs_mount	*mp = gc->mp;
 	unsigned int		nofs_flag;
 
 	ASSERT(gc->cpu == smp_processor_id());
 
+	cpumask_test_and_clear_cpu(gc->cpu, &mp->m_inodegc_cpumask);
 	WRITE_ONCE(gc->items, 0);
 
 	if (!node)
@@ -1862,7 +1864,7 @@ xfs_inodegc_worker(
 	nofs_flag = memalloc_nofs_save();
 
 	ip = llist_entry(node, struct xfs_inode, i_gclist);
-	trace_xfs_inodegc_worker(ip->i_mount, READ_ONCE(gc->shrinker_hits));
+	trace_xfs_inodegc_worker(mp, READ_ONCE(gc->shrinker_hits));
 
 	WRITE_ONCE(gc->shrinker_hits, 0);
 	llist_for_each_entry_safe(ip, n, node, i_gclist) {
@@ -2057,6 +2059,7 @@ xfs_inodegc_queue(
 	struct xfs_inodegc	*gc;
 	int			items;
 	unsigned int		shrinker_hits;
+	unsigned int		cpu_nr;
 	unsigned long		queue_delay = 1;
 
 	trace_xfs_inode_set_need_inactive(ip);
@@ -2064,12 +2067,16 @@ xfs_inodegc_queue(
 	ip->i_flags |= XFS_NEED_INACTIVE;
 	spin_unlock(&ip->i_flags_lock);
 
-	gc = get_cpu_ptr(mp->m_inodegc);
+	cpu_nr = get_cpu();
+	gc = this_cpu_ptr(mp->m_inodegc);
 	llist_add(&ip->i_gclist, &gc->list);
 	items = READ_ONCE(gc->items);
 	WRITE_ONCE(gc->items, items + 1);
 	shrinker_hits = READ_ONCE(gc->shrinker_hits);
 
+	if (!cpumask_test_cpu(cpu_nr, &mp->m_inodegc_cpumask))
+		cpumask_test_and_set_cpu(cpu_nr, &mp->m_inodegc_cpumask);
+
 	/*
 	 * We queue the work while holding the current CPU so that the work
 	 * is scheduled to run on this CPU.
@@ -2093,47 +2100,6 @@ xfs_inodegc_queue(
 	}
 }
 
-/*
- * Fold the dead CPU inodegc queue into the current CPUs queue.
- */
-void
-xfs_inodegc_cpu_dead(
-	struct xfs_mount	*mp,
-	unsigned int		dead_cpu)
-{
-	struct xfs_inodegc	*dead_gc, *gc;
-	struct llist_node	*first, *last;
-	unsigned int		count = 0;
-
-	dead_gc = per_cpu_ptr(mp->m_inodegc, dead_cpu);
-	cancel_delayed_work_sync(&dead_gc->work);
-
-	if (llist_empty(&dead_gc->list))
-		return;
-
-	first = dead_gc->list.first;
-	last = first;
-	while (last->next) {
-		last = last->next;
-		count++;
-	}
-	dead_gc->list.first = NULL;
-	dead_gc->items = 0;
-
-	/* Add pending work to current CPU */
-	gc = get_cpu_ptr(mp->m_inodegc);
-	llist_add_batch(first, last, &gc->list);
-	count += READ_ONCE(gc->items);
-	WRITE_ONCE(gc->items, count);
-
-	if (xfs_is_inodegc_enabled(mp)) {
-		trace_xfs_inodegc_queue(mp, __return_address);
-		mod_delayed_work_on(current_cpu(), mp->m_inodegc_wq, &gc->work,
-				0);
-	}
-	put_cpu_ptr(gc);
-}
-
 /*
  * We set the inode flag atomically with the radix tree tag.  Once we get tag
  * lookups on the radix tree, this inode flag can go away.
@@ -2195,7 +2161,7 @@ xfs_inodegc_shrinker_count(
 	if (!xfs_is_inodegc_enabled(mp))
 		return 0;
 
-	for_each_online_cpu(cpu) {
+	for_each_cpu(cpu, &mp->m_inodegc_cpumask) {
 		gc = per_cpu_ptr(mp->m_inodegc, cpu);
 		if (!llist_empty(&gc->list))
 			return XFS_INODEGC_SHRINKER_COUNT;
@@ -2220,7 +2186,7 @@ xfs_inodegc_shrinker_scan(
 
 	trace_xfs_inodegc_shrinker_scan(mp, sc, __return_address);
 
-	for_each_online_cpu(cpu) {
+	for_each_cpu(cpu, &mp->m_inodegc_cpumask) {
 		gc = per_cpu_ptr(mp->m_inodegc, cpu);
 		if (!llist_empty(&gc->list)) {
 			unsigned int	h = READ_ONCE(gc->shrinker_hits);
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 2fa6f2e09d07..905944dafbe5 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -79,7 +79,6 @@ void xfs_inodegc_push(struct xfs_mount *mp);
 int xfs_inodegc_flush(struct xfs_mount *mp);
 void xfs_inodegc_stop(struct xfs_mount *mp);
 void xfs_inodegc_start(struct xfs_mount *mp);
-void xfs_inodegc_cpu_dead(struct xfs_mount *mp, unsigned int cpu);
 int xfs_inodegc_register_shrinker(struct xfs_mount *mp);
 
 #endif
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index a25eece3be2b..f4a8879ba0e9 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -60,6 +60,7 @@ struct xfs_error_cfg {
  * Per-cpu deferred inode inactivation GC lists.
  */
 struct xfs_inodegc {
+	struct xfs_mount	*mp;
 	struct llist_head	list;
 	struct delayed_work	work;
 	int			error;
@@ -67,9 +68,7 @@ struct xfs_inodegc {
 	/* approximate count of inodes in the list */
 	unsigned int		items;
 	unsigned int		shrinker_hits;
-#if defined(DEBUG) || defined(XFS_WARN)
 	unsigned int		cpu;
-#endif
 };
 
 /*
@@ -249,6 +248,9 @@ typedef struct xfs_mount {
 	unsigned int		*m_errortag;
 	struct xfs_kobj		m_errortag_kobj;
 #endif
+
+	/* cpus that have inodes queued for inactivation */
+	struct cpumask		m_inodegc_cpumask;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ef7775657ce3..6a4f8b2f6159 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1110,9 +1110,8 @@ xfs_inodegc_init_percpu(
 
 	for_each_possible_cpu(cpu) {
 		gc = per_cpu_ptr(mp->m_inodegc, cpu);
-#if defined(DEBUG) || defined(XFS_WARN)
 		gc->cpu = cpu;
-#endif
+		gc->mp = mp;
 		init_llist_head(&gc->list);
 		gc->items = 0;
 		gc->error = 0;
@@ -2312,7 +2311,6 @@ xfs_cpu_dead(
 	spin_lock(&xfs_mount_list_lock);
 	list_for_each_entry_safe(mp, n, &xfs_mount_list, m_mount_list) {
 		spin_unlock(&xfs_mount_list_lock);
-		xfs_inodegc_cpu_dead(mp, cpu);
 		spin_lock(&xfs_mount_list_lock);
 	}
 	spin_unlock(&xfs_mount_list_lock);


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] xfs: remove cpu hotplug hooks
  2023-08-24 23:21 [PATCHSET 0/3] xfs: fix cpu hotplug mess Darrick J. Wong
  2023-08-24 23:21 ` [PATCH 1/3] xfs: fix per-cpu CIL structure aggregation racing with dying cpus Darrick J. Wong
  2023-08-24 23:21 ` [PATCH 2/3] xfs: use per-mount cpumask to track nonempty percpu inodegc lists Darrick J. Wong
@ 2023-08-24 23:21 ` Darrick J. Wong
  2023-08-25  0:15   ` Dave Chinner
  2023-08-24 23:45 ` [PATCH 4/3] generic/650: race mount and unmount with cpu hotplug too Darrick J. Wong
  3 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2023-08-24 23:21 UTC (permalink / raw)
  To: chandan.babu, djwong; +Cc: linux-xfs, david, ritesh.list, sandeen

From: Darrick J. Wong <djwong@kernel.org>

There are no users of the cpu hotplug hooks in xfs now, so remove the
infrastructure.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_super.c         |   50 +-------------------------------------------
 include/linux/cpuhotplug.h |    1 -
 2 files changed, 1 insertion(+), 50 deletions(-)


diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 6a4f8b2f6159..1403aace4fe3 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2301,47 +2301,6 @@ xfs_destroy_workqueues(void)
 	destroy_workqueue(xfs_alloc_wq);
 }
 
-#ifdef CONFIG_HOTPLUG_CPU
-static int
-xfs_cpu_dead(
-	unsigned int		cpu)
-{
-	struct xfs_mount	*mp, *n;
-
-	spin_lock(&xfs_mount_list_lock);
-	list_for_each_entry_safe(mp, n, &xfs_mount_list, m_mount_list) {
-		spin_unlock(&xfs_mount_list_lock);
-		spin_lock(&xfs_mount_list_lock);
-	}
-	spin_unlock(&xfs_mount_list_lock);
-	return 0;
-}
-
-static int __init
-xfs_cpu_hotplug_init(void)
-{
-	int	error;
-
-	error = cpuhp_setup_state_nocalls(CPUHP_XFS_DEAD, "xfs:dead", NULL,
-			xfs_cpu_dead);
-	if (error < 0)
-		xfs_alert(NULL,
-"Failed to initialise CPU hotplug, error %d. XFS is non-functional.",
-			error);
-	return error;
-}
-
-static void
-xfs_cpu_hotplug_destroy(void)
-{
-	cpuhp_remove_state_nocalls(CPUHP_XFS_DEAD);
-}
-
-#else /* !CONFIG_HOTPLUG_CPU */
-static inline int xfs_cpu_hotplug_init(void) { return 0; }
-static inline void xfs_cpu_hotplug_destroy(void) {}
-#endif
-
 STATIC int __init
 init_xfs_fs(void)
 {
@@ -2358,13 +2317,9 @@ init_xfs_fs(void)
 
 	xfs_dir_startup();
 
-	error = xfs_cpu_hotplug_init();
-	if (error)
-		goto out;
-
 	error = xfs_init_caches();
 	if (error)
-		goto out_destroy_hp;
+		goto out;
 
 	error = xfs_init_workqueues();
 	if (error)
@@ -2448,8 +2403,6 @@ init_xfs_fs(void)
 	xfs_destroy_workqueues();
  out_destroy_caches:
 	xfs_destroy_caches();
- out_destroy_hp:
-	xfs_cpu_hotplug_destroy();
  out:
 	return error;
 }
@@ -2473,7 +2426,6 @@ exit_xfs_fs(void)
 	xfs_destroy_workqueues();
 	xfs_destroy_caches();
 	xfs_uuid_table_free();
-	xfs_cpu_hotplug_destroy();
 }
 
 module_init(init_xfs_fs);
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 25b6e6e6ba6b..a93e539efd85 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -90,7 +90,6 @@ enum cpuhp_state {
 	CPUHP_FS_BUFF_DEAD,
 	CPUHP_PRINTK_DEAD,
 	CPUHP_MM_MEMCQ_DEAD,
-	CPUHP_XFS_DEAD,
 	CPUHP_PERCPU_CNT_DEAD,
 	CPUHP_RADIX_DEAD,
 	CPUHP_PAGE_ALLOC,


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/3] generic/650: race mount and unmount with cpu hotplug too
  2023-08-24 23:21 [PATCHSET 0/3] xfs: fix cpu hotplug mess Darrick J. Wong
                   ` (2 preceding siblings ...)
  2023-08-24 23:21 ` [PATCH 3/3] xfs: remove cpu hotplug hooks Darrick J. Wong
@ 2023-08-24 23:45 ` Darrick J. Wong
  3 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2023-08-24 23:45 UTC (permalink / raw)
  To: chandan.babu; +Cc: peterz, ritesh.list, sandeen, tglx, linux-xfs, david

From: Darrick J. Wong <djwong@kernel.org>

Ritesh Harjani reported that mount and unmount can race with the xfs cpu
hotplug notifier hooks and crash the kernel.  Extend this test to
include that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/generic/650 |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/tests/generic/650 b/tests/generic/650
index 05c939b84f..773f93c7cb 100755
--- a/tests/generic/650
+++ b/tests/generic/650
@@ -67,11 +67,18 @@ fsstress_args=(-w -d $stress_dir)
 nr_cpus=$((LOAD_FACTOR * nr_hotplug_cpus))
 test "$nr_cpus" -gt 1024 && nr_cpus="$nr_hotplug_cpus"
 fsstress_args+=(-p $nr_cpus)
-test -n "$SOAK_DURATION" && fsstress_args+=(--duration="$SOAK_DURATION")
+if [ -n "$SOAK_DURATION" ]; then
+	test "$SOAK_DURATION" -lt 10 && SOAK_DURATION=10
+	fsstress_args+=(--duration="$((SOAK_DURATION / 10))")
+fi
 
-nr_ops=$((25000 * TIME_FACTOR))
+nr_ops=$((2500 * TIME_FACTOR))
 fsstress_args+=(-n $nr_ops)
-$FSSTRESS_PROG $FSSTRESS_AVOID -w "${fsstress_args[@]}" >> $seqres.full
+for ((i = 0; i < 10; i++)); do
+	$FSSTRESS_PROG $FSSTRESS_AVOID -w "${fsstress_args[@]}" >> $seqres.full
+	_test_cycle_mount
+done
+
 rm -f $sentinel_file
 
 # success, all done

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] xfs: fix per-cpu CIL structure aggregation racing with dying cpus
  2023-08-24 23:21 ` [PATCH 1/3] xfs: fix per-cpu CIL structure aggregation racing with dying cpus Darrick J. Wong
@ 2023-08-24 23:53   ` Dave Chinner
  2023-08-25  4:07     ` Darrick J. Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2023-08-24 23:53 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: chandan.babu, tglx, peterz, ritesh.list, sandeen, linux-xfs

On Thu, Aug 24, 2023 at 04:21:20PM -0700, Darrick J. Wong wrote:
> @@ -554,6 +560,7 @@ xlog_cil_insert_items(
>  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
>  	int			space_used;
>  	int			order;
> +	unsigned int		cpu_nr;
>  	struct xlog_cil_pcp	*cilpcp;
>  
>  	ASSERT(tp);
> @@ -577,7 +584,12 @@ xlog_cil_insert_items(
>  	 * can't be scheduled away between split sample/update operations that
>  	 * are done without outside locking to serialise them.
>  	 */
> -	cilpcp = get_cpu_ptr(cil->xc_pcp);
> +	cpu_nr = get_cpu();
> +	cilpcp = this_cpu_ptr(cil->xc_pcp);
> +
> +	/* Tell the future push that there was work added by this CPU. */
> +	if (!cpumask_test_cpu(cpu_nr, &ctx->cil_pcpmask))
> +		cpumask_test_and_set_cpu(cpu_nr, &ctx->cil_pcpmask);
>  
>  	/*
>  	 * We need to take the CIL checkpoint unit reservation on the first

This code also needs the put_cpu_ptr(cil->xc_pcp) converted to
put_cpu(), even though they end up doing exactly the same thing.

Other than that, it looks good. I'll pull this into my test trees
and give it a run...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] xfs: use per-mount cpumask to track nonempty percpu inodegc lists
  2023-08-24 23:21 ` [PATCH 2/3] xfs: use per-mount cpumask to track nonempty percpu inodegc lists Darrick J. Wong
@ 2023-08-25  0:12   ` Dave Chinner
  2023-08-25  4:07     ` Darrick J. Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2023-08-25  0:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandan.babu, linux-xfs, ritesh.list, sandeen

On Thu, Aug 24, 2023 at 04:21:25PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Directly track which CPUs have contributed to the inodegc percpu lists
> instead of trusting the cpu online mask.  This eliminates a theoretical
> problem where the inodegc flush functions might fail to flush a CPU's
> inodes if that CPU happened to be dying at exactly the same time.  Most
> likely nobody's noticed this because the CPU dead hook moves the percpu
> inodegc list to another CPU and schedules that worker immediately.  But
> it's quite possible that this is a subtle race leading to UAF if the
> inodegc flush were part of an unmount.
> 
> Further benefits: This reduces the overhead of the inodegc flush code
> slightly by allowing us to ignore CPUs that have empty lists.  Better
> yet, it reduces our dependence on the cpu online masks, which have been
> the cause of confusion and drama lately.
> 
> Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_icache.c |   60 +++++++++++----------------------------------------
>  fs/xfs/xfs_icache.h |    1 -
>  fs/xfs/xfs_mount.h  |    6 +++--
>  fs/xfs/xfs_super.c  |    4 +--
>  4 files changed, 18 insertions(+), 53 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index e541f5c0bc25..7fd876e94ecb 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -443,7 +443,7 @@ xfs_inodegc_queue_all(
>  	int			cpu;
>  	bool			ret = false;
>  
> -	for_each_online_cpu(cpu) {
> +	for_each_cpu(cpu, &mp->m_inodegc_cpumask) {
>  		gc = per_cpu_ptr(mp->m_inodegc, cpu);
>  		if (!llist_empty(&gc->list)) {
>  			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
> @@ -463,7 +463,7 @@ xfs_inodegc_wait_all(
>  	int			error = 0;
>  
>  	flush_workqueue(mp->m_inodegc_wq);
> -	for_each_online_cpu(cpu) {
> +	for_each_cpu(cpu, &mp->m_inodegc_cpumask) {
>  		struct xfs_inodegc	*gc;
>  
>  		gc = per_cpu_ptr(mp->m_inodegc, cpu);
> @@ -1845,10 +1845,12 @@ xfs_inodegc_worker(
>  						struct xfs_inodegc, work);
>  	struct llist_node	*node = llist_del_all(&gc->list);
>  	struct xfs_inode	*ip, *n;
> +	struct xfs_mount	*mp = gc->mp;
>  	unsigned int		nofs_flag;
>  
>  	ASSERT(gc->cpu == smp_processor_id());
>  
> +	cpumask_test_and_clear_cpu(gc->cpu, &mp->m_inodegc_cpumask);

Why does this need to be a test-and-clear operation? If it is set,
we clear it. If it is not set, clearing it is a no-op. Hence we
don't need to test whether the bit is set first. Also,
cpumask_clear_cpu() uses clear_bit(), which is an atomic operation,
so clearing the bit isn't going to race with any other updates.

As it is, we probably want acquire semantics for the gc structure
here (see below), so I think this likely should be:

	/*
	 * Clear the cpu mask bit and ensure that we have seen the
	 * latest update of the gc structure associated with this
	 * CPU. This matches with the release semantics used when
	 * setting the cpumask bit in xfs_inodegc_queue.
	 */
	cpumask_clear_cpu(gc->cpu, &mp->m_inodegc_cpumask);
	smp_mb__after_atomic();

>  	WRITE_ONCE(gc->items, 0);
>  
>  	if (!node)
> @@ -1862,7 +1864,7 @@ xfs_inodegc_worker(
>  	nofs_flag = memalloc_nofs_save();
>  
>  	ip = llist_entry(node, struct xfs_inode, i_gclist);
> -	trace_xfs_inodegc_worker(ip->i_mount, READ_ONCE(gc->shrinker_hits));
> +	trace_xfs_inodegc_worker(mp, READ_ONCE(gc->shrinker_hits));
>  
>  	WRITE_ONCE(gc->shrinker_hits, 0);
>  	llist_for_each_entry_safe(ip, n, node, i_gclist) {
> @@ -2057,6 +2059,7 @@ xfs_inodegc_queue(
>  	struct xfs_inodegc	*gc;
>  	int			items;
>  	unsigned int		shrinker_hits;
> +	unsigned int		cpu_nr;
>  	unsigned long		queue_delay = 1;
>  
>  	trace_xfs_inode_set_need_inactive(ip);
> @@ -2064,12 +2067,16 @@ xfs_inodegc_queue(
>  	ip->i_flags |= XFS_NEED_INACTIVE;
>  	spin_unlock(&ip->i_flags_lock);
>  
> -	gc = get_cpu_ptr(mp->m_inodegc);
> +	cpu_nr = get_cpu();
> +	gc = this_cpu_ptr(mp->m_inodegc);
>  	llist_add(&ip->i_gclist, &gc->list);
>  	items = READ_ONCE(gc->items);
>  	WRITE_ONCE(gc->items, items + 1);
>  	shrinker_hits = READ_ONCE(gc->shrinker_hits);
>  
> +	if (!cpumask_test_cpu(cpu_nr, &mp->m_inodegc_cpumask))
> +		cpumask_test_and_set_cpu(cpu_nr, &mp->m_inodegc_cpumask);
> +
>  	/*
>  	 * We queue the work while holding the current CPU so that the work
>  	 * is scheduled to run on this CPU.

I think we need release/acquire memory ordering on this atomic bit
set now. i.e. to guarantee that if the worker sees the cpumask bit
set (with acquire semantics), it will always see the latest item
added to the list. i.e.

	/*
	 * Ensure the list add is always seen by anyone that
	 * find the cpumask bit set. This effectively gives
	 * the cpumask bit set operation release ordering semantics.
	 */
	smp_mb__before_atomic();
	if (!cpumask_test_cpu(cpu_nr, &mp->m_inodegc_cpumask))
		cpumask_test_and_set_cpu(cpu_nr, &mp->m_inodegc_cpumask);

Also, same comment about put_cpu() vs put_cpu_var() as the last patch.

Otherwise this seems sane.

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] xfs: remove cpu hotplug hooks
  2023-08-24 23:21 ` [PATCH 3/3] xfs: remove cpu hotplug hooks Darrick J. Wong
@ 2023-08-25  0:15   ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2023-08-25  0:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandan.babu, linux-xfs, ritesh.list, sandeen

On Thu, Aug 24, 2023 at 04:21:31PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> There are no users of the cpu hotplug hooks in xfs now, so remove the
> infrastructure.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_super.c         |   50 +-------------------------------------------
>  include/linux/cpuhotplug.h |    1 -
>  2 files changed, 1 insertion(+), 50 deletions(-)

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

> 
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 6a4f8b2f6159..1403aace4fe3 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -2301,47 +2301,6 @@ xfs_destroy_workqueues(void)
>  	destroy_workqueue(xfs_alloc_wq);
>  }
>  
> -#ifdef CONFIG_HOTPLUG_CPU
> -static int
> -xfs_cpu_dead(
> -	unsigned int		cpu)
> -{
> -	struct xfs_mount	*mp, *n;
> -
> -	spin_lock(&xfs_mount_list_lock);
> -	list_for_each_entry_safe(mp, n, &xfs_mount_list, m_mount_list) {
> -		spin_unlock(&xfs_mount_list_lock);
> -		spin_lock(&xfs_mount_list_lock);
> -	}
> -	spin_unlock(&xfs_mount_list_lock);
> -	return 0;
> -}

You also can kill the xfs_mount_list and xfs_mount_list_lock now,
right? That can be done in a separate patch, though.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] xfs: use per-mount cpumask to track nonempty percpu inodegc lists
  2023-08-25  0:12   ` Dave Chinner
@ 2023-08-25  4:07     ` Darrick J. Wong
  0 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2023-08-25  4:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: chandan.babu, linux-xfs, ritesh.list, sandeen

On Fri, Aug 25, 2023 at 10:12:49AM +1000, Dave Chinner wrote:
> On Thu, Aug 24, 2023 at 04:21:25PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Directly track which CPUs have contributed to the inodegc percpu lists
> > instead of trusting the cpu online mask.  This eliminates a theoretical
> > problem where the inodegc flush functions might fail to flush a CPU's
> > inodes if that CPU happened to be dying at exactly the same time.  Most
> > likely nobody's noticed this because the CPU dead hook moves the percpu
> > inodegc list to another CPU and schedules that worker immediately.  But
> > it's quite possible that this is a subtle race leading to UAF if the
> > inodegc flush were part of an unmount.
> > 
> > Further benefits: This reduces the overhead of the inodegc flush code
> > slightly by allowing us to ignore CPUs that have empty lists.  Better
> > yet, it reduces our dependence on the cpu online masks, which have been
> > the cause of confusion and drama lately.
> > 
> > Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_icache.c |   60 +++++++++++----------------------------------------
> >  fs/xfs/xfs_icache.h |    1 -
> >  fs/xfs/xfs_mount.h  |    6 +++--
> >  fs/xfs/xfs_super.c  |    4 +--
> >  4 files changed, 18 insertions(+), 53 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index e541f5c0bc25..7fd876e94ecb 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -443,7 +443,7 @@ xfs_inodegc_queue_all(
> >  	int			cpu;
> >  	bool			ret = false;
> >  
> > -	for_each_online_cpu(cpu) {
> > +	for_each_cpu(cpu, &mp->m_inodegc_cpumask) {
> >  		gc = per_cpu_ptr(mp->m_inodegc, cpu);
> >  		if (!llist_empty(&gc->list)) {
> >  			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
> > @@ -463,7 +463,7 @@ xfs_inodegc_wait_all(
> >  	int			error = 0;
> >  
> >  	flush_workqueue(mp->m_inodegc_wq);
> > -	for_each_online_cpu(cpu) {
> > +	for_each_cpu(cpu, &mp->m_inodegc_cpumask) {
> >  		struct xfs_inodegc	*gc;
> >  
> >  		gc = per_cpu_ptr(mp->m_inodegc, cpu);
> > @@ -1845,10 +1845,12 @@ xfs_inodegc_worker(
> >  						struct xfs_inodegc, work);
> >  	struct llist_node	*node = llist_del_all(&gc->list);
> >  	struct xfs_inode	*ip, *n;
> > +	struct xfs_mount	*mp = gc->mp;
> >  	unsigned int		nofs_flag;
> >  
> >  	ASSERT(gc->cpu == smp_processor_id());
> >  
> > +	cpumask_test_and_clear_cpu(gc->cpu, &mp->m_inodegc_cpumask);
> 
> Why does this need to be a test-and-clear operation? If it is set,
> we clear it. If it is not set, clearing it is a no-op. Hence we
> don't need to test whether the bit is set first. Also,
> cpumask_clear_cpu() uses clear_bit(), which is an atomic operation,
> so clearing the bit isn't going to race with any other updates.

Aha, I didn't realize that clear_bit is atomic.  Oh, I guess
atomic_bitops.txt mentions that an RMW operation is atomic if there's a
separate '__' version.  Subtle. :(

> As it is, we probably want acquire semantics for the gc structure
> here (see below), so I think this likely should be:
> 
> 	/*
> 	 * Clear the cpu mask bit and ensure that we have seen the
> 	 * latest update of the gc structure associated with this
> 	 * CPU. This matches with the release semantics used when
> 	 * setting the cpumask bit in xfs_inodegc_queue.
> 	 */
> 	cpumask_clear_cpu(gc->cpu, &mp->m_inodegc_cpumask);
> 	smp_mb__after_atomic();

<nod>

> >  	WRITE_ONCE(gc->items, 0);
> >  
> >  	if (!node)
> > @@ -1862,7 +1864,7 @@ xfs_inodegc_worker(
> >  	nofs_flag = memalloc_nofs_save();
> >  
> >  	ip = llist_entry(node, struct xfs_inode, i_gclist);
> > -	trace_xfs_inodegc_worker(ip->i_mount, READ_ONCE(gc->shrinker_hits));
> > +	trace_xfs_inodegc_worker(mp, READ_ONCE(gc->shrinker_hits));
> >  
> >  	WRITE_ONCE(gc->shrinker_hits, 0);
> >  	llist_for_each_entry_safe(ip, n, node, i_gclist) {
> > @@ -2057,6 +2059,7 @@ xfs_inodegc_queue(
> >  	struct xfs_inodegc	*gc;
> >  	int			items;
> >  	unsigned int		shrinker_hits;
> > +	unsigned int		cpu_nr;
> >  	unsigned long		queue_delay = 1;
> >  
> >  	trace_xfs_inode_set_need_inactive(ip);
> > @@ -2064,12 +2067,16 @@ xfs_inodegc_queue(
> >  	ip->i_flags |= XFS_NEED_INACTIVE;
> >  	spin_unlock(&ip->i_flags_lock);
> >  
> > -	gc = get_cpu_ptr(mp->m_inodegc);
> > +	cpu_nr = get_cpu();
> > +	gc = this_cpu_ptr(mp->m_inodegc);
> >  	llist_add(&ip->i_gclist, &gc->list);
> >  	items = READ_ONCE(gc->items);
> >  	WRITE_ONCE(gc->items, items + 1);
> >  	shrinker_hits = READ_ONCE(gc->shrinker_hits);
> >  
> > +	if (!cpumask_test_cpu(cpu_nr, &mp->m_inodegc_cpumask))
> > +		cpumask_test_and_set_cpu(cpu_nr, &mp->m_inodegc_cpumask);
> > +
> >  	/*
> >  	 * We queue the work while holding the current CPU so that the work
> >  	 * is scheduled to run on this CPU.
> 
> I think we need release/acquire memory ordering on this atomic bit
> set now. i.e. to guarantee that if the worker sees the cpumask bit
> set (with acquire semantics), it will always see the latest item
> added to the list. i.e.
> 
> 	/*
> 	 * Ensure the list add is always seen by anyone that
> 	 * find the cpumask bit set. This effectively gives
> 	 * the cpumask bit set operation release ordering semantics.
> 	 */
> 	smp_mb__before_atomic();
> 	if (!cpumask_test_cpu(cpu_nr, &mp->m_inodegc_cpumask))
> 		cpumask_test_and_set_cpu(cpu_nr, &mp->m_inodegc_cpumask);

<nod>

> Also, same comment about put_cpu() vs put_cpu_var() as the last patch.

<nod>

> Otherwise this seems sane.

Thanks!
--D

> -Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] xfs: fix per-cpu CIL structure aggregation racing with dying cpus
  2023-08-24 23:53   ` Dave Chinner
@ 2023-08-25  4:07     ` Darrick J. Wong
  0 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2023-08-25  4:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: chandan.babu, tglx, peterz, ritesh.list, sandeen, linux-xfs

On Fri, Aug 25, 2023 at 09:53:04AM +1000, Dave Chinner wrote:
> On Thu, Aug 24, 2023 at 04:21:20PM -0700, Darrick J. Wong wrote:
> > @@ -554,6 +560,7 @@ xlog_cil_insert_items(
> >  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
> >  	int			space_used;
> >  	int			order;
> > +	unsigned int		cpu_nr;
> >  	struct xlog_cil_pcp	*cilpcp;
> >  
> >  	ASSERT(tp);
> > @@ -577,7 +584,12 @@ xlog_cil_insert_items(
> >  	 * can't be scheduled away between split sample/update operations that
> >  	 * are done without outside locking to serialise them.
> >  	 */
> > -	cilpcp = get_cpu_ptr(cil->xc_pcp);
> > +	cpu_nr = get_cpu();
> > +	cilpcp = this_cpu_ptr(cil->xc_pcp);
> > +
> > +	/* Tell the future push that there was work added by this CPU. */
> > +	if (!cpumask_test_cpu(cpu_nr, &ctx->cil_pcpmask))
> > +		cpumask_test_and_set_cpu(cpu_nr, &ctx->cil_pcpmask);
> >  
> >  	/*
> >  	 * We need to take the CIL checkpoint unit reservation on the first
> 
> This code also needs the put_cpu_ptr(cil->xc_pcp) converted to
> put_cpu(), even though they end up doing exactly the same thing.
> 
> Other than that, it looks good. I'll pull this into my test trees
> and give it a run...

Ok, I'll look forward to seeing what happens. :)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-08-25  4:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-24 23:21 [PATCHSET 0/3] xfs: fix cpu hotplug mess Darrick J. Wong
2023-08-24 23:21 ` [PATCH 1/3] xfs: fix per-cpu CIL structure aggregation racing with dying cpus Darrick J. Wong
2023-08-24 23:53   ` Dave Chinner
2023-08-25  4:07     ` Darrick J. Wong
2023-08-24 23:21 ` [PATCH 2/3] xfs: use per-mount cpumask to track nonempty percpu inodegc lists Darrick J. Wong
2023-08-25  0:12   ` Dave Chinner
2023-08-25  4:07     ` Darrick J. Wong
2023-08-24 23:21 ` [PATCH 3/3] xfs: remove cpu hotplug hooks Darrick J. Wong
2023-08-25  0:15   ` Dave Chinner
2023-08-24 23:45 ` [PATCH 4/3] generic/650: race mount and unmount with cpu hotplug too Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).