[PATCHSET 0/4] xfs: inodegc fixes for 6.4-rc1

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET 0/4] xfs: inodegc fixes for 6.4-rc1
@ 2023-04-27 22:49 Darrick J. Wong
  2023-04-27 22:49 ` [PATCH 1/4] xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately Darrick J. Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Darrick J. Wong @ 2023-04-27 22:49 UTC (permalink / raw)
  To: david, djwong; +Cc: linux-xfs

Hi all,

This series fixes some assorted bugs in the inodegc code that have been
lurking for a while.

The first is that inodes that are ready to be inactivated are put on
per-cpu lockless lists and a per-cpu worker is scheduled to clear that
list eventually.  Unfortunately, WORK_CPU_UNBOUND doesn't guarantee that
a worker will actually be scheduled on that CPU, so we need to force
this by specifying the CPU explicitly.

The second problem is that that xfs_inodegc_stop races with other
threads that are trying to add inodes to the percpu list and schedule
the inodegc workers.  The solution here is to drain the inodegc lists
by scheduling workers immediately, flushing the workqueue, and
scheduling if any new inodes have appeared.

We also disable the broken fscounters usage of inodegc_stop by neutering
the whole scrubber for now because the proper fixes for it are in the
next batch of online repair patches for 6.5, and the code is still
marked experimental.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=inodegc-fixes-6.4
---
 fs/xfs/scrub/common.c     |   26 --------------------------
 fs/xfs/scrub/common.h     |    2 --
 fs/xfs/scrub/fscounters.c |   10 +++-------
 fs/xfs/scrub/scrub.c      |    2 --
 fs/xfs/scrub/scrub.h      |    1 -
 fs/xfs/scrub/trace.h      |    1 -
 fs/xfs/xfs_icache.c       |   38 +++++++++++++++++++++++++++++++-------
 fs/xfs/xfs_mount.h        |    3 +++
 fs/xfs/xfs_super.c        |    3 +++
 9 files changed, 40 insertions(+), 46 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately
  2023-04-27 22:49 [PATCHSET 0/4] xfs: inodegc fixes for 6.4-rc1 Darrick J. Wong
@ 2023-04-27 22:49 ` Darrick J. Wong
  2023-04-28  2:14   ` Dave Chinner
  2023-04-27 22:49 ` [PATCH 2/4] xfs: check that per-cpu inodegc workers actually run on that cpu Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2023-04-27 22:49 UTC (permalink / raw)
  To: david, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

I've been noticing odd racing behavior in the inodegc code that could
only be explained by one cpu adding an inode to its inactivation llist
at the same time that another cpu is processing that cpu's llist.
Preemption is disabled between get/put_cpu_ptr, so the only explanation
is scheduler mayhem.  I inserted the following debug code into
xfs_inodegc_worker (see the next patch):

	ASSERT(gc->cpu == smp_processor_id());

This assertion tripped during overnight tests on the arm64 machines, but
curiously not on x86_64.  I think we haven't observed any resource leaks
here because the lockfree list code can handle simultaneous llist_add
and llist_del_all functions operating on the same list.  However, the
whole point of having percpu inodegc lists is to take advantage of warm
memory caches by inactivating inodes on the last processor to touch the
inode.

The incorrect scheduling seems to occur after an inodegc worker is
subjected to mod_delayed_work().  This wraps mod_delayed_work_on with
WORK_CPU_UNBOUND specified as the cpu number.  Unbound allows for
scheduling on any cpu, not necessarily the same one that scheduled the
work.

Because preemption is disabled for as long as we have the gc pointer, I
think it's safe to use current_cpu() (aka smp_processor_id) to queue the
delayed work item on the correct cpu.

Fixes: 7cf2b0f9611b ("xfs: bound maximum wait time for inodegc work")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 351849fc18ff..58712113d5d6 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -2069,7 +2069,8 @@ xfs_inodegc_queue(
 		queue_delay = 0;

 	trace_xfs_inodegc_queue(mp, __return_address);
-	mod_delayed_work(mp->m_inodegc_wq, &gc->work, queue_delay);
+	mod_delayed_work_on(current_cpu(), mp->m_inodegc_wq, &gc->work,
+			queue_delay);
 	put_cpu_ptr(gc);

 	if (xfs_inodegc_want_flush_work(ip, items, shrinker_hits)) {
@@ -2113,7 +2114,8 @@ xfs_inodegc_cpu_dead(

 	if (xfs_is_inodegc_enabled(mp)) {
 		trace_xfs_inodegc_queue(mp, __return_address);
-		mod_delayed_work(mp->m_inodegc_wq, &gc->work, 0);
+		mod_delayed_work_on(current_cpu(), mp->m_inodegc_wq, &gc->work,
+				0);
 	}
 	put_cpu_ptr(gc);
 }

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/4] xfs: check that per-cpu inodegc workers actually run on that cpu
  2023-04-27 22:49 [PATCHSET 0/4] xfs: inodegc fixes for 6.4-rc1 Darrick J. Wong
  2023-04-27 22:49 ` [PATCH 1/4] xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately Darrick J. Wong
@ 2023-04-27 22:49 ` Darrick J. Wong
  2023-04-28  2:18   ` Dave Chinner
  2023-04-27 22:49 ` [PATCH 3/4] xfs: disable reaping in fscounters scrub Darrick J. Wong
  2023-04-27 22:49 ` [PATCH 4/4] xfs: fix xfs_inodegc_stop racing with mod_delayed_work Darrick J. Wong
  3 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2023-04-27 22:49 UTC (permalink / raw)
  To: david, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we've allegedly worked out the problem of the per-cpu inodegc
workers being scheduled on the wrong cpu, let's put in a debugging knob
to let us know if a worker ever gets mis-scheduled again.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |    2 ++
 fs/xfs/xfs_mount.h  |    3 +++
 fs/xfs/xfs_super.c  |    3 +++
 3 files changed, 8 insertions(+)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 58712113d5d6..4b63c065ef19 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1856,6 +1856,8 @@ xfs_inodegc_worker(
 	struct xfs_inode	*ip, *n;
 	unsigned int		nofs_flag;
 
+	ASSERT(gc->cpu == smp_processor_id());
+
 	WRITE_ONCE(gc->items, 0);
 
 	if (!node)
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index f3269c0626f0..b51dc8cb7484 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -66,6 +66,9 @@ struct xfs_inodegc {
 	/* approximate count of inodes in the list */
 	unsigned int		items;
 	unsigned int		shrinker_hits;
+#ifdef DEBUG
+	unsigned int		cpu;
+#endif
 };
 
 /*
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 4d2e87462ac4..4f498cc1387c 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1095,6 +1095,9 @@ xfs_inodegc_init_percpu(
 
 	for_each_possible_cpu(cpu) {
 		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+#ifdef DEBUG
+		gc->cpu = cpu;
+#endif
 		init_llist_head(&gc->list);
 		gc->items = 0;
 		INIT_DELAYED_WORK(&gc->work, xfs_inodegc_worker);


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/4] xfs: disable reaping in fscounters scrub
  2023-04-27 22:49 [PATCHSET 0/4] xfs: inodegc fixes for 6.4-rc1 Darrick J. Wong
  2023-04-27 22:49 ` [PATCH 1/4] xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately Darrick J. Wong
  2023-04-27 22:49 ` [PATCH 2/4] xfs: check that per-cpu inodegc workers actually run on that cpu Darrick J. Wong
@ 2023-04-27 22:49 ` Darrick J. Wong
  2023-04-28  2:20   ` Dave Chinner
  2023-04-27 22:49 ` [PATCH 4/4] xfs: fix xfs_inodegc_stop racing with mod_delayed_work Darrick J. Wong
  3 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2023-04-27 22:49 UTC (permalink / raw)
  To: david, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The fscounters scrub code doesn't work properly because it cannot
quiesce updates to the percpu counters in the filesystem, hence it
returns false corruption reports.  This has been fixed properly in
one of the online repair patchsets that are under review by replacing
the xchk_disable_reaping calls with an exclusive filesystem freeze.
Disabling background gc isn't sufficient to fix the problem.

In other words, scrub doesn't need to call xfs_inodegc_stop, which is
just as well since it wasn't correct to allow scrub to call
xfs_inodegc_start when something else could be calling xfs_inodegc_stop
(e.g. trying to freeze the filesystem).

Neuter the scrubber for now, and remove the xchk_*_reaping functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c     |   26 --------------------------
 fs/xfs/scrub/common.h     |    2 --
 fs/xfs/scrub/fscounters.c |   10 +++-------
 fs/xfs/scrub/scrub.c      |    2 --
 fs/xfs/scrub/scrub.h      |    1 -
 fs/xfs/scrub/trace.h      |    1 -
 6 files changed, 3 insertions(+), 39 deletions(-)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 9aa79665c608..7a20256be969 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -1164,32 +1164,6 @@ xchk_metadata_inode_forks(
 	return 0;
 }
 
-/* Pause background reaping of resources. */
-void
-xchk_stop_reaping(
-	struct xfs_scrub	*sc)
-{
-	sc->flags |= XCHK_REAPING_DISABLED;
-	xfs_blockgc_stop(sc->mp);
-	xfs_inodegc_stop(sc->mp);
-}
-
-/* Restart background reaping of resources. */
-void
-xchk_start_reaping(
-	struct xfs_scrub	*sc)
-{
-	/*
-	 * Readonly filesystems do not perform inactivation or speculative
-	 * preallocation, so there's no need to restart the workers.
-	 */
-	if (!xfs_is_readonly(sc->mp)) {
-		xfs_inodegc_start(sc->mp);
-		xfs_blockgc_start(sc->mp);
-	}
-	sc->flags &= ~XCHK_REAPING_DISABLED;
-}
-
 /*
  * Enable filesystem hooks (i.e. runtime code patching) before starting a scrub
  * operation.  Callers must not hold any locks that intersect with the CPU
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 18b5f2b62f13..791235cd9b00 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -156,8 +156,6 @@ static inline bool xchk_skip_xref(struct xfs_scrub_metadata *sm)
 }
 
 int xchk_metadata_inode_forks(struct xfs_scrub *sc);
-void xchk_stop_reaping(struct xfs_scrub *sc);
-void xchk_start_reaping(struct xfs_scrub *sc);
 
 /*
  * Setting up a hook to wait for intents to drain is costly -- we have to take
diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
index faa315be7978..0d90d00de770 100644
--- a/fs/xfs/scrub/fscounters.c
+++ b/fs/xfs/scrub/fscounters.c
@@ -150,13 +150,6 @@ xchk_setup_fscounters(
 	if (error)
 		return error;
 
-	/*
-	 * Pause background reclaim while we're scrubbing to reduce the
-	 * likelihood of background perturbations to the counters throwing off
-	 * our calculations.
-	 */
-	xchk_stop_reaping(sc);
-
 	return xchk_trans_alloc(sc, 0);
 }
 
@@ -453,6 +446,9 @@ xchk_fscounters(
 	if (frextents > mp->m_sb.sb_rextents)
 		xchk_set_corrupt(sc);
 
+	/* XXX: We can't quiesce percpu counter updates, so exit early. */
+	return 0;
+
 	/*
 	 * If ifree exceeds icount by more than the minimum variance then
 	 * something's probably wrong with the counters.
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 02819bedc5b1..3d98f604765e 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -186,8 +186,6 @@ xchk_teardown(
 	}
 	if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)
 		mnt_drop_write_file(sc->file);
-	if (sc->flags & XCHK_REAPING_DISABLED)
-		xchk_start_reaping(sc);
 	if (sc->buf) {
 		if (sc->buf_cleanup)
 			sc->buf_cleanup(sc->buf);
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index e71903474cd7..b38e93830dde 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -106,7 +106,6 @@ struct xfs_scrub {
 
 /* XCHK state flags grow up from zero, XREP state flags grown down from 2^31 */
 #define XCHK_TRY_HARDER		(1 << 0)  /* can't get resources, try again */
-#define XCHK_REAPING_DISABLED	(1 << 1)  /* background block reaping paused */
 #define XCHK_FSGATES_DRAIN	(1 << 2)  /* defer ops draining enabled */
 #define XCHK_NEED_DRAIN		(1 << 3)  /* scrub needs to drain defer ops */
 #define XREP_ALREADY_FIXED	(1 << 31) /* checking our repair work */
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 68efd6fda61c..b3894daeb86a 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -98,7 +98,6 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_FSCOUNTERS);
 
 #define XFS_SCRUB_STATE_STRINGS \
 	{ XCHK_TRY_HARDER,			"try_harder" }, \
-	{ XCHK_REAPING_DISABLED,		"reaping_disabled" }, \
 	{ XCHK_FSGATES_DRAIN,			"fsgates_drain" }, \
 	{ XCHK_NEED_DRAIN,			"need_drain" }, \
 	{ XREP_ALREADY_FIXED,			"already_fixed" }


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/4] xfs: fix xfs_inodegc_stop racing with mod_delayed_work
  2023-04-27 22:49 [PATCHSET 0/4] xfs: inodegc fixes for 6.4-rc1 Darrick J. Wong
                   ` (2 preceding siblings ...)
  2023-04-27 22:49 ` [PATCH 3/4] xfs: disable reaping in fscounters scrub Darrick J. Wong
@ 2023-04-27 22:49 ` Darrick J. Wong
  2023-04-28  2:29   ` Dave Chinner
  3 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2023-04-27 22:49 UTC (permalink / raw)
  To: david, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

syzbot reported this warning from the faux inodegc shrinker that tries
to kick off inodegc work:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 102 at kernel/workqueue.c:1445 __queue_work+0xd44/0x1120 kernel/workqueue.c:1444
RIP: 0010:__queue_work+0xd44/0x1120 kernel/workqueue.c:1444
Call Trace:
 __queue_delayed_work+0x1c8/0x270 kernel/workqueue.c:1672
 mod_delayed_work_on+0xe1/0x220 kernel/workqueue.c:1746
 xfs_inodegc_shrinker_scan fs/xfs/xfs_icache.c:2212 [inline]
 xfs_inodegc_shrinker_scan+0x250/0x4f0 fs/xfs/xfs_icache.c:2191
 do_shrink_slab+0x428/0xaa0 mm/vmscan.c:853
 shrink_slab+0x175/0x660 mm/vmscan.c:1013
 shrink_one+0x502/0x810 mm/vmscan.c:5343
 shrink_many mm/vmscan.c:5394 [inline]
 lru_gen_shrink_node mm/vmscan.c:5511 [inline]
 shrink_node+0x2064/0x35f0 mm/vmscan.c:6459
 kswapd_shrink_node mm/vmscan.c:7262 [inline]
 balance_pgdat+0xa02/0x1ac0 mm/vmscan.c:7452
 kswapd+0x677/0xd60 mm/vmscan.c:7712
 kthread+0x2e8/0x3a0 kernel/kthread.c:376
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308

This warning corresponds to this code in __queue_work:

	/*
	 * For a draining wq, only works from the same workqueue are
	 * allowed. The __WQ_DESTROYING helps to spot the issue that
	 * queues a new work item to a wq after destroy_workqueue(wq).
	 */
	if (unlikely(wq->flags & (__WQ_DESTROYING | __WQ_DRAINING) &&
		     WARN_ON_ONCE(!is_chained_work(wq))))
		return;

For this to trip, we must have a thread draining the inodedgc workqueue
and a second thread trying to queue inodegc work to that workqueue.
This can happen if freezing or a ro remount race with reclaim poking our
faux inodegc shrinker and another thread dropping an unlinked O_RDONLY
file:

Thread 0	Thread 1	Thread 2

xfs_inodegc_stop

				xfs_inodegc_shrinker_scan
				xfs_is_inodegc_enabled
				<yes, will continue>

xfs_clear_inodegc_enabled
xfs_inodegc_queue_all
<list empty, do not queue inodegc worker>

		xfs_inodegc_queue
		<add to list>
		xfs_is_inodegc_enabled
		<no, returns>

drain_workqueue
<set WQ_DRAINING>

				llist_empty
				<no, will queue list>
				mod_delayed_work_on(..., 0)
				__queue_work
				<sees WQ_DRAINING, kaboom>

In other words, everything between the access to inodegc_enabled state
and the decision to poke the inodegc workqueue requires some kind of
coordination to avoid the WQ_DRAINING state.  We could perhaps introduce
a lock here, but we could also try to eliminate WQ_DRAINING from the
picture.

We could replace the drain_workqueue call with a loop that flushes the
workqueue and queues workers as long as there is at least one inode
present in the per-cpu inodegc llists.  We've disabled inodegc at this
point, so we know that the number of queued inodes will eventually hit
zero as long as xfs_inodegc_start cannot reactivate the workers.

There are four callers of xfs_inodegc_start.  Three of them come from the
VFS with s_umount held: filesystem thawing, failed filesystem freezing,
and the rw remount transition.  The fourth caller is mounting rw (no
remount or freezing possible).

There are three callers ofs xfs_inodegc_stop.  One is unmounting (no
remount or thaw possible).  Two of them come from the VFS with s_umount
held: fs freezing and ro remount transition.

Hence, it is correct to replace the drain_workqueue call with a loop
that drains the inodegc llists.

Fixes: 6191cf3ad59f ("xfs: flush inodegc workqueue tasks before cancel")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 4b63c065ef19..14cb660d7f55 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -435,18 +435,23 @@ xfs_iget_check_free_state(
 }
 
 /* Make all pending inactivation work start immediately. */
-static void
+static bool
 xfs_inodegc_queue_all(
 	struct xfs_mount	*mp)
 {
 	struct xfs_inodegc	*gc;
 	int			cpu;
+	bool			ret = false;
 
 	for_each_online_cpu(cpu) {
 		gc = per_cpu_ptr(mp->m_inodegc, cpu);
-		if (!llist_empty(&gc->list))
+		if (!llist_empty(&gc->list)) {
 			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
+			ret = true;
+		}
 	}
+
+	return ret;
 }
 
 /*
@@ -1911,24 +1916,39 @@ xfs_inodegc_flush(
 
 /*
  * Flush all the pending work and then disable the inode inactivation background
- * workers and wait for them to stop.
+ * workers and wait for them to stop.  Do not call xfs_inodegc_start until this
+ * finishes.
  */
 void
 xfs_inodegc_stop(
 	struct xfs_mount	*mp)
 {
+	bool			rerun;
+
 	if (!xfs_clear_inodegc_enabled(mp))
 		return;
 
+	/*
+	 * Drain all pending inodegc work, including inodes that could be
+	 * queued by racing xfs_inodegc_queue or xfs_inodegc_shrinker_scan
+	 * threads that sample the inodegc state just prior to us clearing it.
+	 * The inodegc flag state prevents new threads from queuing more
+	 * inodes, so we queue pending work items and flush the workqueue until
+	 * all inodegc lists are empty.
+	 */
 	xfs_inodegc_queue_all(mp);
-	drain_workqueue(mp->m_inodegc_wq);
+	do {
+		flush_workqueue(mp->m_inodegc_wq);
+		rerun = xfs_inodegc_queue_all(mp);
+	} while (rerun);
 
 	trace_xfs_inodegc_stop(mp, __return_address);
 }
 
 /*
  * Enable the inode inactivation background workers and schedule deferred inode
- * inactivation work if there is any.
+ * inactivation work if there is any.  Do not call this while xfs_inodegc_stop
+ * is running.
  */
 void
 xfs_inodegc_start(


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately
  2023-04-27 22:49 ` [PATCH 1/4] xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately Darrick J. Wong
@ 2023-04-28  2:14   ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2023-04-28  2:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Apr 27, 2023 at 03:49:26PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> I've been noticing odd racing behavior in the inodegc code that could
> only be explained by one cpu adding an inode to its inactivation llist
> at the same time that another cpu is processing that cpu's llist.
> Preemption is disabled between get/put_cpu_ptr, so the only explanation
> is scheduler mayhem.  I inserted the following debug code into
> xfs_inodegc_worker (see the next patch):
> 
> 	ASSERT(gc->cpu == smp_processor_id());
> 
> This assertion tripped during overnight tests on the arm64 machines, but
> curiously not on x86_64.  I think we haven't observed any resource leaks
> here because the lockfree list code can handle simultaneous llist_add
> and llist_del_all functions operating on the same list.  However, the
> whole point of having percpu inodegc lists is to take advantage of warm
> memory caches by inactivating inodes on the last processor to touch the
> inode.
> 
> The incorrect scheduling seems to occur after an inodegc worker is
> subjected to mod_delayed_work().  This wraps mod_delayed_work_on with
> WORK_CPU_UNBOUND specified as the cpu number.  Unbound allows for
> scheduling on any cpu, not necessarily the same one that scheduled the
> work.

Ugh, that's a bit of a landmine.

> Because preemption is disabled for as long as we have the gc pointer, I
> think it's safe to use current_cpu() (aka smp_processor_id) to queue the
> delayed work item on the correct cpu.
> 
> Fixes: 7cf2b0f9611b ("xfs: bound maximum wait time for inodegc work")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_icache.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] xfs: check that per-cpu inodegc workers actually run on that cpu
  2023-04-27 22:49 ` [PATCH 2/4] xfs: check that per-cpu inodegc workers actually run on that cpu Darrick J. Wong
@ 2023-04-28  2:18   ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2023-04-28  2:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Apr 27, 2023 at 03:49:31PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Now that we've allegedly worked out the problem of the per-cpu inodegc
> workers being scheduled on the wrong cpu, let's put in a debugging knob
> to let us know if a worker ever gets mis-scheduled again.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_icache.c |    2 ++
>  fs/xfs/xfs_mount.h  |    3 +++
>  fs/xfs/xfs_super.c  |    3 +++
>  3 files changed, 8 insertions(+)
> 
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 58712113d5d6..4b63c065ef19 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1856,6 +1856,8 @@ xfs_inodegc_worker(
>  	struct xfs_inode	*ip, *n;
>  	unsigned int		nofs_flag;
>  
> +	ASSERT(gc->cpu == smp_processor_id());

I kinda wish there was a reverse "per cpu item to cpu" reverse
resolution function, but this is only debugging code so it'll do.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] xfs: disable reaping in fscounters scrub
  2023-04-27 22:49 ` [PATCH 3/4] xfs: disable reaping in fscounters scrub Darrick J. Wong
@ 2023-04-28  2:20   ` Dave Chinner
  2023-04-28  4:28     ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2023-04-28  2:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Apr 27, 2023 at 03:49:37PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> The fscounters scrub code doesn't work properly because it cannot
> quiesce updates to the percpu counters in the filesystem, hence it
> returns false corruption reports.  This has been fixed properly in
> one of the online repair patchsets that are under review by replacing
> the xchk_disable_reaping calls with an exclusive filesystem freeze.
> Disabling background gc isn't sufficient to fix the problem.
> 
> In other words, scrub doesn't need to call xfs_inodegc_stop, which is
> just as well since it wasn't correct to allow scrub to call
> xfs_inodegc_start when something else could be calling xfs_inodegc_stop
> (e.g. trying to freeze the filesystem).
> 
> Neuter the scrubber for now, and remove the xchk_*_reaping functions.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Looks ok, minor nit below.

> @@ -453,6 +446,9 @@ xchk_fscounters(
>  	if (frextents > mp->m_sb.sb_rextents)
>  		xchk_set_corrupt(sc);
>  
> +	/* XXX: We can't quiesce percpu counter updates, so exit early. */
> +	return 0;

Can you just add to this that we can re-enable this functionality
when we have the exclusive freeze functionality in the kernel?

With that,

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] xfs: fix xfs_inodegc_stop racing with mod_delayed_work
  2023-04-27 22:49 ` [PATCH 4/4] xfs: fix xfs_inodegc_stop racing with mod_delayed_work Darrick J. Wong
@ 2023-04-28  2:29   ` Dave Chinner
  2023-04-28  4:28     ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2023-04-28  2:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Apr 27, 2023 at 03:49:43PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> syzbot reported this warning from the faux inodegc shrinker that tries
> to kick off inodegc work:
> 
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 102 at kernel/workqueue.c:1445 __queue_work+0xd44/0x1120 kernel/workqueue.c:1444
> RIP: 0010:__queue_work+0xd44/0x1120 kernel/workqueue.c:1444
> Call Trace:
>  __queue_delayed_work+0x1c8/0x270 kernel/workqueue.c:1672
>  mod_delayed_work_on+0xe1/0x220 kernel/workqueue.c:1746
>  xfs_inodegc_shrinker_scan fs/xfs/xfs_icache.c:2212 [inline]
>  xfs_inodegc_shrinker_scan+0x250/0x4f0 fs/xfs/xfs_icache.c:2191
>  do_shrink_slab+0x428/0xaa0 mm/vmscan.c:853
>  shrink_slab+0x175/0x660 mm/vmscan.c:1013
>  shrink_one+0x502/0x810 mm/vmscan.c:5343
>  shrink_many mm/vmscan.c:5394 [inline]
>  lru_gen_shrink_node mm/vmscan.c:5511 [inline]
>  shrink_node+0x2064/0x35f0 mm/vmscan.c:6459
>  kswapd_shrink_node mm/vmscan.c:7262 [inline]
>  balance_pgdat+0xa02/0x1ac0 mm/vmscan.c:7452
>  kswapd+0x677/0xd60 mm/vmscan.c:7712
>  kthread+0x2e8/0x3a0 kernel/kthread.c:376
>  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
> 
> This warning corresponds to this code in __queue_work:
> 
> 	/*
> 	 * For a draining wq, only works from the same workqueue are
> 	 * allowed. The __WQ_DESTROYING helps to spot the issue that
> 	 * queues a new work item to a wq after destroy_workqueue(wq).
> 	 */
> 	if (unlikely(wq->flags & (__WQ_DESTROYING | __WQ_DRAINING) &&
> 		     WARN_ON_ONCE(!is_chained_work(wq))))
> 		return;
> 
> For this to trip, we must have a thread draining the inodedgc workqueue
> and a second thread trying to queue inodegc work to that workqueue.
> This can happen if freezing or a ro remount race with reclaim poking our
> faux inodegc shrinker and another thread dropping an unlinked O_RDONLY
> file:
> 
> Thread 0	Thread 1	Thread 2
> 
> xfs_inodegc_stop
> 
> 				xfs_inodegc_shrinker_scan
> 				xfs_is_inodegc_enabled
> 				<yes, will continue>
> 
> xfs_clear_inodegc_enabled
> xfs_inodegc_queue_all
> <list empty, do not queue inodegc worker>
> 
> 		xfs_inodegc_queue
> 		<add to list>
> 		xfs_is_inodegc_enabled
> 		<no, returns>
> 
> drain_workqueue
> <set WQ_DRAINING>
> 
> 				llist_empty
> 				<no, will queue list>
> 				mod_delayed_work_on(..., 0)
> 				__queue_work
> 				<sees WQ_DRAINING, kaboom>
> 
> In other words, everything between the access to inodegc_enabled state
> and the decision to poke the inodegc workqueue requires some kind of
> coordination to avoid the WQ_DRAINING state.  We could perhaps introduce
> a lock here, but we could also try to eliminate WQ_DRAINING from the
> picture.
> 
> We could replace the drain_workqueue call with a loop that flushes the
> workqueue and queues workers as long as there is at least one inode
> present in the per-cpu inodegc llists.  We've disabled inodegc at this
> point, so we know that the number of queued inodes will eventually hit
> zero as long as xfs_inodegc_start cannot reactivate the workers.
> 
> There are four callers of xfs_inodegc_start.  Three of them come from the
> VFS with s_umount held: filesystem thawing, failed filesystem freezing,
> and the rw remount transition.  The fourth caller is mounting rw (no
> remount or freezing possible).
> 
> There are three callers ofs xfs_inodegc_stop.  One is unmounting (no
> remount or thaw possible).  Two of them come from the VFS with s_umount
> held: fs freezing and ro remount transition.

Ok, so effectively we are using s_umount to serialise inodegc
start/stop transitions.

....

> @@ -1911,24 +1916,39 @@ xfs_inodegc_flush(
>  
>  /*
>   * Flush all the pending work and then disable the inode inactivation background
> - * workers and wait for them to stop.
> + * workers and wait for them to stop.  Do not call xfs_inodegc_start until this
> + * finishes.
>   */
>  void
>  xfs_inodegc_stop(
>  	struct xfs_mount	*mp)
>  {

I'd prefer to document that these two functions should not be called
without holding the sb->s_umount lock rather than leaving the
serialisation mechanism undocumented. When we add the exclusive
freeze code for the fscounter scrub sync, that's also going to hold
the sb->s_umount lock while inodegc is stopped and started, right?


> +	bool			rerun;
> +
>  	if (!xfs_clear_inodegc_enabled(mp))
>  		return;
>  
> +	/*
> +	 * Drain all pending inodegc work, including inodes that could be
> +	 * queued by racing xfs_inodegc_queue or xfs_inodegc_shrinker_scan
> +	 * threads that sample the inodegc state just prior to us clearing it.
> +	 * The inodegc flag state prevents new threads from queuing more
> +	 * inodes, so we queue pending work items and flush the workqueue until
> +	 * all inodegc lists are empty.
> +	 */
>  	xfs_inodegc_queue_all(mp);
> -	drain_workqueue(mp->m_inodegc_wq);
> +	do {
> +		flush_workqueue(mp->m_inodegc_wq);
> +		rerun = xfs_inodegc_queue_all(mp);
> +	} while (rerun);

Would it be worth noting that we don't use drain_workqueue() because
it doesn't allow other unserialised mechanisms to reschedule inodegc
work while the draining is in progress?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] xfs: fix xfs_inodegc_stop racing with mod_delayed_work
  2023-04-28  2:29   ` Dave Chinner
@ 2023-04-28  4:28     ` Darrick J. Wong
  0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2023-04-28  4:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Apr 28, 2023 at 12:29:47PM +1000, Dave Chinner wrote:
> On Thu, Apr 27, 2023 at 03:49:43PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > syzbot reported this warning from the faux inodegc shrinker that tries
> > to kick off inodegc work:
> > 
> > ------------[ cut here ]------------
> > WARNING: CPU: 1 PID: 102 at kernel/workqueue.c:1445 __queue_work+0xd44/0x1120 kernel/workqueue.c:1444
> > RIP: 0010:__queue_work+0xd44/0x1120 kernel/workqueue.c:1444
> > Call Trace:
> >  __queue_delayed_work+0x1c8/0x270 kernel/workqueue.c:1672
> >  mod_delayed_work_on+0xe1/0x220 kernel/workqueue.c:1746
> >  xfs_inodegc_shrinker_scan fs/xfs/xfs_icache.c:2212 [inline]
> >  xfs_inodegc_shrinker_scan+0x250/0x4f0 fs/xfs/xfs_icache.c:2191
> >  do_shrink_slab+0x428/0xaa0 mm/vmscan.c:853
> >  shrink_slab+0x175/0x660 mm/vmscan.c:1013
> >  shrink_one+0x502/0x810 mm/vmscan.c:5343
> >  shrink_many mm/vmscan.c:5394 [inline]
> >  lru_gen_shrink_node mm/vmscan.c:5511 [inline]
> >  shrink_node+0x2064/0x35f0 mm/vmscan.c:6459
> >  kswapd_shrink_node mm/vmscan.c:7262 [inline]
> >  balance_pgdat+0xa02/0x1ac0 mm/vmscan.c:7452
> >  kswapd+0x677/0xd60 mm/vmscan.c:7712
> >  kthread+0x2e8/0x3a0 kernel/kthread.c:376
> >  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
> > 
> > This warning corresponds to this code in __queue_work:
> > 
> > 	/*
> > 	 * For a draining wq, only works from the same workqueue are
> > 	 * allowed. The __WQ_DESTROYING helps to spot the issue that
> > 	 * queues a new work item to a wq after destroy_workqueue(wq).
> > 	 */
> > 	if (unlikely(wq->flags & (__WQ_DESTROYING | __WQ_DRAINING) &&
> > 		     WARN_ON_ONCE(!is_chained_work(wq))))
> > 		return;
> > 
> > For this to trip, we must have a thread draining the inodedgc workqueue
> > and a second thread trying to queue inodegc work to that workqueue.
> > This can happen if freezing or a ro remount race with reclaim poking our
> > faux inodegc shrinker and another thread dropping an unlinked O_RDONLY
> > file:
> > 
> > Thread 0	Thread 1	Thread 2
> > 
> > xfs_inodegc_stop
> > 
> > 				xfs_inodegc_shrinker_scan
> > 				xfs_is_inodegc_enabled
> > 				<yes, will continue>
> > 
> > xfs_clear_inodegc_enabled
> > xfs_inodegc_queue_all
> > <list empty, do not queue inodegc worker>
> > 
> > 		xfs_inodegc_queue
> > 		<add to list>
> > 		xfs_is_inodegc_enabled
> > 		<no, returns>
> > 
> > drain_workqueue
> > <set WQ_DRAINING>
> > 
> > 				llist_empty
> > 				<no, will queue list>
> > 				mod_delayed_work_on(..., 0)
> > 				__queue_work
> > 				<sees WQ_DRAINING, kaboom>
> > 
> > In other words, everything between the access to inodegc_enabled state
> > and the decision to poke the inodegc workqueue requires some kind of
> > coordination to avoid the WQ_DRAINING state.  We could perhaps introduce
> > a lock here, but we could also try to eliminate WQ_DRAINING from the
> > picture.
> > 
> > We could replace the drain_workqueue call with a loop that flushes the
> > workqueue and queues workers as long as there is at least one inode
> > present in the per-cpu inodegc llists.  We've disabled inodegc at this
> > point, so we know that the number of queued inodes will eventually hit
> > zero as long as xfs_inodegc_start cannot reactivate the workers.
> > 
> > There are four callers of xfs_inodegc_start.  Three of them come from the
> > VFS with s_umount held: filesystem thawing, failed filesystem freezing,
> > and the rw remount transition.  The fourth caller is mounting rw (no
> > remount or freezing possible).
> > 
> > There are three callers ofs xfs_inodegc_stop.  One is unmounting (no
> > remount or thaw possible).  Two of them come from the VFS with s_umount
> > held: fs freezing and ro remount transition.
> 
> Ok, so effectively we are using s_umount to serialise inodegc
> start/stop transitions.

Correct.

> ....
> 
> > @@ -1911,24 +1916,39 @@ xfs_inodegc_flush(
> >  
> >  /*
> >   * Flush all the pending work and then disable the inode inactivation background
> > - * workers and wait for them to stop.
> > + * workers and wait for them to stop.  Do not call xfs_inodegc_start until this
> > + * finishes.
> >   */
> >  void
> >  xfs_inodegc_stop(
> >  	struct xfs_mount	*mp)
> >  {
> 
> I'd prefer to document that these two functions should not be called
> without holding the sb->s_umount lock rather than leaving the
> serialisation mechanism undocumented.

Done.  "Caller must hold sb->s_umount to coordinate changes in the
inodegc_enabled state."

> When we add the exclusive
> freeze code for the fscounter scrub sync, that's also going to hold
> the sb->s_umount lock while inodegc is stopped and started, right?

Yes, the exclusive freeze mechanism takes all the same locks and makes
all the same state changes as regular freeze.

> 
> > +	bool			rerun;
> > +
> >  	if (!xfs_clear_inodegc_enabled(mp))
> >  		return;
> >  
> > +	/*
> > +	 * Drain all pending inodegc work, including inodes that could be
> > +	 * queued by racing xfs_inodegc_queue or xfs_inodegc_shrinker_scan
> > +	 * threads that sample the inodegc state just prior to us clearing it.
> > +	 * The inodegc flag state prevents new threads from queuing more
> > +	 * inodes, so we queue pending work items and flush the workqueue until
> > +	 * all inodegc lists are empty.
> > +	 */
> >  	xfs_inodegc_queue_all(mp);
> > -	drain_workqueue(mp->m_inodegc_wq);
> > +	do {
> > +		flush_workqueue(mp->m_inodegc_wq);
> > +		rerun = xfs_inodegc_queue_all(mp);
> > +	} while (rerun);
> 
> Would it be worth noting that we don't use drain_workqueue() because
> it doesn't allow other unserialised mechanisms to reschedule inodegc
> work while the draining is in progress?

I'll paste that right in!

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] xfs: disable reaping in fscounters scrub
  2023-04-28  2:20   ` Dave Chinner
@ 2023-04-28  4:28     ` Darrick J. Wong
  0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2023-04-28  4:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Apr 28, 2023 at 12:20:41PM +1000, Dave Chinner wrote:
> On Thu, Apr 27, 2023 at 03:49:37PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > The fscounters scrub code doesn't work properly because it cannot
> > quiesce updates to the percpu counters in the filesystem, hence it
> > returns false corruption reports.  This has been fixed properly in
> > one of the online repair patchsets that are under review by replacing
> > the xchk_disable_reaping calls with an exclusive filesystem freeze.
> > Disabling background gc isn't sufficient to fix the problem.
> > 
> > In other words, scrub doesn't need to call xfs_inodegc_stop, which is
> > just as well since it wasn't correct to allow scrub to call
> > xfs_inodegc_start when something else could be calling xfs_inodegc_stop
> > (e.g. trying to freeze the filesystem).
> > 
> > Neuter the scrubber for now, and remove the xchk_*_reaping functions.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> 
> Looks ok, minor nit below.
> 
> > @@ -453,6 +446,9 @@ xchk_fscounters(
> >  	if (frextents > mp->m_sb.sb_rextents)
> >  		xchk_set_corrupt(sc);
> >  
> > +	/* XXX: We can't quiesce percpu counter updates, so exit early. */
> > +	return 0;
> 
> Can you just add to this that we can re-enable this functionality
> when we have the exclusive freeze functionality in the kernel?

Will do.

--D

> With that,
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 4/4] xfs: fix xfs_inodegc_stop racing with mod_delayed_work
  2023-05-01 18:27 [PATCHSET v2 0/4] xfs: inodegc fixes for 6.4-rc1 Darrick J. Wong
@ 2023-05-01 18:27 ` Darrick J. Wong
  2023-05-01 23:10   ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2023-05-01 18:27 UTC (permalink / raw)
  To: david, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

syzbot reported this warning from the faux inodegc shrinker that tries
to kick off inodegc work:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 102 at kernel/workqueue.c:1445 __queue_work+0xd44/0x1120 kernel/workqueue.c:1444
RIP: 0010:__queue_work+0xd44/0x1120 kernel/workqueue.c:1444
Call Trace:
 __queue_delayed_work+0x1c8/0x270 kernel/workqueue.c:1672
 mod_delayed_work_on+0xe1/0x220 kernel/workqueue.c:1746
 xfs_inodegc_shrinker_scan fs/xfs/xfs_icache.c:2212 [inline]
 xfs_inodegc_shrinker_scan+0x250/0x4f0 fs/xfs/xfs_icache.c:2191
 do_shrink_slab+0x428/0xaa0 mm/vmscan.c:853
 shrink_slab+0x175/0x660 mm/vmscan.c:1013
 shrink_one+0x502/0x810 mm/vmscan.c:5343
 shrink_many mm/vmscan.c:5394 [inline]
 lru_gen_shrink_node mm/vmscan.c:5511 [inline]
 shrink_node+0x2064/0x35f0 mm/vmscan.c:6459
 kswapd_shrink_node mm/vmscan.c:7262 [inline]
 balance_pgdat+0xa02/0x1ac0 mm/vmscan.c:7452
 kswapd+0x677/0xd60 mm/vmscan.c:7712
 kthread+0x2e8/0x3a0 kernel/kthread.c:376
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308

This warning corresponds to this code in __queue_work:

	/*
	 * For a draining wq, only works from the same workqueue are
	 * allowed. The __WQ_DESTROYING helps to spot the issue that
	 * queues a new work item to a wq after destroy_workqueue(wq).
	 */
	if (unlikely(wq->flags & (__WQ_DESTROYING | __WQ_DRAINING) &&
		     WARN_ON_ONCE(!is_chained_work(wq))))
		return;

For this to trip, we must have a thread draining the inodedgc workqueue
and a second thread trying to queue inodegc work to that workqueue.
This can happen if freezing or a ro remount race with reclaim poking our
faux inodegc shrinker and another thread dropping an unlinked O_RDONLY
file:

Thread 0	Thread 1	Thread 2

xfs_inodegc_stop

				xfs_inodegc_shrinker_scan
				xfs_is_inodegc_enabled
				<yes, will continue>

xfs_clear_inodegc_enabled
xfs_inodegc_queue_all
<list empty, do not queue inodegc worker>

		xfs_inodegc_queue
		<add to list>
		xfs_is_inodegc_enabled
		<no, returns>

drain_workqueue
<set WQ_DRAINING>

				llist_empty
				<no, will queue list>
				mod_delayed_work_on(..., 0)
				__queue_work
				<sees WQ_DRAINING, kaboom>

In other words, everything between the access to inodegc_enabled state
and the decision to poke the inodegc workqueue requires some kind of
coordination to avoid the WQ_DRAINING state.  We could perhaps introduce
a lock here, but we could also try to eliminate WQ_DRAINING from the
picture.

We could replace the drain_workqueue call with a loop that flushes the
workqueue and queues workers as long as there is at least one inode
present in the per-cpu inodegc llists.  We've disabled inodegc at this
point, so we know that the number of queued inodes will eventually hit
zero as long as xfs_inodegc_start cannot reactivate the workers.

There are four callers of xfs_inodegc_start.  Three of them come from the
VFS with s_umount held: filesystem thawing, failed filesystem freezing,
and the rw remount transition.  The fourth caller is mounting rw (no
remount or freezing possible).

There are three callers ofs xfs_inodegc_stop.  One is unmounting (no
remount or thaw possible).  Two of them come from the VFS with s_umount
held: fs freezing and ro remount transition.

Hence, it is correct to replace the drain_workqueue call with a loop
that drains the inodegc llists.

Fixes: 6191cf3ad59f ("xfs: flush inodegc workqueue tasks before cancel")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 4b63c065ef19..0f60e301eb1f 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -435,18 +435,23 @@ xfs_iget_check_free_state(
 }
 
 /* Make all pending inactivation work start immediately. */
-static void
+static bool
 xfs_inodegc_queue_all(
 	struct xfs_mount	*mp)
 {
 	struct xfs_inodegc	*gc;
 	int			cpu;
+	bool			ret = false;
 
 	for_each_online_cpu(cpu) {
 		gc = per_cpu_ptr(mp->m_inodegc, cpu);
-		if (!llist_empty(&gc->list))
+		if (!llist_empty(&gc->list)) {
 			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
+			ret = true;
+		}
 	}
+
+	return ret;
 }
 
 /*
@@ -1911,24 +1916,41 @@ xfs_inodegc_flush(
 
 /*
  * Flush all the pending work and then disable the inode inactivation background
- * workers and wait for them to stop.
+ * workers and wait for them to stop.  Caller must hold sb->s_umount to
+ * coordinate changes in the inodegc_enabled state.
  */
 void
 xfs_inodegc_stop(
 	struct xfs_mount	*mp)
 {
+	bool			rerun;
+
 	if (!xfs_clear_inodegc_enabled(mp))
 		return;
 
+	/*
+	 * Drain all pending inodegc work, including inodes that could be
+	 * queued by racing xfs_inodegc_queue or xfs_inodegc_shrinker_scan
+	 * threads that sample the inodegc state just prior to us clearing it.
+	 * The inodegc flag state prevents new threads from queuing more
+	 * inodes, so we queue pending work items and flush the workqueue until
+	 * all inodegc lists are empty.  IOWs, we cannot use drain_workqueue
+	 * here because it does not allow other unserialized mechanisms to
+	 * reschedule inodegc work while this draining is in progress.
+	 */
 	xfs_inodegc_queue_all(mp);
-	drain_workqueue(mp->m_inodegc_wq);
+	do {
+		flush_workqueue(mp->m_inodegc_wq);
+		rerun = xfs_inodegc_queue_all(mp);
+	} while (rerun);
 
 	trace_xfs_inodegc_stop(mp, __return_address);
 }
 
 /*
  * Enable the inode inactivation background workers and schedule deferred inode
- * inactivation work if there is any.
+ * inactivation work if there is any.  Caller must hold sb->s_umount to
+ * coordinate changes in the inodegc_enabled state.
  */
 void
 xfs_inodegc_start(


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] xfs: fix xfs_inodegc_stop racing with mod_delayed_work
  2023-05-01 18:27 ` [PATCH 4/4] xfs: fix xfs_inodegc_stop racing with mod_delayed_work Darrick J. Wong
@ 2023-05-01 23:10   ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2023-05-01 23:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, May 01, 2023 at 11:27:41AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> syzbot reported this warning from the faux inodegc shrinker that tries
> to kick off inodegc work:
> 
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 102 at kernel/workqueue.c:1445 __queue_work+0xd44/0x1120 kernel/workqueue.c:1444
> RIP: 0010:__queue_work+0xd44/0x1120 kernel/workqueue.c:1444
> Call Trace:
>  __queue_delayed_work+0x1c8/0x270 kernel/workqueue.c:1672
>  mod_delayed_work_on+0xe1/0x220 kernel/workqueue.c:1746
>  xfs_inodegc_shrinker_scan fs/xfs/xfs_icache.c:2212 [inline]
>  xfs_inodegc_shrinker_scan+0x250/0x4f0 fs/xfs/xfs_icache.c:2191
>  do_shrink_slab+0x428/0xaa0 mm/vmscan.c:853
>  shrink_slab+0x175/0x660 mm/vmscan.c:1013
>  shrink_one+0x502/0x810 mm/vmscan.c:5343
>  shrink_many mm/vmscan.c:5394 [inline]
>  lru_gen_shrink_node mm/vmscan.c:5511 [inline]
>  shrink_node+0x2064/0x35f0 mm/vmscan.c:6459
>  kswapd_shrink_node mm/vmscan.c:7262 [inline]
>  balance_pgdat+0xa02/0x1ac0 mm/vmscan.c:7452
>  kswapd+0x677/0xd60 mm/vmscan.c:7712
>  kthread+0x2e8/0x3a0 kernel/kthread.c:376
>  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
> 
> This warning corresponds to this code in __queue_work:
> 
> 	/*
> 	 * For a draining wq, only works from the same workqueue are
> 	 * allowed. The __WQ_DESTROYING helps to spot the issue that
> 	 * queues a new work item to a wq after destroy_workqueue(wq).
> 	 */
> 	if (unlikely(wq->flags & (__WQ_DESTROYING | __WQ_DRAINING) &&
> 		     WARN_ON_ONCE(!is_chained_work(wq))))
> 		return;
> 
> For this to trip, we must have a thread draining the inodedgc workqueue
> and a second thread trying to queue inodegc work to that workqueue.
> This can happen if freezing or a ro remount race with reclaim poking our
> faux inodegc shrinker and another thread dropping an unlinked O_RDONLY
> file:
> 
> Thread 0	Thread 1	Thread 2
> 
> xfs_inodegc_stop
> 
> 				xfs_inodegc_shrinker_scan
> 				xfs_is_inodegc_enabled
> 				<yes, will continue>
> 
> xfs_clear_inodegc_enabled
> xfs_inodegc_queue_all
> <list empty, do not queue inodegc worker>
> 
> 		xfs_inodegc_queue
> 		<add to list>
> 		xfs_is_inodegc_enabled
> 		<no, returns>
> 
> drain_workqueue
> <set WQ_DRAINING>
> 
> 				llist_empty
> 				<no, will queue list>
> 				mod_delayed_work_on(..., 0)
> 				__queue_work
> 				<sees WQ_DRAINING, kaboom>
> 
> In other words, everything between the access to inodegc_enabled state
> and the decision to poke the inodegc workqueue requires some kind of
> coordination to avoid the WQ_DRAINING state.  We could perhaps introduce
> a lock here, but we could also try to eliminate WQ_DRAINING from the
> picture.
> 
> We could replace the drain_workqueue call with a loop that flushes the
> workqueue and queues workers as long as there is at least one inode
> present in the per-cpu inodegc llists.  We've disabled inodegc at this
> point, so we know that the number of queued inodes will eventually hit
> zero as long as xfs_inodegc_start cannot reactivate the workers.
> 
> There are four callers of xfs_inodegc_start.  Three of them come from the
> VFS with s_umount held: filesystem thawing, failed filesystem freezing,
> and the rw remount transition.  The fourth caller is mounting rw (no
> remount or freezing possible).
> 
> There are three callers ofs xfs_inodegc_stop.  One is unmounting (no
> remount or thaw possible).  Two of them come from the VFS with s_umount
> held: fs freezing and ro remount transition.
> 
> Hence, it is correct to replace the drain_workqueue call with a loop
> that drains the inodegc llists.
> 
> Fixes: 6191cf3ad59f ("xfs: flush inodegc workqueue tasks before cancel")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_icache.c |   32 +++++++++++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 5 deletions(-)

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-05-01 23:10 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-27 22:49 [PATCHSET 0/4] xfs: inodegc fixes for 6.4-rc1 Darrick J. Wong
2023-04-27 22:49 ` [PATCH 1/4] xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately Darrick J. Wong
2023-04-28  2:14   ` Dave Chinner
2023-04-27 22:49 ` [PATCH 2/4] xfs: check that per-cpu inodegc workers actually run on that cpu Darrick J. Wong
2023-04-28  2:18   ` Dave Chinner
2023-04-27 22:49 ` [PATCH 3/4] xfs: disable reaping in fscounters scrub Darrick J. Wong
2023-04-28  2:20   ` Dave Chinner
2023-04-28  4:28     ` Darrick J. Wong
2023-04-27 22:49 ` [PATCH 4/4] xfs: fix xfs_inodegc_stop racing with mod_delayed_work Darrick J. Wong
2023-04-28  2:29   ` Dave Chinner
2023-04-28  4:28     ` Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2023-05-01 18:27 [PATCHSET v2 0/4] xfs: inodegc fixes for 6.4-rc1 Darrick J. Wong
2023-05-01 18:27 ` [PATCH 4/4] xfs: fix xfs_inodegc_stop racing with mod_delayed_work Darrick J. Wong
2023-05-01 23:10   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).