From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Brian Foster <bfoster@redhat.com>, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 2/2] xfs: run blockgc on freeze to avoid iget stalls after reclaim
Date: Mon, 14 Feb 2022 17:54:16 -0800 [thread overview]
Message-ID: <20220215015416.GG8338@magnolia> (raw)
In-Reply-To: <20220210230806.GO59729@dread.disaster.area>
On Fri, Feb 11, 2022 at 10:08:06AM +1100, Dave Chinner wrote:
> On Thu, Feb 10, 2022 at 02:03:23PM -0500, Brian Foster wrote:
> > On Wed, Feb 02, 2022 at 01:22:40PM +1100, Dave Chinner wrote:
> > > On Mon, Jan 24, 2022 at 11:57:12AM -0500, Brian Foster wrote:
> > > > On Wed, Jan 19, 2022 at 04:36:36PM -0800, Darrick J. Wong wrote:
> > > > > > Of course if you wanted to recycle inactive inodes or do something else
> > > > > > entirely, then it's probably not worth going down this path..
> > > > >
> > > > > I'm a bit partial to /trying/ to recycle inactive inodes because (a)
> > > > > it's less tangling with -fsdevel for you and (b) inode scans in the
> > > > > online repair patchset got a little weird once xfs_iget lost the ability
> > > > > to recycle _NEEDS_INACTIVE inodes...
> > > > >
> > > > > OTOH I tried to figure out how to deal with the lockless list that those
> > > > > inodes are put on, and I couldn't figure out how to get them off the
> > > > > list safely, so that might be a dead end. If you have any ideas I'm all
> > > > > ears. :)
> > > > >
> > > >
> > > > So one of the things that I've been kind of unclear on about the current
> > > > deferred inactivation implementation is the primary goal of the percpu
> > > > optimization. I obviously can see the value of the percpu list in
> > > > general, but how much processing needs to occur in percpu context to
> > > > achieve the primary goal?
> > > >
> > > > For example, I can see how a single or small multi threaded sustained
> > > > file removal might be batched efficiently, but what happens if said task
> > > > happens to bounce around many cpus?
> > >
> > > In that case we have a scheduler problem, not a per-cpu
> > > infrastructure issue.
> > >
> >
> > Last I tested on my box, a single threaded rm -rf had executed on
> > something like 24 of the 80 cpus available after about ~1m of runtime.
>
> Not surprising - the scheduler will select a sibling cores that
> share caches when the previous CPU the task was running on is busy
> running CPU affine tasks (i.e. the per-cpu inactive kworker thread).
>
> But how many CPUs it bounces the workload around over a long period
> is not important. What is important is the cache residency between
> the inodes when they are queued and when they are then inactivated.
> That's measured in microseconds to single digit milliseconds (i.e.
> within a single scheduling tick), and this is the metric that
> matters the most. It doesn't matter if the first batch is unlinked
> on CPU 1 and then inactived on CPU 1 while the scheduler moves the
> unlink task to CPU 2 where it queues the next batch to be
> inactivated on CPU 2, and so on. What matters is the data set in
> each batch remains on the same CPU for inactivation processing.
>
> The tracing I've done indicates taht the majority of the time that
> the scehduler bounces the tasks between two or three sibling CPUs
> repeatedly. This occurs when the inactivation is keeping up with the
> unlink queueing side. When we have lots of extents to remove in
> inactivation, the amount of inactivation work is much greater than
> the unlink() work, and so we see inactivation batch processing
> taking longer and the CPU spread of the unlink task gets larger
> because there are more CPUs running CPU-affine tasks doing
> background inactivation.
>
> IOWs, having the number of CPUs the unlink task is scheduled on
> grow over the long term is not at all unexpected - this is exactly
> what we'd expect to see when we move towards async background
> processing of complex operations...
>
> > Of course the inactivation work for an inode occurs on the cpu that the
> > rm task was running on when the inode was destroyed, but I don't think
> > there's any guarantee that kworker task will be the next task selected
> > on the cpu if the system is loaded with other runnable tasks (and then
> > whatever task is selected doesn't pollute the cache).
>
> The scheduler will select the next CPU based on phsyical machine
> topology - core latencies, shared caches, numa distances, whether
> the target CPU has affinity bound tasks already queued, etc.
>
> > For example, I
> > also noticed rm-<pidX> -> rm-<pidY> context switching on concurrent
> > remove tests. That is obviously not a caching issue in this case because
> > both tasks are still doing remove work, but behavior is less
> > deterministic of the target task happens to be something unrelated. Of
> > course, that doesn't mean the approach might not otherwise be effective
> > for the majority of common workloads..
>
> As long as the context switch rate does not substantially increase,
> having tasks migrate between sibling cores every so often isn't a
> huge problem.
>
> > > That per-ag based back end processing is exactly what Darrick's
> > > original proposals did:
> > >
> > > https://lore.kernel.org/linux-xfs/162758431072.332903.17159226037941080971.stgit@magnolia/
> > >
> > > It used radix tree walks run in background kworker threads to batch
> > > all the inode inactivation in a given AG via radix tree walks to
> > > find them.
> > >
> > > It performed poorly at scale because it destroyed the CPU affinity
> > > between the unlink() operation and the inactivation operation of the
> > > unlinked inode when the last reference to the inode is dropped and
> > > the inode evicted from task syscall exit context. REgardless of
> > > whether there is a per-cpu front end or not, the per-ag based
> > > processing will destroy the CPU affinity of the data set being
> > > processed because we cannot guarantee that the per-ag objects are
> > > all local to the CPU that is processing the per-ag objects....
> > >
> >
> > Ok. The role/significance of cpu caching was not as apparent to me when
> > I had last replied to this thread. The discussion around the rcu inode
> > lifecycle issue helped clear some of that up.
> >
> > That said, why not conditionally tag and divert to a background worker
> > when the inodegc is disabled? That could allow NEEDS_INACTIVE inodes to
> > be claimed/recycled from other contexts in scenarios like when the fs is
> > frozen, since they won't be stuck in inaccessible and inactive percpu
> > queues, but otherwise preserves current behavior in normal runtime
> > conditions. Darrick mentioned online repair wanting to do something
> > similar earlier, but it's not clear to me if scrub could or would want
> > to disable the percpu inodegc workers in favor of a temporary/background
> > mode while repair is running. I'm just guessing that performance is
> > probably small enough of a concern in that situation that it wouldn't be
> > a mitigating factor. Hm?
>
> WE probably could do this, but I'm not sure the complexity is
> justified by the rarity of the problem it is trying to avoid.
> Freezes are not long term, nor are they particularly common for
> performance sensitive workloads. Hence I'm just not this corner case
> is important enough to justify doing the work given that we've had
> similar freeze-will-delay-some-stuff-indefinitely behaviour for a
> long time...
We /do/ have a few complaints lodged about hangcheck warnings when the
filesystem has to be frozen for a very long time. It'd be nice to
unblock the callers that want to grab a still-reachable inode, though.
As for the online repair case that Brian mentioned, I've largely
eliminated the need for scrub freezes by using jump labels, notifier
chains, and an in-memory shadow btree to make it so that full fs scans
can run in the background while the rest of the fs continues running.
That'll be discussed ... soon, when I get to the point where I'm ready
to start a full design review.
(Hilariously, between that and better management of the DONTCACHE flag,
I don't actually need deferred inactivation for repair anymore... :P)
--D
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2022-02-15 1:54 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-13 13:36 [PATCH 0/2] xfs: a couple misc/small deferred inactivation tweaks Brian Foster
2022-01-13 13:37 ` [PATCH 1/2] xfs: flush inodegc workqueue tasks before cancel Brian Foster
2022-01-13 18:35 ` Darrick J. Wong
2022-01-13 22:19 ` Dave Chinner
2022-01-13 13:37 ` [PATCH 2/2] xfs: run blockgc on freeze to avoid iget stalls after reclaim Brian Foster
2022-01-13 17:13 ` Darrick J. Wong
2022-01-13 19:58 ` Brian Foster
2022-01-13 20:43 ` Darrick J. Wong
2022-01-13 21:01 ` Darrick J. Wong
2022-01-13 22:38 ` Dave Chinner
2022-01-14 17:35 ` Darrick J. Wong
2022-01-14 19:45 ` Brian Foster
2022-01-14 21:30 ` Darrick J. Wong
2022-01-15 4:09 ` Darrick J. Wong
2022-01-15 22:40 ` Dave Chinner
2022-01-17 13:37 ` Brian Foster
2022-01-18 18:56 ` Darrick J. Wong
2022-01-19 20:07 ` Brian Foster
2022-01-20 0:36 ` Darrick J. Wong
2022-01-20 5:18 ` Dave Chinner
2022-01-24 16:57 ` Brian Foster
2022-02-02 2:22 ` Dave Chinner
2022-02-10 19:03 ` Brian Foster
2022-02-10 23:08 ` Dave Chinner
2022-02-15 1:54 ` Darrick J. Wong [this message]
2022-02-15 9:26 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220215015416.GG8338@magnolia \
--to=djwong@kernel.org \
--cc=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.