From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Brian Foster <bfoster@redhat.com>, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 2/2] xfs: run blockgc on freeze to avoid iget stalls after reclaim
Date: Mon, 14 Feb 2022 17:54:16 -0800 [thread overview]
Message-ID: <20220215015416.GG8338@magnolia> (raw)
In-Reply-To: <20220210230806.GO59729@dread.disaster.area>
On Fri, Feb 11, 2022 at 10:08:06AM +1100, Dave Chinner wrote:
> On Thu, Feb 10, 2022 at 02:03:23PM -0500, Brian Foster wrote:
> > On Wed, Feb 02, 2022 at 01:22:40PM +1100, Dave Chinner wrote:
> > > On Mon, Jan 24, 2022 at 11:57:12AM -0500, Brian Foster wrote:
> > > > On Wed, Jan 19, 2022 at 04:36:36PM -0800, Darrick J. Wong wrote:
> > > > > > Of course if you wanted to recycle inactive inodes or do something else
> > > > > > entirely, then it's probably not worth going down this path..
> > > > >
> > > > > I'm a bit partial to /trying/ to recycle inactive inodes because (a)
> > > > > it's less tangling with -fsdevel for you and (b) inode scans in the
> > > > > online repair patchset got a little weird once xfs_iget lost the ability
> > > > > to recycle _NEEDS_INACTIVE inodes...
> > > > >
> > > > > OTOH I tried to figure out how to deal with the lockless list that those
> > > > > inodes are put on, and I couldn't figure out how to get them off the
> > > > > list safely, so that might be a dead end. If you have any ideas I'm all
> > > > > ears. :)
> > > > >
> > > >
> > > > So one of the things that I've been kind of unclear on about the current
> > > > deferred inactivation implementation is the primary goal of the percpu
> > > > optimization. I obviously can see the value of the percpu list in
> > > > general, but how much processing needs to occur in percpu context to
> > > > achieve the primary goal?
> > > >
> > > > For example, I can see how a single or small multi threaded sustained
> > > > file removal might be batched efficiently, but what happens if said task
> > > > happens to bounce around many cpus?
> > >
> > > In that case we have a scheduler problem, not a per-cpu
> > > infrastructure issue.
> > >
> >
> > Last I tested on my box, a single threaded rm -rf had executed on
> > something like 24 of the 80 cpus available after about ~1m of runtime.
>
> Not surprising - the scheduler will select a sibling cores that
> share caches when the previous CPU the task was running on is busy
> running CPU affine tasks (i.e. the per-cpu inactive kworker thread).
>
> But how many CPUs it bounces the workload around over a long period
> is not important. What is important is the cache residency between
> the inodes when they are queued and when they are then inactivated.
> That's measured in microseconds to single digit milliseconds (i.e.
> within a single scheduling tick), and this is the metric that
> matters the most. It doesn't matter if the first batch is unlinked
> on CPU 1 and then inactived on CPU 1 while the scheduler moves the
> unlink task to CPU 2 where it queues the next batch to be
> inactivated on CPU 2, and so on. What matters is the data set in
> each batch remains on the same CPU for inactivation processing.
>
> The tracing I've done indicates taht the majority of the time that
> the scehduler bounces the tasks between two or three sibling CPUs
> repeatedly. This occurs when the inactivation is keeping up with the
> unlink queueing side. When we have lots of extents to remove in
> inactivation, the amount of inactivation work is much greater than
> the unlink() work, and so we see inactivation batch processing
> taking longer and the CPU spread of the unlink task gets larger
> because there are more CPUs running CPU-affine tasks doing
> background inactivation.
>
> IOWs, having the number of CPUs the unlink task is scheduled on
> grow over the long term is not at all unexpected - this is exactly
> what we'd expect to see when we move towards async background
> processing of complex operations...
>
> > Of course the inactivation work for an inode occurs on the cpu that the
> > rm task was running on when the inode was destroyed, but I don't think
> > there's any guarantee that kworker task will be the next task selected
> > on the cpu if the system is loaded with other runnable tasks (and then
> > whatever task is selected doesn't pollute the cache).
>
> The scheduler will select the next CPU based on phsyical machine
> topology - core latencies, shared caches, numa distances, whether
> the target CPU has affinity bound tasks already queued, etc.
>
> > For example, I
> > also noticed rm-<pidX> -> rm-<pidY> context switching on concurrent
> > remove tests. That is obviously not a caching issue in this case because
> > both tasks are still doing remove work, but behavior is less
> > deterministic of the target task happens to be something unrelated. Of
> > course, that doesn't mean the approach might not otherwise be effective
> > for the majority of common workloads..
>
> As long as the context switch rate does not substantially increase,
> having tasks migrate between sibling cores every so often isn't a
> huge problem.
>
> > > That per-ag based back end processing is exactly what Darrick's
> > > original proposals did:
> > >
> > > https://lore.kernel.org/linux-xfs/162758431072.332903.17159226037941080971.stgit@magnolia/
> > >
> > > It used radix tree walks run in background kworker threads to batch
> > > all the inode inactivation in a given AG via radix tree walks to
> > > find them.
> > >
> > > It performed poorly at scale because it destroyed the CPU affinity
> > > between the unlink() operation and the inactivation operation of the
> > > unlinked inode when the last reference to the inode is dropped and
> > > the inode evicted from task syscall exit context. REgardless of
> > > whether there is a per-cpu front end or not, the per-ag based
> > > processing will destroy the CPU affinity of the data set being
> > > processed because we cannot guarantee that the per-ag objects are
> > > all local to the CPU that is processing the per-ag objects....
> > >
> >
> > Ok. The role/significance of cpu caching was not as apparent to me when
> > I had last replied to this thread. The discussion around the rcu inode
> > lifecycle issue helped clear some of that up.
> >
> > That said, why not conditionally tag and divert to a background worker
> > when the inodegc is disabled? That could allow NEEDS_INACTIVE inodes to
> > be claimed/recycled from other contexts in scenarios like when the fs is
> > frozen, since they won't be stuck in inaccessible and inactive percpu
> > queues, but otherwise preserves current behavior in normal runtime
> > conditions. Darrick mentioned online repair wanting to do something
> > similar earlier, but it's not clear to me if scrub could or would want
> > to disable the percpu inodegc workers in favor of a temporary/background
> > mode while repair is running. I'm just guessing that performance is
> > probably small enough of a concern in that situation that it wouldn't be
> > a mitigating factor. Hm?
>
> WE probably could do this, but I'm not sure the complexity is
> justified by the rarity of the problem it is trying to avoid.
> Freezes are not long term, nor are they particularly common for
> performance sensitive workloads. Hence I'm just not this corner case
> is important enough to justify doing the work given that we've had
> similar freeze-will-delay-some-stuff-indefinitely behaviour for a
> long time...
We /do/ have a few complaints lodged about hangcheck warnings when the
filesystem has to be frozen for a very long time. It'd be nice to
unblock the callers that want to grab a still-reachable inode, though.
As for the online repair case that Brian mentioned, I've largely
eliminated the need for scrub freezes by using jump labels, notifier
chains, and an in-memory shadow btree to make it so that full fs scans
can run in the background while the rest of the fs continues running.
That'll be discussed ... soon, when I get to the point where I'm ready
to start a full design review.
(Hilariously, between that and better management of the DONTCACHE flag,
I don't actually need deferred inactivation for repair anymore... :P)
--D
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2022-02-15 1:54 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-13 13:36 [PATCH 0/2] xfs: a couple misc/small deferred inactivation tweaks Brian Foster
2022-01-13 13:37 ` [PATCH 1/2] xfs: flush inodegc workqueue tasks before cancel Brian Foster
2022-01-13 18:35 ` Darrick J. Wong
2022-01-13 22:19 ` Dave Chinner
2022-01-13 13:37 ` [PATCH 2/2] xfs: run blockgc on freeze to avoid iget stalls after reclaim Brian Foster
2022-01-13 17:13 ` Darrick J. Wong
2022-01-13 19:58 ` Brian Foster
2022-01-13 20:43 ` Darrick J. Wong
2022-01-13 21:01 ` Darrick J. Wong
2022-01-13 22:38 ` Dave Chinner
2022-01-14 17:35 ` Darrick J. Wong
2022-01-14 19:45 ` Brian Foster
2022-01-14 21:30 ` Darrick J. Wong
2022-01-15 4:09 ` Darrick J. Wong
2022-01-15 22:40 ` Dave Chinner
2022-01-17 13:37 ` Brian Foster
2022-01-18 18:56 ` Darrick J. Wong
2022-01-19 20:07 ` Brian Foster
2022-01-20 0:36 ` Darrick J. Wong
2022-01-20 5:18 ` Dave Chinner
2022-01-24 16:57 ` Brian Foster
2022-02-02 2:22 ` Dave Chinner
2022-02-10 19:03 ` Brian Foster
2022-02-10 23:08 ` Dave Chinner
2022-02-15 1:54 ` Darrick J. Wong [this message]
2022-02-15 9:26 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220215015416.GG8338@magnolia \
--to=djwong@kernel.org \
--cc=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox