From: Dave Chinner <david@fromorbit.com>
To: Chris Mason <clm@fb.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
Date: Thu, 17 Nov 2016 10:31:36 +1100 [thread overview]
Message-ID: <20161116233136.GA19783@dastard> (raw)
In-Reply-To: <20161116030344.GA7746@clm-mbp.masoncoding.com>
On Tue, Nov 15, 2016 at 10:03:52PM -0500, Chris Mason wrote:
> On Wed, Nov 16, 2016 at 12:30:09PM +1100, Dave Chinner wrote:
> >On Tue, Nov 15, 2016 at 02:00:47PM -0500, Chris Mason wrote:
> >>On 11/15/2016 12:54 AM, Dave Chinner wrote:
> >>>On Tue, Nov 15, 2016 at 10:58:01AM +1100, Dave Chinner wrote:
> >>>>On Mon, Nov 14, 2016 at 03:56:14PM -0500, Chris Mason wrote:
> >>>There have been 1.2 million inodes reclaimed from the cache, but
> >>>there have only been 20,000 dirty inode buffer writes. Yes, that's
> >>>written 440,000 dirty inodes - the inode write clustering is
> >>>capturing about 22 inodes per write - but the inode writeback load
> >>>is minimal at about 10 IO/s. XFS inode reclaim is not blocking
> >>>significantly on dirty inodes.
> >>
> >>I think our machines are different enough that we're not seeing the
> >>same problems. Or at least we're seeing different sides of the
> >>problem.
> >>
> >>We have 130GB of ram and on average about 300-500MB of XFS slab,
> >>total across all 15 filesystems. Your inodes are small and cuddly,
> >>and I'd rather have more than less. I see more with simoop than we
> >>see in prod, but either way its a reasonable percentage of system
> >>ram considering the horrible things being done.
> >
> >So I'm running on 16GB RAM and have 100-150MB of XFS slab.
> >Percentage wise, the inode cache is a larger portion of memory than
> >in your machines. I can increase the number of files to increase it
> >further, but I don't think that will change anything.
>
> I think the way to see what I'm seeing would be to drop the number
> of IO threads (-T) and bump both -m and -M. Basically less inode
> working set and more memory working set.
If I increase m/M by any non-trivial amount, the test OOMs within a
couple of minutes of starting even after cutting the number of IO
threads in half. I've managed to increase -m by 10% without OOM -
I'll keep trying to increase this part of the load as much as I
can as I refine the patchset I have.
> >>Both patched (yours or mine) and unpatched, XFS inode reclaim is
> >>keeping up. With my patch in place, tracing during simoop does
> >>show more kswapd prio=1 scanning than unpatched, so I'm clearly
> >>stretching the limits a little more. But we've got 30+ days of
> >>uptime in prod on almost 60 machines. The oom rate is roughly in
> >>line with v3.10, and miles better than v4.0.
> >
> >IOWs, you have a workaround that keeps your production systems
> >running. That's fine for your machines that are running this load,
> >but it's not working well for any of the other other loads I've
> >looked at. That is, removing the throttling from the XFS inode
> >shrinker causes instability and adverse reclaim of the inode cache
> >in situations where the maintaining a working set in memory is
> >required for performance.
>
> We agree on all of this much more than not. Josef has spent a lot
> of time recently on shrinkers (w/btrfs but the ideas are similar),
> and I'm wrapping duct tape around workloads until the overall
> architecture is less fragile.
>
> Using slab for metadata in an FS like btrfs where dirty metadata is
> almost unbounded is a huge challenge in the current framework. Ext4
> is moving to dramatically bigger logs, so it would eventually have
> the same problems.
Your 8TB XFS filesystems will be using 2GB logs (unless mkfs
settings were tweaked manually), so there's a huge amount of
metadata 15x8TB XFS filesystems can pin in memory, too...
> >Indeed, one of the things I noticed with the simoops workload
> >running the shrinker patches is that it no longer kept either the
> >inode cache or the XFS metadata cache in memory long enough for the
> >du to run without requiring IO. i.e. the caches no longer maintained
> >the working set of objects needed to optimise a regular operation
> >and the du scans took a lot longer.
>
> With simoop, du is supposed to do IO. It's crazy to expect to be
> able to scan all the inodes on a huge FS (or 15 of them) and keep it
> all in cache along with everything else hadoop does. I completely
> agree there are cases where having the working set in ram is valid,
> just simoop isn't one ;)
Sure, I was just pointing out that even simoop was seeing signficant
changes in cache residency as a result of this change....
> >That's what removing the blocking from the shrinker causes the
> >overall work rate to go down - it results in the cache not
> >maintaining a working set of inodes and so increases the IO load and
> >that then slows everything down.
>
> At least on my machines, it made the overall work rate go up. Both
> simoop and prod are 10-15% faster.
Ok, I'll see if I can tune the workload here to behave more like
this....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2016-11-16 23:33 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-10-14 12:27 [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim Chris Mason
2016-10-15 22:34 ` Dave Chinner
2016-10-17 0:24 ` Chris Mason
2016-10-17 1:52 ` Dave Chinner
2016-10-17 13:30 ` Chris Mason
2016-10-17 22:30 ` Dave Chinner
2016-10-17 23:20 ` Chris Mason
2016-10-18 2:03 ` Dave Chinner
2016-11-14 1:00 ` Chris Mason
2016-11-14 7:27 ` Dave Chinner
2016-11-14 20:56 ` Chris Mason
2016-11-14 23:58 ` Dave Chinner
2016-11-15 3:09 ` Chris Mason
2016-11-15 5:54 ` Dave Chinner
2016-11-15 19:00 ` Chris Mason
2016-11-16 1:30 ` Dave Chinner
2016-11-16 3:03 ` Chris Mason
2016-11-16 23:31 ` Dave Chinner [this message]
2016-11-17 0:27 ` Chris Mason
2016-11-17 1:00 ` Dave Chinner
2016-11-17 0:47 ` Dave Chinner
2016-11-17 1:07 ` Chris Mason
2016-11-17 3:39 ` Dave Chinner
2019-06-14 12:58 ` Amir Goldstein
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161116233136.GA19783@dastard \
--to=david@fromorbit.com \
--cc=clm@fb.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).