Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Chris Mason <clm@fb.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
Date: Thu, 17 Nov 2016 10:31:36 +1100	[thread overview]
Message-ID: <20161116233136.GA19783@dastard> (raw)
In-Reply-To: <20161116030344.GA7746@clm-mbp.masoncoding.com>

On Tue, Nov 15, 2016 at 10:03:52PM -0500, Chris Mason wrote:
> On Wed, Nov 16, 2016 at 12:30:09PM +1100, Dave Chinner wrote:
> >On Tue, Nov 15, 2016 at 02:00:47PM -0500, Chris Mason wrote:
> >>On 11/15/2016 12:54 AM, Dave Chinner wrote:
> >>>On Tue, Nov 15, 2016 at 10:58:01AM +1100, Dave Chinner wrote:
> >>>>On Mon, Nov 14, 2016 at 03:56:14PM -0500, Chris Mason wrote:
> >>>There have been 1.2 million inodes reclaimed from the cache, but
> >>>there have only been 20,000 dirty inode buffer writes. Yes, that's
> >>>written 440,000 dirty inodes - the inode write clustering is
> >>>capturing about 22 inodes per write - but the inode writeback load
> >>>is minimal at about 10 IO/s. XFS inode reclaim is not blocking
> >>>significantly on dirty inodes.
> >>
> >>I think our machines are different enough that we're not seeing the
> >>same problems.  Or at least we're seeing different sides of the
> >>problem.
> >>
> >>We have 130GB of ram and on average about 300-500MB of XFS slab,
> >>total across all 15 filesystems.  Your inodes are small and cuddly,
> >>and I'd rather have more than less.  I see more with simoop than we
> >>see in prod, but either way its a reasonable percentage of system
> >>ram considering the horrible things being done.
> >
> >So I'm running on 16GB RAM and have 100-150MB of XFS slab.
> >Percentage wise, the inode cache is a larger portion of memory than
> >in your machines. I can increase the number of files to increase it
> >further, but I don't think that will change anything.
> 
> I think the way to see what I'm seeing would be to drop the number
> of IO threads (-T) and bump both -m and -M.  Basically less inode
> working set and more memory working set.

If I increase m/M by any non-trivial amount, the test OOMs within a
couple of minutes of starting even after cutting the number of IO
threads in half. I've managed to increase -m by 10% without OOM -
I'll keep trying to increase this part of the load as much as I
can as I refine the patchset I have.

> >>Both patched (yours or mine) and unpatched, XFS inode reclaim is
> >>keeping up.   With my patch in place, tracing during simoop does
> >>show more kswapd prio=1 scanning than unpatched, so I'm clearly
> >>stretching the limits a little more.  But we've got 30+ days of
> >>uptime in prod on almost 60 machines.  The oom rate is roughly in
> >>line with v3.10, and miles better than v4.0.
> >
> >IOWs, you have a workaround that keeps your production systems
> >running. That's fine for your machines that are running this load,
> >but it's not working well for any of the other other loads I've
> >looked at.  That is, removing the throttling from the XFS inode
> >shrinker causes instability and adverse reclaim of the inode cache
> >in situations where the maintaining a working set in memory is
> >required for performance.
> 
> We agree on all of this much more than not.  Josef has spent a lot
> of time recently on shrinkers (w/btrfs but the ideas are similar),
> and I'm wrapping duct tape around workloads until the overall
> architecture is less fragile.
> 
> Using slab for metadata in an FS like btrfs where dirty metadata is
> almost unbounded is a huge challenge in the current framework.  Ext4
> is moving to dramatically bigger logs, so it would eventually have
> the same problems.

Your 8TB XFS filesystems will be using 2GB logs (unless mkfs
settings were tweaked manually), so there's a huge amount of
metadata 15x8TB XFS filesystems can pin in memory, too...

> >Indeed, one of the things I noticed with the simoops workload
> >running the shrinker patches is that it no longer kept either the
> >inode cache or the XFS metadata cache in memory long enough for the
> >du to run without requiring IO. i.e. the caches no longer maintained
> >the working set of objects needed to optimise a regular operation
> >and the du scans took a lot longer.
> 
> With simoop, du is supposed to do IO.  It's crazy to expect to be
> able to scan all the inodes on a huge FS (or 15 of them) and keep it
> all in cache along with everything else hadoop does.  I completely
> agree there are cases where having the working set in ram is valid,
> just simoop isn't one ;)

Sure, I was just pointing out that even simoop was seeing signficant
changes in cache residency as a result of this change....

> >That's what removing the blocking from the shrinker causes the
> >overall work rate to go down - it results in the cache not
> >maintaining a working set of inodes and so increases the IO load and
> >that then slows everything down.
> 
> At least on my machines, it made the overall work rate go up.  Both
> simoop and prod are 10-15% faster. 

Ok, I'll see if I can tune the workload here to behave more like
this....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2016-11-16 23:33 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-14 12:27 [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim Chris Mason
2016-10-15 22:34 ` Dave Chinner
2016-10-17  0:24   ` Chris Mason
2016-10-17  1:52     ` Dave Chinner
2016-10-17 13:30       ` Chris Mason
2016-10-17 22:30         ` Dave Chinner
2016-10-17 23:20           ` Chris Mason
2016-10-18  2:03             ` Dave Chinner
2016-11-14  1:00               ` Chris Mason
2016-11-14  7:27                 ` Dave Chinner
2016-11-14 20:56                   ` Chris Mason
2016-11-14 23:58                     ` Dave Chinner
2016-11-15  3:09                       ` Chris Mason
2016-11-15  5:54                       ` Dave Chinner
2016-11-15 19:00                         ` Chris Mason
2016-11-16  1:30                           ` Dave Chinner
2016-11-16  3:03                             ` Chris Mason
2016-11-16 23:31                               ` Dave Chinner [this message]
2016-11-17  0:27                                 ` Chris Mason
2016-11-17  1:00                                   ` Dave Chinner
2016-11-17  0:47                               ` Dave Chinner
2016-11-17  1:07                                 ` Chris Mason
2016-11-17  3:39                                   ` Dave Chinner
2019-06-14 12:58 ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161116233136.GA19783@dastard \
    --to=david@fromorbit.com \
    --cc=clm@fb.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).