Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Chris Mason <clm@fb.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
Date: Thu, 17 Nov 2016 10:31:36 +1100	[thread overview]
Message-ID: <20161116233136.GA19783@dastard> (raw)
In-Reply-To: <20161116030344.GA7746@clm-mbp.masoncoding.com>

On Tue, Nov 15, 2016 at 10:03:52PM -0500, Chris Mason wrote:
> On Wed, Nov 16, 2016 at 12:30:09PM +1100, Dave Chinner wrote:
> >On Tue, Nov 15, 2016 at 02:00:47PM -0500, Chris Mason wrote:
> >>On 11/15/2016 12:54 AM, Dave Chinner wrote:
> >>>On Tue, Nov 15, 2016 at 10:58:01AM +1100, Dave Chinner wrote:
> >>>>On Mon, Nov 14, 2016 at 03:56:14PM -0500, Chris Mason wrote:
> >>>There have been 1.2 million inodes reclaimed from the cache, but
> >>>there have only been 20,000 dirty inode buffer writes. Yes, that's
> >>>written 440,000 dirty inodes - the inode write clustering is
> >>>capturing about 22 inodes per write - but the inode writeback load
> >>>is minimal at about 10 IO/s. XFS inode reclaim is not blocking
> >>>significantly on dirty inodes.
> >>
> >>I think our machines are different enough that we're not seeing the
> >>same problems.  Or at least we're seeing different sides of the
> >>problem.
> >>
> >>We have 130GB of ram and on average about 300-500MB of XFS slab,
> >>total across all 15 filesystems.  Your inodes are small and cuddly,
> >>and I'd rather have more than less.  I see more with simoop than we
> >>see in prod, but either way its a reasonable percentage of system
> >>ram considering the horrible things being done.
> >
> >So I'm running on 16GB RAM and have 100-150MB of XFS slab.
> >Percentage wise, the inode cache is a larger portion of memory than
> >in your machines. I can increase the number of files to increase it
> >further, but I don't think that will change anything.
> 
> I think the way to see what I'm seeing would be to drop the number
> of IO threads (-T) and bump both -m and -M.  Basically less inode
> working set and more memory working set.

If I increase m/M by any non-trivial amount, the test OOMs within a
couple of minutes of starting even after cutting the number of IO
threads in half. I've managed to increase -m by 10% without OOM -
I'll keep trying to increase this part of the load as much as I
can as I refine the patchset I have.

> >>Both patched (yours or mine) and unpatched, XFS inode reclaim is
> >>keeping up.   With my patch in place, tracing during simoop does
> >>show more kswapd prio=1 scanning than unpatched, so I'm clearly
> >>stretching the limits a little more.  But we've got 30+ days of
> >>uptime in prod on almost 60 machines.  The oom rate is roughly in
> >>line with v3.10, and miles better than v4.0.
> >
> >IOWs, you have a workaround that keeps your production systems
> >running. That's fine for your machines that are running this load,
> >but it's not working well for any of the other other loads I've
> >looked at.  That is, removing the throttling from the XFS inode
> >shrinker causes instability and adverse reclaim of the inode cache
> >in situations where the maintaining a working set in memory is
> >required for performance.
> 
> We agree on all of this much more than not.  Josef has spent a lot
> of time recently on shrinkers (w/btrfs but the ideas are similar),
> and I'm wrapping duct tape around workloads until the overall
> architecture is less fragile.
> 
> Using slab for metadata in an FS like btrfs where dirty metadata is
> almost unbounded is a huge challenge in the current framework.  Ext4
> is moving to dramatically bigger logs, so it would eventually have
> the same problems.

Your 8TB XFS filesystems will be using 2GB logs (unless mkfs
settings were tweaked manually), so there's a huge amount of
metadata 15x8TB XFS filesystems can pin in memory, too...

> >Indeed, one of the things I noticed with the simoops workload
> >running the shrinker patches is that it no longer kept either the
> >inode cache or the XFS metadata cache in memory long enough for the
> >du to run without requiring IO. i.e. the caches no longer maintained
> >the working set of objects needed to optimise a regular operation
> >and the du scans took a lot longer.
> 
> With simoop, du is supposed to do IO.  It's crazy to expect to be
> able to scan all the inodes on a huge FS (or 15 of them) and keep it
> all in cache along with everything else hadoop does.  I completely
> agree there are cases where having the working set in ram is valid,
> just simoop isn't one ;)

Sure, I was just pointing out that even simoop was seeing signficant
changes in cache residency as a result of this change....

> >That's what removing the blocking from the shrinker causes the
> >overall work rate to go down - it results in the cache not
> >maintaining a working set of inodes and so increases the IO load and
> >that then slows everything down.
> 
> At least on my machines, it made the overall work rate go up.  Both
> simoop and prod are 10-15% faster. 

Ok, I'll see if I can tune the workload here to behave more like
this....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2016-11-16 23:33 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-14 12:27 [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim Chris Mason
2016-10-15 22:34 ` Dave Chinner
2016-10-17  0:24   ` Chris Mason
2016-10-17  1:52     ` Dave Chinner
2016-10-17 13:30       ` Chris Mason
2016-10-17 22:30         ` Dave Chinner
2016-10-17 23:20           ` Chris Mason
2016-10-18  2:03             ` Dave Chinner
2016-11-14  1:00               ` Chris Mason
2016-11-14  7:27                 ` Dave Chinner
2016-11-14 20:56                   ` Chris Mason
2016-11-14 23:58                     ` Dave Chinner
2016-11-15  3:09                       ` Chris Mason
2016-11-15  5:54                       ` Dave Chinner
2016-11-15 19:00                         ` Chris Mason
2016-11-16  1:30                           ` Dave Chinner
2016-11-16  3:03                             ` Chris Mason
2016-11-16 23:31                               ` Dave Chinner [this message]
2016-11-17  0:27                                 ` Chris Mason
2016-11-17  1:00                                   ` Dave Chinner
2016-11-17  0:47                               ` Dave Chinner
2016-11-17  1:07                                 ` Chris Mason
2016-11-17  3:39                                   ` Dave Chinner
2019-06-14 12:58 ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161116233136.GA19783@dastard \
    --to=david@fromorbit.com \
    --cc=clm@fb.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.