Re: [PATCH v7 00/34] kmemcg shrinkers

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Glauber Costa <glommer@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>, Mel Gorman <mgorman@suse.de>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Greg Thelen <gthelen@google.com>,
	kamezawa.hiroyu@jp.fujitsu.com, Michal Hocko <mhocko@suse.cz>,
	Johannes Weiner <hannes@cmpxchg.org>,
	linux-fsdevel@vger.kernel.org, hughd@google.com
Subject: Re: [PATCH v7 00/34] kmemcg shrinkers
Date: Wed, 22 May 2013 16:26:57 +1000	[thread overview]
Message-ID: <20130522062657.GU24543@dastard> (raw)
In-Reply-To: <519B21D5.9090109@parallels.com>

On Tue, May 21, 2013 at 11:27:17AM +0400, Glauber Costa wrote:
> On 05/21/2013 11:18 AM, Dave Chinner wrote:
> > On Tue, May 21, 2013 at 11:03:33AM +0400, Glauber Costa wrote:
> >> On 05/20/2013 12:06 AM, Glauber Costa wrote:
> >>> Initial notes:
> >>> ==============
> >>>
> >>> Please pay attention to new patches that are debuting in this series. Patch1
> >>> changes our unused countries for int to long, since Dave noticed that it wasn't
> >>> being enough in some cases. Aside from that, the major change is that we now
> >>> compute and keep deferred work per-node (Patch13). The biggest effect of this,
> >>> is that to avoid storing a new nodemask in the stack, I am passing only the
> >>> node id down to the API. This means that the lru API *does not* take a nodemask
> >>> any longer, which in turn, makes it simpler.
> >>>
> >>> I deeply considered this matter, and decided this would be the best way to go.
> >>> It is not different from what I have already done for memcgs: Only a single one
> >>> is passed down, and the complexity of scanning them is moved upwards to the
> >>> caller, where all the scanning logic should belong anyway.
> >>>
> >>> If you want, you can also grab from branch "kmemcg-lru-shrinker" at:
> >>>
> >>> 	git://git.kernel.org/pub/scm/linux/kernel/git/glommer/memcg.git
> >>>
> >>> I hope the performance problems are all gone. My testing now shows a smoother
> >>> and steady state for the objects during the lifetime of the workload, and
> >>> postmark numbers are closer to base, although we do deviate a bit.
> >>>
> >>
> >> Mel, Dave, et. al.
> >>
> >> I have applied some more fixes for things I have found here and there as
> >> a result of a new round of testing. I won't post the result here until
> >> Thursday or Friday, to avoid patchbombing you guys. In the meantime I
> >> will be merging comments I receive from this version.
> >>
> >> My git tree is up to date, so if you want to test it further, please
> >> pick that up.
> > 
> > Will do. I hope to do some testing of it tommorrow.
> > 
> >> I am attaching the result of my postmark run. I think the results look
> >> really good now.
> > 
> > What's version and command line you are using - I'll see if i can
> > reproduce the same results on my test system....
> > 
> 
> I am using Mel's mmtest. So I cloned it, changed to config to run the
> postmark benchmark set TEST_PARTITION to my disk, TEST_FILESYSTEM to
> ext3 (specially that fsmark was already running xfs with your script),
> and then ./run-mmtests.sh <name_of_test>

Well, I haven't got to running postmark yet, but so far the
behaviour of this version of the patch series on my usual benchmarks
is, well, damn good. Better than I've ever seen it, and I'd say that
big change is due to the per-node reclaim deferral.

The overall cache balance and stability is as close to identical to
a single node machine(*) as I've ever seen for the workloads I've
run.  It's a little more variable than for the single node tst runs,
but the curves all have the same height, the same shape and same
relative behaviour. Compared to 3.9 numa behaviour, they are worlds
apart.

(*) Same VM, only difference is fake-numa=4 for the numa results.

In terms of performance, the difference is mostly within the
variance of the the benchmarks, maybe ever so slightly faster.  e.g.
for the fsmark workload I posted previously (50m zero length files,
walk then, remove them), then numbers are:

		create	walk	remove
3.9		8m07s	5m42s	11m50s
3.10-lru	8m13s	5m29s	11m40s

Of note, under mixed page cache/slab workloads that generate memory
pressure, I am seeing slightly elevated system CPU time. Nothing in
profiles show up as being significantly different, so it may just be
that because kswapd is not emptying the caches all the time more of
the reclaim work is being accounted to the user processes...

There is one problem I've found, however, but I haven't got to the
bottom of. During concurrent inode read workloads (find, grep, etc)
I'm getting a hang in the XFS inode cache. It's finding an xfs inode
in the XFS inode cache, but when it tries to grab the VFS inode,
that fails. This means we have an XFS inode without the
XFS_IRECLAIMABLE flag set but with I_FREEING|I_WILL_FREE set on the
VFS inode. That means the VFS has let go of the inode and destroyed
it, but XFS doesn't think that it has been destroyed. It's stuck in
limbo.

This isn't easily reproducable, so it might take me a while to track
down. however....

.... I went looking at the xfs inode reclaim code, and realised
there's something missing from the overall patch set. The patch that
does node-aware reclaim of the XFS inode cache is missing, as well
as the followup that cleans up the mess that is no longer needed.
I'll port these patches forward from my old guilt tree that contains
them, and post them once I have them working. I'll also try to get
to the bottom of whatever strangeness is causing this hang...

But, overall, the system is behaviing very well, and so from an
infrastructure perspective I think the patch set is in pretty good
shape. Nice work, Glauber. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

     prev parent reply	other threads:[~2013-05-22  6:26 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-19 20:06 [PATCH v7 00/34] kmemcg shrinkers Glauber Costa
2013-05-19 20:06 ` [PATCH v7 01/34] fs: bump inode and dentry counters to long Glauber Costa
2013-05-19 20:06 ` [PATCH v7 02/34] super: fix calculation of shrinkable objects for small numbers Glauber Costa
2013-05-19 20:06 ` [PATCH v7 03/34] dcache: convert dentry_stat.nr_unused to per-cpu counters Glauber Costa
2013-05-19 20:06 ` [PATCH v7 04/34] dentry: move to per-sb LRU locks Glauber Costa
2013-05-19 20:06 ` [PATCH v7 05/34] dcache: remove dentries from LRU before putting on dispose list Glauber Costa
2013-05-19 20:06 ` [PATCH v7 06/34] mm: new shrinker API Glauber Costa
2013-05-19 20:07 ` [PATCH v7 07/34] shrinker: convert superblock shrinkers to new API Glauber Costa
2013-05-20 16:39   ` Glauber Costa
2013-05-20 23:40     ` Dave Chinner
2013-05-19 20:07 ` [PATCH v7 08/34] list: add a new LRU list type Glauber Costa
2013-05-19 20:07 ` [PATCH v7 09/34] inode: convert inode lru list to generic lru list code Glauber Costa
2013-05-19 20:07 ` [PATCH v7 10/34] dcache: convert to use new lru list infrastructure Glauber Costa
2013-05-19 20:07 ` [PATCH v7 11/34] list_lru: per-node " Glauber Costa
2013-05-19 20:07 ` [PATCH v7 12/34] shrinker: add node awareness Glauber Costa
2013-05-19 20:07 ` [PATCH v7 13/34] vmscan: per-node deferred work Glauber Costa
2013-05-19 20:07 ` [PATCH v7 14/34] list_lru: per-node API Glauber Costa
2013-05-19 20:07 ` [PATCH v7 15/34] fs: convert inode and dentry shrinking to be node aware Glauber Costa
2013-05-19 20:07 ` [PATCH v7 16/34] xfs: convert buftarg LRU to generic code Glauber Costa
2013-05-19 20:07 ` [PATCH v7 17/34] xfs: convert dquot cache lru to list_lru Glauber Costa
2013-05-19 20:07 ` [PATCH v7 18/34] fs: convert fs shrinkers to new scan/count API Glauber Costa
2013-05-20  8:25   ` Steven Whitehouse
2013-05-20 13:46     ` Glauber Costa
2013-05-20 15:25       ` Glauber Costa
2013-05-20 23:38         ` Dave Chinner
2013-05-20 23:42           ` Glauber Costa
2013-05-19 20:07 ` [PATCH v7 19/34] drivers: convert shrinkers to new count/scan API Glauber Costa
2013-06-03 20:03   ` Kent Overstreet
2013-06-04  9:06     ` Glauber Costa
2013-06-04  9:10     ` Glauber Costa
2013-05-19 20:07 ` [PATCH v7 20/34] i915: bail out earlier when shrinker cannot acquire mutex Glauber Costa
2013-05-19 20:07 ` [PATCH v7 21/34] shrinker: convert remaining shrinkers to count/scan API Glauber Costa
2013-05-19 20:07 ` [PATCH v7 22/34] hugepage: convert huge zero page shrinker to new shrinker API Glauber Costa
2013-05-19 20:07 ` [PATCH v7 23/34] shrinker: Kill old ->shrink API Glauber Costa
2013-05-19 20:07 ` [PATCH v7 24/34] vmscan: also shrink slab in memcg pressure Glauber Costa
2013-05-19 20:07 ` [PATCH v7 25/34] memcg,list_lru: duplicate LRUs upon kmemcg creation Glauber Costa
2013-05-19 20:07 ` [PATCH v7 26/34] lru: add an element to a memcg list Glauber Costa
2013-05-19 20:07 ` [PATCH v7 27/34] list_lru: per-memcg walks Glauber Costa
2013-05-19 20:07 ` [PATCH v7 28/34] memcg: per-memcg kmem shrinking Glauber Costa
2013-05-19 20:07 ` [PATCH v7 29/34] memcg: scan cache objects hierarchically Glauber Costa
2013-05-19 20:07 ` [PATCH v7 30/34] vmscan: take at least one pass with shrinkers Glauber Costa
2013-05-19 20:07 ` [PATCH v7 31/34] super: targeted memcg reclaim Glauber Costa
2013-05-19 20:07 ` [PATCH v7 32/34] memcg: move initialization to memcg creation Glauber Costa
2013-05-19 20:07 ` [PATCH v7 33/34] vmpressure: in-kernel notifications Glauber Costa
2013-05-19 20:07 ` [PATCH v7 34/34] memcg: reap dead memcgs upon global memory pressure Glauber Costa
2013-05-21  7:03 ` [PATCH v7 00/34] kmemcg shrinkers Glauber Costa
2013-05-21  7:18   ` Dave Chinner
2013-05-21  7:27     ` Glauber Costa
2013-05-22  6:26       ` Dave Chinner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130522062657.GU24543@dastard \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=glommer@openvz.org \
    --cc=glommer@parallels.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).