Re: [RFC][PATCH 0/13] Per-container dcache management (and a bit more)

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Andrea Arcangeli <aarcange@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Pavel Emelyanov <xemul@parallels.com>,
	Hugh Dickins <hughd@google.com>, Nick Piggin <npiggin@kernel.dk>,
	Rik van Riel <riel@redhat.com>,
	Dave Hansen <dave@linux.vnet.ibm.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [RFC][PATCH 0/13] Per-container dcache management (and a bit more)
Date: Sat, 18 Jun 2011 15:30:38 +0200	[thread overview]
Message-ID: <20110618133038.GE3238@redhat.com> (raw)
In-Reply-To: <20110507000108.GH26837@dastard>

Hi everyone,

I would suggest to re-submit the first few locking improvements that
are independent of the per-container dentry limit. Increasing the
seqlock if there's no modification to the struct is unnecessary, looks
nice and we don't want it lost if valid microopt. And the patchset
size to discuss will decrease too ;).

On Sat, May 07, 2011 at 10:01:08AM +1000, Dave Chinner wrote:
> They aren't immediately reclaimable - they are all still pinned by
> the VFS inode (L2) cache, and will be dirtied by having to truncate
> away speculative allocation beyond EOF when the VFS inode cache
> frees them. So there is IO required on all of those inodes before
> they can be reclaim. That's why the caches have ended up with this
> size ratio, and that's from a long running, steady-state workload.
> Controlling the dentry cache size won't help reduce that inode cache
> size one bit on such workloads....

Certainly opening a flood of inodes, changing some attribute and
writing 1 page to disk, by reusing the same dentry wouldn't provide a
too nice effect but from the container point of view it'd be still
better than a unlimited number of simultaneous pinned inodes which
makes far too easy to DoS. Maybe next step would be to require some
other logic to limit the number of dirty inodes that can be opened by
a container. And waiting on inode writeback and pagecache writeback
and shrinkage during open(2), won't even -EFAIL but it'd just wait so
it'd be more graceful than the effect of too many dentries.  That
would likely be a lot more complex than a dentry limit though... so if
that is next thing to expect we should take that into account from
complexity point of view.

Overall -ENOMEM failures with d_alloc returning -ENOMEM in open(2)
aren't so nice for apps, which is I'm not so fond of the container
virt vs a virt where the container manages its memory and no DoS issue
like this one can ever materialize for the host, and it requires no
added complexity to the host. The container approach won't ever be as
reliable as guest OS in avoiding these issues, so we maybe shouldn't
complain that this solution isn't perfect for the inode cache, when
clearly it will help their usage.

> > > global lists and locks for LRU, shrinker and mob management is the
> > > opposite direction we are taking - we want to make the LRUs more
> > > fine-grained and more closely related to the MM structures,
> > > shrinkers confined to per-sb context (no more lifecycle issues,
> > > ever) and operate per-node/-zone rather than globally, etc.  It
> > > seems to me that this containerisation will make much of that work
> > > difficult to acheive effectively because it doesn't take any of this
> > > ongoing scalability work into account.
> > 
> > Two things from my side on this:
> > 
> > 1. Can you be more specific on this - which parts of VFS suffer from the
> > LRU being global?
> 
> Performance. It doesn't scale beyond a few CPUs before lock
> contention becomes the limiting factor.

The global vs per-zone/numa dentry lru vs global seems an interesting
point. Probably here we've two different point of views that asks for
development going into two different directions because of different
production objectives and priorities. This won't be as easy to get an
agreement on.

Maybe I remember wrong, but I seem to recall Nick was proposing to
split per-zone/numa vfs lrus too, and Christoph was against it (or
maybe it was the other way around :). But it wasn't discussed in
container context and I don't remember exactly what the cons
were. Global usually provides better lru behavior, and splitting
arbitrarily among zone/node tends to be a scalability boost but if we
only see it as a lock-scalability improvement it becomes a tradeoff
between better lru info and better scalability, so then we could
arbitrarily split the lru without regard of the actual zone/node
size. It is much better to split lrus on zone/node boundaries when
there's a real need to shrink specific zones/nodes from a reclaim
point of view, not just better scalability when taking the lock.

We obviously use that zone/node lru split in the pagecache lru, and
clearly when we've to shrink a single node it helps more than just for
lock scalability, so the per-node lrus is certainly needed in NUMA
setups with HARDWALL NUMA pins as it saves a ton of CPU and it avoids
global lru churning as well. So maybe a per zone/node lru would
provide similar benefits for vfs caches and it would be indeed the
right direction for MM point of view (ignoring this very issue of
containers). Right now we do blind vfs shrinks when we could do
selective zone/node ones as the vfs shrinker caller has the zone/node
info already. Maybe whoever was against it (regardless of this
container dentry limit discussion) should point out what the cons are.

> I never implied quotas were for limiting cache usage. I only
> suggested they were the solution to your DOS example by preventing
> unbound numbers of inodes from being created by an unprivileged
> user.
> 
> To me, it sounds like you overprovision your servers and then
> have major troubles when everyone tries to use what you supplied
> them with simultaneously. There is a simple solution to that. ;)
> Otherwise, I think you need to directly limit the size of the inode
> caches, not try to do it implicitly via 2nd and 3rd order side
> effects of controlling the size of the dentry cache.

They want to limit the number of simultaneously pinned amount of
kernel ram structures, while still leaving huge amount of files
possible in the filesystem to make life simple during install
etc... So you can untar whatever size of backup into the container
regardless of quotas, but if only a part of the unpacked data (common
case) is used by the apps it just works. Again I don't think the
objective is a perfect accounting but just something that happen to
works better, if one wants perfect accounting of the memory and bytes
utilized by the on-disk image there are other types of virt available.

Yet another approach would be to account how much kernel data
structures each process is keeping pinned and unfreeable and sum that
to the process RAM during the oom killer decision, but that wouldn't
be an hard per container limit and it sounds way too CPU costly to
account the vfs pinned RAM every time somebody open() or chdir(), it'd
require to count too many things towards the dentry root.

next prev parent reply	other threads:[~2011-06-18 13:31 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-03 12:14 [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Pavel Emelyanov
2011-05-03 12:15 ` [PATCH 1/13] vfs: Lighten r/o rename_lock lockers Pavel Emelyanov
2011-05-03 12:15 ` [PATCH 2/13] vfs: Factor out rename_lock locking Pavel Emelyanov
2011-05-03 12:16 ` [PATCH 3/13] vfs: Make the rename_lock per-sb Pavel Emelyanov
2011-05-03 12:16 ` [PATCH 4/13] vfs: Factor out tree (of four) shrinkers code Pavel Emelyanov
2011-05-03 12:17 ` [PATCH 5/13] vfs: Make dentry LRU list global Pavel Emelyanov
2011-05-03 12:17 ` [PATCH 6/13] vfs: Turn the nr_dentry into percpu_counter Pavel Emelyanov
2011-05-03 12:18 ` [PATCH 7/13] vfs: Limit the number of dentries globally Pavel Emelyanov
2011-05-03 12:18 ` [PATCH 8/13] vfs: Introduce the dentry mobs Pavel Emelyanov
2011-06-18 13:40   ` Andrea Arcangeli
2011-05-03 12:18 ` [PATCH 9/13] vfs: More than one mob management Pavel Emelyanov
2011-05-03 12:19 ` [PATCH 10/13] vfs: Routnes for setting mob size and getting stats Pavel Emelyanov
2011-05-03 12:19 ` [PATCH 11/13] vfs: Make shrink_dcache_memory prune dcache from all mobs Pavel Emelyanov
2011-05-03 12:20 ` [PATCH 12/13] vfs: Mobs creation and mgmt API Pavel Emelyanov
2011-05-03 12:20 ` [PATCH 13/13] vfs: Dentry mobs listing in proc Pavel Emelyanov
2011-05-06  1:05 ` [RFC][PATCH 0/13] Per-container dcache management (and a bit more) Dave Chinner
2011-05-06 12:15   ` Pavel Emelyanov
2011-05-07  0:01     ` Dave Chinner
2011-05-10 11:18       ` Pavel Emelyanov
2011-06-18 13:30       ` Andrea Arcangeli [this message]
2011-06-20  0:49         ` Dave Chinner
2011-07-04  5:32           ` Pavel Emelyanov
2011-05-23  6:43 ` Pavel Emelyanov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110618133038.GE3238@redhat.com \
    --to=aarcange@redhat.com \
    --cc=dave@linux.vnet.ibm.com \
    --cc=david@fromorbit.com \
    --cc=hughd@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=npiggin@kernel.dk \
    --cc=riel@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).