From: Glauber Costa <glommer@parallels.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-kernel@vger.kernel.org, xfs@oss.sgi.com,
linux-mm@kvack.org, Ying Han <yinghan@google.com>,
Suleiman Souhlal <suleiman@google.com>,
linux-fsdevel@vger.kernel.org, Greg Thelen <gthelen@google.com>
Subject: Re: [PATCH 09/19] list_lru: per-node list infrastructure
Date: Thu, 17 Jan 2013 10:21:12 -0800 [thread overview]
Message-ID: <50F84118.7030608@parallels.com> (raw)
In-Reply-To: <20130117042245.GG2498@dastard>
>> Deepest fears:
>>
>> 1) snakes.
>
> Snakes are merely poisonous. Drop Bears are far more dangerous :P
>
fears are irrational anyway...
>> 2) It won't surprise you to know that I am adapting your work, which
>> provides a very sane and helpful API, to memcg shrinking.
>>
>> The dumb and simple approach in there is to copy all lrus that are
>> marked memcg aware at memcg creation time. The API is kept the same,
>> but when you do something like list_lru_add(lru, obj), for instance, we
>> derive the memcg context from obj and relay it to the right list.
>
> At which point, you don't want the overhead of per-node lists.
>
This is one of the assumptions we may have to end up doing here.
> The problem I see here is that someone might still need the
> scalability of the per-node lists. If someone runs a large memcg in terms
> of CPU and memory, then we most definitely are going to need to
> retain per-node lists regardless of the fact that the workload is
> running in a constrained environment. And if you are running a mix
> of large and small containers, then one static solution is not going
> to cut it for some workload.
>
Yes, this is spot on. As a first approach, I think single-node lists
will do.
> This is a problem that superblock contexts don't care about - they
> are global by their very nature. Hence I'm wondering if trying to
> fit these two very different behaviours into the one LRU list is
> the wrong approach.
>
I am not that much concerned about that, honestly. I like the API, and I
like the fact that it allow me to have the subsystems using it
transparently, just by referring to the "master" lru (the dentry, inode,
etc). It reduces complexity to reuse the data structures, but that is
not paramount.
However, a more flexible data structure in which we could select at
least at creation time if we want per-node lists or not, would be quite
helpful.
And it seems it would be at least moderately helpful to you as well for
usually-small filesystems, so I think it would be a good addition to
next version.
> Consider this: these patches give us a generic LRU list structure.
> It currently uses a list_head in each object for indexing, and we
> are talking about single LRU lists because of this limitation and
> trying to build infrastructure that can support this indexing
> mechanism.
>
> I think that all of thses problems go away if we replace the
> list_head index in the object with a "struct lru_item" index. To
> start with, it's just a s/list_head/lru_item/ changeover, but from
> there we can expand.
>
> What I'm getting at is that we want to have multiple axis of
> tracking and reclaim, but we only have a single axis for tracking.
> If the lru_item grew a second list_head called "memcg_lru", then
> suddenly the memcg LRUs can be maintained separately to the global
> (per-superblock) LRU. i.e.:
>
> struct lru_item {
> struct list_head global_list;
> struct list_head memcg_list;
> }
>
I may be misunderstanding you, but that is not how I see it. Your global
list AFAIU, is more like a hook to keep the lists together. The actual
accesses to it are controlled by a parent structure, like the
super-block, which in turns, embeds a shrinker.
So we get (in the sb case), from shrinker to sb, and from sb to dentry
list (or inode). We never care about the global list head.
>From this point on, we "entered" the LRU, but we still don't know which
list to reclaim from: there is one list per node, and we need to figure
out which is our target, based on the flags.
This list selection mechanism is where I am usually hooking memcg: and
for the same way you are using an array - given a node, you want fast
access to the underlying list - so am I. Given the memcg context, I want
to get to the corresponding memcg list.
Now, in my earliest implementations, the memcg would still take me to a
node-wide array, and an extra level would be required. We seem to agree
that (at least as a starting point) getting rid of this extra level, so
the memcg colapses all objects in the same list would provide decent
behavior in most cases, while still keeping the footprint manageable. So
that is what I am pursuing at the moment.
> And then you can use whatever tracking structure you want for a
> memcg LRU list. Indeed, this would allow per-node lists for the
> global LRU, and whatever LRU type is appropriate for the memcg using
> the object (e.g. single list for small memcgs, per-node for large
> memcgs)....
>
Yes. You can have either a small number of big memcgs or a big number of
small memcgs. So if we adopt selectively per-node scalability, our
memory usage is always bounded by #memcgs x avg_size. It works perfectly.
> i.e. rather than trying to make the infrastructure jump through hoops
> to only have one LRU index per object, have a second index that
> allows memcg's to have a separate LRU index and a method for the
> global LRU structure to find them. This woul dallow memcg specific
> shrinker callouts switch to the memcg LRU rather than the global LRU
> and operate on that. That way we still only instantiate a single
> LRU/shrinker pair per cache context, but the memcg code doesn't need
> to duplicate the entire LRU infrastructure into every memcg that
> contains that type of object for that cache context....
>
> /me stops rambling....
>
>> Your current suggestion of going per-node only in the performance
>> critical filesystems could also possibly work, provided this count is
>> expected to be small.
>
> The problem is deciding on a per filesystem basis. I was thinking
> that all filesytsems of a specific type would use a particular type
> of structure, not that specific instances of a filesystem could use
> different types....
>
Yes, I referred to "critical" as an attribute of the fs class as well.
Most of the disk filesystems (if not all) would use per-node lists, and
I would guess most of the pseudo fs would do fine with a single-node.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2013-01-17 18:21 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
2012-11-27 23:14 ` [PATCH 01/19] dcache: convert dentry_stat.nr_unused to per-cpu counters Dave Chinner
2012-11-27 23:14 ` [PATCH 02/19] dentry: move to per-sb LRU locks Dave Chinner
2012-11-27 23:14 ` [PATCH 03/19] dcache: remove dentries from LRU before putting on dispose list Dave Chinner
2012-11-27 23:14 ` [PATCH 04/19] mm: new shrinker API Dave Chinner
2012-11-27 23:14 ` [PATCH 05/19] shrinker: convert superblock shrinkers to new API Dave Chinner
2012-12-20 11:06 ` Glauber Costa
2012-12-21 1:46 ` Dave Chinner
2012-12-21 10:17 ` Glauber Costa
2012-11-27 23:14 ` [PATCH 06/19] list: add a new LRU list type Dave Chinner
2012-11-28 16:10 ` Christoph Hellwig
2012-11-27 23:14 ` [PATCH 07/19] inode: convert inode lru list to generic lru list code Dave Chinner
2012-11-27 23:14 ` [PATCH 08/19] dcache: convert to use new lru list infrastructure Dave Chinner
2012-11-27 23:14 ` [PATCH 09/19] list_lru: per-node " Dave Chinner
2012-12-20 11:21 ` Glauber Costa
2012-12-21 1:54 ` Dave Chinner
2013-01-16 19:21 ` Glauber Costa
2013-01-16 22:55 ` Dave Chinner
2013-01-17 0:35 ` Glauber Costa
2013-01-17 4:22 ` Dave Chinner
2013-01-17 18:21 ` Glauber Costa [this message]
2013-01-18 0:10 ` Dave Chinner
2013-01-18 0:14 ` Glauber Costa
2013-01-18 8:11 ` Dave Chinner
2013-01-18 19:10 ` Glauber Costa
2013-01-19 0:10 ` Dave Chinner
2013-01-19 0:13 ` Glauber Costa
2013-01-18 0:51 ` Glauber Costa
2013-01-18 8:08 ` Dave Chinner
2013-01-18 19:01 ` Glauber Costa
2012-11-27 23:14 ` [PATCH 10/19] shrinker: add node awareness Dave Chinner
2012-11-27 23:14 ` [PATCH 11/19] fs: convert inode and dentry shrinking to be node aware Dave Chinner
2012-11-27 23:14 ` [PATCH 12/19] xfs: convert buftarg LRU to generic code Dave Chinner
2012-11-27 23:14 ` [PATCH 13/19] xfs: Node aware direct inode reclaim Dave Chinner
2012-11-27 23:14 ` [PATCH 14/19] xfs: use generic AG walk for background " Dave Chinner
2012-11-27 23:14 ` [PATCH 15/19] xfs: convert dquot cache lru to list_lru Dave Chinner
2012-11-28 16:17 ` Christoph Hellwig
2012-11-27 23:14 ` [PATCH 16/19] fs: convert fs shrinkers to new scan/count API Dave Chinner
2012-11-27 23:14 ` [PATCH 17/19] drivers: convert shrinkers to new count/scan API Dave Chinner
2012-11-28 1:13 ` Chris Wilson
2012-11-28 3:17 ` Dave Chinner
2012-11-28 8:21 ` Glauber Costa
2012-11-28 21:28 ` Dave Chinner
2012-11-29 10:29 ` Glauber Costa
2012-11-29 22:02 ` Dave Chinner
2013-06-07 13:37 ` Konrad Rzeszutek Wilk
2012-11-27 23:14 ` [PATCH 18/19] shrinker: convert remaining shrinkers to " Dave Chinner
2012-11-27 23:14 ` [PATCH 19/19] shrinker: Kill old ->shrink API Dave Chinner
2012-11-29 19:02 ` [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Andi Kleen
2012-11-29 22:09 ` Dave Chinner
2012-12-20 11:45 ` Glauber Costa
2012-12-21 2:50 ` Dave Chinner
2012-12-21 10:41 ` Glauber Costa
2013-01-21 16:08 ` Glauber Costa
2013-01-21 23:21 ` Dave Chinner
2013-01-23 14:36 ` Glauber Costa
2013-01-23 23:46 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50F84118.7030608@parallels.com \
--to=glommer@parallels.com \
--cc=david@fromorbit.com \
--cc=gthelen@google.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=suleiman@google.com \
--cc=xfs@oss.sgi.com \
--cc=yinghan@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).