From: Glauber Costa <glommer@parallels.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>,
<linux-kernel@vger.kernel.org>, <linux-fsdevel@vger.kernel.org>,
<linux-mm@kvack.org>, <xfs@oss.sgi.com>
Subject: Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API
Date: Wed, 28 Nov 2012 12:21:54 +0400 [thread overview]
Message-ID: <50B5C9A2.6000408@parallels.com> (raw)
In-Reply-To: <20121128031719.GR6434@dastard>
On 11/28/2012 07:17 AM, Dave Chinner wrote:
> On Wed, Nov 28, 2012 at 01:13:11AM +0000, Chris Wilson wrote:
>> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner <david@fromorbit.com> wrote:
>>> +/*
>>> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've seen.
>>> + *
>>> + * i915_gem_purge() expects a byte count to be passed, and the minimum object
>>> + * size is PAGE_SIZE.
>>
>> No, purge() expects a count of pages to be freed. Each pass of the
>> shrinker therefore tries to free a minimum of 128 pages.
>
> Ah, I got the shifts mixed up. I'd been looking at way too much crap
> already when I saw this. But the fact this can be misunderstood says
> something about the level of documentation that the code has (i.e.
> none).
>
>>> The shrinker doesn't work on bytes - it works on
>>> + * *objects*.
>>
>> And I thought you were reviewing the shrinker API to be useful where a
>> single object may range between 4K and 4G.
>
> Which requires rewriting all the algorithms to not be dependent on
> the subsystems using a fixed size object. The shrinker control
> function is called shrink_slab() for a reason - it was expected to
> be used to shrink caches of fixed sized objects allocated from slab
> memory.
>
> It has no concept of the amount of memory that each object consumes,
> just an idea of how much *IO* it takes to replace the object in
> memory once it's been reclaimed. The DEFAULT_SEEKS is design to
> encode the fact it generally takes 2 IOs to replace either a LRU
> page or a filesystem slab object, and so balances the scanning based
> on that value. i.e. the shrinker algorithms are solidly based around
> fixed sized objects that have some relationship to the cost of
> physical IO operations to replace them in the cache.
One nit: It shouldn't take 2IOs to replace a slab object, right? This
should be the cost of allocating a new page, that can contain, multiple
objects.
Once the page is in, a new object should be quite cheap to come up with.
This is a very wild thought, but now that I am diving deep in the
shrinker API, and seeing things like this:
if (reclaim_state) {
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
}
I am becoming more convinced that we should have a page-based mechanism,
like the rest of vmscan.
Also, if we are seeing pressure from someone requesting user pages, what
good does it make to free, say, 35 Mb of memory, if this means we are
freeing objects across 5k different pages, without actually releasing
any of them? (still is TBD if this is a theoretical problem or a
practical one). It would maybe be better to free objects that are
moderately hot, but are on pages dominated by cold objects...
>
> The API change is the first step in the path to removing these built
> in assumptions. The current API is just insane and any attempt to
> build on it is going to be futile.
Amen, brother!
> The way I see this developing is
> this:
>
> - make the shrink_slab count -> scan algorithm per node
>
pages are per-node.
> - add information about size of objects in the cache for
> fixed size object caches.
> - the shrinker now has some idea of how many objects
> need to be freed to be able to free a page of
> memory, as well as the relative penalty for
> replacing them.
this is still guesswork, telling how many pages it should free, could
be a better idea.
> - tells the shrinker the size of the cache
> in bytes so overall memory footprint of the caches
> can be taken into account
> - add new count and scan operations for caches that are
> based on memory used, not object counts
> - allows us to use the same count/scan algorithm for
> calculating how much pressure to put on caches
> with variable size objects.
IOW, pages.
> My care factor mostly ends here, as it will allow XFS to corectly
> balance the metadata buffer cache (variable size objects) against the
> inode, dentry and dquot caches which are object based. The next
> steps that I'm about to give you are based on some discussions with
> some MM people over bottles of red wine, so take it with a grain of
> salt...
>
> - calculate a "pressure" value for each cache controlled by a
> shrinker so that the relative memory pressure between
> caches can be compared. This allows the shrinkers to bias
> reclaim based on where the memory pressure is being
> generated
>
Ok, if a cache is using a lot of memory, this would indicate it has the
dominant workload, right? Should we free from it, or should we free from
the others, so this ones gets the pages it needs?
> - start grouping shrinkers into a heirarchy, allowing
> related shrinkers (e.g. all the caches in a memcg) to be
> shrunk according resource limits that can be placed on the
> group. i.e. memory pressure is proportioned across
> groups rather than many individual shrinkers.
>
pages are already grouped like that!
> - comments have been made to the extent that with generic
> per-node lists and a node aware shrinker, all of the page
> scanning could be driven by the shrinker infrastructure,
> rather than the shrinkers being driven by how many pages
> in the page cache just got scanned for reclaim.
>
> IOWs, the main memory reclaim algorithm walks all the
> shrinkers groups to calculate overall memory pressure,
> calculate how much reclaim is necessary, and then
> proportion reclaim across all the shrinker groups. i.e.
> everything is a shrinker.
>
> This patch set is really just the start of a long process. balance
> between the page cache and VFS/filesystem shrinkers is critical to
> the efficient operation of the OS under many, many workloads, so I'm
> not about to change more than oe little thing at a time. This API
> change is just one little step. You'll get what you want eventually,
> but you're not going to get it as a first step.
>
I have to note again that this is my first *serious* look at the
problem... but this is a summary of what I got, that fits in the context
of this particular discussion =)
I still have to go through all your other patches...
But one thing we seem to agree is that we have quite a long road ahead
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-11-28 8:21 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-11-27 23:14 [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Dave Chinner
2012-11-27 23:14 ` [PATCH 01/19] dcache: convert dentry_stat.nr_unused to per-cpu counters Dave Chinner
2012-11-27 23:14 ` [PATCH 02/19] dentry: move to per-sb LRU locks Dave Chinner
2012-11-27 23:14 ` [PATCH 03/19] dcache: remove dentries from LRU before putting on dispose list Dave Chinner
2012-11-27 23:14 ` [PATCH 04/19] mm: new shrinker API Dave Chinner
2012-11-27 23:14 ` [PATCH 05/19] shrinker: convert superblock shrinkers to new API Dave Chinner
2012-12-20 11:06 ` Glauber Costa
2012-12-21 1:46 ` Dave Chinner
2012-12-21 10:17 ` Glauber Costa
2012-11-27 23:14 ` [PATCH 06/19] list: add a new LRU list type Dave Chinner
2012-11-28 16:10 ` Christoph Hellwig
2012-11-27 23:14 ` [PATCH 07/19] inode: convert inode lru list to generic lru list code Dave Chinner
2012-11-27 23:14 ` [PATCH 08/19] dcache: convert to use new lru list infrastructure Dave Chinner
2012-11-27 23:14 ` [PATCH 09/19] list_lru: per-node " Dave Chinner
2012-12-20 11:21 ` Glauber Costa
2012-12-21 1:54 ` Dave Chinner
2013-01-16 19:21 ` Glauber Costa
2013-01-16 22:55 ` Dave Chinner
2013-01-17 0:35 ` Glauber Costa
2013-01-17 4:22 ` Dave Chinner
2013-01-17 18:21 ` Glauber Costa
2013-01-18 0:10 ` Dave Chinner
2013-01-18 0:14 ` Glauber Costa
2013-01-18 8:11 ` Dave Chinner
2013-01-18 19:10 ` Glauber Costa
2013-01-19 0:10 ` Dave Chinner
2013-01-19 0:13 ` Glauber Costa
2013-01-18 0:51 ` Glauber Costa
2013-01-18 8:08 ` Dave Chinner
2013-01-18 19:01 ` Glauber Costa
2012-11-27 23:14 ` [PATCH 10/19] shrinker: add node awareness Dave Chinner
2012-11-27 23:14 ` [PATCH 11/19] fs: convert inode and dentry shrinking to be node aware Dave Chinner
2012-11-27 23:14 ` [PATCH 12/19] xfs: convert buftarg LRU to generic code Dave Chinner
2012-11-27 23:14 ` [PATCH 13/19] xfs: Node aware direct inode reclaim Dave Chinner
2012-11-27 23:14 ` [PATCH 14/19] xfs: use generic AG walk for background " Dave Chinner
2012-11-27 23:14 ` [PATCH 15/19] xfs: convert dquot cache lru to list_lru Dave Chinner
2012-11-28 16:17 ` Christoph Hellwig
2012-11-27 23:14 ` [PATCH 16/19] fs: convert fs shrinkers to new scan/count API Dave Chinner
2012-11-27 23:14 ` [PATCH 17/19] drivers: convert shrinkers to new count/scan API Dave Chinner
2012-11-28 1:13 ` Chris Wilson
2012-11-28 3:17 ` Dave Chinner
2012-11-28 8:21 ` Glauber Costa [this message]
2012-11-28 21:28 ` Dave Chinner
2012-11-29 10:29 ` Glauber Costa
2012-11-29 22:02 ` Dave Chinner
2013-06-07 13:37 ` Konrad Rzeszutek Wilk
2012-11-27 23:14 ` [PATCH 18/19] shrinker: convert remaining shrinkers to " Dave Chinner
2012-11-27 23:14 ` [PATCH 19/19] shrinker: Kill old ->shrink API Dave Chinner
2012-11-29 19:02 ` [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers Andi Kleen
2012-11-29 22:09 ` Dave Chinner
2012-12-20 11:45 ` Glauber Costa
2012-12-21 2:50 ` Dave Chinner
2012-12-21 10:41 ` Glauber Costa
2013-01-21 16:08 ` Glauber Costa
2013-01-21 23:21 ` Dave Chinner
2013-01-23 14:36 ` Glauber Costa
2013-01-23 23:46 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50B5C9A2.6000408@parallels.com \
--to=glommer@parallels.com \
--cc=chris@chris-wilson.co.uk \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).