Re: [PATCH v12 09/18] vmscan: shrink slab on memcg pressure

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Vladimir Davydov <vdavydov@parallels.com>
To: Dave Chinner <david@fromorbit.com>
Cc: hannes@cmpxchg.org, mhocko@suse.cz, dchinner@redhat.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, cgroups@vger.kernel.org, devel@openvz.org,
	glommer@openvz.org, Mel Gorman <mgorman@suse.de>,
	Rik van Riel <riel@redhat.com>, Al Viro <viro@zeniv.linux.org.uk>,
	Balbir Singh <bsingharora@gmail.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Subject: Re: [PATCH v12 09/18] vmscan: shrink slab on memcg pressure
Date: Tue, 3 Dec 2013 16:15:57 +0400	[thread overview]
Message-ID: <529DCB7D.10205@parallels.com> (raw)
In-Reply-To: <20131203104849.GD8803@dastard>

On 12/03/2013 02:48 PM, Dave Chinner wrote:
>> @@ -236,11 +236,17 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>>  		return 0;
>>  
>>  	/*
>> -	 * copy the current shrinker scan count into a local variable
>> -	 * and zero it so that other concurrent shrinker invocations
>> -	 * don't also do this scanning work.
>> +	 * Do not touch global counter of deferred objects on memcg pressure to
>> +	 * avoid isolation issues. Ideally the counter should be per-memcg.
>>  	 */
>> -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>> +	if (!shrinkctl->target_mem_cgroup) {
>> +		/*
>> +		 * copy the current shrinker scan count into a local variable
>> +		 * and zero it so that other concurrent shrinker invocations
>> +		 * don't also do this scanning work.
>> +		 */
>> +		nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>> +	}
> That's ugly. Effectively it means that memcg reclaim is going to be
> completely ineffective when large numbers of allocations and hence
> reclaim attempts are done under GFP_NOFS context.
>
> The only thing that keeps filesystem caches in balance when there is
> lots of filesystem work going on (i.e. lots of GFP_NOFS allocations)
> is the deferal of reclaim work to a context that can do something
> about it.

Imagine the situation: a memcg issues a GFP_NOFS allocation and goes to
shrink_slab() where it defers them to the global counter; then another
memcg issues a GFP_KERNEL allocation, also goes to shrink_slab() where
it sees a huge number of deferred objects and starts shrinking them,
which is not good IMHO. I understand that nr_deferred is necessary, but
I think it should be per-memcg. What do you think about moving it to
list_lru?

>>  	total_scan = nr;
>>  	delta = (4 * fraction) / shrinker->seeks;
>> @@ -296,21 +302,46 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>>  		cond_resched();
>>  	}
>>  
>> -	/*
>> -	 * move the unused scan count back into the shrinker in a
>> -	 * manner that handles concurrent updates. If we exhausted the
>> -	 * scan, there is no need to do an update.
>> -	 */
>> -	if (total_scan > 0)
>> -		new_nr = atomic_long_add_return(total_scan,
>> +	if (!shrinkctl->target_mem_cgroup) {
>> +		/*
>> +		 * move the unused scan count back into the shrinker in a
>> +		 * manner that handles concurrent updates. If we exhausted the
>> +		 * scan, there is no need to do an update.
>> +		 */
>> +		if (total_scan > 0)
>> +			new_nr = atomic_long_add_return(total_scan,
>>  						&shrinker->nr_deferred[nid]);
>> -	else
>> -		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
>> +		else
>> +			new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
>> +	}
> So, if the memcg can't make progress, why wouldn't you defer the
> work to the global scan? Or can't a global scan trim memcg LRUs?
> And if it can't, then isn't that a major design flaw? Why not just
> allow kswapd to walk memcg LRUs in the background?
>
> /me just looked at patch 13
>
> Yeah, this goes some way to explaining why something like patch 13
> is necessary - slab shrinkers are not keeping up with page cache
> reclaim because of GFP_NOFS allocations, and so the page cache
> empties only leaving slab caches to be trimmed....
>
>
>> +static unsigned long
>> +shrink_slab_memcg(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>> +		  unsigned long fraction, unsigned long denominator)
> what's this function got to do with memcgs? Why did you rename it
> from the self explanitory shrink_slab_one() name that Glauber gave
> it?

When I sent the previous version, Johannes Weiner disliked the name that
was why I renamed it, now you don't like the new name and ask for the
old one :-) But why do you think that shrink_slab_one() is
self-explanatory while shrink_slab_memcg() is not? I mean
shrink_slab_memcg() means "shrink slab accounted to a memcg" just like
shrink_slab_node() means "shrink slab on the node" while seeing
shrink_slab_one() I would ask "one what?".

>> +{
>> +	unsigned long freed = 0;
>> +
>> +	if (shrinkctl->memcg && !memcg_kmem_is_active(shrinkctl->memcg))
>> +		return 0;
> Why here? why not check that in the caller where memcg's are being
> iterated?

No problem, I'll move it.

>> +
>> +	for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
>> +		if (!node_online(shrinkctl->nid))
>> +			continue;
>> +
>> +		if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
>> +		    (shrinkctl->nid != 0))
>> +			break;
> Hmmm - this looks broken. Nothing guarantees that node 0 in
> shrinkctl->nodes_to_scan is ever set, so non-numa aware shrinkers
> will do nothing when the first node in the mask is not set. For non-numa
> aware shrinkers, the shrinker should always be called once with a
> node id of 0.

That's how it operates now - this patch simply moved this piece from
shrink_slab(). I'll fix this.

> That's what earlier versions of the numa aware shrinker patch set
> did, and it seems to have been lost along the way.  Yeah, there's
> the last version from Glauber's tree that I saw:
>
> static unsigned long
> shrink_slab_one(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>                unsigned long nr_pages_scanned, unsigned long lru_pages)
> {
>        unsigned long freed = 0;
>
>        if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
>                shrinkctl->nid = 0;
>
>                return shrink_slab_node(shrinkctl, shrinker,
>                         nr_pages_scanned, lru_pages,
>                         &shrinker->nr_deferred);
>        }
>
>        for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan)
>
>                if (!node_online(shrinkctl->nid))
>                        continue;
>
>                freed += shrink_slab_node(shrinkctl, shrinker,
>                         nr_pages_scanned, lru_pages,
> 			 &shrinker->nr_deferred_node[shrinkctl->nid]);
>        }
>
>        return freed;
> }
>
> So, that's likely to be another reason that all the non-numa slab
> caches are not being shrunk appropriately and need to be hit with a
> bit hammer...
>
>> @@ -352,18 +383,23 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
>>  	}
>>  
>>  	list_for_each_entry(shrinker, &shrinker_list, list) {
>> -		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
>> -			if (!node_online(shrinkctl->nid))
>> -				continue;
>> -
>> -			if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
>> -			    (shrinkctl->nid != 0))
>> +		shrinkctl->memcg = shrinkctl->target_mem_cgroup;
>> +		do {
>> +			if (!(shrinker->flags & SHRINKER_MEMCG_AWARE) &&
>> +			    (shrinkctl->memcg != NULL)) {
>> +				mem_cgroup_iter_break(
>> +						shrinkctl->target_mem_cgroup,
>> +						shrinkctl->memcg);
>>  				break;
>> +			}
>>  
>> -			freed += shrink_slab_node(shrinkctl, shrinker,
>> -						  fraction, denominator);
>> +			freed += shrink_slab_memcg(shrinkctl, shrinker,
>> +						   fraction, denominator);
>> +			shrinkctl->memcg = mem_cgroup_iter(
>> +						shrinkctl->target_mem_cgroup,
>> +						shrinkctl->memcg, NULL);
>> +		} while (shrinkctl->memcg);
> Glauber's tree also had a bunch of comments explaining what was
> going on here. I've got no idea what the hell this code is doing,
> and why the hell we are iterating memcgs here and how and why the
> normal, non-memcg scan and shrinkers still worked.

I found this code straightforward, just like the shrink_zone(), which
also lacks comments, but I admit I was wrong and I'll try to improve.

> This is now just a bunch of memcg gobbledegook with no explanations
> to tell us what it is supposed to be doing. Comments are important -
> you might not think they are necessary, but seeing comments like
> this:
>
> +               /*
> +                * In a hierarchical chain, it might be that not all memcgs are
> +                * kmem active. kmemcg design mandates that when one memcg is
> +                * active, its children will be active as well. But it is
> +                * perfectly possible that its parent is not.
> +                *
> +                * We also need to make sure we scan at least once, for the
> +                * global case. So if we don't have a target memcg (saved in
> +                * root), we proceed normally and expect to break in the next
> +                * round.
> +                */
>
> in Glauber's tree helped an awful lot to explain the mess that the
> memcg stuff was making of the code...
>
> I'm liking this patch set less and less as I work my way through
> it...

Will try to improve.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2013-12-03 12:16 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 01/18] memcg: make cache index determination more robust Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 02/18] memcg: consolidate callers of memcg_cache_id Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 03/18] memcg: move initialization to memcg creation Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 04/18] memcg: move several kmemcg functions upper Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 05/18] fs: do not use destroy_super() in alloc_super() fail path Vladimir Davydov
2013-12-03  9:00   ` Dave Chinner
2013-12-03  9:23     ` Vladimir Davydov
2013-12-03 13:37       ` Al Viro
2013-12-03 13:48         ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 06/18] vmscan: rename shrink_slab() args to make it more generic Vladimir Davydov
2013-12-03  9:33   ` Dave Chinner
2013-12-03  9:44     ` Vladimir Davydov
2013-12-03 10:04       ` Dave Chinner
2013-12-02 11:19 ` [PATCH v12 07/18] vmscan: move call to shrink_slab() to shrink_zones() Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 08/18] vmscan: do_try_to_free_pages(): remove shrink_control argument Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 09/18] vmscan: shrink slab on memcg pressure Vladimir Davydov
2013-12-03 10:48   ` Dave Chinner
2013-12-03 12:15     ` Vladimir Davydov [this message]
2013-12-04  4:51       ` Dave Chinner
2013-12-04  6:31         ` Vladimir Davydov
2013-12-05  5:01           ` Dave Chinner
2013-12-05  6:57             ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 10/18] memcg,list_lru: add per-memcg LRU list infrastructure Vladimir Davydov
2013-12-03 11:18   ` Dave Chinner
2013-12-03 12:29     ` Vladimir Davydov
2013-12-05 21:19       ` Dave Chinner
2013-12-02 11:19 ` [PATCH v12 11/18] memcg,list_lru: add function walking over all lists of a per-memcg LRU Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 12/18] fs: make icache, dcache shrinkers memcg-aware Vladimir Davydov
2013-12-03 11:45   ` Dave Chinner
2013-12-03 12:34     ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 13/18] memcg: per-memcg kmem shrinking Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 14/18] vmscan: take at least one pass with shrinkers Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 15/18] memcg: allow kmem limit to be resized down Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 16/18] vmpressure: in-kernel notifications Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 17/18] memcg: reap dead memcgs upon global memory pressure Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 18/18] memcg: flush memcg items upon memcg destruction Vladimir Davydov
2013-12-02 11:22 ` [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=529DCB7D.10205@parallels.com \
    --to=vdavydov@parallels.com \
    --cc=akpm@linux-foundation.org \
    --cc=bsingharora@gmail.com \
    --cc=cgroups@vger.kernel.org \
    --cc=david@fromorbit.com \
    --cc=dchinner@redhat.com \
    --cc=devel@openvz.org \
    --cc=glommer@openvz.org \
    --cc=hannes@cmpxchg.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=riel@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).