All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
To: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Cc: hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org,
	mhocko-AlSwsSmVLrQ@public.gmane.org,
	dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devel-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org,
	glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org,
	Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>,
	Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>,
	Balbir Singh
	<bsingharora-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	KAMEZAWA Hiroyuki
	<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Subject: Re: [PATCH v12 09/18] vmscan: shrink slab on memcg pressure
Date: Tue, 3 Dec 2013 16:15:57 +0400	[thread overview]
Message-ID: <529DCB7D.10205@parallels.com> (raw)
In-Reply-To: <20131203104849.GD8803@dastard>

On 12/03/2013 02:48 PM, Dave Chinner wrote:
>> @@ -236,11 +236,17 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>>  		return 0;
>>  
>>  	/*
>> -	 * copy the current shrinker scan count into a local variable
>> -	 * and zero it so that other concurrent shrinker invocations
>> -	 * don't also do this scanning work.
>> +	 * Do not touch global counter of deferred objects on memcg pressure to
>> +	 * avoid isolation issues. Ideally the counter should be per-memcg.
>>  	 */
>> -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>> +	if (!shrinkctl->target_mem_cgroup) {
>> +		/*
>> +		 * copy the current shrinker scan count into a local variable
>> +		 * and zero it so that other concurrent shrinker invocations
>> +		 * don't also do this scanning work.
>> +		 */
>> +		nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>> +	}
> That's ugly. Effectively it means that memcg reclaim is going to be
> completely ineffective when large numbers of allocations and hence
> reclaim attempts are done under GFP_NOFS context.
>
> The only thing that keeps filesystem caches in balance when there is
> lots of filesystem work going on (i.e. lots of GFP_NOFS allocations)
> is the deferal of reclaim work to a context that can do something
> about it.

Imagine the situation: a memcg issues a GFP_NOFS allocation and goes to
shrink_slab() where it defers them to the global counter; then another
memcg issues a GFP_KERNEL allocation, also goes to shrink_slab() where
it sees a huge number of deferred objects and starts shrinking them,
which is not good IMHO. I understand that nr_deferred is necessary, but
I think it should be per-memcg. What do you think about moving it to
list_lru?

>>  	total_scan = nr;
>>  	delta = (4 * fraction) / shrinker->seeks;
>> @@ -296,21 +302,46 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>>  		cond_resched();
>>  	}
>>  
>> -	/*
>> -	 * move the unused scan count back into the shrinker in a
>> -	 * manner that handles concurrent updates. If we exhausted the
>> -	 * scan, there is no need to do an update.
>> -	 */
>> -	if (total_scan > 0)
>> -		new_nr = atomic_long_add_return(total_scan,
>> +	if (!shrinkctl->target_mem_cgroup) {
>> +		/*
>> +		 * move the unused scan count back into the shrinker in a
>> +		 * manner that handles concurrent updates. If we exhausted the
>> +		 * scan, there is no need to do an update.
>> +		 */
>> +		if (total_scan > 0)
>> +			new_nr = atomic_long_add_return(total_scan,
>>  						&shrinker->nr_deferred[nid]);
>> -	else
>> -		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
>> +		else
>> +			new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
>> +	}
> So, if the memcg can't make progress, why wouldn't you defer the
> work to the global scan? Or can't a global scan trim memcg LRUs?
> And if it can't, then isn't that a major design flaw? Why not just
> allow kswapd to walk memcg LRUs in the background?
>
> /me just looked at patch 13
>
> Yeah, this goes some way to explaining why something like patch 13
> is necessary - slab shrinkers are not keeping up with page cache
> reclaim because of GFP_NOFS allocations, and so the page cache
> empties only leaving slab caches to be trimmed....
>
>
>> +static unsigned long
>> +shrink_slab_memcg(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>> +		  unsigned long fraction, unsigned long denominator)
> what's this function got to do with memcgs? Why did you rename it
> from the self explanitory shrink_slab_one() name that Glauber gave
> it?

When I sent the previous version, Johannes Weiner disliked the name that
was why I renamed it, now you don't like the new name and ask for the
old one :-) But why do you think that shrink_slab_one() is
self-explanatory while shrink_slab_memcg() is not? I mean
shrink_slab_memcg() means "shrink slab accounted to a memcg" just like
shrink_slab_node() means "shrink slab on the node" while seeing
shrink_slab_one() I would ask "one what?".

>> +{
>> +	unsigned long freed = 0;
>> +
>> +	if (shrinkctl->memcg && !memcg_kmem_is_active(shrinkctl->memcg))
>> +		return 0;
> Why here? why not check that in the caller where memcg's are being
> iterated?

No problem, I'll move it.

>> +
>> +	for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
>> +		if (!node_online(shrinkctl->nid))
>> +			continue;
>> +
>> +		if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
>> +		    (shrinkctl->nid != 0))
>> +			break;
> Hmmm - this looks broken. Nothing guarantees that node 0 in
> shrinkctl->nodes_to_scan is ever set, so non-numa aware shrinkers
> will do nothing when the first node in the mask is not set. For non-numa
> aware shrinkers, the shrinker should always be called once with a
> node id of 0.

That's how it operates now - this patch simply moved this piece from
shrink_slab(). I'll fix this.

> That's what earlier versions of the numa aware shrinker patch set
> did, and it seems to have been lost along the way.  Yeah, there's
> the last version from Glauber's tree that I saw:
>
> static unsigned long
> shrink_slab_one(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>                unsigned long nr_pages_scanned, unsigned long lru_pages)
> {
>        unsigned long freed = 0;
>
>        if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
>                shrinkctl->nid = 0;
>
>                return shrink_slab_node(shrinkctl, shrinker,
>                         nr_pages_scanned, lru_pages,
>                         &shrinker->nr_deferred);
>        }
>
>        for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan)
>
>                if (!node_online(shrinkctl->nid))
>                        continue;
>
>                freed += shrink_slab_node(shrinkctl, shrinker,
>                         nr_pages_scanned, lru_pages,
> 			 &shrinker->nr_deferred_node[shrinkctl->nid]);
>        }
>
>        return freed;
> }
>
> So, that's likely to be another reason that all the non-numa slab
> caches are not being shrunk appropriately and need to be hit with a
> bit hammer...
>
>> @@ -352,18 +383,23 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
>>  	}
>>  
>>  	list_for_each_entry(shrinker, &shrinker_list, list) {
>> -		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
>> -			if (!node_online(shrinkctl->nid))
>> -				continue;
>> -
>> -			if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
>> -			    (shrinkctl->nid != 0))
>> +		shrinkctl->memcg = shrinkctl->target_mem_cgroup;
>> +		do {
>> +			if (!(shrinker->flags & SHRINKER_MEMCG_AWARE) &&
>> +			    (shrinkctl->memcg != NULL)) {
>> +				mem_cgroup_iter_break(
>> +						shrinkctl->target_mem_cgroup,
>> +						shrinkctl->memcg);
>>  				break;
>> +			}
>>  
>> -			freed += shrink_slab_node(shrinkctl, shrinker,
>> -						  fraction, denominator);
>> +			freed += shrink_slab_memcg(shrinkctl, shrinker,
>> +						   fraction, denominator);
>> +			shrinkctl->memcg = mem_cgroup_iter(
>> +						shrinkctl->target_mem_cgroup,
>> +						shrinkctl->memcg, NULL);
>> +		} while (shrinkctl->memcg);
> Glauber's tree also had a bunch of comments explaining what was
> going on here. I've got no idea what the hell this code is doing,
> and why the hell we are iterating memcgs here and how and why the
> normal, non-memcg scan and shrinkers still worked.

I found this code straightforward, just like the shrink_zone(), which
also lacks comments, but I admit I was wrong and I'll try to improve.

> This is now just a bunch of memcg gobbledegook with no explanations
> to tell us what it is supposed to be doing. Comments are important -
> you might not think they are necessary, but seeing comments like
> this:
>
> +               /*
> +                * In a hierarchical chain, it might be that not all memcgs are
> +                * kmem active. kmemcg design mandates that when one memcg is
> +                * active, its children will be active as well. But it is
> +                * perfectly possible that its parent is not.
> +                *
> +                * We also need to make sure we scan at least once, for the
> +                * global case. So if we don't have a target memcg (saved in
> +                * root), we proceed normally and expect to break in the next
> +                * round.
> +                */
>
> in Glauber's tree helped an awful lot to explain the mess that the
> memcg stuff was making of the code...
>
> I'm liking this patch set less and less as I work my way through
> it...

Will try to improve.

Thanks.

WARNING: multiple messages have this Message-ID (diff)
From: Vladimir Davydov <vdavydov@parallels.com>
To: Dave Chinner <david@fromorbit.com>
Cc: hannes@cmpxchg.org, mhocko@suse.cz, dchinner@redhat.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, cgroups@vger.kernel.org, devel@openvz.org,
	glommer@openvz.org, Mel Gorman <mgorman@suse.de>,
	Rik van Riel <riel@redhat.com>, Al Viro <viro@zeniv.linux.org.uk>,
	Balbir Singh <bsingharora@gmail.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Subject: Re: [PATCH v12 09/18] vmscan: shrink slab on memcg pressure
Date: Tue, 3 Dec 2013 16:15:57 +0400	[thread overview]
Message-ID: <529DCB7D.10205@parallels.com> (raw)
In-Reply-To: <20131203104849.GD8803@dastard>

On 12/03/2013 02:48 PM, Dave Chinner wrote:
>> @@ -236,11 +236,17 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>>  		return 0;
>>  
>>  	/*
>> -	 * copy the current shrinker scan count into a local variable
>> -	 * and zero it so that other concurrent shrinker invocations
>> -	 * don't also do this scanning work.
>> +	 * Do not touch global counter of deferred objects on memcg pressure to
>> +	 * avoid isolation issues. Ideally the counter should be per-memcg.
>>  	 */
>> -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>> +	if (!shrinkctl->target_mem_cgroup) {
>> +		/*
>> +		 * copy the current shrinker scan count into a local variable
>> +		 * and zero it so that other concurrent shrinker invocations
>> +		 * don't also do this scanning work.
>> +		 */
>> +		nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>> +	}
> That's ugly. Effectively it means that memcg reclaim is going to be
> completely ineffective when large numbers of allocations and hence
> reclaim attempts are done under GFP_NOFS context.
>
> The only thing that keeps filesystem caches in balance when there is
> lots of filesystem work going on (i.e. lots of GFP_NOFS allocations)
> is the deferal of reclaim work to a context that can do something
> about it.

Imagine the situation: a memcg issues a GFP_NOFS allocation and goes to
shrink_slab() where it defers them to the global counter; then another
memcg issues a GFP_KERNEL allocation, also goes to shrink_slab() where
it sees a huge number of deferred objects and starts shrinking them,
which is not good IMHO. I understand that nr_deferred is necessary, but
I think it should be per-memcg. What do you think about moving it to
list_lru?

>>  	total_scan = nr;
>>  	delta = (4 * fraction) / shrinker->seeks;
>> @@ -296,21 +302,46 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>>  		cond_resched();
>>  	}
>>  
>> -	/*
>> -	 * move the unused scan count back into the shrinker in a
>> -	 * manner that handles concurrent updates. If we exhausted the
>> -	 * scan, there is no need to do an update.
>> -	 */
>> -	if (total_scan > 0)
>> -		new_nr = atomic_long_add_return(total_scan,
>> +	if (!shrinkctl->target_mem_cgroup) {
>> +		/*
>> +		 * move the unused scan count back into the shrinker in a
>> +		 * manner that handles concurrent updates. If we exhausted the
>> +		 * scan, there is no need to do an update.
>> +		 */
>> +		if (total_scan > 0)
>> +			new_nr = atomic_long_add_return(total_scan,
>>  						&shrinker->nr_deferred[nid]);
>> -	else
>> -		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
>> +		else
>> +			new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
>> +	}
> So, if the memcg can't make progress, why wouldn't you defer the
> work to the global scan? Or can't a global scan trim memcg LRUs?
> And if it can't, then isn't that a major design flaw? Why not just
> allow kswapd to walk memcg LRUs in the background?
>
> /me just looked at patch 13
>
> Yeah, this goes some way to explaining why something like patch 13
> is necessary - slab shrinkers are not keeping up with page cache
> reclaim because of GFP_NOFS allocations, and so the page cache
> empties only leaving slab caches to be trimmed....
>
>
>> +static unsigned long
>> +shrink_slab_memcg(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>> +		  unsigned long fraction, unsigned long denominator)
> what's this function got to do with memcgs? Why did you rename it
> from the self explanitory shrink_slab_one() name that Glauber gave
> it?

When I sent the previous version, Johannes Weiner disliked the name that
was why I renamed it, now you don't like the new name and ask for the
old one :-) But why do you think that shrink_slab_one() is
self-explanatory while shrink_slab_memcg() is not? I mean
shrink_slab_memcg() means "shrink slab accounted to a memcg" just like
shrink_slab_node() means "shrink slab on the node" while seeing
shrink_slab_one() I would ask "one what?".

>> +{
>> +	unsigned long freed = 0;
>> +
>> +	if (shrinkctl->memcg && !memcg_kmem_is_active(shrinkctl->memcg))
>> +		return 0;
> Why here? why not check that in the caller where memcg's are being
> iterated?

No problem, I'll move it.

>> +
>> +	for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
>> +		if (!node_online(shrinkctl->nid))
>> +			continue;
>> +
>> +		if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
>> +		    (shrinkctl->nid != 0))
>> +			break;
> Hmmm - this looks broken. Nothing guarantees that node 0 in
> shrinkctl->nodes_to_scan is ever set, so non-numa aware shrinkers
> will do nothing when the first node in the mask is not set. For non-numa
> aware shrinkers, the shrinker should always be called once with a
> node id of 0.

That's how it operates now - this patch simply moved this piece from
shrink_slab(). I'll fix this.

> That's what earlier versions of the numa aware shrinker patch set
> did, and it seems to have been lost along the way.  Yeah, there's
> the last version from Glauber's tree that I saw:
>
> static unsigned long
> shrink_slab_one(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>                unsigned long nr_pages_scanned, unsigned long lru_pages)
> {
>        unsigned long freed = 0;
>
>        if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
>                shrinkctl->nid = 0;
>
>                return shrink_slab_node(shrinkctl, shrinker,
>                         nr_pages_scanned, lru_pages,
>                         &shrinker->nr_deferred);
>        }
>
>        for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan)
>
>                if (!node_online(shrinkctl->nid))
>                        continue;
>
>                freed += shrink_slab_node(shrinkctl, shrinker,
>                         nr_pages_scanned, lru_pages,
> 			 &shrinker->nr_deferred_node[shrinkctl->nid]);
>        }
>
>        return freed;
> }
>
> So, that's likely to be another reason that all the non-numa slab
> caches are not being shrunk appropriately and need to be hit with a
> bit hammer...
>
>> @@ -352,18 +383,23 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
>>  	}
>>  
>>  	list_for_each_entry(shrinker, &shrinker_list, list) {
>> -		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
>> -			if (!node_online(shrinkctl->nid))
>> -				continue;
>> -
>> -			if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
>> -			    (shrinkctl->nid != 0))
>> +		shrinkctl->memcg = shrinkctl->target_mem_cgroup;
>> +		do {
>> +			if (!(shrinker->flags & SHRINKER_MEMCG_AWARE) &&
>> +			    (shrinkctl->memcg != NULL)) {
>> +				mem_cgroup_iter_break(
>> +						shrinkctl->target_mem_cgroup,
>> +						shrinkctl->memcg);
>>  				break;
>> +			}
>>  
>> -			freed += shrink_slab_node(shrinkctl, shrinker,
>> -						  fraction, denominator);
>> +			freed += shrink_slab_memcg(shrinkctl, shrinker,
>> +						   fraction, denominator);
>> +			shrinkctl->memcg = mem_cgroup_iter(
>> +						shrinkctl->target_mem_cgroup,
>> +						shrinkctl->memcg, NULL);
>> +		} while (shrinkctl->memcg);
> Glauber's tree also had a bunch of comments explaining what was
> going on here. I've got no idea what the hell this code is doing,
> and why the hell we are iterating memcgs here and how and why the
> normal, non-memcg scan and shrinkers still worked.

I found this code straightforward, just like the shrink_zone(), which
also lacks comments, but I admit I was wrong and I'll try to improve.

> This is now just a bunch of memcg gobbledegook with no explanations
> to tell us what it is supposed to be doing. Comments are important -
> you might not think they are necessary, but seeing comments like
> this:
>
> +               /*
> +                * In a hierarchical chain, it might be that not all memcgs are
> +                * kmem active. kmemcg design mandates that when one memcg is
> +                * active, its children will be active as well. But it is
> +                * perfectly possible that its parent is not.
> +                *
> +                * We also need to make sure we scan at least once, for the
> +                * global case. So if we don't have a target memcg (saved in
> +                * root), we proceed normally and expect to break in the next
> +                * round.
> +                */
>
> in Glauber's tree helped an awful lot to explain the mess that the
> memcg stuff was making of the code...
>
> I'm liking this patch set less and less as I work my way through
> it...

Will try to improve.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Vladimir Davydov <vdavydov@parallels.com>
To: Dave Chinner <david@fromorbit.com>
Cc: <hannes@cmpxchg.org>, <mhocko@suse.cz>, <dchinner@redhat.com>,
	<akpm@linux-foundation.org>, <linux-kernel@vger.kernel.org>,
	<linux-mm@kvack.org>, <cgroups@vger.kernel.org>,
	<devel@openvz.org>, <glommer@openvz.org>,
	Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Balbir Singh <bsingharora@gmail.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Subject: Re: [PATCH v12 09/18] vmscan: shrink slab on memcg pressure
Date: Tue, 3 Dec 2013 16:15:57 +0400	[thread overview]
Message-ID: <529DCB7D.10205@parallels.com> (raw)
In-Reply-To: <20131203104849.GD8803@dastard>

On 12/03/2013 02:48 PM, Dave Chinner wrote:
>> @@ -236,11 +236,17 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>>  		return 0;
>>  
>>  	/*
>> -	 * copy the current shrinker scan count into a local variable
>> -	 * and zero it so that other concurrent shrinker invocations
>> -	 * don't also do this scanning work.
>> +	 * Do not touch global counter of deferred objects on memcg pressure to
>> +	 * avoid isolation issues. Ideally the counter should be per-memcg.
>>  	 */
>> -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>> +	if (!shrinkctl->target_mem_cgroup) {
>> +		/*
>> +		 * copy the current shrinker scan count into a local variable
>> +		 * and zero it so that other concurrent shrinker invocations
>> +		 * don't also do this scanning work.
>> +		 */
>> +		nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>> +	}
> That's ugly. Effectively it means that memcg reclaim is going to be
> completely ineffective when large numbers of allocations and hence
> reclaim attempts are done under GFP_NOFS context.
>
> The only thing that keeps filesystem caches in balance when there is
> lots of filesystem work going on (i.e. lots of GFP_NOFS allocations)
> is the deferal of reclaim work to a context that can do something
> about it.

Imagine the situation: a memcg issues a GFP_NOFS allocation and goes to
shrink_slab() where it defers them to the global counter; then another
memcg issues a GFP_KERNEL allocation, also goes to shrink_slab() where
it sees a huge number of deferred objects and starts shrinking them,
which is not good IMHO. I understand that nr_deferred is necessary, but
I think it should be per-memcg. What do you think about moving it to
list_lru?

>>  	total_scan = nr;
>>  	delta = (4 * fraction) / shrinker->seeks;
>> @@ -296,21 +302,46 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>>  		cond_resched();
>>  	}
>>  
>> -	/*
>> -	 * move the unused scan count back into the shrinker in a
>> -	 * manner that handles concurrent updates. If we exhausted the
>> -	 * scan, there is no need to do an update.
>> -	 */
>> -	if (total_scan > 0)
>> -		new_nr = atomic_long_add_return(total_scan,
>> +	if (!shrinkctl->target_mem_cgroup) {
>> +		/*
>> +		 * move the unused scan count back into the shrinker in a
>> +		 * manner that handles concurrent updates. If we exhausted the
>> +		 * scan, there is no need to do an update.
>> +		 */
>> +		if (total_scan > 0)
>> +			new_nr = atomic_long_add_return(total_scan,
>>  						&shrinker->nr_deferred[nid]);
>> -	else
>> -		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
>> +		else
>> +			new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
>> +	}
> So, if the memcg can't make progress, why wouldn't you defer the
> work to the global scan? Or can't a global scan trim memcg LRUs?
> And if it can't, then isn't that a major design flaw? Why not just
> allow kswapd to walk memcg LRUs in the background?
>
> /me just looked at patch 13
>
> Yeah, this goes some way to explaining why something like patch 13
> is necessary - slab shrinkers are not keeping up with page cache
> reclaim because of GFP_NOFS allocations, and so the page cache
> empties only leaving slab caches to be trimmed....
>
>
>> +static unsigned long
>> +shrink_slab_memcg(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>> +		  unsigned long fraction, unsigned long denominator)
> what's this function got to do with memcgs? Why did you rename it
> from the self explanitory shrink_slab_one() name that Glauber gave
> it?

When I sent the previous version, Johannes Weiner disliked the name that
was why I renamed it, now you don't like the new name and ask for the
old one :-) But why do you think that shrink_slab_one() is
self-explanatory while shrink_slab_memcg() is not? I mean
shrink_slab_memcg() means "shrink slab accounted to a memcg" just like
shrink_slab_node() means "shrink slab on the node" while seeing
shrink_slab_one() I would ask "one what?".

>> +{
>> +	unsigned long freed = 0;
>> +
>> +	if (shrinkctl->memcg && !memcg_kmem_is_active(shrinkctl->memcg))
>> +		return 0;
> Why here? why not check that in the caller where memcg's are being
> iterated?

No problem, I'll move it.

>> +
>> +	for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
>> +		if (!node_online(shrinkctl->nid))
>> +			continue;
>> +
>> +		if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
>> +		    (shrinkctl->nid != 0))
>> +			break;
> Hmmm - this looks broken. Nothing guarantees that node 0 in
> shrinkctl->nodes_to_scan is ever set, so non-numa aware shrinkers
> will do nothing when the first node in the mask is not set. For non-numa
> aware shrinkers, the shrinker should always be called once with a
> node id of 0.

That's how it operates now - this patch simply moved this piece from
shrink_slab(). I'll fix this.

> That's what earlier versions of the numa aware shrinker patch set
> did, and it seems to have been lost along the way.  Yeah, there's
> the last version from Glauber's tree that I saw:
>
> static unsigned long
> shrink_slab_one(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>                unsigned long nr_pages_scanned, unsigned long lru_pages)
> {
>        unsigned long freed = 0;
>
>        if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
>                shrinkctl->nid = 0;
>
>                return shrink_slab_node(shrinkctl, shrinker,
>                         nr_pages_scanned, lru_pages,
>                         &shrinker->nr_deferred);
>        }
>
>        for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan)
>
>                if (!node_online(shrinkctl->nid))
>                        continue;
>
>                freed += shrink_slab_node(shrinkctl, shrinker,
>                         nr_pages_scanned, lru_pages,
> 			 &shrinker->nr_deferred_node[shrinkctl->nid]);
>        }
>
>        return freed;
> }
>
> So, that's likely to be another reason that all the non-numa slab
> caches are not being shrunk appropriately and need to be hit with a
> bit hammer...
>
>> @@ -352,18 +383,23 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
>>  	}
>>  
>>  	list_for_each_entry(shrinker, &shrinker_list, list) {
>> -		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
>> -			if (!node_online(shrinkctl->nid))
>> -				continue;
>> -
>> -			if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
>> -			    (shrinkctl->nid != 0))
>> +		shrinkctl->memcg = shrinkctl->target_mem_cgroup;
>> +		do {
>> +			if (!(shrinker->flags & SHRINKER_MEMCG_AWARE) &&
>> +			    (shrinkctl->memcg != NULL)) {
>> +				mem_cgroup_iter_break(
>> +						shrinkctl->target_mem_cgroup,
>> +						shrinkctl->memcg);
>>  				break;
>> +			}
>>  
>> -			freed += shrink_slab_node(shrinkctl, shrinker,
>> -						  fraction, denominator);
>> +			freed += shrink_slab_memcg(shrinkctl, shrinker,
>> +						   fraction, denominator);
>> +			shrinkctl->memcg = mem_cgroup_iter(
>> +						shrinkctl->target_mem_cgroup,
>> +						shrinkctl->memcg, NULL);
>> +		} while (shrinkctl->memcg);
> Glauber's tree also had a bunch of comments explaining what was
> going on here. I've got no idea what the hell this code is doing,
> and why the hell we are iterating memcgs here and how and why the
> normal, non-memcg scan and shrinkers still worked.

I found this code straightforward, just like the shrink_zone(), which
also lacks comments, but I admit I was wrong and I'll try to improve.

> This is now just a bunch of memcg gobbledegook with no explanations
> to tell us what it is supposed to be doing. Comments are important -
> you might not think they are necessary, but seeing comments like
> this:
>
> +               /*
> +                * In a hierarchical chain, it might be that not all memcgs are
> +                * kmem active. kmemcg design mandates that when one memcg is
> +                * active, its children will be active as well. But it is
> +                * perfectly possible that its parent is not.
> +                *
> +                * We also need to make sure we scan at least once, for the
> +                * global case. So if we don't have a target memcg (saved in
> +                * root), we proceed normally and expect to break in the next
> +                * round.
> +                */
>
> in Glauber's tree helped an awful lot to explain the mess that the
> memcg stuff was making of the code...
>
> I'm liking this patch set less and less as I work my way through
> it...

Will try to improve.

Thanks.

  reply	other threads:[~2013-12-03 12:15 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
2013-12-02 11:19 ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 01/18] memcg: make cache index determination more robust Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 02/18] memcg: consolidate callers of memcg_cache_id Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 03/18] memcg: move initialization to memcg creation Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 04/18] memcg: move several kmemcg functions upper Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 05/18] fs: do not use destroy_super() in alloc_super() fail path Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-03  9:00   ` Dave Chinner
2013-12-03  9:00     ` Dave Chinner
2013-12-03  9:23     ` Vladimir Davydov
2013-12-03  9:23       ` Vladimir Davydov
2013-12-03  9:23       ` Vladimir Davydov
2013-12-03 13:37       ` Al Viro
2013-12-03 13:37         ` Al Viro
2013-12-03 13:48         ` Vladimir Davydov
2013-12-03 13:48           ` Vladimir Davydov
2013-12-03 13:48           ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 06/18] vmscan: rename shrink_slab() args to make it more generic Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-03  9:33   ` Dave Chinner
2013-12-03  9:33     ` Dave Chinner
2013-12-03  9:44     ` Vladimir Davydov
2013-12-03  9:44       ` Vladimir Davydov
2013-12-03  9:44       ` Vladimir Davydov
2013-12-03 10:04       ` Dave Chinner
2013-12-03 10:04         ` Dave Chinner
2013-12-02 11:19 ` [PATCH v12 07/18] vmscan: move call to shrink_slab() to shrink_zones() Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 08/18] vmscan: do_try_to_free_pages(): remove shrink_control argument Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 09/18] vmscan: shrink slab on memcg pressure Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-03 10:48   ` Dave Chinner
2013-12-03 10:48     ` Dave Chinner
2013-12-03 12:15     ` Vladimir Davydov [this message]
2013-12-03 12:15       ` Vladimir Davydov
2013-12-03 12:15       ` Vladimir Davydov
2013-12-04  4:51       ` Dave Chinner
2013-12-04  4:51         ` Dave Chinner
2013-12-04  6:31         ` Vladimir Davydov
2013-12-04  6:31           ` Vladimir Davydov
2013-12-04  6:31           ` Vladimir Davydov
2013-12-05  5:01           ` Dave Chinner
2013-12-05  5:01             ` Dave Chinner
2013-12-05  6:57             ` Vladimir Davydov
2013-12-05  6:57               ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 10/18] memcg,list_lru: add per-memcg LRU list infrastructure Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
     [not found]   ` <73d7942f31ac80dfa53bbdd0f957ce5e9a301958.1385974612.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-12-03 11:18     ` Dave Chinner
2013-12-03 11:18       ` Dave Chinner
2013-12-03 11:18       ` Dave Chinner
2013-12-03 12:29       ` Vladimir Davydov
2013-12-03 12:29         ` Vladimir Davydov
2013-12-03 12:29         ` Vladimir Davydov
2013-12-05 21:19         ` Dave Chinner
2013-12-05 21:19           ` Dave Chinner
2013-12-02 11:19 ` [PATCH v12 11/18] memcg,list_lru: add function walking over all lists of a per-memcg LRU Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 12/18] fs: make icache, dcache shrinkers memcg-aware Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-03 11:45   ` Dave Chinner
2013-12-03 11:45     ` Dave Chinner
2013-12-03 12:34     ` Vladimir Davydov
2013-12-03 12:34       ` Vladimir Davydov
2013-12-03 12:34       ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 13/18] memcg: per-memcg kmem shrinking Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 14/18] vmscan: take at least one pass with shrinkers Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 15/18] memcg: allow kmem limit to be resized down Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 16/18] vmpressure: in-kernel notifications Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 17/18] memcg: reap dead memcgs upon global memory pressure Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 18/18] memcg: flush memcg items upon memcg destruction Vladimir Davydov
2013-12-02 11:19   ` Vladimir Davydov
2013-12-02 11:22 ` [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
2013-12-02 11:22   ` Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=529DCB7D.10205@parallels.com \
    --to=vdavydov-bzqdu9zft3wakbo8gow8eq@public.gmane.org \
    --cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
    --cc=bsingharora-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org \
    --cc=dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=devel-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org \
    --cc=glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org \
    --cc=hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org \
    --cc=kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
    --cc=mgorman-l3A5Bk7waGM@public.gmane.org \
    --cc=mhocko-AlSwsSmVLrQ@public.gmane.org \
    --cc=riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.