Re: [PATCH 7/8] memcg static scan reclaim for asyncrhonous reclaim

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrew Morton <akpm@linux-foundation.org>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"nishimura@mxp.nes.nec.co.jp" <nishimura@mxp.nes.nec.co.jp>,
	"balbir@linux.vnet.ibm.com" <balbir@linux.vnet.ibm.com>,
	Ying Han <yinghan@google.com>,
	hannes@cmpxchg.org, Michal Hocko <mhocko@suse.cz>
Subject: Re: [PATCH 7/8] memcg static scan reclaim for asyncrhonous reclaim
Date: Fri, 20 May 2011 14:50:08 -0700	[thread overview]
Message-ID: <20110520145008.1ea51f41.akpm@linux-foundation.org> (raw)
In-Reply-To: <20110520124753.56730b37.kamezawa.hiroyu@jp.fujitsu.com>

On Fri, 20 May 2011 12:47:53 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Ostatic scan rate async memory reclaim for memcg.
> 
> This patch implements a routine for asynchronous memory reclaim for memory
> cgroup, which will be triggered when the usage is near to the limit.
> This patch includes only code codes for memory freeing.
> 
> Asynchronous memory reclaim can be a help for reduce latency because
> memory reclaim goes while an application need to wait or compute something.
> 
> To do memory reclaim in async, we need some thread or worker.
> Unlike node or zones, memcg can be created on demand and there may be
> a system with thousands of memcgs. So, the number of jobs for memcg
> asynchronous memory reclaim can be big number in theory. So, node kswapd
> codes doesn't fit well. And some scheduling on memcg layer will be appreciated.
> 
> This patch implements a static scan rate memory reclaim.
> When shrink_mem_cgroup_static_scan() is called, it scans pages at most
> MEMCG_STATIC_SCAN_LIMIT(2048) pages and returnes how memory shrinking
> was hard. When the function returns false, the caller can assume memory
> reclaim on the memcg seemed difficult and can add some scheduling delay
> for the job.

Fully and carefully define the new term "static scan rate"?

> Note:
>   - I think this concept can be used for enhancing softlimit, too.
>     But need more study.
> 
>
> ...
>
> +		total_scan += nr[l];
> +	}
> +	/*
> +	 * Asynchronous reclaim for memcg uses static scan rate for avoiding
> +	 * too much cpu consumption in a memcg. Adjust the scan count to fit
> +	 * into scan_limit.
> +	 */
> +	if (total_scan > sc->scan_limit) {
> +		for_each_evictable_lru(l) {
> +			if (!nr[l] < SWAP_CLUSTER_MAX)

That statement doesn't do what you think it does!

> +				continue;
> +			nr[l] = div64_u64(nr[l] * sc->scan_limit, total_scan);
> +			nr[l] = max((unsigned long)SWAP_CLUSTER_MAX, nr[l]);
> +		}
>  	}

This gets included in CONFIG_CGROUP_MEM_RES_CTLR=n kernels.  Needlessly?

It also has the potential to affect non-memcg behaviour at runtime.

>  }
>  
> @@ -1938,6 +1955,11 @@ restart:
>  		 */
>  		if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY)
>  			break;
> +		/*
> +		 * static scan rate memory reclaim ?

I still don't know what "static scan rate" means :(

> +		 */
> +		if (sc->nr_scanned > sc->scan_limit)
> +			break;
>  	}
>  	sc->nr_reclaimed += nr_reclaimed;
>  
>
> ...
>
> +static void shrink_mem_cgroup_node(int nid,
> +		int priority, struct scan_control *sc)
> +{
> +	unsigned long this_scanned = 0;
> +	unsigned long this_reclaimed = 0;
> +	int i;
> +
> +	for (i = 0; i < NODE_DATA(nid)->nr_zones; i++) {
> +		struct zone *zone = NODE_DATA(nid)->node_zones + i;
> +
> +		if (!populated_zone(zone))
> +			continue;
> +		if (!mem_cgroup_zone_reclaimable_pages(sc->mem_cgroup, nid, i))
> +			continue;
> +		/* If recent scan didn't go good, do writepate */
> +		sc->nr_scanned = 0;
> +		sc->nr_reclaimed = 0;
> +		shrink_zone(priority, zone, sc);
> +		this_scanned += sc->nr_scanned;
> +		this_reclaimed += sc->nr_reclaimed;
> +		if (this_reclaimed >= sc->nr_to_reclaim)
> +			break;
> +		if (sc->scan_limit < this_scanned)
> +			break;
> +		if (need_resched())
> +			break;

Whoa!  Explain?

> +	}
> +	sc->nr_scanned = this_scanned;
> +	sc->nr_reclaimed = this_reclaimed;
> +	return;
> +}
> +
> +#define MEMCG_ASYNCSCAN_LIMIT		(2048)

Needs documentation.  What happens if I set it to 1024?

> +bool mem_cgroup_shrink_static_scan(struct mem_cgroup *mem, long required)

Exported function has no interface documentation.

`required' appears to have units of "number of pages".  Should be unsigned.

> +{
> +	int nid, priority, noscan;

`noscan' is poorly named and distressingly mysterious.  Basically I
don't have a clue what you're doing with this.

It should be unsigned.

> +	unsigned long total_scanned, total_reclaimed, reclaim_target;
> +	struct scan_control sc = {
> +		.gfp_mask      = GFP_HIGHUSER_MOVABLE,
> +		.may_unmap     = 1,
> +		.may_swap      = 1,
> +		.order         = 0,
> +		/* we don't writepage in our scan. but kick flusher threads */
> +		.may_writepage = 0,
> +	};
> +	struct mem_cgroup *victim, *check_again;
> +	bool congested = true;
> +
> +	total_scanned = 0;
> +	total_reclaimed = 0;
> +	reclaim_target = min(required, MEMCG_ASYNCSCAN_LIMIT/2L);
> +	sc.swappiness = mem_cgroup_swappiness(mem);
> +
> +	noscan = 0;
> +	check_again = NULL;
> +
> +	do {
> +		victim = mem_cgroup_select_victim(mem);
> +
> +		if (!mem_cgroup_test_reclaimable(victim)) {
> +			mem_cgroup_release_victim(victim);
> +			/*
> +			 * if selected a hopeless victim again, give up.
> +		 	 */
> +			if (check_again == victim)
> +				goto out;
> +			if (!check_again)
> +				check_again = victim;
> +		} else
> +			check_again = NULL;
> +	} while (check_again);

What's all this trying to do?

> +	current->flags |= PF_SWAPWRITE;
> +	/*
> +	 * We can use arbitrary priority for our run because we just scan
> +	 * up to MEMCG_ASYNCSCAN_LIMIT and reclaim only the half of it.
> +	 * But, we need to have early-give-up chance for avoid cpu hogging.
> +	 * So, start from a small priority and increase it.
> +	 */
> +	priority = DEF_PRIORITY;
> +
> +	while ((total_scanned < MEMCG_ASYNCSCAN_LIMIT) &&
> +		(total_reclaimed < reclaim_target)) {
> +
> +		/* select a node to scan */
> +		nid = mem_cgroup_select_victim_node(victim);
> +
> +		sc.mem_cgroup = victim;
> +		sc.nr_scanned = 0;
> +		sc.scan_limit = MEMCG_ASYNCSCAN_LIMIT - total_scanned;
> +		sc.nr_reclaimed = 0;
> +		sc.nr_to_reclaim = reclaim_target - total_reclaimed;
> +		shrink_mem_cgroup_node(nid, priority, &sc);
> +		if (sc.nr_scanned) {
> +			total_scanned += sc.nr_scanned;
> +			total_reclaimed += sc.nr_reclaimed;
> +			noscan = 0;
> +		} else
> +			noscan++;
> +		mem_cgroup_release_victim(victim);
> +		/* ok, check condition */
> +		if (total_scanned > total_reclaimed * 2)
> +			wakeup_flusher_threads(sc.nr_scanned);
> +
> +		if (mem_cgroup_async_should_stop(mem))
> +			break;
> +		/* If memory reclaim seems heavy, return that we're congested */
> +		if (total_scanned > MEMCG_ASYNCSCAN_LIMIT/4 &&
> +		    total_scanned > total_reclaimed*8)
> +			break;
> +		/*
> +		 * The whole system is busy or some status update
> +		 * is not synched. It's better to wait for a while.
> +		 */
> +		if ((noscan > 1) || (need_resched()))
> +			break;

So we bale out if there were two priority levels at which
shrink_mem_cgroup_node() didn't scan any pages?  What on earth???

And what was the point in calling shrink_mem_cgroup_node() if it didn't
scan anything?  I could understand using nr_reclaimed...

> +		/* ok, we can do deeper scanning. */
> +		priority--;
> +	}
> +	current->flags &= ~PF_SWAPWRITE;
> +	/*
> +	 * If we successfully freed the half of target, report that
> +	 * memory reclaim went smoothly.
> +	 */
> +	if (total_reclaimed > reclaim_target/2)
> +		congested = false;
> +out:
> +	return congested;
> +}
>  #endif



I dunno, the whole thing seems sprinkled full of arbitrary assumptions
and guess-and-giggle magic numbers.  I expect a lot of this stuff is
just unnecessary.  And if it _is_ necessary then I'd expect there to
be lots of situations and corner cases in which it malfunctions,
because the magic numbers weren't tuned to that case.

WARNING: multiple messages have this Message-ID (diff)

From: Andrew Morton <akpm@linux-foundation.org>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"nishimura@mxp.nes.nec.co.jp" <nishimura@mxp.nes.nec.co.jp>,
	"balbir@linux.vnet.ibm.com" <balbir@linux.vnet.ibm.com>,
	Ying Han <yinghan@google.com>,
	hannes@cmpxchg.org, Michal Hocko <mhocko@suse.cz>
Subject: Re: [PATCH 7/8] memcg static scan reclaim for asyncrhonous reclaim
Date: Fri, 20 May 2011 14:50:08 -0700	[thread overview]
Message-ID: <20110520145008.1ea51f41.akpm@linux-foundation.org> (raw)
In-Reply-To: <20110520124753.56730b37.kamezawa.hiroyu@jp.fujitsu.com>

On Fri, 20 May 2011 12:47:53 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Ostatic scan rate async memory reclaim for memcg.
> 
> This patch implements a routine for asynchronous memory reclaim for memory
> cgroup, which will be triggered when the usage is near to the limit.
> This patch includes only code codes for memory freeing.
> 
> Asynchronous memory reclaim can be a help for reduce latency because
> memory reclaim goes while an application need to wait or compute something.
> 
> To do memory reclaim in async, we need some thread or worker.
> Unlike node or zones, memcg can be created on demand and there may be
> a system with thousands of memcgs. So, the number of jobs for memcg
> asynchronous memory reclaim can be big number in theory. So, node kswapd
> codes doesn't fit well. And some scheduling on memcg layer will be appreciated.
> 
> This patch implements a static scan rate memory reclaim.
> When shrink_mem_cgroup_static_scan() is called, it scans pages at most
> MEMCG_STATIC_SCAN_LIMIT(2048) pages and returnes how memory shrinking
> was hard. When the function returns false, the caller can assume memory
> reclaim on the memcg seemed difficult and can add some scheduling delay
> for the job.

Fully and carefully define the new term "static scan rate"?

> Note:
>   - I think this concept can be used for enhancing softlimit, too.
>     But need more study.
> 
>
> ...
>
> +		total_scan += nr[l];
> +	}
> +	/*
> +	 * Asynchronous reclaim for memcg uses static scan rate for avoiding
> +	 * too much cpu consumption in a memcg. Adjust the scan count to fit
> +	 * into scan_limit.
> +	 */
> +	if (total_scan > sc->scan_limit) {
> +		for_each_evictable_lru(l) {
> +			if (!nr[l] < SWAP_CLUSTER_MAX)

That statement doesn't do what you think it does!

> +				continue;
> +			nr[l] = div64_u64(nr[l] * sc->scan_limit, total_scan);
> +			nr[l] = max((unsigned long)SWAP_CLUSTER_MAX, nr[l]);
> +		}
>  	}

This gets included in CONFIG_CGROUP_MEM_RES_CTLR=n kernels.  Needlessly?

It also has the potential to affect non-memcg behaviour at runtime.

>  }
>  
> @@ -1938,6 +1955,11 @@ restart:
>  		 */
>  		if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY)
>  			break;
> +		/*
> +		 * static scan rate memory reclaim ?

I still don't know what "static scan rate" means :(

> +		 */
> +		if (sc->nr_scanned > sc->scan_limit)
> +			break;
>  	}
>  	sc->nr_reclaimed += nr_reclaimed;
>  
>
> ...
>
> +static void shrink_mem_cgroup_node(int nid,
> +		int priority, struct scan_control *sc)
> +{
> +	unsigned long this_scanned = 0;
> +	unsigned long this_reclaimed = 0;
> +	int i;
> +
> +	for (i = 0; i < NODE_DATA(nid)->nr_zones; i++) {
> +		struct zone *zone = NODE_DATA(nid)->node_zones + i;
> +
> +		if (!populated_zone(zone))
> +			continue;
> +		if (!mem_cgroup_zone_reclaimable_pages(sc->mem_cgroup, nid, i))
> +			continue;
> +		/* If recent scan didn't go good, do writepate */
> +		sc->nr_scanned = 0;
> +		sc->nr_reclaimed = 0;
> +		shrink_zone(priority, zone, sc);
> +		this_scanned += sc->nr_scanned;
> +		this_reclaimed += sc->nr_reclaimed;
> +		if (this_reclaimed >= sc->nr_to_reclaim)
> +			break;
> +		if (sc->scan_limit < this_scanned)
> +			break;
> +		if (need_resched())
> +			break;

Whoa!  Explain?

> +	}
> +	sc->nr_scanned = this_scanned;
> +	sc->nr_reclaimed = this_reclaimed;
> +	return;
> +}
> +
> +#define MEMCG_ASYNCSCAN_LIMIT		(2048)

Needs documentation.  What happens if I set it to 1024?

> +bool mem_cgroup_shrink_static_scan(struct mem_cgroup *mem, long required)

Exported function has no interface documentation.

`required' appears to have units of "number of pages".  Should be unsigned.

> +{
> +	int nid, priority, noscan;

`noscan' is poorly named and distressingly mysterious.  Basically I
don't have a clue what you're doing with this.

It should be unsigned.

> +	unsigned long total_scanned, total_reclaimed, reclaim_target;
> +	struct scan_control sc = {
> +		.gfp_mask      = GFP_HIGHUSER_MOVABLE,
> +		.may_unmap     = 1,
> +		.may_swap      = 1,
> +		.order         = 0,
> +		/* we don't writepage in our scan. but kick flusher threads */
> +		.may_writepage = 0,
> +	};
> +	struct mem_cgroup *victim, *check_again;
> +	bool congested = true;
> +
> +	total_scanned = 0;
> +	total_reclaimed = 0;
> +	reclaim_target = min(required, MEMCG_ASYNCSCAN_LIMIT/2L);
> +	sc.swappiness = mem_cgroup_swappiness(mem);
> +
> +	noscan = 0;
> +	check_again = NULL;
> +
> +	do {
> +		victim = mem_cgroup_select_victim(mem);
> +
> +		if (!mem_cgroup_test_reclaimable(victim)) {
> +			mem_cgroup_release_victim(victim);
> +			/*
> +			 * if selected a hopeless victim again, give up.
> +		 	 */
> +			if (check_again == victim)
> +				goto out;
> +			if (!check_again)
> +				check_again = victim;
> +		} else
> +			check_again = NULL;
> +	} while (check_again);

What's all this trying to do?

> +	current->flags |= PF_SWAPWRITE;
> +	/*
> +	 * We can use arbitrary priority for our run because we just scan
> +	 * up to MEMCG_ASYNCSCAN_LIMIT and reclaim only the half of it.
> +	 * But, we need to have early-give-up chance for avoid cpu hogging.
> +	 * So, start from a small priority and increase it.
> +	 */
> +	priority = DEF_PRIORITY;
> +
> +	while ((total_scanned < MEMCG_ASYNCSCAN_LIMIT) &&
> +		(total_reclaimed < reclaim_target)) {
> +
> +		/* select a node to scan */
> +		nid = mem_cgroup_select_victim_node(victim);
> +
> +		sc.mem_cgroup = victim;
> +		sc.nr_scanned = 0;
> +		sc.scan_limit = MEMCG_ASYNCSCAN_LIMIT - total_scanned;
> +		sc.nr_reclaimed = 0;
> +		sc.nr_to_reclaim = reclaim_target - total_reclaimed;
> +		shrink_mem_cgroup_node(nid, priority, &sc);
> +		if (sc.nr_scanned) {
> +			total_scanned += sc.nr_scanned;
> +			total_reclaimed += sc.nr_reclaimed;
> +			noscan = 0;
> +		} else
> +			noscan++;
> +		mem_cgroup_release_victim(victim);
> +		/* ok, check condition */
> +		if (total_scanned > total_reclaimed * 2)
> +			wakeup_flusher_threads(sc.nr_scanned);
> +
> +		if (mem_cgroup_async_should_stop(mem))
> +			break;
> +		/* If memory reclaim seems heavy, return that we're congested */
> +		if (total_scanned > MEMCG_ASYNCSCAN_LIMIT/4 &&
> +		    total_scanned > total_reclaimed*8)
> +			break;
> +		/*
> +		 * The whole system is busy or some status update
> +		 * is not synched. It's better to wait for a while.
> +		 */
> +		if ((noscan > 1) || (need_resched()))
> +			break;

So we bale out if there were two priority levels at which
shrink_mem_cgroup_node() didn't scan any pages?  What on earth???

And what was the point in calling shrink_mem_cgroup_node() if it didn't
scan anything?  I could understand using nr_reclaimed...

> +		/* ok, we can do deeper scanning. */
> +		priority--;
> +	}
> +	current->flags &= ~PF_SWAPWRITE;
> +	/*
> +	 * If we successfully freed the half of target, report that
> +	 * memory reclaim went smoothly.
> +	 */
> +	if (total_reclaimed > reclaim_target/2)
> +		congested = false;
> +out:
> +	return congested;
> +}
>  #endif



I dunno, the whole thing seems sprinkled full of arbitrary assumptions
and guess-and-giggle magic numbers.  I expect a lot of this stuff is
just unnecessary.  And if it _is_ necessary then I'd expect there to
be lots of situations and corner cases in which it malfunctions,
because the magic numbers weren't tuned to that case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2011-05-20 21:50 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-20  3:37 [PATCH 0/8] memcg async reclaim v2 KAMEZAWA Hiroyuki
2011-05-20  3:37 ` KAMEZAWA Hiroyuki
2011-05-20  3:41 ` [PATCH 1/8] memcg: export zone reclaimable pages KAMEZAWA Hiroyuki
2011-05-20  3:41   ` KAMEZAWA Hiroyuki
2011-05-20  3:42 ` [PATCH 2/8] memcg: easy check routine for reclaimable KAMEZAWA Hiroyuki
2011-05-20  3:42   ` KAMEZAWA Hiroyuki
2011-05-20 21:49   ` Andrew Morton
2011-05-20 21:49     ` Andrew Morton
2011-05-20 23:57     ` Hiroyuki Kamezawa
2011-05-20 23:57       ` Hiroyuki Kamezawa
2011-05-20  3:43 ` [PATCH 0/8] memcg: clean up, export swapiness KAMEZAWA Hiroyuki
2011-05-20  3:43   ` KAMEZAWA Hiroyuki
2011-05-23 17:26   ` Ying Han
2011-05-23 17:26     ` Ying Han
2011-05-23 23:55     ` KAMEZAWA Hiroyuki
2011-05-23 23:55       ` KAMEZAWA Hiroyuki
2011-05-20  3:44 ` [PATCH 4/8] memcg: export release victim KAMEZAWA Hiroyuki
2011-05-20  3:44   ` KAMEZAWA Hiroyuki
2011-05-20  3:46 ` [PATCH 6/8] memcg asynchronous memory reclaim interface KAMEZAWA Hiroyuki
2011-05-20  3:46   ` KAMEZAWA Hiroyuki
2011-05-20 21:49   ` Andrew Morton
2011-05-20 21:49     ` Andrew Morton
2011-05-20 23:56     ` Hiroyuki Kamezawa
2011-05-20 23:56       ` Hiroyuki Kamezawa
2011-05-23 23:36       ` Ying Han
2011-05-23 23:36         ` Ying Han
2011-05-24  0:11         ` KAMEZAWA Hiroyuki
2011-05-24  0:11           ` KAMEZAWA Hiroyuki
2011-05-24  0:26           ` Ying Han
2011-05-24  0:26             ` Ying Han
2011-05-20  3:47 ` [PATCH 7/8] memcg static scan reclaim for asyncrhonous reclaim KAMEZAWA Hiroyuki
2011-05-20  3:47   ` KAMEZAWA Hiroyuki
2011-05-20 21:50   ` Andrew Morton [this message]
2011-05-20 21:50     ` Andrew Morton
2011-05-21  0:23     ` Hiroyuki Kamezawa
2011-05-21  0:23       ` Hiroyuki Kamezawa
2011-05-20  3:48 ` [PATCH 8/8] memcg asyncrhouns reclaim workqueue KAMEZAWA Hiroyuki
2011-05-20  3:48   ` KAMEZAWA Hiroyuki
2011-05-20 21:51   ` Andrew Morton
2011-05-20 21:51     ` Andrew Morton
2011-05-21  0:41     ` Hiroyuki Kamezawa
2011-05-21  0:41       ` Hiroyuki Kamezawa
2011-05-21  1:26       ` Andrew Morton
2011-05-21  1:26         ` Andrew Morton
2011-05-23  0:25         ` KAMEZAWA Hiroyuki
2011-05-23  0:25           ` KAMEZAWA Hiroyuki
2011-05-25  5:51           ` Ying Han
     [not found] ` <BANLkTimd0CAqoAnuGz7WvKsbwphJxo0eZQ@mail.gmail.com>
2011-05-24  0:19   ` [PATCH 0/8] memcg async reclaim v2 KAMEZAWA Hiroyuki
2011-05-24  0:19     ` KAMEZAWA Hiroyuki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110520145008.1ea51f41.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=hannes@cmpxchg.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=nishimura@mxp.nes.nec.co.jp \
    --cc=yinghan@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.