[PATCH v4 0/5] memcg : make numa scanning better

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/5] memcg : make numa scanning better
@ 2011-07-27  5:44 KAMEZAWA Hiroyuki
  2011-07-27  5:46 ` [PATCH v4 1/5] memcg : update numascan info by schedule_work KAMEZAWA Hiroyuki
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27  5:44 UTC (permalink / raw)
  To: linux-mm@kvack.org
  Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
	Michal Hocko, nishimura@mxp.nes.nec.co.jp


When 889976db(memcg: reclaim memory from nodes in round-robin order) is
pushed, I mentioned "But yes, a better algorithm is needed."

Here is one. 

I already cut out some of pieces, which was in this set, and pushed to upstream.
This series contains more fixes and a new core logic.

The concept is to select a node with regard to page usages.
This patch calculates weight of nodes and does scheduling proportionally
fair to each node's weight. The weight is calculated in adaptive way
considering the status of the whole memcg. In short, if a node contains
much (inactive) file caches, the node will be a victim.


As I did before, I did apache-bench test as following.

Host
   Host : Xeon 8cpu 
   Memory: 24GB

What test ?
   access a CGI script which reads a file in random. And access it by
   apatch-bench. The randomnes of file access is normalized.
   Full working set is 600MB.
   And run httpd under memcg. This will cause memory reclaim and read I/O.

[Set limit as 300M]
 
<mmotm-0709 + some merged bugfixes>
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:    41   48  15.0     46    1161
Waiting:       40   46  10.5     44     623
Total:         41   48  15.0     46    1161

scanned_pages_by_limit 410693
elapsed_ns_by_limit 2393975561

<mmotm-0709 + cpuset's page cache spread nodes>
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:    42   48  16.9     46    1616
Waiting:       40   46  14.7     44    1614
Total:         42   48  16.9     46    1616

scanned_pages_by_limit 271733
elapsed_ns_by_limit 1415085661

<patch>
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       1
Processing:    41   46   7.5     45     706
Waiting:       39   45   6.4     44     630
Total:         41   46   7.5     45     706

scanned_pages_by_limit 302282
elapsed_ns_by_limit 1312758481

<patch + cpuset's page cache spread nodes>
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       4
Processing:    42   47  11.4     46     962
Waiting:       40   45   8.7     44     493
Total:         42   47  11.4     46     962

scanned_pages_by_limit 349020
elapsed_ns_by_limit 1594144061

[Set Limit as 400M]
<mmotm-0709>
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       3
Processing:    40   45   4.7     45     467
Waiting:       39   44   4.4     43     465
Total:         40   45   4.7     45     467

scanned_pages_by_limit 156279
elapsed_ns_by_limit 1274982214

<mmotm-0709 + cpuset's node spread>
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:    41   46   6.9     45     458
Waiting:       40   44   4.5     44     388
Total:         41   46   6.9     45     459

scanned_pages_by_limit 346534
elapsed_ns_by_limit 2612352442


<Patch>
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       1
Processing:    42   45   5.1     45     467
Waiting:       38   44   4.5     43     465
Total:         42   45   5.1     45     467

scanned_pages_by_limit 116307
elapsed_ns_by_limit 624529569

<patch+spread>
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       1
Processing:    41   46   5.3     45     392
Waiting:       39   44   4.1     43     388
Total:         41   46   5.3     45     392

scanned_pages_by_limit 154865
elapsed_ns_by_limit 830638510


In general, this patch set reduce memory reclaim scans and time and
helps reclaiming memory in efficient way.

Thanks,
-Kame









--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 1/5] memcg : update numascan info by schedule_work
  2011-07-27  5:44 [PATCH v4 0/5] memcg : make numa scanning better KAMEZAWA Hiroyuki
@ 2011-07-27  5:46 ` KAMEZAWA Hiroyuki
  2011-07-27  5:47 ` [PATCH v4 2/5] memcg : pass scan nodemask KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27  5:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, Michal Hocko,
	nishimura@mxp.nes.nec.co.jp


Making memcg numa's scanning information update by schedule_work().

Now, memcg's numa information is updated under a thread doing
memory reclaim. It's not very heavy weight now. But upcoming updates
around numa scanning will add more works. This patch makes
the update be done by schedule_work() and reduce latency caused
by this updates.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   42 ++++++++++++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 12 deletions(-)

Index: mmotm-0710/mm/memcontrol.c
===================================================================
--- mmotm-0710.orig/mm/memcontrol.c
+++ mmotm-0710/mm/memcontrol.c
@@ -284,6 +284,7 @@ struct mem_cgroup {
 	nodemask_t	scan_nodes;
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
+	struct work_struct	numainfo_update_work;
 #endif
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
@@ -1551,6 +1552,23 @@ static bool test_mem_cgroup_node_reclaim
 }
 #if MAX_NUMNODES > 1
 
+static void mem_cgroup_numainfo_update_work(struct work_struct *work)
+{
+	struct mem_cgroup *memcg;
+	int nid;
+
+	memcg = container_of(work, struct mem_cgroup, numainfo_update_work);
+
+	memcg->scan_nodes = node_states[N_HIGH_MEMORY];
+	for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
+		if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
+			node_clear(nid, memcg->scan_nodes);
+	}
+	atomic_set(&memcg->numainfo_updating, 0);
+	css_put(&memcg->css);
+}
+
+
 /*
  * Always updating the nodemask is not very good - even if we have an empty
  * list or the wrong list here, we can start from some node and traverse all
@@ -1559,7 +1577,6 @@ static bool test_mem_cgroup_node_reclaim
  */
 static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
 {
-	int nid;
 	/*
 	 * numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET
 	 * pagein/pageout changes since the last update.
@@ -1568,18 +1585,9 @@ static void mem_cgroup_may_update_nodema
 		return;
 	if (atomic_inc_return(&mem->numainfo_updating) > 1)
 		return;
-
-	/* make a nodemask where this memcg uses memory from */
-	mem->scan_nodes = node_states[N_HIGH_MEMORY];
-
-	for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
-
-		if (!test_mem_cgroup_node_reclaimable(mem, nid, false))
-			node_clear(nid, mem->scan_nodes);
-	}
-
 	atomic_set(&mem->numainfo_events, 0);
-	atomic_set(&mem->numainfo_updating, 0);
+	css_get(&mem->css);
+	schedule_work(&mem->numainfo_update_work);
 }
 
 /*
@@ -1652,6 +1660,12 @@ bool mem_cgroup_reclaimable(struct mem_c
 	return false;
 }
 
+static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
+{
+	INIT_WORK(&memcg->numainfo_update_work,
+		mem_cgroup_numainfo_update_work);
+}
+
 #else
 int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
 {
@@ -1662,6 +1676,9 @@ bool mem_cgroup_reclaimable(struct mem_c
 {
 	return test_mem_cgroup_node_reclaimable(mem, 0, noswap);
 }
+static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
+{
+}
 #endif
 
 static void __mem_cgroup_record_scanstat(unsigned long *stats,
@@ -5032,6 +5049,7 @@ mem_cgroup_create(struct cgroup_subsys *
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
 	spin_lock_init(&mem->scanstat.lock);
+	mem_cgroup_numascan_init(mem);
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 2/5] memcg : pass scan nodemask
  2011-07-27  5:44 [PATCH v4 0/5] memcg : make numa scanning better KAMEZAWA Hiroyuki
  2011-07-27  5:46 ` [PATCH v4 1/5] memcg : update numascan info by schedule_work KAMEZAWA Hiroyuki
@ 2011-07-27  5:47 ` KAMEZAWA Hiroyuki
  2011-08-01 13:59   ` Michal Hocko
  2011-07-27  5:49 ` [PATCH v4 3/5] memcg : stop scanning if enough KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27  5:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, Michal Hocko,
	nishimura@mxp.nes.nec.co.jp


pass memcg's nodemask to try_to_free_pages().

try_to_free_pages can take nodemask as its argument but memcg
doesn't pass it. Considering memcg can be used with cpuset on
big NUMA, memcg should pass nodemask if available.

Now, memcg maintain nodemask with periodic updates. pass it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    2 +-
 mm/memcontrol.c            |    8 ++++++--
 mm/vmscan.c                |    3 ++-
 3 files changed, 9 insertions(+), 4 deletions(-)

Index: mmotm-0710/include/linux/memcontrol.h
===================================================================
--- mmotm-0710.orig/include/linux/memcontrol.h
+++ mmotm-0710/include/linux/memcontrol.h
@@ -117,7 +117,7 @@ extern void mem_cgroup_end_migration(str
  */
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
 					int nid, int zid, unsigned int lrumask);
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
Index: mmotm-0710/mm/memcontrol.c
===================================================================
--- mmotm-0710.orig/mm/memcontrol.c
+++ mmotm-0710/mm/memcontrol.c
@@ -1602,10 +1602,11 @@ static void mem_cgroup_may_update_nodema
  *
  * Now, we use round-robin. Better algorithm is welcomed.
  */
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
 {
 	int node;
 
+	*mask = NULL;
 	mem_cgroup_may_update_nodemask(mem);
 	node = mem->last_scanned_node;
 
@@ -1620,6 +1621,8 @@ int mem_cgroup_select_victim_node(struct
 	 */
 	if (unlikely(node == MAX_NUMNODES))
 		node = numa_node_id();
+	else
+		*mask = &mem->scan_nodes;
 
 	mem->last_scanned_node = node;
 	return node;
@@ -1667,8 +1670,9 @@ static void mem_cgroup_numascan_init(str
 }
 
 #else
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
 {
+	*mask = NULL;
 	return 0;
 }
 
Index: mmotm-0710/mm/vmscan.c
===================================================================
--- mmotm-0710.orig/mm/vmscan.c
+++ mmotm-0710/mm/vmscan.c
@@ -2280,6 +2280,7 @@ unsigned long try_to_free_mem_cgroup_pag
 	unsigned long nr_reclaimed;
 	unsigned long start, end;
 	int nid;
+	nodemask_t *mask;
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
@@ -2302,7 +2303,7 @@ unsigned long try_to_free_mem_cgroup_pag
 	 * take care of from where we get pages. So the node where we start the
 	 * scan does not need to be the current node.
 	 */
-	nid = mem_cgroup_select_victim_node(mem_cont);
+	nid = mem_cgroup_select_victim_node(mem_cont, &mask);
 
 	zonelist = NODE_DATA(nid)->node_zonelists;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 2/5] memcg : pass scan nodemask
  2011-07-27  5:47 ` [PATCH v4 2/5] memcg : pass scan nodemask KAMEZAWA Hiroyuki
@ 2011-08-01 13:59   ` Michal Hocko
  2011-08-02  2:21     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 11+ messages in thread
From: Michal Hocko @ 2011-08-01 13:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, nishimura@mxp.nes.nec.co.jp

On Wed 27-07-11 14:47:42, KAMEZAWA Hiroyuki wrote:
> 
> pass memcg's nodemask to try_to_free_pages().
> 
> try_to_free_pages can take nodemask as its argument but memcg
> doesn't pass it. Considering memcg can be used with cpuset on
> big NUMA, memcg should pass nodemask if available.
> 
> Now, memcg maintain nodemask with periodic updates. pass it.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |    2 +-
>  mm/memcontrol.c            |    8 ++++++--
>  mm/vmscan.c                |    3 ++-
>  3 files changed, 9 insertions(+), 4 deletions(-)
> 
[...]
> Index: mmotm-0710/mm/vmscan.c
> ===================================================================
> --- mmotm-0710.orig/mm/vmscan.c
> +++ mmotm-0710/mm/vmscan.c
> @@ -2280,6 +2280,7 @@ unsigned long try_to_free_mem_cgroup_pag
>  	unsigned long nr_reclaimed;
>  	unsigned long start, end;
>  	int nid;
> +	nodemask_t *mask;
>  	struct scan_control sc = {
>  		.may_writepage = !laptop_mode,
>  		.may_unmap = 1,
> @@ -2302,7 +2303,7 @@ unsigned long try_to_free_mem_cgroup_pag
>  	 * take care of from where we get pages. So the node where we start the
>  	 * scan does not need to be the current node.
>  	 */
> -	nid = mem_cgroup_select_victim_node(mem_cont);
> +	nid = mem_cgroup_select_victim_node(mem_cont, &mask);

The mask is not used anywhere AFAICS and using it is a point of the
patch AFAIU. I guess you wanted to use &sc.nodemask, right?

Other than that, looks good to me.

Reviewed-by: Michal Hocko <mhocko@suse.cz>
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 2/5] memcg : pass scan nodemask
  2011-08-01 13:59   ` Michal Hocko
@ 2011-08-02  2:21     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-02  2:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, nishimura@mxp.nes.nec.co.jp

On Mon, 1 Aug 2011 15:59:53 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Wed 27-07-11 14:47:42, KAMEZAWA Hiroyuki wrote:
> > 
> > pass memcg's nodemask to try_to_free_pages().
> > 
> > try_to_free_pages can take nodemask as its argument but memcg
> > doesn't pass it. Considering memcg can be used with cpuset on
> > big NUMA, memcg should pass nodemask if available.
> > 
> > Now, memcg maintain nodemask with periodic updates. pass it.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  include/linux/memcontrol.h |    2 +-
> >  mm/memcontrol.c            |    8 ++++++--
> >  mm/vmscan.c                |    3 ++-
> >  3 files changed, 9 insertions(+), 4 deletions(-)
> > 
> [...]
> > Index: mmotm-0710/mm/vmscan.c
> > ===================================================================
> > --- mmotm-0710.orig/mm/vmscan.c
> > +++ mmotm-0710/mm/vmscan.c
> > @@ -2280,6 +2280,7 @@ unsigned long try_to_free_mem_cgroup_pag
> >  	unsigned long nr_reclaimed;
> >  	unsigned long start, end;
> >  	int nid;
> > +	nodemask_t *mask;
> >  	struct scan_control sc = {
> >  		.may_writepage = !laptop_mode,
> >  		.may_unmap = 1,
> > @@ -2302,7 +2303,7 @@ unsigned long try_to_free_mem_cgroup_pag
> >  	 * take care of from where we get pages. So the node where we start the
> >  	 * scan does not need to be the current node.
> >  	 */
> > -	nid = mem_cgroup_select_victim_node(mem_cont);
> > +	nid = mem_cgroup_select_victim_node(mem_cont, &mask);
> 
> The mask is not used anywhere AFAICS and using it is a point of the
> patch AFAIU. I guess you wanted to use &sc.nodemask, right?
> 
> Other than that, looks good to me.
> 
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Ah, sorry. I'll fix.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 3/5] memcg : stop scanning if enough
  2011-07-27  5:44 [PATCH v4 0/5] memcg : make numa scanning better KAMEZAWA Hiroyuki
  2011-07-27  5:46 ` [PATCH v4 1/5] memcg : update numascan info by schedule_work KAMEZAWA Hiroyuki
  2011-07-27  5:47 ` [PATCH v4 2/5] memcg : pass scan nodemask KAMEZAWA Hiroyuki
@ 2011-07-27  5:49 ` KAMEZAWA Hiroyuki
  2011-08-01 14:37   ` Michal Hocko
  2011-07-27  5:49 ` [PATCH v4 4/5] memcg : calculate node scan weight KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27  5:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, Michal Hocko,
	nishimura@mxp.nes.nec.co.jp

memcg :avoid node fallback scan if possible.

Now, try_to_free_pages() scans all zonelist because the page allocator
should visit all zonelists...but that behavior is harmful for memcg.
Memcg just scans memory because it hits limit...no memory shortage
in pased zonelist.

For example, with following unbalanced nodes

     Node 0    Node 1
File 1G        0
Anon 200M      200M

memcg will cause swap-out from Node1 at every vmscan.

Another example, assume 1024 nodes system.
With 1024 node system, memcg will visit 1024 nodes
pages per vmscan... This is overkilling. 

This is why memcg's victim node selection logic doesn't work
as expected.

This patch is a help for stopping vmscan when we scanned enough.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/vmscan.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: mmotm-0710/mm/vmscan.c
===================================================================
--- mmotm-0710.orig/mm/vmscan.c
+++ mmotm-0710/mm/vmscan.c
@@ -2058,6 +2058,16 @@ static void shrink_zones(int priority, s
 		}
 
 		shrink_zone(priority, zone, sc);
+		if (!scanning_global_lru(sc)) {
+			/*
+			 * When we do scan for memcg's limit, it's bad to do
+			 * fallback into more node/zones because there is no
+			 * memory shortage. We quit as much as possible when
+			 * we reache target.
+			 */
+			if (sc->nr_to_reclaim <= sc->nr_reclaimed)
+				break;
+		}
 	}
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 3/5] memcg : stop scanning if enough
  2011-07-27  5:49 ` [PATCH v4 3/5] memcg : stop scanning if enough KAMEZAWA Hiroyuki
@ 2011-08-01 14:37   ` Michal Hocko
  2011-08-01 19:49     ` Michal Hocko
  0 siblings, 1 reply; 11+ messages in thread
From: Michal Hocko @ 2011-08-01 14:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, nishimura@mxp.nes.nec.co.jp

On Wed 27-07-11 14:49:00, KAMEZAWA Hiroyuki wrote:
> memcg :avoid node fallback scan if possible.
> 
> Now, try_to_free_pages() scans all zonelist because the page allocator
> should visit all zonelists...but that behavior is harmful for memcg.
> Memcg just scans memory because it hits limit...no memory shortage
> in pased zonelist.
> 
> For example, with following unbalanced nodes
> 
>      Node 0    Node 1
> File 1G        0
> Anon 200M      200M
> 
> memcg will cause swap-out from Node1 at every vmscan.
> 
> Another example, assume 1024 nodes system.
> With 1024 node system, memcg will visit 1024 nodes
> pages per vmscan... This is overkilling. 
> 
> This is why memcg's victim node selection logic doesn't work
> as expected.

Previous patch adds nodemask filled by
mem_cgroup_select_victim_node. Shouldn't we rather limit that nodemask
to a victim node?

Or am I missing something?

The patch as is doesn't look nice it makes shrink_zones even more
memcg-hacky:
	for_each_zone_zonelist_nodemask
		if (scanning_global_lru(sc))
			/.../

		shrink_zone(priority, zone, sc);

		if (!scanning_global_lru(sc))
			/.../

> 
> This patch is a help for stopping vmscan when we scanned enough.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/vmscan.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> Index: mmotm-0710/mm/vmscan.c
> ===================================================================
> --- mmotm-0710.orig/mm/vmscan.c
> +++ mmotm-0710/mm/vmscan.c
> @@ -2058,6 +2058,16 @@ static void shrink_zones(int priority, s
>  		}
>  
>  		shrink_zone(priority, zone, sc);
> +		if (!scanning_global_lru(sc)) {
> +			/*
> +			 * When we do scan for memcg's limit, it's bad to do
> +			 * fallback into more node/zones because there is no
> +			 * memory shortage. We quit as much as possible when
> +			 * we reache target.
> +			 */
> +			if (sc->nr_to_reclaim <= sc->nr_reclaimed)
> +				break;
> +		}
>  	}
>  }
>  
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 3/5] memcg : stop scanning if enough
  2011-08-01 14:37   ` Michal Hocko
@ 2011-08-01 19:49     ` Michal Hocko
  0 siblings, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2011-08-01 19:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, nishimura@mxp.nes.nec.co.jp

On Mon 01-08-11 16:37:45, Michal Hocko wrote:
> On Wed 27-07-11 14:49:00, KAMEZAWA Hiroyuki wrote:
> > memcg :avoid node fallback scan if possible.
> > 
> > Now, try_to_free_pages() scans all zonelist because the page allocator
> > should visit all zonelists...but that behavior is harmful for memcg.
> > Memcg just scans memory because it hits limit...no memory shortage
> > in pased zonelist.
> > 
> > For example, with following unbalanced nodes
> > 
> >      Node 0    Node 1
> > File 1G        0
> > Anon 200M      200M
> > 
> > memcg will cause swap-out from Node1 at every vmscan.
> > 
> > Another example, assume 1024 nodes system.
> > With 1024 node system, memcg will visit 1024 nodes
> > pages per vmscan... This is overkilling. 
> > 
> > This is why memcg's victim node selection logic doesn't work
> > as expected.
> 
> Previous patch adds nodemask filled by
> mem_cgroup_select_victim_node. Shouldn't we rather limit that nodemask
> to a victim node?

Bahh, scratch that. I was jumping from one thing to another and got 
totally confused. Victim memcg is not bound to any particular node in 
general...
Sorry for noise. I will try to get back to this tomorrow.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 4/5] memcg : calculate node scan weight
  2011-07-27  5:44 [PATCH v4 0/5] memcg : make numa scanning better KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2011-07-27  5:49 ` [PATCH v4 3/5] memcg : stop scanning if enough KAMEZAWA Hiroyuki
@ 2011-07-27  5:49 ` KAMEZAWA Hiroyuki
  2011-07-27  5:51 ` [PATCH v4 5/5] memcg : select a victim node by weights KAMEZAWA Hiroyuki
  2011-07-27  5:52 ` [PATCH v4 6/5] memcg : check numa balance KAMEZAWA Hiroyuki
  5 siblings, 0 replies; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27  5:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, Michal Hocko,
	nishimura@mxp.nes.nec.co.jp

caclculate node scan weight.

Now, memory cgroup selects a scan target node in round-robin.
It's not very good...there is not scheduling based on page usages.

This patch is for calculating each node's weight for scanning.
If weight of a node is high, the node is worth to be scanned.

The weight is now calucauted on following concept.

   - make use of swappiness.
   - If inactive-file is enough, ignore active-file
   - If file is enough (w.r.t swappiness), ignore anon
   - make use of recent_scan/rotated reclaim stats.

Then, a node contains many inactive file pages will be a 1st victim.
Node selection logic based on this weight will be in the next patch.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  110 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 105 insertions(+), 5 deletions(-)

Index: mmotm-0710/mm/memcontrol.c
===================================================================
--- mmotm-0710.orig/mm/memcontrol.c
+++ mmotm-0710/mm/memcontrol.c
@@ -143,6 +143,7 @@ struct mem_cgroup_per_zone {
 
 struct mem_cgroup_per_node {
 	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
+	unsigned long weight;
 };
 
 struct mem_cgroup_lru_info {
@@ -285,6 +286,7 @@ struct mem_cgroup {
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
 	struct work_struct	numainfo_update_work;
+	unsigned long total_weight;
 #endif
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
@@ -1552,18 +1554,108 @@ static bool test_mem_cgroup_node_reclaim
 }
 #if MAX_NUMNODES > 1
 
+static unsigned long
+__mem_cgroup_calc_numascan_weight(struct mem_cgroup * memcg,
+				int nid,
+				unsigned long anon_prio,
+				unsigned long file_prio,
+				int lru_mask)
+{
+	u64 file, anon;
+	unsigned long weight, mask;
+	unsigned long rotated[2], scanned[2];
+	int zid;
+
+	scanned[0] = 0;
+	scanned[1] = 0;
+	rotated[0] = 0;
+	rotated[1] = 0;
+
+	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+		struct mem_cgroup_per_zone *mz;
+
+		mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+		scanned[0] += mz->reclaim_stat.recent_scanned[0];
+		scanned[1] += mz->reclaim_stat.recent_scanned[1];
+		rotated[0] += mz->reclaim_stat.recent_rotated[0];
+		rotated[1] += mz->reclaim_stat.recent_rotated[1];
+	}
+	file = mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask & LRU_ALL_FILE);
+
+	if (total_swap_pages)
+		anon = mem_cgroup_node_nr_lru_pages(memcg,
+					nid, mask & LRU_ALL_ANON);
+	else
+		anon = 0;
+	if (!(file + anon))
+		node_clear(nid, memcg->scan_nodes);
+
+	/* 'scanned - rotated/scanned' means ratio of finding not active. */
+	anon = anon * (scanned[0] - rotated[0]) / (scanned[0] + 1);
+	file = file * (scanned[1] - rotated[1]) / (scanned[1] + 1);
+
+	weight = (anon * anon_prio + file * file_prio) / 200;
+	return weight;
+}
+
+/*
+ * Calculate each NUMA node's scan weight. scan weight is determined by
+ * amount of pages and recent scan ratio, swappiness.
+ */
+static unsigned long
+mem_cgroup_calc_numascan_weight(struct mem_cgroup *memcg)
+{
+	unsigned long weight, total_weight;
+	u64 anon_prio, file_prio, nr_anon, nr_file;
+	int lru_mask;
+	int nid;
+
+	anon_prio = mem_cgroup_swappiness(memcg) + 1;
+	file_prio = 200 - anon_prio + 1;
+
+	lru_mask = BIT(LRU_INACTIVE_FILE);
+	if (mem_cgroup_inactive_file_is_low(memcg))
+		lru_mask |= BIT(LRU_ACTIVE_FILE);
+	/*
+	 * In vmscan.c, we'll scan anonymous pages with regard to memcg/zone's
+	 * amounts of file/anon pages and swappiness and reclaim_stat. Here,
+	 * we try to find good node to be scanned. If the memcg contains enough
+	 * file caches, we'll ignore anon's weight.
+	 * (Note) scanning anon-only node tends to be waste of time.
+	 */
+
+	nr_file = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
+	nr_anon = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
+
+	/* If file cache is small w.r.t swappiness, check anon page's weight */
+	if (nr_file * file_prio >= nr_anon * anon_prio)
+		lru_mask |= BIT(LRU_INACTIVE_ANON);
+
+	total_weight = 0;
+
+	for_each_node_state(nid, N_HIGH_MEMORY) {
+		weight = __mem_cgroup_calc_numascan_weight(memcg,
+				nid, anon_prio, file_prio, lru_mask);
+		memcg->info.nodeinfo[nid]->weight = weight;
+		total_weight += weight;
+	}
+
+	return total_weight;
+}
+
+/*
+ * Update all node's scan weight in background.
+ */
 static void mem_cgroup_numainfo_update_work(struct work_struct *work)
 {
 	struct mem_cgroup *memcg;
-	int nid;
 
 	memcg = container_of(work, struct mem_cgroup, numainfo_update_work);
 
 	memcg->scan_nodes = node_states[N_HIGH_MEMORY];
-	for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
-		if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
-			node_clear(nid, memcg->scan_nodes);
-	}
+
+	memcg->total_weight = mem_cgroup_calc_numascan_weight(memcg);
+
 	atomic_set(&memcg->numainfo_updating, 0);
 	css_put(&memcg->css);
 }
@@ -4212,6 +4304,14 @@ static int mem_control_numa_stat_show(st
 		seq_printf(m, " N%d=%lu", nid, node_nr);
 	}
 	seq_putc(m, '\n');
+
+	seq_printf(m, "scan_weight=%lu", mem_cont->total_weight);
+	for_each_node_state(nid, N_HIGH_MEMORY) {
+		unsigned long weight;
+		weight = mem_cont->info.nodeinfo[nid]->weight;
+		seq_printf(m, " N%d=%lu", nid, weight);
+	}
+	seq_putc(m, '\n');
 	return 0;
 }
 #endif /* CONFIG_NUMA */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 5/5] memcg : select a victim node by weights
  2011-07-27  5:44 [PATCH v4 0/5] memcg : make numa scanning better KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2011-07-27  5:49 ` [PATCH v4 4/5] memcg : calculate node scan weight KAMEZAWA Hiroyuki
@ 2011-07-27  5:51 ` KAMEZAWA Hiroyuki
  2011-07-27  5:52 ` [PATCH v4 6/5] memcg : check numa balance KAMEZAWA Hiroyuki
  5 siblings, 0 replies; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27  5:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, Michal Hocko,
	nishimura@mxp.nes.nec.co.jp


This patch implements a node selection logic based on each node's weight.

This patch adds a new array of nodescan_tickets[]. This array holds
each node's scan weight in a tuple of 2 values. as

    for (i = 0, total_weight = 0; i < nodes; i++) {
        weight = node->weight;
        nodescan_tickets[i].start = total_weight;
        nodescan_tickets[i].length = weight;
    }

After this, a lottery logic as 'ticket = random32()/total_weight'
will make a ticket and bserach(ticket, nodescan_tickets[])
will find a node which holds [start, length] contains ticket.
(This is a lottery scheduling.)

By this, node will be selected in fair manner proportinal to
its weight.

This patch improve the scan time. Following is a test result
ot apatch bench on 2-node fake-numa. In this test, almost all
pages are file cache and too much scan on anon and swap-out
is harmful. (The result itself is measured with following patches
to this.)

   Working set: 600Mbytes random access in normalized distribution
   Memory Limit: 300MBytes

   <before patch>
   Connection Times (ms)
                 min  mean[+/-sd] median   max
   Connect:        0    0   0.1      0       1
   Processing:    41   48  15.0     46    1161
   Waiting:       40   46  10.5     44     623
   Total:         41   48  15.0     46    1161

   memory.vmscan_stat
   scanned_pages_by_limit 410693
   elapsed_ns_by_limit 2393975561

   <after patch>
   Connection Times (ms)
                 min  mean[+/-sd] median   max
   Connect:        0    0   0.0      0       1
   Processing:    41   46   7.5     45     706
   Waiting:       39   45   6.4     44     630
   Total:         41   46   7.5     45     706

   scanned_pages_by_limit 302282
   elapsed_ns_by_limit 1312758481

vmscan time is much reduced.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    3 
 mm/memcontrol.c            |  149 ++++++++++++++++++++++++++++++++++++++-------
 mm/vmscan.c                |    4 -
 3 files changed, 130 insertions(+), 26 deletions(-)

Index: mmotm-0710/mm/memcontrol.c
===================================================================
--- mmotm-0710.orig/mm/memcontrol.c
+++ mmotm-0710/mm/memcontrol.c
@@ -48,6 +48,9 @@
 #include <linux/page_cgroup.h>
 #include <linux/cpu.h>
 #include <linux/oom.h>
+#include <linux/random.h>
+#include <linux/bsearch.h>
+#include <linux/cpuset.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -150,6 +153,11 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
+struct numascan_ticket {
+	int nid;
+	unsigned int start, tickets;
+};
+
 /*
  * Cgroups above their limits are maintained in a RB-Tree, independent of
  * their hierarchy representation
@@ -286,7 +294,10 @@ struct mem_cgroup {
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
 	struct work_struct	numainfo_update_work;
-	unsigned long total_weight;
+	unsigned long	total_weight;
+	int		numascan_generation;
+	int		numascan_tickets_num[2];
+	struct numascan_ticket *numascan_tickets[2];
 #endif
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
@@ -1644,6 +1655,46 @@ mem_cgroup_calc_numascan_weight(struct m
 }
 
 /*
+ * For lottery scheduling, this routine disributes "ticket" for
+ * scanning to each node. ticket will be recored into numascan_ticket
+ * array and this array will be used for scheduling, lator.
+ * For make lottery wair, we limit the sum of tickets almost 0xffff.
+ * Later, random() & 0xffff will do proportional fair lottery.
+ */
+#define NUMA_TICKET_SHIFT	(16)
+#define NUMA_TICKET_FACTOR	((1 << NUMA_TICKET_SHIFT) - 1)
+static void mem_cgroup_update_numascan_tickets(struct mem_cgroup *memcg)
+{
+	struct numascan_ticket *nt;
+	unsigned int node_ticket, assigned_tickets;
+	u64 weight;
+	int nid, assigned_num, generation;
+
+	/* update ticket information by double buffering */
+	generation = memcg->numascan_generation ^ 0x1;
+
+	nt = memcg->numascan_tickets[generation];
+	assigned_tickets = 0;
+	assigned_num = 0;
+	for_each_node_mask(nid, memcg->scan_nodes) {
+		weight = memcg->info.nodeinfo[nid]->weight;
+		node_ticket = div64_u64(weight << NUMA_TICKET_SHIFT,
+					memcg->total_weight + 1);
+		if (!node_ticket)
+			node_ticket = 1;
+		nt->nid = nid;
+		nt->start = assigned_tickets;
+		nt->tickets = node_ticket;
+		assigned_tickets += node_ticket;
+		nt++;
+		assigned_num++;
+	}
+	memcg->numascan_tickets_num[generation] = assigned_num;
+	smp_wmb();
+	memcg->numascan_generation = generation;
+}
+
+/*
  * Update all node's scan weight in background.
  */
 static void mem_cgroup_numainfo_update_work(struct work_struct *work)
@@ -1656,6 +1707,8 @@ static void mem_cgroup_numainfo_update_w
 
 	memcg->total_weight = mem_cgroup_calc_numascan_weight(memcg);
 
+	synchronize_rcu();
+	mem_cgroup_update_numascan_tickets(memcg);
 	atomic_set(&memcg->numainfo_updating, 0);
 	css_put(&memcg->css);
 }
@@ -1682,6 +1735,18 @@ static void mem_cgroup_may_update_nodema
 	schedule_work(&mem->numainfo_update_work);
 }
 
+static int node_weight_compare(const void *key, const void *elt)
+{
+	unsigned long lottery = (unsigned long)key;
+	struct numascan_ticket *nt = (struct numascan_ticket *)elt;
+
+	if (lottery < nt->start)
+		return -1;
+	if (lottery > (nt->start + nt->tickets))
+		return 1;
+	return 0;
+}
+
 /*
  * Selecting a node where we start reclaim from. Because what we need is just
  * reducing usage counter, start from anywhere is O,K. Considering
@@ -1691,32 +1756,38 @@ static void mem_cgroup_may_update_nodema
  * we'll use or we've used. So, it may make LRU bad. And if several threads
  * hit limits, it will see a contention on a node. But freeing from remote
  * node means more costs for memory reclaim because of memory latency.
- *
- * Now, we use round-robin. Better algorithm is welcomed.
  */
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
+				struct memcg_scanrecord *rec)
 {
-	int node;
+	int node = MAX_NUMNODES;
+	struct numascan_ticket *nt;
+	unsigned long lottery;
+	int generation;
 
+	if (rec->context == SCAN_BY_SHRINK)
+		goto out;
+
+	mem_cgroup_may_update_nodemask(memcg);
 	*mask = NULL;
-	mem_cgroup_may_update_nodemask(mem);
-	node = mem->last_scanned_node;
+	lottery = random32() & NUMA_TICKET_FACTOR;
 
-	node = next_node(node, mem->scan_nodes);
-	if (node == MAX_NUMNODES)
-		node = first_node(mem->scan_nodes);
-	/*
-	 * We call this when we hit limit, not when pages are added to LRU.
-	 * No LRU may hold pages because all pages are UNEVICTABLE or
-	 * memcg is too small and all pages are not on LRU. In that case,
-	 * we use curret node.
-	 */
-	if (unlikely(node == MAX_NUMNODES))
+	rcu_read_lock();
+	generation = memcg->numascan_generation;
+	nt = bsearch((void *)lottery,
+		memcg->numascan_tickets[generation],
+		memcg->numascan_tickets_num[generation],
+		sizeof(struct numascan_ticket), node_weight_compare);
+	rcu_read_unlock();
+	if (nt)
+		node = nt->nid;
+out:
+	if (unlikely(node == MAX_NUMNODES)) {
 		node = numa_node_id();
-	else
-		*mask = &mem->scan_nodes;
+		*mask = NULL;
+	} else
+		*mask = &memcg->scan_nodes;
 
-	mem->last_scanned_node = node;
 	return node;
 }
 
@@ -1755,14 +1826,42 @@ bool mem_cgroup_reclaimable(struct mem_c
 	return false;
 }
 
-static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
+static bool mem_cgroup_numascan_init(struct mem_cgroup *memcg)
 {
+	struct numascan_ticket *nt;
+	int nr_nodes;
+
 	INIT_WORK(&memcg->numainfo_update_work,
 		mem_cgroup_numainfo_update_work);
+
+	nr_nodes = num_possible_nodes();
+	nt = kmalloc(sizeof(struct numascan_ticket) * nr_nodes,
+			GFP_KERNEL);
+	if (!nt)
+		return false;
+	memcg->numascan_tickets[0] = nt;
+	nt = kmalloc(sizeof(struct numascan_ticket) * nr_nodes,
+			GFP_KERNEL);
+	if (!nt) {
+		kfree(memcg->numascan_tickets[0]);
+		memcg->numascan_tickets[0] = NULL;
+		return false;
+	}
+	memcg->numascan_tickets[1] = nt;
+	memcg->numascan_tickets_num[0] = 0;
+	memcg->numascan_tickets_num[1] = 0;
+	return true;
+}
+
+static void mem_cgroup_numascan_free(struct mem_cgroup *memcg)
+{
+	kfree(memcg->numascan_tickets[0]);
+	kfree(memcg->numascan_tickets[1]);
 }
 
 #else
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
+int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask,
+				struct memcg_scanrecord *rec)
 {
 	*mask = NULL;
 	return 0;
@@ -1775,6 +1874,9 @@ bool mem_cgroup_reclaimable(struct mem_c
 static void mem_cgroup_numascan_init(struct mem_cgroup *memcg)
 {
 }
+static bool mem_cgroup_numascan_free(struct mem_cgroup *memcg)
+{
+}
 #endif
 
 static void __mem_cgroup_record_scanstat(unsigned long *stats,
@@ -5015,6 +5117,7 @@ static void __mem_cgroup_free(struct mem
 	int node;
 
 	mem_cgroup_remove_from_trees(mem);
+	mem_cgroup_numascan_free(mem);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
 
 	for_each_node_state(node, N_POSSIBLE)
@@ -5153,7 +5256,8 @@ mem_cgroup_create(struct cgroup_subsys *
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
 	spin_lock_init(&mem->scanstat.lock);
-	mem_cgroup_numascan_init(mem);
+	if (!mem_cgroup_numascan_init(mem))
+		goto free_out;
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
Index: mmotm-0710/mm/vmscan.c
===================================================================
--- mmotm-0710.orig/mm/vmscan.c
+++ mmotm-0710/mm/vmscan.c
@@ -2313,9 +2313,9 @@ unsigned long try_to_free_mem_cgroup_pag
 	 * take care of from where we get pages. So the node where we start the
 	 * scan does not need to be the current node.
 	 */
-	nid = mem_cgroup_select_victim_node(mem_cont, &mask);
+	nid = mem_cgroup_select_victim_node(mem_cont, &mask, rec);
 
-	zonelist = NODE_DATA(nid)->node_zonelists;
+	zonelist = &NODE_DATA(nid)->node_zonelists[0];
 
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,
Index: mmotm-0710/include/linux/memcontrol.h
===================================================================
--- mmotm-0710.orig/include/linux/memcontrol.h
+++ mmotm-0710/include/linux/memcontrol.h
@@ -117,7 +117,8 @@ extern void mem_cgroup_end_migration(str
  */
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask);
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
+				struct memcg_scanrecord *rec);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
 					int nid, int zid, unsigned int lrumask);
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 6/5] memcg : check numa balance
  2011-07-27  5:44 [PATCH v4 0/5] memcg : make numa scanning better KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2011-07-27  5:51 ` [PATCH v4 5/5] memcg : select a victim node by weights KAMEZAWA Hiroyuki
@ 2011-07-27  5:52 ` KAMEZAWA Hiroyuki
  5 siblings, 0 replies; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-27  5:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, Michal Hocko,
	nishimura@mxp.nes.nec.co.jp

This patch was required for handling numa-unbalanced memcg.
==
Because do_try_to_free_pages() scans node based on zonelist,
even if we select a victim node, we may scan other nodes.

When the nodes are balanced, it's good because we'll quit scan loop
before updating 'priority'. But when the nodes are unbalanced,
it will force scanning a very small nodes and will cause
swap-out when the node doesn't contains enough file caches.

This patch selects zonelist[] for vmscan scan list for memcg.
If memcg is well balanced among nodes, usual fall back (and mask) is used.
If not, it selects node local zonelist and do target reclaim.

This will reduce unnecessary (anon page) scans when memcg is not balanced.

Now, memcg/NUMA is balanced when each node's weight is between
 80% and 120% of average node weight.
 (*) This value is just a magic number but works well in several tests.
     Further study to detemine this value is appreciated.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    2 +-
 mm/memcontrol.c            |   20 ++++++++++++++++++--
 mm/vmscan.c                |    8 ++++++--
 3 files changed, 25 insertions(+), 5 deletions(-)

Index: mmotm-0710/mm/memcontrol.c
===================================================================
--- mmotm-0710.orig/mm/memcontrol.c
+++ mmotm-0710/mm/memcontrol.c
@@ -295,6 +295,7 @@ struct mem_cgroup {
 	atomic_t	numainfo_updating;
 	struct work_struct	numainfo_update_work;
 	unsigned long	total_weight;
+	bool		numascan_balance;
 	int		numascan_generation;
 	int		numascan_tickets_num[2];
 	struct numascan_ticket *numascan_tickets[2];
@@ -1663,12 +1664,15 @@ mem_cgroup_calc_numascan_weight(struct m
  */
 #define NUMA_TICKET_SHIFT	(16)
 #define NUMA_TICKET_FACTOR	((1 << NUMA_TICKET_SHIFT) - 1)
+#define NUMA_BALANCE_RANGE_LOW	(80)
+#define NUMA_BALANCE_RANGE_HIGH	(120)
 static void mem_cgroup_update_numascan_tickets(struct mem_cgroup *memcg)
 {
 	struct numascan_ticket *nt;
 	unsigned int node_ticket, assigned_tickets;
 	u64 weight;
 	int nid, assigned_num, generation;
+	unsigned long average, balance_low, balance_high;
 
 	/* update ticket information by double buffering */
 	generation = memcg->numascan_generation ^ 0x1;
@@ -1676,6 +1680,11 @@ static void mem_cgroup_update_numascan_t
 	nt = memcg->numascan_tickets[generation];
 	assigned_tickets = 0;
 	assigned_num = 0;
+	average = memcg->total_weight / (nodes_weight(memcg->scan_nodes) + 1);
+	balance_low = NUMA_BALANCE_RANGE_LOW * average / 100;
+	balance_high = NUMA_BALANCE_RANGE_HIGH * average / 100;
+	memcg->numascan_balance = true;
+
 	for_each_node_mask(nid, memcg->scan_nodes) {
 		weight = memcg->info.nodeinfo[nid]->weight;
 		node_ticket = div64_u64(weight << NUMA_TICKET_SHIFT,
@@ -1688,6 +1697,9 @@ static void mem_cgroup_update_numascan_t
 		assigned_tickets += node_ticket;
 		nt++;
 		assigned_num++;
+		if ((weight < balance_low) ||
+		    (weight > balance_high))
+			memcg->numascan_balance = false;
 	}
 	memcg->numascan_tickets_num[generation] = assigned_num;
 	smp_wmb();
@@ -1758,7 +1770,7 @@ static int node_weight_compare(const voi
  * node means more costs for memory reclaim because of memory latency.
  */
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
-				struct memcg_scanrecord *rec)
+				struct memcg_scanrecord *rec, bool *fallback)
 {
 	int node = MAX_NUMNODES;
 	struct numascan_ticket *nt;
@@ -1785,8 +1797,11 @@ out:
 	if (unlikely(node == MAX_NUMNODES)) {
 		node = numa_node_id();
 		*mask = NULL;
-	} else
+		*fallback = true;
+	} else {
 		*mask = &memcg->scan_nodes;
+		*fallback = memcg->numascan_balance;
+	}
 
 	return node;
 }
@@ -1864,6 +1879,7 @@ int mem_cgroup_select_victim_node(struct
 				struct memcg_scanrecord *rec)
 {
 	*mask = NULL;
+	*fallback = true;
 	return 0;
 }
 
Index: mmotm-0710/include/linux/memcontrol.h
===================================================================
--- mmotm-0710.orig/include/linux/memcontrol.h
+++ mmotm-0710/include/linux/memcontrol.h
@@ -118,7 +118,7 @@ extern void mem_cgroup_end_migration(str
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
-				struct memcg_scanrecord *rec);
+				struct memcg_scanrecord *rec, bool *fallback);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
 					int nid, int zid, unsigned int lrumask);
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
Index: mmotm-0710/mm/vmscan.c
===================================================================
--- mmotm-0710.orig/mm/vmscan.c
+++ mmotm-0710/mm/vmscan.c
@@ -2290,6 +2290,7 @@ unsigned long try_to_free_mem_cgroup_pag
 	unsigned long nr_reclaimed;
 	unsigned long start, end;
 	int nid;
+	bool fallback;
 	nodemask_t *mask;
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
@@ -2313,9 +2314,12 @@ unsigned long try_to_free_mem_cgroup_pag
 	 * take care of from where we get pages. So the node where we start the
 	 * scan does not need to be the current node.
 	 */
-	nid = mem_cgroup_select_victim_node(mem_cont, &mask, rec);
+	nid = mem_cgroup_select_victim_node(mem_cont, &mask, rec, &fallback);
 
-	zonelist = &NODE_DATA(nid)->node_zonelists[0];
+	if (fallback) /* memcg/NUMA is balanced and fallback works well */
+		zonelist = &NODE_DATA(nid)->node_zonelists[0];
+	else /* memcg/NUMA is not balanced, do target reclaim */
+		zonelist = &NODE_DATA(nid)->node_zonelists[1];
 
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-08-02  2:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-27  5:44 [PATCH v4 0/5] memcg : make numa scanning better KAMEZAWA Hiroyuki
2011-07-27  5:46 ` [PATCH v4 1/5] memcg : update numascan info by schedule_work KAMEZAWA Hiroyuki
2011-07-27  5:47 ` [PATCH v4 2/5] memcg : pass scan nodemask KAMEZAWA Hiroyuki
2011-08-01 13:59   ` Michal Hocko
2011-08-02  2:21     ` KAMEZAWA Hiroyuki
2011-07-27  5:49 ` [PATCH v4 3/5] memcg : stop scanning if enough KAMEZAWA Hiroyuki
2011-08-01 14:37   ` Michal Hocko
2011-08-01 19:49     ` Michal Hocko
2011-07-27  5:49 ` [PATCH v4 4/5] memcg : calculate node scan weight KAMEZAWA Hiroyuki
2011-07-27  5:51 ` [PATCH v4 5/5] memcg : select a victim node by weights KAMEZAWA Hiroyuki
2011-07-27  5:52 ` [PATCH v4 6/5] memcg : check numa balance KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).