[PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge
@ 2009-10-02  4:55 KAMEZAWA Hiroyuki
  2009-10-02  5:01 ` [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation KAMEZAWA Hiroyuki
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-02  4:55 UTC (permalink / raw)
  To: linux-mm@kvack.org
  Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp

Hi,

This patch is against mmotm + softlimit fix patches.
(which are now in -rc git tree.)

In the latest -rc series, the kernel avoids accessing res_counter when
cgroup is root cgroup. This helps scalabilty when memcg is not used.

It's necessary to improve scalabilty even when memcg is used. This patch
is for that. Previous Balbir's work shows that the biggest obstacles for
better scalabilty is memcg's res_counter. Then, there are 2 ways.

(1) make counter scale well.
(2) avoid accessing core counter as much as possible.

My first direction was (1). But no, there is no counter which is free
from false sharing when it needs system-wide fine grain synchronization.
And res_counter has several functionality...this makes (1) difficult.
spin_lock (in slow path) around counter means tons of invalidation will
happen even when we just access counter without modification.

This patch series is for (2). This implements charge/uncharge in bached manner.
This coalesces access to res_counter at charge/uncharge using nature of
access locality.

Tested for a month. And I got good reorts from Balbir and Nishimura, thanks.
One concern is that this adds some members to the bottom of task_struct.
Better idea is welcome.

Following is test result of continuous page-fault on my 8cpu box(x86-64).

A loop like this runs on all cpus in parallel for 60secs. 
==
        while (1) {
                x = mmap(NULL, MEGA, PROT_READ|PROT_WRITE,
                        MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);

                for (off = 0; off < MEGA; off += PAGE_SIZE)
                        x[off]=0;
                munmap(x, MEGA);
        }
==
please see # of page faults. I think this is good improvement.

[Before]
 Performance counter stats for './runpause.sh' (5 runs):

  474539.756944  task-clock-msecs         #      7.890 CPUs    ( +-   0.015% )
          10284  context-switches         #      0.000 M/sec   ( +-   0.156% )
             12  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
       18425800  page-faults              #      0.039 M/sec   ( +-   0.107% )
  1486296285360  cycles                   #   3132.080 M/sec   ( +-   0.029% )
   380334406216  instructions             #      0.256 IPC     ( +-   0.058% )
     3274206662  cache-references         #      6.900 M/sec   ( +-   0.453% )
     1272947699  cache-misses             #      2.682 M/sec   ( +-   0.118% )

   60.147907341  seconds time elapsed   ( +-   0.010% )

[After]
 Performance counter stats for './runpause.sh' (5 runs):

  474658.997489  task-clock-msecs         #      7.891 CPUs    ( +-   0.006% )
          10250  context-switches         #      0.000 M/sec   ( +-   0.020% )
             11  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
       33177858  page-faults              #      0.070 M/sec   ( +-   0.152% )
  1485264748476  cycles                   #   3129.120 M/sec   ( +-   0.021% )
   409847004519  instructions             #      0.276 IPC     ( +-   0.123% )
     3237478723  cache-references         #      6.821 M/sec   ( +-   0.574% )
     1182572827  cache-misses             #      2.491 M/sec   ( +-   0.179% )

   60.151786309  seconds time elapsed   ( +-   0.014% )

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation
  2009-10-02  4:55 [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge KAMEZAWA Hiroyuki
@ 2009-10-02  5:01 ` KAMEZAWA Hiroyuki
  2009-10-02  6:47   ` Hiroshi Shimamoto
  2009-10-09  4:01   ` [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation Balbir Singh
  2009-10-02  5:03 ` [PATCH 2/2] memcg: coalescing charges per cpu KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-02  5:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp


In massive parallel enviroment, res_counter can be a performance bottleneck.
One strong techinque to reduce lock contention is reducing calls by
coalescing some amount of calls into one.

Considering charge/uncharge chatacteristic,
	- charge is done one by one via demand-paging.
	- uncharge is done by
		- in chunk at munmap, truncate, exit, execve...
		- one by one via vmscan/paging.

It seems we have a chance in uncharge at unmap/truncation.

This patch is a for coalescing uncharge. For avoiding scattering memcg's
structure to functions under /mm, this patch adds memcg batch uncharge
information to the task. 

The degree of coalescing depends on callers
  - at invalidate/trucate... pagevec size
  - at unmap ....ZAP_BLOCK_SIZE
(memory itself will be freed in this degree.)
Then, we'll not coalescing too much.

Changelog(now):
 - rebased onto the latest mmotm + softlimit fix patches.

Changelog(old):
 - unified patch for callers
 - added commetns.
 - make ->do_batch as bool.
 - removed css_get() at el. We don't need it.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   13 ++++++
 include/linux/sched.h      |    7 +++
 mm/memcontrol.c            |   91 ++++++++++++++++++++++++++++++++++++++++++---
 mm/memory.c                |    2 
 mm/truncate.c              |    6 ++
 5 files changed, 113 insertions(+), 6 deletions(-)

Index: mmotm-2.6.31-Sep28/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.31-Sep28.orig/include/linux/memcontrol.h
+++ mmotm-2.6.31-Sep28/include/linux/memcontrol.h
@@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s
 extern void mem_cgroup_del_lru(struct page *page);
 extern void mem_cgroup_move_lists(struct page *page,
 				  enum lru_list from, enum lru_list to);
+
+/* For coalescing uncharge for reducing memcg' overhead*/
+extern void mem_cgroup_uncharge_start(void);
+extern void mem_cgroup_uncharge_end(void);
+
 extern void mem_cgroup_uncharge_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
 extern int mem_cgroup_shmem_charge_fallback(struct page *page,
@@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
 {
 }
 
+static inline void mem_cgroup_uncharge_batch_start(void)
+{
+}
+
+static inline void mem_cgroup_uncharge_batch_start(void)
+{
+}
+
 static inline void mem_cgroup_uncharge_page(struct page *page)
 {
 }
Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep28/mm/memcontrol.c
@@ -1826,6 +1826,49 @@ void mem_cgroup_cancel_charge_swapin(str
 	css_put(&mem->css);
 }
 
+static void
+__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
+{
+	struct memcg_batch_info *batch = NULL;
+	bool uncharge_memsw = true;
+	/* If swapout, usage of swap doesn't decrease */
+	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
+		uncharge_memsw = false;
+	/*
+	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
+	 * In those cases, all pages freed continously can be expected to be in
+	 * the same cgroup and we have chance to coalesce uncharges.
+	 * And, we do uncharge one by one if this is killed by OOM.
+	 */
+	if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
+		goto direct_uncharge;
+
+	batch = &current->memcg_batch;
+	/*
+	 * In usual, we do css_get() when we remember memcg pointer.
+	 * But in this case, we keep res->usage until end of a series of
+	 * uncharges. Then, it's ok to ignore memcg's refcnt.
+	 */
+	if (!batch->memcg)
+		batch->memcg = mem;
+	/*
+	 * In typical case, batch->memcg == mem. This means we can
+	 * merge a series of uncharges to an uncharge of res_counter.
+	 * If not, we uncharge res_counter ony by one.
+	 */
+	if (batch->memcg != mem)
+		goto direct_uncharge;
+	/* remember freed charge and uncharge it later */
+	batch->pages += PAGE_SIZE;
+	if (uncharge_memsw)
+		batch->memsw += PAGE_SIZE;
+	return;
+direct_uncharge:
+	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	if (uncharge_memsw)
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+	return;
+}
 
 /*
  * uncharge if !page_mapped(page)
@@ -1874,12 +1917,8 @@ __mem_cgroup_uncharge_common(struct page
 		break;
 	}
 
-	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		if (do_swap_account &&
-				(ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
-	}
+	if (!mem_cgroup_is_root(mem))
+		__do_uncharge(mem, ctype);
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		mem_cgroup_swap_statistics(mem, true);
 	mem_cgroup_charge_statistics(mem, pc, false);
@@ -1925,6 +1964,46 @@ void mem_cgroup_uncharge_cache_page(stru
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
+/*
+ * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
+ * In that cases, pages are freed continuously and we can expect pages
+ * are in the same memcg. All these calls itself limits the number of
+ * pages freed at once, then uncharge_start/end() is called properly.
+ */
+
+void mem_cgroup_uncharge_start(void)
+{
+	if (!current->memcg_batch.do_batch) {
+		current->memcg_batch.memcg = NULL;
+		current->memcg_batch.pages = 0;
+		current->memcg_batch.memsw = 0;
+	}
+	current->memcg_batch.do_batch++;
+}
+
+void mem_cgroup_uncharge_end(void)
+{
+	struct mem_cgroup *mem;
+
+	if (!current->memcg_batch.do_batch)
+		return;
+
+	current->memcg_batch.do_batch--;
+	if (current->memcg_batch.do_batch) /* Nested ? */
+		return;
+
+	mem = current->memcg_batch.memcg;
+	if (!mem)
+		return;
+	/* This "mem" is valid bacause we hide charges behind us. */
+	if (current->memcg_batch.pages)
+		res_counter_uncharge(&mem->res, current->memcg_batch.pages);
+	if (current->memcg_batch.memsw)
+		res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
+	/* Not necessary. but forget this pointer */
+	current->memcg_batch.memcg = NULL;
+}
+
 #ifdef CONFIG_SWAP
 /*
  * called after __delete_from_swap_cache() and drop "page" account.
Index: mmotm-2.6.31-Sep28/include/linux/sched.h
===================================================================
--- mmotm-2.6.31-Sep28.orig/include/linux/sched.h
+++ mmotm-2.6.31-Sep28/include/linux/sched.h
@@ -1549,6 +1549,13 @@ struct task_struct {
 	unsigned long trace_recursion;
 #endif /* CONFIG_TRACING */
 	unsigned long stack_start;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
+	struct memcg_batch_info {
+		int do_batch;
+		struct mem_cgroup *memcg;
+		long pages, memsw;
+	} memcg_batch;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
Index: mmotm-2.6.31-Sep28/mm/memory.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memory.c
+++ mmotm-2.6.31-Sep28/mm/memory.c
@@ -940,6 +940,7 @@ static unsigned long unmap_page_range(st
 		details = NULL;
 
 	BUG_ON(addr >= end);
+	mem_cgroup_uncharge_start();
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -952,6 +953,7 @@ static unsigned long unmap_page_range(st
 						zap_work, details);
 	} while (pgd++, addr = next, (addr != end && *zap_work > 0));
 	tlb_end_vma(tlb, vma);
+	mem_cgroup_uncharge_end();
 
 	return addr;
 }
Index: mmotm-2.6.31-Sep28/mm/truncate.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/truncate.c
+++ mmotm-2.6.31-Sep28/mm/truncate.c
@@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
 			pagevec_release(&pvec);
 			break;
 		}
+		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
 	}
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
@@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
 	pagevec_init(&pvec, 0);
 	while (next <= end &&
 			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t index;
@@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
 				break;
 		}
 		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
 		cond_resched();
 	}
 	return ret;
@@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
 	while (next <= end && !wrapped &&
 		pagevec_lookup(&pvec, mapping, next,
 			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t page_index;
@@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
 		cond_resched();
 	}
 	return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/2] memcg: coalescing charges per cpu
  2009-10-02  4:55 [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge KAMEZAWA Hiroyuki
  2009-10-02  5:01 ` [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation KAMEZAWA Hiroyuki
@ 2009-10-02  5:03 ` KAMEZAWA Hiroyuki
  2009-10-08 22:26   ` Andrew Morton
  2009-10-09  4:15   ` Balbir Singh
  2009-10-02  8:53 ` [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-02  5:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This is a patch for coalescing access to res_counter at charging by
percpu caching. At charge, memcg charges 64pages and remember it in
percpu cache. Because it's cache, drain/flush is done if necessary.

This version uses public percpu area.
 2 benefits of using public percpu area.
 1. Sum of stocked charge in the system is limited to # of cpus
    not to the number of memcg. This shows better synchonization.
 2. drain code for flush/cpuhotplug is very easy (and quick)

The most important point of this patch is that we never touch res_counter
in fast path. The res_counter is system-wide shared counter which is modified
very frequently. We shouldn't touch it as far as we can for avoiding
false sharing.

Changelog (new):
  - rabased onto the latest mmotm
Changelog (old):
  - moved charge size check before __GFP_WAIT check for avoiding unnecesary
  - added asynchronous flush routine.
  - fixed bugs pointed out by Nishimura-san.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  126 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 120 insertions(+), 6 deletions(-)

Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep28/mm/memcontrol.c
@@ -38,6 +38,7 @@
 #include <linux/vmalloc.h>
 #include <linux/mm_inline.h>
 #include <linux/page_cgroup.h>
+#include <linux/cpu.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -275,6 +276,7 @@ enum charge_type {
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
+static void drain_all_stock_async(void);
 
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -1137,6 +1139,8 @@ static int mem_cgroup_hierarchical_recla
 		victim = mem_cgroup_select_victim(root_mem);
 		if (victim == root_mem) {
 			loop++;
+			if (loop >= 1)
+				drain_all_stock_async();
 			if (loop >= 2) {
 				/*
 				 * If we have not been able to reclaim
@@ -1258,6 +1262,103 @@ done:
 	unlock_page_cgroup(pc);
 }
 
+/* size of first charge trial. "32" comes from vmscan.c's magic value */
+#define CHARGE_SIZE	(32 * PAGE_SIZE)
+struct memcg_stock_pcp {
+	struct mem_cgroup *cached;
+	int charge;
+	struct work_struct work;
+};
+static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
+static DEFINE_MUTEX(memcg_drain_mutex);
+
+static bool consume_stock(struct mem_cgroup *mem)
+{
+	struct memcg_stock_pcp *stock;
+	bool ret = true;
+
+	stock = &get_cpu_var(memcg_stock);
+	if (mem == stock->cached && stock->charge)
+		stock->charge -= PAGE_SIZE;
+	else
+		ret = false;
+	put_cpu_var(memcg_stock);
+	return ret;
+}
+
+static void drain_stock(struct memcg_stock_pcp *stock)
+{
+	struct mem_cgroup *old = stock->cached;
+
+	if (stock->charge) {
+		res_counter_uncharge(&old->res, stock->charge);
+		if (do_swap_account)
+			res_counter_uncharge(&old->memsw, stock->charge);
+	}
+	stock->cached = NULL;
+	stock->charge = 0;
+}
+
+static void drain_local_stock(struct work_struct *dummy)
+{
+	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
+	drain_stock(stock);
+	put_cpu_var(memcg_stock);
+}
+
+static void refill_stock(struct mem_cgroup *mem, int val)
+{
+	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
+
+	if (stock->cached != mem) {
+		drain_stock(stock);
+		stock->cached = mem;
+	}
+	stock->charge += val;
+	put_cpu_var(memcg_stock);
+}
+
+static void drain_all_stock_async(void)
+{
+	int cpu;
+	/* Contention means someone tries to flush. */
+	if (!mutex_trylock(&memcg_drain_mutex))
+		return;
+	for_each_online_cpu(cpu) {
+		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
+		if (work_pending(&stock->work))
+			continue;
+		INIT_WORK(&stock->work, drain_local_stock);
+		schedule_work_on(cpu, &stock->work);
+	}
+	mutex_unlock(&memcg_drain_mutex);
+	/* We don't wait for flush_work */
+}
+
+static void drain_all_stock_sync(void)
+{
+	/* called when force_empty is called */
+	mutex_lock(&memcg_drain_mutex);
+	schedule_on_each_cpu(drain_local_stock);
+	mutex_unlock(&memcg_drain_mutex);
+}
+
+static int __cpuinit memcg_stock_cpu_callback(struct notifier_block *nb,
+					unsigned long action,
+					void *hcpu)
+{
+#ifdef CONFIG_HOTPLUG_CPU
+	int cpu = (unsigned long)hcpu;
+	struct memcg_stock_pcp *stock;
+
+	if (action != CPU_DEAD)
+		return NOTIFY_OK;
+	stock = &per_cpu(memcg_stock, cpu);
+	drain_stock(stock);
+#endif
+	return NOTIFY_OK;
+}
+
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
  * oom-killer can be invoked.
@@ -1269,6 +1370,7 @@ static int __mem_cgroup_try_charge(struc
 	struct mem_cgroup *mem, *mem_over_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct res_counter *fail_res;
+	int csize = CHARGE_SIZE;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -1293,23 +1395,25 @@ static int __mem_cgroup_try_charge(struc
 		return 0;
 
 	VM_BUG_ON(css_is_removed(&mem->css));
+	if (mem_cgroup_is_root(mem))
+		goto done;
 
 	while (1) {
 		int ret = 0;
 		unsigned long flags = 0;
 
-		if (mem_cgroup_is_root(mem))
-			goto done;
-		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
+		if (consume_stock(mem))
+			goto charged;
+
+		ret = res_counter_charge(&mem->res, csize, &fail_res);
 		if (likely(!ret)) {
 			if (!do_swap_account)
 				break;
-			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
-							&fail_res);
+			ret = res_counter_charge(&mem->memsw, csize, &fail_res);
 			if (likely(!ret))
 				break;
 			/* mem+swap counter fails */
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
+			res_counter_uncharge(&mem->res, csize);
 			flags |= MEM_CGROUP_RECLAIM_NOSWAP;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
@@ -1318,6 +1422,11 @@ static int __mem_cgroup_try_charge(struc
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									res);
 
+		/* reduce request size and retry */
+		if (csize > PAGE_SIZE) {
+			csize = PAGE_SIZE;
+			continue;
+		}
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
 
@@ -1347,6 +1456,9 @@ static int __mem_cgroup_try_charge(struc
 			goto nomem;
 		}
 	}
+	if (csize > PAGE_SIZE)
+		refill_stock(mem, csize - PAGE_SIZE);
+charged:
 	/*
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
 	 * if they exceeds softlimit.
@@ -2463,6 +2575,7 @@ move_account:
 			goto out;
 		/* This is for making all *used* pages to be on LRU. */
 		lru_add_drain_all();
+		drain_all_stock_sync();
 		ret = 0;
 		for_each_node_state(node, N_HIGH_MEMORY) {
 			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
@@ -3181,6 +3294,7 @@ mem_cgroup_create(struct cgroup_subsys *
 		root_mem_cgroup = mem;
 		if (mem_cgroup_soft_limit_tree_init())
 			goto free_out;
+		hotcpu_notifier(memcg_stock_cpu_callback, 0);
 
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation
  2009-10-02  5:01 ` [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation KAMEZAWA Hiroyuki
@ 2009-10-02  6:47   ` Hiroshi Shimamoto
  2009-10-02  6:53     ` Hiroshi Shimamoto
  2009-10-02  7:02     ` [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation (fixed coimpile bug) KAMEZAWA Hiroyuki
  2009-10-09  4:01   ` [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation Balbir Singh
  1 sibling, 2 replies; 22+ messages in thread
From: Hiroshi Shimamoto @ 2009-10-02  6:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

KAMEZAWA Hiroyuki wrote:
> In massive parallel enviroment, res_counter can be a performance bottleneck.
> One strong techinque to reduce lock contention is reducing calls by
> coalescing some amount of calls into one.
> 
> Considering charge/uncharge chatacteristic,
> 	- charge is done one by one via demand-paging.
> 	- uncharge is done by
> 		- in chunk at munmap, truncate, exit, execve...
> 		- one by one via vmscan/paging.
> 
> It seems we have a chance in uncharge at unmap/truncation.
> 
> This patch is a for coalescing uncharge. For avoiding scattering memcg's
> structure to functions under /mm, this patch adds memcg batch uncharge
> information to the task. 
> 
> The degree of coalescing depends on callers
>   - at invalidate/trucate... pagevec size
>   - at unmap ....ZAP_BLOCK_SIZE
> (memory itself will be freed in this degree.)
> Then, we'll not coalescing too much.
> 
> Changelog(now):
>  - rebased onto the latest mmotm + softlimit fix patches.
> 
> Changelog(old):
>  - unified patch for callers
>  - added commetns.
>  - make ->do_batch as bool.
>  - removed css_get() at el. We don't need it.
> 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |   13 ++++++
>  include/linux/sched.h      |    7 +++
>  mm/memcontrol.c            |   91 ++++++++++++++++++++++++++++++++++++++++++---
>  mm/memory.c                |    2 
>  mm/truncate.c              |    6 ++
>  5 files changed, 113 insertions(+), 6 deletions(-)
> 
> Index: mmotm-2.6.31-Sep28/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.31-Sep28/include/linux/memcontrol.h
> @@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s
>  extern void mem_cgroup_del_lru(struct page *page);
>  extern void mem_cgroup_move_lists(struct page *page,
>  				  enum lru_list from, enum lru_list to);
> +
> +/* For coalescing uncharge for reducing memcg' overhead*/
> +extern void mem_cgroup_uncharge_start(void);
> +extern void mem_cgroup_uncharge_end(void);
> +
>  extern void mem_cgroup_uncharge_page(struct page *page);
>  extern void mem_cgroup_uncharge_cache_page(struct page *page);
>  extern int mem_cgroup_shmem_charge_fallback(struct page *page,
> @@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
>  {
>  }
>  
> +static inline void mem_cgroup_uncharge_batch_start(void)
> +{
> +}
> +
> +static inline void mem_cgroup_uncharge_batch_start(void)

mem_cgroup_uncharge_batch_end?


Thanks,
Hiroshi

> +{
> +}
> +
>  static inline void mem_cgroup_uncharge_page(struct page *page)
>  {
>  }
> Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
> +++ mmotm-2.6.31-Sep28/mm/memcontrol.c
> @@ -1826,6 +1826,49 @@ void mem_cgroup_cancel_charge_swapin(str
>  	css_put(&mem->css);
>  }
>  
> +static void
> +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> +{
> +	struct memcg_batch_info *batch = NULL;
> +	bool uncharge_memsw = true;
> +	/* If swapout, usage of swap doesn't decrease */
> +	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> +		uncharge_memsw = false;
> +	/*
> +	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
> +	 * In those cases, all pages freed continously can be expected to be in
> +	 * the same cgroup and we have chance to coalesce uncharges.
> +	 * And, we do uncharge one by one if this is killed by OOM.
> +	 */
> +	if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
> +		goto direct_uncharge;
> +
> +	batch = &current->memcg_batch;
> +	/*
> +	 * In usual, we do css_get() when we remember memcg pointer.
> +	 * But in this case, we keep res->usage until end of a series of
> +	 * uncharges. Then, it's ok to ignore memcg's refcnt.
> +	 */
> +	if (!batch->memcg)
> +		batch->memcg = mem;
> +	/*
> +	 * In typical case, batch->memcg == mem. This means we can
> +	 * merge a series of uncharges to an uncharge of res_counter.
> +	 * If not, we uncharge res_counter ony by one.
> +	 */
> +	if (batch->memcg != mem)
> +		goto direct_uncharge;
> +	/* remember freed charge and uncharge it later */
> +	batch->pages += PAGE_SIZE;
> +	if (uncharge_memsw)
> +		batch->memsw += PAGE_SIZE;
> +	return;
> +direct_uncharge:
> +	res_counter_uncharge(&mem->res, PAGE_SIZE);
> +	if (uncharge_memsw)
> +		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> +	return;
> +}
>  
>  /*
>   * uncharge if !page_mapped(page)
> @@ -1874,12 +1917,8 @@ __mem_cgroup_uncharge_common(struct page
>  		break;
>  	}
>  
> -	if (!mem_cgroup_is_root(mem)) {
> -		res_counter_uncharge(&mem->res, PAGE_SIZE);
> -		if (do_swap_account &&
> -				(ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> -			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> -	}
> +	if (!mem_cgroup_is_root(mem))
> +		__do_uncharge(mem, ctype);
>  	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
>  		mem_cgroup_swap_statistics(mem, true);
>  	mem_cgroup_charge_statistics(mem, pc, false);
> @@ -1925,6 +1964,46 @@ void mem_cgroup_uncharge_cache_page(stru
>  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
>  }
>  
> +/*
> + * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
> + * In that cases, pages are freed continuously and we can expect pages
> + * are in the same memcg. All these calls itself limits the number of
> + * pages freed at once, then uncharge_start/end() is called properly.
> + */
> +
> +void mem_cgroup_uncharge_start(void)
> +{
> +	if (!current->memcg_batch.do_batch) {
> +		current->memcg_batch.memcg = NULL;
> +		current->memcg_batch.pages = 0;
> +		current->memcg_batch.memsw = 0;
> +	}
> +	current->memcg_batch.do_batch++;
> +}
> +
> +void mem_cgroup_uncharge_end(void)
> +{
> +	struct mem_cgroup *mem;
> +
> +	if (!current->memcg_batch.do_batch)
> +		return;
> +
> +	current->memcg_batch.do_batch--;
> +	if (current->memcg_batch.do_batch) /* Nested ? */
> +		return;
> +
> +	mem = current->memcg_batch.memcg;
> +	if (!mem)
> +		return;
> +	/* This "mem" is valid bacause we hide charges behind us. */
> +	if (current->memcg_batch.pages)
> +		res_counter_uncharge(&mem->res, current->memcg_batch.pages);
> +	if (current->memcg_batch.memsw)
> +		res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
> +	/* Not necessary. but forget this pointer */
> +	current->memcg_batch.memcg = NULL;
> +}
> +
>  #ifdef CONFIG_SWAP
>  /*
>   * called after __delete_from_swap_cache() and drop "page" account.
> Index: mmotm-2.6.31-Sep28/include/linux/sched.h
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/include/linux/sched.h
> +++ mmotm-2.6.31-Sep28/include/linux/sched.h
> @@ -1549,6 +1549,13 @@ struct task_struct {
>  	unsigned long trace_recursion;
>  #endif /* CONFIG_TRACING */
>  	unsigned long stack_start;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
> +	struct memcg_batch_info {
> +		int do_batch;
> +		struct mem_cgroup *memcg;
> +		long pages, memsw;
> +	} memcg_batch;
> +#endif
>  };
>  
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> Index: mmotm-2.6.31-Sep28/mm/memory.c
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/mm/memory.c
> +++ mmotm-2.6.31-Sep28/mm/memory.c
> @@ -940,6 +940,7 @@ static unsigned long unmap_page_range(st
>  		details = NULL;
>  
>  	BUG_ON(addr >= end);
> +	mem_cgroup_uncharge_start();
>  	tlb_start_vma(tlb, vma);
>  	pgd = pgd_offset(vma->vm_mm, addr);
>  	do {
> @@ -952,6 +953,7 @@ static unsigned long unmap_page_range(st
>  						zap_work, details);
>  	} while (pgd++, addr = next, (addr != end && *zap_work > 0));
>  	tlb_end_vma(tlb, vma);
> +	mem_cgroup_uncharge_end();
>  
>  	return addr;
>  }
> Index: mmotm-2.6.31-Sep28/mm/truncate.c
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/mm/truncate.c
> +++ mmotm-2.6.31-Sep28/mm/truncate.c
> @@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
>  			pagevec_release(&pvec);
>  			break;
>  		}
> +		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
> @@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
>  			unlock_page(page);
>  		}
>  		pagevec_release(&pvec);
> +		mem_cgroup_uncharge_end();
>  	}
>  }
>  EXPORT_SYMBOL(truncate_inode_pages_range);
> @@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
>  	pagevec_init(&pvec, 0);
>  	while (next <= end &&
>  			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> +		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  			pgoff_t index;
> @@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
>  				break;
>  		}
>  		pagevec_release(&pvec);
> +		mem_cgroup_uncharge_end();
>  		cond_resched();
>  	}
>  	return ret;
> @@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
>  	while (next <= end && !wrapped &&
>  		pagevec_lookup(&pvec, mapping, next,
>  			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> +		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  			pgoff_t page_index;
> @@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
>  			unlock_page(page);
>  		}
>  		pagevec_release(&pvec);
> +		mem_cgroup_uncharge_end();
>  		cond_resched();
>  	}
>  	return ret;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation
  2009-10-02  6:47   ` Hiroshi Shimamoto
@ 2009-10-02  6:53     ` Hiroshi Shimamoto
  2009-10-02  7:04       ` KAMEZAWA Hiroyuki
  2009-10-02  7:02     ` [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation (fixed coimpile bug) KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 22+ messages in thread
From: Hiroshi Shimamoto @ 2009-10-02  6:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

Hiroshi Shimamoto wrote:
> KAMEZAWA Hiroyuki wrote:
>> In massive parallel enviroment, res_counter can be a performance bottleneck.
>> One strong techinque to reduce lock contention is reducing calls by
>> coalescing some amount of calls into one.
>>
>> Considering charge/uncharge chatacteristic,
>> 	- charge is done one by one via demand-paging.
>> 	- uncharge is done by
>> 		- in chunk at munmap, truncate, exit, execve...
>> 		- one by one via vmscan/paging.
>>
>> It seems we have a chance in uncharge at unmap/truncation.
>>
>> This patch is a for coalescing uncharge. For avoiding scattering memcg's
>> structure to functions under /mm, this patch adds memcg batch uncharge
>> information to the task. 
>>
>> The degree of coalescing depends on callers
>>   - at invalidate/trucate... pagevec size
>>   - at unmap ....ZAP_BLOCK_SIZE
>> (memory itself will be freed in this degree.)
>> Then, we'll not coalescing too much.
>>
>> Changelog(now):
>>  - rebased onto the latest mmotm + softlimit fix patches.
>>
>> Changelog(old):
>>  - unified patch for callers
>>  - added commetns.
>>  - make ->do_batch as bool.
>>  - removed css_get() at el. We don't need it.
>>
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>  include/linux/memcontrol.h |   13 ++++++
>>  include/linux/sched.h      |    7 +++
>>  mm/memcontrol.c            |   91 ++++++++++++++++++++++++++++++++++++++++++---
>>  mm/memory.c                |    2 
>>  mm/truncate.c              |    6 ++
>>  5 files changed, 113 insertions(+), 6 deletions(-)
>>
>> Index: mmotm-2.6.31-Sep28/include/linux/memcontrol.h
>> ===================================================================
>> --- mmotm-2.6.31-Sep28.orig/include/linux/memcontrol.h
>> +++ mmotm-2.6.31-Sep28/include/linux/memcontrol.h
>> @@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s
>>  extern void mem_cgroup_del_lru(struct page *page);
>>  extern void mem_cgroup_move_lists(struct page *page,
>>  				  enum lru_list from, enum lru_list to);
>> +
>> +/* For coalescing uncharge for reducing memcg' overhead*/
>> +extern void mem_cgroup_uncharge_start(void);
>> +extern void mem_cgroup_uncharge_end(void);
>> +
>>  extern void mem_cgroup_uncharge_page(struct page *page);
>>  extern void mem_cgroup_uncharge_cache_page(struct page *page);
>>  extern int mem_cgroup_shmem_charge_fallback(struct page *page,
>> @@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
>>  {
>>  }
>>  
>> +static inline void mem_cgroup_uncharge_batch_start(void)
>> +{
>> +}
>> +
>> +static inline void mem_cgroup_uncharge_batch_start(void)
> 
> mem_cgroup_uncharge_batch_end?

s/_batch// too?

Thanks,
Hiroshi

> 
> 
> Thanks,
> Hiroshi
> 
>> +{
>> +}
>> +
>>  static inline void mem_cgroup_uncharge_page(struct page *page)
>>  {
>>  }
>> Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
>> ===================================================================
>> --- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
>> +++ mmotm-2.6.31-Sep28/mm/memcontrol.c
>> @@ -1826,6 +1826,49 @@ void mem_cgroup_cancel_charge_swapin(str
>>  	css_put(&mem->css);
>>  }
>>  
>> +static void
>> +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
>> +{
>> +	struct memcg_batch_info *batch = NULL;
>> +	bool uncharge_memsw = true;
>> +	/* If swapout, usage of swap doesn't decrease */
>> +	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
>> +		uncharge_memsw = false;
>> +	/*
>> +	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
>> +	 * In those cases, all pages freed continously can be expected to be in
>> +	 * the same cgroup and we have chance to coalesce uncharges.
>> +	 * And, we do uncharge one by one if this is killed by OOM.
>> +	 */
>> +	if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
>> +		goto direct_uncharge;
>> +
>> +	batch = &current->memcg_batch;
>> +	/*
>> +	 * In usual, we do css_get() when we remember memcg pointer.
>> +	 * But in this case, we keep res->usage until end of a series of
>> +	 * uncharges. Then, it's ok to ignore memcg's refcnt.
>> +	 */
>> +	if (!batch->memcg)
>> +		batch->memcg = mem;
>> +	/*
>> +	 * In typical case, batch->memcg == mem. This means we can
>> +	 * merge a series of uncharges to an uncharge of res_counter.
>> +	 * If not, we uncharge res_counter ony by one.
>> +	 */
>> +	if (batch->memcg != mem)
>> +		goto direct_uncharge;
>> +	/* remember freed charge and uncharge it later */
>> +	batch->pages += PAGE_SIZE;
>> +	if (uncharge_memsw)
>> +		batch->memsw += PAGE_SIZE;
>> +	return;
>> +direct_uncharge:
>> +	res_counter_uncharge(&mem->res, PAGE_SIZE);
>> +	if (uncharge_memsw)
>> +		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
>> +	return;
>> +}
>>  
>>  /*
>>   * uncharge if !page_mapped(page)
>> @@ -1874,12 +1917,8 @@ __mem_cgroup_uncharge_common(struct page
>>  		break;
>>  	}
>>  
>> -	if (!mem_cgroup_is_root(mem)) {
>> -		res_counter_uncharge(&mem->res, PAGE_SIZE);
>> -		if (do_swap_account &&
>> -				(ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
>> -			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
>> -	}
>> +	if (!mem_cgroup_is_root(mem))
>> +		__do_uncharge(mem, ctype);
>>  	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
>>  		mem_cgroup_swap_statistics(mem, true);
>>  	mem_cgroup_charge_statistics(mem, pc, false);
>> @@ -1925,6 +1964,46 @@ void mem_cgroup_uncharge_cache_page(stru
>>  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
>>  }
>>  
>> +/*
>> + * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
>> + * In that cases, pages are freed continuously and we can expect pages
>> + * are in the same memcg. All these calls itself limits the number of
>> + * pages freed at once, then uncharge_start/end() is called properly.
>> + */
>> +
>> +void mem_cgroup_uncharge_start(void)
>> +{
>> +	if (!current->memcg_batch.do_batch) {
>> +		current->memcg_batch.memcg = NULL;
>> +		current->memcg_batch.pages = 0;
>> +		current->memcg_batch.memsw = 0;
>> +	}
>> +	current->memcg_batch.do_batch++;
>> +}
>> +
>> +void mem_cgroup_uncharge_end(void)
>> +{
>> +	struct mem_cgroup *mem;
>> +
>> +	if (!current->memcg_batch.do_batch)
>> +		return;
>> +
>> +	current->memcg_batch.do_batch--;
>> +	if (current->memcg_batch.do_batch) /* Nested ? */
>> +		return;
>> +
>> +	mem = current->memcg_batch.memcg;
>> +	if (!mem)
>> +		return;
>> +	/* This "mem" is valid bacause we hide charges behind us. */
>> +	if (current->memcg_batch.pages)
>> +		res_counter_uncharge(&mem->res, current->memcg_batch.pages);
>> +	if (current->memcg_batch.memsw)
>> +		res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
>> +	/* Not necessary. but forget this pointer */
>> +	current->memcg_batch.memcg = NULL;
>> +}
>> +
>>  #ifdef CONFIG_SWAP
>>  /*
>>   * called after __delete_from_swap_cache() and drop "page" account.
>> Index: mmotm-2.6.31-Sep28/include/linux/sched.h
>> ===================================================================
>> --- mmotm-2.6.31-Sep28.orig/include/linux/sched.h
>> +++ mmotm-2.6.31-Sep28/include/linux/sched.h
>> @@ -1549,6 +1549,13 @@ struct task_struct {
>>  	unsigned long trace_recursion;
>>  #endif /* CONFIG_TRACING */
>>  	unsigned long stack_start;
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
>> +	struct memcg_batch_info {
>> +		int do_batch;
>> +		struct mem_cgroup *memcg;
>> +		long pages, memsw;
>> +	} memcg_batch;
>> +#endif
>>  };
>>  
>>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
>> Index: mmotm-2.6.31-Sep28/mm/memory.c
>> ===================================================================
>> --- mmotm-2.6.31-Sep28.orig/mm/memory.c
>> +++ mmotm-2.6.31-Sep28/mm/memory.c
>> @@ -940,6 +940,7 @@ static unsigned long unmap_page_range(st
>>  		details = NULL;
>>  
>>  	BUG_ON(addr >= end);
>> +	mem_cgroup_uncharge_start();
>>  	tlb_start_vma(tlb, vma);
>>  	pgd = pgd_offset(vma->vm_mm, addr);
>>  	do {
>> @@ -952,6 +953,7 @@ static unsigned long unmap_page_range(st
>>  						zap_work, details);
>>  	} while (pgd++, addr = next, (addr != end && *zap_work > 0));
>>  	tlb_end_vma(tlb, vma);
>> +	mem_cgroup_uncharge_end();
>>  
>>  	return addr;
>>  }
>> Index: mmotm-2.6.31-Sep28/mm/truncate.c
>> ===================================================================
>> --- mmotm-2.6.31-Sep28.orig/mm/truncate.c
>> +++ mmotm-2.6.31-Sep28/mm/truncate.c
>> @@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
>>  			pagevec_release(&pvec);
>>  			break;
>>  		}
>> +		mem_cgroup_uncharge_start();
>>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>>  			struct page *page = pvec.pages[i];
>>  
>> @@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
>>  			unlock_page(page);
>>  		}
>>  		pagevec_release(&pvec);
>> +		mem_cgroup_uncharge_end();
>>  	}
>>  }
>>  EXPORT_SYMBOL(truncate_inode_pages_range);
>> @@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
>>  	pagevec_init(&pvec, 0);
>>  	while (next <= end &&
>>  			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
>> +		mem_cgroup_uncharge_start();
>>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>>  			struct page *page = pvec.pages[i];
>>  			pgoff_t index;
>> @@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
>>  				break;
>>  		}
>>  		pagevec_release(&pvec);
>> +		mem_cgroup_uncharge_end();
>>  		cond_resched();
>>  	}
>>  	return ret;
>> @@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
>>  	while (next <= end && !wrapped &&
>>  		pagevec_lookup(&pvec, mapping, next,
>>  			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
>> +		mem_cgroup_uncharge_start();
>>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>>  			struct page *page = pvec.pages[i];
>>  			pgoff_t page_index;
>> @@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
>>  			unlock_page(page);
>>  		}
>>  		pagevec_release(&pvec);
>> +		mem_cgroup_uncharge_end();
>>  		cond_resched();
>>  	}
>>  	return ret;
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>>
>>
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation (fixed coimpile bug)
  2009-10-02  6:47   ` Hiroshi Shimamoto
  2009-10-02  6:53     ` Hiroshi Shimamoto
@ 2009-10-02  7:02     ` KAMEZAWA Hiroyuki
  2009-10-08 22:17       ` Andrew Morton
  1 sibling, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-02  7:02 UTC (permalink / raw)
  To: Hiroshi Shimamoto
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

On Fri, 02 Oct 2009 15:47:22 +0900
Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> wrote:
> >  extern void mem_cgroup_uncharge_page(struct page *page);
> >  extern void mem_cgroup_uncharge_cache_page(struct page *page);
> >  extern int mem_cgroup_shmem_charge_fallback(struct page *page,
> > @@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
> >  {
> >  }
> >  
> > +static inline void mem_cgroup_uncharge_batch_start(void)
> > +{
> > +}
> > +
> > +static inline void mem_cgroup_uncharge_batch_start(void)
> 
> mem_cgroup_uncharge_batch_end?
> 
yes, of course...lack of !CONFIC_MEM_CGROUP test after updating...
thank you!
==
In massive parallel enviroment, res_counter can be a performance bottleneck.
One strong techinque to reduce lock contention is reducing calls by
coalescing some amount of calls into one.

Considering charge/uncharge chatacteristic,
	- charge is done one by one via demand-paging.
	- uncharge is done by
		- in chunk at munmap, truncate, exit, execve...
		- one by one via vmscan/paging.

It seems we have a chance in uncharge at unmap/truncation.

This patch is a for coalescing uncharge. For avoiding scattering memcg's
structure to functions under /mm, this patch adds memcg batch uncharge
information to the task. 

The degree of coalescing depends on callers
  - at invalidate/trucate... pagevec size
  - at unmap ....ZAP_BLOCK_SIZE
(memory itself will be freed in this degree.)
Then, we'll not coalescing too much.

Changelog(now):
 - fixed !CONFIG_MEM_CGROUP case. 

Changelog(old):
 - unified patch for callers
 - added commetns.
 - make ->do_batch as bool.
 - removed css_get() at el. We don't need it.



Changelog(now):
 - rebased onto the latest mmotm + softlimit fix patches.

Changelog(old):
 - unified patch for callers
 - added commetns.
 - make ->do_batch as bool.
 - removed css_get() at el. We don't need it.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   13 ++++++
 include/linux/sched.h      |    7 +++
 mm/memcontrol.c            |   91 ++++++++++++++++++++++++++++++++++++++++++---
 mm/memory.c                |    2 
 mm/truncate.c              |    6 ++
 5 files changed, 113 insertions(+), 6 deletions(-)

Index: mmotm-2.6.31-Sep28/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.31-Sep28.orig/include/linux/memcontrol.h
+++ mmotm-2.6.31-Sep28/include/linux/memcontrol.h
@@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s
 extern void mem_cgroup_del_lru(struct page *page);
 extern void mem_cgroup_move_lists(struct page *page,
 				  enum lru_list from, enum lru_list to);
+
+/* For coalescing uncharge for reducing memcg' overhead*/
+extern void mem_cgroup_uncharge_start(void);
+extern void mem_cgroup_uncharge_end(void);
+
 extern void mem_cgroup_uncharge_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
 extern int mem_cgroup_shmem_charge_fallback(struct page *page,
@@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
 {
 }
 
+static inline void mem_cgroup_uncharge_start(void)
+{
+}
+
+static inline void mem_cgroup_uncharge_end(void)
+{
+}
+
 static inline void mem_cgroup_uncharge_page(struct page *page)
 {
 }
Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep28/mm/memcontrol.c
@@ -1826,6 +1826,49 @@ void mem_cgroup_cancel_charge_swapin(str
 	css_put(&mem->css);
 }
 
+static void
+__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
+{
+	struct memcg_batch_info *batch = NULL;
+	bool uncharge_memsw = true;
+	/* If swapout, usage of swap doesn't decrease */
+	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
+		uncharge_memsw = false;
+	/*
+	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
+	 * In those cases, all pages freed continously can be expected to be in
+	 * the same cgroup and we have chance to coalesce uncharges.
+	 * And, we do uncharge one by one if this is killed by OOM.
+	 */
+	if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
+		goto direct_uncharge;
+
+	batch = &current->memcg_batch;
+	/*
+	 * In usual, we do css_get() when we remember memcg pointer.
+	 * But in this case, we keep res->usage until end of a series of
+	 * uncharges. Then, it's ok to ignore memcg's refcnt.
+	 */
+	if (!batch->memcg)
+		batch->memcg = mem;
+	/*
+	 * In typical case, batch->memcg == mem. This means we can
+	 * merge a series of uncharges to an uncharge of res_counter.
+	 * If not, we uncharge res_counter ony by one.
+	 */
+	if (batch->memcg != mem)
+		goto direct_uncharge;
+	/* remember freed charge and uncharge it later */
+	batch->pages += PAGE_SIZE;
+	if (uncharge_memsw)
+		batch->memsw += PAGE_SIZE;
+	return;
+direct_uncharge:
+	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	if (uncharge_memsw)
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+	return;
+}
 
 /*
  * uncharge if !page_mapped(page)
@@ -1874,12 +1917,8 @@ __mem_cgroup_uncharge_common(struct page
 		break;
 	}
 
-	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		if (do_swap_account &&
-				(ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
-	}
+	if (!mem_cgroup_is_root(mem))
+		__do_uncharge(mem, ctype);
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		mem_cgroup_swap_statistics(mem, true);
 	mem_cgroup_charge_statistics(mem, pc, false);
@@ -1925,6 +1964,46 @@ void mem_cgroup_uncharge_cache_page(stru
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
+/*
+ * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
+ * In that cases, pages are freed continuously and we can expect pages
+ * are in the same memcg. All these calls itself limits the number of
+ * pages freed at once, then uncharge_start/end() is called properly.
+ */
+
+void mem_cgroup_uncharge_start(void)
+{
+	if (!current->memcg_batch.do_batch) {
+		current->memcg_batch.memcg = NULL;
+		current->memcg_batch.pages = 0;
+		current->memcg_batch.memsw = 0;
+	}
+	current->memcg_batch.do_batch++;
+}
+
+void mem_cgroup_uncharge_end(void)
+{
+	struct mem_cgroup *mem;
+
+	if (!current->memcg_batch.do_batch)
+		return;
+
+	current->memcg_batch.do_batch--;
+	if (current->memcg_batch.do_batch) /* Nested ? */
+		return;
+
+	mem = current->memcg_batch.memcg;
+	if (!mem)
+		return;
+	/* This "mem" is valid bacause we hide charges behind us. */
+	if (current->memcg_batch.pages)
+		res_counter_uncharge(&mem->res, current->memcg_batch.pages);
+	if (current->memcg_batch.memsw)
+		res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
+	/* Not necessary. but forget this pointer */
+	current->memcg_batch.memcg = NULL;
+}
+
 #ifdef CONFIG_SWAP
 /*
  * called after __delete_from_swap_cache() and drop "page" account.
Index: mmotm-2.6.31-Sep28/include/linux/sched.h
===================================================================
--- mmotm-2.6.31-Sep28.orig/include/linux/sched.h
+++ mmotm-2.6.31-Sep28/include/linux/sched.h
@@ -1549,6 +1549,13 @@ struct task_struct {
 	unsigned long trace_recursion;
 #endif /* CONFIG_TRACING */
 	unsigned long stack_start;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
+	struct memcg_batch_info {
+		int do_batch;
+		struct mem_cgroup *memcg;
+		long pages, memsw;
+	} memcg_batch;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
Index: mmotm-2.6.31-Sep28/mm/memory.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memory.c
+++ mmotm-2.6.31-Sep28/mm/memory.c
@@ -940,6 +940,7 @@ static unsigned long unmap_page_range(st
 		details = NULL;
 
 	BUG_ON(addr >= end);
+	mem_cgroup_uncharge_start();
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -952,6 +953,7 @@ static unsigned long unmap_page_range(st
 						zap_work, details);
 	} while (pgd++, addr = next, (addr != end && *zap_work > 0));
 	tlb_end_vma(tlb, vma);
+	mem_cgroup_uncharge_end();
 
 	return addr;
 }
Index: mmotm-2.6.31-Sep28/mm/truncate.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/truncate.c
+++ mmotm-2.6.31-Sep28/mm/truncate.c
@@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
 			pagevec_release(&pvec);
 			break;
 		}
+		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
 	}
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
@@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
 	pagevec_init(&pvec, 0);
 	while (next <= end &&
 			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t index;
@@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
 				break;
 		}
 		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
 		cond_resched();
 	}
 	return ret;
@@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
 	while (next <= end && !wrapped &&
 		pagevec_lookup(&pvec, mapping, next,
 			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t page_index;
@@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
 		cond_resched();
 	}
 	return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation
  2009-10-02  6:53     ` Hiroshi Shimamoto
@ 2009-10-02  7:04       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-02  7:04 UTC (permalink / raw)
  To: Hiroshi Shimamoto
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

On Fri, 02 Oct 2009 15:53:19 +0900
Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> wrote:

> Hiroshi Shimamoto wrote:
> > KAMEZAWA Hiroyuki wrote:

> >> +static inline void mem_cgroup_uncharge_batch_start(void)
> >> +{
> >> +}
> >> +
> >> +static inline void mem_cgroup_uncharge_batch_start(void)
> > 
> > mem_cgroup_uncharge_batch_end?
> 
> s/_batch// too?
> 
thank you. fixed.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge
  2009-10-02  4:55 [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge KAMEZAWA Hiroyuki
  2009-10-02  5:01 ` [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation KAMEZAWA Hiroyuki
  2009-10-02  5:03 ` [PATCH 2/2] memcg: coalescing charges per cpu KAMEZAWA Hiroyuki
@ 2009-10-02  8:53 ` KAMEZAWA Hiroyuki
  2009-10-05  7:18   ` KAMEZAWA Hiroyuki
  2009-10-05 10:37 ` Balbir Singh
       [not found] ` <604427e00910091737s52e11ce9p256c95d533dc2837@mail.gmail.com>
  4 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-02  8:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

[-- Attachment #1: Type: text/plain, Size: 3158 bytes --]

On Fri, 2 Oct 2009 13:55:31 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Following is test result of continuous page-fault on my 8cpu box(x86-64).
> 
> A loop like this runs on all cpus in parallel for 60secs. 
> ==
>         while (1) {
>                 x = mmap(NULL, MEGA, PROT_READ|PROT_WRITE,
>                         MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 
>                 for (off = 0; off < MEGA; off += PAGE_SIZE)
>                         x[off]=0;
>                 munmap(x, MEGA);
>         }
> ==
> please see # of page faults. I think this is good improvement.
> 
> 
> [Before]
>  Performance counter stats for './runpause.sh' (5 runs):
> 
>   474539.756944  task-clock-msecs         #      7.890 CPUs    ( +-   0.015% )
>           10284  context-switches         #      0.000 M/sec   ( +-   0.156% )
>              12  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
>        18425800  page-faults              #      0.039 M/sec   ( +-   0.107% )
>   1486296285360  cycles                   #   3132.080 M/sec   ( +-   0.029% )
>    380334406216  instructions             #      0.256 IPC     ( +-   0.058% )
>      3274206662  cache-references         #      6.900 M/sec   ( +-   0.453% )
>      1272947699  cache-misses             #      2.682 M/sec   ( +-   0.118% )
> 
>    60.147907341  seconds time elapsed   ( +-   0.010% )
> 
> [After]
>  Performance counter stats for './runpause.sh' (5 runs):
> 
>   474658.997489  task-clock-msecs         #      7.891 CPUs    ( +-   0.006% )
>           10250  context-switches         #      0.000 M/sec   ( +-   0.020% )
>              11  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
>        33177858  page-faults              #      0.070 M/sec   ( +-   0.152% )
>   1485264748476  cycles                   #   3129.120 M/sec   ( +-   0.021% )
>    409847004519  instructions             #      0.276 IPC     ( +-   0.123% )
>      3237478723  cache-references         #      6.821 M/sec   ( +-   0.574% )
>      1182572827  cache-misses             #      2.491 M/sec   ( +-   0.179% )
> 
>    60.151786309  seconds time elapsed   ( +-   0.014% )
> 
BTW, this is a score in root cgroup.


  473811.590852  task-clock-msecs         #      7.878 CPUs    ( +-   0.006% )
          10257  context-switches         #      0.000 M/sec   ( +-   0.049% )
             10  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
       36418112  page-faults              #      0.077 M/sec   ( +-   0.195% )
  1482880352588  cycles                   #   3129.684 M/sec   ( +-   0.011% )
   410948762898  instructions             #      0.277 IPC     ( +-   0.123% )
     3182986911  cache-references         #      6.718 M/sec   ( +-   0.555% )
     1147144023  cache-misses             #      2.421 M/sec   ( +-   0.137% )


Then,
  36418112 x 100 / 33177858 = 109% slower in children cgroup.

But, Hmm, this test is an extreme case.(60sec continuous page faults on all cpus.)
We may can do something more, but this score itself is not so bad. I think.
Results on more cpus are welcome. Programs I used are attached.

Thanks,
-Kame
 



[-- Attachment #2: pagefault.c --]
[-- Type: text/x-csrc, Size: 453 bytes --]

#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <signal.h>

#define PAGE_SIZE (4096)
#define MEGA	(1024 * 1024)

void sigalarm_handler(int sig)
{
}

int main(int argc, char *argv[])
{
	char *x;
	int off;

	signal(SIGALRM, sigalarm_handler);
	pause();
	while (1) {
		x = mmap(NULL, MEGA, PROT_READ|PROT_WRITE,
			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);

		for (off = 0; off < MEGA; off += PAGE_SIZE)
			x[off]=0;
		munmap(x, MEGA);
	}
}

[-- Attachment #3: runpause.sh --]
[-- Type: text/x-sh, Size: 128 bytes --]

#!/bin/sh

for i in 0 1 2 3 4 5 6 7 ;do
	taskset -c $i ./pagefault &
done

pkill -ALRM  pagefault
sleep 60
pkill -HUP pagefault

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge
  2009-10-02  8:53 ` [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge KAMEZAWA Hiroyuki
@ 2009-10-05  7:18   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-05  7:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

On Fri, 2 Oct 2009 17:53:10 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > [After]
> >  Performance counter stats for './runpause.sh' (5 runs):
> > 
> >   474658.997489  task-clock-msecs         #      7.891 CPUs    ( +-   0.006% )
> >           10250  context-switches         #      0.000 M/sec   ( +-   0.020% )
> >              11  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
> >        33177858  page-faults              #      0.070 M/sec   ( +-   0.152% )
> >   1485264748476  cycles                   #   3129.120 M/sec   ( +-   0.021% )
> >    409847004519  instructions             #      0.276 IPC     ( +-   0.123% )
> >      3237478723  cache-references         #      6.821 M/sec   ( +-   0.574% )
> >      1182572827  cache-misses             #      2.491 M/sec   ( +-   0.179% )
> > 
> >    60.151786309  seconds time elapsed   ( +-   0.014% )
> > 
> BTW, this is a score in root cgroup.
> 
> 
>   473811.590852  task-clock-msecs         #      7.878 CPUs    ( +-   0.006% )
>           10257  context-switches         #      0.000 M/sec   ( +-   0.049% )
>              10  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
>        36418112  page-faults              #      0.077 M/sec   ( +-   0.195% )
>   1482880352588  cycles                   #   3129.684 M/sec   ( +-   0.011% )
>    410948762898  instructions             #      0.277 IPC     ( +-   0.123% )
>      3182986911  cache-references         #      6.718 M/sec   ( +-   0.555% )
>      1147144023  cache-misses             #      2.421 M/sec   ( +-   0.137% )
> 
> 
> Then,
>   36418112 x 100 / 33177858 = 109% slower in children cgroup.
> 

This is an additional patch now under testing.(just experimental)
result of above test:
==
[root cgroup]
      37062405  page-faults              #      0.078 M/sec   ( +-   0.156% )
[children]  
      35876894  page-faults              #      0.076 M/sec   ( +-   0.233% )
==
Near to my target....

This patch adds bulk_css_put() and coalesces css_put() in batched-uncharge path.
avoidng frequent calls css_put().

coalescing-uncharge patch, it reduces reference to res_counter
but css_put() per page is still called.
Of course, we can coalesce prural css_put() to a call of bulk_css_put().

This patch adds bulk_css_put() and reduces false-sharing and will have
good effects in scalability.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 include/linux/cgroup.h |   10 ++++++++--
 kernel/cgroup.c        |    5 ++---
 mm/memcontrol.c        |   16 +++++++++++-----
 3 files changed, 21 insertions(+), 10 deletions(-)

Index: mmotm-2.6.31-Sep28/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.31-Sep28.orig/include/linux/cgroup.h
+++ mmotm-2.6.31-Sep28/include/linux/cgroup.h
@@ -117,11 +117,17 @@ static inline bool css_tryget(struct cgr
  * css_get() or css_tryget()
  */
 
-extern void __css_put(struct cgroup_subsys_state *css);
+extern void __css_put(struct cgroup_subsys_state *css, int val);
 static inline void css_put(struct cgroup_subsys_state *css)
 {
 	if (!test_bit(CSS_ROOT, &css->flags))
-		__css_put(css);
+		__css_put(css, 1);
+}
+
+static inline void bulk_css_put(struct cgroup_subsys_state *css, int val)
+{
+	if (!test_bit(CSS_ROOT, &css->flags))
+		__css_put(css, val);
 }
 
 /* bits in struct cgroup flags field */
Index: mmotm-2.6.31-Sep28/kernel/cgroup.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/kernel/cgroup.c
+++ mmotm-2.6.31-Sep28/kernel/cgroup.c
@@ -3705,12 +3705,11 @@ static void check_for_release(struct cgr
 	}
 }
 
-void __css_put(struct cgroup_subsys_state *css)
+void __css_put(struct cgroup_subsys_state *css, int val)
 {
 	struct cgroup *cgrp = css->cgroup;
-	int val;
 	rcu_read_lock();
-	val = atomic_dec_return(&css->refcnt);
+	val = atomic_sub_return(val, &css->refcnt);
 	if (val == 1) {
 		if (notify_on_release(cgrp)) {
 			set_bit(CGRP_RELEASABLE, &cgrp->flags);
Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep28/mm/memcontrol.c
@@ -1977,8 +1977,14 @@ __do_uncharge(struct mem_cgroup *mem, co
 	return;
 direct_uncharge:
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
-	if (uncharge_memsw)
+	if (uncharge_memsw) {
 		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		/*
+		 * swapout-uncharge do css_put() by itself. then we do
+		 * css_put() only in this case.
+		 */
+		css_put(&mem->css);
+	}
 	return;
 }
 
@@ -2048,9 +2054,6 @@ __mem_cgroup_uncharge_common(struct page
 
 	if (mem_cgroup_soft_limit_check(mem))
 		mem_cgroup_update_tree(mem, page);
-	/* at swapout, this memcg will be accessed to record to swap */
-	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
-		css_put(&mem->css);
 
 	return mem;
 
@@ -2108,8 +2111,11 @@ void mem_cgroup_uncharge_end(void)
 	if (!mem)
 		return;
 	/* This "mem" is valid bacause we hide charges behind us. */
-	if (current->memcg_batch.pages)
+	if (current->memcg_batch.pages) {
 		res_counter_uncharge(&mem->res, current->memcg_batch.pages);
+		bulk_css_put(&mem->css,
+			current->memcg_batch.pages >> PAGE_SHIFT);
+	}
 	if (current->memcg_batch.memsw)
 		res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
 	/* Not necessary. but forget this pointer */





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge
  2009-10-02  4:55 [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2009-10-02  8:53 ` [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge KAMEZAWA Hiroyuki
@ 2009-10-05 10:37 ` Balbir Singh
       [not found] ` <604427e00910091737s52e11ce9p256c95d533dc2837@mail.gmail.com>
  4 siblings, 0 replies; 22+ messages in thread
From: Balbir Singh @ 2009-10-05 10:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, nishimura@mxp.nes.nec.co.jp

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-10-02 13:55:31]:

> Hi,
> 
> This patch is against mmotm + softlimit fix patches.
> (which are now in -rc git tree.)
> 
> In the latest -rc series, the kernel avoids accessing res_counter when
> cgroup is root cgroup. This helps scalabilty when memcg is not used.
> 
> It's necessary to improve scalabilty even when memcg is used. This patch
> is for that. Previous Balbir's work shows that the biggest obstacles for
> better scalabilty is memcg's res_counter. Then, there are 2 ways.
> 
> (1) make counter scale well.
> (2) avoid accessing core counter as much as possible.
> 
> My first direction was (1). But no, there is no counter which is free
> from false sharing when it needs system-wide fine grain synchronization.
> And res_counter has several functionality...this makes (1) difficult.
> spin_lock (in slow path) around counter means tons of invalidation will
> happen even when we just access counter without modification.
> 
> This patch series is for (2). This implements charge/uncharge in bached manner.
> This coalesces access to res_counter at charge/uncharge using nature of
> access locality.
> 
> Tested for a month. And I got good reorts from Balbir and Nishimura, thanks.
> One concern is that this adds some members to the bottom of task_struct.
> Better idea is welcome.
> 
> Following is test result of continuous page-fault on my 8cpu box(x86-64).
> 
> A loop like this runs on all cpus in parallel for 60secs. 
> ==
>         while (1) {
>                 x = mmap(NULL, MEGA, PROT_READ|PROT_WRITE,
>                         MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 
>                 for (off = 0; off < MEGA; off += PAGE_SIZE)
>                         x[off]=0;
>                 munmap(x, MEGA);
>         }
> ==
> please see # of page faults. I think this is good improvement.
> 
> 
> [Before]
>  Performance counter stats for './runpause.sh' (5 runs):
> 
>   474539.756944  task-clock-msecs         #      7.890 CPUs    ( +-   0.015% )
>           10284  context-switches         #      0.000 M/sec   ( +-   0.156% )
>              12  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
>        18425800  page-faults              #      0.039 M/sec   ( +-   0.107% )
>   1486296285360  cycles                   #   3132.080 M/sec   ( +-   0.029% )
>    380334406216  instructions             #      0.256 IPC     ( +-   0.058% )
>      3274206662  cache-references         #      6.900 M/sec   ( +-   0.453% )
>      1272947699  cache-misses             #      2.682 M/sec   ( +-   0.118% )
> 
>    60.147907341  seconds time elapsed   ( +-   0.010% )
> 
> [After]
>  Performance counter stats for './runpause.sh' (5 runs):
> 
>   474658.997489  task-clock-msecs         #      7.891 CPUs    ( +-   0.006% )
>           10250  context-switches         #      0.000 M/sec   ( +-   0.020% )
>              11  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
>        33177858  page-faults              #      0.070 M/sec   ( +-   0.152% )
>   1485264748476  cycles                   #   3129.120 M/sec   ( +-   0.021% )
>    409847004519  instructions             #      0.276 IPC     ( +-   0.123% )
>      3237478723  cache-references         #      6.821 M/sec   ( +-   0.574% )
>      1182572827  cache-misses             #      2.491 M/sec   ( +-   0.179% )
> 
>    60.151786309  seconds time elapsed   ( +-   0.014% )
>

I agree, I liked the previous patchset, let me re-review this one!
Definitely a good candidate to -mm. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation (fixed coimpile bug)
  2009-10-02  7:02     ` [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation (fixed coimpile bug) KAMEZAWA Hiroyuki
@ 2009-10-08 22:17       ` Andrew Morton
  2009-10-08 23:48         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2009-10-08 22:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hiroshi Shimamoto, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

On Fri, 2 Oct 2009 16:02:13 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

>
> ...
>
> In massive parallel enviroment, res_counter can be a performance bottleneck.
> One strong techinque to reduce lock contention is reducing calls by
> coalescing some amount of calls into one.
> 
> Considering charge/uncharge chatacteristic,
> 	- charge is done one by one via demand-paging.
> 	- uncharge is done by
> 		- in chunk at munmap, truncate, exit, execve...
> 		- one by one via vmscan/paging.
> 
> It seems we have a chance in uncharge at unmap/truncation.
> 
> This patch is a for coalescing uncharge. For avoiding scattering memcg's
> structure to functions under /mm, this patch adds memcg batch uncharge
> information to the task. 
> 
> The degree of coalescing depends on callers
>   - at invalidate/trucate... pagevec size
>   - at unmap ....ZAP_BLOCK_SIZE
> (memory itself will be freed in this degree.)
> Then, we'll not coalescing too much.
> 
>
> ...
>
> +static void
> +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> +{
> +	struct memcg_batch_info *batch = NULL;
> +	bool uncharge_memsw = true;
> +	/* If swapout, usage of swap doesn't decrease */
> +	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> +		uncharge_memsw = false;
> +	/*
> +	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
> +	 * In those cases, all pages freed continously can be expected to be in
> +	 * the same cgroup and we have chance to coalesce uncharges.
> +	 * And, we do uncharge one by one if this is killed by OOM.
> +	 */
> +	if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
> +		goto direct_uncharge;
> +
> +	batch = &current->memcg_batch;
> +	/*
> +	 * In usual, we do css_get() when we remember memcg pointer.
> +	 * But in this case, we keep res->usage until end of a series of
> +	 * uncharges. Then, it's ok to ignore memcg's refcnt.
> +	 */
> +	if (!batch->memcg)
> +		batch->memcg = mem;
> +	/*
> +	 * In typical case, batch->memcg == mem. This means we can
> +	 * merge a series of uncharges to an uncharge of res_counter.
> +	 * If not, we uncharge res_counter ony by one.
> +	 */
> +	if (batch->memcg != mem)
> +		goto direct_uncharge;
> +	/* remember freed charge and uncharge it later */
> +	batch->pages += PAGE_SIZE;

->pages is really confusingly named.  It doesn't count pages, it counts
bytes!

We could call it `bytes', but perhaps charge_bytes would be more
communicative?

> +/*
> + * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
> + * In that cases, pages are freed continuously and we can expect pages
> + * are in the same memcg. All these calls itself limits the number of
> + * pages freed at once, then uncharge_start/end() is called properly.
> + */
> +
> +void mem_cgroup_uncharge_start(void)
> +{
> +	if (!current->memcg_batch.do_batch) {
> +		current->memcg_batch.memcg = NULL;
> +		current->memcg_batch.pages = 0;
> +		current->memcg_batch.memsw = 0;

what's memsw?

> +	}
> +	current->memcg_batch.do_batch++;
> +}
> +
>
> ...
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
> +	struct memcg_batch_info {
> +		int do_batch;
> +		struct mem_cgroup *memcg;
> +		long pages, memsw;
> +	} memcg_batch;
> +#endif

I find the most valuable documetnation is that which is devoted to the
data structures.  This one didn't get any :(

Negative values of `pages' and `memsw' are meaningless, so it would be
better to give them an unsigned type.  That matches the
res_counter_charge() expectations also.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] memcg: coalescing charges per cpu
  2009-10-02  5:03 ` [PATCH 2/2] memcg: coalescing charges per cpu KAMEZAWA Hiroyuki
@ 2009-10-08 22:26   ` Andrew Morton
  2009-10-08 23:54     ` KAMEZAWA Hiroyuki
  2009-10-09  4:15   ` Balbir Singh
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2009-10-08 22:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp

On Fri, 2 Oct 2009 14:03:43 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This is a patch for coalescing access to res_counter at charging by
> percpu caching. At charge, memcg charges 64pages and remember it in
> percpu cache. Because it's cache, drain/flush is done if necessary.
> 
> This version uses public percpu area.
>  2 benefits of using public percpu area.
>  1. Sum of stocked charge in the system is limited to # of cpus
>     not to the number of memcg. This shows better synchonization.
>  2. drain code for flush/cpuhotplug is very easy (and quick)
> 
> The most important point of this patch is that we never touch res_counter
> in fast path. The res_counter is system-wide shared counter which is modified
> very frequently. We shouldn't touch it as far as we can for avoiding
> false sharing.
> 
> ...
> 
> +/* size of first charge trial. "32" comes from vmscan.c's magic value */
> +#define CHARGE_SIZE	(32 * PAGE_SIZE)
> +struct memcg_stock_pcp {
> +	struct mem_cgroup *cached;
> +	int charge;
> +	struct work_struct work;
> +};
> +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> +static DEFINE_MUTEX(memcg_drain_mutex);
> +
> +static bool consume_stock(struct mem_cgroup *mem)
> +{
> +	struct memcg_stock_pcp *stock;
> +	bool ret = true;
> +
> +	stock = &get_cpu_var(memcg_stock);
> +	if (mem == stock->cached && stock->charge)
> +		stock->charge -= PAGE_SIZE;
> +	else
> +		ret = false;
> +	put_cpu_var(memcg_stock);
> +	return ret;
> +}

It's unobvious what the return value from this function means.  A nice
comment would help.

> +static void drain_stock(struct memcg_stock_pcp *stock)
> +{
> +	struct mem_cgroup *old = stock->cached;
> +
> +	if (stock->charge) {
> +		res_counter_uncharge(&old->res, stock->charge);
> +		if (do_swap_account)
> +			res_counter_uncharge(&old->memsw, stock->charge);
> +	}
> +	stock->cached = NULL;
> +	stock->charge = 0;
> +}
> +
> +static void drain_local_stock(struct work_struct *dummy)
> +{
> +	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> +	drain_stock(stock);
> +	put_cpu_var(memcg_stock);
> +}

drain_local_stock() is only ever run by a thread which is pinned to a
particular CPU, so we can use plain old __get_cpu_var() here and remove
the put_cpu_var(), methinks.  If this is done, a comment shuold be
added explaining why we can use that optimisation.

> +static void refill_stock(struct mem_cgroup *mem, int val)
> +{
> +	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> +
> +	if (stock->cached != mem) {
> +		drain_stock(stock);
> +		stock->cached = mem;
> +	}
> +	stock->charge += val;
> +	put_cpu_var(memcg_stock);
> +}
> +
> +static void drain_all_stock_async(void)
> +{
> +	int cpu;
> +	/* Contention means someone tries to flush. */
> +	if (!mutex_trylock(&memcg_drain_mutex))
> +		return;

Any time I see a trylock I ask "hm, what happens when it fails - what
are the consequences and how to we later rerun the aborted operation".

That's unobvious here and merits a comment.

> +	for_each_online_cpu(cpu) {

Did we need get_online_cpus() here?

> +		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> +		if (work_pending(&stock->work))
> +			continue;
> +		INIT_WORK(&stock->work, drain_local_stock);
> +		schedule_work_on(cpu, &stock->work);
> +	}
> +	mutex_unlock(&memcg_drain_mutex);
> +	/* We don't wait for flush_work */
> +}
> +
> +static void drain_all_stock_sync(void)
> +{
> +	/* called when force_empty is called */
> +	mutex_lock(&memcg_drain_mutex);
> +	schedule_on_each_cpu(drain_local_stock);
> +	mutex_unlock(&memcg_drain_mutex);
> +}

In fact it would be nice were each of these functions to have a brief
note describing their role.

> +static int __cpuinit memcg_stock_cpu_callback(struct notifier_block *nb,
> +					unsigned long action,
> +					void *hcpu)
> +{
> +#ifdef CONFIG_HOTPLUG_CPU
> +	int cpu = (unsigned long)hcpu;
> +	struct memcg_stock_pcp *stock;
> +
> +	if (action != CPU_DEAD)
> +		return NOTIFY_OK;
> +	stock = &per_cpu(memcg_stock, cpu);
> +	drain_stock(stock);
> +#endif
> +	return NOTIFY_OK;
> +}

Is the ifdef needed?  Using hotcpu_notifier() should cause this entire
function to disappear from vmlinux when CONFIG_HOTPLUG_CPU=n.

>  /*
>   * Unlike exported interface, "oom" parameter is added. if oom==true,
>   * oom-killer can be invoked.
> @@ -1269,6 +1370,7 @@ static int __mem_cgroup_try_charge(struc
>  	struct mem_cgroup *mem, *mem_over_limit;
>  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  	struct res_counter *fail_res;
> +	int csize = CHARGE_SIZE;
>  
>  	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
>  		/* Don't account this! */
> @@ -1293,23 +1395,25 @@ static int __mem_cgroup_try_charge(struc
>  		return 0;
>  
>  	VM_BUG_ON(css_is_removed(&mem->css));
> +	if (mem_cgroup_is_root(mem))
> +		goto done;
>  
>  	while (1) {
>  		int ret = 0;
>  		unsigned long flags = 0;
>  
> -		if (mem_cgroup_is_root(mem))
> -			goto done;
> -		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> +		if (consume_stock(mem))
> +			goto charged;
> +
> +		ret = res_counter_charge(&mem->res, csize, &fail_res);
>  		if (likely(!ret)) {
>  			if (!do_swap_account)
>  				break;
> -			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
> -							&fail_res);
> +			ret = res_counter_charge(&mem->memsw, csize, &fail_res);
>  			if (likely(!ret))
>  				break;
>  			/* mem+swap counter fails */
> -			res_counter_uncharge(&mem->res, PAGE_SIZE);
> +			res_counter_uncharge(&mem->res, csize);
>  			flags |= MEM_CGROUP_RECLAIM_NOSWAP;
>  			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
>  									memsw);
> @@ -1318,6 +1422,11 @@ static int __mem_cgroup_try_charge(struc
>  			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
>  									res);
>  
> +		/* reduce request size and retry */
> +		if (csize > PAGE_SIZE) {
> +			csize = PAGE_SIZE;
> +			continue;
> +		}
>  		if (!(gfp_mask & __GFP_WAIT))
>  			goto nomem;
>  
> @@ -1347,6 +1456,9 @@ static int __mem_cgroup_try_charge(struc
>  			goto nomem;
>  		}
>  	}
> +	if (csize > PAGE_SIZE)
> +		refill_stock(mem, csize - PAGE_SIZE);
> +charged:
>  	/*
>  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
>  	 * if they exceeds softlimit.
> @@ -2463,6 +2575,7 @@ move_account:
>  			goto out;
>  		/* This is for making all *used* pages to be on LRU. */
>  		lru_add_drain_all();
> +		drain_all_stock_sync();
>  		ret = 0;
>  		for_each_node_state(node, N_HIGH_MEMORY) {
>  			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
> @@ -3181,6 +3294,7 @@ mem_cgroup_create(struct cgroup_subsys *
>  		root_mem_cgroup = mem;
>  		if (mem_cgroup_soft_limit_tree_init())
>  			goto free_out;
> +		hotcpu_notifier(memcg_stock_cpu_callback, 0);
>  
>  	} else {
>  		parent = mem_cgroup_from_cont(cont->parent);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation (fixed coimpile bug)
  2009-10-08 22:17       ` Andrew Morton
@ 2009-10-08 23:48         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-08 23:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hiroshi Shimamoto, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

On Thu, 8 Oct 2009 15:17:10 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 2 Oct 2009 16:02:13 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> >
> > ...
> >
> > In massive parallel enviroment, res_counter can be a performance bottleneck.
> > One strong techinque to reduce lock contention is reducing calls by
> > coalescing some amount of calls into one.
> > 
> > Considering charge/uncharge chatacteristic,
> > 	- charge is done one by one via demand-paging.
> > 	- uncharge is done by
> > 		- in chunk at munmap, truncate, exit, execve...
> > 		- one by one via vmscan/paging.
> > 
> > It seems we have a chance in uncharge at unmap/truncation.
> > 
> > This patch is a for coalescing uncharge. For avoiding scattering memcg's
> > structure to functions under /mm, this patch adds memcg batch uncharge
> > information to the task. 
> > 
> > The degree of coalescing depends on callers
> >   - at invalidate/trucate... pagevec size
> >   - at unmap ....ZAP_BLOCK_SIZE
> > (memory itself will be freed in this degree.)
> > Then, we'll not coalescing too much.
> > 
> >
> > ...
> >
> > +static void
> > +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> > +{
> > +	struct memcg_batch_info *batch = NULL;
> > +	bool uncharge_memsw = true;
> > +	/* If swapout, usage of swap doesn't decrease */
> > +	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> > +		uncharge_memsw = false;
> > +	/*
> > +	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
> > +	 * In those cases, all pages freed continously can be expected to be in
> > +	 * the same cgroup and we have chance to coalesce uncharges.
> > +	 * And, we do uncharge one by one if this is killed by OOM.
> > +	 */
> > +	if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
> > +		goto direct_uncharge;
> > +
> > +	batch = &current->memcg_batch;
> > +	/*
> > +	 * In usual, we do css_get() when we remember memcg pointer.
> > +	 * But in this case, we keep res->usage until end of a series of
> > +	 * uncharges. Then, it's ok to ignore memcg's refcnt.
> > +	 */
> > +	if (!batch->memcg)
> > +		batch->memcg = mem;
> > +	/*
> > +	 * In typical case, batch->memcg == mem. This means we can
> > +	 * merge a series of uncharges to an uncharge of res_counter.
> > +	 * If not, we uncharge res_counter ony by one.
> > +	 */
> > +	if (batch->memcg != mem)
> > +		goto direct_uncharge;
> > +	/* remember freed charge and uncharge it later */
> > +	batch->pages += PAGE_SIZE;
> 
> ->pages is really confusingly named.  It doesn't count pages, it counts
> bytes!
> 
> We could call it `bytes', but perhaps charge_bytes would be more
> communicative?
> 
Ah, I agree. I'll change this "bytes".

> > +/*
> > + * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
> > + * In that cases, pages are freed continuously and we can expect pages
> > + * are in the same memcg. All these calls itself limits the number of
> > + * pages freed at once, then uncharge_start/end() is called properly.
> > + */
> > +
> > +void mem_cgroup_uncharge_start(void)
> > +{
> > +	if (!current->memcg_batch.do_batch) {
> > +		current->memcg_batch.memcg = NULL;
> > +		current->memcg_batch.pages = 0;
> > +		current->memcg_batch.memsw = 0;
> 
> what's memsw?
> 
Ah, memccontol.c uses "memsw" in several parts.

For example, memory usage interface to user is shown as

   memory.usage_in_bytes

memory+swap usage interface to suer is shown as

    memory.memsw.usage_in_bytes.

But, Hmm, this is visible from sched.c then...

  memsw_bytes or memory_and_swap_bytes ?



> > +	}
> > +	current->memcg_batch.do_batch++;
> > +}
> > +
> >
> > ...
> >
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
> > +	struct memcg_batch_info {
> > +		int do_batch;
> > +		struct mem_cgroup *memcg;
> > +		long pages, memsw;
> > +	} memcg_batch;
> > +#endif
> 
> I find the most valuable documetnation is that which is devoted to the
> data structures.  This one didn't get any :(
> 
> Negative values of `pages' and `memsw' are meaningless, so it would be
> better to give them an unsigned type.  That matches the
> res_counter_charge() expectations also.
> 
Agreed. I'll rewrite this part with appropriate comments.
Thank you for pointing out.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] memcg: coalescing charges per cpu
  2009-10-08 22:26   ` Andrew Morton
@ 2009-10-08 23:54     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-08 23:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp

On Thu, 8 Oct 2009 15:26:20 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 2 Oct 2009 14:03:43 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > This is a patch for coalescing access to res_counter at charging by
> > percpu caching. At charge, memcg charges 64pages and remember it in
> > percpu cache. Because it's cache, drain/flush is done if necessary.
> > 
> > This version uses public percpu area.
> >  2 benefits of using public percpu area.
> >  1. Sum of stocked charge in the system is limited to # of cpus
> >     not to the number of memcg. This shows better synchonization.
> >  2. drain code for flush/cpuhotplug is very easy (and quick)
> > 
> > The most important point of this patch is that we never touch res_counter
> > in fast path. The res_counter is system-wide shared counter which is modified
> > very frequently. We shouldn't touch it as far as we can for avoiding
> > false sharing.
> > 
> > ...
> > 
> > +/* size of first charge trial. "32" comes from vmscan.c's magic value */
> > +#define CHARGE_SIZE	(32 * PAGE_SIZE)
> > +struct memcg_stock_pcp {
> > +	struct mem_cgroup *cached;
> > +	int charge;
> > +	struct work_struct work;
> > +};
> > +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> > +static DEFINE_MUTEX(memcg_drain_mutex);
> > +
> > +static bool consume_stock(struct mem_cgroup *mem)
> > +{
> > +	struct memcg_stock_pcp *stock;
> > +	bool ret = true;
> > +
> > +	stock = &get_cpu_var(memcg_stock);
> > +	if (mem == stock->cached && stock->charge)
> > +		stock->charge -= PAGE_SIZE;
> > +	else
> > +		ret = false;
> > +	put_cpu_var(memcg_stock);
> > +	return ret;
> > +}
> 
> It's unobvious what the return value from this function means.  A nice
> comment would help.

will do.

> 
> > +static void drain_stock(struct memcg_stock_pcp *stock)
> > +{
> > +	struct mem_cgroup *old = stock->cached;
> > +
> > +	if (stock->charge) {
> > +		res_counter_uncharge(&old->res, stock->charge);
> > +		if (do_swap_account)
> > +			res_counter_uncharge(&old->memsw, stock->charge);
> > +	}
> > +	stock->cached = NULL;
> > +	stock->charge = 0;
> > +}
> > +
> > +static void drain_local_stock(struct work_struct *dummy)
> > +{
> > +	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> > +	drain_stock(stock);
> > +	put_cpu_var(memcg_stock);
> > +}
> 
> drain_local_stock() is only ever run by a thread which is pinned to a
> particular CPU, so we can use plain old __get_cpu_var() here and remove
> the put_cpu_var(), methinks.  If this is done, a comment shuold be
> added explaining why we can use that optimisation.
> 

Ok. I'll try and test that.


> > +static void refill_stock(struct mem_cgroup *mem, int val)
> > +{
> > +	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> > +
> > +	if (stock->cached != mem) {
> > +		drain_stock(stock);
> > +		stock->cached = mem;
> > +	}
> > +	stock->charge += val;
> > +	put_cpu_var(memcg_stock);
> > +}
> > +
> > +static void drain_all_stock_async(void)
> > +{
> > +	int cpu;
> > +	/* Contention means someone tries to flush. */
> > +	if (!mutex_trylock(&memcg_drain_mutex))
> > +		return;
> 
> Any time I see a trylock I ask "hm, what happens when it fails - what
> are the consequences and how to we later rerun the aborted operation".
> 
I'll add more comments.

> That's unobvious here and merits a comment.
> 
> > +	for_each_online_cpu(cpu) {
> 
> Did we need get_online_cpus() here?
> 

Ahhhh, yes. It's necessary.

> > +		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > +		if (work_pending(&stock->work))
> > +			continue;
> > +		INIT_WORK(&stock->work, drain_local_stock);
> > +		schedule_work_on(cpu, &stock->work);
> > +	}
> > +	mutex_unlock(&memcg_drain_mutex);
> > +	/* We don't wait for flush_work */
> > +}
> > +
> > +static void drain_all_stock_sync(void)
> > +{
> > +	/* called when force_empty is called */
> > +	mutex_lock(&memcg_drain_mutex);
> > +	schedule_on_each_cpu(drain_local_stock);
> > +	mutex_unlock(&memcg_drain_mutex);
> > +}
> 
> In fact it would be nice were each of these functions to have a brief
> note describing their role.
> 
Okay. I'll add.

> > +static int __cpuinit memcg_stock_cpu_callback(struct notifier_block *nb,
> > +					unsigned long action,
> > +					void *hcpu)
> > +{
> > +#ifdef CONFIG_HOTPLUG_CPU
> > +	int cpu = (unsigned long)hcpu;
> > +	struct memcg_stock_pcp *stock;
> > +
> > +	if (action != CPU_DEAD)
> > +		return NOTIFY_OK;
> > +	stock = &per_cpu(memcg_stock, cpu);
> > +	drain_stock(stock);
> > +#endif
> > +	return NOTIFY_OK;
> > +}
> 
> Is the ifdef needed?  Using hotcpu_notifier() should cause this entire
> function to disappear from vmlinux when CONFIG_HOTPLUG_CPU=n.
> 
Ah, I didn't notice that. I'll try.

Thank you.

Regards,
-Kame

> >  /*
> >   * Unlike exported interface, "oom" parameter is added. if oom==true,
> >   * oom-killer can be invoked.
> > @@ -1269,6 +1370,7 @@ static int __mem_cgroup_try_charge(struc
> >  	struct mem_cgroup *mem, *mem_over_limit;
> >  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> >  	struct res_counter *fail_res;
> > +	int csize = CHARGE_SIZE;
> >  
> >  	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
> >  		/* Don't account this! */
> > @@ -1293,23 +1395,25 @@ static int __mem_cgroup_try_charge(struc
> >  		return 0;
> >  
> >  	VM_BUG_ON(css_is_removed(&mem->css));
> > +	if (mem_cgroup_is_root(mem))
> > +		goto done;
> >  
> >  	while (1) {
> >  		int ret = 0;
> >  		unsigned long flags = 0;
> >  
> > -		if (mem_cgroup_is_root(mem))
> > -			goto done;
> > -		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> > +		if (consume_stock(mem))
> > +			goto charged;
> > +
> > +		ret = res_counter_charge(&mem->res, csize, &fail_res);
> >  		if (likely(!ret)) {
> >  			if (!do_swap_account)
> >  				break;
> > -			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
> > -							&fail_res);
> > +			ret = res_counter_charge(&mem->memsw, csize, &fail_res);
> >  			if (likely(!ret))
> >  				break;
> >  			/* mem+swap counter fails */
> > -			res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +			res_counter_uncharge(&mem->res, csize);
> >  			flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> >  			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> >  									memsw);
> > @@ -1318,6 +1422,11 @@ static int __mem_cgroup_try_charge(struc
> >  			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> >  									res);
> >  
> > +		/* reduce request size and retry */
> > +		if (csize > PAGE_SIZE) {
> > +			csize = PAGE_SIZE;
> > +			continue;
> > +		}
> >  		if (!(gfp_mask & __GFP_WAIT))
> >  			goto nomem;
> >  
> > @@ -1347,6 +1456,9 @@ static int __mem_cgroup_try_charge(struc
> >  			goto nomem;
> >  		}
> >  	}
> > +	if (csize > PAGE_SIZE)
> > +		refill_stock(mem, csize - PAGE_SIZE);
> > +charged:
> >  	/*
> >  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> >  	 * if they exceeds softlimit.
> > @@ -2463,6 +2575,7 @@ move_account:
> >  			goto out;
> >  		/* This is for making all *used* pages to be on LRU. */
> >  		lru_add_drain_all();
> > +		drain_all_stock_sync();
> >  		ret = 0;
> >  		for_each_node_state(node, N_HIGH_MEMORY) {
> >  			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
> > @@ -3181,6 +3294,7 @@ mem_cgroup_create(struct cgroup_subsys *
> >  		root_mem_cgroup = mem;
> >  		if (mem_cgroup_soft_limit_tree_init())
> >  			goto free_out;
> > +		hotcpu_notifier(memcg_stock_cpu_callback, 0);
> >  
> >  	} else {
> >  		parent = mem_cgroup_from_cont(cont->parent);
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation
  2009-10-02  5:01 ` [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation KAMEZAWA Hiroyuki
  2009-10-02  6:47   ` Hiroshi Shimamoto
@ 2009-10-09  4:01   ` Balbir Singh
  2009-10-09  4:17     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2009-10-09  4:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, nishimura@mxp.nes.nec.co.jp

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-10-02 14:01:26]:

> 
> In massive parallel enviroment, res_counter can be a performance bottleneck.
> One strong techinque to reduce lock contention is reducing calls by
> coalescing some amount of calls into one.
> 
> Considering charge/uncharge chatacteristic,
> 	- charge is done one by one via demand-paging.
> 	- uncharge is done by
> 		- in chunk at munmap, truncate, exit, execve...
> 		- one by one via vmscan/paging.
> 
> It seems we have a chance in uncharge at unmap/truncation.

A chance to improve scalability?

> 
> This patch is a for coalescing uncharge. For avoiding scattering memcg's
> structure to functions under /mm, this patch adds memcg batch uncharge
> information to the task. 
>

Is there a reason for associating batch with the task rather than
per-cpu or per-memcg? per-memcg, I suspect would add some locking
overhead, per-cpu would require synchronization across cpu's while
uncharging, is that where per-task helps?  I suspect per-mm,
per-signal will have the issues above.
 
> The degree of coalescing depends on callers
>   - at invalidate/trucate... pagevec size
>   - at unmap ....ZAP_BLOCK_SIZE
> (memory itself will be freed in this degree.)
> Then, we'll not coalescing too much.
> 
> Changelog(now):
>  - rebased onto the latest mmotm + softlimit fix patches.
> 
> Changelog(old):
>  - unified patch for callers
>  - added commetns.
>  - make ->do_batch as bool.
>  - removed css_get() at el. We don't need it.
> 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |   13 ++++++
>  include/linux/sched.h      |    7 +++
>  mm/memcontrol.c            |   91 ++++++++++++++++++++++++++++++++++++++++++---
>  mm/memory.c                |    2 
>  mm/truncate.c              |    6 ++
>  5 files changed, 113 insertions(+), 6 deletions(-)
> 
> Index: mmotm-2.6.31-Sep28/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.31-Sep28/include/linux/memcontrol.h
> @@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s
>  extern void mem_cgroup_del_lru(struct page *page);
>  extern void mem_cgroup_move_lists(struct page *page,
>  				  enum lru_list from, enum lru_list to);
> +
> +/* For coalescing uncharge for reducing memcg' overhead*/
> +extern void mem_cgroup_uncharge_start(void);
> +extern void mem_cgroup_uncharge_end(void);
> +
>  extern void mem_cgroup_uncharge_page(struct page *page);
>  extern void mem_cgroup_uncharge_cache_page(struct page *page);
>  extern int mem_cgroup_shmem_charge_fallback(struct page *page,
> @@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
>  {
>  }
> 
> +static inline void mem_cgroup_uncharge_batch_start(void)
> +{
> +}
> +
> +static inline void mem_cgroup_uncharge_batch_start(void)
> +{
> +}
> +
>  static inline void mem_cgroup_uncharge_page(struct page *page)
>  {
>  }
> Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
> +++ mmotm-2.6.31-Sep28/mm/memcontrol.c
> @@ -1826,6 +1826,49 @@ void mem_cgroup_cancel_charge_swapin(str
>  	css_put(&mem->css);
>  }
> 
> +static void
> +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> +{
> +	struct memcg_batch_info *batch = NULL;
> +	bool uncharge_memsw = true;
> +	/* If swapout, usage of swap doesn't decrease */
> +	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> +		uncharge_memsw = false;
> +	/*
> +	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
> +	 * In those cases, all pages freed continously can be expected to be in
> +	 * the same cgroup and we have chance to coalesce uncharges.
> +	 * And, we do uncharge one by one if this is killed by OOM.
> +	 */
> +	if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
> +		goto direct_uncharge;

Should we also not uncharge the current batch when the task is dying?

> +
> +	batch = &current->memcg_batch;
> +	/*
> +	 * In usual, we do css_get() when we remember memcg pointer.
> +	 * But in this case, we keep res->usage until end of a series of
> +	 * uncharges. Then, it's ok to ignore memcg's refcnt.
> +	 */
> +	if (!batch->memcg)
> +		batch->memcg = mem;
> +	/*
> +	 * In typical case, batch->memcg == mem. This means we can
> +	 * merge a series of uncharges to an uncharge of res_counter.
> +	 * If not, we uncharge res_counter ony by one.
> +	 */
> +	if (batch->memcg != mem)
> +		goto direct_uncharge;
> +	/* remember freed charge and uncharge it later */
> +	batch->pages += PAGE_SIZE;
> +	if (uncharge_memsw)
> +		batch->memsw += PAGE_SIZE;
> +	return;
> +direct_uncharge:
> +	res_counter_uncharge(&mem->res, PAGE_SIZE);
> +	if (uncharge_memsw)
> +		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> +	return;
> +}
> 
>  /*
>   * uncharge if !page_mapped(page)
> @@ -1874,12 +1917,8 @@ __mem_cgroup_uncharge_common(struct page
>  		break;
>  	}
> 
> -	if (!mem_cgroup_is_root(mem)) {
> -		res_counter_uncharge(&mem->res, PAGE_SIZE);
> -		if (do_swap_account &&
> -				(ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> -			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> -	}
> +	if (!mem_cgroup_is_root(mem))
> +		__do_uncharge(mem, ctype);
>  	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
>  		mem_cgroup_swap_statistics(mem, true);
>  	mem_cgroup_charge_statistics(mem, pc, false);
> @@ -1925,6 +1964,46 @@ void mem_cgroup_uncharge_cache_page(stru
>  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
>  }
> 
> +/*
> + * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
> + * In that cases, pages are freed continuously and we can expect pages
> + * are in the same memcg. All these calls itself limits the number of
> + * pages freed at once, then uncharge_start/end() is called properly.
> + */
> +
> +void mem_cgroup_uncharge_start(void)
> +{
> +	if (!current->memcg_batch.do_batch) {
> +		current->memcg_batch.memcg = NULL;
> +		current->memcg_batch.pages = 0;
> +		current->memcg_batch.memsw = 0;
> +	}
> +	current->memcg_batch.do_batch++;
> +}
> +
> +void mem_cgroup_uncharge_end(void)
> +{
> +	struct mem_cgroup *mem;
> +
> +	if (!current->memcg_batch.do_batch)
> +		return;
> +
> +	current->memcg_batch.do_batch--;
> +	if (current->memcg_batch.do_batch) /* Nested ? */
> +		return;
> +
> +	mem = current->memcg_batch.memcg;
> +	if (!mem)
> +		return;
> +	/* This "mem" is valid bacause we hide charges behind us. */
> +	if (current->memcg_batch.pages)
> +		res_counter_uncharge(&mem->res, current->memcg_batch.pages);
> +	if (current->memcg_batch.memsw)
> +		res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
> +	/* Not necessary. but forget this pointer */
> +	current->memcg_batch.memcg = NULL;
> +}
> +
>  #ifdef CONFIG_SWAP
>  /*
>   * called after __delete_from_swap_cache() and drop "page" account.
> Index: mmotm-2.6.31-Sep28/include/linux/sched.h
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/include/linux/sched.h
> +++ mmotm-2.6.31-Sep28/include/linux/sched.h
> @@ -1549,6 +1549,13 @@ struct task_struct {
>  	unsigned long trace_recursion;
>  #endif /* CONFIG_TRACING */
>  	unsigned long stack_start;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
> +	struct memcg_batch_info {
> +		int do_batch;
> +		struct mem_cgroup *memcg;
> +		long pages, memsw;
> +	} memcg_batch;
> +#endif
>  };
> 
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> Index: mmotm-2.6.31-Sep28/mm/memory.c
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/mm/memory.c
> +++ mmotm-2.6.31-Sep28/mm/memory.c
> @@ -940,6 +940,7 @@ static unsigned long unmap_page_range(st
>  		details = NULL;
> 
>  	BUG_ON(addr >= end);
> +	mem_cgroup_uncharge_start();
>  	tlb_start_vma(tlb, vma);
>  	pgd = pgd_offset(vma->vm_mm, addr);
>  	do {
> @@ -952,6 +953,7 @@ static unsigned long unmap_page_range(st
>  						zap_work, details);
>  	} while (pgd++, addr = next, (addr != end && *zap_work > 0));
>  	tlb_end_vma(tlb, vma);
> +	mem_cgroup_uncharge_end();
> 
>  	return addr;
>  }
> Index: mmotm-2.6.31-Sep28/mm/truncate.c
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/mm/truncate.c
> +++ mmotm-2.6.31-Sep28/mm/truncate.c
> @@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
>  			pagevec_release(&pvec);
>  			break;
>  		}
> +		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
> 
> @@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
>  			unlock_page(page);
>  		}
>  		pagevec_release(&pvec);
> +		mem_cgroup_uncharge_end();
>  	}
>  }
>  EXPORT_SYMBOL(truncate_inode_pages_range);
> @@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
>  	pagevec_init(&pvec, 0);
>  	while (next <= end &&
>  			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> +		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  			pgoff_t index;
> @@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
>  				break;
>  		}
>  		pagevec_release(&pvec);
> +		mem_cgroup_uncharge_end();
>  		cond_resched();
>  	}
>  	return ret;
> @@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
>  	while (next <= end && !wrapped &&
>  		pagevec_lookup(&pvec, mapping, next,
>  			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> +		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  			pgoff_t page_index;
> @@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
>  			unlock_page(page);
>  		}
>  		pagevec_release(&pvec);
> +		mem_cgroup_uncharge_end();
>  		cond_resched();
>  	}
>  	return ret;
>

The patch overall looks good, just some questions about it. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] memcg: coalescing charges per cpu
  2009-10-02  5:03 ` [PATCH 2/2] memcg: coalescing charges per cpu KAMEZAWA Hiroyuki
  2009-10-08 22:26   ` Andrew Morton
@ 2009-10-09  4:15   ` Balbir Singh
  2009-10-09  4:25     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2009-10-09  4:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, nishimura@mxp.nes.nec.co.jp

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-10-02 14:03:43]:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This is a patch for coalescing access to res_counter at charging by
> percpu caching. At charge, memcg charges 64pages and remember it in
> percpu cache. Because it's cache, drain/flush is done if necessary.
> 
> This version uses public percpu area.
>  2 benefits of using public percpu area.
>  1. Sum of stocked charge in the system is limited to # of cpus
>     not to the number of memcg. This shows better synchonization.
>  2. drain code for flush/cpuhotplug is very easy (and quick)
> 
> The most important point of this patch is that we never touch res_counter
> in fast path. The res_counter is system-wide shared counter which is modified
> very frequently. We shouldn't touch it as far as we can for avoiding
> false sharing.
>
> Changelog (new):
>   - rabased onto the latest mmotm
> Changelog (old):
>   - moved charge size check before __GFP_WAIT check for avoiding unnecesary
>   - added asynchronous flush routine.
>   - fixed bugs pointed out by Nishimura-san.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |  126 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 120 insertions(+), 6 deletions(-)
> 
> Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
> +++ mmotm-2.6.31-Sep28/mm/memcontrol.c
> @@ -38,6 +38,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/mm_inline.h>
>  #include <linux/page_cgroup.h>
> +#include <linux/cpu.h>
>  #include "internal.h"
> 
>  #include <asm/uaccess.h>
> @@ -275,6 +276,7 @@ enum charge_type {
>  static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> +static void drain_all_stock_async(void);
> 
>  static struct mem_cgroup_per_zone *
>  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> @@ -1137,6 +1139,8 @@ static int mem_cgroup_hierarchical_recla
>  		victim = mem_cgroup_select_victim(root_mem);
>  		if (victim == root_mem) {
>  			loop++;
> +			if (loop >= 1)
> +				drain_all_stock_async();
>  			if (loop >= 2) {
>  				/*
>  				 * If we have not been able to reclaim
> @@ -1258,6 +1262,103 @@ done:
>  	unlock_page_cgroup(pc);
>  }
> 
> +/* size of first charge trial. "32" comes from vmscan.c's magic value */
> +#define CHARGE_SIZE	(32 * PAGE_SIZE)

Should this then be SWAP_CLUSTER_MAX instead of 32?

> +struct memcg_stock_pcp {
> +	struct mem_cgroup *cached;
> +	int charge;
> +	struct work_struct work;
> +};
> +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> +static DEFINE_MUTEX(memcg_drain_mutex);
> +
> +static bool consume_stock(struct mem_cgroup *mem)
> +{
> +	struct memcg_stock_pcp *stock;
> +	bool ret = true;
> +
> +	stock = &get_cpu_var(memcg_stock);
> +	if (mem == stock->cached && stock->charge)
> +		stock->charge -= PAGE_SIZE;
> +	else
> +		ret = false;
> +	put_cpu_var(memcg_stock);
> +	return ret;
> +}


Shouldn't consume stock and routines below check for memcg_is_root()?

> +
> +static void drain_stock(struct memcg_stock_pcp *stock)
> +{
> +	struct mem_cgroup *old = stock->cached;
> +
> +	if (stock->charge) {
> +		res_counter_uncharge(&old->res, stock->charge);
> +		if (do_swap_account)
> +			res_counter_uncharge(&old->memsw, stock->charge);
> +	}
> +	stock->cached = NULL;
> +	stock->charge = 0;
> +}
> +
> +static void drain_local_stock(struct work_struct *dummy)
> +{
> +	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> +	drain_stock(stock);
> +	put_cpu_var(memcg_stock);
> +}
> +
> +static void refill_stock(struct mem_cgroup *mem, int val)
> +{
> +	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> +
> +	if (stock->cached != mem) {
> +		drain_stock(stock);
> +		stock->cached = mem;
> +	}

More comments here would help

> +	stock->charge += val;
> +	put_cpu_var(memcg_stock);
> +}
> +
> +static void drain_all_stock_async(void)
> +{
> +	int cpu;
> +	/* Contention means someone tries to flush. */
> +	if (!mutex_trylock(&memcg_drain_mutex))
> +		return;
> +	for_each_online_cpu(cpu) {
> +		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> +		if (work_pending(&stock->work))
> +			continue;
> +		INIT_WORK(&stock->work, drain_local_stock);
> +		schedule_work_on(cpu, &stock->work);
> +	}
> +	mutex_unlock(&memcg_drain_mutex);
> +	/* We don't wait for flush_work */
> +}
> +
> +static void drain_all_stock_sync(void)
> +{
> +	/* called when force_empty is called */
> +	mutex_lock(&memcg_drain_mutex);
> +	schedule_on_each_cpu(drain_local_stock);
> +	mutex_unlock(&memcg_drain_mutex);
> +}
> +
> +static int __cpuinit memcg_stock_cpu_callback(struct notifier_block *nb,
> +					unsigned long action,
> +					void *hcpu)
> +{
> +#ifdef CONFIG_HOTPLUG_CPU
> +	int cpu = (unsigned long)hcpu;
> +	struct memcg_stock_pcp *stock;
> +
> +	if (action != CPU_DEAD)
> +		return NOTIFY_OK;
> +	stock = &per_cpu(memcg_stock, cpu);
> +	drain_stock(stock);
> +#endif
> +	return NOTIFY_OK;
> +}
> +
>  /*
>   * Unlike exported interface, "oom" parameter is added. if oom==true,
>   * oom-killer can be invoked.
> @@ -1269,6 +1370,7 @@ static int __mem_cgroup_try_charge(struc
>  	struct mem_cgroup *mem, *mem_over_limit;
>  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  	struct res_counter *fail_res;
> +	int csize = CHARGE_SIZE;
> 
>  	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
>  		/* Don't account this! */
> @@ -1293,23 +1395,25 @@ static int __mem_cgroup_try_charge(struc
>  		return 0;
> 
>  	VM_BUG_ON(css_is_removed(&mem->css));
> +	if (mem_cgroup_is_root(mem))
> +		goto done;
> 
>  	while (1) {
>  		int ret = 0;
>  		unsigned long flags = 0;
> 
> -		if (mem_cgroup_is_root(mem))
> -			goto done;
> -		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> +		if (consume_stock(mem))
> +			goto charged;
> +
> +		ret = res_counter_charge(&mem->res, csize, &fail_res);
>  		if (likely(!ret)) {
>  			if (!do_swap_account)
>  				break;
> -			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
> -							&fail_res);
> +			ret = res_counter_charge(&mem->memsw, csize, &fail_res);
>  			if (likely(!ret))
>  				break;
>  			/* mem+swap counter fails */
> -			res_counter_uncharge(&mem->res, PAGE_SIZE);
> +			res_counter_uncharge(&mem->res, csize);
>  			flags |= MEM_CGROUP_RECLAIM_NOSWAP;
>  			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
>  									memsw);
> @@ -1318,6 +1422,11 @@ static int __mem_cgroup_try_charge(struc
>  			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
>  									res);
> 
> +		/* reduce request size and retry */
> +		if (csize > PAGE_SIZE) {
> +			csize = PAGE_SIZE;
> +			continue;
> +		}
>  		if (!(gfp_mask & __GFP_WAIT))
>  			goto nomem;
> 
> @@ -1347,6 +1456,9 @@ static int __mem_cgroup_try_charge(struc
>  			goto nomem;
>  		}
>  	}
> +	if (csize > PAGE_SIZE)
> +		refill_stock(mem, csize - PAGE_SIZE);
> +charged:
>  	/*
>  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
>  	 * if they exceeds softlimit.
> @@ -2463,6 +2575,7 @@ move_account:
>  			goto out;
>  		/* This is for making all *used* pages to be on LRU. */
>  		lru_add_drain_all();
> +		drain_all_stock_sync();
>  		ret = 0;
>  		for_each_node_state(node, N_HIGH_MEMORY) {
>  			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
> @@ -3181,6 +3294,7 @@ mem_cgroup_create(struct cgroup_subsys *
>  		root_mem_cgroup = mem;
>  		if (mem_cgroup_soft_limit_tree_init())
>  			goto free_out;
> +		hotcpu_notifier(memcg_stock_cpu_callback, 0);
> 
>  	} else {
>  		parent = mem_cgroup_from_cont(cont->parent);
> 

I tested earlier versions of the patchset and they worked absolutely
fine for me!


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation
  2009-10-09  4:01   ` [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation Balbir Singh
@ 2009-10-09  4:17     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-09  4:17 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, nishimura@mxp.nes.nec.co.jp

On Fri, 9 Oct 2009 09:31:42 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-10-02 14:01:26]:
> 
> > 
> > In massive parallel enviroment, res_counter can be a performance bottleneck.
> > One strong techinque to reduce lock contention is reducing calls by
> > coalescing some amount of calls into one.
> > 
> > Considering charge/uncharge chatacteristic,
> > 	- charge is done one by one via demand-paging.
> > 	- uncharge is done by
> > 		- in chunk at munmap, truncate, exit, execve...
> > 		- one by one via vmscan/paging.
> > 
> > It seems we have a chance in uncharge at unmap/truncation.
> 
> A chance to improve scalability?

yes.


> > 
> > This patch is a for coalescing uncharge. For avoiding scattering memcg's
> > structure to functions under /mm, this patch adds memcg batch uncharge
> > information to the task. 
> >
> 
> Is there a reason for associating batch with the task rather than
> per-cpu or per-memcg? per-memcg,

Per-cpu is bad because we can't be robust against task migration.
(I mean per-cpu cannot be clean/simple as this patch.)

Per-memcg doesn't work well. If we desgin per-memcg, we'll use a some logic
like percpu counter. batch-legnth of it cannot be enough big.
And we need fine grain synchronization more than this patch.
 

> I suspect would add some locking
> overhead, per-cpu would require synchronization across cpu's while
> uncharging, is that where per-task helps? 

Yes, it's one of reason for this design.

> I suspect per-mm, per-signal will have the issues above.
Per-mm, per-signal can't be help. mulit-thread can do file truncations (and unmap)
in parallel.

The biggest reason for per-task is that I'd like to change this behavior by
calling _context_. Per-task batch means that we can make use of caller's context.
When uncharge is called by try_to_unmap(), __remove_mapping() in vmscan.c,
we do direct uncharge, for example.


>  
> > The degree of coalescing depends on callers
> >   - at invalidate/trucate... pagevec size
> >   - at unmap ....ZAP_BLOCK_SIZE
> > (memory itself will be freed in this degree.)
> > Then, we'll not coalescing too much.
> > 
> > Changelog(now):
> >  - rebased onto the latest mmotm + softlimit fix patches.
> > 
> > Changelog(old):
> >  - unified patch for callers
> >  - added commetns.
> >  - make ->do_batch as bool.
> >  - removed css_get() at el. We don't need it.
> > 
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  include/linux/memcontrol.h |   13 ++++++
> >  include/linux/sched.h      |    7 +++
> >  mm/memcontrol.c            |   91 ++++++++++++++++++++++++++++++++++++++++++---
> >  mm/memory.c                |    2 
> >  mm/truncate.c              |    6 ++
> >  5 files changed, 113 insertions(+), 6 deletions(-)
> > 
> > Index: mmotm-2.6.31-Sep28/include/linux/memcontrol.h
> > ===================================================================
> > --- mmotm-2.6.31-Sep28.orig/include/linux/memcontrol.h
> > +++ mmotm-2.6.31-Sep28/include/linux/memcontrol.h
> > @@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s
> >  extern void mem_cgroup_del_lru(struct page *page);
> >  extern void mem_cgroup_move_lists(struct page *page,
> >  				  enum lru_list from, enum lru_list to);
> > +
> > +/* For coalescing uncharge for reducing memcg' overhead*/
> > +extern void mem_cgroup_uncharge_start(void);
> > +extern void mem_cgroup_uncharge_end(void);
> > +
> >  extern void mem_cgroup_uncharge_page(struct page *page);
> >  extern void mem_cgroup_uncharge_cache_page(struct page *page);
> >  extern int mem_cgroup_shmem_charge_fallback(struct page *page,
> > @@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
> >  {
> >  }
> > 
> > +static inline void mem_cgroup_uncharge_batch_start(void)
> > +{
> > +}
> > +
> > +static inline void mem_cgroup_uncharge_batch_start(void)
> > +{
> > +}
> > +
> >  static inline void mem_cgroup_uncharge_page(struct page *page)
> >  {
> >  }
> > Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
> > +++ mmotm-2.6.31-Sep28/mm/memcontrol.c
> > @@ -1826,6 +1826,49 @@ void mem_cgroup_cancel_charge_swapin(str
> >  	css_put(&mem->css);
> >  }
> > 
> > +static void
> > +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> > +{
> > +	struct memcg_batch_info *batch = NULL;
> > +	bool uncharge_memsw = true;
> > +	/* If swapout, usage of swap doesn't decrease */
> > +	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> > +		uncharge_memsw = false;
> > +	/*
> > +	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
> > +	 * In those cases, all pages freed continously can be expected to be in
> > +	 * the same cgroup and we have chance to coalesce uncharges.
> > +	 * And, we do uncharge one by one if this is killed by OOM.
> > +	 */
> > +	if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
> > +		goto direct_uncharge;
> 
> Should we also not uncharge the current batch when the task is dying?
>
TIF_MEMDIE means OOM happens. I take "reduce usage as fast as possible"
scheme here.

> > +
> > +	batch = &current->memcg_batch;
> > +	/*
> > +	 * In usual, we do css_get() when we remember memcg pointer.
> > +	 * But in this case, we keep res->usage until end of a series of
> > +	 * uncharges. Then, it's ok to ignore memcg's refcnt.
> > +	 */
> > +	if (!batch->memcg)
> > +		batch->memcg = mem;
> > +	/*
> > +	 * In typical case, batch->memcg == mem. This means we can
> > +	 * merge a series of uncharges to an uncharge of res_counter.
> > +	 * If not, we uncharge res_counter ony by one.
> > +	 */
> > +	if (batch->memcg != mem)
> > +		goto direct_uncharge;
> > +	/* remember freed charge and uncharge it later */
> > +	batch->pages += PAGE_SIZE;
> > +	if (uncharge_memsw)
> > +		batch->memsw += PAGE_SIZE;
> > +	return;
> > +direct_uncharge:
> > +	res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +	if (uncharge_memsw)
> > +		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> > +	return;
> > +}
> > 
> >  /*
> >   * uncharge if !page_mapped(page)
> > @@ -1874,12 +1917,8 @@ __mem_cgroup_uncharge_common(struct page
> >  		break;
> >  	}
> > 
> > -	if (!mem_cgroup_is_root(mem)) {
> > -		res_counter_uncharge(&mem->res, PAGE_SIZE);
> > -		if (do_swap_account &&
> > -				(ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> > -			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> > -	}
> > +	if (!mem_cgroup_is_root(mem))
> > +		__do_uncharge(mem, ctype);
> >  	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> >  		mem_cgroup_swap_statistics(mem, true);
> >  	mem_cgroup_charge_statistics(mem, pc, false);
> > @@ -1925,6 +1964,46 @@ void mem_cgroup_uncharge_cache_page(stru
> >  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> >  }
> > 
> > +/*
> > + * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
> > + * In that cases, pages are freed continuously and we can expect pages
> > + * are in the same memcg. All these calls itself limits the number of
> > + * pages freed at once, then uncharge_start/end() is called properly.
> > + */
> > +
> > +void mem_cgroup_uncharge_start(void)
> > +{
> > +	if (!current->memcg_batch.do_batch) {
> > +		current->memcg_batch.memcg = NULL;
> > +		current->memcg_batch.pages = 0;
> > +		current->memcg_batch.memsw = 0;
> > +	}
> > +	current->memcg_batch.do_batch++;
> > +}
> > +
> > +void mem_cgroup_uncharge_end(void)
> > +{
> > +	struct mem_cgroup *mem;
> > +
> > +	if (!current->memcg_batch.do_batch)
> > +		return;
> > +
> > +	current->memcg_batch.do_batch--;
> > +	if (current->memcg_batch.do_batch) /* Nested ? */
> > +		return;
> > +
> > +	mem = current->memcg_batch.memcg;
> > +	if (!mem)
> > +		return;
> > +	/* This "mem" is valid bacause we hide charges behind us. */
> > +	if (current->memcg_batch.pages)
> > +		res_counter_uncharge(&mem->res, current->memcg_batch.pages);
> > +	if (current->memcg_batch.memsw)
> > +		res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
> > +	/* Not necessary. but forget this pointer */
> > +	current->memcg_batch.memcg = NULL;
> > +}
> > +
> >  #ifdef CONFIG_SWAP
> >  /*
> >   * called after __delete_from_swap_cache() and drop "page" account.
> > Index: mmotm-2.6.31-Sep28/include/linux/sched.h
> > ===================================================================
> > --- mmotm-2.6.31-Sep28.orig/include/linux/sched.h
> > +++ mmotm-2.6.31-Sep28/include/linux/sched.h
> > @@ -1549,6 +1549,13 @@ struct task_struct {
> >  	unsigned long trace_recursion;
> >  #endif /* CONFIG_TRACING */
> >  	unsigned long stack_start;
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
> > +	struct memcg_batch_info {
> > +		int do_batch;
> > +		struct mem_cgroup *memcg;
> > +		long pages, memsw;
> > +	} memcg_batch;
> > +#endif
> >  };
> > 
> >  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> > Index: mmotm-2.6.31-Sep28/mm/memory.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep28.orig/mm/memory.c
> > +++ mmotm-2.6.31-Sep28/mm/memory.c
> > @@ -940,6 +940,7 @@ static unsigned long unmap_page_range(st
> >  		details = NULL;
> > 
> >  	BUG_ON(addr >= end);
> > +	mem_cgroup_uncharge_start();
> >  	tlb_start_vma(tlb, vma);
> >  	pgd = pgd_offset(vma->vm_mm, addr);
> >  	do {
> > @@ -952,6 +953,7 @@ static unsigned long unmap_page_range(st
> >  						zap_work, details);
> >  	} while (pgd++, addr = next, (addr != end && *zap_work > 0));
> >  	tlb_end_vma(tlb, vma);
> > +	mem_cgroup_uncharge_end();
> > 
> >  	return addr;
> >  }
> > Index: mmotm-2.6.31-Sep28/mm/truncate.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep28.orig/mm/truncate.c
> > +++ mmotm-2.6.31-Sep28/mm/truncate.c
> > @@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
> >  			pagevec_release(&pvec);
> >  			break;
> >  		}
> > +		mem_cgroup_uncharge_start();
> >  		for (i = 0; i < pagevec_count(&pvec); i++) {
> >  			struct page *page = pvec.pages[i];
> > 
> > @@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
> >  			unlock_page(page);
> >  		}
> >  		pagevec_release(&pvec);
> > +		mem_cgroup_uncharge_end();
> >  	}
> >  }
> >  EXPORT_SYMBOL(truncate_inode_pages_range);
> > @@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
> >  	pagevec_init(&pvec, 0);
> >  	while (next <= end &&
> >  			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> > +		mem_cgroup_uncharge_start();
> >  		for (i = 0; i < pagevec_count(&pvec); i++) {
> >  			struct page *page = pvec.pages[i];
> >  			pgoff_t index;
> > @@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
> >  				break;
> >  		}
> >  		pagevec_release(&pvec);
> > +		mem_cgroup_uncharge_end();
> >  		cond_resched();
> >  	}
> >  	return ret;
> > @@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
> >  	while (next <= end && !wrapped &&
> >  		pagevec_lookup(&pvec, mapping, next,
> >  			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> > +		mem_cgroup_uncharge_start();
> >  		for (i = 0; i < pagevec_count(&pvec); i++) {
> >  			struct page *page = pvec.pages[i];
> >  			pgoff_t page_index;
> > @@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
> >  			unlock_page(page);
> >  		}
> >  		pagevec_release(&pvec);
> > +		mem_cgroup_uncharge_end();
> >  		cond_resched();
> >  	}
> >  	return ret;
> >
> 
> The patch overall looks good, just some questions about it. 
> 

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] memcg: coalescing charges per cpu
  2009-10-09  4:15   ` Balbir Singh
@ 2009-10-09  4:25     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-09  4:25 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, nishimura@mxp.nes.nec.co.jp

On Fri, 9 Oct 2009 09:45:36 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-10-02 14:03:43]:
> 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > This is a patch for coalescing access to res_counter at charging by
> > percpu caching. At charge, memcg charges 64pages and remember it in
> > percpu cache. Because it's cache, drain/flush is done if necessary.
> > 
> > This version uses public percpu area.
> >  2 benefits of using public percpu area.
> >  1. Sum of stocked charge in the system is limited to # of cpus
> >     not to the number of memcg. This shows better synchonization.
> >  2. drain code for flush/cpuhotplug is very easy (and quick)
> > 
> > The most important point of this patch is that we never touch res_counter
> > in fast path. The res_counter is system-wide shared counter which is modified
> > very frequently. We shouldn't touch it as far as we can for avoiding
> > false sharing.
> >
> > Changelog (new):
> >   - rabased onto the latest mmotm
> > Changelog (old):
> >   - moved charge size check before __GFP_WAIT check for avoiding unnecesary
> >   - added asynchronous flush routine.
> >   - fixed bugs pointed out by Nishimura-san.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |  126 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 120 insertions(+), 6 deletions(-)
> > 
> > Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
> > +++ mmotm-2.6.31-Sep28/mm/memcontrol.c
> > @@ -38,6 +38,7 @@
> >  #include <linux/vmalloc.h>
> >  #include <linux/mm_inline.h>
> >  #include <linux/page_cgroup.h>
> > +#include <linux/cpu.h>
> >  #include "internal.h"
> > 
> >  #include <asm/uaccess.h>
> > @@ -275,6 +276,7 @@ enum charge_type {
> >  static void mem_cgroup_get(struct mem_cgroup *mem);
> >  static void mem_cgroup_put(struct mem_cgroup *mem);
> >  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> > +static void drain_all_stock_async(void);
> > 
> >  static struct mem_cgroup_per_zone *
> >  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> > @@ -1137,6 +1139,8 @@ static int mem_cgroup_hierarchical_recla
> >  		victim = mem_cgroup_select_victim(root_mem);
> >  		if (victim == root_mem) {
> >  			loop++;
> > +			if (loop >= 1)
> > +				drain_all_stock_async();
> >  			if (loop >= 2) {
> >  				/*
> >  				 * If we have not been able to reclaim
> > @@ -1258,6 +1262,103 @@ done:
> >  	unlock_page_cgroup(pc);
> >  }
> > 
> > +/* size of first charge trial. "32" comes from vmscan.c's magic value */
> > +#define CHARGE_SIZE	(32 * PAGE_SIZE)
> 
> Should this then be SWAP_CLUSTER_MAX instead of 32?
> 
I'm not sure what number is the best here. So, just borrow a number from vmscan.
(maybe not very bad number.) 
I'd like to keep this as it is for a while but write why I selected this more.
as..
==
SWAP_CLUSTER_MAX(32) is the number of pages reclaimed by vmscan.c in a turn.
So, we borrow that number here. Considering scalability, this should 
be proportional to the number of cpus. But it's a TO-DO now.
==
Maybe we cannot update this until we get 32 or 64+cpus machine...


> > +struct memcg_stock_pcp {
> > +	struct mem_cgroup *cached;
> > +	int charge;
> > +	struct work_struct work;
> > +};
> > +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> > +static DEFINE_MUTEX(memcg_drain_mutex);
> > +
> > +static bool consume_stock(struct mem_cgroup *mem)
> > +{
> > +	struct memcg_stock_pcp *stock;
> > +	bool ret = true;
> > +
> > +	stock = &get_cpu_var(memcg_stock);
> > +	if (mem == stock->cached && stock->charge)
> > +		stock->charge -= PAGE_SIZE;
> > +	else
> > +		ret = false;
> > +	put_cpu_var(memcg_stock);
> > +	return ret;
> > +}
> 
> 
> Shouldn't consume stock and routines below check for memcg_is_root()?
> 
This is never called if memcg_is_root(). It's checked by caller.


> > +
> > +static void drain_stock(struct memcg_stock_pcp *stock)
> > +{
> > +	struct mem_cgroup *old = stock->cached;
> > +
> > +	if (stock->charge) {
> > +		res_counter_uncharge(&old->res, stock->charge);
> > +		if (do_swap_account)
> > +			res_counter_uncharge(&old->memsw, stock->charge);
> > +	}
> > +	stock->cached = NULL;
> > +	stock->charge = 0;
> > +}
> > +
> > +static void drain_local_stock(struct work_struct *dummy)
> > +{
> > +	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> > +	drain_stock(stock);
> > +	put_cpu_var(memcg_stock);
> > +}
> > +
> > +static void refill_stock(struct mem_cgroup *mem, int val)
> > +{
> > +	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> > +
> > +	if (stock->cached != mem) {
> > +		drain_stock(stock);
> > +		stock->cached = mem;
> > +	}
> 
> More comments here would help
> 
Ah, yes. I'll add.


> > +	stock->charge += val;
> > +	put_cpu_var(memcg_stock);
> > +}
> > +
> > +static void drain_all_stock_async(void)
> > +{
> > +	int cpu;
> > +	/* Contention means someone tries to flush. */
> > +	if (!mutex_trylock(&memcg_drain_mutex))
> > +		return;
> > +	for_each_online_cpu(cpu) {
> > +		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > +		if (work_pending(&stock->work))
> > +			continue;
> > +		INIT_WORK(&stock->work, drain_local_stock);
> > +		schedule_work_on(cpu, &stock->work);
> > +	}
> > +	mutex_unlock(&memcg_drain_mutex);
> > +	/* We don't wait for flush_work */
> > +}
> > +
> > +static void drain_all_stock_sync(void)
> > +{
> > +	/* called when force_empty is called */
> > +	mutex_lock(&memcg_drain_mutex);
> > +	schedule_on_each_cpu(drain_local_stock);
> > +	mutex_unlock(&memcg_drain_mutex);
> > +}
> > +
> > +static int __cpuinit memcg_stock_cpu_callback(struct notifier_block *nb,
> > +					unsigned long action,
> > +					void *hcpu)
> > +{
> > +#ifdef CONFIG_HOTPLUG_CPU
> > +	int cpu = (unsigned long)hcpu;
> > +	struct memcg_stock_pcp *stock;
> > +
> > +	if (action != CPU_DEAD)
> > +		return NOTIFY_OK;
> > +	stock = &per_cpu(memcg_stock, cpu);
> > +	drain_stock(stock);
> > +#endif
> > +	return NOTIFY_OK;
> > +}
> > +
> >  /*
> >   * Unlike exported interface, "oom" parameter is added. if oom==true,
> >   * oom-killer can be invoked.
> > @@ -1269,6 +1370,7 @@ static int __mem_cgroup_try_charge(struc
> >  	struct mem_cgroup *mem, *mem_over_limit;
> >  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> >  	struct res_counter *fail_res;
> > +	int csize = CHARGE_SIZE;
> > 
> >  	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
> >  		/* Don't account this! */
> > @@ -1293,23 +1395,25 @@ static int __mem_cgroup_try_charge(struc
> >  		return 0;
> > 
> >  	VM_BUG_ON(css_is_removed(&mem->css));
> > +	if (mem_cgroup_is_root(mem))
> > +		goto done;
> > 
> >  	while (1) {
> >  		int ret = 0;
> >  		unsigned long flags = 0;
> > 
> > -		if (mem_cgroup_is_root(mem))
> > -			goto done;
> > -		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> > +		if (consume_stock(mem))
> > +			goto charged;
> > +
> > +		ret = res_counter_charge(&mem->res, csize, &fail_res);
> >  		if (likely(!ret)) {
> >  			if (!do_swap_account)
> >  				break;
> > -			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
> > -							&fail_res);
> > +			ret = res_counter_charge(&mem->memsw, csize, &fail_res);
> >  			if (likely(!ret))
> >  				break;
> >  			/* mem+swap counter fails */
> > -			res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +			res_counter_uncharge(&mem->res, csize);
> >  			flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> >  			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> >  									memsw);
> > @@ -1318,6 +1422,11 @@ static int __mem_cgroup_try_charge(struc
> >  			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> >  									res);
> > 
> > +		/* reduce request size and retry */
> > +		if (csize > PAGE_SIZE) {
> > +			csize = PAGE_SIZE;
> > +			continue;
> > +		}
> >  		if (!(gfp_mask & __GFP_WAIT))
> >  			goto nomem;
> > 
> > @@ -1347,6 +1456,9 @@ static int __mem_cgroup_try_charge(struc
> >  			goto nomem;
> >  		}
> >  	}
> > +	if (csize > PAGE_SIZE)
> > +		refill_stock(mem, csize - PAGE_SIZE);
> > +charged:
> >  	/*
> >  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> >  	 * if they exceeds softlimit.
> > @@ -2463,6 +2575,7 @@ move_account:
> >  			goto out;
> >  		/* This is for making all *used* pages to be on LRU. */
> >  		lru_add_drain_all();
> > +		drain_all_stock_sync();
> >  		ret = 0;
> >  		for_each_node_state(node, N_HIGH_MEMORY) {
> >  			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
> > @@ -3181,6 +3294,7 @@ mem_cgroup_create(struct cgroup_subsys *
> >  		root_mem_cgroup = mem;
> >  		if (mem_cgroup_soft_limit_tree_init())
> >  			goto free_out;
> > +		hotcpu_notifier(memcg_stock_cpu_callback, 0);
> > 
> >  	} else {
> >  		parent = mem_cgroup_from_cont(cont->parent);
> > 
> 
> I tested earlier versions of the patchset and they worked absolutely
> fine for me!
> 
Thanks! I'll update commentary.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge
       [not found] ` <604427e00910091737s52e11ce9p256c95d533dc2837@mail.gmail.com>
@ 2009-10-11  2:33   ` KAMEZAWA Hiroyuki
       [not found]     ` <604427e00910111134o6f22f0ddg2b87124dd334ec02@mail.gmail.com>
  0 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-11  2:33 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp

Ying Han wrote:
> Hi KAMEZAWA-san: I tested your patch set based on 2.6.32-rc3 but I don't
> see
> much improvement on the page-faults rate.
> Here is the number I got:
>
> [Before]
>  Performance counter stats for './runpause.sh 10' (5 runs):
>
>   226272.271246  task-clock-msecs         #      3.768 CPUs    ( +-
> 0.193%
> )
>            4424  context-switches         #      0.000 M/sec   ( +-
> 14.418%
> )
>              25  CPU-migrations           #      0.000 M/sec   ( +-
> 23.077%
> )
>        80499059  page-faults              #      0.356 M/sec   ( +-
> 2.586%
> )
>    499246232482  cycles                   #   2206.396 M/sec   ( +-
> 0.055%
> )
>    193036122022  instructions             #      0.387 IPC     ( +-
> 0.281%
> )
>     76548856038  cache-references         #    338.304 M/sec   ( +-
> 0.832%
> )
>       480196860  cache-misses             #      2.122 M/sec   ( +-
> 2.741%
> )
>
>    60.051646892  seconds time elapsed   ( +-   0.010% )
>
> [After]
>  Performance counter stats for './runpause.sh 10' (5 runs):
>
>   226491.338475  task-clock-msecs         #      3.772 CPUs    ( +-
> 0.176%
> )
>            3377  context-switches         #      0.000 M/sec   ( +-
> 14.713%
> )
>              12  CPU-migrations           #      0.000 M/sec   ( +-
> 23.077%
> )
>        81867014  page-faults              #      0.361 M/sec   ( +-
> 3.201%
> )
>    499835798750  cycles                   #   2206.865 M/sec   ( +-
> 0.036%
> )
>    196685031865  instructions             #      0.393 IPC     ( +-
> 0.286%
> )
>     81143829910  cache-references         #    358.265 M/sec   ( +-
> 0.428%
> )
>       119362559  cache-misses             #      0.527 M/sec   ( +-
> 5.291%
> )
>
>    60.048917062  seconds time elapsed   ( +-   0.010% )
>
> I ran it on an 4 core machine with 16G of RAM. And I modified
> the runpause.sh to fork 4 pagefault process instead of 8. I mounted cgroup
> with only memory subsystem and start running the test on the root cgroup.
>
> I believe that we might have different running environment including the
> cgroup configuration.  Any suggestions?
>

This patch series is only for "child" cgroup. Sorry, I had to write it
clearer. No effects to root.

Regards,
-Kame

> --Ying
>
> On Thu, Oct 1, 2009 at 9:55 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> Hi,
>>
>> This patch is against mmotm + softlimit fix patches.
>> (which are now in -rc git tree.)
>>
>> In the latest -rc series, the kernel avoids accessing res_counter when
>> cgroup is root cgroup. This helps scalabilty when memcg is not used.
>>
>> It's necessary to improve scalabilty even when memcg is used. This patch
>> is for that. Previous Balbir's work shows that the biggest obstacles for
>> better scalabilty is memcg's res_counter. Then, there are 2 ways.
>>
>> (1) make counter scale well.
>> (2) avoid accessing core counter as much as possible.
>>
>> My first direction was (1). But no, there is no counter which is free
>> from false sharing when it needs system-wide fine grain synchronization.
>> And res_counter has several functionality...this makes (1) difficult.
>> spin_lock (in slow path) around counter means tons of invalidation will
>> happen even when we just access counter without modification.
>>
>> This patch series is for (2). This implements charge/uncharge in bached
>> manner.
>> This coalesces access to res_counter at charge/uncharge using nature of
>> access locality.
>>
>> Tested for a month. And I got good reorts from Balbir and Nishimura,
>> thanks.
>> One concern is that this adds some members to the bottom of task_struct.
>> Better idea is welcome.
>>
>> Following is test result of continuous page-fault on my 8cpu
>> box(x86-64).
>>
>> A loop like this runs on all cpus in parallel for 60secs.
>> ==
>>        while (1) {
>>                x = mmap(NULL, MEGA, PROT_READ|PROT_WRITE,
>>                        MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
>>
>>                for (off = 0; off < MEGA; off += PAGE_SIZE)
>>                        x[off]=0;
>>                munmap(x, MEGA);
>>        }
>> ==
>> please see # of page faults. I think this is good improvement.
>>
>>
>> [Before]
>>  Performance counter stats for './runpause.sh' (5 runs):
>>
>>  474539.756944  task-clock-msecs         #      7.890 CPUs    ( +-
>> 0.015%
>> )
>>          10284  context-switches         #      0.000 M/sec   ( +-
>> 0.156%
>> )
>>             12  CPU-migrations           #      0.000 M/sec   ( +-
>> 0.000%
>> )
>>       18425800  page-faults              #      0.039 M/sec   ( +-
>> 0.107%
>> )
>>  1486296285360  cycles                   #   3132.080 M/sec   ( +-
>> 0.029%
>> )
>>   380334406216  instructions             #      0.256 IPC     ( +-
>> 0.058%
>> )
>>     3274206662  cache-references         #      6.900 M/sec   ( +-
>> 0.453%
>> )
>>     1272947699  cache-misses             #      2.682 M/sec   ( +-
>> 0.118%
>> )
>>
>>   60.147907341  seconds time elapsed   ( +-   0.010% )
>>
>> [After]
>>  Performance counter stats for './runpause.sh' (5 runs):
>>
>>  474658.997489  task-clock-msecs         #      7.891 CPUs    ( +-
>> 0.006%
>> )
>>          10250  context-switches         #      0.000 M/sec   ( +-
>> 0.020%
>> )
>>             11  CPU-migrations           #      0.000 M/sec   ( +-
>> 0.000%
>> )
>>       33177858  page-faults              #      0.070 M/sec   ( +-
>> 0.152%
>> )
>>  1485264748476  cycles                   #   3129.120 M/sec   ( +-
>> 0.021%
>> )
>>   409847004519  instructions             #      0.276 IPC     ( +-
>> 0.123%
>> )
>>     3237478723  cache-references         #      6.821 M/sec   ( +-
>> 0.574%
>> )
>>     1182572827  cache-misses             #      2.491 M/sec   ( +-
>> 0.179%
>> )
>>
>>   60.151786309  seconds time elapsed   ( +-   0.014% )
>>
>> Regards,
>> -Kame
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge
       [not found]     ` <604427e00910111134o6f22f0ddg2b87124dd334ec02@mail.gmail.com>
@ 2009-10-12 11:38       ` Balbir Singh
  2009-10-13  0:29       ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 22+ messages in thread
From: Balbir Singh @ 2009-10-12 11:38 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
	nishimura@mxp.nes.nec.co.jp

* Ying Han <yinghan@google.com> [2009-10-11 11:34:39]:

> 2009/10/10 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> > Ying Han wrote:
> > > Hi KAMEZAWA-san: I tested your patch set based on 2.6.32-rc3 but I don't
> > > see
> > > much improvement on the page-faults rate.
> > > Here is the number I got:
> > >
> > > [Before]
> > >  Performance counter stats for './runpause.sh 10' (5 runs):
> > >
> > >   226272.271246  task-clock-msecs         #      3.768 CPUs    ( +-
> > > 0.193%
> > > )
> > >            4424  context-switches         #      0.000 M/sec   ( +-
> > > 14.418%
> > > )
> > >              25  CPU-migrations           #      0.000 M/sec   ( +-
> > > 23.077%
> > > )
> > >        80499059  page-faults              #      0.356 M/sec   ( +-
> > > 2.586%
> > > )
> > >    499246232482  cycles                   #   2206.396 M/sec   ( +-
> > > 0.055%
> > > )
> > >    193036122022  instructions             #      0.387 IPC     ( +-
> > > 0.281%
> > > )
> > >     76548856038  cache-references         #    338.304 M/sec   ( +-
> > > 0.832%
> > > )
> > >       480196860  cache-misses             #      2.122 M/sec   ( +-
> > > 2.741%
> > > )
> > >
> > >    60.051646892  seconds time elapsed   ( +-   0.010% )
> > >
> > > [After]
> > >  Performance counter stats for './runpause.sh 10' (5 runs):
> > >
> > >   226491.338475  task-clock-msecs         #      3.772 CPUs    ( +-
> > > 0.176%
> > > )
> > >            3377  context-switches         #      0.000 M/sec   ( +-
> > > 14.713%
> > > )
> > >              12  CPU-migrations           #      0.000 M/sec   ( +-
> > > 23.077%
> > > )
> > >        81867014  page-faults              #      0.361 M/sec   ( +-
> > > 3.201%
> > > )
> > >    499835798750  cycles                   #   2206.865 M/sec   ( +-
> > > 0.036%
> > > )
> > >    196685031865  instructions             #      0.393 IPC     ( +-
> > > 0.286%
> > > )
> > >     81143829910  cache-references         #    358.265 M/sec   ( +-
> > > 0.428%
> > > )
> > >       119362559  cache-misses             #      0.527 M/sec   ( +-
> > > 5.291%
> > > )
> > >
> > >    60.048917062  seconds time elapsed   ( +-   0.010% )
> > >
> > > I ran it on an 4 core machine with 16G of RAM. And I modified
> > > the runpause.sh to fork 4 pagefault process instead of 8. I mounted
> > cgroup
> > > with only memory subsystem and start running the test on the root cgroup.
> > >
> > > I believe that we might have different running environment including the
> > > cgroup configuration.  Any suggestions?
> > >
> >
> > This patch series is only for "child" cgroup. Sorry, I had to write it
> > clearer. No effects to root.
> >
> 
> Ok, Thanks for making it clearer. :) So Do you mind post the cgroup+memcg
> configuration
> while you are running on your host?
> 
> Thanks
>

Yes, root was fixed by another patchset now in mainline. Another check
is to see if resource_counter lock shows up in /proc/lock_stats.
 
-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge
       [not found]     ` <604427e00910111134o6f22f0ddg2b87124dd334ec02@mail.gmail.com>
  2009-10-12 11:38       ` Balbir Singh
@ 2009-10-13  0:29       ` KAMEZAWA Hiroyuki
       [not found]         ` <604427e00910121818w71dd4b7dl8781d7f5bc4f7dd9@mail.gmail.com>
  1 sibling, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-13  0:29 UTC (permalink / raw)
  To: Ying Han
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

On Sun, 11 Oct 2009 11:34:39 -0700
Ying Han <yinghan@google.com> wrote:

> 2009/10/10 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

> > This patch series is only for "child" cgroup. Sorry, I had to write it
> > clearer. No effects to root.
> >
> 
> Ok, Thanks for making it clearer. :) So Do you mind post the cgroup+memcg
> configuration
> while you are running on your host?
> 

#mount -t cgroup /dev/null /cgroups -omemory
#mkdir /cgroups/A
#echo $$ > /cgroups/A

and run test.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge
       [not found]         ` <604427e00910121818w71dd4b7dl8781d7f5bc4f7dd9@mail.gmail.com>
@ 2009-10-13  1:28           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-13  1:28 UTC (permalink / raw)
  To: Ying Han
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

On Mon, 12 Oct 2009 18:18:44 -0700
Ying Han <yinghan@google.com> wrote:

> Ok. After I am running the task in the child cgroup, I do see performance
> improvement on the page-faults .

Thank you. 
Hmm...I wonder what makes cache-miss dramatically larger.
4 process on 4 cpus ? (my script was for 8cpus.)

-Kame

> [Before]
>  Performance counter stats for './runpause.sh 10' (5 runs):
> 
>   226190.612998  task-clock-msecs         #      3.766 CPUs    ( +-   0.192%
> )
>            4454  context-switches         #      0.000 M/sec   ( +-  14.029%
> )
>              38  CPU-migrations           #      0.000 M/sec   ( +-  15.385%
> )
>        21445296  page-faults              #      0.095 M/sec   ( +-   1.686%
> )
>    498345012124  cycles                   #   2203.208 M/sec   ( +-   0.023%
> )
>    255638483632  instructions             #      0.513 IPC     ( +-   0.196%
> )
>     92240143452  cache-references         #    407.798 M/sec   ( +-   0.673%
> )
>       175412792  cache-misses             #      0.776 M/sec   ( +-   2.188%
> )
> 
>    60.068670564  seconds time elapsed   ( +-   0.014% )
> 
> [After]
>  Performance counter stats for './runpause.sh 10' (5 runs):
> 
>   214409.926571  task-clock-msecs         #      3.570 CPUs    ( +-   0.317%
> )
>            3097  context-switches         #      0.000 M/sec   ( +-  25.726%
> )
>              14  CPU-migrations           #      0.000 M/sec   ( +-  20.000%
> )
>        33977467  page-faults              #      0.158 M/sec   ( +-   4.884%
> )
>    472369769787  cycles                   #   2203.115 M/sec   ( +-   0.024%
> )
>    275624185415  instructions             #      0.583 IPC     ( +-   0.271%
> )
>     98359325470  cache-references         #    458.744 M/sec   ( +-   0.281%
> )
>       941121561  cache-misses             #      4.389 M/sec   ( +-   4.969%
> )
> 
>    60.052748032  seconds time elapsed   ( +-   0.013% )
> 
> --Ying
> 
> 
> On Mon, Oct 12, 2009 at 5:29 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Sun, 11 Oct 2009 11:34:39 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> > > 2009/10/10 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > > > This patch series is only for "child" cgroup. Sorry, I had to write it
> > > > clearer. No effects to root.
> > > >
> > >
> > > Ok, Thanks for making it clearer. :) So Do you mind post the cgroup+memcg
> > > configuration
> > > while you are running on your host?
> > >
> >
> > #mount -t cgroup /dev/null /cgroups -omemory
> > #mkdir /cgroups/A
> > #echo $$ > /cgroups/A
> >
> > and run test.
> >
> > Thanks,
> > -Kame
> >
> >
> >
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2009-10-13  1:30 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-02  4:55 [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge KAMEZAWA Hiroyuki
2009-10-02  5:01 ` [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation KAMEZAWA Hiroyuki
2009-10-02  6:47   ` Hiroshi Shimamoto
2009-10-02  6:53     ` Hiroshi Shimamoto
2009-10-02  7:04       ` KAMEZAWA Hiroyuki
2009-10-02  7:02     ` [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation (fixed coimpile bug) KAMEZAWA Hiroyuki
2009-10-08 22:17       ` Andrew Morton
2009-10-08 23:48         ` KAMEZAWA Hiroyuki
2009-10-09  4:01   ` [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation Balbir Singh
2009-10-09  4:17     ` KAMEZAWA Hiroyuki
2009-10-02  5:03 ` [PATCH 2/2] memcg: coalescing charges per cpu KAMEZAWA Hiroyuki
2009-10-08 22:26   ` Andrew Morton
2009-10-08 23:54     ` KAMEZAWA Hiroyuki
2009-10-09  4:15   ` Balbir Singh
2009-10-09  4:25     ` KAMEZAWA Hiroyuki
2009-10-02  8:53 ` [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge KAMEZAWA Hiroyuki
2009-10-05  7:18   ` KAMEZAWA Hiroyuki
2009-10-05 10:37 ` Balbir Singh
     [not found] ` <604427e00910091737s52e11ce9p256c95d533dc2837@mail.gmail.com>
2009-10-11  2:33   ` KAMEZAWA Hiroyuki
     [not found]     ` <604427e00910111134o6f22f0ddg2b87124dd334ec02@mail.gmail.com>
2009-10-12 11:38       ` Balbir Singh
2009-10-13  0:29       ` KAMEZAWA Hiroyuki
     [not found]         ` <604427e00910121818w71dd4b7dl8781d7f5bc4f7dd9@mail.gmail.com>
2009-10-13  1:28           ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).