[PATCH 1/2] memcg: coalescing uncharge at unmap/truncate (Oct/9)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] memcg: coalescing uncharge at unmap/truncate (Oct/9)
@ 2009-10-09  7:58 KAMEZAWA Hiroyuki
       [not found] ` <20091009170105.170e025f.kamezawa.hiroyu@jp.fujitsu.com>
  0 siblings, 1 reply; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-09  7:58 UTC (permalink / raw)
  To: linux-mm@kvack.org
  Cc: balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	akpm@linux-foundation.org, h-shimamoto, linux-kernel

Thank you for all review to previous ones (Oct/2).

This patch is against mmotm + softlimit fix patches.
(which are now in -rc git tree.)
==
In massive parallel enviroment, res_counter can be a performance bottleneck.
One strong techinque to reduce lock contention is reducing calls by
coalescing some amount of calls into one.

Considering charge/uncharge chatacteristic,
	- charge is done one by one via demand-paging.
	- uncharge is done by
		- in chunk at munmap, truncate, exit, execve...
		- one by one via vmscan/paging.

It seems we have a chance to coalesce uncharges for improving scalability
at unmap/truncation.

This patch is a for coalescing uncharge. For avoiding scattering memcg's
structure to functions under /mm, this patch adds memcg batch uncharge
information to the task. A reason for per-task batching is for making
use of caller's context information. We do batched uncharge (deleyed uncharge)
when truncation/unmap occurs but do direct uncharge when uncharge is
called by memory reclaim (vmscan.c).

The degree of coalescing depends on callers
  - at invalidate/trucate... pagevec size
  - at unmap ....ZAP_BLOCK_SIZE
(memory itself will be freed in this degree.)
Then, we'll not coalescing too much.

On x86-64 8cpu server, I tested overheads of memcg at page fault by
running a program which does map/fault/unmap in a loop. Running 
a task per a cpu by taskset and see sum of the number of page faults
in 60secs.

[without memcg config] 
  40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
  27.67 cache-miss/faults
[root cgroup]
  36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
  31.58 miss/faults
[in a child cgroup]
  18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
  69.96 miss/faults
[child with this patch]
  27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
  47.16 miss/faults

We can see some amounts of improvement.
(root cgroup doesn't affected by this patch)
Another patch for "charge" will follow this and above will be improved more.

Changelog(since 2009/10/02):
 - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
 - some clean up and commentary/description updates.
 - added initialize code to copy_process(). (possible bug fix)

Changelog(old):
 - fixed !CONFIG_MEM_CGROUP case. 
 - rebased onto the latest mmotm + softlimit fix patches.
 - unified patch for callers
 - added commetns.
 - make ->do_batch as bool.
 - removed css_get() at el. We don't need it.

Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   13 ++++++
 include/linux/sched.h      |    8 +++
 kernel/fork.c              |    4 +
 mm/memcontrol.c            |   96 ++++++++++++++++++++++++++++++++++++++++++---
 mm/memory.c                |    2 
 mm/truncate.c              |    6 ++
 6 files changed, 123 insertions(+), 6 deletions(-)

Index: mmotm-2.6.31-Sep28/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.31-Sep28.orig/include/linux/memcontrol.h
+++ mmotm-2.6.31-Sep28/include/linux/memcontrol.h
@@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s
 extern void mem_cgroup_del_lru(struct page *page);
 extern void mem_cgroup_move_lists(struct page *page,
 				  enum lru_list from, enum lru_list to);
+
+/* For coalescing uncharge for reducing memcg' overhead*/
+extern void mem_cgroup_uncharge_start(void);
+extern void mem_cgroup_uncharge_end(void);
+
 extern void mem_cgroup_uncharge_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
 extern int mem_cgroup_shmem_charge_fallback(struct page *page,
@@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
 {
 }
 
+static inline void mem_cgroup_uncharge_start(void)
+{
+}
+
+static inline void mem_cgroup_uncharge_end(void)
+{
+}
+
 static inline void mem_cgroup_uncharge_page(struct page *page)
 {
 }
Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep28/mm/memcontrol.c
@@ -1826,6 +1826,50 @@ void mem_cgroup_cancel_charge_swapin(str
 	css_put(&mem->css);
 }
 
+static void
+__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
+{
+	struct memcg_batch_info *batch = NULL;
+	bool uncharge_memsw = true;
+	/* If swapout, usage of swap doesn't decrease */
+	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
+		uncharge_memsw = false;
+	/*
+	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
+	 * In those cases, all pages freed continously can be expected to be in
+	 * the same cgroup and we have chance to coalesce uncharges.
+	 * But we do uncharge one by one if this is killed by OOM(TIF_MEMDIE)
+	 * because we want to do uncharge as soon as possible.
+	 */
+	if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
+		goto direct_uncharge;
+
+	batch = &current->memcg_batch;
+	/*
+	 * In usual, we do css_get() when we remember memcg pointer.
+	 * But in this case, we keep res->usage until end of a series of
+	 * uncharges. Then, it's ok to ignore memcg's refcnt.
+	 */
+	if (!batch->memcg)
+		batch->memcg = mem;
+	/*
+	 * In typical case, batch->memcg == mem. This means we can
+	 * merge a series of uncharges to an uncharge of res_counter.
+	 * If not, we uncharge res_counter ony by one.
+	 */
+	if (batch->memcg != mem)
+		goto direct_uncharge;
+	/* remember freed charge and uncharge it later */
+	batch->bytes += PAGE_SIZE;
+	if (uncharge_memsw)
+		batch->memsw_bytes += PAGE_SIZE;
+	return;
+direct_uncharge:
+	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	if (uncharge_memsw)
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+	return;
+}
 
 /*
  * uncharge if !page_mapped(page)
@@ -1874,12 +1918,8 @@ __mem_cgroup_uncharge_common(struct page
 		break;
 	}
 
-	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		if (do_swap_account &&
-				(ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
-	}
+	if (!mem_cgroup_is_root(mem))
+		__do_uncharge(mem, ctype);
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		mem_cgroup_swap_statistics(mem, true);
 	mem_cgroup_charge_statistics(mem, pc, false);
@@ -1925,6 +1965,50 @@ void mem_cgroup_uncharge_cache_page(stru
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
+/*
+ * Batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
+ * In that cases, pages are freed continuously and we can expect pages
+ * are in the same memcg. All these calls itself limits the number of
+ * pages freed at once, then uncharge_start/end() is called properly.
+ * This may be called prural(2) times in a context,
+ */
+
+void mem_cgroup_uncharge_start(void)
+{
+	current->memcg_batch.do_batch++;
+	/* We can do nest. */
+	if (current->memcg_batch.do_batch == 1) {
+		current->memcg_batch.memcg = NULL;
+		current->memcg_batch.bytes = 0;
+		current->memcg_batch.memsw_bytes = 0;
+	}
+}
+
+void mem_cgroup_uncharge_end(void)
+{
+	struct memcg_batch_info *batch = &current->memcg_batch;
+
+	if (!batch->do_batch)
+		return;
+
+	batch->do_batch--;
+	if (batch->do_batch) /* If stacked, do nothing. */
+		return;
+
+	if (!batch->memcg)
+		return;
+	/*
+	 * This "batch->memcg" is valid without any css_get/put etc...
+	 * bacause we hide charges behind us.
+	 */
+	if (batch->bytes)
+		res_counter_uncharge(&batch->memcg->res, batch->bytes);
+	if (batch->memsw_bytes)
+		res_counter_uncharge(&batch->memcg->memsw, batch->memsw_bytes);
+	/* forget this pointer (for sanity check) */
+	batch->memcg = NULL;
+}
+
 #ifdef CONFIG_SWAP
 /*
  * called after __delete_from_swap_cache() and drop "page" account.
Index: mmotm-2.6.31-Sep28/include/linux/sched.h
===================================================================
--- mmotm-2.6.31-Sep28.orig/include/linux/sched.h
+++ mmotm-2.6.31-Sep28/include/linux/sched.h
@@ -1549,6 +1549,14 @@ struct task_struct {
 	unsigned long trace_recursion;
 #endif /* CONFIG_TRACING */
 	unsigned long stack_start;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
+	struct memcg_batch_info {
+		int do_batch;	/* incremented when batch uncharge started */
+		struct mem_cgroup *memcg; /* target memcg of uncharge */
+		unsigned long bytes; 		/* uncharged usage */
+		unsigned long memsw_bytes; /* uncharged mem+swap usage */
+	} memcg_batch;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
Index: mmotm-2.6.31-Sep28/mm/memory.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memory.c
+++ mmotm-2.6.31-Sep28/mm/memory.c
@@ -940,6 +940,7 @@ static unsigned long unmap_page_range(st
 		details = NULL;
 
 	BUG_ON(addr >= end);
+	mem_cgroup_uncharge_start();
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -952,6 +953,7 @@ static unsigned long unmap_page_range(st
 						zap_work, details);
 	} while (pgd++, addr = next, (addr != end && *zap_work > 0));
 	tlb_end_vma(tlb, vma);
+	mem_cgroup_uncharge_end();
 
 	return addr;
 }
Index: mmotm-2.6.31-Sep28/mm/truncate.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/truncate.c
+++ mmotm-2.6.31-Sep28/mm/truncate.c
@@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
 			pagevec_release(&pvec);
 			break;
 		}
+		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
 	}
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
@@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
 	pagevec_init(&pvec, 0);
 	while (next <= end &&
 			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t index;
@@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
 				break;
 		}
 		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
 		cond_resched();
 	}
 	return ret;
@@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
 	while (next <= end && !wrapped &&
 		pagevec_lookup(&pvec, mapping, next,
 			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t page_index;
@@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
 		cond_resched();
 	}
 	return ret;
Index: mmotm-2.6.31-Sep28/kernel/fork.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/kernel/fork.c
+++ mmotm-2.6.31-Sep28/kernel/fork.c
@@ -1114,6 +1114,10 @@ static struct task_struct *copy_process(
 #ifdef CONFIG_DEBUG_MUTEXES
 	p->blocked_on = NULL; /* not blocked yet */
 #endif
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	p->memcg_batch.do_batch = 0;
+	p->memcg_batch.memcg = NULL;
+#endif
 
 	p->bts = NULL;
 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] memcg: coalescing charge by percpu (Oct/9)
       [not found] ` <20091009170105.170e025f.kamezawa.hiroyu@jp.fujitsu.com>
@ 2009-10-09 23:50   ` Andrew Morton
  2009-10-11  2:37     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2009-10-09 23:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp, h-shimamoto, linux-kernel

On Fri, 9 Oct 2009 17:01:05 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> +static void drain_all_stock_async(void)
> +{
> +	int cpu;
> +	/* This function is for scheduling "drain" in asynchronous way.
> +	 * The result of "drain" is not directly handled by callers. Then,
> +	 * if someone is calling drain, we don't have to call drain more.
> +	 * Anyway, work_pending() will catch if there is a race. We just do
> +	 * loose check here.
> +	 */
> +	if (atomic_read(&memcg_drain_count))
> +		return;
> +	/* Notify other cpus that system-wide "drain" is running */
> +	atomic_inc(&memcg_drain_count);
> +	get_online_cpus();
> +	for_each_online_cpu(cpu) {
> +		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> +		if (work_pending(&stock->work))
> +			continue;
> +		INIT_WORK(&stock->work, drain_local_stock);
> +		schedule_work_on(cpu, &stock->work);
> +	}
> + 	put_online_cpus();
> +	atomic_dec(&memcg_drain_count);
> +	/* We don't wait for flush_work */
> +}

It's unusual to run INIT_WORK() each time we use a work_struct. 
Usually we will run INIT_WORK a single time, then just repeatedly use
that structure.  Because after the work has completed, it is still in a
ready-to-use state.

Running INIT_WORK() repeatedly against the same work_struct adds a risk
that we'll scribble on an in-use work_struct, which would make a big
mess.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] memcg: coalescing charge by percpu (Oct/9)
  2009-10-09 23:50   ` [PATCH 2/2] memcg: coalescing charge by percpu (Oct/9) Andrew Morton
@ 2009-10-11  2:37     ` KAMEZAWA Hiroyuki
  2009-10-13  7:57       ` Daisuke Nishimura
  0 siblings, 1 reply; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-11  2:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp, h-shimamoto, linux-kernel

Andrew Morton wrote:
> On Fri, 9 Oct 2009 17:01:05 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> +static void drain_all_stock_async(void)
>> +{
>> +	int cpu;
>> +	/* This function is for scheduling "drain" in asynchronous way.
>> +	 * The result of "drain" is not directly handled by callers. Then,
>> +	 * if someone is calling drain, we don't have to call drain more.
>> +	 * Anyway, work_pending() will catch if there is a race. We just do
>> +	 * loose check here.
>> +	 */
>> +	if (atomic_read(&memcg_drain_count))
>> +		return;
>> +	/* Notify other cpus that system-wide "drain" is running */
>> +	atomic_inc(&memcg_drain_count);
>> +	get_online_cpus();
>> +	for_each_online_cpu(cpu) {
>> +		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
>> +		if (work_pending(&stock->work))
>> +			continue;
>> +		INIT_WORK(&stock->work, drain_local_stock);
>> +		schedule_work_on(cpu, &stock->work);
>> +	}
>> + 	put_online_cpus();
>> +	atomic_dec(&memcg_drain_count);
>> +	/* We don't wait for flush_work */
>> +}
>
> It's unusual to run INIT_WORK() each time we use a work_struct.
> Usually we will run INIT_WORK a single time, then just repeatedly use
> that structure.  Because after the work has completed, it is still in a
> ready-to-use state.
>
> Running INIT_WORK() repeatedly against the same work_struct adds a risk
> that we'll scribble on an in-use work_struct, which would make a big
> mess.
>
Ah, ok. I'll prepare a fix. (And I think atomic_dec/inc placement is not
very good....I'll do total review, again.)

Thank you for review.

Regards,
-Kame



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] memcg: coalescing charge by percpu (Oct/9)
  2009-10-11  2:37     ` KAMEZAWA Hiroyuki
@ 2009-10-13  7:57       ` Daisuke Nishimura
  2009-10-13  8:05         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 8+ messages in thread
From: Daisuke Nishimura @ 2009-10-13  7:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	h-shimamoto, linux-kernel, Daisuke Nishimura

On Sun, 11 Oct 2009 11:37:35 +0900 (JST), "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Andrew Morton wrote:
> > On Fri, 9 Oct 2009 17:01:05 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> >> +static void drain_all_stock_async(void)
> >> +{
> >> +	int cpu;
> >> +	/* This function is for scheduling "drain" in asynchronous way.
> >> +	 * The result of "drain" is not directly handled by callers. Then,
> >> +	 * if someone is calling drain, we don't have to call drain more.
> >> +	 * Anyway, work_pending() will catch if there is a race. We just do
> >> +	 * loose check here.
> >> +	 */
> >> +	if (atomic_read(&memcg_drain_count))
> >> +		return;
> >> +	/* Notify other cpus that system-wide "drain" is running */
> >> +	atomic_inc(&memcg_drain_count);
Shouldn't we use atomic_inc_not_zero() ?
(Do you mean this problem by "is not very good" below ?)


Thanks,
Daisuke Nishimura.

> >> +	get_online_cpus();
> >> +	for_each_online_cpu(cpu) {
> >> +		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> >> +		if (work_pending(&stock->work))
> >> +			continue;
> >> +		INIT_WORK(&stock->work, drain_local_stock);
> >> +		schedule_work_on(cpu, &stock->work);
> >> +	}
> >> + 	put_online_cpus();
> >> +	atomic_dec(&memcg_drain_count);
> >> +	/* We don't wait for flush_work */
> >> +}
> >
> > It's unusual to run INIT_WORK() each time we use a work_struct.
> > Usually we will run INIT_WORK a single time, then just repeatedly use
> > that structure.  Because after the work has completed, it is still in a
> > ready-to-use state.
> >
> > Running INIT_WORK() repeatedly against the same work_struct adds a risk
> > that we'll scribble on an in-use work_struct, which would make a big
> > mess.
> >
> Ah, ok. I'll prepare a fix. (And I think atomic_dec/inc placement is not
> very good....I'll do total review, again.)
> 
> Thank you for review.
> 
> Regards,
> -Kame
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] memcg: coalescing charge by percpu (Oct/9)
  2009-10-13  7:57       ` Daisuke Nishimura
@ 2009-10-13  8:05         ` KAMEZAWA Hiroyuki
  2009-10-14  6:42           ` Daisuke Nishimura
  0 siblings, 1 reply; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-13  8:05 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrew Morton, linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	h-shimamoto, linux-kernel

On Tue, 13 Oct 2009 16:57:19 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Sun, 11 Oct 2009 11:37:35 +0900 (JST), "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Andrew Morton wrote:
> > > On Fri, 9 Oct 2009 17:01:05 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > >
> > >> +static void drain_all_stock_async(void)
> > >> +{
> > >> +	int cpu;
> > >> +	/* This function is for scheduling "drain" in asynchronous way.
> > >> +	 * The result of "drain" is not directly handled by callers. Then,
> > >> +	 * if someone is calling drain, we don't have to call drain more.
> > >> +	 * Anyway, work_pending() will catch if there is a race. We just do
> > >> +	 * loose check here.
> > >> +	 */
> > >> +	if (atomic_read(&memcg_drain_count))
> > >> +		return;
> > >> +	/* Notify other cpus that system-wide "drain" is running */
> > >> +	atomic_inc(&memcg_drain_count);
> Shouldn't we use atomic_inc_not_zero() ?
> (Do you mean this problem by "is not very good" below ?)
> 
As comment says, "we just do loose check". There is no terrible race except
for wasting cpu time.

I'm now thinking about following.

==
for_each_online_cpu(cpu) {
	struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
	if (work_pending(&stock->work))
		continue;
+       atomic_inc(&memcg_drain_count);
	INIT_WORK(&stock->work, drain_local_stock);
	schedule_work_on(cpu, &stock->work);
}
==
Or using cpumask to avoid scheduleing twice.

atomic_dec will be added to worker routine, after drain.

I'm now prepareing slides for JLS (ah, yes, deadline has gone.), so plz give me time..
If you want to review it, plz let me know.

Thanks,
-Kame


> 
> Thanks,
> Daisuke Nishimura.
> 
> > >> +	get_online_cpus();
> > >> +	for_each_online_cpu(cpu) {
> > >> +		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > >> +		if (work_pending(&stock->work))
> > >> +			continue;
> > >> +		INIT_WORK(&stock->work, drain_local_stock);
> > >> +		schedule_work_on(cpu, &stock->work);
> > >> +	}
> > >> + 	put_online_cpus();
> > >> +	atomic_dec(&memcg_drain_count);
> > >> +	/* We don't wait for flush_work */
> > >> +}
> > >
> > > It's unusual to run INIT_WORK() each time we use a work_struct.
> > > Usually we will run INIT_WORK a single time, then just repeatedly use
> > > that structure.  Because after the work has completed, it is still in a
> > > ready-to-use state.
> > >
> > > Running INIT_WORK() repeatedly against the same work_struct adds a risk
> > > that we'll scribble on an in-use work_struct, which would make a big
> > > mess.
> > >
> > Ah, ok. I'll prepare a fix. (And I think atomic_dec/inc placement is not
> > very good....I'll do total review, again.)
> > 
> > Thank you for review.
> > 
> > Regards,
> > -Kame
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] memcg: coalescing charge by percpu (Oct/9)
  2009-10-13  8:05         ` KAMEZAWA Hiroyuki
@ 2009-10-14  6:42           ` Daisuke Nishimura
  2009-10-14  7:02             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 8+ messages in thread
From: Daisuke Nishimura @ 2009-10-14  6:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	h-shimamoto, linux-kernel, Daisuke Nishimura

> I'm now prepareing slides for JLS (ah, yes, deadline has gone.), so plz give me time..
I see. I think there is still time left till next merge window, so I'm in no hurry.
I will go to JLS and am looking forward to attending your session :)


Some comments and a fix about this patch are inlined.

On Tue, 13 Oct 2009 17:05:45 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 13 Oct 2009 16:57:19 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Sun, 11 Oct 2009 11:37:35 +0900 (JST), "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > Andrew Morton wrote:
> > > > On Fri, 9 Oct 2009 17:01:05 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > >
> > > >> +static void drain_all_stock_async(void)
> > > >> +{
> > > >> +	int cpu;
> > > >> +	/* This function is for scheduling "drain" in asynchronous way.
> > > >> +	 * The result of "drain" is not directly handled by callers. Then,
> > > >> +	 * if someone is calling drain, we don't have to call drain more.
> > > >> +	 * Anyway, work_pending() will catch if there is a race. We just do
> > > >> +	 * loose check here.
> > > >> +	 */
> > > >> +	if (atomic_read(&memcg_drain_count))
> > > >> +		return;
> > > >> +	/* Notify other cpus that system-wide "drain" is running */
> > > >> +	atomic_inc(&memcg_drain_count);
> > Shouldn't we use atomic_inc_not_zero() ?
> > (Do you mean this problem by "is not very good" below ?)
> > 
> As comment says, "we just do loose check". There is no terrible race except
> for wasting cpu time.
> 
Considering more, I think checking loosely itself is not bad, but using INIT_WORK()
is not a good behavior, as Andrew said.

It clears WORK_STRUCT_PENDING bit and initializes the list_head of the work_struct.
So I think a following race can happen.

    work_pending() ->false
    INIT_WORK()
    schedule_work_on()    
      queue_work_on()        
                                  work_pending() -> false
        test_and_seet_bit()
          -> sets WORK_STRUCT_PENDING
                                  INIT_WORK() -> clears WORK_STRUCT_PENDING
                             

And actually, I've seen BUG several times in testing mmotm-2009-10-09-01-07 + these 2 patches.
This seems not to happen on plain mmotm-2009-10-09-01-07.


[ 2264.213803] ------------[ cut here ]------------
[ 2264.214058] WARNING: at lib/list_debug.c:30 __list_add+0x6e/0x87()
[ 2264.214058] Hardware name: Express5800/140Rd-4 [N8100-1065]
[ 2264.214058] list_add corruption. prev->next should be next (ffff880029fd8e80), but was
ffff880029fd36b8. (prev=ffff880029fd36b8).
[ 2264.214058] Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables b
ridge stp autofs4 hidp rfcomm l2cap crc16 bluetooth lockd sunrpc ib_iser rdma_cm ib_cm iw_
cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_t
cp libiscsi scsi_transport_iscsi dm_mirror dm_multipath scsi_dh video output sbs sbshc bat
tery ac lp sg ide_cd_mod cdrom serio_raw acpi_memhotplug button parport_pc parport rtc_cmo
s rtc_core rtc_lib e1000 i2c_i801 i2c_core pata_acpi ata_generic pcspkr dm_region_hash dm_
log dm_mod ata_piix libata shpchp megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd uhci_
hcd ohci_hcd ehci_hcd [last unloaded: freq_table]
[ 2264.214058] Pid: 14398, comm: shmem_test_02 Tainted: G      D W  2.6.32-rc3-mmotm-2009-
10-09-01-07-00249-g1112566 #2
[ 2264.214058] Call Trace:
[ 2264.214058]  [<ffffffff811e0f43>] ? __list_add+0x6e/0x87
[ 2264.214058]  [<ffffffff8104af21>] warn_slowpath_common+0x7c/0x94
[ 2264.214058]  [<ffffffff8104afb8>] warn_slowpath_fmt+0x69/0x6b
[ 2264.214058]  [<ffffffff81077354>] ? print_lock_contention_bug+0x1b/0xe0
[ 2264.214058]  [<ffffffff810b819d>] ? probe_workqueue_insertion+0x40/0x9f
[ 2264.214058]  [<ffffffff81076720>] ? trace_hardirqs_off+0xd/0xf
[ 2264.214058]  [<ffffffff81386bef>] ? _spin_unlock_irqrestore+0x3d/0x4c
[ 2264.214058]  [<ffffffff811e0f43>] __list_add+0x6e/0x87
[ 2264.214058]  [<ffffffff81065230>] insert_work+0x78/0x9a
[ 2264.214058]  [<ffffffff81065c47>] __queue_work+0x2f/0x43
[ 2264.214058]  [<ffffffff81065cdb>] queue_work_on+0x44/0x50
[ 2264.214058]  [<ffffffff81065d02>] schedule_work_on+0x1b/0x1d
[ 2264.214058]  [<ffffffff81104623>] mem_cgroup_hierarchical_reclaim+0x232/0x40e
[ 2264.214058]  [<ffffffff81077eeb>] ? trace_hardirqs_on+0xd/0xf
[ 2264.214058]  [<ffffffff81104ec9>] __mem_cgroup_try_charge+0x14c/0x25b
[ 2264.214058]  [<ffffffff811058df>] mem_cgroup_charge_common+0x55/0x82
[ 2264.214058]  [<ffffffff810d5f9d>] ? shmem_swp_entry+0x134/0x140
[ 2264.214058]  [<ffffffff81106a44>] mem_cgroup_cache_charge+0xef/0x118
[ 2264.214058]  [<ffffffff810f766e>] ? alloc_page_vma+0xe0/0xef
[ 2264.214058]  [<ffffffff810d689a>] shmem_getpage+0x6a7/0x86b
[ 2264.214058]  [<ffffffff81077eeb>] ? trace_hardirqs_on+0xd/0xf
[ 2264.214058]  [<ffffffff810ce1d4>] ? get_page_from_freelist+0x4fc/0x6ce
[ 2264.214058]  [<ffffffff81386794>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 2264.214058]  [<ffffffff8100e385>] ? do_softirq+0x77/0x86
[ 2264.214058]  [<ffffffff8100c5bc>] ? restore_args+0x0/0x30
[ 2264.214058]  [<ffffffff810502bf>] ? current_fs_time+0x27/0x2e
[ 2264.214058]  [<ffffffff81077354>] ? print_lock_contention_bug+0x1b/0xe0
[ 2264.214058]  [<ffffffff811236c2>] ? file_update_time+0x44/0x144
[ 2264.214058]  [<ffffffff810d6b11>] shmem_fault+0x42/0x63
[ 2264.214058]  [<ffffffff810e07db>] __do_fault+0x55/0x484
[ 2264.214058]  [<ffffffff81077354>] ? print_lock_contention_bug+0x1b/0xe0
[ 2264.214058]  [<ffffffff810e28e1>] handle_mm_fault+0x1ea/0x813
[ 2264.214058]  [<ffffffff81389770>] do_page_fault+0x25a/0x2ea
[ 2264.214058]  [<ffffffff8138757f>] page_fault+0x1f/0x30
[ 2264.214058] ---[ end trace d435092dd749cff5 ]---
[ 2264.538186] ------------[ cut here ]------------
[ 2264.538219] ------------[ cut here ]------------
[ 2264.538231] kernel BUG at kernel/workqueue.c:287!
[ 2264.538240] invalid opcode: 0000 [#2] SMP
[ 2264.538251] last sysfs file: /sys/devices/system/cpu/cpu15/cache/index1/shared_cpu_map
[ 2264.538259] CPU 0
[ 2264.538270] Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables b
ridge stp autofs4 hidp rfcomm l2cap crc16 bluetooth lockd sunrpc ib_iser rdma_cm ib_cm iw_
cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_t
cp libiscsi scsi_transport_iscsi dm_mirror dm_multipath scsi_dh video output sbs sbshc bat
tery ac lp sg ide_cd_mod cdrom serio_raw acpi_memhotplug button parport_pc parport rtc_cmo
s rtc_core rtc_lib e1000 i2c_i801 i2c_core pata_acpi ata_generic pcspkr dm_region_hash dm_
log dm_mod ata_piix libata shpchp megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd uhci_
hcd ohci_hcd ehci_hcd [last unloaded: freq_table]
[ 2264.538627] Pid: 51, comm: events/0 Tainted: G      D W  2.6.32-rc3-mmotm-2009-10-09-01
-07-00249-g1112566 #2 Express5800/140Rd-4 [N8100-1065]
[ 2264.538638] RIP: 0010:[<ffffffff81065f56>]  [<ffffffff81065f56>] worker_thread+0x157/0x
2b3
[ 2264.538663] RSP: 0000:ffff8803a068fe10  EFLAGS: 00010207
[ 2264.538670] RAX: 0000000000000000 RBX: ffff8803a045a478 RCX: ffff880029fd36b8
[ 2264.538679] RDX: 0000000000001918 RSI: 00000001a068fdd0 RDI: ffffffff81386bad
[ 2264.538684] RBP: ffff8803a068feb0 R08: 0000000000000002 R09: 0000000000000001
[ 2264.538694] R10: ffffffff81389860 R11: ffff880029fd6518 R12: ffff880029fd36b0
[ 2264.538701] R13: ffff880029fd8e40 R14: ffff8803a067e590 R15: ffffffff811049f3
[ 2264.538719] FS:  0000000000000000(0000) GS:ffff880029e00000(0000) knlGS:000000000000000
0
[ 2264.538734] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 2264.538742] CR2: 00007ffcc30a0001 CR3: 0000000001001000 CR4: 00000000000006f0
[ 2264.538755] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2264.538766] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2264.538773] Process events/0 (pid: 51, threadinfo ffff8803a068e000, task ffff8803a067e5
90)
[ 2264.538783] Stack:
[ 2264.538788]  ffffffff81065f5a ffffffff813847ec ffffffff827d6a20 0000000000000000
[ 2264.538807] <0> ffffffff814f5a42 0000000000000002 0000000000000000 ffff8803a067e948
[ 2264.538835] <0> 0000000000000000 ffff8803a067e590 ffffffff8106a61b ffff8803a068fe68
[ 2264.538857] Call Trace:
[ 2264.538868]  [<ffffffff81065f5a>] ? worker_thread+0x15b/0x2b3
[ 2264.538883]  [<ffffffff813847ec>] ? thread_return+0x3e/0xee
[ 2264.538887]  [<ffffffff8106a61b>] ? autoremove_wake_function+0x0/0x3d
[ 2264.538887]  [<ffffffff81065dff>] ? worker_thread+0x0/0x2b3
[ 2264.538887]  [<ffffffff8106a510>] kthread+0x82/0x8a
[ 2264.538887]  [<ffffffff8100cc1a>] child_rip+0xa/0x20
[ 2264.538887]  [<ffffffff8100c5bc>] ? restore_args+0x0/0x30
[ 2264.538887]  [<ffffffff8106a46d>] ? kthreadd+0xd5/0xf6
[ 2264.538887]  [<ffffffff8106a48e>] ? kthread+0x0/0x8a
[ 2264.538887]  [<ffffffff8100cc10>] ? child_rip+0x0/0x20
[ 2264.538887] Code: 00 4c 89 ef 48 8b 08 48 8b 50 08 48 89 51 08 48 89 0a 48 89 40 08 48
89 00 e8 34 0c 32 00 49 8b 04 24 48 83 e0 fc 4c 39 e8 74 04 <0f> 0b eb fe f0 41 80 24 24 f
e 49 8b bd a8 00 00 00 48 8d 9d 70
[ 2264.538887] RIP  [<ffffffff81065f56>] worker_thread+0x157/0x2b3
[ 2264.538887]  RSP <ffff8803a068fe10>
[ 2264.539304] ---[ end trace d435092dd749cff6 ]---


How about doing INIT_WORK() once in initialization like this ?
After this patch, it has survived the same test for more than 6 hours w/o causing the BUG,
while it survived for only 1 hour at most before.

===
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Don't do INIT_WORK() repeatedly against the same work_struct.
Just do it once in initialization.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 mm/memcontrol.c |   13 ++++++++-----
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bc57916..fd8f65f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1349,8 +1349,8 @@ static void drain_all_stock_async(void)
 	/* This function is for scheduling "drain" in asynchronous way.
 	 * The result of "drain" is not directly handled by callers. Then,
 	 * if someone is calling drain, we don't have to call drain more.
-	 * Anyway, work_pending() will catch if there is a race. We just do
-	 * loose check here.
+	 * Anyway, WORK_STRUCT_PENDING check in queue_work_on() will catch if
+	 * there is a race. We just do loose check here.
 	 */
 	if (atomic_read(&memcg_drain_count))
 		return;
@@ -1359,9 +1359,6 @@ static void drain_all_stock_async(void)
 	get_online_cpus();
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
-		if (work_pending(&stock->work))
-			continue;
-		INIT_WORK(&stock->work, drain_local_stock);
 		schedule_work_on(cpu, &stock->work);
 	}
  	put_online_cpus();
@@ -3327,11 +3324,17 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 
 	/* root ? */
 	if (cont->parent == NULL) {
+		int cpu;
 		enable_swap_cgroup();
 		parent = NULL;
 		root_mem_cgroup = mem;
 		if (mem_cgroup_soft_limit_tree_init())
 			goto free_out;
+		for_each_possible_cpu(cpu) {
+			struct memcg_stock_pcp *stock =
+						&per_cpu(memcg_stock, cpu);
+			INIT_WORK(&stock->work, drain_local_stock);
+		}
 		hotcpu_notifier(memcg_stock_cpu_callback, 0);
 
 	} else {
-- 
1.5.6.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] memcg: coalescing charge by percpu (Oct/9)
  2009-10-14  6:42           ` Daisuke Nishimura
@ 2009-10-14  7:02             ` KAMEZAWA Hiroyuki
  2009-10-16  0:32               ` [BUGFIX][PATCH -mmotm] memcg: don't do INIT_WORK() repeatedly against the same work_struct Daisuke Nishimura
  0 siblings, 1 reply; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-14  7:02 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrew Morton, linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	h-shimamoto, linux-kernel

On Wed, 14 Oct 2009 15:42:11 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> It clears WORK_STRUCT_PENDING bit and initializes the list_head of the work_struct.
> So I think a following race can happen.
> 
>     work_pending() ->false
>     INIT_WORK()
>     schedule_work_on()    
>       queue_work_on()        
>                                   work_pending() -> false
>         test_and_seet_bit()
>           -> sets WORK_STRUCT_PENDING
>                                   INIT_WORK() -> clears WORK_STRUCT_PENDING
>                              
> 
> And actually, I've seen BUG several times in testing mmotm-2009-10-09-01-07 + these 2 patches.
> This seems not to happen on plain mmotm-2009-10-09-01-07.
> 
> 
> [ 2264.213803] ------------[ cut here ]------------
> [ 2264.214058] WARNING: at lib/list_debug.c:30 __list_add+0x6e/0x87()
> [ 2264.214058] Hardware name: Express5800/140Rd-4 [N8100-1065]
> [ 2264.214058] list_add corruption. prev->next should be next (ffff880029fd8e80), but was
> ffff880029fd36b8. (prev=ffff880029fd36b8).
> [ 2264.214058] Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables b
> ridge stp autofs4 hidp rfcomm l2cap crc16 bluetooth lockd sunrpc ib_iser rdma_cm ib_cm iw_
> cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_t
> cp libiscsi scsi_transport_iscsi dm_mirror dm_multipath scsi_dh video output sbs sbshc bat
> tery ac lp sg ide_cd_mod cdrom serio_raw acpi_memhotplug button parport_pc parport rtc_cmo
> s rtc_core rtc_lib e1000 i2c_i801 i2c_core pata_acpi ata_generic pcspkr dm_region_hash dm_
> log dm_mod ata_piix libata shpchp megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd uhci_
> hcd ohci_hcd ehci_hcd [last unloaded: freq_table]
> [ 2264.214058] Pid: 14398, comm: shmem_test_02 Tainted: G      D W  2.6.32-rc3-mmotm-2009-
> 10-09-01-07-00249-g1112566 #2
> [ 2264.214058] Call Trace:
> [ 2264.214058]  [<ffffffff811e0f43>] ? __list_add+0x6e/0x87
> [ 2264.214058]  [<ffffffff8104af21>] warn_slowpath_common+0x7c/0x94
> [ 2264.214058]  [<ffffffff8104afb8>] warn_slowpath_fmt+0x69/0x6b
> [ 2264.214058]  [<ffffffff81077354>] ? print_lock_contention_bug+0x1b/0xe0
> [ 2264.214058]  [<ffffffff810b819d>] ? probe_workqueue_insertion+0x40/0x9f
> [ 2264.214058]  [<ffffffff81076720>] ? trace_hardirqs_off+0xd/0xf
> [ 2264.214058]  [<ffffffff81386bef>] ? _spin_unlock_irqrestore+0x3d/0x4c
> [ 2264.214058]  [<ffffffff811e0f43>] __list_add+0x6e/0x87
> [ 2264.214058]  [<ffffffff81065230>] insert_work+0x78/0x9a
> [ 2264.214058]  [<ffffffff81065c47>] __queue_work+0x2f/0x43
> [ 2264.214058]  [<ffffffff81065cdb>] queue_work_on+0x44/0x50
> [ 2264.214058]  [<ffffffff81065d02>] schedule_work_on+0x1b/0x1d
> [ 2264.214058]  [<ffffffff81104623>] mem_cgroup_hierarchical_reclaim+0x232/0x40e
> [ 2264.214058]  [<ffffffff81077eeb>] ? trace_hardirqs_on+0xd/0xf
> [ 2264.214058]  [<ffffffff81104ec9>] __mem_cgroup_try_charge+0x14c/0x25b
> [ 2264.214058]  [<ffffffff811058df>] mem_cgroup_charge_common+0x55/0x82
> [ 2264.214058]  [<ffffffff810d5f9d>] ? shmem_swp_entry+0x134/0x140
> [ 2264.214058]  [<ffffffff81106a44>] mem_cgroup_cache_charge+0xef/0x118
> [ 2264.214058]  [<ffffffff810f766e>] ? alloc_page_vma+0xe0/0xef
> [ 2264.214058]  [<ffffffff810d689a>] shmem_getpage+0x6a7/0x86b
> [ 2264.214058]  [<ffffffff81077eeb>] ? trace_hardirqs_on+0xd/0xf
> [ 2264.214058]  [<ffffffff810ce1d4>] ? get_page_from_freelist+0x4fc/0x6ce
> [ 2264.214058]  [<ffffffff81386794>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 2264.214058]  [<ffffffff8100e385>] ? do_softirq+0x77/0x86
> [ 2264.214058]  [<ffffffff8100c5bc>] ? restore_args+0x0/0x30
> [ 2264.214058]  [<ffffffff810502bf>] ? current_fs_time+0x27/0x2e
> [ 2264.214058]  [<ffffffff81077354>] ? print_lock_contention_bug+0x1b/0xe0
> [ 2264.214058]  [<ffffffff811236c2>] ? file_update_time+0x44/0x144
> [ 2264.214058]  [<ffffffff810d6b11>] shmem_fault+0x42/0x63
> [ 2264.214058]  [<ffffffff810e07db>] __do_fault+0x55/0x484
> [ 2264.214058]  [<ffffffff81077354>] ? print_lock_contention_bug+0x1b/0xe0
> [ 2264.214058]  [<ffffffff810e28e1>] handle_mm_fault+0x1ea/0x813
> [ 2264.214058]  [<ffffffff81389770>] do_page_fault+0x25a/0x2ea
> [ 2264.214058]  [<ffffffff8138757f>] page_fault+0x1f/0x30
> [ 2264.214058] ---[ end trace d435092dd749cff5 ]---
> [ 2264.538186] ------------[ cut here ]------------
> [ 2264.538219] ------------[ cut here ]------------
> [ 2264.538231] kernel BUG at kernel/workqueue.c:287!
> [ 2264.538240] invalid opcode: 0000 [#2] SMP
> [ 2264.538251] last sysfs file: /sys/devices/system/cpu/cpu15/cache/index1/shared_cpu_map
> [ 2264.538259] CPU 0
> [ 2264.538270] Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables b
> ridge stp autofs4 hidp rfcomm l2cap crc16 bluetooth lockd sunrpc ib_iser rdma_cm ib_cm iw_
> cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_t
> cp libiscsi scsi_transport_iscsi dm_mirror dm_multipath scsi_dh video output sbs sbshc bat
> tery ac lp sg ide_cd_mod cdrom serio_raw acpi_memhotplug button parport_pc parport rtc_cmo
> s rtc_core rtc_lib e1000 i2c_i801 i2c_core pata_acpi ata_generic pcspkr dm_region_hash dm_
> log dm_mod ata_piix libata shpchp megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd uhci_
> hcd ohci_hcd ehci_hcd [last unloaded: freq_table]
> [ 2264.538627] Pid: 51, comm: events/0 Tainted: G      D W  2.6.32-rc3-mmotm-2009-10-09-01
> -07-00249-g1112566 #2 Express5800/140Rd-4 [N8100-1065]
> [ 2264.538638] RIP: 0010:[<ffffffff81065f56>]  [<ffffffff81065f56>] worker_thread+0x157/0x
> 2b3
> [ 2264.538663] RSP: 0000:ffff8803a068fe10  EFLAGS: 00010207
> [ 2264.538670] RAX: 0000000000000000 RBX: ffff8803a045a478 RCX: ffff880029fd36b8
> [ 2264.538679] RDX: 0000000000001918 RSI: 00000001a068fdd0 RDI: ffffffff81386bad
> [ 2264.538684] RBP: ffff8803a068feb0 R08: 0000000000000002 R09: 0000000000000001
> [ 2264.538694] R10: ffffffff81389860 R11: ffff880029fd6518 R12: ffff880029fd36b0
> [ 2264.538701] R13: ffff880029fd8e40 R14: ffff8803a067e590 R15: ffffffff811049f3
> [ 2264.538719] FS:  0000000000000000(0000) GS:ffff880029e00000(0000) knlGS:000000000000000
> 0
> [ 2264.538734] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [ 2264.538742] CR2: 00007ffcc30a0001 CR3: 0000000001001000 CR4: 00000000000006f0
> [ 2264.538755] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2264.538766] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 2264.538773] Process events/0 (pid: 51, threadinfo ffff8803a068e000, task ffff8803a067e5
> 90)
> [ 2264.538783] Stack:
> [ 2264.538788]  ffffffff81065f5a ffffffff813847ec ffffffff827d6a20 0000000000000000
> [ 2264.538807] <0> ffffffff814f5a42 0000000000000002 0000000000000000 ffff8803a067e948
> [ 2264.538835] <0> 0000000000000000 ffff8803a067e590 ffffffff8106a61b ffff8803a068fe68
> [ 2264.538857] Call Trace:
> [ 2264.538868]  [<ffffffff81065f5a>] ? worker_thread+0x15b/0x2b3
> [ 2264.538883]  [<ffffffff813847ec>] ? thread_return+0x3e/0xee
> [ 2264.538887]  [<ffffffff8106a61b>] ? autoremove_wake_function+0x0/0x3d
> [ 2264.538887]  [<ffffffff81065dff>] ? worker_thread+0x0/0x2b3
> [ 2264.538887]  [<ffffffff8106a510>] kthread+0x82/0x8a
> [ 2264.538887]  [<ffffffff8100cc1a>] child_rip+0xa/0x20
> [ 2264.538887]  [<ffffffff8100c5bc>] ? restore_args+0x0/0x30
> [ 2264.538887]  [<ffffffff8106a46d>] ? kthreadd+0xd5/0xf6
> [ 2264.538887]  [<ffffffff8106a48e>] ? kthread+0x0/0x8a
> [ 2264.538887]  [<ffffffff8100cc10>] ? child_rip+0x0/0x20
> [ 2264.538887] Code: 00 4c 89 ef 48 8b 08 48 8b 50 08 48 89 51 08 48 89 0a 48 89 40 08 48
> 89 00 e8 34 0c 32 00 49 8b 04 24 48 83 e0 fc 4c 39 e8 74 04 <0f> 0b eb fe f0 41 80 24 24 f
> e 49 8b bd a8 00 00 00 48 8d 9d 70
> [ 2264.538887] RIP  [<ffffffff81065f56>] worker_thread+0x157/0x2b3
> [ 2264.538887]  RSP <ffff8803a068fe10>
> [ 2264.539304] ---[ end trace d435092dd749cff6 ]---
> 
> 
> How about doing INIT_WORK() once in initialization like this ?
> After this patch, it has survived the same test for more than 6 hours w/o causing the BUG,
> while it survived for only 1 hour at most before.
> 
> ===
> From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
> Don't do INIT_WORK() repeatedly against the same work_struct.
> Just do it once in initialization.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Thanks. you're much quicker than me.

Regards,
-Kame

> ---
>  mm/memcontrol.c |   13 ++++++++-----
>  1 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bc57916..fd8f65f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1349,8 +1349,8 @@ static void drain_all_stock_async(void)
>  	/* This function is for scheduling "drain" in asynchronous way.
>  	 * The result of "drain" is not directly handled by callers. Then,
>  	 * if someone is calling drain, we don't have to call drain more.
> -	 * Anyway, work_pending() will catch if there is a race. We just do
> -	 * loose check here.
> +	 * Anyway, WORK_STRUCT_PENDING check in queue_work_on() will catch if
> +	 * there is a race. We just do loose check here.
>  	 */
>  	if (atomic_read(&memcg_drain_count))
>  		return;
> @@ -1359,9 +1359,6 @@ static void drain_all_stock_async(void)
>  	get_online_cpus();
>  	for_each_online_cpu(cpu) {
>  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> -		if (work_pending(&stock->work))
> -			continue;
> -		INIT_WORK(&stock->work, drain_local_stock);
>  		schedule_work_on(cpu, &stock->work);
>  	}
>   	put_online_cpus();
> @@ -3327,11 +3324,17 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  
>  	/* root ? */
>  	if (cont->parent == NULL) {
> +		int cpu;
>  		enable_swap_cgroup();
>  		parent = NULL;
>  		root_mem_cgroup = mem;
>  		if (mem_cgroup_soft_limit_tree_init())
>  			goto free_out;
> +		for_each_possible_cpu(cpu) {
> +			struct memcg_stock_pcp *stock =
> +						&per_cpu(memcg_stock, cpu);
> +			INIT_WORK(&stock->work, drain_local_stock);
> +		}
>  		hotcpu_notifier(memcg_stock_cpu_callback, 0);
>  
>  	} else {
> -- 
> 1.5.6.1
> 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [BUGFIX][PATCH -mmotm] memcg: don't do INIT_WORK() repeatedly against the same work_struct
  2009-10-14  7:02             ` KAMEZAWA Hiroyuki
@ 2009-10-16  0:32               ` Daisuke Nishimura
  0 siblings, 0 replies; 8+ messages in thread
From: Daisuke Nishimura @ 2009-10-16  0:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	h-shimamoto, linux-kernel, Daisuke Nishimura

This is a fix for memcg-coalesce-charging-via-percpu-storage.patch,
and can be applied after memcg-coalesce-charging-via-percpu-storage-fix.patch.

===
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Don't do INIT_WORK() repeatedly against the same work_struct.
It can actually lead to a BUG.

Just do it once in initialization.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 mm/memcontrol.c |   13 ++++++++-----
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f850941..bf02bea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1349,8 +1349,8 @@ static void drain_all_stock_async(void)
 	/* This function is for scheduling "drain" in asynchronous way.
 	 * The result of "drain" is not directly handled by callers. Then,
 	 * if someone is calling drain, we don't have to call drain more.
-	 * Anyway, work_pending() will catch if there is a race. We just do
-	 * loose check here.
+	 * Anyway, WORK_STRUCT_PENDING check in queue_work_on() will catch if
+	 * there is a race. We just do loose check here.
 	 */
 	if (atomic_read(&memcg_drain_count))
 		return;
@@ -1359,9 +1359,6 @@ static void drain_all_stock_async(void)
 	get_online_cpus();
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
-		if (work_pending(&stock->work))
-			continue;
-		INIT_WORK(&stock->work, drain_local_stock);
 		schedule_work_on(cpu, &stock->work);
 	}
  	put_online_cpus();
@@ -3327,11 +3324,17 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 
 	/* root ? */
 	if (cont->parent == NULL) {
+		int cpu;
 		enable_swap_cgroup();
 		parent = NULL;
 		root_mem_cgroup = mem;
 		if (mem_cgroup_soft_limit_tree_init())
 			goto free_out;
+		for_each_possible_cpu(cpu) {
+			struct memcg_stock_pcp *stock =
+						&per_cpu(memcg_stock, cpu);
+			INIT_WORK(&stock->work, drain_local_stock);
+		}
 		hotcpu_notifier(memcg_stock_cpu_callback, 0);
 
 	} else {
-- 
1.5.6.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-10-16  0:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-09  7:58 [PATCH 1/2] memcg: coalescing uncharge at unmap/truncate (Oct/9) KAMEZAWA Hiroyuki
     [not found] ` <20091009170105.170e025f.kamezawa.hiroyu@jp.fujitsu.com>
2009-10-09 23:50   ` [PATCH 2/2] memcg: coalescing charge by percpu (Oct/9) Andrew Morton
2009-10-11  2:37     ` KAMEZAWA Hiroyuki
2009-10-13  7:57       ` Daisuke Nishimura
2009-10-13  8:05         ` KAMEZAWA Hiroyuki
2009-10-14  6:42           ` Daisuke Nishimura
2009-10-14  7:02             ` KAMEZAWA Hiroyuki
2009-10-16  0:32               ` [BUGFIX][PATCH -mmotm] memcg: don't do INIT_WORK() repeatedly against the same work_struct Daisuke Nishimura

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox