linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
@ 2008-02-19 12:54 KAMEZAWA Hiroyuki
  2008-02-19 15:40 ` Hugh Dickins
                   ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-19 12:54 UTC (permalink / raw)
  To: linux-mm@kvack.org
  Cc: balbir@linux.vnet.ibm.com, yamamoto@valinux.co.jp, hugh,
	riel@redhat.com

I'd like to start from RFC.

In following code
==
  lock_page_cgroup(page);
  pc = page_get_page_cgroup(page);
  unlock_page_cgroup(page);

  access 'pc' later..
== (See, page_cgroup_move_lists())

There is a race because 'pc' is not a stable value without lock_page_cgroup().
(mem_cgroup_uncharge can free this 'pc').

For example, page_cgroup_move_lists() access pc without lock.
There is a small race window, between page_cgroup_move_lists()
and mem_cgroup_uncharge(). At uncharge, page_cgroup struct is immedieately
freed but move_list can access it after taking lru_lock.
(*) mem_cgroup_uncharge_page() can be called without zone->lru lock.

This is not good manner.
.....
There is no quick fix (maybe). Moreover, I hear some people around me said
current memcontrol.c codes are very complicated.
I agree ;( ..it's caued by my work.

I'd like to fix problems in clean way.
(Note: current -rc2 codes works well under heavy pressure. but there
 is possibility of race, I think.)



This patch is first trial to fix the issue by clean up the whole lock
related codes of mem_cgroup. And add necessary explanations as comment.

For making thing clearer, adds following.
 - page_cgroup->usage  for reference of charge.
 - page_cgroup->refcnt for reference from kernel objects.

reference to page_cgroup->usage is incremented at charge and
decremented at uncharge.

Usually there are 2 reference to page_cgroup->ref_cnt.
 (a) A reference from struct page's page->page_cgroup
 (b) A reference from cgroup"s LRU.

 (a) is dropped when page_cgroup->usage goes down to zero.
 (b) is dropped when page_cgroup is removed from LRU list.
 Extra reference can be coutned while accessing page_cgroup without
 taking lock_page_cgroup().

Typical usage is
==
	lock_page_cgroup(page);
	pc = page_get_page_cgroup(page);
	if (pc)
		get_page_cgroup(pc); /* increment pc->refcnt here */
	unlock_page_cgroup(page);
==

It is safe when
 * handling 'pc' under lock_page_cgroup(page) && page_get_page_cgroup(page)==pc.
 * handling 'pc' under lru_lock && !list_empty(&pc->lru);
 * handling 'pc' with holding page_cgroup->refcnt.

What this patch does is..
 * remove special funcitons. (I added...sigh..)
 * added pc->usage.
 * refactoring to make lock rule clearer.
 * added Lock Rule as comment.
 * add node_id/zone_id information to page_cgroup.
   Current codes has to access pc->page while page->page_cgroup is cleared.
   This is for accessing LRU withoug accesing pc->page.
 * Make fore_empty to take page_lock(). This will help us to avoid
   all racy condition.

However # of of lines is increased (mainly because of added cooments),
I think this version is easier to read.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 include/linux/memcontrol.h |    2 
 mm/memcontrol.c            |  283 ++++++++++++++++++++++++++-------------------
 mm/vmscan.c                |    4 
 3 files changed, 172 insertions(+), 117 deletions(-)

Index: linux-2.6.25-rc2/mm/memcontrol.c
===================================================================
--- linux-2.6.25-rc2.orig/mm/memcontrol.c
+++ linux-2.6.25-rc2/mm/memcontrol.c
@@ -17,6 +17,44 @@
  * GNU General Public License for more details.
  */
 
+/*
+ * Lock and Refcnt Rule for page_cgroup.
+ * (pc means struct page_cgroup)
+ *
+ * - pc->usage is reference count of 'charge'.
+ *     + must be modified under page_lock_cgroup()
+ *     + incremented at charge.
+ *     + decremented at uncharge.
+ *     + if pc->usage tunrs to be zero, page_cgroup drops referenece from
+ *       struct page.
+ *
+ * - pc->refcnt is reference from kernel codes/objects. using atomic_ops.
+ *     + One reference from page struct.
+ *     + One reference from LRU.
+ *     + Under lock_page_cgroup(), page_cgroup is alive if page->page_cgroup
+ *	 points this.
+ *     + Under lru_lock, page_cgroup is alive if it's linked to list.
+ *     + Any other codes which handles page_cgroup withoug lock_page_cgroup()
+ *	 must increment this.
+ *
+ *  - under lock_page_cgroup().
+ *     + you can access pc by checking pc == page_get_page_cgroup(page).
+ *     + you must not take page_lock() under lock_page_cgroup().
+ *	 lock_page_cgroup() under page_lock() is ok.
+ *     + you must not take mz->lru_lock under lock_page_cgroup().
+ *
+ *  - under lru_lock.
+ *     + you can access pc safely by checking !list_empty(&pc->lru);
+ *
+ *  - lru_lock must not be held under lock_page_cgroup(). You must increment
+ *    pc->refcnt and call unlock_page_cgroup() and lock lru_lock.
+ *    You can check page_cgroup is still on LRU by list_empty(&pc->lru);
+ *
+ *  - pc->page will not be NULL once it is filled but page->cgroup can be
+ *    NULL if you don't take lock_page_cgroup().
+ */
+
+
 #include <linux/res_counter.h>
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
@@ -30,6 +68,7 @@
 #include <linux/spinlock.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
+#include <linux/pagemap.h>
 
 #include <asm/uaccess.h>
 
@@ -154,23 +193,15 @@ struct page_cgroup {
 	struct list_head lru;		/* per cgroup LRU list */
 	struct page *page;
 	struct mem_cgroup *mem_cgroup;
-	atomic_t ref_cnt;		/* Helpful when pages move b/w  */
-					/* mapped and cached states     */
+	int    usage;			/* # of charged users  */
+	atomic_t refcnt;		/* reference from kernel */
 	int	 flags;
+	short	 nid;
+	short	 zid;
 };
 #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
 #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
 
-static inline int page_cgroup_nid(struct page_cgroup *pc)
-{
-	return page_to_nid(pc->page);
-}
-
-static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
-{
-	return page_zonenum(pc->page);
-}
-
 enum {
 	MEM_CGROUP_TYPE_UNSPEC = 0,
 	MEM_CGROUP_TYPE_MAPPED,
@@ -184,6 +215,21 @@ enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
 };
 
+static int get_page_cgroup(struct page_cgroup *pc)
+{
+	if (atomic_inc_not_zero(&pc->refcnt))
+		return 0;
+	return 1;
+}
+
+/*
+ * decrement and free when refcnt goes down to zero.
+ */
+static void put_page_cgroup(struct page_cgroup *pc)
+{
+	if (atomic_dec_and_test(&pc->refcnt))
+		kfree(pc);
+}
 
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
@@ -213,10 +259,7 @@ static inline struct mem_cgroup_per_zone
 page_cgroup_zoneinfo(struct page_cgroup *pc)
 {
 	struct mem_cgroup *mem = pc->mem_cgroup;
-	int nid = page_cgroup_nid(pc);
-	int zid = page_cgroup_zid(pc);
-
-	return mem_cgroup_zoneinfo(mem, nid, zid);
+	return mem_cgroup_zoneinfo(mem, pc->nid, pc->zid);
 }
 
 static unsigned long mem_cgroup_get_all_zonestat(struct mem_cgroup *mem,
@@ -303,47 +346,6 @@ static void __always_inline unlock_page_
 	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
 }
 
-/*
- * Tie new page_cgroup to struct page under lock_page_cgroup()
- * This can fail if the page has been tied to a page_cgroup.
- * If success, returns 0.
- */
-static int page_cgroup_assign_new_page_cgroup(struct page *page,
-						struct page_cgroup *pc)
-{
-	int ret = 0;
-
-	lock_page_cgroup(page);
-	if (!page_get_page_cgroup(page))
-		page_assign_page_cgroup(page, pc);
-	else /* A page is tied to other pc. */
-		ret = 1;
-	unlock_page_cgroup(page);
-	return ret;
-}
-
-/*
- * Clear page->page_cgroup member under lock_page_cgroup().
- * If given "pc" value is different from one page->page_cgroup,
- * page->cgroup is not cleared.
- * Returns a value of page->page_cgroup at lock taken.
- * A can can detect failure of clearing by following
- *  clear_page_cgroup(page, pc) == pc
- */
-
-static struct page_cgroup *clear_page_cgroup(struct page *page,
-						struct page_cgroup *pc)
-{
-	struct page_cgroup *ret;
-	/* lock and clear */
-	lock_page_cgroup(page);
-	ret = page_get_page_cgroup(page);
-	if (likely(ret == pc))
-		page_assign_page_cgroup(page, NULL);
-	unlock_page_cgroup(page);
-	return ret;
-}
-
 static void __mem_cgroup_remove_list(struct page_cgroup *pc)
 {
 	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
@@ -407,18 +409,28 @@ int task_in_mem_cgroup(struct task_struc
 /*
  * This routine assumes that the appropriate zone's lru lock is already held
  */
-void mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
+void mem_cgroup_move_lists(struct page *page, bool active)
 {
+	struct page_cgroup *pc;
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
 
-	if (!pc)
+	lock_page_cgroup(page);
+	pc = page_get_page_cgroup(page);
+	if (!pc) {
+		unlock_page_cgroup(page);
 		return;
+	}
+	/* Because we release lock after this, get refcnt */
+	get_page_cgroup(pc);
+	unlock_page_cgroup(page);
 
 	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_move_lists(pc, active);
+	if (!list_empty(&pc->lru))
+		__mem_cgroup_move_lists(pc, active);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	put_page_cgroup(pc);
 }
 
 /*
@@ -530,14 +542,19 @@ unsigned long mem_cgroup_isolate_pages(u
 
 	spin_lock(&mz->lru_lock);
 	scan = 0;
+	/* Handling 'pc' is always safe because we have lru_lock and
+	   'pc' is all linked to LRU list.*/
 	list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
 		if (scan >= nr_to_scan)
 			break;
 		page = pc->page;
-		VM_BUG_ON(!pc);
+		VM_BUG_ON(!page);
 
 		if (unlikely(!PageLRU(page)))
 			continue;
+		/* being uncharged ? */
+		if (!pc->usage)
+			continue;
 
 		if (PageActive(page) && !active) {
 			__mem_cgroup_move_lists(pc, true);
@@ -595,18 +612,20 @@ retry:
 		 * the page has already been accounted.
 		 */
 		if (pc) {
-			if (unlikely(!atomic_inc_not_zero(&pc->ref_cnt))) {
+			if (!pc->usage) {
 				/* this page is under being uncharged ? */
 				unlock_page_cgroup(page);
 				cpu_relax();
 				goto retry;
 			} else {
+				++pc->usage;
 				unlock_page_cgroup(page);
 				goto done;
 			}
 		}
 		unlock_page_cgroup(page);
-	}
+	} else
+		return 0;
 
 	pc = kzalloc(sizeof(struct page_cgroup), gfp_mask);
 	if (pc == NULL)
@@ -658,33 +677,32 @@ retry:
 		congestion_wait(WRITE, HZ/10);
 	}
 
-	atomic_set(&pc->ref_cnt, 1);
-	pc->mem_cgroup = mem;
-	pc->page = page;
-	pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
-	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
-		pc->flags |= PAGE_CGROUP_FLAG_CACHE;
+	lock_page_cgroup(page);
 
-	if (!page || page_cgroup_assign_new_page_cgroup(page, pc)) {
-		/*
-		 * Another charge has been added to this page already.
-		 * We take lock_page_cgroup(page) again and read
-		 * page->cgroup, increment refcnt.... just retry is OK.
-		 */
+	if (page_get_page_cgroup(page) != NULL) {
 		res_counter_uncharge(&mem->res, PAGE_SIZE);
 		css_put(&mem->css);
-		kfree(pc);
-		if (!page)
-			goto done;
+		unlock_page_cgroup(page);
 		goto retry;
 	}
+	pc->usage = 1;
+	atomic_set(&pc->refcnt, 1); /* A reference from page struct */
+	pc->mem_cgroup = mem;
+	pc->page = page;	/* This will be never be NULL while alive */
+	pc->nid = page_to_nid(page);
+	pc->zid = page_zonenum(page);
+
+	pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
+	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
+		pc->flags |= PAGE_CGROUP_FLAG_CACHE;
+	page_assign_page_cgroup(page, pc);
+	unlock_page_cgroup(page);
 
 	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
-	/* Update statistics vector */
+	get_page_cgroup(pc);	/* reference from LRU */
 	__mem_cgroup_add_list(pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
-
 done:
 	return 0;
 out:
@@ -716,10 +734,6 @@ int mem_cgroup_cache_charge(struct page 
 	return ret;
 }
 
-/*
- * Uncharging is always a welcome operation, we never complain, simply
- * uncharge. This routine should be called with lock_page_cgroup held
- */
 void mem_cgroup_uncharge(struct page_cgroup *pc)
 {
 	struct mem_cgroup *mem;
@@ -732,24 +746,30 @@ void mem_cgroup_uncharge(struct page_cgr
 	 */
 	if (!pc)
 		return;
+	--pc->usage;
+	if (!pc->usage) {
+		/* At first, drop charge */
+		mem = pc->mem_cgroup;
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		css_put(&mem->css);
 
-	if (atomic_dec_and_test(&pc->ref_cnt)) {
 		page = pc->page;
-		mz = page_cgroup_zoneinfo(pc);
+
 		/*
-		 * get page->cgroup and clear it under lock.
-		 * force_empty can drop page->cgroup without checking refcnt.
+		 * There is no user. drop referenece from 'page'.
+		 * We got this 'pc' under this page_cgroup_lock. Below is safe.
 		 */
+		page_assign_page_cgroup(page, NULL);
+		put_page_cgroup(pc); /* drop reference from 'page' */
 		unlock_page_cgroup(page);
-		if (clear_page_cgroup(page, pc) == pc) {
-			mem = pc->mem_cgroup;
-			css_put(&mem->css);
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			spin_lock_irqsave(&mz->lru_lock, flags);
-			__mem_cgroup_remove_list(pc);
-			spin_unlock_irqrestore(&mz->lru_lock, flags);
-			kfree(pc);
-		}
+
+		/* Remove from our LRU */
+		mz = page_cgroup_zoneinfo(pc);
+		spin_lock_irqsave(&mz->lru_lock, flags);
+		__mem_cgroup_remove_list(pc);
+		put_page_cgroup(pc); /* drop refrerence from LRU */
+		spin_unlock_irqrestore(&mz->lru_lock, flags);
+
 		lock_page_cgroup(page);
 	}
 }
@@ -772,8 +792,12 @@ int mem_cgroup_prepare_migration(struct 
 	int ret = 0;
 	lock_page_cgroup(page);
 	pc = page_get_page_cgroup(page);
-	if (pc && atomic_inc_not_zero(&pc->ref_cnt))
+	if (pc && pc->usage) {
+		++pc->usage;
 		ret = 1;
+	}
+	if (ret)
+		get_page_cgroup(pc);
 	unlock_page_cgroup(page);
 	return ret;
 }
@@ -784,7 +808,10 @@ void mem_cgroup_end_migration(struct pag
 
 	lock_page_cgroup(page);
 	pc = page_get_page_cgroup(page);
-	mem_cgroup_uncharge(pc);
+	if (pc) {
+		mem_cgroup_uncharge(pc);
+		put_page_cgroup(pc);
+	}
 	unlock_page_cgroup(page);
 }
 /*
@@ -799,21 +826,27 @@ void mem_cgroup_page_migration(struct pa
 	struct mem_cgroup *mem;
 	unsigned long flags;
 	struct mem_cgroup_per_zone *mz;
-retry:
+
+	/*
+	 * Against uncharge, we have on extra usage count.
+	 * Aginst force_empty, this is done under PageLock.
+	 */
+	lock_page_cgroup(page);
 	pc = page_get_page_cgroup(page);
-	if (!pc)
-		return;
+	page_assign_page_cgroup(page, NULL);
+	unlock_page_cgroup(page);
+
 	mem = pc->mem_cgroup;
 	mz = page_cgroup_zoneinfo(pc);
-	if (clear_page_cgroup(page, pc) != pc)
-		goto retry;
 	spin_lock_irqsave(&mz->lru_lock, flags);
-
 	__mem_cgroup_remove_list(pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 
-	pc->page = newpage;
 	lock_page_cgroup(newpage);
+	VM_BUG_ON(page_get_page_cgroup(newpage) != NULL);
+	pc->page = newpage;
+	pc->nid = page_to_nid(newpage);
+	pc->zid = page_zonenum(newpage);
 	page_assign_page_cgroup(newpage, pc);
 	unlock_page_cgroup(newpage);
 
@@ -826,7 +859,7 @@ retry:
 
 /*
  * This routine traverse page_cgroup in given list and drop them all.
- * This routine ignores page_cgroup->ref_cnt.
+ * This routine ignores page_cgroup->usage.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
  */
 #define FORCE_UNCHARGE_BATCH	(128)
@@ -854,17 +887,42 @@ retry:
 
 	while (--count && !list_empty(list)) {
 		pc = list_entry(list->prev, struct page_cgroup, lru);
+
+		/* pc->page is never cleared while 'pc' is alive */
 		page = pc->page;
-		/* Avoid race with charge */
-		atomic_set(&pc->ref_cnt, 0);
-		if (clear_page_cgroup(page, pc) == pc) {
-			css_put(&mem->css);
+		VM_BUG_ON(!page);
+
+		/*
+		 * This guarantees that page migration will not touch
+		 * this page.
+		 */
+		if (TestSetPageLocked(page))
+			goto unlock_exit_loop;
+
+		lock_page_cgroup(page);
+		if (page_get_page_cgroup(page) == pc &&
+		    pc->usage) {
+			/* See also mem_cgroup_uncharge(). */
+			/* drop charge at first */
+			pc->usage = 0;
 			res_counter_uncharge(&mem->res, PAGE_SIZE);
+			css_put(&mem->css);
+			page_assign_page_cgroup(page, NULL);
+			put_page_cgroup(pc);
+			unlock_page_cgroup(page);
+			/* Here, no reference from struct 'page' */
+			/* Just remove from LRU */
 			__mem_cgroup_remove_list(pc);
-			kfree(pc);
-		} else 	/* being uncharged ? ...do relax */
-			break;
+			put_page_cgroup(pc);
+			unlock_page(page);
+		} else {
+			/* This page is being uncharged */
+			unlock_page_cgroup(page);
+			unlock_page(page);
+			goto unlock_exit_loop;
+		}
 	}
+unlock_exit_loop:
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 	if (!list_empty(list)) {
 		cond_resched();
Index: linux-2.6.25-rc2/include/linux/memcontrol.h
===================================================================
--- linux-2.6.25-rc2.orig/include/linux/memcontrol.h
+++ linux-2.6.25-rc2/include/linux/memcontrol.h
@@ -39,7 +39,7 @@ extern int mem_cgroup_charge(struct page
 				gfp_t gfp_mask);
 extern void mem_cgroup_uncharge(struct page_cgroup *pc);
 extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_move_lists(struct page_cgroup *pc, bool active);
+extern void mem_cgroup_move_lists(struct page *page, bool active);
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
Index: linux-2.6.25-rc2/mm/vmscan.c
===================================================================
--- linux-2.6.25-rc2.orig/mm/vmscan.c
+++ linux-2.6.25-rc2/mm/vmscan.c
@@ -1128,7 +1128,7 @@ static void shrink_active_list(unsigned 
 		ClearPageActive(page);
 
 		list_move(&page->lru, &zone->inactive_list);
-		mem_cgroup_move_lists(page_get_page_cgroup(page), false);
+		mem_cgroup_move_lists(page, false);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
@@ -1157,7 +1157,7 @@ static void shrink_active_list(unsigned 
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
 		list_move(&page->lru, &zone->active_list);
-		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
+		mem_cgroup_move_lists(page, true);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
Index: linux-2.6.25-rc2/mm/swap.c
===================================================================
--- linux-2.6.25-rc2.orig/mm/swap.c
+++ linux-2.6.25-rc2/mm/swap.c
@@ -176,7 +176,7 @@ void activate_page(struct page *page)
 		SetPageActive(page);
 		add_page_to_active_list(zone, page);
 		__count_vm_event(PGACTIVATE);
-		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
+		mem_cgroup_move_lists(page, true);
 	}
 	spin_unlock_irq(&zone->lru_lock);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-19 12:54 [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races KAMEZAWA Hiroyuki
@ 2008-02-19 15:40 ` Hugh Dickins
  2008-02-20  1:03   ` KAMEZAWA Hiroyuki
                     ` (3 more replies)
  2008-02-19 15:54 ` kamezawa.hiroyu
  2008-02-20  5:00 ` Balbir Singh
  2 siblings, 4 replies; 63+ messages in thread
From: Hugh Dickins @ 2008-02-19 15:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	yamamoto@valinux.co.jp, riel@redhat.com

On Tue, 19 Feb 2008, KAMEZAWA Hiroyuki wrote:
> I'd like to start from RFC.
> 
> In following code
> ==
>   lock_page_cgroup(page);
>   pc = page_get_page_cgroup(page);
>   unlock_page_cgroup(page);
> 
>   access 'pc' later..
> == (See, page_cgroup_move_lists())
> 
> There is a race because 'pc' is not a stable value without lock_page_cgroup().
> (mem_cgroup_uncharge can free this 'pc').
> 
> For example, page_cgroup_move_lists() access pc without lock.
> There is a small race window, between page_cgroup_move_lists()
> and mem_cgroup_uncharge(). At uncharge, page_cgroup struct is immedieately
> freed but move_list can access it after taking lru_lock.
> (*) mem_cgroup_uncharge_page() can be called without zone->lru lock.
> 
> This is not good manner.
> .....
> There is no quick fix (maybe). Moreover, I hear some people around me said
> current memcontrol.c codes are very complicated.
> I agree ;( ..it's caued by my work.
> 
> I'd like to fix problems in clean way.
> (Note: current -rc2 codes works well under heavy pressure. but there
>  is possibility of race, I think.)

Yes, yes, indeed, I've been working away on this too.

Ever since the VM_BUG_ON(page_get_page_cgroup(page)) went into
free_hot_cold_page (at my own prompting), I've been hitting it
just very occasionally in my kernel build testing.  Was unable
to reproduce it over the New Year, but a week or two ago found
one machine and config on which it is relatively reproducible,
pretty sure to happen within 12 hours.

And on Saturday evening at last identified the cause, exactly
where you have: that unsafety in mem_cgroup_move_lists - which
has the nice property of putting pages from the lru on to SLUB's
freelist!

Unlike the unsafeties of force_empty, this is liable to hit anyone
running with MEM_CONT compiled in, they don't have to be consciously
using mem_cgroups at all.

(I consider that, by the way, quite a serious defect in the current
mem_cgroup work: that a distro compiling it in for 1% of customers
is then subjecting all to the mem_cgroup overhead - effectively
doubled struct page size and unnecessary accounting overhead.  I
believe there needs to be a way to opt out, a force_empty which
sticks.  Yes, I know the page_cgroup which does that doubling of
size is only allocated on demand, but every page cache page and
every anonymous page is going to have one.  A kmem_cache for them
will reduce the extra, but there still needs to be a way to opt
out completely.)

Since then I've been working on patches too, testing, currently
breaking up my one big patch into pieces while running more tests.
A lot in common with yours, a lot not.  (And none of it addressing
that issue of opt-out I raise in the last paragraph: haven't begun
to go into that one, hoped you and Balbir would look into it.)

I've not had time to study yours yet, but a first impression is
that you're adding extra complexity (usage in addition to ref_cnt)
where I'm more taking it away (it's pointless for ref_cnt to be an
atomic: the one place which isn't already using lock_page_cgroup
around it needs to).  But that could easily turn out to be because
I'm skirting issues which you're addressing properly: we'll see.

I haven't completed my solution in mem_cgroup_move_lists yet: but
the way it wants a lock in a structure which isn't stabilized until
it's got that lock, reminds me very much of my page_lock_anon_vma,
so I'm expecting to use a SLAB_DESTROY_BY_RCU cache there.

Ha, you have lock_page_cgroup in your mem_cgroup_move_lists: yes,
tried that too, and it deadlocks: someone holding lock_page_cgroup
can be interrupted by an end of I/O interrupt which does
rotate_reclaimable_page and wants the main lru_lock, but that
main lru_lock is held across mem_cgroup_move_lists.  There are
several different ways to address that, but for this release I
think we just go for a try_lock_page_cgroup there.

(And that answers my old question, of why you use spin_lock_irq
on your mz->lru_lock: because if you didn't, the same deadlock
could hit one of the other places which lock mz->lru_lock.)

How should I proceed now?  I think it's best if I press ahead with
my patchset, to get that out on to the list; and only then come
back to look at yours, while you can be looking at mine.  Then
we take the best out of both and push that forward - this does
need to be fixed for 2.6.25.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-19 12:54 [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races KAMEZAWA Hiroyuki
  2008-02-19 15:40 ` Hugh Dickins
@ 2008-02-19 15:54 ` kamezawa.hiroyu
  2008-02-19 16:26   ` Hugh Dickins
  2008-02-20  5:00 ` Balbir Singh
  2 siblings, 1 reply; 63+ messages in thread
From: kamezawa.hiroyu @ 2008-02-19 15:54 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: KAMEZAWA Hiroyuki, linux-mm, balbir, yamamoto, riel

>How should I proceed now?  I think it's best if I press ahead with
>my patchset, to get that out on to the list; and only then come
>back to look at yours, while you can be looking at mine.  Then
>we take the best out of both and push that forward - this does
>need to be fixed for 2.6.25.
>
I'm very glad to hear that you have been working on this already.

I think it's better to test your one at first because it sounds
you've already seem the BUG much more than I've seen and
I think my patch will need more work to be simple.

Could you post your one ? I'll try it on my box.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-19 15:54 ` kamezawa.hiroyu
@ 2008-02-19 16:26   ` Hugh Dickins
  2008-02-20  1:55     ` KAMEZAWA Hiroyuki
  2008-02-20  6:38     ` Balbir Singh
  0 siblings, 2 replies; 63+ messages in thread
From: Hugh Dickins @ 2008-02-19 16:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, yamamoto, riel

On Wed, 20 Feb 2008, kamezawa.hiroyu@jp.fujitsu.com wrote:
> >How should I proceed now?  I think it's best if I press ahead with
> >my patchset, to get that out on to the list; and only then come
> >back to look at yours, while you can be looking at mine.  Then
> >we take the best out of both and push that forward - this does
> >need to be fixed for 2.6.25.
> >
> I'm very glad to hear that you have been working on this already.
> 
> I think it's better to test your one at first because it sounds
> you've already seem the BUG much more than I've seen and
> I think my patch will need more work to be simple.
> 
> Could you post your one ? I'll try it on my box.

Okay, thanks, on the understanding that I may decide things differently
in splitting it up.  And you'll immediately see why I need to split it:
there's several unrelated mods across that area, and a lot of cleanup
(another cleanup I'd like to make but held back from, is remove the
"_page" from mem_cgroup_uncharge_page).

One thing I've already reverted while splitting it: mm/memory.c still
needs to use page_assign_page_cgroup, not in initializing the struct
pages, but its VM_BUG_ON(page_get_page_cgroup) needs to become a bad
page state instead - because most people build without DEBUG_VM, and
page->cgroup must be reset before the next user corrupts through it.

There's a build warning on mem in charge_common which I want to get
rid of; and I've not yet decided if I like that restructuring or not.

Hugh

diff -purN 26252/include/linux/memcontrol.h 26252h/include/linux/memcontrol.h
--- 26252/include/linux/memcontrol.h	2008-02-11 07:18:10.000000000 +0000
+++ 26252h/include/linux/memcontrol.h	2008-02-17 13:05:03.000000000 +0000
@@ -32,14 +32,11 @@ struct mm_struct;
 
 extern void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p);
 extern void mm_free_cgroup(struct mm_struct *mm);
-extern void page_assign_page_cgroup(struct page *page,
-					struct page_cgroup *pc);
 extern struct page_cgroup *page_get_page_cgroup(struct page *page);
 extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
-extern void mem_cgroup_uncharge(struct page_cgroup *pc);
 extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_move_lists(struct page_cgroup *pc, bool active);
+extern void mem_cgroup_move_lists(struct page *page, bool active);
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -51,7 +48,7 @@ extern int mem_cgroup_cache_charge(struc
 					gfp_t gfp_mask);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
-#define vm_match_cgroup(mm, cgroup)	\
+#define mm_match_cgroup(mm, cgroup)	\
 	((cgroup) == rcu_dereference((mm)->mem_cgroup))
 
 extern int mem_cgroup_prepare_migration(struct page *page);
@@ -85,11 +82,6 @@ static inline void mm_free_cgroup(struct
 {
 }
 
-static inline void page_assign_page_cgroup(struct page *page,
-						struct page_cgroup *pc)
-{
-}
-
 static inline struct page_cgroup *page_get_page_cgroup(struct page *page)
 {
 	return NULL;
@@ -101,16 +93,11 @@ static inline int mem_cgroup_charge(stru
 	return 0;
 }
 
-static inline void mem_cgroup_uncharge(struct page_cgroup *pc)
-{
-}
-
 static inline void mem_cgroup_uncharge_page(struct page *page)
 {
 }
 
-static inline void mem_cgroup_move_lists(struct page_cgroup *pc,
-						bool active)
+static inline void mem_cgroup_move_lists(struct page *page, bool active)
 {
 }
 
@@ -121,7 +108,7 @@ static inline int mem_cgroup_cache_charg
 	return 0;
 }
 
-static inline int vm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *mem)
+static inline int mm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *mem)
 {
 	return 1;
 }
diff -purN 26252/mm/memcontrol.c 26252h/mm/memcontrol.c
--- 26252/mm/memcontrol.c	2008-02-11 07:18:12.000000000 +0000
+++ 26252h/mm/memcontrol.c	2008-02-17 13:31:53.000000000 +0000
@@ -137,6 +137,7 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat stat;
 };
+static struct mem_cgroup init_mem_cgroup;
 
 /*
  * We use the lower bit of the page->page_cgroup pointer as a bit spin
@@ -144,7 +145,7 @@ struct mem_cgroup {
  * byte aligned (based on comments from Nick Piggin)
  */
 #define PAGE_CGROUP_LOCK_BIT 	0x0
-#define PAGE_CGROUP_LOCK 		(1 << PAGE_CGROUP_LOCK_BIT)
+#define PAGE_CGROUP_LOCK 	(1 << PAGE_CGROUP_LOCK_BIT)
 
 /*
  * A page_cgroup page is associated with every page descriptor. The
@@ -154,37 +155,27 @@ struct page_cgroup {
 	struct list_head lru;		/* per cgroup LRU list */
 	struct page *page;
 	struct mem_cgroup *mem_cgroup;
-	atomic_t ref_cnt;		/* Helpful when pages move b/w  */
-					/* mapped and cached states     */
-	int	 flags;
+	int ref_cnt;			/* cached, mapped, migrating */
+	int flags;
 };
 #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
 #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
 
-static inline int page_cgroup_nid(struct page_cgroup *pc)
+static int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
 }
 
-static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
+static enum zone_type page_cgroup_zid(struct page_cgroup *pc)
 {
 	return page_zonenum(pc->page);
 }
 
-enum {
-	MEM_CGROUP_TYPE_UNSPEC = 0,
-	MEM_CGROUP_TYPE_MAPPED,
-	MEM_CGROUP_TYPE_CACHED,
-	MEM_CGROUP_TYPE_ALL,
-	MEM_CGROUP_TYPE_MAX,
-};
-
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
 };
 
-
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
  */
@@ -193,23 +184,22 @@ static void mem_cgroup_charge_statistics
 {
 	int val = (charge)? 1 : -1;
 	struct mem_cgroup_stat *stat = &mem->stat;
-	VM_BUG_ON(!irqs_disabled());
 
+	VM_BUG_ON(!irqs_disabled());
 	if (flags & PAGE_CGROUP_FLAG_CACHE)
-		__mem_cgroup_stat_add_safe(stat,
-					MEM_CGROUP_STAT_CACHE, val);
+		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
 	else
 		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
 }
 
-static inline struct mem_cgroup_per_zone *
+static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
 {
-	BUG_ON(!mem->info.nodeinfo[nid]);
+	VM_BUG_ON(!mem->info.nodeinfo[nid]);
 	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
 }
 
-static inline struct mem_cgroup_per_zone *
+static struct mem_cgroup_per_zone *
 page_cgroup_zoneinfo(struct page_cgroup *pc)
 {
 	struct mem_cgroup *mem = pc->mem_cgroup;
@@ -234,18 +224,14 @@ static unsigned long mem_cgroup_get_all_
 	return total;
 }
 
-static struct mem_cgroup init_mem_cgroup;
-
-static inline
-struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
+static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
 {
 	return container_of(cgroup_subsys_state(cont,
 				mem_cgroup_subsys_id), struct mem_cgroup,
 				css);
 }
 
-static inline
-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
+static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 {
 	return container_of(task_subsys_state(p, mem_cgroup_subsys_id),
 				struct mem_cgroup, css);
@@ -265,83 +251,29 @@ void mm_free_cgroup(struct mm_struct *mm
 	css_put(&mm->mem_cgroup->css);
 }
 
-static inline int page_cgroup_locked(struct page *page)
-{
-	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT,
-					&page->page_cgroup);
-}
-
-void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
+static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
 {
-	int locked;
-
-	/*
-	 * While resetting the page_cgroup we might not hold the
-	 * page_cgroup lock. free_hot_cold_page() is an example
-	 * of such a scenario
-	 */
-	if (pc)
-		VM_BUG_ON(!page_cgroup_locked(page));
-	locked = (page->page_cgroup & PAGE_CGROUP_LOCK);
-	page->page_cgroup = ((unsigned long)pc | locked);
+	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
 }
 
 struct page_cgroup *page_get_page_cgroup(struct page *page)
 {
-	return (struct page_cgroup *)
-		(page->page_cgroup & ~PAGE_CGROUP_LOCK);
+	return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
 }
 
-static void __always_inline lock_page_cgroup(struct page *page)
+static void lock_page_cgroup(struct page *page)
 {
 	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-	VM_BUG_ON(!page_cgroup_locked(page));
 }
 
-static void __always_inline unlock_page_cgroup(struct page *page)
+static int try_lock_page_cgroup(struct page *page)
 {
-	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
+	return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
 }
 
-/*
- * Tie new page_cgroup to struct page under lock_page_cgroup()
- * This can fail if the page has been tied to a page_cgroup.
- * If success, returns 0.
- */
-static int page_cgroup_assign_new_page_cgroup(struct page *page,
-						struct page_cgroup *pc)
+static void unlock_page_cgroup(struct page *page)
 {
-	int ret = 0;
-
-	lock_page_cgroup(page);
-	if (!page_get_page_cgroup(page))
-		page_assign_page_cgroup(page, pc);
-	else /* A page is tied to other pc. */
-		ret = 1;
-	unlock_page_cgroup(page);
-	return ret;
-}
-
-/*
- * Clear page->page_cgroup member under lock_page_cgroup().
- * If given "pc" value is different from one page->page_cgroup,
- * page->cgroup is not cleared.
- * Returns a value of page->page_cgroup at lock taken.
- * A can can detect failure of clearing by following
- *  clear_page_cgroup(page, pc) == pc
- */
-
-static struct page_cgroup *clear_page_cgroup(struct page *page,
-						struct page_cgroup *pc)
-{
-	struct page_cgroup *ret;
-	/* lock and clear */
-	lock_page_cgroup(page);
-	ret = page_get_page_cgroup(page);
-	if (likely(ret == pc))
-		page_assign_page_cgroup(page, NULL);
-	unlock_page_cgroup(page);
-	return ret;
+	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
 }
 
 static void __mem_cgroup_remove_list(struct page_cgroup *pc)
@@ -399,7 +331,7 @@ int task_in_mem_cgroup(struct task_struc
 	int ret;
 
 	task_lock(task);
-	ret = task->mm && vm_match_cgroup(task->mm, mem);
+	ret = task->mm && mm_match_cgroup(task->mm, mem);
 	task_unlock(task);
 	return ret;
 }
@@ -407,17 +339,43 @@ int task_in_mem_cgroup(struct task_struc
 /*
  * This routine assumes that the appropriate zone's lru lock is already held
  */
-void mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
+void mem_cgroup_move_lists(struct page *page, bool active)
 {
+	struct page_cgroup *pc;
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
 
-	if (!pc)
+	/*
+	 * We cannot lock_page_cgroup while holding zone's lru_lock,
+	 * because other holders of lock_page_cgroup can be interrupted
+	 * with an attempt to rotate_reclaimable_page.
+	 *
+	 * Change lock_page_cgroup to an interrupt-disabling lock?
+	 * Perhaps, but we'd prefer not.  Hold zone's lru_lock while
+	 * uncharging?  Overhead we'd again prefer to avoid - though
+	 * it may turn out to be just right to uncharge when finally
+	 * removing a page from LRU; but there are probably awkward
+	 * details to that which would need shaking down.
+	 */
+	if (!try_lock_page_cgroup(page))
 		return;
 
-	mz = page_cgroup_zoneinfo(pc);
+	pc = page_get_page_cgroup(page);
+	mz = pc? page_cgroup_zoneinfo(pc): NULL;
+	unlock_page_cgroup(page);
+
+	if (!mz)
+		return;
+
+	/*
+	 * The memory used for this mem_cgroup_per_zone could get
+	 * reused before we take its lru_lock: we probably want to
+	 * use a SLAB_DESTROY_BY_RCU kmem_cache for it.  But that's
+	 * an unlikely race, so for now continue testing without it.
+	 */
 	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_move_lists(pc, active);
+	if (page_get_page_cgroup(page) == pc)
+		__mem_cgroup_move_lists(pc, active);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 }
 
@@ -437,6 +395,7 @@ int mem_cgroup_calc_mapped_ratio(struct 
 	rss = (long)mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
 	return (int)((rss * 100L) / total);
 }
+
 /*
  * This function is called from vmscan.c. In page reclaiming loop. balance
  * between active and inactive list is calculated. For memory controller
@@ -500,7 +459,6 @@ long mem_cgroup_calc_reclaim_inactive(st
 	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
 
 	nr_inactive = MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE);
-
 	return (nr_inactive >> priority);
 }
 
@@ -534,7 +492,6 @@ unsigned long mem_cgroup_isolate_pages(u
 		if (scan >= nr_to_scan)
 			break;
 		page = pc->page;
-		VM_BUG_ON(!pc);
 
 		if (unlikely(!PageLRU(page)))
 			continue;
@@ -575,9 +532,11 @@ static int mem_cgroup_charge_common(stru
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
+	struct page_cgroup *new_pc = NULL;
 	unsigned long flags;
 	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup_per_zone *mz;
+	int error;
 
 	/*
 	 * Should page_cgroup's go to their own slab?
@@ -586,31 +545,20 @@ static int mem_cgroup_charge_common(stru
 	 * to see if the cgroup page already has a page_cgroup associated
 	 * with it
 	 */
-retry:
+
 	if (page) {
+		error = 0;
 		lock_page_cgroup(page);
 		pc = page_get_page_cgroup(page);
-		/*
-		 * The page_cgroup exists and
-		 * the page has already been accounted.
-		 */
-		if (pc) {
-			if (unlikely(!atomic_inc_not_zero(&pc->ref_cnt))) {
-				/* this page is under being uncharged ? */
-				unlock_page_cgroup(page);
-				cpu_relax();
-				goto retry;
-			} else {
-				unlock_page_cgroup(page);
-				goto done;
-			}
-		}
+		if (pc)
+			goto incref;
 		unlock_page_cgroup(page);
 	}
 
-	pc = kzalloc(sizeof(struct page_cgroup), gfp_mask);
-	if (pc == NULL)
-		goto err;
+	error = -ENOMEM;
+	new_pc = kzalloc(sizeof(struct page_cgroup), gfp_mask);
+	if (!new_pc)
+		goto done;
 
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
@@ -624,16 +572,11 @@ retry:
 	rcu_read_lock();
 	mem = rcu_dereference(mm->mem_cgroup);
 	/*
-	 * For every charge from the cgroup, increment reference
-	 * count
+	 * For every charge from the cgroup, increment reference count
 	 */
 	css_get(&mem->css);
 	rcu_read_unlock();
 
-	/*
-	 * If we created the page_cgroup, we should free it on exceeding
-	 * the cgroup limit.
-	 */
 	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
 		if (!(gfp_mask & __GFP_WAIT))
 			goto out;
@@ -642,12 +585,12 @@ retry:
 			continue;
 
 		/*
- 		 * try_to_free_mem_cgroup_pages() might not give us a full
- 		 * picture of reclaim. Some pages are reclaimed and might be
- 		 * moved to swap cache or just unmapped from the cgroup.
- 		 * Check the limit again to see if the reclaim reduced the
- 		 * current usage of the cgroup before giving up
- 		 */
+		 * try_to_free_mem_cgroup_pages() might not give us a full
+		 * picture of reclaim. Some pages are reclaimed and might be
+		 * moved to swap cache or just unmapped from the cgroup.
+		 * Check the limit again to see if the reclaim reduced the
+		 * current usage of the cgroup before giving up
+		 */
 		if (res_counter_check_under_limit(&mem->res))
 			continue;
 
@@ -658,106 +601,101 @@ retry:
 		congestion_wait(WRITE, HZ/10);
 	}
 
-	atomic_set(&pc->ref_cnt, 1);
-	pc->mem_cgroup = mem;
-	pc->page = page;
-	pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
+	error = 0;
+	if (!page)
+		goto out;
+
+	new_pc->ref_cnt = 1;
+	new_pc->mem_cgroup = mem;
+	new_pc->page = page;
+	new_pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
-		pc->flags |= PAGE_CGROUP_FLAG_CACHE;
+		new_pc->flags |= PAGE_CGROUP_FLAG_CACHE;
 
-	if (!page || page_cgroup_assign_new_page_cgroup(page, pc)) {
-		/*
-		 * Another charge has been added to this page already.
-		 * We take lock_page_cgroup(page) again and read
-		 * page->cgroup, increment refcnt.... just retry is OK.
-		 */
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		css_put(&mem->css);
-		kfree(pc);
-		if (!page)
-			goto done;
-		goto retry;
-	}
+	lock_page_cgroup(page);
+	pc = page_get_page_cgroup(page);
+	if (!pc) {
+		page_assign_page_cgroup(page, new_pc);
+		unlock_page_cgroup(page);
 
-	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	/* Update statistics vector */
-	__mem_cgroup_add_list(pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
+		mz = page_cgroup_zoneinfo(new_pc);
+		spin_lock_irqsave(&mz->lru_lock, flags);
+		__mem_cgroup_add_list(new_pc);
+		spin_unlock_irqrestore(&mz->lru_lock, flags);
+		goto done;
+	}
 
-done:
-	return 0;
+incref:
+	VM_BUG_ON(pc->page != page);
+	VM_BUG_ON(pc->ref_cnt <= 0);
+	pc->ref_cnt++;
+	unlock_page_cgroup(page);
 out:
-	css_put(&mem->css);
-	kfree(pc);
-err:
-	return -ENOMEM;
+	if (new_pc) {
+		if (!error)
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
+		css_put(&mem->css);
+		kfree(new_pc);
+	}
+done:
+	return error;
 }
 
-int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
-			gfp_t gfp_mask)
+int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
 {
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
 			MEM_CGROUP_CHARGE_TYPE_MAPPED);
 }
 
-/*
- * See if the cached pages should be charged at all?
- */
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
-	int ret = 0;
 	if (!mm)
 		mm = &init_mm;
-
-	ret = mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_CACHE);
-	return ret;
+	return mem_cgroup_charge_common(page, mm, gfp_mask,
+			MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
 /*
  * Uncharging is always a welcome operation, we never complain, simply
- * uncharge. This routine should be called with lock_page_cgroup held
+ * uncharge.
  */
-void mem_cgroup_uncharge(struct page_cgroup *pc)
+void mem_cgroup_uncharge_page(struct page *page)
 {
+	struct page_cgroup *pc;
 	struct mem_cgroup *mem;
 	struct mem_cgroup_per_zone *mz;
-	struct page *page;
 	unsigned long flags;
 
 	/*
 	 * Check if our page_cgroup is valid
 	 */
+	lock_page_cgroup(page);
+	pc = page_get_page_cgroup(page);
 	if (!pc)
-		return;
+		goto unlock;
 
-	if (atomic_dec_and_test(&pc->ref_cnt)) {
-		page = pc->page;
-		mz = page_cgroup_zoneinfo(pc);
-		/*
-		 * get page->cgroup and clear it under lock.
-		 * force_empty can drop page->cgroup without checking refcnt.
-		 */
+	VM_BUG_ON(pc->page != page);
+	VM_BUG_ON(pc->ref_cnt <= 0);
+
+	if (--(pc->ref_cnt) == 0) {
+		page_assign_page_cgroup(page, NULL);
 		unlock_page_cgroup(page);
-		if (clear_page_cgroup(page, pc) == pc) {
-			mem = pc->mem_cgroup;
-			css_put(&mem->css);
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			spin_lock_irqsave(&mz->lru_lock, flags);
-			__mem_cgroup_remove_list(pc);
-			spin_unlock_irqrestore(&mz->lru_lock, flags);
-			kfree(pc);
-		}
-		lock_page_cgroup(page);
+
+		mem = pc->mem_cgroup;
+		css_put(&mem->css);
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
+
+		mz = page_cgroup_zoneinfo(pc);
+		spin_lock_irqsave(&mz->lru_lock, flags);
+		__mem_cgroup_remove_list(pc);
+		spin_unlock_irqrestore(&mz->lru_lock, flags);
+
+		kfree(pc);
+		return;
 	}
-}
 
-void mem_cgroup_uncharge_page(struct page *page)
-{
-	lock_page_cgroup(page);
-	mem_cgroup_uncharge(page_get_page_cgroup(page));
+unlock:
 	unlock_page_cgroup(page);
 }
 
@@ -765,50 +703,46 @@ void mem_cgroup_uncharge_page(struct pag
  * Returns non-zero if a page (under migration) has valid page_cgroup member.
  * Refcnt of page_cgroup is incremented.
  */
-
 int mem_cgroup_prepare_migration(struct page *page)
 {
 	struct page_cgroup *pc;
-	int ret = 0;
+
 	lock_page_cgroup(page);
 	pc = page_get_page_cgroup(page);
-	if (pc && atomic_inc_not_zero(&pc->ref_cnt))
-		ret = 1;
+	if (pc)
+		pc->ref_cnt++;
 	unlock_page_cgroup(page);
-	return ret;
+	return pc != NULL;
 }
 
 void mem_cgroup_end_migration(struct page *page)
 {
-	struct page_cgroup *pc;
-
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
-	mem_cgroup_uncharge(pc);
-	unlock_page_cgroup(page);
+	mem_cgroup_uncharge_page(page);
 }
+
 /*
  * We know both *page* and *newpage* are now not-on-LRU and Pg_locked.
  * And no race with uncharge() routines because page_cgroup for *page*
  * has extra one reference by mem_cgroup_prepare_migration.
  */
-
 void mem_cgroup_page_migration(struct page *page, struct page *newpage)
 {
 	struct page_cgroup *pc;
-	struct mem_cgroup *mem;
-	unsigned long flags;
 	struct mem_cgroup_per_zone *mz;
-retry:
+	unsigned long flags;
+
+	lock_page_cgroup(page);
 	pc = page_get_page_cgroup(page);
-	if (!pc)
+	if (!pc) {
+		unlock_page_cgroup(page);
 		return;
-	mem = pc->mem_cgroup;
+	}
+
+	page_assign_page_cgroup(page, NULL);
+	unlock_page_cgroup(page);
+
 	mz = page_cgroup_zoneinfo(pc);
-	if (clear_page_cgroup(page, pc) != pc)
-		goto retry;
 	spin_lock_irqsave(&mz->lru_lock, flags);
-
 	__mem_cgroup_remove_list(pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 
@@ -821,7 +755,6 @@ retry:
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_add_list(pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
-	return;
 }
 
 /*
@@ -830,8 +763,7 @@ retry:
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
  */
 #define FORCE_UNCHARGE_BATCH	(128)
-static void
-mem_cgroup_force_empty_list(struct mem_cgroup *mem,
+static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 			    struct mem_cgroup_per_zone *mz,
 			    int active)
 {
@@ -855,30 +787,33 @@ retry:
 	while (--count && !list_empty(list)) {
 		pc = list_entry(list->prev, struct page_cgroup, lru);
 		page = pc->page;
-		/* Avoid race with charge */
-		atomic_set(&pc->ref_cnt, 0);
-		if (clear_page_cgroup(page, pc) == pc) {
+		lock_page_cgroup(page);
+		if (page_get_page_cgroup(page) == pc) {
+			page_assign_page_cgroup(page, NULL);
+			unlock_page_cgroup(page);
 			css_put(&mem->css);
 			res_counter_uncharge(&mem->res, PAGE_SIZE);
 			__mem_cgroup_remove_list(pc);
 			kfree(pc);
-		} else 	/* being uncharged ? ...do relax */
+		} else {
+			/* racing uncharge: let page go then retry */
+			unlock_page_cgroup(page);
 			break;
+		}
 	}
+
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 	if (!list_empty(list)) {
 		cond_resched();
 		goto retry;
 	}
-	return;
 }
 
 /*
  * make mem_cgroup's charge to be 0 if there is no task.
  * This enables deleting this mem_cgroup.
  */
-
-int mem_cgroup_force_empty(struct mem_cgroup *mem)
+static int mem_cgroup_force_empty(struct mem_cgroup *mem)
 {
 	int ret = -EBUSY;
 	int node, zid;
@@ -907,9 +842,7 @@ out:
 	return ret;
 }
 
-
-
-int mem_cgroup_write_strategy(char *buf, unsigned long long *tmp)
+static int mem_cgroup_write_strategy(char *buf, unsigned long long *tmp)
 {
 	*tmp = memparse(buf, &buf);
 	if (*buf != '\0')
@@ -956,7 +889,6 @@ static ssize_t mem_force_empty_write(str
 /*
  * Note: This should be removed if cgroup supports write-only file.
  */
-
 static ssize_t mem_force_empty_read(struct cgroup *cont,
 				struct cftype *cft,
 				struct file *file, char __user *userbuf,
@@ -965,7 +897,6 @@ static ssize_t mem_force_empty_read(stru
 	return -EINVAL;
 }
 
-
 static const struct mem_cgroup_stat_desc {
 	const char *msg;
 	u64 unit;
@@ -1018,8 +949,6 @@ static int mem_control_stat_open(struct 
 	return single_open(file, mem_control_stat_show, cont);
 }
 
-
-
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -1085,9 +1014,6 @@ static void free_mem_cgroup_per_zone_inf
 	kfree(mem->info.nodeinfo[node]);
 }
 
-
-static struct mem_cgroup init_mem_cgroup;
-
 static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
@@ -1177,7 +1103,6 @@ static void mem_cgroup_move_task(struct 
 
 out:
 	mmput(mm);
-	return;
 }
 
 struct cgroup_subsys mem_cgroup_subsys = {
diff -purN 26252/mm/memory.c 26252h/mm/memory.c
--- 26252/mm/memory.c	2008-02-15 23:43:20.000000000 +0000
+++ 26252h/mm/memory.c	2008-02-17 10:26:22.000000000 +0000
@@ -1711,7 +1711,7 @@ unlock:
 	}
 	return ret;
 oom_free_new:
-	__free_page(new_page);
+	page_cache_release(new_page);
 oom:
 	if (old_page)
 		page_cache_release(old_page);
@@ -2093,12 +2093,9 @@ static int do_swap_page(struct mm_struct
 	unlock_page(page);
 
 	if (write_access) {
-		/* XXX: We could OR the do_wp_page code with this one? */
-		if (do_wp_page(mm, vma, address,
-				page_table, pmd, ptl, pte) & VM_FAULT_OOM) {
-			mem_cgroup_uncharge_page(page);
-			ret = VM_FAULT_OOM;
-		}
+		ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
+		if (ret & VM_FAULT_ERROR)
+			ret &= VM_FAULT_ERROR;
 		goto out;
 	}
 
@@ -2163,7 +2160,7 @@ release:
 	page_cache_release(page);
 	goto unlock;
 oom_free_page:
-	__free_page(page);
+	page_cache_release(page);
 oom:
 	return VM_FAULT_OOM;
 }
diff -purN 26252/mm/page_alloc.c 26252h/mm/page_alloc.c
--- 26252/mm/page_alloc.c	2008-02-11 07:18:12.000000000 +0000
+++ 26252h/mm/page_alloc.c	2008-02-17 10:26:11.000000000 +0000
@@ -981,6 +981,7 @@ static void free_hot_cold_page(struct pa
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 
+	VM_BUG_ON(page_get_page_cgroup(page));
 	if (PageAnon(page))
 		page->mapping = NULL;
 	if (free_pages_check(page))
@@ -988,7 +989,6 @@ static void free_hot_cold_page(struct pa
 
 	if (!PageHighMem(page))
 		debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
-	VM_BUG_ON(page_get_page_cgroup(page));
 	arch_free_page(page, 0);
 	kernel_map_pages(page, 1, 0);
 
@@ -2527,7 +2527,6 @@ void __meminit memmap_init_zone(unsigned
 		set_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		reset_page_mapcount(page);
-		page_assign_page_cgroup(page, NULL);
 		SetPageReserved(page);
 
 		/*
diff -purN 26252/mm/rmap.c 26252h/mm/rmap.c
--- 26252/mm/rmap.c	2008-02-11 07:18:12.000000000 +0000
+++ 26252h/mm/rmap.c	2008-02-17 10:26:22.000000000 +0000
@@ -321,7 +321,7 @@ static int page_referenced_anon(struct p
 		 * counting on behalf of references from different
 		 * cgroups
 		 */
-		if (mem_cont && !vm_match_cgroup(vma->vm_mm, mem_cont))
+		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
 		referenced += page_referenced_one(page, vma, &mapcount);
 		if (!mapcount)
@@ -382,7 +382,7 @@ static int page_referenced_file(struct p
 		 * counting on behalf of references from different
 		 * cgroups
 		 */
-		if (mem_cont && !vm_match_cgroup(vma->vm_mm, mem_cont))
+		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
 		if ((vma->vm_flags & (VM_LOCKED|VM_MAYSHARE))
 				  == (VM_LOCKED|VM_MAYSHARE)) {
diff -purN 26252/mm/swap.c 26252h/mm/swap.c
--- 26252/mm/swap.c	2008-02-11 07:18:12.000000000 +0000
+++ 26252h/mm/swap.c	2008-02-17 13:01:50.000000000 +0000
@@ -176,7 +176,7 @@ void activate_page(struct page *page)
 		SetPageActive(page);
 		add_page_to_active_list(zone, page);
 		__count_vm_event(PGACTIVATE);
-		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
+		mem_cgroup_move_lists(page, true);
 	}
 	spin_unlock_irq(&zone->lru_lock);
 }
diff -purN 26252/mm/vmscan.c 26252h/mm/vmscan.c
--- 26252/mm/vmscan.c	2008-02-11 07:18:12.000000000 +0000
+++ 26252h/mm/vmscan.c	2008-02-17 13:02:33.000000000 +0000
@@ -1128,7 +1128,7 @@ static void shrink_active_list(unsigned 
 		ClearPageActive(page);
 
 		list_move(&page->lru, &zone->inactive_list);
-		mem_cgroup_move_lists(page_get_page_cgroup(page), false);
+		mem_cgroup_move_lists(page, false);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
@@ -1157,7 +1157,7 @@ static void shrink_active_list(unsigned 
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
 		list_move(&page->lru, &zone->active_list);
-		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
+		mem_cgroup_move_lists(page, true);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-19 15:40 ` Hugh Dickins
@ 2008-02-20  1:03   ` KAMEZAWA Hiroyuki
  2008-02-20  4:14     ` Hugh Dickins
  2008-02-20  3:14   ` KAMEZAWA Hiroyuki
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-20  1:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	yamamoto@valinux.co.jp, riel@redhat.com

Good morning ;)

On Tue, 19 Feb 2008 15:40:45 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> Since then I've been working on patches too, testing, currently
> breaking up my one big patch into pieces while running more tests.
> A lot in common with yours, a lot not.  (And none of it addressing
> that issue of opt-out I raise in the last paragraph: haven't begun
> to go into that one, hoped you and Balbir would look into it.)
> 
I have some trial patches for reducing atomic_ops by do_it_lazy method.
Now, I'm afraid that performence is too bad when there is *no* memory pressure.

> I've not had time to study yours yet, but a first impression is
> that you're adding extra complexity (usage in addition to ref_cnt)
> where I'm more taking it away (it's pointless for ref_cnt to be an
> atomic: the one place which isn't already using lock_page_cgroup
> around it needs to).  But that could easily turn out to be because
> I'm skirting issues which you're addressing properly: we'll see.
> 
> I haven't completed my solution in mem_cgroup_move_lists yet: but
> the way it wants a lock in a structure which isn't stabilized until
> it's got that lock, reminds me very much of my page_lock_anon_vma,
> so I'm expecting to use a SLAB_DESTROY_BY_RCU cache there.
> 

IMHO, because tons of page_cgroup can be freed at once, we need some good
idea for reducing RCU's GC work to do that.

> Ha, you have lock_page_cgroup in your mem_cgroup_move_lists: yes,
> tried that too, and it deadlocks: someone holding lock_page_cgroup
> can be interrupted by an end of I/O interrupt which does
> rotate_reclaimable_page and wants the main lru_lock, but that
> main lru_lock is held across mem_cgroup_move_lists.  There are
> several different ways to address that, but for this release I
> think we just go for a try_lock_page_cgroup there.
> 
Hm, I'd like to remove mem_cgroup_move_lists if possible ;(
(But its result will be bad LRU ordering.)

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-19 16:26   ` Hugh Dickins
@ 2008-02-20  1:55     ` KAMEZAWA Hiroyuki
  2008-02-20  2:05       ` YAMAMOTO Takashi
  2008-02-20  6:38     ` Balbir Singh
  1 sibling, 1 reply; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-20  1:55 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, yamamoto, riel, hugh@veritas.com

Balbir-san,

On Tue, 19 Feb 2008 16:26:10 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:
> @@ -575,9 +532,11 @@ static int mem_cgroup_charge_common(stru
>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
> +	struct page_cgroup *new_pc = NULL;
>  	unsigned long flags;
>  	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  	struct mem_cgroup_per_zone *mz;
> +	int error;
>  
>  	/*
>  	 * Should page_cgroup's go to their own slab?
> @@ -586,31 +545,20 @@ static int mem_cgroup_charge_common(stru
>  	 * to see if the cgroup page already has a page_cgroup associated
>  	 * with it
>  	 */
> -retry:
> +
>  	if (page) {
> +		error = 0;
>  		lock_page_cgroup(page);

What is !page case in mem_cgroup_charge_xxx() ?

Thanks
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  1:55     ` KAMEZAWA Hiroyuki
@ 2008-02-20  2:05       ` YAMAMOTO Takashi
  2008-02-20  2:15         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 63+ messages in thread
From: YAMAMOTO Takashi @ 2008-02-20  2:05 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: balbir, linux-mm, riel, hugh

> Balbir-san,
> 
> On Tue, 19 Feb 2008 16:26:10 +0000 (GMT)
> Hugh Dickins <hugh@veritas.com> wrote:
> > @@ -575,9 +532,11 @@ static int mem_cgroup_charge_common(stru
> >  {
> >  	struct mem_cgroup *mem;
> >  	struct page_cgroup *pc;
> > +	struct page_cgroup *new_pc = NULL;
> >  	unsigned long flags;
> >  	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> >  	struct mem_cgroup_per_zone *mz;
> > +	int error;
> >  
> >  	/*
> >  	 * Should page_cgroup's go to their own slab?
> > @@ -586,31 +545,20 @@ static int mem_cgroup_charge_common(stru
> >  	 * to see if the cgroup page already has a page_cgroup associated
> >  	 * with it
> >  	 */
> > -retry:
> > +
> >  	if (page) {
> > +		error = 0;
> >  		lock_page_cgroup(page);
> 
> What is !page case in mem_cgroup_charge_xxx() ?

see a hack in shmem_getpage.

YAMAMOTO Takashi

> 
> Thanks
> -Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  2:05       ` YAMAMOTO Takashi
@ 2008-02-20  2:15         ` KAMEZAWA Hiroyuki
  2008-02-20  2:32           ` YAMAMOTO Takashi
  0 siblings, 1 reply; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-20  2:15 UTC (permalink / raw)
  To: YAMAMOTO Takashi; +Cc: balbir, linux-mm, riel, hugh

On Wed, 20 Feb 2008 11:05:12 +0900 (JST)
yamamoto@valinux.co.jp (YAMAMOTO Takashi) wrote:

	error = 0;
> > >  		lock_page_cgroup(page);
> > 
> > What is !page case in mem_cgroup_charge_xxx() ?
> 
> see a hack in shmem_getpage.
> 
Aha, ok. maybe we should add try_to_shrink_page_cgroup() for making room
rather than adding special case.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  2:15         ` KAMEZAWA Hiroyuki
@ 2008-02-20  2:32           ` YAMAMOTO Takashi
  2008-02-20  4:27             ` Hugh Dickins
  0 siblings, 1 reply; 63+ messages in thread
From: YAMAMOTO Takashi @ 2008-02-20  2:32 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: balbir, linux-mm, riel, hugh

> On Wed, 20 Feb 2008 11:05:12 +0900 (JST)
> yamamoto@valinux.co.jp (YAMAMOTO Takashi) wrote:
> 
> 	error = 0;
> > > >  		lock_page_cgroup(page);
> > > 
> > > What is !page case in mem_cgroup_charge_xxx() ?
> > 
> > see a hack in shmem_getpage.
> > 
> Aha, ok. maybe we should add try_to_shrink_page_cgroup() for making room
> rather than adding special case.
> 
> Thanks,
> -Kame

yes.
or, even better, implement cgroup background reclaim.

YAMAMOTO Takashi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-19 15:40 ` Hugh Dickins
  2008-02-20  1:03   ` KAMEZAWA Hiroyuki
@ 2008-02-20  3:14   ` KAMEZAWA Hiroyuki
  2008-02-20  3:37     ` YAMAMOTO Takashi
  2008-02-20  4:32     ` Hugh Dickins
  2008-02-20  5:57   ` Balbir Singh
  2008-02-20  6:27   ` Hirokazu Takahashi
  3 siblings, 2 replies; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-20  3:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	yamamoto@valinux.co.jp, riel@redhat.com

On Tue, 19 Feb 2008 15:40:45 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> I haven't completed my solution in mem_cgroup_move_lists yet: but
> the way it wants a lock in a structure which isn't stabilized until
> it's got that lock, reminds me very much of my page_lock_anon_vma,
> so I'm expecting to use a SLAB_DESTROY_BY_RCU cache there.
> 
Could I make a question about anon_vma's RCU ?

I think SLAB_DESTROY_BY_RCU guarantees that slab's page is not freed back
to buddy allocator while some holds rcu_read_lock().

Why it's safe against reusing freed one by slab fast path (array_cache) ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  3:14   ` KAMEZAWA Hiroyuki
@ 2008-02-20  3:37     ` YAMAMOTO Takashi
  2008-02-20  4:13       ` KAMEZAWA Hiroyuki
  2008-02-20  4:32     ` Hugh Dickins
  1 sibling, 1 reply; 63+ messages in thread
From: YAMAMOTO Takashi @ 2008-02-20  3:37 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: hugh, linux-mm, balbir, riel

> On Tue, 19 Feb 2008 15:40:45 +0000 (GMT)
> Hugh Dickins <hugh@veritas.com> wrote:
> 
> > I haven't completed my solution in mem_cgroup_move_lists yet: but
> > the way it wants a lock in a structure which isn't stabilized until
> > it's got that lock, reminds me very much of my page_lock_anon_vma,
> > so I'm expecting to use a SLAB_DESTROY_BY_RCU cache there.
> > 
> Could I make a question about anon_vma's RCU ?
> 
> I think SLAB_DESTROY_BY_RCU guarantees that slab's page is not freed back
> to buddy allocator while some holds rcu_read_lock().
> 
> Why it's safe against reusing freed one by slab fast path (array_cache) ?

reuse for another anon_vma is ok.
page_check_address checks if it was really for this page.

YAMAMOTO Takashi

> 
> Thanks,
> -Kame
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  3:37     ` YAMAMOTO Takashi
@ 2008-02-20  4:13       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-20  4:13 UTC (permalink / raw)
  To: YAMAMOTO Takashi; +Cc: hugh, linux-mm, balbir, riel

On Wed, 20 Feb 2008 12:37:24 +0900 (JST)
yamamoto@valinux.co.jp (YAMAMOTO Takashi) wrote:

> > Why it's safe against reusing freed one by slab fast path (array_cache) ?
> 
> reuse for another anon_vma is ok.
> page_check_address checks if it was really for this page.
> 
i.c. (and I checked migration code again and it does check.)

Then, page_cgroup should have some check code for using RCU or
use call_rcu() by itself.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  1:03   ` KAMEZAWA Hiroyuki
@ 2008-02-20  4:14     ` Hugh Dickins
  2008-02-20  4:37       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 63+ messages in thread
From: Hugh Dickins @ 2008-02-20  4:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	yamamoto@valinux.co.jp, riel@redhat.com

On Wed, 20 Feb 2008, KAMEZAWA Hiroyuki wrote:
> On Tue, 19 Feb 2008 15:40:45 +0000 (GMT)
> Hugh Dickins <hugh@veritas.com> wrote:
> 
> > A lot in common with yours, a lot not.  (And none of it addressing
> > that issue of opt-out I raise in the last paragraph: haven't begun
> > to go into that one, hoped you and Balbir would look into it.)
> > 
> I have some trial patches for reducing atomic_ops by do_it_lazy method.
> Now, I'm afraid that performence is too bad when there is *no* memory
> pressure.

But it isn't just the atomic ops, it's the whole business of
mem_cgroup_charge_common plus mem_cgroup_uncharge_page per page.

The existence of force_empty indicates that the system can get along
without the charge on the page.  What's needed, I think, is something
in struct mm, a flag or a reserved value in mm->mem_cgroup, to say
don't do any of this mem_cgroup stuff on me; and a cgroup fs interface
to set that, in the same way as force_empty is done.

> > I haven't completed my solution in mem_cgroup_move_lists yet: but
> > the way it wants a lock in a structure which isn't stabilized until
> > it's got that lock, reminds me very much of my page_lock_anon_vma,
> > so I'm expecting to use a SLAB_DESTROY_BY_RCU cache there.
> > 
> 
> IMHO, because tons of page_cgroup can be freed at once, we need some good
> idea for reducing RCU's GC work to do that.

That's a good point that hadn't yet crossed my mind, but it may not
be relevant.  It's not the struct page_cgroups that would need to go
into a SLAB_DESTROY_BY_RCU slab, but the struct mem_cgroups.

> 
> > Ha, you have lock_page_cgroup in your mem_cgroup_move_lists: yes,
> > tried that too, and it deadlocks: someone holding lock_page_cgroup
> > can be interrupted by an end of I/O interrupt which does
> > rotate_reclaimable_page and wants the main lru_lock, but that
> > main lru_lock is held across mem_cgroup_move_lists.  There are
> > several different ways to address that, but for this release I
> > think we just go for a try_lock_page_cgroup there.
> > 
> Hm, I'd like to remove mem_cgroup_move_lists if possible ;(
> (But its result will be bad LRU ordering.)

I'm not sure if you're actually proposing to revert all that, or just
expressing regret at the difficulty it introduces.  I'll assume the
latter: certainly I'm not arguing for such a large reversion.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  2:32           ` YAMAMOTO Takashi
@ 2008-02-20  4:27             ` Hugh Dickins
  0 siblings, 0 replies; 63+ messages in thread
From: Hugh Dickins @ 2008-02-20  4:27 UTC (permalink / raw)
  To: YAMAMOTO Takashi; +Cc: kamezawa.hiroyu, balbir, linux-mm, riel

On Wed, 20 Feb 2008, YAMAMOTO Takashi wrote:
> > On Wed, 20 Feb 2008 11:05:12 +0900 (JST)
> > yamamoto@valinux.co.jp (YAMAMOTO Takashi) wrote:
> > 
> > 	error = 0;
> > > > >  		lock_page_cgroup(page);
> > > > 
> > > > What is !page case in mem_cgroup_charge_xxx() ?
> > > 
> > > see a hack in shmem_getpage.

My hack, yes.  I did wonder even when submitting it, and was slightly
surprised no voices raised against it.  I've left it in because it
poses no real problem at all, apart from complicating the flow in
charge_common.

But I'll happily remove it if you like: shmem_getpage has proved to
get along fine using charge and uncharge on swappage there instead -
since it only happens once per freeing-batch of pages,
the unnecessary overhead remains in the noise.

> > Aha, ok. maybe we should add try_to_shrink_page_cgroup() for making room
> > rather than adding special case.
> 
> yes.
> or, even better, implement cgroup background reclaim.

Well, either of those might be wanted in some future, but not by me:
for now it sounds like I'll please you both if I remove the !page
special case from charge_common.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  3:14   ` KAMEZAWA Hiroyuki
  2008-02-20  3:37     ` YAMAMOTO Takashi
@ 2008-02-20  4:32     ` Hugh Dickins
  1 sibling, 0 replies; 63+ messages in thread
From: Hugh Dickins @ 2008-02-20  4:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	yamamoto@valinux.co.jp, riel@redhat.com

On Wed, 20 Feb 2008, KAMEZAWA Hiroyuki wrote:
> On Tue, 19 Feb 2008 15:40:45 +0000 (GMT)
> Hugh Dickins <hugh@veritas.com> wrote:
> 
> > I haven't completed my solution in mem_cgroup_move_lists yet: but
> > the way it wants a lock in a structure which isn't stabilized until
> > it's got that lock, reminds me very much of my page_lock_anon_vma,
> > so I'm expecting to use a SLAB_DESTROY_BY_RCU cache there.
> > 
> Could I make a question about anon_vma's RCU ?
> 
> I think SLAB_DESTROY_BY_RCU guarantees that slab's page is not freed back
> to buddy allocator while some holds rcu_read_lock().
> 
> Why it's safe against reusing freed one by slab fast path (array_cache) ?

Because so long as that piece of memory is used for the same type of
structure, what's a spinlock remains a spinlock, so (if you've got
rcu_read_lock) it's safe to take the spinlock in the structure even
if it's no longer "yours": then, once you've got the spinlock,
check to see if it's still yours and get out immediately if not.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  4:14     ` Hugh Dickins
@ 2008-02-20  4:37       ` KAMEZAWA Hiroyuki
  2008-02-20  4:39         ` Hugh Dickins
  2008-02-20  6:40         ` Balbir Singh
  0 siblings, 2 replies; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-20  4:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	yamamoto@valinux.co.jp, riel@redhat.com

On Wed, 20 Feb 2008 04:14:58 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> On Wed, 20 Feb 2008, KAMEZAWA Hiroyuki wrote:
> > On Tue, 19 Feb 2008 15:40:45 +0000 (GMT)
> > Hugh Dickins <hugh@veritas.com> wrote:
> > 
> > > A lot in common with yours, a lot not.  (And none of it addressing
> > > that issue of opt-out I raise in the last paragraph: haven't begun
> > > to go into that one, hoped you and Balbir would look into it.)
> > > 
> > I have some trial patches for reducing atomic_ops by do_it_lazy method.
> > Now, I'm afraid that performence is too bad when there is *no* memory
> > pressure.
> 
> But it isn't just the atomic ops, it's the whole business of
> mem_cgroup_charge_common plus mem_cgroup_uncharge_page per page.
> 
> The existence of force_empty indicates that the system can get along
> without the charge on the page. 
Yes.

> What's needed, I think, is something in struct mm, a flag or a reserved value
> in mm->mem_cgroup, to say don't do any of this mem_cgroup stuff on me; and a cgroup
> fs interface to set that, in the same way as force_empty is done.

I agree here. I believe we need "no charge" flag at least to the root group.
For root group, it's better to have boot option if not complicated.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  4:37       ` KAMEZAWA Hiroyuki
@ 2008-02-20  4:39         ` Hugh Dickins
  2008-02-20  4:41           ` Balbir Singh
  2008-02-20  6:40         ` Balbir Singh
  1 sibling, 1 reply; 63+ messages in thread
From: Hugh Dickins @ 2008-02-20  4:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	yamamoto@valinux.co.jp, riel@redhat.com

On Wed, 20 Feb 2008, KAMEZAWA Hiroyuki wrote:
> On Wed, 20 Feb 2008 04:14:58 +0000 (GMT)
> Hugh Dickins <hugh@veritas.com> wrote:
> 
> > What's needed, I think, is something in struct mm, a flag or a reserved value
> > in mm->mem_cgroup, to say don't do any of this mem_cgroup stuff on me; and a cgroup
> > fs interface to set that, in the same way as force_empty is done.
> 
> I agree here. I believe we need "no charge" flag at least to the root group.
> For root group, it's better to have boot option if not complicated.

I expect we'll end up wanting both the cgroupfs interface and the boot
option for the root; but yes, for now, the boot option would be enough.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  4:39         ` Hugh Dickins
@ 2008-02-20  4:41           ` Balbir Singh
  0 siblings, 0 replies; 63+ messages in thread
From: Balbir Singh @ 2008-02-20  4:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, yamamoto@valinux.co.jp,
	riel@redhat.com

Hugh Dickins wrote:
> On Wed, 20 Feb 2008, KAMEZAWA Hiroyuki wrote:
>> On Wed, 20 Feb 2008 04:14:58 +0000 (GMT)
>> Hugh Dickins <hugh@veritas.com> wrote:
>>
>>> What's needed, I think, is something in struct mm, a flag or a reserved value
>>> in mm->mem_cgroup, to say don't do any of this mem_cgroup stuff on me; and a cgroup
>>> fs interface to set that, in the same way as force_empty is done.
>> I agree here. I believe we need "no charge" flag at least to the root group.
>> For root group, it's better to have boot option if not complicated.
> 
> I expect we'll end up wanting both the cgroupfs interface and the boot
> option for the root; but yes, for now, the boot option would be enough.
> 
> Hugh

Yes a boot option would be good for now. Sorry, I've just woken up and reading
through other emails and trying to catch up with the threads. I'll try and
respond to the other emails soon.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-19 12:54 [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races KAMEZAWA Hiroyuki
  2008-02-19 15:40 ` Hugh Dickins
  2008-02-19 15:54 ` kamezawa.hiroyu
@ 2008-02-20  5:00 ` Balbir Singh
  2 siblings, 0 replies; 63+ messages in thread
From: Balbir Singh @ 2008-02-20  5:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, yamamoto@valinux.co.jp, hugh, riel@redhat.com

KAMEZAWA Hiroyuki wrote:
> I'd like to start from RFC.
> 
> In following code
> ==
>   lock_page_cgroup(page);
>   pc = page_get_page_cgroup(page);
>   unlock_page_cgroup(page);
> 
>   access 'pc' later..
> == (See, page_cgroup_move_lists())
> 

Hi, KAMEZAWA-San,

I assume that when you say page_cgroup_move_lists(), you mean
mem_cgroup_move_lists().

> There is a race because 'pc' is not a stable value without lock_page_cgroup().
> (mem_cgroup_uncharge can free this 'pc').
> 
> For example, page_cgroup_move_lists() access pc without lock.
> There is a small race window, between page_cgroup_move_lists()
> and mem_cgroup_uncharge(). At uncharge, page_cgroup struct is immedieately
> freed but move_list can access it after taking lru_lock.
> (*) mem_cgroup_uncharge_page() can be called without zone->lru lock.
> 
> This is not good manner.

Yes, correct. Thanks for catching this. I'll try and review all functions, to
see if there are other violations of correct locking.

> .....
> There is no quick fix (maybe). Moreover, I hear some people around me said
> current memcontrol.c codes are very complicated.
> I agree ;( ..it's caued by my work.
> 
> I'd like to fix problems in clean way.
> (Note: current -rc2 codes works well under heavy pressure. but there
>  is possibility of race, I think.)
> 

I am not looking at the patch below, since I saw that Hugh Dickins has also
posted his fixes and updates. We could review them and then see what else needs
to be done


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-19 15:40 ` Hugh Dickins
  2008-02-20  1:03   ` KAMEZAWA Hiroyuki
  2008-02-20  3:14   ` KAMEZAWA Hiroyuki
@ 2008-02-20  5:57   ` Balbir Singh
  2008-02-20  9:58     ` Hirokazu Takahashi
  2008-02-20  6:27   ` Hirokazu Takahashi
  3 siblings, 1 reply; 63+ messages in thread
From: Balbir Singh @ 2008-02-20  5:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, yamamoto@valinux.co.jp,
	riel@redhat.com

Hugh Dickins wrote:
> On Tue, 19 Feb 2008, KAMEZAWA Hiroyuki wrote:
>> I'd like to start from RFC.
>>
>> In following code
>> ==
>>   lock_page_cgroup(page);
>>   pc = page_get_page_cgroup(page);
>>   unlock_page_cgroup(page);
>>
>>   access 'pc' later..
>> == (See, page_cgroup_move_lists())
>>
>> There is a race because 'pc' is not a stable value without lock_page_cgroup().
>> (mem_cgroup_uncharge can free this 'pc').
>>
>> For example, page_cgroup_move_lists() access pc without lock.
>> There is a small race window, between page_cgroup_move_lists()
>> and mem_cgroup_uncharge(). At uncharge, page_cgroup struct is immedieately
>> freed but move_list can access it after taking lru_lock.
>> (*) mem_cgroup_uncharge_page() can be called without zone->lru lock.
>>
>> This is not good manner.
>> .....
>> There is no quick fix (maybe). Moreover, I hear some people around me said
>> current memcontrol.c codes are very complicated.
>> I agree ;( ..it's caued by my work.
>>
>> I'd like to fix problems in clean way.
>> (Note: current -rc2 codes works well under heavy pressure. but there
>>  is possibility of race, I think.)
> 
> Yes, yes, indeed, I've been working away on this too.
> 
> Ever since the VM_BUG_ON(page_get_page_cgroup(page)) went into
> free_hot_cold_page (at my own prompting), I've been hitting it
> just very occasionally in my kernel build testing.  Was unable
> to reproduce it over the New Year, but a week or two ago found
> one machine and config on which it is relatively reproducible,
> pretty sure to happen within 12 hours.
> 
> And on Saturday evening at last identified the cause, exactly
> where you have: that unsafety in mem_cgroup_move_lists - which
> has the nice property of putting pages from the lru on to SLUB's
> freelist!
> 
> Unlike the unsafeties of force_empty, this is liable to hit anyone
> running with MEM_CONT compiled in, they don't have to be consciously
> using mem_cgroups at all.
> 
> (I consider that, by the way, quite a serious defect in the current
> mem_cgroup work: that a distro compiling it in for 1% of customers
> is then subjecting all to the mem_cgroup overhead - effectively
> doubled struct page size and unnecessary accounting overhead.  I
> believe there needs to be a way to opt out, a force_empty which
> sticks.  Yes, I know the page_cgroup which does that doubling of
> size is only allocated on demand, but every page cache page and
> every anonymous page is going to have one.  A kmem_cache for them
> will reduce the extra, but there still needs to be a way to opt
> out completely.)
> 

I've been thinking along these lines as well

1. Have a boot option to turn on/off the memory controller
2. Have a separate cache for the page_cgroup structure. I sent this suggestion
   out just yesterday or so.

I agree that these are necessary enhancements/changes.

[snip]



-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-19 15:40 ` Hugh Dickins
                     ` (2 preceding siblings ...)
  2008-02-20  5:57   ` Balbir Singh
@ 2008-02-20  6:27   ` Hirokazu Takahashi
  2008-02-20  6:50     ` KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 63+ messages in thread
From: Hirokazu Takahashi @ 2008-02-20  6:27 UTC (permalink / raw)
  To: kamezawa.hiroyu, hugh; +Cc: linux-mm, balbir, yamamoto, riel

Hi,

> On Tue, 19 Feb 2008, KAMEZAWA Hiroyuki wrote:
> > I'd like to start from RFC.
> > 
> > In following code
> > ==
> >   lock_page_cgroup(page);
> >   pc = page_get_page_cgroup(page);
> >   unlock_page_cgroup(page);
> > 
> >   access 'pc' later..
> > == (See, page_cgroup_move_lists())
> > 
> > There is a race because 'pc' is not a stable value without lock_page_cgroup().
> > (mem_cgroup_uncharge can free this 'pc').
> > 
> > For example, page_cgroup_move_lists() access pc without lock.
> > There is a small race window, between page_cgroup_move_lists()
> > and mem_cgroup_uncharge(). At uncharge, page_cgroup struct is immedieately
> > freed but move_list can access it after taking lru_lock.
> > (*) mem_cgroup_uncharge_page() can be called without zone->lru lock.
> > 
> > This is not good manner.
> > .....
> > There is no quick fix (maybe). Moreover, I hear some people around me said
> > current memcontrol.c codes are very complicated.
> > I agree ;( ..it's caued by my work.
> > 
> > I'd like to fix problems in clean way.
> > (Note: current -rc2 codes works well under heavy pressure. but there
> >  is possibility of race, I think.)
> 
> Yes, yes, indeed, I've been working away on this too.
> 
> Ever since the VM_BUG_ON(page_get_page_cgroup(page)) went into
> free_hot_cold_page (at my own prompting), I've been hitting it
> just very occasionally in my kernel build testing.  Was unable
> to reproduce it over the New Year, but a week or two ago found
> one machine and config on which it is relatively reproducible,
> pretty sure to happen within 12 hours.
> 
> And on Saturday evening at last identified the cause, exactly
> where you have: that unsafety in mem_cgroup_move_lists - which
> has the nice property of putting pages from the lru on to SLUB's
> freelist!
> 
> Unlike the unsafeties of force_empty, this is liable to hit anyone
> running with MEM_CONT compiled in, they don't have to be consciously
> using mem_cgroups at all.

As for force_empty, though this may not be the main topic here,
mem_cgroup_force_empty_list() can be implemented simpler.
It is possible to make the function just call mem_cgroup_uncharge_page()
instead of releasing page_cgroups by itself. The tips is to call get_page()
before invoking mem_cgroup_uncharge_page() so the page won't be released
during this function.

Kamezawa-san, you may want look into the attached patch.
I think you will be free from the weired complexity here.

This code can be optimized but it will be enough since this function
isn't critical.

Thanks.


Signed-off-by: Hirokazu Takahashi <taka@vallinux.co.jp>

--- mm/memcontrol.c.ORG	2008-02-12 18:44:45.000000000 +0900
+++ mm/memcontrol.c 2008-02-20 14:23:38.000000000 +0900
@@ -837,7 +837,7 @@ mem_cgroup_force_empty_list(struct mem_c
 {
 	struct page_cgroup *pc;
 	struct page *page;
-	int count;
+	int count = FORCE_UNCHARGE_BATCH;
 	unsigned long flags;
 	struct list_head *list;
 
@@ -846,30 +846,21 @@ mem_cgroup_force_empty_list(struct mem_c
 	else
 		list = &mz->inactive_list;
 
-	if (list_empty(list))
-		return;
-retry:
-	count = FORCE_UNCHARGE_BATCH;
 	spin_lock_irqsave(&mz->lru_lock, flags);
-
-	while (--count && !list_empty(list)) {
+	while (!list_empty(list)) {
 		pc = list_entry(list->prev, struct page_cgroup, lru);
 		page = pc->page;
-		/* Avoid race with charge */
-		atomic_set(&pc->ref_cnt, 0);
-		if (clear_page_cgroup(page, pc) == pc) {
-			css_put(&mem->css);
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			__mem_cgroup_remove_list(pc);
-			kfree(pc);
-		} else 	/* being uncharged ? ...do relax */
-			break;
+		get_page(page);
+		spin_unlock_irqrestore(&mz->lru_lock, flags);
+		mem_cgroup_uncharge_page(page);
+		put_page(page);
+		if (--count <= 0) {
+			count = FORCE_UNCHARGE_BATCH;
+			cond_resched();
+		}
+		spin_lock_irqsave(&mz->lru_lock, flags);
 	}
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
-	if (!list_empty(list)) {
-		cond_resched();
-		goto retry;
-	}
 	return;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-19 16:26   ` Hugh Dickins
  2008-02-20  1:55     ` KAMEZAWA Hiroyuki
@ 2008-02-20  6:38     ` Balbir Singh
  2008-02-20 11:00       ` Hugh Dickins
  1 sibling, 1 reply; 63+ messages in thread
From: Balbir Singh @ 2008-02-20  6:38 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: KAMEZAWA Hiroyuki, linux-mm, yamamoto, riel

Hugh Dickins wrote:
> On Wed, 20 Feb 2008, kamezawa.hiroyu@jp.fujitsu.com wrote:
>>> How should I proceed now?  I think it's best if I press ahead with
>>> my patchset, to get that out on to the list; and only then come
>>> back to look at yours, while you can be looking at mine.  Then
>>> we take the best out of both and push that forward - this does
>>> need to be fixed for 2.6.25.
>>>
>> I'm very glad to hear that you have been working on this already.
>>
>> I think it's better to test your one at first because it sounds
>> you've already seem the BUG much more than I've seen and
>> I think my patch will need more work to be simple.
>>
>> Could you post your one ? I'll try it on my box.
> 
> Okay, thanks, on the understanding that I may decide things differently
> in splitting it up.  And you'll immediately see why I need to split it:
> there's several unrelated mods across that area, and a lot of cleanup
> (another cleanup I'd like to make but held back from, is remove the
> "_page" from mem_cgroup_uncharge_page).
> 
> One thing I've already reverted while splitting it: mm/memory.c still
> needs to use page_assign_page_cgroup, not in initializing the struct
> pages, but its VM_BUG_ON(page_get_page_cgroup) needs to become a bad
> page state instead - because most people build without DEBUG_VM, and
> page->cgroup must be reset before the next user corrupts through it.
> 
> There's a build warning on mem in charge_common which I want to get
> rid of; and I've not yet decided if I like that restructuring or not.
> 
> Hugh
> 

The changes look good and clean overall. I'll apply the patch, test it. I have
some review comments below. I'll review it again to check for locking issues


> diff -purN 26252/include/linux/memcontrol.h 26252h/include/linux/memcontrol.h
> --- 26252/include/linux/memcontrol.h	2008-02-11 07:18:10.000000000 +0000
> +++ 26252h/include/linux/memcontrol.h	2008-02-17 13:05:03.000000000 +0000
> @@ -32,14 +32,11 @@ struct mm_struct;
> 
>  extern void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p);
>  extern void mm_free_cgroup(struct mm_struct *mm);
> -extern void page_assign_page_cgroup(struct page *page,
> -					struct page_cgroup *pc);
>  extern struct page_cgroup *page_get_page_cgroup(struct page *page);
>  extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
>  				gfp_t gfp_mask);
> -extern void mem_cgroup_uncharge(struct page_cgroup *pc);
>  extern void mem_cgroup_uncharge_page(struct page *page);
> -extern void mem_cgroup_move_lists(struct page_cgroup *pc, bool active);
> +extern void mem_cgroup_move_lists(struct page *page, bool active);
>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>  					struct list_head *dst,
>  					unsigned long *scanned, int order,
> @@ -51,7 +48,7 @@ extern int mem_cgroup_cache_charge(struc
>  					gfp_t gfp_mask);
>  int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
> 
> -#define vm_match_cgroup(mm, cgroup)	\
> +#define mm_match_cgroup(mm, cgroup)	\
>  	((cgroup) == rcu_dereference((mm)->mem_cgroup))
> 
>  extern int mem_cgroup_prepare_migration(struct page *page);
> @@ -85,11 +82,6 @@ static inline void mm_free_cgroup(struct
>  {
>  }
> 
> -static inline void page_assign_page_cgroup(struct page *page,
> -						struct page_cgroup *pc)
> -{
> -}
> -
>  static inline struct page_cgroup *page_get_page_cgroup(struct page *page)
>  {
>  	return NULL;
> @@ -101,16 +93,11 @@ static inline int mem_cgroup_charge(stru
>  	return 0;
>  }
> 
> -static inline void mem_cgroup_uncharge(struct page_cgroup *pc)
> -{
> -}
> -
>  static inline void mem_cgroup_uncharge_page(struct page *page)
>  {
>  }
> 
> -static inline void mem_cgroup_move_lists(struct page_cgroup *pc,
> -						bool active)
> +static inline void mem_cgroup_move_lists(struct page *page, bool active)
>  {
>  }
> 
> @@ -121,7 +108,7 @@ static inline int mem_cgroup_cache_charg
>  	return 0;
>  }
> 
> -static inline int vm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *mem)
> +static inline int mm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *mem)
>  {
>  	return 1;
>  }
> diff -purN 26252/mm/memcontrol.c 26252h/mm/memcontrol.c
> --- 26252/mm/memcontrol.c	2008-02-11 07:18:12.000000000 +0000
> +++ 26252h/mm/memcontrol.c	2008-02-17 13:31:53.000000000 +0000
> @@ -137,6 +137,7 @@ struct mem_cgroup {
>  	 */
>  	struct mem_cgroup_stat stat;
>  };
> +static struct mem_cgroup init_mem_cgroup;
> 
>  /*
>   * We use the lower bit of the page->page_cgroup pointer as a bit spin
> @@ -144,7 +145,7 @@ struct mem_cgroup {
>   * byte aligned (based on comments from Nick Piggin)
>   */
>  #define PAGE_CGROUP_LOCK_BIT 	0x0
> -#define PAGE_CGROUP_LOCK 		(1 << PAGE_CGROUP_LOCK_BIT)
> +#define PAGE_CGROUP_LOCK 	(1 << PAGE_CGROUP_LOCK_BIT)
> 
>  /*
>   * A page_cgroup page is associated with every page descriptor. The
> @@ -154,37 +155,27 @@ struct page_cgroup {
>  	struct list_head lru;		/* per cgroup LRU list */
>  	struct page *page;
>  	struct mem_cgroup *mem_cgroup;
> -	atomic_t ref_cnt;		/* Helpful when pages move b/w  */
> -					/* mapped and cached states     */
> -	int	 flags;
> +	int ref_cnt;			/* cached, mapped, migrating */
> +	int flags;
>  };
>  #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
>  #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
> 
> -static inline int page_cgroup_nid(struct page_cgroup *pc)
> +static int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
>  }
> 
> -static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
> +static enum zone_type page_cgroup_zid(struct page_cgroup *pc)
>  {
>  	return page_zonenum(pc->page);
>  }
> 
> -enum {
> -	MEM_CGROUP_TYPE_UNSPEC = 0,
> -	MEM_CGROUP_TYPE_MAPPED,
> -	MEM_CGROUP_TYPE_CACHED,
> -	MEM_CGROUP_TYPE_ALL,
> -	MEM_CGROUP_TYPE_MAX,
> -};
> -
>  enum charge_type {
>  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
>  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
>  };
> 
> -
>  /*
>   * Always modified under lru lock. Then, not necessary to preempt_disable()
>   */
> @@ -193,23 +184,22 @@ static void mem_cgroup_charge_statistics
>  {
>  	int val = (charge)? 1 : -1;
>  	struct mem_cgroup_stat *stat = &mem->stat;
> -	VM_BUG_ON(!irqs_disabled());
> 
> +	VM_BUG_ON(!irqs_disabled());
>  	if (flags & PAGE_CGROUP_FLAG_CACHE)
> -		__mem_cgroup_stat_add_safe(stat,
> -					MEM_CGROUP_STAT_CACHE, val);
> +		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
>  	else
>  		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
>  }
> 
> -static inline struct mem_cgroup_per_zone *
> +static struct mem_cgroup_per_zone *
>  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
>  {
> -	BUG_ON(!mem->info.nodeinfo[nid]);
> +	VM_BUG_ON(!mem->info.nodeinfo[nid]);
>  	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
>  }
> 
> -static inline struct mem_cgroup_per_zone *
> +static struct mem_cgroup_per_zone *
>  page_cgroup_zoneinfo(struct page_cgroup *pc)
>  {
>  	struct mem_cgroup *mem = pc->mem_cgroup;
> @@ -234,18 +224,14 @@ static unsigned long mem_cgroup_get_all_
>  	return total;
>  }
> 
> -static struct mem_cgroup init_mem_cgroup;
> -
> -static inline
> -struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
> +static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
>  {
>  	return container_of(cgroup_subsys_state(cont,
>  				mem_cgroup_subsys_id), struct mem_cgroup,
>  				css);
>  }
> 
> -static inline
> -struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> +static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
>  {
>  	return container_of(task_subsys_state(p, mem_cgroup_subsys_id),
>  				struct mem_cgroup, css);
> @@ -265,83 +251,29 @@ void mm_free_cgroup(struct mm_struct *mm
>  	css_put(&mm->mem_cgroup->css);
>  }
> 
> -static inline int page_cgroup_locked(struct page *page)
> -{
> -	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT,
> -					&page->page_cgroup);
> -}
> -
> -void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
> +static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
>  {
> -	int locked;
> -
> -	/*
> -	 * While resetting the page_cgroup we might not hold the
> -	 * page_cgroup lock. free_hot_cold_page() is an example
> -	 * of such a scenario
> -	 */
> -	if (pc)
> -		VM_BUG_ON(!page_cgroup_locked(page));
> -	locked = (page->page_cgroup & PAGE_CGROUP_LOCK);
> -	page->page_cgroup = ((unsigned long)pc | locked);
> +	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);

We are explicitly setting the PAGE_CGROUP_LOCK bit, shouldn't we keep the
VM_BUG_ON(!page_cgroup_locked(page))?

>  }
> 
>  struct page_cgroup *page_get_page_cgroup(struct page *page)
>  {
> -	return (struct page_cgroup *)
> -		(page->page_cgroup & ~PAGE_CGROUP_LOCK);
> +	return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
>  }
> 
> -static void __always_inline lock_page_cgroup(struct page *page)
> +static void lock_page_cgroup(struct page *page)
>  {
>  	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -	VM_BUG_ON(!page_cgroup_locked(page));
>  }
> 
> -static void __always_inline unlock_page_cgroup(struct page *page)
> +static int try_lock_page_cgroup(struct page *page)
>  {
> -	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> +	return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
>  }
> 
> -/*
> - * Tie new page_cgroup to struct page under lock_page_cgroup()
> - * This can fail if the page has been tied to a page_cgroup.
> - * If success, returns 0.
> - */
> -static int page_cgroup_assign_new_page_cgroup(struct page *page,
> -						struct page_cgroup *pc)
> +static void unlock_page_cgroup(struct page *page)
>  {
> -	int ret = 0;
> -
> -	lock_page_cgroup(page);
> -	if (!page_get_page_cgroup(page))
> -		page_assign_page_cgroup(page, pc);
> -	else /* A page is tied to other pc. */
> -		ret = 1;
> -	unlock_page_cgroup(page);
> -	return ret;
> -}
> -
> -/*
> - * Clear page->page_cgroup member under lock_page_cgroup().
> - * If given "pc" value is different from one page->page_cgroup,
> - * page->cgroup is not cleared.
> - * Returns a value of page->page_cgroup at lock taken.
> - * A can can detect failure of clearing by following
> - *  clear_page_cgroup(page, pc) == pc
> - */
> -
> -static struct page_cgroup *clear_page_cgroup(struct page *page,
> -						struct page_cgroup *pc)
> -{
> -	struct page_cgroup *ret;
> -	/* lock and clear */
> -	lock_page_cgroup(page);
> -	ret = page_get_page_cgroup(page);
> -	if (likely(ret == pc))
> -		page_assign_page_cgroup(page, NULL);
> -	unlock_page_cgroup(page);
> -	return ret;
> +	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
>  }
> 
>  static void __mem_cgroup_remove_list(struct page_cgroup *pc)
> @@ -399,7 +331,7 @@ int task_in_mem_cgroup(struct task_struc
>  	int ret;
> 
>  	task_lock(task);
> -	ret = task->mm && vm_match_cgroup(task->mm, mem);
> +	ret = task->mm && mm_match_cgroup(task->mm, mem);
>  	task_unlock(task);
>  	return ret;
>  }
> @@ -407,17 +339,43 @@ int task_in_mem_cgroup(struct task_struc
>  /*
>   * This routine assumes that the appropriate zone's lru lock is already held
>   */
> -void mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
> +void mem_cgroup_move_lists(struct page *page, bool active)
>  {
> +	struct page_cgroup *pc;
>  	struct mem_cgroup_per_zone *mz;
>  	unsigned long flags;
> 
> -	if (!pc)
> +	/*
> +	 * We cannot lock_page_cgroup while holding zone's lru_lock,
> +	 * because other holders of lock_page_cgroup can be interrupted
> +	 * with an attempt to rotate_reclaimable_page.
> +	 *
> +	 * Change lock_page_cgroup to an interrupt-disabling lock?
> +	 * Perhaps, but we'd prefer not.  Hold zone's lru_lock while
> +	 * uncharging?  Overhead we'd again prefer to avoid - though
> +	 * it may turn out to be just right to uncharge when finally
> +	 * removing a page from LRU; but there are probably awkward
> +	 * details to that which would need shaking down.
> +	 */
> +	if (!try_lock_page_cgroup(page))
>  		return;
> 
> -	mz = page_cgroup_zoneinfo(pc);
> +	pc = page_get_page_cgroup(page);
> +	mz = pc? page_cgroup_zoneinfo(pc): NULL;
> +	unlock_page_cgroup(page);
> +
> +	if (!mz)
> +		return;
> +
> +	/*
> +	 * The memory used for this mem_cgroup_per_zone could get
> +	 * reused before we take its lru_lock: we probably want to
> +	 * use a SLAB_DESTROY_BY_RCU kmem_cache for it.  But that's
> +	 * an unlikely race, so for now continue testing without it.
> +	 */
>  	spin_lock_irqsave(&mz->lru_lock, flags);
> -	__mem_cgroup_move_lists(pc, active);
> +	if (page_get_page_cgroup(page) == pc)
> +		__mem_cgroup_move_lists(pc, active);
>  	spin_unlock_irqrestore(&mz->lru_lock, flags);
>  }
> 
> @@ -437,6 +395,7 @@ int mem_cgroup_calc_mapped_ratio(struct 
>  	rss = (long)mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
>  	return (int)((rss * 100L) / total);
>  }
> +
>  /*
>   * This function is called from vmscan.c. In page reclaiming loop. balance
>   * between active and inactive list is calculated. For memory controller
> @@ -500,7 +459,6 @@ long mem_cgroup_calc_reclaim_inactive(st
>  	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
> 
>  	nr_inactive = MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE);
> -
>  	return (nr_inactive >> priority);
>  }
> 
> @@ -534,7 +492,6 @@ unsigned long mem_cgroup_isolate_pages(u
>  		if (scan >= nr_to_scan)
>  			break;
>  		page = pc->page;
> -		VM_BUG_ON(!pc);
> 
>  		if (unlikely(!PageLRU(page)))
>  			continue;
> @@ -575,9 +532,11 @@ static int mem_cgroup_charge_common(stru
>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
> +	struct page_cgroup *new_pc = NULL;
>  	unsigned long flags;
>  	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  	struct mem_cgroup_per_zone *mz;
> +	int error;
> 
>  	/*
>  	 * Should page_cgroup's go to their own slab?
> @@ -586,31 +545,20 @@ static int mem_cgroup_charge_common(stru
>  	 * to see if the cgroup page already has a page_cgroup associated
>  	 * with it
>  	 */
> -retry:
> +
>  	if (page) {
> +		error = 0;
>  		lock_page_cgroup(page);
>  		pc = page_get_page_cgroup(page);
> -		/*
> -		 * The page_cgroup exists and
> -		 * the page has already been accounted.
> -		 */
> -		if (pc) {
> -			if (unlikely(!atomic_inc_not_zero(&pc->ref_cnt))) {
> -				/* this page is under being uncharged ? */
> -				unlock_page_cgroup(page);
> -				cpu_relax();
> -				goto retry;
> -			} else {
> -				unlock_page_cgroup(page);
> -				goto done;
> -			}
> -		}
> +		if (pc)
> +			goto incref;
>  		unlock_page_cgroup(page);
>  	}
> 
> -	pc = kzalloc(sizeof(struct page_cgroup), gfp_mask);
> -	if (pc == NULL)
> -		goto err;
> +	error = -ENOMEM;
> +	new_pc = kzalloc(sizeof(struct page_cgroup), gfp_mask);
> +	if (!new_pc)
> +		goto done;
> 
>  	/*
>  	 * We always charge the cgroup the mm_struct belongs to.
> @@ -624,16 +572,11 @@ retry:
>  	rcu_read_lock();
>  	mem = rcu_dereference(mm->mem_cgroup);
>  	/*
> -	 * For every charge from the cgroup, increment reference
> -	 * count
> +	 * For every charge from the cgroup, increment reference count
>  	 */
>  	css_get(&mem->css);
>  	rcu_read_unlock();
> 
> -	/*
> -	 * If we created the page_cgroup, we should free it on exceeding
> -	 * the cgroup limit.
> -	 */
>  	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
>  		if (!(gfp_mask & __GFP_WAIT))
>  			goto out;
> @@ -642,12 +585,12 @@ retry:
>  			continue;
> 
>  		/*
> - 		 * try_to_free_mem_cgroup_pages() might not give us a full
> - 		 * picture of reclaim. Some pages are reclaimed and might be
> - 		 * moved to swap cache or just unmapped from the cgroup.
> - 		 * Check the limit again to see if the reclaim reduced the
> - 		 * current usage of the cgroup before giving up
> - 		 */
> +		 * try_to_free_mem_cgroup_pages() might not give us a full
> +		 * picture of reclaim. Some pages are reclaimed and might be
> +		 * moved to swap cache or just unmapped from the cgroup.
> +		 * Check the limit again to see if the reclaim reduced the
> +		 * current usage of the cgroup before giving up
> +		 */
>  		if (res_counter_check_under_limit(&mem->res))
>  			continue;
> 
> @@ -658,106 +601,101 @@ retry:
>  		congestion_wait(WRITE, HZ/10);
>  	}
> 
> -	atomic_set(&pc->ref_cnt, 1);
> -	pc->mem_cgroup = mem;
> -	pc->page = page;
> -	pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
> +	error = 0;
> +	if (!page)
> +		goto out;
> +
> +	new_pc->ref_cnt = 1;
> +	new_pc->mem_cgroup = mem;
> +	new_pc->page = page;
> +	new_pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
>  	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
> -		pc->flags |= PAGE_CGROUP_FLAG_CACHE;
> +		new_pc->flags |= PAGE_CGROUP_FLAG_CACHE;
> 
> -	if (!page || page_cgroup_assign_new_page_cgroup(page, pc)) {
> -		/*
> -		 * Another charge has been added to this page already.
> -		 * We take lock_page_cgroup(page) again and read
> -		 * page->cgroup, increment refcnt.... just retry is OK.
> -		 */
> -		res_counter_uncharge(&mem->res, PAGE_SIZE);
> -		css_put(&mem->css);
> -		kfree(pc);
> -		if (!page)
> -			goto done;
> -		goto retry;
> -	}
> +	lock_page_cgroup(page);
> +	pc = page_get_page_cgroup(page);
> +	if (!pc) {
> +		page_assign_page_cgroup(page, new_pc);
> +		unlock_page_cgroup(page);
> 
> -	mz = page_cgroup_zoneinfo(pc);
> -	spin_lock_irqsave(&mz->lru_lock, flags);
> -	/* Update statistics vector */
> -	__mem_cgroup_add_list(pc);
> -	spin_unlock_irqrestore(&mz->lru_lock, flags);
> +		mz = page_cgroup_zoneinfo(new_pc);
> +		spin_lock_irqsave(&mz->lru_lock, flags);
> +		__mem_cgroup_add_list(new_pc);
> +		spin_unlock_irqrestore(&mz->lru_lock, flags);
> +		goto done;
> +	}
> 
> -done:
> -	return 0;
> +incref:
> +	VM_BUG_ON(pc->page != page);
> +	VM_BUG_ON(pc->ref_cnt <= 0);
> +	pc->ref_cnt++;
> +	unlock_page_cgroup(page);
>  out:
> -	css_put(&mem->css);
> -	kfree(pc);
> -err:
> -	return -ENOMEM;
> +	if (new_pc) {
> +		if (!error)
> +			res_counter_uncharge(&mem->res, PAGE_SIZE);
> +		css_put(&mem->css);
> +		kfree(new_pc);
> +	}
> +done:
> +	return error;
>  }
> 
> -int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
> -			gfp_t gfp_mask)
> +int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
>  {
>  	return mem_cgroup_charge_common(page, mm, gfp_mask,
>  			MEM_CGROUP_CHARGE_TYPE_MAPPED);
>  }
> 
> -/*
> - * See if the cached pages should be charged at all?
> - */
>  int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  				gfp_t gfp_mask)
>  {
> -	int ret = 0;
>  	if (!mm)
>  		mm = &init_mm;
> -
> -	ret = mem_cgroup_charge_common(page, mm, gfp_mask,
> -				MEM_CGROUP_CHARGE_TYPE_CACHE);
> -	return ret;
> +	return mem_cgroup_charge_common(page, mm, gfp_mask,
> +			MEM_CGROUP_CHARGE_TYPE_CACHE);
>  }
> 
>  /*
>   * Uncharging is always a welcome operation, we never complain, simply
> - * uncharge. This routine should be called with lock_page_cgroup held
> + * uncharge.
>   */
> -void mem_cgroup_uncharge(struct page_cgroup *pc)
> +void mem_cgroup_uncharge_page(struct page *page)
>  {
> +	struct page_cgroup *pc;
>  	struct mem_cgroup *mem;
>  	struct mem_cgroup_per_zone *mz;
> -	struct page *page;
>  	unsigned long flags;
> 
>  	/*
>  	 * Check if our page_cgroup is valid
>  	 */
> +	lock_page_cgroup(page);
> +	pc = page_get_page_cgroup(page);
>  	if (!pc)
> -		return;
> +		goto unlock;
> 
> -	if (atomic_dec_and_test(&pc->ref_cnt)) {
> -		page = pc->page;
> -		mz = page_cgroup_zoneinfo(pc);
> -		/*
> -		 * get page->cgroup and clear it under lock.
> -		 * force_empty can drop page->cgroup without checking refcnt.
> -		 */
> +	VM_BUG_ON(pc->page != page);
> +	VM_BUG_ON(pc->ref_cnt <= 0);
> +
> +	if (--(pc->ref_cnt) == 0) {
> +		page_assign_page_cgroup(page, NULL);
>  		unlock_page_cgroup(page);
> -		if (clear_page_cgroup(page, pc) == pc) {
> -			mem = pc->mem_cgroup;
> -			css_put(&mem->css);
> -			res_counter_uncharge(&mem->res, PAGE_SIZE);
> -			spin_lock_irqsave(&mz->lru_lock, flags);
> -			__mem_cgroup_remove_list(pc);
> -			spin_unlock_irqrestore(&mz->lru_lock, flags);
> -			kfree(pc);
> -		}
> -		lock_page_cgroup(page);
> +
> +		mem = pc->mem_cgroup;
> +		css_put(&mem->css);
> +		res_counter_uncharge(&mem->res, PAGE_SIZE);
> +
> +		mz = page_cgroup_zoneinfo(pc);
> +		spin_lock_irqsave(&mz->lru_lock, flags);
> +		__mem_cgroup_remove_list(pc);
> +		spin_unlock_irqrestore(&mz->lru_lock, flags);
> +
> +		kfree(pc);
> +		return;
>  	}
> -}
> 
> -void mem_cgroup_uncharge_page(struct page *page)
> -{
> -	lock_page_cgroup(page);
> -	mem_cgroup_uncharge(page_get_page_cgroup(page));
> +unlock:
>  	unlock_page_cgroup(page);
>  }
> 
> @@ -765,50 +703,46 @@ void mem_cgroup_uncharge_page(struct pag
>   * Returns non-zero if a page (under migration) has valid page_cgroup member.
>   * Refcnt of page_cgroup is incremented.
>   */
> -
>  int mem_cgroup_prepare_migration(struct page *page)
>  {
>  	struct page_cgroup *pc;
> -	int ret = 0;
> +
>  	lock_page_cgroup(page);
>  	pc = page_get_page_cgroup(page);
> -	if (pc && atomic_inc_not_zero(&pc->ref_cnt))
> -		ret = 1;
> +	if (pc)
> +		pc->ref_cnt++;
>  	unlock_page_cgroup(page);
> -	return ret;
> +	return pc != NULL;
>  }
> 
>  void mem_cgroup_end_migration(struct page *page)
>  {
> -	struct page_cgroup *pc;
> -
> -	lock_page_cgroup(page);
> -	pc = page_get_page_cgroup(page);
> -	mem_cgroup_uncharge(pc);
> -	unlock_page_cgroup(page);
> +	mem_cgroup_uncharge_page(page);
>  }
> +
>  /*
>   * We know both *page* and *newpage* are now not-on-LRU and Pg_locked.
>   * And no race with uncharge() routines because page_cgroup for *page*
>   * has extra one reference by mem_cgroup_prepare_migration.
>   */
> -
>  void mem_cgroup_page_migration(struct page *page, struct page *newpage)
>  {
>  	struct page_cgroup *pc;
> -	struct mem_cgroup *mem;
> -	unsigned long flags;
>  	struct mem_cgroup_per_zone *mz;
> -retry:
> +	unsigned long flags;
> +
> +	lock_page_cgroup(page);
>  	pc = page_get_page_cgroup(page);
> -	if (!pc)
> +	if (!pc) {
> +		unlock_page_cgroup(page);
>  		return;
> -	mem = pc->mem_cgroup;
> +	}
> +
> +	page_assign_page_cgroup(page, NULL);
> +	unlock_page_cgroup(page);
> +
>  	mz = page_cgroup_zoneinfo(pc);
> -	if (clear_page_cgroup(page, pc) != pc)
> -		goto retry;
>  	spin_lock_irqsave(&mz->lru_lock, flags);
> -
>  	__mem_cgroup_remove_list(pc);
>  	spin_unlock_irqrestore(&mz->lru_lock, flags);
> 
> @@ -821,7 +755,6 @@ retry:
>  	spin_lock_irqsave(&mz->lru_lock, flags);
>  	__mem_cgroup_add_list(pc);
>  	spin_unlock_irqrestore(&mz->lru_lock, flags);
> -	return;
>  }
> 
>  /*
> @@ -830,8 +763,7 @@ retry:
>   * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
>   */
>  #define FORCE_UNCHARGE_BATCH	(128)
> -static void
> -mem_cgroup_force_empty_list(struct mem_cgroup *mem,
> +static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
>  			    struct mem_cgroup_per_zone *mz,
>  			    int active)
>  {
> @@ -855,30 +787,33 @@ retry:
>  	while (--count && !list_empty(list)) {
>  		pc = list_entry(list->prev, struct page_cgroup, lru);
>  		page = pc->page;
> -		/* Avoid race with charge */
> -		atomic_set(&pc->ref_cnt, 0);
> -		if (clear_page_cgroup(page, pc) == pc) {
> +		lock_page_cgroup(page);
> +		if (page_get_page_cgroup(page) == pc) {
> +			page_assign_page_cgroup(page, NULL);
> +			unlock_page_cgroup(page);
>  			css_put(&mem->css);
>  			res_counter_uncharge(&mem->res, PAGE_SIZE);
>  			__mem_cgroup_remove_list(pc);
>  			kfree(pc);
> -		} else 	/* being uncharged ? ...do relax */
> +		} else {
> +			/* racing uncharge: let page go then retry */
> +			unlock_page_cgroup(page);
>  			break;
> +		}
>  	}
> +
>  	spin_unlock_irqrestore(&mz->lru_lock, flags);
>  	if (!list_empty(list)) {
>  		cond_resched();
>  		goto retry;
>  	}
> -	return;
>  }
> 
>  /*
>   * make mem_cgroup's charge to be 0 if there is no task.
>   * This enables deleting this mem_cgroup.
>   */
> -
> -int mem_cgroup_force_empty(struct mem_cgroup *mem)
> +static int mem_cgroup_force_empty(struct mem_cgroup *mem)
>  {
>  	int ret = -EBUSY;
>  	int node, zid;
> @@ -907,9 +842,7 @@ out:
>  	return ret;
>  }
> 
> -
> -
> -int mem_cgroup_write_strategy(char *buf, unsigned long long *tmp)
> +static int mem_cgroup_write_strategy(char *buf, unsigned long long *tmp)
>  {
>  	*tmp = memparse(buf, &buf);
>  	if (*buf != '\0')
> @@ -956,7 +889,6 @@ static ssize_t mem_force_empty_write(str
>  /*
>   * Note: This should be removed if cgroup supports write-only file.
>   */
> -
>  static ssize_t mem_force_empty_read(struct cgroup *cont,
>  				struct cftype *cft,
>  				struct file *file, char __user *userbuf,
> @@ -965,7 +897,6 @@ static ssize_t mem_force_empty_read(stru
>  	return -EINVAL;
>  }
> 
> -
>  static const struct mem_cgroup_stat_desc {
>  	const char *msg;
>  	u64 unit;
> @@ -1018,8 +949,6 @@ static int mem_control_stat_open(struct 
>  	return single_open(file, mem_control_stat_show, cont);
>  }
> 
> -
> -
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -1085,9 +1014,6 @@ static void free_mem_cgroup_per_zone_inf
>  	kfree(mem->info.nodeinfo[node]);
>  }
> 
> -
> -static struct mem_cgroup init_mem_cgroup;
> -
>  static struct cgroup_subsys_state *
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> @@ -1177,7 +1103,6 @@ static void mem_cgroup_move_task(struct 
> 
>  out:
>  	mmput(mm);
> -	return;
>  }
> 
>  struct cgroup_subsys mem_cgroup_subsys = {
> diff -purN 26252/mm/memory.c 26252h/mm/memory.c
> --- 26252/mm/memory.c	2008-02-15 23:43:20.000000000 +0000
> +++ 26252h/mm/memory.c	2008-02-17 10:26:22.000000000 +0000
> @@ -1711,7 +1711,7 @@ unlock:
>  	}
>  	return ret;
>  oom_free_new:
> -	__free_page(new_page);
> +	page_cache_release(new_page);
>  oom:
>  	if (old_page)
>  		page_cache_release(old_page);
> @@ -2093,12 +2093,9 @@ static int do_swap_page(struct mm_struct
>  	unlock_page(page);
> 
>  	if (write_access) {
> -		/* XXX: We could OR the do_wp_page code with this one? */
> -		if (do_wp_page(mm, vma, address,
> -				page_table, pmd, ptl, pte) & VM_FAULT_OOM) {
> -			mem_cgroup_uncharge_page(page);
> -			ret = VM_FAULT_OOM;
> -		}
> +		ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
> +		if (ret & VM_FAULT_ERROR)
> +			ret &= VM_FAULT_ERROR;

I am afraid, I do not understand this change (may be I need to look at the final
code and not the diff). We no longer uncharge the charged page here.

>  		goto out;
>  	}
> 
> @@ -2163,7 +2160,7 @@ release:
>  	page_cache_release(page);
>  	goto unlock;
>  oom_free_page:
> -	__free_page(page);
> +	page_cache_release(page);
>  oom:
>  	return VM_FAULT_OOM;
>  }
> diff -purN 26252/mm/page_alloc.c 26252h/mm/page_alloc.c
> --- 26252/mm/page_alloc.c	2008-02-11 07:18:12.000000000 +0000
> +++ 26252h/mm/page_alloc.c	2008-02-17 10:26:11.000000000 +0000
> @@ -981,6 +981,7 @@ static void free_hot_cold_page(struct pa
>  	struct per_cpu_pages *pcp;
>  	unsigned long flags;
> 
> +	VM_BUG_ON(page_get_page_cgroup(page));
>  	if (PageAnon(page))
>  		page->mapping = NULL;
>  	if (free_pages_check(page))
> @@ -988,7 +989,6 @@ static void free_hot_cold_page(struct pa
> 
>  	if (!PageHighMem(page))
>  		debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
> -	VM_BUG_ON(page_get_page_cgroup(page));
>  	arch_free_page(page, 0);
>  	kernel_map_pages(page, 1, 0);
> 
> @@ -2527,7 +2527,6 @@ void __meminit memmap_init_zone(unsigned
>  		set_page_links(page, zone, nid, pfn);
>  		init_page_count(page);
>  		reset_page_mapcount(page);
> -		page_assign_page_cgroup(page, NULL);
>  		SetPageReserved(page);
> 
>  		/*
> diff -purN 26252/mm/rmap.c 26252h/mm/rmap.c
> --- 26252/mm/rmap.c	2008-02-11 07:18:12.000000000 +0000
> +++ 26252h/mm/rmap.c	2008-02-17 10:26:22.000000000 +0000
> @@ -321,7 +321,7 @@ static int page_referenced_anon(struct p
>  		 * counting on behalf of references from different
>  		 * cgroups
>  		 */
> -		if (mem_cont && !vm_match_cgroup(vma->vm_mm, mem_cont))
> +		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
>  			continue;
>  		referenced += page_referenced_one(page, vma, &mapcount);
>  		if (!mapcount)
> @@ -382,7 +382,7 @@ static int page_referenced_file(struct p
>  		 * counting on behalf of references from different
>  		 * cgroups
>  		 */
> -		if (mem_cont && !vm_match_cgroup(vma->vm_mm, mem_cont))
> +		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
>  			continue;
>  		if ((vma->vm_flags & (VM_LOCKED|VM_MAYSHARE))
>  				  == (VM_LOCKED|VM_MAYSHARE)) {
> diff -purN 26252/mm/swap.c 26252h/mm/swap.c
> --- 26252/mm/swap.c	2008-02-11 07:18:12.000000000 +0000
> +++ 26252h/mm/swap.c	2008-02-17 13:01:50.000000000 +0000
> @@ -176,7 +176,7 @@ void activate_page(struct page *page)
>  		SetPageActive(page);
>  		add_page_to_active_list(zone, page);
>  		__count_vm_event(PGACTIVATE);
> -		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
> +		mem_cgroup_move_lists(page, true);
>  	}
>  	spin_unlock_irq(&zone->lru_lock);
>  }
> diff -purN 26252/mm/vmscan.c 26252h/mm/vmscan.c
> --- 26252/mm/vmscan.c	2008-02-11 07:18:12.000000000 +0000
> +++ 26252h/mm/vmscan.c	2008-02-17 13:02:33.000000000 +0000
> @@ -1128,7 +1128,7 @@ static void shrink_active_list(unsigned 
>  		ClearPageActive(page);
> 
>  		list_move(&page->lru, &zone->inactive_list);
> -		mem_cgroup_move_lists(page_get_page_cgroup(page), false);
> +		mem_cgroup_move_lists(page, false);
>  		pgmoved++;
>  		if (!pagevec_add(&pvec, page)) {
>  			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
> @@ -1157,7 +1157,7 @@ static void shrink_active_list(unsigned 
>  		SetPageLRU(page);
>  		VM_BUG_ON(!PageActive(page));
>  		list_move(&page->lru, &zone->active_list);
> -		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
> +		mem_cgroup_move_lists(page, true);
>  		pgmoved++;
>  		if (!pagevec_add(&pvec, page)) {
>  			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  4:37       ` KAMEZAWA Hiroyuki
  2008-02-20  4:39         ` Hugh Dickins
@ 2008-02-20  6:40         ` Balbir Singh
  2008-02-20  7:23           ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 63+ messages in thread
From: Balbir Singh @ 2008-02-20  6:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hugh Dickins, linux-mm@kvack.org, yamamoto@valinux.co.jp,
	riel@redhat.com

KAMEZAWA Hiroyuki wrote:
> On Wed, 20 Feb 2008 04:14:58 +0000 (GMT)
> Hugh Dickins <hugh@veritas.com> wrote:
> 
>> On Wed, 20 Feb 2008, KAMEZAWA Hiroyuki wrote:
>>> On Tue, 19 Feb 2008 15:40:45 +0000 (GMT)
>>> Hugh Dickins <hugh@veritas.com> wrote:
>>>
>>>> A lot in common with yours, a lot not.  (And none of it addressing
>>>> that issue of opt-out I raise in the last paragraph: haven't begun
>>>> to go into that one, hoped you and Balbir would look into it.)
>>>>
>>> I have some trial patches for reducing atomic_ops by do_it_lazy method.
>>> Now, I'm afraid that performence is too bad when there is *no* memory
>>> pressure.
>> But it isn't just the atomic ops, it's the whole business of
>> mem_cgroup_charge_common plus mem_cgroup_uncharge_page per page.
>>
>> The existence of force_empty indicates that the system can get along
>> without the charge on the page. 
> Yes.
> 
>> What's needed, I think, is something in struct mm, a flag or a reserved value
>> in mm->mem_cgroup, to say don't do any of this mem_cgroup stuff on me; and a cgroup
>> fs interface to set that, in the same way as force_empty is done.
> 
> I agree here. I believe we need "no charge" flag at least to the root group.
> For root group, it's better to have boot option if not complicated.

I don't think that would work very well. A boot option to turn off everything
makes more sense. The reason why I say it would not work very well is

1. You might want to control the memory used the root cgroup. Consider a system
where all tasks are distributed across sub-cgroups and only a few and default
tasks will be left in the root cgroup. The administrator might want to control
how much memory can be used
2. We need to account for root's usage. This will become very important once we
implement a hierarchy and build in support for shares.
3. The interface remains consistent that way, treating the root as special would
make our interface inconsistent (since we cannot apply/enforce limits at root).

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  6:27   ` Hirokazu Takahashi
@ 2008-02-20  6:50     ` KAMEZAWA Hiroyuki
  2008-02-20  8:32       ` Clean up force_empty (Was Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.) KAMEZAWA Hiroyuki
  2008-02-22  9:24       ` [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races Hugh Dickins
  0 siblings, 2 replies; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-20  6:50 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: hugh, linux-mm, balbir, yamamoto, riel

On Wed, 20 Feb 2008 15:27:53 +0900 (JST)
Hirokazu Takahashi <taka@valinux.co.jp> wrote:

> > Unlike the unsafeties of force_empty, this is liable to hit anyone
> > running with MEM_CONT compiled in, they don't have to be consciously
> > using mem_cgroups at all.
> 
> As for force_empty, though this may not be the main topic here,
> mem_cgroup_force_empty_list() can be implemented simpler.
> It is possible to make the function just call mem_cgroup_uncharge_page()
> instead of releasing page_cgroups by itself. The tips is to call get_page()
> before invoking mem_cgroup_uncharge_page() so the page won't be released
> during this function.
> 
> Kamezawa-san, you may want look into the attached patch.
> I think you will be free from the weired complexity here.
> 
> This code can be optimized but it will be enough since this function
> isn't critical.
> 
> Thanks.
> 
> 
> Signed-off-by: Hirokazu Takahashi <taka@vallinux.co.jp>
> 
> --- mm/memcontrol.c.ORG	2008-02-12 18:44:45.000000000 +0900
> +++ mm/memcontrol.c 2008-02-20 14:23:38.000000000 +0900
> @@ -837,7 +837,7 @@ mem_cgroup_force_empty_list(struct mem_c
>  {
>  	struct page_cgroup *pc;
>  	struct page *page;
> -	int count;
> +	int count = FORCE_UNCHARGE_BATCH;
>  	unsigned long flags;
>  	struct list_head *list;
>  
> @@ -846,30 +846,21 @@ mem_cgroup_force_empty_list(struct mem_c
>  	else
>  		list = &mz->inactive_list;
>  
> -	if (list_empty(list))
> -		return;
> -retry:
> -	count = FORCE_UNCHARGE_BATCH;
>  	spin_lock_irqsave(&mz->lru_lock, flags);
> -
> -	while (--count && !list_empty(list)) {
> +	while (!list_empty(list)) {
>  		pc = list_entry(list->prev, struct page_cgroup, lru);
>  		page = pc->page;
> -		/* Avoid race with charge */
> -		atomic_set(&pc->ref_cnt, 0);
> -		if (clear_page_cgroup(page, pc) == pc) {
> -			css_put(&mem->css);
> -			res_counter_uncharge(&mem->res, PAGE_SIZE);
> -			__mem_cgroup_remove_list(pc);
> -			kfree(pc);
> -		} else 	/* being uncharged ? ...do relax */
> -			break;
> +		get_page(page);
> +		spin_unlock_irqrestore(&mz->lru_lock, flags);
> +		mem_cgroup_uncharge_page(page);
> +		put_page(page);
> +		if (--count <= 0) {
> +			count = FORCE_UNCHARGE_BATCH;
> +			cond_resched();
> +		}
> +		spin_lock_irqsave(&mz->lru_lock, flags);
>  	}
Seems simple. But isn't there following case ?

==in force_empty==

pc1 = list_entry(list->prev, struct page_cgroup, lru);
page = pc1->page;
get_page(page)
spin_unlock_irqrestore(&mz->lru_lock, flags)
mem_cgroup_uncharge_page(page);
	=> lock_page_cgroup(page);
		=> pc2 = page_get_page_cgroup(page);

Here, pc2 != pc1 and pc2->mem_cgroup != pc1->mem_cgroup.
maybe need some check.

But maybe yours is good direction.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  6:40         ` Balbir Singh
@ 2008-02-20  7:23           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-20  7:23 UTC (permalink / raw)
  To: balbir
  Cc: Hugh Dickins, linux-mm@kvack.org, yamamoto@valinux.co.jp,
	riel@redhat.com

On Wed, 20 Feb 2008 12:10:53 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Wed, 20 Feb 2008 04:14:58 +0000 (GMT)
> > Hugh Dickins <hugh@veritas.com> wrote:
> > 
> >> On Wed, 20 Feb 2008, KAMEZAWA Hiroyuki wrote:
> >>> On Tue, 19 Feb 2008 15:40:45 +0000 (GMT)
> >>> Hugh Dickins <hugh@veritas.com> wrote:
> >>>
> >>>> A lot in common with yours, a lot not.  (And none of it addressing
> >>>> that issue of opt-out I raise in the last paragraph: haven't begun
> >>>> to go into that one, hoped you and Balbir would look into it.)
> >>>>
> >>> I have some trial patches for reducing atomic_ops by do_it_lazy method.
> >>> Now, I'm afraid that performence is too bad when there is *no* memory
> >>> pressure.
> >> But it isn't just the atomic ops, it's the whole business of
> >> mem_cgroup_charge_common plus mem_cgroup_uncharge_page per page.
> >>
> >> The existence of force_empty indicates that the system can get along
> >> without the charge on the page. 
> > Yes.
> > 
> >> What's needed, I think, is something in struct mm, a flag or a reserved value
> >> in mm->mem_cgroup, to say don't do any of this mem_cgroup stuff on me; and a cgroup
> >> fs interface to set that, in the same way as force_empty is done.
> > 
> > I agree here. I believe we need "no charge" flag at least to the root group.
> > For root group, it's better to have boot option if not complicated.
> 
> I don't think that would work very well. A boot option to turn off everything
> makes more sense. The reason why I say it would not work very well is
> 
Everything turn off is ok for me. 

At that case, we have to care following.
1. disable mounting memory controller
2. hide "memory" from /proc/cgroup

At lease, it seems we can agree on "everything turn off" option.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Clean up force_empty (Was Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.)
  2008-02-20  6:50     ` KAMEZAWA Hiroyuki
@ 2008-02-20  8:32       ` KAMEZAWA Hiroyuki
  2008-02-20 10:07         ` Clean up force_empty Hirokazu Takahashi
  2008-02-22  9:24       ` [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races Hugh Dickins
  1 sibling, 1 reply; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-20  8:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hirokazu Takahashi, hugh, linux-mm, balbir, yamamoto, riel

How about this ?
I tested Takahashi's one and added comments.
I like this but it's okay just to wait and revisit this later.
-Kame
=

Clean up force_empty.

This patch makes force_empty to be not a special function.

Old one used customized freeing loop. This one uses mem_cgroup_uncharge()
one by one. 

Signed-off-by: Hirokazu Takahashi <taka@vallinux.co.jp>
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 mm/memcontrol.c |   39 +++++++++++++++++++--------------------
 1 file changed, 19 insertions(+), 20 deletions(-)

Index: linux-2.6.25-rc2/mm/memcontrol.c
===================================================================
--- linux-2.6.25-rc2.orig/mm/memcontrol.c
+++ linux-2.6.25-rc2/mm/memcontrol.c
@@ -837,7 +837,7 @@ mem_cgroup_force_empty_list(struct mem_c
 {
 	struct page_cgroup *pc;
 	struct page *page;
-	int count;
+	int count = FORCE_UNCHARGE_BATCH;
 	unsigned long flags;
 	struct list_head *list;
 
@@ -846,30 +846,29 @@ mem_cgroup_force_empty_list(struct mem_c
 	else
 		list = &mz->inactive_list;
 
-	if (list_empty(list))
-		return;
-retry:
-	count = FORCE_UNCHARGE_BATCH;
 	spin_lock_irqsave(&mz->lru_lock, flags);
-
-	while (--count && !list_empty(list)) {
+	while (!list_empty(list)) {
 		pc = list_entry(list->prev, struct page_cgroup, lru);
 		page = pc->page;
-		/* Avoid race with charge */
-		atomic_set(&pc->ref_cnt, 0);
-		if (clear_page_cgroup(page, pc) == pc) {
-			css_put(&mem->css);
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			__mem_cgroup_remove_list(pc);
-			kfree(pc);
-		} else 	/* being uncharged ? ...do relax */
-			break;
+		get_page(page);
+		spin_unlock_irqrestore(&mz->lru_lock, flags);
+
+		lock_page_cgroup(page);
+		/* Because we released lock, we have to chack the page still
+		   points this pc. */
+		if (page_get_page_cgroup(page) == pc)
+			mem_cgroup_uncharge(pc);
+		unlock_page_cgroup(page);
+
+		put_page(page);
+
+		if (--count == 0) {
+			count = FORCE_UNCHARGE_BATCH;
+			cond_resched();
+		}
+		spin_lock_irqsave(&mz->lru_lock, flags);
 	}
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
-	if (!list_empty(list)) {
-		cond_resched();
-		goto retry;
-	}
 	return;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  5:57   ` Balbir Singh
@ 2008-02-20  9:58     ` Hirokazu Takahashi
  2008-02-20 10:06       ` Paul Menage
  2008-02-20 11:36       ` Balbir Singh
  0 siblings, 2 replies; 63+ messages in thread
From: Hirokazu Takahashi @ 2008-02-20  9:58 UTC (permalink / raw)
  To: balbir; +Cc: hugh, kamezawa.hiroyu, linux-mm, yamamoto, riel

Hi,

> >> I'd like to start from RFC.
> >>
> >> In following code
> >> ==
> >>   lock_page_cgroup(page);
> >>   pc = page_get_page_cgroup(page);
> >>   unlock_page_cgroup(page);
> >>
> >>   access 'pc' later..
> >> == (See, page_cgroup_move_lists())
> >>
> >> There is a race because 'pc' is not a stable value without lock_page_cgroup().
> >> (mem_cgroup_uncharge can free this 'pc').
> >>
> >> For example, page_cgroup_move_lists() access pc without lock.
> >> There is a small race window, between page_cgroup_move_lists()
> >> and mem_cgroup_uncharge(). At uncharge, page_cgroup struct is immedieately
> >> freed but move_list can access it after taking lru_lock.
> >> (*) mem_cgroup_uncharge_page() can be called without zone->lru lock.
> >>
> >> This is not good manner.
> >> .....
> >> There is no quick fix (maybe). Moreover, I hear some people around me said
> >> current memcontrol.c codes are very complicated.
> >> I agree ;( ..it's caued by my work.
> >>
> >> I'd like to fix problems in clean way.
> >> (Note: current -rc2 codes works well under heavy pressure. but there
> >>  is possibility of race, I think.)
> > 
> > Yes, yes, indeed, I've been working away on this too.
> > 
> > Ever since the VM_BUG_ON(page_get_page_cgroup(page)) went into
> > free_hot_cold_page (at my own prompting), I've been hitting it
> > just very occasionally in my kernel build testing.  Was unable
> > to reproduce it over the New Year, but a week or two ago found
> > one machine and config on which it is relatively reproducible,
> > pretty sure to happen within 12 hours.
> > 
> > And on Saturday evening at last identified the cause, exactly
> > where you have: that unsafety in mem_cgroup_move_lists - which
> > has the nice property of putting pages from the lru on to SLUB's
> > freelist!
> > 
> > Unlike the unsafeties of force_empty, this is liable to hit anyone
> > running with MEM_CONT compiled in, they don't have to be consciously
> > using mem_cgroups at all.
> > 
> > (I consider that, by the way, quite a serious defect in the current
> > mem_cgroup work: that a distro compiling it in for 1% of customers
> > is then subjecting all to the mem_cgroup overhead - effectively
> > doubled struct page size and unnecessary accounting overhead.  I
> > believe there needs to be a way to opt out, a force_empty which
> > sticks.  Yes, I know the page_cgroup which does that doubling of
> > size is only allocated on demand, but every page cache page and
> > every anonymous page is going to have one.  A kmem_cache for them
> > will reduce the extra, but there still needs to be a way to opt
> > out completely.)
> > 
> 
> I've been thinking along these lines as well
> 
> 1. Have a boot option to turn on/off the memory controller

It will be much convenient if the memory controller can be turned on/off on
demand. I think you can turn it off if there aren't any mem_cgroups except
the root mem_cgroup, 

> 2. Have a separate cache for the page_cgroup structure. I sent this suggestion
>    out just yesterday or so.

I think the policy that every mem_cgroup should be dynamically allocated and
assigned to the proper page every time is causing some overheads and spinlock
contentions.

What do you think if you allocate all page_cgroups and assign to all the pages
when the memory controller gets turned on, which will allow you to remove
most of the spinlocks.

And you may possibly have a chance to remove page->page_cgroup member
if you allocate array of page_cgroups and attach them to the zone which
the pages belong to.

               zone
    page[]    +----+    page_cgroup[]
    +----+<----    ---->+----+
    |    |    |    |    |    |
    +----+    |    |    +----+
    |    |    +----+    |    |
    +----+              +----+
    |    |              |    |
    +----+              +----+
    |    |              |    |
    +----+              +----+
    |    |              |    |
    +----+              +----+


> I agree that these are necessary enhancements/changes.

Thank you,
Hirokazu Takahashi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  9:58     ` Hirokazu Takahashi
@ 2008-02-20 10:06       ` Paul Menage
  2008-02-20 10:11         ` Balbir Singh
  2008-02-20 11:36       ` Balbir Singh
  1 sibling, 1 reply; 63+ messages in thread
From: Paul Menage @ 2008-02-20 10:06 UTC (permalink / raw)
  To: Hirokazu Takahashi
  Cc: balbir, hugh, kamezawa.hiroyu, linux-mm, yamamoto, riel

On Feb 20, 2008 1:58 AM, Hirokazu Takahashi <taka@valinux.co.jp> wrote:
> >
> > 1. Have a boot option to turn on/off the memory controller
>
> It will be much convenient if the memory controller can be turned on/off on
> demand. I think you can turn it off if there aren't any mem_cgroups except
> the root mem_cgroup,

Or possibly turned on when the memory controller is bound to a
non-default hierarchy, and off when it's unbound?

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Clean up force_empty
  2008-02-20  8:32       ` Clean up force_empty (Was Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.) KAMEZAWA Hiroyuki
@ 2008-02-20 10:07         ` Hirokazu Takahashi
  0 siblings, 0 replies; 63+ messages in thread
From: Hirokazu Takahashi @ 2008-02-20 10:07 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: hugh, linux-mm, balbir, yamamoto, riel

Hi,

> How about this ?

Looks good but I found a typo in the comment.

> +		lock_page_cgroup(page);
> +		/* Because we released lock, we have to chack the page still
						        ^^^^^
							check
> +		   points this pc. */

Thank you,
Hirokazu Takahashi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20 10:06       ` Paul Menage
@ 2008-02-20 10:11         ` Balbir Singh
  2008-02-20 10:18           ` Paul Menage
  0 siblings, 1 reply; 63+ messages in thread
From: Balbir Singh @ 2008-02-20 10:11 UTC (permalink / raw)
  To: Paul Menage
  Cc: Hirokazu Takahashi, hugh, kamezawa.hiroyu, linux-mm, yamamoto,
	riel

Paul Menage wrote:
> On Feb 20, 2008 1:58 AM, Hirokazu Takahashi <taka@valinux.co.jp> wrote:
>>> 1. Have a boot option to turn on/off the memory controller
>> It will be much convenient if the memory controller can be turned on/off on
>> demand. I think you can turn it off if there aren't any mem_cgroups except
>> the root mem_cgroup,
> 
> Or possibly turned on when the memory controller is bound to a
> non-default hierarchy, and off when it's unbound?
> 

Dynamically turning on/off the memory controller, can/will lead to accounting
issues and deficiencies, since the memory controller would now have no idea of
how much memory has been allocated by which cgroup.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20 10:11         ` Balbir Singh
@ 2008-02-20 10:18           ` Paul Menage
  2008-02-20 10:55             ` Balbir Singh
  0 siblings, 1 reply; 63+ messages in thread
From: Paul Menage @ 2008-02-20 10:18 UTC (permalink / raw)
  To: balbir; +Cc: Hirokazu Takahashi, hugh, kamezawa.hiroyu, linux-mm, yamamoto,
	riel

On Feb 20, 2008 2:11 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
> Dynamically turning on/off the memory controller, can/will lead to accounting
> issues and deficiencies, since the memory controller would now have no idea of
> how much memory has been allocated by which cgroup.
>

A cgroups subsystem can only be unbound from its hierarchy when there
are no child cgroups of the root cgroup in that hierarchy. So this
shouldn't be too much of a problem - when this transition occurs, all
tasks are in the same group, and no other groups exist.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20 10:18           ` Paul Menage
@ 2008-02-20 10:55             ` Balbir Singh
  2008-02-20 11:21               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 63+ messages in thread
From: Balbir Singh @ 2008-02-20 10:55 UTC (permalink / raw)
  To: Paul Menage
  Cc: Hirokazu Takahashi, hugh, kamezawa.hiroyu, linux-mm, yamamoto,
	riel

Paul Menage wrote:
> On Feb 20, 2008 2:11 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>> Dynamically turning on/off the memory controller, can/will lead to accounting
>> issues and deficiencies, since the memory controller would now have no idea of
>> how much memory has been allocated by which cgroup.
>>
> 
> A cgroups subsystem can only be unbound from its hierarchy when there
> are no child cgroups of the root cgroup in that hierarchy. So this
> shouldn't be too much of a problem - when this transition occurs, all
> tasks are in the same group, and no other groups exist.
> 
> Paul

Yes, I agree, but then at the point of unbinding them, tasks could have already
allocated several pages to their RSS or brought in pages into the page cache.
Accounting from this state is not so straight forward and will lead to more
complexity in code.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  6:38     ` Balbir Singh
@ 2008-02-20 11:00       ` Hugh Dickins
  2008-02-20 11:32         ` Balbir Singh
  0 siblings, 1 reply; 63+ messages in thread
From: Hugh Dickins @ 2008-02-20 11:00 UTC (permalink / raw)
  To: Balbir Singh; +Cc: KAMEZAWA Hiroyuki, linux-mm, yamamoto, riel

On Wed, 20 Feb 2008, Balbir Singh wrote:
> 
> The changes look good and clean overall. I'll apply the patch, test it.

Thanks, yes, it's fine for applying as a patch for testing;
just don't send it up the line until I've split and commented it.

> I have
> some review comments below. I'll review it again to check for locking issues
...
> 
> > -void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
> > +static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
> >  {
> > -	int locked;
> > -
> > -	/*
> > -	 * While resetting the page_cgroup we might not hold the
> > -	 * page_cgroup lock. free_hot_cold_page() is an example
> > -	 * of such a scenario
> > -	 */
> > -	if (pc)
> > -		VM_BUG_ON(!page_cgroup_locked(page));
> > -	locked = (page->page_cgroup & PAGE_CGROUP_LOCK);
> > -	page->page_cgroup = ((unsigned long)pc | locked);
> > +	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
> 
> We are explicitly setting the PAGE_CGROUP_LOCK bit, shouldn't we keep the
> VM_BUG_ON(!page_cgroup_locked(page))?

Could do, but it seemed quite unnecessary to me now that it's a static
function with the obvious rule everywhere that you call it holding lock,
no matter whether pc is or isn't NULL.  If somewhere in memcontrol.c
did call it without holding the lock, it'd also have to bizarrely
remember to unlock while forgetting to lock, for it to escape notice.

(I did say earlier that I was reversing making it static, but that
didn't work out so well: ended up adding a specific page_reset_bad_cgroup
inline in memcontrol.h, just for the bad_page case.)

> > @@ -2093,12 +2093,9 @@ static int do_swap_page(struct mm_struct
> >  	unlock_page(page);
> > 
> >  	if (write_access) {
> > -		/* XXX: We could OR the do_wp_page code with this one? */
> > -		if (do_wp_page(mm, vma, address,
> > -				page_table, pmd, ptl, pte) & VM_FAULT_OOM) {
> > -			mem_cgroup_uncharge_page(page);
> > -			ret = VM_FAULT_OOM;
> > -		}
> > +		ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
> > +		if (ret & VM_FAULT_ERROR)
> > +			ret &= VM_FAULT_ERROR;
> 
> I am afraid, I do not understand this change (may be I need to look at the final
> code and not the diff). We no longer uncharge the charged page here.

The page that was charged is there in the pagetable, and will be
uncharged as usual when that area is unmapped.  What has failed here
is just the COWing of that page.  You could argue that we should ignore
the retval from do_wp_page and return our own retval: I hesitated over
that, but since we skip do_swap_page's update_mmu_cache here, it seems
conceivable that some architecture might loop endlessly if we claimed
success when do_wp_page has skipped it too.

This is of course an example of why I didn't post the patch originally,
just when Kame asked for a copy for testing: it badly needs the split
and comments.  You're brave to be reviewing it at all - thanks!

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20 11:21               ` KAMEZAWA Hiroyuki
@ 2008-02-20 11:18                 ` Balbir Singh
  2008-02-20 11:32                   ` KAMEZAWA Hiroyuki
  2008-02-20 11:41                   ` Paul Menage
  0 siblings, 2 replies; 63+ messages in thread
From: Balbir Singh @ 2008-02-20 11:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Paul Menage, Hirokazu Takahashi, hugh, linux-mm, yamamoto, riel

KAMEZAWA Hiroyuki wrote:
> On Wed, 20 Feb 2008 16:25:00 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> Paul Menage wrote:
>>> On Feb 20, 2008 2:11 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>>> Dynamically turning on/off the memory controller, can/will lead to accounting
>>>> issues and deficiencies, since the memory controller would now have no idea of
>>>> how much memory has been allocated by which cgroup.
>>>>
>>> A cgroups subsystem can only be unbound from its hierarchy when there
>>> are no child cgroups of the root cgroup in that hierarchy. So this
>>> shouldn't be too much of a problem - when this transition occurs, all
>>> tasks are in the same group, and no other groups exist.
>>>
>>> Paul
>> Yes, I agree, but then at the point of unbinding them, tasks could have already
>> allocated several pages to their RSS or brought in pages into the page cache.
>> Accounting from this state is not so straight forward and will lead to more
>> complexity in code.
> 
> unbind -> force_empty can't work ?
> 
> Thanks,
> -Kame
> 

Kame, unbind->force_empty can work, but we can't force_empty the root cgroup.
Even if we could, the code to deal with turning on/off the entire memory
controller and accounting is likely to be very complex and probably racy.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20 10:55             ` Balbir Singh
@ 2008-02-20 11:21               ` KAMEZAWA Hiroyuki
  2008-02-20 11:18                 ` Balbir Singh
  0 siblings, 1 reply; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-20 11:21 UTC (permalink / raw)
  To: balbir; +Cc: Paul Menage, Hirokazu Takahashi, hugh, linux-mm, yamamoto, riel

On Wed, 20 Feb 2008 16:25:00 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Paul Menage wrote:
> > On Feb 20, 2008 2:11 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >> Dynamically turning on/off the memory controller, can/will lead to accounting
> >> issues and deficiencies, since the memory controller would now have no idea of
> >> how much memory has been allocated by which cgroup.
> >>
> > 
> > A cgroups subsystem can only be unbound from its hierarchy when there
> > are no child cgroups of the root cgroup in that hierarchy. So this
> > shouldn't be too much of a problem - when this transition occurs, all
> > tasks are in the same group, and no other groups exist.
> > 
> > Paul
> 
> Yes, I agree, but then at the point of unbinding them, tasks could have already
> allocated several pages to their RSS or brought in pages into the page cache.
> Accounting from this state is not so straight forward and will lead to more
> complexity in code.

unbind -> force_empty can't work ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20 11:18                 ` Balbir Singh
@ 2008-02-20 11:32                   ` KAMEZAWA Hiroyuki
  2008-02-20 11:34                     ` Balbir Singh
  2008-02-20 11:41                   ` Paul Menage
  1 sibling, 1 reply; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-20 11:32 UTC (permalink / raw)
  To: balbir; +Cc: Paul Menage, Hirokazu Takahashi, hugh, linux-mm, yamamoto, riel

On Wed, 20 Feb 2008 16:48:10 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> Kame, unbind->force_empty can work, but we can't force_empty the root cgroup.
> Even if we could, the code to deal with turning on/off the entire memory
> controller and accounting is likely to be very complex and probably racy.
> 
Ok, just put it on my far-future-to-do-list.
(we have another things to do now ;)

But a boot option for turning off entire (memory) controller even if it is
configured will be a good thing.

like..
   cgroup_subsys_disable_mask = ...
or
   memory_controller=off|on

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20 11:00       ` Hugh Dickins
@ 2008-02-20 11:32         ` Balbir Singh
  2008-02-20 14:19           ` Hugh Dickins
  0 siblings, 1 reply; 63+ messages in thread
From: Balbir Singh @ 2008-02-20 11:32 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: KAMEZAWA Hiroyuki, linux-mm, yamamoto, riel

Hugh Dickins wrote:
> On Wed, 20 Feb 2008, Balbir Singh wrote:
>> The changes look good and clean overall. I'll apply the patch, test it.
> 
> Thanks, yes, it's fine for applying as a patch for testing;
> just don't send it up the line until I've split and commented it.
> 

Absolutely! It's too big and does too many things to be sent upstream as one patch.

>> I have
>> some review comments below. I'll review it again to check for locking issues
> ...
>>> -void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
>>> +static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
>>>  {
>>> -	int locked;
>>> -
>>> -	/*
>>> -	 * While resetting the page_cgroup we might not hold the
>>> -	 * page_cgroup lock. free_hot_cold_page() is an example
>>> -	 * of such a scenario
>>> -	 */
>>> -	if (pc)
>>> -		VM_BUG_ON(!page_cgroup_locked(page));
>>> -	locked = (page->page_cgroup & PAGE_CGROUP_LOCK);
>>> -	page->page_cgroup = ((unsigned long)pc | locked);
>>> +	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
>> We are explicitly setting the PAGE_CGROUP_LOCK bit, shouldn't we keep the
>> VM_BUG_ON(!page_cgroup_locked(page))?
> 
> Could do, but it seemed quite unnecessary to me now that it's a static
> function with the obvious rule everywhere that you call it holding lock,
> no matter whether pc is or isn't NULL.  If somewhere in memcontrol.c
> did call it without holding the lock, it'd also have to bizarrely
> remember to unlock while forgetting to lock, for it to escape notice.
> 

Yes, you are as always of-course right. I was thinking of future uses of the
function. Some one could use it, a VM_BUG_ON will help.

> (I did say earlier that I was reversing making it static, but that
> didn't work out so well: ended up adding a specific page_reset_bad_cgroup
> inline in memcontrol.h, just for the bad_page case.)
> 
>>> @@ -2093,12 +2093,9 @@ static int do_swap_page(struct mm_struct
>>>  	unlock_page(page);
>>>
>>>  	if (write_access) {
>>> -		/* XXX: We could OR the do_wp_page code with this one? */
>>> -		if (do_wp_page(mm, vma, address,
>>> -				page_table, pmd, ptl, pte) & VM_FAULT_OOM) {
>>> -			mem_cgroup_uncharge_page(page);
>>> -			ret = VM_FAULT_OOM;
>>> -		}
>>> +		ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
>>> +		if (ret & VM_FAULT_ERROR)
>>> +			ret &= VM_FAULT_ERROR;
>> I am afraid, I do not understand this change (may be I need to look at the final
>> code and not the diff). We no longer uncharge the charged page here.
> 
> The page that was charged is there in the pagetable, and will be
> uncharged as usual when that area is unmapped.  What has failed here
> is just the COWing of that page.  You could argue that we should ignore
> the retval from do_wp_page and return our own retval: I hesitated over
> that, but since we skip do_swap_page's update_mmu_cache here, it seems
> conceivable that some architecture might loop endlessly if we claimed
> success when do_wp_page has skipped it too.
> 

That sounds very reasonable

> This is of course an example of why I didn't post the patch originally,
> just when Kame asked for a copy for testing: it badly needs the split
> and comments.  You're brave to be reviewing it at all - thanks!
> 

I think posting out the patch is helpful, we can test it at more locations.
Thanks for posting your patch.

> Hugh


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20 11:32                   ` KAMEZAWA Hiroyuki
@ 2008-02-20 11:34                     ` Balbir Singh
  2008-02-20 11:44                       ` Paul Menage
  0 siblings, 1 reply; 63+ messages in thread
From: Balbir Singh @ 2008-02-20 11:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Paul Menage, Hirokazu Takahashi, hugh, linux-mm, yamamoto, riel

KAMEZAWA Hiroyuki wrote:
> On Wed, 20 Feb 2008 16:48:10 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>> Kame, unbind->force_empty can work, but we can't force_empty the root cgroup.
>> Even if we could, the code to deal with turning on/off the entire memory
>> controller and accounting is likely to be very complex and probably racy.
>>
> Ok, just put it on my far-future-to-do-list.
> (we have another things to do now ;)
> 

Yes, too many things queued up. Avoiding the race being primary.

> But a boot option for turning off entire (memory) controller even if it is
> configured will be a good thing.
> 
> like..
>    cgroup_subsys_disable_mask = ...

I like this very much. This way, we get control over all controllers.

> or
>    memory_controller=off|on
> 

This can also be done, provided we don't do what has been suggested above.


> Thanks,
> -Kame


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  9:58     ` Hirokazu Takahashi
  2008-02-20 10:06       ` Paul Menage
@ 2008-02-20 11:36       ` Balbir Singh
  2008-02-20 11:55         ` Paul Menage
  2008-02-21  2:49         ` Hirokazu Takahashi
  1 sibling, 2 replies; 63+ messages in thread
From: Balbir Singh @ 2008-02-20 11:36 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: hugh, kamezawa.hiroyu, linux-mm, yamamoto, riel

Hirokazu Takahashi wrote:
> Hi,
> 
>>>> I'd like to start from RFC.
>>>>
>>>> In following code
>>>> ==
>>>>   lock_page_cgroup(page);
>>>>   pc = page_get_page_cgroup(page);
>>>>   unlock_page_cgroup(page);
>>>>
>>>>   access 'pc' later..
>>>> == (See, page_cgroup_move_lists())
>>>>
>>>> There is a race because 'pc' is not a stable value without lock_page_cgroup().
>>>> (mem_cgroup_uncharge can free this 'pc').
>>>>
>>>> For example, page_cgroup_move_lists() access pc without lock.
>>>> There is a small race window, between page_cgroup_move_lists()
>>>> and mem_cgroup_uncharge(). At uncharge, page_cgroup struct is immedieately
>>>> freed but move_list can access it after taking lru_lock.
>>>> (*) mem_cgroup_uncharge_page() can be called without zone->lru lock.
>>>>
>>>> This is not good manner.
>>>> .....
>>>> There is no quick fix (maybe). Moreover, I hear some people around me said
>>>> current memcontrol.c codes are very complicated.
>>>> I agree ;( ..it's caued by my work.
>>>>
>>>> I'd like to fix problems in clean way.
>>>> (Note: current -rc2 codes works well under heavy pressure. but there
>>>>  is possibility of race, I think.)
>>> Yes, yes, indeed, I've been working away on this too.
>>>
>>> Ever since the VM_BUG_ON(page_get_page_cgroup(page)) went into
>>> free_hot_cold_page (at my own prompting), I've been hitting it
>>> just very occasionally in my kernel build testing.  Was unable
>>> to reproduce it over the New Year, but a week or two ago found
>>> one machine and config on which it is relatively reproducible,
>>> pretty sure to happen within 12 hours.
>>>
>>> And on Saturday evening at last identified the cause, exactly
>>> where you have: that unsafety in mem_cgroup_move_lists - which
>>> has the nice property of putting pages from the lru on to SLUB's
>>> freelist!
>>>
>>> Unlike the unsafeties of force_empty, this is liable to hit anyone
>>> running with MEM_CONT compiled in, they don't have to be consciously
>>> using mem_cgroups at all.
>>>
>>> (I consider that, by the way, quite a serious defect in the current
>>> mem_cgroup work: that a distro compiling it in for 1% of customers
>>> is then subjecting all to the mem_cgroup overhead - effectively
>>> doubled struct page size and unnecessary accounting overhead.  I
>>> believe there needs to be a way to opt out, a force_empty which
>>> sticks.  Yes, I know the page_cgroup which does that doubling of
>>> size is only allocated on demand, but every page cache page and
>>> every anonymous page is going to have one.  A kmem_cache for them
>>> will reduce the extra, but there still needs to be a way to opt
>>> out completely.)
>>>
>> I've been thinking along these lines as well
>>
>> 1. Have a boot option to turn on/off the memory controller
> 
> It will be much convenient if the memory controller can be turned on/off on
> demand. I think you can turn it off if there aren't any mem_cgroups except
> the root mem_cgroup, 
> 
>> 2. Have a separate cache for the page_cgroup structure. I sent this suggestion
>>    out just yesterday or so.
> 
> I think the policy that every mem_cgroup should be dynamically allocated and
> assigned to the proper page every time is causing some overheads and spinlock
> contentions.
> 
> What do you think if you allocate all page_cgroups and assign to all the pages
> when the memory controller gets turned on, which will allow you to remove
> most of the spinlocks.
> 
> And you may possibly have a chance to remove page->page_cgroup member
> if you allocate array of page_cgroups and attach them to the zone which
> the pages belong to.
> 

We thought of this as well. We dropped it, because we need to track only user
pages at the moment. Doing it for all pages means having the overhead for each
page on the system.

>                zone
>     page[]    +----+    page_cgroup[]
>     +----+<----    ---->+----+
>     |    |    |    |    |    |
>     +----+    |    |    +----+
>     |    |    +----+    |    |
>     +----+              +----+
>     |    |              |    |
>     +----+              +----+
>     |    |              |    |
>     +----+              +----+
>     |    |              |    |
>     +----+              +----+
> 
> 
>> I agree that these are necessary enhancements/changes.
> 
> Thank you,
> Hirokazu Takahashi.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20 11:18                 ` Balbir Singh
  2008-02-20 11:32                   ` KAMEZAWA Hiroyuki
@ 2008-02-20 11:41                   ` Paul Menage
  1 sibling, 0 replies; 63+ messages in thread
From: Paul Menage @ 2008-02-20 11:41 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, Hirokazu Takahashi, hugh, linux-mm, yamamoto,
	riel

On Feb 20, 2008 3:18 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
> Kame, unbind->force_empty can work, but we can't force_empty the root cgroup.
> Even if we could, the code to deal with turning on/off the entire memory
> controller and accounting is likely to be very complex and probably racy.
>

How about just being able to turn it on, but not turn it off again?
i.e. at the first cgroups bind() event for the memory controller,
memory accounting is enabled.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20 11:34                     ` Balbir Singh
@ 2008-02-20 11:44                       ` Paul Menage
  0 siblings, 0 replies; 63+ messages in thread
From: Paul Menage @ 2008-02-20 11:44 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, Hirokazu Takahashi, hugh, linux-mm, yamamoto,
	riel

On Feb 20, 2008 3:34 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > like..
> >    cgroup_subsys_disable_mask = ...
>
> I like this very much. This way, we get control over all controllers.
>

We'd want to do it by name, rather than by mask, since the ids depend
on what's compiled in to the kernel.

We could have a (possibly repeated) boot option such as
cgroup_disable=memory (or other subsystem). This would set a flag in
the appropriate subsystem indicating that it was disabled, and make
the subsystem not mountable, etc.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20 11:36       ` Balbir Singh
@ 2008-02-20 11:55         ` Paul Menage
  2008-02-21  2:49         ` Hirokazu Takahashi
  1 sibling, 0 replies; 63+ messages in thread
From: Paul Menage @ 2008-02-20 11:55 UTC (permalink / raw)
  To: balbir; +Cc: Hirokazu Takahashi, hugh, kamezawa.hiroyu, linux-mm, yamamoto,
	riel

On Feb 20, 2008 3:36 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >
> > And you may possibly have a chance to remove page->page_cgroup member
> > if you allocate array of page_cgroups and attach them to the zone which
> > the pages belong to.
> >
>
> We thought of this as well. We dropped it, because we need to track only user
> pages at the moment. Doing it for all pages means having the overhead for each
> page on the system.
>

While having an array of page_cgroup objects may or may not be a good
idea, I'm not sure that the overhead argument against them is a very
good one.

I suspect that on most systems that want to use the cgroup memory
controller, user-allocated pages will fill the vast majority of
memory. So using the arrays and eliminating the extra pointer in
struct page would actually reduce overhead.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20 11:32         ` Balbir Singh
@ 2008-02-20 14:19           ` Hugh Dickins
  0 siblings, 0 replies; 63+ messages in thread
From: Hugh Dickins @ 2008-02-20 14:19 UTC (permalink / raw)
  To: Balbir Singh; +Cc: KAMEZAWA Hiroyuki, linux-mm, yamamoto, riel

On Wed, 20 Feb 2008, Balbir Singh wrote:
> Hugh Dickins wrote:
> > On Wed, 20 Feb 2008, Balbir Singh wrote:
> >>> -	if (pc)
> >>> -		VM_BUG_ON(!page_cgroup_locked(page));
> >>> -	locked = (page->page_cgroup & PAGE_CGROUP_LOCK);
> >>> -	page->page_cgroup = ((unsigned long)pc | locked);
> >>> +	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
> >> We are explicitly setting the PAGE_CGROUP_LOCK bit, shouldn't we keep the
> >> VM_BUG_ON(!page_cgroup_locked(page))?
> > 
> > Could do, but it seemed quite unnecessary to me now that it's a static
> > function with the obvious rule everywhere that you call it holding lock,
> > no matter whether pc is or isn't NULL.  If somewhere in memcontrol.c
> > did call it without holding the lock, it'd also have to bizarrely
> > remember to unlock while forgetting to lock, for it to escape notice.
> 
> Yes, you are as always of-course right. I was thinking of future uses of the
> function. Some one could use it, a VM_BUG_ON will help.

In looking to reinstate that VM_BUG_ON, I notice that actually I'm
strictly wrong to be forcing the lock bit on there - because on UP
without DEBUG_SPINLOCK, the bit_spin_lock/unlock wouldn't use it,
so it'd be rather untidy to put it there and leave it in the assign.
I think the answer will be to #define PAGE_CGROUP_LOCK 0 for that case.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20 11:36       ` Balbir Singh
  2008-02-20 11:55         ` Paul Menage
@ 2008-02-21  2:49         ` Hirokazu Takahashi
  2008-02-21  6:35           ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 63+ messages in thread
From: Hirokazu Takahashi @ 2008-02-21  2:49 UTC (permalink / raw)
  To: balbir; +Cc: hugh, kamezawa.hiroyu, linux-mm, yamamoto, riel

Hi,

> >> I've been thinking along these lines as well
> >>
> >> 1. Have a boot option to turn on/off the memory controller
> > 
> > It will be much convenient if the memory controller can be turned on/off on
> > demand. I think you can turn it off if there aren't any mem_cgroups except
> > the root mem_cgroup, 
> > 
> >> 2. Have a separate cache for the page_cgroup structure. I sent this suggestion
> >>    out just yesterday or so.
> > 
> > I think the policy that every mem_cgroup should be dynamically allocated and
> > assigned to the proper page every time is causing some overheads and spinlock
> > contentions.
> > 
> > What do you think if you allocate all page_cgroups and assign to all the pages
> > when the memory controller gets turned on, which will allow you to remove
> > most of the spinlocks.
> > 
> > And you may possibly have a chance to remove page->page_cgroup member
> > if you allocate array of page_cgroups and attach them to the zone which
> > the pages belong to.
> > 
> 
> We thought of this as well. We dropped it, because we need to track only user
> pages at the moment. Doing it for all pages means having the overhead for each
> page on the system.

Let me clarify that the overhead you said is you'll waste some memory
whose pages are assigned for the kernel internal use, right?
If so, it wouldn't be a big problem since most of the pages are assigned to
process anonymous memory or to the page cache as Paul said.

Paul> I suspect that on most systems that want to use the cgroup memory
Paul> controller, user-allocated pages will fill the vast majority of
Paul> memory. So using the arrays and eliminating the extra pointer in
Paul> struct page would actually reduce overhead.

> >                zone
> >     page[]    +----+    page_cgroup[]
> >     +----+<----    ---->+----+
> >     |    |    |    |    |    |
> >     +----+    |    |    +----+
> >     |    |    +----+    |    |
> >     +----+              +----+
> >     |    |              |    |
> >     +----+              +----+
> >     |    |              |    |
> >     +----+              +----+
> >     |    |              |    |
> >     +----+              +----+
> > 
> > 
> >> I agree that these are necessary enhancements/changes.

Thank you,
Hirokazu Takahashi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-21  2:49         ` Hirokazu Takahashi
@ 2008-02-21  6:35           ` KAMEZAWA Hiroyuki
  2008-02-21  9:07             ` Hirokazu Takahashi
  0 siblings, 1 reply; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-21  6:35 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: balbir, hugh, linux-mm, yamamoto, riel

On Thu, 21 Feb 2008 11:49:29 +0900 (JST)
Hirokazu Takahashi <taka@valinux.co.jp> wrote:

> > We thought of this as well. We dropped it, because we need to track only user
> > pages at the moment. Doing it for all pages means having the overhead for each
> > page on the system.
> 
> Let me clarify that the overhead you said is you'll waste some memory
> whose pages are assigned for the kernel internal use, right?
> If so, it wouldn't be a big problem since most of the pages are assigned to
> process anonymous memory or to the page cache as Paul said.
> 
My idea is..
(1) It will be big waste of memory to pre-allocate all page_cgroup struct at
    boot.  Because following two will not need it.
    (1) kernel memory
    (2) HugeTLB memory
Mainly because of (2), I don't like pre-allocation.

But we'll be able to archive  pfn <-> page_cgroup relationship using
on-demand memmap style.
(Someone mentioned about using radix-tree in other thread.)

Balbir-san, I'd like to do some work aroung this becasue I've experience
sparsemem and memory hotplug developments.

Or have you already started ?

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-21  6:35           ` KAMEZAWA Hiroyuki
@ 2008-02-21  9:07             ` Hirokazu Takahashi
  2008-02-21  9:21               ` KAMEZAWA Hiroyuki
  2008-02-21  9:25               ` [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races Balbir Singh
  0 siblings, 2 replies; 63+ messages in thread
From: Hirokazu Takahashi @ 2008-02-21  9:07 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: balbir, hugh, linux-mm, yamamoto, riel

Hi,

> > > We thought of this as well. We dropped it, because we need to track only user
> > > pages at the moment. Doing it for all pages means having the overhead for each
> > > page on the system.
> > 
> > Let me clarify that the overhead you said is you'll waste some memory
> > whose pages are assigned for the kernel internal use, right?
> > If so, it wouldn't be a big problem since most of the pages are assigned to
> > process anonymous memory or to the page cache as Paul said.
> > 
> My idea is..
> (1) It will be big waste of memory to pre-allocate all page_cgroup struct at
>     boot.  Because following two will not need it.
>     (1) kernel memory
>     (2) HugeTLB memory
> Mainly because of (2), I don't like pre-allocation.

I thought kernel memory wasn't big deal but I didn't think about HugeTLB memory.

> But we'll be able to archive  pfn <-> page_cgroup relationship using
> on-demand memmap style.
> (Someone mentioned about using radix-tree in other thread.)

My concern is this approach seems to require some spinlocks to protect the
radix-tree. If you really don't want to allocate page_cgroups for HugeTLB
memory, what do you think if you should turn on the memory controller after
allocating HugeTlb pages?

> Balbir-san, I'd like to do some work aroung this becasue I've experience
> sparsemem and memory hotplug developments.
> 
> Or have you already started ?

Not yet. So you can go ahead.

> Thanks,
> -Kame

Thank you,
Hirokazu Takahashi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-21  9:07             ` Hirokazu Takahashi
@ 2008-02-21  9:21               ` KAMEZAWA Hiroyuki
  2008-02-21  9:28                 ` Balbir Singh
  2008-02-21  9:25               ` [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races Balbir Singh
  1 sibling, 1 reply; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-21  9:21 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: balbir, hugh, linux-mm, yamamoto, riel

On Thu, 21 Feb 2008 18:07:45 +0900 (JST)
Hirokazu Takahashi <taka@valinux.co.jp> wrote:
> > But we'll be able to archive  pfn <-> page_cgroup relationship using
> > on-demand memmap style.
> > (Someone mentioned about using radix-tree in other thread.)
> 
> My concern is this approach seems to require some spinlocks to protect the
> radix-tree. 

Unlike file-cache, radix-tree enries are not frequently changed.
Then we have a chance to cache recently used value to per_cpu area for avoiding
radix_tree lock.

But yes. I'm afraid of lock contention very much. I'll find another lock-less way
if necessary. One idea is map each area like sparsemem_vmemmap for 64bit systems.
Now, I'm convinced that it will be complicated ;)

I'd like to start from easy way and see performance.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-21  9:07             ` Hirokazu Takahashi
  2008-02-21  9:21               ` KAMEZAWA Hiroyuki
@ 2008-02-21  9:25               ` Balbir Singh
  1 sibling, 0 replies; 63+ messages in thread
From: Balbir Singh @ 2008-02-21  9:25 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: kamezawa.hiroyu, hugh, linux-mm, yamamoto, riel

Hirokazu Takahashi wrote:
> Hi,
> 
>>>> We thought of this as well. We dropped it, because we need to track only user
>>>> pages at the moment. Doing it for all pages means having the overhead for each
>>>> page on the system.
>>> Let me clarify that the overhead you said is you'll waste some memory
>>> whose pages are assigned for the kernel internal use, right?
>>> If so, it wouldn't be a big problem since most of the pages are assigned to
>>> process anonymous memory or to the page cache as Paul said.
>>>
>> My idea is..
>> (1) It will be big waste of memory to pre-allocate all page_cgroup struct at
>>     boot.  Because following two will not need it.
>>     (1) kernel memory
>>     (2) HugeTLB memory
>> Mainly because of (2), I don't like pre-allocation.
> 
> I thought kernel memory wasn't big deal but I didn't think about HugeTLB memory.
> 

Yes, true. I don't like pre-allocation either, but people have complained about
increase in page size. I am not committing to pre-allocation, but I promised I
would look at bringing the overhead down. Nick suggested using a radix tree.

>> But we'll be able to archive  pfn <-> page_cgroup relationship using
>> on-demand memmap style.
>> (Someone mentioned about using radix-tree in other thread.)
> 
> My concern is this approach seems to require some spinlocks to protect the
> radix-tree. If you really don't want to allocate page_cgroups for HugeTLB
> memory, what do you think if you should turn on the memory controller after
> allocating HugeTlb pages?
> 

Yes, I expressed that concern on IRC as well. We've have to measure and see
performance impact. It was told to me from experience that, the overhead would
be lost as noise (which I find hard to believe).

>> Balbir-san, I'd like to do some work aroung this becasue I've experience
>> sparsemem and memory hotplug developments.
>>
>> Or have you already started ?
> 
> Not yet. So you can go ahead.
> 

Yes, please feel free to do so.

>> Thanks,
>> -Kame
> 
> Thank you,
> Hirokazu Takahashi.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-21  9:21               ` KAMEZAWA Hiroyuki
@ 2008-02-21  9:28                 ` Balbir Singh
  2008-02-21  9:44                   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 63+ messages in thread
From: Balbir Singh @ 2008-02-21  9:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Hirokazu Takahashi, hugh, linux-mm, yamamoto, riel

KAMEZAWA Hiroyuki wrote:
> On Thu, 21 Feb 2008 18:07:45 +0900 (JST)
> Hirokazu Takahashi <taka@valinux.co.jp> wrote:
>>> But we'll be able to archive  pfn <-> page_cgroup relationship using
>>> on-demand memmap style.
>>> (Someone mentioned about using radix-tree in other thread.)
>> My concern is this approach seems to require some spinlocks to protect the
>> radix-tree. 
> 
> Unlike file-cache, radix-tree enries are not frequently changed.
> Then we have a chance to cache recently used value to per_cpu area for avoiding
> radix_tree lock.
> 
> But yes. I'm afraid of lock contention very much. I'll find another lock-less way
> if necessary. One idea is map each area like sparsemem_vmemmap for 64bit systems.
> Now, I'm convinced that it will be complicated ;)
> 

The radix tree base is lockless (it uses RCU), so we might have a partial
solution to the locking problem. But it's unchartered territory, so no one knows.

> I'd like to start from easy way and see performance.
> 

Sure, please keep me in the loop as well.

> Thanks,
> -Kame


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-21  9:28                 ` Balbir Singh
@ 2008-02-21  9:44                   ` KAMEZAWA Hiroyuki
  2008-02-22  3:31                     ` [RFC] Block I/O Cgroup Hirokazu Takahashi
  0 siblings, 1 reply; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-21  9:44 UTC (permalink / raw)
  To: balbir; +Cc: Hirokazu Takahashi, hugh, linux-mm, yamamoto, riel

On Thu, 21 Feb 2008 14:58:24 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> > But yes. I'm afraid of lock contention very much. I'll find another lock-less way
> > if necessary. One idea is map each area like sparsemem_vmemmap for 64bit systems.
> > Now, I'm convinced that it will be complicated ;)
> > 
> 
> The radix tree base is lockless (it uses RCU), so we might have a partial
> solution to the locking problem. But it's unchartered territory, so no one knows.
> 
> > I'd like to start from easy way and see performance.
> > 
> 
> Sure, please keep me in the loop as well.
> 
Okay, I'll do my best.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC] Block I/O Cgroup
  2008-02-21  9:44                   ` KAMEZAWA Hiroyuki
@ 2008-02-22  3:31                     ` Hirokazu Takahashi
  2008-02-22  5:05                       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 63+ messages in thread
From: Hirokazu Takahashi @ 2008-02-22  3:31 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: balbir, hugh, linux-mm, yamamoto, riel

Hi,

It'll be great if you make the feature --- page to mem_cgroup mapping
mechanism --- generic, which will make it easy to implement a Block I/O
controller. With this feature, you can easily determine the origin cgroup
from the page which is going to start I/O.

 mem_cgroup        block_io_cgroup
     ^                 ^       |
     |                 |       |
     |                 |       |
     +------->page<----+       V
                ^          io_context
                |            ^
                |            |
               bio ----------+

Every page should be associated with the proper block_io_cgroup when
the need arises. The simplest way is to make it at the same point as
mem_cgroups does. It's also possible to do this when the pages get dirtied.

> > > But yes. I'm afraid of lock contention very much. I'll find another lock-less way
> > > if necessary. One idea is map each area like sparsemem_vmemmap for 64bit systems.
> > > Now, I'm convinced that it will be complicated ;)
> > > 
> > 
> > The radix tree base is lockless (it uses RCU), so we might have a partial
> > solution to the locking problem. But it's unchartered territory, so no one knows.
> > 
> > > I'd like to start from easy way and see performance.
> > > 
> > 
> > Sure, please keep me in the loop as well.
> > 
> Okay, I'll do my best.
> 
> Thanks,
> -Kame


Thank you,
Hirokazu Takahashi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] Block I/O Cgroup
  2008-02-22  3:31                     ` [RFC] Block I/O Cgroup Hirokazu Takahashi
@ 2008-02-22  5:05                       ` KAMEZAWA Hiroyuki
  2008-02-22  5:45                         ` Hirokazu Takahashi
  0 siblings, 1 reply; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-22  5:05 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: balbir, hugh, linux-mm, yamamoto, riel

On Fri, 22 Feb 2008 12:31:00 +0900 (JST)
Hirokazu Takahashi <taka@valinux.co.jp> wrote:

> Hi,
> 
> It'll be great if you make the feature --- page to mem_cgroup mapping
> mechanism --- generic, which will make it easy to implement a Block I/O
> controller. With this feature, you can easily determine the origin cgroup
> from the page which is going to start I/O.
> 
>  mem_cgroup        block_io_cgroup
>      ^                 ^       |
>      |                 |       |
>      |                 |       |
>      +------->page<----+       V
>                 ^          io_context
>                 |            ^
>                 |            |
>                bio ----------+
> 
> Every page should be associated with the proper block_io_cgroup when
> the need arises. The simplest way is to make it at the same point as
> mem_cgroups does. It's also possible to do this when the pages get dirtied.
> 

Just exporiting interface for page_to_page_cgroup() interface will be ok.
But, then, it seems your io controller cannot be used if memory resource
controller is not available.
That's acceptable ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] Block I/O Cgroup
  2008-02-22  5:05                       ` KAMEZAWA Hiroyuki
@ 2008-02-22  5:45                         ` Hirokazu Takahashi
  0 siblings, 0 replies; 63+ messages in thread
From: Hirokazu Takahashi @ 2008-02-22  5:45 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: balbir, hugh, linux-mm, yamamoto, riel

Hi,

> > It'll be great if you make the feature --- page to mem_cgroup mapping
> > mechanism --- generic, which will make it easy to implement a Block I/O
> > controller. With this feature, you can easily determine the origin cgroup
> > from the page which is going to start I/O.
> > 
> >  mem_cgroup        block_io_cgroup
> >      ^                 ^       |
> >      |                 |       |
> >      |                 |       |
> >      +------->page<----+       V
> >                 ^          io_context
> >                 |            ^
> >                 |            |
> >                bio ----------+
> > 
> > Every page should be associated with the proper block_io_cgroup when
> > the need arises. The simplest way is to make it at the same point as
> > mem_cgroups does. It's also possible to do this when the pages get dirtied.
> > 
> 
> Just exporiting interface for page_to_page_cgroup() interface will be ok.
> But, then, it seems your io controller cannot be used if memory resource
> controller is not available.
> That's acceptable ?

I think you can split the current code into two parts, the generic part and
the mem_cgroup dependant part. It's okay if the generic part is on when
other cgroups such as "block io controller" wants to use it.


Thank you,
Hirokazu Takahashi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-20  6:50     ` KAMEZAWA Hiroyuki
  2008-02-20  8:32       ` Clean up force_empty (Was Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.) KAMEZAWA Hiroyuki
@ 2008-02-22  9:24       ` Hugh Dickins
  2008-02-22 10:07         ` KAMEZAWA Hiroyuki
                           ` (2 more replies)
  1 sibling, 3 replies; 63+ messages in thread
From: Hugh Dickins @ 2008-02-22  9:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Hirokazu Takahashi, linux-mm, balbir, yamamoto, riel

On Wed, 20 Feb 2008, KAMEZAWA Hiroyuki wrote:
> On Wed, 20 Feb 2008 15:27:53 +0900 (JST)
> Hirokazu Takahashi <taka@valinux.co.jp> wrote:
> 
> > > Unlike the unsafeties of force_empty, this is liable to hit anyone
> > > running with MEM_CONT compiled in, they don't have to be consciously
> > > using mem_cgroups at all.
> > 
> > As for force_empty, though this may not be the main topic here,
> > mem_cgroup_force_empty_list() can be implemented simpler.
> > It is possible to make the function just call mem_cgroup_uncharge_page()
> > instead of releasing page_cgroups by itself. The tips is to call get_page()
> > before invoking mem_cgroup_uncharge_page() so the page won't be released
> > during this function.
> > 
> > Kamezawa-san, you may want look into the attached patch.
> > I think you will be free from the weired complexity here.
> > 
> > This code can be optimized but it will be enough since this function
> > isn't critical.
> > 
> > Thanks.
> > 
> > 
> > Signed-off-by: Hirokazu Takahashi <taka@vallinux.co.jp>

Hirokazu-san, may I change that to <taka@valinux.co.jp>?

>...
>
> Seems simple. But isn't there following case ?
> 
> ==in force_empty==
> 
> pc1 = list_entry(list->prev, struct page_cgroup, lru);
> page = pc1->page;
> get_page(page)
> spin_unlock_irqrestore(&mz->lru_lock, flags)
> mem_cgroup_uncharge_page(page);
> 	=> lock_page_cgroup(page);
> 		=> pc2 = page_get_page_cgroup(page);
> 
> Here, pc2 != pc1 and pc2->mem_cgroup != pc1->mem_cgroup.
> maybe need some check.
> 
> But maybe yours is good direction.

I like Hirokazu-san's approach very much.

Although I eventually completed the locking for my mem_cgroup_move_lists
(SLAB_DESTROY_BY_RCU didn't help there, actually, because it left a
possibility that the same page_cgroup got reused for the same page
but a different mem_cgroup: in which case we got the wrong spinlock),
his reversal in force_empty lets us use straightforward locking in
mem_cgroup_move_lists (though it still has to try_lock_page_cgroup).
So I want to take Hirokazu-san's patch into my bugfix and cleanup
series, where it's testing out fine so far.

Regarding your point above, Kamezawa-san: you're surely right that can
happen, but is it a case that we actually need to avoid?  Aren't we
entitled to take the page out of pc2->mem_cgroup there, because if any
such race occurs, it could easily have happened the other way around,
removing the page from pc1->mem_cgroup just after pc2->mem_cgroup
touched it, so ending up with that page in neither?

I'd just prefer not to handle it as you did in your patch, because
earlier in my series I'd removed the mem_cgroup_uncharge level (which
just gets in the way, requiring a silly lock_page_cgroup at the end
just to match the unlock_page_cgroup at the mem_cgroup_uncharge_page
level), and don't much want to add it back in.

While we're thinking of races...

It seemed to me that mem_cgroup_uncharge should be doing its css_put
after its __mem_cgroup_remove_list: doesn't doing it before leave open
a slight danger that the struct mem_cgroup could be freed before the
remove_list?  Perhaps there's some other refcounting that makes that
impossible, but I've felt safer shifting those around.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-22  9:24       ` [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races Hugh Dickins
@ 2008-02-22 10:07         ` KAMEZAWA Hiroyuki
  2008-02-22 10:25           ` Hugh Dickins
  2008-02-22 10:50         ` Hirokazu Takahashi
  2008-02-22 11:14         ` Balbir Singh
  2 siblings, 1 reply; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-22 10:07 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Hirokazu Takahashi, linux-mm, balbir, yamamoto, riel

On Fri, 22 Feb 2008 09:24:36 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> On Wed, 20 Feb 2008, KAMEZAWA Hiroyuki wrote:
> > On Wed, 20 Feb 2008 15:27:53 +0900 (JST)
> > Hirokazu Takahashi <taka@valinux.co.jp> wrote:
> > 
> > > > Unlike the unsafeties of force_empty, this is liable to hit anyone
> > > > running with MEM_CONT compiled in, they don't have to be consciously
> > > > using mem_cgroups at all.
> > > 
> > > As for force_empty, though this may not be the main topic here,
> > > mem_cgroup_force_empty_list() can be implemented simpler.
> > > It is possible to make the function just call mem_cgroup_uncharge_page()
> > > instead of releasing page_cgroups by itself. The tips is to call get_page()
> > > before invoking mem_cgroup_uncharge_page() so the page won't be released
> > > during this function.
> > > 
> > > Kamezawa-san, you may want look into the attached patch.
> > > I think you will be free from the weired complexity here.
> > > 
> > > This code can be optimized but it will be enough since this function
> > > isn't critical.
> > > 
> > > Thanks.
> > > 
> > > 
> > > Signed-off-by: Hirokazu Takahashi <taka@vallinux.co.jp>
> 
> Hirokazu-san, may I change that to <taka@valinux.co.jp>?
> 
> >...
> >
> > Seems simple. But isn't there following case ?
> > 
> > ==in force_empty==
> > 
> > pc1 = list_entry(list->prev, struct page_cgroup, lru);
> > page = pc1->page;
> > get_page(page)
> > spin_unlock_irqrestore(&mz->lru_lock, flags)
> > mem_cgroup_uncharge_page(page);
> > 	=> lock_page_cgroup(page);
> > 		=> pc2 = page_get_page_cgroup(page);
> > 
> > Here, pc2 != pc1 and pc2->mem_cgroup != pc1->mem_cgroup.
> > maybe need some check.
> > 
> > But maybe yours is good direction.
> 
> I like Hirokazu-san's approach very much.
> 
me, too.

> It seemed to me that mem_cgroup_uncharge should be doing its css_put
> after its __mem_cgroup_remove_list: doesn't doing it before leave open
> a slight danger that the struct mem_cgroup could be freed before the
> remove_list?  Perhaps there's some other refcounting that makes that
> impossible, but I've felt safer shifting those around.
> 
Sigh, it's very complicated. An idea which comes to me now is
disallowing uncharge while force_empty is running and use Takahashi-san's method.
It will be not so complicated.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-22 10:07         ` KAMEZAWA Hiroyuki
@ 2008-02-22 10:25           ` Hugh Dickins
  2008-02-22 10:34             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 63+ messages in thread
From: Hugh Dickins @ 2008-02-22 10:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Hirokazu Takahashi, linux-mm, balbir, yamamoto, riel

On Fri, 22 Feb 2008, KAMEZAWA Hiroyuki wrote:
> On Fri, 22 Feb 2008 09:24:36 +0000 (GMT)
> Hugh Dickins <hugh@veritas.com> wrote:
> 
> > It seemed to me that mem_cgroup_uncharge should be doing its css_put
> > after its __mem_cgroup_remove_list: doesn't doing it before leave open
> > a slight danger that the struct mem_cgroup could be freed before the
> > remove_list?  Perhaps there's some other refcounting that makes that
> > impossible, but I've felt safer shifting those around.
> > 
> Sigh, it's very complicated. An idea which comes to me now is disallowing
> uncharge while force_empty is running and use Takahashi-san's method.
> It will be not so complicated.

Really?  I'd expect disallowing something to add to the complication.
I agree it's all rather subtle, but I'd rather it worked naturally
with itself than we bolt on extra prohibitions.  (I was frustrated
by the EBUSY failure of force_empty, so doing my testing with that
commented out, forcing empty with concurrent activity.)

And I'm not clear whether you're saying I'm wrong to move down that
css_put, for complicated reasons that you've not explained; or that
I'm right, and this is another example of how easy it is to get it
slightly wrong.  Please clarify!

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-22 10:25           ` Hugh Dickins
@ 2008-02-22 10:34             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-22 10:34 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Hirokazu Takahashi, linux-mm, balbir, yamamoto, riel

On Fri, 22 Feb 2008 10:25:56 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:
> > Sigh, it's very complicated. An idea which comes to me now is disallowing
> > uncharge while force_empty is running and use Takahashi-san's method.
> > It will be not so complicated.
> 
> Really?  I'd expect disallowing something to add to the complication.
> I agree it's all rather subtle, but I'd rather it worked naturally
> with itself than we bolt on extra prohibitions.  (I was frustrated
> by the EBUSY failure of force_empty, so doing my testing with that
> commented out, forcing empty with concurrent activity.)
> 
> And I'm not clear whether you're saying I'm wrong to move down that
> css_put, for complicated reasons that you've not explained; or that
> I'm right, and this is another example of how easy it is to get it
> slightly wrong.  Please clarify!
> 
Sorry, All messy things "I added" by force_empty is complicated ;(
(sorry for noise)

And you're right that css_put() if remove_list is succeeded is good idea, 
I think.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-22  9:24       ` [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races Hugh Dickins
  2008-02-22 10:07         ` KAMEZAWA Hiroyuki
@ 2008-02-22 10:50         ` Hirokazu Takahashi
  2008-02-22 11:14         ` Balbir Singh
  2 siblings, 0 replies; 63+ messages in thread
From: Hirokazu Takahashi @ 2008-02-22 10:50 UTC (permalink / raw)
  To: hugh; +Cc: kamezawa.hiroyu, linux-mm, balbir, yamamoto, riel

Hi,

> > > > Unlike the unsafeties of force_empty, this is liable to hit anyone
> > > > running with MEM_CONT compiled in, they don't have to be consciously
> > > > using mem_cgroups at all.
> > > 
> > > As for force_empty, though this may not be the main topic here,
> > > mem_cgroup_force_empty_list() can be implemented simpler.
> > > It is possible to make the function just call mem_cgroup_uncharge_page()
> > > instead of releasing page_cgroups by itself. The tips is to call get_page()
> > > before invoking mem_cgroup_uncharge_page() so the page won't be released
> > > during this function.
> > > 
> > > Kamezawa-san, you may want look into the attached patch.
> > > I think you will be free from the weired complexity here.
> > > 
> > > This code can be optimized but it will be enough since this function
> > > isn't critical.
> > > 
> > > Thanks.
> > > 
> > > 
> > > Signed-off-by: Hirokazu Takahashi <taka@vallinux.co.jp>
> 
> Hirokazu-san, may I change that to <taka@valinux.co.jp>?

Oops! You can change it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-22  9:24       ` [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races Hugh Dickins
  2008-02-22 10:07         ` KAMEZAWA Hiroyuki
  2008-02-22 10:50         ` Hirokazu Takahashi
@ 2008-02-22 11:14         ` Balbir Singh
  2008-02-22 12:00           ` Hugh Dickins
  2 siblings, 1 reply; 63+ messages in thread
From: Balbir Singh @ 2008-02-22 11:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, Hirokazu Takahashi, linux-mm, yamamoto, riel

Hugh Dickins wrote:
> On Wed, 20 Feb 2008, KAMEZAWA Hiroyuki wrote:
>> On Wed, 20 Feb 2008 15:27:53 +0900 (JST)
>> Hirokazu Takahashi <taka@valinux.co.jp> wrote:
>>
>>>> Unlike the unsafeties of force_empty, this is liable to hit anyone
>>>> running with MEM_CONT compiled in, they don't have to be consciously
>>>> using mem_cgroups at all.
>>> As for force_empty, though this may not be the main topic here,
>>> mem_cgroup_force_empty_list() can be implemented simpler.
>>> It is possible to make the function just call mem_cgroup_uncharge_page()
>>> instead of releasing page_cgroups by itself. The tips is to call get_page()
>>> before invoking mem_cgroup_uncharge_page() so the page won't be released
>>> during this function.
>>>
>>> Kamezawa-san, you may want look into the attached patch.
>>> I think you will be free from the weired complexity here.
>>>
>>> This code can be optimized but it will be enough since this function
>>> isn't critical.
>>>
>>> Thanks.
>>>
>>>
>>> Signed-off-by: Hirokazu Takahashi <taka@vallinux.co.jp>
> 
> Hirokazu-san, may I change that to <taka@valinux.co.jp>?
> 
>> ...
>>
>> Seems simple. But isn't there following case ?
>>
>> ==in force_empty==
>>
>> pc1 = list_entry(list->prev, struct page_cgroup, lru);
>> page = pc1->page;
>> get_page(page)
>> spin_unlock_irqrestore(&mz->lru_lock, flags)
>> mem_cgroup_uncharge_page(page);
>> 	=> lock_page_cgroup(page);
>> 		=> pc2 = page_get_page_cgroup(page);
>>
>> Here, pc2 != pc1 and pc2->mem_cgroup != pc1->mem_cgroup.
>> maybe need some check.
>>
>> But maybe yours is good direction.
> 
> I like Hirokazu-san's approach very much.
> 
> Although I eventually completed the locking for my mem_cgroup_move_lists
> (SLAB_DESTROY_BY_RCU didn't help there, actually, because it left a
> possibility that the same page_cgroup got reused for the same page
> but a different mem_cgroup: in which case we got the wrong spinlock),
> his reversal in force_empty lets us use straightforward locking in
> mem_cgroup_move_lists (though it still has to try_lock_page_cgroup).
> So I want to take Hirokazu-san's patch into my bugfix and cleanup
> series, where it's testing out fine so far.
> 
> Regarding your point above, Kamezawa-san: you're surely right that can
> happen, but is it a case that we actually need to avoid?  Aren't we
> entitled to take the page out of pc2->mem_cgroup there, because if any
> such race occurs, it could easily have happened the other way around,
> removing the page from pc1->mem_cgroup just after pc2->mem_cgroup
> touched it, so ending up with that page in neither?
> 
> I'd just prefer not to handle it as you did in your patch, because
> earlier in my series I'd removed the mem_cgroup_uncharge level (which
> just gets in the way, requiring a silly lock_page_cgroup at the end
> just to match the unlock_page_cgroup at the mem_cgroup_uncharge_page
> level), and don't much want to add it back in.
> 

I've been looking through the code time and again, looking for races. I will try
and build a sketch of all the functions and dependencies tonight. One thing that
struck me was that making page_get_page_cgroup() call lock_page_cgroup()
internally might potentially fix a lot of racy call sites. I was thinking of
splitting page_get_page_cgroup into __page_get_page_cgroup() <--> just get the
pc without lock and page_get_page_cgroup(), that holds the lock and then returns pc.

Of course, this is just a thought process. I am yet to write the code and look
at the results.

> While we're thinking of races...
> 
> It seemed to me that mem_cgroup_uncharge should be doing its css_put
> after its __mem_cgroup_remove_list: doesn't doing it before leave open
> a slight danger that the struct mem_cgroup could be freed before the
> remove_list?  Perhaps there's some other refcounting that makes that
> impossible, but I've felt safer shifting those around.
> 

Yes, it's a good idea to move it down.

> Hugh


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-22 11:14         ` Balbir Singh
@ 2008-02-22 12:00           ` Hugh Dickins
  2008-02-22 12:28             ` Balbir Singh
  0 siblings, 1 reply; 63+ messages in thread
From: Hugh Dickins @ 2008-02-22 12:00 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Hirokazu Takahashi, linux-mm, yamamoto, riel

On Fri, 22 Feb 2008, Balbir Singh wrote:
> 
> I've been looking through the code time and again, looking for races. I will try

Well worth doing.

> and build a sketch of all the functions and dependencies tonight. One thing that
> struck me was that making page_get_page_cgroup() call lock_page_cgroup()
> internally might potentially fix a lot of racy call sites. I was thinking of
> splitting page_get_page_cgroup into __page_get_page_cgroup() <--> just get the
> pc without lock and page_get_page_cgroup(), that holds the lock and then returns pc.

I don't think that would help.  One of the problems with what's there
(before my patches) is how, for example, clear_page_cgroup takes the
lock itself - forcing you into dropping the lock before calling it
(you contemplate keeping an __ which doesn't take the lock, but then
I cannot see the point).

What's there after the patches looks fairly tidy and straightforward
to me, but emphasize "fairly".  (Often I think there's a race against
page->page_cgroup going NULL, but then realize that pc->page remains
stable and there's no such race.)

> 
> Of course, this is just a thought process. I am yet to write the code and look
> at the results.

I'd hoped to send out my series last night, but was unable to get
quite that far, sorry, and haven't tested the page migration paths yet.
The total is not unlike what I already showed, but plus Hirokazu-san's
patch and minus shmem's NULL page and minus my rearrangement of
mem_cgroup_charge_common.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-22 12:00           ` Hugh Dickins
@ 2008-02-22 12:28             ` Balbir Singh
  2008-02-22 12:53               ` Hugh Dickins
  0 siblings, 1 reply; 63+ messages in thread
From: Balbir Singh @ 2008-02-22 12:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, Hirokazu Takahashi, linux-mm, yamamoto, riel

Hugh Dickins wrote:
> On Fri, 22 Feb 2008, Balbir Singh wrote:
>> I've been looking through the code time and again, looking for races. I will try
> 
> Well worth doing.
> 

Yes. I agree 100%. Unfortunately I am not a spin expert and modeling it that way
takes longer than reviewing it a few times.

>> and build a sketch of all the functions and dependencies tonight. One thing that
>> struck me was that making page_get_page_cgroup() call lock_page_cgroup()
>> internally might potentially fix a lot of racy call sites. I was thinking of
>> splitting page_get_page_cgroup into __page_get_page_cgroup() <--> just get the
>> pc without lock and page_get_page_cgroup(), that holds the lock and then returns pc.
> 
> I don't think that would help.  One of the problems with what's there
> (before my patches) is how, for example, clear_page_cgroup takes the
> lock itself - forcing you into dropping the lock before calling it
> (you contemplate keeping an __ which doesn't take the lock, but then
> I cannot see the point).
> 

I just proposed the __ version in case there was a reason. If we can get away
from it, I'll not add __page_get_page_cgroup at all.

> What's there after the patches looks fairly tidy and straightforward
> to me, but emphasize "fairly".  (Often I think there's a race against
> page->page_cgroup going NULL, but then realize that pc->page remains
> stable and there's no such race.)
> 

I agree, I find some of the refactoring very welcome! I did a quick code check
and found that almost instances of cases, where we were worried about pc, called
page_get_page_cgroup() at some point. I thought this might be a good common
place to attack and fix.

>> Of course, this is just a thought process. I am yet to write the code and look
>> at the results.
> 
> I'd hoped to send out my series last night, but was unable to get
> quite that far, sorry, and haven't tested the page migration paths yet.
> The total is not unlike what I already showed, but plus Hirokazu-san's
> patch and minus shmem's NULL page and minus my rearrangement of
> mem_cgroup_charge_common.
> 

Do let me know when you'll have a version to test, I can run LTP, LTP stress and
other tests overnight.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-22 12:28             ` Balbir Singh
@ 2008-02-22 12:53               ` Hugh Dickins
  2008-02-25  3:18                 ` Balbir Singh
  0 siblings, 1 reply; 63+ messages in thread
From: Hugh Dickins @ 2008-02-22 12:53 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Hirokazu Takahashi, linux-mm, yamamoto, riel

On Fri, 22 Feb 2008, Balbir Singh wrote:
> Hugh Dickins wrote:
> > I'd hoped to send out my series last night, but was unable to get
> > quite that far, sorry, and haven't tested the page migration paths yet.
> > The total is not unlike what I already showed, but plus Hirokazu-san's
> > patch and minus shmem's NULL page and minus my rearrangement of
> > mem_cgroup_charge_common.
> 
> Do let me know when you'll have a version to test, I can run LTP, LTP stress
> and other tests overnight.

This is the rollup, I'll try hard not to depart from this later without
good reason - thanks, Hugh

diff -purN 26252/include/linux/memcontrol.h memcg12/include/linux/memcontrol.h
--- 26252/include/linux/memcontrol.h	2008-02-11 07:18:10.000000000 +0000
+++ memcg12/include/linux/memcontrol.h	2008-02-21 20:08:08.000000000 +0000
@@ -32,14 +32,16 @@ struct mm_struct;
 
 extern void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p);
 extern void mm_free_cgroup(struct mm_struct *mm);
-extern void page_assign_page_cgroup(struct page *page,
-					struct page_cgroup *pc);
+
+#define page_reset_bad_cgroup(page)	((page)->page_cgroup = 0)
+
 extern struct page_cgroup *page_get_page_cgroup(struct page *page);
 extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
-extern void mem_cgroup_uncharge(struct page_cgroup *pc);
+extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
+					gfp_t gfp_mask);
 extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_move_lists(struct page_cgroup *pc, bool active);
+extern void mem_cgroup_move_lists(struct page *page, bool active);
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -47,11 +49,9 @@ extern unsigned long mem_cgroup_isolate_
 					struct mem_cgroup *mem_cont,
 					int active);
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
-extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-					gfp_t gfp_mask);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
-#define vm_match_cgroup(mm, cgroup)	\
+#define mm_match_cgroup(mm, cgroup)	\
 	((cgroup) == rcu_dereference((mm)->mem_cgroup))
 
 extern int mem_cgroup_prepare_migration(struct page *page);
@@ -85,8 +85,7 @@ static inline void mm_free_cgroup(struct
 {
 }
 
-static inline void page_assign_page_cgroup(struct page *page,
-						struct page_cgroup *pc)
+static inline void page_reset_bad_cgroup(struct page *page)
 {
 }
 
@@ -95,33 +94,27 @@ static inline struct page_cgroup *page_g
 	return NULL;
 }
 
-static inline int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
-					gfp_t gfp_mask)
+static inline int mem_cgroup_charge(struct page *page,
+					struct mm_struct *mm, gfp_t gfp_mask)
 {
 	return 0;
 }
 
-static inline void mem_cgroup_uncharge(struct page_cgroup *pc)
+static inline int mem_cgroup_cache_charge(struct page *page,
+					struct mm_struct *mm, gfp_t gfp_mask)
 {
+	return 0;
 }
 
 static inline void mem_cgroup_uncharge_page(struct page *page)
 {
 }
 
-static inline void mem_cgroup_move_lists(struct page_cgroup *pc,
-						bool active)
-{
-}
-
-static inline int mem_cgroup_cache_charge(struct page *page,
-						struct mm_struct *mm,
-						gfp_t gfp_mask)
+static inline void mem_cgroup_move_lists(struct page *page, bool active)
 {
-	return 0;
 }
 
-static inline int vm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *mem)
+static inline int mm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *mem)
 {
 	return 1;
 }
diff -purN 26252/mm/memcontrol.c memcg12/mm/memcontrol.c
--- 26252/mm/memcontrol.c	2008-02-11 07:18:12.000000000 +0000
+++ memcg12/mm/memcontrol.c	2008-02-21 20:08:34.000000000 +0000
@@ -137,14 +137,21 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat stat;
 };
+static struct mem_cgroup init_mem_cgroup;
 
 /*
  * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock. We need to ensure that page->page_cgroup is atleast two
- * byte aligned (based on comments from Nick Piggin)
+ * lock.  We need to ensure that page->page_cgroup is at least two
+ * byte aligned (based on comments from Nick Piggin).  But since
+ * bit_spin_lock doesn't actually set that lock bit in a non-debug
+ * uniprocessor kernel, we should avoid setting it here too.
  */
 #define PAGE_CGROUP_LOCK_BIT 	0x0
-#define PAGE_CGROUP_LOCK 		(1 << PAGE_CGROUP_LOCK_BIT)
+#if defined (CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#define PAGE_CGROUP_LOCK 	(1 << PAGE_CGROUP_LOCK_BIT)
+#else
+#define PAGE_CGROUP_LOCK	0x0
+#endif
 
 /*
  * A page_cgroup page is associated with every page descriptor. The
@@ -154,37 +161,27 @@ struct page_cgroup {
 	struct list_head lru;		/* per cgroup LRU list */
 	struct page *page;
 	struct mem_cgroup *mem_cgroup;
-	atomic_t ref_cnt;		/* Helpful when pages move b/w  */
-					/* mapped and cached states     */
-	int	 flags;
+	int ref_cnt;			/* cached, mapped, migrating */
+	int flags;
 };
 #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
 #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
 
-static inline int page_cgroup_nid(struct page_cgroup *pc)
+static int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
 }
 
-static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
+static enum zone_type page_cgroup_zid(struct page_cgroup *pc)
 {
 	return page_zonenum(pc->page);
 }
 
-enum {
-	MEM_CGROUP_TYPE_UNSPEC = 0,
-	MEM_CGROUP_TYPE_MAPPED,
-	MEM_CGROUP_TYPE_CACHED,
-	MEM_CGROUP_TYPE_ALL,
-	MEM_CGROUP_TYPE_MAX,
-};
-
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
 };
 
-
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
  */
@@ -193,23 +190,21 @@ static void mem_cgroup_charge_statistics
 {
 	int val = (charge)? 1 : -1;
 	struct mem_cgroup_stat *stat = &mem->stat;
-	VM_BUG_ON(!irqs_disabled());
 
+	VM_BUG_ON(!irqs_disabled());
 	if (flags & PAGE_CGROUP_FLAG_CACHE)
-		__mem_cgroup_stat_add_safe(stat,
-					MEM_CGROUP_STAT_CACHE, val);
+		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
 	else
 		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
 }
 
-static inline struct mem_cgroup_per_zone *
+static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
 {
-	BUG_ON(!mem->info.nodeinfo[nid]);
 	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
 }
 
-static inline struct mem_cgroup_per_zone *
+static struct mem_cgroup_per_zone *
 page_cgroup_zoneinfo(struct page_cgroup *pc)
 {
 	struct mem_cgroup *mem = pc->mem_cgroup;
@@ -234,18 +229,14 @@ static unsigned long mem_cgroup_get_all_
 	return total;
 }
 
-static struct mem_cgroup init_mem_cgroup;
-
-static inline
-struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
+static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
 {
 	return container_of(cgroup_subsys_state(cont,
 				mem_cgroup_subsys_id), struct mem_cgroup,
 				css);
 }
 
-static inline
-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
+static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 {
 	return container_of(task_subsys_state(p, mem_cgroup_subsys_id),
 				struct mem_cgroup, css);
@@ -267,81 +258,33 @@ void mm_free_cgroup(struct mm_struct *mm
 
 static inline int page_cgroup_locked(struct page *page)
 {
-	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT,
-					&page->page_cgroup);
+	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
 }
 
-void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
+static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
 {
-	int locked;
-
-	/*
-	 * While resetting the page_cgroup we might not hold the
-	 * page_cgroup lock. free_hot_cold_page() is an example
-	 * of such a scenario
-	 */
-	if (pc)
-		VM_BUG_ON(!page_cgroup_locked(page));
-	locked = (page->page_cgroup & PAGE_CGROUP_LOCK);
-	page->page_cgroup = ((unsigned long)pc | locked);
+	VM_BUG_ON(!page_cgroup_locked(page));
+	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
 }
 
 struct page_cgroup *page_get_page_cgroup(struct page *page)
 {
-	return (struct page_cgroup *)
-		(page->page_cgroup & ~PAGE_CGROUP_LOCK);
+	return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
 }
 
-static void __always_inline lock_page_cgroup(struct page *page)
+static void lock_page_cgroup(struct page *page)
 {
 	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-	VM_BUG_ON(!page_cgroup_locked(page));
-}
-
-static void __always_inline unlock_page_cgroup(struct page *page)
-{
-	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
 }
 
-/*
- * Tie new page_cgroup to struct page under lock_page_cgroup()
- * This can fail if the page has been tied to a page_cgroup.
- * If success, returns 0.
- */
-static int page_cgroup_assign_new_page_cgroup(struct page *page,
-						struct page_cgroup *pc)
+static int try_lock_page_cgroup(struct page *page)
 {
-	int ret = 0;
-
-	lock_page_cgroup(page);
-	if (!page_get_page_cgroup(page))
-		page_assign_page_cgroup(page, pc);
-	else /* A page is tied to other pc. */
-		ret = 1;
-	unlock_page_cgroup(page);
-	return ret;
+	return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
 }
 
-/*
- * Clear page->page_cgroup member under lock_page_cgroup().
- * If given "pc" value is different from one page->page_cgroup,
- * page->cgroup is not cleared.
- * Returns a value of page->page_cgroup at lock taken.
- * A can can detect failure of clearing by following
- *  clear_page_cgroup(page, pc) == pc
- */
-
-static struct page_cgroup *clear_page_cgroup(struct page *page,
-						struct page_cgroup *pc)
+static void unlock_page_cgroup(struct page *page)
 {
-	struct page_cgroup *ret;
-	/* lock and clear */
-	lock_page_cgroup(page);
-	ret = page_get_page_cgroup(page);
-	if (likely(ret == pc))
-		page_assign_page_cgroup(page, NULL);
-	unlock_page_cgroup(page);
-	return ret;
+	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
 }
 
 static void __mem_cgroup_remove_list(struct page_cgroup *pc)
@@ -399,7 +342,7 @@ int task_in_mem_cgroup(struct task_struc
 	int ret;
 
 	task_lock(task);
-	ret = task->mm && vm_match_cgroup(task->mm, mem);
+	ret = task->mm && mm_match_cgroup(task->mm, mem);
 	task_unlock(task);
 	return ret;
 }
@@ -407,18 +350,30 @@ int task_in_mem_cgroup(struct task_struc
 /*
  * This routine assumes that the appropriate zone's lru lock is already held
  */
-void mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
+void mem_cgroup_move_lists(struct page *page, bool active)
 {
+	struct page_cgroup *pc;
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
 
-	if (!pc)
+	/*
+	 * We cannot lock_page_cgroup while holding zone's lru_lock,
+	 * because other holders of lock_page_cgroup can be interrupted
+	 * with an attempt to rotate_reclaimable_page.  But we cannot
+	 * safely get to page_cgroup without it, so just try_lock it:
+	 * mem_cgroup_isolate_pages allows for page left on wrong list.
+	 */
+	if (!try_lock_page_cgroup(page))
 		return;
 
-	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_move_lists(pc, active);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	pc = page_get_page_cgroup(page);
+	if (pc) {
+		mz = page_cgroup_zoneinfo(pc);
+		spin_lock_irqsave(&mz->lru_lock, flags);
+		__mem_cgroup_move_lists(pc, active);
+		spin_unlock_irqrestore(&mz->lru_lock, flags);
+	}
+	unlock_page_cgroup(page);
 }
 
 /*
@@ -437,6 +392,7 @@ int mem_cgroup_calc_mapped_ratio(struct 
 	rss = (long)mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
 	return (int)((rss * 100L) / total);
 }
+
 /*
  * This function is called from vmscan.c. In page reclaiming loop. balance
  * between active and inactive list is calculated. For memory controller
@@ -500,7 +456,6 @@ long mem_cgroup_calc_reclaim_inactive(st
 	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
 
 	nr_inactive = MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE);
-
 	return (nr_inactive >> priority);
 }
 
@@ -534,7 +489,6 @@ unsigned long mem_cgroup_isolate_pages(u
 		if (scan >= nr_to_scan)
 			break;
 		page = pc->page;
-		VM_BUG_ON(!pc);
 
 		if (unlikely(!PageLRU(page)))
 			continue;
@@ -587,26 +541,21 @@ static int mem_cgroup_charge_common(stru
 	 * with it
 	 */
 retry:
-	if (page) {
-		lock_page_cgroup(page);
-		pc = page_get_page_cgroup(page);
-		/*
-		 * The page_cgroup exists and
-		 * the page has already been accounted.
-		 */
-		if (pc) {
-			if (unlikely(!atomic_inc_not_zero(&pc->ref_cnt))) {
-				/* this page is under being uncharged ? */
-				unlock_page_cgroup(page);
-				cpu_relax();
-				goto retry;
-			} else {
-				unlock_page_cgroup(page);
-				goto done;
-			}
-		}
+	lock_page_cgroup(page);
+	pc = page_get_page_cgroup(page);
+	/*
+	 * The page_cgroup exists and
+	 * the page has already been accounted.
+	 */
+	if (pc) {
+		VM_BUG_ON(pc->page != page);
+		VM_BUG_ON(pc->ref_cnt <= 0);
+
+		pc->ref_cnt++;
 		unlock_page_cgroup(page);
+		goto done;
 	}
+	unlock_page_cgroup(page);
 
 	pc = kzalloc(sizeof(struct page_cgroup), gfp_mask);
 	if (pc == NULL)
@@ -624,16 +573,11 @@ retry:
 	rcu_read_lock();
 	mem = rcu_dereference(mm->mem_cgroup);
 	/*
-	 * For every charge from the cgroup, increment reference
-	 * count
+	 * For every charge from the cgroup, increment reference count
 	 */
 	css_get(&mem->css);
 	rcu_read_unlock();
 
-	/*
-	 * If we created the page_cgroup, we should free it on exceeding
-	 * the cgroup limit.
-	 */
 	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
 		if (!(gfp_mask & __GFP_WAIT))
 			goto out;
@@ -642,12 +586,12 @@ retry:
 			continue;
 
 		/*
- 		 * try_to_free_mem_cgroup_pages() might not give us a full
- 		 * picture of reclaim. Some pages are reclaimed and might be
- 		 * moved to swap cache or just unmapped from the cgroup.
- 		 * Check the limit again to see if the reclaim reduced the
- 		 * current usage of the cgroup before giving up
- 		 */
+		 * try_to_free_mem_cgroup_pages() might not give us a full
+		 * picture of reclaim. Some pages are reclaimed and might be
+		 * moved to swap cache or just unmapped from the cgroup.
+		 * Check the limit again to see if the reclaim reduced the
+		 * current usage of the cgroup before giving up
+		 */
 		if (res_counter_check_under_limit(&mem->res))
 			continue;
 
@@ -658,14 +602,16 @@ retry:
 		congestion_wait(WRITE, HZ/10);
 	}
 
-	atomic_set(&pc->ref_cnt, 1);
+	pc->ref_cnt = 1;
 	pc->mem_cgroup = mem;
 	pc->page = page;
 	pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
 		pc->flags |= PAGE_CGROUP_FLAG_CACHE;
 
-	if (!page || page_cgroup_assign_new_page_cgroup(page, pc)) {
+	lock_page_cgroup(page);
+	if (page_get_page_cgroup(page)) {
+		unlock_page_cgroup(page);
 		/*
 		 * Another charge has been added to this page already.
 		 * We take lock_page_cgroup(page) again and read
@@ -674,14 +620,13 @@ retry:
 		res_counter_uncharge(&mem->res, PAGE_SIZE);
 		css_put(&mem->css);
 		kfree(pc);
-		if (!page)
-			goto done;
 		goto retry;
 	}
+	page_assign_page_cgroup(page, pc);
+	unlock_page_cgroup(page);
 
 	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
-	/* Update statistics vector */
 	__mem_cgroup_add_list(pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 
@@ -694,70 +639,61 @@ err:
 	return -ENOMEM;
 }
 
-int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
-			gfp_t gfp_mask)
+int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
 {
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-			MEM_CGROUP_CHARGE_TYPE_MAPPED);
+				MEM_CGROUP_CHARGE_TYPE_MAPPED);
 }
 
-/*
- * See if the cached pages should be charged at all?
- */
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
-	int ret = 0;
 	if (!mm)
 		mm = &init_mm;
-
-	ret = mem_cgroup_charge_common(page, mm, gfp_mask,
+	return mem_cgroup_charge_common(page, mm, gfp_mask,
 				MEM_CGROUP_CHARGE_TYPE_CACHE);
-	return ret;
 }
 
 /*
  * Uncharging is always a welcome operation, we never complain, simply
- * uncharge. This routine should be called with lock_page_cgroup held
+ * uncharge.
  */
-void mem_cgroup_uncharge(struct page_cgroup *pc)
+void mem_cgroup_uncharge_page(struct page *page)
 {
+	struct page_cgroup *pc;
 	struct mem_cgroup *mem;
 	struct mem_cgroup_per_zone *mz;
-	struct page *page;
 	unsigned long flags;
 
 	/*
 	 * Check if our page_cgroup is valid
 	 */
+	lock_page_cgroup(page);
+	pc = page_get_page_cgroup(page);
 	if (!pc)
-		return;
+		goto unlock;
 
-	if (atomic_dec_and_test(&pc->ref_cnt)) {
-		page = pc->page;
-		mz = page_cgroup_zoneinfo(pc);
-		/*
-		 * get page->cgroup and clear it under lock.
-		 * force_empty can drop page->cgroup without checking refcnt.
-		 */
+	VM_BUG_ON(pc->page != page);
+	VM_BUG_ON(pc->ref_cnt <= 0);
+
+	if (--(pc->ref_cnt) == 0) {
+		page_assign_page_cgroup(page, NULL);
 		unlock_page_cgroup(page);
-		if (clear_page_cgroup(page, pc) == pc) {
-			mem = pc->mem_cgroup;
-			css_put(&mem->css);
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			spin_lock_irqsave(&mz->lru_lock, flags);
-			__mem_cgroup_remove_list(pc);
-			spin_unlock_irqrestore(&mz->lru_lock, flags);
-			kfree(pc);
-		}
-		lock_page_cgroup(page);
+
+		mz = page_cgroup_zoneinfo(pc);
+		spin_lock_irqsave(&mz->lru_lock, flags);
+		__mem_cgroup_remove_list(pc);
+		spin_unlock_irqrestore(&mz->lru_lock, flags);
+
+		mem = pc->mem_cgroup;
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		css_put(&mem->css);
+
+		kfree(pc);
+		return;
 	}
-}
 
-void mem_cgroup_uncharge_page(struct page *page)
-{
-	lock_page_cgroup(page);
-	mem_cgroup_uncharge(page_get_page_cgroup(page));
+unlock:
 	unlock_page_cgroup(page);
 }
 
@@ -765,50 +701,46 @@ void mem_cgroup_uncharge_page(struct pag
  * Returns non-zero if a page (under migration) has valid page_cgroup member.
  * Refcnt of page_cgroup is incremented.
  */
-
 int mem_cgroup_prepare_migration(struct page *page)
 {
 	struct page_cgroup *pc;
-	int ret = 0;
+
 	lock_page_cgroup(page);
 	pc = page_get_page_cgroup(page);
-	if (pc && atomic_inc_not_zero(&pc->ref_cnt))
-		ret = 1;
+	if (pc)
+		pc->ref_cnt++;
 	unlock_page_cgroup(page);
-	return ret;
+	return pc != NULL;
 }
 
 void mem_cgroup_end_migration(struct page *page)
 {
-	struct page_cgroup *pc;
-
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
-	mem_cgroup_uncharge(pc);
-	unlock_page_cgroup(page);
+	mem_cgroup_uncharge_page(page);
 }
+
 /*
- * We know both *page* and *newpage* are now not-on-LRU and Pg_locked.
+ * We know both *page* and *newpage* are now not-on-LRU and PG_locked.
  * And no race with uncharge() routines because page_cgroup for *page*
  * has extra one reference by mem_cgroup_prepare_migration.
  */
-
 void mem_cgroup_page_migration(struct page *page, struct page *newpage)
 {
 	struct page_cgroup *pc;
-	struct mem_cgroup *mem;
-	unsigned long flags;
 	struct mem_cgroup_per_zone *mz;
-retry:
+	unsigned long flags;
+
+	lock_page_cgroup(page);
 	pc = page_get_page_cgroup(page);
-	if (!pc)
+	if (!pc) {
+		unlock_page_cgroup(page);
 		return;
-	mem = pc->mem_cgroup;
+	}
+
+	page_assign_page_cgroup(page, NULL);
+	unlock_page_cgroup(page);
+
 	mz = page_cgroup_zoneinfo(pc);
-	if (clear_page_cgroup(page, pc) != pc)
-		goto retry;
 	spin_lock_irqsave(&mz->lru_lock, flags);
-
 	__mem_cgroup_remove_list(pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 
@@ -821,7 +753,6 @@ retry:
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_add_list(pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
-	return;
 }
 
 /*
@@ -830,14 +761,13 @@ retry:
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
  */
 #define FORCE_UNCHARGE_BATCH	(128)
-static void
-mem_cgroup_force_empty_list(struct mem_cgroup *mem,
+static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 			    struct mem_cgroup_per_zone *mz,
 			    int active)
 {
 	struct page_cgroup *pc;
 	struct page *page;
-	int count;
+	int count = FORCE_UNCHARGE_BATCH;
 	unsigned long flags;
 	struct list_head *list;
 
@@ -846,46 +776,36 @@ mem_cgroup_force_empty_list(struct mem_c
 	else
 		list = &mz->inactive_list;
 
-	if (list_empty(list))
-		return;
-retry:
-	count = FORCE_UNCHARGE_BATCH;
 	spin_lock_irqsave(&mz->lru_lock, flags);
-
-	while (--count && !list_empty(list)) {
+	while (!list_empty(list)) {
 		pc = list_entry(list->prev, struct page_cgroup, lru);
 		page = pc->page;
-		/* Avoid race with charge */
-		atomic_set(&pc->ref_cnt, 0);
-		if (clear_page_cgroup(page, pc) == pc) {
-			css_put(&mem->css);
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			__mem_cgroup_remove_list(pc);
-			kfree(pc);
-		} else 	/* being uncharged ? ...do relax */
-			break;
+		get_page(page);
+		spin_unlock_irqrestore(&mz->lru_lock, flags);
+		mem_cgroup_uncharge_page(page);
+		put_page(page);
+		if (--count <= 0) {
+			count = FORCE_UNCHARGE_BATCH;
+			cond_resched();
+		}
+		spin_lock_irqsave(&mz->lru_lock, flags);
 	}
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
-	if (!list_empty(list)) {
-		cond_resched();
-		goto retry;
-	}
-	return;
 }
 
 /*
  * make mem_cgroup's charge to be 0 if there is no task.
  * This enables deleting this mem_cgroup.
  */
-
-int mem_cgroup_force_empty(struct mem_cgroup *mem)
+static int mem_cgroup_force_empty(struct mem_cgroup *mem)
 {
 	int ret = -EBUSY;
 	int node, zid;
+
 	css_get(&mem->css);
 	/*
 	 * page reclaim code (kswapd etc..) will move pages between
-`	 * active_list <-> inactive_list while we don't take a lock.
+	 * active_list <-> inactive_list while we don't take a lock.
 	 * So, we have to do loop here until all lists are empty.
 	 */
 	while (mem->res.usage > 0) {
@@ -907,9 +827,7 @@ out:
 	return ret;
 }
 
-
-
-int mem_cgroup_write_strategy(char *buf, unsigned long long *tmp)
+static int mem_cgroup_write_strategy(char *buf, unsigned long long *tmp)
 {
 	*tmp = memparse(buf, &buf);
 	if (*buf != '\0')
@@ -946,8 +864,7 @@ static ssize_t mem_force_empty_write(str
 				size_t nbytes, loff_t *ppos)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
-	int ret;
-	ret = mem_cgroup_force_empty(mem);
+	int ret = mem_cgroup_force_empty(mem);
 	if (!ret)
 		ret = nbytes;
 	return ret;
@@ -956,7 +873,6 @@ static ssize_t mem_force_empty_write(str
 /*
  * Note: This should be removed if cgroup supports write-only file.
  */
-
 static ssize_t mem_force_empty_read(struct cgroup *cont,
 				struct cftype *cft,
 				struct file *file, char __user *userbuf,
@@ -965,7 +881,6 @@ static ssize_t mem_force_empty_read(stru
 	return -EINVAL;
 }
 
-
 static const struct mem_cgroup_stat_desc {
 	const char *msg;
 	u64 unit;
@@ -1018,8 +933,6 @@ static int mem_control_stat_open(struct 
 	return single_open(file, mem_control_stat_show, cont);
 }
 
-
-
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -1085,9 +998,6 @@ static void free_mem_cgroup_per_zone_inf
 	kfree(mem->info.nodeinfo[node]);
 }
 
-
-static struct mem_cgroup init_mem_cgroup;
-
 static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
@@ -1177,7 +1087,6 @@ static void mem_cgroup_move_task(struct 
 
 out:
 	mmput(mm);
-	return;
 }
 
 struct cgroup_subsys mem_cgroup_subsys = {
diff -purN 26252/mm/memory.c memcg12/mm/memory.c
--- 26252/mm/memory.c	2008-02-15 23:43:20.000000000 +0000
+++ memcg12/mm/memory.c	2008-02-21 20:07:58.000000000 +0000
@@ -1711,7 +1711,7 @@ unlock:
 	}
 	return ret;
 oom_free_new:
-	__free_page(new_page);
+	page_cache_release(new_page);
 oom:
 	if (old_page)
 		page_cache_release(old_page);
@@ -2093,12 +2093,9 @@ static int do_swap_page(struct mm_struct
 	unlock_page(page);
 
 	if (write_access) {
-		/* XXX: We could OR the do_wp_page code with this one? */
-		if (do_wp_page(mm, vma, address,
-				page_table, pmd, ptl, pte) & VM_FAULT_OOM) {
-			mem_cgroup_uncharge_page(page);
-			ret = VM_FAULT_OOM;
-		}
+		ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
+		if (ret & VM_FAULT_ERROR)
+			ret &= VM_FAULT_ERROR;
 		goto out;
 	}
 
@@ -2163,7 +2160,7 @@ release:
 	page_cache_release(page);
 	goto unlock;
 oom_free_page:
-	__free_page(page);
+	page_cache_release(page);
 oom:
 	return VM_FAULT_OOM;
 }
diff -purN 26252/mm/page_alloc.c memcg12/mm/page_alloc.c
--- 26252/mm/page_alloc.c	2008-02-11 07:18:12.000000000 +0000
+++ memcg12/mm/page_alloc.c	2008-02-21 20:08:04.000000000 +0000
@@ -221,13 +221,19 @@ static inline int bad_range(struct zone 
 
 static void bad_page(struct page *page)
 {
-	printk(KERN_EMERG "Bad page state in process '%s'\n"
-		KERN_EMERG "page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n"
-		KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
-		KERN_EMERG "Backtrace:\n",
+	void *pc = page_get_page_cgroup(page);
+
+	printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG
+		"page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n",
 		current->comm, page, (int)(2*sizeof(unsigned long)),
 		(unsigned long)page->flags, page->mapping,
 		page_mapcount(page), page_count(page));
+	if (pc) {
+		printk(KERN_EMERG "cgroup:%p\n", pc);
+		page_reset_bad_cgroup(page);
+	}
+	printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
+		KERN_EMERG "Backtrace:\n");
 	dump_stack();
 	page->flags &= ~(1 << PG_lru	|
 			1 << PG_private |
@@ -453,6 +459,7 @@ static inline int free_pages_check(struc
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
+		(page_get_page_cgroup(page) != NULL) |
 		(page_count(page) != 0)  |
 		(page->flags & (
 			1 << PG_lru	|
@@ -602,6 +609,7 @@ static int prep_new_page(struct page *pa
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
+		(page_get_page_cgroup(page) != NULL) |
 		(page_count(page) != 0)  |
 		(page->flags & (
 			1 << PG_lru	|
@@ -988,7 +996,6 @@ static void free_hot_cold_page(struct pa
 
 	if (!PageHighMem(page))
 		debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
-	VM_BUG_ON(page_get_page_cgroup(page));
 	arch_free_page(page, 0);
 	kernel_map_pages(page, 1, 0);
 
@@ -2527,7 +2534,6 @@ void __meminit memmap_init_zone(unsigned
 		set_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		reset_page_mapcount(page);
-		page_assign_page_cgroup(page, NULL);
 		SetPageReserved(page);
 
 		/*
diff -purN 26252/mm/rmap.c memcg12/mm/rmap.c
--- 26252/mm/rmap.c	2008-02-11 07:18:12.000000000 +0000
+++ memcg12/mm/rmap.c	2008-02-21 20:07:51.000000000 +0000
@@ -321,7 +321,7 @@ static int page_referenced_anon(struct p
 		 * counting on behalf of references from different
 		 * cgroups
 		 */
-		if (mem_cont && !vm_match_cgroup(vma->vm_mm, mem_cont))
+		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
 		referenced += page_referenced_one(page, vma, &mapcount);
 		if (!mapcount)
@@ -382,7 +382,7 @@ static int page_referenced_file(struct p
 		 * counting on behalf of references from different
 		 * cgroups
 		 */
-		if (mem_cont && !vm_match_cgroup(vma->vm_mm, mem_cont))
+		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
 		if ((vma->vm_flags & (VM_LOCKED|VM_MAYSHARE))
 				  == (VM_LOCKED|VM_MAYSHARE)) {
diff -purN 26252/mm/shmem.c memcg12/mm/shmem.c
--- 26252/mm/shmem.c	2008-02-11 07:18:12.000000000 +0000
+++ memcg12/mm/shmem.c	2008-02-21 20:08:34.000000000 +0000
@@ -1370,14 +1370,17 @@ repeat:
 			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
 			unlock_page(swappage);
-			page_cache_release(swappage);
 			if (error == -ENOMEM) {
 				/* allow reclaim from this memory cgroup */
-				error = mem_cgroup_cache_charge(NULL,
+				error = mem_cgroup_cache_charge(swappage,
 					current->mm, gfp & ~__GFP_HIGHMEM);
-				if (error)
+				if (error) {
+					page_cache_release(swappage);
 					goto failed;
+				}
+				mem_cgroup_uncharge_page(swappage);
 			}
+			page_cache_release(swappage);
 			goto repeat;
 		}
 	} else if (sgp == SGP_READ && !filepage) {
diff -purN 26252/mm/swap.c memcg12/mm/swap.c
--- 26252/mm/swap.c	2008-02-11 07:18:12.000000000 +0000
+++ memcg12/mm/swap.c	2008-02-21 20:08:01.000000000 +0000
@@ -176,7 +176,7 @@ void activate_page(struct page *page)
 		SetPageActive(page);
 		add_page_to_active_list(zone, page);
 		__count_vm_event(PGACTIVATE);
-		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
+		mem_cgroup_move_lists(page, true);
 	}
 	spin_unlock_irq(&zone->lru_lock);
 }
diff -purN 26252/mm/vmscan.c memcg12/mm/vmscan.c
--- 26252/mm/vmscan.c	2008-02-11 07:18:12.000000000 +0000
+++ memcg12/mm/vmscan.c	2008-02-21 20:08:01.000000000 +0000
@@ -1128,7 +1128,7 @@ static void shrink_active_list(unsigned 
 		ClearPageActive(page);
 
 		list_move(&page->lru, &zone->inactive_list);
-		mem_cgroup_move_lists(page_get_page_cgroup(page), false);
+		mem_cgroup_move_lists(page, false);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
@@ -1156,8 +1156,9 @@ static void shrink_active_list(unsigned 
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
+
 		list_move(&page->lru, &zone->active_list);
-		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
+		mem_cgroup_move_lists(page, true);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
  2008-02-22 12:53               ` Hugh Dickins
@ 2008-02-25  3:18                 ` Balbir Singh
  0 siblings, 0 replies; 63+ messages in thread
From: Balbir Singh @ 2008-02-25  3:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, Hirokazu Takahashi, linux-mm, yamamoto, riel

Hugh Dickins wrote:
> On Fri, 22 Feb 2008, Balbir Singh wrote:
>> Hugh Dickins wrote:
>>> I'd hoped to send out my series last night, but was unable to get
>>> quite that far, sorry, and haven't tested the page migration paths yet.
>>> The total is not unlike what I already showed, but plus Hirokazu-san's
>>> patch and minus shmem's NULL page and minus my rearrangement of
>>> mem_cgroup_charge_common.
>> Do let me know when you'll have a version to test, I can run LTP, LTP stress
>> and other tests overnight.
> 
> This is the rollup, I'll try hard not to depart from this later without
> good reason - thanks, Hugh

Hi, Hugh,

Thanks, I'll test these against 2.6.25-rc3.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2008-02-25  3:25 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-19 12:54 [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races KAMEZAWA Hiroyuki
2008-02-19 15:40 ` Hugh Dickins
2008-02-20  1:03   ` KAMEZAWA Hiroyuki
2008-02-20  4:14     ` Hugh Dickins
2008-02-20  4:37       ` KAMEZAWA Hiroyuki
2008-02-20  4:39         ` Hugh Dickins
2008-02-20  4:41           ` Balbir Singh
2008-02-20  6:40         ` Balbir Singh
2008-02-20  7:23           ` KAMEZAWA Hiroyuki
2008-02-20  3:14   ` KAMEZAWA Hiroyuki
2008-02-20  3:37     ` YAMAMOTO Takashi
2008-02-20  4:13       ` KAMEZAWA Hiroyuki
2008-02-20  4:32     ` Hugh Dickins
2008-02-20  5:57   ` Balbir Singh
2008-02-20  9:58     ` Hirokazu Takahashi
2008-02-20 10:06       ` Paul Menage
2008-02-20 10:11         ` Balbir Singh
2008-02-20 10:18           ` Paul Menage
2008-02-20 10:55             ` Balbir Singh
2008-02-20 11:21               ` KAMEZAWA Hiroyuki
2008-02-20 11:18                 ` Balbir Singh
2008-02-20 11:32                   ` KAMEZAWA Hiroyuki
2008-02-20 11:34                     ` Balbir Singh
2008-02-20 11:44                       ` Paul Menage
2008-02-20 11:41                   ` Paul Menage
2008-02-20 11:36       ` Balbir Singh
2008-02-20 11:55         ` Paul Menage
2008-02-21  2:49         ` Hirokazu Takahashi
2008-02-21  6:35           ` KAMEZAWA Hiroyuki
2008-02-21  9:07             ` Hirokazu Takahashi
2008-02-21  9:21               ` KAMEZAWA Hiroyuki
2008-02-21  9:28                 ` Balbir Singh
2008-02-21  9:44                   ` KAMEZAWA Hiroyuki
2008-02-22  3:31                     ` [RFC] Block I/O Cgroup Hirokazu Takahashi
2008-02-22  5:05                       ` KAMEZAWA Hiroyuki
2008-02-22  5:45                         ` Hirokazu Takahashi
2008-02-21  9:25               ` [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races Balbir Singh
2008-02-20  6:27   ` Hirokazu Takahashi
2008-02-20  6:50     ` KAMEZAWA Hiroyuki
2008-02-20  8:32       ` Clean up force_empty (Was Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.) KAMEZAWA Hiroyuki
2008-02-20 10:07         ` Clean up force_empty Hirokazu Takahashi
2008-02-22  9:24       ` [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races Hugh Dickins
2008-02-22 10:07         ` KAMEZAWA Hiroyuki
2008-02-22 10:25           ` Hugh Dickins
2008-02-22 10:34             ` KAMEZAWA Hiroyuki
2008-02-22 10:50         ` Hirokazu Takahashi
2008-02-22 11:14         ` Balbir Singh
2008-02-22 12:00           ` Hugh Dickins
2008-02-22 12:28             ` Balbir Singh
2008-02-22 12:53               ` Hugh Dickins
2008-02-25  3:18                 ` Balbir Singh
2008-02-19 15:54 ` kamezawa.hiroyu
2008-02-19 16:26   ` Hugh Dickins
2008-02-20  1:55     ` KAMEZAWA Hiroyuki
2008-02-20  2:05       ` YAMAMOTO Takashi
2008-02-20  2:15         ` KAMEZAWA Hiroyuki
2008-02-20  2:32           ` YAMAMOTO Takashi
2008-02-20  4:27             ` Hugh Dickins
2008-02-20  6:38     ` Balbir Singh
2008-02-20 11:00       ` Hugh Dickins
2008-02-20 11:32         ` Balbir Singh
2008-02-20 14:19           ` Hugh Dickins
2008-02-20  5:00 ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).