[PATCH 00/15] memcg: fixes and cleanups

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/15] memcg: fixes and cleanups
@ 2008-02-25 23:34 Hugh Dickins
  2008-02-25 23:35 ` [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup Hugh Dickins
                   ` (15 more replies)
  0 siblings, 16 replies; 50+ messages in thread
From: Hugh Dickins @ 2008-02-25 23:34 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

Here's my series of fifteen mem_cgroup patches against 2.6.25-rc3:
fixes to bugs I've found and cleanups made on the way to those fixes.
I believe we'll want these in 2.6.25-rc4.  Mostly they're what was
included in the rollup I posted on Friday, but 05/15 and 15/15 fix
(pre-existing) page migration and force_empty bugs I found after that.
I'm not at all happy with force_empty, 15/15 discusses it a little.

 include/linux/memcontrol.h |   37 +--
 mm/memcontrol.c            |  365 +++++++++++++----------------------
 mm/memory.c                |   13 -
 mm/migrate.c               |   19 +
 mm/page_alloc.c            |   18 +
 mm/rmap.c                  |    4 
 mm/shmem.c                 |    9 
 mm/swap.c                  |    2 
 mm/vmscan.c                |    5 
 9 files changed, 196 insertions(+), 276 deletions(-)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup
  2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
@ 2008-02-25 23:35 ` Hugh Dickins
  2008-02-26  0:39   ` David Rientjes
                     ` (3 more replies)
  2008-02-25 23:36 ` [PATCH 02/15] memcg: move_lists on page not page_cgroup Hugh Dickins
                   ` (14 subsequent siblings)
  15 siblings, 4 replies; 50+ messages in thread
From: Hugh Dickins @ 2008-02-25 23:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, David Rientjes, linux-mm

vm_match_cgroup is a perverse name for a macro to match mm with cgroup:
rename it mm_match_cgroup, matching mm_init_cgroup and mm_free_cgroup.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 include/linux/memcontrol.h |    4 ++--
 mm/memcontrol.c            |    2 +-
 mm/rmap.c                  |    4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

--- 2.6.25-rc3/include/linux/memcontrol.h	2008-02-24 22:39:48.000000000 +0000
+++ memcg01/include/linux/memcontrol.h	2008-02-25 14:05:35.000000000 +0000
@@ -48,7 +48,7 @@ extern int mem_cgroup_cache_charge(struc
 					gfp_t gfp_mask);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
-#define vm_match_cgroup(mm, cgroup)	\
+#define mm_match_cgroup(mm, cgroup)	\
 	((cgroup) == rcu_dereference((mm)->mem_cgroup))
 
 extern int mem_cgroup_prepare_migration(struct page *page);
@@ -118,7 +118,7 @@ static inline int mem_cgroup_cache_charg
 	return 0;
 }
 
-static inline int vm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *mem)
+static inline int mm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *mem)
 {
 	return 1;
 }
--- 2.6.25-rc3/mm/memcontrol.c	2008-02-24 22:39:48.000000000 +0000
+++ memcg01/mm/memcontrol.c	2008-02-25 14:05:35.000000000 +0000
@@ -399,7 +399,7 @@ int task_in_mem_cgroup(struct task_struc
 	int ret;
 
 	task_lock(task);
-	ret = task->mm && vm_match_cgroup(task->mm, mem);
+	ret = task->mm && mm_match_cgroup(task->mm, mem);
 	task_unlock(task);
 	return ret;
 }
--- 2.6.25-rc3/mm/rmap.c	2008-02-11 07:18:12.000000000 +0000
+++ memcg01/mm/rmap.c	2008-02-25 14:05:35.000000000 +0000
@@ -321,7 +321,7 @@ static int page_referenced_anon(struct p
 		 * counting on behalf of references from different
 		 * cgroups
 		 */
-		if (mem_cont && !vm_match_cgroup(vma->vm_mm, mem_cont))
+		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
 		referenced += page_referenced_one(page, vma, &mapcount);
 		if (!mapcount)
@@ -382,7 +382,7 @@ static int page_referenced_file(struct p
 		 * counting on behalf of references from different
 		 * cgroups
 		 */
-		if (mem_cont && !vm_match_cgroup(vma->vm_mm, mem_cont))
+		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
 		if ((vma->vm_flags & (VM_LOCKED|VM_MAYSHARE))
 				  == (VM_LOCKED|VM_MAYSHARE)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 02/15] memcg: move_lists on page not page_cgroup
  2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
  2008-02-25 23:35 ` [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup Hugh Dickins
@ 2008-02-25 23:36 ` Hugh Dickins
  2008-02-26 15:52   ` Balbir Singh
  2008-02-26 23:45   ` KAMEZAWA Hiroyuki
  2008-02-25 23:37 ` [PATCH 03/15] memcg: page_cache_release not __free_page Hugh Dickins
                   ` (13 subsequent siblings)
  15 siblings, 2 replies; 50+ messages in thread
From: Hugh Dickins @ 2008-02-25 23:36 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

Each caller of mem_cgroup_move_lists is having to use page_get_page_cgroup:
it's more convenient if it acts upon the page itself not the page_cgroup;
and in a later patch this becomes important to handle within memcontrol.c.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 include/linux/memcontrol.h |    5 ++---
 mm/memcontrol.c            |    4 +++-
 mm/swap.c                  |    2 +-
 mm/vmscan.c                |    5 +++--
 4 files changed, 9 insertions(+), 7 deletions(-)

--- memcg01/include/linux/memcontrol.h	2008-02-25 14:05:35.000000000 +0000
+++ memcg02/include/linux/memcontrol.h	2008-02-25 14:05:38.000000000 +0000
@@ -36,7 +36,7 @@ extern int mem_cgroup_charge(struct page
 				gfp_t gfp_mask);
 extern void mem_cgroup_uncharge(struct page_cgroup *pc);
 extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_move_lists(struct page_cgroup *pc, bool active);
+extern void mem_cgroup_move_lists(struct page *page, bool active);
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -106,8 +106,7 @@ static inline void mem_cgroup_uncharge_p
 {
 }
 
-static inline void mem_cgroup_move_lists(struct page_cgroup *pc,
-						bool active)
+static inline void mem_cgroup_move_lists(struct page *page, bool active)
 {
 }
 
--- memcg01/mm/memcontrol.c	2008-02-25 14:05:35.000000000 +0000
+++ memcg02/mm/memcontrol.c	2008-02-25 14:05:38.000000000 +0000
@@ -407,11 +407,13 @@ int task_in_mem_cgroup(struct task_struc
 /*
  * This routine assumes that the appropriate zone's lru lock is already held
  */
-void mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
+void mem_cgroup_move_lists(struct page *page, bool active)
 {
+	struct page_cgroup *pc;
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
 
+	pc = page_get_page_cgroup(page);
 	if (!pc)
 		return;
 
--- memcg01/mm/swap.c	2008-02-11 07:18:12.000000000 +0000
+++ memcg02/mm/swap.c	2008-02-25 14:05:38.000000000 +0000
@@ -176,7 +176,7 @@ void activate_page(struct page *page)
 		SetPageActive(page);
 		add_page_to_active_list(zone, page);
 		__count_vm_event(PGACTIVATE);
-		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
+		mem_cgroup_move_lists(page, true);
 	}
 	spin_unlock_irq(&zone->lru_lock);
 }
--- memcg01/mm/vmscan.c	2008-02-11 07:18:12.000000000 +0000
+++ memcg02/mm/vmscan.c	2008-02-25 14:05:38.000000000 +0000
@@ -1128,7 +1128,7 @@ static void shrink_active_list(unsigned 
 		ClearPageActive(page);
 
 		list_move(&page->lru, &zone->inactive_list);
-		mem_cgroup_move_lists(page_get_page_cgroup(page), false);
+		mem_cgroup_move_lists(page, false);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
@@ -1156,8 +1156,9 @@ static void shrink_active_list(unsigned 
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
+
 		list_move(&page->lru, &zone->active_list);
-		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
+		mem_cgroup_move_lists(page, true);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 03/15] memcg: page_cache_release not __free_page
  2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
  2008-02-25 23:35 ` [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup Hugh Dickins
  2008-02-25 23:36 ` [PATCH 02/15] memcg: move_lists on page not page_cgroup Hugh Dickins
@ 2008-02-25 23:37 ` Hugh Dickins
  2008-02-26 16:02   ` Balbir Singh
  2008-02-26 23:38   ` KAMEZAWA Hiroyuki
  2008-02-25 23:38 ` [PATCH 04/15] memcg: when do_swap's do_wp_page fails Hugh Dickins
                   ` (12 subsequent siblings)
  15 siblings, 2 replies; 50+ messages in thread
From: Hugh Dickins @ 2008-02-25 23:37 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

There's nothing wrong with mem_cgroup_charge failure in do_wp_page and
do_anonymous page using __free_page, but it does look odd when nearby
code uses page_cache_release: use that instead (while turning a blind
eye to ancient inconsistencies of page_cache_release versus put_page).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/memory.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- memcg02/mm/memory.c	2008-02-15 23:43:20.000000000 +0000
+++ memcg03/mm/memory.c	2008-02-25 14:05:43.000000000 +0000
@@ -1711,7 +1711,7 @@ unlock:
 	}
 	return ret;
 oom_free_new:
-	__free_page(new_page);
+	page_cache_release(new_page);
 oom:
 	if (old_page)
 		page_cache_release(old_page);
@@ -2163,7 +2163,7 @@ release:
 	page_cache_release(page);
 	goto unlock;
 oom_free_page:
-	__free_page(page);
+	page_cache_release(page);
 oom:
 	return VM_FAULT_OOM;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 04/15] memcg: when do_swap's do_wp_page fails
  2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
                   ` (2 preceding siblings ...)
  2008-02-25 23:37 ` [PATCH 03/15] memcg: page_cache_release not __free_page Hugh Dickins
@ 2008-02-25 23:38 ` Hugh Dickins
  2008-02-26 23:41   ` KAMEZAWA Hiroyuki
  2008-02-27  5:08   ` Balbir Singh
  2008-02-25 23:39 ` [PATCH 05/15] memcg: fix VM_BUG_ON from page migration Hugh Dickins
                   ` (11 subsequent siblings)
  15 siblings, 2 replies; 50+ messages in thread
From: Hugh Dickins @ 2008-02-25 23:38 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, Nick Piggin, linux-mm

Don't uncharge when do_swap_page's call to do_wp_page fails: the page which
was charged for is there in the pagetable, and will be correctly uncharged
when that area is unmapped - it was only its COWing which failed.

And while we're here, remove earlier XXX comment: yes, OR in do_wp_page's
return value (maybe VM_FAULT_WRITE) with do_swap_page's there; but if it
fails, mask out success bits, which might confuse some arches e.g. sparc.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/memory.c |    9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

--- memcg03/mm/memory.c	2008-02-25 14:05:43.000000000 +0000
+++ memcg04/mm/memory.c	2008-02-25 14:05:47.000000000 +0000
@@ -2093,12 +2093,9 @@ static int do_swap_page(struct mm_struct
 	unlock_page(page);
 
 	if (write_access) {
-		/* XXX: We could OR the do_wp_page code with this one? */
-		if (do_wp_page(mm, vma, address,
-				page_table, pmd, ptl, pte) & VM_FAULT_OOM) {
-			mem_cgroup_uncharge_page(page);
-			ret = VM_FAULT_OOM;
-		}
+		ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
+		if (ret & VM_FAULT_ERROR)
+			ret &= VM_FAULT_ERROR;
 		goto out;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 05/15] memcg: fix VM_BUG_ON from page migration
  2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
                   ` (3 preceding siblings ...)
  2008-02-25 23:38 ` [PATCH 04/15] memcg: when do_swap's do_wp_page fails Hugh Dickins
@ 2008-02-25 23:39 ` Hugh Dickins
  2008-02-26  1:30   ` KAMEZAWA Hiroyuki
  2008-02-27  5:52   ` Balbir Singh
  2008-02-25 23:40 ` [PATCH 06/15] memcg: bad page if page_cgroup when free Hugh Dickins
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 50+ messages in thread
From: Hugh Dickins @ 2008-02-25 23:39 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

Page migration gave me free_hot_cold_page's VM_BUG_ON page->page_cgroup.
remove_migration_pte was calling mem_cgroup_charge on the new page whenever
it found a swap pte, before it had determined it to be a migration entry.
That left a surplus reference count on the page_cgroup, so it was still
attached when the page was later freed.

Move that mem_cgroup_charge down to where we're sure it's a migration entry.
We were already under i_mmap_lock or anon_vma->lock, so its GFP_KERNEL was
already inappropriate: change that to GFP_ATOMIC.

It's essential that remove_migration_pte removes all the migration entries,
other crashes follow if not.  So proceed even when the charge fails: normally
it cannot, but after a mem_cgroup_force_empty it might - comment in the code.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/migrate.c |   19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

--- memcg04/mm/migrate.c	2008-02-11 07:18:12.000000000 +0000
+++ memcg05/mm/migrate.c	2008-02-25 14:05:50.000000000 +0000
@@ -153,11 +153,6 @@ static void remove_migration_pte(struct 
  		return;
  	}
 
-	if (mem_cgroup_charge(new, mm, GFP_KERNEL)) {
-		pte_unmap(ptep);
-		return;
-	}
-
  	ptl = pte_lockptr(mm, pmd);
  	spin_lock(ptl);
 	pte = *ptep;
@@ -169,6 +164,20 @@ static void remove_migration_pte(struct 
 	if (!is_migration_entry(entry) || migration_entry_to_page(entry) != old)
 		goto out;
 
+	/*
+	 * Yes, ignore the return value from a GFP_ATOMIC mem_cgroup_charge.
+	 * Failure is not an option here: we're now expected to remove every
+	 * migration pte, and will cause crashes otherwise.  Normally this
+	 * is not an issue: mem_cgroup_prepare_migration bumped up the old
+	 * page_cgroup count for safety, that's now attached to the new page,
+	 * so this charge should just be another incrementation of the count,
+	 * to keep in balance with rmap.c's mem_cgroup_uncharging.  But if
+	 * there's been a force_empty, those reference counts may no longer
+	 * be reliable, and this charge can actually fail: oh well, we don't
+	 * make the situation any worse by proceeding as if it had succeeded.
+	 */
+	mem_cgroup_charge(new, mm, GFP_ATOMIC);
+
 	get_page(new);
 	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
 	if (is_write_migration_entry(entry))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 06/15] memcg: bad page if page_cgroup when free
  2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
                   ` (4 preceding siblings ...)
  2008-02-25 23:39 ` [PATCH 05/15] memcg: fix VM_BUG_ON from page migration Hugh Dickins
@ 2008-02-25 23:40 ` Hugh Dickins
  2008-02-26 23:44   ` KAMEZAWA Hiroyuki
  2008-02-27  8:38   ` Balbir Singh
  2008-02-25 23:41 ` [PATCH 07/15] memcg: mem_cgroup_charge never NULL Hugh Dickins
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 50+ messages in thread
From: Hugh Dickins @ 2008-02-25 23:40 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

Replace free_hot_cold_page's VM_BUG_ON(page_get_page_cgroup(page)) by a
"Bad page state" and clear: most users don't have CONFIG_DEBUG_VM on, and
if it were set here, it'd likely cause corruption when the page is reused.

Don't use page_assign_page_cgroup to clear it: that should be private to
memcontrol.c, and always called with the lock taken; and memmap_init_zone
doesn't need it either - like page->mapping and other pointers throughout
the kernel, Linux assumes pointers in zeroed structures are NULL pointers.

Instead use page_reset_bad_cgroup, added to memcontrol.h for this only.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
I had page_reset_bad_cgroup as the approved inline function a few days ago,
but now there's been a cull of included header files, so it's a #define.

 include/linux/memcontrol.h |    8 ++++----
 mm/memcontrol.c            |   27 ++++++++++++---------------
 mm/page_alloc.c            |   18 ++++++++++++------
 3 files changed, 28 insertions(+), 25 deletions(-)

--- memcg05/include/linux/memcontrol.h	2008-02-25 14:05:38.000000000 +0000
+++ memcg06/include/linux/memcontrol.h	2008-02-25 14:05:55.000000000 +0000
@@ -29,8 +29,9 @@ struct mm_struct;
 
 extern void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p);
 extern void mm_free_cgroup(struct mm_struct *mm);
-extern void page_assign_page_cgroup(struct page *page,
-					struct page_cgroup *pc);
+
+#define page_reset_bad_cgroup(page)	((page)->page_cgroup = 0)
+
 extern struct page_cgroup *page_get_page_cgroup(struct page *page);
 extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
@@ -82,8 +83,7 @@ static inline void mm_free_cgroup(struct
 {
 }
 
-static inline void page_assign_page_cgroup(struct page *page,
-						struct page_cgroup *pc)
+static inline void page_reset_bad_cgroup(struct page *page)
 {
 }
 
--- memcg05/mm/memcontrol.c	2008-02-25 14:05:38.000000000 +0000
+++ memcg06/mm/memcontrol.c	2008-02-25 14:05:55.000000000 +0000
@@ -140,11 +140,17 @@ struct mem_cgroup {
 
 /*
  * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock. We need to ensure that page->page_cgroup is atleast two
- * byte aligned (based on comments from Nick Piggin)
+ * lock.  We need to ensure that page->page_cgroup is at least two
+ * byte aligned (based on comments from Nick Piggin).  But since
+ * bit_spin_lock doesn't actually set that lock bit in a non-debug
+ * uniprocessor kernel, we should avoid setting it here too.
  */
 #define PAGE_CGROUP_LOCK_BIT 	0x0
-#define PAGE_CGROUP_LOCK 		(1 << PAGE_CGROUP_LOCK_BIT)
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#define PAGE_CGROUP_LOCK 	(1 << PAGE_CGROUP_LOCK_BIT)
+#else
+#define PAGE_CGROUP_LOCK	0x0
+#endif
 
 /*
  * A page_cgroup page is associated with every page descriptor. The
@@ -271,19 +277,10 @@ static inline int page_cgroup_locked(str
 					&page->page_cgroup);
 }
 
-void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
+static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
 {
-	int locked;
-
-	/*
-	 * While resetting the page_cgroup we might not hold the
-	 * page_cgroup lock. free_hot_cold_page() is an example
-	 * of such a scenario
-	 */
-	if (pc)
-		VM_BUG_ON(!page_cgroup_locked(page));
-	locked = (page->page_cgroup & PAGE_CGROUP_LOCK);
-	page->page_cgroup = ((unsigned long)pc | locked);
+	VM_BUG_ON(!page_cgroup_locked(page));
+	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
 }
 
 struct page_cgroup *page_get_page_cgroup(struct page *page)
--- memcg05/mm/page_alloc.c	2008-02-24 22:39:48.000000000 +0000
+++ memcg06/mm/page_alloc.c	2008-02-25 14:05:55.000000000 +0000
@@ -221,13 +221,19 @@ static inline int bad_range(struct zone 
 
 static void bad_page(struct page *page)
 {
-	printk(KERN_EMERG "Bad page state in process '%s'\n"
-		KERN_EMERG "page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n"
-		KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
-		KERN_EMERG "Backtrace:\n",
+	void *pc = page_get_page_cgroup(page);
+
+	printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG
+		"page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n",
 		current->comm, page, (int)(2*sizeof(unsigned long)),
 		(unsigned long)page->flags, page->mapping,
 		page_mapcount(page), page_count(page));
+	if (pc) {
+		printk(KERN_EMERG "cgroup:%p\n", pc);
+		page_reset_bad_cgroup(page);
+	}
+	printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
+		KERN_EMERG "Backtrace:\n");
 	dump_stack();
 	page->flags &= ~(1 << PG_lru	|
 			1 << PG_private |
@@ -453,6 +459,7 @@ static inline int free_pages_check(struc
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
+		(page_get_page_cgroup(page) != NULL) |
 		(page_count(page) != 0)  |
 		(page->flags & (
 			1 << PG_lru	|
@@ -602,6 +609,7 @@ static int prep_new_page(struct page *pa
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
+		(page_get_page_cgroup(page) != NULL) |
 		(page_count(page) != 0)  |
 		(page->flags & (
 			1 << PG_lru	|
@@ -988,7 +996,6 @@ static void free_hot_cold_page(struct pa
 
 	if (!PageHighMem(page))
 		debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
-	VM_BUG_ON(page_get_page_cgroup(page));
 	arch_free_page(page, 0);
 	kernel_map_pages(page, 1, 0);
 
@@ -2527,7 +2534,6 @@ void __meminit memmap_init_zone(unsigned
 		set_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		reset_page_mapcount(page);
-		page_assign_page_cgroup(page, NULL);
 		SetPageReserved(page);
 
 		/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 07/15] memcg: mem_cgroup_charge never NULL
  2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
                   ` (5 preceding siblings ...)
  2008-02-25 23:40 ` [PATCH 06/15] memcg: bad page if page_cgroup when free Hugh Dickins
@ 2008-02-25 23:41 ` Hugh Dickins
  2008-02-26  1:32   ` KAMEZAWA Hiroyuki
  2008-02-27  8:42   ` Balbir Singh
  2008-02-25 23:42 ` [PATCH 08/15] memcg: remove mem_cgroup_uncharge Hugh Dickins
                   ` (8 subsequent siblings)
  15 siblings, 2 replies; 50+ messages in thread
From: Hugh Dickins @ 2008-02-25 23:41 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

My memcgroup patch to fix hang with shmem/tmpfs added NULL page handling
to mem_cgroup_charge_common.  It seemed convenient at the time, but hard
to justify now: there's a perfectly appropriate swappage to charge and
uncharge instead, this is not on any hot path through shmem_getpage,
and no performance hit was observed from the slight extra overhead.

So revert that NULL page handling from mem_cgroup_charge_common; and
make it clearer by bringing page_cgroup_assign_new_page_cgroup into its
body - that was a helper I found more of a hindrance to understanding.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/memcontrol.c |   61 +++++++++++++++-------------------------------
 mm/shmem.c      |    9 ++++--
 2 files changed, 27 insertions(+), 43 deletions(-)

--- memcg06/mm/memcontrol.c	2008-02-25 14:05:55.000000000 +0000
+++ memcg07/mm/memcontrol.c	2008-02-25 14:05:58.000000000 +0000
@@ -301,25 +301,6 @@ static void __always_inline unlock_page_
 }
 
 /*
- * Tie new page_cgroup to struct page under lock_page_cgroup()
- * This can fail if the page has been tied to a page_cgroup.
- * If success, returns 0.
- */
-static int page_cgroup_assign_new_page_cgroup(struct page *page,
-						struct page_cgroup *pc)
-{
-	int ret = 0;
-
-	lock_page_cgroup(page);
-	if (!page_get_page_cgroup(page))
-		page_assign_page_cgroup(page, pc);
-	else /* A page is tied to other pc. */
-		ret = 1;
-	unlock_page_cgroup(page);
-	return ret;
-}
-
-/*
  * Clear page->page_cgroup member under lock_page_cgroup().
  * If given "pc" value is different from one page->page_cgroup,
  * page->cgroup is not cleared.
@@ -585,26 +566,24 @@ static int mem_cgroup_charge_common(stru
 	 * with it
 	 */
 retry:
-	if (page) {
-		lock_page_cgroup(page);
-		pc = page_get_page_cgroup(page);
-		/*
-		 * The page_cgroup exists and
-		 * the page has already been accounted.
-		 */
-		if (pc) {
-			if (unlikely(!atomic_inc_not_zero(&pc->ref_cnt))) {
-				/* this page is under being uncharged ? */
-				unlock_page_cgroup(page);
-				cpu_relax();
-				goto retry;
-			} else {
-				unlock_page_cgroup(page);
-				goto done;
-			}
+	lock_page_cgroup(page);
+	pc = page_get_page_cgroup(page);
+	/*
+	 * The page_cgroup exists and
+	 * the page has already been accounted.
+	 */
+	if (pc) {
+		if (unlikely(!atomic_inc_not_zero(&pc->ref_cnt))) {
+			/* this page is under being uncharged ? */
+			unlock_page_cgroup(page);
+			cpu_relax();
+			goto retry;
+		} else {
+			unlock_page_cgroup(page);
+			goto done;
 		}
-		unlock_page_cgroup(page);
 	}
+	unlock_page_cgroup(page);
 
 	pc = kzalloc(sizeof(struct page_cgroup), gfp_mask);
 	if (pc == NULL)
@@ -663,7 +642,9 @@ retry:
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
 		pc->flags |= PAGE_CGROUP_FLAG_CACHE;
 
-	if (!page || page_cgroup_assign_new_page_cgroup(page, pc)) {
+	lock_page_cgroup(page);
+	if (page_get_page_cgroup(page)) {
+		unlock_page_cgroup(page);
 		/*
 		 * Another charge has been added to this page already.
 		 * We take lock_page_cgroup(page) again and read
@@ -672,10 +653,10 @@ retry:
 		res_counter_uncharge(&mem->res, PAGE_SIZE);
 		css_put(&mem->css);
 		kfree(pc);
-		if (!page)
-			goto done;
 		goto retry;
 	}
+	page_assign_page_cgroup(page, pc);
+	unlock_page_cgroup(page);
 
 	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
--- memcg06/mm/shmem.c	2008-02-11 07:18:12.000000000 +0000
+++ memcg07/mm/shmem.c	2008-02-25 14:05:58.000000000 +0000
@@ -1370,14 +1370,17 @@ repeat:
 			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
 			unlock_page(swappage);
-			page_cache_release(swappage);
 			if (error == -ENOMEM) {
 				/* allow reclaim from this memory cgroup */
-				error = mem_cgroup_cache_charge(NULL,
+				error = mem_cgroup_cache_charge(swappage,
 					current->mm, gfp & ~__GFP_HIGHMEM);
-				if (error)
+				if (error) {
+					page_cache_release(swappage);
 					goto failed;
+				}
+				mem_cgroup_uncharge_page(swappage);
 			}
+			page_cache_release(swappage);
 			goto repeat;
 		}
 	} else if (sgp == SGP_READ && !filepage) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 08/15] memcg: remove mem_cgroup_uncharge
  2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
                   ` (6 preceding siblings ...)
  2008-02-25 23:41 ` [PATCH 07/15] memcg: mem_cgroup_charge never NULL Hugh Dickins
@ 2008-02-25 23:42 ` Hugh Dickins
  2008-02-26  1:34   ` KAMEZAWA Hiroyuki
  2008-02-28 18:22   ` Balbir Singh
  2008-02-25 23:43 ` [PATCH 09/15] memcg: memcontrol whitespace cleanups Hugh Dickins
                   ` (7 subsequent siblings)
  15 siblings, 2 replies; 50+ messages in thread
From: Hugh Dickins @ 2008-02-25 23:42 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

Nothing uses mem_cgroup_uncharge apart from mem_cgroup_uncharge_page,
(a trivial wrapper around it) and mem_cgroup_end_migration (which does
the same as mem_cgroup_uncharge_page).  And it often ends up having to
lock just to let its caller unlock.  Remove it (but leave the silly
locking until a later patch).

Moved mem_cgroup_cache_charge next to mem_cgroup_charge in memcontrol.h.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
I'd happily rename mem_cgroup_uncharge_page to mem_cgroup_uncharge later,
matching mem_cgroup_charge: that patch would touch more, what do you think?

 include/linux/memcontrol.h |   20 +++++++-------------
 mm/memcontrol.c            |   23 ++++++++---------------
 2 files changed, 15 insertions(+), 28 deletions(-)

--- memcg07/include/linux/memcontrol.h	2008-02-25 14:05:55.000000000 +0000
+++ memcg08/include/linux/memcontrol.h	2008-02-25 14:06:02.000000000 +0000
@@ -35,7 +35,8 @@ extern void mm_free_cgroup(struct mm_str
 extern struct page_cgroup *page_get_page_cgroup(struct page *page);
 extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
-extern void mem_cgroup_uncharge(struct page_cgroup *pc);
+extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
+					gfp_t gfp_mask);
 extern void mem_cgroup_uncharge_page(struct page *page);
 extern void mem_cgroup_move_lists(struct page *page, bool active);
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
@@ -45,8 +46,6 @@ extern unsigned long mem_cgroup_isolate_
 					struct mem_cgroup *mem_cont,
 					int active);
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
-extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-					gfp_t gfp_mask);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
 #define mm_match_cgroup(mm, cgroup)	\
@@ -92,14 +91,16 @@ static inline struct page_cgroup *page_g
 	return NULL;
 }
 
-static inline int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
-					gfp_t gfp_mask)
+static inline int mem_cgroup_charge(struct page *page,
+					struct mm_struct *mm, gfp_t gfp_mask)
 {
 	return 0;
 }
 
-static inline void mem_cgroup_uncharge(struct page_cgroup *pc)
+static inline int mem_cgroup_cache_charge(struct page *page,
+					struct mm_struct *mm, gfp_t gfp_mask)
 {
+	return 0;
 }
 
 static inline void mem_cgroup_uncharge_page(struct page *page)
@@ -110,13 +111,6 @@ static inline void mem_cgroup_move_lists
 {
 }
 
-static inline int mem_cgroup_cache_charge(struct page *page,
-						struct mm_struct *mm,
-						gfp_t gfp_mask)
-{
-	return 0;
-}
-
 static inline int mm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *mem)
 {
 	return 1;
--- memcg07/mm/memcontrol.c	2008-02-25 14:05:58.000000000 +0000
+++ memcg08/mm/memcontrol.c	2008-02-25 14:06:02.000000000 +0000
@@ -697,20 +697,22 @@ int mem_cgroup_cache_charge(struct page 
 
 /*
  * Uncharging is always a welcome operation, we never complain, simply
- * uncharge. This routine should be called with lock_page_cgroup held
+ * uncharge.
  */
-void mem_cgroup_uncharge(struct page_cgroup *pc)
+void mem_cgroup_uncharge_page(struct page *page)
 {
+	struct page_cgroup *pc;
 	struct mem_cgroup *mem;
 	struct mem_cgroup_per_zone *mz;
-	struct page *page;
 	unsigned long flags;
 
 	/*
 	 * Check if our page_cgroup is valid
 	 */
+	lock_page_cgroup(page);
+	pc = page_get_page_cgroup(page);
 	if (!pc)
-		return;
+		goto unlock;
 
 	if (atomic_dec_and_test(&pc->ref_cnt)) {
 		page = pc->page;
@@ -731,12 +733,8 @@ void mem_cgroup_uncharge(struct page_cgr
 		}
 		lock_page_cgroup(page);
 	}
-}
 
-void mem_cgroup_uncharge_page(struct page *page)
-{
-	lock_page_cgroup(page);
-	mem_cgroup_uncharge(page_get_page_cgroup(page));
+unlock:
 	unlock_page_cgroup(page);
 }
 
@@ -759,12 +757,7 @@ int mem_cgroup_prepare_migration(struct 
 
 void mem_cgroup_end_migration(struct page *page)
 {
-	struct page_cgroup *pc;
-
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
-	mem_cgroup_uncharge(pc);
-	unlock_page_cgroup(page);
+	mem_cgroup_uncharge_page(page);
 }
 /*
  * We know both *page* and *newpage* are now not-on-LRU and Pg_locked.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 09/15] memcg: memcontrol whitespace cleanups
  2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
                   ` (7 preceding siblings ...)
  2008-02-25 23:42 ` [PATCH 08/15] memcg: remove mem_cgroup_uncharge Hugh Dickins
@ 2008-02-25 23:43 ` Hugh Dickins
  2008-02-25 23:44 ` [PATCH 10/15] memcg: memcontrol uninlined and static Hugh Dickins
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Hugh Dickins @ 2008-02-25 23:43 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

Sorry, before getting down to more important changes, I'd like to do some
cleanup in memcontrol.c.  This patch doesn't change the code generated,
but cleans up whitespace, moves up a double declaration, removes an unused
enum, removes void returns, removes misleading comments, that kind of thing.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/memcontrol.c |   94 +++++++++++++---------------------------------
 1 file changed, 28 insertions(+), 66 deletions(-)

--- memcg08/mm/memcontrol.c	2008-02-25 14:06:02.000000000 +0000
+++ memcg09/mm/memcontrol.c	2008-02-25 14:06:09.000000000 +0000
@@ -137,6 +137,7 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat stat;
 };
+static struct mem_cgroup init_mem_cgroup;
 
 /*
  * We use the lower bit of the page->page_cgroup pointer as a bit spin
@@ -162,7 +163,7 @@ struct page_cgroup {
 	struct mem_cgroup *mem_cgroup;
 	atomic_t ref_cnt;		/* Helpful when pages move b/w  */
 					/* mapped and cached states     */
-	int	 flags;
+	int flags;
 };
 #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
 #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
@@ -177,20 +178,11 @@ static inline enum zone_type page_cgroup
 	return page_zonenum(pc->page);
 }
 
-enum {
-	MEM_CGROUP_TYPE_UNSPEC = 0,
-	MEM_CGROUP_TYPE_MAPPED,
-	MEM_CGROUP_TYPE_CACHED,
-	MEM_CGROUP_TYPE_ALL,
-	MEM_CGROUP_TYPE_MAX,
-};
-
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
 };
 
-
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
  */
@@ -199,11 +191,10 @@ static void mem_cgroup_charge_statistics
 {
 	int val = (charge)? 1 : -1;
 	struct mem_cgroup_stat *stat = &mem->stat;
-	VM_BUG_ON(!irqs_disabled());
 
+	VM_BUG_ON(!irqs_disabled());
 	if (flags & PAGE_CGROUP_FLAG_CACHE)
-		__mem_cgroup_stat_add_safe(stat,
-					MEM_CGROUP_STAT_CACHE, val);
+		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
 	else
 		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
 }
@@ -240,8 +231,6 @@ static unsigned long mem_cgroup_get_all_
 	return total;
 }
 
-static struct mem_cgroup init_mem_cgroup;
-
 static inline
 struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
 {
@@ -273,8 +262,7 @@ void mm_free_cgroup(struct mm_struct *mm
 
 static inline int page_cgroup_locked(struct page *page)
 {
-	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT,
-					&page->page_cgroup);
+	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
 }
 
 static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
@@ -285,8 +273,7 @@ static void page_assign_page_cgroup(stru
 
 struct page_cgroup *page_get_page_cgroup(struct page *page)
 {
-	return (struct page_cgroup *)
-		(page->page_cgroup & ~PAGE_CGROUP_LOCK);
+	return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
 }
 
 static void __always_inline lock_page_cgroup(struct page *page)
@@ -308,7 +295,6 @@ static void __always_inline unlock_page_
  * A can can detect failure of clearing by following
  *  clear_page_cgroup(page, pc) == pc
  */
-
 static struct page_cgroup *clear_page_cgroup(struct page *page,
 						struct page_cgroup *pc)
 {
@@ -417,6 +403,7 @@ int mem_cgroup_calc_mapped_ratio(struct 
 	rss = (long)mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
 	return (int)((rss * 100L) / total);
 }
+
 /*
  * This function is called from vmscan.c. In page reclaiming loop. balance
  * between active and inactive list is calculated. For memory controller
@@ -480,7 +467,6 @@ long mem_cgroup_calc_reclaim_inactive(st
 	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
 
 	nr_inactive = MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE);
-
 	return (nr_inactive >> priority);
 }
 
@@ -601,16 +587,11 @@ retry:
 	rcu_read_lock();
 	mem = rcu_dereference(mm->mem_cgroup);
 	/*
-	 * For every charge from the cgroup, increment reference
-	 * count
+	 * For every charge from the cgroup, increment reference count
 	 */
 	css_get(&mem->css);
 	rcu_read_unlock();
 
-	/*
-	 * If we created the page_cgroup, we should free it on exceeding
-	 * the cgroup limit.
-	 */
 	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
 		if (!(gfp_mask & __GFP_WAIT))
 			goto out;
@@ -619,12 +600,12 @@ retry:
 			continue;
 
 		/*
- 		 * try_to_free_mem_cgroup_pages() might not give us a full
- 		 * picture of reclaim. Some pages are reclaimed and might be
- 		 * moved to swap cache or just unmapped from the cgroup.
- 		 * Check the limit again to see if the reclaim reduced the
- 		 * current usage of the cgroup before giving up
- 		 */
+		 * try_to_free_mem_cgroup_pages() might not give us a full
+		 * picture of reclaim. Some pages are reclaimed and might be
+		 * moved to swap cache or just unmapped from the cgroup.
+		 * Check the limit again to see if the reclaim reduced the
+		 * current usage of the cgroup before giving up
+		 */
 		if (res_counter_check_under_limit(&mem->res))
 			continue;
 
@@ -660,7 +641,6 @@ retry:
 
 	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
-	/* Update statistics vector */
 	__mem_cgroup_add_list(pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 
@@ -673,26 +653,19 @@ err:
 	return -ENOMEM;
 }
 
-int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
-			gfp_t gfp_mask)
+int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
 {
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-			MEM_CGROUP_CHARGE_TYPE_MAPPED);
+				MEM_CGROUP_CHARGE_TYPE_MAPPED);
 }
 
-/*
- * See if the cached pages should be charged at all?
- */
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
-	int ret = 0;
 	if (!mm)
 		mm = &init_mm;
-
-	ret = mem_cgroup_charge_common(page, mm, gfp_mask,
+	return mem_cgroup_charge_common(page, mm, gfp_mask,
 				MEM_CGROUP_CHARGE_TYPE_CACHE);
-	return ret;
 }
 
 /*
@@ -742,11 +715,11 @@ unlock:
  * Returns non-zero if a page (under migration) has valid page_cgroup member.
  * Refcnt of page_cgroup is incremented.
  */
-
 int mem_cgroup_prepare_migration(struct page *page)
 {
 	struct page_cgroup *pc;
 	int ret = 0;
+
 	lock_page_cgroup(page);
 	pc = page_get_page_cgroup(page);
 	if (pc && atomic_inc_not_zero(&pc->ref_cnt))
@@ -759,28 +732,30 @@ void mem_cgroup_end_migration(struct pag
 {
 	mem_cgroup_uncharge_page(page);
 }
+
 /*
- * We know both *page* and *newpage* are now not-on-LRU and Pg_locked.
+ * We know both *page* and *newpage* are now not-on-LRU and PG_locked.
  * And no race with uncharge() routines because page_cgroup for *page*
  * has extra one reference by mem_cgroup_prepare_migration.
  */
-
 void mem_cgroup_page_migration(struct page *page, struct page *newpage)
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem;
 	unsigned long flags;
 	struct mem_cgroup_per_zone *mz;
+
 retry:
 	pc = page_get_page_cgroup(page);
 	if (!pc)
 		return;
+
 	mem = pc->mem_cgroup;
 	mz = page_cgroup_zoneinfo(pc);
 	if (clear_page_cgroup(page, pc) != pc)
 		goto retry;
-	spin_lock_irqsave(&mz->lru_lock, flags);
 
+	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_remove_list(pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 
@@ -793,7 +768,6 @@ retry:
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_add_list(pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
-	return;
 }
 
 /*
@@ -802,8 +776,7 @@ retry:
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
  */
 #define FORCE_UNCHARGE_BATCH	(128)
-static void
-mem_cgroup_force_empty_list(struct mem_cgroup *mem,
+static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 			    struct mem_cgroup_per_zone *mz,
 			    int active)
 {
@@ -837,27 +810,27 @@ retry:
 		} else 	/* being uncharged ? ...do relax */
 			break;
 	}
+
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 	if (!list_empty(list)) {
 		cond_resched();
 		goto retry;
 	}
-	return;
 }
 
 /*
  * make mem_cgroup's charge to be 0 if there is no task.
  * This enables deleting this mem_cgroup.
  */
-
 int mem_cgroup_force_empty(struct mem_cgroup *mem)
 {
 	int ret = -EBUSY;
 	int node, zid;
+
 	css_get(&mem->css);
 	/*
 	 * page reclaim code (kswapd etc..) will move pages between
-`	 * active_list <-> inactive_list while we don't take a lock.
+	 * active_list <-> inactive_list while we don't take a lock.
 	 * So, we have to do loop here until all lists are empty.
 	 */
 	while (mem->res.usage > 0) {
@@ -879,8 +852,6 @@ out:
 	return ret;
 }
 
-
-
 int mem_cgroup_write_strategy(char *buf, unsigned long long *tmp)
 {
 	*tmp = memparse(buf, &buf);
@@ -918,8 +889,7 @@ static ssize_t mem_force_empty_write(str
 				size_t nbytes, loff_t *ppos)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
-	int ret;
-	ret = mem_cgroup_force_empty(mem);
+	int ret = mem_cgroup_force_empty(mem);
 	if (!ret)
 		ret = nbytes;
 	return ret;
@@ -928,7 +898,6 @@ static ssize_t mem_force_empty_write(str
 /*
  * Note: This should be removed if cgroup supports write-only file.
  */
-
 static ssize_t mem_force_empty_read(struct cgroup *cont,
 				struct cftype *cft,
 				struct file *file, char __user *userbuf,
@@ -937,7 +906,6 @@ static ssize_t mem_force_empty_read(stru
 	return -EINVAL;
 }
 
-
 static const struct mem_cgroup_stat_desc {
 	const char *msg;
 	u64 unit;
@@ -990,8 +958,6 @@ static int mem_control_stat_open(struct 
 	return single_open(file, mem_control_stat_show, cont);
 }
 
-
-
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -1057,9 +1023,6 @@ static void free_mem_cgroup_per_zone_inf
 	kfree(mem->info.nodeinfo[node]);
 }
 
-
-static struct mem_cgroup init_mem_cgroup;
-
 static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
@@ -1149,7 +1112,6 @@ static void mem_cgroup_move_task(struct 
 
 out:
 	mmput(mm);
-	return;
 }
 
 struct cgroup_subsys mem_cgroup_subsys = {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 10/15] memcg: memcontrol uninlined and static
  2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
                   ` (8 preceding siblings ...)
  2008-02-25 23:43 ` [PATCH 09/15] memcg: memcontrol whitespace cleanups Hugh Dickins
@ 2008-02-25 23:44 ` Hugh Dickins
  2008-02-26  1:36   ` KAMEZAWA Hiroyuki
  2008-02-25 23:46 ` [PATCH 11/15] memcg: remove clear_page_cgroup and atomics Hugh Dickins
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Hugh Dickins @ 2008-02-25 23:44 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, Adrian Bunk, linux-mm

More cleanup to memcontrol.c, this time changing some of the code generated.
Let the compiler decide what to inline (except for page_cgroup_locked which
is only used when CONFIG_DEBUG_VM): the __always_inline on lock_page_cgroup
etc. was quite a waste since bit_spin_lock etc. are inlines in a header file;
made mem_cgroup_force_empty and mem_cgroup_write_strategy static.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/memcontrol.c |   28 +++++++++++-----------------
 1 file changed, 11 insertions(+), 17 deletions(-)

--- memcg09/mm/memcontrol.c	2008-02-25 14:06:09.000000000 +0000
+++ memcg10/mm/memcontrol.c	2008-02-25 14:06:12.000000000 +0000
@@ -168,12 +168,12 @@ struct page_cgroup {
 #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
 #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
 
-static inline int page_cgroup_nid(struct page_cgroup *pc)
+static int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
 }
 
-static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
+static enum zone_type page_cgroup_zid(struct page_cgroup *pc)
 {
 	return page_zonenum(pc->page);
 }
@@ -199,14 +199,13 @@ static void mem_cgroup_charge_statistics
 		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
 }
 
-static inline struct mem_cgroup_per_zone *
+static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
 {
-	BUG_ON(!mem->info.nodeinfo[nid]);
 	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
 }
 
-static inline struct mem_cgroup_per_zone *
+static struct mem_cgroup_per_zone *
 page_cgroup_zoneinfo(struct page_cgroup *pc)
 {
 	struct mem_cgroup *mem = pc->mem_cgroup;
@@ -231,16 +230,14 @@ static unsigned long mem_cgroup_get_all_
 	return total;
 }
 
-static inline
-struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
+static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
 {
 	return container_of(cgroup_subsys_state(cont,
 				mem_cgroup_subsys_id), struct mem_cgroup,
 				css);
 }
 
-static inline
-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
+static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 {
 	return container_of(task_subsys_state(p, mem_cgroup_subsys_id),
 				struct mem_cgroup, css);
@@ -276,13 +273,12 @@ struct page_cgroup *page_get_page_cgroup
 	return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
 }
 
-static void __always_inline lock_page_cgroup(struct page *page)
+static void lock_page_cgroup(struct page *page)
 {
 	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-	VM_BUG_ON(!page_cgroup_locked(page));
 }
 
-static void __always_inline unlock_page_cgroup(struct page *page)
+static void unlock_page_cgroup(struct page *page)
 {
 	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
 }
@@ -741,16 +737,14 @@ void mem_cgroup_end_migration(struct pag
 void mem_cgroup_page_migration(struct page *page, struct page *newpage)
 {
 	struct page_cgroup *pc;
-	struct mem_cgroup *mem;
-	unsigned long flags;
 	struct mem_cgroup_per_zone *mz;
+	unsigned long flags;
 
 retry:
 	pc = page_get_page_cgroup(page);
 	if (!pc)
 		return;
 
-	mem = pc->mem_cgroup;
 	mz = page_cgroup_zoneinfo(pc);
 	if (clear_page_cgroup(page, pc) != pc)
 		goto retry;
@@ -822,7 +816,7 @@ retry:
  * make mem_cgroup's charge to be 0 if there is no task.
  * This enables deleting this mem_cgroup.
  */
-int mem_cgroup_force_empty(struct mem_cgroup *mem)
+static int mem_cgroup_force_empty(struct mem_cgroup *mem)
 {
 	int ret = -EBUSY;
 	int node, zid;
@@ -852,7 +846,7 @@ out:
 	return ret;
 }
 
-int mem_cgroup_write_strategy(char *buf, unsigned long long *tmp)
+static int mem_cgroup_write_strategy(char *buf, unsigned long long *tmp)
 {
 	*tmp = memparse(buf, &buf);
 	if (*buf != '\0')

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 11/15] memcg: remove clear_page_cgroup and atomics
  2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
                   ` (9 preceding siblings ...)
  2008-02-25 23:44 ` [PATCH 10/15] memcg: memcontrol uninlined and static Hugh Dickins
@ 2008-02-25 23:46 ` Hugh Dickins
  2008-02-26  1:38   ` KAMEZAWA Hiroyuki
  2008-02-25 23:47 ` [PATCH 12/15] memcg: css_put after remove_list Hugh Dickins
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Hugh Dickins @ 2008-02-25 23:46 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

Remove clear_page_cgroup: it's an unhelpful helper, see for example how
mem_cgroup_uncharge_page had to unlock_page_cgroup just in order to call
it (serious races from that? I'm not sure).

Once that's gone, you can see it's pointless for page_cgroup's ref_cnt
to be atomic: it's always manipulated under lock_page_cgroup, except
where force_empty unilaterally reset it to 0 (and how does uncharge's
atomic_dec_and_test protect against that?).

Simplify this page_cgroup locking: if you've got the lock and the pc
is attached, then the ref_cnt must be positive: VM_BUG_ONs to check
that, and to check that pc->page matches page (we're on the way to
finding why sometimes it doesn't, but this patch doesn't fix that).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/memcontrol.c |  106 ++++++++++++++++++----------------------------
 1 file changed, 43 insertions(+), 63 deletions(-)

--- memcg10/mm/memcontrol.c	2008-02-25 14:06:12.000000000 +0000
+++ memcg11/mm/memcontrol.c	2008-02-25 14:06:16.000000000 +0000
@@ -161,8 +161,7 @@ struct page_cgroup {
 	struct list_head lru;		/* per cgroup LRU list */
 	struct page *page;
 	struct mem_cgroup *mem_cgroup;
-	atomic_t ref_cnt;		/* Helpful when pages move b/w  */
-					/* mapped and cached states     */
+	int ref_cnt;			/* cached, mapped, migrating */
 	int flags;
 };
 #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
@@ -283,27 +282,6 @@ static void unlock_page_cgroup(struct pa
 	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
 }
 
-/*
- * Clear page->page_cgroup member under lock_page_cgroup().
- * If given "pc" value is different from one page->page_cgroup,
- * page->cgroup is not cleared.
- * Returns a value of page->page_cgroup at lock taken.
- * A can can detect failure of clearing by following
- *  clear_page_cgroup(page, pc) == pc
- */
-static struct page_cgroup *clear_page_cgroup(struct page *page,
-						struct page_cgroup *pc)
-{
-	struct page_cgroup *ret;
-	/* lock and clear */
-	lock_page_cgroup(page);
-	ret = page_get_page_cgroup(page);
-	if (likely(ret == pc))
-		page_assign_page_cgroup(page, NULL);
-	unlock_page_cgroup(page);
-	return ret;
-}
-
 static void __mem_cgroup_remove_list(struct page_cgroup *pc)
 {
 	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
@@ -555,15 +533,12 @@ retry:
 	 * the page has already been accounted.
 	 */
 	if (pc) {
-		if (unlikely(!atomic_inc_not_zero(&pc->ref_cnt))) {
-			/* this page is under being uncharged ? */
-			unlock_page_cgroup(page);
-			cpu_relax();
-			goto retry;
-		} else {
-			unlock_page_cgroup(page);
-			goto done;
-		}
+		VM_BUG_ON(pc->page != page);
+		VM_BUG_ON(pc->ref_cnt <= 0);
+
+		pc->ref_cnt++;
+		unlock_page_cgroup(page);
+		goto done;
 	}
 	unlock_page_cgroup(page);
 
@@ -612,7 +587,7 @@ retry:
 		congestion_wait(WRITE, HZ/10);
 	}
 
-	atomic_set(&pc->ref_cnt, 1);
+	pc->ref_cnt = 1;
 	pc->mem_cgroup = mem;
 	pc->page = page;
 	pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
@@ -683,24 +658,24 @@ void mem_cgroup_uncharge_page(struct pag
 	if (!pc)
 		goto unlock;
 
-	if (atomic_dec_and_test(&pc->ref_cnt)) {
-		page = pc->page;
-		mz = page_cgroup_zoneinfo(pc);
-		/*
-		 * get page->cgroup and clear it under lock.
-		 * force_empty can drop page->cgroup without checking refcnt.
-		 */
+	VM_BUG_ON(pc->page != page);
+	VM_BUG_ON(pc->ref_cnt <= 0);
+
+	if (--(pc->ref_cnt) == 0) {
+		page_assign_page_cgroup(page, NULL);
 		unlock_page_cgroup(page);
-		if (clear_page_cgroup(page, pc) == pc) {
-			mem = pc->mem_cgroup;
-			css_put(&mem->css);
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			spin_lock_irqsave(&mz->lru_lock, flags);
-			__mem_cgroup_remove_list(pc);
-			spin_unlock_irqrestore(&mz->lru_lock, flags);
-			kfree(pc);
-		}
-		lock_page_cgroup(page);
+
+		mem = pc->mem_cgroup;
+		css_put(&mem->css);
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
+
+		mz = page_cgroup_zoneinfo(pc);
+		spin_lock_irqsave(&mz->lru_lock, flags);
+		__mem_cgroup_remove_list(pc);
+		spin_unlock_irqrestore(&mz->lru_lock, flags);
+
+		kfree(pc);
+		return;
 	}
 
 unlock:
@@ -714,14 +689,13 @@ unlock:
 int mem_cgroup_prepare_migration(struct page *page)
 {
 	struct page_cgroup *pc;
-	int ret = 0;
 
 	lock_page_cgroup(page);
 	pc = page_get_page_cgroup(page);
-	if (pc && atomic_inc_not_zero(&pc->ref_cnt))
-		ret = 1;
+	if (pc)
+		pc->ref_cnt++;
 	unlock_page_cgroup(page);
-	return ret;
+	return pc != NULL;
 }
 
 void mem_cgroup_end_migration(struct page *page)
@@ -740,15 +714,17 @@ void mem_cgroup_page_migration(struct pa
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
 
-retry:
+	lock_page_cgroup(page);
 	pc = page_get_page_cgroup(page);
-	if (!pc)
+	if (!pc) {
+		unlock_page_cgroup(page);
 		return;
+	}
 
-	mz = page_cgroup_zoneinfo(pc);
-	if (clear_page_cgroup(page, pc) != pc)
-		goto retry;
+	page_assign_page_cgroup(page, NULL);
+	unlock_page_cgroup(page);
 
+	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_remove_list(pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
@@ -794,15 +770,19 @@ retry:
 	while (--count && !list_empty(list)) {
 		pc = list_entry(list->prev, struct page_cgroup, lru);
 		page = pc->page;
-		/* Avoid race with charge */
-		atomic_set(&pc->ref_cnt, 0);
-		if (clear_page_cgroup(page, pc) == pc) {
+		lock_page_cgroup(page);
+		if (page_get_page_cgroup(page) == pc) {
+			page_assign_page_cgroup(page, NULL);
+			unlock_page_cgroup(page);
 			css_put(&mem->css);
 			res_counter_uncharge(&mem->res, PAGE_SIZE);
 			__mem_cgroup_remove_list(pc);
 			kfree(pc);
-		} else 	/* being uncharged ? ...do relax */
+		} else {
+			/* racing uncharge: let page go then retry */
+			unlock_page_cgroup(page);
 			break;
+		}
 	}
 
 	spin_unlock_irqrestore(&mz->lru_lock, flags);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 12/15] memcg: css_put after remove_list
  2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
                   ` (10 preceding siblings ...)
  2008-02-25 23:46 ` [PATCH 11/15] memcg: remove clear_page_cgroup and atomics Hugh Dickins
@ 2008-02-25 23:47 ` Hugh Dickins
  2008-02-26  1:39   ` KAMEZAWA Hiroyuki
  2008-02-25 23:49 ` [PATCH 13/15] memcg: fix mem_cgroup_move_lists locking Hugh Dickins
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Hugh Dickins @ 2008-02-25 23:47 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

mem_cgroup_uncharge_page does css_put on the mem_cgroup before uncharging
from it, and before removing page_cgroup from one of its lru lists: isn't
there a danger that struct mem_cgroup memory could be freed and reused
before completing that, so corrupting something?  Never seen it, and
for all I know there may be other constraints which make it impossible;
but let's be defensive and reverse the ordering there.

mem_cgroup_force_empty_list is safe because there's an extra css_get
around all its works; but even so, change its ordering the same way
round, to help get in the habit of doing it like this.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/memcontrol.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- memcg11/mm/memcontrol.c	2008-02-25 14:06:16.000000000 +0000
+++ memcg12/mm/memcontrol.c	2008-02-25 14:06:21.000000000 +0000
@@ -665,15 +665,15 @@ void mem_cgroup_uncharge_page(struct pag
 		page_assign_page_cgroup(page, NULL);
 		unlock_page_cgroup(page);
 
-		mem = pc->mem_cgroup;
-		css_put(&mem->css);
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-
 		mz = page_cgroup_zoneinfo(pc);
 		spin_lock_irqsave(&mz->lru_lock, flags);
 		__mem_cgroup_remove_list(pc);
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
 
+		mem = pc->mem_cgroup;
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		css_put(&mem->css);
+
 		kfree(pc);
 		return;
 	}
@@ -774,9 +774,9 @@ retry:
 		if (page_get_page_cgroup(page) == pc) {
 			page_assign_page_cgroup(page, NULL);
 			unlock_page_cgroup(page);
-			css_put(&mem->css);
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
 			__mem_cgroup_remove_list(pc);
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
+			css_put(&mem->css);
 			kfree(pc);
 		} else {
 			/* racing uncharge: let page go then retry */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 13/15] memcg: fix mem_cgroup_move_lists locking
  2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
                   ` (11 preceding siblings ...)
  2008-02-25 23:47 ` [PATCH 12/15] memcg: css_put after remove_list Hugh Dickins
@ 2008-02-25 23:49 ` Hugh Dickins
  2008-02-26  1:43   ` KAMEZAWA Hiroyuki
  2008-02-25 23:50 ` [PATCH 14/15] memcg: simplify force_empty and move_lists Hugh Dickins, Hirokazu Takahashi
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Hugh Dickins @ 2008-02-25 23:49 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

Ever since the VM_BUG_ON(page_get_page_cgroup(page)) (now Bad page state)
went into page freeing, I've hit it from time to time in testing on some
machines, sometimes only after many days.  Recently found a machine which
could usually produce it within a few hours, which got me there at last.

The culprit is mem_cgroup_move_lists, whose locking is inadequate; and
the arrangement of structures was such that you got page_cgroups from
the lru list neatly put on to SLUB's freelist.  Kamezawa-san identified
the same hole independently.

The main problem was that it was missing the lock_page_cgroup it needs
to safely page_get_page_cgroup; but it's tricky to go beyond that too,
and I couldn't do it with SLAB_DESTROY_BY_RCU as I'd expected.
See the code for comments on the constraints.

This patch immediately gets replaced by a simpler one from Hirokazu-san;
but is it just foolish pride that tells me to put this one on record,
in case we need to come back to it later?

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/memcontrol.c |   49 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 43 insertions(+), 6 deletions(-)

--- memcg12/mm/memcontrol.c	2008-02-25 14:06:21.000000000 +0000
+++ memcg13/mm/memcontrol.c	2008-02-25 14:06:25.000000000 +0000
@@ -277,6 +277,11 @@ static void lock_page_cgroup(struct page
 	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
 }
 
+static int try_lock_page_cgroup(struct page *page)
+{
+	return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
+}
+
 static void unlock_page_cgroup(struct page *page)
 {
 	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
@@ -348,17 +353,49 @@ int task_in_mem_cgroup(struct task_struc
 void mem_cgroup_move_lists(struct page *page, bool active)
 {
 	struct page_cgroup *pc;
+	struct mem_cgroup *mem;
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
 
-	pc = page_get_page_cgroup(page);
-	if (!pc)
+	/*
+	 * We cannot lock_page_cgroup while holding zone's lru_lock,
+	 * because other holders of lock_page_cgroup can be interrupted
+	 * with an attempt to rotate_reclaimable_page.  But we cannot
+	 * safely get to page_cgroup without it, so just try_lock it:
+	 * mem_cgroup_isolate_pages allows for page left on wrong list.
+	 */
+	if (!try_lock_page_cgroup(page))
 		return;
 
-	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_move_lists(pc, active);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	/*
+	 * Now page_cgroup is stable, but we cannot acquire mz->lru_lock
+	 * while holding it, because mem_cgroup_force_empty_list does the
+	 * reverse.  Get a hold on the mem_cgroup before unlocking, so that
+	 * the zoneinfo remains stable, then take mz->lru_lock; then check
+	 * that page still points to pc and pc (even if freed and reassigned
+	 * to that same page meanwhile) still points to the same mem_cgroup.
+	 * Then we know mz still points to the right spinlock, so it's safe
+	 * to move_lists (page->page_cgroup might be reset while we do so, but
+	 * that doesn't matter: pc->page is stable till we drop mz->lru_lock).
+	 * We're being a little naughty not to try_lock_page_cgroup again
+	 * inside there, but we are safe, aren't we?  Aren't we?  Whistle...
+	 */
+	pc = page_get_page_cgroup(page);
+	if (pc) {
+		mem = pc->mem_cgroup;
+		mz = page_cgroup_zoneinfo(pc);
+		css_get(&mem->css);
+
+		unlock_page_cgroup(page);
+
+		spin_lock_irqsave(&mz->lru_lock, flags);
+		if (page_get_page_cgroup(page) == pc && pc->mem_cgroup == mem)
+			__mem_cgroup_move_lists(pc, active);
+		spin_unlock_irqrestore(&mz->lru_lock, flags);
+
+		css_put(&mem->css);
+	} else
+		unlock_page_cgroup(page);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 14/15] memcg: simplify force_empty and move_lists
  2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
                   ` (12 preceding siblings ...)
  2008-02-25 23:49 ` [PATCH 13/15] memcg: fix mem_cgroup_move_lists locking Hugh Dickins
@ 2008-02-25 23:50 ` Hugh Dickins, Hirokazu Takahashi
  2008-02-26  1:48   ` KAMEZAWA Hiroyuki
  2008-02-25 23:51 ` [PATCH 15/15] memcg: fix oops on NULL lru list Hugh Dickins
  2008-02-26  1:26 ` [PATCH 00/15] memcg: fixes and cleanups KAMEZAWA Hiroyuki
  15 siblings, 1 reply; 50+ messages in thread
From: Hugh Dickins, Hirokazu Takahashi @ 2008-02-25 23:50 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

As for force_empty, though this may not be the main topic here,
mem_cgroup_force_empty_list() can be implemented simpler.
It is possible to make the function just call mem_cgroup_uncharge_page()
instead of releasing page_cgroups by itself.  The tip is to call get_page()
before invoking mem_cgroup_uncharge_page(), so the page won't be released
during this function.

Kamezawa-san points out that by the time mem_cgroup_uncharge_page()
uncharges, the page might have been reassigned to an lru of a different
mem_cgroup, and now be emptied from that; but Hugh claims that's okay,
the end state is the same as when it hasn't gone to another list.

And once force_empty stops taking lock_page_cgroup within mz->lru_lock,
mem_cgroup_move_lists() can be simplified to take mz->lru_lock directly
while holding page_cgroup lock (but still has to use try_lock_page_cgroup).

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/memcontrol.c |   62 +++++++++-------------------------------------
 1 file changed, 13 insertions(+), 49 deletions(-)

--- memcg13/mm/memcontrol.c	2008-02-25 14:06:25.000000000 +0000
+++ memcg14/mm/memcontrol.c	2008-02-25 14:06:28.000000000 +0000
@@ -353,7 +353,6 @@ int task_in_mem_cgroup(struct task_struc
 void mem_cgroup_move_lists(struct page *page, bool active)
 {
 	struct page_cgroup *pc;
-	struct mem_cgroup *mem;
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
 
@@ -367,35 +366,14 @@ void mem_cgroup_move_lists(struct page *
 	if (!try_lock_page_cgroup(page))
 		return;
 
-	/*
-	 * Now page_cgroup is stable, but we cannot acquire mz->lru_lock
-	 * while holding it, because mem_cgroup_force_empty_list does the
-	 * reverse.  Get a hold on the mem_cgroup before unlocking, so that
-	 * the zoneinfo remains stable, then take mz->lru_lock; then check
-	 * that page still points to pc and pc (even if freed and reassigned
-	 * to that same page meanwhile) still points to the same mem_cgroup.
-	 * Then we know mz still points to the right spinlock, so it's safe
-	 * to move_lists (page->page_cgroup might be reset while we do so, but
-	 * that doesn't matter: pc->page is stable till we drop mz->lru_lock).
-	 * We're being a little naughty not to try_lock_page_cgroup again
-	 * inside there, but we are safe, aren't we?  Aren't we?  Whistle...
-	 */
 	pc = page_get_page_cgroup(page);
 	if (pc) {
-		mem = pc->mem_cgroup;
 		mz = page_cgroup_zoneinfo(pc);
-		css_get(&mem->css);
-
-		unlock_page_cgroup(page);
-
 		spin_lock_irqsave(&mz->lru_lock, flags);
-		if (page_get_page_cgroup(page) == pc && pc->mem_cgroup == mem)
-			__mem_cgroup_move_lists(pc, active);
+		__mem_cgroup_move_lists(pc, active);
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
-
-		css_put(&mem->css);
-	} else
-		unlock_page_cgroup(page);
+	}
+	unlock_page_cgroup(page);
 }
 
 /*
@@ -789,7 +767,7 @@ static void mem_cgroup_force_empty_list(
 {
 	struct page_cgroup *pc;
 	struct page *page;
-	int count;
+	int count = FORCE_UNCHARGE_BATCH;
 	unsigned long flags;
 	struct list_head *list;
 
@@ -798,35 +776,21 @@ static void mem_cgroup_force_empty_list(
 	else
 		list = &mz->inactive_list;
 
-	if (list_empty(list))
-		return;
-retry:
-	count = FORCE_UNCHARGE_BATCH;
 	spin_lock_irqsave(&mz->lru_lock, flags);
-
-	while (--count && !list_empty(list)) {
+	while (!list_empty(list)) {
 		pc = list_entry(list->prev, struct page_cgroup, lru);
 		page = pc->page;
-		lock_page_cgroup(page);
-		if (page_get_page_cgroup(page) == pc) {
-			page_assign_page_cgroup(page, NULL);
-			unlock_page_cgroup(page);
-			__mem_cgroup_remove_list(pc);
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			css_put(&mem->css);
-			kfree(pc);
-		} else {
-			/* racing uncharge: let page go then retry */
-			unlock_page_cgroup(page);
-			break;
+		get_page(page);
+		spin_unlock_irqrestore(&mz->lru_lock, flags);
+		mem_cgroup_uncharge_page(page);
+		put_page(page);
+		if (--count <= 0) {
+			count = FORCE_UNCHARGE_BATCH;
+			cond_resched();
 		}
+		spin_lock_irqsave(&mz->lru_lock, flags);
 	}
-
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
-	if (!list_empty(list)) {
-		cond_resched();
-		goto retry;
-	}
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 15/15] memcg: fix oops on NULL lru list
  2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
                   ` (13 preceding siblings ...)
  2008-02-25 23:50 ` [PATCH 14/15] memcg: simplify force_empty and move_lists Hugh Dickins, Hirokazu Takahashi
@ 2008-02-25 23:51 ` Hugh Dickins
  2008-02-26  1:26 ` [PATCH 00/15] memcg: fixes and cleanups KAMEZAWA Hiroyuki
  15 siblings, 0 replies; 50+ messages in thread
From: Hugh Dickins @ 2008-02-25 23:51 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

While testing force_empty, during an exit_mmap, __mem_cgroup_remove_list
called from mem_cgroup_uncharge_page oopsed on a NULL pointer in the lru
list.  I couldn't see what racing tasks on other cpus were doing, but
surmise that another must have been in mem_cgroup_charge_common on the
same page, between its unlock_page_cgroup and spin_lock_irqsave near
done (thanks to that kzalloc which I'd almost changed to a kmalloc).

Normally such a race cannot happen, the ref_cnt prevents it, the final
uncharge cannot race with the initial charge.  But force_empty buggers
the ref_cnt, that's what it's all about; and thereafter forced pages
are vulnerable to races such as this (just think of a shared page
also mapped into an mm of another mem_cgroup than that just emptied).
And remain vulnerable until they're freed indefinitely later.

This patch just fixes the oops by moving the unlock_page_cgroups down
below adding to and removing from the list (only possible given the
previous patch); and while we're at it, we might as well make it an
invariant that page->page_cgroup is always set while pc is on lru.

But this behaviour of force_empty seems highly unsatisfactory to me:
why have a ref_cnt if we always have to cope with it being violated
(as in the earlier page migration patch).  We may prefer force_empty
to move pages to an orphan mem_cgroup (could be the root, but better
not), from which other cgroups could recover them; we might need to
reverse the locking again; but no time now for such concerns.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/memcontrol.c |   17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

--- memcg14/mm/memcontrol.c	2008-02-25 14:06:28.000000000 +0000
+++ memcg15/mm/memcontrol.c	2008-02-25 14:06:33.000000000 +0000
@@ -623,13 +623,13 @@ retry:
 		goto retry;
 	}
 	page_assign_page_cgroup(page, pc);
-	unlock_page_cgroup(page);

 	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_add_list(pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);

+	unlock_page_cgroup(page);
 done:
 	return 0;
 out:
@@ -677,14 +677,14 @@ void mem_cgroup_uncharge_page(struct pag
 	VM_BUG_ON(pc->ref_cnt <= 0);

 	if (--(pc->ref_cnt) == 0) {
-		page_assign_page_cgroup(page, NULL);
-		unlock_page_cgroup(page);
-
 		mz = page_cgroup_zoneinfo(pc);
 		spin_lock_irqsave(&mz->lru_lock, flags);
 		__mem_cgroup_remove_list(pc);
 		spin_unlock_irqrestore(&mz->lru_lock, flags);

+		page_assign_page_cgroup(page, NULL);
+		unlock_page_cgroup(page);
+
 		mem = pc->mem_cgroup;
 		res_counter_uncharge(&mem->res, PAGE_SIZE);
 		css_put(&mem->css);
@@ -736,23 +736,24 @@ void mem_cgroup_page_migration(struct pa
 		return;
 	}

-	page_assign_page_cgroup(page, NULL);
-	unlock_page_cgroup(page);
-
 	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_remove_list(pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);

+	page_assign_page_cgroup(page, NULL);
+	unlock_page_cgroup(page);
+
 	pc->page = newpage;
 	lock_page_cgroup(newpage);
 	page_assign_page_cgroup(newpage, pc);
-	unlock_page_cgroup(newpage);

 	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_add_list(pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
+
+	unlock_page_cgroup(newpage);
 }

 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup
  2008-02-25 23:35 ` [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup Hugh Dickins
@ 2008-02-26  0:39   ` David Rientjes
  2008-02-26  3:27     ` Hugh Dickins
  2008-02-26  2:41   ` Balbir Singh
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 50+ messages in thread
From: David Rientjes @ 2008-02-26  0:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Balbir Singh, Andrew Morton, KAMEZAWA Hiroyuki,
	Hirokazu Takahashi, YAMAMOTO Takashi, Linus Torvalds, linux-mm

On Mon, 25 Feb 2008, Hugh Dickins wrote:

> vm_match_cgroup is a perverse name for a macro to match mm with cgroup:
> rename it mm_match_cgroup, matching mm_init_cgroup and mm_free_cgroup.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

+torvalds, who suggested the vm_match_cgroup name.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/15] memcg: fixes and cleanups
  2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
                   ` (14 preceding siblings ...)
  2008-02-25 23:51 ` [PATCH 15/15] memcg: fix oops on NULL lru list Hugh Dickins
@ 2008-02-26  1:26 ` KAMEZAWA Hiroyuki
  15 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26  1:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Balbir Singh, Andrew Morton, Hirokazu Takahashi, YAMAMOTO Takashi,
	linux-mm

On Mon, 25 Feb 2008 23:34:04 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> Here's my series of fifteen mem_cgroup patches against 2.6.25-rc3:
> fixes to bugs I've found and cleanups made on the way to those fixes.
> I believe we'll want these in 2.6.25-rc4.  Mostly they're what was
> included in the rollup I posted on Friday, but 05/15 and 15/15 fix
> (pre-existing) page migration and force_empty bugs I found after that.
> I'm not at all happy with force_empty, 15/15 discusses it a little.
> 

Hi, thank you.

I'll work on this.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 05/15] memcg: fix VM_BUG_ON from page migration
  2008-02-25 23:39 ` [PATCH 05/15] memcg: fix VM_BUG_ON from page migration Hugh Dickins
@ 2008-02-26  1:30   ` KAMEZAWA Hiroyuki
  2008-02-27  5:52   ` Balbir Singh
  1 sibling, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26  1:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Balbir Singh, Andrew Morton, Hirokazu Takahashi, YAMAMOTO Takashi,
	linux-mm

On Mon, 25 Feb 2008 23:39:23 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> Page migration gave me free_hot_cold_page's VM_BUG_ON page->page_cgroup.
> remove_migration_pte was calling mem_cgroup_charge on the new page whenever
> it found a swap pte, before it had determined it to be a migration entry.
> That left a surplus reference count on the page_cgroup, so it was still
> attached when the page was later freed.
> 
> Move that mem_cgroup_charge down to where we're sure it's a migration entry.
> We were already under i_mmap_lock or anon_vma->lock, so its GFP_KERNEL was
> already inappropriate: change that to GFP_ATOMIC.
> 
> It's essential that remove_migration_pte removes all the migration entries,
> other crashes follow if not.  So proceed even when the charge fails: normally
> it cannot, but after a mem_cgroup_force_empty it might - comment in the code.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---

make sense

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 07/15] memcg: mem_cgroup_charge never NULL
  2008-02-25 23:41 ` [PATCH 07/15] memcg: mem_cgroup_charge never NULL Hugh Dickins
@ 2008-02-26  1:32   ` KAMEZAWA Hiroyuki
  2008-02-27  8:42   ` Balbir Singh
  1 sibling, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26  1:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Balbir Singh, Andrew Morton, Hirokazu Takahashi, YAMAMOTO Takashi,
	linux-mm

On Mon, 25 Feb 2008 23:41:17 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> My memcgroup patch to fix hang with shmem/tmpfs added NULL page handling
> to mem_cgroup_charge_common.  It seemed convenient at the time, but hard
> to justify now: there's a perfectly appropriate swappage to charge and
> uncharge instead, this is not on any hot path through shmem_getpage,
> and no performance hit was observed from the slight extra overhead.
> 
> So revert that NULL page handling from mem_cgroup_charge_common; and
> make it clearer by bringing page_cgroup_assign_new_page_cgroup into its
> body - that was a helper I found more of a hindrance to understanding.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
This is welcome.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/15] memcg: remove mem_cgroup_uncharge
  2008-02-25 23:42 ` [PATCH 08/15] memcg: remove mem_cgroup_uncharge Hugh Dickins
@ 2008-02-26  1:34   ` KAMEZAWA Hiroyuki
  2008-02-28 18:22   ` Balbir Singh
  1 sibling, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26  1:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Balbir Singh, Andrew Morton, Hirokazu Takahashi, YAMAMOTO Takashi,
	linux-mm

On Mon, 25 Feb 2008 23:42:05 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> Nothing uses mem_cgroup_uncharge apart from mem_cgroup_uncharge_page,
> (a trivial wrapper around it) and mem_cgroup_end_migration (which does
> the same as mem_cgroup_uncharge_page).  And it often ends up having to
> lock just to let its caller unlock.  Remove it (but leave the silly
> locking until a later patch).
> 
> Moved mem_cgroup_cache_charge next to mem_cgroup_charge in memcontrol.h.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Hmm, ok.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/15] memcg: memcontrol uninlined and static
  2008-02-25 23:44 ` [PATCH 10/15] memcg: memcontrol uninlined and static Hugh Dickins
@ 2008-02-26  1:36   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26  1:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Balbir Singh, Andrew Morton, Hirokazu Takahashi, YAMAMOTO Takashi,
	Adrian Bunk, linux-mm

On Mon, 25 Feb 2008 23:44:44 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> More cleanup to memcontrol.c, this time changing some of the code generated.
> Let the compiler decide what to inline (except for page_cgroup_locked which
> is only used when CONFIG_DEBUG_VM): the __always_inline on lock_page_cgroup
> etc. was quite a waste since bit_spin_lock etc. are inlines in a header file;
> made mem_cgroup_force_empty and mem_cgroup_write_strategy static.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 11/15] memcg: remove clear_page_cgroup and atomics
  2008-02-25 23:46 ` [PATCH 11/15] memcg: remove clear_page_cgroup and atomics Hugh Dickins
@ 2008-02-26  1:38   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26  1:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Balbir Singh, Andrew Morton, Hirokazu Takahashi, YAMAMOTO Takashi,
	linux-mm

On Mon, 25 Feb 2008 23:46:22 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> Remove clear_page_cgroup: it's an unhelpful helper, see for example how
> mem_cgroup_uncharge_page had to unlock_page_cgroup just in order to call
> it (serious races from that? I'm not sure).
> 
> Once that's gone, you can see it's pointless for page_cgroup's ref_cnt
> to be atomic: it's always manipulated under lock_page_cgroup, except
> where force_empty unilaterally reset it to 0 (and how does uncharge's
> atomic_dec_and_test protect against that?).
> 
> Simplify this page_cgroup locking: if you've got the lock and the pc
> is attached, then the ref_cnt must be positive: VM_BUG_ONs to check
> that, and to check that pc->page matches page (we're on the way to
> finding why sometimes it doesn't, but this patch doesn't fix that).
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
O.K.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 12/15] memcg: css_put after remove_list
  2008-02-25 23:47 ` [PATCH 12/15] memcg: css_put after remove_list Hugh Dickins
@ 2008-02-26  1:39   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26  1:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Balbir Singh, Andrew Morton, Hirokazu Takahashi, YAMAMOTO Takashi,
	linux-mm

On Mon, 25 Feb 2008 23:47:10 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> mem_cgroup_uncharge_page does css_put on the mem_cgroup before uncharging
> from it, and before removing page_cgroup from one of its lru lists: isn't
> there a danger that struct mem_cgroup memory could be freed and reused
> before completing that, so corrupting something?  Never seen it, and
> for all I know there may be other constraints which make it impossible;
> but let's be defensive and reverse the ordering there.
> 
> mem_cgroup_force_empty_list is safe because there's an extra css_get
> around all its works; but even so, change its ordering the same way
> round, to help get in the habit of doing it like this.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---

make sense

Acked-by: KAMEZAWA Hiroyiki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 13/15] memcg: fix mem_cgroup_move_lists locking
  2008-02-25 23:49 ` [PATCH 13/15] memcg: fix mem_cgroup_move_lists locking Hugh Dickins
@ 2008-02-26  1:43   ` KAMEZAWA Hiroyuki
  2008-02-26  2:56     ` Hugh Dickins
  0 siblings, 1 reply; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26  1:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Balbir Singh, Andrew Morton, Hirokazu Takahashi, YAMAMOTO Takashi,
	linux-mm

On Mon, 25 Feb 2008 23:49:04 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> Ever since the VM_BUG_ON(page_get_page_cgroup(page)) (now Bad page state)
> went into page freeing, I've hit it from time to time in testing on some
> machines, sometimes only after many days.  Recently found a machine which
> could usually produce it within a few hours, which got me there at last.
> 
> The culprit is mem_cgroup_move_lists, whose locking is inadequate; and
> the arrangement of structures was such that you got page_cgroups from
> the lru list neatly put on to SLUB's freelist.  Kamezawa-san identified
> the same hole independently.
> 
> The main problem was that it was missing the lock_page_cgroup it needs
> to safely page_get_page_cgroup; but it's tricky to go beyond that too,
> and I couldn't do it with SLAB_DESTROY_BY_RCU as I'd expected.
> See the code for comments on the constraints.
> 
> This patch immediately gets replaced by a simpler one from Hirokazu-san;
> but is it just foolish pride that tells me to put this one on record,
> in case we need to come back to it later?
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
yes, we need this patch.

BTW, what is "a simpler one from Hirokazu-san" ? 

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 14/15] memcg: simplify force_empty and move_lists
  2008-02-25 23:50 ` [PATCH 14/15] memcg: simplify force_empty and move_lists Hugh Dickins, Hirokazu Takahashi
@ 2008-02-26  1:48   ` KAMEZAWA Hiroyuki
  2008-02-26  3:23     ` Hugh Dickins
  0 siblings, 1 reply; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26  1:48 UTC (permalink / raw)
  To: Hugh Dickins, Hirokazu Takahashi
  Cc: Balbir Singh, Andrew Morton, YAMAMOTO Takashi, linux-mm


Claim again ;)
On Mon, 25 Feb 2008 23:50:27 +0000 (GMT)
Hugh Dickins <hugh@veritas.com>, Hirokazu Takahashi <taka@valinux.co.jp> wrote:

> +		get_page(page);
How about this?
> +		spin_unlock_irqrestore(&mz->lru_lock, flags);
		local_irq_save(flags):
		if (TestSetPageLocked(page)) {
	> +		mem_cgroup_uncharge_page(page);
	> +		put_page(page);
	> +		if (--count <= 0) {
	> +			count = FORCE_UNCHARGE_BATCH;
	> +			cond_resched();
	>  		}
	> +		spin_lock_irqsave(&mz->lru_lock, flags);
			unlock_page(page);
		}
		local_irq_restore(flags);

page's lock bit guarantees 100% safe against page migration.
(And most of other charging/uncharging callers.)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup
  2008-02-25 23:35 ` [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup Hugh Dickins
  2008-02-26  0:39   ` David Rientjes
@ 2008-02-26  2:41   ` Balbir Singh
  2008-02-26 23:46   ` KAMEZAWA Hiroyuki
  2008-02-28  3:47   ` Andrew Morton
  3 siblings, 0 replies; 50+ messages in thread
From: Balbir Singh @ 2008-02-26  2:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, David Rientjes, linux-mm

Hugh Dickins wrote:
> vm_match_cgroup is a perverse name for a macro to match mm with cgroup:
> rename it mm_match_cgroup, matching mm_init_cgroup and mm_free_cgroup.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Agreed

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 13/15] memcg: fix mem_cgroup_move_lists locking
  2008-02-26  1:43   ` KAMEZAWA Hiroyuki
@ 2008-02-26  2:56     ` Hugh Dickins
  0 siblings, 0 replies; 50+ messages in thread
From: Hugh Dickins @ 2008-02-26  2:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Andrew Morton, Hirokazu Takahashi, YAMAMOTO Takashi,
	linux-mm

On Tue, 26 Feb 2008, KAMEZAWA Hiroyuki wrote:
> On Mon, 25 Feb 2008 23:49:04 +0000 (GMT)
> Hugh Dickins <hugh@veritas.com> wrote:
> > 
> > This patch immediately gets replaced by a simpler one from Hirokazu-san;
> > but is it just foolish pride that tells me to put this one on record,
> > in case we need to come back to it later?
> > 
> > Signed-off-by: Hugh Dickins <hugh@veritas.com>
> > ---
> yes, we need this patch.
> 
> BTW, what is "a simpler one from Hirokazu-san" ? 

14/15: most of the complexity of this 13/15 is immediately removed in 14/15.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 14/15] memcg: simplify force_empty and move_lists
  2008-02-26  1:48   ` KAMEZAWA Hiroyuki
@ 2008-02-26  3:23     ` Hugh Dickins
  2008-02-26  4:09       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 50+ messages in thread
From: Hugh Dickins @ 2008-02-26  3:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hirokazu Takahashi, Balbir Singh, Andrew Morton, YAMAMOTO Takashi,
	linux-mm

On Tue, 26 Feb 2008, KAMEZAWA Hiroyuki wrote:
> > +		get_page(page);
> How about this?
> > +		spin_unlock_irqrestore(&mz->lru_lock, flags);
> 		local_irq_save(flags):
> 		if (TestSetPageLocked(page)) {

I think you meant !TestSetPageLocked ;)

> 	> +		mem_cgroup_uncharge_page(page);
> 	> +		put_page(page);
> 	> +		if (--count <= 0) {
> 	> +			count = FORCE_UNCHARGE_BATCH;
> 	> +			cond_resched();
> 	>  		}
> 	> +		spin_lock_irqsave(&mz->lru_lock, flags);
> 			unlock_page(page);
> 		}
> 		local_irq_restore(flags);
> 
> page's lock bit guarantees 100% safe against page migration.
> (And most of other charging/uncharging callers.)

That simply doesn't solve any problem I've observed yet.  It appears
(so far!) that I can safely run for hours with 1-15/15, doing random
page migrations and force_empties concurrently (commenting out the
EBUSY check on mem->css.cgroup->count).

The problem with force_empty is that it leaves the pages it touched
in a state inconsistent with normality, not that it's racy while it's
touching them.

If your TestSetPageLocked actually solves some problem, we could add
that; though it'd be the first reference to PageLocked in that source
file, and you're adding a long busy loop there while a page is locked
(broken by cond_resched, but still burning cpu).  Hmm, on top of that,
add_to_page_cache puts the page on the mem_cgroup lru a few instants
before it does its SetPageLocked.  So I'd certainly want you to show
what you're solving with this before we should add it.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup
  2008-02-26  0:39   ` David Rientjes
@ 2008-02-26  3:27     ` Hugh Dickins
  0 siblings, 0 replies; 50+ messages in thread
From: Hugh Dickins @ 2008-02-26  3:27 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Andrew Morton, KAMEZAWA Hiroyuki,
	Hirokazu Takahashi, YAMAMOTO Takashi, Linus Torvalds, linux-mm

On Mon, 25 Feb 2008, David Rientjes wrote:
> On Mon, 25 Feb 2008, Hugh Dickins wrote:
> 
> > vm_match_cgroup is a perverse name for a macro to match mm with cgroup:
> > rename it mm_match_cgroup, matching mm_init_cgroup and mm_free_cgroup.
> > 
> > Signed-off-by: Hugh Dickins <hugh@veritas.com>
> 
> +torvalds, who suggested the vm_match_cgroup name.

Ah, then I apologize for implying that you're a pervert, David:
Linus, well, we all know about him ... ;)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 14/15] memcg: simplify force_empty and move_lists
  2008-02-26  3:23     ` Hugh Dickins
@ 2008-02-26  4:09       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26  4:09 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Hirokazu Takahashi, Balbir Singh, Andrew Morton, YAMAMOTO Takashi,
	linux-mm

On Tue, 26 Feb 2008 03:23:17 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> On Tue, 26 Feb 2008, KAMEZAWA Hiroyuki wrote:
> > > +		get_page(page);
> > How about this?
> > > +		spin_unlock_irqrestore(&mz->lru_lock, flags);
> > 		local_irq_save(flags):
> > 		if (TestSetPageLocked(page)) {
> 
> I think you meant !TestSetPageLocked ;)
> 
> > 	> +		mem_cgroup_uncharge_page(page);
> > 	> +		put_page(page);
> > 	> +		if (--count <= 0) {
> > 	> +			count = FORCE_UNCHARGE_BATCH;
> > 	> +			cond_resched();
> > 	>  		}
> > 	> +		spin_lock_irqsave(&mz->lru_lock, flags);
> > 			unlock_page(page);
> > 		}
> > 		local_irq_restore(flags);
> > 
> > page's lock bit guarantees 100% safe against page migration.
> > (And most of other charging/uncharging callers.)
> 
> That simply doesn't solve any problem I've observed yet.  It appears
> (so far!) that I can safely run for hours with 1-15/15, doing random
> page migrations and force_empties concurrently (commenting out the
> EBUSY check on mem->css.cgroup->count).
> 
> The problem with force_empty is that it leaves the pages it touched
> in a state inconsistent with normality, not that it's racy while it's
> touching them.
> 
> If your TestSetPageLocked actually solves some problem, we could add
> that; though it'd be the first reference to PageLocked in that source
> file, and you're adding a long busy loop there while a page is locked
> (broken by cond_resched, but still burning cpu).  Hmm, on top of that,
> add_to_page_cache puts the page on the mem_cgroup lru a few instants
> before it does its SetPageLocked.  So I'd certainly want you to show
> what you're solving with this before we should add it.
> 
unlock place was bad ;(
But ok, PageLock is no help here.

BTW, I'm now removing page->page_cgroup and adding spinlock into
page_cgroup and changing all rules. And probably, add new races and
break something. I'm now rebasing them on to your fix.

I hope your help again.


Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 02/15] memcg: move_lists on page not page_cgroup
  2008-02-25 23:36 ` [PATCH 02/15] memcg: move_lists on page not page_cgroup Hugh Dickins
@ 2008-02-26 15:52   ` Balbir Singh
  2008-02-26 23:45   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 50+ messages in thread
From: Balbir Singh @ 2008-02-26 15:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

* Hugh Dickins <hugh@veritas.com> [2008-02-25 23:36:20]:

> Each caller of mem_cgroup_move_lists is having to use page_get_page_cgroup:
> it's more convenient if it acts upon the page itself not the page_cgroup;
> and in a later patch this becomes important to handle within memcontrol.c.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/15] memcg: page_cache_release not __free_page
  2008-02-25 23:37 ` [PATCH 03/15] memcg: page_cache_release not __free_page Hugh Dickins
@ 2008-02-26 16:02   ` Balbir Singh
  2008-02-26 23:38   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 50+ messages in thread
From: Balbir Singh @ 2008-02-26 16:02 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

* Hugh Dickins <hugh@veritas.com> [2008-02-25 23:37:05]:

> There's nothing wrong with mem_cgroup_charge failure in do_wp_page and
> do_anonymous page using __free_page, but it does look odd when nearby
> code uses page_cache_release: use that instead (while turning a blind
> eye to ancient inconsistencies of page_cache_release versus put_page).
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/15] memcg: page_cache_release not __free_page
  2008-02-25 23:37 ` [PATCH 03/15] memcg: page_cache_release not __free_page Hugh Dickins
  2008-02-26 16:02   ` Balbir Singh
@ 2008-02-26 23:38   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26 23:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Balbir Singh, Andrew Morton, Hirokazu Takahashi, YAMAMOTO Takashi,
	linux-mm

On Mon, 25 Feb 2008 23:37:05 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> There's nothing wrong with mem_cgroup_charge failure in do_wp_page and
> do_anonymous page using __free_page, but it does look odd when nearby
> code uses page_cache_release: use that instead (while turning a blind
> eye to ancient inconsistencies of page_cache_release versus put_page).
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
> 
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/15] memcg: when do_swap's do_wp_page fails
  2008-02-25 23:38 ` [PATCH 04/15] memcg: when do_swap's do_wp_page fails Hugh Dickins
@ 2008-02-26 23:41   ` KAMEZAWA Hiroyuki
  2008-02-27  5:08   ` Balbir Singh
  1 sibling, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26 23:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Balbir Singh, Andrew Morton, Hirokazu Takahashi, YAMAMOTO Takashi,
	Nick Piggin, linux-mm

On Mon, 25 Feb 2008 23:38:02 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> Don't uncharge when do_swap_page's call to do_wp_page fails: the page which
> was charged for is there in the pagetable, and will be correctly uncharged
> when that area is unmapped - it was only its COWing which failed.
> 
> And while we're here, remove earlier XXX comment: yes, OR in do_wp_page's
> return value (maybe VM_FAULT_WRITE) with do_swap_page's there; but if it
> fails, mask out success bits, which might confuse some arches e.g. sparc.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 06/15] memcg: bad page if page_cgroup when free
  2008-02-25 23:40 ` [PATCH 06/15] memcg: bad page if page_cgroup when free Hugh Dickins
@ 2008-02-26 23:44   ` KAMEZAWA Hiroyuki
  2008-02-27  8:38   ` Balbir Singh
  1 sibling, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26 23:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Balbir Singh, Andrew Morton, Hirokazu Takahashi, YAMAMOTO Takashi,
	linux-mm

On Mon, 25 Feb 2008 23:40:14 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> Replace free_hot_cold_page's VM_BUG_ON(page_get_page_cgroup(page)) by a
> "Bad page state" and clear: most users don't have CONFIG_DEBUG_VM on, and
> if it were set here, it'd likely cause corruption when the page is reused.
> 
> Don't use page_assign_page_cgroup to clear it: that should be private to
> memcontrol.c, and always called with the lock taken; and memmap_init_zone
> doesn't need it either - like page->mapping and other pointers throughout
> the kernel, Linux assumes pointers in zeroed structures are NULL pointers.
> 
> Instead use page_reset_bad_cgroup, added to memcontrol.h for this only.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
Seems reasonable

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 02/15] memcg: move_lists on page not page_cgroup
  2008-02-25 23:36 ` [PATCH 02/15] memcg: move_lists on page not page_cgroup Hugh Dickins
  2008-02-26 15:52   ` Balbir Singh
@ 2008-02-26 23:45   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26 23:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Balbir Singh, Andrew Morton, Hirokazu Takahashi, YAMAMOTO Takashi,
	linux-mm

On Mon, 25 Feb 2008 23:36:20 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> Each caller of mem_cgroup_move_lists is having to use page_get_page_cgroup:
> it's more convenient if it acts upon the page itself not the page_cgroup;
> and in a later patch this becomes important to handle within memcontrol.c.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup
  2008-02-25 23:35 ` [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup Hugh Dickins
  2008-02-26  0:39   ` David Rientjes
  2008-02-26  2:41   ` Balbir Singh
@ 2008-02-26 23:46   ` KAMEZAWA Hiroyuki
  2008-02-28  3:47   ` Andrew Morton
  3 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26 23:46 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Balbir Singh, Andrew Morton, Hirokazu Takahashi, YAMAMOTO Takashi,
	David Rientjes, linux-mm

On Mon, 25 Feb 2008 23:35:33 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> vm_match_cgroup is a perverse name for a macro to match mm with cgroup:
> rename it mm_match_cgroup, matching mm_init_cgroup and mm_free_cgroup.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
make sense

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/15] memcg: when do_swap's do_wp_page fails
  2008-02-25 23:38 ` [PATCH 04/15] memcg: when do_swap's do_wp_page fails Hugh Dickins
  2008-02-26 23:41   ` KAMEZAWA Hiroyuki
@ 2008-02-27  5:08   ` Balbir Singh
  2008-02-27 12:57     ` Hugh Dickins
  1 sibling, 1 reply; 50+ messages in thread
From: Balbir Singh @ 2008-02-27  5:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, Nick Piggin, linux-mm

* Hugh Dickins <hugh@veritas.com> [2008-02-25 23:38:02]:

> Don't uncharge when do_swap_page's call to do_wp_page fails: the page which
> was charged for is there in the pagetable, and will be correctly uncharged
> when that area is unmapped - it was only its COWing which failed.
> 
> And while we're here, remove earlier XXX comment: yes, OR in do_wp_page's
> return value (maybe VM_FAULT_WRITE) with do_swap_page's there; but if it
> fails, mask out success bits, which might confuse some arches e.g. sparc.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
> 
>  mm/memory.c |    9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> --- memcg03/mm/memory.c	2008-02-25 14:05:43.000000000 +0000
> +++ memcg04/mm/memory.c	2008-02-25 14:05:47.000000000 +0000
> @@ -2093,12 +2093,9 @@ static int do_swap_page(struct mm_struct
>  	unlock_page(page);
> 
>  	if (write_access) {
> -		/* XXX: We could OR the do_wp_page code with this one? */
> -		if (do_wp_page(mm, vma, address,
> -				page_table, pmd, ptl, pte) & VM_FAULT_OOM) {
> -			mem_cgroup_uncharge_page(page);
> -			ret = VM_FAULT_OOM;
> -		}
> +		ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
> +		if (ret & VM_FAULT_ERROR)
> +			ret &= VM_FAULT_ERROR;
>  		goto out;
>  	}
>

Looks good to me. Do you think we could add some of the description
from above as a comment in the code? People would not have to look at
the git log to understand why we did not uncharge.

Otherwise, it looks very good

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 
-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 05/15] memcg: fix VM_BUG_ON from page migration
  2008-02-25 23:39 ` [PATCH 05/15] memcg: fix VM_BUG_ON from page migration Hugh Dickins
  2008-02-26  1:30   ` KAMEZAWA Hiroyuki
@ 2008-02-27  5:52   ` Balbir Singh
  2008-02-27 13:23     ` Hugh Dickins
  1 sibling, 1 reply; 50+ messages in thread
From: Balbir Singh @ 2008-02-27  5:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

* Hugh Dickins <hugh@veritas.com> [2008-02-25 23:39:23]:

> Page migration gave me free_hot_cold_page's VM_BUG_ON page->page_cgroup.
> remove_migration_pte was calling mem_cgroup_charge on the new page whenever
> it found a swap pte, before it had determined it to be a migration entry.
> That left a surplus reference count on the page_cgroup, so it was still
> attached when the page was later freed.
> 
> Move that mem_cgroup_charge down to where we're sure it's a migration entry.
> We were already under i_mmap_lock or anon_vma->lock, so its GFP_KERNEL was
> already inappropriate: change that to GFP_ATOMIC.
>

One side effect I see of this patch is that the page_cgroup lock and
the lru_lock can now be taken from within i_mmap_lock or
anon_vma->lock.
 
> It's essential that remove_migration_pte removes all the migration entries,
> other crashes follow if not.  So proceed even when the charge fails: normally
> it cannot, but after a mem_cgroup_force_empty it might - comment in the code.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
> 
>  mm/migrate.c |   19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> --- memcg04/mm/migrate.c	2008-02-11 07:18:12.000000000 +0000
> +++ memcg05/mm/migrate.c	2008-02-25 14:05:50.000000000 +0000
> @@ -153,11 +153,6 @@ static void remove_migration_pte(struct 
>   		return;
>   	}
> 
> -	if (mem_cgroup_charge(new, mm, GFP_KERNEL)) {
> -		pte_unmap(ptep);
> -		return;
> -	}
> -
>   	ptl = pte_lockptr(mm, pmd);
>   	spin_lock(ptl);
>  	pte = *ptep;
> @@ -169,6 +164,20 @@ static void remove_migration_pte(struct 
>  	if (!is_migration_entry(entry) || migration_entry_to_page(entry) != old)
>  		goto out;

Is it not easier to uncharge here then to move to the charging to the
context below? Do you suspect this will be a common operation (so we
might end up charging/uncharing more frequently?)

> 
> +	/*
> +	 * Yes, ignore the return value from a GFP_ATOMIC mem_cgroup_charge.
> +	 * Failure is not an option here: we're now expected to remove every
> +	 * migration pte, and will cause crashes otherwise.  Normally this
> +	 * is not an issue: mem_cgroup_prepare_migration bumped up the old
> +	 * page_cgroup count for safety, that's now attached to the new page,
> +	 * so this charge should just be another incrementation of the count,
> +	 * to keep in balance with rmap.c's mem_cgroup_uncharging.  But if
> +	 * there's been a force_empty, those reference counts may no longer
> +	 * be reliable, and this charge can actually fail: oh well, we don't
> +	 * make the situation any worse by proceeding as if it had succeeded.
> +	 */
> +	mem_cgroup_charge(new, mm, GFP_ATOMIC);
> +
>  	get_page(new);
>  	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
>  	if (is_write_migration_entry(entry))

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 06/15] memcg: bad page if page_cgroup when free
  2008-02-25 23:40 ` [PATCH 06/15] memcg: bad page if page_cgroup when free Hugh Dickins
  2008-02-26 23:44   ` KAMEZAWA Hiroyuki
@ 2008-02-27  8:38   ` Balbir Singh
  1 sibling, 0 replies; 50+ messages in thread
From: Balbir Singh @ 2008-02-27  8:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

* Hugh Dickins <hugh@veritas.com> [2008-02-25 23:40:14]:

> Replace free_hot_cold_page's VM_BUG_ON(page_get_page_cgroup(page)) by a
> "Bad page state" and clear: most users don't have CONFIG_DEBUG_VM on, and
> if it were set here, it'd likely cause corruption when the page is reused.
> 
> Don't use page_assign_page_cgroup to clear it: that should be private to
> memcontrol.c, and always called with the lock taken; and memmap_init_zone
> doesn't need it either - like page->mapping and other pointers throughout
> the kernel, Linux assumes pointers in zeroed structures are NULL pointers.
> 
> Instead use page_reset_bad_cgroup, added to memcontrol.h for this only.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Looks good to me

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 07/15] memcg: mem_cgroup_charge never NULL
  2008-02-25 23:41 ` [PATCH 07/15] memcg: mem_cgroup_charge never NULL Hugh Dickins
  2008-02-26  1:32   ` KAMEZAWA Hiroyuki
@ 2008-02-27  8:42   ` Balbir Singh
  1 sibling, 0 replies; 50+ messages in thread
From: Balbir Singh @ 2008-02-27  8:42 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

* Hugh Dickins <hugh@veritas.com> [2008-02-25 23:41:17]:

> My memcgroup patch to fix hang with shmem/tmpfs added NULL page handling
> to mem_cgroup_charge_common.  It seemed convenient at the time, but hard
> to justify now: there's a perfectly appropriate swappage to charge and
> uncharge instead, this is not on any hot path through shmem_getpage,
> and no performance hit was observed from the slight extra overhead.
> 
> So revert that NULL page handling from mem_cgroup_charge_common; and
> make it clearer by bringing page_cgroup_assign_new_page_cgroup into its
> body - that was a helper I found more of a hindrance to understanding.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/15] memcg: when do_swap's do_wp_page fails
  2008-02-27  5:08   ` Balbir Singh
@ 2008-02-27 12:57     ` Hugh Dickins
  0 siblings, 0 replies; 50+ messages in thread
From: Hugh Dickins @ 2008-02-27 12:57 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, Nick Piggin, linux-mm

On Wed, 27 Feb 2008, Balbir Singh wrote:
> * Hugh Dickins <hugh@veritas.com> [2008-02-25 23:38:02]:
> > Don't uncharge when do_swap_page's call to do_wp_page fails: the page which
> > was charged for is there in the pagetable, and will be correctly uncharged
> > when that area is unmapped - it was only its COWing which failed.
> 
> Looks good to me. Do you think we could add some of the description
> from above as a comment in the code? People would not have to look at
> the git log to understand why we did not uncharge.

Sorry to be uncooperative, but I honestly think not.  If we put in
a comment everywhere somebody once made a mistake (a temptation at
the time indeed), or everywhere we remove some unnecessary code,
the kernel source will not become more readable.  We don't have (and
don't need) a comment there for why there isn't a page_cache_release.

If any comment were needed (but its repetition would become tedious,
I'm happier without it), it's on the mem_cgroup_charges, pointing out
that it's a speculative charge while we're still allowed to sleep for
memory, which the subsequent add_rmap will either preserve or reverse
according to whether this is the first mapping of the page or not.

You know that, and the matching page_add_anon_rmap has clearly been
done above this write_access call to do_wp_page, so the mystery would
be why we should mem_cgroup_uncharge_page there.  If do_wp_page were
an unmapping operation, which for some reason didn't uncharge what
it unmapped, then a mem_cgroup_uncharge_page with comment would be
appropriate.  But that's not the case.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 05/15] memcg: fix VM_BUG_ON from page migration
  2008-02-27  5:52   ` Balbir Singh
@ 2008-02-27 13:23     ` Hugh Dickins
  2008-02-27 13:43       ` Balbir Singh
  0 siblings, 1 reply; 50+ messages in thread
From: Hugh Dickins @ 2008-02-27 13:23 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

On Wed, 27 Feb 2008, Balbir Singh wrote:
> * Hugh Dickins <hugh@veritas.com> [2008-02-25 23:39:23]:
> 
> > Page migration gave me free_hot_cold_page's VM_BUG_ON page->page_cgroup.
> > remove_migration_pte was calling mem_cgroup_charge on the new page whenever
> > it found a swap pte, before it had determined it to be a migration entry.
> > That left a surplus reference count on the page_cgroup, so it was still
> > attached when the page was later freed.
> > 
> > Move that mem_cgroup_charge down to where we're sure it's a migration entry.
> > We were already under i_mmap_lock or anon_vma->lock, so its GFP_KERNEL was
> > already inappropriate: change that to GFP_ATOMIC.
> >
> 
> One side effect I see of this patch is that the page_cgroup lock and
> the lru_lock can now be taken from within i_mmap_lock or
> anon_vma->lock.

That's not a side-effect of this patch, but it is something which was
already being done there, and you're absolutely right to draw attention
to it, thank you.

Although I mentioned they were held in the comment, it hadn't really
dawned on me how unwelcome that is: it's not a violation, and lockdep
doesn't protest, but we'd all be happier not to interweave those
otherwise independent locks in that one place.

Oh, hold on, no, it's not that one place.  It's a well-established
nesting of locks, as when mem_cgroup_uncharge_page is called by
page_remove_rmap from try_to_unmap_one.  Panic over!  But we'd
better add memcontrol's locks to the hierarchies shown in
filemap.c and in rmap.c.

> > -	if (mem_cgroup_charge(new, mm, GFP_KERNEL)) {
> > -		pte_unmap(ptep);
> > -		return;
> > -	}
> > -
> >   	ptl = pte_lockptr(mm, pmd);
> >   	spin_lock(ptl);
> >  	pte = *ptep;
> > @@ -169,6 +164,20 @@ static void remove_migration_pte(struct 
> >  	if (!is_migration_entry(entry) || migration_entry_to_page(entry) != old)
> >  		goto out;
> 
> Is it not easier to uncharge here then to move to the charging to the
> context below? Do you suspect this will be a common operation (so we
> might end up charging/uncharing more frequently?)

In what way would it be easier to charge too early, then uncharge
when we find it was wrong, than simply to charge once we know it's
right, as the patch does?

If we were not already in atomic context, it would definitely be
better to do it the way you suggest, with GFP_KERNEL not GFP_ATOMIC;
but we're already in atomic context, so I cannot see any advantage
to doing it your way.

What would be a real improvement is a way of doing it outside the
atomic context: I've given that little thought, but it's not obvious
how.  And really the wider issue of force_empty inconsistencies is
more important than this singular wart in page migration.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 05/15] memcg: fix VM_BUG_ON from page migration
  2008-02-27 13:23     ` Hugh Dickins
@ 2008-02-27 13:43       ` Balbir Singh
  0 siblings, 0 replies; 50+ messages in thread
From: Balbir Singh @ 2008-02-27 13:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

Hugh Dickins wrote:
> On Wed, 27 Feb 2008, Balbir Singh wrote:
>> * Hugh Dickins <hugh@veritas.com> [2008-02-25 23:39:23]:
>>
>>> Page migration gave me free_hot_cold_page's VM_BUG_ON page->page_cgroup.
>>> remove_migration_pte was calling mem_cgroup_charge on the new page whenever
>>> it found a swap pte, before it had determined it to be a migration entry.
>>> That left a surplus reference count on the page_cgroup, so it was still
>>> attached when the page was later freed.
>>>
>>> Move that mem_cgroup_charge down to where we're sure it's a migration entry.
>>> We were already under i_mmap_lock or anon_vma->lock, so its GFP_KERNEL was
>>> already inappropriate: change that to GFP_ATOMIC.
>>>
>> One side effect I see of this patch is that the page_cgroup lock and
>> the lru_lock can now be taken from within i_mmap_lock or
>> anon_vma->lock.
> 
> That's not a side-effect of this patch, but it is something which was
> already being done there, and you're absolutely right to draw attention
> to it, thank you.
> 
> Although I mentioned they were held in the comment, it hadn't really
> dawned on me how unwelcome that is: it's not a violation, and lockdep
> doesn't protest, but we'd all be happier not to interweave those
> otherwise independent locks in that one place.
> 
> Oh, hold on, no, it's not that one place.  It's a well-established
> nesting of locks, as when mem_cgroup_uncharge_page is called by
> page_remove_rmap from try_to_unmap_one.  Panic over!  But we'd
> better add memcontrol's locks to the hierarchies shown in
> filemap.c and in rmap.c.
> 

Yes, I agree. I also agree it is not a side-effect and we're already doing that.
But well worth documenting in the places you've mentioned.


>>> -	if (mem_cgroup_charge(new, mm, GFP_KERNEL)) {
>>> -		pte_unmap(ptep);
>>> -		return;
>>> -	}
>>> -
>>>   	ptl = pte_lockptr(mm, pmd);
>>>   	spin_lock(ptl);
>>>  	pte = *ptep;
>>> @@ -169,6 +164,20 @@ static void remove_migration_pte(struct 
>>>  	if (!is_migration_entry(entry) || migration_entry_to_page(entry) != old)
>>>  		goto out;
>> Is it not easier to uncharge here then to move to the charging to the
>> context below? Do you suspect this will be a common operation (so we
>> might end up charging/uncharing more frequently?)
> 
> In what way would it be easier to charge too early, then uncharge
> when we find it was wrong, than simply to charge once we know it's
> right, as the patch does?
> 
> If we were not already in atomic context, it would definitely be
> better to do it the way you suggest, with GFP_KERNEL not GFP_ATOMIC;
> but we're already in atomic context, so I cannot see any advantage
> to doing it your way.
> 
> What would be a real improvement is a way of doing it outside the
> atomic context: I've given that little thought, but it's not obvious
> how.  And really the wider issue of force_empty inconsistencies is
> more important than this singular wart in page migration.

Yes, you are absolutely right, we called the route with the anon_vma lock or the
i_mmap_lock held. So moving it outside does not really help. In fact this patch
fixes the bug where we use GFP_KERNEL from within a lock.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

PS: I know I've been slow in reviewing the patches. I am trying to run and
compile each patch as I review it. Please bear with me while I catch up with the
series of patches.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup
  2008-02-25 23:35 ` [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup Hugh Dickins
                     ` (2 preceding siblings ...)
  2008-02-26 23:46   ` KAMEZAWA Hiroyuki
@ 2008-02-28  3:47   ` Andrew Morton
  2008-02-28  7:19     ` David Rientjes
  3 siblings, 1 reply; 50+ messages in thread
From: Andrew Morton @ 2008-02-28  3:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, David Rientjes, linux-mm

On Mon, 25 Feb 2008 23:35:33 +0000 (GMT) Hugh Dickins <hugh@veritas.com> wrote:

> -#define vm_match_cgroup(mm, cgroup)	\
> +#define mm_match_cgroup(mm, cgroup)	\
>  	((cgroup) == rcu_dereference((mm)->mem_cgroup))

Could be written in C, methinks.

Unless we really want to be able to pass a `struct page_cgroup *' in place
of arg `mm' here.  If we don't want to be able to do that (prays fervently)
then let's sleep happily in the knowledge that the C type system prevents
us from doing it accidentally?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup
  2008-02-28  3:47   ` Andrew Morton
@ 2008-02-28  7:19     ` David Rientjes
  2008-02-28  7:26       ` Andrew Morton
  0 siblings, 1 reply; 50+ messages in thread
From: David Rientjes @ 2008-02-28  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Balbir Singh, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

On Wed, 27 Feb 2008, Andrew Morton wrote:

> > -#define vm_match_cgroup(mm, cgroup)	\
> > +#define mm_match_cgroup(mm, cgroup)	\
> >  	((cgroup) == rcu_dereference((mm)->mem_cgroup))
> 
> Could be written in C, methinks.
> 
> Unless we really want to be able to pass a `struct page_cgroup *' in place
> of arg `mm' here.  If we don't want to be able to do that (prays fervently)
> then let's sleep happily in the knowledge that the C type system prevents
> us from doing it accidentally?
> 

Writing vm_match_cgroup() as a static inline function in 
include/linux/memcontrol.h created all the sparc build errors about two 
weeks ago because of the dependency on linux/mm.h and linux/rcupdate.h.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup
  2008-02-28  7:19     ` David Rientjes
@ 2008-02-28  7:26       ` Andrew Morton
  2008-02-28  8:08         ` Hugh Dickins
  0 siblings, 1 reply; 50+ messages in thread
From: Andrew Morton @ 2008-02-28  7:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Hugh Dickins, Balbir Singh, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

On Wed, 27 Feb 2008 23:19:08 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

> On Wed, 27 Feb 2008, Andrew Morton wrote:
> 
> > > -#define vm_match_cgroup(mm, cgroup)	\
> > > +#define mm_match_cgroup(mm, cgroup)	\
> > >  	((cgroup) == rcu_dereference((mm)->mem_cgroup))
> > 
> > Could be written in C, methinks.
> > 
> > Unless we really want to be able to pass a `struct page_cgroup *' in place
> > of arg `mm' here.  If we don't want to be able to do that (prays fervently)
> > then let's sleep happily in the knowledge that the C type system prevents
> > us from doing it accidentally?
> > 
> 
> Writing vm_match_cgroup() as a static inline function in 
> include/linux/memcontrol.h created all the sparc build errors about two 
> weeks ago because of the dependency on linux/mm.h and linux/rcupdate.h.

It's become an faq already?  Should have put a comment in there..

That's the second ugly hack in that file because of missing includes.  It's
preferable to add the needed includes, or just temper our little
inline-everything fetish.  

Oh well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup
  2008-02-28  7:26       ` Andrew Morton
@ 2008-02-28  8:08         ` Hugh Dickins
  0 siblings, 0 replies; 50+ messages in thread
From: Hugh Dickins @ 2008-02-28  8:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Balbir Singh, KAMEZAWA Hiroyuki,
	Hirokazu Takahashi, YAMAMOTO Takashi, linux-mm

On Wed, 27 Feb 2008, Andrew Morton wrote:
> On Wed, 27 Feb 2008 23:19:08 -0800 (PST) David Rientjes <rientjes@google.com> wrote:
> > 
> > Writing vm_match_cgroup() as a static inline function in 
> > include/linux/memcontrol.h created all the sparc build errors about two 
> > weeks ago because of the dependency on linux/mm.h and linux/rcupdate.h.
> 
> It's become an faq already?  Should have put a comment in there..
> 
> That's the second ugly hack in that file because of missing includes.  It's
> preferable to add the needed includes, or just temper our little
> inline-everything fetish.  
> 
> Oh well.

Temper our inline-everything fetish and say "Oh well".  We prefer
inline functions to macros, we prefer to avoid include hell, macros
are the key to avoiding include hell.

Oh well: it really doesn't matter much, both David and I left our
!CONFIG_CGROUP_MEM_CONT versions as static inlines, so those without
MEM_CONT will be doing that part of build testing for those with it.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/15] memcg: remove mem_cgroup_uncharge
  2008-02-25 23:42 ` [PATCH 08/15] memcg: remove mem_cgroup_uncharge Hugh Dickins
  2008-02-26  1:34   ` KAMEZAWA Hiroyuki
@ 2008-02-28 18:22   ` Balbir Singh
  1 sibling, 0 replies; 50+ messages in thread
From: Balbir Singh @ 2008-02-28 18:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hirokazu Takahashi,
	YAMAMOTO Takashi, linux-mm

* Hugh Dickins <hugh@veritas.com> [2008-02-25 23:42:05]:

> Nothing uses mem_cgroup_uncharge apart from mem_cgroup_uncharge_page,
> (a trivial wrapper around it) and mem_cgroup_end_migration (which does
> the same as mem_cgroup_uncharge_page).  And it often ends up having to
> lock just to let its caller unlock.  Remove it (but leave the silly
> locking until a later patch).
> 
> Moved mem_cgroup_cache_charge next to mem_cgroup_charge in memcontrol.h.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2008-02-28 18:22 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-25 23:34 [PATCH 00/15] memcg: fixes and cleanups Hugh Dickins
2008-02-25 23:35 ` [PATCH 01/15] memcg: mm_match_cgroup not vm_match_cgroup Hugh Dickins
2008-02-26  0:39   ` David Rientjes
2008-02-26  3:27     ` Hugh Dickins
2008-02-26  2:41   ` Balbir Singh
2008-02-26 23:46   ` KAMEZAWA Hiroyuki
2008-02-28  3:47   ` Andrew Morton
2008-02-28  7:19     ` David Rientjes
2008-02-28  7:26       ` Andrew Morton
2008-02-28  8:08         ` Hugh Dickins
2008-02-25 23:36 ` [PATCH 02/15] memcg: move_lists on page not page_cgroup Hugh Dickins
2008-02-26 15:52   ` Balbir Singh
2008-02-26 23:45   ` KAMEZAWA Hiroyuki
2008-02-25 23:37 ` [PATCH 03/15] memcg: page_cache_release not __free_page Hugh Dickins
2008-02-26 16:02   ` Balbir Singh
2008-02-26 23:38   ` KAMEZAWA Hiroyuki
2008-02-25 23:38 ` [PATCH 04/15] memcg: when do_swap's do_wp_page fails Hugh Dickins
2008-02-26 23:41   ` KAMEZAWA Hiroyuki
2008-02-27  5:08   ` Balbir Singh
2008-02-27 12:57     ` Hugh Dickins
2008-02-25 23:39 ` [PATCH 05/15] memcg: fix VM_BUG_ON from page migration Hugh Dickins
2008-02-26  1:30   ` KAMEZAWA Hiroyuki
2008-02-27  5:52   ` Balbir Singh
2008-02-27 13:23     ` Hugh Dickins
2008-02-27 13:43       ` Balbir Singh
2008-02-25 23:40 ` [PATCH 06/15] memcg: bad page if page_cgroup when free Hugh Dickins
2008-02-26 23:44   ` KAMEZAWA Hiroyuki
2008-02-27  8:38   ` Balbir Singh
2008-02-25 23:41 ` [PATCH 07/15] memcg: mem_cgroup_charge never NULL Hugh Dickins
2008-02-26  1:32   ` KAMEZAWA Hiroyuki
2008-02-27  8:42   ` Balbir Singh
2008-02-25 23:42 ` [PATCH 08/15] memcg: remove mem_cgroup_uncharge Hugh Dickins
2008-02-26  1:34   ` KAMEZAWA Hiroyuki
2008-02-28 18:22   ` Balbir Singh
2008-02-25 23:43 ` [PATCH 09/15] memcg: memcontrol whitespace cleanups Hugh Dickins
2008-02-25 23:44 ` [PATCH 10/15] memcg: memcontrol uninlined and static Hugh Dickins
2008-02-26  1:36   ` KAMEZAWA Hiroyuki
2008-02-25 23:46 ` [PATCH 11/15] memcg: remove clear_page_cgroup and atomics Hugh Dickins
2008-02-26  1:38   ` KAMEZAWA Hiroyuki
2008-02-25 23:47 ` [PATCH 12/15] memcg: css_put after remove_list Hugh Dickins
2008-02-26  1:39   ` KAMEZAWA Hiroyuki
2008-02-25 23:49 ` [PATCH 13/15] memcg: fix mem_cgroup_move_lists locking Hugh Dickins
2008-02-26  1:43   ` KAMEZAWA Hiroyuki
2008-02-26  2:56     ` Hugh Dickins
2008-02-25 23:50 ` [PATCH 14/15] memcg: simplify force_empty and move_lists Hugh Dickins, Hirokazu Takahashi
2008-02-26  1:48   ` KAMEZAWA Hiroyuki
2008-02-26  3:23     ` Hugh Dickins
2008-02-26  4:09       ` KAMEZAWA Hiroyuki
2008-02-25 23:51 ` [PATCH 15/15] memcg: fix oops on NULL lru list Hugh Dickins
2008-02-26  1:26 ` [PATCH 00/15] memcg: fixes and cleanups KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).