[RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction
@ 2008-02-25  3:07 KAMEZAWA Hiroyuki
  2008-02-25  3:10 ` [RFC][PATCH] radix-tree based page_cgroup. [1/7] definitions for page_cgroup KAMEZAWA Hiroyuki
                   ` (9 more replies)
  0 siblings, 10 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  3:07 UTC (permalink / raw)
  To: balbir@linux.vnet.ibm.com
  Cc: hugh@veritas.com, yamamoto@valinux.co.jp, taka, Andi Kleen,
	nickpiggin@yahoo.com.au, linux-mm@kvack.org,
	kamezawa.hiroyu@jp.fujitsu.com

This patch series is for implementing radix-tree based page_cgroup.

This patch does
  - remove page_cgroup member from struct page.
  - add a lookup function get_page_cgroup(page).

And, by removing page_cgroup member, we'll have to change the whole lock rule.
In this patch, page_cgroup is allocated on demand but not freed. (see TODO).

This is first trial and I hope I get advices, comments.


Following is unix bench result under ia64/NUMA box, 8 cpu system. 
(Shell Script 8 concurrent result was not available from unknown reason.)
./Run fstime execl shell C hanoi

== rc2 + CONFIG_CGROUP_MEM_CONT ==
File Read 1024 bufsize 2000 maxblocks    937399.0 KBps  (30.0 secs, 3 samples)
File Write 1024 bufsize 2000 maxblocks   323117.0 KBps  (30.0 secs, 3 samples)
File Copy 1024 bufsize 2000 maxblocks    233737.0 KBps  (30.0 secs, 3 samples)
Execl Throughput                           2418.7 lps   (29.7 secs, 3 samples)
Shell Scripts (1 concurrent)               5506.0 lpm   (60.0 secs, 3 samples)
Shell Scripts (16 concurrent)               988.3 lpm   (60.0 secs, 3 samples)
C Compiler Throughput                       741.7 lpm   (60.0 secs, 3 samples)
Recursion Test--Tower of Hanoi            74555.8 lps   (20.0 secs, 3 samples)

== rc2 + CONFIG_CGROUP_MEM_CONT + radix-tree based page_cgroup ==
File Read 1024 bufsize 2000 maxblocks    966342.0 KBps  (30.0 secs, 2 samples)
File Write 1024 bufsize 2000 maxblocks   316999.0 KBps  (30.0 secs, 2 samples)
File Copy 1024 bufsize 2000 maxblocks    234167.0 KBps  (30.0 secs, 2 samples)
Execl Throughput                           2410.5 lps   (29.8 secs, 2 samples)
Shell Scripts (1 concurrent)               5505.0 lpm   (60.0 secs, 2 samples)
Shell Scripts (8 concurrent)               1824.5 lpm   (60.0 secs, 2 samples)
Shell Scripts (16 concurrent)               987.0 lpm   (60.0 secs, 2 samples)
C Compiler Throughput                       742.5 lpm   (60.0 secs, 2 samples)
Recursion Test--Tower of Hanoi            74335.6 lps   (20.0 secs, 2 samples)

looks good as first result.

Becaue today's my machine time is over, I post this now. I'll rebase this to
rc3 and reflect comments in the next trial.

series of patches
[1/8] --- defintions of header file. 
[2/8] --- changes in charge/uncharge path and remove locks.
[3/8] --- changes in page_cgroup_move_lists()
[4/8] --- changes in page migration with page_cgroup
[5/8] --- changes in force_empty
[6/8] --- radix-tree based page_cgroup
[7/8] --- (Optional) per-cpu fast lookup helper
[8/8] --- (Optional) Use vmalloc for 64bit machines.


TODO
 - Move to -rc3 or -mm ?
 - This patch series doesn't implement page_cgroup removal.
   I consider it's worth tring to remove page_cgroup when the page is used for
   HugePage or the page is offlined. But this will incease complexity. So, do later.
 - More perfomance measurement and brush codes up.
 - Check lock dependency...Do more test.
 - Should I add smaller chunk size for page_cgroup ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC][PATCH] radix-tree based page_cgroup. [1/7] definitions for page_cgroup
  2008-02-25  3:07 [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction KAMEZAWA Hiroyuki
@ 2008-02-25  3:10 ` KAMEZAWA Hiroyuki
  2008-02-25  7:47   ` Hirokazu Takahashi
  2008-02-25  3:12 ` [RFC][PATCH] radix-tree based page_cgroup. [2/7] charge/uncharge KAMEZAWA Hiroyuki
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  3:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir@linux.vnet.ibm.com, hugh@veritas.com,
	yamamoto@valinux.co.jp, taka, Andi Kleen, nickpiggin@yahoo.com.au,
	linux-mm@kvack.org

(This is one of a series of patch for "lookup page_cgroup" patches..)

 * Exporting page_cgroup definition.
 * Remove page_cgroup member from sturct page.
 * As result, PAGE_CGROUP_LOCK_BIT and assign/access functions are removed.

Other chages will appear in following patches.
There is a change in the structure itself, spin_lock is added.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 include/linux/memcontrol.h  |   13 -----
 include/linux/mm_types.h    |    3 -
 include/linux/page_cgroup.h |   46 +++++++++++++++++++
 mm/memcontrol.c             |  103 --------------------------------------------
 mm/page_alloc.c             |    2 
 5 files changed, 47 insertions(+), 120 deletions(-)

Index: linux-2.6.25-rc2/include/linux/page_cgroup.h
===================================================================
--- /dev/null
+++ linux-2.6.25-rc2/include/linux/page_cgroup.h
@@ -0,0 +1,44 @@
+#ifndef __LINUX_PAGE_CGROUP_H
+#define __LINUX_PAGE_CGROUP_H
+
+#ifdef CONFIG_CGROUP_MEM_CONT
+/*
+ * page_cgroup is yet another mem_map structure for accounting  usage.
+ * but, unlike mem_map, allocated on demand for accounted pages.
+ * see also memcontrol.h
+ * In nature, this cosumes much amount of memory.
+ */
+
+struct mem_cgroup;
+
+struct page_cgroup {
+	struct page 		*page;       /* the page this accounts for*/
+	struct mem_cgroup 	*mem_cgroup; /* current cgroup subsys */
+	int    			flags;	     /* See below */
+	int    			refcnt;      /* reference count */
+	spinlock_t		lock;        /* lock for all above members */
+	struct list_head 	lru;         /* for per cgroup LRU */
+};
+
+/* flags */
+#define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache. */
+#define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* is on active list */
+
+/*
+ * Lookup and return page_cgroup struct.
+ * returns NULL when
+ * 1. Page Cgroup is not activated yet.
+ * 2. cannot lookup entry and gfp_mask was 0.
+ * return -ENOMEM if cannot allocate memory.
+ */
+
+struct page_cgroup *get_page_cgroup(struct page *page, gfp_t gfpmask);
+
+#else
+
+static struct page_cgroup *get_page_cgroup(struct page *page, gfp_t gfpmask)
+{
+	return NULL;
+}
+#endif
+#endif
Index: linux-2.6.25-rc2/include/linux/mm_types.h
===================================================================
--- linux-2.6.25-rc2.orig/include/linux/mm_types.h
+++ linux-2.6.25-rc2/include/linux/mm_types.h
@@ -91,9 +91,6 @@ struct page {
 	void *virtual;			/* Kernel virtual address (NULL if
 					   not kmapped, ie. highmem) */
 #endif /* WANT_PAGE_VIRTUAL */
-#ifdef CONFIG_CGROUP_MEM_CONT
-	unsigned long page_cgroup;
-#endif
 };
 
 /*
Index: linux-2.6.25-rc2/mm/memcontrol.c
===================================================================
--- linux-2.6.25-rc2.orig/mm/memcontrol.c
+++ linux-2.6.25-rc2/mm/memcontrol.c
@@ -30,6 +30,7 @@
 #include <linux/spinlock.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
+#include <linux/page_cgroup.h>
 
 #include <asm/uaccess.h>
 
@@ -138,29 +139,6 @@ struct mem_cgroup {
 	struct mem_cgroup_stat stat;
 };
 
-/*
- * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock. We need to ensure that page->page_cgroup is atleast two
- * byte aligned (based on comments from Nick Piggin)
- */
-#define PAGE_CGROUP_LOCK_BIT 	0x0
-#define PAGE_CGROUP_LOCK 		(1 << PAGE_CGROUP_LOCK_BIT)
-
-/*
- * A page_cgroup page is associated with every page descriptor. The
- * page_cgroup helps us identify information about the cgroup
- */
-struct page_cgroup {
-	struct list_head lru;		/* per cgroup LRU list */
-	struct page *page;
-	struct mem_cgroup *mem_cgroup;
-	atomic_t ref_cnt;		/* Helpful when pages move b/w  */
-					/* mapped and cached states     */
-	int	 flags;
-};
-#define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
-
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
@@ -265,85 +243,6 @@ void mm_free_cgroup(struct mm_struct *mm
 	css_put(&mm->mem_cgroup->css);
 }
 
-static inline int page_cgroup_locked(struct page *page)
-{
-	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT,
-					&page->page_cgroup);
-}
-
-void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
-{
-	int locked;
-
-	/*
-	 * While resetting the page_cgroup we might not hold the
-	 * page_cgroup lock. free_hot_cold_page() is an example
-	 * of such a scenario
-	 */
-	if (pc)
-		VM_BUG_ON(!page_cgroup_locked(page));
-	locked = (page->page_cgroup & PAGE_CGROUP_LOCK);
-	page->page_cgroup = ((unsigned long)pc | locked);
-}
-
-struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
-	return (struct page_cgroup *)
-		(page->page_cgroup & ~PAGE_CGROUP_LOCK);
-}
-
-static void __always_inline lock_page_cgroup(struct page *page)
-{
-	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-	VM_BUG_ON(!page_cgroup_locked(page));
-}
-
-static void __always_inline unlock_page_cgroup(struct page *page)
-{
-	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-/*
- * Tie new page_cgroup to struct page under lock_page_cgroup()
- * This can fail if the page has been tied to a page_cgroup.
- * If success, returns 0.
- */
-static int page_cgroup_assign_new_page_cgroup(struct page *page,
-						struct page_cgroup *pc)
-{
-	int ret = 0;
-
-	lock_page_cgroup(page);
-	if (!page_get_page_cgroup(page))
-		page_assign_page_cgroup(page, pc);
-	else /* A page is tied to other pc. */
-		ret = 1;
-	unlock_page_cgroup(page);
-	return ret;
-}
-
-/*
- * Clear page->page_cgroup member under lock_page_cgroup().
- * If given "pc" value is different from one page->page_cgroup,
- * page->cgroup is not cleared.
- * Returns a value of page->page_cgroup at lock taken.
- * A can can detect failure of clearing by following
- *  clear_page_cgroup(page, pc) == pc
- */
-
-static struct page_cgroup *clear_page_cgroup(struct page *page,
-						struct page_cgroup *pc)
-{
-	struct page_cgroup *ret;
-	/* lock and clear */
-	lock_page_cgroup(page);
-	ret = page_get_page_cgroup(page);
-	if (likely(ret == pc))
-		page_assign_page_cgroup(page, NULL);
-	unlock_page_cgroup(page);
-	return ret;
-}
-
 static void __mem_cgroup_remove_list(struct page_cgroup *pc)
 {
 	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
Index: linux-2.6.25-rc2/include/linux/memcontrol.h
===================================================================
--- linux-2.6.25-rc2.orig/include/linux/memcontrol.h
+++ linux-2.6.25-rc2/include/linux/memcontrol.h
@@ -32,9 +32,6 @@ struct mm_struct;
 
 extern void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p);
 extern void mm_free_cgroup(struct mm_struct *mm);
-extern void page_assign_page_cgroup(struct page *page,
-					struct page_cgroup *pc);
-extern struct page_cgroup *page_get_page_cgroup(struct page *page);
 extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 extern void mem_cgroup_uncharge(struct page_cgroup *pc);
@@ -85,16 +82,6 @@ static inline void mm_free_cgroup(struct
 {
 }
 
-static inline void page_assign_page_cgroup(struct page *page,
-						struct page_cgroup *pc)
-{
-}
-
-static inline struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
-	return NULL;
-}
-
 static inline int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask)
 {
Index: linux-2.6.25-rc2/mm/page_alloc.c
===================================================================
--- linux-2.6.25-rc2.orig/mm/page_alloc.c
+++ linux-2.6.25-rc2/mm/page_alloc.c
@@ -988,7 +988,6 @@ static void free_hot_cold_page(struct pa
 
 	if (!PageHighMem(page))
 		debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
-	VM_BUG_ON(page_get_page_cgroup(page));
 	arch_free_page(page, 0);
 	kernel_map_pages(page, 1, 0);
 
@@ -2527,7 +2526,6 @@ void __meminit memmap_init_zone(unsigned
 		set_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		reset_page_mapcount(page);
-		page_assign_page_cgroup(page, NULL);
 		SetPageReserved(page);
 
 		/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC][PATCH] radix-tree based page_cgroup. [2/7] charge/uncharge
  2008-02-25  3:07 [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction KAMEZAWA Hiroyuki
  2008-02-25  3:10 ` [RFC][PATCH] radix-tree based page_cgroup. [1/7] definitions for page_cgroup KAMEZAWA Hiroyuki
@ 2008-02-25  3:12 ` KAMEZAWA Hiroyuki
  2008-02-25  3:13 ` [RFC][PATCH] radix-tree based page_cgroup. [3/7] move lists KAMEZAWA Hiroyuki
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  3:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir@linux.vnet.ibm.com, hugh@veritas.com,
	yamamoto@valinux.co.jp, taka, Andi Kleen, nickpiggin@yahoo.com.au,
	linux-mm@kvack.org

Chagnges in Core Logic....charge and uncharge.

Because bit spin lock is removed and spinlock is added to page_cgroup.
There are some amount of changes.

This patch does
	- modified charge/uncharge
	- removed add_list/remove_list function. Just added stat functions
	- Added simple lock rule comments.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 mm/memcontrol.c |  173 +++++++++++++++++++++++---------------------------------
 1 file changed, 74 insertions(+), 99 deletions(-)

Index: linux-2.6.25-rc2/mm/memcontrol.c
===================================================================
--- linux-2.6.25-rc2.orig/mm/memcontrol.c
+++ linux-2.6.25-rc2/mm/memcontrol.c
@@ -34,6 +34,16 @@
 
 #include <asm/uaccess.h>
 
+/*
+ * Lock Rule
+ * zone->lru_lcok (global LRU)
+ *	-> mz->lru_lock (mem_cgroup's per_zone lock.)
+ *		-> pc->lock (page_cgroup's lock)
+ *
+ * mz->lru_lock and pc->lock should be acquired irq off.
+ *
+ */
+
 struct cgroup_subsys mem_cgroup_subsys;
 static const int MEM_CGROUP_RECLAIM_RETRIES = 5;
 
@@ -243,33 +253,29 @@ void mm_free_cgroup(struct mm_struct *mm
 	css_put(&mm->mem_cgroup->css);
 }
 
-static void __mem_cgroup_remove_list(struct page_cgroup *pc)
+static void mem_cgroup_dec_stat(struct mem_cgroup *mem,
+		struct mem_cgroup_per_zone *mz, unsigned long pc_flags)
 {
-	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
-	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+	int from = pc_flags & PAGE_CGROUP_FLAG_ACTIVE;
 
 	if (from)
 		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) -= 1;
 	else
 		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) -= 1;
 
-	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false);
-	list_del_init(&pc->lru);
+	mem_cgroup_charge_statistics(mem, pc_flags, false);
 }
 
-static void __mem_cgroup_add_list(struct page_cgroup *pc)
+static void mem_cgroup_inc_stat(struct mem_cgroup *mem,
+			struct mem_cgroup_per_zone *mz, unsigned long pc_flags)
 {
-	int to = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
-	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+	int to = pc_flags & PAGE_CGROUP_FLAG_ACTIVE;
 
-	if (!to) {
+	if (!to)
 		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1;
-		list_add(&pc->lru, &mz->inactive_list);
-	} else {
+	else
 		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) += 1;
-		list_add(&pc->lru, &mz->active_list);
-	}
-	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
+	mem_cgroup_charge_statistics(mem, pc_flags, true);
 }
 
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
@@ -478,38 +484,22 @@ static int mem_cgroup_charge_common(stru
 	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup_per_zone *mz;
 
+	pc = get_page_cgroup(page, gfp_mask);
+	if (!pc || IS_ERR(pc))
+		return PTR_ERR(pc);
+
+	spin_lock_irqsave(&pc->lock, flags);
+
+	if (pc->refcnt > 0) {
+		++pc->refcnt;
+		spin_unlock_irqrestore(&pc->lock, flags);
+		return 0;
+	}
 	/*
-	 * Should page_cgroup's go to their own slab?
-	 * One could optimize the performance of the charging routine
-	 * by saving a bit in the page_flags and using it as a lock
-	 * to see if the cgroup page already has a page_cgroup associated
-	 * with it
+	 * Note: refcnt is still 0 here. We charge resource usage
+	 * before increment refcnt.
 	 */
-retry:
-	if (page) {
-		lock_page_cgroup(page);
-		pc = page_get_page_cgroup(page);
-		/*
-		 * The page_cgroup exists and
-		 * the page has already been accounted.
-		 */
-		if (pc) {
-			if (unlikely(!atomic_inc_not_zero(&pc->ref_cnt))) {
-				/* this page is under being uncharged ? */
-				unlock_page_cgroup(page);
-				cpu_relax();
-				goto retry;
-			} else {
-				unlock_page_cgroup(page);
-				goto done;
-			}
-		}
-		unlock_page_cgroup(page);
-	}
-
-	pc = kzalloc(sizeof(struct page_cgroup), gfp_mask);
-	if (pc == NULL)
-		goto err;
+	spin_unlock_irqrestore(&pc->lock, flags);
 
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
@@ -522,11 +512,6 @@ retry:
 
 	rcu_read_lock();
 	mem = rcu_dereference(mm->mem_cgroup);
-	/*
-	 * For every charge from the cgroup, increment reference
-	 * count
-	 */
-	css_get(&mem->css);
 	rcu_read_unlock();
 
 	/*
@@ -535,7 +520,7 @@ retry:
 	 */
 	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
 		if (!(gfp_mask & __GFP_WAIT))
-			goto out;
+			goto nomem;
 
 		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
 			continue;
@@ -552,44 +537,41 @@ retry:
 
 		if (!nr_retries--) {
 			mem_cgroup_out_of_memory(mem, gfp_mask);
-			goto out;
+			goto nomem;
 		}
 		congestion_wait(WRITE, HZ/10);
 	}
 
-	atomic_set(&pc->ref_cnt, 1);
+	spin_lock_irqsave(&pc->lock, flags);
+	if (pc->refcnt) { /* Someone charged before me. */
+		++pc->refcnt;
+		spin_unlock_irqrestore(&pc->lock, flags);
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		return 0;
+	}
+	pc->refcnt = 1;
 	pc->mem_cgroup = mem;
-	pc->page = page;
 	pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
 		pc->flags |= PAGE_CGROUP_FLAG_CACHE;
+	spin_unlock_irqrestore(&pc->lock, flags);
 
-	if (!page || page_cgroup_assign_new_page_cgroup(page, pc)) {
-		/*
-		 * Another charge has been added to this page already.
-		 * We take lock_page_cgroup(page) again and read
-		 * page->cgroup, increment refcnt.... just retry is OK.
-		 */
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		css_put(&mem->css);
-		kfree(pc);
-		if (!page)
-			goto done;
-		goto retry;
-	}
+	css_get(&mem->css);
+	/*
+	 * Check uncharge is finished..
+	 */
+	while (unlikely(!list_empty(&pc->lru)))
+		smp_rmb();
 
 	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	/* Update statistics vector */
-	__mem_cgroup_add_list(pc);
+	mem_cgroup_inc_stat(mem, mz, pc->flags);
+	list_add(&pc->lru, &mz->active_list);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 
-done:
 	return 0;
-out:
-	css_put(&mem->css);
-	kfree(pc);
-err:
+nomem:
 	return -ENOMEM;
 }
 
@@ -623,41 +605,34 @@ void mem_cgroup_uncharge(struct page_cgr
 {
 	struct mem_cgroup *mem;
 	struct mem_cgroup_per_zone *mz;
-	struct page *page;
-	unsigned long flags;
+	unsigned long flags, pc_flags;
 
-	/*
-	 * Check if our page_cgroup is valid
-	 */
 	if (!pc)
 		return;
-
-	if (atomic_dec_and_test(&pc->ref_cnt)) {
-		page = pc->page;
-		mz = page_cgroup_zoneinfo(pc);
-		/*
-		 * get page->cgroup and clear it under lock.
-		 * force_empty can drop page->cgroup without checking refcnt.
-		 */
-		unlock_page_cgroup(page);
-		if (clear_page_cgroup(page, pc) == pc) {
-			mem = pc->mem_cgroup;
-			css_put(&mem->css);
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			spin_lock_irqsave(&mz->lru_lock, flags);
-			__mem_cgroup_remove_list(pc);
-			spin_unlock_irqrestore(&mz->lru_lock, flags);
-			kfree(pc);
-		}
-		lock_page_cgroup(page);
+	spin_lock_irqsave(&pc->lock, flags);
+	if (!pc->refcnt || --pc->refcnt) {
+		spin_unlock_irqrestore(&pc->lock, flags);
+		return;
 	}
+	mz = page_cgroup_zoneinfo(pc);
+	mem = pc->mem_cgroup;
+	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	css_put(&mem->css);
+	pc_flags = pc->flags;
+	pc->flags = 0;
+	pc->mem_cgroup = NULL;
+	spin_unlock_irqrestore(&pc->lock, flags);
+
+	spin_lock_irqsave(&mz->lru_lock, flags);
+	mem_cgroup_dec_stat(mem, mz, pc_flags);
+	list_del_init(&pc->lru);
+	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	css_put(&mem->css);
 }
 
 void mem_cgroup_uncharge_page(struct page *page)
 {
-	lock_page_cgroup(page);
-	mem_cgroup_uncharge(page_get_page_cgroup(page));
-	unlock_page_cgroup(page);
+	mem_cgroup_uncharge(get_page_cgroup(page, 0));
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC][PATCH] radix-tree based page_cgroup. [3/7] move lists
  2008-02-25  3:07 [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction KAMEZAWA Hiroyuki
  2008-02-25  3:10 ` [RFC][PATCH] radix-tree based page_cgroup. [1/7] definitions for page_cgroup KAMEZAWA Hiroyuki
  2008-02-25  3:12 ` [RFC][PATCH] radix-tree based page_cgroup. [2/7] charge/uncharge KAMEZAWA Hiroyuki
@ 2008-02-25  3:13 ` KAMEZAWA Hiroyuki
  2008-02-25  3:14 ` [RFC][PATCH] radix-tree based page_cgroup. [4/7] migration KAMEZAWA Hiroyuki
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  3:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir@linux.vnet.ibm.com, hugh@veritas.com,
	yamamoto@valinux.co.jp, taka, Andi Kleen, nickpiggin@yahoo.com.au,
	linux-mm@kvack.org

mem_cgroup_move_lists() is called from external functions to notify
lru is rotated.

	1. get reference if not freed.
	2. rotate page.
	3. put reference

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/memcontrol.h |    7 -----
 mm/memcontrol.c            |   60 ++++++++++++++++++++++++++++++++-------------
 mm/swap.c                  |    2 -
 mm/vmscan.c                |    4 +--
 4 files changed, 47 insertions(+), 26 deletions(-)

Index: linux-2.6.25-rc2/mm/memcontrol.c
===================================================================
--- linux-2.6.25-rc2.orig/mm/memcontrol.c
+++ linux-2.6.25-rc2/mm/memcontrol.c
@@ -310,23 +310,6 @@ int task_in_mem_cgroup(struct task_struc
 }
 
 /*
- * This routine assumes that the appropriate zone's lru lock is already held
- */
-void mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
-{
-	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
-
-	if (!pc)
-		return;
-
-	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_move_lists(pc, active);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
-}
-
-/*
  * Calculate mapped_ratio under memory controller. This will be used in
  * vmscan.c for deteremining we have to reclaim mapped pages.
  */
@@ -470,6 +453,27 @@ unsigned long mem_cgroup_isolate_pages(u
 }
 
 /*
+ * Just increment page_cgroup's refcnt and return it.
+ * if there is used one.
+ */
+static struct page_cgroup *page_cgroup_getref(struct page *page)
+{
+	struct page_cgroup *pc = get_page_cgroup(page, 0);
+	struct page_cgroup *ret = NULL;
+	unsigned long flags;
+
+	if (!pc)
+		return ret;
+	spin_lock_irqsave(&pc->lock, flags);
+	if (pc->refcnt) {
+		++pc->refcnt;
+		ret = pc;
+	}
+	spin_unlock_irqrestore(&pc->lock, flags);
+	return ret;
+}
+
+/*
  * Charge the memory controller for page usage.
  * Return
  * 0 if the charge was successful
@@ -635,6 +639,28 @@ void mem_cgroup_uncharge_page(struct pag
 	mem_cgroup_uncharge(get_page_cgroup(page, 0));
 }
 
+void mem_cgroup_move_lists(struct page *page, bool active)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup_per_zone *mz;
+	unsigned long flags;
+
+	pc = page_cgroup_getref(page);
+	if (!pc)
+		return;
+	/* pc<->page relation ship is stable. */
+	mz = page_cgroup_zoneinfo(pc);
+	spin_lock_irqsave(&mz->lru_lock, flags);
+	/* This check is necessary. Anything can happen because
+	   we relaased lock. */
+	if (!list_empty(&pc->lru))
+		__mem_cgroup_move_lists(pc, active);
+	spin_unlock_irqrestore(&mz->lru_lock, flags);
+
+	mem_cgroup_uncharge_page(page);
+}
+
+
 /*
  * Returns non-zero if a page (under migration) has valid page_cgroup member.
  * Refcnt of page_cgroup is incremented.
Index: linux-2.6.25-rc2/mm/swap.c
===================================================================
--- linux-2.6.25-rc2.orig/mm/swap.c
+++ linux-2.6.25-rc2/mm/swap.c
@@ -176,7 +176,7 @@ void activate_page(struct page *page)
 		SetPageActive(page);
 		add_page_to_active_list(zone, page);
 		__count_vm_event(PGACTIVATE);
-		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
+		mem_cgroup_move_lists(page, true);
 	}
 	spin_unlock_irq(&zone->lru_lock);
 }
Index: linux-2.6.25-rc2/mm/vmscan.c
===================================================================
--- linux-2.6.25-rc2.orig/mm/vmscan.c
+++ linux-2.6.25-rc2/mm/vmscan.c
@@ -1128,7 +1128,7 @@ static void shrink_active_list(unsigned 
 		ClearPageActive(page);
 
 		list_move(&page->lru, &zone->inactive_list);
-		mem_cgroup_move_lists(page_get_page_cgroup(page), false);
+		mem_cgroup_move_lists(page, false);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
@@ -1157,7 +1157,7 @@ static void shrink_active_list(unsigned 
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
 		list_move(&page->lru, &zone->active_list);
-		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
+		mem_cgroup_move_lists(page, true);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
Index: linux-2.6.25-rc2/include/linux/memcontrol.h
===================================================================
--- linux-2.6.25-rc2.orig/include/linux/memcontrol.h
+++ linux-2.6.25-rc2/include/linux/memcontrol.h
@@ -36,7 +36,7 @@ extern int mem_cgroup_charge(struct page
 				gfp_t gfp_mask);
 extern void mem_cgroup_uncharge(struct page_cgroup *pc);
 extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_move_lists(struct page_cgroup *pc, bool active);
+extern void mem_cgroup_move_lists(struct page *page, bool active);
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -96,11 +96,6 @@ static inline void mem_cgroup_uncharge_p
 {
 }
 
-static inline void mem_cgroup_move_lists(struct page_cgroup *pc,
-						bool active)
-{
-}
-
 static inline int mem_cgroup_cache_charge(struct page *page,
 						struct mm_struct *mm,
 						gfp_t gfp_mask)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC][PATCH] radix-tree based page_cgroup. [4/7] migration
  2008-02-25  3:07 [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2008-02-25  3:13 ` [RFC][PATCH] radix-tree based page_cgroup. [3/7] move lists KAMEZAWA Hiroyuki
@ 2008-02-25  3:14 ` KAMEZAWA Hiroyuki
  2008-02-25  3:16 ` [RFC][PATCH] radix-tree based page_cgroup. [5/7] force_empty KAMEZAWA Hiroyuki
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  3:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir@linux.vnet.ibm.com, hugh@veritas.com,
	yamamoto@valinux.co.jp, taka, Andi Kleen, nickpiggin@yahoo.com.au,
	linux-mm@kvack.org

Changes in page migraion handler.

How migration works is.
  1. unmap all
  2. replace trees and copies contents.
  3. map all again.

1. and 3. can drop all charges and a page will lose cgroup information.

For preventing that,
  * increment refcnt before unmap all.
  * At the end of copying, set newpage to be under oldpage's cgroup.
  * drop extra-refcnt

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



 mm/memcontrol.c |   93 +++++++++++++++++++++++++-------------------------------
 1 file changed, 43 insertions(+), 50 deletions(-)

Index: linux-2.6.25-rc2/mm/memcontrol.c
===================================================================
--- linux-2.6.25-rc2.orig/mm/memcontrol.c
+++ linux-2.6.25-rc2/mm/memcontrol.c
@@ -480,9 +480,9 @@ static struct page_cgroup *page_cgroup_g
  * < 0 if the cgroup is over its limit
  */
 static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask, enum charge_type ctype)
+				gfp_t gfp_mask, enum charge_type ctype,
+				struct mem_cgroup *mem)
 {
-	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
 	unsigned long flags;
 	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -505,18 +505,20 @@ static int mem_cgroup_charge_common(stru
 	 */
 	spin_unlock_irqrestore(&pc->lock, flags);
 
-	/*
-	 * We always charge the cgroup the mm_struct belongs to.
-	 * The mm_struct's mem_cgroup changes on task migration if the
-	 * thread group leader migrates. It's possible that mm is not
-	 * set, if so charge the init_mm (happens for pagecache usage).
-	 */
-	if (!mm)
-		mm = &init_mm;
+	if (likely(!mem)) {
+		/*
+		 * We always charge the cgroup the mm_struct belongs to.
+		 * The mm_struct's mem_cgroup changes on task migration if the
+		 * thread group leader migrates. It's possible that mm is not
+		 * set, if so charge the init_mm (happens for pagecache usage).
+		 */
+		if (!mm)
+			mm = &init_mm;
 
-	rcu_read_lock();
-	mem = rcu_dereference(mm->mem_cgroup);
-	rcu_read_unlock();
+		rcu_read_lock();
+		mem = rcu_dereference(mm->mem_cgroup);
+		rcu_read_unlock();
+	}
 
 	/*
 	 * If we created the page_cgroup, we should free it on exceeding
@@ -583,7 +585,7 @@ int mem_cgroup_charge(struct page *page,
 			gfp_t gfp_mask)
 {
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-			MEM_CGROUP_CHARGE_TYPE_MAPPED);
+			MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
 }
 
 /*
@@ -597,7 +599,7 @@ int mem_cgroup_cache_charge(struct page 
 		mm = &init_mm;
 
 	ret = mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_CACHE);
+				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
 	return ret;
 }
 
@@ -668,24 +670,12 @@ void mem_cgroup_move_lists(struct page *
 
 int mem_cgroup_prepare_migration(struct page *page)
 {
-	struct page_cgroup *pc;
-	int ret = 0;
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
-	if (pc && atomic_inc_not_zero(&pc->ref_cnt))
-		ret = 1;
-	unlock_page_cgroup(page);
-	return ret;
+	return page_cgroup_getref(page)? 1 : 0;
 }
 
 void mem_cgroup_end_migration(struct page *page)
 {
-	struct page_cgroup *pc;
-
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
-	mem_cgroup_uncharge(pc);
-	unlock_page_cgroup(page);
+	mem_cgroup_uncharge_page(page);
 }
 /*
  * We know both *page* and *newpage* are now not-on-LRU and Pg_locked.
@@ -696,31 +686,34 @@ void mem_cgroup_end_migration(struct pag
 void mem_cgroup_page_migration(struct page *page, struct page *newpage)
 {
 	struct page_cgroup *pc;
-	struct mem_cgroup *mem;
-	unsigned long flags;
-	struct mem_cgroup_per_zone *mz;
-retry:
-	pc = page_get_page_cgroup(page);
+	struct mem_cgroup *mem = NULL;
+	unsigned long flags, pc_flags;
+
+	/* This is done under RCU read lock */
+	pc = get_page_cgroup(page, 0);
 	if (!pc)
 		return;
-	mem = pc->mem_cgroup;
-	mz = page_cgroup_zoneinfo(pc);
-	if (clear_page_cgroup(page, pc) != pc)
-		goto retry;
-	spin_lock_irqsave(&mz->lru_lock, flags);
-
-	__mem_cgroup_remove_list(pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	spin_lock_irqsave(&pc->lock, flags);
+	if (pc->refcnt) {
+		mem = pc->mem_cgroup;
+		pc_flags = pc->flags;
+		css_get(&mem->css);
+	}
+	spin_unlock_irqrestore(&pc->lock, flags);
+	if (!mem)
+		return;
 
-	pc->page = newpage;
-	lock_page_cgroup(newpage);
-	page_assign_page_cgroup(newpage, pc);
-	unlock_page_cgroup(newpage);
+	/* We got all necessary information */
+	mem_cgroup_uncharge_page(page);
 
-	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_add_list(pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	/* Following extra charge will be dropped by end_migraion */
+	if (pc_flags & PAGE_CGROUP_FLAG_CACHE)
+		mem_cgroup_charge_common(newpage, NULL, GFP_ATOMIC,
+				MEM_CGROUP_CHARGE_TYPE_CACHE, mem);
+	else
+		mem_cgroup_charge_common(newpage, NULL,  GFP_ATOMIC,
+				MEM_CGROUP_CHARGE_TYPE_MAPPED, mem);
+	css_put(&mem->css);
 	return;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC][PATCH] radix-tree based page_cgroup. [5/7] force_empty
  2008-02-25  3:07 [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2008-02-25  3:14 ` [RFC][PATCH] radix-tree based page_cgroup. [4/7] migration KAMEZAWA Hiroyuki
@ 2008-02-25  3:16 ` KAMEZAWA Hiroyuki
  2008-02-25  3:17 ` [RFC][PATCH] radix-tree based page_cgroup. [6/7] radix-tree based page cgroup KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  3:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir@linux.vnet.ibm.com, hugh@veritas.com,
	yamamoto@valinux.co.jp, taka, Andi Kleen, nickpiggin@yahoo.com.au,
	linux-mm@kvack.org

Lock page and uncharge it.
This *Lock* ensures we have no race with migration.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/memcontrol.c |   34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

Index: linux-2.6.25-rc2/mm/memcontrol.c
===================================================================
--- linux-2.6.25-rc2.orig/mm/memcontrol.c
+++ linux-2.6.25-rc2/mm/memcontrol.c
@@ -31,6 +31,7 @@
 #include <linux/fs.h>
 #include <linux/seq_file.h>
 #include <linux/page_cgroup.h>
+#include <linux/pagemap.h>
 
 #include <asm/uaccess.h>
 
@@ -730,7 +731,7 @@ mem_cgroup_force_empty_list(struct mem_c
 {
 	struct page_cgroup *pc;
 	struct page *page;
-	int count;
+	int count = FORCE_UNCHARGE_BATCH;
 	unsigned long flags;
 	struct list_head *list;
 
@@ -741,28 +742,27 @@ mem_cgroup_force_empty_list(struct mem_c
 
 	if (list_empty(list))
 		return;
-retry:
-	count = FORCE_UNCHARGE_BATCH;
+
 	spin_lock_irqsave(&mz->lru_lock, flags);
 
-	while (--count && !list_empty(list)) {
+	while (!list_empty(list)) {
 		pc = list_entry(list->prev, struct page_cgroup, lru);
 		page = pc->page;
-		/* Avoid race with charge */
-		atomic_set(&pc->ref_cnt, 0);
-		if (clear_page_cgroup(page, pc) == pc) {
-			css_put(&mem->css);
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			__mem_cgroup_remove_list(pc);
-			kfree(pc);
-		} else 	/* being uncharged ? ...do relax */
-			break;
+		get_page(page);
+		spin_unlock_irqrestore(&mz->lru_lock, flags);
+		/* check against page migration */
+		if (TestSetPageLocked(page)) {
+			mem_cgroup_uncharge_page(page);
+			unlock_page(page);
+		}
+		put_page(page);
+		if (--count == 0) {
+			count = FORCE_UNCHARGE_BATCH;
+			cond_resched();
+		}
+		spin_lock_irqsave(&mz->lru_lock, flags);
 	}
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
-	if (!list_empty(list)) {
-		cond_resched();
-		goto retry;
-	}
 	return;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC][PATCH] radix-tree based page_cgroup. [6/7] radix-tree based page cgroup
  2008-02-25  3:07 [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2008-02-25  3:16 ` [RFC][PATCH] radix-tree based page_cgroup. [5/7] force_empty KAMEZAWA Hiroyuki
@ 2008-02-25  3:17 ` KAMEZAWA Hiroyuki
  2008-02-25  5:56   ` YAMAMOTO Takashi
  2008-02-25  6:40   ` Hirokazu Takahashi
  2008-02-25  3:18 ` [RFC][PATCH] radix-tree based page_cgroup. [7/7] per cpu fast lookup KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  3:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir@linux.vnet.ibm.com, hugh@veritas.com,
	yamamoto@valinux.co.jp, taka, Andi Kleen, nickpiggin@yahoo.com.au,
	linux-mm@kvack.org

A lookup routine for page_cgroup struct.

Now, page_cgroup is pointed by struct page's page_cgroup entry

struct page {
	...
	struct page_cgroup *page_cgroup;
	..
}

But people dislike this because this increases sizeof(struct page).

For avoiding that, we'll have to add a lookup routine for
	pfn <-> page_cgroup.
by radix-tree.

New function is

struct page *get_page_cgroup(struct page *page, gfp_mask mask);

if (mask != 0), look up and allocate new one if necessary.
if (mask == 0), just do look up and return NULL if not exist.

Each radix-tree entry contains base address of array of page_cgroup.
As sparsemem does, this registered base address is subtracted by base_pfn
for that entry. See sparsemem's logic if unsure.

Signed-off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/Makefile      |    2 
 mm/page_cgroup.c |  151 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 152 insertions(+), 1 deletion(-)

Index: linux-2.6.25-rc2/mm/page_cgroup.c
===================================================================
--- /dev/null
+++ linux-2.6.25-rc2/mm/page_cgroup.c
@@ -0,0 +1,151 @@
+/*
+ * page_cgroup mamagement codes.
+ * page_cgroup is yet another mem_map when cgroup's memory resoruce controller
+ * is activated. It containes information which cannot be stored in usual
+ * mem_map. (it's too big.)
+ * This allows us to keep 'struct page' small when a user doesn't activate
+ * memory resource controller.
+ *
+ * Note: all things are allocated on demand.
+ *
+ * We can translate : struct page <-> pfn -> page_cgroup -> struct page.
+ */
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/radix-tree.h>
+#include <linux/memcontrol.h>
+#include <linux/page_cgroup.h>
+#include <linux/err.h>
+
+#define PCGRP_SHIFT	(8)
+#define PCGRP_SIZE	(1 << PCGRP_SHIFT)
+
+struct page_cgroup_root {
+	spinlock_t	       tree_lock;
+	struct radix_tree_root root_node;
+};
+
+static struct page_cgroup_root *root_dir[MAX_NUMNODES];
+
+static void init_page_cgroup(struct page_cgroup *base, unsigned long pfn)
+{
+	int i;
+	int size = PCGRP_SIZE * sizeof (struct page_cgroup);
+	struct page_cgroup *pc;
+
+	memset(base, 0, size);
+	for (i = 0; i < PCGRP_SIZE; ++i) {
+		pc = base + i;
+		pc->page = pfn_to_page(pfn + i);
+		spin_lock_init(&pc->lock);
+		INIT_LIST_HEAD(&pc->lru);
+	}
+}
+
+
+
+static struct page_cgroup *alloc_init_page_cgroup(unsigned long pfn, int nid,
+					gfp_t mask)
+{
+	int size, order;
+	struct page *page;
+
+	size = PCGRP_SIZE * sizeof(struct page_cgroup);
+	order = get_order(PAGE_ALIGN(size));
+	page = alloc_pages_node(nid, mask, order);
+	if (!page)
+		return NULL;
+
+	init_page_cgroup(page_address(page), pfn);
+
+	return page_address(page);
+}
+
+void free_page_cgroup(struct page_cgroup *pc)
+{
+	int size = PCGRP_SIZE * sizeof(struct page_cgroup);
+	int order = get_order(PAGE_ALIGN(size));
+	__free_pages(virt_to_page(pc), order);
+}
+
+
+/*
+ * Look up page_cgroup struct for struct page (page's pfn)
+ * if (gfp_mask != 0), look up and allocate new one if necessary.
+ * if (gfp_mask == 0), look up and return NULL if it cannot be found.
+ */
+
+struct page_cgroup *get_page_cgroup(struct page *page, gfp_t gfpmask)
+{
+	struct page_cgroup_root *root;
+	struct page_cgroup *pc, *base_addr;
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long idx = pfn >> PCGRP_SHIFT;
+	int nid	= page_to_nid(page);
+	unsigned long base_pfn, flags;
+	int error;
+
+	root = root_dir[nid];
+	/* Before Init ? */
+	if (unlikely(!root))
+		return NULL;
+
+	base_pfn = idx << PCGRP_SHIFT;
+retry:
+	error = 0;
+	rcu_read_lock();
+	pc = radix_tree_lookup(&root->root_node, idx);
+	rcu_read_unlock();
+
+	if (likely(pc))
+		return pc + (pfn - base_pfn);
+	if (!gfpmask)
+		return NULL;
+
+	/* Very Slow Path. On demand allocation. */
+	gfpmask = gfpmask & ~(__GFP_HIGHMEM | __GFP_MOVABLE);
+
+	base_addr = alloc_init_page_cgroup(base_pfn, nid, gfpmask);
+	if (!base_addr)
+		return ERR_PTR(-ENOMEM);
+
+	error = radix_tree_preload(gfpmask);
+	if (error)
+		goto out;
+	spin_lock_irqsave(&root->tree_lock, flags);
+	error = radix_tree_insert(&root->root_node, idx, base_addr);
+
+	if (error)
+		pc  = NULL;
+	else
+		pc = base_addr + (pfn - base_pfn);
+	spin_unlock_irqrestore(&root->tree_lock, flags);
+	radix_tree_preload_end();
+out:
+	if (!pc) {
+		free_page_cgroup(base_addr);
+		if (error == -EEXIST)
+			goto retry;
+	}
+	if (error)
+		pc = ERR_PTR(error);
+	return pc;
+}
+
+__init int page_cgroup_init(void)
+{
+	int nid;
+	struct page_cgroup_root *root;
+	for_each_node(nid) {
+		root = kmalloc_node(sizeof(struct page_cgroup_root),
+					GFP_KERNEL, nid);
+		INIT_RADIX_TREE(&root->root_node, GFP_ATOMIC);
+		spin_lock_init(&root->tree_lock);
+		smp_wmb();
+		root_dir[nid] = root;
+	}
+	printk(KERN_INFO "Page Accouintg is activated\n");
+	return 0;
+}
+late_initcall(page_cgroup_init);
Index: linux-2.6.25-rc2/mm/Makefile
===================================================================
--- linux-2.6.25-rc2.orig/mm/Makefile
+++ linux-2.6.25-rc2/mm/Makefile
@@ -32,5 +32,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_CONT) += memcontrol.o
+obj-$(CONFIG_CGROUP_MEM_CONT) += memcontrol.o page_cgroup.o
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC][PATCH] radix-tree based page_cgroup. [7/7] per cpu fast lookup
  2008-02-25  3:07 [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction KAMEZAWA Hiroyuki
                   ` (5 preceding siblings ...)
  2008-02-25  3:17 ` [RFC][PATCH] radix-tree based page_cgroup. [6/7] radix-tree based page cgroup KAMEZAWA Hiroyuki
@ 2008-02-25  3:18 ` KAMEZAWA Hiroyuki
  2008-02-25  5:36   ` YAMAMOTO Takashi
  2008-02-26 13:26   ` minchan Kim
  2008-02-25  3:19 ` [RFC][PATCH] radix-tree based page_cgroup. [8/7] vmalloc for large machines KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  3:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir@linux.vnet.ibm.com, hugh@veritas.com,
	yamamoto@valinux.co.jp, taka, Andi Kleen, nickpiggin@yahoo.com.au,
	linux-mm@kvack.org

Add per_cpu cache entry for avoiding taking rwlock in page_cgroup lookup.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/page_cgroup.h |   33 ++++++++++++++++++++++++++++++++-
 mm/page_cgroup.c            |   29 ++++++++++++++++++++++-------
 2 files changed, 54 insertions(+), 8 deletions(-)

Index: linux-2.6.25-rc2/mm/page_cgroup.c
===================================================================
--- linux-2.6.25-rc2.orig/mm/page_cgroup.c
+++ linux-2.6.25-rc2/mm/page_cgroup.c
@@ -17,6 +17,7 @@
 #include <linux/memcontrol.h>
 #include <linux/page_cgroup.h>
 #include <linux/err.h>
+#include <linux/interrupt.h>
 
 #define PCGRP_SHIFT	(8)
 #define PCGRP_SIZE	(1 << PCGRP_SHIFT)
@@ -27,6 +28,7 @@ struct page_cgroup_root {
 };
 
 static struct page_cgroup_root *root_dir[MAX_NUMNODES];
+DEFINE_PER_CPU(struct page_cgroup_cache, pcpu_page_cgroup_cache);
 
 static void init_page_cgroup(struct page_cgroup *base, unsigned long pfn)
 {
@@ -68,13 +70,25 @@ void free_page_cgroup(struct page_cgroup
 }
 
 
+
+static void save_result(struct page_cgroup  *base, unsigned long idx)
+{
+	int hash = idx & (PAGE_CGROUP_NR_CACHE - 1);
+	struct page_cgroup_cache *pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
+	preempt_disable();
+	pcp->ents[hash].idx = idx;
+	pcp->ents[hash].base = base;
+	preempt_enable();
+}
+
+
 /*
  * Look up page_cgroup struct for struct page (page's pfn)
  * if (gfp_mask != 0), look up and allocate new one if necessary.
  * if (gfp_mask == 0), look up and return NULL if it cannot be found.
  */
 
-struct page_cgroup *get_page_cgroup(struct page *page, gfp_t gfpmask)
+struct page_cgroup *__get_page_cgroup(struct page *page, gfp_t gfpmask)
 {
 	struct page_cgroup_root *root;
 	struct page_cgroup *pc, *base_addr;
@@ -96,8 +110,14 @@ retry:
 	pc = radix_tree_lookup(&root->root_node, idx);
 	rcu_read_unlock();
 
+	if (pc) {
+		if (!in_interrupt())
+			save_result(pc, idx);
+	}
+
 	if (likely(pc))
 		return pc + (pfn - base_pfn);
+
 	if (!gfpmask)
 		return NULL;
 
Index: linux-2.6.25-rc2/include/linux/page_cgroup.h
===================================================================
--- linux-2.6.25-rc2.orig/include/linux/page_cgroup.h
+++ linux-2.6.25-rc2/include/linux/page_cgroup.h
@@ -24,6 +24,19 @@ struct page_cgroup {
 #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache. */
 #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* is on active list */
 
+#define PAGE_CGROUP_NR_CACHE        (0x8)
+struct page_cgroup_cache {
+	struct	{
+		unsigned long idx;
+		struct page_cgroup *base;
+	} ents[PAGE_CGROUP_NR_CACHE];
+};
+
+DECLARE_PER_CPU(struct page_cgroup_cache, pcpu_page_cgroup_cache);
+
+#define PCGRP_SHIFT     (8)
+#define PCGRP_SIZE      (1 << PCGRP_SHIFT)
+
 /*
  * Lookup and return page_cgroup struct.
  * returns NULL when
@@ -32,7 +45,26 @@ struct page_cgroup {
  * return -ENOMEM if cannot allocate memory.
  */
 
-struct page_cgroup *get_page_cgroup(struct page *page, gfp_t gfpmask);
+struct page_cgroup *__get_page_cgroup(struct page *page, gfp_t gfpmask);
+
+static inline struct page_cgroup *
+get_page_cgroup(struct page *page, gfp_t gfpmask)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct page_cgroup_cache *pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
+	struct page_cgroup *ret;
+	unsigned long idx = pfn >> PCGRP_SHIFT;
+	int hnum = (idx) & (PAGE_CGROUP_NR_CACHE - 1);
+
+	preempt_disable();
+	if (pcp->ents[hnum].idx == idx && pcp->ents[hnum].base)
+		ret = pcp->ents[hnum].base + (pfn - (idx << PCGRP_SHIFT));
+	else
+		ret = NULL;
+	preempt_enable();
+
+	return (ret)? ret : __get_page_cgroup(page, gfpmask);
+}
 
 #else
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC][PATCH] radix-tree based page_cgroup. [8/7] vmalloc for large machines
  2008-02-25  3:07 [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction KAMEZAWA Hiroyuki
                   ` (6 preceding siblings ...)
  2008-02-25  3:18 ` [RFC][PATCH] radix-tree based page_cgroup. [7/7] per cpu fast lookup KAMEZAWA Hiroyuki
@ 2008-02-25  3:19 ` KAMEZAWA Hiroyuki
  2008-02-25  7:06   ` KAMEZAWA Hiroyuki
  2008-02-25  3:24 ` [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction Balbir Singh
  2008-02-25  3:31 ` KAMEZAWA Hiroyuki
  9 siblings, 1 reply; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  3:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir@linux.vnet.ibm.com, hugh@veritas.com,
	yamamoto@valinux.co.jp, taka, Andi Kleen, nickpiggin@yahoo.com.au,
	linux-mm@kvack.org

On 64bit arch (and some others?) we have prenty of vmalloc area.

This patch uses vmalloc() for allocating continuous big area and
reduces entries in radix-tree.
Each entry covers 256Mbytes of area.

Note: because of vmallc, we don't allocate new entry in interrupt().


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 include/linux/page_cgroup.h |    6 ++++++
 init/Kconfig                |    4 ++++
 mm/page_cgroup.c            |   29 ++++++++++++++++++++++++++++-
 3 files changed, 38 insertions(+), 1 deletion(-)

Index: linux-2.6.25-rc2/include/linux/page_cgroup.h
===================================================================
--- linux-2.6.25-rc2.orig/include/linux/page_cgroup.h
+++ linux-2.6.25-rc2/include/linux/page_cgroup.h
@@ -34,7 +34,13 @@ struct page_cgroup_cache {
 
 DECLARE_PER_CPU(struct page_cgroup_cache, pcpu_page_cgroup_cache);
 
+#ifdef CONFIG_PAGE_CGROUP_VMALLOC
+#define PCGRUP_BASE_SHIFT	(28)	/* covers 256M per entry */
+#define PCGRP_SHIFT		(PCGROUP_PAGE_SHIFT - PCGRP_SHIFT)
+#else
 #define PCGRP_SHIFT     (8)
+#endif
+
 #define PCGRP_SIZE      (1 << PCGRP_SHIFT)
 
 /*
Index: linux-2.6.25-rc2/mm/page_cgroup.c
===================================================================
--- linux-2.6.25-rc2.orig/mm/page_cgroup.c
+++ linux-2.6.25-rc2/mm/page_cgroup.c
@@ -18,6 +18,7 @@
 #include <linux/page_cgroup.h>
 #include <linux/err.h>
 #include <linux/interrupt.h>
+#include <linux/vmalloc.h>
 
 #define PCGRP_SHIFT	(8)
 #define PCGRP_SIZE	(1 << PCGRP_SHIFT)
@@ -43,6 +44,29 @@ static void init_page_cgroup(struct page
 	}
 }
 
+#ifndef CONFIG_PAGE_CGROUP_VMALLOC
+static struct page_cgroup *alloc_init_page_cgroup(unsigned long pfn, int nid,
+				gfp_t mask)
+{
+	int size, order;
+	struct page_cgroup *base;
+
+	size = PCGRP_SIZE * sizeof(struct page_cgroup);
+	order = get_order(PAGE_ALIGN(size));
+	base = vmalloc_node(size, nid);
+
+	init_page_cgroup(base, pfn);
+	return base;
+}
+
+void free_page_cgroup(struct page_cgroup *pc)
+{
+	vfree(pc);
+}
+
+#else
+
+
 
 
 static struct page_cgroup *alloc_init_page_cgroup(unsigned long pfn, int nid,
@@ -68,7 +92,7 @@ void free_page_cgroup(struct page_cgroup
 	int order = get_order(PAGE_ALIGN(size));
 	__free_pages(virt_to_page(pc), order);
 }
-
+#endif
 
 
 static void save_result(struct page_cgroup  *base, unsigned long idx)
@@ -121,6 +145,9 @@ retry:
 	if (!gfpmask)
 		return NULL;
 
+	if (in_interrupt())
+		return NULL;
+
 	/* Very Slow Path. On demand allocation. */
 	gfpmask = gfpmask & ~(__GFP_HIGHMEM | __GFP_MOVABLE);
 
Index: linux-2.6.25-rc2/init/Kconfig
===================================================================
--- linux-2.6.25-rc2.orig/init/Kconfig
+++ linux-2.6.25-rc2/init/Kconfig
@@ -394,6 +394,10 @@ config CGROUP_MEM_CONT
 	  Provides a memory controller that manages both page cache and
 	  RSS memory.
 
+config CGROUP_PAGE_CGROUP_VMALLOC
+	def_bool y
+	depends on CGROUP_MEM_CONT && 64BIT
+
 config PROC_PID_CPUSET
 	bool "Include legacy /proc/<pid>/cpuset file"
 	depends on CPUSETS

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction
  2008-02-25  3:07 [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction KAMEZAWA Hiroyuki
                   ` (7 preceding siblings ...)
  2008-02-25  3:19 ` [RFC][PATCH] radix-tree based page_cgroup. [8/7] vmalloc for large machines KAMEZAWA Hiroyuki
@ 2008-02-25  3:24 ` Balbir Singh
  2008-02-25  4:02   ` KAMEZAWA Hiroyuki
  2008-02-25  3:31 ` KAMEZAWA Hiroyuki
  9 siblings, 1 reply; 35+ messages in thread
From: Balbir Singh @ 2008-02-25  3:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: hugh@veritas.com, yamamoto@valinux.co.jp, taka, Andi Kleen,
	nickpiggin@yahoo.com.au, linux-mm@kvack.org

KAMEZAWA Hiroyuki wrote:
> This patch series is for implementing radix-tree based page_cgroup.
> 
> This patch does
>   - remove page_cgroup member from struct page.
>   - add a lookup function get_page_cgroup(page).
> 
> And, by removing page_cgroup member, we'll have to change the whole lock rule.
> In this patch, page_cgroup is allocated on demand but not freed. (see TODO).
> 
> This is first trial and I hope I get advices, comments.
> 
> 
> Following is unix bench result under ia64/NUMA box, 8 cpu system. 
> (Shell Script 8 concurrent result was not available from unknown reason.)
> ./Run fstime execl shell C hanoi
> 
> == rc2 + CONFIG_CGROUP_MEM_CONT ==
> File Read 1024 bufsize 2000 maxblocks    937399.0 KBps  (30.0 secs, 3 samples)
> File Write 1024 bufsize 2000 maxblocks   323117.0 KBps  (30.0 secs, 3 samples)
> File Copy 1024 bufsize 2000 maxblocks    233737.0 KBps  (30.0 secs, 3 samples)
> Execl Throughput                           2418.7 lps   (29.7 secs, 3 samples)
> Shell Scripts (1 concurrent)               5506.0 lpm   (60.0 secs, 3 samples)
> Shell Scripts (16 concurrent)               988.3 lpm   (60.0 secs, 3 samples)
> C Compiler Throughput                       741.7 lpm   (60.0 secs, 3 samples)
> Recursion Test--Tower of Hanoi            74555.8 lps   (20.0 secs, 3 samples)
> 
> == rc2 + CONFIG_CGROUP_MEM_CONT + radix-tree based page_cgroup ==
> File Read 1024 bufsize 2000 maxblocks    966342.0 KBps  (30.0 secs, 2 samples)
> File Write 1024 bufsize 2000 maxblocks   316999.0 KBps  (30.0 secs, 2 samples)
> File Copy 1024 bufsize 2000 maxblocks    234167.0 KBps  (30.0 secs, 2 samples)
> Execl Throughput                           2410.5 lps   (29.8 secs, 2 samples)
> Shell Scripts (1 concurrent)               5505.0 lpm   (60.0 secs, 2 samples)
> Shell Scripts (8 concurrent)               1824.5 lpm   (60.0 secs, 2 samples)
> Shell Scripts (16 concurrent)               987.0 lpm   (60.0 secs, 2 samples)
> C Compiler Throughput                       742.5 lpm   (60.0 secs, 2 samples)
> Recursion Test--Tower of Hanoi            74335.6 lps   (20.0 secs, 2 samples)
> 

Hi, KAMEZAWA-San,

The results look quite good.

> looks good as first result.
> 
> Becaue today's my machine time is over, I post this now. I'll rebase this to
> rc3 and reflect comments in the next trial.
> 
> series of patches
> [1/8] --- defintions of header file. 
> [2/8] --- changes in charge/uncharge path and remove locks.
> [3/8] --- changes in page_cgroup_move_lists()
> [4/8] --- changes in page migration with page_cgroup
> [5/8] --- changes in force_empty
> [6/8] --- radix-tree based page_cgroup
> [7/8] --- (Optional) per-cpu fast lookup helper
> [8/8] --- (Optional) Use vmalloc for 64bit machines.
> 
> 
> TODO
>  - Move to -rc3 or -mm ?

I think -mm is better, since we have been pushing all the patches through -mm
and that way we'll get some testing before the patches go in as well.

>  - This patch series doesn't implement page_cgroup removal.
>    I consider it's worth tring to remove page_cgroup when the page is used for
>    HugePage or the page is offlined. But this will incease complexity. So, do later.

Why don't we remove the page_cgroup, what is the increased complexity? I'll take
a look into the patches.

>  - More perfomance measurement and brush codes up.
>  - Check lock dependency...Do more test.

I think we should work this out as well. Hugh is working on cleanup for locking
right now. I suspect that with the radix tree changes, there might a conflict
between your and hugh's work. I think for the moment, while we stabilize and get
the radix tree patches ready, we should get Hugh's cleanup in.

>  - Should I add smaller chunk size for page_cgroup ?
> 
> Thanks,
> -Kame
> 


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction
  2008-02-25  3:07 [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction KAMEZAWA Hiroyuki
                   ` (8 preceding siblings ...)
  2008-02-25  3:24 ` [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction Balbir Singh
@ 2008-02-25  3:31 ` KAMEZAWA Hiroyuki
  9 siblings, 0 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  3:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir@linux.vnet.ibm.com, hugh@veritas.com,
	yamamoto@valinux.co.jp, taka, Andi Kleen, nickpiggin@yahoo.com.au,
	linux-mm@kvack.org

On Mon, 25 Feb 2008 12:07:58 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> And, by removing page_cgroup member, we'll have to change the whole lock rule.
> In this patch, page_cgroup is allocated on demand but not freed. (see TODO).
> 
Lacks this explanation
==
  BTW, we are now plainning to add boot options which allows turning off cgroup
  subsys. If memory resource controller is turned off by boot option, we won't
  allocate page_cgroup.
==

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction
  2008-02-25  3:24 ` [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction Balbir Singh
@ 2008-02-25  4:02   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  4:02 UTC (permalink / raw)
  To: balbir
  Cc: hugh@veritas.com, yamamoto@valinux.co.jp, taka, Andi Kleen,
	nickpiggin@yahoo.com.au, linux-mm@kvack.org

On Mon, 25 Feb 2008 08:54:25 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > TODO
> >  - Move to -rc3 or -mm ?
> 
> I think -mm is better, since we have been pushing all the patches through -mm
> and that way we'll get some testing before the patches go in as well.
> 
Okay,

> >  - This patch series doesn't implement page_cgroup removal.
> >    I consider it's worth tring to remove page_cgroup when the page is used for
> >    HugePage or the page is offlined. But this will incease complexity. So, do later.
> 
> Why don't we remove the page_cgroup, what is the increased complexity? I'll take
> a look into the patches.
> 
> >  - More perfomance measurement and brush codes up.
> >  - Check lock dependency...Do more test.
> 
> I think we should work this out as well. Hugh is working on cleanup for locking
> right now. I suspect that with the radix tree changes, there might a conflict
> between your and hugh's work. I think for the moment, while we stabilize and get
> the radix tree patches ready, we should get Hugh's cleanup in.
> 
I agree here.
I think Hugh-san's patch should go fast path, because it's bugfix.
This set should be tested under -mm or private tree  until enough test.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [7/7] per cpu fast lookup
  2008-02-25  3:18 ` [RFC][PATCH] radix-tree based page_cgroup. [7/7] per cpu fast lookup KAMEZAWA Hiroyuki
@ 2008-02-25  5:36   ` YAMAMOTO Takashi
  2008-02-25  5:46     ` KAMEZAWA Hiroyuki
  2008-02-26 13:26   ` minchan Kim
  1 sibling, 1 reply; 35+ messages in thread
From: YAMAMOTO Takashi @ 2008-02-25  5:36 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: balbir, hugh, taka, ak, nickpiggin, linux-mm

> +
> +static void save_result(struct page_cgroup  *base, unsigned long idx)
> +{
> +	int hash = idx & (PAGE_CGROUP_NR_CACHE - 1);
> +	struct page_cgroup_cache *pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
> +	preempt_disable();
> +	pcp->ents[hash].idx = idx;
> +	pcp->ents[hash].base = base;
> +	preempt_enable();

preempt_disable after __get_cpu_var doesn't make much sense.
it should be get_cpu_var and put_cpu_var.

> -struct page_cgroup *get_page_cgroup(struct page *page, gfp_t gfpmask);
> +struct page_cgroup *__get_page_cgroup(struct page *page, gfp_t gfpmask);
> +
> +static inline struct page_cgroup *
> +get_page_cgroup(struct page *page, gfp_t gfpmask)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +	struct page_cgroup_cache *pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
> +	struct page_cgroup *ret;
> +	unsigned long idx = pfn >> PCGRP_SHIFT;
> +	int hnum = (idx) & (PAGE_CGROUP_NR_CACHE - 1);
> +
> +	preempt_disable();
> +	if (pcp->ents[hnum].idx == idx && pcp->ents[hnum].base)
> +		ret = pcp->ents[hnum].base + (pfn - (idx << PCGRP_SHIFT));
> +	else
> +		ret = NULL;
> +	preempt_enable();

ditto.

YAMAMOTO Takashi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [7/7] per cpu fast lookup
  2008-02-25  5:36   ` YAMAMOTO Takashi
@ 2008-02-25  5:46     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  5:46 UTC (permalink / raw)
  To: YAMAMOTO Takashi; +Cc: balbir, hugh, taka, ak, nickpiggin, linux-mm

On Mon, 25 Feb 2008 14:36:40 +0900 (JST)
yamamoto@valinux.co.jp (YAMAMOTO Takashi) wrote:

> > +
> > +static void save_result(struct page_cgroup  *base, unsigned long idx)
> > +{
> > +	int hash = idx & (PAGE_CGROUP_NR_CACHE - 1);
> > +	struct page_cgroup_cache *pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
> > +	preempt_disable();
> > +	pcp->ents[hash].idx = idx;
> > +	pcp->ents[hash].base = base;
> > +	preempt_enable();
> 
> preempt_disable after __get_cpu_var doesn't make much sense.
> it should be get_cpu_var and put_cpu_var.
> 
I'm sorry that local_cpu_var doesn't take id of local cpu. 
will fix.


> > -struct page_cgroup *get_page_cgroup(struct page *page, gfp_t gfpmask);
> > +struct page_cgroup *__get_page_cgroup(struct page *page, gfp_t gfpmask);
> > +
> > +static inline struct page_cgroup *
> > +get_page_cgroup(struct page *page, gfp_t gfpmask)
> > +{
> > +	unsigned long pfn = page_to_pfn(page);
> > +	struct page_cgroup_cache *pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
> > +	struct page_cgroup *ret;
> > +	unsigned long idx = pfn >> PCGRP_SHIFT;
> > +	int hnum = (idx) & (PAGE_CGROUP_NR_CACHE - 1);
> > +
> > +	preempt_disable();
> > +	if (pcp->ents[hnum].idx == idx && pcp->ents[hnum].base)
> > +		ret = pcp->ents[hnum].base + (pfn - (idx << PCGRP_SHIFT));
> > +	else
> > +		ret = NULL;
> > +	preempt_enable();
> 
> ditto.
> 

Thank you for pointing out.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [6/7] radix-tree based page cgroup
  2008-02-25  3:17 ` [RFC][PATCH] radix-tree based page_cgroup. [6/7] radix-tree based page cgroup KAMEZAWA Hiroyuki
@ 2008-02-25  5:56   ` YAMAMOTO Takashi
  2008-02-25  6:07     ` KAMEZAWA Hiroyuki
  2008-02-25  6:40   ` Hirokazu Takahashi
  1 sibling, 1 reply; 35+ messages in thread
From: YAMAMOTO Takashi @ 2008-02-25  5:56 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: balbir, hugh, taka, ak, nickpiggin, linux-mm

> struct page *get_page_cgroup(struct page *page, gfp_mask mask);
> 
> if (mask != 0), look up and allocate new one if necessary.
> if (mask == 0), just do look up and return NULL if not exist.

0 is a valid gfp mask.  (GFP_NOWAIT)

YAMAMOTO Takashi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [6/7] radix-tree based page cgroup
  2008-02-25  5:56   ` YAMAMOTO Takashi
@ 2008-02-25  6:07     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  6:07 UTC (permalink / raw)
  To: YAMAMOTO Takashi; +Cc: balbir, hugh, taka, ak, nickpiggin, linux-mm

On Mon, 25 Feb 2008 14:56:31 +0900 (JST)
yamamoto@valinux.co.jp (YAMAMOTO Takashi) wrote:

> > struct page *get_page_cgroup(struct page *page, gfp_mask mask);
> > 
> > if (mask != 0), look up and allocate new one if necessary.
> > if (mask == 0), just do look up and return NULL if not exist.
> 
> 0 is a valid gfp mask.  (GFP_NOWAIT)
> 
> YAMAMOTO Takashi
> 
Hmm, maybe adding new arg is better, like

struct get_page_cgroup(struct page* page, gfp_t mask, bool allocate);

I will change. thank you.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [6/7] radix-tree based page cgroup
  2008-02-25  3:17 ` [RFC][PATCH] radix-tree based page_cgroup. [6/7] radix-tree based page cgroup KAMEZAWA Hiroyuki
  2008-02-25  5:56   ` YAMAMOTO Takashi
@ 2008-02-25  6:40   ` Hirokazu Takahashi
  2008-02-25  6:52     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 35+ messages in thread
From: Hirokazu Takahashi @ 2008-02-25  6:40 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: balbir, hugh, yamamoto, ak, nickpiggin, linux-mm

Hi,

I looked into the code a bit and I have some comments.

> Each radix-tree entry contains base address of array of page_cgroup.
> As sparsemem does, this registered base address is subtracted by base_pfn
> for that entry. See sparsemem's logic if unsure.
> 
> Signed-off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

  (snip)

> +#define PCGRP_SHIFT	(8)
> +#define PCGRP_SIZE	(1 << PCGRP_SHIFT)

I wonder where the value of PCGRP_SHIFT comes from.

  (snip)

> +static struct page_cgroup *alloc_init_page_cgroup(unsigned long pfn, int nid,
> +					gfp_t mask)
> +{
> +	int size, order;
> +	struct page *page;
> +
> +	size = PCGRP_SIZE * sizeof(struct page_cgroup);
> +	order = get_order(PAGE_ALIGN(size));

I wonder if this alignment will waste some memory.

> +	page = alloc_pages_node(nid, mask, order);

I think you should make "order" be 0 not to cause extra memory pressure
if possible.

> +	if (!page)
> +		return NULL;
> +
> +	init_page_cgroup(page_address(page), pfn);
> +
> +	return page_address(page);
> +}


Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [6/7] radix-tree based page cgroup
  2008-02-25  6:40   ` Hirokazu Takahashi
@ 2008-02-25  6:52     ` KAMEZAWA Hiroyuki
  2008-02-25  7:05       ` Hirokazu Takahashi
  0 siblings, 1 reply; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  6:52 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: balbir, hugh, yamamoto, ak, nickpiggin, linux-mm

On Mon, 25 Feb 2008 15:40:51 +0900 (JST)
Hirokazu Takahashi <taka@valinux.co.jp> wrote:

> Hi,
> 
> I looked into the code a bit and I have some comments.
> 
> > Each radix-tree entry contains base address of array of page_cgroup.
> > As sparsemem does, this registered base address is subtracted by base_pfn
> > for that entry. See sparsemem's logic if unsure.
> > 
> > Signed-off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
>   (snip)
> 
> > +#define PCGRP_SHIFT	(8)
> > +#define PCGRP_SIZE	(1 << PCGRP_SHIFT)
> 
> I wonder where the value of PCGRP_SHIFT comes from.
> 
On 32bit systems, (I think 64bit should use vmalloc),
this order comes from sizeof(struct page_cgroup) * 2^8  <= 8192 ,  2 pages.


>   (snip)
> 
> > +static struct page_cgroup *alloc_init_page_cgroup(unsigned long pfn, int nid,
> > +					gfp_t mask)
> > +{
> > +	int size, order;
> > +	struct page *page;
> > +
> > +	size = PCGRP_SIZE * sizeof(struct page_cgroup);
> > +	order = get_order(PAGE_ALIGN(size));
> 
> I wonder if this alignment will waste some memory.
> 
Maybe. 

> > +	page = alloc_pages_node(nid, mask, order);
> 
> I think you should make "order" be 0 not to cause extra memory pressure
> if possible.
> 
Hmm, and increase depth of radix-tree ? 
But ok, starting from safe code is better. will make this order to be 0
and see what happens.

Thanks
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [6/7] radix-tree based page cgroup
  2008-02-25  6:52     ` KAMEZAWA Hiroyuki
@ 2008-02-25  7:05       ` Hirokazu Takahashi
  2008-02-25  7:25         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 35+ messages in thread
From: Hirokazu Takahashi @ 2008-02-25  7:05 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: balbir, hugh, yamamoto, ak, nickpiggin, linux-mm

Hi,

> > I looked into the code a bit and I have some comments.
> > 
> > > Each radix-tree entry contains base address of array of page_cgroup.
> > > As sparsemem does, this registered base address is subtracted by base_pfn
> > > for that entry. See sparsemem's logic if unsure.
> > > 
> > > Signed-off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> >   (snip)
> > 
> > > +#define PCGRP_SHIFT	(8)
> > > +#define PCGRP_SIZE	(1 << PCGRP_SHIFT)
> > 
> > I wonder where the value of PCGRP_SHIFT comes from.
> > 
> On 32bit systems, (I think 64bit should use vmalloc),
> this order comes from sizeof(struct page_cgroup) * 2^8  <= 8192 ,  2 pages.

The size of struct page_cgroup on 32bit will be 28byte,
so that sizeof(struct page_cgroup) * 2^8 = 28 * 2^8 = 7168 byte.
I'm not sure it is acceptable if we lose (8192 - 7168)/8192 = 0.125 = 12.5%
of memory for page_cgroup.

+struct page_cgroup {
+	struct page 		*page;       /* the page this accounts for*/
+	struct mem_cgroup 	*mem_cgroup; /* current cgroup subsys */
+	int    			flags;	     /* See below */
+	int    			refcnt;      /* reference count */
+	spinlock_t		lock;        /* lock for all above members */
+	struct list_head 	lru;         /* for per cgroup LRU */
+};

I wonder if we can find any better way.

> >   (snip)
> > 
> > > +static struct page_cgroup *alloc_init_page_cgroup(unsigned long pfn, int nid,
> > > +					gfp_t mask)
> > > +{
> > > +	int size, order;
> > > +	struct page *page;
> > > +
> > > +	size = PCGRP_SIZE * sizeof(struct page_cgroup);
> > > +	order = get_order(PAGE_ALIGN(size));
> > 
> > I wonder if this alignment will waste some memory.
> > 
> Maybe. 
> 
> > > +	page = alloc_pages_node(nid, mask, order);
> > 
> > I think you should make "order" be 0 not to cause extra memory pressure
> > if possible.
> > 
> Hmm, and increase depth of radix-tree ? 
> But ok, starting from safe code is better. will make this order to be 0
> and see what happens.
> 
> Thanks
> -Kame
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [8/7] vmalloc for large machines
  2008-02-25  3:19 ` [RFC][PATCH] radix-tree based page_cgroup. [8/7] vmalloc for large machines KAMEZAWA Hiroyuki
@ 2008-02-25  7:06   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  7:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir@linux.vnet.ibm.com, hugh@veritas.com,
	yamamoto@valinux.co.jp, taka, Andi Kleen, nickpiggin@yahoo.com.au,
	linux-mm@kvack.org

On Mon, 25 Feb 2008 12:19:59 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> +#ifdef CONFIG_PAGE_CGROUP_VMALLOC
> +#define PCGRUP_BASE_SHIFT	(28)	/* covers 256M per entry */
> +#define PCGRP_SHIFT		(PCGROUP_PAGE_SHIFT - PCGRP_SHIFT)
> +#else
>  #define PCGRP_SHIFT     (8)
> +#endif
Above is broken, maybe reflesh miss...
please ignore this.

Sorry,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [6/7] radix-tree based page cgroup
  2008-02-25  7:05       ` Hirokazu Takahashi
@ 2008-02-25  7:25         ` KAMEZAWA Hiroyuki
  2008-02-25  8:02           ` Hirokazu Takahashi
  0 siblings, 1 reply; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  7:25 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: balbir, hugh, yamamoto, ak, nickpiggin, linux-mm

On Mon, 25 Feb 2008 16:05:40 +0900 (JST)
Hirokazu Takahashi <taka@valinux.co.jp> wrote:
> The size of struct page_cgroup on 32bit will be 28byte,
> so that sizeof(struct page_cgroup) * 2^8 = 28 * 2^8 = 7168 byte.
> I'm not sure it is acceptable if we lose (8192 - 7168)/8192 = 0.125 = 12.5%
> of memory for page_cgroup.
> 
> +struct page_cgroup {
> +	struct page 		*page;       /* the page this accounts for*/
> +	struct mem_cgroup 	*mem_cgroup; /* current cgroup subsys */
> +	int    			flags;	     /* See below */
> +	int    			refcnt;      /* reference count */
> +	spinlock_t		lock;        /* lock for all above members */
> +	struct list_head 	lru;         /* for per cgroup LRU */
> +};
> 

- 28bytes * (2^7) = 3584 bytes. wastes 608 bytes per 512k of user memory. 
- 28bytes * (2^8) = 7168 bytes. wastes 1024 bytes per 1M of user memory.

loss is 0.1%. or any room user for page_cgroup's extra 4 bytes ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [1/7] definitions for page_cgroup
  2008-02-25  3:10 ` [RFC][PATCH] radix-tree based page_cgroup. [1/7] definitions for page_cgroup KAMEZAWA Hiroyuki
@ 2008-02-25  7:47   ` Hirokazu Takahashi
  2008-02-25  7:56     ` Balbir Singh
  2008-02-25  8:03     ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 35+ messages in thread
From: Hirokazu Takahashi @ 2008-02-25  7:47 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: balbir, hugh, yamamoto, ak, nickpiggin, linux-mm

Hi,

> (This is one of a series of patch for "lookup page_cgroup" patches..)
> 
>  * Exporting page_cgroup definition.
>  * Remove page_cgroup member from sturct page.
>  * As result, PAGE_CGROUP_LOCK_BIT and assign/access functions are removed.
> 
> Other chages will appear in following patches.
> There is a change in the structure itself, spin_lock is added.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
     (snip)
> +#ifdef CONFIG_CGROUP_MEM_CONT
> +/*
> + * page_cgroup is yet another mem_map structure for accounting  usage.
> + * but, unlike mem_map, allocated on demand for accounted pages.
> + * see also memcontrol.h
> + * In nature, this cosumes much amount of memory.
> + */
> +
> +struct mem_cgroup;
> +
> +struct page_cgroup {
> +	struct page 		*page;       /* the page this accounts for*/
> +	struct mem_cgroup 	*mem_cgroup; /* current cgroup subsys */
> +	int    			flags;	     /* See below */
> +	int    			refcnt;      /* reference count */
> +	spinlock_t		lock;        /* lock for all above members */
> +	struct list_head 	lru;         /* for per cgroup LRU */
> +};

You can possible reduce the size of page_cgroup structure not to consume
a lot of memory. I think this is important.

I have some ideas:
(1) I don't think every struct page_cgroup needs to have a "lock" member.
    I think one "lock" variable for several page_cgroup will be also enough
    from a performance viewpoint. In addition, it will become low-impact for
    cache memory. I guess it may be okay if each array of page_cgroup --
    which you just introduced now -- has one lock variable.
(2) The "flags" member and the "refcnt" member can be encoded into
    one member.
(3) The page member can be replaced with the page frame number and it will be
    also possible to use some kind of ID instead of the mem_cgroup member.
    This means these members can be encoded to one members with other members
    such as "flags" and "refcnt"

You don't need to hurry to implement this but will you put these on the
ToDo list.

> +
> +/* flags */
> +#define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache. */
> +#define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* is on active list */

I've been also wondering the preallocation of page_croup approach,
with which the page member of page_cgroup can be completely removed.


Thank you,
Hirokazu Takahashi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [1/7] definitions for page_cgroup
  2008-02-25  7:47   ` Hirokazu Takahashi
@ 2008-02-25  7:56     ` Balbir Singh
  2008-02-25  8:03     ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 35+ messages in thread
From: Balbir Singh @ 2008-02-25  7:56 UTC (permalink / raw)
  To: Hirokazu Takahashi
  Cc: kamezawa.hiroyu, hugh, yamamoto, ak, nickpiggin, linux-mm

Hirokazu Takahashi wrote:
> Hi,
> 
>> (This is one of a series of patch for "lookup page_cgroup" patches..)
>>
>>  * Exporting page_cgroup definition.
>>  * Remove page_cgroup member from sturct page.
>>  * As result, PAGE_CGROUP_LOCK_BIT and assign/access functions are removed.
>>
>> Other chages will appear in following patches.
>> There is a change in the structure itself, spin_lock is added.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>      (snip)
>> +#ifdef CONFIG_CGROUP_MEM_CONT
>> +/*
>> + * page_cgroup is yet another mem_map structure for accounting  usage.
>> + * but, unlike mem_map, allocated on demand for accounted pages.
>> + * see also memcontrol.h
>> + * In nature, this cosumes much amount of memory.
>> + */
>> +
>> +struct mem_cgroup;
>> +
>> +struct page_cgroup {
>> +	struct page 		*page;       /* the page this accounts for*/
>> +	struct mem_cgroup 	*mem_cgroup; /* current cgroup subsys */
>> +	int    			flags;	     /* See below */
>> +	int    			refcnt;      /* reference count */
>> +	spinlock_t		lock;        /* lock for all above members */
>> +	struct list_head 	lru;         /* for per cgroup LRU */
>> +};
> 
> You can possible reduce the size of page_cgroup structure not to consume
> a lot of memory. I think this is important.
> 
> I have some ideas:
> (1) I don't think every struct page_cgroup needs to have a "lock" member.
>     I think one "lock" variable for several page_cgroup will be also enough
>     from a performance viewpoint. In addition, it will become low-impact for
>     cache memory. I guess it may be okay if each array of page_cgroup --
>     which you just introduced now -- has one lock variable.

You'll hit concurrency quite a bit with this approach. Several pages sharing one
lock field is not a good idea.

> (2) The "flags" member and the "refcnt" member can be encoded into
>     one member.

This can be done, but then the assumption is that refcnt is always bounded,
which is true for now. I think we should think of these optimizations *after* we
have a stable rb-tree based patch.

> (3) The page member can be replaced with the page frame number and it will be
>     also possible to use some kind of ID instead of the mem_cgroup member.
>     This means these members can be encoded to one members with other members
>     such as "flags" and "refcnt"
> 

Will hit speed again. First we need a mechanism of id'ing each control group and
then do a look up from id to pointer.


> You don't need to hurry to implement this but will you put these on the
> ToDo list.
> 

As probable optimizations, yes!

>> +
>> +/* flags */
>> +#define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache. */
>> +#define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* is on active list */
> 
> I've been also wondering the preallocation of page_croup approach,
> with which the page member of page_cgroup can be completely removed.
> 
> 
> Thank you,
> Hirokazu Takahashi.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [6/7] radix-tree based page cgroup
  2008-02-25  7:25         ` KAMEZAWA Hiroyuki
@ 2008-02-25  8:02           ` Hirokazu Takahashi
  2008-02-25  8:11             ` KAMEZAWA Hiroyuki
  2008-02-25  8:28             ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 35+ messages in thread
From: Hirokazu Takahashi @ 2008-02-25  8:02 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: balbir, hugh, yamamoto, ak, nickpiggin, linux-mm

Hi,

> Hirokazu Takahashi <taka@valinux.co.jp> wrote:
> > The size of struct page_cgroup on 32bit will be 28byte,
> > so that sizeof(struct page_cgroup) * 2^8 = 28 * 2^8 = 7168 byte.
> > I'm not sure it is acceptable if we lose (8192 - 7168)/8192 = 0.125 = 12.5%
> > of memory for page_cgroup.
> > 
> > +struct page_cgroup {
> > +	struct page 		*page;       /* the page this accounts for*/
> > +	struct mem_cgroup 	*mem_cgroup; /* current cgroup subsys */
> > +	int    			flags;	     /* See below */
> > +	int    			refcnt;      /* reference count */
> > +	spinlock_t		lock;        /* lock for all above members */
> > +	struct list_head 	lru;         /* for per cgroup LRU */
> > +};
> > 
> 
> - 28bytes * (2^7) = 3584 bytes. wastes 608 bytes per 512k of user memory. 
> - 28bytes * (2^8) = 7168 bytes. wastes 1024 bytes per 1M of user memory.
> 
> loss is 0.1%. or any room user for page_cgroup's extra 4 bytes ?
> 
> Thanks,
> -Kame

I feel kernel memory is an expensive resource especially on 32 bit linux
machines. I think this is one of the reasones why a lot of people don't
want to increase the size of page structure.

Thank you,
Hirokazu Takahashi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [1/7] definitions for page_cgroup
  2008-02-25  7:47   ` Hirokazu Takahashi
  2008-02-25  7:56     ` Balbir Singh
@ 2008-02-25  8:03     ` KAMEZAWA Hiroyuki
  2008-02-26  7:46       ` Hirokazu Takahashi
  1 sibling, 1 reply; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  8:03 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: balbir, hugh, yamamoto, ak, nickpiggin, linux-mm

On Mon, 25 Feb 2008 16:47:45 +0900 (JST)
Hirokazu Takahashi <taka@valinux.co.jp> wrote:
> > +struct mem_cgroup;
> > +
> > +struct page_cgroup {
> > +	struct page 		*page;       /* the page this accounts for*/
> > +	struct mem_cgroup 	*mem_cgroup; /* current cgroup subsys */
> > +	int    			flags;	     /* See below */
> > +	int    			refcnt;      /* reference count */
> > +	spinlock_t		lock;        /* lock for all above members */
> > +	struct list_head 	lru;         /* for per cgroup LRU */
> > +};
> 
> You can possible reduce the size of page_cgroup structure not to consume
> a lot of memory. I think this is important.
> 
> I have some ideas:
> (1) I don't think every struct page_cgroup needs to have a "lock" member.
>     I think one "lock" variable for several page_cgroup will be also enough
>     from a performance viewpoint. In addition, it will become low-impact for
>     cache memory. I guess it may be okay if each array of page_cgroup --
>     which you just introduced now -- has one lock variable.

I think it will increase cache-bouncing, but I have no data.
(I notices that lock bit can be moved to flags and use bit_spin_lock.
 But I wouldn't like to do it at this stage.)


> (2) The "flags" member and the "refcnt" member can be encoded into
>     one member.

I don't like this idea.
Because some people discuss about enlarging 32bit countes in struct 'page'
to be 64bit, I wonder refcnt should be "unsigned long", now.

> (3) The page member can be replaced with the page frame number and it will be
>     also possible to use some kind of ID instead of the mem_cgroup member.
>     This means these members can be encoded to one members with other members
>     such as "flags" and "refcnt"
 
I think there is a case that "pfn" doesn't fit in 32bit.
(64bit system tend to have sparse address space.)
We need unsigned long anyway.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [6/7] radix-tree based page cgroup
  2008-02-25  8:02           ` Hirokazu Takahashi
@ 2008-02-25  8:11             ` KAMEZAWA Hiroyuki
  2008-02-25  8:28             ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  8:11 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: balbir, hugh, yamamoto, ak, nickpiggin, linux-mm

On Mon, 25 Feb 2008 17:02:29 +0900 (JST)
Hirokazu Takahashi <taka@valinux.co.jp> wrote:

> I feel kernel memory is an expensive resource especially on 32 bit linux
> machines. I think this is one of the reasones why a lot of people don't
> want to increase the size of page structure.
> 
Hmm, making PCGRP_SHIFT to be 0 and use kmalloc() is an idea.
I'll add config to small systems.
(and we can get rid of get_order()...releated codes.)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [6/7] radix-tree based page cgroup
  2008-02-25  8:02           ` Hirokazu Takahashi
  2008-02-25  8:11             ` KAMEZAWA Hiroyuki
@ 2008-02-25  8:28             ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-25  8:28 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: balbir, hugh, yamamoto, ak, nickpiggin, linux-mm

On Mon, 25 Feb 2008 17:02:29 +0900 (JST)
Hirokazu Takahashi <taka@valinux.co.jp> wrote:

> > - 28bytes * (2^7) = 3584 bytes. wastes 608 bytes per 512k of user memory. 
> > - 28bytes * (2^8) = 7168 bytes. wastes 1024 bytes per 1M of user memory.
> > 
> > loss is 0.1%. or any room user for page_cgroup's extra 4 bytes ?
> > 
> > Thanks,
> > -Kame
> 
> I feel kernel memory is an expensive resource especially on 32 bit linux
> machines. I think this is one of the reasones why a lot of people don't
> want to increase the size of page structure.
> 
Maybe the smallest loss case is 2^4. 

28bytes * 2^4 * 9 = 4032byes. (use private kmem_cache_alloc() here.)

I'll add this config and test it.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [1/7] definitions for page_cgroup
  2008-02-25  8:03     ` KAMEZAWA Hiroyuki
@ 2008-02-26  7:46       ` Hirokazu Takahashi
  2008-02-26  9:07         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 35+ messages in thread
From: Hirokazu Takahashi @ 2008-02-26  7:46 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: balbir, hugh, yamamoto, ak, nickpiggin, linux-mm

Hi,

> > > +struct mem_cgroup;
> > > +
> > > +struct page_cgroup {
> > > +	struct page 		*page;       /* the page this accounts for*/
> > > +	struct mem_cgroup 	*mem_cgroup; /* current cgroup subsys */
> > > +	int    			flags;	     /* See below */
> > > +	int    			refcnt;      /* reference count */
> > > +	spinlock_t		lock;        /* lock for all above members */
> > > +	struct list_head 	lru;         /* for per cgroup LRU */
> > > +};
> > 
> > You can possible reduce the size of page_cgroup structure not to consume
> > a lot of memory. I think this is important.
> > 
> > I have some ideas:
> > (1) I don't think every struct page_cgroup needs to have a "lock" member.
> >     I think one "lock" variable for several page_cgroup will be also enough
> >     from a performance viewpoint. In addition, it will become low-impact for
> >     cache memory. I guess it may be okay if each array of page_cgroup --
> >     which you just introduced now -- has one lock variable.
> 
> I think it will increase cache-bouncing, but I have no data.

Yes, that's the point. There will be some tradeoff between the cache-bouncing
and the memory usage. 

> (I notices that lock bit can be moved to flags and use bit_spin_lock.
>  But I wouldn't like to do it at this stage.)

Yep.

> > (2) The "flags" member and the "refcnt" member can be encoded into
> >     one member.
> 
> I don't like this idea.
> Because some people discuss about enlarging 32bit countes in struct 'page'
> to be 64bit, I wonder refcnt should be "unsigned long", now.

I don't think the refcnt member of page_cgroup will need such a large
counter. I think you can make it small.

> > (3) The page member can be replaced with the page frame number and it will be
> >     also possible to use some kind of ID instead of the mem_cgroup member.
> >     This means these members can be encoded to one members with other members
> >     such as "flags" and "refcnt"
>  
> I think there is a case that "pfn" doesn't fit in 32bit.
> (64bit system tend to have sparse address space.)
> We need unsigned long anyway.

It will be a 64bit variable on a 64bit machine, where the pointers are
also 64bit long. I think you can encode "pfn" and other stuff into one
64bit variable.

Thank you,
Hirokazu Takahashi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [1/7] definitions for page_cgroup
  2008-02-26  7:46       ` Hirokazu Takahashi
@ 2008-02-26  9:07         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26  9:07 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: balbir, hugh, yamamoto, ak, nickpiggin, linux-mm

On Tue, 26 Feb 2008 16:46:41 +0900 (JST)
Hirokazu Takahashi <taka@valinux.co.jp> wrote:

>
> > > (3) The page member can be replaced with the page frame number and it will be
> > >     also possible to use some kind of ID instead of the mem_cgroup member.
> > >     This means these members can be encoded to one members with other members
> > >     such as "flags" and "refcnt"
> >  
> > I think there is a case that "pfn" doesn't fit in 32bit.
> > (64bit system tend to have sparse address space.)
> > We need unsigned long anyway.
> 
> It will be a 64bit variable on a 64bit machine, where the pointers are
> also 64bit long. I think you can encode "pfn" and other stuff into one
> 64bit variable.
> 
Next version will have reclaim of page_cgroup, I'm now trying.
The size itself will be discussed later.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [7/7] per cpu fast lookup
  2008-02-25  3:18 ` [RFC][PATCH] radix-tree based page_cgroup. [7/7] per cpu fast lookup KAMEZAWA Hiroyuki
  2008-02-25  5:36   ` YAMAMOTO Takashi
@ 2008-02-26 13:26   ` minchan Kim
  2008-02-26 13:31     ` minchan Kim
  2008-02-26 23:37     ` KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 35+ messages in thread
From: minchan Kim @ 2008-02-26 13:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir@linux.vnet.ibm.com, hugh@veritas.com,
	yamamoto@valinux.co.jp, taka, Andi Kleen, nickpiggin@yahoo.com.au,
	linux-mm@kvack.org

>  -struct page_cgroup *get_page_cgroup(struct page *page, gfp_t gfpmask)
>  +struct page_cgroup *__get_page_cgroup(struct page *page, gfp_t gfpmask)
>   {
>         struct page_cgroup_root *root;
>         struct page_cgroup *pc, *base_addr;
>  @@ -96,8 +110,14 @@ retry:
>         pc = radix_tree_lookup(&root->root_node, idx);
>         rcu_read_unlock();
>
>  +       if (pc) {
>  +               if (!in_interrupt())
>  +                       save_result(pc, idx);
>  +       }
>  +

I didn't look through mem_control's patches yet.
Please understand me if my question might be foolish.

why do you prevent when it happen in interrupt context ?
Do you have any reason ?

-- 
Thanks,
barrios

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [7/7] per cpu fast lookup
  2008-02-26 13:26   ` minchan Kim
@ 2008-02-26 13:31     ` minchan Kim
  2008-02-26 23:37     ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 35+ messages in thread
From: minchan Kim @ 2008-02-26 13:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir@linux.vnet.ibm.com, hugh@veritas.com,
	yamamoto@valinux.co.jp, taka, Andi Kleen, nickpiggin@yahoo.com.au,
	linux-mm@kvack.org

I knew you purpose with following patch <radix-tree based page_cgroup.
[8/7] vmalloc for large machines>
I am afraid your patch list isn't arranged well. :-<

On Tue, Feb 26, 2008 at 10:26 PM, minchan Kim <barrioskmc@gmail.com> wrote:
> >  -struct page_cgroup *get_page_cgroup(struct page *page, gfp_t gfpmask)
>  >  +struct page_cgroup *__get_page_cgroup(struct page *page, gfp_t gfpmask)
>  >   {
>  >         struct page_cgroup_root *root;
>  >         struct page_cgroup *pc, *base_addr;
>  >  @@ -96,8 +110,14 @@ retry:
>  >         pc = radix_tree_lookup(&root->root_node, idx);
>  >         rcu_read_unlock();
>  >
>  >  +       if (pc) {
>  >  +               if (!in_interrupt())
>  >  +                       save_result(pc, idx);
>  >  +       }
>  >  +
>
>  I didn't look through mem_control's patches yet.
>  Please understand me if my question might be foolish.
>
>  why do you prevent when it happen in interrupt context ?
>  Do you have any reason ?
>
>  --
>  Thanks,
>  barrios
>



-- 
Thanks,
barrios

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [7/7] per cpu fast lookup
  2008-02-26 13:26   ` minchan Kim
  2008-02-26 13:31     ` minchan Kim
@ 2008-02-26 23:37     ` KAMEZAWA Hiroyuki
  2008-02-27  0:57       ` minchan Kim
  1 sibling, 1 reply; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-26 23:37 UTC (permalink / raw)
  To: minchan Kim
  Cc: balbir@linux.vnet.ibm.com, hugh@veritas.com,
	yamamoto@valinux.co.jp, taka, Andi Kleen, nickpiggin@yahoo.com.au,
	linux-mm@kvack.org

On Tue, 26 Feb 2008 22:26:40 +0900
"minchan Kim" <barrioskmc@gmail.com> wrote:

> >  -struct page_cgroup *get_page_cgroup(struct page *page, gfp_t gfpmask)
> >  +struct page_cgroup *__get_page_cgroup(struct page *page, gfp_t gfpmask)
> >   {
> >         struct page_cgroup_root *root;
> >         struct page_cgroup *pc, *base_addr;
> >  @@ -96,8 +110,14 @@ retry:
> >         pc = radix_tree_lookup(&root->root_node, idx);
> >         rcu_read_unlock();
> >
> >  +       if (pc) {
> >  +               if (!in_interrupt())
> >  +                       save_result(pc, idx);
> >  +       }
> >  +
> 
> I didn't look through mem_control's patches yet.
> Please understand me if my question might be foolish.
> 
> why do you prevent when it happen in interrupt context ?
> Do you have any reason ?
> 
looking up isn't done under irq disable but under preempt disable.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [7/7] per cpu fast lookup
  2008-02-26 23:37     ` KAMEZAWA Hiroyuki
@ 2008-02-27  0:57       ` minchan Kim
  2008-02-27  1:09         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 35+ messages in thread
From: minchan Kim @ 2008-02-27  0:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir@linux.vnet.ibm.com, hugh@veritas.com,
	yamamoto@valinux.co.jp, taka, Andi Kleen, nickpiggin@yahoo.com.au,
	linux-mm@kvack.org

On Wed, Feb 27, 2008 at 8:37 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 26 Feb 2008 22:26:40 +0900
>
> "minchan Kim" <barrioskmc@gmail.com> wrote:
>
>
>
> > >  -struct page_cgroup *get_page_cgroup(struct page *page, gfp_t gfpmask)
>  > >  +struct page_cgroup *__get_page_cgroup(struct page *page, gfp_t gfpmask)
>  > >   {
>  > >         struct page_cgroup_root *root;
>  > >         struct page_cgroup *pc, *base_addr;
>  > >  @@ -96,8 +110,14 @@ retry:
>  > >         pc = radix_tree_lookup(&root->root_node, idx);
>  > >         rcu_read_unlock();
>  > >
>  > >  +       if (pc) {
>  > >  +               if (!in_interrupt())
>  > >  +                       save_result(pc, idx);
>  > >  +       }
>  > >  +
>  >
>  > I didn't look through mem_control's patches yet.
>  > Please understand me if my question might be foolish.
>  >
>  > why do you prevent when it happen in interrupt context ?
>  > Do you have any reason ?
>  >
>  looking up isn't done under irq disable but under preempt disable.

I can't understand your point.
Is that check is really necessary if save_result function use
get_cpu_var and put_cpu_var in save_result ?

>  Thanks,
>  -Kame
>
>



-- 
Thanks,
barrios

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [7/7] per cpu fast lookup
  2008-02-27  0:57       ` minchan Kim
@ 2008-02-27  1:09         ` KAMEZAWA Hiroyuki
  2008-02-27  1:21           ` minchan Kim
  0 siblings, 1 reply; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-27  1:09 UTC (permalink / raw)
  To: minchan Kim
  Cc: balbir@linux.vnet.ibm.com, hugh@veritas.com,
	yamamoto@valinux.co.jp, taka, Andi Kleen, nickpiggin@yahoo.com.au,
	linux-mm@kvack.org

On Wed, 27 Feb 2008 09:57:49 +0900
"minchan Kim" <barrioskmc@gmail.com> wrote:

> > > >  -struct page_cgroup *get_page_cgroup(struct page *page, gfp_t gfpmask)
> >  > >  +struct page_cgroup *__get_page_cgroup(struct page *page, gfp_t gfpmask)
> >  > >   {
> >  > >         struct page_cgroup_root *root;
> >  > >         struct page_cgroup *pc, *base_addr;
> >  > >  @@ -96,8 +110,14 @@ retry:
> >  > >         pc = radix_tree_lookup(&root->root_node, idx);
> >  > >         rcu_read_unlock();
> >  > >
> >  > >  +       if (pc) {
> >  > >  +               if (!in_interrupt())
> >  > >  +                       save_result(pc, idx);
> >  > >  +       }
> >  > >  +
> >  >
> >  > I didn't look through mem_control's patches yet.
> >  > Please understand me if my question might be foolish.
> >  >
> >  > why do you prevent when it happen in interrupt context ?
> >  > Do you have any reason ?
> >  >
> >  looking up isn't done under irq disable but under preempt disable.
> 
> I can't understand your point.
> Is that check is really necessary if save_result function use
> get_cpu_var and put_cpu_var in save_result ?
> 
looku up is done by this routine.
==
+	if (pcp->ents[hnum].idx == idx && pcp->ents[hnum].base)
+		ret = pcp->ents[hnum].base + (pfn - (idx << PCGRP_SHIFT));
==
Then,

 check pcp->ents[hnum].idx == idx, match.
 <interrupt>
              ---------------------------> some codes.
                                           get_page_cgroup()
                                               cache_miss--> __get_page_cgroup()
                                                              --> save_result()
                                           .............
 <ret_from_IRQ> <--------------------------

What will I see ?

Thanks,
-Kame
   




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] radix-tree based page_cgroup. [7/7] per cpu fast lookup
  2008-02-27  1:09         ` KAMEZAWA Hiroyuki
@ 2008-02-27  1:21           ` minchan Kim
  0 siblings, 0 replies; 35+ messages in thread
From: minchan Kim @ 2008-02-27  1:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir@linux.vnet.ibm.com, hugh@veritas.com,
	yamamoto@valinux.co.jp, taka, Andi Kleen, nickpiggin@yahoo.com.au,
	linux-mm@kvack.org

>  > >  > why do you prevent when it happen in interrupt context ?
>  > >  > Do you have any reason ?
>  > >  >
>  > >  looking up isn't done under irq disable but under preempt disable.
>  >
>  > I can't understand your point.
>  > Is that check is really necessary if save_result function use
>  > get_cpu_var and put_cpu_var in save_result ?
>  >
>  looku up is done by this routine.
>  ==
>
> +       if (pcp->ents[hnum].idx == idx && pcp->ents[hnum].base)
>  +               ret = pcp->ents[hnum].base + (pfn - (idx << PCGRP_SHIFT));
>  ==
>  Then,
>
>   check pcp->ents[hnum].idx == idx, match.
>   <interrupt>
>               ---------------------------> some codes.
>                                            get_page_cgroup()
>                                                cache_miss--> __get_page_cgroup()
>                                                               --> save_result()
>                                            .............
>   <ret_from_IRQ> <--------------------------
>
>  What will I see ?

I see. Thanks for your good explanation.  :-)
I hope you insert above explanation with comment.

>  Thanks,
>  -Kame
>
-- 
Thanks,
barrios

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2008-02-27  1:21 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-25  3:07 [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction KAMEZAWA Hiroyuki
2008-02-25  3:10 ` [RFC][PATCH] radix-tree based page_cgroup. [1/7] definitions for page_cgroup KAMEZAWA Hiroyuki
2008-02-25  7:47   ` Hirokazu Takahashi
2008-02-25  7:56     ` Balbir Singh
2008-02-25  8:03     ` KAMEZAWA Hiroyuki
2008-02-26  7:46       ` Hirokazu Takahashi
2008-02-26  9:07         ` KAMEZAWA Hiroyuki
2008-02-25  3:12 ` [RFC][PATCH] radix-tree based page_cgroup. [2/7] charge/uncharge KAMEZAWA Hiroyuki
2008-02-25  3:13 ` [RFC][PATCH] radix-tree based page_cgroup. [3/7] move lists KAMEZAWA Hiroyuki
2008-02-25  3:14 ` [RFC][PATCH] radix-tree based page_cgroup. [4/7] migration KAMEZAWA Hiroyuki
2008-02-25  3:16 ` [RFC][PATCH] radix-tree based page_cgroup. [5/7] force_empty KAMEZAWA Hiroyuki
2008-02-25  3:17 ` [RFC][PATCH] radix-tree based page_cgroup. [6/7] radix-tree based page cgroup KAMEZAWA Hiroyuki
2008-02-25  5:56   ` YAMAMOTO Takashi
2008-02-25  6:07     ` KAMEZAWA Hiroyuki
2008-02-25  6:40   ` Hirokazu Takahashi
2008-02-25  6:52     ` KAMEZAWA Hiroyuki
2008-02-25  7:05       ` Hirokazu Takahashi
2008-02-25  7:25         ` KAMEZAWA Hiroyuki
2008-02-25  8:02           ` Hirokazu Takahashi
2008-02-25  8:11             ` KAMEZAWA Hiroyuki
2008-02-25  8:28             ` KAMEZAWA Hiroyuki
2008-02-25  3:18 ` [RFC][PATCH] radix-tree based page_cgroup. [7/7] per cpu fast lookup KAMEZAWA Hiroyuki
2008-02-25  5:36   ` YAMAMOTO Takashi
2008-02-25  5:46     ` KAMEZAWA Hiroyuki
2008-02-26 13:26   ` minchan Kim
2008-02-26 13:31     ` minchan Kim
2008-02-26 23:37     ` KAMEZAWA Hiroyuki
2008-02-27  0:57       ` minchan Kim
2008-02-27  1:09         ` KAMEZAWA Hiroyuki
2008-02-27  1:21           ` minchan Kim
2008-02-25  3:19 ` [RFC][PATCH] radix-tree based page_cgroup. [8/7] vmalloc for large machines KAMEZAWA Hiroyuki
2008-02-25  7:06   ` KAMEZAWA Hiroyuki
2008-02-25  3:24 ` [RFC][PATCH] radix-tree based page_cgroup. [0/7] introduction Balbir Singh
2008-02-25  4:02   ` KAMEZAWA Hiroyuki
2008-02-25  3:31 ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).