[PATCH 0/7] memcg: radix-tree page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] memcg: radix-tree page_cgroup
@ 2008-03-14  9:59 KAMEZAWA Hiroyuki
  2008-03-14 10:03 ` [PATCH 1/7] re-define page_cgroup KAMEZAWA Hiroyuki
                   ` (7 more replies)
  0 siblings, 8 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-14  9:59 UTC (permalink / raw)
  To: linux-mm@kvack.org
  Cc: balbir@linux.vnet.ibm.com, xemul, hugh@veritas.com,
	kamezawa.hiroyu@jp.fujitsu.com

This is a patch set for implemening page_cgroup under radix-tree.
against 2.6.25-rc5-mm1.

Before this patch, page and page_cgroup relationship is

    pfn <-> struct page <-> struct page_cgroup
    lock for page_cgroup is in struct page.

After this patch, 
    struct page <-> pfn <-> struct page_cgroup.
    lock for page_cgroup is in struct page_cgroup itself.

Pros.
   - we can remove an extra pointer in struct page.
   - lock for page_cgroup is moved to page_cgroup inself from struct page.
Cons.
   - For avoiding too much access to radix-tree, some kind of workaround is
     necessary. On this patch set, page_cgroup is managed by chunk of an order.

I think this will hunk with patches which are already merged into -mm tree.
So, if I got positive answers from people, I'd like to rebase this to the next
-mm again in the next week.

Changes from previous (preview) version.
 - page migration algorithm is changed. (thanks to Chrisotph)
 - freeing routine is added.

I did tested on ia64/NUMA box. My x86_64 box is busy....
I'm glad if someone test this in his own box.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 1/7] re-define page_cgroup.
  2008-03-14  9:59 [PATCH 0/7] memcg: radix-tree page_cgroup KAMEZAWA Hiroyuki
@ 2008-03-14 10:03 ` KAMEZAWA Hiroyuki
  2008-03-16 14:15   ` Balbir Singh
                     ` (2 more replies)
  2008-03-14 10:06 ` [PATCH 2/7] charge/uncharge KAMEZAWA Hiroyuki
                   ` (6 subsequent siblings)
  7 siblings, 3 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-14 10:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

(This is one of a series of patch for "lookup page_cgroup" patches..)

 * Exporting page_cgroup definition.
 * Remove page_cgroup member from sturct page.
 * As result, PAGE_CGROUP_LOCK_BIT and assign/access functions are removed.

Other chages will appear in following patches.
There is a change in the structure itself, spin_lock is added.

Changelog:
 - adjusted to rc5-mm1

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 include/linux/memcontrol.h  |   11 --------
 include/linux/mm_types.h    |    3 --
 include/linux/page_cgroup.h |   47 +++++++++++++++++++++++++++++++++++
 mm/memcontrol.c             |   59 --------------------------------------------
 mm/page_alloc.c             |    8 -----
 5 files changed, 48 insertions(+), 80 deletions(-)

Index: mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
===================================================================
--- /dev/null
+++ mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
@@ -0,0 +1,47 @@
+#ifndef __LINUX_PAGE_CGROUP_H
+#define __LINUX_PAGE_CGROUP_H
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/*
+ * page_cgroup is yet another mem_map structure for accounting  usage.
+ * but, unlike mem_map, allocated on demand for accounted pages.
+ * see also memcontrol.h
+ * In nature, this cosumes much amount of memory.
+ */
+
+struct mem_cgroup;
+
+struct page_cgroup {
+	struct page 		*page;       /* the page this accounts for*/
+	struct mem_cgroup 	*mem_cgroup; /* current cgroup subsys */
+	int    			flags;	     /* See below */
+	int    			refcnt;      /* reference count */
+	spinlock_t		lock;        /* lock for all above members */
+	struct list_head 	lru;         /* for per cgroup LRU */
+};
+
+/* flags */
+#define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache. */
+#define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* is on active list */
+
+/*
+ * Lookup and return page_cgroup struct.
+ * returns NULL when
+ * 1. Page Cgroup is not activated yet.
+ * 2. cannot lookup entry and allocate was false.
+ * return -ENOMEM if cannot allocate memory.
+ * If allocate==false, gfpmask will be ignored as a result.
+ */
+
+struct page_cgroup *
+get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate);
+
+#else
+
+static struct page_cgroup *
+get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
+{
+	return NULL;
+}
+#endif
+#endif
Index: mm-2.6.25-rc5-mm1/mm/memcontrol.c
===================================================================
--- mm-2.6.25-rc5-mm1.orig/mm/memcontrol.c
+++ mm-2.6.25-rc5-mm1/mm/memcontrol.c
@@ -30,6 +30,7 @@
 #include <linux/spinlock.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
+#include <linux/page_cgroup.h>
 
 #include <asm/uaccess.h>
 
@@ -139,33 +140,6 @@ struct mem_cgroup {
 };
 static struct mem_cgroup init_mem_cgroup;
 
-/*
- * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock.  We need to ensure that page->page_cgroup is at least two
- * byte aligned (based on comments from Nick Piggin).  But since
- * bit_spin_lock doesn't actually set that lock bit in a non-debug
- * uniprocessor kernel, we should avoid setting it here too.
- */
-#define PAGE_CGROUP_LOCK_BIT 	0x0
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-#define PAGE_CGROUP_LOCK 	(1 << PAGE_CGROUP_LOCK_BIT)
-#else
-#define PAGE_CGROUP_LOCK	0x0
-#endif
-
-/*
- * A page_cgroup page is associated with every page descriptor. The
- * page_cgroup helps us identify information about the cgroup
- */
-struct page_cgroup {
-	struct list_head lru;		/* per cgroup LRU list */
-	struct page *page;
-	struct mem_cgroup *mem_cgroup;
-	int ref_cnt;			/* cached, mapped, migrating */
-	int flags;
-};
-#define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
 
 static int page_cgroup_nid(struct page_cgroup *pc)
 {
@@ -256,37 +230,6 @@ void mm_free_cgroup(struct mm_struct *mm
 	css_put(&mm->mem_cgroup->css);
 }
 
-static inline int page_cgroup_locked(struct page *page)
-{
-	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
-{
-	VM_BUG_ON(!page_cgroup_locked(page));
-	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
-}
-
-struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
-	return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
-}
-
-static void lock_page_cgroup(struct page *page)
-{
-	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static int try_lock_page_cgroup(struct page *page)
-{
-	return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void unlock_page_cgroup(struct page *page)
-{
-	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
 static void __mem_cgroup_remove_list(struct page_cgroup *pc)
 {
 	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
Index: mm-2.6.25-rc5-mm1/include/linux/memcontrol.h
===================================================================
--- mm-2.6.25-rc5-mm1.orig/include/linux/memcontrol.h
+++ mm-2.6.25-rc5-mm1/include/linux/memcontrol.h
@@ -30,9 +30,6 @@ struct mm_struct;
 extern void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p);
 extern void mm_free_cgroup(struct mm_struct *mm);
 
-#define page_reset_bad_cgroup(page)	((page)->page_cgroup = 0)
-
-extern struct page_cgroup *page_get_page_cgroup(struct page *page);
 extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
@@ -82,14 +79,6 @@ static inline void mm_free_cgroup(struct
 {
 }
 
-static inline void page_reset_bad_cgroup(struct page *page)
-{
-}
-
-static inline struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
-	return NULL;
-}
 
 static inline int mem_cgroup_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
Index: mm-2.6.25-rc5-mm1/mm/page_alloc.c
===================================================================
--- mm-2.6.25-rc5-mm1.orig/mm/page_alloc.c
+++ mm-2.6.25-rc5-mm1/mm/page_alloc.c
@@ -222,17 +222,11 @@ static inline int bad_range(struct zone 
 
 static void bad_page(struct page *page)
 {
-	void *pc = page_get_page_cgroup(page);
-
 	printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG
 		"page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n",
 		current->comm, page, (int)(2*sizeof(unsigned long)),
 		(unsigned long)page->flags, page->mapping,
 		page_mapcount(page), page_count(page));
-	if (pc) {
-		printk(KERN_EMERG "cgroup:%p\n", pc);
-		page_reset_bad_cgroup(page);
-	}
 	printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
 		KERN_EMERG "Backtrace:\n");
 	dump_stack();
@@ -478,7 +472,6 @@ static inline int free_pages_check(struc
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_get_page_cgroup(page) != NULL) |
 		(page_count(page) != 0)  |
 		(page->flags & (
 			1 << PG_lru	|
@@ -628,7 +621,6 @@ static int prep_new_page(struct page *pa
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_get_page_cgroup(page) != NULL) |
 		(page_count(page) != 0)  |
 		(page->flags & (
 			1 << PG_lru	|
Index: mm-2.6.25-rc5-mm1/include/linux/mm_types.h
===================================================================
--- mm-2.6.25-rc5-mm1.orig/include/linux/mm_types.h
+++ mm-2.6.25-rc5-mm1/include/linux/mm_types.h
@@ -88,9 +88,6 @@ struct page {
 	void *virtual;			/* Kernel virtual address (NULL if
 					   not kmapped, ie. highmem) */
 #endif /* WANT_PAGE_VIRTUAL */
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-	unsigned long page_cgroup;
-#endif
 #ifdef CONFIG_PAGE_OWNER
 	int order;
 	unsigned int gfp_mask;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/7] re-define page_cgroup.
  2008-03-14 10:03 ` [PATCH 1/7] re-define page_cgroup KAMEZAWA Hiroyuki
@ 2008-03-16 14:15   ` Balbir Singh
  2008-03-18  1:10     ` KAMEZAWA Hiroyuki
  2008-03-17  0:21   ` Li Zefan
  2008-03-17  2:07   ` Li Zefan
  2 siblings, 1 reply; 53+ messages in thread
From: Balbir Singh @ 2008-03-16 14:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm@kvack.org, xemul, hugh@veritas.com

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-03-14 19:03:13]:

> (This is one of a series of patch for "lookup page_cgroup" patches..)
> 
>  * Exporting page_cgroup definition.
>  * Remove page_cgroup member from sturct page.
>  * As result, PAGE_CGROUP_LOCK_BIT and assign/access functions are removed.
>

The memory controller effectively becomes unavailable/unusable since
page_get_page_cgroup() returns NULL. What happens when we try to do
page_assign_page_cgroup() in mem_cgroup_charge_common()? I fear that
this will break git-bisect. I am in the middle of compiling the
patches one-by-one. I'll report the results soon
 
> Other chages will appear in following patches.
> There is a change in the structure itself, spin_lock is added.
> 
> Changelog:
>  - adjusted to rc5-mm1
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> 
>  include/linux/memcontrol.h  |   11 --------
>  include/linux/mm_types.h    |    3 --
>  include/linux/page_cgroup.h |   47 +++++++++++++++++++++++++++++++++++
>  mm/memcontrol.c             |   59 --------------------------------------------
>  mm/page_alloc.c             |    8 -----
>  5 files changed, 48 insertions(+), 80 deletions(-)
> 
> Index: mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> ===================================================================
> --- /dev/null
> +++ mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> @@ -0,0 +1,47 @@
> +#ifndef __LINUX_PAGE_CGROUP_H
> +#define __LINUX_PAGE_CGROUP_H
> +

Since this is a new file could you please add a copyright and license.

> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/*
> + * page_cgroup is yet another mem_map structure for accounting  usage.
> + * but, unlike mem_map, allocated on demand for accounted pages.
> + * see also memcontrol.h
> + * In nature, this cosumes much amount of memory.
                     ^^^^^^^^
                     consumes
> + */
> +
> +struct mem_cgroup;
> +
> +struct page_cgroup {
> +	struct page 		*page;       /* the page this accounts for*/
                                                                         ^^^
                                                              please add space
> +	struct mem_cgroup 	*mem_cgroup; /* current cgroup subsys */
> +	int    			flags;	     /* See below */
> +	int    			refcnt;      /* reference count */
> +	spinlock_t		lock;        /* lock for all above members */
> +	struct list_head 	lru;         /* for per cgroup LRU */
> +};
> +
> +/* flags */
> +#define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache. */
> +#define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* is on active list */
> +
> +/*
> + * Lookup and return page_cgroup struct.
> + * returns NULL when
> + * 1. Page Cgroup is not activated yet.
> + * 2. cannot lookup entry and allocate was false.
> + * return -ENOMEM if cannot allocate memory.
> + * If allocate==false, gfpmask will be ignored as a result.
> + */
> +
> +struct page_cgroup *
> +get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate);
> +

Shouldn't we split this into two functions

get_page_cgroup() and allocate_new_page_cgroup(). I know we do pass
boolean parameters all the time to tweak the behaviour of a function,
but I suspect splitting this into two will create better code.

> +#else
> +
> +static struct page_cgroup *
> +get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
> +{
> +	return NULL;
> +}
> +#endif
> +#endif
> Index: mm-2.6.25-rc5-mm1/mm/memcontrol.c
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/memcontrol.c
> +++ mm-2.6.25-rc5-mm1/mm/memcontrol.c
> @@ -30,6 +30,7 @@
>  #include <linux/spinlock.h>
>  #include <linux/fs.h>
>  #include <linux/seq_file.h>
> +#include <linux/page_cgroup.h>
> 

Isn't it better if memcontrol.h includes this header file?

>  #include <asm/uaccess.h>
> 
> @@ -139,33 +140,6 @@ struct mem_cgroup {
>  };
>  static struct mem_cgroup init_mem_cgroup;
> 
> -/*
> - * We use the lower bit of the page->page_cgroup pointer as a bit spin
> - * lock.  We need to ensure that page->page_cgroup is at least two
> - * byte aligned (based on comments from Nick Piggin).  But since
> - * bit_spin_lock doesn't actually set that lock bit in a non-debug
> - * uniprocessor kernel, we should avoid setting it here too.
> - */
> -#define PAGE_CGROUP_LOCK_BIT 	0x0
> -#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
> -#define PAGE_CGROUP_LOCK 	(1 << PAGE_CGROUP_LOCK_BIT)
> -#else
> -#define PAGE_CGROUP_LOCK	0x0
> -#endif
> -
> -/*
> - * A page_cgroup page is associated with every page descriptor. The
> - * page_cgroup helps us identify information about the cgroup
> - */
> -struct page_cgroup {
> -	struct list_head lru;		/* per cgroup LRU list */
> -	struct page *page;
> -	struct mem_cgroup *mem_cgroup;
> -	int ref_cnt;			/* cached, mapped, migrating */
> -	int flags;
> -};
> -#define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
> -#define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
> 
>  static int page_cgroup_nid(struct page_cgroup *pc)
>  {
> @@ -256,37 +230,6 @@ void mm_free_cgroup(struct mm_struct *mm
>  	css_put(&mm->mem_cgroup->css);
>  }
> 
> -static inline int page_cgroup_locked(struct page *page)
> -{
> -	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> -static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
> -{
> -	VM_BUG_ON(!page_cgroup_locked(page));
> -	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
> -}
> -
> -struct page_cgroup *page_get_page_cgroup(struct page *page)
> -{
> -	return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
> -}
> -
> -static void lock_page_cgroup(struct page *page)
> -{
> -	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> -static int try_lock_page_cgroup(struct page *page)
> -{
> -	return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> -static void unlock_page_cgroup(struct page *page)
> -{
> -	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
>  static void __mem_cgroup_remove_list(struct page_cgroup *pc)
>  {
>  	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
> Index: mm-2.6.25-rc5-mm1/include/linux/memcontrol.h
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/include/linux/memcontrol.h
> +++ mm-2.6.25-rc5-mm1/include/linux/memcontrol.h
> @@ -30,9 +30,6 @@ struct mm_struct;
>  extern void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p);
>  extern void mm_free_cgroup(struct mm_struct *mm);
> 
> -#define page_reset_bad_cgroup(page)	((page)->page_cgroup = 0)
> -
> -extern struct page_cgroup *page_get_page_cgroup(struct page *page);
>  extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
>  				gfp_t gfp_mask);
>  extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> @@ -82,14 +79,6 @@ static inline void mm_free_cgroup(struct
>  {
>  }
> 
> -static inline void page_reset_bad_cgroup(struct page *page)
> -{
> -}
> -
> -static inline struct page_cgroup *page_get_page_cgroup(struct page *page)
> -{
> -	return NULL;
> -}
> 
>  static inline int mem_cgroup_charge(struct page *page,
>  					struct mm_struct *mm, gfp_t gfp_mask)
> Index: mm-2.6.25-rc5-mm1/mm/page_alloc.c
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/page_alloc.c
> +++ mm-2.6.25-rc5-mm1/mm/page_alloc.c
> @@ -222,17 +222,11 @@ static inline int bad_range(struct zone 
> 
>  static void bad_page(struct page *page)
>  {
> -	void *pc = page_get_page_cgroup(page);
> -
>  	printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG
>  		"page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n",
>  		current->comm, page, (int)(2*sizeof(unsigned long)),
>  		(unsigned long)page->flags, page->mapping,
>  		page_mapcount(page), page_count(page));
> -	if (pc) {
> -		printk(KERN_EMERG "cgroup:%p\n", pc);
> -		page_reset_bad_cgroup(page);
> -	}
>  	printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
>  		KERN_EMERG "Backtrace:\n");
>  	dump_stack();
> @@ -478,7 +472,6 @@ static inline int free_pages_check(struc
>  {
>  	if (unlikely(page_mapcount(page) |
>  		(page->mapping != NULL)  |
> -		(page_get_page_cgroup(page) != NULL) |
>  		(page_count(page) != 0)  |
>  		(page->flags & (
>  			1 << PG_lru	|
> @@ -628,7 +621,6 @@ static int prep_new_page(struct page *pa
>  {
>  	if (unlikely(page_mapcount(page) |
>  		(page->mapping != NULL)  |
> -		(page_get_page_cgroup(page) != NULL) |
>  		(page_count(page) != 0)  |
>  		(page->flags & (
>  			1 << PG_lru	|
> Index: mm-2.6.25-rc5-mm1/include/linux/mm_types.h
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/include/linux/mm_types.h
> +++ mm-2.6.25-rc5-mm1/include/linux/mm_types.h
> @@ -88,9 +88,6 @@ struct page {
>  	void *virtual;			/* Kernel virtual address (NULL if
>  					   not kmapped, ie. highmem) */
>  #endif /* WANT_PAGE_VIRTUAL */
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -	unsigned long page_cgroup;
> -#endif
>  #ifdef CONFIG_PAGE_OWNER
>  	int order;
>  	unsigned int gfp_mask;
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/7] re-define page_cgroup.
  2008-03-16 14:15   ` Balbir Singh
@ 2008-03-18  1:10     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-18  1:10 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm@kvack.org, xemul, hugh@veritas.com

On Sun, 16 Mar 2008 19:45:28 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-03-14 19:03:13]:
> 
> > (This is one of a series of patch for "lookup page_cgroup" patches..)
> > 
> >  * Exporting page_cgroup definition.
> >  * Remove page_cgroup member from sturct page.
> >  * As result, PAGE_CGROUP_LOCK_BIT and assign/access functions are removed.
> >
> 
> The memory controller effectively becomes unavailable/unusable since
> page_get_page_cgroup() returns NULL. What happens when we try to do
> page_assign_page_cgroup() in mem_cgroup_charge_common()? I fear that
> this will break git-bisect. I am in the middle of compiling the
> patches one-by-one. I'll report the results soon
>  

Hmm, I canoot get your point. After this patches are applied,
page_assing_page_cgroup() will disappear.


> > Index: mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> > ===================================================================
> > --- /dev/null
> > +++ mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> > @@ -0,0 +1,47 @@
> > +#ifndef __LINUX_PAGE_CGROUP_H
> > +#define __LINUX_PAGE_CGROUP_H
> > +
> 
> Since this is a new file could you please add a copyright and license.
> 
ok.


> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +/*
> > + * page_cgroup is yet another mem_map structure for accounting  usage.
> > + * but, unlike mem_map, allocated on demand for accounted pages.
> > + * see also memcontrol.h
> > + * In nature, this cosumes much amount of memory.
>                      ^^^^^^^^
>                      consumes
i.c.


> > +/* flags */
> > +#define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache. */
> > +#define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* is on active list */
> > +
> > +/*
> > + * Lookup and return page_cgroup struct.
> > + * returns NULL when
> > + * 1. Page Cgroup is not activated yet.
> > + * 2. cannot lookup entry and allocate was false.
> > + * return -ENOMEM if cannot allocate memory.
> > + * If allocate==false, gfpmask will be ignored as a result.
> > + */
> > +
> > +struct page_cgroup *
> > +get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate);
> > +
> 
> Shouldn't we split this into two functions
> 
> get_page_cgroup() and allocate_new_page_cgroup(). I know we do pass
> boolean parameters all the time to tweak the behaviour of a function,
> but I suspect splitting this into two will create better code.

Hmm, ok.

get_page_cgroup() / get_alloc_page_cgroup()


> > Index: mm-2.6.25-rc5-mm1/mm/memcontrol.c
> > ===================================================================
> > --- mm-2.6.25-rc5-mm1.orig/mm/memcontrol.c
> > +++ mm-2.6.25-rc5-mm1/mm/memcontrol.c
> > @@ -30,6 +30,7 @@
> >  #include <linux/spinlock.h>
> >  #include <linux/fs.h>
> >  #include <linux/seq_file.h>
> > +#include <linux/page_cgroup.h>
> > 
> 
> Isn't it better if memcontrol.h includes this header file?
> 
I'm wondering this will be reused by some other controller.
But merging is ok.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/7] re-define page_cgroup.
  2008-03-14 10:03 ` [PATCH 1/7] re-define page_cgroup KAMEZAWA Hiroyuki
  2008-03-16 14:15   ` Balbir Singh
@ 2008-03-17  0:21   ` Li Zefan
  2008-03-18  1:12     ` KAMEZAWA Hiroyuki
  2008-03-17  2:07   ` Li Zefan
  2 siblings, 1 reply; 53+ messages in thread
From: Li Zefan @ 2008-03-17  0:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

KAMEZAWA Hiroyuki wrote:
> (This is one of a series of patch for "lookup page_cgroup" patches..)
> 
>  * Exporting page_cgroup definition.
>  * Remove page_cgroup member from sturct page.
>  * As result, PAGE_CGROUP_LOCK_BIT and assign/access functions are removed.
> 
> Other chages will appear in following patches.
> There is a change in the structure itself, spin_lock is added.
> 
> Changelog:
>  - adjusted to rc5-mm1
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 

Please don't break git-bisect. Make sure your patches can be applied one
by one.

mm/memcontrol.c: In function a??mem_cgroup_move_listsa??:
mm/memcontrol.c:309: error: implicit declaration of function a??try_lock_page_cgroupa??
mm/memcontrol.c:312: error: implicit declaration of function a??page_get_page_cgroupa??
mm/memcontrol.c:312: warning: assignment makes pointer from integer without a cast
mm/memcontrol.c:319: error: implicit declaration of function a??unlock_page_cgroupa??
mm/memcontrol.c: In function a??mem_cgroup_charge_commona??:
mm/memcontrol.c:490: error: implicit declaration of function a??lock_page_cgroupa??
mm/memcontrol.c:491: warning: assignment makes pointer from integer without a cast
mm/memcontrol.c:500: error: a??struct page_cgroupa?? has no member named a??ref_cnta??
mm/memcontrol.c:551: error: a??struct page_cgroupa?? has no member named a??ref_cnta??
mm/memcontrol.c:571: error: implicit declaration of function a??page_assign_page_cgroupa??
mm/memcontrol.c: In function a??mem_cgroup_uncharge_pagea??:
mm/memcontrol.c:621: warning: assignment makes pointer from integer without a cast
mm/memcontrol.c:628: error: a??struct page_cgroupa?? has no member named a??ref_cnta??
mm/memcontrol.c: In function a??mem_cgroup_prepare_migrationa??:
mm/memcontrol.c:661: warning: assignment makes pointer from integer without a cast
mm/memcontrol.c:663: error: a??struct page_cgroupa?? has no member named a??ref_cnta??
mm/memcontrol.c: In function a??mem_cgroup_page_migrationa??:
mm/memcontrol.c:685: warning: assignment makes pointer from integer without a cast

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/7] re-define page_cgroup.
  2008-03-17  0:21   ` Li Zefan
@ 2008-03-18  1:12     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-18  1:12 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

On Mon, 17 Mar 2008 09:21:57 +0900
Li Zefan <lizf@cn.fujitsu.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > (This is one of a series of patch for "lookup page_cgroup" patches..)
> > 
> >  * Exporting page_cgroup definition.
> >  * Remove page_cgroup member from sturct page.
> >  * As result, PAGE_CGROUP_LOCK_BIT and assign/access functions are removed.
> > 
> > Other chages will appear in following patches.
> > There is a change in the structure itself, spin_lock is added.
> > 
> > Changelog:
> >  - adjusted to rc5-mm1
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> 
> Please don't break git-bisect. Make sure your patches can be applied one
> by one.
> 
folding all 7 patches into one patch ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/7] re-define page_cgroup.
  2008-03-14 10:03 ` [PATCH 1/7] re-define page_cgroup KAMEZAWA Hiroyuki
  2008-03-16 14:15   ` Balbir Singh
  2008-03-17  0:21   ` Li Zefan
@ 2008-03-17  2:07   ` Li Zefan
  2008-03-18  1:11     ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 53+ messages in thread
From: Li Zefan @ 2008-03-17  2:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

KAMEZAWA Hiroyuki wrote:
> (This is one of a series of patch for "lookup page_cgroup" patches..)
> 
>  * Exporting page_cgroup definition.
>  * Remove page_cgroup member from sturct page.
>  * As result, PAGE_CGROUP_LOCK_BIT and assign/access functions are removed.
> 
> Other chages will appear in following patches.
> There is a change in the structure itself, spin_lock is added.
> 
> Changelog:
>  - adjusted to rc5-mm1
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> 
>  include/linux/memcontrol.h  |   11 --------
>  include/linux/mm_types.h    |    3 --
>  include/linux/page_cgroup.h |   47 +++++++++++++++++++++++++++++++++++
>  mm/memcontrol.c             |   59 --------------------------------------------
>  mm/page_alloc.c             |    8 -----
>  5 files changed, 48 insertions(+), 80 deletions(-)
> 
> Index: mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> ===================================================================
> --- /dev/null
> +++ mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> @@ -0,0 +1,47 @@
> +#ifndef __LINUX_PAGE_CGROUP_H
> +#define __LINUX_PAGE_CGROUP_H
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/*
> + * page_cgroup is yet another mem_map structure for accounting  usage.

                                                      two spaces ^^

> + * but, unlike mem_map, allocated on demand for accounted pages.
> + * see also memcontrol.h
> + * In nature, this cosumes much amount of memory.
> + */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/7] re-define page_cgroup.
  2008-03-17  2:07   ` Li Zefan
@ 2008-03-18  1:11     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-18  1:11 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

On Mon, 17 Mar 2008 11:07:02 +0900
Li Zefan <lizf@cn.fujitsu.com> wrote:

> > Index: mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> > ===================================================================
> > --- /dev/null
> > +++ mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> > @@ -0,0 +1,47 @@
> > +#ifndef __LINUX_PAGE_CGROUP_H
> > +#define __LINUX_PAGE_CGROUP_H
> > +
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +/*
> > + * page_cgroup is yet another mem_map structure for accounting  usage.
> 
>                                                       two spaces ^^
> 
thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 2/7] charge/uncharge
  2008-03-14  9:59 [PATCH 0/7] memcg: radix-tree page_cgroup KAMEZAWA Hiroyuki
  2008-03-14 10:03 ` [PATCH 1/7] re-define page_cgroup KAMEZAWA Hiroyuki
@ 2008-03-14 10:06 ` KAMEZAWA Hiroyuki
  2008-03-17  1:46   ` Balbir Singh
  2008-03-17  2:26   ` Li Zefan
  2008-03-14 10:07 ` [PATCH 3/7] memcg: move_lists KAMEZAWA Hiroyuki
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-14 10:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

Because bit spin lock is removed and spinlock is added to page_cgroup.
There are some amount of changes.

This patch does
	- modify charge/uncharge to adjust it to the new lock.
	- Added simple lock rule comments.

Major changes from current(-mm) version is
	- pc->refcnt is set as "1" after the charge is done.

Changelog
  - Rebased to rc5-mm1

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 mm/memcontrol.c |  136 +++++++++++++++++++++++++-------------------------------
 1 file changed, 62 insertions(+), 74 deletions(-)

Index: mm-2.6.25-rc5-mm1/mm/memcontrol.c
===================================================================
--- mm-2.6.25-rc5-mm1.orig/mm/memcontrol.c
+++ mm-2.6.25-rc5-mm1/mm/memcontrol.c
@@ -34,6 +34,16 @@
 
 #include <asm/uaccess.h>
 
+/*
+ * Lock Rule
+ * zone->lru_lcok (global LRU)
+ *	-> pc->lock (page_cgroup's lock)
+ *		-> mz->lru_lock (mem_cgroup's per_zone lock.)
+ *
+ * At least, mz->lru_lock and pc->lock should be acquired irq off.
+ *
+ */
+
 struct cgroup_subsys mem_cgroup_subsys;
 static const int MEM_CGROUP_RECLAIM_RETRIES = 5;
 
@@ -479,33 +489,22 @@ static int mem_cgroup_charge_common(stru
 	if (mem_cgroup_subsys.disabled)
 		return 0;
 
+	pc = get_page_cgroup(page, gfp_mask, true);
+	if (!pc || IS_ERR(pc))
+		return PTR_ERR(pc);
+
+	spin_lock_irqsave(&pc->lock, flags);
 	/*
-	 * Should page_cgroup's go to their own slab?
-	 * One could optimize the performance of the charging routine
-	 * by saving a bit in the page_flags and using it as a lock
-	 * to see if the cgroup page already has a page_cgroup associated
-	 * with it
-	 */
-retry:
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
-	/*
-	 * The page_cgroup exists and
-	 * the page has already been accounted.
+	 * Has the page already been accounted ?
 	 */
-	if (pc) {
-		VM_BUG_ON(pc->page != page);
-		VM_BUG_ON(pc->ref_cnt <= 0);
-
-		pc->ref_cnt++;
-		unlock_page_cgroup(page);
-		goto done;
+	if (pc->refcnt > 0) {
+		pc->refcnt++;
+		spin_unlock_irqrestore(&pc->lock, flags);
+		goto success;
 	}
-	unlock_page_cgroup(page);
+	spin_unlock_irqrestore(&pc->lock, flags);
 
-	pc = kzalloc(sizeof(struct page_cgroup), gfp_mask);
-	if (pc == NULL)
-		goto err;
+	/* Note: pc->refcnt is still 0 here. */
 
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
@@ -526,7 +525,7 @@ retry:
 
 	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
 		if (!(gfp_mask & __GFP_WAIT))
-			goto out;
+			goto nomem;
 
 		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
 			continue;
@@ -543,45 +542,40 @@ retry:
 
 		if (!nr_retries--) {
 			mem_cgroup_out_of_memory(mem, gfp_mask);
-			goto out;
+			goto nomem;
 		}
 		congestion_wait(WRITE, HZ/10);
 	}
-
-	pc->ref_cnt = 1;
+	/*
+ 	 * We have to acquire 2 spinlocks.
+	 */
+	spin_lock_irqsave(&pc->lock, flags);
+	if (pc->refcnt) {
+		/* Someone charged this page while we released the lock */
+		++pc->refcnt;
+		spin_unlock_irqrestore(&pc->lock, flags);
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		css_put(&mem->css);
+		goto success;
+	}
+	/* Anyone doesn't touch this. */
+	VM_BUG_ON(pc->mem_cgroup);
+	VM_BUG_ON(!list_empty(&pc->lru));
+	pc->refcnt = 1;
 	pc->mem_cgroup = mem;
-	pc->page = page;
 	pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
 		pc->flags |= PAGE_CGROUP_FLAG_CACHE;
-
-	lock_page_cgroup(page);
-	if (page_get_page_cgroup(page)) {
-		unlock_page_cgroup(page);
-		/*
-		 * Another charge has been added to this page already.
-		 * We take lock_page_cgroup(page) again and read
-		 * page->cgroup, increment refcnt.... just retry is OK.
-		 */
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		css_put(&mem->css);
-		kfree(pc);
-		goto retry;
-	}
-	page_assign_page_cgroup(page, pc);
-
 	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
+	spin_lock(&mz->lru_lock);
 	__mem_cgroup_add_list(pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	spin_unlock(&mz->lru_lock);
+	spin_unlock_irqrestore(&pc->lock, flags);
 
-	unlock_page_cgroup(page);
-done:
+success:
 	return 0;
-out:
+nomem:
 	css_put(&mem->css);
-	kfree(pc);
-err:
 	return -ENOMEM;
 }
 
@@ -617,33 +611,27 @@ void mem_cgroup_uncharge_page(struct pag
 	/*
 	 * Check if our page_cgroup is valid
 	 */
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
+	pc = get_page_cgroup(page, GFP_ATOMIC, false); /* No allocation */
 	if (!pc)
-		goto unlock;
-
-	VM_BUG_ON(pc->page != page);
-	VM_BUG_ON(pc->ref_cnt <= 0);
-
-	if (--(pc->ref_cnt) == 0) {
-		mz = page_cgroup_zoneinfo(pc);
-		spin_lock_irqsave(&mz->lru_lock, flags);
-		__mem_cgroup_remove_list(pc);
-		spin_unlock_irqrestore(&mz->lru_lock, flags);
-
-		page_assign_page_cgroup(page, NULL);
-		unlock_page_cgroup(page);
-
-		mem = pc->mem_cgroup;
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		css_put(&mem->css);
-
-		kfree(pc);
+		return;
+	spin_lock_irqsave(&pc->lock, flags);
+	if (!pc->refcnt || --pc->refcnt > 0) {
+		spin_unlock_irqrestore(&pc->lock, flags);
 		return;
 	}
+	VM_BUG_ON(pc->page != page);
+	mz = page_cgroup_zoneinfo(pc);
+	mem = pc->mem_cgroup;
 
-unlock:
-	unlock_page_cgroup(page);
+	spin_lock(&mz->lru_lock);
+	__mem_cgroup_remove_list(pc);
+	spin_unlock(&mz->lru_lock);
+
+	pc->flags = 0;
+	pc->mem_cgroup = 0;
+	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	css_put(&mem->css);
+	spin_unlock_irqrestore(&pc->lock, flags);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/7] charge/uncharge
  2008-03-14 10:06 ` [PATCH 2/7] charge/uncharge KAMEZAWA Hiroyuki
@ 2008-03-17  1:46   ` Balbir Singh
  2008-03-18  1:14     ` KAMEZAWA Hiroyuki
  2008-03-17  2:26   ` Li Zefan
  1 sibling, 1 reply; 53+ messages in thread
From: Balbir Singh @ 2008-03-17  1:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm@kvack.org, xemul, hugh@veritas.com

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-03-14 19:06:22]:

> Because bit spin lock is removed and spinlock is added to page_cgroup.
> There are some amount of changes.
> 
> This patch does
> 	- modify charge/uncharge to adjust it to the new lock.
> 	- Added simple lock rule comments.
> 
> Major changes from current(-mm) version is
> 	- pc->refcnt is set as "1" after the charge is done.
> 
> Changelog
>   - Rebased to rc5-mm1
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>

Hi, KAMEZAWA-San,

The build continues to be broken, even after this patch is applied.
We will have to find another way to refactor the code, so that we
don't break git-bisect.
 
> 
>  mm/memcontrol.c |  136 +++++++++++++++++++++++++-------------------------------
>  1 file changed, 62 insertions(+), 74 deletions(-)
> 
> Index: mm-2.6.25-rc5-mm1/mm/memcontrol.c
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/memcontrol.c
> +++ mm-2.6.25-rc5-mm1/mm/memcontrol.c
> @@ -34,6 +34,16 @@
> 
>  #include <asm/uaccess.h>
> 
> +/*
> + * Lock Rule
> + * zone->lru_lcok (global LRU)
> + *	-> pc->lock (page_cgroup's lock)
> + *		-> mz->lru_lock (mem_cgroup's per_zone lock.)
> + *
> + * At least, mz->lru_lock and pc->lock should be acquired irq off.
> + *
> + */
> +

I think the rule applies to even the zone's lru_lock, so we could just
state that these two locks should be acquired with irq's off.

>  struct cgroup_subsys mem_cgroup_subsys;
>  static const int MEM_CGROUP_RECLAIM_RETRIES = 5;
> 
> @@ -479,33 +489,22 @@ static int mem_cgroup_charge_common(stru
>  	if (mem_cgroup_subsys.disabled)
>  		return 0;
> 
> +	pc = get_page_cgroup(page, gfp_mask, true);
> +	if (!pc || IS_ERR(pc))
> +		return PTR_ERR(pc);
> +
> +	spin_lock_irqsave(&pc->lock, flags);
>  	/*
> -	 * Should page_cgroup's go to their own slab?
> -	 * One could optimize the performance of the charging routine
> -	 * by saving a bit in the page_flags and using it as a lock
> -	 * to see if the cgroup page already has a page_cgroup associated
> -	 * with it
> -	 */
> -retry:
> -	lock_page_cgroup(page);
> -	pc = page_get_page_cgroup(page);
> -	/*
> -	 * The page_cgroup exists and
> -	 * the page has already been accounted.
> +	 * Has the page already been accounted ?
>  	 */
> -	if (pc) {
> -		VM_BUG_ON(pc->page != page);
> -		VM_BUG_ON(pc->ref_cnt <= 0);
> -
> -		pc->ref_cnt++;
> -		unlock_page_cgroup(page);
> -		goto done;
> +	if (pc->refcnt > 0) {
> +		pc->refcnt++;
> +		spin_unlock_irqrestore(&pc->lock, flags);
> +		goto success;
>  	}
> -	unlock_page_cgroup(page);
> +	spin_unlock_irqrestore(&pc->lock, flags);
> 
> -	pc = kzalloc(sizeof(struct page_cgroup), gfp_mask);
> -	if (pc == NULL)
> -		goto err;
> +	/* Note: pc->refcnt is still 0 here. */
>

I think the comment can be updated to say for new pc's the refcnt is
0.
 
>  	/*
>  	 * We always charge the cgroup the mm_struct belongs to.
> @@ -526,7 +525,7 @@ retry:
> 
>  	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
>  		if (!(gfp_mask & __GFP_WAIT))
> -			goto out;
> +			goto nomem;
> 
>  		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
>  			continue;
> @@ -543,45 +542,40 @@ retry:
> 
>  		if (!nr_retries--) {
>  			mem_cgroup_out_of_memory(mem, gfp_mask);
> -			goto out;
> +			goto nomem;
>  		}
>  		congestion_wait(WRITE, HZ/10);
>  	}
> -
> -	pc->ref_cnt = 1;
> +	/*
> + 	 * We have to acquire 2 spinlocks.
> +	 */
> +	spin_lock_irqsave(&pc->lock, flags);
> +	if (pc->refcnt) {
> +		/* Someone charged this page while we released the lock */
> +		++pc->refcnt;

We used pc->refcnt++ earlier, for consistency we could use that here
as well.

> +		spin_unlock_irqrestore(&pc->lock, flags);
> +		res_counter_uncharge(&mem->res, PAGE_SIZE);
> +		css_put(&mem->css);
> +		goto success;
> +	}
> +	/* Anyone doesn't touch this. */
> +	VM_BUG_ON(pc->mem_cgroup);
> +	VM_BUG_ON(!list_empty(&pc->lru));
> +	pc->refcnt = 1;
>  	pc->mem_cgroup = mem;
> -	pc->page = page;
>  	pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
>  	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
>  		pc->flags |= PAGE_CGROUP_FLAG_CACHE;
> -
> -	lock_page_cgroup(page);
> -	if (page_get_page_cgroup(page)) {
> -		unlock_page_cgroup(page);
> -		/*
> -		 * Another charge has been added to this page already.
> -		 * We take lock_page_cgroup(page) again and read
> -		 * page->cgroup, increment refcnt.... just retry is OK.
> -		 */
> -		res_counter_uncharge(&mem->res, PAGE_SIZE);
> -		css_put(&mem->css);
> -		kfree(pc);
> -		goto retry;
> -	}
> -	page_assign_page_cgroup(page, pc);
> -
>  	mz = page_cgroup_zoneinfo(pc);
> -	spin_lock_irqsave(&mz->lru_lock, flags);
> +	spin_lock(&mz->lru_lock);
>  	__mem_cgroup_add_list(pc);
> -	spin_unlock_irqrestore(&mz->lru_lock, flags);
> +	spin_unlock(&mz->lru_lock);
> +	spin_unlock_irqrestore(&pc->lock, flags);
> 
> -	unlock_page_cgroup(page);
> -done:
> +success:
>  	return 0;
> -out:
> +nomem:
>  	css_put(&mem->css);
> -	kfree(pc);
> -err:
>  	return -ENOMEM;
>  }
> 
> @@ -617,33 +611,27 @@ void mem_cgroup_uncharge_page(struct pag
>  	/*
>  	 * Check if our page_cgroup is valid
>  	 */
> -	lock_page_cgroup(page);
> -	pc = page_get_page_cgroup(page);
> +	pc = get_page_cgroup(page, GFP_ATOMIC, false); /* No allocation */
>  	if (!pc)
> -		goto unlock;
> -
> -	VM_BUG_ON(pc->page != page);
> -	VM_BUG_ON(pc->ref_cnt <= 0);
> -
> -	if (--(pc->ref_cnt) == 0) {
> -		mz = page_cgroup_zoneinfo(pc);
> -		spin_lock_irqsave(&mz->lru_lock, flags);
> -		__mem_cgroup_remove_list(pc);
> -		spin_unlock_irqrestore(&mz->lru_lock, flags);
> -
> -		page_assign_page_cgroup(page, NULL);
> -		unlock_page_cgroup(page);
> -
> -		mem = pc->mem_cgroup;
> -		res_counter_uncharge(&mem->res, PAGE_SIZE);
> -		css_put(&mem->css);
> -
> -		kfree(pc);
> +		return;
> +	spin_lock_irqsave(&pc->lock, flags);
> +	if (!pc->refcnt || --pc->refcnt > 0) {
> +		spin_unlock_irqrestore(&pc->lock, flags);
>  		return;
>  	}
> +	VM_BUG_ON(pc->page != page);
> +	mz = page_cgroup_zoneinfo(pc);
> +	mem = pc->mem_cgroup;
> 
> -unlock:
> -	unlock_page_cgroup(page);
> +	spin_lock(&mz->lru_lock);
> +	__mem_cgroup_remove_list(pc);
> +	spin_unlock(&mz->lru_lock);
> +
> +	pc->flags = 0;
> +	pc->mem_cgroup = 0;
> +	res_counter_uncharge(&mem->res, PAGE_SIZE);
> +	css_put(&mem->css);
> +	spin_unlock_irqrestore(&pc->lock, flags);
>  }
>


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/7] charge/uncharge
  2008-03-17  1:46   ` Balbir Singh
@ 2008-03-18  1:14     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-18  1:14 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm@kvack.org, xemul, hugh@veritas.com

On Mon, 17 Mar 2008 07:16:01 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-03-14 19:06:22]:
> 
> > Because bit spin lock is removed and spinlock is added to page_cgroup.
> > There are some amount of changes.
> > 
> > This patch does
> > 	- modify charge/uncharge to adjust it to the new lock.
> > 	- Added simple lock rule comments.
> > 
> > Major changes from current(-mm) version is
> > 	- pc->refcnt is set as "1" after the charge is done.
> > 
> > Changelog
> >   - Rebased to rc5-mm1
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> 
> Hi, KAMEZAWA-San,
> 
> The build continues to be broken, even after this patch is applied.
> We will have to find another way to refactor the code, so that we
> don't break git-bisect.
>  
At least, patch 1-5 should be applied.
Hmm, ok. folding patch 1-5 to one patch.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/7] charge/uncharge
  2008-03-14 10:06 ` [PATCH 2/7] charge/uncharge KAMEZAWA Hiroyuki
  2008-03-17  1:46   ` Balbir Singh
@ 2008-03-17  2:26   ` Li Zefan
  2008-03-18  1:15     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 53+ messages in thread
From: Li Zefan @ 2008-03-17  2:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

KAMEZAWA Hiroyuki wrote:
> Because bit spin lock is removed and spinlock is added to page_cgroup.
> There are some amount of changes.
> 
> This patch does
> 	- modify charge/uncharge to adjust it to the new lock.
> 	- Added simple lock rule comments.
> 
> Major changes from current(-mm) version is
> 	- pc->refcnt is set as "1" after the charge is done.
> 
> Changelog
>   - Rebased to rc5-mm1
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> 
>  mm/memcontrol.c |  136 +++++++++++++++++++++++++-------------------------------
>  1 file changed, 62 insertions(+), 74 deletions(-)
> 
> Index: mm-2.6.25-rc5-mm1/mm/memcontrol.c
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/memcontrol.c
> +++ mm-2.6.25-rc5-mm1/mm/memcontrol.c
> @@ -34,6 +34,16 @@
>  
>  #include <asm/uaccess.h>
>  
> +/*
> + * Lock Rule
> + * zone->lru_lcok (global LRU)
> + *	-> pc->lock (page_cgroup's lock)
> + *		-> mz->lru_lock (mem_cgroup's per_zone lock.)
> + *
> + * At least, mz->lru_lock and pc->lock should be acquired irq off.
> + *
> + */
> +
>  struct cgroup_subsys mem_cgroup_subsys;
>  static const int MEM_CGROUP_RECLAIM_RETRIES = 5;
>  
> @@ -479,33 +489,22 @@ static int mem_cgroup_charge_common(stru
>  	if (mem_cgroup_subsys.disabled)
>  		return 0;
>  
> +	pc = get_page_cgroup(page, gfp_mask, true);
> +	if (!pc || IS_ERR(pc))
> +		return PTR_ERR(pc);
> +

If get_page_cgroup() returns NULL, you will end up return *sucesss* by
returning PTR_ERR(pc)

> +	spin_lock_irqsave(&pc->lock, flags);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/7] charge/uncharge
  2008-03-17  2:26   ` Li Zefan
@ 2008-03-18  1:15     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-18  1:15 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

On Mon, 17 Mar 2008 11:26:46 +0900
Li Zefan <lizf@cn.fujitsu.com> wrote:

 
> > +	pc = get_page_cgroup(page, gfp_mask, true);
> > +	if (!pc || IS_ERR(pc))
> > +		return PTR_ERR(pc);
> > +
> 
> If get_page_cgroup() returns NULL, you will end up return *sucesss* by
> returning PTR_ERR(pc)
> 
NULL is success. (NULL only returns in boot. I'll add commetns.)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 3/7] memcg: move_lists
  2008-03-14  9:59 [PATCH 0/7] memcg: radix-tree page_cgroup KAMEZAWA Hiroyuki
  2008-03-14 10:03 ` [PATCH 1/7] re-define page_cgroup KAMEZAWA Hiroyuki
  2008-03-14 10:06 ` [PATCH 2/7] charge/uncharge KAMEZAWA Hiroyuki
@ 2008-03-14 10:07 ` KAMEZAWA Hiroyuki
  2008-03-18 16:44   ` Balbir Singh
  2008-03-14 10:15 ` [PATCH 4/7] memcg: page migration KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-14 10:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

Modifies mem_cgroup_move_lists() to use get_page_cgroup().
No major algorithm changes just adjusted to new locks.

Signed-off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/memcontrol.c |   16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

Index: mm-2.6.25-rc5-mm1/mm/memcontrol.c
===================================================================
--- mm-2.6.25-rc5-mm1.orig/mm/memcontrol.c
+++ mm-2.6.25-rc5-mm1/mm/memcontrol.c
@@ -309,6 +309,10 @@ void mem_cgroup_move_lists(struct page *
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
 
+	/* This GFP will be ignored..*/
+	pc = get_page_cgroup(page, GFP_ATOMIC, false);
+	if (!pc)
+		return;
 	/*
 	 * We cannot lock_page_cgroup while holding zone's lru_lock,
 	 * because other holders of lock_page_cgroup can be interrupted
@@ -316,17 +320,15 @@ void mem_cgroup_move_lists(struct page *
 	 * safely get to page_cgroup without it, so just try_lock it:
 	 * mem_cgroup_isolate_pages allows for page left on wrong list.
 	 */
-	if (!try_lock_page_cgroup(page))
+	if (!spin_trylock_irqsave(&pc->lock, flags))
 		return;
-
-	pc = page_get_page_cgroup(page);
-	if (pc) {
+	if (pc->refcnt) {
 		mz = page_cgroup_zoneinfo(pc);
-		spin_lock_irqsave(&mz->lru_lock, flags);
+		spin_lock(&mz->lru_lock);
 		__mem_cgroup_move_lists(pc, active);
-		spin_unlock_irqrestore(&mz->lru_lock, flags);
+		spin_unlock(&mz->lru_lock);
 	}
-	unlock_page_cgroup(page);
+	spin_unlock_irqrestore(&pc->lock, flags);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/7] memcg: move_lists
  2008-03-14 10:07 ` [PATCH 3/7] memcg: move_lists KAMEZAWA Hiroyuki
@ 2008-03-18 16:44   ` Balbir Singh
  2008-03-19  2:34     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 53+ messages in thread
From: Balbir Singh @ 2008-03-18 16:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm@kvack.org, xemul, hugh@veritas.com

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-03-14 19:07:31]:

> Modifies mem_cgroup_move_lists() to use get_page_cgroup().
> No major algorithm changes just adjusted to new locks.
> 
> Signed-off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
>  mm/memcontrol.c |   16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> Index: mm-2.6.25-rc5-mm1/mm/memcontrol.c
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/memcontrol.c
> +++ mm-2.6.25-rc5-mm1/mm/memcontrol.c
> @@ -309,6 +309,10 @@ void mem_cgroup_move_lists(struct page *
>  	struct mem_cgroup_per_zone *mz;
>  	unsigned long flags;
> 
> +	/* This GFP will be ignored..*/
> +	pc = get_page_cgroup(page, GFP_ATOMIC, false);
> +	if (!pc)
> +		return;

Splitting get_page_cgroup will help avoid thse kinds of hacks. Please
see my earlier comment.

>  	/*
>  	 * We cannot lock_page_cgroup while holding zone's lru_lock,
>  	 * because other holders of lock_page_cgroup can be interrupted
> @@ -316,17 +320,15 @@ void mem_cgroup_move_lists(struct page *
>  	 * safely get to page_cgroup without it, so just try_lock it:
>  	 * mem_cgroup_isolate_pages allows for page left on wrong list.
>  	 */
> -	if (!try_lock_page_cgroup(page))
> +	if (!spin_trylock_irqsave(&pc->lock, flags))
>  		return;
> -
> -	pc = page_get_page_cgroup(page);
> -	if (pc) {
> +	if (pc->refcnt) {
>  		mz = page_cgroup_zoneinfo(pc);
> -		spin_lock_irqsave(&mz->lru_lock, flags);
> +		spin_lock(&mz->lru_lock);
>  		__mem_cgroup_move_lists(pc, active);
> -		spin_unlock_irqrestore(&mz->lru_lock, flags);
> +		spin_unlock(&mz->lru_lock);
>  	}
> -	unlock_page_cgroup(page);
> +	spin_unlock_irqrestore(&pc->lock, flags);
>  }
> 
>  /*
> 

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/7] memcg: move_lists
  2008-03-18 16:44   ` Balbir Singh
@ 2008-03-19  2:34     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-19  2:34 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm@kvack.org, xemul, hugh@veritas.com

On Tue, 18 Mar 2008 22:14:37 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-03-14 19:07:31]:
> 
> > Modifies mem_cgroup_move_lists() to use get_page_cgroup().
> > No major algorithm changes just adjusted to new locks.
> > 
> > Signed-off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> >  mm/memcontrol.c |   16 +++++++++-------
> >  1 file changed, 9 insertions(+), 7 deletions(-)
> > 
> > Index: mm-2.6.25-rc5-mm1/mm/memcontrol.c
> > ===================================================================
> > --- mm-2.6.25-rc5-mm1.orig/mm/memcontrol.c
> > +++ mm-2.6.25-rc5-mm1/mm/memcontrol.c
> > @@ -309,6 +309,10 @@ void mem_cgroup_move_lists(struct page *
> >  	struct mem_cgroup_per_zone *mz;
> >  	unsigned long flags;
> > 
> > +	/* This GFP will be ignored..*/
> > +	pc = get_page_cgroup(page, GFP_ATOMIC, false);
> > +	if (!pc)
> > +		return;
> 
> Splitting get_page_cgroup will help avoid thse kinds of hacks. Please
> see my earlier comment.
> 
My new version has 2 funcs.

get_page_cgroup(struct page *page)
get_alloc_page_cgroup(struct page *page, gfp_t mask);

I will post after I can get test machine..

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 4/7] memcg: page migration
  2008-03-14  9:59 [PATCH 0/7] memcg: radix-tree page_cgroup KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2008-03-14 10:07 ` [PATCH 3/7] memcg: move_lists KAMEZAWA Hiroyuki
@ 2008-03-14 10:15 ` KAMEZAWA Hiroyuki
  2008-03-17  2:36   ` Li Zefan
  2008-03-18 18:11   ` Balbir Singh
  2008-03-14 10:17 ` [PATCH 5/7] radix-tree page cgroup KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-14 10:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

Christoph Lameter, writer of page migraion, suggested me that
new page_cgroup should be assignd to new page at page is allocated.
This patch changes migration path to assign page_cgroup of new page
at allocation.

Pros:
 - We can avoid compliated lock depndencies.
Cons:
 - Have to handle a page which is not on LRU in memory resource controller.

For pages not-on-LRU, I added PAGE_CGROUP_FLAG_MIGRATION and
mem_cgroup->migrations counter.
(force_empty will not end while migration because new page's
 refcnt is alive until the end of migration.)

I think this version simplifies complicated lock dependency in page migraiton,
but I admit this adds some hacky codes. If you have good idea, please advise me.

Works well under my tests.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>




 include/linux/memcontrol.h  |    5 -
 include/linux/page_cgroup.h |    1 
 mm/memcontrol.c             |  153 +++++++++++++++++++++++++++-----------------
 mm/migrate.c                |   22 ++++--
 4 files changed, 113 insertions(+), 68 deletions(-)

Index: mm-2.6.25-rc5-mm1/include/linux/memcontrol.h
===================================================================
--- mm-2.6.25-rc5-mm1.orig/include/linux/memcontrol.h
+++ mm-2.6.25-rc5-mm1/include/linux/memcontrol.h
@@ -48,9 +48,8 @@ int task_in_mem_cgroup(struct task_struc
 #define mm_match_cgroup(mm, cgroup)	\
 	((cgroup) == rcu_dereference((mm)->mem_cgroup))
 
-extern int mem_cgroup_prepare_migration(struct page *page);
-extern void mem_cgroup_end_migration(struct page *page);
-extern void mem_cgroup_page_migration(struct page *page, struct page *newpage);
+extern int mem_cgroup_prepare_migration(struct page *, struct page *);
+extern void mem_cgroup_end_migration(struct page *);
 
 /*
  * For memory reclaim.
Index: mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
===================================================================
--- mm-2.6.25-rc5-mm1.orig/include/linux/page_cgroup.h
+++ mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
@@ -23,6 +23,7 @@ struct page_cgroup {
 /* flags */
 #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache. */
 #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* is on active list */
+#define PAGE_CGROUP_FLAG_MIGRATION (0x4) /* is on active list */
 
 /*
  * Lookup and return page_cgroup struct.
Index: mm-2.6.25-rc5-mm1/mm/memcontrol.c
===================================================================
--- mm-2.6.25-rc5-mm1.orig/mm/memcontrol.c
+++ mm-2.6.25-rc5-mm1/mm/memcontrol.c
@@ -147,6 +147,8 @@ struct mem_cgroup {
 	 * statistics.
 	 */
 	struct mem_cgroup_stat stat;
+	/* migration is under going ? */
+	atomic_t migrations;
 };
 static struct mem_cgroup init_mem_cgroup;
 
@@ -164,6 +166,9 @@ static enum zone_type page_cgroup_zid(st
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
+	/* this 2 types are not linked to LRU */
+	MEM_CGROUP_CHARGE_TYPE_MIGRATION_CACHE,
+	MEM_CGROUP_CHARGE_TYPE_MIGRATION_MAPPED,
 };
 
 /*
@@ -480,7 +485,8 @@ unsigned long mem_cgroup_isolate_pages(u
  * < 0 if the cgroup is over its limit
  */
 static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask, enum charge_type ctype)
+				gfp_t gfp_mask, enum charge_type ctype,
+				struct mem_cgroup *memcg)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
@@ -514,16 +520,21 @@ static int mem_cgroup_charge_common(stru
 	 * thread group leader migrates. It's possible that mm is not
 	 * set, if so charge the init_mm (happens for pagecache usage).
 	 */
-	if (!mm)
+	if (!mm && !memcg)
 		mm = &init_mm;
 
-	rcu_read_lock();
-	mem = rcu_dereference(mm->mem_cgroup);
-	/*
-	 * For every charge from the cgroup, increment reference count
-	 */
-	css_get(&mem->css);
-	rcu_read_unlock();
+	if (mm) {
+		rcu_read_lock();
+		mem = rcu_dereference(mm->mem_cgroup);
+		/*
+	 	* For every charge from the cgroup, increment reference count
+	 	*/
+		css_get(&mem->css);
+		rcu_read_unlock();
+	} else {
+		mem = memcg;
+		css_get(&mem->css);
+	}
 
 	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
 		if (!(gfp_mask & __GFP_WAIT))
@@ -566,12 +577,24 @@ static int mem_cgroup_charge_common(stru
 	pc->refcnt = 1;
 	pc->mem_cgroup = mem;
 	pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
-	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
+	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE ||
+		ctype == MEM_CGROUP_CHARGE_TYPE_MIGRATION_CACHE)
 		pc->flags |= PAGE_CGROUP_FLAG_CACHE;
-	mz = page_cgroup_zoneinfo(pc);
-	spin_lock(&mz->lru_lock);
-	__mem_cgroup_add_list(pc);
-	spin_unlock(&mz->lru_lock);
+
+	if (ctype == MEM_CGROUP_CHARGE_TYPE_MIGRATION_MAPPED ||
+		ctype == MEM_CGROUP_CHARGE_TYPE_MIGRATION_CACHE)
+		pc->flags |= PAGE_CGROUP_FLAG_MIGRATION;
+
+	if (pc->flags & PAGE_CGROUP_FLAG_MIGRATION) {
+		/* just remember there is migration */
+		atomic_inc(&mem->migrations);
+	} else {
+		/* add to LRU */
+		mz = page_cgroup_zoneinfo(pc);
+		spin_lock(&mz->lru_lock);
+		__mem_cgroup_add_list(pc);
+		spin_unlock(&mz->lru_lock);
+	}
 	spin_unlock_irqrestore(&pc->lock, flags);
 
 success:
@@ -584,7 +607,7 @@ nomem:
 int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
 {
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_MAPPED);
+				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
 }
 
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
@@ -593,7 +616,7 @@ int mem_cgroup_cache_charge(struct page 
 	if (!mm)
 		mm = &init_mm;
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_CACHE);
+				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
 }
 
 /*
@@ -637,65 +660,75 @@ void mem_cgroup_uncharge_page(struct pag
 }
 
 /*
- * Returns non-zero if a page (under migration) has valid page_cgroup member.
- * Refcnt of page_cgroup is incremented.
+ * Pre-charge against newpage while moving a page.
+ * This function is called before taking page locks.
  */
-int mem_cgroup_prepare_migration(struct page *page)
+int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
 {
 	struct page_cgroup *pc;
+	struct mem_cgroup *mem = NULL;
+	int ret = 0;
+	enum charge_type type;
+	unsigned long flags;
 
 	if (mem_cgroup_subsys.disabled)
-		return 0;
+		return ret;
+	/* check newpage isn't under memory resource control */
+	pc = get_page_cgroup(newpage, GFP_ATOMIC, false);
+	VM_BUG_ON(pc && pc->refcnt);
 
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
-	if (pc)
-		pc->ref_cnt++;
-	unlock_page_cgroup(page);
-	return pc != NULL;
-}
+	pc = get_page_cgroup(page, GFP_ATOMIC, false);
+	spin_lock_irqsave(&pc->lock, flags);
+	if (pc && pc->refcnt) {
+		mem = pc->mem_cgroup;
+		if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
+			type = MEM_CGROUP_CHARGE_TYPE_MIGRATION_CACHE;
+		else
+			type = MEM_CGROUP_CHARGE_TYPE_MIGRATION_MAPPED;
+	}
+	spin_unlock_irqrestore(&pc->lock, flags);
 
-void mem_cgroup_end_migration(struct page *page)
-{
-	mem_cgroup_uncharge_page(page);
+	if (mem) {
+		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
+				type, mem);
+	}
+	return ret;
 }
-
 /*
- * We know both *page* and *newpage* are now not-on-LRU and PG_locked.
- * And no race with uncharge() routines because page_cgroup for *page*
- * has extra one reference by mem_cgroup_prepare_migration.
+ * At the end of migration, we'll push newpage to LRU and
+ * drop one refcnt which added at prepare_migration.
  */
-void mem_cgroup_page_migration(struct page *page, struct page *newpage)
+void mem_cgroup_end_migration(struct page *newpage)
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup *mem;
 	unsigned long flags;
+	int moved = 0;
 
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
-	if (!pc) {
-		unlock_page_cgroup(page);
+	if (mem_cgroup_subsys.disabled)
 		return;
-	}
-
-	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_remove_list(pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
-
-	page_assign_page_cgroup(page, NULL);
-	unlock_page_cgroup(page);
-
-	pc->page = newpage;
-	lock_page_cgroup(newpage);
-	page_assign_page_cgroup(newpage, pc);
 
-	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_add_list(pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
-
-	unlock_page_cgroup(newpage);
+	pc = get_page_cgroup(newpage, GFP_ATOMIC, false);
+	if (!pc)
+		return;
+	spin_lock_irqsave(&pc->lock, flags);
+	if (pc->flags & PAGE_CGROUP_FLAG_MIGRATION) {
+		pc->flags &= ~PAGE_CGROUP_FLAG_MIGRATION;
+		mem = pc->mem_cgroup;
+		mz = page_cgroup_zoneinfo(pc);
+		spin_lock(&mz->lru_lock);
+		__mem_cgroup_add_list(pc);
+		spin_unlock(&mz->lru_lock);
+		moved = 1;
+	}
+	spin_unlock_irqrestore(&pc->lock, flags);
+	if (!pc)
+		return;
+	if (moved) {
+		mem_cgroup_uncharge_page(newpage);
+		atomic_dec(&mem->migrations);
+	}
 }
 
 /*
@@ -757,6 +790,10 @@ static int mem_cgroup_force_empty(struct
 	while (mem->res.usage > 0) {
 		if (atomic_read(&mem->css.cgroup->count) > 0)
 			goto out;
+		if (atomic_read(&mem->migrations)) {
+			cond_resched();
+			continue;
+		}
 		for_each_node_state(node, N_POSSIBLE)
 			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 				struct mem_cgroup_per_zone *mz;
Index: mm-2.6.25-rc5-mm1/mm/migrate.c
===================================================================
--- mm-2.6.25-rc5-mm1.orig/mm/migrate.c
+++ mm-2.6.25-rc5-mm1/mm/migrate.c
@@ -358,6 +358,12 @@ static int migrate_page_move_mapping(str
 
 	write_unlock_irq(&mapping->tree_lock);
 
+	/* by mem_cgroup_prepare_migration, newpage is already
+	   assigned to valid cgroup. and current->mm and GFP_ATOMIC
+	   will not be used...*/
+	mem_cgroup_uncharge_page(page);
+	mem_cgroup_cache_charge(newpage, current->mm ,GFP_ATOMIC);
+
 	return 0;
 }
 
@@ -603,7 +609,6 @@ static int move_to_new_page(struct page 
 		rc = fallback_migrate_page(mapping, newpage, page);
 
 	if (!rc) {
-		mem_cgroup_page_migration(page, newpage);
 		remove_migration_ptes(page, newpage);
 	} else
 		newpage->mapping = NULL;
@@ -633,6 +638,12 @@ static int unmap_and_move(new_page_t get
 		/* page was freed from under us. So we are done. */
 		goto move_newpage;
 
+	charge = mem_cgroup_prepare_migration(page, newpage);
+	if (charge == -ENOMEM) {
+		rc = -ENOMEM;
+		goto move_newpage;
+	}
+
 	rc = -EAGAIN;
 	if (TestSetPageLocked(page)) {
 		if (!force)
@@ -684,19 +695,14 @@ static int unmap_and_move(new_page_t get
 		goto rcu_unlock;
 	}
 
-	charge = mem_cgroup_prepare_migration(page);
 	/* Establish migration ptes or remove ptes */
 	try_to_unmap(page, 1);
 
 	if (!page_mapped(page))
 		rc = move_to_new_page(newpage, page);
 
-	if (rc) {
+	if (rc)
 		remove_migration_ptes(page, page);
-		if (charge)
-			mem_cgroup_end_migration(page);
-	} else if (charge)
- 		mem_cgroup_end_migration(newpage);
 rcu_unlock:
 	if (rcu_locked)
 		rcu_read_unlock();
@@ -717,6 +723,8 @@ unlock:
 	}
 
 move_newpage:
+	if (!charge)
+		mem_cgroup_end_migration(newpage);
 	/*
 	 * Move the new page to the LRU. If migration was not successful
 	 * then this will free the page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/7] memcg: page migration
  2008-03-14 10:15 ` [PATCH 4/7] memcg: page migration KAMEZAWA Hiroyuki
@ 2008-03-17  2:36   ` Li Zefan
  2008-03-18  1:17     ` KAMEZAWA Hiroyuki
  2008-03-18 18:11   ` Balbir Singh
  1 sibling, 1 reply; 53+ messages in thread
From: Li Zefan @ 2008-03-17  2:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

KAMEZAWA Hiroyuki wrote:
> Christoph Lameter, writer of page migraion, suggested me that
> new page_cgroup should be assignd to new page at page is allocated.
> This patch changes migration path to assign page_cgroup of new page
> at allocation.
> 
> Pros:
>  - We can avoid compliated lock depndencies.
> Cons:
>  - Have to handle a page which is not on LRU in memory resource controller.
> 
> For pages not-on-LRU, I added PAGE_CGROUP_FLAG_MIGRATION and
> mem_cgroup->migrations counter.
> (force_empty will not end while migration because new page's
>  refcnt is alive until the end of migration.)
> 
> I think this version simplifies complicated lock dependency in page migraiton,
> but I admit this adds some hacky codes. If you have good idea, please advise me.
> 
> Works well under my tests.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> 
> 
> 
>  include/linux/memcontrol.h  |    5 -
>  include/linux/page_cgroup.h |    1 
>  mm/memcontrol.c             |  153 +++++++++++++++++++++++++++-----------------
>  mm/migrate.c                |   22 ++++--
>  4 files changed, 113 insertions(+), 68 deletions(-)
> 
> Index: mm-2.6.25-rc5-mm1/include/linux/memcontrol.h
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/include/linux/memcontrol.h
> +++ mm-2.6.25-rc5-mm1/include/linux/memcontrol.h
> @@ -48,9 +48,8 @@ int task_in_mem_cgroup(struct task_struc
>  #define mm_match_cgroup(mm, cgroup)	\
>  	((cgroup) == rcu_dereference((mm)->mem_cgroup))
>  
> -extern int mem_cgroup_prepare_migration(struct page *page);
> -extern void mem_cgroup_end_migration(struct page *page);
> -extern void mem_cgroup_page_migration(struct page *page, struct page *newpage);
> +extern int mem_cgroup_prepare_migration(struct page *, struct page *);
> +extern void mem_cgroup_end_migration(struct page *);
>  
>  /*
>   * For memory reclaim.
> Index: mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/include/linux/page_cgroup.h
> +++ mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> @@ -23,6 +23,7 @@ struct page_cgroup {
>  /* flags */
>  #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache. */
>  #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* is on active list */
> +#define PAGE_CGROUP_FLAG_MIGRATION (0x4) /* is on active list */
>  
>  /*
>   * Lookup and return page_cgroup struct.
> Index: mm-2.6.25-rc5-mm1/mm/memcontrol.c
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/memcontrol.c
> +++ mm-2.6.25-rc5-mm1/mm/memcontrol.c
> @@ -147,6 +147,8 @@ struct mem_cgroup {
>  	 * statistics.
>  	 */
>  	struct mem_cgroup_stat stat;
> +	/* migration is under going ? */

Please stick to this comment style:
	/*
	 * ...
	 */

> +	atomic_t migrations;
>  };
>  static struct mem_cgroup init_mem_cgroup;
>  
> @@ -164,6 +166,9 @@ static enum zone_type page_cgroup_zid(st
>  enum charge_type {
>  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
>  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> +	/* this 2 types are not linked to LRU */
> +	MEM_CGROUP_CHARGE_TYPE_MIGRATION_CACHE,
> +	MEM_CGROUP_CHARGE_TYPE_MIGRATION_MAPPED,
>  };
>  
>  /*
> @@ -480,7 +485,8 @@ unsigned long mem_cgroup_isolate_pages(u
>   * < 0 if the cgroup is over its limit
>   */
>  static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> -				gfp_t gfp_mask, enum charge_type ctype)
> +				gfp_t gfp_mask, enum charge_type ctype,
> +				struct mem_cgroup *memcg)
>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
> @@ -514,16 +520,21 @@ static int mem_cgroup_charge_common(stru
>  	 * thread group leader migrates. It's possible that mm is not
>  	 * set, if so charge the init_mm (happens for pagecache usage).
>  	 */
> -	if (!mm)
> +	if (!mm && !memcg)
>  		mm = &init_mm;
>  
> -	rcu_read_lock();
> -	mem = rcu_dereference(mm->mem_cgroup);
> -	/*
> -	 * For every charge from the cgroup, increment reference count
> -	 */
> -	css_get(&mem->css);
> -	rcu_read_unlock();
> +	if (mm) {
> +		rcu_read_lock();
> +		mem = rcu_dereference(mm->mem_cgroup);
> +		/*
> +	 	* For every charge from the cgroup, increment reference count
> +	 	*/
> +		css_get(&mem->css);
> +		rcu_read_unlock();
> +	} else {
> +		mem = memcg;
> +		css_get(&mem->css);
> +	}
>  
>  	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
>  		if (!(gfp_mask & __GFP_WAIT))
> @@ -566,12 +577,24 @@ static int mem_cgroup_charge_common(stru
>  	pc->refcnt = 1;
>  	pc->mem_cgroup = mem;
>  	pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
> -	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
> +	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE ||
> +		ctype == MEM_CGROUP_CHARGE_TYPE_MIGRATION_CACHE)
>  		pc->flags |= PAGE_CGROUP_FLAG_CACHE;
> -	mz = page_cgroup_zoneinfo(pc);
> -	spin_lock(&mz->lru_lock);
> -	__mem_cgroup_add_list(pc);
> -	spin_unlock(&mz->lru_lock);
> +
> +	if (ctype == MEM_CGROUP_CHARGE_TYPE_MIGRATION_MAPPED ||
> +		ctype == MEM_CGROUP_CHARGE_TYPE_MIGRATION_CACHE)
> +		pc->flags |= PAGE_CGROUP_FLAG_MIGRATION;
> +
> +	if (pc->flags & PAGE_CGROUP_FLAG_MIGRATION) {
> +		/* just remember there is migration */
> +		atomic_inc(&mem->migrations);
> +	} else {
> +		/* add to LRU */
> +		mz = page_cgroup_zoneinfo(pc);
> +		spin_lock(&mz->lru_lock);
> +		__mem_cgroup_add_list(pc);
> +		spin_unlock(&mz->lru_lock);
> +	}
>  	spin_unlock_irqrestore(&pc->lock, flags);
>  
>  success:
> @@ -584,7 +607,7 @@ nomem:
>  int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
>  {
>  	return mem_cgroup_charge_common(page, mm, gfp_mask,
> -				MEM_CGROUP_CHARGE_TYPE_MAPPED);
> +				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
>  }
>  
>  int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> @@ -593,7 +616,7 @@ int mem_cgroup_cache_charge(struct page 
>  	if (!mm)
>  		mm = &init_mm;
>  	return mem_cgroup_charge_common(page, mm, gfp_mask,
> -				MEM_CGROUP_CHARGE_TYPE_CACHE);
> +				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
>  }
>  
>  /*
> @@ -637,65 +660,75 @@ void mem_cgroup_uncharge_page(struct pag
>  }
>  
>  /*
> - * Returns non-zero if a page (under migration) has valid page_cgroup member.
> - * Refcnt of page_cgroup is incremented.
> + * Pre-charge against newpage while moving a page.
> + * This function is called before taking page locks.
>   */
> -int mem_cgroup_prepare_migration(struct page *page)
> +int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
>  {
>  	struct page_cgroup *pc;
> +	struct mem_cgroup *mem = NULL;
> +	int ret = 0;
> +	enum charge_type type;
> +	unsigned long flags;
>  
>  	if (mem_cgroup_subsys.disabled)
> -		return 0;
> +		return ret;
> +	/* check newpage isn't under memory resource control */
> +	pc = get_page_cgroup(newpage, GFP_ATOMIC, false);
> +	VM_BUG_ON(pc && pc->refcnt);
>  
> -	lock_page_cgroup(page);
> -	pc = page_get_page_cgroup(page);
> -	if (pc)
> -		pc->ref_cnt++;
> -	unlock_page_cgroup(page);
> -	return pc != NULL;
> -}
> +	pc = get_page_cgroup(page, GFP_ATOMIC, false);
> +	spin_lock_irqsave(&pc->lock, flags);
> +	if (pc && pc->refcnt) {

You check if (pc) after you deference it by &pc->lock, it's a bug
here or the check is unneeded ?

> +		mem = pc->mem_cgroup;
> +		if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
> +			type = MEM_CGROUP_CHARGE_TYPE_MIGRATION_CACHE;
> +		else
> +			type = MEM_CGROUP_CHARGE_TYPE_MIGRATION_MAPPED;
> +	}
> +	spin_unlock_irqrestore(&pc->lock, flags);
>  
> -void mem_cgroup_end_migration(struct page *page)
> -{
> -	mem_cgroup_uncharge_page(page);
> +	if (mem) {
> +		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
> +				type, mem);
> +	}
> +	return ret;
>  }
> -
>  /*
> - * We know both *page* and *newpage* are now not-on-LRU and PG_locked.
> - * And no race with uncharge() routines because page_cgroup for *page*
> - * has extra one reference by mem_cgroup_prepare_migration.
> + * At the end of migration, we'll push newpage to LRU and
> + * drop one refcnt which added at prepare_migration.
>   */
> -void mem_cgroup_page_migration(struct page *page, struct page *newpage)
> +void mem_cgroup_end_migration(struct page *newpage)
>  {
>  	struct page_cgroup *pc;
>  	struct mem_cgroup_per_zone *mz;
> +	struct mem_cgroup *mem;
>  	unsigned long flags;
> +	int moved = 0;
>  
> -	lock_page_cgroup(page);
> -	pc = page_get_page_cgroup(page);
> -	if (!pc) {
> -		unlock_page_cgroup(page);
> +	if (mem_cgroup_subsys.disabled)
>  		return;
> -	}
> -
> -	mz = page_cgroup_zoneinfo(pc);
> -	spin_lock_irqsave(&mz->lru_lock, flags);
> -	__mem_cgroup_remove_list(pc);
> -	spin_unlock_irqrestore(&mz->lru_lock, flags);
> -
> -	page_assign_page_cgroup(page, NULL);
> -	unlock_page_cgroup(page);
> -
> -	pc->page = newpage;
> -	lock_page_cgroup(newpage);
> -	page_assign_page_cgroup(newpage, pc);
>  
> -	mz = page_cgroup_zoneinfo(pc);
> -	spin_lock_irqsave(&mz->lru_lock, flags);
> -	__mem_cgroup_add_list(pc);
> -	spin_unlock_irqrestore(&mz->lru_lock, flags);
> -
> -	unlock_page_cgroup(newpage);
> +	pc = get_page_cgroup(newpage, GFP_ATOMIC, false);
> +	if (!pc)
> +		return;
> +	spin_lock_irqsave(&pc->lock, flags);
> +	if (pc->flags & PAGE_CGROUP_FLAG_MIGRATION) {
> +		pc->flags &= ~PAGE_CGROUP_FLAG_MIGRATION;
> +		mem = pc->mem_cgroup;
> +		mz = page_cgroup_zoneinfo(pc);
> +		spin_lock(&mz->lru_lock);
> +		__mem_cgroup_add_list(pc);
> +		spin_unlock(&mz->lru_lock);
> +		moved = 1;
> +	}
> +	spin_unlock_irqrestore(&pc->lock, flags);
> +	if (!pc)
> +		return;

redundant check ?

> +	if (moved) {
> +		mem_cgroup_uncharge_page(newpage);
> +		atomic_dec(&mem->migrations);
> +	}
>  }
>  
>  /*
> @@ -757,6 +790,10 @@ static int mem_cgroup_force_empty(struct
>  	while (mem->res.usage > 0) {
>  		if (atomic_read(&mem->css.cgroup->count) > 0)
>  			goto out;
> +		if (atomic_read(&mem->migrations)) {
> +			cond_resched();
> +			continue;
> +		}
>  		for_each_node_state(node, N_POSSIBLE)
>  			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>  				struct mem_cgroup_per_zone *mz;
> Index: mm-2.6.25-rc5-mm1/mm/migrate.c
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/migrate.c
> +++ mm-2.6.25-rc5-mm1/mm/migrate.c
> @@ -358,6 +358,12 @@ static int migrate_page_move_mapping(str
>  
>  	write_unlock_irq(&mapping->tree_lock);
>  
> +	/* by mem_cgroup_prepare_migration, newpage is already
> +	   assigned to valid cgroup. and current->mm and GFP_ATOMIC
> +	   will not be used...*/
> +	mem_cgroup_uncharge_page(page);
> +	mem_cgroup_cache_charge(newpage, current->mm ,GFP_ATOMIC);
> +
>  	return 0;
>  }
>  
> @@ -603,7 +609,6 @@ static int move_to_new_page(struct page 
>  		rc = fallback_migrate_page(mapping, newpage, page);
>  
>  	if (!rc) {
> -		mem_cgroup_page_migration(page, newpage);
>  		remove_migration_ptes(page, newpage);
>  	} else
>  		newpage->mapping = NULL;
> @@ -633,6 +638,12 @@ static int unmap_and_move(new_page_t get
>  		/* page was freed from under us. So we are done. */
>  		goto move_newpage;
>  
> +	charge = mem_cgroup_prepare_migration(page, newpage);
> +	if (charge == -ENOMEM) {
> +		rc = -ENOMEM;
> +		goto move_newpage;
> +	}
> +
>  	rc = -EAGAIN;
>  	if (TestSetPageLocked(page)) {
>  		if (!force)
> @@ -684,19 +695,14 @@ static int unmap_and_move(new_page_t get
>  		goto rcu_unlock;
>  	}
>  
> -	charge = mem_cgroup_prepare_migration(page);
>  	/* Establish migration ptes or remove ptes */
>  	try_to_unmap(page, 1);
>  
>  	if (!page_mapped(page))
>  		rc = move_to_new_page(newpage, page);
>  
> -	if (rc) {
> +	if (rc)
>  		remove_migration_ptes(page, page);
> -		if (charge)
> -			mem_cgroup_end_migration(page);
> -	} else if (charge)
> - 		mem_cgroup_end_migration(newpage);
>  rcu_unlock:
>  	if (rcu_locked)
>  		rcu_read_unlock();
> @@ -717,6 +723,8 @@ unlock:
>  	}
>  
>  move_newpage:
> +	if (!charge)
> +		mem_cgroup_end_migration(newpage);
>  	/*
>  	 * Move the new page to the LRU. If migration was not successful
>  	 * then this will free the page.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/7] memcg: page migration
  2008-03-17  2:36   ` Li Zefan
@ 2008-03-18  1:17     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-18  1:17 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

On Mon, 17 Mar 2008 11:36:41 +0900
Li Zefan <lizf@cn.fujitsu.com> wrote:

> > @@ -147,6 +147,8 @@ struct mem_cgroup {
> >  	 * statistics.
> >  	 */
> >  	struct mem_cgroup_stat stat;
> > +	/* migration is under going ? */
> 
> Please stick to this comment style:
> 	/*
> 	 * ...
> 	 */
> 
ok.


> > +	pc = get_page_cgroup(page, GFP_ATOMIC, false);
> > +	spin_lock_irqsave(&pc->lock, flags);
> > +	if (pc && pc->refcnt) {
> 
> You check if (pc) after you deference it by &pc->lock, it's a bug
> here or the check is unneeded ?
> 
Ah, BUG. Thanks.

> > +		mem = pc->mem_cgroup;
> > +		if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
> > +			type = MEM_CGROUP_CHARGE_TYPE_MIGRATION_CACHE;
> > +		else
> > +			type = MEM_CGROUP_CHARGE_TYPE_MIGRATION_MAPPED;
> > +	}
> > +	spin_unlock_irqrestore(&pc->lock, flags);
> >  
> > -void mem_cgroup_end_migration(struct page *page)
> > -{
> > -	mem_cgroup_uncharge_page(page);
> > +	if (mem) {
> > +		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
> > +				type, mem);
> > +	}
> > +	return ret;
> >  }
> > -
> >  /*
> > - * We know both *page* and *newpage* are now not-on-LRU and PG_locked.
> > - * And no race with uncharge() routines because page_cgroup for *page*
> > - * has extra one reference by mem_cgroup_prepare_migration.
> > + * At the end of migration, we'll push newpage to LRU and
> > + * drop one refcnt which added at prepare_migration.
> >   */
> > -void mem_cgroup_page_migration(struct page *page, struct page *newpage)
> > +void mem_cgroup_end_migration(struct page *newpage)
> >  {
> >  	struct page_cgroup *pc;
> >  	struct mem_cgroup_per_zone *mz;
> > +	struct mem_cgroup *mem;
> >  	unsigned long flags;
> > +	int moved = 0;
> >  
> > -	lock_page_cgroup(page);
> > -	pc = page_get_page_cgroup(page);
> > -	if (!pc) {
> > -		unlock_page_cgroup(page);
> > +	if (mem_cgroup_subsys.disabled)
> >  		return;
> > -	}
> > -
> > -	mz = page_cgroup_zoneinfo(pc);
> > -	spin_lock_irqsave(&mz->lru_lock, flags);
> > -	__mem_cgroup_remove_list(pc);
> > -	spin_unlock_irqrestore(&mz->lru_lock, flags);
> > -
> > -	page_assign_page_cgroup(page, NULL);
> > -	unlock_page_cgroup(page);
> > -
> > -	pc->page = newpage;
> > -	lock_page_cgroup(newpage);
> > -	page_assign_page_cgroup(newpage, pc);
> >  
> > -	mz = page_cgroup_zoneinfo(pc);
> > -	spin_lock_irqsave(&mz->lru_lock, flags);
> > -	__mem_cgroup_add_list(pc);
> > -	spin_unlock_irqrestore(&mz->lru_lock, flags);
> > -
> > -	unlock_page_cgroup(newpage);
> > +	pc = get_page_cgroup(newpage, GFP_ATOMIC, false);
> > +	if (!pc)
> > +		return;
> > +	spin_lock_irqsave(&pc->lock, flags);
> > +	if (pc->flags & PAGE_CGROUP_FLAG_MIGRATION) {
> > +		pc->flags &= ~PAGE_CGROUP_FLAG_MIGRATION;
> > +		mem = pc->mem_cgroup;
> > +		mz = page_cgroup_zoneinfo(pc);
> > +		spin_lock(&mz->lru_lock);
> > +		__mem_cgroup_add_list(pc);
> > +		spin_unlock(&mz->lru_lock);
> > +		moved = 1;
> > +	}
> > +	spin_unlock_irqrestore(&pc->lock, flags);
> > +	if (!pc)
> > +		return;
> 
> redundant check ?
> 
yes. will fix.


Thank you.
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/7] memcg: page migration
  2008-03-14 10:15 ` [PATCH 4/7] memcg: page migration KAMEZAWA Hiroyuki
  2008-03-17  2:36   ` Li Zefan
@ 2008-03-18 18:11   ` Balbir Singh
  2008-03-19  2:44     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 53+ messages in thread
From: Balbir Singh @ 2008-03-18 18:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm@kvack.org, xemul, hugh@veritas.com

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-03-14 19:15:43]:

> Christoph Lameter, writer of page migraion, suggested me that
> new page_cgroup should be assignd to new page at page is allocated.
> This patch changes migration path to assign page_cgroup of new page
> at allocation.
> 
> Pros:
>  - We can avoid compliated lock depndencies.
> Cons:
>  - Have to handle a page which is not on LRU in memory resource controller.
> 
> For pages not-on-LRU, I added PAGE_CGROUP_FLAG_MIGRATION and
> mem_cgroup->migrations counter.
> (force_empty will not end while migration because new page's
>  refcnt is alive until the end of migration.)
> 
> I think this version simplifies complicated lock dependency in page migraiton,
> but I admit this adds some hacky codes. If you have good idea, please advise me.
> 
> Works well under my tests.

This code is easier to read as well. I think this a good approach. To
be honest, I've not had the chance to test page migration very often.
Should we update Documentation/controllers/memory.txt to indicate that
migration might prevent force_empty and hence rmdir() from working?

> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> 
> 
> 
>  include/linux/memcontrol.h  |    5 -
>  include/linux/page_cgroup.h |    1 
>  mm/memcontrol.c             |  153 +++++++++++++++++++++++++++-----------------
>  mm/migrate.c                |   22 ++++--
>  4 files changed, 113 insertions(+), 68 deletions(-)
> 
> Index: mm-2.6.25-rc5-mm1/include/linux/memcontrol.h
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/include/linux/memcontrol.h
> +++ mm-2.6.25-rc5-mm1/include/linux/memcontrol.h
> @@ -48,9 +48,8 @@ int task_in_mem_cgroup(struct task_struc
>  #define mm_match_cgroup(mm, cgroup)	\
>  	((cgroup) == rcu_dereference((mm)->mem_cgroup))
> 
> -extern int mem_cgroup_prepare_migration(struct page *page);
> -extern void mem_cgroup_end_migration(struct page *page);
> -extern void mem_cgroup_page_migration(struct page *page, struct page *newpage);
> +extern int mem_cgroup_prepare_migration(struct page *, struct page *);
> +extern void mem_cgroup_end_migration(struct page *);
> 
>  /*
>   * For memory reclaim.
> Index: mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/include/linux/page_cgroup.h
> +++ mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> @@ -23,6 +23,7 @@ struct page_cgroup {
>  /* flags */
>  #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache. */
>  #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* is on active list */
> +#define PAGE_CGROUP_FLAG_MIGRATION (0x4) /* is on active list */
> 
>  /*
>   * Lookup and return page_cgroup struct.
> Index: mm-2.6.25-rc5-mm1/mm/memcontrol.c
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/memcontrol.c
> +++ mm-2.6.25-rc5-mm1/mm/memcontrol.c
> @@ -147,6 +147,8 @@ struct mem_cgroup {
>  	 * statistics.
>  	 */
>  	struct mem_cgroup_stat stat;
> +	/* migration is under going ? */
> +	atomic_t migrations;
>  };
>  static struct mem_cgroup init_mem_cgroup;
> 
> @@ -164,6 +166,9 @@ static enum zone_type page_cgroup_zid(st
>  enum charge_type {
>  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
>  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> +	/* this 2 types are not linked to LRU */
          ^^^^^
          these
> +	MEM_CGROUP_CHARGE_TYPE_MIGRATION_CACHE,
> +	MEM_CGROUP_CHARGE_TYPE_MIGRATION_MAPPED,
>  };
> 
>  /*
> @@ -480,7 +485,8 @@ unsigned long mem_cgroup_isolate_pages(u
>   * < 0 if the cgroup is over its limit
>   */
>  static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> -				gfp_t gfp_mask, enum charge_type ctype)
> +				gfp_t gfp_mask, enum charge_type ctype,
> +				struct mem_cgroup *memcg)

I think it'll be a good idea to add a comment stating that memcg is
NOT NULL only for migration cases.

>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
> @@ -514,16 +520,21 @@ static int mem_cgroup_charge_common(stru
>  	 * thread group leader migrates. It's possible that mm is not
>  	 * set, if so charge the init_mm (happens for pagecache usage).
>  	 */
> -	if (!mm)
> +	if (!mm && !memcg)
>  		mm = &init_mm;
> 
> -	rcu_read_lock();
> -	mem = rcu_dereference(mm->mem_cgroup);
> -	/*
> -	 * For every charge from the cgroup, increment reference count
> -	 */
> -	css_get(&mem->css);
> -	rcu_read_unlock();
> +	if (mm) {
> +		rcu_read_lock();
> +		mem = rcu_dereference(mm->mem_cgroup);
> +		/*
> +	 	* For every charge from the cgroup, increment reference count
> +	 	*/
> +		css_get(&mem->css);
> +		rcu_read_unlock();
> +	} else {
> +		mem = memcg;
> +		css_get(&mem->css);
> +	}
> 
>  	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
>  		if (!(gfp_mask & __GFP_WAIT))
> @@ -566,12 +577,24 @@ static int mem_cgroup_charge_common(stru
>  	pc->refcnt = 1;
>  	pc->mem_cgroup = mem;
>  	pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
> -	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
> +	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE ||
> +		ctype == MEM_CGROUP_CHARGE_TYPE_MIGRATION_CACHE)
>  		pc->flags |= PAGE_CGROUP_FLAG_CACHE;
> -	mz = page_cgroup_zoneinfo(pc);
> -	spin_lock(&mz->lru_lock);
> -	__mem_cgroup_add_list(pc);
> -	spin_unlock(&mz->lru_lock);
> +
> +	if (ctype == MEM_CGROUP_CHARGE_TYPE_MIGRATION_MAPPED ||
> +		ctype == MEM_CGROUP_CHARGE_TYPE_MIGRATION_CACHE)
> +		pc->flags |= PAGE_CGROUP_FLAG_MIGRATION;
> +
> +	if (pc->flags & PAGE_CGROUP_FLAG_MIGRATION) {
> +		/* just remember there is migration */
> +		atomic_inc(&mem->migrations);
> +	} else {
> +		/* add to LRU */
> +		mz = page_cgroup_zoneinfo(pc);
> +		spin_lock(&mz->lru_lock);
> +		__mem_cgroup_add_list(pc);
> +		spin_unlock(&mz->lru_lock);
> +	}
>  	spin_unlock_irqrestore(&pc->lock, flags);
> 
>  success:
> @@ -584,7 +607,7 @@ nomem:
>  int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
>  {
>  	return mem_cgroup_charge_common(page, mm, gfp_mask,
> -				MEM_CGROUP_CHARGE_TYPE_MAPPED);
> +				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
>  }
> 
>  int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> @@ -593,7 +616,7 @@ int mem_cgroup_cache_charge(struct page 
>  	if (!mm)
>  		mm = &init_mm;
>  	return mem_cgroup_charge_common(page, mm, gfp_mask,
> -				MEM_CGROUP_CHARGE_TYPE_CACHE);
> +				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
>  }
> 
>  /*
> @@ -637,65 +660,75 @@ void mem_cgroup_uncharge_page(struct pag
>  }
> 
>  /*
> - * Returns non-zero if a page (under migration) has valid page_cgroup member.
> - * Refcnt of page_cgroup is incremented.
> + * Pre-charge against newpage while moving a page.
> + * This function is called before taking page locks.
>   */
> -int mem_cgroup_prepare_migration(struct page *page)
> +int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
>  {
>  	struct page_cgroup *pc;
> +	struct mem_cgroup *mem = NULL;
> +	int ret = 0;
> +	enum charge_type type;
> +	unsigned long flags;
> 
>  	if (mem_cgroup_subsys.disabled)
> -		return 0;
> +		return ret;
> +	/* check newpage isn't under memory resource control */
> +	pc = get_page_cgroup(newpage, GFP_ATOMIC, false);
> +	VM_BUG_ON(pc && pc->refcnt);
> 
> -	lock_page_cgroup(page);
> -	pc = page_get_page_cgroup(page);
> -	if (pc)
> -		pc->ref_cnt++;
> -	unlock_page_cgroup(page);
> -	return pc != NULL;
> -}
> +	pc = get_page_cgroup(page, GFP_ATOMIC, false);
> +	spin_lock_irqsave(&pc->lock, flags);
> +	if (pc && pc->refcnt) {
> +		mem = pc->mem_cgroup;
> +		if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
> +			type = MEM_CGROUP_CHARGE_TYPE_MIGRATION_CACHE;
> +		else
> +			type = MEM_CGROUP_CHARGE_TYPE_MIGRATION_MAPPED;
> +	}
> +	spin_unlock_irqrestore(&pc->lock, flags);
> 
> -void mem_cgroup_end_migration(struct page *page)
> -{
> -	mem_cgroup_uncharge_page(page);
> +	if (mem) {
> +		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
> +				type, mem);
> +	}
> +	return ret;
>  }
> -
>  /*
> - * We know both *page* and *newpage* are now not-on-LRU and PG_locked.
> - * And no race with uncharge() routines because page_cgroup for *page*
> - * has extra one reference by mem_cgroup_prepare_migration.
> + * At the end of migration, we'll push newpage to LRU and
> + * drop one refcnt which added at prepare_migration.
>   */
> -void mem_cgroup_page_migration(struct page *page, struct page *newpage)
> +void mem_cgroup_end_migration(struct page *newpage)
>  {
>  	struct page_cgroup *pc;
>  	struct mem_cgroup_per_zone *mz;
> +	struct mem_cgroup *mem;
>  	unsigned long flags;
> +	int moved = 0;
> 
> -	lock_page_cgroup(page);
> -	pc = page_get_page_cgroup(page);
> -	if (!pc) {
> -		unlock_page_cgroup(page);
> +	if (mem_cgroup_subsys.disabled)
>  		return;
> -	}
> -
> -	mz = page_cgroup_zoneinfo(pc);
> -	spin_lock_irqsave(&mz->lru_lock, flags);
> -	__mem_cgroup_remove_list(pc);
> -	spin_unlock_irqrestore(&mz->lru_lock, flags);
> -
> -	page_assign_page_cgroup(page, NULL);
> -	unlock_page_cgroup(page);
> -
> -	pc->page = newpage;
> -	lock_page_cgroup(newpage);
> -	page_assign_page_cgroup(newpage, pc);
> 
> -	mz = page_cgroup_zoneinfo(pc);
> -	spin_lock_irqsave(&mz->lru_lock, flags);
> -	__mem_cgroup_add_list(pc);
> -	spin_unlock_irqrestore(&mz->lru_lock, flags);
> -
> -	unlock_page_cgroup(newpage);
> +	pc = get_page_cgroup(newpage, GFP_ATOMIC, false);
> +	if (!pc)
> +		return;
> +	spin_lock_irqsave(&pc->lock, flags);
> +	if (pc->flags & PAGE_CGROUP_FLAG_MIGRATION) {
> +		pc->flags &= ~PAGE_CGROUP_FLAG_MIGRATION;
> +		mem = pc->mem_cgroup;
> +		mz = page_cgroup_zoneinfo(pc);
> +		spin_lock(&mz->lru_lock);
> +		__mem_cgroup_add_list(pc);
> +		spin_unlock(&mz->lru_lock);
> +		moved = 1;
> +	}
> +	spin_unlock_irqrestore(&pc->lock, flags);
> +	if (!pc)
> +		return;
> +	if (moved) {
> +		mem_cgroup_uncharge_page(newpage);

Not sure I understand this part correctly? Why uncharge on move? Is it
just to drop the extra reference count we have? Also should we
disallow task migration as in migrating to different cgroups when page
migration is in progress?

> +		atomic_dec(&mem->migrations);
> +	}
>  }
> 
>  /*
> @@ -757,6 +790,10 @@ static int mem_cgroup_force_empty(struct
>  	while (mem->res.usage > 0) {
>  		if (atomic_read(&mem->css.cgroup->count) > 0)
>  			goto out;
> +		if (atomic_read(&mem->migrations)) {
> +			cond_resched();
> +			continue;
> +		}
>  		for_each_node_state(node, N_POSSIBLE)
>  			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>  				struct mem_cgroup_per_zone *mz;
> Index: mm-2.6.25-rc5-mm1/mm/migrate.c
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/migrate.c
> +++ mm-2.6.25-rc5-mm1/mm/migrate.c
> @@ -358,6 +358,12 @@ static int migrate_page_move_mapping(str
> 
>  	write_unlock_irq(&mapping->tree_lock);
> 
> +	/* by mem_cgroup_prepare_migration, newpage is already
> +	   assigned to valid cgroup. and current->mm and GFP_ATOMIC
> +	   will not be used...*/
> +	mem_cgroup_uncharge_page(page);
> +	mem_cgroup_cache_charge(newpage, current->mm ,GFP_ATOMIC);
> +
>  	return 0;
>  }
> 
> @@ -603,7 +609,6 @@ static int move_to_new_page(struct page 
>  		rc = fallback_migrate_page(mapping, newpage, page);
> 
>  	if (!rc) {
> -		mem_cgroup_page_migration(page, newpage);
>  		remove_migration_ptes(page, newpage);
>  	} else
>  		newpage->mapping = NULL;
> @@ -633,6 +638,12 @@ static int unmap_and_move(new_page_t get
>  		/* page was freed from under us. So we are done. */
>  		goto move_newpage;
> 
> +	charge = mem_cgroup_prepare_migration(page, newpage);
> +	if (charge == -ENOMEM) {
> +		rc = -ENOMEM;
> +		goto move_newpage;
> +	}
> +
>  	rc = -EAGAIN;
>  	if (TestSetPageLocked(page)) {
>  		if (!force)
> @@ -684,19 +695,14 @@ static int unmap_and_move(new_page_t get
>  		goto rcu_unlock;
>  	}
> 
> -	charge = mem_cgroup_prepare_migration(page);
>  	/* Establish migration ptes or remove ptes */
>  	try_to_unmap(page, 1);
> 
>  	if (!page_mapped(page))
>  		rc = move_to_new_page(newpage, page);
> 
> -	if (rc) {
> +	if (rc)
>  		remove_migration_ptes(page, page);
> -		if (charge)
> -			mem_cgroup_end_migration(page);
> -	} else if (charge)
> - 		mem_cgroup_end_migration(newpage);
>  rcu_unlock:
>  	if (rcu_locked)
>  		rcu_read_unlock();
> @@ -717,6 +723,8 @@ unlock:
>  	}
> 
>  move_newpage:
> +	if (!charge)
> +		mem_cgroup_end_migration(newpage);
>  	/*
>  	 * Move the new page to the LRU. If migration was not successful
>  	 * then this will free the page.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 4/7] memcg: page migration
  2008-03-18 18:11   ` Balbir Singh
@ 2008-03-19  2:44     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-19  2:44 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm@kvack.org, xemul, hugh@veritas.com

On Tue, 18 Mar 2008 23:41:41 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-03-14 19:15:43]:
> 
> > Christoph Lameter, writer of page migraion, suggested me that
> > new page_cgroup should be assignd to new page at page is allocated.
> > This patch changes migration path to assign page_cgroup of new page
> > at allocation.
> > 
> > Pros:
> >  - We can avoid compliated lock depndencies.
> > Cons:
> >  - Have to handle a page which is not on LRU in memory resource controller.
> > 
> > For pages not-on-LRU, I added PAGE_CGROUP_FLAG_MIGRATION and
> > mem_cgroup->migrations counter.
> > (force_empty will not end while migration because new page's
> >  refcnt is alive until the end of migration.)
> > 
> > I think this version simplifies complicated lock dependency in page migraiton,
> > but I admit this adds some hacky codes. If you have good idea, please advise me.
> > 
> > Works well under my tests.
> 
> This code is easier to read as well. I think this a good approach. To
> be honest, I've not had the chance to test page migration very often.
> Should we update Documentation/controllers/memory.txt to indicate that
> migration might prevent force_empty and hence rmdir() from working?
> 
I'm now rewriting this code to use 'list' instead of a counter but
reconsidering this care is really necessary or not.

Because !Page_lru pages are already handled in mem_cgroup_isolate_page(),
we don't have to do something special to non-LRU pages.

I'd like to drop this care-nolru-pages check today and check it.

Thanks,
-Kame














--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 5/7] radix-tree page cgroup
  2008-03-14  9:59 [PATCH 0/7] memcg: radix-tree page_cgroup KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2008-03-14 10:15 ` [PATCH 4/7] memcg: page migration KAMEZAWA Hiroyuki
@ 2008-03-14 10:17 ` KAMEZAWA Hiroyuki
  2008-03-17  2:56   ` Li Zefan
                     ` (3 more replies)
  2008-03-14 10:18 ` [PATCH 6/7] memcg: speed up by percpu KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  7 siblings, 4 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-14 10:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

A lookup routine for page_cgroup struct.

Now, page_cgroup is pointed by struct page's page_cgroup entry

struct page {
	...
	struct page_cgroup *page_cgroup;
	..
}

But some people dislike this because this increases sizeof(struct page).

For avoiding that, we'll have to add a lookup routine for
	pfn <-> page_cgroup.
by radix-tree.

New function is

struct page *get_page_cgroup(struct page *page, gfp_mask mask, bool allocate);

if (allocate == true), look up and allocate new one if necessary.
if (allocate == false), just do look up and return NULL if not exist.

Changes:
  - add the 3rd argument 'allocate'
  - making page_cgroup chunk size to be configurable (for test.)


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 init/Kconfig     |   14 ++++
 mm/Makefile      |    2 
 mm/page_cgroup.c |  169 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 184 insertions(+), 1 deletion(-)

Index: mm-2.6.25-rc5-mm1/mm/page_cgroup.c
===================================================================
--- /dev/null
+++ mm-2.6.25-rc5-mm1/mm/page_cgroup.c
@@ -0,0 +1,173 @@
+/*
+ * page_cgroup mamagement codes.
+ * page_cgroup is yet another mem_map when cgroup's memory resoruce controller
+ * is activated. It containes information which cannot be stored in usual
+ * mem_map. (it's too big.)
+ * This allows us to keep 'struct page' small when a user doesn't activate
+ * memory resource controller.
+ *
+ * Note: all things are allocated on demand.
+ *
+ * We can translate : struct page <-> pfn -> page_cgroup -> struct page.
+ */
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/radix-tree.h>
+#include <linux/memcontrol.h>
+#include <linux/page_cgroup.h>
+#include <linux/err.h>
+
+
+
+#define PCGRP_SHIFT	(CONFIG_CGROUP_PAGE_CGROUP_ORDER)
+#define PCGRP_SIZE	(1 << PCGRP_SHIFT)
+
+struct page_cgroup_head {
+	struct page_cgroup pc[PCGRP_SIZE];
+};
+
+struct page_cgroup_root {
+	spinlock_t	       tree_lock;
+	struct radix_tree_root root_node;
+};
+
+static struct page_cgroup_root *root_dir[MAX_NUMNODES];
+
+static void init_page_cgroup(struct page_cgroup_head *head, unsigned long pfn)
+{
+	int i;
+	struct page_cgroup *pc;
+
+	memset(head, 0, sizeof(*head));
+	for (i = 0; i < PCGRP_SIZE; ++i) {
+		pc = &head->pc[i];
+		pc->page = pfn_to_page(pfn + i);
+		spin_lock_init(&pc->lock);
+		INIT_LIST_HEAD(&pc->lru);
+	}
+}
+
+
+struct kmem_cache *page_cgroup_cachep;
+
+static struct page_cgroup_head *
+alloc_init_page_cgroup(unsigned long pfn, int nid, gfp_t mask)
+{
+	struct page_cgroup_head *head;
+
+	head = kmem_cache_alloc_node(page_cgroup_cachep, mask, nid);
+	if (!head)
+		return NULL;
+
+	init_page_cgroup(head, pfn);
+
+	return head;
+}
+
+void free_page_cgroup(struct page_cgroup_head *head)
+{
+	kmem_cache_free(page_cgroup_cachep, head);
+}
+
+
+/*
+ * Look up page_cgroup struct for struct page (page's pfn)
+ * if (allocate == true), look up and allocate new one if necessary.
+ * if (allocate == false), look up and return NULL if it cannot be found.
+ */
+
+struct page_cgroup *
+get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
+{
+	struct page_cgroup_root *root;
+	struct page_cgroup_head *head;
+	struct page_cgroup *pc;
+	unsigned long pfn, idx;
+	int nid;
+	unsigned long base_pfn, flags;
+	int error;
+
+	if (!page)
+		return NULL;
+
+	pfn = page_to_pfn(page);
+	idx = pfn >> PCGRP_SHIFT;
+	nid = page_to_nid(page);
+
+	root = root_dir[nid];
+	/* Before Init ? */
+	if (unlikely(!root))
+		return NULL;
+
+	base_pfn = idx << PCGRP_SHIFT;
+retry:
+	error = 0;
+	rcu_read_lock();
+	head = radix_tree_lookup(&root->root_node, idx);
+	rcu_read_unlock();
+
+	if (likely(head))
+		return &head->pc[pfn - base_pfn];
+	if (allocate == false)
+		return NULL;
+
+	/* Very Slow Path. On demand allocation. */
+	gfpmask = gfpmask & ~(__GFP_HIGHMEM | __GFP_MOVABLE);
+
+	head = alloc_init_page_cgroup(base_pfn, nid, gfpmask);
+	if (!head)
+		return ERR_PTR(-ENOMEM);
+	pc = NULL;
+	error = radix_tree_preload(gfpmask);
+	if (error)
+		goto out;
+	spin_lock_irqsave(&root->tree_lock, flags);
+	error = radix_tree_insert(&root->root_node, idx, head);
+
+	if (!error)
+		pc = &head->pc[pfn - base_pfn];
+	spin_unlock_irqrestore(&root->tree_lock, flags);
+	radix_tree_preload_end();
+out:
+	if (!pc) {
+		free_page_cgroup(head);
+		if (error == -EEXIST)
+			goto retry;
+	}
+	if (error)
+		pc = ERR_PTR(error);
+	return pc;
+}
+
+__init int page_cgroup_init(void)
+{
+	int nid;
+	struct page_cgroup_root *root;
+
+	page_cgroup_cachep = kmem_cache_create("page_cgroup",
+				sizeof(struct page_cgroup_head), 0,
+				SLAB_PANIC | SLAB_DESTROY_BY_RCU, NULL);
+	if (!page_cgroup_cachep) {
+		printk(KERN_ERR "page accouning setup failure\n");
+		printk(KERN_ERR "can't initialize slab memory\n");
+		/* FIX ME: should return some error code ? */
+		return 0;
+	}
+	for_each_online_node(nid) {
+		if (node_state(nid, N_NORMAL_MEMORY)
+			root = kmalloc_node(sizeof(struct page_cgroup_root),
+					GFP_KERNEL, nid);
+		else
+			root = kmalloc(sizeof(struct page_cgroup_root),
+					GFP_KERNEL);
+		INIT_RADIX_TREE(&root->root_node, GFP_ATOMIC);
+		spin_lock_init(&root->tree_lock);
+		smp_wmb();
+		root_dir[nid] = root;
+	}
+
+	printk(KERN_INFO "Page Accouintg is activated\n");
+	return 0;
+}
+late_initcall(page_cgroup_init);
Index: mm-2.6.25-rc5-mm1/mm/Makefile
===================================================================
--- mm-2.6.25-rc5-mm1.orig/mm/Makefile
+++ mm-2.6.25-rc5-mm1/mm/Makefile
@@ -32,5 +32,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
 
Index: mm-2.6.25-rc5-mm1/init/Kconfig
===================================================================
--- mm-2.6.25-rc5-mm1.orig/init/Kconfig
+++ mm-2.6.25-rc5-mm1/init/Kconfig
@@ -405,6 +405,20 @@ config SYSFS_DEPRECATED_V2
 	  If you are using a distro with the most recent userspace
 	  packages, it should be safe to say N here.
 
+config CGROUP_PAGE_CGROUP_ORDER
+	int "Order of page accounting subsystem"
+	range 0 10
+	default 3 if HIGHMEM64G
+	default 10 if 64BIT
+	default 7
+	depends on CGROUP_MEM_RES_CTLR
+	help
+	  By making this value to be small, wastes in memory usage of page
+	  accounting can be small. But big number is good for perfomance.
+	  Especially, HIGHMEM64G users should keep this to be small because
+	  you tend to have small kernel memory.
+	  If unsure, use default.
+
 config PROC_PID_CPUSET
 	bool "Include legacy /proc/<pid>/cpuset file"
 	depends on CPUSETS

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/7] radix-tree page cgroup
  2008-03-14 10:17 ` [PATCH 5/7] radix-tree page cgroup KAMEZAWA Hiroyuki
@ 2008-03-17  2:56   ` Li Zefan
  2008-03-17  3:26     ` Li Zefan
  2008-03-18  1:23     ` KAMEZAWA Hiroyuki
  2008-03-19  2:05   ` Balbir Singh
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 53+ messages in thread
From: Li Zefan @ 2008-03-17  2:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

KAMEZAWA Hiroyuki wrote:
> A lookup routine for page_cgroup struct.
> 
> Now, page_cgroup is pointed by struct page's page_cgroup entry
> 
> struct page {
> 	...
> 	struct page_cgroup *page_cgroup;
> 	..
> }
> 
> But some people dislike this because this increases sizeof(struct page).
> 
> For avoiding that, we'll have to add a lookup routine for
> 	pfn <-> page_cgroup.
> by radix-tree.
> 
> New function is
> 
> struct page *get_page_cgroup(struct page *page, gfp_mask mask, bool allocate);
> 
> if (allocate == true), look up and allocate new one if necessary.
> if (allocate == false), just do look up and return NULL if not exist.
> 
> Changes:
>   - add the 3rd argument 'allocate'
>   - making page_cgroup chunk size to be configurable (for test.)
> 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
>  init/Kconfig     |   14 ++++
>  mm/Makefile      |    2 
>  mm/page_cgroup.c |  169 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 184 insertions(+), 1 deletion(-)
> 
> Index: mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> ===================================================================
> --- /dev/null
> +++ mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> @@ -0,0 +1,173 @@
> +/*
> + * page_cgroup mamagement codes.
> + * page_cgroup is yet another mem_map when cgroup's memory resoruce controller
> + * is activated. It containes information which cannot be stored in usual
> + * mem_map. (it's too big.)
> + * This allows us to keep 'struct page' small when a user doesn't activate
> + * memory resource controller.
> + *
> + * Note: all things are allocated on demand.
> + *
> + * We can translate : struct page <-> pfn -> page_cgroup -> struct page.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/radix-tree.h>
> +#include <linux/memcontrol.h>
> +#include <linux/page_cgroup.h>
> +#include <linux/err.h>
> +
> +
> +
> +#define PCGRP_SHIFT	(CONFIG_CGROUP_PAGE_CGROUP_ORDER)
> +#define PCGRP_SIZE	(1 << PCGRP_SHIFT)
> +
> +struct page_cgroup_head {
> +	struct page_cgroup pc[PCGRP_SIZE];
> +};
> +
> +struct page_cgroup_root {
> +	spinlock_t	       tree_lock;
> +	struct radix_tree_root root_node;
> +};
> +
> +static struct page_cgroup_root *root_dir[MAX_NUMNODES];
> +
> +static void init_page_cgroup(struct page_cgroup_head *head, unsigned long pfn)
> +{
> +	int i;
> +	struct page_cgroup *pc;
> +
> +	memset(head, 0, sizeof(*head));
> +	for (i = 0; i < PCGRP_SIZE; ++i) {

Usually we use 'i++' in this case, gcc will take care of it.

> +		pc = &head->pc[i];
> +		pc->page = pfn_to_page(pfn + i);
> +		spin_lock_init(&pc->lock);
> +		INIT_LIST_HEAD(&pc->lru);
> +	}
> +}
> +
> +
> +struct kmem_cache *page_cgroup_cachep;
> +
> +static struct page_cgroup_head *
> +alloc_init_page_cgroup(unsigned long pfn, int nid, gfp_t mask)
> +{
> +	struct page_cgroup_head *head;
> +
> +	head = kmem_cache_alloc_node(page_cgroup_cachep, mask, nid);
> +	if (!head)
> +		return NULL;
> +
> +	init_page_cgroup(head, pfn);
> +
> +	return head;
> +}
> +
> +void free_page_cgroup(struct page_cgroup_head *head)
> +{
> +	kmem_cache_free(page_cgroup_cachep, head);
> +}
> +
> +
> +/*
> + * Look up page_cgroup struct for struct page (page's pfn)
> + * if (allocate == true), look up and allocate new one if necessary.
> + * if (allocate == false), look up and return NULL if it cannot be found.
> + */
> +

It's confusing when NULL will be returned and when -EFXXX...

if (allocate == true) -EFXXX may still be returned ?

> +struct page_cgroup *
> +get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
> +{
> +	struct page_cgroup_root *root;
> +	struct page_cgroup_head *head;
> +	struct page_cgroup *pc;
> +	unsigned long pfn, idx;
> +	int nid;
> +	unsigned long base_pfn, flags;
> +	int error;
> +
> +	if (!page)
> +		return NULL;
> +
> +	pfn = page_to_pfn(page);
> +	idx = pfn >> PCGRP_SHIFT;
> +	nid = page_to_nid(page);
> +
> +	root = root_dir[nid];
> +	/* Before Init ? */
> +	if (unlikely(!root))
> +		return NULL;
> +
> +	base_pfn = idx << PCGRP_SHIFT;
> +retry:
> +	error = 0;
> +	rcu_read_lock();
> +	head = radix_tree_lookup(&root->root_node, idx);
> +	rcu_read_unlock();
> +
> +	if (likely(head))
> +		return &head->pc[pfn - base_pfn];
> +	if (allocate == false)
> +		return NULL;
> +
> +	/* Very Slow Path. On demand allocation. */
> +	gfpmask = gfpmask & ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> +
> +	head = alloc_init_page_cgroup(base_pfn, nid, gfpmask);
> +	if (!head)
> +		return ERR_PTR(-ENOMEM);
> +	pc = NULL;
> +	error = radix_tree_preload(gfpmask);
> +	if (error)
> +		goto out;
> +	spin_lock_irqsave(&root->tree_lock, flags);
> +	error = radix_tree_insert(&root->root_node, idx, head);
> +
> +	if (!error)
> +		pc = &head->pc[pfn - base_pfn];
> +	spin_unlock_irqrestore(&root->tree_lock, flags);
> +	radix_tree_preload_end();
> +out:
> +	if (!pc) {
> +		free_page_cgroup(head);
> +		if (error == -EEXIST)
> +			goto retry;
> +	}
> +	if (error)
> +		pc = ERR_PTR(error);
> +	return pc;
> +}
> +
> +__init int page_cgroup_init(void)
> +{
> +	int nid;
> +	struct page_cgroup_root *root;
> +
> +	page_cgroup_cachep = kmem_cache_create("page_cgroup",
> +				sizeof(struct page_cgroup_head), 0,
> +				SLAB_PANIC | SLAB_DESTROY_BY_RCU, NULL);
> +	if (!page_cgroup_cachep) {
> +		printk(KERN_ERR "page accouning setup failure\n");
> +		printk(KERN_ERR "can't initialize slab memory\n");
> +		/* FIX ME: should return some error code ? */
> +		return 0;

why can't return -ENOMEM ?

> +	}
> +	for_each_online_node(nid) {
> +		if (node_state(nid, N_NORMAL_MEMORY)
> +			root = kmalloc_node(sizeof(struct page_cgroup_root),
> +					GFP_KERNEL, nid);

if (root == NULL)

> +		else
> +			root = kmalloc(sizeof(struct page_cgroup_root),
> +					GFP_KERNEL);

ditto

> +		INIT_RADIX_TREE(&root->root_node, GFP_ATOMIC);
> +		spin_lock_init(&root->tree_lock);
> +		smp_wmb();
> +		root_dir[nid] = root;
> +	}
> +
> +	printk(KERN_INFO "Page Accouintg is activated\n");
> +	return 0;
> +}
> +late_initcall(page_cgroup_init);
> Index: mm-2.6.25-rc5-mm1/mm/Makefile
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/Makefile
> +++ mm-2.6.25-rc5-mm1/mm/Makefile
> @@ -32,5 +32,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_SMP) += allocpercpu.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
> -obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
> +obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
>  
> Index: mm-2.6.25-rc5-mm1/init/Kconfig
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/init/Kconfig
> +++ mm-2.6.25-rc5-mm1/init/Kconfig
> @@ -405,6 +405,20 @@ config SYSFS_DEPRECATED_V2
>  	  If you are using a distro with the most recent userspace
>  	  packages, it should be safe to say N here.
>  
> +config CGROUP_PAGE_CGROUP_ORDER
> +	int "Order of page accounting subsystem"
> +	range 0 10
> +	default 3 if HIGHMEM64G
> +	default 10 if 64BIT
> +	default 7
> +	depends on CGROUP_MEM_RES_CTLR
> +	help
> +	  By making this value to be small, wastes in memory usage of page
> +	  accounting can be small. But big number is good for perfomance.

s/perfomance/performance

> +	  Especially, HIGHMEM64G users should keep this to be small because
> +	  you tend to have small kernel memory.
> +	  If unsure, use default.
> +
>  config PROC_PID_CPUSET
>  	bool "Include legacy /proc/<pid>/cpuset file"
>  	depends on CPUSETS
> 
> --

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/7] radix-tree page cgroup
  2008-03-17  2:56   ` Li Zefan
@ 2008-03-17  3:26     ` Li Zefan
  2008-03-18  1:18       ` KAMEZAWA Hiroyuki
  2008-03-18  1:23     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 53+ messages in thread
From: Li Zefan @ 2008-03-17  3:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

Li Zefan wrote:
>> +/*
>> + * Look up page_cgroup struct for struct page (page's pfn)
>> + * if (allocate == true), look up and allocate new one if necessary.
>> + * if (allocate == false), look up and return NULL if it cannot be found.
>> + */
>> +
> 
> It's confusing when NULL will be returned and when -EFXXX...
> 
> if (allocate == true) -EFXXX may still be returned ?
> 

Sorry, my comment is:

It's confusing when NULL will be returned and when -EFXXX... 

if (allocate == true), *NULL* may still be returned ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/7] radix-tree page cgroup
  2008-03-17  3:26     ` Li Zefan
@ 2008-03-18  1:18       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-18  1:18 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

On Mon, 17 Mar 2008 12:26:01 +0900
Li Zefan <lizf@cn.fujitsu.com> wrote:

> Li Zefan wrote:
> >> +/*
> >> + * Look up page_cgroup struct for struct page (page's pfn)
> >> + * if (allocate == true), look up and allocate new one if necessary.
> >> + * if (allocate == false), look up and return NULL if it cannot be found.
> >> + */
> >> +
> > 
> > It's confusing when NULL will be returned and when -EFXXX...
> > 
> > if (allocate == true) -EFXXX may still be returned ?
> > 
> 
> Sorry, my comment is:
> 
> It's confusing when NULL will be returned and when -EFXXX... 
> 
> if (allocate == true), *NULL* may still be returned ?

yes. at boot time. before kmalloc ready.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/7] radix-tree page cgroup
  2008-03-17  2:56   ` Li Zefan
  2008-03-17  3:26     ` Li Zefan
@ 2008-03-18  1:23     ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-18  1:23 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

On Mon, 17 Mar 2008 11:56:06 +0900
Li Zefan <lizf@cn.fujitsu.com> wrote:

> KAMEZAWA Hiroyuki wrote:

> > +static void init_page_cgroup(struct page_cgroup_head *head, unsigned long pfn)
> > +{
> > +	int i;
> > +	struct page_cgroup *pc;
> > +
> > +	memset(head, 0, sizeof(*head));
> > +	for (i = 0; i < PCGRP_SIZE; ++i) {
> 
> Usually we use 'i++' in this case, gcc will take care of it.
> 
Hmm, ok.

> > +		pc = &head->pc[i];
> > +		pc->page = pfn_to_page(pfn + i);
> > +		spin_lock_init(&pc->lock);
> > +		INIT_LIST_HEAD(&pc->lru);
> > +	}
> > +}
> > +
> > +
> > +struct kmem_cache *page_cgroup_cachep;
> > +
> > +static struct page_cgroup_head *
> > +alloc_init_page_cgroup(unsigned long pfn, int nid, gfp_t mask)
> > +{
> > +	struct page_cgroup_head *head;
> > +
> > +	head = kmem_cache_alloc_node(page_cgroup_cachep, mask, nid);
> > +	if (!head)
> > +		return NULL;
> > +
> > +	init_page_cgroup(head, pfn);
> > +
> > +	return head;
> > +}
> > +
> > +void free_page_cgroup(struct page_cgroup_head *head)
> > +{
> > +	kmem_cache_free(page_cgroup_cachep, head);
> > +}
> > +
> > +
> > +/*
> > + * Look up page_cgroup struct for struct page (page's pfn)
> > + * if (allocate == true), look up and allocate new one if necessary.
> > + * if (allocate == false), look up and return NULL if it cannot be found.
> > + */
> > +
> 
> It's confusing when NULL will be returned and when -EFXXX...
> 
> if (allocate == true) -EFXXX may still be returned ?
> 
> > +struct page_cgroup *
> > +get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
> > +{
> > +	struct page_cgroup_root *root;
> > +	struct page_cgroup_head *head;
> > +	struct page_cgroup *pc;
> > +	unsigned long pfn, idx;
> > +	int nid;
> > +	unsigned long base_pfn, flags;
> > +	int error;
> > +
> > +	if (!page)
> > +		return NULL;
> > +
> > +	pfn = page_to_pfn(page);
> > +	idx = pfn >> PCGRP_SHIFT;
> > +	nid = page_to_nid(page);
> > +
> > +	root = root_dir[nid];
> > +	/* Before Init ? */
> > +	if (unlikely(!root))
> > +		return NULL;
> > +
> > +	base_pfn = idx << PCGRP_SHIFT;
> > +retry:
> > +	error = 0;
> > +	rcu_read_lock();
> > +	head = radix_tree_lookup(&root->root_node, idx);
> > +	rcu_read_unlock();
> > +
> > +	if (likely(head))
> > +		return &head->pc[pfn - base_pfn];
> > +	if (allocate == false)
> > +		return NULL;
> > +
> > +	/* Very Slow Path. On demand allocation. */
> > +	gfpmask = gfpmask & ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> > +
> > +	head = alloc_init_page_cgroup(base_pfn, nid, gfpmask);
> > +	if (!head)
> > +		return ERR_PTR(-ENOMEM);
> > +	pc = NULL;
> > +	error = radix_tree_preload(gfpmask);
> > +	if (error)
> > +		goto out;
> > +	spin_lock_irqsave(&root->tree_lock, flags);
> > +	error = radix_tree_insert(&root->root_node, idx, head);
> > +
> > +	if (!error)
> > +		pc = &head->pc[pfn - base_pfn];
> > +	spin_unlock_irqrestore(&root->tree_lock, flags);
> > +	radix_tree_preload_end();
> > +out:
> > +	if (!pc) {
> > +		free_page_cgroup(head);
> > +		if (error == -EEXIST)
> > +			goto retry;
> > +	}
> > +	if (error)
> > +		pc = ERR_PTR(error);
> > +	return pc;
> > +}
> > +
> > +__init int page_cgroup_init(void)
> > +{
> > +	int nid;
> > +	struct page_cgroup_root *root;
> > +
> > +	page_cgroup_cachep = kmem_cache_create("page_cgroup",
> > +				sizeof(struct page_cgroup_head), 0,
> > +				SLAB_PANIC | SLAB_DESTROY_BY_RCU, NULL);
> > +	if (!page_cgroup_cachep) {
> > +		printk(KERN_ERR "page accouning setup failure\n");
> > +		printk(KERN_ERR "can't initialize slab memory\n");
> > +		/* FIX ME: should return some error code ? */
> > +		return 0;
> 
> why can't return -ENOMEM ?
> 
It seems I misunderstand initcalls. Sorry.


> > +	}
> > +	for_each_online_node(nid) {
> > +		if (node_state(nid, N_NORMAL_MEMORY)
> > +			root = kmalloc_node(sizeof(struct page_cgroup_root),
> > +					GFP_KERNEL, nid);
> 
> if (root == NULL)
> 
> > +		else
> > +			root = kmalloc(sizeof(struct page_cgroup_root),
> > +					GFP_KERNEL);
> 
> ditto
> 

ok.


> > +		INIT_RADIX_TREE(&root->root_node, GFP_ATOMIC);
> > +		spin_lock_init(&root->tree_lock);
> > +		smp_wmb();
> > +		root_dir[nid] = root;
> > +	}
> > +
> > +	printk(KERN_INFO "Page Accouintg is activated\n");
> > +	return 0;
> > +}
> > +late_initcall(page_cgroup_init);
> > Index: mm-2.6.25-rc5-mm1/mm/Makefile
> > ===================================================================
> > --- mm-2.6.25-rc5-mm1.orig/mm/Makefile
> > +++ mm-2.6.25-rc5-mm1/mm/Makefile
> > @@ -32,5 +32,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
> >  obj-$(CONFIG_MIGRATION) += migrate.o
> >  obj-$(CONFIG_SMP) += allocpercpu.o
> >  obj-$(CONFIG_QUICKLIST) += quicklist.o
> > -obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
> > +obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> >  
> > Index: mm-2.6.25-rc5-mm1/init/Kconfig
> > ===================================================================
> > --- mm-2.6.25-rc5-mm1.orig/init/Kconfig
> > +++ mm-2.6.25-rc5-mm1/init/Kconfig
> > @@ -405,6 +405,20 @@ config SYSFS_DEPRECATED_V2
> >  	  If you are using a distro with the most recent userspace
> >  	  packages, it should be safe to say N here.
> >  
> > +config CGROUP_PAGE_CGROUP_ORDER
> > +	int "Order of page accounting subsystem"
> > +	range 0 10
> > +	default 3 if HIGHMEM64G
> > +	default 10 if 64BIT
> > +	default 7
> > +	depends on CGROUP_MEM_RES_CTLR
> > +	help
> > +	  By making this value to be small, wastes in memory usage of page
> > +	  accounting can be small. But big number is good for perfomance.
> 
> s/perfomance/performance
> 
ok

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/7] radix-tree page cgroup
  2008-03-14 10:17 ` [PATCH 5/7] radix-tree page cgroup KAMEZAWA Hiroyuki
  2008-03-17  2:56   ` Li Zefan
@ 2008-03-19  2:05   ` Balbir Singh
  2008-03-19  2:51     ` KAMEZAWA Hiroyuki
  2008-03-19  3:14   ` Balbir Singh
  2008-03-19 21:11   ` Peter Zijlstra
  3 siblings, 1 reply; 53+ messages in thread
From: Balbir Singh @ 2008-03-19  2:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm@kvack.org, xemul, hugh@veritas.com

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-03-14 19:17:33]:

> A lookup routine for page_cgroup struct.
> 
> Now, page_cgroup is pointed by struct page's page_cgroup entry
> 
> struct page {
> 	...
> 	struct page_cgroup *page_cgroup;
> 	..
> }
> 
> But some people dislike this because this increases sizeof(struct page).
> 
> For avoiding that, we'll have to add a lookup routine for
> 	pfn <-> page_cgroup.
> by radix-tree.
> 
> New function is
> 
> struct page *get_page_cgroup(struct page *page, gfp_mask mask, bool allocate);
> 
> if (allocate == true), look up and allocate new one if necessary.
> if (allocate == false), just do look up and return NULL if not exist.
> 
> Changes:
>   - add the 3rd argument 'allocate'
>   - making page_cgroup chunk size to be configurable (for test.)
> 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
>  init/Kconfig     |   14 ++++
>  mm/Makefile      |    2 
>  mm/page_cgroup.c |  169 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 184 insertions(+), 1 deletion(-)
> 
> Index: mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> ===================================================================
> --- /dev/null
> +++ mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> @@ -0,0 +1,173 @@
> +/*
> + * page_cgroup mamagement codes.
> + * page_cgroup is yet another mem_map when cgroup's memory resoruce controller
> + * is activated. It containes information which cannot be stored in usual
> + * mem_map. (it's too big.)
> + * This allows us to keep 'struct page' small when a user doesn't activate
> + * memory resource controller.
> + *
> + * Note: all things are allocated on demand.
> + *
> + * We can translate : struct page <-> pfn -> page_cgroup -> struct page.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/radix-tree.h>
> +#include <linux/memcontrol.h>
> +#include <linux/page_cgroup.h>
> +#include <linux/err.h>
> +
> +
> +

Too many extra lines

> +#define PCGRP_SHIFT	(CONFIG_CGROUP_PAGE_CGROUP_ORDER)
> +#define PCGRP_SIZE	(1 << PCGRP_SHIFT)
> +
> +struct page_cgroup_head {
> +	struct page_cgroup pc[PCGRP_SIZE];

Are we over optimizing here?

> +};
> +
> +struct page_cgroup_root {
> +	spinlock_t	       tree_lock;
> +	struct radix_tree_root root_node;
> +};
> +

Comments describing induvidual members of the structures would be nice

> +static struct page_cgroup_root *root_dir[MAX_NUMNODES];
> +


The root_dir is per node, should we call it node_root_dir?

> +static void init_page_cgroup(struct page_cgroup_head *head, unsigned long pfn)
> +{
> +	int i;
> +	struct page_cgroup *pc;
> +
> +	memset(head, 0, sizeof(*head));
> +	for (i = 0; i < PCGRP_SIZE; ++i) {
> +		pc = &head->pc[i];
> +		pc->page = pfn_to_page(pfn + i);
> +		spin_lock_init(&pc->lock);
> +		INIT_LIST_HEAD(&pc->lru);
> +	}
> +}
> +
> +
> +struct kmem_cache *page_cgroup_cachep;
> +
> +static struct page_cgroup_head *
> +alloc_init_page_cgroup(unsigned long pfn, int nid, gfp_t mask)
> +{
> +	struct page_cgroup_head *head;
> +
> +	head = kmem_cache_alloc_node(page_cgroup_cachep, mask, nid);
> +	if (!head)
> +		return NULL;
> +
> +	init_page_cgroup(head, pfn);
> +
> +	return head;
> +}
> +
> +void free_page_cgroup(struct page_cgroup_head *head)
> +{
> +	kmem_cache_free(page_cgroup_cachep, head);
> +}
> +
> +
> +/*
> + * Look up page_cgroup struct for struct page (page's pfn)
> + * if (allocate == true), look up and allocate new one if necessary.
> + * if (allocate == false), look up and return NULL if it cannot be found.
> + */
> +
> +struct page_cgroup *
> +get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
> +{
> +	struct page_cgroup_root *root;
> +	struct page_cgroup_head *head;
> +	struct page_cgroup *pc;
> +	unsigned long pfn, idx;
> +	int nid;
> +	unsigned long base_pfn, flags;
> +	int error;
> +
> +	if (!page)
> +		return NULL;
> +
> +	pfn = page_to_pfn(page);
> +	idx = pfn >> PCGRP_SHIFT;
> +	nid = page_to_nid(page);
> +
> +	root = root_dir[nid];
> +	/* Before Init ? */
> +	if (unlikely(!root))
> +		return NULL;
> +

Shouldn't this be a BUG_ON? We don't expect any user space pages to be
allocated prior to init.

> +	base_pfn = idx << PCGRP_SHIFT;
> +retry:
> +	error = 0;
> +	rcu_read_lock();
> +	head = radix_tree_lookup(&root->root_node, idx);
> +	rcu_read_unlock();
> +
> +	if (likely(head))
> +		return &head->pc[pfn - base_pfn];
> +	if (allocate == false)
> +		return NULL;
> +
> +	/* Very Slow Path. On demand allocation. */
> +	gfpmask = gfpmask & ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> +
> +	head = alloc_init_page_cgroup(base_pfn, nid, gfpmask);

This name is a bit confusing, it sounds like the page_cgroup is
allocated from initialzation context, but I think it stands for
allocate and initialize right?

> +	if (!head)
> +		return ERR_PTR(-ENOMEM);
> +	pc = NULL;
> +	error = radix_tree_preload(gfpmask);
> +	if (error)
> +		goto out;
> +	spin_lock_irqsave(&root->tree_lock, flags);
> +	error = radix_tree_insert(&root->root_node, idx, head);
> +
> +	if (!error)
> +		pc = &head->pc[pfn - base_pfn];
> +	spin_unlock_irqrestore(&root->tree_lock, flags);
> +	radix_tree_preload_end();
> +out:
> +	if (!pc) {
> +		free_page_cgroup(head);

We free the entire page_cgroup_head?

> +		if (error == -EEXIST)
> +			goto retry;
> +	}
> +	if (error)
> +		pc = ERR_PTR(error);
> +	return pc;
> +}
> +
> +__init int page_cgroup_init(void)
> +{
> +	int nid;
> +	struct page_cgroup_root *root;
> +
> +	page_cgroup_cachep = kmem_cache_create("page_cgroup",
> +				sizeof(struct page_cgroup_head), 0,
> +				SLAB_PANIC | SLAB_DESTROY_BY_RCU, NULL);
> +	if (!page_cgroup_cachep) {
> +		printk(KERN_ERR "page accouning setup failure\n");
> +		printk(KERN_ERR "can't initialize slab memory\n");
> +		/* FIX ME: should return some error code ? */
> +		return 0;
> +	}
> +	for_each_online_node(nid) {
> +		if (node_state(nid, N_NORMAL_MEMORY)
> +			root = kmalloc_node(sizeof(struct page_cgroup_root),
> +					GFP_KERNEL, nid);
> +		else
> +			root = kmalloc(sizeof(struct page_cgroup_root),
> +					GFP_KERNEL);
> +		INIT_RADIX_TREE(&root->root_node, GFP_ATOMIC);
> +		spin_lock_init(&root->tree_lock);
> +		smp_wmb();

Could you please explain why we need a barrier here and comment it as
well.

> +		root_dir[nid] = root;
> +	}
> +
> +	printk(KERN_INFO "Page Accouintg is activated\n");
> +	return 0;
> +}
> +late_initcall(page_cgroup_init);
> Index: mm-2.6.25-rc5-mm1/mm/Makefile
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/Makefile
> +++ mm-2.6.25-rc5-mm1/mm/Makefile
> @@ -32,5 +32,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_SMP) += allocpercpu.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
> -obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
> +obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> 
> Index: mm-2.6.25-rc5-mm1/init/Kconfig
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/init/Kconfig
> +++ mm-2.6.25-rc5-mm1/init/Kconfig
> @@ -405,6 +405,20 @@ config SYSFS_DEPRECATED_V2
>  	  If you are using a distro with the most recent userspace
>  	  packages, it should be safe to say N here.
> 
> +config CGROUP_PAGE_CGROUP_ORDER
> +	int "Order of page accounting subsystem"
> +	range 0 10
> +	default 3 if HIGHMEM64G
> +	default 10 if 64BIT
> +	default 7

What are these defaults based on?

> +	depends on CGROUP_MEM_RES_CTLR
> +	help
> +	  By making this value to be small, wastes in memory usage of page
> +	  accounting can be small. But big number is good for perfomance.
> +	  Especially, HIGHMEM64G users should keep this to be small because
> +	  you tend to have small kernel memory.
> +	  If unsure, use default.
> +
>  config PROC_PID_CPUSET
>  	bool "Include legacy /proc/<pid>/cpuset file"
>  	depends on CPUSETS
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/7] radix-tree page cgroup
  2008-03-19  2:05   ` Balbir Singh
@ 2008-03-19  2:51     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-19  2:51 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm@kvack.org, xemul, hugh@veritas.com

On Wed, 19 Mar 2008 07:35:36 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > +#include <linux/mm.h>
> > +#include <linux/slab.h>
> > +#include <linux/radix-tree.h>
> > +#include <linux/memcontrol.h>
> > +#include <linux/page_cgroup.h>
> > +#include <linux/err.h>
> > +
> > +
> > +
> 
> Too many extra lines
> 
fixed.

> > +#define PCGRP_SHIFT	(CONFIG_CGROUP_PAGE_CGROUP_ORDER)
> > +#define PCGRP_SIZE	(1 << PCGRP_SHIFT)
> > +
> > +struct page_cgroup_head {
> > +	struct page_cgroup pc[PCGRP_SIZE];
> 
> Are we over optimizing here?
> 
what is over optimizing ? PCGRP_SIZE is too large ?

> > +};
> > +
> > +struct page_cgroup_root {
> > +	spinlock_t	       tree_lock;
> > +	struct radix_tree_root root_node;
> > +};
> > +
> 
> Comments describing induvidual members of the structures would be nice
> 
will add.

> > +static struct page_cgroup_root *root_dir[MAX_NUMNODES];
> > +
> 
> 
> The root_dir is per node, should we call it node_root_dir?
> 
ok.

>
> > +	root = root_dir[nid];
> > +	/* Before Init ? */
> > +	if (unlikely(!root))
> > +		return NULL;
> > +
> 
> Shouldn't this be a BUG_ON? We don't expect any user space pages to be
> allocated prior to init.
No. this routine uses kmalloc(). To use it, I uses late_initcall().
Before late_initcall(), pages will be used.

> 
> > +	base_pfn = idx << PCGRP_SHIFT;
> > +retry:
> > +	error = 0;
> > +	rcu_read_lock();
> > +	head = radix_tree_lookup(&root->root_node, idx);
> > +	rcu_read_unlock();
> > +
> > +	if (likely(head))
> > +		return &head->pc[pfn - base_pfn];
> > +	if (allocate == false)
> > +		return NULL;
> > +
> > +	/* Very Slow Path. On demand allocation. */
> > +	gfpmask = gfpmask & ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> > +
> > +	head = alloc_init_page_cgroup(base_pfn, nid, gfpmask);
> 
> This name is a bit confusing, it sounds like the page_cgroup is
> allocated from initialzation context, but I think it stands for
> allocate and initialize right?
> 
Ah, yes. I'll rename this.


> > +	if (!head)
> > +		return ERR_PTR(-ENOMEM);
> > +	pc = NULL;
> > +	error = radix_tree_preload(gfpmask);
> > +	if (error)
> > +		goto out;
> > +	spin_lock_irqsave(&root->tree_lock, flags);
> > +	error = radix_tree_insert(&root->root_node, idx, head);
> > +
> > +	if (!error)
> > +		pc = &head->pc[pfn - base_pfn];
> > +	spin_unlock_irqrestore(&root->tree_lock, flags);
> > +	radix_tree_preload_end();
> > +out:
> > +	if (!pc) {
> > +		free_page_cgroup(head);
> 
> We free the entire page_cgroup_head?
> 
yes. We alloc a head and free a head.


> > +		if (error == -EEXIST)
> > +			goto retry;
> > +	}
> > +	if (error)
> > +		pc = ERR_PTR(error);
> > +	return pc;
> > +}
> > +
> > +__init int page_cgroup_init(void)
> > +{
> > +	int nid;
> > +	struct page_cgroup_root *root;
> > +
> > +	page_cgroup_cachep = kmem_cache_create("page_cgroup",
> > +				sizeof(struct page_cgroup_head), 0,
> > +				SLAB_PANIC | SLAB_DESTROY_BY_RCU, NULL);
> > +	if (!page_cgroup_cachep) {
> > +		printk(KERN_ERR "page accouning setup failure\n");
> > +		printk(KERN_ERR "can't initialize slab memory\n");
> > +		/* FIX ME: should return some error code ? */
> > +		return 0;
> > +	}
> > +	for_each_online_node(nid) {
> > +		if (node_state(nid, N_NORMAL_MEMORY)
> > +			root = kmalloc_node(sizeof(struct page_cgroup_root),
> > +					GFP_KERNEL, nid);
> > +		else
> > +			root = kmalloc(sizeof(struct page_cgroup_root),
> > +					GFP_KERNEL);
> > +		INIT_RADIX_TREE(&root->root_node, GFP_ATOMIC);
> > +		spin_lock_init(&root->tree_lock);
> > +		smp_wmb();
> 
> Could you please explain why we need a barrier here and comment it as
> well.
ok. will add.

Before root_dir[nid] != NULL is visible to other cpus, all initizlization to
root should be finished.
> > +		root_dir[nid] = root;

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/7] radix-tree page cgroup
  2008-03-14 10:17 ` [PATCH 5/7] radix-tree page cgroup KAMEZAWA Hiroyuki
  2008-03-17  2:56   ` Li Zefan
  2008-03-19  2:05   ` Balbir Singh
@ 2008-03-19  3:14   ` Balbir Singh
  2008-03-19  3:24     ` KAMEZAWA Hiroyuki
  2008-03-19 21:11   ` Peter Zijlstra
  3 siblings, 1 reply; 53+ messages in thread
From: Balbir Singh @ 2008-03-19  3:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm@kvack.org, xemul, hugh@veritas.com

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-03-14 19:17:33]:

> +		if (node_state(nid, N_NORMAL_MEMORY)

Small Typo - missing parenthesis

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/7] radix-tree page cgroup
  2008-03-19  3:14   ` Balbir Singh
@ 2008-03-19  3:24     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-19  3:24 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm@kvack.org, xemul, hugh@veritas.com

On Wed, 19 Mar 2008 08:44:45 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-03-14 19:17:33]:
> 
> > +		if (node_state(nid, N_NORMAL_MEMORY)
> 
> Small Typo - missing parenthesis
> 
ok...maybe magically fixed in my set...

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/7] radix-tree page cgroup
  2008-03-14 10:17 ` [PATCH 5/7] radix-tree page cgroup KAMEZAWA Hiroyuki
                     ` (2 preceding siblings ...)
  2008-03-19  3:14   ` Balbir Singh
@ 2008-03-19 21:11   ` Peter Zijlstra
  2008-03-20  4:45     ` KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2008-03-19 21:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

On Fri, 2008-03-14 at 19:17 +0900, KAMEZAWA Hiroyuki wrote:
> A lookup routine for page_cgroup struct.
> 
> Now, page_cgroup is pointed by struct page's page_cgroup entry
> 
> struct page {
> 	...
> 	struct page_cgroup *page_cgroup;
> 	..
> }
> 
> But some people dislike this because this increases sizeof(struct page).
> 
> For avoiding that, we'll have to add a lookup routine for
> 	pfn <-> page_cgroup.
> by radix-tree.

Good stuff.

> New function is
> 
> struct page *get_page_cgroup(struct page *page, gfp_mask mask, bool allocate);
> 
> if (allocate == true), look up and allocate new one if necessary.
> if (allocate == false), just do look up and return NULL if not exist.

I think others said as well, but we generally just write

 if (allocate)

 if (!allocate)


> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
>  init/Kconfig     |   14 ++++
>  mm/Makefile      |    2 
>  mm/page_cgroup.c |  169 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 184 insertions(+), 1 deletion(-)
> 
> Index: mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> ===================================================================
> --- /dev/null
> +++ mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> @@ -0,0 +1,173 @@
> +/*
> + * page_cgroup mamagement codes.
> + * page_cgroup is yet another mem_map when cgroup's memory resoruce controller
> + * is activated. It containes information which cannot be stored in usual
> + * mem_map. (it's too big.)
> + * This allows us to keep 'struct page' small when a user doesn't activate
> + * memory resource controller.
> + *
> + * Note: all things are allocated on demand.
> + *
> + * We can translate : struct page <-> pfn -> page_cgroup -> struct page.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/radix-tree.h>
> +#include <linux/memcontrol.h>
> +#include <linux/page_cgroup.h>
> +#include <linux/err.h>
> +
> +
> +
> +#define PCGRP_SHIFT	(CONFIG_CGROUP_PAGE_CGROUP_ORDER)
> +#define PCGRP_SIZE	(1 << PCGRP_SHIFT)
> +
> +struct page_cgroup_head {
> +	struct page_cgroup pc[PCGRP_SIZE];
> +};
> +
> +struct page_cgroup_root {
> +	spinlock_t	       tree_lock;
> +	struct radix_tree_root root_node;
> +};
> +
> +static struct page_cgroup_root *root_dir[MAX_NUMNODES];
> +
> +static void init_page_cgroup(struct page_cgroup_head *head, unsigned long pfn)
> +{
> +	int i;
> +	struct page_cgroup *pc;
> +
> +	memset(head, 0, sizeof(*head));
> +	for (i = 0; i < PCGRP_SIZE; ++i) {
> +		pc = &head->pc[i];
> +		pc->page = pfn_to_page(pfn + i);
> +		spin_lock_init(&pc->lock);
> +		INIT_LIST_HEAD(&pc->lru);
> +	}
> +}
> +
> +
> +struct kmem_cache *page_cgroup_cachep;
> +
> +static struct page_cgroup_head *
> +alloc_init_page_cgroup(unsigned long pfn, int nid, gfp_t mask)
> +{
> +	struct page_cgroup_head *head;
> +
> +	head = kmem_cache_alloc_node(page_cgroup_cachep, mask, nid);
> +	if (!head)
> +		return NULL;
> +
> +	init_page_cgroup(head, pfn);

Just because I'm lazy, I'll suggest the shorter:

if (head)
   init_page_cgroup(head, pfn)

> +
> +	return head;
> +}
> +
> +void free_page_cgroup(struct page_cgroup_head *head)
> +{
> +	kmem_cache_free(page_cgroup_cachep, head);
> +}
> +
> +
> +/*
> + * Look up page_cgroup struct for struct page (page's pfn)
> + * if (allocate == true), look up and allocate new one if necessary.
> + * if (allocate == false), look up and return NULL if it cannot be found.
> + */
> +
> +struct page_cgroup *
> +get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
> +{
> +	struct page_cgroup_root *root;
> +	struct page_cgroup_head *head;
> +	struct page_cgroup *pc;
> +	unsigned long pfn, idx;
> +	int nid;
> +	unsigned long base_pfn, flags;
> +	int error;

Would a this make sense? :

  might_sleep_if(allocate && (gfp_mask & __GFP_WAIT));

> +	if (!page)
> +		return NULL;
> +
> +	pfn = page_to_pfn(page);
> +	idx = pfn >> PCGRP_SHIFT;
> +	nid = page_to_nid(page);
> +
> +	root = root_dir[nid];
> +	/* Before Init ? */
> +	if (unlikely(!root))
> +		return NULL;
> +
> +	base_pfn = idx << PCGRP_SHIFT;
> +retry:
> +	error = 0;
> +	rcu_read_lock();
> +	head = radix_tree_lookup(&root->root_node, idx);
> +	rcu_read_unlock();

This looks iffy, who protects head here?

> +	if (likely(head))
> +		return &head->pc[pfn - base_pfn];
> +	if (allocate == false)
> +		return NULL;
> +
> +	/* Very Slow Path. On demand allocation. */
> +	gfpmask = gfpmask & ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> +
> +	head = alloc_init_page_cgroup(base_pfn, nid, gfpmask);
> +	if (!head)
> +		return ERR_PTR(-ENOMEM);
> +	pc = NULL;
> +	error = radix_tree_preload(gfpmask);
> +	if (error)
> +		goto out;
> +	spin_lock_irqsave(&root->tree_lock, flags);
> +	error = radix_tree_insert(&root->root_node, idx, head);
> +
> +	if (!error)
> +		pc = &head->pc[pfn - base_pfn];
> +	spin_unlock_irqrestore(&root->tree_lock, flags);
> +	radix_tree_preload_end();
> +out:
> +	if (!pc) {
> +		free_page_cgroup(head);
> +		if (error == -EEXIST)
> +			goto retry;
> +	}
> +	if (error)
> +		pc = ERR_PTR(error);
> +	return pc;
> +}
> +
> +__init int page_cgroup_init(void)
> +{
> +	int nid;
> +	struct page_cgroup_root *root;
> +
> +	page_cgroup_cachep = kmem_cache_create("page_cgroup",
> +				sizeof(struct page_cgroup_head), 0,
> +				SLAB_PANIC | SLAB_DESTROY_BY_RCU, NULL);
> +	if (!page_cgroup_cachep) {
> +		printk(KERN_ERR "page accouning setup failure\n");
> +		printk(KERN_ERR "can't initialize slab memory\n");
> +		/* FIX ME: should return some error code ? */
> +		return 0;
> +	}
> +	for_each_online_node(nid) {
> +		if (node_state(nid, N_NORMAL_MEMORY)
> +			root = kmalloc_node(sizeof(struct page_cgroup_root),
> +					GFP_KERNEL, nid);
> +		else
> +			root = kmalloc(sizeof(struct page_cgroup_root),
> +					GFP_KERNEL);

if (!node_state(nid, N_NORMAL_MEMORY))
  nid = -1;

allows us to use a single kmalloc_node() statement.

> +		INIT_RADIX_TREE(&root->root_node, GFP_ATOMIC);
> +		spin_lock_init(&root->tree_lock);
> +		smp_wmb();

unadorned barrier; we usually require a comment outlining the race, and
a reference to the matching barrier.

> +		root_dir[nid] = root;
> +	}
> +
> +	printk(KERN_INFO "Page Accouintg is activated\n");
> +	return 0;
> +}
> +late_initcall(page_cgroup_init);
> Index: mm-2.6.25-rc5-mm1/mm/Makefile
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/Makefile
> +++ mm-2.6.25-rc5-mm1/mm/Makefile
> @@ -32,5 +32,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_SMP) += allocpercpu.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
> -obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
> +obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
>  
> Index: mm-2.6.25-rc5-mm1/init/Kconfig
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/init/Kconfig
> +++ mm-2.6.25-rc5-mm1/init/Kconfig
> @@ -405,6 +405,20 @@ config SYSFS_DEPRECATED_V2
>  	  If you are using a distro with the most recent userspace
>  	  packages, it should be safe to say N here.
>  
> +config CGROUP_PAGE_CGROUP_ORDER
> +	int "Order of page accounting subsystem"
> +	range 0 10
> +	default 3 if HIGHMEM64G
> +	default 10 if 64BIT
> +	default 7
> +	depends on CGROUP_MEM_RES_CTLR
> +	help
> +	  By making this value to be small, wastes in memory usage of page
> +	  accounting can be small. But big number is good for perfomance.
> +	  Especially, HIGHMEM64G users should keep this to be small because
> +	  you tend to have small kernel memory.
> +	  If unsure, use default.
> +
>  config PROC_PID_CPUSET
>  	bool "Include legacy /proc/<pid>/cpuset file"
>  	depends on CPUSETS
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/7] radix-tree page cgroup
  2008-03-19 21:11   ` Peter Zijlstra
@ 2008-03-20  4:45     ` KAMEZAWA Hiroyuki
  2008-03-20  5:09       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-20  4:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

thank you for review.

On Wed, 19 Mar 2008 22:11:06 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > New function is
> > 
> > struct page *get_page_cgroup(struct page *page, gfp_mask mask, bool allocate);
> > 
> > if (allocate == true), look up and allocate new one if necessary.
> > if (allocate == false), just do look up and return NULL if not exist.
> 
> I think others said as well, but we generally just write
> 
>  if (allocate)
> 
>  if (!allocate)
> 
ok. I'm now separating this function to 2 functions.
just look-up/ look-up and allocate.


> > +	struct page_cgroup_head *head;
> > +
> > +	head = kmem_cache_alloc_node(page_cgroup_cachep, mask, nid);
> > +	if (!head)
> > +		return NULL;
> > +
> > +	init_page_cgroup(head, pfn);
> 
> Just because I'm lazy, I'll suggest the shorter:
> 
> if (head)
>    init_page_cgroup(head, pfn)
I'll fix.


> > +	struct page_cgroup_root *root;
> > +	struct page_cgroup_head *head;
> > +	struct page_cgroup *pc;
> > +	unsigned long pfn, idx;
> > +	int nid;
> > +	unsigned long base_pfn, flags;
> > +	int error;
> 
> Would a this make sense? :
> 
>   might_sleep_if(allocate && (gfp_mask & __GFP_WAIT));
> 
seems good. I'll add it.


> > +	base_pfn = idx << PCGRP_SHIFT;
> > +retry:
> > +	error = 0;
> > +	rcu_read_lock();
> > +	head = radix_tree_lookup(&root->root_node, idx);
> > +	rcu_read_unlock();
> 
> This looks iffy, who protects head here?
> 

In this patch, a routine for freeing "head" is not included.
Then....Hmm.....rcu_read_xxx is not required...I'll remove it.
I'll check the whole logic around here again.

> > +	for_each_online_node(nid) {
> > +		if (node_state(nid, N_NORMAL_MEMORY)
> > +			root = kmalloc_node(sizeof(struct page_cgroup_root),
> > +					GFP_KERNEL, nid);
> > +		else
> > +			root = kmalloc(sizeof(struct page_cgroup_root),
> > +					GFP_KERNEL);
> 
> if (!node_state(nid, N_NORMAL_MEMORY))
>   nid = -1;
> 
> allows us to use a single kmalloc_node() statement.
> 
Oh, ok. it seems good.

> > +		INIT_RADIX_TREE(&root->root_node, GFP_ATOMIC);
> > +		spin_lock_init(&root->tree_lock);
> > +		smp_wmb();
> 
> unadorned barrier; we usually require a comment outlining the race, and
> a reference to the matching barrier.
> 
I'll add comments.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/7] radix-tree page cgroup
  2008-03-20  4:45     ` KAMEZAWA Hiroyuki
@ 2008-03-20  5:09       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-20  5:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Peter Zijlstra, linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	xemul, hugh@veritas.com

On Thu, 20 Mar 2008 13:45:13 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

>
> > > +	base_pfn = idx << PCGRP_SHIFT;
> > > +retry:
> > > +	error = 0;
> > > +	rcu_read_lock();
> > > +	head = radix_tree_lookup(&root->root_node, idx);
> > > +	rcu_read_unlock();
> > 
> > This looks iffy, who protects head here?
> > 
> 
> In this patch, a routine for freeing "head" is not included.
> Then....Hmm.....rcu_read_xxx is not required...I'll remove it.
> I'll check the whole logic around here again.
> 
Sorry, I was confused...for radix-tree, ruc_xxx is necessary.

Regards,
-Kame
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 6/7] memcg: speed up by percpu
  2008-03-14  9:59 [PATCH 0/7] memcg: radix-tree page_cgroup KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2008-03-14 10:17 ` [PATCH 5/7] radix-tree page cgroup KAMEZAWA Hiroyuki
@ 2008-03-14 10:18 ` KAMEZAWA Hiroyuki
  2008-03-17  3:03   ` Li Zefan
  2008-03-19 21:19   ` Peter Zijlstra
  2008-03-14 10:22 ` [PATCH 7/7] memcg: freeing page_cgroup at suitable chance KAMEZAWA Hiroyuki
  2008-03-15  6:15 ` [PATCH 0/7] memcg: radix-tree page_cgroup Balbir Singh
  7 siblings, 2 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-14 10:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

This patch adds per-cpu look up cache for get_page_cgroup().
Works well when nearby pages are accessed continuously.
(And it's an usual case under buddy allocator.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 include/linux/page_cgroup.h |   37 +++++++++++++++++++++++++++++++++++--
 mm/page_cgroup.c            |   26 +++++++++++++++++++++-----
 2 files changed, 56 insertions(+), 7 deletions(-)

Index: mm-2.6.25-rc5-mm1/mm/page_cgroup.c
===================================================================
--- mm-2.6.25-rc5-mm1.orig/mm/page_cgroup.c
+++ mm-2.6.25-rc5-mm1/mm/page_cgroup.c
@@ -17,11 +17,10 @@
 #include <linux/memcontrol.h>
 #include <linux/page_cgroup.h>
 #include <linux/err.h>
+#include <linux/interrupt.h>
 
 
-
-#define PCGRP_SHIFT	(CONFIG_CGROUP_PAGE_CGROUP_ORDER)
-#define PCGRP_SIZE	(1 << PCGRP_SHIFT)
+DEFINE_PER_CPU(struct page_cgroup_cache, pcpu_page_cgroup_cache);
 
 struct page_cgroup_head {
 	struct page_cgroup pc[PCGRP_SIZE];
@@ -71,6 +70,19 @@ void free_page_cgroup(struct page_cgroup
 }
 
 
+static void save_result(struct page_cgroup  *base, unsigned long idx)
+{
+	int hash = idx & (PAGE_CGROUP_NR_CACHE - 1);
+	struct page_cgroup_cache *pcp;
+	/* look up is done under preempt_disable(). then, don't call
+	   this under interrupt(). */
+	preempt_disable();
+	pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
+	pcp->ents[hash].idx = idx;
+	pcp->ents[hash].base = base;
+	preempt_enable();
+}
+
 /*
  * Look up page_cgroup struct for struct page (page's pfn)
  * if (allocate == true), look up and allocate new one if necessary.
@@ -78,7 +90,7 @@ void free_page_cgroup(struct page_cgroup
  */
 
 struct page_cgroup *
-get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
+__get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
 {
 	struct page_cgroup_root *root;
 	struct page_cgroup_head *head;
@@ -107,8 +119,12 @@ retry:
 	head = radix_tree_lookup(&root->root_node, idx);
 	rcu_read_unlock();
 
-	if (likely(head))
+	if (likely(head)) {
+		if (!in_interrupt())
+			save_result(&head->pc[0], idx);
 		return &head->pc[pfn - base_pfn];
+	}
+
 	if (allocate == false)
 		return NULL;
 
Index: mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
===================================================================
--- mm-2.6.25-rc5-mm1.orig/include/linux/page_cgroup.h
+++ mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
@@ -25,6 +25,20 @@ struct page_cgroup {
 #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* is on active list */
 #define PAGE_CGROUP_FLAG_MIGRATION (0x4) /* is on active list */
 
+/* per cpu cashing for fast access */
+#define PAGE_CGROUP_NR_CACHE	(0x8)
+struct page_cgroup_cache {
+	struct {
+		unsigned long idx;
+		struct page_cgroup *base;
+	} ents[PAGE_CGROUP_NR_CACHE];
+};
+
+DECLARE_PER_CPU(struct page_cgroup_cache, pcpu_page_cgroup_cache);
+
+#define PCGRP_SHIFT	(CONFIG_CGROUP_PAGE_CGROUP_ORDER)
+#define PCGRP_SIZE	(1 << PCGRP_SHIFT)
+
 /*
  * Lookup and return page_cgroup struct.
  * returns NULL when
@@ -33,9 +47,28 @@ struct page_cgroup {
  * return -ENOMEM if cannot allocate memory.
  * If allocate==false, gfpmask will be ignored as a result.
  */
-
 struct page_cgroup *
-get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate);
+__get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate);
+
+static inline struct page_cgroup *
+get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct page_cgroup_cache *pcp;
+	struct page_cgroup *ret;
+	unsigned long idx = pfn >> PCGRP_SHIFT;
+	int hnum = (idx) & (PAGE_CGROUP_NR_CACHE - 1);
+
+	preempt_disable();
+	pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
+	if (pcp->ents[hnum].idx == idx && pcp->ents[hnum].base)
+		ret = pcp->ents[hnum].base + (pfn - (idx << PCGRP_SHIFT));
+	else
+		ret = NULL;
+	preempt_enable();
+
+	return (ret)? ret : __get_page_cgroup(page, gfpmask, allocate);
+}
 
 #else
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 6/7] memcg: speed up by percpu
  2008-03-14 10:18 ` [PATCH 6/7] memcg: speed up by percpu KAMEZAWA Hiroyuki
@ 2008-03-17  3:03   ` Li Zefan
  2008-03-18  1:25     ` KAMEZAWA Hiroyuki
  2008-03-19 21:19   ` Peter Zijlstra
  1 sibling, 1 reply; 53+ messages in thread
From: Li Zefan @ 2008-03-17  3:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

KAMEZAWA Hiroyuki wrote:
> This patch adds per-cpu look up cache for get_page_cgroup().
> Works well when nearby pages are accessed continuously.
> (And it's an usual case under buddy allocator.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> 
>  include/linux/page_cgroup.h |   37 +++++++++++++++++++++++++++++++++++--
>  mm/page_cgroup.c            |   26 +++++++++++++++++++++-----
>  2 files changed, 56 insertions(+), 7 deletions(-)
> 
> Index: mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/page_cgroup.c
> +++ mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> @@ -17,11 +17,10 @@
>  #include <linux/memcontrol.h>
>  #include <linux/page_cgroup.h>
>  #include <linux/err.h>
> +#include <linux/interrupt.h>
>  
>  
> -
> -#define PCGRP_SHIFT	(CONFIG_CGROUP_PAGE_CGROUP_ORDER)
> -#define PCGRP_SIZE	(1 << PCGRP_SHIFT)
> +DEFINE_PER_CPU(struct page_cgroup_cache, pcpu_page_cgroup_cache);
>  
>  struct page_cgroup_head {
>  	struct page_cgroup pc[PCGRP_SIZE];
> @@ -71,6 +70,19 @@ void free_page_cgroup(struct page_cgroup
>  }
>  
>  

redundant empty line.

> +static void save_result(struct page_cgroup  *base, unsigned long idx)
> +{
> +	int hash = idx & (PAGE_CGROUP_NR_CACHE - 1);
> +	struct page_cgroup_cache *pcp;
> +	/* look up is done under preempt_disable(). then, don't call
> +	   this under interrupt(). */
> +	preempt_disable();
> +	pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
> +	pcp->ents[hash].idx = idx;
> +	pcp->ents[hash].base = base;
> +	preempt_enable();
> +}
> +
>  /*
>   * Look up page_cgroup struct for struct page (page's pfn)
>   * if (allocate == true), look up and allocate new one if necessary.
> @@ -78,7 +90,7 @@ void free_page_cgroup(struct page_cgroup
>   */
>  
>  struct page_cgroup *
> -get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
> +__get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
>  {
>  	struct page_cgroup_root *root;
>  	struct page_cgroup_head *head;
> @@ -107,8 +119,12 @@ retry:
>  	head = radix_tree_lookup(&root->root_node, idx);
>  	rcu_read_unlock();
>  
> -	if (likely(head))
> +	if (likely(head)) {
> +		if (!in_interrupt())
> +			save_result(&head->pc[0], idx);
>  		return &head->pc[pfn - base_pfn];
> +	}
> +
>  	if (allocate == false)
>  		return NULL;
>  
> Index: mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/include/linux/page_cgroup.h
> +++ mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> @@ -25,6 +25,20 @@ struct page_cgroup {
>  #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* is on active list */
>  #define PAGE_CGROUP_FLAG_MIGRATION (0x4) /* is on active list */
>  
> +/* per cpu cashing for fast access */
> +#define PAGE_CGROUP_NR_CACHE	(0x8)
> +struct page_cgroup_cache {
> +	struct {
> +		unsigned long idx;
> +		struct page_cgroup *base;
> +	} ents[PAGE_CGROUP_NR_CACHE];
> +};
> +
> +DECLARE_PER_CPU(struct page_cgroup_cache, pcpu_page_cgroup_cache);
> +
> +#define PCGRP_SHIFT	(CONFIG_CGROUP_PAGE_CGROUP_ORDER)
> +#define PCGRP_SIZE	(1 << PCGRP_SHIFT)
> +
>  /*
>   * Lookup and return page_cgroup struct.
>   * returns NULL when
> @@ -33,9 +47,28 @@ struct page_cgroup {
>   * return -ENOMEM if cannot allocate memory.
>   * If allocate==false, gfpmask will be ignored as a result.
>   */
> -
>  struct page_cgroup *
> -get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate);
> +__get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate);
> +
> +static inline struct page_cgroup *
> +get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)

This function is too big to be inline

> +{
> +	unsigned long pfn = page_to_pfn(page);
> +	struct page_cgroup_cache *pcp;
> +	struct page_cgroup *ret;
> +	unsigned long idx = pfn >> PCGRP_SHIFT;
> +	int hnum = (idx) & (PAGE_CGROUP_NR_CACHE - 1);
> +
> +	preempt_disable();
> +	pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
> +	if (pcp->ents[hnum].idx == idx && pcp->ents[hnum].base)
> +		ret = pcp->ents[hnum].base + (pfn - (idx << PCGRP_SHIFT));
> +	else
> +		ret = NULL;
> +	preempt_enable();
> +
> +	return (ret)? ret : __get_page_cgroup(page, gfpmask, allocate);
> +}
>  
>  #else
>  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 6/7] memcg: speed up by percpu
  2008-03-17  3:03   ` Li Zefan
@ 2008-03-18  1:25     ` KAMEZAWA Hiroyuki
  2008-03-18 23:55       ` Li Zefan
  0 siblings, 1 reply; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-18  1:25 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

On Mon, 17 Mar 2008 12:03:26 +0900
Li Zefan <lizf@cn.fujitsu.com> wrote:

> > +static inline struct page_cgroup *
> > +get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
> 
> This function is too big to be inline
> 
I don't think so. This just does shift, mask, access to per-cpu.
And enough benefit to make this to be inline.
But ok, will check text size.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 6/7] memcg: speed up by percpu
  2008-03-18  1:25     ` KAMEZAWA Hiroyuki
@ 2008-03-18 23:55       ` Li Zefan
  2008-03-19  2:51         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 53+ messages in thread
From: Li Zefan @ 2008-03-18 23:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

KAMEZAWA Hiroyuki wrote:
> On Mon, 17 Mar 2008 12:03:26 +0900
> Li Zefan <lizf@cn.fujitsu.com> wrote:
> 
>>> +static inline struct page_cgroup *
>>> +get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
>> This function is too big to be inline
>>
> I don't think so. This just does shift, mask, access to per-cpu.
> And enough benefit to make this to be inline.
> But ok, will check text size.
> 

Normally 1-line or 2-lines functions are good candidates for inline,
but others are usually not.

> Thanks,
> -Kame
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 6/7] memcg: speed up by percpu
  2008-03-18 23:55       ` Li Zefan
@ 2008-03-19  2:51         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-19  2:51 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

On Wed, 19 Mar 2008 08:55:45 +0900
Li Zefan <lizf@cn.fujitsu.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Mon, 17 Mar 2008 12:03:26 +0900
> > Li Zefan <lizf@cn.fujitsu.com> wrote:
> > 
> >>> +static inline struct page_cgroup *
> >>> +get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
> >> This function is too big to be inline
> >>
> > I don't think so. This just does shift, mask, access to per-cpu.
> > And enough benefit to make this to be inline.
> > But ok, will check text size.
> > 
> 
> Normally 1-line or 2-lines functions are good candidates for inline,
> but others are usually not.
> 
next version will not contain inline func.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 6/7] memcg: speed up by percpu
  2008-03-14 10:18 ` [PATCH 6/7] memcg: speed up by percpu KAMEZAWA Hiroyuki
  2008-03-17  3:03   ` Li Zefan
@ 2008-03-19 21:19   ` Peter Zijlstra
  2008-03-19 21:41     ` Peter Zijlstra
  2008-03-20  4:46     ` KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 53+ messages in thread
From: Peter Zijlstra @ 2008-03-19 21:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

On Fri, 2008-03-14 at 19:18 +0900, KAMEZAWA Hiroyuki wrote:
> This patch adds per-cpu look up cache for get_page_cgroup().
> Works well when nearby pages are accessed continuously.
> (And it's an usual case under buddy allocator.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> 
>  include/linux/page_cgroup.h |   37 +++++++++++++++++++++++++++++++++++--
>  mm/page_cgroup.c            |   26 +++++++++++++++++++++-----
>  2 files changed, 56 insertions(+), 7 deletions(-)
> 
> Index: mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/page_cgroup.c
> +++ mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> @@ -17,11 +17,10 @@
>  #include <linux/memcontrol.h>
>  #include <linux/page_cgroup.h>
>  #include <linux/err.h>
> +#include <linux/interrupt.h>
>  
> 
> -
> -#define PCGRP_SHIFT	(CONFIG_CGROUP_PAGE_CGROUP_ORDER)
> -#define PCGRP_SIZE	(1 << PCGRP_SHIFT)
> +DEFINE_PER_CPU(struct page_cgroup_cache, pcpu_page_cgroup_cache);
>  
>  struct page_cgroup_head {
>  	struct page_cgroup pc[PCGRP_SIZE];
> @@ -71,6 +70,19 @@ void free_page_cgroup(struct page_cgroup
>  }
>  
> 
> +static void save_result(struct page_cgroup  *base, unsigned long idx)
> +{
> +	int hash = idx & (PAGE_CGROUP_NR_CACHE - 1);
> +	struct page_cgroup_cache *pcp;
> +	/* look up is done under preempt_disable(). then, don't call
> +	   this under interrupt(). */
> +	preempt_disable();
> +	pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
> +	pcp->ents[hash].idx = idx;
> +	pcp->ents[hash].base = base;
> +	preempt_enable();
> +}
> +
>  /*
>   * Look up page_cgroup struct for struct page (page's pfn)
>   * if (allocate == true), look up and allocate new one if necessary.
> @@ -78,7 +90,7 @@ void free_page_cgroup(struct page_cgroup
>   */
>  
>  struct page_cgroup *
> -get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
> +__get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
>  {
>  	struct page_cgroup_root *root;
>  	struct page_cgroup_head *head;
> @@ -107,8 +119,12 @@ retry:
>  	head = radix_tree_lookup(&root->root_node, idx);
>  	rcu_read_unlock();
>  
> -	if (likely(head))
> +	if (likely(head)) {
> +		if (!in_interrupt())
> +			save_result(&head->pc[0], idx);
>  		return &head->pc[pfn - base_pfn];
> +	}
> +
>  	if (allocate == false)
>  		return NULL;
>  
> Index: mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/include/linux/page_cgroup.h
> +++ mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> @@ -25,6 +25,20 @@ struct page_cgroup {
>  #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* is on active list */
>  #define PAGE_CGROUP_FLAG_MIGRATION (0x4) /* is on active list */
>  
> +/* per cpu cashing for fast access */
> +#define PAGE_CGROUP_NR_CACHE	(0x8)
> +struct page_cgroup_cache {
> +	struct {
> +		unsigned long idx;
> +		struct page_cgroup *base;
> +	} ents[PAGE_CGROUP_NR_CACHE];
> +};
> +
> +DECLARE_PER_CPU(struct page_cgroup_cache, pcpu_page_cgroup_cache);
> +
> +#define PCGRP_SHIFT	(CONFIG_CGROUP_PAGE_CGROUP_ORDER)
> +#define PCGRP_SIZE	(1 << PCGRP_SHIFT)
> +
>  /*
>   * Lookup and return page_cgroup struct.
>   * returns NULL when
> @@ -33,9 +47,28 @@ struct page_cgroup {
>   * return -ENOMEM if cannot allocate memory.
>   * If allocate==false, gfpmask will be ignored as a result.
>   */
> -
>  struct page_cgroup *
> -get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate);
> +__get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate);
> +
> +static inline struct page_cgroup *
> +get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +	struct page_cgroup_cache *pcp;
> +	struct page_cgroup *ret;
> +	unsigned long idx = pfn >> PCGRP_SHIFT;
> +	int hnum = (idx) & (PAGE_CGROUP_NR_CACHE - 1);
> +
> +	preempt_disable();

get_cpu_var()

> +	pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
> +	if (pcp->ents[hnum].idx == idx && pcp->ents[hnum].base)
> +		ret = pcp->ents[hnum].base + (pfn - (idx << PCGRP_SHIFT));
> +	else
> +		ret = NULL;
> +	preempt_enable();

put_cpu_var()

> +	return (ret)? ret : __get_page_cgroup(page, gfpmask, allocate);

if (!ret)
  ret = __get_page_cgroup();

return ret;

> +}
>  
>  #else
>  
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 6/7] memcg: speed up by percpu
  2008-03-19 21:19   ` Peter Zijlstra
@ 2008-03-19 21:41     ` Peter Zijlstra
  2008-03-20  9:08       ` Andy Whitcroft
  2008-03-20  4:46     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2008-03-19 21:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com, apw, Ingo Molnar

On Wed, 2008-03-19 at 22:19 +0100, Peter Zijlstra wrote:

> > +static void save_result(struct page_cgroup  *base, unsigned long idx)
> > +{
> > +	int hash = idx & (PAGE_CGROUP_NR_CACHE - 1);
> > +	struct page_cgroup_cache *pcp;
> > +	/* look up is done under preempt_disable(). then, don't call
> > +	   this under interrupt(). */
> > +	preempt_disable();
> > +	pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
> > +	pcp->ents[hash].idx = idx;
> > +	pcp->ents[hash].base = base;
> > +	preempt_enable();
> > +}

Another instance where get_cpu_var(), put_cpu_var() would be preferable.

Raw preempt_{disable,enable)() calls are discouraged, because just as
the BKL they are opaque, they don't tell us what data is protected from
what, or if we rely upon interaction with the RCU grace period or things
like that.

These things are a pain to sort out if you try to change the preemption
model after a few years.

Ingo, does it make sense to make checkpatch.pl warn about raw
preempt_{disable,enable} calls?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 6/7] memcg: speed up by percpu
  2008-03-19 21:41     ` Peter Zijlstra
@ 2008-03-20  9:08       ` Andy Whitcroft
  0 siblings, 0 replies; 53+ messages in thread
From: Andy Whitcroft @ 2008-03-20  9:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	xemul, hugh@veritas.com, Ingo Molnar

On Wed, Mar 19, 2008 at 10:41:56PM +0100, Peter Zijlstra wrote:
> On Wed, 2008-03-19 at 22:19 +0100, Peter Zijlstra wrote:
> 
> > > +static void save_result(struct page_cgroup  *base, unsigned long idx)
> > > +{
> > > +	int hash = idx & (PAGE_CGROUP_NR_CACHE - 1);
> > > +	struct page_cgroup_cache *pcp;
> > > +	/* look up is done under preempt_disable(). then, don't call
> > > +	   this under interrupt(). */
> > > +	preempt_disable();
> > > +	pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
> > > +	pcp->ents[hash].idx = idx;
> > > +	pcp->ents[hash].base = base;
> > > +	preempt_enable();
> > > +}
> 
> Another instance where get_cpu_var(), put_cpu_var() would be preferable.
> 
> Raw preempt_{disable,enable)() calls are discouraged, because just as
> the BKL they are opaque, they don't tell us what data is protected from
> what, or if we rely upon interaction with the RCU grace period or things
> like that.
> 
> These things are a pain to sort out if you try to change the preemption
> model after a few years.
> 
> Ingo, does it make sense to make checkpatch.pl warn about raw
> preempt_{disable,enable} calls?

That signature would likely be detectable even if all preempt_disable()s
were not reportable.  I do wonder looking at the shape of that whether
there is a ganged version for where you want more than one variable; I
can't see one.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 6/7] memcg: speed up by percpu
  2008-03-19 21:19   ` Peter Zijlstra
  2008-03-19 21:41     ` Peter Zijlstra
@ 2008-03-20  4:46     ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-20  4:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

On Wed, 19 Mar 2008 22:19:25 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> > +static inline struct page_cgroup *
> > +get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
> > +{
> > +	unsigned long pfn = page_to_pfn(page);
> > +	struct page_cgroup_cache *pcp;
> > +	struct page_cgroup *ret;
> > +	unsigned long idx = pfn >> PCGRP_SHIFT;
> > +	int hnum = (idx) & (PAGE_CGROUP_NR_CACHE - 1);
> > +
> > +	preempt_disable();
> 
> get_cpu_var()
> 
> > +	pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
> > +	if (pcp->ents[hnum].idx == idx && pcp->ents[hnum].base)
> > +		ret = pcp->ents[hnum].base + (pfn - (idx << PCGRP_SHIFT));
> > +	else
> > +		ret = NULL;
> > +	preempt_enable();
> 
> put_cpu_var()
> 
> > +	return (ret)? ret : __get_page_cgroup(page, gfpmask, allocate);
> 
> if (!ret)
>   ret = __get_page_cgroup();
> 
> return ret;
> 
ok, I'll fix. 

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 7/7] memcg: freeing page_cgroup at suitable chance
  2008-03-14  9:59 [PATCH 0/7] memcg: radix-tree page_cgroup KAMEZAWA Hiroyuki
                   ` (5 preceding siblings ...)
  2008-03-14 10:18 ` [PATCH 6/7] memcg: speed up by percpu KAMEZAWA Hiroyuki
@ 2008-03-14 10:22 ` KAMEZAWA Hiroyuki
  2008-03-17  3:10   ` Li Zefan
  2008-03-19 21:33   ` Peter Zijlstra
  2008-03-15  6:15 ` [PATCH 0/7] memcg: radix-tree page_cgroup Balbir Singh
  7 siblings, 2 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-14 10:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

This patch is for freeing page_cgroup if a chunk of pages are freed.

How this works 
 * when the order of free page reaches PCGRP_SHRINK_ORDER, pcgrp is freed.
   This will be done by RCU.

I think this works well because
   - unnecessary freeing will not occur in busy servers.
   - page_cgroup will be removed at necessary point (allocating Hugepage,etc..)
   - If tons of pages are freed (ex. big file is removed), page_cgroup will
     be removed.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsuc.com>


 include/linux/page_cgroup.h |   15 +++++++++++-
 mm/page_alloc.c             |    3 ++
 mm/page_cgroup.c            |   54 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 71 insertions(+), 1 deletion(-)

Index: mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
===================================================================
--- mm-2.6.25-rc5-mm1.orig/include/linux/page_cgroup.h
+++ mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
@@ -39,6 +39,12 @@ DECLARE_PER_CPU(struct page_cgroup_cache
 #define PCGRP_SHIFT	(CONFIG_CGROUP_PAGE_CGROUP_ORDER)
 #define PCGRP_SIZE	(1 << PCGRP_SHIFT)
 
+#if PCGRP_SHIFT + 3 >= MAX_ORDER
+#define PCGRP_SHRINK_ORDER	(MAX_ORDER - 1)
+#else
+#define PCGRP_SHRINK_ORDER	(PCGRP_SHIFT + 3)
+#endif
+
 /*
  * Lookup and return page_cgroup struct.
  * returns NULL when
@@ -70,12 +76,19 @@ get_page_cgroup(struct page *page, gfp_t
 	return (ret)? ret : __get_page_cgroup(page, gfpmask, allocate);
 }
 
+void try_to_shrink_page_cgroup(struct page *page, int order);
+
 #else
 
-static struct page_cgroup *
+static inline struct page_cgroup *
 get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
 {
 	return NULL;
 }
+static inline void try_to_shrink_page_cgroup(struct page *page, int order)
+{
+	return;
+}
+#define PCGRP_SHRINK_ORDER	(MAX_ORDER)
 #endif
 #endif
Index: mm-2.6.25-rc5-mm1/mm/page_cgroup.c
===================================================================
--- mm-2.6.25-rc5-mm1.orig/mm/page_cgroup.c
+++ mm-2.6.25-rc5-mm1/mm/page_cgroup.c
@@ -12,6 +12,7 @@
  */
 
 #include <linux/mm.h>
+#include <linux/mmzone.h>
 #include <linux/slab.h>
 #include <linux/radix-tree.h>
 #include <linux/memcontrol.h>
@@ -80,6 +81,7 @@ static void save_result(struct page_cgro
 	pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
 	pcp->ents[hash].idx = idx;
 	pcp->ents[hash].base = base;
+	smp_wmb();
 	preempt_enable();
 }
 
@@ -156,6 +158,58 @@ out:
 	return pc;
 }
 
+/* Must be called under zone->lock */
+void try_to_shrink_page_cgroup(struct page *page, int order)
+{
+	unsigned long pfn = page_to_pfn(page);
+	int nid = page_to_nid(page);
+	int idx = pfn >> PCGRP_SHIFT;
+	int hnum = (PAGE_CGROUP_NR_CACHE - 1);
+	struct page_cgroup_cache *pcp;
+	struct page_cgroup_head *head;
+	struct page_cgroup_root *root;
+	unsigned long end_pfn;
+	int cpu;
+
+
+	root = root_dir[nid];
+	if (!root || in_interrupt() || (order < PCGRP_SHIFT))
+		return;
+
+	pfn = page_to_pfn(page);
+	end_pfn = pfn + (1 << order);
+
+	while (pfn != end_pfn) {
+		idx = pfn >> PCGRP_SHIFT;
+		/* Is this pfn has entry ? */
+		rcu_read_lock();
+		head = radix_tree_lookup(&root->root_node, idx);
+		rcu_read_unlock();
+		if (!head) {
+			pfn += (1 << PCGRP_SHIFT);
+			continue;
+		}
+		/* It's guaranteed that no one access to this pfn/idx
+		   because there is no reference to this page. */
+		hnum = (idx) & (PAGE_CGROUP_NR_CACHE - 1);
+		for_each_online_cpu(cpu) {
+			pcp = &per_cpu(pcpu_page_cgroup_cache, cpu);
+			smp_rmb();
+			if (pcp->ents[hnum].idx == idx)
+				pcp->ents[hnum].base = NULL;
+		}
+		if (spin_trylock(&root->tree_lock)) {
+			/* radix tree is freed by RCU. so they will not call
+			   free_pages() right now.*/
+			radix_tree_delete(&root->root_node, idx);
+			spin_unlock(&root->tree_lock);
+			/* We can free this in lazy fashion .*/
+			free_page_cgroup(head);
+		}
+		pfn += (1 << PCGRP_SHIFT);
+	}
+}
+
 __init int page_cgroup_init(void)
 {
 	int nid;
Index: mm-2.6.25-rc5-mm1/mm/page_alloc.c
===================================================================
--- mm-2.6.25-rc5-mm1.orig/mm/page_alloc.c
+++ mm-2.6.25-rc5-mm1/mm/page_alloc.c
@@ -45,6 +45,7 @@
 #include <linux/fault-inject.h>
 #include <linux/page-isolation.h>
 #include <linux/memcontrol.h>
+#include <linux/page_cgroup.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -463,6 +464,8 @@ static inline void __free_one_page(struc
 		order++;
 	}
 	set_page_order(page, order);
+	if (order >= PCGRP_SHRINK_ORDER)
+		try_to_shrink_page_cgroup(page, order);
 	list_add(&page->lru,
 		&zone->free_area[order].free_list[migratetype]);
 	zone->free_area[order].nr_free++;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/7] memcg: freeing page_cgroup at suitable chance
  2008-03-14 10:22 ` [PATCH 7/7] memcg: freeing page_cgroup at suitable chance KAMEZAWA Hiroyuki
@ 2008-03-17  3:10   ` Li Zefan
  2008-03-18  1:30     ` KAMEZAWA Hiroyuki
  2008-03-19 21:33   ` Peter Zijlstra
  1 sibling, 1 reply; 53+ messages in thread
From: Li Zefan @ 2008-03-17  3:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

KAMEZAWA Hiroyuki wrote:
> This patch is for freeing page_cgroup if a chunk of pages are freed.
> 
> How this works 
>  * when the order of free page reaches PCGRP_SHRINK_ORDER, pcgrp is freed.
>    This will be done by RCU.
> 
> I think this works well because
>    - unnecessary freeing will not occur in busy servers.
>    - page_cgroup will be removed at necessary point (allocating Hugepage,etc..)
>    - If tons of pages are freed (ex. big file is removed), page_cgroup will
>      be removed.
> 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsuc.com>
> 
> 
>  include/linux/page_cgroup.h |   15 +++++++++++-
>  mm/page_alloc.c             |    3 ++
>  mm/page_cgroup.c            |   54 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 71 insertions(+), 1 deletion(-)
> 
> Index: mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/include/linux/page_cgroup.h
> +++ mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> @@ -39,6 +39,12 @@ DECLARE_PER_CPU(struct page_cgroup_cache
>  #define PCGRP_SHIFT	(CONFIG_CGROUP_PAGE_CGROUP_ORDER)
>  #define PCGRP_SIZE	(1 << PCGRP_SHIFT)
>  
> +#if PCGRP_SHIFT + 3 >= MAX_ORDER
> +#define PCGRP_SHRINK_ORDER	(MAX_ORDER - 1)
> +#else
> +#define PCGRP_SHRINK_ORDER	(PCGRP_SHIFT + 3)
> +#endif
> +
>  /*
>   * Lookup and return page_cgroup struct.
>   * returns NULL when
> @@ -70,12 +76,19 @@ get_page_cgroup(struct page *page, gfp_t
>  	return (ret)? ret : __get_page_cgroup(page, gfpmask, allocate);
>  }
>  
> +void try_to_shrink_page_cgroup(struct page *page, int order);
> +

extern void

>  #else
>  
> -static struct page_cgroup *
> +static inline struct page_cgroup *
>  get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
>  {
>  	return NULL;
>  }
> +static inline void try_to_shrink_page_cgroup(struct page *page, int order)
> +{
> +	return;
> +}
> +#define PCGRP_SHRINK_ORDER	(MAX_ORDER)
>  #endif
>  #endif
> Index: mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/page_cgroup.c
> +++ mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> @@ -12,6 +12,7 @@
>   */
>  
>  #include <linux/mm.h>
> +#include <linux/mmzone.h>
>  #include <linux/slab.h>
>  #include <linux/radix-tree.h>
>  #include <linux/memcontrol.h>
> @@ -80,6 +81,7 @@ static void save_result(struct page_cgro
>  	pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
>  	pcp->ents[hash].idx = idx;
>  	pcp->ents[hash].base = base;
> +	smp_wmb();

Whenever you add a memory barrier, you should comment on it.

>  	preempt_enable();
>  }
>  
> @@ -156,6 +158,58 @@ out:
>  	return pc;
>  }
>  
> +/* Must be called under zone->lock */
> +void try_to_shrink_page_cgroup(struct page *page, int order)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +	int nid = page_to_nid(page);
> +	int idx = pfn >> PCGRP_SHIFT;
> +	int hnum = (PAGE_CGROUP_NR_CACHE - 1);
> +	struct page_cgroup_cache *pcp;
> +	struct page_cgroup_head *head;
> +	struct page_cgroup_root *root;
> +	unsigned long end_pfn;
> +	int cpu;
> +
> +

redundant empty line

> +	root = root_dir[nid];
> +	if (!root || in_interrupt() || (order < PCGRP_SHIFT))
> +		return;
> +
> +	pfn = page_to_pfn(page);
> +	end_pfn = pfn + (1 << order);
> +
> +	while (pfn != end_pfn) {
> +		idx = pfn >> PCGRP_SHIFT;
> +		/* Is this pfn has entry ? */
> +		rcu_read_lock();
> +		head = radix_tree_lookup(&root->root_node, idx);
> +		rcu_read_unlock();
> +		if (!head) {
> +			pfn += (1 << PCGRP_SHIFT);

pfn += PCGRP_SIZE;

> +			continue;
> +		}
> +		/* It's guaranteed that no one access to this pfn/idx
> +		   because there is no reference to this page. */
> +		hnum = (idx) & (PAGE_CGROUP_NR_CACHE - 1);
> +		for_each_online_cpu(cpu) {
> +			pcp = &per_cpu(pcpu_page_cgroup_cache, cpu);
> +			smp_rmb();
> +			if (pcp->ents[hnum].idx == idx)
> +				pcp->ents[hnum].base = NULL;
> +		}
> +		if (spin_trylock(&root->tree_lock)) {
> +			/* radix tree is freed by RCU. so they will not call
> +			   free_pages() right now.*/
> +			radix_tree_delete(&root->root_node, idx);
> +			spin_unlock(&root->tree_lock);
> +			/* We can free this in lazy fashion .*/
> +			free_page_cgroup(head);
> +		}
> +		pfn += (1 << PCGRP_SHIFT);

ditto

> +	}
> +}
> +
>  __init int page_cgroup_init(void)
>  {
>  	int nid;
> Index: mm-2.6.25-rc5-mm1/mm/page_alloc.c
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/page_alloc.c
> +++ mm-2.6.25-rc5-mm1/mm/page_alloc.c
> @@ -45,6 +45,7 @@
>  #include <linux/fault-inject.h>
>  #include <linux/page-isolation.h>
>  #include <linux/memcontrol.h>
> +#include <linux/page_cgroup.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -463,6 +464,8 @@ static inline void __free_one_page(struc
>  		order++;
>  	}
>  	set_page_order(page, order);
> +	if (order >= PCGRP_SHRINK_ORDER)
> +		try_to_shrink_page_cgroup(page, order);
>  	list_add(&page->lru,
>  		&zone->free_area[order].free_list[migratetype]);
>  	zone->free_area[order].nr_free++;
> 
> --

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/7] memcg: freeing page_cgroup at suitable chance
  2008-03-17  3:10   ` Li Zefan
@ 2008-03-18  1:30     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-18  1:30 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

On Mon, 17 Mar 2008 12:10:40 +0900
Li Zefan <lizf@cn.fujitsu.com> wrote:

> > +void try_to_shrink_page_cgroup(struct page *page, int order);
> > +
> 
> extern void
> 
ok.

> >  #else
> >  
> > -static struct page_cgroup *
> > +static inline struct page_cgroup *
> >  get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
> >  {
> >  	return NULL;
> >  }
> > +static inline void try_to_shrink_page_cgroup(struct page *page, int order)
> > +{
> > +	return;
> > +}
> > +#define PCGRP_SHRINK_ORDER	(MAX_ORDER)
> >  #endif
> >  #endif
> > Index: mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> > ===================================================================
> > --- mm-2.6.25-rc5-mm1.orig/mm/page_cgroup.c
> > +++ mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> > @@ -12,6 +12,7 @@
> >   */
> >  
> >  #include <linux/mm.h>
> > +#include <linux/mmzone.h>
> >  #include <linux/slab.h>
> >  #include <linux/radix-tree.h>
> >  #include <linux/memcontrol.h>
> > @@ -80,6 +81,7 @@ static void save_result(struct page_cgro
> >  	pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
> >  	pcp->ents[hash].idx = idx;
> >  	pcp->ents[hash].base = base;
> > +	smp_wmb();
> 
> Whenever you add a memory barrier, you should comment on it.
> 
ok.

> >  	preempt_enable();
> >  }
> >  
> > @@ -156,6 +158,58 @@ out:
> >  	return pc;
> >  }
> >  
> > +/* Must be called under zone->lock */
> > +void try_to_shrink_page_cgroup(struct page *page, int order)
> > +{
> > +	unsigned long pfn = page_to_pfn(page);
> > +	int nid = page_to_nid(page);
> > +	int idx = pfn >> PCGRP_SHIFT;
> > +	int hnum = (PAGE_CGROUP_NR_CACHE - 1);
> > +	struct page_cgroup_cache *pcp;
> > +	struct page_cgroup_head *head;
> > +	struct page_cgroup_root *root;
> > +	unsigned long end_pfn;
> > +	int cpu;
> > +
> > +
> 
> redundant empty line
> 
> > +	root = root_dir[nid];
> > +	if (!root || in_interrupt() || (order < PCGRP_SHIFT))
> > +		return;
> > +
> > +	pfn = page_to_pfn(page);
> > +	end_pfn = pfn + (1 << order);
> > +
> > +	while (pfn != end_pfn) {
> > +		idx = pfn >> PCGRP_SHIFT;
> > +		/* Is this pfn has entry ? */
> > +		rcu_read_lock();
> > +		head = radix_tree_lookup(&root->root_node, idx);
> > +		rcu_read_unlock();
> > +		if (!head) {
> > +			pfn += (1 << PCGRP_SHIFT);
> 
> pfn += PCGRP_SIZE;
> 
ok.

> > +			continue;
> > +		}
> > +		/* It's guaranteed that no one access to this pfn/idx
> > +		   because there is no reference to this page. */
> > +		hnum = (idx) & (PAGE_CGROUP_NR_CACHE - 1);
> > +		for_each_online_cpu(cpu) {
> > +			pcp = &per_cpu(pcpu_page_cgroup_cache, cpu);
> > +			smp_rmb();
> > +			if (pcp->ents[hnum].idx == idx)
> > +				pcp->ents[hnum].base = NULL;
> > +		}
> > +		if (spin_trylock(&root->tree_lock)) {
> > +			/* radix tree is freed by RCU. so they will not call
> > +			   free_pages() right now.*/
> > +			radix_tree_delete(&root->root_node, idx);
> > +			spin_unlock(&root->tree_lock);
> > +			/* We can free this in lazy fashion .*/
> > +			free_page_cgroup(head);
> > +		}
> > +		pfn += (1 << PCGRP_SHIFT);
> 
> ditto
> 
ok.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/7] memcg: freeing page_cgroup at suitable chance
  2008-03-14 10:22 ` [PATCH 7/7] memcg: freeing page_cgroup at suitable chance KAMEZAWA Hiroyuki
  2008-03-17  3:10   ` Li Zefan
@ 2008-03-19 21:33   ` Peter Zijlstra
  2008-03-20  5:07     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2008-03-19 21:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

On Fri, 2008-03-14 at 19:22 +0900, KAMEZAWA Hiroyuki wrote:
> This patch is for freeing page_cgroup if a chunk of pages are freed.
> 
> How this works 
>  * when the order of free page reaches PCGRP_SHRINK_ORDER, pcgrp is freed.
>    This will be done by RCU.
> 
> I think this works well because
>    - unnecessary freeing will not occur in busy servers.

So we'll OOM instead?

>    - page_cgroup will be removed at necessary point (allocating Hugepage,etc..)
>    - If tons of pages are freed (ex. big file is removed), page_cgroup will
>      be removed.
> 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsuc.com>
> 
> 
>  include/linux/page_cgroup.h |   15 +++++++++++-
>  mm/page_alloc.c             |    3 ++
>  mm/page_cgroup.c            |   54 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 71 insertions(+), 1 deletion(-)
> 
> Index: mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/include/linux/page_cgroup.h
> +++ mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> @@ -39,6 +39,12 @@ DECLARE_PER_CPU(struct page_cgroup_cache
>  #define PCGRP_SHIFT	(CONFIG_CGROUP_PAGE_CGROUP_ORDER)
>  #define PCGRP_SIZE	(1 << PCGRP_SHIFT)
>  
> +#if PCGRP_SHIFT + 3 >= MAX_ORDER
> +#define PCGRP_SHRINK_ORDER	(MAX_ORDER - 1)
> +#else
> +#define PCGRP_SHRINK_ORDER	(PCGRP_SHIFT + 3)
> +#endif
> +
>  /*
>   * Lookup and return page_cgroup struct.
>   * returns NULL when
> @@ -70,12 +76,19 @@ get_page_cgroup(struct page *page, gfp_t
>  	return (ret)? ret : __get_page_cgroup(page, gfpmask, allocate);
>  }
>  
> +void try_to_shrink_page_cgroup(struct page *page, int order);
> +
>  #else
>  
> -static struct page_cgroup *
> +static inline struct page_cgroup *
>  get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
>  {
>  	return NULL;
>  }
> +static inline void try_to_shrink_page_cgroup(struct page *page, int order)
> +{
> +	return;
> +}
> +#define PCGRP_SHRINK_ORDER	(MAX_ORDER)
>  #endif
>  #endif
> Index: mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/page_cgroup.c
> +++ mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> @@ -12,6 +12,7 @@
>   */
>  
>  #include <linux/mm.h>
> +#include <linux/mmzone.h>
>  #include <linux/slab.h>
>  #include <linux/radix-tree.h>
>  #include <linux/memcontrol.h>
> @@ -80,6 +81,7 @@ static void save_result(struct page_cgro
>  	pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
>  	pcp->ents[hash].idx = idx;
>  	pcp->ents[hash].base = base;
> +	smp_wmb();

Lacks a comments outlining the race and a pointer to the matching
barrier.

>  	preempt_enable();
>  }
>  
> @@ -156,6 +158,58 @@ out:
>  	return pc;
>  }
>  
> +/* Must be called under zone->lock */
> +void try_to_shrink_page_cgroup(struct page *page, int order)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +	int nid = page_to_nid(page);
> +	int idx = pfn >> PCGRP_SHIFT;
> +	int hnum = (PAGE_CGROUP_NR_CACHE - 1);
> +	struct page_cgroup_cache *pcp;
> +	struct page_cgroup_head *head;
> +	struct page_cgroup_root *root;
> +	unsigned long end_pfn;
> +	int cpu;
> +
> +
> +	root = root_dir[nid];
> +	if (!root || in_interrupt() || (order < PCGRP_SHIFT))
> +		return;
> +
> +	pfn = page_to_pfn(page);
> +	end_pfn = pfn + (1 << order);
> +
> +	while (pfn != end_pfn) {
> +		idx = pfn >> PCGRP_SHIFT;
> +		/* Is this pfn has entry ? */
> +		rcu_read_lock();
> +		head = radix_tree_lookup(&root->root_node, idx);
> +		rcu_read_unlock();

What stops head from being freed here

> +		if (!head) {
> +			pfn += (1 << PCGRP_SHIFT);
> +			continue;
> +		}
> +		/* It's guaranteed that no one access to this pfn/idx
> +		   because there is no reference to this page. */
> +		hnum = (idx) & (PAGE_CGROUP_NR_CACHE - 1);
> +		for_each_online_cpu(cpu) {
> +			pcp = &per_cpu(pcpu_page_cgroup_cache, cpu);
> +			smp_rmb();

Another unadorned barrier - presumably the pair for the one above.

> +			if (pcp->ents[hnum].idx == idx)
> +				pcp->ents[hnum].base = NULL;
> +		}

This is rather expensive... but I can't seem to come up with another way
around this. However, would it make sense to place this after, and make
it conditional on the following condition; so that we'll not iterate all
cpus only to find we couldn't free the radix tree item.

> +		if (spin_trylock(&root->tree_lock)) {
> +			/* radix tree is freed by RCU. so they will not call
> +			   free_pages() right now.*/
> +			radix_tree_delete(&root->root_node, idx);
> +			spin_unlock(&root->tree_lock);
> +			/* We can free this in lazy fashion .*/
> +			free_page_cgroup(head);

No RCU based freeing? I'd expected a call_rcu(), otherwise we race with
lookups.

> +		}
> +		pfn += (1 << PCGRP_SHIFT);
> +	}
> +}
> +
>  __init int page_cgroup_init(void)
>  {
>  	int nid;
> Index: mm-2.6.25-rc5-mm1/mm/page_alloc.c
> ===================================================================
> --- mm-2.6.25-rc5-mm1.orig/mm/page_alloc.c
> +++ mm-2.6.25-rc5-mm1/mm/page_alloc.c
> @@ -45,6 +45,7 @@
>  #include <linux/fault-inject.h>
>  #include <linux/page-isolation.h>
>  #include <linux/memcontrol.h>
> +#include <linux/page_cgroup.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -463,6 +464,8 @@ static inline void __free_one_page(struc
>  		order++;
>  	}
>  	set_page_order(page, order);
> +	if (order >= PCGRP_SHRINK_ORDER)
> +		try_to_shrink_page_cgroup(page, order);

So we only shrink if the buddy managed to coalesce the free pages into a
high enough order free page, under a high enough load this might never
happen.

This worries me somewhat, can you at least outline the worst case upper
bound of memory consumption so we can check if this is acceptable?

Also, this is the very hart of the buddy system, doesn't this regress
performance of the page-allocator under certain loads?

>  	list_add(&page->lru,
>  		&zone->free_area[order].free_list[migratetype]);
>  	zone->free_area[order].nr_free++;
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/7] memcg: freeing page_cgroup at suitable chance
  2008-03-19 21:33   ` Peter Zijlstra
@ 2008-03-20  5:07     ` KAMEZAWA Hiroyuki
  2008-03-20  7:55       ` Peter Zijlstra
                         ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-20  5:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

On Wed, 19 Mar 2008 22:33:19 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Fri, 2008-03-14 at 19:22 +0900, KAMEZAWA Hiroyuki wrote:
> > This patch is for freeing page_cgroup if a chunk of pages are freed.
> > 
> > How this works 
> >  * when the order of free page reaches PCGRP_SHRINK_ORDER, pcgrp is freed.
> >    This will be done by RCU.
> > 
> > I think this works well because
> >    - unnecessary freeing will not occur in busy servers.
> 
> So we'll OOM instead?
> 
Hmm. I think page_cgroup will not be able to be freed under OOM because
pages, which are accounted, are not able to be freed.
"Free If Unnecessary" is what we can here.

My purpose is just for making wise use of memory a bit more.

I'm now considering to adjust page_cgroup's chunk order to be smaller than
page-migration-type. Then, we can guarantee that

We don't allocate page_cgroup against
  - kernel pages (pages in not-movable migrate type)
  - huge pages



> >    - page_cgroup will be removed at necessary point (allocating Hugepage,etc..)
> >    - If tons of pages are freed (ex. big file is removed), page_cgroup will
> >      be removed.
> > 
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsuc.com>
> > 
> > 
> >  include/linux/page_cgroup.h |   15 +++++++++++-
> >  mm/page_alloc.c             |    3 ++
> >  mm/page_cgroup.c            |   54 ++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 71 insertions(+), 1 deletion(-)
> > 
> > Index: mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> > ===================================================================
> > --- mm-2.6.25-rc5-mm1.orig/include/linux/page_cgroup.h
> > +++ mm-2.6.25-rc5-mm1/include/linux/page_cgroup.h
> > @@ -39,6 +39,12 @@ DECLARE_PER_CPU(struct page_cgroup_cache
> >  #define PCGRP_SHIFT	(CONFIG_CGROUP_PAGE_CGROUP_ORDER)
> >  #define PCGRP_SIZE	(1 << PCGRP_SHIFT)
> >  
> > +#if PCGRP_SHIFT + 3 >= MAX_ORDER
> > +#define PCGRP_SHRINK_ORDER	(MAX_ORDER - 1)
> > +#else
> > +#define PCGRP_SHRINK_ORDER	(PCGRP_SHIFT + 3)
> > +#endif
> > +
> >  /*
> >   * Lookup and return page_cgroup struct.
> >   * returns NULL when
> > @@ -70,12 +76,19 @@ get_page_cgroup(struct page *page, gfp_t
> >  	return (ret)? ret : __get_page_cgroup(page, gfpmask, allocate);
> >  }
> >  
> > +void try_to_shrink_page_cgroup(struct page *page, int order);
> > +
> >  #else
> >  
> > -static struct page_cgroup *
> > +static inline struct page_cgroup *
> >  get_page_cgroup(struct page *page, gfp_t gfpmask, bool allocate)
> >  {
> >  	return NULL;
> >  }
> > +static inline void try_to_shrink_page_cgroup(struct page *page, int order)
> > +{
> > +	return;
> > +}
> > +#define PCGRP_SHRINK_ORDER	(MAX_ORDER)
> >  #endif
> >  #endif
> > Index: mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> > ===================================================================
> > --- mm-2.6.25-rc5-mm1.orig/mm/page_cgroup.c
> > +++ mm-2.6.25-rc5-mm1/mm/page_cgroup.c
> > @@ -12,6 +12,7 @@
> >   */
> >  
> >  #include <linux/mm.h>
> > +#include <linux/mmzone.h>
> >  #include <linux/slab.h>
> >  #include <linux/radix-tree.h>
> >  #include <linux/memcontrol.h>
> > @@ -80,6 +81,7 @@ static void save_result(struct page_cgro
> >  	pcp = &__get_cpu_var(pcpu_page_cgroup_cache);
> >  	pcp->ents[hash].idx = idx;
> >  	pcp->ents[hash].base = base;
> > +	smp_wmb();
> 
> Lacks a comments outlining the race and a pointer to the matching
> barrier.
> 
ok, will fix.


> >  	preempt_enable();
> >  }
> >  
> > @@ -156,6 +158,58 @@ out:
> >  	return pc;
> >  }
> >  
> > +/* Must be called under zone->lock */
> > +void try_to_shrink_page_cgroup(struct page *page, int order)
> > +{
> > +	unsigned long pfn = page_to_pfn(page);
> > +	int nid = page_to_nid(page);
> > +	int idx = pfn >> PCGRP_SHIFT;
> > +	int hnum = (PAGE_CGROUP_NR_CACHE - 1);
> > +	struct page_cgroup_cache *pcp;
> > +	struct page_cgroup_head *head;
> > +	struct page_cgroup_root *root;
> > +	unsigned long end_pfn;
> > +	int cpu;
> > +
> > +
> > +	root = root_dir[nid];
> > +	if (!root || in_interrupt() || (order < PCGRP_SHIFT))
> > +		return;
> > +
> > +	pfn = page_to_pfn(page);
> > +	end_pfn = pfn + (1 << order);
> > +
> > +	while (pfn != end_pfn) {
> > +		idx = pfn >> PCGRP_SHIFT;
> > +		/* Is this pfn has entry ? */
> > +		rcu_read_lock();
> > +		head = radix_tree_lookup(&root->root_node, idx);
> > +		rcu_read_unlock();
> 
> What stops head from being freed here
> 
We have zone->lock here.
I'll add comments.

> > +		if (!head) {
> > +			pfn += (1 << PCGRP_SHIFT);
> > +			continue;
> > +		}
> > +		/* It's guaranteed that no one access to this pfn/idx
> > +		   because there is no reference to this page. */
> > +		hnum = (idx) & (PAGE_CGROUP_NR_CACHE - 1);
> > +		for_each_online_cpu(cpu) {
> > +			pcp = &per_cpu(pcpu_page_cgroup_cache, cpu);
> > +			smp_rmb();
> 
> Another unadorned barrier - presumably the pair for the one above.
> 
will fix.

> > +			if (pcp->ents[hnum].idx == idx)
> > +				pcp->ents[hnum].base = NULL;
> > +		}
> 
> This is rather expensive... but I can't seem to come up with another way
> around this. However, would it make sense to place this after, and make
> it conditional on the following condition; so that we'll not iterate all
> cpus only to find we couldn't free the radix tree item.
> 
ok. new version has some improvement on this.


> > +		if (spin_trylock(&root->tree_lock)) {
> > +			/* radix tree is freed by RCU. so they will not call
> > +			   free_pages() right now.*/
> > +			radix_tree_delete(&root->root_node, idx);
> > +			spin_unlock(&root->tree_lock);
> > +			/* We can free this in lazy fashion .*/
> > +			free_page_cgroup(head);
> 
> No RCU based freeing? I'd expected a call_rcu(), otherwise we race with
> lookups.
> 
SLAB itself is SLAB_DESTROY_BY_RCU. I'll add comments here.


> > +		}
> > +		pfn += (1 << PCGRP_SHIFT);
> > +	}
> > +}
> > +
> >  __init int page_cgroup_init(void)
> >  {
> >  	int nid;
> > Index: mm-2.6.25-rc5-mm1/mm/page_alloc.c
> > ===================================================================
> > --- mm-2.6.25-rc5-mm1.orig/mm/page_alloc.c
> > +++ mm-2.6.25-rc5-mm1/mm/page_alloc.c
> > @@ -45,6 +45,7 @@
> >  #include <linux/fault-inject.h>
> >  #include <linux/page-isolation.h>
> >  #include <linux/memcontrol.h>
> > +#include <linux/page_cgroup.h>
> >  
> >  #include <asm/tlbflush.h>
> >  #include <asm/div64.h>
> > @@ -463,6 +464,8 @@ static inline void __free_one_page(struc
> >  		order++;
> >  	}
> >  	set_page_order(page, order);
> > +	if (order >= PCGRP_SHRINK_ORDER)
> > +		try_to_shrink_page_cgroup(page, order);
> 
> So we only shrink if the buddy managed to coalesce the free pages into a
> high enough order free page, under a high enough load this might never
> happen.
> 
yes.

> This worries me somewhat, can you at least outline the worst case upper
> bound of memory consumption so we can check if this is acceptable?
> 
I'm now considering to make this to be not configurable and to change
this order to be migrate_type_order.

> Also, this is the very hart of the buddy system, doesn't this regress
> performance of the page-allocator under certain loads?
> 
I'm afraid of it, too. Maybe "removing tons of pages at once" workload
will be affected. But current memory resource controller calls tons of
kfree(). I doubt we can see the cost of this behavior.

I'm looking for better knobs to control this behavior.
(If much amounts of memory is free, we don't have to free this immediately.)

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/7] memcg: freeing page_cgroup at suitable chance
  2008-03-20  5:07     ` KAMEZAWA Hiroyuki
@ 2008-03-20  7:55       ` Peter Zijlstra
  2008-03-20 14:49       ` kamezawa.hiroyu
  2008-03-20 16:04       ` kamezawa.hiroyu
  2 siblings, 0 replies; 53+ messages in thread
From: Peter Zijlstra @ 2008-03-20  7:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com, xemul,
	hugh@veritas.com

On Thu, 2008-03-20 at 14:07 +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 19 Mar 2008 22:33:19 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > > +		if (spin_trylock(&root->tree_lock)) {
> > > +			/* radix tree is freed by RCU. so they will not call
> > > +			   free_pages() right now.*/
> > > +			radix_tree_delete(&root->root_node, idx);
> > > +			spin_unlock(&root->tree_lock);
> > > +			/* We can free this in lazy fashion .*/
> > > +			free_page_cgroup(head);
> > 
> > No RCU based freeing? I'd expected a call_rcu(), otherwise we race with
> > lookups.
> > 
> SLAB itself is SLAB_DESTROY_BY_RCU. I'll add comments here.

SLAB_DESTROYED_BY_RCU is not enough, that will just ensure the slab does
not get invalid, but it does not guarantee the objects will not be
re-used. So you still have a race here, the lookup can see another
object than it went for.

Consider:


	A				B

rcu_read_lock()
				spin_lock(&tree_lock)
				obj = radix_tree_remove(&tree, idx)
				spin_unlock(&tree_lock)

				kmem_cache_free(s, obj)

obj = radix_tree_lookup(&tree, idx)
rcu_read_unlock()

				obj2 = kmem_cache_alloc(s)
				spin_lock(&tree_lock)
				radix_tree_insert(&tree_lock, idx2)
				spin_unlock(&tree_lock)


return obj->data


If B's obj2 == obj (possible), then A will return the object for idx2,
while it asked for idx.

So A needs an object validate and retry loop, or you need to RCU-free
objects and extend the rcu_read_lock() range to cover obj's usage.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Re: [PATCH 7/7] memcg: freeing page_cgroup at suitable chance
  2008-03-20  5:07     ` KAMEZAWA Hiroyuki
  2008-03-20  7:55       ` Peter Zijlstra
@ 2008-03-20 14:49       ` kamezawa.hiroyu
  2008-03-20 16:04       ` kamezawa.hiroyu
  2 siblings, 0 replies; 53+ messages in thread
From: kamezawa.hiroyu @ 2008-03-20 14:49 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: KAMEZAWA Hiroyuki, linux-mm, balbir, xemul, hugh

>On Thu, 2008-03-20 at 14:07 +0900, KAMEZAWA Hiroyuki wrote:
>> On Wed, 19 Mar 2008 22:33:19 +0100
>> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> 
>> > > +		if (spin_trylock(&root->tree_lock)) {
>> > > +			/* radix tree is freed by RCU. so they will not call
>> > > +			   free_pages() right now.*/
>> > > +			radix_tree_delete(&root->root_node, idx);
>> > > +			spin_unlock(&root->tree_lock);
>> > > +			/* We can free this in lazy fashion .*/
>> > > +			free_page_cgroup(head);
>> > 
>> > No RCU based freeing? I'd expected a call_rcu(), otherwise we race with
>> > lookups.
>> > 
>> SLAB itself is SLAB_DESTROY_BY_RCU. I'll add comments here.
>
>SLAB_DESTROYED_BY_RCU is not enough, that will just ensure the slab does
>not get invalid, but it does not guarantee the objects will not be
>re-used. So you still have a race here, the lookup can see another
>object than it went for.
>
>Consider:
>
>
>	A				B
>
>rcu_read_lock()
>				spin_lock(&tree_lock)
>				obj = radix_tree_remove(&tree, idx)
>				spin_unlock(&tree_lock)
>
>				kmem_cache_free(s, obj)
>
>obj = radix_tree_lookup(&tree, idx)
>rcu_read_unlock()
>
>				obj2 = kmem_cache_alloc(s)
>				spin_lock(&tree_lock)
>				radix_tree_insert(&tree_lock, idx2)
>				spin_unlock(&tree_lock)
>
>
>return obj->data
>
>
>If B's obj2 == obj (possible), then A will return the object for idx2,
>while it asked for idx.
>
>So A needs an object validate and retry loop, or you need to RCU-free
>objects and extend the rcu_read_lock() range to cover obj's usage.
>
ok. thank you for pointing out and explaining kindly.

I'll fix it.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Re: Re: [PATCH 7/7] memcg: freeing page_cgroup at suitable chance
  2008-03-20  5:07     ` KAMEZAWA Hiroyuki
  2008-03-20  7:55       ` Peter Zijlstra
  2008-03-20 14:49       ` kamezawa.hiroyu
@ 2008-03-20 16:04       ` kamezawa.hiroyu
  2008-03-20 16:09         ` Peter Zijlstra
  2008-03-20 16:15         ` kamezawa.hiroyu
  2 siblings, 2 replies; 53+ messages in thread
From: kamezawa.hiroyu @ 2008-03-20 16:04 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: Peter Zijlstra, linux-mm, balbir, xemul, hugh

>>On Thu, 2008-03-20 at 14:07 +0900, KAMEZAWA Hiroyuki wrote:
>>> On Wed, 19 Mar 2008 22:33:19 +0100
>>> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>>> 
>>> > > +		if (spin_trylock(&root->tree_lock)) {
>>> > > +			/* radix tree is freed by RCU. so they will not call
>>> > > +			   free_pages() right now.*/
>>> > > +			radix_tree_delete(&root->root_node, idx);
>>> > > +			spin_unlock(&root->tree_lock);
>>> > > +			/* We can free this in lazy fashion .*/
>>> > > +			free_page_cgroup(head);
>>> > 
>>> > No RCU based freeing? I'd expected a call_rcu(), otherwise we race with
>>> > lookups.
>>> > 
>>> SLAB itself is SLAB_DESTROY_BY_RCU. I'll add comments here.
>>
>>SLAB_DESTROYED_BY_RCU is not enough, that will just ensure the slab does
>>not get invalid, but it does not guarantee the objects will not be
>>re-used. So you still have a race here, the lookup can see another
>>object than it went for.
>>
>>Consider:
>>
>>
>>	A				B
>>
>>rcu_read_lock()
>>				spin_lock(&tree_lock)
>>				obj = radix_tree_remove(&tree, idx)
>>				spin_unlock(&tree_lock)
>>
>>				kmem_cache_free(s, obj)
>>
>>obj = radix_tree_lookup(&tree, idx)
>>rcu_read_unlock()
>>
>>				obj2 = kmem_cache_alloc(s)
>>				spin_lock(&tree_lock)
>>				radix_tree_insert(&tree_lock, idx2)
>>				spin_unlock(&tree_lock)
>>
>>
>>return obj->data
>>
>>
>>If B's obj2 == obj (possible), then A will return the object for idx2,
>>while it asked for idx.
>>
>>So A needs an object validate and retry loop, or you need to RCU-free
>>objects and extend the rcu_read_lock() range to cover obj's usage.
>>
>ok. thank you for pointing out and explaining kindly.
>
Hmm, I have to add texts for myself...

I assume that page-cgroup-lookup against *free* pages never occur.

When we free entries of [pfn, pfn + 1 << order), [pfn, pfn + 1 << order)
is in freelist and we have zone->lock. Lookup against [pfn, pfn + 1 << order) 
cannot happen against freed pages.
Then, looking up and freeing an idx cannot happen at the same time.

radix_tree_lookup() can look up wrong entry in this case ?


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Re: Re: [PATCH 7/7] memcg: freeing page_cgroup at suitable chance
  2008-03-20 16:04       ` kamezawa.hiroyu
@ 2008-03-20 16:09         ` Peter Zijlstra
  2008-03-20 16:15         ` kamezawa.hiroyu
  1 sibling, 0 replies; 53+ messages in thread
From: Peter Zijlstra @ 2008-03-20 16:09 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: linux-mm, balbir, xemul, hugh

On Fri, 2008-03-21 at 01:04 +0900, kamezawa.hiroyu@jp.fujitsu.com wrote:
> >>On Thu, 2008-03-20 at 14:07 +0900, KAMEZAWA Hiroyuki wrote:
> >>> On Wed, 19 Mar 2008 22:33:19 +0100
> >>> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >>> 
> >>> > > +		if (spin_trylock(&root->tree_lock)) {
> >>> > > +			/* radix tree is freed by RCU. so they will not call
> >>> > > +			   free_pages() right now.*/
> >>> > > +			radix_tree_delete(&root->root_node, idx);
> >>> > > +			spin_unlock(&root->tree_lock);
> >>> > > +			/* We can free this in lazy fashion .*/
> >>> > > +			free_page_cgroup(head);
> >>> > 
> >>> > No RCU based freeing? I'd expected a call_rcu(), otherwise we race with
> >>> > lookups.
> >>> > 
> >>> SLAB itself is SLAB_DESTROY_BY_RCU. I'll add comments here.
> >>
> >>SLAB_DESTROYED_BY_RCU is not enough, that will just ensure the slab does
> >>not get invalid, but it does not guarantee the objects will not be
> >>re-used. So you still have a race here, the lookup can see another
> >>object than it went for.
> >>
> >>Consider:
> >>
> >>
> >>	A				B
> >>
> >>rcu_read_lock()

> >>obj = radix_tree_lookup(&tree, idx)

> >>				spin_lock(&tree_lock)
> >>				obj = radix_tree_remove(&tree, idx)
> >>				spin_unlock(&tree_lock)
> >>
> >>				kmem_cache_free(s, obj)
> >>

> >>rcu_read_unlock()
> >>
> >>				obj2 = kmem_cache_alloc(s)
> >>				spin_lock(&tree_lock)
> >>				radix_tree_insert(&tree_lock, idx2)
> >>				spin_unlock(&tree_lock)
> >>
> >>
> >>return obj->data
> >>
> >>
> >>If B's obj2 == obj (possible), then A will return the object for idx2,
> >>while it asked for idx.
> >>
> >>So A needs an object validate and retry loop, or you need to RCU-free
> >>objects and extend the rcu_read_lock() range to cover obj's usage.
> >>
> >ok. thank you for pointing out and explaining kindly.
> >
> Hmm, I have to add texts for myself...
> 
> I assume that page-cgroup-lookup against *free* pages never occur.
> 
> When we free entries of [pfn, pfn + 1 << order), [pfn, pfn + 1 << order)
> is in freelist and we have zone->lock. Lookup against [pfn, pfn + 1 << order) 
> cannot happen against freed pages.
> Then, looking up and freeing an idx cannot happen at the same time.
> 
> radix_tree_lookup() can look up wrong entry in this case ?
 
You're right, my bad.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Re: Re: Re: [PATCH 7/7] memcg: freeing page_cgroup at suitable chance
  2008-03-20 16:04       ` kamezawa.hiroyu
  2008-03-20 16:09         ` Peter Zijlstra
@ 2008-03-20 16:15         ` kamezawa.hiroyu
  1 sibling, 0 replies; 53+ messages in thread
From: kamezawa.hiroyu @ 2008-03-20 16:15 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: kamezawa.hiroyu, linux-mm, balbir, xemul, hugh

>> When we free entries of [pfn, pfn + 1 << order), [pfn, pfn + 1 << order)
>> is in freelist and we have zone->lock. Lookup against [pfn, pfn + 1 << orde
r) 
>> cannot happen against freed pages.
>> Then, looking up and freeing an idx cannot happen at the same time.
>> 
>> radix_tree_lookup() can look up wrong entry in this case ?
> 
>You're right, my bad.
>
your text was very much help for understanding my own codes again...
I'll add enough texts. It seems this use of radix-tree is tricky.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 0/7] memcg: radix-tree page_cgroup
  2008-03-14  9:59 [PATCH 0/7] memcg: radix-tree page_cgroup KAMEZAWA Hiroyuki
                   ` (6 preceding siblings ...)
  2008-03-14 10:22 ` [PATCH 7/7] memcg: freeing page_cgroup at suitable chance KAMEZAWA Hiroyuki
@ 2008-03-15  6:15 ` Balbir Singh
  7 siblings, 0 replies; 53+ messages in thread
From: Balbir Singh @ 2008-03-15  6:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm@kvack.org, xemul, hugh@veritas.com

KAMEZAWA Hiroyuki wrote:
> This is a patch set for implemening page_cgroup under radix-tree.
> against 2.6.25-rc5-mm1.

Hi, KAMEZAWA-San,

I am building and applying all the patches one-by-one (just started). I'll get
back soon. Thanks for looking into this

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2008-03-20 16:15 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-14  9:59 [PATCH 0/7] memcg: radix-tree page_cgroup KAMEZAWA Hiroyuki
2008-03-14 10:03 ` [PATCH 1/7] re-define page_cgroup KAMEZAWA Hiroyuki
2008-03-16 14:15   ` Balbir Singh
2008-03-18  1:10     ` KAMEZAWA Hiroyuki
2008-03-17  0:21   ` Li Zefan
2008-03-18  1:12     ` KAMEZAWA Hiroyuki
2008-03-17  2:07   ` Li Zefan
2008-03-18  1:11     ` KAMEZAWA Hiroyuki
2008-03-14 10:06 ` [PATCH 2/7] charge/uncharge KAMEZAWA Hiroyuki
2008-03-17  1:46   ` Balbir Singh
2008-03-18  1:14     ` KAMEZAWA Hiroyuki
2008-03-17  2:26   ` Li Zefan
2008-03-18  1:15     ` KAMEZAWA Hiroyuki
2008-03-14 10:07 ` [PATCH 3/7] memcg: move_lists KAMEZAWA Hiroyuki
2008-03-18 16:44   ` Balbir Singh
2008-03-19  2:34     ` KAMEZAWA Hiroyuki
2008-03-14 10:15 ` [PATCH 4/7] memcg: page migration KAMEZAWA Hiroyuki
2008-03-17  2:36   ` Li Zefan
2008-03-18  1:17     ` KAMEZAWA Hiroyuki
2008-03-18 18:11   ` Balbir Singh
2008-03-19  2:44     ` KAMEZAWA Hiroyuki
2008-03-14 10:17 ` [PATCH 5/7] radix-tree page cgroup KAMEZAWA Hiroyuki
2008-03-17  2:56   ` Li Zefan
2008-03-17  3:26     ` Li Zefan
2008-03-18  1:18       ` KAMEZAWA Hiroyuki
2008-03-18  1:23     ` KAMEZAWA Hiroyuki
2008-03-19  2:05   ` Balbir Singh
2008-03-19  2:51     ` KAMEZAWA Hiroyuki
2008-03-19  3:14   ` Balbir Singh
2008-03-19  3:24     ` KAMEZAWA Hiroyuki
2008-03-19 21:11   ` Peter Zijlstra
2008-03-20  4:45     ` KAMEZAWA Hiroyuki
2008-03-20  5:09       ` KAMEZAWA Hiroyuki
2008-03-14 10:18 ` [PATCH 6/7] memcg: speed up by percpu KAMEZAWA Hiroyuki
2008-03-17  3:03   ` Li Zefan
2008-03-18  1:25     ` KAMEZAWA Hiroyuki
2008-03-18 23:55       ` Li Zefan
2008-03-19  2:51         ` KAMEZAWA Hiroyuki
2008-03-19 21:19   ` Peter Zijlstra
2008-03-19 21:41     ` Peter Zijlstra
2008-03-20  9:08       ` Andy Whitcroft
2008-03-20  4:46     ` KAMEZAWA Hiroyuki
2008-03-14 10:22 ` [PATCH 7/7] memcg: freeing page_cgroup at suitable chance KAMEZAWA Hiroyuki
2008-03-17  3:10   ` Li Zefan
2008-03-18  1:30     ` KAMEZAWA Hiroyuki
2008-03-19 21:33   ` Peter Zijlstra
2008-03-20  5:07     ` KAMEZAWA Hiroyuki
2008-03-20  7:55       ` Peter Zijlstra
2008-03-20 14:49       ` kamezawa.hiroyu
2008-03-20 16:04       ` kamezawa.hiroyu
2008-03-20 16:09         ` Peter Zijlstra
2008-03-20 16:15         ` kamezawa.hiroyu
2008-03-15  6:15 ` [PATCH 0/7] memcg: radix-tree page_cgroup Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).