From: Li Zefan <lizf@cn.fujitsu.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
"balbir@linux.vnet.ibm.com" <balbir@linux.vnet.ibm.com>,
a.p.zijlstra@chello.nl
Subject: Re: [-mm] [PATCH 1/4] memcg : radix-tree page_cgroup
Date: Fri, 28 Mar 2008 08:55:01 +0800 [thread overview]
Message-ID: <47EC41E5.7090901@cn.fujitsu.com> (raw)
In-Reply-To: <20080327174757.6ea57ea4.kamezawa.hiroyu@jp.fujitsu.com>
KAMEZAWA Hiroyuki wrote:
> This patch implements radixt-tree based page cgroup.
>
> This patch does
> * add radix-tree based page_cgroup look up subsystem.page_cgroup_init
> * remove bit_spin_lock used by page_cgroup.
>
> changes:
>
> Before patch
> * struct page had pointer to page_cgroup. Then, relationship between objects
> was pfn <-> struct page <-> struct page_cgroup
> * (spin) lock for page_cgroup was in struct page.
> * page_cgroup->refcnt is incremented before charge is done.
> * page migration does complicated page_cgroup migration under locks.
>
> After patch
> * struct page has no pointer to page_cgroup. Relationship between objects
> is struct page <-> pfn <-> struct page_cgroup -> struct page,
> * page_cgroup has its own spin lock.
> * page_cgroup->refcnt is incremented after charge is done.
> * page migration accounts a new page before migration. By this, we can
> avoid complicated locks.
>
> tested on ia64/NUMA, x86_64/SMP.
>
> Changelog v1 -> v2:
> * create a folded patch. maybe good for bysect.
> * removed special handling codes for new pages under migration
> Added PG_LRU check to force_empty.
> * reflected comments.
> * Added comments in the head of page_cgroup.c
> * order of page_cgroup is automatically calculated.
> * fixed handling of root_node[] entries in page_cgroup_init().
> * rewrite init_page_cgroup_head() to do minimum work.
> * fixed N_NORMAL_MEMORY handling.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
At a first glance, some typos and small things, see below ;)
>
> include/linux/memcontrol.h | 17 --
> include/linux/mm_types.h | 3
> include/linux/page_cgroup.h | 56 +++++++
> mm/Makefile | 2
> mm/memcontrol.c | 332 ++++++++++++++++----------------------------
> mm/migrate.c | 22 +-
> mm/page_alloc.c | 8 -
> mm/page_cgroup.c | 260 ++++++++++++++++++++++++++++++++++
> 8 files changed, 463 insertions(+), 237 deletions(-)
>
> Index: linux-2.6.25-rc5-mm1-k/include/linux/page_cgroup.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.25-rc5-mm1-k/include/linux/page_cgroup.h
> @@ -0,0 +1,56 @@
> +#ifndef __LINUX_PAGE_CGROUP_H
> +#define __LINUX_PAGE_CGROUP_H
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/*
> + * page_cgroup is yet another mem_map structure for accounting usage.
Not fix yes:
extra space ^^
> + * but, unlike mem_map, allocated on demand for accounted pages.
> + * see also memcontrol.h
> + * In nature, this consumes much amount of memory.
> + */
> +
> +struct mem_cgroup;
> +
> +struct page_cgroup {
> + spinlock_t lock; /* lock for all members */
> + int refcnt; /* reference count */
> + struct mem_cgroup *mem_cgroup; /* current cgroup subsys */
> + struct list_head lru; /* for per cgroup LRU */
> + int flags; /* See below */
> + struct page *page; /* the page this accounts for*/
> +};
> +
> +/* flags */
> +#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache. */
> +#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* is on active list */
> +
> +/*
> + * look up page_cgroup. returns NULL if not exists.
> + */
> +extern struct page_cgroup *get_page_cgroup(struct page *page);
> +
> +
> +/*
> + * look up page_cgroup, allocate new one if it doesn't exist.
> + * Return value is
> + * 1. page_cgroup, at success.
> + * 2. -EXXXXX, at failure.
> + * 3. NULL, at boot.
> + */
> +extern struct page_cgroup *
> +get_alloc_page_cgroup(struct page *page, gfp_t gfpmask);
> +
> +#else
> +
> +static inline struct page_cgroup *get_page_cgroup(struct page *page)
> +{
> + return NULL;
> +}
> +
> +static inline struct page_cgroup *
> +get_alloc_page_cgroup(struct page *page, gfp_t gfpmask)
> +{
> + return NULL;
> +}
> +#endif
> +#endif
> Index: linux-2.6.25-rc5-mm1-k/mm/memcontrol.c
> ===================================================================
> --- linux-2.6.25-rc5-mm1-k.orig/mm/memcontrol.c
> +++ linux-2.6.25-rc5-mm1-k/mm/memcontrol.c
> @@ -30,6 +30,7 @@
> #include <linux/spinlock.h>
> #include <linux/fs.h>
> #include <linux/seq_file.h>
> +#include <linux/page_cgroup.h>
>
> #include <asm/uaccess.h>
>
> @@ -92,7 +93,7 @@ struct mem_cgroup_per_zone {
> /*
> * spin_lock to protect the per cgroup LRU
> */
> - spinlock_t lru_lock;
> + spinlock_t lru_lock; /* irq should be off. */
> struct list_head active_list;
> struct list_head inactive_list;
> unsigned long count[NR_MEM_CGROUP_ZSTAT];
> @@ -139,33 +140,6 @@ struct mem_cgroup {
> };
> static struct mem_cgroup init_mem_cgroup;
>
> -/*
> - * We use the lower bit of the page->page_cgroup pointer as a bit spin
> - * lock. We need to ensure that page->page_cgroup is at least two
> - * byte aligned (based on comments from Nick Piggin). But since
> - * bit_spin_lock doesn't actually set that lock bit in a non-debug
> - * uniprocessor kernel, we should avoid setting it here too.
> - */
> -#define PAGE_CGROUP_LOCK_BIT 0x0
> -#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
> -#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT)
> -#else
> -#define PAGE_CGROUP_LOCK 0x0
> -#endif
> -
> -/*
> - * A page_cgroup page is associated with every page descriptor. The
> - * page_cgroup helps us identify information about the cgroup
> - */
> -struct page_cgroup {
> - struct list_head lru; /* per cgroup LRU list */
> - struct page *page;
> - struct mem_cgroup *mem_cgroup;
> - int ref_cnt; /* cached, mapped, migrating */
> - int flags;
> -};
> -#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
> -#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */
>
> static int page_cgroup_nid(struct page_cgroup *pc)
> {
> @@ -256,37 +230,6 @@ void mm_free_cgroup(struct mm_struct *mm
> css_put(&mm->mem_cgroup->css);
> }
>
> -static inline int page_cgroup_locked(struct page *page)
> -{
> - return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> -static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
> -{
> - VM_BUG_ON(!page_cgroup_locked(page));
> - page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
> -}
> -
> -struct page_cgroup *page_get_page_cgroup(struct page *page)
> -{
> - return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
> -}
> -
> -static void lock_page_cgroup(struct page *page)
> -{
> - bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> -static int try_lock_page_cgroup(struct page *page)
> -{
> - return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> -static void unlock_page_cgroup(struct page *page)
> -{
> - bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> static void __mem_cgroup_remove_list(struct page_cgroup *pc)
> {
> int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
> @@ -356,6 +299,10 @@ void mem_cgroup_move_lists(struct page *
> struct mem_cgroup_per_zone *mz;
> unsigned long flags;
>
> + /* This GFP will be ignored..*/
> + pc = get_page_cgroup(page);
> + if (!pc)
> + return;
> /*
> * We cannot lock_page_cgroup while holding zone's lru_lock,
> * because other holders of lock_page_cgroup can be interrupted
> @@ -363,17 +310,15 @@ void mem_cgroup_move_lists(struct page *
> * safely get to page_cgroup without it, so just try_lock it:
> * mem_cgroup_isolate_pages allows for page left on wrong list.
> */
> - if (!try_lock_page_cgroup(page))
> + if (!spin_trylock_irqsave(&pc->lock, flags))
> return;
> -
> - pc = page_get_page_cgroup(page);
> - if (pc) {
> + if (pc->refcnt) {
> mz = page_cgroup_zoneinfo(pc);
> - spin_lock_irqsave(&mz->lru_lock, flags);
> + spin_lock(&mz->lru_lock);
> __mem_cgroup_move_lists(pc, active);
> - spin_unlock_irqrestore(&mz->lru_lock, flags);
> + spin_unlock(&mz->lru_lock);
> }
> - unlock_page_cgroup(page);
> + spin_unlock_irqrestore(&pc->lock, flags);
> }
>
> /*
> @@ -525,7 +470,8 @@ unsigned long mem_cgroup_isolate_pages(u
> * < 0 if the cgroup is over its limit
> */
> static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> - gfp_t gfp_mask, enum charge_type ctype)
> + gfp_t gfp_mask, enum charge_type ctype,
> + struct mem_cgroup *memcg)
> {
> struct mem_cgroup *mem;
> struct page_cgroup *pc;
> @@ -536,33 +482,23 @@ static int mem_cgroup_charge_common(stru
> if (mem_cgroup_subsys.disabled)
> return 0;
>
> + pc = get_alloc_page_cgroup(page, gfp_mask);
> + /* Before kamalloc initialization, get_page_cgroup can retrun NULL */
kmalloc return
> + if (unlikely(!pc || IS_ERR(pc)))
> + return PTR_ERR(pc);
> +
> + spin_lock_irqsave(&pc->lock, flags);
> /*
> - * Should page_cgroup's go to their own slab?
> - * One could optimize the performance of the charging routine
> - * by saving a bit in the page_flags and using it as a lock
> - * to see if the cgroup page already has a page_cgroup associated
> - * with it
> - */
> -retry:
> - lock_page_cgroup(page);
> - pc = page_get_page_cgroup(page);
> - /*
> - * The page_cgroup exists and
> - * the page has already been accounted.
> + * Has the page already been accounted ?
> */
> - if (pc) {
> - VM_BUG_ON(pc->page != page);
> - VM_BUG_ON(pc->ref_cnt <= 0);
> -
> - pc->ref_cnt++;
> - unlock_page_cgroup(page);
> - goto done;
> + if (pc->refcnt > 0) {
> + pc->refcnt++;
> + spin_unlock_irqrestore(&pc->lock, flags);
> + goto success;
> }
> - unlock_page_cgroup(page);
> + spin_unlock_irqrestore(&pc->lock, flags);
>
> - pc = kzalloc(sizeof(struct page_cgroup), gfp_mask);
> - if (pc == NULL)
> - goto err;
> + /* Note: *new* pc's refcnt is still 0 here. */
>
> /*
> * We always charge the cgroup the mm_struct belongs to.
> @@ -570,20 +506,24 @@ retry:
> * thread group leader migrates. It's possible that mm is not
> * set, if so charge the init_mm (happens for pagecache usage).
> */
> - if (!mm)
> - mm = &init_mm;
> -
> - rcu_read_lock();
> - mem = rcu_dereference(mm->mem_cgroup);
> - /*
> - * For every charge from the cgroup, increment reference count
> - */
> - css_get(&mem->css);
> - rcu_read_unlock();
> + if (memcg) {
> + mem = memcg;
> + css_get(&mem->css);
> + } else {
> + if (!mm)
> + mm = &init_mm;
> + rcu_read_lock();
> + mem = rcu_dereference(mm->mem_cgroup);
> + /*
> + * For every charge from the cgroup, increment reference count
> + */
> + css_get(&mem->css);
> + rcu_read_unlock();
> + }
>
> while (res_counter_charge(&mem->res, PAGE_SIZE)) {
> if (!(gfp_mask & __GFP_WAIT))
> - goto out;
> + goto nomem;
>
> if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> continue;
> @@ -600,52 +540,51 @@ retry:
>
> if (!nr_retries--) {
> mem_cgroup_out_of_memory(mem, gfp_mask);
> - goto out;
> + goto nomem;
> }
> congestion_wait(WRITE, HZ/10);
> }
> -
> - pc->ref_cnt = 1;
> - pc->mem_cgroup = mem;
> - pc->page = page;
> - pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
> - if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
> - pc->flags |= PAGE_CGROUP_FLAG_CACHE;
> -
> - lock_page_cgroup(page);
> - if (page_get_page_cgroup(page)) {
> - unlock_page_cgroup(page);
> - /*
> - * Another charge has been added to this page already.
> - * We take lock_page_cgroup(page) again and read
> - * page->cgroup, increment refcnt.... just retry is OK.
> - */
> + /*
> + * We have to acquire 2 spinlocks.
> + */
> + spin_lock_irqsave(&pc->lock, flags);
> + /* Is anyone charged ? */
> + if (unlikely(pc->refcnt)) {
> + /* Someone charged this page while we released the lock */
> + pc->refcnt++;
> + spin_unlock_irqrestore(&pc->lock, flags);
> res_counter_uncharge(&mem->res, PAGE_SIZE);
> css_put(&mem->css);
> - kfree(pc);
> - goto retry;
> + goto success;
> }
> - page_assign_page_cgroup(page, pc);
> + /* Anyone doesn't touch this. */
> + VM_BUG_ON(pc->mem_cgroup);
> +
> + if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
> + pc->flags = PAGE_CGROUP_FLAG_ACTIVE | PAGE_CGROUP_FLAG_CACHE;
> + else
> + pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
> + pc->refcnt = 1;
> + pc->mem_cgroup = mem;
>
> mz = page_cgroup_zoneinfo(pc);
> - spin_lock_irqsave(&mz->lru_lock, flags);
> +
> + spin_lock(&mz->lru_lock);
> __mem_cgroup_add_list(pc);
> - spin_unlock_irqrestore(&mz->lru_lock, flags);
> + spin_unlock(&mz->lru_lock);
> + spin_unlock_irqrestore(&pc->lock, flags);
>
> - unlock_page_cgroup(page);
> -done:
> +success:
> return 0;
> -out:
> +nomem:
> css_put(&mem->css);
> - kfree(pc);
> -err:
> return -ENOMEM;
> }
>
> int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> {
> return mem_cgroup_charge_common(page, mm, gfp_mask,
> - MEM_CGROUP_CHARGE_TYPE_MAPPED);
> + MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> }
>
> int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> @@ -654,7 +593,7 @@ int mem_cgroup_cache_charge(struct page
> if (!mm)
> mm = &init_mm;
> return mem_cgroup_charge_common(page, mm, gfp_mask,
> - MEM_CGROUP_CHARGE_TYPE_CACHE);
> + MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
> }
>
> /*
> @@ -664,105 +603,87 @@ int mem_cgroup_cache_charge(struct page
> void mem_cgroup_uncharge_page(struct page *page)
> {
> struct page_cgroup *pc;
> - struct mem_cgroup *mem;
> - struct mem_cgroup_per_zone *mz;
> - unsigned long flags;
>
> if (mem_cgroup_subsys.disabled)
> return;
> -
> /*
> * Check if our page_cgroup is valid
> */
> - lock_page_cgroup(page);
> - pc = page_get_page_cgroup(page);
> - if (!pc)
> - goto unlock;
> -
> - VM_BUG_ON(pc->page != page);
> - VM_BUG_ON(pc->ref_cnt <= 0);
> + pc = get_page_cgroup(page);
> + if (likely(pc)) {
> + unsigned long flags;
> + struct mem_cgroup *mem;
> + struct mem_cgroup_per_zone *mz;
>
> - if (--(pc->ref_cnt) == 0) {
> + spin_lock_irqsave(&pc->lock, flags);
> + if (!pc->refcnt || --pc->refcnt > 0) {
> + spin_unlock_irqrestore(&pc->lock, flags);
> + return;
> + }
> + mem = pc->mem_cgroup;
> mz = page_cgroup_zoneinfo(pc);
> - spin_lock_irqsave(&mz->lru_lock, flags);
> + spin_lock(&mz->lru_lock);
> __mem_cgroup_remove_list(pc);
> - spin_unlock_irqrestore(&mz->lru_lock, flags);
> -
> - page_assign_page_cgroup(page, NULL);
> - unlock_page_cgroup(page);
> + spin_unlock(&mz->lru_lock);
> + pc->flags = 0;
> + pc->mem_cgroup = 0;
> + spin_unlock_irqrestore(&pc->lock, flags);
>
> - mem = pc->mem_cgroup;
> res_counter_uncharge(&mem->res, PAGE_SIZE);
> css_put(&mem->css);
> -
> - kfree(pc);
> - return;
> }
> -
> -unlock:
> - unlock_page_cgroup(page);
> }
>
> /*
> - * Returns non-zero if a page (under migration) has valid page_cgroup member.
> - * Refcnt of page_cgroup is incremented.
> + * Pre-charge against newpage while moving a page.
> + * This function is called before taking page locks.
> */
> -int mem_cgroup_prepare_migration(struct page *page)
> +int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
> {
> struct page_cgroup *pc;
> + struct mem_cgroup *mem = NULL;
> + int ret = 0;
> + enum charge_type type = 0;
don't initialize an enum variable with integer value.
> + unsigned long flags;
>
> if (mem_cgroup_subsys.disabled)
> - return 0;
> + return ret;
>
> - lock_page_cgroup(page);
> - pc = page_get_page_cgroup(page);
> - if (pc)
> - pc->ref_cnt++;
> - unlock_page_cgroup(page);
> - return pc != NULL;
> -}
> + /* check newpage isn't under memory resource control */
> + pc = get_page_cgroup(newpage);
> + VM_BUG_ON(pc && pc->refcnt);
>
> -void mem_cgroup_end_migration(struct page *page)
> -{
> - mem_cgroup_uncharge_page(page);
> -}
> + pc = get_page_cgroup(page);
>
> + if (pc) {
> + spin_lock_irqsave(&pc->lock, flags);
> + if (pc->refcnt) {
> + mem = pc->mem_cgroup;
> + css_get(&mem->css);
> + if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
> + type = MEM_CGROUP_CHARGE_TYPE_CACHE;
> + else
> + type = MEM_CGROUP_CHARGE_TYPE_MAPPED;
> + }
> + spin_unlock_irqrestore(&pc->lock, flags);
> + if (mem) {
> + ret = mem_cgroup_charge_common(newpage, NULL,
> + GFP_KERNEL, type, mem);
> + css_put(&mem->css);
> + }
> + }
> + return ret;
> +}
> /*
> - * We know both *page* and *newpage* are now not-on-LRU and PG_locked.
> - * And no race with uncharge() routines because page_cgroup for *page*
> - * has extra one reference by mem_cgroup_prepare_migration.
> + * At the end of migration, we'll push newpage to LRU and
> + * drop one refcnt which added at prepare_migration.
> */
> -void mem_cgroup_page_migration(struct page *page, struct page *newpage)
> +void mem_cgroup_end_migration(struct page *newpage)
> {
> - struct page_cgroup *pc;
> - struct mem_cgroup_per_zone *mz;
> - unsigned long flags;
> -
> - lock_page_cgroup(page);
> - pc = page_get_page_cgroup(page);
> - if (!pc) {
> - unlock_page_cgroup(page);
> + if (mem_cgroup_subsys.disabled)
> return;
> - }
>
> - mz = page_cgroup_zoneinfo(pc);
> - spin_lock_irqsave(&mz->lru_lock, flags);
> - __mem_cgroup_remove_list(pc);
> - spin_unlock_irqrestore(&mz->lru_lock, flags);
> -
> - page_assign_page_cgroup(page, NULL);
> - unlock_page_cgroup(page);
> -
> - pc->page = newpage;
> - lock_page_cgroup(newpage);
> - page_assign_page_cgroup(newpage, pc);
> -
> - mz = page_cgroup_zoneinfo(pc);
> - spin_lock_irqsave(&mz->lru_lock, flags);
> - __mem_cgroup_add_list(pc);
> - spin_unlock_irqrestore(&mz->lru_lock, flags);
> -
> - unlock_page_cgroup(newpage);
> + mem_cgroup_uncharge_page(newpage);
> }
>
> /*
> @@ -790,10 +711,13 @@ static void mem_cgroup_force_empty_list(
> while (!list_empty(list)) {
> pc = list_entry(list->prev, struct page_cgroup, lru);
> page = pc->page;
> - get_page(page);
> - spin_unlock_irqrestore(&mz->lru_lock, flags);
> - mem_cgroup_uncharge_page(page);
> - put_page(page);
> + if (PageLRU(page)) {
> + get_page(page);
> + spin_unlock_irqrestore(&mz->lru_lock, flags);
> + mem_cgroup_uncharge_page(page);
> + put_page(page);
> + } else
> + count = 0;
> if (--count <= 0) {
> count = FORCE_UNCHARGE_BATCH;
> cond_resched();
> Index: linux-2.6.25-rc5-mm1-k/include/linux/memcontrol.h
> ===================================================================
> --- linux-2.6.25-rc5-mm1-k.orig/include/linux/memcontrol.h
> +++ linux-2.6.25-rc5-mm1-k/include/linux/memcontrol.h
> @@ -19,6 +19,7 @@
>
> #ifndef _LINUX_MEMCONTROL_H
> #define _LINUX_MEMCONTROL_H
> +#include <linux/page_cgroup.h>
>
> struct mem_cgroup;
> struct page_cgroup;
> @@ -30,9 +31,6 @@ struct mm_struct;
> extern void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p);
> extern void mm_free_cgroup(struct mm_struct *mm);
>
> -#define page_reset_bad_cgroup(page) ((page)->page_cgroup = 0)
> -
> -extern struct page_cgroup *page_get_page_cgroup(struct page *page);
> extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask);
> extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> @@ -51,9 +49,8 @@ int task_in_mem_cgroup(struct task_struc
> #define mm_match_cgroup(mm, cgroup) \
> ((cgroup) == rcu_dereference((mm)->mem_cgroup))
>
> -extern int mem_cgroup_prepare_migration(struct page *page);
> -extern void mem_cgroup_end_migration(struct page *page);
> -extern void mem_cgroup_page_migration(struct page *page, struct page *newpage);
> +extern int mem_cgroup_prepare_migration(struct page *, struct page *);
> +extern void mem_cgroup_end_migration(struct page *);
>
> /*
> * For memory reclaim.
> @@ -82,14 +79,6 @@ static inline void mm_free_cgroup(struct
> {
> }
>
> -static inline void page_reset_bad_cgroup(struct page *page)
> -{
> -}
> -
> -static inline struct page_cgroup *page_get_page_cgroup(struct page *page)
> -{
> - return NULL;
> -}
>
> static inline int mem_cgroup_charge(struct page *page,
> struct mm_struct *mm, gfp_t gfp_mask)
> @@ -122,7 +111,8 @@ static inline int task_in_mem_cgroup(str
> return 1;
> }
>
> -static inline int mem_cgroup_prepare_migration(struct page *page)
> +static inline int
> +mem_cgroup_prepare_migration(struct page *page , struct page *newpage)
> {
> return 0;
> }
> Index: linux-2.6.25-rc5-mm1-k/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.25-rc5-mm1-k.orig/mm/page_alloc.c
> +++ linux-2.6.25-rc5-mm1-k/mm/page_alloc.c
> @@ -222,17 +222,11 @@ static inline int bad_range(struct zone
>
> static void bad_page(struct page *page)
> {
> - void *pc = page_get_page_cgroup(page);
> -
> printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG
> "page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n",
> current->comm, page, (int)(2*sizeof(unsigned long)),
> (unsigned long)page->flags, page->mapping,
> page_mapcount(page), page_count(page));
> - if (pc) {
> - printk(KERN_EMERG "cgroup:%p\n", pc);
> - page_reset_bad_cgroup(page);
> - }
> printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
> KERN_EMERG "Backtrace:\n");
> dump_stack();
> @@ -478,7 +472,6 @@ static inline int free_pages_check(struc
> {
> if (unlikely(page_mapcount(page) |
> (page->mapping != NULL) |
> - (page_get_page_cgroup(page) != NULL) |
> (page_count(page) != 0) |
> (page->flags & (
> 1 << PG_lru |
> @@ -628,7 +621,6 @@ static int prep_new_page(struct page *pa
> {
> if (unlikely(page_mapcount(page) |
> (page->mapping != NULL) |
> - (page_get_page_cgroup(page) != NULL) |
> (page_count(page) != 0) |
> (page->flags & (
> 1 << PG_lru |
> Index: linux-2.6.25-rc5-mm1-k/include/linux/mm_types.h
> ===================================================================
> --- linux-2.6.25-rc5-mm1-k.orig/include/linux/mm_types.h
> +++ linux-2.6.25-rc5-mm1-k/include/linux/mm_types.h
> @@ -88,9 +88,6 @@ struct page {
> void *virtual; /* Kernel virtual address (NULL if
> not kmapped, ie. highmem) */
> #endif /* WANT_PAGE_VIRTUAL */
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> - unsigned long page_cgroup;
> -#endif
> #ifdef CONFIG_PAGE_OWNER
> int order;
> unsigned int gfp_mask;
> Index: linux-2.6.25-rc5-mm1-k/mm/migrate.c
> ===================================================================
> --- linux-2.6.25-rc5-mm1-k.orig/mm/migrate.c
> +++ linux-2.6.25-rc5-mm1-k/mm/migrate.c
> @@ -358,6 +358,12 @@ static int migrate_page_move_mapping(str
>
> write_unlock_irq(&mapping->tree_lock);
>
> + /* by mem_cgroup_prepare_migration, newpage is already
> + assigned to valid cgroup. and current->mm and GFP_ATOMIC
> + will not be used...*/
> + mem_cgroup_uncharge_page(page);
> + mem_cgroup_cache_charge(newpage, current->mm, GFP_ATOMIC);
> +
> return 0;
> }
>
> @@ -603,7 +609,6 @@ static int move_to_new_page(struct page
> rc = fallback_migrate_page(mapping, newpage, page);
>
> if (!rc) {
> - mem_cgroup_page_migration(page, newpage);
> remove_migration_ptes(page, newpage);
> } else
> newpage->mapping = NULL;
> @@ -633,6 +638,12 @@ static int unmap_and_move(new_page_t get
> /* page was freed from under us. So we are done. */
> goto move_newpage;
>
> + charge = mem_cgroup_prepare_migration(page, newpage);
> + if (charge == -ENOMEM) {
> + rc = -ENOMEM;
> + goto move_newpage;
> + }
> +
> rc = -EAGAIN;
> if (TestSetPageLocked(page)) {
> if (!force)
> @@ -684,19 +695,14 @@ static int unmap_and_move(new_page_t get
> goto rcu_unlock;
> }
>
> - charge = mem_cgroup_prepare_migration(page);
> /* Establish migration ptes or remove ptes */
> try_to_unmap(page, 1);
>
> if (!page_mapped(page))
> rc = move_to_new_page(newpage, page);
>
> - if (rc) {
> + if (rc)
> remove_migration_ptes(page, page);
> - if (charge)
> - mem_cgroup_end_migration(page);
> - } else if (charge)
> - mem_cgroup_end_migration(newpage);
> rcu_unlock:
> if (rcu_locked)
> rcu_read_unlock();
> @@ -717,6 +723,8 @@ unlock:
> }
>
> move_newpage:
> + if (!charge)
> + mem_cgroup_end_migration(newpage);
> /*
> * Move the new page to the LRU. If migration was not successful
> * then this will free the page.
> Index: linux-2.6.25-rc5-mm1-k/mm/page_cgroup.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.25-rc5-mm1-k/mm/page_cgroup.c
> @@ -0,0 +1,260 @@
> +/*
> + * per-page accounting subsystem infrastructure. - linux/mm/page_cgroup.c
> + *
> + * (C) 2008 FUJITSU, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * page_cgroup is yet another mem_map under memory resoruce controller.
> + * It containes information which cannot be stored in usual mem_map.
> + * This allows us to keep 'struct page' small when a user doesn't activate
> + * memory resource controller.
> + *
> + * We can translate : struct page <-> pfn -> page_cgroup -> struct page.
> + *
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/radix-tree.h>
> +#include <linux/memcontrol.h>
> +#include <linux/err.h>
> +
> +static int page_cgroup_order __read_mostly;
> +static int page_cgroup_head_size __read_mostly;
> +
> +#define PCGRP_SHIFT (page_cgroup_order)
> +#define PCGRP_SIZE (1 << PCGRP_SHIFT)
> +#define PCGRP_MASK (PCGRP_SIZE - 1)
> +
> +struct page_cgroup_head {
> + struct page_cgroup pc[0];
> +};
> +
> +struct page_cgroup_root {
> + spinlock_t tree_lock;
> + struct radix_tree_root root_node;
> +};
> +
> +/*
> + * Calculate page_cgroup order to be not larger than order-2 page allocation.
> + */
> +static void calc_page_cgroup_order(void)
> +{
> + int order = pageblock_order;
> + unsigned long size = sizeof(struct page_cgroup) << order;
> +
> + while (size > PAGE_SIZE * 2) {
> + order -= 1;
> + size = sizeof(struct page_cgroup) << order;
> + }
> +
> + page_cgroup_order = order;
> + page_cgroup_head_size = sizeof(struct page_cgroup_head) +
> + (sizeof(struct page_cgroup) << order);
> +}
> +
> +static struct page_cgroup_root __initdata *tmp_root_dir[MAX_NUMNODES];
> +static struct page_cgroup_root *root_node[MAX_NUMNODES] __read_mostly;
> +
> +static void
> +init_page_cgroup_head(struct page_cgroup_head *head, unsigned long pfn)
> +{
> + struct page *page;
> + struct page_cgroup *pc;
> + int i;
> +
> + for (i = 0, page = pfn_to_page(pfn), pc = &head->pc[0];
> + i < PCGRP_SIZE; i++, page++, pc++) {
indentation:
i < PCGRP_SIZE; i++, page++, pc++) {
> + pc->refcnt = 0;
> + pc->page = page;
> + spin_lock_init(&pc->lock);
> + }
> +}
> +
> +
> +struct kmem_cache *page_cgroup_cachep;
> +
> +static struct page_cgroup_head *
> +alloc_page_cgroup_head(unsigned long pfn, int nid, gfp_t mask)
> +{
> + struct page_cgroup_head *head;
> +
> + if (!node_state(nid, N_NORMAL_MEMORY))
> + nid = -1;
> + head = kmem_cache_alloc_node(page_cgroup_cachep, mask, nid);
> + if (head)
> + init_page_cgroup_head(head, pfn);
> +
> + return head;
> +}
> +
> +void free_page_cgroup(struct page_cgroup_head *head)
> +{
> + kmem_cache_free(page_cgroup_cachep, head);
> +}
> +
> +static struct page_cgroup_root *pcgroup_get_root(struct page *page)
> +{
> + int nid;
> +
> + VM_BUG_ON(!page);
> +
> + nid = page_to_nid(page);
> +
> + return root_node[nid];
> +}
> +
> +/**
> + * get_page_cgroup - look up a page_cgroup for a page
> + * @page: the page whose page_cgroup is looked up.
> + *
> + * This just does lookup.
> + */
> +struct page_cgroup *get_page_cgroup(struct page *page)
> +{
> + struct page_cgroup_head *head;
> + struct page_cgroup_root *root;
> + struct page_cgroup *ret = NULL;
> + unsigned long pfn, idx;
> +
> + /*
> + * NULL can be returned before initialization
> + */
> + root = pcgroup_get_root(page);
> + if (unlikely(!root))
> + return ret;
> +
> + pfn = page_to_pfn(page);
> + idx = pfn >> PCGRP_SHIFT;
> + /*
> + * We don't need lock here because no one deletes this head.
> + * (Freeing routtine will be added later.)
> + */
> + rcu_read_lock();
> + head = radix_tree_lookup(&root->root_node, idx);
> + rcu_read_unlock();
> +
> + if (likely(head))
> + ret = &head->pc[pfn & PCGRP_MASK];
> +
> + return ret;
> +}
> +
> +/**
> + * get_alloc_page_cgroup - look up or allocate a page_cgroup for a page
> + * @page: the page whose page_cgroup is looked up.
> + * @gfpmask: the gfpmask which will be used for page allocatiopn.
> + *
> + * look up and allocate if not found.
> + */
> +
> +struct page_cgroup *
> +get_alloc_page_cgroup(struct page *page, gfp_t gfpmask)
> +{
> + struct page_cgroup_root *root;
> + struct page_cgroup_head *head;
> + struct page_cgroup *pc;
> + unsigned long pfn, idx;
> + int nid;
> + unsigned long base_pfn, flags;
> + int error = 0;
> +
> + might_sleep_if(gfpmask & __GFP_WAIT);
> +
> +retry:
> + pc = get_page_cgroup(page);
> + if (pc)
> + return pc;
> + /*
> + * NULL can be returned before initialization.
> + */
> + root = pcgroup_get_root(page);
> + if (unlikely(!root))
> + return NULL;
> +
> + pfn = page_to_pfn(page);
> + idx = pfn >> PCGRP_SHIFT;
> + nid = page_to_nid(page);
> + base_pfn = idx << PCGRP_SHIFT;
> +
> + gfpmask = gfpmask & ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> +
> + head = alloc_page_cgroup_head(base_pfn, nid, gfpmask);
> + if (!head)
> + return ERR_PTR(-ENOMEM);
> +
> + pc = &head->pc[pfn & PCGRP_MASK];
> +
> + error = radix_tree_preload(gfpmask);
> + if (error)
> + goto out;
> + spin_lock_irqsave(&root->tree_lock, flags);
> + error = radix_tree_insert(&root->root_node, idx, head);
> + spin_unlock_irqrestore(&root->tree_lock, flags);
> + radix_tree_preload_end();
> +out:
> + if (error) {
> + free_page_cgroup(head);
> + if (error == -EEXIST)
> + goto retry;
> + pc = ERR_PTR(error);
> + }
> + return pc;
> +}
> +
> +__init int page_cgroup_init(void)
static int __init
> +{
> + int tmp, nid;
> + struct page_cgroup_root *root;
> +
> + calc_page_cgroup_order();
> +
> + page_cgroup_cachep = kmem_cache_create("page_cgroup",
> + page_cgroup_head_size, 0,
> + SLAB_PANIC | SLAB_DESTROY_BY_RCU, NULL);
> +
> + if (!page_cgroup_cachep) {
> + printk(KERN_ERR "page accouning setup failure\n");
> + printk(KERN_ERR "can't initialize slab memory\n");
why not:
printk(KERN_ERR "page accouning setup failure, can't initialize slab memory\n");
otherwise the user may think 2 bad things happened
> + return -ENOMEM;
> + }
> +
> + for_each_node(nid) {
> + tmp = nid;
> + if (!node_state(nid, N_NORMAL_MEMORY))
> + tmp = -1;
> + root = kmalloc_node(sizeof(struct page_cgroup_root),
> + GFP_KERNEL, tmp);
> + if (!root)
> + goto unroll;
> + INIT_RADIX_TREE(&root->root_node, GFP_ATOMIC);
> + spin_lock_init(&root->tree_lock);
> + tmp_root_dir[nid] = root;
> + }
> + /*
> + * By filling node_root[], this tree turns to be visible.
> + * Because we have to finish initialization of the tree before
> + * we make it visible, memory barrier is necessary.
> + */
> + smp_wmb();
> + for_each_node(nid)
> + root_node[nid] = tmp_root_dir[nid];
> +
> + printk(KERN_INFO "Page Accouintg is activated\n");
Accounting
> + return 0;
> +unroll:
> + for_each_node(nid)
> + kfree(tmp_root_dir[nid]);
> +
> + return -ENOMEM;
> +}
> +late_initcall(page_cgroup_init);
> Index: linux-2.6.25-rc5-mm1-k/mm/Makefile
> ===================================================================
> --- linux-2.6.25-rc5-mm1-k.orig/mm/Makefile
> +++ linux-2.6.25-rc5-mm1-k/mm/Makefile
> @@ -32,5 +32,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
> obj-$(CONFIG_MIGRATION) += migrate.o
> obj-$(CONFIG_SMP) += allocpercpu.o
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> -obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
> +obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2008-03-28 0:55 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-03-27 8:44 [-mm] [PATCH 0/4] memcg : radix-tree page_cgroup v2 KAMEZAWA Hiroyuki
2008-03-27 8:47 ` [-mm] [PATCH 1/4] memcg : radix-tree page_cgroup KAMEZAWA Hiroyuki
2008-03-28 0:55 ` Li Zefan [this message]
2008-03-28 1:13 ` KAMEZAWA Hiroyuki
2008-03-27 8:49 ` [-mm] [PATCH 2/4] memcg: boost by percpu KAMEZAWA Hiroyuki
2008-03-27 8:50 ` [-mm] [PATCH 3/4] memcg : shirink page cgroup KAMEZAWA Hiroyuki
2008-03-27 8:51 ` [-mm] [PATCH 4/4] memcg : radix-tree page_cgroup v2 KAMEZAWA Hiroyuki
2008-03-27 8:53 ` KAMEZAWA Hiroyuki
2008-03-27 9:12 ` [-mm] [PATCH 0/4] " KOSAKI Motohiro
2008-03-27 9:34 ` KAMEZAWA Hiroyuki
2008-03-27 9:53 ` Balbir Singh
2008-03-27 10:19 ` KAMEZAWA Hiroyuki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=47EC41E5.7090901@cn.fujitsu.com \
--to=lizf@cn.fujitsu.com \
--cc=a.p.zijlstra@chello.nl \
--cc=balbir@linux.vnet.ibm.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.