[PATCH] memory cgroup enhancements [0/5] intro

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] memory cgroup enhancements [0/5] intro
@ 2007-10-16 10:19 KAMEZAWA Hiroyuki
  2007-10-16 10:23 ` [PATCH] memory cgroup enhancements [1/5] force_empty for memory cgroup KAMEZAWA Hiroyuki
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-10-16 10:19 UTC (permalink / raw)
  To: linux-mm@kvack.org
  Cc: containers@lists.osdl.org, balbir@linux.vnet.ibm.com,
	yamamoto@valinux.co.jp, kamezawa.hiroyu@jp.fujitsu.com

This patch set adds
 - force_empty interface, which drops all charges in memory cgroup.
   This enables rmdir() against unused memory cgroup.
 - the memory cgroup statistics accounting.

Based on 2.6.23-mm1 + http://lkml.org/lkml/2007/10/12/53

Changes from previous version is
 - merged comments.
 - based on 2.6.23-mm1
 - removed Charge/Uncharge counter.

[1/5] ... force_empty patch
[2/5] ... remember page is charged as page-cache patch
[3/5] ... remember page is on which list patch
[4/5] ... memory cgroup statistics patch
[5/5] ... show statistics patch

tested on x86-64/fake-NUMA + CONFIG_PREEMPT=y/n (for testing preempt_disable())

Any comments are welcome.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] memory cgroup enhancements [1/5]  force_empty for memory cgroup
  2007-10-16 10:19 [PATCH] memory cgroup enhancements [0/5] intro KAMEZAWA Hiroyuki
@ 2007-10-16 10:23 ` KAMEZAWA Hiroyuki
  2007-10-17  4:17   ` David Rientjes
  2007-10-16 10:25 ` [PATCH] memory cgroup enhancements [2/5] remember charge as cache KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-10-16 10:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, containers@lists.osdl.org,
	balbir@linux.vnet.ibm.com, yamamoto@valinux.co.jp

This patch adds an interface "memory.force_empty".
Any write to this file will drop all charges in this cgroup if
there is no task under.

%echo 1 > /....../memory.force_empty

will drop all charges of memory cgroup if cgroup's tasks is empty.

This is useful to invoke rmdir() against memory cgroup successfully.

Tested and worked well on x86_64/fake-NUMA system.

Changelog v3 -> v4:
  - adjusted to 2.6.23-mm1
  - fixed typo
  - changes buf[2]="0" to static const
Changelog v2 -> v3:
  - changed the name from force_reclaim to force_empty.
Changelog v1 -> v2:
  - added a new interface force_reclaim.
  - changes spin_lock to spin_lock_irqsave().


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 mm/memcontrol.c |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 95 insertions(+), 7 deletions(-)

Index: devel-2.6.23-mm1/mm/memcontrol.c
===================================================================
--- devel-2.6.23-mm1.orig/mm/memcontrol.c
+++ devel-2.6.23-mm1/mm/memcontrol.c
@@ -480,6 +480,7 @@ void mem_cgroup_uncharge(struct page_cgr
 		page = pc->page;
 		/*
 		 * get page->cgroup and clear it under lock.
+		 * force_empty can drop page->cgroup without checking refcnt.
 		 */
 		if (clear_page_cgroup(page, pc) == pc) {
 			mem = pc->mem_cgroup;
@@ -489,13 +490,6 @@ void mem_cgroup_uncharge(struct page_cgr
 			list_del_init(&pc->lru);
 			spin_unlock_irqrestore(&mem->lru_lock, flags);
 			kfree(pc);
-		} else {
-			/*
-			 * Note:This will be removed when force-empty patch is
-			 * applied. just show warning here.
-			 */
-			printk(KERN_ERR "Race in mem_cgroup_uncharge() ?");
-			dump_stack();
 		}
 	}
 }
@@ -543,6 +537,70 @@ retry:
 	return;
 }
 
+/*
+ * This routine traverse page_cgroup in given list and drop them all.
+ * This routine ignores page_cgroup->ref_cnt.
+ * *And* this routine doesn't relcaim page itself, just removes page_cgroup.
+ */
+static void
+mem_cgroup_force_empty_list(struct mem_cgroup *mem, struct list_head *list)
+{
+	struct page_cgroup *pc;
+	struct page *page;
+	int count = SWAP_CLUSTER_MAX;
+	unsigned long flags;
+
+	spin_lock_irqsave(&mem->lru_lock, flags);
+
+	while (!list_empty(list)) {
+		pc = list_entry(list->prev, struct page_cgroup, lru);
+		page = pc->page;
+		/* Avoid race with charge */
+		atomic_set(&pc->ref_cnt, 0);
+		if (clear_page_cgroup(page, pc) == pc) {
+			css_put(&mem->css);
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
+			list_del_init(&pc->lru);
+			kfree(pc);
+		} else
+			count = 1; /* being uncharged ? ...do relax */
+
+		if (--count == 0) {
+			spin_unlock_irqrestore(&mem->lru_lock, flags);
+			cond_resched();
+			spin_lock_irqsave(&mem->lru_lock, flags);
+			count = SWAP_CLUSTER_MAX;
+		}
+	}
+	spin_unlock_irqrestore(&mem->lru_lock, flags);
+}
+
+/*
+ * make mem_cgroup's charge to be 0 if there is no binded task.
+ * This enables deleting this mem_cgroup.
+ */
+
+int mem_cgroup_force_empty(struct mem_cgroup *mem)
+{
+	int ret = -EBUSY;
+	css_get(&mem->css);
+	while (!list_empty(&mem->active_list) ||
+	       !list_empty(&mem->inactive_list)) {
+		if (atomic_read(&mem->css.cgroup->count) > 0)
+			goto out;
+		/* drop all page_cgroup in active_list */
+		mem_cgroup_force_empty_list(mem, &mem->active_list);
+		/* drop all page_cgroup in inactive_list */
+		mem_cgroup_force_empty_list(mem, &mem->inactive_list);
+	}
+	ret = 0;
+out:
+	css_put(&mem->css);
+	return ret;
+}
+
+
+
 int mem_cgroup_write_strategy(char *buf, unsigned long long *tmp)
 {
 	*tmp = memparse(buf, &buf);
@@ -628,6 +686,31 @@ static ssize_t mem_control_type_read(str
 			ppos, buf, s - buf);
 }
 
+
+static ssize_t mem_force_empty_write(struct cgroup *cont,
+				struct cftype *cft, struct file *file,
+				const char __user *userbuf,
+				size_t nbytes, loff_t *ppos)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	int ret;
+	ret = mem_cgroup_force_empty(mem);
+	if (!ret)
+		ret = nbytes;
+	return ret;
+}
+
+static ssize_t mem_force_empty_read(struct cgroup *cont,
+				struct cftype *cft,
+				struct file *file, char __user *userbuf,
+				size_t nbytes, loff_t *ppos)
+{
+	static const char buf[2] = "0";
+	return simple_read_from_buffer((void __user *)userbuf, nbytes,
+			ppos, buf, strlen(buf));
+}
+
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -650,6 +733,11 @@ static struct cftype mem_cgroup_files[] 
 		.write = mem_control_type_write,
 		.read = mem_control_type_read,
 	},
+	{
+		.name = "force_empty",
+		.write = mem_force_empty_write,
+		.read = mem_force_empty_read,
+	},
 };
 
 static struct mem_cgroup init_mem_cgroup;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] memory cgroup enhancements [2/5] remember charge as cache
  2007-10-16 10:19 [PATCH] memory cgroup enhancements [0/5] intro KAMEZAWA Hiroyuki
  2007-10-16 10:23 ` [PATCH] memory cgroup enhancements [1/5] force_empty for memory cgroup KAMEZAWA Hiroyuki
@ 2007-10-16 10:25 ` KAMEZAWA Hiroyuki
  2007-10-16 10:26 ` [PATCH] memory cgroup enhancements [3/5] record pc is on active list KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-10-16 10:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, containers@lists.osdl.org,
	balbir@linux.vnet.ibm.com, yamamoto@valinux.co.jp

Add PCGF_PAGECACHE flag to page_cgroup to remember "this page is
charged as page-cache."
This is very useful for implementing precise accounting in memory cgroup.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>

 mm/memcontrol.c |   18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

Index: devel-2.6.23-mm1/mm/memcontrol.c
===================================================================
--- devel-2.6.23-mm1.orig/mm/memcontrol.c
+++ devel-2.6.23-mm1/mm/memcontrol.c
@@ -83,6 +83,8 @@ struct page_cgroup {
 	struct mem_cgroup *mem_cgroup;
 	atomic_t ref_cnt;		/* Helpful when pages move b/w  */
 					/* mapped and cached states     */
+	int	 flags;
+#define PCGF_PAGECACHE		(0x1)	/* charged as page-cache */
 };
 
 enum {
@@ -315,8 +317,8 @@ unsigned long mem_cgroup_isolate_pages(u
  * 0 if the charge was successful
  * < 0 if the cgroup is over its limit
  */
-int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask)
+static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
+				gfp_t gfp_mask, int is_cache)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
@@ -418,6 +420,10 @@ noreclaim:
 	atomic_set(&pc->ref_cnt, 1);
 	pc->mem_cgroup = mem;
 	pc->page = page;
+	if (is_cache)
+		pc->flags = PCGF_PAGECACHE;
+	else
+		pc->flags = 0;
 	if (page_cgroup_assign_new_page_cgroup(page, pc)) {
 		/*
 		 * an another charge is added to this page already.
@@ -442,6 +448,12 @@ err:
 	return -ENOMEM;
 }
 
+int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
+			gfp_t gfp_mask)
+{
+	return mem_cgroup_charge_common(page, mm, gfp_mask, 0);
+}
+
 /*
  * See if the cached pages should be charged at all?
  */
@@ -454,7 +466,7 @@ int mem_cgroup_cache_charge(struct page 
 
 	mem = rcu_dereference(mm->mem_cgroup);
 	if (mem->control_type == MEM_CGROUP_TYPE_ALL)
-		return mem_cgroup_charge(page, mm, gfp_mask);
+		return mem_cgroup_charge_common(page, mm, gfp_mask, 1);
 	else
 		return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] memory cgroup enhancements [3/5] record pc is on active list
  2007-10-16 10:19 [PATCH] memory cgroup enhancements [0/5] intro KAMEZAWA Hiroyuki
  2007-10-16 10:23 ` [PATCH] memory cgroup enhancements [1/5] force_empty for memory cgroup KAMEZAWA Hiroyuki
  2007-10-16 10:25 ` [PATCH] memory cgroup enhancements [2/5] remember charge as cache KAMEZAWA Hiroyuki
@ 2007-10-16 10:26 ` KAMEZAWA Hiroyuki
  2007-10-17  4:17   ` David Rientjes
  2007-10-16 10:27 ` [PATCH] memory cgroup enhancements [4/5] memory cgroup statistics KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-10-16 10:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, containers@lists.osdl.org,
	balbir@linux.vnet.ibm.com, yamamoto@valinux.co.jp

Remember page_cgroup is on active_list or not in page_cgroup->flags.

Against 2.6.23-mm1.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>

 mm/memcontrol.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

Index: devel-2.6.23-mm1/mm/memcontrol.c
===================================================================
--- devel-2.6.23-mm1.orig/mm/memcontrol.c
+++ devel-2.6.23-mm1/mm/memcontrol.c
@@ -85,6 +85,7 @@ struct page_cgroup {
 					/* mapped and cached states     */
 	int	 flags;
 #define PCGF_PAGECACHE		(0x1)	/* charged as page-cache */
+#define PCGF_ACTIVE		(0x2)	/* this is on cgroup's active list */
 };
 
 enum {
@@ -208,10 +209,13 @@ clear_page_cgroup(struct page *page, str
 
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
 {
-	if (active)
+	if (active) {
+		pc->flags |= PCGF_ACTIVE;
 		list_move(&pc->lru, &pc->mem_cgroup->active_list);
-	else
+	} else {
+		pc->flags &= ~PCGF_ACTIVE;
 		list_move(&pc->lru, &pc->mem_cgroup->inactive_list);
+	}
 }
 
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
@@ -421,9 +425,9 @@ noreclaim:
 	pc->mem_cgroup = mem;
 	pc->page = page;
 	if (is_cache)
-		pc->flags = PCGF_PAGECACHE;
+		pc->flags = PCGF_PAGECACHE | PCGF_ACTIVE;
 	else
-		pc->flags = 0;
+		pc->flags = PCGF_ACTIVE;
 	if (page_cgroup_assign_new_page_cgroup(page, pc)) {
 		/*
 		 * an another charge is added to this page already.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] memory cgroup enhancements [4/5] memory cgroup statistics
  2007-10-16 10:19 [PATCH] memory cgroup enhancements [0/5] intro KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2007-10-16 10:26 ` [PATCH] memory cgroup enhancements [3/5] record pc is on active list KAMEZAWA Hiroyuki
@ 2007-10-16 10:27 ` KAMEZAWA Hiroyuki
  2007-10-16 10:28 ` [PATCH] memory cgroup enhancements [5/5] show statistics by memory.stat file per cgroup KAMEZAWA Hiroyuki
  2007-10-16 18:20 ` [PATCH] memory cgroup enhancements [0/5] intro Balbir Singh
  5 siblings, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-10-16 10:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, containers@lists.osdl.org,
	balbir@linux.vnet.ibm.com, yamamoto@valinux.co.jp

Add statistics account infrastructure for memory controller.

Changelog v1 -> v2
 - Removed Charge/Uncharge counter
 - reflected comments.
   - changes __move_lists() args.
   - changes __mem_cgroup_stat_add() name, comment and added VM_BUG_ON

Changes from original:
 - divided into 2 patch (account and show info)
 - changed from u64 to s64
 - added mem_cgroup_stat_add() and batched statistics modification logic.
 - removed stat init code because mem_cgroup is allocated by kzalloc().


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>

 mm/memcontrol.c |  120 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 113 insertions(+), 7 deletions(-)

Index: devel-2.6.23-mm1/mm/memcontrol.c
===================================================================
--- devel-2.6.23-mm1.orig/mm/memcontrol.c
+++ devel-2.6.23-mm1/mm/memcontrol.c
@@ -35,6 +35,64 @@ struct cgroup_subsys mem_cgroup_subsys;
 static const int MEM_CGROUP_RECLAIM_RETRIES = 5;
 
 /*
+ * Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_PAGECACHE, /* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as rss */
+
+	/*
+	 * usage = charge - uncharge.
+	 */
+	MEM_CGROUP_STAT_ACTIVE,	   /* # of pages in active list */
+	MEM_CGROUP_STAT_INACTIVE,  /* # of pages on inactive list */
+
+	MEM_CGROUP_STAT_NSTATS,
+};
+
+struct mem_cgroup_stat_cpu {
+	s64 count[MEM_CGROUP_STAT_NSTATS];
+} ____cacheline_aligned_in_smp;
+
+struct mem_cgroup_stat {
+	struct mem_cgroup_stat_cpu cpustat[NR_CPUS];
+};
+
+/*
+ * For batching....mem_cgroup_charge_statistics()(see below).
+ * MUST be called under preempt_disable().
+ */
+static inline void __mem_cgroup_stat_add(struct mem_cgroup_stat *stat,
+                enum mem_cgroup_stat_index idx, int val)
+{
+	int cpu = smp_processor_id();
+#ifdef CONFIG_PREEMPT
+	VM_BUG_ON(preempt_count() == 0);
+#endif
+	stat->cpustat[cpu].count[idx] += val;
+}
+
+static inline void mem_cgroup_stat_inc(struct mem_cgroup_stat *stat,
+		enum mem_cgroup_stat_index idx)
+{
+	preempt_disable();
+	__mem_cgroup_stat_add(stat, idx, 1);
+	preempt_enable();
+}
+
+static inline void mem_cgroup_stat_dec(struct mem_cgroup_stat *stat,
+		enum mem_cgroup_stat_index idx)
+{
+	preempt_disable();
+	__mem_cgroup_stat_add(stat, idx, -1);
+	preempt_enable();
+}
+
+
+/*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
  * statistics based on the statistics developed by Rik Van Riel for clock-pro,
@@ -63,6 +121,10 @@ struct mem_cgroup {
 	 */
 	spinlock_t lru_lock;
 	unsigned long control_type;	/* control RSS or RSS+Pagecache */
+	/*
+	 * statistics.
+	 */
+	struct mem_cgroup_stat stat;
 };
 
 /*
@@ -96,6 +158,33 @@ enum {
 	MEM_CGROUP_TYPE_MAX,
 };
 
+/*
+ * Batched statistics modification.
+ * We have to modify several values at charge/uncharge..
+ */
+static inline void
+mem_cgroup_charge_statistics(struct mem_cgroup *mem, int flags, int charge)
+{
+	int val = (charge)? 1 : -1;
+	struct mem_cgroup_stat *stat = &mem->stat;
+	preempt_disable();
+
+	if (flags & PCGF_PAGECACHE)
+		__mem_cgroup_stat_add(stat, MEM_CGROUP_STAT_PAGECACHE, val);
+	else
+		__mem_cgroup_stat_add(stat, MEM_CGROUP_STAT_RSS, val);
+
+	if (flags & PCGF_ACTIVE)
+		__mem_cgroup_stat_add(stat, MEM_CGROUP_STAT_ACTIVE, val);
+	else
+		__mem_cgroup_stat_add(stat, MEM_CGROUP_STAT_INACTIVE, val);
+
+	preempt_enable();
+}
+
+
+
+
 static struct mem_cgroup init_mem_cgroup;
 
 static inline
@@ -209,12 +298,27 @@ clear_page_cgroup(struct page *page, str
 
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
 {
+	int moved = 0;
+	struct mem_cgroup *mem = pc->mem_cgroup;
+
+	if (active && (pc->flags & PCGF_ACTIVE) == 0)
+		moved = 1; /* Move from inactive to active */
+	else if (!active && (pc->flags & PCGF_ACTIVE))
+		moved = -1; /* Move from active to inactive */
+
+	if (moved) {
+		struct mem_cgroup_stat *stat = &mem->stat;
+		preempt_disable();
+		__mem_cgroup_stat_add(stat, MEM_CGROUP_STAT_ACTIVE, moved);
+		__mem_cgroup_stat_add(stat, MEM_CGROUP_STAT_INACTIVE, -moved);
+		preempt_enable();
+	}
 	if (active) {
 		pc->flags |= PCGF_ACTIVE;
-		list_move(&pc->lru, &pc->mem_cgroup->active_list);
+		list_move(&pc->lru, &mem->active_list);
 	} else {
 		pc->flags &= ~PCGF_ACTIVE;
-		list_move(&pc->lru, &pc->mem_cgroup->inactive_list);
+		list_move(&pc->lru, &mem->inactive_list);
 	}
 }
 
@@ -233,15 +337,12 @@ int task_in_mem_cgroup(struct task_struc
  */
 void mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
 {
-	struct mem_cgroup *mem;
 	if (!pc)
 		return;
 
-	mem = pc->mem_cgroup;
-
-	spin_lock(&mem->lru_lock);
+	spin_lock(&pc->mem_cgroup->lru_lock);
 	__mem_cgroup_move_lists(pc, active);
-	spin_unlock(&mem->lru_lock);
+	spin_unlock(&pc->mem_cgroup->lru_lock);
 }
 
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
@@ -440,6 +541,9 @@ noreclaim:
 		goto retry;
 	}
 
+	/* Update statistics vector */
+	mem_cgroup_charge_statistics(mem, pc->flags, true);
+
 	spin_lock_irqsave(&mem->lru_lock, flags);
 	list_add(&pc->lru, &mem->active_list);
 	spin_unlock_irqrestore(&mem->lru_lock, flags);
@@ -505,6 +609,7 @@ void mem_cgroup_uncharge(struct page_cgr
 			spin_lock_irqsave(&mem->lru_lock, flags);
 			list_del_init(&pc->lru);
 			spin_unlock_irqrestore(&mem->lru_lock, flags);
+			mem_cgroup_charge_statistics(mem, pc->flags, false);
 			kfree(pc);
 		}
 	}
@@ -577,6 +682,7 @@ mem_cgroup_force_empty_list(struct mem_c
 			css_put(&mem->css);
 			res_counter_uncharge(&mem->res, PAGE_SIZE);
 			list_del_init(&pc->lru);
+			mem_cgroup_charge_statistics(mem, pc->flags, false);
 			kfree(pc);
 		} else
 			count = 1; /* being uncharged ? ...do relax */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] memory cgroup enhancements [5/5] show statistics by memory.stat file per cgroup
  2007-10-16 10:19 [PATCH] memory cgroup enhancements [0/5] intro KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2007-10-16 10:27 ` [PATCH] memory cgroup enhancements [4/5] memory cgroup statistics KAMEZAWA Hiroyuki
@ 2007-10-16 10:28 ` KAMEZAWA Hiroyuki
  2007-10-16 18:20 ` [PATCH] memory cgroup enhancements [0/5] intro Balbir Singh
  5 siblings, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-10-16 10:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, containers@lists.osdl.org,
	balbir@linux.vnet.ibm.com, yamamoto@valinux.co.jp

Show accounted information of memory cgroup by memory.stat file

Changelog v1->v2
 - dropped Charge/Uncharge entry.

Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/memcontrol.c |   52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

Index: devel-2.6.23-mm1/mm/memcontrol.c
===================================================================
--- devel-2.6.23-mm1.orig/mm/memcontrol.c
+++ devel-2.6.23-mm1/mm/memcontrol.c
@@ -28,6 +28,7 @@
 #include <linux/swap.h>
 #include <linux/spinlock.h>
 #include <linux/fs.h>
+#include <linux/seq_file.h>
 
 #include <asm/uaccess.h>
 
@@ -833,6 +834,53 @@ static ssize_t mem_force_empty_read(stru
 }
 
 
+static const struct mem_cgroup_stat_desc {
+	const char *msg;
+	u64 unit;
+} mem_cgroup_stat_desc[] = {
+	[MEM_CGROUP_STAT_PAGECACHE] = { "page_cache", PAGE_SIZE, },
+	[MEM_CGROUP_STAT_RSS] = { "rss", PAGE_SIZE, },
+	[MEM_CGROUP_STAT_ACTIVE] = { "active", PAGE_SIZE, },
+	[MEM_CGROUP_STAT_INACTIVE] = { "inactive", PAGE_SIZE, },
+};
+
+static int mem_control_stat_show(struct seq_file *m, void *arg)
+{
+	struct cgroup *cont = m->private;
+	struct mem_cgroup *mem_cont = mem_cgroup_from_cont(cont);
+	struct mem_cgroup_stat *stat = &mem_cont->stat;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(stat->cpustat[0].count); i++) {
+		unsigned int cpu;
+		s64 val;
+
+		val = 0;
+		for (cpu = 0; cpu < NR_CPUS; cpu++)
+			val += stat->cpustat[cpu].count[i];
+		val *= mem_cgroup_stat_desc[i].unit;
+		seq_printf(m, "%s %lld\n", mem_cgroup_stat_desc[i].msg, val);
+	}
+	return 0;
+}
+
+static const struct file_operations mem_control_stat_file_operations = {
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = single_release,
+};
+
+static int mem_control_stat_open(struct inode *unused, struct file *file)
+{
+	/* XXX __d_cont */
+	struct cgroup *cont = file->f_dentry->d_parent->d_fsdata;
+
+	file->f_op = &mem_control_stat_file_operations;
+	return single_open(file, mem_control_stat_show, cont);
+}
+
+
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -860,6 +908,10 @@ static struct cftype mem_cgroup_files[] 
 		.write = mem_force_empty_write,
 		.read = mem_force_empty_read,
 	},
+	{
+		.name = "stat",
+		.open = mem_control_stat_open,
+	},
 };
 
 static struct mem_cgroup init_mem_cgroup;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] memory cgroup enhancements [0/5] intro
  2007-10-16 10:19 [PATCH] memory cgroup enhancements [0/5] intro KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2007-10-16 10:28 ` [PATCH] memory cgroup enhancements [5/5] show statistics by memory.stat file per cgroup KAMEZAWA Hiroyuki
@ 2007-10-16 18:20 ` Balbir Singh
  2007-10-16 18:28   ` Andrew Morton
  5 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2007-10-16 18:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Andrew Morton
  Cc: linux-mm@kvack.org, containers@lists.osdl.org,
	yamamoto@valinux.co.jp

KAMEZAWA Hiroyuki wrote:
> This patch set adds
>  - force_empty interface, which drops all charges in memory cgroup.
>    This enables rmdir() against unused memory cgroup.
>  - the memory cgroup statistics accounting.
> 
> Based on 2.6.23-mm1 + http://lkml.org/lkml/2007/10/12/53
> 
> Changes from previous version is
>  - merged comments.
>  - based on 2.6.23-mm1
>  - removed Charge/Uncharge counter.
> 
> [1/5] ... force_empty patch
> [2/5] ... remember page is charged as page-cache patch
> [3/5] ... remember page is on which list patch
> [4/5] ... memory cgroup statistics patch
> [5/5] ... show statistics patch
> 
> tested on x86-64/fake-NUMA + CONFIG_PREEMPT=y/n (for testing preempt_disable())
> 
> Any comments are welcome.
> 

Hi, KAMEZAWA-San,

I would prefer these patches to go in once the fixes that you've posted
earlier have gone in (the migration fix series). I am yet to test the
migration fix per-se, but the series seemed quite fine to me. Andrew
could you please pick it up.

> -Kame
> 


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] memory cgroup enhancements [0/5] intro
  2007-10-16 18:20 ` [PATCH] memory cgroup enhancements [0/5] intro Balbir Singh
@ 2007-10-16 18:28   ` Andrew Morton
  2007-10-16 18:40     ` Balbir Singh
  2007-10-17  5:19     ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 29+ messages in thread
From: Andrew Morton @ 2007-10-16 18:28 UTC (permalink / raw)
  To: balbir; +Cc: kamezawa.hiroyu, linux-mm, containers, yamamoto

On Tue, 16 Oct 2007 23:50:28 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> > [1/5] ... force_empty patch
> > [2/5] ... remember page is charged as page-cache patch
> > [3/5] ... remember page is on which list patch
> > [4/5] ... memory cgroup statistics patch
> > [5/5] ... show statistics patch
> > 
> > tested on x86-64/fake-NUMA + CONFIG_PREEMPT=y/n (for testing preempt_disable())
> > 
> > Any comments are welcome.
> > 
> 
> Hi, KAMEZAWA-San,
> 
> I would prefer these patches to go in once the fixes that you've posted
> earlier have gone in (the migration fix series). I am yet to test the
> migration fix per-se, but the series seemed quite fine to me. Andrew
> could you please pick it up.

It's in my backlog somewhere.  I'm not paying much attention to things
which don't look like 2.6.24 material at present...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] memory cgroup enhancements [0/5] intro
  2007-10-16 18:28   ` Andrew Morton
@ 2007-10-16 18:40     ` Balbir Singh
  2007-10-17  5:19     ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 29+ messages in thread
From: Balbir Singh @ 2007-10-16 18:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: kamezawa.hiroyu, linux-mm, containers, yamamoto

Andrew Morton wrote:
> On Tue, 16 Oct 2007 23:50:28 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>>> [1/5] ... force_empty patch
>>> [2/5] ... remember page is charged as page-cache patch
>>> [3/5] ... remember page is on which list patch
>>> [4/5] ... memory cgroup statistics patch
>>> [5/5] ... show statistics patch
>>>
>>> tested on x86-64/fake-NUMA + CONFIG_PREEMPT=y/n (for testing preempt_disable())
>>>
>>> Any comments are welcome.
>>>
>> Hi, KAMEZAWA-San,
>>
>> I would prefer these patches to go in once the fixes that you've posted
>> earlier have gone in (the migration fix series). I am yet to test the
>> migration fix per-se, but the series seemed quite fine to me. Andrew
>> could you please pick it up.
> 
> It's in my backlog somewhere.  I'm not paying much attention to things
> which don't look like 2.6.24 material at present...

Aah.. That makes sense.. Thanks for clarifying.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] memory cgroup enhancements [1/5]  force_empty for memory cgroup
  2007-10-16 10:23 ` [PATCH] memory cgroup enhancements [1/5] force_empty for memory cgroup KAMEZAWA Hiroyuki
@ 2007-10-17  4:17   ` David Rientjes
  2007-10-17  5:05     ` Balbir Singh
  2007-10-17  5:16     ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 29+ messages in thread
From: David Rientjes @ 2007-10-17  4:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, containers@lists.osdl.org,
	balbir@linux.vnet.ibm.com, yamamoto@valinux.co.jp

On Tue, 16 Oct 2007, KAMEZAWA Hiroyuki wrote:

> This patch adds an interface "memory.force_empty".
> Any write to this file will drop all charges in this cgroup if
> there is no task under.
> 
> %echo 1 > /....../memory.force_empty
> 
> will drop all charges of memory cgroup if cgroup's tasks is empty.
> 
> This is useful to invoke rmdir() against memory cgroup successfully.
> 

If there's no tasks in the cgroup, then how can there be any charges 
against its memory controller?  Is the memory not being uncharged when a 
task exits or is moved to another cgroup?

If the only use of this is for rmdir, why not just make it part of the 
rmdir operation on the memory cgroup if there are no tasks by default?

> Tested and worked well on x86_64/fake-NUMA system.
> 
> Changelog v3 -> v4:
>   - adjusted to 2.6.23-mm1
>   - fixed typo
>   - changes buf[2]="0" to static const
> Changelog v2 -> v3:
>   - changed the name from force_reclaim to force_empty.
> Changelog v1 -> v2:
>   - added a new interface force_reclaim.
>   - changes spin_lock to spin_lock_irqsave().
> 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> 
>  mm/memcontrol.c |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 95 insertions(+), 7 deletions(-)
> 
> Index: devel-2.6.23-mm1/mm/memcontrol.c
> ===================================================================
> --- devel-2.6.23-mm1.orig/mm/memcontrol.c
> +++ devel-2.6.23-mm1/mm/memcontrol.c
> @@ -480,6 +480,7 @@ void mem_cgroup_uncharge(struct page_cgr
>  		page = pc->page;
>  		/*
>  		 * get page->cgroup and clear it under lock.
> +		 * force_empty can drop page->cgroup without checking refcnt.
>  		 */
>  		if (clear_page_cgroup(page, pc) == pc) {
>  			mem = pc->mem_cgroup;
> @@ -489,13 +490,6 @@ void mem_cgroup_uncharge(struct page_cgr
>  			list_del_init(&pc->lru);
>  			spin_unlock_irqrestore(&mem->lru_lock, flags);
>  			kfree(pc);
> -		} else {
> -			/*
> -			 * Note:This will be removed when force-empty patch is
> -			 * applied. just show warning here.
> -			 */
> -			printk(KERN_ERR "Race in mem_cgroup_uncharge() ?");
> -			dump_stack();

Unrelated change?

>  		}
>  	}
>  }
> @@ -543,6 +537,70 @@ retry:
>  	return;
>  }
>  
> +/*
> + * This routine traverse page_cgroup in given list and drop them all.
> + * This routine ignores page_cgroup->ref_cnt.
> + * *And* this routine doesn't relcaim page itself, just removes page_cgroup.

Reclaim?

> + */
> +static void
> +mem_cgroup_force_empty_list(struct mem_cgroup *mem, struct list_head *list)
> +{
> +	struct page_cgroup *pc;
> +	struct page *page;
> +	int count = SWAP_CLUSTER_MAX;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&mem->lru_lock, flags);
> +
> +	while (!list_empty(list)) {
> +		pc = list_entry(list->prev, struct page_cgroup, lru);
> +		page = pc->page;
> +		/* Avoid race with charge */
> +		atomic_set(&pc->ref_cnt, 0);

Don't you need {lock,unlock}_page_cgroup(page) here to avoid a true race 
with mem_cgroup_charge() right after you set pc->ref_cnt to 0 here?

> +		if (clear_page_cgroup(page, pc) == pc) {
> +			css_put(&mem->css);
> +			res_counter_uncharge(&mem->res, PAGE_SIZE);
> +			list_del_init(&pc->lru);
> +			kfree(pc);
> +		} else
> +			count = 1; /* being uncharged ? ...do relax */
> +
> +		if (--count == 0) {
> +			spin_unlock_irqrestore(&mem->lru_lock, flags);
> +			cond_resched();
> +			spin_lock_irqsave(&mem->lru_lock, flags);

Instead of this hack, it's probably easier to just goto a label placed 
before spin_lock_irqsave() at the top of the function.

> +			count = SWAP_CLUSTER_MAX;
> +		}
> +	}
> +	spin_unlock_irqrestore(&mem->lru_lock, flags);
> +}
> +
> +/*
> + * make mem_cgroup's charge to be 0 if there is no binded task.
> + * This enables deleting this mem_cgroup.
> + */
> +
> +int mem_cgroup_force_empty(struct mem_cgroup *mem)
> +{
> +	int ret = -EBUSY;
> +	css_get(&mem->css);
> +	while (!list_empty(&mem->active_list) ||
> +	       !list_empty(&mem->inactive_list)) {
> +		if (atomic_read(&mem->css.cgroup->count) > 0)
> +			goto out;
> +		/* drop all page_cgroup in active_list */
> +		mem_cgroup_force_empty_list(mem, &mem->active_list);
> +		/* drop all page_cgroup in inactive_list */
> +		mem_cgroup_force_empty_list(mem, &mem->inactive_list);
> +	}

This implementation as a while loop looks very suspect since 
mem_cgroup_force_empty_list() uses while (!list_empty(list)) as well.  
Perhaps it's just easier here as

	if (list_empty(&mem->active_list) && list_empty(&mem->inactive_list))
		return 0;

> +	ret = 0;
> +out:
> +	css_put(&mem->css);
> +	return ret;
> +}
> +
> +
> +
>  int mem_cgroup_write_strategy(char *buf, unsigned long long *tmp)
>  {
>  	*tmp = memparse(buf, &buf);
> @@ -628,6 +686,31 @@ static ssize_t mem_control_type_read(str
>  			ppos, buf, s - buf);
>  }
>  
> +
> +static ssize_t mem_force_empty_write(struct cgroup *cont,
> +				struct cftype *cft, struct file *file,
> +				const char __user *userbuf,
> +				size_t nbytes, loff_t *ppos)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> +	int ret;
> +	ret = mem_cgroup_force_empty(mem);
> +	if (!ret)
> +		ret = nbytes;
> +	return ret;
> +}
> +
> +static ssize_t mem_force_empty_read(struct cgroup *cont,
> +				struct cftype *cft,
> +				struct file *file, char __user *userbuf,
> +				size_t nbytes, loff_t *ppos)
> +{
> +	static const char buf[2] = "0";
> +	return simple_read_from_buffer((void __user *)userbuf, nbytes,
> +			ppos, buf, strlen(buf));

Reading memory.force_empty is pretty useless, so why allow it to be read 
at all?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] memory cgroup enhancements [3/5] record pc is on active list
  2007-10-16 10:26 ` [PATCH] memory cgroup enhancements [3/5] record pc is on active list KAMEZAWA Hiroyuki
@ 2007-10-17  4:17   ` David Rientjes
  2007-10-17  5:16     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2007-10-17  4:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, containers@lists.osdl.org,
	balbir@linux.vnet.ibm.com, yamamoto@valinux.co.jp

On Tue, 16 Oct 2007, KAMEZAWA Hiroyuki wrote:

> Remember page_cgroup is on active_list or not in page_cgroup->flags.
> 
> Against 2.6.23-mm1.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
> 
>  mm/memcontrol.c |   12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> Index: devel-2.6.23-mm1/mm/memcontrol.c
> ===================================================================
> --- devel-2.6.23-mm1.orig/mm/memcontrol.c
> +++ devel-2.6.23-mm1/mm/memcontrol.c
> @@ -85,6 +85,7 @@ struct page_cgroup {
>  					/* mapped and cached states     */
>  	int	 flags;
>  #define PCGF_PAGECACHE		(0x1)	/* charged as page-cache */
> +#define PCGF_ACTIVE		(0x2)	/* this is on cgroup's active list */

Please move these flag #defines out of the struct definition.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] memory cgroup enhancements [1/5]  force_empty for memory cgroup
  2007-10-17  4:17   ` David Rientjes
@ 2007-10-17  5:05     ` Balbir Singh
  2007-10-17  5:26       ` KAMEZAWA Hiroyuki
  2007-10-17  5:16     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2007-10-17  5:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, containers@lists.osdl.org,
	yamamoto@valinux.co.jp

David Rientjes wrote:
> On Tue, 16 Oct 2007, KAMEZAWA Hiroyuki wrote:
> 
>> This patch adds an interface "memory.force_empty".
>> Any write to this file will drop all charges in this cgroup if
>> there is no task under.
>>
>> %echo 1 > /....../memory.force_empty
>>
>> will drop all charges of memory cgroup if cgroup's tasks is empty.
>>
>> This is useful to invoke rmdir() against memory cgroup successfully.
>>
> 
> If there's no tasks in the cgroup, then how can there be any charges 
> against its memory controller?  Is the memory not being uncharged when a 
> task exits or is moved to another cgroup?
> 

David,

Since we account even for page and swap cache, there might be pages left
over even after all tasks are gone.

> If the only use of this is for rmdir, why not just make it part of the 
> rmdir operation on the memory cgroup if there are no tasks by default?
> 

That's a good idea, but sometimes an administrator might want to force
a cgroup empty and start fresh without necessary deleting the cgroup.

>> Tested and worked well on x86_64/fake-NUMA system.
>>
>> Changelog v3 -> v4:
>>   - adjusted to 2.6.23-mm1
>>   - fixed typo
>>   - changes buf[2]="0" to static const
>> Changelog v2 -> v3:
>>   - changed the name from force_reclaim to force_empty.
>> Changelog v1 -> v2:
>>   - added a new interface force_reclaim.
>>   - changes spin_lock to spin_lock_irqsave().
>>
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>>
>>  mm/memcontrol.c |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 95 insertions(+), 7 deletions(-)
>>
>> Index: devel-2.6.23-mm1/mm/memcontrol.c
>> ===================================================================
>> --- devel-2.6.23-mm1.orig/mm/memcontrol.c
>> +++ devel-2.6.23-mm1/mm/memcontrol.c
>> @@ -480,6 +480,7 @@ void mem_cgroup_uncharge(struct page_cgr
>>  		page = pc->page;
>>  		/*
>>  		 * get page->cgroup and clear it under lock.
>> +		 * force_empty can drop page->cgroup without checking refcnt.
>>  		 */
>>  		if (clear_page_cgroup(page, pc) == pc) {
>>  			mem = pc->mem_cgroup;
>> @@ -489,13 +490,6 @@ void mem_cgroup_uncharge(struct page_cgr
>>  			list_del_init(&pc->lru);
>>  			spin_unlock_irqrestore(&mem->lru_lock, flags);
>>  			kfree(pc);
>> -		} else {
>> -			/*
>> -			 * Note:This will be removed when force-empty patch is
>> -			 * applied. just show warning here.
>> -			 */
>> -			printk(KERN_ERR "Race in mem_cgroup_uncharge() ?");
>> -			dump_stack();
> 
> Unrelated change?
> 

Yes

>>  		}
>>  	}
>>  }
>> @@ -543,6 +537,70 @@ retry:
>>  	return;
>>  }
>>  
>> +/*
>> + * This routine traverse page_cgroup in given list and drop them all.
>> + * This routine ignores page_cgroup->ref_cnt.
>> + * *And* this routine doesn't relcaim page itself, just removes page_cgroup.
> 
> Reclaim?
> 
>> + */
>> +static void
>> +mem_cgroup_force_empty_list(struct mem_cgroup *mem, struct list_head *list)
>> +{
>> +	struct page_cgroup *pc;
>> +	struct page *page;
>> +	int count = SWAP_CLUSTER_MAX;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&mem->lru_lock, flags);
>> +
>> +	while (!list_empty(list)) {
>> +		pc = list_entry(list->prev, struct page_cgroup, lru);
>> +		page = pc->page;
>> +		/* Avoid race with charge */
>> +		atomic_set(&pc->ref_cnt, 0);
> 
> Don't you need {lock,unlock}_page_cgroup(page) here to avoid a true race 
> with mem_cgroup_charge() right after you set pc->ref_cnt to 0 here?
> 

I think clear_page_cgroup() below checks for a possible race.

>> +		if (clear_page_cgroup(page, pc) == pc) {
>> +			css_put(&mem->css);
>> +			res_counter_uncharge(&mem->res, PAGE_SIZE);
>> +			list_del_init(&pc->lru);
>> +			kfree(pc);
>> +		} else
>> +			count = 1; /* being uncharged ? ...do relax */
>> +
>> +		if (--count == 0) {
>> +			spin_unlock_irqrestore(&mem->lru_lock, flags);
>> +			cond_resched();
>> +			spin_lock_irqsave(&mem->lru_lock, flags);
> 
> Instead of this hack, it's probably easier to just goto a label placed 
> before spin_lock_irqsave() at the top of the function.
> 

I am not convinced of this hack either, specially the statement of
setting count to SWAP_CLUSTER_MAX.

>> +			count = SWAP_CLUSTER_MAX;
>> +		}
>> +	}
>> +	spin_unlock_irqrestore(&mem->lru_lock, flags);
>> +}
>> +
>> +/*
>> + * make mem_cgroup's charge to be 0 if there is no binded task.
>> + * This enables deleting this mem_cgroup.
>> + */
>> +
>> +int mem_cgroup_force_empty(struct mem_cgroup *mem)
>> +{
>> +	int ret = -EBUSY;
>> +	css_get(&mem->css);
>> +	while (!list_empty(&mem->active_list) ||
>> +	       !list_empty(&mem->inactive_list)) {
>> +		if (atomic_read(&mem->css.cgroup->count) > 0)
>> +			goto out;
>> +		/* drop all page_cgroup in active_list */
>> +		mem_cgroup_force_empty_list(mem, &mem->active_list);
>> +		/* drop all page_cgroup in inactive_list */
>> +		mem_cgroup_force_empty_list(mem, &mem->inactive_list);
>> +	}
> 
> This implementation as a while loop looks very suspect since 
> mem_cgroup_force_empty_list() uses while (!list_empty(list)) as well.  
> Perhaps it's just easier here as
> 
> 	if (list_empty(&mem->active_list) && list_empty(&mem->inactive_list))
> 		return 0;
> 

Do we VM_BUG_ON() in case the lists are not empty after calling
mem_cgroup_force_empty_list()

>> +	ret = 0;
>> +out:
>> +	css_put(&mem->css);
>> +	return ret;
>> +}
>> +
>> +
>> +
>>  int mem_cgroup_write_strategy(char *buf, unsigned long long *tmp)
>>  {
>>  	*tmp = memparse(buf, &buf);
>> @@ -628,6 +686,31 @@ static ssize_t mem_control_type_read(str
>>  			ppos, buf, s - buf);
>>  }
>>  
>> +
>> +static ssize_t mem_force_empty_write(struct cgroup *cont,
>> +				struct cftype *cft, struct file *file,
>> +				const char __user *userbuf,
>> +				size_t nbytes, loff_t *ppos)
>> +{
>> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>> +	int ret;
>> +	ret = mem_cgroup_force_empty(mem);
>> +	if (!ret)
>> +		ret = nbytes;
>> +	return ret;
>> +}
>> +
>> +static ssize_t mem_force_empty_read(struct cgroup *cont,
>> +				struct cftype *cft,
>> +				struct file *file, char __user *userbuf,
>> +				size_t nbytes, loff_t *ppos)
>> +{
>> +	static const char buf[2] = "0";
>> +	return simple_read_from_buffer((void __user *)userbuf, nbytes,
>> +			ppos, buf, strlen(buf));
> 
> Reading memory.force_empty is pretty useless, so why allow it to be read 
> at all?

I agree, this is not required. I wonder if we could set permissions at
group level to mark this file as *write only*. We could use the new
read_uint and write_uint callbacks for reading/writing integers.


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] memory cgroup enhancements [1/5]  force_empty for memory cgroup
  2007-10-17  4:17   ` David Rientjes
  2007-10-17  5:05     ` Balbir Singh
@ 2007-10-17  5:16     ` KAMEZAWA Hiroyuki
  2007-10-17  5:38       ` David Rientjes
  1 sibling, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-10-17  5:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm@kvack.org, containers@lists.osdl.org,
	balbir@linux.vnet.ibm.com, yamamoto@valinux.co.jp

On Tue, 16 Oct 2007 21:17:06 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> > This is useful to invoke rmdir() against memory cgroup successfully.
> > 
> 
> If there's no tasks in the cgroup, then how can there be any charges 
> against its memory controller?  Is the memory not being uncharged when a 
> task exits or is moved to another cgroup?
Page Caches are charged as page cache. (even if not mmmaped)
So, empty cgroup can maintain some amount of charges as page cache.

> 
> If the only use of this is for rmdir, why not just make it part of the 
> rmdir operation on the memory cgroup if there are no tasks by default?
> 
This patch is just a step for that. I myself think your way is better.
If we can make a concensus and good code, we can do this in automatic way
by rmdir().
BTW, personally, I uses this *force_empty* for test memory cgroup.
(like drop_caches)


> > Tested and worked well on x86_64/fake-NUMA system.
> > 
> > Changelog v3 -> v4:
> >   - adjusted to 2.6.23-mm1
> >   - fixed typo
> >   - changes buf[2]="0" to static const
> > Changelog v2 -> v3:
> >   - changed the name from force_reclaim to force_empty.
> > Changelog v1 -> v2:
> >   - added a new interface force_reclaim.
> >   - changes spin_lock to spin_lock_irqsave().
> > 
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > 
> >  mm/memcontrol.c |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 95 insertions(+), 7 deletions(-)
> > 
> > Index: devel-2.6.23-mm1/mm/memcontrol.c
> > ===================================================================
> > --- devel-2.6.23-mm1.orig/mm/memcontrol.c
> > +++ devel-2.6.23-mm1/mm/memcontrol.c
> > @@ -480,6 +480,7 @@ void mem_cgroup_uncharge(struct page_cgr
> >  		page = pc->page;
> >  		/*
> >  		 * get page->cgroup and clear it under lock.
> > +		 * force_empty can drop page->cgroup without checking refcnt.
> >  		 */
> >  		if (clear_page_cgroup(page, pc) == pc) {
> >  			mem = pc->mem_cgroup;
> > @@ -489,13 +490,6 @@ void mem_cgroup_uncharge(struct page_cgr
> >  			list_del_init(&pc->lru);
> >  			spin_unlock_irqrestore(&mem->lru_lock, flags);
> >  			kfree(pc);
> > -		} else {
> > -			/*
> > -			 * Note:This will be removed when force-empty patch is
> > -			 * applied. just show warning here.
> > -			 */
> > -			printk(KERN_ERR "Race in mem_cgroup_uncharge() ?");
> > -			dump_stack();
> 
> Unrelated change?
> 
No. force_reclaim causes this race. but not BUG. so printk is removed.


> >  		}
> >  	}
> >  }
> > @@ -543,6 +537,70 @@ retry:
> >  	return;
> >  }
> >  
> > +/*
> > + * This routine traverse page_cgroup in given list and drop them all.
> > + * This routine ignores page_cgroup->ref_cnt.
> > + * *And* this routine doesn't relcaim page itself, just removes page_cgroup.
> 
> Reclaim?
> 
yes..

> > + */
> > +static void
> > +mem_cgroup_force_empty_list(struct mem_cgroup *mem, struct list_head *list)
> > +{
> > +	struct page_cgroup *pc;
> > +	struct page *page;
> > +	int count = SWAP_CLUSTER_MAX;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&mem->lru_lock, flags);
> > +
> > +	while (!list_empty(list)) {
> > +		pc = list_entry(list->prev, struct page_cgroup, lru);
> > +		page = pc->page;
> > +		/* Avoid race with charge */
> > +		atomic_set(&pc->ref_cnt, 0);
> 
> Don't you need {lock,unlock}_page_cgroup(page) here to avoid a true race 
> with mem_cgroup_charge() right after you set pc->ref_cnt to 0 here?
> 
My patch (already posted) add check like this in mem_cgroup_charge
--
retry:
        lock_page_cgroup(page);
        pc = page_get_page_cgroup(page);
        /*
         * The page_cgroup exists and the page has already been accounted
         */
        if (pc) {
                if (unlikely(!atomic_inc_not_zero(&pc->ref_cnt))) {
                        /* this page is under being uncharged ? */
                        unlock_page_cgroup(page);
                        cpu_relax();
                        goto retry;
                } else {
                        unlock_page_cgroup(page);
                        goto done;
                }
        }
--
and clear_page_cgroup() takes a lock.
Then,
   - If atoimc_set(&pc->ref_cnt, 0) goes before mem_cgroup_charge()'s
     atomic_inc_not_zero(&pc->ref_cnt), there is no race.
   - If atomic_inc_not_zero(&pc->ref_cnt) goes before atomic_set(),
     just ignore increment and drop this charge.


> > +		if (clear_page_cgroup(page, pc) == pc) {
> > +			css_put(&mem->css);
> > +			res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +			list_del_init(&pc->lru);
> > +			kfree(pc);
> > +		} else
> > +			count = 1; /* being uncharged ? ...do relax */
> > +
> > +		if (--count == 0) {
> > +			spin_unlock_irqrestore(&mem->lru_lock, flags);
> > +			cond_resched();
> > +			spin_lock_irqsave(&mem->lru_lock, flags);
> 
> Instead of this hack, it's probably easier to just goto a label placed 
> before spin_lock_irqsave() at the top of the function.
> 
Hmm, ok. I'll consider this again.


> > +			count = SWAP_CLUSTER_MAX;
> > +		}
> > +	}
> > +	spin_unlock_irqrestore(&mem->lru_lock, flags);
> > +}
> > +
> > +/*
> > + * make mem_cgroup's charge to be 0 if there is no binded task.
> > + * This enables deleting this mem_cgroup.
> > + */
> > +
> > +int mem_cgroup_force_empty(struct mem_cgroup *mem)
> > +{
> > +	int ret = -EBUSY;
> > +	css_get(&mem->css);
> > +	while (!list_empty(&mem->active_list) ||
> > +	       !list_empty(&mem->inactive_list)) {
> > +		if (atomic_read(&mem->css.cgroup->count) > 0)
> > +			goto out;
> > +		/* drop all page_cgroup in active_list */
> > +		mem_cgroup_force_empty_list(mem, &mem->active_list);
> > +		/* drop all page_cgroup in inactive_list */
> > +		mem_cgroup_force_empty_list(mem, &mem->inactive_list);
> > +	}
> 
> This implementation as a while loop looks very suspect since 
> mem_cgroup_force_empty_list() uses while (!list_empty(list)) as well.  
> Perhaps it's just easier here as
> 
> 	if (list_empty(&mem->active_list) && list_empty(&mem->inactive_list))
> 		return 0;
> 
will fix.


> > +	ret = 0;
> > +out:
> > +	css_put(&mem->css);
> > +	return ret;
> > +}
> > +
> > +
> > +
> >  int mem_cgroup_write_strategy(char *buf, unsigned long long *tmp)
> >  {
> >  	*tmp = memparse(buf, &buf);
> > @@ -628,6 +686,31 @@ static ssize_t mem_control_type_read(str
> >  			ppos, buf, s - buf);
> >  }
> >  
> > +
> > +static ssize_t mem_force_empty_write(struct cgroup *cont,
> > +				struct cftype *cft, struct file *file,
> > +				const char __user *userbuf,
> > +				size_t nbytes, loff_t *ppos)
> > +{
> > +	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> > +	int ret;
> > +	ret = mem_cgroup_force_empty(mem);
> > +	if (!ret)
> > +		ret = nbytes;
> > +	return ret;
> > +}
> > +
> > +static ssize_t mem_force_empty_read(struct cgroup *cont,
> > +				struct cftype *cft,
> > +				struct file *file, char __user *userbuf,
> > +				size_t nbytes, loff_t *ppos)
> > +{
> > +	static const char buf[2] = "0";
> > +	return simple_read_from_buffer((void __user *)userbuf, nbytes,
> > +			ppos, buf, strlen(buf));
> 
> Reading memory.force_empty is pretty useless, so why allow it to be read 
> at all?

Ah, yes. How can I make this as read-only ? just remove .read ?

Thanks,
-Kame 





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] memory cgroup enhancements [3/5] record pc is on active list
  2007-10-17  4:17   ` David Rientjes
@ 2007-10-17  5:16     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-10-17  5:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm@kvack.org, containers@lists.osdl.org,
	balbir@linux.vnet.ibm.com, yamamoto@valinux.co.jp

On Tue, 16 Oct 2007 21:17:24 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 16 Oct 2007, KAMEZAWA Hiroyuki wrote:
> 
> > Remember page_cgroup is on active_list or not in page_cgroup->flags.
> > 
> > Against 2.6.23-mm1.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
> > 
> >  mm/memcontrol.c |   12 ++++++++----
> >  1 file changed, 8 insertions(+), 4 deletions(-)
> > 
> > Index: devel-2.6.23-mm1/mm/memcontrol.c
> > ===================================================================
> > --- devel-2.6.23-mm1.orig/mm/memcontrol.c
> > +++ devel-2.6.23-mm1/mm/memcontrol.c
> > @@ -85,6 +85,7 @@ struct page_cgroup {
> >  					/* mapped and cached states     */
> >  	int	 flags;
> >  #define PCGF_PAGECACHE		(0x1)	/* charged as page-cache */
> > +#define PCGF_ACTIVE		(0x2)	/* this is on cgroup's active list */
> 
> Please move these flag #defines out of the struct definition.
> 
ok

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] memory cgroup enhancements [0/5] intro
  2007-10-16 18:28   ` Andrew Morton
  2007-10-16 18:40     ` Balbir Singh
@ 2007-10-17  5:19     ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-10-17  5:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: balbir, linux-mm, containers, yamamoto

On Tue, 16 Oct 2007 11:28:43 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> > I would prefer these patches to go in once the fixes that you've posted
> > earlier have gone in (the migration fix series). I am yet to test the
> > migration fix per-se, but the series seemed quite fine to me. Andrew
> > could you please pick it up.
> 
> It's in my backlog somewhere.  I'm not paying much attention to things
> which don't look like 2.6.24 material at present...
> 
Ah, ok. I'll wait and keep this set as RFC for a while.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] memory cgroup enhancements [1/5]  force_empty for memory cgroup
  2007-10-17  5:05     ` Balbir Singh
@ 2007-10-17  5:26       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-10-17  5:26 UTC (permalink / raw)
  To: balbir
  Cc: David Rientjes, linux-mm@kvack.org, containers@lists.osdl.org,
	yamamoto@valinux.co.jp

On Wed, 17 Oct 2007 10:35:58 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > If the only use of this is for rmdir, why not just make it part of the 
> > rmdir operation on the memory cgroup if there are no tasks by default?
> > 
> 
> That's a good idea, but sometimes an administrator might want to force
> a cgroup empty and start fresh without necessary deleting the cgroup.
> 
I'll make a "automatic force_empty at rmdir()" patch as another patch depends
on this. If we make concensus that "force_empty interface is redundant", I'll
remove it later.


> I am not convinced of this hack either, specially the statement of
> setting count to SWAP_CLUSTER_MAX.
> 
Just because I think there should be "unlock and rest" in this busy loop,
I need some number. Should I define other number ?
as
#define FORCE_RECALIM_BATCH	(128)


> >> +		/* drop all page_cgroup in inactive_list */
> >> +		mem_cgroup_force_empty_list(mem, &mem->inactive_list);
> >> +	}
> > 
> > This implementation as a while loop looks very suspect since 
> > mem_cgroup_force_empty_list() uses while (!list_empty(list)) as well.  
> > Perhaps it's just easier here as
> > 
> > 	if (list_empty(&mem->active_list) && list_empty(&mem->inactive_list))
> > 		return 0;
> > 
> 
> Do we VM_BUG_ON() in case the lists are not empty after calling
> mem_cgroup_force_empty_list()
>
Okay, I will add.
 

> > Reading memory.force_empty is pretty useless, so why allow it to be read 
> > at all?
> 
> I agree, this is not required. I wonder if we could set permissions at
> group level to mark this file as *write only*. We could use the new
> read_uint and write_uint callbacks for reading/writing integers.
> 
ok, will remove.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] memory cgroup enhancements [1/5]  force_empty for memory cgroup
  2007-10-17  5:16     ` KAMEZAWA Hiroyuki
@ 2007-10-17  5:38       ` David Rientjes
  2007-10-17  5:50         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2007-10-17  5:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, containers@lists.osdl.org,
	balbir@linux.vnet.ibm.com, yamamoto@valinux.co.jp

On Wed, 17 Oct 2007, KAMEZAWA Hiroyuki wrote:

> > > +static ssize_t mem_force_empty_read(struct cgroup *cont,
> > > +				struct cftype *cft,
> > > +				struct file *file, char __user *userbuf,
> > > +				size_t nbytes, loff_t *ppos)
> > > +{
> > > +	static const char buf[2] = "0";
> > > +	return simple_read_from_buffer((void __user *)userbuf, nbytes,
> > > +			ppos, buf, strlen(buf));
> > 
> > Reading memory.force_empty is pretty useless, so why allow it to be read 
> > at all?
> 
> Ah, yes. How can I make this as read-only ? just remove .read ?
> 
> Thanks,

You mean make it write-only?  Typically it would be as easy as only 
specifying a mode of S_IWUSR so that it can only be written to, the 
S_IFREG is already provided by cgroup_add_file().

Unfortunately, cgroups do not appear to allow that.  It hardcodes
the permissions of 0644 | S_IFREG into the cgroup_create_file() call from 
cgroup_add_file(), which is a bug.  Cgroup files should be able to be 
marked as read-only or write-only depending on their semantics.

So until that bug gets fixed and you're allowed to pass your own file 
modes to cgroup_add_files(), you'll have to provide the read function.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] memory cgroup enhancements [1/5]  force_empty for memory cgroup
  2007-10-17  5:38       ` David Rientjes
@ 2007-10-17  5:50         ` KAMEZAWA Hiroyuki
  2007-10-17  7:09           ` about page migration on UMA Jacky(GuangXiang  Lee)
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-10-17  5:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm@kvack.org, containers@lists.osdl.org,
	balbir@linux.vnet.ibm.com, yamamoto@valinux.co.jp

On Tue, 16 Oct 2007 22:38:18 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> > Thanks,
> 
> You mean make it write-only?  Typically it would be as easy as only 
> specifying a mode of S_IWUSR so that it can only be written to, the 
> S_IFREG is already provided by cgroup_add_file().
> 
> Unfortunately, cgroups do not appear to allow that.  It hardcodes
> the permissions of 0644 | S_IFREG into the cgroup_create_file() call from 
> cgroup_add_file(), which is a bug.  Cgroup files should be able to be 
> marked as read-only or write-only depending on their semantics.
> 
> So until that bug gets fixed and you're allowed to pass your own file 
> modes to cgroup_add_files(), you'll have to provide the read function.
> 
Hmm. it seems I have to read current cgroup code more.
Thank you for advice.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: about page migration on UMA
  2007-10-17  7:09           ` about page migration on UMA Jacky(GuangXiang  Lee)
@ 2007-10-17  6:44             ` KAMEZAWA Hiroyuki
  2007-10-19  1:26             ` Christoph Lameter
  1 sibling, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-10-17  6:44 UTC (permalink / raw)
  To: Jacky(GuangXiang  Lee); +Cc: climeter, linux-mm

On Wed, 17 Oct 2007 15:09:19 +0800
"Jacky(GuangXiang  Lee)" <gxli@arca.com.cn> wrote:

> seems page migration is used mostly for NUMA platform to improve
> performance.
> But in a UMA architecture, Is it possible to use page migration to move
> pages ?
> 

Currently no usage. but someone  (Mel ?) will use migration code to do 
page defragmentation. In that case, migration will be used.
For memory hot remove, we will need to enable migration on UMA.
(but I don't hear requests to do that...)

But, sys_move_pages() will not work on UMA.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* about page migration on UMA
  2007-10-17  5:50         ` KAMEZAWA Hiroyuki
@ 2007-10-17  7:09           ` Jacky(GuangXiang  Lee)
  2007-10-17  6:44             ` KAMEZAWA Hiroyuki
  2007-10-19  1:26             ` Christoph Lameter
  0 siblings, 2 replies; 29+ messages in thread
From: Jacky(GuangXiang  Lee) @ 2007-10-17  7:09 UTC (permalink / raw)
  To: climeter; +Cc: linux-mm

seems page migration is used mostly for NUMA platform to improve
performance.
But in a UMA architecture, Is it possible to use page migration to move
pages ?

----- Original Message ----- 
From: "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>
To: "David Rientjes" <rientjes@google.com>
Cc: <linux-mm@kvack.org>; <containers@lists.osdl.org>;
<balbir@linux.vnet.ibm.com>; <yamamoto@valinux.co.jp>
Sent: Wednesday, October 17, 2007 1:50 PM
Subject: Re: [PATCH] memory cgroup enhancements [1/5] force_empty for memory
cgroup


> On Tue, 16 Oct 2007 22:38:18 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
> > > Thanks,
> >
> > You mean make it write-only?  Typically it would be as easy as only
> > specifying a mode of S_IWUSR so that it can only be written to, the
> > S_IFREG is already provided by cgroup_add_file().
> >
> > Unfortunately, cgroups do not appear to allow that.  It hardcodes
> > the permissions of 0644 | S_IFREG into the cgroup_create_file() call
from
> > cgroup_add_file(), which is a bug.  Cgroup files should be able to be
> > marked as read-only or write-only depending on their semantics.
> >
> > So until that bug gets fixed and you're allowed to pass your own file
> > modes to cgroup_add_files(), you'll have to provide the read function.
> >
> Hmm. it seems I have to read current cgroup code more.
> Thank you for advice.
>
> Regards,
> -Kame
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: about page migration on UMA
  2007-10-17  7:09           ` about page migration on UMA Jacky(GuangXiang  Lee)
  2007-10-17  6:44             ` KAMEZAWA Hiroyuki
@ 2007-10-19  1:26             ` Christoph Lameter
  2007-11-09 19:31               ` Jared Hulbert
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2007-10-19  1:26 UTC (permalink / raw)
  To: Jacky(GuangXiang  Lee); +Cc: climeter, linux-mm

On Wed, 17 Oct 2007, Jacky(GuangXiang  Lee) wrote:

> seems page migration is used mostly for NUMA platform to improve
> performance.
> But in a UMA architecture, Is it possible to use page migration to move
> pages ?

Yes. Just one up with a usage for it. The page migration mechanism itself 
is not NUMA dependent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: about page migration on UMA
  2007-10-19  1:26             ` Christoph Lameter
@ 2007-11-09 19:31               ` Jared Hulbert
  2007-11-09 19:36                 ` Christoph Lameter
  2007-11-14  4:31                 ` Jacky(GuangXiang Lee)
  0 siblings, 2 replies; 29+ messages in thread
From: Jared Hulbert @ 2007-11-09 19:31 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Jacky(GuangXiang Lee), climeter, linux-mm

On 10/18/07, Christoph Lameter <clameter@sgi.com> wrote:
> On Wed, 17 Oct 2007, Jacky(GuangXiang  Lee) wrote:
>
> > seems page migration is used mostly for NUMA platform to improve
> > performance.
> > But in a UMA architecture, Is it possible to use page migration to move
> > pages ?
>
> Yes. Just one up with a usage for it. The page migration mechanism itself
> is not NUMA dependent.

For extreme low power systems it would be possible to shut down banks
in SDRAM chips that were not full thereby saving power.  That would
require some defraging and migration to empty them prior to powering
down those banks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: about page migration on UMA
  2007-11-09 19:31               ` Jared Hulbert
@ 2007-11-09 19:36                 ` Christoph Lameter
  2007-11-09 19:54                   ` Jared Hulbert
  2007-11-14  4:31                 ` Jacky(GuangXiang Lee)
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2007-11-09 19:36 UTC (permalink / raw)
  To: Jared Hulbert; +Cc: Jacky(GuangXiang Lee), linux-mm

On Fri, 9 Nov 2007, Jared Hulbert wrote:

> For extreme low power systems it would be possible to shut down banks
> in SDRAM chips that were not full thereby saving power.  That would
> require some defraging and migration to empty them prior to powering
> down those banks.

Yes we have discussed ideas like that a couple of time. If you do have the 
time then please make this work. You have my full support.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: about page migration on UMA
  2007-11-09 19:36                 ` Christoph Lameter
@ 2007-11-09 19:54                   ` Jared Hulbert
  2007-11-09 19:58                     ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Jared Hulbert @ 2007-11-09 19:54 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Jacky(GuangXiang Lee), linux-mm

On 11/9/07, Christoph Lameter <clameter@sgi.com> wrote:
> On Fri, 9 Nov 2007, Jared Hulbert wrote:
>
> > For extreme low power systems it would be possible to shut down banks
> > in SDRAM chips that were not full thereby saving power.  That would
> > require some defraging and migration to empty them prior to powering
> > down those banks.
>
> Yes we have discussed ideas like that a couple of time. If you do have the
> time then please make this work. You have my full support.

So I would like to make this migration controlled by userspace.  Are
there mechanisms to allow that today?  If you give me a starting point
I'll look into something like this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: about page migration on UMA
  2007-11-09 19:54                   ` Jared Hulbert
@ 2007-11-09 19:58                     ` Christoph Lameter
  2007-11-09 21:30                       ` Dave Hansen
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2007-11-09 19:58 UTC (permalink / raw)
  To: Jared Hulbert
  Cc: Jacky(GuangXiang Lee), linux-mm, Yasunori Goto, KAMEZAWA Hiroyuki

On Fri, 9 Nov 2007, Jared Hulbert wrote:

> On 11/9/07, Christoph Lameter <clameter@sgi.com> wrote:
> > On Fri, 9 Nov 2007, Jared Hulbert wrote:
> >
> > > For extreme low power systems it would be possible to shut down banks
> > > in SDRAM chips that were not full thereby saving power.  That would
> > > require some defraging and migration to empty them prior to powering
> > > down those banks.
> >
> > Yes we have discussed ideas like that a couple of time. If you do have the
> > time then please make this work. You have my full support.
> 
> So I would like to make this migration controlled by userspace.  Are
> there mechanisms to allow that today?  If you give me a starting point
> I'll look into something like this.

Well one idea is to generate a sysfs file that can take a physical memory 
range? echo the range to the sysfs file. The kernel can then try to 
vacate the memory range.

I am ccing the memory hotplug developers. They may be able to help you 
better.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: about page migration on UMA
  2007-11-09 19:58                     ` Christoph Lameter
@ 2007-11-09 21:30                       ` Dave Hansen
  2007-11-12  0:50                         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Hansen @ 2007-11-09 21:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jared Hulbert, Jacky(GuangXiang Lee), linux-mm, Yasunori Goto,
	KAMEZAWA Hiroyuki

On Fri, 2007-11-09 at 11:58 -0800, Christoph Lameter wrote:
> Well one idea is to generate a sysfs file that can take a physical memory 
> range? echo the range to the sysfs file. The kernel can then try to 
> vacate the memory range.

When memory hotplug is on, you should see memory broken up into
"sections", and exported in sysfs today:

	/sys/devices/system/memory

You can on/offline memory from there, and it should be pretty easy to
figure out the physical addresses with the phys_index file.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: about page migration on UMA
  2007-11-09 21:30                       ` Dave Hansen
@ 2007-11-12  0:50                         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-11-12  0:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Christoph Lameter, Jared Hulbert, Jacky(GuangXiang Lee), linux-mm,
	Yasunori Goto

On Fri, 09 Nov 2007 13:30:50 -0800
Dave Hansen <haveblue@us.ibm.com> wrote:

> On Fri, 2007-11-09 at 11:58 -0800, Christoph Lameter wrote:
> > Well one idea is to generate a sysfs file that can take a physical memory 
> > range? echo the range to the sysfs file. The kernel can then try to 
> > vacate the memory range.
> 
> When memory hotplug is on, you should see memory broken up into
> "sections", and exported in sysfs today:
> 
> 	/sys/devices/system/memory
> 
> You can on/offline memory from there, and it should be pretty easy to
> figure out the physical addresses with the phys_index file.
> 

yes.

Using current memory-hotplug interface is the easiest way.

What you have to do in your arch is...

 1. support SPARSEMEM and define SECTION_SIZE to be suitable
 2. add Memory Hotplug config. see mm/Kconfig
 3. use ZONE_MOVABLE with boot option.
 4. maybe you have to find that you have to define your own walk_memory_resource()
    or not.

Maybe reading  Badari's patches in these days for ppc64 support will help you.
Contiguous memory-size memory can be offlined.

One problem I don't understand is memory hotplug cannot be used if
CONFIG_HIBERNATION is on. If this is problem to you, you may need
some work.

Regards,
 -Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: about page migration on UMA
  2007-11-09 19:31               ` Jared Hulbert
  2007-11-09 19:36                 ` Christoph Lameter
@ 2007-11-14  4:31                 ` Jacky(GuangXiang Lee)
  2007-11-14 17:05                   ` Jared Hulbert
  1 sibling, 1 reply; 29+ messages in thread
From: Jacky(GuangXiang Lee) @ 2007-11-14  4:31 UTC (permalink / raw)
  To: Jared Hulbert; +Cc: Christoph Lameter, climeter, linux-mm


Jared Hulbert a??e??:
> On 10/18/07, Christoph Lameter <clameter@sgi.com> wrote:
>   
>> On Wed, 17 Oct 2007, Jacky(GuangXiang  Lee) wrote:
>>
>>     
>>> seems page migration is used mostly for NUMA platform to improve
>>> performance.
>>> But in a UMA architecture, Is it possible to use page migration to move
>>> pages ?
>>>       
>> Yes. Just one up with a usage for it. The page migration mechanism itself
>> is not NUMA dependent.
>>     
>
> For extreme low power systems it would be possible to shut down banks
> in SDRAM chips that were not full thereby saving power.  That would
> require some defraging and migration to empty them prior to powering
> down those banks.
>
>   
what is the way to shut down banks in SDRAM chips?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: about page migration on UMA
  2007-11-14  4:31                 ` Jacky(GuangXiang Lee)
@ 2007-11-14 17:05                   ` Jared Hulbert
  0 siblings, 0 replies; 29+ messages in thread
From: Jared Hulbert @ 2007-11-14 17:05 UTC (permalink / raw)
  To: Jacky(GuangXiang Lee); +Cc: Christoph Lameter, climeter, linux-mm

> what is the way to shut down banks in SDRAM chips?

Well there's partial array self-refresh modes.  However at second
glance that doesn't actually lose data, it just requires a 'wake up'
period to access the partial array.  I had been under the impression
they were lossy modes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2007-11-14 17:05 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-16 10:19 [PATCH] memory cgroup enhancements [0/5] intro KAMEZAWA Hiroyuki
2007-10-16 10:23 ` [PATCH] memory cgroup enhancements [1/5] force_empty for memory cgroup KAMEZAWA Hiroyuki
2007-10-17  4:17   ` David Rientjes
2007-10-17  5:05     ` Balbir Singh
2007-10-17  5:26       ` KAMEZAWA Hiroyuki
2007-10-17  5:16     ` KAMEZAWA Hiroyuki
2007-10-17  5:38       ` David Rientjes
2007-10-17  5:50         ` KAMEZAWA Hiroyuki
2007-10-17  7:09           ` about page migration on UMA Jacky(GuangXiang  Lee)
2007-10-17  6:44             ` KAMEZAWA Hiroyuki
2007-10-19  1:26             ` Christoph Lameter
2007-11-09 19:31               ` Jared Hulbert
2007-11-09 19:36                 ` Christoph Lameter
2007-11-09 19:54                   ` Jared Hulbert
2007-11-09 19:58                     ` Christoph Lameter
2007-11-09 21:30                       ` Dave Hansen
2007-11-12  0:50                         ` KAMEZAWA Hiroyuki
2007-11-14  4:31                 ` Jacky(GuangXiang Lee)
2007-11-14 17:05                   ` Jared Hulbert
2007-10-16 10:25 ` [PATCH] memory cgroup enhancements [2/5] remember charge as cache KAMEZAWA Hiroyuki
2007-10-16 10:26 ` [PATCH] memory cgroup enhancements [3/5] record pc is on active list KAMEZAWA Hiroyuki
2007-10-17  4:17   ` David Rientjes
2007-10-17  5:16     ` KAMEZAWA Hiroyuki
2007-10-16 10:27 ` [PATCH] memory cgroup enhancements [4/5] memory cgroup statistics KAMEZAWA Hiroyuki
2007-10-16 10:28 ` [PATCH] memory cgroup enhancements [5/5] show statistics by memory.stat file per cgroup KAMEZAWA Hiroyuki
2007-10-16 18:20 ` [PATCH] memory cgroup enhancements [0/5] intro Balbir Singh
2007-10-16 18:28   ` Andrew Morton
2007-10-16 18:40     ` Balbir Singh
2007-10-17  5:19     ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).