[PATCH 0/5] memcg: per cgroup background reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] memcg: per cgroup background reclaim
@ 2011-01-13 22:00 Ying Han
  2011-01-13 22:00 ` [PATCH 1/5] Add kswapd descriptor Ying Han
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Ying Han @ 2011-01-13 22:00 UTC (permalink / raw)
  To: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo
  Cc: linux-mm

The current implementation of memcg only supports direct reclaim and this
patchset adds the support for background reclaim. Per cgroup background
reclaim is needed which spreads out the memory pressure over longer period
of time and smoothes out the system performance.

I run through some simple tests which reads/writes a large file and makes sure
it triggers per cgroup kswapd on the low_wmark. I compared at pg_steal/pg_scan
ratio w/o background reclaim. Also the running time is measured in this
patchset.

Step1: Create a cgroup with 500M memory_limit.
$ mount -t cgroup -o cpuset,memory cpuset /dev/cgroup
$ mkdir /dev/cgroup/A
$ echo 0 >/dev/cgroup/A/cpuset.cpus
$ echo 0 >/dev/cgroup/A/cpuset.mems
$ echo 500m >/dev/cgroup/A/memory.limit_in_bytes
$ echo $$ >/dev/cgroup/A/tasks

Step2: Check the wmarks.
$ cat /dev/cgroup/A/memory.reclaim_wmarks
low_wmark 3663360
high_wmark 4396032

Step3: Dirty the pages by creating a 20g file on hard drive.
$ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1

Checked the memory.stat w/o background reclaim. It used to be all the pages are
reclaimed from direct reclaim, and now most of the pages  are reclaimed at
background. (note: writing '0' to min_free_kbytes disables per cgroup kswapd)

Only direct reclaim                       With background reclaim:
pgpgin 5184374                            pgpgin 5248437
pgpgout 5056385                           pgpgout 5121659
kswapd_steal 0                            kswapd_steal 5121516
pg_pgsteal 5056363                        pg_pgsteal 32
kswapd_pgscan 0                           kswapd_pgscan 5121569
pg_scan 5056416                           pg_scan 32
pgrefill 297632                           pgrefill 312512
pgoutrun 0                                pgoutrun 107525
allocstall 158009                         allocstall 1

real 21m6.864s                            real 24m56.735s
user 0m2.047s                             user 0m2.331s
sys 6m2.572s                              sys  7m29.048s

Step4: Cleanup
$ echo $$ >/dev/cgroup/tasks
$ echo 1 > /dev/cgroup/A/memory.force_empty

Step5: Read the 20g file into the pagecache.
$ cat /export/hdc3/dd/tf0 > /dev/zero;

Checked the memory.stat w/o background reclaim. Most of clean pages are
reclaimed at background instead of direct reclaim.

Only direct reclaim                       With background reclaim:
pgpgin 5184066                            pgpgin 5184081
pgpgout 5056093                           pgpgout 5057185
kswapd_steal 0                            kswapd_steal 4960805
pg_pgsteal 5056063                        pg_pgsteal 96348
kswapd_pgscan 0                           kswapd_pgscan 4960809
pg_scan 5056064                           pg_scan 96348
pgrefill 0                                pgrefill 0
pgoutrun 0                                pgoutrun 54904
allocstall 158001                         allocstall 3010

real 3m13.034s                            real 3m13.074s
user 0m0.221s                             user 0m0.158s
sys  0m22.793s                            sys  0m38.603s

TODO:
1. Keep debugging the crash isse on NUMA machine.
2. Generate more test cases and look into reducing the lock contention.

Ying Han (5):
  Add kswapd descriptor.
  Add per cgroup reclaim watermarks.
  New APIs to adjust per cgroup wmarks.
  Per cgroup background reclaim.
  Add more per memcg stats.

 Documentation/cgroups/memory.txt |   14 ++
 include/linux/memcontrol.h       |  102 +++++++++
 include/linux/mmzone.h           |    3 +-
 include/linux/res_counter.h      |   83 ++++++++
 include/linux/swap.h             |   12 +-
 kernel/res_counter.c             |    6 +
 mm/memcontrol.c                  |  390 +++++++++++++++++++++++++++++++++++-
 mm/page_alloc.c                  |    1 -
 mm/vmscan.c                      |  418 ++++++++++++++++++++++++++++++++++----
 9 files changed, 979 insertions(+), 50 deletions(-)

-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/5] Add kswapd descriptor.
  2011-01-13 22:00 [PATCH 0/5] memcg: per cgroup background reclaim Ying Han
@ 2011-01-13 22:00 ` Ying Han
  2011-01-13 22:00 ` [PATCH 2/5] Add per cgroup reclaim watermarks Ying Han
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Ying Han @ 2011-01-13 22:00 UTC (permalink / raw)
  To: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo
  Cc: linux-mm

There is a kswapd kernel thread for each memory node. We add a different kswapd
for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
field of a kswapd descriptor. The kswapd descriptor stores information of node
or cgroup and it allows the global and per cgroup background reclaim to share
common reclaim algorithms.

Changelog v2...v1:
1. dynamic allocate kswapd descriptor and initialize the wait_queue_head of pgdat
at kswapd_run.

2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup kswapd
descriptor.

TODO:
1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later patch.
2. rename thr in kswapd_run to something else.
3. split this into two patches and the first one just add the kswapd descriptor
definition.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/mmzone.h |    3 +-
 include/linux/swap.h   |    8 +++
 mm/memcontrol.c        |    2 +
 mm/page_alloc.c        |    1 -
 mm/vmscan.c            |  118 ++++++++++++++++++++++++++++++++++++------------
 5 files changed, 100 insertions(+), 32 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4890662..d9e70e6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -636,8 +636,7 @@ typedef struct pglist_data {
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
 	int node_id;
-	wait_queue_head_t kswapd_wait;
-	struct task_struct *kswapd;
+	wait_queue_head_t *kswapd_wait;
 	int kswapd_max_order;
 } pg_data_t;
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index eba53e7..52122fa 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -26,6 +26,14 @@ static inline int current_is_kswapd(void)
 	return current->flags & PF_KSWAPD;
 }
 
+struct kswapd {
+	struct task_struct *kswapd_task;
+	wait_queue_head_t kswapd_wait;
+	struct mem_cgroup *kswapd_mem;
+	pg_data_t *kswapd_pgdat;
+};
+
+int kswapd(void *p);
 /*
  * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
  * be swapped to.  The swap type and the offset into that swap type are
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 73ccdfc..f6e0987 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -288,6 +288,8 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
+
+	wait_queue_head_t *kswapd_wait;
 };
 
 /* Stuffs for move charges at task migration. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 62b7280..0b30939 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4092,7 +4092,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 	pgdat_resize_init(pgdat);
 	pgdat->nr_zones = 0;
-	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
 	
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8cc90d5..a53d91d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2115,12 +2115,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 
 	return nr_reclaimed;
 }
+
 #endif
 
+DEFINE_SPINLOCK(kswapds_spinlock);
+#define is_node_kswapd(kswapd_p) (!(kswapd_p)->kswapd_mem)
+
 /* is kswapd sleeping prematurely? */
-static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
+static int sleeping_prematurely(struct kswapd *kswapd, int order,
+				long remaining)
 {
 	int i;
+	pg_data_t *pgdat = kswapd->kswapd_pgdat;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
@@ -2377,21 +2383,27 @@ out:
  * If there are applications that are active memory-allocators
  * (most normal use), this basically shouldn't matter.
  */
-static int kswapd(void *p)
+int kswapd(void *p)
 {
 	unsigned long order;
-	pg_data_t *pgdat = (pg_data_t*)p;
+	struct kswapd *kswapd_p = (struct kswapd *)p;
+	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
 	struct task_struct *tsk = current;
 	DEFINE_WAIT(wait);
 	struct reclaim_state reclaim_state = {
 		.reclaimed_slab = 0,
 	};
-	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+	const struct cpumask *cpumask;
 
 	lockdep_set_current_reclaim_state(GFP_KERNEL);
 
-	if (!cpumask_empty(cpumask))
-		set_cpus_allowed_ptr(tsk, cpumask);
+	if (is_node_kswapd(kswapd_p)) {
+		BUG_ON(pgdat->kswapd_wait != wait_h);
+		cpumask = cpumask_of_node(pgdat->node_id);
+		if (!cpumask_empty(cpumask))
+			set_cpus_allowed_ptr(tsk, cpumask);
+	}
 	current->reclaim_state = &reclaim_state;
 
 	/*
@@ -2414,9 +2426,13 @@ static int kswapd(void *p)
 		unsigned long new_order;
 		int ret;
 
-		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
-		new_order = pgdat->kswapd_max_order;
-		pgdat->kswapd_max_order = 0;
+		prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
+		if (is_node_kswapd(kswapd_p)) {
+			new_order = pgdat->kswapd_max_order;
+			pgdat->kswapd_max_order = 0;
+		} else
+			new_order = 0;
+
 		if (order < new_order) {
 			/*
 			 * Don't sleep if someone wants a larger 'order'
@@ -2428,10 +2444,12 @@ static int kswapd(void *p)
 				long remaining = 0;
 
 				/* Try to sleep for a short interval */
-				if (!sleeping_prematurely(pgdat, order, remaining)) {
+				if (!sleeping_prematurely(kswapd_p, order,
+							remaining)) {
 					remaining = schedule_timeout(HZ/10);
-					finish_wait(&pgdat->kswapd_wait, &wait);
-					prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+					finish_wait(wait_h, &wait);
+					prepare_to_wait(wait_h, &wait,
+							TASK_INTERRUPTIBLE);
 				}
 
 				/*
@@ -2439,13 +2457,19 @@ static int kswapd(void *p)
 				 * premature sleep. If not, then go fully
 				 * to sleep until explicitly woken up
 				 */
-				if (!sleeping_prematurely(pgdat, order, remaining)) {
-					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
-					set_pgdat_percpu_threshold(pgdat,
-						calculate_normal_threshold);
+				if (!sleeping_prematurely(kswapd_p, order,
+							remaining)) {
+					if (is_node_kswapd(kswapd_p)) {
+						trace_mm_vmscan_kswapd_sleep(
+								pgdat->node_id);
+						set_pgdat_percpu_threshold(pgdat,
+							calculate_normal_threshold);
+					}
 					schedule();
-					set_pgdat_percpu_threshold(pgdat,
-						calculate_pressure_threshold);
+					if (is_node_kswapd(kswapd_p)) {
+						set_pgdat_percpu_threshold(pgdat,
+							calculate_pressure_threshold);
+					}
 				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
@@ -2454,9 +2478,10 @@ static int kswapd(void *p)
 				}
 			}
 
-			order = pgdat->kswapd_max_order;
+			if (is_node_kswapd(kswapd_p))
+				order = pgdat->kswapd_max_order;
 		}
-		finish_wait(&pgdat->kswapd_wait, &wait);
+		finish_wait(wait_h, &wait);
 
 		ret = try_to_freeze();
 		if (kthread_should_stop())
@@ -2489,13 +2514,13 @@ void wakeup_kswapd(struct zone *zone, int order)
 	pgdat = zone->zone_pgdat;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
-	if (!waitqueue_active(&pgdat->kswapd_wait))
+	if (!waitqueue_active(pgdat->kswapd_wait))
 		return;
 	if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-	wake_up_interruptible(&pgdat->kswapd_wait);
+	wake_up_interruptible(pgdat->kswapd_wait);
 }
 
 /*
@@ -2587,12 +2612,23 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
 		for_each_node_state(nid, N_HIGH_MEMORY) {
 			pg_data_t *pgdat = NODE_DATA(nid);
 			const struct cpumask *mask;
+			struct kswapd *kswapd_p;
+			struct task_struct *thr;
+			wait_queue_head_t *wait;
 
 			mask = cpumask_of_node(pgdat->node_id);
 
+			spin_lock(&kswapds_spinlock);
+			wait = pgdat->kswapd_wait;
+			kswapd_p = container_of(wait, struct kswapd,
+						kswapd_wait);
+			thr = kswapd_p->kswapd_task;
+			spin_unlock(&kswapds_spinlock);
+
 			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
 				/* One of our CPUs online: restore mask */
-				set_cpus_allowed_ptr(pgdat->kswapd, mask);
+				if (thr)
+					set_cpus_allowed_ptr(thr, mask);
 		}
 	}
 	return NOTIFY_OK;
@@ -2605,18 +2641,28 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
 int kswapd_run(int nid)
 {
 	pg_data_t *pgdat = NODE_DATA(nid);
+	struct task_struct *thr;
+	struct kswapd *kswapd_p;
 	int ret = 0;
 
-	if (pgdat->kswapd)
+	if (pgdat->kswapd_wait)
 		return 0;
 
-	pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
-	if (IS_ERR(pgdat->kswapd)) {
+	kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
+	if (!kswapd_p)
+		return -ENOMEM;
+
+	init_waitqueue_head(&kswapd_p->kswapd_wait);
+	pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
+	kswapd_p->kswapd_pgdat = pgdat;
+	thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
+	if (IS_ERR(thr)) {
 		/* failure at boot is fatal */
 		BUG_ON(system_state == SYSTEM_BOOTING);
 		printk("Failed to start kswapd on node %d\n",nid);
 		ret = -1;
 	}
+	kswapd_p->kswapd_task = thr;
 	return ret;
 }
 
@@ -2625,10 +2671,24 @@ int kswapd_run(int nid)
  */
 void kswapd_stop(int nid)
 {
-	struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
+	struct task_struct *thr = NULL;
+	struct kswapd *kswapd_p = NULL;
+	wait_queue_head_t *wait;
+
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	spin_lock(&kswapds_spinlock);
+	wait = pgdat->kswapd_wait;
+	if (wait) {
+		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
+		thr = kswapd_p->kswapd_task;
+	}
+	spin_unlock(&kswapds_spinlock);
+
+	if (thr)
+		kthread_stop(thr);
 
-	if (kswapd)
-		kthread_stop(kswapd);
+	kfree(kswapd_p);
 }
 
 static int __init kswapd_init(void)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/5] Add per cgroup reclaim watermarks.
  2011-01-13 22:00 [PATCH 0/5] memcg: per cgroup background reclaim Ying Han
  2011-01-13 22:00 ` [PATCH 1/5] Add kswapd descriptor Ying Han
@ 2011-01-13 22:00 ` Ying Han
  2011-01-14  0:11   ` KAMEZAWA Hiroyuki
  2011-01-13 22:00 ` [PATCH 3/5] New APIs to adjust per cgroup wmarks Ying Han
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Ying Han @ 2011-01-13 22:00 UTC (permalink / raw)
  To: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo
  Cc: linux-mm

The per cgroup kswapd is invoked when the cgroup's free memory (limit - usage)
is less than a threshold--low_wmark. Then the kswapd thread starts to reclaim
pages in a priority loop similar to global algorithm. The kswapd is done if the
free memory is above a threshold--high_wmark.

The per cgroup background reclaim is based on the per cgroup LRU and also adds
per cgroup watermarks. There are two watermarks including "low_wmark" and
"high_wmark", and they are calculated based on the limit_in_bytes(hard_limit)
for each cgroup. Each time the hard_limit is changed, the corresponding wmarks
are re-calculated. Since memory controller charges only user pages, there is
no need for a "min_wmark". The current calculation of wmarks is a function of
"memory.min_free_kbytes" which could be adjusted by writing different values
into the new api. This is added mainly for debugging purpose.

Change log v2...v1:
1. Remove the res_counter_charge on wmark due to performance concern.
2. Move the new APIs min_free_kbytes, reclaim_wmarks into seperate commit.
3. Calculate the min_free_kbytes automatically based on the limit_in_bytes.
4. make the wmark to be consistant with core VM which checks the free pages
instead of usage.
5. changed wmark to be boolean

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h  |    1 +
 include/linux/res_counter.h |   83 +++++++++++++++++++++++++++++++++++++++++++
 kernel/res_counter.c        |    6 +++
 mm/memcontrol.c             |   73 +++++++++++++++++++++++++++++++++++++
 4 files changed, 163 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3433784..80a605f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -93,6 +93,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index fcb9884..10b7e59 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -39,6 +39,15 @@ struct res_counter {
 	 */
 	unsigned long long soft_limit;
 	/*
+	 * the limit that reclaim triggers. it is the free count
+	 * (limit - usage)
+	 */
+	unsigned long long low_wmark_limit;
+	/*
+	 * the limit that reclaim stops. it is the free count
+	 */
+	unsigned long long high_wmark_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -55,6 +64,9 @@ struct res_counter {
 
 #define RESOURCE_MAX (unsigned long long)LLONG_MAX
 
+#define CHARGE_WMARK_LOW	0x02
+#define CHARGE_WMARK_HIGH	0x04
+
 /**
  * Helpers to interact with userspace
  * res_counter_read_u64() - returns the value of the specified member.
@@ -92,6 +104,8 @@ enum {
 	RES_LIMIT,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
+	RES_LOW_WMARK_LIMIT,
+	RES_HIGH_WMARK_LIMIT
 };
 
 /*
@@ -145,6 +159,28 @@ static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
 	return false;
 }
 
+static inline bool
+res_counter_high_wmark_limit_check_locked(struct res_counter *cnt)
+{
+	unsigned long long free = cnt->limit - cnt->usage;
+
+	if (free <= cnt->high_wmark_limit)
+		return false;
+
+	return true;
+}
+
+static inline bool
+res_counter_low_wmark_limit_check_locked(struct res_counter *cnt)
+{
+	unsigned long long free = cnt->limit - cnt->usage;
+
+	if (free <= cnt->low_wmark_limit)
+		return false;
+
+	return true;
+}
+
 /**
  * Get the difference between the usage and the soft limit
  * @cnt: The counter
@@ -193,6 +229,30 @@ static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
 	return ret;
 }
 
+static inline bool
+res_counter_check_under_low_wmark_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_low_wmark_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
+static inline bool
+res_counter_check_under_high_wmark_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_high_wmark_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
@@ -238,4 +298,27 @@ res_counter_set_soft_limit(struct res_counter *cnt,
 	return 0;
 }
 
+static inline int
+res_counter_set_high_wmark_limit(struct res_counter *cnt,
+				unsigned long long wmark_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->high_wmark_limit = wmark_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
+static inline int
+res_counter_set_low_wmark_limit(struct res_counter *cnt,
+				unsigned long long wmark_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->low_wmark_limit = wmark_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index c7eaa37..f68bd63 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -19,6 +19,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 	spin_lock_init(&counter->lock);
 	counter->limit = RESOURCE_MAX;
 	counter->soft_limit = RESOURCE_MAX;
+	counter->low_wmark_limit = 0;
+	counter->high_wmark_limit = 0;
 	counter->parent = parent;
 }
 
@@ -103,6 +105,10 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->failcnt;
 	case RES_SOFT_LIMIT:
 		return &counter->soft_limit;
+	case RES_LOW_WMARK_LIMIT:
+		return &counter->low_wmark_limit;
+	case RES_HIGH_WMARK_LIMIT:
+		return &counter->high_wmark_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f6e0987..5508d94 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -290,6 +290,7 @@ struct mem_cgroup {
 	spinlock_t pcp_counter_lock;
 
 	wait_queue_head_t *kswapd_wait;
+	unsigned long min_free_kbytes;
 };
 
 /* Stuffs for move charges at task migration. */
@@ -378,6 +379,7 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(void);
+static unsigned long get_min_free_kbytes(struct mem_cgroup *mem);
 
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -818,6 +820,45 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
 	return (mem == root_mem_cgroup);
 }
 
+void setup_per_memcg_wmarks(struct mem_cgroup *mem)
+{
+	unsigned long min_free_kbytes;
+
+	min_free_kbytes = get_min_free_kbytes(mem);
+	if (min_free_kbytes == 0) {
+		u64 limit;
+
+		limit = mem_cgroup_get_limit(mem);
+		res_counter_set_low_wmark_limit(&mem->res, limit);
+		res_counter_set_high_wmark_limit(&mem->res, limit);
+	} else {
+		unsigned long long low_wmark, high_wmark;
+		unsigned long long tmp = min_free_kbytes << 10;
+
+		low_wmark = tmp + (tmp >> 2);
+		high_wmark = tmp + (tmp >> 1);
+		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
+		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
+	}
+}
+
+void init_per_memcg_wmarks(struct mem_cgroup *mem)
+{
+	unsigned long min_free_kbytes;
+	u64 limit_kbytes;
+
+	limit_kbytes = mem_cgroup_get_limit(mem) >> 10;
+	min_free_kbytes = int_sqrt(limit_kbytes * 16);
+	if (min_free_kbytes < 128)
+		min_free_kbytes = 128;
+	if (min_free_kbytes > 65536)
+		min_free_kbytes = 65536;
+
+	mem->min_free_kbytes = min_free_kbytes;
+	setup_per_memcg_wmarks(mem);
+
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -1403,6 +1444,22 @@ unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
 	return value;
 }
 
+static unsigned long get_min_free_kbytes(struct mem_cgroup *memcg)
+{
+	struct cgroup *cgrp = memcg->css.cgroup;
+	unsigned long min_free_kbytes;
+
+	/* root ? */
+	if (cgrp == NULL || cgrp->parent == NULL)
+		return 0;
+
+	spin_lock(&memcg->reclaim_param_lock);
+	min_free_kbytes = memcg->min_free_kbytes;
+	spin_unlock(&memcg->reclaim_param_lock);
+
+	return min_free_kbytes;
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *mem)
 {
 	int cpu;
@@ -3310,6 +3367,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 			else
 				memcg->memsw_is_minimum = false;
 		}
+		init_per_memcg_wmarks(memcg);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -3369,6 +3427,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 			else
 				memcg->memsw_is_minimum = false;
 		}
+		init_per_memcg_wmarks(memcg);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -4744,6 +4803,19 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
+				int charge_flags)
+{
+	long ret = 0;
+
+	if (charge_flags & CHARGE_WMARK_LOW)
+		ret = res_counter_check_under_low_wmark_limit(&mem->res);
+	if (charge_flags & CHARGE_WMARK_HIGH)
+		ret = res_counter_check_under_high_wmark_limit(&mem->res);
+
+	return ret;
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
@@ -4834,6 +4906,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
+	mem->min_free_kbytes = 0;
 	mutex_init(&mem->thresholds_lock);
 	return &mem->css;
 free_out:
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] Add per cgroup reclaim watermarks.
  2011-01-13 22:00 ` [PATCH 2/5] Add per cgroup reclaim watermarks Ying Han
@ 2011-01-14  0:11   ` KAMEZAWA Hiroyuki
  2011-01-18 20:02     ` Ying Han
  0 siblings, 1 reply; 17+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-14  0:11 UTC (permalink / raw)
  To: Ying Han
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Thu, 13 Jan 2011 14:00:32 -0800
Ying Han <yinghan@google.com> wrote:

> The per cgroup kswapd is invoked when the cgroup's free memory (limit - usage)
> is less than a threshold--low_wmark. Then the kswapd thread starts to reclaim
> pages in a priority loop similar to global algorithm. The kswapd is done if the
> free memory is above a threshold--high_wmark.
> 
> The per cgroup background reclaim is based on the per cgroup LRU and also adds
> per cgroup watermarks. There are two watermarks including "low_wmark" and
> "high_wmark", and they are calculated based on the limit_in_bytes(hard_limit)
> for each cgroup. Each time the hard_limit is changed, the corresponding wmarks
> are re-calculated. Since memory controller charges only user pages, there is
> no need for a "min_wmark". The current calculation of wmarks is a function of
> "memory.min_free_kbytes" which could be adjusted by writing different values
> into the new api. This is added mainly for debugging purpose.
> 
> Change log v2...v1:
> 1. Remove the res_counter_charge on wmark due to performance concern.
> 2. Move the new APIs min_free_kbytes, reclaim_wmarks into seperate commit.
> 3. Calculate the min_free_kbytes automatically based on the limit_in_bytes.
> 4. make the wmark to be consistant with core VM which checks the free pages
> instead of usage.
> 5. changed wmark to be boolean
> 
> Signed-off-by: Ying Han <yinghan@google.com>

Hmm, I don't think using the same algorithm as min_free_kbytes is good.

Why it's bad to have 2 interfaces as low_wmark and high_wmark ? 

And in this patch, min_free_kbytes can be [256...65536]...I think this
'256' is not good because it should be able to be set to '0'.

IIUC, in enterprise systems, there are users who want to keep a fixed amount
of free memory always. This interface will not allow such use case.

I think we should have 2 interfaces as low_wmark and high_wmark. But as default
value, the same value as to the alogorithm with min_free_kbytes will make sense.

BTW, please divide res_counter part and memcg part in the next post.

Please explain your handling of 'hierarchy' in description.

Thanks,
-Kame


> ---
>  include/linux/memcontrol.h  |    1 +
>  include/linux/res_counter.h |   83 +++++++++++++++++++++++++++++++++++++++++++
>  kernel/res_counter.c        |    6 +++
>  mm/memcontrol.c             |   73 +++++++++++++++++++++++++++++++++++++
>  4 files changed, 163 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3433784..80a605f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -93,6 +93,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
>  
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> +extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
>  
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index fcb9884..10b7e59 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -39,6 +39,15 @@ struct res_counter {
>  	 */
>  	unsigned long long soft_limit;
>  	/*
> +	 * the limit that reclaim triggers. it is the free count
> +	 * (limit - usage)
> +	 */
> +	unsigned long long low_wmark_limit;
> +	/*
> +	 * the limit that reclaim stops. it is the free count
> +	 */
> +	unsigned long long high_wmark_limit;
> +	/*
>  	 * the number of unsuccessful attempts to consume the resource
>  	 */
>  	unsigned long long failcnt;
> @@ -55,6 +64,9 @@ struct res_counter {
>  
>  #define RESOURCE_MAX (unsigned long long)LLONG_MAX
>  
> +#define CHARGE_WMARK_LOW	0x02
> +#define CHARGE_WMARK_HIGH	0x04
> +
>  /**
>   * Helpers to interact with userspace
>   * res_counter_read_u64() - returns the value of the specified member.
> @@ -92,6 +104,8 @@ enum {
>  	RES_LIMIT,
>  	RES_FAILCNT,
>  	RES_SOFT_LIMIT,
> +	RES_LOW_WMARK_LIMIT,
> +	RES_HIGH_WMARK_LIMIT
>  };
>  
>  /*
> @@ -145,6 +159,28 @@ static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
>  	return false;
>  }
>  
> +static inline bool
> +res_counter_high_wmark_limit_check_locked(struct res_counter *cnt)
> +{
> +	unsigned long long free = cnt->limit - cnt->usage;
> +
> +	if (free <= cnt->high_wmark_limit)
> +		return false;
> +
> +	return true;
> +}
> +
> +static inline bool
> +res_counter_low_wmark_limit_check_locked(struct res_counter *cnt)
> +{
> +	unsigned long long free = cnt->limit - cnt->usage;
> +
> +	if (free <= cnt->low_wmark_limit)
> +		return false;
> +
> +	return true;
> +}
> +
>  /**
>   * Get the difference between the usage and the soft limit
>   * @cnt: The counter
> @@ -193,6 +229,30 @@ static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
>  	return ret;
>  }
>  
> +static inline bool
> +res_counter_check_under_low_wmark_limit(struct res_counter *cnt)
> +{
> +	bool ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = res_counter_low_wmark_limit_check_locked(cnt);
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +
> +static inline bool
> +res_counter_check_under_high_wmark_limit(struct res_counter *cnt)
> +{
> +	bool ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = res_counter_high_wmark_limit_check_locked(cnt);
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +
>  static inline void res_counter_reset_max(struct res_counter *cnt)
>  {
>  	unsigned long flags;
> @@ -238,4 +298,27 @@ res_counter_set_soft_limit(struct res_counter *cnt,
>  	return 0;
>  }
>  
> +static inline int
> +res_counter_set_high_wmark_limit(struct res_counter *cnt,
> +				unsigned long long wmark_limit)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	cnt->high_wmark_limit = wmark_limit;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return 0;
> +}
> +
> +static inline int
> +res_counter_set_low_wmark_limit(struct res_counter *cnt,
> +				unsigned long long wmark_limit)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	cnt->low_wmark_limit = wmark_limit;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return 0;
> +}
>  #endif
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index c7eaa37..f68bd63 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -19,6 +19,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
>  	spin_lock_init(&counter->lock);
>  	counter->limit = RESOURCE_MAX;
>  	counter->soft_limit = RESOURCE_MAX;
> +	counter->low_wmark_limit = 0;
> +	counter->high_wmark_limit = 0;
>  	counter->parent = parent;
>  }
>  
> @@ -103,6 +105,10 @@ res_counter_member(struct res_counter *counter, int member)
>  		return &counter->failcnt;
>  	case RES_SOFT_LIMIT:
>  		return &counter->soft_limit;
> +	case RES_LOW_WMARK_LIMIT:
> +		return &counter->low_wmark_limit;
> +	case RES_HIGH_WMARK_LIMIT:
> +		return &counter->high_wmark_limit;
>  	};
>  
>  	BUG();
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f6e0987..5508d94 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -290,6 +290,7 @@ struct mem_cgroup {
>  	spinlock_t pcp_counter_lock;
>  
>  	wait_queue_head_t *kswapd_wait;
> +	unsigned long min_free_kbytes;
>  };
>  
>  /* Stuffs for move charges at task migration. */
> @@ -378,6 +379,7 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>  static void drain_all_stock_async(void);
> +static unsigned long get_min_free_kbytes(struct mem_cgroup *mem);
>  
>  static struct mem_cgroup_per_zone *
>  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> @@ -818,6 +820,45 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>  	return (mem == root_mem_cgroup);
>  }
>  
> +void setup_per_memcg_wmarks(struct mem_cgroup *mem)
> +{
> +	unsigned long min_free_kbytes;
> +
> +	min_free_kbytes = get_min_free_kbytes(mem);
> +	if (min_free_kbytes == 0) {
> +		u64 limit;
> +
> +		limit = mem_cgroup_get_limit(mem);
> +		res_counter_set_low_wmark_limit(&mem->res, limit);
> +		res_counter_set_high_wmark_limit(&mem->res, limit);
> +	} else {
> +		unsigned long long low_wmark, high_wmark;
> +		unsigned long long tmp = min_free_kbytes << 10;
> +
> +		low_wmark = tmp + (tmp >> 2);
> +		high_wmark = tmp + (tmp >> 1);
> +		res_counter_set_low_wmark_limit(&mem->res, low_wmark);
> +		res_counter_set_high_wmark_limit(&mem->res, high_wmark);
> +	}
> +}
> +
> +void init_per_memcg_wmarks(struct mem_cgroup *mem)
> +{
> +	unsigned long min_free_kbytes;
> +	u64 limit_kbytes;
> +
> +	limit_kbytes = mem_cgroup_get_limit(mem) >> 10;
> +	min_free_kbytes = int_sqrt(limit_kbytes * 16);
> +	if (min_free_kbytes < 128)
> +		min_free_kbytes = 128;
> +	if (min_free_kbytes > 65536)
> +		min_free_kbytes = 65536;
> +
> +	mem->min_free_kbytes = min_free_kbytes;
> +	setup_per_memcg_wmarks(mem);
> +
> +}
> +
>  /*
>   * Following LRU functions are allowed to be used without PCG_LOCK.
>   * Operations are called by routine of global LRU independently from memcg.
> @@ -1403,6 +1444,22 @@ unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
>  	return value;
>  }
>  
> +static unsigned long get_min_free_kbytes(struct mem_cgroup *memcg)
> +{
> +	struct cgroup *cgrp = memcg->css.cgroup;
> +	unsigned long min_free_kbytes;
> +
> +	/* root ? */
> +	if (cgrp == NULL || cgrp->parent == NULL)
> +		return 0;
> +
> +	spin_lock(&memcg->reclaim_param_lock);
> +	min_free_kbytes = memcg->min_free_kbytes;
> +	spin_unlock(&memcg->reclaim_param_lock);
> +
> +	return min_free_kbytes;
> +}
> +
>  static void mem_cgroup_start_move(struct mem_cgroup *mem)
>  {
>  	int cpu;
> @@ -3310,6 +3367,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  			else
>  				memcg->memsw_is_minimum = false;
>  		}
> +		init_per_memcg_wmarks(memcg);
>  		mutex_unlock(&set_limit_mutex);
>  
>  		if (!ret)
> @@ -3369,6 +3427,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  			else
>  				memcg->memsw_is_minimum = false;
>  		}
> +		init_per_memcg_wmarks(memcg);
>  		mutex_unlock(&set_limit_mutex);
>  
>  		if (!ret)
> @@ -4744,6 +4803,19 @@ static void __init enable_swap_cgroup(void)
>  }
>  #endif
>  
> +int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
> +				int charge_flags)
> +{
> +	long ret = 0;
> +
> +	if (charge_flags & CHARGE_WMARK_LOW)
> +		ret = res_counter_check_under_low_wmark_limit(&mem->res);
> +	if (charge_flags & CHARGE_WMARK_HIGH)
> +		ret = res_counter_check_under_high_wmark_limit(&mem->res);
> +
> +	return ret;
> +}
> +
>  static int mem_cgroup_soft_limit_tree_init(void)
>  {
>  	struct mem_cgroup_tree_per_node *rtpn;
> @@ -4834,6 +4906,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  
>  	atomic_set(&mem->refcnt, 1);
>  	mem->move_charge_at_immigrate = 0;
> +	mem->min_free_kbytes = 0;
>  	mutex_init(&mem->thresholds_lock);
>  	return &mem->css;
>  free_out:
> -- 
> 1.7.3.1
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] Add per cgroup reclaim watermarks.
  2011-01-14  0:11   ` KAMEZAWA Hiroyuki
@ 2011-01-18 20:02     ` Ying Han
  2011-01-18 20:36       ` David Rientjes
  2011-01-19  0:44       ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 17+ messages in thread
From: Ying Han @ 2011-01-18 20:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Thu, Jan 13, 2011 at 4:11 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 13 Jan 2011 14:00:32 -0800
> Ying Han <yinghan@google.com> wrote:
>
>> The per cgroup kswapd is invoked when the cgroup's free memory (limit - usage)
>> is less than a threshold--low_wmark. Then the kswapd thread starts to reclaim
>> pages in a priority loop similar to global algorithm. The kswapd is done if the
>> free memory is above a threshold--high_wmark.
>>
>> The per cgroup background reclaim is based on the per cgroup LRU and also adds
>> per cgroup watermarks. There are two watermarks including "low_wmark" and
>> "high_wmark", and they are calculated based on the limit_in_bytes(hard_limit)
>> for each cgroup. Each time the hard_limit is changed, the corresponding wmarks
>> are re-calculated. Since memory controller charges only user pages, there is
>> no need for a "min_wmark". The current calculation of wmarks is a function of
>> "memory.min_free_kbytes" which could be adjusted by writing different values
>> into the new api. This is added mainly for debugging purpose.
>>
>> Change log v2...v1:
>> 1. Remove the res_counter_charge on wmark due to performance concern.
>> 2. Move the new APIs min_free_kbytes, reclaim_wmarks into seperate commit.
>> 3. Calculate the min_free_kbytes automatically based on the limit_in_bytes.
>> 4. make the wmark to be consistant with core VM which checks the free pages
>> instead of usage.
>> 5. changed wmark to be boolean
>>
>> Signed-off-by: Ying Han <yinghan@google.com>
>

Thank you  KAMEZAWA for your comments.

> Hmm, I don't think using the same algorithm as min_free_kbytes is good.
>
> Why it's bad to have 2 interfaces as low_wmark and high_wmark ?


>
> And in this patch, min_free_kbytes can be [256...65536]...I think this
> '256' is not good because it should be able to be set to '0'.
>
> IIUC, in enterprise systems, there are users who want to keep a fixed amount
> of free memory always. This interface will not allow such use case.

>
> I think we should have 2 interfaces as low_wmark and high_wmark. But as default
> value, the same value as to the alogorithm with min_free_kbytes will make sense.

I agree that "min_free_kbytes" concept doesn't apply well since there
is no notion of "reserved pool" in memcg. I borrowed it at the
beginning is to add a tunable to the per-memcg watermarks besides the
hard_limit. I read the
patch posted from Satoru Moriya "Tunable watermarks", and introducing
the per-memcg-per-watermark tunable
sounds good to me. Might consider adding it to the next post.

>
> BTW, please divide res_counter part and memcg part in the next post.

Will do.

>
> Please explain your handling of 'hierarchy' in description.
I haven't thought through the 'hierarchy' handling in this patchset
which I will probably put more thoughts in the following
posts. Do you have recommendations on handing the 'hierarchy' ?

--Ying

>
> Thanks,
> -Kame
>
>
>> ---
>>  include/linux/memcontrol.h  |    1 +
>>  include/linux/res_counter.h |   83 +++++++++++++++++++++++++++++++++++++++++++
>>  kernel/res_counter.c        |    6 +++
>>  mm/memcontrol.c             |   73 +++++++++++++++++++++++++++++++++++++
>>  4 files changed, 163 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 3433784..80a605f 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -93,6 +93,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
>>
>>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>> +extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
>>
>>  static inline
>>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
>> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
>> index fcb9884..10b7e59 100644
>> --- a/include/linux/res_counter.h
>> +++ b/include/linux/res_counter.h
>> @@ -39,6 +39,15 @@ struct res_counter {
>>        */
>>       unsigned long long soft_limit;
>>       /*
>> +      * the limit that reclaim triggers. it is the free count
>> +      * (limit - usage)
>> +      */
>> +     unsigned long long low_wmark_limit;
>> +     /*
>> +      * the limit that reclaim stops. it is the free count
>> +      */
>> +     unsigned long long high_wmark_limit;
>> +     /*
>>        * the number of unsuccessful attempts to consume the resource
>>        */
>>       unsigned long long failcnt;
>> @@ -55,6 +64,9 @@ struct res_counter {
>>
>>  #define RESOURCE_MAX (unsigned long long)LLONG_MAX
>>
>> +#define CHARGE_WMARK_LOW     0x02
>> +#define CHARGE_WMARK_HIGH    0x04
>> +
>>  /**
>>   * Helpers to interact with userspace
>>   * res_counter_read_u64() - returns the value of the specified member

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] Add per cgroup reclaim watermarks.
  2011-01-18 20:02     ` Ying Han
@ 2011-01-18 20:36       ` David Rientjes
  2011-01-18 21:10         ` Ying Han
  2011-01-19  0:44       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 17+ messages in thread
From: David Rientjes @ 2011-01-18 20:36 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Tue, 18 Jan 2011, Ying Han wrote:

> I agree that "min_free_kbytes" concept doesn't apply well since there
> is no notion of "reserved pool" in memcg. I borrowed it at the
> beginning is to add a tunable to the per-memcg watermarks besides the
> hard_limit.

You may want to add a small amount of memory that a memcg may allocate 
from in oom conditions, however: memory reserves are allocated per-zone 
and if the entire system is oom and that includes several dozen memcgs, 
for example, they could all be contending for the same memory reserves.  
It would be much easier to deplete all reserves since you would have 
several tasks allowed to allocate from this pool: that's not possible 
without memcg since the oom killer is serialized on zones and does not 
kill a task if another oom killed task is already detected in the 
tasklist.

I think it would be very trivial to DoS the entire machine in this way: 
set up a thousand memcgs with tasks that have core_state, for example, and 
trigger them to all allocate anonymous memory up to their hard limit so 
they oom at the same time.  The machine should livelock with all zones 
having 0 pages free.

> I read the
> patch posted from Satoru Moriya "Tunable watermarks", and introducing
> the per-memcg-per-watermark tunable
> sounds good to me. Might consider adding it to the next post.
> 

Those tunable watermarks were nacked for a reason: they are internal to 
the VM and should be set to sane values by the kernel with no intevention 
needed by userspace.  You'd need to show why a memcg would need a user to 
tune its watermarks to trigger background reclaim and why that's not 
possible by the kernel and how this is a special case in comparsion to the 
per-zone watermarks used by the VM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] Add per cgroup reclaim watermarks.
  2011-01-18 20:36       ` David Rientjes
@ 2011-01-18 21:10         ` Ying Han
  2011-01-19  0:56           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 17+ messages in thread
From: Ying Han @ 2011-01-18 21:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Tue, Jan 18, 2011 at 12:36 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 18 Jan 2011, Ying Han wrote:
>
>> I agree that "min_free_kbytes" concept doesn't apply well since there
>> is no notion of "reserved pool" in memcg. I borrowed it at the
>> beginning is to add a tunable to the per-memcg watermarks besides the
>> hard_limit.
>
> You may want to add a small amount of memory that a memcg may allocate
> from in oom conditions, however: memory reserves are allocated per-zone
> and if the entire system is oom and that includes several dozen memcgs,
> for example, they could all be contending for the same memory reserves.
> It would be much easier to deplete all reserves since you would have
> several tasks allowed to allocate from this pool: that's not possible
> without memcg since the oom killer is serialized on zones and does not
> kill a task if another oom killed task is already detected in the
> tasklist.

so something like per-memcg min_wmark which also needs to be reserved upfront?

> I think it would be very trivial to DoS the entire machine in this way:
> set up a thousand memcgs with tasks that have core_state, for example, and
> trigger them to all allocate anonymous memory up to their hard limit so
> they oom at the same time.  The machine should livelock with all zones
> having 0 pages free.
>
>> I read the
>> patch posted from Satoru Moriya "Tunable watermarks", and introducing
>> the per-memcg-per-watermark tunable
>> sounds good to me. Might consider adding it to the next post.
>>
>
> Those tunable watermarks were nacked for a reason: they are internal to
> the VM and should be set to sane values by the kernel with no intevention
> needed by userspace.  You'd need to show why a memcg would need a user to
> tune its watermarks to trigger background reclaim and why that's not
> possible by the kernel and how this is a special case in comparsion to the
> per-zone watermarks used by the VM.

KAMEZAWA gave an example on his early post, which some enterprise user
like to keep fixed amount of free pages
regardless of the hard_limit.

Since setting the wmarks has impact on the reclaim behavior of each
memcg,  adding this flexibility helps the system where it like to
treat memcg differently based on the priority.

--Ying


>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] Add per cgroup reclaim watermarks.
  2011-01-18 21:10         ` Ying Han
@ 2011-01-19  0:56           ` KAMEZAWA Hiroyuki
  2011-01-19  2:38             ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-19  0:56 UTC (permalink / raw)
  To: Ying Han
  Cc: David Rientjes, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Tue, 18 Jan 2011 13:10:39 -0800
Ying Han <yinghan@google.com> wrote:

> On Tue, Jan 18, 2011 at 12:36 PM, David Rientjes <rientjes@google.com> wrote:
> > On Tue, 18 Jan 2011, Ying Han wrote:
> >
> >> I agree that "min_free_kbytes" concept doesn't apply well since there
> >> is no notion of "reserved pool" in memcg. I borrowed it at the
> >> beginning is to add a tunable to the per-memcg watermarks besides the
> >> hard_limit.
> >
> > You may want to add a small amount of memory that a memcg may allocate
> > from in oom conditions, however: memory reserves are allocated per-zone
> > and if the entire system is oom and that includes several dozen memcgs,
> > for example, they could all be contending for the same memory reserves.
> > It would be much easier to deplete all reserves since you would have
> > several tasks allowed to allocate from this pool: that's not possible
> > without memcg since the oom killer is serialized on zones and does not
> > kill a task if another oom killed task is already detected in the
> > tasklist.
> 
> so something like per-memcg min_wmark which also needs to be reserved upfront?
> 

I think the variable name 'min_free_kbytes' is the source of confusion...
It's just a watermark to trigger background reclaim. It's not reservation.


> > I think it would be very trivial to DoS the entire machine in this way:
> > set up a thousand memcgs with tasks that have core_state, for example, and
> > trigger them to all allocate anonymous memory up to their hard limit so
> > they oom at the same time. A The machine should livelock with all zones
> > having 0 pages free.
> >
> >> I read the
> >> patch posted from Satoru Moriya "Tunable watermarks", and introducing
> >> the per-memcg-per-watermark tunable
> >> sounds good to me. Might consider adding it to the next post.
> >>
> >
> > Those tunable watermarks were nacked for a reason: they are internal to
> > the VM and should be set to sane values by the kernel with no intevention
> > needed by userspace. A You'd need to show why a memcg would need a user to
> > tune its watermarks to trigger background reclaim and why that's not
> > possible by the kernel and how this is a special case in comparsion to the
> > per-zone watermarks used by the VM.
> 
> KAMEZAWA gave an example on his early post, which some enterprise user
> like to keep fixed amount of free pages
> regardless of the hard_limit.
> 
> Since setting the wmarks has impact on the reclaim behavior of each
> memcg,  adding this flexibility helps the system where it like to
> treat memcg differently based on the priority.
> 

Please add some tricks to throttle the usage of cpu by kswapd-for-memcg
even when the user sets some bad value. And the total number of threads/workers
for all memcg should be throttled, too. (I think this parameter can be 
sysctl or root cgroup parameter.)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] Add per cgroup reclaim watermarks.
  2011-01-19  0:56           ` KAMEZAWA Hiroyuki
@ 2011-01-19  2:38             ` David Rientjes
  2011-01-19  2:47               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2011-01-19  2:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Wed, 19 Jan 2011, KAMEZAWA Hiroyuki wrote:

> > so something like per-memcg min_wmark which also needs to be reserved upfront?
> > 
> 
> I think the variable name 'min_free_kbytes' is the source of confusion...
> It's just a watermark to trigger background reclaim. It's not reservation.
> 

min_free_kbytes alters the min watermark of zones, meaning it can increase 
or decrease the amount of memory that is reserved for GFP_ATOMIC 
allocations, those in irq context, etc.  Since oom killed tasks don't 
allocate from any watermark, it also can increase or decrease the amount 
of memory available to oom killed tasks.  In that case, it _is_ a 
reservation of memory.

The issue is that it's done per-zone and if you're contending for those 
memory reserves that some oom killed tasks need to exit and free their 
memory, then it may deplete all memory in the DoS scenario I described.

> > KAMEZAWA gave an example on his early post, which some enterprise user
> > like to keep fixed amount of free pages
> > regardless of the hard_limit.
> > 
> > Since setting the wmarks has impact on the reclaim behavior of each
> > memcg,  adding this flexibility helps the system where it like to
> > treat memcg differently based on the priority.
> > 
> 
> Please add some tricks to throttle the usage of cpu by kswapd-for-memcg
> even when the user sets some bad value. And the total number of threads/workers
> for all memcg should be throttled, too. (I think this parameter can be 
> sysctl or root cgroup parameter.)
> 

I think that you probably want to add a min_free_kbytes for each memcg 
(and users who choose not to pre-reserve memory for things like oom killed 
tasks in that cgroup may set it to 0) and then have all other watermarks 
based off that setting just like the VM currently does whenever the global 
min_free_kbytes changes.

And I agree with your point that some cpu throttling will be needed to not 
be harmful to other cgroups whenever one memcg continuously hits its low 
watermark.  I'd suggest a global sysctl for that purpose to avoid certain 
memcg's impacting the preformance of others when under continuous reclaim 
to make sure everyone's on the same playing field.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] Add per cgroup reclaim watermarks.
  2011-01-19  2:38             ` David Rientjes
@ 2011-01-19  2:47               ` KAMEZAWA Hiroyuki
  2011-01-19 10:03                 ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-19  2:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ying Han, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Tue, 18 Jan 2011 18:38:42 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 19 Jan 2011, KAMEZAWA Hiroyuki wrote:
> 
> > > so something like per-memcg min_wmark which also needs to be reserved upfront?
> > > 
> > 
> > I think the variable name 'min_free_kbytes' is the source of confusion...
> > It's just a watermark to trigger background reclaim. It's not reservation.
> > 
> 
> min_free_kbytes alters the min watermark of zones, meaning it can increase 
> or decrease the amount of memory that is reserved for GFP_ATOMIC 
> allocations, those in irq context, etc.  Since oom killed tasks don't 
> allocate from any watermark, it also can increase or decrease the amount 
> of memory available to oom killed tasks.  In that case, it _is_ a 
> reservation of memory.
> 
I know.

THIS PATCH's min_free_kbytes is not the same to ZONE's one. It's just a
trigger. This patch's one is not used to limit charge() or for handling
gfp_mask.
(We can assume it's always GFP_HIGHUSER_MOVABLE or GFP_USER in some cases.)

So, I wrote the name of 'min_free_kbytes' in _this_ patch is a source of
confusion. I don't recommend to use such name in _this_ patch.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] Add per cgroup reclaim watermarks.
  2011-01-19  2:47               ` KAMEZAWA Hiroyuki
@ 2011-01-19 10:03                 ` David Rientjes
  0 siblings, 0 replies; 17+ messages in thread
From: David Rientjes @ 2011-01-19 10:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, Balbir Singh, Daisuke Nishimura, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo, linux-mm

On Wed, 19 Jan 2011, KAMEZAWA Hiroyuki wrote:

> I know.
> 
> THIS PATCH's min_free_kbytes is not the same to ZONE's one. It's just a
> trigger. This patch's one is not used to limit charge() or for handling
> gfp_mask.
> (We can assume it's always GFP_HIGHUSER_MOVABLE or GFP_USER in some cases.)
> 
> So, I wrote the name of 'min_free_kbytes' in _this_ patch is a source of
> confusion. I don't recommend to use such name in _this_ patch.
> 

Agree with respect to memcg min_free_kbytes.  I think it would be 
preferrable, however, to have a single tunable for which oom killed tasks 
may access a privileged pool of memory to avoid the aforementioned DoS and 
base all other watermarks off that value just like it happens for the 
global case.  Your point about throttling cpu for background reclaim is 
also a good one: I think we should be able to control the aggressiveness 
of memcg background reclaim with an additional property of memcg where a 
child memcg cannot be more aggressive than a parent, but I think the 
watermark should be internal to the subsystem itself and, perhaps, based 
on the user tunable that determines how much memory is accessible by only 
oom killed tasks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] Add per cgroup reclaim watermarks.
  2011-01-18 20:02     ` Ying Han
  2011-01-18 20:36       ` David Rientjes
@ 2011-01-19  0:44       ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 17+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-19  0:44 UTC (permalink / raw)
  To: Ying Han
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Tue, 18 Jan 2011 12:02:51 -0800
Ying Han <yinghan@google.com> wrote:

> On Thu, Jan 13, 2011 at 4:11 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> >
> > Please explain your handling of 'hierarchy' in description.
> I haven't thought through the 'hierarchy' handling in this patchset
> which I will probably put more thoughts in the following
> posts. Do you have recommendations on handing the 'hierarchy' ?
> 

For example, assume a Hierarchy like following.

 A
  \
   B 

B's usage is accoutned into A, too. So, it's difficult to determine when
A's kswapd should run if
 - A's kswapd runs only against 'A'
 - A's kswapd just see information of A's LRU
 - B has its own kswapd...this means A has 2 kswapd.
.....


What I think are 2 options.

(1) having one kswapd per hierarchy, IOW, B will never have hierarchy.
or
(2) having kswapd per cgroup but it shares mutex. Parent's kswapd will
   never run if one of children's run.

(1) sounds slow and handling of children's watermark will be serialized.
(2) sounds we may have too much worker.

I like something between (1) and (2) ;)   sqrt(num_of_cgroup) of kswapd
is good ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/5] New APIs to adjust per cgroup wmarks.
  2011-01-13 22:00 [PATCH 0/5] memcg: per cgroup background reclaim Ying Han
  2011-01-13 22:00 ` [PATCH 1/5] Add kswapd descriptor Ying Han
  2011-01-13 22:00 ` [PATCH 2/5] Add per cgroup reclaim watermarks Ying Han
@ 2011-01-13 22:00 ` Ying Han
  2011-01-13 22:00 ` [PATCH 4/5] Per cgroup background reclaim Ying Han
  2011-01-13 22:00 ` [PATCH 5/5] Add more per memcg stats Ying Han
  4 siblings, 0 replies; 17+ messages in thread
From: Ying Han @ 2011-01-13 22:00 UTC (permalink / raw)
  To: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo
  Cc: linux-mm

Add min_free_kbytes and reclaim_wmarks APIs per memory cgroup.
The first one is to adjust the internal low/high wmark calculation
and the second one is to export the wmarks.

$ echo 1024 >/dev/cgroup/A/memory.min_free_kbytes

$ cat /dev/cgroup/A/memory.reclaim_wmarks
low_wmark 98304000
high_wmark 81920000

Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/memcontrol.c |   51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5508d94..6ef26a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4122,6 +4122,33 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static u64 mem_cgroup_min_free_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return get_min_free_kbytes(memcg);
+}
+
+static int mem_cgroup_min_free_write(struct cgroup *cgrp, struct cftype *cfg,
+				     u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *parent;
+
+	if (cgrp->parent == NULL)
+		return -EINVAL;
+
+	parent = mem_cgroup_from_cont(cgrp->parent);
+
+	spin_lock(&memcg->reclaim_param_lock);
+	memcg->min_free_kbytes = val;
+	spin_unlock(&memcg->reclaim_param_lock);
+
+	setup_per_memcg_wmarks(memcg);
+	return 0;
+
+}
+
 static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 {
 	struct mem_cgroup_threshold_ary *t;
@@ -4413,6 +4440,21 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
 	mutex_unlock(&memcg_oom_mutex);
 }
 
+static int mem_cgroup_wmark_read(struct cgroup *cgrp,
+	struct cftype *cft,  struct cgroup_map_cb *cb)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	u64 low_wmark, high_wmark;
+
+	low_wmark = res_counter_read_u64(&mem->res, RES_LOW_WMARK_LIMIT);
+	high_wmark = res_counter_read_u64(&mem->res, RES_HIGH_WMARK_LIMIT);
+
+	cb->fill(cb, "low_wmark", low_wmark);
+	cb->fill(cb, "high_wmark", high_wmark);
+
+	return 0;
+}
+
 static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 	struct cftype *cft,  struct cgroup_map_cb *cb)
 {
@@ -4623,6 +4665,15 @@ static struct cftype mem_cgroup_files[] = {
 		.write_string = mem_cgroup_dirty_write_string,
 		.private = MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
 	},
+	{
+		.name = "min_free_kbytes",
+		.write_u64 = mem_cgroup_min_free_write,
+		.read_u64 = mem_cgroup_min_free_read,
+	},
+	{
+		.name = "reclaim_wmarks",
+		.read_map = mem_cgroup_wmark_read,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/5] Per cgroup background reclaim.
  2011-01-13 22:00 [PATCH 0/5] memcg: per cgroup background reclaim Ying Han
                   ` (2 preceding siblings ...)
  2011-01-13 22:00 ` [PATCH 3/5] New APIs to adjust per cgroup wmarks Ying Han
@ 2011-01-13 22:00 ` Ying Han
  2011-01-14  0:52   ` KAMEZAWA Hiroyuki
  2011-01-13 22:00 ` [PATCH 5/5] Add more per memcg stats Ying Han
  4 siblings, 1 reply; 17+ messages in thread
From: Ying Han @ 2011-01-13 22:00 UTC (permalink / raw)
  To: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo
  Cc: linux-mm

The current implementation of memcg only supports direct reclaim and this
patch adds the support for background reclaim. Per cgroup background reclaim
is needed which spreads out the memory pressure over longer period of time
and smoothes out the system performance.

There is a kswapd kernel thread for each memory node. We add a different kswapd
for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
field of a kswapd descriptor.

The kswapd() function now is shared between global and per cgroup kswapd thread.
It is passed in with the kswapd descriptor which contains the information of
either node or cgroup. Then the new function balance_mem_cgroup_pgdat is invoked
if it is per cgroup kswapd thread. The balance_mem_cgroup_pgdat performs a
priority loop similar to global reclaim. In each iteration it invokes
balance_pgdat_node for all nodes on the system, which is a new function performs
background reclaim per node. A fairness mechanism is implemented to remember the
last node it was reclaiming from and always start at the next one. After reclaiming
each node, it checks mem_cgroup_watermark_ok() and breaks the priority loop if
returns true. A per memcg zone will be marked as "unreclaimable" if the scanning
rate is much greater than the reclaiming rate on the per cgroup LRU. The bit is
cleared when there is a page charged to the cgroup being freed. Kswapd breaks the
priority loop if all the zones are marked as "unreclaimable".

Change log v2...v1:
1. start/stop the per-cgroup kswapd at create/delete cgroup stage.
2. remove checking the wmark from per-page charging. now it checks the wmark
periodically based on the event counter.
3. move the per-cgroup per-zone clear_unreclaimable into uncharge stage.
4. shared the kswapd_run/kswapd_stop for per-cgroup and global background
reclaim.
5. name the per-cgroup memcg as "memcg-id" (css->id). And the global kswapd
keeps the same name.
6. fix a race on kswapd_stop while the per-memcg-per-zone info could be accessed
after freeing.
7. add the fairness in zonelist where memcg remember the last zone reclaimed
from.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |   37 ++++++
 include/linux/swap.h       |    4 +-
 mm/memcontrol.c            |  192 ++++++++++++++++++++++++++++-
 mm/vmscan.c                |  298 ++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 504 insertions(+), 27 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 80a605f..69c6e41 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -25,6 +25,7 @@ struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
+struct kswapd;
 
 /* Stats that can be updated by kernel. */
 enum mem_cgroup_page_stat_item {
@@ -94,6 +95,12 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
+extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
+				  struct kswapd *kswapd_p);
+extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem);
+extern int mem_cgroup_last_scanned_node(struct mem_cgroup *mem);
+extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
+					nodemask_t *nodes);
 
 static inline
 int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
@@ -166,6 +173,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
 
+void mem_cgroup_clear_unreclaimable(struct page_cgroup *pc);
+bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
+bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
+void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
+void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
+					unsigned long nr_scanned);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -361,6 +374,25 @@ static inline unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
 	return -ENOSYS;
 }
 
+static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
+						struct zone *zone,
+						unsigned long nr_scanned)
+{
+}
+
+static inline void mem_cgroup_clear_unreclaimable(struct page *page,
+							struct zone *zone)
+{
+}
+static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
+		struct zone *zone)
+{
+}
+static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
+						struct zone *zone)
+{
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask)
@@ -374,6 +406,11 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
 	return 0;
 }
 
+static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
+								int zid)
+{
+	return false;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 52122fa..b6b5cbb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -292,8 +292,8 @@ static inline void scan_unevictable_unregister_node(struct node *node)
 }
 #endif
 
-extern int kswapd_run(int nid);
-extern void kswapd_stop(int nid);
+extern int kswapd_run(int nid, struct mem_cgroup *mem);
+extern void kswapd_stop(int nid, struct mem_cgroup *mem);
 
 #ifdef CONFIG_MMU
 /* linux/mm/shmem.c */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ef26a7..e716ece 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -48,6 +48,8 @@
 #include <linux/page_cgroup.h>
 #include <linux/cpu.h>
 #include <linux/oom.h>
+#include <linux/kthread.h>
+
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -75,6 +77,7 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
  */
 #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
 #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
+#define WMARK_EVENTS_THRESH (10) /* once in 1024 */
 
 /*
  * Statistics for memory cgroup.
@@ -131,7 +134,10 @@ struct mem_cgroup_per_zone {
 	bool			on_tree;
 	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
 						/* use container_of	   */
+	unsigned long		pages_scanned;	/* since last reclaim */
+	int			all_unreclaimable;	/* All pages pinned */
 };
+
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
 
@@ -289,8 +295,16 @@ struct mem_cgroup {
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
 
+	/*
+	 * per cgroup background reclaim.
+	 */
 	wait_queue_head_t *kswapd_wait;
 	unsigned long min_free_kbytes;
+
+	/* While doing per cgroup background reclaim, we cache the
+	 * last node we reclaimed from
+	 */
+	int last_scanned_node;
 };
 
 /* Stuffs for move charges at task migration. */
@@ -380,6 +394,7 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(void);
 static unsigned long get_min_free_kbytes(struct mem_cgroup *mem);
+static void wake_memcg_kswapd(struct mem_cgroup *mem);
 
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -568,6 +583,12 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 	return mz;
 }
 
+static void mem_cgroup_check_wmark(struct mem_cgroup *mem)
+{
+	if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_LOW))
+		wake_memcg_kswapd(mem);
+}
+
 /*
  * Implementation Note: reading percpu statistics for memcg.
  *
@@ -692,6 +713,8 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
 		mem_cgroup_threshold(mem);
 		if (unlikely(__memcg_event_check(mem, SOFTLIMIT_EVENTS_THRESH)))
 			mem_cgroup_update_tree(mem, page);
+		if (unlikely(__memcg_event_check(mem, WMARK_EVENTS_THRESH)))
+			mem_cgroup_check_wmark(mem);
 	}
 }
 
@@ -1121,6 +1144,95 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 	return &mz->reclaim_stat;
 }
 
+static unsigned long mem_cgroup_zone_reclaimable_pages(
+						struct mem_cgroup_per_zone *mz)
+{
+	int nr;
+	nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
+		MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
+
+	if (nr_swap_pages > 0)
+		nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
+			MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
+
+	return nr;
+}
+
+void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
+						unsigned long nr_scanned)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		mz->pages_scanned += nr_scanned;
+}
+
+bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+
+	if (!mem)
+		return 0;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		return mz->pages_scanned <
+				mem_cgroup_zone_reclaimable_pages(mz) * 6;
+	return 0;
+}
+
+bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return false;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		return mz->all_unreclaimable;
+
+	return false;
+}
+
+void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	if (!mem)
+		return;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (mz)
+		mz->all_unreclaimable = 1;
+}
+
+void mem_cgroup_clear_unreclaimable(struct page_cgroup *pc)
+{
+	struct mem_cgroup_per_zone *mz = NULL;
+
+	if (!pc)
+		return;
+
+	mz = page_cgroup_zoneinfo(pc);
+	if (mz) {
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = 0;
+	}
+
+	return;
+}
+
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -1773,6 +1885,34 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 }
 
 /*
+ * Visit the first node after the last_scanned_node of @mem and use that to
+ * reclaim free pages from.
+ */
+int
+mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t *nodes)
+{
+	int next_nid;
+	int last_scanned;
+
+	last_scanned = mem->last_scanned_node;
+
+	/* Initial stage and start from node0 */
+	if (last_scanned == -1)
+		next_nid = 0;
+	else
+		next_nid = next_node(last_scanned, *nodes);
+
+	if (next_nid == MAX_NUMNODES)
+		next_nid = first_node(*nodes);
+
+	spin_lock(&mem->reclaim_param_lock);
+	mem->last_scanned_node = next_nid;
+	spin_unlock(&mem->reclaim_param_lock);
+
+	return next_nid;
+}
+
+/*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
  */
@@ -2955,6 +3095,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	 * special functions.
 	 */
 
+	mem_cgroup_clear_unreclaimable(pc);
 	unlock_page_cgroup(pc);
 	/*
 	 * even after unlock, we have mem->res.usage here and this memcg
@@ -3377,7 +3518,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 						MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
-  		if (curusage >= oldusage)
+		if (curusage >= oldusage)
 			retry_count--;
 		else
 			oldusage = curusage;
@@ -3385,6 +3526,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 	if (!ret && enlarge)
 		memcg_oom_recover(memcg);
 
+	if (!mem_cgroup_is_root(memcg) && !memcg->kswapd_wait)
+		kswapd_run(0, memcg);
+
 	return ret;
 }
 
@@ -4747,6 +4891,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->mem = mem;
+		mz->pages_scanned = 0;
+		mz->all_unreclaimable = 0;
 	}
 	return 0;
 }
@@ -4799,6 +4945,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
 {
 	int node;
 
+	kswapd_stop(0, mem);
 	mem_cgroup_remove_from_trees(mem);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
 
@@ -4867,6 +5014,48 @@ int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
 	return ret;
 }
 
+int mem_cgroup_init_kswapd(struct mem_cgroup *mem, struct kswapd *kswapd_p)
+{
+	if (!mem || !kswapd_p)
+		return 0;
+
+	mem->kswapd_wait = &kswapd_p->kswapd_wait;
+	kswapd_p->kswapd_mem = mem;
+
+	return css_id(&mem->css);
+}
+
+wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return NULL;
+
+	return mem->kswapd_wait;
+}
+
+int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return -1;
+
+	return mem->last_scanned_node;
+}
+
+static void wake_memcg_kswapd(struct mem_cgroup *mem)
+{
+	wait_queue_head_t *wait;
+
+	if (!mem)
+		return;
+
+	wait = mem->kswapd_wait;
+
+	if (!waitqueue_active(wait))
+		return;
+
+	wake_up_interruptible(wait);
+}
+
 static int mem_cgroup_soft_limit_tree_init(void)
 {
 	struct mem_cgroup_tree_per_node *rtpn;
@@ -4942,6 +5131,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		res_counter_init(&mem->memsw, NULL);
 	}
 	mem->last_scanned_child = 0;
+	mem->last_scanned_node = -1;
 	spin_lock_init(&mem->reclaim_param_lock);
 	INIT_LIST_HEAD(&mem->oom_notify);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a53d91d..34f6165 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -46,6 +46,8 @@
 
 #include <linux/swapops.h>
 
+#include <linux/res_counter.h>
+
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -98,6 +100,8 @@ struct scan_control {
 	 * are scanned.
 	 */
 	nodemask_t	*nodemask;
+
+	int priority;
 };
 
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -1385,6 +1389,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 					ISOLATE_INACTIVE : ISOLATE_BOTH,
 			zone, sc->mem_cgroup,
 			0, file);
+
+		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
+
 		/*
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
@@ -1504,6 +1511,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
 		 */
+		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
 	}
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
@@ -2127,11 +2135,19 @@ static int sleeping_prematurely(struct kswapd *kswapd, int order,
 {
 	int i;
 	pg_data_t *pgdat = kswapd->kswapd_pgdat;
+	struct mem_cgroup *mem = kswapd->kswapd_mem;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return 1;
 
+	/* If after HZ/10, the cgroup is below the high wmark, it's premature */
+	if (mem) {
+		if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_HIGH))
+			return 1;
+		return 0;
+	}
+
 	/* If after HZ/10, a zone is below the high mark, it's premature */
 	for (i = 0; i < pgdat->nr_zones; i++) {
 		struct zone *zone = pgdat->node_zones + i;
@@ -2370,6 +2386,212 @@ out:
 	return sc.nr_reclaimed;
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/*
+ * The function is used for per-memcg LRU. It scanns all the zones of the
+ * node and returns the nr_scanned and nr_reclaimed.
+ */
+static void balance_pgdat_node(pg_data_t *pgdat, int order,
+					struct scan_control *sc)
+{
+	int i, end_zone;
+	unsigned long total_scanned;
+	struct mem_cgroup *mem_cont = sc->mem_cgroup;
+	int priority = sc->priority;
+	int nid = pgdat->node_id;
+
+	/*
+	 * Scan in the highmem->dma direction for the highest
+	 * zone which needs scanning
+	 */
+	for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
+				priority != DEF_PRIORITY)
+			continue;
+		/*
+		 * Do some background aging of the anon list, to give
+		 * pages a chance to be referenced before reclaiming.
+		 */
+		if (inactive_anon_is_low(zone, sc))
+			shrink_active_list(SWAP_CLUSTER_MAX, zone,
+							sc, priority, 0);
+
+		end_zone = i;
+		goto scan;
+	}
+	return;
+
+scan:
+	total_scanned = 0;
+	/*
+	 * Now scan the zone in the dma->highmem direction, stopping
+	 * at the last zone which needs scanning.
+	 *
+	 * We do this because the page allocator works in the opposite
+	 * direction.  This prevents the page allocator from allocating
+	 * pages behind kswapd's direction of progress, which would
+	 * cause too much scanning of the lower zones.
+	 */
+	for (i = 0; i <= end_zone; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
+			priority != DEF_PRIORITY)
+			continue;
+
+		sc->nr_scanned = 0;
+		shrink_zone(priority, zone, sc);
+		total_scanned += sc->nr_scanned;
+
+		if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
+			continue;
+
+		if (!mem_cgroup_zone_reclaimable(mem_cont, nid, i))
+			mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
+
+		/*
+		 * If we've done a decent amount of scanning and
+		 * the reclaim ratio is low, start doing writepage
+		 * even in laptop mode
+		 */
+		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
+		    total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
+			sc->may_writepage = 1;
+		}
+	}
+
+	sc->nr_scanned = total_scanned;
+	return;
+}
+
+/*
+ * Per cgroup background reclaim.
+ * TODO: Take off the order since memcg always do order 0
+ */
+static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
+					      int order)
+{
+	int i, nid;
+	int start_node;
+	int priority;
+	int wmark_ok;
+	int loop = 0;
+	pg_data_t *pgdat;
+	nodemask_t do_nodes;
+	unsigned long total_scanned = 0;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+		.may_unmap = 1,
+		.may_swap = 1,
+		.nr_to_reclaim = ULONG_MAX,
+		.swappiness = vm_swappiness,
+		.order = order,
+		.mem_cgroup = mem_cont,
+	};
+
+loop_again:
+	do_nodes = NODE_MASK_NONE;
+	sc.may_writepage = !laptop_mode;
+	sc.nr_reclaimed = 0;
+	total_scanned = 0;
+
+	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+		sc.priority = priority;
+		wmark_ok = 0;
+		loop = 0;
+
+		/* The swap token gets in the way of swapout... */
+		if (!priority)
+			disable_swap_token();
+
+		if (priority == DEF_PRIORITY)
+			do_nodes = node_states[N_ONLINE];
+
+		while (1) {
+			nid = mem_cgroup_select_victim_node(mem_cont,
+							&do_nodes);
+
+			/* Indicate we have cycled the nodelist once
+			 * TODO: we might add MAX_RECLAIM_LOOP for preventing
+			 * kswapd burning cpu cycles.
+			 */
+			if (loop == 0) {
+				start_node = nid;
+				loop++;
+			} else if (nid == start_node)
+				break;
+
+			pgdat = NODE_DATA(nid);
+			balance_pgdat_node(pgdat, order, &sc);
+			total_scanned += sc.nr_scanned;
+
+			/* Set the node which has at least
+			 * one reclaimable zone
+			 */
+			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+				struct zone *zone = pgdat->node_zones + i;
+
+				if (!populated_zone(zone))
+					continue;
+
+				if (!mem_cgroup_mz_unreclaimable(mem_cont,
+								zone))
+					break;
+			}
+			if (i < 0)
+				node_clear(nid, do_nodes);
+
+			if (mem_cgroup_watermark_ok(mem_cont,
+							CHARGE_WMARK_HIGH)) {
+				wmark_ok = 1;
+				goto out;
+			}
+
+			if (nodes_empty(do_nodes)) {
+				wmark_ok = 1;
+				goto out;
+			}
+		}
+
+		/* All the nodes are unreclaimable, kswapd is done */
+		if (nodes_empty(do_nodes)) {
+			wmark_ok = 1;
+			goto out;
+		}
+
+		if (total_scanned && priority < DEF_PRIORITY - 2)
+			congestion_wait(WRITE, HZ/10);
+
+		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
+			break;
+	}
+out:
+	if (!wmark_ok) {
+		cond_resched();
+
+		try_to_freeze();
+
+		goto loop_again;
+	}
+
+	return sc.nr_reclaimed;
+}
+#else
+static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
+							int order)
+{
+	return 0;
+}
+#endif
+
 /*
  * The background pageout daemon, started as a kernel thread
  * from the init process.
@@ -2388,6 +2610,7 @@ int kswapd(void *p)
 	unsigned long order;
 	struct kswapd *kswapd_p = (struct kswapd *)p;
 	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
+	struct mem_cgroup *mem = kswapd_p->kswapd_mem;
 	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
 	struct task_struct *tsk = current;
 	DEFINE_WAIT(wait);
@@ -2430,8 +2653,10 @@ int kswapd(void *p)
 		if (is_node_kswapd(kswapd_p)) {
 			new_order = pgdat->kswapd_max_order;
 			pgdat->kswapd_max_order = 0;
-		} else
+		} else {
+			/* mem cgroup does order 0 charging always */
 			new_order = 0;
+		}
 
 		if (order < new_order) {
 			/*
@@ -2492,8 +2717,12 @@ int kswapd(void *p)
 		 * after returning from the refrigerator
 		 */
 		if (!ret) {
-			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			balance_pgdat(pgdat, order);
+			if (is_node_kswapd(kswapd_p)) {
+				trace_mm_vmscan_kswapd_wake(pgdat->node_id,
+								order);
+				balance_pgdat(pgdat, order);
+			} else
+				balance_mem_cgroup_pgdat(mem, order);
 		}
 	}
 	return 0;
@@ -2635,60 +2864,81 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
 }
 
 /*
- * This kswapd start function will be called by init and node-hot-add.
- * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
+ * This kswapd start function will be called by init, node-hot-add and memcg
+ * limiting. On node-hot-add, kswapd will moved to proper cpus if cpus are
+ * hot-added.
  */
-int kswapd_run(int nid)
+int kswapd_run(int nid, struct mem_cgroup *mem)
 {
-	pg_data_t *pgdat = NODE_DATA(nid);
 	struct task_struct *thr;
+	pg_data_t *pgdat = NULL;
 	struct kswapd *kswapd_p;
+	static char name[TASK_COMM_LEN];
+	int memcg_id;
 	int ret = 0;
 
-	if (pgdat->kswapd_wait)
-		return 0;
+	if (!mem) {
+		pgdat = NODE_DATA(nid);
+		if (pgdat->kswapd_wait)
+			return ret;
+	}
 
 	kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
 	if (!kswapd_p)
 		return -ENOMEM;
 
 	init_waitqueue_head(&kswapd_p->kswapd_wait);
-	pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
-	kswapd_p->kswapd_pgdat = pgdat;
-	thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
+	if (!mem) {
+		pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
+		kswapd_p->kswapd_pgdat = pgdat;
+		snprintf(name, TASK_COMM_LEN, "kswapd_%d", nid);
+	} else {
+		memcg_id = mem_cgroup_init_kswapd(mem, kswapd_p);
+		if (!memcg_id) {
+			kfree(kswapd_p);
+			return ret;
+		}
+		snprintf(name, TASK_COMM_LEN, "memcg_%d", memcg_id);
+	}
+
+	thr = kthread_run(kswapd, kswapd_p, name);
 	if (IS_ERR(thr)) {
 		/* failure at boot is fatal */
 		BUG_ON(system_state == SYSTEM_BOOTING);
-		printk("Failed to start kswapd on node %d\n",nid);
 		ret = -1;
-	}
-	kswapd_p->kswapd_task = thr;
+	} else
+		kswapd_p->kswapd_task = thr;
 	return ret;
 }
 
 /*
  * Called by memory hotplug when all memory in a node is offlined.
+ * Also called by memcg when the cgroup is deleted.
  */
-void kswapd_stop(int nid)
+void kswapd_stop(int nid, struct mem_cgroup *mem)
 {
 	struct task_struct *thr = NULL;
 	struct kswapd *kswapd_p = NULL;
 	wait_queue_head_t *wait;
 
-	pg_data_t *pgdat = NODE_DATA(nid);
-
 	spin_lock(&kswapds_spinlock);
-	wait = pgdat->kswapd_wait;
+	if (!mem) {
+		pg_data_t *pgdat = NODE_DATA(nid);
+		wait = pgdat->kswapd_wait;
+	} else
+		wait = mem_cgroup_kswapd_wait(mem);
+
 	if (wait) {
 		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
 		thr = kswapd_p->kswapd_task;
 	}
 	spin_unlock(&kswapds_spinlock);
 
-	if (thr)
-		kthread_stop(thr);
-
-	kfree(kswapd_p);
+	if (kswapd_p) {
+		if (thr)
+			kthread_stop(thr);
+		kfree(kswapd_p);
+	}
 }
 
 static int __init kswapd_init(void)
@@ -2697,7 +2947,7 @@ static int __init kswapd_init(void)
 
 	swap_setup();
 	for_each_node_state(nid, N_HIGH_MEMORY)
- 		kswapd_run(nid);
+		kswapd_run(nid, NULL);
 	hotcpu_notifier(cpu_callback, 0);
 	return 0;
 }
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/5] Per cgroup background reclaim.
  2011-01-13 22:00 ` [PATCH 4/5] Per cgroup background reclaim Ying Han
@ 2011-01-14  0:52   ` KAMEZAWA Hiroyuki
  2011-01-19  2:12     ` Ying Han
  0 siblings, 1 reply; 17+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-14  0:52 UTC (permalink / raw)
  To: Ying Han
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Thu, 13 Jan 2011 14:00:34 -0800
Ying Han <yinghan@google.com> wrote:

> The current implementation of memcg only supports direct reclaim and this
> patch adds the support for background reclaim. Per cgroup background reclaim
> is needed which spreads out the memory pressure over longer period of time
> and smoothes out the system performance.
> 
> There is a kswapd kernel thread for each memory node. We add a different kswapd
> for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
> field of a kswapd descriptor.
> 
> The kswapd() function now is shared between global and per cgroup kswapd thread.
> It is passed in with the kswapd descriptor which contains the information of
> either node or cgroup. Then the new function balance_mem_cgroup_pgdat is invoked
> if it is per cgroup kswapd thread. The balance_mem_cgroup_pgdat performs a
> priority loop similar to global reclaim. In each iteration it invokes
> balance_pgdat_node for all nodes on the system, which is a new function performs
> background reclaim per node. A fairness mechanism is implemented to remember the
> last node it was reclaiming from and always start at the next one. After reclaiming
> each node, it checks mem_cgroup_watermark_ok() and breaks the priority loop if
> returns true. A per memcg zone will be marked as "unreclaimable" if the scanning
> rate is much greater than the reclaiming rate on the per cgroup LRU. The bit is
> cleared when there is a page charged to the cgroup being freed. Kswapd breaks the
> priority loop if all the zones are marked as "unreclaimable".
> 
> Change log v2...v1:
> 1. start/stop the per-cgroup kswapd at create/delete cgroup stage.
> 2. remove checking the wmark from per-page charging. now it checks the wmark
> periodically based on the event counter.
> 3. move the per-cgroup per-zone clear_unreclaimable into uncharge stage.
> 4. shared the kswapd_run/kswapd_stop for per-cgroup and global background
> reclaim.
> 5. name the per-cgroup memcg as "memcg-id" (css->id). And the global kswapd
> keeps the same name.
> 6. fix a race on kswapd_stop while the per-memcg-per-zone info could be accessed
> after freeing.
> 7. add the fairness in zonelist where memcg remember the last zone reclaimed
> from.
> 
> Signed-off-by: Ying Han <yinghan@google.com>

Hmm...at first, I like using workqueue rather than using a thread per memcg.




> ---
>  include/linux/memcontrol.h |   37 ++++++
>  include/linux/swap.h       |    4 +-
>  mm/memcontrol.c            |  192 ++++++++++++++++++++++++++++-
>  mm/vmscan.c                |  298 ++++++++++++++++++++++++++++++++++++++++----
>  4 files changed, 504 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 80a605f..69c6e41 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -25,6 +25,7 @@ struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
> +struct kswapd;
>  
>  /* Stats that can be updated by kernel. */
>  enum mem_cgroup_page_stat_item {
> @@ -94,6 +95,12 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>  extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
> +extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
> +				  struct kswapd *kswapd_p);
> +extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem);
> +extern int mem_cgroup_last_scanned_node(struct mem_cgroup *mem);
> +extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
> +					nodemask_t *nodes);
>  
>  static inline
>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
> @@ -166,6 +173,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>  
> +void mem_cgroup_clear_unreclaimable(struct page_cgroup *pc);
> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
> +					unsigned long nr_scanned);
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct mem_cgroup;
>  
> @@ -361,6 +374,25 @@ static inline unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
>  	return -ENOSYS;
>  }
>  
> +static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
> +						struct zone *zone,
> +						unsigned long nr_scanned)
> +{
> +}
> +
> +static inline void mem_cgroup_clear_unreclaimable(struct page *page,
> +							struct zone *zone)
> +{
> +}
> +static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
> +		struct zone *zone)
> +{
> +}
> +static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
> +						struct zone *zone)
> +{
> +}
> +
>  static inline
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  					    gfp_t gfp_mask)
> @@ -374,6 +406,11 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
>  	return 0;
>  }
>  
> +static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
> +								int zid)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>  
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 52122fa..b6b5cbb 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -292,8 +292,8 @@ static inline void scan_unevictable_unregister_node(struct node *node)
>  }
>  #endif
>  
> -extern int kswapd_run(int nid);
> -extern void kswapd_stop(int nid);
> +extern int kswapd_run(int nid, struct mem_cgroup *mem);
> +extern void kswapd_stop(int nid, struct mem_cgroup *mem);
>  
>  #ifdef CONFIG_MMU
>  /* linux/mm/shmem.c */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6ef26a7..e716ece 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -48,6 +48,8 @@
>  #include <linux/page_cgroup.h>
>  #include <linux/cpu.h>
>  #include <linux/oom.h>
> +#include <linux/kthread.h>
> +
>  #include "internal.h"
>  
>  #include <asm/uaccess.h>
> @@ -75,6 +77,7 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>   */
>  #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
>  #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
> +#define WMARK_EVENTS_THRESH (10) /* once in 1024 */
>  
>  /*
>   * Statistics for memory cgroup.
> @@ -131,7 +134,10 @@ struct mem_cgroup_per_zone {
>  	bool			on_tree;
>  	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
>  						/* use container_of	   */
> +	unsigned long		pages_scanned;	/* since last reclaim */
> +	int			all_unreclaimable;	/* All pages pinned */
>  };
> +
>  /* Macro for accessing counter */
>  #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
>  
> @@ -289,8 +295,16 @@ struct mem_cgroup {
>  	struct mem_cgroup_stat_cpu nocpu_base;
>  	spinlock_t pcp_counter_lock;
>  
> +	/*
> +	 * per cgroup background reclaim.
> +	 */
>  	wait_queue_head_t *kswapd_wait;
>  	unsigned long min_free_kbytes;
> +
> +	/* While doing per cgroup background reclaim, we cache the
> +	 * last node we reclaimed from
> +	 */
> +	int last_scanned_node;
>  };
>  
>  /* Stuffs for move charges at task migration. */
> @@ -380,6 +394,7 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>  static void drain_all_stock_async(void);
>  static unsigned long get_min_free_kbytes(struct mem_cgroup *mem);
> +static void wake_memcg_kswapd(struct mem_cgroup *mem);
>  
>  static struct mem_cgroup_per_zone *
>  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> @@ -568,6 +583,12 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
>  	return mz;
>  }
>  
> +static void mem_cgroup_check_wmark(struct mem_cgroup *mem)
> +{
> +	if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_LOW))
> +		wake_memcg_kswapd(mem);
> +}
> +

Low for trigger, High for stop ? 


>  /*
>   * Implementation Note: reading percpu statistics for memcg.
>   *
> @@ -692,6 +713,8 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
>  		mem_cgroup_threshold(mem);
>  		if (unlikely(__memcg_event_check(mem, SOFTLIMIT_EVENTS_THRESH)))
>  			mem_cgroup_update_tree(mem, page);
> +		if (unlikely(__memcg_event_check(mem, WMARK_EVENTS_THRESH)))
> +			mem_cgroup_check_wmark(mem);
>  	}

This is nice. 


>  }
>  
> @@ -1121,6 +1144,95 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
>  	return &mz->reclaim_stat;
>  }
>  
> +static unsigned long mem_cgroup_zone_reclaimable_pages(
> +						struct mem_cgroup_per_zone *mz)
> +{
> +	int nr;
> +	nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
> +		MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
> +
> +	if (nr_swap_pages > 0)
> +		nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
> +			MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
> +
> +	return nr;
> +}
> +
> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
> +						unsigned long nr_scanned)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		mz->pages_scanned += nr_scanned;
> +}
> +
> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +
> +	if (!mem)
> +		return 0;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		return mz->pages_scanned <
> +				mem_cgroup_zone_reclaimable_pages(mz) * 6;
> +	return 0;
> +}

Where does this "*6" come from ? please add comment. Or add macro in header
file and share the value with original.



> +
> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return false;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		return mz->all_unreclaimable;
> +
> +	return false;
> +}

I think you should check whether this zone has any page.
If no pages in this zone, you can't reclaim any.


> +
> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	if (!mem)
> +		return;
> +
> +	mz = mem_cgroup_zoneinfo(mem, nid, zid);
> +	if (mz)
> +		mz->all_unreclaimable = 1;
> +}

I like boolean for this kind ot true/false value.



> +
> +void mem_cgroup_clear_unreclaimable(struct page_cgroup *pc)
> +{
> +	struct mem_cgroup_per_zone *mz = NULL;
> +
> +	if (!pc)
> +		return;
> +
> +	mz = page_cgroup_zoneinfo(pc);
> +	if (mz) {
> +		mz->pages_scanned = 0;
> +		mz->all_unreclaimable = 0;
> +	}
> +
> +	return;
> +}
> +
>  unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>  					struct list_head *dst,
>  					unsigned long *scanned, int order,
> @@ -1773,6 +1885,34 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  }
>  
>  /*
> + * Visit the first node after the last_scanned_node of @mem and use that to
> + * reclaim free pages from.
> + */
> +int
> +mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t *nodes)
> +{
> +	int next_nid;
> +	int last_scanned;
> +
> +	last_scanned = mem->last_scanned_node;
> +
> +	/* Initial stage and start from node0 */
> +	if (last_scanned == -1)
> +		next_nid = 0;
> +	else
> +		next_nid = next_node(last_scanned, *nodes);
> +
> +	if (next_nid == MAX_NUMNODES)
> +		next_nid = first_node(*nodes);
> +
> +	spin_lock(&mem->reclaim_param_lock);
> +	mem->last_scanned_node = next_nid;
> +	spin_unlock(&mem->reclaim_param_lock);
> +

Is this 'lock' required ?

> +	return next_nid;
> +}
> +
> +/*
>   * Check OOM-Killer is already running under our hierarchy.
>   * If someone is running, return false.
>   */
> @@ -2955,6 +3095,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  	 * special functions.
>  	 */
>  
> +	mem_cgroup_clear_unreclaimable(pc);
>  	unlock_page_cgroup(pc);

This kind of hook is not good....Can't you do this 'clear' by kswapd in
lazy way ?



>  	/*
>  	 * even after unlock, we have mem->res.usage here and this memcg
> @@ -3377,7 +3518,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  						MEM_CGROUP_RECLAIM_SHRINK);
>  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>  		/* Usage is reduced ? */
> -  		if (curusage >= oldusage)
> +		if (curusage >= oldusage)
>  			retry_count--;
>  		else
>  			oldusage = curusage;

?


> @@ -3385,6 +3526,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  	if (!ret && enlarge)
>  		memcg_oom_recover(memcg);
>  
> +	if (!mem_cgroup_is_root(memcg) && !memcg->kswapd_wait)
> +		kswapd_run(0, memcg);
> +
>  	return ret;
>  }

Hmm, this creates a thread when limit is set....So, tons of threads can be
created. Can't we do this by work_queue ?
Then, the number of threads will be scaled automatically.


>  
> @@ -4747,6 +4891,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>  		mz->usage_in_excess = 0;
>  		mz->on_tree = false;
>  		mz->mem = mem;
> +		mz->pages_scanned = 0;
> +		mz->all_unreclaimable = 0;
>  	}
>  	return 0;
>  }
> @@ -4799,6 +4945,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
>  {
>  	int node;
>  
> +	kswapd_stop(0, mem);
>  	mem_cgroup_remove_from_trees(mem);
>  	free_css_id(&mem_cgroup_subsys, &mem->css);
>  
> @@ -4867,6 +5014,48 @@ int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
>  	return ret;
>  }
>  
> +int mem_cgroup_init_kswapd(struct mem_cgroup *mem, struct kswapd *kswapd_p)
> +{
> +	if (!mem || !kswapd_p)
> +		return 0;
> +
> +	mem->kswapd_wait = &kswapd_p->kswapd_wait;
> +	kswapd_p->kswapd_mem = mem;
> +
> +	return css_id(&mem->css);
> +}
> +
> +wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
> +{
> +	if (!mem)
> +		return NULL;
> +
> +	return mem->kswapd_wait;
> +}
> +
> +int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
> +{
> +	if (!mem)
> +		return -1;
> +
> +	return mem->last_scanned_node;
> +}
> +
> +static void wake_memcg_kswapd(struct mem_cgroup *mem)
> +{
> +	wait_queue_head_t *wait;
> +
> +	if (!mem)
> +		return;
> +
> +	wait = mem->kswapd_wait;
> +
> +	if (!waitqueue_active(wait))
> +		return;
> +
> +	wake_up_interruptible(wait);
> +}
> +
>  static int mem_cgroup_soft_limit_tree_init(void)
>  {
>  	struct mem_cgroup_tree_per_node *rtpn;
> @@ -4942,6 +5131,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  		res_counter_init(&mem->memsw, NULL);
>  	}
>  	mem->last_scanned_child = 0;
> +	mem->last_scanned_node = -1;

If we always start from 0 at the first run, I think this can be 0 at default.

>  	spin_lock_init(&mem->reclaim_param_lock);
>  	INIT_LIST_HEAD(&mem->oom_notify);
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a53d91d..34f6165 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -46,6 +46,8 @@
>  
>  #include <linux/swapops.h>
>  
> +#include <linux/res_counter.h>
> +
>  #include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
> @@ -98,6 +100,8 @@ struct scan_control {
>  	 * are scanned.
>  	 */
>  	nodemask_t	*nodemask;
> +
> +	int priority;
>  };
>  
>  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
> @@ -1385,6 +1389,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  					ISOLATE_INACTIVE : ISOLATE_BOTH,
>  			zone, sc->mem_cgroup,
>  			0, file);
> +
> +		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
> +
>  		/*
>  		 * mem_cgroup_isolate_pages() keeps track of
>  		 * scanned pages on its own.
> @@ -1504,6 +1511,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>  		 * mem_cgroup_isolate_pages() keeps track of
>  		 * scanned pages on its own.
>  		 */
> +		mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
>  	}
>  
>  	reclaim_stat->recent_scanned[file] += nr_taken;
> @@ -2127,11 +2135,19 @@ static int sleeping_prematurely(struct kswapd *kswapd, int order,
>  {
>  	int i;
>  	pg_data_t *pgdat = kswapd->kswapd_pgdat;
> +	struct mem_cgroup *mem = kswapd->kswapd_mem;
>  
>  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>  	if (remaining)
>  		return 1;
>  
> +	/* If after HZ/10, the cgroup is below the high wmark, it's premature */
> +	if (mem) {
> +		if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_HIGH))
> +			return 1;
> +		return 0;
> +	}
> +
>  	/* If after HZ/10, a zone is below the high mark, it's premature */
>  	for (i = 0; i < pgdat->nr_zones; i++) {
>  		struct zone *zone = pgdat->node_zones + i;
> @@ -2370,6 +2386,212 @@ out:
>  	return sc.nr_reclaimed;
>  }
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/*
> + * The function is used for per-memcg LRU. It scanns all the zones of the
> + * node and returns the nr_scanned and nr_reclaimed.
> + */
> +static void balance_pgdat_node(pg_data_t *pgdat, int order,
> +					struct scan_control *sc)
> +{
> +	int i, end_zone;
> +	unsigned long total_scanned;
> +	struct mem_cgroup *mem_cont = sc->mem_cgroup;
> +	int priority = sc->priority;
> +	int nid = pgdat->node_id;
> +
> +	/*
> +	 * Scan in the highmem->dma direction for the highest
> +	 * zone which needs scanning
> +	 */
> +	for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> +		struct zone *zone = pgdat->node_zones + i;
> +
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
> +				priority != DEF_PRIORITY)
> +			continue;
> +		/*
> +		 * Do some background aging of the anon list, to give
> +		 * pages a chance to be referenced before reclaiming.
> +		 */
> +		if (inactive_anon_is_low(zone, sc))
> +			shrink_active_list(SWAP_CLUSTER_MAX, zone,
> +							sc, priority, 0);

I think you can check per-zone memory usage here and compare it with
the value in previous run which set mz->all_unreclaimable.

If current_zone_usage < mz->usage_in_previous_run, you can clear
all_unreclaimable without hooks.

But please note that 'uncharge' doesn't mean pages turned to be reclaimable.
I'm not sure there are better hint or not.



> +
> +		end_zone = i;
> +		goto scan;
> +	}
> +	return;
> +
> +scan:
> +	total_scanned = 0;
> +	/*
> +	 * Now scan the zone in the dma->highmem direction, stopping
> +	 * at the last zone which needs scanning.
> +	 *
> +	 * We do this because the page allocator works in the opposite
> +	 * direction.  This prevents the page allocator from allocating
> +	 * pages behind kswapd's direction of progress, which would
> +	 * cause too much scanning of the lower zones.
> +	 */
> +	for (i = 0; i <= end_zone; i++) {
> +		struct zone *zone = pgdat->node_zones + i;
> +
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
> +			priority != DEF_PRIORITY)
> +			continue;
> +
> +		sc->nr_scanned = 0;
> +		shrink_zone(priority, zone, sc);
> +		total_scanned += sc->nr_scanned;
> +
> +		if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
> +			continue;
> +
> +		if (!mem_cgroup_zone_reclaimable(mem_cont, nid, i))
> +			mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
> +
> +		/*
> +		 * If we've done a decent amount of scanning and
> +		 * the reclaim ratio is low, start doing writepage
> +		 * even in laptop mode
> +		 */
> +		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> +		    total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
> +			sc->may_writepage = 1;
> +		}
> +	}
> +
> +	sc->nr_scanned = total_scanned;
> +	return;
> +}
> +
> +/*
> + * Per cgroup background reclaim.
> + * TODO: Take off the order since memcg always do order 0
> + */
> +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
> +					      int order)
> +{
> +	int i, nid;
> +	int start_node;
> +	int priority;
> +	int wmark_ok;
> +	int loop = 0;
> +	pg_data_t *pgdat;
> +	nodemask_t do_nodes;
> +	unsigned long total_scanned = 0;
> +	struct scan_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +		.may_unmap = 1,
> +		.may_swap = 1,
> +		.nr_to_reclaim = ULONG_MAX,
> +		.swappiness = vm_swappiness,
> +		.order = order,
> +		.mem_cgroup = mem_cont,
> +	};
> +
> +loop_again:
> +	do_nodes = NODE_MASK_NONE;
> +	sc.may_writepage = !laptop_mode;
> +	sc.nr_reclaimed = 0;
> +	total_scanned = 0;
> +
> +	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> +		sc.priority = priority;
> +		wmark_ok = 0;
> +		loop = 0;
> +
> +		/* The swap token gets in the way of swapout... */
> +		if (!priority)
> +			disable_swap_token();
> +
> +		if (priority == DEF_PRIORITY)
> +			do_nodes = node_states[N_ONLINE];
> +
> +		while (1) {
> +			nid = mem_cgroup_select_victim_node(mem_cont,
> +							&do_nodes);
> +
> +			/* Indicate we have cycled the nodelist once
> +			 * TODO: we might add MAX_RECLAIM_LOOP for preventing
> +			 * kswapd burning cpu cycles.
> +			 */
> +			if (loop == 0) {
> +				start_node = nid;
> +				loop++;
> +			} else if (nid == start_node)
> +				break;
> +
> +			pgdat = NODE_DATA(nid);
> +			balance_pgdat_node(pgdat, order, &sc);
> +			total_scanned += sc.nr_scanned;
> +
> +			/* Set the node which has at least
> +			 * one reclaimable zone
> +			 */
> +			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> +				struct zone *zone = pgdat->node_zones + i;
> +
> +				if (!populated_zone(zone))
> +					continue;
> +
> +				if (!mem_cgroup_mz_unreclaimable(mem_cont,
> +								zone))
> +					break;
> +			}
> +			if (i < 0)
> +				node_clear(nid, do_nodes);
> +
> +			if (mem_cgroup_watermark_ok(mem_cont,
> +							CHARGE_WMARK_HIGH)) {
> +				wmark_ok = 1;
> +				goto out;
> +			}
> +
> +			if (nodes_empty(do_nodes)) {
> +				wmark_ok = 1;
> +				goto out;
> +			}
> +		}
> +
> +		/* All the nodes are unreclaimable, kswapd is done */
> +		if (nodes_empty(do_nodes)) {
> +			wmark_ok = 1;
> +			goto out;
> +		}
> +
> +		if (total_scanned && priority < DEF_PRIORITY - 2)
> +			congestion_wait(WRITE, HZ/10);
> +
> +		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> +			break;
> +	}
> +out:
> +	if (!wmark_ok) {
> +		cond_resched();
> +
> +		try_to_freeze();
> +
> +		goto loop_again;
> +	}
> +
> +	return sc.nr_reclaimed;
> +}
> +#else
> +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
> +							int order)
> +{
> +	return 0;
> +}
> +#endif
> +
>  /*
>   * The background pageout daemon, started as a kernel thread
>   * from the init process.
> @@ -2388,6 +2610,7 @@ int kswapd(void *p)
>  	unsigned long order;
>  	struct kswapd *kswapd_p = (struct kswapd *)p;
>  	pg_data_t *pgdat = kswapd_p->kswapd_pgdat;
> +	struct mem_cgroup *mem = kswapd_p->kswapd_mem;
>  	wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait;
>  	struct task_struct *tsk = current;
>  	DEFINE_WAIT(wait);
> @@ -2430,8 +2653,10 @@ int kswapd(void *p)
>  		if (is_node_kswapd(kswapd_p)) {
>  			new_order = pgdat->kswapd_max_order;
>  			pgdat->kswapd_max_order = 0;
> -		} else
> +		} else {
> +			/* mem cgroup does order 0 charging always */
>  			new_order = 0;
> +		}
>  
>  		if (order < new_order) {
>  			/*
> @@ -2492,8 +2717,12 @@ int kswapd(void *p)
>  		 * after returning from the refrigerator
>  		 */
>  		if (!ret) {
> -			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> -			balance_pgdat(pgdat, order);
> +			if (is_node_kswapd(kswapd_p)) {
> +				trace_mm_vmscan_kswapd_wake(pgdat->node_id,
> +								order);
> +				balance_pgdat(pgdat, order);
> +			} else
> +				balance_mem_cgroup_pgdat(mem, order);
>  		}
>  	}
>  	return 0;
> @@ -2635,60 +2864,81 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
>  }
>  
>  /*
> - * This kswapd start function will be called by init and node-hot-add.
> - * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
> + * This kswapd start function will be called by init, node-hot-add and memcg
> + * limiting. On node-hot-add, kswapd will moved to proper cpus if cpus are
> + * hot-added.
>   */
> -int kswapd_run(int nid)
> +int kswapd_run(int nid, struct mem_cgroup *mem)
>  {
> -	pg_data_t *pgdat = NODE_DATA(nid);
>  	struct task_struct *thr;
> +	pg_data_t *pgdat = NULL;
>  	struct kswapd *kswapd_p;
> +	static char name[TASK_COMM_LEN];
> +	int memcg_id;
>  	int ret = 0;
>  
> -	if (pgdat->kswapd_wait)
> -		return 0;
> +	if (!mem) {
> +		pgdat = NODE_DATA(nid);
> +		if (pgdat->kswapd_wait)
> +			return ret;
> +	}
>  
>  	kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL);
>  	if (!kswapd_p)
>  		return -ENOMEM;
>  
>  	init_waitqueue_head(&kswapd_p->kswapd_wait);
> -	pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> -	kswapd_p->kswapd_pgdat = pgdat;
> -	thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid);
> +	if (!mem) {
> +		pgdat->kswapd_wait = &kswapd_p->kswapd_wait;
> +		kswapd_p->kswapd_pgdat = pgdat;
> +		snprintf(name, TASK_COMM_LEN, "kswapd_%d", nid);
> +	} else {
> +		memcg_id = mem_cgroup_init_kswapd(mem, kswapd_p);
> +		if (!memcg_id) {
> +			kfree(kswapd_p);
> +			return ret;
> +		}
> +		snprintf(name, TASK_COMM_LEN, "memcg_%d", memcg_id);
> +	}

This naming is good and fits TASK_COMM_LEN.


Thank you for your effort.
-Kame


> +
> +	thr = kthread_run(kswapd, kswapd_p, name);
>  	if (IS_ERR(thr)) {
>  		/* failure at boot is fatal */
>  		BUG_ON(system_state == SYSTEM_BOOTING);
> -		printk("Failed to start kswapd on node %d\n",nid);
>  		ret = -1;
> -	}
> -	kswapd_p->kswapd_task = thr;
> +	} else
> +		kswapd_p->kswapd_task = thr;
>  	return ret;
>  }
>  
>  /*
>   * Called by memory hotplug when all memory in a node is offlined.
> + * Also called by memcg when the cgroup is deleted.
>   */
> -void kswapd_stop(int nid)
> +void kswapd_stop(int nid, struct mem_cgroup *mem)
>  {
>  	struct task_struct *thr = NULL;
>  	struct kswapd *kswapd_p = NULL;
>  	wait_queue_head_t *wait;
>  
> -	pg_data_t *pgdat = NODE_DATA(nid);
> -
>  	spin_lock(&kswapds_spinlock);
> -	wait = pgdat->kswapd_wait;
> +	if (!mem) {
> +		pg_data_t *pgdat = NODE_DATA(nid);
> +		wait = pgdat->kswapd_wait;
> +	} else
> +		wait = mem_cgroup_kswapd_wait(mem);
> +
>  	if (wait) {
>  		kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
>  		thr = kswapd_p->kswapd_task;
>  	}
>  	spin_unlock(&kswapds_spinlock);
>  
> -	if (thr)
> -		kthread_stop(thr);
> -
> -	kfree(kswapd_p);
> +	if (kswapd_p) {
> +		if (thr)
> +			kthread_stop(thr);
> +		kfree(kswapd_p);
> +	}
>  }
>  
>  static int __init kswapd_init(void)
> @@ -2697,7 +2947,7 @@ static int __init kswapd_init(void)
>  
>  	swap_setup();
>  	for_each_node_state(nid, N_HIGH_MEMORY)
> - 		kswapd_run(nid);
> +		kswapd_run(nid, NULL);
>  	hotcpu_notifier(cpu_callback, 0);
>  	return 0;
>  }
> -- 
> 1.7.3.1
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/5] Per cgroup background reclaim.
  2011-01-14  0:52   ` KAMEZAWA Hiroyuki
@ 2011-01-19  2:12     ` Ying Han
  0 siblings, 0 replies; 17+ messages in thread
From: Ying Han @ 2011-01-19  2:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Daisuke Nishimura, Andrew Morton, Mel Gorman,
	Johannes Weiner, Christoph Lameter, Wu Fengguang, Andi Kleen,
	Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Tejun Heo, linux-mm

On Thu, Jan 13, 2011 at 4:52 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 13 Jan 2011 14:00:34 -0800
> Ying Han <yinghan@google.com> wrote:
>
>> The current implementation of memcg only supports direct reclaim and this
>> patch adds the support for background reclaim. Per cgroup background reclaim
>> is needed which spreads out the memory pressure over longer period of time
>> and smoothes out the system performance.
>>
>> There is a kswapd kernel thread for each memory node. We add a different kswapd
>> for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
>> field of a kswapd descriptor.
>>
>> The kswapd() function now is shared between global and per cgroup kswapd thread.
>> It is passed in with the kswapd descriptor which contains the information of
>> either node or cgroup. Then the new function balance_mem_cgroup_pgdat is invoked
>> if it is per cgroup kswapd thread. The balance_mem_cgroup_pgdat performs a
>> priority loop similar to global reclaim. In each iteration it invokes
>> balance_pgdat_node for all nodes on the system, which is a new function performs
>> background reclaim per node. A fairness mechanism is implemented to remember the
>> last node it was reclaiming from and always start at the next one. After reclaiming
>> each node, it checks mem_cgroup_watermark_ok() and breaks the priority loop if
>> returns true. A per memcg zone will be marked as "unreclaimable" if the scanning
>> rate is much greater than the reclaiming rate on the per cgroup LRU. The bit is
>> cleared when there is a page charged to the cgroup being freed. Kswapd breaks the
>> priority loop if all the zones are marked as "unreclaimable".
>>
>> Change log v2...v1:
>> 1. start/stop the per-cgroup kswapd at create/delete cgroup stage.
>> 2. remove checking the wmark from per-page charging. now it checks the wmark
>> periodically based on the event counter.
>> 3. move the per-cgroup per-zone clear_unreclaimable into uncharge stage.
>> 4. shared the kswapd_run/kswapd_stop for per-cgroup and global background
>> reclaim.
>> 5. name the per-cgroup memcg as "memcg-id" (css->id). And the global kswapd
>> keeps the same name.
>> 6. fix a race on kswapd_stop while the per-memcg-per-zone info could be accessed
>> after freeing.
>> 7. add the fairness in zonelist where memcg remember the last zone reclaimed
>> from.
>>
>> Signed-off-by: Ying Han <yinghan@google.com>

Thank you for your comments ~

> Hmm...at first, I like using workqueue rather than using a thread per memcg.

I plan this as part of performance optimization effort which mainly
focus on reducing
the lock contention between the threads.

>
>
>
>
>> ---
>>  include/linux/memcontrol.h |   37 ++++++
>>  include/linux/swap.h       |    4 +-
>>  mm/memcontrol.c            |  192 ++++++++++++++++++++++++++++-
>>  mm/vmscan.c                |  298 ++++++++++++++++++++++++++++++++++++++++----
>>  4 files changed, 504 insertions(+), 27 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 80a605f..69c6e41 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -25,6 +25,7 @@ struct mem_cgroup;
>>  struct page_cgroup;
>>  struct page;
>>  struct mm_struct;
>> +struct kswapd;
>>
>>  /* Stats that can be updated by kernel. */
>>  enum mem_cgroup_page_stat_item {
>> @@ -94,6 +95,12 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
>>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>>  extern int mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags);
>> +extern int mem_cgroup_init_kswapd(struct mem_cgroup *mem,
>> +                               struct kswapd *kswapd_p);
>> +extern wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem);
>> +extern int mem_cgroup_last_scanned_node(struct mem_cgroup *mem);
>> +extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem,
>> +                                     nodemask_t *nodes);
>>
>>  static inline
>>  int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
>> @@ -166,6 +173,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>                                               gfp_t gfp_mask);
>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>>
>> +void mem_cgroup_clear_unreclaimable(struct page_cgroup *pc);
>> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
>> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
>> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
>> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
>> +                                     unsigned long nr_scanned);
>>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>>  struct mem_cgroup;
>>
>> @@ -361,6 +374,25 @@ static inline unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
>>       return -ENOSYS;
>>  }
>>
>> +static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
>> +                                             struct zone *zone,
>> +                                             unsigned long nr_scanned)
>> +{
>> +}
>> +
>> +static inline void mem_cgroup_clear_unreclaimable(struct page *page,
>> +                                                     struct zone *zone)
>> +{
>> +}
>> +static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
>> +             struct zone *zone)
>> +{
>> +}
>> +static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
>> +                                             struct zone *zone)
>> +{
>> +}
>> +
>>  static inline
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>                                           gfp_t gfp_mask)
>> @@ -374,6 +406,11 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
>>       return 0;
>>  }
>>
>> +static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
>> +                                                             int zid)
>> +{
>> +     return false;
>> +}
>>  #endif /* CONFIG_CGROUP_MEM_CONT */
>>
>>  #endif /* _LINUX_MEMCONTROL_H */
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 52122fa..b6b5cbb 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -292,8 +292,8 @@ static inline void scan_unevictable_unregister_node(struct node *node)
>>  }
>>  #endif
>>
>> -extern int kswapd_run(int nid);
>> -extern void kswapd_stop(int nid);
>> +extern int kswapd_run(int nid, struct mem_cgroup *mem);
>> +extern void kswapd_stop(int nid, struct mem_cgroup *mem);
>>
>>  #ifdef CONFIG_MMU
>>  /* linux/mm/shmem.c */
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 6ef26a7..e716ece 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -48,6 +48,8 @@
>>  #include <linux/page_cgroup.h>
>>  #include <linux/cpu.h>
>>  #include <linux/oom.h>
>> +#include <linux/kthread.h>
>> +
>>  #include "internal.h"
>>
>>  #include <asm/uaccess.h>
>> @@ -75,6 +77,7 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>>   */
>>  #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
>>  #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
>> +#define WMARK_EVENTS_THRESH (10) /* once in 1024 */
>>
>>  /*
>>   * Statistics for memory cgroup.
>> @@ -131,7 +134,10 @@ struct mem_cgroup_per_zone {
>>       bool                    on_tree;
>>       struct mem_cgroup       *mem;           /* Back pointer, we cannot */
>>                                               /* use container_of        */
>> +     unsigned long           pages_scanned;  /* since last reclaim */
>> +     int                     all_unreclaimable;      /* All pages pinned */
>>  };
>> +
>>  /* Macro for accessing counter */
>>  #define MEM_CGROUP_ZSTAT(mz, idx)    ((mz)->count[(idx)])
>>
>> @@ -289,8 +295,16 @@ struct mem_cgroup {
>>       struct mem_cgroup_stat_cpu nocpu_base;
>>       spinlock_t pcp_counter_lock;
>>
>> +     /*
>> +      * per cgroup background reclaim.
>> +      */
>>       wait_queue_head_t *kswapd_wait;
>>       unsigned long min_free_kbytes;
>> +
>> +     /* While doing per cgroup background reclaim, we cache the
>> +      * last node we reclaimed from
>> +      */
>> +     int last_scanned_node;
>>  };
>>
>>  /* Stuffs for move charges at task migration. */
>> @@ -380,6 +394,7 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
>>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>>  static void drain_all_stock_async(void);
>>  static unsigned long get_min_free_kbytes(struct mem_cgroup *mem);
>> +static void wake_memcg_kswapd(struct mem_cgroup *mem);
>>
>>  static struct mem_cgroup_per_zone *
>>  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
>> @@ -568,6 +583,12 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
>>       return mz;
>>  }
>>
>> +static void mem_cgroup_check_wmark(struct mem_cgroup *mem)
>> +{
>> +     if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_LOW))
>> +             wake_memcg_kswapd(mem);
>> +}
>> +
>
> Low for trigger, High for stop ?
>
>
>>  /*
>>   * Implementation Note: reading percpu statistics for memcg.
>>   *
>> @@ -692,6 +713,8 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
>>               mem_cgroup_threshold(mem);
>>               if (unlikely(__memcg_event_check(mem, SOFTLIMIT_EVENTS_THRESH)))
>>                       mem_cgroup_update_tree(mem, page);
>> +             if (unlikely(__memcg_event_check(mem, WMARK_EVENTS_THRESH)))
>> +                     mem_cgroup_check_wmark(mem);
>>       }
>
> This is nice.

>
>
>>  }
>>
>> @@ -1121,6 +1144,95 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
>>       return &mz->reclaim_stat;
>>  }
>>
>> +static unsigned long mem_cgroup_zone_reclaimable_pages(
>> +                                             struct mem_cgroup_per_zone *mz)
>> +{
>> +     int nr;
>> +     nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
>> +             MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
>> +
>> +     if (nr_swap_pages > 0)
>> +             nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
>> +                     MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
>> +
>> +     return nr;
>> +}
>> +
>> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
>> +                                             unsigned long nr_scanned)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +     int nid = zone_to_nid(zone);
>> +     int zid = zone_idx(zone);
>> +
>> +     if (!mem)
>> +             return;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz)
>> +             mz->pages_scanned += nr_scanned;
>> +}
>> +
>> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +
>> +     if (!mem)
>> +             return 0;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz)
>> +             return mz->pages_scanned <
>> +                             mem_cgroup_zone_reclaimable_pages(mz) * 6;
>> +     return 0;
>> +}
>
> Where does this "*6" come from ? please add comment. Or add macro in header
> file and share the value with original.

This will be changed on next post, and I plan to define a macro of the
magic number
shared between per-zone & per-memcg reclaim.

>
>
>
>> +
>> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +     int nid = zone_to_nid(zone);
>> +     int zid = zone_idx(zone);
>> +
>> +     if (!mem)
>> +             return false;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz)
>> +             return mz->all_unreclaimable;
>> +
>> +     return false;
>> +}
>
> I think you should check whether this zone has any page.
> If no pages in this zone, you can't reclaim any.

I think that has been covered in the mem_cgroup_zone_reclaimable():
>-------if (mz)
>------->-------return mz->pages_scanned <
>------->------->------->-------mem_cgroup_zone_reclaimable_pages(mz) *
>------->------->------->-------ZONE_RECLAIMABLE_RATE;

In this case, the mem_cgroup_zone_reclaimable_pages(mz) == 0 and the
function returns false.

>
>
>> +
>> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +     int nid = zone_to_nid(zone);
>> +     int zid = zone_idx(zone);
>> +
>> +     if (!mem)
>> +             return;
>> +
>> +     mz = mem_cgroup_zoneinfo(mem, nid, zid);
>> +     if (mz)
>> +             mz->all_unreclaimable = 1;
>> +}
>
> I like boolean for this kind ot true/false value.

Changed in the next post.

>
>
>
>> +
>> +void mem_cgroup_clear_unreclaimable(struct page_cgroup *pc)
>> +{
>> +     struct mem_cgroup_per_zone *mz = NULL;
>> +
>> +     if (!pc)
>> +             return;
>> +
>> +     mz = page_cgroup_zoneinfo(pc);
>> +     if (mz) {
>> +             mz->pages_scanned = 0;
>> +             mz->all_unreclaimable = 0;
>> +     }
>> +
>> +     return;
>> +}
>> +
>>  unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>                                       struct list_head *dst,
>>                                       unsigned long *scanned, int order,
>> @@ -1773,6 +1885,34 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>>  }
>>
>>  /*
>> + * Visit the first node after the last_scanned_node of @mem and use that to
>> + * reclaim free pages from.
>> + */
>> +int
>> +mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t *nodes)
>> +{
>> +     int next_nid;
>> +     int last_scanned;
>> +
>> +     last_scanned = mem->last_scanned_node;
>> +
>> +     /* Initial stage and start from node0 */
>> +     if (last_scanned == -1)
>> +             next_nid = 0;
>> +     else
>> +             next_nid = next_node(last_scanned, *nodes);
>> +
>> +     if (next_nid == MAX_NUMNODES)
>> +             next_nid = first_node(*nodes);
>> +
>> +     spin_lock(&mem->reclaim_param_lock);
>> +     mem->last_scanned_node = next_nid;
>> +     spin_unlock(&mem->reclaim_param_lock);
>> +
>
> Is this 'lock' required ?

Changed in the next post.

>
>> +     return next_nid;
>> +}
>> +
>> +/*
>>   * Check OOM-Killer is already running under our hierarchy.
>>   * If someone is running, return false.
>>   */
>> @@ -2955,6 +3095,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>>        * special functions.
>>        */
>>
>> +     mem_cgroup_clear_unreclaimable(pc);
>>       unlock_page_cgroup(pc);
>
> This kind of hook is not good....Can't you do this 'clear' by kswapd in
> lazy way ?

I can look into that.

>
>
>
>>       /*
>>        * even after unlock, we have mem->res.usage here and this memcg
>> @@ -3377,7 +3518,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>>                                               MEM_CGROUP_RECLAIM_SHRINK);
>>               curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>>               /* Usage is reduced ? */
>> -             if (curusage >= oldusage)
>> +             if (curusage >= oldusage)
>>                       retry_count--;
>>               else
>>                       oldusage = curusage;
>
> ?
>
>
>> @@ -3385,6 +3526,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>>       if (!ret && enlarge)
>>               memcg_oom_recover(memcg);
>>
>> +     if (!mem_cgroup_is_root(memcg) && !memcg->kswapd_wait)
>> +             kswapd_run(0, memcg);
>> +
>>       return ret;
>>  }
>
> Hmm, this creates a thread when limit is set....So, tons of threads can be
> created. Can't we do this by work_queue ?
> Then, the number of threads will be scaled automatically.

Some effort I plan to do next for reducing the lock contention.

>
>
>>
>> @@ -4747,6 +4891,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>>               mz->usage_in_excess = 0;
>>               mz->on_tree = false;
>>               mz->mem = mem;
>> +             mz->pages_scanned = 0;
>> +             mz->all_unreclaimable = 0;
>>       }
>>       return 0;
>>  }
>> @@ -4799,6 +4945,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
>>  {
>>       int node;
>>
>> +     kswapd_stop(0, mem);
>>       mem_cgroup_remove_from_trees(mem);
>>       free_css_id(&mem_cgroup_subsys, &mem->css);
>>
>> @@ -4867,6 +5014,48 @@ int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
>>       return ret;
>>  }
>>
>> +int mem_cgroup_init_kswapd(struct mem_cgroup *mem, struct kswapd *kswapd_p)
>> +{
>> +     if (!mem || !kswapd_p)
>> +             return 0;
>> +
>> +     mem->kswapd_wait = &kswapd_p->kswapd_wait;
>> +     kswapd_p->kswapd_mem = mem;
>> +
>> +     return css_id(&mem->css);
>> +}
>> +
>> +wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
>> +{
>> +     if (!mem)
>> +             return NULL;
>> +
>> +     return mem->kswapd_wait;
>> +}
>> +
>> +int mem_cgroup_last_scanned_node(struct mem_cgroup *mem)
>> +{
>> +     if (!mem)
>> +             return -1;
>> +
>> +     return mem->last_scanned_node;
>> +}
>> +
>> +static void wake_memcg_kswapd(struct mem_cgroup *mem)
>> +{
>> +     wait_queue_head_t *wait;
>> +
>> +     if (!mem)
>> +             return;
>> +
>> +     wait = mem->kswapd_wait;
>> +
>> +     if (!waitqueue_active(wait))
>> +             return;
>> +
>> +     wake_up_interruptible(wait);
>> +}
>> +
>>  static int mem_cgroup_soft_limit_tree_init(void)
>>  {
>>       struct mem_cgroup_tree_per_node *rtpn;
>> @@ -4942,6 +5131,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>>               res_counter_init(&mem->memsw, NULL);
>>       }
>>       mem->last_scanned_child = 0;
>> +     mem->last_scanned_node = -1;
>
> If we always start from 0 at the first run, I think this can be 0 at default.
>
>>       spin_lock_init(&mem->reclaim_param_lock);
>>       INIT_LIST_HEAD(&mem->oom_notify);
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index a53d91d..34f6165 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -46,6 +46,8 @@
>>
>>  #include <linux/swapops.h>
>>
>> +#include <linux/res_counter.h>
>> +
>>  #include "internal.h"
>>
>>  #define CREATE_TRACE_POINTS
>> @@ -98,6 +100,8 @@ struct scan_control {
>>        * are scanned.
>>        */
>>       nodemask_t      *nodemask;
>> +
>> +     int priority;
>>  };
>>
>>  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
>> @@ -1385,6 +1389,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>>                                       ISOLATE_INACTIVE : ISOLATE_BOTH,
>>                       zone, sc->mem_cgroup,
>>                       0, file);
>> +
>> +             mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, nr_scanned);
>> +
>>               /*
>>                * mem_cgroup_isolate_pages() keeps track of
>>                * scanned pages on its own.
>> @@ -1504,6 +1511,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>>                * mem_cgroup_isolate_pages() keeps track of
>>                * scanned pages on its own.
>>                */
>> +             mem_cgroup_mz_pages_scanned(sc->mem_cgroup, zone, pgscanned);
>>       }
>>
>>       reclaim_stat->recent_scanned[file] += nr_taken;
>> @@ -2127,11 +2135,19 @@ static int sleeping_prematurely(struct kswapd *kswapd, int order,
>>  {
>>       int i;
>>       pg_data_t *pgdat = kswapd->kswapd_pgdat;
>> +     struct mem_cgroup *mem = kswapd->kswapd_mem;
>>
>>       /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>>       if (remaining)
>>               return 1;
>>
>> +     /* If after HZ/10, the cgroup is below the high wmark, it's premature */
>> +     if (mem) {
>> +             if (!mem_cgroup_watermark_ok(mem, CHARGE_WMARK_HIGH))
>> +                     return 1;
>> +             return 0;
>> +     }
>> +
>>       /* If after HZ/10, a zone is below the high mark, it's premature */
>>       for (i = 0; i < pgdat->nr_zones; i++) {
>>               struct zone *zone = pgdat->node_zones + i;
>> @@ -2370,6 +2386,212 @@ out:
>>       return sc.nr_reclaimed;
>>  }
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
>> +/*
>> + * The function is used for per-memcg LRU. It scanns all the zones of the
>> + * node and returns the nr_scanned and nr_reclaimed.
>> + */
>> +static void balance_pgdat_node(pg_data_t *pgdat, int order,
>> +                                     struct scan_control *sc)
>> +{
>> +     int i, end_zone;
>> +     unsigned long total_scanned;
>> +     struct mem_cgroup *mem_cont = sc->mem_cgroup;
>> +     int priority = sc->priority;
>> +     int nid = pgdat->node_id;
>> +
>> +     /*
>> +      * Scan in the highmem->dma direction for the highest
>> +      * zone which needs scanning
>> +      */
>> +     for (i = pgdat->nr_zones - 1; i >= 0; i--) {
>> +             struct zone *zone = pgdat->node_zones + i;
>> +
>> +             if (!populated_zone(zone))
>> +                     continue;
>> +
>> +             if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
>> +                             priority != DEF_PRIORITY)
>> +                     continue;
>> +             /*
>> +              * Do some background aging of the anon list, to give
>> +              * pages a chance to be referenced before reclaiming.
>> +              */
>> +             if (inactive_anon_is_low(zone, sc))
>> +                     shrink_active_list(SWAP_CLUSTER_MAX, zone,
>> +                                                     sc, priority, 0);
>
> I think you can check per-zone memory usage here and compare it with
> the value in previous run which set mz->all_unreclaimable.
>
> If current_zone_usage < mz->usage_in_previous_run, you can clear
> all_unreclaimable without hooks.
>
> But please note that 'uncharge' doesn't mean pages turned to be reclaimable.
> I'm not sure there are better hint or not.
>
>
>
>> +
>> +             end_zone = i;
>> +             goto scan;
>> +     }
>> +     return;
>> +
>> +scan:
>> +     total_scanned = 0;
>> +     /*
>> +      * Now scan the zone in the dma->highmem direction, stopping
>> +      * at the last zone which needs scanning.
>> +      *
>> +      * We do this because the page allocator works in the opposite
>> +      * direction.  This prevents the page allocator from allocating
>> +      * pages behind kswapd's direction of progress, which would
>> +      * cause too much scanning of the lower zones.
>> +      */
>> +     for (i = 0; i <= end_zone; i++) {
>> +             struct zone *zone = pgdat->node_zones + i;
>> +
>> +             if (!populated_zone(zone))
>> +                     continue;
>> +
>> +             if (mem_cgroup_mz_unreclaimable(mem_cont, zone) &&
>> +                     priority != DEF_PRIORITY)
>> +                     continue;
>> +
>> +             sc->nr_scanned = 0;
>> +             shrink_zone(priority, zone, sc);
>> +             total_scanned += sc->nr_scanned;
>> +
>> +             if (mem_cgroup_mz_unreclaimable(mem_cont, zone))
>> +                     continue;
>> +
>> +             if (!mem_cgroup_zone_reclaimable(mem_cont, nid, i))
>> +                     mem_cgroup_mz_set_unreclaimable(mem_cont, zone);
>> +
>> +             /*
>> +              * If we've done a decent amount of scanning and
>> +              * the reclaim ratio is low, start doing writepage
>> +              * even in laptop mode
>> +              */
>> +             if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
>> +                 total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / 2) {
>> +                     sc->may_writepage = 1;
>> +             }
>> +     }
>> +
>> +     sc->nr_scanned = total_scanned;
>> +     return;
>> +}
>> +
>> +/*
>> + * Per cgroup background reclaim.
>> + * TODO: Take off the order since memcg always do order 0
>> + */
>> +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_cont,
>> +                                           int order)
>> +{
>> +     int i, nid;
>> +     int start_node;
>> +     int priority;
>> +     int wmark_ok;
>> +     int loop = 0;
>> +     pg_data_t *pgdat;
>> +     nodemask_t do_nodes;
>> +     unsigned long total_scanned = 0;
>> +     struct scan_control sc = {
>> +             .gfp_mask = GFP_KERNEL,
>> +             .may_unmap = 1,
>> +             .may_swap = 1,
>> +             .nr_to_reclaim = ULONG_MAX,
>> +             .swappiness = vm_swappiness,
>> +             .order = order,
>> +             .mem_cgroup = mem_cont,
>> +     };
>> +
>> +loop_again:
>> +     do_nodes = NODE_MASK_NONE;
>> +     sc.may_writepage = !laptop_mode;
>> +     sc.nr_reclaimed = 0;
>> +     total_scanned = 0;
>> +
>> +     for (priority = DEF_PRIORITY; priority >= 0; priority--) {
>> +             sc.priority = priority;
>> +             wmark_ok = 0;
>> +             loop = 0;
>> +
>> +             /* The swap token gets in the way of swapout... */
>> +             if (!priority)
>> +                     disable_swap_token();
>> +
>> +             if (priority == DEF_PRIORITY)
>> +                     do_nodes = node_states[N_ONLINE];
>> +
>> +             while (1) {
>> +                     nid = mem_cgroup_select_victim_node(mem_cont,
>> +                                                     &do_nodes);
>> +
>> +                     /* Indicate we have cycled the nodelist once
>> +                      * TODO: we might add MAX_RECLAIM_LOOP for preventing
>> +                      * kswapd burning cpu cycles

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 5/5] Add more per memcg stats.
  2011-01-13 22:00 [PATCH 0/5] memcg: per cgroup background reclaim Ying Han
                   ` (3 preceding siblings ...)
  2011-01-13 22:00 ` [PATCH 4/5] Per cgroup background reclaim Ying Han
@ 2011-01-13 22:00 ` Ying Han
  4 siblings, 0 replies; 17+ messages in thread
From: Ying Han @ 2011-01-13 22:00 UTC (permalink / raw)
  To: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Andrew Morton,
	Mel Gorman, Johannes Weiner, Christoph Lameter, Wu Fengguang,
	Andi Kleen, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Tejun Heo
  Cc: linux-mm

A bunch of statistics are added in memory.stat to monitor per cgroup
kswapd performance.

$cat /dev/cgroup/yinghan/memory.stat
kswapd_steal 12588994
pg_pgsteal 0
kswapd_pgscan 18629519
pg_scan 0
pgrefill 2893517
pgoutrun 5342267948
allocstall 0

Change log v2...v1:
1. change the stats using events instead of stats.
2. add the stats in the Documentation

Signed-off-by: Ying Han <yinghan@google.com>
---
 Documentation/cgroups/memory.txt |   14 +++++++
 include/linux/memcontrol.h       |   64 +++++++++++++++++++++++++++++++++
 mm/memcontrol.c                  |   72 ++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                      |   28 ++++++++++++--
 4 files changed, 174 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index bac328c..ee54684 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -385,6 +385,13 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
 swap		- # of bytes of swap usage
+kswapd_steal	- # of pages reclaimed from kswapd
+pg_pgsteal	- # of pages reclaimed from direct reclaim
+kswapd_pgscan	- # of pages scanned from kswapd
+pg_scan		- # of pages scanned frm direct reclaim
+pgrefill	- # of pages scanned on active list
+pgoutrun	- # of times triggering kswapd
+allocstall	- # of times triggering direct reclaim
 dirty		- # of bytes that are waiting to get written back to the disk.
 writeback	- # of bytes that are actively being written back to the disk.
 nfs_unstable	- # of bytes sent to the NFS server, but not yet committed to
@@ -410,6 +417,13 @@ total_mapped_file	- sum of all children's "cache"
 total_pgpgin		- sum of all children's "pgpgin"
 total_pgpgout		- sum of all children's "pgpgout"
 total_swap		- sum of all children's "swap"
+total_kswapd_steal	- sum of all children's "kswapd_steal"
+total_pg_pgsteal	- sum of all children's "pg_pgsteal"
+total_kswapd_pgscan	- sum of all children's "kswapd_pgscan"
+total_pg_scan		- sum of all children's "pg_scan"
+total_pgrefill		- sum of all children's "pgrefill"
+total_pgoutrun		- sum of all children's "pgoutrun"
+total_allocstall	- sum of all children's "allocstall"
 total_dirty		- sum of all children's "dirty"
 total_writeback		- sum of all children's "writeback"
 total_nfs_unstable	- sum of all children's "nfs_unstable"
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69c6e41..9e7d93e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -173,6 +173,15 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
 
+/* background reclaim stats */
+void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_steal(struct mem_cgroup *memcg, int val);
+void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pgrefill(struct mem_cgroup *memcg, int val);
+void mem_cgroup_pg_outrun(struct mem_cgroup *memcg, int val);
+void mem_cgroup_alloc_stall(struct mem_cgroup *memcg, int val);
+
 void mem_cgroup_clear_unreclaimable(struct page_cgroup *pc);
 bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
 bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
@@ -260,6 +269,23 @@ static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 	return NULL;
 }
 
+static inline int
+mem_cgroup_watermark_ok(struct mem_cgroup *mem, int charge_flags)
+{
+	return 0;
+}
+
+static inline int
+mem_cgroup_init_kswapd(struct mem_cgroup *mem, struct kswapd *kswapd_p)
+{
+	return 0;
+}
+
+static inline wait_queue_head_t *mem_cgroup_kswapd_wait(struct mem_cgroup *mem)
+{
+	return NULL;
+}
+
 static inline int mm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *mem)
 {
 	return 1;
@@ -391,6 +417,7 @@ static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
 static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
 						struct zone *zone)
 {
+	return false;
 }
 
 static inline
@@ -411,6 +438,43 @@ static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
 {
 	return false;
 }
+
+/* background reclaim stats */
+static inline void mem_cgroup_kswapd_steal(struct mem_cgroup *memcg,
+					   int val)
+{
+}
+
+static inline void mem_cgroup_pg_steal(struct mem_cgroup *memcg,
+				       int val)
+{
+}
+
+static inline void mem_cgroup_kswapd_pgscan(struct mem_cgroup *memcg,
+					    int val)
+{
+}
+
+static inline void mem_cgroup_pg_pgscan(struct mem_cgroup *memcg,
+					int val)
+{
+}
+
+static inline void mem_cgroup_pgrefill(struct mem_cgroup *memcg,
+				       int val)
+{
+}
+
+static inline void mem_cgroup_pg_outrun(struct mem_cgroup *memcg,
+					int val)
+{
+}
+
+static inline void mem_cgroup_alloc_stall(struct mem_cgroup *memcg,
+					  int val)
+{
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e716ece..c101b51 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -102,6 +102,13 @@ enum mem_cgroup_events_index {
 	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
 	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
 	MEM_CGROUP_EVENTS_COUNT,	/* # of pages paged in/out */
+	MEM_CGROUP_EVENTS_KSWAPD_STEAL, /* # of pages reclaimed from kswapd */
+	MEM_CGROUP_EVENTS_PG_PGSTEAL, /* # of pages reclaimed from ttfp */
+	MEM_CGROUP_EVENTS_KSWAPD_PGSCAN, /* # of pages scanned from kswapd */
+	MEM_CGROUP_EVENTS_PG_PGSCAN, /* # of pages scanned from ttfp */
+	MEM_CGROUP_EVENTS_PGREFILL, /* # of pages scanned on active list */
+	MEM_CGROUP_EVENTS_PGOUTRUN,
+	MEM_CGROUP_EVENTS_ALLOCSTALL,
 	MEM_CGROUP_EVENTS_NSTATS,
 };
 
@@ -640,6 +647,41 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
 	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
 }
 
+void mem_cgroup_kswapd_steal(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_KSWAPD_STEAL], val);
+}
+
+void mem_cgroup_pg_steal(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PG_PGSTEAL], val);
+}
+
+void mem_cgroup_kswapd_pgscan(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_KSWAPD_PGSCAN], val);
+}
+
+void mem_cgroup_pg_pgscan(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PG_PGSCAN], val);
+}
+
+void mem_cgroup_pgrefill(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PGREFILL], val);
+}
+
+void mem_cgroup_pg_outrun(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PGOUTRUN], val);
+}
+
+void mem_cgroup_alloc_stall(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_ALLOCSTALL], val);
+}
+
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *mem,
 					    enum mem_cgroup_events_index idx)
 {
@@ -4076,6 +4118,13 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_KSWAPD_STEAL,
+	MCS_PG_PGSTEAL,
+	MCS_KSWAPD_PGSCAN,
+	MCS_PG_PGSCAN,
+	MCS_PGREFILL,
+	MCS_PGOUTRUN,
+	MCS_ALLOCSTALL,
 	MCS_FILE_DIRTY,
 	MCS_WRITEBACK,
 	MCS_UNSTABLE_NFS,
@@ -4101,6 +4150,13 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"kswapd_steal", "total_kswapd_steal"},
+	{"pg_pgsteal", "total_pg_pgsteal"},
+	{"kswapd_pgscan", "total_kswapd_pgscan"},
+	{"pg_scan", "total_pg_scan"},
+	{"pgrefill", "total_pgrefill"},
+	{"pgoutrun", "total_pgoutrun"},
+	{"allocstall", "total_allocstall"},
 	{"dirty", "total_dirty"},
 	{"writeback", "total_writeback"},
 	{"nfs_unstable", "total_nfs_unstable"},
@@ -4133,6 +4189,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
 
+	/* kswapd stat */
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_KSWAPD_STEAL);
+	s->stat[MCS_KSWAPD_STEAL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PG_PGSTEAL);
+	s->stat[MCS_PG_PGSTEAL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_KSWAPD_PGSCAN);
+	s->stat[MCS_KSWAPD_PGSCAN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PG_PGSCAN);
+	s->stat[MCS_PG_PGSCAN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGREFILL);
+	s->stat[MCS_PGREFILL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGOUTRUN);
+	s->stat[MCS_PGOUTRUN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_ALLOCSTALL);
+	s->stat[MCS_ALLOCSTALL] += val;
+
 	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
 	s->stat[MCS_FILE_DIRTY] += val * PAGE_SIZE;
 	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 34f6165..11784cc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1396,6 +1396,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		 * mem_cgroup_isolate_pages() keeps track of
 		 * scanned pages on its own.
 		 */
+		if (current_is_kswapd())
+			mem_cgroup_kswapd_pgscan(sc->mem_cgroup, nr_scanned);
+		else
+			mem_cgroup_pg_pgscan(sc->mem_cgroup, nr_scanned);
 	}
 
 	if (nr_taken == 0) {
@@ -1416,9 +1420,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	}
 
 	local_irq_disable();
-	if (current_is_kswapd())
-		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
-	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+	if (scanning_global_lru(sc)) {
+		if (current_is_kswapd())
+			__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+		__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+	} else {
+		if (current_is_kswapd())
+			mem_cgroup_kswapd_steal(sc->mem_cgroup, nr_reclaimed);
+		else
+			mem_cgroup_pg_steal(sc->mem_cgroup, nr_reclaimed);
+	}
 
 	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
 
@@ -1516,7 +1527,12 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
-	__count_zone_vm_events(PGREFILL, zone, pgscanned);
+	if (scanning_global_lru(sc))
+		__count_zone_vm_events(PGREFILL, zone, pgscanned);
+	else
+		mem_cgroup_pgrefill(sc->mem_cgroup, pgscanned);
+
+
 	if (file)
 		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -nr_taken);
 	else
@@ -1959,6 +1975,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 	if (scanning_global_lru(sc))
 		count_vm_event(ALLOCSTALL);
+	else
+		mem_cgroup_alloc_stall(sc->mem_cgroup, 1);
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc->nr_scanned = 0;
@@ -2503,6 +2521,8 @@ loop_again:
 	sc.nr_reclaimed = 0;
 	total_scanned = 0;
 
+	mem_cgroup_pg_outrun(mem_cont, 1);
+
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc.priority = priority;
 		wmark_ok = 0;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-01-19 10:04 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-13 22:00 [PATCH 0/5] memcg: per cgroup background reclaim Ying Han
2011-01-13 22:00 ` [PATCH 1/5] Add kswapd descriptor Ying Han
2011-01-13 22:00 ` [PATCH 2/5] Add per cgroup reclaim watermarks Ying Han
2011-01-14  0:11   ` KAMEZAWA Hiroyuki
2011-01-18 20:02     ` Ying Han
2011-01-18 20:36       ` David Rientjes
2011-01-18 21:10         ` Ying Han
2011-01-19  0:56           ` KAMEZAWA Hiroyuki
2011-01-19  2:38             ` David Rientjes
2011-01-19  2:47               ` KAMEZAWA Hiroyuki
2011-01-19 10:03                 ` David Rientjes
2011-01-19  0:44       ` KAMEZAWA Hiroyuki
2011-01-13 22:00 ` [PATCH 3/5] New APIs to adjust per cgroup wmarks Ying Han
2011-01-13 22:00 ` [PATCH 4/5] Per cgroup background reclaim Ying Han
2011-01-14  0:52   ` KAMEZAWA Hiroyuki
2011-01-19  2:12     ` Ying Han
2011-01-13 22:00 ` [PATCH 5/5] Add more per memcg stats Ying Han

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).