linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/4] Memory controller soft limit patches (v2)
@ 2009-02-16 11:08 Balbir Singh
  2009-02-16 11:08 ` [RFC][PATCH 1/4] Memory controller soft limit documentation (v2) Balbir Singh
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Balbir Singh @ 2009-02-16 11:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao, Paul Menage, lizf,
	linux-kernel, KOSAKI Motohiro, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Balbir Singh, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki


From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v2...v1
1. Soft limits now support hierarchies
2. Use spinlocks instead of mutexes for synchronization of the RB tree

Here is v2 of the new soft limit implementation. Soft limits is a new feature
for the memory resource controller, something similar has existed in the
group scheduler in the form of shares. The CPU controllers interpretation
of shares is very different though. We'll compare shares and soft limits
below.

Soft limits are the most useful feature to have for environments where
the administrator wants to overcommit the system, such that only on memory
contention do the limits become active. The current soft limits implementation
provides a soft_limit_in_bytes interface for the memory controller and not
for memory+swap controller. The implementation maintains an RB-Tree of groups
that exceed their soft limit and starts reclaiming from the group that
exceeds this limit by the maximum amount.

This is an RFC implementation and is not meant for inclusion

TODOs

1. The current implementation maintains the delta from the soft limit
   and pushes back groups to their soft limits, a ratio of delta/soft_limit
   is more useful
2. It would be nice to have more targetted reclaim (in terms of pages to
   recalim) interface. So that groups are pushed back, close to their soft
   limits.

Tests
-----

I've run two memory intensive workloads with differing soft limits and
seen that they are pushed back to their soft limit on contention. Their usage
was their soft limit plus additional memory that they were able to grab
on the system.

Please review, comment.

Series
------

memcg-soft-limit-documentation.patch
memcg-add-soft-limit-interface.patch
memcg-organize-over-soft-limit-groups.patch
memcg-soft-limit-reclaim-on-contention.patch
---

 0 files changed, 0 insertions(+), 0 deletions(-)



-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC][PATCH 1/4] Memory controller soft limit documentation (v2)
  2009-02-16 11:08 [RFC][PATCH 0/4] Memory controller soft limit patches (v2) Balbir Singh
@ 2009-02-16 11:08 ` Balbir Singh
  2009-02-16 11:08 ` [RFC][PATCH 2/4] Memory controller soft limit interface (v2) Balbir Singh
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Balbir Singh @ 2009-02-16 11:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao, Paul Menage, lizf,
	linux-kernel, KOSAKI Motohiro, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Balbir Singh, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

Add documentation for soft limit feature support.

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/cgroups/memory.txt |   28 +++++++++++++++++++++++++++-
 1 files changed, 27 insertions(+), 1 deletions(-)


diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index a98a7fe..6e6ef41 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -360,7 +360,33 @@ cgroups created below it.
 
 NOTE2: This feature can be enabled/disabled per subtree.
 
-7. TODO
+7. Soft limits
+
+Soft limits allow for greater sharing of memory. The idea behind soft limits
+is to allow control groups to use as much of the memory as needed, provided
+
+a. There is no memory contention
+b. They do not exceed their hard limit
+
+When the system detects memory contention (through do_try_to_free_pages(),
+while allocating), control groups are pushed back to their soft limits if
+possible. If the soft limit of each control group is very high, they are
+pushed back as much as possible to make sure that one control group does not
+starve the others.
+
+7.1 Interface
+
+Soft limits can be setup by using the following commands (in this example we
+assume a soft limit of 256 megabytes)
+
+# echo 256M > memory.soft_limit_in_bytes
+
+If we want to change this to 1G, we can at any time use
+
+# echo 1G > memory.soft_limit_in_bytes
+
+
+8. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC][PATCH 2/4] Memory controller soft limit interface (v2)
  2009-02-16 11:08 [RFC][PATCH 0/4] Memory controller soft limit patches (v2) Balbir Singh
  2009-02-16 11:08 ` [RFC][PATCH 1/4] Memory controller soft limit documentation (v2) Balbir Singh
@ 2009-02-16 11:08 ` Balbir Singh
  2009-02-16 11:09 ` [RFC][PATCH 3/4] Memory controller soft limit organize cgroups (v2) Balbir Singh
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Balbir Singh @ 2009-02-16 11:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao, Paul Menage, lizf,
	linux-kernel, KOSAKI Motohiro, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Balbir Singh, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki


From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v2...v1
1. Add support for res_counter_check_soft_limit_locked. This is used
   by the hierarchy code.

Add an interface to allow get/set of soft limits. Soft limits for memory plus
swap controller (memsw) is currently not supported. Resource counters have
been enhanced to support soft limits and new type RES_SOFT_LIMIT has been
added. Unlike hard limits, soft limits can be directly set and do not
need any reclaim or checks before setting them to a newer value.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |   47 +++++++++++++++++++++++++++++++++++++++++++
 kernel/res_counter.c        |    3 +++
 mm/memcontrol.c             |   20 ++++++++++++++++++
 3 files changed, 70 insertions(+), 0 deletions(-)


diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 4c5bcf6..b5f14fa 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -35,6 +35,10 @@ struct res_counter {
 	 */
 	unsigned long long limit;
 	/*
+	 * the limit that usage can be exceed
+	 */
+	unsigned long long soft_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -85,6 +89,7 @@ enum {
 	RES_MAX_USAGE,
 	RES_LIMIT,
 	RES_FAILCNT,
+	RES_SOFT_LIMIT,
 };
 
 /*
@@ -130,6 +135,36 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 	return false;
 }
 
+static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->soft_limit)
+		return true;
+
+	return false;
+}
+
+/**
+ * Get the difference between the usage and the soft limit
+ * @cnt: The counter
+ *
+ * Returns 0 if usage is less than or equal to soft limit
+ * The difference between usage and soft limit, otherwise.
+ */
+static inline unsigned long long
+res_counter_soft_limit_excess(struct res_counter *cnt)
+{
+	unsigned long long excess;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (cnt->usage <= cnt->soft_limit)
+		excess = 0;
+	else
+		excess = cnt->usage - cnt->soft_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return excess;
+}
+
 /*
  * Helper function to detect if the cgroup is within it's limit or
  * not. It's currently called from cgroup_rss_prepare()
@@ -178,4 +213,16 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
 	return ret;
 }
 
+static inline int
+res_counter_set_soft_limit(struct res_counter *cnt,
+				unsigned long long soft_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->soft_limit = soft_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index bf8e753..4e6dafe 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -19,6 +19,7 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 {
 	spin_lock_init(&counter->lock);
 	counter->limit = (unsigned long long)LLONG_MAX;
+	counter->soft_limit = (unsigned long long)LLONG_MAX;
 	counter->parent = parent;
 }
 
@@ -101,6 +102,8 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->limit;
 	case RES_FAILCNT:
 		return &counter->failcnt;
+	case RES_SOFT_LIMIT:
+		return &counter->soft_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7bb14fd..75a7b1a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1939,6 +1939,20 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 		else
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
+	case RES_SOFT_LIMIT:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		/*
+		 * For memsw, soft limits are hard to implement in terms
+		 * of semantics, for now, we support soft limits for
+		 * control without swap
+		 */
+		if (type == _MEM)
+			ret = res_counter_set_soft_limit(&memcg->res, val);
+		else
+			ret = -EINVAL;
+		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
 		break;
@@ -2188,6 +2202,12 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_read,
 	},
 	{
+		.name = "soft_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
 		.name = "failcnt",
 		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
 		.trigger = mem_cgroup_reset,

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC][PATCH 3/4] Memory controller soft limit organize cgroups (v2)
  2009-02-16 11:08 [RFC][PATCH 0/4] Memory controller soft limit patches (v2) Balbir Singh
  2009-02-16 11:08 ` [RFC][PATCH 1/4] Memory controller soft limit documentation (v2) Balbir Singh
  2009-02-16 11:08 ` [RFC][PATCH 2/4] Memory controller soft limit interface (v2) Balbir Singh
@ 2009-02-16 11:09 ` Balbir Singh
  2009-02-17  1:00   ` KOSAKI Motohiro
  2009-02-16 11:09 ` [RFC][PATCH 4/4] Memory controller soft limit reclaim on contention (v2) Balbir Singh
  2009-02-17  0:05 ` [RFC][PATCH 0/4] Memory controller soft limit patches (v2) KAMEZAWA Hiroyuki
  4 siblings, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2009-02-16 11:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao, Paul Menage, lizf,
	linux-kernel, KOSAKI Motohiro, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Balbir Singh, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki


From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v2...v1
1. Add support for hierarchies
2. The res_counter that is highest in the hierarchy is returned on soft
   limit being exceeded. Since we do hierarchical reclaim and add all
   groups exceeding their soft limits, this approach seems to work well
   in practice.

This patch introduces a RB-Tree for storing memory cgroups that are over their
soft limit. The overall goal is to

1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
   We are careful about updates, updates take place only after a particular
   time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
   limit

The next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |    3 +
 kernel/res_counter.c        |   12 +++++
 mm/memcontrol.c             |  104 +++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 113 insertions(+), 6 deletions(-)


diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index b5f14fa..e526ab6 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -112,7 +112,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
 int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
+		unsigned long val, struct res_counter **limit_fail_at,
+		struct res_counter **soft_limit_at);
 
 /*
  * uncharge - tell that some portion of the resource is released
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 4e6dafe..08b7614 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 }
 
 int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
+			struct res_counter **limit_fail_at,
+			struct res_counter **soft_limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
 	struct res_counter *c, *u;
 
 	*limit_fail_at = NULL;
+	if (soft_limit_fail_at)
+		*soft_limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
+		/*
+		 * With soft limits, we return the highest ancestor
+		 * that exceeds its soft limit
+		 */
+		if (soft_limit_fail_at &&
+			!res_counter_soft_limit_check_locked(c))
+			*soft_limit_fail_at = c;
 		spin_unlock(&c->lock);
 		if (ret < 0) {
 			*limit_fail_at = c;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75a7b1a..a2617ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -29,6 +29,7 @@
 #include <linux/rcupdate.h>
 #include <linux/limits.h>
 #include <linux/mutex.h>
+#include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/spinlock.h>
@@ -129,6 +130,14 @@ struct mem_cgroup_lru_info {
 };
 
 /*
+ * Cgroups above their limits are maintained in a RB-Tree, independent of
+ * their hierarchy representation
+ */
+
+static struct rb_root mem_cgroup_soft_limit_exceeded_groups;
+static DEFINE_SPINLOCK(memcg_soft_limit_tree_lock);
+
+/*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
  * statistics based on the statistics developed by Rik Van Riel for clock-pro,
@@ -176,12 +185,18 @@ struct mem_cgroup {
 
 	unsigned int	swappiness;
 
+	struct rb_node mem_cgroup_node;
+	unsigned long long usage_in_excess;
+	unsigned long last_tree_update;
+
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
 	struct mem_cgroup_stat stat;
 };
 
+#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ)
+
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -214,6 +229,41 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 
+static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+{
+	struct rb_node **p = &mem_cgroup_soft_limit_exceeded_groups.rb_node;
+	struct rb_node *parent = NULL;
+	struct mem_cgroup *mem_node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	while (*p) {
+		parent = *p;
+		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
+		if (mem->usage_in_excess < mem_node->usage_in_excess)
+			p = &(*p)->rb_left;
+		/*
+		 * We can't avoid mem cgroups that are over their soft
+		 * limit by the same amount
+		 */
+		else if (mem->usage_in_excess >= mem_node->usage_in_excess)
+			p = &(*p)->rb_right;
+	}
+	rb_link_node(&mem->mem_cgroup_node, parent, p);
+	rb_insert_color(&mem->mem_cgroup_node,
+			&mem_cgroup_soft_limit_exceeded_groups);
+	mem->last_tree_update = jiffies;
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+}
+
+static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_exceeded_groups);
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+}
+
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 					 struct page_cgroup *pc,
 					 bool charge)
@@ -897,6 +947,39 @@ static void record_last_oom(struct mem_cgroup *mem)
 	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
 }
 
+static void mem_cgroup_check_and_update_tree(struct mem_cgroup *mem,
+						bool time_check)
+{
+	unsigned long long prev_usage_in_excess, new_usage_in_excess;
+	bool updated_tree = false;
+	unsigned long next_update = 0;
+	unsigned long flags;
+
+	mem_cgroup_get(mem);
+	prev_usage_in_excess = mem->usage_in_excess;
+	new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+
+	if (time_check)
+		next_update = mem->last_tree_update +
+				MEM_CGROUP_TREE_UPDATE_INTERVAL;
+	if (new_usage_in_excess && time_after(jiffies, next_update)) {
+		if (prev_usage_in_excess)
+			mem_cgroup_remove_exceeded(mem);
+		mem_cgroup_insert_exceeded(mem);
+		updated_tree = true;
+	} else if (prev_usage_in_excess && !new_usage_in_excess) {
+		mem_cgroup_remove_exceeded(mem);
+		updated_tree = true;
+	}
+
+	if (updated_tree) {
+		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+		mem->last_tree_update = jiffies;
+		mem->usage_in_excess = new_usage_in_excess;
+		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+	}
+	mem_cgroup_put(mem);
+}
 
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
@@ -906,9 +989,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			gfp_t gfp_mask, struct mem_cgroup **memcg,
 			bool oom)
 {
-	struct mem_cgroup *mem, *mem_over_limit;
+	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct res_counter *fail_res;
+	struct res_counter *fail_res, *soft_fail_res = NULL;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -938,12 +1021,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		int ret;
 		bool noswap = false;
 
-		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
+		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
+						&soft_fail_res);
 		if (likely(!ret)) {
 			if (!do_swap_account)
 				break;
 			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
-							&fail_res);
+							&fail_res, NULL);
 			if (likely(!ret))
 				break;
 			/* mem+swap counter fails */
@@ -985,6 +1069,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			goto nomem;
 		}
 	}
+
+	if (soft_fail_res) {
+		mem_over_soft_limit =
+			mem_cgroup_from_res_counter(soft_fail_res, res);
+		mem_cgroup_check_and_update_tree(mem_over_soft_limit, true);
+	}
+	mem_cgroup_check_and_update_tree(mem, true);
 	return 0;
 nomem:
 	css_put(&mem->css);
@@ -1422,6 +1513,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	mz = page_cgroup_zoneinfo(pc);
 	unlock_page_cgroup(pc);
 
+	mem_cgroup_check_and_update_tree(mem, true);
 	/* at swapout, this memcg will be accessed to record to swap */
 	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		css_put(&mem->css);
@@ -2346,6 +2438,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
 {
 	int node;
 
+	mem_cgroup_check_and_update_tree(mem, false);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
 
 	for_each_node_state(node, N_POSSIBLE)
@@ -2412,6 +2505,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (cont->parent == NULL) {
 		enable_swap_cgroup();
 		parent = NULL;
+		mem_cgroup_soft_limit_exceeded_groups = RB_ROOT;
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
@@ -2432,6 +2526,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		res_counter_init(&mem->memsw, NULL);
 	}
 	mem->last_scanned_child = 0;
+	mem->usage_in_excess = 0;
+	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
 	spin_lock_init(&mem->reclaim_param_lock);
 
 	if (parent)

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC][PATCH 4/4] Memory controller soft limit reclaim on contention (v2)
  2009-02-16 11:08 [RFC][PATCH 0/4] Memory controller soft limit patches (v2) Balbir Singh
                   ` (2 preceding siblings ...)
  2009-02-16 11:09 ` [RFC][PATCH 3/4] Memory controller soft limit organize cgroups (v2) Balbir Singh
@ 2009-02-16 11:09 ` Balbir Singh
  2009-02-17  1:20   ` KOSAKI Motohiro
  2009-02-17  0:05 ` [RFC][PATCH 0/4] Memory controller soft limit patches (v2) KAMEZAWA Hiroyuki
  4 siblings, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2009-02-16 11:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao, Paul Menage, lizf,
	linux-kernel, KOSAKI Motohiro, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Balbir Singh, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki


From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v2...v1
1. Added support for hierarchical soft limits

This patch allows reclaim from memory cgroups on contention (via the
__alloc_pages_internal() path). If a order greater than 0 is specified, we
anyway fall back on try_to_free_pages().

memory cgroup soft limit reclaim finds the group that exceeds its soft limit
by the largest amount and reclaims pages from it and then reinserts the
cgroup into its correct place in the rbtree.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/memcontrol.h |    1 
 mm/memcontrol.c            |  105 +++++++++++++++++++++++++++++++++++++++-----
 mm/page_alloc.c            |   10 ++++
 3 files changed, 104 insertions(+), 12 deletions(-)


diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 18146c9..a50f73e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -116,6 +116,7 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
+extern unsigned long mem_cgroup_soft_limit_reclaim(gfp_t gfp_mask);
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a2617ac..dd835d3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -188,6 +188,7 @@ struct mem_cgroup {
 	struct rb_node mem_cgroup_node;
 	unsigned long long usage_in_excess;
 	unsigned long last_tree_update;
+	bool on_tree;
 
 	/*
 	 * statistics. This must be placed at the end of memcg.
@@ -195,7 +196,7 @@ struct mem_cgroup {
 	struct mem_cgroup_stat stat;
 };
 
-#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ)
+#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
 
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
@@ -229,14 +230,15 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 
-static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+static void __mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
 {
 	struct rb_node **p = &mem_cgroup_soft_limit_exceeded_groups.rb_node;
 	struct rb_node *parent = NULL;
 	struct mem_cgroup *mem_node;
-	unsigned long flags;
 
-	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	if (mem->on_tree)
+		return;
+
 	while (*p) {
 		parent = *p;
 		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
@@ -253,6 +255,23 @@ static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
 	rb_insert_color(&mem->mem_cgroup_node,
 			&mem_cgroup_soft_limit_exceeded_groups);
 	mem->last_tree_update = jiffies;
+	mem->on_tree = true;
+}
+
+static void __mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
+{
+	if (!mem->on_tree)
+		return;
+	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_exceeded_groups);
+	mem->on_tree = false;
+}
+
+static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	__mem_cgroup_insert_exceeded(mem);
 	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
 }
 
@@ -260,10 +279,34 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
 {
 	unsigned long flags;
 	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
-	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_exceeded_groups);
+	__mem_cgroup_remove_exceeded(mem);
 	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
 }
 
+static struct mem_cgroup *mem_cgroup_get_largest_soft_limit_exceeding_node(void)
+{
+	struct rb_node *rightmost = NULL;
+	struct mem_cgroup *mem = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	rightmost = rb_last(&mem_cgroup_soft_limit_exceeded_groups);
+	if (!rightmost)
+		goto done;		/* Nothing to reclaim from */
+
+	mem = rb_entry(rightmost, struct mem_cgroup, mem_cgroup_node);
+	mem_cgroup_get(mem);
+	/*
+	 * Remove the node now but someone else can add it back,
+	 * we will to add it back at the end of reclaim to its correct
+	 * position in the tree.
+	 */
+	__mem_cgroup_remove_exceeded(mem);
+done:
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+	return mem;
+}
+
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 					 struct page_cgroup *pc,
 					 bool charge)
@@ -886,7 +929,8 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
  * If shrink==true, for avoiding to free too much, this returns immedieately.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
-				   gfp_t gfp_mask, bool noswap, bool shrink)
+				   gfp_t gfp_mask, bool noswap, bool shrink,
+				   bool check_soft)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
@@ -913,7 +957,11 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 		if (shrink)
 			return ret;
 		total += ret;
-		if (mem_cgroup_check_under_limit(root_mem))
+
+		if (check_soft) {
+			if (res_counter_soft_limit_excess(&root_mem->res))
+				return 1 + total;
+		} else if (mem_cgroup_check_under_limit(root_mem))
 			return 1 + total;
 	}
 	return total;
@@ -1044,7 +1092,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			goto nomem;
 
 		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
-							noswap, false);
+							noswap, false, false);
 		if (ret)
 			continue;
 
@@ -1686,7 +1734,7 @@ int mem_cgroup_shrink_usage(struct page *page,
 
 	do {
 		progress = mem_cgroup_hierarchical_reclaim(mem,
-					gfp_mask, true, false);
+					gfp_mask, true, false, false);
 		progress += mem_cgroup_check_under_limit(mem);
 	} while (!progress && --retry);
 
@@ -1741,7 +1789,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 			break;
 
 		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
-						   false, true);
+						   false, true, false);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -1789,7 +1837,8 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
+		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true,
+						false);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -1940,6 +1989,38 @@ try_to_free:
 	goto out;
 }
 
+unsigned long mem_cgroup_soft_limit_reclaim(gfp_t gfp_mask)
+{
+	unsigned long nr_reclaimed = 0;
+	struct mem_cgroup *mem;
+	unsigned long flags;
+
+	do {
+		mem = mem_cgroup_get_largest_soft_limit_exceeding_node();
+		if (!mem)
+			break;
+		if (mem_cgroup_is_obsolete(mem)) {
+			mem_cgroup_put(mem);
+			continue;
+		}
+		nr_reclaimed +=
+			mem_cgroup_hierarchical_reclaim(mem, gfp_mask, false,
+							false, true);
+		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+		/*
+		 * We need to remove and reinsert the node in its correct
+		 * position
+		 */
+		__mem_cgroup_remove_exceeded(mem);
+		if (mem->usage_in_excess)
+			__mem_cgroup_insert_exceeded(mem);
+		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+		mem_cgroup_put(mem);
+	} while (!nr_reclaimed);
+	return nr_reclaimed;
+}
+
 int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event)
 {
 	return mem_cgroup_force_empty(mem_cgroup_from_cont(cont), true);
@@ -2528,6 +2609,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mem->last_scanned_child = 0;
 	mem->usage_in_excess = 0;
 	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
+	mem->on_tree = false;
+
 	spin_lock_init(&mem->reclaim_param_lock);
 
 	if (parent)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7be9386..c50e29b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1579,7 +1579,15 @@ nofail_alloc:
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
+	did_some_progress = mem_cgroup_soft_limit_reclaim(gfp_mask);
+	/*
+	 * If we made no progress or need higher order allocations
+	 * try_to_free_pages() is still our best bet, since mem_cgroup
+	 * reclaim does not handle freeing pages greater than order 0
+	 */
+	if (!did_some_progress || order)
+		did_some_progress = try_to_free_pages(zonelist, order,
+							gfp_mask);
 
 	p->reclaim_state = NULL;
 	p->flags &= ~PF_MEMALLOC;

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 0/4] Memory controller soft limit patches (v2)
  2009-02-16 11:08 [RFC][PATCH 0/4] Memory controller soft limit patches (v2) Balbir Singh
                   ` (3 preceding siblings ...)
  2009-02-16 11:09 ` [RFC][PATCH 4/4] Memory controller soft limit reclaim on contention (v2) Balbir Singh
@ 2009-02-17  0:05 ` KAMEZAWA Hiroyuki
  2009-02-17  3:05   ` Balbir Singh
  4 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-17  0:05 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Mon, 16 Feb 2009 16:38:44 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Changelog v2...v1
> 1. Soft limits now support hierarchies
> 2. Use spinlocks instead of mutexes for synchronization of the RB tree
> 
> Here is v2 of the new soft limit implementation. Soft limits is a new feature
> for the memory resource controller, something similar has existed in the
> group scheduler in the form of shares. The CPU controllers interpretation
> of shares is very different though. We'll compare shares and soft limits
> below.
> 
> Soft limits are the most useful feature to have for environments where
> the administrator wants to overcommit the system, such that only on memory
> contention do the limits become active. The current soft limits implementation
> provides a soft_limit_in_bytes interface for the memory controller and not
> for memory+swap controller. The implementation maintains an RB-Tree of groups
> that exceed their soft limit and starts reclaiming from the group that
> exceeds this limit by the maximum amount.
> 
> This is an RFC implementation and is not meant for inclusion
> 

some thoughts after reading patch.

1. As I pointed out, cpuset/mempolicy case is not handled yet.
2. I don't like to change usual direct-memory-reclaim path. It will be obstacles
   for VM-maintaners to improve memory reclaim. memcg's LRU is designed for
   shrinking memory usage and not for avoiding memory shortage. IOW, it's slow routine
   for reclaiming memory for memory shortage.
3. After this patch, res_counter is no longer for general purpose res_counter...
   It seems to have too many unnecessary accessories for general purpose.  
4. please use css_tryget() rather than mem_cgroup_get().
5. please remove mem_cgroup from tree at force_empty or rmdir.
   Just making  memcg->on_tree=false is enough ? I'm in doubt.
6. What happens when the-largest-soft-limit-memcg has tons on Anon on swapless
   system and memory reclaim cannot make enough progress ?

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 3/4] Memory controller soft limit organize cgroups (v2)
  2009-02-16 11:09 ` [RFC][PATCH 3/4] Memory controller soft limit organize cgroups (v2) Balbir Singh
@ 2009-02-17  1:00   ` KOSAKI Motohiro
  2009-02-17  3:24     ` Balbir Singh
  0 siblings, 1 reply; 19+ messages in thread
From: KOSAKI Motohiro @ 2009-02-17  1:00 UTC (permalink / raw)
  To: Balbir Singh
  Cc: kosaki.motohiro, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	Bharata B Rao, Paul Menage, lizf, linux-kernel, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

>  /*
> + * Cgroups above their limits are maintained in a RB-Tree, independent of
> + * their hierarchy representation
> + */
> +
> +static struct rb_root mem_cgroup_soft_limit_exceeded_groups;

37 length variable name seems too long.


> +static DEFINE_SPINLOCK(memcg_soft_limit_tree_lock);
> +
> +/*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
>   * statistics based on the statistics developed by Rik Van Riel for clock-pro,
> @@ -176,12 +185,18 @@ struct mem_cgroup {
>  
>  	unsigned int	swappiness;
>  
> +	struct rb_node mem_cgroup_node;
> +	unsigned long long usage_in_excess;
> +	unsigned long last_tree_update;
> +

no comment fields.

Do usage_in_excess and last_tree_update have what unit? "unsigned long" 
don't tell me anything.


>  	/*
>  	 * statistics. This must be placed at the end of memcg.
>  	 */
>  	struct mem_cgroup_stat stat;
>  };
>  
> +#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ)
> +

In general, memory subsystem be considered to shouldn't have timer thing.
it's because we expect we get 100x times faster machine after 10 year,
at that time, we expect proper timeout value is changed.

Can we make proper stastics, instead?


>  enum charge_type {
>  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
>  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> @@ -214,6 +229,41 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>  
> +static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> +{
> +	struct rb_node **p = &mem_cgroup_soft_limit_exceeded_groups.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct mem_cgroup *mem_node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +	while (*p) {
> +		parent = *p;
> +		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
> +		if (mem->usage_in_excess < mem_node->usage_in_excess)
> +			p = &(*p)->rb_left;
> +		/*
> +		 * We can't avoid mem cgroups that are over their soft
> +		 * limit by the same amount
> +		 */
> +		else if (mem->usage_in_excess >= mem_node->usage_in_excess)
> +			p = &(*p)->rb_right;
> +	}
> +	rb_link_node(&mem->mem_cgroup_node, parent, p);
> +	rb_insert_color(&mem->mem_cgroup_node,
> +			&mem_cgroup_soft_limit_exceeded_groups);
> +	mem->last_tree_update = jiffies;
> +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +}

I think this function is called from page fault hotpath, right?
if so, you insert global lock into hotpath!


> +
> +static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
> +{
> +	unsigned long flags;
> +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_exceeded_groups);
> +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +}
> +
>  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
>  					 struct page_cgroup *pc,
>  					 bool charge)
> @@ -897,6 +947,39 @@ static void record_last_oom(struct mem_cgroup *mem)
>  	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
>  }
>  
> +static void mem_cgroup_check_and_update_tree(struct mem_cgroup *mem,
> +						bool time_check)
> +{
> +	unsigned long long prev_usage_in_excess, new_usage_in_excess;
> +	bool updated_tree = false;
> +	unsigned long next_update = 0;
> +	unsigned long flags;
> +
> +	mem_cgroup_get(mem);
> +	prev_usage_in_excess = mem->usage_in_excess;
> +	new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> +
> +	if (time_check)
> +		next_update = mem->last_tree_update +
> +				MEM_CGROUP_TREE_UPDATE_INTERVAL;
> +	if (new_usage_in_excess && time_after(jiffies, next_update)) {

incorrect time_after() usage. jiffies can round-tripping. then 
time_after(jiffies, 0) don't gurantee to return true.

> +		if (prev_usage_in_excess)
> +			mem_cgroup_remove_exceeded(mem);
> +		mem_cgroup_insert_exceeded(mem);
> +		updated_tree = true;
> +	} else if (prev_usage_in_excess && !new_usage_in_excess) {
> +		mem_cgroup_remove_exceeded(mem);
> +		updated_tree = true;
> +	}
> +
> +	if (updated_tree) {
> +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +		mem->last_tree_update = jiffies;
> +		mem->usage_in_excess = new_usage_in_excess;
> +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +	}
> +	mem_cgroup_put(mem);
> +}
>  
>  /*
>   * Unlike exported interface, "oom" parameter is added. if oom==true,
> @@ -906,9 +989,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  			gfp_t gfp_mask, struct mem_cgroup **memcg,
>  			bool oom)
>  {
> -	struct mem_cgroup *mem, *mem_over_limit;
> +	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
>  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> -	struct res_counter *fail_res;
> +	struct res_counter *fail_res, *soft_fail_res = NULL;
>  
>  	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
>  		/* Don't account this! */
> @@ -938,12 +1021,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  		int ret;
>  		bool noswap = false;
>  
> -		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> +		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
> +						&soft_fail_res);
>  		if (likely(!ret)) {
>  			if (!do_swap_account)
>  				break;
>  			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
> -							&fail_res);
> +							&fail_res, NULL);
>  			if (likely(!ret))
>  				break;
>  			/* mem+swap counter fails */
> @@ -985,6 +1069,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  			goto nomem;
>  		}
>  	}
> +
> +	if (soft_fail_res) {
> +		mem_over_soft_limit =
> +			mem_cgroup_from_res_counter(soft_fail_res, res);
> +		mem_cgroup_check_and_update_tree(mem_over_soft_limit, true);
> +	}
> +	mem_cgroup_check_and_update_tree(mem, true);
>  	return 0;
>  nomem:
>  	css_put(&mem->css);
> @@ -1422,6 +1513,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  	mz = page_cgroup_zoneinfo(pc);
>  	unlock_page_cgroup(pc);
>  
> +	mem_cgroup_check_and_update_tree(mem, true);
>  	/* at swapout, this memcg will be accessed to record to swap */
>  	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
>  		css_put(&mem->css);
> @@ -2346,6 +2438,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
>  {
>  	int node;
>  
> +	mem_cgroup_check_and_update_tree(mem, false);
>  	free_css_id(&mem_cgroup_subsys, &mem->css);
>  
>  	for_each_node_state(node, N_POSSIBLE)
> @@ -2412,6 +2505,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	if (cont->parent == NULL) {
>  		enable_swap_cgroup();
>  		parent = NULL;
> +		mem_cgroup_soft_limit_exceeded_groups = RB_ROOT;
>  	} else {
>  		parent = mem_cgroup_from_cont(cont->parent);
>  		mem->use_hierarchy = parent->use_hierarchy;
> @@ -2432,6 +2526,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  		res_counter_init(&mem->memsw, NULL);
>  	}
>  	mem->last_scanned_child = 0;
> +	mem->usage_in_excess = 0;
> +	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
>  	spin_lock_init(&mem->reclaim_param_lock);
>  
>  	if (parent)
> 
> -- 
> 	Balbir
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 4/4] Memory controller soft limit reclaim on contention (v2)
  2009-02-16 11:09 ` [RFC][PATCH 4/4] Memory controller soft limit reclaim on contention (v2) Balbir Singh
@ 2009-02-17  1:20   ` KOSAKI Motohiro
  2009-02-17  3:12     ` Balbir Singh
  0 siblings, 1 reply; 19+ messages in thread
From: KOSAKI Motohiro @ 2009-02-17  1:20 UTC (permalink / raw)
  To: Balbir Singh
  Cc: kosaki.motohiro, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	Bharata B Rao, Paul Menage, lizf, linux-kernel, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki

> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Changelog v2...v1
> 1. Added support for hierarchical soft limits
> 
> This patch allows reclaim from memory cgroups on contention (via the
> __alloc_pages_internal() path). If a order greater than 0 is specified, we
> anyway fall back on try_to_free_pages().
> 
> memory cgroup soft limit reclaim finds the group that exceeds its soft limit
> by the largest amount and reclaims pages from it and then reinserts the
> cgroup into its correct place in the rbtree.
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  include/linux/memcontrol.h |    1 
>  mm/memcontrol.c            |  105 +++++++++++++++++++++++++++++++++++++++-----
>  mm/page_alloc.c            |   10 ++++
>  3 files changed, 104 insertions(+), 12 deletions(-)
> 
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 18146c9..a50f73e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -116,6 +116,7 @@ static inline bool mem_cgroup_disabled(void)
>  }
>  
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> +extern unsigned long mem_cgroup_soft_limit_reclaim(gfp_t gfp_mask);
>  
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct mem_cgroup;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a2617ac..dd835d3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -188,6 +188,7 @@ struct mem_cgroup {
>  	struct rb_node mem_cgroup_node;
>  	unsigned long long usage_in_excess;
>  	unsigned long last_tree_update;
> +	bool on_tree;
>  
>  	/*
>  	 * statistics. This must be placed at the end of memcg.
> @@ -195,7 +196,7 @@ struct mem_cgroup {
>  	struct mem_cgroup_stat stat;
>  };
>  
> -#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ)
> +#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)

??
moving [3/4] is proper more?


>  
>  enum charge_type {
>  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> @@ -229,14 +230,15 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>  
> -static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> +static void __mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
>  {
>  	struct rb_node **p = &mem_cgroup_soft_limit_exceeded_groups.rb_node;
>  	struct rb_node *parent = NULL;
>  	struct mem_cgroup *mem_node;
> -	unsigned long flags;
>  
> -	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +	if (mem->on_tree)
> +		return;
> +
>  	while (*p) {
>  		parent = *p;
>  		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
> @@ -253,6 +255,23 @@ static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
>  	rb_insert_color(&mem->mem_cgroup_node,
>  			&mem_cgroup_soft_limit_exceeded_groups);
>  	mem->last_tree_update = jiffies;
> +	mem->on_tree = true;
> +}
> +
> +static void __mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
> +{
> +	if (!mem->on_tree)
> +		return;
> +	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_exceeded_groups);
> +	mem->on_tree = false;
> +}
> +
> +static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +	__mem_cgroup_insert_exceeded(mem);
>  	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
>  }
>  
> @@ -260,10 +279,34 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
>  {
>  	unsigned long flags;
>  	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> -	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_exceeded_groups);
> +	__mem_cgroup_remove_exceeded(mem);
>  	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
>  }
>  
> +static struct mem_cgroup *mem_cgroup_get_largest_soft_limit_exceeding_node(void)
> +{
> +	struct rb_node *rightmost = NULL;
> +	struct mem_cgroup *mem = NULL;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +	rightmost = rb_last(&mem_cgroup_soft_limit_exceeded_groups);
> +	if (!rightmost)
> +		goto done;		/* Nothing to reclaim from */
> +
> +	mem = rb_entry(rightmost, struct mem_cgroup, mem_cgroup_node);
> +	mem_cgroup_get(mem);
> +	/*
> +	 * Remove the node now but someone else can add it back,
> +	 * we will to add it back at the end of reclaim to its correct
> +	 * position in the tree.
> +	 */
> +	__mem_cgroup_remove_exceeded(mem);
> +done:
> +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +	return mem;
> +}
> +

Do you remember we discuss about zone reclaim balancing thing 
in "reclaim bail out"thread at about 2 month ago?

The "largest exceeding" policy seems to have similar problems.
if largest exceeding group is most active, 

  1. do the largest group activity and charge memory over softlimit.
  2. reclaim memory from the largest group.
  3. goto 1.

then, system can become livelock.

I think per-group priority is not good idea.



>  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
>  					 struct page_cgroup *pc,
>  					 bool charge)
> @@ -886,7 +929,8 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
>   * If shrink==true, for avoiding to free too much, this returns immedieately.
>   */
>  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> -				   gfp_t gfp_mask, bool noswap, bool shrink)
> +				   gfp_t gfp_mask, bool noswap, bool shrink,
> +				   bool check_soft)

Now, we get three boolean argument.
So, can we convert "int flags" argument?

I don't think mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, false, true, false) is
self-described good look.


>  {
>  	struct mem_cgroup *victim;
>  	int ret, total = 0;
> @@ -913,7 +957,11 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  		if (shrink)
>  			return ret;
>  		total += ret;
> -		if (mem_cgroup_check_under_limit(root_mem))
> +
> +		if (check_soft) {
> +			if (res_counter_soft_limit_excess(&root_mem->res))
> +				return 1 + total;
> +		} else if (mem_cgroup_check_under_limit(root_mem))
>  			return 1 + total;

I don't understand what's mean "1 +".


>  	}
>  	return total;
> @@ -1044,7 +1092,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  			goto nomem;
>  
>  		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> -							noswap, false);
> +							noswap, false, false);
>  		if (ret)
>  			continue;
>  
> @@ -1686,7 +1734,7 @@ int mem_cgroup_shrink_usage(struct page *page,
>  
>  	do {
>  		progress = mem_cgroup_hierarchical_reclaim(mem,
> -					gfp_mask, true, false);
> +					gfp_mask, true, false, false);
>  		progress += mem_cgroup_check_under_limit(mem);
>  	} while (!progress && --retry);
>  
> @@ -1741,7 +1789,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  			break;
>  
>  		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
> -						   false, true);
> +						   false, true, false);
>  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>  		/* Usage is reduced ? */
>    		if (curusage >= oldusage)
> @@ -1789,7 +1837,8 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
> +		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true,
> +						false);
>  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
>  		/* Usage is reduced ? */
>  		if (curusage >= oldusage)
> @@ -1940,6 +1989,38 @@ try_to_free:
>  	goto out;
>  }
>  
> +unsigned long mem_cgroup_soft_limit_reclaim(gfp_t gfp_mask)
> +{
> +	unsigned long nr_reclaimed = 0;
> +	struct mem_cgroup *mem;
> +	unsigned long flags;
> +
> +	do {
> +		mem = mem_cgroup_get_largest_soft_limit_exceeding_node();
> +		if (!mem)
> +			break;
> +		if (mem_cgroup_is_obsolete(mem)) {
> +			mem_cgroup_put(mem);
> +			continue;
> +		}
> +		nr_reclaimed +=
> +			mem_cgroup_hierarchical_reclaim(mem, gfp_mask, false,
> +							false, true);
> +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> +		/*
> +		 * We need to remove and reinsert the node in its correct
> +		 * position
> +		 */
> +		__mem_cgroup_remove_exceeded(mem);
> +		if (mem->usage_in_excess)
> +			__mem_cgroup_insert_exceeded(mem);
> +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +		mem_cgroup_put(mem);
> +	} while (!nr_reclaimed);
> +	return nr_reclaimed;
> +}

this function is called from reclaim hotpath. but it grab glocal spin lock..
I don't like this.


> +
>  int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event)
>  {
>  	return mem_cgroup_force_empty(mem_cgroup_from_cont(cont), true);
> @@ -2528,6 +2609,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	mem->last_scanned_child = 0;
>  	mem->usage_in_excess = 0;
>  	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
> +	mem->on_tree = false;
> +
>  	spin_lock_init(&mem->reclaim_param_lock);
>  
>  	if (parent)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7be9386..c50e29b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1579,7 +1579,15 @@ nofail_alloc:
>  	reclaim_state.reclaimed_slab = 0;
>  	p->reclaim_state = &reclaim_state;
>  
> -	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> +	did_some_progress = mem_cgroup_soft_limit_reclaim(gfp_mask);
> +	/*
> +	 * If we made no progress or need higher order allocations
> +	 * try_to_free_pages() is still our best bet, since mem_cgroup
> +	 * reclaim does not handle freeing pages greater than order 0
> +	 */
> +	if (!did_some_progress || order)
> +		did_some_progress = try_to_free_pages(zonelist, order,
> +							gfp_mask);
>  
>  	p->reclaim_state = NULL;
>  	p->flags &= ~PF_MEMALLOC;

this is very freqently called place. then we want to
  - if no memcgroup using, no performance regression.
  - if no softlimit but using memcg, performance degression is smaller than 1%.

Do you have any performance number?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 0/4] Memory controller soft limit patches (v2)
  2009-02-17  0:05 ` [RFC][PATCH 0/4] Memory controller soft limit patches (v2) KAMEZAWA Hiroyuki
@ 2009-02-17  3:05   ` Balbir Singh
  2009-02-17  4:03     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2009-02-17  3:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-02-17 09:05:23]:

> On Mon, 16 Feb 2009 16:38:44 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > Changelog v2...v1
> > 1. Soft limits now support hierarchies
> > 2. Use spinlocks instead of mutexes for synchronization of the RB tree
> > 
> > Here is v2 of the new soft limit implementation. Soft limits is a new feature
> > for the memory resource controller, something similar has existed in the
> > group scheduler in the form of shares. The CPU controllers interpretation
> > of shares is very different though. We'll compare shares and soft limits
> > below.
> > 
> > Soft limits are the most useful feature to have for environments where
> > the administrator wants to overcommit the system, such that only on memory
> > contention do the limits become active. The current soft limits implementation
> > provides a soft_limit_in_bytes interface for the memory controller and not
> > for memory+swap controller. The implementation maintains an RB-Tree of groups
> > that exceed their soft limit and starts reclaiming from the group that
> > exceeds this limit by the maximum amount.
> > 
> > This is an RFC implementation and is not meant for inclusion
> > 
> 
> some thoughts after reading patch.
> 
> 1. As I pointed out, cpuset/mempolicy case is not handled yet.

That should be esy to do with zonelists passed from reclaim path

> 2. I don't like to change usual direct-memory-reclaim path. It will be obstacles
>    for VM-maintaners to improve memory reclaim. memcg's LRU is designed for
>    shrinking memory usage and not for avoiding memory shortage. IOW, it's slow routine
>    for reclaiming memory for memory shortage.

I don't think I agree here. Direct reclaim is the first indication of
shortage and if order 0 pages are short, memcg's above their soft
limit can be targetted first.

> 3. After this patch, res_counter is no longer for general purpose res_counter...
>    It seems to have too many unnecessary accessories for general purpose.  

Why not? Soft limits are a feature of any controller. The return of
highest ancestor might be the only policy we impose right now. But as
new controllers start using res_counter, we can clearly add a policy
callback.

> 4. please use css_tryget() rather than mem_cgroup_get().

OK, will do

> 5. please remove mem_cgroup from tree at force_empty or rmdir.
>    Just making  memcg->on_tree=false is enough ? I'm in doubt.

force_empty will cause uncharge and we handle it there, but I can add
an explicit call there as well.

> 6. What happens when the-largest-soft-limit-memcg has tons on Anon on swapless
>    system and memory reclaim cannot make enough progress ?

The samething that would happen on regular reclaim, one needs to
decide whether to oom or not from this context for memcg's.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 4/4] Memory controller soft limit reclaim on contention (v2)
  2009-02-17  1:20   ` KOSAKI Motohiro
@ 2009-02-17  3:12     ` Balbir Singh
  0 siblings, 0 replies; 19+ messages in thread
From: Balbir Singh @ 2009-02-17  3:12 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-02-17 10:20:44]:

> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > Changelog v2...v1
> > 1. Added support for hierarchical soft limits
> > 
> > This patch allows reclaim from memory cgroups on contention (via the
> > __alloc_pages_internal() path). If a order greater than 0 is specified, we
> > anyway fall back on try_to_free_pages().
> > 
> > memory cgroup soft limit reclaim finds the group that exceeds its soft limit
> > by the largest amount and reclaims pages from it and then reinserts the
> > cgroup into its correct place in the rbtree.
> > 
> > Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> > ---
> > 
> >  include/linux/memcontrol.h |    1 
> >  mm/memcontrol.c            |  105 +++++++++++++++++++++++++++++++++++++++-----
> >  mm/page_alloc.c            |   10 ++++
> >  3 files changed, 104 insertions(+), 12 deletions(-)
> > 
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 18146c9..a50f73e 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -116,6 +116,7 @@ static inline bool mem_cgroup_disabled(void)
> >  }
> >  
> >  extern bool mem_cgroup_oom_called(struct task_struct *task);
> > +extern unsigned long mem_cgroup_soft_limit_reclaim(gfp_t gfp_mask);
> >  
> >  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> >  struct mem_cgroup;
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a2617ac..dd835d3 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -188,6 +188,7 @@ struct mem_cgroup {
> >  	struct rb_node mem_cgroup_node;
> >  	unsigned long long usage_in_excess;
> >  	unsigned long last_tree_update;
> > +	bool on_tree;
> >  
> >  	/*
> >  	 * statistics. This must be placed at the end of memcg.
> > @@ -195,7 +196,7 @@ struct mem_cgroup {
> >  	struct mem_cgroup_stat stat;
> >  };
> >  
> > -#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ)
> > +#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
> 
> ??
> moving [3/4] is proper more?
>

Yes, I can move it there. Thanks!
 
> 
> >  
> >  enum charge_type {
> >  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> > @@ -229,14 +230,15 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
> >  static void mem_cgroup_put(struct mem_cgroup *mem);
> >  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> >  
> > -static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> > +static void __mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> >  {
> >  	struct rb_node **p = &mem_cgroup_soft_limit_exceeded_groups.rb_node;
> >  	struct rb_node *parent = NULL;
> >  	struct mem_cgroup *mem_node;
> > -	unsigned long flags;
> >  
> > -	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +	if (mem->on_tree)
> > +		return;
> > +
> >  	while (*p) {
> >  		parent = *p;
> >  		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
> > @@ -253,6 +255,23 @@ static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> >  	rb_insert_color(&mem->mem_cgroup_node,
> >  			&mem_cgroup_soft_limit_exceeded_groups);
> >  	mem->last_tree_update = jiffies;
> > +	mem->on_tree = true;
> > +}
> > +
> > +static void __mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
> > +{
> > +	if (!mem->on_tree)
> > +		return;
> > +	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_exceeded_groups);
> > +	mem->on_tree = false;
> > +}
> > +
> > +static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +	__mem_cgroup_insert_exceeded(mem);
> >  	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> >  }
> >  
> > @@ -260,10 +279,34 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
> >  {
> >  	unsigned long flags;
> >  	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > -	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_exceeded_groups);
> > +	__mem_cgroup_remove_exceeded(mem);
> >  	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> >  }
> >  
> > +static struct mem_cgroup *mem_cgroup_get_largest_soft_limit_exceeding_node(void)
> > +{
> > +	struct rb_node *rightmost = NULL;
> > +	struct mem_cgroup *mem = NULL;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +	rightmost = rb_last(&mem_cgroup_soft_limit_exceeded_groups);
> > +	if (!rightmost)
> > +		goto done;		/* Nothing to reclaim from */
> > +
> > +	mem = rb_entry(rightmost, struct mem_cgroup, mem_cgroup_node);
> > +	mem_cgroup_get(mem);
> > +	/*
> > +	 * Remove the node now but someone else can add it back,
> > +	 * we will to add it back at the end of reclaim to its correct
> > +	 * position in the tree.
> > +	 */
> > +	__mem_cgroup_remove_exceeded(mem);
> > +done:
> > +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +	return mem;
> > +}
> > +
> 
> Do you remember we discuss about zone reclaim balancing thing 
> in "reclaim bail out"thread at about 2 month ago?
>
> The "largest exceeding" policy seems to have similar problems.
> if largest exceeding group is most active, 
> 
>   1. do the largest group activity and charge memory over softlimit.
>   2. reclaim memory from the largest group.
>   3. goto 1.
> 
> then, system can become livelock.
>

What other alternative do you recommend? Reclaiming memory from an
over active group that is over its assigned usage is not a livelock
scenario, do you see it that way?
 
> I think per-group priority is not good idea.
>

Not sure I understand this part.
 
> 
> 
> >  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> >  					 struct page_cgroup *pc,
> >  					 bool charge)
> > @@ -886,7 +929,8 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> >   * If shrink==true, for avoiding to free too much, this returns immedieately.
> >   */
> >  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > -				   gfp_t gfp_mask, bool noswap, bool shrink)
> > +				   gfp_t gfp_mask, bool noswap, bool shrink,
> > +				   bool check_soft)
> 
> Now, we get three boolean argument.
> So, can we convert "int flags" argument?
> 
> I don't think mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, false, true, false) is
> self-described good look.
> 
> 
> >  {
> >  	struct mem_cgroup *victim;
> >  	int ret, total = 0;
> > @@ -913,7 +957,11 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> >  		if (shrink)
> >  			return ret;
> >  		total += ret;
> > -		if (mem_cgroup_check_under_limit(root_mem))
> > +
> > +		if (check_soft) {
> > +			if (res_counter_soft_limit_excess(&root_mem->res))
> > +				return 1 + total;
> > +		} else if (mem_cgroup_check_under_limit(root_mem))
> >  			return 1 + total;
> 
> I don't understand what's mean "1 +".
> 
> 
> >  	}
> >  	return total;
> > @@ -1044,7 +1092,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  			goto nomem;
> >  
> >  		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> > -							noswap, false);
> > +							noswap, false, false);
> >  		if (ret)
> >  			continue;
> >  
> > @@ -1686,7 +1734,7 @@ int mem_cgroup_shrink_usage(struct page *page,
> >  
> >  	do {
> >  		progress = mem_cgroup_hierarchical_reclaim(mem,
> > -					gfp_mask, true, false);
> > +					gfp_mask, true, false, false);
> >  		progress += mem_cgroup_check_under_limit(mem);
> >  	} while (!progress && --retry);
> >  
> > @@ -1741,7 +1789,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> >  			break;
> >  
> >  		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
> > -						   false, true);
> > +						   false, true, false);
> >  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> >  		/* Usage is reduced ? */
> >    		if (curusage >= oldusage)
> > @@ -1789,7 +1837,8 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> >  		if (!ret)
> >  			break;
> >  
> > -		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
> > +		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true,
> > +						false);
> >  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
> >  		/* Usage is reduced ? */
> >  		if (curusage >= oldusage)
> > @@ -1940,6 +1989,38 @@ try_to_free:
> >  	goto out;
> >  }
> >  
> > +unsigned long mem_cgroup_soft_limit_reclaim(gfp_t gfp_mask)
> > +{
> > +	unsigned long nr_reclaimed = 0;
> > +	struct mem_cgroup *mem;
> > +	unsigned long flags;
> > +
> > +	do {
> > +		mem = mem_cgroup_get_largest_soft_limit_exceeding_node();
> > +		if (!mem)
> > +			break;
> > +		if (mem_cgroup_is_obsolete(mem)) {
> > +			mem_cgroup_put(mem);
> > +			continue;
> > +		}
> > +		nr_reclaimed +=
> > +			mem_cgroup_hierarchical_reclaim(mem, gfp_mask, false,
> > +							false, true);
> > +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> > +		/*
> > +		 * We need to remove and reinsert the node in its correct
> > +		 * position
> > +		 */
> > +		__mem_cgroup_remove_exceeded(mem);
> > +		if (mem->usage_in_excess)
> > +			__mem_cgroup_insert_exceeded(mem);
> > +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +		mem_cgroup_put(mem);
> > +	} while (!nr_reclaimed);
> > +	return nr_reclaimed;
> > +}
> 
> this function is called from reclaim hotpath. but it grab glocal spin lock..
> I don't like this.
> 
> 
> > +
> >  int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event)
> >  {
> >  	return mem_cgroup_force_empty(mem_cgroup_from_cont(cont), true);
> > @@ -2528,6 +2609,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  	mem->last_scanned_child = 0;
> >  	mem->usage_in_excess = 0;
> >  	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
> > +	mem->on_tree = false;
> > +
> >  	spin_lock_init(&mem->reclaim_param_lock);
> >  
> >  	if (parent)
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 7be9386..c50e29b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1579,7 +1579,15 @@ nofail_alloc:
> >  	reclaim_state.reclaimed_slab = 0;
> >  	p->reclaim_state = &reclaim_state;
> >  
> > -	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> > +	did_some_progress = mem_cgroup_soft_limit_reclaim(gfp_mask);
> > +	/*
> > +	 * If we made no progress or need higher order allocations
> > +	 * try_to_free_pages() is still our best bet, since mem_cgroup
> > +	 * reclaim does not handle freeing pages greater than order 0
> > +	 */
> > +	if (!did_some_progress || order)
> > +		did_some_progress = try_to_free_pages(zonelist, order,
> > +							gfp_mask);
> >  
> >  	p->reclaim_state = NULL;
> >  	p->flags &= ~PF_MEMALLOC;
> 
> this is very freqently called place. then we want to
>   - if no memcgroup using, no performance regression.
>   - if no softlimit but using memcg, performance degression is smaller than 1%.
> 
> Do you have any performance number?
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 3/4] Memory controller soft limit organize cgroups (v2)
  2009-02-17  1:00   ` KOSAKI Motohiro
@ 2009-02-17  3:24     ` Balbir Singh
  0 siblings, 0 replies; 19+ messages in thread
From: Balbir Singh @ 2009-02-17  3:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-02-17 10:00:00]:

> >  /*
> > + * Cgroups above their limits are maintained in a RB-Tree, independent of
> > + * their hierarchy representation
> > + */
> > +
> > +static struct rb_root mem_cgroup_soft_limit_exceeded_groups;
> 
> 37 length variable name seems too long.
>

OK, I'll shorten it
 
> 
> > +static DEFINE_SPINLOCK(memcg_soft_limit_tree_lock);
> > +
> > +/*
> >   * The memory controller data structure. The memory controller controls both
> >   * page cache and RSS per cgroup. We would eventually like to provide
> >   * statistics based on the statistics developed by Rik Van Riel for clock-pro,
> > @@ -176,12 +185,18 @@ struct mem_cgroup {
> >  
> >  	unsigned int	swappiness;
> >  
> > +	struct rb_node mem_cgroup_node;
> > +	unsigned long long usage_in_excess;
> > +	unsigned long last_tree_update;
> > +
> 
> no comment fields.
> 

I'll add them, the names are descriptive, but comments always help

> Do usage_in_excess and last_tree_update have what unit? "unsigned long" 
> don't tell me anything.
> 
> 
> >  	/*
> >  	 * statistics. This must be placed at the end of memcg.
> >  	 */
> >  	struct mem_cgroup_stat stat;
> >  };
> >  
> > +#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ)
> > +
> 
> In general, memory subsystem be considered to shouldn't have timer thing.
> it's because we expect we get 100x times faster machine after 10 year,
> at that time, we expect proper timeout value is changed.
> 

Right now, I don't want to overwhelm the system by updating the tree
every time a page is added/removed. So I use an interval to see if we
should update the tree. I am not using any timers per-se.

> Can we make proper stastics, instead?
> 

I am not sure I understand your proposal fully

> 
> >  enum charge_type {
> >  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> >  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> > @@ -214,6 +229,41 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
> >  static void mem_cgroup_put(struct mem_cgroup *mem);
> >  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> >  
> > +static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> > +{
> > +	struct rb_node **p = &mem_cgroup_soft_limit_exceeded_groups.rb_node;
> > +	struct rb_node *parent = NULL;
> > +	struct mem_cgroup *mem_node;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +	while (*p) {
> > +		parent = *p;
> > +		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
> > +		if (mem->usage_in_excess < mem_node->usage_in_excess)
> > +			p = &(*p)->rb_left;
> > +		/*
> > +		 * We can't avoid mem cgroups that are over their soft
> > +		 * limit by the same amount
> > +		 */
> > +		else if (mem->usage_in_excess >= mem_node->usage_in_excess)
> > +			p = &(*p)->rb_right;
> > +	}
> > +	rb_link_node(&mem->mem_cgroup_node, parent, p);
> > +	rb_insert_color(&mem->mem_cgroup_node,
> > +			&mem_cgroup_soft_limit_exceeded_groups);
> > +	mem->last_tree_update = jiffies;
> > +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +}
> 
> I think this function is called from page fault hotpath, right?
> if so, you insert global lock into hotpath!
>

page fault hotpath - the hooks are at the place where we would have
called try_to_free_pages(). We already have global locks per zone and
lock is held for a short time to find and pick the correct memcg for
reclaim.
 
> 
> > +
> > +static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
> > +{
> > +	unsigned long flags;
> > +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_exceeded_groups);
> > +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +}
> > +
> >  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> >  					 struct page_cgroup *pc,
> >  					 bool charge)
> > @@ -897,6 +947,39 @@ static void record_last_oom(struct mem_cgroup *mem)
> >  	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
> >  }
> >  
> > +static void mem_cgroup_check_and_update_tree(struct mem_cgroup *mem,
> > +						bool time_check)
> > +{
> > +	unsigned long long prev_usage_in_excess, new_usage_in_excess;
> > +	bool updated_tree = false;
> > +	unsigned long next_update = 0;
> > +	unsigned long flags;
> > +
> > +	mem_cgroup_get(mem);
> > +	prev_usage_in_excess = mem->usage_in_excess;
> > +	new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> > +
> > +	if (time_check)
> > +		next_update = mem->last_tree_update +
> > +				MEM_CGROUP_TREE_UPDATE_INTERVAL;
> > +	if (new_usage_in_excess && time_after(jiffies, next_update)) {
> 
> incorrect time_after() usage. jiffies can round-tripping. then 
> time_after(jiffies, 0) don't gurantee to return true.

Not sure I completely understand your comment. Is even
mem_cgroup_oom_called time_before broken?

> 
> > +		if (prev_usage_in_excess)
> > +			mem_cgroup_remove_exceeded(mem);
> > +		mem_cgroup_insert_exceeded(mem);
> > +		updated_tree = true;
> > +	} else if (prev_usage_in_excess && !new_usage_in_excess) {
> > +		mem_cgroup_remove_exceeded(mem);
> > +		updated_tree = true;
> > +	}
> > +
> > +	if (updated_tree) {
> > +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +		mem->last_tree_update = jiffies;
> > +		mem->usage_in_excess = new_usage_in_excess;
> > +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +	}
> > +	mem_cgroup_put(mem);
> > +}
> >  
> >  /*
> >   * Unlike exported interface, "oom" parameter is added. if oom==true,
> > @@ -906,9 +989,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  			gfp_t gfp_mask, struct mem_cgroup **memcg,
> >  			bool oom)
> >  {
> > -	struct mem_cgroup *mem, *mem_over_limit;
> > +	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
> >  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > -	struct res_counter *fail_res;
> > +	struct res_counter *fail_res, *soft_fail_res = NULL;
> >  
> >  	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
> >  		/* Don't account this! */
> > @@ -938,12 +1021,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  		int ret;
> >  		bool noswap = false;
> >  
> > -		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> > +		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
> > +						&soft_fail_res);
> >  		if (likely(!ret)) {
> >  			if (!do_swap_account)
> >  				break;
> >  			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
> > -							&fail_res);
> > +							&fail_res, NULL);
> >  			if (likely(!ret))
> >  				break;
> >  			/* mem+swap counter fails */
> > @@ -985,6 +1069,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  			goto nomem;
> >  		}
> >  	}
> > +
> > +	if (soft_fail_res) {
> > +		mem_over_soft_limit =
> > +			mem_cgroup_from_res_counter(soft_fail_res, res);
> > +		mem_cgroup_check_and_update_tree(mem_over_soft_limit, true);
> > +	}
> > +	mem_cgroup_check_and_update_tree(mem, true);
> >  	return 0;
> >  nomem:
> >  	css_put(&mem->css);
> > @@ -1422,6 +1513,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
> >  	mz = page_cgroup_zoneinfo(pc);
> >  	unlock_page_cgroup(pc);
> >  
> > +	mem_cgroup_check_and_update_tree(mem, true);
> >  	/* at swapout, this memcg will be accessed to record to swap */
> >  	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> >  		css_put(&mem->css);
> > @@ -2346,6 +2438,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
> >  {
> >  	int node;
> >  
> > +	mem_cgroup_check_and_update_tree(mem, false);
> >  	free_css_id(&mem_cgroup_subsys, &mem->css);
> >  
> >  	for_each_node_state(node, N_POSSIBLE)
> > @@ -2412,6 +2505,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  	if (cont->parent == NULL) {
> >  		enable_swap_cgroup();
> >  		parent = NULL;
> > +		mem_cgroup_soft_limit_exceeded_groups = RB_ROOT;
> >  	} else {
> >  		parent = mem_cgroup_from_cont(cont->parent);
> >  		mem->use_hierarchy = parent->use_hierarchy;
> > @@ -2432,6 +2526,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  		res_counter_init(&mem->memsw, NULL);
> >  	}
> >  	mem->last_scanned_child = 0;
> > +	mem->usage_in_excess = 0;
> > +	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
> >  	spin_lock_init(&mem->reclaim_param_lock);
> >  
> >  	if (parent)
> > 
>

Thanks for the review! 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 0/4] Memory controller soft limit patches (v2)
  2009-02-17  3:05   ` Balbir Singh
@ 2009-02-17  4:03     ` KAMEZAWA Hiroyuki
  2009-02-17  4:20       ` KAMEZAWA Hiroyuki
  2009-02-17  4:41       ` Balbir Singh
  0 siblings, 2 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-17  4:03 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Tue, 17 Feb 2009 08:35:26 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-02-17 09:05:23]:
> 
> > On Mon, 16 Feb 2009 16:38:44 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > 
> > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > 
> > > Changelog v2...v1
> > > 1. Soft limits now support hierarchies
> > > 2. Use spinlocks instead of mutexes for synchronization of the RB tree
> > > 
> > > Here is v2 of the new soft limit implementation. Soft limits is a new feature
> > > for the memory resource controller, something similar has existed in the
> > > group scheduler in the form of shares. The CPU controllers interpretation
> > > of shares is very different though. We'll compare shares and soft limits
> > > below.
> > > 
> > > Soft limits are the most useful feature to have for environments where
> > > the administrator wants to overcommit the system, such that only on memory
> > > contention do the limits become active. The current soft limits implementation
> > > provides a soft_limit_in_bytes interface for the memory controller and not
> > > for memory+swap controller. The implementation maintains an RB-Tree of groups
> > > that exceed their soft limit and starts reclaiming from the group that
> > > exceeds this limit by the maximum amount.
> > > 
> > > This is an RFC implementation and is not meant for inclusion
> > > 
> > 
> > some thoughts after reading patch.
> > 
> > 1. As I pointed out, cpuset/mempolicy case is not handled yet.
> 
> That should be esy to do with zonelists passed from reclaim path
>
plz do.

> 
> > 2. I don't like to change usual direct-memory-reclaim path. It will be obstacles
> >    for VM-maintaners to improve memory reclaim. memcg's LRU is designed for
> >    shrinking memory usage and not for avoiding memory shortage. IOW, it's slow routine
> >    for reclaiming memory for memory shortage.
> 
> I don't think I agree here. Direct reclaim is the first indication of
> shortage and if order 0 pages are short, memcg's above their soft
> limit can be targetted first.
> 
My "slow" means "the overhead seems to be big". The latency will increase.

About 0-order
In patch 4/4
+	did_some_progress = mem_cgroup_soft_limit_reclaim(gfp_mask);
+	/*
should be
        if (!order)
            did_some_progress = mem....


I don't want to add any new big burden to kernel hackers of memory management,
they work hard to improve memory reclaim. This patch will change the behavior.

BTW, in typical bad case, several threads on cpus goes into memory recalim at once and
all thread will visit this memcg's soft-limit tree at once and soft-limit will
not work as desired anyway.
You can't avoid this problem at alloc_page() hot-path.

> > 3. After this patch, res_counter is no longer for general purpose res_counter...
> >    It seems to have too many unnecessary accessories for general purpose.  
> 
> Why not? Soft limits are a feature of any controller. The return of
> highest ancestor might be the only policy we impose right now. But as
> new controllers start using res_counter, we can clearly add a policy
> callback.
> 
I think you forget that memroy cgroups is an only controller in which the kernel
can reduce the usage of resource without any harmful to users.
soft-limit is nonsense for general resources, I think.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 0/4] Memory controller soft limit patches (v2)
  2009-02-17  4:03     ` KAMEZAWA Hiroyuki
@ 2009-02-17  4:20       ` KAMEZAWA Hiroyuki
  2009-02-17  4:42         ` Balbir Singh
  2009-02-17  4:41       ` Balbir Singh
  1 sibling, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-17  4:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir, linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Tue, 17 Feb 2009 13:03:52 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > > 2. I don't like to change usual direct-memory-reclaim path. It will be obstacles
> > >    for VM-maintaners to improve memory reclaim. memcg's LRU is designed for
> > >    shrinking memory usage and not for avoiding memory shortage. IOW, it's slow routine
> > >    for reclaiming memory for memory shortage.
> > 
> > I don't think I agree here. Direct reclaim is the first indication of
> > shortage and if order 0 pages are short, memcg's above their soft
> > limit can be targetted first.
> > 
> My "slow" means "the overhead seems to be big". The latency will increase.
> 
> About 0-order
> In patch 4/4
> +	did_some_progress = mem_cgroup_soft_limit_reclaim(gfp_mask);
> +	/*
> should be
>         if (!order)
>             did_some_progress = mem....
above is wrong.

if (!order && (gfp_mask & GFP_MOVABLE)) ....Hmm, but this is not correct.
I have no good idea to avoid unnecessary works.

BTW,  why don't you call soft_limit_reclaim from kswapd's path ?

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 0/4] Memory controller soft limit patches (v2)
  2009-02-17  4:03     ` KAMEZAWA Hiroyuki
  2009-02-17  4:20       ` KAMEZAWA Hiroyuki
@ 2009-02-17  4:41       ` Balbir Singh
  2009-02-17  5:10         ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2009-02-17  4:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-02-17 13:03:52]:

> On Tue, 17 Feb 2009 08:35:26 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-02-17 09:05:23]:
> > 
> > > On Mon, 16 Feb 2009 16:38:44 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > 
> > > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > > 
> > > > Changelog v2...v1
> > > > 1. Soft limits now support hierarchies
> > > > 2. Use spinlocks instead of mutexes for synchronization of the RB tree
> > > > 
> > > > Here is v2 of the new soft limit implementation. Soft limits is a new feature
> > > > for the memory resource controller, something similar has existed in the
> > > > group scheduler in the form of shares. The CPU controllers interpretation
> > > > of shares is very different though. We'll compare shares and soft limits
> > > > below.
> > > > 
> > > > Soft limits are the most useful feature to have for environments where
> > > > the administrator wants to overcommit the system, such that only on memory
> > > > contention do the limits become active. The current soft limits implementation
> > > > provides a soft_limit_in_bytes interface for the memory controller and not
> > > > for memory+swap controller. The implementation maintains an RB-Tree of groups
> > > > that exceed their soft limit and starts reclaiming from the group that
> > > > exceeds this limit by the maximum amount.
> > > > 
> > > > This is an RFC implementation and is not meant for inclusion
> > > > 
> > > 
> > > some thoughts after reading patch.
> > > 
> > > 1. As I pointed out, cpuset/mempolicy case is not handled yet.
> > 
> > That should be esy to do with zonelists passed from reclaim path
> >
> plz do.
> 
> > 
> > > 2. I don't like to change usual direct-memory-reclaim path. It will be obstacles
> > >    for VM-maintaners to improve memory reclaim. memcg's LRU is designed for
> > >    shrinking memory usage and not for avoiding memory shortage. IOW, it's slow routine
> > >    for reclaiming memory for memory shortage.
> > 
> > I don't think I agree here. Direct reclaim is the first indication of
> > shortage and if order 0 pages are short, memcg's above their soft
> > limit can be targetted first.
> > 
> My "slow" means "the overhead seems to be big". The latency will increase.
> 
> About 0-order
> In patch 4/4
> +	did_some_progress = mem_cgroup_soft_limit_reclaim(gfp_mask);
> +	/*
> should be
>         if (!order)
>             did_some_progress = mem....
> 

OK, will do

> 
> I don't want to add any new big burden to kernel hackers of memory management,
> they work hard to improve memory reclaim. This patch will change the behavior.
> 

I don't think I agree, this approach suggests that before doing global
reclaim, there are several groups that are using more than their
share of memory, so it makes sense to reclaim from them first.


> BTW, in typical bad case, several threads on cpus goes into memory recalim at once and
> all thread will visit this memcg's soft-limit tree at once and soft-limit will
> not work as desired anyway.
> You can't avoid this problem at alloc_page() hot-path.

Even if all threads go into soft-reclaim at once, the tree will become
empty after a point and we will just return saying there are no more
memcg's to reclaim from (we remove the memcg from the tree when
reclaiming), then those threads will go into regular reclaim if there
is still memory pressure.

> 
> > > 3. After this patch, res_counter is no longer for general purpose res_counter...
> > >    It seems to have too many unnecessary accessories for general purpose.  
> > 
> > Why not? Soft limits are a feature of any controller. The return of
> > highest ancestor might be the only policy we impose right now. But as
> > new controllers start using res_counter, we can clearly add a policy
> > callback.
> > 
> I think you forget that memroy cgroups is an only controller in which the kernel
> can reduce the usage of resource without any harmful to users.
> soft-limit is nonsense for general resources, I think.
> 

Really? Even for CPUs? soft-limit is a form of shares (please don't
confuse with cpu.shares). Soft limits is used as a way of implementing
work conserving controllers.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 0/4] Memory controller soft limit patches (v2)
  2009-02-17  4:20       ` KAMEZAWA Hiroyuki
@ 2009-02-17  4:42         ` Balbir Singh
  0 siblings, 0 replies; 19+ messages in thread
From: Balbir Singh @ 2009-02-17  4:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-02-17 13:20:39]:

> On Tue, 17 Feb 2009 13:03:52 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > > > 2. I don't like to change usual direct-memory-reclaim path. It will be obstacles
> > > >    for VM-maintaners to improve memory reclaim. memcg's LRU is designed for
> > > >    shrinking memory usage and not for avoiding memory shortage. IOW, it's slow routine
> > > >    for reclaiming memory for memory shortage.
> > > 
> > > I don't think I agree here. Direct reclaim is the first indication of
> > > shortage and if order 0 pages are short, memcg's above their soft
> > > limit can be targetted first.
> > > 
> > My "slow" means "the overhead seems to be big". The latency will increase.
> > 
> > About 0-order
> > In patch 4/4
> > +	did_some_progress = mem_cgroup_soft_limit_reclaim(gfp_mask);
> > +	/*
> > should be
> >         if (!order)
> >             did_some_progress = mem....
> above is wrong.
> 
> if (!order && (gfp_mask & GFP_MOVABLE)) ....Hmm, but this is not correct.
> I have no good idea to avoid unnecessary works.
> 
> BTW,  why don't you call soft_limit_reclaim from kswapd's path ?
>

I think it has to be both kswapd and pdflush path, I can consider that
option as well. That needs more thought on the design.
 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 0/4] Memory controller soft limit patches (v2)
  2009-02-17  4:41       ` Balbir Singh
@ 2009-02-17  5:10         ` KAMEZAWA Hiroyuki
  2009-02-17  5:39           ` Balbir Singh
  0 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-17  5:10 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Tue, 17 Feb 2009 10:11:10 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-02-17 13:03:52]:
> 
> > On Tue, 17 Feb 2009 08:35:26 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > I don't want to add any new big burden to kernel hackers of memory management,
> > they work hard to improve memory reclaim. This patch will change the behavior.
> > 
> 
> I don't think I agree, this approach suggests that before doing global
> reclaim, there are several groups that are using more than their
> share of memory, so it makes sense to reclaim from them first.
> 

> 
> > BTW, in typical bad case, several threads on cpus goes into memory recalim at once and
> > all thread will visit this memcg's soft-limit tree at once and soft-limit will
> > not work as desired anyway.
> > You can't avoid this problem at alloc_page() hot-path.
> 
> Even if all threads go into soft-reclaim at once, the tree will become
> empty after a point and we will just return saying there are no more
> memcg's to reclaim from (we remove the memcg from the tree when
> reclaiming), then those threads will go into regular reclaim if there
> is still memory pressure.

Yes. the largest-excess group will be removed. So, it seems that it doesn't work
as designed. rbtree is considered as just a hint ? If so, rbtree seems to be
overkill.

just a question:
Assume memcg under hierarchy.
   ../group_A/                 usage=1G, soft_limit=900M  hierarchy=1
              01/              usage=200M, soft_limit=100M
              02/              usage=300M, soft_limit=200M
              03/              usage=500M, soft_limit=300M  <==== 200M over.
                 004/          usage=200M, soft_limit=100M
                 005/          usage=300M, soft_limit=200M

At memory shortage, group 03's memory will be reclaimed 
  - reclaim memory from 03, 03/004, 03/005

When 100M of group 03' memory is reclaimed, group_A 's memory is reclaimd at the
same time, implicitly. Doesn't this break your rb-tree ?

I recommend you that soft-limit can be only applied to the node which is top of
hierarchy.
 
> > 
> > > > 3. After this patch, res_counter is no longer for general purpose res_counter...
> > > >    It seems to have too many unnecessary accessories for general purpose.  
> > > 
> > > Why not? Soft limits are a feature of any controller. The return of
> > > highest ancestor might be the only policy we impose right now. But as
> > > new controllers start using res_counter, we can clearly add a policy
> > > callback.
> > > 
> > I think you forget that memroy cgroups is an only controller in which the kernel
> > can reduce the usage of resource without any harmful to users.
> > soft-limit is nonsense for general resources, I think.
> > 
> 
> Really? Even for CPUs? soft-limit is a form of shares (please don't
> confuse with cpu.shares). Soft limits is used as a way of implementing
> work conserving controllers.
> 

I don't think cpu needs this. It works under share and no hardlimit.

THanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 0/4] Memory controller soft limit patches (v2)
  2009-02-17  5:10         ` KAMEZAWA Hiroyuki
@ 2009-02-17  5:39           ` Balbir Singh
  2009-02-17  6:36             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2009-02-17  5:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-02-17 14:10:39]:

> On Tue, 17 Feb 2009 10:11:10 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-02-17 13:03:52]:
> > 
> > > On Tue, 17 Feb 2009 08:35:26 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > I don't want to add any new big burden to kernel hackers of memory management,
> > > they work hard to improve memory reclaim. This patch will change the behavior.
> > > 
> > 
> > I don't think I agree, this approach suggests that before doing global
> > reclaim, there are several groups that are using more than their
> > share of memory, so it makes sense to reclaim from them first.
> > 
> 
> > 
> > > BTW, in typical bad case, several threads on cpus goes into memory recalim at once and
> > > all thread will visit this memcg's soft-limit tree at once and soft-limit will
> > > not work as desired anyway.
> > > You can't avoid this problem at alloc_page() hot-path.
> > 
> > Even if all threads go into soft-reclaim at once, the tree will become
> > empty after a point and we will just return saying there are no more
> > memcg's to reclaim from (we remove the memcg from the tree when
> > reclaiming), then those threads will go into regular reclaim if there
> > is still memory pressure.
> 
> Yes. the largest-excess group will be removed. So, it seems that it doesn't work
> as designed. rbtree is considered as just a hint ? If so, rbtree seems to be
> overkill.
> 
> just a question:
> Assume memcg under hierarchy.
>    ../group_A/                 usage=1G, soft_limit=900M  hierarchy=1
>               01/              usage=200M, soft_limit=100M
>               02/              usage=300M, soft_limit=200M
>               03/              usage=500M, soft_limit=300M  <==== 200M over.
>                  004/          usage=200M, soft_limit=100M
>                  005/          usage=300M, soft_limit=200M
> 
> At memory shortage, group 03's memory will be reclaimed 
>   - reclaim memory from 03, 03/004, 03/005
> 
> When 100M of group 03' memory is reclaimed, group_A 's memory is reclaimd at the
> same time, implicitly. Doesn't this break your rb-tree ?
> 
> I recommend you that soft-limit can be only applied to the node which is top of
> hierarchy.

Yes, that can be done, but the reason for putting both was to target
the right memcg early.

> 
> > > 
> > > > > 3. After this patch, res_counter is no longer for general purpose res_counter...
> > > > >    It seems to have too many unnecessary accessories for general purpose.  
> > > > 
> > > > Why not? Soft limits are a feature of any controller. The return of
> > > > highest ancestor might be the only policy we impose right now. But as
> > > > new controllers start using res_counter, we can clearly add a policy
> > > > callback.
> > > > 
> > > I think you forget that memroy cgroups is an only controller in which the kernel
> > > can reduce the usage of resource without any harmful to users.
> > > soft-limit is nonsense for general resources, I think.
> > > 
> > 
> > Really? Even for CPUs? soft-limit is a form of shares (please don't
> > confuse with cpu.shares). Soft limits is used as a way of implementing
> > work conserving controllers.
> > 
> 
> I don't think cpu needs this. It works under share and no hardlimit.
>

Forget CPUs for now. The concept of soft-limits is applicable to all
resource controllers. Look at check_thread_timers, you'll see CPU soft
limits for rlimit. soft limits allow overcommit as long as there is no
contention on the resource. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 0/4] Memory controller soft limit patches (v2)
  2009-02-17  5:39           ` Balbir Singh
@ 2009-02-17  6:36             ` KAMEZAWA Hiroyuki
  2009-02-17  6:43               ` Balbir Singh
  0 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-17  6:36 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

On Tue, 17 Feb 2009 11:09:03 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-02-17 14:10:39]:
> 
> > On Tue, 17 Feb 2009 10:11:10 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-02-17 13:03:52]:
> > > 
> > > > On Tue, 17 Feb 2009 08:35:26 +0530
> > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > I don't want to add any new big burden to kernel hackers of memory management,
> > > > they work hard to improve memory reclaim. This patch will change the behavior.
> > > > 
> > > 
> > > I don't think I agree, this approach suggests that before doing global
> > > reclaim, there are several groups that are using more than their
> > > share of memory, so it makes sense to reclaim from them first.
> > > 
> > 
> > > 
> > > > BTW, in typical bad case, several threads on cpus goes into memory recalim at once and
> > > > all thread will visit this memcg's soft-limit tree at once and soft-limit will
> > > > not work as desired anyway.
> > > > You can't avoid this problem at alloc_page() hot-path.
> > > 
> > > Even if all threads go into soft-reclaim at once, the tree will become
> > > empty after a point and we will just return saying there are no more
> > > memcg's to reclaim from (we remove the memcg from the tree when
> > > reclaiming), then those threads will go into regular reclaim if there
> > > is still memory pressure.
> > 
> > Yes. the largest-excess group will be removed. So, it seems that it doesn't work
> > as designed. rbtree is considered as just a hint ? If so, rbtree seems to be
> > overkill.
> > 
> > just a question:
> > Assume memcg under hierarchy.
> >    ../group_A/                 usage=1G, soft_limit=900M  hierarchy=1
> >               01/              usage=200M, soft_limit=100M
> >               02/              usage=300M, soft_limit=200M
> >               03/              usage=500M, soft_limit=300M  <==== 200M over.
> >                  004/          usage=200M, soft_limit=100M
> >                  005/          usage=300M, soft_limit=200M
> > 
> > At memory shortage, group 03's memory will be reclaimed 
> >   - reclaim memory from 03, 03/004, 03/005
> > 
> > When 100M of group 03' memory is reclaimed, group_A 's memory is reclaimd at the
> > same time, implicitly. Doesn't this break your rb-tree ?
> > 
> > I recommend you that soft-limit can be only applied to the node which is top of
> > hierarchy.
> 
> Yes, that can be done, but the reason for putting both was to target
> the right memcg early.
> 
My point is  that sort by rb-tree is broken in above case.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 0/4] Memory controller soft limit patches (v2)
  2009-02-17  6:36             ` KAMEZAWA Hiroyuki
@ 2009-02-17  6:43               ` Balbir Singh
  0 siblings, 0 replies; 19+ messages in thread
From: Balbir Singh @ 2009-02-17  6:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Bharata B Rao,
	Paul Menage, lizf, linux-kernel, KOSAKI Motohiro, David Rientjes,
	Pavel Emelianov, Dhaval Giani, Rik van Riel, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-02-17 15:36:58]:

> On Tue, 17 Feb 2009 11:09:03 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-02-17 14:10:39]:
> > 
> > > On Tue, 17 Feb 2009 10:11:10 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-02-17 13:03:52]:
> > > > 
> > > > > On Tue, 17 Feb 2009 08:35:26 +0530
> > > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > > I don't want to add any new big burden to kernel hackers of memory management,
> > > > > they work hard to improve memory reclaim. This patch will change the behavior.
> > > > > 
> > > > 
> > > > I don't think I agree, this approach suggests that before doing global
> > > > reclaim, there are several groups that are using more than their
> > > > share of memory, so it makes sense to reclaim from them first.
> > > > 
> > > 
> > > > 
> > > > > BTW, in typical bad case, several threads on cpus goes into memory recalim at once and
> > > > > all thread will visit this memcg's soft-limit tree at once and soft-limit will
> > > > > not work as desired anyway.
> > > > > You can't avoid this problem at alloc_page() hot-path.
> > > > 
> > > > Even if all threads go into soft-reclaim at once, the tree will become
> > > > empty after a point and we will just return saying there are no more
> > > > memcg's to reclaim from (we remove the memcg from the tree when
> > > > reclaiming), then those threads will go into regular reclaim if there
> > > > is still memory pressure.
> > > 
> > > Yes. the largest-excess group will be removed. So, it seems that it doesn't work
> > > as designed. rbtree is considered as just a hint ? If so, rbtree seems to be
> > > overkill.
> > > 
> > > just a question:
> > > Assume memcg under hierarchy.
> > >    ../group_A/                 usage=1G, soft_limit=900M  hierarchy=1
> > >               01/              usage=200M, soft_limit=100M
> > >               02/              usage=300M, soft_limit=200M
> > >               03/              usage=500M, soft_limit=300M  <==== 200M over.
> > >                  004/          usage=200M, soft_limit=100M
> > >                  005/          usage=300M, soft_limit=200M
> > > 
> > > At memory shortage, group 03's memory will be reclaimed 
> > >   - reclaim memory from 03, 03/004, 03/005
> > > 
> > > When 100M of group 03' memory is reclaimed, group_A 's memory is reclaimd at the
> > > same time, implicitly. Doesn't this break your rb-tree ?
> > > 
> > > I recommend you that soft-limit can be only applied to the node which is top of
> > > hierarchy.
> > 
> > Yes, that can be done, but the reason for putting both was to target
> > the right memcg early.
> > 
> My point is  that sort by rb-tree is broken in above case.
>

OK, I'll explore, experiment and think about adding just the root 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2009-02-17  6:44 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-16 11:08 [RFC][PATCH 0/4] Memory controller soft limit patches (v2) Balbir Singh
2009-02-16 11:08 ` [RFC][PATCH 1/4] Memory controller soft limit documentation (v2) Balbir Singh
2009-02-16 11:08 ` [RFC][PATCH 2/4] Memory controller soft limit interface (v2) Balbir Singh
2009-02-16 11:09 ` [RFC][PATCH 3/4] Memory controller soft limit organize cgroups (v2) Balbir Singh
2009-02-17  1:00   ` KOSAKI Motohiro
2009-02-17  3:24     ` Balbir Singh
2009-02-16 11:09 ` [RFC][PATCH 4/4] Memory controller soft limit reclaim on contention (v2) Balbir Singh
2009-02-17  1:20   ` KOSAKI Motohiro
2009-02-17  3:12     ` Balbir Singh
2009-02-17  0:05 ` [RFC][PATCH 0/4] Memory controller soft limit patches (v2) KAMEZAWA Hiroyuki
2009-02-17  3:05   ` Balbir Singh
2009-02-17  4:03     ` KAMEZAWA Hiroyuki
2009-02-17  4:20       ` KAMEZAWA Hiroyuki
2009-02-17  4:42         ` Balbir Singh
2009-02-17  4:41       ` Balbir Singh
2009-02-17  5:10         ` KAMEZAWA Hiroyuki
2009-02-17  5:39           ` Balbir Singh
2009-02-17  6:36             ` KAMEZAWA Hiroyuki
2009-02-17  6:43               ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).