[PATCH 0/4] Memory controller soft limit patches (v6)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] Memory controller soft limit patches (v6)
@ 2009-03-14 17:30 Balbir Singh
  2009-03-14 17:30 ` [PATCH 1/4] Memory controller soft limit documentation (v6) Balbir Singh
                   ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Balbir Singh @ 2009-03-14 17:30 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Balbir Singh,
	Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

From: Balbir Singh <balbir@linux.vnet.ibm.com>

New Feature: Soft limits for memory resource controller.

Changelog v6...v5
1. If the number of reclaimed pages are zero, select the next mem cgroup
   for reclamation
2. Fixed a bug, where key was being updated after insertion into the tree
3. Fixed a build issue, when CONFIG_MEM_RES_CTLR is not enabled

Changelog v5...v4
1. Several changes to the reclaim logic, please see the patch 4 (reclaim on
   contention). I've experimented with several possibilities for reclaim
   and chose to come back to this due to the excellent behaviour seen while
   testing the patchset.
2. Reduced the overhead of soft limits on resource counters very significantly.
   Reaim benchmark now shows almost no drop in performance.

Changelog v4...v3
1. Adopted suggestions from Kamezawa to do a per-zone-per-node reclaim
   while doing soft limit reclaim. We don't record priorities while
   doing soft reclaim
2. Some of the overheads associated with soft limits (like calculating
   excess each time) is eliminated
3. The time_after(jiffies, 0) bug has been fixed
4. Tasks are throttled if the mem cgroup they belong to is being soft reclaimed
   and at the same time tasks are increasing the memory footprint and causing
   the mem cgroup to exceed its soft limit.

Changelog v3...v2
1. Implemented several review comments from Kosaki-San and Kamezawa-San
   Please see individual changelogs for changes

Changelog v2...v1
1. Soft limits now support hierarchies
2. Use spinlocks instead of mutexes for synchronization of the RB tree

Here is v6 of the new soft limit implementation. Soft limits is a new feature
for the memory resource controller, something similar has existed in the
group scheduler in the form of shares. The CPU controllers interpretation
of shares is very different though. 

Soft limits are the most useful feature to have for environments where
the administrator wants to overcommit the system, such that only on memory
contention do the limits become active. The current soft limits implementation
provides a soft_limit_in_bytes interface for the memory controller and not
for memory+swap controller. The implementation maintains an RB-Tree of groups
that exceed their soft limit and starts reclaiming from the group that
exceeds this limit by the maximum amount.

Kamezawa-San has another patchset for soft limits, but I don't like the reclaim logic of watermark based balancing of zones for global memory cgroup limits.
I also don't like the data structures, a list does not scale well. Kamezawa's
objection to this patch is the cost of sorting, which is really negligible,
since the updates happen at a fixed interval (curently four times a second).
I however do like the priority feature in Kamezawa's patchset. The feature
can be easily adopted to this incrementally.

Some reclaim aspects deserve more discussion. Kosaki-San suggested a double
loop for reclaim. I need to try that logic, although it is not very different
from what I currently have. I also need to test Kamezawa's approach and report
and compare results.

TODOs

1. The current implementation maintains the delta from the soft limit
   and pushes back groups to their soft limits, a ratio of delta/soft_limit
   might be more useful

Tests
-----

I've run two memory intensive workloads with differing soft limits and
seen that they are pushed back to their soft limit on contention. Their usage
was their soft limit plus additional memory that they were able to grab
on the system. Soft limit can take a while before we see the expected
results.

The other tests I've run are
1. Deletion of groups while soft limit is in progress in the hierarchy
2. Setting the soft limit to zero and running other groups with non-zero
   soft limits.
3. Setting the soft limit to zero and testing if the mem cgroup is able
   to use available memory

Please review, comment.

Series
------

memcg-soft-limit-documentation.patch
memcg-add-soft-limit-interface.patch
memcg-organize-over-soft-limit-groups.patch
memcg-soft-limit-reclaim-on-contention.patch

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/4] Memory controller soft limit documentation (v6)
  2009-03-14 17:30 [PATCH 0/4] Memory controller soft limit patches (v6) Balbir Singh
@ 2009-03-14 17:30 ` Balbir Singh
  2009-03-14 17:30 ` [PATCH 2/4] Memory controller soft limit interface (v6) Balbir Singh
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: Balbir Singh @ 2009-03-14 17:30 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Balbir Singh,
	Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

Feature: Add documentation for soft limits

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/cgroups/memory.txt |   31 ++++++++++++++++++++++++++++++-
 1 files changed, 30 insertions(+), 1 deletions(-)


diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index a98a7fe..c5f73d9 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -360,7 +360,36 @@ cgroups created below it.
 
 NOTE2: This feature can be enabled/disabled per subtree.
 
-7. TODO
+7. Soft limits
+
+Soft limits allow for greater sharing of memory. The idea behind soft limits
+is to allow control groups to use as much of the memory as needed, provided
+
+a. There is no memory contention
+b. They do not exceed their hard limit
+
+When the system detects memory contention or low memory control groups
+are pushed back to their soft limits. If the soft limit of each control
+group is very high, they are pushed back as much as possible to make
+sure that one control group does not starve the others of memory.
+
+7.1 Interface
+
+Soft limits can be setup by using the following commands (in this example we
+assume a soft limit of 256 megabytes)
+
+# echo 256M > memory.soft_limit_in_bytes
+
+If we want to change this to 1G, we can at any time use
+
+# echo 1G > memory.soft_limit_in_bytes
+
+NOTE1: Soft limits take effect over a long period of time, since they involve
+       reclaiming memory for balancing between memory cgroups
+NOTE2: It is recommended to set the soft limit always below the hard limit,
+       otherwise the hard limit will take precedence.
+
+8. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/4] Memory controller soft limit interface (v6)
  2009-03-14 17:30 [PATCH 0/4] Memory controller soft limit patches (v6) Balbir Singh
  2009-03-14 17:30 ` [PATCH 1/4] Memory controller soft limit documentation (v6) Balbir Singh
@ 2009-03-14 17:30 ` Balbir Singh
  2009-03-14 17:31 ` [PATCH 3/4] Memory controller soft limit organize cgroups (v6) Balbir Singh
  2009-03-14 17:31 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v6) Balbir Singh
  3 siblings, 0 replies; 29+ messages in thread
From: Balbir Singh @ 2009-03-14 17:30 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Balbir Singh,
	Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

Feature: Add soft limits interface to resource counters

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v2...v1
1. Add support for res_counter_check_soft_limit_locked. This is used
   by the hierarchy code.

Add an interface to allow get/set of soft limits. Soft limits for memory plus
swap controller (memsw) is currently not supported. Resource counters have
been enhanced to support soft limits and new type RES_SOFT_LIMIT has been
added. Unlike hard limits, soft limits can be directly set and do not
need any reclaim or checks before setting them to a newer value.

Kamezawa-San raised a question as to whether soft limit should belong
to res_counter. Since all resources understand the basic concepts of
hard and soft limits, it is justified to add soft limits here. Soft limits
are a generic resource usage feature, even file system quotas support
soft limits.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |   58 +++++++++++++++++++++++++++++++++++++++++++
 kernel/res_counter.c        |    3 ++
 mm/memcontrol.c             |   20 +++++++++++++++
 3 files changed, 81 insertions(+), 0 deletions(-)


diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 4c5bcf6..5c821fd 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -35,6 +35,10 @@ struct res_counter {
 	 */
 	unsigned long long limit;
 	/*
+	 * the limit that usage can be exceed
+	 */
+	unsigned long long soft_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -85,6 +89,7 @@ enum {
 	RES_MAX_USAGE,
 	RES_LIMIT,
 	RES_FAILCNT,
+	RES_SOFT_LIMIT,
 };
 
 /*
@@ -130,6 +135,36 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 	return false;
 }
 
+static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->soft_limit)
+		return true;
+
+	return false;
+}
+
+/**
+ * Get the difference between the usage and the soft limit
+ * @cnt: The counter
+ *
+ * Returns 0 if usage is less than or equal to soft limit
+ * The difference between usage and soft limit, otherwise.
+ */
+static inline unsigned long long
+res_counter_soft_limit_excess(struct res_counter *cnt)
+{
+	unsigned long long excess;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (cnt->usage <= cnt->soft_limit)
+		excess = 0;
+	else
+		excess = cnt->usage - cnt->soft_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return excess;
+}
+
 /*
  * Helper function to detect if the cgroup is within it's limit or
  * not. It's currently called from cgroup_rss_prepare()
@@ -145,6 +180,17 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
 	return ret;
 }
 
+static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_soft_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
@@ -178,4 +224,16 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
 	return ret;
 }
 
+static inline int
+res_counter_set_soft_limit(struct res_counter *cnt,
+				unsigned long long soft_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->soft_limit = soft_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index bf8e753..4e6dafe 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -19,6 +19,7 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 {
 	spin_lock_init(&counter->lock);
 	counter->limit = (unsigned long long)LLONG_MAX;
+	counter->soft_limit = (unsigned long long)LLONG_MAX;
 	counter->parent = parent;
 }
 
@@ -101,6 +102,8 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->limit;
 	case RES_FAILCNT:
 		return &counter->failcnt;
+	case RES_SOFT_LIMIT:
+		return &counter->soft_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5de6be9..70bc992 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2002,6 +2002,20 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 		else
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
+	case RES_SOFT_LIMIT:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		/*
+		 * For memsw, soft limits are hard to implement in terms
+		 * of semantics, for now, we support soft limits for
+		 * control without swap
+		 */
+		if (type == _MEM)
+			ret = res_counter_set_soft_limit(&memcg->res, val);
+		else
+			ret = -EINVAL;
+		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
 		break;
@@ -2251,6 +2265,12 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_read,
 	},
 	{
+		.name = "soft_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
 		.name = "failcnt",
 		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
 		.trigger = mem_cgroup_reset,

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/4] Memory controller soft limit organize cgroups (v6)
  2009-03-14 17:30 [PATCH 0/4] Memory controller soft limit patches (v6) Balbir Singh
  2009-03-14 17:30 ` [PATCH 1/4] Memory controller soft limit documentation (v6) Balbir Singh
  2009-03-14 17:30 ` [PATCH 2/4] Memory controller soft limit interface (v6) Balbir Singh
@ 2009-03-14 17:31 ` Balbir Singh
  2009-03-16  0:21   ` KAMEZAWA Hiroyuki
  2009-03-14 17:31 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v6) Balbir Singh
  3 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2009-03-14 17:31 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Balbir Singh,
	Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

Feature: Organize cgroups over soft limit in a RB-Tree

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v6...v5
1. Update the key before inserting into RB tree. Without the current change
   it could take an additional iteration to get the key correct.

Changelog v5...v4
1. res_counter_uncharge has an additional parameter to indicate if the
   counter was over its soft limit, before uncharge.

Changelog v4...v3
1. Optimizations to ensure we don't uncessarily get res_counter values
2. Fixed a bug in usage of time_after()

Changelog v3...v2
1. Add only the ancestor to the RB-Tree
2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put

Changelog v2...v1
1. Add support for hierarchies
2. The res_counter that is highest in the hierarchy is returned on soft
   limit being exceeded. Since we do hierarchical reclaim and add all
   groups exceeding their soft limits, this approach seems to work well
   in practice.

This patch introduces a RB-Tree for storing memory cgroups that are over their
soft limit. The overall goal is to

1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
   We are careful about updates, updates take place only after a particular
   time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
   limit

The next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |    6 +-
 kernel/res_counter.c        |   18 +++++
 mm/memcontrol.c             |  141 ++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 143 insertions(+), 22 deletions(-)


diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 5c821fd..5bbf8b1 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -112,7 +112,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
 int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
+		unsigned long val, struct res_counter **limit_fail_at,
+		struct res_counter **soft_limit_at);
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -125,7 +126,8 @@ int __must_check res_counter_charge(struct res_counter *counter,
  */
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val);
+void res_counter_uncharge(struct res_counter *counter, unsigned long val,
+				bool *was_soft_limit_excess);
 
 static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 {
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 4e6dafe..51ec438 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 }
 
 int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
+			struct res_counter **limit_fail_at,
+			struct res_counter **soft_limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
 	struct res_counter *c, *u;
 
 	*limit_fail_at = NULL;
+	if (soft_limit_fail_at)
+		*soft_limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
+		/*
+		 * With soft limits, we return the highest ancestor
+		 * that exceeds its soft limit
+		 */
+		if (soft_limit_fail_at &&
+			!res_counter_soft_limit_check_locked(c))
+			*soft_limit_fail_at = c;
 		spin_unlock(&c->lock);
 		if (ret < 0) {
 			*limit_fail_at = c;
@@ -75,7 +85,8 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 	counter->usage -= val;
 }
 
-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge(struct res_counter *counter, unsigned long val,
+				bool *was_soft_limit_excess)
 {
 	unsigned long flags;
 	struct res_counter *c;
@@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
+		if (c == counter && was_soft_limit_excess)
+			*was_soft_limit_excess =
+				!res_counter_soft_limit_check_locked(c);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 70bc992..200d44a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -29,6 +29,7 @@
 #include <linux/rcupdate.h>
 #include <linux/limits.h>
 #include <linux/mutex.h>
+#include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/spinlock.h>
@@ -129,6 +130,14 @@ struct mem_cgroup_lru_info {
 };
 
 /*
+ * Cgroups above their limits are maintained in a RB-Tree, independent of
+ * their hierarchy representation
+ */
+
+static struct rb_root mem_cgroup_soft_limit_tree;
+static DEFINE_SPINLOCK(memcg_soft_limit_tree_lock);
+
+/*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
  * statistics based on the statistics developed by Rik Van Riel for clock-pro,
@@ -176,12 +185,20 @@ struct mem_cgroup {
 
 	unsigned int	swappiness;
 
+	struct rb_node mem_cgroup_node;		/* RB tree node */
+	unsigned long long usage_in_excess;	/* Set to the value by which */
+						/* the soft limit is exceeded*/
+	unsigned long last_tree_update;		/* Last time the tree was */
+						/* updated in jiffies     */
+
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
 	struct mem_cgroup_stat stat;
 };
 
+#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
+
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -214,6 +231,42 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 
+static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+{
+	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct mem_cgroup *mem_node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+	while (*p) {
+		parent = *p;
+		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
+		if (mem->usage_in_excess < mem_node->usage_in_excess)
+			p = &(*p)->rb_left;
+		/*
+		 * We can't avoid mem cgroups that are over their soft
+		 * limit by the same amount
+		 */
+		else if (mem->usage_in_excess >= mem_node->usage_in_excess)
+			p = &(*p)->rb_right;
+	}
+	rb_link_node(&mem->mem_cgroup_node, parent, p);
+	rb_insert_color(&mem->mem_cgroup_node,
+			&mem_cgroup_soft_limit_tree);
+	mem->last_tree_update = jiffies;
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+}
+
+static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+}
+
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 					 struct page_cgroup *pc,
 					 bool charge)
@@ -897,6 +950,39 @@ static void record_last_oom(struct mem_cgroup *mem)
 	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
 }
 
+static void mem_cgroup_check_and_update_tree(struct mem_cgroup *mem,
+						bool time_check)
+{
+	unsigned long long prev_usage_in_excess, new_usage_in_excess;
+	bool updated_tree = false;
+	unsigned long next_update = 0;
+	unsigned long flags;
+
+	prev_usage_in_excess = mem->usage_in_excess;
+
+	if (time_check)
+		next_update = mem->last_tree_update +
+				MEM_CGROUP_TREE_UPDATE_INTERVAL;
+
+	if (!time_check || time_after(jiffies, next_update)) {
+		new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+		if (prev_usage_in_excess) {
+			mem_cgroup_remove_exceeded(mem);
+			updated_tree = true;
+		}
+		if (!new_usage_in_excess)
+			goto done;
+		mem_cgroup_insert_exceeded(mem);
+	}
+
+done:
+	if (updated_tree) {
+		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+		mem->last_tree_update = jiffies;
+		mem->usage_in_excess = new_usage_in_excess;
+		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+	}
+}
 
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
@@ -906,9 +992,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			gfp_t gfp_mask, struct mem_cgroup **memcg,
 			bool oom)
 {
-	struct mem_cgroup *mem, *mem_over_limit;
+	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct res_counter *fail_res;
+	struct res_counter *fail_res, *soft_fail_res = NULL;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -938,16 +1024,17 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		int ret;
 		bool noswap = false;
 
-		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
+		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
+						&soft_fail_res);
 		if (likely(!ret)) {
 			if (!do_swap_account)
 				break;
 			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
-							&fail_res);
+							&fail_res, NULL);
 			if (likely(!ret))
 				break;
 			/* mem+swap counter fails */
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
+			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
 			noswap = true;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
@@ -985,6 +1072,17 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			goto nomem;
 		}
 	}
+
+	/*
+	 * Insert just the ancestor, we should trickle down to the correct
+	 * cgroup for reclaim, since the other nodes will be below their
+	 * soft limit
+	 */
+	if (soft_fail_res) {
+		mem_over_soft_limit =
+			mem_cgroup_from_res_counter(soft_fail_res, res);
+		mem_cgroup_check_and_update_tree(mem_over_soft_limit, true);
+	}
 	return 0;
 nomem:
 	css_put(&mem->css);
@@ -1061,9 +1159,9 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
 		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
 		css_put(&mem->css);
 		return;
 	}
@@ -1116,10 +1214,10 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
 	if (pc->mem_cgroup != from)
 		goto out;
 
-	res_counter_uncharge(&from->res, PAGE_SIZE);
+	res_counter_uncharge(&from->res, PAGE_SIZE, NULL);
 	mem_cgroup_charge_statistics(from, pc, false);
 	if (do_swap_account)
-		res_counter_uncharge(&from->memsw, PAGE_SIZE);
+		res_counter_uncharge(&from->memsw, PAGE_SIZE, NULL);
 	css_put(&from->css);
 
 	css_get(&to->css);
@@ -1183,9 +1281,9 @@ uncharge:
 	/* drop extra refcnt by try_charge() */
 	css_put(&parent->css);
 	/* uncharge if move fails */
-	res_counter_uncharge(&parent->res, PAGE_SIZE);
+	res_counter_uncharge(&parent->res, PAGE_SIZE, NULL);
 	if (do_swap_account)
-		res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+		res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL);
 	return ret;
 }
 
@@ -1314,7 +1412,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 			 * Recorded ID can be obsolete. We avoid calling
 			 * css_tryget()
 			 */
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
 			mem_cgroup_put(mem);
 		}
 		rcu_read_unlock();
@@ -1393,7 +1491,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
 			 * This recorded memcg can be obsolete one. So, avoid
 			 * calling css_tryget
 			 */
-			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+			res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
 			mem_cgroup_put(memcg);
 		}
 		rcu_read_unlock();
@@ -1408,9 +1506,9 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
 		return;
 	if (!mem)
 		return;
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
 	if (do_swap_account)
-		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
 	css_put(&mem->css);
 }
 
@@ -1424,6 +1522,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
+	bool soft_limit_excess = false;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1461,9 +1560,9 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 		break;
 	}
 
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess);
 	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
-		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
 	mem_cgroup_charge_statistics(mem, pc, false);
 
 	ClearPageCgroupUsed(pc);
@@ -1477,6 +1576,8 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	mz = page_cgroup_zoneinfo(pc);
 	unlock_page_cgroup(pc);
 
+	if (soft_limit_excess)
+		mem_cgroup_check_and_update_tree(mem, true);
 	/* at swapout, this memcg will be accessed to record to swap */
 	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		css_put(&mem->css);
@@ -1545,7 +1646,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
 		 * We uncharge this because swap is freed.
 		 * This memcg can be obsolete one. We avoid calling css_tryget
 		 */
-		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
 		mem_cgroup_put(memcg);
 	}
 	rcu_read_unlock();
@@ -2409,6 +2510,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
 {
 	int node;
 
+	mem_cgroup_check_and_update_tree(mem, false);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
 
 	for_each_node_state(node, N_POSSIBLE)
@@ -2475,6 +2577,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (cont->parent == NULL) {
 		enable_swap_cgroup();
 		parent = NULL;
+		mem_cgroup_soft_limit_tree = RB_ROOT;
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
@@ -2495,6 +2598,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		res_counter_init(&mem->memsw, NULL);
 	}
 	mem->last_scanned_child = 0;
+	mem->usage_in_excess = 0;
+	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
 	spin_lock_init(&mem->reclaim_param_lock);
 
 	if (parent)

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-14 17:30 [PATCH 0/4] Memory controller soft limit patches (v6) Balbir Singh
                   ` (2 preceding siblings ...)
  2009-03-14 17:31 ` [PATCH 3/4] Memory controller soft limit organize cgroups (v6) Balbir Singh
@ 2009-03-14 17:31 ` Balbir Singh
  2009-03-16  0:52   ` KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2009-03-14 17:31 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Balbir Singh,
	Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

Feature: Implement reclaim from groups over their soft limit

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v6...v5
1. Reclaim arguments to hierarchical reclaim have been merged into one
   parameter called reclaim_options.
2. Check if we failed to reclaim from one cgroup during soft reclaim, if
   so move on to the next one. This can be very useful if the zonelist
   passed to soft limit reclaim has no allocations from the selected
   memory cgroup
3. Coding style cleanups

Changelog v5...v4

1. Throttling is removed, earlier we throttled tasks over their soft limit
2. Reclaim has been moved back to __alloc_pages_internal, several experiments
   and tests showed that it was the best place to reclaim memory. kswapd has
   a different goal, that does not work with a single soft limit for the memory
   cgroup.
3. Soft limit reclaim is more targetted and the pages reclaim depend on the
   amount by which the soft limit is exceeded.

Changelog v4...v3
1. soft_reclaim is now called from balance_pgdat
2. soft_reclaim is aware of nodes and zones
3. A mem_cgroup will be throttled if it is undergoing soft limit reclaim
   and at the same time trying to allocate pages and exceed its soft limit.
4. A new mem_cgroup_shrink_zone() routine has been added to shrink zones
   particular to a mem cgroup.

Changelog v3...v2
1. Convert several arguments to hierarchical reclaim to flags, thereby
   consolidating them
2. The reclaim for soft limits is now triggered from kswapd
3. try_to_free_mem_cgroup_pages() now accepts an optional zonelist argument


Changelog v2...v1
1. Added support for hierarchical soft limits

This patch allows reclaim from memory cgroups on contention (via the
direct reclaim path).

memory cgroup soft limit reclaim finds the group that exceeds its soft limit
by the largest number of pages and reclaims pages from it and then reinserts the
cgroup into its correct place in the rbtree.

Reclaim arguments to hierarchical reclaim have been merged into one parameter
called reclaim_options.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/memcontrol.h |    8 ++
 include/linux/swap.h       |    1 
 mm/memcontrol.c            |  205 ++++++++++++++++++++++++++++++++++++++++----
 mm/page_alloc.c            |    9 ++
 mm/vmscan.c                |    5 +
 5 files changed, 205 insertions(+), 23 deletions(-)


diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 18146c9..b99d9c5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -116,7 +116,8 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
-
+unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl,
+						gfp_t gfp_mask);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -264,6 +265,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
+static inline
+unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl, gfp_t gfp_mask)
+{
+	return 0;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 989eb53..c128337 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -215,6 +215,7 @@ static inline void lru_cache_add_active_file(struct page *page)
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
+						  struct zonelist *zl,
 						  gfp_t gfp_mask, bool noswap,
 						  unsigned int swappiness);
 extern int __isolate_lru_page(struct page *page, int mode, int file);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 200d44a..980bd18 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -191,6 +191,7 @@ struct mem_cgroup {
 	unsigned long last_tree_update;		/* Last time the tree was */
 						/* updated in jiffies     */
 
+	bool on_tree;				/* Is the node on tree? */
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
@@ -227,18 +228,29 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
 
+/*
+ * Bits used for hierarchical reclaim bits
+ */
+#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
+#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
+#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
+#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
+#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
+#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
+
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 
-static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+static void __mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
 {
 	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
 	struct rb_node *parent = NULL;
 	struct mem_cgroup *mem_node;
-	unsigned long flags;
 
-	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	if (mem->on_tree)
+		return;
+
 	mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
 	while (*p) {
 		parent = *p;
@@ -256,6 +268,23 @@ static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
 	rb_insert_color(&mem->mem_cgroup_node,
 			&mem_cgroup_soft_limit_tree);
 	mem->last_tree_update = jiffies;
+	mem->on_tree = true;
+}
+
+static void __mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
+{
+	if (!mem->on_tree)
+		return;
+	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
+	mem->on_tree = false;
+}
+
+static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	__mem_cgroup_insert_exceeded(mem);
 	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
 }
 
@@ -263,8 +292,53 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
 {
 	unsigned long flags;
 	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
-	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
+	__mem_cgroup_remove_exceeded(mem);
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+}
+
+unsigned long mem_cgroup_get_excess(struct mem_cgroup *mem)
+{
+	unsigned long flags;
+	unsigned long long excess;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	excess = mem->usage_in_excess >> PAGE_SHIFT;
 	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+	return (excess > ULONG_MAX) ? ULONG_MAX : excess;
+}
+
+static struct mem_cgroup *__mem_cgroup_largest_soft_limit_node(void)
+{
+	struct rb_node *rightmost = NULL;
+	struct mem_cgroup *mem = NULL;
+
+retry:
+	rightmost = rb_last(&mem_cgroup_soft_limit_tree);
+	if (!rightmost)
+		goto done;		/* Nothing to reclaim from */
+
+	mem = rb_entry(rightmost, struct mem_cgroup, mem_cgroup_node);
+	/*
+	 * Remove the node now but someone else can add it back,
+	 * we will to add it back at the end of reclaim to its correct
+	 * position in the tree.
+	 */
+	__mem_cgroup_remove_exceeded(mem);
+	if (!css_tryget(&mem->css) || !res_counter_soft_limit_excess(&mem->res))
+		goto retry;
+done:
+	return mem;
+}
+
+static struct mem_cgroup *mem_cgroup_largest_soft_limit_node(void)
+{
+	struct mem_cgroup *mem;
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	mem = __mem_cgroup_largest_soft_limit_node();
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+	return mem;
 }
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
@@ -889,14 +963,42 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
  * If shrink==true, for avoiding to free too much, this returns immedieately.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
-				   gfp_t gfp_mask, bool noswap, bool shrink)
+						struct zonelist *zl,
+						gfp_t gfp_mask,
+						unsigned long reclaim_options)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
 	int loop = 0;
+	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
+	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
+	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
+	unsigned long excess = mem_cgroup_get_excess(root_mem);
 
-	while (loop < 2) {
+	while (1) {
+		if (loop >= 2) {
+			if (!check_soft)
+				break;
+			/*
+			 * We want to do more targetted reclaim. excess >> 4
+			 * >> 4 is not to excessive so as to reclaim too
+			 * much, nor too less that we keep coming back
+			 * to reclaim from this cgroup
+			 */
+			if (total >= (excess >> 4))
+				break;
+		}
 		victim = mem_cgroup_select_victim(root_mem);
+		/*
+		 * In the first loop, don't reclaim from victims below
+		 * their soft limit
+		 */
+		if (!loop && res_counter_check_under_soft_limit(&victim->res)) {
+			if (victim == root_mem)
+				loop++;
+			css_put(&victim->css);
+			continue;
+		}
 		if (victim == root_mem)
 			loop++;
 		if (!mem_cgroup_local_usage(&victim->stat)) {
@@ -905,8 +1007,9 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 			continue;
 		}
 		/* we use swappiness of local cgroup */
-		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
-						   get_swappiness(victim));
+		ret = try_to_free_mem_cgroup_pages(victim, zl, gfp_mask,
+							noswap,
+							get_swappiness(victim));
 		css_put(&victim->css);
 		/*
 		 * At shrinking usage, we can't check we should stop here or
@@ -916,7 +1019,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 		if (shrink)
 			return ret;
 		total += ret;
-		if (mem_cgroup_check_under_limit(root_mem))
+		if (check_soft) {
+			if (res_counter_check_under_soft_limit(&root_mem->res))
+				return total;
+		} else if (mem_cgroup_check_under_limit(root_mem))
 			return 1 + total;
 	}
 	return total;
@@ -1022,7 +1128,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 
 	while (1) {
 		int ret;
-		bool noswap = false;
+		unsigned long flags = 0;
 
 		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
 						&soft_fail_res);
@@ -1035,7 +1141,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 				break;
 			/* mem+swap counter fails */
 			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
-			noswap = true;
+			flags = MEM_CGROUP_RECLAIM_NOSWAP;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
 		} else
@@ -1046,8 +1152,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
 
-		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
-							noswap, false);
+		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
+							gfp_mask, flags);
 		if (ret)
 			continue;
 
@@ -1757,8 +1863,8 @@ int mem_cgroup_shrink_usage(struct page *page,
 		return 0;
 
 	do {
-		progress = mem_cgroup_hierarchical_reclaim(mem,
-					gfp_mask, true, false);
+		progress = mem_cgroup_hierarchical_reclaim(mem, NULL,
+					gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP);
 		progress += mem_cgroup_check_under_limit(mem);
 	} while (!progress && --retry);
 
@@ -1812,8 +1918,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
-						   false, true);
+		progress = mem_cgroup_hierarchical_reclaim(memcg, NULL,
+						GFP_KERNEL,
+						MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -1861,7 +1968,9 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
+		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
+						MEM_CGROUP_RECLAIM_NOSWAP |
+						MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -1872,6 +1981,62 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	return ret;
 }
 
+unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl, gfp_t gfp_mask)
+{
+	unsigned long nr_reclaimed = 0;
+	struct mem_cgroup *mem, *next_mem = NULL;
+	unsigned long flags;
+	unsigned long reclaimed;
+
+	/*
+	 * This loop can run a while, specially if mem_cgroup's continuously
+	 * keep exceeding their soft limit and putting the system under
+	 * pressure
+	 */
+	do {
+		if (next_mem)
+			mem = next_mem;
+		else
+			mem = mem_cgroup_largest_soft_limit_node();
+		if (!mem)
+			break;
+
+		reclaimed = mem_cgroup_hierarchical_reclaim(mem, zl,
+						gfp_mask,
+						MEM_CGROUP_RECLAIM_SOFT);
+		nr_reclaimed += reclaimed;
+		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+
+		/*
+		 * If we failed to reclaim anything from this memory cgroup
+		 * it is time to move on to the next cgroup
+		 */
+		next_mem = NULL;
+		if (!reclaimed) {
+			do {
+				/*
+				 * By the time we get the soft_limit lock
+				 * again, someone might have aded the
+				 * group back on the RB tree. Iterate to
+				 * make sure we get a different mem.
+				 * mem_cgroup_largest_soft_limit_node returns
+				 * NULL if no other cgroup is present on
+				 * the tree
+				 */
+				next_mem =
+					__mem_cgroup_largest_soft_limit_node();
+			} while (next_mem == mem);
+		}
+		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+		__mem_cgroup_remove_exceeded(mem);
+		if (mem->usage_in_excess)
+			__mem_cgroup_insert_exceeded(mem);
+		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+		css_put(&mem->css);
+	} while (!nr_reclaimed);
+	return nr_reclaimed;
+}
+
 /*
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -1995,7 +2160,7 @@ try_to_free:
 			ret = -EINTR;
 			goto out;
 		}
-		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
+		progress = try_to_free_mem_cgroup_pages(mem, NULL, GFP_KERNEL,
 						false, get_swappiness(mem));
 		if (!progress) {
 			nr_retries--;
@@ -2600,6 +2765,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mem->last_scanned_child = 0;
 	mem->usage_in_excess = 0;
 	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
+	mem->on_tree = false;
+
 	spin_lock_init(&mem->reclaim_param_lock);
 
 	if (parent)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f8fd1e2..5e1a6ca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1598,7 +1598,14 @@ nofail_alloc:
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
+	/*
+	 * Try to free up some pages from the memory controllers soft
+	 * limit queue.
+	 */
+	did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
+	if (order || !did_some_progress)
+		did_some_progress += try_to_free_pages(zonelist, order,
+							gfp_mask);
 
 	p->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 15f7737..13001d9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1708,6 +1708,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
+					   struct zonelist *zonelist,
 					   gfp_t gfp_mask,
 					   bool noswap,
 					   unsigned int swappiness)
@@ -1721,14 +1722,14 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.mem_cgroup = mem_cont,
 		.isolate_pages = mem_cgroup_isolate_pages,
 	};
-	struct zonelist *zonelist;
 
 	if (noswap)
 		sc.may_unmap = 0;
 
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
-	zonelist = NODE_DATA(numa_node_id())->node_zonelists;
+	if (!zonelist)
+		zonelist = NODE_DATA(numa_node_id())->node_zonelists;
 	return do_try_to_free_pages(zonelist, &sc);
 }
 #endif

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] Memory controller soft limit organize cgroups (v6)
  2009-03-14 17:31 ` [PATCH 3/4] Memory controller soft limit organize cgroups (v6) Balbir Singh
@ 2009-03-16  0:21   ` KAMEZAWA Hiroyuki
  2009-03-16  8:47     ` Balbir Singh
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-16  0:21 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Sat, 14 Mar 2009 23:01:02 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Feature: Organize cgroups over soft limit in a RB-Tree
> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Changelog v6...v5
> 1. Update the key before inserting into RB tree. Without the current change
>    it could take an additional iteration to get the key correct.
> 
> Changelog v5...v4
> 1. res_counter_uncharge has an additional parameter to indicate if the
>    counter was over its soft limit, before uncharge.
> 
> Changelog v4...v3
> 1. Optimizations to ensure we don't uncessarily get res_counter values
> 2. Fixed a bug in usage of time_after()
> 
> Changelog v3...v2
> 1. Add only the ancestor to the RB-Tree
> 2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put
> 
> Changelog v2...v1
> 1. Add support for hierarchies
> 2. The res_counter that is highest in the hierarchy is returned on soft
>    limit being exceeded. Since we do hierarchical reclaim and add all
>    groups exceeding their soft limits, this approach seems to work well
>    in practice.
> 
> This patch introduces a RB-Tree for storing memory cgroups that are over their
> soft limit. The overall goal is to
> 
> 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
>    We are careful about updates, updates take place only after a particular
>    time interval has passed
> 2. We remove the node from the RB-Tree when the usage goes below the soft
>    limit
> 
> The next set of patches will exploit the RB-Tree to get the group that is
> over its soft limit by the largest amount and reclaim from it, when we
> face memory contention.
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  include/linux/res_counter.h |    6 +-
>  kernel/res_counter.c        |   18 +++++
>  mm/memcontrol.c             |  141 ++++++++++++++++++++++++++++++++++++++-----
>  3 files changed, 143 insertions(+), 22 deletions(-)
> 
> 
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 5c821fd..5bbf8b1 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -112,7 +112,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
>  int __must_check res_counter_charge_locked(struct res_counter *counter,
>  		unsigned long val);
>  int __must_check res_counter_charge(struct res_counter *counter,
> -		unsigned long val, struct res_counter **limit_fail_at);
> +		unsigned long val, struct res_counter **limit_fail_at,
> +		struct res_counter **soft_limit_at);
>  
>  /*
>   * uncharge - tell that some portion of the resource is released
> @@ -125,7 +126,8 @@ int __must_check res_counter_charge(struct res_counter *counter,
>   */
>  
>  void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
> -void res_counter_uncharge(struct res_counter *counter, unsigned long val);
> +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> +				bool *was_soft_limit_excess);
>  
>  static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
>  {
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 4e6dafe..51ec438 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
>  }
>  
>  int res_counter_charge(struct res_counter *counter, unsigned long val,
> -			struct res_counter **limit_fail_at)
> +			struct res_counter **limit_fail_at,
> +			struct res_counter **soft_limit_fail_at)
>  {
>  	int ret;
>  	unsigned long flags;
>  	struct res_counter *c, *u;
>  
>  	*limit_fail_at = NULL;
> +	if (soft_limit_fail_at)
> +		*soft_limit_fail_at = NULL;
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
>  		spin_lock(&c->lock);
>  		ret = res_counter_charge_locked(c, val);
> +		/*
> +		 * With soft limits, we return the highest ancestor
> +		 * that exceeds its soft limit
> +		 */
> +		if (soft_limit_fail_at &&
> +			!res_counter_soft_limit_check_locked(c))
> +			*soft_limit_fail_at = c;
>  		spin_unlock(&c->lock);
>  		if (ret < 0) {
>  			*limit_fail_at = c;
> @@ -75,7 +85,8 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
>  	counter->usage -= val;
>  }
>  
> -void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> +				bool *was_soft_limit_excess)
>  {
>  	unsigned long flags;
>  	struct res_counter *c;
> @@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
>  		spin_lock(&c->lock);
> +		if (c == counter && was_soft_limit_excess)
> +			*was_soft_limit_excess =
> +				!res_counter_soft_limit_check_locked(c);
>  		res_counter_uncharge_locked(c, val);
>  		spin_unlock(&c->lock);
>  	}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 70bc992..200d44a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -29,6 +29,7 @@
>  #include <linux/rcupdate.h>
>  #include <linux/limits.h>
>  #include <linux/mutex.h>
> +#include <linux/rbtree.h>
>  #include <linux/slab.h>
>  #include <linux/swap.h>
>  #include <linux/spinlock.h>
> @@ -129,6 +130,14 @@ struct mem_cgroup_lru_info {
>  };
>  
>  /*
> + * Cgroups above their limits are maintained in a RB-Tree, independent of
> + * their hierarchy representation
> + */
> +
> +static struct rb_root mem_cgroup_soft_limit_tree;
> +static DEFINE_SPINLOCK(memcg_soft_limit_tree_lock);
> +
> +/*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
>   * statistics based on the statistics developed by Rik Van Riel for clock-pro,
> @@ -176,12 +185,20 @@ struct mem_cgroup {
>  
>  	unsigned int	swappiness;
>  
> +	struct rb_node mem_cgroup_node;		/* RB tree node */
> +	unsigned long long usage_in_excess;	/* Set to the value by which */
> +						/* the soft limit is exceeded*/
> +	unsigned long last_tree_update;		/* Last time the tree was */
> +						/* updated in jiffies     */
> +
>  	/*
>  	 * statistics. This must be placed at the end of memcg.
>  	 */
>  	struct mem_cgroup_stat stat;
>  };
>  
> +#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
> +
>  enum charge_type {
>  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
>  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> @@ -214,6 +231,42 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>  
> +static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> +{
> +	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct mem_cgroup *mem_node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +	mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> +	while (*p) {
> +		parent = *p;
> +		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
> +		if (mem->usage_in_excess < mem_node->usage_in_excess)
> +			p = &(*p)->rb_left;
> +		/*
> +		 * We can't avoid mem cgroups that are over their soft
> +		 * limit by the same amount
> +		 */
> +		else if (mem->usage_in_excess >= mem_node->usage_in_excess)
> +			p = &(*p)->rb_right;
> +	}
> +	rb_link_node(&mem->mem_cgroup_node, parent, p);
> +	rb_insert_color(&mem->mem_cgroup_node,
> +			&mem_cgroup_soft_limit_tree);
> +	mem->last_tree_update = jiffies;
> +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +}
> +
> +static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
> +{
> +	unsigned long flags;
> +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
> +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +}
> +
>  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
>  					 struct page_cgroup *pc,
>  					 bool charge)
> @@ -897,6 +950,39 @@ static void record_last_oom(struct mem_cgroup *mem)
>  	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
>  }
>  
> +static void mem_cgroup_check_and_update_tree(struct mem_cgroup *mem,
> +						bool time_check)
> +{
> +	unsigned long long prev_usage_in_excess, new_usage_in_excess;
> +	bool updated_tree = false;
> +	unsigned long next_update = 0;
> +	unsigned long flags;
> +
> +	prev_usage_in_excess = mem->usage_in_excess;
> +
> +	if (time_check)
> +		next_update = mem->last_tree_update +
> +				MEM_CGROUP_TREE_UPDATE_INTERVAL;
> +
> +	if (!time_check || time_after(jiffies, next_update)) {
> +		new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> +		if (prev_usage_in_excess) {
> +			mem_cgroup_remove_exceeded(mem);
> +			updated_tree = true;
> +		}
> +		if (!new_usage_in_excess)
> +			goto done;
> +		mem_cgroup_insert_exceeded(mem);
> +	}
> +
> +done:
> +	if (updated_tree) {
> +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +		mem->last_tree_update = jiffies;
> +		mem->usage_in_excess = new_usage_in_excess;
> +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +	}
> +}
>  
>  /*
>   * Unlike exported interface, "oom" parameter is added. if oom==true,
> @@ -906,9 +992,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  			gfp_t gfp_mask, struct mem_cgroup **memcg,
>  			bool oom)
>  {
> -	struct mem_cgroup *mem, *mem_over_limit;
> +	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
>  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> -	struct res_counter *fail_res;
> +	struct res_counter *fail_res, *soft_fail_res = NULL;
>  
>  	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
>  		/* Don't account this! */
> @@ -938,16 +1024,17 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  		int ret;
>  		bool noswap = false;
>  
> -		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> +		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
> +						&soft_fail_res);

As I pointed out, if this value is finally used once per HZ/X. checking
this *alyways* is overkill. plz remove softlimit check from here.

Maybe code like this is good.
==
 if (need_softlimit_check(mem)) {
     softlimit_res = res_counter_check_under_softlimit(&mem->res);
     if (softlimit_res) {
        struct mem_cgroup *mem = mem_cgroup_from_cont(softlimit_res);
        update_tree()....      
     }
 }
==

*And* what is important here is "need_softlimit_check(mem)".
As Andrew said, there may be something reasonable rather than using tick.
So, adding "mem_cgroup_need_softlimit_check(mem)" and improving what it checks
makes sense for development.


> @@ -1461,9 +1560,9 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  		break;
>  	}
>  
> -	res_counter_uncharge(&mem->res, PAGE_SIZE);
> +	res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess);
>  	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> -		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> +		res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
>  	mem_cgroup_charge_statistics(mem, pc, false);
>  
here, too.

Cound you add "mem_cgroup_need_softlimit_check(mem)" function here ?
It will make code clearner, I think.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-14 17:31 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v6) Balbir Singh
@ 2009-03-16  0:52   ` KAMEZAWA Hiroyuki
  2009-03-16  8:35     ` Balbir Singh
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-16  0:52 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Sat, 14 Mar 2009 23:01:11 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 
>  include/linux/memcontrol.h |    8 ++
>  include/linux/swap.h       |    1 
>  mm/memcontrol.c            |  205 ++++++++++++++++++++++++++++++++++++++++----
>  mm/page_alloc.c            |    9 ++
>  mm/vmscan.c                |    5 +
>  5 files changed, 205 insertions(+), 23 deletions(-)
> 
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 18146c9..b99d9c5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -116,7 +116,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
>  
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -
> +unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl,
> +						gfp_t gfp_mask);
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct mem_cgroup;
>  
> @@ -264,6 +265,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>  
> +static inline
> +unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl, gfp_t gfp_mask)
> +{
> +	return 0;
> +}
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>  
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 989eb53..c128337 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -215,6 +215,7 @@ static inline void lru_cache_add_active_file(struct page *page)
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  					gfp_t gfp_mask);
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> +						  struct zonelist *zl,
>  						  gfp_t gfp_mask, bool noswap,
>  						  unsigned int swappiness);
>  extern int __isolate_lru_page(struct page *page, int mode, int file);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 200d44a..980bd18 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -191,6 +191,7 @@ struct mem_cgroup {
>  	unsigned long last_tree_update;		/* Last time the tree was */
>  						/* updated in jiffies     */
>  
> +	bool on_tree;				/* Is the node on tree? */
>  	/*
>  	 * statistics. This must be placed at the end of memcg.
>  	 */
> @@ -227,18 +228,29 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
>  #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
>  #define MEMFILE_ATTR(val)	((val) & 0xffff)
>  
> +/*
> + * Bits used for hierarchical reclaim bits
> + */
> +#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
> +#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
> +#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
> +#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
> +#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
> +#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
> +
Could you divide this clean-up part to other patch ?


>  static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>  
> -static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> +static void __mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
>  {
>  	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
>  	struct rb_node *parent = NULL;
>  	struct mem_cgroup *mem_node;
> -	unsigned long flags;
>  
> -	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +	if (mem->on_tree)
> +		return;
> +
>  	mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
>  	while (*p) {
>  		parent = *p;
> @@ -256,6 +268,23 @@ static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
>  	rb_insert_color(&mem->mem_cgroup_node,
>  			&mem_cgroup_soft_limit_tree);
>  	mem->last_tree_update = jiffies;
> +	mem->on_tree = true;
> +}
> +
> +static void __mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
> +{
> +	if (!mem->on_tree)
> +		return;
> +	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
> +	mem->on_tree = false;
> +}
> +
> +static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +	__mem_cgroup_insert_exceeded(mem);
>  	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
>  }
>  
> @@ -263,8 +292,53 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
>  {
>  	unsigned long flags;
>  	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> -	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
> +	__mem_cgroup_remove_exceeded(mem);
> +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +}
> +
> +unsigned long mem_cgroup_get_excess(struct mem_cgroup *mem)
> +{
> +	unsigned long flags;
> +	unsigned long long excess;
> +
> +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +	excess = mem->usage_in_excess >> PAGE_SHIFT;
>  	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +	return (excess > ULONG_MAX) ? ULONG_MAX : excess;
> +}
> +
> +static struct mem_cgroup *__mem_cgroup_largest_soft_limit_node(void)
> +{
> +	struct rb_node *rightmost = NULL;
> +	struct mem_cgroup *mem = NULL;
> +
> +retry:
> +	rightmost = rb_last(&mem_cgroup_soft_limit_tree);
> +	if (!rightmost)
> +		goto done;		/* Nothing to reclaim from */
> +
> +	mem = rb_entry(rightmost, struct mem_cgroup, mem_cgroup_node);
> +	/*
> +	 * Remove the node now but someone else can add it back,
> +	 * we will to add it back at the end of reclaim to its correct
> +	 * position in the tree.
> +	 */
> +	__mem_cgroup_remove_exceeded(mem);
> +	if (!css_tryget(&mem->css) || !res_counter_soft_limit_excess(&mem->res))
> +		goto retry;
> +done:
> +	return mem;
> +}
> +
> +static struct mem_cgroup *mem_cgroup_largest_soft_limit_node(void)
> +{
> +	struct mem_cgroup *mem;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +	mem = __mem_cgroup_largest_soft_limit_node();
> +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +	return mem;
>  }
>  
Can you think of avoiding this global-lock ?(As Kosaki said.)
IIUC, cpu-scheduler's RB tree/hrtimer's one, you memtioned, is per-cpu.


>  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> @@ -889,14 +963,42 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
>   * If shrink==true, for avoiding to free too much, this returns immedieately.
>   */
>  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> -				   gfp_t gfp_mask, bool noswap, bool shrink)
> +						struct zonelist *zl,
> +						gfp_t gfp_mask,
> +						unsigned long reclaim_options)
>  {
>  	struct mem_cgroup *victim;
>  	int ret, total = 0;
>  	int loop = 0;
> +	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> +	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
> +	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
> +	unsigned long excess = mem_cgroup_get_excess(root_mem);
>  
> -	while (loop < 2) {
> +	while (1) {
> +		if (loop >= 2) {
> +			if (!check_soft)
> +				break;
> +			/*
> +			 * We want to do more targetted reclaim. excess >> 4
> +			 * >> 4 is not to excessive so as to reclaim too
> +			 * much, nor too less that we keep coming back
> +			 * to reclaim from this cgroup
> +			 */
> +			if (total >= (excess >> 4))
> +				break;
> +		}

I wonder this means, in very bad case, the thread cannot exit this loop...
right ?
>  		victim = mem_cgroup_select_victim(root_mem);
> +		/*
> +		 * In the first loop, don't reclaim from victims below
> +		 * their soft limit
> +		 */
> +		if (!loop && res_counter_check_under_soft_limit(&victim->res)) {
> +			if (victim == root_mem)
> +				loop++;
> +			css_put(&victim->css);
> +			continue;
> +		}
>  		if (victim == root_mem)
>  			loop++;
>  		if (!mem_cgroup_local_usage(&victim->stat)) {
> @@ -905,8 +1007,9 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  			continue;
>  		}
>  		/* we use swappiness of local cgroup */
> -		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
> -						   get_swappiness(victim));
> +		ret = try_to_free_mem_cgroup_pages(victim, zl, gfp_mask,
> +							noswap,
> +							get_swappiness(victim));
>  		css_put(&victim->css);
>  		/*
>  		 * At shrinking usage, we can't check we should stop here or
> @@ -916,7 +1019,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  		if (shrink)
>  			return ret;
>  		total += ret;
> -		if (mem_cgroup_check_under_limit(root_mem))
> +		if (check_soft) {
> +			if (res_counter_check_under_soft_limit(&root_mem->res))
> +				return total;
> +		} else if (mem_cgroup_check_under_limit(root_mem))
>  			return 1 + total;
>  	}
>  	return total;
> @@ -1022,7 +1128,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  
>  	while (1) {
>  		int ret;
> -		bool noswap = false;
> +		unsigned long flags = 0;
>  
>  		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
>  						&soft_fail_res);
> @@ -1035,7 +1141,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  				break;
>  			/* mem+swap counter fails */
>  			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
> -			noswap = true;
> +			flags = MEM_CGROUP_RECLAIM_NOSWAP;
>  			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
>  									memsw);
>  		} else
> @@ -1046,8 +1152,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  		if (!(gfp_mask & __GFP_WAIT))
>  			goto nomem;
>  
> -		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> -							noswap, false);
> +		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> +							gfp_mask, flags);
>  		if (ret)
>  			continue;
>  
> @@ -1757,8 +1863,8 @@ int mem_cgroup_shrink_usage(struct page *page,
>  		return 0;
>  
>  	do {
> -		progress = mem_cgroup_hierarchical_reclaim(mem,
> -					gfp_mask, true, false);
> +		progress = mem_cgroup_hierarchical_reclaim(mem, NULL,
> +					gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP);
>  		progress += mem_cgroup_check_under_limit(mem);
>  	} while (!progress && --retry);
>  
> @@ -1812,8 +1918,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
> -						   false, true);
> +		progress = mem_cgroup_hierarchical_reclaim(memcg, NULL,
> +						GFP_KERNEL,
> +						MEM_CGROUP_RECLAIM_SHRINK);
>  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>  		/* Usage is reduced ? */
>    		if (curusage >= oldusage)
> @@ -1861,7 +1968,9 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
> +		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> +						MEM_CGROUP_RECLAIM_NOSWAP |
> +						MEM_CGROUP_RECLAIM_SHRINK);
>  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
>  		/* Usage is reduced ? */
>  		if (curusage >= oldusage)
> @@ -1872,6 +1981,62 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  	return ret;
>  }
>  
> +unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl, gfp_t gfp_mask)
> +{
> +	unsigned long nr_reclaimed = 0;
> +	struct mem_cgroup *mem, *next_mem = NULL;
> +	unsigned long flags;
> +	unsigned long reclaimed;
> +
> +	/*
> +	 * This loop can run a while, specially if mem_cgroup's continuously
> +	 * keep exceeding their soft limit and putting the system under
> +	 * pressure
> +	 */
> +	do {
> +		if (next_mem)
> +			mem = next_mem;
> +		else
> +			mem = mem_cgroup_largest_soft_limit_node();
> +		if (!mem)
> +			break;
> +
> +		reclaimed = mem_cgroup_hierarchical_reclaim(mem, zl,
> +						gfp_mask,
> +						MEM_CGROUP_RECLAIM_SOFT);
> +		nr_reclaimed += reclaimed;
> +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> +
> +		/*
> +		 * If we failed to reclaim anything from this memory cgroup
> +		 * it is time to move on to the next cgroup
> +		 */
> +		next_mem = NULL;
> +		if (!reclaimed) {
> +			do {
> +				/*
> +				 * By the time we get the soft_limit lock
> +				 * again, someone might have aded the
> +				 * group back on the RB tree. Iterate to
> +				 * make sure we get a different mem.
> +				 * mem_cgroup_largest_soft_limit_node returns
> +				 * NULL if no other cgroup is present on
> +				 * the tree
> +				 */
Do we have to allow "someone will push back" case ?

> +				next_mem =
> +					__mem_cgroup_largest_soft_limit_node();
> +			} while (next_mem == mem);
> +		}
> +		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> +		__mem_cgroup_remove_exceeded(mem);
> +		if (mem->usage_in_excess)
> +			__mem_cgroup_insert_exceeded(mem);

If next_mem == NULL here, (means "mem" is an only mem_cgroup which excess softlimit.)
mem will be found again even if !reclaimed.
plz check.

> +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> +		css_put(&mem->css);
> +	} while (!nr_reclaimed);
> +	return nr_reclaimed;
> +}
> +
>  /*
>   * This routine traverse page_cgroup in given list and drop them all.
>   * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
> @@ -1995,7 +2160,7 @@ try_to_free:
>  			ret = -EINTR;
>  			goto out;
>  		}
> -		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
> +		progress = try_to_free_mem_cgroup_pages(mem, NULL, GFP_KERNEL,
>  						false, get_swappiness(mem));
>  		if (!progress) {
>  			nr_retries--;
> @@ -2600,6 +2765,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	mem->last_scanned_child = 0;
>  	mem->usage_in_excess = 0;
>  	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
> +	mem->on_tree = false;
> +
>  	spin_lock_init(&mem->reclaim_param_lock);
>  
>  	if (parent)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f8fd1e2..5e1a6ca 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1598,7 +1598,14 @@ nofail_alloc:
>  	reclaim_state.reclaimed_slab = 0;
>  	p->reclaim_state = &reclaim_state;
>  
> -	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> +	/*
> +	 * Try to free up some pages from the memory controllers soft
> +	 * limit queue.
> +	 */
> +	did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
> +	if (order || !did_some_progress)
> +		did_some_progress += try_to_free_pages(zonelist, order,
> +							gfp_mask);
I'm not sure but do we have to call try_to_free()...twice ?

if (order)
   did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);       
if (!order || did_some_progrees)
   did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);

IIRC, why Kosaki said "don't check order" is because this was called by kswapd() case.

BTW, mem_cgroup_soft_limit_reclaim() can do enough job even under 
(gfp_mask & (__GFP_IO|__GFP_FS)) == 0 case ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-16  0:52   ` KAMEZAWA Hiroyuki
@ 2009-03-16  8:35     ` Balbir Singh
  2009-03-16  8:49       ` KAMEZAWA Hiroyuki
  2009-03-18  0:07       ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 29+ messages in thread
From: Balbir Singh @ 2009-03-16  8:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-16 09:52:58]:

> On Sat, 14 Mar 2009 23:01:11 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > 
> >  include/linux/memcontrol.h |    8 ++
> >  include/linux/swap.h       |    1 
> >  mm/memcontrol.c            |  205 ++++++++++++++++++++++++++++++++++++++++----
> >  mm/page_alloc.c            |    9 ++
> >  mm/vmscan.c                |    5 +
> >  5 files changed, 205 insertions(+), 23 deletions(-)
> > 
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 18146c9..b99d9c5 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -116,7 +116,8 @@ static inline bool mem_cgroup_disabled(void)
> >  }
> >  
> >  extern bool mem_cgroup_oom_called(struct task_struct *task);
> > -
> > +unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl,
> > +						gfp_t gfp_mask);
> >  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> >  struct mem_cgroup;
> >  
> > @@ -264,6 +265,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> >  {
> >  }
> >  
> > +static inline
> > +unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl, gfp_t gfp_mask)
> > +{
> > +	return 0;
> > +}
> >  #endif /* CONFIG_CGROUP_MEM_CONT */
> >  
> >  #endif /* _LINUX_MEMCONTROL_H */
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 989eb53..c128337 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -215,6 +215,7 @@ static inline void lru_cache_add_active_file(struct page *page)
> >  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >  					gfp_t gfp_mask);
> >  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> > +						  struct zonelist *zl,
> >  						  gfp_t gfp_mask, bool noswap,
> >  						  unsigned int swappiness);
> >  extern int __isolate_lru_page(struct page *page, int mode, int file);
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 200d44a..980bd18 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -191,6 +191,7 @@ struct mem_cgroup {
> >  	unsigned long last_tree_update;		/* Last time the tree was */
> >  						/* updated in jiffies     */
> >  
> > +	bool on_tree;				/* Is the node on tree? */
> >  	/*
> >  	 * statistics. This must be placed at the end of memcg.
> >  	 */
> > @@ -227,18 +228,29 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
> >  #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
> >  #define MEMFILE_ATTR(val)	((val) & 0xffff)
> >  
> > +/*
> > + * Bits used for hierarchical reclaim bits
> > + */
> > +#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
> > +#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
> > +#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
> > +#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
> > +#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
> > +#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
> > +
> Could you divide this clean-up part to other patch ?
>

OK, sure, I'll do that.
 
> 
> >  static void mem_cgroup_get(struct mem_cgroup *mem);
> >  static void mem_cgroup_put(struct mem_cgroup *mem);
> >  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> >  
> > -static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> > +static void __mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> >  {
> >  	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
> >  	struct rb_node *parent = NULL;
> >  	struct mem_cgroup *mem_node;
> > -	unsigned long flags;
> >  
> > -	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +	if (mem->on_tree)
> > +		return;
> > +
> >  	mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> >  	while (*p) {
> >  		parent = *p;
> > @@ -256,6 +268,23 @@ static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> >  	rb_insert_color(&mem->mem_cgroup_node,
> >  			&mem_cgroup_soft_limit_tree);
> >  	mem->last_tree_update = jiffies;
> > +	mem->on_tree = true;
> > +}
> > +
> > +static void __mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
> > +{
> > +	if (!mem->on_tree)
> > +		return;
> > +	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
> > +	mem->on_tree = false;
> > +}
> > +
> > +static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +	__mem_cgroup_insert_exceeded(mem);
> >  	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> >  }
> >  
> > @@ -263,8 +292,53 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
> >  {
> >  	unsigned long flags;
> >  	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > -	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
> > +	__mem_cgroup_remove_exceeded(mem);
> > +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +}
> > +
> > +unsigned long mem_cgroup_get_excess(struct mem_cgroup *mem)
> > +{
> > +	unsigned long flags;
> > +	unsigned long long excess;
> > +
> > +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +	excess = mem->usage_in_excess >> PAGE_SHIFT;
> >  	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +	return (excess > ULONG_MAX) ? ULONG_MAX : excess;
> > +}
> > +
> > +static struct mem_cgroup *__mem_cgroup_largest_soft_limit_node(void)
> > +{
> > +	struct rb_node *rightmost = NULL;
> > +	struct mem_cgroup *mem = NULL;
> > +
> > +retry:
> > +	rightmost = rb_last(&mem_cgroup_soft_limit_tree);
> > +	if (!rightmost)
> > +		goto done;		/* Nothing to reclaim from */
> > +
> > +	mem = rb_entry(rightmost, struct mem_cgroup, mem_cgroup_node);
> > +	/*
> > +	 * Remove the node now but someone else can add it back,
> > +	 * we will to add it back at the end of reclaim to its correct
> > +	 * position in the tree.
> > +	 */
> > +	__mem_cgroup_remove_exceeded(mem);
> > +	if (!css_tryget(&mem->css) || !res_counter_soft_limit_excess(&mem->res))
> > +		goto retry;
> > +done:
> > +	return mem;
> > +}
> > +
> > +static struct mem_cgroup *mem_cgroup_largest_soft_limit_node(void)
> > +{
> > +	struct mem_cgroup *mem;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +	mem = __mem_cgroup_largest_soft_limit_node();
> > +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +	return mem;
> >  }
> >  
> Can you think of avoiding this global-lock ?(As Kosaki said.)
> IIUC, cpu-scheduler's RB tree/hrtimer's one, you memtioned, is per-cpu.
>

I thought about it, but since the data structure is global, we need a
global lock. I've not yet seen a lot of contention on the lock. I'll
think more along the lines of seeing how to split up the lock, but I
don't see it right now.
 
> 
> >  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> > @@ -889,14 +963,42 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> >   * If shrink==true, for avoiding to free too much, this returns immedieately.
> >   */
> >  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > -				   gfp_t gfp_mask, bool noswap, bool shrink)
> > +						struct zonelist *zl,
> > +						gfp_t gfp_mask,
> > +						unsigned long reclaim_options)
> >  {
> >  	struct mem_cgroup *victim;
> >  	int ret, total = 0;
> >  	int loop = 0;
> > +	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> > +	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
> > +	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
> > +	unsigned long excess = mem_cgroup_get_excess(root_mem);
> >  
> > -	while (loop < 2) {
> > +	while (1) {
> > +		if (loop >= 2) {
> > +			if (!check_soft)
> > +				break;
> > +			/*
> > +			 * We want to do more targetted reclaim. excess >> 4
> > +			 * >> 4 is not to excessive so as to reclaim too
> > +			 * much, nor too less that we keep coming back
> > +			 * to reclaim from this cgroup
> > +			 */
> > +			if (total >= (excess >> 4))
> > +				break;
> > +		}
> 
> I wonder this means, in very bad case, the thread cannot exit this loop...
> right ?

Potentially. When we do force empty, we actually reclaim all pages in a loop.
Do you want to see additional checks here?

> >  		victim = mem_cgroup_select_victim(root_mem);
> > +		/*
> > +		 * In the first loop, don't reclaim from victims below
> > +		 * their soft limit
> > +		 */
> > +		if (!loop && res_counter_check_under_soft_limit(&victim->res)) {
> > +			if (victim == root_mem)
> > +				loop++;
> > +			css_put(&victim->css);
> > +			continue;
> > +		}
> >  		if (victim == root_mem)
> >  			loop++;
> >  		if (!mem_cgroup_local_usage(&victim->stat)) {
> > @@ -905,8 +1007,9 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> >  			continue;
> >  		}
> >  		/* we use swappiness of local cgroup */
> > -		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
> > -						   get_swappiness(victim));
> > +		ret = try_to_free_mem_cgroup_pages(victim, zl, gfp_mask,
> > +							noswap,
> > +							get_swappiness(victim));
> >  		css_put(&victim->css);
> >  		/*
> >  		 * At shrinking usage, we can't check we should stop here or
> > @@ -916,7 +1019,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> >  		if (shrink)
> >  			return ret;
> >  		total += ret;
> > -		if (mem_cgroup_check_under_limit(root_mem))
> > +		if (check_soft) {
> > +			if (res_counter_check_under_soft_limit(&root_mem->res))
> > +				return total;
> > +		} else if (mem_cgroup_check_under_limit(root_mem))
> >  			return 1 + total;
> >  	}
> >  	return total;
> > @@ -1022,7 +1128,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  
> >  	while (1) {
> >  		int ret;
> > -		bool noswap = false;
> > +		unsigned long flags = 0;
> >  
> >  		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
> >  						&soft_fail_res);
> > @@ -1035,7 +1141,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  				break;
> >  			/* mem+swap counter fails */
> >  			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
> > -			noswap = true;
> > +			flags = MEM_CGROUP_RECLAIM_NOSWAP;
> >  			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> >  									memsw);
> >  		} else
> > @@ -1046,8 +1152,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  		if (!(gfp_mask & __GFP_WAIT))
> >  			goto nomem;
> >  
> > -		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> > -							noswap, false);
> > +		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > +							gfp_mask, flags);
> >  		if (ret)
> >  			continue;
> >  
> > @@ -1757,8 +1863,8 @@ int mem_cgroup_shrink_usage(struct page *page,
> >  		return 0;
> >  
> >  	do {
> > -		progress = mem_cgroup_hierarchical_reclaim(mem,
> > -					gfp_mask, true, false);
> > +		progress = mem_cgroup_hierarchical_reclaim(mem, NULL,
> > +					gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP);
> >  		progress += mem_cgroup_check_under_limit(mem);
> >  	} while (!progress && --retry);
> >  
> > @@ -1812,8 +1918,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> >  		if (!ret)
> >  			break;
> >  
> > -		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
> > -						   false, true);
> > +		progress = mem_cgroup_hierarchical_reclaim(memcg, NULL,
> > +						GFP_KERNEL,
> > +						MEM_CGROUP_RECLAIM_SHRINK);
> >  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> >  		/* Usage is reduced ? */
> >    		if (curusage >= oldusage)
> > @@ -1861,7 +1968,9 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> >  		if (!ret)
> >  			break;
> >  
> > -		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
> > +		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> > +						MEM_CGROUP_RECLAIM_NOSWAP |
> > +						MEM_CGROUP_RECLAIM_SHRINK);
> >  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
> >  		/* Usage is reduced ? */
> >  		if (curusage >= oldusage)
> > @@ -1872,6 +1981,62 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> >  	return ret;
> >  }
> >  
> > +unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl, gfp_t gfp_mask)
> > +{
> > +	unsigned long nr_reclaimed = 0;
> > +	struct mem_cgroup *mem, *next_mem = NULL;
> > +	unsigned long flags;
> > +	unsigned long reclaimed;
> > +
> > +	/*
> > +	 * This loop can run a while, specially if mem_cgroup's continuously
> > +	 * keep exceeding their soft limit and putting the system under
> > +	 * pressure
> > +	 */
> > +	do {
> > +		if (next_mem)
> > +			mem = next_mem;
> > +		else
> > +			mem = mem_cgroup_largest_soft_limit_node();
> > +		if (!mem)
> > +			break;
> > +
> > +		reclaimed = mem_cgroup_hierarchical_reclaim(mem, zl,
> > +						gfp_mask,
> > +						MEM_CGROUP_RECLAIM_SOFT);
> > +		nr_reclaimed += reclaimed;
> > +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +
> > +		/*
> > +		 * If we failed to reclaim anything from this memory cgroup
> > +		 * it is time to move on to the next cgroup
> > +		 */
> > +		next_mem = NULL;
> > +		if (!reclaimed) {
> > +			do {
> > +				/*
> > +				 * By the time we get the soft_limit lock
> > +				 * again, someone might have aded the
> > +				 * group back on the RB tree. Iterate to
> > +				 * make sure we get a different mem.
> > +				 * mem_cgroup_largest_soft_limit_node returns
> > +				 * NULL if no other cgroup is present on
> > +				 * the tree
> > +				 */
> Do we have to allow "someone will push back" case ?
> 

Not sure I understand your comment completely? When you say push back,
are you referring to some one else adding back the RB-Tree to the
node? If so, yes, that is quite possible and I've seen it happen.

> > +				next_mem =
> > +					__mem_cgroup_largest_soft_limit_node();
> > +			} while (next_mem == mem);
> > +		}
> > +		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> > +		__mem_cgroup_remove_exceeded(mem);
> > +		if (mem->usage_in_excess)
> > +			__mem_cgroup_insert_exceeded(mem);
> 
> If next_mem == NULL here, (means "mem" is an only mem_cgroup which excess softlimit.)
> mem will be found again even if !reclaimed.
> plz check.

Yes, We need to add a if (!next_mem) break; Thanks!

> 
> > +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +		css_put(&mem->css);
> > +	} while (!nr_reclaimed);
> > +	return nr_reclaimed;
> > +}
> > +
> >  /*
> >   * This routine traverse page_cgroup in given list and drop them all.
> >   * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
> > @@ -1995,7 +2160,7 @@ try_to_free:
> >  			ret = -EINTR;
> >  			goto out;
> >  		}
> > -		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
> > +		progress = try_to_free_mem_cgroup_pages(mem, NULL, GFP_KERNEL,
> >  						false, get_swappiness(mem));
> >  		if (!progress) {
> >  			nr_retries--;
> > @@ -2600,6 +2765,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  	mem->last_scanned_child = 0;
> >  	mem->usage_in_excess = 0;
> >  	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
> > +	mem->on_tree = false;
> > +
> >  	spin_lock_init(&mem->reclaim_param_lock);
> >  
> >  	if (parent)
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index f8fd1e2..5e1a6ca 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1598,7 +1598,14 @@ nofail_alloc:
> >  	reclaim_state.reclaimed_slab = 0;
> >  	p->reclaim_state = &reclaim_state;
> >  
> > -	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> > +	/*
> > +	 * Try to free up some pages from the memory controllers soft
> > +	 * limit queue.
> > +	 */
> > +	did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
> > +	if (order || !did_some_progress)
> > +		did_some_progress += try_to_free_pages(zonelist, order,
> > +							gfp_mask);
> I'm not sure but do we have to call try_to_free()...twice ?

We call it twice, once for the memory controller and once for normal
reclaim (try_to_free_mem_cgroup_pages() and try_to_free_pages()), is
that an issue?

> 
> if (order)
>    did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);       
> if (!order || did_some_progrees)
>    did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
>

I don't understand the code snippet above.
 
> IIRC, why Kosaki said "don't check order" is because this was called by kswapd() case.
> 
> BTW, mem_cgroup_soft_limit_reclaim() can do enough job even under 
> (gfp_mask & (__GFP_IO|__GFP_FS)) == 0 case ?
>

What about clean page cache? Anyway, we pass the gfp_mask, so the reclaimer
knows what pages to reclaim from, so it should return quickly if it
can't reclaim. Am I missing something?
 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] Memory controller soft limit organize cgroups (v6)
  2009-03-16  0:21   ` KAMEZAWA Hiroyuki
@ 2009-03-16  8:47     ` Balbir Singh
  2009-03-16  8:57       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2009-03-16  8:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-16 09:21:26]:

> On Sat, 14 Mar 2009 23:01:02 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > Feature: Organize cgroups over soft limit in a RB-Tree
> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > Changelog v6...v5
> > 1. Update the key before inserting into RB tree. Without the current change
> >    it could take an additional iteration to get the key correct.
> > 
> > Changelog v5...v4
> > 1. res_counter_uncharge has an additional parameter to indicate if the
> >    counter was over its soft limit, before uncharge.
> > 
> > Changelog v4...v3
> > 1. Optimizations to ensure we don't uncessarily get res_counter values
> > 2. Fixed a bug in usage of time_after()
> > 
> > Changelog v3...v2
> > 1. Add only the ancestor to the RB-Tree
> > 2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put
> > 
> > Changelog v2...v1
> > 1. Add support for hierarchies
> > 2. The res_counter that is highest in the hierarchy is returned on soft
> >    limit being exceeded. Since we do hierarchical reclaim and add all
> >    groups exceeding their soft limits, this approach seems to work well
> >    in practice.
> > 
> > This patch introduces a RB-Tree for storing memory cgroups that are over their
> > soft limit. The overall goal is to
> > 
> > 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
> >    We are careful about updates, updates take place only after a particular
> >    time interval has passed
> > 2. We remove the node from the RB-Tree when the usage goes below the soft
> >    limit
> > 
> > The next set of patches will exploit the RB-Tree to get the group that is
> > over its soft limit by the largest amount and reclaim from it, when we
> > face memory contention.
> > 
> > Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> > ---
> > 
> >  include/linux/res_counter.h |    6 +-
> >  kernel/res_counter.c        |   18 +++++
> >  mm/memcontrol.c             |  141 ++++++++++++++++++++++++++++++++++++++-----
> >  3 files changed, 143 insertions(+), 22 deletions(-)
> > 
> > 
> > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > index 5c821fd..5bbf8b1 100644
> > --- a/include/linux/res_counter.h
> > +++ b/include/linux/res_counter.h
> > @@ -112,7 +112,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
> >  int __must_check res_counter_charge_locked(struct res_counter *counter,
> >  		unsigned long val);
> >  int __must_check res_counter_charge(struct res_counter *counter,
> > -		unsigned long val, struct res_counter **limit_fail_at);
> > +		unsigned long val, struct res_counter **limit_fail_at,
> > +		struct res_counter **soft_limit_at);
> >  
> >  /*
> >   * uncharge - tell that some portion of the resource is released
> > @@ -125,7 +126,8 @@ int __must_check res_counter_charge(struct res_counter *counter,
> >   */
> >  
> >  void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
> > -void res_counter_uncharge(struct res_counter *counter, unsigned long val);
> > +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > +				bool *was_soft_limit_excess);
> >  
> >  static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
> >  {
> > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > index 4e6dafe..51ec438 100644
> > --- a/kernel/res_counter.c
> > +++ b/kernel/res_counter.c
> > @@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> >  }
> >  
> >  int res_counter_charge(struct res_counter *counter, unsigned long val,
> > -			struct res_counter **limit_fail_at)
> > +			struct res_counter **limit_fail_at,
> > +			struct res_counter **soft_limit_fail_at)
> >  {
> >  	int ret;
> >  	unsigned long flags;
> >  	struct res_counter *c, *u;
> >  
> >  	*limit_fail_at = NULL;
> > +	if (soft_limit_fail_at)
> > +		*soft_limit_fail_at = NULL;
> >  	local_irq_save(flags);
> >  	for (c = counter; c != NULL; c = c->parent) {
> >  		spin_lock(&c->lock);
> >  		ret = res_counter_charge_locked(c, val);
> > +		/*
> > +		 * With soft limits, we return the highest ancestor
> > +		 * that exceeds its soft limit
> > +		 */
> > +		if (soft_limit_fail_at &&
> > +			!res_counter_soft_limit_check_locked(c))
> > +			*soft_limit_fail_at = c;
> >  		spin_unlock(&c->lock);
> >  		if (ret < 0) {
> >  			*limit_fail_at = c;
> > @@ -75,7 +85,8 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
> >  	counter->usage -= val;
> >  }
> >  
> > -void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> > +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > +				bool *was_soft_limit_excess)
> >  {
> >  	unsigned long flags;
> >  	struct res_counter *c;
> > @@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> >  	local_irq_save(flags);
> >  	for (c = counter; c != NULL; c = c->parent) {
> >  		spin_lock(&c->lock);
> > +		if (c == counter && was_soft_limit_excess)
> > +			*was_soft_limit_excess =
> > +				!res_counter_soft_limit_check_locked(c);
> >  		res_counter_uncharge_locked(c, val);
> >  		spin_unlock(&c->lock);
> >  	}
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 70bc992..200d44a 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -29,6 +29,7 @@
> >  #include <linux/rcupdate.h>
> >  #include <linux/limits.h>
> >  #include <linux/mutex.h>
> > +#include <linux/rbtree.h>
> >  #include <linux/slab.h>
> >  #include <linux/swap.h>
> >  #include <linux/spinlock.h>
> > @@ -129,6 +130,14 @@ struct mem_cgroup_lru_info {
> >  };
> >  
> >  /*
> > + * Cgroups above their limits are maintained in a RB-Tree, independent of
> > + * their hierarchy representation
> > + */
> > +
> > +static struct rb_root mem_cgroup_soft_limit_tree;
> > +static DEFINE_SPINLOCK(memcg_soft_limit_tree_lock);
> > +
> > +/*
> >   * The memory controller data structure. The memory controller controls both
> >   * page cache and RSS per cgroup. We would eventually like to provide
> >   * statistics based on the statistics developed by Rik Van Riel for clock-pro,
> > @@ -176,12 +185,20 @@ struct mem_cgroup {
> >  
> >  	unsigned int	swappiness;
> >  
> > +	struct rb_node mem_cgroup_node;		/* RB tree node */
> > +	unsigned long long usage_in_excess;	/* Set to the value by which */
> > +						/* the soft limit is exceeded*/
> > +	unsigned long last_tree_update;		/* Last time the tree was */
> > +						/* updated in jiffies     */
> > +
> >  	/*
> >  	 * statistics. This must be placed at the end of memcg.
> >  	 */
> >  	struct mem_cgroup_stat stat;
> >  };
> >  
> > +#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
> > +
> >  enum charge_type {
> >  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> >  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> > @@ -214,6 +231,42 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
> >  static void mem_cgroup_put(struct mem_cgroup *mem);
> >  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> >  
> > +static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> > +{
> > +	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
> > +	struct rb_node *parent = NULL;
> > +	struct mem_cgroup *mem_node;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +	mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> > +	while (*p) {
> > +		parent = *p;
> > +		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
> > +		if (mem->usage_in_excess < mem_node->usage_in_excess)
> > +			p = &(*p)->rb_left;
> > +		/*
> > +		 * We can't avoid mem cgroups that are over their soft
> > +		 * limit by the same amount
> > +		 */
> > +		else if (mem->usage_in_excess >= mem_node->usage_in_excess)
> > +			p = &(*p)->rb_right;
> > +	}
> > +	rb_link_node(&mem->mem_cgroup_node, parent, p);
> > +	rb_insert_color(&mem->mem_cgroup_node,
> > +			&mem_cgroup_soft_limit_tree);
> > +	mem->last_tree_update = jiffies;
> > +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +}
> > +
> > +static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
> > +{
> > +	unsigned long flags;
> > +	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
> > +	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +}
> > +
> >  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> >  					 struct page_cgroup *pc,
> >  					 bool charge)
> > @@ -897,6 +950,39 @@ static void record_last_oom(struct mem_cgroup *mem)
> >  	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
> >  }
> >  
> > +static void mem_cgroup_check_and_update_tree(struct mem_cgroup *mem,
> > +						bool time_check)
> > +{
> > +	unsigned long long prev_usage_in_excess, new_usage_in_excess;
> > +	bool updated_tree = false;
> > +	unsigned long next_update = 0;
> > +	unsigned long flags;
> > +
> > +	prev_usage_in_excess = mem->usage_in_excess;
> > +
> > +	if (time_check)
> > +		next_update = mem->last_tree_update +
> > +				MEM_CGROUP_TREE_UPDATE_INTERVAL;
> > +
> > +	if (!time_check || time_after(jiffies, next_update)) {
> > +		new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> > +		if (prev_usage_in_excess) {
> > +			mem_cgroup_remove_exceeded(mem);
> > +			updated_tree = true;
> > +		}
> > +		if (!new_usage_in_excess)
> > +			goto done;
> > +		mem_cgroup_insert_exceeded(mem);
> > +	}
> > +
> > +done:
> > +	if (updated_tree) {
> > +		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
> > +		mem->last_tree_update = jiffies;
> > +		mem->usage_in_excess = new_usage_in_excess;
> > +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > +	}
> > +}
> >  
> >  /*
> >   * Unlike exported interface, "oom" parameter is added. if oom==true,
> > @@ -906,9 +992,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  			gfp_t gfp_mask, struct mem_cgroup **memcg,
> >  			bool oom)
> >  {
> > -	struct mem_cgroup *mem, *mem_over_limit;
> > +	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
> >  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > -	struct res_counter *fail_res;
> > +	struct res_counter *fail_res, *soft_fail_res = NULL;
> >  
> >  	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
> >  		/* Don't account this! */
> > @@ -938,16 +1024,17 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  		int ret;
> >  		bool noswap = false;
> >  
> > -		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> > +		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
> > +						&soft_fail_res);
> 
> As I pointed out, if this value is finally used once per HZ/X. checking
> this *alyways* is overkill. plz remove softlimit check from here.
> 
> Maybe code like this is good.
> ==
>  if (need_softlimit_check(mem)) {
>      softlimit_res = res_counter_check_under_softlimit(&mem->res);
>      if (softlimit_res) {
>         struct mem_cgroup *mem = mem_cgroup_from_cont(softlimit_res);
>         update_tree()....      
>      }
>  }
> ==

An additional if is the problem? We do all the checks under a lock we
already hold. I ran aim9, new_dbase, dbase, compute and shared tests
to make sure that there is no degradation. I've not seen anything
noticable so far.

> 
> *And* what is important here is "need_softlimit_check(mem)".
> As Andrew said, there may be something reasonable rather than using tick.
> So, adding "mem_cgroup_need_softlimit_check(mem)" and improving what it checks
> makes sense for development.
> 

OK, that is a good abstraction, but scanning as a metric does not guarantee
anything. It is harder to come up with better heuristics with scan
rate than to come up with something time based. I am open to
suggestions for something reliable though.


> 
> > @@ -1461,9 +1560,9 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
> >  		break;
> >  	}
> >  
> > -	res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +	res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess);
> >  	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> > -		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> > +		res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
> >  	mem_cgroup_charge_statistics(mem, pc, false);
> >  
> here, too.
> 
> Cound you add "mem_cgroup_need_softlimit_check(mem)" function here ?
> It will make code clearner, I think.
>

OK, as an abstraction, sure. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-16  8:35     ` Balbir Singh
@ 2009-03-16  8:49       ` KAMEZAWA Hiroyuki
  2009-03-16  9:03         ` KAMEZAWA Hiroyuki
  2009-03-18  0:07       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-16  8:49 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Mon, 16 Mar 2009 14:05:12 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> > 
> > >  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> > > @@ -889,14 +963,42 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> > >   * If shrink==true, for avoiding to free too much, this returns immedieately.
> > >   */
> > >  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > > -				   gfp_t gfp_mask, bool noswap, bool shrink)
> > > +						struct zonelist *zl,
> > > +						gfp_t gfp_mask,
> > > +						unsigned long reclaim_options)
> > >  {
> > >  	struct mem_cgroup *victim;
> > >  	int ret, total = 0;
> > >  	int loop = 0;
> > > +	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> > > +	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
> > > +	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
> > > +	unsigned long excess = mem_cgroup_get_excess(root_mem);
> > >  
> > > -	while (loop < 2) {
> > > +	while (1) {
> > > +		if (loop >= 2) {
> > > +			if (!check_soft)
> > > +				break;
> > > +			/*
> > > +			 * We want to do more targetted reclaim. excess >> 4
> > > +			 * >> 4 is not to excessive so as to reclaim too
> > > +			 * much, nor too less that we keep coming back
> > > +			 * to reclaim from this cgroup
> > > +			 */
> > > +			if (total >= (excess >> 4))
> > > +				break;
> > > +		}
> > 
> > I wonder this means, in very bad case, the thread cannot exit this loop...
> > right ?
> 
> Potentially. When we do force empty, we actually reclaim all pages in a loop.
> Do you want to see additional checks here?

plz fix. In user enviroments, once-in-1000years trouble can happen
in my experience....

> > > +		if (!reclaimed) {
> > > +			do {
> > > +				/*
> > > +				 * By the time we get the soft_limit lock
> > > +				 * again, someone might have aded the
> > > +				 * group back on the RB tree. Iterate to
> > > +				 * make sure we get a different mem.
> > > +				 * mem_cgroup_largest_soft_limit_node returns
> > > +				 * NULL if no other cgroup is present on
> > > +				 * the tree
> > > +				 */
> > Do we have to allow "someone will push back" case ?
> > 
> 
> Not sure I understand your comment completely? When you say push back,
> are you referring to some one else adding back the RB-Tree to the
> node? 
yes.

> If so, yes, that is quite possible and I've seen it happen.
> 
Hmm. So, it results that several threads start recalim on the same memcg.
Can't we make this "selected" victim is guaranteed  to be out-of-tree while
some reclaims memory on it ?

> > > +				next_mem =
> > > +					__mem_cgroup_largest_soft_limit_node();
> > > +			} while (next_mem == mem);
> > > +		}
> > > +		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> > > +		__mem_cgroup_remove_exceeded(mem);
> > > +		if (mem->usage_in_excess)
> > > +			__mem_cgroup_insert_exceeded(mem);
> > 
> > If next_mem == NULL here, (means "mem" is an only mem_cgroup which excess softlimit.)
> > mem will be found again even if !reclaimed.
> > plz check.
> 
> Yes, We need to add a if (!next_mem) break; Thanks!
> 
> > 
> > > +		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
> > > +		css_put(&mem->css);
> > > +	} while (!nr_reclaimed);
> > > +	return nr_reclaimed;
> > > +}
> > > +
> > >  /*
> > >   * This routine traverse page_cgroup in given list and drop them all.
> > >   * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
> > > @@ -1995,7 +2160,7 @@ try_to_free:
> > >  			ret = -EINTR;
> > >  			goto out;
> > >  		}
> > > -		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
> > > +		progress = try_to_free_mem_cgroup_pages(mem, NULL, GFP_KERNEL,
> > >  						false, get_swappiness(mem));
> > >  		if (!progress) {
> > >  			nr_retries--;
> > > @@ -2600,6 +2765,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> > >  	mem->last_scanned_child = 0;
> > >  	mem->usage_in_excess = 0;
> > >  	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
> > > +	mem->on_tree = false;
> > > +
> > >  	spin_lock_init(&mem->reclaim_param_lock);
> > >  
> > >  	if (parent)
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index f8fd1e2..5e1a6ca 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1598,7 +1598,14 @@ nofail_alloc:
> > >  	reclaim_state.reclaimed_slab = 0;
> > >  	p->reclaim_state = &reclaim_state;
> > >  
> > > -	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> > > +	/*
> > > +	 * Try to free up some pages from the memory controllers soft
> > > +	 * limit queue.
> > > +	 */
> > > +	did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
> > > +	if (order || !did_some_progress)
> > > +		did_some_progress += try_to_free_pages(zonelist, order,
> > > +							gfp_mask);
> > I'm not sure but do we have to call try_to_free()...twice ?
> 
> We call it twice, once for the memory controller and once for normal
> reclaim (try_to_free_mem_cgroup_pages() and try_to_free_pages()), is
> that an issue?
> 
Maybe "HugePage Allocation" benchmark is necessary to check "calling twice"
is bad or not. But, in general, calling twice is not very good, I think.



> > 
> > if (order)
> >    did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);       
> > if (!order || did_some_progrees)
> >    did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
> >
> 
> I don't understand the code snippet above.
>  
Sorry. Ignore above.


> > IIRC, why Kosaki said "don't check order" is because this was called by kswapd() case.
> > 
> > BTW, mem_cgroup_soft_limit_reclaim() can do enough job even under 
> > (gfp_mask & (__GFP_IO|__GFP_FS)) == 0 case ?
> >
> 
> What about clean page cache? Anyway, we pass the gfp_mask, so the reclaimer
> knows what pages to reclaim from, so it should return quickly if it
> can't reclaim. Am I missing something?
>  
My point is, if sc->mem_cgroup is not NULL, we have to be careful that
important routines will not be called.

For example, shrink_slab() is not called. and this must be called.

For exmaple, we may have to add 
 sc->call_shrink_slab
flag and set it "true" at soft limit reclaim. 

In other words, we need some changes in vmscan.c. We should have good eyes to check
whethere a routine should be called or not.


Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] Memory controller soft limit organize cgroups (v6)
  2009-03-16  8:47     ` Balbir Singh
@ 2009-03-16  8:57       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-16  8:57 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Mon, 16 Mar 2009 14:17:35 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-16 09:21:26]:

> > Maybe code like this is good.
> > ==
> >  if (need_softlimit_check(mem)) {
> >      softlimit_res = res_counter_check_under_softlimit(&mem->res);
> >      if (softlimit_res) {
> >         struct mem_cgroup *mem = mem_cgroup_from_cont(softlimit_res);
> >         update_tree()....      
> >      }
> >  }
> > ==
> 
> An additional if is the problem?
My point is "check condition but the result is not used always" is ugly.

> We do all the checks under a lock we
> already hold. I ran aim9, new_dbase, dbase, compute and shared tests
> to make sure that there is no degradation. I've not seen anything
> noticable so far.
> 
ya, maybe. How about unix-bench exec test ?
(it's one of the worst application for memcg ;)

> > 
> > *And* what is important here is "need_softlimit_check(mem)".
> > As Andrew said, there may be something reasonable rather than using tick.
> > So, adding "mem_cgroup_need_softlimit_check(mem)" and improving what it checks
> > makes sense for development.
> > 
> 
> OK, that is a good abstraction, but scanning as a metric does not guarantee
> anything. It is harder to come up with better heuristics with scan
> rate than to come up with something time based. I am open to
> suggestions for something reliable though.
> 
I think making this easy-to-be-modified will help people other than us.
The memory-management algorithm is very difficult but people tend to try
their own new logic to improve overall performance.

Refactoring to make "modification of algorithm" easy makes sense for Linux, OSS.
We're not only people to modify memcontrol.c

So, I think some abstraction is good.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-16  8:49       ` KAMEZAWA Hiroyuki
@ 2009-03-16  9:03         ` KAMEZAWA Hiroyuki
  2009-03-16  9:10           ` Balbir Singh
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-16  9:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir, linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro,
	Rik van Riel, Andrew Morton

On Mon, 16 Mar 2009 17:49:43 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 16 Mar 2009 14:05:12 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> For example, shrink_slab() is not called. and this must be called.
> 
> For exmaple, we may have to add 
>  sc->call_shrink_slab
> flag and set it "true" at soft limit reclaim. 
> 
At least, this check will be necessary in v7, I think.
shrink_slab() should be called.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-16  9:03         ` KAMEZAWA Hiroyuki
@ 2009-03-16  9:10           ` Balbir Singh
  2009-03-16 11:10             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2009-03-16  9:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-16 18:03:08]:

> On Mon, 16 Mar 2009 17:49:43 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Mon, 16 Mar 2009 14:05:12 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > For example, shrink_slab() is not called. and this must be called.
> > 
> > For exmaple, we may have to add 
> >  sc->call_shrink_slab
> > flag and set it "true" at soft limit reclaim. 
> > 
> At least, this check will be necessary in v7, I think.
> shrink_slab() should be called.

Why do you think so? So here is the design

1. If a cgroup was using over its soft limit, we believe that this
   cgroup created overall memory contention and caused the page
   reclaimer to get activated. If we can solve the situation by
   reclaiming from this cgroup, why do we need to invoke shrink_slab?

If the concern is that we are not following the traditional reclaim,
soft limit reclaim can be followed by unconditional reclaim, but I
believe this is not necessary. Remember, we wake up kswapd that will
call shrink_slab if needed.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-16  9:10           ` Balbir Singh
@ 2009-03-16 11:10             ` KAMEZAWA Hiroyuki
  2009-03-16 11:38               ` Balbir Singh
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-16 11:10 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-mm, YAMAMOTO Takashi, lizf,
	KOSAKI Motohiro, Rik van Riel, Andrew Morton

Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-16
> 18:03:08]:
>
>> On Mon, 16 Mar 2009 17:49:43 +0900
>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>
>> > On Mon, 16 Mar 2009 14:05:12 +0530
>> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>
>> > For example, shrink_slab() is not called. and this must be called.
>> >
>> > For exmaple, we may have to add
>> >  sc->call_shrink_slab
>> > flag and set it "true" at soft limit reclaim.
>> >
>> At least, this check will be necessary in v7, I think.
>> shrink_slab() should be called.
>
> Why do you think so? So here is the design
>
> 1. If a cgroup was using over its soft limit, we believe that this
>    cgroup created overall memory contention and caused the page
>    reclaimer to get activated.
This assumption is wrong, see below.

>    If we can solve the situation by
>    reclaiming from this cgroup, why do we need to invoke shrink_slab?
>
No,
IIUC, in big server, inode, dentry cache etc....can occupy Gigabytes
of memory even if 99% of them are not used.

By shrink_slab(), we can reclaim unused but cached slabs and make
the kernel more healthy.


> If the concern is that we are not following the traditional reclaim,
> soft limit reclaim can be followed by unconditional reclaim, but I
> believe this is not necessary. Remember, we wake up kswapd that will
> call shrink_slab if needed.
kswapd doesn't call shrink_slab() when zone->free is enough.
(when direct recail did good jobs.)

Anyway, we'll have to add softlimit hook to kswapd.
I think you read Kosaki's e-mail to you.
==
in global reclaim view, foreground reclaim and background reclaim's
  reclaim rate is about 1:9 typically.
==

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-16 11:10             ` KAMEZAWA Hiroyuki
@ 2009-03-16 11:38               ` Balbir Singh
  2009-03-16 11:58                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2009-03-16 11:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-16 20:10:41]:

> Balbir Singh wrote:
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-16
> > 18:03:08]:
> >
> >> On Mon, 16 Mar 2009 17:49:43 +0900
> >> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >>
> >> > On Mon, 16 Mar 2009 14:05:12 +0530
> >> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >>
> >> > For example, shrink_slab() is not called. and this must be called.
> >> >
> >> > For exmaple, we may have to add
> >> >  sc->call_shrink_slab
> >> > flag and set it "true" at soft limit reclaim.
> >> >
> >> At least, this check will be necessary in v7, I think.
> >> shrink_slab() should be called.
> >
> > Why do you think so? So here is the design
> >
> > 1. If a cgroup was using over its soft limit, we believe that this
> >    cgroup created overall memory contention and caused the page
> >    reclaimer to get activated.
> This assumption is wrong, see below.
> 
> >    If we can solve the situation by
> >    reclaiming from this cgroup, why do we need to invoke shrink_slab?
> >
> No,
> IIUC, in big server, inode, dentry cache etc....can occupy Gigabytes
> of memory even if 99% of them are not used.
> 
> By shrink_slab(), we can reclaim unused but cached slabs and make
> the kernel more healthy.
> 

But that is not the job of the soft limit reclaimer.. Yes if no groups
are over their soft limit, the regular action will take place.

> 
> > If the concern is that we are not following the traditional reclaim,
> > soft limit reclaim can be followed by unconditional reclaim, but I
> > believe this is not necessary. Remember, we wake up kswapd that will
> > call shrink_slab if needed.
> kswapd doesn't call shrink_slab() when zone->free is enough.
> (when direct recail did good jobs.)
> 

If zone->free is high why do we need shrink_slab()? The other way
of asking it is, why does the soft limit reclaimer need to call
shrink_slab(), when its job is to reclaim memory from cgroups above
their soft limits.

> Anyway, we'll have to add softlimit hook to kswapd.
> I think you read Kosaki's e-mail to you.
> ==
> in global reclaim view, foreground reclaim and background reclaim's
>   reclaim rate is about 1:9 typically.
> ==

I think not. Please don't interpret soft limits as water marks, I
think that is where the basic disagreement lies. Keeping zones under
their watermarks is different from soft limits; where a cgroup gets
pushed it is causing the memory allocator to go into reclaim.


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-16 11:38               ` Balbir Singh
@ 2009-03-16 11:58                 ` KAMEZAWA Hiroyuki
  2009-03-16 12:19                   ` Balbir Singh
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-16 11:58 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-mm, YAMAMOTO Takashi, lizf,
	KOSAKI Motohiro, Rik van Riel, Andrew Morton

Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-16
> 20:10:41]:
>> >> At least, this check will be necessary in v7, I think.
>> >> shrink_slab() should be called.
>> >
>> > Why do you think so? So here is the design
>> >
>> > 1. If a cgroup was using over its soft limit, we believe that this
>> >    cgroup created overall memory contention and caused the page
>> >    reclaimer to get activated.
>> This assumption is wrong, see below.
>>
>> >    If we can solve the situation by
>> >    reclaiming from this cgroup, why do we need to invoke shrink_slab?
>> >
>> No,
>> IIUC, in big server, inode, dentry cache etc....can occupy Gigabytes
>> of memory even if 99% of them are not used.
>>
>> By shrink_slab(), we can reclaim unused but cached slabs and make
>> the kernel more healthy.
>>
>
> But that is not the job of the soft limit reclaimer.. Yes if no groups
> are over their soft limit, the regular action will take place.
>
Oh, yes, it's not job of memcg but it's job of memory management.


>>
>> > If the concern is that we are not following the traditional reclaim,
>> > soft limit reclaim can be followed by unconditional reclaim, but I
>> > believe this is not necessary. Remember, we wake up kswapd that will
>> > call shrink_slab if needed.
>> kswapd doesn't call shrink_slab() when zone->free is enough.
>> (when direct recail did good jobs.)
>>
>
> If zone->free is high why do we need shrink_slab()? The other way
> of asking it is, why does the soft limit reclaimer need to call
> shrink_slab(), when its job is to reclaim memory from cgroups above
> their soft limits.
>
Why do you consider that softlimit is called more than necessary
if shrink_slab() is never called ?

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-16 11:58                 ` KAMEZAWA Hiroyuki
@ 2009-03-16 12:19                   ` Balbir Singh
  2009-03-17  3:47                     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2009-03-16 12:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-16 20:58:30]:

> Balbir Singh wrote:
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-16
> > 20:10:41]:
> >> >> At least, this check will be necessary in v7, I think.
> >> >> shrink_slab() should be called.
> >> >
> >> > Why do you think so? So here is the design
> >> >
> >> > 1. If a cgroup was using over its soft limit, we believe that this
> >> >    cgroup created overall memory contention and caused the page
> >> >    reclaimer to get activated.
> >> This assumption is wrong, see below.
> >>
> >> >    If we can solve the situation by
> >> >    reclaiming from this cgroup, why do we need to invoke shrink_slab?
> >> >
> >> No,
> >> IIUC, in big server, inode, dentry cache etc....can occupy Gigabytes
> >> of memory even if 99% of them are not used.
> >>
> >> By shrink_slab(), we can reclaim unused but cached slabs and make
> >> the kernel more healthy.
> >>
> >
> > But that is not the job of the soft limit reclaimer.. Yes if no groups
> > are over their soft limit, the regular action will take place.
> >
> Oh, yes, it's not job of memcg but it's job of memory management.
> 
> 
> >>
> >> > If the concern is that we are not following the traditional reclaim,
> >> > soft limit reclaim can be followed by unconditional reclaim, but I
> >> > believe this is not necessary. Remember, we wake up kswapd that will
> >> > call shrink_slab if needed.
> >> kswapd doesn't call shrink_slab() when zone->free is enough.
> >> (when direct recail did good jobs.)
> >>
> >
> > If zone->free is high why do we need shrink_slab()? The other way
> > of asking it is, why does the soft limit reclaimer need to call
> > shrink_slab(), when its job is to reclaim memory from cgroups above
> > their soft limits.
> >
> Why do you consider that softlimit is called more than necessary
> if shrink_slab() is never called ?

A run away application can do that. Like I mentioned with the tests I
did for your patches. Soft limits were at 1G/2G and the applications
(two) tried to touch all the memory in the system. The point is that
shrink_slab() will be called if the normal system experiences
watermark issues, soft limits will tackle/control cgroups running out
of their soft limits and causing memory contention to take place.


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-16 12:19                   ` Balbir Singh
@ 2009-03-17  3:47                     ` KAMEZAWA Hiroyuki
  2009-03-17  4:40                       ` Balbir Singh
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-17  3:47 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Mon, 16 Mar 2009 17:49:15 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-16 20:58:30]:

> A run away application can do that. Like I mentioned with the tests I
> did for your patches. Soft limits were at 1G/2G and the applications
> (two) tried to touch all the memory in the system. The point is that
> shrink_slab() will be called if the normal system experiences
> watermark issues, soft limits will tackle/control cgroups running out
> of their soft limits and causing memory contention to take place.
> 
Ok, then, how about this ?

 Because our target is "softlimit" and not "memory shortage"
 - don't call soft limit from alloc_pages().
 - don't call soft limit from kswapd and others.
 
Instead of this, add sysctl like this.

  - vm.softlimit_ratio

If vm.softlimit_ratio = 99%, 
  when sum of all usage of memcg is over 99% of system memory,
  softlimit runs and reclaim memory until the whole usage will be below 99%.
   (or some other trigger can be considered.)

Then,
 - We don't have to take care of misc. complicated aspects of memory reclaiming
   We reclaim memory based on our own logic, then, no influence to global LRU.

I think this approach will hide the all corner case and make merging softlimit 
to mainline much easier. If you use this approach, RB-tree is the best one
to go with (and we don't have to care zone's status.)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-17  3:47                     ` KAMEZAWA Hiroyuki
@ 2009-03-17  4:40                       ` Balbir Singh
  2009-03-17  4:47                         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2009-03-17  4:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-17 12:47:40]:

> On Mon, 16 Mar 2009 17:49:15 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-16 20:58:30]:
> 
> > A run away application can do that. Like I mentioned with the tests I
> > did for your patches. Soft limits were at 1G/2G and the applications
> > (two) tried to touch all the memory in the system. The point is that
> > shrink_slab() will be called if the normal system experiences
> > watermark issues, soft limits will tackle/control cgroups running out
> > of their soft limits and causing memory contention to take place.
> > 
> Ok, then, how about this ?
> 
>  Because our target is "softlimit" and not "memory shortage"
>  - don't call soft limit from alloc_pages().
>  - don't call soft limit from kswapd and others.
> 
> Instead of this, add sysctl like this.
> 
>   - vm.softlimit_ratio
> 
> If vm.softlimit_ratio = 99%, 
>   when sum of all usage of memcg is over 99% of system memory,
>   softlimit runs and reclaim memory until the whole usage will be below 99%.
>    (or some other trigger can be considered.)
> 
> Then,
>  - We don't have to take care of misc. complicated aspects of memory reclaiming
>    We reclaim memory based on our own logic, then, no influence to global LRU.
> 
> I think this approach will hide the all corner case and make merging softlimit 
> to mainline much easier. If you use this approach, RB-tree is the best one
> to go with (and we don't have to care zone's status.)

I like the idea in general, but I have concerns about

1. Tracking all cgroup memory, it can quickly get expensive (tracking
to check for vm.soft_limit_ratio and for usage)
2. Finding a good default for the sysctl (might not be so hard)

Even today our influence on global LRU is very limited, only when we
come under reclaim, we do an additional step of seeing if we can get
memory from soft limit groups first.

(1) is a real concern.


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-17  4:40                       ` Balbir Singh
@ 2009-03-17  4:47                         ` KAMEZAWA Hiroyuki
  2009-03-17  4:58                           ` Balbir Singh
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-17  4:47 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Tue, 17 Mar 2009 10:10:16 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> >   - vm.softlimit_ratio
> > 
> > If vm.softlimit_ratio = 99%, 
> >   when sum of all usage of memcg is over 99% of system memory,
> >   softlimit runs and reclaim memory until the whole usage will be below 99%.
> >    (or some other trigger can be considered.)
> > 
> > Then,
> >  - We don't have to take care of misc. complicated aspects of memory reclaiming
> >    We reclaim memory based on our own logic, then, no influence to global LRU.
> > 
> > I think this approach will hide the all corner case and make merging softlimit 
> > to mainline much easier. If you use this approach, RB-tree is the best one
> > to go with (and we don't have to care zone's status.)
> 
> I like the idea in general, but I have concerns about
> 
> 1. Tracking all cgroup memory, it can quickly get expensive (tracking
> to check for vm.soft_limit_ratio and for usage)

Not so expensive because we already tracks them all by default cgroup.
Then, what we need is "fast" counter.
Maybe percpu coutner (lib/percpu_counter.c) gives us enough codes for counting.

Checking value ratio is ...how about "once per 1000 increment per cpu" or some ?

> 2. Finding a good default for the sysctl (might not be so hard)
> 
I think some parameter like high-low watermark is good and we can find
good value as
  - low watermak .... max_memory - (sum of all zone->high) * 16 of memory.
  - high watermark .... max_memory - (sum_of all zone->high) * 8
(just an example but not so bad.)

> Even today our influence on global LRU is very limited, only when we
> come under reclaim, we do an additional step of seeing if we can get
> memory from soft limit groups first.
> 
> (1) is a real concern.

Maybe yes. But all memcg will call "charge" "uncharge" codes so, problem is
just "counter". I think percpu coutner works enough.

Thanks,
-Kame 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-17  4:47                         ` KAMEZAWA Hiroyuki
@ 2009-03-17  4:58                           ` Balbir Singh
  2009-03-17  5:17                             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2009-03-17  4:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-17 13:47:27]:

> On Tue, 17 Mar 2009 10:10:16 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > >   - vm.softlimit_ratio
> > > 
> > > If vm.softlimit_ratio = 99%, 
> > >   when sum of all usage of memcg is over 99% of system memory,
> > >   softlimit runs and reclaim memory until the whole usage will be below 99%.
> > >    (or some other trigger can be considered.)
> > > 
> > > Then,
> > >  - We don't have to take care of misc. complicated aspects of memory reclaiming
> > >    We reclaim memory based on our own logic, then, no influence to global LRU.
> > > 
> > > I think this approach will hide the all corner case and make merging softlimit 
> > > to mainline much easier. If you use this approach, RB-tree is the best one
> > > to go with (and we don't have to care zone's status.)
> > 
> > I like the idea in general, but I have concerns about
> > 
> > 1. Tracking all cgroup memory, it can quickly get expensive (tracking
> > to check for vm.soft_limit_ratio and for usage)
> 
> Not so expensive because we already tracks them all by default cgroup.
> Then, what we need is "fast" counter.
> Maybe percpu coutner (lib/percpu_counter.c) gives us enough codes for counting.
>
> Checking value ratio is ...how about "once per 1000 increment per cpu" or some ?

That is not true..we don't track them to default cgroup unless
memory.use_hiearchy is enabled in the root cgroup. To do what you
suggest, we have to iterate through all mem cgroups, which is not
desirable at all.

> 
> > 2. Finding a good default for the sysctl (might not be so hard)
> > 
> I think some parameter like high-low watermark is good and we can find
> good value as
>   - low watermak .... max_memory - (sum of all zone->high) * 16 of memory.
>   - high watermark .... max_memory - (sum_of all zone->high) * 8
> (just an example but not so bad.)
>

OK..

[offtopic] I liked the per-mem cgroup watermark patches as well. I
think we should look at them later on, after soft limits and some other items.
 
> > Even today our influence on global LRU is very limited, only when we
> > come under reclaim, we do an additional step of seeing if we can get
> > memory from soft limit groups first.
> > 
> > (1) is a real concern.
> 
> Maybe yes. But all memcg will call "charge" "uncharge" codes so, problem is
> just "counter". I think percpu coutner works enough.
>

This scheme adds more overhead due to (1), we'll need a global counter
and need to protect it, which will serialize all res_counters. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-17  4:58                           ` Balbir Singh
@ 2009-03-17  5:17                             ` KAMEZAWA Hiroyuki
  2009-03-17  5:55                               ` Balbir Singh
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-17  5:17 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Tue, 17 Mar 2009 10:28:50 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-17 13:47:27]:
> 
> > On Tue, 17 Mar 2009 10:10:16 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > >   - vm.softlimit_ratio
> > > > 
> > > > If vm.softlimit_ratio = 99%, 
> > > >   when sum of all usage of memcg is over 99% of system memory,
> > > >   softlimit runs and reclaim memory until the whole usage will be below 99%.
> > > >    (or some other trigger can be considered.)
> > > > 
> > > > Then,
> > > >  - We don't have to take care of misc. complicated aspects of memory reclaiming
> > > >    We reclaim memory based on our own logic, then, no influence to global LRU.
> > > > 
> > > > I think this approach will hide the all corner case and make merging softlimit 
> > > > to mainline much easier. If you use this approach, RB-tree is the best one
> > > > to go with (and we don't have to care zone's status.)
> > > 
> > > I like the idea in general, but I have concerns about
> > > 
> > > 1. Tracking all cgroup memory, it can quickly get expensive (tracking
> > > to check for vm.soft_limit_ratio and for usage)
> > 
> > Not so expensive because we already tracks them all by default cgroup.
> > Then, what we need is "fast" counter.
> > Maybe percpu coutner (lib/percpu_counter.c) gives us enough codes for counting.
> >
> > Checking value ratio is ...how about "once per 1000 increment per cpu" or some ?
> 
> That is not true..we don't track them to default cgroup unless
> memory.use_hiearchy is enabled in the root cgroup. 
What I want to say is "the task which is not attached to user's cgroup is
also under defaut cgroup, so we don't need additional hook"
Not talking about hierarchy.

> To do what you
> suggest, we have to iterate through all mem cgroups, which is not
> desirable at all.

I don't say percpu counter should be within struct mem_cgroup.
DEFINE_PERCPU(overall_usage);
is enough, for example. (just an example.)

Or we already have vmstat (see vmstat.h:: global_page_state())

== si_meminfo()==
 val->freeram = global_page_state(NR_FREE_PAGES);
==

It seems we have what we need already. (but usage from this includes usage comes
from slab, page table, etc...)


> > 
> > > 2. Finding a good default for the sysctl (might not be so hard)
> > > 
> > I think some parameter like high-low watermark is good and we can find
> > good value as
> >   - low watermak .... max_memory - (sum of all zone->high) * 16 of memory.
> >   - high watermark .... max_memory - (sum_of all zone->high) * 8
> > (just an example but not so bad.)
> >
> 
> OK..
> 
> [offtopic] I liked the per-mem cgroup watermark patches as well. I
> think we should look at them later on, after soft limits and some other items.
>  
I'm glad if we can reuse logics added by softlimit by per memcg watermarks.

> > > Even today our influence on global LRU is very limited, only when we
> > > come under reclaim, we do an additional step of seeing if we can get
> > > memory from soft limit groups first.
> > > 
> > > (1) is a real concern.
> > 
> > Maybe yes. But all memcg will call "charge" "uncharge" codes so, problem is
> > just "counter". I think percpu coutner works enough.
> >
> 
> This scheme adds more overhead due to (1), we'll need a global counter
> and need to protect it, which will serialize all res_counters. 
> 
It's not necessary. for example, reading vmstat doesn't need global lock.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-17  5:17                             ` KAMEZAWA Hiroyuki
@ 2009-03-17  5:55                               ` Balbir Singh
  2009-03-17  6:00                                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2009-03-17  5:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-17 14:17:14]:

> On Tue, 17 Mar 2009 10:28:50 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-17 13:47:27]:
> > 
> > > On Tue, 17 Mar 2009 10:10:16 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > >   - vm.softlimit_ratio
> > > > > 
> > > > > If vm.softlimit_ratio = 99%, 
> > > > >   when sum of all usage of memcg is over 99% of system memory,
> > > > >   softlimit runs and reclaim memory until the whole usage will be below 99%.
> > > > >    (or some other trigger can be considered.)
> > > > > 
> > > > > Then,
> > > > >  - We don't have to take care of misc. complicated aspects of memory reclaiming
> > > > >    We reclaim memory based on our own logic, then, no influence to global LRU.
> > > > > 
> > > > > I think this approach will hide the all corner case and make merging softlimit 
> > > > > to mainline much easier. If you use this approach, RB-tree is the best one
> > > > > to go with (and we don't have to care zone's status.)
> > > > 
> > > > I like the idea in general, but I have concerns about
> > > > 
> > > > 1. Tracking all cgroup memory, it can quickly get expensive (tracking
> > > > to check for vm.soft_limit_ratio and for usage)
> > > 
> > > Not so expensive because we already tracks them all by default cgroup.
> > > Then, what we need is "fast" counter.
> > > Maybe percpu coutner (lib/percpu_counter.c) gives us enough codes for counting.
> > >
> > > Checking value ratio is ...how about "once per 1000 increment per cpu" or some ?
> > 
> > That is not true..we don't track them to default cgroup unless
> > memory.use_hiearchy is enabled in the root cgroup. 
> What I want to say is "the task which is not attached to user's cgroup is
> also under defaut cgroup, so we don't need additional hook"
> Not talking about hierarchy.
>

Since all the user pages are tracked in one or the other cgroup, the
total accounting is equal to total_lru_pages across all zones/nodes.
Your suggestion boils down to if total_lru_pages reaches a threshold,
do soft limit reclaim, instead of doing reclaim when there is
contention.. right?
 
> > To do what you
> > suggest, we have to iterate through all mem cgroups, which is not
> > desirable at all.
> 
> I don't say percpu counter should be within struct mem_cgroup.
> DEFINE_PERCPU(overall_usage);
> is enough, for example. (just an example.)
> 
> Or we already have vmstat (see vmstat.h:: global_page_state())
> 
> == si_meminfo()==
>  val->freeram = global_page_state(NR_FREE_PAGES);
> ==
> 
> It seems we have what we need already. (but usage from this includes usage comes
> from slab, page table, etc...)
> 
> 
> > > 
> > > > 2. Finding a good default for the sysctl (might not be so hard)
> > > > 
> > > I think some parameter like high-low watermark is good and we can find
> > > good value as
> > >   - low watermak .... max_memory - (sum of all zone->high) * 16 of memory.
> > >   - high watermark .... max_memory - (sum_of all zone->high) * 8
> > > (just an example but not so bad.)
> > >
> > 
> > OK..
> > 
> > [offtopic] I liked the per-mem cgroup watermark patches as well. I
> > think we should look at them later on, after soft limits and some other items.
> >  
> I'm glad if we can reuse logics added by softlimit by per memcg watermarks.
> 
> > > > Even today our influence on global LRU is very limited, only when we
> > > > come under reclaim, we do an additional step of seeing if we can get
> > > > memory from soft limit groups first.
> > > > 
> > > > (1) is a real concern.
> > > 
> > > Maybe yes. But all memcg will call "charge" "uncharge" codes so, problem is
> > > just "counter". I think percpu coutner works enough.
> > >
> > 
> > This scheme adds more overhead due to (1), we'll need a global counter
> > and need to protect it, which will serialize all res_counters. 
> > 
> It's not necessary. for example, reading vmstat doesn't need global lock.
>

It uses atomic values for accounting. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-17  5:55                               ` Balbir Singh
@ 2009-03-17  6:00                                 ` KAMEZAWA Hiroyuki
  2009-03-17  6:22                                   ` Balbir Singh
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-17  6:00 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Tue, 17 Mar 2009 11:25:06 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-17 14:17:14]:
> > > That is not true..we don't track them to default cgroup unless
> > > memory.use_hiearchy is enabled in the root cgroup. 
> > What I want to say is "the task which is not attached to user's cgroup is
> > also under defaut cgroup, so we don't need additional hook"
> > Not talking about hierarchy.
> >
> 
> Since all the user pages are tracked in one or the other cgroup, the
> total accounting is equal to total_lru_pages across all zones/nodes.
> Your suggestion boils down to if total_lru_pages reaches a threshold,
> do soft limit reclaim, instead of doing reclaim when there is
> contention.. right?
>  
Yes.


> > It's not necessary. for example, reading vmstat doesn't need global lock.
> >
> 
> It uses atomic values for accounting. 
> 
Ah, my point is that "when it comes to usage of global LRU,
accounting pages is already done somewhere. we can reuse it."
Not means "add some new counter"

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-17  6:00                                 ` KAMEZAWA Hiroyuki
@ 2009-03-17  6:22                                   ` Balbir Singh
  2009-03-17  6:30                                     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2009-03-17  6:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-17 15:00:58]:

> On Tue, 17 Mar 2009 11:25:06 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-17 14:17:14]:
> > > > That is not true..we don't track them to default cgroup unless
> > > > memory.use_hiearchy is enabled in the root cgroup. 
> > > What I want to say is "the task which is not attached to user's cgroup is
> > > also under defaut cgroup, so we don't need additional hook"
> > > Not talking about hierarchy.
> > >
> > 
> > Since all the user pages are tracked in one or the other cgroup, the
> > total accounting is equal to total_lru_pages across all zones/nodes.
> > Your suggestion boils down to if total_lru_pages reaches a threshold,
> > do soft limit reclaim, instead of doing reclaim when there is
> > contention.. right?
> >  
> Yes.
>

May I suggest that we first do the reclaim on contention and then
later on enhance it to add sysctl vm.soft_limit_ratio. I am not
proposing the soft limit patches for 2.6.30, but I would like to get
them in -mm for wider testing. If in that process the sysctl seems
more useful and applicable, we can consider adding it. Adding it right
now makes the reclaim logic more complex, having to check if we hit
the vm ratio quite often. Do you agree?
 
> 
> > > It's not necessary. for example, reading vmstat doesn't need global lock.
> > >
> > 
> > It uses atomic values for accounting. 
> > 
> Ah, my point is that "when it comes to usage of global LRU,
> accounting pages is already done somewhere. we can reuse it."
> Not means "add some new counter"
>

OK. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-17  6:22                                   ` Balbir Singh
@ 2009-03-17  6:30                                     ` KAMEZAWA Hiroyuki
  2009-03-17  6:59                                       ` Balbir Singh
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-17  6:30 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Tue, 17 Mar 2009 11:52:05 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-17 15:00:58]:
> 
> > On Tue, 17 Mar 2009 11:25:06 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-17 14:17:14]:
> > > > > That is not true..we don't track them to default cgroup unless
> > > > > memory.use_hiearchy is enabled in the root cgroup. 
> > > > What I want to say is "the task which is not attached to user's cgroup is
> > > > also under defaut cgroup, so we don't need additional hook"
> > > > Not talking about hierarchy.
> > > >
> > > 
> > > Since all the user pages are tracked in one or the other cgroup, the
> > > total accounting is equal to total_lru_pages across all zones/nodes.
> > > Your suggestion boils down to if total_lru_pages reaches a threshold,
> > > do soft limit reclaim, instead of doing reclaim when there is
> > > contention.. right?
> > >  
> > Yes.
> >
> 
> May I suggest that we first do the reclaim on contention and then
> later on enhance it to add sysctl vm.soft_limit_ratio. I am not
> proposing the soft limit patches for 2.6.30, but I would like to get
> them in -mm for wider testing. If in that process the sysctl seems
> more useful and applicable, we can consider adding it. Adding it right
> now makes the reclaim logic more complex, having to check if we hit
> the vm ratio quite often. Do you agree?
>  
If you can fix zone issues and can answer all Kosaki's request.
But you said "this all is not for memory shortage but for softlimit".
It seems strange for me to modify memory reclaim path.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-17  6:30                                     ` KAMEZAWA Hiroyuki
@ 2009-03-17  6:59                                       ` Balbir Singh
  0 siblings, 0 replies; 29+ messages in thread
From: Balbir Singh @ 2009-03-17  6:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-17 15:30:46]:

> On Tue, 17 Mar 2009 11:52:05 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-17 15:00:58]:
> > 
> > > On Tue, 17 Mar 2009 11:25:06 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-17 14:17:14]:
> > > > > > That is not true..we don't track them to default cgroup unless
> > > > > > memory.use_hiearchy is enabled in the root cgroup. 
> > > > > What I want to say is "the task which is not attached to user's cgroup is
> > > > > also under defaut cgroup, so we don't need additional hook"
> > > > > Not talking about hierarchy.
> > > > >
> > > > 
> > > > Since all the user pages are tracked in one or the other cgroup, the
> > > > total accounting is equal to total_lru_pages across all zones/nodes.
> > > > Your suggestion boils down to if total_lru_pages reaches a threshold,
> > > > do soft limit reclaim, instead of doing reclaim when there is
> > > > contention.. right?
> > > >  
> > > Yes.
> > >
> > 
> > May I suggest that we first do the reclaim on contention and then
> > later on enhance it to add sysctl vm.soft_limit_ratio. I am not
> > proposing the soft limit patches for 2.6.30, but I would like to get
> > them in -mm for wider testing. If in that process the sysctl seems
> > more useful and applicable, we can consider adding it. Adding it right
> > now makes the reclaim logic more complex, having to check if we hit
> > the vm ratio quite often. Do you agree?
> >  
> If you can fix zone issues and can answer all Kosaki's request.
> But you said "this all is not for memory shortage but for softlimit".
> It seems strange for me to modify memory reclaim path.
>

Kame, the reason for modifying those paths is just to invoke the soft
limit reclaimer. At some point we need to make a decision on the soft
limit reclaim and where to invoke it from.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-16  8:35     ` Balbir Singh
  2009-03-16  8:49       ` KAMEZAWA Hiroyuki
@ 2009-03-18  0:07       ` KAMEZAWA Hiroyuki
  2009-03-18  4:14         ` Balbir Singh
  1 sibling, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-18  0:07 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Mon, 16 Mar 2009 14:05:12 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > +				next_mem =
> > > +					__mem_cgroup_largest_soft_limit_node();
> > > +			} while (next_mem == mem);
> > > +		}
> > > +		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> > > +		__mem_cgroup_remove_exceeded(mem);
> > > +		if (mem->usage_in_excess)
> > > +			__mem_cgroup_insert_exceeded(mem);
> > 
> > If next_mem == NULL here, (means "mem" is an only mem_cgroup which excess softlimit.)
> > mem will be found again even if !reclaimed.
> > plz check.
> 
> Yes, We need to add a if (!next_mem) break; Thanks!
> 
Plz be sure that there can be following case.

  1. several memcg is over softlimit.
  2. almost all memory usage comes from ANON or tmpfile/shmem.
  3. Swapless system
     or
     Most of memory are mlocked.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] Memory controller soft limit reclaim on contention (v6)
  2009-03-18  0:07       ` KAMEZAWA Hiroyuki
@ 2009-03-18  4:14         ` Balbir Singh
  0 siblings, 0 replies; 29+ messages in thread
From: Balbir Singh @ 2009-03-18  4:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-18 09:07:47]:

> On Mon, 16 Mar 2009 14:05:12 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > +				next_mem =
> > > > +					__mem_cgroup_largest_soft_limit_node();
> > > > +			} while (next_mem == mem);
> > > > +		}
> > > > +		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> > > > +		__mem_cgroup_remove_exceeded(mem);
> > > > +		if (mem->usage_in_excess)
> > > > +			__mem_cgroup_insert_exceeded(mem);
> > > 
> > > If next_mem == NULL here, (means "mem" is an only mem_cgroup which excess softlimit.)
> > > mem will be found again even if !reclaimed.
> > > plz check.
> > 
> > Yes, We need to add a if (!next_mem) break; Thanks!
> > 
> Plz be sure that there can be following case.
> 
>   1. several memcg is over softlimit.
>   2. almost all memory usage comes from ANON or tmpfile/shmem.
>   3. Swapless system
>      or
>      Most of memory are mlocked.
>

Good point, will test with those as well. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2009-03-18  4:14 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-14 17:30 [PATCH 0/4] Memory controller soft limit patches (v6) Balbir Singh
2009-03-14 17:30 ` [PATCH 1/4] Memory controller soft limit documentation (v6) Balbir Singh
2009-03-14 17:30 ` [PATCH 2/4] Memory controller soft limit interface (v6) Balbir Singh
2009-03-14 17:31 ` [PATCH 3/4] Memory controller soft limit organize cgroups (v6) Balbir Singh
2009-03-16  0:21   ` KAMEZAWA Hiroyuki
2009-03-16  8:47     ` Balbir Singh
2009-03-16  8:57       ` KAMEZAWA Hiroyuki
2009-03-14 17:31 ` [PATCH 4/4] Memory controller soft limit reclaim on contention (v6) Balbir Singh
2009-03-16  0:52   ` KAMEZAWA Hiroyuki
2009-03-16  8:35     ` Balbir Singh
2009-03-16  8:49       ` KAMEZAWA Hiroyuki
2009-03-16  9:03         ` KAMEZAWA Hiroyuki
2009-03-16  9:10           ` Balbir Singh
2009-03-16 11:10             ` KAMEZAWA Hiroyuki
2009-03-16 11:38               ` Balbir Singh
2009-03-16 11:58                 ` KAMEZAWA Hiroyuki
2009-03-16 12:19                   ` Balbir Singh
2009-03-17  3:47                     ` KAMEZAWA Hiroyuki
2009-03-17  4:40                       ` Balbir Singh
2009-03-17  4:47                         ` KAMEZAWA Hiroyuki
2009-03-17  4:58                           ` Balbir Singh
2009-03-17  5:17                             ` KAMEZAWA Hiroyuki
2009-03-17  5:55                               ` Balbir Singh
2009-03-17  6:00                                 ` KAMEZAWA Hiroyuki
2009-03-17  6:22                                   ` Balbir Singh
2009-03-17  6:30                                     ` KAMEZAWA Hiroyuki
2009-03-17  6:59                                       ` Balbir Singh
2009-03-18  0:07       ` KAMEZAWA Hiroyuki
2009-03-18  4:14         ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).