[PATCH 0/5] Memory controller soft limit patches (v7)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] Memory controller soft limit patches (v7)
@ 2009-03-19 16:57 Balbir Singh
  2009-03-19 16:57 ` [PATCH 1/5] Memory controller soft limit documentation (v7) Balbir Singh
                   ` (7 more replies)
  0 siblings, 8 replies; 54+ messages in thread
From: Balbir Singh @ 2009-03-19 16:57 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Balbir Singh,
	Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

From: Balbir Singh <balbir@linux.vnet.ibm.com>

New Feature: Soft limits for memory resource controller.

Changelog v7...v6
1. Added checks in reclaim path to make sure we don't infinitely loop
2. Refactored reclaim options into a new patch
3. Tested several scenarios, see tests below

Changelog v6...v5
1. If the number of reclaimed pages are zero, select the next mem cgroup
   for reclamation
2. Fixed a bug, where key was being updated after insertion into the tree
3. Fixed a build issue, when CONFIG_MEM_RES_CTLR is not enabled

Changelog v5...v4
1. Several changes to the reclaim logic, please see the patch 4 (reclaim on
   contention). I've experimented with several possibilities for reclaim
   and chose to come back to this due to the excellent behaviour seen while
   testing the patchset.
2. Reduced the overhead of soft limits on resource counters very significantly.
   Reaim benchmark now shows almost no drop in performance.

Changelog v4...v3
1. Adopted suggestions from Kamezawa to do a per-zone-per-node reclaim
   while doing soft limit reclaim. We don't record priorities while
   doing soft reclaim
2. Some of the overheads associated with soft limits (like calculating
   excess each time) is eliminated
3. The time_after(jiffies, 0) bug has been fixed
4. Tasks are throttled if the mem cgroup they belong to is being soft reclaimed
   and at the same time tasks are increasing the memory footprint and causing
   the mem cgroup to exceed its soft limit.

Changelog v3...v2
1. Implemented several review comments from Kosaki-San and Kamezawa-San
   Please see individual changelogs for changes

Changelog v2...v1
1. Soft limits now support hierarchies
2. Use spinlocks instead of mutexes for synchronization of the RB tree

Here is v7 of the new soft limit implementation. Soft limits is a new feature
for the memory resource controller, something similar has existed in the
group scheduler in the form of shares. The CPU controllers interpretation
of shares is very different though. 

Soft limits are the most useful feature to have for environments where
the administrator wants to overcommit the system, such that only on memory
contention do the limits become active. The current soft limits implementation
provides a soft_limit_in_bytes interface for the memory controller and not
for memory+swap controller. The implementation maintains an RB-Tree of groups
that exceed their soft limit and starts reclaiming from the group that
exceeds this limit by the maximum amount.

So far I have the best test results with this patchset. I've experimented with
several approaches and methods. I might be a little delayed in responding,
I might have intermittent access to the internet for the next few days.

TODOs

1. The current implementation maintains the delta from the soft limit
   and pushes back groups to their soft limits, a ratio of delta/soft_limit
   might be more useful

Tests
-----

I've run two memory intensive workloads with differing soft limits and
seen that they are pushed back to their soft limit on contention. Their usage
was their soft limit plus additional memory that they were able to grab
on the system. Soft limit can take a while before we see the expected
results.

The other tests I've run are
1. Deletion of groups while soft limit is in progress in the hierarchy
2. Setting the soft limit to zero and running other groups with non-zero
   soft limits.
3. Setting the soft limit to zero and testing if the mem cgroup is able
   to use available memory
4. Tested the patches with hierarchy enabled
5. Tested with swapoff -a, to make sure we don't go into an infinite loop

Please review, comment.

Series
------

memcg-soft-limits-documentation.patch
memcg-soft-limits-interface.patch
memcg-soft-limits-organize.patch
memcg-soft-limits-refactor-reclaim-bits
memcg-soft-limits-reclaim-on-contention.patch

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/5] Memory controller soft limit documentation (v7)
  2009-03-19 16:57 [PATCH 0/5] Memory controller soft limit patches (v7) Balbir Singh
@ 2009-03-19 16:57 ` Balbir Singh
  2009-03-19 16:57 ` [PATCH 2/5] Memory controller soft limit interface (v7) Balbir Singh
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 54+ messages in thread
From: Balbir Singh @ 2009-03-19 16:57 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Balbir Singh,
	Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

Feature: Add documentation for soft limits

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/cgroups/memory.txt |   31 ++++++++++++++++++++++++++++++-
 1 files changed, 30 insertions(+), 1 deletions(-)


diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index a98a7fe..c5f73d9 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -360,7 +360,36 @@ cgroups created below it.
 
 NOTE2: This feature can be enabled/disabled per subtree.
 
-7. TODO
+7. Soft limits
+
+Soft limits allow for greater sharing of memory. The idea behind soft limits
+is to allow control groups to use as much of the memory as needed, provided
+
+a. There is no memory contention
+b. They do not exceed their hard limit
+
+When the system detects memory contention or low memory control groups
+are pushed back to their soft limits. If the soft limit of each control
+group is very high, they are pushed back as much as possible to make
+sure that one control group does not starve the others of memory.
+
+7.1 Interface
+
+Soft limits can be setup by using the following commands (in this example we
+assume a soft limit of 256 megabytes)
+
+# echo 256M > memory.soft_limit_in_bytes
+
+If we want to change this to 1G, we can at any time use
+
+# echo 1G > memory.soft_limit_in_bytes
+
+NOTE1: Soft limits take effect over a long period of time, since they involve
+       reclaiming memory for balancing between memory cgroups
+NOTE2: It is recommended to set the soft limit always below the hard limit,
+       otherwise the hard limit will take precedence.
+
+8. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/5] Memory controller soft limit interface (v7)
  2009-03-19 16:57 [PATCH 0/5] Memory controller soft limit patches (v7) Balbir Singh
  2009-03-19 16:57 ` [PATCH 1/5] Memory controller soft limit documentation (v7) Balbir Singh
@ 2009-03-19 16:57 ` Balbir Singh
  2009-03-19 16:57 ` [PATCH 3/5] Memory controller soft limit organize cgroups (v7) Balbir Singh
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 54+ messages in thread
From: Balbir Singh @ 2009-03-19 16:57 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Balbir Singh,
	Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

Feature: Add soft limits interface to resource counters

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v2...v1
1. Add support for res_counter_check_soft_limit_locked. This is used
   by the hierarchy code.

Add an interface to allow get/set of soft limits. Soft limits for memory plus
swap controller (memsw) is currently not supported. Resource counters have
been enhanced to support soft limits and new type RES_SOFT_LIMIT has been
added. Unlike hard limits, soft limits can be directly set and do not
need any reclaim or checks before setting them to a newer value.

Kamezawa-San raised a question as to whether soft limit should belong
to res_counter. Since all resources understand the basic concepts of
hard and soft limits, it is justified to add soft limits here. Soft limits
are a generic resource usage feature, even file system quotas support
soft limits.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |   58 +++++++++++++++++++++++++++++++++++++++++++
 kernel/res_counter.c        |    3 ++
 mm/memcontrol.c             |   20 +++++++++++++++
 3 files changed, 81 insertions(+), 0 deletions(-)


diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 4c5bcf6..5c821fd 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -35,6 +35,10 @@ struct res_counter {
 	 */
 	unsigned long long limit;
 	/*
+	 * the limit that usage can be exceed
+	 */
+	unsigned long long soft_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -85,6 +89,7 @@ enum {
 	RES_MAX_USAGE,
 	RES_LIMIT,
 	RES_FAILCNT,
+	RES_SOFT_LIMIT,
 };
 
 /*
@@ -130,6 +135,36 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 	return false;
 }
 
+static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->soft_limit)
+		return true;
+
+	return false;
+}
+
+/**
+ * Get the difference between the usage and the soft limit
+ * @cnt: The counter
+ *
+ * Returns 0 if usage is less than or equal to soft limit
+ * The difference between usage and soft limit, otherwise.
+ */
+static inline unsigned long long
+res_counter_soft_limit_excess(struct res_counter *cnt)
+{
+	unsigned long long excess;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (cnt->usage <= cnt->soft_limit)
+		excess = 0;
+	else
+		excess = cnt->usage - cnt->soft_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return excess;
+}
+
 /*
  * Helper function to detect if the cgroup is within it's limit or
  * not. It's currently called from cgroup_rss_prepare()
@@ -145,6 +180,17 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
 	return ret;
 }
 
+static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_soft_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
@@ -178,4 +224,16 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
 	return ret;
 }
 
+static inline int
+res_counter_set_soft_limit(struct res_counter *cnt,
+				unsigned long long soft_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->soft_limit = soft_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index bf8e753..4e6dafe 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -19,6 +19,7 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 {
 	spin_lock_init(&counter->lock);
 	counter->limit = (unsigned long long)LLONG_MAX;
+	counter->soft_limit = (unsigned long long)LLONG_MAX;
 	counter->parent = parent;
 }
 
@@ -101,6 +102,8 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->limit;
 	case RES_FAILCNT:
 		return &counter->failcnt;
+	case RES_SOFT_LIMIT:
+		return &counter->soft_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5de6be9..70bc992 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2002,6 +2002,20 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 		else
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
+	case RES_SOFT_LIMIT:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		/*
+		 * For memsw, soft limits are hard to implement in terms
+		 * of semantics, for now, we support soft limits for
+		 * control without swap
+		 */
+		if (type == _MEM)
+			ret = res_counter_set_soft_limit(&memcg->res, val);
+		else
+			ret = -EINVAL;
+		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
 		break;
@@ -2251,6 +2265,12 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_read,
 	},
 	{
+		.name = "soft_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
 		.name = "failcnt",
 		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
 		.trigger = mem_cgroup_reset,

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-19 16:57 [PATCH 0/5] Memory controller soft limit patches (v7) Balbir Singh
  2009-03-19 16:57 ` [PATCH 1/5] Memory controller soft limit documentation (v7) Balbir Singh
  2009-03-19 16:57 ` [PATCH 2/5] Memory controller soft limit interface (v7) Balbir Singh
@ 2009-03-19 16:57 ` Balbir Singh
  2009-03-20  3:46   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  2009-03-19 16:57 ` [PATCH 4/5] Memory controller soft limit refactor reclaim flags (v7) Balbir Singh
                   ` (4 subsequent siblings)
  7 siblings, 3 replies; 54+ messages in thread
From: Balbir Singh @ 2009-03-19 16:57 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Balbir Singh,
	Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

Feature: Organize cgroups over soft limit in a RB-Tree

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v7...v6
1. Refactor the check and update logic. The goal is to allow the
   check logic to be modular, so that it can be revisited in the future
   if something more appropriate is found to be useful.

Changelog v6...v5
1. Update the key before inserting into RB tree. Without the current change
   it could take an additional iteration to get the key correct.

Changelog v5...v4
1. res_counter_uncharge has an additional parameter to indicate if the
   counter was over its soft limit, before uncharge.

Changelog v4...v3
1. Optimizations to ensure we don't uncessarily get res_counter values
2. Fixed a bug in usage of time_after()

Changelog v3...v2
1. Add only the ancestor to the RB-Tree
2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put

Changelog v2...v1
1. Add support for hierarchies
2. The res_counter that is highest in the hierarchy is returned on soft
   limit being exceeded. Since we do hierarchical reclaim and add all
   groups exceeding their soft limits, this approach seems to work well
   in practice.

This patch introduces a RB-Tree for storing memory cgroups that are over their
soft limit. The overall goal is to

1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
   We are careful about updates, updates take place only after a particular
   time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
   limit

The next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |    6 +-
 kernel/res_counter.c        |   18 +++++
 mm/memcontrol.c             |  149 ++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 151 insertions(+), 22 deletions(-)


diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 5c821fd..5bbf8b1 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -112,7 +112,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
 int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
+		unsigned long val, struct res_counter **limit_fail_at,
+		struct res_counter **soft_limit_at);
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -125,7 +126,8 @@ int __must_check res_counter_charge(struct res_counter *counter,
  */
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val);
+void res_counter_uncharge(struct res_counter *counter, unsigned long val,
+				bool *was_soft_limit_excess);
 
 static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 {
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 4e6dafe..51ec438 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 }
 
 int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
+			struct res_counter **limit_fail_at,
+			struct res_counter **soft_limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
 	struct res_counter *c, *u;
 
 	*limit_fail_at = NULL;
+	if (soft_limit_fail_at)
+		*soft_limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
+		/*
+		 * With soft limits, we return the highest ancestor
+		 * that exceeds its soft limit
+		 */
+		if (soft_limit_fail_at &&
+			!res_counter_soft_limit_check_locked(c))
+			*soft_limit_fail_at = c;
 		spin_unlock(&c->lock);
 		if (ret < 0) {
 			*limit_fail_at = c;
@@ -75,7 +85,8 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 	counter->usage -= val;
 }
 
-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge(struct res_counter *counter, unsigned long val,
+				bool *was_soft_limit_excess)
 {
 	unsigned long flags;
 	struct res_counter *c;
@@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
+		if (c == counter && was_soft_limit_excess)
+			*was_soft_limit_excess =
+				!res_counter_soft_limit_check_locked(c);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 70bc992..f5b61b8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -29,6 +29,7 @@
 #include <linux/rcupdate.h>
 #include <linux/limits.h>
 #include <linux/mutex.h>
+#include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/spinlock.h>
@@ -129,6 +130,14 @@ struct mem_cgroup_lru_info {
 };
 
 /*
+ * Cgroups above their limits are maintained in a RB-Tree, independent of
+ * their hierarchy representation
+ */
+
+static struct rb_root mem_cgroup_soft_limit_tree;
+static DEFINE_SPINLOCK(memcg_soft_limit_tree_lock);
+
+/*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
  * statistics based on the statistics developed by Rik Van Riel for clock-pro,
@@ -176,12 +185,20 @@ struct mem_cgroup {
 
 	unsigned int	swappiness;
 
+	struct rb_node mem_cgroup_node;		/* RB tree node */
+	unsigned long long usage_in_excess;	/* Set to the value by which */
+						/* the soft limit is exceeded*/
+	unsigned long last_tree_update;		/* Last time the tree was */
+						/* updated in jiffies     */
+
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
 	struct mem_cgroup_stat stat;
 };
 
+#define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
+
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -214,6 +231,42 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 
+static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+{
+	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct mem_cgroup *mem_node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+	while (*p) {
+		parent = *p;
+		mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
+		if (mem->usage_in_excess < mem_node->usage_in_excess)
+			p = &(*p)->rb_left;
+		/*
+		 * We can't avoid mem cgroups that are over their soft
+		 * limit by the same amount
+		 */
+		else if (mem->usage_in_excess >= mem_node->usage_in_excess)
+			p = &(*p)->rb_right;
+	}
+	rb_link_node(&mem->mem_cgroup_node, parent, p);
+	rb_insert_color(&mem->mem_cgroup_node,
+			&mem_cgroup_soft_limit_tree);
+	mem->last_tree_update = jiffies;
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+}
+
+static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+}
+
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 					 struct page_cgroup *pc,
 					 bool charge)
@@ -897,6 +950,46 @@ static void record_last_oom(struct mem_cgroup *mem)
 	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
 }
 
+static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem,
+					bool over_soft_limit)
+{
+	unsigned long next_update;
+
+	if (!over_soft_limit)
+		return false;
+
+	next_update = mem->last_tree_update + MEM_CGROUP_TREE_UPDATE_INTERVAL;
+	if (time_after(jiffies, next_update))
+		return true;
+
+	return false;
+}
+
+static void mem_cgroup_update_tree(struct mem_cgroup *mem)
+{
+	unsigned long long prev_usage_in_excess, new_usage_in_excess;
+	bool updated_tree = false;
+	unsigned long flags;
+
+	prev_usage_in_excess = mem->usage_in_excess;
+
+	new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+	if (prev_usage_in_excess) {
+		mem_cgroup_remove_exceeded(mem);
+		updated_tree = true;
+	}
+	if (!new_usage_in_excess)
+		goto done;
+	mem_cgroup_insert_exceeded(mem);
+
+done:
+	if (updated_tree) {
+		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+		mem->last_tree_update = jiffies;
+		mem->usage_in_excess = new_usage_in_excess;
+		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+	}
+}
 
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
@@ -906,9 +999,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			gfp_t gfp_mask, struct mem_cgroup **memcg,
 			bool oom)
 {
-	struct mem_cgroup *mem, *mem_over_limit;
+	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct res_counter *fail_res;
+	struct res_counter *fail_res, *soft_fail_res = NULL;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -938,16 +1031,17 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		int ret;
 		bool noswap = false;
 
-		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
+		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
+						&soft_fail_res);
 		if (likely(!ret)) {
 			if (!do_swap_account)
 				break;
 			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
-							&fail_res);
+							&fail_res, NULL);
 			if (likely(!ret))
 				break;
 			/* mem+swap counter fails */
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
+			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
 			noswap = true;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
@@ -985,6 +1079,18 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			goto nomem;
 		}
 	}
+
+	/*
+	 * Insert just the ancestor, we should trickle down to the correct
+	 * cgroup for reclaim, since the other nodes will be below their
+	 * soft limit
+	 */
+	if (soft_fail_res) {
+		mem_over_soft_limit =
+			mem_cgroup_from_res_counter(soft_fail_res, res);
+		if (mem_cgroup_soft_limit_check(mem_over_soft_limit, true))
+			mem_cgroup_update_tree(mem_over_soft_limit);
+	}
 	return 0;
 nomem:
 	css_put(&mem->css);
@@ -1061,9 +1167,9 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
 		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
 		css_put(&mem->css);
 		return;
 	}
@@ -1116,10 +1222,10 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
 	if (pc->mem_cgroup != from)
 		goto out;
 
-	res_counter_uncharge(&from->res, PAGE_SIZE);
+	res_counter_uncharge(&from->res, PAGE_SIZE, NULL);
 	mem_cgroup_charge_statistics(from, pc, false);
 	if (do_swap_account)
-		res_counter_uncharge(&from->memsw, PAGE_SIZE);
+		res_counter_uncharge(&from->memsw, PAGE_SIZE, NULL);
 	css_put(&from->css);
 
 	css_get(&to->css);
@@ -1183,9 +1289,9 @@ uncharge:
 	/* drop extra refcnt by try_charge() */
 	css_put(&parent->css);
 	/* uncharge if move fails */
-	res_counter_uncharge(&parent->res, PAGE_SIZE);
+	res_counter_uncharge(&parent->res, PAGE_SIZE, NULL);
 	if (do_swap_account)
-		res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+		res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL);
 	return ret;
 }
 
@@ -1314,7 +1420,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 			 * Recorded ID can be obsolete. We avoid calling
 			 * css_tryget()
 			 */
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
 			mem_cgroup_put(mem);
 		}
 		rcu_read_unlock();
@@ -1393,7 +1499,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
 			 * This recorded memcg can be obsolete one. So, avoid
 			 * calling css_tryget
 			 */
-			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+			res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
 			mem_cgroup_put(memcg);
 		}
 		rcu_read_unlock();
@@ -1408,9 +1514,9 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
 		return;
 	if (!mem)
 		return;
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
 	if (do_swap_account)
-		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
 	css_put(&mem->css);
 }
 
@@ -1424,6 +1530,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
+	bool soft_limit_excess = false;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1461,9 +1568,9 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 		break;
 	}
 
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess);
 	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
-		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
 	mem_cgroup_charge_statistics(mem, pc, false);
 
 	ClearPageCgroupUsed(pc);
@@ -1477,6 +1584,8 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	mz = page_cgroup_zoneinfo(pc);
 	unlock_page_cgroup(pc);
 
+	if (mem_cgroup_soft_limit_check(mem, soft_limit_excess))
+		mem_cgroup_update_tree(mem);
 	/* at swapout, this memcg will be accessed to record to swap */
 	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		css_put(&mem->css);
@@ -1545,7 +1654,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
 		 * We uncharge this because swap is freed.
 		 * This memcg can be obsolete one. We avoid calling css_tryget
 		 */
-		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
 		mem_cgroup_put(memcg);
 	}
 	rcu_read_unlock();
@@ -2409,6 +2518,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
 {
 	int node;
 
+	mem_cgroup_update_tree(mem);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
 
 	for_each_node_state(node, N_POSSIBLE)
@@ -2475,6 +2585,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (cont->parent == NULL) {
 		enable_swap_cgroup();
 		parent = NULL;
+		mem_cgroup_soft_limit_tree = RB_ROOT;
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
@@ -2495,6 +2606,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		res_counter_init(&mem->memsw, NULL);
 	}
 	mem->last_scanned_child = 0;
+	mem->usage_in_excess = 0;
+	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
 	spin_lock_init(&mem->reclaim_param_lock);
 
 	if (parent)

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 4/5] Memory controller soft limit refactor reclaim flags (v7)
  2009-03-19 16:57 [PATCH 0/5] Memory controller soft limit patches (v7) Balbir Singh
                   ` (2 preceding siblings ...)
  2009-03-19 16:57 ` [PATCH 3/5] Memory controller soft limit organize cgroups (v7) Balbir Singh
@ 2009-03-19 16:57 ` Balbir Singh
  2009-03-20  3:47   ` KAMEZAWA Hiroyuki
  2009-03-19 16:57 ` [PATCH 5/5] Memory controller soft limit reclaim on contention (v7) Balbir Singh
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-19 16:57 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Balbir Singh,
	Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

Impact: Refactor mem_cgroup_hierarchical_reclaim()

From: Balbir Singh <balbir@linux.vnet.ibm.com>

This patch refactors the arguments passed to
mem_cgroup_hierarchical_reclaim() into flags, so that new parameters don't
have to be passed as we make the reclaim routine more flexible


Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 mm/memcontrol.c |   27 ++++++++++++++++++++-------
 1 files changed, 20 insertions(+), 7 deletions(-)


diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f5b61b8..992aac8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -227,6 +227,14 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
 
+/*
+ * Reclaim flags for mem_cgroup_hierarchical_reclaim
+ */
+#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
+#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
+#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
+#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
+
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
@@ -889,11 +897,14 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
  * If shrink==true, for avoiding to free too much, this returns immedieately.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
-				   gfp_t gfp_mask, bool noswap, bool shrink)
+						gfp_t gfp_mask,
+						unsigned long reclaim_options)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
 	int loop = 0;
+	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
+	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
 
 	while (loop < 2) {
 		victim = mem_cgroup_select_victim(root_mem);
@@ -1029,7 +1040,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 
 	while (1) {
 		int ret;
-		bool noswap = false;
+		unsigned long flags = 0;
 
 		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
 						&soft_fail_res);
@@ -1042,7 +1053,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 				break;
 			/* mem+swap counter fails */
 			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
-			noswap = true;
+			flags |= MEM_CGROUP_RECLAIM_NOSWAP;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
 		} else
@@ -1054,7 +1065,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			goto nomem;
 
 		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
-							noswap, false);
+							flags);
 		if (ret)
 			continue;
 
@@ -1766,7 +1777,7 @@ int mem_cgroup_shrink_usage(struct page *page,
 
 	do {
 		progress = mem_cgroup_hierarchical_reclaim(mem,
-					gfp_mask, true, false);
+					gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP);
 		progress += mem_cgroup_check_under_limit(mem);
 	} while (!progress && --retry);
 
@@ -1821,7 +1832,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 			break;
 
 		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
-						   false, true);
+						   MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -1869,7 +1880,9 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
+		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
+						MEM_CGROUP_RECLAIM_NOSWAP |
+						MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 5/5] Memory controller soft limit reclaim on contention (v7)
  2009-03-19 16:57 [PATCH 0/5] Memory controller soft limit patches (v7) Balbir Singh
                   ` (3 preceding siblings ...)
  2009-03-19 16:57 ` [PATCH 4/5] Memory controller soft limit refactor reclaim flags (v7) Balbir Singh
@ 2009-03-19 16:57 ` Balbir Singh
  2009-03-20  4:06   ` KAMEZAWA Hiroyuki
  2009-03-23  3:50 ` [PATCH 0/5] Memory controller soft limit patches (v7) KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-19 16:57 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Balbir Singh,
	Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki

Feature: Implement reclaim from groups over their soft limit

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v7...v6
1. Refactored out reclaim_options patch into a separate patch
2. Added additional checks for all swap off condition in
   mem_cgroup_hierarchical_reclaim()

Changelog v6...v5
1. Reclaim arguments to hierarchical reclaim have been merged into one
   parameter called reclaim_options.
2. Check if we failed to reclaim from one cgroup during soft reclaim, if
   so move on to the next one. This can be very useful if the zonelist
   passed to soft limit reclaim has no allocations from the selected
   memory cgroup
3. Coding style cleanups

Changelog v5...v4

1. Throttling is removed, earlier we throttled tasks over their soft limit
2. Reclaim has been moved back to __alloc_pages_internal, several experiments
   and tests showed that it was the best place to reclaim memory. kswapd has
   a different goal, that does not work with a single soft limit for the memory
   cgroup.
3. Soft limit reclaim is more targetted and the pages reclaim depend on the
   amount by which the soft limit is exceeded.

Changelog v4...v3
1. soft_reclaim is now called from balance_pgdat
2. soft_reclaim is aware of nodes and zones
3. A mem_cgroup will be throttled if it is undergoing soft limit reclaim
   and at the same time trying to allocate pages and exceed its soft limit.
4. A new mem_cgroup_shrink_zone() routine has been added to shrink zones
   particular to a mem cgroup.

Changelog v3...v2
1. Convert several arguments to hierarchical reclaim to flags, thereby
   consolidating them
2. The reclaim for soft limits is now triggered from kswapd
3. try_to_free_mem_cgroup_pages() now accepts an optional zonelist argument


Changelog v2...v1
1. Added support for hierarchical soft limits

This patch allows reclaim from memory cgroups on contention (via the
direct reclaim path).

memory cgroup soft limit reclaim finds the group that exceeds its soft limit
by the largest number of pages and reclaims pages from it and then reinserts the
cgroup into its correct place in the rbtree.

Added additional checks to mem_cgroup_hierarchical_reclaim() to detect
long loops in case all swap is turned off. The code has been refactored
and the loop check (loop < 2) has been enhanced for soft limits. For soft
limits, we try to do more targetted reclaim. Instead of bailing out after
two loops, the routine now reclaims memory proportional to the size by
which the soft limit is exceeded. The proportion has been empirically
determined.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/memcontrol.h |    8 ++
 include/linux/swap.h       |    1 
 mm/memcontrol.c            |  202 +++++++++++++++++++++++++++++++++++++++++---
 mm/page_alloc.c            |    9 ++
 mm/vmscan.c                |    5 +
 5 files changed, 206 insertions(+), 19 deletions(-)


diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 18146c9..faeb358 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -116,6 +116,8 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
+unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl,
+						gfp_t gfp_mask);
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
@@ -264,6 +266,12 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
+static inline
+unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl, gfp_t gfp_mask)
+{
+	return 0;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 989eb53..c128337 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -215,6 +215,7 @@ static inline void lru_cache_add_active_file(struct page *page)
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
+						  struct zonelist *zl,
 						  gfp_t gfp_mask, bool noswap,
 						  unsigned int swappiness);
 extern int __isolate_lru_page(struct page *page, int mode, int file);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 992aac8..aeab794 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -191,6 +191,7 @@ struct mem_cgroup {
 	unsigned long last_tree_update;		/* Last time the tree was */
 						/* updated in jiffies     */
 
+	bool on_tree;				/* Is the node on tree? */
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
@@ -199,6 +200,13 @@ struct mem_cgroup {
 
 #define	MEM_CGROUP_TREE_UPDATE_INTERVAL		(HZ/4)
 
+/*
+ * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
+ * limit reclaim to prevent infinite loops, if they ever occur.
+ */
+#define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(10000)
+#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	(2)
+
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -234,19 +242,22 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 #define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
 #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
 #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
+#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
+#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
 
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 
-static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+static void __mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
 {
 	struct rb_node **p = &mem_cgroup_soft_limit_tree.rb_node;
 	struct rb_node *parent = NULL;
 	struct mem_cgroup *mem_node;
-	unsigned long flags;
 
-	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	if (mem->on_tree)
+		return;
+
 	mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
 	while (*p) {
 		parent = *p;
@@ -264,6 +275,23 @@ static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
 	rb_insert_color(&mem->mem_cgroup_node,
 			&mem_cgroup_soft_limit_tree);
 	mem->last_tree_update = jiffies;
+	mem->on_tree = true;
+}
+
+static void __mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
+{
+	if (!mem->on_tree)
+		return;
+	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
+	mem->on_tree = false;
+}
+
+static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	__mem_cgroup_insert_exceeded(mem);
 	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
 }
 
@@ -271,7 +299,53 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
 {
 	unsigned long flags;
 	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
-	rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_tree);
+	__mem_cgroup_remove_exceeded(mem);
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+}
+
+unsigned long mem_cgroup_get_excess(struct mem_cgroup *mem)
+{
+	unsigned long flags;
+	unsigned long long excess;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	excess = mem->usage_in_excess >> PAGE_SHIFT;
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+	return (excess > ULONG_MAX) ? ULONG_MAX : excess;
+}
+
+static struct mem_cgroup *__mem_cgroup_largest_soft_limit_node(void)
+{
+	struct rb_node *rightmost = NULL;
+	struct mem_cgroup *mem = NULL;
+
+retry:
+	rightmost = rb_last(&mem_cgroup_soft_limit_tree);
+	if (!rightmost)
+		goto done;		/* Nothing to reclaim from */
+
+	mem = rb_entry(rightmost, struct mem_cgroup, mem_cgroup_node);
+	/*
+	 * Remove the node now but someone else can add it back,
+	 * we will to add it back at the end of reclaim to its correct
+	 * position in the tree.
+	 */
+	__mem_cgroup_remove_exceeded(mem);
+	if (!css_tryget(&mem->css) || !res_counter_soft_limit_excess(&mem->res))
+		goto retry;
+done:
+	return mem;
+}
+
+static struct mem_cgroup *mem_cgroup_largest_soft_limit_node(void)
+{
+	struct mem_cgroup *mem;
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+	mem = __mem_cgroup_largest_soft_limit_node();
+	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+	return mem;
 	spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
 }
 
@@ -897,6 +971,7 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
  * If shrink==true, for avoiding to free too much, this returns immedieately.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
+						struct zonelist *zl,
 						gfp_t gfp_mask,
 						unsigned long reclaim_options)
 {
@@ -905,19 +980,41 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 	int loop = 0;
 	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
 	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
+	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
+	unsigned long excess = mem_cgroup_get_excess(root_mem);
 
-	while (loop < 2) {
+	while (1) {
 		victim = mem_cgroup_select_victim(root_mem);
-		if (victim == root_mem)
+		if (victim == root_mem) {
 			loop++;
+			if (loop >= 2) {
+				/*
+				 * If we have not been able to reclaim
+				 * anything, it might because there are
+				 * no reclaimable pages under this hierarchy
+				 */
+				if (!check_soft || !total)
+					break;
+				/*
+				 * We want to do more targetted reclaim.
+				 * excess >> 2 is not to excessive so as to
+				 * reclaim too much, nor too less that we keep
+				 * coming back to reclaim from this cgroup
+				 */
+				if (total >= (excess >> 2) ||
+					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
+					break;
+			}
+		}
 		if (!mem_cgroup_local_usage(&victim->stat)) {
 			/* this cgroup's local usage == 0 */
 			css_put(&victim->css);
 			continue;
 		}
 		/* we use swappiness of local cgroup */
-		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
-						   get_swappiness(victim));
+		ret = try_to_free_mem_cgroup_pages(victim, zl, gfp_mask,
+							noswap,
+							get_swappiness(victim));
 		css_put(&victim->css);
 		/*
 		 * At shrinking usage, we can't check we should stop here or
@@ -927,7 +1024,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 		if (shrink)
 			return ret;
 		total += ret;
-		if (mem_cgroup_check_under_limit(root_mem))
+		if (check_soft) {
+			if (res_counter_check_under_soft_limit(&root_mem->res))
+				return total;
+		} else if (mem_cgroup_check_under_limit(root_mem))
 			return 1 + total;
 	}
 	return total;
@@ -1064,8 +1164,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
 
-		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
-							flags);
+		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
+							gfp_mask, flags);
 		if (ret)
 			continue;
 
@@ -1776,7 +1876,7 @@ int mem_cgroup_shrink_usage(struct page *page,
 		return 0;
 
 	do {
-		progress = mem_cgroup_hierarchical_reclaim(mem,
+		progress = mem_cgroup_hierarchical_reclaim(mem, NULL,
 					gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP);
 		progress += mem_cgroup_check_under_limit(mem);
 	} while (!progress && --retry);
@@ -1831,8 +1931,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
-						   MEM_CGROUP_RECLAIM_SHRINK);
+		progress = mem_cgroup_hierarchical_reclaim(memcg, NULL,
+						GFP_KERNEL,
+						MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -1880,7 +1981,7 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
+		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_NOSWAP |
 						MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
@@ -1893,6 +1994,73 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	return ret;
 }
 
+unsigned long mem_cgroup_soft_limit_reclaim(struct zonelist *zl, gfp_t gfp_mask)
+{
+	unsigned long nr_reclaimed = 0;
+	struct mem_cgroup *mem, *next_mem = NULL;
+	unsigned long flags;
+	unsigned long reclaimed;
+	int loop = 0;
+
+	/*
+	 * This loop can run a while, specially if mem_cgroup's continuously
+	 * keep exceeding their soft limit and putting the system under
+	 * pressure
+	 */
+	do {
+		if (next_mem)
+			mem = next_mem;
+		else
+			mem = mem_cgroup_largest_soft_limit_node();
+		if (!mem)
+			break;
+
+		reclaimed = mem_cgroup_hierarchical_reclaim(mem, zl,
+						gfp_mask,
+						MEM_CGROUP_RECLAIM_SOFT);
+		nr_reclaimed += reclaimed;
+		spin_lock_irqsave(&memcg_soft_limit_tree_lock, flags);
+
+		/*
+		 * If we failed to reclaim anything from this memory cgroup
+		 * it is time to move on to the next cgroup
+		 */
+		next_mem = NULL;
+		if (!reclaimed) {
+			do {
+				/*
+				 * By the time we get the soft_limit lock
+				 * again, someone might have aded the
+				 * group back on the RB tree. Iterate to
+				 * make sure we get a different mem.
+				 * mem_cgroup_largest_soft_limit_node returns
+				 * NULL if no other cgroup is present on
+				 * the tree
+				 */
+				next_mem =
+					__mem_cgroup_largest_soft_limit_node();
+			} while (next_mem == mem);
+		}
+		mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+		__mem_cgroup_remove_exceeded(mem);
+		if (mem->usage_in_excess)
+			__mem_cgroup_insert_exceeded(mem);
+		spin_unlock_irqrestore(&memcg_soft_limit_tree_lock, flags);
+		css_put(&mem->css);
+		loop++;
+		/*
+		 * Could not reclaim anything and there are no more
+		 * mem cgroups to try or we seem to be looping without
+		 * reclaiming anything.
+		 */
+		if (!nr_reclaimed &&
+			(next_mem == NULL ||
+			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
+			break;
+	} while (!nr_reclaimed);
+	return nr_reclaimed;
+}
+
 /*
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -2016,7 +2184,7 @@ try_to_free:
 			ret = -EINTR;
 			goto out;
 		}
-		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
+		progress = try_to_free_mem_cgroup_pages(mem, NULL, GFP_KERNEL,
 						false, get_swappiness(mem));
 		if (!progress) {
 			nr_retries--;
@@ -2621,6 +2789,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mem->last_scanned_child = 0;
 	mem->usage_in_excess = 0;
 	mem->last_tree_update = 0;	/* Yes, time begins at 0 here */
+	mem->on_tree = false;
+
 	spin_lock_init(&mem->reclaim_param_lock);
 
 	if (parent)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f8fd1e2..5e1a6ca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1598,7 +1598,14 @@ nofail_alloc:
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
+	/*
+	 * Try to free up some pages from the memory controllers soft
+	 * limit queue.
+	 */
+	did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
+	if (order || !did_some_progress)
+		did_some_progress += try_to_free_pages(zonelist, order,
+							gfp_mask);
 
 	p->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5b560f9..0acd19d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1708,6 +1708,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
+					   struct zonelist *zonelist,
 					   gfp_t gfp_mask,
 					   bool noswap,
 					   unsigned int swappiness)
@@ -1721,14 +1722,14 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.mem_cgroup = mem_cont,
 		.isolate_pages = mem_cgroup_isolate_pages,
 	};
-	struct zonelist *zonelist;
 
 	if (noswap)
 		sc.may_unmap = 0;
 
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
-	zonelist = NODE_DATA(numa_node_id())->node_zonelists;
+	if (!zonelist)
+		zonelist = NODE_DATA(numa_node_id())->node_zonelists;
 	return do_try_to_free_pages(zonelist, &sc);
 }
 #endif

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-19 16:57 ` [PATCH 3/5] Memory controller soft limit organize cgroups (v7) Balbir Singh
@ 2009-03-20  3:46   ` KAMEZAWA Hiroyuki
  2009-03-22 14:21     ` Balbir Singh
  2009-03-25  4:59   ` KAMEZAWA Hiroyuki
  2009-03-25  5:07   ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-20  3:46 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Thu, 19 Mar 2009 22:27:35 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Feature: Organize cgroups over soft limit in a RB-Tree
> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Changelog v7...v6
> 1. Refactor the check and update logic. The goal is to allow the
>    check logic to be modular, so that it can be revisited in the future
>    if something more appropriate is found to be useful.
> 
One of my motivation to this was "reducing if" in res_counter charege...
But ..plz see comment.

> Changelog v6...v5
> 1. Update the key before inserting into RB tree. Without the current change
>    it could take an additional iteration to get the key correct.
> 
> Changelog v5...v4
> 1. res_counter_uncharge has an additional parameter to indicate if the
>    counter was over its soft limit, before uncharge.
> 
> Changelog v4...v3
> 1. Optimizations to ensure we don't uncessarily get res_counter values
> 2. Fixed a bug in usage of time_after()
> 
> Changelog v3...v2
> 1. Add only the ancestor to the RB-Tree
> 2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put
> 
> Changelog v2...v1
> 1. Add support for hierarchies
> 2. The res_counter that is highest in the hierarchy is returned on soft
>    limit being exceeded. Since we do hierarchical reclaim and add all
>    groups exceeding their soft limits, this approach seems to work well
>    in practice.
> 
> This patch introduces a RB-Tree for storing memory cgroups that are over their
> soft limit. The overall goal is to
> 
> 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
>    We are careful about updates, updates take place only after a particular
>    time interval has passed
> 2. We remove the node from the RB-Tree when the usage goes below the soft
>    limit
> 
> The next set of patches will exploit the RB-Tree to get the group that is
> over its soft limit by the largest amount and reclaim from it, when we
> face memory contention.
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  include/linux/res_counter.h |    6 +-
>  kernel/res_counter.c        |   18 +++++
>  mm/memcontrol.c             |  149 ++++++++++++++++++++++++++++++++++++++-----
>  3 files changed, 151 insertions(+), 22 deletions(-)
> 
> 
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 5c821fd..5bbf8b1 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -112,7 +112,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
>  int __must_check res_counter_charge_locked(struct res_counter *counter,
>  		unsigned long val);
>  int __must_check res_counter_charge(struct res_counter *counter,
> -		unsigned long val, struct res_counter **limit_fail_at);
> +		unsigned long val, struct res_counter **limit_fail_at,
> +		struct res_counter **soft_limit_at);
>  
>  /*
>   * uncharge - tell that some portion of the resource is released
> @@ -125,7 +126,8 @@ int __must_check res_counter_charge(struct res_counter *counter,
>   */
>  
>  void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
> -void res_counter_uncharge(struct res_counter *counter, unsigned long val);
> +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> +				bool *was_soft_limit_excess);
>  
>  static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
>  {
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 4e6dafe..51ec438 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
>  }
>  
>  int res_counter_charge(struct res_counter *counter, unsigned long val,
> -			struct res_counter **limit_fail_at)
> +			struct res_counter **limit_fail_at,
> +			struct res_counter **soft_limit_fail_at)
>  {
>  	int ret;
>  	unsigned long flags;
>  	struct res_counter *c, *u;
>  
>  	*limit_fail_at = NULL;
> +	if (soft_limit_fail_at)
> +		*soft_limit_fail_at = NULL;
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
>  		spin_lock(&c->lock);
>  		ret = res_counter_charge_locked(c, val);
> +		/*
> +		 * With soft limits, we return the highest ancestor
> +		 * that exceeds its soft limit
> +		 */
> +		if (soft_limit_fail_at &&
> +			!res_counter_soft_limit_check_locked(c))
> +			*soft_limit_fail_at = c;

Is this correct way to go ? In following situation,

     A/       softlimit=1G usage=1.2G
       B1/     sfotlimit=400M usage=1G
         C/
       B2/    softlimit=400M usage=200M

"A" will be victim and both of B1 and B2 will be reclaim target, right ?

and I wonder we don't need *softlimit_failed_at*... here.

<snip>


> +static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem,
> +					bool over_soft_limit)
> +{
> +	unsigned long next_update;
> +
> +	if (!over_soft_limit)
> +		return false;
> +
> +	next_update = mem->last_tree_update + MEM_CGROUP_TREE_UPDATE_INTERVAL;
> +	if (time_after(jiffies, next_update))
> +		return true;
> +
> +	return false;
> +}

If I write, this function will be

static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem, struct res_counter *failed_at)
{
	next_update = mem->last_tree_update + MEM_CGROUP_TREE_UPDATE_INTERVAL;
	if (!time_after(jiffies, next_update))
		return true;
	/* check softlimit */
        for (c = &mem->res; !c; c= c->parent) {
		if (!res_counter_check_under_soft_limit(c)) {
			failed_at =c;
		}
	}
	return false;
}


	/*
	 * Insert just the ancestor, we should trickle down to the correct
	 * cgroup for reclaim, since the other nodes will be below their
	 * soft limit
	 */
        if (mem_cgroup_soft_limit_check(mem, &soft_fail_res)) {
		mem_over_soft_limit =
			mem_cgroup_from_res_counter(soft_fail_res, res);
		mem_cgroup_update_tree(mem_over_soft_limit);
	}

Then, we really do softlimit check once in interval.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/5] Memory controller soft limit refactor reclaim flags (v7)
  2009-03-19 16:57 ` [PATCH 4/5] Memory controller soft limit refactor reclaim flags (v7) Balbir Singh
@ 2009-03-20  3:47   ` KAMEZAWA Hiroyuki
  2009-03-22 14:21     ` Balbir Singh
  0 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-20  3:47 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Thu, 19 Mar 2009 22:27:44 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Impact: Refactor mem_cgroup_hierarchical_reclaim()
> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> This patch refactors the arguments passed to
> mem_cgroup_hierarchical_reclaim() into flags, so that new parameters don't
> have to be passed as we make the reclaim routine more flexible
> 
seems nice :)

Thanks,
-Kame

> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  mm/memcontrol.c |   27 ++++++++++++++++++++-------
>  1 files changed, 20 insertions(+), 7 deletions(-)
> 
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f5b61b8..992aac8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -227,6 +227,14 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
>  #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
>  #define MEMFILE_ATTR(val)	((val) & 0xffff)
>  
> +/*
> + * Reclaim flags for mem_cgroup_hierarchical_reclaim
> + */
> +#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
> +#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
> +#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
> +#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
> +
>  static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> @@ -889,11 +897,14 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
>   * If shrink==true, for avoiding to free too much, this returns immedieately.
>   */
>  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> -				   gfp_t gfp_mask, bool noswap, bool shrink)
> +						gfp_t gfp_mask,
> +						unsigned long reclaim_options)
>  {
>  	struct mem_cgroup *victim;
>  	int ret, total = 0;
>  	int loop = 0;
> +	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> +	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
>  
>  	while (loop < 2) {
>  		victim = mem_cgroup_select_victim(root_mem);
> @@ -1029,7 +1040,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  
>  	while (1) {
>  		int ret;
> -		bool noswap = false;
> +		unsigned long flags = 0;
>  
>  		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
>  						&soft_fail_res);
> @@ -1042,7 +1053,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  				break;
>  			/* mem+swap counter fails */
>  			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
> -			noswap = true;
> +			flags |= MEM_CGROUP_RECLAIM_NOSWAP;
>  			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
>  									memsw);
>  		} else
> @@ -1054,7 +1065,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  			goto nomem;
>  
>  		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> -							noswap, false);
> +							flags);
>  		if (ret)
>  			continue;
>  
> @@ -1766,7 +1777,7 @@ int mem_cgroup_shrink_usage(struct page *page,
>  
>  	do {
>  		progress = mem_cgroup_hierarchical_reclaim(mem,
> -					gfp_mask, true, false);
> +					gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP);
>  		progress += mem_cgroup_check_under_limit(mem);
>  	} while (!progress && --retry);
>  
> @@ -1821,7 +1832,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  			break;
>  
>  		progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
> -						   false, true);
> +						   MEM_CGROUP_RECLAIM_SHRINK);
>  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>  		/* Usage is reduced ? */
>    		if (curusage >= oldusage)
> @@ -1869,7 +1880,9 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
> +		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
> +						MEM_CGROUP_RECLAIM_NOSWAP |
> +						MEM_CGROUP_RECLAIM_SHRINK);
>  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
>  		/* Usage is reduced ? */
>  		if (curusage >= oldusage)
> 
> -- 
> 	Balbir
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 5/5] Memory controller soft limit reclaim on contention (v7)
  2009-03-19 16:57 ` [PATCH 5/5] Memory controller soft limit reclaim on contention (v7) Balbir Singh
@ 2009-03-20  4:06   ` KAMEZAWA Hiroyuki
  2009-03-22 14:27     ` Balbir Singh
  0 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-20  4:06 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Thu, 19 Mar 2009 22:27:52 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Feature: Implement reclaim from groups over their soft limit
> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Changelog v7...v6
> 1. Refactored out reclaim_options patch into a separate patch
> 2. Added additional checks for all swap off condition in
>    mem_cgroup_hierarchical_reclaim()

> -	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> +	/*
> +	 * Try to free up some pages from the memory controllers soft
> +	 * limit queue.
> +	 */
> +	did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
> +	if (order || !did_some_progress)
> +		did_some_progress += try_to_free_pages(zonelist, order,
> +							gfp_mask);
>  

Anyway, my biggest concern is here, always.

        By this.
          if (order > 1), try_to_free_pages() is called twice.
        Hmm...how about

        if (!pages_reclaimed && !(gfp_mask & __GFP_NORETRY)) { # this is the first loop or noretry
               did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
               if (!did_some_progress)
                    did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
        }else
                    did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
        

        maybe a bit more concervative.


        And I wonder "nodemask" should be checked or not..
        softlimit reclaim doesn't seem to work well with nodemask...
Thanks,
-Kame

                

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-20  3:46   ` KAMEZAWA Hiroyuki
@ 2009-03-22 14:21     ` Balbir Singh
  2009-03-22 23:53       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-22 14:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-20 12:46:39]:

> On Thu, 19 Mar 2009 22:27:35 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > Feature: Organize cgroups over soft limit in a RB-Tree
> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > Changelog v7...v6
> > 1. Refactor the check and update logic. The goal is to allow the
> >    check logic to be modular, so that it can be revisited in the future
> >    if something more appropriate is found to be useful.
> > 
> One of my motivation to this was "reducing if" in res_counter charege...
> But ..plz see comment.
> 
> > Changelog v6...v5
> > 1. Update the key before inserting into RB tree. Without the current change
> >    it could take an additional iteration to get the key correct.
> > 
> > Changelog v5...v4
> > 1. res_counter_uncharge has an additional parameter to indicate if the
> >    counter was over its soft limit, before uncharge.
> > 
> > Changelog v4...v3
> > 1. Optimizations to ensure we don't uncessarily get res_counter values
> > 2. Fixed a bug in usage of time_after()
> > 
> > Changelog v3...v2
> > 1. Add only the ancestor to the RB-Tree
> > 2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put
> > 
> > Changelog v2...v1
> > 1. Add support for hierarchies
> > 2. The res_counter that is highest in the hierarchy is returned on soft
> >    limit being exceeded. Since we do hierarchical reclaim and add all
> >    groups exceeding their soft limits, this approach seems to work well
> >    in practice.
> > 
> > This patch introduces a RB-Tree for storing memory cgroups that are over their
> > soft limit. The overall goal is to
> > 
> > 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
> >    We are careful about updates, updates take place only after a particular
> >    time interval has passed
> > 2. We remove the node from the RB-Tree when the usage goes below the soft
> >    limit
> > 
> > The next set of patches will exploit the RB-Tree to get the group that is
> > over its soft limit by the largest amount and reclaim from it, when we
> > face memory contention.
> > 
> > Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> > ---
> > 
> >  include/linux/res_counter.h |    6 +-
> >  kernel/res_counter.c        |   18 +++++
> >  mm/memcontrol.c             |  149 ++++++++++++++++++++++++++++++++++++++-----
> >  3 files changed, 151 insertions(+), 22 deletions(-)
> > 
> > 
> > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > index 5c821fd..5bbf8b1 100644
> > --- a/include/linux/res_counter.h
> > +++ b/include/linux/res_counter.h
> > @@ -112,7 +112,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
> >  int __must_check res_counter_charge_locked(struct res_counter *counter,
> >  		unsigned long val);
> >  int __must_check res_counter_charge(struct res_counter *counter,
> > -		unsigned long val, struct res_counter **limit_fail_at);
> > +		unsigned long val, struct res_counter **limit_fail_at,
> > +		struct res_counter **soft_limit_at);
> >  
> >  /*
> >   * uncharge - tell that some portion of the resource is released
> > @@ -125,7 +126,8 @@ int __must_check res_counter_charge(struct res_counter *counter,
> >   */
> >  
> >  void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
> > -void res_counter_uncharge(struct res_counter *counter, unsigned long val);
> > +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > +				bool *was_soft_limit_excess);
> >  
> >  static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
> >  {
> > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > index 4e6dafe..51ec438 100644
> > --- a/kernel/res_counter.c
> > +++ b/kernel/res_counter.c
> > @@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> >  }
> >  
> >  int res_counter_charge(struct res_counter *counter, unsigned long val,
> > -			struct res_counter **limit_fail_at)
> > +			struct res_counter **limit_fail_at,
> > +			struct res_counter **soft_limit_fail_at)
> >  {
> >  	int ret;
> >  	unsigned long flags;
> >  	struct res_counter *c, *u;
> >  
> >  	*limit_fail_at = NULL;
> > +	if (soft_limit_fail_at)
> > +		*soft_limit_fail_at = NULL;
> >  	local_irq_save(flags);
> >  	for (c = counter; c != NULL; c = c->parent) {
> >  		spin_lock(&c->lock);
> >  		ret = res_counter_charge_locked(c, val);
> > +		/*
> > +		 * With soft limits, we return the highest ancestor
> > +		 * that exceeds its soft limit
> > +		 */
> > +		if (soft_limit_fail_at &&
> > +			!res_counter_soft_limit_check_locked(c))
> > +			*soft_limit_fail_at = c;
> 
> Is this correct way to go ? In following situation,
> 
>      A/       softlimit=1G usage=1.2G
>        B1/     sfotlimit=400M usage=1G
>          C/
>        B2/    softlimit=400M usage=200M
> 
> "A" will be victim and both of B1 and B2 will be reclaim target, right ?
> 

Yes, you remember we discussed adding the oldest ancestor in an older
version. It was your suggestion to add the highest ancestor, have you
changed your mind?

> and I wonder we don't need *softlimit_failed_at*... here.
> 

Not sure I get your point, could you please clarify this?

> <snip>
> 
> 
> > +static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem,
> > +					bool over_soft_limit)
> > +{
> > +	unsigned long next_update;
> > +
> > +	if (!over_soft_limit)
> > +		return false;
> > +
> > +	next_update = mem->last_tree_update + MEM_CGROUP_TREE_UPDATE_INTERVAL;
> > +	if (time_after(jiffies, next_update))
> > +		return true;
> > +
> > +	return false;
> > +}
> 
> If I write, this function will be
> 
> static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem, struct res_counter *failed_at)
> {
> 	next_update = mem->last_tree_update + MEM_CGROUP_TREE_UPDATE_INTERVAL;
> 	if (!time_after(jiffies, next_update))
> 		return true;
> 	/* check softlimit */
>         for (c = &mem->res; !c; c= c->parent) {
> 		if (!res_counter_check_under_soft_limit(c)) {
> 			failed_at =c;
> 		}
> 	}
> 	return false;
> }
>
> 
> 	/*
> 	 * Insert just the ancestor, we should trickle down to the correct
> 	 * cgroup for reclaim, since the other nodes will be below their
> 	 * soft limit
> 	 */
>         if (mem_cgroup_soft_limit_check(mem, &soft_fail_res)) {
> 		mem_over_soft_limit =
> 			mem_cgroup_from_res_counter(soft_fail_res, res);
> 		mem_cgroup_update_tree(mem_over_soft_limit);
> 	}
> 
> Then, we really do softlimit check once in interval.

OK, so the trade-off is - every once per interval,
I need to walk up res_counters all over again, hold all locks and
check. Like I mentioned earlier, with the current approach I've
reduced the overhead significantly for non-users. Earlier I was seeing
a small loss in output with reaim, but since I changed
res_counter_uncharge to track soft limits, that difference is negligible
now.

The issue I see with this approach is that if soft-limits were
not enabled, even then we would need to walk up the hierarchy and do
tests, where as embedding it in res_counter_charge, one simple check
tells us we don't have more to do.


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/5] Memory controller soft limit refactor reclaim flags (v7)
  2009-03-20  3:47   ` KAMEZAWA Hiroyuki
@ 2009-03-22 14:21     ` Balbir Singh
  0 siblings, 0 replies; 54+ messages in thread
From: Balbir Singh @ 2009-03-22 14:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-20 12:47:17]:

> On Thu, 19 Mar 2009 22:27:44 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > Impact: Refactor mem_cgroup_hierarchical_reclaim()
> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > This patch refactors the arguments passed to
> > mem_cgroup_hierarchical_reclaim() into flags, so that new parameters don't
> > have to be passed as we make the reclaim routine more flexible
> > 
> seems nice :)
> 
>

Thanks! 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 5/5] Memory controller soft limit reclaim on contention (v7)
  2009-03-20  4:06   ` KAMEZAWA Hiroyuki
@ 2009-03-22 14:27     ` Balbir Singh
  2009-03-23  0:02       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-22 14:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-20 13:06:30]:

> On Thu, 19 Mar 2009 22:27:52 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > Feature: Implement reclaim from groups over their soft limit
> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > Changelog v7...v6
> > 1. Refactored out reclaim_options patch into a separate patch
> > 2. Added additional checks for all swap off condition in
> >    mem_cgroup_hierarchical_reclaim()
> 
> > -	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> > +	/*
> > +	 * Try to free up some pages from the memory controllers soft
> > +	 * limit queue.
> > +	 */
> > +	did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
> > +	if (order || !did_some_progress)
> > +		did_some_progress += try_to_free_pages(zonelist, order,
> > +							gfp_mask);
> >  
> 
> Anyway, my biggest concern is here, always.
> 
>         By this.
>           if (order > 1), try_to_free_pages() is called twice.

try_to_free_mem_cgroup_pages and try_to_free_pages() are called

>         Hmm...how about


> 
>         if (!pages_reclaimed && !(gfp_mask & __GFP_NORETRY)) { # this is the first loop or noretry
>                did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);

OK, I see what you mean.. but the cost of the
mem_cgroup_soft_limit_reclaim() is really a low overhead call, which
will bail out very quickly if nothing is over their soft limit.
Even if we retry, we do a simple check for soft-limit-reclaim, if
there is really something to be reclaimed, we reclaim from there
first.

>                if (!did_some_progress)
>                     did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
>         }else
>                     did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> 
> 
>         maybe a bit more concervative.
> 
> 
>         And I wonder "nodemask" should be checked or not..
>         softlimit reclaim doesn't seem to work well with nodemask...

Doesn't the zonelist take care of nodemask?


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-22 14:21     ` Balbir Singh
@ 2009-03-22 23:53       ` KAMEZAWA Hiroyuki
  2009-03-23  3:34         ` Balbir Singh
  0 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-22 23:53 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Sun, 22 Mar 2009 19:51:05 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> >         if (mem_cgroup_soft_limit_check(mem, &soft_fail_res)) {
> > 		mem_over_soft_limit =
> > 			mem_cgroup_from_res_counter(soft_fail_res, res);
> > 		mem_cgroup_update_tree(mem_over_soft_limit);
> > 	}
> > 
> > Then, we really do softlimit check once in interval.
> 
> OK, so the trade-off is - every once per interval,
> I need to walk up res_counters all over again, hold all locks and
> check. Like I mentioned earlier, with the current approach I've
> reduced the overhead significantly for non-users. Earlier I was seeing
> a small loss in output with reaim, but since I changed
> res_counter_uncharge to track soft limits, that difference is negligible
> now.
> 
> The issue I see with this approach is that if soft-limits were
> not enabled, even then we would need to walk up the hierarchy and do
> tests, where as embedding it in res_counter_charge, one simple check
> tells us we don't have more to do.
> 
Not at all.

just check softlimit is enabled or not in mem_cgroup_soft_limit_check() by some flag.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 5/5] Memory controller soft limit reclaim on contention (v7)
  2009-03-22 14:27     ` Balbir Singh
@ 2009-03-23  0:02       ` KAMEZAWA Hiroyuki
  2009-03-23  4:12         ` Balbir Singh
  0 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-23  0:02 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Sun, 22 Mar 2009 19:57:48 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-20 13:06:30]:
> 
> > On Thu, 19 Mar 2009 22:27:52 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > Feature: Implement reclaim from groups over their soft limit
> > > 
> > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > 
> > > Changelog v7...v6
> > > 1. Refactored out reclaim_options patch into a separate patch
> > > 2. Added additional checks for all swap off condition in
> > >    mem_cgroup_hierarchical_reclaim()
> > 
> > > -	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> > > +	/*
> > > +	 * Try to free up some pages from the memory controllers soft
> > > +	 * limit queue.
> > > +	 */
> > > +	did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
> > > +	if (order || !did_some_progress)
> > > +		did_some_progress += try_to_free_pages(zonelist, order,
> > > +							gfp_mask);
> > >  
> > 
> > Anyway, my biggest concern is here, always.
> > 
> >         By this.
> >           if (order > 1), try_to_free_pages() is called twice.
> 
> try_to_free_mem_cgroup_pages and try_to_free_pages() are called
> 
> >         Hmm...how about
> 
> 
> > 
> >         if (!pages_reclaimed && !(gfp_mask & __GFP_NORETRY)) { # this is the first loop or noretry
> >                did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
> 
> OK, I see what you mean.. but the cost of the
> mem_cgroup_soft_limit_reclaim() is really a low overhead call, which
> will bail out very quickly if nothing is over their soft limit.

My point is "if something is over soft limit" case. Memory is reclaiemd twice.
My above code tries to avoid call memory-reclaim twice.

Even if order > 0, mem_cgroup_try_to_free_pages() may be able to recover
the situation. Maybe it's better to allow lumpty-reclaim even when
!scanning_global_lru().


> Even if we retry, we do a simple check for soft-limit-reclaim, if
> there is really something to be reclaimed, we reclaim from there
> first.
> 
That means you reclaim memory twice ;) 
AFAIK,
  - fork() -> task_struct/stack
    page table in x86 PAE mode
requires order-1 pages very frequently and this "call twice" approach will kill
the application peformance very effectively.

> >                if (!did_some_progress)
> >                     did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> >         }else
> >                     did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> > 
> > 
> >         maybe a bit more concervative.
> > 
> > 
> >         And I wonder "nodemask" should be checked or not..
> >         softlimit reclaim doesn't seem to work well with nodemask...
> 
> Doesn't the zonelist take care of nodemask?
> 

Not sure, but I think, no check. hmm BUG in vmscan.c ?

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-22 23:53       ` KAMEZAWA Hiroyuki
@ 2009-03-23  3:34         ` Balbir Singh
  2009-03-23  3:38           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-23  3:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 08:53:14]:

> On Sun, 22 Mar 2009 19:51:05 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > >         if (mem_cgroup_soft_limit_check(mem, &soft_fail_res)) {
> > > 		mem_over_soft_limit =
> > > 			mem_cgroup_from_res_counter(soft_fail_res, res);
> > > 		mem_cgroup_update_tree(mem_over_soft_limit);
> > > 	}
> > > 
> > > Then, we really do softlimit check once in interval.
> > 
> > OK, so the trade-off is - every once per interval,
> > I need to walk up res_counters all over again, hold all locks and
> > check. Like I mentioned earlier, with the current approach I've
> > reduced the overhead significantly for non-users. Earlier I was seeing
> > a small loss in output with reaim, but since I changed
> > res_counter_uncharge to track soft limits, that difference is negligible
> > now.
> > 
> > The issue I see with this approach is that if soft-limits were
> > not enabled, even then we would need to walk up the hierarchy and do
> > tests, where as embedding it in res_counter_charge, one simple check
> > tells us we don't have more to do.
> > 
> Not at all.
> 
> just check softlimit is enabled or not in mem_cgroup_soft_limit_check() by some flag.
>

So far, we don't use flags, the default soft limit is LONGLONG_MAX, if
hierarchy is enabled, we need to check all the way up. The only way we
check over limit is via a comparison. Are you suggesting we cache the
value or save a special flag whenever the soft limit is set to
anything other than LONGLONG_MAX? It is an indication that we are
using soft limits, but we still need to see if we exceed it.

Why are we trying to over optimize this path? Like I mentioned
earlier, the degradation is down to the order of noise. Knuth,
re-learnt several times that "premature optimization is the root of
all evil". If we find an issue with performance, we can definitely go
down the road you are suggesting.
 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-23  3:34         ` Balbir Singh
@ 2009-03-23  3:38           ` KAMEZAWA Hiroyuki
  2009-03-23  4:15             ` Balbir Singh
  0 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-23  3:38 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Mon, 23 Mar 2009 09:04:04 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 08:53:14]:
> 
> > On Sun, 22 Mar 2009 19:51:05 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > >         if (mem_cgroup_soft_limit_check(mem, &soft_fail_res)) {
> > > > 		mem_over_soft_limit =
> > > > 			mem_cgroup_from_res_counter(soft_fail_res, res);
> > > > 		mem_cgroup_update_tree(mem_over_soft_limit);
> > > > 	}
> > > > 
> > > > Then, we really do softlimit check once in interval.
> > > 
> > > OK, so the trade-off is - every once per interval,
> > > I need to walk up res_counters all over again, hold all locks and
> > > check. Like I mentioned earlier, with the current approach I've
> > > reduced the overhead significantly for non-users. Earlier I was seeing
> > > a small loss in output with reaim, but since I changed
> > > res_counter_uncharge to track soft limits, that difference is negligible
> > > now.
> > > 
> > > The issue I see with this approach is that if soft-limits were
> > > not enabled, even then we would need to walk up the hierarchy and do
> > > tests, where as embedding it in res_counter_charge, one simple check
> > > tells us we don't have more to do.
> > > 
> > Not at all.
> > 
> > just check softlimit is enabled or not in mem_cgroup_soft_limit_check() by some flag.
> >
> 
> So far, we don't use flags, the default soft limit is LONGLONG_MAX, if
> hierarchy is enabled, we need to check all the way up. The only way we
> check over limit is via a comparison. Are you suggesting we cache the
> value or save a special flag whenever the soft limit is set to
> anything other than LONGLONG_MAX? It is an indication that we are
> using soft limits, but we still need to see if we exceed it.
> 

Hmm ok, then, what we have to do here is
"children's softlimit should not be greater than parent's".
or
"if no softlimit, make last_tree_update to be enough big (jiffies + 1year)"
This will reduce the check.

> Why are we trying to over optimize this path? Like I mentioned
> earlier, the degradation is down to the order of noise. Knuth,
> re-learnt several times that "premature optimization is the root of
> all evil". If we find an issue with performance, we can definitely go
> down the road you are suggesting.
>  

I just don't like "check always even if unnecessary"

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-19 16:57 [PATCH 0/5] Memory controller soft limit patches (v7) Balbir Singh
                   ` (4 preceding siblings ...)
  2009-03-19 16:57 ` [PATCH 5/5] Memory controller soft limit reclaim on contention (v7) Balbir Singh
@ 2009-03-23  3:50 ` KAMEZAWA Hiroyuki
  2009-03-23  5:22   ` Balbir Singh
  2009-03-23  8:31 ` KAMEZAWA Hiroyuki
  2009-03-24 17:34 ` Balbir Singh
  7 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-23  3:50 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Thu, 19 Mar 2009 22:27:13 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> New Feature: Soft limits for memory resource controller.
> 
> Changelog v7...v6
> 1. Added checks in reclaim path to make sure we don't infinitely loop
> 2. Refactored reclaim options into a new patch
> 3. Tested several scenarios, see tests below
> 
> Changelog v6...v5
> 1. If the number of reclaimed pages are zero, select the next mem cgroup
>    for reclamation
> 2. Fixed a bug, where key was being updated after insertion into the tree
> 3. Fixed a build issue, when CONFIG_MEM_RES_CTLR is not enabled
> 
> Changelog v5...v4
> 1. Several changes to the reclaim logic, please see the patch 4 (reclaim on
>    contention). I've experimented with several possibilities for reclaim
>    and chose to come back to this due to the excellent behaviour seen while
>    testing the patchset.
> 2. Reduced the overhead of soft limits on resource counters very significantly.
>    Reaim benchmark now shows almost no drop in performance.
> 
> Changelog v4...v3
> 1. Adopted suggestions from Kamezawa to do a per-zone-per-node reclaim
>    while doing soft limit reclaim. We don't record priorities while
>    doing soft reclaim
> 2. Some of the overheads associated with soft limits (like calculating
>    excess each time) is eliminated
> 3. The time_after(jiffies, 0) bug has been fixed
> 4. Tasks are throttled if the mem cgroup they belong to is being soft reclaimed
>    and at the same time tasks are increasing the memory footprint and causing
>    the mem cgroup to exceed its soft limit.
> 
> Changelog v3...v2
> 1. Implemented several review comments from Kosaki-San and Kamezawa-San
>    Please see individual changelogs for changes
> 
> Changelog v2...v1
> 1. Soft limits now support hierarchies
> 2. Use spinlocks instead of mutexes for synchronization of the RB tree
> 
> Here is v7 of the new soft limit implementation. Soft limits is a new feature
> for the memory resource controller, something similar has existed in the
> group scheduler in the form of shares. The CPU controllers interpretation
> of shares is very different though. 
> 
> Soft limits are the most useful feature to have for environments where
> the administrator wants to overcommit the system, such that only on memory
> contention do the limits become active. The current soft limits implementation
> provides a soft_limit_in_bytes interface for the memory controller and not
> for memory+swap controller. The implementation maintains an RB-Tree of groups
> that exceed their soft limit and starts reclaiming from the group that
> exceeds this limit by the maximum amount.
> 
> So far I have the best test results with this patchset. I've experimented with
> several approaches and methods. I might be a little delayed in responding,
> I might have intermittent access to the internet for the next few days.
> 
> TODOs
> 
> 1. The current implementation maintains the delta from the soft limit
>    and pushes back groups to their soft limits, a ratio of delta/soft_limit
>    might be more useful
> 
> 
> Tests
> -----
> 
> I've run two memory intensive workloads with differing soft limits and
> seen that they are pushed back to their soft limit on contention. Their usage
> was their soft limit plus additional memory that they were able to grab
> on the system. Soft limit can take a while before we see the expected
> results.
> 
> The other tests I've run are
> 1. Deletion of groups while soft limit is in progress in the hierarchy
> 2. Setting the soft limit to zero and running other groups with non-zero
>    soft limits.
> 3. Setting the soft limit to zero and testing if the mem cgroup is able
>    to use available memory
> 4. Tested the patches with hierarchy enabled
> 5. Tested with swapoff -a, to make sure we don't go into an infinite loop
> 
> Please review, comment.
> 

please add text to explain the behaior, what happens in the following situation.


   /group_A .....softlimit=100M usage=ANON=1G,FILE=1M
   /group_B .....softlimit=200M usage=ANON=1G,FILE=1M
   /group_C .....softlimit=300M
   on swap-available/swap-less/swap-full system.

  And Run run "dd" or "cp" of big files under group_C.


Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 5/5] Memory controller soft limit reclaim on contention (v7)
  2009-03-23  0:02       ` KAMEZAWA Hiroyuki
@ 2009-03-23  4:12         ` Balbir Singh
  2009-03-23  4:20           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-23  4:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 09:02:05]:

> On Sun, 22 Mar 2009 19:57:48 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-20 13:06:30]:
> > 
> > > On Thu, 19 Mar 2009 22:27:52 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > Feature: Implement reclaim from groups over their soft limit
> > > > 
> > > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > > 
> > > > Changelog v7...v6
> > > > 1. Refactored out reclaim_options patch into a separate patch
> > > > 2. Added additional checks for all swap off condition in
> > > >    mem_cgroup_hierarchical_reclaim()
> > > 
> > > > -	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> > > > +	/*
> > > > +	 * Try to free up some pages from the memory controllers soft
> > > > +	 * limit queue.
> > > > +	 */
> > > > +	did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
> > > > +	if (order || !did_some_progress)
> > > > +		did_some_progress += try_to_free_pages(zonelist, order,
> > > > +							gfp_mask);
> > > >  
> > > 
> > > Anyway, my biggest concern is here, always.
> > > 
> > >         By this.
> > >           if (order > 1), try_to_free_pages() is called twice.
> > 
> > try_to_free_mem_cgroup_pages and try_to_free_pages() are called
> > 
> > >         Hmm...how about
> > 
> > 
> > > 
> > >         if (!pages_reclaimed && !(gfp_mask & __GFP_NORETRY)) { # this is the first loop or noretry
> > >                did_some_progress = mem_cgroup_soft_limit_reclaim(zonelist, gfp_mask);
> > 
> > OK, I see what you mean.. but the cost of the
> > mem_cgroup_soft_limit_reclaim() is really a low overhead call, which
> > will bail out very quickly if nothing is over their soft limit.
> 
> My point is "if something is over soft limit" case. Memory is reclaiemd twice.
> My above code tries to avoid call memory-reclaim twice.
> 

Twice if order > 0 or if soft limit reclaim fails or there is nothing
to soft limit reclaim. 

> Even if order > 0, mem_cgroup_try_to_free_pages() may be able to recover
> the situation. Maybe it's better to allow lumpty-reclaim even when
> !scanning_global_lru().
> 

if order > 0, we let the global reclaim handler reclaim (scan global
LRU). I think the chance of success is higher through that path,
having said that I have not experimented with trying to allow
lumpy-reclaim from memory cgroup LRU's. I think that should be a
separate effort from this one.

> 
> > Even if we retry, we do a simple check for soft-limit-reclaim, if
> > there is really something to be reclaimed, we reclaim from there
> > first.
> > 
> That means you reclaim memory twice ;) 
> AFAIK,
>   - fork() -> task_struct/stack
>     page table in x86 PAE mode
> requires order-1 pages very frequently and this "call twice" approach will kill
> the application peformance very effectively.

Yes, it would if this was the only way to allocate pages. But look at
reality, with kswapd running in the background, how frequently do you
expect to hit the reclaim path. Could you clarify what you mean by
order-1 (2^1), if so soft limit reclaim is not invoked and it should
not hurt performance. What am I missing?

> 
> > >                if (!did_some_progress)
> > >                     did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> > >         }else
> > >                     did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> > > 
> > > 
> > >         maybe a bit more concervative.
> > > 
> > > 
> > >         And I wonder "nodemask" should be checked or not..
> > >         softlimit reclaim doesn't seem to work well with nodemask...
> > 
> > Doesn't the zonelist take care of nodemask?
> > 
> 
> Not sure, but I think, no check. hmm BUG in vmscan.c ?
> 

The zonelist is built using policy_zonelist, that handles nodemask as
well. That should keep the zonelist and nodemask in sync.. no?

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-23  3:38           ` KAMEZAWA Hiroyuki
@ 2009-03-23  4:15             ` Balbir Singh
  2009-03-23  4:23               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-23  4:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 12:38:41]:

> On Mon, 23 Mar 2009 09:04:04 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 08:53:14]:
> > 
> > > On Sun, 22 Mar 2009 19:51:05 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > >         if (mem_cgroup_soft_limit_check(mem, &soft_fail_res)) {
> > > > > 		mem_over_soft_limit =
> > > > > 			mem_cgroup_from_res_counter(soft_fail_res, res);
> > > > > 		mem_cgroup_update_tree(mem_over_soft_limit);
> > > > > 	}
> > > > > 
> > > > > Then, we really do softlimit check once in interval.
> > > > 
> > > > OK, so the trade-off is - every once per interval,
> > > > I need to walk up res_counters all over again, hold all locks and
> > > > check. Like I mentioned earlier, with the current approach I've
> > > > reduced the overhead significantly for non-users. Earlier I was seeing
> > > > a small loss in output with reaim, but since I changed
> > > > res_counter_uncharge to track soft limits, that difference is negligible
> > > > now.
> > > > 
> > > > The issue I see with this approach is that if soft-limits were
> > > > not enabled, even then we would need to walk up the hierarchy and do
> > > > tests, where as embedding it in res_counter_charge, one simple check
> > > > tells us we don't have more to do.
> > > > 
> > > Not at all.
> > > 
> > > just check softlimit is enabled or not in mem_cgroup_soft_limit_check() by some flag.
> > >
> > 
> > So far, we don't use flags, the default soft limit is LONGLONG_MAX, if
> > hierarchy is enabled, we need to check all the way up. The only way we
> > check over limit is via a comparison. Are you suggesting we cache the
> > value or save a special flag whenever the soft limit is set to
> > anything other than LONGLONG_MAX? It is an indication that we are
> > using soft limits, but we still need to see if we exceed it.
> > 
> 
> Hmm ok, then, what we have to do here is
> "children's softlimit should not be greater than parent's".
> or
> "if no softlimit, make last_tree_update to be enough big (jiffies + 1year)"
> This will reduce the check.
>

No... That breaks hierarchy and changes limit behaviour. Today a hard
limit can be greater than parent, if so we bottle-neck at the parent
and catch it. I am not changing semantics.
 
> > Why are we trying to over optimize this path? Like I mentioned
> > earlier, the degradation is down to the order of noise. Knuth,
> > re-learnt several times that "premature optimization is the root of
> > all evil". If we find an issue with performance, we can definitely go
> > down the road you are suggesting.
> >  
> 
> I just don't like "check always even if unnecessary"
>

We do that even for hard limits today. The price (if any) is paid on
enabling those features. My tests don't show the overhead. If we do
see them in the future, we can revisit. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 5/5] Memory controller soft limit reclaim on contention (v7)
  2009-03-23  4:12         ` Balbir Singh
@ 2009-03-23  4:20           ` KAMEZAWA Hiroyuki
  2009-03-23  8:28             ` Balbir Singh
  0 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-23  4:20 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Mon, 23 Mar 2009 09:42:53 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> > Even if order > 0, mem_cgroup_try_to_free_pages() may be able to recover
> > the situation. Maybe it's better to allow lumpty-reclaim even when
> > !scanning_global_lru().
> > 
> 
> if order > 0, we let the global reclaim handler reclaim (scan global
> LRU). I think the chance of success is higher through that path,
> having said that I have not experimented with trying to allow
> lumpy-reclaim from memory cgroup LRU's. I think that should be a
> separate effort from this one.
> 

But ignoring that will make the cost twice....

> > 
> > > Even if we retry, we do a simple check for soft-limit-reclaim, if
> > > there is really something to be reclaimed, we reclaim from there
> > > first.
> > > 
> > That means you reclaim memory twice ;) 
> > AFAIK,
> >   - fork() -> task_struct/stack
> >     page table in x86 PAE mode
> > requires order-1 pages very frequently and this "call twice" approach will kill
> > the application peformance very effectively.
> 
> Yes, it would if this was the only way to allocate pages. But look at
> reality, with kswapd running in the background, how frequently do you
> expect to hit the reclaim path. Could you clarify what you mean by
> order-1 (2^1), if so soft limit reclaim is not invoked and it should
> not hurt performance. What am I missing?
> 
Hmm, maybe running hackbench under memory pressure will tell the answer.
Anyway, plz get Ack from people for memory management.
Rik or Mel or Christoph or Nick or someone.


> > 
> > > >                if (!did_some_progress)
> > > >                     did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> > > >         }else
> > > >                     did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> > > > 
> > > > 
> > > >         maybe a bit more concervative.
> > > > 
> > > > 
> > > >         And I wonder "nodemask" should be checked or not..
> > > >         softlimit reclaim doesn't seem to work well with nodemask...
> > > 
> > > Doesn't the zonelist take care of nodemask?
> > > 
> > 
> > Not sure, but I think, no check. hmm BUG in vmscan.c ?
> > 
> 
> The zonelist is built using policy_zonelist, that handles nodemask as
> well. That should keep the zonelist and nodemask in sync.. no?
> 

I already sent a patch.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-23  4:15             ` Balbir Singh
@ 2009-03-23  4:23               ` KAMEZAWA Hiroyuki
  2009-03-23  8:22                 ` Balbir Singh
  0 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-23  4:23 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Mon, 23 Mar 2009 09:45:59 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 12:38:41]:
> 
> > On Mon, 23 Mar 2009 09:04:04 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 08:53:14]:
> > > 
> > > > On Sun, 22 Mar 2009 19:51:05 +0530
> > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > >         if (mem_cgroup_soft_limit_check(mem, &soft_fail_res)) {
> > > > > > 		mem_over_soft_limit =
> > > > > > 			mem_cgroup_from_res_counter(soft_fail_res, res);
> > > > > > 		mem_cgroup_update_tree(mem_over_soft_limit);
> > > > > > 	}
> > > > > > 
> > > > > > Then, we really do softlimit check once in interval.
> > > > > 
> > > > > OK, so the trade-off is - every once per interval,
> > > > > I need to walk up res_counters all over again, hold all locks and
> > > > > check. Like I mentioned earlier, with the current approach I've
> > > > > reduced the overhead significantly for non-users. Earlier I was seeing
> > > > > a small loss in output with reaim, but since I changed
> > > > > res_counter_uncharge to track soft limits, that difference is negligible
> > > > > now.
> > > > > 
> > > > > The issue I see with this approach is that if soft-limits were
> > > > > not enabled, even then we would need to walk up the hierarchy and do
> > > > > tests, where as embedding it in res_counter_charge, one simple check
> > > > > tells us we don't have more to do.
> > > > > 
> > > > Not at all.
> > > > 
> > > > just check softlimit is enabled or not in mem_cgroup_soft_limit_check() by some flag.
> > > >
> > > 
> > > So far, we don't use flags, the default soft limit is LONGLONG_MAX, if
> > > hierarchy is enabled, we need to check all the way up. The only way we
> > > check over limit is via a comparison. Are you suggesting we cache the
> > > value or save a special flag whenever the soft limit is set to
> > > anything other than LONGLONG_MAX? It is an indication that we are
> > > using soft limits, but we still need to see if we exceed it.
> > > 
> > 
> > Hmm ok, then, what we have to do here is
> > "children's softlimit should not be greater than parent's".
> > or
> > "if no softlimit, make last_tree_update to be enough big (jiffies + 1year)"
> > This will reduce the check.
> >
> 
> No... That breaks hierarchy and changes limit behaviour. Today a hard
> limit can be greater than parent, if so we bottle-neck at the parent
> and catch it. I am not changing semantics.
>  
> > > Why are we trying to over optimize this path? Like I mentioned
> > > earlier, the degradation is down to the order of noise. Knuth,
> > > re-learnt several times that "premature optimization is the root of
> > > all evil". If we find an issue with performance, we can definitely go
> > > down the road you are suggesting.
> > >  
> > 
> > I just don't like "check always even if unnecessary"
> >
> 
> We do that even for hard limits today. The price (if any) is paid on
> enabling those features. My tests don't show the overhead. If we do
> see them in the future, we can revisit. 
> 
ok, plz don't expext Ack from me. 

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-23  3:50 ` [PATCH 0/5] Memory controller soft limit patches (v7) KAMEZAWA Hiroyuki
@ 2009-03-23  5:22   ` Balbir Singh
  2009-03-23  5:31     ` KAMEZAWA Hiroyuki
  2009-03-23  6:12     ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 54+ messages in thread
From: Balbir Singh @ 2009-03-23  5:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 12:50:05]:

> On Thu, 19 Mar 2009 22:27:13 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > New Feature: Soft limits for memory resource controller.
> > 
> > Changelog v7...v6
> > 1. Added checks in reclaim path to make sure we don't infinitely loop
> > 2. Refactored reclaim options into a new patch
> > 3. Tested several scenarios, see tests below
> > 
> > Changelog v6...v5
> > 1. If the number of reclaimed pages are zero, select the next mem cgroup
> >    for reclamation
> > 2. Fixed a bug, where key was being updated after insertion into the tree
> > 3. Fixed a build issue, when CONFIG_MEM_RES_CTLR is not enabled
> > 
> > Changelog v5...v4
> > 1. Several changes to the reclaim logic, please see the patch 4 (reclaim on
> >    contention). I've experimented with several possibilities for reclaim
> >    and chose to come back to this due to the excellent behaviour seen while
> >    testing the patchset.
> > 2. Reduced the overhead of soft limits on resource counters very significantly.
> >    Reaim benchmark now shows almost no drop in performance.
> > 
> > Changelog v4...v3
> > 1. Adopted suggestions from Kamezawa to do a per-zone-per-node reclaim
> >    while doing soft limit reclaim. We don't record priorities while
> >    doing soft reclaim
> > 2. Some of the overheads associated with soft limits (like calculating
> >    excess each time) is eliminated
> > 3. The time_after(jiffies, 0) bug has been fixed
> > 4. Tasks are throttled if the mem cgroup they belong to is being soft reclaimed
> >    and at the same time tasks are increasing the memory footprint and causing
> >    the mem cgroup to exceed its soft limit.
> > 
> > Changelog v3...v2
> > 1. Implemented several review comments from Kosaki-San and Kamezawa-San
> >    Please see individual changelogs for changes
> > 
> > Changelog v2...v1
> > 1. Soft limits now support hierarchies
> > 2. Use spinlocks instead of mutexes for synchronization of the RB tree
> > 
> > Here is v7 of the new soft limit implementation. Soft limits is a new feature
> > for the memory resource controller, something similar has existed in the
> > group scheduler in the form of shares. The CPU controllers interpretation
> > of shares is very different though. 
> > 
> > Soft limits are the most useful feature to have for environments where
> > the administrator wants to overcommit the system, such that only on memory
> > contention do the limits become active. The current soft limits implementation
> > provides a soft_limit_in_bytes interface for the memory controller and not
> > for memory+swap controller. The implementation maintains an RB-Tree of groups
> > that exceed their soft limit and starts reclaiming from the group that
> > exceeds this limit by the maximum amount.
> > 
> > So far I have the best test results with this patchset. I've experimented with
> > several approaches and methods. I might be a little delayed in responding,
> > I might have intermittent access to the internet for the next few days.
> > 
> > TODOs
> > 
> > 1. The current implementation maintains the delta from the soft limit
> >    and pushes back groups to their soft limits, a ratio of delta/soft_limit
> >    might be more useful
> > 
> > 
> > Tests
> > -----
> > 
> > I've run two memory intensive workloads with differing soft limits and
> > seen that they are pushed back to their soft limit on contention. Their usage
> > was their soft limit plus additional memory that they were able to grab
> > on the system. Soft limit can take a while before we see the expected
> > results.
> > 
> > The other tests I've run are
> > 1. Deletion of groups while soft limit is in progress in the hierarchy
> > 2. Setting the soft limit to zero and running other groups with non-zero
> >    soft limits.
> > 3. Setting the soft limit to zero and testing if the mem cgroup is able
> >    to use available memory
> > 4. Tested the patches with hierarchy enabled
> > 5. Tested with swapoff -a, to make sure we don't go into an infinite loop
> > 
> > Please review, comment.
> > 
> 
> please add text to explain the behaior, what happens in the following situation.
> 
> 
>    /group_A .....softlimit=100M usage=ANON=1G,FILE=1M
>    /group_B .....softlimit=200M usage=ANON=1G,FILE=1M
>    /group_C .....softlimit=300M
>    on swap-available/swap-less/swap-full system.
> 
>   And Run run "dd" or "cp" of big files under group_C.

That depends on the memory on the system, on my system with 4G, things
run just fine.

I tried the following

        /group_A soft_limit=100M, needed memory=3200M (allocate and touch)
        /group_B soft_limit=200M, needed memory=3200M
        /group_C soft_limit=300M, needed memory=1024M (dd in a while loop)

group_B and group_A had a difference of 200M in their allocations on
average. group_C touched 800M as maximum usage in bytes and around
500M on the average.

With swap turned off

group_C was hit the most with a lot of reclaim taking place on it.
group_A was OOM killed and immediately after group_B got all the
memory it needed and completed successfully.

I have one large swap partition, so I could not test the partial-swap
scenario.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-23  5:22   ` Balbir Singh
@ 2009-03-23  5:31     ` KAMEZAWA Hiroyuki
  2009-03-23  6:12     ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-23  5:31 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Mon, 23 Mar 2009 10:52:47 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> > please add text to explain the behaior, what happens in the following situation.
> > 
> > 
> >    /group_A .....softlimit=100M usage=ANON=1G,FILE=1M
> >    /group_B .....softlimit=200M usage=ANON=1G,FILE=1M
> >    /group_C .....softlimit=300M
> >    on swap-available/swap-less/swap-full system.
> > 
> >   And Run run "dd" or "cp" of big files under group_C.
> 
> That depends on the memory on the system, on my system with 4G, things
> run just fine.
> 
fine ?

> I tried the following
> 
>         /group_A soft_limit=100M, needed memory=3200M (allocate and touch)
>         /group_B soft_limit=200M, needed memory=3200M
>         /group_C soft_limit=300M, needed memory=1024M (dd in a while loop)
> 
> group_B and group_A had a difference of 200M in their allocations on
> average. group_C touched 800M as maximum usage in bytes and around
> 500M on the average.
> 
> With swap turned off
> 
> group_C was hit the most with a lot of reclaim taking place on it.
> group_A was OOM killed and immediately after group_B got all the
> memory it needed and completed successfully.

Hmm ? OOM-Kill seems to happen without "dd" in group C...

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-23  5:22   ` Balbir Singh
  2009-03-23  5:31     ` KAMEZAWA Hiroyuki
@ 2009-03-23  6:12     ` KAMEZAWA Hiroyuki
  2009-03-23  6:17       ` KAMEZAWA Hiroyuki
  2009-03-23  9:41       ` Balbir Singh
  1 sibling, 2 replies; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-23  6:12 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Mon, 23 Mar 2009 10:52:47 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> I have one large swap partition, so I could not test the partial-swap
> scenario.
> 
plz go ahead as you like, Seems no landing point now and I'd like to see
what I can, later. I'll send no ACK nor NACK, more.

But please get ack from someone resposible for glorbal memory reclaim.
Especially for hooks in try_to_free_pages().

And please make it clear in documentation that 
 - Depends on the system but this may increase the usage of swap.
 - Depends on the system but this may not work as the user expected as hard-limit.

Considering corner cases, this is a very complicated/usage-is-difficult feature.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-23  6:12     ` KAMEZAWA Hiroyuki
@ 2009-03-23  6:17       ` KAMEZAWA Hiroyuki
  2009-03-23  6:35         ` KOSAKI Motohiro
  2009-03-23  8:35         ` Balbir Singh
  2009-03-23  9:41       ` Balbir Singh
  1 sibling, 2 replies; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-23  6:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir, linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro,
	Rik van Riel, Andrew Morton

On Mon, 23 Mar 2009 15:12:45 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 23 Mar 2009 10:52:47 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > I have one large swap partition, so I could not test the partial-swap
> > scenario.
> > 
> plz go ahead as you like, Seems no landing point now and I'd like to see
> what I can, later. I'll send no ACK nor NACK, more.
> 
But I dislike the whole concept, at all.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-23  6:17       ` KAMEZAWA Hiroyuki
@ 2009-03-23  6:35         ` KOSAKI Motohiro
  2009-03-23  8:24           ` Balbir Singh
  2009-03-23  8:35         ` Balbir Singh
  1 sibling, 1 reply; 54+ messages in thread
From: KOSAKI Motohiro @ 2009-03-23  6:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, balbir, linux-mm, YAMAMOTO Takashi, lizf,
	Rik van Riel, Andrew Morton

> On Mon, 23 Mar 2009 15:12:45 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Mon, 23 Mar 2009 10:52:47 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > I have one large swap partition, so I could not test the partial-swap
> > > scenario.
> > > 
> > plz go ahead as you like, Seems no landing point now and I'd like to see
> > what I can, later. I'll send no ACK nor NACK, more.
> > 
> But I dislike the whole concept, at all.


Kamezawa-san, This implementation is suck. but I think softlimit concept 
itself isn't suck.

So, I would suggested discuss this feature based on your 
"memcg softlimit (Another one) v4" patch. I exept I can ack it after few spin.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-23  4:23               ` KAMEZAWA Hiroyuki
@ 2009-03-23  8:22                 ` Balbir Singh
  2009-03-23  8:47                   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-23  8:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 13:23:08]:

> On Mon, 23 Mar 2009 09:45:59 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 12:38:41]:
> > 
> > > On Mon, 23 Mar 2009 09:04:04 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 08:53:14]:
> > > > 
> > > > > On Sun, 22 Mar 2009 19:51:05 +0530
> > > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > > 
> > > > > > >         if (mem_cgroup_soft_limit_check(mem, &soft_fail_res)) {
> > > > > > > 		mem_over_soft_limit =
> > > > > > > 			mem_cgroup_from_res_counter(soft_fail_res, res);
> > > > > > > 		mem_cgroup_update_tree(mem_over_soft_limit);
> > > > > > > 	}
> > > > > > > 
> > > > > > > Then, we really do softlimit check once in interval.
> > > > > > 
> > > > > > OK, so the trade-off is - every once per interval,
> > > > > > I need to walk up res_counters all over again, hold all locks and
> > > > > > check. Like I mentioned earlier, with the current approach I've
> > > > > > reduced the overhead significantly for non-users. Earlier I was seeing
> > > > > > a small loss in output with reaim, but since I changed
> > > > > > res_counter_uncharge to track soft limits, that difference is negligible
> > > > > > now.
> > > > > > 
> > > > > > The issue I see with this approach is that if soft-limits were
> > > > > > not enabled, even then we would need to walk up the hierarchy and do
> > > > > > tests, where as embedding it in res_counter_charge, one simple check
> > > > > > tells us we don't have more to do.
> > > > > > 
> > > > > Not at all.
> > > > > 
> > > > > just check softlimit is enabled or not in mem_cgroup_soft_limit_check() by some flag.
> > > > >
> > > > 
> > > > So far, we don't use flags, the default soft limit is LONGLONG_MAX, if
> > > > hierarchy is enabled, we need to check all the way up. The only way we
> > > > check over limit is via a comparison. Are you suggesting we cache the
> > > > value or save a special flag whenever the soft limit is set to
> > > > anything other than LONGLONG_MAX? It is an indication that we are
> > > > using soft limits, but we still need to see if we exceed it.
> > > > 
> > > 
> > > Hmm ok, then, what we have to do here is
> > > "children's softlimit should not be greater than parent's".
> > > or
> > > "if no softlimit, make last_tree_update to be enough big (jiffies + 1year)"
> > > This will reduce the check.
> > >
> > 
> > No... That breaks hierarchy and changes limit behaviour. Today a hard
> > limit can be greater than parent, if so we bottle-neck at the parent
> > and catch it. I am not changing semantics.
> >  
> > > > Why are we trying to over optimize this path? Like I mentioned
> > > > earlier, the degradation is down to the order of noise. Knuth,
> > > > re-learnt several times that "premature optimization is the root of
> > > > all evil". If we find an issue with performance, we can definitely go
> > > > down the road you are suggesting.
> > > >  
> > > 
> > > I just don't like "check always even if unnecessary"
> > >
> > 
> > We do that even for hard limits today. The price (if any) is paid on
> > enabling those features. My tests don't show the overhead. If we do
> > see them in the future, we can revisit. 
> > 
> ok, plz don't expext Ack from me. 

We can agree to disagree, as long as you don't NACK the patches
without any testing, I don't see why we can't go ahead. The current
design for soft limit is very much in line with our hierarchical hard
limit design.

I don't see why you are harping about something that you might think
is a problem and want to over-optimize even without tests. Fix
something when you can see the problem, on my system I don't see it. I
am willing to consider alternatives or moving away from the current
coding style *iff* it needs to be redone for better performance.

What I am proposing is that we do iterative development, get the
functionality right and then if needed tune for performance.



-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-23  6:35         ` KOSAKI Motohiro
@ 2009-03-23  8:24           ` Balbir Singh
  2009-03-23  9:12             ` KOSAKI Motohiro
  0 siblings, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-23  8:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel,
	Andrew Morton

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-03-23 15:35:50]:

> > On Mon, 23 Mar 2009 15:12:45 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Mon, 23 Mar 2009 10:52:47 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > I have one large swap partition, so I could not test the partial-swap
> > > > scenario.
> > > > 
> > > plz go ahead as you like, Seems no landing point now and I'd like to see
> > > what I can, later. I'll send no ACK nor NACK, more.
> > > 
> > But I dislike the whole concept, at all.
> 
> 
> Kamezawa-san, This implementation is suck. but I think softlimit concept 
> itself isn't suck.
> 

Just because of the reclaim factor? Feel free to improve it
iteratively. Like I said to Kamezawa, don't over optimize in the first
iteration. Pre-mature optimization is the root of all evil.

> So, I would suggested discuss this feature based on your 
> "memcg softlimit (Another one) v4" patch. I exept I can ack it after few spin.

Kame's implementation sucked quite badly, please see my posted test
results. Basic, bare minimum functionality did not work.


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 5/5] Memory controller soft limit reclaim on contention (v7)
  2009-03-23  4:20           ` KAMEZAWA Hiroyuki
@ 2009-03-23  8:28             ` Balbir Singh
  2009-03-23  8:30               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-23  8:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 13:20:45]:

> On Mon, 23 Mar 2009 09:42:53 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > > Even if order > 0, mem_cgroup_try_to_free_pages() may be able to recover
> > > the situation. Maybe it's better to allow lumpty-reclaim even when
> > > !scanning_global_lru().
> > > 
> > 
> > if order > 0, we let the global reclaim handler reclaim (scan global
> > LRU). I think the chance of success is higher through that path,
> > having said that I have not experimented with trying to allow
> > lumpy-reclaim from memory cgroup LRU's. I think that should be a
> > separate effort from this one.
> > 
> 
> But ignoring that will make the cost twice....
>

OK, lets fix it, but it as a separate effort and with data that shows
us the same.
 
> > > 
> > > > Even if we retry, we do a simple check for soft-limit-reclaim, if
> > > > there is really something to be reclaimed, we reclaim from there
> > > > first.
> > > > 
> > > That means you reclaim memory twice ;) 
> > > AFAIK,
> > >   - fork() -> task_struct/stack
> > >     page table in x86 PAE mode
> > > requires order-1 pages very frequently and this "call twice" approach will kill
> > > the application peformance very effectively.
> > 
> > Yes, it would if this was the only way to allocate pages. But look at
> > reality, with kswapd running in the background, how frequently do you
> > expect to hit the reclaim path. Could you clarify what you mean by
> > order-1 (2^1), if so soft limit reclaim is not invoked and it should
> > not hurt performance. What am I missing?
> > 
> Hmm, maybe running hackbench under memory pressure will tell the answer.
> Anyway, plz get Ack from people for memory management.
> Rik or Mel or Christoph or Nick or someone.
>

Rik is on the cc and is linux-mm. I hope they'll look at it.
 
> 
> > > 
> > > > >                if (!did_some_progress)
> > > > >                     did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> > > > >         }else
> > > > >                     did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
> > > > > 
> > > > > 
> > > > >         maybe a bit more concervative.
> > > > > 
> > > > > 
> > > > >         And I wonder "nodemask" should be checked or not..
> > > > >         softlimit reclaim doesn't seem to work well with nodemask...
> > > > 
> > > > Doesn't the zonelist take care of nodemask?
> > > > 
> > > 
> > > Not sure, but I think, no check. hmm BUG in vmscan.c ?
> > > 
> > 
> > The zonelist is built using policy_zonelist, that handles nodemask as
> > well. That should keep the zonelist and nodemask in sync.. no?
> > 
> 
> I already sent a patch.

I've seen it, the basic assumption of the patch is that

policy_zonelist() and for_each_zone_zonelist_nodemask() where nodemask
is derived from policy_nodemask() give different results.. correct?

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 5/5] Memory controller soft limit reclaim on contention (v7)
  2009-03-23  8:28             ` Balbir Singh
@ 2009-03-23  8:30               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-23  8:30 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Mon, 23 Mar 2009 13:58:22 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> I've seen it, the basic assumption of the patch is that
> 
> policy_zonelist() and for_each_zone_zonelist_nodemask() where nodemask
> is derived from policy_nodemask() give different results.. correct?
> 

Basic thinking is that there is alloc_pages_nodemask() but try_to_free_pages()
ignores nodemask. Then, removing alloc_pages_nodemask() or taking care of nodemask
in try_to_free_pages() is necessary.

How nodemask/zonelist is built is out of sight.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-19 16:57 [PATCH 0/5] Memory controller soft limit patches (v7) Balbir Singh
                   ` (5 preceding siblings ...)
  2009-03-23  3:50 ` [PATCH 0/5] Memory controller soft limit patches (v7) KAMEZAWA Hiroyuki
@ 2009-03-23  8:31 ` KAMEZAWA Hiroyuki
  2009-03-24 17:34 ` Balbir Singh
  7 siblings, 0 replies; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-23  8:31 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Thu, 19 Mar 2009 22:27:13 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> New Feature: Soft limits for memory resource controller.
> 

Hmm, did you test against _slow_ workload ?

I tried this. My system with 2cpus. 1.6GB of memory.

create 2 cgroups A and B. Assume allocate 1G of ANON in B.
and run kernel-make in group A. (%make -j 4)

0. echo 3 > /proc/sys/drop_caches.
1. set A's softlimit to 300M.
2. alloc memory 1GB in group B.
3. run %make -j 4 in group A.
  => 250MB of memory are swapeed-out from B.

0. echo 3 > /proc/sys/drop_caches.
1. set A's softlimit to 10GB (never reach this.)
2. alloc memory 1GB in group B.
3. run %make -j 4 in group A.
  => 350MB of memory are swapeed-out from B.


Is this enough to your purpose ? 250MB of swap-out is too much....
(the number can be changed depends on enviroment.)
 
Thanks,
-Kame


> Changelog v7...v6
> 1. Added checks in reclaim path to make sure we don't infinitely loop
> 2. Refactored reclaim options into a new patch
> 3. Tested several scenarios, see tests below
> 
> Changelog v6...v5
> 1. If the number of reclaimed pages are zero, select the next mem cgroup
>    for reclamation
> 2. Fixed a bug, where key was being updated after insertion into the tree
> 3. Fixed a build issue, when CONFIG_MEM_RES_CTLR is not enabled
> 
> Changelog v5...v4
> 1. Several changes to the reclaim logic, please see the patch 4 (reclaim on
>    contention). I've experimented with several possibilities for reclaim
>    and chose to come back to this due to the excellent behaviour seen while
>    testing the patchset.
> 2. Reduced the overhead of soft limits on resource counters very significantly.
>    Reaim benchmark now shows almost no drop in performance.
> 
> Changelog v4...v3
> 1. Adopted suggestions from Kamezawa to do a per-zone-per-node reclaim
>    while doing soft limit reclaim. We don't record priorities while
>    doing soft reclaim
> 2. Some of the overheads associated with soft limits (like calculating
>    excess each time) is eliminated
> 3. The time_after(jiffies, 0) bug has been fixed
> 4. Tasks are throttled if the mem cgroup they belong to is being soft reclaimed
>    and at the same time tasks are increasing the memory footprint and causing
>    the mem cgroup to exceed its soft limit.
> 
> Changelog v3...v2
> 1. Implemented several review comments from Kosaki-San and Kamezawa-San
>    Please see individual changelogs for changes
> 
> Changelog v2...v1
> 1. Soft limits now support hierarchies
> 2. Use spinlocks instead of mutexes for synchronization of the RB tree
> 
> Here is v7 of the new soft limit implementation. Soft limits is a new feature
> for the memory resource controller, something similar has existed in the
> group scheduler in the form of shares. The CPU controllers interpretation
> of shares is very different though. 
> 
> Soft limits are the most useful feature to have for environments where
> the administrator wants to overcommit the system, such that only on memory
> contention do the limits become active. The current soft limits implementation
> provides a soft_limit_in_bytes interface for the memory controller and not
> for memory+swap controller. The implementation maintains an RB-Tree of groups
> that exceed their soft limit and starts reclaiming from the group that
> exceeds this limit by the maximum amount.
> 
> So far I have the best test results with this patchset. I've experimented with
> several approaches and methods. I might be a little delayed in responding,
> I might have intermittent access to the internet for the next few days.
> 
> TODOs
> 
> 1. The current implementation maintains the delta from the soft limit
>    and pushes back groups to their soft limits, a ratio of delta/soft_limit
>    might be more useful
> 
> 
> Tests
> -----
> 
> I've run two memory intensive workloads with differing soft limits and
> seen that they are pushed back to their soft limit on contention. Their usage
> was their soft limit plus additional memory that they were able to grab
> on the system. Soft limit can take a while before we see the expected
> results.
> 
> The other tests I've run are
> 1. Deletion of groups while soft limit is in progress in the hierarchy
> 2. Setting the soft limit to zero and running other groups with non-zero
>    soft limits.
> 3. Setting the soft limit to zero and testing if the mem cgroup is able
>    to use available memory
> 4. Tested the patches with hierarchy enabled
> 5. Tested with swapoff -a, to make sure we don't go into an infinite loop
> 
> Please review, comment.
> 
> Series
> ------
> 
> 
> memcg-soft-limits-documentation.patch
> memcg-soft-limits-interface.patch
> memcg-soft-limits-organize.patch
> memcg-soft-limits-refactor-reclaim-bits
> memcg-soft-limits-reclaim-on-contention.patch
> 
> 
> -- 
> 	Balbir
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-23  6:17       ` KAMEZAWA Hiroyuki
  2009-03-23  6:35         ` KOSAKI Motohiro
@ 2009-03-23  8:35         ` Balbir Singh
  2009-03-23  8:52           ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-23  8:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 15:17:03]:

> On Mon, 23 Mar 2009 15:12:45 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Mon, 23 Mar 2009 10:52:47 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > I have one large swap partition, so I could not test the partial-swap
> > > scenario.
> > > 
> > plz go ahead as you like, Seems no landing point now and I'd like to see
> > what I can, later. I'll send no ACK nor NACK, more.
> > 
> But I dislike the whole concept, at all.
>

Kame, if you dislike it please don't enable
memory.soft_limit_in_bytes. After having sent several revisions of
your own patchset and helping me with review of several revisions, your
sudden dislike comes as a surprise.

Please NOTE: I am not saying we'll never see any of the reclaim
changes you are suggesting, all I am saying is lets do enough test to
prove it is needed. Lets get the functionality right and then optimize
if we have to.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-23  8:22                 ` Balbir Singh
@ 2009-03-23  8:47                   ` KAMEZAWA Hiroyuki
  2009-03-23  9:30                     ` Balbir Singh
  0 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-23  8:47 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Mon, 23 Mar 2009 13:52:44 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> I don't see why you are harping about something that you might think
> is a problem and want to over-optimize even without tests. Fix
> something when you can see the problem, on my system I don't see it. I
> am willing to consider alternatives or moving away from the current
> coding style *iff* it needs to be redone for better performance.
> 

It's usually true that "For optimize system, don't do anything unnecessary".
And the patch increase size of res_counter_charge from 236bytes to 295bytes.
on my compliler.

And this is called at every charge if the check is unnecessary.
(i.e. the _real_ check itself is done once in a HZ/?)

Thanks
-Kame


> What I am proposing is that we do iterative development, get the
> functionality right and then if needed tune for performance.
> 
> 
> 
> -- 
> 	Balbir
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-23  8:35         ` Balbir Singh
@ 2009-03-23  8:52           ` KAMEZAWA Hiroyuki
  2009-03-23  9:46             ` Balbir Singh
  0 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-23  8:52 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Mon, 23 Mar 2009 14:05:06 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 15:17:03]:
> Kame, if you dislike it please don't enable
> memory.soft_limit_in_bytes. After having sent several revisions of
> your own patchset and helping me with review of several revisions, your
> sudden dislike comes as a surprise.

I can't think
  - we need hook in mem_cgroup_charge/uncharge.
  - RB-tree is good.
  - don't taking care of kswad is enough

and memcg should be independent from global memory reclaim AMAP.

> Please NOTE: I am not saying we'll never see any of the reclaim
> changes you are suggesting, all I am saying is lets do enough test to
> prove it is needed. Lets get the functionality right and then optimize
> if we have to.
> 

But this itself is problem for me.

When we added
  - hierarchy
  - swap handling
  - etc...

Almost all bug reports are from Nishimura and Li Zefan, not from *us*.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-23  8:24           ` Balbir Singh
@ 2009-03-23  9:12             ` KOSAKI Motohiro
  2009-03-23  9:23               ` Balbir Singh
  0 siblings, 1 reply; 54+ messages in thread
From: KOSAKI Motohiro @ 2009-03-23  9:12 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, linux-mm, YAMAMOTO Takashi,
	lizf, Rik van Riel, Andrew Morton

> > Kamezawa-san, This implementation is suck. but I think softlimit concept 
> > itself isn't suck.
> 
> Just because of the reclaim factor? Feel free to improve it
> iteratively. Like I said to Kamezawa, don't over optimize in the first
> iteration. Pre-mature optimization is the root of all evil.

Agreed.
Then, I nacked premature optimization code everytime.


> > So, I would suggested discuss this feature based on your 
> > "memcg softlimit (Another one) v4" patch. I exept I can ack it after few spin.
> 
> Kame's implementation sucked quite badly, please see my posted test
> results. Basic, bare minimum functionality did not work.

Yes. I see.
but I think it can be fixed. the basic design of the patch is sane IMHO.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-23  9:12             ` KOSAKI Motohiro
@ 2009-03-23  9:23               ` Balbir Singh
  0 siblings, 0 replies; 54+ messages in thread
From: Balbir Singh @ 2009-03-23  9:23 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, linux-mm, YAMAMOTO Takashi, lizf, Rik van Riel,
	Andrew Morton

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-03-23 18:12:54]:

> > > Kamezawa-san, This implementation is suck. but I think softlimit concept 
> > > itself isn't suck.
> > 
> > Just because of the reclaim factor? Feel free to improve it
> > iteratively. Like I said to Kamezawa, don't over optimize in the first
> > iteration. Pre-mature optimization is the root of all evil.
> 
> Agreed.
> Then, I nacked premature optimization code everytime.
> 
> 
> > > So, I would suggested discuss this feature based on your 
> > > "memcg softlimit (Another one) v4" patch. I exept I can ack it after few spin.
> > 
> > Kame's implementation sucked quite badly, please see my posted test
> > results. Basic, bare minimum functionality did not work.
> 
> Yes. I see.
> but I think it can be fixed. the basic design of the patch is sane IMHO.
>

I have the following major objections to design

1. The use of lists as a data-structure, it will not scale well.
2. Using zone watermarks to implement global soft limits 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-23  8:47                   ` KAMEZAWA Hiroyuki
@ 2009-03-23  9:30                     ` Balbir Singh
  0 siblings, 0 replies; 54+ messages in thread
From: Balbir Singh @ 2009-03-23  9:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 17:47:43]:

> On Mon, 23 Mar 2009 13:52:44 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > I don't see why you are harping about something that you might think
> > is a problem and want to over-optimize even without tests. Fix
> > something when you can see the problem, on my system I don't see it. I
> > am willing to consider alternatives or moving away from the current
> > coding style *iff* it needs to be redone for better performance.
> > 
> 
> It's usually true that "For optimize system, don't do anything unnecessary".
> And the patch increase size of res_counter_charge from 236bytes to 295bytes.
> on my compliler.
>

New features do come with a cost, I expected to add 8 bytes not 59
bytes. Something is wrong, could you help me with offsets and what you
see on your system.
 
> And this is called at every charge if the check is unnecessary.
> (i.e. the _real_ check itself is done once in a HZ/?)
>

So your suggestions is

1. Add a flag to indicate if soft limits are enabled (new atomic field
or field protected by a lock).
2. Every HZ/? walk up the entire tree hold all locks and check if soft
limit is exceeded
3. Or, restrict the child not to have soft limit greater than parent
and break design


 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-23  6:12     ` KAMEZAWA Hiroyuki
  2009-03-23  6:17       ` KAMEZAWA Hiroyuki
@ 2009-03-23  9:41       ` Balbir Singh
  1 sibling, 0 replies; 54+ messages in thread
From: Balbir Singh @ 2009-03-23  9:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 15:12:45]:

> On Mon, 23 Mar 2009 10:52:47 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > I have one large swap partition, so I could not test the partial-swap
> > scenario.
> > 
> plz go ahead as you like, Seems no landing point now and I'd like to see
> what I can, later. I'll send no ACK nor NACK, more.
> 
> But please get ack from someone resposible for glorbal memory reclaim.
> Especially for hooks in try_to_free_pages().
> 
> And please make it clear in documentation that 
>  - Depends on the system but this may increase the usage of swap.
>  - Depends on the system but this may not work as the user expected as hard-limit.
>

The documentation mentions that soft limits take a long time before
coming into affect. The use of the word "soft" over "hard" and the
usage of this terminology in resource management clearly implies what
you say in point (2).
 
> Considering corner cases, this is a very complicated/usage-is-difficult feature.
> 
> -Kame
> 
> 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-23  8:52           ` KAMEZAWA Hiroyuki
@ 2009-03-23  9:46             ` Balbir Singh
  0 siblings, 0 replies; 54+ messages in thread
From: Balbir Singh @ 2009-03-23  9:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 17:52:23]:

> On Mon, 23 Mar 2009 14:05:06 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-23 15:17:03]:
> > Kame, if you dislike it please don't enable
> > memory.soft_limit_in_bytes. After having sent several revisions of
> > your own patchset and helping me with review of several revisions, your
> > sudden dislike comes as a surprise.
> 
> I can't think
>   - we need hook in mem_cgroup_charge/uncharge.
>   - RB-tree is good.
>   - don't taking care of kswad is enough
> 
> and memcg should be independent from global memory reclaim AMAP.
> 
> > Please NOTE: I am not saying we'll never see any of the reclaim
> > changes you are suggesting, all I am saying is lets do enough test to
> > prove it is needed. Lets get the functionality right and then optimize
> > if we have to.
> > 
> 
> But this itself is problem for me.
> 
> When we added
>   - hierarchy
>   - swap handling
>   - etc...
> 
> Almost all bug reports are from Nishimura and Li Zefan, not from *us*.
>

As long as we fix them, I don't care who reports bugs. I've been
testing the patches I have with various configurations, but not hard
enough at times. The advantage of -mm is that we get enough testing
through the contribution of folks like Li and Nishimura (which is very
much appreciated). I am not asking for these patches to go into
mainline, but for wider testing in -mm. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-19 16:57 [PATCH 0/5] Memory controller soft limit patches (v7) Balbir Singh
                   ` (6 preceding siblings ...)
  2009-03-23  8:31 ` KAMEZAWA Hiroyuki
@ 2009-03-24 17:34 ` Balbir Singh
  2009-03-24 23:55   ` KAMEZAWA Hiroyuki
  7 siblings, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-24 17:34 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki

* Balbir Singh <balbir@linux.vnet.ibm.com> [2009-03-19 22:27:13]:

> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> New Feature: Soft limits for memory resource controller.
> 
> Changelog v7...v6
> 1. Added checks in reclaim path to make sure we don't infinitely loop
> 2. Refactored reclaim options into a new patch
> 3. Tested several scenarios, see tests below
> 
> Changelog v6...v5
> 1. If the number of reclaimed pages are zero, select the next mem cgroup
>    for reclamation
> 2. Fixed a bug, where key was being updated after insertion into the tree
> 3. Fixed a build issue, when CONFIG_MEM_RES_CTLR is not enabled
> 
> Changelog v5...v4
> 1. Several changes to the reclaim logic, please see the patch 4 (reclaim on
>    contention). I've experimented with several possibilities for reclaim
>    and chose to come back to this due to the excellent behaviour seen while
>    testing the patchset.
> 2. Reduced the overhead of soft limits on resource counters very significantly.
>    Reaim benchmark now shows almost no drop in performance.
> 
> Changelog v4...v3
> 1. Adopted suggestions from Kamezawa to do a per-zone-per-node reclaim
>    while doing soft limit reclaim. We don't record priorities while
>    doing soft reclaim
> 2. Some of the overheads associated with soft limits (like calculating
>    excess each time) is eliminated
> 3. The time_after(jiffies, 0) bug has been fixed
> 4. Tasks are throttled if the mem cgroup they belong to is being soft reclaimed
>    and at the same time tasks are increasing the memory footprint and causing
>    the mem cgroup to exceed its soft limit.
> 
> Changelog v3...v2
> 1. Implemented several review comments from Kosaki-San and Kamezawa-San
>    Please see individual changelogs for changes
> 
> Changelog v2...v1
> 1. Soft limits now support hierarchies
> 2. Use spinlocks instead of mutexes for synchronization of the RB tree
> 
> Here is v7 of the new soft limit implementation. Soft limits is a new feature
> for the memory resource controller, something similar has existed in the
> group scheduler in the form of shares. The CPU controllers interpretation
> of shares is very different though. 
> 
> Soft limits are the most useful feature to have for environments where
> the administrator wants to overcommit the system, such that only on memory
> contention do the limits become active. The current soft limits implementation
> provides a soft_limit_in_bytes interface for the memory controller and not
> for memory+swap controller. The implementation maintains an RB-Tree of groups
> that exceed their soft limit and starts reclaiming from the group that
> exceeds this limit by the maximum amount.
> 
> So far I have the best test results with this patchset. I've experimented with
> several approaches and methods. I might be a little delayed in responding,
> I might have intermittent access to the internet for the next few days.
> 
> TODOs
> 
> 1. The current implementation maintains the delta from the soft limit
>    and pushes back groups to their soft limits, a ratio of delta/soft_limit
>    might be more useful
> 
> 
> Tests
> -----
> 
> I've run two memory intensive workloads with differing soft limits and
> seen that they are pushed back to their soft limit on contention. Their usage
> was their soft limit plus additional memory that they were able to grab
> on the system. Soft limit can take a while before we see the expected
> results.
> 
> The other tests I've run are
> 1. Deletion of groups while soft limit is in progress in the hierarchy
> 2. Setting the soft limit to zero and running other groups with non-zero
>    soft limits.
> 3. Setting the soft limit to zero and testing if the mem cgroup is able
>    to use available memory
> 4. Tested the patches with hierarchy enabled
> 5. Tested with swapoff -a, to make sure we don't go into an infinite loop
> 

I've run lmbench with the soft limit patches and the results show no
major overhead, there are some outliers and unexpected results.

The outliers are at context-switch 16p/64K, in communicating
latencies and some unexpected results where the softlimit changes help improve
performance (I consider these to be in the range of noise).

                 L M B E N C H  2 . 0   S U M M A R Y
                 ------------------------------------


Basic system parameters
----------------------------------------------------
Host                 OS Description              Mhz
                                                    
--------- ------------- ----------------------- ----
nosoftlim Linux 2.6.29-        x86_64-linux-gnu 2131
softlimit Linux 2.6.29-        x86_64-linux-gnu 2131

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
Host                 OS  Mhz null null      open selct sig  sig  fork exec sh  
                             call  I/O stat clos TCP   inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ----
---- ----
nosoftlim Linux 2.6.29- 2131 0.67 1.33 29.9 36.8 6.484 1.12 12.1 508. 1708 6281
softlimit Linux 2.6.29- 2131 0.66 1.31 29.8 36.8 6.486 1.11 12.3 483. 1697 6241

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                        ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw ctxsw
--------- ------------- ----- ------ ------ ------ ------ --------------
nosoftlim Linux 2.6.29- 2.190 9.2300 3.1900 9.7400   10.8 7.93000 4.36000
softlimit Linux 2.6.29- 0.970 4.8200 3.1300 8.8900   10.3 8.82000 10.7

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                        ctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
nosoftlim Linux 2.6.29- 2.190  22.0 58.5  53.3  68.7  61.7  64.9 210.
softlimit Linux 2.6.29- 0.970  20.3 55.3  54.0  53.8  79.7  64.5 211.

File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
Host                 OS   0K File      10K File      Mmap    Prot Page    
                        Create Delete Create Delete  Latency Fault Fault 
--------- ------------- ------ ------ ------ ------  ------- ----- ----- 
nosoftlim Linux 2.6.29-   51.6   48.6  153.6   87.4    20.2K 7.00000
softlimit Linux 2.6.29-   51.6   48.2  137.8   83.9    20.2K 6.00000

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem
Mem
                             UNIX      reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
nosoftlim Linux 2.6.29- 1367 778. 803. 2058.5 4659.4 1303.9 1303.5 4664 1422.
softlimit Linux 2.6.29- 1314 823. 812. 2061.3 4659.9 1290.2 1280.9 4662 1422.

Memory latencies in nanoseconds - smaller is better
    (WARNING - may not be correct, check graphs)
---------------------------------------------------
Host                 OS   Mhz  L1 $   L2 $    Main mem    Guesses
--------- -------------  ---- ----- ------    --------    -------
nosoftlim Linux 2.6.29-  2131 1.875 6.5990   76.8
softlimit Linux 2.6.29-  2131 1.875 6.5980   76.8

Earlier, I ran reaim and saw no regression there as well.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-24 17:34 ` Balbir Singh
@ 2009-03-24 23:55   ` KAMEZAWA Hiroyuki
  2009-03-25  3:42     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-24 23:55 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Tue, 24 Mar 2009 23:04:14 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> I've run lmbench with the soft limit patches and the results show no
> major overhead, there are some outliers and unexpected results.
> 
> The outliers are at context-switch 16p/64K, in communicating
> latencies and some unexpected results where the softlimit changes help improve
> performance (I consider these to be in the range of noise).
> 

ok, seems no regressions. but what is the softlimit value ?
I think there result is of course souftlimit=0 case value...right ?

-Kame

>                  L M B E N C H  2 . 0   S U M M A R Y
>                  ------------------------------------
> 
> 
> Basic system parameters
> ----------------------------------------------------
> Host                 OS Description              Mhz
>                                                     
> --------- ------------- ----------------------- ----
> nosoftlim Linux 2.6.29-        x86_64-linux-gnu 2131
> softlimit Linux 2.6.29-        x86_64-linux-gnu 2131
> 
> Processor, Processes - times in microseconds - smaller is better
> ----------------------------------------------------------------
> Host                 OS  Mhz null null      open selct sig  sig  fork exec sh  
>                              call  I/O stat clos TCP   inst hndl proc proc proc
> --------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ----
> ---- ----
> nosoftlim Linux 2.6.29- 2131 0.67 1.33 29.9 36.8 6.484 1.12 12.1 508. 1708 6281
> softlimit Linux 2.6.29- 2131 0.66 1.31 29.8 36.8 6.486 1.11 12.3 483. 1697 6241
> 
> Context switching - times in microseconds - smaller is better
> -------------------------------------------------------------
> Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
>                         ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw ctxsw
> --------- ------------- ----- ------ ------ ------ ------ --------------
> nosoftlim Linux 2.6.29- 2.190 9.2300 3.1900 9.7400   10.8 7.93000 4.36000
> softlimit Linux 2.6.29- 0.970 4.8200 3.1300 8.8900   10.3 8.82000 10.7
> 
> *Local* Communication latencies in microseconds - smaller is better
> -------------------------------------------------------------------
> Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
>                         ctxsw       UNIX         UDP         TCP conn
> --------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
> nosoftlim Linux 2.6.29- 2.190  22.0 58.5  53.3  68.7  61.7  64.9 210.
> softlimit Linux 2.6.29- 0.970  20.3 55.3  54.0  53.8  79.7  64.5 211.
> 
> File & VM system latencies in microseconds - smaller is better
> --------------------------------------------------------------
> Host                 OS   0K File      10K File      Mmap    Prot Page    
>                         Create Delete Create Delete  Latency Fault Fault 
> --------- ------------- ------ ------ ------ ------  ------- ----- ----- 
> nosoftlim Linux 2.6.29-   51.6   48.6  153.6   87.4    20.2K 7.00000
> softlimit Linux 2.6.29-   51.6   48.2  137.8   83.9    20.2K 6.00000
> 
> *Local* Communication bandwidths in MB/s - bigger is better
> -----------------------------------------------------------
> Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem
> Mem
>                              UNIX      reread reread (libc) (hand) read write
> --------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
> nosoftlim Linux 2.6.29- 1367 778. 803. 2058.5 4659.4 1303.9 1303.5 4664 1422.
> softlimit Linux 2.6.29- 1314 823. 812. 2061.3 4659.9 1290.2 1280.9 4662 1422.
> 
> Memory latencies in nanoseconds - smaller is better
>     (WARNING - may not be correct, check graphs)
> ---------------------------------------------------
> Host                 OS   Mhz  L1 $   L2 $    Main mem    Guesses
> --------- -------------  ---- ----- ------    --------    -------
> nosoftlim Linux 2.6.29-  2131 1.875 6.5990   76.8
> softlimit Linux 2.6.29-  2131 1.875 6.5980   76.8
> 
> Earlier, I ran reaim and saw no regression there as well.
> 
> -- 
> 	Balbir
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-24 23:55   ` KAMEZAWA Hiroyuki
@ 2009-03-25  3:42     ` KAMEZAWA Hiroyuki
  2009-03-25  4:02       ` Balbir Singh
  0 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-25  3:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir, linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro,
	Rik van Riel, Andrew Morton

On Wed, 25 Mar 2009 08:55:05 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 24 Mar 2009 23:04:14 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > I've run lmbench with the soft limit patches and the results show no
> > major overhead, there are some outliers and unexpected results.
> > 
> > The outliers are at context-switch 16p/64K, in communicating
> > latencies and some unexpected results where the softlimit changes help improve
> > performance (I consider these to be in the range of noise).
> > 
> 
> ok, seems no regressions. but what is the softlimit value ?
> I think there result is of course souftlimit=0 case value...right ?
> 

I'll say no more complains to this hooks even while I don't like them.
But res_coutner_charge() looks like decolated chocolate cake as _counter_ ;)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-25  3:42     ` KAMEZAWA Hiroyuki
@ 2009-03-25  4:02       ` Balbir Singh
  2009-03-25  4:05         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-25  4:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-25 12:42:02]:

> On Wed, 25 Mar 2009 08:55:05 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Tue, 24 Mar 2009 23:04:14 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > I've run lmbench with the soft limit patches and the results show no
> > > major overhead, there are some outliers and unexpected results.
> > > 
> > > The outliers are at context-switch 16p/64K, in communicating
> > > latencies and some unexpected results where the softlimit changes help improve
> > > performance (I consider these to be in the range of noise).
> > > 
> > 
> > ok, seems no regressions. but what is the softlimit value ?
> > I think there result is of course souftlimit=0 case value...right ?
> > 
> 

Yes, this result is for the soft limit being default value
(LONGLONG_MAX) case.

> I'll say no more complains to this hooks even while I don't like them.
> But res_coutner_charge() looks like decolated chocolate cake as _counter_ ;)
> 

res_counters are split out for modularity reasons, the advantage is
that we can optimize/change res_counters without affecting the memcg
code. I am glad you can see that there is no overhead as a result of
these hooks.


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/5] Memory controller soft limit patches (v7)
  2009-03-25  4:02       ` Balbir Singh
@ 2009-03-25  4:05         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-25  4:05 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Wed, 25 Mar 2009 09:32:46 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> > I'll say no more complains to this hooks even while I don't like them.
> > But res_coutner_charge() looks like decolated chocolate cake as _counter_ ;)
> > 
> 
> res_counters are split out for modularity reasons, the advantage is
> that we can optimize/change res_counters without affecting the memcg
> code. I am glad you can see that there is no overhead as a result of
> these hooks.
> 
I'll use your code in my own set. Anyway, it's merge-window now and we'll have
enough time to think of cool stuff.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-19 16:57 ` [PATCH 3/5] Memory controller soft limit organize cgroups (v7) Balbir Singh
  2009-03-20  3:46   ` KAMEZAWA Hiroyuki
@ 2009-03-25  4:59   ` KAMEZAWA Hiroyuki
  2009-03-25  5:29     ` Balbir Singh
  2009-03-25  5:07   ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-25  4:59 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Thu, 19 Mar 2009 22:27:35 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Feature: Organize cgroups over soft limit in a RB-Tree
> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Changelog v7...v6
> 1. Refactor the check and update logic. The goal is to allow the
>    check logic to be modular, so that it can be revisited in the future
>    if something more appropriate is found to be useful.
> 
> Changelog v6...v5
> 1. Update the key before inserting into RB tree. Without the current change
>    it could take an additional iteration to get the key correct.
> 
> Changelog v5...v4
> 1. res_counter_uncharge has an additional parameter to indicate if the
>    counter was over its soft limit, before uncharge.
> 
> Changelog v4...v3
> 1. Optimizations to ensure we don't uncessarily get res_counter values
> 2. Fixed a bug in usage of time_after()
> 
> Changelog v3...v2
> 1. Add only the ancestor to the RB-Tree
> 2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put
> 
> Changelog v2...v1
> 1. Add support for hierarchies
> 2. The res_counter that is highest in the hierarchy is returned on soft
>    limit being exceeded. Since we do hierarchical reclaim and add all
>    groups exceeding their soft limits, this approach seems to work well
>    in practice.
> 
> This patch introduces a RB-Tree for storing memory cgroups that are over their
> soft limit. The overall goal is to
> 
> 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
>    We are careful about updates, updates take place only after a particular
>    time interval has passed
> 2. We remove the node from the RB-Tree when the usage goes below the soft
>    limit
> 
> The next set of patches will exploit the RB-Tree to get the group that is
> over its soft limit by the largest amount and reclaim from it, when we
> face memory contention.
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  include/linux/res_counter.h |    6 +-
>  kernel/res_counter.c        |   18 +++++
>  mm/memcontrol.c             |  149 ++++++++++++++++++++++++++++++++++++++-----
>  3 files changed, 151 insertions(+), 22 deletions(-)
> 
> 
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 5c821fd..5bbf8b1 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -112,7 +112,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
>  int __must_check res_counter_charge_locked(struct res_counter *counter,
>  		unsigned long val);
>  int __must_check res_counter_charge(struct res_counter *counter,
> -		unsigned long val, struct res_counter **limit_fail_at);
> +		unsigned long val, struct res_counter **limit_fail_at,
> +		struct res_counter **soft_limit_at);
>  
>  /*
>   * uncharge - tell that some portion of the resource is released
> @@ -125,7 +126,8 @@ int __must_check res_counter_charge(struct res_counter *counter,
>   */
>  
>  void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
> -void res_counter_uncharge(struct res_counter *counter, unsigned long val);
> +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> +				bool *was_soft_limit_excess);
>  
>  static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
>  {
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 4e6dafe..51ec438 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
>  }
>  
>  int res_counter_charge(struct res_counter *counter, unsigned long val,
> -			struct res_counter **limit_fail_at)
> +			struct res_counter **limit_fail_at,
> +			struct res_counter **soft_limit_fail_at)
>  {
>  	int ret;
>  	unsigned long flags;
>  	struct res_counter *c, *u;
>  
>  	*limit_fail_at = NULL;
> +	if (soft_limit_fail_at)
> +		*soft_limit_fail_at = NULL;
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
>  		spin_lock(&c->lock);
>  		ret = res_counter_charge_locked(c, val);
> +		/*
> +		 * With soft limits, we return the highest ancestor
> +		 * that exceeds its soft limit
> +		 */
> +		if (soft_limit_fail_at &&
> +			!res_counter_soft_limit_check_locked(c))
> +			*soft_limit_fail_at = c;
>  		spin_unlock(&c->lock);

I'm not sure this works as intended or not. Could you clarify ? (see below)

    In following hierarchy,

         A/   soft_limit=1G, usage=1.2G.
           B  soft_limit=200M, usage=1G
           C  soft_limit=800M, usage=200M

   This function returns only "A". 
   And memory will be reclaimed from B and C, at first.
   


>  		if (ret < 0) {
>  			*limit_fail_at = c;
> @@ -75,7 +85,8 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
>  	counter->usage -= val;
>  }
>  
> -void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> +				bool *was_soft_limit_excess)
>  {
>  	unsigned long flags;
>  	struct res_counter *c;
> @@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
>  		spin_lock(&c->lock);
> +		if (c == counter && was_soft_limit_excess)
> +			*was_soft_limit_excess =
> +				!res_counter_soft_limit_check_locked(c);
>  		res_counter_uncharge_locked(c, val);
>  		spin_unlock(&c->lock);
>  	}
Does this work as intended ?
Assume following hierarchy

   A/  softlimit=1G usage=300M
     B/ softlimit=200M usage=300M.
     C/ softlimit=800M usage=0M

*was_soft_limit_excess will be false and no tree update, forever.

Hmm ?


Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-19 16:57 ` [PATCH 3/5] Memory controller soft limit organize cgroups (v7) Balbir Singh
  2009-03-20  3:46   ` KAMEZAWA Hiroyuki
  2009-03-25  4:59   ` KAMEZAWA Hiroyuki
@ 2009-03-25  5:07   ` KAMEZAWA Hiroyuki
  2009-03-25  5:18     ` Balbir Singh
  2 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-25  5:07 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Thu, 19 Mar 2009 22:27:35 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> @@ -938,16 +1031,17 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  		int ret;
>  		bool noswap = false;
>  
In logical, plz add
  soft_fail_res = NULL, here.


> -		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> +		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
> +						&soft_fail_res);

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-25  5:07   ` KAMEZAWA Hiroyuki
@ 2009-03-25  5:18     ` Balbir Singh
  2009-03-25  5:22       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-25  5:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-25 14:07:52]:

> On Thu, 19 Mar 2009 22:27:35 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > @@ -938,16 +1031,17 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  		int ret;
> >  		bool noswap = false;
> >  
> In logical, plz add
>   soft_fail_res = NULL, here.
>

As an optimization? OK, done!
 
> 
> > -		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> > +		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
> > +						&soft_fail_res);
> 
> -Kame
> 
> 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-25  5:18     ` Balbir Singh
@ 2009-03-25  5:22       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-25  5:22 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Wed, 25 Mar 2009 10:48:17 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-25 14:07:52]:
> 
> > On Thu, 19 Mar 2009 22:27:35 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > @@ -938,16 +1031,17 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > >  		int ret;
> > >  		bool noswap = false;
> > >  
> > In logical, plz add
> >   soft_fail_res = NULL, here.
> >
> 
> As an optimization? OK, done!
>  
Ah, sorry....I missed that pointer was automatically initilized to NULL in
res_counter_charge().
plz ignore..

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-25  4:59   ` KAMEZAWA Hiroyuki
@ 2009-03-25  5:29     ` Balbir Singh
  2009-03-25  5:39       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-25  5:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-25 13:59:00]:

> On Thu, 19 Mar 2009 22:27:35 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > Feature: Organize cgroups over soft limit in a RB-Tree
> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > Changelog v7...v6
> > 1. Refactor the check and update logic. The goal is to allow the
> >    check logic to be modular, so that it can be revisited in the future
> >    if something more appropriate is found to be useful.
> > 
> > Changelog v6...v5
> > 1. Update the key before inserting into RB tree. Without the current change
> >    it could take an additional iteration to get the key correct.
> > 
> > Changelog v5...v4
> > 1. res_counter_uncharge has an additional parameter to indicate if the
> >    counter was over its soft limit, before uncharge.
> > 
> > Changelog v4...v3
> > 1. Optimizations to ensure we don't uncessarily get res_counter values
> > 2. Fixed a bug in usage of time_after()
> > 
> > Changelog v3...v2
> > 1. Add only the ancestor to the RB-Tree
> > 2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put
> > 
> > Changelog v2...v1
> > 1. Add support for hierarchies
> > 2. The res_counter that is highest in the hierarchy is returned on soft
> >    limit being exceeded. Since we do hierarchical reclaim and add all
> >    groups exceeding their soft limits, this approach seems to work well
> >    in practice.
> > 
> > This patch introduces a RB-Tree for storing memory cgroups that are over their
> > soft limit. The overall goal is to
> > 
> > 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
> >    We are careful about updates, updates take place only after a particular
> >    time interval has passed
> > 2. We remove the node from the RB-Tree when the usage goes below the soft
> >    limit
> > 
> > The next set of patches will exploit the RB-Tree to get the group that is
> > over its soft limit by the largest amount and reclaim from it, when we
> > face memory contention.
> > 
> > Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> > ---
> > 
> >  include/linux/res_counter.h |    6 +-
> >  kernel/res_counter.c        |   18 +++++
> >  mm/memcontrol.c             |  149 ++++++++++++++++++++++++++++++++++++++-----
> >  3 files changed, 151 insertions(+), 22 deletions(-)
> > 
> > 
> > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > index 5c821fd..5bbf8b1 100644
> > --- a/include/linux/res_counter.h
> > +++ b/include/linux/res_counter.h
> > @@ -112,7 +112,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
> >  int __must_check res_counter_charge_locked(struct res_counter *counter,
> >  		unsigned long val);
> >  int __must_check res_counter_charge(struct res_counter *counter,
> > -		unsigned long val, struct res_counter **limit_fail_at);
> > +		unsigned long val, struct res_counter **limit_fail_at,
> > +		struct res_counter **soft_limit_at);
> >  
> >  /*
> >   * uncharge - tell that some portion of the resource is released
> > @@ -125,7 +126,8 @@ int __must_check res_counter_charge(struct res_counter *counter,
> >   */
> >  
> >  void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
> > -void res_counter_uncharge(struct res_counter *counter, unsigned long val);
> > +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > +				bool *was_soft_limit_excess);
> >  
> >  static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
> >  {
> > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > index 4e6dafe..51ec438 100644
> > --- a/kernel/res_counter.c
> > +++ b/kernel/res_counter.c
> > @@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> >  }
> >  
> >  int res_counter_charge(struct res_counter *counter, unsigned long val,
> > -			struct res_counter **limit_fail_at)
> > +			struct res_counter **limit_fail_at,
> > +			struct res_counter **soft_limit_fail_at)
> >  {
> >  	int ret;
> >  	unsigned long flags;
> >  	struct res_counter *c, *u;
> >  
> >  	*limit_fail_at = NULL;
> > +	if (soft_limit_fail_at)
> > +		*soft_limit_fail_at = NULL;
> >  	local_irq_save(flags);
> >  	for (c = counter; c != NULL; c = c->parent) {
> >  		spin_lock(&c->lock);
> >  		ret = res_counter_charge_locked(c, val);
> > +		/*
> > +		 * With soft limits, we return the highest ancestor
> > +		 * that exceeds its soft limit
> > +		 */
> > +		if (soft_limit_fail_at &&
> > +			!res_counter_soft_limit_check_locked(c))
> > +			*soft_limit_fail_at = c;
> >  		spin_unlock(&c->lock);
> 
> I'm not sure this works as intended or not. Could you clarify ? (see below)
> 
>     In following hierarchy,
> 
>          A/   soft_limit=1G, usage=1.2G.
>            B  soft_limit=200M, usage=1G
>            C  soft_limit=800M, usage=200M
> 
>    This function returns only "A". 
>    And memory will be reclaimed from B and C, at first.
>    

Yes, A will be put on the RB-Tree, earlier we were putting both A and
B, but you were opposed to it (I can't remember why). Memory will be
reclaimed from A, B and C through hierarchical reclaim.

> 
> 
> >  		if (ret < 0) {
> >  			*limit_fail_at = c;
> > @@ -75,7 +85,8 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
> >  	counter->usage -= val;
> >  }
> >  
> > -void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> > +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > +				bool *was_soft_limit_excess)
> >  {
> >  	unsigned long flags;
> >  	struct res_counter *c;
> > @@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> >  	local_irq_save(flags);
> >  	for (c = counter; c != NULL; c = c->parent) {
> >  		spin_lock(&c->lock);
> > +		if (c == counter && was_soft_limit_excess)
> > +			*was_soft_limit_excess =
> > +				!res_counter_soft_limit_check_locked(c);
> >  		res_counter_uncharge_locked(c, val);
> >  		spin_unlock(&c->lock);
> >  	}
> Does this work as intended ?
> Assume following hierarchy
> 
>    A/  softlimit=1G usage=300M
>      B/ softlimit=200M usage=300M.
>      C/ softlimit=800M usage=0M
> 
> *was_soft_limit_excess will be false and no tree update, forever.
>

No.. was_soft_limit_excess checks the soft limit before uncharge to
see if we were over soft limit, when a page gets uncharged from B,
since B is over soft limit and on tree, we will update the tree. Why
do you say that was_soft_limit_excess will return false? 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-25  5:29     ` Balbir Singh
@ 2009-03-25  5:39       ` KAMEZAWA Hiroyuki
  2009-03-25  5:53         ` Balbir Singh
  0 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-25  5:39 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Wed, 25 Mar 2009 10:59:47 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-25 13:59:00]:

> > > @@ -75,7 +85,8 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
> > >  	counter->usage -= val;
> > >  }
> > >  
> > > -void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> > > +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > > +				bool *was_soft_limit_excess)
> > >  {
> > >  	unsigned long flags;
> > >  	struct res_counter *c;
> > > @@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> > >  	local_irq_save(flags);
> > >  	for (c = counter; c != NULL; c = c->parent) {
> > >  		spin_lock(&c->lock);
> > > +		if (c == counter && was_soft_limit_excess)
> > > +			*was_soft_limit_excess =
> > > +				!res_counter_soft_limit_check_locked(c);
> > >  		res_counter_uncharge_locked(c, val);
> > >  		spin_unlock(&c->lock);
> > >  	}
> > Does this work as intended ?
> > Assume following hierarchy
> > 
> >    A/  softlimit=1G usage=300M
> >      B/ softlimit=200M usage=300M.
> >      C/ softlimit=800M usage=0M
> > 
> > *was_soft_limit_excess will be false and no tree update, forever.
> >
> 
> No.. was_soft_limit_excess checks the soft limit before uncharge to
> see if we were over soft limit, when a page gets uncharged from B,
> since B is over soft limit and on tree, we will update the tree. Why
> do you say that was_soft_limit_excess will return false? 
> 
my eyes tend to be buggy. ok, change the question.

==
+void res_counter_uncharge(struct res_counter *counter, unsigned long val,
+				bool *was_soft_limit_excess)
 {
 	unsigned long flags;
 	struct res_counter *c;
@@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
+		if (c == counter && was_soft_limit_excess)
+			*was_soft_limit_excess =
+				!res_counter_soft_limit_check_locked(c);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
 	}
==
Why just check "c == coutner" case is enough ?

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-25  5:39       ` KAMEZAWA Hiroyuki
@ 2009-03-25  5:53         ` Balbir Singh
  2009-03-25  6:01           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-25  5:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-25 14:39:53]:

> On Wed, 25 Mar 2009 10:59:47 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-25 13:59:00]:
> 
> > > > @@ -75,7 +85,8 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
> > > >  	counter->usage -= val;
> > > >  }
> > > >  
> > > > -void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> > > > +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > > > +				bool *was_soft_limit_excess)
> > > >  {
> > > >  	unsigned long flags;
> > > >  	struct res_counter *c;
> > > > @@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> > > >  	local_irq_save(flags);
> > > >  	for (c = counter; c != NULL; c = c->parent) {
> > > >  		spin_lock(&c->lock);
> > > > +		if (c == counter && was_soft_limit_excess)
> > > > +			*was_soft_limit_excess =
> > > > +				!res_counter_soft_limit_check_locked(c);
> > > >  		res_counter_uncharge_locked(c, val);
> > > >  		spin_unlock(&c->lock);
> > > >  	}
> > > Does this work as intended ?
> > > Assume following hierarchy
> > > 
> > >    A/  softlimit=1G usage=300M
> > >      B/ softlimit=200M usage=300M.
> > >      C/ softlimit=800M usage=0M
> > > 
> > > *was_soft_limit_excess will be false and no tree update, forever.
> > >
> > 
> > No.. was_soft_limit_excess checks the soft limit before uncharge to
> > see if we were over soft limit, when a page gets uncharged from B,
> > since B is over soft limit and on tree, we will update the tree. Why
> > do you say that was_soft_limit_excess will return false? 
> > 
> my eyes tend to be buggy. ok, change the question.
>

No problem, we've all been there :)
 
> ==
> +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> +				bool *was_soft_limit_excess)
>  {
>  	unsigned long flags;
>  	struct res_counter *c;
> @@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
>  		spin_lock(&c->lock);
> +		if (c == counter && was_soft_limit_excess)
> +			*was_soft_limit_excess =
> +				!res_counter_soft_limit_check_locked(c);
>  		res_counter_uncharge_locked(c, val);
>  		spin_unlock(&c->lock);
>  	}
> ==
> Why just check "c == coutner" case is enough ?
> 

This is a very good question, I think this check might not be
necessary and can also be potentially buggy.

> Thanks,
> -Kame
> 
> 
> 
> 
> 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-25  5:53         ` Balbir Singh
@ 2009-03-25  6:01           ` KAMEZAWA Hiroyuki
  2009-03-25  6:21             ` Balbir Singh
  0 siblings, 1 reply; 54+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-25  6:01 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

On Wed, 25 Mar 2009 11:23:54 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> > ==
> > +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > +				bool *was_soft_limit_excess)
> >  {
> >  	unsigned long flags;
> >  	struct res_counter *c;
> > @@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> >  	local_irq_save(flags);
> >  	for (c = counter; c != NULL; c = c->parent) {
> >  		spin_lock(&c->lock);
> > +		if (c == counter && was_soft_limit_excess)
> > +			*was_soft_limit_excess =
> > +				!res_counter_soft_limit_check_locked(c);
> >  		res_counter_uncharge_locked(c, val);
> >  		spin_unlock(&c->lock);
> >  	}
> > ==
> > Why just check "c == coutner" case is enough ?
> > 
> 
> This is a very good question, I think this check might not be
> necessary and can also be potentially buggy.
> 
I feel so, but can't think of good clean up.

Don't we remove this check at uncharge ? Anyway status can be updated at
  - charge().
  - reclaim

I'll seek this way in mine...

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-25  6:01           ` KAMEZAWA Hiroyuki
@ 2009-03-25  6:21             ` Balbir Singh
  2009-03-25  6:38               ` Balbir Singh
  0 siblings, 1 reply; 54+ messages in thread
From: Balbir Singh @ 2009-03-25  6:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-25 15:01:09]:

> On Wed, 25 Mar 2009 11:23:54 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > > ==
> > > +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > > +				bool *was_soft_limit_excess)
> > >  {
> > >  	unsigned long flags;
> > >  	struct res_counter *c;
> > > @@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> > >  	local_irq_save(flags);
> > >  	for (c = counter; c != NULL; c = c->parent) {
> > >  		spin_lock(&c->lock);
> > > +		if (c == counter && was_soft_limit_excess)
> > > +			*was_soft_limit_excess =
> > > +				!res_counter_soft_limit_check_locked(c);
> > >  		res_counter_uncharge_locked(c, val);
> > >  		spin_unlock(&c->lock);
> > >  	}
> > > ==
> > > Why just check "c == coutner" case is enough ?
> > > 
> > 
> > This is a very good question, I think this check might not be
> > necessary and can also be potentially buggy.
> > 
> I feel so, but can't think of good clean up.
> 
> Don't we remove this check at uncharge ? Anyway status can be updated at
>   - charge().
>   - reclaim
> 
> I'll seek this way in mine...

The check can be removed, let me do that and re-run the overhead
tests.


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] Memory controller soft limit organize cgroups (v7)
  2009-03-25  6:21             ` Balbir Singh
@ 2009-03-25  6:38               ` Balbir Singh
  0 siblings, 0 replies; 54+ messages in thread
From: Balbir Singh @ 2009-03-25  6:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, lizf, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton

* Balbir Singh <balbir@linux.vnet.ibm.com> [2009-03-25 11:51:40]:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-03-25 15:01:09]:
> 
> > On Wed, 25 Mar 2009 11:23:54 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > > ==
> > > > +void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > > > +				bool *was_soft_limit_excess)
> > > >  {
> > > >  	unsigned long flags;
> > > >  	struct res_counter *c;
> > > > @@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> > > >  	local_irq_save(flags);
> > > >  	for (c = counter; c != NULL; c = c->parent) {
> > > >  		spin_lock(&c->lock);
> > > > +		if (c == counter && was_soft_limit_excess)
> > > > +			*was_soft_limit_excess =
> > > > +				!res_counter_soft_limit_check_locked(c);
> > > >  		res_counter_uncharge_locked(c, val);
> > > >  		spin_unlock(&c->lock);
> > > >  	}
> > > > ==
> > > > Why just check "c == coutner" case is enough ?
> > > > 
> > > 
> > > This is a very good question, I think this check might not be
> > > necessary and can also be potentially buggy.
> > > 
> > I feel so, but can't think of good clean up.
> > 
> > Don't we remove this check at uncharge ? Anyway status can be updated at
> >   - charge().
> >   - reclaim
> > 
> > I'll seek this way in mine...
> 
> The check can be removed, let me do that and re-run the overhead
> tests.

OK, no impact since I don't have soft limits or hierarchy enabled in
the tests for overhead (I am testing overhead for non-users of the
feature).

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2009-03-25  6:13 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-19 16:57 [PATCH 0/5] Memory controller soft limit patches (v7) Balbir Singh
2009-03-19 16:57 ` [PATCH 1/5] Memory controller soft limit documentation (v7) Balbir Singh
2009-03-19 16:57 ` [PATCH 2/5] Memory controller soft limit interface (v7) Balbir Singh
2009-03-19 16:57 ` [PATCH 3/5] Memory controller soft limit organize cgroups (v7) Balbir Singh
2009-03-20  3:46   ` KAMEZAWA Hiroyuki
2009-03-22 14:21     ` Balbir Singh
2009-03-22 23:53       ` KAMEZAWA Hiroyuki
2009-03-23  3:34         ` Balbir Singh
2009-03-23  3:38           ` KAMEZAWA Hiroyuki
2009-03-23  4:15             ` Balbir Singh
2009-03-23  4:23               ` KAMEZAWA Hiroyuki
2009-03-23  8:22                 ` Balbir Singh
2009-03-23  8:47                   ` KAMEZAWA Hiroyuki
2009-03-23  9:30                     ` Balbir Singh
2009-03-25  4:59   ` KAMEZAWA Hiroyuki
2009-03-25  5:29     ` Balbir Singh
2009-03-25  5:39       ` KAMEZAWA Hiroyuki
2009-03-25  5:53         ` Balbir Singh
2009-03-25  6:01           ` KAMEZAWA Hiroyuki
2009-03-25  6:21             ` Balbir Singh
2009-03-25  6:38               ` Balbir Singh
2009-03-25  5:07   ` KAMEZAWA Hiroyuki
2009-03-25  5:18     ` Balbir Singh
2009-03-25  5:22       ` KAMEZAWA Hiroyuki
2009-03-19 16:57 ` [PATCH 4/5] Memory controller soft limit refactor reclaim flags (v7) Balbir Singh
2009-03-20  3:47   ` KAMEZAWA Hiroyuki
2009-03-22 14:21     ` Balbir Singh
2009-03-19 16:57 ` [PATCH 5/5] Memory controller soft limit reclaim on contention (v7) Balbir Singh
2009-03-20  4:06   ` KAMEZAWA Hiroyuki
2009-03-22 14:27     ` Balbir Singh
2009-03-23  0:02       ` KAMEZAWA Hiroyuki
2009-03-23  4:12         ` Balbir Singh
2009-03-23  4:20           ` KAMEZAWA Hiroyuki
2009-03-23  8:28             ` Balbir Singh
2009-03-23  8:30               ` KAMEZAWA Hiroyuki
2009-03-23  3:50 ` [PATCH 0/5] Memory controller soft limit patches (v7) KAMEZAWA Hiroyuki
2009-03-23  5:22   ` Balbir Singh
2009-03-23  5:31     ` KAMEZAWA Hiroyuki
2009-03-23  6:12     ` KAMEZAWA Hiroyuki
2009-03-23  6:17       ` KAMEZAWA Hiroyuki
2009-03-23  6:35         ` KOSAKI Motohiro
2009-03-23  8:24           ` Balbir Singh
2009-03-23  9:12             ` KOSAKI Motohiro
2009-03-23  9:23               ` Balbir Singh
2009-03-23  8:35         ` Balbir Singh
2009-03-23  8:52           ` KAMEZAWA Hiroyuki
2009-03-23  9:46             ` Balbir Singh
2009-03-23  9:41       ` Balbir Singh
2009-03-23  8:31 ` KAMEZAWA Hiroyuki
2009-03-24 17:34 ` Balbir Singh
2009-03-24 23:55   ` KAMEZAWA Hiroyuki
2009-03-25  3:42     ` KAMEZAWA Hiroyuki
2009-03-25  4:02       ` Balbir Singh
2009-03-25  4:05         ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).