[RFC PATCH 0/4] memcg: revisit soft_limit reclaim on contention

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] memcg: revisit soft_limit reclaim on contention
@ 2011-05-12 18:47 Ying Han
  2011-05-12 18:47 ` [RFC PATCH 1/4] Disable "organizing cgroups over soft limit in a RB-Tree" Ying Han
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: Ying Han @ 2011-05-12 18:47 UTC (permalink / raw)
  To: Johannes Weiner, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai
  Cc: linux-mm

This is the patch I prepared after the LSF proposal. The patch itself is only
the first step to improve the memcg soft_limit reclaim and I will list out the
TODOs at the end.

Here is the proposal I sent out after lots of hallway discussions with Rik,
Johannes, Michal and Kamezawa. Also, Johannes already posted a implementation
and I will read his patchset after posting this. Sorry it took me a while for
posting the implementation after the proposal..

This patchset is based on mmotm-2011-04-14-15-08.

What is "soft_limit"?
The "soft_limit was introduced in memcg to support over-committing the memory
resource on the host. Each cgroup can be configured with "hard_limit", where it
will be throttled or OOM killed by going over the limit. However, the allocation
can go above the "soft_limit" as long as there is no memory contention. The
"soft_limit" is the kernel mechanism for re-distributing spare memory resource
among cgroups.

What is the problem?
Right now, the softlimit reclaim happens at global background reclaim, and acts
as best-effort before the global LRU scanning. However the global LRU reclaim
breaks the isolation badly and we need to eliminate the double LRU at the end.
Moving towards that direction, the first step is to have efficient targeting
reclaim.

What we have now?
The current implementation of softlimit is based on per-zone RB tree, where only
the cgroup exceeds the soft_limit the most being selected for reclaim.
1. It takes no consideration of how many pages actually allocated on the zone
from this cgroup. The RB tree is indexed by the cgroup_(usage - soft_limit).
2. It makes less sense to only reclaim from one cgroup rather than reclaiming
all cgroups based on calculated propotion. This is required for fairness.
3. The target of the soft limit reclaim is to bring one cgroup's usage under its
soft_limit. However the target of global memory pressure is to reclaim pages
above the zone's high_wmark.

Proposed design:
1. softlimit reclaim is triggered under global memory pressure, both at
background and direct reclaim.
2. the target of the softlimit reclaim is consistent with global reclaim where
we check the zone's watermarks instead.
3. round-robin across the cgroups where they have memory allocated on the zone
and also exceed the softlimit configured.
4. the change should be a noop where memcg is not configured.
5. be able to have the ability of zone balance w/o the global LRU reclaim.

More details:
Build per-zone memcg list which links mem_cgroup_per_zone for all memcgs
exceeded their soft_limit and have memory allocated on the zone.
1. new cgroup is examed and inserted once per 1024 increments of
mem_cgroup_commit_charge().
2. under global memory pressure, we iterate the list and try to reclaim a target
number of pages from each cgroup.
3. the target number is per-cgroup and is calculated based on per-memcg lru
fraction and soft_limit exceeds. We could borrow the existing get_scan_count()
but adding the soft_limit factor on top of that.
4. move the cgroup to the tail if the target number of pages being reclaimed.
5. remove the cgroup from the list if the usage dropped below the soft_limit.
6. after reclaiming from each cgroup, check the zone watermark. If the free pages
goes above the high_wmark + balance_gap, break the reclaim loop.
7. reclaim strategies should be consistent w/ global reclaim. for example, we want
to scan each cgroup's file-lru first and then the anon-lru for the next iteration.

Action Items:
0. revert some of the changes in current soft_limit reclaim [DONE]
note: covered in this patchset.

1. implement the softlimit reclaim described above [DONE]
note: covered in this patchset.

TODO:
a) there was a question on how to do zone balancing w/o global LRU. This could be
solved by building another cgroup list per-zone, where we also link cgroups under
their soft_limit. We won't scan the list unless the first list being exhausted and
the free pages is still under the high_wmark.

b). one of the tricky part is to calculate the target nr_to_scan for each cgroup,
especially combining the current heuristics with soft_limit exceeds. it depends how
much weight we need to put on the second. One way is to make the ratio to be user
configurable.

c). the soft limit does not support high order reclaim, which means it won't have
lumpy reclaim. it is ok with memory compaction enabled.

2. add the soft_limit reclaim into global direct reclaim [DONE and merged in mmotm]

3. eliminate the global lru and remove the lru field in page_cgroup [TODO]
note: Johannes's patchset might already cover that, i will read about it

4. separate out the zone->lru lock and make a per-memcg-per-zone lock [TODO]
note: We posted a patch before the LSF but decided to hold before the previous lists
are done. I will also read Johannes's patchset in case that is also covered.

Ying Han (4):
  Disable "organizing cgroups over soft limit in a RB-Tree"
  Organize memcgs over soft limit in round-robin.
  Implementation of soft_limit reclaim in round-robin.
  Add some debugging stats

 include/linux/memcontrol.h    |   17 ++-
 include/linux/vm_event_item.h |    1 +
 mm/memcontrol.c               |  379 ++++++++++++++++++++---------------------
 mm/vmscan.c                   |   28 ++--
 mm/vmstat.c                   |    2 +
 5 files changed, 220 insertions(+), 207 deletions(-)

-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 1/4] Disable "organizing cgroups over soft limit in a RB-Tree"
  2011-05-12 18:47 [RFC PATCH 0/4] memcg: revisit soft_limit reclaim on contention Ying Han
@ 2011-05-12 18:47 ` Ying Han
  2011-05-12 18:47 ` [RFC PATCH 2/4] Organize memcgs over soft limit in round-robin Ying Han
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Ying Han @ 2011-05-12 18:47 UTC (permalink / raw)
  To: Johannes Weiner, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai
  Cc: linux-mm

The current implementation of softlimit is based on per-zone RB tree, where
only the cgroup exceeds the soft_limit the most being selected for reclaim.

The problems are:

1. It takes no consideration of how many pages actually allocated on the zone
from this cgroup. The RB tree is indexed by the cgroup_(usage - soft_limit).

2. It makes less sense to only reclaim from one cgroup rather than reclaiming
all cgroups based on calculated propotion. This is required for fairness.

3. The target of the soft limit reclaim is to bring one cgroup's usage under
its soft_limit. However the target of global memory pressure is to reclaim
pages above the zone's high_wmark.

So the current softlimit reclaim is far from fulfilling the efficiency
requirement. From the discussion on LSF mm session, we agree to change to
organizing the memcgs in a round-robin fashion. This patch is to revert the
current RB-Tree implementation.

Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/memcontrol.c |  304 +------------------------------------------------------
 1 files changed, 1 insertions(+), 303 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index da1fb2b..9da3ecf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -34,7 +34,6 @@
 #include <linux/rcupdate.h>
 #include <linux/limits.h>
 #include <linux/mutex.h>
-#include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
@@ -137,10 +136,6 @@ struct mem_cgroup_per_zone {
 	unsigned long		count[NR_LRU_LISTS];
 
 	struct zone_reclaim_stat reclaim_stat;
-	struct rb_node		tree_node;	/* RB tree node */
-	unsigned long long	usage_in_excess;/* Set to the value by which */
-						/* the soft limit is exceeded*/
-	bool			on_tree;
 	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
 						/* use container_of	   */
 };
@@ -155,26 +150,6 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
-/*
- * Cgroups above their limits are maintained in a RB-Tree, independent of
- * their hierarchy representation
- */
-
-struct mem_cgroup_tree_per_zone {
-	struct rb_root rb_root;
-	spinlock_t lock;
-};
-
-struct mem_cgroup_tree_per_node {
-	struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
-};
-
-struct mem_cgroup_tree {
-	struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
-};
-
-static struct mem_cgroup_tree soft_limit_tree __read_mostly;
-
 struct mem_cgroup_threshold {
 	struct eventfd_ctx *eventfd;
 	u64 threshold;
@@ -384,164 +359,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *mem, struct page *page)
 	return mem_cgroup_zoneinfo(mem, nid, zid);
 }
 
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_node_zone(int nid, int zid)
-{
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_from_page(struct page *page)
-{
-	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
-
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static void
-__mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz,
-				unsigned long long new_usage_in_excess)
-{
-	struct rb_node **p = &mctz->rb_root.rb_node;
-	struct rb_node *parent = NULL;
-	struct mem_cgroup_per_zone *mz_node;
-
-	if (mz->on_tree)
-		return;
-
-	mz->usage_in_excess = new_usage_in_excess;
-	if (!mz->usage_in_excess)
-		return;
-	while (*p) {
-		parent = *p;
-		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
-					tree_node);
-		if (mz->usage_in_excess < mz_node->usage_in_excess)
-			p = &(*p)->rb_left;
-		/*
-		 * We can't avoid mem cgroups that are over their soft
-		 * limit by the same amount
-		 */
-		else if (mz->usage_in_excess >= mz_node->usage_in_excess)
-			p = &(*p)->rb_right;
-	}
-	rb_link_node(&mz->tree_node, parent, p);
-	rb_insert_color(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = true;
-}
-
-static void
-__mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
-{
-	if (!mz->on_tree)
-		return;
-	rb_erase(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = false;
-}
-
-static void
-mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
-{
-	spin_lock(&mctz->lock);
-	__mem_cgroup_remove_exceeded(mem, mz, mctz);
-	spin_unlock(&mctz->lock);
-}
-
-
-static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
-{
-	unsigned long long excess;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
-	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
-	mctz = soft_limit_tree_from_page(page);
-
-	/*
-	 * Necessary to update all ancestors when hierarchy is used.
-	 * because their event counter is not touched.
-	 */
-	for (; mem; mem = parent_mem_cgroup(mem)) {
-		mz = mem_cgroup_zoneinfo(mem, nid, zid);
-		excess = res_counter_soft_limit_excess(&mem->res);
-		/*
-		 * We have to update the tree if mz is on RB-tree or
-		 * mem is over its softlimit.
-		 */
-		if (excess || mz->on_tree) {
-			spin_lock(&mctz->lock);
-			/* if on-tree, remove it */
-			if (mz->on_tree)
-				__mem_cgroup_remove_exceeded(mem, mz, mctz);
-			/*
-			 * Insert again. mz->usage_in_excess will be updated.
-			 * If excess is 0, no tree ops.
-			 */
-			__mem_cgroup_insert_exceeded(mem, mz, mctz, excess);
-			spin_unlock(&mctz->lock);
-		}
-	}
-}
-
-static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
-{
-	int node, zone;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
-
-	for_each_node_state(node, N_POSSIBLE) {
-		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-			mz = mem_cgroup_zoneinfo(mem, node, zone);
-			mctz = soft_limit_tree_node_zone(node, zone);
-			mem_cgroup_remove_exceeded(mem, mz, mctz);
-		}
-	}
-}
-
-static struct mem_cgroup_per_zone *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
-	struct rb_node *rightmost = NULL;
-	struct mem_cgroup_per_zone *mz;
-
-retry:
-	mz = NULL;
-	rightmost = rb_last(&mctz->rb_root);
-	if (!rightmost)
-		goto done;		/* Nothing to reclaim from */
-
-	mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
-	/*
-	 * Remove the node now but someone else can add it back,
-	 * we will to add it back at the end of reclaim to its correct
-	 * position in the tree.
-	 */
-	__mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
-	if (!res_counter_soft_limit_excess(&mz->mem->res) ||
-		!css_tryget(&mz->mem->css))
-		goto retry;
-done:
-	return mz;
-}
-
-static struct mem_cgroup_per_zone *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
-	struct mem_cgroup_per_zone *mz;
-
-	spin_lock(&mctz->lock);
-	mz = __mem_cgroup_largest_soft_limit_node(mctz);
-	spin_unlock(&mctz->lock);
-	return mz;
-}
-
 /*
  * Implementation Note: reading percpu statistics for memcg.
  *
@@ -727,7 +544,6 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
 		__mem_cgroup_target_update(mem, MEM_CGROUP_TARGET_THRESH);
 		if (unlikely(__memcg_event_check(mem,
 			MEM_CGROUP_TARGET_SOFTLIMIT))){
-			mem_cgroup_update_tree(mem, page);
 			__mem_cgroup_target_update(mem,
 				MEM_CGROUP_TARGET_SOFTLIMIT);
 		}
@@ -3373,95 +3189,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask,
 					    unsigned long *total_scanned)
 {
-	unsigned long nr_reclaimed = 0;
-	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
-	unsigned long reclaimed;
-	int loop = 0;
-	struct mem_cgroup_tree_per_zone *mctz;
-	unsigned long long excess;
-	unsigned long nr_scanned;
-
-	if (order > 0)
-		return 0;
-
-	mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
-	/*
-	 * This loop can run a while, specially if mem_cgroup's continuously
-	 * keep exceeding their soft limit and putting the system under
-	 * pressure
-	 */
-	do {
-		if (next_mz)
-			mz = next_mz;
-		else
-			mz = mem_cgroup_largest_soft_limit_node(mctz);
-		if (!mz)
-			break;
-
-		nr_scanned = 0;
-		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
-						gfp_mask,
-						MEM_CGROUP_RECLAIM_SOFT,
-						&nr_scanned);
-		nr_reclaimed += reclaimed;
-		*total_scanned += nr_scanned;
-
-		spin_lock(&mctz->lock);
-
-		/*
-		 * If we failed to reclaim anything from this memory cgroup
-		 * it is time to move on to the next cgroup
-		 */
-		next_mz = NULL;
-		if (!reclaimed) {
-			do {
-				/*
-				 * Loop until we find yet another one.
-				 *
-				 * By the time we get the soft_limit lock
-				 * again, someone might have aded the
-				 * group back on the RB tree. Iterate to
-				 * make sure we get a different mem.
-				 * mem_cgroup_largest_soft_limit_node returns
-				 * NULL if no other cgroup is present on
-				 * the tree
-				 */
-				next_mz =
-				__mem_cgroup_largest_soft_limit_node(mctz);
-				if (next_mz == mz)
-					css_put(&next_mz->mem->css);
-				else /* next_mz == NULL or other memcg */
-					break;
-			} while (1);
-		}
-		__mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
-		excess = res_counter_soft_limit_excess(&mz->mem->res);
-		/*
-		 * One school of thought says that we should not add
-		 * back the node to the tree if reclaim returns 0.
-		 * But our reclaim could return 0, simply because due
-		 * to priority we are exposing a smaller subset of
-		 * memory to reclaim from. Consider this as a longer
-		 * term TODO.
-		 */
-		/* If excess == 0, no tree ops */
-		__mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
-		spin_unlock(&mctz->lock);
-		css_put(&mz->mem->css);
-		loop++;
-		/*
-		 * Could not reclaim anything and there are no more
-		 * mem cgroups to try or we seem to be looping without
-		 * reclaiming anything.
-		 */
-		if (!nr_reclaimed &&
-			(next_mz == NULL ||
-			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
-			break;
-	} while (!nr_reclaimed);
-	if (next_mz)
-		css_put(&next_mz->mem->css);
-	return nr_reclaimed;
+	return 0;
 }
 
 /*
@@ -4525,8 +4253,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 		mz = &pn->zoneinfo[zone];
 		for_each_lru(l)
 			INIT_LIST_HEAD(&mz->lists[l]);
-		mz->usage_in_excess = 0;
-		mz->on_tree = false;
 		mz->mem = mem;
 	}
 	return 0;
@@ -4580,7 +4306,6 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
 {
 	int node;
 
-	mem_cgroup_remove_from_trees(mem);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
 
 	for_each_node_state(node, N_POSSIBLE)
@@ -4635,31 +4360,6 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
-static int mem_cgroup_soft_limit_tree_init(void)
-{
-	struct mem_cgroup_tree_per_node *rtpn;
-	struct mem_cgroup_tree_per_zone *rtpz;
-	int tmp, node, zone;
-
-	for_each_node_state(node, N_POSSIBLE) {
-		tmp = node;
-		if (!node_state(node, N_NORMAL_MEMORY))
-			tmp = -1;
-		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
-		if (!rtpn)
-			return 1;
-
-		soft_limit_tree.rb_tree_per_node[node] = rtpn;
-
-		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-			rtpz = &rtpn->rb_tree_per_zone[zone];
-			rtpz->rb_root = RB_ROOT;
-			spin_lock_init(&rtpz->lock);
-		}
-	}
-	return 0;
-}
-
 static struct cgroup_subsys_state * __ref
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
@@ -4681,8 +4381,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		enable_swap_cgroup();
 		parent = NULL;
 		root_mem_cgroup = mem;
-		if (mem_cgroup_soft_limit_tree_init())
-			goto free_out;
 		for_each_possible_cpu(cpu) {
 			struct memcg_stock_pcp *stock =
 						&per_cpu(memcg_stock, cpu);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 2/4] Organize memcgs over soft limit in round-robin.
  2011-05-12 18:47 [RFC PATCH 0/4] memcg: revisit soft_limit reclaim on contention Ying Han
  2011-05-12 18:47 ` [RFC PATCH 1/4] Disable "organizing cgroups over soft limit in a RB-Tree" Ying Han
@ 2011-05-12 18:47 ` Ying Han
  2011-05-12 18:47 ` [RFC PATCH 3/4] Implementation of soft_limit reclaim " Ying Han
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Ying Han @ 2011-05-12 18:47 UTC (permalink / raw)
  To: Johannes Weiner, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai
  Cc: linux-mm

Based on the discussion from LSF, we came up with the design where all the
memcgs are stored in link-list and reclaims happen in a round-robin fashion.

We build per-zone memcg list which links mem_cgroup_per_zone for all memcgs
exceeded their soft_limit and have memory allocated on the zone.

1. new memcg is examed and inserted once per 1024 increments of
mem_cgroup_commit_charge().

2. under global memory pressure, we iterate the list and try to reclaim a
target number of pages from each memcg.

3. move the memcg to the tail after finishing the reclaim.

4. remove the memcg from the list if the usage dropped below the soft_limit.

Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/memcontrol.c |  159 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 159 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9da3ecf..1360de6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -136,6 +136,9 @@ struct mem_cgroup_per_zone {
 	unsigned long		count[NR_LRU_LISTS];
 
 	struct zone_reclaim_stat reclaim_stat;
+	struct list_head	soft_limit_list;
+	unsigned long long	usage_in_excess;
+	bool			on_list;
 	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
 						/* use container_of	   */
 };
@@ -150,6 +153,25 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
+/*
+ * Cgroups above their limits are maintained in a link-list, independent of
+ * their hierarchy representation
+ */
+struct mem_cgroup_list_per_zone {
+	struct list_head list;
+	spinlock_t lock;
+};
+
+struct mem_cgroup_list_per_node {
+	struct mem_cgroup_list_per_zone list_per_zone[MAX_NR_ZONES];
+};
+
+struct mem_cgroup_list {
+	struct mem_cgroup_list_per_node *list_per_node[MAX_NUMNODES];
+};
+
+static struct mem_cgroup_list soft_limit_list __read_mostly;
+
 struct mem_cgroup_threshold {
 	struct eventfd_ctx *eventfd;
 	u64 threshold;
@@ -359,6 +381,112 @@ page_cgroup_zoneinfo(struct mem_cgroup *mem, struct page *page)
 	return mem_cgroup_zoneinfo(mem, nid, zid);
 }
 
+static struct mem_cgroup_list_per_zone *
+soft_limit_list_node_zone(int nid, int zid)
+{
+	return &soft_limit_list.list_per_node[nid]->list_per_zone[zid];
+}
+
+static struct mem_cgroup_list_per_zone *
+soft_limit_list_from_page(struct page *page)
+{
+	int nid = page_to_nid(page);
+	int zid = page_zonenum(page);
+
+	return &soft_limit_list.list_per_node[nid]->list_per_zone[zid];
+}
+
+static void
+__mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
+				struct mem_cgroup_per_zone *mz,
+				struct mem_cgroup_list_per_zone *mclz,
+				unsigned long long new_usage_in_excess)
+{
+	if (mz->on_list)
+		return;
+
+	mz->usage_in_excess = new_usage_in_excess;
+	if (!mz->usage_in_excess)
+		return;
+
+	list_add(&mz->soft_limit_list, &mclz->list);
+	mz->on_list = true;
+}
+
+static void
+mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
+				struct mem_cgroup_per_zone *mz,
+				struct mem_cgroup_list_per_zone *mclz,
+				unsigned long long new_usage_in_excess)
+{
+	spin_lock(&mclz->lock);
+	__mem_cgroup_insert_exceeded(mem, mz, mclz, new_usage_in_excess);
+	spin_unlock(&mclz->lock);
+}
+
+static void
+__mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
+				struct mem_cgroup_per_zone *mz,
+				struct mem_cgroup_list_per_zone *mclz)
+{
+	if (!mz->on_list)
+		return;
+
+	if (list_empty(&mclz->list))
+		return;
+
+	list_del(&mz->soft_limit_list);
+	mz->on_list = false;
+}
+
+static void
+mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
+				struct mem_cgroup_per_zone *mz,
+				struct mem_cgroup_list_per_zone *mclz)
+{
+
+	spin_lock(&mclz->lock);
+	__mem_cgroup_remove_exceeded(mem, mz, mclz);
+	spin_unlock(&mclz->lock);
+}
+
+static void
+mem_cgroup_update_list(struct mem_cgroup *mem, struct page *page)
+{
+	unsigned long long excess;
+	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_list_per_zone *mclz;
+	int nid = page_to_nid(page);
+	int zid = page_zonenum(page);
+	mclz = soft_limit_list_from_page(page);
+
+	for (; mem; mem = parent_mem_cgroup(mem)) {
+		mz = mem_cgroup_zoneinfo(mem, nid, zid);
+		excess = res_counter_soft_limit_excess(&mem->res);
+
+		if (excess)
+			mem_cgroup_insert_exceeded(mem, mz, mclz, excess);
+		else
+			mem_cgroup_remove_exceeded(mem, mz, mclz);
+	}
+}
+
+static void
+mem_cgroup_remove_from_lists(struct mem_cgroup *mem)
+{
+	int node, zone;
+	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_list_per_zone *mclz;
+
+	for_each_node_state(node, N_POSSIBLE) {
+		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+			mz = mem_cgroup_zoneinfo(mem, node, zone);
+			mclz = soft_limit_list_node_zone(node, zone);
+			mem_cgroup_remove_exceeded(mem, mz, mclz);
+		}
+	}
+}
+
 /*
  * Implementation Note: reading percpu statistics for memcg.
  *
@@ -544,6 +672,7 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
 		__mem_cgroup_target_update(mem, MEM_CGROUP_TARGET_THRESH);
 		if (unlikely(__memcg_event_check(mem,
 			MEM_CGROUP_TARGET_SOFTLIMIT))){
+			mem_cgroup_update_list(mem, page);
 			__mem_cgroup_target_update(mem,
 				MEM_CGROUP_TARGET_SOFTLIMIT);
 		}
@@ -4253,6 +4382,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 		mz = &pn->zoneinfo[zone];
 		for_each_lru(l)
 			INIT_LIST_HEAD(&mz->lists[l]);
+		mz->usage_in_excess = 0;
+		mz->on_list = false;
 		mz->mem = mem;
 	}
 	return 0;
@@ -4306,6 +4437,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
 {
 	int node;
 
+	mem_cgroup_remove_from_lists(mem);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
 
 	for_each_node_state(node, N_POSSIBLE)
@@ -4360,6 +4492,31 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+static int mem_cgroup_soft_limit_list_init(void)
+{
+	struct mem_cgroup_list_per_node *rlpn;
+	struct mem_cgroup_list_per_zone *rlpz;
+	int tmp, node, zone;
+
+	for_each_node_state(node, N_POSSIBLE) {
+		tmp = node;
+		if (!node_state(node, N_NORMAL_MEMORY))
+			tmp = -1;
+		rlpn = kzalloc_node(sizeof(*rlpn), GFP_KERNEL, tmp);
+		if (!rlpn)
+			return 1;
+
+		soft_limit_list.list_per_node[node] = rlpn;
+
+		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+			rlpz = &rlpn->list_per_zone[zone];
+			INIT_LIST_HEAD(&rlpz->list);
+			spin_lock_init(&rlpz->lock);
+		}
+	}
+	return 0;
+}
+
 static struct cgroup_subsys_state * __ref
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
@@ -4381,6 +4538,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		enable_swap_cgroup();
 		parent = NULL;
 		root_mem_cgroup = mem;
+		if (mem_cgroup_soft_limit_list_init())
+			goto free_out;
 		for_each_possible_cpu(cpu) {
 			struct memcg_stock_pcp *stock =
 						&per_cpu(memcg_stock, cpu);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 3/4] Implementation of soft_limit reclaim in round-robin.
  2011-05-12 18:47 [RFC PATCH 0/4] memcg: revisit soft_limit reclaim on contention Ying Han
  2011-05-12 18:47 ` [RFC PATCH 1/4] Disable "organizing cgroups over soft limit in a RB-Tree" Ying Han
  2011-05-12 18:47 ` [RFC PATCH 2/4] Organize memcgs over soft limit in round-robin Ying Han
@ 2011-05-12 18:47 ` Ying Han
  2011-05-12 18:47 ` [RFC PATCH 4/4] Add some debugging stats Ying Han
  2011-05-13  0:40 ` [RFC PATCH 0/4] memcg: revisit soft_limit reclaim on contention Rik van Riel
  4 siblings, 0 replies; 7+ messages in thread
From: Ying Han @ 2011-05-12 18:47 UTC (permalink / raw)
  To: Johannes Weiner, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai
  Cc: linux-mm

This patch re-implement the soft_limit reclaim function which it
picks up next memcg to reclaim from in a round-robin fashion.

For each memcg, we do hierarchical reclaim and checks the zone_wmark_ok()
after each iteration. There is a rate limit per each memcg on how many
pages to scan based on how much it exceeds the soft_limit.

This patch is a first step approach to switch from RB-tree based reclaim
to link-list based reclaim, and improvement on per-memcg soft_limit reclaim
algorithm is needed next.

Some test result:
Test 1:
Here I have three memcgs each doing a read on 20g file on a 32g system(no swap).
Meantime I have a program pinned a 18g anon pages under root. The hard_limit and
soft_limit is listed as container(hard_limit, soft_limit)

root: 18g anon pages w/o swap

A (20g, 2g):
soft_kswapd_steal 4265600
soft_kswapd_scan 4265600

B (20g, 2g):
soft_kswapd_steal 4265600
soft_kswapd_scan 4265600

C: (20g, 2g)
soft_kswapd_steal 4083904
soft_kswapd_scan 4083904

vmstat:
kswapd_steal 12617255

99.9% steal

This two stats shows the zone_wmark_ok is fullfilled after soft_limit
reclaim vs per-zone reclaim.

kwapd_zone_wmark_ok 1974
kswapd_soft_limit_zone_wmark_ok 1969


Test2:
Here the same memcgs but each is doing a 20g file write.

root: 18g anon pages w/o swap

A (20g, 2g):
soft_kswapd_steal 4718336
soft_kswapd_scan 4718336

B (20g, 2g):
soft_kswapd_steal 4710304
soft_kswapd_scan 4710304

C (20g, 3g);
soft_kswapd_steal 2933406
soft_kswapd_scan 5797460

kswapd_steal 15958486
77%

kswapd_zone_wmark_ok 2517
kswapd_soft_limit_zone_wmark_ok 2405

TODO:
1. We would like to do better on targeting reclaim by calculating the target
nr_to_scan per-memcg, especially combining the current heuristics with
soft_limit exceeds. How much weight we would like to put for the soft_limit
exceed, or do we want to make it configurable?

2. As decided in LSF, we need a second list of memcgs under their soft_limit
per-zone as well. This is needed to do zone balancing w/o global LRU. We
shouldn't scan the second list unless the first list exhausted.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h |    3 +-
 mm/memcontrol.c            |  119 ++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                |   25 +++++-----
 3 files changed, 131 insertions(+), 16 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6a0cffd..c7fcb26 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -145,7 +145,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 }
 
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-						gfp_t gfp_mask,
+						gfp_t gfp_mask, int end_zone,
+						unsigned long balance_gap,
 						unsigned long *total_scanned);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1360de6..b87ccc8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1093,6 +1093,19 @@ unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
 	return MEM_CGROUP_ZSTAT(mz, lru);
 }
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup_per_zone *mz)
+{
+	unsigned long total = 0;
+
+	if (nr_swap_pages) {
+		total += MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
+		total += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON);
+	}
+	total +=  MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
+	total +=  MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE);
+	return total;
+}
+
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
 						      struct zone *zone)
 {
@@ -1528,7 +1541,14 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 			return ret;
 		total += ret;
 		if (check_soft) {
-			if (!res_counter_soft_limit_excess(&root_mem->res))
+			/*
+			 * We want to be fair for each memcg soft_limit reclaim
+			 * based on the excess.excess >> 2 is not to excessive
+			 * so as to reclaim too much, nor too less that we keep
+			 * coming back to reclaim from tis cgroup.
+			 */
+			if (!res_counter_soft_limit_excess(&root_mem->res) ||
+			    total >= (excess >> 2))
 				return total;
 		} else if (mem_cgroup_margin(root_mem))
 			return 1 + total;
@@ -3314,11 +3334,104 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	return ret;
 }
 
+static struct mem_cgroup_per_zone *
+__mem_cgroup_next_soft_limit_node(struct mem_cgroup_list_per_zone *mclz)
+{
+	struct mem_cgroup_per_zone *mz;
+
+retry:
+	mz = NULL;
+	if (list_empty(&mclz->list))
+		goto done;
+
+	mz = list_entry(mclz->list.prev, struct mem_cgroup_per_zone,
+			soft_limit_list);
+
+	__mem_cgroup_remove_exceeded(mz->mem, mz, mclz);
+	if (!res_counter_soft_limit_excess(&mz->mem->res) ||
+		!mem_cgroup_zone_reclaimable_pages(mz) ||
+		!css_tryget(&mz->mem->css))
+		goto retry;
+done:
+	return mz;
+}
+
+static struct mem_cgroup_per_zone *
+mem_cgroup_next_soft_limit_node(struct mem_cgroup_list_per_zone *mclz)
+{
+	struct mem_cgroup_per_zone *mz;
+
+	spin_lock(&mclz->lock);
+	mz = __mem_cgroup_next_soft_limit_node(mclz);
+	spin_unlock(&mclz->lock);
+	return mz;
+}
+
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-					    gfp_t gfp_mask,
+					    gfp_t gfp_mask, int end_zone,
+					    unsigned long balance_gap,
 					    unsigned long *total_scanned)
 {
-	return 0;
+	unsigned long nr_reclaimed = 0;
+	unsigned long reclaimed;
+	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_list_per_zone *mclz;
+	unsigned long long excess;
+	unsigned long nr_scanned;
+	int loop = 0;
+
+	/*
+	 * memcg reclaim doesn't support lumpy.
+	 */
+	if (order > 0)
+		return 0;
+
+	mclz = soft_limit_list_node_zone(zone_to_nid(zone), zone_idx(zone));
+	/*
+	 * Start from the head of list.
+	 */
+	while (!list_empty(&mclz->list)) {
+		mz = mem_cgroup_next_soft_limit_node(mclz);
+		if (!mz)
+			break;
+
+		nr_scanned = 0;
+		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
+							gfp_mask,
+							MEM_CGROUP_RECLAIM_SOFT,
+							&nr_scanned);
+		nr_reclaimed += reclaimed;
+		*total_scanned += nr_scanned;
+
+		spin_lock(&mclz->lock);
+
+		__mem_cgroup_remove_exceeded(mz->mem, mz, mclz);
+		/*
+		 * Add it back to the list even the reclaimed equals
+		 * to zero as long as the memcg is still above its
+		 * soft_limit. It could be possible lots of pages becomes
+		 * reclaimable suddently.
+		 */
+		excess = res_counter_soft_limit_excess(&mz->mem->res);
+		__mem_cgroup_insert_exceeded(mz->mem, mz, mclz, excess);
+
+		spin_unlock(&mclz->lock);
+		css_put(&mz->mem->css);
+		loop++;
+
+		if (zone_watermark_ok_safe(zone, order,
+				high_wmark_pages(zone) + balance_gap,
+				end_zone, 0)) {
+			break;
+		}
+
+		if (loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS ||
+			*total_scanned > nr_reclaimed + nr_reclaimed / 2)
+			break;
+
+	}
+
+	return nr_reclaimed;
 }
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 96789e0..9d79070 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2421,18 +2421,6 @@ loop_again:
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;
 
-			sc.nr_scanned = 0;
-
-			nr_soft_scanned = 0;
-			/*
-			 * Call soft limit reclaim before calling shrink_zone.
-			 */
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
-							order, sc.gfp_mask,
-							&nr_soft_scanned);
-			sc.nr_reclaimed += nr_soft_reclaimed;
-			total_scanned += nr_soft_scanned;
-
 			/*
 			 * We put equal pressure on every zone, unless
 			 * one zone has way too many pages free
@@ -2445,6 +2433,19 @@ loop_again:
 				(zone->present_pages +
 					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
 				KSWAPD_ZONE_BALANCE_GAP_RATIO);
+			sc.nr_scanned = 0;
+
+			nr_soft_scanned = 0;
+			/*
+			 * Call soft limit reclaim before calling shrink_zone.
+			 */
+			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
+							order, sc.gfp_mask,
+							end_zone, balance_gap,
+							&nr_soft_scanned);
+			sc.nr_reclaimed += nr_soft_reclaimed;
+			total_scanned += nr_soft_scanned;
+
 			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone) + balance_gap,
 					end_zone, 0))
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 4/4] Add some debugging stats
  2011-05-12 18:47 [RFC PATCH 0/4] memcg: revisit soft_limit reclaim on contention Ying Han
                   ` (2 preceding siblings ...)
  2011-05-12 18:47 ` [RFC PATCH 3/4] Implementation of soft_limit reclaim " Ying Han
@ 2011-05-12 18:47 ` Ying Han
  2011-05-13  0:40 ` [RFC PATCH 0/4] memcg: revisit soft_limit reclaim on contention Rik van Riel
  4 siblings, 0 replies; 7+ messages in thread
From: Ying Han @ 2011-05-12 18:47 UTC (permalink / raw)
  To: Johannes Weiner, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai
  Cc: linux-mm

This patch is not intended to be included but only including debugging
stats.

It includes counters memcg being inserted/deleted in the list. And also
counters where zone_wmark_ok() fullfilled from soft_limit reclaim.

Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h    |   14 ++++++++++++++
 include/linux/vm_event_item.h |    1 +
 mm/memcontrol.c               |   23 +++++++++++++++++++++++
 mm/vmscan.c                   |    3 ++-
 mm/vmstat.c                   |    2 ++
 5 files changed, 42 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c7fcb26..d97aa1c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -121,6 +121,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 extern int do_swap_account;
 #endif
 
+/* background reclaim stats */
+void mem_cgroup_list_insert(struct mem_cgroup *memcg, int val);
+void mem_cgroup_list_remove(struct mem_cgroup *memcg, int val);
+
 static inline bool mem_cgroup_disabled(void)
 {
 	if (mem_cgroup_subsys.disabled)
@@ -363,6 +367,16 @@ static inline
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void mem_cgroup_list_insert(struct mem_cgroup *memcg,
+					  int val)
+{
+}
+
+static inline void mem_cgroup_list_remove(struct mem_cgroup *memcg,
+					  int val)
+{
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03b90cdc..f226bfd 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -35,6 +35,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
+		KSWAPD_ZONE_WMARK_OK, KSWAPD_SOFT_LIMIT_ZONE_WMARK_OK,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
 #ifdef CONFIG_COMPACTION
 		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b87ccc8..bd7c481 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -103,6 +103,8 @@ enum mem_cgroup_events_index {
 					/* soft reclaim in direct reclaim */
 	MEM_CGROUP_EVENTS_SOFT_DIRECT_SCAN, /* # of pages scanned from */
 					/* soft reclaim in direct reclaim */
+	MEM_CGROUP_EVENTS_LIST_INSERT,
+	MEM_CGROUP_EVENTS_LIST_REMOVE,
 	MEM_CGROUP_EVENTS_NSTATS,
 };
 /*
@@ -411,6 +413,7 @@ __mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
 
 	list_add(&mz->soft_limit_list, &mclz->list);
 	mz->on_list = true;
+	mem_cgroup_list_insert(mem, 1);
 }
 
 static void
@@ -437,6 +440,7 @@ __mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
 
 	list_del(&mz->soft_limit_list);
 	mz->on_list = false;
+	mem_cgroup_list_remove(mem, 1);
 }
 
 static void
@@ -550,6 +554,16 @@ void mem_cgroup_pgmajfault(struct mem_cgroup *mem, int val)
 	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT], val);
 }
 
+void mem_cgroup_list_insert(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_LIST_INSERT], val);
+}
+
+void mem_cgroup_list_remove(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_LIST_REMOVE], val);
+}
+
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *mem,
 					    enum mem_cgroup_events_index idx)
 {
@@ -3422,6 +3436,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 		if (zone_watermark_ok_safe(zone, order,
 				high_wmark_pages(zone) + balance_gap,
 				end_zone, 0)) {
+			count_vm_events(KSWAPD_SOFT_LIMIT_ZONE_WMARK_OK, 1);
 			break;
 		}
 
@@ -3838,6 +3853,8 @@ enum {
 	MCS_SOFT_KSWAPD_SCAN,
 	MCS_SOFT_DIRECT_STEAL,
 	MCS_SOFT_DIRECT_SCAN,
+	MCS_LIST_INSERT,
+	MCS_LIST_REMOVE,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3866,6 +3883,8 @@ struct {
 	{"soft_kswapd_scan", "total_soft_kswapd_scan"},
 	{"soft_direct_steal", "total_soft_direct_steal"},
 	{"soft_direct_scan", "total_soft_direct_scan"},
+	{"list_insert", "total_list_insert"},
+	{"list_remove", "total_list_remove"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -3906,6 +3925,10 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 	s->stat[MCS_PGFAULT] += val;
 	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGMAJFAULT);
 	s->stat[MCS_PGMAJFAULT] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_LIST_INSERT);
+	s->stat[MCS_LIST_INSERT] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_LIST_REMOVE);
+	s->stat[MCS_LIST_REMOVE] += val;
 
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9d79070..fc3da68 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2492,11 +2492,12 @@ loop_again:
 				zone_clear_flag(zone, ZONE_CONGESTED);
 				if (i <= *classzone_idx)
 					balanced += zone->present_pages;
+				count_vm_events(KSWAPD_ZONE_WMARK_OK, 1);
 			}
-
 		}
 		if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
 			break;		/* kswapd: all done */
+
 		/*
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
 		 * another pass across the zones.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a2b7344..2b3a7e5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -922,6 +922,8 @@ const char * const vmstat_text[] = {
 	"kswapd_low_wmark_hit_quickly",
 	"kswapd_high_wmark_hit_quickly",
 	"kswapd_skip_congestion_wait",
+	"kswapd_zone_wmark_ok",
+	"kswapd_soft_limit_zone_wmark_ok",
 	"pageoutrun",
 	"allocstall",
 
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/4] memcg: revisit soft_limit reclaim on contention
  2011-05-12 18:47 [RFC PATCH 0/4] memcg: revisit soft_limit reclaim on contention Ying Han
                   ` (3 preceding siblings ...)
  2011-05-12 18:47 ` [RFC PATCH 4/4] Add some debugging stats Ying Han
@ 2011-05-13  0:40 ` Rik van Riel
  2011-05-13  0:54   ` Ying Han
  4 siblings, 1 reply; 7+ messages in thread
From: Rik van Riel @ 2011-05-13  0:40 UTC (permalink / raw)
  To: Ying Han
  Cc: Johannes Weiner, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm,
	Michel Lespinasse

On 05/12/2011 02:47 PM, Ying Han wrote:

> TODO:
> a) there was a question on how to do zone balancing w/o global LRU. This could be
> solved by building another cgroup list per-zone, where we also link cgroups under
> their soft_limit. We won't scan the list unless the first list being exhausted and
> the free pages is still under the high_wmark.

> b). one of the tricky part is to calculate the target nr_to_scan for each cgroup,
> especially combining the current heuristics with soft_limit exceeds. it depends how
> much weight we need to put on the second. One way is to make the ratio to be user
> configurable.

Johannes addresses these in his patch series.

> Ying Han (4):
>    Disable "organizing cgroups over soft limit in a RB-Tree"
>    Organize memcgs over soft limit in round-robin.
>    Implementation of soft_limit reclaim in round-robin.
>    Add some debugging stats

Looks like you also have some things Johannes doesn't have.

It may be good for the two patch series you have to get
merged into one series, before stuff gets merged upstream.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/4] memcg: revisit soft_limit reclaim on contention
  2011-05-13  0:40 ` [RFC PATCH 0/4] memcg: revisit soft_limit reclaim on contention Rik van Riel
@ 2011-05-13  0:54   ` Ying Han
  0 siblings, 0 replies; 7+ messages in thread
From: Ying Han @ 2011-05-13  0:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm,
	Michel Lespinasse

[-- Attachment #1: Type: text/plain, Size: 1388 bytes --]

On Thu, May 12, 2011 at 5:40 PM, Rik van Riel <riel@redhat.com> wrote:

> On 05/12/2011 02:47 PM, Ying Han wrote:
>
>  TODO:
>> a) there was a question on how to do zone balancing w/o global LRU. This
>> could be
>> solved by building another cgroup list per-zone, where we also link
>> cgroups under
>> their soft_limit. We won't scan the list unless the first list being
>> exhausted and
>> the free pages is still under the high_wmark.
>>
>
>  b). one of the tricky part is to calculate the target nr_to_scan for each
>> cgroup,
>> especially combining the current heuristics with soft_limit exceeds. it
>> depends how
>> much weight we need to put on the second. One way is to make the ratio to
>> be user
>> configurable.
>>
>
> Johannes addresses these in his patch series.


That would be great, I am reading through his patch and apparently not
getting there yet :)

>
>
>  Ying Han (4):
>>   Disable "organizing cgroups over soft limit in a RB-Tree"
>>   Organize memcgs over soft limit in round-robin.
>>   Implementation of soft_limit reclaim in round-robin.
>>   Add some debugging stats
>>
>
> Looks like you also have some things Johannes doesn't have.
>
> It may be good for the two patch series you have to get
> merged into one series, before stuff gets merged upstream.
>
> Yes, that is my motivation here to post the patch here :)

--Ying

> --
> All rights reversed
>

[-- Attachment #2: Type: text/html, Size: 2438 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-05-13  0:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-12 18:47 [RFC PATCH 0/4] memcg: revisit soft_limit reclaim on contention Ying Han
2011-05-12 18:47 ` [RFC PATCH 1/4] Disable "organizing cgroups over soft limit in a RB-Tree" Ying Han
2011-05-12 18:47 ` [RFC PATCH 2/4] Organize memcgs over soft limit in round-robin Ying Han
2011-05-12 18:47 ` [RFC PATCH 3/4] Implementation of soft_limit reclaim " Ying Han
2011-05-12 18:47 ` [RFC PATCH 4/4] Add some debugging stats Ying Han
2011-05-13  0:40 ` [RFC PATCH 0/4] memcg: revisit soft_limit reclaim on contention Rik van Riel
2011-05-13  0:54   ` Ying Han

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).