[RFC 0/3] soft reclaim rework

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/3] soft reclaim rework
@ 2013-04-09 12:13 Michal Hocko
  2013-04-09 12:13 ` [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko
                   ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: Michal Hocko @ 2013-04-09 12:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel,
	Hugh Dickins, Mel Gorman, Glauber Costa

Hi all,
It's been a long when I promised my take on the $subject but I got
permanently preempted by other tasks. I finally got it, fortunately.

This is just a first attempt. There are still some todos but I wanted to
post it soon to get a feedback.

The basic idea is quite simple. Pull soft reclaim into shrink_zone in
the first step and get rid of the previous soft reclaim infrastructure.
shrink_zone is done in two passes now. First it tries to do the soft
limit reclaim and it falls back to reclaim-all-mode if no group is over
the limit or no pages have been scanned. The second pass happens at the
same priority so the only time we waste is the memcg tree walk which
shouldn't be a big deal. There is certainly room for improvements in
that direction. But let's keep it simple for now.
As a bonus we will get rid of a _lot_ of code by this and soft reclaim
will not stand out like before.
The second step is somehow more controversial. I am redefining meaning
of the default soft limit value. I've not chosen 0 as we discussed
previously because I want to preserve hierarchical property of the soft
limit (if a parent up the hierarchy is over its limit then children are
over as well) so I have kept the default untouched - unlimited - but I
have slightly changed the meaning of this value. I interpret it as "user
doesn't care about soft limit". More precisely the value is ignored
unless it has been specified by user so such groups are eligible for
soft reclaim even though they do not reach the limit. Such groups
do not force their children to be reclaimed of course.
I guess the only possible use case where this wouldn't work as
expected is when somebody creates a group and set its soft limit to
a small value (e.g. 0) just to protect all other groups from being
reclaimed. With a new scheme all groups would be reclaimed while the
previous implementation could end up reclaiming only the "special"
group. This configuration can be achieved by the new scheme trivially
so I think we should be safe. Or does this sound like a big problem?
Finally the third step is soft limit reclaim integration into targeted
reclaim. The patch is trivial one liner.

I haven't get to test it properly yet. I've tested only 2 workloads:
1) 1GB RAM + 128MB swap in a kvm (host 4 GB RAM)
   - 2 memcgs (directly under root)
   	- A has soft limit 500MB and hard unlimited
	- B both hard and soft unlimited (default values)
   - One dd if=/dev/zero of=storage/$file bs=1024 count=1228800 per group
2) same setup
   - tar -xf linux source tree + make -j2 vmlinux

Results
1) I've checked memory.usage_in_bytes
Base (-mm tree)
	Group A		Group B	
median	446498816	448659456

Patches applied
median	524314624	377921536

So as expected, A got more room on behalf of B and it is nicely over its
soft limit. I wanted to compare the reclaim performance as well but we
do not account scanned and reclaimed pages during the old soft reclaim
(global_reclaim prevents that). But I am planning to look at it.
Anyway it doesn't look like we are scanning/reclaiming more with the
patched kernel:
Base: 	 pgscan_kswapd_dma32 394382	pgsteal_kswapd_dma32 394372
Patched: pgscan_kswapd_dma32 394501	pgsteal_kswapd_dma32 394491

So I would assume that the soft limit reclaim scanned more in the end.

Total runtime was slightly smaller for the patch version:
Base
		Group A		Group B
total time	480.087 s	480.067 s

Patches applied
total time	474.853 s	474.736 s

But this could be an artifacts of the guest scheduling or related to the
host activity so I wouldn't draw any conclusions from here.

2) kbuild test showed more or less the same results
usage_in_bytes
Base
		Group A		Group B
Median		394817536	395634688

Patches applied
median		483481600	302131200

A is kept closer to the soft limit again. There is some fluctuation
around the limit because kbuild creates a lot of short lived processes.
Base: 	 pgscan_kswapd_dma32 1648718	pgsteal_kswapd_dma32 1510749
Patched: pgscan_kswapd_dma32 2042065	pgsteal_kswapd_dma32 1667745

The differences are much bigger now so it would be interesting how much
has been scanned/reclaimed during soft reclaim in the base kernel.

I haven't included total runtime statistics here because they seemed
even more random due to guest/host interaction.

Any comments are welcome, of course.

Michal Hocko (3):
      memcg: integrate soft reclaim tighter with zone shrinking code
      memcg: Ignore soft limit until it is explicitly specified
      vmscan, memcg: Do softlimit reclaim also for targeted reclaim

Incomplete diffstat (without node-zone soft limit tree removal etc...)
so more deletions to come.
 include/linux/memcontrol.h |   10 +--
 mm/memcontrol.c            |  175 +++++++++-----------------------------------
 mm/vmscan.c                |   67 ++++++++++-------
 3 files changed, 78 insertions(+), 174 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code
  2013-04-09 12:13 [RFC 0/3] soft reclaim rework Michal Hocko
@ 2013-04-09 12:13 ` Michal Hocko
  2013-04-09 13:08   ` Johannes Weiner
                     ` (3 more replies)
  2013-04-09 12:13 ` [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified Michal Hocko
                   ` (4 subsequent siblings)
  5 siblings, 4 replies; 27+ messages in thread
From: Michal Hocko @ 2013-04-09 12:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel,
	Hugh Dickins, Mel Gorman, Glauber Costa

Memcg soft reclaim has been traditionally triggered from the global
reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
then picked up a group which exceeds the soft limit the most and
reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.

The infrastructure requires per-node-zone trees which hold over-limit
groups and keep them up-to-date (via memcg_check_events) which is not
cost free. Although this overhead hasn't turned out to be a bottle neck
the implementation is suboptimal because mem_cgroup_update_tree has no
idea which zones consumed memory over the limit so we could easily end
up having a group on a node-zone tree having only few pages from that
node-zone.

This patch doesn't try to fix node-zone trees management because it
seems that integrating soft reclaim into zone shrinking sounds much
easier and more appropriate for several reasons.
First of all 0 priority reclaim was a crude hack which might lead to
big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot
of dirty/writeback pages).
Soft reclaim should be applicable also to the targeted reclaim which is
awkward right now without additional hacks.
Last but not least the whole infrastructure eats a lot of code[1].

After this patch shrink_zone is done in 2. First it tries to do the
soft reclaim if appropriate (only for global reclaim for now to keep
compatible with the current state) and fall back to ignoring soft limit
if no group is eligible to soft reclaim or nothing has been scanned
during the first pass. Only groups which are over their soft limit or
any of their parent up the hierarchy is over the limit are considered
eligible during the first pass.

TODO: remove mem_cgroup_tree_per_zone, mem_cgroup_shrink_node_zone and co.
but maybe it would be easier for review to remove that code in a separate
patch...

---
[1] TODO: put size vmlinux before/after whole clean-up

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h |   10 +--
 mm/memcontrol.c            |  161 ++++++--------------------------------------
 mm/vmscan.c                |   67 +++++++++++-------
 3 files changed, 64 insertions(+), 174 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d6183f0..1833c95 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -179,9 +179,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-						gfp_t gfp_mask,
-						unsigned long *total_scanned);
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg);
 
 void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
@@ -358,11 +356,9 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 }
 
 static inline
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-					    gfp_t gfp_mask,
-					    unsigned long *total_scanned)
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
 {
-	return 0;
+	return false;
 }
 
 static inline void mem_cgroup_split_huge_fixup(struct page *head)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f608546..33424d8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2060,57 +2060,28 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
 }
 #endif
 
-static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
-				   struct zone *zone,
-				   gfp_t gfp_mask,
-				   unsigned long *total_scanned)
-{
-	struct mem_cgroup *victim = NULL;
-	int total = 0;
-	int loop = 0;
-	unsigned long excess;
-	unsigned long nr_scanned;
-	struct mem_cgroup_reclaim_cookie reclaim = {
-		.zone = zone,
-		.priority = 0,
-	};
+/*
+ * A group is eligible for the soft limit reclaim if it is
+ * 	a) is over its soft limit
+ * 	b) any parent up the hierarchy is over its soft limit
+ */
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent = memcg;
 
-	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
-
-	while (1) {
-		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
-		if (!victim) {
-			loop++;
-			if (loop >= 2) {
-				/*
-				 * If we have not been able to reclaim
-				 * anything, it might because there are
-				 * no reclaimable pages under this hierarchy
-				 */
-				if (!total)
-					break;
-				/*
-				 * We want to do more targeted reclaim.
-				 * excess >> 2 is not to excessive so as to
-				 * reclaim too much, nor too less that we keep
-				 * coming back to reclaim from this cgroup
-				 */
-				if (total >= (excess >> 2) ||
-					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
-					break;
-			}
-			continue;
-		}
-		if (!mem_cgroup_reclaimable(victim, false))
-			continue;
-		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
-						     zone, &nr_scanned);
-		*total_scanned += nr_scanned;
-		if (!res_counter_soft_limit_excess(&root_memcg->res))
-			break;
+	if (res_counter_soft_limit_excess(&memcg->res))
+		return true;
+
+	/*
+	 * If any parent up the hierarchy is over its soft limit then we
+	 * have to obey and reclaim from this group as well.
+	 */
+	while((parent = parent_mem_cgroup(parent))) {
+		if (res_counter_soft_limit_excess(&parent->res))
+			return true;
 	}
-	mem_cgroup_iter_break(root_memcg, victim);
-	return total;
+
+	return false;
 }
 
 /*
@@ -4724,98 +4695,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	return ret;
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-					    gfp_t gfp_mask,
-					    unsigned long *total_scanned)
-{
-	unsigned long nr_reclaimed = 0;
-	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
-	unsigned long reclaimed;
-	int loop = 0;
-	struct mem_cgroup_tree_per_zone *mctz;
-	unsigned long long excess;
-	unsigned long nr_scanned;
-
-	if (order > 0)
-		return 0;
-
-	mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
-	/*
-	 * This loop can run a while, specially if mem_cgroup's continuously
-	 * keep exceeding their soft limit and putting the system under
-	 * pressure
-	 */
-	do {
-		if (next_mz)
-			mz = next_mz;
-		else
-			mz = mem_cgroup_largest_soft_limit_node(mctz);
-		if (!mz)
-			break;
-
-		nr_scanned = 0;
-		reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone,
-						    gfp_mask, &nr_scanned);
-		nr_reclaimed += reclaimed;
-		*total_scanned += nr_scanned;
-		spin_lock(&mctz->lock);
-
-		/*
-		 * If we failed to reclaim anything from this memory cgroup
-		 * it is time to move on to the next cgroup
-		 */
-		next_mz = NULL;
-		if (!reclaimed) {
-			do {
-				/*
-				 * Loop until we find yet another one.
-				 *
-				 * By the time we get the soft_limit lock
-				 * again, someone might have aded the
-				 * group back on the RB tree. Iterate to
-				 * make sure we get a different mem.
-				 * mem_cgroup_largest_soft_limit_node returns
-				 * NULL if no other cgroup is present on
-				 * the tree
-				 */
-				next_mz =
-				__mem_cgroup_largest_soft_limit_node(mctz);
-				if (next_mz == mz)
-					css_put(&next_mz->memcg->css);
-				else /* next_mz == NULL or other memcg */
-					break;
-			} while (1);
-		}
-		__mem_cgroup_remove_exceeded(mz->memcg, mz, mctz);
-		excess = res_counter_soft_limit_excess(&mz->memcg->res);
-		/*
-		 * One school of thought says that we should not add
-		 * back the node to the tree if reclaim returns 0.
-		 * But our reclaim could return 0, simply because due
-		 * to priority we are exposing a smaller subset of
-		 * memory to reclaim from. Consider this as a longer
-		 * term TODO.
-		 */
-		/* If excess == 0, no tree ops */
-		__mem_cgroup_insert_exceeded(mz->memcg, mz, mctz, excess);
-		spin_unlock(&mctz->lock);
-		css_put(&mz->memcg->css);
-		loop++;
-		/*
-		 * Could not reclaim anything and there are no more
-		 * mem cgroups to try or we seem to be looping without
-		 * reclaiming anything.
-		 */
-		if (!nr_reclaimed &&
-			(next_mz == NULL ||
-			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
-			break;
-	} while (!nr_reclaimed);
-	if (next_mz)
-		css_put(&next_mz->memcg->css);
-	return nr_reclaimed;
-}
-
 /**
  * mem_cgroup_force_empty_list - clears LRU of a group
  * @memcg: group to clear
diff --git a/mm/vmscan.c b/mm/vmscan.c
index df78d17..ae3a387 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -138,11 +138,21 @@ static bool global_reclaim(struct scan_control *sc)
 {
 	return !sc->target_mem_cgroup;
 }
+
+static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
+{
+	return global_reclaim(sc);
+}
 #else
 static bool global_reclaim(struct scan_control *sc)
 {
 	return true;
 }
+
+static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
+{
+	return false;
+}
 #endif
 
 static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -1942,9 +1952,11 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	}
 }
 
-static void shrink_zone(struct zone *zone, struct scan_control *sc)
+static unsigned
+__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
 {
 	unsigned long nr_reclaimed, nr_scanned;
+	unsigned nr_shrunk = 0;
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -1961,6 +1973,13 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 		do {
 			struct lruvec *lruvec;
 
+			if (soft_reclaim &&
+					!mem_cgroup_soft_reclaim_eligible(memcg)) {
+				memcg = mem_cgroup_iter(root, memcg, &reclaim);
+				continue;
+			}
+
+			nr_shrunk++;
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
 			shrink_lruvec(lruvec, sc);
@@ -1984,6 +2003,27 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 		} while (memcg);
 	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
+
+	return nr_shrunk;
+}
+
+
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
+{
+	bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
+	unsigned long nr_scanned = sc->nr_scanned;
+	unsigned nr_shrunk;
+
+	nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim);
+
+	/*
+	 * No group is over the soft limit or those that are do not have
+	 * pages in the zone we are reclaiming so we have to reclaim everybody
+	 */
+	if (do_soft_reclaim && (!nr_shrunk || sc->nr_scanned == nr_scanned)) {
+		__shrink_zone(zone, sc, false);
+		return;
+	}
 }
 
 /* Returns true if compaction should go ahead for a high-order request */
@@ -2047,8 +2087,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
-	unsigned long nr_soft_reclaimed;
-	unsigned long nr_soft_scanned;
 	bool aborted_reclaim = false;
 
 	/*
@@ -2088,18 +2126,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 					continue;
 				}
 			}
-			/*
-			 * This steals pages from memory cgroups over softlimit
-			 * and returns the number of reclaimed pages and
-			 * scanned pages. This works for global memory pressure
-			 * and balancing, not for a memcg's limit.
-			 */
-			nr_soft_scanned = 0;
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
-						sc->order, sc->gfp_mask,
-						&nr_soft_scanned);
-			sc->nr_reclaimed += nr_soft_reclaimed;
-			sc->nr_scanned += nr_soft_scanned;
 			/* need some check for avoid more shrink_zone() */
 		}
 
@@ -2620,8 +2646,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 	struct reclaim_state *reclaim_state = current->reclaim_state;
-	unsigned long nr_soft_reclaimed;
-	unsigned long nr_soft_scanned;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.may_unmap = 1,
@@ -2720,15 +2744,6 @@ loop_again:
 
 			sc.nr_scanned = 0;
 
-			nr_soft_scanned = 0;
-			/*
-			 * Call soft limit reclaim before calling shrink_zone.
-			 */
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
-							order, sc.gfp_mask,
-							&nr_soft_scanned);
-			sc.nr_reclaimed += nr_soft_reclaimed;
-
 			/*
 			 * We put equal pressure on every zone, unless
 			 * one zone has way too many pages free
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code
  2013-04-09 12:13 ` [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko
@ 2013-04-09 13:08   ` Johannes Weiner
  2013-04-09 13:31     ` Michal Hocko
  2013-04-09 13:57   ` Glauber Costa
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Johannes Weiner @ 2013-04-09 13:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Ying Han, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins,
	Mel Gorman, Glauber Costa

On Tue, Apr 09, 2013 at 02:13:13PM +0200, Michal Hocko wrote:
> Memcg soft reclaim has been traditionally triggered from the global
> reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
> then picked up a group which exceeds the soft limit the most and
> reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.
> 
> The infrastructure requires per-node-zone trees which hold over-limit
> groups and keep them up-to-date (via memcg_check_events) which is not
> cost free. Although this overhead hasn't turned out to be a bottle neck
> the implementation is suboptimal because mem_cgroup_update_tree has no
> idea which zones consumed memory over the limit so we could easily end
> up having a group on a node-zone tree having only few pages from that
> node-zone.
> 
> This patch doesn't try to fix node-zone trees management because it
> seems that integrating soft reclaim into zone shrinking sounds much
> easier and more appropriate for several reasons.
> First of all 0 priority reclaim was a crude hack which might lead to
> big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot
> of dirty/writeback pages).
> Soft reclaim should be applicable also to the targeted reclaim which is
> awkward right now without additional hacks.
> Last but not least the whole infrastructure eats a lot of code[1].
> 
> After this patch shrink_zone is done in 2. First it tries to do the
> soft reclaim if appropriate (only for global reclaim for now to keep
> compatible with the current state) and fall back to ignoring soft limit
> if no group is eligible to soft reclaim or nothing has been scanned
> during the first pass. Only groups which are over their soft limit or
> any of their parent up the hierarchy is over the limit are considered
> eligible during the first pass.
> 
> TODO: remove mem_cgroup_tree_per_zone, mem_cgroup_shrink_node_zone and co.
> but maybe it would be easier for review to remove that code in a separate
> patch...

It should be in this series, though, for the diffstat :-)

> ---
> [1] TODO: put size vmlinux before/after whole clean-up

Yes!

> @@ -1984,6 +2003,27 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  		} while (memcg);
>  	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
> +
> +	return nr_shrunk;
> +}
> +
> +
> +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> +{
> +	bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
> +	unsigned long nr_scanned = sc->nr_scanned;
> +	unsigned nr_shrunk;
> +
> +	nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim);
> +
> +	/*
> +	 * No group is over the soft limit or those that are do not have
> +	 * pages in the zone we are reclaiming so we have to reclaim everybody
> +	 */
> +	if (do_soft_reclaim && (!nr_shrunk || sc->nr_scanned == nr_scanned)) {

If no pages were scanned you are doing a second pass regardless of
nr_shrunk.  If pages were scanned, nr_shrunk must have been increased
as well.  So I think you can remove all the nr_shrunk counting and
just check for scanned pages, no?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code
  2013-04-09 13:08   ` Johannes Weiner
@ 2013-04-09 13:31     ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2013-04-09 13:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Ying Han, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins,
	Mel Gorman, Glauber Costa

On Tue 09-04-13 09:08:33, Johannes Weiner wrote:
> On Tue, Apr 09, 2013 at 02:13:13PM +0200, Michal Hocko wrote:
[...]
> > TODO: remove mem_cgroup_tree_per_zone, mem_cgroup_shrink_node_zone and co.
> > but maybe it would be easier for review to remove that code in a separate
> > patch...
> 
> It should be in this series, though, for the diffstat :-)

Sure thing, I just wanted to prevent from pointless work during rebasing
when this changes its shape, like all such "bug changes"

> 
> > ---
> > [1] TODO: put size vmlinux before/after whole clean-up
> 
> Yes!
> 
> > @@ -1984,6 +2003,27 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
> >  		} while (memcg);
> >  	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
> >  					 sc->nr_scanned - nr_scanned, sc));
> > +
> > +	return nr_shrunk;
> > +}
> > +
> > +
> > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > +{
> > +	bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
> > +	unsigned long nr_scanned = sc->nr_scanned;
> > +	unsigned nr_shrunk;
> > +
> > +	nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim);
> > +
> > +	/*
> > +	 * No group is over the soft limit or those that are do not have
> > +	 * pages in the zone we are reclaiming so we have to reclaim everybody
> > +	 */
> > +	if (do_soft_reclaim && (!nr_shrunk || sc->nr_scanned == nr_scanned)) {
> 
> If no pages were scanned you are doing a second pass regardless of
> nr_shrunk.  If pages were scanned, nr_shrunk must have been increased
> as well.  So I think you can remove all the nr_shrunk counting and
> just check for scanned pages, no?

Yes you are right. I have started with nr_shrunk part only and then
realized that no scaning could be a problem so I've just added it. I
didn't optimize it yet.
I will remove nr_shrunk part in later versions.

Thanks
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code
  2013-04-09 12:13 ` [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko
  2013-04-09 13:08   ` Johannes Weiner
@ 2013-04-09 13:57   ` Glauber Costa
  2013-04-09 14:22     ` Michal Hocko
  2013-04-09 16:45   ` Kamezawa Hiroyuki
  2013-04-14  0:42   ` Mel Gorman
  3 siblings, 1 reply; 27+ messages in thread
From: Glauber Costa @ 2013-04-09 13:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki,
	Rik van Riel, Hugh Dickins, Mel Gorman

On 04/09/2013 04:13 PM, Michal Hocko wrote:
> Memcg soft reclaim has been traditionally triggered from the global
> reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
> then picked up a group which exceeds the soft limit the most and
> reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.
> 
> The infrastructure requires per-node-zone trees which hold over-limit
> groups and keep them up-to-date (via memcg_check_events) which is not
> cost free. Although this overhead hasn't turned out to be a bottle neck
> the implementation is suboptimal because mem_cgroup_update_tree has no
> idea which zones consumed memory over the limit so we could easily end
> up having a group on a node-zone tree having only few pages from that
> node-zone.
> 
> This patch doesn't try to fix node-zone trees management because it
> seems that integrating soft reclaim into zone shrinking sounds much
> easier and more appropriate for several reasons.
> First of all 0 priority reclaim was a crude hack which might lead to
> big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot
> of dirty/writeback pages).
> Soft reclaim should be applicable also to the targeted reclaim which is
> awkward right now without additional hacks.
> Last but not least the whole infrastructure eats a lot of code[1].
> 
> After this patch shrink_zone is done in 2. First it tries to do the
> soft reclaim if appropriate (only for global reclaim for now to keep
> compatible with the current state) and fall back to ignoring soft limit
> if no group is eligible to soft reclaim or nothing has been scanned
> during the first pass. Only groups which are over their soft limit or
> any of their parent up the hierarchy is over the limit are considered
> eligible during the first pass.
> 
> TODO: remove mem_cgroup_tree_per_zone, mem_cgroup_shrink_node_zone and co.
> but maybe it would be easier for review to remove that code in a separate
> patch...
> 
Well, the concept is obviously headed right. Code comments:

> +/*
> + * A group is eligible for the soft limit reclaim if it is
> + * 	a) is over its soft limit
> + * 	b) any parent up the hierarchy is over its soft limit
> + */
> +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *parent = memcg;
>  
> -	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
> -
> -	while (1) {
> -		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
> -		if (!victim) {
> -			loop++;
> -			if (loop >= 2) {
> -				/*
> -				 * If we have not been able to reclaim
> -				 * anything, it might because there are
> -				 * no reclaimable pages under this hierarchy
> -				 */
> -				if (!total)
> -					break;
> -				/*
> -				 * We want to do more targeted reclaim.
> -				 * excess >> 2 is not to excessive so as to
> -				 * reclaim too much, nor too less that we keep
> -				 * coming back to reclaim from this cgroup
> -				 */
> -				if (total >= (excess >> 2) ||
> -					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
> -					break;
> -			}
> -			continue;
> -		}
> -		if (!mem_cgroup_reclaimable(victim, false))
> -			continue;
> -		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
> -						     zone, &nr_scanned);
> -		*total_scanned += nr_scanned;
> -		if (!res_counter_soft_limit_excess(&root_memcg->res))
> -			break;
> +	if (res_counter_soft_limit_excess(&memcg->res))
> +		return true;
> +
> +	/*
> +	 * If any parent up the hierarchy is over its soft limit then we
> +	 * have to obey and reclaim from this group as well.
> +	 */
> +	while((parent = parent_mem_cgroup(parent))) {
> +		if (res_counter_soft_limit_excess(&parent->res))
> +			return true;
>  	}
> -	mem_cgroup_iter_break(root_memcg, victim);
> -	return total;
> +
> +	return false;
>  }
>  
good work.
There is a confusion with parent here, but I believe Johnny had already
noted it.

>  
> -static void shrink_zone(struct zone *zone, struct scan_control *sc)
> +static unsigned
> +__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
>  {
>  	unsigned long nr_reclaimed, nr_scanned;
> +	unsigned nr_shrunk = 0;
>  
>  	do {
>  		struct mem_cgroup *root = sc->target_mem_cgroup;
> @@ -1961,6 +1973,13 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  		do {
>  			struct lruvec *lruvec;
>  
> +			if (soft_reclaim &&
> +					!mem_cgroup_soft_reclaim_eligible(memcg)) {
> +				memcg = mem_cgroup_iter(root, memcg, &reclaim);
> +				continue;
> +			}
> +
> +			nr_shrunk++;
>  			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>  
>  			shrink_lruvec(lruvec, sc);
> @@ -1984,6 +2003,27 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  		} while (memcg);
>  	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
> +
> +	return nr_shrunk;
> +}
> +
> +
> +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> +{
> +	bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
> +	unsigned long nr_scanned = sc->nr_scanned;
> +	unsigned nr_shrunk;
> +
> +	nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim);
> +
> +	/*
> +	 * No group is over the soft limit or those that are do not have
> +	 * pages in the zone we are reclaiming so we have to reclaim everybody
> +	 */
> +	if (do_soft_reclaim && (!nr_shrunk || sc->nr_scanned == nr_scanned)) {
> +		__shrink_zone(zone, sc, false);
> +		return;
> +	}
>  }

If I read this correctly, you stop shrinking when you reach a group in
which you manage to shrink some pages. Is it really what we want?

We have no guarantee that we're now under the soft limit, so shouldn't
we keep shrinking downwards until every parent of ours is within limits ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code
  2013-04-09 13:57   ` Glauber Costa
@ 2013-04-09 14:22     ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2013-04-09 14:22 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki,
	Rik van Riel, Hugh Dickins, Mel Gorman

On Tue 09-04-13 17:57:54, Glauber Costa wrote:
> On 04/09/2013 04:13 PM, Michal Hocko wrote:
[...]
> > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > +{
> > +	bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
> > +	unsigned long nr_scanned = sc->nr_scanned;
> > +	unsigned nr_shrunk;
> > +
> > +	nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim);
> > +
> > +	/*
> > +	 * No group is over the soft limit or those that are do not have
> > +	 * pages in the zone we are reclaiming so we have to reclaim everybody
> > +	 */
> > +	if (do_soft_reclaim && (!nr_shrunk || sc->nr_scanned == nr_scanned)) {
> > +		__shrink_zone(zone, sc, false);
> > +		return;
> > +	}
> >  }
> 
> If I read this correctly, you stop shrinking when you reach a group in
> which you manage to shrink some pages. Is it really what we want?

Well, this is what we do during standard reclaim __shrink_zone either
walks all children of the target_memcg or reclaim enough pages.

> We have no guarantee that we're now under the soft limit, so shouldn't
> we keep shrinking downwards until every parent of ours is within limits ?

I do not think we should reclaim until we are under soft limit because
our primary target is different - balance zones resp. get under hard
limit. Soft limit just helps us to point at victims (and newly also to
protect high class citizens).

So the second round is just a way to reclaim at least something if
the is nobody eligible for the soft game part. I can see some harder
conditions for the fallback (e.g. only fallback after certain priority
but let's keep this simple for now and do additional parts on top).

Thanks
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code
  2013-04-09 12:13 ` [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko
  2013-04-09 13:08   ` Johannes Weiner
  2013-04-09 13:57   ` Glauber Costa
@ 2013-04-09 16:45   ` Kamezawa Hiroyuki
  2013-04-09 17:05     ` Michal Hocko
  2013-04-14  0:42   ` Mel Gorman
  3 siblings, 1 reply; 27+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-09 16:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Ying Han, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Mel Gorman, Glauber Costa

(2013/04/09 21:13), Michal Hocko wrote:
> Memcg soft reclaim has been traditionally triggered from the global
> reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
> then picked up a group which exceeds the soft limit the most and
> reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.
> 
> The infrastructure requires per-node-zone trees which hold over-limit
> groups and keep them up-to-date (via memcg_check_events) which is not
> cost free. Although this overhead hasn't turned out to be a bottle neck
> the implementation is suboptimal because mem_cgroup_update_tree has no
> idea which zones consumed memory over the limit so we could easily end
> up having a group on a node-zone tree having only few pages from that
> node-zone.
> 
> This patch doesn't try to fix node-zone trees management because it
> seems that integrating soft reclaim into zone shrinking sounds much
> easier and more appropriate for several reasons.
> First of all 0 priority reclaim was a crude hack which might lead to
> big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot
> of dirty/writeback pages).
> Soft reclaim should be applicable also to the targeted reclaim which is
> awkward right now without additional hacks.
> Last but not least the whole infrastructure eats a lot of code[1].
> 
> After this patch shrink_zone is done in 2. First it tries to do the
> soft reclaim if appropriate (only for global reclaim for now to keep
> compatible with the current state) and fall back to ignoring soft limit
> if no group is eligible to soft reclaim or nothing has been scanned
> during the first pass. Only groups which are over their soft limit or
> any of their parent up the hierarchy is over the limit are considered
> eligible during the first pass.
> 
> TODO: remove mem_cgroup_tree_per_zone, mem_cgroup_shrink_node_zone and co.
> but maybe it would be easier for review to remove that code in a separate
> patch...
> 

If we don't make prioritization based on excessed usage against soft-limit
and visit all memcgs, dropping per-zone-tree makes sense.

(*)I don't like current prioitization.

> ---
> [1] TODO: put size vmlinux before/after whole clean-up
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>   include/linux/memcontrol.h |   10 +--
>   mm/memcontrol.c            |  161 ++++++--------------------------------------
>   mm/vmscan.c                |   67 +++++++++++-------
>   3 files changed, 64 insertions(+), 174 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d6183f0..1833c95 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -179,9 +179,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>   	mem_cgroup_update_page_stat(page, idx, -1);
>   }
>   
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> -						gfp_t gfp_mask,
> -						unsigned long *total_scanned);
> +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg);
>   
>   void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
>   static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
> @@ -358,11 +356,9 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>   }
>   
>   static inline
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> -					    gfp_t gfp_mask,
> -					    unsigned long *total_scanned)
> +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
>   {
> -	return 0;
> +	return false;
>   }
>   
>   static inline void mem_cgroup_split_huge_fixup(struct page *head)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f608546..33424d8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2060,57 +2060,28 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
>   }
>   #endif
>   
> -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
> -				   struct zone *zone,
> -				   gfp_t gfp_mask,
> -				   unsigned long *total_scanned)
> -{
> -	struct mem_cgroup *victim = NULL;
> -	int total = 0;
> -	int loop = 0;
> -	unsigned long excess;
> -	unsigned long nr_scanned;
> -	struct mem_cgroup_reclaim_cookie reclaim = {
> -		.zone = zone,
> -		.priority = 0,
> -	};
> +/*
> + * A group is eligible for the soft limit reclaim if it is
> + * 	a) is over its soft limit
> + * 	b) any parent up the hierarchy is over its soft limit
> + */
> +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *parent = memcg;
>   
> -	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
> -
> -	while (1) {
> -		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
> -		if (!victim) {
> -			loop++;
> -			if (loop >= 2) {
> -				/*
> -				 * If we have not been able to reclaim
> -				 * anything, it might because there are
> -				 * no reclaimable pages under this hierarchy
> -				 */
> -				if (!total)
> -					break;
> -				/*
> -				 * We want to do more targeted reclaim.
> -				 * excess >> 2 is not to excessive so as to
> -				 * reclaim too much, nor too less that we keep
> -				 * coming back to reclaim from this cgroup
> -				 */
> -				if (total >= (excess >> 2) ||
> -					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
> -					break;
> -			}
> -			continue;
> -		}
> -		if (!mem_cgroup_reclaimable(victim, false))
> -			continue;
> -		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
> -						     zone, &nr_scanned);
> -		*total_scanned += nr_scanned;
> -		if (!res_counter_soft_limit_excess(&root_memcg->res))
> -			break;
> +	if (res_counter_soft_limit_excess(&memcg->res))
> +		return true;
> +
> +	/*
> +	 * If any parent up the hierarchy is over its soft limit then we
> +	 * have to obey and reclaim from this group as well.
> +	 */
> +	while((parent = parent_mem_cgroup(parent))) {
> +		if (res_counter_soft_limit_excess(&parent->res))
> +			return true;
>   	}
> -	mem_cgroup_iter_break(root_memcg, victim);
> -	return total;
> +
> +	return false;
>   }
>   
>   /*
> @@ -4724,98 +4695,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>   	return ret;
>   }
>   
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> -					    gfp_t gfp_mask,
> -					    unsigned long *total_scanned)
> -{
> -	unsigned long nr_reclaimed = 0;
> -	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
> -	unsigned long reclaimed;
> -	int loop = 0;
> -	struct mem_cgroup_tree_per_zone *mctz;
> -	unsigned long long excess;
> -	unsigned long nr_scanned;
> -
> -	if (order > 0)
> -		return 0;
> -
> -	mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
> -	/*
> -	 * This loop can run a while, specially if mem_cgroup's continuously
> -	 * keep exceeding their soft limit and putting the system under
> -	 * pressure
> -	 */
> -	do {
> -		if (next_mz)
> -			mz = next_mz;
> -		else
> -			mz = mem_cgroup_largest_soft_limit_node(mctz);
> -		if (!mz)
> -			break;
> -
> -		nr_scanned = 0;
> -		reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone,
> -						    gfp_mask, &nr_scanned);
> -		nr_reclaimed += reclaimed;
> -		*total_scanned += nr_scanned;
> -		spin_lock(&mctz->lock);
> -
> -		/*
> -		 * If we failed to reclaim anything from this memory cgroup
> -		 * it is time to move on to the next cgroup
> -		 */
> -		next_mz = NULL;
> -		if (!reclaimed) {
> -			do {
> -				/*
> -				 * Loop until we find yet another one.
> -				 *
> -				 * By the time we get the soft_limit lock
> -				 * again, someone might have aded the
> -				 * group back on the RB tree. Iterate to
> -				 * make sure we get a different mem.
> -				 * mem_cgroup_largest_soft_limit_node returns
> -				 * NULL if no other cgroup is present on
> -				 * the tree
> -				 */
> -				next_mz =
> -				__mem_cgroup_largest_soft_limit_node(mctz);
> -				if (next_mz == mz)
> -					css_put(&next_mz->memcg->css);
> -				else /* next_mz == NULL or other memcg */
> -					break;
> -			} while (1);
> -		}
> -		__mem_cgroup_remove_exceeded(mz->memcg, mz, mctz);
> -		excess = res_counter_soft_limit_excess(&mz->memcg->res);
> -		/*
> -		 * One school of thought says that we should not add
> -		 * back the node to the tree if reclaim returns 0.
> -		 * But our reclaim could return 0, simply because due
> -		 * to priority we are exposing a smaller subset of
> -		 * memory to reclaim from. Consider this as a longer
> -		 * term TODO.
> -		 */
> -		/* If excess == 0, no tree ops */
> -		__mem_cgroup_insert_exceeded(mz->memcg, mz, mctz, excess);
> -		spin_unlock(&mctz->lock);
> -		css_put(&mz->memcg->css);
> -		loop++;
> -		/*
> -		 * Could not reclaim anything and there are no more
> -		 * mem cgroups to try or we seem to be looping without
> -		 * reclaiming anything.
> -		 */
> -		if (!nr_reclaimed &&
> -			(next_mz == NULL ||
> -			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> -			break;
> -	} while (!nr_reclaimed);
> -	if (next_mz)
> -		css_put(&next_mz->memcg->css);
> -	return nr_reclaimed;
> -}
> -
>   /**
>    * mem_cgroup_force_empty_list - clears LRU of a group
>    * @memcg: group to clear
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index df78d17..ae3a387 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -138,11 +138,21 @@ static bool global_reclaim(struct scan_control *sc)
>   {
>   	return !sc->target_mem_cgroup;
>   }
> +
> +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
> +{
> +	return global_reclaim(sc);
> +}
>   #else
>   static bool global_reclaim(struct scan_control *sc)
>   {
>   	return true;
>   }
> +
> +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
> +{
> +	return false;
> +}
>   #endif
>   
>   static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
> @@ -1942,9 +1952,11 @@ static inline bool should_continue_reclaim(struct zone *zone,
>   	}
>   }
>   
> -static void shrink_zone(struct zone *zone, struct scan_control *sc)
> +static unsigned
> +__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
>   {
>   	unsigned long nr_reclaimed, nr_scanned;
> +	unsigned nr_shrunk = 0;

What does this number mean ?

>   
>   	do {
>   		struct mem_cgroup *root = sc->target_mem_cgroup;
> @@ -1961,6 +1973,13 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>   		do {
>   			struct lruvec *lruvec;
>   
> +			if (soft_reclaim &&
> +					!mem_cgroup_soft_reclaim_eligible(memcg)) {
> +				memcg = mem_cgroup_iter(root, memcg, &reclaim);
> +				continue;
> +			}
> +
> +			nr_shrunk++;
>   			lruvec = mem_cgroup_zone_lruvec(zone, memcg);

nr_shrunk will be updated even if the memcg has no pages to be reclaimed...right ?

>   
>   			shrink_lruvec(lruvec, sc);
> @@ -1984,6 +2003,27 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>   		} while (memcg);
>   	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
>   					 sc->nr_scanned - nr_scanned, sc));
> +
> +	return nr_shrunk;
> +}
> +
> +
> +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> +{
> +	bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
> +	unsigned long nr_scanned = sc->nr_scanned;
> +	unsigned nr_shrunk;
> +
> +	nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim);
> +
> +	/*
> +	 * No group is over the soft limit or those that are do not have
> +	 * pages in the zone we are reclaiming so we have to reclaim everybody
> +	 */
> +	if (do_soft_reclaim && (!nr_shrunk || sc->nr_scanned == nr_scanned)) {
> +		__shrink_zone(zone, sc, false);
> +		return;
> +	}

Hmm...so...nr_shrunk is working as a bool value. Isn't it better to call
__shrink_zone(...,false) if above shrink_zone(...,true) couldn't make
good progress ? memory-disk ping-pong will happen in bad case.

I think....in the 1st run, you can count amount of pages, which are
candidates to be reclaimed. Then, you can compare the amounts of
reclaim target and the priority and size of the target (amounts of
reclaimable memory on the target zonelist), make a decision to fallback to
full global reclaim or not.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code
  2013-04-09 16:45   ` Kamezawa Hiroyuki
@ 2013-04-09 17:05     ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2013-04-09 17:05 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, Ying Han, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Mel Gorman, Glauber Costa

On Wed 10-04-13 01:45:19, KAMEZAWA Hiroyuki wrote:
> (2013/04/09 21:13), Michal Hocko wrote:
[...]
> > @@ -1942,9 +1952,11 @@ static inline bool should_continue_reclaim(struct zone *zone,
> >   	}
> >   }
> >   
> > -static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > +static unsigned
> > +__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
> >   {
> >   	unsigned long nr_reclaimed, nr_scanned;
> > +	unsigned nr_shrunk = 0;
> 
> What does this number mean ?

number of groups that we called shrink_lruvec for.

> >   	do {
> >   		struct mem_cgroup *root = sc->target_mem_cgroup;
> > @@ -1961,6 +1973,13 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
> >   		do {
> >   			struct lruvec *lruvec;
> >   
> > +			if (soft_reclaim &&
> > +					!mem_cgroup_soft_reclaim_eligible(memcg)) {
> > +				memcg = mem_cgroup_iter(root, memcg, &reclaim);
> > +				continue;
> > +			}
> > +
> > +			nr_shrunk++;
> >   			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> 
> nr_shrunk will be updated even if the memcg has no pages to be reclaimed...right ?

yes.

> 
> >   
> >   			shrink_lruvec(lruvec, sc);
> > @@ -1984,6 +2003,27 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
> >   		} while (memcg);
> >   	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
> >   					 sc->nr_scanned - nr_scanned, sc));
> > +
> > +	return nr_shrunk;
> > +}
> > +
> > +
> > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > +{
> > +	bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
> > +	unsigned long nr_scanned = sc->nr_scanned;
> > +	unsigned nr_shrunk;
> > +
> > +	nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim);
> > +
> > +	/*
> > +	 * No group is over the soft limit or those that are do not have
> > +	 * pages in the zone we are reclaiming so we have to reclaim everybody
> > +	 */
> > +	if (do_soft_reclaim && (!nr_shrunk || sc->nr_scanned == nr_scanned)) {
> > +		__shrink_zone(zone, sc, false);
> > +		return;
> > +	}
> 
> Hmm...so...nr_shrunk is working as a bool value. Isn't it better to call
> __shrink_zone(...,false) if above shrink_zone(...,true) couldn't make
> good progress ?

Yes that was an attempt and as Johannes already pointed out nr_shrunk is
superseded by nr_scanned check

> memory-disk ping-pong will happen in bad case.

I am not sure what you mean by this.

> I think....in the 1st run, you can count amount of pages, which are
> candidates to be reclaimed. Then, you can compare the amounts of
> reclaim target and the priority and size of the target (amounts of
> reclaimable memory on the target zonelist), make a decision to fallback to
> full global reclaim or not.

I would like to keep the logic as simple as possible. nr_scanned
progress is a protection from increasing the priority and should be
sufficient for starter. There is still possibility that a small groups
over their soft limit won't have any pages to reclaim because those are
dirty but those pages should be flushed during the global reclaim and we
wait for them during targeted reclaim.

But I agree that maybe we need also a priority check here. Will think
about it.

> 
> Thanks,
> -Kame
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code
  2013-04-09 12:13 ` [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko
                     ` (2 preceding siblings ...)
  2013-04-09 16:45   ` Kamezawa Hiroyuki
@ 2013-04-14  0:42   ` Mel Gorman
  2013-04-14 14:34     ` Michal Hocko
  3 siblings, 1 reply; 27+ messages in thread
From: Mel Gorman @ 2013-04-14  0:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki,
	Rik van Riel, Hugh Dickins, Glauber Costa

On Tue, Apr 09, 2013 at 02:13:13PM +0200, Michal Hocko wrote:
> Memcg soft reclaim has been traditionally triggered from the global
> reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
> then picked up a group which exceeds the soft limit the most and
> reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.
> 

I didn't realise it scanned at priority 0 or else I forgot! Priority 0
scanning means memcg soft reclaim currently scans anon and file equally
with the full LRU of ecah type considered as scan candidates. Consequently,
it will reclaim SWAP_CLUSTER_MAX from each evictable LRU before stopping as
sc->nr_to_reclaim pages have been scanned. It's only partially related to
your series of course this is very blunt behaviour for memcg reclaim. In an
ideal world of infinite free time it might be worth checking what happens
if that thing scans at priority 1 or at least keep an eye on what happens
priority when/if you replace mem_cgroup_shrink_node_zone

> The infrastructure requires per-node-zone trees which hold over-limit
> groups and keep them up-to-date (via memcg_check_events) which is not
> cost free. Although this overhead hasn't turned out to be a bottle neck
> the implementation is suboptimal because mem_cgroup_update_tree has no
> idea which zones consumed memory over the limit so we could easily end
> up having a group on a node-zone tree having only few pages from that
> node-zone.
> 
> This patch doesn't try to fix node-zone trees management because it
> seems that integrating soft reclaim into zone shrinking sounds much
> easier and more appropriate for several reasons.
> First of all 0 priority reclaim was a crude hack which might lead to
> big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot
> of dirty/writeback pages).

Scanning at priority 1 would still be vunerable to this but it might avoid
some of the stalls of anon/file balancing is treated properly.

> Soft reclaim should be applicable also to the targeted reclaim which is
> awkward right now without additional hacks.
> Last but not least the whole infrastructure eats a lot of code[1].
> 
> After this patch shrink_zone is done in 2. First it tries to do the

Done in 2 what? Passes I think.

> soft reclaim if appropriate (only for global reclaim for now to keep
> compatible with the current state) and fall back to ignoring soft limit
> if no group is eligible to soft reclaim or nothing has been scanned
> during the first pass. Only groups which are over their soft limit or
> any of their parent up the hierarchy is over the limit are considered
> eligible during the first pass.
> 
> TODO: remove mem_cgroup_tree_per_zone, mem_cgroup_shrink_node_zone and co.
> but maybe it would be easier for review to remove that code in a separate
> patch...
> 
> ---
> [1] TODO: put size vmlinux before/after whole clean-up
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  include/linux/memcontrol.h |   10 +--
>  mm/memcontrol.c            |  161 ++++++--------------------------------------
>  mm/vmscan.c                |   67 +++++++++++-------
>  3 files changed, 64 insertions(+), 174 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d6183f0..1833c95 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -179,9 +179,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  	mem_cgroup_update_page_stat(page, idx, -1);
>  }
>  
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> -						gfp_t gfp_mask,
> -						unsigned long *total_scanned);
> +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg);
>  
>  void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
>  static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
> @@ -358,11 +356,9 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  }
>  
>  static inline
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> -					    gfp_t gfp_mask,
> -					    unsigned long *total_scanned)
> +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
>  {
> -	return 0;
> +	return false;
>  }
>  
>  static inline void mem_cgroup_split_huge_fixup(struct page *head)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f608546..33424d8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2060,57 +2060,28 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
>  }
>  #endif
>  
> -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
> -				   struct zone *zone,
> -				   gfp_t gfp_mask,
> -				   unsigned long *total_scanned)
> -{
> -	struct mem_cgroup *victim = NULL;
> -	int total = 0;
> -	int loop = 0;
> -	unsigned long excess;
> -	unsigned long nr_scanned;
> -	struct mem_cgroup_reclaim_cookie reclaim = {
> -		.zone = zone,
> -		.priority = 0,
> -	};
> +/*
> + * A group is eligible for the soft limit reclaim if it is
> + * 	a) is over its soft limit
> + * 	b) any parent up the hierarchy is over its soft limit
> + */
> +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *parent = memcg;
>  
> -	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
> -
> -	while (1) {
> -		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
> -		if (!victim) {
> -			loop++;
> -			if (loop >= 2) {
> -				/*
> -				 * If we have not been able to reclaim
> -				 * anything, it might because there are
> -				 * no reclaimable pages under this hierarchy
> -				 */
> -				if (!total)
> -					break;
> -				/*
> -				 * We want to do more targeted reclaim.
> -				 * excess >> 2 is not to excessive so as to
> -				 * reclaim too much, nor too less that we keep
> -				 * coming back to reclaim from this cgroup
> -				 */
> -				if (total >= (excess >> 2) ||
> -					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
> -					break;
> -			}
> -			continue;
> -		}
> -		if (!mem_cgroup_reclaimable(victim, false))
> -			continue;
> -		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
> -						     zone, &nr_scanned);
> -		*total_scanned += nr_scanned;
> -		if (!res_counter_soft_limit_excess(&root_memcg->res))
> -			break;
> +	if (res_counter_soft_limit_excess(&memcg->res))
> +		return true;
> +
> +	/*
> +	 * If any parent up the hierarchy is over its soft limit then we
> +	 * have to obey and reclaim from this group as well.
> +	 */
> +	while((parent = parent_mem_cgroup(parent))) {
> +		if (res_counter_soft_limit_excess(&parent->res))
> +			return true;

Remove the initial if with this?

/*
 * If the target memcg or any of its parents are over their soft limit
 * then we have to obey and reclaim from this group as well
 */
do {
	if (res_counter_soft_limit_excess(&memcg->res))
		return true;
while ((memcg = parent_mem_cgroup(memcg));

>  	}
> -	mem_cgroup_iter_break(root_memcg, victim);
> -	return total;
> +
> +	return false;
>  }
>  
>  /*
> @@ -4724,98 +4695,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  	return ret;
>  }
>  
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> -					    gfp_t gfp_mask,
> -					    unsigned long *total_scanned)
> -{
> -	unsigned long nr_reclaimed = 0;
> -	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
> -	unsigned long reclaimed;
> -	int loop = 0;
> -	struct mem_cgroup_tree_per_zone *mctz;
> -	unsigned long long excess;
> -	unsigned long nr_scanned;
> -
> -	if (order > 0)
> -		return 0;
> -
> -	mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
> -	/*
> -	 * This loop can run a while, specially if mem_cgroup's continuously
> -	 * keep exceeding their soft limit and putting the system under
> -	 * pressure
> -	 */
> -	do {
> -		if (next_mz)
> -			mz = next_mz;
> -		else
> -			mz = mem_cgroup_largest_soft_limit_node(mctz);
> -		if (!mz)
> -			break;
> -
> -		nr_scanned = 0;
> -		reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone,
> -						    gfp_mask, &nr_scanned);
> -		nr_reclaimed += reclaimed;
> -		*total_scanned += nr_scanned;
> -		spin_lock(&mctz->lock);
> -
> -		/*
> -		 * If we failed to reclaim anything from this memory cgroup
> -		 * it is time to move on to the next cgroup
> -		 */
> -		next_mz = NULL;
> -		if (!reclaimed) {
> -			do {
> -				/*
> -				 * Loop until we find yet another one.
> -				 *
> -				 * By the time we get the soft_limit lock
> -				 * again, someone might have aded the
> -				 * group back on the RB tree. Iterate to
> -				 * make sure we get a different mem.
> -				 * mem_cgroup_largest_soft_limit_node returns
> -				 * NULL if no other cgroup is present on
> -				 * the tree
> -				 */
> -				next_mz =
> -				__mem_cgroup_largest_soft_limit_node(mctz);
> -				if (next_mz == mz)
> -					css_put(&next_mz->memcg->css);
> -				else /* next_mz == NULL or other memcg */
> -					break;
> -			} while (1);
> -		}
> -		__mem_cgroup_remove_exceeded(mz->memcg, mz, mctz);
> -		excess = res_counter_soft_limit_excess(&mz->memcg->res);
> -		/*
> -		 * One school of thought says that we should not add
> -		 * back the node to the tree if reclaim returns 0.
> -		 * But our reclaim could return 0, simply because due
> -		 * to priority we are exposing a smaller subset of
> -		 * memory to reclaim from. Consider this as a longer
> -		 * term TODO.
> -		 */
> -		/* If excess == 0, no tree ops */
> -		__mem_cgroup_insert_exceeded(mz->memcg, mz, mctz, excess);
> -		spin_unlock(&mctz->lock);
> -		css_put(&mz->memcg->css);
> -		loop++;
> -		/*
> -		 * Could not reclaim anything and there are no more
> -		 * mem cgroups to try or we seem to be looping without
> -		 * reclaiming anything.
> -		 */
> -		if (!nr_reclaimed &&
> -			(next_mz == NULL ||
> -			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> -			break;
> -	} while (!nr_reclaimed);
> -	if (next_mz)
> -		css_put(&next_mz->memcg->css);
> -	return nr_reclaimed;
> -}
> -
>  /**
>   * mem_cgroup_force_empty_list - clears LRU of a group
>   * @memcg: group to clear
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index df78d17..ae3a387 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -138,11 +138,21 @@ static bool global_reclaim(struct scan_control *sc)
>  {
>  	return !sc->target_mem_cgroup;
>  }
> +
> +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
> +{
> +	return global_reclaim(sc);
> +}
>  #else
>  static bool global_reclaim(struct scan_control *sc)
>  {
>  	return true;
>  }
> +
> +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
> +{
> +	return false;
> +}
>  #endif
>  
>  static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
> @@ -1942,9 +1952,11 @@ static inline bool should_continue_reclaim(struct zone *zone,
>  	}
>  }
>  
> -static void shrink_zone(struct zone *zone, struct scan_control *sc)
> +static unsigned
> +__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
>  {
>  	unsigned long nr_reclaimed, nr_scanned;
> +	unsigned nr_shrunk = 0;
>  
>  	do {
>  		struct mem_cgroup *root = sc->target_mem_cgroup;
> @@ -1961,6 +1973,13 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  		do {
>  			struct lruvec *lruvec;
>  
> +			if (soft_reclaim &&
> +					!mem_cgroup_soft_reclaim_eligible(memcg)) {
> +				memcg = mem_cgroup_iter(root, memcg, &reclaim);
> +				continue;
> +			}
> +

Calling mem_cgroup_soft_reclaim_eligible means we do multiple searches
of the hierarchy while ascending the hierarchy. It's a stretch but it
may be a problem for very deep hierarchies. Would it be worth having
mem_cgroup_soft_reclaim_eligible return what the highest parent over its soft
limit was and stop the iterator when the highest parent is reached?  I think
this would avoid calling mem_cgroup_soft_reclaim_eligible multiple times.

> +			nr_shrunk++;
>  			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>  
>  			shrink_lruvec(lruvec, sc);
> @@ -1984,6 +2003,27 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  		} while (memcg);
>  	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
> +
> +	return nr_shrunk;
> +}
> +
> +
> +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> +{
> +	bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
> +	unsigned long nr_scanned = sc->nr_scanned;
> +	unsigned nr_shrunk;
> +
> +	nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim);
> +

The two pass thing is explained in the changelog very well but adding
comments on it here would not hurt.

Otherwise this patch looks like a great idea and memcg soft reclaim looks
a lot less like it's stuck on the side.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code
  2013-04-14  0:42   ` Mel Gorman
@ 2013-04-14 14:34     ` Michal Hocko
  2013-04-14 14:55       ` Johannes Weiner
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2013-04-14 14:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki,
	Rik van Riel, Hugh Dickins, Glauber Costa

On Sun 14-04-13 01:42:52, Mel Gorman wrote:
> On Tue, Apr 09, 2013 at 02:13:13PM +0200, Michal Hocko wrote:
> > Memcg soft reclaim has been traditionally triggered from the global
> > reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
> > then picked up a group which exceeds the soft limit the most and
> > reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.
> > 
> 
> I didn't realise it scanned at priority 0 or else I forgot! Priority 0
> scanning means memcg soft reclaim currently scans anon and file equally
> with the full LRU of ecah type considered as scan candidates. Consequently,
> it will reclaim SWAP_CLUSTER_MAX from each evictable LRU before stopping as
> sc->nr_to_reclaim pages have been scanned. It's only partially related to
> your series of course this is very blunt behaviour for memcg reclaim. In an
> ideal world of infinite free time it might be worth checking what happens
> if that thing scans at priority 1 or at least keep an eye on what happens
> priority when/if you replace mem_cgroup_shrink_node_zone

I do not think experimenting with prio 1 would make any difference. We
would still reclaim half of LRUs and bail out if at least
SWAP_CLUSTER_MAX cluster max pagas have been reclaimed after visiting
all reclaimable LRUs. The whole point of the series is to not do
anything special for the soft reclaim priority wise.

[...]
> > Soft reclaim should be applicable also to the targeted reclaim which is
> > awkward right now without additional hacks.
> > Last but not least the whole infrastructure eats a lot of code[1].
> > 
> > After this patch shrink_zone is done in 2. First it tries to do the
> 
> Done in 2 what? Passes I think.

Yes. Fixed.
 
[...]
> > +	if (res_counter_soft_limit_excess(&memcg->res))
> > +		return true;
> > +
> > +	/*
> > +	 * If any parent up the hierarchy is over its soft limit then we
> > +	 * have to obey and reclaim from this group as well.
> > +	 */
> > +	while((parent = parent_mem_cgroup(parent))) {
> > +		if (res_counter_soft_limit_excess(&parent->res))
> > +			return true;
> 
> Remove the initial if with this?
> /*
>  * If the target memcg or any of its parents are over their soft limit
>  * then we have to obey and reclaim from this group as well
>  */
> do {
> 	if (res_counter_soft_limit_excess(&memcg->res))
> 		return true;
> while ((memcg = parent_mem_cgroup(memcg));

The later patch changes this behavior. Where we treat current memcg and
parent slightly different based on whether the limit has been set by
an user or it is the default unlimited value.

[...]
> > -static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > +static unsigned
> > +__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
> >  {
> >  	unsigned long nr_reclaimed, nr_scanned;
> > +	unsigned nr_shrunk = 0;
> >  
> >  	do {
> >  		struct mem_cgroup *root = sc->target_mem_cgroup;
> > @@ -1961,6 +1973,13 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
> >  		do {
> >  			struct lruvec *lruvec;
> >  
> > +			if (soft_reclaim &&
> > +					!mem_cgroup_soft_reclaim_eligible(memcg)) {
> > +				memcg = mem_cgroup_iter(root, memcg, &reclaim);
> > +				continue;
> > +			}
> > +
> 
> Calling mem_cgroup_soft_reclaim_eligible means we do multiple searches
> of the hierarchy while ascending the hierarchy. It's a stretch but it
> may be a problem for very deep hierarchies.

I think it shouldn't be a problem for hundreds of memcgs and I am quite
sceptical about such configurations for other reasons (e.g. charging
overhead). And we are in the reclaim path so this is hardly a hot path
(unlike the chargin). So while this might turn out to be a real problem
we would need to fix other parts as well with higher priority.

> Would it be worth having mem_cgroup_soft_reclaim_eligible return what
> the highest parent over its soft limit was and stop the iterator when
> the highest parent is reached?  I think this would avoid calling
> mem_cgroup_soft_reclaim_eligible multiple times.

This is basically what the original implementation did and I think it is
not the right way to go. First why should we care who is the most
exceeding group. We should treat them equally if the there is no special
reason to not do so. And I do not see such a special reason. Besides
that keeping a exceed sorted data structure of memcgs turned out quite a
lot of code. Note that the later patch integrate soft reclaim into
targeted reclaim which would mean that we would have to keep such a
list/tree per memcg.

> > +			nr_shrunk++;
> >  			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> >  
> >  			shrink_lruvec(lruvec, sc);
> > @@ -1984,6 +2003,27 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
> >  		} while (memcg);
> >  	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
> >  					 sc->nr_scanned - nr_scanned, sc));
> > +
> > +	return nr_shrunk;
> > +}
> > +
> > +
> > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > +{
> > +	bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
> > +	unsigned long nr_scanned = sc->nr_scanned;
> > +	unsigned nr_shrunk;
> > +
> > +	nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim);
> > +
> 
> The two pass thing is explained in the changelog very well but adding
> comments on it here would not hurt.

What about merging the comment that is already there with this?

/*
 * If memcg is enabled we try to reclaim only over-soft limit groups in
 * the first pass and only fallback to all groups reclaim if no group is
 * over the soft limit or those that are do not have pages in the zone
 * we are reclaiming so we have to reclaim everybody.
 * This will guarantee that groups that are below their soft limit are
 * not touched unless the memory pressure cannot be handled otherwise
 * and so the soft limit can be used for the working set preservation.
 */
> 
> Otherwise this patch looks like a great idea and memcg soft reclaim looks
> a lot less like it's stuck on the side.

Thanks for the review Mel!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code
  2013-04-14 14:34     ` Michal Hocko
@ 2013-04-14 14:55       ` Johannes Weiner
  2013-04-14 15:04         ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Johannes Weiner @ 2013-04-14 14:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, linux-mm, Ying Han, KAMEZAWA Hiroyuki, Rik van Riel,
	Hugh Dickins, Glauber Costa

On Sun, Apr 14, 2013 at 07:34:20AM -0700, Michal Hocko wrote:
> On Sun 14-04-13 01:42:52, Mel Gorman wrote:
> > On Tue, Apr 09, 2013 at 02:13:13PM +0200, Michal Hocko wrote:
> > > @@ -1961,6 +1973,13 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > >  		do {
> > >  			struct lruvec *lruvec;
> > >  
> > > +			if (soft_reclaim &&
> > > +					!mem_cgroup_soft_reclaim_eligible(memcg)) {
> > > +				memcg = mem_cgroup_iter(root, memcg, &reclaim);
> > > +				continue;
> > > +			}
> > > +
> > 
> > Calling mem_cgroup_soft_reclaim_eligible means we do multiple searches
> > of the hierarchy while ascending the hierarchy. It's a stretch but it
> > may be a problem for very deep hierarchies.
> 
> I think it shouldn't be a problem for hundreds of memcgs and I am quite
> sceptical about such configurations for other reasons (e.g. charging
> overhead). And we are in the reclaim path so this is hardly a hot path
> (unlike the chargin). So while this might turn out to be a real problem
> we would need to fix other parts as well with higher priority.
> 
> > Would it be worth having mem_cgroup_soft_reclaim_eligible return what
> > the highest parent over its soft limit was and stop the iterator when
> > the highest parent is reached?  I think this would avoid calling
> > mem_cgroup_soft_reclaim_eligible multiple times.
> 
> This is basically what the original implementation did and I think it is
> not the right way to go. First why should we care who is the most
> exceeding group. We should treat them equally if the there is no special
> reason to not do so. And I do not see such a special reason. Besides
> that keeping a exceed sorted data structure of memcgs turned out quite a
> lot of code. Note that the later patch integrate soft reclaim into
> targeted reclaim which would mean that we would have to keep such a
> list/tree per memcg.

I think what Mel suggests is not to return the highest excessor, but
return the highest parent in the hierarchy that is in excess.  Once
you have this parent, you know that all children are in excess,
without looking them up individually.

However, that parent is not necessarily the root of the hierarchy that
is being reclaimed and you might have multiple of such sub-hierarchies
in excess.  To handle all the corner cases, I'd expect the
relationship checking to get really complicated.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code
  2013-04-14 14:55       ` Johannes Weiner
@ 2013-04-14 15:04         ` Michal Hocko
  2013-04-14 15:11           ` Michal Hocko
  2013-04-14 18:03           ` Rik van Riel
  0 siblings, 2 replies; 27+ messages in thread
From: Michal Hocko @ 2013-04-14 15:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, linux-mm, Ying Han, KAMEZAWA Hiroyuki, Rik van Riel,
	Hugh Dickins, Glauber Costa

On Sun 14-04-13 10:55:32, Johannes Weiner wrote:
> On Sun, Apr 14, 2013 at 07:34:20AM -0700, Michal Hocko wrote:
> > On Sun 14-04-13 01:42:52, Mel Gorman wrote:
> > > On Tue, Apr 09, 2013 at 02:13:13PM +0200, Michal Hocko wrote:
> > > > @@ -1961,6 +1973,13 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > > >  		do {
> > > >  			struct lruvec *lruvec;
> > > >  
> > > > +			if (soft_reclaim &&
> > > > +					!mem_cgroup_soft_reclaim_eligible(memcg)) {
> > > > +				memcg = mem_cgroup_iter(root, memcg, &reclaim);
> > > > +				continue;
> > > > +			}
> > > > +
> > > 
> > > Calling mem_cgroup_soft_reclaim_eligible means we do multiple searches
> > > of the hierarchy while ascending the hierarchy. It's a stretch but it
> > > may be a problem for very deep hierarchies.
> > 
> > I think it shouldn't be a problem for hundreds of memcgs and I am quite
> > sceptical about such configurations for other reasons (e.g. charging
> > overhead). And we are in the reclaim path so this is hardly a hot path
> > (unlike the chargin). So while this might turn out to be a real problem
> > we would need to fix other parts as well with higher priority.
> > 
> > > Would it be worth having mem_cgroup_soft_reclaim_eligible return what
> > > the highest parent over its soft limit was and stop the iterator when
> > > the highest parent is reached?  I think this would avoid calling
> > > mem_cgroup_soft_reclaim_eligible multiple times.
> > 
> > This is basically what the original implementation did and I think it is
> > not the right way to go. First why should we care who is the most
> > exceeding group. We should treat them equally if the there is no special
> > reason to not do so. And I do not see such a special reason. Besides
> > that keeping a exceed sorted data structure of memcgs turned out quite a
> > lot of code. Note that the later patch integrate soft reclaim into
> > targeted reclaim which would mean that we would have to keep such a
> > list/tree per memcg.
> 
> I think what Mel suggests is not to return the highest excessor, but
> return the highest parent in the hierarchy that is in excess.  Once
> you have this parent, you know that all children are in excess,
> without looking them up individually.

OK, I see it now.

> However, that parent is not necessarily the root of the hierarchy that
> is being reclaimed and you might have multiple of such sub-hierarchies
> in excess.  To handle all the corner cases, I'd expect the
> relationship checking to get really complicated.

We could always return the leftmost and get to others as the iteration
continues. I will try to think about it some more. I do not think we
would save a lot but it looks like a neat idea.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code
  2013-04-14 15:04         ` Michal Hocko
@ 2013-04-14 15:11           ` Michal Hocko
  2013-04-14 18:03           ` Rik van Riel
  1 sibling, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2013-04-14 15:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, linux-mm, Ying Han, KAMEZAWA Hiroyuki, Rik van Riel,
	Hugh Dickins, Glauber Costa

On Sun 14-04-13 08:04:55, Michal Hocko wrote:
> On Sun 14-04-13 10:55:32, Johannes Weiner wrote:
> > However, that parent is not necessarily the root of the hierarchy that
> > is being reclaimed and you might have multiple of such sub-hierarchies
> > in excess.  To handle all the corner cases, I'd expect the
> > relationship checking to get really complicated.
> 
> We could always return the leftmost and get to others as the iteration
> continues. I will try to think about it some more. I do not think we
> would save a lot but it looks like a neat idea.

Hmm, scratch that. Leftmost doesn't make much sense as we are going
bottom up...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code
  2013-04-14 15:04         ` Michal Hocko
  2013-04-14 15:11           ` Michal Hocko
@ 2013-04-14 18:03           ` Rik van Riel
  1 sibling, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2013-04-14 18:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Mel Gorman, linux-mm, Ying Han,
	KAMEZAWA Hiroyuki, Hugh Dickins, Glauber Costa

On 04/14/2013 11:04 AM, Michal Hocko wrote:
> On Sun 14-04-13 10:55:32, Johannes Weiner wrote:

>> I think what Mel suggests is not to return the highest excessor, but
>> return the highest parent in the hierarchy that is in excess.  Once
>> you have this parent, you know that all children are in excess,
>> without looking them up individually.
>
> OK, I see it now.
>
>> However, that parent is not necessarily the root of the hierarchy that
>> is being reclaimed and you might have multiple of such sub-hierarchies
>> in excess.  To handle all the corner cases, I'd expect the
>> relationship checking to get really complicated.
>
> We could always return the leftmost and get to others as the iteration
> continues. I will try to think about it some more. I do not think we
> would save a lot but it looks like a neat idea.

We should probably gather around a whiteboard this week in
San Francisco, and figure out what exactly we want the code
to do, before figuring out the most efficient way to do it.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified
  2013-04-09 12:13 [RFC 0/3] soft reclaim rework Michal Hocko
  2013-04-09 12:13 ` [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko
@ 2013-04-09 12:13 ` Michal Hocko
  2013-04-09 13:24   ` Johannes Weiner
  2013-04-09 17:10   ` Kamezawa Hiroyuki
  2013-04-09 12:13 ` [RFC 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 27+ messages in thread
From: Michal Hocko @ 2013-04-09 12:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel,
	Hugh Dickins, Mel Gorman, Glauber Costa

The soft limit has been traditionally initialized to RESOURCE_MAX
which means that the group is soft unlimited by default. This was
working more or less satisfactorily so far because the soft limit has
been interpreted as a tool to hint memory reclaim which groups to
reclaim first to free some memory so groups basically opted in for being
reclaimed more.

While this feature might be really helpful it would be even nicer if
the soft reclaim could be used as a certain working set protection -
only groups over their soft limit are reclaimed as far as the reclaim
is able to free memory. In order to accomplish this behavior we have to
reconsider the default soft limit value because with the current default
all groups would become soft unreclaimable and so the reclaim would have
to fall back to ignoring soft reclaim altogether harming those groups
that set up a limit as a protection against the reclaim. Changing the
default soft limit to 0 wouldn't work either because all groups would
become soft reclaimable as the parent's limit would overwrite all its
children down the hierarchy.

This patch doesn't change the default soft limit value. Rather than that
it distinguishes groups with the limit set by user by a per group flag.
All groups are considered soft reclaimable regardless their limit until
a limit is set. The default limit doesn't enforce reclaim down the
hierarchy.

TODO: How do we present default unlimited vs. RESOURCE_MAX set by the
user? One possible way could be returning -1 for RESOURCE_MAX && !soft_limited
but this is a change in user interface. Although nothing explicitly says
the value has to be greater > 0 I can imagine this could be PITA to use.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 33424d8..043d760 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -292,6 +292,10 @@ struct mem_cgroup {
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
+	/*
+	 * Is the group soft limited?
+	 */
+	bool soft_limited;
 	unsigned long kmem_account_flags; /* See KMEM_ACCOUNTED_*, below */
 
 	bool		oom_lock;
@@ -2062,14 +2066,15 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
 
 /*
  * A group is eligible for the soft limit reclaim if it is
- * 	a) is over its soft limit
- * 	b) any parent up the hierarchy is over its soft limit
+ * 	a) doesn't have any soft limit set
+ * 	b) is over its soft limit
+ * 	c) any parent up the hierarchy is over its soft limit
  */
 bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
 {
 	struct mem_cgroup *parent = memcg;
 
-	if (res_counter_soft_limit_excess(&memcg->res))
+	if (!memcg->soft_limited || res_counter_soft_limit_excess(&memcg->res))
 		return true;
 
 	/*
@@ -2077,7 +2082,8 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
 	 * have to obey and reclaim from this group as well.
 	 */
 	while((parent = parent_mem_cgroup(parent))) {
-		if (res_counter_soft_limit_excess(&parent->res))
+		if (memcg->soft_limited &&
+				res_counter_soft_limit_excess(&parent->res))
 			return true;
 	}
 
@@ -5237,6 +5243,14 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 			ret = res_counter_set_soft_limit(&memcg->res, val);
 		else
 			ret = -EINVAL;
+
+		/*
+		 * We could disable soft_limited when we get RESOURCE_MAX but
+		 * then we have a little problem to distinguish the default
+		 * unlimited and limitted but never soft reclaimed groups.
+		 */
+		if (!ret)
+			memcg->soft_limited = true;
 		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified
  2013-04-09 12:13 ` [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified Michal Hocko
@ 2013-04-09 13:24   ` Johannes Weiner
  2013-04-09 13:42     ` Michal Hocko
  2013-04-09 17:10   ` Kamezawa Hiroyuki
  1 sibling, 1 reply; 27+ messages in thread
From: Johannes Weiner @ 2013-04-09 13:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Ying Han, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins,
	Mel Gorman, Glauber Costa

On Tue, Apr 09, 2013 at 02:13:14PM +0200, Michal Hocko wrote:
> The soft limit has been traditionally initialized to RESOURCE_MAX
> which means that the group is soft unlimited by default. This was
> working more or less satisfactorily so far because the soft limit has
> been interpreted as a tool to hint memory reclaim which groups to
> reclaim first to free some memory so groups basically opted in for being
> reclaimed more.
> 
> While this feature might be really helpful it would be even nicer if
> the soft reclaim could be used as a certain working set protection -
> only groups over their soft limit are reclaimed as far as the reclaim
> is able to free memory. In order to accomplish this behavior we have to
> reconsider the default soft limit value because with the current default
> all groups would become soft unreclaimable and so the reclaim would have
> to fall back to ignoring soft reclaim altogether harming those groups
> that set up a limit as a protection against the reclaim. Changing the
> default soft limit to 0 wouldn't work either because all groups would
> become soft reclaimable as the parent's limit would overwrite all its
> children down the hierarchy.
> 
> This patch doesn't change the default soft limit value. Rather than that
> it distinguishes groups with the limit set by user by a per group flag.
> All groups are considered soft reclaimable regardless their limit until
> a limit is set. The default limit doesn't enforce reclaim down the
> hierarchy.
> 
> TODO: How do we present default unlimited vs. RESOURCE_MAX set by the
> user? One possible way could be returning -1 for RESOURCE_MAX && !soft_limited
> but this is a change in user interface. Although nothing explicitly says
> the value has to be greater > 0 I can imagine this could be PITA to use.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 33424d8..043d760 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -292,6 +292,10 @@ struct mem_cgroup {
>  	 * Should the accounting and control be hierarchical, per subtree?
>  	 */
>  	bool use_hierarchy;
> +	/*
> +	 * Is the group soft limited?
> +	 */
> +	bool soft_limited;
>  	unsigned long kmem_account_flags; /* See KMEM_ACCOUNTED_*, below */
>  
>  	bool		oom_lock;
> @@ -2062,14 +2066,15 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
>  
>  /*
>   * A group is eligible for the soft limit reclaim if it is
> - * 	a) is over its soft limit
> - * 	b) any parent up the hierarchy is over its soft limit
> + * 	a) doesn't have any soft limit set
> + * 	b) is over its soft limit
> + * 	c) any parent up the hierarchy is over its soft limit
>   */
>  bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
>  {
>  	struct mem_cgroup *parent = memcg;
>  
> -	if (res_counter_soft_limit_excess(&memcg->res))
> +	if (!memcg->soft_limited || res_counter_soft_limit_excess(&memcg->res))
>  		return true;

With the very similar condition in the hierarchy walk down there, this
was more confusing than I would have expected it to be.

Would you mind splitting this check and putting the comments directly
over the individual checks?

	/* No specific soft limit set, eligible for soft reclaim */
	if (!memcg->soft_limited)
		return true;

	/* Soft limit exceeded, eligible for soft reclaim */
	if (res_counter_soft_limit_excess(&memcg->res))
		return true;

	/* Parental limit exceeded, eligible for... soft reclaim! */
	...

> @@ -2077,7 +2082,8 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
>  	 * have to obey and reclaim from this group as well.
>  	 */
>  	while((parent = parent_mem_cgroup(parent))) {
> -		if (res_counter_soft_limit_excess(&parent->res))
> +		if (memcg->soft_limited &&
> +				res_counter_soft_limit_excess(&parent->res))
>  			return true;

Should this be parent->soft_limited instead of memcg->softlimited?

> @@ -5237,6 +5243,14 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
>  			ret = res_counter_set_soft_limit(&memcg->res, val);
>  		else
>  			ret = -EINVAL;
> +
> +		/*
> +		 * We could disable soft_limited when we get RESOURCE_MAX but
> +		 * then we have a little problem to distinguish the default
> +		 * unlimited and limitted but never soft reclaimed groups.
> +		 */
> +		if (!ret)
> +			memcg->soft_limited = true;

It's neither reversible nor distinguishable from userspace, so it
would be good to either find a value or just make the soft_limited
knob explicit and accessible from userspace.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified
  2013-04-09 13:24   ` Johannes Weiner
@ 2013-04-09 13:42     ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2013-04-09 13:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Ying Han, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins,
	Mel Gorman, Glauber Costa

On Tue 09-04-13 09:24:06, Johannes Weiner wrote:
> On Tue, Apr 09, 2013 at 02:13:14PM +0200, Michal Hocko wrote:
[...]
> > @@ -2062,14 +2066,15 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
> >  
> >  /*
> >   * A group is eligible for the soft limit reclaim if it is
> > - * 	a) is over its soft limit
> > - * 	b) any parent up the hierarchy is over its soft limit
> > + * 	a) doesn't have any soft limit set
> > + * 	b) is over its soft limit
> > + * 	c) any parent up the hierarchy is over its soft limit
> >   */
> >  bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
> >  {
> >  	struct mem_cgroup *parent = memcg;
> >  
> > -	if (res_counter_soft_limit_excess(&memcg->res))
> > +	if (!memcg->soft_limited || res_counter_soft_limit_excess(&memcg->res))
> >  		return true;
> 
> With the very similar condition in the hierarchy walk down there, this
> was more confusing than I would have expected it to be.
> 
> Would you mind splitting this check and putting the comments directly
> over the individual checks?
> 
> 	/* No specific soft limit set, eligible for soft reclaim */
> 	if (!memcg->soft_limited)
> 		return true;
> 
> 	/* Soft limit exceeded, eligible for soft reclaim */
> 	if (res_counter_soft_limit_excess(&memcg->res))
> 		return true;
> 
> 	/* Parental limit exceeded, eligible for... soft reclaim! */

Sure thing.

> 	...
> 
> > @@ -2077,7 +2082,8 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
> >  	 * have to obey and reclaim from this group as well.
> >  	 */
> >  	while((parent = parent_mem_cgroup(parent))) {
> > -		if (res_counter_soft_limit_excess(&parent->res))
> > +		if (memcg->soft_limited &&
> > +				res_counter_soft_limit_excess(&parent->res))
> >  			return true;
> 
> Should this be parent->soft_limited instead of memcg->softlimited?

Yes. I haven't tested with deeper hierarchies yet... Thanks for catching
this.

> 
> > @@ -5237,6 +5243,14 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
> >  			ret = res_counter_set_soft_limit(&memcg->res, val);
> >  		else
> >  			ret = -EINVAL;
> > +
> > +		/*
> > +		 * We could disable soft_limited when we get RESOURCE_MAX but
> > +		 * then we have a little problem to distinguish the default
> > +		 * unlimited and limitted but never soft reclaimed groups.
> > +		 */
> > +		if (!ret)
> > +			memcg->soft_limited = true;
> 
> It's neither reversible nor distinguishable from userspace, so it
> would be good to either find a value or just make the soft_limited
> knob explicit and accessible from userspace.

I can export the knob but I would like to prevent from that if possible.
So far it seems it would be hard to keep backward compatibility. I hoped
somebody would come up with something clever ;)

One possible way would be returning -1 if soft_limited == false. Users
who use u64 would see the same value in the end so they shouldn't break
and those that are _really_ interested can check the string value as
well. What do you think?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified
  2013-04-09 12:13 ` [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified Michal Hocko
  2013-04-09 13:24   ` Johannes Weiner
@ 2013-04-09 17:10   ` Kamezawa Hiroyuki
  2013-04-09 17:22     ` Michal Hocko
  1 sibling, 1 reply; 27+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-09 17:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Ying Han, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Mel Gorman, Glauber Costa

(2013/04/09 21:13), Michal Hocko wrote:
> The soft limit has been traditionally initialized to RESOURCE_MAX
> which means that the group is soft unlimited by default. This was
> working more or less satisfactorily so far because the soft limit has
> been interpreted as a tool to hint memory reclaim which groups to
> reclaim first to free some memory so groups basically opted in for being
> reclaimed more.
> 
> While this feature might be really helpful it would be even nicer if
> the soft reclaim could be used as a certain working set protection -
> only groups over their soft limit are reclaimed as far as the reclaim
> is able to free memory. In order to accomplish this behavior we have to
> reconsider the default soft limit value because with the current default
> all groups would become soft unreclaimable and so the reclaim would have
> to fall back to ignoring soft reclaim altogether harming those groups
> that set up a limit as a protection against the reclaim. Changing the
> default soft limit to 0 wouldn't work either because all groups would
> become soft reclaimable as the parent's limit would overwrite all its
> children down the hierarchy.
> 
> This patch doesn't change the default soft limit value. Rather than that
> it distinguishes groups with the limit set by user by a per group flag.
> All groups are considered soft reclaimable regardless their limit until
> a limit is set. The default limit doesn't enforce reclaim down the
> hierarchy.
> 
> TODO: How do we present default unlimited vs. RESOURCE_MAX set by the
> user? One possible way could be returning -1 for RESOURCE_MAX && !soft_limited
> but this is a change in user interface. Although nothing explicitly says
> the value has to be greater > 0 I can imagine this could be PITA to use.
> 

Hmm..

Now, if a user sets soft_limit to a memcg, it will be a victim. All other
cgroups, which has default value, will be 2nd choice for memory reclaim.
When user sets RESOURCE_MAX, it will be 2nd choice, too.
In this case, soft-limit is for creating victims.

You want the another configuration that all cgroup must be 1st choice
with the default value and protect memcg which has some soft-limit value.
In this case, soft-limit is for protection.
i.e. an opposite policy.

How about allowing users to set root memcg's soft-limit (to be 0 ?) and
allow the new choice of protection before creating children memcgs ?
(I think you can make this default policy as CONFIG option or some...)
Users can choice  global soft-limit policy.
Complicated ?

Thanks,
-Kame

> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>   mm/memcontrol.c |   22 ++++++++++++++++++----
>   1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 33424d8..043d760 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -292,6 +292,10 @@ struct mem_cgroup {
>   	 * Should the accounting and control be hierarchical, per subtree?
>   	 */
>   	bool use_hierarchy;
> +	/*
> +	 * Is the group soft limited?
> +	 */
> +	bool soft_limited;
>   	unsigned long kmem_account_flags; /* See KMEM_ACCOUNTED_*, below */
>   
>   	bool		oom_lock;
> @@ -2062,14 +2066,15 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
>   
>   /*
>    * A group is eligible for the soft limit reclaim if it is
> - * 	a) is over its soft limit
> - * 	b) any parent up the hierarchy is over its soft limit
> + * 	a) doesn't have any soft limit set
> + * 	b) is over its soft limit
> + * 	c) any parent up the hierarchy is over its soft limit
>    */
>   bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
>   {
>   	struct mem_cgroup *parent = memcg;
>   
> -	if (res_counter_soft_limit_excess(&memcg->res))
> +	if (!memcg->soft_limited || res_counter_soft_limit_excess(&memcg->res))
>   		return true;
>   
>   	/*
> @@ -2077,7 +2082,8 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
>   	 * have to obey and reclaim from this group as well.
>   	 */
>   	while((parent = parent_mem_cgroup(parent))) {
> -		if (res_counter_soft_limit_excess(&parent->res))
> +		if (memcg->soft_limited &&
> +				res_counter_soft_limit_excess(&parent->res))
>   			return true;
>   	}
>   
> @@ -5237,6 +5243,14 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
>   			ret = res_counter_set_soft_limit(&memcg->res, val);
>   		else
>   			ret = -EINVAL;
> +
> +		/*
> +		 * We could disable soft_limited when we get RESOURCE_MAX but
> +		 * then we have a little problem to distinguish the default
> +		 * unlimited and limitted but never soft reclaimed groups.
> +		 */
> +		if (!ret)
> +			memcg->soft_limited = true;
>   		break;
>   	default:
>   		ret = -EINVAL; /* should be BUG() ? */
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified
  2013-04-09 17:10   ` Kamezawa Hiroyuki
@ 2013-04-09 17:22     ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2013-04-09 17:22 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, Ying Han, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Mel Gorman, Glauber Costa

On Wed 10-04-13 02:10:44, KAMEZAWA Hiroyuki wrote:
> (2013/04/09 21:13), Michal Hocko wrote:
> > The soft limit has been traditionally initialized to RESOURCE_MAX
> > which means that the group is soft unlimited by default. This was
> > working more or less satisfactorily so far because the soft limit has
> > been interpreted as a tool to hint memory reclaim which groups to
> > reclaim first to free some memory so groups basically opted in for being
> > reclaimed more.
> > 
> > While this feature might be really helpful it would be even nicer if
> > the soft reclaim could be used as a certain working set protection -
> > only groups over their soft limit are reclaimed as far as the reclaim
> > is able to free memory. In order to accomplish this behavior we have to
> > reconsider the default soft limit value because with the current default
> > all groups would become soft unreclaimable and so the reclaim would have
> > to fall back to ignoring soft reclaim altogether harming those groups
> > that set up a limit as a protection against the reclaim. Changing the
> > default soft limit to 0 wouldn't work either because all groups would
> > become soft reclaimable as the parent's limit would overwrite all its
> > children down the hierarchy.
> > 
> > This patch doesn't change the default soft limit value. Rather than that
> > it distinguishes groups with the limit set by user by a per group flag.
> > All groups are considered soft reclaimable regardless their limit until
> > a limit is set. The default limit doesn't enforce reclaim down the
> > hierarchy.
> > 
> > TODO: How do we present default unlimited vs. RESOURCE_MAX set by the
> > user? One possible way could be returning -1 for RESOURCE_MAX && !soft_limited
> > but this is a change in user interface. Although nothing explicitly says
> > the value has to be greater > 0 I can imagine this could be PITA to use.
> > 
> 
> Hmm..
> 
> Now, if a user sets soft_limit to a memcg, it will be a victim. All other
> cgroups, which has default value, will be 2nd choice for memory reclaim.

Not really. All those with the default value will be the 1sth choice along
with those that are over the limit. Just to make sure we are on the same
page this is what I have currently after Johannes feedback:
bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
{
	struct mem_cgroup *parent = memcg;

	/* No specific soft limit set, eligible for soft reclaim */
	if (!memcg->soft_limited)
		return true;

	/* Soft limit exceeded, eligible for soft reclaim */
	if (res_counter_soft_limit_excess(&memcg->res))
		return true;

	/*
	 * If any parent up the hierarchy is over its soft limit then we
	 * have to obey and reclaim from this group as well.
	 */
	while((parent = parent_mem_cgroup(parent))) {
		if (parent->soft_limited &&
				res_counter_soft_limit_excess(&parent->res))
			return true;
	}

	return false;
}

Does this make more sense to you?

> When user sets RESOURCE_MAX, it will be 2nd choice, too.

No, it will be never soft reclaimed because it would have
memcg->soft_limited == true.

> In this case, soft-limit is for creating victims.
> 
> You want the another configuration that all cgroup must be 1st choice
> with the default value and protect memcg which has some soft-limit value.
> In this case, soft-limit is for protection.

Why should we distinguish default setting from over-the-limit groups?

> i.e. an opposite policy.
> 
> How about allowing users to set root memcg's soft-limit (to be 0 ?)

This is not forbidden AFAICS in mem_cgroup_write for RES_SOFT_LIMIT.
The 0 @ root is not good as I tried to explain in the changelog because
this would make a hiararchical pressure on all children so their limit
would be ignored basically.

> and allow the new choice of protection before creating children
> memcgs? (I think you can make this default policy as CONFIG option or
> some...)  Users can choice global soft-limit policy.  Complicated ?

Yes and I do not understand why a CONFIG option is needed.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim
  2013-04-09 12:13 [RFC 0/3] soft reclaim rework Michal Hocko
  2013-04-09 12:13 ` [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko
  2013-04-09 12:13 ` [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified Michal Hocko
@ 2013-04-09 12:13 ` Michal Hocko
  2013-04-22  2:14   ` Michal Hocko
  2013-04-09 15:37 ` [RFC 0/3] soft reclaim rework Michal Hocko
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2013-04-09 12:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel,
	Hugh Dickins, Mel Gorman, Glauber Costa

Soft reclaim has been done only for the global reclaim (both background
and direct). Since "memcg: integrate soft reclaim tighter with zone
shrinking code" there is no reason for this limitation anymore as the
soft limit reclaim doesn't use any special code paths and it is a
part of the zone shrinking code which is used by both global and
targeted reclaims.

>From semantic point of view it is even natural to consider soft limit
before touching all groups in the hierarchy tree which is touching the
hard limit because soft limit tells us where to push back when there is
a memory pressure. It is not important whether the pressure comes from
the limit or imbalanced zones.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ae3a387..cf729ca 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -141,7 +141,7 @@ static bool global_reclaim(struct scan_control *sc)
 
 static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
 {
-	return global_reclaim(sc);
+	return true;
 }
 #else
 static bool global_reclaim(struct scan_control *sc)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim
  2013-04-09 12:13 ` [RFC 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko
@ 2013-04-22  2:14   ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2013-04-22  2:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel,
	Hugh Dickins, Mel Gorman, Glauber Costa

On Tue 09-04-13 14:13:15, Michal Hocko wrote:
> Soft reclaim has been done only for the global reclaim (both background
> and direct). Since "memcg: integrate soft reclaim tighter with zone
> shrinking code" there is no reason for this limitation anymore as the
> soft limit reclaim doesn't use any special code paths and it is a
> part of the zone shrinking code which is used by both global and
> targeted reclaims.
> 
> From semantic point of view it is even natural to consider soft limit
> before touching all groups in the hierarchy tree which is touching the
> hard limit because soft limit tells us where to push back when there is
> a memory pressure. It is not important whether the pressure comes from
> the limit or imbalanced zones.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/vmscan.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ae3a387..cf729ca 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -141,7 +141,7 @@ static bool global_reclaim(struct scan_control *sc)
>  
>  static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
>  {
> -	return global_reclaim(sc);
> +	return true;
>  }
>  #else
>  static bool global_reclaim(struct scan_control *sc)

This patch is not complete. We also need to update
mem_cgroup_soft_reclaim_eligible as well because we should ignore
parents that are above the current reclaim pressure. Say we have
A (over soft limit)
 \
  B (below s.l., hit the hard limit) 
 / \
C   D (below s.l.)

B is the source of the outside memory pressure now for D but we
shouldn't soft reclaim it because it is behaving well under B subtree.
mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
hierarchy at B (root of the memory pressure).
---
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1833c95..80ed1b6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -179,7 +179,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg);
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root);
 
 void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
@@ -356,7 +357,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 }
 
 static inline
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root)
 {
 	return false;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index be86815..19b4cb7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1845,12 +1845,14 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 #endif
 
 /*
- * A group is eligible for the soft limit reclaim if it is
+ * A group is eligible for the soft limit reclaim under given root hierarchy
+ * if it is
  * 	a) doesn't have any soft limit set
  * 	b) is over its soft limit
  * 	c) any parent up the hierarchy is over its soft limit
  */
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root)
 {
 	struct mem_cgroup *parent = memcg;
 
@@ -1863,13 +1865,15 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
 		return true;
 
 	/*
-	 * If any parent up the hierarchy is over its soft limit then we
-	 * have to obey and reclaim from this group as well.
+	 * If any parent up to the root in the hierarchy is over its soft limit
+	 * then we have to obey and reclaim from this group as well.
 	 */
 	while((parent = parent_mem_cgroup(parent))) {
 		if (parent->soft_limited &&
 				res_counter_soft_limit_excess(&parent->res))
 			return true;
+		if (parent == root)
+			break;
 	}
 
 	return false;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1fe9f81..471bf94 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1973,7 +1973,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
 			struct lruvec *lruvec;
 
 			if (soft_reclaim &&
-					!mem_cgroup_soft_reclaim_eligible(memcg)) {
+					!mem_cgroup_soft_reclaim_eligible(memcg, root)) {
 				memcg = mem_cgroup_iter(root, memcg, &reclaim);
 				continue;
 			}
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC 0/3] soft reclaim rework
  2013-04-09 12:13 [RFC 0/3] soft reclaim rework Michal Hocko
                   ` (2 preceding siblings ...)
  2013-04-09 12:13 ` [RFC 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko
@ 2013-04-09 15:37 ` Michal Hocko
  2013-04-09 15:50   ` Michal Hocko
  2013-04-11  8:43 ` Michal Hocko
  2013-04-17 22:52 ` Ying Han
  5 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2013-04-09 15:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel,
	Hugh Dickins, Mel Gorman, Glauber Costa

On Tue 09-04-13 14:13:12, Michal Hocko wrote:
[...]
> 2) kbuild test showed more or less the same results
> usage_in_bytes
> Base
> 		Group A		Group B
> Median		394817536	395634688
> 
> Patches applied
> median		483481600	302131200
> 
> A is kept closer to the soft limit again. There is some fluctuation
> around the limit because kbuild creates a lot of short lived processes.
> Base: 	 pgscan_kswapd_dma32 1648718	pgsteal_kswapd_dma32 1510749
> Patched: pgscan_kswapd_dma32 2042065	pgsteal_kswapd_dma32 1667745

OK, so I have patched the base version with the patch bellow which
uncovers soft reclaim scanning and reclaim and guess what:
Base:	 pgscan_kswapd_dma32 3710092	pgsteal_kswapd_dma32 3225191
Patched: pgscan_kswapd_dma32 1846700	pgsteal_kswapd_dma32 1442232
Base:	 pgscan_direct_dma32 2417683	pgsteal_direct_dma32 459702
Patched: pgscan_direct_dma32 1839331	pgsteal_direct_dma32 244338

The numbers are obviously timing dependent (wrt. previous run ~10% for
the patched kernel) but the ~1/2 half wrt. the base kernel seems real
we just haven't seen it previously because it wasn't accounted. I guess
this can be attributed to prio-0 soft reclaim behavior and a lot of
dirty pages on the LRU.

> The differences are much bigger now so it would be interesting how much
> has been scanned/reclaimed during soft reclaim in the base kernel.
---

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/3] soft reclaim rework
  2013-04-09 15:37 ` [RFC 0/3] soft reclaim rework Michal Hocko
@ 2013-04-09 15:50   ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2013-04-09 15:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel,
	Hugh Dickins, Mel Gorman, Glauber Costa

On Tue 09-04-13 17:37:42, Michal Hocko wrote:
> On Tue 09-04-13 14:13:12, Michal Hocko wrote:
> [...]
> > 2) kbuild test showed more or less the same results
> > usage_in_bytes
> > Base
> > 		Group A		Group B
> > Median		394817536	395634688
> > 
> > Patches applied
> > median		483481600	302131200
> > 
> > A is kept closer to the soft limit again. There is some fluctuation
> > around the limit because kbuild creates a lot of short lived processes.
> > Base: 	 pgscan_kswapd_dma32 1648718	pgsteal_kswapd_dma32 1510749
> > Patched: pgscan_kswapd_dma32 2042065	pgsteal_kswapd_dma32 1667745
> 
> OK, so I have patched the base version with the patch bellow which
> uncovers soft reclaim scanning and reclaim and guess what:
> Base:	 pgscan_kswapd_dma32 3710092	pgsteal_kswapd_dma32 3225191
> Patched: pgscan_kswapd_dma32 1846700	pgsteal_kswapd_dma32 1442232
> Base:	 pgscan_direct_dma32 2417683	pgsteal_direct_dma32 459702
> Patched: pgscan_direct_dma32 1839331	pgsteal_direct_dma32 244338

Dohh, a dwarf sneaked in and broke my numbers for the base kernel.
I am rerunning the test.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/3] soft reclaim rework
  2013-04-09 12:13 [RFC 0/3] soft reclaim rework Michal Hocko
                   ` (3 preceding siblings ...)
  2013-04-09 15:37 ` [RFC 0/3] soft reclaim rework Michal Hocko
@ 2013-04-11  8:43 ` Michal Hocko
  2013-04-11  9:07   ` Michal Hocko
  2013-04-11 13:04   ` Michal Hocko
  2013-04-17 22:52 ` Ying Han
  5 siblings, 2 replies; 27+ messages in thread
From: Michal Hocko @ 2013-04-11  8:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel,
	Hugh Dickins, Mel Gorman, Glauber Costa

Hi,
I have retested kbuild test on a bare HW (8CPUs, 1GB RAM limited by
mem=1G, 2GB swap partition). There are 2 groups (A, B) without any hard
limit and group A has soft limit set to 700M (to have 70% of available
memory). Build starts after fresh boot by extracting sources and
make -j4 vmlinux.
Each group works on a separate source tree. I have repeated the test 3
times:

First some data as returned by /usr/bin/time -v:
* Patched:
A:
User time (seconds): 1133.06
User time (seconds): 1132.84
User time (seconds): 1135.37
		Avg: 1133.76
System time (seconds): 258.02
System time (seconds): 259.33
System time (seconds): 258.83
		Avg: 258.73
Elapsed (wall clock) time (h:mm:ss or m:ss): 8:57.55
Elapsed (wall clock) time (h:mm:ss or m:ss): 8:55.68
Elapsed (wall clock) time (h:mm:ss or m:ss): 8:50.96
		Avg: 08:54.73

B:
User time (seconds): 1149.22
User time (seconds): 1153.98
User time (seconds): 1150.37
		Avg: 1151.19 (101.5% of A)
System time (seconds): 262.13
System time (seconds): 263.31
System time (seconds): 260.84
		Avg: 262.09 (101.3% of A)
Elapsed (wall clock) time (h:mm:ss or m:ss): 10:13.37
Elapsed (wall clock) time (h:mm:ss or m:ss): 10:17.15
Elapsed (wall clock) time (h:mm:ss or m:ss): 10:05.23
		Avg: 10:11.92 (114.4% of A)

* Base:
A:
User time (seconds): 1132.58
User time (seconds): 1140.63
User time (seconds): 1135.68
		avg: 1136.30 (100.2% of A - patched)
System time (seconds): 264.88
System time (seconds): 263.54
System time (seconds): 261.99
		avg: 263.47 (101.8 of A - patched)
Elapsed (wall clock) time (h:mm:ss or m:ss): 9:48.54
Elapsed (wall clock) time (h:mm:ss or m:ss): 9:50.44
Elapsed (wall clock) time (h:mm:ss or m:ss): 9:44.28
		avg: 09:47.75 (109.9% of A - patched)

B:
User time (seconds): 1138.32
User time (seconds): 1135.70
User time (seconds): 1136.80
		avg: 1136.94 (100.2% of A - patched)

System time (seconds): 261.56
System time (seconds): 262.10
System time (seconds): 262.24
		avg: 261.97  (100% of A - patched)
Elapsed (wall clock) time (h:mm:ss or m:ss): 9:39.17
Elapsed (wall clock) time (h:mm:ss or m:ss): 9:46.95
Elapsed (wall clock) time (h:mm:ss or m:ss): 9:44.73
		avg: 09:47.75 (109.1% of A - patched)

While for the patched kernel soft limit helped to protect A's working
set so it was faster (14% in the total time) than B without any limits.
The unpatched kernel has treated them more or less equally regardless
the softlimit setting.

If we compare patched and base kernels numbers then the overall
situation improved slightly (A+B Elapsed time is 2% smaller) with the
patched kernel which was quite surprising for me. Maybe a side effect of
priority-0 soft reclaim in the base kernel.

As the variance between runs wasn't very high I have focused on the first
run for the memory usage and reclaim statistics comparisons between the
base and patched kernels.

* Patched:
pgscan_direct_dma32 	252408
pgscan_kswapd_dma32 	988928
pgsteal_direct_dma32 	63565
pgsteal_kswapd_dma32	905223

* Base:
pgscan_direct_dma32 	97310	(38% of patched)
pgscan_kswapd_dma32 	1702971	(172%)
pgsteal_direct_dma32 	83377	(131%)
pgsteal_kswapd_dma32 	1534616	(169.5%)

So it seems that we scanned much more on the patched kernel during the
direct reclaim but we have reclaimed less nevertheless. This is most
probably because there is a bigger pressure on B's LRU and we encounter
more dirty pages so more pages are scanned in the end. In sum we scanned
and reclaimed less (by 45% resp. 67%) though.

You can find some graphs at:
- http://labs.suse.cz/mhocko/soft_limit_rework/base-usage.png
- http://labs.suse.cz/mhocko/soft_limit_rework/patched-usage.png

Per group charges over time.

- http://labs.suse.cz/mhocko/soft_limit_rework/base-usage-histogram.png
- http://labs.suse.cz/mhocko/soft_limit_rework/patched-usage-histogram.png

Same here but in the histogram form to see the main tendencies.

- http://labs.suse.cz/mhocko/soft_limit_rework/pgscan.png
- http://labs.suse.cz/mhocko/soft_limit_rework/pgsteal.png

Scanning and reclaiming activity comparision between the base and the
patched kernel.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/3] soft reclaim rework
  2013-04-11  8:43 ` Michal Hocko
@ 2013-04-11  9:07   ` Michal Hocko
  2013-04-11 13:04   ` Michal Hocko
  1 sibling, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2013-04-11  9:07 UTC (permalink / raw)
  To: linux-mm
  Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel,
	Hugh Dickins, Mel Gorman, Glauber Costa

On Thu 11-04-13 10:43:46, Michal Hocko wrote:
> Hi,
> I have retested kbuild test on a bare HW (8CPUs, 1GB RAM limited by
> mem=1G, 2GB swap partition). There are 2 groups (A, B) without any hard
> limit and group A has soft limit set to 700M (to have 70% of available
> memory). Build starts after fresh boot by extracting sources and
> make -j4 vmlinux.
> Each group works on a separate source tree. I have repeated the test 3
> times:
> 
> First some data as returned by /usr/bin/time -v:
> * Patched:
> A:
> User time (seconds): 1133.06
> User time (seconds): 1132.84
> User time (seconds): 1135.37
> 		Avg: 1133.76
> System time (seconds): 258.02
> System time (seconds): 259.33
> System time (seconds): 258.83
> 		Avg: 258.73
> Elapsed (wall clock) time (h:mm:ss or m:ss): 8:57.55
> Elapsed (wall clock) time (h:mm:ss or m:ss): 8:55.68
> Elapsed (wall clock) time (h:mm:ss or m:ss): 8:50.96
> 		Avg: 08:54.73
> 
> B:
> User time (seconds): 1149.22
> User time (seconds): 1153.98
> User time (seconds): 1150.37
> 		Avg: 1151.19 (101.5% of A)
> System time (seconds): 262.13
> System time (seconds): 263.31
> System time (seconds): 260.84
> 		Avg: 262.09 (101.3% of A)
> Elapsed (wall clock) time (h:mm:ss or m:ss): 10:13.37
> Elapsed (wall clock) time (h:mm:ss or m:ss): 10:17.15
> Elapsed (wall clock) time (h:mm:ss or m:ss): 10:05.23
> 		Avg: 10:11.92 (114.4% of A)
> 
> * Base:
> A:
> User time (seconds): 1132.58
> User time (seconds): 1140.63
> User time (seconds): 1135.68
> 		avg: 1136.30 (100.2% of A - patched)
> System time (seconds): 264.88
> System time (seconds): 263.54
> System time (seconds): 261.99
> 		avg: 263.47 (101.8 of A - patched)
> Elapsed (wall clock) time (h:mm:ss or m:ss): 9:48.54
> Elapsed (wall clock) time (h:mm:ss or m:ss): 9:50.44
> Elapsed (wall clock) time (h:mm:ss or m:ss): 9:44.28
> 		avg: 09:47.75 (109.9% of A - patched)
> 
> B:
> User time (seconds): 1138.32
> User time (seconds): 1135.70
> User time (seconds): 1136.80
> 		avg: 1136.94 (100.2% of A - patched)
> 
> System time (seconds): 261.56
> System time (seconds): 262.10
> System time (seconds): 262.24
> 		avg: 261.97  (100% of A - patched)
> Elapsed (wall clock) time (h:mm:ss or m:ss): 9:39.17
> Elapsed (wall clock) time (h:mm:ss or m:ss): 9:46.95
> Elapsed (wall clock) time (h:mm:ss or m:ss): 9:44.73
> 		avg: 09:47.75 (109.1% of A - patched)
> 
> While for the patched kernel soft limit helped to protect A's working
> set so it was faster (14% in the total time) than B without any limits.
> The unpatched kernel has treated them more or less equally regardless
> the softlimit setting.
> 
> If we compare patched and base kernels numbers then the overall
> situation improved slightly (A+B Elapsed time is 2% smaller) with the
> patched kernel which was quite surprising for me. Maybe a side effect of
> priority-0 soft reclaim in the base kernel.
> 
> As the variance between runs wasn't very high I have focused on the first
> run for the memory usage and reclaim statistics comparisons between the
> base and patched kernels.
> 
> * Patched:
> pgscan_direct_dma32 	252408
> pgscan_kswapd_dma32 	988928
> pgsteal_direct_dma32 	63565
> pgsteal_kswapd_dma32	905223
> 
> * Base:
> pgscan_direct_dma32 	97310	(38% of patched)
> pgscan_kswapd_dma32 	1702971	(172%)
> pgsteal_direct_dma32 	83377	(131%)
> pgsteal_kswapd_dma32 	1534616	(169.5%)
> 
> So it seems that we scanned much more on the patched kernel during the
> direct reclaim but we have reclaimed less nevertheless. This is most
> probably because there is a bigger pressure on B's LRU and we encounter
> more dirty pages so more pages are scanned in the end. In sum we scanned
> and reclaimed less (by 45% resp. 67%) though.
> 

I have moved graphs to
http://labs.suse.cz/mhocko/soft_limit_rework/kbuild/700-softlimit/kbuild
because I am doing tests with other soft limits and also other types of
tests. Sorry about that.

> You can find some graphs at:
> - http://labs.suse.cz/mhocko/soft_limit_rework/base-usage.png
> - http://labs.suse.cz/mhocko/soft_limit_rework/patched-usage.png
> 
> Per group charges over time.
> 
> - http://labs.suse.cz/mhocko/soft_limit_rework/base-usage-histogram.png
> - http://labs.suse.cz/mhocko/soft_limit_rework/patched-usage-histogram.png
> 
> Same here but in the histogram form to see the main tendencies.
> 
> - http://labs.suse.cz/mhocko/soft_limit_rework/pgscan.png
> - http://labs.suse.cz/mhocko/soft_limit_rework/pgsteal.png
> 
> Scanning and reclaiming activity comparision between the base and the
> patched kernel.
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/3] soft reclaim rework
  2013-04-11  8:43 ` Michal Hocko
  2013-04-11  9:07   ` Michal Hocko
@ 2013-04-11 13:04   ` Michal Hocko
  1 sibling, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2013-04-11 13:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel,
	Hugh Dickins, Mel Gorman, Glauber Costa

On Thu 11-04-13 10:43:46, Michal Hocko wrote:
> Hi,
> I have retested kbuild test on a bare HW (8CPUs, 1GB RAM limited by
> mem=1G, 2GB swap partition). There are 2 groups (A, B) without any hard
> limit and group A has soft limit set to 700M (to have 70% of available
> memory). Build starts after fresh boot by extracting sources and
> make -j4 vmlinux.
> Each group works on a separate source tree. I have repeated the test 3
> times:

[Cutting the previous results and keeping only averages for overview]
> * Patched:
> A:
> User time (seconds): Avg: 1133.76
> System time (seconds): Avg: 258.73
> Elapsed (wall clock) time (h:mm:ss or m:ss): Avg: 08:54.73
> 
> B:
> User time (seconds): Avg: 1151.19 (101.5% of A)
> System time (seconds): Avg: 262.09 (101.3% of A)
> Elapsed (wall clock) time (h:mm:ss or m:ss): Avg: 10:11.92 (114.4% of A)
> 
> * Base:
> A:
> User time (seconds): avg: 1136.30 (100.2% of A - patched)
> System time (seconds): avg: 263.47 (101.8 of A - patched)
> Elapsed (wall clock) time (h:mm:ss or m:ss): avg: 09:47.75 (109.9% of A - patched)
> 
> B:
> User time (seconds): avg: 1136.94 (100.2% of A - patched)
> System time (seconds): avg: 261.97  (100% of A - patched)
> Elapsed (wall clock) time (h:mm:ss or m:ss): avg: 09:47.75 (109.1% of A - patched)

Same test again with 300M soft limit instead (for A).
* Patched:
A:
User time (seconds): 1143.68, 1137.85, 1137.47
		avg:1139.67
System time (seconds): 264.73, 265.50, 262.44
		avg:264.22
Elapsed (wall clock) time (h:mm:ss or m:ss): 9:54.07, 9:48.23, 9:39.35
		avg:09:47.22

B:
User time (seconds): 1139.10, 1135.94, 1138.13
		avg:1137.72 (99.8% of A)
System time (seconds): 260.94, 262.37, 263.56
		avg:262.29 (99.2% of A)
Elapsed (wall clock) time (h:mm:ss or m:ss): 9:53.04, 9:48.17, 9:51.34
		avg:09:50.85 (100.6% of A)

Both groups are comparable now as both of them are reclaimed (see bellow
for the reclaim statistics).
So we are 1min slower (in Elapsed time) than with 700M soft limit for
both groups.

* Base:
A:
User time (seconds): 1148.50, 1145.96, 1144.60
		avg:1146.35 (100.5% of A patched)
System time (seconds): 265.00, 262.31, 264.98
		avg:264.10 (100% of A patched)
Elapsed (wall clock) time (h:mm:ss or m:ss): 10:44.57, 10:14.74, 10:32.28
		avg:10:30.53 (107.4% of A patched)

B:
User time (seconds): 1137.01, 1131.44, 1136.86
		avg:1135.10 (99.6% of A patched)
System time (seconds): 259.72, 259.05, 262.62
		avg:260.46 (98.6% of A patched)
Elapsed (wall clock) time (h:mm:ss or m:ss): 9:33.82, 9:25.39, 9:38.35
		avg:09:32.52 (97.5% of A patched)

A is hammered by soft reclaim much more than with 700M soft limit which
is expected.
If we sum A+B Elapsed time, though, then the workload is faster by ~2%
with the patched kernel (same as with the 700M limit). This confirms
that the soft limit is too harsh with the base kernel.
Just for completness, if we compare A+B to 700M soft limited runs then
we get ~3% slowdown for both patched and unpatched kernels with smaller
softlimit.

> * Patched:
> pgscan_direct_dma32 	252408
> pgscan_kswapd_dma32 	988928
> pgsteal_direct_dma32 	63565
> pgsteal_kswapd_dma32	905223
> 
> * Base:
> pgscan_direct_dma32 	97310	(38% of patched)
> pgscan_kswapd_dma32 	1702971	(172%)
> pgsteal_direct_dma32 	83377	(131%)
> pgsteal_kswapd_dma32 	1534616	(169.5%)

* Patched:
pgscan_direct_dma32 153455 	(60.8% Patched 700M limit)
pgscan_kswapd_dma32 1670779 	(168.9% Patched 700M limit)
pgsteal_direct_dma32 109624	(172.5% Patched 700M limit)
pgsteal_kswapd_dma32 1512120	(167% Patched 700M limit)

* Base:
pgscan_direct_dma32 492381	(320% of A)
pgscan_kswapd_dma32 1373732	(82.2% of A)
pgsteal_direct_dma32 339563	(309.8 of A)
pgsteal_kswapd_dma32 1108240	(73.3% of A)

And this shows it nicely. We scan and reclaim 3 times more in direct
reclaim context while we scan ~20% resp. reclaim ~30% less in the
background.

We scan and reclaim ~70% more in kswapd context than with 700M soft
limit but the direct reclaim is reduced which is nice.

Same graphs as for the 700M:
http://labs.suse.cz/mhocko/soft_limit_rework/kbuild/300-softlimit/base-usage.png
http://labs.suse.cz/mhocko/soft_limit_rework/kbuild/300-softlimit/patched-usage.png

charges over time. We can see that the patched kernel bahaves much more
just to both groups than the base kernel.

http://labs.suse.cz/mhocko/soft_limit_rework/kbuild/300-softlimit/base-usage-histogram.png
http://labs.suse.cz/mhocko/soft_limit_rework/kbuild/300-softlimit/patched-usage-histogram.png

Same can be seen in the histogram.

http://labs.suse.cz/mhocko/soft_limit_rework/kbuild/300-softlimit/pgscan.png
http://labs.suse.cz/mhocko/soft_limit_rework/kbuild/300-softlimit/pgsteal.png

And the scanning/reclaiming data over time.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/3] soft reclaim rework
  2013-04-09 12:13 [RFC 0/3] soft reclaim rework Michal Hocko
                   ` (4 preceding siblings ...)
  2013-04-11  8:43 ` Michal Hocko
@ 2013-04-17 22:52 ` Ying Han
  5 siblings, 0 replies; 27+ messages in thread
From: Ying Han @ 2013-04-17 22:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm@kvack.org, Johannes Weiner, KAMEZAWA Hiroyuki,
	Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa


[-- Attachment #1.1: Type: text/plain, Size: 6749 bytes --]

On Tue, Apr 9, 2013 at 5:13 AM, Michal Hocko <mhocko@suse.cz> wrote:

> Hi all,
> It's been a long when I promised my take on the $subject but I got
> permanently preempted by other tasks. I finally got it, fortunately.
>

Hi Michal,

This is on my list for a while and never get chance to get to it.  The
per-memcg softlimit reclaim is one of the key feature google uses today,
and thank you for putting the effort of move this forward.

I haven't read the patch in details, but since we chatted about this for
few iterations and it should just look familiar.


This is just a first attempt. There are still some todos but I wanted to
> post it soon to get a feedback.
>
> The basic idea is quite simple. Pull soft reclaim into shrink_zone in
> the first step and get rid of the previous soft reclaim infrastructure.
> shrink_zone is done in two passes now. First it tries to do the soft
> limit reclaim and it falls back to reclaim-all-mode if no group is over
> the limit or no pages have been scanned. The second pass happens at the
> same priority so the only time we waste is the memcg tree walk which
> shouldn't be a big deal. There is certainly room for improvements in
> that direction. But let's keep it simple for now.
> As a bonus we will get rid of a _lot_ of code by this and soft reclaim
> will not stand out like before.
>

Yes, that is the part that should have given us enough motivation to merge
this effort long time ago. However, we had difficulties of agreeing the 5%
of the code (mainly on the softlimit policy) which preventing to cleaning
up 95% of the code. I take the blame.

The second step is somehow more controversial. I am redefining meaning
> of the default soft limit value. I've not chosen 0 as we discussed
> previously because I want to preserve hierarchical property of the soft
> limit (if a parent up the hierarchy is over its limit then children are
> over as well)


This is the 5% we keep disagreeing each other. The internal patch I am
carrying has different interpretation of "hierarchical softlimit reclaim".

However, I am more incline to accept that difference this time. At least
that will get us moving forward to clean up the code first. Then we can
revisit the exact policy of that 5% if that doesn't fit for other usecase
( besides google). I am happy to backport this part into our kernel later
and then only carry that 5% of change internally.

To give more background of what I mean by different interpretation of
"hierarchical", I have some write up some time back which is attached in
this thread. This is purely to make a note for later, and as I mentioned I
will go ahead review the patch and forget about that difference at this
step.

so I have kept the default untouched - unlimited - but I
> have slightly changed the meaning of this value. I interpret it as "user
> doesn't care about soft limit". More precisely the value is ignored
> unless it has been specified by user so such groups are eligible for
> soft reclaim even though they do not reach the limit. Such groups
> do not force their children to be reclaimed of course.
>



> I guess the only possible use case where this wouldn't work as
> expected is when somebody creates a group and set its soft limit to
> a small value (e.g. 0) just to protect all other groups from being
> reclaimed. With a new scheme all groups would be reclaimed while the
> previous implementation could end up reclaiming only the "special"
> group. This configuration can be achieved by the new scheme trivially
> so I think we should be safe. Or does this sound like a big problem?
> Finally the third step is soft limit reclaim integration into targeted
> reclaim. The patch is trivial one liner.
>

Will go through the patches with details in next day or so.

Thanks

--Ying

>
> I haven't get to test it properly yet. I've tested only 2 workloads:
> 1) 1GB RAM + 128MB swap in a kvm (host 4 GB RAM)
>    - 2 memcgs (directly under root)
>         - A has soft limit 500MB and hard unlimited
>         - B both hard and soft unlimited (default values)
>    - One dd if=/dev/zero of=storage/$file bs=1024 count=1228800 per group
> 2) same setup
>    - tar -xf linux source tree + make -j2 vmlinux
>
> Results
> 1) I've checked memory.usage_in_bytes
> Base (-mm tree)
>         Group A         Group B
> median  446498816       448659456
>
> Patches applied
> median  524314624       377921536
>
> So as expected, A got more room on behalf of B and it is nicely over its
> soft limit. I wanted to compare the reclaim performance as well but we
> do not account scanned and reclaimed pages during the old soft reclaim
> (global_reclaim prevents that). But I am planning to look at it.
> Anyway it doesn't look like we are scanning/reclaiming more with the
> patched kernel:
> Base:    pgscan_kswapd_dma32 394382     pgsteal_kswapd_dma32 394372
> Patched: pgscan_kswapd_dma32 394501     pgsteal_kswapd_dma32 394491
>
> So I would assume that the soft limit reclaim scanned more in the end.
>
> Total runtime was slightly smaller for the patch version:
> Base
>                 Group A         Group B
> total time      480.087 s       480.067 s
>
> Patches applied
> total time      474.853 s       474.736 s
>
> But this could be an artifacts of the guest scheduling or related to the
> host activity so I wouldn't draw any conclusions from here.
>
> 2) kbuild test showed more or less the same results
> usage_in_bytes
> Base
>                 Group A         Group B
> Median          394817536       395634688
>
> Patches applied
> median          483481600       302131200
>
> A is kept closer to the soft limit again. There is some fluctuation
> around the limit because kbuild creates a lot of short lived processes.
> Base:    pgscan_kswapd_dma32 1648718    pgsteal_kswapd_dma32 1510749
> Patched: pgscan_kswapd_dma32 2042065    pgsteal_kswapd_dma32 1667745
>
> The differences are much bigger now so it would be interesting how much
> has been scanned/reclaimed during soft reclaim in the base kernel.
>
> I haven't included total runtime statistics here because they seemed
> even more random due to guest/host interaction.
>
> Any comments are welcome, of course.
>
> Michal Hocko (3):
>       memcg: integrate soft reclaim tighter with zone shrinking code
>       memcg: Ignore soft limit until it is explicitly specified
>       vmscan, memcg: Do softlimit reclaim also for targeted reclaim
>
> Incomplete diffstat (without node-zone soft limit tree removal etc...)
> so more deletions to come.
>  include/linux/memcontrol.h |   10 +--
>  mm/memcontrol.c            |  175
> +++++++++-----------------------------------
>  mm/vmscan.c                |   67 ++++++++++-------
>  3 files changed, 78 insertions(+), 174 deletions(-)
>

[-- Attachment #1.2: Type: text/html, Size: 9580 bytes --]

[-- Attachment #2: SoftlimitReclaimInMemcg.pdf --]
[-- Type: application/pdf, Size: 416555 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2013-04-22  2:14 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-09 12:13 [RFC 0/3] soft reclaim rework Michal Hocko
2013-04-09 12:13 ` [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko
2013-04-09 13:08   ` Johannes Weiner
2013-04-09 13:31     ` Michal Hocko
2013-04-09 13:57   ` Glauber Costa
2013-04-09 14:22     ` Michal Hocko
2013-04-09 16:45   ` Kamezawa Hiroyuki
2013-04-09 17:05     ` Michal Hocko
2013-04-14  0:42   ` Mel Gorman
2013-04-14 14:34     ` Michal Hocko
2013-04-14 14:55       ` Johannes Weiner
2013-04-14 15:04         ` Michal Hocko
2013-04-14 15:11           ` Michal Hocko
2013-04-14 18:03           ` Rik van Riel
2013-04-09 12:13 ` [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified Michal Hocko
2013-04-09 13:24   ` Johannes Weiner
2013-04-09 13:42     ` Michal Hocko
2013-04-09 17:10   ` Kamezawa Hiroyuki
2013-04-09 17:22     ` Michal Hocko
2013-04-09 12:13 ` [RFC 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko
2013-04-22  2:14   ` Michal Hocko
2013-04-09 15:37 ` [RFC 0/3] soft reclaim rework Michal Hocko
2013-04-09 15:50   ` Michal Hocko
2013-04-11  8:43 ` Michal Hocko
2013-04-11  9:07   ` Michal Hocko
2013-04-11 13:04   ` Michal Hocko
2013-04-17 22:52 ` Ying Han

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).