[PATCH v5] Soft limit rework

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5] Soft limit rework
@ 2013-06-18 12:09 Michal Hocko
  2013-06-18 12:09 ` [PATCH v5 1/8] memcg, vmscan: integrate soft reclaim tighter with zone shrinking code Michal Hocko
                   ` (10 more replies)
  0 siblings, 11 replies; 30+ messages in thread
From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

Hi,

This is the fifth version of the patchset.

Summary of versions:
The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973
(lkml wasn't CCed at the time so I cannot find it in lwn.net
archives). There were no major objections. 

The second version has been posted here http://lwn.net/Articles/548191/
as a part of a longer and spicier thread which started after LSF here:
https://lwn.net/Articles/548192/

Version number 3 has been posted here http://lwn.net/Articles/550409/
Johannes was worried about setups with thousands of memcgs and the
tree walk overhead for the soft reclaim pass without anybody in excess.

Version number 4 has been posted here http://lwn.net/Articles/552703/
appart from heated discussion about memcg iterator predicate which ended
with a conclusion that the predicate based iteration is "the shortest path to
implementing subtree skip given how the iterator is put together
currently and the series as a whole reduces significant amount of
complexity, so it is an acceptable tradeoff to proceed with this
implementation with later restructuring of the iterator." 
(http://thread.gmane.org/gmane.linux.kernel.mm/101162/focus=101560)

Changes between RFC (aka V1) -> V2
As there were no major objections there were only some minor cleanups
since the last version and I have moved "memcg: Ignore soft limit until
it is explicitly specified" to the end of the series.

Changes between V2 -> V3
No changes in the code since the last version. I have just rebased the
series on top of the current mmotm tree. The most controversial part
has been dropped (the last patch "memcg: Ignore soft limit until it is
explicitly specified") so there are no semantical changes to the soft
limit behavior. This makes this work mostly a code clean up and code
reorganization. Nevertheless, this is enough to make the soft limit work
more efficiently according to my testing and groups above the soft limit
are reclaimed much less as a result.

Changes between V3->V4
Added some Reviewed-bys but the biggest change comes from Johannes
concern about the tree traversal overhead with a huge number of memcgs
(http://thread.gmane.org/gmane.linux.kernel.cgroups/7307/focus=100326)
and this version addresses this problem by augmenting the memcg tree
with the number of over soft limit children at each level of the
hierarchy. See more bellow.

Changes between V4->V5
Rebased on top of mmotm tree (without slab shrinkers patchset because
there are issues with that patchset) + restested as there were many 
kswapd changes (Results are more or less consistent more on that bellow).
There were only doc updates, no code changes.

Please let me know if this has any chance to get merged into 3.11. I do
not want to push it too hard but I think this work is basically ready
and waiting more doesn't help. I can live with 3.12 merge window as well
if 3.11 sounds too early though.

The basic idea is quite simple. Pull soft reclaim into shrink_zone in
the first step and get rid of the previous soft reclaim infrastructure.
shrink_zone is done in two passes now. First it tries to do the soft
limit reclaim and it falls back to reclaim-all mode if no group is over
the limit or no pages have been scanned. The second pass happens at the
same priority so the only time we waste is the memcg tree walk which
has been updated in the third step to have only negligible overhead.

As a bonus we will get rid of a _lot_ of code by this and soft reclaim
will not stand out like before when it wasn't integrated into the zone
shrinking code and it reclaimed at priority 0 (the testing results show
that some workloads suffers from such an aggressive reclaim). The clean
up is in a separate patch because I felt it would be easier to review
that way.

The second step is soft limit reclaim integration into targeted
reclaim. It should be rather straight forward. Soft limit has been used
only for the global reclaim so far but it makes sense for any kind of
pressure coming from up-the-hierarchy, including targeted reclaim.

The third step (patches 4-8) addresses the tree walk overhead by
enhancing memcg iterators to enable skipping whole subtrees and tracking
number of over soft limit children at each level of the hierarchy. This
information is updated same way the old soft limit tree was updated
(from memcg_check_events) so we shouldn't see an additional overhead. In
fact mem_cgroup_update_soft_limit is much simpler than tree manipulation
done previously.
__shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for
mem_cgroup_iter so the decision whether a particular group should be
visited is done at the iterator level which allows us to decide to skip
the whole subtree as well (if there is no child in excess). This reduces
the tree walk overhead considerably.

My primary test case was a parallel kernel build with 2 groups (make
is running with -j4 with a distribution .config in a separate cgroup
without any hard limit) on a 8 CPU machine booted with 1GB memory.  I
was mostly interested in 2 setups. Default - no soft limit set and - and
0 soft limit set to both groups.
The first one should tell us whether the rework regresses the default
behavior while the second one should show us improvements in an extreme
case where both workloads are always over the soft limit.

/usr/bin/time -v has been used to collect the statistics and each
configuration had 3 runs after fresh boot without any other load on the
system.

base is mmotm-2013-05-09-15-57
baserebase is mmotm-2013-06-05-17-24-63 + patches from the current mmots
without slab shrinkers patchset.
reworkrebase all patches 8 applied on top of baserebase

* No-limit
User
base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6
baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6
reworkrebase: min: 1172.58 [100.7%] max: 1177.43 [100.7%] avg: 1175.53 [100.6%] std: 1.91 runs: 6
System
base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6
baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6
reworkrebase: min: 236.21 [97.4%] max: 239.46 [97.6%] avg: 237.55 [97.4%] std: 1.05 runs: 6
Elapsed
base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6
baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6
reworkrebase: min: 664.05 [111.3%] max: 701.06 [113.1%] avg: 689.29 [113.8%] std: 12.36 runs: 6

Elapsed time regressed by 13% wrt. base but it seems that this came from
baserebase which regressed by the same amount.

* 0-limit
User
base: min: 1188.28 max: 1198.54 avg: 1194.10 std: 3.31 runs: 6
baserebase: min: 1186.17 [99.8%] max: 1196.46 [99.8%] avg: 1189.75 [99.6%] std: 3.41 runs: 6
reworkrebase: min: 1169.88 [98.5%] max: 1177.84 [98.3%] avg: 1173.50 [98.3%] std: 2.79 runs: 6
System
base: min: 248.40 max: 252.00 avg: 250.19 std: 1.38 runs: 6
baserebase: min: 240.77 [96.9%] max: 246.74 [97.9%] avg: 243.63 [97.4%] std: 2.23 runs: 6
reworkrebase: min: 235.19 [94.7%] max: 237.43 [94.2%] avg: 236.35 [94.5%] std: 0.86 runs: 6
Elapsed
base: min: 759.28 max: 805.30 avg: 784.87 std: 15.45 runs: 6
baserebase: min: 881.69 [116.1%] max: 938.14 [116.5%] avg: 911.68 [116.2%] std: 19.58 runs: 6
reworkrebase: min: 667.54 [87.9%] max: 718.54 [89.2%] avg: 695.61 [88.6%] std: 17.16 runs: 6

System time is slightly better but I wouldn't consider it relevant.

Elapsed time is more interesting though. baserebase regresses by 16%
again which is in par with no-limit configuration.

With the patchset applied we are 11% better in average wrt. to the
old base but it is important to realize that this is still 76.3% wrt.
baserebase so the effect of the series is comparable to the previous
version. Albeit the whole result is worse.

Page fault statistics tell us at least part of the story:
Minor
base: min: 35941845.00 max: 36029788.00 avg: 35986860.17 std: 28288.66 runs: 6
baserebase: min: 35852414.00 [99.8%] max: 35899605.00 [99.6%] avg: 35874906.83 [99.7%] std: 18722.59 runs: 6
reworkrebase: min: 35538346.00 [98.9%] max: 35584907.00 [98.8%] avg: 35562362.17 [98.8%] std: 18921.74 runs: 6
Major
base: min: 25390.00 max: 33132.00 avg: 29961.83 std: 2476.58 runs: 6
baserebase: min: 34224.00 [134.8%] max: 45674.00 [137.9%] avg: 41556.83 [138.7%] std: 3595.39 runs: 6
reworkrebase: min: 277.00 [1.1%] max: 480.00 [1.4%] avg: 384.67 [1.3%] std: 74.67 runs: 6

While the minor faults are within the noise the major faults are reduced
considerably. This looks like an aggressive pageout during the reclaim
and that pageout affects the working set presumably. Please note that
baserebase has even hight number of major page faults than the older
mmotm trree.

While this looks as a nice win it is fair to say that there are some
workloads that actually benefit from reclaim at 0 priority (from
background reclaim). E.g. an aggressive streaming IO would like to get
rid of as many pages as possible and do not block on the pages under
writeback. This can lead to a higher System time but I generally got
Elapsed which was comparable.

The following results are from 2 groups configuration on a 8GB machine
(A running stream IO with 4*TotalMem with 0 soft limit, B runnning a
mem_eater which consumes TotalMem-1G without any limit).
System
base: min: 124.88 max: 136.97 avg: 130.77 std: 4.94 runs: 3
baserebase: min: 102.51 [82.1%] max: 108.84 [79.5%] avg: 104.81 [80.1%] std: 2.86 runs: 3
reworkrebase: min: 108.29 [86.7%] max: 121.70 [88.9%] avg: 114.60 [87.6%] std: 5.50 runs: 3
Elapsed
base: min: 398.86 max: 412.81 avg: 407.62 std: 6.23 runs: 3
baserebase: min: 480.92 [120.6%] max: 497.56 [120.5%] avg: 491.46 [120.6%] std: 7.48 runs: 3
reworkrebase: min: 397.19 [99.6%] max: 462.57 [112.1%] avg: 436.13 [107.0%] std: 28.12 runs: 3

baserebase regresses again by 20% and the series is worse by 7% but it
is still at 89% wrt baserebase so it looks good to me.

So to wrap this up. The series is still doing good and improves the soft
limit.

The testing results for bunch of cgroups with both stream IO and kbuild
loads can be found in "memcg: track children in soft limit excess to
improve soft limit".

The series has seen quite some testing and I guess it is in the state to
be merged into mmotm and hopefully get into 3.11. I would like to hear
back from Johannes and Kamezawa about this timing though.

Shortlog says:
Michal Hocko (8):
      memcg, vmscan: integrate soft reclaim tighter with zone shrinking code
      memcg: Get rid of soft-limit tree infrastructure
      vmscan, memcg: Do softlimit reclaim also for targeted reclaim
      memcg: enhance memcg iterator to support predicates
      memcg: track children in soft limit excess to improve soft limit
      memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything
      memcg: Track all children over limit in the root
      memcg, vmscan: do not fall into reclaim-all pass too quickly

And the disffstat shows us that we still got rid of a lot of code
 include/linux/memcontrol.h |  54 ++++-
 mm/memcontrol.c            | 565 +++++++++++++--------------------------------
 mm/vmscan.c                |  83 ++++---
 3 files changed, 254 insertions(+), 448 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v5 1/8] memcg, vmscan: integrate soft reclaim tighter with zone shrinking code
  2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko
@ 2013-06-18 12:09 ` Michal Hocko
  2013-06-18 12:09 ` [PATCH v5 2/8] memcg: Get rid of soft-limit tree infrastructure Michal Hocko
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

Memcg soft reclaim has been traditionally triggered from the global
reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
then picked up a group which exceeds the soft limit the most and
reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.

The infrastructure requires per-node-zone trees which hold over-limit
groups and keep them up-to-date (via memcg_check_events) which is not
cost free. Although this overhead hasn't turned out to be a bottle neck
the implementation is suboptimal because mem_cgroup_update_tree has no
idea which zones consumed memory over the limit so we could easily end
up having a group on a node-zone tree having only few pages from that
node-zone.

This patch doesn't try to fix node-zone trees management because it
seems that integrating soft reclaim into zone shrinking sounds much
easier and more appropriate for several reasons.
First of all 0 priority reclaim was a crude hack which might lead to
big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot
of dirty/writeback pages).
Soft reclaim should be applicable also to the targeted reclaim which is
awkward right now without additional hacks.
Last but not least the whole infrastructure eats quite some code.

After this patch shrink_zone is done in 2 passes. First it tries to do the
soft reclaim if appropriate (only for global reclaim for now to keep
compatible with the original state) and fall back to ignoring soft limit
if no group is eligible to soft reclaim or nothing has been scanned
during the first pass. Only groups which are over their soft limit or
any of their parents up the hierarchy is over the limit are considered
eligible during the first pass.

Soft limit tree which is not necessary anymore will be removed in the
follow up patch to make this patch smaller and easier to review.

Changes since v2
- Do not try soft reclaim pass if mem_cgroup_disabled
Changes since v1
- __shrink_zone doesn't return the number of shrunk groups as nr_scanned
  test covers both no groups scanned and no pages from the required zone
  as pointed by Johannes

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
---
 include/linux/memcontrol.h |  10 +--
 mm/memcontrol.c            | 161 ++++++---------------------------------------
 mm/vmscan.c                |  62 +++++++++--------
 3 files changed, 59 insertions(+), 174 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7b4d9d7..d495b9e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -180,9 +180,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-						gfp_t gfp_mask,
-						unsigned long *total_scanned);
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg);
 
 void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
@@ -359,11 +357,9 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 }
 
 static inline
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-					    gfp_t gfp_mask,
-					    unsigned long *total_scanned)
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
 {
-	return 0;
+	return false;
 }
 
 static inline void mem_cgroup_split_huge_fixup(struct page *head)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 96e6168..ea10f73 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2101,57 +2101,28 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
 }
 #endif
 
-static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
-				   struct zone *zone,
-				   gfp_t gfp_mask,
-				   unsigned long *total_scanned)
-{
-	struct mem_cgroup *victim = NULL;
-	int total = 0;
-	int loop = 0;
-	unsigned long excess;
-	unsigned long nr_scanned;
-	struct mem_cgroup_reclaim_cookie reclaim = {
-		.zone = zone,
-		.priority = 0,
-	};
+/*
+ * A group is eligible for the soft limit reclaim if it is
+ * 	a) is over its soft limit
+ * 	b) any parent up the hierarchy is over its soft limit
+ */
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent = memcg;
 
-	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
-
-	while (1) {
-		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
-		if (!victim) {
-			loop++;
-			if (loop >= 2) {
-				/*
-				 * If we have not been able to reclaim
-				 * anything, it might because there are
-				 * no reclaimable pages under this hierarchy
-				 */
-				if (!total)
-					break;
-				/*
-				 * We want to do more targeted reclaim.
-				 * excess >> 2 is not to excessive so as to
-				 * reclaim too much, nor too less that we keep
-				 * coming back to reclaim from this cgroup
-				 */
-				if (total >= (excess >> 2) ||
-					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
-					break;
-			}
-			continue;
-		}
-		if (!mem_cgroup_reclaimable(victim, false))
-			continue;
-		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
-						     zone, &nr_scanned);
-		*total_scanned += nr_scanned;
-		if (!res_counter_soft_limit_excess(&root_memcg->res))
-			break;
+	if (res_counter_soft_limit_excess(&memcg->res))
+		return true;
+
+	/*
+	 * If any parent up the hierarchy is over its soft limit then we
+	 * have to obey and reclaim from this group as well.
+	 */
+	while((parent = parent_mem_cgroup(parent))) {
+		if (res_counter_soft_limit_excess(&parent->res))
+			return true;
 	}
-	mem_cgroup_iter_break(root_memcg, victim);
-	return total;
+
+	return false;
 }
 
 /*
@@ -4777,98 +4748,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	return ret;
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-					    gfp_t gfp_mask,
-					    unsigned long *total_scanned)
-{
-	unsigned long nr_reclaimed = 0;
-	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
-	unsigned long reclaimed;
-	int loop = 0;
-	struct mem_cgroup_tree_per_zone *mctz;
-	unsigned long long excess;
-	unsigned long nr_scanned;
-
-	if (order > 0)
-		return 0;
-
-	mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
-	/*
-	 * This loop can run a while, specially if mem_cgroup's continuously
-	 * keep exceeding their soft limit and putting the system under
-	 * pressure
-	 */
-	do {
-		if (next_mz)
-			mz = next_mz;
-		else
-			mz = mem_cgroup_largest_soft_limit_node(mctz);
-		if (!mz)
-			break;
-
-		nr_scanned = 0;
-		reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone,
-						    gfp_mask, &nr_scanned);
-		nr_reclaimed += reclaimed;
-		*total_scanned += nr_scanned;
-		spin_lock(&mctz->lock);
-
-		/*
-		 * If we failed to reclaim anything from this memory cgroup
-		 * it is time to move on to the next cgroup
-		 */
-		next_mz = NULL;
-		if (!reclaimed) {
-			do {
-				/*
-				 * Loop until we find yet another one.
-				 *
-				 * By the time we get the soft_limit lock
-				 * again, someone might have aded the
-				 * group back on the RB tree. Iterate to
-				 * make sure we get a different mem.
-				 * mem_cgroup_largest_soft_limit_node returns
-				 * NULL if no other cgroup is present on
-				 * the tree
-				 */
-				next_mz =
-				__mem_cgroup_largest_soft_limit_node(mctz);
-				if (next_mz == mz)
-					css_put(&next_mz->memcg->css);
-				else /* next_mz == NULL or other memcg */
-					break;
-			} while (1);
-		}
-		__mem_cgroup_remove_exceeded(mz->memcg, mz, mctz);
-		excess = res_counter_soft_limit_excess(&mz->memcg->res);
-		/*
-		 * One school of thought says that we should not add
-		 * back the node to the tree if reclaim returns 0.
-		 * But our reclaim could return 0, simply because due
-		 * to priority we are exposing a smaller subset of
-		 * memory to reclaim from. Consider this as a longer
-		 * term TODO.
-		 */
-		/* If excess == 0, no tree ops */
-		__mem_cgroup_insert_exceeded(mz->memcg, mz, mctz, excess);
-		spin_unlock(&mctz->lock);
-		css_put(&mz->memcg->css);
-		loop++;
-		/*
-		 * Could not reclaim anything and there are no more
-		 * mem cgroups to try or we seem to be looping without
-		 * reclaiming anything.
-		 */
-		if (!nr_reclaimed &&
-			(next_mz == NULL ||
-			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
-			break;
-	} while (!nr_reclaimed);
-	if (next_mz)
-		css_put(&next_mz->memcg->css);
-	return nr_reclaimed;
-}
-
 /**
  * mem_cgroup_force_empty_list - clears LRU of a group
  * @memcg: group to clear
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b96faea..d8823f0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -139,11 +139,21 @@ static bool global_reclaim(struct scan_control *sc)
 {
 	return !sc->target_mem_cgroup;
 }
+
+static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
+{
+	return !mem_cgroup_disabled() && global_reclaim(sc);
+}
 #else
 static bool global_reclaim(struct scan_control *sc)
 {
 	return true;
 }
+
+static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
+{
+	return false;
+}
 #endif
 
 static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -2124,7 +2134,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	}
 }
 
-static void shrink_zone(struct zone *zone, struct scan_control *sc)
+static void
+__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
 {
 	unsigned long nr_reclaimed, nr_scanned;
 
@@ -2143,6 +2154,12 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 		do {
 			struct lruvec *lruvec;
 
+			if (soft_reclaim &&
+			    !mem_cgroup_soft_reclaim_eligible(memcg)) {
+				memcg = mem_cgroup_iter(root, memcg, &reclaim);
+				continue;
+			}
+
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
 			shrink_lruvec(lruvec, sc);
@@ -2173,6 +2190,24 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 					 sc->nr_scanned - nr_scanned, sc));
 }
 
+
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
+{
+	bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
+	unsigned long nr_scanned = sc->nr_scanned;
+
+	__shrink_zone(zone, sc, do_soft_reclaim);
+
+	/*
+	 * No group is over the soft limit or those that are do not have
+	 * pages in the zone we are reclaiming so we have to reclaim everybody
+	 */
+	if (do_soft_reclaim && (sc->nr_scanned == nr_scanned)) {
+		__shrink_zone(zone, sc, false);
+		return;
+	}
+}
+
 /* Returns true if compaction should go ahead for a high-order request */
 static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 {
@@ -2234,8 +2269,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
-	unsigned long nr_soft_reclaimed;
-	unsigned long nr_soft_scanned;
 	bool aborted_reclaim = false;
 
 	/*
@@ -2275,18 +2308,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 					continue;
 				}
 			}
-			/*
-			 * This steals pages from memory cgroups over softlimit
-			 * and returns the number of reclaimed pages and
-			 * scanned pages. This works for global memory pressure
-			 * and balancing, not for a memcg's limit.
-			 */
-			nr_soft_scanned = 0;
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
-						sc->order, sc->gfp_mask,
-						&nr_soft_scanned);
-			sc->nr_reclaimed += nr_soft_reclaimed;
-			sc->nr_scanned += nr_soft_scanned;
 			/* need some check for avoid more shrink_zone() */
 		}
 
@@ -2881,8 +2902,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 {
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
-	unsigned long nr_soft_reclaimed;
-	unsigned long nr_soft_scanned;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.priority = DEF_PRIORITY,
@@ -2997,15 +3016,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 
 			sc.nr_scanned = 0;
 
-			nr_soft_scanned = 0;
-			/*
-			 * Call soft limit reclaim before calling shrink_zone.
-			 */
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
-							order, sc.gfp_mask,
-							&nr_soft_scanned);
-			sc.nr_reclaimed += nr_soft_reclaimed;
-
 			/*
 			 * There should be no need to raise the scanning
 			 * priority if enough pages are already being scanned
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 2/8] memcg: Get rid of soft-limit tree infrastructure
  2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko
  2013-06-18 12:09 ` [PATCH v5 1/8] memcg, vmscan: integrate soft reclaim tighter with zone shrinking code Michal Hocko
@ 2013-06-18 12:09 ` Michal Hocko
  2013-06-18 12:09 ` [PATCH v5 3/8] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

Now that the soft limit is integrated to the reclaim directly the whole
soft-limit tree infrastructure is not needed anymore. Rip it out.

Changes v1->v2
- mem_cgroup_reclaimable is no longer used
- test_mem_cgroup_node_reclaimable is not used outside NUMA code

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
---
 mm/memcontrol.c | 265 +-------------------------------------------------------
 1 file changed, 2 insertions(+), 263 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ea10f73..cdc8aed 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -39,7 +39,6 @@
 #include <linux/limits.h>
 #include <linux/export.h>
 #include <linux/mutex.h>
-#include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
@@ -137,7 +136,6 @@ static const char * const mem_cgroup_lru_names[] = {
  */
 enum mem_cgroup_events_target {
 	MEM_CGROUP_TARGET_THRESH,
-	MEM_CGROUP_TARGET_SOFTLIMIT,
 	MEM_CGROUP_TARGET_NUMAINFO,
 	MEM_CGROUP_NTARGETS,
 };
@@ -173,10 +171,6 @@ struct mem_cgroup_per_zone {
 
 	struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
 
-	struct rb_node		tree_node;	/* RB tree node */
-	unsigned long long	usage_in_excess;/* Set to the value by which */
-						/* the soft limit is exceeded*/
-	bool			on_tree;
 	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
 						/* use container_of	   */
 };
@@ -185,26 +179,6 @@ struct mem_cgroup_per_node {
 	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
 };
 
-/*
- * Cgroups above their limits are maintained in a RB-Tree, independent of
- * their hierarchy representation
- */
-
-struct mem_cgroup_tree_per_zone {
-	struct rb_root rb_root;
-	spinlock_t lock;
-};
-
-struct mem_cgroup_tree_per_node {
-	struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
-};
-
-struct mem_cgroup_tree {
-	struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
-};
-
-static struct mem_cgroup_tree soft_limit_tree __read_mostly;
-
 struct mem_cgroup_threshold {
 	struct eventfd_ctx *eventfd;
 	u64 threshold;
@@ -523,7 +497,6 @@ static bool move_file(void)
  * limit reclaim to prevent infinite loops, if they ever occur.
  */
 #define	MEM_CGROUP_MAX_RECLAIM_LOOPS		100
-#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	2
 
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
@@ -754,164 +727,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *memcg, struct page *page)
 	return mem_cgroup_zoneinfo(memcg, nid, zid);
 }
 
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_node_zone(int nid, int zid)
-{
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_from_page(struct page *page)
-{
-	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
-
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static void
-__mem_cgroup_insert_exceeded(struct mem_cgroup *memcg,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz,
-				unsigned long long new_usage_in_excess)
-{
-	struct rb_node **p = &mctz->rb_root.rb_node;
-	struct rb_node *parent = NULL;
-	struct mem_cgroup_per_zone *mz_node;
-
-	if (mz->on_tree)
-		return;
-
-	mz->usage_in_excess = new_usage_in_excess;
-	if (!mz->usage_in_excess)
-		return;
-	while (*p) {
-		parent = *p;
-		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
-					tree_node);
-		if (mz->usage_in_excess < mz_node->usage_in_excess)
-			p = &(*p)->rb_left;
-		/*
-		 * We can't avoid mem cgroups that are over their soft
-		 * limit by the same amount
-		 */
-		else if (mz->usage_in_excess >= mz_node->usage_in_excess)
-			p = &(*p)->rb_right;
-	}
-	rb_link_node(&mz->tree_node, parent, p);
-	rb_insert_color(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = true;
-}
-
-static void
-__mem_cgroup_remove_exceeded(struct mem_cgroup *memcg,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
-{
-	if (!mz->on_tree)
-		return;
-	rb_erase(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = false;
-}
-
-static void
-mem_cgroup_remove_exceeded(struct mem_cgroup *memcg,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
-{
-	spin_lock(&mctz->lock);
-	__mem_cgroup_remove_exceeded(memcg, mz, mctz);
-	spin_unlock(&mctz->lock);
-}
-
-
-static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
-{
-	unsigned long long excess;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
-	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
-	mctz = soft_limit_tree_from_page(page);
-
-	/*
-	 * Necessary to update all ancestors when hierarchy is used.
-	 * because their event counter is not touched.
-	 */
-	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
-		mz = mem_cgroup_zoneinfo(memcg, nid, zid);
-		excess = res_counter_soft_limit_excess(&memcg->res);
-		/*
-		 * We have to update the tree if mz is on RB-tree or
-		 * mem is over its softlimit.
-		 */
-		if (excess || mz->on_tree) {
-			spin_lock(&mctz->lock);
-			/* if on-tree, remove it */
-			if (mz->on_tree)
-				__mem_cgroup_remove_exceeded(memcg, mz, mctz);
-			/*
-			 * Insert again. mz->usage_in_excess will be updated.
-			 * If excess is 0, no tree ops.
-			 */
-			__mem_cgroup_insert_exceeded(memcg, mz, mctz, excess);
-			spin_unlock(&mctz->lock);
-		}
-	}
-}
-
-static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
-{
-	int node, zone;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
-
-	for_each_node(node) {
-		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-			mz = mem_cgroup_zoneinfo(memcg, node, zone);
-			mctz = soft_limit_tree_node_zone(node, zone);
-			mem_cgroup_remove_exceeded(memcg, mz, mctz);
-		}
-	}
-}
-
-static struct mem_cgroup_per_zone *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
-	struct rb_node *rightmost = NULL;
-	struct mem_cgroup_per_zone *mz;
-
-retry:
-	mz = NULL;
-	rightmost = rb_last(&mctz->rb_root);
-	if (!rightmost)
-		goto done;		/* Nothing to reclaim from */
-
-	mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
-	/*
-	 * Remove the node now but someone else can add it back,
-	 * we will to add it back at the end of reclaim to its correct
-	 * position in the tree.
-	 */
-	__mem_cgroup_remove_exceeded(mz->memcg, mz, mctz);
-	if (!res_counter_soft_limit_excess(&mz->memcg->res) ||
-		!css_tryget(&mz->memcg->css))
-		goto retry;
-done:
-	return mz;
-}
-
-static struct mem_cgroup_per_zone *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
-	struct mem_cgroup_per_zone *mz;
-
-	spin_lock(&mctz->lock);
-	mz = __mem_cgroup_largest_soft_limit_node(mctz);
-	spin_unlock(&mctz->lock);
-	return mz;
-}
-
 /*
  * Implementation Note: reading percpu statistics for memcg.
  *
@@ -1065,9 +880,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 		case MEM_CGROUP_TARGET_THRESH:
 			next = val + THRESHOLDS_EVENTS_TARGET;
 			break;
-		case MEM_CGROUP_TARGET_SOFTLIMIT:
-			next = val + SOFTLIMIT_EVENTS_TARGET;
-			break;
 		case MEM_CGROUP_TARGET_NUMAINFO:
 			next = val + NUMAINFO_EVENTS_TARGET;
 			break;
@@ -1090,11 +902,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 	/* threshold event is triggered in finer grain than soft limit */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
-		bool do_softlimit;
 		bool do_numainfo __maybe_unused;
 
-		do_softlimit = mem_cgroup_event_ratelimit(memcg,
-						MEM_CGROUP_TARGET_SOFTLIMIT);
 #if MAX_NUMNODES > 1
 		do_numainfo = mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_NUMAINFO);
@@ -1102,8 +911,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 		preempt_enable();
 
 		mem_cgroup_threshold(memcg);
-		if (unlikely(do_softlimit))
-			mem_cgroup_update_tree(memcg, page);
 #if MAX_NUMNODES > 1
 		if (unlikely(do_numainfo))
 			atomic_inc(&memcg->numainfo_events);
@@ -1964,6 +1771,7 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
 	return total;
 }
 
+#if MAX_NUMNODES > 1
 /**
  * test_mem_cgroup_node_reclaimable
  * @memcg: the target memcg
@@ -1986,7 +1794,6 @@ static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
 	return false;
 
 }
-#if MAX_NUMNODES > 1
 
 /*
  * Always updating the nodemask is not very good - even if we have an empty
@@ -2054,51 +1861,12 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 	return node;
 }
 
-/*
- * Check all nodes whether it contains reclaimable pages or not.
- * For quick scan, we make use of scan_nodes. This will allow us to skip
- * unused nodes. But scan_nodes is lazily updated and may not cotain
- * enough new information. We need to do double check.
- */
-static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
-{
-	int nid;
-
-	/*
-	 * quick check...making use of scan_node.
-	 * We can skip unused nodes.
-	 */
-	if (!nodes_empty(memcg->scan_nodes)) {
-		for (nid = first_node(memcg->scan_nodes);
-		     nid < MAX_NUMNODES;
-		     nid = next_node(nid, memcg->scan_nodes)) {
-
-			if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap))
-				return true;
-		}
-	}
-	/*
-	 * Check rest of nodes.
-	 */
-	for_each_node_state(nid, N_MEMORY) {
-		if (node_isset(nid, memcg->scan_nodes))
-			continue;
-		if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap))
-			return true;
-	}
-	return false;
-}
-
 #else
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 {
 	return 0;
 }
 
-static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
-{
-	return test_mem_cgroup_node_reclaimable(memcg, 0, noswap);
-}
 #endif
 
 /*
@@ -2973,9 +2741,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 	unlock_page_cgroup(pc);
 
 	/*
-	 * "charge_statistics" updated event counter. Then, check it.
-	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
-	 * if they exceeds softlimit.
+	 * "charge_statistics" updated event counter.
 	 */
 	memcg_check_events(memcg, page);
 }
@@ -6112,8 +5878,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
 		lruvec_init(&mz->lruvec);
-		mz->usage_in_excess = 0;
-		mz->on_tree = false;
 		mz->memcg = memcg;
 	}
 	memcg->nodeinfo[node] = pn;
@@ -6169,7 +5933,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 	int node;
 	size_t size = memcg_size();
 
-	mem_cgroup_remove_from_trees(memcg);
 	free_css_id(&mem_cgroup_subsys, &memcg->css);
 
 	for_each_node(node)
@@ -6251,29 +6014,6 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 }
 EXPORT_SYMBOL(parent_mem_cgroup);
 
-static void __init mem_cgroup_soft_limit_tree_init(void)
-{
-	struct mem_cgroup_tree_per_node *rtpn;
-	struct mem_cgroup_tree_per_zone *rtpz;
-	int tmp, node, zone;
-
-	for_each_node(node) {
-		tmp = node;
-		if (!node_state(node, N_NORMAL_MEMORY))
-			tmp = -1;
-		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
-		BUG_ON(!rtpn);
-
-		soft_limit_tree.rb_tree_per_node[node] = rtpn;
-
-		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-			rtpz = &rtpn->rb_tree_per_zone[zone];
-			rtpz->rb_root = RB_ROOT;
-			spin_lock_init(&rtpz->lock);
-		}
-	}
-}
-
 static struct cgroup_subsys_state * __ref
 mem_cgroup_css_alloc(struct cgroup *cont)
 {
@@ -7066,7 +6806,6 @@ static int __init mem_cgroup_init(void)
 {
 	hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
 	enable_swap_cgroup();
-	mem_cgroup_soft_limit_tree_init();
 	memcg_stock_init();
 	return 0;
 }
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 3/8] vmscan, memcg: Do softlimit reclaim also for targeted reclaim
  2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko
  2013-06-18 12:09 ` [PATCH v5 1/8] memcg, vmscan: integrate soft reclaim tighter with zone shrinking code Michal Hocko
  2013-06-18 12:09 ` [PATCH v5 2/8] memcg: Get rid of soft-limit tree infrastructure Michal Hocko
@ 2013-06-18 12:09 ` Michal Hocko
  2013-06-18 12:09 ` [PATCH v5 4/8] memcg: enhance memcg iterator to support predicates Michal Hocko
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

Soft reclaim has been done only for the global reclaim (both background
and direct). Since "memcg: integrate soft reclaim tighter with zone
shrinking code" there is no reason for this limitation anymore as the
soft limit reclaim doesn't use any special code paths and it is a
part of the zone shrinking code which is used by both global and
targeted reclaims.

From semantic point of view it is even natural to consider soft limit
before touching all groups in the hierarchy tree which is touching the
hard limit because soft limit tells us where to push back when there is
a  memory pressure. It is not important whether the pressure comes from
the limit or imbalanced zones.

This patch simply enables soft reclaim unconditionally in
mem_cgroup_should_soft_reclaim so it is enabled for both global and
targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn
about the root of the reclaim to know where to stop checking soft limit
state of parents up the hierarchy.
Say we have
A (over soft limit)
 \
  B (below s.l., hit the hard limit)
 / \
C   D (below s.l.)

B is the source of the outside memory pressure now for D but we
shouldn't soft reclaim it because it is behaving well under B subtree
and we can still reclaim from C (pressumably it is over the limit).
mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
hierarchy at B (root of the memory pressure).

Changes since v1
- add sc->target_mem_cgroup handling into mem_cgroup_soft_reclaim_eligible

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
---
 include/linux/memcontrol.h |  6 ++++--
 mm/memcontrol.c            | 14 +++++++++-----
 mm/vmscan.c                |  4 ++--
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d495b9e..065ecef 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -180,7 +180,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg);
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root);
 
 void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
@@ -357,7 +358,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 }
 
 static inline
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root)
 {
 	return false;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cdc8aed..90217f3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1870,11 +1870,13 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 #endif
 
 /*
- * A group is eligible for the soft limit reclaim if it is
- * 	a) is over its soft limit
+ * A group is eligible for the soft limit reclaim under the given root
+ * hierarchy if
+ * 	a) it is over its soft limit
  * 	b) any parent up the hierarchy is over its soft limit
  */
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root)
 {
 	struct mem_cgroup *parent = memcg;
 
@@ -1882,12 +1884,14 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
 		return true;
 
 	/*
-	 * If any parent up the hierarchy is over its soft limit then we
-	 * have to obey and reclaim from this group as well.
+	 * If any parent up to the root in the hierarchy is over its soft limit
+	 * then we have to obey and reclaim from this group as well.
 	 */
 	while((parent = parent_mem_cgroup(parent))) {
 		if (res_counter_soft_limit_excess(&parent->res))
 			return true;
+		if (parent == root)
+			break;
 	}
 
 	return false;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d8823f0..13d746d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -142,7 +142,7 @@ static bool global_reclaim(struct scan_control *sc)
 
 static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
 {
-	return !mem_cgroup_disabled() && global_reclaim(sc);
+	return !mem_cgroup_disabled();
 }
 #else
 static bool global_reclaim(struct scan_control *sc)
@@ -2155,7 +2155,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
 			struct lruvec *lruvec;
 
 			if (soft_reclaim &&
-			    !mem_cgroup_soft_reclaim_eligible(memcg)) {
+			    !mem_cgroup_soft_reclaim_eligible(memcg, root)) {
 				memcg = mem_cgroup_iter(root, memcg, &reclaim);
 				continue;
 			}
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 4/8] memcg: enhance memcg iterator to support predicates
  2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko
                   ` (2 preceding siblings ...)
  2013-06-18 12:09 ` [PATCH v5 3/8] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko
@ 2013-06-18 12:09 ` Michal Hocko
  2013-06-18 12:09 ` [PATCH v5 5/8] memcg: track children in soft limit excess to improve soft limit Michal Hocko
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

The caller of the iterator might know that some nodes or even subtrees
should be skipped but there is no way to tell iterators about that so
the only choice left is to let iterators to visit each node and do the
selection outside of the iterating code. This, however, doesn't scale
well with hierarchies with many groups where only few groups are
interesting.

This patch adds mem_cgroup_iter_cond variant of the iterator with a
callback which gets called for every visited node. There are three
possible ways how the callback can influence the walk. Either the node
is visited, it is skipped but the tree walk continues down the tree or
the whole subtree of the current group is skipped.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h | 48 +++++++++++++++++++++++++----
 mm/memcontrol.c            | 77 ++++++++++++++++++++++++++++++++++++----------
 mm/vmscan.c                | 16 +++-------
 3 files changed, 108 insertions(+), 33 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 065ecef..1276be3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -41,6 +41,23 @@ struct mem_cgroup_reclaim_cookie {
 	unsigned int generation;
 };
 
+enum mem_cgroup_filter_t {
+	VISIT,		/* visit current node */
+	SKIP,		/* skip the current node and continue traversal */
+	SKIP_TREE,	/* skip the whole subtree and continue traversal */
+};
+
+/*
+ * mem_cgroup_filter_t predicate might instruct mem_cgroup_iter_cond how to
+ * iterate through the hierarchy tree. Each tree element is checked by the
+ * predicate before it is returned by the iterator. If a filter returns
+ * SKIP or SKIP_TREE then the iterator code continues traversal (with the
+ * next node down the hierarchy or the next node that doesn't belong under the
+ * memcg's subtree).
+ */
+typedef enum mem_cgroup_filter_t
+(*mem_cgroup_iter_filter)(struct mem_cgroup *memcg, struct mem_cgroup *root);
+
 #ifdef CONFIG_MEMCG
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -108,9 +125,18 @@ mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
 	struct page *oldpage, struct page *newpage, bool migration_ok);
 
-struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
-				   struct mem_cgroup *,
-				   struct mem_cgroup_reclaim_cookie *);
+struct mem_cgroup *mem_cgroup_iter_cond(struct mem_cgroup *root,
+				   struct mem_cgroup *prev,
+				   struct mem_cgroup_reclaim_cookie *reclaim,
+				   mem_cgroup_iter_filter cond);
+
+static inline struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
+				   struct mem_cgroup *prev,
+				   struct mem_cgroup_reclaim_cookie *reclaim)
+{
+	return mem_cgroup_iter_cond(root, prev, reclaim, NULL);
+}
+
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 
 /*
@@ -180,7 +206,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+enum mem_cgroup_filter_t
+mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
 		struct mem_cgroup *root);
 
 void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
@@ -295,6 +322,14 @@ static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg,
 		struct page *oldpage, struct page *newpage, bool migration_ok)
 {
 }
+static inline struct mem_cgroup *
+mem_cgroup_iter_cond(struct mem_cgroup *root,
+		struct mem_cgroup *prev,
+		struct mem_cgroup_reclaim_cookie *reclaim,
+		mem_cgroup_iter_filter cond)
+{
+	return NULL;
+}
 
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
@@ -358,10 +393,11 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 }
 
 static inline
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+enum mem_cgroup_filter_t
+mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
 		struct mem_cgroup *root)
 {
-	return false;
+	return VISIT;
 }
 
 static inline void mem_cgroup_split_huge_fixup(struct page *head)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90217f3..1364ca5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -959,6 +959,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
 	return memcg;
 }
 
+static enum mem_cgroup_filter_t
+mem_cgroup_filter(struct mem_cgroup *memcg, struct mem_cgroup *root,
+		mem_cgroup_iter_filter cond)
+{
+	if (!cond)
+		return VISIT;
+	return cond(memcg, root);
+}
+
 /*
  * Returns a next (in a pre-order walk) alive memcg (with elevated css
  * ref. count) or NULL if the whole root's subtree has been visited.
@@ -966,7 +975,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
  * helper function to be used by mem_cgroup_iter
  */
 static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root,
-		struct mem_cgroup *last_visited)
+		struct mem_cgroup *last_visited, mem_cgroup_iter_filter cond)
 {
 	struct cgroup *prev_cgroup, *next_cgroup;
 
@@ -974,10 +983,18 @@ static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root,
 	 * Root is not visited by cgroup iterators so it needs an
 	 * explicit visit.
 	 */
-	if (!last_visited)
-		return root;
+	if (!last_visited) {
+		switch(mem_cgroup_filter(root, root, cond)) {
+		case VISIT:
+			return root;
+		case SKIP:
+			break;
+		case SKIP_TREE:
+			return NULL;
+		}
+	}
 
-	prev_cgroup = (last_visited == root) ? NULL
+	prev_cgroup = (last_visited == root || !last_visited) ? NULL
 		: last_visited->css.cgroup;
 skip_node:
 	next_cgroup = cgroup_next_descendant_pre(
@@ -993,11 +1010,30 @@ skip_node:
 	if (next_cgroup) {
 		struct mem_cgroup *mem = mem_cgroup_from_cont(
 				next_cgroup);
-		if (css_tryget(&mem->css))
-			return mem;
-		else {
+
+		switch (mem_cgroup_filter(mem, root, cond)) {
+		case SKIP:
 			prev_cgroup = next_cgroup;
 			goto skip_node;
+		case SKIP_TREE:
+			/*
+			 * cgroup_rightmost_descendant is not an optimal way to
+			 * skip through a subtree (especially for imbalanced
+			 * trees leaning to right) but that's what we have right
+			 * now. More effective solution would be traversing
+			 * right-up for first non-NULL without calling
+			 * cgroup_next_descendant_pre afterwards.
+			 */
+			prev_cgroup = cgroup_rightmost_descendant(next_cgroup);
+			goto skip_node;
+		case VISIT:
+			if (css_tryget(&mem->css))
+				return mem;
+			else {
+				prev_cgroup = next_cgroup;
+				goto skip_node;
+			}
+			break;
 		}
 	}
 
@@ -1061,6 +1097,7 @@ static void mem_cgroup_iter_update(struct mem_cgroup_reclaim_iter *iter,
  * @root: hierarchy root
  * @prev: previously returned memcg, NULL on first invocation
  * @reclaim: cookie for shared reclaim walks, NULL for full walks
+ * @cond: filter for visited nodes, NULL for no filter
  *
  * Returns references to children of the hierarchy below @root, or
  * @root itself, or %NULL after a full round-trip.
@@ -1073,9 +1110,10 @@ static void mem_cgroup_iter_update(struct mem_cgroup_reclaim_iter *iter,
  * divide up the memcgs in the hierarchy among all concurrent
  * reclaimers operating on the same zone and priority.
  */
-struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
+struct mem_cgroup *mem_cgroup_iter_cond(struct mem_cgroup *root,
 				   struct mem_cgroup *prev,
-				   struct mem_cgroup_reclaim_cookie *reclaim)
+				   struct mem_cgroup_reclaim_cookie *reclaim,
+				   mem_cgroup_iter_filter cond)
 {
 	struct mem_cgroup *memcg = NULL;
 	struct mem_cgroup *last_visited = NULL;
@@ -1092,7 +1130,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	if (!root->use_hierarchy && root != root_mem_cgroup) {
 		if (prev)
 			goto out_css_put;
-		return root;
+		if (mem_cgroup_filter(root, root, cond) == VISIT)
+			return root;
+		return NULL;
 	}
 
 	rcu_read_lock();
@@ -1116,7 +1156,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			last_visited = mem_cgroup_iter_load(iter, root, &seq);
 		}
 
-		memcg = __mem_cgroup_iter_next(root, last_visited);
+		memcg = __mem_cgroup_iter_next(root, last_visited, cond);
 
 		if (reclaim) {
 			mem_cgroup_iter_update(iter, last_visited, memcg, seq);
@@ -1127,7 +1167,11 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				reclaim->generation = iter->generation;
 		}
 
-		if (prev && !memcg)
+		/*
+		 * We have finished the whole tree walk or no group has been
+		 * visited because filter told us to skip the root node.
+		 */
+		if (!memcg && (prev || (cond && !last_visited)))
 			goto out_unlock;
 	}
 out_unlock:
@@ -1875,13 +1919,14 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
  * 	a) it is over its soft limit
  * 	b) any parent up the hierarchy is over its soft limit
  */
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+enum mem_cgroup_filter_t
+mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
 		struct mem_cgroup *root)
 {
 	struct mem_cgroup *parent = memcg;
 
 	if (res_counter_soft_limit_excess(&memcg->res))
-		return true;
+		return VISIT;
 
 	/*
 	 * If any parent up to the root in the hierarchy is over its soft limit
@@ -1889,12 +1934,12 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
 	 */
 	while((parent = parent_mem_cgroup(parent))) {
 		if (res_counter_soft_limit_excess(&parent->res))
-			return true;
+			return VISIT;
 		if (parent == root)
 			break;
 	}
 
-	return false;
+	return SKIP;
 }
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 13d746d..79d59ec 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2145,21 +2145,16 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
 			.zone = zone,
 			.priority = sc->priority,
 		};
-		struct mem_cgroup *memcg;
+		struct mem_cgroup *memcg = NULL;
+		mem_cgroup_iter_filter filter = (soft_reclaim) ?
+			mem_cgroup_soft_reclaim_eligible : NULL;
 
 		nr_reclaimed = sc->nr_reclaimed;
 		nr_scanned = sc->nr_scanned;
 
-		memcg = mem_cgroup_iter(root, NULL, &reclaim);
-		do {
+		while ((memcg = mem_cgroup_iter_cond(root, memcg, &reclaim, filter))) {
 			struct lruvec *lruvec;
 
-			if (soft_reclaim &&
-			    !mem_cgroup_soft_reclaim_eligible(memcg, root)) {
-				memcg = mem_cgroup_iter(root, memcg, &reclaim);
-				continue;
-			}
-
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
 			shrink_lruvec(lruvec, sc);
@@ -2179,8 +2174,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
 				mem_cgroup_iter_break(root, memcg);
 				break;
 			}
-			memcg = mem_cgroup_iter(root, memcg, &reclaim);
-		} while (memcg);
+		}
 
 		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
 			   sc->nr_scanned - nr_scanned,
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 5/8] memcg: track children in soft limit excess to improve soft limit
  2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko
                   ` (3 preceding siblings ...)
  2013-06-18 12:09 ` [PATCH v5 4/8] memcg: enhance memcg iterator to support predicates Michal Hocko
@ 2013-06-18 12:09 ` Michal Hocko
  2013-06-18 12:09 ` [PATCH v5 6/8] memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything Michal Hocko
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

Soft limit reclaim has to check the whole reclaim hierarchy while doing
the first pass of the reclaim. This leads to a higher system time which
can be visible especially when there are many groups in the hierarchy.

This patch adds a per-memcg counter of children in excess. It also
restores MEM_CGROUP_TARGET_SOFTLIMIT into mem_cgroup_event_ratelimit for
a proper batching.
If a group crosses soft limit for the first time it increases parent's
children_in_excess up the hierarchy. The similarly if a group gets below
the limit it will decrease the counter. The transition phase is recorded
in soft_contributed flag.

mem_cgroup_soft_reclaim_eligible then uses this information to better
decide whether to skip the node or the whole subtree. The rule is
simple. Skip the node with a children in excess or skip the whole subtree
otherwise.

This has been tested by a stream IO (dd if=/dev/zero of=file with
4*MemTotal size) which is quite sensitive to overhead during reclaim.
The load is running in a group with soft limit set to 0 and without any
limit. Apart from that there was a hierarchy with ~500, 2k and 8k groups
(two groups on each level) without any pages in them. base denotes to
the kernel on which the whole series is based on, rework is the kernel
before this patch and reworkoptim is with this patch applied:

* Run with soft limit set to 0
Elapsed
0-0-limit/base: min: 88.21 max: 94.61 avg: 91.73 std: 2.65 runs: 3
0-0-limit/rework: min: 76.05 [86.2%] max: 79.08 [83.6%] avg: 77.84 [84.9%] std: 1.30 runs: 3
0-0-limit/reworkoptim: min: 77.98 [88.4%] max: 80.36 [84.9%] avg: 78.92 [86.0%] std: 1.03 runs: 3
System
0.5k-0-limit/base: min: 34.86 max: 36.42 avg: 35.89 std: 0.73 runs: 3
0.5k-0-limit/rework: min: 43.26 [124.1%] max: 48.95 [134.4%] avg: 46.09 [128.4%] std: 2.32 runs: 3
0.5k-0-limit/reworkoptim: min: 46.98 [134.8%] max: 50.98 [140.0%] avg: 48.49 [135.1%] std: 1.77 runs: 3
Elapsed
0.5k-0-limit/base: min: 88.50 max: 97.52 avg: 93.92 std: 3.90 runs: 3
0.5k-0-limit/rework: min: 75.92 [85.8%] max: 78.45 [80.4%] avg: 77.34 [82.3%] std: 1.06 runs: 3
0.5k-0-limit/reworkoptim: min: 75.79 [85.6%] max: 79.37 [81.4%] avg: 77.55 [82.6%] std: 1.46 runs: 3
System
2k-0-limit/base: min: 34.57 max: 37.65 avg: 36.34 std: 1.30 runs: 3
2k-0-limit/rework: min: 64.17 [185.6%] max: 68.20 [181.1%] avg: 66.21 [182.2%] std: 1.65 runs: 3
2k-0-limit/reworkoptim: min: 49.78 [144.0%] max: 52.99 [140.7%] avg: 51.00 [140.3%] std: 1.42 runs: 3
Elapsed
2k-0-limit/base: min: 92.61 max: 97.83 avg: 95.03 std: 2.15 runs: 3
2k-0-limit/rework: min: 78.33 [84.6%] max: 84.08 [85.9%] avg: 81.09 [85.3%] std: 2.35 runs: 3
2k-0-limit/reworkoptim: min: 75.72 [81.8%] max: 78.57 [80.3%] avg: 76.73 [80.7%] std: 1.30 runs: 3
System
8k-0-limit/base: min: 39.78 max: 42.09 avg: 41.09 std: 0.97 runs: 3
8k-0-limit/rework: min: 200.86 [504.9%] max: 265.42 [630.6%] avg: 241.80 [588.5%] std: 29.06 runs: 3
8k-0-limit/reworkoptim: min: 53.70 [135.0%] max: 54.89 [130.4%] avg: 54.43 [132.5%] std: 0.52 runs: 3
Elapsed
8k-0-limit/base: min: 95.11 max: 98.61 avg: 96.81 std: 1.43 runs: 3
8k-0-limit/rework: min: 246.96 [259.7%] max: 331.47 [336.1%] avg: 301.32 [311.2%] std: 38.52 runs: 3
8k-0-limit/reworkoptim: min: 76.79 [80.7%] max: 81.71 [82.9%] avg: 78.97 [81.6%] std: 2.05 runs: 3

System time is increased by 30-40% but it is reduced a lot comparing
to kernel without this patch. The higher time can be explained by the
fact that the original soft reclaim scanned at priority 0 so it was much
more effective for this workload (which is basically touch once and
writeback). The Elapsed time looks better though (~20%).

* Run with no soft limit set
System
0-no-limit/base: min: 42.18 max: 50.38 avg: 46.44 std: 3.36 runs: 3
0-no-limit/rework: min: 40.57 [96.2%] max: 47.04 [93.4%] avg: 43.82 [94.4%] std: 2.64 runs: 3
0-no-limit/reworkoptim: min: 40.45 [95.9%] max: 45.28 [89.9%] avg: 42.10 [90.7%] std: 2.25 runs: 3
Elapsed
0-no-limit/base: min: 75.97 max: 78.21 avg: 76.87 std: 0.96 runs: 3
0-no-limit/rework: min: 75.59 [99.5%] max: 80.73 [103.2%] avg: 77.64 [101.0%] std: 2.23 runs: 3
0-no-limit/reworkoptim: min: 77.85 [102.5%] max: 82.42 [105.4%] avg: 79.64 [103.6%] std: 1.99 runs: 3
System
0.5k-no-limit/base: min: 44.54 max: 46.93 avg: 46.12 std: 1.12 runs: 3
0.5k-no-limit/rework: min: 42.09 [94.5%] max: 46.16 [98.4%] avg: 43.92 [95.2%] std: 1.69 runs: 3
0.5k-no-limit/reworkoptim: min: 42.47 [95.4%] max: 45.67 [97.3%] avg: 44.06 [95.5%] std: 1.31 runs: 3
Elapsed
0.5k-no-limit/base: min: 78.26 max: 81.49 avg: 79.65 std: 1.36 runs: 3
0.5k-no-limit/rework: min: 77.01 [98.4%] max: 80.43 [98.7%] avg: 78.30 [98.3%] std: 1.52 runs: 3
0.5k-no-limit/reworkoptim: min: 76.13 [97.3%] max: 77.87 [95.6%] avg: 77.18 [96.9%] std: 0.75 runs: 3
System
2k-no-limit/base: min: 62.96 max: 69.14 avg: 66.14 std: 2.53 runs: 3
2k-no-limit/rework: min: 76.01 [120.7%] max: 81.06 [117.2%] avg: 78.17 [118.2%] std: 2.12 runs: 3
2k-no-limit/reworkoptim: min: 62.57 [99.4%] max: 66.10 [95.6%] avg: 64.53 [97.6%] std: 1.47 runs: 3
Elapsed
2k-no-limit/base: min: 76.47 max: 84.22 avg: 79.12 std: 3.60 runs: 3
2k-no-limit/rework: min: 89.67 [117.3%] max: 93.26 [110.7%] avg: 91.10 [115.1%] std: 1.55 runs: 3
2k-no-limit/reworkoptim: min: 76.94 [100.6%] max: 79.21 [94.1%] avg: 78.45 [99.2%] std: 1.07 runs: 3
System
8k-no-limit/base: min: 104.74 max: 151.34 avg: 129.21 std: 19.10 runs: 3
8k-no-limit/rework: min: 205.23 [195.9%] max: 285.94 [188.9%] avg: 258.98 [200.4%] std: 38.01 runs: 3
8k-no-limit/reworkoptim: min: 161.16 [153.9%] max: 184.54 [121.9%] avg: 174.52 [135.1%] std: 9.83 runs: 3
Elapsed
8k-no-limit/base: min: 125.43 max: 181.00 avg: 154.81 std: 22.80 runs: 3
8k-no-limit/rework: min: 254.05 [202.5%] max: 355.67 [196.5%] avg: 321.46 [207.6%] std: 47.67 runs: 3
8k-no-limit/reworkoptim: min: 193.77 [154.5%] max: 222.72 [123.0%] avg: 210.18 [135.8%] std: 12.13 runs: 3

Both System and Elapsed are in stdev with the base kernel for all
configurations except for 8k where both System and Elapsed are up by
35%. I do not have a good explanation for this because there is no soft
reclaim pass going on as no group is above the limit which is checked in
mem_cgroup_should_soft_reclaim.

Then I have tested kernel build with the same configuration to see the
behavior with a more general behavior.

* Soft limit set to 0 for the build
System
0-0-limit/base: min: 242.70 max: 245.17 avg: 243.85 std: 1.02 runs: 3
0-0-limit/rework min: 237.86 [98.0%] max: 240.22 [98.0%] avg: 239.00 [98.0%] std: 0.97 runs: 3
0-0-limit/reworkoptim: min: 241.11 [99.3%] max: 243.53 [99.3%] avg: 242.01 [99.2%] std: 1.08 runs: 3
Elapsed
0-0-limit/base: min: 348.48 max: 360.86 avg: 356.04 std: 5.41 runs: 3
0-0-limit/rework min: 286.95 [82.3%] max: 290.26 [80.4%] avg: 288.27 [81.0%] std: 1.43 runs: 3
0-0-limit/reworkoptim: min: 286.55 [82.2%] max: 289.00 [80.1%] avg: 287.69 [80.8%] std: 1.01 runs: 3
System
0.5k-0-limit/base: min: 251.77 max: 254.41 avg: 252.70 std: 1.21 runs: 3
0.5k-0-limit/rework min: 286.44 [113.8%] max: 289.30 [113.7%] avg: 287.60 [113.8%] std: 1.23 runs: 3
0.5k-0-limit/reworkoptim: min: 252.18 [100.2%] max: 253.16 [99.5%] avg: 252.62 [100.0%] std: 0.41 runs: 3
Elapsed
0.5k-0-limit/base: min: 347.83 max: 353.06 avg: 350.04 std: 2.21 runs: 3
0.5k-0-limit/rework min: 290.19 [83.4%] max: 295.62 [83.7%] avg: 293.12 [83.7%] std: 2.24 runs: 3
0.5k-0-limit/reworkoptim: min: 293.91 [84.5%] max: 294.87 [83.5%] avg: 294.29 [84.1%] std: 0.42 runs: 3
System
2k-0-limit/base: min: 263.05 max: 271.52 avg: 267.94 std: 3.58 runs: 3
2k-0-limit/rework min: 458.99 [174.5%] max: 468.31 [172.5%] avg: 464.45 [173.3%] std: 3.97 runs: 3
2k-0-limit/reworkoptim: min: 267.10 [101.5%] max: 279.38 [102.9%] avg: 272.78 [101.8%] std: 5.05 runs: 3
Elapsed
2k-0-limit/base: min: 372.33 max: 379.32 avg: 375.47 std: 2.90 runs: 3
2k-0-limit/rework min: 334.40 [89.8%] max: 339.52 [89.5%] avg: 337.44 [89.9%] std: 2.20 runs: 3
2k-0-limit/reworkoptim: min: 301.47 [81.0%] max: 319.19 [84.1%] avg: 307.90 [82.0%] std: 8.01 runs: 3
System
8k-0-limit/base: min: 320.50 max: 332.10 avg: 325.46 std: 4.88 runs: 3
8k-0-limit/rework min: 1115.76 [348.1%] max: 1165.66 [351.0%] avg: 1132.65 [348.0%] std: 23.34 runs: 3
8k-0-limit/reworkoptim: min: 403.75 [126.0%] max: 409.22 [123.2%] avg: 406.16 [124.8%] std: 2.28 runs: 3
Elapsed
8k-0-limit/base: min: 475.48 max: 585.19 avg: 525.54 std: 45.30 runs: 3
8k-0-limit/rework min: 616.25 [129.6%] max: 625.90 [107.0%] avg: 620.68 [118.1%] std: 3.98 runs: 3
8k-0-limit/reworkoptim: min: 420.18 [88.4%] max: 428.28 [73.2%] avg: 423.05 [80.5%] std: 3.71 runs: 3

Apart from 8k the system time is comparable with the base kernel while
Elapsed is up to 20% better with all configurations.

* No soft limit set
System
0-no-limit/base: min: 234.76 max: 237.42 avg: 236.25 std: 1.11 runs: 3
0-no-limit/rework min: 233.09 [99.3%] max: 238.65 [100.5%] avg: 236.09 [99.9%] std: 2.29 runs: 3
0-no-limit/reworkoptim: min: 236.12 [100.6%] max: 240.53 [101.3%] avg: 237.94 [100.7%] std: 1.88 runs: 3
Elapsed
0-no-limit/base: min: 288.52 max: 295.42 avg: 291.29 std: 2.98 runs: 3
0-no-limit/rework min: 283.17 [98.1%] max: 284.33 [96.2%] avg: 283.78 [97.4%] std: 0.48 runs: 3
0-no-limit/reworkoptim: min: 288.50 [100.0%] max: 290.79 [98.4%] avg: 289.78 [99.5%] std: 0.95 runs: 3
System
0.5k-no-limit/base: min: 286.51 max: 293.23 avg: 290.21 std: 2.78 runs: 3
0.5k-no-limit/rework min: 291.69 [101.8%] max: 294.38 [100.4%] avg: 292.97 [101.0%] std: 1.10 runs: 3
0.5k-no-limit/reworkoptim: min: 277.05 [96.7%] max: 288.76 [98.5%] avg: 284.17 [97.9%] std: 5.11 runs: 3
Elapsed
0.5k-no-limit/base: min: 294.94 max: 298.92 avg: 296.47 std: 1.75 runs: 3
0.5k-no-limit/rework min: 292.55 [99.2%] max: 294.21 [98.4%] avg: 293.55 [99.0%] std: 0.72 runs: 3
0.5k-no-limit/reworkoptim: min: 294.41 [99.8%] max: 301.67 [100.9%] avg: 297.78 [100.4%] std: 2.99 runs: 3
System
2k-no-limit/base: min: 443.41 max: 466.66 avg: 457.66 std: 10.19 runs: 3
2k-no-limit/rework min: 490.11 [110.5%] max: 516.02 [110.6%] avg: 501.42 [109.6%] std: 10.83 runs: 3
2k-no-limit/reworkoptim: min: 435.25 [98.2%] max: 458.11 [98.2%] avg: 446.73 [97.6%] std: 9.33 runs: 3
Elapsed
2k-no-limit/base: min: 330.85 max: 333.75 avg: 332.52 std: 1.23 runs: 3
2k-no-limit/rework min: 343.06 [103.7%] max: 349.59 [104.7%] avg: 345.95 [104.0%] std: 2.72 runs: 3
2k-no-limit/reworkoptim: min: 330.01 [99.7%] max: 333.92 [100.1%] avg: 332.22 [99.9%] std: 1.64 runs: 3
System
8k-no-limit/base: min: 1175.64 max: 1259.38 avg: 1222.39 std: 34.88 runs: 3
8k-no-limit/rework min: 1226.31 [104.3%] max: 1241.60 [98.6%] avg: 1233.74 [100.9%] std: 6.25 runs: 3
8k-no-limit/reworkoptim: min: 1023.45 [87.1%] max: 1056.74 [83.9%] avg: 1038.92 [85.0%] std: 13.69 runs: 3
Elapsed
8k-no-limit/base: min: 613.36 max: 619.60 avg: 616.47 std: 2.55 runs: 3
8k-no-limit/rework min: 627.56 [102.3%] max: 642.33 [103.7%] avg: 633.44 [102.8%] std: 6.39 runs: 3
8k-no-limit/reworkoptim: min: 545.89 [89.0%] max: 555.36 [89.6%] avg: 552.06 [89.6%] std: 4.37 runs: 3

and these numbers look good as well. System time is around 100%
(suprisingly better for the 8k case) and Elapsed is copies that trend.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1364ca5..2a06d5a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -136,6 +136,7 @@ static const char * const mem_cgroup_lru_names[] = {
  */
 enum mem_cgroup_events_target {
 	MEM_CGROUP_TARGET_THRESH,
+	MEM_CGROUP_TARGET_SOFTLIMIT,
 	MEM_CGROUP_TARGET_NUMAINFO,
 	MEM_CGROUP_NTARGETS,
 };
@@ -350,6 +351,22 @@ struct mem_cgroup {
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
 #endif
+	/*
+	 * Protects soft_contributed transitions.
+	 * See mem_cgroup_update_soft_limit
+	 */
+	spinlock_t soft_lock;
+
+	/*
+	 * If true then this group has increased parents' children_in_excess
+         * when it got over the soft limit.
+	 * When a group falls bellow the soft limit, parents' children_in_excess
+	 * is decreased and soft_contributed changed to false.
+	 */
+	bool soft_contributed;
+
+	/* Number of children that are in soft limit excess */
+	atomic_t children_in_excess;
 
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
@@ -880,6 +897,9 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 		case MEM_CGROUP_TARGET_THRESH:
 			next = val + THRESHOLDS_EVENTS_TARGET;
 			break;
+		case MEM_CGROUP_TARGET_SOFTLIMIT:
+			next = val + SOFTLIMIT_EVENTS_TARGET;
+			break;
 		case MEM_CGROUP_TARGET_NUMAINFO:
 			next = val + NUMAINFO_EVENTS_TARGET;
 			break;
@@ -893,6 +913,42 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 }
 
 /*
+ * Called from rate-limitted memcg_check_events when enough
+ * MEM_CGROUP_TARGET_SOFTLIMIT events are accumulated and it makes sure
+ * that all the parents up the hierarchy will be noticed that this group
+ * is in excess or that it is not in excess anymore. mmecg->soft_contributed
+ * makes the transition a single action whenever the state flips from one to
+ * other.
+ */
+static void mem_cgroup_update_soft_limit(struct mem_cgroup *memcg)
+{
+	unsigned long long excess = res_counter_soft_limit_excess(&memcg->res);
+	struct mem_cgroup *parent = memcg;
+	int delta = 0;
+
+	spin_lock(&memcg->soft_lock);
+	if (excess) {
+		if (!memcg->soft_contributed) {
+			delta = 1;
+			memcg->soft_contributed = true;
+		}
+	} else {
+		if (memcg->soft_contributed) {
+			delta = -1;
+			memcg->soft_contributed = false;
+		}
+	}
+
+	/*
+	 * Necessary to update all ancestors when hierarchy is used
+	 * because their event counter is not touched.
+	 */
+	while (delta && (parent = parent_mem_cgroup(parent)))
+		atomic_add(delta, &parent->children_in_excess);
+	spin_unlock(&memcg->soft_lock);
+}
+
+/*
  * Check events in order.
  *
  */
@@ -902,8 +958,11 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 	/* threshold event is triggered in finer grain than soft limit */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
+		bool do_softlimit;
 		bool do_numainfo __maybe_unused;
 
+		do_softlimit = mem_cgroup_event_ratelimit(memcg,
+						MEM_CGROUP_TARGET_SOFTLIMIT);
 #if MAX_NUMNODES > 1
 		do_numainfo = mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_NUMAINFO);
@@ -911,6 +970,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 		preempt_enable();
 
 		mem_cgroup_threshold(memcg);
+		if (unlikely(do_softlimit))
+			mem_cgroup_update_soft_limit(memcg);
 #if MAX_NUMNODES > 1
 		if (unlikely(do_numainfo))
 			atomic_inc(&memcg->numainfo_events);
@@ -1918,6 +1979,9 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
  * hierarchy if
  * 	a) it is over its soft limit
  * 	b) any parent up the hierarchy is over its soft limit
+ *
+ * If the given group doesn't have any children over the limit then it
+ * doesn't make any sense to iterate its subtree.
  */
 enum mem_cgroup_filter_t
 mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
@@ -1939,6 +2003,8 @@ mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
 			break;
 	}
 
+	if (!atomic_read(&memcg->children_in_excess))
+		return SKIP_TREE;
 	return SKIP;
 }
 
@@ -6093,6 +6159,7 @@ mem_cgroup_css_alloc(struct cgroup *cont)
 	mutex_init(&memcg->thresholds_lock);
 	spin_lock_init(&memcg->move_lock);
 	vmpressure_init(&memcg->vmpressure);
+	spin_lock_init(&memcg->soft_lock);
 
 	return &memcg->css;
 
@@ -6182,6 +6249,10 @@ static void mem_cgroup_css_offline(struct cgroup *cont)
 
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
+	if (memcg->soft_contributed) {
+		while ((memcg = parent_mem_cgroup(memcg)))
+			atomic_dec(&memcg->children_in_excess);
+	}
 	mem_cgroup_destroy_all_caches(memcg);
 }
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 6/8] memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything
  2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko
                   ` (4 preceding siblings ...)
  2013-06-18 12:09 ` [PATCH v5 5/8] memcg: track children in soft limit excess to improve soft limit Michal Hocko
@ 2013-06-18 12:09 ` Michal Hocko
  2013-06-18 12:09 ` [PATCH v5 7/8] memcg: Track all children over limit in the root Michal Hocko
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is
done and it always says yes currently. Memcg iterators are clever to
skip nodes that are not soft reclaimable quite efficiently but
mem_cgroup_should_soft_reclaim can be more clever and do not start the
soft reclaim pass at all if it knows that nothing would be scanned
anyway.

In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for
the target group of the reclaim and allow the pass only if the whole
subtree wouldn't be skipped.

Changes since v1
- do not export mem_cgroup_root and teach mem_cgroup_soft_reclaim_eligible
  to handle NULL memcg as mem_cgroup_root

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 6 +++++-
 mm/vmscan.c     | 4 +++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2a06d5a..b2e44d3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1987,7 +1987,11 @@ enum mem_cgroup_filter_t
 mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
 		struct mem_cgroup *root)
 {
-	struct mem_cgroup *parent = memcg;
+	struct mem_cgroup *parent;
+
+	if (!memcg)
+		memcg = root_mem_cgroup;
+	parent = memcg;
 
 	if (res_counter_soft_limit_excess(&memcg->res))
 		return VISIT;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79d59ec..56302da 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -142,7 +142,9 @@ static bool global_reclaim(struct scan_control *sc)
 
 static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
 {
-	return !mem_cgroup_disabled();
+	struct mem_cgroup *root = sc->target_mem_cgroup;
+	return !mem_cgroup_disabled() &&
+		mem_cgroup_soft_reclaim_eligible(root, root) != SKIP_TREE;
 }
 #else
 static bool global_reclaim(struct scan_control *sc)
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 7/8] memcg: Track all children over limit in the root
  2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko
                   ` (5 preceding siblings ...)
  2013-06-18 12:09 ` [PATCH v5 6/8] memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything Michal Hocko
@ 2013-06-18 12:09 ` Michal Hocko
  2013-06-18 12:09 ` [PATCH v5 8/8] memcg, vmscan: do not fall into reclaim-all pass too quickly Michal Hocko
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

Children in soft limit excess are currently tracked up the hierarchy
in memcg->children_in_excess. Nevertheless there still might exist
tons of groups that are not in hierarchy relation to the root cgroup
(e.g. all first level groups if root_mem_cgroup->use_hierarchy ==
false).

As the whole tree walk has to be done when the iteration starts at
root_mem_cgroup the iterator should be able to skip the walk if there
is no child above the limit without iterating them. This can be done
easily if the root tracks all children rather than only hierarchical
children. This is done by this patch which updates root_mem_cgroup
children_in_excess if root_mem_cgroup->use_hierarchy == false so the
root knows about all children in excess.

Please note that this is not an issue for inner memcgs which have
use_hierarchy == false because then only the single group is visited so
no special optimization is necessary.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b2e44d3..d55e2c8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -942,9 +942,15 @@ static void mem_cgroup_update_soft_limit(struct mem_cgroup *memcg)
 	/*
 	 * Necessary to update all ancestors when hierarchy is used
 	 * because their event counter is not touched.
+	 * We track children even outside the hierarchy for the root
+	 * cgroup because tree walk starting at root should visit
+	 * all cgroups and we want to prevent from pointless tree
+	 * walk if no children is below the limit.
 	 */
 	while (delta && (parent = parent_mem_cgroup(parent)))
 		atomic_add(delta, &parent->children_in_excess);
+	if (memcg != root_mem_cgroup && !root_mem_cgroup->use_hierarchy)
+		atomic_add(delta, &root_mem_cgroup->children_in_excess);
 	spin_unlock(&memcg->soft_lock);
 }
 
@@ -6256,6 +6262,9 @@ static void mem_cgroup_css_offline(struct cgroup *cont)
 	if (memcg->soft_contributed) {
 		while ((memcg = parent_mem_cgroup(memcg)))
 			atomic_dec(&memcg->children_in_excess);
+
+		if (memcg != root_mem_cgroup && !root_mem_cgroup->use_hierarchy)
+			atomic_dec(&root_mem_cgroup->children_in_excess);
 	}
 	mem_cgroup_destroy_all_caches(memcg);
 }
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 8/8] memcg, vmscan: do not fall into reclaim-all pass too quickly
  2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko
                   ` (6 preceding siblings ...)
  2013-06-18 12:09 ` [PATCH v5 7/8] memcg: Track all children over limit in the root Michal Hocko
@ 2013-06-18 12:09 ` Michal Hocko
  2013-06-18 19:01 ` [PATCH v5] Soft limit rework Johannes Weiner
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

shrink_zone starts with soft reclaim pass first and then falls back to
regular reclaim if nothing has been scanned. This behavior is natural
but there is a catch. Memcg iterators, when used with the reclaim
cookie, are designed to help to prevent from over reclaim by
interleaving reclaimers (per node-zone-priority) so the tree walk might
miss many (even all) nodes in the hierarchy e.g. when there are direct
reclaimers racing with each other or with kswapd in the global case or
multiple allocators reaching the limit for the target reclaim case.
To make it even more complicated, targeted reclaim doesn't do the whole
tree walk because it stops reclaiming once it reclaims sufficient pages.
As a result groups over the limit might be missed, thus nothing is
scanned, and reclaim would fall back to the reclaim all mode.

This patch checks for the incomplete tree walk in shrink_zone. If no
group has been visited and the hierarchy is soft reclaimable then we
must have missed some groups, in which case the __shrink_zone is called
again. This doesn't guarantee there will be some progress of course
because the current reclaimer might be still racing with others but it
would at least give a chance to start the walk without a big risk of
reclaim latencies.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 56302da..8cbc8e5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2136,10 +2136,11 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	}
 }
 
-static void
+static int
 __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
 {
 	unsigned long nr_reclaimed, nr_scanned;
+	int groups_scanned = 0;
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2157,6 +2158,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
 		while ((memcg = mem_cgroup_iter_cond(root, memcg, &reclaim, filter))) {
 			struct lruvec *lruvec;
 
+			groups_scanned++;
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
 			shrink_lruvec(lruvec, sc);
@@ -2184,6 +2186,8 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
 
 	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
+
+	return groups_scanned;
 }
 
 
@@ -2191,8 +2195,19 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 {
 	bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
 	unsigned long nr_scanned = sc->nr_scanned;
+	int scanned_groups;
 
-	__shrink_zone(zone, sc, do_soft_reclaim);
+	scanned_groups = __shrink_zone(zone, sc, do_soft_reclaim);
+	/*
+         * memcg iterator might race with other reclaimer or start from
+         * a incomplete tree walk so the tree walk in __shrink_zone
+         * might have missed groups that are above the soft limit. Try
+         * another loop to catch up with others. Do it just once to
+         * prevent from reclaim latencies when other reclaimers always
+         * preempt this one.
+	 */
+	if (do_soft_reclaim && !scanned_groups)
+		__shrink_zone(zone, sc, do_soft_reclaim);
 
 	/*
 	 * No group is over the soft limit or those that are do not have
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v5] Soft limit rework
  2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko
                   ` (7 preceding siblings ...)
  2013-06-18 12:09 ` [PATCH v5 8/8] memcg, vmscan: do not fall into reclaim-all pass too quickly Michal Hocko
@ 2013-06-18 19:01 ` Johannes Weiner
  2013-06-19 10:20   ` Michal Hocko
  2013-06-20 11:12 ` Mel Gorman
       [not found] ` <1371557387-22434-1-git-send-email-mhocko-AlSwsSmVLrQ@public.gmane.org>
  10 siblings, 1 reply; 30+ messages in thread
From: Johannes Weiner @ 2013-06-18 19:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel,
	Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa

On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote:
> My primary test case was a parallel kernel build with 2 groups (make
> is running with -j4 with a distribution .config in a separate cgroup
> without any hard limit) on a 8 CPU machine booted with 1GB memory.  I
> was mostly interested in 2 setups. Default - no soft limit set and - and
> 0 soft limit set to both groups.
> The first one should tell us whether the rework regresses the default
> behavior while the second one should show us improvements in an extreme
> case where both workloads are always over the soft limit.

The most interesting test case would be how it behaves if some groups
are over the soft limits while others are not.  I would expect this to
be the most common situation for when soft limits are used.

On the other hand, setting all soft limits to 0 makes every reclaim
invocation do almost only soft reclaim.  What's the point of that?

> /usr/bin/time -v has been used to collect the statistics and each
> configuration had 3 runs after fresh boot without any other load on the
> system.
> 
> base is mmotm-2013-05-09-15-57
> baserebase is mmotm-2013-06-05-17-24-63 + patches from the current mmots
> without slab shrinkers patchset.
> reworkrebase all patches 8 applied on top of baserebase
> 
> * No-limit
> User
> base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6
> baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6
> reworkrebase: min: 1172.58 [100.7%] max: 1177.43 [100.7%] avg: 1175.53 [100.6%] std: 1.91 runs: 6
> System
> base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6
> baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6
> reworkrebase: min: 236.21 [97.4%] max: 239.46 [97.6%] avg: 237.55 [97.4%] std: 1.05 runs: 6
> Elapsed
> base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6
> baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6
> reworkrebase: min: 664.05 [111.3%] max: 701.06 [113.1%] avg: 689.29 [113.8%] std: 12.36 runs: 6
> 
> Elapsed time regressed by 13% wrt. base but it seems that this came from
> baserebase which regressed by the same amount.

Which mmots does this refer to?  We should probably look into a
regression this size...

> * 0-limit
> User
> base: min: 1188.28 max: 1198.54 avg: 1194.10 std: 3.31 runs: 6
> baserebase: min: 1186.17 [99.8%] max: 1196.46 [99.8%] avg: 1189.75 [99.6%] std: 3.41 runs: 6
> reworkrebase: min: 1169.88 [98.5%] max: 1177.84 [98.3%] avg: 1173.50 [98.3%] std: 2.79 runs: 6
> System
> base: min: 248.40 max: 252.00 avg: 250.19 std: 1.38 runs: 6
> baserebase: min: 240.77 [96.9%] max: 246.74 [97.9%] avg: 243.63 [97.4%] std: 2.23 runs: 6
> reworkrebase: min: 235.19 [94.7%] max: 237.43 [94.2%] avg: 236.35 [94.5%] std: 0.86 runs: 6
> Elapsed
> base: min: 759.28 max: 805.30 avg: 784.87 std: 15.45 runs: 6
> baserebase: min: 881.69 [116.1%] max: 938.14 [116.5%] avg: 911.68 [116.2%] std: 19.58 runs: 6
> reworkrebase: min: 667.54 [87.9%] max: 718.54 [89.2%] avg: 695.61 [88.6%] std: 17.16 runs: 6
> 
> System time is slightly better but I wouldn't consider it relevant.
> 
> Elapsed time is more interesting though. baserebase regresses by 16%
> again which is in par with no-limit configuration.
> 
> With the patchset applied we are 11% better in average wrt. to the
> old base but it is important to realize that this is still 76.3% wrt.
> baserebase so the effect of the series is comparable to the previous
> version. Albeit the whole result is worse.
> 
> Page fault statistics tell us at least part of the story:
> Minor
> base: min: 35941845.00 max: 36029788.00 avg: 35986860.17 std: 28288.66 runs: 6
> baserebase: min: 35852414.00 [99.8%] max: 35899605.00 [99.6%] avg: 35874906.83 [99.7%] std: 18722.59 runs: 6
> reworkrebase: min: 35538346.00 [98.9%] max: 35584907.00 [98.8%] avg: 35562362.17 [98.8%] std: 18921.74 runs: 6
> Major
> base: min: 25390.00 max: 33132.00 avg: 29961.83 std: 2476.58 runs: 6
> baserebase: min: 34224.00 [134.8%] max: 45674.00 [137.9%] avg: 41556.83 [138.7%] std: 3595.39 runs: 6
> reworkrebase: min: 277.00 [1.1%] max: 480.00 [1.4%] avg: 384.67 [1.3%] std: 74.67 runs: 6

The changes are big but the test makes no sense to me.

> While the minor faults are within the noise the major faults are reduced
> considerably. This looks like an aggressive pageout during the reclaim
> and that pageout affects the working set presumably. Please note that
> baserebase has even hight number of major page faults than the older
> mmotm trree.
> 
> While this looks as a nice win it is fair to say that there are some
> workloads that actually benefit from reclaim at 0 priority (from
> background reclaim). E.g. an aggressive streaming IO would like to get
> rid of as many pages as possible and do not block on the pages under
> writeback. This can lead to a higher System time but I generally got
> Elapsed which was comparable.
> 
> The following results are from 2 groups configuration on a 8GB machine
> (A running stream IO with 4*TotalMem with 0 soft limit, B runnning a
> mem_eater which consumes TotalMem-1G without any limit).
> System
> base: min: 124.88 max: 136.97 avg: 130.77 std: 4.94 runs: 3
> baserebase: min: 102.51 [82.1%] max: 108.84 [79.5%] avg: 104.81 [80.1%] std: 2.86 runs: 3
> reworkrebase: min: 108.29 [86.7%] max: 121.70 [88.9%] avg: 114.60 [87.6%] std: 5.50 runs: 3
> Elapsed
> base: min: 398.86 max: 412.81 avg: 407.62 std: 6.23 runs: 3
> baserebase: min: 480.92 [120.6%] max: 497.56 [120.5%] avg: 491.46 [120.6%] std: 7.48 runs: 3
> reworkrebase: min: 397.19 [99.6%] max: 462.57 [112.1%] avg: 436.13 [107.0%] std: 28.12 runs: 3

Do you have individual runtimes for both the streamer and "mem_eater"?
Is mem_eater's memory reclaimable?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5] Soft limit rework
  2013-06-18 19:01 ` [PATCH v5] Soft limit rework Johannes Weiner
@ 2013-06-19 10:20   ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2013-06-19 10:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel,
	Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa

On Tue 18-06-13 15:01:21, Johannes Weiner wrote:
> On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote:
> > My primary test case was a parallel kernel build with 2 groups (make
> > is running with -j4 with a distribution .config in a separate cgroup
> > without any hard limit) on a 8 CPU machine booted with 1GB memory.  I
> > was mostly interested in 2 setups. Default - no soft limit set and - and
> > 0 soft limit set to both groups.
> > The first one should tell us whether the rework regresses the default
> > behavior while the second one should show us improvements in an extreme
> > case where both workloads are always over the soft limit.
> 
> The most interesting test case would be how it behaves if some groups
> are over the soft limits while others are not.  I would expect this to
> be the most common situation for when soft limits are used.

The stream IO test case does this (0 soft limit for dd and no limit for
mem_eater).

Anyway, I have tested this configuration with kbuild test (two groups,
one 0 s.l. the second with no soft limit). Here are the commulative
results (which do not distinguish results from the two groups.
stdev tells us that the differences between the two groups are not
significant):
User
baserebase: min: 1001.74 max: 1006.70 avg: 1004.54 std: 1.57 runs: 6
reworkrebase: min: 1001.11 [99.9%] max: 1005.35 [99.9%] avg: 1003.31 [99.9%] std: 1.58 runs: 6
System
baserebase: min: 143.68 max: 146.39 avg: 144.95 std: 0.96 runs: 6
reworkrebase: min: 144.68 [100.7%] max: 146.80 [100.3%] avg: 145.43 [100.3%] std: 0.71 runs: 6
Elapsed
baserebase: min: 403.31 max: 411.35 avg: 406.94 std: 2.50 runs: 6
reworkrebase: min: 391.70 [97.1%] max: 406.52 [98.8%] avg: 401.32 [98.6%] std: 4.95 runs: 6

Results are within noise.

> On the other hand, setting all soft limits to 0 makes every reclaim
> invocation do almost only soft reclaim.  What's the point of that?

The main point was to isolate the soft recalim as much as possible so
that we can compare effects of the previous and the new implementations.
The no-limit configuration on the other hand shows that the change
didn't regress for the default configuration when no soft limit is used.

Seeing some improvement is nice but it is not the primary motivation of
this series. My primary interest was to not regress in common workloads.
As I said there are some workloads which could benefit from prio-0
reclaim used for the soft reclaim previously but as the figures show
this is not the case for the best and most loved kbuild test.

> > /usr/bin/time -v has been used to collect the statistics and each
> > configuration had 3 runs after fresh boot without any other load on the
> > system.
> > 
> > base is mmotm-2013-05-09-15-57
> > baserebase is mmotm-2013-06-05-17-24-63 + patches from the current mmots
> > without slab shrinkers patchset.
> > reworkrebase all patches 8 applied on top of baserebase
> > 
> > * No-limit
> > User
> > base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6
> > baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6
> > reworkrebase: min: 1172.58 [100.7%] max: 1177.43 [100.7%] avg: 1175.53 [100.6%] std: 1.91 runs: 6
> > System
> > base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6
> > baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6
> > reworkrebase: min: 236.21 [97.4%] max: 239.46 [97.6%] avg: 237.55 [97.4%] std: 1.05 runs: 6
> > Elapsed
> > base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6
> > baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6
> > reworkrebase: min: 664.05 [111.3%] max: 701.06 [113.1%] avg: 689.29 [113.8%] std: 12.36 runs: 6
> > 
> > Elapsed time regressed by 13% wrt. base but it seems that this came from
> > baserebase which regressed by the same amount.
> 
> Which mmots does this refer to?  We should probably look into a
> regression this size...

It was mmotm-2013-06-05-17-24-63 with some patches on top without slab
shrinkers which failed to finish this test case. I plan to report this
separately with more specific information in a separate thread.

> > * 0-limit
> > User
> > base: min: 1188.28 max: 1198.54 avg: 1194.10 std: 3.31 runs: 6
> > baserebase: min: 1186.17 [99.8%] max: 1196.46 [99.8%] avg: 1189.75 [99.6%] std: 3.41 runs: 6
> > reworkrebase: min: 1169.88 [98.5%] max: 1177.84 [98.3%] avg: 1173.50 [98.3%] std: 2.79 runs: 6
> > System
> > base: min: 248.40 max: 252.00 avg: 250.19 std: 1.38 runs: 6
> > baserebase: min: 240.77 [96.9%] max: 246.74 [97.9%] avg: 243.63 [97.4%] std: 2.23 runs: 6
> > reworkrebase: min: 235.19 [94.7%] max: 237.43 [94.2%] avg: 236.35 [94.5%] std: 0.86 runs: 6
> > Elapsed
> > base: min: 759.28 max: 805.30 avg: 784.87 std: 15.45 runs: 6
> > baserebase: min: 881.69 [116.1%] max: 938.14 [116.5%] avg: 911.68 [116.2%] std: 19.58 runs: 6
> > reworkrebase: min: 667.54 [87.9%] max: 718.54 [89.2%] avg: 695.61 [88.6%] std: 17.16 runs: 6
> > 
> > System time is slightly better but I wouldn't consider it relevant.
> > 
> > Elapsed time is more interesting though. baserebase regresses by 16%
> > again which is in par with no-limit configuration.
> > 
> > With the patchset applied we are 11% better in average wrt. to the
> > old base but it is important to realize that this is still 76.3% wrt.
> > baserebase so the effect of the series is comparable to the previous
> > version. Albeit the whole result is worse.
> > 
> > Page fault statistics tell us at least part of the story:
> > Minor
> > base: min: 35941845.00 max: 36029788.00 avg: 35986860.17 std: 28288.66 runs: 6
> > baserebase: min: 35852414.00 [99.8%] max: 35899605.00 [99.6%] avg: 35874906.83 [99.7%] std: 18722.59 runs: 6
> > reworkrebase: min: 35538346.00 [98.9%] max: 35584907.00 [98.8%] avg: 35562362.17 [98.8%] std: 18921.74 runs: 6
> > Major
> > base: min: 25390.00 max: 33132.00 avg: 29961.83 std: 2476.58 runs: 6
> > baserebase: min: 34224.00 [134.8%] max: 45674.00 [137.9%] avg: 41556.83 [138.7%] std: 3595.39 runs: 6
> > reworkrebase: min: 277.00 [1.1%] max: 480.00 [1.4%] avg: 384.67 [1.3%] std: 74.67 runs: 6
> 
> The changes are big but the test makes no sense to me.

As mentioned above this is a simple regression test when the soft
reclaim is exercised as much as possible. For this particular test case
it is clear that the previous implementation was too disruptive by doing
prio-0 reclaim which evicted part of the working set.

Do you have any other ideas how to compare the two implementations?

> > While the minor faults are within the noise the major faults are reduced
> > considerably. This looks like an aggressive pageout during the reclaim
> > and that pageout affects the working set presumably. Please note that
> > baserebase has even hight number of major page faults than the older
> > mmotm trree.
> > 
> > While this looks as a nice win it is fair to say that there are some
> > workloads that actually benefit from reclaim at 0 priority (from
> > background reclaim). E.g. an aggressive streaming IO would like to get
> > rid of as many pages as possible and do not block on the pages under
> > writeback. This can lead to a higher System time but I generally got
> > Elapsed which was comparable.
> > 
> > The following results are from 2 groups configuration on a 8GB machine
> > (A running stream IO with 4*TotalMem with 0 soft limit, B runnning a
> > mem_eater which consumes TotalMem-1G without any limit).
> > System
> > base: min: 124.88 max: 136.97 avg: 130.77 std: 4.94 runs: 3
> > baserebase: min: 102.51 [82.1%] max: 108.84 [79.5%] avg: 104.81 [80.1%] std: 2.86 runs: 3
> > reworkrebase: min: 108.29 [86.7%] max: 121.70 [88.9%] avg: 114.60 [87.6%] std: 5.50 runs: 3
> > Elapsed
> > base: min: 398.86 max: 412.81 avg: 407.62 std: 6.23 runs: 3
> > baserebase: min: 480.92 [120.6%] max: 497.56 [120.5%] avg: 491.46 [120.6%] std: 7.48 runs: 3
> > reworkrebase: min: 397.19 [99.6%] max: 462.57 [112.1%] avg: 436.13 [107.0%] std: 28.12 runs: 3
> 
> Do you have individual runtimes for both the streamer and "mem_eater"?

mem_eater is a simple anon mmap MAP_POPULATE which then sits on the
memory without touching it until it is notified when it dies (this
happens after streamer is done). It simulates a big memory resident
while there is a big memory pressure. Measuring the time for the
streamer makes sense on the other hand because it depends on the reclaim
effectiveness. The numbers show that we regress a bit here which is
caused by getting away from prio-0 reclaim which manages to reclaim
touched once pages more effectively.

> Is mem_eater's memory reclaimable?

yes but the fact that it is soft unlimited helps it stay resident. When
we look at usage_in_bytes for B (mem eater) for all 3 runs:
base		median: 7399763968 max: 7605112832 avg: 7314214879
rework		median: 7736799232 [104%] max: 7832416256 [102%] avg: 7581929848 [104%] (*)
baserebase	median: 7788097536 [105%] max: 7841947648 [103%] avg: 7717106374 [105%]
reworkrebase	median: 7722799104 [104%] max: 7814995968 [102%] avg: 7614082984 [104%]

(*) - this is the patch series before rebase

We do not reclaim that much in the base kernel for baserebase but the
series helps to prevent from reclaim a bit (~4% more memory for
mem_eater both in average and in median). I think we can do even better
if the streamer preferred zones that are not full from mem_eater and
didn't trigger reclaim on those so that we wouldn't fall back to
all-reclaim mode so often.

Does this make more sense to you now or there are things that I should
retest to better describe/measure the two soft reclaim implementations?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5] Soft limit rework
  2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko
                   ` (8 preceding siblings ...)
  2013-06-18 19:01 ` [PATCH v5] Soft limit rework Johannes Weiner
@ 2013-06-20 11:12 ` Mel Gorman
       [not found]   ` <20130620111206.GA14809-l3A5Bk7waGM@public.gmane.org>
       [not found] ` <1371557387-22434-1-git-send-email-mhocko-AlSwsSmVLrQ@public.gmane.org>
  10 siblings, 1 reply; 30+ messages in thread
From: Mel Gorman @ 2013-06-20 11:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse,
	Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh,
	Glauber Costa

On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote:
> base is mmotm-2013-05-09-15-57
> baserebase is mmotm-2013-06-05-17-24-63 + patches from the current mmots
> without slab shrinkers patchset.
> reworkrebase all patches 8 applied on top of baserebase
> 
> * No-limit
> User
> base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6
> baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6
> reworkrebase: min: 1172.58 [100.7%] max: 1177.43 [100.7%] avg: 1175.53 [100.6%] std: 1.91 runs: 6
> System
> base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6
> baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6
> reworkrebase: min: 236.21 [97.4%] max: 239.46 [97.6%] avg: 237.55 [97.4%] std: 1.05 runs: 6
> Elapsed
> base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6
> baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6
> reworkrebase: min: 664.05 [111.3%] max: 701.06 [113.1%] avg: 689.29 [113.8%] std: 12.36 runs: 6
> 
> Elapsed time regressed by 13% wrt. base but it seems that this came from
> baserebase which regressed by the same amount.
> 

boo-urns

> Page fault statistics tell us at least part of the story:
> Minor
> base: min: 35941845.00 max: 36029788.00 avg: 35986860.17 std: 28288.66 runs: 6
> baserebase: min: 35852414.00 [99.8%] max: 35899605.00 [99.6%] avg: 35874906.83 [99.7%] std: 18722.59 runs: 6
> reworkrebase: min: 35538346.00 [98.9%] max: 35584907.00 [98.8%] avg: 35562362.17 [98.8%] std: 18921.74 runs: 6
> Major
> base: min: 25390.00 max: 33132.00 avg: 29961.83 std: 2476.58 runs: 6
> baserebase: min: 34224.00 [134.8%] max: 45674.00 [137.9%] avg: 41556.83 [138.7%] std: 3595.39 runs: 6
> reworkrebase: min: 277.00 [1.1%] max: 480.00 [1.4%] avg: 384.67 [1.3%] std: 74.67 runs: 6

Can you try this monolithic patch please?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fe73724..f677780 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1477,25 +1477,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	 * as there is no guarantee the dirtying process is throttled in the
 	 * same way balance_dirty_pages() manages.
 	 *
-	 * This scales the number of dirty pages that must be under writeback
-	 * before a zone gets flagged ZONE_WRITEBACK. It is a simple backoff
-	 * function that has the most effect in the range DEF_PRIORITY to
-	 * DEF_PRIORITY-2 which is the priority reclaim is considered to be
-	 * in trouble and reclaim is considered to be in trouble.
-	 *
-	 * DEF_PRIORITY   100% isolated pages must be PageWriteback to throttle
-	 * DEF_PRIORITY-1  50% must be PageWriteback
-	 * DEF_PRIORITY-2  25% must be PageWriteback, kswapd in trouble
-	 * ...
-	 * DEF_PRIORITY-6 For SWAP_CLUSTER_MAX isolated pages, throttle if any
-	 *                     isolated page is PageWriteback
-	 *
 	 * Once a zone is flagged ZONE_WRITEBACK, kswapd will count the number
 	 * of pages under pages flagged for immediate reclaim and stall if any
 	 * are encountered in the nr_immediate check below.
 	 */
-	if (nr_writeback && nr_writeback >=
-			(nr_taken >> (DEF_PRIORITY - sc->priority)))
+	if (nr_writeback && nr_writeback == nr_taken)
 		zone_set_flag(zone, ZONE_WRITEBACK);
 
 	/*
@@ -2382,12 +2368,14 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	struct zone *zone;
 	unsigned long writeback_threshold;
 	bool aborted_reclaim;
+	int min_scan_priority = 1;
 
 	delayacct_freepages_start();
 
 	if (global_reclaim(sc))
 		count_vm_event(ALLOCSTALL);
 
+rescan:
 	do {
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
@@ -2442,7 +2430,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 						WB_REASON_TRY_TO_FREE_PAGES);
 			sc->may_writepage = 1;
 		}
-	} while (--sc->priority >= 0);
+	} while (--sc->priority >= min_scan_priority);
 
 out:
 	delayacct_freepages_end();
@@ -2466,6 +2454,12 @@ out:
 	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
 		return 1;
 
+	/* If the page allocator is going to consider OOM, rescan at priority 0 */
+	if (min_scan_priority) {
+		min_scan_priority = 0;
+		goto rescan;
+	}
+
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

[parent not found: <20130620111206.GA14809-l3A5Bk7waGM@public.gmane.org>]

* Re: [PATCH v5] Soft limit rework
       [not found]   ` <20130620111206.GA14809-l3A5Bk7waGM@public.gmane.org>
@ 2013-06-21 14:06     ` Michal Hocko
       [not found]       ` <20130621140627.GI12424-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2013-06-25 15:49     ` Michal Hocko
  1 sibling, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2013-06-21 14:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

On Thu 20-06-13 12:12:06, Mel Gorman wrote:
> On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote:
> > base is mmotm-2013-05-09-15-57
> > baserebase is mmotm-2013-06-05-17-24-63 + patches from the current mmots
> > without slab shrinkers patchset.
> > reworkrebase all patches 8 applied on top of baserebase
> > 
> > * No-limit
> > User
> > base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6
> > baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6
> > reworkrebase: min: 1172.58 [100.7%] max: 1177.43 [100.7%] avg: 1175.53 [100.6%] std: 1.91 runs: 6
> > System
> > base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6
> > baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6
> > reworkrebase: min: 236.21 [97.4%] max: 239.46 [97.6%] avg: 237.55 [97.4%] std: 1.05 runs: 6
> > Elapsed
> > base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6
> > baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6
> > reworkrebase: min: 664.05 [111.3%] max: 701.06 [113.1%] avg: 689.29 [113.8%] std: 12.36 runs: 6
> > 
> > Elapsed time regressed by 13% wrt. base but it seems that this came from
> > baserebase which regressed by the same amount.
> > 
> 
> boo-urns
> 
> > Page fault statistics tell us at least part of the story:
> > Minor
> > base: min: 35941845.00 max: 36029788.00 avg: 35986860.17 std: 28288.66 runs: 6
> > baserebase: min: 35852414.00 [99.8%] max: 35899605.00 [99.6%] avg: 35874906.83 [99.7%] std: 18722.59 runs: 6
> > reworkrebase: min: 35538346.00 [98.9%] max: 35584907.00 [98.8%] avg: 35562362.17 [98.8%] std: 18921.74 runs: 6
> > Major
> > base: min: 25390.00 max: 33132.00 avg: 29961.83 std: 2476.58 runs: 6
> > baserebase: min: 34224.00 [134.8%] max: 45674.00 [137.9%] avg: 41556.83 [138.7%] std: 3595.39 runs: 6
> > reworkrebase: min: 277.00 [1.1%] max: 480.00 [1.4%] avg: 384.67 [1.3%] std: 74.67 runs: 6
> 
> Can you try this monolithic patch please?

Wow, this looks much better!

* 0-limit
User
base: min: 1188.28 max: 1198.54 avg: 1194.10 std: 3.31 runs: 6
baserebase: min: 1186.17 [99.8%] max: 1196.46 [99.8%] avg: 1189.75 [99.6%] std: 3.41 runs: 6
mel: min: 993.25 [83.6%] max: 997.40 [83.2%] avg: 995.81 [83.4%] std: 1.43 runs: 6
System
base: min: 248.40 max: 252.00 avg: 250.19 std: 1.38 runs: 6
baserebase: min: 240.77 [96.9%] max: 246.74 [97.9%] avg: 243.63 [97.4%] std: 2.23 runs: 6
mel: min: 145.36 [58.5%] max: 148.69 [59.0%] avg: 147.66 [59.0%] std: 1.17 runs: 6
Elapsed
base: min: 759.28 max: 805.30 avg: 784.87 std: 15.45 runs: 6
baserebase: min: 881.69 [116.1%] max: 938.14 [116.5%] avg: 911.68 [116.2%] std: 19.58 runs: 6
mel: min: 367.99 [48.5%] max: 381.67 [47.4%] avg: 371.84 [47.4%] std: 4.72 runs: 6

* no-limit
User
base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6
baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6
mel: min: 993.46 [85.3%] max: 995.07 [85.1%] avg: 994.26 [85.1%] std: 0.56 runs: 6
System
base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6
baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6
mel: min: 148.55 [61.2%] max: 151.80 [61.9%] avg: 149.80 [61.4%] std: 1.07 runs: 6
Elapsed
base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6
baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6
mel: min: 366.85 [61.5%] max: 375.98 [60.6%] avg: 372.25 [61.5%] std: 3.18 runs: 6
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <20130621140627.GI12424-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH v5] Soft limit rework
       [not found]       ` <20130621140627.GI12424-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-06-21 14:09         ` Michal Hocko
  2013-06-21 15:04           ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2013-06-21 14:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

On Fri 21-06-13 16:06:27, Michal Hocko wrote:
[...]
> > Can you try this monolithic patch please?
> 
> Wow, this looks much better!

Damn it! Scratch that. I have made a mistake in configuration so this
all has been 0-no-limit in fact. Sorry about that. It's only now that
I've noticed that so I am retesting. Hopefully it will be done before I
leave today. I will post it on Monday otherwise.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5] Soft limit rework
  2013-06-21 14:09         ` Michal Hocko
@ 2013-06-21 15:04           ` Michal Hocko
       [not found]             ` <20130621150430.GL12424-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2013-06-21 15:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse,
	Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh,
	Glauber Costa

On Fri 21-06-13 16:09:38, Michal Hocko wrote:
> On Fri 21-06-13 16:06:27, Michal Hocko wrote:
> [...]
> > > Can you try this monolithic patch please?
> > 
> > Wow, this looks much better!
> 
> Damn it! Scratch that. I have made a mistake in configuration so this
> all has been 0-no-limit in fact. Sorry about that. It's only now that
> I've noticed that so I am retesting. Hopefully it will be done before I
> leave today. I will post it on Monday otherwise.

And for real now and it is still Wow!
* 0-limit
er
base: min: 1188.28 max: 1198.54 avg: 1194.10 std: 3.31 runs: 6
baserebase: min: 1186.17 [99.8%] max: 1196.46 [99.8%] avg: 1189.75 [99.6%] std: 3.41 runs: 6
mel: min: 989.99 [83.3%] max: 993.35 [82.9%] avg: 991.71 [83.1%] std: 1.18 runs: 6
System
base: min: 248.40 max: 252.00 avg: 250.19 std: 1.38 runs: 6
baserebase: min: 240.77 [96.9%] max: 246.74 [97.9%] avg: 243.63 [97.4%] std: 2.23 runs: 6
mel: min: 144.75 [58.3%] max: 148.27 [58.8%] avg: 146.97 [58.7%] std: 1.41 runs: 6
Elapsed
base: min: 759.28 max: 805.30 avg: 784.87 std: 15.45 runs: 6
baserebase: min: 881.69 [116.1%] max: 938.14 [116.5%] avg: 911.68 [116.2%] std: 19.58 runs: 6
mel: min: 365.99 [48.2%] max: 376.12 [46.7%] avg: 369.82 [47.1%] std: 4.04 runs: 6

* no-limit
User
base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6
baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6
mel: min: 990.96 [85.1%] max: 994.18 [85.0%] avg: 992.79 [85.0%] std: 1.15 runs: 6
System
base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6
baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6
mel: min: 148.40 [61.2%] max: 150.77 [61.4%] avg: 149.33 [61.2%] std: 0.97 runs: 6
Elapsed
base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6
baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6
mel: min: 365.27 [61.2%] max: 371.40 [59.9%] avg: 369.43 [61.0%] std: 2.26 runs: 6
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <20130621150430.GL12424-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH v5] Soft limit rework
       [not found]             ` <20130621150430.GL12424-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-06-21 15:09               ` Michal Hocko
  2013-06-21 16:34                 ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2013-06-21 15:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

On Fri 21-06-13 17:04:30, Michal Hocko wrote:
> On Fri 21-06-13 16:09:38, Michal Hocko wrote:
> > On Fri 21-06-13 16:06:27, Michal Hocko wrote:
> > [...]
> > > > Can you try this monolithic patch please?
> > > 
> > > Wow, this looks much better!
> > 
> > Damn it! Scratch that. I have made a mistake in configuration so this
> > all has been 0-no-limit in fact. Sorry about that. It's only now that
> > I've noticed that so I am retesting. Hopefully it will be done before I
> > leave today. I will post it on Monday otherwise.
> 
> And for real now and it is still Wow!

And I am total idiot. The machine was not booted with mem=1G so the
figures are completely useless.

It is soooo Friday. I will start everything again on Monday with a clean
head.

Sorry about all the noise.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5] Soft limit rework
  2013-06-21 15:09               ` Michal Hocko
@ 2013-06-21 16:34                 ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2013-06-21 16:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki,
	linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Balbir Singh,
	Glauber Costa

On Fri, Jun 21, 2013 at 05:09:06PM +0200, Michal Hocko wrote:
> And I am total idiot. The machine was not booted with mem=1G so the
> figures are completely useless.
> 
> It is soooo Friday. I will start everything again on Monday with a clean
> head.
> 
> Sorry about all the noise.

Oh, don't be.  It was the most fun thread of the whole week.  Enjoy
the weekend. :)

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5] Soft limit rework
       [not found]   ` <20130620111206.GA14809-l3A5Bk7waGM@public.gmane.org>
  2013-06-21 14:06     ` Michal Hocko
@ 2013-06-25 15:49     ` Michal Hocko
  1 sibling, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2013-06-25 15:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

On Thu 20-06-13 12:12:06, Mel Gorman wrote:
> On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote:
> > base is mmotm-2013-05-09-15-57
> > baserebase is mmotm-2013-06-05-17-24-63 + patches from the current mmots
> > without slab shrinkers patchset.
> > reworkrebase all patches 8 applied on top of baserebase
> > 
> > * No-limit
> > User
> > base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6
> > baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6
> > reworkrebase: min: 1172.58 [100.7%] max: 1177.43 [100.7%] avg: 1175.53 [100.6%] std: 1.91 runs: 6
> > System
> > base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6
> > baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6
> > reworkrebase: min: 236.21 [97.4%] max: 239.46 [97.6%] avg: 237.55 [97.4%] std: 1.05 runs: 6
> > Elapsed
> > base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6
> > baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6
> > reworkrebase: min: 664.05 [111.3%] max: 701.06 [113.1%] avg: 689.29 [113.8%] std: 12.36 runs: 6
> > 
> > Elapsed time regressed by 13% wrt. base but it seems that this came from
> > baserebase which regressed by the same amount.
> > 
> 
> boo-urns
> 
> > Page fault statistics tell us at least part of the story:
> > Minor
> > base: min: 35941845.00 max: 36029788.00 avg: 35986860.17 std: 28288.66 runs: 6
> > baserebase: min: 35852414.00 [99.8%] max: 35899605.00 [99.6%] avg: 35874906.83 [99.7%] std: 18722.59 runs: 6
> > reworkrebase: min: 35538346.00 [98.9%] max: 35584907.00 [98.8%] avg: 35562362.17 [98.8%] std: 18921.74 runs: 6
> > Major
> > base: min: 25390.00 max: 33132.00 avg: 29961.83 std: 2476.58 runs: 6
> > baserebase: min: 34224.00 [134.8%] max: 45674.00 [137.9%] avg: 41556.83 [138.7%] std: 3595.39 runs: 6
> > reworkrebase: min: 277.00 [1.1%] max: 480.00 [1.4%] avg: 384.67 [1.3%] std: 74.67 runs: 6
> 
> Can you try this monolithic patch please?

Now that I've tested the patch properly, finally, I have to say that the
situation has improved.
0-limit case has a bigger variation with 15% regression in the worst
case and down to 4% regression in the best case (which gets us to 7% on
average).
no-limit configuration behaves better. The worst case is not so bad and
we are down to 3% regression in average.

* 0-limit
User
base: min: 1188.28 max: 1198.54 avg: 1194.10 std: 3.31 runs: 6
baserebase: min: 1186.17 [99.8%] max: 1196.46 [99.8%] avg: 1189.75 [99.6%] std: 3.41 runs: 6
mel: min: 1183.62 [99.6%] max: 1208.19 [100.8%] avg: 1194.00 [100.0%] std: 8.71 runs: 6
System
base: min: 248.40 max: 252.00 avg: 250.19 std: 1.38 runs: 6
baserebase: min: 240.77 [96.9%] max: 246.74 [97.9%] avg: 243.63 [97.4%] std: 2.23 runs: 6
mel: min: 249.41 [100.4%] max: 253.34 [100.5%] avg: 251.28 [100.4%] std: 1.20 runs: 6
Elapsed
base: min: 759.28 max: 805.30 avg: 784.87 std: 15.45 runs: 6
baserebase: min: 881.69 [116.1%] max: 938.14 [116.5%] avg: 911.68 [116.2%] std: 19.58 runs: 6
mel: min: 789.62 [104.0%] max: 928.68 [115.3%] avg: 845.62 [107.7%] std: 59.31 runs: 6

* no-limit
User
base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6
baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6
mel: min: 1174.58 [100.8%] max: 1176.46 [100.6%] avg: 1175.53 [100.6%] std: 0.59 runs: 6
System
base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6
baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6
mel: min: 244.79 [100.9%] max: 246.86 [100.6%] avg: 246.30 [101.0%] std: 0.76 runs: 6
Elapsed
base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6
baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6
mel: min: 618.77 [103.7%] max: 637.84 [102.9%] avg: 627.92 [103.7%] std: 6.64 runs: 6
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <1371557387-22434-1-git-send-email-mhocko-AlSwsSmVLrQ@public.gmane.org>]

* Re: [PATCH v5] Soft limit rework
       [not found] ` <1371557387-22434-1-git-send-email-mhocko-AlSwsSmVLrQ@public.gmane.org>
@ 2013-08-19 16:35   ` Johannes Weiner
       [not found]     ` <20130819163512.GB712-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Johannes Weiner @ 2013-08-19 16:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote:
> Hi,
> 
> This is the fifth version of the patchset.
> 
> Summary of versions:
> The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973
> (lkml wasn't CCed at the time so I cannot find it in lwn.net
> archives). There were no major objections. 

Except there are.

> My primary test case was a parallel kernel build with 2 groups (make
> is running with -j4 with a distribution .config in a separate cgroup
> without any hard limit) on a 8 CPU machine booted with 1GB memory.  I
> was mostly interested in 2 setups. Default - no soft limit set and - and
> 0 soft limit set to both groups.
> The first one should tell us whether the rework regresses the default
> behavior while the second one should show us improvements in an extreme
> case where both workloads are always over the soft limit.

Two kernel builds with 1G of memory means that reclaim is purely
trimming the cache every once in a while.  Changes in memory pressure
are not measurable up to a certain point, because whether you trim old
cache or not does not affect the build jobs.

Also you tested the no-softlimit case and an extreme soft limit case.
Where are the common soft limit cases?

> /usr/bin/time -v has been used to collect the statistics and each
> configuration had 3 runs after fresh boot without any other load on the
> system.
> 
> base is mmotm-2013-05-09-15-57
> baserebase is mmotm-2013-06-05-17-24-63 + patches from the current mmots
> without slab shrinkers patchset.
> reworkrebase all patches 8 applied on top of baserebase
> 
> * No-limit
> User
> base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6
> baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6
> reworkrebase: min: 1172.58 [100.7%] max: 1177.43 [100.7%] avg: 1175.53 [100.6%] std: 1.91 runs: 6
> System
> base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6
> baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6
> reworkrebase: min: 236.21 [97.4%] max: 239.46 [97.6%] avg: 237.55 [97.4%] std: 1.05 runs: 6
> Elapsed
> base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6
> baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6
> reworkrebase: min: 664.05 [111.3%] max: 701.06 [113.1%] avg: 689.29 [113.8%] std: 12.36 runs: 6
> 
> Elapsed time regressed by 13% wrt. base but it seems that this came from
> baserebase which regressed by the same amount.
> 
> * 0-limit
> User
> base: min: 1188.28 max: 1198.54 avg: 1194.10 std: 3.31 runs: 6
> baserebase: min: 1186.17 [99.8%] max: 1196.46 [99.8%] avg: 1189.75 [99.6%] std: 3.41 runs: 6
> reworkrebase: min: 1169.88 [98.5%] max: 1177.84 [98.3%] avg: 1173.50 [98.3%] std: 2.79 runs: 6
> System
> base: min: 248.40 max: 252.00 avg: 250.19 std: 1.38 runs: 6
> baserebase: min: 240.77 [96.9%] max: 246.74 [97.9%] avg: 243.63 [97.4%] std: 2.23 runs: 6
> reworkrebase: min: 235.19 [94.7%] max: 237.43 [94.2%] avg: 236.35 [94.5%] std: 0.86 runs: 6
> Elapsed
> base: min: 759.28 max: 805.30 avg: 784.87 std: 15.45 runs: 6
> baserebase: min: 881.69 [116.1%] max: 938.14 [116.5%] avg: 911.68 [116.2%] std: 19.58 runs: 6
> reworkrebase: min: 667.54 [87.9%] max: 718.54 [89.2%] avg: 695.61 [88.6%] std: 17.16 runs: 6

You set one group unlimited and one group to limit 0.  I would expect
all memory pressure that occurs to be applied to the 0-limit group.
Is this happening?  If so, wouldn't it be expected that its working
set is thrashed?  It could be that the patched kernel just shifts some
of the pressure to the No-limit group, which would arguably be a
regression.  But as I said, this is not measurable in realtime.

> While the minor faults are within the noise the major faults are reduced
> considerably. This looks like an aggressive pageout during the reclaim
> and that pageout affects the working set presumably. Please note that
> baserebase has even hight number of major page faults than the older
> mmotm trree.
> 
> While this looks as a nice win it is fair to say that there are some
> workloads that actually benefit from reclaim at 0 priority (from
> background reclaim). E.g. an aggressive streaming IO would like to get
> rid of as many pages as possible and do not block on the pages under
> writeback. This can lead to a higher System time but I generally got
> Elapsed which was comparable.
> 
> The following results are from 2 groups configuration on a 8GB machine
> (A running stream IO with 4*TotalMem with 0 soft limit, B runnning a
> mem_eater which consumes TotalMem-1G without any limit).
> System
> base: min: 124.88 max: 136.97 avg: 130.77 std: 4.94 runs: 3
> baserebase: min: 102.51 [82.1%] max: 108.84 [79.5%] avg: 104.81 [80.1%] std: 2.86 runs: 3
> reworkrebase: min: 108.29 [86.7%] max: 121.70 [88.9%] avg: 114.60 [87.6%] std: 5.50 runs: 3
> Elapsed
> base: min: 398.86 max: 412.81 avg: 407.62 std: 6.23 runs: 3
> baserebase: min: 480.92 [120.6%] max: 497.56 [120.5%] avg: 491.46 [120.6%] std: 7.48 runs: 3
> reworkrebase: min: 397.19 [99.6%] max: 462.57 [112.1%] avg: 436.13 [107.0%] std: 28.12 runs: 3
> 
> baserebase regresses again by 20% and the series is worse by 7% but it
> is still at 89% wrt baserebase so it looks good to me.
> 
> So to wrap this up. The series is still doing good and improves the soft
> limit.

The soft limit tree is a bunch of isolated code that's completely
straight-forward.  This is replaced by convoluted memcg iterators,
convoluted lruvec shrinkers, spreading even more memcg callbacks with
questionable semantics into already complicated generic reclaim code.
This series considerably worsens readability and maintainability of
both the generic reclaim code as well as the memcg counterpart of it.

The point of naturalizing the memcg code is to reduce data structures
and redundancy and to break open opaque interfaces like "do soft
reclaim and report back".  But you didn't actually reduce complexity,
you added even more opaque callbacks (should_soft_reclaim?
soft_reclaim_eligible?).  You didn't integrate soft limit into generic
reclaim code, you just made the soft limit API more complicated.

And, as I mentioned repeatedly in previous submissions, your benchmark
numbers don't actually say anything useful about this change.

I'm against merging this upstream at this point.

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <20130819163512.GB712-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: [PATCH v5] Soft limit rework
       [not found]     ` <20130819163512.GB712-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2013-08-20  9:14       ` Michal Hocko
  2013-08-20 14:13         ` Johannes Weiner
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2013-08-20  9:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

On Mon 19-08-13 12:35:12, Johannes Weiner wrote:
> On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote:
> > Hi,
> > 
> > This is the fifth version of the patchset.
> > 
> > Summary of versions:
> > The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973
> > (lkml wasn't CCed at the time so I cannot find it in lwn.net
> > archives). There were no major objections. 
> 
> Except there are.

Good to know that late... It would have been much more helpful to have
such a principal feedback few months ago (this work is here since early
Jun).

> > My primary test case was a parallel kernel build with 2 groups (make
> > is running with -j4 with a distribution .config in a separate cgroup
> > without any hard limit) on a 8 CPU machine booted with 1GB memory.  I
> > was mostly interested in 2 setups. Default - no soft limit set and - and
> > 0 soft limit set to both groups.
> > The first one should tell us whether the rework regresses the default
> > behavior while the second one should show us improvements in an extreme
> > case where both workloads are always over the soft limit.
> 
> Two kernel builds with 1G of memory means that reclaim is purely
> trimming the cache every once in a while.  Changes in memory pressure
> are not measurable up to a certain point, because whether you trim old
> cache or not does not affect the build jobs.
> 
> Also you tested the no-softlimit case and an extreme soft limit case.
> Where are the common soft limit cases?

v5.1 had some more tests. I have added soft limitted stream IO resp. kbuild vs
unlimitted mem_eater loads. Have you checked those?

[...]
> > So to wrap this up. The series is still doing good and improves the soft
> > limit.
> 
> The soft limit tree is a bunch of isolated code that's completely
> straight-forward.  This is replaced by convoluted memcg iterators,
> convoluted lruvec shrinkers, spreading even more memcg callbacks with
> questionable semantics into already complicated generic reclaim code.

I was trying to keep the convolution into vmscan as small as possible.
Maybe it can get reduced even more. I will think about it.

Predicate for memcg iterator has been added to address your concern
about a potential regression with too many groups. And that looked like
the least convoluting solution.

> This series considerably worsens readability and maintainability of
> both the generic reclaim code as well as the memcg counterpart of it.

I am really surprised that you are coming with this concerns that late.
This code has been posted quite some ago, hasn't it? We have even had
that "calm" discussion with Tejun about predicates and you were silent
at the time.

> The point of naturalizing the memcg code is to reduce data structures
> and redundancy and to break open opaque interfaces like "do soft
> reclaim and report back".  But you didn't actually reduce complexity,
> you added even more opaque callbacks (should_soft_reclaim?
> soft_reclaim_eligible?).  You didn't integrate soft limit into generic
> reclaim code, you just made the soft limit API more complicated.

I can certainly think about simplifications. But it would be nicer if
you were more specific on the "more complicated" part. The soft reclaim
is a natural part of the reclaim now. Which I find as an improvement.
"Do some memcg magic and get back was" a bad idea IMO.
Hiding the soft limit decisions into the iterators as a searching
criteria doesn't sound as a totally bad idea to me. Soft limit is an
additional criteria who to reclaim, isn't it?
Well, I could have open coded it but that would mean a more code into
vmscan or getting back to "call some memcg magic and get back to me".

> And, as I mentioned repeatedly in previous submissions, your benchmark
> numbers don't actually say anything useful about this change.

I would really welcome suggestions for improvements. I have tried "The
most interesting test case would be how it behaves if some groups are
over the soft limits while others are not." with v5.1 where I had
memeater unlimited and kbuild resp. stream IO being limited. 

> I'm against merging this upstream at this point.

Can we at least find some middle ground here? The way how the current
soft limit is done is a disaster. Ditching the whole series sounds like
a step back to me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5] Soft limit rework
  2013-08-20  9:14       ` Michal Hocko
@ 2013-08-20 14:13         ` Johannes Weiner
  2013-08-22 10:58           ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Johannes Weiner @ 2013-08-20 14:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel,
	Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa

On Tue, Aug 20, 2013 at 11:14:14AM +0200, Michal Hocko wrote:
> On Mon 19-08-13 12:35:12, Johannes Weiner wrote:
> > On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote:
> > > Hi,
> > > 
> > > This is the fifth version of the patchset.
> > > 
> > > Summary of versions:
> > > The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973
> > > (lkml wasn't CCed at the time so I cannot find it in lwn.net
> > > archives). There were no major objections. 
> > 
> > Except there are.
> 
> Good to know that late... It would have been much more helpful to have
> such a principal feedback few months ago (this work is here since early
> Jun).

I NAKed this "integrate into reclaim but still make soft reclaim an
extra pass" idea the first time Ying submitted it.

And no, the soft limit discussions have been going on since I first
submitted a rewrite as part of the reclaim integration series in
mid-2011.

We have been going back and forth on whether soft limits should behave
as guarantees and *protect* from reclaim.  But I thought we had since
agreed that implementing guarantees should be a separate effort and as
far as I'm concerned there is no justification for all this added code
just to prevent reclaiming non-excess when there is excess around.

> > > My primary test case was a parallel kernel build with 2 groups (make
> > > is running with -j4 with a distribution .config in a separate cgroup
> > > without any hard limit) on a 8 CPU machine booted with 1GB memory.  I
> > > was mostly interested in 2 setups. Default - no soft limit set and - and
> > > 0 soft limit set to both groups.
> > > The first one should tell us whether the rework regresses the default
> > > behavior while the second one should show us improvements in an extreme
> > > case where both workloads are always over the soft limit.
> > 
> > Two kernel builds with 1G of memory means that reclaim is purely
> > trimming the cache every once in a while.  Changes in memory pressure
> > are not measurable up to a certain point, because whether you trim old
> > cache or not does not affect the build jobs.
> > 
> > Also you tested the no-softlimit case and an extreme soft limit case.
> > Where are the common soft limit cases?
> 
> v5.1 had some more tests. I have added soft limitted stream IO resp. kbuild vs
> unlimitted mem_eater loads. Have you checked those?

Another 0-limit test?  If you declare every single page a cgroup has
as being in excess of what it should have, how does reclaiming them
aggressively constitute bad behavior?

> [...]
> > > So to wrap this up. The series is still doing good and improves the soft
> > > limit.
> > 
> > The soft limit tree is a bunch of isolated code that's completely
> > straight-forward.  This is replaced by convoluted memcg iterators,
> > convoluted lruvec shrinkers, spreading even more memcg callbacks with
> > questionable semantics into already complicated generic reclaim code.
> 
> I was trying to keep the convolution into vmscan as small as possible.
> Maybe it can get reduced even more. I will think about it.
> 
> Predicate for memcg iterator has been added to address your concern
> about a potential regression with too many groups. And that looked like
> the least convoluting solution.

I suggested improving the 1st pass by propagating soft limit excess
through the res counters.  They are the hierarchical book keepers.  I
even send you patches to do this.  I don't understand why you
propagate soft limit excess in struct mem_cgroup and use this
ridiculous check_events thing to update them.  We already have the
per-cpu charge caches (stock) and uncharge caches to batch res_counter
updates.

Anyway, this would be a replacement for the current tree structure to
quickly find memcgs in excess.  It has nothing to do with integrating
it into vmscan.c.

> > This series considerably worsens readability and maintainability of
> > both the generic reclaim code as well as the memcg counterpart of it.
> 
> I am really surprised that you are coming with this concerns that late.
> This code has been posted quite some ago, hasn't it? We have even had
> that "calm" discussion with Tejun about predicates and you were silent
> at the time.

I apologize that I was not more responsive in previous submissions,
this is mainly because it was hard to context switch while I was
focussed on the allocator fairness / thrash detection patches.

But you are surprised that I still object to something that I have
been objecting to from day one?  What exactly has changed?

> > The point of naturalizing the memcg code is to reduce data structures
> > and redundancy and to break open opaque interfaces like "do soft
> > reclaim and report back".  But you didn't actually reduce complexity,
> > you added even more opaque callbacks (should_soft_reclaim?
> > soft_reclaim_eligible?).  You didn't integrate soft limit into generic
> > reclaim code, you just made the soft limit API more complicated.
> 
> I can certainly think about simplifications. But it would be nicer if
> you were more specific on the "more complicated" part. The soft reclaim
> is a natural part of the reclaim now. Which I find as an improvement.
> "Do some memcg magic and get back was" a bad idea IMO.

It depends.  Doing memcg reclaim in generic code made sense because
the global data structure went and the generic code absolutely had to
know about memcg iteration.

Soft limit reclaim is something else.  If it were just a modifier of
the existing memcg reclaim loop, I would still put it in vmscan.c for
simplicity reasons.

But your separate soft limit reclaim pass adds quite some complication
to the generic code and there is no good reason for it.

The naming does not help to make it very natural to vmscan.c, either.
__shrink_zone() should be shrink_lruvecs(),
mem_cgroup_soft_reclaim_eligible() should be
mem_cgroup_soft_limit_exceeded() or something like that.  SKIP_TREE
should not be a kernel-wide name for anything.  Also note that we have
since moved on to make lruvec the shared structure and try to steer
away from using memcgs outside of memcontrol.c, so I'm not really fond
of tying more generic code to struct mem_cgroup.

> Hiding the soft limit decisions into the iterators as a searching
> criteria doesn't sound as a totally bad idea to me. Soft limit is an
> additional criteria who to reclaim, isn't it?

Arguably it's a criteria of /how/ to reclaim any given memcg.  Making
it about pre-selecting memcgs is a totally different beast.

> Well, I could have open coded it but that would mean a more code into
> vmscan or getting back to "call some memcg magic and get back to me".

Either way sounds preferrable to having vmscan.c do a lot of things it
does not understand.

> > And, as I mentioned repeatedly in previous submissions, your benchmark
> > numbers don't actually say anything useful about this change.
> 
> I would really welcome suggestions for improvements. I have tried "The
> most interesting test case would be how it behaves if some groups are
> over the soft limits while others are not." with v5.1 where I had
> memeater unlimited and kbuild resp. stream IO being limited.

The workload seems better but the limit configuration is still very
dubious.

I just don't understand what your benchmarks are trying to express.
If you set a soft limit of 0 you declare all memory is optional but
then you go ahead and complain that the workload in there is
thrashing.  WTF?

Kernelbuilds are not really good candidates for soft limit usage in
the first place, because they have a small set of heavily used pages
and a lot of used-once cache.  The core workingset is anything but
optional and you could just set the hard limit to clip all that
needless cache because it does not change performance whether it's
available or not.  Same exact thing goes for IO streamers and their
subset of dirty/writeback pages.

> > I'm against merging this upstream at this point.
> 
> Can we at least find some middle ground here? The way how the current
> soft limit is done is a disaster. Ditching the whole series sounds like
> a step back to me.

I'm worried about memcg madness spreading.  Memcg code is nowhere near
kernel core code standards and I'm really reluctant to have it spread
into something as delicate as vmscan.c for no good reason at all.

If you claim the soft limit implementation is a disaster, I would have
expected a real life workload to back this up but all I've seen are
synthetic tests with no apparent significance to reality.

So, no, I don't buy the "let's just do SOMETHING here" argument.

Let's figure out what the hell is wrong, back it up, find a solution
and verify it against the test case.

Memcg/cgroup code is full of overengineered solutions to problems that
don't exist.  It's insane and it has to stop.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5] Soft limit rework
  2013-08-20 14:13         ` Johannes Weiner
@ 2013-08-22 10:58           ` Michal Hocko
  2013-09-03 16:15             ` Johannes Weiner
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2013-08-22 10:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel,
	Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa

[I am mostly offline for the whole week with very limitted internet
access so it will get longer for me to respond to emails. Sorry about
that]

On Tue 20-08-13 10:13:39, Johannes Weiner wrote:
> On Tue, Aug 20, 2013 at 11:14:14AM +0200, Michal Hocko wrote:
> > On Mon 19-08-13 12:35:12, Johannes Weiner wrote:
> > > On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote:
> > > > Hi,
> > > > 
> > > > This is the fifth version of the patchset.
> > > > 
> > > > Summary of versions:
> > > > The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973
> > > > (lkml wasn't CCed at the time so I cannot find it in lwn.net
> > > > archives). There were no major objections. 
> > > 
> > > Except there are.
> > 
> > Good to know that late... It would have been much more helpful to have
> > such a principal feedback few months ago (this work is here since early
> > Jun).
> 
> I NAKed this "integrate into reclaim but still make soft reclaim an
> extra pass" idea the first time Ying submitted it.

There was a general agreement about this approach at LSF AFAIR. So it is
really surprising hearing that from you _now_

I do not remember the details of your previous NAK but I do remember you
liked the series I posted before the conference. I am willing to change
details (e.g. move big vmscan part into memcg) but I do insist on having
soft reclaim integrated to the regular reclaim.

> And no, the soft limit discussions have been going on since I first
> submitted a rewrite as part of the reclaim integration series in
> mid-2011.

Which makes the whole story even more sad :/

> We have been going back and forth on whether soft limits should behave
> as guarantees and *protect* from reclaim.  But I thought we had since
> agreed that implementing guarantees should be a separate effort and as
> far as I'm concerned there is no justification for all this added code
> just to prevent reclaiming non-excess when there is excess around.

The patchset is absolutely not about guarantees anymore. It is more
about making the current soft limit reclaim more natural. E.g. not doing
prio-0 scans.

The side effect that the isolation works better in the result is just a
bonus. And I have to admit I like that bonus.

> > > > My primary test case was a parallel kernel build with 2 groups (make
> > > > is running with -j4 with a distribution .config in a separate cgroup
> > > > without any hard limit) on a 8 CPU machine booted with 1GB memory.  I
> > > > was mostly interested in 2 setups. Default - no soft limit set and - and
> > > > 0 soft limit set to both groups.
> > > > The first one should tell us whether the rework regresses the default
> > > > behavior while the second one should show us improvements in an extreme
> > > > case where both workloads are always over the soft limit.
> > > 
> > > Two kernel builds with 1G of memory means that reclaim is purely
> > > trimming the cache every once in a while.  Changes in memory pressure
> > > are not measurable up to a certain point, because whether you trim old
> > > cache or not does not affect the build jobs.
> > > 
> > > Also you tested the no-softlimit case and an extreme soft limit case.
> > > Where are the common soft limit cases?
> > 
> > v5.1 had some more tests. I have added soft limitted stream IO resp. kbuild vs
> > unlimitted mem_eater loads. Have you checked those?
> 
> Another 0-limit test?  If you declare every single page a cgroup has
> as being in excess of what it should have, how does reclaiming them
> aggressively constitute bad behavior?

My testing simply tries to compare the behavior before and after the
patchset. And points out that the current soft reclaim is really
aggressive and it doesn't have to act like that.

> > [...]
> > > > So to wrap this up. The series is still doing good and improves the soft
> > > > limit.
> > > 
> > > The soft limit tree is a bunch of isolated code that's completely
> > > straight-forward.  This is replaced by convoluted memcg iterators,
> > > convoluted lruvec shrinkers, spreading even more memcg callbacks with
> > > questionable semantics into already complicated generic reclaim code.
> > 
> > I was trying to keep the convolution into vmscan as small as possible.
> > Maybe it can get reduced even more. I will think about it.
> > 
> > Predicate for memcg iterator has been added to address your concern
> > about a potential regression with too many groups. And that looked like
> > the least convoluting solution.
> 
> I suggested improving the 1st pass by propagating soft limit excess
> through the res counters.  They are the hierarchical book keepers.  I
> even send you patches to do this.  I don't understand why you
> propagate soft limit excess in struct mem_cgroup and use this
> ridiculous check_events thing to update them.  We already have the
> per-cpu charge caches (stock) and uncharge caches to batch res_counter
> updates.
> 
> Anyway, this would be a replacement for the current tree structure to
> quickly find memcgs in excess. 

Yes and I do not insist on having this per-memcg counter. I took this
way to keep this memcg specific thing withing memcg code and do not
pollute generic cgroup code which doesn't care about this property at
all.

> It has nothing to do with integrating it into vmscan.c.

agreed

> > > This series considerably worsens readability and maintainability of
> > > both the generic reclaim code as well as the memcg counterpart of it.
> > 
> > I am really surprised that you are coming with this concerns that late.
> > This code has been posted quite some ago, hasn't it? We have even had
> > that "calm" discussion with Tejun about predicates and you were silent
> > at the time.
> 
> I apologize that I was not more responsive in previous submissions,
> this is mainly because it was hard to context switch while I was
> focussed on the allocator fairness / thrash detection patches.

Come on, Johannes! We were sitting in the same room at LSF when there
was a general agreement about the patchset because "It is a general
improvement and further changes might happen on top of it". You didn't
say a word you didn't like it and consider the integration as a bad
idea.

> But you are surprised that I still object to something that I have
> been objecting to from day one?  What exactly has changed?
> 
> > > The point of naturalizing the memcg code is to reduce data structures
> > > and redundancy and to break open opaque interfaces like "do soft
> > > reclaim and report back".  But you didn't actually reduce complexity,
> > > you added even more opaque callbacks (should_soft_reclaim?
> > > soft_reclaim_eligible?).  You didn't integrate soft limit into generic
> > > reclaim code, you just made the soft limit API more complicated.
> > 
> > I can certainly think about simplifications. But it would be nicer if
> > you were more specific on the "more complicated" part. The soft reclaim
> > is a natural part of the reclaim now. Which I find as an improvement.
> > "Do some memcg magic and get back was" a bad idea IMO.
> 
> It depends.  Doing memcg reclaim in generic code made sense because
> the global data structure went and the generic code absolutely had to
> know about memcg iteration.
> 
> Soft limit reclaim is something else.  If it were just a modifier of
> the existing memcg reclaim loop, I would still put it in vmscan.c for
> simplicity reasons.
> 
> But your separate soft limit reclaim pass adds quite some complication
> to the generic code and there is no good reason for it.

Does it? The whole decision is hidden within the predicate so the main
loop doesn't have to care at all. If we ever go with another limit for
the guarantee it would simply use a different predicate to select the
right group.

> The naming does not help to make it very natural to vmscan.c, either.
> __shrink_zone() should be shrink_lruvecs(),
> mem_cgroup_soft_reclaim_eligible() should be
> mem_cgroup_soft_limit_exceeded() or something like that.

I am certainly open to change naming.

> SKIP_TREE should not be a kernel-wide name for anything.

That would require to move the looping code inside memcg. It is question
whether that would help readability. I do not insist on doing the memcg
loop inside vmscan, though.

> Also note that we have
> since moved on to make lruvec the shared structure and try to steer
> away from using memcgs outside of memcontrol.c, so I'm not really fond
> of tying more generic code to struct mem_cgroup.
> 
> > Hiding the soft limit decisions into the iterators as a searching
> > criteria doesn't sound as a totally bad idea to me. Soft limit is an
> > additional criteria who to reclaim, isn't it?
> 
> Arguably it's a criteria of /how/ to reclaim any given memcg.  Making
> it about pre-selecting memcgs is a totally different beast.
> 
> > Well, I could have open coded it but that would mean a more code into
> > vmscan or getting back to "call some memcg magic and get back to me".
> 
> Either way sounds preferrable to having vmscan.c do a lot of things it
> does not understand.
> 
> > > And, as I mentioned repeatedly in previous submissions, your benchmark
> > > numbers don't actually say anything useful about this change.
> > 
> > I would really welcome suggestions for improvements. I have tried "The
> > most interesting test case would be how it behaves if some groups are
> > over the soft limits while others are not." with v5.1 where I had
> > memeater unlimited and kbuild resp. stream IO being limited.
> 
> The workload seems better but the limit configuration is still very
> dubious.
> 
> I just don't understand what your benchmarks are trying to express.
> If you set a soft limit of 0 you declare all memory is optional but
> then you go ahead and complain that the workload in there is
> thrashing.  WTF?

As I've already said the primary motivation was to check before and
after state and as you can see we can do better... Although you might
find soft_limit=0 setting artificial this setting is the easiest way to
tell that such a group is the number one candidate for reclaim.

We have users who would like to do backups or some temporary actions to
not disturb the main workload. So stream-io resp. kbuild vs. mem_eater
is simulating such a use case.

> Kernelbuilds are not really good candidates for soft limit usage in
> the first place, because they have a small set of heavily used pages
> and a lot of used-once cache.

It still produces the sufficient memory pressure to disturb other
groups.

> The core workingset is anything but
> optional and you could just set the hard limit to clip all that
> needless cache because it does not change performance whether it's
> available or not.  Same exact thing goes for IO streamers and their
> subset of dirty/writeback pages.
> 
> > > I'm against merging this upstream at this point.
> > 
> > Can we at least find some middle ground here? The way how the current
> > soft limit is done is a disaster. Ditching the whole series sounds like
> > a step back to me.
> 
> I'm worried about memcg madness spreading.  Memcg code is nowhere near
> kernel core code standards and I'm really reluctant to have it spread
> into something as delicate as vmscan.c for no good reason at all.
> 
> If you claim the soft limit implementation is a disaster, I would have
> expected a real life workload to back this up but all I've seen are
> synthetic tests with no apparent significance to reality.
> 
> So, no, I don't buy the "let's just do SOMETHING here" argument.
> 
> Let's figure out what the hell is wrong, back it up, find a solution
> and verify it against the test case.

It is basically prio-0 thing that is the disaster. And I was quite
explicit about that. 

> Memcg/cgroup code is full of overengineered solutions to problems that
> don't exist.  It's insane and it has to stop.

This code is getting rid of a big part of that over-engineering! I do
agree there is a lot of room improvements but, hey, that is always the
case.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5] Soft limit rework
  2013-08-22 10:58           ` Michal Hocko
@ 2013-09-03 16:15             ` Johannes Weiner
  2013-09-04 16:38               ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Johannes Weiner @ 2013-09-03 16:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel,
	Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa

Hi Michal,

On Thu, Aug 22, 2013 at 12:58:56PM +0200, Michal Hocko wrote:
> [I am mostly offline for the whole week with very limitted internet
> access so it will get longer for me to respond to emails. Sorry about
> that]

Same deal for me, just got back.  Sorry for the delays.

> On Tue 20-08-13 10:13:39, Johannes Weiner wrote:
> > On Tue, Aug 20, 2013 at 11:14:14AM +0200, Michal Hocko wrote:
> > > On Mon 19-08-13 12:35:12, Johannes Weiner wrote:
> > > > On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote:
> > > > > Hi,
> > > > > 
> > > > > This is the fifth version of the patchset.
> > > > > 
> > > > > Summary of versions:
> > > > > The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973
> > > > > (lkml wasn't CCed at the time so I cannot find it in lwn.net
> > > > > archives). There were no major objections. 
> > > > 
> > > > Except there are.
> > > 
> > > Good to know that late... It would have been much more helpful to have
> > > such a principal feedback few months ago (this work is here since early
> > > Jun).
> > 
> > I NAKed this "integrate into reclaim but still make soft reclaim an
> > extra pass" idea the first time Ying submitted it.
> 
> There was a general agreement about this approach at LSF AFAIR. So it is
> really surprising hearing that from you _now_
> 
> I do not remember the details of your previous NAK but I do remember you
> liked the series I posted before the conference. I am willing to change
> details (e.g. move big vmscan part into memcg) but I do insist on having
> soft reclaim integrated to the regular reclaim.

I was okay with the series back at LSF.

But this was under the assumption that we are moving towards turning
softlimits into guarantees.  It was only after LSF that we agreed that
conflating these two things is not a good idea and that we are better
off implementing guarantees separately.

On the contrary, I am surprised that you are still pushing this series
in light of the discussions after LSF.

Google was a big pusher of the soft limit changes, but their
requirements were always hard guarantees.  We have since agreed that
this should be done separately.

What is the use case for soft limits at this point?

> > And no, the soft limit discussions have been going on since I first
> > submitted a rewrite as part of the reclaim integration series in
> > mid-2011.
> 
> Which makes the whole story even more sad :/
> 
> > We have been going back and forth on whether soft limits should behave
> > as guarantees and *protect* from reclaim.  But I thought we had since
> > agreed that implementing guarantees should be a separate effort and as
> > far as I'm concerned there is no justification for all this added code
> > just to prevent reclaiming non-excess when there is excess around.
> 
> The patchset is absolutely not about guarantees anymore. It is more
> about making the current soft limit reclaim more natural. E.g. not doing
> prio-0 scans.

With guarantees out of the window there is absolutely no justification
for this added level of complexity.  The tree code is verbose but dead
simple.  And it's contained.

You have not shown that prio-0 scans are a problem.  Or even argued
why you need new memcg iteration code to fix the priority level.

> The side effect that the isolation works better in the result is just a
> bonus. And I have to admit I like that bonus.

What isolation?

> > > > > My primary test case was a parallel kernel build with 2 groups (make
> > > > > is running with -j4 with a distribution .config in a separate cgroup
> > > > > without any hard limit) on a 8 CPU machine booted with 1GB memory.  I
> > > > > was mostly interested in 2 setups. Default - no soft limit set and - and
> > > > > 0 soft limit set to both groups.
> > > > > The first one should tell us whether the rework regresses the default
> > > > > behavior while the second one should show us improvements in an extreme
> > > > > case where both workloads are always over the soft limit.
> > > > 
> > > > Two kernel builds with 1G of memory means that reclaim is purely
> > > > trimming the cache every once in a while.  Changes in memory pressure
> > > > are not measurable up to a certain point, because whether you trim old
> > > > cache or not does not affect the build jobs.
> > > > 
> > > > Also you tested the no-softlimit case and an extreme soft limit case.
> > > > Where are the common soft limit cases?
> > > 
> > > v5.1 had some more tests. I have added soft limitted stream IO resp. kbuild vs
> > > unlimitted mem_eater loads. Have you checked those?
> > 
> > Another 0-limit test?  If you declare every single page a cgroup has
> > as being in excess of what it should have, how does reclaiming them
> > aggressively constitute bad behavior?
> 
> My testing simply tries to compare the behavior before and after the
> patchset. And points out that the current soft reclaim is really
> aggressive and it doesn't have to act like that.

How can you get a meaningful result out of a meaningless setup?

You can't tell if it's apropriately aggressive, not aggressive enough,
or too aggressive if your test case does not mean anything.  There are
no sane expectations that can be met.

> > > > > So to wrap this up. The series is still doing good and improves the soft
> > > > > limit.
> > > > 
> > > > The soft limit tree is a bunch of isolated code that's completely
> > > > straight-forward.  This is replaced by convoluted memcg iterators,
> > > > convoluted lruvec shrinkers, spreading even more memcg callbacks with
> > > > questionable semantics into already complicated generic reclaim code.
> > > 
> > > I was trying to keep the convolution into vmscan as small as possible.
> > > Maybe it can get reduced even more. I will think about it.
> > > 
> > > Predicate for memcg iterator has been added to address your concern
> > > about a potential regression with too many groups. And that looked like
> > > the least convoluting solution.
> > 
> > I suggested improving the 1st pass by propagating soft limit excess
> > through the res counters.  They are the hierarchical book keepers.  I
> > even send you patches to do this.  I don't understand why you
> > propagate soft limit excess in struct mem_cgroup and use this
> > ridiculous check_events thing to update them.  We already have the
> > per-cpu charge caches (stock) and uncharge caches to batch res_counter
> > updates.
> > 
> > Anyway, this would be a replacement for the current tree structure to
> > quickly find memcgs in excess. 
> 
> Yes and I do not insist on having this per-memcg counter. I took this
> way to keep this memcg specific thing withing memcg code and do not
> pollute generic cgroup code which doesn't care about this property at
> all.

res_counters hierarchically track usage, limit, and soft limit.
Sounds like the perfect fit to track hierarchical soft limit excess to
me.

> > > > This series considerably worsens readability and maintainability of
> > > > both the generic reclaim code as well as the memcg counterpart of it.
> > > 
> > > I am really surprised that you are coming with this concerns that late.
> > > This code has been posted quite some ago, hasn't it? We have even had
> > > that "calm" discussion with Tejun about predicates and you were silent
> > > at the time.
> > 
> > I apologize that I was not more responsive in previous submissions,
> > this is mainly because it was hard to context switch while I was
> > focussed on the allocator fairness / thrash detection patches.
> 
> Come on, Johannes! We were sitting in the same room at LSF when there
> was a general agreement about the patchset because "It is a general
> improvement and further changes might happen on top of it". You didn't
> say a word you didn't like it and consider the integration as a bad
> idea.

It was a step towards turning soft limits into guarantees.  What other
use case was there?

Then it might have been a good first step.  I don't think it is a
general improvement of the code and I don't remember saying that.

> > > > The point of naturalizing the memcg code is to reduce data structures
> > > > and redundancy and to break open opaque interfaces like "do soft
> > > > reclaim and report back".  But you didn't actually reduce complexity,
> > > > you added even more opaque callbacks (should_soft_reclaim?
> > > > soft_reclaim_eligible?).  You didn't integrate soft limit into generic
> > > > reclaim code, you just made the soft limit API more complicated.
> > > 
> > > I can certainly think about simplifications. But it would be nicer if
> > > you were more specific on the "more complicated" part. The soft reclaim
> > > is a natural part of the reclaim now. Which I find as an improvement.
> > > "Do some memcg magic and get back was" a bad idea IMO.
> > 
> > It depends.  Doing memcg reclaim in generic code made sense because
> > the global data structure went and the generic code absolutely had to
> > know about memcg iteration.
> > 
> > Soft limit reclaim is something else.  If it were just a modifier of
> > the existing memcg reclaim loop, I would still put it in vmscan.c for
> > simplicity reasons.
> > 
> > But your separate soft limit reclaim pass adds quite some complication
> > to the generic code and there is no good reason for it.
> 
> Does it? The whole decision is hidden within the predicate so the main
> loop doesn't have to care at all. If we ever go with another limit for
> the guarantee it would simply use a different predicate to select the
> right group.

This is just handwaving.  You replaced a simple function call in
kswapd with an extra lruvec pass and an enhanced looping construct.
It's all extra protocol that vmscan.c has to follow without actually
improving the understandability of the code.  You still have to dig
into all those predicates to know what's going on.

And at the risk of repeating myself, you haven't proven that it's
worth the trouble.

> > > > And, as I mentioned repeatedly in previous submissions, your benchmark
> > > > numbers don't actually say anything useful about this change.
> > > 
> > > I would really welcome suggestions for improvements. I have tried "The
> > > most interesting test case would be how it behaves if some groups are
> > > over the soft limits while others are not." with v5.1 where I had
> > > memeater unlimited and kbuild resp. stream IO being limited.
> > 
> > The workload seems better but the limit configuration is still very
> > dubious.
> > 
> > I just don't understand what your benchmarks are trying to express.
> > If you set a soft limit of 0 you declare all memory is optional but
> > then you go ahead and complain that the workload in there is
> > thrashing.  WTF?
> 
> As I've already said the primary motivation was to check before and
> after state and as you can see we can do better... Although you might
> find soft_limit=0 setting artificial this setting is the easiest way to
> tell that such a group is the number one candidate for reclaim.

That would test for correctness, but then you draw conclusions about
performance.

Correctness-wise, the unpatched kernel seems to prefer reclaiming the
0-limit group as expected.

You can't draw reasonable conclusions on performance and
aggressiveness from such an unrealistic setup.

> We have users who would like to do backups or some temporary actions to
> not disturb the main workload. So stream-io resp. kbuild vs. mem_eater
> is simulating such a use case.

It's not obvious why you are not using hard limits instead.  Again,
streamers and kbuild don't really benefit from the extra cache, so
there is no reason to let it expand into available memory and retract
on pressure.

These tests are just wildly inappropriate for soft limits, it's
ridiculous that you insist that their outcome says anything at all.

> > Kernelbuilds are not really good candidates for soft limit usage in
> > the first place, because they have a small set of heavily used pages
> > and a lot of used-once cache.
> 
> It still produces the sufficient memory pressure to disturb other
> groups.

See "hard limit" above.

> > The core workingset is anything but
> > optional and you could just set the hard limit to clip all that
> > needless cache because it does not change performance whether it's
> > available or not.  Same exact thing goes for IO streamers and their
> > subset of dirty/writeback pages.
> > 
> > > > I'm against merging this upstream at this point.
> > > 
> > > Can we at least find some middle ground here? The way how the current
> > > soft limit is done is a disaster. Ditching the whole series sounds like
> > > a step back to me.
> > 
> > I'm worried about memcg madness spreading.  Memcg code is nowhere near
> > kernel core code standards and I'm really reluctant to have it spread
> > into something as delicate as vmscan.c for no good reason at all.
> > 
> > If you claim the soft limit implementation is a disaster, I would have
> > expected a real life workload to back this up but all I've seen are
> > synthetic tests with no apparent significance to reality.
> > 
> > So, no, I don't buy the "let's just do SOMETHING here" argument.
> > 
> > Let's figure out what the hell is wrong, back it up, find a solution
> > and verify it against the test case.
> 
> It is basically prio-0 thing that is the disaster. And I was quite
> explicit about that. 

And I keep telling you that you have proven neither the problem nor
the solution.

Your choice of tests suggests that you either don't understand what
you are testing or that you don't understand what soft limits mean.

There is no apparent point to this series and it adds complexity to
vmscan.c.  It has no business getting merged.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5] Soft limit rework
  2013-09-03 16:15             ` Johannes Weiner
@ 2013-09-04 16:38               ` Michal Hocko
  2013-09-06 19:23                 ` Johannes Weiner
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2013-09-04 16:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel,
	Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa

On Tue 03-09-13 12:15:50, Johannes Weiner wrote:
> > On Tue 20-08-13 10:13:39, Johannes Weiner wrote:
> > > On Tue, Aug 20, 2013 at 11:14:14AM +0200, Michal Hocko wrote:
> > > > On Mon 19-08-13 12:35:12, Johannes Weiner wrote:
> > > > > On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > This is the fifth version of the patchset.
> > > > > > 
> > > > > > Summary of versions:
> > > > > > The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973
> > > > > > (lkml wasn't CCed at the time so I cannot find it in lwn.net
> > > > > > archives). There were no major objections. 
> > > > > 
> > > > > Except there are.
> > > > 
> > > > Good to know that late... It would have been much more helpful to have
> > > > such a principal feedback few months ago (this work is here since early
> > > > Jun).
> > > 
> > > I NAKed this "integrate into reclaim but still make soft reclaim an
> > > extra pass" idea the first time Ying submitted it.
> > 
> > There was a general agreement about this approach at LSF AFAIR. So it is
> > really surprising hearing that from you _now_
> > 
> > I do not remember the details of your previous NAK but I do remember you
> > liked the series I posted before the conference. I am willing to change
> > details (e.g. move big vmscan part into memcg) but I do insist on having
> > soft reclaim integrated to the regular reclaim.
> 
> I was okay with the series back at LSF.
> 
> But this was under the assumption that we are moving towards turning
> softlimits into guarantees.  It was only after LSF that we agreed that
> conflating these two things is not a good idea and that we are better
> off implementing guarantees separately.
> 
> On the contrary, I am surprised that you are still pushing this series
> in light of the discussions after LSF.
> 
> Google was a big pusher of the soft limit changes, but their
> requirements were always hard guarantees.  We have since agreed that
> this should be done separately.

Yes, and this series is _not_ about guarantees. It is primarily
about fitting the current soft reclaim into the regular reclaim more
naturally. More on that bellow.

> What is the use case for soft limits at this point?

To handle overcommit situations more gracefully. As the documentation
states:
"
7. Soft limits

Soft limits allow for greater sharing of memory. The idea behind soft limits
is to allow control groups to use as much of the memory as needed, provided

a. There is no memory contention
b. They do not exceed their hard limit

When the system detects memory contention or low memory, control groups
are pushed back to their soft limits. If the soft limit of each control
group is very high, they are pushed back as much as possible to make
sure that one control group does not starve the others of memory.

Please note that soft limits is a best-effort feature; it comes with
no guarantees, but it does its best to make sure that when memory is
heavily contended for, memory is allocated based on the soft limit
hints/setup. Currently soft limit based reclaim is set up such that
it gets invoked from balance_pgdat (kswapd).
"

Except for the last sentence the same holds for the integrated
implementation as well. With the patchset we are doing the soft reclaim
also for the targeted reclaim which was simply not possible previously
because of the data structures limitations. And doing soft reclaim from
target reclaim makes a lot of sense to me because whether we have a
global or hierarchical memory pressure doesn't make any difference that
some groups are set up to sacrifice their memory to help to release the
pressure.

> > > And no, the soft limit discussions have been going on since I first
> > > submitted a rewrite as part of the reclaim integration series in
> > > mid-2011.
> > 
> > Which makes the whole story even more sad :/
> > 
> > > We have been going back and forth on whether soft limits should behave
> > > as guarantees and *protect* from reclaim.  But I thought we had since
> > > agreed that implementing guarantees should be a separate effort and as
> > > far as I'm concerned there is no justification for all this added code
> > > just to prevent reclaiming non-excess when there is excess around.
> > 
> > The patchset is absolutely not about guarantees anymore. It is more
> > about making the current soft limit reclaim more natural. E.g. not doing
> > prio-0 scans.
> 
> With guarantees out of the window there is absolutely no justification
> for this added level of complexity.  The tree code is verbose but dead
> simple.  And it's contained.
> 
> You have not shown that prio-0 scans are a problem. 

OK, I thought this was self evident but let me be more specific.

The scan the world is almost always a problem. We are no longer doing
proportional anon/file reclaim (swappiness is ignored). This is wrong
from at least two points of view. Firstly it makes the reclaim decisions
different a lot for groups that are under the soft limit and those
that are over. Secondly, and more importantly, this might lead to a
pre-mature swapping, especially when there is a lot of IO going on.

The global reclaim suffers from the very same problem and that is why
we try to prevent from prio-0 reclaim as much as possible and use it
only as a last resort.

> Or even argued why you need new memcg iteration code to fix the
> priority level.

The new iterator code is there to make the iteration more effective when
selecting interesting groups. Whether the skipping logic should be
inside the iterator or open coded outside is an implementation detail
IMO.
I am considering the callback approach better because it is reusable as
the skipping part is implemented only once at the place with the same
code which has to deal with all other details already. And the caller
only cares to tell which (sub)trees he is interested in.

> > The side effect that the isolation works better in the result is just a
> > bonus. And I have to admit I like that bonus.
> 
> What isolation?

Reclaiming groups over the soft limit might be sufficient to release
the memory pressure and so other groups might be saved from being
reclaimed. Isolation might be a strong word for that as that would
require a certain guarantee but working in a best-effort mode can work
much better than what we have right now.

If we do a fair reclaim on the whole (sub)tree rather than hammer the
biggest offender we might end up reclaiming from other groups less.

I even think that doing the fair soft reclaim is a better approach from
the conceptual point of view because it is less prone to corner cases
when one group is hammered over and over again without actually helping
to relief the memory pressure for which is the soft limit intended in
the first place.

[...]
> > > > > This series considerably worsens readability and maintainability of
> > > > > both the generic reclaim code as well as the memcg counterpart of it.
> > > > 
> > > > I am really surprised that you are coming with this concerns that late.
> > > > This code has been posted quite some ago, hasn't it? We have even had
> > > > that "calm" discussion with Tejun about predicates and you were silent
> > > > at the time.
> > > 
> > > I apologize that I was not more responsive in previous submissions,
> > > this is mainly because it was hard to context switch while I was
> > > focussed on the allocator fairness / thrash detection patches.
> > 
> > Come on, Johannes! We were sitting in the same room at LSF when there
> > was a general agreement about the patchset because "It is a general
> > improvement and further changes might happen on top of it". You didn't
> > say a word you didn't like it and consider the integration as a bad
> > idea.
> 
> It was a step towards turning soft limits into guarantees.  What other
> use case was there?

And it is still a preparatory step for further improvements towards
guarantees. If you look at the series a new (min limit or whatever you
call it) knob is almost trivial to implement.

> Then it might have been a good first step.  I don't think it is a
> general improvement of the code and I don't remember saying that.
> 
> > > > > The point of naturalizing the memcg code is to reduce data structures
> > > > > and redundancy and to break open opaque interfaces like "do soft
> > > > > reclaim and report back".  But you didn't actually reduce complexity,
> > > > > you added even more opaque callbacks (should_soft_reclaim?
> > > > > soft_reclaim_eligible?).  You didn't integrate soft limit into generic
> > > > > reclaim code, you just made the soft limit API more complicated.
> > > > 
> > > > I can certainly think about simplifications. But it would be nicer if
> > > > you were more specific on the "more complicated" part. The soft reclaim
> > > > is a natural part of the reclaim now. Which I find as an improvement.
> > > > "Do some memcg magic and get back was" a bad idea IMO.
> > > 
> > > It depends.  Doing memcg reclaim in generic code made sense because
> > > the global data structure went and the generic code absolutely had to
> > > know about memcg iteration.
> > > 
> > > Soft limit reclaim is something else.  If it were just a modifier of
> > > the existing memcg reclaim loop, I would still put it in vmscan.c for
> > > simplicity reasons.
> > > 
> > > But your separate soft limit reclaim pass adds quite some complication
> > > to the generic code and there is no good reason for it.
> > 
> > Does it? The whole decision is hidden within the predicate so the main
> > loop doesn't have to care at all. If we ever go with another limit for
> > the guarantee it would simply use a different predicate to select the
> > right group.
> 
> This is just handwaving.  You replaced a simple function call in
> kswapd

That simple call from kswapd is not that simple at all in fact. It hides
a lot of memcg specific code which is far from being trivial. Even worse
that memcg specific code gets back to the reclaim code with different
reclaim parameters than those used from the context it has been called
from.

> with an extra lruvec pass and an enhanced looping construct.
> It's all extra protocol that vmscan.c has to follow without actually
> improving the understandability of the code.  You still have to dig
> into all those predicates to know what's going on.

Comparing that to the memcg per-node-zone excess trees, their
maintenance and a code to achieve at least some fairness during reclaim
the above is a) much less code to read/maintain/understand b) it
achieves the same result without corner cases mentioned previously.

I was trying to taint vmscan.c as little as possible. I have a patch
which moves most of the memcg specific parts back to memcontrol.c but I
cannot say I like it because it duplicates some code and has to expose
scan_control and shrink_lruvec outside vmscan.c.

> And at the risk of repeating myself, you haven't proven that it's
> worth the trouble.
>
> > > > > And, as I mentioned repeatedly in previous submissions, your benchmark
> > > > > numbers don't actually say anything useful about this change.
> > > > 
> > > > I would really welcome suggestions for improvements. I have tried "The
> > > > most interesting test case would be how it behaves if some groups are
> > > > over the soft limits while others are not." with v5.1 where I had
> > > > memeater unlimited and kbuild resp. stream IO being limited.
> > > 
> > > The workload seems better but the limit configuration is still very
> > > dubious.
> > > 
> > > I just don't understand what your benchmarks are trying to express.
> > > If you set a soft limit of 0 you declare all memory is optional but
> > > then you go ahead and complain that the workload in there is
> > > thrashing.  WTF?
> > 
> > As I've already said the primary motivation was to check before and
> > after state and as you can see we can do better... Although you might
> > find soft_limit=0 setting artificial this setting is the easiest way to
> > tell that such a group is the number one candidate for reclaim.
> 
> That would test for correctness, but then you draw conclusions about
> performance.

The series is not about performance improvements. I am sorry if the
testing results made a different impression. The primary motivation was
to get rid of a big memcg specific reclaim path which doesn't handle
fairness well and uses hackish prio-0 reclaim which is too disruptive.

> Correctness-wise, the unpatched kernel seems to prefer reclaiming the
> 0-limit group as expected.

It does and as even simple configuration shows that the current soft
reclaim is too disturbing and reclaiming much more than necessary.

> You can't draw reasonable conclusions on performance and
> aggressiveness from such an unrealistic setup.
> 
> > We have users who would like to do backups or some temporary actions to
> > not disturb the main workload. So stream-io resp. kbuild vs. mem_eater
> > is simulating such a use case.
> 
> It's not obvious why you are not using hard limits instead. 

Because although the load in the soft limited groups is willing to
sacrifice its memory it would still prefer as much much memory as
possible to finish as soon as possible.
And also setting the hard limit might be really non-trivial, especially
when you have more such loads because you have to be careful so the
cumulative usage doesn't cause the global reclaim.

> Again, streamers and kbuild don't really benefit from the extra cache,
> so there is no reason to let it expand into available memory and
> retract on pressure.
> These tests are just wildly inappropriate for soft limits, it's
> ridiculous that you insist that their outcome says anything at all.

I do realize that my testing load is simplified a lot, but to be honest,
what ever load I come up with, it still might be argued against as too
simplified or artificial. We simply do not have any etalon.

I was merely interested in well known and understood loads and compared
before and after results for a basic overview of the changes. The series
was not aimed as performance improvement in the first place.

And to be honest I wasn't expecting such a big differences, especially
for stream IO which should be easy "drop everything behind" as it
is use-once load. As you can see, though, the previous soft limit
implementation was hammering even on that load too much. You can see
that on this graph quite nicely [1].
Similar with the kbuild test case where the system swapped out much more
pages (by factor 145) and the Major page faults increased as well (by
factor 85).
You can argue that my testing loads are not typical soft limit
configurations and I even might agree but they clearly show that the way
how we do the soft reclaim currently is way too disruptive and that is
the first thing to fix when we want to move on in that area.

---
[1] http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream-one-run.png
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5] Soft limit rework
  2013-09-04 16:38               ` Michal Hocko
@ 2013-09-06 19:23                 ` Johannes Weiner
  2013-09-13 14:49                   ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Johannes Weiner @ 2013-09-06 19:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel,
	Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa

On Wed, Sep 04, 2013 at 06:38:23PM +0200, Michal Hocko wrote:
> On Tue 03-09-13 12:15:50, Johannes Weiner wrote:
> > > On Tue 20-08-13 10:13:39, Johannes Weiner wrote:
> > > > On Tue, Aug 20, 2013 at 11:14:14AM +0200, Michal Hocko wrote:
> > > > > On Mon 19-08-13 12:35:12, Johannes Weiner wrote:
> > > > > > On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > This is the fifth version of the patchset.
> > > > > > > 
> > > > > > > Summary of versions:
> > > > > > > The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973
> > > > > > > (lkml wasn't CCed at the time so I cannot find it in lwn.net
> > > > > > > archives). There were no major objections. 
> > > > > > 
> > > > > > Except there are.
> > > > > 
> > > > > Good to know that late... It would have been much more helpful to have
> > > > > such a principal feedback few months ago (this work is here since early
> > > > > Jun).
> > > > 
> > > > I NAKed this "integrate into reclaim but still make soft reclaim an
> > > > extra pass" idea the first time Ying submitted it.
> > > 
> > > There was a general agreement about this approach at LSF AFAIR. So it is
> > > really surprising hearing that from you _now_
> > > 
> > > I do not remember the details of your previous NAK but I do remember you
> > > liked the series I posted before the conference. I am willing to change
> > > details (e.g. move big vmscan part into memcg) but I do insist on having
> > > soft reclaim integrated to the regular reclaim.
> > 
> > I was okay with the series back at LSF.
> > 
> > But this was under the assumption that we are moving towards turning
> > softlimits into guarantees.  It was only after LSF that we agreed that
> > conflating these two things is not a good idea and that we are better
> > off implementing guarantees separately.
> > 
> > On the contrary, I am surprised that you are still pushing this series
> > in light of the discussions after LSF.
> > 
> > Google was a big pusher of the soft limit changes, but their
> > requirements were always hard guarantees.  We have since agreed that
> > this should be done separately.
> 
> Yes, and this series is _not_ about guarantees. It is primarily
> about fitting the current soft reclaim into the regular reclaim more
> naturally. More on that bellow.
> 
> > What is the use case for soft limits at this point?
> 
> To handle overcommit situations more gracefully. As the documentation
> states:
> "
> 7. Soft limits
> 
> Soft limits allow for greater sharing of memory. The idea behind soft limits
> is to allow control groups to use as much of the memory as needed, provided
> 
> a. There is no memory contention
> b. They do not exceed their hard limit
> 
> When the system detects memory contention or low memory, control groups
> are pushed back to their soft limits. If the soft limit of each control
> group is very high, they are pushed back as much as possible to make
> sure that one control group does not starve the others of memory.
> 
> Please note that soft limits is a best-effort feature; it comes with
> no guarantees, but it does its best to make sure that when memory is
> heavily contended for, memory is allocated based on the soft limit
> hints/setup. Currently soft limit based reclaim is set up such that
> it gets invoked from balance_pgdat (kswapd).
> "
> 
> Except for the last sentence the same holds for the integrated
> implementation as well. With the patchset we are doing the soft reclaim
> also for the targeted reclaim which was simply not possible previously
> because of the data structures limitations. And doing soft reclaim from
> target reclaim makes a lot of sense to me because whether we have a
> global or hierarchical memory pressure doesn't make any difference that
> some groups are set up to sacrifice their memory to help to release the
> pressure.

The issue I have with this is that the semantics of the soft limit are
so backwards that we should strive to get this stuff right
conceptually before integrating this better into the VM.

We have a big user that asks for guarantees, which are comparable but
the invert opposite of this.  Instead of specifying what is optional
in one group, you specify what is essential in the other group.  And
the default is to guarantee nothing instead of everything like soft
limits are currently defined.

We even tried to invert the default soft limit setting in the past,
which went nowhere because we can't do these subtle semantic changes
on an existing interface.

I would really like to deprecate soft limits and introduce something
new that has the proper semantics we want from the get-go.  Its
implementation could very much look like your code, so we can easily
reuse that.  But the interface and its semantics should come first.

> > > > And no, the soft limit discussions have been going on since I first
> > > > submitted a rewrite as part of the reclaim integration series in
> > > > mid-2011.
> > > 
> > > Which makes the whole story even more sad :/
> > > 
> > > > We have been going back and forth on whether soft limits should behave
> > > > as guarantees and *protect* from reclaim.  But I thought we had since
> > > > agreed that implementing guarantees should be a separate effort and as
> > > > far as I'm concerned there is no justification for all this added code
> > > > just to prevent reclaiming non-excess when there is excess around.
> > > 
> > > The patchset is absolutely not about guarantees anymore. It is more
> > > about making the current soft limit reclaim more natural. E.g. not doing
> > > prio-0 scans.
> > 
> > With guarantees out of the window there is absolutely no justification
> > for this added level of complexity.  The tree code is verbose but dead
> > simple.  And it's contained.
> > 
> > You have not shown that prio-0 scans are a problem. 
> 
> OK, I thought this was self evident but let me be more specific.
> 
> The scan the world is almost always a problem. We are no longer doing
> proportional anon/file reclaim (swappiness is ignored). This is wrong
> from at least two points of view. Firstly it makes the reclaim decisions
> different a lot for groups that are under the soft limit and those
> that are over. Secondly, and more importantly, this might lead to a
> pre-mature swapping, especially when there is a lot of IO going on.
> 
> The global reclaim suffers from the very same problem and that is why
> we try to prevent from prio-0 reclaim as much as possible and use it
> only as a last resort.

I know that and I can see that this should probably be fixed, but
there is no quantification for this.  We have no per-memcg reclaim
statistics and your test cases were not useful in determining what's
going on reclaim-wise.

> > Or even argued why you need new memcg iteration code to fix the
> > priority level.
> 
> The new iterator code is there to make the iteration more effective when
> selecting interesting groups. Whether the skipping logic should be
> inside the iterator or open coded outside is an implementation detail
> IMO.
> I am considering the callback approach better because it is reusable as
> the skipping part is implemented only once at the place with the same
> code which has to deal with all other details already. And the caller
> only cares to tell which (sub)trees he is interested in.

I was questioning the selective iterators as such, rather than the
specific implementation.

This whole problem of needing better iterators comes from the fact
that soft limits default to ~0UL, which means that per default there
are no groups eligible for reclaim in the first pass.

If you implement the separate guarantee / lower bound functionality
with a default of no-guarantee, the common case is that everybody is
eligible for reclaim in the first pass.  And suddenly, the selective
iterators are merely a performance optimization for the guarantee
feature as opposed to a fundamental requirement for the common case.

> > > The side effect that the isolation works better in the result is just a
> > > bonus. And I have to admit I like that bonus.
> > 
> > What isolation?
> 
> Reclaiming groups over the soft limit might be sufficient to release
> the memory pressure and so other groups might be saved from being
> reclaimed. Isolation might be a strong word for that as that would
> require a certain guarantee but working in a best-effort mode can work
> much better than what we have right now.
> 
> If we do a fair reclaim on the whole (sub)tree rather than hammer the
> biggest offender we might end up reclaiming from other groups less.
> 
> I even think that doing the fair soft reclaim is a better approach from
> the conceptual point of view because it is less prone to corner cases
> when one group is hammered over and over again without actually helping
> to relief the memory pressure for which is the soft limit intended in
> the first place.

Fully agreed on that we could do a better job at being selective.  And
as per above, this should be *the* single argument for the selective
iterators and not just a side show.

At this point, please step back and think about how many aspects of
the soft limit semantics and implementation you are trying to fix with
this series.

My conclusion that the interface is basically unsalvagable, based on
our disagreements over the semantics and the implementation.  Not just
in this thread, but over many threads, conferences in person and on
the phone.

The fact is, soft limits do very little of what we actually want and
we can't turn an existing interface upsidedown.  There is no point in
contorting ourselves, the documentation, the code, to achieve
something reasonable while preserving something nobody cares about.

> > > > > > This series considerably worsens readability and maintainability of
> > > > > > both the generic reclaim code as well as the memcg counterpart of it.
> > > > > 
> > > > > I am really surprised that you are coming with this concerns that late.
> > > > > This code has been posted quite some ago, hasn't it? We have even had
> > > > > that "calm" discussion with Tejun about predicates and you were silent
> > > > > at the time.
> > > > 
> > > > I apologize that I was not more responsive in previous submissions,
> > > > this is mainly because it was hard to context switch while I was
> > > > focussed on the allocator fairness / thrash detection patches.
> > > 
> > > Come on, Johannes! We were sitting in the same room at LSF when there
> > > was a general agreement about the patchset because "It is a general
> > > improvement and further changes might happen on top of it". You didn't
> > > say a word you didn't like it and consider the integration as a bad
> > > idea.
> > 
> > It was a step towards turning soft limits into guarantees.  What other
> > use case was there?
> 
> And it is still a preparatory step for further improvements towards
> guarantees. If you look at the series a new (min limit or whatever you
> call it) knob is almost trivial to implement.

So let's just do the right thing and forget about all these detours,
shall we?

> > Then it might have been a good first step.  I don't think it is a
> > general improvement of the code and I don't remember saying that.
> > 
> > > > > > The point of naturalizing the memcg code is to reduce data structures
> > > > > > and redundancy and to break open opaque interfaces like "do soft
> > > > > > reclaim and report back".  But you didn't actually reduce complexity,
> > > > > > you added even more opaque callbacks (should_soft_reclaim?
> > > > > > soft_reclaim_eligible?).  You didn't integrate soft limit into generic
> > > > > > reclaim code, you just made the soft limit API more complicated.
> > > > > 
> > > > > I can certainly think about simplifications. But it would be nicer if
> > > > > you were more specific on the "more complicated" part. The soft reclaim
> > > > > is a natural part of the reclaim now. Which I find as an improvement.
> > > > > "Do some memcg magic and get back was" a bad idea IMO.
> > > > 
> > > > It depends.  Doing memcg reclaim in generic code made sense because
> > > > the global data structure went and the generic code absolutely had to
> > > > know about memcg iteration.
> > > > 
> > > > Soft limit reclaim is something else.  If it were just a modifier of
> > > > the existing memcg reclaim loop, I would still put it in vmscan.c for
> > > > simplicity reasons.
> > > > 
> > > > But your separate soft limit reclaim pass adds quite some complication
> > > > to the generic code and there is no good reason for it.
> > > 
> > > Does it? The whole decision is hidden within the predicate so the main
> > > loop doesn't have to care at all. If we ever go with another limit for
> > > the guarantee it would simply use a different predicate to select the
> > > right group.
> > 
> > This is just handwaving.  You replaced a simple function call in
> > kswapd
> 
> That simple call from kswapd is not that simple at all in fact. It hides
> a lot of memcg specific code which is far from being trivial. Even worse
> that memcg specific code gets back to the reclaim code with different
> reclaim parameters than those used from the context it has been called
> from.

It does not matter to understanding generic reclaim code, though, and
acts more like the shrinkers.  We send it off to get memory and it
comes back with results.

> > with an extra lruvec pass and an enhanced looping construct.
> > It's all extra protocol that vmscan.c has to follow without actually
> > improving the understandability of the code.  You still have to dig
> > into all those predicates to know what's going on.
> 
> Comparing that to the memcg per-node-zone excess trees, their
> maintenance and a code to achieve at least some fairness during reclaim
> the above is a) much less code to read/maintain/understand b) it
> achieves the same result without corner cases mentioned previously.
> 
> I was trying to taint vmscan.c as little as possible. I have a patch
> which moves most of the memcg specific parts back to memcontrol.c but I
> cannot say I like it because it duplicates some code and has to expose
> scan_control and shrink_lruvec outside vmscan.c.

I'm not against complexity if it's appropriate, I just pointed out
that you add it with rather weak justification.

> > And at the risk of repeating myself, you haven't proven that it's
> > worth the trouble.
> >
> > > > > > And, as I mentioned repeatedly in previous submissions, your benchmark
> > > > > > numbers don't actually say anything useful about this change.
> > > > > 
> > > > > I would really welcome suggestions for improvements. I have tried "The
> > > > > most interesting test case would be how it behaves if some groups are
> > > > > over the soft limits while others are not." with v5.1 where I had
> > > > > memeater unlimited and kbuild resp. stream IO being limited.
> > > > 
> > > > The workload seems better but the limit configuration is still very
> > > > dubious.
> > > > 
> > > > I just don't understand what your benchmarks are trying to express.
> > > > If you set a soft limit of 0 you declare all memory is optional but
> > > > then you go ahead and complain that the workload in there is
> > > > thrashing.  WTF?
> > > 
> > > As I've already said the primary motivation was to check before and
> > > after state and as you can see we can do better... Although you might
> > > find soft_limit=0 setting artificial this setting is the easiest way to
> > > tell that such a group is the number one candidate for reclaim.
> > 
> > That would test for correctness, but then you draw conclusions about
> > performance.
> 
> The series is not about performance improvements. I am sorry if the
> testing results made a different impression. The primary motivation was
> to get rid of a big memcg specific reclaim path which doesn't handle
> fairness well and uses hackish prio-0 reclaim which is too disruptive.
>  
> > Correctness-wise, the unpatched kernel seems to prefer reclaiming the
> > 0-limit group as expected.
> 
> It does and as even simple configuration shows that the current soft
> reclaim is too disturbing and reclaiming much more than necessary.

Soft limit is about balancing reclaim pressure and I already pointed
out that your control group has so much limit slack that you can't
tell if the main group is performing better because of reclaim
aggressiveness (good) or because the memory is just taken from your
control group (bad).

Please either say why I'm wrong or stop asserting points that have
been refuted.

> > You can't draw reasonable conclusions on performance and
> > aggressiveness from such an unrealistic setup.
> > 
> > > We have users who would like to do backups or some temporary actions to
> > > not disturb the main workload. So stream-io resp. kbuild vs. mem_eater
> > > is simulating such a use case.
> > 
> > It's not obvious why you are not using hard limits instead. 
> 
> Because although the load in the soft limited groups is willing to
> sacrifice its memory it would still prefer as much much memory as
> possible to finish as soon as possible.

Again, both streamers and kbuild do not benefit from the extra memory,
I'm not sure why you keep repeating this nonsense over and over.

Find a workload whose performance scales with the amount of available
memory and whose pressure-performance graph is not

  DONT CARE   DONT CARE   DONT CARE   FALLS OFF CLIFF

Using a hard limit workload to argue soft limits is pointless.

> And also setting the hard limit might be really non-trivial, especially
> when you have more such loads because you have to be careful so the
> cumulative usage doesn't cause the global reclaim.
> 
> > Again, streamers and kbuild don't really benefit from the extra cache,
> > so there is no reason to let it expand into available memory and
> > retract on pressure.
> > These tests are just wildly inappropriate for soft limits, it's
> > ridiculous that you insist that their outcome says anything at all.
> 
> I do realize that my testing load is simplified a lot, but to be honest,
> what ever load I come up with, it still might be argued against as too
> simplified or artificial. We simply do not have any etalon.

Except they are not simplified.  They are in no way indicative.

> I was merely interested in well known and understood loads and compared
> before and after results for a basic overview of the changes. The series
> was not aimed as performance improvement in the first place.

You really need to decide on what you are trying to sell.

> And to be honest I wasn't expecting such a big differences, especially
> for stream IO which should be easy "drop everything behind" as it
> is use-once load. As you can see, though, the previous soft limit
> implementation was hammering even on that load too much. You can see
> that on this graph quite nicely [1].
> Similar with the kbuild test case where the system swapped out much more
> pages (by factor 145) and the Major page faults increased as well (by
> factor 85).
> You can argue that my testing loads are not typical soft limit
> configurations and I even might agree but they clearly show that the way
> how we do the soft reclaim currently is way too disruptive and that is
> the first thing to fix when we want to move on in that area.

So we should pile all this complexity into reclaim code based on the
performance of misconfigurations?

Questionable methods aside, I don't see this going anywhere.  Your
patches don't make soft limits better, they're just trying to put
lipstick on the pig.

It would be best to use the opportunity from the cgroup rework and get
this stuff right.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5] Soft limit rework
  2013-09-06 19:23                 ` Johannes Weiner
@ 2013-09-13 14:49                   ` Michal Hocko
       [not found]                     ` <20130913144953.GA23857-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2013-09-13 14:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel,
	Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa

On Fri 06-09-13 15:23:11, Johannes Weiner wrote:
> On Wed, Sep 04, 2013 at 06:38:23PM +0200, Michal Hocko wrote:
[...]
> > To handle overcommit situations more gracefully. As the documentation
> > states:
> > "
> > 7. Soft limits
> > 
> > Soft limits allow for greater sharing of memory. The idea behind soft limits
> > is to allow control groups to use as much of the memory as needed, provided
> > 
> > a. There is no memory contention
> > b. They do not exceed their hard limit
> > 
> > When the system detects memory contention or low memory, control groups
> > are pushed back to their soft limits. If the soft limit of each control
> > group is very high, they are pushed back as much as possible to make
> > sure that one control group does not starve the others of memory.
> > 
> > Please note that soft limits is a best-effort feature; it comes with
> > no guarantees, but it does its best to make sure that when memory is
> > heavily contended for, memory is allocated based on the soft limit
> > hints/setup. Currently soft limit based reclaim is set up such that
> > it gets invoked from balance_pgdat (kswapd).
> > "
> > 
> > Except for the last sentence the same holds for the integrated
> > implementation as well. With the patchset we are doing the soft reclaim
> > also for the targeted reclaim which was simply not possible previously
> > because of the data structures limitations. And doing soft reclaim from
> > target reclaim makes a lot of sense to me because whether we have a
> > global or hierarchical memory pressure doesn't make any difference that
> > some groups are set up to sacrifice their memory to help to release the
> > pressure.
> 
> The issue I have with this is that the semantics of the soft limit are
> so backwards that we should strive to get this stuff right
> conceptually before integrating this better into the VM.
> 
> We have a big user that asks for guarantees, which are comparable but
> the invert opposite of this.  Instead of specifying what is optional
> in one group, you specify what is essential in the other group.  And
> the default is to guarantee nothing instead of everything like soft
> limits are currently defined.
> 
> We even tried to invert the default soft limit setting in the past,
> which went nowhere because we can't do these subtle semantic changes
> on an existing interface.
> 
> I would really like to deprecate soft limits and introduce something
> new that has the proper semantics we want from the get-go.  Its
> implementation could very much look like your code, so we can easily
> reuse that.  But the interface and its semantics should come first.

I am open to discussin such a change I just do not see any reason to
have a crippled soft reclaim implementation for the mean time.
Especially when it doesn't look like such a new interface is easy to
agree on.

[...]
> > > You have not shown that prio-0 scans are a problem. 
> > 
> > OK, I thought this was self evident but let me be more specific.
> > 
> > The scan the world is almost always a problem. We are no longer doing
> > proportional anon/file reclaim (swappiness is ignored). This is wrong
> > from at least two points of view. Firstly it makes the reclaim decisions
> > different a lot for groups that are under the soft limit and those
> > that are over. Secondly, and more importantly, this might lead to a
> > pre-mature swapping, especially when there is a lot of IO going on.
> > 
> > The global reclaim suffers from the very same problem and that is why
> > we try to prevent from prio-0 reclaim as much as possible and use it
> > only as a last resort.
> 
> I know that and I can see that this should probably be fixed, but
> there is no quantification for this.  We have no per-memcg reclaim
> statistics

Not having statistic is a separate issue. It makes the situation worse
but that is not a new thing. The old implementation is even worse
because the soft reclaim activity is basically hidden from global
reclaim counters. So a lot of pages might get scanned and we will have
no way to find out. That part is inherently fixed by the series because
of the integration.

> and your test cases were not useful in determining what's going on
> reclaim-wise.

I will instrument the kernel for the next round of tests which would be
hopefully more descriptive.

[...]
> > That simple call from kswapd is not that simple at all in fact. It hides
> > a lot of memcg specific code which is far from being trivial. Even worse
> > that memcg specific code gets back to the reclaim code with different
> > reclaim parameters than those used from the context it has been called
> > from.
> 
> It does not matter to understanding generic reclaim code, though, and
> acts more like the shrinkers.  We send it off to get memory and it
> comes back with results.

Shrinker interface is just too bad. It might work for dentries and
inodes but it failed in many other subsystems where it ended up in
do-something mode. Soft reclaim is yet another example where we are
doing an artificial scan-the-world reclaim to hammer somebody. Fairness
is basically impossible to guarantee and there are corner cases which
are just waiting to explode.

[...]
> Soft limit is about balancing reclaim pressure and I already pointed
> out that your control group has so much limit slack that you can't
> tell if the main group is performing better because of reclaim
> aggressiveness (good) or because the memory is just taken from your
> control group (bad).
> 
> Please either say why I'm wrong or stop asserting points that have
> been refuted.

I will work on improving my testing setup. I will come back with results
early next week hopefully.

[...] 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <20130913144953.GA23857-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH v5] Soft limit rework
       [not found]                     ` <20130913144953.GA23857-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-09-13 16:17                       ` Johannes Weiner
  2013-09-16 16:44                         ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Johannes Weiner @ 2013-09-13 16:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

On Fri, Sep 13, 2013 at 04:49:53PM +0200, Michal Hocko wrote:
> On Fri 06-09-13 15:23:11, Johannes Weiner wrote:
> > On Wed, Sep 04, 2013 at 06:38:23PM +0200, Michal Hocko wrote:
> [...]
> > > To handle overcommit situations more gracefully. As the documentation
> > > states:
> > > "
> > > 7. Soft limits
> > > 
> > > Soft limits allow for greater sharing of memory. The idea behind soft limits
> > > is to allow control groups to use as much of the memory as needed, provided
> > > 
> > > a. There is no memory contention
> > > b. They do not exceed their hard limit
> > > 
> > > When the system detects memory contention or low memory, control groups
> > > are pushed back to their soft limits. If the soft limit of each control
> > > group is very high, they are pushed back as much as possible to make
> > > sure that one control group does not starve the others of memory.
> > > 
> > > Please note that soft limits is a best-effort feature; it comes with
> > > no guarantees, but it does its best to make sure that when memory is
> > > heavily contended for, memory is allocated based on the soft limit
> > > hints/setup. Currently soft limit based reclaim is set up such that
> > > it gets invoked from balance_pgdat (kswapd).
> > > "
> > > 
> > > Except for the last sentence the same holds for the integrated
> > > implementation as well. With the patchset we are doing the soft reclaim
> > > also for the targeted reclaim which was simply not possible previously
> > > because of the data structures limitations. And doing soft reclaim from
> > > target reclaim makes a lot of sense to me because whether we have a
> > > global or hierarchical memory pressure doesn't make any difference that
> > > some groups are set up to sacrifice their memory to help to release the
> > > pressure.
> > 
> > The issue I have with this is that the semantics of the soft limit are
> > so backwards that we should strive to get this stuff right
> > conceptually before integrating this better into the VM.
> > 
> > We have a big user that asks for guarantees, which are comparable but
> > the invert opposite of this.  Instead of specifying what is optional
> > in one group, you specify what is essential in the other group.  And
> > the default is to guarantee nothing instead of everything like soft
> > limits are currently defined.
> > 
> > We even tried to invert the default soft limit setting in the past,
> > which went nowhere because we can't do these subtle semantic changes
> > on an existing interface.
> > 
> > I would really like to deprecate soft limits and introduce something
> > new that has the proper semantics we want from the get-go.  Its
> > implementation could very much look like your code, so we can easily
> > reuse that.  But the interface and its semantics should come first.
> 
> I am open to discussin such a change I just do not see any reason to
> have a crippled soft reclaim implementation for the mean time.
> Especially when it doesn't look like such a new interface is easy to
> agree on.

We had a crippled soft limit implementation from the time it was
merged, it never worked better than now.

You seem to think that this is an argument *for* finally fixing it.  I
disagree.  We should absolutely *avoid* steering people toward it now,
when the long term plan is already to get rid of it.

There is a concensus that cgroups and the controllers were merged
before they were ready and we are now struggling heavily to iron out
the design mistakes with the minimum amount of disruption we can get
away with.

We are also at this time coordinating with all the other controllers
and the cgroup core to do exactly that, where Tejun is providing us
with tools to revamp the problematic interfaces.

And we agree that soft limits were such a design mistake that should
be ironed out.

So for the love of everything we hold dear, why would you think that
NOW is a good time to fix the implemantion and get people to use it?

> > > > You have not shown that prio-0 scans are a problem. 
> > > 
> > > OK, I thought this was self evident but let me be more specific.
> > > 
> > > The scan the world is almost always a problem. We are no longer doing
> > > proportional anon/file reclaim (swappiness is ignored). This is wrong
> > > from at least two points of view. Firstly it makes the reclaim decisions
> > > different a lot for groups that are under the soft limit and those
> > > that are over. Secondly, and more importantly, this might lead to a
> > > pre-mature swapping, especially when there is a lot of IO going on.
> > > 
> > > The global reclaim suffers from the very same problem and that is why
> > > we try to prevent from prio-0 reclaim as much as possible and use it
> > > only as a last resort.
> > 
> > I know that and I can see that this should probably be fixed, but
> > there is no quantification for this.  We have no per-memcg reclaim
> > statistics
> 
> Not having statistic is a separate issue. It makes the situation worse
> but that is not a new thing. The old implementation is even worse
> because the soft reclaim activity is basically hidden from global
> reclaim counters. So a lot of pages might get scanned and we will have
> no way to find out. That part is inherently fixed by the series because
> of the integration.

Because it's in the *global* reclaim counters?  That's great but it
does not address the problem at all.  This is about pressure balance
between groups and you don't have any numbers for that.

All I'm saying is that before changing how the pressure is balanced we
need to know per-memcg statistics to quantify it and get an insight
into what we are actually doing.  You respond with a wall of text but
you don't address the problem at all.

And before doing all that, we should get the user-visible interface
right, which we all agreed is broken.

> > > That simple call from kswapd is not that simple at all in fact. It hides
> > > a lot of memcg specific code which is far from being trivial. Even worse
> > > that memcg specific code gets back to the reclaim code with different
> > > reclaim parameters than those used from the context it has been called
> > > from.
> > 
> > It does not matter to understanding generic reclaim code, though, and
> > acts more like the shrinkers.  We send it off to get memory and it
> > comes back with results.
> 
> Shrinker interface is just too bad. It might work for dentries and
> inodes but it failed in many other subsystems where it ended up in
> do-something mode. Soft reclaim is yet another example where we are
> doing an artificial scan-the-world reclaim to hammer somebody. Fairness
> is basically impossible to guarantee and there are corner cases which
> are just waiting to explode.

Every time you reply you are just attacking bits of my argument in a
way that is completely irrelevant to the discussion.  What is the
overall objective that you are trying to defend?

I said that you are making the interface more complex because the
current interface is leaving the complexity encapsulated in memcg
code.  It does not matter one bit that some shrinkers are set up
incorrectly, that entirely misses the point.

Michal, it's completely unobvious what your longterm goals are for
soft limits and guarantees.  And without that it's hard to comprehend
how and if the patches you are sending push into the right direction.
Every time I try to discuss the bigger picture you derail it with
details about how the implementation is broken.  It's frustrating.

This series is a grab bag of fixes that drag a lot of complexity from
memcg code into generic reclaim, to repair the age old implementation
of a user-visible interface that we already agree sucks balls and
should be deprecated.  The fact that you did not even demonstrate that
the repair itself was successful is a secondary issue at this point,
but it certainly didn't help your case.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5] Soft limit rework
  2013-09-13 16:17                       ` Johannes Weiner
@ 2013-09-16 16:44                         ` Michal Hocko
       [not found]                           ` <20130916164405.GG3674-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2013-09-16 16:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel,
	Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa

On Fri 13-09-13 12:17:09, Johannes Weiner wrote:
> On Fri, Sep 13, 2013 at 04:49:53PM +0200, Michal Hocko wrote:
> > On Fri 06-09-13 15:23:11, Johannes Weiner wrote:
[...]
> > > I would really like to deprecate soft limits and introduce something
> > > new that has the proper semantics we want from the get-go.  Its
> > > implementation could very much look like your code, so we can easily
> > > reuse that.  But the interface and its semantics should come first.
> > 
> > I am open to discussin such a change I just do not see any reason to
> > have a crippled soft reclaim implementation for the mean time.
> > Especially when it doesn't look like such a new interface is easy to
> > agree on.
> 
> We had a crippled soft limit implementation from the time it was
> merged, it never worked better than now.
> 
> You seem to think that this is an argument *for* finally fixing it.  I
> disagree.  We should absolutely *avoid* steering people toward it now,
> when the long term plan is already to get rid of it.

It is not just about fixing it. It is also about getting rid of the
bloat that the previous implementation depended on. Not only LoC but
also the resulting binary:

$ size mm/built-in.o 
base:
   text    data     bss     dec     hex filename
   534283  233703  163456  931442   e3672 mm/built-in.o
rework:
   text    data     bss     dec     hex filename
   532866  229591  163456  925913   e20d9 mm/built-in.o

I would like to get rid of as much of a special code as possible.
Especially this one which I hate personally because it is a crude hack
that shouldn't have existed.

> There is a concensus that cgroups and the controllers were merged
> before they were ready and we are now struggling heavily to iron out
> the design mistakes with the minimum amount of disruption we can get
> away with.
> 
> We are also at this time coordinating with all the other controllers
> and the cgroup core to do exactly that, where Tejun is providing us
> with tools to revamp the problematic interfaces.
> 
> And we agree that soft limits were such a design mistake that should
> be ironed out.
> 
> So for the love of everything we hold dear, why would you think that
> NOW is a good time to fix the implemantion and get people to use it?

There are users who already use this feature and it will take some (read
a lot of) time to move them to something else. And that something else
still doesn't exist and I suspect it will take some time to push it into
a proper shape (and be sure we do not screw it this time).

So while I agree that we need something more (semantically) reasonable
there is no need to keep this crippled implementation around especially
when it is non-trivial amount of code.

> > > > > You have not shown that prio-0 scans are a problem. 
> > > > 
> > > > OK, I thought this was self evident but let me be more specific.
> > > > 
> > > > The scan the world is almost always a problem. We are no longer doing
> > > > proportional anon/file reclaim (swappiness is ignored). This is wrong
> > > > from at least two points of view. Firstly it makes the reclaim decisions
> > > > different a lot for groups that are under the soft limit and those
> > > > that are over. Secondly, and more importantly, this might lead to a
> > > > pre-mature swapping, especially when there is a lot of IO going on.
> > > > 
> > > > The global reclaim suffers from the very same problem and that is why
> > > > we try to prevent from prio-0 reclaim as much as possible and use it
> > > > only as a last resort.
> > > 
> > > I know that and I can see that this should probably be fixed, but
> > > there is no quantification for this.  We have no per-memcg reclaim
> > > statistics
> > 
> > Not having statistic is a separate issue. It makes the situation worse
> > but that is not a new thing. The old implementation is even worse
> > because the soft reclaim activity is basically hidden from global
> > reclaim counters. So a lot of pages might get scanned and we will have
> > no way to find out. That part is inherently fixed by the series because
> > of the integration.
> 
> Because it's in the *global* reclaim counters?  That's great but it
> does not address the problem at all.  This is about pressure balance
> between groups and you don't have any numbers for that.

yes, but the point was that if somebody uses soft reclaim currently you
would miss a big part of reclaim activity because soft reclaim is not
accounted even in the global counters. So you can see a long stall
during direct reclaim while the counters look all good.

> All I'm saying is that before changing how the pressure is balanced we
> need to know per-memcg statistics to quantify it and get an insight
> into what we are actually doing.

We can still see the indirect effects of the reclaim. E.g. performance
dropdown due to re-faults/swapping. My previous tests cared only about
the restricted group which, I admit, is very superficial. I am working
on a setup where I am measuring both.

> You respond with a wall of text but you don't address the problem at
> all.
> 
> And before doing all that, we should get the user-visible interface
> right, which we all agreed is broken.
>
> > > > That simple call from kswapd is not that simple at all in fact. It hides
> > > > a lot of memcg specific code which is far from being trivial. Even worse
> > > > that memcg specific code gets back to the reclaim code with different
> > > > reclaim parameters than those used from the context it has been called
> > > > from.
> > > 
> > > It does not matter to understanding generic reclaim code, though, and
> > > acts more like the shrinkers.  We send it off to get memory and it
> > > comes back with results.
> > 
> > Shrinker interface is just too bad. It might work for dentries and
> > inodes but it failed in many other subsystems where it ended up in
> > do-something mode. Soft reclaim is yet another example where we are
> > doing an artificial scan-the-world reclaim to hammer somebody. Fairness
> > is basically impossible to guarantee and there are corner cases which
> > are just waiting to explode.
> 
> Every time you reply you are just attacking bits of my argument in a
> way that is completely irrelevant to the discussion.  What is the
> overall objective that you are trying to defend?
> 
> I said that you are making the interface more complex because the
> current interface is leaving the complexity encapsulated in memcg
> code.  It does not matter one bit that some shrinkers are set up
> incorrectly, that entirely misses the point.

My point was that this is another example of those shrinkers that do a
bad job.

> Michal, it's completely unobvious what your longterm goals are for
> soft limits and guarantees. 

I've already said that I am open to discuss a new interface with a
better semantics and providing guarantees. Once we settle with that one
then we can deprecate the soft limit and after a long time we can ditch
it completely.

> And without that it's hard to comprehend how and if the patches you
> are sending push into the right direction.  Every time I try to
> discuss the bigger picture you derail it with details about how the
> implementation is broken.  It's frustrating.

I am sorry to hear that. I thought that highlevel things were clear.
Soft limit sucks and we need something better. That is clear. I am
just arguing that waiting for that something better shouldn't stop us
working on the current interface as it removes the code and I believe
it eventually even helps loads that are currently relying on the soft
limit.

> This series is a grab bag of fixes that drag a lot of complexity from
> memcg code into generic reclaim, to repair the age old implementation
> of a user-visible interface that we already agree sucks balls and
> should be deprecated.  The fact that you did not even demonstrate that
> the repair itself was successful is a secondary issue at this point,
> but it certainly didn't help your case.

I will follow up with the testing results later. I hope I manage to have
them before I leave on vacation.

The question, though, is whether even results supporting my claims about
enhancements would make any difference. To be honest I thought that the
integration would be non-controversial topic even without performance
improvements which could have be expected due to removing prio-0 reclaim
which is a pure evil. Also the natural fairness of the s.r. sounds like
a good thing.

Anyway. Your wording that nothing should be done about the soft reclaim
seems to be quite clear though. If this position is really firm then go
ahead and NACK the series _explicitly_ so that Andrew or you can send a
revert request to Linus. I would really like to not waste a lot of time
on testing right now when it wouldn't lead to anything.

Thanks
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <20130916164405.GG3674-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH v5] Soft limit rework
       [not found]                           ` <20130916164405.GG3674-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-09-17 19:56                             ` Johannes Weiner
  2013-09-17 20:57                               ` Andrew Morton
  0 siblings, 1 reply; 30+ messages in thread
From: Johannes Weiner @ 2013-09-17 19:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
	Balbir Singh, Glauber Costa

On Mon, Sep 16, 2013 at 06:44:05PM +0200, Michal Hocko wrote:
> On Fri 13-09-13 12:17:09, Johannes Weiner wrote:
> > On Fri, Sep 13, 2013 at 04:49:53PM +0200, Michal Hocko wrote:
> > > On Fri 06-09-13 15:23:11, Johannes Weiner wrote:
> [...]
> > > > I would really like to deprecate soft limits and introduce something
> > > > new that has the proper semantics we want from the get-go.  Its
> > > > implementation could very much look like your code, so we can easily
> > > > reuse that.  But the interface and its semantics should come first.
> > > 
> > > I am open to discussin such a change I just do not see any reason to
> > > have a crippled soft reclaim implementation for the mean time.
> > > Especially when it doesn't look like such a new interface is easy to
> > > agree on.
> > 
> > We had a crippled soft limit implementation from the time it was
> > merged, it never worked better than now.
> > 
> > You seem to think that this is an argument *for* finally fixing it.  I
> > disagree.  We should absolutely *avoid* steering people toward it now,
> > when the long term plan is already to get rid of it.
> 
> It is not just about fixing it. It is also about getting rid of the
> bloat that the previous implementation depended on. Not only LoC but
> also the resulting binary:
> 
> $ size mm/built-in.o 
> base:
>    text    data     bss     dec     hex filename
>    534283  233703  163456  931442   e3672 mm/built-in.o
> rework:
>    text    data     bss     dec     hex filename
>    532866  229591  163456  925913   e20d9 mm/built-in.o
> 
> I would like to get rid of as much of a special code as possible.
> Especially this one which I hate personally because it is a crude hack
> that shouldn't have existed.

The interface is, not the implementation.

> > There is a concensus that cgroups and the controllers were merged
> > before they were ready and we are now struggling heavily to iron out
> > the design mistakes with the minimum amount of disruption we can get
> > away with.
> > 
> > We are also at this time coordinating with all the other controllers
> > and the cgroup core to do exactly that, where Tejun is providing us
> > with tools to revamp the problematic interfaces.
> > 
> > And we agree that soft limits were such a design mistake that should
> > be ironed out.
> > 
> > So for the love of everything we hold dear, why would you think that
> > NOW is a good time to fix the implemantion and get people to use it?
> 
> There are users who already use this feature and it will take some (read
> a lot of) time to move them to something else. And that something else
> still doesn't exist and I suspect it will take some time to push it into
> a proper shape (and be sure we do not screw it this time).

Yet you could not name a single sensible use case.  So no, we have no
known users, except maybe Google, who actually want guarantees.

> So while I agree that we need something more (semantically) reasonable
> there is no need to keep this crippled implementation around especially
> when it is non-trivial amount of code.

You traded a non-trivial amount of code for non-trivial code.  And you
moved the cost from optional memcg code to essential memory management
code.  All for a feature that needs to be redesigned and has a
questionable userbase.

> > > > > > You have not shown that prio-0 scans are a problem. 
> > > > > 
> > > > > OK, I thought this was self evident but let me be more specific.
> > > > > 
> > > > > The scan the world is almost always a problem. We are no longer doing
> > > > > proportional anon/file reclaim (swappiness is ignored). This is wrong
> > > > > from at least two points of view. Firstly it makes the reclaim decisions
> > > > > different a lot for groups that are under the soft limit and those
> > > > > that are over. Secondly, and more importantly, this might lead to a
> > > > > pre-mature swapping, especially when there is a lot of IO going on.
> > > > > 
> > > > > The global reclaim suffers from the very same problem and that is why
> > > > > we try to prevent from prio-0 reclaim as much as possible and use it
> > > > > only as a last resort.
> > > > 
> > > > I know that and I can see that this should probably be fixed, but
> > > > there is no quantification for this.  We have no per-memcg reclaim
> > > > statistics
> > > 
> > > Not having statistic is a separate issue. It makes the situation worse
> > > but that is not a new thing. The old implementation is even worse
> > > because the soft reclaim activity is basically hidden from global
> > > reclaim counters. So a lot of pages might get scanned and we will have
> > > no way to find out. That part is inherently fixed by the series because
> > > of the integration.
> > 
> > Because it's in the *global* reclaim counters?  That's great but it
> > does not address the problem at all.  This is about pressure balance
> > between groups and you don't have any numbers for that.
> 
> yes, but the point was that if somebody uses soft reclaim currently you
> would miss a big part of reclaim activity because soft reclaim is not
> accounted even in the global counters. So you can see a long stall
> during direct reclaim while the counters look all good.

This is not going anywhere :(

> > This series is a grab bag of fixes that drag a lot of complexity from
> > memcg code into generic reclaim, to repair the age old implementation
> > of a user-visible interface that we already agree sucks balls and
> > should be deprecated.  The fact that you did not even demonstrate that
> > the repair itself was successful is a secondary issue at this point,
> > but it certainly didn't help your case.
> 
> I will follow up with the testing results later. I hope I manage to have
> them before I leave on vacation.
> 
> The question, though, is whether even results supporting my claims about
> enhancements would make any difference. To be honest I thought that the
> integration would be non-controversial topic even without performance
> improvements which could have be expected due to removing prio-0 reclaim
> which is a pure evil. Also the natural fairness of the s.r. sounds like
> a good thing.
> 
> Anyway. Your wording that nothing should be done about the soft reclaim
> seems to be quite clear though. If this position is really firm then go
> ahead and NACK the series _explicitly_ so that Andrew or you can send a
> revert request to Linus. I would really like to not waste a lot of time
> on testing right now when it wouldn't lead to anything.

Nacked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5] Soft limit rework
  2013-09-17 19:56                             ` Johannes Weiner
@ 2013-09-17 20:57                               ` Andrew Morton
  0 siblings, 0 replies; 30+ messages in thread
From: Andrew Morton @ 2013-09-17 20:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel,
	Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa

On Tue, 17 Sep 2013 15:56:15 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> > Anyway. Your wording that nothing should be done about the soft reclaim
> > seems to be quite clear though. If this position is really firm then go
> > ahead and NACK the series _explicitly_ so that Andrew or you can send a
> > revert request to Linus. I would really like to not waste a lot of time
> > on testing right now when it wouldn't lead to anything.
> 
> Nacked-by: Johannes Weiner <hannes@cmpxchg.org>

OK, I queued a bunch of reverts for when Linus gets back to his desk.

It's all my fault for getting distracted for a week (in the middle of
the dang merge window) and not noticing that we haven't yet settled on
a direction.  Sorry bout that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2013-09-17 20:57 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko
2013-06-18 12:09 ` [PATCH v5 1/8] memcg, vmscan: integrate soft reclaim tighter with zone shrinking code Michal Hocko
2013-06-18 12:09 ` [PATCH v5 2/8] memcg: Get rid of soft-limit tree infrastructure Michal Hocko
2013-06-18 12:09 ` [PATCH v5 3/8] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko
2013-06-18 12:09 ` [PATCH v5 4/8] memcg: enhance memcg iterator to support predicates Michal Hocko
2013-06-18 12:09 ` [PATCH v5 5/8] memcg: track children in soft limit excess to improve soft limit Michal Hocko
2013-06-18 12:09 ` [PATCH v5 6/8] memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything Michal Hocko
2013-06-18 12:09 ` [PATCH v5 7/8] memcg: Track all children over limit in the root Michal Hocko
2013-06-18 12:09 ` [PATCH v5 8/8] memcg, vmscan: do not fall into reclaim-all pass too quickly Michal Hocko
2013-06-18 19:01 ` [PATCH v5] Soft limit rework Johannes Weiner
2013-06-19 10:20   ` Michal Hocko
2013-06-20 11:12 ` Mel Gorman
     [not found]   ` <20130620111206.GA14809-l3A5Bk7waGM@public.gmane.org>
2013-06-21 14:06     ` Michal Hocko
     [not found]       ` <20130621140627.GI12424-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-06-21 14:09         ` Michal Hocko
2013-06-21 15:04           ` Michal Hocko
     [not found]             ` <20130621150430.GL12424-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-06-21 15:09               ` Michal Hocko
2013-06-21 16:34                 ` Tejun Heo
2013-06-25 15:49     ` Michal Hocko
     [not found] ` <1371557387-22434-1-git-send-email-mhocko-AlSwsSmVLrQ@public.gmane.org>
2013-08-19 16:35   ` Johannes Weiner
     [not found]     ` <20130819163512.GB712-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2013-08-20  9:14       ` Michal Hocko
2013-08-20 14:13         ` Johannes Weiner
2013-08-22 10:58           ` Michal Hocko
2013-09-03 16:15             ` Johannes Weiner
2013-09-04 16:38               ` Michal Hocko
2013-09-06 19:23                 ` Johannes Weiner
2013-09-13 14:49                   ` Michal Hocko
     [not found]                     ` <20130913144953.GA23857-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-09-13 16:17                       ` Johannes Weiner
2013-09-16 16:44                         ` Michal Hocko
     [not found]                           ` <20130916164405.GG3674-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-09-17 19:56                             ` Johannes Weiner
2013-09-17 20:57                               ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).