* [PATCH v5] Soft limit rework
@ 2013-06-18 12:09 Michal Hocko
2013-06-18 12:09 ` [PATCH v5 1/8] memcg, vmscan: integrate soft reclaim tighter with zone shrinking code Michal Hocko
` (10 more replies)
0 siblings, 11 replies; 30+ messages in thread
From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo,
Balbir Singh, Glauber Costa
Hi,
This is the fifth version of the patchset.
Summary of versions:
The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973
(lkml wasn't CCed at the time so I cannot find it in lwn.net
archives). There were no major objections.
The second version has been posted here http://lwn.net/Articles/548191/
as a part of a longer and spicier thread which started after LSF here:
https://lwn.net/Articles/548192/
Version number 3 has been posted here http://lwn.net/Articles/550409/
Johannes was worried about setups with thousands of memcgs and the
tree walk overhead for the soft reclaim pass without anybody in excess.
Version number 4 has been posted here http://lwn.net/Articles/552703/
appart from heated discussion about memcg iterator predicate which ended
with a conclusion that the predicate based iteration is "the shortest path to
implementing subtree skip given how the iterator is put together
currently and the series as a whole reduces significant amount of
complexity, so it is an acceptable tradeoff to proceed with this
implementation with later restructuring of the iterator."
(http://thread.gmane.org/gmane.linux.kernel.mm/101162/focus=101560)
Changes between RFC (aka V1) -> V2
As there were no major objections there were only some minor cleanups
since the last version and I have moved "memcg: Ignore soft limit until
it is explicitly specified" to the end of the series.
Changes between V2 -> V3
No changes in the code since the last version. I have just rebased the
series on top of the current mmotm tree. The most controversial part
has been dropped (the last patch "memcg: Ignore soft limit until it is
explicitly specified") so there are no semantical changes to the soft
limit behavior. This makes this work mostly a code clean up and code
reorganization. Nevertheless, this is enough to make the soft limit work
more efficiently according to my testing and groups above the soft limit
are reclaimed much less as a result.
Changes between V3->V4
Added some Reviewed-bys but the biggest change comes from Johannes
concern about the tree traversal overhead with a huge number of memcgs
(http://thread.gmane.org/gmane.linux.kernel.cgroups/7307/focus=100326)
and this version addresses this problem by augmenting the memcg tree
with the number of over soft limit children at each level of the
hierarchy. See more bellow.
Changes between V4->V5
Rebased on top of mmotm tree (without slab shrinkers patchset because
there are issues with that patchset) + restested as there were many
kswapd changes (Results are more or less consistent more on that bellow).
There were only doc updates, no code changes.
Please let me know if this has any chance to get merged into 3.11. I do
not want to push it too hard but I think this work is basically ready
and waiting more doesn't help. I can live with 3.12 merge window as well
if 3.11 sounds too early though.
The basic idea is quite simple. Pull soft reclaim into shrink_zone in
the first step and get rid of the previous soft reclaim infrastructure.
shrink_zone is done in two passes now. First it tries to do the soft
limit reclaim and it falls back to reclaim-all mode if no group is over
the limit or no pages have been scanned. The second pass happens at the
same priority so the only time we waste is the memcg tree walk which
has been updated in the third step to have only negligible overhead.
As a bonus we will get rid of a _lot_ of code by this and soft reclaim
will not stand out like before when it wasn't integrated into the zone
shrinking code and it reclaimed at priority 0 (the testing results show
that some workloads suffers from such an aggressive reclaim). The clean
up is in a separate patch because I felt it would be easier to review
that way.
The second step is soft limit reclaim integration into targeted
reclaim. It should be rather straight forward. Soft limit has been used
only for the global reclaim so far but it makes sense for any kind of
pressure coming from up-the-hierarchy, including targeted reclaim.
The third step (patches 4-8) addresses the tree walk overhead by
enhancing memcg iterators to enable skipping whole subtrees and tracking
number of over soft limit children at each level of the hierarchy. This
information is updated same way the old soft limit tree was updated
(from memcg_check_events) so we shouldn't see an additional overhead. In
fact mem_cgroup_update_soft_limit is much simpler than tree manipulation
done previously.
__shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for
mem_cgroup_iter so the decision whether a particular group should be
visited is done at the iterator level which allows us to decide to skip
the whole subtree as well (if there is no child in excess). This reduces
the tree walk overhead considerably.
My primary test case was a parallel kernel build with 2 groups (make
is running with -j4 with a distribution .config in a separate cgroup
without any hard limit) on a 8 CPU machine booted with 1GB memory. I
was mostly interested in 2 setups. Default - no soft limit set and - and
0 soft limit set to both groups.
The first one should tell us whether the rework regresses the default
behavior while the second one should show us improvements in an extreme
case where both workloads are always over the soft limit.
/usr/bin/time -v has been used to collect the statistics and each
configuration had 3 runs after fresh boot without any other load on the
system.
base is mmotm-2013-05-09-15-57
baserebase is mmotm-2013-06-05-17-24-63 + patches from the current mmots
without slab shrinkers patchset.
reworkrebase all patches 8 applied on top of baserebase
* No-limit
User
base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6
baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6
reworkrebase: min: 1172.58 [100.7%] max: 1177.43 [100.7%] avg: 1175.53 [100.6%] std: 1.91 runs: 6
System
base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6
baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6
reworkrebase: min: 236.21 [97.4%] max: 239.46 [97.6%] avg: 237.55 [97.4%] std: 1.05 runs: 6
Elapsed
base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6
baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6
reworkrebase: min: 664.05 [111.3%] max: 701.06 [113.1%] avg: 689.29 [113.8%] std: 12.36 runs: 6
Elapsed time regressed by 13% wrt. base but it seems that this came from
baserebase which regressed by the same amount.
* 0-limit
User
base: min: 1188.28 max: 1198.54 avg: 1194.10 std: 3.31 runs: 6
baserebase: min: 1186.17 [99.8%] max: 1196.46 [99.8%] avg: 1189.75 [99.6%] std: 3.41 runs: 6
reworkrebase: min: 1169.88 [98.5%] max: 1177.84 [98.3%] avg: 1173.50 [98.3%] std: 2.79 runs: 6
System
base: min: 248.40 max: 252.00 avg: 250.19 std: 1.38 runs: 6
baserebase: min: 240.77 [96.9%] max: 246.74 [97.9%] avg: 243.63 [97.4%] std: 2.23 runs: 6
reworkrebase: min: 235.19 [94.7%] max: 237.43 [94.2%] avg: 236.35 [94.5%] std: 0.86 runs: 6
Elapsed
base: min: 759.28 max: 805.30 avg: 784.87 std: 15.45 runs: 6
baserebase: min: 881.69 [116.1%] max: 938.14 [116.5%] avg: 911.68 [116.2%] std: 19.58 runs: 6
reworkrebase: min: 667.54 [87.9%] max: 718.54 [89.2%] avg: 695.61 [88.6%] std: 17.16 runs: 6
System time is slightly better but I wouldn't consider it relevant.
Elapsed time is more interesting though. baserebase regresses by 16%
again which is in par with no-limit configuration.
With the patchset applied we are 11% better in average wrt. to the
old base but it is important to realize that this is still 76.3% wrt.
baserebase so the effect of the series is comparable to the previous
version. Albeit the whole result is worse.
Page fault statistics tell us at least part of the story:
Minor
base: min: 35941845.00 max: 36029788.00 avg: 35986860.17 std: 28288.66 runs: 6
baserebase: min: 35852414.00 [99.8%] max: 35899605.00 [99.6%] avg: 35874906.83 [99.7%] std: 18722.59 runs: 6
reworkrebase: min: 35538346.00 [98.9%] max: 35584907.00 [98.8%] avg: 35562362.17 [98.8%] std: 18921.74 runs: 6
Major
base: min: 25390.00 max: 33132.00 avg: 29961.83 std: 2476.58 runs: 6
baserebase: min: 34224.00 [134.8%] max: 45674.00 [137.9%] avg: 41556.83 [138.7%] std: 3595.39 runs: 6
reworkrebase: min: 277.00 [1.1%] max: 480.00 [1.4%] avg: 384.67 [1.3%] std: 74.67 runs: 6
While the minor faults are within the noise the major faults are reduced
considerably. This looks like an aggressive pageout during the reclaim
and that pageout affects the working set presumably. Please note that
baserebase has even hight number of major page faults than the older
mmotm trree.
While this looks as a nice win it is fair to say that there are some
workloads that actually benefit from reclaim at 0 priority (from
background reclaim). E.g. an aggressive streaming IO would like to get
rid of as many pages as possible and do not block on the pages under
writeback. This can lead to a higher System time but I generally got
Elapsed which was comparable.
The following results are from 2 groups configuration on a 8GB machine
(A running stream IO with 4*TotalMem with 0 soft limit, B runnning a
mem_eater which consumes TotalMem-1G without any limit).
System
base: min: 124.88 max: 136.97 avg: 130.77 std: 4.94 runs: 3
baserebase: min: 102.51 [82.1%] max: 108.84 [79.5%] avg: 104.81 [80.1%] std: 2.86 runs: 3
reworkrebase: min: 108.29 [86.7%] max: 121.70 [88.9%] avg: 114.60 [87.6%] std: 5.50 runs: 3
Elapsed
base: min: 398.86 max: 412.81 avg: 407.62 std: 6.23 runs: 3
baserebase: min: 480.92 [120.6%] max: 497.56 [120.5%] avg: 491.46 [120.6%] std: 7.48 runs: 3
reworkrebase: min: 397.19 [99.6%] max: 462.57 [112.1%] avg: 436.13 [107.0%] std: 28.12 runs: 3
baserebase regresses again by 20% and the series is worse by 7% but it
is still at 89% wrt baserebase so it looks good to me.
So to wrap this up. The series is still doing good and improves the soft
limit.
The testing results for bunch of cgroups with both stream IO and kbuild
loads can be found in "memcg: track children in soft limit excess to
improve soft limit".
The series has seen quite some testing and I guess it is in the state to
be merged into mmotm and hopefully get into 3.11. I would like to hear
back from Johannes and Kamezawa about this timing though.
Shortlog says:
Michal Hocko (8):
memcg, vmscan: integrate soft reclaim tighter with zone shrinking code
memcg: Get rid of soft-limit tree infrastructure
vmscan, memcg: Do softlimit reclaim also for targeted reclaim
memcg: enhance memcg iterator to support predicates
memcg: track children in soft limit excess to improve soft limit
memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything
memcg: Track all children over limit in the root
memcg, vmscan: do not fall into reclaim-all pass too quickly
And the disffstat shows us that we still got rid of a lot of code
include/linux/memcontrol.h | 54 ++++-
mm/memcontrol.c | 565 +++++++++++++--------------------------------
mm/vmscan.c | 83 ++++---
3 files changed, 254 insertions(+), 448 deletions(-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 30+ messages in thread* [PATCH v5 1/8] memcg, vmscan: integrate soft reclaim tighter with zone shrinking code 2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko @ 2013-06-18 12:09 ` Michal Hocko 2013-06-18 12:09 ` [PATCH v5 2/8] memcg: Get rid of soft-limit tree infrastructure Michal Hocko ` (9 subsequent siblings) 10 siblings, 0 replies; 30+ messages in thread From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw) To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa Memcg soft reclaim has been traditionally triggered from the global reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim then picked up a group which exceeds the soft limit the most and reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. The infrastructure requires per-node-zone trees which hold over-limit groups and keep them up-to-date (via memcg_check_events) which is not cost free. Although this overhead hasn't turned out to be a bottle neck the implementation is suboptimal because mem_cgroup_update_tree has no idea which zones consumed memory over the limit so we could easily end up having a group on a node-zone tree having only few pages from that node-zone. This patch doesn't try to fix node-zone trees management because it seems that integrating soft reclaim into zone shrinking sounds much easier and more appropriate for several reasons. First of all 0 priority reclaim was a crude hack which might lead to big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot of dirty/writeback pages). Soft reclaim should be applicable also to the targeted reclaim which is awkward right now without additional hacks. Last but not least the whole infrastructure eats quite some code. After this patch shrink_zone is done in 2 passes. First it tries to do the soft reclaim if appropriate (only for global reclaim for now to keep compatible with the original state) and fall back to ignoring soft limit if no group is eligible to soft reclaim or nothing has been scanned during the first pass. Only groups which are over their soft limit or any of their parents up the hierarchy is over the limit are considered eligible during the first pass. Soft limit tree which is not necessary anymore will be removed in the follow up patch to make this patch smaller and easier to review. Changes since v2 - Do not try soft reclaim pass if mem_cgroup_disabled Changes since v1 - __shrink_zone doesn't return the number of shrunk groups as nr_scanned test covers both no groups scanned and no pages from the required zone as pointed by Johannes Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Glauber Costa <glommer@openvz.org> Reviewed-by: Tejun Heo <tj@kernel.org> --- include/linux/memcontrol.h | 10 +-- mm/memcontrol.c | 161 ++++++--------------------------------------- mm/vmscan.c | 62 +++++++++-------- 3 files changed, 59 insertions(+), 174 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7b4d9d7..d495b9e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -180,9 +180,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned); +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, @@ -359,11 +357,9 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) { - return 0; + return false; } static inline void mem_cgroup_split_huge_fixup(struct page *head) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 96e6168..ea10f73 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2101,57 +2101,28 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) } #endif -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, - struct zone *zone, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - struct mem_cgroup *victim = NULL; - int total = 0; - int loop = 0; - unsigned long excess; - unsigned long nr_scanned; - struct mem_cgroup_reclaim_cookie reclaim = { - .zone = zone, - .priority = 0, - }; +/* + * A group is eligible for the soft limit reclaim if it is + * a) is over its soft limit + * b) any parent up the hierarchy is over its soft limit + */ +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +{ + struct mem_cgroup *parent = memcg; - excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT; - - while (1) { - victim = mem_cgroup_iter(root_memcg, victim, &reclaim); - if (!victim) { - loop++; - if (loop >= 2) { - /* - * If we have not been able to reclaim - * anything, it might because there are - * no reclaimable pages under this hierarchy - */ - if (!total) - break; - /* - * We want to do more targeted reclaim. - * excess >> 2 is not to excessive so as to - * reclaim too much, nor too less that we keep - * coming back to reclaim from this cgroup - */ - if (total >= (excess >> 2) || - (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) - break; - } - continue; - } - if (!mem_cgroup_reclaimable(victim, false)) - continue; - total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false, - zone, &nr_scanned); - *total_scanned += nr_scanned; - if (!res_counter_soft_limit_excess(&root_memcg->res)) - break; + if (res_counter_soft_limit_excess(&memcg->res)) + return true; + + /* + * If any parent up the hierarchy is over its soft limit then we + * have to obey and reclaim from this group as well. + */ + while((parent = parent_mem_cgroup(parent))) { + if (res_counter_soft_limit_excess(&parent->res)) + return true; } - mem_cgroup_iter_break(root_memcg, victim); - return total; + + return false; } /* @@ -4777,98 +4748,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg, return ret; } -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - unsigned long nr_reclaimed = 0; - struct mem_cgroup_per_zone *mz, *next_mz = NULL; - unsigned long reclaimed; - int loop = 0; - struct mem_cgroup_tree_per_zone *mctz; - unsigned long long excess; - unsigned long nr_scanned; - - if (order > 0) - return 0; - - mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone)); - /* - * This loop can run a while, specially if mem_cgroup's continuously - * keep exceeding their soft limit and putting the system under - * pressure - */ - do { - if (next_mz) - mz = next_mz; - else - mz = mem_cgroup_largest_soft_limit_node(mctz); - if (!mz) - break; - - nr_scanned = 0; - reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone, - gfp_mask, &nr_scanned); - nr_reclaimed += reclaimed; - *total_scanned += nr_scanned; - spin_lock(&mctz->lock); - - /* - * If we failed to reclaim anything from this memory cgroup - * it is time to move on to the next cgroup - */ - next_mz = NULL; - if (!reclaimed) { - do { - /* - * Loop until we find yet another one. - * - * By the time we get the soft_limit lock - * again, someone might have aded the - * group back on the RB tree. Iterate to - * make sure we get a different mem. - * mem_cgroup_largest_soft_limit_node returns - * NULL if no other cgroup is present on - * the tree - */ - next_mz = - __mem_cgroup_largest_soft_limit_node(mctz); - if (next_mz == mz) - css_put(&next_mz->memcg->css); - else /* next_mz == NULL or other memcg */ - break; - } while (1); - } - __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz); - excess = res_counter_soft_limit_excess(&mz->memcg->res); - /* - * One school of thought says that we should not add - * back the node to the tree if reclaim returns 0. - * But our reclaim could return 0, simply because due - * to priority we are exposing a smaller subset of - * memory to reclaim from. Consider this as a longer - * term TODO. - */ - /* If excess == 0, no tree ops */ - __mem_cgroup_insert_exceeded(mz->memcg, mz, mctz, excess); - spin_unlock(&mctz->lock); - css_put(&mz->memcg->css); - loop++; - /* - * Could not reclaim anything and there are no more - * mem cgroups to try or we seem to be looping without - * reclaiming anything. - */ - if (!nr_reclaimed && - (next_mz == NULL || - loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) - break; - } while (!nr_reclaimed); - if (next_mz) - css_put(&next_mz->memcg->css); - return nr_reclaimed; -} - /** * mem_cgroup_force_empty_list - clears LRU of a group * @memcg: group to clear diff --git a/mm/vmscan.c b/mm/vmscan.c index b96faea..d8823f0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -139,11 +139,21 @@ static bool global_reclaim(struct scan_control *sc) { return !sc->target_mem_cgroup; } + +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) +{ + return !mem_cgroup_disabled() && global_reclaim(sc); +} #else static bool global_reclaim(struct scan_control *sc) { return true; } + +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) +{ + return false; +} #endif static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru) @@ -2124,7 +2134,8 @@ static inline bool should_continue_reclaim(struct zone *zone, } } -static void shrink_zone(struct zone *zone, struct scan_control *sc) +static void +__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) { unsigned long nr_reclaimed, nr_scanned; @@ -2143,6 +2154,12 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) do { struct lruvec *lruvec; + if (soft_reclaim && + !mem_cgroup_soft_reclaim_eligible(memcg)) { + memcg = mem_cgroup_iter(root, memcg, &reclaim); + continue; + } + lruvec = mem_cgroup_zone_lruvec(zone, memcg); shrink_lruvec(lruvec, sc); @@ -2173,6 +2190,24 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) sc->nr_scanned - nr_scanned, sc)); } + +static void shrink_zone(struct zone *zone, struct scan_control *sc) +{ + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); + unsigned long nr_scanned = sc->nr_scanned; + + __shrink_zone(zone, sc, do_soft_reclaim); + + /* + * No group is over the soft limit or those that are do not have + * pages in the zone we are reclaiming so we have to reclaim everybody + */ + if (do_soft_reclaim && (sc->nr_scanned == nr_scanned)) { + __shrink_zone(zone, sc, false); + return; + } +} + /* Returns true if compaction should go ahead for a high-order request */ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) { @@ -2234,8 +2269,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) { struct zoneref *z; struct zone *zone; - unsigned long nr_soft_reclaimed; - unsigned long nr_soft_scanned; bool aborted_reclaim = false; /* @@ -2275,18 +2308,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) continue; } } - /* - * This steals pages from memory cgroups over softlimit - * and returns the number of reclaimed pages and - * scanned pages. This works for global memory pressure - * and balancing, not for a memcg's limit. - */ - nr_soft_scanned = 0; - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - sc->order, sc->gfp_mask, - &nr_soft_scanned); - sc->nr_reclaimed += nr_soft_reclaimed; - sc->nr_scanned += nr_soft_scanned; /* need some check for avoid more shrink_zone() */ } @@ -2881,8 +2902,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, { int i; int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ - unsigned long nr_soft_reclaimed; - unsigned long nr_soft_scanned; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .priority = DEF_PRIORITY, @@ -2997,15 +3016,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, sc.nr_scanned = 0; - nr_soft_scanned = 0; - /* - * Call soft limit reclaim before calling shrink_zone. - */ - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - order, sc.gfp_mask, - &nr_soft_scanned); - sc.nr_reclaimed += nr_soft_reclaimed; - /* * There should be no need to raise the scanning * priority if enough pages are already being scanned -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v5 2/8] memcg: Get rid of soft-limit tree infrastructure 2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko 2013-06-18 12:09 ` [PATCH v5 1/8] memcg, vmscan: integrate soft reclaim tighter with zone shrinking code Michal Hocko @ 2013-06-18 12:09 ` Michal Hocko 2013-06-18 12:09 ` [PATCH v5 3/8] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko ` (8 subsequent siblings) 10 siblings, 0 replies; 30+ messages in thread From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw) To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa Now that the soft limit is integrated to the reclaim directly the whole soft-limit tree infrastructure is not needed anymore. Rip it out. Changes v1->v2 - mem_cgroup_reclaimable is no longer used - test_mem_cgroup_node_reclaimable is not used outside NUMA code Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Glauber Costa <glommer@openvz.org> Reviewed-by: Tejun Heo <tj@kernel.org> --- mm/memcontrol.c | 265 +------------------------------------------------------- 1 file changed, 2 insertions(+), 263 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ea10f73..cdc8aed 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -39,7 +39,6 @@ #include <linux/limits.h> #include <linux/export.h> #include <linux/mutex.h> -#include <linux/rbtree.h> #include <linux/slab.h> #include <linux/swap.h> #include <linux/swapops.h> @@ -137,7 +136,6 @@ static const char * const mem_cgroup_lru_names[] = { */ enum mem_cgroup_events_target { MEM_CGROUP_TARGET_THRESH, - MEM_CGROUP_TARGET_SOFTLIMIT, MEM_CGROUP_TARGET_NUMAINFO, MEM_CGROUP_NTARGETS, }; @@ -173,10 +171,6 @@ struct mem_cgroup_per_zone { struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1]; - struct rb_node tree_node; /* RB tree node */ - unsigned long long usage_in_excess;/* Set to the value by which */ - /* the soft limit is exceeded*/ - bool on_tree; struct mem_cgroup *memcg; /* Back pointer, we cannot */ /* use container_of */ }; @@ -185,26 +179,6 @@ struct mem_cgroup_per_node { struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES]; }; -/* - * Cgroups above their limits are maintained in a RB-Tree, independent of - * their hierarchy representation - */ - -struct mem_cgroup_tree_per_zone { - struct rb_root rb_root; - spinlock_t lock; -}; - -struct mem_cgroup_tree_per_node { - struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES]; -}; - -struct mem_cgroup_tree { - struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES]; -}; - -static struct mem_cgroup_tree soft_limit_tree __read_mostly; - struct mem_cgroup_threshold { struct eventfd_ctx *eventfd; u64 threshold; @@ -523,7 +497,6 @@ static bool move_file(void) * limit reclaim to prevent infinite loops, if they ever occur. */ #define MEM_CGROUP_MAX_RECLAIM_LOOPS 100 -#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS 2 enum charge_type { MEM_CGROUP_CHARGE_TYPE_CACHE = 0, @@ -754,164 +727,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *memcg, struct page *page) return mem_cgroup_zoneinfo(memcg, nid, zid); } -static struct mem_cgroup_tree_per_zone * -soft_limit_tree_node_zone(int nid, int zid) -{ - return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid]; -} - -static struct mem_cgroup_tree_per_zone * -soft_limit_tree_from_page(struct page *page) -{ - int nid = page_to_nid(page); - int zid = page_zonenum(page); - - return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid]; -} - -static void -__mem_cgroup_insert_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz, - unsigned long long new_usage_in_excess) -{ - struct rb_node **p = &mctz->rb_root.rb_node; - struct rb_node *parent = NULL; - struct mem_cgroup_per_zone *mz_node; - - if (mz->on_tree) - return; - - mz->usage_in_excess = new_usage_in_excess; - if (!mz->usage_in_excess) - return; - while (*p) { - parent = *p; - mz_node = rb_entry(parent, struct mem_cgroup_per_zone, - tree_node); - if (mz->usage_in_excess < mz_node->usage_in_excess) - p = &(*p)->rb_left; - /* - * We can't avoid mem cgroups that are over their soft - * limit by the same amount - */ - else if (mz->usage_in_excess >= mz_node->usage_in_excess) - p = &(*p)->rb_right; - } - rb_link_node(&mz->tree_node, parent, p); - rb_insert_color(&mz->tree_node, &mctz->rb_root); - mz->on_tree = true; -} - -static void -__mem_cgroup_remove_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz) -{ - if (!mz->on_tree) - return; - rb_erase(&mz->tree_node, &mctz->rb_root); - mz->on_tree = false; -} - -static void -mem_cgroup_remove_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz) -{ - spin_lock(&mctz->lock); - __mem_cgroup_remove_exceeded(memcg, mz, mctz); - spin_unlock(&mctz->lock); -} - - -static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page) -{ - unsigned long long excess; - struct mem_cgroup_per_zone *mz; - struct mem_cgroup_tree_per_zone *mctz; - int nid = page_to_nid(page); - int zid = page_zonenum(page); - mctz = soft_limit_tree_from_page(page); - - /* - * Necessary to update all ancestors when hierarchy is used. - * because their event counter is not touched. - */ - for (; memcg; memcg = parent_mem_cgroup(memcg)) { - mz = mem_cgroup_zoneinfo(memcg, nid, zid); - excess = res_counter_soft_limit_excess(&memcg->res); - /* - * We have to update the tree if mz is on RB-tree or - * mem is over its softlimit. - */ - if (excess || mz->on_tree) { - spin_lock(&mctz->lock); - /* if on-tree, remove it */ - if (mz->on_tree) - __mem_cgroup_remove_exceeded(memcg, mz, mctz); - /* - * Insert again. mz->usage_in_excess will be updated. - * If excess is 0, no tree ops. - */ - __mem_cgroup_insert_exceeded(memcg, mz, mctz, excess); - spin_unlock(&mctz->lock); - } - } -} - -static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg) -{ - int node, zone; - struct mem_cgroup_per_zone *mz; - struct mem_cgroup_tree_per_zone *mctz; - - for_each_node(node) { - for (zone = 0; zone < MAX_NR_ZONES; zone++) { - mz = mem_cgroup_zoneinfo(memcg, node, zone); - mctz = soft_limit_tree_node_zone(node, zone); - mem_cgroup_remove_exceeded(memcg, mz, mctz); - } - } -} - -static struct mem_cgroup_per_zone * -__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) -{ - struct rb_node *rightmost = NULL; - struct mem_cgroup_per_zone *mz; - -retry: - mz = NULL; - rightmost = rb_last(&mctz->rb_root); - if (!rightmost) - goto done; /* Nothing to reclaim from */ - - mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node); - /* - * Remove the node now but someone else can add it back, - * we will to add it back at the end of reclaim to its correct - * position in the tree. - */ - __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz); - if (!res_counter_soft_limit_excess(&mz->memcg->res) || - !css_tryget(&mz->memcg->css)) - goto retry; -done: - return mz; -} - -static struct mem_cgroup_per_zone * -mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) -{ - struct mem_cgroup_per_zone *mz; - - spin_lock(&mctz->lock); - mz = __mem_cgroup_largest_soft_limit_node(mctz); - spin_unlock(&mctz->lock); - return mz; -} - /* * Implementation Note: reading percpu statistics for memcg. * @@ -1065,9 +880,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, case MEM_CGROUP_TARGET_THRESH: next = val + THRESHOLDS_EVENTS_TARGET; break; - case MEM_CGROUP_TARGET_SOFTLIMIT: - next = val + SOFTLIMIT_EVENTS_TARGET; - break; case MEM_CGROUP_TARGET_NUMAINFO: next = val + NUMAINFO_EVENTS_TARGET; break; @@ -1090,11 +902,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) /* threshold event is triggered in finer grain than soft limit */ if (unlikely(mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_THRESH))) { - bool do_softlimit; bool do_numainfo __maybe_unused; - do_softlimit = mem_cgroup_event_ratelimit(memcg, - MEM_CGROUP_TARGET_SOFTLIMIT); #if MAX_NUMNODES > 1 do_numainfo = mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_NUMAINFO); @@ -1102,8 +911,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) preempt_enable(); mem_cgroup_threshold(memcg); - if (unlikely(do_softlimit)) - mem_cgroup_update_tree(memcg, page); #if MAX_NUMNODES > 1 if (unlikely(do_numainfo)) atomic_inc(&memcg->numainfo_events); @@ -1964,6 +1771,7 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg, return total; } +#if MAX_NUMNODES > 1 /** * test_mem_cgroup_node_reclaimable * @memcg: the target memcg @@ -1986,7 +1794,6 @@ static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg, return false; } -#if MAX_NUMNODES > 1 /* * Always updating the nodemask is not very good - even if we have an empty @@ -2054,51 +1861,12 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) return node; } -/* - * Check all nodes whether it contains reclaimable pages or not. - * For quick scan, we make use of scan_nodes. This will allow us to skip - * unused nodes. But scan_nodes is lazily updated and may not cotain - * enough new information. We need to do double check. - */ -static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) -{ - int nid; - - /* - * quick check...making use of scan_node. - * We can skip unused nodes. - */ - if (!nodes_empty(memcg->scan_nodes)) { - for (nid = first_node(memcg->scan_nodes); - nid < MAX_NUMNODES; - nid = next_node(nid, memcg->scan_nodes)) { - - if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap)) - return true; - } - } - /* - * Check rest of nodes. - */ - for_each_node_state(nid, N_MEMORY) { - if (node_isset(nid, memcg->scan_nodes)) - continue; - if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap)) - return true; - } - return false; -} - #else int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) { return 0; } -static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) -{ - return test_mem_cgroup_node_reclaimable(memcg, 0, noswap); -} #endif /* @@ -2973,9 +2741,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg, unlock_page_cgroup(pc); /* - * "charge_statistics" updated event counter. Then, check it. - * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree. - * if they exceeds softlimit. + * "charge_statistics" updated event counter. */ memcg_check_events(memcg, page); } @@ -6112,8 +5878,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node) for (zone = 0; zone < MAX_NR_ZONES; zone++) { mz = &pn->zoneinfo[zone]; lruvec_init(&mz->lruvec); - mz->usage_in_excess = 0; - mz->on_tree = false; mz->memcg = memcg; } memcg->nodeinfo[node] = pn; @@ -6169,7 +5933,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) int node; size_t size = memcg_size(); - mem_cgroup_remove_from_trees(memcg); free_css_id(&mem_cgroup_subsys, &memcg->css); for_each_node(node) @@ -6251,29 +6014,6 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) } EXPORT_SYMBOL(parent_mem_cgroup); -static void __init mem_cgroup_soft_limit_tree_init(void) -{ - struct mem_cgroup_tree_per_node *rtpn; - struct mem_cgroup_tree_per_zone *rtpz; - int tmp, node, zone; - - for_each_node(node) { - tmp = node; - if (!node_state(node, N_NORMAL_MEMORY)) - tmp = -1; - rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp); - BUG_ON(!rtpn); - - soft_limit_tree.rb_tree_per_node[node] = rtpn; - - for (zone = 0; zone < MAX_NR_ZONES; zone++) { - rtpz = &rtpn->rb_tree_per_zone[zone]; - rtpz->rb_root = RB_ROOT; - spin_lock_init(&rtpz->lock); - } - } -} - static struct cgroup_subsys_state * __ref mem_cgroup_css_alloc(struct cgroup *cont) { @@ -7066,7 +6806,6 @@ static int __init mem_cgroup_init(void) { hotcpu_notifier(memcg_cpu_hotplug_callback, 0); enable_swap_cgroup(); - mem_cgroup_soft_limit_tree_init(); memcg_stock_init(); return 0; } -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v5 3/8] vmscan, memcg: Do softlimit reclaim also for targeted reclaim 2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko 2013-06-18 12:09 ` [PATCH v5 1/8] memcg, vmscan: integrate soft reclaim tighter with zone shrinking code Michal Hocko 2013-06-18 12:09 ` [PATCH v5 2/8] memcg: Get rid of soft-limit tree infrastructure Michal Hocko @ 2013-06-18 12:09 ` Michal Hocko 2013-06-18 12:09 ` [PATCH v5 4/8] memcg: enhance memcg iterator to support predicates Michal Hocko ` (7 subsequent siblings) 10 siblings, 0 replies; 30+ messages in thread From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw) To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa Soft reclaim has been done only for the global reclaim (both background and direct). Since "memcg: integrate soft reclaim tighter with zone shrinking code" there is no reason for this limitation anymore as the soft limit reclaim doesn't use any special code paths and it is a part of the zone shrinking code which is used by both global and targeted reclaims. From semantic point of view it is even natural to consider soft limit before touching all groups in the hierarchy tree which is touching the hard limit because soft limit tells us where to push back when there is a memory pressure. It is not important whether the pressure comes from the limit or imbalanced zones. This patch simply enables soft reclaim unconditionally in mem_cgroup_should_soft_reclaim so it is enabled for both global and targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn about the root of the reclaim to know where to stop checking soft limit state of parents up the hierarchy. Say we have A (over soft limit) \ B (below s.l., hit the hard limit) / \ C D (below s.l.) B is the source of the outside memory pressure now for D but we shouldn't soft reclaim it because it is behaving well under B subtree and we can still reclaim from C (pressumably it is over the limit). mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the hierarchy at B (root of the memory pressure). Changes since v1 - add sc->target_mem_cgroup handling into mem_cgroup_soft_reclaim_eligible Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Glauber Costa <glommer@openvz.org> Reviewed-by: Tejun Heo <tj@kernel.org> --- include/linux/memcontrol.h | 6 ++++-- mm/memcontrol.c | 14 +++++++++----- mm/vmscan.c | 4 ++-- 3 files changed, 15 insertions(+), 9 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d495b9e..065ecef 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -180,7 +180,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg); +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, @@ -357,7 +358,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root) { return false; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cdc8aed..90217f3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1870,11 +1870,13 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) #endif /* - * A group is eligible for the soft limit reclaim if it is - * a) is over its soft limit + * A group is eligible for the soft limit reclaim under the given root + * hierarchy if + * a) it is over its soft limit * b) any parent up the hierarchy is over its soft limit */ -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root) { struct mem_cgroup *parent = memcg; @@ -1882,12 +1884,14 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) return true; /* - * If any parent up the hierarchy is over its soft limit then we - * have to obey and reclaim from this group as well. + * If any parent up to the root in the hierarchy is over its soft limit + * then we have to obey and reclaim from this group as well. */ while((parent = parent_mem_cgroup(parent))) { if (res_counter_soft_limit_excess(&parent->res)) return true; + if (parent == root) + break; } return false; diff --git a/mm/vmscan.c b/mm/vmscan.c index d8823f0..13d746d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -142,7 +142,7 @@ static bool global_reclaim(struct scan_control *sc) static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) { - return !mem_cgroup_disabled() && global_reclaim(sc); + return !mem_cgroup_disabled(); } #else static bool global_reclaim(struct scan_control *sc) @@ -2155,7 +2155,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) struct lruvec *lruvec; if (soft_reclaim && - !mem_cgroup_soft_reclaim_eligible(memcg)) { + !mem_cgroup_soft_reclaim_eligible(memcg, root)) { memcg = mem_cgroup_iter(root, memcg, &reclaim); continue; } -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v5 4/8] memcg: enhance memcg iterator to support predicates 2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko ` (2 preceding siblings ...) 2013-06-18 12:09 ` [PATCH v5 3/8] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko @ 2013-06-18 12:09 ` Michal Hocko 2013-06-18 12:09 ` [PATCH v5 5/8] memcg: track children in soft limit excess to improve soft limit Michal Hocko ` (6 subsequent siblings) 10 siblings, 0 replies; 30+ messages in thread From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw) To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa The caller of the iterator might know that some nodes or even subtrees should be skipped but there is no way to tell iterators about that so the only choice left is to let iterators to visit each node and do the selection outside of the iterating code. This, however, doesn't scale well with hierarchies with many groups where only few groups are interesting. This patch adds mem_cgroup_iter_cond variant of the iterator with a callback which gets called for every visited node. There are three possible ways how the callback can influence the walk. Either the node is visited, it is skipped but the tree walk continues down the tree or the whole subtree of the current group is skipped. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 48 +++++++++++++++++++++++++---- mm/memcontrol.c | 77 ++++++++++++++++++++++++++++++++++++---------- mm/vmscan.c | 16 +++------- 3 files changed, 108 insertions(+), 33 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 065ecef..1276be3 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -41,6 +41,23 @@ struct mem_cgroup_reclaim_cookie { unsigned int generation; }; +enum mem_cgroup_filter_t { + VISIT, /* visit current node */ + SKIP, /* skip the current node and continue traversal */ + SKIP_TREE, /* skip the whole subtree and continue traversal */ +}; + +/* + * mem_cgroup_filter_t predicate might instruct mem_cgroup_iter_cond how to + * iterate through the hierarchy tree. Each tree element is checked by the + * predicate before it is returned by the iterator. If a filter returns + * SKIP or SKIP_TREE then the iterator code continues traversal (with the + * next node down the hierarchy or the next node that doesn't belong under the + * memcg's subtree). + */ +typedef enum mem_cgroup_filter_t +(*mem_cgroup_iter_filter)(struct mem_cgroup *memcg, struct mem_cgroup *root); + #ifdef CONFIG_MEMCG /* * All "charge" functions with gfp_mask should use GFP_KERNEL or @@ -108,9 +125,18 @@ mem_cgroup_prepare_migration(struct page *page, struct page *newpage, extern void mem_cgroup_end_migration(struct mem_cgroup *memcg, struct page *oldpage, struct page *newpage, bool migration_ok); -struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *, - struct mem_cgroup *, - struct mem_cgroup_reclaim_cookie *); +struct mem_cgroup *mem_cgroup_iter_cond(struct mem_cgroup *root, + struct mem_cgroup *prev, + struct mem_cgroup_reclaim_cookie *reclaim, + mem_cgroup_iter_filter cond); + +static inline struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, + struct mem_cgroup *prev, + struct mem_cgroup_reclaim_cookie *reclaim) +{ + return mem_cgroup_iter_cond(root, prev, reclaim, NULL); +} + void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *); /* @@ -180,7 +206,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, +enum mem_cgroup_filter_t +mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, struct mem_cgroup *root); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); @@ -295,6 +322,14 @@ static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg, struct page *oldpage, struct page *newpage, bool migration_ok) { } +static inline struct mem_cgroup * +mem_cgroup_iter_cond(struct mem_cgroup *root, + struct mem_cgroup *prev, + struct mem_cgroup_reclaim_cookie *reclaim, + mem_cgroup_iter_filter cond) +{ + return NULL; +} static inline struct mem_cgroup * mem_cgroup_iter(struct mem_cgroup *root, @@ -358,10 +393,11 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, +enum mem_cgroup_filter_t +mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, struct mem_cgroup *root) { - return false; + return VISIT; } static inline void mem_cgroup_split_huge_fixup(struct page *head) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 90217f3..1364ca5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -959,6 +959,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm) return memcg; } +static enum mem_cgroup_filter_t +mem_cgroup_filter(struct mem_cgroup *memcg, struct mem_cgroup *root, + mem_cgroup_iter_filter cond) +{ + if (!cond) + return VISIT; + return cond(memcg, root); +} + /* * Returns a next (in a pre-order walk) alive memcg (with elevated css * ref. count) or NULL if the whole root's subtree has been visited. @@ -966,7 +975,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm) * helper function to be used by mem_cgroup_iter */ static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root, - struct mem_cgroup *last_visited) + struct mem_cgroup *last_visited, mem_cgroup_iter_filter cond) { struct cgroup *prev_cgroup, *next_cgroup; @@ -974,10 +983,18 @@ static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root, * Root is not visited by cgroup iterators so it needs an * explicit visit. */ - if (!last_visited) - return root; + if (!last_visited) { + switch(mem_cgroup_filter(root, root, cond)) { + case VISIT: + return root; + case SKIP: + break; + case SKIP_TREE: + return NULL; + } + } - prev_cgroup = (last_visited == root) ? NULL + prev_cgroup = (last_visited == root || !last_visited) ? NULL : last_visited->css.cgroup; skip_node: next_cgroup = cgroup_next_descendant_pre( @@ -993,11 +1010,30 @@ skip_node: if (next_cgroup) { struct mem_cgroup *mem = mem_cgroup_from_cont( next_cgroup); - if (css_tryget(&mem->css)) - return mem; - else { + + switch (mem_cgroup_filter(mem, root, cond)) { + case SKIP: prev_cgroup = next_cgroup; goto skip_node; + case SKIP_TREE: + /* + * cgroup_rightmost_descendant is not an optimal way to + * skip through a subtree (especially for imbalanced + * trees leaning to right) but that's what we have right + * now. More effective solution would be traversing + * right-up for first non-NULL without calling + * cgroup_next_descendant_pre afterwards. + */ + prev_cgroup = cgroup_rightmost_descendant(next_cgroup); + goto skip_node; + case VISIT: + if (css_tryget(&mem->css)) + return mem; + else { + prev_cgroup = next_cgroup; + goto skip_node; + } + break; } } @@ -1061,6 +1097,7 @@ static void mem_cgroup_iter_update(struct mem_cgroup_reclaim_iter *iter, * @root: hierarchy root * @prev: previously returned memcg, NULL on first invocation * @reclaim: cookie for shared reclaim walks, NULL for full walks + * @cond: filter for visited nodes, NULL for no filter * * Returns references to children of the hierarchy below @root, or * @root itself, or %NULL after a full round-trip. @@ -1073,9 +1110,10 @@ static void mem_cgroup_iter_update(struct mem_cgroup_reclaim_iter *iter, * divide up the memcgs in the hierarchy among all concurrent * reclaimers operating on the same zone and priority. */ -struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, +struct mem_cgroup *mem_cgroup_iter_cond(struct mem_cgroup *root, struct mem_cgroup *prev, - struct mem_cgroup_reclaim_cookie *reclaim) + struct mem_cgroup_reclaim_cookie *reclaim, + mem_cgroup_iter_filter cond) { struct mem_cgroup *memcg = NULL; struct mem_cgroup *last_visited = NULL; @@ -1092,7 +1130,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (!root->use_hierarchy && root != root_mem_cgroup) { if (prev) goto out_css_put; - return root; + if (mem_cgroup_filter(root, root, cond) == VISIT) + return root; + return NULL; } rcu_read_lock(); @@ -1116,7 +1156,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, last_visited = mem_cgroup_iter_load(iter, root, &seq); } - memcg = __mem_cgroup_iter_next(root, last_visited); + memcg = __mem_cgroup_iter_next(root, last_visited, cond); if (reclaim) { mem_cgroup_iter_update(iter, last_visited, memcg, seq); @@ -1127,7 +1167,11 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, reclaim->generation = iter->generation; } - if (prev && !memcg) + /* + * We have finished the whole tree walk or no group has been + * visited because filter told us to skip the root node. + */ + if (!memcg && (prev || (cond && !last_visited))) goto out_unlock; } out_unlock: @@ -1875,13 +1919,14 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) * a) it is over its soft limit * b) any parent up the hierarchy is over its soft limit */ -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, +enum mem_cgroup_filter_t +mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, struct mem_cgroup *root) { struct mem_cgroup *parent = memcg; if (res_counter_soft_limit_excess(&memcg->res)) - return true; + return VISIT; /* * If any parent up to the root in the hierarchy is over its soft limit @@ -1889,12 +1934,12 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, */ while((parent = parent_mem_cgroup(parent))) { if (res_counter_soft_limit_excess(&parent->res)) - return true; + return VISIT; if (parent == root) break; } - return false; + return SKIP; } /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 13d746d..79d59ec 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2145,21 +2145,16 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) .zone = zone, .priority = sc->priority, }; - struct mem_cgroup *memcg; + struct mem_cgroup *memcg = NULL; + mem_cgroup_iter_filter filter = (soft_reclaim) ? + mem_cgroup_soft_reclaim_eligible : NULL; nr_reclaimed = sc->nr_reclaimed; nr_scanned = sc->nr_scanned; - memcg = mem_cgroup_iter(root, NULL, &reclaim); - do { + while ((memcg = mem_cgroup_iter_cond(root, memcg, &reclaim, filter))) { struct lruvec *lruvec; - if (soft_reclaim && - !mem_cgroup_soft_reclaim_eligible(memcg, root)) { - memcg = mem_cgroup_iter(root, memcg, &reclaim); - continue; - } - lruvec = mem_cgroup_zone_lruvec(zone, memcg); shrink_lruvec(lruvec, sc); @@ -2179,8 +2174,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) mem_cgroup_iter_break(root, memcg); break; } - memcg = mem_cgroup_iter(root, memcg, &reclaim); - } while (memcg); + } vmpressure(sc->gfp_mask, sc->target_mem_cgroup, sc->nr_scanned - nr_scanned, -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v5 5/8] memcg: track children in soft limit excess to improve soft limit 2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko ` (3 preceding siblings ...) 2013-06-18 12:09 ` [PATCH v5 4/8] memcg: enhance memcg iterator to support predicates Michal Hocko @ 2013-06-18 12:09 ` Michal Hocko 2013-06-18 12:09 ` [PATCH v5 6/8] memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything Michal Hocko ` (5 subsequent siblings) 10 siblings, 0 replies; 30+ messages in thread From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw) To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa Soft limit reclaim has to check the whole reclaim hierarchy while doing the first pass of the reclaim. This leads to a higher system time which can be visible especially when there are many groups in the hierarchy. This patch adds a per-memcg counter of children in excess. It also restores MEM_CGROUP_TARGET_SOFTLIMIT into mem_cgroup_event_ratelimit for a proper batching. If a group crosses soft limit for the first time it increases parent's children_in_excess up the hierarchy. The similarly if a group gets below the limit it will decrease the counter. The transition phase is recorded in soft_contributed flag. mem_cgroup_soft_reclaim_eligible then uses this information to better decide whether to skip the node or the whole subtree. The rule is simple. Skip the node with a children in excess or skip the whole subtree otherwise. This has been tested by a stream IO (dd if=/dev/zero of=file with 4*MemTotal size) which is quite sensitive to overhead during reclaim. The load is running in a group with soft limit set to 0 and without any limit. Apart from that there was a hierarchy with ~500, 2k and 8k groups (two groups on each level) without any pages in them. base denotes to the kernel on which the whole series is based on, rework is the kernel before this patch and reworkoptim is with this patch applied: * Run with soft limit set to 0 Elapsed 0-0-limit/base: min: 88.21 max: 94.61 avg: 91.73 std: 2.65 runs: 3 0-0-limit/rework: min: 76.05 [86.2%] max: 79.08 [83.6%] avg: 77.84 [84.9%] std: 1.30 runs: 3 0-0-limit/reworkoptim: min: 77.98 [88.4%] max: 80.36 [84.9%] avg: 78.92 [86.0%] std: 1.03 runs: 3 System 0.5k-0-limit/base: min: 34.86 max: 36.42 avg: 35.89 std: 0.73 runs: 3 0.5k-0-limit/rework: min: 43.26 [124.1%] max: 48.95 [134.4%] avg: 46.09 [128.4%] std: 2.32 runs: 3 0.5k-0-limit/reworkoptim: min: 46.98 [134.8%] max: 50.98 [140.0%] avg: 48.49 [135.1%] std: 1.77 runs: 3 Elapsed 0.5k-0-limit/base: min: 88.50 max: 97.52 avg: 93.92 std: 3.90 runs: 3 0.5k-0-limit/rework: min: 75.92 [85.8%] max: 78.45 [80.4%] avg: 77.34 [82.3%] std: 1.06 runs: 3 0.5k-0-limit/reworkoptim: min: 75.79 [85.6%] max: 79.37 [81.4%] avg: 77.55 [82.6%] std: 1.46 runs: 3 System 2k-0-limit/base: min: 34.57 max: 37.65 avg: 36.34 std: 1.30 runs: 3 2k-0-limit/rework: min: 64.17 [185.6%] max: 68.20 [181.1%] avg: 66.21 [182.2%] std: 1.65 runs: 3 2k-0-limit/reworkoptim: min: 49.78 [144.0%] max: 52.99 [140.7%] avg: 51.00 [140.3%] std: 1.42 runs: 3 Elapsed 2k-0-limit/base: min: 92.61 max: 97.83 avg: 95.03 std: 2.15 runs: 3 2k-0-limit/rework: min: 78.33 [84.6%] max: 84.08 [85.9%] avg: 81.09 [85.3%] std: 2.35 runs: 3 2k-0-limit/reworkoptim: min: 75.72 [81.8%] max: 78.57 [80.3%] avg: 76.73 [80.7%] std: 1.30 runs: 3 System 8k-0-limit/base: min: 39.78 max: 42.09 avg: 41.09 std: 0.97 runs: 3 8k-0-limit/rework: min: 200.86 [504.9%] max: 265.42 [630.6%] avg: 241.80 [588.5%] std: 29.06 runs: 3 8k-0-limit/reworkoptim: min: 53.70 [135.0%] max: 54.89 [130.4%] avg: 54.43 [132.5%] std: 0.52 runs: 3 Elapsed 8k-0-limit/base: min: 95.11 max: 98.61 avg: 96.81 std: 1.43 runs: 3 8k-0-limit/rework: min: 246.96 [259.7%] max: 331.47 [336.1%] avg: 301.32 [311.2%] std: 38.52 runs: 3 8k-0-limit/reworkoptim: min: 76.79 [80.7%] max: 81.71 [82.9%] avg: 78.97 [81.6%] std: 2.05 runs: 3 System time is increased by 30-40% but it is reduced a lot comparing to kernel without this patch. The higher time can be explained by the fact that the original soft reclaim scanned at priority 0 so it was much more effective for this workload (which is basically touch once and writeback). The Elapsed time looks better though (~20%). * Run with no soft limit set System 0-no-limit/base: min: 42.18 max: 50.38 avg: 46.44 std: 3.36 runs: 3 0-no-limit/rework: min: 40.57 [96.2%] max: 47.04 [93.4%] avg: 43.82 [94.4%] std: 2.64 runs: 3 0-no-limit/reworkoptim: min: 40.45 [95.9%] max: 45.28 [89.9%] avg: 42.10 [90.7%] std: 2.25 runs: 3 Elapsed 0-no-limit/base: min: 75.97 max: 78.21 avg: 76.87 std: 0.96 runs: 3 0-no-limit/rework: min: 75.59 [99.5%] max: 80.73 [103.2%] avg: 77.64 [101.0%] std: 2.23 runs: 3 0-no-limit/reworkoptim: min: 77.85 [102.5%] max: 82.42 [105.4%] avg: 79.64 [103.6%] std: 1.99 runs: 3 System 0.5k-no-limit/base: min: 44.54 max: 46.93 avg: 46.12 std: 1.12 runs: 3 0.5k-no-limit/rework: min: 42.09 [94.5%] max: 46.16 [98.4%] avg: 43.92 [95.2%] std: 1.69 runs: 3 0.5k-no-limit/reworkoptim: min: 42.47 [95.4%] max: 45.67 [97.3%] avg: 44.06 [95.5%] std: 1.31 runs: 3 Elapsed 0.5k-no-limit/base: min: 78.26 max: 81.49 avg: 79.65 std: 1.36 runs: 3 0.5k-no-limit/rework: min: 77.01 [98.4%] max: 80.43 [98.7%] avg: 78.30 [98.3%] std: 1.52 runs: 3 0.5k-no-limit/reworkoptim: min: 76.13 [97.3%] max: 77.87 [95.6%] avg: 77.18 [96.9%] std: 0.75 runs: 3 System 2k-no-limit/base: min: 62.96 max: 69.14 avg: 66.14 std: 2.53 runs: 3 2k-no-limit/rework: min: 76.01 [120.7%] max: 81.06 [117.2%] avg: 78.17 [118.2%] std: 2.12 runs: 3 2k-no-limit/reworkoptim: min: 62.57 [99.4%] max: 66.10 [95.6%] avg: 64.53 [97.6%] std: 1.47 runs: 3 Elapsed 2k-no-limit/base: min: 76.47 max: 84.22 avg: 79.12 std: 3.60 runs: 3 2k-no-limit/rework: min: 89.67 [117.3%] max: 93.26 [110.7%] avg: 91.10 [115.1%] std: 1.55 runs: 3 2k-no-limit/reworkoptim: min: 76.94 [100.6%] max: 79.21 [94.1%] avg: 78.45 [99.2%] std: 1.07 runs: 3 System 8k-no-limit/base: min: 104.74 max: 151.34 avg: 129.21 std: 19.10 runs: 3 8k-no-limit/rework: min: 205.23 [195.9%] max: 285.94 [188.9%] avg: 258.98 [200.4%] std: 38.01 runs: 3 8k-no-limit/reworkoptim: min: 161.16 [153.9%] max: 184.54 [121.9%] avg: 174.52 [135.1%] std: 9.83 runs: 3 Elapsed 8k-no-limit/base: min: 125.43 max: 181.00 avg: 154.81 std: 22.80 runs: 3 8k-no-limit/rework: min: 254.05 [202.5%] max: 355.67 [196.5%] avg: 321.46 [207.6%] std: 47.67 runs: 3 8k-no-limit/reworkoptim: min: 193.77 [154.5%] max: 222.72 [123.0%] avg: 210.18 [135.8%] std: 12.13 runs: 3 Both System and Elapsed are in stdev with the base kernel for all configurations except for 8k where both System and Elapsed are up by 35%. I do not have a good explanation for this because there is no soft reclaim pass going on as no group is above the limit which is checked in mem_cgroup_should_soft_reclaim. Then I have tested kernel build with the same configuration to see the behavior with a more general behavior. * Soft limit set to 0 for the build System 0-0-limit/base: min: 242.70 max: 245.17 avg: 243.85 std: 1.02 runs: 3 0-0-limit/rework min: 237.86 [98.0%] max: 240.22 [98.0%] avg: 239.00 [98.0%] std: 0.97 runs: 3 0-0-limit/reworkoptim: min: 241.11 [99.3%] max: 243.53 [99.3%] avg: 242.01 [99.2%] std: 1.08 runs: 3 Elapsed 0-0-limit/base: min: 348.48 max: 360.86 avg: 356.04 std: 5.41 runs: 3 0-0-limit/rework min: 286.95 [82.3%] max: 290.26 [80.4%] avg: 288.27 [81.0%] std: 1.43 runs: 3 0-0-limit/reworkoptim: min: 286.55 [82.2%] max: 289.00 [80.1%] avg: 287.69 [80.8%] std: 1.01 runs: 3 System 0.5k-0-limit/base: min: 251.77 max: 254.41 avg: 252.70 std: 1.21 runs: 3 0.5k-0-limit/rework min: 286.44 [113.8%] max: 289.30 [113.7%] avg: 287.60 [113.8%] std: 1.23 runs: 3 0.5k-0-limit/reworkoptim: min: 252.18 [100.2%] max: 253.16 [99.5%] avg: 252.62 [100.0%] std: 0.41 runs: 3 Elapsed 0.5k-0-limit/base: min: 347.83 max: 353.06 avg: 350.04 std: 2.21 runs: 3 0.5k-0-limit/rework min: 290.19 [83.4%] max: 295.62 [83.7%] avg: 293.12 [83.7%] std: 2.24 runs: 3 0.5k-0-limit/reworkoptim: min: 293.91 [84.5%] max: 294.87 [83.5%] avg: 294.29 [84.1%] std: 0.42 runs: 3 System 2k-0-limit/base: min: 263.05 max: 271.52 avg: 267.94 std: 3.58 runs: 3 2k-0-limit/rework min: 458.99 [174.5%] max: 468.31 [172.5%] avg: 464.45 [173.3%] std: 3.97 runs: 3 2k-0-limit/reworkoptim: min: 267.10 [101.5%] max: 279.38 [102.9%] avg: 272.78 [101.8%] std: 5.05 runs: 3 Elapsed 2k-0-limit/base: min: 372.33 max: 379.32 avg: 375.47 std: 2.90 runs: 3 2k-0-limit/rework min: 334.40 [89.8%] max: 339.52 [89.5%] avg: 337.44 [89.9%] std: 2.20 runs: 3 2k-0-limit/reworkoptim: min: 301.47 [81.0%] max: 319.19 [84.1%] avg: 307.90 [82.0%] std: 8.01 runs: 3 System 8k-0-limit/base: min: 320.50 max: 332.10 avg: 325.46 std: 4.88 runs: 3 8k-0-limit/rework min: 1115.76 [348.1%] max: 1165.66 [351.0%] avg: 1132.65 [348.0%] std: 23.34 runs: 3 8k-0-limit/reworkoptim: min: 403.75 [126.0%] max: 409.22 [123.2%] avg: 406.16 [124.8%] std: 2.28 runs: 3 Elapsed 8k-0-limit/base: min: 475.48 max: 585.19 avg: 525.54 std: 45.30 runs: 3 8k-0-limit/rework min: 616.25 [129.6%] max: 625.90 [107.0%] avg: 620.68 [118.1%] std: 3.98 runs: 3 8k-0-limit/reworkoptim: min: 420.18 [88.4%] max: 428.28 [73.2%] avg: 423.05 [80.5%] std: 3.71 runs: 3 Apart from 8k the system time is comparable with the base kernel while Elapsed is up to 20% better with all configurations. * No soft limit set System 0-no-limit/base: min: 234.76 max: 237.42 avg: 236.25 std: 1.11 runs: 3 0-no-limit/rework min: 233.09 [99.3%] max: 238.65 [100.5%] avg: 236.09 [99.9%] std: 2.29 runs: 3 0-no-limit/reworkoptim: min: 236.12 [100.6%] max: 240.53 [101.3%] avg: 237.94 [100.7%] std: 1.88 runs: 3 Elapsed 0-no-limit/base: min: 288.52 max: 295.42 avg: 291.29 std: 2.98 runs: 3 0-no-limit/rework min: 283.17 [98.1%] max: 284.33 [96.2%] avg: 283.78 [97.4%] std: 0.48 runs: 3 0-no-limit/reworkoptim: min: 288.50 [100.0%] max: 290.79 [98.4%] avg: 289.78 [99.5%] std: 0.95 runs: 3 System 0.5k-no-limit/base: min: 286.51 max: 293.23 avg: 290.21 std: 2.78 runs: 3 0.5k-no-limit/rework min: 291.69 [101.8%] max: 294.38 [100.4%] avg: 292.97 [101.0%] std: 1.10 runs: 3 0.5k-no-limit/reworkoptim: min: 277.05 [96.7%] max: 288.76 [98.5%] avg: 284.17 [97.9%] std: 5.11 runs: 3 Elapsed 0.5k-no-limit/base: min: 294.94 max: 298.92 avg: 296.47 std: 1.75 runs: 3 0.5k-no-limit/rework min: 292.55 [99.2%] max: 294.21 [98.4%] avg: 293.55 [99.0%] std: 0.72 runs: 3 0.5k-no-limit/reworkoptim: min: 294.41 [99.8%] max: 301.67 [100.9%] avg: 297.78 [100.4%] std: 2.99 runs: 3 System 2k-no-limit/base: min: 443.41 max: 466.66 avg: 457.66 std: 10.19 runs: 3 2k-no-limit/rework min: 490.11 [110.5%] max: 516.02 [110.6%] avg: 501.42 [109.6%] std: 10.83 runs: 3 2k-no-limit/reworkoptim: min: 435.25 [98.2%] max: 458.11 [98.2%] avg: 446.73 [97.6%] std: 9.33 runs: 3 Elapsed 2k-no-limit/base: min: 330.85 max: 333.75 avg: 332.52 std: 1.23 runs: 3 2k-no-limit/rework min: 343.06 [103.7%] max: 349.59 [104.7%] avg: 345.95 [104.0%] std: 2.72 runs: 3 2k-no-limit/reworkoptim: min: 330.01 [99.7%] max: 333.92 [100.1%] avg: 332.22 [99.9%] std: 1.64 runs: 3 System 8k-no-limit/base: min: 1175.64 max: 1259.38 avg: 1222.39 std: 34.88 runs: 3 8k-no-limit/rework min: 1226.31 [104.3%] max: 1241.60 [98.6%] avg: 1233.74 [100.9%] std: 6.25 runs: 3 8k-no-limit/reworkoptim: min: 1023.45 [87.1%] max: 1056.74 [83.9%] avg: 1038.92 [85.0%] std: 13.69 runs: 3 Elapsed 8k-no-limit/base: min: 613.36 max: 619.60 avg: 616.47 std: 2.55 runs: 3 8k-no-limit/rework min: 627.56 [102.3%] max: 642.33 [103.7%] avg: 633.44 [102.8%] std: 6.39 runs: 3 8k-no-limit/reworkoptim: min: 545.89 [89.0%] max: 555.36 [89.6%] avg: 552.06 [89.6%] std: 4.37 runs: 3 and these numbers look good as well. System time is around 100% (suprisingly better for the 8k case) and Elapsed is copies that trend. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/memcontrol.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 71 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1364ca5..2a06d5a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -136,6 +136,7 @@ static const char * const mem_cgroup_lru_names[] = { */ enum mem_cgroup_events_target { MEM_CGROUP_TARGET_THRESH, + MEM_CGROUP_TARGET_SOFTLIMIT, MEM_CGROUP_TARGET_NUMAINFO, MEM_CGROUP_NTARGETS, }; @@ -350,6 +351,22 @@ struct mem_cgroup { atomic_t numainfo_events; atomic_t numainfo_updating; #endif + /* + * Protects soft_contributed transitions. + * See mem_cgroup_update_soft_limit + */ + spinlock_t soft_lock; + + /* + * If true then this group has increased parents' children_in_excess + * when it got over the soft limit. + * When a group falls bellow the soft limit, parents' children_in_excess + * is decreased and soft_contributed changed to false. + */ + bool soft_contributed; + + /* Number of children that are in soft limit excess */ + atomic_t children_in_excess; struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ @@ -880,6 +897,9 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, case MEM_CGROUP_TARGET_THRESH: next = val + THRESHOLDS_EVENTS_TARGET; break; + case MEM_CGROUP_TARGET_SOFTLIMIT: + next = val + SOFTLIMIT_EVENTS_TARGET; + break; case MEM_CGROUP_TARGET_NUMAINFO: next = val + NUMAINFO_EVENTS_TARGET; break; @@ -893,6 +913,42 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, } /* + * Called from rate-limitted memcg_check_events when enough + * MEM_CGROUP_TARGET_SOFTLIMIT events are accumulated and it makes sure + * that all the parents up the hierarchy will be noticed that this group + * is in excess or that it is not in excess anymore. mmecg->soft_contributed + * makes the transition a single action whenever the state flips from one to + * other. + */ +static void mem_cgroup_update_soft_limit(struct mem_cgroup *memcg) +{ + unsigned long long excess = res_counter_soft_limit_excess(&memcg->res); + struct mem_cgroup *parent = memcg; + int delta = 0; + + spin_lock(&memcg->soft_lock); + if (excess) { + if (!memcg->soft_contributed) { + delta = 1; + memcg->soft_contributed = true; + } + } else { + if (memcg->soft_contributed) { + delta = -1; + memcg->soft_contributed = false; + } + } + + /* + * Necessary to update all ancestors when hierarchy is used + * because their event counter is not touched. + */ + while (delta && (parent = parent_mem_cgroup(parent))) + atomic_add(delta, &parent->children_in_excess); + spin_unlock(&memcg->soft_lock); +} + +/* * Check events in order. * */ @@ -902,8 +958,11 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) /* threshold event is triggered in finer grain than soft limit */ if (unlikely(mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_THRESH))) { + bool do_softlimit; bool do_numainfo __maybe_unused; + do_softlimit = mem_cgroup_event_ratelimit(memcg, + MEM_CGROUP_TARGET_SOFTLIMIT); #if MAX_NUMNODES > 1 do_numainfo = mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_NUMAINFO); @@ -911,6 +970,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) preempt_enable(); mem_cgroup_threshold(memcg); + if (unlikely(do_softlimit)) + mem_cgroup_update_soft_limit(memcg); #if MAX_NUMNODES > 1 if (unlikely(do_numainfo)) atomic_inc(&memcg->numainfo_events); @@ -1918,6 +1979,9 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) * hierarchy if * a) it is over its soft limit * b) any parent up the hierarchy is over its soft limit + * + * If the given group doesn't have any children over the limit then it + * doesn't make any sense to iterate its subtree. */ enum mem_cgroup_filter_t mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, @@ -1939,6 +2003,8 @@ mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, break; } + if (!atomic_read(&memcg->children_in_excess)) + return SKIP_TREE; return SKIP; } @@ -6093,6 +6159,7 @@ mem_cgroup_css_alloc(struct cgroup *cont) mutex_init(&memcg->thresholds_lock); spin_lock_init(&memcg->move_lock); vmpressure_init(&memcg->vmpressure); + spin_lock_init(&memcg->soft_lock); return &memcg->css; @@ -6182,6 +6249,10 @@ static void mem_cgroup_css_offline(struct cgroup *cont) mem_cgroup_invalidate_reclaim_iterators(memcg); mem_cgroup_reparent_charges(memcg); + if (memcg->soft_contributed) { + while ((memcg = parent_mem_cgroup(memcg))) + atomic_dec(&memcg->children_in_excess); + } mem_cgroup_destroy_all_caches(memcg); } -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v5 6/8] memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything 2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko ` (4 preceding siblings ...) 2013-06-18 12:09 ` [PATCH v5 5/8] memcg: track children in soft limit excess to improve soft limit Michal Hocko @ 2013-06-18 12:09 ` Michal Hocko 2013-06-18 12:09 ` [PATCH v5 7/8] memcg: Track all children over limit in the root Michal Hocko ` (4 subsequent siblings) 10 siblings, 0 replies; 30+ messages in thread From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw) To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is done and it always says yes currently. Memcg iterators are clever to skip nodes that are not soft reclaimable quite efficiently but mem_cgroup_should_soft_reclaim can be more clever and do not start the soft reclaim pass at all if it knows that nothing would be scanned anyway. In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for the target group of the reclaim and allow the pass only if the whole subtree wouldn't be skipped. Changes since v1 - do not export mem_cgroup_root and teach mem_cgroup_soft_reclaim_eligible to handle NULL memcg as mem_cgroup_root Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/memcontrol.c | 6 +++++- mm/vmscan.c | 4 +++- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2a06d5a..b2e44d3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1987,7 +1987,11 @@ enum mem_cgroup_filter_t mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, struct mem_cgroup *root) { - struct mem_cgroup *parent = memcg; + struct mem_cgroup *parent; + + if (!memcg) + memcg = root_mem_cgroup; + parent = memcg; if (res_counter_soft_limit_excess(&memcg->res)) return VISIT; diff --git a/mm/vmscan.c b/mm/vmscan.c index 79d59ec..56302da 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -142,7 +142,9 @@ static bool global_reclaim(struct scan_control *sc) static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) { - return !mem_cgroup_disabled(); + struct mem_cgroup *root = sc->target_mem_cgroup; + return !mem_cgroup_disabled() && + mem_cgroup_soft_reclaim_eligible(root, root) != SKIP_TREE; } #else static bool global_reclaim(struct scan_control *sc) -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v5 7/8] memcg: Track all children over limit in the root 2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko ` (5 preceding siblings ...) 2013-06-18 12:09 ` [PATCH v5 6/8] memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything Michal Hocko @ 2013-06-18 12:09 ` Michal Hocko 2013-06-18 12:09 ` [PATCH v5 8/8] memcg, vmscan: do not fall into reclaim-all pass too quickly Michal Hocko ` (3 subsequent siblings) 10 siblings, 0 replies; 30+ messages in thread From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw) To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa Children in soft limit excess are currently tracked up the hierarchy in memcg->children_in_excess. Nevertheless there still might exist tons of groups that are not in hierarchy relation to the root cgroup (e.g. all first level groups if root_mem_cgroup->use_hierarchy == false). As the whole tree walk has to be done when the iteration starts at root_mem_cgroup the iterator should be able to skip the walk if there is no child above the limit without iterating them. This can be done easily if the root tracks all children rather than only hierarchical children. This is done by this patch which updates root_mem_cgroup children_in_excess if root_mem_cgroup->use_hierarchy == false so the root knows about all children in excess. Please note that this is not an issue for inner memcgs which have use_hierarchy == false because then only the single group is visited so no special optimization is necessary. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/memcontrol.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b2e44d3..d55e2c8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -942,9 +942,15 @@ static void mem_cgroup_update_soft_limit(struct mem_cgroup *memcg) /* * Necessary to update all ancestors when hierarchy is used * because their event counter is not touched. + * We track children even outside the hierarchy for the root + * cgroup because tree walk starting at root should visit + * all cgroups and we want to prevent from pointless tree + * walk if no children is below the limit. */ while (delta && (parent = parent_mem_cgroup(parent))) atomic_add(delta, &parent->children_in_excess); + if (memcg != root_mem_cgroup && !root_mem_cgroup->use_hierarchy) + atomic_add(delta, &root_mem_cgroup->children_in_excess); spin_unlock(&memcg->soft_lock); } @@ -6256,6 +6262,9 @@ static void mem_cgroup_css_offline(struct cgroup *cont) if (memcg->soft_contributed) { while ((memcg = parent_mem_cgroup(memcg))) atomic_dec(&memcg->children_in_excess); + + if (memcg != root_mem_cgroup && !root_mem_cgroup->use_hierarchy) + atomic_dec(&root_mem_cgroup->children_in_excess); } mem_cgroup_destroy_all_caches(memcg); } -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v5 8/8] memcg, vmscan: do not fall into reclaim-all pass too quickly 2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko ` (6 preceding siblings ...) 2013-06-18 12:09 ` [PATCH v5 7/8] memcg: Track all children over limit in the root Michal Hocko @ 2013-06-18 12:09 ` Michal Hocko 2013-06-18 19:01 ` [PATCH v5] Soft limit rework Johannes Weiner ` (2 subsequent siblings) 10 siblings, 0 replies; 30+ messages in thread From: Michal Hocko @ 2013-06-18 12:09 UTC (permalink / raw) To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa shrink_zone starts with soft reclaim pass first and then falls back to regular reclaim if nothing has been scanned. This behavior is natural but there is a catch. Memcg iterators, when used with the reclaim cookie, are designed to help to prevent from over reclaim by interleaving reclaimers (per node-zone-priority) so the tree walk might miss many (even all) nodes in the hierarchy e.g. when there are direct reclaimers racing with each other or with kswapd in the global case or multiple allocators reaching the limit for the target reclaim case. To make it even more complicated, targeted reclaim doesn't do the whole tree walk because it stops reclaiming once it reclaims sufficient pages. As a result groups over the limit might be missed, thus nothing is scanned, and reclaim would fall back to the reclaim all mode. This patch checks for the incomplete tree walk in shrink_zone. If no group has been visited and the hierarchy is soft reclaimable then we must have missed some groups, in which case the __shrink_zone is called again. This doesn't guarantee there will be some progress of course because the current reclaimer might be still racing with others but it would at least give a chance to start the walk without a big risk of reclaim latencies. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/vmscan.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 56302da..8cbc8e5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2136,10 +2136,11 @@ static inline bool should_continue_reclaim(struct zone *zone, } } -static void +static int __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) { unsigned long nr_reclaimed, nr_scanned; + int groups_scanned = 0; do { struct mem_cgroup *root = sc->target_mem_cgroup; @@ -2157,6 +2158,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) while ((memcg = mem_cgroup_iter_cond(root, memcg, &reclaim, filter))) { struct lruvec *lruvec; + groups_scanned++; lruvec = mem_cgroup_zone_lruvec(zone, memcg); shrink_lruvec(lruvec, sc); @@ -2184,6 +2186,8 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed, sc->nr_scanned - nr_scanned, sc)); + + return groups_scanned; } @@ -2191,8 +2195,19 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) { bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); unsigned long nr_scanned = sc->nr_scanned; + int scanned_groups; - __shrink_zone(zone, sc, do_soft_reclaim); + scanned_groups = __shrink_zone(zone, sc, do_soft_reclaim); + /* + * memcg iterator might race with other reclaimer or start from + * a incomplete tree walk so the tree walk in __shrink_zone + * might have missed groups that are above the soft limit. Try + * another loop to catch up with others. Do it just once to + * prevent from reclaim latencies when other reclaimers always + * preempt this one. + */ + if (do_soft_reclaim && !scanned_groups) + __shrink_zone(zone, sc, do_soft_reclaim); /* * No group is over the soft limit or those that are do not have -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v5] Soft limit rework 2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko ` (7 preceding siblings ...) 2013-06-18 12:09 ` [PATCH v5 8/8] memcg, vmscan: do not fall into reclaim-all pass too quickly Michal Hocko @ 2013-06-18 19:01 ` Johannes Weiner 2013-06-19 10:20 ` Michal Hocko 2013-06-20 11:12 ` Mel Gorman [not found] ` <1371557387-22434-1-git-send-email-mhocko-AlSwsSmVLrQ@public.gmane.org> 10 siblings, 1 reply; 30+ messages in thread From: Johannes Weiner @ 2013-06-18 19:01 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote: > My primary test case was a parallel kernel build with 2 groups (make > is running with -j4 with a distribution .config in a separate cgroup > without any hard limit) on a 8 CPU machine booted with 1GB memory. I > was mostly interested in 2 setups. Default - no soft limit set and - and > 0 soft limit set to both groups. > The first one should tell us whether the rework regresses the default > behavior while the second one should show us improvements in an extreme > case where both workloads are always over the soft limit. The most interesting test case would be how it behaves if some groups are over the soft limits while others are not. I would expect this to be the most common situation for when soft limits are used. On the other hand, setting all soft limits to 0 makes every reclaim invocation do almost only soft reclaim. What's the point of that? > /usr/bin/time -v has been used to collect the statistics and each > configuration had 3 runs after fresh boot without any other load on the > system. > > base is mmotm-2013-05-09-15-57 > baserebase is mmotm-2013-06-05-17-24-63 + patches from the current mmots > without slab shrinkers patchset. > reworkrebase all patches 8 applied on top of baserebase > > * No-limit > User > base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6 > baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6 > reworkrebase: min: 1172.58 [100.7%] max: 1177.43 [100.7%] avg: 1175.53 [100.6%] std: 1.91 runs: 6 > System > base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6 > baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6 > reworkrebase: min: 236.21 [97.4%] max: 239.46 [97.6%] avg: 237.55 [97.4%] std: 1.05 runs: 6 > Elapsed > base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6 > baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6 > reworkrebase: min: 664.05 [111.3%] max: 701.06 [113.1%] avg: 689.29 [113.8%] std: 12.36 runs: 6 > > Elapsed time regressed by 13% wrt. base but it seems that this came from > baserebase which regressed by the same amount. Which mmots does this refer to? We should probably look into a regression this size... > * 0-limit > User > base: min: 1188.28 max: 1198.54 avg: 1194.10 std: 3.31 runs: 6 > baserebase: min: 1186.17 [99.8%] max: 1196.46 [99.8%] avg: 1189.75 [99.6%] std: 3.41 runs: 6 > reworkrebase: min: 1169.88 [98.5%] max: 1177.84 [98.3%] avg: 1173.50 [98.3%] std: 2.79 runs: 6 > System > base: min: 248.40 max: 252.00 avg: 250.19 std: 1.38 runs: 6 > baserebase: min: 240.77 [96.9%] max: 246.74 [97.9%] avg: 243.63 [97.4%] std: 2.23 runs: 6 > reworkrebase: min: 235.19 [94.7%] max: 237.43 [94.2%] avg: 236.35 [94.5%] std: 0.86 runs: 6 > Elapsed > base: min: 759.28 max: 805.30 avg: 784.87 std: 15.45 runs: 6 > baserebase: min: 881.69 [116.1%] max: 938.14 [116.5%] avg: 911.68 [116.2%] std: 19.58 runs: 6 > reworkrebase: min: 667.54 [87.9%] max: 718.54 [89.2%] avg: 695.61 [88.6%] std: 17.16 runs: 6 > > System time is slightly better but I wouldn't consider it relevant. > > Elapsed time is more interesting though. baserebase regresses by 16% > again which is in par with no-limit configuration. > > With the patchset applied we are 11% better in average wrt. to the > old base but it is important to realize that this is still 76.3% wrt. > baserebase so the effect of the series is comparable to the previous > version. Albeit the whole result is worse. > > Page fault statistics tell us at least part of the story: > Minor > base: min: 35941845.00 max: 36029788.00 avg: 35986860.17 std: 28288.66 runs: 6 > baserebase: min: 35852414.00 [99.8%] max: 35899605.00 [99.6%] avg: 35874906.83 [99.7%] std: 18722.59 runs: 6 > reworkrebase: min: 35538346.00 [98.9%] max: 35584907.00 [98.8%] avg: 35562362.17 [98.8%] std: 18921.74 runs: 6 > Major > base: min: 25390.00 max: 33132.00 avg: 29961.83 std: 2476.58 runs: 6 > baserebase: min: 34224.00 [134.8%] max: 45674.00 [137.9%] avg: 41556.83 [138.7%] std: 3595.39 runs: 6 > reworkrebase: min: 277.00 [1.1%] max: 480.00 [1.4%] avg: 384.67 [1.3%] std: 74.67 runs: 6 The changes are big but the test makes no sense to me. > While the minor faults are within the noise the major faults are reduced > considerably. This looks like an aggressive pageout during the reclaim > and that pageout affects the working set presumably. Please note that > baserebase has even hight number of major page faults than the older > mmotm trree. > > While this looks as a nice win it is fair to say that there are some > workloads that actually benefit from reclaim at 0 priority (from > background reclaim). E.g. an aggressive streaming IO would like to get > rid of as many pages as possible and do not block on the pages under > writeback. This can lead to a higher System time but I generally got > Elapsed which was comparable. > > The following results are from 2 groups configuration on a 8GB machine > (A running stream IO with 4*TotalMem with 0 soft limit, B runnning a > mem_eater which consumes TotalMem-1G without any limit). > System > base: min: 124.88 max: 136.97 avg: 130.77 std: 4.94 runs: 3 > baserebase: min: 102.51 [82.1%] max: 108.84 [79.5%] avg: 104.81 [80.1%] std: 2.86 runs: 3 > reworkrebase: min: 108.29 [86.7%] max: 121.70 [88.9%] avg: 114.60 [87.6%] std: 5.50 runs: 3 > Elapsed > base: min: 398.86 max: 412.81 avg: 407.62 std: 6.23 runs: 3 > baserebase: min: 480.92 [120.6%] max: 497.56 [120.5%] avg: 491.46 [120.6%] std: 7.48 runs: 3 > reworkrebase: min: 397.19 [99.6%] max: 462.57 [112.1%] avg: 436.13 [107.0%] std: 28.12 runs: 3 Do you have individual runtimes for both the streamer and "mem_eater"? Is mem_eater's memory reclaimable? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v5] Soft limit rework 2013-06-18 19:01 ` [PATCH v5] Soft limit rework Johannes Weiner @ 2013-06-19 10:20 ` Michal Hocko 0 siblings, 0 replies; 30+ messages in thread From: Michal Hocko @ 2013-06-19 10:20 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Tue 18-06-13 15:01:21, Johannes Weiner wrote: > On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote: > > My primary test case was a parallel kernel build with 2 groups (make > > is running with -j4 with a distribution .config in a separate cgroup > > without any hard limit) on a 8 CPU machine booted with 1GB memory. I > > was mostly interested in 2 setups. Default - no soft limit set and - and > > 0 soft limit set to both groups. > > The first one should tell us whether the rework regresses the default > > behavior while the second one should show us improvements in an extreme > > case where both workloads are always over the soft limit. > > The most interesting test case would be how it behaves if some groups > are over the soft limits while others are not. I would expect this to > be the most common situation for when soft limits are used. The stream IO test case does this (0 soft limit for dd and no limit for mem_eater). Anyway, I have tested this configuration with kbuild test (two groups, one 0 s.l. the second with no soft limit). Here are the commulative results (which do not distinguish results from the two groups. stdev tells us that the differences between the two groups are not significant): User baserebase: min: 1001.74 max: 1006.70 avg: 1004.54 std: 1.57 runs: 6 reworkrebase: min: 1001.11 [99.9%] max: 1005.35 [99.9%] avg: 1003.31 [99.9%] std: 1.58 runs: 6 System baserebase: min: 143.68 max: 146.39 avg: 144.95 std: 0.96 runs: 6 reworkrebase: min: 144.68 [100.7%] max: 146.80 [100.3%] avg: 145.43 [100.3%] std: 0.71 runs: 6 Elapsed baserebase: min: 403.31 max: 411.35 avg: 406.94 std: 2.50 runs: 6 reworkrebase: min: 391.70 [97.1%] max: 406.52 [98.8%] avg: 401.32 [98.6%] std: 4.95 runs: 6 Results are within noise. > On the other hand, setting all soft limits to 0 makes every reclaim > invocation do almost only soft reclaim. What's the point of that? The main point was to isolate the soft recalim as much as possible so that we can compare effects of the previous and the new implementations. The no-limit configuration on the other hand shows that the change didn't regress for the default configuration when no soft limit is used. Seeing some improvement is nice but it is not the primary motivation of this series. My primary interest was to not regress in common workloads. As I said there are some workloads which could benefit from prio-0 reclaim used for the soft reclaim previously but as the figures show this is not the case for the best and most loved kbuild test. > > /usr/bin/time -v has been used to collect the statistics and each > > configuration had 3 runs after fresh boot without any other load on the > > system. > > > > base is mmotm-2013-05-09-15-57 > > baserebase is mmotm-2013-06-05-17-24-63 + patches from the current mmots > > without slab shrinkers patchset. > > reworkrebase all patches 8 applied on top of baserebase > > > > * No-limit > > User > > base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6 > > baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6 > > reworkrebase: min: 1172.58 [100.7%] max: 1177.43 [100.7%] avg: 1175.53 [100.6%] std: 1.91 runs: 6 > > System > > base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6 > > baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6 > > reworkrebase: min: 236.21 [97.4%] max: 239.46 [97.6%] avg: 237.55 [97.4%] std: 1.05 runs: 6 > > Elapsed > > base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6 > > baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6 > > reworkrebase: min: 664.05 [111.3%] max: 701.06 [113.1%] avg: 689.29 [113.8%] std: 12.36 runs: 6 > > > > Elapsed time regressed by 13% wrt. base but it seems that this came from > > baserebase which regressed by the same amount. > > Which mmots does this refer to? We should probably look into a > regression this size... It was mmotm-2013-06-05-17-24-63 with some patches on top without slab shrinkers which failed to finish this test case. I plan to report this separately with more specific information in a separate thread. > > * 0-limit > > User > > base: min: 1188.28 max: 1198.54 avg: 1194.10 std: 3.31 runs: 6 > > baserebase: min: 1186.17 [99.8%] max: 1196.46 [99.8%] avg: 1189.75 [99.6%] std: 3.41 runs: 6 > > reworkrebase: min: 1169.88 [98.5%] max: 1177.84 [98.3%] avg: 1173.50 [98.3%] std: 2.79 runs: 6 > > System > > base: min: 248.40 max: 252.00 avg: 250.19 std: 1.38 runs: 6 > > baserebase: min: 240.77 [96.9%] max: 246.74 [97.9%] avg: 243.63 [97.4%] std: 2.23 runs: 6 > > reworkrebase: min: 235.19 [94.7%] max: 237.43 [94.2%] avg: 236.35 [94.5%] std: 0.86 runs: 6 > > Elapsed > > base: min: 759.28 max: 805.30 avg: 784.87 std: 15.45 runs: 6 > > baserebase: min: 881.69 [116.1%] max: 938.14 [116.5%] avg: 911.68 [116.2%] std: 19.58 runs: 6 > > reworkrebase: min: 667.54 [87.9%] max: 718.54 [89.2%] avg: 695.61 [88.6%] std: 17.16 runs: 6 > > > > System time is slightly better but I wouldn't consider it relevant. > > > > Elapsed time is more interesting though. baserebase regresses by 16% > > again which is in par with no-limit configuration. > > > > With the patchset applied we are 11% better in average wrt. to the > > old base but it is important to realize that this is still 76.3% wrt. > > baserebase so the effect of the series is comparable to the previous > > version. Albeit the whole result is worse. > > > > Page fault statistics tell us at least part of the story: > > Minor > > base: min: 35941845.00 max: 36029788.00 avg: 35986860.17 std: 28288.66 runs: 6 > > baserebase: min: 35852414.00 [99.8%] max: 35899605.00 [99.6%] avg: 35874906.83 [99.7%] std: 18722.59 runs: 6 > > reworkrebase: min: 35538346.00 [98.9%] max: 35584907.00 [98.8%] avg: 35562362.17 [98.8%] std: 18921.74 runs: 6 > > Major > > base: min: 25390.00 max: 33132.00 avg: 29961.83 std: 2476.58 runs: 6 > > baserebase: min: 34224.00 [134.8%] max: 45674.00 [137.9%] avg: 41556.83 [138.7%] std: 3595.39 runs: 6 > > reworkrebase: min: 277.00 [1.1%] max: 480.00 [1.4%] avg: 384.67 [1.3%] std: 74.67 runs: 6 > > The changes are big but the test makes no sense to me. As mentioned above this is a simple regression test when the soft reclaim is exercised as much as possible. For this particular test case it is clear that the previous implementation was too disruptive by doing prio-0 reclaim which evicted part of the working set. Do you have any other ideas how to compare the two implementations? > > While the minor faults are within the noise the major faults are reduced > > considerably. This looks like an aggressive pageout during the reclaim > > and that pageout affects the working set presumably. Please note that > > baserebase has even hight number of major page faults than the older > > mmotm trree. > > > > While this looks as a nice win it is fair to say that there are some > > workloads that actually benefit from reclaim at 0 priority (from > > background reclaim). E.g. an aggressive streaming IO would like to get > > rid of as many pages as possible and do not block on the pages under > > writeback. This can lead to a higher System time but I generally got > > Elapsed which was comparable. > > > > The following results are from 2 groups configuration on a 8GB machine > > (A running stream IO with 4*TotalMem with 0 soft limit, B runnning a > > mem_eater which consumes TotalMem-1G without any limit). > > System > > base: min: 124.88 max: 136.97 avg: 130.77 std: 4.94 runs: 3 > > baserebase: min: 102.51 [82.1%] max: 108.84 [79.5%] avg: 104.81 [80.1%] std: 2.86 runs: 3 > > reworkrebase: min: 108.29 [86.7%] max: 121.70 [88.9%] avg: 114.60 [87.6%] std: 5.50 runs: 3 > > Elapsed > > base: min: 398.86 max: 412.81 avg: 407.62 std: 6.23 runs: 3 > > baserebase: min: 480.92 [120.6%] max: 497.56 [120.5%] avg: 491.46 [120.6%] std: 7.48 runs: 3 > > reworkrebase: min: 397.19 [99.6%] max: 462.57 [112.1%] avg: 436.13 [107.0%] std: 28.12 runs: 3 > > Do you have individual runtimes for both the streamer and "mem_eater"? mem_eater is a simple anon mmap MAP_POPULATE which then sits on the memory without touching it until it is notified when it dies (this happens after streamer is done). It simulates a big memory resident while there is a big memory pressure. Measuring the time for the streamer makes sense on the other hand because it depends on the reclaim effectiveness. The numbers show that we regress a bit here which is caused by getting away from prio-0 reclaim which manages to reclaim touched once pages more effectively. > Is mem_eater's memory reclaimable? yes but the fact that it is soft unlimited helps it stay resident. When we look at usage_in_bytes for B (mem eater) for all 3 runs: base median: 7399763968 max: 7605112832 avg: 7314214879 rework median: 7736799232 [104%] max: 7832416256 [102%] avg: 7581929848 [104%] (*) baserebase median: 7788097536 [105%] max: 7841947648 [103%] avg: 7717106374 [105%] reworkrebase median: 7722799104 [104%] max: 7814995968 [102%] avg: 7614082984 [104%] (*) - this is the patch series before rebase We do not reclaim that much in the base kernel for baserebase but the series helps to prevent from reclaim a bit (~4% more memory for mem_eater both in average and in median). I think we can do even better if the streamer preferred zones that are not full from mem_eater and didn't trigger reclaim on those so that we wouldn't fall back to all-reclaim mode so often. Does this make more sense to you now or there are things that I should retest to better describe/measure the two soft reclaim implementations? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v5] Soft limit rework 2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko ` (8 preceding siblings ...) 2013-06-18 19:01 ` [PATCH v5] Soft limit rework Johannes Weiner @ 2013-06-20 11:12 ` Mel Gorman [not found] ` <20130620111206.GA14809-l3A5Bk7waGM@public.gmane.org> [not found] ` <1371557387-22434-1-git-send-email-mhocko-AlSwsSmVLrQ@public.gmane.org> 10 siblings, 1 reply; 30+ messages in thread From: Mel Gorman @ 2013-06-20 11:12 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote: > base is mmotm-2013-05-09-15-57 > baserebase is mmotm-2013-06-05-17-24-63 + patches from the current mmots > without slab shrinkers patchset. > reworkrebase all patches 8 applied on top of baserebase > > * No-limit > User > base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6 > baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6 > reworkrebase: min: 1172.58 [100.7%] max: 1177.43 [100.7%] avg: 1175.53 [100.6%] std: 1.91 runs: 6 > System > base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6 > baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6 > reworkrebase: min: 236.21 [97.4%] max: 239.46 [97.6%] avg: 237.55 [97.4%] std: 1.05 runs: 6 > Elapsed > base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6 > baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6 > reworkrebase: min: 664.05 [111.3%] max: 701.06 [113.1%] avg: 689.29 [113.8%] std: 12.36 runs: 6 > > Elapsed time regressed by 13% wrt. base but it seems that this came from > baserebase which regressed by the same amount. > boo-urns > Page fault statistics tell us at least part of the story: > Minor > base: min: 35941845.00 max: 36029788.00 avg: 35986860.17 std: 28288.66 runs: 6 > baserebase: min: 35852414.00 [99.8%] max: 35899605.00 [99.6%] avg: 35874906.83 [99.7%] std: 18722.59 runs: 6 > reworkrebase: min: 35538346.00 [98.9%] max: 35584907.00 [98.8%] avg: 35562362.17 [98.8%] std: 18921.74 runs: 6 > Major > base: min: 25390.00 max: 33132.00 avg: 29961.83 std: 2476.58 runs: 6 > baserebase: min: 34224.00 [134.8%] max: 45674.00 [137.9%] avg: 41556.83 [138.7%] std: 3595.39 runs: 6 > reworkrebase: min: 277.00 [1.1%] max: 480.00 [1.4%] avg: 384.67 [1.3%] std: 74.67 runs: 6 Can you try this monolithic patch please? diff --git a/mm/vmscan.c b/mm/vmscan.c index fe73724..f677780 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1477,25 +1477,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, * as there is no guarantee the dirtying process is throttled in the * same way balance_dirty_pages() manages. * - * This scales the number of dirty pages that must be under writeback - * before a zone gets flagged ZONE_WRITEBACK. It is a simple backoff - * function that has the most effect in the range DEF_PRIORITY to - * DEF_PRIORITY-2 which is the priority reclaim is considered to be - * in trouble and reclaim is considered to be in trouble. - * - * DEF_PRIORITY 100% isolated pages must be PageWriteback to throttle - * DEF_PRIORITY-1 50% must be PageWriteback - * DEF_PRIORITY-2 25% must be PageWriteback, kswapd in trouble - * ... - * DEF_PRIORITY-6 For SWAP_CLUSTER_MAX isolated pages, throttle if any - * isolated page is PageWriteback - * * Once a zone is flagged ZONE_WRITEBACK, kswapd will count the number * of pages under pages flagged for immediate reclaim and stall if any * are encountered in the nr_immediate check below. */ - if (nr_writeback && nr_writeback >= - (nr_taken >> (DEF_PRIORITY - sc->priority))) + if (nr_writeback && nr_writeback == nr_taken) zone_set_flag(zone, ZONE_WRITEBACK); /* @@ -2382,12 +2368,14 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, struct zone *zone; unsigned long writeback_threshold; bool aborted_reclaim; + int min_scan_priority = 1; delayacct_freepages_start(); if (global_reclaim(sc)) count_vm_event(ALLOCSTALL); +rescan: do { vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup, sc->priority); @@ -2442,7 +2430,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, WB_REASON_TRY_TO_FREE_PAGES); sc->may_writepage = 1; } - } while (--sc->priority >= 0); + } while (--sc->priority >= min_scan_priority); out: delayacct_freepages_end(); @@ -2466,6 +2454,12 @@ out: if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc)) return 1; + /* If the page allocator is going to consider OOM, rescan at priority 0 */ + if (min_scan_priority) { + min_scan_priority = 0; + goto rescan; + } + return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 30+ messages in thread
[parent not found: <20130620111206.GA14809-l3A5Bk7waGM@public.gmane.org>]
* Re: [PATCH v5] Soft limit rework [not found] ` <20130620111206.GA14809-l3A5Bk7waGM@public.gmane.org> @ 2013-06-21 14:06 ` Michal Hocko [not found] ` <20130621140627.GI12424-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2013-06-25 15:49 ` Michal Hocko 1 sibling, 1 reply; 30+ messages in thread From: Michal Hocko @ 2013-06-21 14:06 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Thu 20-06-13 12:12:06, Mel Gorman wrote: > On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote: > > base is mmotm-2013-05-09-15-57 > > baserebase is mmotm-2013-06-05-17-24-63 + patches from the current mmots > > without slab shrinkers patchset. > > reworkrebase all patches 8 applied on top of baserebase > > > > * No-limit > > User > > base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6 > > baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6 > > reworkrebase: min: 1172.58 [100.7%] max: 1177.43 [100.7%] avg: 1175.53 [100.6%] std: 1.91 runs: 6 > > System > > base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6 > > baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6 > > reworkrebase: min: 236.21 [97.4%] max: 239.46 [97.6%] avg: 237.55 [97.4%] std: 1.05 runs: 6 > > Elapsed > > base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6 > > baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6 > > reworkrebase: min: 664.05 [111.3%] max: 701.06 [113.1%] avg: 689.29 [113.8%] std: 12.36 runs: 6 > > > > Elapsed time regressed by 13% wrt. base but it seems that this came from > > baserebase which regressed by the same amount. > > > > boo-urns > > > Page fault statistics tell us at least part of the story: > > Minor > > base: min: 35941845.00 max: 36029788.00 avg: 35986860.17 std: 28288.66 runs: 6 > > baserebase: min: 35852414.00 [99.8%] max: 35899605.00 [99.6%] avg: 35874906.83 [99.7%] std: 18722.59 runs: 6 > > reworkrebase: min: 35538346.00 [98.9%] max: 35584907.00 [98.8%] avg: 35562362.17 [98.8%] std: 18921.74 runs: 6 > > Major > > base: min: 25390.00 max: 33132.00 avg: 29961.83 std: 2476.58 runs: 6 > > baserebase: min: 34224.00 [134.8%] max: 45674.00 [137.9%] avg: 41556.83 [138.7%] std: 3595.39 runs: 6 > > reworkrebase: min: 277.00 [1.1%] max: 480.00 [1.4%] avg: 384.67 [1.3%] std: 74.67 runs: 6 > > Can you try this monolithic patch please? Wow, this looks much better! * 0-limit User base: min: 1188.28 max: 1198.54 avg: 1194.10 std: 3.31 runs: 6 baserebase: min: 1186.17 [99.8%] max: 1196.46 [99.8%] avg: 1189.75 [99.6%] std: 3.41 runs: 6 mel: min: 993.25 [83.6%] max: 997.40 [83.2%] avg: 995.81 [83.4%] std: 1.43 runs: 6 System base: min: 248.40 max: 252.00 avg: 250.19 std: 1.38 runs: 6 baserebase: min: 240.77 [96.9%] max: 246.74 [97.9%] avg: 243.63 [97.4%] std: 2.23 runs: 6 mel: min: 145.36 [58.5%] max: 148.69 [59.0%] avg: 147.66 [59.0%] std: 1.17 runs: 6 Elapsed base: min: 759.28 max: 805.30 avg: 784.87 std: 15.45 runs: 6 baserebase: min: 881.69 [116.1%] max: 938.14 [116.5%] avg: 911.68 [116.2%] std: 19.58 runs: 6 mel: min: 367.99 [48.5%] max: 381.67 [47.4%] avg: 371.84 [47.4%] std: 4.72 runs: 6 * no-limit User base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6 baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6 mel: min: 993.46 [85.3%] max: 995.07 [85.1%] avg: 994.26 [85.1%] std: 0.56 runs: 6 System base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6 baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6 mel: min: 148.55 [61.2%] max: 151.80 [61.9%] avg: 149.80 [61.4%] std: 1.07 runs: 6 Elapsed base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6 baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6 mel: min: 366.85 [61.5%] max: 375.98 [60.6%] avg: 372.25 [61.5%] std: 3.18 runs: 6 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <20130621140627.GI12424-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v5] Soft limit rework [not found] ` <20130621140627.GI12424-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-06-21 14:09 ` Michal Hocko 2013-06-21 15:04 ` Michal Hocko 0 siblings, 1 reply; 30+ messages in thread From: Michal Hocko @ 2013-06-21 14:09 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Fri 21-06-13 16:06:27, Michal Hocko wrote: [...] > > Can you try this monolithic patch please? > > Wow, this looks much better! Damn it! Scratch that. I have made a mistake in configuration so this all has been 0-no-limit in fact. Sorry about that. It's only now that I've noticed that so I am retesting. Hopefully it will be done before I leave today. I will post it on Monday otherwise. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v5] Soft limit rework 2013-06-21 14:09 ` Michal Hocko @ 2013-06-21 15:04 ` Michal Hocko [not found] ` <20130621150430.GL12424-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 30+ messages in thread From: Michal Hocko @ 2013-06-21 15:04 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Fri 21-06-13 16:09:38, Michal Hocko wrote: > On Fri 21-06-13 16:06:27, Michal Hocko wrote: > [...] > > > Can you try this monolithic patch please? > > > > Wow, this looks much better! > > Damn it! Scratch that. I have made a mistake in configuration so this > all has been 0-no-limit in fact. Sorry about that. It's only now that > I've noticed that so I am retesting. Hopefully it will be done before I > leave today. I will post it on Monday otherwise. And for real now and it is still Wow! * 0-limit er base: min: 1188.28 max: 1198.54 avg: 1194.10 std: 3.31 runs: 6 baserebase: min: 1186.17 [99.8%] max: 1196.46 [99.8%] avg: 1189.75 [99.6%] std: 3.41 runs: 6 mel: min: 989.99 [83.3%] max: 993.35 [82.9%] avg: 991.71 [83.1%] std: 1.18 runs: 6 System base: min: 248.40 max: 252.00 avg: 250.19 std: 1.38 runs: 6 baserebase: min: 240.77 [96.9%] max: 246.74 [97.9%] avg: 243.63 [97.4%] std: 2.23 runs: 6 mel: min: 144.75 [58.3%] max: 148.27 [58.8%] avg: 146.97 [58.7%] std: 1.41 runs: 6 Elapsed base: min: 759.28 max: 805.30 avg: 784.87 std: 15.45 runs: 6 baserebase: min: 881.69 [116.1%] max: 938.14 [116.5%] avg: 911.68 [116.2%] std: 19.58 runs: 6 mel: min: 365.99 [48.2%] max: 376.12 [46.7%] avg: 369.82 [47.1%] std: 4.04 runs: 6 * no-limit User base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6 baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6 mel: min: 990.96 [85.1%] max: 994.18 [85.0%] avg: 992.79 [85.0%] std: 1.15 runs: 6 System base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6 baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6 mel: min: 148.40 [61.2%] max: 150.77 [61.4%] avg: 149.33 [61.2%] std: 0.97 runs: 6 Elapsed base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6 baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6 mel: min: 365.27 [61.2%] max: 371.40 [59.9%] avg: 369.43 [61.0%] std: 2.26 runs: 6 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <20130621150430.GL12424-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v5] Soft limit rework [not found] ` <20130621150430.GL12424-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-06-21 15:09 ` Michal Hocko 2013-06-21 16:34 ` Tejun Heo 0 siblings, 1 reply; 30+ messages in thread From: Michal Hocko @ 2013-06-21 15:09 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Fri 21-06-13 17:04:30, Michal Hocko wrote: > On Fri 21-06-13 16:09:38, Michal Hocko wrote: > > On Fri 21-06-13 16:06:27, Michal Hocko wrote: > > [...] > > > > Can you try this monolithic patch please? > > > > > > Wow, this looks much better! > > > > Damn it! Scratch that. I have made a mistake in configuration so this > > all has been 0-no-limit in fact. Sorry about that. It's only now that > > I've noticed that so I am retesting. Hopefully it will be done before I > > leave today. I will post it on Monday otherwise. > > And for real now and it is still Wow! And I am total idiot. The machine was not booted with mem=1G so the figures are completely useless. It is soooo Friday. I will start everything again on Monday with a clean head. Sorry about all the noise. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v5] Soft limit rework 2013-06-21 15:09 ` Michal Hocko @ 2013-06-21 16:34 ` Tejun Heo 0 siblings, 0 replies; 30+ messages in thread From: Tejun Heo @ 2013-06-21 16:34 UTC (permalink / raw) To: Michal Hocko Cc: Mel Gorman, Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Balbir Singh, Glauber Costa On Fri, Jun 21, 2013 at 05:09:06PM +0200, Michal Hocko wrote: > And I am total idiot. The machine was not booted with mem=1G so the > figures are completely useless. > > It is soooo Friday. I will start everything again on Monday with a clean > head. > > Sorry about all the noise. Oh, don't be. It was the most fun thread of the whole week. Enjoy the weekend. :) -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v5] Soft limit rework [not found] ` <20130620111206.GA14809-l3A5Bk7waGM@public.gmane.org> 2013-06-21 14:06 ` Michal Hocko @ 2013-06-25 15:49 ` Michal Hocko 1 sibling, 0 replies; 30+ messages in thread From: Michal Hocko @ 2013-06-25 15:49 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Thu 20-06-13 12:12:06, Mel Gorman wrote: > On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote: > > base is mmotm-2013-05-09-15-57 > > baserebase is mmotm-2013-06-05-17-24-63 + patches from the current mmots > > without slab shrinkers patchset. > > reworkrebase all patches 8 applied on top of baserebase > > > > * No-limit > > User > > base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6 > > baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6 > > reworkrebase: min: 1172.58 [100.7%] max: 1177.43 [100.7%] avg: 1175.53 [100.6%] std: 1.91 runs: 6 > > System > > base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6 > > baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6 > > reworkrebase: min: 236.21 [97.4%] max: 239.46 [97.6%] avg: 237.55 [97.4%] std: 1.05 runs: 6 > > Elapsed > > base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6 > > baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6 > > reworkrebase: min: 664.05 [111.3%] max: 701.06 [113.1%] avg: 689.29 [113.8%] std: 12.36 runs: 6 > > > > Elapsed time regressed by 13% wrt. base but it seems that this came from > > baserebase which regressed by the same amount. > > > > boo-urns > > > Page fault statistics tell us at least part of the story: > > Minor > > base: min: 35941845.00 max: 36029788.00 avg: 35986860.17 std: 28288.66 runs: 6 > > baserebase: min: 35852414.00 [99.8%] max: 35899605.00 [99.6%] avg: 35874906.83 [99.7%] std: 18722.59 runs: 6 > > reworkrebase: min: 35538346.00 [98.9%] max: 35584907.00 [98.8%] avg: 35562362.17 [98.8%] std: 18921.74 runs: 6 > > Major > > base: min: 25390.00 max: 33132.00 avg: 29961.83 std: 2476.58 runs: 6 > > baserebase: min: 34224.00 [134.8%] max: 45674.00 [137.9%] avg: 41556.83 [138.7%] std: 3595.39 runs: 6 > > reworkrebase: min: 277.00 [1.1%] max: 480.00 [1.4%] avg: 384.67 [1.3%] std: 74.67 runs: 6 > > Can you try this monolithic patch please? Now that I've tested the patch properly, finally, I have to say that the situation has improved. 0-limit case has a bigger variation with 15% regression in the worst case and down to 4% regression in the best case (which gets us to 7% on average). no-limit configuration behaves better. The worst case is not so bad and we are down to 3% regression in average. * 0-limit User base: min: 1188.28 max: 1198.54 avg: 1194.10 std: 3.31 runs: 6 baserebase: min: 1186.17 [99.8%] max: 1196.46 [99.8%] avg: 1189.75 [99.6%] std: 3.41 runs: 6 mel: min: 1183.62 [99.6%] max: 1208.19 [100.8%] avg: 1194.00 [100.0%] std: 8.71 runs: 6 System base: min: 248.40 max: 252.00 avg: 250.19 std: 1.38 runs: 6 baserebase: min: 240.77 [96.9%] max: 246.74 [97.9%] avg: 243.63 [97.4%] std: 2.23 runs: 6 mel: min: 249.41 [100.4%] max: 253.34 [100.5%] avg: 251.28 [100.4%] std: 1.20 runs: 6 Elapsed base: min: 759.28 max: 805.30 avg: 784.87 std: 15.45 runs: 6 baserebase: min: 881.69 [116.1%] max: 938.14 [116.5%] avg: 911.68 [116.2%] std: 19.58 runs: 6 mel: min: 789.62 [104.0%] max: 928.68 [115.3%] avg: 845.62 [107.7%] std: 59.31 runs: 6 * no-limit User base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6 baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6 mel: min: 1174.58 [100.8%] max: 1176.46 [100.6%] avg: 1175.53 [100.6%] std: 0.59 runs: 6 System base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6 baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6 mel: min: 244.79 [100.9%] max: 246.86 [100.6%] avg: 246.30 [101.0%] std: 0.76 runs: 6 Elapsed base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6 baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6 mel: min: 618.77 [103.7%] max: 637.84 [102.9%] avg: 627.92 [103.7%] std: 6.64 runs: 6 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <1371557387-22434-1-git-send-email-mhocko-AlSwsSmVLrQ@public.gmane.org>]
* Re: [PATCH v5] Soft limit rework [not found] ` <1371557387-22434-1-git-send-email-mhocko-AlSwsSmVLrQ@public.gmane.org> @ 2013-08-19 16:35 ` Johannes Weiner [not found] ` <20130819163512.GB712-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 0 siblings, 1 reply; 30+ messages in thread From: Johannes Weiner @ 2013-08-19 16:35 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote: > Hi, > > This is the fifth version of the patchset. > > Summary of versions: > The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973 > (lkml wasn't CCed at the time so I cannot find it in lwn.net > archives). There were no major objections. Except there are. > My primary test case was a parallel kernel build with 2 groups (make > is running with -j4 with a distribution .config in a separate cgroup > without any hard limit) on a 8 CPU machine booted with 1GB memory. I > was mostly interested in 2 setups. Default - no soft limit set and - and > 0 soft limit set to both groups. > The first one should tell us whether the rework regresses the default > behavior while the second one should show us improvements in an extreme > case where both workloads are always over the soft limit. Two kernel builds with 1G of memory means that reclaim is purely trimming the cache every once in a while. Changes in memory pressure are not measurable up to a certain point, because whether you trim old cache or not does not affect the build jobs. Also you tested the no-softlimit case and an extreme soft limit case. Where are the common soft limit cases? > /usr/bin/time -v has been used to collect the statistics and each > configuration had 3 runs after fresh boot without any other load on the > system. > > base is mmotm-2013-05-09-15-57 > baserebase is mmotm-2013-06-05-17-24-63 + patches from the current mmots > without slab shrinkers patchset. > reworkrebase all patches 8 applied on top of baserebase > > * No-limit > User > base: min: 1164.94 max: 1169.75 avg: 1168.31 std: 1.57 runs: 6 > baserebase: min: 1169.46 [100.4%] max: 1176.07 [100.5%] avg: 1172.49 [100.4%] std: 2.38 runs: 6 > reworkrebase: min: 1172.58 [100.7%] max: 1177.43 [100.7%] avg: 1175.53 [100.6%] std: 1.91 runs: 6 > System > base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6 > baserebase: min: 235.36 [97.0%] max: 238.52 [97.2%] avg: 236.70 [97.0%] std: 1.04 runs: 6 > reworkrebase: min: 236.21 [97.4%] max: 239.46 [97.6%] avg: 237.55 [97.4%] std: 1.05 runs: 6 > Elapsed > base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6 > baserebase: min: 666.45 [111.7%] max: 710.89 [114.7%] avg: 690.62 [114.1%] std: 13.85 runs: 6 > reworkrebase: min: 664.05 [111.3%] max: 701.06 [113.1%] avg: 689.29 [113.8%] std: 12.36 runs: 6 > > Elapsed time regressed by 13% wrt. base but it seems that this came from > baserebase which regressed by the same amount. > > * 0-limit > User > base: min: 1188.28 max: 1198.54 avg: 1194.10 std: 3.31 runs: 6 > baserebase: min: 1186.17 [99.8%] max: 1196.46 [99.8%] avg: 1189.75 [99.6%] std: 3.41 runs: 6 > reworkrebase: min: 1169.88 [98.5%] max: 1177.84 [98.3%] avg: 1173.50 [98.3%] std: 2.79 runs: 6 > System > base: min: 248.40 max: 252.00 avg: 250.19 std: 1.38 runs: 6 > baserebase: min: 240.77 [96.9%] max: 246.74 [97.9%] avg: 243.63 [97.4%] std: 2.23 runs: 6 > reworkrebase: min: 235.19 [94.7%] max: 237.43 [94.2%] avg: 236.35 [94.5%] std: 0.86 runs: 6 > Elapsed > base: min: 759.28 max: 805.30 avg: 784.87 std: 15.45 runs: 6 > baserebase: min: 881.69 [116.1%] max: 938.14 [116.5%] avg: 911.68 [116.2%] std: 19.58 runs: 6 > reworkrebase: min: 667.54 [87.9%] max: 718.54 [89.2%] avg: 695.61 [88.6%] std: 17.16 runs: 6 You set one group unlimited and one group to limit 0. I would expect all memory pressure that occurs to be applied to the 0-limit group. Is this happening? If so, wouldn't it be expected that its working set is thrashed? It could be that the patched kernel just shifts some of the pressure to the No-limit group, which would arguably be a regression. But as I said, this is not measurable in realtime. > While the minor faults are within the noise the major faults are reduced > considerably. This looks like an aggressive pageout during the reclaim > and that pageout affects the working set presumably. Please note that > baserebase has even hight number of major page faults than the older > mmotm trree. > > While this looks as a nice win it is fair to say that there are some > workloads that actually benefit from reclaim at 0 priority (from > background reclaim). E.g. an aggressive streaming IO would like to get > rid of as many pages as possible and do not block on the pages under > writeback. This can lead to a higher System time but I generally got > Elapsed which was comparable. > > The following results are from 2 groups configuration on a 8GB machine > (A running stream IO with 4*TotalMem with 0 soft limit, B runnning a > mem_eater which consumes TotalMem-1G without any limit). > System > base: min: 124.88 max: 136.97 avg: 130.77 std: 4.94 runs: 3 > baserebase: min: 102.51 [82.1%] max: 108.84 [79.5%] avg: 104.81 [80.1%] std: 2.86 runs: 3 > reworkrebase: min: 108.29 [86.7%] max: 121.70 [88.9%] avg: 114.60 [87.6%] std: 5.50 runs: 3 > Elapsed > base: min: 398.86 max: 412.81 avg: 407.62 std: 6.23 runs: 3 > baserebase: min: 480.92 [120.6%] max: 497.56 [120.5%] avg: 491.46 [120.6%] std: 7.48 runs: 3 > reworkrebase: min: 397.19 [99.6%] max: 462.57 [112.1%] avg: 436.13 [107.0%] std: 28.12 runs: 3 > > baserebase regresses again by 20% and the series is worse by 7% but it > is still at 89% wrt baserebase so it looks good to me. > > So to wrap this up. The series is still doing good and improves the soft > limit. The soft limit tree is a bunch of isolated code that's completely straight-forward. This is replaced by convoluted memcg iterators, convoluted lruvec shrinkers, spreading even more memcg callbacks with questionable semantics into already complicated generic reclaim code. This series considerably worsens readability and maintainability of both the generic reclaim code as well as the memcg counterpart of it. The point of naturalizing the memcg code is to reduce data structures and redundancy and to break open opaque interfaces like "do soft reclaim and report back". But you didn't actually reduce complexity, you added even more opaque callbacks (should_soft_reclaim? soft_reclaim_eligible?). You didn't integrate soft limit into generic reclaim code, you just made the soft limit API more complicated. And, as I mentioned repeatedly in previous submissions, your benchmark numbers don't actually say anything useful about this change. I'm against merging this upstream at this point. ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <20130819163512.GB712-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [PATCH v5] Soft limit rework [not found] ` <20130819163512.GB712-druUgvl0LCNAfugRpC6u6w@public.gmane.org> @ 2013-08-20 9:14 ` Michal Hocko 2013-08-20 14:13 ` Johannes Weiner 0 siblings, 1 reply; 30+ messages in thread From: Michal Hocko @ 2013-08-20 9:14 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Mon 19-08-13 12:35:12, Johannes Weiner wrote: > On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote: > > Hi, > > > > This is the fifth version of the patchset. > > > > Summary of versions: > > The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973 > > (lkml wasn't CCed at the time so I cannot find it in lwn.net > > archives). There were no major objections. > > Except there are. Good to know that late... It would have been much more helpful to have such a principal feedback few months ago (this work is here since early Jun). > > My primary test case was a parallel kernel build with 2 groups (make > > is running with -j4 with a distribution .config in a separate cgroup > > without any hard limit) on a 8 CPU machine booted with 1GB memory. I > > was mostly interested in 2 setups. Default - no soft limit set and - and > > 0 soft limit set to both groups. > > The first one should tell us whether the rework regresses the default > > behavior while the second one should show us improvements in an extreme > > case where both workloads are always over the soft limit. > > Two kernel builds with 1G of memory means that reclaim is purely > trimming the cache every once in a while. Changes in memory pressure > are not measurable up to a certain point, because whether you trim old > cache or not does not affect the build jobs. > > Also you tested the no-softlimit case and an extreme soft limit case. > Where are the common soft limit cases? v5.1 had some more tests. I have added soft limitted stream IO resp. kbuild vs unlimitted mem_eater loads. Have you checked those? [...] > > So to wrap this up. The series is still doing good and improves the soft > > limit. > > The soft limit tree is a bunch of isolated code that's completely > straight-forward. This is replaced by convoluted memcg iterators, > convoluted lruvec shrinkers, spreading even more memcg callbacks with > questionable semantics into already complicated generic reclaim code. I was trying to keep the convolution into vmscan as small as possible. Maybe it can get reduced even more. I will think about it. Predicate for memcg iterator has been added to address your concern about a potential regression with too many groups. And that looked like the least convoluting solution. > This series considerably worsens readability and maintainability of > both the generic reclaim code as well as the memcg counterpart of it. I am really surprised that you are coming with this concerns that late. This code has been posted quite some ago, hasn't it? We have even had that "calm" discussion with Tejun about predicates and you were silent at the time. > The point of naturalizing the memcg code is to reduce data structures > and redundancy and to break open opaque interfaces like "do soft > reclaim and report back". But you didn't actually reduce complexity, > you added even more opaque callbacks (should_soft_reclaim? > soft_reclaim_eligible?). You didn't integrate soft limit into generic > reclaim code, you just made the soft limit API more complicated. I can certainly think about simplifications. But it would be nicer if you were more specific on the "more complicated" part. The soft reclaim is a natural part of the reclaim now. Which I find as an improvement. "Do some memcg magic and get back was" a bad idea IMO. Hiding the soft limit decisions into the iterators as a searching criteria doesn't sound as a totally bad idea to me. Soft limit is an additional criteria who to reclaim, isn't it? Well, I could have open coded it but that would mean a more code into vmscan or getting back to "call some memcg magic and get back to me". > And, as I mentioned repeatedly in previous submissions, your benchmark > numbers don't actually say anything useful about this change. I would really welcome suggestions for improvements. I have tried "The most interesting test case would be how it behaves if some groups are over the soft limits while others are not." with v5.1 where I had memeater unlimited and kbuild resp. stream IO being limited. > I'm against merging this upstream at this point. Can we at least find some middle ground here? The way how the current soft limit is done is a disaster. Ditching the whole series sounds like a step back to me. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v5] Soft limit rework 2013-08-20 9:14 ` Michal Hocko @ 2013-08-20 14:13 ` Johannes Weiner 2013-08-22 10:58 ` Michal Hocko 0 siblings, 1 reply; 30+ messages in thread From: Johannes Weiner @ 2013-08-20 14:13 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Tue, Aug 20, 2013 at 11:14:14AM +0200, Michal Hocko wrote: > On Mon 19-08-13 12:35:12, Johannes Weiner wrote: > > On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote: > > > Hi, > > > > > > This is the fifth version of the patchset. > > > > > > Summary of versions: > > > The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973 > > > (lkml wasn't CCed at the time so I cannot find it in lwn.net > > > archives). There were no major objections. > > > > Except there are. > > Good to know that late... It would have been much more helpful to have > such a principal feedback few months ago (this work is here since early > Jun). I NAKed this "integrate into reclaim but still make soft reclaim an extra pass" idea the first time Ying submitted it. And no, the soft limit discussions have been going on since I first submitted a rewrite as part of the reclaim integration series in mid-2011. We have been going back and forth on whether soft limits should behave as guarantees and *protect* from reclaim. But I thought we had since agreed that implementing guarantees should be a separate effort and as far as I'm concerned there is no justification for all this added code just to prevent reclaiming non-excess when there is excess around. > > > My primary test case was a parallel kernel build with 2 groups (make > > > is running with -j4 with a distribution .config in a separate cgroup > > > without any hard limit) on a 8 CPU machine booted with 1GB memory. I > > > was mostly interested in 2 setups. Default - no soft limit set and - and > > > 0 soft limit set to both groups. > > > The first one should tell us whether the rework regresses the default > > > behavior while the second one should show us improvements in an extreme > > > case where both workloads are always over the soft limit. > > > > Two kernel builds with 1G of memory means that reclaim is purely > > trimming the cache every once in a while. Changes in memory pressure > > are not measurable up to a certain point, because whether you trim old > > cache or not does not affect the build jobs. > > > > Also you tested the no-softlimit case and an extreme soft limit case. > > Where are the common soft limit cases? > > v5.1 had some more tests. I have added soft limitted stream IO resp. kbuild vs > unlimitted mem_eater loads. Have you checked those? Another 0-limit test? If you declare every single page a cgroup has as being in excess of what it should have, how does reclaiming them aggressively constitute bad behavior? > [...] > > > So to wrap this up. The series is still doing good and improves the soft > > > limit. > > > > The soft limit tree is a bunch of isolated code that's completely > > straight-forward. This is replaced by convoluted memcg iterators, > > convoluted lruvec shrinkers, spreading even more memcg callbacks with > > questionable semantics into already complicated generic reclaim code. > > I was trying to keep the convolution into vmscan as small as possible. > Maybe it can get reduced even more. I will think about it. > > Predicate for memcg iterator has been added to address your concern > about a potential regression with too many groups. And that looked like > the least convoluting solution. I suggested improving the 1st pass by propagating soft limit excess through the res counters. They are the hierarchical book keepers. I even send you patches to do this. I don't understand why you propagate soft limit excess in struct mem_cgroup and use this ridiculous check_events thing to update them. We already have the per-cpu charge caches (stock) and uncharge caches to batch res_counter updates. Anyway, this would be a replacement for the current tree structure to quickly find memcgs in excess. It has nothing to do with integrating it into vmscan.c. > > This series considerably worsens readability and maintainability of > > both the generic reclaim code as well as the memcg counterpart of it. > > I am really surprised that you are coming with this concerns that late. > This code has been posted quite some ago, hasn't it? We have even had > that "calm" discussion with Tejun about predicates and you were silent > at the time. I apologize that I was not more responsive in previous submissions, this is mainly because it was hard to context switch while I was focussed on the allocator fairness / thrash detection patches. But you are surprised that I still object to something that I have been objecting to from day one? What exactly has changed? > > The point of naturalizing the memcg code is to reduce data structures > > and redundancy and to break open opaque interfaces like "do soft > > reclaim and report back". But you didn't actually reduce complexity, > > you added even more opaque callbacks (should_soft_reclaim? > > soft_reclaim_eligible?). You didn't integrate soft limit into generic > > reclaim code, you just made the soft limit API more complicated. > > I can certainly think about simplifications. But it would be nicer if > you were more specific on the "more complicated" part. The soft reclaim > is a natural part of the reclaim now. Which I find as an improvement. > "Do some memcg magic and get back was" a bad idea IMO. It depends. Doing memcg reclaim in generic code made sense because the global data structure went and the generic code absolutely had to know about memcg iteration. Soft limit reclaim is something else. If it were just a modifier of the existing memcg reclaim loop, I would still put it in vmscan.c for simplicity reasons. But your separate soft limit reclaim pass adds quite some complication to the generic code and there is no good reason for it. The naming does not help to make it very natural to vmscan.c, either. __shrink_zone() should be shrink_lruvecs(), mem_cgroup_soft_reclaim_eligible() should be mem_cgroup_soft_limit_exceeded() or something like that. SKIP_TREE should not be a kernel-wide name for anything. Also note that we have since moved on to make lruvec the shared structure and try to steer away from using memcgs outside of memcontrol.c, so I'm not really fond of tying more generic code to struct mem_cgroup. > Hiding the soft limit decisions into the iterators as a searching > criteria doesn't sound as a totally bad idea to me. Soft limit is an > additional criteria who to reclaim, isn't it? Arguably it's a criteria of /how/ to reclaim any given memcg. Making it about pre-selecting memcgs is a totally different beast. > Well, I could have open coded it but that would mean a more code into > vmscan or getting back to "call some memcg magic and get back to me". Either way sounds preferrable to having vmscan.c do a lot of things it does not understand. > > And, as I mentioned repeatedly in previous submissions, your benchmark > > numbers don't actually say anything useful about this change. > > I would really welcome suggestions for improvements. I have tried "The > most interesting test case would be how it behaves if some groups are > over the soft limits while others are not." with v5.1 where I had > memeater unlimited and kbuild resp. stream IO being limited. The workload seems better but the limit configuration is still very dubious. I just don't understand what your benchmarks are trying to express. If you set a soft limit of 0 you declare all memory is optional but then you go ahead and complain that the workload in there is thrashing. WTF? Kernelbuilds are not really good candidates for soft limit usage in the first place, because they have a small set of heavily used pages and a lot of used-once cache. The core workingset is anything but optional and you could just set the hard limit to clip all that needless cache because it does not change performance whether it's available or not. Same exact thing goes for IO streamers and their subset of dirty/writeback pages. > > I'm against merging this upstream at this point. > > Can we at least find some middle ground here? The way how the current > soft limit is done is a disaster. Ditching the whole series sounds like > a step back to me. I'm worried about memcg madness spreading. Memcg code is nowhere near kernel core code standards and I'm really reluctant to have it spread into something as delicate as vmscan.c for no good reason at all. If you claim the soft limit implementation is a disaster, I would have expected a real life workload to back this up but all I've seen are synthetic tests with no apparent significance to reality. So, no, I don't buy the "let's just do SOMETHING here" argument. Let's figure out what the hell is wrong, back it up, find a solution and verify it against the test case. Memcg/cgroup code is full of overengineered solutions to problems that don't exist. It's insane and it has to stop. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v5] Soft limit rework 2013-08-20 14:13 ` Johannes Weiner @ 2013-08-22 10:58 ` Michal Hocko 2013-09-03 16:15 ` Johannes Weiner 0 siblings, 1 reply; 30+ messages in thread From: Michal Hocko @ 2013-08-22 10:58 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa [I am mostly offline for the whole week with very limitted internet access so it will get longer for me to respond to emails. Sorry about that] On Tue 20-08-13 10:13:39, Johannes Weiner wrote: > On Tue, Aug 20, 2013 at 11:14:14AM +0200, Michal Hocko wrote: > > On Mon 19-08-13 12:35:12, Johannes Weiner wrote: > > > On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote: > > > > Hi, > > > > > > > > This is the fifth version of the patchset. > > > > > > > > Summary of versions: > > > > The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973 > > > > (lkml wasn't CCed at the time so I cannot find it in lwn.net > > > > archives). There were no major objections. > > > > > > Except there are. > > > > Good to know that late... It would have been much more helpful to have > > such a principal feedback few months ago (this work is here since early > > Jun). > > I NAKed this "integrate into reclaim but still make soft reclaim an > extra pass" idea the first time Ying submitted it. There was a general agreement about this approach at LSF AFAIR. So it is really surprising hearing that from you _now_ I do not remember the details of your previous NAK but I do remember you liked the series I posted before the conference. I am willing to change details (e.g. move big vmscan part into memcg) but I do insist on having soft reclaim integrated to the regular reclaim. > And no, the soft limit discussions have been going on since I first > submitted a rewrite as part of the reclaim integration series in > mid-2011. Which makes the whole story even more sad :/ > We have been going back and forth on whether soft limits should behave > as guarantees and *protect* from reclaim. But I thought we had since > agreed that implementing guarantees should be a separate effort and as > far as I'm concerned there is no justification for all this added code > just to prevent reclaiming non-excess when there is excess around. The patchset is absolutely not about guarantees anymore. It is more about making the current soft limit reclaim more natural. E.g. not doing prio-0 scans. The side effect that the isolation works better in the result is just a bonus. And I have to admit I like that bonus. > > > > My primary test case was a parallel kernel build with 2 groups (make > > > > is running with -j4 with a distribution .config in a separate cgroup > > > > without any hard limit) on a 8 CPU machine booted with 1GB memory. I > > > > was mostly interested in 2 setups. Default - no soft limit set and - and > > > > 0 soft limit set to both groups. > > > > The first one should tell us whether the rework regresses the default > > > > behavior while the second one should show us improvements in an extreme > > > > case where both workloads are always over the soft limit. > > > > > > Two kernel builds with 1G of memory means that reclaim is purely > > > trimming the cache every once in a while. Changes in memory pressure > > > are not measurable up to a certain point, because whether you trim old > > > cache or not does not affect the build jobs. > > > > > > Also you tested the no-softlimit case and an extreme soft limit case. > > > Where are the common soft limit cases? > > > > v5.1 had some more tests. I have added soft limitted stream IO resp. kbuild vs > > unlimitted mem_eater loads. Have you checked those? > > Another 0-limit test? If you declare every single page a cgroup has > as being in excess of what it should have, how does reclaiming them > aggressively constitute bad behavior? My testing simply tries to compare the behavior before and after the patchset. And points out that the current soft reclaim is really aggressive and it doesn't have to act like that. > > [...] > > > > So to wrap this up. The series is still doing good and improves the soft > > > > limit. > > > > > > The soft limit tree is a bunch of isolated code that's completely > > > straight-forward. This is replaced by convoluted memcg iterators, > > > convoluted lruvec shrinkers, spreading even more memcg callbacks with > > > questionable semantics into already complicated generic reclaim code. > > > > I was trying to keep the convolution into vmscan as small as possible. > > Maybe it can get reduced even more. I will think about it. > > > > Predicate for memcg iterator has been added to address your concern > > about a potential regression with too many groups. And that looked like > > the least convoluting solution. > > I suggested improving the 1st pass by propagating soft limit excess > through the res counters. They are the hierarchical book keepers. I > even send you patches to do this. I don't understand why you > propagate soft limit excess in struct mem_cgroup and use this > ridiculous check_events thing to update them. We already have the > per-cpu charge caches (stock) and uncharge caches to batch res_counter > updates. > > Anyway, this would be a replacement for the current tree structure to > quickly find memcgs in excess. Yes and I do not insist on having this per-memcg counter. I took this way to keep this memcg specific thing withing memcg code and do not pollute generic cgroup code which doesn't care about this property at all. > It has nothing to do with integrating it into vmscan.c. agreed > > > This series considerably worsens readability and maintainability of > > > both the generic reclaim code as well as the memcg counterpart of it. > > > > I am really surprised that you are coming with this concerns that late. > > This code has been posted quite some ago, hasn't it? We have even had > > that "calm" discussion with Tejun about predicates and you were silent > > at the time. > > I apologize that I was not more responsive in previous submissions, > this is mainly because it was hard to context switch while I was > focussed on the allocator fairness / thrash detection patches. Come on, Johannes! We were sitting in the same room at LSF when there was a general agreement about the patchset because "It is a general improvement and further changes might happen on top of it". You didn't say a word you didn't like it and consider the integration as a bad idea. > But you are surprised that I still object to something that I have > been objecting to from day one? What exactly has changed? > > > > The point of naturalizing the memcg code is to reduce data structures > > > and redundancy and to break open opaque interfaces like "do soft > > > reclaim and report back". But you didn't actually reduce complexity, > > > you added even more opaque callbacks (should_soft_reclaim? > > > soft_reclaim_eligible?). You didn't integrate soft limit into generic > > > reclaim code, you just made the soft limit API more complicated. > > > > I can certainly think about simplifications. But it would be nicer if > > you were more specific on the "more complicated" part. The soft reclaim > > is a natural part of the reclaim now. Which I find as an improvement. > > "Do some memcg magic and get back was" a bad idea IMO. > > It depends. Doing memcg reclaim in generic code made sense because > the global data structure went and the generic code absolutely had to > know about memcg iteration. > > Soft limit reclaim is something else. If it were just a modifier of > the existing memcg reclaim loop, I would still put it in vmscan.c for > simplicity reasons. > > But your separate soft limit reclaim pass adds quite some complication > to the generic code and there is no good reason for it. Does it? The whole decision is hidden within the predicate so the main loop doesn't have to care at all. If we ever go with another limit for the guarantee it would simply use a different predicate to select the right group. > The naming does not help to make it very natural to vmscan.c, either. > __shrink_zone() should be shrink_lruvecs(), > mem_cgroup_soft_reclaim_eligible() should be > mem_cgroup_soft_limit_exceeded() or something like that. I am certainly open to change naming. > SKIP_TREE should not be a kernel-wide name for anything. That would require to move the looping code inside memcg. It is question whether that would help readability. I do not insist on doing the memcg loop inside vmscan, though. > Also note that we have > since moved on to make lruvec the shared structure and try to steer > away from using memcgs outside of memcontrol.c, so I'm not really fond > of tying more generic code to struct mem_cgroup. > > > Hiding the soft limit decisions into the iterators as a searching > > criteria doesn't sound as a totally bad idea to me. Soft limit is an > > additional criteria who to reclaim, isn't it? > > Arguably it's a criteria of /how/ to reclaim any given memcg. Making > it about pre-selecting memcgs is a totally different beast. > > > Well, I could have open coded it but that would mean a more code into > > vmscan or getting back to "call some memcg magic and get back to me". > > Either way sounds preferrable to having vmscan.c do a lot of things it > does not understand. > > > > And, as I mentioned repeatedly in previous submissions, your benchmark > > > numbers don't actually say anything useful about this change. > > > > I would really welcome suggestions for improvements. I have tried "The > > most interesting test case would be how it behaves if some groups are > > over the soft limits while others are not." with v5.1 where I had > > memeater unlimited and kbuild resp. stream IO being limited. > > The workload seems better but the limit configuration is still very > dubious. > > I just don't understand what your benchmarks are trying to express. > If you set a soft limit of 0 you declare all memory is optional but > then you go ahead and complain that the workload in there is > thrashing. WTF? As I've already said the primary motivation was to check before and after state and as you can see we can do better... Although you might find soft_limit=0 setting artificial this setting is the easiest way to tell that such a group is the number one candidate for reclaim. We have users who would like to do backups or some temporary actions to not disturb the main workload. So stream-io resp. kbuild vs. mem_eater is simulating such a use case. > Kernelbuilds are not really good candidates for soft limit usage in > the first place, because they have a small set of heavily used pages > and a lot of used-once cache. It still produces the sufficient memory pressure to disturb other groups. > The core workingset is anything but > optional and you could just set the hard limit to clip all that > needless cache because it does not change performance whether it's > available or not. Same exact thing goes for IO streamers and their > subset of dirty/writeback pages. > > > > I'm against merging this upstream at this point. > > > > Can we at least find some middle ground here? The way how the current > > soft limit is done is a disaster. Ditching the whole series sounds like > > a step back to me. > > I'm worried about memcg madness spreading. Memcg code is nowhere near > kernel core code standards and I'm really reluctant to have it spread > into something as delicate as vmscan.c for no good reason at all. > > If you claim the soft limit implementation is a disaster, I would have > expected a real life workload to back this up but all I've seen are > synthetic tests with no apparent significance to reality. > > So, no, I don't buy the "let's just do SOMETHING here" argument. > > Let's figure out what the hell is wrong, back it up, find a solution > and verify it against the test case. It is basically prio-0 thing that is the disaster. And I was quite explicit about that. > Memcg/cgroup code is full of overengineered solutions to problems that > don't exist. It's insane and it has to stop. This code is getting rid of a big part of that over-engineering! I do agree there is a lot of room improvements but, hey, that is always the case. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v5] Soft limit rework 2013-08-22 10:58 ` Michal Hocko @ 2013-09-03 16:15 ` Johannes Weiner 2013-09-04 16:38 ` Michal Hocko 0 siblings, 1 reply; 30+ messages in thread From: Johannes Weiner @ 2013-09-03 16:15 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa Hi Michal, On Thu, Aug 22, 2013 at 12:58:56PM +0200, Michal Hocko wrote: > [I am mostly offline for the whole week with very limitted internet > access so it will get longer for me to respond to emails. Sorry about > that] Same deal for me, just got back. Sorry for the delays. > On Tue 20-08-13 10:13:39, Johannes Weiner wrote: > > On Tue, Aug 20, 2013 at 11:14:14AM +0200, Michal Hocko wrote: > > > On Mon 19-08-13 12:35:12, Johannes Weiner wrote: > > > > On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote: > > > > > Hi, > > > > > > > > > > This is the fifth version of the patchset. > > > > > > > > > > Summary of versions: > > > > > The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973 > > > > > (lkml wasn't CCed at the time so I cannot find it in lwn.net > > > > > archives). There were no major objections. > > > > > > > > Except there are. > > > > > > Good to know that late... It would have been much more helpful to have > > > such a principal feedback few months ago (this work is here since early > > > Jun). > > > > I NAKed this "integrate into reclaim but still make soft reclaim an > > extra pass" idea the first time Ying submitted it. > > There was a general agreement about this approach at LSF AFAIR. So it is > really surprising hearing that from you _now_ > > I do not remember the details of your previous NAK but I do remember you > liked the series I posted before the conference. I am willing to change > details (e.g. move big vmscan part into memcg) but I do insist on having > soft reclaim integrated to the regular reclaim. I was okay with the series back at LSF. But this was under the assumption that we are moving towards turning softlimits into guarantees. It was only after LSF that we agreed that conflating these two things is not a good idea and that we are better off implementing guarantees separately. On the contrary, I am surprised that you are still pushing this series in light of the discussions after LSF. Google was a big pusher of the soft limit changes, but their requirements were always hard guarantees. We have since agreed that this should be done separately. What is the use case for soft limits at this point? > > And no, the soft limit discussions have been going on since I first > > submitted a rewrite as part of the reclaim integration series in > > mid-2011. > > Which makes the whole story even more sad :/ > > > We have been going back and forth on whether soft limits should behave > > as guarantees and *protect* from reclaim. But I thought we had since > > agreed that implementing guarantees should be a separate effort and as > > far as I'm concerned there is no justification for all this added code > > just to prevent reclaiming non-excess when there is excess around. > > The patchset is absolutely not about guarantees anymore. It is more > about making the current soft limit reclaim more natural. E.g. not doing > prio-0 scans. With guarantees out of the window there is absolutely no justification for this added level of complexity. The tree code is verbose but dead simple. And it's contained. You have not shown that prio-0 scans are a problem. Or even argued why you need new memcg iteration code to fix the priority level. > The side effect that the isolation works better in the result is just a > bonus. And I have to admit I like that bonus. What isolation? > > > > > My primary test case was a parallel kernel build with 2 groups (make > > > > > is running with -j4 with a distribution .config in a separate cgroup > > > > > without any hard limit) on a 8 CPU machine booted with 1GB memory. I > > > > > was mostly interested in 2 setups. Default - no soft limit set and - and > > > > > 0 soft limit set to both groups. > > > > > The first one should tell us whether the rework regresses the default > > > > > behavior while the second one should show us improvements in an extreme > > > > > case where both workloads are always over the soft limit. > > > > > > > > Two kernel builds with 1G of memory means that reclaim is purely > > > > trimming the cache every once in a while. Changes in memory pressure > > > > are not measurable up to a certain point, because whether you trim old > > > > cache or not does not affect the build jobs. > > > > > > > > Also you tested the no-softlimit case and an extreme soft limit case. > > > > Where are the common soft limit cases? > > > > > > v5.1 had some more tests. I have added soft limitted stream IO resp. kbuild vs > > > unlimitted mem_eater loads. Have you checked those? > > > > Another 0-limit test? If you declare every single page a cgroup has > > as being in excess of what it should have, how does reclaiming them > > aggressively constitute bad behavior? > > My testing simply tries to compare the behavior before and after the > patchset. And points out that the current soft reclaim is really > aggressive and it doesn't have to act like that. How can you get a meaningful result out of a meaningless setup? You can't tell if it's apropriately aggressive, not aggressive enough, or too aggressive if your test case does not mean anything. There are no sane expectations that can be met. > > > > > So to wrap this up. The series is still doing good and improves the soft > > > > > limit. > > > > > > > > The soft limit tree is a bunch of isolated code that's completely > > > > straight-forward. This is replaced by convoluted memcg iterators, > > > > convoluted lruvec shrinkers, spreading even more memcg callbacks with > > > > questionable semantics into already complicated generic reclaim code. > > > > > > I was trying to keep the convolution into vmscan as small as possible. > > > Maybe it can get reduced even more. I will think about it. > > > > > > Predicate for memcg iterator has been added to address your concern > > > about a potential regression with too many groups. And that looked like > > > the least convoluting solution. > > > > I suggested improving the 1st pass by propagating soft limit excess > > through the res counters. They are the hierarchical book keepers. I > > even send you patches to do this. I don't understand why you > > propagate soft limit excess in struct mem_cgroup and use this > > ridiculous check_events thing to update them. We already have the > > per-cpu charge caches (stock) and uncharge caches to batch res_counter > > updates. > > > > Anyway, this would be a replacement for the current tree structure to > > quickly find memcgs in excess. > > Yes and I do not insist on having this per-memcg counter. I took this > way to keep this memcg specific thing withing memcg code and do not > pollute generic cgroup code which doesn't care about this property at > all. res_counters hierarchically track usage, limit, and soft limit. Sounds like the perfect fit to track hierarchical soft limit excess to me. > > > > This series considerably worsens readability and maintainability of > > > > both the generic reclaim code as well as the memcg counterpart of it. > > > > > > I am really surprised that you are coming with this concerns that late. > > > This code has been posted quite some ago, hasn't it? We have even had > > > that "calm" discussion with Tejun about predicates and you were silent > > > at the time. > > > > I apologize that I was not more responsive in previous submissions, > > this is mainly because it was hard to context switch while I was > > focussed on the allocator fairness / thrash detection patches. > > Come on, Johannes! We were sitting in the same room at LSF when there > was a general agreement about the patchset because "It is a general > improvement and further changes might happen on top of it". You didn't > say a word you didn't like it and consider the integration as a bad > idea. It was a step towards turning soft limits into guarantees. What other use case was there? Then it might have been a good first step. I don't think it is a general improvement of the code and I don't remember saying that. > > > > The point of naturalizing the memcg code is to reduce data structures > > > > and redundancy and to break open opaque interfaces like "do soft > > > > reclaim and report back". But you didn't actually reduce complexity, > > > > you added even more opaque callbacks (should_soft_reclaim? > > > > soft_reclaim_eligible?). You didn't integrate soft limit into generic > > > > reclaim code, you just made the soft limit API more complicated. > > > > > > I can certainly think about simplifications. But it would be nicer if > > > you were more specific on the "more complicated" part. The soft reclaim > > > is a natural part of the reclaim now. Which I find as an improvement. > > > "Do some memcg magic and get back was" a bad idea IMO. > > > > It depends. Doing memcg reclaim in generic code made sense because > > the global data structure went and the generic code absolutely had to > > know about memcg iteration. > > > > Soft limit reclaim is something else. If it were just a modifier of > > the existing memcg reclaim loop, I would still put it in vmscan.c for > > simplicity reasons. > > > > But your separate soft limit reclaim pass adds quite some complication > > to the generic code and there is no good reason for it. > > Does it? The whole decision is hidden within the predicate so the main > loop doesn't have to care at all. If we ever go with another limit for > the guarantee it would simply use a different predicate to select the > right group. This is just handwaving. You replaced a simple function call in kswapd with an extra lruvec pass and an enhanced looping construct. It's all extra protocol that vmscan.c has to follow without actually improving the understandability of the code. You still have to dig into all those predicates to know what's going on. And at the risk of repeating myself, you haven't proven that it's worth the trouble. > > > > And, as I mentioned repeatedly in previous submissions, your benchmark > > > > numbers don't actually say anything useful about this change. > > > > > > I would really welcome suggestions for improvements. I have tried "The > > > most interesting test case would be how it behaves if some groups are > > > over the soft limits while others are not." with v5.1 where I had > > > memeater unlimited and kbuild resp. stream IO being limited. > > > > The workload seems better but the limit configuration is still very > > dubious. > > > > I just don't understand what your benchmarks are trying to express. > > If you set a soft limit of 0 you declare all memory is optional but > > then you go ahead and complain that the workload in there is > > thrashing. WTF? > > As I've already said the primary motivation was to check before and > after state and as you can see we can do better... Although you might > find soft_limit=0 setting artificial this setting is the easiest way to > tell that such a group is the number one candidate for reclaim. That would test for correctness, but then you draw conclusions about performance. Correctness-wise, the unpatched kernel seems to prefer reclaiming the 0-limit group as expected. You can't draw reasonable conclusions on performance and aggressiveness from such an unrealistic setup. > We have users who would like to do backups or some temporary actions to > not disturb the main workload. So stream-io resp. kbuild vs. mem_eater > is simulating such a use case. It's not obvious why you are not using hard limits instead. Again, streamers and kbuild don't really benefit from the extra cache, so there is no reason to let it expand into available memory and retract on pressure. These tests are just wildly inappropriate for soft limits, it's ridiculous that you insist that their outcome says anything at all. > > Kernelbuilds are not really good candidates for soft limit usage in > > the first place, because they have a small set of heavily used pages > > and a lot of used-once cache. > > It still produces the sufficient memory pressure to disturb other > groups. See "hard limit" above. > > The core workingset is anything but > > optional and you could just set the hard limit to clip all that > > needless cache because it does not change performance whether it's > > available or not. Same exact thing goes for IO streamers and their > > subset of dirty/writeback pages. > > > > > > I'm against merging this upstream at this point. > > > > > > Can we at least find some middle ground here? The way how the current > > > soft limit is done is a disaster. Ditching the whole series sounds like > > > a step back to me. > > > > I'm worried about memcg madness spreading. Memcg code is nowhere near > > kernel core code standards and I'm really reluctant to have it spread > > into something as delicate as vmscan.c for no good reason at all. > > > > If you claim the soft limit implementation is a disaster, I would have > > expected a real life workload to back this up but all I've seen are > > synthetic tests with no apparent significance to reality. > > > > So, no, I don't buy the "let's just do SOMETHING here" argument. > > > > Let's figure out what the hell is wrong, back it up, find a solution > > and verify it against the test case. > > It is basically prio-0 thing that is the disaster. And I was quite > explicit about that. And I keep telling you that you have proven neither the problem nor the solution. Your choice of tests suggests that you either don't understand what you are testing or that you don't understand what soft limits mean. There is no apparent point to this series and it adds complexity to vmscan.c. It has no business getting merged. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v5] Soft limit rework 2013-09-03 16:15 ` Johannes Weiner @ 2013-09-04 16:38 ` Michal Hocko 2013-09-06 19:23 ` Johannes Weiner 0 siblings, 1 reply; 30+ messages in thread From: Michal Hocko @ 2013-09-04 16:38 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Tue 03-09-13 12:15:50, Johannes Weiner wrote: > > On Tue 20-08-13 10:13:39, Johannes Weiner wrote: > > > On Tue, Aug 20, 2013 at 11:14:14AM +0200, Michal Hocko wrote: > > > > On Mon 19-08-13 12:35:12, Johannes Weiner wrote: > > > > > On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote: > > > > > > Hi, > > > > > > > > > > > > This is the fifth version of the patchset. > > > > > > > > > > > > Summary of versions: > > > > > > The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973 > > > > > > (lkml wasn't CCed at the time so I cannot find it in lwn.net > > > > > > archives). There were no major objections. > > > > > > > > > > Except there are. > > > > > > > > Good to know that late... It would have been much more helpful to have > > > > such a principal feedback few months ago (this work is here since early > > > > Jun). > > > > > > I NAKed this "integrate into reclaim but still make soft reclaim an > > > extra pass" idea the first time Ying submitted it. > > > > There was a general agreement about this approach at LSF AFAIR. So it is > > really surprising hearing that from you _now_ > > > > I do not remember the details of your previous NAK but I do remember you > > liked the series I posted before the conference. I am willing to change > > details (e.g. move big vmscan part into memcg) but I do insist on having > > soft reclaim integrated to the regular reclaim. > > I was okay with the series back at LSF. > > But this was under the assumption that we are moving towards turning > softlimits into guarantees. It was only after LSF that we agreed that > conflating these two things is not a good idea and that we are better > off implementing guarantees separately. > > On the contrary, I am surprised that you are still pushing this series > in light of the discussions after LSF. > > Google was a big pusher of the soft limit changes, but their > requirements were always hard guarantees. We have since agreed that > this should be done separately. Yes, and this series is _not_ about guarantees. It is primarily about fitting the current soft reclaim into the regular reclaim more naturally. More on that bellow. > What is the use case for soft limits at this point? To handle overcommit situations more gracefully. As the documentation states: " 7. Soft limits Soft limits allow for greater sharing of memory. The idea behind soft limits is to allow control groups to use as much of the memory as needed, provided a. There is no memory contention b. They do not exceed their hard limit When the system detects memory contention or low memory, control groups are pushed back to their soft limits. If the soft limit of each control group is very high, they are pushed back as much as possible to make sure that one control group does not starve the others of memory. Please note that soft limits is a best-effort feature; it comes with no guarantees, but it does its best to make sure that when memory is heavily contended for, memory is allocated based on the soft limit hints/setup. Currently soft limit based reclaim is set up such that it gets invoked from balance_pgdat (kswapd). " Except for the last sentence the same holds for the integrated implementation as well. With the patchset we are doing the soft reclaim also for the targeted reclaim which was simply not possible previously because of the data structures limitations. And doing soft reclaim from target reclaim makes a lot of sense to me because whether we have a global or hierarchical memory pressure doesn't make any difference that some groups are set up to sacrifice their memory to help to release the pressure. > > > And no, the soft limit discussions have been going on since I first > > > submitted a rewrite as part of the reclaim integration series in > > > mid-2011. > > > > Which makes the whole story even more sad :/ > > > > > We have been going back and forth on whether soft limits should behave > > > as guarantees and *protect* from reclaim. But I thought we had since > > > agreed that implementing guarantees should be a separate effort and as > > > far as I'm concerned there is no justification for all this added code > > > just to prevent reclaiming non-excess when there is excess around. > > > > The patchset is absolutely not about guarantees anymore. It is more > > about making the current soft limit reclaim more natural. E.g. not doing > > prio-0 scans. > > With guarantees out of the window there is absolutely no justification > for this added level of complexity. The tree code is verbose but dead > simple. And it's contained. > > You have not shown that prio-0 scans are a problem. OK, I thought this was self evident but let me be more specific. The scan the world is almost always a problem. We are no longer doing proportional anon/file reclaim (swappiness is ignored). This is wrong from at least two points of view. Firstly it makes the reclaim decisions different a lot for groups that are under the soft limit and those that are over. Secondly, and more importantly, this might lead to a pre-mature swapping, especially when there is a lot of IO going on. The global reclaim suffers from the very same problem and that is why we try to prevent from prio-0 reclaim as much as possible and use it only as a last resort. > Or even argued why you need new memcg iteration code to fix the > priority level. The new iterator code is there to make the iteration more effective when selecting interesting groups. Whether the skipping logic should be inside the iterator or open coded outside is an implementation detail IMO. I am considering the callback approach better because it is reusable as the skipping part is implemented only once at the place with the same code which has to deal with all other details already. And the caller only cares to tell which (sub)trees he is interested in. > > The side effect that the isolation works better in the result is just a > > bonus. And I have to admit I like that bonus. > > What isolation? Reclaiming groups over the soft limit might be sufficient to release the memory pressure and so other groups might be saved from being reclaimed. Isolation might be a strong word for that as that would require a certain guarantee but working in a best-effort mode can work much better than what we have right now. If we do a fair reclaim on the whole (sub)tree rather than hammer the biggest offender we might end up reclaiming from other groups less. I even think that doing the fair soft reclaim is a better approach from the conceptual point of view because it is less prone to corner cases when one group is hammered over and over again without actually helping to relief the memory pressure for which is the soft limit intended in the first place. [...] > > > > > This series considerably worsens readability and maintainability of > > > > > both the generic reclaim code as well as the memcg counterpart of it. > > > > > > > > I am really surprised that you are coming with this concerns that late. > > > > This code has been posted quite some ago, hasn't it? We have even had > > > > that "calm" discussion with Tejun about predicates and you were silent > > > > at the time. > > > > > > I apologize that I was not more responsive in previous submissions, > > > this is mainly because it was hard to context switch while I was > > > focussed on the allocator fairness / thrash detection patches. > > > > Come on, Johannes! We were sitting in the same room at LSF when there > > was a general agreement about the patchset because "It is a general > > improvement and further changes might happen on top of it". You didn't > > say a word you didn't like it and consider the integration as a bad > > idea. > > It was a step towards turning soft limits into guarantees. What other > use case was there? And it is still a preparatory step for further improvements towards guarantees. If you look at the series a new (min limit or whatever you call it) knob is almost trivial to implement. > Then it might have been a good first step. I don't think it is a > general improvement of the code and I don't remember saying that. > > > > > > The point of naturalizing the memcg code is to reduce data structures > > > > > and redundancy and to break open opaque interfaces like "do soft > > > > > reclaim and report back". But you didn't actually reduce complexity, > > > > > you added even more opaque callbacks (should_soft_reclaim? > > > > > soft_reclaim_eligible?). You didn't integrate soft limit into generic > > > > > reclaim code, you just made the soft limit API more complicated. > > > > > > > > I can certainly think about simplifications. But it would be nicer if > > > > you were more specific on the "more complicated" part. The soft reclaim > > > > is a natural part of the reclaim now. Which I find as an improvement. > > > > "Do some memcg magic and get back was" a bad idea IMO. > > > > > > It depends. Doing memcg reclaim in generic code made sense because > > > the global data structure went and the generic code absolutely had to > > > know about memcg iteration. > > > > > > Soft limit reclaim is something else. If it were just a modifier of > > > the existing memcg reclaim loop, I would still put it in vmscan.c for > > > simplicity reasons. > > > > > > But your separate soft limit reclaim pass adds quite some complication > > > to the generic code and there is no good reason for it. > > > > Does it? The whole decision is hidden within the predicate so the main > > loop doesn't have to care at all. If we ever go with another limit for > > the guarantee it would simply use a different predicate to select the > > right group. > > This is just handwaving. You replaced a simple function call in > kswapd That simple call from kswapd is not that simple at all in fact. It hides a lot of memcg specific code which is far from being trivial. Even worse that memcg specific code gets back to the reclaim code with different reclaim parameters than those used from the context it has been called from. > with an extra lruvec pass and an enhanced looping construct. > It's all extra protocol that vmscan.c has to follow without actually > improving the understandability of the code. You still have to dig > into all those predicates to know what's going on. Comparing that to the memcg per-node-zone excess trees, their maintenance and a code to achieve at least some fairness during reclaim the above is a) much less code to read/maintain/understand b) it achieves the same result without corner cases mentioned previously. I was trying to taint vmscan.c as little as possible. I have a patch which moves most of the memcg specific parts back to memcontrol.c but I cannot say I like it because it duplicates some code and has to expose scan_control and shrink_lruvec outside vmscan.c. > And at the risk of repeating myself, you haven't proven that it's > worth the trouble. > > > > > > And, as I mentioned repeatedly in previous submissions, your benchmark > > > > > numbers don't actually say anything useful about this change. > > > > > > > > I would really welcome suggestions for improvements. I have tried "The > > > > most interesting test case would be how it behaves if some groups are > > > > over the soft limits while others are not." with v5.1 where I had > > > > memeater unlimited and kbuild resp. stream IO being limited. > > > > > > The workload seems better but the limit configuration is still very > > > dubious. > > > > > > I just don't understand what your benchmarks are trying to express. > > > If you set a soft limit of 0 you declare all memory is optional but > > > then you go ahead and complain that the workload in there is > > > thrashing. WTF? > > > > As I've already said the primary motivation was to check before and > > after state and as you can see we can do better... Although you might > > find soft_limit=0 setting artificial this setting is the easiest way to > > tell that such a group is the number one candidate for reclaim. > > That would test for correctness, but then you draw conclusions about > performance. The series is not about performance improvements. I am sorry if the testing results made a different impression. The primary motivation was to get rid of a big memcg specific reclaim path which doesn't handle fairness well and uses hackish prio-0 reclaim which is too disruptive. > Correctness-wise, the unpatched kernel seems to prefer reclaiming the > 0-limit group as expected. It does and as even simple configuration shows that the current soft reclaim is too disturbing and reclaiming much more than necessary. > You can't draw reasonable conclusions on performance and > aggressiveness from such an unrealistic setup. > > > We have users who would like to do backups or some temporary actions to > > not disturb the main workload. So stream-io resp. kbuild vs. mem_eater > > is simulating such a use case. > > It's not obvious why you are not using hard limits instead. Because although the load in the soft limited groups is willing to sacrifice its memory it would still prefer as much much memory as possible to finish as soon as possible. And also setting the hard limit might be really non-trivial, especially when you have more such loads because you have to be careful so the cumulative usage doesn't cause the global reclaim. > Again, streamers and kbuild don't really benefit from the extra cache, > so there is no reason to let it expand into available memory and > retract on pressure. > These tests are just wildly inappropriate for soft limits, it's > ridiculous that you insist that their outcome says anything at all. I do realize that my testing load is simplified a lot, but to be honest, what ever load I come up with, it still might be argued against as too simplified or artificial. We simply do not have any etalon. I was merely interested in well known and understood loads and compared before and after results for a basic overview of the changes. The series was not aimed as performance improvement in the first place. And to be honest I wasn't expecting such a big differences, especially for stream IO which should be easy "drop everything behind" as it is use-once load. As you can see, though, the previous soft limit implementation was hammering even on that load too much. You can see that on this graph quite nicely [1]. Similar with the kbuild test case where the system swapped out much more pages (by factor 145) and the Major page faults increased as well (by factor 85). You can argue that my testing loads are not typical soft limit configurations and I even might agree but they clearly show that the way how we do the soft reclaim currently is way too disruptive and that is the first thing to fix when we want to move on in that area. --- [1] http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream-one-run.png -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v5] Soft limit rework 2013-09-04 16:38 ` Michal Hocko @ 2013-09-06 19:23 ` Johannes Weiner 2013-09-13 14:49 ` Michal Hocko 0 siblings, 1 reply; 30+ messages in thread From: Johannes Weiner @ 2013-09-06 19:23 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Wed, Sep 04, 2013 at 06:38:23PM +0200, Michal Hocko wrote: > On Tue 03-09-13 12:15:50, Johannes Weiner wrote: > > > On Tue 20-08-13 10:13:39, Johannes Weiner wrote: > > > > On Tue, Aug 20, 2013 at 11:14:14AM +0200, Michal Hocko wrote: > > > > > On Mon 19-08-13 12:35:12, Johannes Weiner wrote: > > > > > > On Tue, Jun 18, 2013 at 02:09:39PM +0200, Michal Hocko wrote: > > > > > > > Hi, > > > > > > > > > > > > > > This is the fifth version of the patchset. > > > > > > > > > > > > > > Summary of versions: > > > > > > > The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973 > > > > > > > (lkml wasn't CCed at the time so I cannot find it in lwn.net > > > > > > > archives). There were no major objections. > > > > > > > > > > > > Except there are. > > > > > > > > > > Good to know that late... It would have been much more helpful to have > > > > > such a principal feedback few months ago (this work is here since early > > > > > Jun). > > > > > > > > I NAKed this "integrate into reclaim but still make soft reclaim an > > > > extra pass" idea the first time Ying submitted it. > > > > > > There was a general agreement about this approach at LSF AFAIR. So it is > > > really surprising hearing that from you _now_ > > > > > > I do not remember the details of your previous NAK but I do remember you > > > liked the series I posted before the conference. I am willing to change > > > details (e.g. move big vmscan part into memcg) but I do insist on having > > > soft reclaim integrated to the regular reclaim. > > > > I was okay with the series back at LSF. > > > > But this was under the assumption that we are moving towards turning > > softlimits into guarantees. It was only after LSF that we agreed that > > conflating these two things is not a good idea and that we are better > > off implementing guarantees separately. > > > > On the contrary, I am surprised that you are still pushing this series > > in light of the discussions after LSF. > > > > Google was a big pusher of the soft limit changes, but their > > requirements were always hard guarantees. We have since agreed that > > this should be done separately. > > Yes, and this series is _not_ about guarantees. It is primarily > about fitting the current soft reclaim into the regular reclaim more > naturally. More on that bellow. > > > What is the use case for soft limits at this point? > > To handle overcommit situations more gracefully. As the documentation > states: > " > 7. Soft limits > > Soft limits allow for greater sharing of memory. The idea behind soft limits > is to allow control groups to use as much of the memory as needed, provided > > a. There is no memory contention > b. They do not exceed their hard limit > > When the system detects memory contention or low memory, control groups > are pushed back to their soft limits. If the soft limit of each control > group is very high, they are pushed back as much as possible to make > sure that one control group does not starve the others of memory. > > Please note that soft limits is a best-effort feature; it comes with > no guarantees, but it does its best to make sure that when memory is > heavily contended for, memory is allocated based on the soft limit > hints/setup. Currently soft limit based reclaim is set up such that > it gets invoked from balance_pgdat (kswapd). > " > > Except for the last sentence the same holds for the integrated > implementation as well. With the patchset we are doing the soft reclaim > also for the targeted reclaim which was simply not possible previously > because of the data structures limitations. And doing soft reclaim from > target reclaim makes a lot of sense to me because whether we have a > global or hierarchical memory pressure doesn't make any difference that > some groups are set up to sacrifice their memory to help to release the > pressure. The issue I have with this is that the semantics of the soft limit are so backwards that we should strive to get this stuff right conceptually before integrating this better into the VM. We have a big user that asks for guarantees, which are comparable but the invert opposite of this. Instead of specifying what is optional in one group, you specify what is essential in the other group. And the default is to guarantee nothing instead of everything like soft limits are currently defined. We even tried to invert the default soft limit setting in the past, which went nowhere because we can't do these subtle semantic changes on an existing interface. I would really like to deprecate soft limits and introduce something new that has the proper semantics we want from the get-go. Its implementation could very much look like your code, so we can easily reuse that. But the interface and its semantics should come first. > > > > And no, the soft limit discussions have been going on since I first > > > > submitted a rewrite as part of the reclaim integration series in > > > > mid-2011. > > > > > > Which makes the whole story even more sad :/ > > > > > > > We have been going back and forth on whether soft limits should behave > > > > as guarantees and *protect* from reclaim. But I thought we had since > > > > agreed that implementing guarantees should be a separate effort and as > > > > far as I'm concerned there is no justification for all this added code > > > > just to prevent reclaiming non-excess when there is excess around. > > > > > > The patchset is absolutely not about guarantees anymore. It is more > > > about making the current soft limit reclaim more natural. E.g. not doing > > > prio-0 scans. > > > > With guarantees out of the window there is absolutely no justification > > for this added level of complexity. The tree code is verbose but dead > > simple. And it's contained. > > > > You have not shown that prio-0 scans are a problem. > > OK, I thought this was self evident but let me be more specific. > > The scan the world is almost always a problem. We are no longer doing > proportional anon/file reclaim (swappiness is ignored). This is wrong > from at least two points of view. Firstly it makes the reclaim decisions > different a lot for groups that are under the soft limit and those > that are over. Secondly, and more importantly, this might lead to a > pre-mature swapping, especially when there is a lot of IO going on. > > The global reclaim suffers from the very same problem and that is why > we try to prevent from prio-0 reclaim as much as possible and use it > only as a last resort. I know that and I can see that this should probably be fixed, but there is no quantification for this. We have no per-memcg reclaim statistics and your test cases were not useful in determining what's going on reclaim-wise. > > Or even argued why you need new memcg iteration code to fix the > > priority level. > > The new iterator code is there to make the iteration more effective when > selecting interesting groups. Whether the skipping logic should be > inside the iterator or open coded outside is an implementation detail > IMO. > I am considering the callback approach better because it is reusable as > the skipping part is implemented only once at the place with the same > code which has to deal with all other details already. And the caller > only cares to tell which (sub)trees he is interested in. I was questioning the selective iterators as such, rather than the specific implementation. This whole problem of needing better iterators comes from the fact that soft limits default to ~0UL, which means that per default there are no groups eligible for reclaim in the first pass. If you implement the separate guarantee / lower bound functionality with a default of no-guarantee, the common case is that everybody is eligible for reclaim in the first pass. And suddenly, the selective iterators are merely a performance optimization for the guarantee feature as opposed to a fundamental requirement for the common case. > > > The side effect that the isolation works better in the result is just a > > > bonus. And I have to admit I like that bonus. > > > > What isolation? > > Reclaiming groups over the soft limit might be sufficient to release > the memory pressure and so other groups might be saved from being > reclaimed. Isolation might be a strong word for that as that would > require a certain guarantee but working in a best-effort mode can work > much better than what we have right now. > > If we do a fair reclaim on the whole (sub)tree rather than hammer the > biggest offender we might end up reclaiming from other groups less. > > I even think that doing the fair soft reclaim is a better approach from > the conceptual point of view because it is less prone to corner cases > when one group is hammered over and over again without actually helping > to relief the memory pressure for which is the soft limit intended in > the first place. Fully agreed on that we could do a better job at being selective. And as per above, this should be *the* single argument for the selective iterators and not just a side show. At this point, please step back and think about how many aspects of the soft limit semantics and implementation you are trying to fix with this series. My conclusion that the interface is basically unsalvagable, based on our disagreements over the semantics and the implementation. Not just in this thread, but over many threads, conferences in person and on the phone. The fact is, soft limits do very little of what we actually want and we can't turn an existing interface upsidedown. There is no point in contorting ourselves, the documentation, the code, to achieve something reasonable while preserving something nobody cares about. > > > > > > This series considerably worsens readability and maintainability of > > > > > > both the generic reclaim code as well as the memcg counterpart of it. > > > > > > > > > > I am really surprised that you are coming with this concerns that late. > > > > > This code has been posted quite some ago, hasn't it? We have even had > > > > > that "calm" discussion with Tejun about predicates and you were silent > > > > > at the time. > > > > > > > > I apologize that I was not more responsive in previous submissions, > > > > this is mainly because it was hard to context switch while I was > > > > focussed on the allocator fairness / thrash detection patches. > > > > > > Come on, Johannes! We were sitting in the same room at LSF when there > > > was a general agreement about the patchset because "It is a general > > > improvement and further changes might happen on top of it". You didn't > > > say a word you didn't like it and consider the integration as a bad > > > idea. > > > > It was a step towards turning soft limits into guarantees. What other > > use case was there? > > And it is still a preparatory step for further improvements towards > guarantees. If you look at the series a new (min limit or whatever you > call it) knob is almost trivial to implement. So let's just do the right thing and forget about all these detours, shall we? > > Then it might have been a good first step. I don't think it is a > > general improvement of the code and I don't remember saying that. > > > > > > > > The point of naturalizing the memcg code is to reduce data structures > > > > > > and redundancy and to break open opaque interfaces like "do soft > > > > > > reclaim and report back". But you didn't actually reduce complexity, > > > > > > you added even more opaque callbacks (should_soft_reclaim? > > > > > > soft_reclaim_eligible?). You didn't integrate soft limit into generic > > > > > > reclaim code, you just made the soft limit API more complicated. > > > > > > > > > > I can certainly think about simplifications. But it would be nicer if > > > > > you were more specific on the "more complicated" part. The soft reclaim > > > > > is a natural part of the reclaim now. Which I find as an improvement. > > > > > "Do some memcg magic and get back was" a bad idea IMO. > > > > > > > > It depends. Doing memcg reclaim in generic code made sense because > > > > the global data structure went and the generic code absolutely had to > > > > know about memcg iteration. > > > > > > > > Soft limit reclaim is something else. If it were just a modifier of > > > > the existing memcg reclaim loop, I would still put it in vmscan.c for > > > > simplicity reasons. > > > > > > > > But your separate soft limit reclaim pass adds quite some complication > > > > to the generic code and there is no good reason for it. > > > > > > Does it? The whole decision is hidden within the predicate so the main > > > loop doesn't have to care at all. If we ever go with another limit for > > > the guarantee it would simply use a different predicate to select the > > > right group. > > > > This is just handwaving. You replaced a simple function call in > > kswapd > > That simple call from kswapd is not that simple at all in fact. It hides > a lot of memcg specific code which is far from being trivial. Even worse > that memcg specific code gets back to the reclaim code with different > reclaim parameters than those used from the context it has been called > from. It does not matter to understanding generic reclaim code, though, and acts more like the shrinkers. We send it off to get memory and it comes back with results. > > with an extra lruvec pass and an enhanced looping construct. > > It's all extra protocol that vmscan.c has to follow without actually > > improving the understandability of the code. You still have to dig > > into all those predicates to know what's going on. > > Comparing that to the memcg per-node-zone excess trees, their > maintenance and a code to achieve at least some fairness during reclaim > the above is a) much less code to read/maintain/understand b) it > achieves the same result without corner cases mentioned previously. > > I was trying to taint vmscan.c as little as possible. I have a patch > which moves most of the memcg specific parts back to memcontrol.c but I > cannot say I like it because it duplicates some code and has to expose > scan_control and shrink_lruvec outside vmscan.c. I'm not against complexity if it's appropriate, I just pointed out that you add it with rather weak justification. > > And at the risk of repeating myself, you haven't proven that it's > > worth the trouble. > > > > > > > > And, as I mentioned repeatedly in previous submissions, your benchmark > > > > > > numbers don't actually say anything useful about this change. > > > > > > > > > > I would really welcome suggestions for improvements. I have tried "The > > > > > most interesting test case would be how it behaves if some groups are > > > > > over the soft limits while others are not." with v5.1 where I had > > > > > memeater unlimited and kbuild resp. stream IO being limited. > > > > > > > > The workload seems better but the limit configuration is still very > > > > dubious. > > > > > > > > I just don't understand what your benchmarks are trying to express. > > > > If you set a soft limit of 0 you declare all memory is optional but > > > > then you go ahead and complain that the workload in there is > > > > thrashing. WTF? > > > > > > As I've already said the primary motivation was to check before and > > > after state and as you can see we can do better... Although you might > > > find soft_limit=0 setting artificial this setting is the easiest way to > > > tell that such a group is the number one candidate for reclaim. > > > > That would test for correctness, but then you draw conclusions about > > performance. > > The series is not about performance improvements. I am sorry if the > testing results made a different impression. The primary motivation was > to get rid of a big memcg specific reclaim path which doesn't handle > fairness well and uses hackish prio-0 reclaim which is too disruptive. > > > Correctness-wise, the unpatched kernel seems to prefer reclaiming the > > 0-limit group as expected. > > It does and as even simple configuration shows that the current soft > reclaim is too disturbing and reclaiming much more than necessary. Soft limit is about balancing reclaim pressure and I already pointed out that your control group has so much limit slack that you can't tell if the main group is performing better because of reclaim aggressiveness (good) or because the memory is just taken from your control group (bad). Please either say why I'm wrong or stop asserting points that have been refuted. > > You can't draw reasonable conclusions on performance and > > aggressiveness from such an unrealistic setup. > > > > > We have users who would like to do backups or some temporary actions to > > > not disturb the main workload. So stream-io resp. kbuild vs. mem_eater > > > is simulating such a use case. > > > > It's not obvious why you are not using hard limits instead. > > Because although the load in the soft limited groups is willing to > sacrifice its memory it would still prefer as much much memory as > possible to finish as soon as possible. Again, both streamers and kbuild do not benefit from the extra memory, I'm not sure why you keep repeating this nonsense over and over. Find a workload whose performance scales with the amount of available memory and whose pressure-performance graph is not DONT CARE DONT CARE DONT CARE FALLS OFF CLIFF Using a hard limit workload to argue soft limits is pointless. > And also setting the hard limit might be really non-trivial, especially > when you have more such loads because you have to be careful so the > cumulative usage doesn't cause the global reclaim. > > > Again, streamers and kbuild don't really benefit from the extra cache, > > so there is no reason to let it expand into available memory and > > retract on pressure. > > These tests are just wildly inappropriate for soft limits, it's > > ridiculous that you insist that their outcome says anything at all. > > I do realize that my testing load is simplified a lot, but to be honest, > what ever load I come up with, it still might be argued against as too > simplified or artificial. We simply do not have any etalon. Except they are not simplified. They are in no way indicative. > I was merely interested in well known and understood loads and compared > before and after results for a basic overview of the changes. The series > was not aimed as performance improvement in the first place. You really need to decide on what you are trying to sell. > And to be honest I wasn't expecting such a big differences, especially > for stream IO which should be easy "drop everything behind" as it > is use-once load. As you can see, though, the previous soft limit > implementation was hammering even on that load too much. You can see > that on this graph quite nicely [1]. > Similar with the kbuild test case where the system swapped out much more > pages (by factor 145) and the Major page faults increased as well (by > factor 85). > You can argue that my testing loads are not typical soft limit > configurations and I even might agree but they clearly show that the way > how we do the soft reclaim currently is way too disruptive and that is > the first thing to fix when we want to move on in that area. So we should pile all this complexity into reclaim code based on the performance of misconfigurations? Questionable methods aside, I don't see this going anywhere. Your patches don't make soft limits better, they're just trying to put lipstick on the pig. It would be best to use the opportunity from the cgroup rework and get this stuff right. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v5] Soft limit rework 2013-09-06 19:23 ` Johannes Weiner @ 2013-09-13 14:49 ` Michal Hocko [not found] ` <20130913144953.GA23857-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 30+ messages in thread From: Michal Hocko @ 2013-09-13 14:49 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Fri 06-09-13 15:23:11, Johannes Weiner wrote: > On Wed, Sep 04, 2013 at 06:38:23PM +0200, Michal Hocko wrote: [...] > > To handle overcommit situations more gracefully. As the documentation > > states: > > " > > 7. Soft limits > > > > Soft limits allow for greater sharing of memory. The idea behind soft limits > > is to allow control groups to use as much of the memory as needed, provided > > > > a. There is no memory contention > > b. They do not exceed their hard limit > > > > When the system detects memory contention or low memory, control groups > > are pushed back to their soft limits. If the soft limit of each control > > group is very high, they are pushed back as much as possible to make > > sure that one control group does not starve the others of memory. > > > > Please note that soft limits is a best-effort feature; it comes with > > no guarantees, but it does its best to make sure that when memory is > > heavily contended for, memory is allocated based on the soft limit > > hints/setup. Currently soft limit based reclaim is set up such that > > it gets invoked from balance_pgdat (kswapd). > > " > > > > Except for the last sentence the same holds for the integrated > > implementation as well. With the patchset we are doing the soft reclaim > > also for the targeted reclaim which was simply not possible previously > > because of the data structures limitations. And doing soft reclaim from > > target reclaim makes a lot of sense to me because whether we have a > > global or hierarchical memory pressure doesn't make any difference that > > some groups are set up to sacrifice their memory to help to release the > > pressure. > > The issue I have with this is that the semantics of the soft limit are > so backwards that we should strive to get this stuff right > conceptually before integrating this better into the VM. > > We have a big user that asks for guarantees, which are comparable but > the invert opposite of this. Instead of specifying what is optional > in one group, you specify what is essential in the other group. And > the default is to guarantee nothing instead of everything like soft > limits are currently defined. > > We even tried to invert the default soft limit setting in the past, > which went nowhere because we can't do these subtle semantic changes > on an existing interface. > > I would really like to deprecate soft limits and introduce something > new that has the proper semantics we want from the get-go. Its > implementation could very much look like your code, so we can easily > reuse that. But the interface and its semantics should come first. I am open to discussin such a change I just do not see any reason to have a crippled soft reclaim implementation for the mean time. Especially when it doesn't look like such a new interface is easy to agree on. [...] > > > You have not shown that prio-0 scans are a problem. > > > > OK, I thought this was self evident but let me be more specific. > > > > The scan the world is almost always a problem. We are no longer doing > > proportional anon/file reclaim (swappiness is ignored). This is wrong > > from at least two points of view. Firstly it makes the reclaim decisions > > different a lot for groups that are under the soft limit and those > > that are over. Secondly, and more importantly, this might lead to a > > pre-mature swapping, especially when there is a lot of IO going on. > > > > The global reclaim suffers from the very same problem and that is why > > we try to prevent from prio-0 reclaim as much as possible and use it > > only as a last resort. > > I know that and I can see that this should probably be fixed, but > there is no quantification for this. We have no per-memcg reclaim > statistics Not having statistic is a separate issue. It makes the situation worse but that is not a new thing. The old implementation is even worse because the soft reclaim activity is basically hidden from global reclaim counters. So a lot of pages might get scanned and we will have no way to find out. That part is inherently fixed by the series because of the integration. > and your test cases were not useful in determining what's going on > reclaim-wise. I will instrument the kernel for the next round of tests which would be hopefully more descriptive. [...] > > That simple call from kswapd is not that simple at all in fact. It hides > > a lot of memcg specific code which is far from being trivial. Even worse > > that memcg specific code gets back to the reclaim code with different > > reclaim parameters than those used from the context it has been called > > from. > > It does not matter to understanding generic reclaim code, though, and > acts more like the shrinkers. We send it off to get memory and it > comes back with results. Shrinker interface is just too bad. It might work for dentries and inodes but it failed in many other subsystems where it ended up in do-something mode. Soft reclaim is yet another example where we are doing an artificial scan-the-world reclaim to hammer somebody. Fairness is basically impossible to guarantee and there are corner cases which are just waiting to explode. [...] > Soft limit is about balancing reclaim pressure and I already pointed > out that your control group has so much limit slack that you can't > tell if the main group is performing better because of reclaim > aggressiveness (good) or because the memory is just taken from your > control group (bad). > > Please either say why I'm wrong or stop asserting points that have > been refuted. I will work on improving my testing setup. I will come back with results early next week hopefully. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <20130913144953.GA23857-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v5] Soft limit rework [not found] ` <20130913144953.GA23857-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-09-13 16:17 ` Johannes Weiner 2013-09-16 16:44 ` Michal Hocko 0 siblings, 1 reply; 30+ messages in thread From: Johannes Weiner @ 2013-09-13 16:17 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Fri, Sep 13, 2013 at 04:49:53PM +0200, Michal Hocko wrote: > On Fri 06-09-13 15:23:11, Johannes Weiner wrote: > > On Wed, Sep 04, 2013 at 06:38:23PM +0200, Michal Hocko wrote: > [...] > > > To handle overcommit situations more gracefully. As the documentation > > > states: > > > " > > > 7. Soft limits > > > > > > Soft limits allow for greater sharing of memory. The idea behind soft limits > > > is to allow control groups to use as much of the memory as needed, provided > > > > > > a. There is no memory contention > > > b. They do not exceed their hard limit > > > > > > When the system detects memory contention or low memory, control groups > > > are pushed back to their soft limits. If the soft limit of each control > > > group is very high, they are pushed back as much as possible to make > > > sure that one control group does not starve the others of memory. > > > > > > Please note that soft limits is a best-effort feature; it comes with > > > no guarantees, but it does its best to make sure that when memory is > > > heavily contended for, memory is allocated based on the soft limit > > > hints/setup. Currently soft limit based reclaim is set up such that > > > it gets invoked from balance_pgdat (kswapd). > > > " > > > > > > Except for the last sentence the same holds for the integrated > > > implementation as well. With the patchset we are doing the soft reclaim > > > also for the targeted reclaim which was simply not possible previously > > > because of the data structures limitations. And doing soft reclaim from > > > target reclaim makes a lot of sense to me because whether we have a > > > global or hierarchical memory pressure doesn't make any difference that > > > some groups are set up to sacrifice their memory to help to release the > > > pressure. > > > > The issue I have with this is that the semantics of the soft limit are > > so backwards that we should strive to get this stuff right > > conceptually before integrating this better into the VM. > > > > We have a big user that asks for guarantees, which are comparable but > > the invert opposite of this. Instead of specifying what is optional > > in one group, you specify what is essential in the other group. And > > the default is to guarantee nothing instead of everything like soft > > limits are currently defined. > > > > We even tried to invert the default soft limit setting in the past, > > which went nowhere because we can't do these subtle semantic changes > > on an existing interface. > > > > I would really like to deprecate soft limits and introduce something > > new that has the proper semantics we want from the get-go. Its > > implementation could very much look like your code, so we can easily > > reuse that. But the interface and its semantics should come first. > > I am open to discussin such a change I just do not see any reason to > have a crippled soft reclaim implementation for the mean time. > Especially when it doesn't look like such a new interface is easy to > agree on. We had a crippled soft limit implementation from the time it was merged, it never worked better than now. You seem to think that this is an argument *for* finally fixing it. I disagree. We should absolutely *avoid* steering people toward it now, when the long term plan is already to get rid of it. There is a concensus that cgroups and the controllers were merged before they were ready and we are now struggling heavily to iron out the design mistakes with the minimum amount of disruption we can get away with. We are also at this time coordinating with all the other controllers and the cgroup core to do exactly that, where Tejun is providing us with tools to revamp the problematic interfaces. And we agree that soft limits were such a design mistake that should be ironed out. So for the love of everything we hold dear, why would you think that NOW is a good time to fix the implemantion and get people to use it? > > > > You have not shown that prio-0 scans are a problem. > > > > > > OK, I thought this was self evident but let me be more specific. > > > > > > The scan the world is almost always a problem. We are no longer doing > > > proportional anon/file reclaim (swappiness is ignored). This is wrong > > > from at least two points of view. Firstly it makes the reclaim decisions > > > different a lot for groups that are under the soft limit and those > > > that are over. Secondly, and more importantly, this might lead to a > > > pre-mature swapping, especially when there is a lot of IO going on. > > > > > > The global reclaim suffers from the very same problem and that is why > > > we try to prevent from prio-0 reclaim as much as possible and use it > > > only as a last resort. > > > > I know that and I can see that this should probably be fixed, but > > there is no quantification for this. We have no per-memcg reclaim > > statistics > > Not having statistic is a separate issue. It makes the situation worse > but that is not a new thing. The old implementation is even worse > because the soft reclaim activity is basically hidden from global > reclaim counters. So a lot of pages might get scanned and we will have > no way to find out. That part is inherently fixed by the series because > of the integration. Because it's in the *global* reclaim counters? That's great but it does not address the problem at all. This is about pressure balance between groups and you don't have any numbers for that. All I'm saying is that before changing how the pressure is balanced we need to know per-memcg statistics to quantify it and get an insight into what we are actually doing. You respond with a wall of text but you don't address the problem at all. And before doing all that, we should get the user-visible interface right, which we all agreed is broken. > > > That simple call from kswapd is not that simple at all in fact. It hides > > > a lot of memcg specific code which is far from being trivial. Even worse > > > that memcg specific code gets back to the reclaim code with different > > > reclaim parameters than those used from the context it has been called > > > from. > > > > It does not matter to understanding generic reclaim code, though, and > > acts more like the shrinkers. We send it off to get memory and it > > comes back with results. > > Shrinker interface is just too bad. It might work for dentries and > inodes but it failed in many other subsystems where it ended up in > do-something mode. Soft reclaim is yet another example where we are > doing an artificial scan-the-world reclaim to hammer somebody. Fairness > is basically impossible to guarantee and there are corner cases which > are just waiting to explode. Every time you reply you are just attacking bits of my argument in a way that is completely irrelevant to the discussion. What is the overall objective that you are trying to defend? I said that you are making the interface more complex because the current interface is leaving the complexity encapsulated in memcg code. It does not matter one bit that some shrinkers are set up incorrectly, that entirely misses the point. Michal, it's completely unobvious what your longterm goals are for soft limits and guarantees. And without that it's hard to comprehend how and if the patches you are sending push into the right direction. Every time I try to discuss the bigger picture you derail it with details about how the implementation is broken. It's frustrating. This series is a grab bag of fixes that drag a lot of complexity from memcg code into generic reclaim, to repair the age old implementation of a user-visible interface that we already agree sucks balls and should be deprecated. The fact that you did not even demonstrate that the repair itself was successful is a secondary issue at this point, but it certainly didn't help your case. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v5] Soft limit rework 2013-09-13 16:17 ` Johannes Weiner @ 2013-09-16 16:44 ` Michal Hocko [not found] ` <20130916164405.GG3674-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 30+ messages in thread From: Michal Hocko @ 2013-09-16 16:44 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Fri 13-09-13 12:17:09, Johannes Weiner wrote: > On Fri, Sep 13, 2013 at 04:49:53PM +0200, Michal Hocko wrote: > > On Fri 06-09-13 15:23:11, Johannes Weiner wrote: [...] > > > I would really like to deprecate soft limits and introduce something > > > new that has the proper semantics we want from the get-go. Its > > > implementation could very much look like your code, so we can easily > > > reuse that. But the interface and its semantics should come first. > > > > I am open to discussin such a change I just do not see any reason to > > have a crippled soft reclaim implementation for the mean time. > > Especially when it doesn't look like such a new interface is easy to > > agree on. > > We had a crippled soft limit implementation from the time it was > merged, it never worked better than now. > > You seem to think that this is an argument *for* finally fixing it. I > disagree. We should absolutely *avoid* steering people toward it now, > when the long term plan is already to get rid of it. It is not just about fixing it. It is also about getting rid of the bloat that the previous implementation depended on. Not only LoC but also the resulting binary: $ size mm/built-in.o base: text data bss dec hex filename 534283 233703 163456 931442 e3672 mm/built-in.o rework: text data bss dec hex filename 532866 229591 163456 925913 e20d9 mm/built-in.o I would like to get rid of as much of a special code as possible. Especially this one which I hate personally because it is a crude hack that shouldn't have existed. > There is a concensus that cgroups and the controllers were merged > before they were ready and we are now struggling heavily to iron out > the design mistakes with the minimum amount of disruption we can get > away with. > > We are also at this time coordinating with all the other controllers > and the cgroup core to do exactly that, where Tejun is providing us > with tools to revamp the problematic interfaces. > > And we agree that soft limits were such a design mistake that should > be ironed out. > > So for the love of everything we hold dear, why would you think that > NOW is a good time to fix the implemantion and get people to use it? There are users who already use this feature and it will take some (read a lot of) time to move them to something else. And that something else still doesn't exist and I suspect it will take some time to push it into a proper shape (and be sure we do not screw it this time). So while I agree that we need something more (semantically) reasonable there is no need to keep this crippled implementation around especially when it is non-trivial amount of code. > > > > > You have not shown that prio-0 scans are a problem. > > > > > > > > OK, I thought this was self evident but let me be more specific. > > > > > > > > The scan the world is almost always a problem. We are no longer doing > > > > proportional anon/file reclaim (swappiness is ignored). This is wrong > > > > from at least two points of view. Firstly it makes the reclaim decisions > > > > different a lot for groups that are under the soft limit and those > > > > that are over. Secondly, and more importantly, this might lead to a > > > > pre-mature swapping, especially when there is a lot of IO going on. > > > > > > > > The global reclaim suffers from the very same problem and that is why > > > > we try to prevent from prio-0 reclaim as much as possible and use it > > > > only as a last resort. > > > > > > I know that and I can see that this should probably be fixed, but > > > there is no quantification for this. We have no per-memcg reclaim > > > statistics > > > > Not having statistic is a separate issue. It makes the situation worse > > but that is not a new thing. The old implementation is even worse > > because the soft reclaim activity is basically hidden from global > > reclaim counters. So a lot of pages might get scanned and we will have > > no way to find out. That part is inherently fixed by the series because > > of the integration. > > Because it's in the *global* reclaim counters? That's great but it > does not address the problem at all. This is about pressure balance > between groups and you don't have any numbers for that. yes, but the point was that if somebody uses soft reclaim currently you would miss a big part of reclaim activity because soft reclaim is not accounted even in the global counters. So you can see a long stall during direct reclaim while the counters look all good. > All I'm saying is that before changing how the pressure is balanced we > need to know per-memcg statistics to quantify it and get an insight > into what we are actually doing. We can still see the indirect effects of the reclaim. E.g. performance dropdown due to re-faults/swapping. My previous tests cared only about the restricted group which, I admit, is very superficial. I am working on a setup where I am measuring both. > You respond with a wall of text but you don't address the problem at > all. > > And before doing all that, we should get the user-visible interface > right, which we all agreed is broken. > > > > > That simple call from kswapd is not that simple at all in fact. It hides > > > > a lot of memcg specific code which is far from being trivial. Even worse > > > > that memcg specific code gets back to the reclaim code with different > > > > reclaim parameters than those used from the context it has been called > > > > from. > > > > > > It does not matter to understanding generic reclaim code, though, and > > > acts more like the shrinkers. We send it off to get memory and it > > > comes back with results. > > > > Shrinker interface is just too bad. It might work for dentries and > > inodes but it failed in many other subsystems where it ended up in > > do-something mode. Soft reclaim is yet another example where we are > > doing an artificial scan-the-world reclaim to hammer somebody. Fairness > > is basically impossible to guarantee and there are corner cases which > > are just waiting to explode. > > Every time you reply you are just attacking bits of my argument in a > way that is completely irrelevant to the discussion. What is the > overall objective that you are trying to defend? > > I said that you are making the interface more complex because the > current interface is leaving the complexity encapsulated in memcg > code. It does not matter one bit that some shrinkers are set up > incorrectly, that entirely misses the point. My point was that this is another example of those shrinkers that do a bad job. > Michal, it's completely unobvious what your longterm goals are for > soft limits and guarantees. I've already said that I am open to discuss a new interface with a better semantics and providing guarantees. Once we settle with that one then we can deprecate the soft limit and after a long time we can ditch it completely. > And without that it's hard to comprehend how and if the patches you > are sending push into the right direction. Every time I try to > discuss the bigger picture you derail it with details about how the > implementation is broken. It's frustrating. I am sorry to hear that. I thought that highlevel things were clear. Soft limit sucks and we need something better. That is clear. I am just arguing that waiting for that something better shouldn't stop us working on the current interface as it removes the code and I believe it eventually even helps loads that are currently relying on the soft limit. > This series is a grab bag of fixes that drag a lot of complexity from > memcg code into generic reclaim, to repair the age old implementation > of a user-visible interface that we already agree sucks balls and > should be deprecated. The fact that you did not even demonstrate that > the repair itself was successful is a secondary issue at this point, > but it certainly didn't help your case. I will follow up with the testing results later. I hope I manage to have them before I leave on vacation. The question, though, is whether even results supporting my claims about enhancements would make any difference. To be honest I thought that the integration would be non-controversial topic even without performance improvements which could have be expected due to removing prio-0 reclaim which is a pure evil. Also the natural fairness of the s.r. sounds like a good thing. Anyway. Your wording that nothing should be done about the soft reclaim seems to be quite clear though. If this position is really firm then go ahead and NACK the series _explicitly_ so that Andrew or you can send a revert request to Linus. I would really like to not waste a lot of time on testing right now when it wouldn't lead to anything. Thanks -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <20130916164405.GG3674-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v5] Soft limit rework [not found] ` <20130916164405.GG3674-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-09-17 19:56 ` Johannes Weiner 2013-09-17 20:57 ` Andrew Morton 0 siblings, 1 reply; 30+ messages in thread From: Johannes Weiner @ 2013-09-17 19:56 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Mon, Sep 16, 2013 at 06:44:05PM +0200, Michal Hocko wrote: > On Fri 13-09-13 12:17:09, Johannes Weiner wrote: > > On Fri, Sep 13, 2013 at 04:49:53PM +0200, Michal Hocko wrote: > > > On Fri 06-09-13 15:23:11, Johannes Weiner wrote: > [...] > > > > I would really like to deprecate soft limits and introduce something > > > > new that has the proper semantics we want from the get-go. Its > > > > implementation could very much look like your code, so we can easily > > > > reuse that. But the interface and its semantics should come first. > > > > > > I am open to discussin such a change I just do not see any reason to > > > have a crippled soft reclaim implementation for the mean time. > > > Especially when it doesn't look like such a new interface is easy to > > > agree on. > > > > We had a crippled soft limit implementation from the time it was > > merged, it never worked better than now. > > > > You seem to think that this is an argument *for* finally fixing it. I > > disagree. We should absolutely *avoid* steering people toward it now, > > when the long term plan is already to get rid of it. > > It is not just about fixing it. It is also about getting rid of the > bloat that the previous implementation depended on. Not only LoC but > also the resulting binary: > > $ size mm/built-in.o > base: > text data bss dec hex filename > 534283 233703 163456 931442 e3672 mm/built-in.o > rework: > text data bss dec hex filename > 532866 229591 163456 925913 e20d9 mm/built-in.o > > I would like to get rid of as much of a special code as possible. > Especially this one which I hate personally because it is a crude hack > that shouldn't have existed. The interface is, not the implementation. > > There is a concensus that cgroups and the controllers were merged > > before they were ready and we are now struggling heavily to iron out > > the design mistakes with the minimum amount of disruption we can get > > away with. > > > > We are also at this time coordinating with all the other controllers > > and the cgroup core to do exactly that, where Tejun is providing us > > with tools to revamp the problematic interfaces. > > > > And we agree that soft limits were such a design mistake that should > > be ironed out. > > > > So for the love of everything we hold dear, why would you think that > > NOW is a good time to fix the implemantion and get people to use it? > > There are users who already use this feature and it will take some (read > a lot of) time to move them to something else. And that something else > still doesn't exist and I suspect it will take some time to push it into > a proper shape (and be sure we do not screw it this time). Yet you could not name a single sensible use case. So no, we have no known users, except maybe Google, who actually want guarantees. > So while I agree that we need something more (semantically) reasonable > there is no need to keep this crippled implementation around especially > when it is non-trivial amount of code. You traded a non-trivial amount of code for non-trivial code. And you moved the cost from optional memcg code to essential memory management code. All for a feature that needs to be redesigned and has a questionable userbase. > > > > > > You have not shown that prio-0 scans are a problem. > > > > > > > > > > OK, I thought this was self evident but let me be more specific. > > > > > > > > > > The scan the world is almost always a problem. We are no longer doing > > > > > proportional anon/file reclaim (swappiness is ignored). This is wrong > > > > > from at least two points of view. Firstly it makes the reclaim decisions > > > > > different a lot for groups that are under the soft limit and those > > > > > that are over. Secondly, and more importantly, this might lead to a > > > > > pre-mature swapping, especially when there is a lot of IO going on. > > > > > > > > > > The global reclaim suffers from the very same problem and that is why > > > > > we try to prevent from prio-0 reclaim as much as possible and use it > > > > > only as a last resort. > > > > > > > > I know that and I can see that this should probably be fixed, but > > > > there is no quantification for this. We have no per-memcg reclaim > > > > statistics > > > > > > Not having statistic is a separate issue. It makes the situation worse > > > but that is not a new thing. The old implementation is even worse > > > because the soft reclaim activity is basically hidden from global > > > reclaim counters. So a lot of pages might get scanned and we will have > > > no way to find out. That part is inherently fixed by the series because > > > of the integration. > > > > Because it's in the *global* reclaim counters? That's great but it > > does not address the problem at all. This is about pressure balance > > between groups and you don't have any numbers for that. > > yes, but the point was that if somebody uses soft reclaim currently you > would miss a big part of reclaim activity because soft reclaim is not > accounted even in the global counters. So you can see a long stall > during direct reclaim while the counters look all good. This is not going anywhere :( > > This series is a grab bag of fixes that drag a lot of complexity from > > memcg code into generic reclaim, to repair the age old implementation > > of a user-visible interface that we already agree sucks balls and > > should be deprecated. The fact that you did not even demonstrate that > > the repair itself was successful is a secondary issue at this point, > > but it certainly didn't help your case. > > I will follow up with the testing results later. I hope I manage to have > them before I leave on vacation. > > The question, though, is whether even results supporting my claims about > enhancements would make any difference. To be honest I thought that the > integration would be non-controversial topic even without performance > improvements which could have be expected due to removing prio-0 reclaim > which is a pure evil. Also the natural fairness of the s.r. sounds like > a good thing. > > Anyway. Your wording that nothing should be done about the soft reclaim > seems to be quite clear though. If this position is really firm then go > ahead and NACK the series _explicitly_ so that Andrew or you can send a > revert request to Linus. I would really like to not waste a lot of time > on testing right now when it wouldn't lead to anything. Nacked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v5] Soft limit rework 2013-09-17 19:56 ` Johannes Weiner @ 2013-09-17 20:57 ` Andrew Morton 0 siblings, 0 replies; 30+ messages in thread From: Andrew Morton @ 2013-09-17 20:57 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo, Balbir Singh, Glauber Costa On Tue, 17 Sep 2013 15:56:15 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > > Anyway. Your wording that nothing should be done about the soft reclaim > > seems to be quite clear though. If this position is really firm then go > > ahead and NACK the series _explicitly_ so that Andrew or you can send a > > revert request to Linus. I would really like to not waste a lot of time > > on testing right now when it wouldn't lead to anything. > > Nacked-by: Johannes Weiner <hannes@cmpxchg.org> OK, I queued a bunch of reverts for when Linus gets back to his desk. It's all my fault for getting distracted for a week (in the middle of the dang merge window) and not noticing that we haven't yet settled on a direction. Sorry bout that. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2013-09-17 20:57 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-18 12:09 [PATCH v5] Soft limit rework Michal Hocko
2013-06-18 12:09 ` [PATCH v5 1/8] memcg, vmscan: integrate soft reclaim tighter with zone shrinking code Michal Hocko
2013-06-18 12:09 ` [PATCH v5 2/8] memcg: Get rid of soft-limit tree infrastructure Michal Hocko
2013-06-18 12:09 ` [PATCH v5 3/8] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko
2013-06-18 12:09 ` [PATCH v5 4/8] memcg: enhance memcg iterator to support predicates Michal Hocko
2013-06-18 12:09 ` [PATCH v5 5/8] memcg: track children in soft limit excess to improve soft limit Michal Hocko
2013-06-18 12:09 ` [PATCH v5 6/8] memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything Michal Hocko
2013-06-18 12:09 ` [PATCH v5 7/8] memcg: Track all children over limit in the root Michal Hocko
2013-06-18 12:09 ` [PATCH v5 8/8] memcg, vmscan: do not fall into reclaim-all pass too quickly Michal Hocko
2013-06-18 19:01 ` [PATCH v5] Soft limit rework Johannes Weiner
2013-06-19 10:20 ` Michal Hocko
2013-06-20 11:12 ` Mel Gorman
[not found] ` <20130620111206.GA14809-l3A5Bk7waGM@public.gmane.org>
2013-06-21 14:06 ` Michal Hocko
[not found] ` <20130621140627.GI12424-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-06-21 14:09 ` Michal Hocko
2013-06-21 15:04 ` Michal Hocko
[not found] ` <20130621150430.GL12424-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-06-21 15:09 ` Michal Hocko
2013-06-21 16:34 ` Tejun Heo
2013-06-25 15:49 ` Michal Hocko
[not found] ` <1371557387-22434-1-git-send-email-mhocko-AlSwsSmVLrQ@public.gmane.org>
2013-08-19 16:35 ` Johannes Weiner
[not found] ` <20130819163512.GB712-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2013-08-20 9:14 ` Michal Hocko
2013-08-20 14:13 ` Johannes Weiner
2013-08-22 10:58 ` Michal Hocko
2013-09-03 16:15 ` Johannes Weiner
2013-09-04 16:38 ` Michal Hocko
2013-09-06 19:23 ` Johannes Weiner
2013-09-13 14:49 ` Michal Hocko
[not found] ` <20130913144953.GA23857-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-09-13 16:17 ` Johannes Weiner
2013-09-16 16:44 ` Michal Hocko
[not found] ` <20130916164405.GG3674-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-09-17 19:56 ` Johannes Weiner
2013-09-17 20:57 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).