* [RFC 0/3] soft reclaim rework @ 2013-04-09 12:13 Michal Hocko 2013-04-09 12:13 ` [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko ` (5 more replies) 0 siblings, 6 replies; 27+ messages in thread From: Michal Hocko @ 2013-04-09 12:13 UTC (permalink / raw) To: linux-mm Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa Hi all, It's been a long when I promised my take on the $subject but I got permanently preempted by other tasks. I finally got it, fortunately. This is just a first attempt. There are still some todos but I wanted to post it soon to get a feedback. The basic idea is quite simple. Pull soft reclaim into shrink_zone in the first step and get rid of the previous soft reclaim infrastructure. shrink_zone is done in two passes now. First it tries to do the soft limit reclaim and it falls back to reclaim-all-mode if no group is over the limit or no pages have been scanned. The second pass happens at the same priority so the only time we waste is the memcg tree walk which shouldn't be a big deal. There is certainly room for improvements in that direction. But let's keep it simple for now. As a bonus we will get rid of a _lot_ of code by this and soft reclaim will not stand out like before. The second step is somehow more controversial. I am redefining meaning of the default soft limit value. I've not chosen 0 as we discussed previously because I want to preserve hierarchical property of the soft limit (if a parent up the hierarchy is over its limit then children are over as well) so I have kept the default untouched - unlimited - but I have slightly changed the meaning of this value. I interpret it as "user doesn't care about soft limit". More precisely the value is ignored unless it has been specified by user so such groups are eligible for soft reclaim even though they do not reach the limit. Such groups do not force their children to be reclaimed of course. I guess the only possible use case where this wouldn't work as expected is when somebody creates a group and set its soft limit to a small value (e.g. 0) just to protect all other groups from being reclaimed. With a new scheme all groups would be reclaimed while the previous implementation could end up reclaiming only the "special" group. This configuration can be achieved by the new scheme trivially so I think we should be safe. Or does this sound like a big problem? Finally the third step is soft limit reclaim integration into targeted reclaim. The patch is trivial one liner. I haven't get to test it properly yet. I've tested only 2 workloads: 1) 1GB RAM + 128MB swap in a kvm (host 4 GB RAM) - 2 memcgs (directly under root) - A has soft limit 500MB and hard unlimited - B both hard and soft unlimited (default values) - One dd if=/dev/zero of=storage/$file bs=1024 count=1228800 per group 2) same setup - tar -xf linux source tree + make -j2 vmlinux Results 1) I've checked memory.usage_in_bytes Base (-mm tree) Group A Group B median 446498816 448659456 Patches applied median 524314624 377921536 So as expected, A got more room on behalf of B and it is nicely over its soft limit. I wanted to compare the reclaim performance as well but we do not account scanned and reclaimed pages during the old soft reclaim (global_reclaim prevents that). But I am planning to look at it. Anyway it doesn't look like we are scanning/reclaiming more with the patched kernel: Base: pgscan_kswapd_dma32 394382 pgsteal_kswapd_dma32 394372 Patched: pgscan_kswapd_dma32 394501 pgsteal_kswapd_dma32 394491 So I would assume that the soft limit reclaim scanned more in the end. Total runtime was slightly smaller for the patch version: Base Group A Group B total time 480.087 s 480.067 s Patches applied total time 474.853 s 474.736 s But this could be an artifacts of the guest scheduling or related to the host activity so I wouldn't draw any conclusions from here. 2) kbuild test showed more or less the same results usage_in_bytes Base Group A Group B Median 394817536 395634688 Patches applied median 483481600 302131200 A is kept closer to the soft limit again. There is some fluctuation around the limit because kbuild creates a lot of short lived processes. Base: pgscan_kswapd_dma32 1648718 pgsteal_kswapd_dma32 1510749 Patched: pgscan_kswapd_dma32 2042065 pgsteal_kswapd_dma32 1667745 The differences are much bigger now so it would be interesting how much has been scanned/reclaimed during soft reclaim in the base kernel. I haven't included total runtime statistics here because they seemed even more random due to guest/host interaction. Any comments are welcome, of course. Michal Hocko (3): memcg: integrate soft reclaim tighter with zone shrinking code memcg: Ignore soft limit until it is explicitly specified vmscan, memcg: Do softlimit reclaim also for targeted reclaim Incomplete diffstat (without node-zone soft limit tree removal etc...) so more deletions to come. include/linux/memcontrol.h | 10 +-- mm/memcontrol.c | 175 +++++++++----------------------------------- mm/vmscan.c | 67 ++++++++++------- 3 files changed, 78 insertions(+), 174 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-04-09 12:13 [RFC 0/3] soft reclaim rework Michal Hocko @ 2013-04-09 12:13 ` Michal Hocko 2013-04-09 13:08 ` Johannes Weiner ` (3 more replies) 2013-04-09 12:13 ` [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified Michal Hocko ` (4 subsequent siblings) 5 siblings, 4 replies; 27+ messages in thread From: Michal Hocko @ 2013-04-09 12:13 UTC (permalink / raw) To: linux-mm Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa Memcg soft reclaim has been traditionally triggered from the global reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim then picked up a group which exceeds the soft limit the most and reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. The infrastructure requires per-node-zone trees which hold over-limit groups and keep them up-to-date (via memcg_check_events) which is not cost free. Although this overhead hasn't turned out to be a bottle neck the implementation is suboptimal because mem_cgroup_update_tree has no idea which zones consumed memory over the limit so we could easily end up having a group on a node-zone tree having only few pages from that node-zone. This patch doesn't try to fix node-zone trees management because it seems that integrating soft reclaim into zone shrinking sounds much easier and more appropriate for several reasons. First of all 0 priority reclaim was a crude hack which might lead to big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot of dirty/writeback pages). Soft reclaim should be applicable also to the targeted reclaim which is awkward right now without additional hacks. Last but not least the whole infrastructure eats a lot of code[1]. After this patch shrink_zone is done in 2. First it tries to do the soft reclaim if appropriate (only for global reclaim for now to keep compatible with the current state) and fall back to ignoring soft limit if no group is eligible to soft reclaim or nothing has been scanned during the first pass. Only groups which are over their soft limit or any of their parent up the hierarchy is over the limit are considered eligible during the first pass. TODO: remove mem_cgroup_tree_per_zone, mem_cgroup_shrink_node_zone and co. but maybe it would be easier for review to remove that code in a separate patch... --- [1] TODO: put size vmlinux before/after whole clean-up Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 10 +-- mm/memcontrol.c | 161 ++++++-------------------------------------- mm/vmscan.c | 67 +++++++++++------- 3 files changed, 64 insertions(+), 174 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d6183f0..1833c95 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -179,9 +179,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned); +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, @@ -358,11 +356,9 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) { - return 0; + return false; } static inline void mem_cgroup_split_huge_fixup(struct page *head) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f608546..33424d8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2060,57 +2060,28 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) } #endif -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, - struct zone *zone, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - struct mem_cgroup *victim = NULL; - int total = 0; - int loop = 0; - unsigned long excess; - unsigned long nr_scanned; - struct mem_cgroup_reclaim_cookie reclaim = { - .zone = zone, - .priority = 0, - }; +/* + * A group is eligible for the soft limit reclaim if it is + * a) is over its soft limit + * b) any parent up the hierarchy is over its soft limit + */ +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +{ + struct mem_cgroup *parent = memcg; - excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT; - - while (1) { - victim = mem_cgroup_iter(root_memcg, victim, &reclaim); - if (!victim) { - loop++; - if (loop >= 2) { - /* - * If we have not been able to reclaim - * anything, it might because there are - * no reclaimable pages under this hierarchy - */ - if (!total) - break; - /* - * We want to do more targeted reclaim. - * excess >> 2 is not to excessive so as to - * reclaim too much, nor too less that we keep - * coming back to reclaim from this cgroup - */ - if (total >= (excess >> 2) || - (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) - break; - } - continue; - } - if (!mem_cgroup_reclaimable(victim, false)) - continue; - total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false, - zone, &nr_scanned); - *total_scanned += nr_scanned; - if (!res_counter_soft_limit_excess(&root_memcg->res)) - break; + if (res_counter_soft_limit_excess(&memcg->res)) + return true; + + /* + * If any parent up the hierarchy is over its soft limit then we + * have to obey and reclaim from this group as well. + */ + while((parent = parent_mem_cgroup(parent))) { + if (res_counter_soft_limit_excess(&parent->res)) + return true; } - mem_cgroup_iter_break(root_memcg, victim); - return total; + + return false; } /* @@ -4724,98 +4695,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg, return ret; } -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - unsigned long nr_reclaimed = 0; - struct mem_cgroup_per_zone *mz, *next_mz = NULL; - unsigned long reclaimed; - int loop = 0; - struct mem_cgroup_tree_per_zone *mctz; - unsigned long long excess; - unsigned long nr_scanned; - - if (order > 0) - return 0; - - mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone)); - /* - * This loop can run a while, specially if mem_cgroup's continuously - * keep exceeding their soft limit and putting the system under - * pressure - */ - do { - if (next_mz) - mz = next_mz; - else - mz = mem_cgroup_largest_soft_limit_node(mctz); - if (!mz) - break; - - nr_scanned = 0; - reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone, - gfp_mask, &nr_scanned); - nr_reclaimed += reclaimed; - *total_scanned += nr_scanned; - spin_lock(&mctz->lock); - - /* - * If we failed to reclaim anything from this memory cgroup - * it is time to move on to the next cgroup - */ - next_mz = NULL; - if (!reclaimed) { - do { - /* - * Loop until we find yet another one. - * - * By the time we get the soft_limit lock - * again, someone might have aded the - * group back on the RB tree. Iterate to - * make sure we get a different mem. - * mem_cgroup_largest_soft_limit_node returns - * NULL if no other cgroup is present on - * the tree - */ - next_mz = - __mem_cgroup_largest_soft_limit_node(mctz); - if (next_mz == mz) - css_put(&next_mz->memcg->css); - else /* next_mz == NULL or other memcg */ - break; - } while (1); - } - __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz); - excess = res_counter_soft_limit_excess(&mz->memcg->res); - /* - * One school of thought says that we should not add - * back the node to the tree if reclaim returns 0. - * But our reclaim could return 0, simply because due - * to priority we are exposing a smaller subset of - * memory to reclaim from. Consider this as a longer - * term TODO. - */ - /* If excess == 0, no tree ops */ - __mem_cgroup_insert_exceeded(mz->memcg, mz, mctz, excess); - spin_unlock(&mctz->lock); - css_put(&mz->memcg->css); - loop++; - /* - * Could not reclaim anything and there are no more - * mem cgroups to try or we seem to be looping without - * reclaiming anything. - */ - if (!nr_reclaimed && - (next_mz == NULL || - loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) - break; - } while (!nr_reclaimed); - if (next_mz) - css_put(&next_mz->memcg->css); - return nr_reclaimed; -} - /** * mem_cgroup_force_empty_list - clears LRU of a group * @memcg: group to clear diff --git a/mm/vmscan.c b/mm/vmscan.c index df78d17..ae3a387 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -138,11 +138,21 @@ static bool global_reclaim(struct scan_control *sc) { return !sc->target_mem_cgroup; } + +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) +{ + return global_reclaim(sc); +} #else static bool global_reclaim(struct scan_control *sc) { return true; } + +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) +{ + return false; +} #endif static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru) @@ -1942,9 +1952,11 @@ static inline bool should_continue_reclaim(struct zone *zone, } } -static void shrink_zone(struct zone *zone, struct scan_control *sc) +static unsigned +__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) { unsigned long nr_reclaimed, nr_scanned; + unsigned nr_shrunk = 0; do { struct mem_cgroup *root = sc->target_mem_cgroup; @@ -1961,6 +1973,13 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) do { struct lruvec *lruvec; + if (soft_reclaim && + !mem_cgroup_soft_reclaim_eligible(memcg)) { + memcg = mem_cgroup_iter(root, memcg, &reclaim); + continue; + } + + nr_shrunk++; lruvec = mem_cgroup_zone_lruvec(zone, memcg); shrink_lruvec(lruvec, sc); @@ -1984,6 +2003,27 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) } while (memcg); } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed, sc->nr_scanned - nr_scanned, sc)); + + return nr_shrunk; +} + + +static void shrink_zone(struct zone *zone, struct scan_control *sc) +{ + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); + unsigned long nr_scanned = sc->nr_scanned; + unsigned nr_shrunk; + + nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim); + + /* + * No group is over the soft limit or those that are do not have + * pages in the zone we are reclaiming so we have to reclaim everybody + */ + if (do_soft_reclaim && (!nr_shrunk || sc->nr_scanned == nr_scanned)) { + __shrink_zone(zone, sc, false); + return; + } } /* Returns true if compaction should go ahead for a high-order request */ @@ -2047,8 +2087,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) { struct zoneref *z; struct zone *zone; - unsigned long nr_soft_reclaimed; - unsigned long nr_soft_scanned; bool aborted_reclaim = false; /* @@ -2088,18 +2126,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) continue; } } - /* - * This steals pages from memory cgroups over softlimit - * and returns the number of reclaimed pages and - * scanned pages. This works for global memory pressure - * and balancing, not for a memcg's limit. - */ - nr_soft_scanned = 0; - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - sc->order, sc->gfp_mask, - &nr_soft_scanned); - sc->nr_reclaimed += nr_soft_reclaimed; - sc->nr_scanned += nr_soft_scanned; /* need some check for avoid more shrink_zone() */ } @@ -2620,8 +2646,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, int i; int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ struct reclaim_state *reclaim_state = current->reclaim_state; - unsigned long nr_soft_reclaimed; - unsigned long nr_soft_scanned; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .may_unmap = 1, @@ -2720,15 +2744,6 @@ loop_again: sc.nr_scanned = 0; - nr_soft_scanned = 0; - /* - * Call soft limit reclaim before calling shrink_zone. - */ - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - order, sc.gfp_mask, - &nr_soft_scanned); - sc.nr_reclaimed += nr_soft_reclaimed; - /* * We put equal pressure on every zone, unless * one zone has way too many pages free -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-04-09 12:13 ` [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko @ 2013-04-09 13:08 ` Johannes Weiner 2013-04-09 13:31 ` Michal Hocko 2013-04-09 13:57 ` Glauber Costa ` (2 subsequent siblings) 3 siblings, 1 reply; 27+ messages in thread From: Johannes Weiner @ 2013-04-09 13:08 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Ying Han, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa On Tue, Apr 09, 2013 at 02:13:13PM +0200, Michal Hocko wrote: > Memcg soft reclaim has been traditionally triggered from the global > reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim > then picked up a group which exceeds the soft limit the most and > reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. > > The infrastructure requires per-node-zone trees which hold over-limit > groups and keep them up-to-date (via memcg_check_events) which is not > cost free. Although this overhead hasn't turned out to be a bottle neck > the implementation is suboptimal because mem_cgroup_update_tree has no > idea which zones consumed memory over the limit so we could easily end > up having a group on a node-zone tree having only few pages from that > node-zone. > > This patch doesn't try to fix node-zone trees management because it > seems that integrating soft reclaim into zone shrinking sounds much > easier and more appropriate for several reasons. > First of all 0 priority reclaim was a crude hack which might lead to > big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot > of dirty/writeback pages). > Soft reclaim should be applicable also to the targeted reclaim which is > awkward right now without additional hacks. > Last but not least the whole infrastructure eats a lot of code[1]. > > After this patch shrink_zone is done in 2. First it tries to do the > soft reclaim if appropriate (only for global reclaim for now to keep > compatible with the current state) and fall back to ignoring soft limit > if no group is eligible to soft reclaim or nothing has been scanned > during the first pass. Only groups which are over their soft limit or > any of their parent up the hierarchy is over the limit are considered > eligible during the first pass. > > TODO: remove mem_cgroup_tree_per_zone, mem_cgroup_shrink_node_zone and co. > but maybe it would be easier for review to remove that code in a separate > patch... It should be in this series, though, for the diffstat :-) > --- > [1] TODO: put size vmlinux before/after whole clean-up Yes! > @@ -1984,6 +2003,27 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) > } while (memcg); > } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed, > sc->nr_scanned - nr_scanned, sc)); > + > + return nr_shrunk; > +} > + > + > +static void shrink_zone(struct zone *zone, struct scan_control *sc) > +{ > + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); > + unsigned long nr_scanned = sc->nr_scanned; > + unsigned nr_shrunk; > + > + nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim); > + > + /* > + * No group is over the soft limit or those that are do not have > + * pages in the zone we are reclaiming so we have to reclaim everybody > + */ > + if (do_soft_reclaim && (!nr_shrunk || sc->nr_scanned == nr_scanned)) { If no pages were scanned you are doing a second pass regardless of nr_shrunk. If pages were scanned, nr_shrunk must have been increased as well. So I think you can remove all the nr_shrunk counting and just check for scanned pages, no? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-04-09 13:08 ` Johannes Weiner @ 2013-04-09 13:31 ` Michal Hocko 0 siblings, 0 replies; 27+ messages in thread From: Michal Hocko @ 2013-04-09 13:31 UTC (permalink / raw) To: Johannes Weiner Cc: linux-mm, Ying Han, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa On Tue 09-04-13 09:08:33, Johannes Weiner wrote: > On Tue, Apr 09, 2013 at 02:13:13PM +0200, Michal Hocko wrote: [...] > > TODO: remove mem_cgroup_tree_per_zone, mem_cgroup_shrink_node_zone and co. > > but maybe it would be easier for review to remove that code in a separate > > patch... > > It should be in this series, though, for the diffstat :-) Sure thing, I just wanted to prevent from pointless work during rebasing when this changes its shape, like all such "bug changes" > > > --- > > [1] TODO: put size vmlinux before/after whole clean-up > > Yes! > > > @@ -1984,6 +2003,27 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) > > } while (memcg); > > } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed, > > sc->nr_scanned - nr_scanned, sc)); > > + > > + return nr_shrunk; > > +} > > + > > + > > +static void shrink_zone(struct zone *zone, struct scan_control *sc) > > +{ > > + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); > > + unsigned long nr_scanned = sc->nr_scanned; > > + unsigned nr_shrunk; > > + > > + nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim); > > + > > + /* > > + * No group is over the soft limit or those that are do not have > > + * pages in the zone we are reclaiming so we have to reclaim everybody > > + */ > > + if (do_soft_reclaim && (!nr_shrunk || sc->nr_scanned == nr_scanned)) { > > If no pages were scanned you are doing a second pass regardless of > nr_shrunk. If pages were scanned, nr_shrunk must have been increased > as well. So I think you can remove all the nr_shrunk counting and > just check for scanned pages, no? Yes you are right. I have started with nr_shrunk part only and then realized that no scaning could be a problem so I've just added it. I didn't optimize it yet. I will remove nr_shrunk part in later versions. Thanks -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-04-09 12:13 ` [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko 2013-04-09 13:08 ` Johannes Weiner @ 2013-04-09 13:57 ` Glauber Costa 2013-04-09 14:22 ` Michal Hocko 2013-04-09 16:45 ` Kamezawa Hiroyuki 2013-04-14 0:42 ` Mel Gorman 3 siblings, 1 reply; 27+ messages in thread From: Glauber Costa @ 2013-04-09 13:57 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman On 04/09/2013 04:13 PM, Michal Hocko wrote: > Memcg soft reclaim has been traditionally triggered from the global > reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim > then picked up a group which exceeds the soft limit the most and > reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. > > The infrastructure requires per-node-zone trees which hold over-limit > groups and keep them up-to-date (via memcg_check_events) which is not > cost free. Although this overhead hasn't turned out to be a bottle neck > the implementation is suboptimal because mem_cgroup_update_tree has no > idea which zones consumed memory over the limit so we could easily end > up having a group on a node-zone tree having only few pages from that > node-zone. > > This patch doesn't try to fix node-zone trees management because it > seems that integrating soft reclaim into zone shrinking sounds much > easier and more appropriate for several reasons. > First of all 0 priority reclaim was a crude hack which might lead to > big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot > of dirty/writeback pages). > Soft reclaim should be applicable also to the targeted reclaim which is > awkward right now without additional hacks. > Last but not least the whole infrastructure eats a lot of code[1]. > > After this patch shrink_zone is done in 2. First it tries to do the > soft reclaim if appropriate (only for global reclaim for now to keep > compatible with the current state) and fall back to ignoring soft limit > if no group is eligible to soft reclaim or nothing has been scanned > during the first pass. Only groups which are over their soft limit or > any of their parent up the hierarchy is over the limit are considered > eligible during the first pass. > > TODO: remove mem_cgroup_tree_per_zone, mem_cgroup_shrink_node_zone and co. > but maybe it would be easier for review to remove that code in a separate > patch... > Well, the concept is obviously headed right. Code comments: > +/* > + * A group is eligible for the soft limit reclaim if it is > + * a) is over its soft limit > + * b) any parent up the hierarchy is over its soft limit > + */ > +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *parent = memcg; > > - excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT; > - > - while (1) { > - victim = mem_cgroup_iter(root_memcg, victim, &reclaim); > - if (!victim) { > - loop++; > - if (loop >= 2) { > - /* > - * If we have not been able to reclaim > - * anything, it might because there are > - * no reclaimable pages under this hierarchy > - */ > - if (!total) > - break; > - /* > - * We want to do more targeted reclaim. > - * excess >> 2 is not to excessive so as to > - * reclaim too much, nor too less that we keep > - * coming back to reclaim from this cgroup > - */ > - if (total >= (excess >> 2) || > - (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) > - break; > - } > - continue; > - } > - if (!mem_cgroup_reclaimable(victim, false)) > - continue; > - total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false, > - zone, &nr_scanned); > - *total_scanned += nr_scanned; > - if (!res_counter_soft_limit_excess(&root_memcg->res)) > - break; > + if (res_counter_soft_limit_excess(&memcg->res)) > + return true; > + > + /* > + * If any parent up the hierarchy is over its soft limit then we > + * have to obey and reclaim from this group as well. > + */ > + while((parent = parent_mem_cgroup(parent))) { > + if (res_counter_soft_limit_excess(&parent->res)) > + return true; > } > - mem_cgroup_iter_break(root_memcg, victim); > - return total; > + > + return false; > } > good work. There is a confusion with parent here, but I believe Johnny had already noted it. > > -static void shrink_zone(struct zone *zone, struct scan_control *sc) > +static unsigned > +__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) > { > unsigned long nr_reclaimed, nr_scanned; > + unsigned nr_shrunk = 0; > > do { > struct mem_cgroup *root = sc->target_mem_cgroup; > @@ -1961,6 +1973,13 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) > do { > struct lruvec *lruvec; > > + if (soft_reclaim && > + !mem_cgroup_soft_reclaim_eligible(memcg)) { > + memcg = mem_cgroup_iter(root, memcg, &reclaim); > + continue; > + } > + > + nr_shrunk++; > lruvec = mem_cgroup_zone_lruvec(zone, memcg); > > shrink_lruvec(lruvec, sc); > @@ -1984,6 +2003,27 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) > } while (memcg); > } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed, > sc->nr_scanned - nr_scanned, sc)); > + > + return nr_shrunk; > +} > + > + > +static void shrink_zone(struct zone *zone, struct scan_control *sc) > +{ > + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); > + unsigned long nr_scanned = sc->nr_scanned; > + unsigned nr_shrunk; > + > + nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim); > + > + /* > + * No group is over the soft limit or those that are do not have > + * pages in the zone we are reclaiming so we have to reclaim everybody > + */ > + if (do_soft_reclaim && (!nr_shrunk || sc->nr_scanned == nr_scanned)) { > + __shrink_zone(zone, sc, false); > + return; > + } > } If I read this correctly, you stop shrinking when you reach a group in which you manage to shrink some pages. Is it really what we want? We have no guarantee that we're now under the soft limit, so shouldn't we keep shrinking downwards until every parent of ours is within limits ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-04-09 13:57 ` Glauber Costa @ 2013-04-09 14:22 ` Michal Hocko 0 siblings, 0 replies; 27+ messages in thread From: Michal Hocko @ 2013-04-09 14:22 UTC (permalink / raw) To: Glauber Costa Cc: linux-mm, Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman On Tue 09-04-13 17:57:54, Glauber Costa wrote: > On 04/09/2013 04:13 PM, Michal Hocko wrote: [...] > > +static void shrink_zone(struct zone *zone, struct scan_control *sc) > > +{ > > + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); > > + unsigned long nr_scanned = sc->nr_scanned; > > + unsigned nr_shrunk; > > + > > + nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim); > > + > > + /* > > + * No group is over the soft limit or those that are do not have > > + * pages in the zone we are reclaiming so we have to reclaim everybody > > + */ > > + if (do_soft_reclaim && (!nr_shrunk || sc->nr_scanned == nr_scanned)) { > > + __shrink_zone(zone, sc, false); > > + return; > > + } > > } > > If I read this correctly, you stop shrinking when you reach a group in > which you manage to shrink some pages. Is it really what we want? Well, this is what we do during standard reclaim __shrink_zone either walks all children of the target_memcg or reclaim enough pages. > We have no guarantee that we're now under the soft limit, so shouldn't > we keep shrinking downwards until every parent of ours is within limits ? I do not think we should reclaim until we are under soft limit because our primary target is different - balance zones resp. get under hard limit. Soft limit just helps us to point at victims (and newly also to protect high class citizens). So the second round is just a way to reclaim at least something if the is nobody eligible for the soft game part. I can see some harder conditions for the fallback (e.g. only fallback after certain priority but let's keep this simple for now and do additional parts on top). Thanks -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-04-09 12:13 ` [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko 2013-04-09 13:08 ` Johannes Weiner 2013-04-09 13:57 ` Glauber Costa @ 2013-04-09 16:45 ` Kamezawa Hiroyuki 2013-04-09 17:05 ` Michal Hocko 2013-04-14 0:42 ` Mel Gorman 3 siblings, 1 reply; 27+ messages in thread From: Kamezawa Hiroyuki @ 2013-04-09 16:45 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Ying Han, Johannes Weiner, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa (2013/04/09 21:13), Michal Hocko wrote: > Memcg soft reclaim has been traditionally triggered from the global > reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim > then picked up a group which exceeds the soft limit the most and > reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. > > The infrastructure requires per-node-zone trees which hold over-limit > groups and keep them up-to-date (via memcg_check_events) which is not > cost free. Although this overhead hasn't turned out to be a bottle neck > the implementation is suboptimal because mem_cgroup_update_tree has no > idea which zones consumed memory over the limit so we could easily end > up having a group on a node-zone tree having only few pages from that > node-zone. > > This patch doesn't try to fix node-zone trees management because it > seems that integrating soft reclaim into zone shrinking sounds much > easier and more appropriate for several reasons. > First of all 0 priority reclaim was a crude hack which might lead to > big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot > of dirty/writeback pages). > Soft reclaim should be applicable also to the targeted reclaim which is > awkward right now without additional hacks. > Last but not least the whole infrastructure eats a lot of code[1]. > > After this patch shrink_zone is done in 2. First it tries to do the > soft reclaim if appropriate (only for global reclaim for now to keep > compatible with the current state) and fall back to ignoring soft limit > if no group is eligible to soft reclaim or nothing has been scanned > during the first pass. Only groups which are over their soft limit or > any of their parent up the hierarchy is over the limit are considered > eligible during the first pass. > > TODO: remove mem_cgroup_tree_per_zone, mem_cgroup_shrink_node_zone and co. > but maybe it would be easier for review to remove that code in a separate > patch... > If we don't make prioritization based on excessed usage against soft-limit and visit all memcgs, dropping per-zone-tree makes sense. (*)I don't like current prioitization. > --- > [1] TODO: put size vmlinux before/after whole clean-up > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > --- > include/linux/memcontrol.h | 10 +-- > mm/memcontrol.c | 161 ++++++-------------------------------------- > mm/vmscan.c | 67 +++++++++++------- > 3 files changed, 64 insertions(+), 174 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index d6183f0..1833c95 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -179,9 +179,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, > mem_cgroup_update_page_stat(page, idx, -1); > } > > -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, > - gfp_t gfp_mask, > - unsigned long *total_scanned); > +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg); > > void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); > static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, > @@ -358,11 +356,9 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, > } > > static inline > -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, > - gfp_t gfp_mask, > - unsigned long *total_scanned) > +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) > { > - return 0; > + return false; > } > > static inline void mem_cgroup_split_huge_fixup(struct page *head) > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index f608546..33424d8 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2060,57 +2060,28 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) > } > #endif > > -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, > - struct zone *zone, > - gfp_t gfp_mask, > - unsigned long *total_scanned) > -{ > - struct mem_cgroup *victim = NULL; > - int total = 0; > - int loop = 0; > - unsigned long excess; > - unsigned long nr_scanned; > - struct mem_cgroup_reclaim_cookie reclaim = { > - .zone = zone, > - .priority = 0, > - }; > +/* > + * A group is eligible for the soft limit reclaim if it is > + * a) is over its soft limit > + * b) any parent up the hierarchy is over its soft limit > + */ > +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *parent = memcg; > > - excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT; > - > - while (1) { > - victim = mem_cgroup_iter(root_memcg, victim, &reclaim); > - if (!victim) { > - loop++; > - if (loop >= 2) { > - /* > - * If we have not been able to reclaim > - * anything, it might because there are > - * no reclaimable pages under this hierarchy > - */ > - if (!total) > - break; > - /* > - * We want to do more targeted reclaim. > - * excess >> 2 is not to excessive so as to > - * reclaim too much, nor too less that we keep > - * coming back to reclaim from this cgroup > - */ > - if (total >= (excess >> 2) || > - (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) > - break; > - } > - continue; > - } > - if (!mem_cgroup_reclaimable(victim, false)) > - continue; > - total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false, > - zone, &nr_scanned); > - *total_scanned += nr_scanned; > - if (!res_counter_soft_limit_excess(&root_memcg->res)) > - break; > + if (res_counter_soft_limit_excess(&memcg->res)) > + return true; > + > + /* > + * If any parent up the hierarchy is over its soft limit then we > + * have to obey and reclaim from this group as well. > + */ > + while((parent = parent_mem_cgroup(parent))) { > + if (res_counter_soft_limit_excess(&parent->res)) > + return true; > } > - mem_cgroup_iter_break(root_memcg, victim); > - return total; > + > + return false; > } > > /* > @@ -4724,98 +4695,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg, > return ret; > } > > -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, > - gfp_t gfp_mask, > - unsigned long *total_scanned) > -{ > - unsigned long nr_reclaimed = 0; > - struct mem_cgroup_per_zone *mz, *next_mz = NULL; > - unsigned long reclaimed; > - int loop = 0; > - struct mem_cgroup_tree_per_zone *mctz; > - unsigned long long excess; > - unsigned long nr_scanned; > - > - if (order > 0) > - return 0; > - > - mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone)); > - /* > - * This loop can run a while, specially if mem_cgroup's continuously > - * keep exceeding their soft limit and putting the system under > - * pressure > - */ > - do { > - if (next_mz) > - mz = next_mz; > - else > - mz = mem_cgroup_largest_soft_limit_node(mctz); > - if (!mz) > - break; > - > - nr_scanned = 0; > - reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone, > - gfp_mask, &nr_scanned); > - nr_reclaimed += reclaimed; > - *total_scanned += nr_scanned; > - spin_lock(&mctz->lock); > - > - /* > - * If we failed to reclaim anything from this memory cgroup > - * it is time to move on to the next cgroup > - */ > - next_mz = NULL; > - if (!reclaimed) { > - do { > - /* > - * Loop until we find yet another one. > - * > - * By the time we get the soft_limit lock > - * again, someone might have aded the > - * group back on the RB tree. Iterate to > - * make sure we get a different mem. > - * mem_cgroup_largest_soft_limit_node returns > - * NULL if no other cgroup is present on > - * the tree > - */ > - next_mz = > - __mem_cgroup_largest_soft_limit_node(mctz); > - if (next_mz == mz) > - css_put(&next_mz->memcg->css); > - else /* next_mz == NULL or other memcg */ > - break; > - } while (1); > - } > - __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz); > - excess = res_counter_soft_limit_excess(&mz->memcg->res); > - /* > - * One school of thought says that we should not add > - * back the node to the tree if reclaim returns 0. > - * But our reclaim could return 0, simply because due > - * to priority we are exposing a smaller subset of > - * memory to reclaim from. Consider this as a longer > - * term TODO. > - */ > - /* If excess == 0, no tree ops */ > - __mem_cgroup_insert_exceeded(mz->memcg, mz, mctz, excess); > - spin_unlock(&mctz->lock); > - css_put(&mz->memcg->css); > - loop++; > - /* > - * Could not reclaim anything and there are no more > - * mem cgroups to try or we seem to be looping without > - * reclaiming anything. > - */ > - if (!nr_reclaimed && > - (next_mz == NULL || > - loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) > - break; > - } while (!nr_reclaimed); > - if (next_mz) > - css_put(&next_mz->memcg->css); > - return nr_reclaimed; > -} > - > /** > * mem_cgroup_force_empty_list - clears LRU of a group > * @memcg: group to clear > diff --git a/mm/vmscan.c b/mm/vmscan.c > index df78d17..ae3a387 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -138,11 +138,21 @@ static bool global_reclaim(struct scan_control *sc) > { > return !sc->target_mem_cgroup; > } > + > +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) > +{ > + return global_reclaim(sc); > +} > #else > static bool global_reclaim(struct scan_control *sc) > { > return true; > } > + > +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) > +{ > + return false; > +} > #endif > > static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru) > @@ -1942,9 +1952,11 @@ static inline bool should_continue_reclaim(struct zone *zone, > } > } > > -static void shrink_zone(struct zone *zone, struct scan_control *sc) > +static unsigned > +__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) > { > unsigned long nr_reclaimed, nr_scanned; > + unsigned nr_shrunk = 0; What does this number mean ? > > do { > struct mem_cgroup *root = sc->target_mem_cgroup; > @@ -1961,6 +1973,13 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) > do { > struct lruvec *lruvec; > > + if (soft_reclaim && > + !mem_cgroup_soft_reclaim_eligible(memcg)) { > + memcg = mem_cgroup_iter(root, memcg, &reclaim); > + continue; > + } > + > + nr_shrunk++; > lruvec = mem_cgroup_zone_lruvec(zone, memcg); nr_shrunk will be updated even if the memcg has no pages to be reclaimed...right ? > > shrink_lruvec(lruvec, sc); > @@ -1984,6 +2003,27 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) > } while (memcg); > } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed, > sc->nr_scanned - nr_scanned, sc)); > + > + return nr_shrunk; > +} > + > + > +static void shrink_zone(struct zone *zone, struct scan_control *sc) > +{ > + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); > + unsigned long nr_scanned = sc->nr_scanned; > + unsigned nr_shrunk; > + > + nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim); > + > + /* > + * No group is over the soft limit or those that are do not have > + * pages in the zone we are reclaiming so we have to reclaim everybody > + */ > + if (do_soft_reclaim && (!nr_shrunk || sc->nr_scanned == nr_scanned)) { > + __shrink_zone(zone, sc, false); > + return; > + } Hmm...so...nr_shrunk is working as a bool value. Isn't it better to call __shrink_zone(...,false) if above shrink_zone(...,true) couldn't make good progress ? memory-disk ping-pong will happen in bad case. I think....in the 1st run, you can count amount of pages, which are candidates to be reclaimed. Then, you can compare the amounts of reclaim target and the priority and size of the target (amounts of reclaimable memory on the target zonelist), make a decision to fallback to full global reclaim or not. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-04-09 16:45 ` Kamezawa Hiroyuki @ 2013-04-09 17:05 ` Michal Hocko 0 siblings, 0 replies; 27+ messages in thread From: Michal Hocko @ 2013-04-09 17:05 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: linux-mm, Ying Han, Johannes Weiner, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa On Wed 10-04-13 01:45:19, KAMEZAWA Hiroyuki wrote: > (2013/04/09 21:13), Michal Hocko wrote: [...] > > @@ -1942,9 +1952,11 @@ static inline bool should_continue_reclaim(struct zone *zone, > > } > > } > > > > -static void shrink_zone(struct zone *zone, struct scan_control *sc) > > +static unsigned > > +__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) > > { > > unsigned long nr_reclaimed, nr_scanned; > > + unsigned nr_shrunk = 0; > > What does this number mean ? number of groups that we called shrink_lruvec for. > > do { > > struct mem_cgroup *root = sc->target_mem_cgroup; > > @@ -1961,6 +1973,13 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) > > do { > > struct lruvec *lruvec; > > > > + if (soft_reclaim && > > + !mem_cgroup_soft_reclaim_eligible(memcg)) { > > + memcg = mem_cgroup_iter(root, memcg, &reclaim); > > + continue; > > + } > > + > > + nr_shrunk++; > > lruvec = mem_cgroup_zone_lruvec(zone, memcg); > > nr_shrunk will be updated even if the memcg has no pages to be reclaimed...right ? yes. > > > > > shrink_lruvec(lruvec, sc); > > @@ -1984,6 +2003,27 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) > > } while (memcg); > > } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed, > > sc->nr_scanned - nr_scanned, sc)); > > + > > + return nr_shrunk; > > +} > > + > > + > > +static void shrink_zone(struct zone *zone, struct scan_control *sc) > > +{ > > + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); > > + unsigned long nr_scanned = sc->nr_scanned; > > + unsigned nr_shrunk; > > + > > + nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim); > > + > > + /* > > + * No group is over the soft limit or those that are do not have > > + * pages in the zone we are reclaiming so we have to reclaim everybody > > + */ > > + if (do_soft_reclaim && (!nr_shrunk || sc->nr_scanned == nr_scanned)) { > > + __shrink_zone(zone, sc, false); > > + return; > > + } > > Hmm...so...nr_shrunk is working as a bool value. Isn't it better to call > __shrink_zone(...,false) if above shrink_zone(...,true) couldn't make > good progress ? Yes that was an attempt and as Johannes already pointed out nr_shrunk is superseded by nr_scanned check > memory-disk ping-pong will happen in bad case. I am not sure what you mean by this. > I think....in the 1st run, you can count amount of pages, which are > candidates to be reclaimed. Then, you can compare the amounts of > reclaim target and the priority and size of the target (amounts of > reclaimable memory on the target zonelist), make a decision to fallback to > full global reclaim or not. I would like to keep the logic as simple as possible. nr_scanned progress is a protection from increasing the priority and should be sufficient for starter. There is still possibility that a small groups over their soft limit won't have any pages to reclaim because those are dirty but those pages should be flushed during the global reclaim and we wait for them during targeted reclaim. But I agree that maybe we need also a priority check here. Will think about it. > > Thanks, > -Kame > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-04-09 12:13 ` [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko ` (2 preceding siblings ...) 2013-04-09 16:45 ` Kamezawa Hiroyuki @ 2013-04-14 0:42 ` Mel Gorman 2013-04-14 14:34 ` Michal Hocko 3 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2013-04-14 0:42 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Glauber Costa On Tue, Apr 09, 2013 at 02:13:13PM +0200, Michal Hocko wrote: > Memcg soft reclaim has been traditionally triggered from the global > reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim > then picked up a group which exceeds the soft limit the most and > reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. > I didn't realise it scanned at priority 0 or else I forgot! Priority 0 scanning means memcg soft reclaim currently scans anon and file equally with the full LRU of ecah type considered as scan candidates. Consequently, it will reclaim SWAP_CLUSTER_MAX from each evictable LRU before stopping as sc->nr_to_reclaim pages have been scanned. It's only partially related to your series of course this is very blunt behaviour for memcg reclaim. In an ideal world of infinite free time it might be worth checking what happens if that thing scans at priority 1 or at least keep an eye on what happens priority when/if you replace mem_cgroup_shrink_node_zone > The infrastructure requires per-node-zone trees which hold over-limit > groups and keep them up-to-date (via memcg_check_events) which is not > cost free. Although this overhead hasn't turned out to be a bottle neck > the implementation is suboptimal because mem_cgroup_update_tree has no > idea which zones consumed memory over the limit so we could easily end > up having a group on a node-zone tree having only few pages from that > node-zone. > > This patch doesn't try to fix node-zone trees management because it > seems that integrating soft reclaim into zone shrinking sounds much > easier and more appropriate for several reasons. > First of all 0 priority reclaim was a crude hack which might lead to > big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot > of dirty/writeback pages). Scanning at priority 1 would still be vunerable to this but it might avoid some of the stalls of anon/file balancing is treated properly. > Soft reclaim should be applicable also to the targeted reclaim which is > awkward right now without additional hacks. > Last but not least the whole infrastructure eats a lot of code[1]. > > After this patch shrink_zone is done in 2. First it tries to do the Done in 2 what? Passes I think. > soft reclaim if appropriate (only for global reclaim for now to keep > compatible with the current state) and fall back to ignoring soft limit > if no group is eligible to soft reclaim or nothing has been scanned > during the first pass. Only groups which are over their soft limit or > any of their parent up the hierarchy is over the limit are considered > eligible during the first pass. > > TODO: remove mem_cgroup_tree_per_zone, mem_cgroup_shrink_node_zone and co. > but maybe it would be easier for review to remove that code in a separate > patch... > > --- > [1] TODO: put size vmlinux before/after whole clean-up > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > --- > include/linux/memcontrol.h | 10 +-- > mm/memcontrol.c | 161 ++++++-------------------------------------- > mm/vmscan.c | 67 +++++++++++------- > 3 files changed, 64 insertions(+), 174 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index d6183f0..1833c95 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -179,9 +179,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, > mem_cgroup_update_page_stat(page, idx, -1); > } > > -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, > - gfp_t gfp_mask, > - unsigned long *total_scanned); > +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg); > > void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); > static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, > @@ -358,11 +356,9 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, > } > > static inline > -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, > - gfp_t gfp_mask, > - unsigned long *total_scanned) > +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) > { > - return 0; > + return false; > } > > static inline void mem_cgroup_split_huge_fixup(struct page *head) > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index f608546..33424d8 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2060,57 +2060,28 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) > } > #endif > > -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, > - struct zone *zone, > - gfp_t gfp_mask, > - unsigned long *total_scanned) > -{ > - struct mem_cgroup *victim = NULL; > - int total = 0; > - int loop = 0; > - unsigned long excess; > - unsigned long nr_scanned; > - struct mem_cgroup_reclaim_cookie reclaim = { > - .zone = zone, > - .priority = 0, > - }; > +/* > + * A group is eligible for the soft limit reclaim if it is > + * a) is over its soft limit > + * b) any parent up the hierarchy is over its soft limit > + */ > +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *parent = memcg; > > - excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT; > - > - while (1) { > - victim = mem_cgroup_iter(root_memcg, victim, &reclaim); > - if (!victim) { > - loop++; > - if (loop >= 2) { > - /* > - * If we have not been able to reclaim > - * anything, it might because there are > - * no reclaimable pages under this hierarchy > - */ > - if (!total) > - break; > - /* > - * We want to do more targeted reclaim. > - * excess >> 2 is not to excessive so as to > - * reclaim too much, nor too less that we keep > - * coming back to reclaim from this cgroup > - */ > - if (total >= (excess >> 2) || > - (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) > - break; > - } > - continue; > - } > - if (!mem_cgroup_reclaimable(victim, false)) > - continue; > - total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false, > - zone, &nr_scanned); > - *total_scanned += nr_scanned; > - if (!res_counter_soft_limit_excess(&root_memcg->res)) > - break; > + if (res_counter_soft_limit_excess(&memcg->res)) > + return true; > + > + /* > + * If any parent up the hierarchy is over its soft limit then we > + * have to obey and reclaim from this group as well. > + */ > + while((parent = parent_mem_cgroup(parent))) { > + if (res_counter_soft_limit_excess(&parent->res)) > + return true; Remove the initial if with this? /* * If the target memcg or any of its parents are over their soft limit * then we have to obey and reclaim from this group as well */ do { if (res_counter_soft_limit_excess(&memcg->res)) return true; while ((memcg = parent_mem_cgroup(memcg)); > } > - mem_cgroup_iter_break(root_memcg, victim); > - return total; > + > + return false; > } > > /* > @@ -4724,98 +4695,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg, > return ret; > } > > -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, > - gfp_t gfp_mask, > - unsigned long *total_scanned) > -{ > - unsigned long nr_reclaimed = 0; > - struct mem_cgroup_per_zone *mz, *next_mz = NULL; > - unsigned long reclaimed; > - int loop = 0; > - struct mem_cgroup_tree_per_zone *mctz; > - unsigned long long excess; > - unsigned long nr_scanned; > - > - if (order > 0) > - return 0; > - > - mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone)); > - /* > - * This loop can run a while, specially if mem_cgroup's continuously > - * keep exceeding their soft limit and putting the system under > - * pressure > - */ > - do { > - if (next_mz) > - mz = next_mz; > - else > - mz = mem_cgroup_largest_soft_limit_node(mctz); > - if (!mz) > - break; > - > - nr_scanned = 0; > - reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone, > - gfp_mask, &nr_scanned); > - nr_reclaimed += reclaimed; > - *total_scanned += nr_scanned; > - spin_lock(&mctz->lock); > - > - /* > - * If we failed to reclaim anything from this memory cgroup > - * it is time to move on to the next cgroup > - */ > - next_mz = NULL; > - if (!reclaimed) { > - do { > - /* > - * Loop until we find yet another one. > - * > - * By the time we get the soft_limit lock > - * again, someone might have aded the > - * group back on the RB tree. Iterate to > - * make sure we get a different mem. > - * mem_cgroup_largest_soft_limit_node returns > - * NULL if no other cgroup is present on > - * the tree > - */ > - next_mz = > - __mem_cgroup_largest_soft_limit_node(mctz); > - if (next_mz == mz) > - css_put(&next_mz->memcg->css); > - else /* next_mz == NULL or other memcg */ > - break; > - } while (1); > - } > - __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz); > - excess = res_counter_soft_limit_excess(&mz->memcg->res); > - /* > - * One school of thought says that we should not add > - * back the node to the tree if reclaim returns 0. > - * But our reclaim could return 0, simply because due > - * to priority we are exposing a smaller subset of > - * memory to reclaim from. Consider this as a longer > - * term TODO. > - */ > - /* If excess == 0, no tree ops */ > - __mem_cgroup_insert_exceeded(mz->memcg, mz, mctz, excess); > - spin_unlock(&mctz->lock); > - css_put(&mz->memcg->css); > - loop++; > - /* > - * Could not reclaim anything and there are no more > - * mem cgroups to try or we seem to be looping without > - * reclaiming anything. > - */ > - if (!nr_reclaimed && > - (next_mz == NULL || > - loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) > - break; > - } while (!nr_reclaimed); > - if (next_mz) > - css_put(&next_mz->memcg->css); > - return nr_reclaimed; > -} > - > /** > * mem_cgroup_force_empty_list - clears LRU of a group > * @memcg: group to clear > diff --git a/mm/vmscan.c b/mm/vmscan.c > index df78d17..ae3a387 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -138,11 +138,21 @@ static bool global_reclaim(struct scan_control *sc) > { > return !sc->target_mem_cgroup; > } > + > +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) > +{ > + return global_reclaim(sc); > +} > #else > static bool global_reclaim(struct scan_control *sc) > { > return true; > } > + > +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) > +{ > + return false; > +} > #endif > > static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru) > @@ -1942,9 +1952,11 @@ static inline bool should_continue_reclaim(struct zone *zone, > } > } > > -static void shrink_zone(struct zone *zone, struct scan_control *sc) > +static unsigned > +__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) > { > unsigned long nr_reclaimed, nr_scanned; > + unsigned nr_shrunk = 0; > > do { > struct mem_cgroup *root = sc->target_mem_cgroup; > @@ -1961,6 +1973,13 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) > do { > struct lruvec *lruvec; > > + if (soft_reclaim && > + !mem_cgroup_soft_reclaim_eligible(memcg)) { > + memcg = mem_cgroup_iter(root, memcg, &reclaim); > + continue; > + } > + Calling mem_cgroup_soft_reclaim_eligible means we do multiple searches of the hierarchy while ascending the hierarchy. It's a stretch but it may be a problem for very deep hierarchies. Would it be worth having mem_cgroup_soft_reclaim_eligible return what the highest parent over its soft limit was and stop the iterator when the highest parent is reached? I think this would avoid calling mem_cgroup_soft_reclaim_eligible multiple times. > + nr_shrunk++; > lruvec = mem_cgroup_zone_lruvec(zone, memcg); > > shrink_lruvec(lruvec, sc); > @@ -1984,6 +2003,27 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) > } while (memcg); > } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed, > sc->nr_scanned - nr_scanned, sc)); > + > + return nr_shrunk; > +} > + > + > +static void shrink_zone(struct zone *zone, struct scan_control *sc) > +{ > + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); > + unsigned long nr_scanned = sc->nr_scanned; > + unsigned nr_shrunk; > + > + nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim); > + The two pass thing is explained in the changelog very well but adding comments on it here would not hurt. Otherwise this patch looks like a great idea and memcg soft reclaim looks a lot less like it's stuck on the side. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-04-14 0:42 ` Mel Gorman @ 2013-04-14 14:34 ` Michal Hocko 2013-04-14 14:55 ` Johannes Weiner 0 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2013-04-14 14:34 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Glauber Costa On Sun 14-04-13 01:42:52, Mel Gorman wrote: > On Tue, Apr 09, 2013 at 02:13:13PM +0200, Michal Hocko wrote: > > Memcg soft reclaim has been traditionally triggered from the global > > reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim > > then picked up a group which exceeds the soft limit the most and > > reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. > > > > I didn't realise it scanned at priority 0 or else I forgot! Priority 0 > scanning means memcg soft reclaim currently scans anon and file equally > with the full LRU of ecah type considered as scan candidates. Consequently, > it will reclaim SWAP_CLUSTER_MAX from each evictable LRU before stopping as > sc->nr_to_reclaim pages have been scanned. It's only partially related to > your series of course this is very blunt behaviour for memcg reclaim. In an > ideal world of infinite free time it might be worth checking what happens > if that thing scans at priority 1 or at least keep an eye on what happens > priority when/if you replace mem_cgroup_shrink_node_zone I do not think experimenting with prio 1 would make any difference. We would still reclaim half of LRUs and bail out if at least SWAP_CLUSTER_MAX cluster max pagas have been reclaimed after visiting all reclaimable LRUs. The whole point of the series is to not do anything special for the soft reclaim priority wise. [...] > > Soft reclaim should be applicable also to the targeted reclaim which is > > awkward right now without additional hacks. > > Last but not least the whole infrastructure eats a lot of code[1]. > > > > After this patch shrink_zone is done in 2. First it tries to do the > > Done in 2 what? Passes I think. Yes. Fixed. [...] > > + if (res_counter_soft_limit_excess(&memcg->res)) > > + return true; > > + > > + /* > > + * If any parent up the hierarchy is over its soft limit then we > > + * have to obey and reclaim from this group as well. > > + */ > > + while((parent = parent_mem_cgroup(parent))) { > > + if (res_counter_soft_limit_excess(&parent->res)) > > + return true; > > Remove the initial if with this? > /* > * If the target memcg or any of its parents are over their soft limit > * then we have to obey and reclaim from this group as well > */ > do { > if (res_counter_soft_limit_excess(&memcg->res)) > return true; > while ((memcg = parent_mem_cgroup(memcg)); The later patch changes this behavior. Where we treat current memcg and parent slightly different based on whether the limit has been set by an user or it is the default unlimited value. [...] > > -static void shrink_zone(struct zone *zone, struct scan_control *sc) > > +static unsigned > > +__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) > > { > > unsigned long nr_reclaimed, nr_scanned; > > + unsigned nr_shrunk = 0; > > > > do { > > struct mem_cgroup *root = sc->target_mem_cgroup; > > @@ -1961,6 +1973,13 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) > > do { > > struct lruvec *lruvec; > > > > + if (soft_reclaim && > > + !mem_cgroup_soft_reclaim_eligible(memcg)) { > > + memcg = mem_cgroup_iter(root, memcg, &reclaim); > > + continue; > > + } > > + > > Calling mem_cgroup_soft_reclaim_eligible means we do multiple searches > of the hierarchy while ascending the hierarchy. It's a stretch but it > may be a problem for very deep hierarchies. I think it shouldn't be a problem for hundreds of memcgs and I am quite sceptical about such configurations for other reasons (e.g. charging overhead). And we are in the reclaim path so this is hardly a hot path (unlike the chargin). So while this might turn out to be a real problem we would need to fix other parts as well with higher priority. > Would it be worth having mem_cgroup_soft_reclaim_eligible return what > the highest parent over its soft limit was and stop the iterator when > the highest parent is reached? I think this would avoid calling > mem_cgroup_soft_reclaim_eligible multiple times. This is basically what the original implementation did and I think it is not the right way to go. First why should we care who is the most exceeding group. We should treat them equally if the there is no special reason to not do so. And I do not see such a special reason. Besides that keeping a exceed sorted data structure of memcgs turned out quite a lot of code. Note that the later patch integrate soft reclaim into targeted reclaim which would mean that we would have to keep such a list/tree per memcg. > > + nr_shrunk++; > > lruvec = mem_cgroup_zone_lruvec(zone, memcg); > > > > shrink_lruvec(lruvec, sc); > > @@ -1984,6 +2003,27 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) > > } while (memcg); > > } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed, > > sc->nr_scanned - nr_scanned, sc)); > > + > > + return nr_shrunk; > > +} > > + > > + > > +static void shrink_zone(struct zone *zone, struct scan_control *sc) > > +{ > > + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); > > + unsigned long nr_scanned = sc->nr_scanned; > > + unsigned nr_shrunk; > > + > > + nr_shrunk = __shrink_zone(zone, sc, do_soft_reclaim); > > + > > The two pass thing is explained in the changelog very well but adding > comments on it here would not hurt. What about merging the comment that is already there with this? /* * If memcg is enabled we try to reclaim only over-soft limit groups in * the first pass and only fallback to all groups reclaim if no group is * over the soft limit or those that are do not have pages in the zone * we are reclaiming so we have to reclaim everybody. * This will guarantee that groups that are below their soft limit are * not touched unless the memory pressure cannot be handled otherwise * and so the soft limit can be used for the working set preservation. */ > > Otherwise this patch looks like a great idea and memcg soft reclaim looks > a lot less like it's stuck on the side. Thanks for the review Mel! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-04-14 14:34 ` Michal Hocko @ 2013-04-14 14:55 ` Johannes Weiner 2013-04-14 15:04 ` Michal Hocko 0 siblings, 1 reply; 27+ messages in thread From: Johannes Weiner @ 2013-04-14 14:55 UTC (permalink / raw) To: Michal Hocko Cc: Mel Gorman, linux-mm, Ying Han, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Glauber Costa On Sun, Apr 14, 2013 at 07:34:20AM -0700, Michal Hocko wrote: > On Sun 14-04-13 01:42:52, Mel Gorman wrote: > > On Tue, Apr 09, 2013 at 02:13:13PM +0200, Michal Hocko wrote: > > > @@ -1961,6 +1973,13 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) > > > do { > > > struct lruvec *lruvec; > > > > > > + if (soft_reclaim && > > > + !mem_cgroup_soft_reclaim_eligible(memcg)) { > > > + memcg = mem_cgroup_iter(root, memcg, &reclaim); > > > + continue; > > > + } > > > + > > > > Calling mem_cgroup_soft_reclaim_eligible means we do multiple searches > > of the hierarchy while ascending the hierarchy. It's a stretch but it > > may be a problem for very deep hierarchies. > > I think it shouldn't be a problem for hundreds of memcgs and I am quite > sceptical about such configurations for other reasons (e.g. charging > overhead). And we are in the reclaim path so this is hardly a hot path > (unlike the chargin). So while this might turn out to be a real problem > we would need to fix other parts as well with higher priority. > > > Would it be worth having mem_cgroup_soft_reclaim_eligible return what > > the highest parent over its soft limit was and stop the iterator when > > the highest parent is reached? I think this would avoid calling > > mem_cgroup_soft_reclaim_eligible multiple times. > > This is basically what the original implementation did and I think it is > not the right way to go. First why should we care who is the most > exceeding group. We should treat them equally if the there is no special > reason to not do so. And I do not see such a special reason. Besides > that keeping a exceed sorted data structure of memcgs turned out quite a > lot of code. Note that the later patch integrate soft reclaim into > targeted reclaim which would mean that we would have to keep such a > list/tree per memcg. I think what Mel suggests is not to return the highest excessor, but return the highest parent in the hierarchy that is in excess. Once you have this parent, you know that all children are in excess, without looking them up individually. However, that parent is not necessarily the root of the hierarchy that is being reclaimed and you might have multiple of such sub-hierarchies in excess. To handle all the corner cases, I'd expect the relationship checking to get really complicated. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-04-14 14:55 ` Johannes Weiner @ 2013-04-14 15:04 ` Michal Hocko 2013-04-14 15:11 ` Michal Hocko 2013-04-14 18:03 ` Rik van Riel 0 siblings, 2 replies; 27+ messages in thread From: Michal Hocko @ 2013-04-14 15:04 UTC (permalink / raw) To: Johannes Weiner Cc: Mel Gorman, linux-mm, Ying Han, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Glauber Costa On Sun 14-04-13 10:55:32, Johannes Weiner wrote: > On Sun, Apr 14, 2013 at 07:34:20AM -0700, Michal Hocko wrote: > > On Sun 14-04-13 01:42:52, Mel Gorman wrote: > > > On Tue, Apr 09, 2013 at 02:13:13PM +0200, Michal Hocko wrote: > > > > @@ -1961,6 +1973,13 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) > > > > do { > > > > struct lruvec *lruvec; > > > > > > > > + if (soft_reclaim && > > > > + !mem_cgroup_soft_reclaim_eligible(memcg)) { > > > > + memcg = mem_cgroup_iter(root, memcg, &reclaim); > > > > + continue; > > > > + } > > > > + > > > > > > Calling mem_cgroup_soft_reclaim_eligible means we do multiple searches > > > of the hierarchy while ascending the hierarchy. It's a stretch but it > > > may be a problem for very deep hierarchies. > > > > I think it shouldn't be a problem for hundreds of memcgs and I am quite > > sceptical about such configurations for other reasons (e.g. charging > > overhead). And we are in the reclaim path so this is hardly a hot path > > (unlike the chargin). So while this might turn out to be a real problem > > we would need to fix other parts as well with higher priority. > > > > > Would it be worth having mem_cgroup_soft_reclaim_eligible return what > > > the highest parent over its soft limit was and stop the iterator when > > > the highest parent is reached? I think this would avoid calling > > > mem_cgroup_soft_reclaim_eligible multiple times. > > > > This is basically what the original implementation did and I think it is > > not the right way to go. First why should we care who is the most > > exceeding group. We should treat them equally if the there is no special > > reason to not do so. And I do not see such a special reason. Besides > > that keeping a exceed sorted data structure of memcgs turned out quite a > > lot of code. Note that the later patch integrate soft reclaim into > > targeted reclaim which would mean that we would have to keep such a > > list/tree per memcg. > > I think what Mel suggests is not to return the highest excessor, but > return the highest parent in the hierarchy that is in excess. Once > you have this parent, you know that all children are in excess, > without looking them up individually. OK, I see it now. > However, that parent is not necessarily the root of the hierarchy that > is being reclaimed and you might have multiple of such sub-hierarchies > in excess. To handle all the corner cases, I'd expect the > relationship checking to get really complicated. We could always return the leftmost and get to others as the iteration continues. I will try to think about it some more. I do not think we would save a lot but it looks like a neat idea. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-04-14 15:04 ` Michal Hocko @ 2013-04-14 15:11 ` Michal Hocko 2013-04-14 18:03 ` Rik van Riel 1 sibling, 0 replies; 27+ messages in thread From: Michal Hocko @ 2013-04-14 15:11 UTC (permalink / raw) To: Johannes Weiner Cc: Mel Gorman, linux-mm, Ying Han, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Glauber Costa On Sun 14-04-13 08:04:55, Michal Hocko wrote: > On Sun 14-04-13 10:55:32, Johannes Weiner wrote: > > However, that parent is not necessarily the root of the hierarchy that > > is being reclaimed and you might have multiple of such sub-hierarchies > > in excess. To handle all the corner cases, I'd expect the > > relationship checking to get really complicated. > > We could always return the leftmost and get to others as the iteration > continues. I will try to think about it some more. I do not think we > would save a lot but it looks like a neat idea. Hmm, scratch that. Leftmost doesn't make much sense as we are going bottom up... -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-04-14 15:04 ` Michal Hocko 2013-04-14 15:11 ` Michal Hocko @ 2013-04-14 18:03 ` Rik van Riel 1 sibling, 0 replies; 27+ messages in thread From: Rik van Riel @ 2013-04-14 18:03 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Mel Gorman, linux-mm, Ying Han, KAMEZAWA Hiroyuki, Hugh Dickins, Glauber Costa On 04/14/2013 11:04 AM, Michal Hocko wrote: > On Sun 14-04-13 10:55:32, Johannes Weiner wrote: >> I think what Mel suggests is not to return the highest excessor, but >> return the highest parent in the hierarchy that is in excess. Once >> you have this parent, you know that all children are in excess, >> without looking them up individually. > > OK, I see it now. > >> However, that parent is not necessarily the root of the hierarchy that >> is being reclaimed and you might have multiple of such sub-hierarchies >> in excess. To handle all the corner cases, I'd expect the >> relationship checking to get really complicated. > > We could always return the leftmost and get to others as the iteration > continues. I will try to think about it some more. I do not think we > would save a lot but it looks like a neat idea. We should probably gather around a whiteboard this week in San Francisco, and figure out what exactly we want the code to do, before figuring out the most efficient way to do it. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified 2013-04-09 12:13 [RFC 0/3] soft reclaim rework Michal Hocko 2013-04-09 12:13 ` [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko @ 2013-04-09 12:13 ` Michal Hocko 2013-04-09 13:24 ` Johannes Weiner 2013-04-09 17:10 ` Kamezawa Hiroyuki 2013-04-09 12:13 ` [RFC 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko ` (3 subsequent siblings) 5 siblings, 2 replies; 27+ messages in thread From: Michal Hocko @ 2013-04-09 12:13 UTC (permalink / raw) To: linux-mm Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa The soft limit has been traditionally initialized to RESOURCE_MAX which means that the group is soft unlimited by default. This was working more or less satisfactorily so far because the soft limit has been interpreted as a tool to hint memory reclaim which groups to reclaim first to free some memory so groups basically opted in for being reclaimed more. While this feature might be really helpful it would be even nicer if the soft reclaim could be used as a certain working set protection - only groups over their soft limit are reclaimed as far as the reclaim is able to free memory. In order to accomplish this behavior we have to reconsider the default soft limit value because with the current default all groups would become soft unreclaimable and so the reclaim would have to fall back to ignoring soft reclaim altogether harming those groups that set up a limit as a protection against the reclaim. Changing the default soft limit to 0 wouldn't work either because all groups would become soft reclaimable as the parent's limit would overwrite all its children down the hierarchy. This patch doesn't change the default soft limit value. Rather than that it distinguishes groups with the limit set by user by a per group flag. All groups are considered soft reclaimable regardless their limit until a limit is set. The default limit doesn't enforce reclaim down the hierarchy. TODO: How do we present default unlimited vs. RESOURCE_MAX set by the user? One possible way could be returning -1 for RESOURCE_MAX && !soft_limited but this is a change in user interface. Although nothing explicitly says the value has to be greater > 0 I can imagine this could be PITA to use. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/memcontrol.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 33424d8..043d760 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -292,6 +292,10 @@ struct mem_cgroup { * Should the accounting and control be hierarchical, per subtree? */ bool use_hierarchy; + /* + * Is the group soft limited? + */ + bool soft_limited; unsigned long kmem_account_flags; /* See KMEM_ACCOUNTED_*, below */ bool oom_lock; @@ -2062,14 +2066,15 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) /* * A group is eligible for the soft limit reclaim if it is - * a) is over its soft limit - * b) any parent up the hierarchy is over its soft limit + * a) doesn't have any soft limit set + * b) is over its soft limit + * c) any parent up the hierarchy is over its soft limit */ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) { struct mem_cgroup *parent = memcg; - if (res_counter_soft_limit_excess(&memcg->res)) + if (!memcg->soft_limited || res_counter_soft_limit_excess(&memcg->res)) return true; /* @@ -2077,7 +2082,8 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) * have to obey and reclaim from this group as well. */ while((parent = parent_mem_cgroup(parent))) { - if (res_counter_soft_limit_excess(&parent->res)) + if (memcg->soft_limited && + res_counter_soft_limit_excess(&parent->res)) return true; } @@ -5237,6 +5243,14 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft, ret = res_counter_set_soft_limit(&memcg->res, val); else ret = -EINVAL; + + /* + * We could disable soft_limited when we get RESOURCE_MAX but + * then we have a little problem to distinguish the default + * unlimited and limitted but never soft reclaimed groups. + */ + if (!ret) + memcg->soft_limited = true; break; default: ret = -EINVAL; /* should be BUG() ? */ -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified 2013-04-09 12:13 ` [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified Michal Hocko @ 2013-04-09 13:24 ` Johannes Weiner 2013-04-09 13:42 ` Michal Hocko 2013-04-09 17:10 ` Kamezawa Hiroyuki 1 sibling, 1 reply; 27+ messages in thread From: Johannes Weiner @ 2013-04-09 13:24 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Ying Han, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa On Tue, Apr 09, 2013 at 02:13:14PM +0200, Michal Hocko wrote: > The soft limit has been traditionally initialized to RESOURCE_MAX > which means that the group is soft unlimited by default. This was > working more or less satisfactorily so far because the soft limit has > been interpreted as a tool to hint memory reclaim which groups to > reclaim first to free some memory so groups basically opted in for being > reclaimed more. > > While this feature might be really helpful it would be even nicer if > the soft reclaim could be used as a certain working set protection - > only groups over their soft limit are reclaimed as far as the reclaim > is able to free memory. In order to accomplish this behavior we have to > reconsider the default soft limit value because with the current default > all groups would become soft unreclaimable and so the reclaim would have > to fall back to ignoring soft reclaim altogether harming those groups > that set up a limit as a protection against the reclaim. Changing the > default soft limit to 0 wouldn't work either because all groups would > become soft reclaimable as the parent's limit would overwrite all its > children down the hierarchy. > > This patch doesn't change the default soft limit value. Rather than that > it distinguishes groups with the limit set by user by a per group flag. > All groups are considered soft reclaimable regardless their limit until > a limit is set. The default limit doesn't enforce reclaim down the > hierarchy. > > TODO: How do we present default unlimited vs. RESOURCE_MAX set by the > user? One possible way could be returning -1 for RESOURCE_MAX && !soft_limited > but this is a change in user interface. Although nothing explicitly says > the value has to be greater > 0 I can imagine this could be PITA to use. > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > --- > mm/memcontrol.c | 22 ++++++++++++++++++---- > 1 file changed, 18 insertions(+), 4 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 33424d8..043d760 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -292,6 +292,10 @@ struct mem_cgroup { > * Should the accounting and control be hierarchical, per subtree? > */ > bool use_hierarchy; > + /* > + * Is the group soft limited? > + */ > + bool soft_limited; > unsigned long kmem_account_flags; /* See KMEM_ACCOUNTED_*, below */ > > bool oom_lock; > @@ -2062,14 +2066,15 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) > > /* > * A group is eligible for the soft limit reclaim if it is > - * a) is over its soft limit > - * b) any parent up the hierarchy is over its soft limit > + * a) doesn't have any soft limit set > + * b) is over its soft limit > + * c) any parent up the hierarchy is over its soft limit > */ > bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) > { > struct mem_cgroup *parent = memcg; > > - if (res_counter_soft_limit_excess(&memcg->res)) > + if (!memcg->soft_limited || res_counter_soft_limit_excess(&memcg->res)) > return true; With the very similar condition in the hierarchy walk down there, this was more confusing than I would have expected it to be. Would you mind splitting this check and putting the comments directly over the individual checks? /* No specific soft limit set, eligible for soft reclaim */ if (!memcg->soft_limited) return true; /* Soft limit exceeded, eligible for soft reclaim */ if (res_counter_soft_limit_excess(&memcg->res)) return true; /* Parental limit exceeded, eligible for... soft reclaim! */ ... > @@ -2077,7 +2082,8 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) > * have to obey and reclaim from this group as well. > */ > while((parent = parent_mem_cgroup(parent))) { > - if (res_counter_soft_limit_excess(&parent->res)) > + if (memcg->soft_limited && > + res_counter_soft_limit_excess(&parent->res)) > return true; Should this be parent->soft_limited instead of memcg->softlimited? > @@ -5237,6 +5243,14 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft, > ret = res_counter_set_soft_limit(&memcg->res, val); > else > ret = -EINVAL; > + > + /* > + * We could disable soft_limited when we get RESOURCE_MAX but > + * then we have a little problem to distinguish the default > + * unlimited and limitted but never soft reclaimed groups. > + */ > + if (!ret) > + memcg->soft_limited = true; It's neither reversible nor distinguishable from userspace, so it would be good to either find a value or just make the soft_limited knob explicit and accessible from userspace. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified 2013-04-09 13:24 ` Johannes Weiner @ 2013-04-09 13:42 ` Michal Hocko 0 siblings, 0 replies; 27+ messages in thread From: Michal Hocko @ 2013-04-09 13:42 UTC (permalink / raw) To: Johannes Weiner Cc: linux-mm, Ying Han, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa On Tue 09-04-13 09:24:06, Johannes Weiner wrote: > On Tue, Apr 09, 2013 at 02:13:14PM +0200, Michal Hocko wrote: [...] > > @@ -2062,14 +2066,15 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) > > > > /* > > * A group is eligible for the soft limit reclaim if it is > > - * a) is over its soft limit > > - * b) any parent up the hierarchy is over its soft limit > > + * a) doesn't have any soft limit set > > + * b) is over its soft limit > > + * c) any parent up the hierarchy is over its soft limit > > */ > > bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) > > { > > struct mem_cgroup *parent = memcg; > > > > - if (res_counter_soft_limit_excess(&memcg->res)) > > + if (!memcg->soft_limited || res_counter_soft_limit_excess(&memcg->res)) > > return true; > > With the very similar condition in the hierarchy walk down there, this > was more confusing than I would have expected it to be. > > Would you mind splitting this check and putting the comments directly > over the individual checks? > > /* No specific soft limit set, eligible for soft reclaim */ > if (!memcg->soft_limited) > return true; > > /* Soft limit exceeded, eligible for soft reclaim */ > if (res_counter_soft_limit_excess(&memcg->res)) > return true; > > /* Parental limit exceeded, eligible for... soft reclaim! */ Sure thing. > ... > > > @@ -2077,7 +2082,8 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) > > * have to obey and reclaim from this group as well. > > */ > > while((parent = parent_mem_cgroup(parent))) { > > - if (res_counter_soft_limit_excess(&parent->res)) > > + if (memcg->soft_limited && > > + res_counter_soft_limit_excess(&parent->res)) > > return true; > > Should this be parent->soft_limited instead of memcg->softlimited? Yes. I haven't tested with deeper hierarchies yet... Thanks for catching this. > > > @@ -5237,6 +5243,14 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft, > > ret = res_counter_set_soft_limit(&memcg->res, val); > > else > > ret = -EINVAL; > > + > > + /* > > + * We could disable soft_limited when we get RESOURCE_MAX but > > + * then we have a little problem to distinguish the default > > + * unlimited and limitted but never soft reclaimed groups. > > + */ > > + if (!ret) > > + memcg->soft_limited = true; > > It's neither reversible nor distinguishable from userspace, so it > would be good to either find a value or just make the soft_limited > knob explicit and accessible from userspace. I can export the knob but I would like to prevent from that if possible. So far it seems it would be hard to keep backward compatibility. I hoped somebody would come up with something clever ;) One possible way would be returning -1 if soft_limited == false. Users who use u64 would see the same value in the end so they shouldn't break and those that are _really_ interested can check the string value as well. What do you think? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified 2013-04-09 12:13 ` [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified Michal Hocko 2013-04-09 13:24 ` Johannes Weiner @ 2013-04-09 17:10 ` Kamezawa Hiroyuki 2013-04-09 17:22 ` Michal Hocko 1 sibling, 1 reply; 27+ messages in thread From: Kamezawa Hiroyuki @ 2013-04-09 17:10 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Ying Han, Johannes Weiner, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa (2013/04/09 21:13), Michal Hocko wrote: > The soft limit has been traditionally initialized to RESOURCE_MAX > which means that the group is soft unlimited by default. This was > working more or less satisfactorily so far because the soft limit has > been interpreted as a tool to hint memory reclaim which groups to > reclaim first to free some memory so groups basically opted in for being > reclaimed more. > > While this feature might be really helpful it would be even nicer if > the soft reclaim could be used as a certain working set protection - > only groups over their soft limit are reclaimed as far as the reclaim > is able to free memory. In order to accomplish this behavior we have to > reconsider the default soft limit value because with the current default > all groups would become soft unreclaimable and so the reclaim would have > to fall back to ignoring soft reclaim altogether harming those groups > that set up a limit as a protection against the reclaim. Changing the > default soft limit to 0 wouldn't work either because all groups would > become soft reclaimable as the parent's limit would overwrite all its > children down the hierarchy. > > This patch doesn't change the default soft limit value. Rather than that > it distinguishes groups with the limit set by user by a per group flag. > All groups are considered soft reclaimable regardless their limit until > a limit is set. The default limit doesn't enforce reclaim down the > hierarchy. > > TODO: How do we present default unlimited vs. RESOURCE_MAX set by the > user? One possible way could be returning -1 for RESOURCE_MAX && !soft_limited > but this is a change in user interface. Although nothing explicitly says > the value has to be greater > 0 I can imagine this could be PITA to use. > Hmm.. Now, if a user sets soft_limit to a memcg, it will be a victim. All other cgroups, which has default value, will be 2nd choice for memory reclaim. When user sets RESOURCE_MAX, it will be 2nd choice, too. In this case, soft-limit is for creating victims. You want the another configuration that all cgroup must be 1st choice with the default value and protect memcg which has some soft-limit value. In this case, soft-limit is for protection. i.e. an opposite policy. How about allowing users to set root memcg's soft-limit (to be 0 ?) and allow the new choice of protection before creating children memcgs ? (I think you can make this default policy as CONFIG option or some...) Users can choice global soft-limit policy. Complicated ? Thanks, -Kame > Signed-off-by: Michal Hocko <mhocko@suse.cz> > --- > mm/memcontrol.c | 22 ++++++++++++++++++---- > 1 file changed, 18 insertions(+), 4 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 33424d8..043d760 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -292,6 +292,10 @@ struct mem_cgroup { > * Should the accounting and control be hierarchical, per subtree? > */ > bool use_hierarchy; > + /* > + * Is the group soft limited? > + */ > + bool soft_limited; > unsigned long kmem_account_flags; /* See KMEM_ACCOUNTED_*, below */ > > bool oom_lock; > @@ -2062,14 +2066,15 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) > > /* > * A group is eligible for the soft limit reclaim if it is > - * a) is over its soft limit > - * b) any parent up the hierarchy is over its soft limit > + * a) doesn't have any soft limit set > + * b) is over its soft limit > + * c) any parent up the hierarchy is over its soft limit > */ > bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) > { > struct mem_cgroup *parent = memcg; > > - if (res_counter_soft_limit_excess(&memcg->res)) > + if (!memcg->soft_limited || res_counter_soft_limit_excess(&memcg->res)) > return true; > > /* > @@ -2077,7 +2082,8 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) > * have to obey and reclaim from this group as well. > */ > while((parent = parent_mem_cgroup(parent))) { > - if (res_counter_soft_limit_excess(&parent->res)) > + if (memcg->soft_limited && > + res_counter_soft_limit_excess(&parent->res)) > return true; > } > > @@ -5237,6 +5243,14 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft, > ret = res_counter_set_soft_limit(&memcg->res, val); > else > ret = -EINVAL; > + > + /* > + * We could disable soft_limited when we get RESOURCE_MAX but > + * then we have a little problem to distinguish the default > + * unlimited and limitted but never soft reclaimed groups. > + */ > + if (!ret) > + memcg->soft_limited = true; > break; > default: > ret = -EINVAL; /* should be BUG() ? */ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified 2013-04-09 17:10 ` Kamezawa Hiroyuki @ 2013-04-09 17:22 ` Michal Hocko 0 siblings, 0 replies; 27+ messages in thread From: Michal Hocko @ 2013-04-09 17:22 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: linux-mm, Ying Han, Johannes Weiner, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa On Wed 10-04-13 02:10:44, KAMEZAWA Hiroyuki wrote: > (2013/04/09 21:13), Michal Hocko wrote: > > The soft limit has been traditionally initialized to RESOURCE_MAX > > which means that the group is soft unlimited by default. This was > > working more or less satisfactorily so far because the soft limit has > > been interpreted as a tool to hint memory reclaim which groups to > > reclaim first to free some memory so groups basically opted in for being > > reclaimed more. > > > > While this feature might be really helpful it would be even nicer if > > the soft reclaim could be used as a certain working set protection - > > only groups over their soft limit are reclaimed as far as the reclaim > > is able to free memory. In order to accomplish this behavior we have to > > reconsider the default soft limit value because with the current default > > all groups would become soft unreclaimable and so the reclaim would have > > to fall back to ignoring soft reclaim altogether harming those groups > > that set up a limit as a protection against the reclaim. Changing the > > default soft limit to 0 wouldn't work either because all groups would > > become soft reclaimable as the parent's limit would overwrite all its > > children down the hierarchy. > > > > This patch doesn't change the default soft limit value. Rather than that > > it distinguishes groups with the limit set by user by a per group flag. > > All groups are considered soft reclaimable regardless their limit until > > a limit is set. The default limit doesn't enforce reclaim down the > > hierarchy. > > > > TODO: How do we present default unlimited vs. RESOURCE_MAX set by the > > user? One possible way could be returning -1 for RESOURCE_MAX && !soft_limited > > but this is a change in user interface. Although nothing explicitly says > > the value has to be greater > 0 I can imagine this could be PITA to use. > > > > Hmm.. > > Now, if a user sets soft_limit to a memcg, it will be a victim. All other > cgroups, which has default value, will be 2nd choice for memory reclaim. Not really. All those with the default value will be the 1sth choice along with those that are over the limit. Just to make sure we are on the same page this is what I have currently after Johannes feedback: bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) { struct mem_cgroup *parent = memcg; /* No specific soft limit set, eligible for soft reclaim */ if (!memcg->soft_limited) return true; /* Soft limit exceeded, eligible for soft reclaim */ if (res_counter_soft_limit_excess(&memcg->res)) return true; /* * If any parent up the hierarchy is over its soft limit then we * have to obey and reclaim from this group as well. */ while((parent = parent_mem_cgroup(parent))) { if (parent->soft_limited && res_counter_soft_limit_excess(&parent->res)) return true; } return false; } Does this make more sense to you? > When user sets RESOURCE_MAX, it will be 2nd choice, too. No, it will be never soft reclaimed because it would have memcg->soft_limited == true. > In this case, soft-limit is for creating victims. > > You want the another configuration that all cgroup must be 1st choice > with the default value and protect memcg which has some soft-limit value. > In this case, soft-limit is for protection. Why should we distinguish default setting from over-the-limit groups? > i.e. an opposite policy. > > How about allowing users to set root memcg's soft-limit (to be 0 ?) This is not forbidden AFAICS in mem_cgroup_write for RES_SOFT_LIMIT. The 0 @ root is not good as I tried to explain in the changelog because this would make a hiararchical pressure on all children so their limit would be ignored basically. > and allow the new choice of protection before creating children > memcgs? (I think you can make this default policy as CONFIG option or > some...) Users can choice global soft-limit policy. Complicated ? Yes and I do not understand why a CONFIG option is needed. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim 2013-04-09 12:13 [RFC 0/3] soft reclaim rework Michal Hocko 2013-04-09 12:13 ` [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko 2013-04-09 12:13 ` [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified Michal Hocko @ 2013-04-09 12:13 ` Michal Hocko 2013-04-22 2:14 ` Michal Hocko 2013-04-09 15:37 ` [RFC 0/3] soft reclaim rework Michal Hocko ` (2 subsequent siblings) 5 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2013-04-09 12:13 UTC (permalink / raw) To: linux-mm Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa Soft reclaim has been done only for the global reclaim (both background and direct). Since "memcg: integrate soft reclaim tighter with zone shrinking code" there is no reason for this limitation anymore as the soft limit reclaim doesn't use any special code paths and it is a part of the zone shrinking code which is used by both global and targeted reclaims. >From semantic point of view it is even natural to consider soft limit before touching all groups in the hierarchy tree which is touching the hard limit because soft limit tells us where to push back when there is a memory pressure. It is not important whether the pressure comes from the limit or imbalanced zones. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index ae3a387..cf729ca 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -141,7 +141,7 @@ static bool global_reclaim(struct scan_control *sc) static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) { - return global_reclaim(sc); + return true; } #else static bool global_reclaim(struct scan_control *sc) -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim 2013-04-09 12:13 ` [RFC 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko @ 2013-04-22 2:14 ` Michal Hocko 0 siblings, 0 replies; 27+ messages in thread From: Michal Hocko @ 2013-04-22 2:14 UTC (permalink / raw) To: linux-mm Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa On Tue 09-04-13 14:13:15, Michal Hocko wrote: > Soft reclaim has been done only for the global reclaim (both background > and direct). Since "memcg: integrate soft reclaim tighter with zone > shrinking code" there is no reason for this limitation anymore as the > soft limit reclaim doesn't use any special code paths and it is a > part of the zone shrinking code which is used by both global and > targeted reclaims. > > From semantic point of view it is even natural to consider soft limit > before touching all groups in the hierarchy tree which is touching the > hard limit because soft limit tells us where to push back when there is > a memory pressure. It is not important whether the pressure comes from > the limit or imbalanced zones. > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > --- > mm/vmscan.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index ae3a387..cf729ca 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -141,7 +141,7 @@ static bool global_reclaim(struct scan_control *sc) > > static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) > { > - return global_reclaim(sc); > + return true; > } > #else > static bool global_reclaim(struct scan_control *sc) This patch is not complete. We also need to update mem_cgroup_soft_reclaim_eligible as well because we should ignore parents that are above the current reclaim pressure. Say we have A (over soft limit) \ B (below s.l., hit the hard limit) / \ C D (below s.l.) B is the source of the outside memory pressure now for D but we shouldn't soft reclaim it because it is behaving well under B subtree. mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the hierarchy at B (root of the memory pressure). --- diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1833c95..80ed1b6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -179,7 +179,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg); +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, @@ -356,7 +357,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root) { return false; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index be86815..19b4cb7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1845,12 +1845,14 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) #endif /* - * A group is eligible for the soft limit reclaim if it is + * A group is eligible for the soft limit reclaim under given root hierarchy + * if it is * a) doesn't have any soft limit set * b) is over its soft limit * c) any parent up the hierarchy is over its soft limit */ -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root) { struct mem_cgroup *parent = memcg; @@ -1863,13 +1865,15 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) return true; /* - * If any parent up the hierarchy is over its soft limit then we - * have to obey and reclaim from this group as well. + * If any parent up to the root in the hierarchy is over its soft limit + * then we have to obey and reclaim from this group as well. */ while((parent = parent_mem_cgroup(parent))) { if (parent->soft_limited && res_counter_soft_limit_excess(&parent->res)) return true; + if (parent == root) + break; } return false; diff --git a/mm/vmscan.c b/mm/vmscan.c index 1fe9f81..471bf94 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1973,7 +1973,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) struct lruvec *lruvec; if (soft_reclaim && - !mem_cgroup_soft_reclaim_eligible(memcg)) { + !mem_cgroup_soft_reclaim_eligible(memcg, root)) { memcg = mem_cgroup_iter(root, memcg, &reclaim); continue; } -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC 0/3] soft reclaim rework 2013-04-09 12:13 [RFC 0/3] soft reclaim rework Michal Hocko ` (2 preceding siblings ...) 2013-04-09 12:13 ` [RFC 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko @ 2013-04-09 15:37 ` Michal Hocko 2013-04-09 15:50 ` Michal Hocko 2013-04-11 8:43 ` Michal Hocko 2013-04-17 22:52 ` Ying Han 5 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2013-04-09 15:37 UTC (permalink / raw) To: linux-mm Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa On Tue 09-04-13 14:13:12, Michal Hocko wrote: [...] > 2) kbuild test showed more or less the same results > usage_in_bytes > Base > Group A Group B > Median 394817536 395634688 > > Patches applied > median 483481600 302131200 > > A is kept closer to the soft limit again. There is some fluctuation > around the limit because kbuild creates a lot of short lived processes. > Base: pgscan_kswapd_dma32 1648718 pgsteal_kswapd_dma32 1510749 > Patched: pgscan_kswapd_dma32 2042065 pgsteal_kswapd_dma32 1667745 OK, so I have patched the base version with the patch bellow which uncovers soft reclaim scanning and reclaim and guess what: Base: pgscan_kswapd_dma32 3710092 pgsteal_kswapd_dma32 3225191 Patched: pgscan_kswapd_dma32 1846700 pgsteal_kswapd_dma32 1442232 Base: pgscan_direct_dma32 2417683 pgsteal_direct_dma32 459702 Patched: pgscan_direct_dma32 1839331 pgsteal_direct_dma32 244338 The numbers are obviously timing dependent (wrt. previous run ~10% for the patched kernel) but the ~1/2 half wrt. the base kernel seems real we just haven't seen it previously because it wasn't accounted. I guess this can be attributed to prio-0 soft reclaim behavior and a lot of dirty pages on the LRU. > The differences are much bigger now so it would be interesting how much > has been scanned/reclaimed during soft reclaim in the base kernel. --- ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/3] soft reclaim rework 2013-04-09 15:37 ` [RFC 0/3] soft reclaim rework Michal Hocko @ 2013-04-09 15:50 ` Michal Hocko 0 siblings, 0 replies; 27+ messages in thread From: Michal Hocko @ 2013-04-09 15:50 UTC (permalink / raw) To: linux-mm Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa On Tue 09-04-13 17:37:42, Michal Hocko wrote: > On Tue 09-04-13 14:13:12, Michal Hocko wrote: > [...] > > 2) kbuild test showed more or less the same results > > usage_in_bytes > > Base > > Group A Group B > > Median 394817536 395634688 > > > > Patches applied > > median 483481600 302131200 > > > > A is kept closer to the soft limit again. There is some fluctuation > > around the limit because kbuild creates a lot of short lived processes. > > Base: pgscan_kswapd_dma32 1648718 pgsteal_kswapd_dma32 1510749 > > Patched: pgscan_kswapd_dma32 2042065 pgsteal_kswapd_dma32 1667745 > > OK, so I have patched the base version with the patch bellow which > uncovers soft reclaim scanning and reclaim and guess what: > Base: pgscan_kswapd_dma32 3710092 pgsteal_kswapd_dma32 3225191 > Patched: pgscan_kswapd_dma32 1846700 pgsteal_kswapd_dma32 1442232 > Base: pgscan_direct_dma32 2417683 pgsteal_direct_dma32 459702 > Patched: pgscan_direct_dma32 1839331 pgsteal_direct_dma32 244338 Dohh, a dwarf sneaked in and broke my numbers for the base kernel. I am rerunning the test. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/3] soft reclaim rework 2013-04-09 12:13 [RFC 0/3] soft reclaim rework Michal Hocko ` (3 preceding siblings ...) 2013-04-09 15:37 ` [RFC 0/3] soft reclaim rework Michal Hocko @ 2013-04-11 8:43 ` Michal Hocko 2013-04-11 9:07 ` Michal Hocko 2013-04-11 13:04 ` Michal Hocko 2013-04-17 22:52 ` Ying Han 5 siblings, 2 replies; 27+ messages in thread From: Michal Hocko @ 2013-04-11 8:43 UTC (permalink / raw) To: linux-mm Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa Hi, I have retested kbuild test on a bare HW (8CPUs, 1GB RAM limited by mem=1G, 2GB swap partition). There are 2 groups (A, B) without any hard limit and group A has soft limit set to 700M (to have 70% of available memory). Build starts after fresh boot by extracting sources and make -j4 vmlinux. Each group works on a separate source tree. I have repeated the test 3 times: First some data as returned by /usr/bin/time -v: * Patched: A: User time (seconds): 1133.06 User time (seconds): 1132.84 User time (seconds): 1135.37 Avg: 1133.76 System time (seconds): 258.02 System time (seconds): 259.33 System time (seconds): 258.83 Avg: 258.73 Elapsed (wall clock) time (h:mm:ss or m:ss): 8:57.55 Elapsed (wall clock) time (h:mm:ss or m:ss): 8:55.68 Elapsed (wall clock) time (h:mm:ss or m:ss): 8:50.96 Avg: 08:54.73 B: User time (seconds): 1149.22 User time (seconds): 1153.98 User time (seconds): 1150.37 Avg: 1151.19 (101.5% of A) System time (seconds): 262.13 System time (seconds): 263.31 System time (seconds): 260.84 Avg: 262.09 (101.3% of A) Elapsed (wall clock) time (h:mm:ss or m:ss): 10:13.37 Elapsed (wall clock) time (h:mm:ss or m:ss): 10:17.15 Elapsed (wall clock) time (h:mm:ss or m:ss): 10:05.23 Avg: 10:11.92 (114.4% of A) * Base: A: User time (seconds): 1132.58 User time (seconds): 1140.63 User time (seconds): 1135.68 avg: 1136.30 (100.2% of A - patched) System time (seconds): 264.88 System time (seconds): 263.54 System time (seconds): 261.99 avg: 263.47 (101.8 of A - patched) Elapsed (wall clock) time (h:mm:ss or m:ss): 9:48.54 Elapsed (wall clock) time (h:mm:ss or m:ss): 9:50.44 Elapsed (wall clock) time (h:mm:ss or m:ss): 9:44.28 avg: 09:47.75 (109.9% of A - patched) B: User time (seconds): 1138.32 User time (seconds): 1135.70 User time (seconds): 1136.80 avg: 1136.94 (100.2% of A - patched) System time (seconds): 261.56 System time (seconds): 262.10 System time (seconds): 262.24 avg: 261.97 (100% of A - patched) Elapsed (wall clock) time (h:mm:ss or m:ss): 9:39.17 Elapsed (wall clock) time (h:mm:ss or m:ss): 9:46.95 Elapsed (wall clock) time (h:mm:ss or m:ss): 9:44.73 avg: 09:47.75 (109.1% of A - patched) While for the patched kernel soft limit helped to protect A's working set so it was faster (14% in the total time) than B without any limits. The unpatched kernel has treated them more or less equally regardless the softlimit setting. If we compare patched and base kernels numbers then the overall situation improved slightly (A+B Elapsed time is 2% smaller) with the patched kernel which was quite surprising for me. Maybe a side effect of priority-0 soft reclaim in the base kernel. As the variance between runs wasn't very high I have focused on the first run for the memory usage and reclaim statistics comparisons between the base and patched kernels. * Patched: pgscan_direct_dma32 252408 pgscan_kswapd_dma32 988928 pgsteal_direct_dma32 63565 pgsteal_kswapd_dma32 905223 * Base: pgscan_direct_dma32 97310 (38% of patched) pgscan_kswapd_dma32 1702971 (172%) pgsteal_direct_dma32 83377 (131%) pgsteal_kswapd_dma32 1534616 (169.5%) So it seems that we scanned much more on the patched kernel during the direct reclaim but we have reclaimed less nevertheless. This is most probably because there is a bigger pressure on B's LRU and we encounter more dirty pages so more pages are scanned in the end. In sum we scanned and reclaimed less (by 45% resp. 67%) though. You can find some graphs at: - http://labs.suse.cz/mhocko/soft_limit_rework/base-usage.png - http://labs.suse.cz/mhocko/soft_limit_rework/patched-usage.png Per group charges over time. - http://labs.suse.cz/mhocko/soft_limit_rework/base-usage-histogram.png - http://labs.suse.cz/mhocko/soft_limit_rework/patched-usage-histogram.png Same here but in the histogram form to see the main tendencies. - http://labs.suse.cz/mhocko/soft_limit_rework/pgscan.png - http://labs.suse.cz/mhocko/soft_limit_rework/pgsteal.png Scanning and reclaiming activity comparision between the base and the patched kernel. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/3] soft reclaim rework 2013-04-11 8:43 ` Michal Hocko @ 2013-04-11 9:07 ` Michal Hocko 2013-04-11 13:04 ` Michal Hocko 1 sibling, 0 replies; 27+ messages in thread From: Michal Hocko @ 2013-04-11 9:07 UTC (permalink / raw) To: linux-mm Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa On Thu 11-04-13 10:43:46, Michal Hocko wrote: > Hi, > I have retested kbuild test on a bare HW (8CPUs, 1GB RAM limited by > mem=1G, 2GB swap partition). There are 2 groups (A, B) without any hard > limit and group A has soft limit set to 700M (to have 70% of available > memory). Build starts after fresh boot by extracting sources and > make -j4 vmlinux. > Each group works on a separate source tree. I have repeated the test 3 > times: > > First some data as returned by /usr/bin/time -v: > * Patched: > A: > User time (seconds): 1133.06 > User time (seconds): 1132.84 > User time (seconds): 1135.37 > Avg: 1133.76 > System time (seconds): 258.02 > System time (seconds): 259.33 > System time (seconds): 258.83 > Avg: 258.73 > Elapsed (wall clock) time (h:mm:ss or m:ss): 8:57.55 > Elapsed (wall clock) time (h:mm:ss or m:ss): 8:55.68 > Elapsed (wall clock) time (h:mm:ss or m:ss): 8:50.96 > Avg: 08:54.73 > > B: > User time (seconds): 1149.22 > User time (seconds): 1153.98 > User time (seconds): 1150.37 > Avg: 1151.19 (101.5% of A) > System time (seconds): 262.13 > System time (seconds): 263.31 > System time (seconds): 260.84 > Avg: 262.09 (101.3% of A) > Elapsed (wall clock) time (h:mm:ss or m:ss): 10:13.37 > Elapsed (wall clock) time (h:mm:ss or m:ss): 10:17.15 > Elapsed (wall clock) time (h:mm:ss or m:ss): 10:05.23 > Avg: 10:11.92 (114.4% of A) > > * Base: > A: > User time (seconds): 1132.58 > User time (seconds): 1140.63 > User time (seconds): 1135.68 > avg: 1136.30 (100.2% of A - patched) > System time (seconds): 264.88 > System time (seconds): 263.54 > System time (seconds): 261.99 > avg: 263.47 (101.8 of A - patched) > Elapsed (wall clock) time (h:mm:ss or m:ss): 9:48.54 > Elapsed (wall clock) time (h:mm:ss or m:ss): 9:50.44 > Elapsed (wall clock) time (h:mm:ss or m:ss): 9:44.28 > avg: 09:47.75 (109.9% of A - patched) > > B: > User time (seconds): 1138.32 > User time (seconds): 1135.70 > User time (seconds): 1136.80 > avg: 1136.94 (100.2% of A - patched) > > System time (seconds): 261.56 > System time (seconds): 262.10 > System time (seconds): 262.24 > avg: 261.97 (100% of A - patched) > Elapsed (wall clock) time (h:mm:ss or m:ss): 9:39.17 > Elapsed (wall clock) time (h:mm:ss or m:ss): 9:46.95 > Elapsed (wall clock) time (h:mm:ss or m:ss): 9:44.73 > avg: 09:47.75 (109.1% of A - patched) > > While for the patched kernel soft limit helped to protect A's working > set so it was faster (14% in the total time) than B without any limits. > The unpatched kernel has treated them more or less equally regardless > the softlimit setting. > > If we compare patched and base kernels numbers then the overall > situation improved slightly (A+B Elapsed time is 2% smaller) with the > patched kernel which was quite surprising for me. Maybe a side effect of > priority-0 soft reclaim in the base kernel. > > As the variance between runs wasn't very high I have focused on the first > run for the memory usage and reclaim statistics comparisons between the > base and patched kernels. > > * Patched: > pgscan_direct_dma32 252408 > pgscan_kswapd_dma32 988928 > pgsteal_direct_dma32 63565 > pgsteal_kswapd_dma32 905223 > > * Base: > pgscan_direct_dma32 97310 (38% of patched) > pgscan_kswapd_dma32 1702971 (172%) > pgsteal_direct_dma32 83377 (131%) > pgsteal_kswapd_dma32 1534616 (169.5%) > > So it seems that we scanned much more on the patched kernel during the > direct reclaim but we have reclaimed less nevertheless. This is most > probably because there is a bigger pressure on B's LRU and we encounter > more dirty pages so more pages are scanned in the end. In sum we scanned > and reclaimed less (by 45% resp. 67%) though. > I have moved graphs to http://labs.suse.cz/mhocko/soft_limit_rework/kbuild/700-softlimit/kbuild because I am doing tests with other soft limits and also other types of tests. Sorry about that. > You can find some graphs at: > - http://labs.suse.cz/mhocko/soft_limit_rework/base-usage.png > - http://labs.suse.cz/mhocko/soft_limit_rework/patched-usage.png > > Per group charges over time. > > - http://labs.suse.cz/mhocko/soft_limit_rework/base-usage-histogram.png > - http://labs.suse.cz/mhocko/soft_limit_rework/patched-usage-histogram.png > > Same here but in the histogram form to see the main tendencies. > > - http://labs.suse.cz/mhocko/soft_limit_rework/pgscan.png > - http://labs.suse.cz/mhocko/soft_limit_rework/pgsteal.png > > Scanning and reclaiming activity comparision between the base and the > patched kernel. > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/3] soft reclaim rework 2013-04-11 8:43 ` Michal Hocko 2013-04-11 9:07 ` Michal Hocko @ 2013-04-11 13:04 ` Michal Hocko 1 sibling, 0 replies; 27+ messages in thread From: Michal Hocko @ 2013-04-11 13:04 UTC (permalink / raw) To: linux-mm Cc: Ying Han, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa On Thu 11-04-13 10:43:46, Michal Hocko wrote: > Hi, > I have retested kbuild test on a bare HW (8CPUs, 1GB RAM limited by > mem=1G, 2GB swap partition). There are 2 groups (A, B) without any hard > limit and group A has soft limit set to 700M (to have 70% of available > memory). Build starts after fresh boot by extracting sources and > make -j4 vmlinux. > Each group works on a separate source tree. I have repeated the test 3 > times: [Cutting the previous results and keeping only averages for overview] > * Patched: > A: > User time (seconds): Avg: 1133.76 > System time (seconds): Avg: 258.73 > Elapsed (wall clock) time (h:mm:ss or m:ss): Avg: 08:54.73 > > B: > User time (seconds): Avg: 1151.19 (101.5% of A) > System time (seconds): Avg: 262.09 (101.3% of A) > Elapsed (wall clock) time (h:mm:ss or m:ss): Avg: 10:11.92 (114.4% of A) > > * Base: > A: > User time (seconds): avg: 1136.30 (100.2% of A - patched) > System time (seconds): avg: 263.47 (101.8 of A - patched) > Elapsed (wall clock) time (h:mm:ss or m:ss): avg: 09:47.75 (109.9% of A - patched) > > B: > User time (seconds): avg: 1136.94 (100.2% of A - patched) > System time (seconds): avg: 261.97 (100% of A - patched) > Elapsed (wall clock) time (h:mm:ss or m:ss): avg: 09:47.75 (109.1% of A - patched) Same test again with 300M soft limit instead (for A). * Patched: A: User time (seconds): 1143.68, 1137.85, 1137.47 avg:1139.67 System time (seconds): 264.73, 265.50, 262.44 avg:264.22 Elapsed (wall clock) time (h:mm:ss or m:ss): 9:54.07, 9:48.23, 9:39.35 avg:09:47.22 B: User time (seconds): 1139.10, 1135.94, 1138.13 avg:1137.72 (99.8% of A) System time (seconds): 260.94, 262.37, 263.56 avg:262.29 (99.2% of A) Elapsed (wall clock) time (h:mm:ss or m:ss): 9:53.04, 9:48.17, 9:51.34 avg:09:50.85 (100.6% of A) Both groups are comparable now as both of them are reclaimed (see bellow for the reclaim statistics). So we are 1min slower (in Elapsed time) than with 700M soft limit for both groups. * Base: A: User time (seconds): 1148.50, 1145.96, 1144.60 avg:1146.35 (100.5% of A patched) System time (seconds): 265.00, 262.31, 264.98 avg:264.10 (100% of A patched) Elapsed (wall clock) time (h:mm:ss or m:ss): 10:44.57, 10:14.74, 10:32.28 avg:10:30.53 (107.4% of A patched) B: User time (seconds): 1137.01, 1131.44, 1136.86 avg:1135.10 (99.6% of A patched) System time (seconds): 259.72, 259.05, 262.62 avg:260.46 (98.6% of A patched) Elapsed (wall clock) time (h:mm:ss or m:ss): 9:33.82, 9:25.39, 9:38.35 avg:09:32.52 (97.5% of A patched) A is hammered by soft reclaim much more than with 700M soft limit which is expected. If we sum A+B Elapsed time, though, then the workload is faster by ~2% with the patched kernel (same as with the 700M limit). This confirms that the soft limit is too harsh with the base kernel. Just for completness, if we compare A+B to 700M soft limited runs then we get ~3% slowdown for both patched and unpatched kernels with smaller softlimit. > * Patched: > pgscan_direct_dma32 252408 > pgscan_kswapd_dma32 988928 > pgsteal_direct_dma32 63565 > pgsteal_kswapd_dma32 905223 > > * Base: > pgscan_direct_dma32 97310 (38% of patched) > pgscan_kswapd_dma32 1702971 (172%) > pgsteal_direct_dma32 83377 (131%) > pgsteal_kswapd_dma32 1534616 (169.5%) * Patched: pgscan_direct_dma32 153455 (60.8% Patched 700M limit) pgscan_kswapd_dma32 1670779 (168.9% Patched 700M limit) pgsteal_direct_dma32 109624 (172.5% Patched 700M limit) pgsteal_kswapd_dma32 1512120 (167% Patched 700M limit) * Base: pgscan_direct_dma32 492381 (320% of A) pgscan_kswapd_dma32 1373732 (82.2% of A) pgsteal_direct_dma32 339563 (309.8 of A) pgsteal_kswapd_dma32 1108240 (73.3% of A) And this shows it nicely. We scan and reclaim 3 times more in direct reclaim context while we scan ~20% resp. reclaim ~30% less in the background. We scan and reclaim ~70% more in kswapd context than with 700M soft limit but the direct reclaim is reduced which is nice. Same graphs as for the 700M: http://labs.suse.cz/mhocko/soft_limit_rework/kbuild/300-softlimit/base-usage.png http://labs.suse.cz/mhocko/soft_limit_rework/kbuild/300-softlimit/patched-usage.png charges over time. We can see that the patched kernel bahaves much more just to both groups than the base kernel. http://labs.suse.cz/mhocko/soft_limit_rework/kbuild/300-softlimit/base-usage-histogram.png http://labs.suse.cz/mhocko/soft_limit_rework/kbuild/300-softlimit/patched-usage-histogram.png Same can be seen in the histogram. http://labs.suse.cz/mhocko/soft_limit_rework/kbuild/300-softlimit/pgscan.png http://labs.suse.cz/mhocko/soft_limit_rework/kbuild/300-softlimit/pgsteal.png And the scanning/reclaiming data over time. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/3] soft reclaim rework 2013-04-09 12:13 [RFC 0/3] soft reclaim rework Michal Hocko ` (4 preceding siblings ...) 2013-04-11 8:43 ` Michal Hocko @ 2013-04-17 22:52 ` Ying Han 5 siblings, 0 replies; 27+ messages in thread From: Ying Han @ 2013-04-17 22:52 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm@kvack.org, Johannes Weiner, KAMEZAWA Hiroyuki, Rik van Riel, Hugh Dickins, Mel Gorman, Glauber Costa [-- Attachment #1.1: Type: text/plain, Size: 6749 bytes --] On Tue, Apr 9, 2013 at 5:13 AM, Michal Hocko <mhocko@suse.cz> wrote: > Hi all, > It's been a long when I promised my take on the $subject but I got > permanently preempted by other tasks. I finally got it, fortunately. > Hi Michal, This is on my list for a while and never get chance to get to it. The per-memcg softlimit reclaim is one of the key feature google uses today, and thank you for putting the effort of move this forward. I haven't read the patch in details, but since we chatted about this for few iterations and it should just look familiar. This is just a first attempt. There are still some todos but I wanted to > post it soon to get a feedback. > > The basic idea is quite simple. Pull soft reclaim into shrink_zone in > the first step and get rid of the previous soft reclaim infrastructure. > shrink_zone is done in two passes now. First it tries to do the soft > limit reclaim and it falls back to reclaim-all-mode if no group is over > the limit or no pages have been scanned. The second pass happens at the > same priority so the only time we waste is the memcg tree walk which > shouldn't be a big deal. There is certainly room for improvements in > that direction. But let's keep it simple for now. > As a bonus we will get rid of a _lot_ of code by this and soft reclaim > will not stand out like before. > Yes, that is the part that should have given us enough motivation to merge this effort long time ago. However, we had difficulties of agreeing the 5% of the code (mainly on the softlimit policy) which preventing to cleaning up 95% of the code. I take the blame. The second step is somehow more controversial. I am redefining meaning > of the default soft limit value. I've not chosen 0 as we discussed > previously because I want to preserve hierarchical property of the soft > limit (if a parent up the hierarchy is over its limit then children are > over as well) This is the 5% we keep disagreeing each other. The internal patch I am carrying has different interpretation of "hierarchical softlimit reclaim". However, I am more incline to accept that difference this time. At least that will get us moving forward to clean up the code first. Then we can revisit the exact policy of that 5% if that doesn't fit for other usecase ( besides google). I am happy to backport this part into our kernel later and then only carry that 5% of change internally. To give more background of what I mean by different interpretation of "hierarchical", I have some write up some time back which is attached in this thread. This is purely to make a note for later, and as I mentioned I will go ahead review the patch and forget about that difference at this step. so I have kept the default untouched - unlimited - but I > have slightly changed the meaning of this value. I interpret it as "user > doesn't care about soft limit". More precisely the value is ignored > unless it has been specified by user so such groups are eligible for > soft reclaim even though they do not reach the limit. Such groups > do not force their children to be reclaimed of course. > > I guess the only possible use case where this wouldn't work as > expected is when somebody creates a group and set its soft limit to > a small value (e.g. 0) just to protect all other groups from being > reclaimed. With a new scheme all groups would be reclaimed while the > previous implementation could end up reclaiming only the "special" > group. This configuration can be achieved by the new scheme trivially > so I think we should be safe. Or does this sound like a big problem? > Finally the third step is soft limit reclaim integration into targeted > reclaim. The patch is trivial one liner. > Will go through the patches with details in next day or so. Thanks --Ying > > I haven't get to test it properly yet. I've tested only 2 workloads: > 1) 1GB RAM + 128MB swap in a kvm (host 4 GB RAM) > - 2 memcgs (directly under root) > - A has soft limit 500MB and hard unlimited > - B both hard and soft unlimited (default values) > - One dd if=/dev/zero of=storage/$file bs=1024 count=1228800 per group > 2) same setup > - tar -xf linux source tree + make -j2 vmlinux > > Results > 1) I've checked memory.usage_in_bytes > Base (-mm tree) > Group A Group B > median 446498816 448659456 > > Patches applied > median 524314624 377921536 > > So as expected, A got more room on behalf of B and it is nicely over its > soft limit. I wanted to compare the reclaim performance as well but we > do not account scanned and reclaimed pages during the old soft reclaim > (global_reclaim prevents that). But I am planning to look at it. > Anyway it doesn't look like we are scanning/reclaiming more with the > patched kernel: > Base: pgscan_kswapd_dma32 394382 pgsteal_kswapd_dma32 394372 > Patched: pgscan_kswapd_dma32 394501 pgsteal_kswapd_dma32 394491 > > So I would assume that the soft limit reclaim scanned more in the end. > > Total runtime was slightly smaller for the patch version: > Base > Group A Group B > total time 480.087 s 480.067 s > > Patches applied > total time 474.853 s 474.736 s > > But this could be an artifacts of the guest scheduling or related to the > host activity so I wouldn't draw any conclusions from here. > > 2) kbuild test showed more or less the same results > usage_in_bytes > Base > Group A Group B > Median 394817536 395634688 > > Patches applied > median 483481600 302131200 > > A is kept closer to the soft limit again. There is some fluctuation > around the limit because kbuild creates a lot of short lived processes. > Base: pgscan_kswapd_dma32 1648718 pgsteal_kswapd_dma32 1510749 > Patched: pgscan_kswapd_dma32 2042065 pgsteal_kswapd_dma32 1667745 > > The differences are much bigger now so it would be interesting how much > has been scanned/reclaimed during soft reclaim in the base kernel. > > I haven't included total runtime statistics here because they seemed > even more random due to guest/host interaction. > > Any comments are welcome, of course. > > Michal Hocko (3): > memcg: integrate soft reclaim tighter with zone shrinking code > memcg: Ignore soft limit until it is explicitly specified > vmscan, memcg: Do softlimit reclaim also for targeted reclaim > > Incomplete diffstat (without node-zone soft limit tree removal etc...) > so more deletions to come. > include/linux/memcontrol.h | 10 +-- > mm/memcontrol.c | 175 > +++++++++----------------------------------- > mm/vmscan.c | 67 ++++++++++------- > 3 files changed, 78 insertions(+), 174 deletions(-) > [-- Attachment #1.2: Type: text/html, Size: 9580 bytes --] [-- Attachment #2: SoftlimitReclaimInMemcg.pdf --] [-- Type: application/pdf, Size: 416555 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2013-04-22 2:14 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-04-09 12:13 [RFC 0/3] soft reclaim rework Michal Hocko 2013-04-09 12:13 ` [RFC 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko 2013-04-09 13:08 ` Johannes Weiner 2013-04-09 13:31 ` Michal Hocko 2013-04-09 13:57 ` Glauber Costa 2013-04-09 14:22 ` Michal Hocko 2013-04-09 16:45 ` Kamezawa Hiroyuki 2013-04-09 17:05 ` Michal Hocko 2013-04-14 0:42 ` Mel Gorman 2013-04-14 14:34 ` Michal Hocko 2013-04-14 14:55 ` Johannes Weiner 2013-04-14 15:04 ` Michal Hocko 2013-04-14 15:11 ` Michal Hocko 2013-04-14 18:03 ` Rik van Riel 2013-04-09 12:13 ` [RFC 2/3] memcg: Ignore soft limit until it is explicitly specified Michal Hocko 2013-04-09 13:24 ` Johannes Weiner 2013-04-09 13:42 ` Michal Hocko 2013-04-09 17:10 ` Kamezawa Hiroyuki 2013-04-09 17:22 ` Michal Hocko 2013-04-09 12:13 ` [RFC 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko 2013-04-22 2:14 ` Michal Hocko 2013-04-09 15:37 ` [RFC 0/3] soft reclaim rework Michal Hocko 2013-04-09 15:50 ` Michal Hocko 2013-04-11 8:43 ` Michal Hocko 2013-04-11 9:07 ` Michal Hocko 2013-04-11 13:04 ` Michal Hocko 2013-04-17 22:52 ` Ying Han
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).