* [PATCHSET] memcg: improve high limit behavior and always enable kmemcg on dfl hier
@ 2015-08-28 15:25 Tejun Heo
[not found] ` <1440775530-18630-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
0 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
To: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg
Hello,
This patchset contains the following four patches.
0001-memcg-fix-over-high-reclaim-amount.patch
0002-memcg-flatten-task_struct-memcg_oom.patch
0003-memcg-punt-high-overage-reclaim-to-return-to-userlan.patch
0004-memcg-always-enable-kmemcg-on-the-default-hierarchy.patch
0001-0002 are simple fix and prep patches. 0003 makes memcg alwyas
punt direct reclaim of high limit overages to return-to-user path.
0004 always enables kmemcg on the default hierarchy.
This patchset is on top of next/akpm 1f3bd508da15
("drivers/w1/w1_int.c: call put_device if device_register fails") and
also available in the following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-memcg-dfl
diffstat follows. Thanks.
include/linux/memcontrol.h | 16 +++++---
include/linux/sched.h | 14 +++----
include/linux/tracehook.h | 3 +
mm/memcontrol.c | 88 +++++++++++++++++++++++++++++++++++----------
4 files changed, 91 insertions(+), 30 deletions(-)
--
tejun
^ permalink raw reply [flat|nested] 40+ messages in thread[parent not found: <1440775530-18630-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* [PATCH 1/4] memcg: fix over-high reclaim amount [not found] ` <1440775530-18630-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2015-08-28 15:25 ` Tejun Heo 2015-08-28 17:06 ` Michal Hocko 2015-08-28 15:25 ` [PATCH 2/4] memcg: flatten task_struct->memcg_oom Tejun Heo ` (2 subsequent siblings) 3 siblings, 1 reply; 40+ messages in thread From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw) To: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg, Tejun Heo When memory usage is over the high limit, try_charge() performs direct reclaim; however, it uses the current charging amount @nr_pages as the reclamation target which is incorrect as we want to reclaim down to the high limit. In practice, this doesn't matter all that much because the minimum target pages that try_to_free_mem_cgroup_pages() uses is SWAP_CLUSTER_MAX which is rather large. Fix it by setting the target number of pages to the difference between the current usage and the high limit. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- mm/memcontrol.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index aacc767..18ecf75 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2078,10 +2078,13 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, * make the charging task trim their excess contribution. */ do { - if (page_counter_read(&memcg->memory) <= memcg->high) + unsigned long usage = page_counter_read(&memcg->memory); + unsigned long high = ACCESS_ONCE(memcg->high); + + if (usage <= high) continue; mem_cgroup_events(memcg, MEMCG_HIGH, 1); - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); + try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true); } while ((memcg = parent_mem_cgroup(memcg))); done: return ret; -- 2.4.3 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] memcg: fix over-high reclaim amount 2015-08-28 15:25 ` [PATCH 1/4] memcg: fix over-high reclaim amount Tejun Heo @ 2015-08-28 17:06 ` Michal Hocko [not found] ` <20150828170612.GA21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 40+ messages in thread From: Michal Hocko @ 2015-08-28 17:06 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team On Fri 28-08-15 11:25:27, Tejun Heo wrote: > When memory usage is over the high limit, try_charge() performs direct > reclaim; however, it uses the current charging amount @nr_pages as the > reclamation target which is incorrect as we want to reclaim down to > the high limit. In practice, this doesn't matter all that much > because the minimum target pages that try_to_free_mem_cgroup_pages() > uses is SWAP_CLUSTER_MAX which is rather large. > > Fix it by setting the target number of pages to the difference between > the current usage and the high limit. I do not think this a better behavior. If you have parallel charges to the same memcg then you can easilly over-reclaim because everybody will reclaim the maximum rather than its contribution. Sure we can fail to reclaim the target and slowly grow over high limit but that is to be expected. This is not the max limit which cannot be breached and external memory pressure/reclaim is there to mitigate that. > Signed-off-by: Tejun Heo <tj@kernel.org> > --- > mm/memcontrol.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index aacc767..18ecf75 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2078,10 +2078,13 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > * make the charging task trim their excess contribution. > */ > do { > - if (page_counter_read(&memcg->memory) <= memcg->high) > + unsigned long usage = page_counter_read(&memcg->memory); > + unsigned long high = ACCESS_ONCE(memcg->high); > + > + if (usage <= high) > continue; > mem_cgroup_events(memcg, MEMCG_HIGH, 1); > - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); > + try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true); > } while ((memcg = parent_mem_cgroup(memcg))); > done: > return ret; > -- > 2.4.3 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <20150828170612.GA21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 1/4] memcg: fix over-high reclaim amount [not found] ` <20150828170612.GA21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2015-08-28 18:32 ` Tejun Heo [not found] ` <20150828183209.GA9423-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 40+ messages in thread From: Tejun Heo @ 2015-08-28 18:32 UTC (permalink / raw) To: Michal Hocko Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg Hello, On Fri, Aug 28, 2015 at 07:06:13PM +0200, Michal Hocko wrote: > I do not think this a better behavior. If you have parallel charges to > the same memcg then you can easilly over-reclaim because everybody > will reclaim the maximum rather than its contribution. > > Sure we can fail to reclaim the target and slowly grow over high limit > but that is to be expected. This is not the max limit which cannot be > breached and external memory pressure/reclaim is there to mitigate that. Ah, I see, yeah, over-reclaim can happen. How about just wrapping the over-high reclaim with a per-memcg mutex? Do we gain anything by putting multiple tasks into the reclaim path? Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <20150828183209.GA9423-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH 1/4] memcg: fix over-high reclaim amount [not found] ` <20150828183209.GA9423-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2015-08-31 7:51 ` Michal Hocko [not found] ` <20150831075133.GA29723-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 40+ messages in thread From: Michal Hocko @ 2015-08-31 7:51 UTC (permalink / raw) To: Tejun Heo Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg On Fri 28-08-15 14:32:09, Tejun Heo wrote: > Hello, > > On Fri, Aug 28, 2015 at 07:06:13PM +0200, Michal Hocko wrote: > > I do not think this a better behavior. If you have parallel charges to > > the same memcg then you can easilly over-reclaim because everybody > > will reclaim the maximum rather than its contribution. > > > > Sure we can fail to reclaim the target and slowly grow over high limit > > but that is to be expected. This is not the max limit which cannot be > > breached and external memory pressure/reclaim is there to mitigate that. > > Ah, I see, yeah, over-reclaim can happen. How about just wrapping the > over-high reclaim with a per-memcg mutex? Do we gain anything by > putting multiple tasks into the reclaim path? The overall reclaim throughput will be higher with the parallel reclaim. Threads might still get synchronized on the zone lru lock but this is only for isolating them from the LRU. In a larger hierarchies this even might not be the case because the hierarchy iterator tries to spread the reclaim over different memcgs. So the per-memcg mutex would solve the potential over-reclaim but it will restrain the reclaim activity unnecessarily. Why is per-contribution reclaim such a big deal in the first place? If there are runaways allocation requests like GFP_NOWAIT then we should look after those. And I would argue that your delayed reclaim idea is a great fit for that. We just should track how many pages were charged over high limit in the process context and reclaim that amount on the way out from the kernel. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <20150831075133.GA29723-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 1/4] memcg: fix over-high reclaim amount [not found] ` <20150831075133.GA29723-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2015-08-31 13:38 ` Tejun Heo 2015-09-01 12:51 ` Michal Hocko 0 siblings, 1 reply; 40+ messages in thread From: Tejun Heo @ 2015-08-31 13:38 UTC (permalink / raw) To: Michal Hocko Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg Hello, Michal. On Mon, Aug 31, 2015 at 09:51:33AM +0200, Michal Hocko wrote: > The overall reclaim throughput will be higher with the parallel reclaim. Is reclaim throughput as determined by CPU cycle bandwidth a meaningful metric? I'm having a bit of trouble imagining that this actually would matter especially given that writeback is single threaded per bdi_writeback. Shoving in a large number of threads into the same path which walks the same data structures when there's no clear benefit doesn't usually end up buying anything. Cachelines get thrown back and forth, locks get contended, CPU cycles which could have been used for a lot of more useful things get wasted and IO pattern becomes random. We can still choose to do that but I think we should have explicit justifications (e.g. it really harms scalability otherwise). > Threads might still get synchronized on the zone lru lock but this is only > for isolating them from the LRU. In a larger hierarchies this even might > not be the case because the hierarchy iterator tries to spread the reclaim > over different memcgs. > > So the per-memcg mutex would solve the potential over-reclaim but it > will restrain the reclaim activity unnecessarily. Why is per-contribution > reclaim such a big deal in the first place? If there are runaways > allocation requests like GFP_NOWAIT then we should look after those. And > I would argue that your delayed reclaim idea is a great fit for that. We > just should track how many pages were charged over high limit in the > process context and reclaim that amount on the way out from the kernel. Per-contribution reclaim is not necessarily a "big deal" but is kinda mushy on the edges which get more pronounced with async reclaim. * It still can over reclaim to a considerable extent. The reclaim path uses mean reclaim size of 1M and when the high limit is used as the main mechanism for reclaim rather than global limit, many threads performing simultaneous 1M reclaims will happen. * We need to keep track of an additional state. What if a task performs multiple NOWAIT try_charge()'s? I guess we should add up those numbers. * But even if we do that, what does that actually mean? These numbers are arbitrary in nature. A thread may have just performed high-order allocations back to back at the point where it steps over the limit with an order-0 allocation or maybe a different thread which wasn't consuming much memory can hit it right after. @nr_pages is a convenient number that we can use on the spot which will make the consumption converge on the limit but I'm not sure this is a number that we should keep track of. Also, having a central point of control means that we can actually define policies there - e.g. if the overage is less than 10%, let tasks through as long as there's at least one reclaiming. Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] memcg: fix over-high reclaim amount 2015-08-31 13:38 ` Tejun Heo @ 2015-09-01 12:51 ` Michal Hocko [not found] ` <20150901125149.GD8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 40+ messages in thread From: Michal Hocko @ 2015-09-01 12:51 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team On Mon 31-08-15 09:38:40, Tejun Heo wrote: > Hello, Michal. > > On Mon, Aug 31, 2015 at 09:51:33AM +0200, Michal Hocko wrote: > > The overall reclaim throughput will be higher with the parallel reclaim. > > Is reclaim throughput as determined by CPU cycle bandwidth a > meaningful metric? Well, considering it has a direct effect on the latency I would consider it quite meaningful. > I'm having a bit of trouble imagining that this > actually would matter especially given that writeback is single > threaded per bdi_writeback. Sure, if the LRU contains a lot of dirty pages then the writeback will be a bottleneck. But LRUs are quite often full of the clean pagecache pages which can be reclaimed quickly and efficiently. > Shoving in a large number of threads into the same path which walks > the same data structures when there's no clear benefit doesn't usually > end up buying anything. I agree that a certain level of throttling is reasonable. We are doing some of that in the lower layers of the reclaim. It is true that some of that throttling is specific to the global reclaim and some heuristics might be applicable on the try_charge level as well but we should be careful here. Throttling has been quite tricky and there were some issues where it led to unexpected stalls in the past. > Cachelines get thrown back and forth, locks > get contended, CPU cycles which could have been used for a lot of more > useful things get wasted and IO pattern becomes random. We can still > choose to do that but I think we should have explicit justifications > (e.g. it really harms scalability otherwise). I do not remember heavy contention on the lru lock and we are not doing an IO from the direct reclaim (other than swap) so a random IO pattern shouldn't be an issue as well. > > Threads might still get synchronized on the zone lru lock but this is only > > for isolating them from the LRU. In a larger hierarchies this even might > > not be the case because the hierarchy iterator tries to spread the reclaim > > over different memcgs. > > > > So the per-memcg mutex would solve the potential over-reclaim but it > > will restrain the reclaim activity unnecessarily. Why is per-contribution > > reclaim such a big deal in the first place? If there are runaways > > allocation requests like GFP_NOWAIT then we should look after those. And > > I would argue that your delayed reclaim idea is a great fit for that. We > > just should track how many pages were charged over high limit in the > > process context and reclaim that amount on the way out from the kernel. > > Per-contribution reclaim is not necessarily a "big deal" but is kinda > mushy on the edges which get more pronounced with async reclaim. Yes, I am not saying it is a perfect solution. It has its issues as well. We have been doing this for ages though and there should be really good reasons with numbers demonstrating improvements to change it to something else. > * It still can over reclaim to a considerable extent. The reclaim > path uses mean reclaim size of 1M and when the high limit is used as > the main mechanism for reclaim rather than global limit, many > threads performing simultaneous 1M reclaims will happen. Yes this is something that has been changed recently and I am not sure the new SWAP_CLUSTER_MAX value fits well into the current memcg direct reclaim implementation. I didn't get to measure the effect yet, though, but maybe we will have to go back to 32 or something small for the memcg reclaim. This is just an implementation detail, though. > * We need to keep track of an additional state. What if a task > performs multiple NOWAIT try_charge()'s? I guess we should add up > those numbers. Yes, that was the idea. Just accumulate nr_pages attempts when the current > high and then attempt to reclaim them on the way out as you were suggesting. > * But even if we do that, what does that actually mean? These numbers > are arbitrary in nature. nr_pages at least reflects the request size so we, at least theoretically, throttle larger consumers more. > A thread may have just performed > high-order allocations back to back at the point where it steps over > the limit with an order-0 allocation or maybe a different thread > which wasn't consuming much memory can hit it right after. > @nr_pages is a convenient number that we can use on the spot which > will make the consumption converge on the limit but I'm not sure > this is a number that we should keep track of. I agree that this is an inherently racy environment. It really depends on who hits the limit and how good are reclaimers at doing their work when others piggy back on their work. We have the background reclaim to reduce that effect for the global case. Something similar have been discussed in the past for memcg as well but it hits its own issues as it has to scale with the potentially large number of memcgs. > Also, having a central point of control means that we can actually > define policies there - e.g. if the overage is less than 10%, let > tasks through as long as there's at least one reclaiming. I am not sure whether throttling at this level would be more beneficial than doing that down at the reclaim paths. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <20150901125149.GD8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 1/4] memcg: fix over-high reclaim amount [not found] ` <20150901125149.GD8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2015-09-01 18:33 ` Tejun Heo 0 siblings, 0 replies; 40+ messages in thread From: Tejun Heo @ 2015-09-01 18:33 UTC (permalink / raw) To: Michal Hocko Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg Hello, On Tue, Sep 01, 2015 at 02:51:50PM +0200, Michal Hocko wrote: > > Is reclaim throughput as determined by CPU cycle bandwidth a > > meaningful metric? > > Well, considering it has a direct effect on the latency I would consider > it quite meaningful. > > > I'm having a bit of trouble imagining that this > > actually would matter especially given that writeback is single > > threaded per bdi_writeback. > > Sure, if the LRU contains a lot of dirty pages then the writeback will be > a bottleneck. But LRUs are quite often full of the clean pagecache pages > which can be reclaimed quickly and efficiently. I see. Hmmm... I can imagine the scheduling latencies from synchronization being a factor. Alright, if we decide to do this return-path reclaiming, I'll update the patch to accumulate nr_pages. Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 2/4] memcg: flatten task_struct->memcg_oom [not found] ` <1440775530-18630-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2015-08-28 15:25 ` [PATCH 1/4] memcg: fix over-high reclaim amount Tejun Heo @ 2015-08-28 15:25 ` Tejun Heo [not found] ` <1440775530-18630-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2015-08-28 15:25 ` [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path Tejun Heo 2015-08-28 15:25 ` [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy Tejun Heo 3 siblings, 1 reply; 40+ messages in thread From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw) To: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg, Tejun Heo task_struct->memcg_oom is a sub-struct containing fields which are used for async memcg oom handling. Most task_struct fields aren't packaged this way and it can lead to unnecessary alignment paddings. This patch flattens it. * task.memcg_oom.memcg -> task.memcg_in_oom * task.memcg_oom.gfp_mask -> task.memcg_oom_gfp_mask * task.memcg_oom.order -> task.memcg_oom_order * task.memcg_oom.may_oom -> task.memcg_may_oom In addition, task.memcg_may_oom is relocated to where other bitfields are which reduces the size of task_struct. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- include/linux/memcontrol.h | 10 +++++----- include/linux/sched.h | 13 ++++++------- mm/memcontrol.c | 16 ++++++++-------- 3 files changed, 19 insertions(+), 20 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index ad800e6..3d28656 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -407,19 +407,19 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, static inline void mem_cgroup_oom_enable(void) { - WARN_ON(current->memcg_oom.may_oom); - current->memcg_oom.may_oom = 1; + WARN_ON(current->memcg_may_oom); + current->memcg_may_oom = 1; } static inline void mem_cgroup_oom_disable(void) { - WARN_ON(!current->memcg_oom.may_oom); - current->memcg_oom.may_oom = 0; + WARN_ON(!current->memcg_may_oom); + current->memcg_may_oom = 0; } static inline bool task_in_memcg_oom(struct task_struct *p) { - return p->memcg_oom.memcg; + return p->memcg_in_oom; } bool mem_cgroup_oom_synchronize(bool wait); diff --git a/include/linux/sched.h b/include/linux/sched.h index a4ab9da..ef73b54 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1451,7 +1451,9 @@ struct task_struct { unsigned sched_reset_on_fork:1; unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; - +#ifdef CONFIG_MEMCG + unsigned memcg_may_oom:1; +#endif #ifdef CONFIG_MEMCG_KMEM unsigned memcg_kmem_skip_account:1; #endif @@ -1782,12 +1784,9 @@ struct task_struct { unsigned long trace_recursion; #endif /* CONFIG_TRACING */ #ifdef CONFIG_MEMCG - struct memcg_oom_info { - struct mem_cgroup *memcg; - gfp_t gfp_mask; - int order; - unsigned int may_oom:1; - } memcg_oom; + struct mem_cgroup *memcg_in_oom; + gfp_t memcg_oom_gfp_mask; + int memcg_oom_order; #endif #ifdef CONFIG_UPROBES struct uprobe_task *utask; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 18ecf75..74abb31 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1652,7 +1652,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) { - if (!current->memcg_oom.may_oom) + if (!current->memcg_may_oom) return; /* * We are in the middle of the charge context here, so we @@ -1669,9 +1669,9 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) * and when we know whether the fault was overall successful. */ css_get(&memcg->css); - current->memcg_oom.memcg = memcg; - current->memcg_oom.gfp_mask = mask; - current->memcg_oom.order = order; + current->memcg_in_oom = memcg; + current->memcg_oom_gfp_mask = mask; + current->memcg_oom_order = order; } /** @@ -1693,7 +1693,7 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) */ bool mem_cgroup_oom_synchronize(bool handle) { - struct mem_cgroup *memcg = current->memcg_oom.memcg; + struct mem_cgroup *memcg = current->memcg_in_oom; struct oom_wait_info owait; bool locked; @@ -1721,8 +1721,8 @@ bool mem_cgroup_oom_synchronize(bool handle) if (locked && !memcg->oom_kill_disable) { mem_cgroup_unmark_under_oom(memcg); finish_wait(&memcg_oom_waitq, &owait.wait); - mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask, - current->memcg_oom.order); + mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask, + current->memcg_oom_order); } else { schedule(); mem_cgroup_unmark_under_oom(memcg); @@ -1739,7 +1739,7 @@ bool mem_cgroup_oom_synchronize(bool handle) memcg_oom_recover(memcg); } cleanup: - current->memcg_oom.memcg = NULL; + current->memcg_in_oom = NULL; css_put(&memcg->css); return true; } -- 2.4.3 ^ permalink raw reply related [flat|nested] 40+ messages in thread
[parent not found: <1440775530-18630-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 2/4] memcg: flatten task_struct->memcg_oom [not found] ` <1440775530-18630-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2015-08-28 17:11 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2015-08-28 17:11 UTC (permalink / raw) To: Tejun Heo Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg On Fri 28-08-15 11:25:28, Tejun Heo wrote: > task_struct->memcg_oom is a sub-struct containing fields which are > used for async memcg oom handling. Most task_struct fields aren't > packaged this way and it can lead to unnecessary alignment paddings. > This patch flattens it. > > * task.memcg_oom.memcg -> task.memcg_in_oom > * task.memcg_oom.gfp_mask -> task.memcg_oom_gfp_mask > * task.memcg_oom.order -> task.memcg_oom_order > * task.memcg_oom.may_oom -> task.memcg_may_oom > > In addition, task.memcg_may_oom is relocated to where other bitfields > are which reduces the size of task_struct. OK we will save 8B AFAICS which probably doesn't make much different for this huge structure. But we already have memcg_kmem_skip_account bit field there so another one makes sense. That alone would be sufficient to save those bytes. Regarding the struct, I do not have a strong opinion. I do not mind removing it. > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > --- > include/linux/memcontrol.h | 10 +++++----- > include/linux/sched.h | 13 ++++++------- > mm/memcontrol.c | 16 ++++++++-------- > 3 files changed, 19 insertions(+), 20 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index ad800e6..3d28656 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -407,19 +407,19 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > > static inline void mem_cgroup_oom_enable(void) > { > - WARN_ON(current->memcg_oom.may_oom); > - current->memcg_oom.may_oom = 1; > + WARN_ON(current->memcg_may_oom); > + current->memcg_may_oom = 1; > } > > static inline void mem_cgroup_oom_disable(void) > { > - WARN_ON(!current->memcg_oom.may_oom); > - current->memcg_oom.may_oom = 0; > + WARN_ON(!current->memcg_may_oom); > + current->memcg_may_oom = 0; > } > > static inline bool task_in_memcg_oom(struct task_struct *p) > { > - return p->memcg_oom.memcg; > + return p->memcg_in_oom; > } > > bool mem_cgroup_oom_synchronize(bool wait); > diff --git a/include/linux/sched.h b/include/linux/sched.h > index a4ab9da..ef73b54 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1451,7 +1451,9 @@ struct task_struct { > unsigned sched_reset_on_fork:1; > unsigned sched_contributes_to_load:1; > unsigned sched_migrated:1; > - > +#ifdef CONFIG_MEMCG > + unsigned memcg_may_oom:1; > +#endif > #ifdef CONFIG_MEMCG_KMEM > unsigned memcg_kmem_skip_account:1; > #endif > @@ -1782,12 +1784,9 @@ struct task_struct { > unsigned long trace_recursion; > #endif /* CONFIG_TRACING */ > #ifdef CONFIG_MEMCG > - struct memcg_oom_info { > - struct mem_cgroup *memcg; > - gfp_t gfp_mask; > - int order; > - unsigned int may_oom:1; > - } memcg_oom; > + struct mem_cgroup *memcg_in_oom; > + gfp_t memcg_oom_gfp_mask; > + int memcg_oom_order; > #endif > #ifdef CONFIG_UPROBES > struct uprobe_task *utask; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 18ecf75..74abb31 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1652,7 +1652,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) > > static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > { > - if (!current->memcg_oom.may_oom) > + if (!current->memcg_may_oom) > return; > /* > * We are in the middle of the charge context here, so we > @@ -1669,9 +1669,9 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > * and when we know whether the fault was overall successful. > */ > css_get(&memcg->css); > - current->memcg_oom.memcg = memcg; > - current->memcg_oom.gfp_mask = mask; > - current->memcg_oom.order = order; > + current->memcg_in_oom = memcg; > + current->memcg_oom_gfp_mask = mask; > + current->memcg_oom_order = order; > } > > /** > @@ -1693,7 +1693,7 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > */ > bool mem_cgroup_oom_synchronize(bool handle) > { > - struct mem_cgroup *memcg = current->memcg_oom.memcg; > + struct mem_cgroup *memcg = current->memcg_in_oom; > struct oom_wait_info owait; > bool locked; > > @@ -1721,8 +1721,8 @@ bool mem_cgroup_oom_synchronize(bool handle) > if (locked && !memcg->oom_kill_disable) { > mem_cgroup_unmark_under_oom(memcg); > finish_wait(&memcg_oom_waitq, &owait.wait); > - mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask, > - current->memcg_oom.order); > + mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask, > + current->memcg_oom_order); > } else { > schedule(); > mem_cgroup_unmark_under_oom(memcg); > @@ -1739,7 +1739,7 @@ bool mem_cgroup_oom_synchronize(bool handle) > memcg_oom_recover(memcg); > } > cleanup: > - current->memcg_oom.memcg = NULL; > + current->memcg_in_oom = NULL; > css_put(&memcg->css); > return true; > } > -- > 2.4.3 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path [not found] ` <1440775530-18630-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2015-08-28 15:25 ` [PATCH 1/4] memcg: fix over-high reclaim amount Tejun Heo 2015-08-28 15:25 ` [PATCH 2/4] memcg: flatten task_struct->memcg_oom Tejun Heo @ 2015-08-28 15:25 ` Tejun Heo [not found] ` <1440775530-18630-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2015-08-28 15:25 ` [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy Tejun Heo 3 siblings, 1 reply; 40+ messages in thread From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw) To: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg, Tejun Heo Currently, try_charge() tries to reclaim memory directly when the high limit is breached; however, this has a couple issues. * try_charge() can be invoked from any in-kernel allocation site and reclaim path may use considerable amount of stack. This can lead to stack overflows which are extremely difficult to reproduce. * If the allocation doesn't have __GFP_WAIT, direct reclaim is skipped. If a process performs only speculative allocations, it can blow way past the high limit. This is actually easily reproducible by simply doing "find /". VFS tries speculative !__GFP_WAIT allocations first, so as long as there's memory which can be consumed without blocking, it can keep allocating memory regardless of the high limit. This patch makes try_charge() always punt the direct reclaim to the return-to-userland path. If try_charge() detects that high limit is breached, it sets current->memcg_over_high to the offending memcg and schedules execution of mem_cgroup_handle_over_high() which performs the direct reclaim from the return-to-userland path. As long as kernel doesn't have a run-away allocation spree, this should provide enough protection while making kmemcg behave more consistently. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- include/linux/memcontrol.h | 6 +++++ include/linux/sched.h | 1 + include/linux/tracehook.h | 3 +++ mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++++++++--------- 4 files changed, 64 insertions(+), 12 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 3d28656..8d345a7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -402,6 +402,8 @@ static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec) return inactive * inactive_ratio < active; } +void mem_cgroup_handle_over_high(void); + void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); @@ -621,6 +623,10 @@ static inline void mem_cgroup_end_page_stat(struct mem_cgroup *memcg) { } +static inline void mem_cgroup_handle_over_high(void) +{ +} + static inline void mem_cgroup_oom_enable(void) { } diff --git a/include/linux/sched.h b/include/linux/sched.h index ef73b54..c76b71d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1785,6 +1785,7 @@ struct task_struct { #endif /* CONFIG_TRACING */ #ifdef CONFIG_MEMCG struct mem_cgroup *memcg_in_oom; + struct mem_cgroup *memcg_over_high; /* reclaim on returning to user */ gfp_t memcg_oom_gfp_mask; int memcg_oom_order; #endif diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h index 84d4972..26c1521 100644 --- a/include/linux/tracehook.h +++ b/include/linux/tracehook.h @@ -50,6 +50,7 @@ #include <linux/ptrace.h> #include <linux/security.h> #include <linux/task_work.h> +#include <linux/memcontrol.h> struct linux_binprm; /* @@ -188,6 +189,8 @@ static inline void tracehook_notify_resume(struct pt_regs *regs) smp_mb__after_atomic(); if (unlikely(current->task_works)) task_work_run(); + + mem_cgroup_handle_over_high(); } #endif /* <linux/tracehook.h> */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 74abb31..c94b686 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -62,6 +62,7 @@ #include <linux/oom.h> #include <linux/lockdep.h> #include <linux/file.h> +#include <linux/tracehook.h> #include "internal.h" #include <net/sock.h> #include <net/ip.h> @@ -1963,6 +1964,33 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb, return NOTIFY_OK; } +/* + * Scheduled by try_charge() to be executed from the userland return path + * and reclaims memory over the high limit. + */ +void mem_cgroup_handle_over_high(void) +{ + struct mem_cgroup *memcg = current->memcg_over_high; + + if (likely(!memcg)) + return; + + do { + unsigned long usage = page_counter_read(&memcg->memory); + unsigned long high = ACCESS_ONCE(memcg->high); + + if (usage <= high) + continue; + + mem_cgroup_events(memcg, MEMCG_HIGH, 1); + try_to_free_mem_cgroup_pages(memcg, usage - high, + GFP_KERNEL, true); + } while ((memcg = parent_mem_cgroup(memcg))); + + css_put(¤t->memcg_over_high->css); + current->memcg_over_high = NULL; +} + static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int nr_pages) { @@ -2071,21 +2099,27 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, css_get_many(&memcg->css, batch); if (batch > nr_pages) refill_stock(memcg, batch - nr_pages); - if (!(gfp_mask & __GFP_WAIT)) - goto done; + /* - * If the hierarchy is above the normal consumption range, - * make the charging task trim their excess contribution. + * If the hierarchy is above the normal consumption range, schedule + * direct reclaim on returning to userland. We can perform direct + * reclaim here if __GFP_WAIT; however, punting has the benefit of + * avoiding surprise high stack usages and it's fine to breach the + * high limit temporarily while control stays in kernel. */ - do { - unsigned long usage = page_counter_read(&memcg->memory); - unsigned long high = ACCESS_ONCE(memcg->high); + if (!current->memcg_over_high) { + struct mem_cgroup *pos = memcg; - if (usage <= high) - continue; - mem_cgroup_events(memcg, MEMCG_HIGH, 1); - try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true); - } while ((memcg = parent_mem_cgroup(memcg))); + do { + if (page_counter_read(&pos->memory) > pos->high) { + /* make user return path rescan from leaf */ + css_get(&memcg->css); + current->memcg_over_high = memcg; + set_notify_resume(current); + break; + } + } while ((pos = parent_mem_cgroup(pos))); + } done: return ret; } @@ -5053,6 +5087,13 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css, } #endif +static void mem_cgroup_exit(struct cgroup_subsys_state *css, + struct cgroup_subsys_state *old_css, + struct task_struct *task) +{ + mem_cgroup_handle_over_high(); +} + /* * Cgroup retains root cgroups across [un]mount cycles making it necessary * to verify whether we're attached to the default hierarchy on each mount @@ -5223,6 +5264,7 @@ struct cgroup_subsys memory_cgrp_subsys = { .can_attach = mem_cgroup_can_attach, .cancel_attach = mem_cgroup_cancel_attach, .attach = mem_cgroup_move_task, + .exit = mem_cgroup_exit, .bind = mem_cgroup_bind, .dfl_cftypes = memory_files, .legacy_cftypes = mem_cgroup_legacy_files, -- 2.4.3 ^ permalink raw reply related [flat|nested] 40+ messages in thread
[parent not found: <1440775530-18630-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path [not found] ` <1440775530-18630-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2015-08-28 16:36 ` Vladimir Davydov 2015-08-28 16:48 ` Tejun Heo 2015-08-28 17:13 ` Michal Hocko 1 sibling, 1 reply; 40+ messages in thread From: Vladimir Davydov @ 2015-08-28 16:36 UTC (permalink / raw) To: Tejun Heo Cc: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg Hi Tejun, On Fri, Aug 28, 2015 at 11:25:29AM -0400, Tejun Heo wrote: > Currently, try_charge() tries to reclaim memory directly when the high > limit is breached; however, this has a couple issues. > > * try_charge() can be invoked from any in-kernel allocation site and > reclaim path may use considerable amount of stack. This can lead to > stack overflows which are extremely difficult to reproduce. IMO this paragraph does not justify this patch at all, because one will still invoke direct reclaim from try_charge() on hitting the hard limit. > > * If the allocation doesn't have __GFP_WAIT, direct reclaim is > skipped. If a process performs only speculative allocations, it can > blow way past the high limit. This is actually easily reproducible > by simply doing "find /". VFS tries speculative !__GFP_WAIT > allocations first, so as long as there's memory which can be > consumed without blocking, it can keep allocating memory regardless > of the high limit. I think there shouldn't normally occur a lot of !__GFP_WAIT allocations in a row - they should still alternate with normal __GFP_WAIT allocations. Yes, that means we can breach memory.high threshold for a short period of time, but it isn't a hard limit, so it looks perfectly fine to me. I tried to run `find /` over ext4 in a cgroup with memory.high set to 32M and kmem accounting enabled. With such a setup memory.current never got higher than 33152K, which is only 384K greater than the memory.high. Which FS did you use? Thanks, Vladimir ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path 2015-08-28 16:36 ` Vladimir Davydov @ 2015-08-28 16:48 ` Tejun Heo 2015-08-28 20:32 ` Vladimir Davydov 0 siblings, 1 reply; 40+ messages in thread From: Tejun Heo @ 2015-08-28 16:48 UTC (permalink / raw) To: Vladimir Davydov Cc: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg Hello, Vladimir. On Fri, Aug 28, 2015 at 07:36:11PM +0300, Vladimir Davydov wrote: > > * try_charge() can be invoked from any in-kernel allocation site and > > reclaim path may use considerable amount of stack. This can lead to > > stack overflows which are extremely difficult to reproduce. > > IMO this paragraph does not justify this patch at all, because one will > still invoke direct reclaim from try_charge() on hitting the hard limit. Ah... right, and we can't defer direct reclaim for hard limit. > > * If the allocation doesn't have __GFP_WAIT, direct reclaim is > > skipped. If a process performs only speculative allocations, it can > > blow way past the high limit. This is actually easily reproducible > > by simply doing "find /". VFS tries speculative !__GFP_WAIT > > allocations first, so as long as there's memory which can be > > consumed without blocking, it can keep allocating memory regardless > > of the high limit. > > I think there shouldn't normally occur a lot of !__GFP_WAIT allocations > in a row - they should still alternate with normal __GFP_WAIT > allocations. Yes, that means we can breach memory.high threshold for a > short period of time, but it isn't a hard limit, so it looks perfectly > fine to me. > > I tried to run `find /` over ext4 in a cgroup with memory.high set to > 32M and kmem accounting enabled. With such a setup memory.current never > got higher than 33152K, which is only 384K greater than the memory.high. > Which FS did you use? ext4. Here, it goes onto happily consuming hundreds of megabytes with limit set at 32M. We have quite a few places where !__GFP_WAIT allocations are performed speculatively in hot paths with fallback slow paths, so this is bound to happen somewhere. Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path 2015-08-28 16:48 ` Tejun Heo @ 2015-08-28 20:32 ` Vladimir Davydov 2015-08-28 20:44 ` Tejun Heo 2015-08-30 15:52 ` Vladimir Davydov 0 siblings, 2 replies; 40+ messages in thread From: Vladimir Davydov @ 2015-08-28 20:32 UTC (permalink / raw) To: Tejun Heo, Andrew Morton Cc: hannes, mhocko, cgroups, linux-mm, kernel-team, Joonsoo Kim, Christoph Lameter, David Rientjes On Fri, Aug 28, 2015 at 12:48:19PM -0400, Tejun Heo wrote: ... > > > * If the allocation doesn't have __GFP_WAIT, direct reclaim is > > > skipped. If a process performs only speculative allocations, it can > > > blow way past the high limit. This is actually easily reproducible > > > by simply doing "find /". VFS tries speculative !__GFP_WAIT > > > allocations first, so as long as there's memory which can be > > > consumed without blocking, it can keep allocating memory regardless > > > of the high limit. > > > > I think there shouldn't normally occur a lot of !__GFP_WAIT allocations > > in a row - they should still alternate with normal __GFP_WAIT > > allocations. Yes, that means we can breach memory.high threshold for a > > short period of time, but it isn't a hard limit, so it looks perfectly > > fine to me. > > > > I tried to run `find /` over ext4 in a cgroup with memory.high set to > > 32M and kmem accounting enabled. With such a setup memory.current never > > got higher than 33152K, which is only 384K greater than the memory.high. > > Which FS did you use? > > ext4. Here, it goes onto happily consuming hundreds of megabytes with > limit set at 32M. We have quite a few places where !__GFP_WAIT > allocations are performed speculatively in hot paths with fallback > slow paths, so this is bound to happen somewhere. What kind of workload should it be then? `find` will constantly invoke d_alloc, which issues a GFP_KERNEL allocation and therefore is allowed to perform reclaim... OK, I tried to reproduce the issue on the latest mainline kernel and ... succeeded - memory.current did occasionally jump up to ~55M although memory.high was set to 32M. Hmm, strange... Started to investigate. Printed stack traces and found that we don't invoke memcg reclaim on normal GFP_KERNEL allocations! How is that? The thing is there was a commit that made SLUB (not VFS or any other kmem user, but core SLUB) try to allocate high order slab pages w/o __GFP_WAIT for performance reasons. That broke kmemcg case. Here it goes: commit 6af3142bed1f520b90f4cdb6cd10bbd16906ce9a Author: Joonsoo Kim <js1304@gmail.com> Date: Tue Aug 25 00:03:52 2015 +0000 mm/slub: don't wait for high-order page allocation I suspect your kernel has this commit included, because w/o it I haven't managed to catch anything nearly as bad as you describe: the memory.high excess reached 1-2 Mb at max, but never "hundreds of megabytes". If so, we'd better fix that instead. Actually, it's worth fixing anyway. What about the patch below? --- From: Vladimir Davydov <vdavydov@parallels.com> Date: Fri, 28 Aug 2015 23:17:19 +0300 Subject: [PATCH] mm/slub: don't bypass memcg reclaim for high-order page allocation Commit 6af3142bed1f52 ("mm/slub: don't wait for high-order page allocation") made allocate_slab() try to allocate high order slab pages w/o __GFP_WAIT in order to avoid invoking reclaim/compaction when we can fall back on low order pages. However, it broke kmemcg/memory.high logic. The latter works as a soft limit: an allocation won't fail if it is breached, but we call direct reclaim to compensate the excess. W/o __GFP_WAIT we can't invoke reclaimer and therefore we will just go on, exceeding memory.high more and more until a normal __GFP_WAIT allocation is issued. Since memcg reclaim never triggers compaction, we can pass __GFP_WAIT to memcg_charge_slab() even on high order page allocations w/o any performance impact. So let's fix this problem by excluding __GFP_WAIT only from alloc_pages() while still forwarding it to memcg_charge_slab() if the context allows. Fixes: 6af3142bed1f52 ("mm/slub: don't wait for high-order page allocation") Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> diff --git a/mm/slub.c b/mm/slub.c index e180f8dcd06d..1b9dbad40272 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1333,6 +1333,9 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s, if (memcg_charge_slab(s, flags, order)) return NULL; + if ((flags & __GFP_WAIT) && oo_order(oo) > oo_order(s->min)) + flags = (flags | __GFP_NOMEMALLOC) & ~__GFP_WAIT; + if (node == NUMA_NO_NODE) page = alloc_pages(flags, order); else @@ -1364,8 +1367,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) * so we fall-back to the minimum order allocation. */ alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL; - if ((alloc_gfp & __GFP_WAIT) && oo_order(oo) > oo_order(s->min)) - alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_WAIT; page = alloc_slab_page(s, alloc_gfp, node, oo); if (unlikely(!page)) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path 2015-08-28 20:32 ` Vladimir Davydov @ 2015-08-28 20:44 ` Tejun Heo 2015-08-28 22:06 ` Tejun Heo 2015-08-30 15:52 ` Vladimir Davydov 1 sibling, 1 reply; 40+ messages in thread From: Tejun Heo @ 2015-08-28 20:44 UTC (permalink / raw) To: Vladimir Davydov Cc: Andrew Morton, hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg, Joonsoo Kim, Christoph Lameter, David Rientjes Hello, On Fri, Aug 28, 2015 at 11:32:31PM +0300, Vladimir Davydov wrote: > What kind of workload should it be then? `find` will constantly invoke > d_alloc, which issues a GFP_KERNEL allocation and therefore is allowed > to perform reclaim... > > OK, I tried to reproduce the issue on the latest mainline kernel and ... > succeeded - memory.current did occasionally jump up to ~55M although > memory.high was set to 32M. Hmm, strange... Started to investigate. > Printed stack traces and found that we don't invoke memcg reclaim on > normal GFP_KERNEL allocations! How is that? The thing is there was a > commit that made SLUB (not VFS or any other kmem user, but core SLUB) > try to allocate high order slab pages w/o __GFP_WAIT for performance > reasons. That broke kmemcg case. Here it goes: Ah, cool, so it was a bug from slub. Punting to return path still has some niceties but if we can't consistently get rid of stack consumption it's not that attractive. Let's revisit it later together with hard limit reclaim. Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path 2015-08-28 20:44 ` Tejun Heo @ 2015-08-28 22:06 ` Tejun Heo [not found] ` <20150828220632.GF11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org> 0 siblings, 1 reply; 40+ messages in thread From: Tejun Heo @ 2015-08-28 22:06 UTC (permalink / raw) To: Vladimir Davydov Cc: Andrew Morton, hannes, mhocko, cgroups, linux-mm, kernel-team, Joonsoo Kim, Christoph Lameter, David Rientjes On Fri, Aug 28, 2015 at 04:44:32PM -0400, Tejun Heo wrote: > Ah, cool, so it was a bug from slub. Punting to return path still has > some niceties but if we can't consistently get rid of stack > consumption it's not that attractive. Let's revisit it later together > with hard limit reclaim. So, I can't check right now but I'm pretty sure I was using SLAB on my test config, so this issue may exist there too. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <20150828220632.GF11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>]
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path [not found] ` <20150828220632.GF11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org> @ 2015-08-29 7:59 ` Vladimir Davydov 0 siblings, 0 replies; 40+ messages in thread From: Vladimir Davydov @ 2015-08-29 7:59 UTC (permalink / raw) To: Tejun Heo Cc: Andrew Morton, hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg, Joonsoo Kim, Christoph Lameter, David Rientjes On Fri, Aug 28, 2015 at 06:06:32PM -0400, Tejun Heo wrote: > On Fri, Aug 28, 2015 at 04:44:32PM -0400, Tejun Heo wrote: > > Ah, cool, so it was a bug from slub. Punting to return path still has > > some niceties but if we can't consistently get rid of stack > > consumption it's not that attractive. Let's revisit it later together > > with hard limit reclaim. > > So, I can't check right now but I'm pretty sure I was using SLAB on my > test config, so this issue may exist there too. Yeah, SLAB is broken too. It was accidentally broken by commit 4167e9b2cf10 ("mm: remove GFP_THISNODE"), which among other things made SLAB filter out __GFP_WAIT from gfp flags when probing a NUMA node. I'll take a look what we can do with that. Thanks, Vladimir ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path 2015-08-28 20:32 ` Vladimir Davydov 2015-08-28 20:44 ` Tejun Heo @ 2015-08-30 15:52 ` Vladimir Davydov 1 sibling, 0 replies; 40+ messages in thread From: Vladimir Davydov @ 2015-08-30 15:52 UTC (permalink / raw) To: Tejun Heo, Andrew Morton Cc: hannes, mhocko, cgroups, linux-mm, kernel-team, Joonsoo Kim, Christoph Lameter, David Rientjes On Fri, Aug 28, 2015 at 11:32:31PM +0300, Vladimir Davydov wrote: ... > From: Vladimir Davydov <vdavydov@parallels.com> > Date: Fri, 28 Aug 2015 23:17:19 +0300 > Subject: [PATCH] mm/slub: don't bypass memcg reclaim for high-order page > allocation Please ignore this patch. I'll rework and resend. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path [not found] ` <1440775530-18630-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2015-08-28 16:36 ` Vladimir Davydov @ 2015-08-28 17:13 ` Michal Hocko [not found] ` <20150828171322.GC21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 1 sibling, 1 reply; 40+ messages in thread From: Michal Hocko @ 2015-08-28 17:13 UTC (permalink / raw) To: Tejun Heo Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg On Fri 28-08-15 11:25:29, Tejun Heo wrote: > Currently, try_charge() tries to reclaim memory directly when the high > limit is breached; however, this has a couple issues. > > * try_charge() can be invoked from any in-kernel allocation site and > reclaim path may use considerable amount of stack. This can lead to > stack overflows which are extremely difficult to reproduce. This is true but I haven't seen any reports for the stack overflow for quite some time. > * If the allocation doesn't have __GFP_WAIT, direct reclaim is > skipped. If a process performs only speculative allocations, it can > blow way past the high limit. This is actually easily reproducible > by simply doing "find /". VFS tries speculative !__GFP_WAIT > allocations first, so as long as there's memory which can be > consumed without blocking, it can keep allocating memory regardless > of the high limit. It is a bit confusing that you are talking about direct reclaim but in fact mean high limit reclaim. But yeah, you are right there is no protection against GFP_NOWAIT allocations there. > This patch makes try_charge() always punt the direct reclaim to the > return-to-userland path. If try_charge() detects that high limit is > breached, it sets current->memcg_over_high to the offending memcg and > schedules execution of mem_cgroup_handle_over_high() which performs > the direct reclaim from the return-to-userland path. OK, this is certainly an attractive idea because of allocation requests with reduced reclaim capabilities. GFP_NOWAIT is not the only one. GFP_NOFS would be another. With kmem accounting they are much bigger problem than with regular page faults/page cache. And having full GFP_KERNEL reclaim context is definitely nice. I would just argue that this implementation has the same issue as the other patch in the series which performs high-usage reclaim. I think that each task should reclaim only its contribution which is trivial to account. > As long as kernel doesn't have a run-away allocation spree, this > should provide enough protection while making kmemcg behave more > consistently. > > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > --- > include/linux/memcontrol.h | 6 +++++ > include/linux/sched.h | 1 + > include/linux/tracehook.h | 3 +++ > mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++++++++--------- > 4 files changed, 64 insertions(+), 12 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 3d28656..8d345a7 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -402,6 +402,8 @@ static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec) > return inactive * inactive_ratio < active; > } > > +void mem_cgroup_handle_over_high(void); > + > void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > struct task_struct *p); > > @@ -621,6 +623,10 @@ static inline void mem_cgroup_end_page_stat(struct mem_cgroup *memcg) > { > } > > +static inline void mem_cgroup_handle_over_high(void) > +{ > +} > + > static inline void mem_cgroup_oom_enable(void) > { > } > diff --git a/include/linux/sched.h b/include/linux/sched.h > index ef73b54..c76b71d 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1785,6 +1785,7 @@ struct task_struct { > #endif /* CONFIG_TRACING */ > #ifdef CONFIG_MEMCG > struct mem_cgroup *memcg_in_oom; > + struct mem_cgroup *memcg_over_high; /* reclaim on returning to user */ > gfp_t memcg_oom_gfp_mask; > int memcg_oom_order; > #endif > diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h > index 84d4972..26c1521 100644 > --- a/include/linux/tracehook.h > +++ b/include/linux/tracehook.h > @@ -50,6 +50,7 @@ > #include <linux/ptrace.h> > #include <linux/security.h> > #include <linux/task_work.h> > +#include <linux/memcontrol.h> > struct linux_binprm; > > /* > @@ -188,6 +189,8 @@ static inline void tracehook_notify_resume(struct pt_regs *regs) > smp_mb__after_atomic(); > if (unlikely(current->task_works)) > task_work_run(); > + > + mem_cgroup_handle_over_high(); > } > > #endif /* <linux/tracehook.h> */ > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 74abb31..c94b686 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -62,6 +62,7 @@ > #include <linux/oom.h> > #include <linux/lockdep.h> > #include <linux/file.h> > +#include <linux/tracehook.h> > #include "internal.h" > #include <net/sock.h> > #include <net/ip.h> > @@ -1963,6 +1964,33 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb, > return NOTIFY_OK; > } > > +/* > + * Scheduled by try_charge() to be executed from the userland return path > + * and reclaims memory over the high limit. > + */ > +void mem_cgroup_handle_over_high(void) > +{ > + struct mem_cgroup *memcg = current->memcg_over_high; > + > + if (likely(!memcg)) > + return; > + > + do { > + unsigned long usage = page_counter_read(&memcg->memory); > + unsigned long high = ACCESS_ONCE(memcg->high); > + > + if (usage <= high) > + continue; > + > + mem_cgroup_events(memcg, MEMCG_HIGH, 1); > + try_to_free_mem_cgroup_pages(memcg, usage - high, > + GFP_KERNEL, true); > + } while ((memcg = parent_mem_cgroup(memcg))); > + > + css_put(¤t->memcg_over_high->css); > + current->memcg_over_high = NULL; > +} > + > static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > unsigned int nr_pages) > { > @@ -2071,21 +2099,27 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > css_get_many(&memcg->css, batch); > if (batch > nr_pages) > refill_stock(memcg, batch - nr_pages); > - if (!(gfp_mask & __GFP_WAIT)) > - goto done; > + > /* > - * If the hierarchy is above the normal consumption range, > - * make the charging task trim their excess contribution. > + * If the hierarchy is above the normal consumption range, schedule > + * direct reclaim on returning to userland. We can perform direct > + * reclaim here if __GFP_WAIT; however, punting has the benefit of > + * avoiding surprise high stack usages and it's fine to breach the > + * high limit temporarily while control stays in kernel. > */ > - do { > - unsigned long usage = page_counter_read(&memcg->memory); > - unsigned long high = ACCESS_ONCE(memcg->high); > + if (!current->memcg_over_high) { > + struct mem_cgroup *pos = memcg; > > - if (usage <= high) > - continue; > - mem_cgroup_events(memcg, MEMCG_HIGH, 1); > - try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true); > - } while ((memcg = parent_mem_cgroup(memcg))); > + do { > + if (page_counter_read(&pos->memory) > pos->high) { > + /* make user return path rescan from leaf */ > + css_get(&memcg->css); > + current->memcg_over_high = memcg; > + set_notify_resume(current); > + break; > + } > + } while ((pos = parent_mem_cgroup(pos))); > + } > done: > return ret; > } > @@ -5053,6 +5087,13 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css, > } > #endif > > +static void mem_cgroup_exit(struct cgroup_subsys_state *css, > + struct cgroup_subsys_state *old_css, > + struct task_struct *task) > +{ > + mem_cgroup_handle_over_high(); > +} > + > /* > * Cgroup retains root cgroups across [un]mount cycles making it necessary > * to verify whether we're attached to the default hierarchy on each mount > @@ -5223,6 +5264,7 @@ struct cgroup_subsys memory_cgrp_subsys = { > .can_attach = mem_cgroup_can_attach, > .cancel_attach = mem_cgroup_cancel_attach, > .attach = mem_cgroup_move_task, > + .exit = mem_cgroup_exit, > .bind = mem_cgroup_bind, > .dfl_cftypes = memory_files, > .legacy_cftypes = mem_cgroup_legacy_files, > -- > 2.4.3 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <20150828171322.GC21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path [not found] ` <20150828171322.GC21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2015-08-28 17:56 ` Tejun Heo 2015-08-28 20:45 ` Vladimir Davydov 1 sibling, 0 replies; 40+ messages in thread From: Tejun Heo @ 2015-08-28 17:56 UTC (permalink / raw) To: Michal Hocko Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg Hello, On Fri, Aug 28, 2015 at 07:13:22PM +0200, Michal Hocko wrote: > On Fri 28-08-15 11:25:29, Tejun Heo wrote: > > Currently, try_charge() tries to reclaim memory directly when the high > > limit is breached; however, this has a couple issues. > > > > * try_charge() can be invoked from any in-kernel allocation site and > > reclaim path may use considerable amount of stack. This can lead to > > stack overflows which are extremely difficult to reproduce. > > This is true but I haven't seen any reports for the stack overflow for > quite some time. So, this didn't really fix it but xfs had to punt things to workqueues to avoid stack overflows and IIRC it involved direct reclaim. Maybe it's too late but it probably is a good idea to punt this from the source. > I would just argue that this implementation has the same issue as the > other patch in the series which performs high-usage reclaim. I think > that each task should reclaim only its contribution which is trivial > to account. Hmm... I'll respond there. Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path [not found] ` <20150828171322.GC21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2015-08-28 17:56 ` Tejun Heo @ 2015-08-28 20:45 ` Vladimir Davydov 2015-08-28 20:53 ` Tejun Heo 1 sibling, 1 reply; 40+ messages in thread From: Vladimir Davydov @ 2015-08-28 20:45 UTC (permalink / raw) To: Michal Hocko Cc: Tejun Heo, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg On Fri, Aug 28, 2015 at 07:13:22PM +0200, Michal Hocko wrote: ... > > * If the allocation doesn't have __GFP_WAIT, direct reclaim is > > skipped. If a process performs only speculative allocations, it can > > blow way past the high limit. This is actually easily reproducible > > by simply doing "find /". VFS tries speculative !__GFP_WAIT > > allocations first, so as long as there's memory which can be > > consumed without blocking, it can keep allocating memory regardless > > of the high limit. > > It is a bit confusing that you are talking about direct reclaim but in > fact mean high limit reclaim. But yeah, you are right there is no > protection against GFP_NOWAIT allocations there. Actually, memory.high by itself *is* the protection against GFP_NOWAIT allocations, similarly to zone watermarks. W/o it we would have no other choice but fail a GFP_NOWAIT allocation on hitting memory.max. One should just set it so that memory.max - memory.high > [max sum size of !__GFP_WAIT allocations that can normally occur in a row] That being said, currently I don't see any point in making memory.high !__GFP_WAIT-safe. Thanks, Vladimir ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path 2015-08-28 20:45 ` Vladimir Davydov @ 2015-08-28 20:53 ` Tejun Heo [not found] ` <20150828205301.GB11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org> 0 siblings, 1 reply; 40+ messages in thread From: Tejun Heo @ 2015-08-28 20:53 UTC (permalink / raw) To: Vladimir Davydov Cc: Michal Hocko, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg Hello, Vladmir. On Fri, Aug 28, 2015 at 11:45:54PM +0300, Vladimir Davydov wrote: > Actually, memory.high by itself *is* the protection against GFP_NOWAIT > allocations, similarly to zone watermarks. W/o it we would have no other > choice but fail a GFP_NOWAIT allocation on hitting memory.max. One > should just set it so that > > memory.max - memory.high > [max sum size of !__GFP_WAIT allocations > that can normally occur in a row] While this would be true in many cases, I don't think this is the intention of the two knobs and the space between high and max can be filled up by anything which can't be reclaimed - e.g. too many dirty / writeback pages on a slow device or memlocked pages. If it were really the buffer for GFP_NOWAIT, there's no reason to even make it a separate knob and we *may* change how over-high reclaim behaves in the future, so let's please not dig ourselves into something too specific. > That being said, currently I don't see any point in making memory.high > !__GFP_WAIT-safe. Yeah, as long as the blow up can't be triggered consistently, it should be fine. Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <20150828205301.GB11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>]
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path [not found] ` <20150828205301.GB11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org> @ 2015-08-28 21:07 ` Vladimir Davydov 2015-08-28 21:14 ` Tejun Heo 0 siblings, 1 reply; 40+ messages in thread From: Vladimir Davydov @ 2015-08-28 21:07 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg On Fri, Aug 28, 2015 at 04:53:01PM -0400, Tejun Heo wrote: > On Fri, Aug 28, 2015 at 11:45:54PM +0300, Vladimir Davydov wrote: > > Actually, memory.high by itself *is* the protection against GFP_NOWAIT > > allocations, similarly to zone watermarks. W/o it we would have no other > > choice but fail a GFP_NOWAIT allocation on hitting memory.max. One > > should just set it so that > > > > memory.max - memory.high > [max sum size of !__GFP_WAIT allocations > > that can normally occur in a row] > > While this would be true in many cases, I don't think this is the > intention of the two knobs and the space between high and max can be > filled up by anything which can't be reclaimed - e.g. too many dirty / > writeback pages on a slow device or memlocked pages. If it were > really the buffer for GFP_NOWAIT, there's no reason to even make it a > separate knob and we *may* change how over-high reclaim behaves in the > future, so let's please not dig ourselves into something too specific. Yep, come to think of it, you're right. One might want to use the memory.high knob as the protection, because currently it is the only way to protect kmemcg against GFP_NOWAIT failures, but it looks more like abusing it :-/ We should probably think about introducing some kind of watermarks that would trigger memcg reclaim, asynchronous or direct, on exceeding them. Thanks, Vladimir ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path 2015-08-28 21:07 ` Vladimir Davydov @ 2015-08-28 21:14 ` Tejun Heo 0 siblings, 0 replies; 40+ messages in thread From: Tejun Heo @ 2015-08-28 21:14 UTC (permalink / raw) To: Vladimir Davydov Cc: Michal Hocko, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg Hey, On Sat, Aug 29, 2015 at 12:07:04AM +0300, Vladimir Davydov wrote: > We should probably think about introducing some kind of watermarks that > would trigger memcg reclaim, asynchronous or direct, on exceeding > them. Yeah, for max + kmemcg case, we eventually should do something similar to the global case where we try to kick off async reclaim before we hit the hard wall. Ultimately, I think punting reclaims to workqueue or return-path is a good idea anyway, so maybe it can be all part of the same mechanism. Given that the high limit is the primary control mechanism on the default hierarchy, it should be fine for now. Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy [not found] ` <1440775530-18630-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (2 preceding siblings ...) 2015-08-28 15:25 ` [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path Tejun Heo @ 2015-08-28 15:25 ` Tejun Heo [not found] ` <1440775530-18630-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2015-11-05 17:30 ` Michal Hocko 3 siblings, 2 replies; 40+ messages in thread From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw) To: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg, Tejun Heo On the default hierarchy, all memory consumption will be accounted together and controlled by the same set of limits. Always enable kmemcg on the default hierarchy. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- mm/memcontrol.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c94b686..8a5dd01 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4362,6 +4362,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) if (ret) return ret; + /* kmem is always accounted together on the default hierarchy */ + if (cgroup_on_dfl(css->cgroup)) { + ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX); + if (ret) + return ret; + } + /* * Make sure the memcg is initialized: mem_cgroup_iter() * orders reading memcg->initialized against its callers -- 2.4.3 ^ permalink raw reply related [flat|nested] 40+ messages in thread
[parent not found: <1440775530-18630-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy [not found] ` <1440775530-18630-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2015-08-28 16:49 ` Vladimir Davydov 2015-08-28 16:56 ` Tejun Heo 2015-08-28 17:14 ` Michal Hocko 0 siblings, 2 replies; 40+ messages in thread From: Vladimir Davydov @ 2015-08-28 16:49 UTC (permalink / raw) To: Tejun Heo Cc: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote: > On the default hierarchy, all memory consumption will be accounted > together and controlled by the same set of limits. Always enable > kmemcg on the default hierarchy. IMO we should introduce a boot time knob for disabling it, because kmem accounting is still not perfect, besides some users might prefer to go w/o it for performance reasons. > > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > --- > mm/memcontrol.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c94b686..8a5dd01 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4362,6 +4362,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) > if (ret) > return ret; > > + /* kmem is always accounted together on the default hierarchy */ > + if (cgroup_on_dfl(css->cgroup)) { > + ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX); > + if (ret) > + return ret; > + } > + This is a wrong place for this. The kernel will panic on an attempt to create a sub memcg, because memcg_init_kmem already enables kmem accounting in this case. I guess we should add this hunk to memcg_propagate_kmem instead. Thanks, Vladimir ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy 2015-08-28 16:49 ` Vladimir Davydov @ 2015-08-28 16:56 ` Tejun Heo 2015-08-28 17:14 ` Michal Hocko 1 sibling, 0 replies; 40+ messages in thread From: Tejun Heo @ 2015-08-28 16:56 UTC (permalink / raw) To: Vladimir Davydov; +Cc: hannes, mhocko, cgroups, linux-mm, kernel-team Hello, On Fri, Aug 28, 2015 at 07:49:18PM +0300, Vladimir Davydov wrote: > On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote: > > On the default hierarchy, all memory consumption will be accounted > > together and controlled by the same set of limits. Always enable > > kmemcg on the default hierarchy. > > IMO we should introduce a boot time knob for disabling it, because kmem > accounting is still not perfect, besides some users might prefer to go > w/o it for performance reasons. Yeah, fair enough. > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index c94b686..8a5dd01 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -4362,6 +4362,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) > > if (ret) > > return ret; > > > > + /* kmem is always accounted together on the default hierarchy */ > > + if (cgroup_on_dfl(css->cgroup)) { > > + ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX); > > + if (ret) > > + return ret; > > + } > > + > > This is a wrong place for this. The kernel will panic on an attempt to > create a sub memcg, because memcg_init_kmem already enables kmem > accounting in this case. I guess we should add this hunk to > memcg_propagate_kmem instead. Yeap, bypassing "parent is active" test in memcg_propagate_kmem() seems like the right thing to do. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy 2015-08-28 16:49 ` Vladimir Davydov 2015-08-28 16:56 ` Tejun Heo @ 2015-08-28 17:14 ` Michal Hocko 2015-08-28 17:41 ` Tejun Heo 1 sibling, 1 reply; 40+ messages in thread From: Michal Hocko @ 2015-08-28 17:14 UTC (permalink / raw) To: Vladimir Davydov Cc: Tejun Heo, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg On Fri 28-08-15 19:49:18, Vladimir Davydov wrote: > On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote: > > On the default hierarchy, all memory consumption will be accounted > > together and controlled by the same set of limits. Always enable > > kmemcg on the default hierarchy. > > IMO we should introduce a boot time knob for disabling it, because kmem > accounting is still not perfect, besides some users might prefer to go > w/o it for performance reasons. I would even argue for opt-in rather than opt-out. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy 2015-08-28 17:14 ` Michal Hocko @ 2015-08-28 17:41 ` Tejun Heo 2015-09-01 12:44 ` Michal Hocko 0 siblings, 1 reply; 40+ messages in thread From: Tejun Heo @ 2015-08-28 17:41 UTC (permalink / raw) To: Michal Hocko; +Cc: Vladimir Davydov, hannes, cgroups, linux-mm, kernel-team On Fri, Aug 28, 2015 at 07:14:38PM +0200, Michal Hocko wrote: > On Fri 28-08-15 19:49:18, Vladimir Davydov wrote: > > On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote: > > > On the default hierarchy, all memory consumption will be accounted > > > together and controlled by the same set of limits. Always enable > > > kmemcg on the default hierarchy. > > > > IMO we should introduce a boot time knob for disabling it, because kmem > > accounting is still not perfect, besides some users might prefer to go > > w/o it for performance reasons. > > I would even argue for opt-in rather than opt-out. Definitely not. We wanna put all memory consumptions under the same roof by default. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy 2015-08-28 17:41 ` Tejun Heo @ 2015-09-01 12:44 ` Michal Hocko [not found] ` <20150901124459.GC8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 40+ messages in thread From: Michal Hocko @ 2015-09-01 12:44 UTC (permalink / raw) To: Tejun Heo; +Cc: Vladimir Davydov, hannes, cgroups, linux-mm, kernel-team On Fri 28-08-15 13:41:40, Tejun Heo wrote: > On Fri, Aug 28, 2015 at 07:14:38PM +0200, Michal Hocko wrote: > > On Fri 28-08-15 19:49:18, Vladimir Davydov wrote: > > > On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote: > > > > On the default hierarchy, all memory consumption will be accounted > > > > together and controlled by the same set of limits. Always enable > > > > kmemcg on the default hierarchy. > > > > > > IMO we should introduce a boot time knob for disabling it, because kmem > > > accounting is still not perfect, besides some users might prefer to go > > > w/o it for performance reasons. > > > > I would even argue for opt-in rather than opt-out. > > Definitely not. The runtime overhead is not negligible and I do not see why everybody should be paying that price by default. I can definitely see the reason why somebody would want to enable the kmem accounting but many users will probably never care because the kernel footprint would be in the noise wrt. user memory. > We wanna put all memory consumptions under the same roof by default. But I am not sure we will ever achieve this. E.g. hugetlb memory is way too different to be under the same charging by default IMO. Also all the random drivers calling into the page allocator directly in the user context would need to charge explicitly. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <20150901124459.GC8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy [not found] ` <20150901124459.GC8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2015-09-01 18:51 ` Tejun Heo 2015-09-04 13:30 ` Michal Hocko 0 siblings, 1 reply; 40+ messages in thread From: Tejun Heo @ 2015-09-01 18:51 UTC (permalink / raw) To: Michal Hocko Cc: Vladimir Davydov, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg Hello, On Tue, Sep 01, 2015 at 02:44:59PM +0200, Michal Hocko wrote: > The runtime overhead is not negligible and I do not see why everybody > should be paying that price by default. I can definitely see the reason why > somebody would want to enable the kmem accounting but many users will > probably never care because the kernel footprint would be in the noise > wrt. user memory. We said the same thing about hierarchy support. Sure, it's not the same but I think it's wiser to keep the architectural decisions at a higher level. I don't think kmem overhead is that high but if this actually is a problem we'd need a per-cgroup knob anyway. > > We wanna put all memory consumptions under the same roof by default. > > But I am not sure we will ever achieve this. E.g. hugetlb memory is way > too different to be under the same charging by default IMO. Also all > the random drivers calling into the page allocator directly in the user > context would need to charge explicitly. Oh I meant the big ones. I don't think we'll achieve 100% coverage either but even just catching the major ones, kmem and tcp socket buffers, should remove most ambiguities around memory consumption. Thanks. -- tejun ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy 2015-09-01 18:51 ` Tejun Heo @ 2015-09-04 13:30 ` Michal Hocko [not found] ` <20150904133038.GC8220-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2015-09-04 16:18 ` Tejun Heo 0 siblings, 2 replies; 40+ messages in thread From: Michal Hocko @ 2015-09-04 13:30 UTC (permalink / raw) To: Tejun Heo; +Cc: Vladimir Davydov, hannes, cgroups, linux-mm, kernel-team On Tue 01-09-15 14:51:57, Tejun Heo wrote: > Hello, > > On Tue, Sep 01, 2015 at 02:44:59PM +0200, Michal Hocko wrote: > > The runtime overhead is not negligible and I do not see why everybody > > should be paying that price by default. I can definitely see the reason why > > somebody would want to enable the kmem accounting but many users will > > probably never care because the kernel footprint would be in the noise > > wrt. user memory. > > We said the same thing about hierarchy support. Sure, it's not the > same but I think it's wiser to keep the architectural decisions at a > higher level. I don't think kmem overhead is that high but if this > actually is a problem we'd need a per-cgroup knob anyway. The overhead was around 4% for the basic kbuild test without ever triggering the [k]memcg limit last time I checked. This was quite some time ago and things might have changed since then. Even when this got better there will still be _some_ overhead because we have to track that memory and that is not free. The question really is whether kmem accounting is so generally useful that the overhead is acceptable and it is should be enabled by default. From my POV it is a useful mitigation of untrusted users but many loads simply do not care because they only care about a certain level of isolation. I might be wrong here of course but if the default should be switched it would deserve a better justification with some numbers so that people can see the possible drawbacks. I agree that the per-cgroup knob is better than the global one. We should also find consensus whether the legacy semantic of k < u limit should be preserved. It made sense to me at the time it was introduced but I recall that Vladimir found it not really helpful when we discussed that at LSF. I found it interesting e.g. for the rough task count limiting use case which people were asking for. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <20150904133038.GC8220-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy [not found] ` <20150904133038.GC8220-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2015-09-04 15:38 ` Vladimir Davydov 2015-09-07 9:39 ` Michal Hocko 0 siblings, 1 reply; 40+ messages in thread From: Vladimir Davydov @ 2015-09-04 15:38 UTC (permalink / raw) To: Michal Hocko Cc: Tejun Heo, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg On Fri, Sep 04, 2015 at 03:30:38PM +0200, Michal Hocko wrote: > On Tue 01-09-15 14:51:57, Tejun Heo wrote: > > Hello, > > > > On Tue, Sep 01, 2015 at 02:44:59PM +0200, Michal Hocko wrote: > > > The runtime overhead is not negligible and I do not see why everybody > > > should be paying that price by default. I can definitely see the reason why > > > somebody would want to enable the kmem accounting but many users will > > > probably never care because the kernel footprint would be in the noise > > > wrt. user memory. > > > > We said the same thing about hierarchy support. Sure, it's not the > > same but I think it's wiser to keep the architectural decisions at a > > higher level. I don't think kmem overhead is that high but if this > > actually is a problem we'd need a per-cgroup knob anyway. > > The overhead was around 4% for the basic kbuild test without ever > triggering the [k]memcg limit last time I checked. This was quite some > time ago and things might have changed since then. Even when this got > better there will still be _some_ overhead because we have to track that > memory and that is not free. Just like there is some overhead if a process is placed in memcg w/o kmem accounting enabled. > > The question really is whether kmem accounting is so generally useful > that the overhead is acceptable and it is should be enabled by > default. From my POV it is a useful mitigation of untrusted users but > many loads simply do not care because they only care about a certain > level of isolation. FWIW, I've seen a useful workload that generated tons of negative dentries for some reason (if my memory doesn't fail, it was nginx web server). If one starts such a workload inside a container w/o kmem accounting, it might evict useful data from other containers. So, even if a container is trusted, it might be still worth having kmem accounting enabled. > > I might be wrong here of course but if the default should be switched it > would deserve a better justification with some numbers so that people > can see the possible drawbacks. Personally, I'd prefer to have it switched on by default, because it would force people test it and report bugs and performance degradation. If one finds it really crappy, he/she should be able to disable it. > > I agree that the per-cgroup knob is better than the global one. We Not that sure :-/ > should also find consensus whether the legacy semantic of k < u limit > should be preserved. It made sense to me at the time it was introduced > but I recall that Vladimir found it not really helpful when we discussed > that at LSF. I found it interesting e.g. for the rough task count > limiting use case which people were asking for. There is the pids cgroup, which suits this purpose much better. K < U adds a lot of complexity to reclaim, while it's not clear whether we really need it. For instance, when you hit K you should reclaim kmem only, but there is kmem that is pinned by umem, e.g. radix tree nodes or buffer heads. What should we do with them? Reclaim umem on hitting kmem limit? IMO ugly. Thanks, Vladimir ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy 2015-09-04 15:38 ` Vladimir Davydov @ 2015-09-07 9:39 ` Michal Hocko 2015-09-07 10:01 ` Vladimir Davydov 0 siblings, 1 reply; 40+ messages in thread From: Michal Hocko @ 2015-09-07 9:39 UTC (permalink / raw) To: Vladimir Davydov Cc: Tejun Heo, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg On Fri 04-09-15 18:38:10, Vladimir Davydov wrote: > On Fri, Sep 04, 2015 at 03:30:38PM +0200, Michal Hocko wrote: > > On Tue 01-09-15 14:51:57, Tejun Heo wrote: > > > Hello, > > > > > > On Tue, Sep 01, 2015 at 02:44:59PM +0200, Michal Hocko wrote: > > > > The runtime overhead is not negligible and I do not see why everybody > > > > should be paying that price by default. I can definitely see the reason why > > > > somebody would want to enable the kmem accounting but many users will > > > > probably never care because the kernel footprint would be in the noise > > > > wrt. user memory. > > > > > > We said the same thing about hierarchy support. Sure, it's not the > > > same but I think it's wiser to keep the architectural decisions at a > > > higher level. I don't think kmem overhead is that high but if this > > > actually is a problem we'd need a per-cgroup knob anyway. > > > > The overhead was around 4% for the basic kbuild test without ever > > triggering the [k]memcg limit last time I checked. This was quite some > > time ago and things might have changed since then. Even when this got > > better there will still be _some_ overhead because we have to track that > > memory and that is not free. > > Just like there is some overhead if a process is placed in memcg w/o > kmem accounting enabled. Sure I am not questioning that. > > The question really is whether kmem accounting is so generally useful > > that the overhead is acceptable and it is should be enabled by > > default. From my POV it is a useful mitigation of untrusted users but > > many loads simply do not care because they only care about a certain > > level of isolation. > > FWIW, I've seen a useful workload that generated tons of negative > dentries for some reason (if my memory doesn't fail, it was nginx web > server). If one starts such a workload inside a container w/o kmem > accounting, it might evict useful data from other containers. So, even > if a container is trusted, it might be still worth having kmem > accounting enabled. OK, then this is a clear usecase for using kmem mem. I was merely pointing out that many others will not care about kmem. Is it majority? I dunno. But I haven't heard any convincing argument that those that need kmem would form a majority either. > > I might be wrong here of course but if the default should be switched it > > would deserve a better justification with some numbers so that people > > can see the possible drawbacks. > > Personally, I'd prefer to have it switched on by default, because it > would force people test it and report bugs and performance degradation. > If one finds it really crappy, he/she should be able to disable it. I do not think this is the way of introducing new functionality. You do not want to force users to debug your code and go let it disable if it is too crappy. > > I agree that the per-cgroup knob is better than the global one. We > > Not that sure :-/ Why? > > should also find consensus whether the legacy semantic of k < u limit > > should be preserved. It made sense to me at the time it was introduced > > but I recall that Vladimir found it not really helpful when we discussed > > that at LSF. I found it interesting e.g. for the rough task count > > limiting use case which people were asking for. > > There is the pids cgroup, which suits this purpose much better. I am not familiar with this controller. I am not following cgroup mailing list too closely but I vaguely remember somebody proposing this controller but I am not sure what is the current status. The last time this has been discussed there was a general pushback to use kmem accounting for process count restriction. > K < U adds a lot of complexity to reclaim, while it's not clear whether > we really need it. For instance, when you hit K you should reclaim kmem > only, but there is kmem that is pinned by umem, e.g. radix tree nodes or > buffer heads. What should we do with them? Reclaim umem on hitting kmem > limit? IMO ugly. kmem as a hard limit could simply reclaim slab objects and fail if it doesn't succeed. Sure many objects might be pinned by other resources which are not reclaimable but that is possible with the global case as well. I can see your argument that the configuration might be quite tricky, though. If there is a general consensus that kernel memory bound to resources which need to be controllable will get its own way of controlling then a separate K limit would indeed be not needed. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy 2015-09-07 9:39 ` Michal Hocko @ 2015-09-07 10:01 ` Vladimir Davydov 2015-09-07 11:03 ` Michal Hocko 0 siblings, 1 reply; 40+ messages in thread From: Vladimir Davydov @ 2015-09-07 10:01 UTC (permalink / raw) To: Michal Hocko; +Cc: Tejun Heo, hannes, cgroups, linux-mm, kernel-team On Mon, Sep 07, 2015 at 11:39:06AM +0200, Michal Hocko wrote: ... > > > I might be wrong here of course but if the default should be switched it > > > would deserve a better justification with some numbers so that people > > > can see the possible drawbacks. > > > > Personally, I'd prefer to have it switched on by default, because it > > would force people test it and report bugs and performance degradation. > > If one finds it really crappy, he/she should be able to disable it. > > I do not think this is the way of introducing new functionality. You do > not want to force users to debug your code and go let it disable if it > is too crappy. One must first enable CONFIG_KMEM, which is off by default. Anyway, we aren't talking about enabling it by default in the legacy hierarchy, only in the unified hierarchy, which must be explicitly enabled by passing __DEVEL__save_behavior. I think that's enough. > > > > I agree that the per-cgroup knob is better than the global one. We > > > > Not that sure :-/ > > Why? I'm not sure there is use cases which need having kmem acct enabled in one cgroup and disabled in another. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy 2015-09-07 10:01 ` Vladimir Davydov @ 2015-09-07 11:03 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2015-09-07 11:03 UTC (permalink / raw) To: Vladimir Davydov; +Cc: Tejun Heo, hannes, cgroups, linux-mm, kernel-team On Mon 07-09-15 13:01:10, Vladimir Davydov wrote: > On Mon, Sep 07, 2015 at 11:39:06AM +0200, Michal Hocko wrote: > ... > > > > I might be wrong here of course but if the default should be switched it > > > > would deserve a better justification with some numbers so that people > > > > can see the possible drawbacks. > > > > > > Personally, I'd prefer to have it switched on by default, because it > > > would force people test it and report bugs and performance degradation. > > > If one finds it really crappy, he/she should be able to disable it. > > > > I do not think this is the way of introducing new functionality. You do > > not want to force users to debug your code and go let it disable if it > > is too crappy. > > One must first enable CONFIG_KMEM, which is off by default. Yes, but let's be realistic here. People tend to use distribution kernels and so the config option will have to be enabled. > Anyway, we aren't talking about enabling it by default in the legacy > hierarchy, only in the unified hierarchy, which must be explicitly > enabled by passing __DEVEL__save_behavior. I think that's enough. But once it is set like that in default then it will stay even when the __DEVEL__ part is dropped. > > > > I agree that the per-cgroup knob is better than the global one. We > > > > > > Not that sure :-/ > > > > Why? > > I'm not sure there is use cases which need having kmem acct enabled in > one cgroup and disabled in another. What would be the downside though? It is true that all the static keys will be enabled in these configurations so even the non-enabled path would pay some overhead but that should be still close to unmeasurable (I haven't measured that so I might be wrong here). -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy 2015-09-04 13:30 ` Michal Hocko [not found] ` <20150904133038.GC8220-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2015-09-04 16:18 ` Tejun Heo [not found] ` <20150904161845.GB25329-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 1 sibling, 1 reply; 40+ messages in thread From: Tejun Heo @ 2015-09-04 16:18 UTC (permalink / raw) To: Michal Hocko; +Cc: Vladimir Davydov, hannes, cgroups, linux-mm, kernel-team Hello, Michal. On Fri, Sep 04, 2015 at 03:30:38PM +0200, Michal Hocko wrote: > The overhead was around 4% for the basic kbuild test without ever > triggering the [k]memcg limit last time I checked. This was quite some > time ago and things might have changed since then. Even when this got > better there will still be _some_ overhead because we have to track that > memory and that is not free. So, I just ran small scale tests and I don't see any meaningful difference between kmemcg disabled and enabled for kbuild workload (limit is never reached in both cases, memory is reclaimed from global pressure). The difference in kernel time usage. I'm sure there's *some* overhead buried in the noise but given the current implementation, I can't see how enabling kmem would lead to 4% overhead in kbuild tests. It isn't that kernel intensive to begin with. > The question really is whether kmem accounting is so generally useful > that the overhead is acceptable and it is should be enabled by > default. From my POV it is a useful mitigation of untrusted users but > many loads simply do not care because they only care about a certain > level of isolation. I don't think that's the right way to approach the problem. Given that the cost isn't prohibitive, no user only care about a certain level of isolation willingly. Distributing memory is what it's all about after all and memory is memory, user or kernel. We have kmem on/off situation for historical reasons and because the early implementation wasn't good enough to be enabled by default. I get that there can be special cases, temporary or otherwise, where disabling kmem is desirable but that gotta be the exception, not the norm. > I might be wrong here of course but if the default should be switched it > would deserve a better justification with some numbers so that people > can see the possible drawbacks. > > I agree that the per-cgroup knob is better than the global one. We > should also find consensus whether the legacy semantic of k < u limit > should be preserved. It made sense to me at the time it was introduced > but I recall that Vladimir found it not really helpful when we discussed > that at LSF. I found it interesting e.g. for the rough task count > limiting use case which people were asking for. Let's please not hinge major design decisions on short-sighted hacks and overhead considerations. If task count is an actual resource which needs to be regulated separatley, we should add a proper controller for it and we did. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <20150904161845.GB25329-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy [not found] ` <20150904161845.GB25329-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2015-09-07 10:54 ` Michal Hocko 2015-09-08 18:50 ` Tejun Heo 0 siblings, 1 reply; 40+ messages in thread From: Michal Hocko @ 2015-09-07 10:54 UTC (permalink / raw) To: Tejun Heo Cc: Vladimir Davydov, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg On Fri 04-09-15 12:18:45, Tejun Heo wrote: > Hello, Michal. > > On Fri, Sep 04, 2015 at 03:30:38PM +0200, Michal Hocko wrote: > > The overhead was around 4% for the basic kbuild test without ever > > triggering the [k]memcg limit last time I checked. This was quite some > > time ago and things might have changed since then. Even when this got > > better there will still be _some_ overhead because we have to track that > > memory and that is not free. > > So, I just ran small scale tests and I don't see any meaningful > difference between kmemcg disabled and enabled for kbuild workload > (limit is never reached in both cases, memory is reclaimed from global > pressure). The difference in kernel time usage. I'm sure there's > *some* overhead buried in the noise but given the current > implementation, I can't see how enabling kmem would lead to 4% > overhead in kbuild tests. It isn't that kernel intensive to begin > with. OK, I've quickly rerun my test on 32CPU machine with 64G of RAM Elapsed logs.kmem: min: 68.10 max: 69.27 avg: 68.53 std: 0.53 runs: 3 logs.no.kmem: min: 64.08 [94.1%] max: 68.42 [98.8%] avg: 66.22 [96.6%] std: 1.77 runs: 3 User logs.kmem: min: 867.68 max: 872.88 avg: 869.49 std: 2.40 runs: 3 logs.no.kmem: min: 865.99 [99.8%] max: 884.94 [101.4%] avg: 874.08 [100.5%] std: 7.98 runs: 3 System logs.kmem: min: 78.50 max: 78.85 avg: 78.63 std: 0.16 runs: 3 logs.no.kmem: min: 75.36 [96.0%] max: 80.50 [102.1%] avg: 77.91 [99.1%] std: 2.10 runs: 3 The elapsed time is still ~3% worse in average while user and system are in noise. I haven't checked where he overhead is coming from. > > The question really is whether kmem accounting is so generally useful > > that the overhead is acceptable and it is should be enabled by > > default. From my POV it is a useful mitigation of untrusted users but > > many loads simply do not care because they only care about a certain > > level of isolation. > > I don't think that's the right way to approach the problem. Given > that the cost isn't prohibitive, no user only care about a certain > level of isolation willingly. I haven't said it is prohibitive. It is simply non-zero and there is always cost/benefit that should be considered. > Distributing memory is what it's all about after all and memory is > memory, user or kernel. True except that kmem accounting doesn't cover the whole kernel memory usage. It is an opt-in mechanism for a _better_ isolation. And the question really is whether that better isolation is needed/requested by default. > We have kmem > on/off situation for historical reasons and because the early > implementation wasn't good enough to be enabled by default. I get > that there can be special cases, temporary or otherwise, where > disabling kmem is desirable but that gotta be the exception, not the > norm. The default should be the cheapest one IMHO. And our overhead is really close to 0 if no memcg accounting is enabled thanks to Johannes' page_counters. Then we have a lightweight form of accounting (only user memory) which is nicely defined. And then we have an additional opt-in for a better isolation which involves some kernel memory as well. Why should we conflate the last two? I mean, if somebody wants an additional protection then sure, enable kmem and pay an additional overhead but why to force this on everybody who wants to use memcg? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy 2015-09-07 10:54 ` Michal Hocko @ 2015-09-08 18:50 ` Tejun Heo 0 siblings, 0 replies; 40+ messages in thread From: Tejun Heo @ 2015-09-08 18:50 UTC (permalink / raw) To: Michal Hocko; +Cc: Vladimir Davydov, hannes, cgroups, linux-mm, kernel-team Hello, Michal. On Mon, Sep 07, 2015 at 12:54:37PM +0200, Michal Hocko wrote: > OK, I've quickly rerun my test on 32CPU machine with 64G of RAM > Elapsed > logs.kmem: min: 68.10 max: 69.27 avg: 68.53 std: 0.53 runs: 3 > logs.no.kmem: min: 64.08 [94.1%] max: 68.42 [98.8%] avg: 66.22 [96.6%] std: 1.77 runs: 3 > User > logs.kmem: min: 867.68 max: 872.88 avg: 869.49 std: 2.40 runs: 3 > logs.no.kmem: min: 865.99 [99.8%] max: 884.94 [101.4%] avg: 874.08 [100.5%] std: 7.98 runs: 3 > System > logs.kmem: min: 78.50 max: 78.85 avg: 78.63 std: 0.16 runs: 3 > logs.no.kmem: min: 75.36 [96.0%] max: 80.50 [102.1%] avg: 77.91 [99.1%] std: 2.10 runs: 3 > > The elapsed time is still ~3% worse in average while user and system are > in noise. I haven't checked where he overhead is coming from. Does the cgroup have memory limit configured? Unless there are measurement errors, the only way it'd take noticeably longer w/o incurring more CPU time is spending time blocked on reclaim and enabling kmem of course adds to memory pressure, which is the intended behavior. > > I don't think that's the right way to approach the problem. Given > > that the cost isn't prohibitive, no user only care about a certain > > level of isolation willingly. > > I haven't said it is prohibitive. It is simply non-zero and there is > always cost/benefit that should be considered. We do want to hunt down that 3% but locking into interface is an a lot larger and longer-term commitment. The cost sure is non-zero but I'd be surprised if we can't get that down to something generally acceptable over time given that the domain switching is a relatively low-frequency event (scheduling) and it's an area where we can actively make space to speed trade-off. > > Distributing memory is what it's all about after all and memory is > > memory, user or kernel. > > True except that kmem accounting doesn't cover the whole kernel memory > usage. It is an opt-in mechanism for a _better_ isolation. And the > question really is whether that better isolation is needed/requested by > default. It isn't perfect but kmem and socket buffers do cover most of kernel memory usage that is accountable to userland. It isn't just a matter of better or worse. The goal of cgroup is providing a "reasonable" isolation. Sure, we can decide to ignore some but that should be because the extra accuracy there doesn't matter in the scheme of things and thus paying the overhead is pointless; however, users shouldn't need to worry about the different levels of ambiguous accuracies which can't even be quantified, at least not by default. Let's please not get lost in perfectionism. Sure, it can't be perfect but that doesn't mean an attainable and clear goal doesn't exist here. > > We have kmem > > on/off situation for historical reasons and because the early > > implementation wasn't good enough to be enabled by default. I get > > that there can be special cases, temporary or otherwise, where > > disabling kmem is desirable but that gotta be the exception, not the > > norm. > > The default should be the cheapest one IMHO. And our overhead is really That is a way too simplistic and greedy decision criterion. I don't think we want to make interface decisions on that. Overhead considerations definitely dictate what we can and can't do and that's why I said that the cost wasn't prohibitive but there are a whole lot of other things to consider too including where we wanna be eventually years down the road. > close to 0 if no memcg accounting is enabled thanks to Johannes' > page_counters. Then we have a lightweight form of accounting (only user > memory) which is nicely defined. And then we have an additional opt-in > for a better isolation which involves some kernel memory as well. Why > should we conflate the last two? I mean, if somebody wants an additional > protection then sure, enable kmem and pay an additional overhead but why > to force this on everybody who wants to use memcg? Because it betrays the basic goal of reasonable resource isolation. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy 2015-08-28 15:25 ` [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy Tejun Heo [not found] ` <1440775530-18630-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2015-11-05 17:30 ` Michal Hocko 1 sibling, 0 replies; 40+ messages in thread From: Michal Hocko @ 2015-11-05 17:30 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team Just for the reference. This has been discussed as a part of other email thread discussed here: http://lkml.kernel.org/r/20151027122647.GG9891%40dhcp22.suse.cz I am _really_ sorry for hijacking that one - I didn't intend to do so but my remark ended up in a full discussion. If I knew it would go that way I wouldn't even mention it. On Fri 28-08-15 11:25:30, Tejun Heo wrote: > On the default hierarchy, all memory consumption will be accounted > together and controlled by the same set of limits. Always enable > kmemcg on the default hierarchy. > > Signed-off-by: Tejun Heo <tj@kernel.org> > --- > mm/memcontrol.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c94b686..8a5dd01 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4362,6 +4362,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) > if (ret) > return ret; > > + /* kmem is always accounted together on the default hierarchy */ > + if (cgroup_on_dfl(css->cgroup)) { > + ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX); > + if (ret) > + return ret; > + } > + > /* > * Make sure the memcg is initialized: mem_cgroup_iter() > * orders reading memcg->initialized against its callers > -- > 2.4.3 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2015-11-05 17:30 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-28 15:25 [PATCHSET] memcg: improve high limit behavior and always enable kmemcg on dfl hier Tejun Heo
[not found] ` <1440775530-18630-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-08-28 15:25 ` [PATCH 1/4] memcg: fix over-high reclaim amount Tejun Heo
2015-08-28 17:06 ` Michal Hocko
[not found] ` <20150828170612.GA21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-08-28 18:32 ` Tejun Heo
[not found] ` <20150828183209.GA9423-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-08-31 7:51 ` Michal Hocko
[not found] ` <20150831075133.GA29723-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-08-31 13:38 ` Tejun Heo
2015-09-01 12:51 ` Michal Hocko
[not found] ` <20150901125149.GD8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-09-01 18:33 ` Tejun Heo
2015-08-28 15:25 ` [PATCH 2/4] memcg: flatten task_struct->memcg_oom Tejun Heo
[not found] ` <1440775530-18630-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-08-28 17:11 ` Michal Hocko
2015-08-28 15:25 ` [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path Tejun Heo
[not found] ` <1440775530-18630-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-08-28 16:36 ` Vladimir Davydov
2015-08-28 16:48 ` Tejun Heo
2015-08-28 20:32 ` Vladimir Davydov
2015-08-28 20:44 ` Tejun Heo
2015-08-28 22:06 ` Tejun Heo
[not found] ` <20150828220632.GF11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2015-08-29 7:59 ` Vladimir Davydov
2015-08-30 15:52 ` Vladimir Davydov
2015-08-28 17:13 ` Michal Hocko
[not found] ` <20150828171322.GC21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-08-28 17:56 ` Tejun Heo
2015-08-28 20:45 ` Vladimir Davydov
2015-08-28 20:53 ` Tejun Heo
[not found] ` <20150828205301.GB11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2015-08-28 21:07 ` Vladimir Davydov
2015-08-28 21:14 ` Tejun Heo
2015-08-28 15:25 ` [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy Tejun Heo
[not found] ` <1440775530-18630-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-08-28 16:49 ` Vladimir Davydov
2015-08-28 16:56 ` Tejun Heo
2015-08-28 17:14 ` Michal Hocko
2015-08-28 17:41 ` Tejun Heo
2015-09-01 12:44 ` Michal Hocko
[not found] ` <20150901124459.GC8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-09-01 18:51 ` Tejun Heo
2015-09-04 13:30 ` Michal Hocko
[not found] ` <20150904133038.GC8220-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-09-04 15:38 ` Vladimir Davydov
2015-09-07 9:39 ` Michal Hocko
2015-09-07 10:01 ` Vladimir Davydov
2015-09-07 11:03 ` Michal Hocko
2015-09-04 16:18 ` Tejun Heo
[not found] ` <20150904161845.GB25329-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-09-07 10:54 ` Michal Hocko
2015-09-08 18:50 ` Tejun Heo
2015-11-05 17:30 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).