* Re: [PATCH 1/2] memcg: reparent charges of children before processing parent
2014-02-12 23:03 [PATCH 1/2] memcg: reparent charges of children before processing parent Hugh Dickins
@ 2014-02-13 0:25 ` Tejun Heo
2014-02-13 15:27 ` Michal Hocko
2014-03-05 9:27 ` Markus Blank-Burian
2 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2014-02-13 0:25 UTC (permalink / raw)
To: Hugh Dickins
Cc: Michal Hocko, Johannes Weiner, Filipe Brandenburger, Li Zefan,
Andrew Morton, Greg Thelen, Michel Lespinasse,
Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
linux-mm
Hello,
On Wed, Feb 12, 2014 at 03:03:31PM -0800, Hugh Dickins wrote:
> From: Filipe Brandenburger <filbranden@google.com>
>
> Sometimes the cleanup after memcg hierarchy testing gets stuck in
> mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
>
> There may turn out to be several causes, but a major cause is this: the
> workitem to offline parent can get run before workitem to offline child;
> parent's mem_cgroup_reparent_charges() circles around waiting for the
> child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> which prevents the child from reaching its mem_cgroup_reparent_charges().
>
> Further testing showed that an ordered workqueue for cgroup_destroy_wq
> is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
> stage on the way can mess up the order before reaching the workqueue.
>
> Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
> all its children (and grandchildren, in the correct order) to have their
> charges reparented first.
>
> Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
> Signed-off-by: Filipe Brandenburger <filbranden@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org # v3.10+ (but will need extra care)
> ---
> Or, you may prefer my alternative cgroup.c approach in 2/2:
> there's no need for both. Please note that neither of these patches
> attempts to handle the unlikely case of racy charges made to child
> after its offline, but parent's offline coming before child's free:
> mem_cgroup_css_free()'s backstop call to mem_cgroup_reparent_charges()
> cannot help in that case, with or without these patches. Fixing that
> would have to be a separate effort - Michal's?
I've changed my mind several times now but I think it'd be a better
idea to stick to this patch, at least for now. This one is easier for
-stable backport and it looks like the requirements for ordering
->css_offline() might go away depending on how reparenting changes
work out.
Reviewed-by: Tejun Heo <tj@kernel.org>
Michal, Johannes, can you guys please ack this one if you guys agree?
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] memcg: reparent charges of children before processing parent
2014-02-12 23:03 [PATCH 1/2] memcg: reparent charges of children before processing parent Hugh Dickins
2014-02-13 0:25 ` Tejun Heo
@ 2014-02-13 15:27 ` Michal Hocko
2014-02-13 15:35 ` Tejun Heo
2014-03-05 9:27 ` Markus Blank-Burian
2 siblings, 1 reply; 6+ messages in thread
From: Michal Hocko @ 2014-02-13 15:27 UTC (permalink / raw)
To: Hugh Dickins
Cc: Tejun Heo, Johannes Weiner, Filipe Brandenburger, Li Zefan,
Andrew Morton, Greg Thelen, Michel Lespinasse,
Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
linux-mm
On Wed 12-02-14 15:03:31, Hugh Dickins wrote:
> From: Filipe Brandenburger <filbranden@google.com>
>
> Sometimes the cleanup after memcg hierarchy testing gets stuck in
> mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
>
> There may turn out to be several causes, but a major cause is this: the
> workitem to offline parent can get run before workitem to offline child;
> parent's mem_cgroup_reparent_charges() circles around waiting for the
> child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> which prevents the child from reaching its mem_cgroup_reparent_charges().
>
> Further testing showed that an ordered workqueue for cgroup_destroy_wq
> is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
> stage on the way can mess up the order before reaching the workqueue.
This whole code path is so complicated by different types of delayed
work that I am not wondering that we have missed that :/
> Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
> all its children (and grandchildren, in the correct order) to have their
> charges reparented first.
That is basically what I was suggesting
http://marc.info/?l=linux-mm&m=139178386407184&w=2 as #1 option. I
cannot say I would like it and I think that reparenting LRUs in
css_offline and then reparent the remaining charges from css_free is a
better solution but let's keep this for later.
> Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
> Signed-off-by: Filipe Brandenburger <filbranden@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org # v3.10+ (but will need extra care)
OK, I guess we should go with this one because it describes both the
problem in memcg offlining and requirements from the cgroup core so the
later approach can build on top of it.
Acked-by: Michal Hocko <mhocko@suse.cz>
> ---
> Or, you may prefer my alternative cgroup.c approach in 2/2:
> there's no need for both. Please note that neither of these patches
> attempts to handle the unlikely case of racy charges made to child
> after its offline, but parent's offline coming before child's free:
> mem_cgroup_css_free()'s backstop call to mem_cgroup_reparent_charges()
> cannot help in that case, with or without these patches. Fixing that
> would have to be a separate effort - Michal's?
I plan to implement the LRU care for css_offline and charge drain for
css_free later on but I have quite some work on my plate currently so I
cannot promis it will be done right now. I hope to have it soon though.
> mm/memcontrol.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> --- 3.14-rc2/mm/memcontrol.c 2014-02-02 18:49:07.897302115 -0800
> +++ linux/mm/memcontrol.c 2014-02-11 17:48:07.604582963 -0800
> @@ -6595,6 +6595,7 @@ static void mem_cgroup_css_offline(struc
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> struct mem_cgroup_event *event, *tmp;
> + struct cgroup_subsys_state *iter;
>
> /*
> * Unregister events and notify userspace.
> @@ -6611,7 +6612,14 @@ static void mem_cgroup_css_offline(struc
> kmem_cgroup_css_offline(memcg);
>
> mem_cgroup_invalidate_reclaim_iterators(memcg);
> - mem_cgroup_reparent_charges(memcg);
> +
> + /*
> + * This requires that offlining is serialized. Right now that is
> + * guaranteed because css_killed_work_fn() holds the cgroup_mutex.
> + */
> + css_for_each_descendant_post(iter, css)
> + mem_cgroup_reparent_charges(mem_cgroup_from_css(iter));
> +
> mem_cgroup_destroy_all_caches(memcg);
> vmpressure_cleanup(&memcg->vmpressure);
> }
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] memcg: reparent charges of children before processing parent
2014-02-13 15:27 ` Michal Hocko
@ 2014-02-13 15:35 ` Tejun Heo
0 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2014-02-13 15:35 UTC (permalink / raw)
To: Michal Hocko
Cc: Hugh Dickins, Johannes Weiner, Filipe Brandenburger, Li Zefan,
Andrew Morton, Greg Thelen, Michel Lespinasse,
Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
linux-mm
Hello,
On Thu, Feb 13, 2014 at 04:27:45PM +0100, Michal Hocko wrote:
> > Further testing showed that an ordered workqueue for cgroup_destroy_wq
> > is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
> > stage on the way can mess up the order before reaching the workqueue.
>
> This whole code path is so complicated by different types of delayed
> work that I am not wondering that we have missed that :/
Yeah, I know. Good part of the complexity comes from RCU -> wq
bouncing. I wonder whether we just should bite the bullet and add
something along the line of call_rcu_work(). The other part is percpu
ref shutdown. For me that part is easier to swallow, as the benefits
are quite clear.
> > Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
> > all its children (and grandchildren, in the correct order) to have their
> > charges reparented first.
>
> That is basically what I was suggesting
> http://marc.info/?l=linux-mm&m=139178386407184&w=2 as #1 option. I
> cannot say I would like it and I think that reparenting LRUs in
> css_offline and then reparent the remaining charges from css_free is a
> better solution but let's keep this for later.
I'm kinda wishing the reparenting things works out. Even if that
involves a bit of overhead at offline, I think it'd be worthwhile to
be able to follow the same object lifetime rules as other controllers,
as long as the overhead is reasonable.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] memcg: reparent charges of children before processing parent
2014-02-12 23:03 [PATCH 1/2] memcg: reparent charges of children before processing parent Hugh Dickins
2014-02-13 0:25 ` Tejun Heo
2014-02-13 15:27 ` Michal Hocko
@ 2014-03-05 9:27 ` Markus Blank-Burian
2014-03-06 2:53 ` Hugh Dickins
2 siblings, 1 reply; 6+ messages in thread
From: Markus Blank-Burian @ 2014-03-05 9:27 UTC (permalink / raw)
To: Hugh Dickins
Cc: Tejun Heo, Michal Hocko, Johannes Weiner, Filipe Brandenburger,
Li Zefan, Andrew Morton, Greg Thelen, Michel Lespinasse,
Shawn Bohrer, cgroups, linux-kernel, linux-mm
I wanted to give you small feedback, that this patch successfully fixes the
problem with reparent_charges on our cluster. Thank you very much for finding
and fixing this one!
On Wednesday 12 February 2014 15:03:31 Hugh Dickins wrote:
> From: Filipe Brandenburger <filbranden@google.com>
>
> Sometimes the cleanup after memcg hierarchy testing gets stuck in
> mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
>
> There may turn out to be several causes, but a major cause is this: the
> workitem to offline parent can get run before workitem to offline child;
> parent's mem_cgroup_reparent_charges() circles around waiting for the
> child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> which prevents the child from reaching its mem_cgroup_reparent_charges().
>
> Further testing showed that an ordered workqueue for cgroup_destroy_wq
> is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
> stage on the way can mess up the order before reaching the workqueue.
>
> Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
> all its children (and grandchildren, in the correct order) to have their
> charges reparented first.
>
> Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup
> destruction") Signed-off-by: Filipe Brandenburger <filbranden@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org # v3.10+ (but will need extra care)
> ---
> Or, you may prefer my alternative cgroup.c approach in 2/2:
> there's no need for both. Please note that neither of these patches
> attempts to handle the unlikely case of racy charges made to child
> after its offline, but parent's offline coming before child's free:
> mem_cgroup_css_free()'s backstop call to mem_cgroup_reparent_charges()
> cannot help in that case, with or without these patches. Fixing that
> would have to be a separate effort - Michal's?
>
> mm/memcontrol.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> --- 3.14-rc2/mm/memcontrol.c 2014-02-02 18:49:07.897302115 -0800
> +++ linux/mm/memcontrol.c 2014-02-11 17:48:07.604582963 -0800
> @@ -6595,6 +6595,7 @@ static void mem_cgroup_css_offline(struc
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> struct mem_cgroup_event *event, *tmp;
> + struct cgroup_subsys_state *iter;
>
> /*
> * Unregister events and notify userspace.
> @@ -6611,7 +6612,14 @@ static void mem_cgroup_css_offline(struc
> kmem_cgroup_css_offline(memcg);
>
> mem_cgroup_invalidate_reclaim_iterators(memcg);
> - mem_cgroup_reparent_charges(memcg);
> +
> + /*
> + * This requires that offlining is serialized. Right now that is
> + * guaranteed because css_killed_work_fn() holds the cgroup_mutex.
> + */
> + css_for_each_descendant_post(iter, css)
> + mem_cgroup_reparent_charges(mem_cgroup_from_css(iter));
> +
> mem_cgroup_destroy_all_caches(memcg);
> vmpressure_cleanup(&memcg->vmpressure);
> }
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] memcg: reparent charges of children before processing parent
2014-03-05 9:27 ` Markus Blank-Burian
@ 2014-03-06 2:53 ` Hugh Dickins
0 siblings, 0 replies; 6+ messages in thread
From: Hugh Dickins @ 2014-03-06 2:53 UTC (permalink / raw)
To: Markus Blank-Burian
Cc: Hugh Dickins, Tejun Heo, Michal Hocko, Johannes Weiner,
Filipe Brandenburger, Li Zefan, Andrew Morton, Greg Thelen,
Michel Lespinasse, Shawn Bohrer, cgroups, linux-kernel, linux-mm
On Wed, 5 Mar 2014, Markus Blank-Burian wrote:
> I wanted to give you small feedback, that this patch successfully fixes the
> problem with reparent_charges on our cluster. Thank you very much for finding
> and fixing this one!
>
That's great to hear, Markus: thanks a lot for letting us know.
All credit to Filipe, who had the idea of what might be going wrong,
and the best patch to fix it.
Hugh
>
> On Wednesday 12 February 2014 15:03:31 Hugh Dickins wrote:
> > From: Filipe Brandenburger <filbranden@google.com>
> >
> > Sometimes the cleanup after memcg hierarchy testing gets stuck in
> > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> >
> > There may turn out to be several causes, but a major cause is this: the
> > workitem to offline parent can get run before workitem to offline child;
> > parent's mem_cgroup_reparent_charges() circles around waiting for the
> > child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> > which prevents the child from reaching its mem_cgroup_reparent_charges().
> >
> > Further testing showed that an ordered workqueue for cgroup_destroy_wq
> > is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
> > stage on the way can mess up the order before reaching the workqueue.
> >
> > Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
> > all its children (and grandchildren, in the correct order) to have their
> > charges reparented first.
> >
> > Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup
> > destruction") Signed-off-by: Filipe Brandenburger <filbranden@google.com>
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: stable@vger.kernel.org # v3.10+ (but will need extra care)
> > ---
> > Or, you may prefer my alternative cgroup.c approach in 2/2:
> > there's no need for both. Please note that neither of these patches
> > attempts to handle the unlikely case of racy charges made to child
> > after its offline, but parent's offline coming before child's free:
> > mem_cgroup_css_free()'s backstop call to mem_cgroup_reparent_charges()
> > cannot help in that case, with or without these patches. Fixing that
> > would have to be a separate effort - Michal's?
> >
> > mm/memcontrol.c | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > --- 3.14-rc2/mm/memcontrol.c 2014-02-02 18:49:07.897302115 -0800
> > +++ linux/mm/memcontrol.c 2014-02-11 17:48:07.604582963 -0800
> > @@ -6595,6 +6595,7 @@ static void mem_cgroup_css_offline(struc
> > {
> > struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> > struct mem_cgroup_event *event, *tmp;
> > + struct cgroup_subsys_state *iter;
> >
> > /*
> > * Unregister events and notify userspace.
> > @@ -6611,7 +6612,14 @@ static void mem_cgroup_css_offline(struc
> > kmem_cgroup_css_offline(memcg);
> >
> > mem_cgroup_invalidate_reclaim_iterators(memcg);
> > - mem_cgroup_reparent_charges(memcg);
> > +
> > + /*
> > + * This requires that offlining is serialized. Right now that is
> > + * guaranteed because css_killed_work_fn() holds the cgroup_mutex.
> > + */
> > + css_for_each_descendant_post(iter, css)
> > + mem_cgroup_reparent_charges(mem_cgroup_from_css(iter));
> > +
> > mem_cgroup_destroy_all_caches(memcg);
> > vmpressure_cleanup(&memcg->vmpressure);
> > }
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread