* [PATCH] cgroup: use an ordered workqueue for cgroup destruction @ 2014-02-06 23:56 Hugh Dickins 2014-02-07 13:45 ` Michal Hocko ` (3 more replies) 0 siblings, 4 replies; 17+ messages in thread From: Hugh Dickins @ 2014-02-06 23:56 UTC (permalink / raw) To: Tejun Heo Cc: Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko, Johannes Weiner, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm Sometimes the cleanup after memcg hierarchy testing gets stuck in mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. There may turn out to be several causes, but a major cause is this: the workitem to offline parent can get run before workitem to offline child; parent's mem_cgroup_reparent_charges() circles around waiting for the child's pages to be reparented to its lrus, but it's holding cgroup_mutex which prevents the child from reaching its mem_cgroup_reparent_charges(). Just use an ordered workqueue for cgroup_destroy_wq. Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction") Suggested-by: Filipe Brandenburger <filbranden@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org # 3.10+ --- kernel/cgroup.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- 3.14-rc1/kernel/cgroup.c 2014-02-02 18:49:07.737302111 -0800 +++ linux/kernel/cgroup.c 2014-02-06 15:20:35.548904965 -0800 @@ -4845,12 +4845,12 @@ static int __init cgroup_wq_init(void) /* * There isn't much point in executing destruction path in * parallel. Good chunk is serialized with cgroup_mutex anyway. - * Use 1 for @max_active. + * Must be ordered to make sure parent is offlined after children. * * We would prefer to do this in cgroup_init() above, but that * is called before init_workqueues(): so leave this until after. */ - cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1); + cgroup_destroy_wq = alloc_ordered_workqueue("cgroup_destroy", 0); BUG_ON(!cgroup_destroy_wq); /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction 2014-02-06 23:56 [PATCH] cgroup: use an ordered workqueue for cgroup destruction Hugh Dickins @ 2014-02-07 13:45 ` Michal Hocko 2014-02-07 14:04 ` Tejun Heo ` (2 subsequent siblings) 3 siblings, 0 replies; 17+ messages in thread From: Michal Hocko @ 2014-02-07 13:45 UTC (permalink / raw) To: Hugh Dickins Cc: Tejun Heo, Filipe Brandenburger, Li Zefan, Andrew Morton, Johannes Weiner, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm On Thu 06-02-14 15:56:01, Hugh Dickins wrote: > Sometimes the cleanup after memcg hierarchy testing gets stuck in > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. > > There may turn out to be several causes, but a major cause is this: the > workitem to offline parent can get run before workitem to offline child; > parent's mem_cgroup_reparent_charges() circles around waiting for the > child's pages to be reparented to its lrus, but it's holding cgroup_mutex > which prevents the child from reaching its mem_cgroup_reparent_charges(). > > Just use an ordered workqueue for cgroup_destroy_wq. Hmm, interesting. Markus has seen hangs even with mem_cgroup_css_offline and the referenced cgroup fixes, maybe this is the the right one finally. > Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction") > Suggested-by: Filipe Brandenburger <filbranden@google.com> > Signed-off-by: Hugh Dickins <hughd@google.com> > Cc: stable@vger.kernel.org # 3.10+ Reviewed-by: Michal Hocko <mhocko@suse.cz> e5fca243abae was marked for 3.9 stable but I do not see it in the Greg's 3.9 stable branch so 3.10+ seems to be sufficient. > --- > > kernel/cgroup.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > --- 3.14-rc1/kernel/cgroup.c 2014-02-02 18:49:07.737302111 -0800 > +++ linux/kernel/cgroup.c 2014-02-06 15:20:35.548904965 -0800 > @@ -4845,12 +4845,12 @@ static int __init cgroup_wq_init(void) > /* > * There isn't much point in executing destruction path in > * parallel. Good chunk is serialized with cgroup_mutex anyway. > - * Use 1 for @max_active. > + * Must be ordered to make sure parent is offlined after children. > * > * We would prefer to do this in cgroup_init() above, but that > * is called before init_workqueues(): so leave this until after. > */ > - cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1); > + cgroup_destroy_wq = alloc_ordered_workqueue("cgroup_destroy", 0); > BUG_ON(!cgroup_destroy_wq); > > /* -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction 2014-02-06 23:56 [PATCH] cgroup: use an ordered workqueue for cgroup destruction Hugh Dickins 2014-02-07 13:45 ` Michal Hocko @ 2014-02-07 14:04 ` Tejun Heo 2014-02-07 14:37 ` Michal Hocko 2014-02-07 20:20 ` Hugh Dickins 2014-02-07 15:21 ` Tejun Heo 2014-02-07 16:43 ` Johannes Weiner 3 siblings, 2 replies; 17+ messages in thread From: Tejun Heo @ 2014-02-07 14:04 UTC (permalink / raw) To: Hugh Dickins Cc: Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko, Johannes Weiner, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm Hello, Hugh. On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote: > Sometimes the cleanup after memcg hierarchy testing gets stuck in > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. > > There may turn out to be several causes, but a major cause is this: the > workitem to offline parent can get run before workitem to offline child; > parent's mem_cgroup_reparent_charges() circles around waiting for the > child's pages to be reparented to its lrus, but it's holding cgroup_mutex > which prevents the child from reaching its mem_cgroup_reparent_charges(). > > Just use an ordered workqueue for cgroup_destroy_wq. Hmmm... I'm not really comfortable with this. This would seal shut any possiblity of increasing concurrency in that path, which is okay now but I find the combination of such long term commitment and the non-obviousness (it's not apparent from looking at memcg code why it wouldn't deadlock) very unappealing. Besides, the only reason offline() is currently called under cgroup_mutex is history. We can move it out of cgroup_mutex right now. But even with offline being called outside cgroup_mutex, IIRC, the described problem would still be able to deadlock as long as the tree depth is deeper than max concurrency level of the destruction workqueue. Sure, we can give it large enough number but it's generally nasty. One thing I don't get is why memcg has such reverse dependency at all. Why does the parent wait for its descendants to do something during offline? Shouldn't it be able to just bail and let whatever descendant which is stil busy propagate things upwards? That's a usual pattern we use to tree shutdowns anyway. Would that be nasty to implement in memcg? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction 2014-02-07 14:04 ` Tejun Heo @ 2014-02-07 14:37 ` Michal Hocko 2014-02-07 15:13 ` Tejun Heo 2014-02-07 20:20 ` Hugh Dickins 1 sibling, 1 reply; 17+ messages in thread From: Michal Hocko @ 2014-02-07 14:37 UTC (permalink / raw) To: Tejun Heo Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton, Johannes Weiner, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm On Fri 07-02-14 09:04:02, Tejun Heo wrote: > Hello, Hugh. > > On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote: > > Sometimes the cleanup after memcg hierarchy testing gets stuck in > > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. > > > > There may turn out to be several causes, but a major cause is this: the > > workitem to offline parent can get run before workitem to offline child; > > parent's mem_cgroup_reparent_charges() circles around waiting for the > > child's pages to be reparented to its lrus, but it's holding cgroup_mutex > > which prevents the child from reaching its mem_cgroup_reparent_charges(). > > > > Just use an ordered workqueue for cgroup_destroy_wq. > > Hmmm... I'm not really comfortable with this. This would seal shut > any possiblity of increasing concurrency in that path, which is okay > now but I find the combination of such long term commitment and the > non-obviousness (it's not apparent from looking at memcg code why it > wouldn't deadlock) very unappealing. Besides, the only reason > offline() is currently called under cgroup_mutex is history. We can > move it out of cgroup_mutex right now. > > But even with offline being called outside cgroup_mutex, IIRC, the > described problem would still be able to deadlock as long as the tree > depth is deeper than max concurrency level of the destruction > workqueue. Sure, we can give it large enough number but it's > generally nasty. > > One thing I don't get is why memcg has such reverse dependency at all. > Why does the parent wait for its descendants to do something during > offline? Because the parent sees charges of its children but it doesn't see pages as they are on the LRU of those children. So it cannot reach 0 charges. We are are assuming that the offlining memcg doesn't have any children which sounds like a reasonable expectation to me. > Shouldn't it be able to just bail and let whatever > descendant which is stil busy propagate things upwards? That's a > usual pattern we use to tree shutdowns anyway. Would that be nasty to > implement in memcg? Hmm, this is a bit tricky. We cannot use memcg iterators to reach children because css_tryget would fail on them. We can use cgroup iterators instead, alright, and reparent pages from leafs but this all sounds like a lot of complications. Another option would be weakening css_offline reparenting and do not insist on having 0 charges. We want to get rid of as many charges as possible but do not need to have all of them gone (http://marc.info/?l=linux-kernel&m=139161412932193&w=2). The last part would be reparenting to the upmost parent which is still online. I guess this is implementable but I would prefer Hugh's fix for now and for stable. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction 2014-02-07 14:37 ` Michal Hocko @ 2014-02-07 15:13 ` Tejun Heo 2014-02-07 15:28 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: Tejun Heo @ 2014-02-07 15:13 UTC (permalink / raw) To: Michal Hocko Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton, Johannes Weiner, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm Hello, Michal. On Fri, Feb 07, 2014 at 03:37:40PM +0100, Michal Hocko wrote: > Hmm, this is a bit tricky. We cannot use memcg iterators to reach > children because css_tryget would fail on them. We can use cgroup > iterators instead, alright, and reparent pages from leafs but this all > sounds like a lot of complications. Hmmm... I think we're talking past each other here. Why would the parent need to reach down to the children? Just bail out if it can't make things down to zero and let the child when it finishes its own cleaning walk up the tree propagating changes. ->parent is always accessible. Would that be complicated too? > Another option would be weakening css_offline reparenting and do not > insist on having 0 charges. We want to get rid of as many charges as > possible but do not need to have all of them gone > (http://marc.info/?l=linux-kernel&m=139161412932193&w=2). The last part > would be reparenting to the upmost parent which is still online. > > I guess this is implementable but I would prefer Hugh's fix for now and > for stable. Yeah, for -stable, I think Hugh's patch is good but I really don't want to keep it long term. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction 2014-02-07 15:13 ` Tejun Heo @ 2014-02-07 15:28 ` Michal Hocko 0 siblings, 0 replies; 17+ messages in thread From: Michal Hocko @ 2014-02-07 15:28 UTC (permalink / raw) To: Tejun Heo Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton, Johannes Weiner, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm On Fri 07-02-14 10:13:41, Tejun Heo wrote: > Hello, Michal. > > On Fri, Feb 07, 2014 at 03:37:40PM +0100, Michal Hocko wrote: > > Hmm, this is a bit tricky. We cannot use memcg iterators to reach > > children because css_tryget would fail on them. We can use cgroup > > iterators instead, alright, and reparent pages from leafs but this all > > sounds like a lot of complications. > > Hmmm... I think we're talking past each other here. Why would the > parent need to reach down to the children? Just bail out if it can't > make things down to zero and let the child when it finishes its own > cleaning walk up the tree propagating changes. ->parent is always > accessible. Would that be complicated too? This would be basically the option #2 bellow. > > Another option would be weakening css_offline reparenting and do not > > insist on having 0 charges. We want to get rid of as many charges as > > possible but do not need to have all of them gone > > (http://marc.info/?l=linux-kernel&m=139161412932193&w=2). The last part > > would be reparenting to the upmost parent which is still online. > > > > I guess this is implementable but I would prefer Hugh's fix for now and > > for stable. > > Yeah, for -stable, I think Hugh's patch is good but I really don't > want to keep it long term. Based on our recent discussion regarding css_offline semantic we want to do some changes in that area. I thought we would simply update comments but considering this report css_offline needs some changes as well. I will look at it. The idea is to split mem_cgroup_reparent_charges into two parts. The core one which drains LRUs and would be called from mem_cgroup_css_offline and one which loops until all charges are gone for mem_cgroup_css_free. mem_cgroup_move_parent will need an update as well. It would have to go up the hierarchy to the first alive parent. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction 2014-02-07 14:04 ` Tejun Heo 2014-02-07 14:37 ` Michal Hocko @ 2014-02-07 20:20 ` Hugh Dickins 2014-02-07 20:35 ` Tejun Heo 1 sibling, 1 reply; 17+ messages in thread From: Hugh Dickins @ 2014-02-07 20:20 UTC (permalink / raw) To: Tejun Heo Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko, Johannes Weiner, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm Hi Tejun, On Fri, 7 Feb 2014, Tejun Heo wrote: > On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote: > > Sometimes the cleanup after memcg hierarchy testing gets stuck in > > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. > > > > There may turn out to be several causes, but a major cause is this: the > > workitem to offline parent can get run before workitem to offline child; > > parent's mem_cgroup_reparent_charges() circles around waiting for the > > child's pages to be reparented to its lrus, but it's holding cgroup_mutex > > which prevents the child from reaching its mem_cgroup_reparent_charges(). > > > > Just use an ordered workqueue for cgroup_destroy_wq. > > Hmmm... I'm not really comfortable with this. This would seal shut > any possiblity of increasing concurrency in that path, which is okay > now but I find the combination of such long term commitment and the > non-obviousness (it's not apparent from looking at memcg code why it > wouldn't deadlock) very unappealing. Besides, the only reason > offline() is currently called under cgroup_mutex is history. We can > move it out of cgroup_mutex right now. Thanks for taking the patch into your tree for now, and thanks to Michal and Hannes for supporting it. Yes, we're not sealing a door shut with this one-liner. My first reaction to the deadlock was indeed, what's the cgroup_mutex for here? and I've seen enough deadlocks on cgroup_mutex (though most from this issue, I now believe) to welcome the idea of reducing its blanket use. But I think there are likely to be bumps along that road (just as there have been along the workqueue-ification road), so this ordered workqueue appears much the safer option for now. Please rip it out again when the cgroup_mutex is safely removed from this path. (I've certainly written memcg code myself that "knows" it's already serialized by cgroup_mutex at the outer level: I think code that never reached anyone else's tree, but I'm not certain of that.) > > But even with offline being called outside cgroup_mutex, IIRC, the > described problem would still be able to deadlock as long as the tree > depth is deeper than max concurrency level of the destruction > workqueue. Sure, we can give it large enough number but it's > generally nasty. You worry me there: I certainly don't want to be introducing new deadlocks. You understand workqueues much better than most of us: I'm not sure what "max concurrency level of the destruction workqueue" is, but it sounds uncomfortably like an ordered workqueue's max_active 1. You don't return to this concern in the following mails of the thread: did you later decide that it actually won't be a problem? I'll assume so for the moment, since you took the patch, but please reassure me. > > One thing I don't get is why memcg has such reverse dependency at all. > Why does the parent wait for its descendants to do something during > offline? Shouldn't it be able to just bail and let whatever > descendant which is stil busy propagate things upwards? That's a > usual pattern we use to tree shutdowns anyway. Would that be nasty to > implement in memcg? I've no idea how nasty it would be to change memcg around, but Michal and Hannes appear very open to doing so. I do think that memcg's current expectation is very reasonable: it's perfectly normal that a rmdir cannot succeed until the directory is empty, and to depend upon that fact; but the use of workqueue made some things asynchronous which were not before, which has led to some surprises. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction 2014-02-07 20:20 ` Hugh Dickins @ 2014-02-07 20:35 ` Tejun Heo 2014-02-07 21:06 ` Hugh Dickins 0 siblings, 1 reply; 17+ messages in thread From: Tejun Heo @ 2014-02-07 20:35 UTC (permalink / raw) To: Hugh Dickins Cc: Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko, Johannes Weiner, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm Hello, Hugh. On Fri, Feb 07, 2014 at 12:20:44PM -0800, Hugh Dickins wrote: > > But even with offline being called outside cgroup_mutex, IIRC, the > > described problem would still be able to deadlock as long as the tree > > depth is deeper than max concurrency level of the destruction > > workqueue. Sure, we can give it large enough number but it's > > generally nasty. > > You worry me there: I certainly don't want to be introducing new > deadlocks. You understand workqueues much better than most of us: I'm > not sure what "max concurrency level of the destruction workqueue" is, > but it sounds uncomfortably like an ordered workqueue's max_active 1. Ooh, max_active is always a finite number. The only reason we usually don't worry about it is because they are large enough for the existing dependency chains to cause deadlocks. The theoretical problem with cgroup is that the dependency chain can grow arbitrarily long and multiple removals along different subhierarchies can overlap which means that there can be multiple long dependency chains among work items. The probability would be extremely low but deadlock might be possible even with relatively high max_active. Besides, the reason we reduced max_active in the first place was because destruction work items tend to just stack up without any actual concurrency benefits, so increasing concurrncy level seems a bit nasty to me (but probably a lot of those traffic jam was from cgroup_mutex and once we take that out of the picture, it could become fine). > You don't return to this concern in the following mails of the thread: > did you later decide that it actually won't be a problem? I'll assume > so for the moment, since you took the patch, but please reassure me. I was just worrying about a different solution where we take css_offline invocation outside of cgroup_mutex and bumping up max_active. There's nothing to worry about your patch. Sorry about not being clear. :) > > One thing I don't get is why memcg has such reverse dependency at all. > > Why does the parent wait for its descendants to do something during > > offline? Shouldn't it be able to just bail and let whatever > > descendant which is stil busy propagate things upwards? That's a > > usual pattern we use to tree shutdowns anyway. Would that be nasty to > > implement in memcg? > > I've no idea how nasty it would be to change memcg around, but Michal > and Hannes appear very open to doing so. I do think that memcg's current > expectation is very reasonable: it's perfectly normal that a rmdir cannot > succeed until the directory is empty, and to depend upon that fact; but > the use of workqueue made some things asynchronous which were not before, > which has led to some surprises. Maybe. The thing is that ->css_offline() isn't really comparable to rmdir. ->css_free() is and is fully ordered through refcnts as one would expect. Whether ->css_offline() should be ordered similarly so that the parent's offline is called iff all its children finished offlining, I'm not sure. Maybe it'd be something nice to have but I kinda wanna keep the offline hook and its usages simple and limited. It's not where the actual destruction should happen. It's just a notification to get ready. Looks like Johannes's patch is headed towards that direction - moving destruction from ->css_offline to ->css_free(), so if that works out, I think we should be good for the time being. Thanks! -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction 2014-02-07 20:35 ` Tejun Heo @ 2014-02-07 21:06 ` Hugh Dickins 0 siblings, 0 replies; 17+ messages in thread From: Hugh Dickins @ 2014-02-07 21:06 UTC (permalink / raw) To: Tejun Heo Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko, Johannes Weiner, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm On Fri, 7 Feb 2014, Tejun Heo wrote: > On Fri, Feb 07, 2014 at 12:20:44PM -0800, Hugh Dickins wrote: > > > You don't return to this concern in the following mails of the thread: > > did you later decide that it actually won't be a problem? I'll assume > > so for the moment, since you took the patch, but please reassure me. > > I was just worrying about a different solution where we take > css_offline invocation outside of cgroup_mutex and bumping up > max_active. There's nothing to worry about your patch. Sorry about > not being clear. :) Thanks a lot for your detailed and admirably responsive explanations. You were looking ahead, I see that now, and I'm gratefully reassured :) Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction 2014-02-06 23:56 [PATCH] cgroup: use an ordered workqueue for cgroup destruction Hugh Dickins 2014-02-07 13:45 ` Michal Hocko 2014-02-07 14:04 ` Tejun Heo @ 2014-02-07 15:21 ` Tejun Heo 2014-02-07 16:43 ` Johannes Weiner 3 siblings, 0 replies; 17+ messages in thread From: Tejun Heo @ 2014-02-07 15:21 UTC (permalink / raw) To: Hugh Dickins Cc: Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko, Johannes Weiner, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote: > Sometimes the cleanup after memcg hierarchy testing gets stuck in > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. > > There may turn out to be several causes, but a major cause is this: the > workitem to offline parent can get run before workitem to offline child; > parent's mem_cgroup_reparent_charges() circles around waiting for the > child's pages to be reparented to its lrus, but it's holding cgroup_mutex > which prevents the child from reaching its mem_cgroup_reparent_charges(). > > Just use an ordered workqueue for cgroup_destroy_wq. > > Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction") > Suggested-by: Filipe Brandenburger <filbranden@google.com> > Signed-off-by: Hugh Dickins <hughd@google.com> > Cc: stable@vger.kernel.org # 3.10+ Applied to cgroup/for-3.14-fixes with comment updated to indicate that this is temporary. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction 2014-02-06 23:56 [PATCH] cgroup: use an ordered workqueue for cgroup destruction Hugh Dickins ` (2 preceding siblings ...) 2014-02-07 15:21 ` Tejun Heo @ 2014-02-07 16:43 ` Johannes Weiner 2014-02-10 15:46 ` Michal Hocko 2014-02-12 22:59 ` Hugh Dickins 3 siblings, 2 replies; 17+ messages in thread From: Johannes Weiner @ 2014-02-07 16:43 UTC (permalink / raw) To: Hugh Dickins Cc: Tejun Heo, Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote: > Sometimes the cleanup after memcg hierarchy testing gets stuck in > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. > > There may turn out to be several causes, but a major cause is this: the > workitem to offline parent can get run before workitem to offline child; > parent's mem_cgroup_reparent_charges() circles around waiting for the > child's pages to be reparented to its lrus, but it's holding cgroup_mutex > which prevents the child from reaching its mem_cgroup_reparent_charges(). > > Just use an ordered workqueue for cgroup_destroy_wq. > > Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction") > Suggested-by: Filipe Brandenburger <filbranden@google.com> > Signed-off-by: Hugh Dickins <hughd@google.com> > Cc: stable@vger.kernel.org # 3.10+ I think this is a good idea for now and -stable: Acked-by: Johannes Weiner <hannes@cmpxchg.org> Long-term, I think we may want to get rid of the reparenting in css_offline() entirely and only do it in the naturally ordered css_free() callback. We only reparent in css_offline() because swapout records pin the css and we don't want to hang on to potentially gigabytes of unreclaimable (css_tryget() disabled) cache indefinitely until the last swapout records disappear. So I'm currently mucking around with the following patch, which drops the css pin from swapout records and reparents them in css_free(). It's lightly tested and there might well be dragons but I don't see any fundamental problems with it. What do you think? --- include/linux/page_cgroup.h | 8 ++++ mm/memcontrol.c | 101 +++++++++++++------------------------------- mm/page_cgroup.c | 41 ++++++++++++++++++ 3 files changed, 78 insertions(+), 72 deletions(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 777a524716db..3ededda8934f 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -111,6 +111,8 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, unsigned short old, unsigned short new); extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id); extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent); +extern unsigned long swap_cgroup_migrate(unsigned short old, + unsigned short new); extern int swap_cgroup_swapon(int type, unsigned long max_pages); extern void swap_cgroup_swapoff(int type); #else @@ -127,6 +129,12 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent) return 0; } +static inline unsigned long swap_cgroup_migrate(unsigned short old, + unsigned short new) +{ + return 0; +} + static inline int swap_cgroup_swapon(int type, unsigned long max_pages) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 53385cd4e6f0..e2a0d3986c74 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -892,11 +892,9 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg, return val; } -static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, - bool charge) +static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, long nr_pages) { - int val = (charge) ? 1 : -1; - this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val); + this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], nr_pages); } static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg, @@ -4234,15 +4232,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype, */ unlock_page_cgroup(pc); - /* - * even after unlock, we have memcg->res.usage here and this memcg - * will never be freed, so it's safe to call css_get(). - */ + memcg_check_events(memcg, page); - if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) { - mem_cgroup_swap_statistics(memcg, true); - css_get(&memcg->css); - } + + if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) + mem_cgroup_swap_statistics(memcg, 1); + /* * Migration does not charge the res_counter for the * replacement page, so leave it alone when phasing out the @@ -4351,10 +4346,6 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout) memcg = __mem_cgroup_uncharge_common(page, ctype, false); - /* - * record memcg information, if swapout && memcg != NULL, - * css_get() was called in uncharge(). - */ if (do_swap_account && swapout && memcg) swap_cgroup_record(ent, mem_cgroup_id(memcg)); } @@ -4383,8 +4374,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent) */ if (!mem_cgroup_is_root(memcg)) res_counter_uncharge(&memcg->memsw, PAGE_SIZE); - mem_cgroup_swap_statistics(memcg, false); - css_put(&memcg->css); + mem_cgroup_swap_statistics(memcg, -1); } rcu_read_unlock(); } @@ -4412,20 +4402,8 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry, new_id = mem_cgroup_id(to); if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) { - mem_cgroup_swap_statistics(from, false); - mem_cgroup_swap_statistics(to, true); - /* - * This function is only called from task migration context now. - * It postpones res_counter and refcount handling till the end - * of task migration(mem_cgroup_clear_mc()) for performance - * improvement. But we cannot postpone css_get(to) because if - * the process that has been moved to @to does swap-in, the - * refcount of @to might be decreased to 0. - * - * We are in attach() phase, so the cgroup is guaranteed to be - * alive, so we can just call css_get(). - */ - css_get(&to->css); + mem_cgroup_swap_statistics(from, -1); + mem_cgroup_swap_statistics(to, 1); return 0; } return -EINVAL; @@ -6611,7 +6589,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) kmem_cgroup_css_offline(memcg); mem_cgroup_invalidate_reclaim_iterators(memcg); - mem_cgroup_reparent_charges(memcg); mem_cgroup_destroy_all_caches(memcg); vmpressure_cleanup(&memcg->vmpressure); } @@ -6619,41 +6596,26 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) static void mem_cgroup_css_free(struct cgroup_subsys_state *css) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); + unsigned long swaps_moved; + struct mem_cgroup *parent; + /* - * XXX: css_offline() would be where we should reparent all - * memory to prepare the cgroup for destruction. However, - * memcg does not do css_tryget() and res_counter charging - * under the same RCU lock region, which means that charging - * could race with offlining. Offlining only happens to - * cgroups with no tasks in them but charges can show up - * without any tasks from the swapin path when the target - * memcg is looked up from the swapout record and not from the - * current task as it usually is. A race like this can leak - * charges and put pages with stale cgroup pointers into - * circulation: - * - * #0 #1 - * lookup_swap_cgroup_id() - * rcu_read_lock() - * mem_cgroup_lookup() - * css_tryget() - * rcu_read_unlock() - * disable css_tryget() - * call_rcu() - * offline_css() - * reparent_charges() - * res_counter_charge() - * css_put() - * css_free() - * pc->mem_cgroup = dead memcg - * add page to lru - * - * The bulk of the charges are still moved in offline_css() to - * avoid pinning a lot of pages in case a long-term reference - * like a swapout record is deferring the css_free() to long - * after offlining. But this makes sure we catch any charges - * made after offlining: + * Migrate all swap entries to the parent. There are no more + * references to the css, so no new swap out records can show + * up. Any concurrently faulting pages will either get this + * offline cgroup and charge the faulting task, or get the + * migrated parent id and charge the parent for the in-memory + * page. uncharge_swap() will balance the res_counter in the + * parent either way, whether it still fixes this group's + * res_counter is irrelevant at this point. */ + parent = parent_mem_cgroup(memcg); + if (!parent) + parent = root_mem_cgroup; + swaps_moved = swap_cgroup_migrate(mem_cgroup_id(memcg), + mem_cgroup_id(parent)); + mem_cgroup_swap_statistics(parent, swaps_moved); + mem_cgroup_reparent_charges(memcg); memcg_destroy_kmem(memcg); @@ -6966,7 +6928,6 @@ static void __mem_cgroup_clear_mc(void) { struct mem_cgroup *from = mc.from; struct mem_cgroup *to = mc.to; - int i; /* we must uncharge all the leftover precharges from mc.to */ if (mc.precharge) { @@ -6981,16 +6942,13 @@ static void __mem_cgroup_clear_mc(void) __mem_cgroup_cancel_charge(mc.from, mc.moved_charge); mc.moved_charge = 0; } - /* we must fixup refcnts and charges */ + /* we must fixup charges */ if (mc.moved_swap) { /* uncharge swap account from the old cgroup */ if (!mem_cgroup_is_root(mc.from)) res_counter_uncharge(&mc.from->memsw, PAGE_SIZE * mc.moved_swap); - for (i = 0; i < mc.moved_swap; i++) - css_put(&mc.from->css); - if (!mem_cgroup_is_root(mc.to)) { /* * we charged both to->res and to->memsw, so we should @@ -6999,7 +6957,6 @@ static void __mem_cgroup_clear_mc(void) res_counter_uncharge(&mc.to->res, PAGE_SIZE * mc.moved_swap); } - /* we've already done css_get(mc.to) */ mc.moved_swap = 0; } memcg_oom_recover(from); diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c index cfd162882c00..ca04feaae7fe 100644 --- a/mm/page_cgroup.c +++ b/mm/page_cgroup.c @@ -459,6 +459,47 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent) return lookup_swap_cgroup(ent, NULL)->id; } +/** + * swap_cgroup_migrate - migrate all entries of one id to another + * @old: old id + * @new: new id + * + * Caller has to be able to deal with swapon/swapoff racing. + * + * Returns number of migrated entries. + */ +unsigned long swap_cgroup_migrate(unsigned short old, unsigned short new) +{ + unsigned long migrated = 0; + unsigned int type; + + for (type = 0; type < MAX_SWAPFILES; type++) { + struct swap_cgroup_ctrl *ctrl; + unsigned long flags; + unsigned int page; + + ctrl = &swap_cgroup_ctrl[type]; + spin_lock_irqsave(&ctrl->lock, flags); + for (page = 0; page < ctrl->length; page++) { + struct swap_cgroup *base; + pgoff_t offset; + + base = page_address(ctrl->map[page]); + for (offset = 0; offset < SC_PER_PAGE; offset++) { + struct swap_cgroup *sc; + + sc = base + offset; + if (sc->id == old) { + sc->id = new; + migrated++; + } + } + } + spin_unlock_irqrestore(&ctrl->lock, flags); + } + return migrated; +} + int swap_cgroup_swapon(int type, unsigned long max_pages) { void *array; -- 1.8.5.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction 2014-02-07 16:43 ` Johannes Weiner @ 2014-02-10 15:46 ` Michal Hocko 2014-02-12 22:59 ` Hugh Dickins 1 sibling, 0 replies; 17+ messages in thread From: Michal Hocko @ 2014-02-10 15:46 UTC (permalink / raw) To: Johannes Weiner Cc: Hugh Dickins, Tejun Heo, Filipe Brandenburger, Li Zefan, Andrew Morton, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm On Fri 07-02-14 11:43:21, Johannes Weiner wrote: [...] > Long-term, I think we may want to get rid of the reparenting in > css_offline() entirely and only do it in the naturally ordered > css_free() callback. We only reparent in css_offline() because > swapout records pin the css and we don't want to hang on to > potentially gigabytes of unreclaimable (css_tryget() disabled) cache > indefinitely until the last swapout records disappear. > > So I'm currently mucking around with the following patch, which drops > the css pin from swapout records and reparents them in css_free(). > It's lightly tested and there might well be dragons but I don't see > any fundamental problems with it. > > What do you think? Hugh has posted something like this back in December (http://marc.info/?l=linux-mm&m=138718299304670&w=2). I am not entirely happy about scanning potentially huge amount of swap records. A trivial optimization would check the memsw counter and break out early but it would still leave a potentially full scan possible. I guess we shouldn't care much about when css_free is called. I think we should split the reparenting into two parts. LRU reparent without any hard requirements and charges reparent which guarantees that nothing is left behind. The first one called css_offline and the second one from css_free. kmem accounting pins memcg as well btw so taking care of swap is not sufficient to guarantee an early css_free. > --- > include/linux/page_cgroup.h | 8 ++++ > mm/memcontrol.c | 101 +++++++++++++------------------------------- > mm/page_cgroup.c | 41 ++++++++++++++++++ > 3 files changed, 78 insertions(+), 72 deletions(-) > > diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h > index 777a524716db..3ededda8934f 100644 > --- a/include/linux/page_cgroup.h > +++ b/include/linux/page_cgroup.h > @@ -111,6 +111,8 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, > unsigned short old, unsigned short new); > extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id); > extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent); > +extern unsigned long swap_cgroup_migrate(unsigned short old, > + unsigned short new); > extern int swap_cgroup_swapon(int type, unsigned long max_pages); > extern void swap_cgroup_swapoff(int type); > #else > @@ -127,6 +129,12 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent) > return 0; > } > > +static inline unsigned long swap_cgroup_migrate(unsigned short old, > + unsigned short new) > +{ > + return 0; > +} > + > static inline int > swap_cgroup_swapon(int type, unsigned long max_pages) > { > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 53385cd4e6f0..e2a0d3986c74 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -892,11 +892,9 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg, > return val; > } > > -static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, > - bool charge) > +static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, long nr_pages) > { > - int val = (charge) ? 1 : -1; > - this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val); > + this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], nr_pages); > } > > static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg, > @@ -4234,15 +4232,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype, > */ > > unlock_page_cgroup(pc); > - /* > - * even after unlock, we have memcg->res.usage here and this memcg > - * will never be freed, so it's safe to call css_get(). > - */ > + > memcg_check_events(memcg, page); > - if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) { > - mem_cgroup_swap_statistics(memcg, true); > - css_get(&memcg->css); > - } > + > + if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) > + mem_cgroup_swap_statistics(memcg, 1); > + > /* > * Migration does not charge the res_counter for the > * replacement page, so leave it alone when phasing out the > @@ -4351,10 +4346,6 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout) > > memcg = __mem_cgroup_uncharge_common(page, ctype, false); > > - /* > - * record memcg information, if swapout && memcg != NULL, > - * css_get() was called in uncharge(). > - */ > if (do_swap_account && swapout && memcg) > swap_cgroup_record(ent, mem_cgroup_id(memcg)); > } > @@ -4383,8 +4374,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent) > */ > if (!mem_cgroup_is_root(memcg)) > res_counter_uncharge(&memcg->memsw, PAGE_SIZE); > - mem_cgroup_swap_statistics(memcg, false); > - css_put(&memcg->css); > + mem_cgroup_swap_statistics(memcg, -1); > } > rcu_read_unlock(); > } > @@ -4412,20 +4402,8 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry, > new_id = mem_cgroup_id(to); > > if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) { > - mem_cgroup_swap_statistics(from, false); > - mem_cgroup_swap_statistics(to, true); > - /* > - * This function is only called from task migration context now. > - * It postpones res_counter and refcount handling till the end > - * of task migration(mem_cgroup_clear_mc()) for performance > - * improvement. But we cannot postpone css_get(to) because if > - * the process that has been moved to @to does swap-in, the > - * refcount of @to might be decreased to 0. > - * > - * We are in attach() phase, so the cgroup is guaranteed to be > - * alive, so we can just call css_get(). > - */ > - css_get(&to->css); > + mem_cgroup_swap_statistics(from, -1); > + mem_cgroup_swap_statistics(to, 1); > return 0; > } > return -EINVAL; > @@ -6611,7 +6589,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) > kmem_cgroup_css_offline(memcg); > > mem_cgroup_invalidate_reclaim_iterators(memcg); > - mem_cgroup_reparent_charges(memcg); > mem_cgroup_destroy_all_caches(memcg); > vmpressure_cleanup(&memcg->vmpressure); > } > @@ -6619,41 +6596,26 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) > static void mem_cgroup_css_free(struct cgroup_subsys_state *css) > { > struct mem_cgroup *memcg = mem_cgroup_from_css(css); > + unsigned long swaps_moved; > + struct mem_cgroup *parent; > + > /* > - * XXX: css_offline() would be where we should reparent all > - * memory to prepare the cgroup for destruction. However, > - * memcg does not do css_tryget() and res_counter charging > - * under the same RCU lock region, which means that charging > - * could race with offlining. Offlining only happens to > - * cgroups with no tasks in them but charges can show up > - * without any tasks from the swapin path when the target > - * memcg is looked up from the swapout record and not from the > - * current task as it usually is. A race like this can leak > - * charges and put pages with stale cgroup pointers into > - * circulation: > - * > - * #0 #1 > - * lookup_swap_cgroup_id() > - * rcu_read_lock() > - * mem_cgroup_lookup() > - * css_tryget() > - * rcu_read_unlock() > - * disable css_tryget() > - * call_rcu() > - * offline_css() > - * reparent_charges() > - * res_counter_charge() > - * css_put() > - * css_free() > - * pc->mem_cgroup = dead memcg > - * add page to lru > - * > - * The bulk of the charges are still moved in offline_css() to > - * avoid pinning a lot of pages in case a long-term reference > - * like a swapout record is deferring the css_free() to long > - * after offlining. But this makes sure we catch any charges > - * made after offlining: > + * Migrate all swap entries to the parent. There are no more > + * references to the css, so no new swap out records can show > + * up. Any concurrently faulting pages will either get this > + * offline cgroup and charge the faulting task, or get the > + * migrated parent id and charge the parent for the in-memory > + * page. uncharge_swap() will balance the res_counter in the > + * parent either way, whether it still fixes this group's > + * res_counter is irrelevant at this point. > */ > + parent = parent_mem_cgroup(memcg); > + if (!parent) > + parent = root_mem_cgroup; > + swaps_moved = swap_cgroup_migrate(mem_cgroup_id(memcg), > + mem_cgroup_id(parent)); > + mem_cgroup_swap_statistics(parent, swaps_moved); > + > mem_cgroup_reparent_charges(memcg); > > memcg_destroy_kmem(memcg); > @@ -6966,7 +6928,6 @@ static void __mem_cgroup_clear_mc(void) > { > struct mem_cgroup *from = mc.from; > struct mem_cgroup *to = mc.to; > - int i; > > /* we must uncharge all the leftover precharges from mc.to */ > if (mc.precharge) { > @@ -6981,16 +6942,13 @@ static void __mem_cgroup_clear_mc(void) > __mem_cgroup_cancel_charge(mc.from, mc.moved_charge); > mc.moved_charge = 0; > } > - /* we must fixup refcnts and charges */ > + /* we must fixup charges */ > if (mc.moved_swap) { > /* uncharge swap account from the old cgroup */ > if (!mem_cgroup_is_root(mc.from)) > res_counter_uncharge(&mc.from->memsw, > PAGE_SIZE * mc.moved_swap); > > - for (i = 0; i < mc.moved_swap; i++) > - css_put(&mc.from->css); > - > if (!mem_cgroup_is_root(mc.to)) { > /* > * we charged both to->res and to->memsw, so we should > @@ -6999,7 +6957,6 @@ static void __mem_cgroup_clear_mc(void) > res_counter_uncharge(&mc.to->res, > PAGE_SIZE * mc.moved_swap); > } > - /* we've already done css_get(mc.to) */ > mc.moved_swap = 0; > } > memcg_oom_recover(from); > diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c > index cfd162882c00..ca04feaae7fe 100644 > --- a/mm/page_cgroup.c > +++ b/mm/page_cgroup.c > @@ -459,6 +459,47 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent) > return lookup_swap_cgroup(ent, NULL)->id; > } > > +/** > + * swap_cgroup_migrate - migrate all entries of one id to another > + * @old: old id > + * @new: new id > + * > + * Caller has to be able to deal with swapon/swapoff racing. > + * > + * Returns number of migrated entries. > + */ > +unsigned long swap_cgroup_migrate(unsigned short old, unsigned short new) > +{ > + unsigned long migrated = 0; > + unsigned int type; > + > + for (type = 0; type < MAX_SWAPFILES; type++) { > + struct swap_cgroup_ctrl *ctrl; > + unsigned long flags; > + unsigned int page; > + > + ctrl = &swap_cgroup_ctrl[type]; > + spin_lock_irqsave(&ctrl->lock, flags); > + for (page = 0; page < ctrl->length; page++) { > + struct swap_cgroup *base; > + pgoff_t offset; > + > + base = page_address(ctrl->map[page]); > + for (offset = 0; offset < SC_PER_PAGE; offset++) { > + struct swap_cgroup *sc; > + > + sc = base + offset; > + if (sc->id == old) { > + sc->id = new; > + migrated++; > + } > + } > + } > + spin_unlock_irqrestore(&ctrl->lock, flags); > + } > + return migrated; > +} > + > int swap_cgroup_swapon(int type, unsigned long max_pages) > { > void *array; > -- > 1.8.5.3 > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction 2014-02-07 16:43 ` Johannes Weiner 2014-02-10 15:46 ` Michal Hocko @ 2014-02-12 22:59 ` Hugh Dickins 2014-02-12 23:06 ` [PATCH 2/2] cgroup: bring back kill_cnt to order css destruction Hugh Dickins 2014-02-13 0:09 ` [PATCH] Revert "cgroup: use an ordered workqueue for cgroup destruction" Tejun Heo 1 sibling, 2 replies; 17+ messages in thread From: Hugh Dickins @ 2014-02-12 22:59 UTC (permalink / raw) To: Johannes Weiner Cc: Hugh Dickins, Tejun Heo, Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm On Fri, 7 Feb 2014, Johannes Weiner wrote: > On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote: > > Sometimes the cleanup after memcg hierarchy testing gets stuck in > > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. > > > > There may turn out to be several causes, but a major cause is this: the > > workitem to offline parent can get run before workitem to offline child; > > parent's mem_cgroup_reparent_charges() circles around waiting for the > > child's pages to be reparented to its lrus, but it's holding cgroup_mutex > > which prevents the child from reaching its mem_cgroup_reparent_charges(). > > > > Just use an ordered workqueue for cgroup_destroy_wq. > > > > Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction") > > Suggested-by: Filipe Brandenburger <filbranden@google.com> > > Signed-off-by: Hugh Dickins <hughd@google.com> > > Cc: stable@vger.kernel.org # 3.10+ > > I think this is a good idea for now and -stable: > Acked-by: Johannes Weiner <hannes@cmpxchg.org> You might be wondering why this patch didn't reach Linus yet. It's because more thorough testing, by others here, found that it wasn't always solving the problem: so I asked Tejun privately to hold off from sending it in, until we'd worked out why not. Most of our testing being on a v3,11-based kernel, it was perfectly possible that the problem was merely our own e.g. missing Tejun's 8a2b75384444 ("workqueue: fix ordered workqueues in NUMA setups"). But that turned out not to be enough to fix it either. Then Filipe pointed out how percpu_ref_kill_and_confirm() uses call_rcu_sched() before we ever get to put the offline on to the workqueue: by the time we get to the workqueue, the ordering has already been lost. So, thanks for the Acks, but I'm afraid that this ordered workqueue solution is just not good enough: we should simply forget that patch and provide a different answer. So I'm now posting a couple of alternative solutions: 1/2 from Filipe at the memcg end, and 2/2 from me at the cgroup end. Each of these has stood up to better testing, so you can choose between them, or work out a better answer. (By the way, I have another little pair of memcg/cgroup fixes to post shortly, nothing to do with these two: it would be less confusing if I had some third fix to add in there, but sadly not.) Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/2] cgroup: bring back kill_cnt to order css destruction 2014-02-12 22:59 ` Hugh Dickins @ 2014-02-12 23:06 ` Hugh Dickins 2014-02-13 0:28 ` Tejun Heo 2014-02-13 0:09 ` [PATCH] Revert "cgroup: use an ordered workqueue for cgroup destruction" Tejun Heo 1 sibling, 1 reply; 17+ messages in thread From: Hugh Dickins @ 2014-02-12 23:06 UTC (permalink / raw) To: Tejun Heo, Michal Hocko Cc: Johannes Weiner, Filipe Brandenburger, Li Zefan, Andrew Morton, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm Sometimes the cleanup after memcg hierarchy testing gets stuck in mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. There may turn out to be several causes, but a major cause is this: the workitem to offline parent can get run before workitem to offline child; parent's mem_cgroup_reparent_charges() circles around waiting for the child's pages to be reparented to its lrus, but it's holding cgroup_mutex which prevents the child from reaching its mem_cgroup_reparent_charges(). Further testing showed that an ordered workqueue for cgroup_destroy_wq is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched stage on the way can mess up the order before reaching the workqueue. Instead bring back v3.11's css kill_cnt, repurposing it to make sure that offline_css() is not called for parent before it has been called for all children. Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Filipe Brandenburger <filbranden@google.com> Cc: stable@vger.kernel.org # v3.10+ (but will need extra care) --- This is an alternative to Filipe's 1/2: there's no need for both, but each has its merits. I prefer Filipe's, which is much easier to understand: this one made more sense in v3.11, when it was just a matter of extending the use of css_kill_cnt; but might be preferred if offlining children before parent is thought to be a good idea generally. include/linux/cgroup.h | 3 +++ kernel/cgroup.c | 21 +++++++++++++++++++++ 2 files changed, 24 insertions(+) --- 3.14-rc2/include/linux/cgroup.h 2014-02-02 18:49:07.033302094 -0800 +++ linux/include/linux/cgroup.h 2014-02-11 15:59:22.720393186 -0800 @@ -79,6 +79,9 @@ struct cgroup_subsys_state { unsigned long flags; + /* ensure children are offlined before parent */ + atomic_t kill_cnt; + /* percpu_ref killing and RCU release */ struct rcu_head rcu_head; struct work_struct destroy_work; --- 3.14-rc2/kernel/cgroup.c 2014-02-02 18:49:07.737302111 -0800 +++ linux/kernel/cgroup.c 2014-02-11 15:57:56.000391125 -0800 @@ -175,6 +175,7 @@ static int need_forkexit_callback __read static struct cftype cgroup_base_files[]; +static void css_killed_ref_fn(struct percpu_ref *ref); static void cgroup_destroy_css_killed(struct cgroup *cgrp); static int cgroup_destroy_locked(struct cgroup *cgrp); static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[], @@ -4043,6 +4044,7 @@ static void init_css(struct cgroup_subsy css->cgroup = cgrp; css->ss = ss; css->flags = 0; + atomic_set(&css->kill_cnt, 1); if (cgrp->parent) css->parent = cgroup_css(cgrp->parent, ss); @@ -4292,6 +4294,7 @@ static void css_killed_work_fn(struct wo { struct cgroup_subsys_state *css = container_of(work, struct cgroup_subsys_state, destroy_work); + struct cgroup_subsys_state *parent = css->parent; struct cgroup *cgrp = css->cgroup; mutex_lock(&cgroup_mutex); @@ -4320,6 +4323,12 @@ static void css_killed_work_fn(struct wo * destruction happens only after all css's are released. */ css_put(css); + + /* + * Put the parent's kill_cnt reference from kill_css(), and + * schedule its ->css_offline() if all children are now offline. + */ + css_killed_ref_fn(&parent->refcnt); } /* css kill confirmation processing requires process context, bounce */ @@ -4328,6 +4337,9 @@ static void css_killed_ref_fn(struct per struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt); + if (!atomic_dec_and_test(&css->kill_cnt)) + return; + INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work); } @@ -4362,6 +4374,15 @@ static void kill_css(struct cgroup_subsy * css is confirmed to be seen as killed on all CPUs. */ percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn); + + /* + * Make sure that ->css_offline() will not be called for parent + * before it has been called for all children: this ordering + * requirement is important for memcg, where parent's offline + * might wait for a child's, leading to deadlock. + */ + atomic_inc(&css->parent->kill_cnt); + css_killed_ref_fn(&css->refcnt); } /** -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] cgroup: bring back kill_cnt to order css destruction 2014-02-12 23:06 ` [PATCH 2/2] cgroup: bring back kill_cnt to order css destruction Hugh Dickins @ 2014-02-13 0:28 ` Tejun Heo 2014-02-13 0:38 ` Hugh Dickins 0 siblings, 1 reply; 17+ messages in thread From: Tejun Heo @ 2014-02-13 0:28 UTC (permalink / raw) To: Hugh Dickins Cc: Michal Hocko, Johannes Weiner, Filipe Brandenburger, Li Zefan, Andrew Morton, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm Hello, Hugh. On Wed, Feb 12, 2014 at 03:06:26PM -0800, Hugh Dickins wrote: > Sometimes the cleanup after memcg hierarchy testing gets stuck in > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. > > There may turn out to be several causes, but a major cause is this: the > workitem to offline parent can get run before workitem to offline child; > parent's mem_cgroup_reparent_charges() circles around waiting for the > child's pages to be reparented to its lrus, but it's holding cgroup_mutex > which prevents the child from reaching its mem_cgroup_reparent_charges(). > > Further testing showed that an ordered workqueue for cgroup_destroy_wq > is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched > stage on the way can mess up the order before reaching the workqueue. > > Instead bring back v3.11's css kill_cnt, repurposing it to make sure > that offline_css() is not called for parent before it has been called > for all children. > > Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction") > Signed-off-by: Hugh Dickins <hughd@google.com> > Reviewed-by: Filipe Brandenburger <filbranden@google.com> > Cc: stable@vger.kernel.org # v3.10+ (but will need extra care) > --- > This is an alternative to Filipe's 1/2: there's no need for both, > but each has its merits. I prefer Filipe's, which is much easier to > understand: this one made more sense in v3.11, when it was just a matter > of extending the use of css_kill_cnt; but might be preferred if offlining > children before parent is thought to be a good idea generally. Not that your implementation is bad or anything but the patch itself somehow makes me cringe a bit. It's probably just because it has to add to the already overly complicated offline path. Guaranteeing strict offline ordering might be a good idea but at least for the immediate bug fix, I agree that the memcg specific fix seems better suited. Let's apply that one and reconsider this one if it turns out we do need strict offline reordering. Thanks a lot! -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] cgroup: bring back kill_cnt to order css destruction 2014-02-13 0:28 ` Tejun Heo @ 2014-02-13 0:38 ` Hugh Dickins 0 siblings, 0 replies; 17+ messages in thread From: Hugh Dickins @ 2014-02-13 0:38 UTC (permalink / raw) To: Tejun Heo Cc: Hugh Dickins, Michal Hocko, Johannes Weiner, Filipe Brandenburger, Li Zefan, Andrew Morton, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm On Wed, 12 Feb 2014, Tejun Heo wrote: > > Not that your implementation is bad or anything but the patch itself > somehow makes me cringe a bit. It's probably just because it has to > add to the already overly complicated offline path. Guaranteeing > strict offline ordering might be a good idea but at least for the > immediate bug fix, I agree that the memcg specific fix seems better > suited. Let's apply that one and reconsider this one if it turns out > we do need strict offline reordering. Yes, I agree completely - thanks. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] Revert "cgroup: use an ordered workqueue for cgroup destruction" 2014-02-12 22:59 ` Hugh Dickins 2014-02-12 23:06 ` [PATCH 2/2] cgroup: bring back kill_cnt to order css destruction Hugh Dickins @ 2014-02-13 0:09 ` Tejun Heo 1 sibling, 0 replies; 17+ messages in thread From: Tejun Heo @ 2014-02-13 0:09 UTC (permalink / raw) To: Hugh Dickins Cc: Johannes Weiner, Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko, Greg Thelen, Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel, linux-mm ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2014-02-13 0:39 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-02-06 23:56 [PATCH] cgroup: use an ordered workqueue for cgroup destruction Hugh Dickins 2014-02-07 13:45 ` Michal Hocko 2014-02-07 14:04 ` Tejun Heo 2014-02-07 14:37 ` Michal Hocko 2014-02-07 15:13 ` Tejun Heo 2014-02-07 15:28 ` Michal Hocko 2014-02-07 20:20 ` Hugh Dickins 2014-02-07 20:35 ` Tejun Heo 2014-02-07 21:06 ` Hugh Dickins 2014-02-07 15:21 ` Tejun Heo 2014-02-07 16:43 ` Johannes Weiner 2014-02-10 15:46 ` Michal Hocko 2014-02-12 22:59 ` Hugh Dickins 2014-02-12 23:06 ` [PATCH 2/2] cgroup: bring back kill_cnt to order css destruction Hugh Dickins 2014-02-13 0:28 ` Tejun Heo 2014-02-13 0:38 ` Hugh Dickins 2014-02-13 0:09 ` [PATCH] Revert "cgroup: use an ordered workqueue for cgroup destruction" Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).