All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Tejun Heo <tj@kernel.org>
Cc: Kazuki Yamaguchi <k@rhe.jp>,
	Niklas Cassel <niklas.cassel@axis.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [BUG] sched: leaf_cfs_rq_list use after free
Date: Wed, 16 Mar 2016 16:22:45 +0100	[thread overview]
Message-ID: <20160316152245.GY6344@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <20160316142414.GA6980@mtj.duckdns.org>

On Wed, Mar 16, 2016 at 07:24:14AM -0700, Tejun Heo wrote:
> So, the problem here is that cpu is using css_offline to tear down the
> css.  perf shouldn't have the same problem as it destroys its css from
> css_free.  The distinctions among different callbacks evolved over
> time and we need to update the documentation but here are the rules.
> 
> css_alloc()
> 
> 	This one is obvious.  Alloc and init.  The css will become
> 	visible during css iteration sometime between alloc and
> 	online.
> 
> css_online()
> 
> 	The css is now guaranteed to be visible for css_for_each*()
> 	iterations.  This distinction exists because some controllers
> 	need to propagate state changes downwards requiring a new css
> 	to become visible before it inherits the current state from
> 	the parent.  Conversely, there's no reason to use this
> 	callback if there's no such requirement.
> 
> 	Ex: Freezer which propagates the target state downwards and
> 	needs a new child to inherit the current state while
> 	iteratable.

So it looks like sched uses css_online() for no particular reason
either, I've moved all that into css_alloc().

> 
> css_offline()
> 
> 	The css is going down and draining usages by refusing new
> 	css_tryget_online()'s but still guaranteed to be visible
> 	during css iterations.  Controllers which explicitly needs to
> 	put, say, cache references or need to perform cleanup
> 	operations while the css is iteratable need to use this
> 	method; otherwise, no reason to bother with it.
> 
> 	Ex: blkcg pins csses as part of lookup cache which can prevent
> 	a css from being drained and released, so blkcg uses this call
> 	back to disable caching for the css.
> 
> css_released()
> 
> 	The css is drained and not visible during iteration and will
> 	be freed after a RCU grace period.  Used by controllers which
> 	cache pointers to csses being drained.
> 
> 	Ex: memcg needs to iterate csses which are being drained and
> 	its custom iterator implementation caches RCU protected
> 	pointers to csses.  memcg uses this callback to shoot down
> 	those cached pointers.
> 
> css_free()
> 
> 	The css is drained, not visible to iterations, not used by
> 	anyone, and will be freed immediately.

None of that speaks of where Zombies live, am I to infer that Zombies
pass css_offline() but stall css_released() ?

I don't particularly care about iterating css bits, but I do need my
parent group to still exist, this is now also guaranteed for
css_release(), right? The above documentation also doesn't mention this;
in particular I require that css_release() for any group is not called
before the css_release() of any child group.

The reason I'm interested in css_release() is that there is an RCU grace
period between that and css_free(), so I can kill my call_rcu() from
css_free().

> > So something needs to fundamentally ensure that ->css changes before we
> > go offline the thing.
> 
> I could be mistaken but AFAICS there doesn't seem to be anything
> requiring bothering with the more specialized exit methods.  Given
> that no css iteration is used and everything is lock protected, the
> patch at the end of this messages should do and seems to work fine
> here.  Am I missing something?

Some additional cleanups maybe.. but yes, that seems to work.

Thanks!

---
Subject: sched: Fix/cleanup cgroup teardown/init

The cpu controller hasn't kept up with the various changes in the whole
cgroup initialization / destruction sequence, and commit 2e91fa7f6d45
("cgroup: keep zombies associated with their original cgroups") caused
it to explode.

The reason for this is that zombies do not inhibit css_offline() from
being called, but do stall css_released(). Now we tear down the cfs_rq
structures on css_offline() but zombies can run after that, leading to
use-after-free issues.

The solution is to move the tear-down to css_released(), which
guarantees nobody (including no zombies) is still using our cgroup.

Furthermore, a few simple cleanups are possible too. There doesn't
appear to be any point to us using css_online() (anymore?) so fold that
in css_alloc().

And since cgroup code guarantees an RCU grace period between
css_released() and css_free() we can forgo using call_rcu() and free the
stuff immediately.

Cc: stable@vger.kernel.org
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c | 35 ++++++++++++++---------------------
 1 file changed, 14 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 04a04e0378fe..bc5076fd0167 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7538,7 +7538,7 @@ void set_curr_task(int cpu, struct task_struct *p)
 /* task_group_lock serializes the addition/removal of task groups */
 static DEFINE_SPINLOCK(task_group_lock);
 
-static void free_sched_group(struct task_group *tg)
+static void sched_free_group(struct task_group *tg)
 {
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
@@ -7564,7 +7564,7 @@ struct task_group *sched_create_group(struct task_group *parent)
 	return tg;
 
 err:
-	free_sched_group(tg);
+	sched_free_group(tg);
 	return ERR_PTR(-ENOMEM);
 }
 
@@ -7584,17 +7584,16 @@ void sched_online_group(struct task_group *tg, struct task_group *parent)
 }
 
 /* rcu callback to free various structures associated with a task group */
-static void free_sched_group_rcu(struct rcu_head *rhp)
+static void sched_free_group_rcu(struct rcu_head *rhp)
 {
 	/* now it should be safe to free those cfs_rqs */
-	free_sched_group(container_of(rhp, struct task_group, rcu));
+	sched_free_group(container_of(rhp, struct task_group, rcu));
 }
 
-/* Destroy runqueue etc associated with a task group */
 void sched_destroy_group(struct task_group *tg)
 {
 	/* wait for possible concurrent references to cfs_rqs complete */
-	call_rcu(&tg->rcu, free_sched_group_rcu);
+	call_rcu(&tg->rcu, sched_free_group_rcu);
 }
 
 void sched_offline_group(struct task_group *tg)
@@ -8053,31 +8052,26 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (IS_ERR(tg))
 		return ERR_PTR(-ENOMEM);
 
+	sched_online_group(tg, parent);
+
 	return &tg->css;
 }
 
-static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
+static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
-	struct task_group *parent = css_tg(css->parent);
 
-	if (parent)
-		sched_online_group(tg, parent);
-	return 0;
+	sched_offline_group(tg);
 }
 
 static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
 
-	sched_destroy_group(tg);
-}
-
-static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
-{
-	struct task_group *tg = css_tg(css);
-
-	sched_offline_group(tg);
+	/*
+	 * Relies on the RCU grace period between css_released() and this.
+	 */
+	sched_free_group(tg);
 }
 
 static void cpu_cgroup_fork(struct task_struct *task)
@@ -8437,9 +8431,8 @@ static struct cftype cpu_files[] = {
 
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
+	.css_released	= cpu_cgroup_css_released,
 	.css_free	= cpu_cgroup_css_free,
-	.css_online	= cpu_cgroup_css_online,
-	.css_offline	= cpu_cgroup_css_offline,
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,

  parent reply	other threads:[~2016-03-16 15:23 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-12  9:42 [BUG] sched: leaf_cfs_rq_list use after free Kazuki Yamaguchi
2016-03-12 13:59 ` Peter Zijlstra
2016-03-14 11:20 ` Peter Zijlstra
2016-03-14 12:09   ` Peter Zijlstra
2016-03-16 14:24     ` Tejun Heo
2016-03-16 14:44       ` Tejun Heo
2016-03-16 15:22       ` Peter Zijlstra [this message]
2016-03-16 16:50         ` Tejun Heo
2016-03-16 17:04           ` Peter Zijlstra
2016-03-16 17:49             ` Tejun Heo
2016-03-17  8:29         ` Niklas Cassel
2016-03-21 11:15         ` [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init tip-bot for Peter Zijlstra
2016-04-28 18:40           ` Peter Zijlstra
2016-04-28 18:51             ` Greg Kroah-Hartman
2016-04-28 21:36               ` Peter Zijlstra
2016-05-02  3:06                 ` Greg Kroah-Hartman
  -- strict thread matches above, loose matches on Subject: below --
2016-03-04 10:41 [BUG] sched: leaf_cfs_rq_list use after free Niklas Cassel
2016-03-10 12:54 ` Peter Zijlstra
2016-03-11 17:02   ` Niklas Cassel
2016-03-11 17:28     ` Peter Zijlstra
2016-03-11 18:20   ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160316152245.GY6344@twins.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=k@rhe.jp \
    --cc=linux-kernel@vger.kernel.org \
    --cc=niklas.cassel@axis.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.