Re: [RFC][PATCH] sched: Fix race in task_group()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Stefan Bader <stefan.bader@canonical.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: mingo@kernel.org, Oleg Nesterov <oleg@redhat.com>,
	Paul Turner <pjt@google.com>, Mike Galbraith <efault@gmx.de>,
	Andrew Vagin <avagin@openvz.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Tejun Heo <tj@kernel.org>
Subject: Re: [RFC][PATCH] sched: Fix race in task_group()
Date: Tue, 26 Jun 2012 19:49:47 +0200	[thread overview]
Message-ID: <4FE9F63B.1000908@canonical.com> (raw)
In-Reply-To: <1340718515.21991.83.camel@twins>


[-- Attachment #1.1: Type: text/plain, Size: 6286 bytes --]

On 26.06.2012 15:48, Peter Zijlstra wrote:
> Here's one that's actually compile tested (with the right CONFIG_foo
> enabled) and I fixed the autogroup lockdep splat.
> 
> ---
> Subject: sched: Fix race in task_group()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri, 22 Jun 2012 13:36:05 +0200
> 
> Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't
> call task_group() too many times in set_task_rq()"), he found the reason
> to be that the multiple task_group() invocations in set_task_rq()
> returned different values.
> 
> Looking at all that I found a lack of serialization and plain wrong
> comments.
> 
> The below tries to fix it using an extra pointer which is updated under
> the appropriate scheduler locks. Its not pretty, but I can't really see
> another way given how all the cgroup stuff works.
> 
> Reported-by: Stefan Bader <stefan.bader@canonical.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/init_task.h |   12 +++++++++++-
>  include/linux/sched.h     |    5 ++++-
>  kernel/sched/core.c       |    9 ++++++++-
>  kernel/sched/sched.h      |   23 ++++++++++-------------
>  4 files changed, 33 insertions(+), 16 deletions(-)
> 
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -123,8 +123,17 @@ extern struct group_info init_groups;
>  
>  extern struct cred init_cred;
>  
> +extern struct task_group root_task_group;
> +
> +#ifdef CONFIG_CGROUP_SCHED
> +# define INIT_CGROUP_SCHED(tsk)						\
> +	.sched_task_group = &root_task_group,
> +#else
> +# define INIT_CGROUP_SCHED(tsk)
> +#endif
> +
>  #ifdef CONFIG_PERF_EVENTS
> -# define INIT_PERF_EVENTS(tsk)					\
> +# define INIT_PERF_EVENTS(tsk)						\
>  	.perf_event_mutex = 						\
>  		 __MUTEX_INITIALIZER(tsk.perf_event_mutex),		\
>  	.perf_event_list = LIST_HEAD_INIT(tsk.perf_event_list),
> @@ -168,6 +177,7 @@ extern struct cred init_cred;
>  	},								\
>  	.tasks		= LIST_HEAD_INIT(tsk.tasks),			\
>  	INIT_PUSHABLE_TASKS(tsk)					\
> +	INIT_CGROUP_SCHED(tsk)						\
>  	.ptraced	= LIST_HEAD_INIT(tsk.ptraced),			\
>  	.ptrace_entry	= LIST_HEAD_INIT(tsk.ptrace_entry),		\
>  	.real_parent	= &tsk,						\
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1246,6 +1246,9 @@ struct task_struct {
>  	const struct sched_class *sched_class;
>  	struct sched_entity se;
>  	struct sched_rt_entity rt;
> +#ifdef CONFIG_CGROUP_SCHED
> +	struct task_group *sched_task_group;
> +#endif
>  
>  #ifdef CONFIG_NUMA
>  	unsigned long	 numa_contrib;
> @@ -2749,7 +2752,7 @@ extern int sched_group_set_rt_period(str
>  extern long sched_group_rt_period(struct task_group *tg);
>  extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk);
>  #endif
> -#endif
> +#endif /* CONFIG_CGROUP_SCHED */
>  
>  extern int task_can_switch_user(struct user_struct *up,
>  					struct task_struct *tsk);
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1096,7 +1096,7 @@ void set_task_cpu(struct task_struct *p,
>  	 * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
>  	 *
>  	 * sched_move_task() holds both and thus holding either pins the cgroup,
> -	 * see set_task_rq().
> +	 * see task_group().
>  	 *
>  	 * Furthermore, all task_rq users should acquire both locks, see
>  	 * task_rq_lock().
> @@ -7712,6 +7712,7 @@ void sched_destroy_group(struct task_gro
>   */
>  void sched_move_task(struct task_struct *tsk)
>  {
> +	struct task_group *tg;
>  	int on_rq, running;
>  	unsigned long flags;
>  	struct rq *rq;
> @@ -7726,6 +7727,12 @@ void sched_move_task(struct task_struct
>  	if (unlikely(running))
>  		tsk->sched_class->put_prev_task(rq, tsk);
>  
> +	tg = container_of(task_subsys_state_check(tsk, cpu_cgroup_subsys_id,
> +				lockdep_is_held(&tsk->sighand->siglock)),
> +			  struct task_group, css);
> +	tg = autogroup_task_group(tsk, tg);
> +	tsk->sched_task_group = tg;
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	if (tsk->sched_class->task_move_group)
>  		tsk->sched_class->task_move_group(tsk, on_rq);
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -554,22 +554,19 @@ extern int group_balance_cpu(struct sche
>  /*
>   * Return the group to which this tasks belongs.
>   *
> - * We use task_subsys_state_check() and extend the RCU verification with
> - * pi->lock and rq->lock because cpu_cgroup_attach() holds those locks for each
> - * task it moves into the cgroup. Therefore by holding either of those locks,
> - * we pin the task to the current cgroup.
> + * We cannot use task_subsys_state() and friends because the cgroup
> + * subsystem changes that value before the cgroup_subsys::attach() method
> + * is called, therefore we cannot pin it and might observe the wrong value.
> + *
> + * The same is true for autogroup's p->signal->autogroup->tg, the autogroup
> + * core changes this before calling sched_move_task().
> + *
> + * Instead we use a 'copy' which is updated from sched_move_task() while
> + * holding both task_struct::pi_lock and rq::lock.
>   */
>  static inline struct task_group *task_group(struct task_struct *p)
>  {
> -	struct task_group *tg;
> -	struct cgroup_subsys_state *css;
> -
> -	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
> -			lockdep_is_held(&p->pi_lock) ||
> -			lockdep_is_held(&task_rq(p)->lock));
> -	tg = container_of(css, struct task_group, css);
> -
> -	return autogroup_task_group(p, tg);
> +	return p->sched_task_group;
>  }
>  
>  /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
> 

I ran this version through the testcase and no more warning indeed. Also the
crash is not happening anymore (with the backported version).
This should probably get a "Cc: stable@vger.kernel.org # 2.6.38+" into the s-o-b
area. I don't think 2.6.38..2.6.39 have a longterm support but at least that was
the time when autogroup came in. For 3.0..3.2 the patch needs a bit of tweaking
due to some conference boredom. ;) I am  attaching the backport I was using for
the test for convenience. Although ... it has to be refreshed after the original
patch has landed upstream...

Cheers,
-Stefan


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: 0001-sched-Fix-race-in-task_group.patch --]
[-- Type: text/x-diff; name="0001-sched-Fix-race-in-task_group.patch", Size: 5660 bytes --]

From d751ab1f1e532f32412d99b71a1bfea3e5282d07 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 22 Jun 2012 13:36:00 +0200
Subject: [PATCH] sched: Fix race in task_group()

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't
call task_group() too many times in set_task_rq()"), he found the reason
to be that the multiple task_group() invocations in set_task_rq()
returned different values.

Looking at all that I found a lack of serialization and plain wrong
comments.

The below tries to fix it using an extra pointer which is updated under
the appropriate scheduler locks. Its not pretty, but I can't really see
another way given how all the cgroup stuff works.

Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
[backported to apply to 3.0 and 3.2]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 include/linux/init_task.h |   12 +++++++++++-
 include/linux/sched.h     |    5 ++++-
 kernel/sched.c            |   32 ++++++++++++++++++--------------
 3 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 32574ee..13b2684 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -117,8 +117,17 @@ extern struct group_info init_groups;
 
 extern struct cred init_cred;
 
+extern struct task_group root_task_group;
+
+#ifdef CONFIG_CGROUP_SCHED
+# define INIT_CGROUP_SCHED(tsk)						\
+	.sched_task_group = &root_task_group,
+#else
+# define INIT_CGROUP_SCHED(tsk)
+#endif
+
 #ifdef CONFIG_PERF_EVENTS
-# define INIT_PERF_EVENTS(tsk)					\
+# define INIT_PERF_EVENTS(tsk)						\
 	.perf_event_mutex = 						\
 		 __MUTEX_INITIALIZER(tsk.perf_event_mutex),		\
 	.perf_event_list = LIST_HEAD_INIT(tsk.perf_event_list),
@@ -155,6 +164,7 @@ extern struct cred init_cred;
 	},								\
 	.tasks		= LIST_HEAD_INIT(tsk.tasks),			\
 	INIT_PUSHABLE_TASKS(tsk)					\
+	INIT_CGROUP_SCHED(tsk)						\
 	.ptraced	= LIST_HEAD_INIT(tsk.ptraced),			\
 	.ptrace_entry	= LIST_HEAD_INIT(tsk.ptrace_entry),		\
 	.real_parent	= &tsk,						\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 56de5c1..1fd9884 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1242,6 +1242,9 @@ struct task_struct {
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
+#ifdef CONFIG_CGROUP_SCHED
+	struct task_struct *sched_task_group;
+#endif
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
@@ -2646,7 +2649,7 @@ extern int sched_group_set_rt_period(struct task_group *tg,
 extern long sched_group_rt_period(struct task_group *tg);
 extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk);
 #endif
-#endif
+#endif /* CONFIG_CGROUP_SCHED */
 
 extern int task_can_switch_user(struct user_struct *up,
 					struct task_struct *tsk);
diff --git a/kernel/sched.c b/kernel/sched.c
index aae0c1d..b99a61e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -746,22 +746,19 @@ static inline int cpu_of(struct rq *rq)
 /*
  * Return the group to which this tasks belongs.
  *
- * We use task_subsys_state_check() and extend the RCU verification with
- * pi->lock and rq->lock because cpu_cgroup_attach() holds those locks for each
- * task it moves into the cgroup. Therefore by holding either of those locks,
- * we pin the task to the current cgroup.
+ * We cannot use task_subsys_state() and friends because the cgroup
+ * subsystem changes that value before the cgroup_subsys::attach() method
+ * is called, therefore we cannot pin it and might observe the wrong value.
+ *
+ * The same is true for autogroup's p->signal->autogroup->tg, the autogroup
+ * core changes this before calling sched_move_task().
+ *
+ * Instead we use a 'copy' which is updated from sched_move_task() while
+ * holding both task_struct::pi_lock and rq::lock.
  */
 static inline struct task_group *task_group(struct task_struct *p)
 {
-	struct task_group *tg;
-	struct cgroup_subsys_state *css;
-
-	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
-			lockdep_is_held(&p->pi_lock) ||
-			lockdep_is_held(&task_rq(p)->lock));
-	tg = container_of(css, struct task_group, css);
-
-	return autogroup_task_group(p, tg);
+	return p->sched_task_group;
 }
 
 /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
@@ -2373,7 +2370,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	 * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
 	 *
 	 * sched_move_task() holds both and thus holding either pins the cgroup,
-	 * see set_task_rq().
+	 * see task_group().
 	 *
 	 * Furthermore, all task_rq users should acquire both locks, see
 	 * task_rq_lock().
@@ -8765,6 +8762,7 @@ void sched_destroy_group(struct task_group *tg)
  */
 void sched_move_task(struct task_struct *tsk)
 {
+	struct task_group *tg;
 	int on_rq, running;
 	unsigned long flags;
 	struct rq *rq;
@@ -8779,6 +8777,12 @@ void sched_move_task(struct task_struct *tsk)
 	if (unlikely(running))
 		tsk->sched_class->put_prev_task(rq, tsk);
 
+	tg = container_of(task_subsys_state_check(tsk, cpu_cgroup_subsys_id,
+				lockdep_is_held(&tsk->sighand->siglock)),
+			  struct task_group, css);
+	tg = autogroup_task_group(tsk, tg);
+	tsk->sched_task_group = tg;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	if (tsk->sched_class->task_move_group)
 		tsk->sched_class->task_move_group(tsk, on_rq);
-- 
1.7.9.5


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 900 bytes --]

next prev parent reply	other threads:[~2012-06-26 17:50 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-22 11:36 [RFC][PATCH] sched: Fix race in task_group() Peter Zijlstra
2012-06-22 15:06 ` Stefan Bader
2012-06-22 15:15   ` Peter Zijlstra
2012-06-26 13:48   ` Peter Zijlstra
2012-06-26 17:49     ` Stefan Bader [this message]
2012-06-27 12:40       ` Hillf Danton
2012-06-27 12:51         ` Stefan Bader
2012-06-26 20:13     ` Tejun Heo
2012-06-26 21:17       ` Peter Zijlstra
2012-07-03 10:06     ` Stefan Bader
2012-07-06  6:24 ` [tip:sched/core] " tip-bot for Peter Zijlstra
2012-07-24 14:21 ` tip-bot for Peter Zijlstra
2012-10-18  8:27   ` cwillu
2012-10-18 10:23     ` Stefan Bader
2012-10-18 13:33       ` Luis Henriques
2012-10-18 20:50         ` cwillu
2012-10-19  7:40         ` Stefan Bader

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:32574ee dfblob:13b2684 dfblob:56de5c1 dfblob:1fd9884
dfblob:aae0c1d dfblob:b99a61e )
 OR (
bs:"sched: Fix race in task_group()" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FE9F63B.1000908@canonical.com \
    --to=stefan.bader@canonical.com \
    --cc=avagin@openvz.org \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.