public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched: Add logic to handle parallel try_to_wake_up() of the same task
@ 2013-07-13 15:45 Kirill Tkhai
  2013-07-14  5:55 ` Mike Galbraith
  2013-07-15  6:31 ` Peter Zijlstra
  0 siblings, 2 replies; 6+ messages in thread
From: Kirill Tkhai @ 2013-07-13 15:45 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org; +Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra

This patch adds optimization of try_to_wake_up() function
for cases when the system is doing parallel wake_up
of the same task on the different cpus. Also it adds
accounting the statistics of these situations.

We check the status of the task we want to wake up.
If it is TASK_WAKING then the task is manipulated
by try_to_wake_up() on another cpu. And after this
check it will be a moment when the task is queued
and his status is TASK_RUNNING. We just return
earlier when we are sure the task will be TASK_RUNNING
in the future (maybe right after the check). The profit is
we don't loop while we are waiting the spinlock.

There mustn't be any problems connected with
we return earlier, besause scheduler allready does
the same, when he queues a task in wake_list to be
waked up on another cpu.

Parallel wake up is not unreal situation. Here is
statistics from my 2-cpu laptop:

~# grep 'nr_wakeups_parallel.*[1-9]' -B100 -h /proc/*/sched | grep 'threads\|parallel\|wakeups ' | sed 's/(.*)//g'

rcu_sched 
se.statistics.nr_wakeups                     :                  102
se.statistics.nr_wakeups_parallel            :                    2
Xorg 
se.statistics.nr_wakeups                     :                36030
se.statistics.nr_wakeups_parallel            :                   56
gnome-terminal 
se.statistics.nr_wakeups                     :                70573
se.statistics.nr_wakeups_parallel            :                   55
rcu_preempt 
se.statistics.nr_wakeups                     :                68101
se.statistics.nr_wakeups_parallel            :                 1368

It's the moment after boot (5-10 minutes uptime). Later the ratio
goes down:

rcu_sched 
se.statistics.nr_wakeups                     :                  102
se.statistics.nr_wakeups_parallel            :                    2
Xorg 
se.statistics.nr_wakeups                     :                49057
se.statistics.nr_wakeups_parallel            :                   74
gnome-terminal 
se.statistics.nr_wakeups                     :              1495463
se.statistics.nr_wakeups_parallel            :                   99
rcu_preempt 
se.statistics.nr_wakeups                     :              2015010
se.statistics.nr_wakeups_parallel            :                 3738

Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |   29 +++++++++++++++++++++++++----
 kernel/sched/debug.c  |    7 +++++++
 kernel/sched/stats.h  |   16 ++++++++++++++++
 4 files changed, 49 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fc09d21..235a466 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -964,6 +964,7 @@ struct sched_statistics {
 	u64			nr_wakeups_affine_attempts;
 	u64			nr_wakeups_passive;
 	u64			nr_wakeups_idle;
+	atomic_t		nr_wakeups_parallel;
 };
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d06ad6..1e1475f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1336,6 +1336,11 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 
 	p->state = TASK_RUNNING;
 #ifdef CONFIG_SMP
+	/*
+	 * Pair bracket with TASK_WAKING check it try_to_wake_up()
+	 */
+	smp_wmb();
+
 	if (p->sched_class->task_woken)
 		p->sched_class->task_woken(rq, p);
 
@@ -1487,20 +1492,37 @@ static int
 try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
 	unsigned long flags;
-	int cpu, success = 0;
+	int cpu, success = 1;
 
+	/*
+	 * See commentary for commit 04e2f1741d235ba599037734878d72e57cb302b5.
+	 */
 	smp_wmb();
+#ifdef CONFIG_SMP
+	if (p->state == TASK_WAKING) {
+		/*
+		 * Pairs with sets of p->state: below and in ttwu_do_wakeup().
+		 */
+		smp_rmb();
+		inc_nr_parallel_wakeups(p);
+		return success;
+	}
+#endif
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	if (!(p->state & state))
+	if (!(p->state & state)) {
+		success = 0;
 		goto out;
+	}
 
-	success = 1; /* we're going to change ->state */
 	cpu = task_cpu(p);
 
 	if (p->on_rq && ttwu_remote(p, wake_flags))
 		goto stat;
 
 #ifdef CONFIG_SMP
+	p->state = TASK_WAKING;
+	smp_wmb();
+
 	/*
 	 * If the owning (remote) cpu is still in the middle of schedule() with
 	 * this task as prev, wait until its done referencing the task.
@@ -1513,7 +1535,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	smp_rmb();
 
 	p->sched_contributes_to_load = !!task_contributes_to_load(p);
-	p->state = TASK_WAKING;
 
 	if (p->sched_class->task_waking)
 		p->sched_class->task_waking(p);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e076bdd..f18736d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -542,6 +542,13 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.statistics.nr_wakeups_idle);
 
 	{
+		int nr = get_nr_parallel_wakeups(p);
+
+		SEQ_printf(m, "%-45s:%21d\n",
+			      "se.statistics.nr_wakeups_parallel", nr);
+	}
+
+	{
 		u64 avg_atom, avg_per_cpu;
 
 		avg_atom = p->se.sum_exec_runtime;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 5aef494..dbbc6e9 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -155,6 +155,22 @@ sched_info_switch(struct task_struct *prev, struct task_struct *next)
 #define sched_info_switch(t, next)		do { } while (0)
 #endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */
 
+#if defined(CONFIG_SCHEDSTATS) && defined(CONFIG_SMP)
+static inline void
+inc_nr_parallel_wakeups(struct task_struct *t)
+{
+	atomic_inc(&t->se.statistics.nr_wakeups_parallel);
+}
+static inline int
+get_nr_parallel_wakeups(struct task_struct *t)
+{
+	return atomic_read(&t->se.statistics.nr_wakeups_parallel);
+}
+#else
+#define inc_nr_parallel_wakeups(t)		do { } while (0)
+#define get_nr_parallel_wakeups(t)		(0)
+#endif /* CONFIG_SCHEDSTATS && CONFIG_SMP */
+
 /*
  * The following are functions that support scheduler-internal time accounting.
  * These functions are generally called at the timer tick.  None of this depends

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched: Add logic to handle parallel try_to_wake_up() of the same task
  2013-07-13 15:45 [PATCH] sched: Add logic to handle parallel try_to_wake_up() of the same task Kirill Tkhai
@ 2013-07-14  5:55 ` Mike Galbraith
  2013-07-15  6:31 ` Peter Zijlstra
  1 sibling, 0 replies; 6+ messages in thread
From: Mike Galbraith @ 2013-07-14  5:55 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel@vger.kernel.org, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra

On Sat, 2013-07-13 at 19:45 +0400, Kirill Tkhai wrote: 
> This patch adds optimization of try_to_wake_up() function
> for cases when the system is doing parallel wake_up
> of the same task on the different cpus. Also it adds
> accounting the statistics of these situations.
> 
> We check the status of the task we want to wake up.
> If it is TASK_WAKING then the task is manipulated
> by try_to_wake_up() on another cpu. And after this
> check it will be a moment when the task is queued
> and his status is TASK_RUNNING. We just return
> earlier when we are sure the task will be TASK_RUNNING
> in the future (maybe right after the check). The profit is
> we don't loop while we are waiting the spinlock.

Hm, you're adding cycles to the common case to shave spin cycles in the
very rare case, then spending some to note that a collision happened.
What makes recording worth even 1 cycle?  What am I gonna do with the
knowledge that $dinkynum wakeups intersected at a some random task?

-Mike



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched: Add logic to handle parallel try_to_wake_up() of the same task
  2013-07-13 15:45 [PATCH] sched: Add logic to handle parallel try_to_wake_up() of the same task Kirill Tkhai
  2013-07-14  5:55 ` Mike Galbraith
@ 2013-07-15  6:31 ` Peter Zijlstra
  2013-07-15 14:14   ` Kirill Tkhai
  1 sibling, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2013-07-15  6:31 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: linux-kernel@vger.kernel.org, Steven Rostedt, Ingo Molnar

On Sat, Jul 13, 2013 at 07:45:49PM +0400, Kirill Tkhai wrote:
> ---
>  include/linux/sched.h |    1 +
>  kernel/sched/core.c   |   29 +++++++++++++++++++++++++----
>  kernel/sched/debug.c  |    7 +++++++
>  kernel/sched/stats.h  |   16 ++++++++++++++++
>  4 files changed, 49 insertions(+), 4 deletions(-)
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index fc09d21..235a466 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -964,6 +964,7 @@ struct sched_statistics {
>  	u64			nr_wakeups_affine_attempts;
>  	u64			nr_wakeups_passive;
>  	u64			nr_wakeups_idle;
> +	atomic_t		nr_wakeups_parallel;

bad idea.. atomics are expensive and stats aren't generally _that_
important.

>  };
>  #endif
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9d06ad6..1e1475f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1336,6 +1336,11 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>  
>  	p->state = TASK_RUNNING;
>  #ifdef CONFIG_SMP
> +	/*
> +	 * Pair bracket with TASK_WAKING check it try_to_wake_up()
> +	 */
> +	smp_wmb();
> +

Like Mike said, you're making the common case more expensive, this is
not appreciated.

>  	if (p->sched_class->task_woken)
>  		p->sched_class->task_woken(rq, p);
>  
> @@ -1487,20 +1492,37 @@ static int
>  try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  {
>  	unsigned long flags;
> -	int cpu, success = 0;
> +	int cpu, success = 1;
>  
> +	/*
> +	 * See commentary for commit 04e2f1741d235ba599037734878d72e57cb302b5.

That must be the worst barrier comment ever.. If you want it there, just
copy/paste the commit log here. It is also completely unrelated to the
rest of the patch.

> +	 */
>  	smp_wmb();
> +#ifdef CONFIG_SMP
> +	if (p->state == TASK_WAKING) {
> +		/*
> +		 * Pairs with sets of p->state: below and in ttwu_do_wakeup().
> +		 */
> +		smp_rmb();
> +		inc_nr_parallel_wakeups(p);
> +		return success;

This is wrong, you didn't change p->state, therefore returning 1 is not
an option. If you want to do something like this, treat it like waking
an already running task.

Also, the barrier comments fail to explain the exact problem they're
solving.


> +	}
> +#endif
>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
> -	if (!(p->state & state))
> +	if (!(p->state & state)) {
> +		success = 0;
>  		goto out;
> +	}
>  
> -	success = 1; /* we're going to change ->state */
>  	cpu = task_cpu(p);
>  
>  	if (p->on_rq && ttwu_remote(p, wake_flags))
>  		goto stat;
>  
>  #ifdef CONFIG_SMP
> +	p->state = TASK_WAKING;
> +	smp_wmb();
> +

This too is broken; the loop below needs to be completed first,
otherwise we change p->state while the task is still on the CPU and it
might read the wrong p->state.

>  	/*
>  	 * If the owning (remote) cpu is still in the middle of schedule() with
>  	 * this task as prev, wait until its done referencing the task.
> @@ -1513,7 +1535,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	smp_rmb();
>  
>  	p->sched_contributes_to_load = !!task_contributes_to_load(p);
> -	p->state = TASK_WAKING;
>  
>  	if (p->sched_class->task_waking)
>  		p->sched_class->task_waking(p);

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched: Add logic to handle parallel try_to_wake_up() of the same task
  2013-07-15  6:31 ` Peter Zijlstra
@ 2013-07-15 14:14   ` Kirill Tkhai
  2013-07-15 20:19     ` Peter Zijlstra
  0 siblings, 1 reply; 6+ messages in thread
From: Kirill Tkhai @ 2013-07-15 14:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel@vger.kernel.org, Steven Rostedt, Ingo Molnar

Hi, Peter,

15.07.2013, 10:32, "Peter Zijlstra" <peterz@infradead.org>:
> On Sat, Jul 13, 2013 at 07:45:49PM +0400, Kirill Tkhai wrote:
>
>>  ---
>>   include/linux/sched.h |    1 +
>>   kernel/sched/core.c   |   29 +++++++++++++++++++++++++----
>>   kernel/sched/debug.c  |    7 +++++++
>>   kernel/sched/stats.h  |   16 ++++++++++++++++
>>   4 files changed, 49 insertions(+), 4 deletions(-)
>>  diff --git a/include/linux/sched.h b/include/linux/sched.h
>>  index fc09d21..235a466 100644
>>  --- a/include/linux/sched.h
>>  +++ b/include/linux/sched.h
>>  @@ -964,6 +964,7 @@ struct sched_statistics {
>>           u64 nr_wakeups_affine_attempts;
>>           u64 nr_wakeups_passive;
>>           u64 nr_wakeups_idle;
>>  + atomic_t nr_wakeups_parallel;
>
> bad idea.. atomics are expensive and stats aren't generally _that_
> important.
>
>>   };
>>   #endif
>>
>>  diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>  index 9d06ad6..1e1475f 100644
>>  --- a/kernel/sched/core.c
>>  +++ b/kernel/sched/core.c
>>  @@ -1336,6 +1336,11 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>>
>>           p->state = TASK_RUNNING;
>>   #ifdef CONFIG_SMP
>>  + /*
>>  + * Pair bracket with TASK_WAKING check it try_to_wake_up()
>>  + */
>>  + smp_wmb();
>>  +
>
> Like Mike said, you're making the common case more expensive, this is
> not appreciated.
>
>>           if (p->sched_class->task_woken)
>>                   p->sched_class->task_woken(rq, p);
>>
>>  @@ -1487,20 +1492,37 @@ static int
>>   try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>   {
>>           unsigned long flags;
>>  - int cpu, success = 0;
>>  + int cpu, success = 1;
>>
>>  + /*
>>  + * See commentary for commit 04e2f1741d235ba599037734878d72e57cb302b5.
>
> That must be the worst barrier comment ever.. If you want it there, just
> copy/paste the commit log here. It is also completely unrelated to the
> rest of the patch.
>
>>  + */
>>           smp_wmb();
>>  +#ifdef CONFIG_SMP
>>  + if (p->state == TASK_WAKING) {
>>  + /*
>>  + * Pairs with sets of p->state: below and in ttwu_do_wakeup().
>>  + */
>>  + smp_rmb();
>>  + inc_nr_parallel_wakeups(p);
>>  + return success;
>
> This is wrong, you didn't change p->state, therefore returning 1 is not
> an option. If you want to do something like this, treat it like waking
> an already running task.
>
> Also, the barrier comments fail to explain the exact problem they're
> solving.
>
>>  + }
>>  +#endif
>>           raw_spin_lock_irqsave(&p->pi_lock, flags);
>>  - if (!(p->state & state))
>>  + if (!(p->state & state)) {
>>  + success = 0;
>>                   goto out;
>>  + }
>>
>>  - success = 1; /* we're going to change ->state */
>>           cpu = task_cpu(p);
>>
>>           if (p->on_rq && ttwu_remote(p, wake_flags))
>>                   goto stat;
>>
>>   #ifdef CONFIG_SMP
>>  + p->state = TASK_WAKING;
>>  + smp_wmb();
>>  +
>
> This too is broken; the loop below needs to be completed first,
> otherwise we change p->state while the task is still on the CPU and it
> might read the wrong p->state.

This place is below (on_rq && ttwu_remote) check, so the task
either 'dequeued and on_cpu == 0'
or it's in the middle of schedule() on arch, which wants unlocked
context switch.

Nobody scheduler's probes p->state between prepare_lock_switch() and
finish_lock_switch(). Archs with unlocked ctx switch (mips and ia64)
don't change or probe state of previous process during context_switch.

The only exception is TASK_DEAD case, but it is handled above
in 'p->state & state' check. Nobody passes TASK_DEAD to try_to_wake_up().

p->state might be changed outside from the scheduler. It's ptrace.
But, it doesn't look be touched by earlier TASK_WAKING set.

So, I don't see any problem with it. Peter, what do you really mean?

Other problems is easy to be fixed, all except common case overhead.
If it's really considerable in terms of scheduler, I'll be account this fact
in the future.

>>           /*
>>            * If the owning (remote) cpu is still in the middle of schedule() with
>>            * this task as prev, wait until its done referencing the task.
>>  @@ -1513,7 +1535,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>           smp_rmb();
>>
>>           p->sched_contributes_to_load = !!task_contributes_to_load(p);
>>  - p->state = TASK_WAKING;
>>
>>           if (p->sched_class->task_waking)
>>                   p->sched_class->task_waking(p);

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched: Add logic to handle parallel try_to_wake_up() of the same task
  2013-07-15 14:14   ` Kirill Tkhai
@ 2013-07-15 20:19     ` Peter Zijlstra
  2013-07-15 20:22       ` Kirill Tkhai
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2013-07-15 20:19 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: linux-kernel@vger.kernel.org, Steven Rostedt, Ingo Molnar

On Mon, Jul 15, 2013 at 06:14:34PM +0400, Kirill Tkhai wrote:
> >>
> >>   #ifdef CONFIG_SMP
> >>  + p->state = TASK_WAKING;
> >>  + smp_wmb();
> >>  +
> >
> > This too is broken; the loop below needs to be completed first,
> > otherwise we change p->state while the task is still on the CPU and it
> > might read the wrong p->state.
> 
> This place is below (on_rq && ttwu_remote) check, so the task
> either 'dequeued and on_cpu == 0'
> or it's in the middle of schedule() on arch, which wants unlocked
> context switch.
> 
> Nobody scheduler's probes p->state between prepare_lock_switch() and
> finish_lock_switch(). Archs with unlocked ctx switch (mips and ia64)
> don't change or probe state of previous process during context_switch.

It means its after deactivate_task(), but before context_switch(). It so
happens that
context_switch()->prepare_task_switch()->trace_sched_switch() inspects
p->state.

Even if this was not the case, touching a task that is 'life' on another
CPU is very _very_ bad practise.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched: Add logic to handle parallel try_to_wake_up() of the same task
  2013-07-15 20:19     ` Peter Zijlstra
@ 2013-07-15 20:22       ` Kirill Tkhai
  0 siblings, 0 replies; 6+ messages in thread
From: Kirill Tkhai @ 2013-07-15 20:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel@vger.kernel.org, Steven Rostedt, Ingo Molnar

16.07.2013, 00:19, "Peter Zijlstra" <peterz@infradead.org>:
> On Mon, Jul 15, 2013 at 06:14:34PM +0400, Kirill Tkhai wrote:
>
>>>>    #ifdef CONFIG_SMP
>>>>   + p->state = TASK_WAKING;
>>>>   + smp_wmb();
>>>>   +
>>>  This too is broken; the loop below needs to be completed first,
>>>  otherwise we change p->state while the task is still on the CPU and it
>>>  might read the wrong p->state.
>>  This place is below (on_rq && ttwu_remote) check, so the task
>>  either 'dequeued and on_cpu == 0'
>>  or it's in the middle of schedule() on arch, which wants unlocked
>>  context switch.
>>
>>  Nobody scheduler's probes p->state between prepare_lock_switch() and
>>  finish_lock_switch(). Archs with unlocked ctx switch (mips and ia64)
>>  don't change or probe state of previous process during context_switch.
>
> It means its after deactivate_task(), but before context_switch(). It so
> happens that
> context_switch()->prepare_task_switch()->trace_sched_switch() inspects
> p->state.
>
> Even if this was not the case, touching a task that is 'life' on another
> CPU is very _very_ bad practise.

Thanks for the explanation.

Kirill

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-07-15 20:29 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-13 15:45 [PATCH] sched: Add logic to handle parallel try_to_wake_up() of the same task Kirill Tkhai
2013-07-14  5:55 ` Mike Galbraith
2013-07-15  6:31 ` Peter Zijlstra
2013-07-15 14:14   ` Kirill Tkhai
2013-07-15 20:19     ` Peter Zijlstra
2013-07-15 20:22       ` Kirill Tkhai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox