public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Frank Rowand <frank.rowand@am.sony.com>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Chris Mason <chris.mason@oracle.com>, Ingo Molnar <mingo@elte.hu>,
	Thomas Gleixner <tglx@linutronix.de>,
	Mike Galbraith <efault@gmx.de>, Oleg Nesterov <oleg@redhat.com>,
	Paul Turner <pjt@google.com>, Jens Axboe <axboe@kernel.dk>,
	Yong Zhang <yong.zhang0@gmail.com>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [RFC][PATCH 14/18] sched: Remove rq->lock from the first half of ttwu()
Date: Fri, 28 Jan 2011 17:05:14 -0800	[thread overview]
Message-ID: <4D4367CA.2030303@am.sony.com> (raw)
In-Reply-To: <20110104150103.012710349@chello.nl>

On 01/04/11 06:59, Peter Zijlstra wrote:
> Currently ttwu() does two rq->lock acquisitions, once on the task's
> old rq, holding it over the p->state fiddling and load-balance pass.
> Then it drops the old rq->lock to acquire the new rq->lock.
> 
> By having serialized ttwu(), p->sched_class, p->cpus_allowed with
> p->pi_lock, we can now drop the whole first rq->lock acquisition.
> 
> The p->pi_lock serializing concurrent ttwu() calls protects p->state,
> which we will set to TASK_WAKING to bridge possible p->pi_lock to
> rq->lock gaps and serialize set_task_cpu() calls against
> task_rq_lock().
> 
> The p->pi_lock serialization of p->sched_class allows us to call
> scheduling class methods without holding the rq->lock, and the
> serialization of p->cpus_allowed allows us to do the load-balancing
> bits without races.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched.c      |   47 +++++++++++++++++++----------------------------
>  kernel/sched_fair.c |    3 +--
>  2 files changed, 20 insertions(+), 30 deletions(-)
> 
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2436,69 +2436,60 @@ ttwu_post_activation(struct task_struct 
>   * Returns %true if @p was woken up, %false if it was already running
>   * or @state didn't match @p's state.
>   */
> -static int try_to_wake_up(struct task_struct *p, unsigned int state,
> -			  int wake_flags)
> +static int
> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  {
> -	int cpu, orig_cpu, this_cpu, success = 0;
> +	int cpu, this_cpu, success = 0;
>  	unsigned long flags;
> -	unsigned long en_flags = ENQUEUE_WAKEUP;
>  	struct rq *rq;
>  
>  	this_cpu = get_cpu();
>  
>  	smp_wmb();
>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
> -	rq = __task_rq_lock(p);
>  	if (!(p->state & state))
>  		goto out;
>  
>  	cpu = task_cpu(p);
>  
> -	if (p->on_rq)
> -		goto out_running;
> +	if (p->on_rq) {
> +		rq = __task_rq_lock(p);
> +		if (p->on_rq)
> +			goto out_running;
> +		__task_rq_unlock(rq);
> +	}
>  
> -	orig_cpu = cpu;
>  #ifdef CONFIG_SMP
> -	if (unlikely(task_running(rq, p)))
> -		goto out_activate;

I think this while (p->on_cpu) can lead to a deadlock.  I'll explain at the
bottom of this email.

> +	while (p->on_cpu)
> +		cpu_relax();


>  
>  	p->sched_contributes_to_load = !!task_contributes_to_load(p);
>  	p->state = TASK_WAKING;
>  
> -	if (p->sched_class->task_waking) {
> +	if (p->sched_class->task_waking)
>  		p->sched_class->task_waking(p);
> -		en_flags |= ENQUEUE_WAKING;
> -	}
>  
>  	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
> -	if (cpu != orig_cpu)
> -		set_task_cpu(p, cpu);
> -	__task_rq_unlock(rq);
> +#endif /* CONFIG_SMP */
>  
>  	rq = cpu_rq(cpu);
>  	raw_spin_lock(&rq->lock);
>  
> -	/*
> -	 * We migrated the task without holding either rq->lock, however
> -	 * since the task is not on the task list itself, nobody else
> -	 * will try and migrate the task, hence the rq should match the
> -	 * cpu we just moved it to.
> -	 */
> -	WARN_ON(task_cpu(p) != cpu);
> -	WARN_ON(p->state != TASK_WAKING);
> +#ifdef CONFIG_SMP
> +	if (cpu != task_cpu(p))
> +		set_task_cpu(p, cpu);
>  
>  	if (p->sched_contributes_to_load)
>  		rq->nr_uninterruptible--;
> +#endif
>  
> -out_activate:
> -#endif /* CONFIG_SMP */
> -	activate_task(rq, p, en_flags);
> +	activate_task(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
>  out_running:
>  	ttwu_post_activation(p, rq, wake_flags);
>  	ttwu_stat(rq, p, cpu, wake_flags);
>  	success = 1;
> -out:
>  	__task_rq_unlock(rq);
> +out:
>  	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>  	put_cpu();

The deadlock can occur if __ARCH_WANT_UNLOCKED_CTXSW and
__ARCH_WANT_INTERRUPTS_ON_CTXSW are defined.

A task sets p->state = TASK_UNINTERRUPTIBLE, then calls schedule().

schedule()
   prev->on_rq = 0
   context_switch()
      prepare_task_switch()
         prepare_lock_switch()
            raw_spin_unlock_irq(&rq->lock)

At this point, a pending interrupt (on this same cpu) is handled.
The interrupt handling results in a call to try_to_wake_up() on the
current process.  The try_to_wake_up() gets into:

   while (p->on_cpu)
      cpu_relax();

and spins forever.  This is because "prev->on_cpu = 0" slightly
after this point at:

   finish_task_switch()
      finish_lock_switch()
         prev->on_cpu = 0


One possible fix would be to get rid of __ARCH_WANT_INTERRUPTS_ON_CTXSW.
I don't suspect the reaction to that suggestion will be very positive...

Another fix might be:

   while (p->on_cpu) {
      if (p == current)
         goto out_activate;
      cpu_relax();
      }

   Then add back in the out_activate label.

I don't know if the second fix is good -- I haven't thought out how
it impacts the later patches in the series.

-Frank


  parent reply	other threads:[~2011-01-29  1:05 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-04 14:59 [RFC][PATCH 00/18] sched: Reduce runqueue lock contention -v4 Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 01/18] sched: Always provide p->on_cpu Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 02/18] mutex: Use p->on_cpu for the adaptive spin Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 03/18] sched: Change the ttwu success details Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 04/18] sched: Clean up ttwu stats Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 05/18] sched: Provide p->on_rq Peter Zijlstra
2011-01-05  8:13   ` Yong Zhang
2011-01-05  9:53     ` Peter Zijlstra
2011-01-29  0:10   ` Frank Rowand
2011-01-04 14:59 ` [RFC][PATCH 06/18] sched: Serialize p->cpus_allowed and ttwu() using p->pi_lock Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 07/18] sched: Drop the rq argument to sched_class::select_task_rq() Peter Zijlstra
2011-01-06 13:57   ` Peter Zijlstra
2011-01-06 14:23     ` Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 08/18] sched: Remove rq argument to sched_class::task_waking() Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 09/18] sched: Delay task_contributes_to_load() Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 10/18] sched: Also serialize ttwu_local() with p->pi_lock Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 11/18] sched: Add p->pi_lock to task_rq_lock() Peter Zijlstra
2011-01-05 18:46   ` Oleg Nesterov
2011-01-05 19:33     ` Peter Zijlstra
2011-01-29  0:21   ` Frank Rowand
2011-02-03 17:16     ` Peter Zijlstra
2011-02-03 17:49       ` Frank Rowand
2011-01-04 14:59 ` [RFC][PATCH 12/18] sched: Drop rq->lock from first part of wake_up_new_task() Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 13/18] sched: Drop rq->lock from sched_exec() Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 14/18] sched: Remove rq->lock from the first half of ttwu() Peter Zijlstra
2011-01-06 16:29   ` Peter Zijlstra
2011-01-29  1:05   ` Frank Rowand [this message]
2011-02-03 17:16     ` Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 15/18] sched: Remove rq argument from ttwu_stat() Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 16/18] sched: Rename ttwu_post_activation Peter Zijlstra
2011-01-29  1:08   ` Frank Rowand
2011-01-04 14:59 ` [RFC][PATCH 17/18] sched: Move the second half of ttwu() to the remote cpu Peter Zijlstra
2011-01-05 21:07   ` Oleg Nesterov
2011-01-06 15:09     ` Peter Zijlstra
2011-01-07 15:22       ` Oleg Nesterov
2011-01-18 16:38         ` Peter Zijlstra
2011-01-19 19:37           ` Oleg Nesterov
2011-01-29  0:04           ` Frank Rowand
2011-02-03 17:16             ` Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 18/18] sched: Sort hotplug vs ttwu queueing Peter Zijlstra
2011-01-05 20:47   ` Oleg Nesterov
2011-01-06 10:56     ` Peter Zijlstra
2011-01-04 15:16 ` [RFC][PATCH 00/18] sched: Reduce runqueue lock contention -v4 Ingo Molnar
2011-01-29  1:20 ` Frank Rowand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D4367CA.2030303@am.sony.com \
    --to=frank.rowand@am.sony.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=axboe@kernel.dk \
    --cc=chris.mason@oracle.com \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=oleg@redhat.com \
    --cc=pjt@google.com \
    --cc=tglx@linutronix.de \
    --cc=yong.zhang0@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox