From: Frank Rowand <frank.rowand@am.sony.com>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Chris Mason <chris.mason@oracle.com>, Ingo Molnar <mingo@elte.hu>,
Thomas Gleixner <tglx@linutronix.de>,
Mike Galbraith <efault@gmx.de>, Oleg Nesterov <oleg@redhat.com>,
Paul Turner <pjt@google.com>, Jens Axboe <axboe@kernel.dk>,
Yong Zhang <yong.zhang0@gmail.com>,
<linux-kernel@vger.kernel.org>
Subject: Re: [RFC][PATCH 14/18] sched: Remove rq->lock from the first half of ttwu()
Date: Fri, 28 Jan 2011 17:05:14 -0800 [thread overview]
Message-ID: <4D4367CA.2030303@am.sony.com> (raw)
In-Reply-To: <20110104150103.012710349@chello.nl>
On 01/04/11 06:59, Peter Zijlstra wrote:
> Currently ttwu() does two rq->lock acquisitions, once on the task's
> old rq, holding it over the p->state fiddling and load-balance pass.
> Then it drops the old rq->lock to acquire the new rq->lock.
>
> By having serialized ttwu(), p->sched_class, p->cpus_allowed with
> p->pi_lock, we can now drop the whole first rq->lock acquisition.
>
> The p->pi_lock serializing concurrent ttwu() calls protects p->state,
> which we will set to TASK_WAKING to bridge possible p->pi_lock to
> rq->lock gaps and serialize set_task_cpu() calls against
> task_rq_lock().
>
> The p->pi_lock serialization of p->sched_class allows us to call
> scheduling class methods without holding the rq->lock, and the
> serialization of p->cpus_allowed allows us to do the load-balancing
> bits without races.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
> kernel/sched.c | 47 +++++++++++++++++++----------------------------
> kernel/sched_fair.c | 3 +--
> 2 files changed, 20 insertions(+), 30 deletions(-)
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2436,69 +2436,60 @@ ttwu_post_activation(struct task_struct
> * Returns %true if @p was woken up, %false if it was already running
> * or @state didn't match @p's state.
> */
> -static int try_to_wake_up(struct task_struct *p, unsigned int state,
> - int wake_flags)
> +static int
> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> {
> - int cpu, orig_cpu, this_cpu, success = 0;
> + int cpu, this_cpu, success = 0;
> unsigned long flags;
> - unsigned long en_flags = ENQUEUE_WAKEUP;
> struct rq *rq;
>
> this_cpu = get_cpu();
>
> smp_wmb();
> raw_spin_lock_irqsave(&p->pi_lock, flags);
> - rq = __task_rq_lock(p);
> if (!(p->state & state))
> goto out;
>
> cpu = task_cpu(p);
>
> - if (p->on_rq)
> - goto out_running;
> + if (p->on_rq) {
> + rq = __task_rq_lock(p);
> + if (p->on_rq)
> + goto out_running;
> + __task_rq_unlock(rq);
> + }
>
> - orig_cpu = cpu;
> #ifdef CONFIG_SMP
> - if (unlikely(task_running(rq, p)))
> - goto out_activate;
I think this while (p->on_cpu) can lead to a deadlock. I'll explain at the
bottom of this email.
> + while (p->on_cpu)
> + cpu_relax();
>
> p->sched_contributes_to_load = !!task_contributes_to_load(p);
> p->state = TASK_WAKING;
>
> - if (p->sched_class->task_waking) {
> + if (p->sched_class->task_waking)
> p->sched_class->task_waking(p);
> - en_flags |= ENQUEUE_WAKING;
> - }
>
> cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
> - if (cpu != orig_cpu)
> - set_task_cpu(p, cpu);
> - __task_rq_unlock(rq);
> +#endif /* CONFIG_SMP */
>
> rq = cpu_rq(cpu);
> raw_spin_lock(&rq->lock);
>
> - /*
> - * We migrated the task without holding either rq->lock, however
> - * since the task is not on the task list itself, nobody else
> - * will try and migrate the task, hence the rq should match the
> - * cpu we just moved it to.
> - */
> - WARN_ON(task_cpu(p) != cpu);
> - WARN_ON(p->state != TASK_WAKING);
> +#ifdef CONFIG_SMP
> + if (cpu != task_cpu(p))
> + set_task_cpu(p, cpu);
>
> if (p->sched_contributes_to_load)
> rq->nr_uninterruptible--;
> +#endif
>
> -out_activate:
> -#endif /* CONFIG_SMP */
> - activate_task(rq, p, en_flags);
> + activate_task(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
> out_running:
> ttwu_post_activation(p, rq, wake_flags);
> ttwu_stat(rq, p, cpu, wake_flags);
> success = 1;
> -out:
> __task_rq_unlock(rq);
> +out:
> raw_spin_unlock_irqrestore(&p->pi_lock, flags);
> put_cpu();
The deadlock can occur if __ARCH_WANT_UNLOCKED_CTXSW and
__ARCH_WANT_INTERRUPTS_ON_CTXSW are defined.
A task sets p->state = TASK_UNINTERRUPTIBLE, then calls schedule().
schedule()
prev->on_rq = 0
context_switch()
prepare_task_switch()
prepare_lock_switch()
raw_spin_unlock_irq(&rq->lock)
At this point, a pending interrupt (on this same cpu) is handled.
The interrupt handling results in a call to try_to_wake_up() on the
current process. The try_to_wake_up() gets into:
while (p->on_cpu)
cpu_relax();
and spins forever. This is because "prev->on_cpu = 0" slightly
after this point at:
finish_task_switch()
finish_lock_switch()
prev->on_cpu = 0
One possible fix would be to get rid of __ARCH_WANT_INTERRUPTS_ON_CTXSW.
I don't suspect the reaction to that suggestion will be very positive...
Another fix might be:
while (p->on_cpu) {
if (p == current)
goto out_activate;
cpu_relax();
}
Then add back in the out_activate label.
I don't know if the second fix is good -- I haven't thought out how
it impacts the later patches in the series.
-Frank
next prev parent reply other threads:[~2011-01-29 1:05 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-01-04 14:59 [RFC][PATCH 00/18] sched: Reduce runqueue lock contention -v4 Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 01/18] sched: Always provide p->on_cpu Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 02/18] mutex: Use p->on_cpu for the adaptive spin Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 03/18] sched: Change the ttwu success details Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 04/18] sched: Clean up ttwu stats Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 05/18] sched: Provide p->on_rq Peter Zijlstra
2011-01-05 8:13 ` Yong Zhang
2011-01-05 9:53 ` Peter Zijlstra
2011-01-29 0:10 ` Frank Rowand
2011-01-04 14:59 ` [RFC][PATCH 06/18] sched: Serialize p->cpus_allowed and ttwu() using p->pi_lock Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 07/18] sched: Drop the rq argument to sched_class::select_task_rq() Peter Zijlstra
2011-01-06 13:57 ` Peter Zijlstra
2011-01-06 14:23 ` Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 08/18] sched: Remove rq argument to sched_class::task_waking() Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 09/18] sched: Delay task_contributes_to_load() Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 10/18] sched: Also serialize ttwu_local() with p->pi_lock Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 11/18] sched: Add p->pi_lock to task_rq_lock() Peter Zijlstra
2011-01-05 18:46 ` Oleg Nesterov
2011-01-05 19:33 ` Peter Zijlstra
2011-01-29 0:21 ` Frank Rowand
2011-02-03 17:16 ` Peter Zijlstra
2011-02-03 17:49 ` Frank Rowand
2011-01-04 14:59 ` [RFC][PATCH 12/18] sched: Drop rq->lock from first part of wake_up_new_task() Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 13/18] sched: Drop rq->lock from sched_exec() Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 14/18] sched: Remove rq->lock from the first half of ttwu() Peter Zijlstra
2011-01-06 16:29 ` Peter Zijlstra
2011-01-29 1:05 ` Frank Rowand [this message]
2011-02-03 17:16 ` Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 15/18] sched: Remove rq argument from ttwu_stat() Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 16/18] sched: Rename ttwu_post_activation Peter Zijlstra
2011-01-29 1:08 ` Frank Rowand
2011-01-04 14:59 ` [RFC][PATCH 17/18] sched: Move the second half of ttwu() to the remote cpu Peter Zijlstra
2011-01-05 21:07 ` Oleg Nesterov
2011-01-06 15:09 ` Peter Zijlstra
2011-01-07 15:22 ` Oleg Nesterov
2011-01-18 16:38 ` Peter Zijlstra
2011-01-19 19:37 ` Oleg Nesterov
2011-01-29 0:04 ` Frank Rowand
2011-02-03 17:16 ` Peter Zijlstra
2011-01-04 14:59 ` [RFC][PATCH 18/18] sched: Sort hotplug vs ttwu queueing Peter Zijlstra
2011-01-05 20:47 ` Oleg Nesterov
2011-01-06 10:56 ` Peter Zijlstra
2011-01-04 15:16 ` [RFC][PATCH 00/18] sched: Reduce runqueue lock contention -v4 Ingo Molnar
2011-01-29 1:20 ` Frank Rowand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4D4367CA.2030303@am.sony.com \
--to=frank.rowand@am.sony.com \
--cc=a.p.zijlstra@chello.nl \
--cc=axboe@kernel.dk \
--cc=chris.mason@oracle.com \
--cc=efault@gmx.de \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=oleg@redhat.com \
--cc=pjt@google.com \
--cc=tglx@linutronix.de \
--cc=yong.zhang0@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.