From: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
To: a.p.zijlstra@chello.nl, mingo@kernel.org
Cc: pjt@google.com, suresh.b.siddha@intel.com,
seto.hidetoshi@jp.fujitsu.com, rostedt@goodmis.org,
tglx@linutronix.de, dhillf@gmail.com, rjw@sisk.pl,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH] sched/cpu hotplug: Don't traverse sched domain tree of cpus going offline
Date: Wed, 20 Jun 2012 15:20:45 +0530 [thread overview]
Message-ID: <4FE19CF5.3060000@linux.vnet.ibm.com> (raw)
In-Reply-To: <20120602185721.5559.65562.stgit@srivatsabhat>
On 06/03/2012 12:27 AM, Srivatsa S. Bhat wrote:
> The cpu_active_mask is used to help the scheduler make the right decisions
> during CPU hotplug transitions, and thereby enable CPU hotplug operations to
> run smoothly.
>
> However there are a few places where the scheduler doesn't consult the
> cpu_active_mask and hence can trip over during CPU hotplug operations, by
> making decisions based on stale data structures.
>
> In particular, the call to partition_sched_domains() in the CPU offline path
> sets cpu_rq(dying_cpu)->sd (sched domain pointer) to NULL. However, until
> we get to that point, it remains non-NULL. And during that time period,
> the scheduler could end up traversing that non-NULL sched domain tree and
> making decisions based on that, which could lead to undesirable consequences.
>
> For example, commit 8f2f748 (CPU hotplug, cpusets, suspend: Don't touch
> cpusets during suspend/resume) caused suspend hangs at the CPU hotplug stage,
> in some configurations. And the root-cause of that hang was that the
> scheduler was traversing the sched domain trees of cpus going offline, thereby
> messing up its decisions. The analysis of that hang can be found in:
> http://marc.info/?l=linux-kernel&m=133518704022532&w=4
>
> In short, our assumption that we are good to go (even with stale sched
> domains) as long as the dying cpu was not in the cpu_active_mask, was wrong.
> Because, there are several places where the scheduler code doesn't really
> check the cpu_active_mask, but instead traverses the sched domains directly
> to do stuff.
>
> Anyway, that commit got reverted. However, that same problem could very well
> occur during regular CPU hotplug too, even now. Ideally, we would want the
> scheduler to make all its decisions during the CPU hotplug transition, by
> looking up the cpu_active_mask (which is updated very early during CPU
> Hotplug) to avoid such disasters. So, strengthen the scheduler by making it
> consult the cpu_active_mask as a validity check before traversing the sched
> domain tree.
>
> Here we are specifically fixing the 2 known instances: idle_balance() and
> find_lowest_rq().
>
> 1. kernel/sched/core.c and fair.c:
> schedule()
> __schedule()
> idle_balance()
>
> idle_balance() can end up doing:
>
> for_each_domain(dying_cpu, sd) {
> ...
> }
>
> 2. kernel/sched/core.c and rt.c:
>
> migration_call()
> sched_ttwu_pending()
> ttwu_do_activate()
> ttwu_do_wakeup()
> task_woken()
> task_woken_rt()
> push_rt_tasks()
> find_lowest_rq()
>
> find_lowest_rq() can end up doing:
>
> for_each_domain(dying_cpu, sd) {
> ...
> }
>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Any thoughts on this patch?
Regards,
Srivatsa S. Bhat
> ---
>
> kernel/sched/fair.c | 5 +++++
> kernel/sched/rt.c | 4 ++++
> 2 files changed, 9 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 940e6d1..72cc7c6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4402,6 +4402,10 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> raw_spin_unlock(&this_rq->lock);
>
> update_shares(this_cpu);
> +
> + if (!cpu_active(this_cpu))
> + goto out;
> +
> rcu_read_lock();
> for_each_domain(this_cpu, sd) {
> unsigned long interval;
> @@ -4426,6 +4430,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> }
> rcu_read_unlock();
>
> + out:
> raw_spin_lock(&this_rq->lock);
>
> if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index c5565c3..8cbafcd 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1488,6 +1488,9 @@ static int find_lowest_rq(struct task_struct *task)
> if (!cpumask_test_cpu(this_cpu, lowest_mask))
> this_cpu = -1; /* Skip this_cpu opt if not among lowest */
>
> + if (!cpu_active(cpu))
> + goto out;
> +
> rcu_read_lock();
> for_each_domain(cpu, sd) {
> if (sd->flags & SD_WAKE_AFFINE) {
> @@ -1513,6 +1516,7 @@ static int find_lowest_rq(struct task_struct *task)
> }
> rcu_read_unlock();
>
> + out:
> /*
> * And finally, if there were no matches within the domains
> * just give the caller *something* to work with from the compatible
>
next prev parent reply other threads:[~2012-06-20 10:06 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-06-02 18:57 [PATCH] sched/cpu hotplug: Don't traverse sched domain tree of cpus going offline Srivatsa S. Bhat
2012-06-20 9:50 ` Srivatsa S. Bhat [this message]
2012-06-20 10:17 ` Peter Zijlstra
2012-07-17 10:35 ` Srivatsa S. Bhat
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FE19CF5.3060000@linux.vnet.ibm.com \
--to=srivatsa.bhat@linux.vnet.ibm.com \
--cc=a.p.zijlstra@chello.nl \
--cc=dhillf@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=pjt@google.com \
--cc=rjw@sisk.pl \
--cc=rostedt@goodmis.org \
--cc=seto.hidetoshi@jp.fujitsu.com \
--cc=suresh.b.siddha@intel.com \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.