linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Igor Mammedov <imammedo@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: rob@landley.net, tglx@linutronix.de, mingo@redhat.com,
	hpa@zytor.com, x86@kernel.org, luto@mit.edu,
	suresh.b.siddha@intel.com, avi@redhat.com, imammedo@redhat.com,
	a.p.zijlstra@chello.nl, johnstul@us.ibm.com,
	arjan@linux.intel.com, linux-doc@vger.kernel.org
Subject: [PATCH 1/5] Fix soft-lookup in stop machine on secondary cpu bring up
Date: Wed,  9 May 2012 12:24:58 +0200	[thread overview]
Message-ID: <1336559102-28103-2-git-send-email-imammedo@redhat.com> (raw)
In-Reply-To: <1336559102-28103-1-git-send-email-imammedo@redhat.com>

When bringing up cpuX1, it could stall in start_secondary
before setting cpu_callin_mask for more than 5 sec. That forces
do_boot_cpu() to give up on waiting and go to error return path
printing messages:
  pr_err("CPU%d: Stuck ??\n", cpuX1);
or
  pr_err("CPU%d: Not responding.\n", cpuX1);
and native_cpu_up returns early with -EIO. However AP may continue
its boot process till it reaches check_tsc_sync_target(), where
it will wait for boot cpu to run cpu_up...=>check_tsc_sync_source.
That will never happen since cpu_up have returned with error before.

Now we need to note that cpuX1 is marked as active in smp_callin
before it stuck in check_tsc_sync_target. And when another cpuX2
is being onlined, start_secondary on it will call
  smp_callin
    -> smp_store_cpu_info
      -> identify_secondary_cpu
        -> mtrr_ap_init
          -> set_mtrr_from_inactive_cpu
            -> stop_machine_from_inactive_cpu
where it's going to schedule stop_machine work on all ACTIVE cpus
  smdata.num_threads = num_active_cpus() + 1;
and wait till they all complete it before continuing. As was noted
before cpuX1 was marked as active but can't execute any work since
it's not completed initialization and stuck in check_tsc_sync_target.
As result system soft lockups in stop_machine_cpu_stop.

backtrace from reproducer:

PID: 3324   TASK: ffff88007c00ae20  CPU: other cpus   COMMAND: "migration/1"
    [exception RIP: stop_machine_cpu_stop+131]
...
 #0 [ffff88007b4d7de8] cpu_stopper_thread at ffffffff810c66bd
 #1 [ffff88007b4d7ee8] kthread at ffffffff8107871e
 #2 [ffff88007b4d7f48] kernel_thread_helper at ffffffff8154af24

PID: 0      TASK: ffff88007c029710  CPU: 2   COMMAND: "swapper/2"
    [exception RIP: check_tsc_sync_target+33]
...
 #0 [ffff88007c025f30] start_secondary at ffffffff81539876

PID: 0      TASK: ffff88007c041710  CPU: 3   COMMAND: "swapper/3"
    [exception RIP: stop_machine_cpu_stop+131]
...
 #0 [ffff88007c04be50] stop_machine_from_inactive_cpu at ffffffff810c6b2f
 #1 [ffff88007c04bee0] mtrr_ap_init at ffffffff8102e963
 #2 [ffff88007c04bf10] identify_secondary_cpu at ffffffff81536799
 #3 [ffff88007c04bf20] smp_store_cpu_info at ffffffff815396d5
 #4 [ffff88007c04bf30] start_secondary at ffffffff81539800

Could be fixed by not marking being onlined cpu as active too early.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 arch/x86/kernel/smpboot.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 6e1e406..ae19d90 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -232,8 +232,6 @@ static void __cpuinit smp_callin(void)
 	set_cpu_sibling_map(raw_smp_processor_id());
 	wmb();
 
-	notify_cpu_starting(cpuid);
-
 	/*
 	 * Allow the master to continue.
 	 */
@@ -268,6 +266,8 @@ notrace static void __cpuinit start_secondary(void *unused)
 	 */
 	check_tsc_sync_target();
 
+	notify_cpu_starting(smp_processor_id());
+
 	/*
 	 * We need to hold call_lock, so there is no inconsistency
 	 * between the time smp_call_function() determines number of
-- 
1.7.1


  parent reply	other threads:[~2012-05-09  8:26 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-09 10:24 [PATCH 0/5] [x86]: Improve secondary CPU bring-up process robustness Igor Mammedov
2012-05-09  9:19 ` Peter Zijlstra
2012-05-09 12:29   ` Igor Mammedov
2012-05-09 13:12   ` Ingo Molnar
2012-05-10 17:31     ` Rob Landley
2012-05-10 17:39       ` Peter Zijlstra
2012-05-09 10:24 ` Igor Mammedov [this message]
2012-05-09 15:04   ` [PATCH 1/5] Fix soft-lookup in stop machine on secondary cpu bring up Shuah Khan
2012-05-09 15:22     ` Igor Mammedov
2012-05-09 15:34       ` Shuah Khan
2012-05-10 15:26         ` Shuah Khan
2012-05-10 16:29           ` Igor Mammedov
2012-05-10 16:38             ` Shuah Khan
2012-05-11 11:45   ` Thomas Gleixner
2012-05-11 15:16     ` Igor Mammedov
2012-05-11 21:14       ` Thomas Gleixner
2012-05-12 19:32         ` [RFC] [x86]: abort secondary cpu bringup gracefully Igor Mammedov
2012-05-12 17:39           ` Peter Zijlstra
2012-05-12 18:51             ` Igor Mammedov
2012-05-14 11:09               ` [RFC v2] " Igor Mammedov
2012-05-24 15:41                 ` Igor Mammedov
2012-05-25 18:11                   ` Rob Landley
2012-05-30 16:38                     ` Igor Mammedov
2012-05-09 10:24 ` [PATCH 2/5] Take in account that several cpus might call check_tsc_sync_* at the same time Igor Mammedov
2012-05-09 10:25 ` [PATCH 3/5] Do not wait till next cpu online and abort early if lead cpu do not wait for us anymore Igor Mammedov
2012-05-09 10:25 ` [PATCH 4/5] Cancel secondary CPU bringup if boot cpu abandoned this effort Igor Mammedov
2012-05-09 10:25 ` [PATCH 5/5] Do not mark cpu as not present if we failed to boot it Igor Mammedov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1336559102-28103-2-git-send-email-imammedo@redhat.com \
    --to=imammedo@redhat.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=arjan@linux.intel.com \
    --cc=avi@redhat.com \
    --cc=hpa@zytor.com \
    --cc=johnstul@us.ibm.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@mit.edu \
    --cc=mingo@redhat.com \
    --cc=rob@landley.net \
    --cc=suresh.b.siddha@intel.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).