Re: [PATCH] smp/call: Detect stuck CSD locks

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: Chris J Arges <chris.j.arges@canonical.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Rafael David Tinoco <inaddy@ubuntu.com>,
	Peter Anvin <hpa@zytor.com>,
	Jiang Liu <jiang.liu@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>, Jens Axboe <axboe@kernel.dk>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Gema Gomez <gema.gomez-solano@canonical.com>,
	the arch/x86 maintainers <x86@kernel.org>
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks
Date: Wed, 8 Apr 2015 08:47:34 +0200	[thread overview]
Message-ID: <20150408064734.GA26861@gmail.com> (raw)
In-Reply-To: <20150407205945.GA28212@canonical.com>


* Chris J Arges <chris.j.arges@canonical.com> wrote:

> Ingo,
> 
> Looks like sched_clock() works in this case.
> 
> Adding the dump_stack() line caused various issues such as the VM 
> oopsing on boot or the softlockup never being detected properly (and 
> thus not crashing). So the below is running with your patch and 
> 'dump_stack()' commented out.
> 
> Here is the log leading up to the soft lockup (I adjusted CSD_LOCK_TIMEOUT to 5s):
> [   22.669630] kvm [1523]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
> [   38.712710] csd: Detected non-responsive CSD lock (#1) on CPU#00, waiting 5.000 secs for CPU#01
> [   38.712715] csd: Re-sending CSD lock (#1) IPI from CPU#00 to CPU#01
> [   43.712709] csd: Detected non-responsive CSD lock (#2) on CPU#00, waiting 5.000 secs for CPU#01
> [   43.712713] csd: Re-sending CSD lock (#2) IPI from CPU#00 to CPU#01
> [   48.712708] csd: Detected non-responsive CSD lock (#3) on CPU#00, waiting 5.000 secs for CPU#01
> [   48.712732] csd: Re-sending CSD lock (#3) IPI from CPU#00 to CPU#01
> [   53.712708] csd: Detected non-responsive CSD lock (#4) on CPU#00, waiting 5.000 secs for CPU#01
> [   53.712712] csd: Re-sending CSD lock (#4) IPI from CPU#00 to CPU#01
> [   58.712707] csd: Detected non-responsive CSD lock (#5) on CPU#00, waiting 5.000 secs for CPU#01
> [   58.712712] csd: Re-sending CSD lock (#5) IPI from CPU#00 to CPU#01
> [   60.080005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ksmd:26]
> 
> Still we never seem to release the lock, even when resending the IPI.

So it would be really important to see the stack dump of CPU#0 at this 
point, and also do an APIC state dump of it.

Because from previous dumps it appeared that the 'stuck' CPU was just 
in its idle loop - which is 'impossible' as the idle loop should still 
allow APIC irqs arriving.

This behavior can only happen if:

	- CPU#0 has irqs disabled perpetually. A dump of CPU#0 should
	  tell us where it's executing. This has actually a fair 
	  chance to be the case as it actually happened in a fair 
	  number of bugs in the past, but I thought from past dumps 
	  you guys provided that this possibility was excluded ... but 
	  it merits re-examination with the debug patches applied.

	- the APIC on CPU#0 is unacked and has queued up so many IPIs 
	  that it starts rejecting them. I'm not sure that's even 
	  possible on KVM though. I'm not sure that's even possible on 
	  KVM, unless part of the hardware virtualizes the APIC. One 
	  other thing that talks against this scenario is that NMIs 
	  appear to be reaching through to CPU#0: the crash dumps and 
	  dump-on-all-cpus NMI callbacks worked fine.

	- the APIC on CPU#0 is in some weird state well outside of its 
	  Linux programming model (TPR set wrong, etc. etc.). There's 
	  literally a myriad of ways an APIC can be configured to not 
	  receive IPIs: but I've never actually seen this happen under 
	  Linux, as it needs complicated writes to specialized APIC 
	  registers, and we don't actually reconfigure the APIC in any 
	  serious fashion aside bootup. Low likelihood but not 
	  impossible. Again, NMIs reaching through make this situation 
	  less likely.

	- CPU#0 having a bad IDT and essentially ignoring certain 
	  IPIs. This presumes some serious but very targeted memory 
	  corruption. Lowest likelihood.

	- ... other failure modes that elude me. Neither of the 
	  scenarios above strike me as particularly plausible - but 
	  something must be causing the lockup, so ...

In any case, something got seriously messed up on CPU#0, and stays 
messed up during the lockup, and it would help a lot figuring out 
exactly what, by further examining its state.

Note, it might also be useful to dump KVM's state of the APIC of 
CPU#0, to see why _it_ isn't sending (and injecting) the lapic IRQ 
into CPU#0. By all means it should. [Maybe take a look at CPU#1 as 
well, to make sure the IPI was actually generated.]

It should be much easier to figure this out on the KVM side than on 
the native hardware side, which emulates the lapic to a large degree, 
so we can see 'hardware state' directly. If we are lucky then the KVM 
problem mirrors the native hardware problem.

Btw., it might also be helpful to try to turn off hardware assisted 
APIC virtualization on the KVM side, to make the APIC purely software 
emulated. If this makes the bug go away magically then this raises the 
likelihood that the bug is really hardware APIC related.

I don't know what the magic incantation is to make 'pure software 
APIC' happen on KVM and Qemu though.

Thanks,

	Ingo

next prev parent reply	other threads:[~2015-04-08  6:47 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-11 13:19 smp_call_function_single lockups Rafael David Tinoco
2015-02-11 18:18 ` Linus Torvalds
2015-02-11 19:59   ` Linus Torvalds
2015-02-11 20:42     ` Linus Torvalds
2015-02-12 16:38       ` Rafael David Tinoco
2015-02-18 22:25       ` Peter Zijlstra
2015-02-19 15:42         ` Rafael David Tinoco
2015-02-19 16:14           ` Linus Torvalds
2015-02-23 14:01             ` Rafael David Tinoco
2015-02-23 19:32               ` Linus Torvalds
2015-02-23 20:50                 ` Peter Zijlstra
2015-02-23 21:02                   ` Rafael David Tinoco
2015-02-19 16:16           ` Peter Zijlstra
2015-02-19 16:26           ` Linus Torvalds
2015-02-19 16:32             ` Rafael David Tinoco
2015-02-19 16:59               ` Linus Torvalds
2015-02-19 17:30                 ` Rafael David Tinoco
2015-02-19 17:39                 ` Linus Torvalds
2015-02-19 20:29                   ` Linus Torvalds
2015-02-19 21:59                     ` Linus Torvalds
2015-02-19 22:45                       ` Linus Torvalds
2015-03-31  3:15                         ` Chris J Arges
2015-03-31  4:28                           ` Linus Torvalds
2015-03-31 10:56                             ` [debug PATCHes] " Ingo Molnar
2015-03-31 22:38                               ` Chris J Arges
2015-04-01 12:39                                 ` Ingo Molnar
2015-04-01 14:10                                   ` Chris J Arges
2015-04-01 14:55                                     ` Ingo Molnar
2015-03-31  4:46                           ` Linus Torvalds
2015-03-31 15:08                           ` Linus Torvalds
2015-03-31 22:23                             ` Chris J Arges
2015-03-31 23:07                               ` Linus Torvalds
2015-04-01 14:32                                 ` Chris J Arges
2015-04-01 15:36                                   ` Linus Torvalds
2015-04-02  9:55                                     ` Ingo Molnar
2015-04-02 17:35                                       ` Linus Torvalds
2015-04-01 12:43                               ` Ingo Molnar
2015-04-01 16:10                                 ` Chris J Arges
2015-04-01 16:14                                   ` Linus Torvalds
2015-04-01 21:59                                     ` Chris J Arges
2015-04-02 17:31                                       ` Linus Torvalds
2015-04-02 18:26                                         ` Ingo Molnar
2015-04-02 18:51                                           ` Chris J Arges
2015-04-02 19:07                                             ` Ingo Molnar
2015-04-02 20:57                                               ` Linus Torvalds
2015-04-02 21:13                                               ` Chris J Arges
2015-04-03  5:43                                                 ` [PATCH] smp/call: Detect stuck CSD locks Ingo Molnar
2015-04-03  5:47                                                   ` Ingo Molnar
2015-04-06 16:58                                                   ` Chris J Arges
2015-04-06 17:32                                                     ` Linus Torvalds
2015-04-07  9:21                                                       ` Ingo Molnar
2015-04-07 20:59                                                         ` Chris J Arges
2015-04-07 21:15                                                           ` Linus Torvalds
2015-04-08  6:47                                                           ` Ingo Molnar [this message]
2015-04-13  3:56                                                             ` Chris J Arges
2015-04-13  6:14                                                               ` Ingo Molnar
2015-04-15 19:54                                                                 ` Chris J Arges
2015-04-16 11:04                                                                   ` Ingo Molnar
2015-04-16 15:58                                                                     ` Chris J Arges
2015-04-16 16:31                                                                       ` Ingo Molnar
2015-04-29 21:08                                                                         ` Chris J Arges
2015-05-11 14:00                                                                           ` Ingo Molnar
2015-05-20 18:19                                                                             ` Chris J Arges
2015-04-03  5:45                                                 ` smp_call_function_single lockups Ingo Molnar
2015-04-06 17:23                                         ` Chris J Arges
2015-02-20  9:30                     ` Ingo Molnar
2015-02-20 16:49                       ` Linus Torvalds
2015-02-20 19:41                         ` Ingo Molnar
2015-02-20 20:03                           ` Linus Torvalds
2015-02-20 20:11                             ` Ingo Molnar
2015-03-20 10:15       ` Peter Zijlstra
2015-03-20 16:26         ` Linus Torvalds
2015-03-20 17:14           ` Mike Galbraith
2015-04-01 14:22       ` Frederic Weisbecker
2015-04-18 10:13       ` [tip:locking/urgent] smp: Fix smp_call_function_single_async() locking tip-bot for Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150408064734.GA26861@gmail.com \
    --to=mingo@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=chris.j.arges@canonical.com \
    --cc=fweisbec@gmail.com \
    --cc=gema.gomez-solano@canonical.com \
    --cc=hpa@zytor.com \
    --cc=inaddy@ubuntu.com \
    --cc=jiang.liu@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).