All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris J Arges <chris.j.arges@canonical.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Rafael David Tinoco <inaddy@ubuntu.com>,
	Peter Anvin <hpa@zytor.com>,
	Jiang Liu <jiang.liu@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>, Jens Axboe <axboe@kernel.dk>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Gema Gomez <gema.gomez-solano@canonical.com>,
	the arch/x86 maintainers <x86@kernel.org>
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks
Date: Sun, 12 Apr 2015 22:56:17 -0500	[thread overview]
Message-ID: <20150413035616.GA24037@canonical.com> (raw)
In-Reply-To: <20150408064734.GA26861@gmail.com>

<snip> 
> So it would be really important to see the stack dump of CPU#0 at this 
> point, and also do an APIC state dump of it.
> 
> Because from previous dumps it appeared that the 'stuck' CPU was just 
> in its idle loop - which is 'impossible' as the idle loop should still 
> allow APIC irqs arriving.
> 
> This behavior can only happen if:
> 
> 	- CPU#0 has irqs disabled perpetually. A dump of CPU#0 should
> 	  tell us where it's executing. This has actually a fair 
> 	  chance to be the case as it actually happened in a fair 
> 	  number of bugs in the past, but I thought from past dumps 
> 	  you guys provided that this possibility was excluded ... but 
> 	  it merits re-examination with the debug patches applied.
> 
> 	- the APIC on CPU#0 is unacked and has queued up so many IPIs 
> 	  that it starts rejecting them. I'm not sure that's even 
> 	  possible on KVM though. I'm not sure that's even possible on 
> 	  KVM, unless part of the hardware virtualizes the APIC. One 
> 	  other thing that talks against this scenario is that NMIs 
> 	  appear to be reaching through to CPU#0: the crash dumps and 
> 	  dump-on-all-cpus NMI callbacks worked fine.
> 
> 	- the APIC on CPU#0 is in some weird state well outside of its 
> 	  Linux programming model (TPR set wrong, etc. etc.). There's 
> 	  literally a myriad of ways an APIC can be configured to not 
> 	  receive IPIs: but I've never actually seen this happen under 
> 	  Linux, as it needs complicated writes to specialized APIC 
> 	  registers, and we don't actually reconfigure the APIC in any 
> 	  serious fashion aside bootup. Low likelihood but not 
> 	  impossible. Again, NMIs reaching through make this situation 
> 	  less likely.
> 
> 	- CPU#0 having a bad IDT and essentially ignoring certain 
> 	  IPIs. This presumes some serious but very targeted memory 
> 	  corruption. Lowest likelihood.
> 
> 	- ... other failure modes that elude me. Neither of the 
> 	  scenarios above strike me as particularly plausible - but 
> 	  something must be causing the lockup, so ...
> 
> In any case, something got seriously messed up on CPU#0, and stays 
> messed up during the lockup, and it would help a lot figuring out 
> exactly what, by further examining its state.
> 
> Note, it might also be useful to dump KVM's state of the APIC of 
> CPU#0, to see why _it_ isn't sending (and injecting) the lapic IRQ 
> into CPU#0. By all means it should. [Maybe take a look at CPU#1 as 
> well, to make sure the IPI was actually generated.]
> 
> It should be much easier to figure this out on the KVM side than on 
> the native hardware side, which emulates the lapic to a large degree, 
> so we can see 'hardware state' directly. If we are lucky then the KVM 
> problem mirrors the native hardware problem.
> 
> Btw., it might also be helpful to try to turn off hardware assisted 
> APIC virtualization on the KVM side, to make the APIC purely software 
> emulated. If this makes the bug go away magically then this raises the 
> likelihood that the bug is really hardware APIC related.
>
> I don't know what the magic incantation is to make 'pure software 
> APIC' happen on KVM and Qemu though.
> 
> Thanks,
> 
> 	Ingo
>

Ingo,

/sys/module/kvm_intel/parameters/enable_apicv on the affected hardware is not
enabled, and unfortunately my hardware doesn't have the necessary features
to enable it. So we are dealing with KVM's lapic implementation only.

FYI, I'm working on getting better data at the moment and here is my approach:
* For the L0 kernel:
 - In arch/x86/kvm/lapic.c, I enabled 'apic_debug' to get more output (and print
   the addresses of various useful structures)
 - Setup crash to live dump kvm_lapic structures and associated registers for
   both vCPUs
* For the L1 kernel:
 - Dump a stacktrace when we detect a lockup.
 - Detect a lockup and try to not alter the state.
 - Have a reliable signal such that the L0 hypervisor can dump the lapic
   structures and registers when csd_lock_wait detects a softlockup.

Hopefully I can make progress and present meaningful results in my next update.

Thanks,
--chris j arges



  reply	other threads:[~2015-04-13  3:57 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-11 13:19 smp_call_function_single lockups Rafael David Tinoco
2015-02-11 18:18 ` Linus Torvalds
2015-02-11 19:59   ` Linus Torvalds
2015-02-11 20:42     ` Linus Torvalds
2015-02-12 16:38       ` Rafael David Tinoco
2015-02-18 22:25       ` Peter Zijlstra
2015-02-19 15:42         ` Rafael David Tinoco
2015-02-19 16:14           ` Linus Torvalds
2015-02-23 14:01             ` Rafael David Tinoco
2015-02-23 19:32               ` Linus Torvalds
2015-02-23 20:50                 ` Peter Zijlstra
2015-02-23 21:02                   ` Rafael David Tinoco
2015-02-19 16:16           ` Peter Zijlstra
2015-02-19 16:26           ` Linus Torvalds
2015-02-19 16:32             ` Rafael David Tinoco
2015-02-19 16:59               ` Linus Torvalds
2015-02-19 17:30                 ` Rafael David Tinoco
2015-02-19 17:39                 ` Linus Torvalds
2015-02-19 20:29                   ` Linus Torvalds
2015-02-19 21:59                     ` Linus Torvalds
2015-02-19 22:45                       ` Linus Torvalds
2015-03-31  3:15                         ` Chris J Arges
2015-03-31  4:28                           ` Linus Torvalds
2015-03-31 10:56                             ` [debug PATCHes] " Ingo Molnar
2015-03-31 22:38                               ` Chris J Arges
2015-04-01 12:39                                 ` Ingo Molnar
2015-04-01 14:10                                   ` Chris J Arges
2015-04-01 14:55                                     ` Ingo Molnar
2015-03-31  4:46                           ` Linus Torvalds
2015-03-31 15:08                           ` Linus Torvalds
2015-03-31 22:23                             ` Chris J Arges
2015-03-31 23:07                               ` Linus Torvalds
2015-04-01 14:32                                 ` Chris J Arges
2015-04-01 15:36                                   ` Linus Torvalds
2015-04-02  9:55                                     ` Ingo Molnar
2015-04-02 17:35                                       ` Linus Torvalds
2015-04-01 12:43                               ` Ingo Molnar
2015-04-01 16:10                                 ` Chris J Arges
2015-04-01 16:14                                   ` Linus Torvalds
2015-04-01 21:59                                     ` Chris J Arges
2015-04-02 17:31                                       ` Linus Torvalds
2015-04-02 18:26                                         ` Ingo Molnar
2015-04-02 18:51                                           ` Chris J Arges
2015-04-02 19:07                                             ` Ingo Molnar
2015-04-02 20:57                                               ` Linus Torvalds
2015-04-02 21:13                                               ` Chris J Arges
2015-04-03  5:43                                                 ` [PATCH] smp/call: Detect stuck CSD locks Ingo Molnar
2015-04-03  5:47                                                   ` Ingo Molnar
2015-04-06 16:58                                                   ` Chris J Arges
2015-04-06 17:32                                                     ` Linus Torvalds
2015-04-07  9:21                                                       ` Ingo Molnar
2015-04-07 20:59                                                         ` Chris J Arges
2015-04-07 21:15                                                           ` Linus Torvalds
2015-04-08  6:47                                                           ` Ingo Molnar
2015-04-13  3:56                                                             ` Chris J Arges [this message]
2015-04-13  6:14                                                               ` Ingo Molnar
2015-04-15 19:54                                                                 ` Chris J Arges
2015-04-16 11:04                                                                   ` Ingo Molnar
2015-04-16 15:58                                                                     ` Chris J Arges
2015-04-16 16:31                                                                       ` Ingo Molnar
2015-04-29 21:08                                                                         ` Chris J Arges
2015-05-11 14:00                                                                           ` Ingo Molnar
2015-05-20 18:19                                                                             ` Chris J Arges
2015-04-03  5:45                                                 ` smp_call_function_single lockups Ingo Molnar
2015-04-06 17:23                                         ` Chris J Arges
2015-02-20  9:30                     ` Ingo Molnar
2015-02-20 16:49                       ` Linus Torvalds
2015-02-20 19:41                         ` Ingo Molnar
2015-02-20 20:03                           ` Linus Torvalds
2015-02-20 20:11                             ` Ingo Molnar
2015-03-20 10:15       ` Peter Zijlstra
2015-03-20 16:26         ` Linus Torvalds
2015-03-20 17:14           ` Mike Galbraith
2015-04-01 14:22       ` Frederic Weisbecker
2015-04-18 10:13       ` [tip:locking/urgent] smp: Fix smp_call_function_single_async() locking tip-bot for Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150413035616.GA24037@canonical.com \
    --to=chris.j.arges@canonical.com \
    --cc=axboe@kernel.dk \
    --cc=fweisbec@gmail.com \
    --cc=gema.gomez-solano@canonical.com \
    --cc=hpa@zytor.com \
    --cc=inaddy@ubuntu.com \
    --cc=jiang.liu@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.