xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Xen-devel <xen-devel@lists.xen.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>,
	Keir Fraser <keir@xen.org>, Jan Beulich <JBeulich@suse.com>,
	Tim Deegan <tim@xen.org>
Subject: [Patch v2] x86/crash: Indicate how well nmi_shootdown_cpus() managed to do.
Date: Wed, 25 Sep 2013 11:22:13 +0100	[thread overview]
Message-ID: <1380104533-16110-1-git-send-email-andrew.cooper3@citrix.com> (raw)
In-Reply-To: <1380052613-3837-1-git-send-email-andrew.cooper3@citrix.com>

Having nmi_shootdown_cpus() report which pcpus failed to be shot down is a
useful debugging hint as to what possibly went wrong (especially when the
crash logs seem to indicate that an NMI timeout occurred while waiting for one
of the problematic pcpus to perform an action).

This is achieved by swapping an atomic_t count of unreported pcpus with a
cpumask.  In the case that the 1 second timeout occurs, use the cpumask to
identify the problematic pcpus.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Keir Fraser <keir@xen.org>
CC: Jan Beulich <JBeulich@suse.com>
CC: Tim Deegan <tim@xen.org>

---

Changes in v2:
 * Use cpumask_andnot() in preference to copy() followed by clear_cpu()
 * Use cpumask_empty() in preference to "if ( msecs )"
 * Use !cpumask_empty() in preference to "cpumask_weight(&waiting_to_crash) > 0"

We in XenServer have seen a few crashes like this recently, and having an
extra bit of debugging on the serial console or in the conring is
substantially more helpful than trying to piece the crash together after-the-
fact based on what information is missing.
---
 xen/arch/x86/crash.c |   19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/crash.c b/xen/arch/x86/crash.c
index 0a807d1..4495451 100644
--- a/xen/arch/x86/crash.c
+++ b/xen/arch/x86/crash.c
@@ -22,6 +22,7 @@
 #include <xen/perfc.h>
 #include <xen/kexec.h>
 #include <xen/sched.h>
+#include <xen/keyhandler.h>
 #include <public/xen.h>
 #include <asm/shared.h>
 #include <asm/hvm/support.h>
@@ -30,7 +31,7 @@
 #include <xen/iommu.h>
 #include <asm/hpet.h>
 
-static atomic_t waiting_for_crash_ipi;
+static cpumask_t waiting_to_crash;
 static unsigned int crashing_cpu;
 static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
 
@@ -65,7 +66,7 @@ void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
         __stop_this_cpu();
 
         this_cpu(crash_save_done) = 1;
-        atomic_dec(&waiting_for_crash_ipi);
+        cpumask_clear_cpu(cpu, &waiting_to_crash);
     }
 
     /* Poor mans self_nmi().  __stop_this_cpu() has reverted the LAPIC
@@ -122,7 +123,7 @@ static void nmi_shootdown_cpus(void)
     crashing_cpu = cpu;
     local_irq_count(crashing_cpu) = 0;
 
-    atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
+    cpumask_andnot(&waiting_to_crash, &cpu_online_map, cpumask_of(cpu));
 
     /* Change NMI trap handlers.  Non-crashing pcpus get nmi_crash which
      * invokes do_nmi_crash (above), which cause them to write state and
@@ -162,12 +163,22 @@ static void nmi_shootdown_cpus(void)
     smp_send_nmi_allbutself();
 
     msecs = 1000; /* Wait at most a second for the other cpus to stop */
-    while ( (atomic_read(&waiting_for_crash_ipi) > 0) && msecs )
+    while ( !cpumask_empty(&waiting_to_crash) && msecs )
     {
         mdelay(1);
         msecs--;
     }
 
+    /* Leave a hint of how well we did trying to shoot down the other cpus */
+    if ( cpumask_empty(&waiting_to_crash) )
+        printk("Shot down all cpus\n");
+    else
+    {
+        cpulist_scnprintf(keyhandler_scratch, sizeof keyhandler_scratch,
+                          &waiting_to_crash);
+        printk("Failed to shoot down cpus {%s}\n", keyhandler_scratch);
+    }
+
     /* Crash shutdown any IOMMU functionality as the crashdump kernel is not
      * happy when booting if interrupt/dma remapping is still enabled */
     iommu_crash_shutdown();
-- 
1.7.10.4

  parent reply	other threads:[~2013-09-25 10:22 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-24 19:56 [PATCH 1/2] x86/crash: Indicate how well nmi_shootdown_cpus() managed to do Andrew Cooper
2013-09-24 19:56 ` [PATCH 2/2] DO NOT APPLY - debugging code to lock a pcpu in an NMI loop Andrew Cooper
2013-09-25  5:56 ` [PATCH 1/2] x86/crash: Indicate how well nmi_shootdown_cpus() managed to do Keir Fraser
2013-09-25  7:35 ` Jan Beulich
2013-09-25 10:22 ` Andrew Cooper [this message]
2013-09-25 11:41   ` [Patch v2] " Keir Fraser

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1380104533-16110-1-git-send-email-andrew.cooper3@citrix.com \
    --to=andrew.cooper3@citrix.com \
    --cc=JBeulich@suse.com \
    --cc=keir@xen.org \
    --cc=tim@xen.org \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).