All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Jan Beulich <JBeulich@suse.com>,
	xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: [PATCH] x86: show remote CPU state upon fatal NMI
Date: Tue, 14 Jun 2016 16:03:34 +0100	[thread overview]
Message-ID: <57601CC6.5060500@citrix.com> (raw)
In-Reply-To: <576031C802000078000F4D9F@prv-mh.provo.novell.com>

On 14/06/16 15:33, Jan Beulich wrote:
> Quite frequently the watchdog would hit an innocent CPU, e.g. one
> trying to acquire a spin lock a remote CPU holds for extended periods
> of time, or a random CPU in TSC calbration rendezvous. In such cases
> the register and stack dump for that CPU doesn't really help in the
> analysis of the problem.
>
> To keep things reasonable on large systems, only log CS:RIP by default.
> This can be overridden via a new extension to the "nmi=" command line
> option such that full register/stack state will get dumped.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -1155,7 +1155,7 @@ Use the MWAIT idle driver (with model sp
>  of the ACPI based one.
>  
>  ### nmi
> -> `= ignore | dom0 | fatal`
> +> `= ignore | dom0 | fatal [,show-all]`
>  
>  > Default: `fatal` for a debug build, or `dom0` for a non-debug build
>  
> @@ -1163,6 +1163,9 @@ Specify what Xen should do in the event
>  `ignore` discards the error; `dom0` causes Xen to report the error to
>  dom0, while 'fatal' causes Xen to print diagnostics and then hang.
>  
> +The `show-all` modifier forces all CPUs' full state to be dumped upon
> +fatal NMIs (normally a result of the watchdog kicking in).
> +
>  ### noapic
>  
>  Instruct Xen to ignore any IOAPICs that are present in the system, and
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -84,10 +84,11 @@
>   *  dom0:   The NMI is virtualised to DOM0.
>   *  ignore: The NMI error is cleared and ignored.
>   */
> +static char __read_mostly opt_nmi[16] =
>  #ifdef NDEBUG
> -static char __read_mostly opt_nmi[10] = "dom0";
> + "dom0";
>  #else
> -static char __read_mostly opt_nmi[10] = "fatal";
> + "fatal";
>  #endif
>  string_param("nmi", opt_nmi);
>  
> @@ -525,6 +526,35 @@ void vcpu_show_execution_state(struct vc
>      vcpu_unpause(v);
>  }
>  
> +static cpumask_t nmi_show_state_mask;
> +static bool_t opt_nmi_show_all;
> +
> +static int __init get_nmi_show_all(void)
> +{
> +    const char *s = strchr(opt_nmi, ',');
> +
> +    if ( s && !strcmp(s + 1, "show-all") )
> +        opt_nmi_show_all = 1;
> +
> +    return 0;
> +}
> +presmp_initcall(get_nmi_show_all);
> +
> +static int nmi_show_execution_state(const struct cpu_user_regs *regs, int cpu)
> +{
> +    if ( !cpumask_test_cpu(cpu, &nmi_show_state_mask) )
> +        return 0;
> +
> +    if ( opt_nmi_show_all )
> +        show_execution_state(regs);
> +    else
> +        printk(XENLOG_ERR "CPU%d @ %04x:%08lx (%pS)\n", cpu, regs->cs, regs->rip,
> +               guest_mode(regs) ? _p(regs->rip) : NULL);
> +    cpumask_clear_cpu(cpu, &nmi_show_state_mask);

I would clear the mask before printing state.  Given the nature of this
handler, it liable to contend sufficiently on the console lock to induce
the further watchdog timeout.

> +
> +    return 1;
> +}
> +
>  static const char *trapstr(unsigned int trapnr)
>  {
>      static const char * const strings[] = {
> @@ -570,6 +600,15 @@ void fatal_trap(const struct cpu_user_re
>              printk("Faulting linear address: %p\n", _p(cr2));
>              show_page_walk(cr2);
>          }
> +        else if ( trapnr == TRAP_nmi )
> +        {
> +            cpumask_andnot(&nmi_show_state_mask, &cpu_online_map,
> +                           cpumask_of(smp_processor_id()));
> +            set_nmi_callback(nmi_show_execution_state);
> +            smp_send_nmi_allbutself();

This would cause far less spinlock contention as

for_each_cpu( cpu, nmi_show_state_mask )
    smp_send_nmi(cpu);

I realise this involves introducing a new smp function, but it would
substantially reduce contention on the console lock.


I would recommend moving this clause into nmi_watchdog_tick(), so that
it doesn't get involved for non-watchdog NMIs.  IOCK/SERR NMIs won't
have anything interesting to print from here.  I would also recommend
disabling the watchdog before IPI'ing.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

  reply	other threads:[~2016-06-14 15:03 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-14 14:33 [PATCH] x86: show remote CPU state upon fatal NMI Jan Beulich
2016-06-14 15:03 ` Andrew Cooper [this message]
2016-06-15  7:55   ` Jan Beulich
2016-06-15 11:03     ` Andrew Cooper
2016-06-15 12:59       ` Jan Beulich
2016-06-15 13:15         ` Andrew Cooper

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57601CC6.5060500@citrix.com \
    --to=andrew.cooper3@citrix.com \
    --cc=JBeulich@suse.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.