linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Keith Owens <kaos@sgi.com>
To: vgoyal@in.ibm.com
Cc: linux-arch@vger.kernel.org
Subject: Re: [patch 2.6.19-rc5 0/12] crash_stop: Summary
Date: Mon, 13 Nov 2006 13:08:34 +1100	[thread overview]
Message-ID: <10293.1163383714@ocs3.ocs.com.au> (raw)
In-Reply-To: Your message of "Fri, 10 Nov 2006 20:45:05 CDT." <20061111014505.GA12814@in.ibm.com>

Vivek Goyal (on Fri, 10 Nov 2006 20:45:05 -0500) wrote:
>On Thu, Nov 09, 2006 at 03:04:18PM +1100, Keith Owens wrote:
>Hi Keith,
>
>> All the kernel debug style tools (kdb, kgdb, nlkd, netdump, lkcd,
>> crash, kdump etc.) have a common requirement, they need to do a crash
>> stop of the systems.  This means stopping all the cpus, even if some of
>> the cpus are spinning disabled.  In addition, each cpu has to save
>> enough state to start the diagnosis of the problem.
>> 
>> * Each debug style tool has written its own code for interrupting the
>>   other cpus and for saving cpu state.
>> 
>> * Some tools try a normal IPI first then send a non-maskable interrupt
>>   after a delay.
>> 
>> * Some tools always send a NMI first, which can result in incomplete or
>>   wrong machine state if NMI arrives at the wrong time.
>> 
>
>What kind of problem one can run into if NMI is sent directly instead
>of trying an normal IPI first?

Incomplete cpu state on the cpus that are hit with NMI.  By definition,
NMI can be delivered at any time, with the cpu in any state.  It can be
in the middle of saving the state for a previous interrupt when NMI is
delivered.  On IA64, a cpu can even be in physical mode instead of
virtual mode when INIT is delivered.  All of which means that you have
incomplete or misleading information in your dump.

Sending a normal interrupt first, waiting a short while, then sending
an NMI later maximises the chance that we get good state for every cpu.

>On a general note, I am not sure how well suited this infrastructure
>is for crash dump needs. We are trying to follow one theme and that
>is run a bare minimal code after a system crash to increase reliabitity.

Agreed.  But you still have to stop all the existing cpus _and_ capture
their state before switching to the second kernel.  There is no point
in switching to a new kernel if you cannot get information about all
the cpus from the failing kernel.

>Avoid taking locks avoid relying on crashed task's stack etc. (if possible).
>Of course it is an ideal situation and we have not achieved that state but
>roughly that seems to be the long term goal. Looking at the patches it
>looks like it introduces lots of code to be run after crash and also
>uses smp_processor_id() which introduces a dependency on stack. This poses
>problem in stack overflow cases. Fernando from valinux, introduced 
>safe_smp_processor_id() call to read apic id from LAPIC instead of relying
>on the thread's stack. As of today, the crash path is no safe from stack
>overflow but we hope that someday it would be.

Good point.  I will look at converting crash_stop to use
safe_smp_processor_id.


      reply	other threads:[~2006-11-13  2:08 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-11-09  4:04 [patch 2.6.19-rc5 0/12] crash_stop: Summary Keith Owens
2006-11-09  4:04 ` [patch 2.6.19-rc5 1/12] crash_stop: common header Keith Owens
2006-11-09  4:04 ` [patch 2.6.19-rc5 2/12] crash_stop: common code Keith Owens
2006-11-09  4:04 ` [patch 2.6.19-rc5 3/12] crash_stop: i386 interrupt handlers Keith Owens
2006-11-09  4:04 ` [patch 2.6.19-rc5 4/12] crash_stop: i386 specific code Keith Owens
2006-11-09  4:04 ` [patch 2.6.19-rc5 5/12] crash_stop: add DIE_NMIWATCHDOG to x86_64 Keith Owens
2006-11-09  4:04 ` [patch 2.6.19-rc5 6/12] crash_stop: x86_64 interrupt handlers Keith Owens
2006-11-09  4:04 ` [patch 2.6.19-rc5 7/12] crash_stop: x86_64 specific code Keith Owens
2006-11-09  4:05 ` [patch 2.6.19-rc5 8/12] crash_stop: ia64 interrupt handlers Keith Owens
2006-11-09  4:05 ` [patch 2.6.19-rc5 9/12] crash_stop: ia64 specific code Keith Owens
2006-11-09  4:05 ` [patch 2.6.19-rc5 10/12] crash_stop: add to config system Keith Owens
2006-11-09  4:05 ` [patch 2.6.19-rc5 11/12] crash_stop: demonstration code Keith Owens
2006-11-09  4:05 ` [patch 2.6.19-rc5 12/12] crash_stop: test code Keith Owens
2006-11-11  1:45 ` [patch 2.6.19-rc5 0/12] crash_stop: Summary Vivek Goyal
2006-11-13  2:08   ` Keith Owens [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=10293.1163383714@ocs3.ocs.com.au \
    --to=kaos@sgi.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=vgoyal@in.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).