From: ebiederm@xmission.com (Eric W. Biederman)
To: prasad@linux.vnet.ibm.com
Cc: "Luck, Tony" <tony.luck@intel.com>,
Andi Kleen <andi@firstfloor.org>,
Ananth N Mavinakayanahalli <ananth@in.ibm.com>,
kexec@lists.infradead.org,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Srivatsa Vaddagiri <vatsa@in.ibm.com>,
Vivek Goyal <vgoyal@redhat.com>
Subject: Re: [RFC] Kdump and memory error handling
Date: Thu, 12 May 2011 15:22:44 -0700 [thread overview]
Message-ID: <m1hb8zmudn.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <20110504193509.GA5342@in.ibm.com> (K. Prasad's message of "Thu, 5 May 2011 01:05:09 +0530")
"K.Prasad" <prasad@linux.vnet.ibm.com> writes:
> Hi All,
> We've been trying to study and improve the kdump behaviour when
> a panic is triggered due to an unrecoverable memory error causing a
> machine check exception (MCE) followed by a kernel panic.
>
> In this context we foresee a few issues in capturing kdump and would
> like to receive comments about the ways to handle them.
>
> Probable Issues when capturing coredump through kdump following a memory
> error
> ---------------------------
> - First, a coredump of the memory from the crashing kernel isn't really
> helpful in debugging the crash that was caused due to a faulty memory.
> Collecting the same has some of the problems illustrated below. It should
> therefore suffice to let the user know the reason of the crash
> rather than provide a complete dump of the memory.
>
> For this, a 'slim' yet crash-tool readable coredump containing:
> - message about the cause (such as crash due to unrecoverable memory error)
> in the coredump's elf-note section.
> - and no data from the memory of the 'crashing' kernel (their elf
> sections can be reduced to zero length).
> may be suitable.
>
> - Alternatively, if the kdump kernel decides to capture the coredump,
> its attempts to read the faulty memory location may lead to subsequent
> faults in the context of kdump kernel with fatal consequences. This
> may either be avoided by:
>
> a) Pass the address of the corrupt memory location to the kdump kernel
> and skip reading that location while creating the vmcore. This needs
> an instance of 'struct mce' (from the 'crashing' kernel), which
> already contains the faulty memory address (in the physical address
> form, which should be confirmed using the IA32_MCi_MISC[8:6] bits stored
> in 'misc' field of 'struct mce') to be populated inside the elf
> (-notes?) section.
>
> b) Use modified copy applications (such as a modified 'cp' command)
> that can map the /dev/oldmem into user-space and then initiate the
> creation of vmcore. In this method, the user-space process performing
> the copy will receive a SIGBUS while consuming the faulty memory (through
> INT18 -> do_machine_check) but it must be modified to be resilient to the
> signal, while intelligently skipping to the subsequent memory location
> for further copying. Meanwhile the data for the faulty memory location
> can be represented using 'zero-ed' data and the vmcore enhanced to
> indicate the cause of the crash as one resulting from a fatal MCE.
>
> Any thoughts/suggestions?
In practice this all works for me.
I have received several crash dumps where there was an mce error.
I admit I have my userspace configured to just grab the dmesg from the
kernel log and not do a full crash dump. So in that sense I am already
a slim crash dump.
But in practice with real hardware errors it is working today without
kernel changes.
Eric
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
WARNING: multiple messages have this Message-ID (diff)
From: ebiederm@xmission.com (Eric W. Biederman)
To: prasad@linux.vnet.ibm.com
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Srivatsa Vaddagiri <vatsa@in.ibm.com>,
Ananth N Mavinakayanahalli <ananth@in.ibm.com>, "Luck\,
Tony" <tony.luck@intel.com>,
kexec@lists.infradead.org, Andi Kleen <andi@firstfloor.org>,
Vivek Goyal <vgoyal@redhat.com>
Subject: Re: [RFC] Kdump and memory error handling
Date: Thu, 12 May 2011 15:22:44 -0700 [thread overview]
Message-ID: <m1hb8zmudn.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <20110504193509.GA5342@in.ibm.com> (K. Prasad's message of "Thu, 5 May 2011 01:05:09 +0530")
"K.Prasad" <prasad@linux.vnet.ibm.com> writes:
> Hi All,
> We've been trying to study and improve the kdump behaviour when
> a panic is triggered due to an unrecoverable memory error causing a
> machine check exception (MCE) followed by a kernel panic.
>
> In this context we foresee a few issues in capturing kdump and would
> like to receive comments about the ways to handle them.
>
> Probable Issues when capturing coredump through kdump following a memory
> error
> ---------------------------
> - First, a coredump of the memory from the crashing kernel isn't really
> helpful in debugging the crash that was caused due to a faulty memory.
> Collecting the same has some of the problems illustrated below. It should
> therefore suffice to let the user know the reason of the crash
> rather than provide a complete dump of the memory.
>
> For this, a 'slim' yet crash-tool readable coredump containing:
> - message about the cause (such as crash due to unrecoverable memory error)
> in the coredump's elf-note section.
> - and no data from the memory of the 'crashing' kernel (their elf
> sections can be reduced to zero length).
> may be suitable.
>
> - Alternatively, if the kdump kernel decides to capture the coredump,
> its attempts to read the faulty memory location may lead to subsequent
> faults in the context of kdump kernel with fatal consequences. This
> may either be avoided by:
>
> a) Pass the address of the corrupt memory location to the kdump kernel
> and skip reading that location while creating the vmcore. This needs
> an instance of 'struct mce' (from the 'crashing' kernel), which
> already contains the faulty memory address (in the physical address
> form, which should be confirmed using the IA32_MCi_MISC[8:6] bits stored
> in 'misc' field of 'struct mce') to be populated inside the elf
> (-notes?) section.
>
> b) Use modified copy applications (such as a modified 'cp' command)
> that can map the /dev/oldmem into user-space and then initiate the
> creation of vmcore. In this method, the user-space process performing
> the copy will receive a SIGBUS while consuming the faulty memory (through
> INT18 -> do_machine_check) but it must be modified to be resilient to the
> signal, while intelligently skipping to the subsequent memory location
> for further copying. Meanwhile the data for the faulty memory location
> can be represented using 'zero-ed' data and the vmcore enhanced to
> indicate the cause of the crash as one resulting from a fatal MCE.
>
> Any thoughts/suggestions?
In practice this all works for me.
I have received several crash dumps where there was an mce error.
I admit I have my userspace configured to just grab the dmesg from the
kernel log and not do a full crash dump. So in that sense I am already
a slim crash dump.
But in practice with real hardware errors it is working today without
kernel changes.
Eric
next prev parent reply other threads:[~2011-05-12 22:23 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-05-04 19:35 [RFC] Kdump and memory error handling K.Prasad
2011-05-04 19:35 ` K.Prasad
2011-05-04 20:02 ` Luck, Tony
2011-05-04 20:02 ` Luck, Tony
2011-05-04 20:39 ` Andi Kleen
2011-05-04 20:39 ` Andi Kleen
2011-05-05 3:02 ` Vivek Goyal
2011-05-05 3:02 ` Vivek Goyal
2011-05-05 9:25 ` Srivatsa Vaddagiri
2011-05-05 9:25 ` Srivatsa Vaddagiri
2011-05-09 17:29 ` K.Prasad
2011-05-09 17:29 ` K.Prasad
2011-05-09 17:40 ` Vivek Goyal
2011-05-09 17:40 ` Vivek Goyal
2011-05-12 22:22 ` Eric W. Biederman [this message]
2011-05-12 22:22 ` Eric W. Biederman
2011-05-17 17:24 ` K.Prasad
2011-05-17 17:24 ` K.Prasad
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=m1hb8zmudn.fsf@fess.ebiederm.org \
--to=ebiederm@xmission.com \
--cc=ananth@in.ibm.com \
--cc=andi@firstfloor.org \
--cc=kexec@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=prasad@linux.vnet.ibm.com \
--cc=tony.luck@intel.com \
--cc=vatsa@in.ibm.com \
--cc=vgoyal@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.