From: Vivek Goyal <vgoyal@redhat.com>
To: "K.Prasad" <prasad@linux.vnet.ibm.com>
Cc: oomichi@mxs.nes.nec.co.jp, "Luck, Tony" <tony.luck@intel.com>,
kexec@lists.infradead.org, linux-kernel@vger.kernel.org,
tachibana@mxm.nes.nec.co.jp, Andi Kleen <andi@firstfloor.org>,
Borislav Petkov <bp@alien8.de>,
"Eric W. Biederman" <ebiederm@xmission.com>,
anderson@redhat.com, crash-utility@redhat.com
Subject: Re: [Patch 1/4][kernel][slimdump] Add new elf-note of type NT_NOCOREDUMP to capture slimdump
Date: Wed, 5 Oct 2011 11:30:40 -0400 [thread overview]
Message-ID: <20111005153040.GC30146@redhat.com> (raw)
In-Reply-To: <20111005070728.GA2235@in.ibm.com>
On Wed, Oct 05, 2011 at 12:37:28PM +0530, K.Prasad wrote:
> On Tue, Oct 04, 2011 at 08:34:40AM +0200, Borislav Petkov wrote:
> > On Mon, Oct 03, 2011 at 05:33:36PM +0530, K.Prasad wrote:
> > > It's interesting...according to Intel's Software Developer Manual
> > > (quoting from Volume 3A, Chapter 15), the MCIP bit in IA32_MCG_STATUS
> > > MSR behaves as described below.
> > >
> > > "MCIP (machine check in progress) flag, bit 2 Indicates (when set)
> > > that a machine-check exception was generated. Software can set or clear this
> > > flag. The occurrence of a second Machine-Check Event while MCIP is set will
> > > cause the processor to enter a shutdown state."
> > >
> > > While in do_machine_check function, we enter the panic path (for
> > > unrecoverable errors) much before the IA32_MCG_STATUS MSR is reset and
> > > this is likely to dangerous.
> > >
> > > 911 void do_machine_check(struct pt_regs *regs, long error_code)
> > > 912 {
> > > .............
> > > ................
> > > 1055 if (no_way_out && tolerant < 3)
> > > 1056 mce_panic("Fatal machine check on current CPU", final, msg);
> > > .............
> > > ................
> > > 1073 mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> > > 1074 out:
> > >
> > > It'd be interesting to know the type of memory error (as classified by
> > > the processor) for which you're able to capture the memory dump.
> > > Maybe a dump of the various MCE status registers (and struct mce) would
> > > help us understand the behaviour on your system better.
> >
> > Well, there are MCE types for which we need to panic but we don't
> > necessarily corrupt memory. Your approach is to unconditionally avoid
> > dumping core whenever we panic while you should look at the MCE
> > signature and decide then whether to capture crashed kernel memory or
> > not.
> >
> > For example, if the MCE signature says UC DRAM error, then you can
> > be pretty sure that there is a landmine somewhere in the DRAM region
> > mapping the crashed kernel. If it is, say, a UC when doing data fills
> > from L2 to L1, that doesn't necessarily mean that DRAM is corrupted. But
> > even in the first case, you can evaluate the MCi_ADDR reported with the
> > UC DRAM error and simply skip that particular cacheline when dumping the
> > core instead of not capturing anything at all.
> >
>
> True. Like stated by me earlier, there could be two possible outcomes
> from capturing memory dump in such cases - they're either dangerous or
> doesn't make sense. It is best to avoid a normal kdump in both cases,
> although the elf-note doesn't distinguish between the two.
>
> NT_NOCOREDUMP, in my opinion, is just the first step towards introducing
> a framework where different code paths that lead to panic() can
> 'opt-out' from kdump by adding an elf-note.
>
> We can modify this to add more fine-grained messages using different elf-note
> types (or use the elf-note name under the NT_NOCOREDUMP type) to
> indicate the cause/type of crash.
Which could be found by looking at log buffers too? So looks like that
you want to put all the MCE related info in an ELF note and don't want
user to poke at vmcore. (Though there are no gurantees that writing to
MCE note location is safe or not). So assumption here would be that
reading an ELF note is safer than trying to extract kernel log buffers.
>
> I'd like to hear further from you and the rest of the community to see if
> there's a need felt for such a change.
I feel that we are trying to solve a theoritical problem at this point
of time. You have never run into any issues, just that you are reading
the documentation and then trying to add a framework. I will be little
wary of that.
Having said that I do think that adding a way to let user space know some
additional information about panic is not a bad idea. For example, an
additional field in vmcoreinfo to let user space know that it was
MCE panic.
Thanks
Vivek
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
next prev parent reply other threads:[~2011-10-05 15:31 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-10-03 7:07 [Patch 0/4] Slimdump framework using NT_NOCOREDUMP elf-note K.Prasad
2011-10-03 7:32 ` [Patch 1/4][kernel][slimdump] Add new elf-note of type NT_NOCOREDUMP to capture slimdump K.Prasad
2011-10-03 10:10 ` Eric W. Biederman
2011-10-03 12:03 ` K.Prasad
2011-10-04 6:34 ` Borislav Petkov
2011-10-05 7:07 ` K.Prasad
2011-10-05 7:31 ` Borislav Petkov
2011-10-05 9:47 ` K.Prasad
2011-10-05 12:41 ` Borislav Petkov
2011-10-05 15:52 ` Vivek Goyal
[not found] ` <10327.1317830438@turing-police.cc.vt.edu>
2011-10-05 16:16 ` Borislav Petkov
2011-10-05 17:20 ` Vivek Goyal
2011-10-05 17:13 ` Vivek Goyal
[not found] ` <26571.1317815746@turing-police.cc.vt.edu>
2011-10-05 12:31 ` Borislav Petkov
2011-10-05 15:19 ` Vivek Goyal
2011-10-05 15:30 ` Vivek Goyal [this message]
2011-10-03 22:53 ` Luck, Tony
2011-10-04 14:04 ` Vivek Goyal
2011-10-05 7:18 ` K.Prasad
2011-10-05 7:33 ` Borislav Petkov
2011-10-05 9:23 ` K.Prasad
2011-10-05 15:25 ` Vivek Goyal
2011-10-07 16:12 ` K.Prasad
2011-10-10 7:07 ` Borislav Petkov
2011-10-11 18:44 ` K.Prasad
2011-10-11 18:59 ` Luck, Tony
2011-10-12 0:20 ` Andi Kleen
2011-10-12 10:44 ` Borislav Petkov
2011-10-12 15:59 ` Vivek Goyal
2011-10-12 15:51 ` Vivek Goyal
2011-10-14 11:30 ` K.Prasad
2011-10-14 14:14 ` Vivek Goyal
2011-10-18 17:41 ` K.Prasad
2011-10-11 18:55 ` Luck, Tony
2011-10-04 14:30 ` Vivek Goyal
2011-10-05 7:41 ` K.Prasad
2011-10-05 15:40 ` Vivek Goyal
2011-10-05 15:58 ` Luck, Tony
2011-10-05 16:25 ` Borislav Petkov
2011-10-05 17:10 ` Vivek Goyal
2011-10-05 17:20 ` Borislav Petkov
2011-10-05 17:29 ` Vivek Goyal
2011-10-05 17:43 ` Borislav Petkov
2011-10-05 18:00 ` Dave Anderson
2011-10-05 18:09 ` Vivek Goyal
2011-10-04 15:04 ` Nick Bowler
2011-10-07 16:36 ` K.Prasad
2011-10-07 18:19 ` Nick Bowler
2011-10-03 7:35 ` [Patch 2/4][kexec-tools] Recognise NT_NOCOREDUMP elf-note type K.Prasad
2011-10-03 7:37 ` [Patch 3/4][makedumpfile] Capture slimdump if elf-note NT_NOCOREDUMP present K.Prasad
2011-10-03 7:45 ` [Patch 4/4][crash] Recognise elf-note of type NT_NOCOREDUMP before vmcore analysis K.Prasad
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111005153040.GC30146@redhat.com \
--to=vgoyal@redhat.com \
--cc=anderson@redhat.com \
--cc=andi@firstfloor.org \
--cc=bp@alien8.de \
--cc=crash-utility@redhat.com \
--cc=ebiederm@xmission.com \
--cc=kexec@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=oomichi@mxs.nes.nec.co.jp \
--cc=prasad@linux.vnet.ibm.com \
--cc=tachibana@mxm.nes.nec.co.jp \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox