From: Vivek Goyal <vgoyal@redhat.com>
To: Dave Anderson <anderson@redhat.com>
Cc: oomichi@mxs.nes.nec.co.jp, Tony Luck <tony.luck@intel.com>,
kexec@lists.infradead.org, linux-kernel@vger.kernel.org,
tachibana@mxm.nes.nec.co.jp, Andi Kleen <andi@firstfloor.org>,
Borislav Petkov <bp@alien8.de>,
"Eric W. Biederman" <ebiederm@xmission.com>,
"K.Prasad" <prasad@linux.vnet.ibm.com>,
crash-utility@redhat.com
Subject: Re: [Patch 1/4][kernel][slimdump] Add new elf-note of type NT_NOCOREDUMP to capture slimdump
Date: Wed, 5 Oct 2011 14:09:00 -0400 [thread overview]
Message-ID: <20111005180900.GJ30146@redhat.com> (raw)
In-Reply-To: <67cddd75-90d7-49b8-83ff-c677101b37b3@zmail05.collab.prod.int.phx2.redhat.com>
On Wed, Oct 05, 2011 at 02:00:09PM -0400, Dave Anderson wrote:
>
>
> ----- Original Message -----
> > On Wed, Oct 05, 2011 at 07:20:37PM +0200, Borislav Petkov wrote:
> > > On Wed, Oct 05, 2011 at 01:10:07PM -0400, Vivek Goyal wrote:
> > > > On Wed, Oct 05, 2011 at 08:58:53AM -0700, Luck, Tony wrote:
> > > > > > > The plan is to pass-down the list of poisoned memory pages
> > > > > > > to the second
> > > > > > > kernel using an elf-note so that these pages are left
> > > > > > > untouched during
> > > > > > > dump capture. I'm working on an implementation of the same
> > > > > > > and should
> > > > > > > have patches soon.
> > > > > >
> > > > > > I would say let us first figure out what happens while
> > > > > > reading a poisoned
> > > > > > page and is this a problem before working on a solution.
> > > > >
> > > > > If the page is poisoned because of a real uncorrectable error
> > > > > in memory
> > > > > (reported as SRAO machine check today, or by SRAR
> > > > > real-soon-now). Then
> > > > > accessing the page from the processor while taking a memory
> > > > > dump will
> > > > > result in a machine check.
> > > > >
> > > > > Note that a large memory system that had been running for a
> > > > > long time
> > > > > may have built up a small stash of these land-mine pages - and
> > > > > we need
> > > > > to worry about them even in the case where the panic is not
> > > > > machine
> > > > > check related (in fact especially in this case ... we are in a
> > > > > case
> > > > > where we actually do want the dump to diagnose the cause of the
> > > > > panic,
> > > > > and we don't want to risk losing the crash dump because we
> > > > > aborted when
> > > > > touching a page that the OS had safely avoided for
> > > > > days/weeks/months).
> > > > >
> > > > > So passing a list of poisoned pages from the old kernel to the
> > > > > new kernel
> > > > > is a good idea - and is independent of the cause of the crash
> > > > > (except that
> > > > > in the fatal machine check case due to memory error the list is
> > > > > guaranteed
> > > > > to be non-empty).
> > > >
> > > > Whre is this poisoned page info stored? In struct page? If yes, then
> > > > user space can walk through it and make sure not to touch poisoned pages.
> > > > Anyway user space filtering utility "makedumpfile" walks through struct
> > > > pages to filter out the pages. It should be able to filter out
> > > > poisoned pages unconditionally. So there should be no need for kernel
> > > > to export a list of these pages.
> > >
> > > Does this utility work on a vmcore dump? If so, Tony refers to the
> > > creation of the vmcore itself from the memory used by the first
> > > kernel.
> >
> > No, this utitlity can directly work on /proc/vmcore where first kernel's
> > image is still in memory and not on disk.
> >
> > > If there are poisoned pages, merely accessing that portion of DRAM
> > > containing the poisoned data would cause further MCEs in the freshly
> > > booted kernel so you won't be able to finish creating the dump.
> >
> > As long as you can get to your struct page arrays, one should be able
> > to filter out poisoned pages without saving the whole dump.
>
> It's still going to require a minimal kernel change because the
> PG_hwpoison flag's bit number differs depending upon the kernel
> configuration, if it exists at all. An additional vmcoreinfo item
> probably...
>
Yes, that kind of information we can export along with other info
in vmcoreinfo.
Thanks
Vivek
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
WARNING: multiple messages have this Message-ID (diff)
From: Vivek Goyal <vgoyal@redhat.com>
To: Dave Anderson <anderson@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>,
"K.Prasad" <prasad@linux.vnet.ibm.com>,
linux-kernel@vger.kernel.org, crash-utility@redhat.com,
kexec@lists.infradead.org, Andi Kleen <andi@firstfloor.org>,
"Eric W. Biederman" <ebiederm@xmission.com>,
tachibana@mxm.nes.nec.co.jp, oomichi@mxs.nes.nec.co.jp,
Borislav Petkov <bp@alien8.de>
Subject: Re: [Patch 1/4][kernel][slimdump] Add new elf-note of type NT_NOCOREDUMP to capture slimdump
Date: Wed, 5 Oct 2011 14:09:00 -0400 [thread overview]
Message-ID: <20111005180900.GJ30146@redhat.com> (raw)
In-Reply-To: <67cddd75-90d7-49b8-83ff-c677101b37b3@zmail05.collab.prod.int.phx2.redhat.com>
On Wed, Oct 05, 2011 at 02:00:09PM -0400, Dave Anderson wrote:
>
>
> ----- Original Message -----
> > On Wed, Oct 05, 2011 at 07:20:37PM +0200, Borislav Petkov wrote:
> > > On Wed, Oct 05, 2011 at 01:10:07PM -0400, Vivek Goyal wrote:
> > > > On Wed, Oct 05, 2011 at 08:58:53AM -0700, Luck, Tony wrote:
> > > > > > > The plan is to pass-down the list of poisoned memory pages
> > > > > > > to the second
> > > > > > > kernel using an elf-note so that these pages are left
> > > > > > > untouched during
> > > > > > > dump capture. I'm working on an implementation of the same
> > > > > > > and should
> > > > > > > have patches soon.
> > > > > >
> > > > > > I would say let us first figure out what happens while
> > > > > > reading a poisoned
> > > > > > page and is this a problem before working on a solution.
> > > > >
> > > > > If the page is poisoned because of a real uncorrectable error
> > > > > in memory
> > > > > (reported as SRAO machine check today, or by SRAR
> > > > > real-soon-now). Then
> > > > > accessing the page from the processor while taking a memory
> > > > > dump will
> > > > > result in a machine check.
> > > > >
> > > > > Note that a large memory system that had been running for a
> > > > > long time
> > > > > may have built up a small stash of these land-mine pages - and
> > > > > we need
> > > > > to worry about them even in the case where the panic is not
> > > > > machine
> > > > > check related (in fact especially in this case ... we are in a
> > > > > case
> > > > > where we actually do want the dump to diagnose the cause of the
> > > > > panic,
> > > > > and we don't want to risk losing the crash dump because we
> > > > > aborted when
> > > > > touching a page that the OS had safely avoided for
> > > > > days/weeks/months).
> > > > >
> > > > > So passing a list of poisoned pages from the old kernel to the
> > > > > new kernel
> > > > > is a good idea - and is independent of the cause of the crash
> > > > > (except that
> > > > > in the fatal machine check case due to memory error the list is
> > > > > guaranteed
> > > > > to be non-empty).
> > > >
> > > > Whre is this poisoned page info stored? In struct page? If yes, then
> > > > user space can walk through it and make sure not to touch poisoned pages.
> > > > Anyway user space filtering utility "makedumpfile" walks through struct
> > > > pages to filter out the pages. It should be able to filter out
> > > > poisoned pages unconditionally. So there should be no need for kernel
> > > > to export a list of these pages.
> > >
> > > Does this utility work on a vmcore dump? If so, Tony refers to the
> > > creation of the vmcore itself from the memory used by the first
> > > kernel.
> >
> > No, this utitlity can directly work on /proc/vmcore where first kernel's
> > image is still in memory and not on disk.
> >
> > > If there are poisoned pages, merely accessing that portion of DRAM
> > > containing the poisoned data would cause further MCEs in the freshly
> > > booted kernel so you won't be able to finish creating the dump.
> >
> > As long as you can get to your struct page arrays, one should be able
> > to filter out poisoned pages without saving the whole dump.
>
> It's still going to require a minimal kernel change because the
> PG_hwpoison flag's bit number differs depending upon the kernel
> configuration, if it exists at all. An additional vmcoreinfo item
> probably...
>
Yes, that kind of information we can export along with other info
in vmcoreinfo.
Thanks
Vivek
next prev parent reply other threads:[~2011-10-05 18:09 UTC|newest]
Thread overview: 102+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-10-03 7:07 [Patch 0/4] Slimdump framework using NT_NOCOREDUMP elf-note K.Prasad
2011-10-03 7:07 ` K.Prasad
2011-10-03 7:32 ` [Patch 1/4][kernel][slimdump] Add new elf-note of type NT_NOCOREDUMP to capture slimdump K.Prasad
2011-10-03 7:32 ` K.Prasad
2011-10-03 10:10 ` Eric W. Biederman
2011-10-03 10:10 ` Eric W. Biederman
2011-10-03 12:03 ` K.Prasad
2011-10-03 12:03 ` K.Prasad
2011-10-04 6:34 ` Borislav Petkov
2011-10-04 6:34 ` Borislav Petkov
2011-10-05 7:07 ` K.Prasad
2011-10-05 7:07 ` K.Prasad
2011-10-05 7:31 ` Borislav Petkov
2011-10-05 7:31 ` Borislav Petkov
2011-10-05 9:47 ` K.Prasad
2011-10-05 9:47 ` K.Prasad
2011-10-05 12:41 ` Borislav Petkov
2011-10-05 12:41 ` Borislav Petkov
2011-10-05 15:52 ` Vivek Goyal
2011-10-05 15:52 ` Vivek Goyal
2011-10-05 16:00 ` Valdis.Kletnieks
2011-10-05 16:16 ` Borislav Petkov
2011-10-05 16:16 ` Borislav Petkov
2011-10-05 17:20 ` Vivek Goyal
2011-10-05 17:20 ` Vivek Goyal
2011-10-05 17:13 ` Vivek Goyal
2011-10-05 17:13 ` Vivek Goyal
2011-10-05 11:55 ` Valdis.Kletnieks
2011-10-05 12:31 ` Borislav Petkov
2011-10-05 12:31 ` Borislav Petkov
2011-10-05 15:19 ` Vivek Goyal
2011-10-05 15:19 ` Vivek Goyal
2011-10-05 15:30 ` Vivek Goyal
2011-10-05 15:30 ` Vivek Goyal
2011-10-03 22:53 ` Luck, Tony
2011-10-03 22:53 ` Luck, Tony
2011-10-04 14:04 ` Vivek Goyal
2011-10-04 14:04 ` Vivek Goyal
2011-10-05 7:18 ` K.Prasad
2011-10-05 7:18 ` K.Prasad
2011-10-05 7:33 ` Borislav Petkov
2011-10-05 7:33 ` Borislav Petkov
2011-10-05 9:23 ` K.Prasad
2011-10-05 9:23 ` K.Prasad
2011-10-05 15:25 ` Vivek Goyal
2011-10-05 15:25 ` Vivek Goyal
2011-10-07 16:12 ` K.Prasad
2011-10-07 16:12 ` K.Prasad
2011-10-10 7:07 ` Borislav Petkov
2011-10-10 7:07 ` Borislav Petkov
2011-10-11 18:44 ` K.Prasad
2011-10-11 18:44 ` K.Prasad
2011-10-11 18:59 ` Luck, Tony
2011-10-11 18:59 ` Luck, Tony
2011-10-12 0:20 ` Andi Kleen
2011-10-12 0:20 ` Andi Kleen
2011-10-12 10:44 ` Borislav Petkov
2011-10-12 10:44 ` Borislav Petkov
2011-10-12 15:59 ` Vivek Goyal
2011-10-12 15:59 ` Vivek Goyal
2011-10-12 15:51 ` Vivek Goyal
2011-10-12 15:51 ` Vivek Goyal
2011-10-14 11:30 ` K.Prasad
2011-10-14 11:30 ` K.Prasad
2011-10-14 14:14 ` Vivek Goyal
2011-10-14 14:14 ` Vivek Goyal
2011-10-18 17:41 ` K.Prasad
2011-10-18 17:41 ` K.Prasad
2011-10-11 18:55 ` Luck, Tony
2011-10-04 14:30 ` Vivek Goyal
2011-10-04 14:30 ` Vivek Goyal
2011-10-05 7:41 ` K.Prasad
2011-10-05 7:41 ` K.Prasad
2011-10-05 15:40 ` Vivek Goyal
2011-10-05 15:40 ` Vivek Goyal
2011-10-05 15:58 ` Luck, Tony
2011-10-05 16:25 ` Borislav Petkov
2011-10-05 16:25 ` Borislav Petkov
2011-10-05 17:10 ` Vivek Goyal
2011-10-05 17:10 ` Vivek Goyal
2011-10-05 17:20 ` Borislav Petkov
2011-10-05 17:20 ` Borislav Petkov
2011-10-05 17:29 ` Vivek Goyal
2011-10-05 17:29 ` Vivek Goyal
2011-10-05 17:43 ` Borislav Petkov
2011-10-05 17:43 ` Borislav Petkov
2011-10-05 18:00 ` Dave Anderson
2011-10-05 18:00 ` Dave Anderson
2011-10-05 18:09 ` Vivek Goyal [this message]
2011-10-05 18:09 ` Vivek Goyal
2011-10-04 15:04 ` Nick Bowler
2011-10-04 15:04 ` Nick Bowler
2011-10-07 16:36 ` K.Prasad
2011-10-07 16:36 ` K.Prasad
2011-10-07 18:19 ` Nick Bowler
2011-10-07 18:19 ` Nick Bowler
2011-10-03 7:35 ` [Patch 2/4][kexec-tools] Recognise NT_NOCOREDUMP elf-note type K.Prasad
2011-10-03 7:35 ` K.Prasad
2011-10-03 7:37 ` [Patch 3/4][makedumpfile] Capture slimdump if elf-note NT_NOCOREDUMP present K.Prasad
2011-10-03 7:37 ` K.Prasad
2011-10-03 7:45 ` [Patch 4/4][crash] Recognise elf-note of type NT_NOCOREDUMP before vmcore analysis K.Prasad
2011-10-03 7:45 ` K.Prasad
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111005180900.GJ30146@redhat.com \
--to=vgoyal@redhat.com \
--cc=anderson@redhat.com \
--cc=andi@firstfloor.org \
--cc=bp@alien8.de \
--cc=crash-utility@redhat.com \
--cc=ebiederm@xmission.com \
--cc=kexec@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=oomichi@mxs.nes.nec.co.jp \
--cc=prasad@linux.vnet.ibm.com \
--cc=tachibana@mxm.nes.nec.co.jp \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.