From: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
To: Dave Young <dyoung@redhat.com>
Cc: Indranil Choudhury <indranil@chelsio.com>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
Nirranjan Kirubaharan <nirranjan@chelsio.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
"davem@davemloft.net" <davem@davemloft.net>,
"stephen@networkplumber.org" <stephen@networkplumber.org>,
Ganesh GR <ganeshgr@chelsio.com>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"torvalds@linux-foundation.org" <torvalds@linux-foundation.org>,
"kexec@lists.infradead.org" <kexec@lists.infradead.org>,
"ebiederm@xmission.com" <ebiederm@xmission.com>
Subject: Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
Date: Wed, 18 Apr 2018 18:01:16 +0530 [thread overview]
Message-ID: <20180418123114.GA19159@chelsio.com> (raw)
In-Reply-To: <20180418061546.GA4551@dhcp-128-65.nay.redhat.com>
On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
> Hi Rahul,
> On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> > On production servers running variety of workloads over time, kernel
> > panic can happen sporadically after days or even months. It is
> > important to collect as much debug logs as possible to root cause
> > and fix the problem, that may not be easy to reproduce. Snapshot of
> > underlying hardware/firmware state (like register dump, firmware
> > logs, adapter memory, etc.), at the time of kernel panic will be very
> > helpful while debugging the culprit device driver.
> >
> > This series of patches add new generic framework that enable device
> > drivers to collect device specific snapshot of the hardware/firmware
> > state of the underlying device in the crash recovery kernel. In crash
> > recovery kernel, the collected logs are added as elf notes to
> > /proc/vmcore, which is copied by user space scripts for post-analysis.
> >
> > The sequence of actions done by device drivers to append their device
> > specific hardware/firmware logs to /proc/vmcore are as follows:
> >
> > 1. During probe (before hardware is initialized), device drivers
> > register to the vmcore module (via vmcore_add_device_dump()), with
> > callback function, along with buffer size and log name needed for
> > firmware/hardware log collection.
>
> I assumed the elf notes info should be prepared while kexec_[file_]load
> phase. But I did not read the old comment, not sure if it has been discussed
> or not.
>
We must not collect dumps in crashing kernel. Adding more things in
crash dump path risks not collecting vmcore at all. Eric had
discussed this in more detail at:
https://lkml.org/lkml/2018/3/24/319
We are safe to collect dumps in the second kernel. Each device dump
will be exported as an elf note in /proc/vmcore.
> If do this in 2nd kernel a question is driver can be loaded later than vmcore init.
Yes, drivers will add their device dumps after vmcore init.
> How to guarantee the function works if vmcore reading happens before
> the driver is loaded?
>
> Also it is possible that kdump initramfs does not contains the driver
> module.
>
> Am I missing something?
>
Yes, driver must be in initramfs if it wants to collect and add device
dump to /proc/vmcore in second kernel.
> >
> > 2. vmcore module allocates the buffer with requested size. It adds
> > an elf note and invokes the device driver's registered callback
> > function.
> >
> > 3. Device driver collects all hardware/firmware logs into the buffer
> > and returns control back to vmcore module.
> >
> > The device specific hardware/firmware logs can be seen as elf notes:
> >
> > # readelf -n /proc/vmcore
> >
> > Displaying notes found at file offset 0x00001000 with length 0x04003288:
> > Owner Data size Description
> > VMCOREDD_cxgb4_0000:02:00.4 0x02000fd8 Unknown note type: (0x00000700)
> > VMCOREDD_cxgb4_0000:04:00.4 0x02000fd8 Unknown note type: (0x00000700)
> > CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> > CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> > CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> > CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> > CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> > CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> > CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> > CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> > VMCOREINFO 0x0000074f Unknown note type: (0x00000000)
> >
> > Patch 1 adds API to vmcore module to allow drivers to register callback
> > to collect the device specific hardware/firmware logs. The logs will
> > be added to /proc/vmcore as elf notes.
> >
> > Patch 2 updates read and mmap logic to append device specific hardware/
> > firmware logs as elf notes.
> >
> > Patch 3 shows a cxgb4 driver example using the API to collect
> > hardware/firmware logs in crash recovery kernel, before hardware is
> > initialized.
> >
> > Thanks,
> > Rahul
> >
> > RFC v1: https://lkml.org/lkml/2018/3/2/542
> > RFC v2: https://lkml.org/lkml/2018/3/16/326
> >
[...]
Thanks,
Rahul
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
WARNING: multiple messages have this Message-ID (diff)
From: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
To: Dave Young <dyoung@redhat.com>
Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>,
"kexec@lists.infradead.org" <kexec@lists.infradead.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Indranil Choudhury <indranil@chelsio.com>,
Nirranjan Kirubaharan <nirranjan@chelsio.com>,
"stephen@networkplumber.org" <stephen@networkplumber.org>,
Ganesh GR <ganeshgr@chelsio.com>,
"ebiederm@xmission.com" <ebiederm@xmission.com>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"torvalds@linux-foundation.org" <torvalds@linux-foundation.org>,
"davem@davemloft.net" <davem@davemloft.net>,
"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>
Subject: Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
Date: Wed, 18 Apr 2018 18:01:16 +0530 [thread overview]
Message-ID: <20180418123114.GA19159@chelsio.com> (raw)
In-Reply-To: <20180418061546.GA4551@dhcp-128-65.nay.redhat.com>
On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
> Hi Rahul,
> On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> > On production servers running variety of workloads over time, kernel
> > panic can happen sporadically after days or even months. It is
> > important to collect as much debug logs as possible to root cause
> > and fix the problem, that may not be easy to reproduce. Snapshot of
> > underlying hardware/firmware state (like register dump, firmware
> > logs, adapter memory, etc.), at the time of kernel panic will be very
> > helpful while debugging the culprit device driver.
> >
> > This series of patches add new generic framework that enable device
> > drivers to collect device specific snapshot of the hardware/firmware
> > state of the underlying device in the crash recovery kernel. In crash
> > recovery kernel, the collected logs are added as elf notes to
> > /proc/vmcore, which is copied by user space scripts for post-analysis.
> >
> > The sequence of actions done by device drivers to append their device
> > specific hardware/firmware logs to /proc/vmcore are as follows:
> >
> > 1. During probe (before hardware is initialized), device drivers
> > register to the vmcore module (via vmcore_add_device_dump()), with
> > callback function, along with buffer size and log name needed for
> > firmware/hardware log collection.
>
> I assumed the elf notes info should be prepared while kexec_[file_]load
> phase. But I did not read the old comment, not sure if it has been discussed
> or not.
>
We must not collect dumps in crashing kernel. Adding more things in
crash dump path risks not collecting vmcore at all. Eric had
discussed this in more detail at:
https://lkml.org/lkml/2018/3/24/319
We are safe to collect dumps in the second kernel. Each device dump
will be exported as an elf note in /proc/vmcore.
> If do this in 2nd kernel a question is driver can be loaded later than vmcore init.
Yes, drivers will add their device dumps after vmcore init.
> How to guarantee the function works if vmcore reading happens before
> the driver is loaded?
>
> Also it is possible that kdump initramfs does not contains the driver
> module.
>
> Am I missing something?
>
Yes, driver must be in initramfs if it wants to collect and add device
dump to /proc/vmcore in second kernel.
> >
> > 2. vmcore module allocates the buffer with requested size. It adds
> > an elf note and invokes the device driver's registered callback
> > function.
> >
> > 3. Device driver collects all hardware/firmware logs into the buffer
> > and returns control back to vmcore module.
> >
> > The device specific hardware/firmware logs can be seen as elf notes:
> >
> > # readelf -n /proc/vmcore
> >
> > Displaying notes found at file offset 0x00001000 with length 0x04003288:
> > Owner Data size Description
> > VMCOREDD_cxgb4_0000:02:00.4 0x02000fd8 Unknown note type: (0x00000700)
> > VMCOREDD_cxgb4_0000:04:00.4 0x02000fd8 Unknown note type: (0x00000700)
> > CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> > CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> > CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> > CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> > CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> > CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> > CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> > CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> > VMCOREINFO 0x0000074f Unknown note type: (0x00000000)
> >
> > Patch 1 adds API to vmcore module to allow drivers to register callback
> > to collect the device specific hardware/firmware logs. The logs will
> > be added to /proc/vmcore as elf notes.
> >
> > Patch 2 updates read and mmap logic to append device specific hardware/
> > firmware logs as elf notes.
> >
> > Patch 3 shows a cxgb4 driver example using the API to collect
> > hardware/firmware logs in crash recovery kernel, before hardware is
> > initialized.
> >
> > Thanks,
> > Rahul
> >
> > RFC v1: https://lkml.org/lkml/2018/3/2/542
> > RFC v2: https://lkml.org/lkml/2018/3/16/326
> >
[...]
Thanks,
Rahul
next prev parent reply other threads:[~2018-04-18 12:32 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-04-17 7:44 [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel Rahul Lakkireddy
2018-04-17 7:44 ` Rahul Lakkireddy
2018-04-17 7:44 ` [PATCH net-next v4 1/3] vmcore: add API to collect hardware dump in second kernel Rahul Lakkireddy
2018-04-17 7:44 ` Rahul Lakkireddy
2018-04-19 8:24 ` Greg KH
2018-04-19 8:24 ` Greg KH
2018-04-19 14:56 ` Rahul Lakkireddy
2018-04-19 14:56 ` Rahul Lakkireddy
2018-04-17 7:44 ` [PATCH net-next v4 2/3] vmcore: append device dumps to vmcore as elf notes Rahul Lakkireddy
2018-04-17 7:44 ` Rahul Lakkireddy
2018-04-17 7:44 ` [PATCH net-next v4 3/3] cxgb4: collect hardware dump in second kernel Rahul Lakkireddy
2018-04-17 7:44 ` Rahul Lakkireddy
2018-04-18 6:15 ` [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel Dave Young
2018-04-18 6:15 ` Dave Young
2018-04-18 12:31 ` Rahul Lakkireddy [this message]
2018-04-18 12:31 ` Rahul Lakkireddy
2018-04-18 14:28 ` Eric W. Biederman
2018-04-18 14:28 ` Eric W. Biederman
2018-04-18 15:07 ` Rahul Lakkireddy
2018-04-18 15:07 ` Rahul Lakkireddy
2018-04-19 1:40 ` Dave Young
2018-04-19 1:40 ` Dave Young
2018-04-19 14:27 ` Rahul Lakkireddy
2018-04-19 14:27 ` Rahul Lakkireddy
2018-04-19 14:53 ` Eric W. Biederman
2018-04-19 14:53 ` Eric W. Biederman
2018-04-20 13:06 ` Rahul Lakkireddy
2018-04-20 13:06 ` Rahul Lakkireddy
2018-04-20 13:36 ` Eric W. Biederman
2018-04-20 13:36 ` Eric W. Biederman
2018-04-20 14:51 ` Rahul Lakkireddy
2018-04-20 14:51 ` Rahul Lakkireddy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180418123114.GA19159@chelsio.com \
--to=rahul.lakkireddy@chelsio.com \
--cc=akpm@linux-foundation.org \
--cc=davem@davemloft.net \
--cc=dyoung@redhat.com \
--cc=ebiederm@xmission.com \
--cc=ganeshgr@chelsio.com \
--cc=indranil@chelsio.com \
--cc=kexec@lists.infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=nirranjan@chelsio.com \
--cc=stephen@networkplumber.org \
--cc=torvalds@linux-foundation.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.