From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rahul Lakkireddy Subject: Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel Date: Wed, 18 Apr 2018 18:01:16 +0530 Message-ID: <20180418123114.GA19159@chelsio.com> References: <20180418061546.GA4551@dhcp-128-65.nay.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Cc: Indranil Choudhury , "netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" , Nirranjan Kirubaharan , "linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" , "viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org" , "davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org" , "stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org" , Ganesh GR , "linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" , "akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org" , "torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org" , "kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org" , "ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org" To: Dave Young Return-path: Content-Disposition: inline In-Reply-To: <20180418061546.GA4551-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "kexec" Errors-To: kexec-bounces+glkk-kexec=m.gmane.org-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org List-Id: netdev.vger.kernel.org On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote: > Hi Rahul, > On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote: > > On production servers running variety of workloads over time, kernel > > panic can happen sporadically after days or even months. It is > > important to collect as much debug logs as possible to root cause > > and fix the problem, that may not be easy to reproduce. Snapshot of > > underlying hardware/firmware state (like register dump, firmware > > logs, adapter memory, etc.), at the time of kernel panic will be very > > helpful while debugging the culprit device driver. > > > > This series of patches add new generic framework that enable device > > drivers to collect device specific snapshot of the hardware/firmware > > state of the underlying device in the crash recovery kernel. In crash > > recovery kernel, the collected logs are added as elf notes to > > /proc/vmcore, which is copied by user space scripts for post-analysis. > > > > The sequence of actions done by device drivers to append their device > > specific hardware/firmware logs to /proc/vmcore are as follows: > > > > 1. During probe (before hardware is initialized), device drivers > > register to the vmcore module (via vmcore_add_device_dump()), with > > callback function, along with buffer size and log name needed for > > firmware/hardware log collection. > > I assumed the elf notes info should be prepared while kexec_[file_]load > phase. But I did not read the old comment, not sure if it has been discussed > or not. > We must not collect dumps in crashing kernel. Adding more things in crash dump path risks not collecting vmcore at all. Eric had discussed this in more detail at: https://lkml.org/lkml/2018/3/24/319 We are safe to collect dumps in the second kernel. Each device dump will be exported as an elf note in /proc/vmcore. > If do this in 2nd kernel a question is driver can be loaded later than vmcore init. Yes, drivers will add their device dumps after vmcore init. > How to guarantee the function works if vmcore reading happens before > the driver is loaded? > > Also it is possible that kdump initramfs does not contains the driver > module. > > Am I missing something? > Yes, driver must be in initramfs if it wants to collect and add device dump to /proc/vmcore in second kernel. > > > > 2. vmcore module allocates the buffer with requested size. It adds > > an elf note and invokes the device driver's registered callback > > function. > > > > 3. Device driver collects all hardware/firmware logs into the buffer > > and returns control back to vmcore module. > > > > The device specific hardware/firmware logs can be seen as elf notes: > > > > # readelf -n /proc/vmcore > > > > Displaying notes found at file offset 0x00001000 with length 0x04003288: > > Owner Data size Description > > VMCOREDD_cxgb4_0000:02:00.4 0x02000fd8 Unknown note type: (0x00000700) > > VMCOREDD_cxgb4_0000:04:00.4 0x02000fd8 Unknown note type: (0x00000700) > > CORE 0x00000150 NT_PRSTATUS (prstatus structure) > > CORE 0x00000150 NT_PRSTATUS (prstatus structure) > > CORE 0x00000150 NT_PRSTATUS (prstatus structure) > > CORE 0x00000150 NT_PRSTATUS (prstatus structure) > > CORE 0x00000150 NT_PRSTATUS (prstatus structure) > > CORE 0x00000150 NT_PRSTATUS (prstatus structure) > > CORE 0x00000150 NT_PRSTATUS (prstatus structure) > > CORE 0x00000150 NT_PRSTATUS (prstatus structure) > > VMCOREINFO 0x0000074f Unknown note type: (0x00000000) > > > > Patch 1 adds API to vmcore module to allow drivers to register callback > > to collect the device specific hardware/firmware logs. The logs will > > be added to /proc/vmcore as elf notes. > > > > Patch 2 updates read and mmap logic to append device specific hardware/ > > firmware logs as elf notes. > > > > Patch 3 shows a cxgb4 driver example using the API to collect > > hardware/firmware logs in crash recovery kernel, before hardware is > > initialized. > > > > Thanks, > > Rahul > > > > RFC v1: https://lkml.org/lkml/2018/3/2/542 > > RFC v2: https://lkml.org/lkml/2018/3/16/326 > > [...] Thanks, Rahul