From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by ozlabs.org (Postfix) with ESMTP id E452DB6F8A for ; Thu, 10 Nov 2011 20:47:43 +1100 (EST) Message-ID: <4EBB9D76.80601@redhat.com> Date: Thu, 10 Nov 2011 17:46:30 +0800 From: Cong Wang MIME-Version: 1.0 To: Mahesh J Salgaonkar Subject: Re: [RFC PATCH v4 01/10] fadump: Add documentation for firmware-assisted dump. References: <20111107095215.1997.14866.stgit@mars.in.ibm.com> <20111107095521.1997.34844.stgit@mars.in.ibm.com> In-Reply-To: <20111107095521.1997.34844.stgit@mars.in.ibm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Cc: Linux Kernel , Milton Miller , linuxppc-dev , Anton Blanchard , "Eric W. Biederman" List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , 于 2011年11月07日 17:55, Mahesh J Salgaonkar 写道: > From: Mahesh Salgaonkar > > Documentation for firmware-assisted dump. This document is based on the > original documentation written for phyp assisted dump by Linas Vepstas > and Manish Ahuja, with few changes to reflect the current implementation. > > Change in v3: > - Modified the documentation to reflect introdunction of fadump_registered > sysfs file and few minor changes. > > Change in v2: > - Modified the documentation to reflect the change of fadump_region > file under debugfs filesystem. > > Signed-off-by: Mahesh Salgaonkar Please Cc Randy Dunlap for kernel documentation patch. I have some inline comments below. > --- > Documentation/powerpc/firmware-assisted-dump.txt | 262 ++++++++++++++++++++++ > 1 files changed, 262 insertions(+), 0 deletions(-) > create mode 100644 Documentation/powerpc/firmware-assisted-dump.txt > > diff --git a/Documentation/powerpc/firmware-assisted-dump.txt b/Documentation/powerpc/firmware-assisted-dump.txt > new file mode 100644 > index 0000000..ba6724a > --- /dev/null > +++ b/Documentation/powerpc/firmware-assisted-dump.txt > @@ -0,0 +1,262 @@ > + > + Firmware-Assisted Dump > + ------------------------ > + July 2011 > + > +The goal of firmware-assisted dump is to enable the dump of > +a crashed system, and to do so from a fully-reset system, and > +to minimize the total elapsed time until the system is back > +in production use. > + > +As compared to kdump or other strategies, firmware-assisted > +dump offers several strong, practical advantages: Comparing with kdump or... > + > +-- Unlike kdump, the system has been reset, and loaded > + with a fresh copy of the kernel. In particular, > + PCI and I/O devices have been reinitialized and are > + in a clean, consistent state. > +-- Once the dump is copied out, the memory that held the dump > + is immediately available to the running kernel. A further > + reboot isn't required. > + > +The above can only be accomplished by coordination with, > +and assistance from the Power firmware. The procedure is > +as follows: > + > +-- The first kernel registers the sections of memory with the > + Power firmware for dump preservation during OS initialization. > + This registered sections of memory is reserved by the first These registered sections of memory are... > + kernel during early boot. > + > +-- When a system crashes, the Power firmware will save > + the low memory (boot memory of size larger of 5% of system RAM > + or 256MB) of RAM to a previously registered save region. It ...to the previous registered region... > + will also save system registers, and hardware PTE's. > + > + NOTE: The term 'boot memory' means size of the low memory chunk > + that is required for a kernel to boot successfully when > + booted with restricted memory. By default, the boot memory > + size will be calculated to larger of 5% of system RAM or will be the larger of... > + 256MB. Alternatively, user can also specify boot memory > + size through boot parameter 'fadump_reserve_mem=' which > + will override the default calculated size. > + > +-- After the low memory (boot memory) area has been saved, the > + firmware will reset PCI and other hardware state. It will > + *not* clear the RAM. It will then launch the bootloader, as > + normal. > + > +-- The freshly booted kernel will notice that there is a new > + node (ibm,dump-kernel) in the device tree, indicating that > + there is crash data available from a previous boot. During > + the early boot OS will reserve rest of the memory above > + boot memory size effectively booting with restricted memory > + size. This will make sure that the second kernel will not > + touch any of the dump memory area. > + > +-- Userspace tools will read /proc/vmcore to obtain the contents > + of memory, which holds the previous crashed kernel dump in ELF > + format. The userspace tools may copy this info to disk, or > + network, nas, san, iscsi, etc. as desired. s/Userspace/User-space/ > + > +-- Once the userspace tool is done saving dump, it will echo > + '1' to /sys/kernel/fadump_release_mem to release the reserved > + memory back to general use, except the memory required for > + next firmware-assisted dump registration. > + > + e.g. > + # echo 1> /sys/kernel/fadump_release_mem > + > +Please note that the firmware-assisted dump feature > +is only available on Power6 and above systems with recent > +firmware versions. > + > +Implementation details: > +---------------------- > + > +During boot, a check is made to see if firmware supports > +this feature on that particular machine. If it does, then > +we check to see if an active dump is waiting for us. If yes > +then everything but boot memory size of RAM is reserved during > +early boot (See Fig. 2). This area is released once we collect a > +dump from user land scripts (kdump scripts) that are run. If This area is released once we finish collecting the dump from user land scripts (e.g. kdump scripts). > +there is dump data, then the /sys/kernel/fadump_release_mem > +file is created, and the reserved memory is held. > + > +If there is no waiting dump data, then only the memory required > +to hold CPU state, HPTE region, boot memory dump and elfcore > +header, is reserved at the top of memory (see Fig. 1). This area > +is *not* released: this region will be kept permanently reserved, > +so that it can act as a receptacle for a copy of the boot memory > +content in addition to CPU state and HPTE region, in the case a > +crash does occur. > + > + o Memory Reservation during first kernel > + > + Low memory Top of memory > + 0 boot memory size | > + | | |<--Reserved dump area -->| > + V V | Permanent Reservation V > + +-----------+----------/ /----------+---+----+-----------+----+ > + | | |CPU|HPTE| DUMP |ELF | > + +-----------+----------/ /----------+---+----+-----------+----+ > + | ^ > + | | > + \ / > + ------------------------------------------- > + Boot memory content gets transferred to > + reserved area by firmware at the time of > + crash > + Fig. 1 > + > + o Memory Reservation during second kernel after crash > + > + Low memory Top of memory > + 0 boot memory size | > + | |<------------- Reserved dump area ----------- -->| > + V V V > + +-----------+----------/ /----------+---+----+-----------+----+ > + | | |CPU|HPTE| DUMP |ELF | > + +-----------+----------/ /----------+---+----+-----------+----+ > + | | > + V V > + Used by second /proc/vmcore > + kernel to boot > + Fig. 2 > + > +Currently the dump will be copied from /proc/vmcore to a > +a new file upon user intervention. The dump data available through > +/proc/vmcore will be in ELF format. Hence the existing kdump > +infrastructure (kdump scripts) to save the dump works fine > +with minor modifications. The kdump script requires following > +modifications: > +-- During service kdump start if /proc/vmcore entry is not present, > + look for the existence of /sys/kernel/fadump_enabled and read > + value exported by it. If value is set to '0' then fallback to > + existing kexec based kdump. If value is set to '1' then check the > + value exported by /sys/kernel/fadump_registered. If value it set > + to '1' then print success otherwise register for fadump by > + echo'ing 1> /sys/kernel/fadump_registered file. > + > +-- During service kdump start if /proc/vmcore entry is present, > + execute the existing routine to save the dump. Once the dump > + is saved, echo 1> /sys/kernel/fadump_release_mem (if the > + file exists) to release the reserved memory for general use > + and continue without rebooting. At this point the memory > + reservation map will look like as shown in Fig. 1. If the file > + /sys/kernel/fadump_release_mem is not present then follow > + the existing routine to reboot into new kernel. > + > +-- During service kdump stop echo 0> /sys/kernel/fadump_registered > + to un-register the fadump. > + I don't think you need to document kdump script changes in a kernel doc. > +The tools to examine the dump will be same as the ones > +used for kdump. > + > +How to enable firmware-assisted dump (fadump): > +------------------------------------- > + > +1. Set config option CONFIG_FA_DUMP=y and build kernel. > +2. Boot into linux kernel with 'fadump=1' kernel cmdline option. > +3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline > + to specify size of the memory to reserve for boot memory dump > + preservation. > + > +NOTE: If firmware-assisted dump fails to reserve memory then it will > + fallback to existing kdump mechanism if 'crashkernel=' option > + is set at kernel cmdline. > + > +Sysfs/debugfs files: > +------------ > + > +Firmware-assisted dump feature uses sysfs file system to hold > +the control files and debugfs file to display memory reserved region. > + > +Here is the list of files under kernel sysfs: > + > + /sys/kernel/fadump_enabled > + > + This is used to display the fadump status. > + 0 = fadump is disabled > + 1 = fadump is enabled > + > + /sys/kernel/fadump_registered > + > + This is used to display the fadump registration status as well > + as to control (start/stop) the fadump registration. > + 0 = fadump is not registered. > + 1 = fadump is registered and ready to handle system crash. > + > + To register fadump echo 1> /sys/kernel/fadump_registered and > + echo 0> /sys/kernel/fadump_registered for un-register and stop the > + fadump. Once the fadump is un-registered, the system crash will not > + be handled and vmcore will not be captured. > + > + /sys/kernel/fadump_release_mem > + > + This file is available only when fadump is active during > + second kernel. This is used to release the reserved memory > + region that are held for saving crash dump. To release the > + reserved memory echo 1 to it: > + > + echo 1> /sys/kernel/fadump_release_mem > + > + After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region > + file will change to reflect the new memory reservations. > + > +Here is the list of files under powerpc debugfs: > +(Assuming debugfs is mounted on /sys/kernel/debug directory.) > + > + /sys/kernel/debug/powerpc/fadump_region > + > + This file shows the reserved memory regions if fadump is > + enabled otherwise this file is empty. The output format > + is: > +: [-] bytes, Dumped: > + > + e.g. > + Contents when fadump is registered during first kernel > + > + # cat /sys/kernel/debug/powerpc/fadump_region > + CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0 > + HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0 > + DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0 > + > + Contents when fadump is active during second kernel > + > + # cat /sys/kernel/debug/powerpc/fadump_region > + CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020 > + HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000 > + DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000 > + : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000 > + > +NOTE: Please refer to debugfs documentation on how to mount the debugfs > + filesystem. > + That is Documentation/filesystems/debugfs.txt. > + > +TODO: > +----- > + o Need to come up with the better approach to find out more > + accurate boot memory size that is required for a kernel to > + boot successfully when booted with restricted memory. > + o The fadump implementation introduces a fadump crash info structure > + in the scratch area before the ELF core header. The idea of introducing > + this structure is to pass some important crash info data to the second > + kernel which will help second kernel to populate ELF core header with > + correct data before it gets exported through /proc/vmcore. The current > + design implementation does not address a possibility of introducing > + additional fields (in future) to this structure without affecting > + compatibility. Need to come up with the better approach to address this. > + The possible approaches are: > + 1. Introduce version field for version tracking, bump up the version > + whenever a new field is added to the structure in future. The version > + field can be used to find out what fields are valid for the current > + version of the structure. > + 2. Reserve the area of predefined size (say PAGE_SIZE) for this > + structure and have unused area as reserved (initialized to zero) > + for future field additions. > + The advantage of approach 1 over 2 is we don't need to reserve extra space. > +--- Why do we keep TODO in this doc? Thanks!