* [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump
@ 2008-01-07 23:45 Manish Ahuja
2008-01-08 0:13 ` [PATCH 1/8] pseries: phyp dump: Docmentation Manish Ahuja
` (8 more replies)
0 siblings, 9 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-01-07 23:45 UTC (permalink / raw)
To: linuxppc-dev; +Cc: mahuja, linasvepstas, lkessler, strosake
The following series of patches implement a basic framework
for hypervisor-assisted dump. The very first patch provides
documentation explaining what this is :-) . Yes, its supposed
to be an improvement over kdump.
The patches mostly work; a list of open issues / todo list
is included in the documentation. It also appears that
the not-yet-released firmware versions this was tested
on are still, ahem, incomplete; this work is also pending.
-- Linas & Manish
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-07 23:45 [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
@ 2008-01-08 0:13 ` Manish Ahuja
2008-01-09 4:29 ` Nathan Lynch
2008-01-08 0:16 ` [PATCH 2/8] pseries: phyp dump: config file Manish Ahuja
` (7 subsequent siblings)
8 siblings, 1 reply; 71+ messages in thread
From: Manish Ahuja @ 2008-01-08 0:13 UTC (permalink / raw)
To: linuxppc-dev; +Cc: mahuja, linasvepstas, lkessler, strosake
Basic documentation for hypervisor-assisted dump.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
----
Documentation/powerpc/phyp-assisted-dump.txt | 129 +++++++++++++++++++++++++++
1 file changed, 129 insertions(+)
Index: 2.6.24-rc5/Documentation/powerpc/phyp-assisted-dump.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ 2.6.24-rc5/Documentation/powerpc/phyp-assisted-dump.txt 2008-01-07 18:05:46.000000000 -0600
@@ -0,0 +1,129 @@
+
+ Hypervisor-Assisted Dump
+ ------------------------
+ November 2007
+
+The goal of hypervisor-assisted dump is to enable the dump of
+a crashed system, and to do so from a fully-reset system, and
+to minimize the total elapsed time until the system is back
+in production use.
+
+As compared to kdump or other strategies, hypervisor-assisted
+dump offers several strong, practical advantages:
+
+-- Unlike kdump, the system has been reset, and loaded
+ with a fresh copy of the kernel. In particular,
+ PCI and I/O devices have been reinitialized and are
+ in a clean, consistent state.
+-- As the dump is performed, the dumped memory becomes
+ immediately available to the system for normal use.
+-- After the dump is completed, no further reboots are
+ required; the system will be fully usable, and running
+ in it's normal, production mode on it normal kernel.
+
+The above can only be accomplished by coordination with,
+and assistance from the hypervisor. The procedure is
+as follows:
+
+-- When a system crashes, the hypervisor will save
+ the low 256MB of RAM to a previously registered
+ save region. It will also save system state, system
+ registers, and hardware PTE's.
+
+-- After the low 256MB area has been saved, the
+ hypervisor will reset PCI and other hardware state.
+ It will *not* clear RAM. It will then launch the
+ bootloader, as normal.
+
+-- The freshly booted kernel will notice that there
+ is a new node (ibm,dump-kernel) in the device tree,
+ indicating that there is crash data available from
+ a previous boot. It will boot into only 256MB of RAM,
+ reserving the rest of system memory.
+
+-- Userspace tools will parse /sys/kernel/release_region
+ and read /proc/vmcore to obtain the contents of memory,
+ which holds the previous crashed kernel. The userspace
+ tools may copy this info to disk, or network, nas, san,
+ iscsi, etc. as desired.
+
+ For Example: the values in /sys/kernel/release-region
+ would look something like this (address-range pairs).
+ CPU:0x177fee000-0x10000: HPTE:0x177ffe020-0x1000: /
+ DUMP:0x177fff020-0x10000000, 0x10000000-0x16F1D370A
+
+-- As the userspace tools complete saving a portion of
+ dump, they echo an offset and size to
+ /sys/kernel/release_region to release the reserved
+ memory back to general use.
+
+ An example of this is:
+ "echo 0x40000000 0x10000000 > /sys/kernel/release_region"
+ which will release 256MB at the 1GB boundary.
+
+Please note that the hypervisor-assisted dump feature
+is only available on Power6-based systems with recent
+firmware versions.
+
+Implementation details:
+----------------------
+In order for this scheme to work, memory needs to be reserved
+quite early in the boot cycle. However, access to the device
+tree this early in the boot cycle is difficult, and device-tree
+access is needed to determine if there is a crash data waiting.
+To work around this problem, all but 256MB of RAM is reserved
+during early boot. A short while later in boot, a check is made
+to determine if there is dump data waiting. If there isn't,
+then the reserved memory is released to general kernel use.
+If there is dump data, then the /sys/kernel/release_region
+file is created, and the reserved memory is held.
+
+If there is no waiting dump data, then all but 256MB of the
+reserved ram will be released for general kernel use. The
+highest 256 MB of RAM will *not* be released: this region
+will be kept permanently reserved, so that it can act as
+a receptacle for a copy of the low 256MB in the case a crash
+does occur. See, however, "open issues" below, as to whether
+such a reserved region is really needed.
+
+Currently the dump will be copied from /proc/vmcore to a
+a new file upon user intervention. The starting address
+to be read and the range for each data point in provided
+in /sys/kernel/release_region.
+
+The tools to examine the dump will be same as the ones
+used for kdump.
+
+
+General notes:
+--------------
+Security: please note that there are potential security issues
+with any sort of dump mechanism. In particular, plaintext
+(unencrypted) data, and possibly passwords, may be present in
+the dump data. Userspace tools must take adequate precautions to
+preserve security.
+
+Open issues/ToDo:
+------------
+ o The various code paths that tell the hypervisor that a crash
+ occurred, vs. it simply being a normal reboot, should be
+ reviewed, and possibly clarified/fixed.
+
+ o Instead of using /sys/kernel, should there be a /sys/dump
+ instead? There is a dump_subsys being created by the s390 code,
+ perhaps the pseries code should use a similar layout as well.
+
+ o Is reserving a 256MB region really required? The goal of
+ reserving a 256MB scratch area is to make sure that no
+ important crash data is clobbered when the hypervisor
+ save low mem to the scratch area. But, if one could assure
+ that nothing important is located in some 256MB area, then
+ it would not need to be reserved. Something that can be
+ improved in subsequent versions.
+
+ o Still working the kdump team to integrate this with kdump,
+ some work remains but this would not affect the current
+ patches.
+
+ o Still need to write a shell script, to copy the dump away.
+ Currently I am parsing it manually.
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 2/8] pseries: phyp dump: config file
2008-01-07 23:45 [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
2008-01-08 0:13 ` [PATCH 1/8] pseries: phyp dump: Docmentation Manish Ahuja
@ 2008-01-08 0:16 ` Manish Ahuja
2008-01-08 3:18 ` Stephen Rothwell
2008-01-08 0:21 ` [PATCH 4/8] pseries: phyp dump: use sysfs to release reserved mem Manish Ahuja
` (6 subsequent siblings)
8 siblings, 1 reply; 71+ messages in thread
From: Manish Ahuja @ 2008-01-08 0:16 UTC (permalink / raw)
To: linuxppc-dev; +Cc: mahuja, linasvepstas, lkessler, strosake
Add hypervisor-assisted dump to kernel config
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
-----
arch/powerpc/Kconfig | 11 +++++++++++
1 file changed, 11 insertions(+)
Index: linux-2.6.24-rc2-git4/arch/powerpc/Kconfig
===================================================================
--- linux-2.6.24-rc2-git4.orig/arch/powerpc/Kconfig 2007-11-14 16:39:20.000000000 -0600
+++ linux-2.6.24-rc2-git4/arch/powerpc/Kconfig 2007-11-15 14:27:33.000000000 -0600
@@ -261,6 +261,17 @@ config CRASH_DUMP
Don't change this unless you know what you are doing.
+config PHYP_DUMP
+ bool "Hypervisor-assisted dump (EXPERIMENTAL)"
+ depends on PPC_PSERIES && EXPERIMENTAL
+ default y
+ help
+ Hypervisor-assisted dump is meant to be a kdump replacement
+ offering robustness and speed not possible without system
+ hypervisor assistence.
+
+ If unsure, say "Y"
+
config PPCBUG_NVRAM
bool "Enable reading PPCBUG NVRAM during boot" if PPLUS || LOPEC
default y if PPC_PREP
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 4/8] pseries: phyp dump: use sysfs to release reserved mem
2008-01-07 23:45 [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
2008-01-08 0:13 ` [PATCH 1/8] pseries: phyp dump: Docmentation Manish Ahuja
2008-01-08 0:16 ` [PATCH 2/8] pseries: phyp dump: config file Manish Ahuja
@ 2008-01-08 0:21 ` Manish Ahuja
2008-01-08 3:45 ` Stephen Rothwell
2008-01-08 0:25 ` [PATCH 3/8] pseries: phyp dump: reserve-release proof-of-concept Manish Ahuja
` (5 subsequent siblings)
8 siblings, 1 reply; 71+ messages in thread
From: Manish Ahuja @ 2008-01-08 0:21 UTC (permalink / raw)
To: linuxppc-dev; +Cc: mahuja, linasvepstas, lkessler, strosake
Check to see if there actually is data from a previously
crashed kernel waiting. If so, Allow user-sapce tools to
grab the data (by reading /proc/kcore). When user-space
finishes dumping a section, it must release that memory
by writing to sysfs. For example,
echo "0x40000000 0x10000000" > /sys/kernel/release_region
will release 256MB starting at the 1GB. The released memory
becomes free for general use.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
------
arch/powerpc/platforms/pseries/phyp_dump.c | 101 +++++++++++++++++++++++++++--
1 file changed, 96 insertions(+), 5 deletions(-)
Index: linux-2.6.24-rc3-git1/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- linux-2.6.24-rc3-git1.orig/arch/powerpc/platforms/pseries/phyp_dump.c 2007-11-21 13:15:05.000000000 -0600
+++ linux-2.6.24-rc3-git1/arch/powerpc/platforms/pseries/phyp_dump.c 2007-11-21 13:24:30.000000000 -0600
@@ -12,17 +12,24 @@
*/
#include <linux/init.h>
+#include <linux/kobject.h>
#include <linux/mm.h>
+#include <linux/of.h>
#include <linux/pfn.h>
#include <linux/swap.h>
+#include <linux/sysfs.h>
#include <asm/page.h>
#include <asm/phyp_dump.h>
+#include <asm/rtas.h>
/* Global, used to communicate data between early boot and late boot */
static struct phyp_dump phyp_dump_global;
struct phyp_dump *phyp_dump_info = &phyp_dump_global;
+static int ibm_configure_kernel_dump;
+
+/* ------------------------------------------------- */
/**
* release_memory_range -- release memory previously lmb_reserved
* @start_pfn: starting physical frame number
@@ -52,18 +59,102 @@ release_memory_range(unsigned long start
}
}
-static int __init phyp_dump_setup(void)
+/* ------------------------------------------------- */
+/**
+ * sysfs_release_region -- sysfs interface to release memory range.
+ *
+ * Usage:
+ * "echo <start addr> <length> > /sys/kernel/release_region"
+ *
+ * Example:
+ * "echo 0x40000000 0x10000000 > /sys/kernel/release_region"
+ *
+ * will release 256MB starting at 1GB.
+ */
+static ssize_t
+store_release_region(struct kset *kset, const char *buf, size_t count)
{
+ unsigned long start_addr, length, end_addr;
unsigned long start_pfn, nr_pages;
+ ssize_t ret;
- /* If no memory was reserved in early boot, there is nothing to do */
- if (phyp_dump_info->init_reserve_size == 0)
- return 0;
+ ret = sscanf(buf, "%lx %lx", &start_addr, &length);
+ if (ret != 2)
+ return -EINVAL;
+
+ /* Range-check - don't free any reserved memory that
+ * wasn't reserved for phyp-dump */
+ if (start_addr < phyp_dump_info->init_reserve_start)
+ start_addr = phyp_dump_info->init_reserve_start;
+
+ end_addr = phyp_dump_info->init_reserve_start +
+ phyp_dump_info->init_reserve_size;
+ if (start_addr+length > end_addr)
+ length = end_addr - start_addr;
+
+ /* Release the region of memory assed in by user */
+ start_pfn = PFN_DOWN(start_addr);
+ nr_pages = PFN_DOWN(length);
+ release_memory_range (start_pfn, nr_pages);
+
+ return count;
+}
+
+static ssize_t
+show_release_region(struct kset * kset, char *buf)
+{
+ return sprintf(buf, "ola\n");
+}
+
+static struct subsys_attribute rr = __ATTR(release_region, 0600,
+ show_release_region,
+ store_release_region);
+
+/* ------------------------------------------------- */
+
+static void release_all (void)
+{
+ unsigned long start_pfn, nr_pages;
- /* Release memory that was reserved in early boot */
+ /* Release all memory that was reserved in early boot */
start_pfn = PFN_DOWN(phyp_dump_info->init_reserve_start);
nr_pages = PFN_DOWN(phyp_dump_info->init_reserve_size);
release_memory_range(start_pfn, nr_pages);
+}
+
+static int __init phyp_dump_setup(void)
+{
+ struct device_node *rtas;
+ const int *dump_header;
+ int header_len = 0;
+ int rc;
+
+ /* If no memory was reserved in early boot, there is nothing to do */
+ if (phyp_dump_info->init_reserve_size == 0)
+ return 0;
+
+ /* Return if phyp dump not supported */
+ ibm_configure_kernel_dump = rtas_token("ibm,configure-kernel-dump");
+ if (ibm_configure_kernel_dump == RTAS_UNKNOWN_SERVICE) {
+ release_all();
+ return -ENOSYS;
+ }
+
+ /* Is there dump data waiting for us? */
+ rtas = of_find_node_by_path("/rtas");
+ dump_header = of_get_property(rtas, "ibm,kernel-dump", &header_len);
+ if (dump_header == NULL) {
+ release_all();
+ return 0;
+ }
+
+ /* Should we create a dump_subsys, analogous to s390/ipl.c ? */
+ rc = subsys_create_file(&kernel_subsys, &rr);
+ if (rc) {
+ printk (KERN_ERR "phyp-dump: unable to create sysfs file (%d)\n", rc);
+ release_all();
+ return 0;
+ }
return 0;
}
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 3/8] pseries: phyp dump: reserve-release proof-of-concept
2008-01-07 23:45 [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
` (2 preceding siblings ...)
2008-01-08 0:21 ` [PATCH 4/8] pseries: phyp dump: use sysfs to release reserved mem Manish Ahuja
@ 2008-01-08 0:25 ` Manish Ahuja
2008-01-08 3:16 ` Stephen Rothwell
2008-01-16 4:21 ` Paul Mackerras
2008-01-08 0:28 ` [PATCH 5/8] pseries: phyp dump: register dump area Manish Ahuja
` (4 subsequent siblings)
8 siblings, 2 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-01-08 0:25 UTC (permalink / raw)
To: linuxppc-dev; +Cc: mahuja, linasvepstas, lkessler, strosake
Initial patch for reserving memory in early boot, and freeing it later.
If the previous boot had ended with a crash, the reserved memory would contain
a copy of the crashed kernel data.
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
----
arch/powerpc/kernel/prom.c | 33 +++++++++++++
arch/powerpc/platforms/pseries/Makefile | 1
arch/powerpc/platforms/pseries/phyp_dump.c | 71 +++++++++++++++++++++++++++++
include/asm-powerpc/phyp_dump.h | 32 +++++++++++++
4 files changed, 137 insertions(+)
Index: linux-2.6.24-rc2-git4/include/asm-powerpc/phyp_dump.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.24-rc2-git4/include/asm-powerpc/phyp_dump.h 2007-11-19 17:44:21.000000000 -0600
@@ -0,0 +1,32 @@
+/*
+ * Hypervisor-assisted dump
+ *
+ * Linas Vepstas, Manish Ahuja 2007
+ * Copyright (c) 2007 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _PPC64_PHYP_DUMP_H
+#define _PPC64_PHYP_DUMP_H
+
+#ifdef CONFIG_PHYP_DUMP
+
+/* The RMR region will be saved for later dumping
+ * whenever the kernel crashes. Set this to 256MB. */
+#define PHYP_DUMP_RMR_START 0x0
+#define PHYP_DUMP_RMR_END (1UL<<28)
+
+struct phyp_dump {
+ /* Memory that is reserved during very early boot. */
+ unsigned long init_reserve_start;
+ unsigned long init_reserve_size;
+};
+
+extern struct phyp_dump *phyp_dump_info;
+
+#endif /* CONFIG_PHYP_DUMP */
+#endif /* _PPC64_PHYP_DUMP_H */
Index: linux-2.6.24-rc2-git4/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.24-rc2-git4/arch/powerpc/platforms/pseries/phyp_dump.c 2007-11-19 19:07:49.000000000 -0600
@@ -0,0 +1,71 @@
+/*
+ * Hypervisor-assisted dump
+ *
+ * Linas Vepstas, Manish Ahuja 2007
+ * Copyrhgit (c) 2007 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/pfn.h>
+#include <linux/swap.h>
+
+#include <asm/page.h>
+#include <asm/phyp_dump.h>
+
+/* Global, used to communicate data between early boot and late boot */
+static struct phyp_dump phyp_dump_global;
+struct phyp_dump *phyp_dump_info = &phyp_dump_global;
+
+/**
+ * release_memory_range -- release memory previously lmb_reserved
+ * @start_pfn: starting physical frame number
+ * @nr_pages: number of pages to free.
+ *
+ * This routine will release memory that had been previously
+ * lmb_reserved in early boot. The released memory becomes
+ * available for genreal use.
+ */
+static void
+release_memory_range(unsigned long start_pfn, unsigned long nr_pages)
+{
+ struct page *rpage;
+ unsigned long end_pfn;
+ long i;
+
+ end_pfn = start_pfn + nr_pages;
+
+ for (i=start_pfn; i <= end_pfn; i++) {
+ rpage = pfn_to_page(i);
+ if (PageReserved(rpage)) {
+ ClearPageReserved(rpage);
+ init_page_count(rpage);
+ __free_page(rpage);
+ totalram_pages++;
+ }
+ }
+}
+
+static int __init phyp_dump_setup(void)
+{
+ unsigned long start_pfn, nr_pages;
+
+ /* If no memory was reserved in early boot, there is nothing to do */
+ if (phyp_dump_info->init_reserve_size == 0)
+ return 0;
+
+ /* Release memory that was reserved in early boot */
+ start_pfn = PFN_DOWN(phyp_dump_info->init_reserve_start);
+ nr_pages = PFN_DOWN(phyp_dump_info->init_reserve_size);
+ release_memory_range(start_pfn, nr_pages);
+
+ return 0;
+}
+
+subsys_initcall(phyp_dump_setup);
Index: linux-2.6.24-rc2-git4/arch/powerpc/platforms/pseries/Makefile
===================================================================
--- linux-2.6.24-rc2-git4.orig/arch/powerpc/platforms/pseries/Makefile 2007-11-19 17:43:52.000000000 -0600
+++ linux-2.6.24-rc2-git4/arch/powerpc/platforms/pseries/Makefile 2007-11-19 17:44:21.000000000 -0600
@@ -18,3 +18,4 @@ obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu
obj-$(CONFIG_HVC_CONSOLE) += hvconsole.o
obj-$(CONFIG_HVCS) += hvcserver.o
obj-$(CONFIG_HCALL_STATS) += hvCall_inst.o
+obj-$(CONFIG_PHYP_DUMP) += phyp_dump.o
Index: linux-2.6.24-rc2-git4/arch/powerpc/kernel/prom.c
===================================================================
--- linux-2.6.24-rc2-git4.orig/arch/powerpc/kernel/prom.c 2007-11-19 17:43:52.000000000 -0600
+++ linux-2.6.24-rc2-git4/arch/powerpc/kernel/prom.c 2007-11-19 17:44:21.000000000 -0600
@@ -51,6 +51,7 @@
#include <asm/machdep.h>
#include <asm/pSeries_reconfig.h>
#include <asm/pci-bridge.h>
+#include <asm/phyp_dump.h>
#include <asm/kexec.h>
#ifdef DEBUG
@@ -1011,6 +1012,37 @@ static void __init early_reserve_mem(voi
#endif
}
+#ifdef CONFIG_PHYP_DUMP
+
+/**
+ * reserve_crashed_mem() - reserve all not-yet-dumped mmemory
+ *
+ * This routine will reserve almost all of the memory in the
+ * system, except for a few hundred megabytes used to boot the
+ * new kernel. As the reserved memory is dumped to the dump
+ * device (by userland tools), it will be freed and made available.
+ */
+static void __init reserve_crashed_mem(void)
+{
+ unsigned long crashed_base, crashed_size;
+
+ /* Reserve *everything* above the RMR. We'll free this real soon. */
+ crashed_base = PHYP_DUMP_RMR_END;
+ crashed_size = lmb_end_of_DRAM() - crashed_base;
+
+ /* XXX crashed_ram_end is wrong, since it may be beyond
+ * the memory_limit, it will need to be adjusted. */
+ lmb_reserve(crashed_base, crashed_size);
+
+ phyp_dump_info->init_reserve_start = crashed_base;
+ phyp_dump_info->init_reserve_size = crashed_size;
+}
+
+#else
+static inline void __init reserve_crashed_mem(void) {}
+#endif /* CONFIG_PHYP_DUMP */
+
+
void __init early_init_devtree(void *params)
{
DBG(" -> early_init_devtree(%p)\n", params);
@@ -1043,6 +1075,7 @@ void __init early_init_devtree(void *par
reserve_kdump_trampoline();
reserve_crashkernel();
early_reserve_mem();
+ reserve_crashed_mem();
lmb_enforce_memory_limit(memory_limit);
lmb_analyze();
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 5/8] pseries: phyp dump: register dump area.
2008-01-07 23:45 [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
` (3 preceding siblings ...)
2008-01-08 0:25 ` [PATCH 3/8] pseries: phyp dump: reserve-release proof-of-concept Manish Ahuja
@ 2008-01-08 0:28 ` Manish Ahuja
2008-01-08 3:59 ` Stephen Rothwell
2008-01-08 0:35 ` [PATCH 6/8] pseries: phyp dump: debugging print routines Manish Ahuja
` (3 subsequent siblings)
8 siblings, 1 reply; 71+ messages in thread
From: Manish Ahuja @ 2008-01-08 0:28 UTC (permalink / raw)
To: linuxppc-dev; +Cc: mahuja, linasvepstas, lkessler, strosake
Set up the actual dump header, register it with the hypervisor.
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
------
arch/powerpc/platforms/pseries/phyp_dump.c | 169 +++++++++++++++++++++++++++--
1 file changed, 163 insertions(+), 6 deletions(-)
Index: linux-2.6.24-rc3-git1/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- linux-2.6.24-rc3-git1.orig/arch/powerpc/platforms/pseries/phyp_dump.c 2007-11-21 15:55:37.000000000 -0600
+++ linux-2.6.24-rc3-git1/arch/powerpc/platforms/pseries/phyp_dump.c 2007-11-21 16:06:52.000000000 -0600
@@ -30,6 +30,134 @@ struct phyp_dump *phyp_dump_info = &phyp
static int ibm_configure_kernel_dump;
/* ------------------------------------------------- */
+/* RTAS interfaces to declare the dump regions */
+
+struct dump_section {
+ u32 dump_flags;
+ u16 source_type;
+ u16 error_flags;
+ u64 source_address;
+ u64 source_length;
+ u64 length_copied;
+ u64 destination_address;
+};
+
+struct phyp_dump_header {
+ u32 version;
+ u16 num_of_sections;
+ u16 status;
+
+ u32 first_offset_section;
+ u32 dump_disk_section;
+ u64 block_num_dd;
+ u64 num_of_blocks_dd;
+ u32 offset_dd;
+ u32 maxtime_to_auto;
+ /* No dump disk path string used */
+
+ struct dump_section cpu_data;
+ struct dump_section hpte_data;
+ struct dump_section kernel_data;
+};
+
+/* The dump header *must be* in low memory, so .bss it */
+static struct phyp_dump_header phdr;
+
+#define NUM_DUMP_SECTIONS 3
+#define DUMP_HEADER_VERSION 0x1
+#define DUMP_REQUEST_FLAG 0x1
+#define DUMP_SOURCE_CPU 0x0001
+#define DUMP_SOURCE_HPTE 0x0002
+#define DUMP_SOURCE_RMO 0x0011
+
+/**
+ * init_dump_header() - initialize the header declaring a dump
+ * Returns: length of dump save area.
+ *
+ * When the hypervisor saves crashed state, it needs to put
+ * it somewhere. The dump header tells the hypervisor where
+ * the data can be saved.
+ */
+static unsigned long init_dump_header(struct phyp_dump_header *ph)
+{
+ struct device_node *rtas;
+ const unsigned int *sizes;
+ int len;
+ unsigned long cpu_state_size = 0;
+ unsigned long hpte_region_size = 0;
+ unsigned long addr_offset = 0;
+
+ /* Get the required dump region sizes */
+ rtas = of_find_node_by_path("/rtas");
+ sizes = of_get_property(rtas, "ibm,configure-kernel-dump-sizes", &len);
+ if (!sizes || len < 20)
+ return 0;
+
+ if (sizes[0] == 1)
+ cpu_state_size = *((unsigned long *) &sizes[1]);
+
+ if (sizes[3] == 2)
+ hpte_region_size = *((unsigned long *) &sizes[4]);
+
+ /* Set up the dump header */
+ ph->version = DUMP_HEADER_VERSION;
+ ph->num_of_sections = NUM_DUMP_SECTIONS;
+ ph->status = 0;
+
+ ph->first_offset_section =
+ (u32) &(((struct phyp_dump_header *) 0)->cpu_data);
+ ph->dump_disk_section = 0;
+ ph->block_num_dd = 0;
+ ph->num_of_blocks_dd = 0;
+ ph->offset_dd = 0;
+
+ ph->maxtime_to_auto = 0; /* disabled */
+
+ /* The first two sections are mandatory */
+ ph->cpu_data.dump_flags = DUMP_REQUEST_FLAG;
+ ph->cpu_data.source_type = DUMP_SOURCE_CPU;
+ ph->cpu_data.source_address = 0;
+ ph->cpu_data.source_length = cpu_state_size;
+ ph->cpu_data.destination_address = addr_offset;
+ addr_offset += cpu_state_size;
+
+ ph->hpte_data.dump_flags = DUMP_REQUEST_FLAG;
+ ph->hpte_data.source_type = DUMP_SOURCE_HPTE;
+ ph->hpte_data.source_address = 0;
+ ph->hpte_data.source_length = hpte_region_size;
+ ph->hpte_data.destination_address = addr_offset;
+ addr_offset += hpte_region_size;
+
+ /* This section describes the low kernel region */
+ ph->kernel_data.dump_flags = DUMP_REQUEST_FLAG;
+ ph->kernel_data.source_type = DUMP_SOURCE_RMO;
+ ph->kernel_data.source_address = PHYP_DUMP_RMR_START;
+ ph->kernel_data.source_length = PHYP_DUMP_RMR_END;
+ ph->kernel_data.destination_address = addr_offset;
+ addr_offset += ph->kernel_data.source_length;
+
+ return addr_offset;
+}
+
+static void register_dump_area(struct phyp_dump_header *ph, unsigned long addr)
+{
+ int rc;
+ ph->cpu_data.destination_address += addr;
+ ph->hpte_data.destination_address += addr;
+ ph->kernel_data.destination_address += addr;
+
+ do {
+ rc = rtas_call(ibm_configure_kernel_dump, 3, 1, NULL,
+ 1, ph, sizeof(struct phyp_dump_header));
+ } while (rtas_busy_delay(rc));
+
+ if (rc)
+ {
+ printk (KERN_ERR "phyp-dump: unexpected error (%d) on register\n", rc);
+ }
+}
+
+/* ------------------------------------------------- */
/**
* release_memory_range -- release memory previously lmb_reserved
* @start_pfn: starting physical frame number
@@ -125,7 +253,11 @@ static void release_all (void)
static int __init phyp_dump_setup(void)
{
struct device_node *rtas;
- const int *dump_header;
+ const struct phyp_dump_header *dump_header;
+ unsigned long dump_area_start;
+ unsigned long dump_area_length;
+ unsigned long free_area_length;
+ unsigned long start_pfn, nr_pages;
int header_len = 0;
int rc;
@@ -140,22 +272,47 @@ static int __init phyp_dump_setup(void)
return -ENOSYS;
}
- /* Is there dump data waiting for us? */
+ /* Is there dump data waiting for us? If there isn't,
+ * then register a new dump area, and release all of
+ * the rest of the reserved ram.
+ *
+ * The /rtas/ibm,kernel-dump rtas node is present only
+ * if there is dump data waiting for us.
+ */
rtas = of_find_node_by_path("/rtas");
dump_header = of_get_property(rtas, "ibm,kernel-dump", &header_len);
+
+ dump_area_length = init_dump_header (&phdr);
+ free_area_length = phyp_dump_info->init_reserve_size - dump_area_length;
+ dump_area_start = phyp_dump_info->init_reserve_start + free_area_length;
+ dump_area_start = dump_area_start & PAGE_MASK; /* align down */
+ free_area_length = dump_area_start - phyp_dump_info->init_reserve_start;
+
if (dump_header == NULL) {
- release_all();
- return 0;
+ register_dump_area (&phdr, dump_area_start);
+ goto release_mem;
}
+ /* Don't allow user to release the 256MB scratch area */
+ phyp_dump_info->init_reserve_size = free_area_length;
+
/* Should we create a dump_subsys, analogous to s390/ipl.c ? */
rc = subsys_create_file(&kernel_subsys, &rr);
if (rc) {
printk (KERN_ERR "phyp-dump: unable to create sysfs file (%d)\n", rc);
- release_all();
- return 0;
+ goto release_mem;
}
+ /* ToDo: re-register the dump area, for next time. */
+
+ return 0;
+
+release_mem:
+ /* release everything except the top 256 MB scratch area */
+ start_pfn = PFN_DOWN(phyp_dump_info->init_reserve_start);
+ nr_pages = PFN_DOWN(free_area_length);
+ release_memory_range(start_pfn, nr_pages);
+
return 0;
}
> documentation explaining what this is :-) . Yes, its supposed
> to be an improvement over kdump.
>
> The patches mostly work; a list of open issues / todo list
> is included in the documentation. It also appears that
> the not-yet-released firmware versions this was tested
> on are still, ahem, incomplete; this work is also pending.
>
> -- Linas & Manish
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@ozlabs.org
> https://ozlabs.org/mailman/listinfo/linuxppc-dev
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 6/8] pseries: phyp dump: debugging print routines.
2008-01-07 23:45 [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
` (4 preceding siblings ...)
2008-01-08 0:28 ` [PATCH 5/8] pseries: phyp dump: register dump area Manish Ahuja
@ 2008-01-08 0:35 ` Manish Ahuja
2008-01-08 0:49 ` Arnd Bergmann
2008-01-08 4:03 ` Stephen Rothwell
2008-01-08 0:37 ` [PATCH 7/8] pseries: phyp dump: Unregister and print dump areas Manish Ahuja
` (2 subsequent siblings)
8 siblings, 2 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-01-08 0:35 UTC (permalink / raw)
To: linuxppc-dev; +Cc: mahuja, linasvepstas, lkessler, strosake
Provide some basic debugging support.
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
Signed-off-by: Linas Vepsts <linas@austin.ibm.com>
-----
arch/powerpc/platforms/pseries/phyp_dump.c | 53 ++++++++++++++++++++++++++++-
1 file changed, 52 insertions(+), 1 deletion(-)
Index: 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/platforms/pseries/phyp_dump.c 2008-01-01 23:24:10.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c 2008-01-01 23:24:27.000000000 -0600
@@ -2,7 +2,7 @@
* Hypervisor-assisted dump
*
* Linas Vepstas, Manish Ahuja 2007
- * Copyrhgit (c) 2007 IBM Corp.
+ * Copyright (c) 2007 IBM Corp.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
@@ -139,6 +139,51 @@ static unsigned long init_dump_header(st
return addr_offset;
}
+#ifdef DEBUG
+static void print_dump_header(const struct phyp_dump_header *ph)
+{
+ printk(KERN_INFO "dump header:\n");
+ /* setup some ph->sections required */
+ printk(KERN_INFO "version = %d\n", ph->version);
+ printk(KERN_INFO "Sections = %d\n", ph->num_of_sections);
+ printk(KERN_INFO "Status = 0x%x\n", ph->status);
+
+ /* No ph->disk, so all should be set to 0 */
+ printk(KERN_INFO "Offset to first section 0x%x\n", ph->first_offset_section);
+ printk(KERN_INFO "dump disk sections should be zero\n");
+ printk(KERN_INFO "dump disk section = %d\n",ph->dump_disk_section);
+ printk(KERN_INFO "block num = %ld\n",ph->block_num_dd);
+ printk(KERN_INFO "number of blocks = %ld\n",ph->num_of_blocks_dd);
+ printk(KERN_INFO "dump disk offset = %d\n",ph->offset_dd);
+ printk(KERN_INFO "Max auto time= %d\n",ph->maxtime_to_auto);
+
+ /*set cpu state and hpte states as well scratch pad area */
+ printk(KERN_INFO " CPU AREA \n");
+ printk(KERN_INFO "cpu dump_flags =%d\n",ph->cpu_data.dump_flags);
+ printk(KERN_INFO "cpu source_type =%d\n",ph->cpu_data.source_type);
+ printk(KERN_INFO "cpu error_flags =%d\n",ph->cpu_data.error_flags);
+ printk(KERN_INFO "cpu source_address =%lx\n",ph->cpu_data.source_address);
+ printk(KERN_INFO "cpu source_length =%lx\n",ph->cpu_data.source_length);
+ printk(KERN_INFO "cpu length_copied =%lx\n",ph->cpu_data.length_copied);
+
+ printk(KERN_INFO " HPTE AREA \n");
+ printk(KERN_INFO "HPTE dump_flags =%d\n",ph->hpte_data.dump_flags);
+ printk(KERN_INFO "HPTE source_type =%d\n",ph->hpte_data.source_type);
+ printk(KERN_INFO "HPTE error_flags =%d\n",ph->hpte_data.error_flags);
+ printk(KERN_INFO "HPTE source_address =%lx\n",ph->hpte_data.source_address);
+ printk(KERN_INFO "HPTE source_length =%lx\n",ph->hpte_data.source_length);
+ printk(KERN_INFO "HPTE length_copied =%lx\n",ph->hpte_data.length_copied);
+
+ printk(KERN_INFO " SRSD AREA \n");
+ printk(KERN_INFO "SRSD dump_flags =%d\n",ph->kernel_data.dump_flags);
+ printk(KERN_INFO "SRSD source_type =%d\n",ph->kernel_data.source_type);
+ printk(KERN_INFO "SRSD error_flags =%d\n",ph->kernel_data.error_flags);
+ printk(KERN_INFO "SRSD source_address =%lx\n",ph->kernel_data.source_address);
+ printk(KERN_INFO "SRSD source_length =%lx\n",ph->kernel_data.source_length);
+ printk(KERN_INFO "SRSD length_copied =%lx\n",ph->kernel_data.length_copied);
+}
+#endif
+
static void register_dump_area(struct phyp_dump_header *ph, unsigned long addr)
{
int rc;
@@ -154,6 +199,9 @@ static void register_dump_area(struct ph
if (rc)
{
printk (KERN_ERR "phyp-dump: unexpected error (%d) on register\n", rc);
+#ifdef DEBUG
+ print_dump_header (ph);
+#endif
}
}
@@ -271,6 +319,9 @@ static int __init phyp_dump_setup(void)
release_all();
return -ENOSYS;
}
+#ifdef DEBUG
+ print_dump_header (dump_header);
+#endif
/* Is there dump data waiting for us? If there isn't,
* then register a new dump area, and release all of
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 7/8] pseries: phyp dump: Unregister and print dump areas.
2008-01-07 23:45 [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
` (5 preceding siblings ...)
2008-01-08 0:35 ` [PATCH 6/8] pseries: phyp dump: debugging print routines Manish Ahuja
@ 2008-01-08 0:37 ` Manish Ahuja
2008-01-08 4:25 ` Stephen Rothwell
2008-01-08 0:39 ` [PATCH 8/8] pseries: phyp dump: Tracking memory range freed Manish Ahuja
2008-02-12 6:31 ` [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
8 siblings, 1 reply; 71+ messages in thread
From: Manish Ahuja @ 2008-01-08 0:37 UTC (permalink / raw)
To: linuxppc-dev; +Cc: mahuja, linasvepstas, lkessler, strosake
Routines to invalidate and unregister dump routines.
Unregister has not been used yet, I will release another
patch for that at a later stage with the kdump integration patches.
There is also a routine which calculates the regions to be
freed and exports that through sysfs.
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
---
arch/powerpc/platforms/pseries/phyp_dump.c | 79 ++++++++++++++++++++++++++---
1 file changed, 73 insertions(+), 6 deletions(-)
Index: 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/platforms/pseries/phyp_dump.c 2008-01-07 18:06:11.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c 2008-01-07 22:57:27.000000000 -0600
@@ -69,6 +69,10 @@ static struct phyp_dump_header phdr;
#define DUMP_SOURCE_CPU 0x0001
#define DUMP_SOURCE_HPTE 0x0002
#define DUMP_SOURCE_RMO 0x0011
+#define DUMP_ERROR_FLAG 0x2000
+#define DUMP_TRIGGERED 0x4000
+#define DUMP_PERFORMED 0x8000
+
/**
* init_dump_header() - initialize the header declaring a dump
@@ -187,10 +191,15 @@ static void print_dump_header(const stru
static void register_dump_area(struct phyp_dump_header *ph, unsigned long addr)
{
int rc;
- ph->cpu_data.destination_address += addr;
- ph->hpte_data.destination_address += addr;
- ph->kernel_data.destination_address += addr;
+ /* Add addr value if not initialized before */
+ if (ph->cpu_data.destination_address == 0) {
+ ph->cpu_data.destination_address += addr;
+ ph->hpte_data.destination_address += addr;
+ ph->kernel_data.destination_address += addr;
+ }
+
+ /* ToDo Invalidate kdump and free memory range. */
do {
rc = rtas_call(ibm_configure_kernel_dump, 3, 1, NULL,
1, ph, sizeof(struct phyp_dump_header));
@@ -205,6 +214,49 @@ static void register_dump_area(struct ph
}
}
+static void invalidate_last_dump(struct phyp_dump_header *ph, unsigned long addr)
+{
+ int rc;
+
+ /* Add addr value if not initialized before */
+ if (ph->cpu_data.destination_address == 0) {
+ ph->cpu_data.destination_address = addr;
+ ph->hpte_data.destination_address += addr;
+ ph->kernel_data.destination_address += addr;
+ }
+
+ do {
+ rc = rtas_call(ibm_configure_kernel_dump, 3, 1, NULL,
+ 2, ph, sizeof(struct phyp_dump_header));
+ } while (rtas_busy_delay(rc));
+
+ if (rc)
+ {
+ printk (KERN_ERR "phyp-dump: unexpected error (%d) on invalidate\n", rc);
+#ifdef DEBUG
+ print_dump_header (ph);
+#endif
+ }
+}
+
+static void unregister_dump_area(struct phyp_dump_header *ph)
+{
+ int rc;
+
+ do {
+ rc = rtas_call(ibm_configure_kernel_dump, 3, 1, NULL,
+ 3, ph, sizeof(struct phyp_dump_header));
+ } while (rtas_busy_delay(rc));
+
+ if (rc)
+ {
+ printk (KERN_ERR "phyp-dump: unexpected error (%d) on unregister\n", rc);
+#ifdef DEBUG
+ print_dump_header (ph);
+#endif
+ }
+}
+
/* ------------------------------------------------- */
/**
* release_memory_range -- release memory previously lmb_reserved
@@ -279,7 +331,17 @@ store_release_region(struct kset *kset,
static ssize_t
show_release_region(struct kset * kset, char *buf)
{
- return sprintf(buf, "ola\n");
+ u64 second_addr_range;
+
+ /* total reserved size - start of scratch area */
+ second_addr_range = phdr.cpu_data.destination_address -
+ phyp_dump_info->init_reserve_size;
+ return sprintf(buf, "CPU:0x%lx-0x%lx: HPTE:0x%lx-0x%lx:"
+ " DUMP:0x%lx-0x%lx, 0x%lx-0x%lx:\n",
+ phdr.cpu_data.destination_address, phdr.cpu_data.length_copied,
+ phdr.hpte_data.destination_address, phdr.hpte_data.length_copied,
+ phdr.kernel_data.destination_address, phdr.kernel_data.length_copied,
+ phyp_dump_info->init_reserve_start, second_addr_range);
}
static struct subsys_attribute rr = __ATTR(release_region, 0600,
@@ -344,6 +406,13 @@ static int __init phyp_dump_setup(void)
goto release_mem;
}
+ /* re-register the dump area, if old dump was invalid */
+ if ((dump_header) && (dump_header->status & DUMP_ERROR_FLAG)) {
+ invalidate_last_dump (&phdr, dump_area_start);
+ register_dump_area (&phdr, dump_area_start);
+ goto release_mem;
+ }
+
/* Don't allow user to release the 256MB scratch area */
phyp_dump_info->init_reserve_size = free_area_length;
@@ -354,8 +423,6 @@ static int __init phyp_dump_setup(void)
goto release_mem;
}
- /* ToDo: re-register the dump area, for next time. */
-
return 0;
release_mem:
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 8/8] pseries: phyp dump: Tracking memory range freed.
2008-01-07 23:45 [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
` (6 preceding siblings ...)
2008-01-08 0:37 ` [PATCH 7/8] pseries: phyp dump: Unregister and print dump areas Manish Ahuja
@ 2008-01-08 0:39 ` Manish Ahuja
2008-02-12 6:31 ` [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
8 siblings, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-01-08 0:39 UTC (permalink / raw)
To: linuxppc-dev; +Cc: mahuja, linasvepstas, lkessler, strosake
This patch tracks the size freed. For now it does a simple
rudimentary calculation of the ranges freed. The idea is
to keep it simple at the external shell script level and
send in large chunks for now.
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
-----
---
arch/powerpc/platforms/pseries/phyp_dump.c | 36 +++++++++++++++++++++++++++++
include/asm-powerpc/phyp_dump.h | 3 ++
2 files changed, 39 insertions(+)
Index: 2.6.24-rc5/include/asm-powerpc/phyp_dump.h
===================================================================
--- 2.6.24-rc5.orig/include/asm-powerpc/phyp_dump.h 2008-01-07 22:55:28.000000000 -0600
+++ 2.6.24-rc5/include/asm-powerpc/phyp_dump.h 2008-01-07 22:58:02.000000000 -0600
@@ -24,6 +24,9 @@ struct phyp_dump {
/* Memory that is reserved during very early boot. */
unsigned long init_reserve_start;
unsigned long init_reserve_size;
+ /* Scratch area memory details */
+ unsigned long scratch_reserve_start;
+ unsigned long scratch_reserve_size;
};
extern struct phyp_dump *phyp_dump_info;
Index: 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/platforms/pseries/phyp_dump.c 2008-01-07 22:57:27.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c 2008-01-07 22:58:02.000000000 -0600
@@ -287,6 +287,39 @@ release_memory_range(unsigned long start
}
}
+/**
+ * track_freed_range -- Counts the range being freed.
+ * Once the counter goes to zero, it re-registers dump for
+ * future use.
+ */
+static void
+track_freed_range(unsigned long addr, unsigned long length)
+{
+ static unsigned long scratch_area_size, reserved_area_size;
+
+ if (addr < phyp_dump_info->init_reserve_start)
+ return;
+
+ if ((addr >= phyp_dump_info->init_reserve_start) &&
+ (addr <= phyp_dump_info->init_reserve_start +
+ phyp_dump_info->init_reserve_size))
+ reserved_area_size += length;
+
+ if ((addr >= phyp_dump_info->scratch_reserve_start) &&
+ (addr <= phyp_dump_info->scratch_reserve_start +
+ phyp_dump_info->scratch_reserve_size))
+ scratch_area_size += length;
+
+ if ((reserved_area_size == phyp_dump_info->init_reserve_start) &&
+ (scratch_area_size == phyp_dump_info->scratch_reserve_size)) {
+
+ invalidate_last_dump(&phdr,
+ phyp_dump_info->scratch_reserve_start);
+ register_dump_area (&phdr,
+ phyp_dump_info->scratch_reserve_start);
+ }
+}
+
/* ------------------------------------------------- */
/**
* sysfs_release_region -- sysfs interface to release memory range.
@@ -310,6 +343,8 @@ store_release_region(struct kset *kset,
if (ret != 2)
return -EINVAL;
+ track_freed_range(start_addr, length);
+
/* Range-check - don't free any reserved memory that
* wasn't reserved for phyp-dump */
if (start_addr < phyp_dump_info->init_reserve_start)
@@ -414,6 +449,7 @@ static int __init phyp_dump_setup(void)
}
/* Don't allow user to release the 256MB scratch area */
+ /* this might be wrong */
phyp_dump_info->init_reserve_size = free_area_length;
/* Should we create a dump_subsys, analogous to s390/ipl.c ? */
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 6/8] pseries: phyp dump: debugging print routines.
2008-01-08 0:35 ` [PATCH 6/8] pseries: phyp dump: debugging print routines Manish Ahuja
@ 2008-01-08 0:49 ` Arnd Bergmann
2008-01-08 4:03 ` Stephen Rothwell
1 sibling, 0 replies; 71+ messages in thread
From: Arnd Bergmann @ 2008-01-08 0:49 UTC (permalink / raw)
To: linuxppc-dev; +Cc: mahuja, lkessler, linasvepstas, strosake
On Tuesday 08 January 2008, Manish Ahuja wrote:
> +#ifdef DEBUG
> +static void print_dump_header(const struct phyp_dump_header *ph)
> +{
> +=A0=A0=A0=A0=A0=A0=A0printk(KERN_INFO "dump header:\n");
> ...
> +=A0=A0=A0=A0=A0=A0=A0printk(KERN_INFO "SRSD length_copied =3D%lx\n",ph->=
kernel_data.length_copied);
> +}
> +#endif
> +
If you move the #if to inside of the function,
> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0printk (KERN_ERR "phyp-du=
mp: unexpected error (%d) on register\n", rc);
> +#ifdef DEBUG
> +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0print_dump_header (ph);
> +#endif
> =A0=A0=A0=A0=A0=A0=A0=A0}
> =A0=A0=A0=A0=A0=A0=A0=A0}
> +#ifdef DEBUG
> +=A0=A0=A0=A0=A0=A0=A0print_dump_header (dump_header);
> +#endif
you don't need the #ifdefs here.
Arnd <><
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 3/8] pseries: phyp dump: reserve-release proof-of-concept
2008-01-08 0:25 ` [PATCH 3/8] pseries: phyp dump: reserve-release proof-of-concept Manish Ahuja
@ 2008-01-08 3:16 ` Stephen Rothwell
2008-01-16 4:21 ` Paul Mackerras
1 sibling, 0 replies; 71+ messages in thread
From: Stephen Rothwell @ 2008-01-08 3:16 UTC (permalink / raw)
To: Manish Ahuja; +Cc: mahuja, linuxppc-dev, linasvepstas, lkessler, strosake
[-- Attachment #1: Type: text/plain, Size: 1771 bytes --]
Hi Manish,
Just some trivial things ...
On Mon, 07 Jan 2008 18:25:31 -0600 Manish Ahuja <ahuja@austin.ibm.com> wrote:
>
>
> +++ linux-2.6.24-rc2-git4/include/asm-powerpc/phyp_dump.h 2007-11-19 17:44:21.000000000 -0600
> +#ifndef _PPC64_PHYP_DUMP_H
We more usually use _ASM_POWERPC_PHYP_DUMP_H
> +#define _PPC64_PHYP_DUMP_H
> +
> +#ifdef CONFIG_PHYP_DUMP
Do these things really need protecting by this CONFIG variable? i.e. does
anything change depending on the visibility of these various symbols?
> +++ linux-2.6.24-rc2-git4/arch/powerpc/platforms/pseries/phyp_dump.c 2007-11-19 19:07:49.000000000 -0600
> @@ -0,0 +1,71 @@
> +/*
> + * Hypervisor-assisted dump
> + *
> + * Linas Vepstas, Manish Ahuja 2007
> + * Copyrhgit (c) 2007 IBM Corp.
^^^^^^^^^
typo. and you should really be using '©' in new copyright notices (or
nothing).
> +/**
> + * release_memory_range -- release memory previously lmb_reserved
> + * @start_pfn: starting physical frame number
> + * @nr_pages: number of pages to free.
> + *
> + * This routine will release memory that had been previously
> + * lmb_reserved in early boot. The released memory becomes
> + * available for genreal use.
^^^^^^^
typo.
> +release_memory_range(unsigned long start_pfn, unsigned long nr_pages)
> +{
> + struct page *rpage;
> + unsigned long end_pfn;
> + long i;
> +
> + end_pfn = start_pfn + nr_pages;
> +
> + for (i=start_pfn; i <= end_pfn; i++) {
spaces around '='
> +static int __init phyp_dump_setup(void)
> +{
> +}
> +
> +subsys_initcall(phyp_dump_setup);
Normally we don't leave a blank line here.
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 2/8] pseries: phyp dump: config file
2008-01-08 0:16 ` [PATCH 2/8] pseries: phyp dump: config file Manish Ahuja
@ 2008-01-08 3:18 ` Stephen Rothwell
0 siblings, 0 replies; 71+ messages in thread
From: Stephen Rothwell @ 2008-01-08 3:18 UTC (permalink / raw)
To: Manish Ahuja; +Cc: mahuja, linuxppc-dev, linasvepstas, lkessler, strosake
[-- Attachment #1: Type: text/plain, Size: 445 bytes --]
On Mon, 07 Jan 2008 18:16:08 -0600 Manish Ahuja <ahuja@austin.ibm.com> wrote:
>
>
> Add hypervisor-assisted dump to kernel config
>
> Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
This patch should probably come last in the series so that all the code
is there before some bisecting autobuilder tries to configure it.
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 4/8] pseries: phyp dump: use sysfs to release reserved mem
2008-01-08 0:21 ` [PATCH 4/8] pseries: phyp dump: use sysfs to release reserved mem Manish Ahuja
@ 2008-01-08 3:45 ` Stephen Rothwell
2008-01-08 18:34 ` Linas Vepstas
0 siblings, 1 reply; 71+ messages in thread
From: Stephen Rothwell @ 2008-01-08 3:45 UTC (permalink / raw)
To: Manish Ahuja; +Cc: mahuja, linuxppc-dev, linasvepstas, lkessler, strosake
[-- Attachment #1: Type: text/plain, Size: 697 bytes --]
On Mon, 07 Jan 2008 18:21:57 -0600 Manish Ahuja <ahuja@austin.ibm.com> wrote:
>
> +static int __init phyp_dump_setup(void)
> +{
>
> + /* Is there dump data waiting for us? */
> + rtas = of_find_node_by_path("/rtas");
> + dump_header = of_get_property(rtas, "ibm,kernel-dump", &header_len);
^^^^^^^^^^^
You could pass NULL here as header_len appears to be unused. Also you
need "of_node_put(rtas)" somewhere (probably just here would do).
> + if (dump_header == NULL) {
> + release_all();
> + return 0;
> + }
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 5/8] pseries: phyp dump: register dump area.
2008-01-08 0:28 ` [PATCH 5/8] pseries: phyp dump: register dump area Manish Ahuja
@ 2008-01-08 3:59 ` Stephen Rothwell
0 siblings, 0 replies; 71+ messages in thread
From: Stephen Rothwell @ 2008-01-08 3:59 UTC (permalink / raw)
To: Manish Ahuja; +Cc: mahuja, linuxppc-dev, linasvepstas, lkessler, strosake
[-- Attachment #1: Type: text/plain, Size: 1569 bytes --]
Hi Manish,
On Mon, 07 Jan 2008 18:28:30 -0600 Manish Ahuja <ahuja@austin.ibm.com> wrote:
>
> +++ linux-2.6.24-rc3-git1/arch/powerpc/platforms/pseries/phyp_dump.c 2007-11-21 16:06:52.000000000 -0600
> +static unsigned long init_dump_header(struct phyp_dump_header *ph)
> +{
> + /* Get the required dump region sizes */
> + rtas = of_find_node_by_path("/rtas");
You need to of_node_put(rtas) somewhere.
> + if (sizes[0] == 1)
> + cpu_state_size = *((unsigned long *) &sizes[1]);
We normally don't put spaces after casts.
> + ph->first_offset_section =
> + (u32) &(((struct phyp_dump_header *) 0)->cpu_data);
(u32)offsetof(struct phyp_dump_header, cpu_data);
> +static void register_dump_area(struct phyp_dump_header *ph, unsigned long addr)
> +{
> + if (rc)
> + {
> + printk (KERN_ERR "phyp-dump: unexpected error (%d) on register\n", rc);
> + }
The braces are not needed.
> +
> + dump_area_length = init_dump_header (&phdr);
We don't put spaces after function names.
> + free_area_length = phyp_dump_info->init_reserve_size - dump_area_length;
> + dump_area_start = phyp_dump_info->init_reserve_start + free_area_length;
> + dump_area_start = dump_area_start & PAGE_MASK; /* align down */
> + free_area_length = dump_area_start - phyp_dump_info->init_reserve_start;
> +
> if (dump_header == NULL) {
> - release_all();
> - return 0;
> + register_dump_area (&phdr, dump_area_start);
Ditto.
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 6/8] pseries: phyp dump: debugging print routines.
2008-01-08 0:35 ` [PATCH 6/8] pseries: phyp dump: debugging print routines Manish Ahuja
2008-01-08 0:49 ` Arnd Bergmann
@ 2008-01-08 4:03 ` Stephen Rothwell
1 sibling, 0 replies; 71+ messages in thread
From: Stephen Rothwell @ 2008-01-08 4:03 UTC (permalink / raw)
To: Manish Ahuja; +Cc: mahuja, linuxppc-dev, linasvepstas, lkessler, strosake
[-- Attachment #1: Type: text/plain, Size: 359 bytes --]
Hi Manish,
Just a trivial comment.
On Mon, 07 Jan 2008 18:35:09 -0600 Manish Ahuja <ahuja@austin.ibm.com> wrote:
>
> + printk(KERN_INFO "dump disk section = %d\n",ph->dump_disk_section);
Please put a space after the ','. Here and later.
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 7/8] pseries: phyp dump: Unregister and print dump areas.
2008-01-08 0:37 ` [PATCH 7/8] pseries: phyp dump: Unregister and print dump areas Manish Ahuja
@ 2008-01-08 4:25 ` Stephen Rothwell
2008-01-08 22:56 ` Manish Ahuja
0 siblings, 1 reply; 71+ messages in thread
From: Stephen Rothwell @ 2008-01-08 4:25 UTC (permalink / raw)
To: Manish Ahuja; +Cc: mahuja, linuxppc-dev, linasvepstas, lkessler, strosake
[-- Attachment #1: Type: text/plain, Size: 2078 bytes --]
Hi Manish,
Just more trivial stuff.
On Mon, 07 Jan 2008 18:37:31 -0600 Manish Ahuja <ahuja@austin.ibm.com> wrote:
>
> static void register_dump_area(struct phyp_dump_header *ph, unsigned long addr)
> {
> int rc;
> - ph->cpu_data.destination_address += addr;
> - ph->hpte_data.destination_address += addr;
> - ph->kernel_data.destination_address += addr;
>
> + /* Add addr value if not initialized before */
> + if (ph->cpu_data.destination_address == 0) {
> + ph->cpu_data.destination_address += addr;
Could be just '=' like further down, right?
> +static void invalidate_last_dump(struct phyp_dump_header *ph, unsigned long addr)
> +{
> + if (rc)
> + {
We prefer
if (rc) {
> +static void unregister_dump_area(struct phyp_dump_header *ph)
> +{
> + if (rc)
> + {
And again.
> static ssize_t
> show_release_region(struct kset * kset, char *buf)
We normally put the function name on the same line as the type.
> {
> - return sprintf(buf, "ola\n");
> + u64 second_addr_range;
> +
> + /* total reserved size - start of scratch area */
> + second_addr_range = phdr.cpu_data.destination_address -
> + phyp_dump_info->init_reserve_size;
> + return sprintf(buf, "CPU:0x%lx-0x%lx: HPTE:0x%lx-0x%lx:"
> + " DUMP:0x%lx-0x%lx, 0x%lx-0x%lx:\n",
> + phdr.cpu_data.destination_address, phdr.cpu_data.length_copied,
> + phdr.hpte_data.destination_address, phdr.hpte_data.length_copied,
> + phdr.kernel_data.destination_address, phdr.kernel_data.length_copied,
> + phyp_dump_info->init_reserve_start, second_addr_range);
This indentation should be (probably) two tabs.
> + /* re-register the dump area, if old dump was invalid */
> + if ((dump_header) && (dump_header->status & DUMP_ERROR_FLAG)) {
^ ^
Extra parentheses.
> + invalidate_last_dump (&phdr, dump_area_start);
> + register_dump_area (&phdr, dump_area_start);
No spaces after function names.
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 4/8] pseries: phyp dump: use sysfs to release reserved mem
2008-01-08 3:45 ` Stephen Rothwell
@ 2008-01-08 18:34 ` Linas Vepstas
0 siblings, 0 replies; 71+ messages in thread
From: Linas Vepstas @ 2008-01-08 18:34 UTC (permalink / raw)
To: Stephen Rothwell; +Cc: mahuja, linuxppc-dev, lkessler, strosake
On 07/01/2008, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> On Mon, 07 Jan 2008 18:21:57 -0600 Manish Ahuja <ahuja@austin.ibm.com> wrote:
> >
> > +static int __init phyp_dump_setup(void)
> > +{
> >
> > + /* Is there dump data waiting for us? */
> > + rtas = of_find_node_by_path("/rtas");
> > + dump_header = of_get_property(rtas, "ibm,kernel-dump", &header_len);
> ^^^^^^^^^^^
> You could pass NULL here as header_len appears to be unused. Also you
> need "of_node_put(rtas)" somewhere (probably just here would do).
Perhaps the routine should have been "of_get_node_by_path()" ?
In ye olden days, "finds" didn't require put, but "gets" did. I'm guessing
that this has now all been fixed up for the of_xxx routines, but I think
that pci_find_xxx still does not require a pci_put.
Why did I bother to write this email? I dunno...
--linas
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 7/8] pseries: phyp dump: Unregister and print dump areas.
2008-01-08 4:25 ` Stephen Rothwell
@ 2008-01-08 22:56 ` Manish Ahuja
0 siblings, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-01-08 22:56 UTC (permalink / raw)
To: Stephen Rothwell; +Cc: mahuja, linuxppc-dev, linasvepstas, lkessler, strosake
Stephen,
>> + /* Add addr value if not initialized before */
>> + if (ph->cpu_data.destination_address == 0) {
>> + ph->cpu_data.destination_address += addr;
>
> Could be just '=' like further down, right?
Actually the one below should be += as well. Thanks for catching it.
>> + /* total reserved size - start of scratch area */
>> + second_addr_range = phdr.cpu_data.destination_address -
>> + phyp_dump_info->init_reserve_size;
>> + return sprintf(buf, "CPU:0x%lx-0x%lx: HPTE:0x%lx-0x%lx:"
>> + " DUMP:0x%lx-0x%lx, 0x%lx-0x%lx:\n",
>> + phdr.cpu_data.destination_address, phdr.cpu_data.length_copied,
>> + phdr.hpte_data.destination_address, phdr.hpte_data.length_copied,
>> + phdr.kernel_data.destination_address, phdr.kernel_data.length_copied,
>> + phyp_dump_info->init_reserve_start, second_addr_range);
>
> This indentation should be (probably) two tabs.
I kept it one with a few spaces as otherwise it was exceeding 80, I guess, I can just have one per line
and that should take care of that.
>
>> + /* re-register the dump area, if old dump was invalid */
>> + if ((dump_header) && (dump_header->status & DUMP_ERROR_FLAG)) {
> ^ ^
> Extra parentheses.
Just for clarity.. I would prefer that, if thats okay.
>
>> + invalidate_last_dump (&phdr, dump_area_start);
>> + register_dump_area (&phdr, dump_area_start);
>
> No spaces after function names.
>
Yeah, will take that out from here and other files as well.
Thanks,
Manish
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-08 0:13 ` [PATCH 1/8] pseries: phyp dump: Docmentation Manish Ahuja
@ 2008-01-09 4:29 ` Nathan Lynch
2008-01-09 4:58 ` Michael Ellerman
2008-01-09 15:31 ` Linas Vepstas
0 siblings, 2 replies; 71+ messages in thread
From: Nathan Lynch @ 2008-01-09 4:29 UTC (permalink / raw)
To: Manish Ahuja; +Cc: mahuja, linuxppc-dev, linasvepstas, lkessler, strosake
Manish Ahuja wrote:
> +
> + Hypervisor-Assisted Dump
> + ------------------------
> + November 2007
Date is unneeded (and, uhm, dated :)
> +The goal of hypervisor-assisted dump is to enable the dump of
> +a crashed system, and to do so from a fully-reset system, and
> +to minimize the total elapsed time until the system is back
> +in production use.
Is it actually faster than kdump?
> +As compared to kdump or other strategies, hypervisor-assisted
> +dump offers several strong, practical advantages:
> +
> +-- Unlike kdump, the system has been reset, and loaded
> + with a fresh copy of the kernel. In particular,
> + PCI and I/O devices have been reinitialized and are
> + in a clean, consistent state.
> +-- As the dump is performed, the dumped memory becomes
> + immediately available to the system for normal use.
> +-- After the dump is completed, no further reboots are
> + required; the system will be fully usable, and running
> + in it's normal, production mode on it normal kernel.
> +
> +The above can only be accomplished by coordination with,
> +and assistance from the hypervisor. The procedure is
> +as follows:
> +
> +-- When a system crashes, the hypervisor will save
> + the low 256MB of RAM to a previously registered
> + save region. It will also save system state, system
> + registers, and hardware PTE's.
> +
> +-- After the low 256MB area has been saved, the
> + hypervisor will reset PCI and other hardware state.
> + It will *not* clear RAM. It will then launch the
> + bootloader, as normal.
> +
> +-- The freshly booted kernel will notice that there
> + is a new node (ibm,dump-kernel) in the device tree,
> + indicating that there is crash data available from
> + a previous boot. It will boot into only 256MB of RAM,
> + reserving the rest of system memory.
> +
> +-- Userspace tools will parse /sys/kernel/release_region
> + and read /proc/vmcore to obtain the contents of memory,
> + which holds the previous crashed kernel. The userspace
> + tools may copy this info to disk, or network, nas, san,
> + iscsi, etc. as desired.
> +
> + For Example: the values in /sys/kernel/release-region
> + would look something like this (address-range pairs).
> + CPU:0x177fee000-0x10000: HPTE:0x177ffe020-0x1000: /
> + DUMP:0x177fff020-0x10000000, 0x10000000-0x16F1D370A
> +
> +-- As the userspace tools complete saving a portion of
> + dump, they echo an offset and size to
> + /sys/kernel/release_region to release the reserved
> + memory back to general use.
> +
> + An example of this is:
> + "echo 0x40000000 0x10000000 > /sys/kernel/release_region"
> + which will release 256MB at the 1GB boundary.
This violates the "one file, one value" rule of sysfs, but nobody
really takes that seriously, I guess. In any case, consider
documenting this in Documentation/ABI.
> +
> +Please note that the hypervisor-assisted dump feature
> +is only available on Power6-based systems with recent
> +firmware versions.
This statement will of course become dated/incorrect so I recommend
removing it.
> +
> +Implementation details:
> +----------------------
> +In order for this scheme to work, memory needs to be reserved
> +quite early in the boot cycle. However, access to the device
> +tree this early in the boot cycle is difficult, and device-tree
> +access is needed to determine if there is a crash data waiting.
I don't think this bit about early device tree access is correct. By
the time your code is reserving memory (from early_init_devtree(), I
think), RTAS has been instantiated and you are able to test for the
existence of /rtas/ibm,dump-kernel.
> +To work around this problem, all but 256MB of RAM is reserved
> +during early boot. A short while later in boot, a check is made
> +to determine if there is dump data waiting. If there isn't,
> +then the reserved memory is released to general kernel use.
So I think these gymnastics are unneeded -- unless I'm
misunderstanding something, you should be able to determine very early
whether to reserve that memory.
> +If there is dump data, then the /sys/kernel/release_region
> +file is created, and the reserved memory is held.
> +
> +If there is no waiting dump data, then all but 256MB of the
> +reserved ram will be released for general kernel use. The
> +highest 256 MB of RAM will *not* be released: this region
> +will be kept permanently reserved, so that it can act as
> +a receptacle for a copy of the low 256MB in the case a crash
> +does occur. See, however, "open issues" below, as to whether
> +such a reserved region is really needed.
> +
> +Currently the dump will be copied from /proc/vmcore to a
> +a new file upon user intervention. The starting address
> +to be read and the range for each data point in provided
^is
> +in /sys/kernel/release_region.
> +
> +The tools to examine the dump will be same as the ones
> +used for kdump.
> +
> +
> +General notes:
> +--------------
> +Security: please note that there are potential security issues
> +with any sort of dump mechanism. In particular, plaintext
> +(unencrypted) data, and possibly passwords, may be present in
> +the dump data. Userspace tools must take adequate precautions to
> +preserve security.
> +
> +Open issues/ToDo:
> +------------
> + o The various code paths that tell the hypervisor that a crash
> + occurred, vs. it simply being a normal reboot, should be
> + reviewed, and possibly clarified/fixed.
> +
> + o Instead of using /sys/kernel, should there be a /sys/dump
> + instead? There is a dump_subsys being created by the s390 code,
> + perhaps the pseries code should use a similar layout as well.
Well, it seems to me that there's little reason to duplicate the s390
layout unless we can actually share code.
FWIW, I've been thinking about making a /sys/firmware/phyp hierarchy
which could contain much of the System P-specific functions (DLPAR,
lparcfg, other crud in /proc/ppc64)... seems suited to this
platform-specific dump mechanism.
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-09 4:29 ` Nathan Lynch
@ 2008-01-09 4:58 ` Michael Ellerman
2008-01-09 15:31 ` Linas Vepstas
1 sibling, 0 replies; 71+ messages in thread
From: Michael Ellerman @ 2008-01-09 4:58 UTC (permalink / raw)
To: Nathan Lynch; +Cc: lkessler, linuxppc-dev, mahuja, linasvepstas, strosake
[-- Attachment #1: Type: text/plain, Size: 4886 bytes --]
On Tue, 2008-01-08 at 22:29 -0600, Nathan Lynch wrote:
> Manish Ahuja wrote:
> > +
> > + Hypervisor-Assisted Dump
> > + ------------------------
> > + November 2007
>
> Date is unneeded (and, uhm, dated :)
>
>
> > +The goal of hypervisor-assisted dump is to enable the dump of
> > +a crashed system, and to do so from a fully-reset system, and
> > +to minimize the total elapsed time until the system is back
> > +in production use.
>
> Is it actually faster than kdump?
>
>
> > +As compared to kdump or other strategies, hypervisor-assisted
> > +dump offers several strong, practical advantages:
> > +
> > +-- Unlike kdump, the system has been reset, and loaded
> > + with a fresh copy of the kernel. In particular,
> > + PCI and I/O devices have been reinitialized and are
> > + in a clean, consistent state.
> > +-- As the dump is performed, the dumped memory becomes
> > + immediately available to the system for normal use.
> > +-- After the dump is completed, no further reboots are
> > + required; the system will be fully usable, and running
> > + in it's normal, production mode on it normal kernel.
> > +
> > +The above can only be accomplished by coordination with,
> > +and assistance from the hypervisor. The procedure is
> > +as follows:
> > +
> > +-- When a system crashes, the hypervisor will save
> > + the low 256MB of RAM to a previously registered
> > + save region. It will also save system state, system
> > + registers, and hardware PTE's.
> > +
> > +-- After the low 256MB area has been saved, the
> > + hypervisor will reset PCI and other hardware state.
> > + It will *not* clear RAM. It will then launch the
> > + bootloader, as normal.
> > +
> > +-- The freshly booted kernel will notice that there
> > + is a new node (ibm,dump-kernel) in the device tree,
> > + indicating that there is crash data available from
> > + a previous boot. It will boot into only 256MB of RAM,
> > + reserving the rest of system memory.
> > +
> > +-- Userspace tools will parse /sys/kernel/release_region
> > + and read /proc/vmcore to obtain the contents of memory,
> > + which holds the previous crashed kernel. The userspace
> > + tools may copy this info to disk, or network, nas, san,
> > + iscsi, etc. as desired.
> > +
> > + For Example: the values in /sys/kernel/release-region
> > + would look something like this (address-range pairs).
> > + CPU:0x177fee000-0x10000: HPTE:0x177ffe020-0x1000: /
> > + DUMP:0x177fff020-0x10000000, 0x10000000-0x16F1D370A
> > +
> > +-- As the userspace tools complete saving a portion of
> > + dump, they echo an offset and size to
> > + /sys/kernel/release_region to release the reserved
> > + memory back to general use.
> > +
> > + An example of this is:
> > + "echo 0x40000000 0x10000000 > /sys/kernel/release_region"
> > + which will release 256MB at the 1GB boundary.
>
> This violates the "one file, one value" rule of sysfs, but nobody
> really takes that seriously, I guess. In any case, consider
> documenting this in Documentation/ABI.
>
>
> > +
> > +Please note that the hypervisor-assisted dump feature
> > +is only available on Power6-based systems with recent
> > +firmware versions.
>
> This statement will of course become dated/incorrect so I recommend
> removing it.
>
>
> > +
> > +Implementation details:
> > +----------------------
> > +In order for this scheme to work, memory needs to be reserved
> > +quite early in the boot cycle. However, access to the device
> > +tree this early in the boot cycle is difficult, and device-tree
> > +access is needed to determine if there is a crash data waiting.
>
> I don't think this bit about early device tree access is correct. By
> the time your code is reserving memory (from early_init_devtree(), I
> think), RTAS has been instantiated and you are able to test for the
> existence of /rtas/ibm,dump-kernel.
Yep it's early_init_devtree(), and yes it's fairly easy to access the
(flattened) device tree at that point.
> > +To work around this problem, all but 256MB of RAM is reserved
> > +during early boot. A short while later in boot, a check is made
> > +to determine if there is dump data waiting. If there isn't,
> > +then the reserved memory is released to general kernel use.
>
> So I think these gymnastics are unneeded -- unless I'm
> misunderstanding something, you should be able to determine very early
> whether to reserve that memory.
I agree.
cheers
--
Michael Ellerman
OzLabs, IBM Australia Development Lab
wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)
We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-09 4:29 ` Nathan Lynch
2008-01-09 4:58 ` Michael Ellerman
@ 2008-01-09 15:31 ` Linas Vepstas
2008-01-09 18:44 ` Nathan Lynch
1 sibling, 1 reply; 71+ messages in thread
From: Linas Vepstas @ 2008-01-09 15:31 UTC (permalink / raw)
To: Nathan Lynch; +Cc: mahuja, linuxppc-dev, lkessler, strosake
Hi Nathan,
Thank you for looking at this.
On 08/01/2008, Nathan Lynch <ntl@pobox.com> wrote:
> Manish Ahuja wrote:
> > +
> > + Hypervisor-Assisted Dump
> > + ------------------------
> > + November 2007
>
> Date is unneeded (and, uhm, dated :)
I'll argue very very strongly against this. One of the worst things
I deal with is undated documents. Google pulls up multiple
versions of a document, and you start wondering: "which one
of these is the *right one*?" Its only after trying some suggested
sequence of commands that aren't working, and more time googling
and effort wasted, that you finallly realize that you are reading
the version 0.1 of documentation from 1999 that some idiot
did not have the decency to put a date on. (A version number
is acceptable). Documentation is often in mediocre state,
don't compound the problem be leaving off critical info.
> > +The goal of hypervisor-assisted dump is to enable the dump of
> > +a crashed system, and to do so from a fully-reset system, and
> > +to minimize the total elapsed time until the system is back
> > +in production use.
>
> Is it actually faster than kdump?
This is a basic presumption; it should blow it away, as, after all,
it requires one less reboot! As a side effect, the system is in
production *while* the dump is being taken; with kdump,
you can't go into production until after the dump is finished,
and the system has been rebooted a second time. On
systems with terabytes of RAM, the time difference can be
hours.
> > +-- As the userspace tools complete saving a portion of
> > + dump, they echo an offset and size to
> > + /sys/kernel/release_region to release the reserved
> > + memory back to general use.
> > +
> > + An example of this is:
> > + "echo 0x40000000 0x10000000 > /sys/kernel/release_region"
> > + which will release 256MB at the 1GB boundary.
>
> This violates the "one file, one value" rule of sysfs, but nobody
> really takes that seriously, I guess. In any case, consider
> documenting this in Documentation/ABI.
This interface will almost certainly be changed, in order to allow
phyp dump to work with the kdump user-space tools. Its provisional
right now, until the details of user-space is hammered out.
> > +Please note that the hypervisor-assisted dump feature
> > +is only available on Power6-based systems with recent
> > +firmware versions.
>
> This statement will of course become dated/incorrect so I recommend
> removing it.
A provisional statement. I figure it should get taken out after
a few years.
> > +Implementation details:
> > +----------------------
> > +In order for this scheme to work, memory needs to be reserved
> > +quite early in the boot cycle. However, access to the device
> > +tree this early in the boot cycle is difficult, and device-tree
> > +access is needed to determine if there is a crash data waiting.
>
> I don't think this bit about early device tree access is correct. By
> the time your code is reserving memory (from early_init_devtree(), I
> think), RTAS has been instantiated and you are able to test for the
> existence of /rtas/ibm,dump-kernel.
If I remember right, it was still too early to look up this token directly,
so we wrote some code to crawl the flat device tree to find it. But
not only was that a lot of work, but I somehow decided that doing this
to the flat tree was wrong, as otherwise someone would surely have
written the access code. If this can be made to work, that would be
great, but we couldn't make it work at the time.
> > +To work around this problem, all but 256MB of RAM is reserved
> > +during early boot. A short while later in boot, a check is made
> > +to determine if there is dump data waiting. If there isn't,
> > +then the reserved memory is released to general kernel use.
>
> So I think these gymnastics are unneeded -- unless I'm
> misunderstanding something, you should be able to determine very early
> whether to reserve that memory.
Only if you can get at rtas, but you can't get at rtas at that point.
I think we would have been the earliest users of rtas at that point.
Possibly the order of how things are initialized could be changed.
> > +Open issues/ToDo:
> > +------------
> > + o The various code paths that tell the hypervisor that a crash
> > + occurred, vs. it simply being a normal reboot, should be
> > + reviewed, and possibly clarified/fixed.
> > +
> > + o Instead of using /sys/kernel, should there be a /sys/dump
> > + instead? There is a dump_subsys being created by the s390 code,
> > + perhaps the pseries code should use a similar layout as well.
>
> Well, it seems to me that there's little reason to duplicate the s390
> layout unless we can actually share code.
I'm thinking "user-space tools". This could probably be decided very
quickly after a short chat with the s390 architects about tools and
standardization and dump formats and the support process. In the
end, its some third-level support people who will be dissecting the
dump using god-knows-what tool (a hacked kdgb ???), so things
should get standardized as much as possible so that we don't
duplicate work.
> FWIW, I've been thinking about making a /sys/firmware/phyp hierarchy
> which could contain much of the System P-specific functions (DLPAR,
> lparcfg, other crud in /proc/ppc64)... seems suited to this
> platform-specific dump mechanism.
Works for me. I was always unclear what /sys/firmware was supposed
to contain.
--linas
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-09 15:31 ` Linas Vepstas
@ 2008-01-09 18:44 ` Nathan Lynch
2008-01-09 19:28 ` Manish Ahuja
` (2 more replies)
0 siblings, 3 replies; 71+ messages in thread
From: Nathan Lynch @ 2008-01-09 18:44 UTC (permalink / raw)
To: Linas Vepstas; +Cc: mahuja, linuxppc-dev, lkessler, strosake
Hi Linas,
Linas Vepstas wrote:
>
> On 08/01/2008, Nathan Lynch <ntl@pobox.com> wrote:
> > Manish Ahuja wrote:
> > > +
> > > +The goal of hypervisor-assisted dump is to enable the dump of
> > > +a crashed system, and to do so from a fully-reset system, and
> > > +to minimize the total elapsed time until the system is back
> > > +in production use.
> >
> > Is it actually faster than kdump?
>
> This is a basic presumption;
I used the word "actually". I already know that it is intended to be
faster. :)
> it should blow it away, as, after all,
> it requires one less reboot!
There's more than rebooting going on during system dump processing.
Depending on the system type, booting may not be where most time is
spent.
> As a side effect, the system is in
> production *while* the dump is being taken;
A dubious feature IMO. Seems that the design potentially trades
reliability of first failure data capture for availability.
E.g. system crashes, reboots, resumes processing while copying dump,
crashes again before dump procedure is complete. How is that handled,
if at all?
> with kdump,
> you can't go into production until after the dump is finished,
> and the system has been rebooted a second time. On
> systems with terabytes of RAM, the time difference can be
> hours.
The difference in time it takes to resume the normal workload may be
significant, yes. But the time it takes to get a usable dump image
would seem to be the basically the same.
Since you bring up large systems... a system with terabytes of RAM is
practically guaranteed to be a NUMA configuration with dozens of cpus.
When processing a dump on such a system, I wonder how well we fare:
can we successfully boot with (say) 128 cpus and 256MB of usable
memory? Do we have to hot-online nodes as system memory is freed up
(and does that even work)? We need to be able to restore the system
to its optimal topology when the dump is finished; if the best we can
do is a degraded configuration, the workload will suffer and the
system admin is likely to just reboot the machine again so the kernel
will have the right NUMA topology.
> > > +Implementation details:
> > > +----------------------
> > > +In order for this scheme to work, memory needs to be reserved
> > > +quite early in the boot cycle. However, access to the device
> > > +tree this early in the boot cycle is difficult, and device-tree
> > > +access is needed to determine if there is a crash data waiting.
> >
> > I don't think this bit about early device tree access is correct. By
> > the time your code is reserving memory (from early_init_devtree(), I
> > think), RTAS has been instantiated and you are able to test for the
> > existence of /rtas/ibm,dump-kernel.
>
> If I remember right, it was still too early to look up this token directly,
> so we wrote some code to crawl the flat device tree to find it. But
> not only was that a lot of work, but I somehow decided that doing this
> to the flat tree was wrong, as otherwise someone would surely have
> written the access code. If this can be made to work, that would be
> great, but we couldn't make it work at the time.
>
> > > +To work around this problem, all but 256MB of RAM is reserved
> > > +during early boot. A short while later in boot, a check is made
> > > +to determine if there is dump data waiting. If there isn't,
> > > +then the reserved memory is released to general kernel use.
> >
> > So I think these gymnastics are unneeded -- unless I'm
> > misunderstanding something, you should be able to determine very early
> > whether to reserve that memory.
>
> Only if you can get at rtas, but you can't get at rtas at that point.
Sorry, but I think you are mistaken (see Michael's earlier reply).
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-09 18:44 ` Nathan Lynch
@ 2008-01-09 19:28 ` Manish Ahuja
2008-01-09 22:59 ` Michael Ellerman
2008-01-10 2:33 ` Linas Vepstas
2 siblings, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-01-09 19:28 UTC (permalink / raw)
To: Nathan Lynch; +Cc: mahuja, linuxppc-dev, Linas Vepstas, lkessler, strosake
>
> I used the word "actually". I already know that it is intended to be
> faster. :)
>
>> it should blow it away, as, after all,
>> it requires one less reboot!
>
> There's more than rebooting going on during system dump processing.
> Depending on the system type, booting may not be where most time is
> spent.
>
>
>> As a side effect, the system is in
>> production *while* the dump is being taken;
>
> A dubious feature IMO. Seems that the design potentially trades
> reliability of first failure data capture for availability.
> E.g. system crashes, reboots, resumes processing while copying dump,
> crashes again before dump procedure is complete. How is that handled,
> if at all?
This is a simple version. The intent was not to have a complex dump taking
mechanism in version 1. Subsequent versions will see planned improvement
on the way the pages are tracked and freed.
Also it is very easily possible now, to register for another dump as soon as the
scratch area is copied to a user designated region. But for now this simple implementation
exists.
It is also possible to extend this further to only preserve pages that are
kernel pages and free the non required pages like user/data pages etc. This
would reduce the space preserved and would prevent any issues that are
caused by reserving everything in memory except for the first 256 MB.
Improvements and future versions are planned to make this efficient. But for
now the intent is to get this off the ground and handle simple cases.
>
>
>> with kdump,
>> you can't go into production until after the dump is finished,
>> and the system has been rebooted a second time. On
>> systems with terabytes of RAM, the time difference can be
>> hours.
>
> The difference in time it takes to resume the normal workload may be
> significant, yes. But the time it takes to get a usable dump image
> would seem to be the basically the same.
>
> Since you bring up large systems... a system with terabytes of RAM is
> practically guaranteed to be a NUMA configuration with dozens of cpus.
> When processing a dump on such a system, I wonder how well we fare:
> can we successfully boot with (say) 128 cpus and 256MB of usable
> memory? Do we have to hot-online nodes as system memory is freed up
> (and does that even work)? We need to be able to restore the system
> to its optimal topology when the dump is finished; if the best we can
> do is a degraded configuration, the workload will suffer and the
> system admin is likely to just reboot the machine again so the kernel
> will have the right NUMA topology.
>
>
>>>> +Implementation details:
>>>> +----------------------
>>>> +In order for this scheme to work, memory needs to be reserved
>>>> +quite early in the boot cycle. However, access to the device
>>>> +tree this early in the boot cycle is difficult, and device-tree
>>>> +access is needed to determine if there is a crash data waiting.
>>> I don't think this bit about early device tree access is correct. By
>>> the time your code is reserving memory (from early_init_devtree(), I
>>> think), RTAS has been instantiated and you are able to test for the
>>> existence of /rtas/ibm,dump-kernel.
>> If I remember right, it was still too early to look up this token directly,
>> so we wrote some code to crawl the flat device tree to find it. But
>> not only was that a lot of work, but I somehow decided that doing this
>> to the flat tree was wrong, as otherwise someone would surely have
>> written the access code. If this can be made to work, that would be
>> great, but we couldn't make it work at the time.
>>
>>>> +To work around this problem, all but 256MB of RAM is reserved
>>>> +during early boot. A short while later in boot, a check is made
>>>> +to determine if there is dump data waiting. If there isn't,
>>>> +then the reserved memory is released to general kernel use.
>>> So I think these gymnastics are unneeded -- unless I'm
>>> misunderstanding something, you should be able to determine very early
>>> whether to reserve that memory.
>> Only if you can get at rtas, but you can't get at rtas at that point.
>
> Sorry, but I think you are mistaken (see Michael's earlier reply).
>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@ozlabs.org
> https://ozlabs.org/mailman/listinfo/linuxppc-dev
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-09 18:44 ` Nathan Lynch
2008-01-09 19:28 ` Manish Ahuja
@ 2008-01-09 22:59 ` Michael Ellerman
2008-01-09 23:18 ` Manish Ahuja
2008-01-10 2:47 ` Linas Vepstas
2008-01-10 2:33 ` Linas Vepstas
2 siblings, 2 replies; 71+ messages in thread
From: Michael Ellerman @ 2008-01-09 22:59 UTC (permalink / raw)
To: Nathan Lynch; +Cc: mahuja, linuxppc-dev, Linas Vepstas, lkessler, strosake
[-- Attachment #1: Type: text/plain, Size: 4146 bytes --]
On Wed, 2008-01-09 at 12:44 -0600, Nathan Lynch wrote:
> Hi Linas,
>
> Linas Vepstas wrote:
> >
> > On 08/01/2008, Nathan Lynch <ntl@pobox.com> wrote:
> > > Manish Ahuja wrote:
> > > > +
> > > > +The goal of hypervisor-assisted dump is to enable the dump of
> > > > +a crashed system, and to do so from a fully-reset system, and
> > > > +to minimize the total elapsed time until the system is back
> > > > +in production use.
> > >
> > > Is it actually faster than kdump?
> >
> > This is a basic presumption;
>
> > As a side effect, the system is in
> > production *while* the dump is being taken;
It's in "production" with 256MB of RAM? Err. Sure as the dump progresses
more RAM will be freed, but that's hardly production. I think Nathan's
right, any sysadmin who wants predictability will probably double reboot
anyway.
> > with kdump,
> > you can't go into production until after the dump is finished,
> > and the system has been rebooted a second time. On
> > systems with terabytes of RAM, the time difference can be
> > hours.
> Since you bring up large systems... a system with terabytes of RAM is
> practically guaranteed to be a NUMA configuration with dozens of cpus.
> When processing a dump on such a system, I wonder how well we fare:
> can we successfully boot with (say) 128 cpus and 256MB of usable
> memory? Do we have to hot-online nodes as system memory is freed up
> (and does that even work)? We need to be able to restore the system
> to its optimal topology when the dump is finished; if the best we can
> do is a degraded configuration, the workload will suffer and the
> system admin is likely to just reboot the machine again so the kernel
> will have the right NUMA topology.
Yeah that's a good question. Even if the hot-onlining works, there's
still kernel data structures allocated at boot which want to be
node-local. So the end result will be != a "production" boot.
> > > > +Implementation details:
> > > > +----------------------
> > > > +In order for this scheme to work, memory needs to be reserved
> > > > +quite early in the boot cycle. However, access to the device
> > > > +tree this early in the boot cycle is difficult, and device-tree
> > > > +access is needed to determine if there is a crash data waiting.
> > >
> > > I don't think this bit about early device tree access is correct. By
> > > the time your code is reserving memory (from early_init_devtree(), I
> > > think), RTAS has been instantiated and you are able to test for the
> > > existence of /rtas/ibm,dump-kernel.
> >
> > If I remember right, it was still too early to look up this token directly,
> > so we wrote some code to crawl the flat device tree to find it. But
> > not only was that a lot of work, but I somehow decided that doing this
> > to the flat tree was wrong, as otherwise someone would surely have
> > written the access code. If this can be made to work, that would be
> > great, but we couldn't make it work at the time.
> >
> > > > +To work around this problem, all but 256MB of RAM is reserved
> > > > +during early boot. A short while later in boot, a check is made
> > > > +to determine if there is dump data waiting. If there isn't,
> > > > +then the reserved memory is released to general kernel use.
> > >
> > > So I think these gymnastics are unneeded -- unless I'm
> > > misunderstanding something, you should be able to determine very early
> > > whether to reserve that memory.
> >
> > Only if you can get at rtas, but you can't get at rtas at that point.
AFAICT you don't need to get at RTAS, you just need to look at the
device tree to see if the property is present, and that is trivial.
You probably just need to add a check in early_init_dt_scan_rtas() which
sets a flag for the PHYP dump stuff, or add your own scan routine if you
need.
cheers
--
Michael Ellerman
OzLabs, IBM Australia Development Lab
wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)
We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-09 22:59 ` Michael Ellerman
@ 2008-01-09 23:18 ` Manish Ahuja
2008-01-10 2:47 ` Linas Vepstas
1 sibling, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-01-09 23:18 UTC (permalink / raw)
To: michael
Cc: lkessler, linuxppc-dev, Nathan Lynch, mahuja, Linas Vepstas,
strosake
> It's in "production" with 256MB of RAM? Err. Sure as the dump progresses
> more RAM will be freed, but that's hardly production. I think Nathan's
> right, any sysadmin who wants predictability will probably double reboot
> anyway.
Thats a changeable parameter. Its something we chose for now. It by no means
is set in stone. Its not a design parameter. If you like to allocate 1GB we can.
But that is something we did for now. we expect this to be a variable value
dependent upon the size of the system. So if you have 128 GB system and you
can spare 10 gb, you should be able to have 10 GB to boot with.
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-09 18:44 ` Nathan Lynch
2008-01-09 19:28 ` Manish Ahuja
2008-01-09 22:59 ` Michael Ellerman
@ 2008-01-10 2:33 ` Linas Vepstas
2008-01-10 3:17 ` Olof Johansson
2 siblings, 1 reply; 71+ messages in thread
From: Linas Vepstas @ 2008-01-10 2:33 UTC (permalink / raw)
To: Nathan Lynch; +Cc: mahuja, linuxppc-dev, lkessler, strosake
On 09/01/2008, Nathan Lynch <ntl@pobox.com> wrote:
> Hi Linas,
>
> Linas Vepstas wrote:
> >
> > As a side effect, the system is in
> > production *while* the dump is being taken;
>
> A dubious feature IMO.
Hmm. Take it up with Ken Rozendal, this is supposed to be
one of the two main selling points of this thing.
> Seems that the design potentially trades
> reliability of first failure data capture for availability.
> E.g. system crashes, reboots, resumes processing while copying dump,
> crashes again before dump procedure is complete. How is that handled,
> if at all?
Its handled by the hypervisor. phyp maintains the copy of the
RMO of first crash, until such time that the OS declares the
dump of the RMO to be complete. So you'll always have
the RMO of the first crash.
For the rest of RAM, it will come in two parts: some portion
will have been dumped already. The rest has not yet been dumped,
and it will still be there, preserved across the second crash.
So you get both RMO and all of RAM from the first crash.
> > with kdump,
> > you can't go into production until after the dump is finished,
> > and the system has been rebooted a second time. On
> > systems with terabytes of RAM, the time difference can be
> > hours.
>
> The difference in time it takes to resume the normal workload may be
> significant, yes. But the time it takes to get a usable dump image
> would seem to be the basically the same.
Yes.
> Since you bring up large systems... a system with terabytes of RAM is
> practically guaranteed to be a NUMA configuration with dozens of cpus.
> When processing a dump on such a system, I wonder how well we fare:
> can we successfully boot with (say) 128 cpus and 256MB of usable
> memory? Do we have to hot-online nodes as system memory is freed up
> (and does that even work)? We need to be able to restore the system
> to its optimal topology when the dump is finished; if the best we can
> do is a degraded configuration, the workload will suffer and the
> system admin is likely to just reboot the machine again so the kernel
> will have the right NUMA topology.
Heh. That's the elbow-grease of this thing. The easy part is to get
the core function working. The hard part is to test these various configs,
and when they don't work, figure out what went wrong. That will take
perseverence and brains.
--linas
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-09 22:59 ` Michael Ellerman
2008-01-09 23:18 ` Manish Ahuja
@ 2008-01-10 2:47 ` Linas Vepstas
2008-01-10 3:55 ` Michael Ellerman
1 sibling, 1 reply; 71+ messages in thread
From: Linas Vepstas @ 2008-01-10 2:47 UTC (permalink / raw)
To: michael; +Cc: mahuja, linuxppc-dev, lkessler, Nathan Lynch, strosake
On 09/01/2008, Michael Ellerman <michael@ellerman.id.au> wrote:
>
> > > Only if you can get at rtas, but you can't get at rtas at that point.
>
> AFAICT you don't need to get at RTAS, you just need to look at the
> device tree to see if the property is present, and that is trivial.
>
> You probably just need to add a check in early_init_dt_scan_rtas() which
> sets a flag for the PHYP dump stuff, or add your own scan routine if you
> need.
I no longer remember the details. I do remember spending a lot of time
trying to figure out how to do this. I know I didn't want to write my own scan
routine; maybe that's what stopped me. As it happens, we also did most
of the development on a broken phyp which simply did not even have
this property, no matter what, and so that may have brain-damaged me.
I went for the "most elegant" solution, where "most elegant" is defined
as "fewest lines of code", "least effort", etc.
Manish may need some hands-on help to extract this token during
early boot. Hopefully, he'll let us know.
--linas
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-10 2:33 ` Linas Vepstas
@ 2008-01-10 3:17 ` Olof Johansson
2008-01-10 4:12 ` Linas Vepstas
0 siblings, 1 reply; 71+ messages in thread
From: Olof Johansson @ 2008-01-10 3:17 UTC (permalink / raw)
To: Linas Vepstas; +Cc: mahuja, linuxppc-dev, lkessler, Nathan Lynch, strosake
On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote:
> Heh. That's the elbow-grease of this thing. The easy part is to get
> the core function working. The hard part is to test these various configs,
> and when they don't work, figure out what went wrong. That will take
> perseverence and brains.
This just sounds like a whole lot of extra work to get a feature that
already exists. Also, features like these seem to just get tested when the
next enterprise distro is released, so they're broken for long stretches
of time in mainline.
There's a bunch of problems like the NUMA ones, which would by far be
easiest to solve by just doing another reboot or kexec, wouldn't they?
-Olof
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-10 2:47 ` Linas Vepstas
@ 2008-01-10 3:55 ` Michael Ellerman
0 siblings, 0 replies; 71+ messages in thread
From: Michael Ellerman @ 2008-01-10 3:55 UTC (permalink / raw)
To: linasvepstas; +Cc: mahuja, linuxppc-dev, lkessler, Nathan Lynch, strosake
[-- Attachment #1: Type: text/plain, Size: 3344 bytes --]
On Wed, 2008-01-09 at 20:47 -0600, Linas Vepstas wrote:
> On 09/01/2008, Michael Ellerman <michael@ellerman.id.au> wrote:
> >
> > > > Only if you can get at rtas, but you can't get at rtas at that point.
> >
> > AFAICT you don't need to get at RTAS, you just need to look at the
> > device tree to see if the property is present, and that is trivial.
> >
> > You probably just need to add a check in early_init_dt_scan_rtas() which
> > sets a flag for the PHYP dump stuff, or add your own scan routine if you
> > need.
>
> I no longer remember the details. I do remember spending a lot of time
> trying to figure out how to do this. I know I didn't want to write my own scan
> routine; maybe that's what stopped me. As it happens, we also did most
> of the development on a broken phyp which simply did not even have
> this property, no matter what, and so that may have brain-damaged me.
Sure, the API docs for the kernel are a little lacking ;)
> I went for the "most elegant" solution, where "most elegant" is defined
> as "fewest lines of code", "least effort", etc.
>
> Manish may need some hands-on help to extract this token during
> early boot. Hopefully, he'll let us know.
It would just be something like:
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -901,6 +901,11 @@ int __init early_init_dt_scan_rtas(unsigned long node,
rtas.size = *sizep;
}
+#ifdef CONFIG_PHYP_DUMP
+ if (of_get_flat_dt_prop(node, "ibm,dump-kernel", NULL))
+ phyp_dump_is_active++;
+#endif
+
#ifdef CONFIG_UDBG_RTAS_CONSOLE
basep = of_get_flat_dt_prop(node, "put-term-char", NULL);
if (basep)
Or to do your own scan routine:
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index acc0d24..442134e 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -1022,6 +1022,7 @@ void __init early_init_devtree(void *params)
/* Some machines might need RTAS info for debugging, grab it now. */
of_scan_flat_dt(early_init_dt_scan_rtas, NULL);
#endif
+ of_scan_flat_dt(early_init_dt_scan_phyp_dump, NULL);
/* Retrieve various informations from the /chosen node of the
* device-tree, including the platform type, initrd location and
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 52e95c2..af2b6e8 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -883,6 +883,19 @@ void __init rtas_initialize(void)
#endif
}
+int __init early_init_dt_scan_phyp_dump(unsigned long node,
+ const char *uname, int depth, void *data)
+{
+#ifdef CONFIG_PHYP_DUMP
+ if (depth != 1 || strcmp(uname, "rtas") != 0)
+ return 0;
+
+ if (of_get_flat_dt_prop(node, "ibm,dump-kernel", NULL))
+ phyp_dump_is_active++;
+#endif
+ return 1;
+}
+
int __init early_init_dt_scan_rtas(unsigned long node,
const char *uname, int depth, void *data)
{
cheers
--
Michael Ellerman
OzLabs, IBM Australia Development Lab
wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)
We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply related [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-10 3:17 ` Olof Johansson
@ 2008-01-10 4:12 ` Linas Vepstas
2008-01-10 4:52 ` Michael Ellerman
2008-01-10 16:21 ` Olof Johansson
0 siblings, 2 replies; 71+ messages in thread
From: Linas Vepstas @ 2008-01-10 4:12 UTC (permalink / raw)
To: Olof Johansson; +Cc: mahuja, linuxppc-dev, lkessler, Nathan Lynch, strosake
On 09/01/2008, Olof Johansson <olof@lixom.net> wrote:
> On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote:
>
> > Heh. That's the elbow-grease of this thing. The easy part is to get
> > the core function working. The hard part is to test these various configs,
> > and when they don't work, figure out what went wrong. That will take
> > perseverence and brains.
>
> This just sounds like a whole lot of extra work to get a feature that
> already exists.
Well, no. kexec is horribly ill-behaved with respect to PCI. The
kexec kernel starts running with PCI devices in some random
state; maybe they're DMA'ing or who knows what. kexec tries
real hard to whack a few needed pci devices into submission
but it has been hit-n-miss, and the source of 90% of the kexec
headaches and debugging effort. Its not pretty.
If all pci-host bridges could shut-down or settle the bus, and
raise the #RST line high, and then if all BIOS'es supported
this, you'd be right. But they can't ....
--linas
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-10 4:12 ` Linas Vepstas
@ 2008-01-10 4:52 ` Michael Ellerman
2008-01-10 16:21 ` Olof Johansson
1 sibling, 0 replies; 71+ messages in thread
From: Michael Ellerman @ 2008-01-10 4:52 UTC (permalink / raw)
To: linasvepstas
Cc: linuxppc-dev, lkessler, mahuja, Nathan Lynch, Olof Johansson,
strosake
[-- Attachment #1: Type: text/plain, Size: 1242 bytes --]
On Wed, 2008-01-09 at 22:12 -0600, Linas Vepstas wrote:
> On 09/01/2008, Olof Johansson <olof@lixom.net> wrote:
> > On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote:
> >
> > > Heh. That's the elbow-grease of this thing. The easy part is to get
> > > the core function working. The hard part is to test these various configs,
> > > and when they don't work, figure out what went wrong. That will take
> > > perseverence and brains.
> >
> > This just sounds like a whole lot of extra work to get a feature that
> > already exists.
>
> Well, no. kexec is horribly ill-behaved with respect to PCI. The
> kexec kernel starts running with PCI devices in some random
> state; maybe they're DMA'ing or who knows what. kexec tries
> real hard to whack a few needed pci devices into submission
> but it has been hit-n-miss, and the source of 90% of the kexec
> headaches and debugging effort. Its not pretty.
Isn't that what EEH and the IOMMU are for? :)
cheers
--
Michael Ellerman
OzLabs, IBM Australia Development Lab
wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)
We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-10 4:12 ` Linas Vepstas
2008-01-10 4:52 ` Michael Ellerman
@ 2008-01-10 16:21 ` Olof Johansson
2008-01-10 16:34 ` Linas Vepstas
1 sibling, 1 reply; 71+ messages in thread
From: Olof Johansson @ 2008-01-10 16:21 UTC (permalink / raw)
To: Linas Vepstas; +Cc: mahuja, linuxppc-dev, lkessler, Nathan Lynch, strosake
On Wed, Jan 09, 2008 at 10:12:13PM -0600, Linas Vepstas wrote:
> On 09/01/2008, Olof Johansson <olof@lixom.net> wrote:
> > On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote:
> >
> > > Heh. That's the elbow-grease of this thing. The easy part is to get
> > > the core function working. The hard part is to test these various configs,
> > > and when they don't work, figure out what went wrong. That will take
> > > perseverence and brains.
> >
> > This just sounds like a whole lot of extra work to get a feature that
> > already exists.
>
> Well, no. kexec is horribly ill-behaved with respect to PCI. The
> kexec kernel starts running with PCI devices in some random
> state; maybe they're DMA'ing or who knows what. kexec tries
> real hard to whack a few needed pci devices into submission
> but it has been hit-n-miss, and the source of 90% of the kexec
> headaches and debugging effort. Its not pretty.
It surprises me that this hasn't been possible to resolve with less than
architecting a completely new interface, given that the platform has
all this fancy support for isolating and resetting adapters. After all,
the exact same thing has to be done by the hypervisor before rebooting
the partition.
> If all pci-host bridges could shut-down or settle the bus, and
> raise the #RST line high, and then if all BIOS'es supported
> this, you'd be right. But they can't ....
This argument doesn't hold. We're not talking about some generic PC with
a crappy BIOS here, we're specifically talking about POWER6 PHYP. It
certainly already has ways to reset adapters in it, or EEH wouldn't
work. Actually, the whole phyp dump feature wouldn't work either, since
it's exactly what the firmware has to do under the covers as well.
-Olof
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-10 16:21 ` Olof Johansson
@ 2008-01-10 16:34 ` Linas Vepstas
2008-01-10 21:46 ` Mike Strosaker
0 siblings, 1 reply; 71+ messages in thread
From: Linas Vepstas @ 2008-01-10 16:34 UTC (permalink / raw)
To: Olof Johansson; +Cc: mahuja, linuxppc-dev, lkessler, Nathan Lynch, strosake
On 10/01/2008, Olof Johansson <olof@lixom.net> wrote:
> On Wed, Jan 09, 2008 at 10:12:13PM -0600, Linas Vepstas wrote:
> > On 09/01/2008, Olof Johansson <olof@lixom.net> wrote:
> > > On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote:
> > >
> > > > Heh. That's the elbow-grease of this thing. The easy part is to get
> > > > the core function working. The hard part is to test these various configs,
> > > > and when they don't work, figure out what went wrong. That will take
> > > > perseverence and brains.
> > >
> > > This just sounds like a whole lot of extra work to get a feature that
> > > already exists.
> >
> > Well, no. kexec is horribly ill-behaved with respect to PCI. The
> > kexec kernel starts running with PCI devices in some random
> > state; maybe they're DMA'ing or who knows what. kexec tries
> > real hard to whack a few needed pci devices into submission
> > but it has been hit-n-miss, and the source of 90% of the kexec
> > headaches and debugging effort. Its not pretty.
>
> It surprises me that this hasn't been possible to resolve with less than
> architecting a completely new interface, given that the platform has
> all this fancy support for isolating and resetting adapters. After all,
> the exact same thing has to be done by the hypervisor before rebooting
> the partition.
OK, point taken.
-- The phyp interfaces are there for AIX, which I guess must
not have kexec-like ability. So this is a case of Linux leveraging
a feature architected for AIX.
-- There's also this idea, somewhat weak, that the crash may
have corrupted the ram where the kexec kernel sits.
For someone who is used to seeing crashes due to
null pointer deref's, this seems fairly unlikely. But perhaps
crashes in production systems are more mind-bending.
(we did have a case where a USB stick used for boot
continued to scribble on memory long after it was
supposed to be quiet and unused. This resulted in
a very hard to debug crash.)
A solution to a corrupted
kexec kernel would be to disable memory access to
where kexec sits, e.g un-mapping or making r/o the
pages where it lies. This begs the questions of "who
unhides the kexec kernel", and "what if this 'who' gets
corrupted"?
In short, the kexec kernel does not boot
exactly the same as a cold boot, and so this opens
a can of worms about "well, what's different, how do
we minimize these differences, etc." and I think that
lead AIX to punt, and say "lets just use one single,
well-known boot loader/ boot sequence instead of
inventing a new one", thus leading to the phyp design.
But that's just my guess.. :-)
--linas
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-10 16:34 ` Linas Vepstas
@ 2008-01-10 21:46 ` Mike Strosaker
2008-01-11 1:26 ` Nathan Lynch
0 siblings, 1 reply; 71+ messages in thread
From: Mike Strosaker @ 2008-01-10 21:46 UTC (permalink / raw)
To: linasvepstas
Cc: linuxppc-dev, lkessler, mahuja, Nathan Lynch, Olof Johansson,
strosake
Linas Vepstas wrote:
> On 10/01/2008, Olof Johansson <olof@lixom.net> wrote:
>
>>On Wed, Jan 09, 2008 at 10:12:13PM -0600, Linas Vepstas wrote:
>>
>>>On 09/01/2008, Olof Johansson <olof@lixom.net> wrote:
>>>
>>>>On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote:
>>>>
>>>>
>>>>>Heh. That's the elbow-grease of this thing. The easy part is to get
>>>>>the core function working. The hard part is to test these various configs,
>>>>>and when they don't work, figure out what went wrong. That will take
>>>>>perseverence and brains.
>>>>
>>>>This just sounds like a whole lot of extra work to get a feature that
>>>>already exists.
>>>
>>>Well, no. kexec is horribly ill-behaved with respect to PCI. The
>>>kexec kernel starts running with PCI devices in some random
>>>state; maybe they're DMA'ing or who knows what. kexec tries
>>>real hard to whack a few needed pci devices into submission
>>>but it has been hit-n-miss, and the source of 90% of the kexec
>>>headaches and debugging effort. Its not pretty.
>>
>>It surprises me that this hasn't been possible to resolve with less than
>>architecting a completely new interface, given that the platform has
>>all this fancy support for isolating and resetting adapters. After all,
>>the exact same thing has to be done by the hypervisor before rebooting
>>the partition.
>
>
> OK, point taken.
>
> -- The phyp interfaces are there for AIX, which I guess must
> not have kexec-like ability. So this is a case of Linux leveraging
> a feature architected for AIX.
Certainly AIX was in a more difficult position at the time, because they don't
have a kexec equivalent, and thus were collecting dump data with a potentially
faulty kernel. It makes sense to have something outside the partition collect or
maintain the data; ideally, some kind of service partition would extract dump
data from a failed partition, but giving one partition total access to the memory
of another is clearly risky. Both the PHYP-assistance method and the kexec
method are ways to simulate that without the risk.
At the risk of repeating what others have already said, the PHYP-assistance
method provides some advantages that the kexec method cannot:
- Availability of the system for production use before the dump data is
collected. As was mentioned before, some production systems may choose not to
operate with the limited memory initially available after the reboot, but it sure
is nice to provide the option.
- Ensuring that the devices are in a good state. PHYP doesn't expose a method
to force adapters into a frozen state, (which I agree would be useful), and I
don't know of any plans to do so. What we are starting to see is that some
drivers need modifications in order to work correctly with kdump [1]. Supporting
PHYP-assisted dump would eliminate those issues.
- The small possibility that the kexec area could have been munged by the
failing kernel, preventing it from being able to collect a dump.
The NUMA issues are daunting, but not insurmountable. Release early, release
often, n'est-ce pas?
Mike
[1] http://ozlabs.org/pipermail/linuxppc-dev/2007-November/045663.html
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-10 21:46 ` Mike Strosaker
@ 2008-01-11 1:26 ` Nathan Lynch
2008-01-11 16:57 ` Linas Vepstas
0 siblings, 1 reply; 71+ messages in thread
From: Nathan Lynch @ 2008-01-11 1:26 UTC (permalink / raw)
To: Mike Strosaker
Cc: linuxppc-dev, lkessler, mahuja, Olof Johansson, linasvepstas,
strosake
Mike Strosaker wrote:
>
> At the risk of repeating what others have already said, the PHYP-assistance
> method provides some advantages that the kexec method cannot:
> - Availability of the system for production use before the dump data is
> collected. As was mentioned before, some production systems may choose not
> to operate with the limited memory initially available after the reboot,
> but it sure is nice to provide the option.
I'm more concerned that this design encourages the user to resume a
workload *which is almost certainly known to result in a system crash*
before collection of crash data is complete. Maybe the gamble will
pay off most of the time, but I wouldn't want to be working support
when it doesn't.
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-11 1:26 ` Nathan Lynch
@ 2008-01-11 16:57 ` Linas Vepstas
2008-01-14 5:24 ` Olof Johansson
0 siblings, 1 reply; 71+ messages in thread
From: Linas Vepstas @ 2008-01-11 16:57 UTC (permalink / raw)
To: Nathan Lynch; +Cc: linuxppc-dev, lkessler, mahuja, Olof Johansson, strosake
On 10/01/2008, Nathan Lynch <ntl@pobox.com> wrote:
> Mike Strosaker wrote:
> >
> > At the risk of repeating what others have already said, the PHYP-assistance
> > method provides some advantages that the kexec method cannot:
> > - Availability of the system for production use before the dump data is
> > collected. As was mentioned before, some production systems may choose not
> > to operate with the limited memory initially available after the reboot,
> > but it sure is nice to provide the option.
>
> I'm more concerned that this design encourages the user to resume a
> workload *which is almost certainly known to result in a system crash*
> before collection of crash data is complete. Maybe the gamble will
> pay off most of the time, but I wouldn't want to be working support
> when it doesn't.
Workloads that cause crashes within hours of startup tend to be
weeded-out/discovered during pre-production test of the system
to be deployed. Since its pre-production test, dumps can be
taken in a leisurely manner. Heck, even a session at the
xmon prompt can be contemplated.
The problem is when the crash only reproduces after days or
weeks of uptime, on a production machine. Since the machine
is in production, its got to be brought back up ASAP. Since
its crashing only after days/weeks, the dump should have
plenty of time to complete. (And if it crashes quickly after
that reboot ... well, support people always welcome ways
in which a bug can be reproduced more quickly/easily).
--linas
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-11 16:57 ` Linas Vepstas
@ 2008-01-14 5:24 ` Olof Johansson
2008-01-14 15:21 ` Linas Vepstas
0 siblings, 1 reply; 71+ messages in thread
From: Olof Johansson @ 2008-01-14 5:24 UTC (permalink / raw)
To: Linas Vepstas; +Cc: lkessler, linuxppc-dev, Nathan Lynch, mahuja, strosake
On Fri, Jan 11, 2008 at 10:57:51AM -0600, Linas Vepstas wrote:
> On 10/01/2008, Nathan Lynch <ntl@pobox.com> wrote:
> > Mike Strosaker wrote:
> > >
> > > At the risk of repeating what others have already said, the PHYP-assistance
> > > method provides some advantages that the kexec method cannot:
> > > - Availability of the system for production use before the dump data is
> > > collected. As was mentioned before, some production systems may choose not
> > > to operate with the limited memory initially available after the reboot,
> > > but it sure is nice to provide the option.
> >
> > I'm more concerned that this design encourages the user to resume a
> > workload *which is almost certainly known to result in a system crash*
> > before collection of crash data is complete. Maybe the gamble will
> > pay off most of the time, but I wouldn't want to be working support
> > when it doesn't.
>
> Workloads that cause crashes within hours of startup tend to be
> weeded-out/discovered during pre-production test of the system
> to be deployed. Since its pre-production test, dumps can be
> taken in a leisurely manner. Heck, even a session at the
> xmon prompt can be contemplated.
>
> The problem is when the crash only reproduces after days or
> weeks of uptime, on a production machine. Since the machine
> is in production, its got to be brought back up ASAP. Since
> its crashing only after days/weeks, the dump should have
> plenty of time to complete. (And if it crashes quickly after
> that reboot ... well, support people always welcome ways
> in which a bug can be reproduced more quickly/easily).
How do you expect to have it in full production if you don't have all
resources available for it? It's not until the dump has finished that you
can return all memory to the production environment and use it.
This can very easily be argued in both direction, with no clear winner:
If the crash is stress-induced (say a slashdotted website), for those
cases it seems more rational to take the time, collect _good data_ even
if it takes a little longer, and then go back into production. Especially
if the alternative is to go back into production immediately, collect
about half of the data, and then crash again. Rinse and repeat.
Anyway -- I can agree that some of the arguments w.r.t robustness and
reliability of collecting dumps can be higher using this approach. It
really surprises me that there's no way to reset a device through PHYP
though. Seems like such a fundamental feature.
I think people are overly optimistic if they think it'll be possible
to do all of this reliably (as in with consistent performance) without
a second reboot though. At least without similar amounts of work being
done as it would have taken to fix kdump's reliability in the first place.
Speaking of reboots. PHYP isn't known for being quick at rebooting a
partition, it used to take in the order of minutes even on a small
machine. Has that been fixed? If not, the avoiding an extra reboot
argument hardly seems like a benefit versus kdump+kexec, which reboots
nearly instantly and without involvement from PHYP.
-Olof
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/8] pseries: phyp dump: Docmentation
2008-01-14 5:24 ` Olof Johansson
@ 2008-01-14 15:21 ` Linas Vepstas
0 siblings, 0 replies; 71+ messages in thread
From: Linas Vepstas @ 2008-01-14 15:21 UTC (permalink / raw)
To: Olof Johansson; +Cc: lkessler, linuxppc-dev, Nathan Lynch, mahuja, strosake
On 13/01/2008, Olof Johansson <olof@lixom.net> wrote:
>
> How do you expect to have it in full production if you don't have all
> resources available for it? It's not until the dump has finished that you
> can return all memory to the production environment and use it.
With the PHYP dump, each chunk of RAM is returned for
general use immediately after being dumped; so its not
an all-or-nothing proposition. Production systems don't
often hit 100% RAM use right out of the gate, they often
take hours or days to get there, so again, there should
be time to dump.
> This can very easily be argued in both direction, with no clear winner:
> If the crash is stress-induced (say a slashdotted website), for those
> cases it seems more rational to take the time, collect _good data_ even
> if it takes a little longer, and then go back into production. Especially
> if the alternative is to go back into production immediately, collect
> about half of the data, and then crash again. Rinse and repeat.
Again, the mode of operation for the phyp dump is that you'll
always have all of the data from the *first* crash, even if there
are multiple crashes. That's because the the as-yet undumped
RAM is not put back into production.
> really surprises me that there's no way to reset a device through PHYP
> though. Seems like such a fundamental feature.
I don't know who said that; that's not right. The EEH function
certainly does allow you to halt/restart PCI traffic to a particular
device and also to reset the device. So, yes, the pSeries
kexec code should call into the eeh subsystem to rationalize
the device state.
> I think people are overly optimistic if they think it'll be possible
> to do all of this reliably (as in with consistent performance) without
> a second reboot though.
The NUMA issues do concern me. But then, the whole virtualized,
fractional-cpu, tickless operation stuff sounds like a performance
tuning nightmare to begin with.
> At least without similar amounts of work being
> done as it would have taken to fix kdump's reliability in the first place.
:-)
> Speaking of reboots. PHYP isn't known for being quick at rebooting a
> partition, it used to take in the order of minutes even on a small
> machine. Has that been fixed?
Dunno. Probably not.
> If not, the avoiding an extra reboot
> argument hardly seems like a benefit versus kdump+kexec, which reboots
> nearly instantly and without involvement from PHYP.
OK, let me tell you what I'm up against right now.
I'm dealing with sporadic corruption on my home box.
About a month ago, I bought a whizzy ASUS M2NE
motherboard & an AMD64 2-core cpu, and two sticks
of RAM, 1GB per stick. I have one new hard drive,
SATA, and one old hard drive, from my old machine,
the PATA. The two disks are mirrored in a RAID-1
config. Running Ubuntu.
During install/upgrade a month ago, I noticed some of
the install files seemed to have gotten corrupted, but
that downloading them again got me a working version.
This put a serious frown on my face: maybe a bad ethernet
card or connection !?
Two weeks ago, gcc stopped working one morning, although
it worked fine the night before. I'd done nothing in the interim
but sleep. Reinstalling it made it work again. Yesterday,
something else stopped working. I found the offending
library, I compared file checksums against a known-good
version, and they were off. (!!!) Disk corruption?
Then apt-get stopped working. The /var/lib/dpkg/status file
had randomly corrupted single bytes. Its ascii, I hand
repaired it; it had maybe 10 bad bytes out of 2MB total size.
I installed tripwire. Between the first run of tripwire, and the
second, less than an hour later, it reported several dozen
files have changed checksums. Manual inspection of some
of these files against known-good versions show that, at least
this morning, that's no longer the case.
System hasn't crashed in a month, since first boot. So
what's going on? Is it possible that one of the two disks
is serving up bad data, which explains the funny checksum
behaviour? Or maybe its bad RAM, so that a fresh disk
read shows good data? If its bad ram, why doesn't the
system crash? I forced fsck last night, fsck came back
spotless.
So ... moral of the story: If phyp is doing some sort of
hardware checks and validation, that's great. I wish I could
afford a pSeries system for my home computer, because
my impression is that they are very stable, and don't do
things like data corruption. I'm such a friggin cheapskate
that I can't bear to spend many thousands instead of many
hundreds of dollars. However, I will trade a longer boot
for the dream of higher reliability.
--linas
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 3/8] pseries: phyp dump: reserve-release proof-of-concept
2008-01-08 0:25 ` [PATCH 3/8] pseries: phyp dump: reserve-release proof-of-concept Manish Ahuja
2008-01-08 3:16 ` Stephen Rothwell
@ 2008-01-16 4:21 ` Paul Mackerras
1 sibling, 0 replies; 71+ messages in thread
From: Paul Mackerras @ 2008-01-16 4:21 UTC (permalink / raw)
To: Manish Ahuja; +Cc: mahuja, linuxppc-dev, linasvepstas, lkessler, strosake
Manish Ahuja writes:
> Initial patch for reserving memory in early boot, and freeing it later.
> If the previous boot had ended with a crash, the reserved memory would contain
> a copy of the crashed kernel data.
The main problem I see here is that if this option is turned on, the
kernel now has only 256MB of memory from early boot until
subsys_initcalls are done -- on any machine, and whether or not there
is actually a dump. That means, for instance, that a machine running
bare-metal (such as a G5) might not be able to allocate the hash table
for the MMU. Also, any allocations made during that time won't be
able to be node-local.
So it will be necessary to read the flattened device tree early on to
see whether or not there is a dump, so that we don't reserve most of
memory in the cases where there isn't a dump.
Paul.
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem
2008-01-22 19:12 [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
@ 2008-01-22 19:33 ` Manish Ahuja
0 siblings, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-01-22 19:33 UTC (permalink / raw)
To: ppc-dev, paulus, linasvepstas; +Cc: lkessler, strosake
Check to see if there actually is data from a previously
crashed kernel waiting. If so, Allow user-sapce tools to
grab the data (by reading /proc/kcore). When user-space
finishes dumping a section, it must release that memory
by writing to sysfs. For example,
echo "0x40000000 0x10000000" > /sys/kernel/release_region
will release 256MB starting at the 1GB. The released memory
becomes free for general use.
Signed-off-by: Linas Vepstas <linasvepstas@gmail.com>
------
arch/powerpc/platforms/pseries/phyp_dump.c | 102 +++++++++++++++++++++++++++--
1 file changed, 96 insertions(+), 6 deletions(-)
Index: 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/platforms/pseries/phyp_dump.c 2008-01-18 07:37:33.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c 2008-01-18 22:43:00.000000000 -0600
@@ -12,17 +12,24 @@
*/
#include <linux/init.h>
+#include <linux/kobject.h>
#include <linux/mm.h>
+#include <linux/of.h>
#include <linux/pfn.h>
#include <linux/swap.h>
+#include <linux/sysfs.h>
#include <asm/page.h>
#include <asm/phyp_dump.h>
+#include <asm/rtas.h>
/* Global, used to communicate data between early boot and late boot */
static struct phyp_dump phyp_dump_global;
struct phyp_dump *phyp_dump_info = &phyp_dump_global;
+static int ibm_configure_kernel_dump;
+
+/* ------------------------------------------------- */
/**
* release_memory_range -- release memory previously lmb_reserved
* @start_pfn: starting physical frame number
@@ -52,20 +59,103 @@ release_memory_range(unsigned long start
}
}
-static int __init phyp_dump_setup(void)
+/* ------------------------------------------------- */
+/**
+ * sysfs_release_region -- sysfs interface to release memory range.
+ *
+ * Usage:
+ * "echo <start addr> <length> > /sys/kernel/release_region"
+ *
+ * Example:
+ * "echo 0x40000000 0x10000000 > /sys/kernel/release_region"
+ *
+ * will release 256MB starting at 1GB.
+ */
+static ssize_t
+store_release_region(struct kset *kset, const char *buf, size_t count)
{
+ unsigned long start_addr, length, end_addr;
unsigned long start_pfn, nr_pages;
+ ssize_t ret;
- /* If no memory was reserved in early boot, there is nothing to do */
- if (phyp_dump_info->init_reserve_size == 0)
- return 0;
+ ret = sscanf(buf, "%lx %lx", &start_addr, &length);
+ if (ret != 2)
+ return -EINVAL;
+
+ /* Range-check - don't free any reserved memory that
+ * wasn't reserved for phyp-dump */
+ if (start_addr < phyp_dump_info->init_reserve_start)
+ start_addr = phyp_dump_info->init_reserve_start;
+
+ end_addr = phyp_dump_info->init_reserve_start +
+ phyp_dump_info->init_reserve_size;
+ if (start_addr+length > end_addr)
+ length = end_addr - start_addr;
+
+ /* Release the region of memory assed in by user */
+ start_pfn = PFN_DOWN(start_addr);
+ nr_pages = PFN_DOWN(length);
+ release_memory_range (start_pfn, nr_pages);
- /* Release memory that was reserved in early boot */
+ return count;
+}
+
+static ssize_t
+show_release_region(struct kset * kset, char *buf)
+{
+ return sprintf(buf, "ola\n");
+}
+
+static struct subsys_attribute rr = __ATTR(release_region, 0600,
+ show_release_region,
+ store_release_region);
+
+/* ------------------------------------------------- */
+
+static void release_all (void)
+{
+ unsigned long start_pfn, nr_pages;
+
+ /* Release all memory that was reserved in early boot */
start_pfn = PFN_DOWN(phyp_dump_info->init_reserve_start);
nr_pages = PFN_DOWN(phyp_dump_info->init_reserve_size);
release_memory_range(start_pfn, nr_pages);
+}
+
+static int __init phyp_dump_setup(void)
+{
+ struct device_node *rtas;
+ const int *dump_header;
+ int header_len = 0;
+ int rc;
+
+ /* If no memory was reserved in early boot, there is nothing to do */
+ if (phyp_dump_info->init_reserve_size == 0)
+ return 0;
+
+ /* Return if phyp dump not supported */
+ ibm_configure_kernel_dump = rtas_token("ibm,configure-kernel-dump");
+ if (ibm_configure_kernel_dump == RTAS_UNKNOWN_SERVICE) {
+ release_all();
+ return -ENOSYS;
+ }
+
+ /* Is there dump data waiting for us? */
+ rtas = of_find_node_by_path("/rtas");
+ dump_header = of_get_property(rtas, "ibm,kernel-dump", &header_len);
+ if (dump_header == NULL) {
+ release_all();
+ return 0;
+ }
+
+ /* Should we create a dump_subsys, analogous to s390/ipl.c ? */
+ rc = subsys_create_file(&kernel_subsys, &rr);
+ if (rc) {
+ printk (KERN_ERR "phyp-dump: unable to create sysfs file (%d)\n", rc);
+ release_all();
+ return 0;
+ }
return 0;
}
-
subsys_initcall(phyp_dump_setup);
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump
2008-01-07 23:45 [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
` (7 preceding siblings ...)
2008-01-08 0:39 ` [PATCH 8/8] pseries: phyp dump: Tracking memory range freed Manish Ahuja
@ 2008-02-12 6:31 ` Manish Ahuja
2008-02-12 6:53 ` [PATCH 1/8] pseries: phyp dump: Docmentation Manish Ahuja
` (7 more replies)
8 siblings, 8 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-02-12 6:31 UTC (permalink / raw)
To: linuxppc-dev; +Cc: mahuja, linasvepstas, lkessler, strosake
The following series of patches implement a basic framework
for hypervisor-assisted dump. The very first patch provides
documentation explaining what this is :-). Yes, its supposed
to be an improvement over kdump.
A list of open issues / todo list is included in the documentation.
It also appears that the not-yet-released firmware versions this was tested
on are still, ahem, incomplete; this work is also pending.
I have included most of the changes requested. Although, I did find
one or two, fixed in a later patch file rather than the first location
they appeared at.
Also it now does not block any memory on machines other than power6 boxes
which have the requisite firmware. This is from a power5 box.
from jal-lp6 a power5 machine.
.........
Phyp-dump not supported on this hardware
Using pSeries machine description
console [udbg-1] enabled
.......
Since I changed a few more things, I am reposting all the patches.
-- Manish & Linas.
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 1/8] pseries: phyp dump: Docmentation
2008-02-12 6:31 ` [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
@ 2008-02-12 6:53 ` Manish Ahuja
2008-02-12 7:08 ` [PATCH 2/8] pseries: phyp dump: reserve-release proof-of-concept Manish Ahuja
` (6 subsequent siblings)
7 siblings, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-02-12 6:53 UTC (permalink / raw)
To: linuxppc-dev, paulus; +Cc: Manish Ahuja
Basic documentation for hypervisor-assisted dump.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
----
Documentation/powerpc/phyp-assisted-dump.txt | 127 +++++++++++++++++++++++++++
1 file changed, 127 insertions(+)
Index: 2.6.24-rc5/Documentation/powerpc/phyp-assisted-dump.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ 2.6.24-rc5/Documentation/powerpc/phyp-assisted-dump.txt 2008-02-12 06:38:25.000000000 -0600
@@ -0,0 +1,127 @@
+
+ Hypervisor-Assisted Dump
+ ------------------------
+ November 2007
+
+The goal of hypervisor-assisted dump is to enable the dump of
+a crashed system, and to do so from a fully-reset system, and
+to minimize the total elapsed time until the system is back
+in production use.
+
+As compared to kdump or other strategies, hypervisor-assisted
+dump offers several strong, practical advantages:
+
+-- Unlike kdump, the system has been reset, and loaded
+ with a fresh copy of the kernel. In particular,
+ PCI and I/O devices have been reinitialized and are
+ in a clean, consistent state.
+-- As the dump is performed, the dumped memory becomes
+ immediately available to the system for normal use.
+-- After the dump is completed, no further reboots are
+ required; the system will be fully usable, and running
+ in it's normal, production mode on it normal kernel.
+
+The above can only be accomplished by coordination with,
+and assistance from the hypervisor. The procedure is
+as follows:
+
+-- When a system crashes, the hypervisor will save
+ the low 256MB of RAM to a previously registered
+ save region. It will also save system state, system
+ registers, and hardware PTE's.
+
+-- After the low 256MB area has been saved, the
+ hypervisor will reset PCI and other hardware state.
+ It will *not* clear RAM. It will then launch the
+ bootloader, as normal.
+
+-- The freshly booted kernel will notice that there
+ is a new node (ibm,dump-kernel) in the device tree,
+ indicating that there is crash data available from
+ a previous boot. It will boot into only 256MB of RAM,
+ reserving the rest of system memory.
+
+-- Userspace tools will parse /sys/kernel/release_region
+ and read /proc/vmcore to obtain the contents of memory,
+ which holds the previous crashed kernel. The userspace
+ tools may copy this info to disk, or network, nas, san,
+ iscsi, etc. as desired.
+
+ For Example: the values in /sys/kernel/release-region
+ would look something like this (address-range pairs).
+ CPU:0x177fee000-0x10000: HPTE:0x177ffe020-0x1000: /
+ DUMP:0x177fff020-0x10000000, 0x10000000-0x16F1D370A
+
+-- As the userspace tools complete saving a portion of
+ dump, they echo an offset and size to
+ /sys/kernel/release_region to release the reserved
+ memory back to general use.
+
+ An example of this is:
+ "echo 0x40000000 0x10000000 > /sys/kernel/release_region"
+ which will release 256MB at the 1GB boundary.
+
+Please note that the hypervisor-assisted dump feature
+is only available on Power6-based systems with recent
+firmware versions.
+
+Implementation details:
+----------------------
+
+During boot, a check is made to see if firmware supports
+this feature on this particular machine. If it does, then
+we check to see if a active dump is waiting for us. If yes
+then everything but 256 MB of RAM is reserved during early
+boot. This area is released once we collect a dump from user
+land scripts that are run. If there is dump data, then
+the /sys/kernel/release_region file is created, and
+the reserved memory is held.
+
+If there is no waiting dump data, then only the highest
+256MB of the ram is reserved as a scratch area. This area
+is *not* be released: this region will be kept permanently
+reserved, so that it can act as a receptacle for a copy
+of the low 256MB in the case a crash does occur. See,
+however, "open issues" below, as to whether
+such a reserved region is really needed.
+
+Currently the dump will be copied from /proc/vmcore to a
+a new file upon user intervention. The starting address
+to be read and the range for each data point in provided
+in /sys/kernel/release_region.
+
+The tools to examine the dump will be same as the ones
+used for kdump.
+
+General notes:
+--------------
+Security: please note that there are potential security issues
+with any sort of dump mechanism. In particular, plaintext
+(unencrypted) data, and possibly passwords, may be present in
+the dump data. Userspace tools must take adequate precautions to
+preserve security.
+
+Open issues/ToDo:
+------------
+ o The various code paths that tell the hypervisor that a crash
+ occurred, vs. it simply being a normal reboot, should be
+ reviewed, and possibly clarified/fixed.
+
+ o Instead of using /sys/kernel, should there be a /sys/dump
+ instead? There is a dump_subsys being created by the s390 code,
+ perhaps the pseries code should use a similar layout as well.
+
+ o Is reserving a 256MB region really required? The goal of
+ reserving a 256MB scratch area is to make sure that no
+ important crash data is clobbered when the hypervisor
+ save low mem to the scratch area. But, if one could assure
+ that nothing important is located in some 256MB area, then
+ it would not need to be reserved. Something that can be
+ improved in subsequent versions.
+
+ o Still working the kdump team to integrate this with kdump,
+ some work remains but this would not affect the current
+ patches.
+
+ o Still need to write a shell script, to copy the dump away.
+ Currently I am parsing it manually.
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 2/8] pseries: phyp dump: reserve-release proof-of-concept
2008-02-12 6:31 ` [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
2008-02-12 6:53 ` [PATCH 1/8] pseries: phyp dump: Docmentation Manish Ahuja
@ 2008-02-12 7:08 ` Manish Ahuja
2008-02-12 8:48 ` Michael Ellerman
2008-02-14 3:46 ` Tony Breeds
2008-02-12 7:11 ` [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem Manish Ahuja
` (5 subsequent siblings)
7 siblings, 2 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-02-12 7:08 UTC (permalink / raw)
To: linuxppc-dev, paulus; +Cc: mahuja, linasvepstas
Initial patch for reserving memory in early boot, and freeing it later.
If the previous boot had ended with a crash, the reserved memory would contain
a copy of the crashed kernel data.
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
----
arch/powerpc/kernel/prom.c | 50 ++++++++++++++++++++
arch/powerpc/kernel/rtas.c | 32 +++++++++++++
arch/powerpc/platforms/pseries/Makefile | 1
arch/powerpc/platforms/pseries/phyp_dump.c | 71 +++++++++++++++++++++++++++++
include/asm-powerpc/phyp_dump.h | 38 +++++++++++++++
include/asm/rtas.h | 3 +
6 files changed, 195 insertions(+)
Index: 2.6.24-rc5/include/asm-powerpc/phyp_dump.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ 2.6.24-rc5/include/asm-powerpc/phyp_dump.h 2008-02-12 06:12:37.000000000 -0600
@@ -0,0 +1,38 @@
+/*
+ * Hypervisor-assisted dump
+ *
+ * Linas Vepstas, Manish Ahuja 2007
+ * Copyright (c) 2007 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _PPC64_PHYP_DUMP_H
+#define _PPC64_PHYP_DUMP_H
+
+#ifdef CONFIG_PHYP_DUMP
+
+/* The RMR region will be saved for later dumping
+ * whenever the kernel crashes. Set this to 256MB. */
+#define PHYP_DUMP_RMR_START 0x0
+#define PHYP_DUMP_RMR_END (1UL<<28)
+
+struct phyp_dump {
+ /* Memory that is reserved during very early boot. */
+ unsigned long init_reserve_start;
+ unsigned long init_reserve_size;
+ /* Check status during boot if dump supported, active & present*/
+ unsigned long phyp_dump_configured;
+ unsigned long phyp_dump_is_active;
+ /* store cpu & hpte size */
+ unsigned long cpu_state_size;
+ unsigned long hpte_region_size;
+};
+
+extern struct phyp_dump *phyp_dump_info;
+
+#endif /* CONFIG_PHYP_DUMP */
+#endif /* _PPC64_PHYP_DUMP_H */
Index: 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-12 06:12:37.000000000 -0600
@@ -0,0 +1,71 @@
+/*
+ * Hypervisor-assisted dump
+ *
+ * Linas Vepstas, Manish Ahuja 2007
+ * Copyrhgit (c) 2007 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/pfn.h>
+#include <linux/swap.h>
+
+#include <asm/page.h>
+#include <asm/phyp_dump.h>
+
+/* Global, used to communicate data between early boot and late boot */
+static struct phyp_dump phyp_dump_global;
+struct phyp_dump *phyp_dump_info = &phyp_dump_global;
+
+/**
+ * release_memory_range -- release memory previously lmb_reserved
+ * @start_pfn: starting physical frame number
+ * @nr_pages: number of pages to free.
+ *
+ * This routine will release memory that had been previously
+ * lmb_reserved in early boot. The released memory becomes
+ * available for genreal use.
+ */
+static void
+release_memory_range(unsigned long start_pfn, unsigned long nr_pages)
+{
+ struct page *rpage;
+ unsigned long end_pfn;
+ long i;
+
+ end_pfn = start_pfn + nr_pages;
+
+ for (i=start_pfn; i <= end_pfn; i++) {
+ rpage = pfn_to_page(i);
+ if (PageReserved(rpage)) {
+ ClearPageReserved(rpage);
+ init_page_count(rpage);
+ __free_page(rpage);
+ totalram_pages++;
+ }
+ }
+}
+
+static int __init phyp_dump_setup(void)
+{
+ unsigned long start_pfn, nr_pages;
+
+ /* If no memory was reserved in early boot, there is nothing to do */
+ if (phyp_dump_info->init_reserve_size == 0)
+ return 0;
+
+ /* Release memory that was reserved in early boot */
+ start_pfn = PFN_DOWN(phyp_dump_info->init_reserve_start);
+ nr_pages = PFN_DOWN(phyp_dump_info->init_reserve_size);
+ release_memory_range(start_pfn, nr_pages);
+
+ return 0;
+}
+
+subsys_initcall(phyp_dump_setup);
Index: 2.6.24-rc5/arch/powerpc/platforms/pseries/Makefile
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/platforms/pseries/Makefile 2008-02-12 06:12:10.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/platforms/pseries/Makefile 2008-02-12 06:12:37.000000000 -0600
@@ -18,3 +18,4 @@ obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu
obj-$(CONFIG_HVC_CONSOLE) += hvconsole.o
obj-$(CONFIG_HVCS) += hvcserver.o
obj-$(CONFIG_HCALL_STATS) += hvCall_inst.o
+obj-$(CONFIG_PHYP_DUMP) += phyp_dump.o
Index: 2.6.24-rc5/arch/powerpc/kernel/prom.c
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/kernel/prom.c 2008-02-12 06:12:10.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/kernel/prom.c 2008-02-12 06:12:37.000000000 -0600
@@ -51,6 +51,7 @@
#include <asm/machdep.h>
#include <asm/pSeries_reconfig.h>
#include <asm/pci-bridge.h>
+#include <asm/phyp_dump.h>
#include <asm/kexec.h>
#ifdef DEBUG
@@ -1011,6 +1012,52 @@ static void __init early_reserve_mem(voi
#endif
}
+#ifdef CONFIG_PHYP_DUMP
+/**
+ * reserve_crashed_mem() - reserve all not-yet-dumped mmemory
+ *
+ * This routine may reserve memory regions in the kernel only
+ * if the system is supported and a dump was taken in last
+ * boot instance or if the hardware is supported and the
+ * scratch area needs to be setup. In other instances it returns
+ * without reserving anything. The memory in case of dump being
+ * active is freed when the dump is collected (by userland tools).
+ */
+static void __init reserve_crashed_mem(void)
+{
+ unsigned long base, size;
+ if (!phyp_dump_info->phyp_dump_configured) {
+ printk(KERN_ERR "Phyp-dump not supported on this hardware\n");
+ return;
+ }
+
+ if (phyp_dump_info->phyp_dump_is_active) {
+ /* Reserve *everything* above RMR.Area freed by userland tools*/
+ base = PHYP_DUMP_RMR_END;
+ size = lmb_end_of_DRAM() - base;
+
+ /* XXX crashed_ram_end is wrong, since it may be beyond
+ * the memory_limit, it will need to be adjusted. */
+ lmb_reserve(base, size);
+
+ phyp_dump_info->init_reserve_start = base;
+ phyp_dump_info->init_reserve_size = size;
+ }
+ else {
+ size = phyp_dump_info->cpu_state_size +
+ phyp_dump_info->hpte_region_size +
+ PHYP_DUMP_RMR_END;
+ base = lmb_end_of_DRAM() - size;
+ lmb_reserve(base, size);
+ phyp_dump_info->init_reserve_start = base;
+ phyp_dump_info->init_reserve_size = size;
+ }
+}
+#else
+static inline void __init reserve_crashed_mem(void) {}
+#endif /* CONFIG_PHYP_DUMP */
+
+
void __init early_init_devtree(void *params)
{
DBG(" -> early_init_devtree(%p)\n", params);
@@ -1022,6 +1069,8 @@ void __init early_init_devtree(void *par
/* Some machines might need RTAS info for debugging, grab it now. */
of_scan_flat_dt(early_init_dt_scan_rtas, NULL);
#endif
+ /* scan tree to see if dump occured during last boot */
+ of_scan_flat_dt(early_init_dt_scan_phyp_dump, NULL);
/* Retrieve various informations from the /chosen node of the
* device-tree, including the platform type, initrd location and
@@ -1043,6 +1092,7 @@ void __init early_init_devtree(void *par
reserve_kdump_trampoline();
reserve_crashkernel();
early_reserve_mem();
+ reserve_crashed_mem();
lmb_enforce_memory_limit(memory_limit);
lmb_analyze();
Index: 2.6.24-rc5/arch/powerpc/kernel/rtas.c
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/kernel/rtas.c 2008-02-12 06:12:10.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/kernel/rtas.c 2008-02-12 06:12:37.000000000 -0600
@@ -39,6 +39,7 @@
#include <asm/syscalls.h>
#include <asm/smp.h>
#include <asm/atomic.h>
+#include <asm/phyp_dump.h>
struct rtas_t rtas = {
.lock = SPIN_LOCK_UNLOCKED
@@ -883,6 +884,37 @@ void __init rtas_initialize(void)
#endif
}
+int __init early_init_dt_scan_phyp_dump(unsigned long node,
+ const char *uname, int depth, void *data)
+{
+#ifdef CONFIG_PHYP_DUMP
+ const unsigned int *sizes;
+
+ phyp_dump_info->phyp_dump_configured = 0;
+ phyp_dump_info->phyp_dump_is_active = 0;
+
+ if (depth != 1 || strcmp(uname, "rtas") != 0)
+ return 0;
+
+ if (of_get_flat_dt_prop(node, "ibm,configure-kernel-dump", NULL))
+ phyp_dump_info->phyp_dump_configured++;
+
+ if (of_get_flat_dt_prop(node, "ibm,dump-kernel", NULL))
+ phyp_dump_info->phyp_dump_is_active++;
+
+ sizes = of_get_flat_dt_prop(node, "ibm,configure-kernel-dump-sizes", NULL);
+ if (!sizes)
+ return 0;
+
+ if (sizes[0] == 1)
+ phyp_dump_info->cpu_state_size = *((unsigned long *)&sizes[1]);
+
+ if (sizes[3] == 2)
+ phyp_dump_info->hpte_region_size = *((unsigned long *)&sizes[4]);
+#endif
+ return 1;
+}
+
int __init early_init_dt_scan_rtas(unsigned long node,
const char *uname, int depth, void *data)
{
Index: 2.6.24-rc5/include/asm/rtas.h
===================================================================
--- 2.6.24-rc5.orig/include/asm/rtas.h 2008-02-12 06:12:10.000000000 -0600
+++ 2.6.24-rc5/include/asm/rtas.h 2008-02-12 06:12:37.000000000 -0600
@@ -183,6 +183,9 @@ extern unsigned int rtas_busy_delay(int
extern int early_init_dt_scan_rtas(unsigned long node,
const char *uname, int depth, void *data);
+int early_init_dt_scan_phyp_dump(unsigned long node,
+ const char *uname, int depth, void *data);
+
extern void pSeries_log_error(char *buf, unsigned int err_type, int fatal);
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem
2008-02-12 6:31 ` [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
2008-02-12 6:53 ` [PATCH 1/8] pseries: phyp dump: Docmentation Manish Ahuja
2008-02-12 7:08 ` [PATCH 2/8] pseries: phyp dump: reserve-release proof-of-concept Manish Ahuja
@ 2008-02-12 7:11 ` Manish Ahuja
2008-02-12 10:08 ` Stephen Rothwell
2008-02-15 1:05 ` Tony Breeds
2008-02-12 7:14 ` [PATCH 4/8] pseries: phyp dump: register dump area Manish Ahuja
` (4 subsequent siblings)
7 siblings, 2 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-02-12 7:11 UTC (permalink / raw)
To: linuxppc-dev, paulus; +Cc: mahuja, linasvepstas
Check to see if there actually is data from a previously
crashed kernel waiting. If so, Allow user-space tools to
grab the data (by reading /proc/kcore). When user-space
finishes dumping a section, it must release that memory
by writing to sysfs. For example,
echo "0x40000000 0x10000000" > /sys/kernel/release_region
will release 256MB starting at the 1GB. The released memory
becomes free for general use.
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
------
arch/powerpc/platforms/pseries/phyp_dump.c | 88 +++++++++++++++++++++++++++--
1 file changed, 82 insertions(+), 6 deletions(-)
Index: 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-12 06:12:37.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-12 06:12:55.000000000 -0600
@@ -12,17 +12,24 @@
*/
#include <linux/init.h>
+#include <linux/kobject.h>
#include <linux/mm.h>
+#include <linux/of.h>
#include <linux/pfn.h>
#include <linux/swap.h>
+#include <linux/sysfs.h>
#include <asm/page.h>
#include <asm/phyp_dump.h>
+#include <asm/rtas.h>
/* Global, used to communicate data between early boot and late boot */
static struct phyp_dump phyp_dump_global;
struct phyp_dump *phyp_dump_info = &phyp_dump_global;
+static int ibm_configure_kernel_dump;
+
+/* ------------------------------------------------- */
/**
* release_memory_range -- release memory previously lmb_reserved
* @start_pfn: starting physical frame number
@@ -52,20 +59,89 @@ release_memory_range(unsigned long start
}
}
-static int __init phyp_dump_setup(void)
+/* ------------------------------------------------- */
+/**
+ * sysfs_release_region -- sysfs interface to release memory range.
+ *
+ * Usage:
+ * "echo <start addr> <length> > /sys/kernel/release_region"
+ *
+ * Example:
+ * "echo 0x40000000 0x10000000 > /sys/kernel/release_region"
+ *
+ * will release 256MB starting at 1GB.
+ */
+static ssize_t
+store_release_region(struct kset *kset, const char *buf, size_t count)
{
+ unsigned long start_addr, length, end_addr;
unsigned long start_pfn, nr_pages;
+ ssize_t ret;
+
+ ret = sscanf(buf, "%lx %lx", &start_addr, &length);
+ if (ret != 2)
+ return -EINVAL;
+
+ /* Range-check - don't free any reserved memory that
+ * wasn't reserved for phyp-dump */
+ if (start_addr < phyp_dump_info->init_reserve_start)
+ start_addr = phyp_dump_info->init_reserve_start;
+
+ end_addr = phyp_dump_info->init_reserve_start +
+ phyp_dump_info->init_reserve_size;
+ if (start_addr+length > end_addr)
+ length = end_addr - start_addr;
+
+ /* Release the region of memory assed in by user */
+ start_pfn = PFN_DOWN(start_addr);
+ nr_pages = PFN_DOWN(length);
+ release_memory_range (start_pfn, nr_pages);
+
+ return count;
+}
+
+static ssize_t
+show_release_region(struct kset * kset, char *buf)
+{
+ return sprintf(buf, "ola\n");
+}
+
+static struct subsys_attribute rr = __ATTR(release_region, 0600,
+ show_release_region,
+ store_release_region);
+
+static int __init phyp_dump_setup(void)
+{
+ struct device_node *rtas;
+ const int *dump_header;
+ int header_len = 0;
+ int rc;
/* If no memory was reserved in early boot, there is nothing to do */
if (phyp_dump_info->init_reserve_size == 0)
return 0;
- /* Release memory that was reserved in early boot */
- start_pfn = PFN_DOWN(phyp_dump_info->init_reserve_start);
- nr_pages = PFN_DOWN(phyp_dump_info->init_reserve_size);
- release_memory_range(start_pfn, nr_pages);
+ /* Return if phyp dump not supported */
+ if (!phyp_dump_info->phyp_dump_configured) {
+ return -ENOSYS;
+ }
+
+ /* Is there dump data waiting for us? */
+ rtas = of_find_node_by_path("/rtas");
+ dump_header = of_get_property(rtas, "ibm,kernel-dump", &header_len);
+ if (dump_header == NULL) {
+ release_all();
+ return 0;
+ }
+
+ /* Should we create a dump_subsys, analogous to s390/ipl.c ? */
+ rc = subsys_create_file(&kernel_subsys, &rr);
+ if (rc) {
+ printk (KERN_ERR "phyp-dump: unable to create sysfs file (%d)\n", rc);
+ release_all();
+ return 0;
+ }
return 0;
}
-
subsys_initcall(phyp_dump_setup);
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 4/8] pseries: phyp dump: register dump area.
2008-02-12 6:31 ` [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
` (2 preceding siblings ...)
2008-02-12 7:11 ` [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem Manish Ahuja
@ 2008-02-12 7:14 ` Manish Ahuja
2008-02-12 10:11 ` Stephen Rothwell
2008-02-12 7:16 ` [PATCH 5/8] pseries: phyp dump: debugging print routines Manish Ahuja
` (3 subsequent siblings)
7 siblings, 1 reply; 71+ messages in thread
From: Manish Ahuja @ 2008-02-12 7:14 UTC (permalink / raw)
To: linuxppc-dev, paulus; +Cc: mahuja, linasvepstas
Set up the actual dump header, register it with the hypervisor.
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
------
arch/powerpc/platforms/pseries/phyp_dump.c | 136 +++++++++++++++++++++++++++--
1 file changed, 129 insertions(+), 7 deletions(-)
Index: 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-12 06:12:55.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-12 06:13:01.000000000 -0600
@@ -30,6 +30,117 @@ struct phyp_dump *phyp_dump_info = &phyp
static int ibm_configure_kernel_dump;
/* ------------------------------------------------- */
+/* RTAS interfaces to declare the dump regions */
+
+struct dump_section {
+ u32 dump_flags;
+ u16 source_type;
+ u16 error_flags;
+ u64 source_address;
+ u64 source_length;
+ u64 length_copied;
+ u64 destination_address;
+};
+
+struct phyp_dump_header {
+ u32 version;
+ u16 num_of_sections;
+ u16 status;
+
+ u32 first_offset_section;
+ u32 dump_disk_section;
+ u64 block_num_dd;
+ u64 num_of_blocks_dd;
+ u32 offset_dd;
+ u32 maxtime_to_auto;
+ /* No dump disk path string used */
+
+ struct dump_section cpu_data;
+ struct dump_section hpte_data;
+ struct dump_section kernel_data;
+};
+
+/* The dump header *must be* in low memory, so .bss it */
+static struct phyp_dump_header phdr;
+
+#define NUM_DUMP_SECTIONS 3
+#define DUMP_HEADER_VERSION 0x1
+#define DUMP_REQUEST_FLAG 0x1
+#define DUMP_SOURCE_CPU 0x0001
+#define DUMP_SOURCE_HPTE 0x0002
+#define DUMP_SOURCE_RMO 0x0011
+
+/**
+ * init_dump_header() - initialize the header declaring a dump
+ * Returns: length of dump save area.
+ *
+ * When the hypervisor saves crashed state, it needs to put
+ * it somewhere. The dump header tells the hypervisor where
+ * the data can be saved.
+ */
+static unsigned long init_dump_header(struct phyp_dump_header *ph)
+{
+ unsigned long addr_offset = 0;
+
+ /* Set up the dump header */
+ ph->version = DUMP_HEADER_VERSION;
+ ph->num_of_sections = NUM_DUMP_SECTIONS;
+ ph->status = 0;
+
+ ph->first_offset_section =
+ (u32)offsetof(struct phyp_dump_header, cpu_data);
+ ph->dump_disk_section = 0;
+ ph->block_num_dd = 0;
+ ph->num_of_blocks_dd = 0;
+ ph->offset_dd = 0;
+
+ ph->maxtime_to_auto = 0; /* disabled */
+
+ /* The first two sections are mandatory */
+ ph->cpu_data.dump_flags = DUMP_REQUEST_FLAG;
+ ph->cpu_data.source_type = DUMP_SOURCE_CPU;
+ ph->cpu_data.source_address = 0;
+ ph->cpu_data.source_length = phyp_dump_info->cpu_state_size;
+ ph->cpu_data.destination_address = addr_offset;
+ addr_offset += phyp_dump_info->cpu_state_size;
+
+ ph->hpte_data.dump_flags = DUMP_REQUEST_FLAG;
+ ph->hpte_data.source_type = DUMP_SOURCE_HPTE;
+ ph->hpte_data.source_address = 0;
+ ph->hpte_data.source_length = phyp_dump_info->hpte_region_size;
+ ph->hpte_data.destination_address = addr_offset;
+ addr_offset += phyp_dump_info->hpte_region_size;
+
+ /* This section describes the low kernel region */
+ ph->kernel_data.dump_flags = DUMP_REQUEST_FLAG;
+ ph->kernel_data.source_type = DUMP_SOURCE_RMO;
+ ph->kernel_data.source_address = PHYP_DUMP_RMR_START;
+ ph->kernel_data.source_length = PHYP_DUMP_RMR_END;
+ ph->kernel_data.destination_address = addr_offset;
+ addr_offset += ph->kernel_data.source_length;
+
+ return addr_offset;
+}
+
+static void register_dump_area(struct phyp_dump_header *ph, unsigned long addr)
+{
+ int rc;
+ ph->cpu_data.destination_address += addr;
+ ph->hpte_data.destination_address += addr;
+ ph->kernel_data.destination_address += addr;
+
+ do {
+ rc = rtas_call(ibm_configure_kernel_dump, 3, 1, NULL,
+ 1, ph, sizeof(struct phyp_dump_header));
+ } while (rtas_busy_delay(rc));
+
+ if (rc)
+ {
+ printk (KERN_ERR "phyp-dump: unexpected error (%d) on register\n", rc);
+ }
+}
+
+/* ------------------------------------------------- */
/**
* release_memory_range -- release memory previously lmb_reserved
* @start_pfn: starting physical frame number
@@ -113,7 +224,9 @@ static struct subsys_attribute rr = __AT
static int __init phyp_dump_setup(void)
{
struct device_node *rtas;
- const int *dump_header;
+ const struct phyp_dump_header *dump_header;
+ unsigned long dump_area_start;
+ unsigned long dump_area_length;
int header_len = 0;
int rc;
@@ -126,22 +239,31 @@ static int __init phyp_dump_setup(void)
return -ENOSYS;
}
- /* Is there dump data waiting for us? */
+ /* Is there dump data waiting for us? If there isn't,
+ * then register a new dump area, and release all of
+ * the rest of the reserved ram.
+ *
+ * The /rtas/ibm,kernel-dump rtas node is present only
+ * if there is dump data waiting for us.
+ */
rtas = of_find_node_by_path("/rtas");
dump_header = of_get_property(rtas, "ibm,kernel-dump", &header_len);
+ of_node_put(rtas);
+
+ dump_area_length = init_dump_header(&phdr);
+ dump_area_start = phyp_dump_info->init_reserve_start & PAGE_MASK; /* align down */
+
if (dump_header == NULL) {
- release_all();
+ register_dump_area(&phdr, dump_area_start);
return 0;
}
/* Should we create a dump_subsys, analogous to s390/ipl.c ? */
rc = subsys_create_file(&kernel_subsys, &rr);
- if (rc) {
+ if (rc)
printk (KERN_ERR "phyp-dump: unable to create sysfs file (%d)\n", rc);
- release_all();
- return 0;
- }
+ /* ToDo: re-register the dump area, for next time. */
return 0;
}
subsys_initcall(phyp_dump_setup);
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 5/8] pseries: phyp dump: debugging print routines.
2008-02-12 6:31 ` [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
` (3 preceding siblings ...)
2008-02-12 7:14 ` [PATCH 4/8] pseries: phyp dump: register dump area Manish Ahuja
@ 2008-02-12 7:16 ` Manish Ahuja
2008-02-12 7:18 ` [PATCH 6/8] pseries: phyp dump: Invalidate and print dump areas Manish Ahuja
` (2 subsequent siblings)
7 siblings, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-02-12 7:16 UTC (permalink / raw)
To: linuxppc-dev, paulus; +Cc: mahuja, linasvepstas
Provide some basic debugging support.
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
Signed-off-by: Linas Vepsts <linas@austin.ibm.com>
-----
arch/powerpc/platforms/pseries/phyp_dump.c | 64 +++++++++++++++++++++++++++--
1 file changed, 60 insertions(+), 4 deletions(-)
Index: 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-12 06:13:01.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-12 06:13:06.000000000 -0600
@@ -2,7 +2,7 @@
* Hypervisor-assisted dump
*
* Linas Vepstas, Manish Ahuja 2007
- * Copyrhgit (c) 2007 IBM Corp.
+ * Copyright (c) 2007 IBM Corp.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
@@ -122,6 +122,61 @@ static unsigned long init_dump_header(st
return addr_offset;
}
+static void print_dump_header(const struct phyp_dump_header *ph)
+{
+#ifdef DEBUG
+ printk(KERN_INFO "dump header:\n");
+ /* setup some ph->sections required */
+ printk(KERN_INFO "version = %d\n", ph->version);
+ printk(KERN_INFO "Sections = %d\n", ph->num_of_sections);
+ printk(KERN_INFO "Status = 0x%x\n", ph->status);
+
+ /* No ph->disk, so all should be set to 0 */
+ printk(KERN_INFO "Offset to first section 0x%x\n",
+ ph->first_offset_section);
+ printk(KERN_INFO "dump disk sections should be zero\n");
+ printk(KERN_INFO "dump disk section = %d\n", ph->dump_disk_section);
+ printk(KERN_INFO "block num = %ld\n", ph->block_num_dd);
+ printk(KERN_INFO "number of blocks = %ld\n", ph->num_of_blocks_dd);
+ printk(KERN_INFO "dump disk offset = %d\n", ph->offset_dd);
+ printk(KERN_INFO "Max auto time= %d\n", ph->maxtime_to_auto);
+
+ /*set cpu state and hpte states as well scratch pad area */
+ printk(KERN_INFO " CPU AREA \n");
+ printk(KERN_INFO "cpu dump_flags =%d\n", ph->cpu_data.dump_flags);
+ printk(KERN_INFO "cpu source_type =%d\n", ph->cpu_data.source_type);
+ printk(KERN_INFO "cpu error_flags =%d\n", ph->cpu_data.error_flags);
+ printk(KERN_INFO "cpu source_address =%lx\n",
+ ph->cpu_data.source_address);
+ printk(KERN_INFO "cpu source_length =%lx\n",
+ ph->cpu_data.source_length);
+ printk(KERN_INFO "cpu length_copied =%lx\n",
+ ph->cpu_data.length_copied);
+
+ printk(KERN_INFO " HPTE AREA \n");
+ printk(KERN_INFO "HPTE dump_flags =%d\n", ph->hpte_data.dump_flags);
+ printk(KERN_INFO "HPTE source_type =%d\n", ph->hpte_data.source_type);
+ printk(KERN_INFO "HPTE error_flags =%d\n", ph->hpte_data.error_flags);
+ printk(KERN_INFO "HPTE source_address =%lx\n",
+ ph->hpte_data.source_address);
+ printk(KERN_INFO "HPTE source_length =%lx\n",
+ ph->hpte_data.source_length);
+ printk(KERN_INFO "HPTE length_copied =%lx\n",
+ ph->hpte_data.length_copied);
+
+ printk(KERN_INFO " SRSD AREA \n");
+ printk(KERN_INFO "SRSD dump_flags =%d\n", ph->kernel_data.dump_flags);
+ printk(KERN_INFO "SRSD source_type =%d\n", ph->kernel_data.source_type);
+ printk(KERN_INFO "SRSD error_flags =%d\n", ph->kernel_data.error_flags);
+ printk(KERN_INFO "SRSD source_address =%lx\n",
+ ph->kernel_data.source_address);
+ printk(KERN_INFO "SRSD source_length =%lx\n",
+ ph->kernel_data.source_length);
+ printk(KERN_INFO "SRSD length_copied =%lx\n",
+ ph->kernel_data.length_copied);
+#endif
+}
+
static void register_dump_area(struct phyp_dump_header *ph, unsigned long addr)
{
int rc;
@@ -134,9 +189,9 @@ static void register_dump_area(struct ph
1, ph, sizeof(struct phyp_dump_header));
} while (rtas_busy_delay(rc));
- if (rc)
- {
- printk (KERN_ERR "phyp-dump: unexpected error (%d) on register\n", rc);
+ if (rc) {
+ printk(KERN_ERR "phyp-dump: unexpected error (%d) on register\n", rc);
+ print_dump_header(ph);
}
}
@@ -238,6 +293,7 @@ static int __init phyp_dump_setup(void)
if (!phyp_dump_info->phyp_dump_configured) {
return -ENOSYS;
}
+ print_dump_header(dump_header);
/* Is there dump data waiting for us? If there isn't,
* then register a new dump area, and release all of
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 6/8] pseries: phyp dump: Invalidate and print dump areas.
2008-02-12 6:31 ` [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
` (4 preceding siblings ...)
2008-02-12 7:16 ` [PATCH 5/8] pseries: phyp dump: debugging print routines Manish Ahuja
@ 2008-02-12 7:18 ` Manish Ahuja
2008-02-12 10:18 ` Stephen Rothwell
2008-02-13 21:43 ` Manish Ahuja
2008-02-12 7:20 ` [PATCH 7/8] pseries: phyp dump: Tracking memory range freed Manish Ahuja
2008-02-12 7:21 ` [PATCH 8/8] pseries: phyp dump: config file Manish Ahuja
7 siblings, 2 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-02-12 7:18 UTC (permalink / raw)
To: linuxppc-dev, paulus; +Cc: mahuja, linasvepstas
Routines to
a. invalidate dump
b. Calculate region that is reserved and needs to be freed. This is
exported through sysfs interface.
Unregister has been removed for now as it wasn't being used.
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
-----
---
arch/powerpc/platforms/pseries/phyp_dump.c | 85 +++++++++++++++++++++++++----
include/asm/phyp_dump.h | 6 +-
2 files changed, 79 insertions(+), 12 deletions(-)
Index: 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-12 06:13:06.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-12 06:13:17.000000000 -0600
@@ -69,6 +69,10 @@ static struct phyp_dump_header phdr;
#define DUMP_SOURCE_CPU 0x0001
#define DUMP_SOURCE_HPTE 0x0002
#define DUMP_SOURCE_RMO 0x0011
+#define DUMP_ERROR_FLAG 0x2000
+#define DUMP_TRIGGERED 0x4000
+#define DUMP_PERFORMED 0x8000
+
/**
* init_dump_header() - initialize the header declaring a dump
@@ -180,9 +184,15 @@ static void print_dump_header(const stru
static void register_dump_area(struct phyp_dump_header *ph, unsigned long addr)
{
int rc;
- ph->cpu_data.destination_address += addr;
- ph->hpte_data.destination_address += addr;
- ph->kernel_data.destination_address += addr;
+
+ /* Add addr value if not initialized before */
+ if (ph->cpu_data.destination_address == 0) {
+ ph->cpu_data.destination_address += addr;
+ ph->hpte_data.destination_address += addr;
+ ph->kernel_data.destination_address += addr;
+ }
+
+ /* ToDo Invalidate kdump and free memory range. */
do {
rc = rtas_call(ibm_configure_kernel_dump, 3, 1, NULL,
@@ -195,6 +205,30 @@ static void register_dump_area(struct ph
}
}
+static
+void invalidate_last_dump(struct phyp_dump_header *ph, unsigned long addr)
+{
+ int rc;
+
+ /* Add addr value if not initialized before */
+ if (ph->cpu_data.destination_address == 0) {
+ ph->cpu_data.destination_address += addr;
+ ph->hpte_data.destination_address += addr;
+ ph->kernel_data.destination_address += addr;
+ }
+
+ do {
+ rc = rtas_call(ibm_configure_kernel_dump, 3, 1, NULL,
+ 2, ph, sizeof(struct phyp_dump_header));
+ } while (rtas_busy_delay(rc));
+
+ if (rc) {
+ printk (KERN_ERR "phyp-dump: unexpected error (%d) "
+ "on invalidate\n", rc);
+ print_dump_header(ph);
+ }
+}
+
/* ------------------------------------------------- */
/**
* release_memory_range -- release memory previously lmb_reserved
@@ -205,8 +239,8 @@ static void register_dump_area(struct ph
* lmb_reserved in early boot. The released memory becomes
* available for genreal use.
*/
-static void
-release_memory_range(unsigned long start_pfn, unsigned long nr_pages)
+static
+void release_memory_range(unsigned long start_pfn, unsigned long nr_pages)
{
struct page *rpage;
unsigned long end_pfn;
@@ -237,8 +271,8 @@ release_memory_range(unsigned long start
*
* will release 256MB starting at 1GB.
*/
-static ssize_t
-store_release_region(struct kset *kset, const char *buf, size_t count)
+static
+ssize_t store_release_region(struct kset *kset, const char *buf, size_t count)
{
unsigned long start_addr, length, end_addr;
unsigned long start_pfn, nr_pages;
@@ -266,10 +300,23 @@ store_release_region(struct kset *kset,
return count;
}
-static ssize_t
-show_release_region(struct kset * kset, char *buf)
+static ssize_t show_release_region(struct kset * kset, char *buf)
{
- return sprintf(buf, "ola\n");
+ u64 second_addr_range;
+
+ /* total reserved size - start of scratch area */
+ second_addr_range = phyp_dump_info->init_reserve_size -
+ phyp_dump_info->reserved_scratch_size;
+ return sprintf(buf, "CPU:0x%lx-0x%lx: HPTE:0x%lx-0x%lx:"
+ " DUMP:0x%lx-0x%lx, 0x%lx-0x%lx:\n",
+ phdr.cpu_data.destination_address,
+ phdr.cpu_data.length_copied,
+ phdr.hpte_data.destination_address,
+ phdr.hpte_data.length_copied,
+ phdr.kernel_data.destination_address,
+ phdr.kernel_data.length_copied,
+ phyp_dump_info->init_reserve_start,
+ second_addr_range);
}
static struct subsys_attribute rr = __ATTR(release_region, 0600,
@@ -293,7 +340,6 @@ static int __init phyp_dump_setup(void)
if (!phyp_dump_info->phyp_dump_configured) {
return -ENOSYS;
}
- print_dump_header(dump_header);
/* Is there dump data waiting for us? If there isn't,
* then register a new dump area, and release all of
@@ -305,6 +351,7 @@ static int __init phyp_dump_setup(void)
rtas = of_find_node_by_path("/rtas");
dump_header = of_get_property(rtas, "ibm,kernel-dump", &header_len);
of_node_put(rtas);
+ print_dump_header(dump_header);
dump_area_length = init_dump_header(&phdr);
dump_area_start = phyp_dump_info->init_reserve_start & PAGE_MASK; /* align down */
@@ -314,6 +361,22 @@ static int __init phyp_dump_setup(void)
return 0;
}
+ /* re-register the dump area, if old dump was invalid */
+ if ((dump_header) && (dump_header->status & DUMP_ERROR_FLAG)) {
+ invalidate_last_dump(&phdr, dump_area_start);
+ register_dump_area(&phdr, dump_area_start);
+ return 0;
+ }
+
+ if (dump_header) {
+ phyp_dump_info->reserved_scratch_addr =
+ dump_header->cpu_data.destination_address;
+ phyp_dump_info->reserved_scratch_size =
+ dump_header->cpu_data.source_length +
+ dump_header->hpte_data.source_length +
+ dump_header->kernel_data.source_length;
+ }
+
/* Should we create a dump_subsys, analogous to s390/ipl.c ? */
rc = subsys_create_file(&kernel_subsys, &rr);
if (rc)
Index: 2.6.24-rc5/include/asm/phyp_dump.h
===================================================================
--- 2.6.24-rc5.orig/include/asm/phyp_dump.h 2008-02-12 06:12:37.000000000 -0600
+++ 2.6.24-rc5/include/asm/phyp_dump.h 2008-02-12 06:13:17.000000000 -0600
@@ -18,7 +18,8 @@
/* The RMR region will be saved for later dumping
* whenever the kernel crashes. Set this to 256MB. */
#define PHYP_DUMP_RMR_START 0x0
-#define PHYP_DUMP_RMR_END (1UL<<28)
+/*#define PHYP_DUMP_RMR_END (1UL<<28)*/
+#define PHYP_DUMP_RMR_END (1UL<<27)
struct phyp_dump {
/* Memory that is reserved during very early boot. */
@@ -30,6 +31,9 @@ struct phyp_dump {
/* store cpu & hpte size */
unsigned long cpu_state_size;
unsigned long hpte_region_size;
+ /* previous scratch area values */
+ unsigned long reserved_scratch_addr;
+ unsigned long reserved_scratch_size;
};
extern struct phyp_dump *phyp_dump_info;
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 7/8] pseries: phyp dump: Tracking memory range freed.
2008-02-12 6:31 ` [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
` (5 preceding siblings ...)
2008-02-12 7:18 ` [PATCH 6/8] pseries: phyp dump: Invalidate and print dump areas Manish Ahuja
@ 2008-02-12 7:20 ` Manish Ahuja
2008-02-12 7:21 ` [PATCH 8/8] pseries: phyp dump: config file Manish Ahuja
7 siblings, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-02-12 7:20 UTC (permalink / raw)
To: linuxppc-dev, paulus; +Cc: mahuja, linasvepstas
This patch tracks the size freed. For now it does a simple
rudimentary calculation of the ranges freed. The idea is
to keep it simple at the external shell script level and
send in large chunks for now.
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
-----
---
arch/powerpc/platforms/pseries/phyp_dump.c | 35 +++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
Index: 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-12 06:13:17.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-12 06:13:21.000000000 -0600
@@ -259,6 +259,39 @@ void release_memory_range(unsigned long
}
}
+/**
+ * track_freed_range -- Counts the range being freed.
+ * Once the counter goes to zero, it re-registers dump for
+ * future use.
+ */
+static void
+track_freed_range(unsigned long addr, unsigned long length)
+{
+ static unsigned long scratch_area_size, reserved_area_size;
+
+ if (addr < phyp_dump_info->init_reserve_start)
+ return;
+
+ if ((addr >= phyp_dump_info->init_reserve_start) &&
+ (addr <= phyp_dump_info->init_reserve_start +
+ phyp_dump_info->init_reserve_size))
+ reserved_area_size += length;
+
+ if ((addr >= phyp_dump_info->reserved_scratch_addr) &&
+ (addr <= phyp_dump_info->reserved_scratch_addr +
+ phyp_dump_info->reserved_scratch_size))
+ scratch_area_size += length;
+
+ if ((reserved_area_size == phyp_dump_info->init_reserve_size) &&
+ (scratch_area_size == phyp_dump_info->reserved_scratch_size)) {
+
+ invalidate_last_dump(&phdr,
+ phyp_dump_info->reserved_scratch_addr);
+ register_dump_area (&phdr,
+ phyp_dump_info->reserved_scratch_addr);
+ }
+}
+
/* ------------------------------------------------- */
/**
* sysfs_release_region -- sysfs interface to release memory range.
@@ -282,6 +315,8 @@ ssize_t store_release_region(struct kset
if (ret != 2)
return -EINVAL;
+ track_freed_range(start_addr, length);
+
/* Range-check - don't free any reserved memory that
* wasn't reserved for phyp-dump */
if (start_addr < phyp_dump_info->init_reserve_start)
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 8/8] pseries: phyp dump: config file
2008-02-12 6:31 ` [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
` (6 preceding siblings ...)
2008-02-12 7:20 ` [PATCH 7/8] pseries: phyp dump: Tracking memory range freed Manish Ahuja
@ 2008-02-12 7:21 ` Manish Ahuja
7 siblings, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-02-12 7:21 UTC (permalink / raw)
To: linuxppc-dev, paulus; +Cc: mahuja, linasvepstas, lkessler, strosake
Add hypervisor-assisted dump to kernel config
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
-----
arch/powerpc/Kconfig | 11 +++++++++++
1 file changed, 11 insertions(+)
Index: 2.6.24-rc5/arch/powerpc/Kconfig
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/Kconfig 2008-02-12 06:12:08.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/Kconfig 2008-02-12 06:13:24.000000000 -0600
@@ -266,6 +266,17 @@ config CRASH_DUMP
Don't change this unless you know what you are doing.
+config PHYP_DUMP
+ bool "Hypervisor-assisted dump (EXPERIMENTAL)"
+ depends on PPC_PSERIES && EXPERIMENTAL
+ default y
+ help
+ Hypervisor-assisted dump is meant to be a kdump replacement
+ offering robustness and speed not possible without system
+ hypervisor assistence.
+
+ If unsure, say "Y"
+
config PPCBUG_NVRAM
bool "Enable reading PPCBUG NVRAM during boot" if PPLUS || LOPEC
default y if PPC_PREP
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 2/8] pseries: phyp dump: reserve-release proof-of-concept
2008-02-12 7:08 ` [PATCH 2/8] pseries: phyp dump: reserve-release proof-of-concept Manish Ahuja
@ 2008-02-12 8:48 ` Michael Ellerman
2008-02-12 16:38 ` Manish Ahuja
2008-02-14 3:46 ` Tony Breeds
1 sibling, 1 reply; 71+ messages in thread
From: Michael Ellerman @ 2008-02-12 8:48 UTC (permalink / raw)
To: Manish Ahuja; +Cc: mahuja, linuxppc-dev, linasvepstas, paulus
[-- Attachment #1: Type: text/plain, Size: 1164 bytes --]
On Tue, 2008-02-12 at 01:08 -0600, Manish Ahuja wrote:
> Initial patch for reserving memory in early boot, and freeing it later.
> If the previous boot had ended with a crash, the reserved memory would contain
> a copy of the crashed kernel data.
>
> Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
> Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
>
> ----
> arch/powerpc/kernel/prom.c | 50 ++++++++++++++++++++
> arch/powerpc/kernel/rtas.c | 32 +++++++++++++
> arch/powerpc/platforms/pseries/Makefile | 1
> arch/powerpc/platforms/pseries/phyp_dump.c | 71 +++++++++++++++++++++++++++++
> include/asm-powerpc/phyp_dump.h | 38 +++++++++++++++
> include/asm/rtas.h | 3 +
^^^^^^^^^^^^^^^^^^
asm/rtas.h doesn't exist. You need to clean your tree, and patch
asm-powerpc/rtas.h instead.
cheers
--
Michael Ellerman
OzLabs, IBM Australia Development Lab
wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)
We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem
2008-02-12 7:11 ` [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem Manish Ahuja
@ 2008-02-12 10:08 ` Stephen Rothwell
2008-02-12 16:40 ` Manish Ahuja
2008-02-15 1:05 ` Tony Breeds
1 sibling, 1 reply; 71+ messages in thread
From: Stephen Rothwell @ 2008-02-12 10:08 UTC (permalink / raw)
To: Manish Ahuja; +Cc: mahuja, linuxppc-dev, linasvepstas, paulus
[-- Attachment #1: Type: text/plain, Size: 510 bytes --]
Hi Manish,
Just a small comment.
On Tue, 12 Feb 2008 01:11:58 -0600 Manish Ahuja <ahuja@austin.ibm.com> wrote:
>
> + /* Is there dump data waiting for us? */
> + rtas = of_find_node_by_path("/rtas");
> + dump_header = of_get_property(rtas, "ibm,kernel-dump", &header_len);
You need an of_node_put(rtas) here.
> + if (dump_header == NULL) {
> + release_all();
> + return 0;
> + }
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 4/8] pseries: phyp dump: register dump area.
2008-02-12 7:14 ` [PATCH 4/8] pseries: phyp dump: register dump area Manish Ahuja
@ 2008-02-12 10:11 ` Stephen Rothwell
2008-02-12 16:31 ` Manish Ahuja
0 siblings, 1 reply; 71+ messages in thread
From: Stephen Rothwell @ 2008-02-12 10:11 UTC (permalink / raw)
To: Manish Ahuja; +Cc: mahuja, linuxppc-dev, linasvepstas, paulus
[-- Attachment #1: Type: text/plain, Size: 656 bytes --]
Hi Manish,
> - /* Is there dump data waiting for us? */
> + /* Is there dump data waiting for us? If there isn't,
> + * then register a new dump area, and release all of
> + * the rest of the reserved ram.
> + *
> + * The /rtas/ibm,kernel-dump rtas node is present only
> + * if there is dump data waiting for us.
> + */
> rtas = of_find_node_by_path("/rtas");
> dump_header = of_get_property(rtas, "ibm,kernel-dump", &header_len);
> + of_node_put(rtas);
Oh, here is the of_node_put() - you should move that to patch 3.
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 6/8] pseries: phyp dump: Invalidate and print dump areas.
2008-02-12 7:18 ` [PATCH 6/8] pseries: phyp dump: Invalidate and print dump areas Manish Ahuja
@ 2008-02-12 10:18 ` Stephen Rothwell
2008-02-12 16:32 ` Manish Ahuja
2008-02-13 21:43 ` Manish Ahuja
1 sibling, 1 reply; 71+ messages in thread
From: Stephen Rothwell @ 2008-02-12 10:18 UTC (permalink / raw)
To: Manish Ahuja; +Cc: mahuja, linuxppc-dev, linasvepstas, paulus
[-- Attachment #1: Type: text/plain, Size: 521 bytes --]
Hi Manish,
On Tue, 12 Feb 2008 01:18:22 -0600 Manish Ahuja <ahuja@austin.ibm.com> wrote:
>
> -static void
> -release_memory_range(unsigned long start_pfn, unsigned long nr_pages)
> +static
> +void release_memory_range(unsigned long start_pfn, unsigned long nr_pages)
Cosmetic changes like this should normally be put in separate patch as
they just distract from a real review (which this isn't :-))
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 4/8] pseries: phyp dump: register dump area.
2008-02-12 10:11 ` Stephen Rothwell
@ 2008-02-12 16:31 ` Manish Ahuja
0 siblings, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-02-12 16:31 UTC (permalink / raw)
To: Stephen Rothwell; +Cc: mahuja, linuxppc-dev, linasvepstas, paulus
For now, if we can leave this patch as is, that will be great. That move requires me
to work all remaining patches as they apply uncleanly after that.
I will bunch those two together functionally next time onwards.
Thanks,
Manish
Stephen Rothwell wrote:
> Hi Manish,
>
>> - /* Is there dump data waiting for us? */
>> + /* Is there dump data waiting for us? If there isn't,
>> + * then register a new dump area, and release all of
>> + * the rest of the reserved ram.
>> + *
>> + * The /rtas/ibm,kernel-dump rtas node is present only
>> + * if there is dump data waiting for us.
>> + */
>> rtas = of_find_node_by_path("/rtas");
>> dump_header = of_get_property(rtas, "ibm,kernel-dump", &header_len);
>> + of_node_put(rtas);
>
> Oh, here is the of_node_put() - you should move that to patch 3.
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 6/8] pseries: phyp dump: Invalidate and print dump areas.
2008-02-12 10:18 ` Stephen Rothwell
@ 2008-02-12 16:32 ` Manish Ahuja
0 siblings, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-02-12 16:32 UTC (permalink / raw)
To: Stephen Rothwell; +Cc: mahuja, linuxppc-dev, linasvepstas, paulus
Stephen Rothwell wrote:
> Hi Manish,
>
> On Tue, 12 Feb 2008 01:18:22 -0600 Manish Ahuja <ahuja@austin.ibm.com> wrote:
>> -static void
>> -release_memory_range(unsigned long start_pfn, unsigned long nr_pages)
>> +static
>> +void release_memory_range(unsigned long start_pfn, unsigned long nr_pages)
>
> Cosmetic changes like this should normally be put in separate patch as
> they just distract from a real review (which this isn't :-))
>
Okay, will do from future.
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 2/8] pseries: phyp dump: reserve-release proof-of-concept
2008-02-12 8:48 ` Michael Ellerman
@ 2008-02-12 16:38 ` Manish Ahuja
0 siblings, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-02-12 16:38 UTC (permalink / raw)
To: michael; +Cc: mahuja, linuxppc-dev, linasvepstas, paulus
Michael,
Fixed.
-Manish
----------
Initial patch for reserving memory in early boot, and freeing it later.
If the previous boot had ended with a crash, the reserved memory would contain
a copy of the crashed kernel data.
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
----
arch/powerpc/kernel/prom.c | 50 ++++++++++++++++++++
arch/powerpc/kernel/rtas.c | 32 +++++++++++++
arch/powerpc/platforms/pseries/Makefile | 1
arch/powerpc/platforms/pseries/phyp_dump.c | 71 +++++++++++++++++++++++++++++
include/asm-powerpc/phyp_dump.h | 38 +++++++++++++++
include/asm-powerpc/rtas.h | 3 +
6 files changed, 195 insertions(+)
Index: 2.6.24-rc5/include/asm-powerpc/phyp_dump.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ 2.6.24-rc5/include/asm-powerpc/phyp_dump.h 2008-02-12 16:12:45.000000000 -0600
@@ -0,0 +1,38 @@
+/*
+ * Hypervisor-assisted dump
+ *
+ * Linas Vepstas, Manish Ahuja 2007
+ * Copyright (c) 2007 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _PPC64_PHYP_DUMP_H
+#define _PPC64_PHYP_DUMP_H
+
+#ifdef CONFIG_PHYP_DUMP
+
+/* The RMR region will be saved for later dumping
+ * whenever the kernel crashes. Set this to 256MB. */
+#define PHYP_DUMP_RMR_START 0x0
+#define PHYP_DUMP_RMR_END (1UL<<28)
+
+struct phyp_dump {
+ /* Memory that is reserved during very early boot. */
+ unsigned long init_reserve_start;
+ unsigned long init_reserve_size;
+ /* Check status during boot if dump supported, active & present*/
+ unsigned long phyp_dump_configured;
+ unsigned long phyp_dump_is_active;
+ /* store cpu & hpte size */
+ unsigned long cpu_state_size;
+ unsigned long hpte_region_size;
+};
+
+extern struct phyp_dump *phyp_dump_info;
+
+#endif /* CONFIG_PHYP_DUMP */
+#endif /* _PPC64_PHYP_DUMP_H */
Index: 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-12 16:12:45.000000000 -0600
@@ -0,0 +1,71 @@
+/*
+ * Hypervisor-assisted dump
+ *
+ * Linas Vepstas, Manish Ahuja 2007
+ * Copyrhgit (c) 2007 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/pfn.h>
+#include <linux/swap.h>
+
+#include <asm/page.h>
+#include <asm/phyp_dump.h>
+
+/* Global, used to communicate data between early boot and late boot */
+static struct phyp_dump phyp_dump_global;
+struct phyp_dump *phyp_dump_info = &phyp_dump_global;
+
+/**
+ * release_memory_range -- release memory previously lmb_reserved
+ * @start_pfn: starting physical frame number
+ * @nr_pages: number of pages to free.
+ *
+ * This routine will release memory that had been previously
+ * lmb_reserved in early boot. The released memory becomes
+ * available for genreal use.
+ */
+static void
+release_memory_range(unsigned long start_pfn, unsigned long nr_pages)
+{
+ struct page *rpage;
+ unsigned long end_pfn;
+ long i;
+
+ end_pfn = start_pfn + nr_pages;
+
+ for (i=start_pfn; i <= end_pfn; i++) {
+ rpage = pfn_to_page(i);
+ if (PageReserved(rpage)) {
+ ClearPageReserved(rpage);
+ init_page_count(rpage);
+ __free_page(rpage);
+ totalram_pages++;
+ }
+ }
+}
+
+static int __init phyp_dump_setup(void)
+{
+ unsigned long start_pfn, nr_pages;
+
+ /* If no memory was reserved in early boot, there is nothing to do */
+ if (phyp_dump_info->init_reserve_size == 0)
+ return 0;
+
+ /* Release memory that was reserved in early boot */
+ start_pfn = PFN_DOWN(phyp_dump_info->init_reserve_start);
+ nr_pages = PFN_DOWN(phyp_dump_info->init_reserve_size);
+ release_memory_range(start_pfn, nr_pages);
+
+ return 0;
+}
+
+subsys_initcall(phyp_dump_setup);
Index: 2.6.24-rc5/arch/powerpc/platforms/pseries/Makefile
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/platforms/pseries/Makefile 2008-02-12 16:11:44.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/platforms/pseries/Makefile 2008-02-12 16:12:45.000000000 -0600
@@ -18,3 +18,4 @@ obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu
obj-$(CONFIG_HVC_CONSOLE) += hvconsole.o
obj-$(CONFIG_HVCS) += hvcserver.o
obj-$(CONFIG_HCALL_STATS) += hvCall_inst.o
+obj-$(CONFIG_PHYP_DUMP) += phyp_dump.o
Index: 2.6.24-rc5/arch/powerpc/kernel/prom.c
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/kernel/prom.c 2008-02-12 16:11:44.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/kernel/prom.c 2008-02-12 16:12:45.000000000 -0600
@@ -51,6 +51,7 @@
#include <asm/machdep.h>
#include <asm/pSeries_reconfig.h>
#include <asm/pci-bridge.h>
+#include <asm/phyp_dump.h>
#include <asm/kexec.h>
#ifdef DEBUG
@@ -1011,6 +1012,52 @@ static void __init early_reserve_mem(voi
#endif
}
+#ifdef CONFIG_PHYP_DUMP
+/**
+ * reserve_crashed_mem() - reserve all not-yet-dumped mmemory
+ *
+ * This routine may reserve memory regions in the kernel only
+ * if the system is supported and a dump was taken in last
+ * boot instance or if the hardware is supported and the
+ * scratch area needs to be setup. In other instances it returns
+ * without reserving anything. The memory in case of dump being
+ * active is freed when the dump is collected (by userland tools).
+ */
+static void __init reserve_crashed_mem(void)
+{
+ unsigned long base, size;
+ if (!phyp_dump_info->phyp_dump_configured) {
+ printk(KERN_ERR "Phyp-dump not supported on this hardware\n");
+ return;
+ }
+
+ if (phyp_dump_info->phyp_dump_is_active) {
+ /* Reserve *everything* above RMR.Area freed by userland tools*/
+ base = PHYP_DUMP_RMR_END;
+ size = lmb_end_of_DRAM() - base;
+
+ /* XXX crashed_ram_end is wrong, since it may be beyond
+ * the memory_limit, it will need to be adjusted. */
+ lmb_reserve(base, size);
+
+ phyp_dump_info->init_reserve_start = base;
+ phyp_dump_info->init_reserve_size = size;
+ }
+ else {
+ size = phyp_dump_info->cpu_state_size +
+ phyp_dump_info->hpte_region_size +
+ PHYP_DUMP_RMR_END;
+ base = lmb_end_of_DRAM() - size;
+ lmb_reserve(base, size);
+ phyp_dump_info->init_reserve_start = base;
+ phyp_dump_info->init_reserve_size = size;
+ }
+}
+#else
+static inline void __init reserve_crashed_mem(void) {}
+#endif /* CONFIG_PHYP_DUMP */
+
+
void __init early_init_devtree(void *params)
{
DBG(" -> early_init_devtree(%p)\n", params);
@@ -1022,6 +1069,8 @@ void __init early_init_devtree(void *par
/* Some machines might need RTAS info for debugging, grab it now. */
of_scan_flat_dt(early_init_dt_scan_rtas, NULL);
#endif
+ /* scan tree to see if dump occured during last boot */
+ of_scan_flat_dt(early_init_dt_scan_phyp_dump, NULL);
/* Retrieve various informations from the /chosen node of the
* device-tree, including the platform type, initrd location and
@@ -1043,6 +1092,7 @@ void __init early_init_devtree(void *par
reserve_kdump_trampoline();
reserve_crashkernel();
early_reserve_mem();
+ reserve_crashed_mem();
lmb_enforce_memory_limit(memory_limit);
lmb_analyze();
Index: 2.6.24-rc5/arch/powerpc/kernel/rtas.c
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/kernel/rtas.c 2008-02-12 16:11:44.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/kernel/rtas.c 2008-02-12 16:12:45.000000000 -0600
@@ -39,6 +39,7 @@
#include <asm/syscalls.h>
#include <asm/smp.h>
#include <asm/atomic.h>
+#include <asm/phyp_dump.h>
struct rtas_t rtas = {
.lock = SPIN_LOCK_UNLOCKED
@@ -883,6 +884,37 @@ void __init rtas_initialize(void)
#endif
}
+int __init early_init_dt_scan_phyp_dump(unsigned long node,
+ const char *uname, int depth, void *data)
+{
+#ifdef CONFIG_PHYP_DUMP
+ const unsigned int *sizes;
+
+ phyp_dump_info->phyp_dump_configured = 0;
+ phyp_dump_info->phyp_dump_is_active = 0;
+
+ if (depth != 1 || strcmp(uname, "rtas") != 0)
+ return 0;
+
+ if (of_get_flat_dt_prop(node, "ibm,configure-kernel-dump", NULL))
+ phyp_dump_info->phyp_dump_configured++;
+
+ if (of_get_flat_dt_prop(node, "ibm,dump-kernel", NULL))
+ phyp_dump_info->phyp_dump_is_active++;
+
+ sizes = of_get_flat_dt_prop(node, "ibm,configure-kernel-dump-sizes", NULL);
+ if (!sizes)
+ return 0;
+
+ if (sizes[0] == 1)
+ phyp_dump_info->cpu_state_size = *((unsigned long *)&sizes[1]);
+
+ if (sizes[3] == 2)
+ phyp_dump_info->hpte_region_size = *((unsigned long *)&sizes[4]);
+#endif
+ return 1;
+}
+
int __init early_init_dt_scan_rtas(unsigned long node,
const char *uname, int depth, void *data)
{
Index: 2.6.24-rc5/include/asm-powerpc/rtas.h
===================================================================
--- 2.6.24-rc5.orig/include/asm-powerpc/rtas.h 2008-02-12 16:09:48.000000000 -0600
+++ 2.6.24-rc5/include/asm-powerpc/rtas.h 2008-02-12 16:12:45.000000000 -0600
@@ -183,6 +183,9 @@ extern unsigned int rtas_busy_delay(int
extern int early_init_dt_scan_rtas(unsigned long node,
const char *uname, int depth, void *data);
+int early_init_dt_scan_phyp_dump(unsigned long node,
+ const char *uname, int depth, void *data);
+
extern void pSeries_log_error(char *buf, unsigned int err_type, int fatal);
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem
2008-02-12 10:08 ` Stephen Rothwell
@ 2008-02-12 16:40 ` Manish Ahuja
0 siblings, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-02-12 16:40 UTC (permalink / raw)
To: Stephen Rothwell; +Cc: mahuja, linuxppc-dev, linasvepstas, paulus
As noted, its fixed in patch 4.
If its okay for this time, I will prefer to leave it there.
-Manish
Stephen Rothwell wrote:
> Hi Manish,
>
> Just a small comment.
>
> On Tue, 12 Feb 2008 01:11:58 -0600 Manish Ahuja <ahuja@austin.ibm.com> wrote:
>> + /* Is there dump data waiting for us? */
>> + rtas = of_find_node_by_path("/rtas");
>> + dump_header = of_get_property(rtas, "ibm,kernel-dump", &header_len);
>
> You need an of_node_put(rtas) here.
>
>> + if (dump_header == NULL) {
>> + release_all();
>> + return 0;
>> + }
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 6/8] pseries: phyp dump: Invalidate and print dump areas.
2008-02-12 7:18 ` [PATCH 6/8] pseries: phyp dump: Invalidate and print dump areas Manish Ahuja
2008-02-12 10:18 ` Stephen Rothwell
@ 2008-02-13 21:43 ` Manish Ahuja
1 sibling, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-02-13 21:43 UTC (permalink / raw)
To: linuxppc-dev, paulus; +Cc: mahuja, linasvepstas
Changed asm to asm-powerpc.
Hopefully this was the last of them.
-Manish
----------------------------------------
Routines to
a. invalidate dump
b. Calculate region that is reserved and needs to be freed. This is
exported through sysfs interface.
Unregister has been removed for now as it wasn't being used.
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
-----
---
arch/powerpc/platforms/pseries/phyp_dump.c | 85 +++++++++++++++++++++++++----
include/asm-powerpc/phyp_dump.h | 3 +
2 files changed, 77 insertions(+), 11 deletions(-)
Index: 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- 2.6.24-rc5.orig/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-13 21:21:00.000000000 -0600
+++ 2.6.24-rc5/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-13 21:21:48.000000000 -0600
@@ -69,6 +69,10 @@ static struct phyp_dump_header phdr;
#define DUMP_SOURCE_CPU 0x0001
#define DUMP_SOURCE_HPTE 0x0002
#define DUMP_SOURCE_RMO 0x0011
+#define DUMP_ERROR_FLAG 0x2000
+#define DUMP_TRIGGERED 0x4000
+#define DUMP_PERFORMED 0x8000
+
/**
* init_dump_header() - initialize the header declaring a dump
@@ -180,9 +184,15 @@ static void print_dump_header(const stru
static void register_dump_area(struct phyp_dump_header *ph, unsigned long addr)
{
int rc;
- ph->cpu_data.destination_address += addr;
- ph->hpte_data.destination_address += addr;
- ph->kernel_data.destination_address += addr;
+
+ /* Add addr value if not initialized before */
+ if (ph->cpu_data.destination_address == 0) {
+ ph->cpu_data.destination_address += addr;
+ ph->hpte_data.destination_address += addr;
+ ph->kernel_data.destination_address += addr;
+ }
+
+ /* ToDo Invalidate kdump and free memory range. */
do {
rc = rtas_call(ibm_configure_kernel_dump, 3, 1, NULL,
@@ -195,6 +205,30 @@ static void register_dump_area(struct ph
}
}
+static
+void invalidate_last_dump(struct phyp_dump_header *ph, unsigned long addr)
+{
+ int rc;
+
+ /* Add addr value if not initialized before */
+ if (ph->cpu_data.destination_address == 0) {
+ ph->cpu_data.destination_address += addr;
+ ph->hpte_data.destination_address += addr;
+ ph->kernel_data.destination_address += addr;
+ }
+
+ do {
+ rc = rtas_call(ibm_configure_kernel_dump, 3, 1, NULL,
+ 2, ph, sizeof(struct phyp_dump_header));
+ } while (rtas_busy_delay(rc));
+
+ if (rc) {
+ printk (KERN_ERR "phyp-dump: unexpected error (%d) "
+ "on invalidate\n", rc);
+ print_dump_header(ph);
+ }
+}
+
/* ------------------------------------------------- */
/**
* release_memory_range -- release memory previously lmb_reserved
@@ -205,8 +239,8 @@ static void register_dump_area(struct ph
* lmb_reserved in early boot. The released memory becomes
* available for genreal use.
*/
-static void
-release_memory_range(unsigned long start_pfn, unsigned long nr_pages)
+static
+void release_memory_range(unsigned long start_pfn, unsigned long nr_pages)
{
struct page *rpage;
unsigned long end_pfn;
@@ -237,8 +271,8 @@ release_memory_range(unsigned long start
*
* will release 256MB starting at 1GB.
*/
-static ssize_t
-store_release_region(struct kset *kset, const char *buf, size_t count)
+static
+ssize_t store_release_region(struct kset *kset, const char *buf, size_t count)
{
unsigned long start_addr, length, end_addr;
unsigned long start_pfn, nr_pages;
@@ -266,10 +300,23 @@ store_release_region(struct kset *kset,
return count;
}
-static ssize_t
-show_release_region(struct kset * kset, char *buf)
+static ssize_t show_release_region(struct kset * kset, char *buf)
{
- return sprintf(buf, "ola\n");
+ u64 second_addr_range;
+
+ /* total reserved size - start of scratch area */
+ second_addr_range = phyp_dump_info->init_reserve_size -
+ phyp_dump_info->reserved_scratch_size;
+ return sprintf(buf, "CPU:0x%lx-0x%lx: HPTE:0x%lx-0x%lx:"
+ " DUMP:0x%lx-0x%lx, 0x%lx-0x%lx:\n",
+ phdr.cpu_data.destination_address,
+ phdr.cpu_data.length_copied,
+ phdr.hpte_data.destination_address,
+ phdr.hpte_data.length_copied,
+ phdr.kernel_data.destination_address,
+ phdr.kernel_data.length_copied,
+ phyp_dump_info->init_reserve_start,
+ second_addr_range);
}
static struct subsys_attribute rr = __ATTR(release_region, 0600,
@@ -293,7 +340,6 @@ static int __init phyp_dump_setup(void)
if (!phyp_dump_info->phyp_dump_configured) {
return -ENOSYS;
}
- print_dump_header(dump_header);
/* Is there dump data waiting for us? If there isn't,
* then register a new dump area, and release all of
@@ -305,6 +351,7 @@ static int __init phyp_dump_setup(void)
rtas = of_find_node_by_path("/rtas");
dump_header = of_get_property(rtas, "ibm,kernel-dump", &header_len);
of_node_put(rtas);
+ print_dump_header(dump_header);
dump_area_length = init_dump_header(&phdr);
dump_area_start = phyp_dump_info->init_reserve_start & PAGE_MASK; /* align down */
@@ -314,6 +361,22 @@ static int __init phyp_dump_setup(void)
return 0;
}
+ /* re-register the dump area, if old dump was invalid */
+ if ((dump_header) && (dump_header->status & DUMP_ERROR_FLAG)) {
+ invalidate_last_dump(&phdr, dump_area_start);
+ register_dump_area(&phdr, dump_area_start);
+ return 0;
+ }
+
+ if (dump_header) {
+ phyp_dump_info->reserved_scratch_addr =
+ dump_header->cpu_data.destination_address;
+ phyp_dump_info->reserved_scratch_size =
+ dump_header->cpu_data.source_length +
+ dump_header->hpte_data.source_length +
+ dump_header->kernel_data.source_length;
+ }
+
/* Should we create a dump_subsys, analogous to s390/ipl.c ? */
rc = subsys_create_file(&kernel_subsys, &rr);
if (rc)
Index: 2.6.24-rc5/include/asm-powerpc/phyp_dump.h
===================================================================
--- 2.6.24-rc5.orig/include/asm-powerpc/phyp_dump.h 2008-02-13 21:21:00.000000000 -0600
+++ 2.6.24-rc5/include/asm-powerpc/phyp_dump.h 2008-02-13 21:22:10.000000000 -0600
@@ -30,6 +30,9 @@ struct phyp_dump {
/* store cpu & hpte size */
unsigned long cpu_state_size;
unsigned long hpte_region_size;
+ /* previous scratch area values */
+ unsigned long reserved_scratch_addr;
+ unsigned long reserved_scratch_size;
};
extern struct phyp_dump *phyp_dump_info;
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 2/8] pseries: phyp dump: reserve-release proof-of-concept
2008-02-12 7:08 ` [PATCH 2/8] pseries: phyp dump: reserve-release proof-of-concept Manish Ahuja
2008-02-12 8:48 ` Michael Ellerman
@ 2008-02-14 3:46 ` Tony Breeds
2008-02-14 23:12 ` Olof Johansson
1 sibling, 1 reply; 71+ messages in thread
From: Tony Breeds @ 2008-02-14 3:46 UTC (permalink / raw)
To: Manish Ahuja; +Cc: linuxppc-dev
On Tue, Feb 12, 2008 at 01:08:25AM -0600, Manish Ahuja wrote:
>
> Initial patch for reserving memory in early boot, and freeing it later.
> If the previous boot had ended with a crash, the reserved memory would contain
> a copy of the crashed kernel data.
>
> Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
> Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
>
> ----
> arch/powerpc/kernel/prom.c | 50 ++++++++++++++++++++
> arch/powerpc/kernel/rtas.c | 32 +++++++++++++
> arch/powerpc/platforms/pseries/Makefile | 1
> arch/powerpc/platforms/pseries/phyp_dump.c | 71 +++++++++++++++++++++++++++++
> include/asm-powerpc/phyp_dump.h | 38 +++++++++++++++
> include/asm/rtas.h | 3 +
> 6 files changed, 195 insertions(+)
>
> Index: 2.6.24-rc5/include/asm-powerpc/phyp_dump.h
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ 2.6.24-rc5/include/asm-powerpc/phyp_dump.h 2008-02-12 06:12:37.000000000 -0600
> @@ -0,0 +1,38 @@
> +/*
> + * Hypervisor-assisted dump
> + *
> + * Linas Vepstas, Manish Ahuja 2007
> + * Copyright (c) 2007 IBM Corp.
Hi Manish,
Sorry for the minor nits but this should be:
---
* Linas Vepstas, Manish Ahuja 2008
* Copyright 2008 IBM Corp.
---
You can optionally use the '©' symbol after word 'Copyright' but you
shouldn't use '(c)' anymore.
Also in at least one place you've misspelt "Copyright"
Yours Tony
linux.conf.au http://linux.conf.au/ || http://lca2008.linux.org.au/
Jan 28 - Feb 02 2008 The Australian Linux Technical Conference!
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 2/8] pseries: phyp dump: reserve-release proof-of-concept
2008-02-14 3:46 ` Tony Breeds
@ 2008-02-14 23:12 ` Olof Johansson
2008-02-15 7:16 ` Manish Ahuja
0 siblings, 1 reply; 71+ messages in thread
From: Olof Johansson @ 2008-02-14 23:12 UTC (permalink / raw)
To: Tony Breeds; +Cc: linuxppc-dev
On Thu, Feb 14, 2008 at 02:46:21PM +1100, Tony Breeds wrote:
> Hi Manish,
> Sorry for the minor nits but this should be:
>
> ---
> * Linas Vepstas, Manish Ahuja 2008
> * Copyright 2008 IBM Corp.
> ---
>
> You can optionally use the '??' symbol after word 'Copyright' but you
> shouldn't use '(c)' anymore.
>
> Also in at least one place you've misspelt "Copyright"
If we're going to nitpick, then I'd like to point out that the whole
series needs to be run through checkpatch and at least the whitespace
issues should be taken care of.
I'm still not convinced that this is a useful feature compared to
hardening kdump, especially now that ehea can handle kexec/kdump (patch
posted the other day). But in the end it's up to Paul if he wants to
take it or not, not me.
-Olof
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem
2008-02-12 7:11 ` [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem Manish Ahuja
2008-02-12 10:08 ` Stephen Rothwell
@ 2008-02-15 1:05 ` Tony Breeds
2008-02-15 7:17 ` Manish Ahuja
2008-02-15 17:30 ` Linas Vepstas
1 sibling, 2 replies; 71+ messages in thread
From: Tony Breeds @ 2008-02-15 1:05 UTC (permalink / raw)
To: Manish Ahuja; +Cc: mahuja, linuxppc-dev, linasvepstas, paulus
On Tue, Feb 12, 2008 at 01:11:58AM -0600, Manish Ahuja wrote:
<snip>
> +static ssize_t
> +show_release_region(struct kset * kset, char *buf)
> +{
> + return sprintf(buf, "ola\n");
> +}
> +
> +static struct subsys_attribute rr = __ATTR(release_region, 0600,
> + show_release_region,
> + store_release_region);
Any reason this sysfs attribute can't be write only? The show method
doesn't seem needed.
> +static int __init phyp_dump_setup(void)
> +{
<snip>
> + /* Is there dump data waiting for us? */
> + rtas = of_find_node_by_path("/rtas");
> + dump_header = of_get_property(rtas, "ibm,kernel-dump", &header_len);
Hmm this isn't good. You need to check rtas != NULL.
> + if (dump_header == NULL) {
> + release_all();
> + return 0;
> + }
> +
> + /* Should we create a dump_subsys, analogous to s390/ipl.c ? */
> + rc = subsys_create_file(&kernel_subsys, &rr);
> + if (rc) {
> + printk (KERN_ERR "phyp-dump: unable to create sysfs file (%d)\n", rc);
> + release_all();
> + return 0;
> + }
>
> return 0;
> }
> -
> subsys_initcall(phyp_dump_setup);
Hmm I think this really should be a:
machine_subsys_initcall(pseries, phyp_dump_setup)
Yours Tony
linux.conf.au http://linux.conf.au/ || http://lca2008.linux.org.au/
Jan 28 - Feb 02 2008 The Australian Linux Technical Conference!
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 2/8] pseries: phyp dump: reserve-release proof-of-concept
2008-02-14 23:12 ` Olof Johansson
@ 2008-02-15 7:16 ` Manish Ahuja
0 siblings, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-02-15 7:16 UTC (permalink / raw)
To: Olof Johansson; +Cc: linuxppc-dev
Olof,
I will run it through checkpatch before resubmitting.
Thanks,
Manish
Olof Johansson wrote:
> On Thu, Feb 14, 2008 at 02:46:21PM +1100, Tony Breeds wrote:
>
>> Hi Manish,
>> Sorry for the minor nits but this should be:
>>
>> ---
>> * Linas Vepstas, Manish Ahuja 2008
>> * Copyright 2008 IBM Corp.
>> ---
>>
>> You can optionally use the '??' symbol after word 'Copyright' but you
>> shouldn't use '(c)' anymore.
>>
>> Also in at least one place you've misspelt "Copyright"
>
> If we're going to nitpick, then I'd like to point out that the whole
> series needs to be run through checkpatch and at least the whitespace
> issues should be taken care of.
>
> I'm still not convinced that this is a useful feature compared to
> hardening kdump, especially now that ehea can handle kexec/kdump (patch
> posted the other day). But in the end it's up to Paul if he wants to
> take it or not, not me.
>
>
> -Olof
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem
2008-02-15 1:05 ` Tony Breeds
@ 2008-02-15 7:17 ` Manish Ahuja
2008-02-15 22:32 ` Tony Breeds
2008-02-15 17:30 ` Linas Vepstas
1 sibling, 1 reply; 71+ messages in thread
From: Manish Ahuja @ 2008-02-15 7:17 UTC (permalink / raw)
To: Tony Breeds; +Cc: mahuja, linuxppc-dev, linasvepstas, paulus
Tony Breeds wrote:
> On Tue, Feb 12, 2008 at 01:11:58AM -0600, Manish Ahuja wrote:
>
> <snip>
>
>> +static ssize_t
>> +show_release_region(struct kset * kset, char *buf)
>> +{
>> + return sprintf(buf, "ola\n");
>> +}
>> +
>> +static struct subsys_attribute rr = __ATTR(release_region, 0600,
>> + show_release_region,
>> + store_release_region);
>
> Any reason this sysfs attribute can't be write only? The show method
> doesn't seem needed.
yes, its used later in the code.
>
>> +static int __init phyp_dump_setup(void)
>> +{
>
> <snip>
>
>> + /* Is there dump data waiting for us? */
>> + rtas = of_find_node_by_path("/rtas");
>> + dump_header = of_get_property(rtas, "ibm,kernel-dump", &header_len);
>
> Hmm this isn't good. You need to check rtas != NULL.
yes, will fix this as well.
>
>> + if (dump_header == NULL) {
>> + release_all();
>> + return 0;
>> + }
>> +
>> + /* Should we create a dump_subsys, analogous to s390/ipl.c ? */
>> + rc = subsys_create_file(&kernel_subsys, &rr);
>> + if (rc) {
>> + printk (KERN_ERR "phyp-dump: unable to create sysfs file (%d)\n", rc);
>> + release_all();
>> + return 0;
>> + }
>>
>> return 0;
>> }
>> -
>> subsys_initcall(phyp_dump_setup);
>
> Hmm I think this really should be a:
> machine_subsys_initcall(pseries, phyp_dump_setup)
>
> Yours Tony
>
> linux.conf.au http://linux.conf.au/ || http://lca2008.linux.org.au/
> Jan 28 - Feb 02 2008 The Australian Linux Technical Conference!
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem
2008-02-15 1:05 ` Tony Breeds
2008-02-15 7:17 ` Manish Ahuja
@ 2008-02-15 17:30 ` Linas Vepstas
1 sibling, 0 replies; 71+ messages in thread
From: Linas Vepstas @ 2008-02-15 17:30 UTC (permalink / raw)
To: Tony Breeds; +Cc: mahuja, linuxppc-dev, paulus
On 14/02/2008, Tony Breeds <tony@bakeyournoodle.com> wrote:
> On Tue, Feb 12, 2008 at 01:11:58AM -0600, Manish Ahuja wrote:
> > +static ssize_t
> > +show_release_region(struct kset * kset, char *buf)
> > +{
> > + return sprintf(buf, "ola\n");
> > +}
> > +
> > +static struct subsys_attribute rr = __ATTR(release_region, 0600,
> > + show_release_region,
> > + store_release_region);
>
>
> Any reason this sysfs attribute can't be write only? The show method
> doesn't seem needed.
This was supposed to be a place-holder; a later patch would add detailed
info. The goal was to have user-land tools that would operate these files
to progressively dump and release memory regions; however, until these
userland tools get written, the proper interface remains murky (e.g.
real addresses? virtual addresses? just delta's or a whole memory map?
some sort of numa flags or whatever?)
--linas
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem
2008-02-15 7:17 ` Manish Ahuja
@ 2008-02-15 22:32 ` Tony Breeds
0 siblings, 0 replies; 71+ messages in thread
From: Tony Breeds @ 2008-02-15 22:32 UTC (permalink / raw)
To: Manish Ahuja; +Cc: mahuja, linuxppc-dev, linasvepstas, paulus
On Fri, Feb 15, 2008 at 01:17:16AM -0600, Manish Ahuja wrote:
> Tony Breeds wrote:
> > Any reason this sysfs attribute can't be write only? The show method
> > doesn't seem needed.
>
> yes, its used later in the code.
I see that now, thanks. From my point of view it would make reviewing
these patches easier if each patch was a correct and simple as possible.
In this case it would have made the review easier if the sysfs attribute
was write only now and then modified to add the read side when it's
actually implemented. The same goes for fixing typosi, cosmetic changes
and reference counting.
Looking forward to a respin of this patch series.
Yours Tony
linux.conf.au http://linux.conf.au/ || http://lca2008.linux.org.au/
Jan 28 - Feb 02 2008 The Australian Linux Technical Conference!
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem
2008-02-18 4:53 [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
@ 2008-02-18 5:38 ` Manish Ahuja
2008-02-22 0:53 ` [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Michael Ellerman
1 sibling, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-02-18 5:38 UTC (permalink / raw)
To: ppc-dev, paulus, Linas Vepstas; +Cc: mahuja
Check to see if there actually is data from a previously
crashed kernel waiting. If so, Allow user-sapce tools to
grab the data (by reading /proc/kcore). When user-space
finishes dumping a section, it must release that memory
by writing to sysfs. For example,
echo "0x40000000 0x10000000" > /sys/kernel/release_region
will release 256MB starting at the 1GB. The released memory
becomes free for general use.
Signed-off-by: Linas Vepstas <linasvepstas@gmail.com>
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
------
arch/powerpc/platforms/pseries/phyp_dump.c | 81 +++++++++++++++++++++++++++--
1 file changed, 76 insertions(+), 5 deletions(-)
Index: 2.6.25-rc1/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- 2.6.25-rc1.orig/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-18 03:23:47.000000000 -0600
+++ 2.6.25-rc1/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-18 04:32:13.000000000 -0600
@@ -12,18 +12,23 @@
*/
#include <linux/init.h>
+#include <linux/kobject.h>
#include <linux/mm.h>
+#include <linux/of.h>
#include <linux/pfn.h>
#include <linux/swap.h>
+#include <linux/sysfs.h>
#include <asm/page.h>
#include <asm/phyp_dump.h>
#include <asm/machdep.h>
+#include <asm/rtas.h>
/* Global, used to communicate data between early boot and late boot */
static struct phyp_dump phyp_dump_global;
struct phyp_dump *phyp_dump_info = &phyp_dump_global;
+/* ------------------------------------------------- */
/**
* release_memory_range -- release memory previously lmb_reserved
* @start_pfn: starting physical frame number
@@ -53,18 +58,84 @@ release_memory_range(unsigned long start
}
}
-static int __init phyp_dump_setup(void)
+/* ------------------------------------------------- */
+/**
+ * sysfs_release_region -- sysfs interface to release memory range.
+ *
+ * Usage:
+ * "echo <start addr> <length> > /sys/kernel/release_region"
+ *
+ * Example:
+ * "echo 0x40000000 0x10000000 > /sys/kernel/release_region"
+ *
+ * will release 256MB starting at 1GB.
+ */
+static ssize_t store_release_region(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
{
+ unsigned long start_addr, length, end_addr;
unsigned long start_pfn, nr_pages;
+ ssize_t ret;
+
+ ret = sscanf(buf, "%lx %lx", &start_addr, &length);
+ if (ret != 2)
+ return -EINVAL;
+
+ /* Range-check - don't free any reserved memory that
+ * wasn't reserved for phyp-dump */
+ if (start_addr < phyp_dump_info->init_reserve_start)
+ start_addr = phyp_dump_info->init_reserve_start;
+
+ end_addr = phyp_dump_info->init_reserve_start +
+ phyp_dump_info->init_reserve_size;
+ if (start_addr+length > end_addr)
+ length = end_addr - start_addr;
+
+ /* Release the region of memory assed in by user */
+ start_pfn = PFN_DOWN(start_addr);
+ nr_pages = PFN_DOWN(length);
+ release_memory_range(start_pfn, nr_pages);
+
+ return count;
+}
+
+static struct kobj_attribute rr = __ATTR(release_region, 0600,
+ NULL, store_release_region);
+
+static int __init phyp_dump_setup(void)
+{
+ struct device_node *rtas;
+ const int *dump_header = NULL;
+ int header_len = 0;
+ int rc;
/* If no memory was reserved in early boot, there is nothing to do */
if (phyp_dump_info->init_reserve_size == 0)
return 0;
- /* Release memory that was reserved in early boot */
- start_pfn = PFN_DOWN(phyp_dump_info->init_reserve_start);
- nr_pages = PFN_DOWN(phyp_dump_info->init_reserve_size);
- release_memory_range(start_pfn, nr_pages);
+ /* Return if phyp dump not supported */
+ if (!phyp_dump_info->phyp_dump_configured)
+ return -ENOSYS;
+
+ /* Is there dump data waiting for us? */
+ rtas = of_find_node_by_path("/rtas");
+ if (rtas) {
+ dump_header = of_get_property(rtas, "ibm,kernel-dump",
+ &header_len);
+ of_node_put(rtas);
+ }
+
+ if (dump_header == NULL)
+ return 0;
+
+ /* Should we create a dump_subsys, analogous to s390/ipl.c ? */
+ rc = sysfs_create_file(kernel_kobj, &rr.attr);
+ if (rc) {
+ printk(KERN_ERR "phyp-dump: unable to create sysfs file (%d)\n",
+ rc);
+ return 0;
+ }
return 0;
}
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem
2008-02-28 23:57 ` Manish Ahuja
@ 2008-02-29 0:27 ` Manish Ahuja
2008-03-11 6:16 ` Paul Mackerras
0 siblings, 1 reply; 71+ messages in thread
From: Manish Ahuja @ 2008-02-29 0:27 UTC (permalink / raw)
To: linuxppc-dev, paulus, michael; +Cc: mahuja, linasvepstas
Check to see if there actually is data from a previously
crashed kernel waiting. If so, Allow user-sapce tools to
grab the data (by reading /proc/kcore). When user-space
finishes dumping a section, it must release that memory
by writing to sysfs. For example,
echo "0x40000000 0x10000000" > /sys/kernel/release_region
will release 256MB starting at the 1GB. The released memory
becomes free for general use.
Signed-off-by: Linas Vepstas <linasvepstas@gmail.com>
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
------
arch/powerpc/platforms/pseries/phyp_dump.c | 82 +++++++++++++++++++++++++++--
1 file changed, 77 insertions(+), 5 deletions(-)
Index: 2.6.25-rc1/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- 2.6.25-rc1.orig/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-28 21:57:52.000000000 -0600
+++ 2.6.25-rc1/arch/powerpc/platforms/pseries/phyp_dump.c 2008-02-28 23:36:01.000000000 -0600
@@ -12,19 +12,25 @@
*/
#include <linux/init.h>
+#include <linux/kobject.h>
#include <linux/mm.h>
+#include <linux/of.h>
#include <linux/pfn.h>
#include <linux/swap.h>
+#include <linux/sysfs.h>
#include <asm/page.h>
#include <asm/phyp_dump.h>
#include <asm/machdep.h>
#include <asm/prom.h>
+#include <asm/rtas.h>
+
/* Global, used to communicate data between early boot and late boot */
static struct phyp_dump phyp_dump_global;
struct phyp_dump *phyp_dump_info = &phyp_dump_global;
+/* ------------------------------------------------- */
/**
* release_memory_range -- release memory previously lmb_reserved
* @start_pfn: starting physical frame number
@@ -54,18 +60,84 @@ release_memory_range(unsigned long start
}
}
-static int __init phyp_dump_setup(void)
+/* ------------------------------------------------- */
+/**
+ * sysfs_release_region -- sysfs interface to release memory range.
+ *
+ * Usage:
+ * "echo <start addr> <length> > /sys/kernel/release_region"
+ *
+ * Example:
+ * "echo 0x40000000 0x10000000 > /sys/kernel/release_region"
+ *
+ * will release 256MB starting at 1GB.
+ */
+static ssize_t store_release_region(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
{
+ unsigned long start_addr, length, end_addr;
unsigned long start_pfn, nr_pages;
+ ssize_t ret;
+
+ ret = sscanf(buf, "%lx %lx", &start_addr, &length);
+ if (ret != 2)
+ return -EINVAL;
+
+ /* Range-check - don't free any reserved memory that
+ * wasn't reserved for phyp-dump */
+ if (start_addr < phyp_dump_info->init_reserve_start)
+ start_addr = phyp_dump_info->init_reserve_start;
+
+ end_addr = phyp_dump_info->init_reserve_start +
+ phyp_dump_info->init_reserve_size;
+ if (start_addr+length > end_addr)
+ length = end_addr - start_addr;
+
+ /* Release the region of memory assed in by user */
+ start_pfn = PFN_DOWN(start_addr);
+ nr_pages = PFN_DOWN(length);
+ release_memory_range(start_pfn, nr_pages);
+
+ return count;
+}
+
+static struct kobj_attribute rr = __ATTR(release_region, 0600,
+ NULL, store_release_region);
+
+static int __init phyp_dump_setup(void)
+{
+ struct device_node *rtas;
+ const int *dump_header = NULL;
+ int header_len = 0;
+ int rc;
/* If no memory was reserved in early boot, there is nothing to do */
if (phyp_dump_info->init_reserve_size == 0)
return 0;
- /* Release memory that was reserved in early boot */
- start_pfn = PFN_DOWN(phyp_dump_info->init_reserve_start);
- nr_pages = PFN_DOWN(phyp_dump_info->init_reserve_size);
- release_memory_range(start_pfn, nr_pages);
+ /* Return if phyp dump not supported */
+ if (!phyp_dump_info->phyp_dump_configured)
+ return -ENOSYS;
+
+ /* Is there dump data waiting for us? */
+ rtas = of_find_node_by_path("/rtas");
+ if (rtas) {
+ dump_header = of_get_property(rtas, "ibm,kernel-dump",
+ &header_len);
+ of_node_put(rtas);
+ }
+
+ if (dump_header == NULL)
+ return 0;
+
+ /* Should we create a dump_subsys, analogous to s390/ipl.c ? */
+ rc = sysfs_create_file(kernel_kobj, &rr.attr);
+ if (rc) {
+ printk(KERN_ERR "phyp-dump: unable to create sysfs file (%d)\n",
+ rc);
+ return 0;
+ }
return 0;
}
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem
2008-02-29 0:27 ` [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem Manish Ahuja
@ 2008-03-11 6:16 ` Paul Mackerras
2008-03-11 16:44 ` Dale Farnsworth
2008-03-12 17:38 ` Linas Vepstas
0 siblings, 2 replies; 71+ messages in thread
From: Paul Mackerras @ 2008-03-11 6:16 UTC (permalink / raw)
To: Manish Ahuja; +Cc: linuxppc-dev, linasvepstas, mahuja
Manish Ahuja writes:
> Check to see if there actually is data from a previously
> crashed kernel waiting. If so, Allow user-sapce tools to
> grab the data (by reading /proc/kcore). When user-space
> finishes dumping a section, it must release that memory
> by writing to sysfs. For example,
>
> echo "0x40000000 0x10000000" > /sys/kernel/release_region
>
> will release 256MB starting at the 1GB. The released memory
> becomes free for general use.
>
> Signed-off-by: Linas Vepstas <linasvepstas@gmail.com>
> Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
>
> ------
This line needs to be exactly 3 dashes, because otherwise the tools
include the diffstat into the commit message. Putting 4 or more
dashes was an annoying habit Linas had, and it means I have to fix it
manually (usually after I have committed the patches, and then notice
that the commit message has the extra stuff in it, so I have to go
back and fix the separators, reset my tree and re-commit the patches.)
> + dump_header = of_get_property(rtas, "ibm,kernel-dump",
> + &header_len);
This is a somewhat weird-looking way of coping with too-long lines.
Please indent the second line either one more tab than the first line,
or else so that it starts just after the '(' in the first line (which
is what emacs will do by default). The same comment applies in
several other places.
Paul.
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem
2008-03-11 6:16 ` Paul Mackerras
@ 2008-03-11 16:44 ` Dale Farnsworth
2008-03-12 17:38 ` Linas Vepstas
1 sibling, 0 replies; 71+ messages in thread
From: Dale Farnsworth @ 2008-03-11 16:44 UTC (permalink / raw)
To: Linuxppc-dev
Paul wrote:
> Manish Ahuja writes:
> > + dump_header = of_get_property(rtas, "ibm,kernel-dump",
> > + &header_len);
>
> This is a somewhat weird-looking way of coping with too-long lines.
Yes, but not too surprising, since it precisely follows the recommendation
(and the example) in Chapter 2 of Documentation/CodingStyle. :)
-Dale
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem
2008-03-11 6:16 ` Paul Mackerras
2008-03-11 16:44 ` Dale Farnsworth
@ 2008-03-12 17:38 ` Linas Vepstas
1 sibling, 0 replies; 71+ messages in thread
From: Linas Vepstas @ 2008-03-12 17:38 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev, mahuja
On 11/03/2008, Paul Mackerras <paulus@samba.org> wrote:
>
> > ------
>
> This line needs to be exactly 3 dashes, because otherwise the tools
> include the diffstat into the commit message. Putting 4 or more
> dashes was an annoying habit Linas had, and it means I have to fix it
> manually (usually after I have committed the patches, and then notice
> that the commit message has the extra stuff in it, so I have to go
> back and fix the separators, reset my tree and re-commit the patches.)
Sorry, I had no idea! If I didn't have enough dashes, then quilt would
sometimes wipe out the comment at the top, so paranoia made me
add lots of dashes.
--linas
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem
2008-03-21 22:42 [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
@ 2008-03-21 23:39 ` Manish Ahuja
0 siblings, 0 replies; 71+ messages in thread
From: Manish Ahuja @ 2008-03-21 23:39 UTC (permalink / raw)
To: linuxppc-dev, paulus, michael; +Cc: mahuja, linasvepstas
Check to see if there actually is data from a previously
crashed kernel waiting. If so, Allow user-sapce tools to
grab the data (by reading /proc/kcore). When user-space
finishes dumping a section, it must release that memory
by writing to sysfs. For example,
echo "0x40000000 0x10000000" > /sys/kernel/release_region
will release 256MB starting at the 1GB. The released memory
becomes free for general use.
Signed-off-by: Linas Vepstas <linasvepstas@gmail.com>
Signed-off-by: Manish Ahuja <mahuja@us.ibm.com>
---
arch/powerpc/platforms/pseries/phyp_dump.c | 81 +++++++++++++++++++++++++++--
1 file changed, 76 insertions(+), 5 deletions(-)
Index: 2.6.25-rc1/arch/powerpc/platforms/pseries/phyp_dump.c
===================================================================
--- 2.6.25-rc1.orig/arch/powerpc/platforms/pseries/phyp_dump.c 2008-03-21 00:10:15.000000000 -0500
+++ 2.6.25-rc1/arch/powerpc/platforms/pseries/phyp_dump.c 2008-03-21 22:39:21.000000000 -0500
@@ -12,19 +12,24 @@
*/
#include <linux/init.h>
+#include <linux/kobject.h>
#include <linux/mm.h>
+#include <linux/of.h>
#include <linux/pfn.h>
#include <linux/swap.h>
+#include <linux/sysfs.h>
#include <asm/page.h>
#include <asm/phyp_dump.h>
#include <asm/machdep.h>
#include <asm/prom.h>
+#include <asm/rtas.h>
/* Variables, used to communicate data between early boot and late boot */
static struct phyp_dump phyp_dump_vars;
struct phyp_dump *phyp_dump_info = &phyp_dump_vars;
+/* ------------------------------------------------- */
/**
* release_memory_range -- release memory previously lmb_reserved
* @start_pfn: starting physical frame number
@@ -54,18 +59,84 @@ release_memory_range(unsigned long start
}
}
-static int __init phyp_dump_setup(void)
+/* ------------------------------------------------- */
+/**
+ * sysfs_release_region -- sysfs interface to release memory range.
+ *
+ * Usage:
+ * "echo <start addr> <length> > /sys/kernel/release_region"
+ *
+ * Example:
+ * "echo 0x40000000 0x10000000 > /sys/kernel/release_region"
+ *
+ * will release 256MB starting at 1GB.
+ */
+static ssize_t store_release_region(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
{
+ unsigned long start_addr, length, end_addr;
unsigned long start_pfn, nr_pages;
+ ssize_t ret;
+
+ ret = sscanf(buf, "%lx %lx", &start_addr, &length);
+ if (ret != 2)
+ return -EINVAL;
+
+ /* Range-check - don't free any reserved memory that
+ * wasn't reserved for phyp-dump */
+ if (start_addr < phyp_dump_info->init_reserve_start)
+ start_addr = phyp_dump_info->init_reserve_start;
+
+ end_addr = phyp_dump_info->init_reserve_start +
+ phyp_dump_info->init_reserve_size;
+ if (start_addr+length > end_addr)
+ length = end_addr - start_addr;
+
+ /* Release the region of memory assed in by user */
+ start_pfn = PFN_DOWN(start_addr);
+ nr_pages = PFN_DOWN(length);
+ release_memory_range(start_pfn, nr_pages);
+
+ return count;
+}
+
+static struct kobj_attribute rr = __ATTR(release_region, 0600,
+ NULL, store_release_region);
+
+static int __init phyp_dump_setup(void)
+{
+ struct device_node *rtas;
+ const int *dump_header = NULL;
+ int header_len = 0;
+ int rc;
/* If no memory was reserved in early boot, there is nothing to do */
if (phyp_dump_info->init_reserve_size == 0)
return 0;
- /* Release memory that was reserved in early boot */
- start_pfn = PFN_DOWN(phyp_dump_info->init_reserve_start);
- nr_pages = PFN_DOWN(phyp_dump_info->init_reserve_size);
- release_memory_range(start_pfn, nr_pages);
+ /* Return if phyp dump not supported */
+ if (!phyp_dump_info->phyp_dump_configured)
+ return -ENOSYS;
+
+ /* Is there dump data waiting for us? */
+ rtas = of_find_node_by_path("/rtas");
+ if (rtas) {
+ dump_header = of_get_property(rtas, "ibm,kernel-dump",
+ &header_len);
+ of_node_put(rtas);
+ }
+
+ if (dump_header == NULL)
+ return 0;
+
+ /* Should we create a dump_subsys, analogous to s390/ipl.c ? */
+ rc = sysfs_create_file(kernel_kobj, &rr.attr);
+ if (rc) {
+ printk(KERN_ERR "phyp-dump: unable to create sysfs file (%d)\n",
+ rc);
+ return 0;
+ }
return 0;
}
^ permalink raw reply [flat|nested] 71+ messages in thread
end of thread, other threads:[~2008-03-21 23:39 UTC | newest]
Thread overview: 71+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-07 23:45 [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
2008-01-08 0:13 ` [PATCH 1/8] pseries: phyp dump: Docmentation Manish Ahuja
2008-01-09 4:29 ` Nathan Lynch
2008-01-09 4:58 ` Michael Ellerman
2008-01-09 15:31 ` Linas Vepstas
2008-01-09 18:44 ` Nathan Lynch
2008-01-09 19:28 ` Manish Ahuja
2008-01-09 22:59 ` Michael Ellerman
2008-01-09 23:18 ` Manish Ahuja
2008-01-10 2:47 ` Linas Vepstas
2008-01-10 3:55 ` Michael Ellerman
2008-01-10 2:33 ` Linas Vepstas
2008-01-10 3:17 ` Olof Johansson
2008-01-10 4:12 ` Linas Vepstas
2008-01-10 4:52 ` Michael Ellerman
2008-01-10 16:21 ` Olof Johansson
2008-01-10 16:34 ` Linas Vepstas
2008-01-10 21:46 ` Mike Strosaker
2008-01-11 1:26 ` Nathan Lynch
2008-01-11 16:57 ` Linas Vepstas
2008-01-14 5:24 ` Olof Johansson
2008-01-14 15:21 ` Linas Vepstas
2008-01-08 0:16 ` [PATCH 2/8] pseries: phyp dump: config file Manish Ahuja
2008-01-08 3:18 ` Stephen Rothwell
2008-01-08 0:21 ` [PATCH 4/8] pseries: phyp dump: use sysfs to release reserved mem Manish Ahuja
2008-01-08 3:45 ` Stephen Rothwell
2008-01-08 18:34 ` Linas Vepstas
2008-01-08 0:25 ` [PATCH 3/8] pseries: phyp dump: reserve-release proof-of-concept Manish Ahuja
2008-01-08 3:16 ` Stephen Rothwell
2008-01-16 4:21 ` Paul Mackerras
2008-01-08 0:28 ` [PATCH 5/8] pseries: phyp dump: register dump area Manish Ahuja
2008-01-08 3:59 ` Stephen Rothwell
2008-01-08 0:35 ` [PATCH 6/8] pseries: phyp dump: debugging print routines Manish Ahuja
2008-01-08 0:49 ` Arnd Bergmann
2008-01-08 4:03 ` Stephen Rothwell
2008-01-08 0:37 ` [PATCH 7/8] pseries: phyp dump: Unregister and print dump areas Manish Ahuja
2008-01-08 4:25 ` Stephen Rothwell
2008-01-08 22:56 ` Manish Ahuja
2008-01-08 0:39 ` [PATCH 8/8] pseries: phyp dump: Tracking memory range freed Manish Ahuja
2008-02-12 6:31 ` [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
2008-02-12 6:53 ` [PATCH 1/8] pseries: phyp dump: Docmentation Manish Ahuja
2008-02-12 7:08 ` [PATCH 2/8] pseries: phyp dump: reserve-release proof-of-concept Manish Ahuja
2008-02-12 8:48 ` Michael Ellerman
2008-02-12 16:38 ` Manish Ahuja
2008-02-14 3:46 ` Tony Breeds
2008-02-14 23:12 ` Olof Johansson
2008-02-15 7:16 ` Manish Ahuja
2008-02-12 7:11 ` [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem Manish Ahuja
2008-02-12 10:08 ` Stephen Rothwell
2008-02-12 16:40 ` Manish Ahuja
2008-02-15 1:05 ` Tony Breeds
2008-02-15 7:17 ` Manish Ahuja
2008-02-15 22:32 ` Tony Breeds
2008-02-15 17:30 ` Linas Vepstas
2008-02-12 7:14 ` [PATCH 4/8] pseries: phyp dump: register dump area Manish Ahuja
2008-02-12 10:11 ` Stephen Rothwell
2008-02-12 16:31 ` Manish Ahuja
2008-02-12 7:16 ` [PATCH 5/8] pseries: phyp dump: debugging print routines Manish Ahuja
2008-02-12 7:18 ` [PATCH 6/8] pseries: phyp dump: Invalidate and print dump areas Manish Ahuja
2008-02-12 10:18 ` Stephen Rothwell
2008-02-12 16:32 ` Manish Ahuja
2008-02-13 21:43 ` Manish Ahuja
2008-02-12 7:20 ` [PATCH 7/8] pseries: phyp dump: Tracking memory range freed Manish Ahuja
2008-02-12 7:21 ` [PATCH 8/8] pseries: phyp dump: config file Manish Ahuja
-- strict thread matches above, loose matches on Subject: below --
2008-01-22 19:12 [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
2008-01-22 19:33 ` [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem Manish Ahuja
2008-02-18 4:53 [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
2008-02-18 5:38 ` [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem Manish Ahuja
2008-02-22 0:53 ` [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Michael Ellerman
2008-02-28 23:57 ` Manish Ahuja
2008-02-29 0:27 ` [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem Manish Ahuja
2008-03-11 6:16 ` Paul Mackerras
2008-03-11 16:44 ` Dale Farnsworth
2008-03-12 17:38 ` Linas Vepstas
2008-03-21 22:42 [PATCH 0/8] pseries: phyp dump: hypervisor-assisted dump Manish Ahuja
2008-03-21 23:39 ` [PATCH 3/8] pseries: phyp dump: use sysfs to release reserved mem Manish Ahuja
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).