From: Lisa Mitchell <lisa.mitchell@hp.com>
To: Cliff Wickman <cpw@sgi.com>
Cc: "kexec@lists.infradead.org" <kexec@lists.infradead.org>,
"d.hatayama@jp.fujitsu.com" <d.hatayama@jp.fujitsu.com>,
"kumagai-atsushi@mxc.nes.nec.co.jp"
<kumagai-atsushi@mxc.nes.nec.co.jp>
Subject: Re: [PATCH v2] makedumpfile: request the kernel do page scans
Date: Wed, 16 Jan 2013 05:15:29 -0700 [thread overview]
Message-ID: <1358338529.13097.987.camel@lisamlinux.fc.hp.com> (raw)
In-Reply-To: <E1TrA0E-0001Ov-Pe@eag09.americas.sgi.com>
[-- Attachment #1: Type: text/plain, Size: 36817 bytes --]
On Fri, 2013-01-04 at 16:20 +0000, Cliff Wickman wrote:
> From: Cliff Wickman <cpw@sgi.com>
>
> This version of the patch improves the consolidation of the mem_map table
> that is passed to the kernel. See make_kernel_mmap().
> Particularly the seemingly duplicate pfn ranges generated on an older
> (2.6.32-based, rhel6) kernel.
>
>
>
> I've been experimenting with asking the kernel to scan the page tables
> instead of reading all those page structures through /proc/vmcore.
> The results are rather dramatic.
> On a small, idle UV: about 4 sec. versus about 40 sec.
> On a 8TB UV the unnecessary page scan takes 4 minutes, vs. about 200 min
> through /proc/vmcore.
>
> This patch incorporates this scheme into version 1.5.1, so that the cyclic
> processing can use the kernel scans.
> It also uses the page_is_buddy logic to speed the finding of free pages.
> And also allows makedumpfile to work as before with a kernel that does
> not provide /proc/vmcore_pfn_lists.
>
> This patch:
> - writes requests to new kernel file /proc/vmcore_pfn_lists
> - makes request PL_REQUEST_MEMMAP to pass the crash kernel information about
> the boot kernel
> - makes requests PL_REQUEST_FREE and PL_REQUEST_EXCLUDE, asking the kernel
> to return lists of PFNs
> - adds page scan timing options -n -o and -t
> - still has a debugging option -a
>
> This patch depends on a kernel patch.
>
> Diffed against the released makedumpfile-1.5.1
>
> Signed-off-by: Cliff Wickman <cpw@sgi.com>
> ---
> dwarf_info.c | 2
> makedumpfile.c | 587 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> makedumpfile.h | 95 +++++++++
> print_info.c | 5
> 4 files changed, 665 insertions(+), 24 deletions(-)
>
>
> Index: makedumpfile-1.5.1.released/makedumpfile.h
> ===================================================================
> --- makedumpfile-1.5.1.released.orig/makedumpfile.h
> +++ makedumpfile-1.5.1.released/makedumpfile.h
> @@ -86,6 +86,8 @@ int get_mem_type(void);
> #define LSEEKED_PDESC (2)
> #define LSEEKED_PDATA (3)
>
> +#define EXTRA_MEMMAPS 100
> +
> /*
> * Xen page flags
> */
> @@ -418,7 +420,7 @@ do { \
> #define KVER_MIN_SHIFT 16
> #define KERNEL_VERSION(x,y,z) (((x) << KVER_MAJ_SHIFT) | ((y) << KVER_MIN_SHIFT) | (z))
> #define OLDEST_VERSION KERNEL_VERSION(2, 6, 15)/* linux-2.6.15 */
> -#define LATEST_VERSION KERNEL_VERSION(3, 6, 7)/* linux-3.6.7 */
> +#define LATEST_VERSION KERNEL_VERSION(3, 7, 8)/* linux-3.7.8 */
>
> /*
> * vmcoreinfo in /proc/vmcore
> @@ -794,11 +796,25 @@ typedef struct {
> } xen_crash_info_v2_t;
>
> struct mem_map_data {
> + /*
> + * pfn_start/pfn_end are the pfn's represented by this mem_map entry.
> + * mem_map is the virtual address of the array of page structures
> + * that represent these pages.
> + * paddr is the physical address of that array of structures.
> + * ending_paddr would be (pfn_end - pfn_start) * sizeof(struct page).
> + * section_vaddr is the address we get from ioremap_cache().
> + */
> unsigned long long pfn_start;
> unsigned long long pfn_end;
> - unsigned long mem_map;
> + unsigned long mem_map;
> + unsigned long long paddr; /* filled in by makedumpfile */
> + long virtual_offset; /* filled in by kernel */
> + unsigned long long ending_paddr; /* filled in by kernel */
> + unsigned long mapped_size; /* filled in by kernel */
> + void *section_vaddr; /* filled in by kernel */
> };
>
> +
> struct dump_bitmap {
> int fd;
> int no_block;
> @@ -875,6 +891,7 @@ struct DumpInfo {
> int flag_rearrange; /* flag of creating dumpfile from
> flattened format */
> int flag_split; /* splitting vmcore */
> + int flag_use_kernel_lists;
> int flag_cyclic; /* cyclic processing to keep memory consumption */
> int flag_reassemble; /* reassemble multiple dumpfiles into one */
> int flag_refiltering; /* refilter from kdump-compressed file */
> @@ -1384,6 +1401,80 @@ struct domain_list {
> unsigned int pickled_id;
> };
>
> +#define PL_REQUEST_FREE 1 /* request for a list of free pages */
> +#define PL_REQUEST_EXCLUDE 2 /* request for a list of excludable
> + pages */
> +#define PL_REQUEST_MEMMAP 3 /* request to pass in the makedumpfile
> + mem_map_data table */
> +/*
> + * limit the size of the pfn list to this many pfn_element structures
> + */
> +#define MAX_PFN_LIST 10000
> +
> +/*
> + * one element in the pfn_list
> + */
> +struct pfn_element {
> + unsigned long pfn;
> + unsigned long order;
> +};
> +
> +/*
> + * a request for finding pfn's that can be excluded from the dump
> + * they may be pages of particular types or free pages
> + */
> +struct pfn_list_request {
> + int request; /* PL_REQUEST_FREE PL_REQUEST_EXCLUDE or */
> + /* PL_REQUEST_MEMMAP */
> + int debug;
> + unsigned long paddr; /* mem_map address for PL_REQUEST_EXCLUDE */
> + unsigned long pfn_start;/* pfn represented by paddr */
> + unsigned long pgdat_paddr; /* for PL_REQUEST_FREE */
> + unsigned long pgdat_vaddr; /* for PL_REQUEST_FREE */
> + int node; /* for PL_REQUEST_FREE */
> + int exclude_bits; /* for PL_REQUEST_EXCLUDE */
> + int count; /* for PL_REQUEST_EXCLUDE */
> + void *reply_ptr; /* address of user's pfn_reply, for reply */
> + void *pfn_list_ptr; /* address of user's pfn array (*pfn_list) */
> + int map_count; /* for PL_REQUEST_MEMMAP; elements */
> + int map_size; /* for PL_REQUEST_MEMMAP; bytes in table */
> + void *map_ptr; /* for PL_REQUEST_MEMMAP; address of table */
> + long list_size; /* for PL_REQUEST_MEMMAP negotiation */
> + /* resume info: */
> + int more; /* 0 for done, 1 for "there's more" */
> + /* PL_REQUEST_EXCLUDE: */
> + int map_index; /* slot in the mem_map array of page structs */
> + /* PL_REQUEST_FREE: */
> + int zone_index; /* zone within the node's pgdat_list */
> + int freearea_index; /* free_area within the zone */
> + int type_index; /* free_list within the free_area */
> + int list_ct; /* page within the list */
> +};
> +
> +/*
> + * the reply from a pfn_list_request
> + * the list of pfn's itself is pointed to by pfn_list
> + */
> +struct pfn_reply {
> + long pfn_list_elements; /* negoiated on PL_REQUEST_MEMMAP */
> + long in_pfn_list; /* returned by PL_REQUEST_EXCLUDE and
> + PL_REQUEST_FREE */
> + /* resume info */
> + int more; /* 0 == done, 1 == there is more */
> + /* PL_REQUEST_MEMMAP: */
> + int map_index; /* slot in the mem_map array of page structs */
> + /* PL_REQUEST_FREE: */
> + int zone_index; /* zone within the node's pgdat_list */
> + int freearea_index; /* free_area within the zone */
> + int type_index; /* free_list within the free_area */
> + int list_ct; /* page within the list */
> + /* statistic counters: */
> + unsigned long long pfn_cache; /* PL_REQUEST_EXCLUDE */
> + unsigned long long pfn_cache_private; /* PL_REQUEST_EXCLUDE */
> + unsigned long long pfn_user; /* PL_REQUEST_EXCLUDE */
> + unsigned long long pfn_free; /* PL_REQUEST_FREE */
> +};
> +
> #define PAGES_PER_MAPWORD (sizeof(unsigned long) * 8)
> #define MFNS_PER_FRAME (info->page_size / sizeof(unsigned long))
>
> Index: makedumpfile-1.5.1.released/dwarf_info.c
> ===================================================================
> --- makedumpfile-1.5.1.released.orig/dwarf_info.c
> +++ makedumpfile-1.5.1.released/dwarf_info.c
> @@ -324,6 +324,8 @@ get_data_member_location(Dwarf_Die *die,
> return TRUE;
> }
>
> +int dwarf_formref(Dwarf_Attribute *, Dwarf_Off *);
> +
> static int
> get_die_type(Dwarf_Die *die, Dwarf_Die *die_type)
> {
> Index: makedumpfile-1.5.1.released/print_info.c
> ===================================================================
> --- makedumpfile-1.5.1.released.orig/print_info.c
> +++ makedumpfile-1.5.1.released/print_info.c
> @@ -244,6 +244,11 @@ print_usage(void)
> MSG(" [-f]:\n");
> MSG(" Overwrite DUMPFILE even if it already exists.\n");
> MSG("\n");
> + MSG(" [-o]:\n");
> + MSG(" Read page structures from /proc/vmcore in the scan for\n");
> + MSG(" free and excluded pages regardless of whether\n");
> + MSG(" /proc/vmcore_pfn_lists is present.\n");
> + MSG("\n");
> MSG(" [-h]:\n");
> MSG(" Show help message and LZO/snappy support status (enabled/disabled).\n");
> MSG("\n");
> Index: makedumpfile-1.5.1.released/makedumpfile.c
> ===================================================================
> --- makedumpfile-1.5.1.released.orig/makedumpfile.c
> +++ makedumpfile-1.5.1.released/makedumpfile.c
> @@ -13,6 +13,8 @@
> * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> * GNU General Public License for more details.
> */
> +#define _GNU_SOURCE
> +#include <stdio.h>
> #include "makedumpfile.h"
> #include "print_info.h"
> #include "dwarf_info.h"
> @@ -31,6 +33,14 @@ struct srcfile_table srcfile_table;
>
> struct vm_table vt = { 0 };
> struct DumpInfo *info = NULL;
> +int pfn_list_fd;
> +struct pfn_element *pfn_list;
> +int nflag = 0;
> +int oflag = 0;
> +int tflag = 0;
> +int aflag = 0;
> +struct timeval scan_start;
> +int max_pfn_list;
>
> char filename_stdout[] = FILENAME_STDOUT;
>
> @@ -2415,6 +2425,22 @@ get_mm_sparsemem(void)
> unsigned long long pfn_start, pfn_end;
> unsigned long section, mem_map;
> unsigned long *mem_sec = NULL;
> + unsigned long vaddr;
> + unsigned long paddr;
> + unsigned long lastvaddr;
> + unsigned long lastpaddr;
> + unsigned long diff;
> + long j;
> + int i;
> + int npfns;
> + int pagesize;
> + int num_mem_map;
> + int num_added = 0;
> + struct mem_map_data *mmd;
> + struct mem_map_data *curmmd;
> + struct mem_map_data *work1mmd;
> + struct mem_map_data *work2mmd;
> + struct mem_map_data *lastmmd;
>
> int ret = FALSE;
>
> @@ -2441,7 +2467,8 @@ get_mm_sparsemem(void)
> }
> info->num_mem_map = num_section;
> if ((info->mem_map_data = (struct mem_map_data *)
> - malloc(sizeof(struct mem_map_data)*info->num_mem_map)) == NULL) {
> + malloc(sizeof(struct mem_map_data) *
> + (EXTRA_MEMMAPS + info->num_mem_map))) == NULL) {
> ERRMSG("Can't allocate memory for the mem_map_data. %s\n",
> strerror(errno));
> goto out;
> @@ -2459,6 +2486,71 @@ get_mm_sparsemem(void)
> dump_mem_map(pfn_start, pfn_end, mem_map, section_nr);
> }
> ret = TRUE;
> +
> + /* add paddr to the table */
> + mmd = &info->mem_map_data[0];
> + num_mem_map = info->num_mem_map;
> + lastmmd = mmd + num_mem_map;
> + for (i = 0; i < num_mem_map; i++) {
> + if (mmd[i].mem_map == 0) {
> + mmd[i].paddr = 0;
> + } else {
> + mmd[i].paddr = vaddr_to_paddr(mmd[i].mem_map);
> + if (mmd[i].paddr == 0) {
> + printf("! can't translate %#lx to paddr\n",
> + mmd[i].mem_map);
> + exit(1);
> + }
> + /*
> + * When we pass a mem_map and its paddr to the kernel
> + * it will be remapped assuming the entire range
> + * of pfn's are consecutive. If they are not then
> + * we need to split the range into two.
> + */
> + pagesize = SIZE(page);
> + npfns = mmd[i].pfn_end - mmd[i].pfn_start;
> + vaddr = (unsigned long)mmd[i].mem_map;
> + paddr = vaddr_to_paddr(vaddr);
> + diff = vaddr - paddr;
> + lastvaddr = vaddr + (pagesize * (npfns-1));
> + lastpaddr = vaddr_to_paddr(lastvaddr);
> + if (lastvaddr - lastpaddr != diff) {
> + /* there is a break in vtop somewhere in this range */
> + /* we need to split it */
> + for (j = 0; j < npfns; j++) {
> + paddr = vaddr_to_paddr(vaddr);
> + if (vaddr - paddr != diff) {
> + diff = vaddr - paddr;
> + /* insert a new entry if we have room */
> + if (num_added < EXTRA_MEMMAPS) {
> + curmmd = &info->mem_map_data[i];
> + num_added++;
> + work1mmd = lastmmd - 1;
> + for (work2mmd = lastmmd;
> + work2mmd > curmmd; work2mmd--) {
> + work1mmd = work2mmd - 1;
> + *work2mmd = *work1mmd;
> + }
> + work2mmd = work1mmd + 1;
> + work2mmd->mem_map =
> + work1mmd->mem_map + (pagesize * j);
> + lastmmd++;
> + num_mem_map++;
> + info->num_mem_map++;
> + /*
> + * need only 1 split, the new
> + * one will be checked also.
> + */
> + break;
> + } else
> + printf("warn: out of EXTRA_MEMMAPS\n");
> + }
> + vaddr += pagesize;
> + }
> + }
> + }
> + }
> +
> out:
> if (mem_sec != NULL)
> free(mem_sec);
> @@ -2571,6 +2663,172 @@ initialize_bitmap_memory(void)
> return TRUE;
> }
>
> +/*
> + * construct a version of the mem_map_data table to pass to the kernel
> + */
> +void *
> +make_kernel_mmap(int *kmap_elements, int *kmap_size)
> +{
> + int i, j;
> + int elements = 0;
> + int page_structs;
> + int elem;
> + long l;
> + unsigned long base_end_pfn;
> + unsigned long end_paddr;
> + unsigned long v1;
> + unsigned long v2;
> + unsigned long end_page_pfns;
> + unsigned long hpagesize = 0x200000UL;
> + unsigned long hpageoffset = hpagesize - 1;
> + struct mem_map_data *mmdo, *mmdn;
> + struct mem_map_data *mmdbase, *mmdnext, *mmdend, *mmdwork;
> + struct mem_map_data temp_mmd;
> + struct mem_map_data *mmap;
> +
> + mmap = malloc(info->num_mem_map * sizeof(struct mem_map_data));
> + if (mmap == NULL) {
> + ERRMSG("Can't allocate memory kernel map\n");
> + return NULL;
> + }
> +
> + /* condense them down to the valid ones */
> + for (i = 0, mmdn = mmap, mmdo = &info->mem_map_data[0];
> + i < info->num_mem_map; i++, mmdo++) {
> + if (mmdo->mem_map && mmdo->paddr) {
> + *mmdn = *mmdo;
> + mmdn++;
> + elements++;
> + }
> + }
> +
> + /* make sure it is sorted by mem_map (it should be already) */
> + mmdn = mmap;
> + for (i = 0; i < elements - 1; i++) {
> + for (j = i + 1; j < elements; j++) {
> + if (mmdn[j].mem_map < mmdn[i].mem_map) {
> + temp_mmd = mmdn[j];
> + mmdn[j] = mmdn[i];
> + mmdn[i] = temp_mmd;
> + }
> + }
> + }
> +
> + if (aflag) {
> + mmdn = mmap;
> + printf("entire mem_map:\n");
> + for (i = 0; i < elements - 1; i++) {
> + l = (mmdn[i].pfn_end - mmdn[i].pfn_start) * SIZE(page);
> + printf(
> + "[%d] pfn %#llx-%llx mem_map %#lx paddr %#llx-%llx\n",
> + i, mmdn[i].pfn_start, mmdn[i].pfn_end,
> + mmdn[i].mem_map, mmdn[i].paddr,
> + mmdn[i].paddr + l);
> + }
> + }
> +
> + /*
> + * a first pass to split overlapping pfn entries like this:
> + * pfn 0x1248000-1250000 mem_map 0xffffea003ffc0000 paddr 0x10081c0000
> + * pfn 0x1248000-1250000 mem_map 0xffffea0040000030 paddr 0x1008400030
> + */
> + mmdbase = mmap;
> + mmdnext = mmap + 1;
> + mmdend = mmap + elements;
> + /* test each mmdbase/mmdnext pair */
> + while (mmdnext < mmdend) { /* mmdnext is the one after mmdbase */
> + page_structs = (mmdbase->pfn_end - mmdbase->pfn_start);
> + /* mmdwork scans from mmdnext to the end */
> + if ((mmdbase->pfn_start == mmdnext->pfn_start) &&
> + (mmdbase->pfn_end == mmdnext->pfn_end)) {
> + /* overlapping pfns, we need a fix */
> + v1 = mmdnext->mem_map - mmdbase->mem_map;
> + v2 = mmdnext->paddr - mmdbase->paddr;
> + if (v1 != (v2 & hpageoffset))
> + printf("virt to phys is wrong %#lx %#lx\n",
> + v1, v2);
> + l = mmdbase->pfn_end - mmdbase->pfn_start;
> + end_page_pfns = l - (((hpagesize -
> + (hpageoffset & mmdbase->paddr)) +
> + SIZE(page) - 1) / SIZE(page));
> + mmdbase->pfn_end -= end_page_pfns;
> + mmdnext->pfn_start = mmdbase->pfn_end;
> + } else if ((mmdbase->pfn_start == mmdnext->pfn_start) ||
> + (mmdbase->pfn_end == mmdnext->pfn_end)) {
> + printf("warning: unfixed overlap\n");
> + }
> + mmdbase++;
> + mmdnext++;
> + }
> +
> + /*
> + * consolidate those mem_map's with occupying consecutive physical
> + * addresses
> + * pages represented by these pages structs: addr of page struct
> + * pfns 0x1000000-1008000 mem_map 0xffffea0038000000 paddr 0x11f7e00000
> + * pfns 0x1008000-1010000 mem_map 0xffffea00381c0000 paddr 0x11f7fc0000
> + * pfns 0x1010000-1018000 mem_map 0xffffea0038380000 paddr 0x11f8180000
> + * 8000 increments inc's: 1c0000
> + * 8000000 of memory (128M) 8000 page structs
> + */
> + mmdbase = mmap;
> + mmdnext = mmap + 1;
> + mmdend = mmap + elements;
> + while (mmdnext < mmdend) {
> + elem = mmdend - mmdnext;
> + /* test mmdbase vs. mmdwork and onward: */
> + for (i = 0, mmdwork = mmdnext; i < elem; i++, mmdwork++) {
> + base_end_pfn = mmdbase->pfn_end;
> + if (base_end_pfn == mmdwork->pfn_start) {
> + page_structs = (mmdbase->pfn_end -
> + mmdbase->pfn_start);
> + end_paddr = (page_structs * SIZE(page)) +
> + mmdbase->paddr;
> + if (mmdwork->paddr == end_paddr) {
> + /* extend base by the work one */
> + mmdbase->pfn_end = mmdwork->pfn_end;
> + /* next is where to begin next time */
> + mmdnext = mmdwork + 1;
> + } else {
> + /* gap in address of page
> + structs; end of section */
> + mmdbase++;
> + if (mmdwork - mmdbase > 0)
> + *mmdbase = *mmdwork;
> + mmdnext = mmdwork + 1;
> + break;
> + }
> + } else {
> + /* gap in pfns; end of section */
> + mmdbase++;
> + if (mmdwork - mmdbase > 0)
> + *mmdbase = *mmdwork;
> + mmdnext = mmdwork + 1;
> + break;
> + }
> + }
> + }
> + elements = (mmdbase - mmap) + 1;
> +
> + if (aflag) {
> + printf("user mmap for kernel:\n");
> + for (i = 0, mmdwork = mmap; i < elements; i++, mmdwork++) {
> + l = mmdwork->pfn_end - mmdwork->pfn_start;
> + printf(
> + "[%d] user pfn %#llx-%llx paddr %#llx-%llx vaddr %#lx\n",
> + i, mmdwork->pfn_start, mmdwork->pfn_end,
> + mmdwork->paddr,
> + mmdwork->paddr + (l * SIZE(page)),
> + mmdwork->mem_map);
> + }
> + }
> +
> + *kmap_elements = elements;
> + *kmap_size = elements * sizeof(struct mem_map_data);
> +
> + return mmap;
> +}
> +
> int
> initial(void)
> {
> @@ -2833,7 +3091,14 @@ out:
> if (!get_value_for_old_linux())
> return FALSE;
>
> - if (info->flag_cyclic && (info->dump_level & DL_EXCLUDE_FREE))
> + /*
> + * page_is_buddy will tell us whether free pages can be identified
> + * by flags and counts in the page structure without making an extra
> + * pass through the free lists.
> + * This is applicable to using /proc/vmcore or using the kernel.
> + * force all old (-o) forms to search free lists
> + */
> + if (info->dump_level & DL_EXCLUDE_FREE)
> setup_page_is_buddy();
>
> return TRUE;
> @@ -3549,6 +3814,65 @@ out:
> return ret;
> }
>
> +/*
> + * let the kernel find excludable pages from one node
> + */
> +void
> +__exclude_free_pages_kernel(unsigned long pgdat, int node)
> +{
> + int i, j, ret, pages;
> + unsigned long pgdat_paddr;
> + struct pfn_list_request request;
> + struct pfn_reply reply;
> + struct pfn_element *pe;
> +
> + if ((pgdat_paddr = vaddr_to_paddr(pgdat)) == NOT_PADDR) {
> + ERRMSG("Can't convert virtual address(%#lx) to physical.\n",
> + pgdat);
> + return;
> + }
> +
> + /*
> + * Get the list of free pages.
> + * This may be broken up into MAX_PFN_list arrays of PFNs.
> + */
> + memset(&request, 0, sizeof(request));
> + request.request = PL_REQUEST_FREE;
> + request.node = node;
> + request.pgdat_paddr = pgdat_paddr;
> + request.pgdat_vaddr = pgdat;
> + request.reply_ptr = (void *)&reply;
> + request.pfn_list_ptr = (void *)pfn_list;
> + memset(&reply, 0, sizeof(reply));
> +
> + do {
> + request.more = 0;
> + if (reply.more) {
> + /* this is to be a continuation of the last request */
> + request.more = 1;
> + request.zone_index = reply.zone_index;
> + request.freearea_index = reply.freearea_index;
> + request.type_index = reply.type_index;
> + request.list_ct = reply.list_ct;
> + }
> + ret = write(pfn_list_fd, &request, sizeof(request));
> + if (ret != sizeof(request)) {
> + printf("PL_REQUEST_FREE failed\n");
> + return;
> + }
> + pfn_free += reply.pfn_free;
> +
> + for (i = 0; i < reply.in_pfn_list; i++) {
> + pe = &pfn_list[i];
> + pages = (1 << pe->order);
> + for (j = 0; j < pages; j++) {
> + clear_bit_on_2nd_bitmap_for_kernel(pe->pfn + j);
> + }
> + }
> + } while (reply.more);
> +
> + return;
> +}
>
> int
> _exclude_free_page(void)
> @@ -3568,7 +3892,24 @@ _exclude_free_page(void)
> gettimeofday(&tv_start, NULL);
>
> for (num_nodes = 1; num_nodes <= vt.numnodes; num_nodes++) {
> -
> + if (!info->flag_cyclic && info->flag_use_kernel_lists) {
> + node_zones = pgdat + OFFSET(pglist_data.node_zones);
> + if (!readmem(VADDR,
> + pgdat + OFFSET(pglist_data.nr_zones),
> + &nr_zones, sizeof(nr_zones))) {
> + ERRMSG("Can't get nr_zones.\n");
> + return FALSE;
> + }
> + print_progress(PROGRESS_FREE_PAGES, num_nodes - 1,
> + vt.numnodes);
> + /* ask the kernel to do one node */
> + __exclude_free_pages_kernel(pgdat, node);
> + goto next_pgdat;
> + }
> + /*
> + * kernel does not have the pfn_list capability
> + * use the old way
> + */
> print_progress(PROGRESS_FREE_PAGES, num_nodes - 1, vt.numnodes);
>
> node_zones = pgdat + OFFSET(pglist_data.node_zones);
> @@ -3595,6 +3936,7 @@ _exclude_free_page(void)
> if (!reset_bitmap_of_free_pages(zone))
> return FALSE;
> }
> + next_pgdat:
> if (num_nodes < vt.numnodes) {
> if ((node = next_online_node(node + 1)) < 0) {
> ERRMSG("Can't get next online node.\n");
> @@ -3612,6 +3954,8 @@ _exclude_free_page(void)
> */
> print_progress(PROGRESS_FREE_PAGES, vt.numnodes, vt.numnodes);
> print_execution_time(PROGRESS_FREE_PAGES, &tv_start);
> + if (tflag)
> + print_execution_time("Total time", &scan_start);
>
> return TRUE;
> }
> @@ -3755,7 +4099,6 @@ setup_page_is_buddy(void)
> }
> } else
> info->page_is_buddy = page_is_buddy_v2;
> -
> out:
> if (!info->page_is_buddy)
> DEBUG_MSG("Can't select page_is_buddy handler; "
> @@ -3964,10 +4307,89 @@ exclude_zero_pages(void)
> return TRUE;
> }
>
> +/*
> + * let the kernel find excludable pages from one mem_section
> + */
> +int
> +__exclude_unnecessary_pages_kernel(int mm, struct mem_map_data *mmd)
> +{
> + unsigned long long pfn_start = mmd->pfn_start;
> + unsigned long long pfn_end = mmd->pfn_end;
> + int i, j, ret, pages, flag;
> + struct pfn_list_request request;
> + struct pfn_reply reply;
> + struct pfn_element *pe;
> +
> + /*
> + * Get the list of to-be-excluded pages in this section.
> + * It may be broken up by groups of max_pfn_list size.
> + */
> + memset(&request, 0, sizeof(request));
> + request.request = PL_REQUEST_EXCLUDE;
> + request.paddr = mmd->paddr; /* phys addr of mem_map */
> + request.reply_ptr = (void *)&reply;
> + request.pfn_list_ptr = (void *)pfn_list;
> + request.exclude_bits = 0;
> + request.pfn_start = pfn_start;
> + request.count = pfn_end - pfn_start;
> + if (info->dump_level & DL_EXCLUDE_CACHE)
> + request.exclude_bits |= DL_EXCLUDE_CACHE;
> + if (info->dump_level & DL_EXCLUDE_CACHE_PRI)
> + request.exclude_bits |= DL_EXCLUDE_CACHE_PRI;
> + if (info->dump_level & DL_EXCLUDE_USER_DATA)
> + request.exclude_bits |= DL_EXCLUDE_USER_DATA;
> + /* if we try for free pages from the freelists then we don't need
> + to ask here for 'buddy' pages */
> + if (info->dump_level & DL_EXCLUDE_FREE)
> + request.exclude_bits |= DL_EXCLUDE_FREE;
> + memset(&reply, 0, sizeof(reply));
> +
> + do {
> + /* pfn represented by paddr */
> + request.more = 0;
> + if (reply.more) {
> + /* this is to be a continuation of the last request */
> + request.more = 1;
> + request.map_index = reply.map_index;
> + }
> +
> + ret = write(pfn_list_fd, &request, sizeof(request));
> + if (ret != sizeof(request))
> + return FALSE;
> +
> + pfn_cache += reply.pfn_cache;
> + pfn_cache_private += reply.pfn_cache_private;
> + pfn_user += reply.pfn_user;
> + pfn_free += reply.pfn_free;
> +
> + flag = 0;
> + for (i = 0; i < reply.in_pfn_list; i++) {
> + pe = &pfn_list[i];
> + pages = (1 << pe->order);
> + for (j = 0; j < pages; j++) {
> + if (clear_bit_on_2nd_bitmap_for_kernel(
> + pe->pfn + j) == FALSE) {
> + // printf("fail: mm %d slot %d pfn %#lx\n",
> + // mm, i, pe->pfn + j);
> + // printf("paddr %#llx pfn %#llx-%#llx mem_map %#lx\n",
> + // mmd->paddr, mmd->pfn_start, mmd->pfn_end, mmd->mem_map);
> + flag = 1;
> + break;
> + }
> + if (flag) break;
> + }
> + }
> + } while (reply.more);
> +
> + return TRUE;
> +}
> +
> int
> -__exclude_unnecessary_pages(unsigned long mem_map,
> - unsigned long long pfn_start, unsigned long long pfn_end)
> +__exclude_unnecessary_pages(int mm, struct mem_map_data *mmd)
> {
> + unsigned long long pfn_start = mmd->pfn_start;
> + unsigned long long pfn_end = mmd->pfn_end;
> + unsigned long mem_map = mmd->mem_map;
> unsigned long long pfn, pfn_mm, maddr;
> unsigned long long pfn_read_start, pfn_read_end, index_pg;
> unsigned char page_cache[SIZE(page) * PGMM_CACHED];
> @@ -3975,6 +4397,12 @@ __exclude_unnecessary_pages(unsigned lon
> unsigned int _count, _mapcount = 0;
> unsigned long flags, mapping, private = 0;
>
> + if (info->flag_use_kernel_lists) {
> + if (__exclude_unnecessary_pages_kernel(mm, mmd) == FALSE)
> + return FALSE;
> + return TRUE;
> + }
> +
> /*
> * Refresh the buffer of struct page, when changing mem_map.
> */
> @@ -4012,7 +4440,6 @@ __exclude_unnecessary_pages(unsigned lon
> pfn_mm = PGMM_CACHED - index_pg;
> else
> pfn_mm = pfn_end - pfn;
> -
> if (!readmem(VADDR, mem_map,
> page_cache + (index_pg * SIZE(page)),
> SIZE(page) * pfn_mm)) {
> @@ -4036,7 +4463,6 @@ __exclude_unnecessary_pages(unsigned lon
> * Exclude the free page managed by a buddy
> */
> if ((info->dump_level & DL_EXCLUDE_FREE)
> - && info->flag_cyclic
> && info->page_is_buddy
> && info->page_is_buddy(flags, _mapcount, private, _count)) {
> int i;
> @@ -4085,19 +4511,78 @@ __exclude_unnecessary_pages(unsigned lon
> return TRUE;
> }
>
> +/*
> + * Pass in the mem_map_data table.
> + * Must do this once, and before doing PL_REQUEST_FREE or PL_REQUEST_EXCLUDE.
> + */
> +int
> +setup_kernel_mmap()
> +{
> + int ret;
> + int kmap_elements, kmap_size;
> + long malloc_size;
> + void *kmap_addr;
> + struct pfn_list_request request;
> + struct pfn_reply reply;
> +
> + kmap_addr = make_kernel_mmap(&kmap_elements, &kmap_size);
> + if (kmap_addr == NULL)
> + return FALSE;
> + memset(&request, 0, sizeof(request));
> + request.request = PL_REQUEST_MEMMAP;
> + request.map_ptr = kmap_addr;
> + request.reply_ptr = (void *)&reply;
> + request.map_count = kmap_elements;
> + request.map_size = kmap_size;
> + request.list_size = MAX_PFN_LIST;
> +
> + ret = write(pfn_list_fd, &request, sizeof(request));
> + if (ret < 0) {
> + fprintf(stderr, "PL_REQUEST_MEMMAP returned %d\n", ret);
> + return FALSE;
> + }
> + /* the reply tells us how long the kernel's list actually is */
> + max_pfn_list = reply.pfn_list_elements;
> + if (max_pfn_list <= 0) {
> + fprintf(stderr,
> + "PL_REQUEST_MEMMAP returned max_pfn_list %d\n",
> + max_pfn_list);
> + return FALSE;
> + }
> + if (max_pfn_list < MAX_PFN_LIST) {
> + printf("length of pfn list dropped from %d to %d\n",
> + MAX_PFN_LIST, max_pfn_list);
> + }
> + free(kmap_addr);
> + /*
> + * Allocate the buffer for the PFN list (just once).
> + */
> + malloc_size = max_pfn_list * sizeof(struct pfn_element);
> + if ((pfn_list = (struct pfn_element *)malloc(malloc_size)) == NULL) {
> + ERRMSG("Can't allocate pfn_list of %ld\n", malloc_size);
> + return FALSE;
> + }
> + return TRUE;
> +}
> +
> int
> exclude_unnecessary_pages(void)
> {
> - unsigned int mm;
> - struct mem_map_data *mmd;
> - struct timeval tv_start;
> + unsigned int mm;
> + struct mem_map_data *mmd;
> + struct timeval tv_start;
>
> if (is_xen_memory() && !info->dom0_mapnr) {
> ERRMSG("Can't get max domain-0 PFN for excluding pages.\n");
> return FALSE;
> }
>
> + if (!info->flag_cyclic && info->flag_use_kernel_lists) {
> + if (setup_kernel_mmap() == FALSE)
> + return FALSE;
> + }
> gettimeofday(&tv_start, NULL);
> + gettimeofday(&scan_start, NULL);
>
> for (mm = 0; mm < info->num_mem_map; mm++) {
> print_progress(PROGRESS_UNN_PAGES, mm, info->num_mem_map);
> @@ -4106,9 +4591,9 @@ exclude_unnecessary_pages(void)
>
> if (mmd->mem_map == NOT_MEMMAP_ADDR)
> continue;
> -
> - if (!__exclude_unnecessary_pages(mmd->mem_map,
> - mmd->pfn_start, mmd->pfn_end))
> + if (mmd->paddr == 0)
> + continue;
> + if (!__exclude_unnecessary_pages(mm, mmd))
> return FALSE;
> }
>
> @@ -4139,7 +4624,11 @@ exclude_unnecessary_pages_cyclic(void)
> */
> copy_bitmap_cyclic();
>
> - if ((info->dump_level & DL_EXCLUDE_FREE) && !info->page_is_buddy)
> + /*
> + * If free pages cannot be identified with the buddy flag and/or
> + * count then we have to search free lists.
> + */
> + if ((info->dump_level & DL_EXCLUDE_FREE) && (!info->page_is_buddy))
> if (!exclude_free_page())
> return FALSE;
>
> @@ -4164,8 +4653,7 @@ exclude_unnecessary_pages_cyclic(void)
>
> if (mmd->pfn_end >= info->cyclic_start_pfn &&
> mmd->pfn_start <= info->cyclic_end_pfn) {
> - if (!__exclude_unnecessary_pages(mmd->mem_map,
> - mmd->pfn_start, mmd->pfn_end))
> + if (!__exclude_unnecessary_pages(mm, mmd))
> return FALSE;
> }
> }
> @@ -4195,7 +4683,7 @@ update_cyclic_region(unsigned long long
> if (!create_1st_bitmap_cyclic())
> return FALSE;
>
> - if (!exclude_unnecessary_pages_cyclic())
> + if (exclude_unnecessary_pages_cyclic() == FALSE)
> return FALSE;
>
> return TRUE;
> @@ -4255,7 +4743,7 @@ create_2nd_bitmap(void)
> if (info->dump_level & DL_EXCLUDE_CACHE ||
> info->dump_level & DL_EXCLUDE_CACHE_PRI ||
> info->dump_level & DL_EXCLUDE_USER_DATA) {
> - if (!exclude_unnecessary_pages()) {
> + if (exclude_unnecessary_pages() == FALSE) {
> ERRMSG("Can't exclude unnecessary pages.\n");
> return FALSE;
> }
> @@ -4263,8 +4751,10 @@ create_2nd_bitmap(void)
>
> /*
> * Exclude free pages.
> + * If free pages cannot be identified with the buddy flag and/or
> + * count then we have to search free lists.
> */
> - if (info->dump_level & DL_EXCLUDE_FREE)
> + if ((info->dump_level & DL_EXCLUDE_FREE) && (!info->page_is_buddy))
> if (!exclude_free_page())
> return FALSE;
>
> @@ -4395,6 +4885,10 @@ create_dump_bitmap(void)
> int ret = FALSE;
>
> if (info->flag_cyclic) {
> + if (info->flag_use_kernel_lists) {
> + if (setup_kernel_mmap() == FALSE)
> + goto out;
> + }
> if (!prepare_bitmap_buffer_cyclic())
> goto out;
>
> @@ -4872,6 +5366,7 @@ get_num_dumpable_cyclic(void)
> {
> unsigned long long pfn, num_dumpable=0;
>
> + gettimeofday(&scan_start, NULL);
> for (pfn = 0; pfn < info->max_mapnr; pfn++) {
> if (!update_cyclic_region(pfn))
> return FALSE;
> @@ -5201,7 +5696,7 @@ get_loads_dumpfile_cyclic(void)
> info->cyclic_end_pfn = info->pfn_cyclic;
> if (!create_1st_bitmap_cyclic())
> return FALSE;
> - if (!exclude_unnecessary_pages_cyclic())
> + if (exclude_unnecessary_pages_cyclic() == FALSE)
> return FALSE;
>
> if (!(phnum = get_phnum_memory()))
> @@ -5613,6 +6108,10 @@ write_kdump_pages(struct cache_data *cd_
> pfn_zero++;
> continue;
> }
> +
> + if (nflag)
> + continue;
> +
> /*
> * Compress the page data.
> */
> @@ -5768,6 +6267,7 @@ write_kdump_pages_cyclic(struct cache_da
> for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>
> if ((num_dumped % per) == 0)
> +
> print_progress(PROGRESS_COPY, num_dumped, info->num_dumpable);
>
> /*
> @@ -5786,11 +6286,17 @@ write_kdump_pages_cyclic(struct cache_da
> */
> if ((info->dump_level & DL_EXCLUDE_ZERO)
> && is_zero_page(buf, info->page_size)) {
> + if (!nflag) {
> if (!write_cache(cd_header, pd_zero, sizeof(page_desc_t)))
> goto out;
> + }
> pfn_zero++;
> continue;
> }
> +
> + if (nflag)
> + continue;
> +
> /*
> * Compress the page data.
> */
> @@ -6208,6 +6714,8 @@ write_kdump_pages_and_bitmap_cyclic(stru
> if (!update_cyclic_region(pfn))
> return FALSE;
>
> + if (tflag)
> + print_execution_time("Total time", &scan_start);
> if (!write_kdump_pages_cyclic(cd_header, cd_page, &pd_zero, &offset_data))
> return FALSE;
>
> @@ -8231,6 +8739,22 @@ static struct option longopts[] = {
> {0, 0, 0, 0}
> };
>
> +/*
> + * test for the presence of capability in the kernel to provide lists
> + * of pfn's:
> + * /proc/vmcore_pfn_lists
> + * return 1 for present
> + * return 0 for not present
> + */
> +int
> +test_kernel_pfn_lists(void)
> +{
> + if ((pfn_list_fd = open("/proc/vmcore_pfn_lists", O_WRONLY)) < 0) {
> + return 0;
> + }
> + return 1;
> +}
> +
> int
> main(int argc, char *argv[])
> {
> @@ -8256,9 +8780,12 @@ main(int argc, char *argv[])
>
> info->block_order = DEFAULT_ORDER;
> message_level = DEFAULT_MSG_LEVEL;
> - while ((opt = getopt_long(argc, argv, "b:cDd:EFfg:hi:lMpRrsvXx:", longopts,
> + while ((opt = getopt_long(argc, argv, "ab:cDd:EFfg:hi:MnoRrstVvXx:Y", longopts,
> NULL)) != -1) {
> switch (opt) {
> + case 'a':
> + aflag = 1;
> + break;
> case 'b':
> info->block_order = atoi(optarg);
> break;
> @@ -8314,6 +8841,13 @@ main(int argc, char *argv[])
> case 'M':
> info->flag_dmesg = 1;
> break;
> + case 'n':
> + /* -n undocumented, for testing page scanning time */
> + nflag = 1;
> + break;
> + case 'o':
> + oflag = 1;
> + break;
> case 'p':
> info->flag_compress = DUMP_DH_COMPRESSED_SNAPPY;
> break;
> @@ -8329,6 +8863,9 @@ main(int argc, char *argv[])
> case 'r':
> info->flag_reassemble = 1;
> break;
> + case 't':
> + tflag = 1;
> + break;
> case 'V':
> info->vaddr_for_vtop = strtoul(optarg, NULL, 0);
> break;
> @@ -8360,6 +8897,12 @@ main(int argc, char *argv[])
> goto out;
> }
> }
> +
> + if (oflag)
> + info->flag_use_kernel_lists = 0;
> + else
> + info->flag_use_kernel_lists = test_kernel_pfn_lists();
> +
> if (flag_debug)
> message_level |= ML_PRINT_DEBUG_MSG;
>
>
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
Cliff I tried your patch above on makedumpfile v1.5.1 (built dynamically
on the same DL980 I was running the test on), with all the RHEL 6
versions of kernel patches you gave me from 1207 plus the kernel patch
to kexec recommended for makedumpfile v1.5.1 built on top of a
preliminary RHEL 6.4 kernel source (higher patch level of 2.6.32
kernel), this time on a 1 TB Memory system (We have lost access to a 4
TB Memory system for some time, now). On this same system, regular
Makedumpfile v1.5.1 worked fine to produce a dump. But the Makedumpfile
with the patches above could not even start the dump, and printed:
Saving vmcore-dmesg.txt
Saved vmcore-dmesg.txt
PL_REQUEST_MEMMAP returned -1
Restarting system.
This happened with both a crashkernel size=200M that would have invoked
cyclic buffer mode, and also with a larger one, 384M that should not
have needed cyclic mode. I had no cyclic buffer mode set or turned off
in the makedumpfile command line, just recording memory usage with:
core_collector makedumpfile -c --message-level 31 -d 31
debug_mem_level 2
ret = write(pfn_list_fd, &request, sizeof(request));
if (ret < 0) {
fprintf(stderr, "PL_REQUEST_MEMMAP returned %d\n", ret);
return FALSE;
Any ideas what probably caused this? Am I missing a patch? Do I have
the wrong kernel patches? Tips to debug?
I am attaching the Kernel patches you sent me earlier that I used, on
top of:
https://lkml.org/lkml/2012/11/21/90 with the tweak for RHEL 2.6.32
kernels below applied on top of it:
NOTE: The patch above is for latest kernel. So you need to fix it as
below if your kernel version is between v2.6.18 and v2.6.37:
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 511151b..56583a4 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1490,7 +1490,6 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_OFFSET(page, flags);
VMCOREINFO_OFFSET(page, _count);
VMCOREINFO_OFFSET(page, mapping);
- VMCOREINFO_OFFSET(page, _mapcount);
VMCOREINFO_OFFSET(page, private);
VMCOREINFO_OFFSET(page, lru);
VMCOREINFO_OFFSET(pglist_data, node_zones);
@@ -1515,8 +1514,7 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_NUMBER(PG_lru);
VMCOREINFO_NUMBER(PG_private);
VMCOREINFO_NUMBER(PG_swapcache);
- VMCOREINFO_NUMBER(PG_slab);
- VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
+ VMCOREINFO_NUMBER(PG_buddy);
arch_crash_save_vmcoreinfo();
update_vmcoreinfo_note();
[-- Attachment #2: cliff_kernel_patch_1219 --]
[-- Type: message/rfc822, Size: 21764 bytes --]
From:
Subject: [PATCH] scan page tables for makedumpfile
Date: Wed, 16 Jan 2013 05:00:37 -0700
Message-ID: <1358337637.13097.972.camel@lisamlinux.fc.hp.com>
---
fs/proc/vmcore.c | 568 +++++++++++++++++++++++++++++++++++++++++++
include/linux/makedumpfile.h | 115 ++++++++
2 files changed, 683 insertions(+)
Index: linux/fs/proc/vmcore.c
===================================================================
--- linux.orig/fs/proc/vmcore.c
+++ linux/fs/proc/vmcore.c
@@ -17,8 +17,18 @@
#include <linux/init.h>
#include <linux/crash_dump.h>
#include <linux/list.h>
+#include <linux/makedumpfile.h>
+#include <linux/mmzone.h>
#include <asm/uaccess.h>
#include <asm/io.h>
+#include <asm/page.h>
+static int num_mem_map_data = 0;
+static struct mem_map_data *mem_map_data;
+static struct pfn_element *pfn_list;
+static long in_pfn_list;
+static int last_found_vaddr = 0;
+static int last_found_paddr = 0;
+static int max_pfn_list;
/* List representing chunks of contiguous memory areas and their offsets in
* vmcore file.
@@ -33,6 +43,7 @@ static size_t elfcorebuf_sz;
static u64 vmcore_size;
static struct proc_dir_entry *proc_vmcore = NULL;
+static struct proc_dir_entry *proc_vmcore_pfn_lists = NULL;
/* Reads a page from the oldmem device from given offset. */
static ssize_t read_from_oldmem(char *buf, size_t count,
@@ -160,10 +171,563 @@ static ssize_t read_vmcore(struct file *
return acc;
}
+/*
+ * Given the boot-kernel-relative virtual address of a page
+ * return its crashkernel-relative virtual address.
+ *
+ * We have a memory map named mem_map_data
+ *
+ * return 0 if it cannot be found
+ */
+unsigned long
+find_local_vaddr(unsigned long orig_vaddr)
+{
+ int i;
+ int fnd = 0;
+ struct mem_map_data *mmd, *next_mmd;
+ unsigned long paddr;
+ unsigned long local_vaddr;
+ unsigned long offset;
+
+ if (!num_mem_map_data) {
+ printk("find_page_paddr !! num_mem_map_data is %d\n",
+ num_mem_map_data);
+ return 0;
+ }
+
+fullsearch:
+ for (i = last_found_vaddr, mmd = mem_map_data + last_found_vaddr,
+ next_mmd = mem_map_data + last_found_vaddr + 1;
+ i < num_mem_map_data; i++, mmd++, next_mmd++) {
+ if (mmd->mem_map && mmd->paddr) {
+ if (orig_vaddr >= mmd->mem_map &&
+ orig_vaddr < next_mmd->mem_map) {
+ offset = orig_vaddr - mmd->mem_map;
+ paddr = mmd->paddr + offset;
+ fnd++;
+ /* caching gives about 99% hit on first pass */
+ last_found_vaddr = i;
+ break;
+ }
+ }
+ }
+
+ if (! fnd) {
+ if (last_found_vaddr > 0) {
+ last_found_vaddr = 0;
+ goto fullsearch;
+ }
+ return 0;
+ }
+
+ /* paddr is now the physical address of the page structure */
+ /* and offset is the offset into the found section, and we have
+ a table of how those sections are ioremap_cache'd */
+ local_vaddr = (unsigned long)mmd->section_vaddr + offset;
+ return local_vaddr;
+}
+
+/*
+ * Given a paddr, return its crashkernel-relative virtual address.
+ *
+ * We have a memory map named mem_map_data
+ *
+ * return 0 if it cannot be found
+ */
+void *
+find_local_from_paddr(unsigned long paddr)
+{
+ int i;
+ struct mem_map_data *mmd;
+ unsigned long offset;
+
+ if (!num_mem_map_data) {
+ printk("find_page_paddr !! num_mem_map_data is %d\n",
+ num_mem_map_data);
+ return 0;
+ }
+
+fullsearch:
+ for (i = last_found_paddr, mmd = mem_map_data + last_found_paddr;
+ i < num_mem_map_data; i++, mmd++) {
+ if ((paddr >= mmd->paddr) && (paddr < mmd->ending_paddr)) {
+ offset = paddr - mmd->paddr;
+ last_found_paddr = i;
+ /* caching gives about 98% hit on first pass */
+ return (void *)(mmd->section_vaddr + offset);
+ }
+ }
+
+ if (last_found_paddr > 0) {
+ last_found_paddr = 0;
+ goto fullsearch;
+ }
+ return 0;
+}
+
+/*
+ * given an anchoring list_head, walk the list of free pages
+ * 'root' is a virtual address based on the ioremap_cache'd pointer pgp
+ * 'boot_root' is the virtual address of the list root, boot kernel relative
+ *
+ * return the number of pages found on the list
+ */
+int
+walk_freelist(struct list_head *root, int node, int zone, int order, int list,
+ int restart_list, int start_page, struct pfn_list_request *reqp,
+ struct pfn_reply *replyp, struct list_head *boot_root)
+{
+ int list_ct = 0;
+ int list_free_pages = 0;
+ int doit;
+ unsigned long start_pfn;
+ struct page *pagep;
+ struct page *local_pagep;
+ struct list_head *lhp;
+ struct list_head *local_lhp; /* crashkernel-relative */
+ struct list_head *prev;
+ struct pfn_element *pe;
+
+ /*
+ * root is the crashkernel-relative address of the anchor of the
+ * free_list.
+ */
+ prev = root;
+ if (root == NULL) {
+ printk(KERN_EMERG "root is null!!, node %d order %d\n",
+ node, order);
+ return 0;
+ }
+
+ if (root->next == boot_root)
+ /* list is empty */
+ return 0;
+
+ lhp = root->next;
+ local_lhp = (struct list_head *)find_local_vaddr((unsigned long)lhp);
+ if (!local_lhp) {
+ return 0;
+ }
+
+ while (local_lhp != boot_root) {
+ list_ct++;
+ if (lhp == NULL) {
+ printk(KERN_EMERG
+ "The free list has a null!!, node %d order %d\n",
+ node, order);
+ break;
+ }
+ if (list_ct > 1 && local_lhp->prev != prev) {
+ /* can't be compared to root, as that is local */
+ printk(KERN_EMERG "The free list is broken!!\n");
+ break;
+ }
+
+ /* we want the boot kernel's pfn that this page represents */
+ pagep = container_of((struct list_head *)lhp,
+ struct page, lru);
+ start_pfn = pagep - vmemmap;
+ local_pagep = container_of((struct list_head *)local_lhp,
+ struct page, lru);
+ doit = 1;
+ if (restart_list && list_ct < start_page)
+ doit = 0;
+ if (doit) {
+ if (in_pfn_list == max_pfn_list) {
+ /* if array would overflow, come back to
+ this page with a continuation */
+ replyp->more = 1;
+ replyp->zone_index = zone;
+ replyp->freearea_index = order;
+ replyp->type_index = list;
+ replyp->list_ct = list_ct;
+ goto list_is_full;
+ }
+ pe = &pfn_list[in_pfn_list++];
+ pe->pfn = start_pfn;
+ pe->order = order;
+ list_free_pages += (1 << order);
+ }
+ prev = lhp;
+ lhp = local_pagep->lru.next;
+ /* the local node-relative vaddr: */
+ local_lhp = (struct list_head *)
+ find_local_vaddr((unsigned long)lhp);
+ if (!local_lhp)
+ break;
+ }
+
+list_is_full:
+ return list_free_pages;
+}
+
+/*
+ * Return the pfns of free pages on this node
+ */
+int
+write_vmcore_get_free(struct pfn_list_request *reqp)
+{
+ int node;
+ int nr_zones;
+ int nr_orders = MAX_ORDER;
+ int nr_freelist = MIGRATE_TYPES;
+ int zone;
+ int order;
+ int list;
+ int start_zone = 0;
+ int start_order = 0;
+ int start_list = 0;
+ int ret;
+ int restart = 0;
+ int start_page = 0;
+ int node_free_pages = 0;
+ struct pfn_reply rep;
+ struct pglist_data *pgp;
+ struct zone *zonep;
+ struct free_area *fap;
+ struct list_head *flp;
+ struct list_head *boot_root;
+ unsigned long pgdat_paddr;
+ unsigned long pgdat_vaddr;
+ unsigned long page_aligned_pgdat;
+ unsigned long page_aligned_size;
+ void *mapped_vaddr;
+
+ node = reqp->node;
+ pgdat_paddr = reqp->pgdat_paddr;
+ pgdat_vaddr = reqp->pgdat_vaddr;
+
+ /* map this pglist_data structure within a page-aligned area */
+ page_aligned_pgdat = pgdat_paddr & ~(PAGE_SIZE - 1);
+ page_aligned_size = sizeof(struct pglist_data) +
+ (pgdat_paddr - page_aligned_pgdat);
+ page_aligned_size = ((page_aligned_size + (PAGE_SIZE - 1))
+ >> PAGE_SHIFT) << PAGE_SHIFT;
+ mapped_vaddr = ioremap_cache(page_aligned_pgdat, page_aligned_size);
+ if (!mapped_vaddr) {
+ printk("ioremap_cache of pgdat %#lx failed\n",
+ page_aligned_pgdat);
+ return -EINVAL;
+ }
+ pgp = (struct pglist_data *)(mapped_vaddr +
+ (pgdat_paddr - page_aligned_pgdat));
+ nr_zones = pgp->nr_zones;
+ memset(&rep, 0, sizeof(rep));
+
+ if (reqp->more) {
+ restart = 1;
+ start_zone = reqp->zone_index;
+ start_order = reqp->freearea_index;
+ start_list = reqp->type_index;
+ start_page = reqp->list_ct;
+ }
+
+ in_pfn_list = 0;
+ for (zone = start_zone; zone < nr_zones; zone++) {
+ zonep = &pgp->node_zones[zone];
+ for (order = start_order; order < nr_orders; order++) {
+ fap = &zonep->free_area[order];
+ /* some free_area's are all zero */
+ if (fap->nr_free) {
+ for (list = start_list; list < nr_freelist;
+ list++) {
+ flp = &fap->free_list[list];
+ boot_root = (struct list_head *)
+ (pgdat_vaddr +
+ ((unsigned long)flp -
+ (unsigned long)pgp));
+ ret = walk_freelist(flp, node, zone,
+ order, list, restart,
+ start_page, reqp, &rep,
+ boot_root);
+ node_free_pages += ret;
+ restart = 0;
+ if (rep.more)
+ goto list_full;
+ }
+ }
+ }
+ }
+list_full:
+
+ iounmap(mapped_vaddr);
+
+ /* copy the reply and the valid part of our pfn list to the user */
+ rep.pfn_free = node_free_pages; /* the total, for statistics */
+ rep.in_pfn_list = in_pfn_list;
+ if (copy_to_user(reqp->reply_ptr, &rep, sizeof(struct pfn_reply)))
+ return -EFAULT;
+ if (in_pfn_list) {
+ if (copy_to_user(reqp->pfn_list_ptr, pfn_list,
+ (in_pfn_list * sizeof(struct pfn_element))))
+ return -EFAULT;
+ }
+ return 0;
+}
+
+/*
+ * Get the memap_data table from makedumpfile
+ * and do the single allocate of the pfn_list.
+ */
+int
+write_vmcore_get_memmap(struct pfn_list_request *reqp)
+{
+ int i;
+ int count;
+ int size;
+ int ret = 0;
+ long pfn_list_elements;
+ long malloc_size;
+ unsigned long page_section_start;
+ unsigned long page_section_size;
+ struct mem_map_data *mmd, *dum_mmd;
+ struct pfn_reply rep;
+ void *bufptr;
+
+ rep.pfn_list_elements = 0;
+ if (num_mem_map_data) {
+ /* shouldn't have been done before, but if it was.. */
+ printk(KERN_INFO "warning: PL_REQUEST_MEMMAP is repeated\n");
+ for (i = 0, mmd = mem_map_data; i < num_mem_map_data;
+ i++, mmd++) {
+ iounmap(mmd->section_vaddr);
+ }
+ kfree(mem_map_data);
+ mem_map_data = NULL;
+ num_mem_map_data = 0;
+ kfree(pfn_list);
+ pfn_list = NULL;
+ }
+
+ count = reqp->map_count;
+ size = reqp->map_size;
+ bufptr = reqp->map_ptr;
+ if (size != (count * sizeof(struct mem_map_data))) {
+ printk("Error in mem_map_data, %d * %ld != %d\n",
+ count, sizeof(struct mem_map_data), size);
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* add a dummy at the end to limit the size of the last entry */
+ size += sizeof(struct mem_map_data);
+
+ mem_map_data = kzalloc(size, GFP_KERNEL);
+ if (!mem_map_data) {
+ printk("kmalloc of mem_map_data for %d failed\n", size);
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (copy_from_user(mem_map_data, bufptr, size)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ num_mem_map_data = count;
+
+ /* construct the dummy entry to limit the size of 'next_mmd->mem_map' */
+ /* (see find_local_vaddr() ) */
+ mmd = mem_map_data + (num_mem_map_data - 1);
+ page_section_size = (mmd->pfn_end - mmd->pfn_start) *
+ sizeof(struct page);
+ dum_mmd = mmd + 1;
+ *dum_mmd = *mmd;
+ dum_mmd->mem_map += page_section_size;
+
+ /* Fill in the ending address of array of page struct */
+ for (i = 0, mmd = mem_map_data; i < num_mem_map_data; i++, mmd++) {
+ mmd->ending_paddr = mmd->paddr +
+ ((mmd->pfn_end - mmd->pfn_start) * sizeof(struct page));
+ }
+
+ /* Map each section of page structures to local virtual addresses */
+ /* (these are never iounmap'd, as this is the crash kernel) */
+ for (i = 0, mmd = mem_map_data; i < num_mem_map_data; i++, mmd++) {
+ page_section_start = mmd->paddr;
+ page_section_size = (mmd->pfn_end - mmd->pfn_start) *
+ sizeof(struct page);
+ mmd->section_vaddr = ioremap_cache(page_section_start,
+ page_section_size);
+ if (!mmd->section_vaddr) {
+ printk(
+ "ioremap_cache of [%d] node %#lx for %#lx failed\n",
+ i, page_section_start, page_section_size);
+ ret = -EINVAL;
+ goto out;
+ }
+ }
+
+ /*
+ * allocate the array for PFN's (just once)
+ * get as much as we can, up to what the user specified, and return
+ * that count to the user
+ */
+ pfn_list_elements = reqp->list_size;
+ do {
+ malloc_size = pfn_list_elements * sizeof(struct pfn_element);
+ if ((pfn_list = kmalloc(malloc_size, GFP_KERNEL)) != NULL) {
+ rep.pfn_list_elements = pfn_list_elements;
+ max_pfn_list = pfn_list_elements;
+ goto out;
+ }
+ pfn_list_elements -= 1000;
+ } while (pfn_list == NULL && pfn_list_elements > 0);
+
+ ret = -EINVAL;
+out:
+ if (copy_to_user(reqp->reply_ptr, &rep, sizeof(struct pfn_reply)))
+ return -EFAULT;
+ return ret;
+}
+
+/*
+ * Return the pfns of to-be-excluded pages fulfilling this request.
+ * This is called for each mem_map in makedumpfile's list.
+ */
+int
+write_vmcore_get_excludes(struct pfn_list_request *reqp)
+{
+ int i;
+ int start = 0;
+ int end;
+ unsigned long paddr;
+ unsigned long pfn;
+ void *vaddr;
+ struct page *pagep;
+ struct pfn_reply rep;
+ struct pfn_element *pe;
+
+ if (!num_mem_map_data) {
+ /* sanity check */
+ printk(
+ "ERROR:PL_REQUEST_MEMMAP not done before PL_REQUEST_EXCLUDE\n");
+ return -EINVAL;
+ }
+
+ /*
+ * the request contains (besides request type and bufptr):
+ * paddr (physical address of the page[0]
+ * count of pages in the block
+ * exclude bits (DL_EXCLUDE_...)
+ */
+ paddr = reqp->paddr;
+ end = reqp->count;
+ pfn = reqp->pfn_start;
+ /* find the already-mapped vaddr of this paddr */
+ vaddr = find_local_from_paddr(paddr);
+ if (!vaddr) {
+ printk("ERROR: PL_REQUEST_EXCLUDE cannot find paddr %#lx\n",
+ paddr);
+ return -EINVAL;
+ }
+ if (reqp->more) {
+ start = reqp->map_index;
+ vaddr += (reqp->map_index * sizeof(struct page));
+ pfn += reqp->map_index;
+ }
+ memset(&rep, 0, sizeof(rep));
+ in_pfn_list = 0;
+
+ for (i = start, pagep = (struct page *)vaddr; i < end;
+ i++, pagep++, pfn++) {
+ if (in_pfn_list == max_pfn_list) {
+ rep.in_pfn_list = in_pfn_list;
+ rep.more = 1;
+ rep.map_index = i;
+ break;
+ }
+ /*
+ * Exclude the free page managed by a buddy
+ */
+ if ((reqp->exclude_bits & DL_EXCLUDE_FREE)
+ && (pagep->flags & (1UL << PG_buddy))) {
+ pe = &pfn_list[in_pfn_list++];
+ pe->pfn = pfn;
+ pe->order = pagep->private;
+ rep.pfn_free += (1 << pe->order);
+ }
+ /*
+ * Exclude the cache page without the private page.
+ */
+ else if ((reqp->exclude_bits & DL_EXCLUDE_CACHE)
+ && (isLRU(pagep->flags) || isSwapCache(pagep->flags))
+ && !isPrivate(pagep->flags) && !isAnon(pagep->mapping)) {
+ pe = &pfn_list[in_pfn_list++];
+ pe->pfn = pfn;
+ pe->order = 0; /* assume 4k */
+ rep.pfn_cache++;
+ }
+ /*
+ * Exclude the cache page with the private page.
+ */
+ else if ((reqp->exclude_bits & DL_EXCLUDE_CACHE_PRI)
+ && (isLRU(pagep->flags) || isSwapCache(pagep->flags))
+ && !isAnon(pagep->mapping)) {
+ pe = &pfn_list[in_pfn_list++];
+ pe->pfn = pfn;
+ pe->order = 0; /* assume 4k */
+ rep.pfn_cache_private++;
+ }
+ /*
+ * Exclude the data page of the user process.
+ */
+ else if ((reqp->exclude_bits & DL_EXCLUDE_USER_DATA)
+ && isAnon(pagep->mapping)) {
+ pe = &pfn_list[in_pfn_list++];
+ pe->pfn = pfn;
+ pe->order = 0; /* assume 4k */
+ rep.pfn_user++;
+ }
+
+ }
+ rep.in_pfn_list = in_pfn_list;
+ if (copy_to_user(reqp->reply_ptr, &rep, sizeof(struct pfn_reply)))
+ return -EFAULT;
+ if (in_pfn_list) {
+ if (copy_to_user(reqp->pfn_list_ptr, pfn_list,
+ (in_pfn_list * sizeof(struct pfn_element))))
+ return -EFAULT;
+ }
+ return 0;
+}
+
+static ssize_t write_vmcore_pfn_lists(struct file *file,
+ const char __user *user_buf, size_t count, loff_t *ppos)
+{
+ int ret;
+ struct pfn_list_request pfn_list_request;
+
+ if (count != sizeof(struct pfn_list_request)) {
+ return -EINVAL;
+ }
+
+ if (copy_from_user(&pfn_list_request, user_buf, count))
+ return -EFAULT;
+
+ if (pfn_list_request.request == PL_REQUEST_FREE) {
+ ret = write_vmcore_get_free(&pfn_list_request);
+ } else if (pfn_list_request.request == PL_REQUEST_EXCLUDE) {
+ ret = write_vmcore_get_excludes(&pfn_list_request);
+ } else if (pfn_list_request.request == PL_REQUEST_MEMMAP) {
+ ret = write_vmcore_get_memmap(&pfn_list_request);
+ } else {
+ return -EINVAL;
+ }
+
+ if (ret)
+ return ret;
+ return count;
+}
+
static const struct file_operations proc_vmcore_operations = {
.read = read_vmcore,
};
+static const struct file_operations proc_vmcore_pfn_lists_operations = {
+ .write = write_vmcore_pfn_lists,
+};
+
static struct vmcore* __init get_new_element(void)
{
return kzalloc(sizeof(struct vmcore), GFP_KERNEL);
@@ -648,6 +1212,10 @@ static int __init vmcore_init(void)
proc_vmcore = proc_create("vmcore", S_IRUSR, NULL, &proc_vmcore_operations);
if (proc_vmcore)
proc_vmcore->size = vmcore_size;
+
+ proc_vmcore_pfn_lists = proc_create("vmcore_pfn_lists", S_IWUSR, NULL,
+ &proc_vmcore_pfn_lists_operations);
+
return 0;
}
module_init(vmcore_init)
Index: linux/include/linux/makedumpfile.h
===================================================================
--- /dev/null
+++ linux/include/linux/makedumpfile.h
@@ -0,0 +1,115 @@
+/*
+ * makedumpfile.h
+ * portions Copyright (C) 2006, 2007, 2008, 2009 NEC Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define isLRU(flags) (flags & (1UL << PG_lru))
+#define isPrivate(flags) (flags & (1UL << PG_private))
+#define isSwapCache(flags) (flags & (1UL << PG_swapcache))
+
+static inline int
+isAnon(struct address_space *mapping)
+{
+ return ((unsigned long)mapping & PAGE_MAPPING_ANON) != 0;
+}
+
+#define DL_EXCLUDE_ZERO (0x001) /* Exclude Pages filled with Zeros */
+#define DL_EXCLUDE_CACHE (0x002) /* Exclude Cache Pages
+ without Private Pages */
+#define DL_EXCLUDE_CACHE_PRI (0x004) /* Exclude Cache Pages
+ with Private Pages */
+#define DL_EXCLUDE_USER_DATA (0x008) /* Exclude UserProcessData Pages */
+#define DL_EXCLUDE_FREE (0x010) /* Exclude Free Pages */
+
+#define PL_REQUEST_FREE 1 /* request for a list of free pages */
+#define PL_REQUEST_EXCLUDE 2 /* request for a list of excludable
+ pages */
+#define PL_REQUEST_MEMMAP 3 /* request to pass in the makedumpfile
+ mem_map_data table */
+/*
+ * a request for finding pfn's that can be excluded from the dump
+ * they may be pages of particular types or free pages
+ */
+struct pfn_list_request {
+ int request; /* PL_REQUEST_FREE PL_REQUEST_EXCLUDE or */
+ /* PL_REQUEST_MEMMAP */
+ int debug;
+ unsigned long paddr; /* mem_map address for PL_REQUEST_EXCLUDE */
+ unsigned long pfn_start;/* pfn represented by paddr */
+ unsigned long pgdat_paddr; /* for PL_REQUEST_FREE */
+ unsigned long pgdat_vaddr; /* for PL_REQUEST_FREE */
+ int node; /* for PL_REQUEST_FREE */
+ int exclude_bits; /* for PL_REQUEST_EXCLUDE */
+ int count; /* for PL_REQUEST_EXCLUDE */
+ void *reply_ptr; /* address of user's pfn_reply, for reply */
+ void *pfn_list_ptr; /* address of user's pfn array (*pfn_list) */
+ int map_count; /* for PL_REQUEST_MEMMAP; elements */
+ int map_size; /* for PL_REQUEST_MEMMAP; bytes in table */
+ void *map_ptr; /* for PL_REQUEST_MEMMAP; address of table */
+ long list_size; /* for PL_REQUEST_MEMMAP negotiation */
+ /* resume info: */
+ int more; /* 0 for done, 1 for "there's more" */
+ /* PL_REQUEST_EXCLUDE: */
+ int map_index; /* slot in the mem_map array of page structs */
+ /* PL_REQUEST_FREE: */
+ int zone_index; /* zone within the node's pgdat_list */
+ int freearea_index; /* free_area within the zone */
+ int type_index; /* free_list within the free_area */
+ int list_ct; /* page within the list */
+};
+
+/*
+ * the reply from a pfn_list_request
+ * the list of pfn's itself is pointed to by pfn_list
+ */
+struct pfn_reply {
+ long pfn_list_elements; /* negotiated on PL_REQUEST_MEMMAP */
+ long in_pfn_list; /* returned by PL_REQUEST_EXCLUDE and
+ PL_REQUEST_FREE */
+ /* resume info */
+ int more; /* 0 == done, 1 == there is more */
+ /* PL_REQUEST_MEMMAP: */
+ int map_index; /* slot in the mem_map array of page structs */
+ /* PL_REQUEST_FREE: */
+ int zone_index; /* zone within the node's pgdat_list */
+ int freearea_index; /* free_area within the zone */
+ int type_index; /* free_list within the free_area */
+ int list_ct; /* page within the list */
+ /* statistic counters: */
+ unsigned long long pfn_cache; /* PL_REQUEST_EXCLUDE */
+ unsigned long long pfn_cache_private; /* PL_REQUEST_EXCLUDE */
+ unsigned long long pfn_user; /* PL_REQUEST_EXCLUDE */
+ unsigned long long pfn_free; /* PL_REQUEST_FREE */
+};
+
+struct pfn_element {
+ unsigned long pfn;
+ unsigned long order;
+};
+
+struct mem_map_data {
+ /*
+ * pfn_start/pfn_end are the pfn's represented by this mem_map entry.
+ * mem_map is the virtual address of the array of page structures
+ * that represent these pages.
+ * paddr is the physical address of that array of structures.
+ * ending_paddr would be (pfn_end - pfn_start) * sizeof(struct page).
+ * section_vaddr is the address we get from ioremap_cache().
+ */
+ unsigned long long pfn_start;
+ unsigned long long pfn_end;
+ unsigned long mem_map;
+ unsigned long long paddr; /* filled in by makedumpfile */
+ unsigned long long ending_paddr; /* filled in by kernel */
+ void *section_vaddr; /* filled in by kernel */
+};
[-- Attachment #3: Type: text/plain, Size: 143 bytes --]
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
next prev parent reply other threads:[~2013-01-16 16:18 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-01-04 16:20 [PATCH v2] makedumpfile: request the kernel do page scans Cliff Wickman
2013-01-16 12:15 ` Lisa Mitchell [this message]
2013-01-16 12:51 ` Lisa Mitchell
2013-01-16 17:50 ` Cliff Wickman
-- strict thread matches above, loose matches on Subject: below --
2012-11-21 20:06 Cliff Wickman
2012-11-22 1:43 ` Hatayama, Daisuke
2012-11-22 14:07 ` HATAYAMA Daisuke
2013-01-07 13:39 ` Cliff Wickman
2013-01-09 15:09 ` HATAYAMA Daisuke
[not found] ` <20130111223034.GA2154@sgi.com>
2013-01-17 1:38 ` HATAYAMA Daisuke
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1358338529.13097.987.camel@lisamlinux.fc.hp.com \
--to=lisa.mitchell@hp.com \
--cc=cpw@sgi.com \
--cc=d.hatayama@jp.fujitsu.com \
--cc=kexec@lists.infradead.org \
--cc=kumagai-atsushi@mxc.nes.nec.co.jp \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.