From: Lisa Mitchell <lisa.mitchell@hp.com>
To: Cliff Wickman <cpw@sgi.com>
Cc: "kexec@lists.infradead.org" <kexec@lists.infradead.org>,
"d.hatayama@jp.fujitsu.com" <d.hatayama@jp.fujitsu.com>,
"kumagai-atsushi@mxc.nes.nec.co.jp"
<kumagai-atsushi@mxc.nes.nec.co.jp>
Subject: Re: [PATCH v2] makedumpfile: request the kernel do page scans
Date: Wed, 16 Jan 2013 05:15:29 -0700 [thread overview]
Message-ID: <1358338529.13097.987.camel@lisamlinux.fc.hp.com> (raw)
In-Reply-To: <E1TrA0E-0001Ov-Pe@eag09.americas.sgi.com>
[-- Attachment #1: Type: text/plain, Size: 36817 bytes --]
On Fri, 2013-01-04 at 16:20 +0000, Cliff Wickman wrote:
> From: Cliff Wickman <cpw@sgi.com>
>
> This version of the patch improves the consolidation of the mem_map table
> that is passed to the kernel. See make_kernel_mmap().
> Particularly the seemingly duplicate pfn ranges generated on an older
> (2.6.32-based, rhel6) kernel.
>
>
>
> I've been experimenting with asking the kernel to scan the page tables
> instead of reading all those page structures through /proc/vmcore.
> The results are rather dramatic.
> On a small, idle UV: about 4 sec. versus about 40 sec.
> On a 8TB UV the unnecessary page scan takes 4 minutes, vs. about 200 min
> through /proc/vmcore.
>
> This patch incorporates this scheme into version 1.5.1, so that the cyclic
> processing can use the kernel scans.
> It also uses the page_is_buddy logic to speed the finding of free pages.
> And also allows makedumpfile to work as before with a kernel that does
> not provide /proc/vmcore_pfn_lists.
>
> This patch:
> - writes requests to new kernel file /proc/vmcore_pfn_lists
> - makes request PL_REQUEST_MEMMAP to pass the crash kernel information about
> the boot kernel
> - makes requests PL_REQUEST_FREE and PL_REQUEST_EXCLUDE, asking the kernel
> to return lists of PFNs
> - adds page scan timing options -n -o and -t
> - still has a debugging option -a
>
> This patch depends on a kernel patch.
>
> Diffed against the released makedumpfile-1.5.1
>
> Signed-off-by: Cliff Wickman <cpw@sgi.com>
> ---
> dwarf_info.c | 2
> makedumpfile.c | 587 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> makedumpfile.h | 95 +++++++++
> print_info.c | 5
> 4 files changed, 665 insertions(+), 24 deletions(-)
>
>
> Index: makedumpfile-1.5.1.released/makedumpfile.h
> ===================================================================
> --- makedumpfile-1.5.1.released.orig/makedumpfile.h
> +++ makedumpfile-1.5.1.released/makedumpfile.h
> @@ -86,6 +86,8 @@ int get_mem_type(void);
> #define LSEEKED_PDESC (2)
> #define LSEEKED_PDATA (3)
>
> +#define EXTRA_MEMMAPS 100
> +
> /*
> * Xen page flags
> */
> @@ -418,7 +420,7 @@ do { \
> #define KVER_MIN_SHIFT 16
> #define KERNEL_VERSION(x,y,z) (((x) << KVER_MAJ_SHIFT) | ((y) << KVER_MIN_SHIFT) | (z))
> #define OLDEST_VERSION KERNEL_VERSION(2, 6, 15)/* linux-2.6.15 */
> -#define LATEST_VERSION KERNEL_VERSION(3, 6, 7)/* linux-3.6.7 */
> +#define LATEST_VERSION KERNEL_VERSION(3, 7, 8)/* linux-3.7.8 */
>
> /*
> * vmcoreinfo in /proc/vmcore
> @@ -794,11 +796,25 @@ typedef struct {
> } xen_crash_info_v2_t;
>
> struct mem_map_data {
> + /*
> + * pfn_start/pfn_end are the pfn's represented by this mem_map entry.
> + * mem_map is the virtual address of the array of page structures
> + * that represent these pages.
> + * paddr is the physical address of that array of structures.
> + * ending_paddr would be (pfn_end - pfn_start) * sizeof(struct page).
> + * section_vaddr is the address we get from ioremap_cache().
> + */
> unsigned long long pfn_start;
> unsigned long long pfn_end;
> - unsigned long mem_map;
> + unsigned long mem_map;
> + unsigned long long paddr; /* filled in by makedumpfile */
> + long virtual_offset; /* filled in by kernel */
> + unsigned long long ending_paddr; /* filled in by kernel */
> + unsigned long mapped_size; /* filled in by kernel */
> + void *section_vaddr; /* filled in by kernel */
> };
>
> +
> struct dump_bitmap {
> int fd;
> int no_block;
> @@ -875,6 +891,7 @@ struct DumpInfo {
> int flag_rearrange; /* flag of creating dumpfile from
> flattened format */
> int flag_split; /* splitting vmcore */
> + int flag_use_kernel_lists;
> int flag_cyclic; /* cyclic processing to keep memory consumption */
> int flag_reassemble; /* reassemble multiple dumpfiles into one */
> int flag_refiltering; /* refilter from kdump-compressed file */
> @@ -1384,6 +1401,80 @@ struct domain_list {
> unsigned int pickled_id;
> };
>
> +#define PL_REQUEST_FREE 1 /* request for a list of free pages */
> +#define PL_REQUEST_EXCLUDE 2 /* request for a list of excludable
> + pages */
> +#define PL_REQUEST_MEMMAP 3 /* request to pass in the makedumpfile
> + mem_map_data table */
> +/*
> + * limit the size of the pfn list to this many pfn_element structures
> + */
> +#define MAX_PFN_LIST 10000
> +
> +/*
> + * one element in the pfn_list
> + */
> +struct pfn_element {
> + unsigned long pfn;
> + unsigned long order;
> +};
> +
> +/*
> + * a request for finding pfn's that can be excluded from the dump
> + * they may be pages of particular types or free pages
> + */
> +struct pfn_list_request {
> + int request; /* PL_REQUEST_FREE PL_REQUEST_EXCLUDE or */
> + /* PL_REQUEST_MEMMAP */
> + int debug;
> + unsigned long paddr; /* mem_map address for PL_REQUEST_EXCLUDE */
> + unsigned long pfn_start;/* pfn represented by paddr */
> + unsigned long pgdat_paddr; /* for PL_REQUEST_FREE */
> + unsigned long pgdat_vaddr; /* for PL_REQUEST_FREE */
> + int node; /* for PL_REQUEST_FREE */
> + int exclude_bits; /* for PL_REQUEST_EXCLUDE */
> + int count; /* for PL_REQUEST_EXCLUDE */
> + void *reply_ptr; /* address of user's pfn_reply, for reply */
> + void *pfn_list_ptr; /* address of user's pfn array (*pfn_list) */
> + int map_count; /* for PL_REQUEST_MEMMAP; elements */
> + int map_size; /* for PL_REQUEST_MEMMAP; bytes in table */
> + void *map_ptr; /* for PL_REQUEST_MEMMAP; address of table */
> + long list_size; /* for PL_REQUEST_MEMMAP negotiation */
> + /* resume info: */
> + int more; /* 0 for done, 1 for "there's more" */
> + /* PL_REQUEST_EXCLUDE: */
> + int map_index; /* slot in the mem_map array of page structs */
> + /* PL_REQUEST_FREE: */
> + int zone_index; /* zone within the node's pgdat_list */
> + int freearea_index; /* free_area within the zone */
> + int type_index; /* free_list within the free_area */
> + int list_ct; /* page within the list */
> +};
> +
> +/*
> + * the reply from a pfn_list_request
> + * the list of pfn's itself is pointed to by pfn_list
> + */
> +struct pfn_reply {
> + long pfn_list_elements; /* negoiated on PL_REQUEST_MEMMAP */
> + long in_pfn_list; /* returned by PL_REQUEST_EXCLUDE and
> + PL_REQUEST_FREE */
> + /* resume info */
> + int more; /* 0 == done, 1 == there is more */
> + /* PL_REQUEST_MEMMAP: */
> + int map_index; /* slot in the mem_map array of page structs */
> + /* PL_REQUEST_FREE: */
> + int zone_index; /* zone within the node's pgdat_list */
> + int freearea_index; /* free_area within the zone */
> + int type_index; /* free_list within the free_area */
> + int list_ct; /* page within the list */
> + /* statistic counters: */
> + unsigned long long pfn_cache; /* PL_REQUEST_EXCLUDE */
> + unsigned long long pfn_cache_private; /* PL_REQUEST_EXCLUDE */
> + unsigned long long pfn_user; /* PL_REQUEST_EXCLUDE */
> + unsigned long long pfn_free; /* PL_REQUEST_FREE */
> +};
> +
> #define PAGES_PER_MAPWORD (sizeof(unsigned long) * 8)
> #define MFNS_PER_FRAME (info->page_size / sizeof(unsigned long))
>
> Index: makedumpfile-1.5.1.released/dwarf_info.c
> ===================================================================
> --- makedumpfile-1.5.1.released.orig/dwarf_info.c
> +++ makedumpfile-1.5.1.released/dwarf_info.c
> @@ -324,6 +324,8 @@ get_data_member_location(Dwarf_Die *die,
> return TRUE;
> }
>
> +int dwarf_formref(Dwarf_Attribute *, Dwarf_Off *);
> +
> static int
> get_die_type(Dwarf_Die *die, Dwarf_Die *die_type)
> {
> Index: makedumpfile-1.5.1.released/print_info.c
> ===================================================================
> --- makedumpfile-1.5.1.released.orig/print_info.c
> +++ makedumpfile-1.5.1.released/print_info.c
> @@ -244,6 +244,11 @@ print_usage(void)
> MSG(" [-f]:\n");
> MSG(" Overwrite DUMPFILE even if it already exists.\n");
> MSG("\n");
> + MSG(" [-o]:\n");
> + MSG(" Read page structures from /proc/vmcore in the scan for\n");
> + MSG(" free and excluded pages regardless of whether\n");
> + MSG(" /proc/vmcore_pfn_lists is present.\n");
> + MSG("\n");
> MSG(" [-h]:\n");
> MSG(" Show help message and LZO/snappy support status (enabled/disabled).\n");
> MSG("\n");
> Index: makedumpfile-1.5.1.released/makedumpfile.c
> ===================================================================
> --- makedumpfile-1.5.1.released.orig/makedumpfile.c
> +++ makedumpfile-1.5.1.released/makedumpfile.c
> @@ -13,6 +13,8 @@
> * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> * GNU General Public License for more details.
> */
> +#define _GNU_SOURCE
> +#include <stdio.h>
> #include "makedumpfile.h"
> #include "print_info.h"
> #include "dwarf_info.h"
> @@ -31,6 +33,14 @@ struct srcfile_table srcfile_table;
>
> struct vm_table vt = { 0 };
> struct DumpInfo *info = NULL;
> +int pfn_list_fd;
> +struct pfn_element *pfn_list;
> +int nflag = 0;
> +int oflag = 0;
> +int tflag = 0;
> +int aflag = 0;
> +struct timeval scan_start;
> +int max_pfn_list;
>
> char filename_stdout[] = FILENAME_STDOUT;
>
> @@ -2415,6 +2425,22 @@ get_mm_sparsemem(void)
> unsigned long long pfn_start, pfn_end;
> unsigned long section, mem_map;
> unsigned long *mem_sec = NULL;
> + unsigned long vaddr;
> + unsigned long paddr;
> + unsigned long lastvaddr;
> + unsigned long lastpaddr;
> + unsigned long diff;
> + long j;
> + int i;
> + int npfns;
> + int pagesize;
> + int num_mem_map;
> + int num_added = 0;
> + struct mem_map_data *mmd;
> + struct mem_map_data *curmmd;
> + struct mem_map_data *work1mmd;
> + struct mem_map_data *work2mmd;
> + struct mem_map_data *lastmmd;
>
> int ret = FALSE;
>
> @@ -2441,7 +2467,8 @@ get_mm_sparsemem(void)
> }
> info->num_mem_map = num_section;
> if ((info->mem_map_data = (struct mem_map_data *)
> - malloc(sizeof(struct mem_map_data)*info->num_mem_map)) == NULL) {
> + malloc(sizeof(struct mem_map_data) *
> + (EXTRA_MEMMAPS + info->num_mem_map))) == NULL) {
> ERRMSG("Can't allocate memory for the mem_map_data. %s\n",
> strerror(errno));
> goto out;
> @@ -2459,6 +2486,71 @@ get_mm_sparsemem(void)
> dump_mem_map(pfn_start, pfn_end, mem_map, section_nr);
> }
> ret = TRUE;
> +
> + /* add paddr to the table */
> + mmd = &info->mem_map_data[0];
> + num_mem_map = info->num_mem_map;
> + lastmmd = mmd + num_mem_map;
> + for (i = 0; i < num_mem_map; i++) {
> + if (mmd[i].mem_map == 0) {
> + mmd[i].paddr = 0;
> + } else {
> + mmd[i].paddr = vaddr_to_paddr(mmd[i].mem_map);
> + if (mmd[i].paddr == 0) {
> + printf("! can't translate %#lx to paddr\n",
> + mmd[i].mem_map);
> + exit(1);
> + }
> + /*
> + * When we pass a mem_map and its paddr to the kernel
> + * it will be remapped assuming the entire range
> + * of pfn's are consecutive. If they are not then
> + * we need to split the range into two.
> + */
> + pagesize = SIZE(page);
> + npfns = mmd[i].pfn_end - mmd[i].pfn_start;
> + vaddr = (unsigned long)mmd[i].mem_map;
> + paddr = vaddr_to_paddr(vaddr);
> + diff = vaddr - paddr;
> + lastvaddr = vaddr + (pagesize * (npfns-1));
> + lastpaddr = vaddr_to_paddr(lastvaddr);
> + if (lastvaddr - lastpaddr != diff) {
> + /* there is a break in vtop somewhere in this range */
> + /* we need to split it */
> + for (j = 0; j < npfns; j++) {
> + paddr = vaddr_to_paddr(vaddr);
> + if (vaddr - paddr != diff) {
> + diff = vaddr - paddr;
> + /* insert a new entry if we have room */
> + if (num_added < EXTRA_MEMMAPS) {
> + curmmd = &info->mem_map_data[i];
> + num_added++;
> + work1mmd = lastmmd - 1;
> + for (work2mmd = lastmmd;
> + work2mmd > curmmd; work2mmd--) {
> + work1mmd = work2mmd - 1;
> + *work2mmd = *work1mmd;
> + }
> + work2mmd = work1mmd + 1;
> + work2mmd->mem_map =
> + work1mmd->mem_map + (pagesize * j);
> + lastmmd++;
> + num_mem_map++;
> + info->num_mem_map++;
> + /*
> + * need only 1 split, the new
> + * one will be checked also.
> + */
> + break;
> + } else
> + printf("warn: out of EXTRA_MEMMAPS\n");
> + }
> + vaddr += pagesize;
> + }
> + }
> + }
> + }
> +
> out:
> if (mem_sec != NULL)
> free(mem_sec);
> @@ -2571,6 +2663,172 @@ initialize_bitmap_memory(void)
> return TRUE;
> }
>
> +/*
> + * construct a version of the mem_map_data table to pass to the kernel
> + */
> +void *
> +make_kernel_mmap(int *kmap_elements, int *kmap_size)
> +{
> + int i, j;
> + int elements = 0;
> + int page_structs;
> + int elem;
> + long l;
> + unsigned long base_end_pfn;
> + unsigned long end_paddr;
> + unsigned long v1;
> + unsigned long v2;
> + unsigned long end_page_pfns;
> + unsigned long hpagesize = 0x200000UL;
> + unsigned long hpageoffset = hpagesize - 1;
> + struct mem_map_data *mmdo, *mmdn;
> + struct mem_map_data *mmdbase, *mmdnext, *mmdend, *mmdwork;
> + struct mem_map_data temp_mmd;
> + struct mem_map_data *mmap;
> +
> + mmap = malloc(info->num_mem_map * sizeof(struct mem_map_data));
> + if (mmap == NULL) {
> + ERRMSG("Can't allocate memory kernel map\n");
> + return NULL;
> + }
> +
> + /* condense them down to the valid ones */
> + for (i = 0, mmdn = mmap, mmdo = &info->mem_map_data[0];
> + i < info->num_mem_map; i++, mmdo++) {
> + if (mmdo->mem_map && mmdo->paddr) {
> + *mmdn = *mmdo;
> + mmdn++;
> + elements++;
> + }
> + }
> +
> + /* make sure it is sorted by mem_map (it should be already) */
> + mmdn = mmap;
> + for (i = 0; i < elements - 1; i++) {
> + for (j = i + 1; j < elements; j++) {
> + if (mmdn[j].mem_map < mmdn[i].mem_map) {
> + temp_mmd = mmdn[j];
> + mmdn[j] = mmdn[i];
> + mmdn[i] = temp_mmd;
> + }
> + }
> + }
> +
> + if (aflag) {
> + mmdn = mmap;
> + printf("entire mem_map:\n");
> + for (i = 0; i < elements - 1; i++) {
> + l = (mmdn[i].pfn_end - mmdn[i].pfn_start) * SIZE(page);
> + printf(
> + "[%d] pfn %#llx-%llx mem_map %#lx paddr %#llx-%llx\n",
> + i, mmdn[i].pfn_start, mmdn[i].pfn_end,
> + mmdn[i].mem_map, mmdn[i].paddr,
> + mmdn[i].paddr + l);
> + }
> + }
> +
> + /*
> + * a first pass to split overlapping pfn entries like this:
> + * pfn 0x1248000-1250000 mem_map 0xffffea003ffc0000 paddr 0x10081c0000
> + * pfn 0x1248000-1250000 mem_map 0xffffea0040000030 paddr 0x1008400030
> + */
> + mmdbase = mmap;
> + mmdnext = mmap + 1;
> + mmdend = mmap + elements;
> + /* test each mmdbase/mmdnext pair */
> + while (mmdnext < mmdend) { /* mmdnext is the one after mmdbase */
> + page_structs = (mmdbase->pfn_end - mmdbase->pfn_start);
> + /* mmdwork scans from mmdnext to the end */
> + if ((mmdbase->pfn_start == mmdnext->pfn_start) &&
> + (mmdbase->pfn_end == mmdnext->pfn_end)) {
> + /* overlapping pfns, we need a fix */
> + v1 = mmdnext->mem_map - mmdbase->mem_map;
> + v2 = mmdnext->paddr - mmdbase->paddr;
> + if (v1 != (v2 & hpageoffset))
> + printf("virt to phys is wrong %#lx %#lx\n",
> + v1, v2);
> + l = mmdbase->pfn_end - mmdbase->pfn_start;
> + end_page_pfns = l - (((hpagesize -
> + (hpageoffset & mmdbase->paddr)) +
> + SIZE(page) - 1) / SIZE(page));
> + mmdbase->pfn_end -= end_page_pfns;
> + mmdnext->pfn_start = mmdbase->pfn_end;
> + } else if ((mmdbase->pfn_start == mmdnext->pfn_start) ||
> + (mmdbase->pfn_end == mmdnext->pfn_end)) {
> + printf("warning: unfixed overlap\n");
> + }
> + mmdbase++;
> + mmdnext++;
> + }
> +
> + /*
> + * consolidate those mem_map's with occupying consecutive physical
> + * addresses
> + * pages represented by these pages structs: addr of page struct
> + * pfns 0x1000000-1008000 mem_map 0xffffea0038000000 paddr 0x11f7e00000
> + * pfns 0x1008000-1010000 mem_map 0xffffea00381c0000 paddr 0x11f7fc0000
> + * pfns 0x1010000-1018000 mem_map 0xffffea0038380000 paddr 0x11f8180000
> + * 8000 increments inc's: 1c0000
> + * 8000000 of memory (128M) 8000 page structs
> + */
> + mmdbase = mmap;
> + mmdnext = mmap + 1;
> + mmdend = mmap + elements;
> + while (mmdnext < mmdend) {
> + elem = mmdend - mmdnext;
> + /* test mmdbase vs. mmdwork and onward: */
> + for (i = 0, mmdwork = mmdnext; i < elem; i++, mmdwork++) {
> + base_end_pfn = mmdbase->pfn_end;
> + if (base_end_pfn == mmdwork->pfn_start) {
> + page_structs = (mmdbase->pfn_end -
> + mmdbase->pfn_start);
> + end_paddr = (page_structs * SIZE(page)) +
> + mmdbase->paddr;
> + if (mmdwork->paddr == end_paddr) {
> + /* extend base by the work one */
> + mmdbase->pfn_end = mmdwork->pfn_end;
> + /* next is where to begin next time */
> + mmdnext = mmdwork + 1;
> + } else {
> + /* gap in address of page
> + structs; end of section */
> + mmdbase++;
> + if (mmdwork - mmdbase > 0)
> + *mmdbase = *mmdwork;
> + mmdnext = mmdwork + 1;
> + break;
> + }
> + } else {
> + /* gap in pfns; end of section */
> + mmdbase++;
> + if (mmdwork - mmdbase > 0)
> + *mmdbase = *mmdwork;
> + mmdnext = mmdwork + 1;
> + break;
> + }
> + }
> + }
> + elements = (mmdbase - mmap) + 1;
> +
> + if (aflag) {
> + printf("user mmap for kernel:\n");
> + for (i = 0, mmdwork = mmap; i < elements; i++, mmdwork++) {
> + l = mmdwork->pfn_end - mmdwork->pfn_start;
> + printf(
> + "[%d] user pfn %#llx-%llx paddr %#llx-%llx vaddr %#lx\n",
> + i, mmdwork->pfn_start, mmdwork->pfn_end,
> + mmdwork->paddr,
> + mmdwork->paddr + (l * SIZE(page)),
> + mmdwork->mem_map);
> + }
> + }
> +
> + *kmap_elements = elements;
> + *kmap_size = elements * sizeof(struct mem_map_data);
> +
> + return mmap;
> +}
> +
> int
> initial(void)
> {
> @@ -2833,7 +3091,14 @@ out:
> if (!get_value_for_old_linux())
> return FALSE;
>
> - if (info->flag_cyclic && (info->dump_level & DL_EXCLUDE_FREE))
> + /*
> + * page_is_buddy will tell us whether free pages can be identified
> + * by flags and counts in the page structure without making an extra
> + * pass through the free lists.
> + * This is applicable to using /proc/vmcore or using the kernel.
> + * force all old (-o) forms to search free lists
> + */
> + if (info->dump_level & DL_EXCLUDE_FREE)
> setup_page_is_buddy();
>
> return TRUE;
> @@ -3549,6 +3814,65 @@ out:
> return ret;
> }
>
> +/*
> + * let the kernel find excludable pages from one node
> + */
> +void
> +__exclude_free_pages_kernel(unsigned long pgdat, int node)
> +{
> + int i, j, ret, pages;
> + unsigned long pgdat_paddr;
> + struct pfn_list_request request;
> + struct pfn_reply reply;
> + struct pfn_element *pe;
> +
> + if ((pgdat_paddr = vaddr_to_paddr(pgdat)) == NOT_PADDR) {
> + ERRMSG("Can't convert virtual address(%#lx) to physical.\n",
> + pgdat);
> + return;
> + }
> +
> + /*
> + * Get the list of free pages.
> + * This may be broken up into MAX_PFN_list arrays of PFNs.
> + */
> + memset(&request, 0, sizeof(request));
> + request.request = PL_REQUEST_FREE;
> + request.node = node;
> + request.pgdat_paddr = pgdat_paddr;
> + request.pgdat_vaddr = pgdat;
> + request.reply_ptr = (void *)&reply;
> + request.pfn_list_ptr = (void *)pfn_list;
> + memset(&reply, 0, sizeof(reply));
> +
> + do {
> + request.more = 0;
> + if (reply.more) {
> + /* this is to be a continuation of the last request */
> + request.more = 1;
> + request.zone_index = reply.zone_index;
> + request.freearea_index = reply.freearea_index;
> + request.type_index = reply.type_index;
> + request.list_ct = reply.list_ct;
> + }
> + ret = write(pfn_list_fd, &request, sizeof(request));
> + if (ret != sizeof(request)) {
> + printf("PL_REQUEST_FREE failed\n");
> + return;
> + }
> + pfn_free += reply.pfn_free;
> +
> + for (i = 0; i < reply.in_pfn_list; i++) {
> + pe = &pfn_list[i];
> + pages = (1 << pe->order);
> + for (j = 0; j < pages; j++) {
> + clear_bit_on_2nd_bitmap_for_kernel(pe->pfn + j);
> + }
> + }
> + } while (reply.more);
> +
> + return;
> +}
>
> int
> _exclude_free_page(void)
> @@ -3568,7 +3892,24 @@ _exclude_free_page(void)
> gettimeofday(&tv_start, NULL);
>
> for (num_nodes = 1; num_nodes <= vt.numnodes; num_nodes++) {
> -
> + if (!info->flag_cyclic && info->flag_use_kernel_lists) {
> + node_zones = pgdat + OFFSET(pglist_data.node_zones);
> + if (!readmem(VADDR,
> + pgdat + OFFSET(pglist_data.nr_zones),
> + &nr_zones, sizeof(nr_zones))) {
> + ERRMSG("Can't get nr_zones.\n");
> + return FALSE;
> + }
> + print_progress(PROGRESS_FREE_PAGES, num_nodes - 1,
> + vt.numnodes);
> + /* ask the kernel to do one node */
> + __exclude_free_pages_kernel(pgdat, node);
> + goto next_pgdat;
> + }
> + /*
> + * kernel does not have the pfn_list capability
> + * use the old way
> + */
> print_progress(PROGRESS_FREE_PAGES, num_nodes - 1, vt.numnodes);
>
> node_zones = pgdat + OFFSET(pglist_data.node_zones);
> @@ -3595,6 +3936,7 @@ _exclude_free_page(void)
> if (!reset_bitmap_of_free_pages(zone))
> return FALSE;
> }
> + next_pgdat:
> if (num_nodes < vt.numnodes) {
> if ((node = next_online_node(node + 1)) < 0) {
> ERRMSG("Can't get next online node.\n");
> @@ -3612,6 +3954,8 @@ _exclude_free_page(void)
> */
> print_progress(PROGRESS_FREE_PAGES, vt.numnodes, vt.numnodes);
> print_execution_time(PROGRESS_FREE_PAGES, &tv_start);
> + if (tflag)
> + print_execution_time("Total time", &scan_start);
>
> return TRUE;
> }
> @@ -3755,7 +4099,6 @@ setup_page_is_buddy(void)
> }
> } else
> info->page_is_buddy = page_is_buddy_v2;
> -
> out:
> if (!info->page_is_buddy)
> DEBUG_MSG("Can't select page_is_buddy handler; "
> @@ -3964,10 +4307,89 @@ exclude_zero_pages(void)
> return TRUE;
> }
>
> +/*
> + * let the kernel find excludable pages from one mem_section
> + */
> +int
> +__exclude_unnecessary_pages_kernel(int mm, struct mem_map_data *mmd)
> +{
> + unsigned long long pfn_start = mmd->pfn_start;
> + unsigned long long pfn_end = mmd->pfn_end;
> + int i, j, ret, pages, flag;
> + struct pfn_list_request request;
> + struct pfn_reply reply;
> + struct pfn_element *pe;
> +
> + /*
> + * Get the list of to-be-excluded pages in this section.
> + * It may be broken up by groups of max_pfn_list size.
> + */
> + memset(&request, 0, sizeof(request));
> + request.request = PL_REQUEST_EXCLUDE;
> + request.paddr = mmd->paddr; /* phys addr of mem_map */
> + request.reply_ptr = (void *)&reply;
> + request.pfn_list_ptr = (void *)pfn_list;
> + request.exclude_bits = 0;
> + request.pfn_start = pfn_start;
> + request.count = pfn_end - pfn_start;
> + if (info->dump_level & DL_EXCLUDE_CACHE)
> + request.exclude_bits |= DL_EXCLUDE_CACHE;
> + if (info->dump_level & DL_EXCLUDE_CACHE_PRI)
> + request.exclude_bits |= DL_EXCLUDE_CACHE_PRI;
> + if (info->dump_level & DL_EXCLUDE_USER_DATA)
> + request.exclude_bits |= DL_EXCLUDE_USER_DATA;
> + /* if we try for free pages from the freelists then we don't need
> + to ask here for 'buddy' pages */
> + if (info->dump_level & DL_EXCLUDE_FREE)
> + request.exclude_bits |= DL_EXCLUDE_FREE;
> + memset(&reply, 0, sizeof(reply));
> +
> + do {
> + /* pfn represented by paddr */
> + request.more = 0;
> + if (reply.more) {
> + /* this is to be a continuation of the last request */
> + request.more = 1;
> + request.map_index = reply.map_index;
> + }
> +
> + ret = write(pfn_list_fd, &request, sizeof(request));
> + if (ret != sizeof(request))
> + return FALSE;
> +
> + pfn_cache += reply.pfn_cache;
> + pfn_cache_private += reply.pfn_cache_private;
> + pfn_user += reply.pfn_user;
> + pfn_free += reply.pfn_free;
> +
> + flag = 0;
> + for (i = 0; i < reply.in_pfn_list; i++) {
> + pe = &pfn_list[i];
> + pages = (1 << pe->order);
> + for (j = 0; j < pages; j++) {
> + if (clear_bit_on_2nd_bitmap_for_kernel(
> + pe->pfn + j) == FALSE) {
> + // printf("fail: mm %d slot %d pfn %#lx\n",
> + // mm, i, pe->pfn + j);
> + // printf("paddr %#llx pfn %#llx-%#llx mem_map %#lx\n",
> + // mmd->paddr, mmd->pfn_start, mmd->pfn_end, mmd->mem_map);
> + flag = 1;
> + break;
> + }
> + if (flag) break;
> + }
> + }
> + } while (reply.more);
> +
> + return TRUE;
> +}
> +
> int
> -__exclude_unnecessary_pages(unsigned long mem_map,
> - unsigned long long pfn_start, unsigned long long pfn_end)
> +__exclude_unnecessary_pages(int mm, struct mem_map_data *mmd)
> {
> + unsigned long long pfn_start = mmd->pfn_start;
> + unsigned long long pfn_end = mmd->pfn_end;
> + unsigned long mem_map = mmd->mem_map;
> unsigned long long pfn, pfn_mm, maddr;
> unsigned long long pfn_read_start, pfn_read_end, index_pg;
> unsigned char page_cache[SIZE(page) * PGMM_CACHED];
> @@ -3975,6 +4397,12 @@ __exclude_unnecessary_pages(unsigned lon
> unsigned int _count, _mapcount = 0;
> unsigned long flags, mapping, private = 0;
>
> + if (info->flag_use_kernel_lists) {
> + if (__exclude_unnecessary_pages_kernel(mm, mmd) == FALSE)
> + return FALSE;
> + return TRUE;
> + }
> +
> /*
> * Refresh the buffer of struct page, when changing mem_map.
> */
> @@ -4012,7 +4440,6 @@ __exclude_unnecessary_pages(unsigned lon
> pfn_mm = PGMM_CACHED - index_pg;
> else
> pfn_mm = pfn_end - pfn;
> -
> if (!readmem(VADDR, mem_map,
> page_cache + (index_pg * SIZE(page)),
> SIZE(page) * pfn_mm)) {
> @@ -4036,7 +4463,6 @@ __exclude_unnecessary_pages(unsigned lon
> * Exclude the free page managed by a buddy
> */
> if ((info->dump_level & DL_EXCLUDE_FREE)
> - && info->flag_cyclic
> && info->page_is_buddy
> && info->page_is_buddy(flags, _mapcount, private, _count)) {
> int i;
> @@ -4085,19 +4511,78 @@ __exclude_unnecessary_pages(unsigned lon
> return TRUE;
> }
>
> +/*
> + * Pass in the mem_map_data table.
> + * Must do this once, and before doing PL_REQUEST_FREE or PL_REQUEST_EXCLUDE.
> + */
> +int
> +setup_kernel_mmap()
> +{
> + int ret;
> + int kmap_elements, kmap_size;
> + long malloc_size;
> + void *kmap_addr;
> + struct pfn_list_request request;
> + struct pfn_reply reply;
> +
> + kmap_addr = make_kernel_mmap(&kmap_elements, &kmap_size);
> + if (kmap_addr == NULL)
> + return FALSE;
> + memset(&request, 0, sizeof(request));
> + request.request = PL_REQUEST_MEMMAP;
> + request.map_ptr = kmap_addr;
> + request.reply_ptr = (void *)&reply;
> + request.map_count = kmap_elements;
> + request.map_size = kmap_size;
> + request.list_size = MAX_PFN_LIST;
> +
> + ret = write(pfn_list_fd, &request, sizeof(request));
> + if (ret < 0) {
> + fprintf(stderr, "PL_REQUEST_MEMMAP returned %d\n", ret);
> + return FALSE;
> + }
> + /* the reply tells us how long the kernel's list actually is */
> + max_pfn_list = reply.pfn_list_elements;
> + if (max_pfn_list <= 0) {
> + fprintf(stderr,
> + "PL_REQUEST_MEMMAP returned max_pfn_list %d\n",
> + max_pfn_list);
> + return FALSE;
> + }
> + if (max_pfn_list < MAX_PFN_LIST) {
> + printf("length of pfn list dropped from %d to %d\n",
> + MAX_PFN_LIST, max_pfn_list);
> + }
> + free(kmap_addr);
> + /*
> + * Allocate the buffer for the PFN list (just once).
> + */
> + malloc_size = max_pfn_list * sizeof(struct pfn_element);
> + if ((pfn_list = (struct pfn_element *)malloc(malloc_size)) == NULL) {
> + ERRMSG("Can't allocate pfn_list of %ld\n", malloc_size);
> + return FALSE;
> + }
> + return TRUE;
> +}
> +
> int
> exclude_unnecessary_pages(void)
> {
> - unsigned int mm;
> - struct mem_map_data *mmd;
> - struct timeval tv_start;
> + unsigned int mm;
> + struct mem_map_data *mmd;
> + struct timeval tv_start;
>
> if (is_xen_memory() && !info->dom0_mapnr) {
> ERRMSG("Can't get max domain-0 PFN for excluding pages.\n");
> return FALSE;
> }
>
> + if (!info->flag_cyclic && info->flag_use_kernel_lists) {
> + if (setup_kernel_mmap() == FALSE)
> + return FALSE;
> + }
> gettimeofday(&tv_start, NULL);
> + gettimeofday(&scan_start, NULL);
>
> for (mm = 0; mm < info->num_mem_map; mm++) {
> print_progress(PROGRESS_UNN_PAGES, mm, info->num_mem_map);
> @@ -4106,9 +4591,9 @@ exclude_unnecessary_pages(void)
>
> if (mmd->mem_map == NOT_MEMMAP_ADDR)
> continue;
> -
> - if (!__exclude_unnecessary_pages(mmd->mem_map,
> - mmd->pfn_start, mmd->pfn_end))
> + if (mmd->paddr == 0)
> + continue;
> + if (!__exclude_unnecessary_pages(mm, mmd))
> return FALSE;
> }
>
> @@ -4139,7 +4624,11 @@ exclude_unnecessary_pages_cyclic(void)
> */
> copy_bitmap_cyclic();
>
> - if ((info->dump_level & DL_EXCLUDE_FREE) && !info->page_is_buddy)
> + /*
> + * If free pages cannot be identified with the buddy flag and/or
> + * count then we have to search free lists.
> + */
> + if ((info->dump_level & DL_EXCLUDE_FREE) && (!info->page_is_buddy))
> if (!exclude_free_page())
> return FALSE;
>
> @@ -4164,8 +4653,7 @@ exclude_unnecessary_pages_cyclic(void)
>
> if (mmd->pfn_end >= info->cyclic_start_pfn &&
> mmd->pfn_start <= info->cyclic_end_pfn) {
> - if (!__exclude_unnecessary_pages(mmd->mem_map,
> - mmd->pfn_start, mmd->pfn_end))
> + if (!__exclude_unnecessary_pages(mm, mmd))
> return FALSE;
> }
> }
> @@ -4195,7 +4683,7 @@ update_cyclic_region(unsigned long long
> if (!create_1st_bitmap_cyclic())
> return FALSE;
>
> - if (!exclude_unnecessary_pages_cyclic())
> + if (exclude_unnecessary_pages_cyclic() == FALSE)
> return FALSE;
>
> return TRUE;
> @@ -4255,7 +4743,7 @@ create_2nd_bitmap(void)
> if (info->dump_level & DL_EXCLUDE_CACHE ||
> info->dump_level & DL_EXCLUDE_CACHE_PRI ||
> info->dump_level & DL_EXCLUDE_USER_DATA) {
> - if (!exclude_unnecessary_pages()) {
> + if (exclude_unnecessary_pages() == FALSE) {
> ERRMSG("Can't exclude unnecessary pages.\n");
> return FALSE;
> }
> @@ -4263,8 +4751,10 @@ create_2nd_bitmap(void)
>
> /*
> * Exclude free pages.
> + * If free pages cannot be identified with the buddy flag and/or
> + * count then we have to search free lists.
> */
> - if (info->dump_level & DL_EXCLUDE_FREE)
> + if ((info->dump_level & DL_EXCLUDE_FREE) && (!info->page_is_buddy))
> if (!exclude_free_page())
> return FALSE;
>
> @@ -4395,6 +4885,10 @@ create_dump_bitmap(void)
> int ret = FALSE;
>
> if (info->flag_cyclic) {
> + if (info->flag_use_kernel_lists) {
> + if (setup_kernel_mmap() == FALSE)
> + goto out;
> + }
> if (!prepare_bitmap_buffer_cyclic())
> goto out;
>
> @@ -4872,6 +5366,7 @@ get_num_dumpable_cyclic(void)
> {
> unsigned long long pfn, num_dumpable=0;
>
> + gettimeofday(&scan_start, NULL);
> for (pfn = 0; pfn < info->max_mapnr; pfn++) {
> if (!update_cyclic_region(pfn))
> return FALSE;
> @@ -5201,7 +5696,7 @@ get_loads_dumpfile_cyclic(void)
> info->cyclic_end_pfn = info->pfn_cyclic;
> if (!create_1st_bitmap_cyclic())
> return FALSE;
> - if (!exclude_unnecessary_pages_cyclic())
> + if (exclude_unnecessary_pages_cyclic() == FALSE)
> return FALSE;
>
> if (!(phnum = get_phnum_memory()))
> @@ -5613,6 +6108,10 @@ write_kdump_pages(struct cache_data *cd_
> pfn_zero++;
> continue;
> }
> +
> + if (nflag)
> + continue;
> +
> /*
> * Compress the page data.
> */
> @@ -5768,6 +6267,7 @@ write_kdump_pages_cyclic(struct cache_da
> for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>
> if ((num_dumped % per) == 0)
> +
> print_progress(PROGRESS_COPY, num_dumped, info->num_dumpable);
>
> /*
> @@ -5786,11 +6286,17 @@ write_kdump_pages_cyclic(struct cache_da
> */
> if ((info->dump_level & DL_EXCLUDE_ZERO)
> && is_zero_page(buf, info->page_size)) {
> + if (!nflag) {
> if (!write_cache(cd_header, pd_zero, sizeof(page_desc_t)))
> goto out;
> + }
> pfn_zero++;
> continue;
> }
> +
> + if (nflag)
> + continue;
> +
> /*
> * Compress the page data.
> */
> @@ -6208,6 +6714,8 @@ write_kdump_pages_and_bitmap_cyclic(stru
> if (!update_cyclic_region(pfn))
> return FALSE;
>
> + if (tflag)
> + print_execution_time("Total time", &scan_start);
> if (!write_kdump_pages_cyclic(cd_header, cd_page, &pd_zero, &offset_data))
> return FALSE;
>
> @@ -8231,6 +8739,22 @@ static struct option longopts[] = {
> {0, 0, 0, 0}
> };
>
> +/*
> + * test for the presence of capability in the kernel to provide lists
> + * of pfn's:
> + * /proc/vmcore_pfn_lists
> + * return 1 for present
> + * return 0 for not present
> + */
> +int
> +test_kernel_pfn_lists(void)
> +{
> + if ((pfn_list_fd = open("/proc/vmcore_pfn_lists", O_WRONLY)) < 0) {
> + return 0;
> + }
> + return 1;
> +}
> +
> int
> main(int argc, char *argv[])
> {
> @@ -8256,9 +8780,12 @@ main(int argc, char *argv[])
>
> info->block_order = DEFAULT_ORDER;
> message_level = DEFAULT_MSG_LEVEL;
> - while ((opt = getopt_long(argc, argv, "b:cDd:EFfg:hi:lMpRrsvXx:", longopts,
> + while ((opt = getopt_long(argc, argv, "ab:cDd:EFfg:hi:MnoRrstVvXx:Y", longopts,
> NULL)) != -1) {
> switch (opt) {
> + case 'a':
> + aflag = 1;
> + break;
> case 'b':
> info->block_order = atoi(optarg);
> break;
> @@ -8314,6 +8841,13 @@ main(int argc, char *argv[])
> case 'M':
> info->flag_dmesg = 1;
> break;
> + case 'n':
> + /* -n undocumented, for testing page scanning time */
> + nflag = 1;
> + break;
> + case 'o':
> + oflag = 1;
> + break;
> case 'p':
> info->flag_compress = DUMP_DH_COMPRESSED_SNAPPY;
> break;
> @@ -8329,6 +8863,9 @@ main(int argc, char *argv[])
> case 'r':
> info->flag_reassemble = 1;
> break;
> + case 't':
> + tflag = 1;
> + break;
> case 'V':
> info->vaddr_for_vtop = strtoul(optarg, NULL, 0);
> break;
> @@ -8360,6 +8897,12 @@ main(int argc, char *argv[])
> goto out;
> }
> }
> +
> + if (oflag)
> + info->flag_use_kernel_lists = 0;
> + else
> + info->flag_use_kernel_lists = test_kernel_pfn_lists();
> +
> if (flag_debug)
> message_level |= ML_PRINT_DEBUG_MSG;
>
>
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
Cliff I tried your patch above on makedumpfile v1.5.1 (built dynamically
on the same DL980 I was running the test on), with all the RHEL 6
versions of kernel patches you gave me from 1207 plus the kernel patch
to kexec recommended for makedumpfile v1.5.1 built on top of a
preliminary RHEL 6.4 kernel source (higher patch level of 2.6.32
kernel), this time on a 1 TB Memory system (We have lost access to a 4
TB Memory system for some time, now). On this same system, regular
Makedumpfile v1.5.1 worked fine to produce a dump. But the Makedumpfile
with the patches above could not even start the dump, and printed:
Saving vmcore-dmesg.txt
Saved vmcore-dmesg.txt
PL_REQUEST_MEMMAP returned -1
Restarting system.
This happened with both a crashkernel size=200M that would have invoked
cyclic buffer mode, and also with a larger one, 384M that should not
have needed cyclic mode. I had no cyclic buffer mode set or turned off
in the makedumpfile command line, just recording memory usage with:
core_collector makedumpfile -c --message-level 31 -d 31
debug_mem_level 2
ret = write(pfn_list_fd, &request, sizeof(request));
if (ret < 0) {
fprintf(stderr, "PL_REQUEST_MEMMAP returned %d\n", ret);
return FALSE;
Any ideas what probably caused this? Am I missing a patch? Do I have
the wrong kernel patches? Tips to debug?
I am attaching the Kernel patches you sent me earlier that I used, on
top of:
https://lkml.org/lkml/2012/11/21/90 with the tweak for RHEL 2.6.32
kernels below applied on top of it:
NOTE: The patch above is for latest kernel. So you need to fix it as
below if your kernel version is between v2.6.18 and v2.6.37:
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 511151b..56583a4 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1490,7 +1490,6 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_OFFSET(page, flags);
VMCOREINFO_OFFSET(page, _count);
VMCOREINFO_OFFSET(page, mapping);
- VMCOREINFO_OFFSET(page, _mapcount);
VMCOREINFO_OFFSET(page, private);
VMCOREINFO_OFFSET(page, lru);
VMCOREINFO_OFFSET(pglist_data, node_zones);
@@ -1515,8 +1514,7 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_NUMBER(PG_lru);
VMCOREINFO_NUMBER(PG_private);
VMCOREINFO_NUMBER(PG_swapcache);
- VMCOREINFO_NUMBER(PG_slab);
- VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
+ VMCOREINFO_NUMBER(PG_buddy);
arch_crash_save_vmcoreinfo();
update_vmcoreinfo_note();
[-- Attachment #2: cliff_kernel_patch_1219 --]
[-- Type: message/rfc822, Size: 21764 bytes --]
From:
Subject: [PATCH] scan page tables for makedumpfile
Date: Wed, 16 Jan 2013 05:00:37 -0700
Message-ID: <1358337637.13097.972.camel@lisamlinux.fc.hp.com>
---
fs/proc/vmcore.c | 568 +++++++++++++++++++++++++++++++++++++++++++
include/linux/makedumpfile.h | 115 ++++++++
2 files changed, 683 insertions(+)
Index: linux/fs/proc/vmcore.c
===================================================================
--- linux.orig/fs/proc/vmcore.c
+++ linux/fs/proc/vmcore.c
@@ -17,8 +17,18 @@
#include <linux/init.h>
#include <linux/crash_dump.h>
#include <linux/list.h>
+#include <linux/makedumpfile.h>
+#include <linux/mmzone.h>
#include <asm/uaccess.h>
#include <asm/io.h>
+#include <asm/page.h>
+static int num_mem_map_data = 0;
+static struct mem_map_data *mem_map_data;
+static struct pfn_element *pfn_list;
+static long in_pfn_list;
+static int last_found_vaddr = 0;
+static int last_found_paddr = 0;
+static int max_pfn_list;
/* List representing chunks of contiguous memory areas and their offsets in
* vmcore file.
@@ -33,6 +43,7 @@ static size_t elfcorebuf_sz;
static u64 vmcore_size;
static struct proc_dir_entry *proc_vmcore = NULL;
+static struct proc_dir_entry *proc_vmcore_pfn_lists = NULL;
/* Reads a page from the oldmem device from given offset. */
static ssize_t read_from_oldmem(char *buf, size_t count,
@@ -160,10 +171,563 @@ static ssize_t read_vmcore(struct file *
return acc;
}
+/*
+ * Given the boot-kernel-relative virtual address of a page
+ * return its crashkernel-relative virtual address.
+ *
+ * We have a memory map named mem_map_data
+ *
+ * return 0 if it cannot be found
+ */
+unsigned long
+find_local_vaddr(unsigned long orig_vaddr)
+{
+ int i;
+ int fnd = 0;
+ struct mem_map_data *mmd, *next_mmd;
+ unsigned long paddr;
+ unsigned long local_vaddr;
+ unsigned long offset;
+
+ if (!num_mem_map_data) {
+ printk("find_page_paddr !! num_mem_map_data is %d\n",
+ num_mem_map_data);
+ return 0;
+ }
+
+fullsearch:
+ for (i = last_found_vaddr, mmd = mem_map_data + last_found_vaddr,
+ next_mmd = mem_map_data + last_found_vaddr + 1;
+ i < num_mem_map_data; i++, mmd++, next_mmd++) {
+ if (mmd->mem_map && mmd->paddr) {
+ if (orig_vaddr >= mmd->mem_map &&
+ orig_vaddr < next_mmd->mem_map) {
+ offset = orig_vaddr - mmd->mem_map;
+ paddr = mmd->paddr + offset;
+ fnd++;
+ /* caching gives about 99% hit on first pass */
+ last_found_vaddr = i;
+ break;
+ }
+ }
+ }
+
+ if (! fnd) {
+ if (last_found_vaddr > 0) {
+ last_found_vaddr = 0;
+ goto fullsearch;
+ }
+ return 0;
+ }
+
+ /* paddr is now the physical address of the page structure */
+ /* and offset is the offset into the found section, and we have
+ a table of how those sections are ioremap_cache'd */
+ local_vaddr = (unsigned long)mmd->section_vaddr + offset;
+ return local_vaddr;
+}
+
+/*
+ * Given a paddr, return its crashkernel-relative virtual address.
+ *
+ * We have a memory map named mem_map_data
+ *
+ * return 0 if it cannot be found
+ */
+void *
+find_local_from_paddr(unsigned long paddr)
+{
+ int i;
+ struct mem_map_data *mmd;
+ unsigned long offset;
+
+ if (!num_mem_map_data) {
+ printk("find_page_paddr !! num_mem_map_data is %d\n",
+ num_mem_map_data);
+ return 0;
+ }
+
+fullsearch:
+ for (i = last_found_paddr, mmd = mem_map_data + last_found_paddr;
+ i < num_mem_map_data; i++, mmd++) {
+ if ((paddr >= mmd->paddr) && (paddr < mmd->ending_paddr)) {
+ offset = paddr - mmd->paddr;
+ last_found_paddr = i;
+ /* caching gives about 98% hit on first pass */
+ return (void *)(mmd->section_vaddr + offset);
+ }
+ }
+
+ if (last_found_paddr > 0) {
+ last_found_paddr = 0;
+ goto fullsearch;
+ }
+ return 0;
+}
+
+/*
+ * given an anchoring list_head, walk the list of free pages
+ * 'root' is a virtual address based on the ioremap_cache'd pointer pgp
+ * 'boot_root' is the virtual address of the list root, boot kernel relative
+ *
+ * return the number of pages found on the list
+ */
+int
+walk_freelist(struct list_head *root, int node, int zone, int order, int list,
+ int restart_list, int start_page, struct pfn_list_request *reqp,
+ struct pfn_reply *replyp, struct list_head *boot_root)
+{
+ int list_ct = 0;
+ int list_free_pages = 0;
+ int doit;
+ unsigned long start_pfn;
+ struct page *pagep;
+ struct page *local_pagep;
+ struct list_head *lhp;
+ struct list_head *local_lhp; /* crashkernel-relative */
+ struct list_head *prev;
+ struct pfn_element *pe;
+
+ /*
+ * root is the crashkernel-relative address of the anchor of the
+ * free_list.
+ */
+ prev = root;
+ if (root == NULL) {
+ printk(KERN_EMERG "root is null!!, node %d order %d\n",
+ node, order);
+ return 0;
+ }
+
+ if (root->next == boot_root)
+ /* list is empty */
+ return 0;
+
+ lhp = root->next;
+ local_lhp = (struct list_head *)find_local_vaddr((unsigned long)lhp);
+ if (!local_lhp) {
+ return 0;
+ }
+
+ while (local_lhp != boot_root) {
+ list_ct++;
+ if (lhp == NULL) {
+ printk(KERN_EMERG
+ "The free list has a null!!, node %d order %d\n",
+ node, order);
+ break;
+ }
+ if (list_ct > 1 && local_lhp->prev != prev) {
+ /* can't be compared to root, as that is local */
+ printk(KERN_EMERG "The free list is broken!!\n");
+ break;
+ }
+
+ /* we want the boot kernel's pfn that this page represents */
+ pagep = container_of((struct list_head *)lhp,
+ struct page, lru);
+ start_pfn = pagep - vmemmap;
+ local_pagep = container_of((struct list_head *)local_lhp,
+ struct page, lru);
+ doit = 1;
+ if (restart_list && list_ct < start_page)
+ doit = 0;
+ if (doit) {
+ if (in_pfn_list == max_pfn_list) {
+ /* if array would overflow, come back to
+ this page with a continuation */
+ replyp->more = 1;
+ replyp->zone_index = zone;
+ replyp->freearea_index = order;
+ replyp->type_index = list;
+ replyp->list_ct = list_ct;
+ goto list_is_full;
+ }
+ pe = &pfn_list[in_pfn_list++];
+ pe->pfn = start_pfn;
+ pe->order = order;
+ list_free_pages += (1 << order);
+ }
+ prev = lhp;
+ lhp = local_pagep->lru.next;
+ /* the local node-relative vaddr: */
+ local_lhp = (struct list_head *)
+ find_local_vaddr((unsigned long)lhp);
+ if (!local_lhp)
+ break;
+ }
+
+list_is_full:
+ return list_free_pages;
+}
+
+/*
+ * Return the pfns of free pages on this node
+ */
+int
+write_vmcore_get_free(struct pfn_list_request *reqp)
+{
+ int node;
+ int nr_zones;
+ int nr_orders = MAX_ORDER;
+ int nr_freelist = MIGRATE_TYPES;
+ int zone;
+ int order;
+ int list;
+ int start_zone = 0;
+ int start_order = 0;
+ int start_list = 0;
+ int ret;
+ int restart = 0;
+ int start_page = 0;
+ int node_free_pages = 0;
+ struct pfn_reply rep;
+ struct pglist_data *pgp;
+ struct zone *zonep;
+ struct free_area *fap;
+ struct list_head *flp;
+ struct list_head *boot_root;
+ unsigned long pgdat_paddr;
+ unsigned long pgdat_vaddr;
+ unsigned long page_aligned_pgdat;
+ unsigned long page_aligned_size;
+ void *mapped_vaddr;
+
+ node = reqp->node;
+ pgdat_paddr = reqp->pgdat_paddr;
+ pgdat_vaddr = reqp->pgdat_vaddr;
+
+ /* map this pglist_data structure within a page-aligned area */
+ page_aligned_pgdat = pgdat_paddr & ~(PAGE_SIZE - 1);
+ page_aligned_size = sizeof(struct pglist_data) +
+ (pgdat_paddr - page_aligned_pgdat);
+ page_aligned_size = ((page_aligned_size + (PAGE_SIZE - 1))
+ >> PAGE_SHIFT) << PAGE_SHIFT;
+ mapped_vaddr = ioremap_cache(page_aligned_pgdat, page_aligned_size);
+ if (!mapped_vaddr) {
+ printk("ioremap_cache of pgdat %#lx failed\n",
+ page_aligned_pgdat);
+ return -EINVAL;
+ }
+ pgp = (struct pglist_data *)(mapped_vaddr +
+ (pgdat_paddr - page_aligned_pgdat));
+ nr_zones = pgp->nr_zones;
+ memset(&rep, 0, sizeof(rep));
+
+ if (reqp->more) {
+ restart = 1;
+ start_zone = reqp->zone_index;
+ start_order = reqp->freearea_index;
+ start_list = reqp->type_index;
+ start_page = reqp->list_ct;
+ }
+
+ in_pfn_list = 0;
+ for (zone = start_zone; zone < nr_zones; zone++) {
+ zonep = &pgp->node_zones[zone];
+ for (order = start_order; order < nr_orders; order++) {
+ fap = &zonep->free_area[order];
+ /* some free_area's are all zero */
+ if (fap->nr_free) {
+ for (list = start_list; list < nr_freelist;
+ list++) {
+ flp = &fap->free_list[list];
+ boot_root = (struct list_head *)
+ (pgdat_vaddr +
+ ((unsigned long)flp -
+ (unsigned long)pgp));
+ ret = walk_freelist(flp, node, zone,
+ order, list, restart,
+ start_page, reqp, &rep,
+ boot_root);
+ node_free_pages += ret;
+ restart = 0;
+ if (rep.more)
+ goto list_full;
+ }
+ }
+ }
+ }
+list_full:
+
+ iounmap(mapped_vaddr);
+
+ /* copy the reply and the valid part of our pfn list to the user */
+ rep.pfn_free = node_free_pages; /* the total, for statistics */
+ rep.in_pfn_list = in_pfn_list;
+ if (copy_to_user(reqp->reply_ptr, &rep, sizeof(struct pfn_reply)))
+ return -EFAULT;
+ if (in_pfn_list) {
+ if (copy_to_user(reqp->pfn_list_ptr, pfn_list,
+ (in_pfn_list * sizeof(struct pfn_element))))
+ return -EFAULT;
+ }
+ return 0;
+}
+
+/*
+ * Get the memap_data table from makedumpfile
+ * and do the single allocate of the pfn_list.
+ */
+int
+write_vmcore_get_memmap(struct pfn_list_request *reqp)
+{
+ int i;
+ int count;
+ int size;
+ int ret = 0;
+ long pfn_list_elements;
+ long malloc_size;
+ unsigned long page_section_start;
+ unsigned long page_section_size;
+ struct mem_map_data *mmd, *dum_mmd;
+ struct pfn_reply rep;
+ void *bufptr;
+
+ rep.pfn_list_elements = 0;
+ if (num_mem_map_data) {
+ /* shouldn't have been done before, but if it was.. */
+ printk(KERN_INFO "warning: PL_REQUEST_MEMMAP is repeated\n");
+ for (i = 0, mmd = mem_map_data; i < num_mem_map_data;
+ i++, mmd++) {
+ iounmap(mmd->section_vaddr);
+ }
+ kfree(mem_map_data);
+ mem_map_data = NULL;
+ num_mem_map_data = 0;
+ kfree(pfn_list);
+ pfn_list = NULL;
+ }
+
+ count = reqp->map_count;
+ size = reqp->map_size;
+ bufptr = reqp->map_ptr;
+ if (size != (count * sizeof(struct mem_map_data))) {
+ printk("Error in mem_map_data, %d * %ld != %d\n",
+ count, sizeof(struct mem_map_data), size);
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* add a dummy at the end to limit the size of the last entry */
+ size += sizeof(struct mem_map_data);
+
+ mem_map_data = kzalloc(size, GFP_KERNEL);
+ if (!mem_map_data) {
+ printk("kmalloc of mem_map_data for %d failed\n", size);
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (copy_from_user(mem_map_data, bufptr, size)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ num_mem_map_data = count;
+
+ /* construct the dummy entry to limit the size of 'next_mmd->mem_map' */
+ /* (see find_local_vaddr() ) */
+ mmd = mem_map_data + (num_mem_map_data - 1);
+ page_section_size = (mmd->pfn_end - mmd->pfn_start) *
+ sizeof(struct page);
+ dum_mmd = mmd + 1;
+ *dum_mmd = *mmd;
+ dum_mmd->mem_map += page_section_size;
+
+ /* Fill in the ending address of array of page struct */
+ for (i = 0, mmd = mem_map_data; i < num_mem_map_data; i++, mmd++) {
+ mmd->ending_paddr = mmd->paddr +
+ ((mmd->pfn_end - mmd->pfn_start) * sizeof(struct page));
+ }
+
+ /* Map each section of page structures to local virtual addresses */
+ /* (these are never iounmap'd, as this is the crash kernel) */
+ for (i = 0, mmd = mem_map_data; i < num_mem_map_data; i++, mmd++) {
+ page_section_start = mmd->paddr;
+ page_section_size = (mmd->pfn_end - mmd->pfn_start) *
+ sizeof(struct page);
+ mmd->section_vaddr = ioremap_cache(page_section_start,
+ page_section_size);
+ if (!mmd->section_vaddr) {
+ printk(
+ "ioremap_cache of [%d] node %#lx for %#lx failed\n",
+ i, page_section_start, page_section_size);
+ ret = -EINVAL;
+ goto out;
+ }
+ }
+
+ /*
+ * allocate the array for PFN's (just once)
+ * get as much as we can, up to what the user specified, and return
+ * that count to the user
+ */
+ pfn_list_elements = reqp->list_size;
+ do {
+ malloc_size = pfn_list_elements * sizeof(struct pfn_element);
+ if ((pfn_list = kmalloc(malloc_size, GFP_KERNEL)) != NULL) {
+ rep.pfn_list_elements = pfn_list_elements;
+ max_pfn_list = pfn_list_elements;
+ goto out;
+ }
+ pfn_list_elements -= 1000;
+ } while (pfn_list == NULL && pfn_list_elements > 0);
+
+ ret = -EINVAL;
+out:
+ if (copy_to_user(reqp->reply_ptr, &rep, sizeof(struct pfn_reply)))
+ return -EFAULT;
+ return ret;
+}
+
+/*
+ * Return the pfns of to-be-excluded pages fulfilling this request.
+ * This is called for each mem_map in makedumpfile's list.
+ */
+int
+write_vmcore_get_excludes(struct pfn_list_request *reqp)
+{
+ int i;
+ int start = 0;
+ int end;
+ unsigned long paddr;
+ unsigned long pfn;
+ void *vaddr;
+ struct page *pagep;
+ struct pfn_reply rep;
+ struct pfn_element *pe;
+
+ if (!num_mem_map_data) {
+ /* sanity check */
+ printk(
+ "ERROR:PL_REQUEST_MEMMAP not done before PL_REQUEST_EXCLUDE\n");
+ return -EINVAL;
+ }
+
+ /*
+ * the request contains (besides request type and bufptr):
+ * paddr (physical address of the page[0]
+ * count of pages in the block
+ * exclude bits (DL_EXCLUDE_...)
+ */
+ paddr = reqp->paddr;
+ end = reqp->count;
+ pfn = reqp->pfn_start;
+ /* find the already-mapped vaddr of this paddr */
+ vaddr = find_local_from_paddr(paddr);
+ if (!vaddr) {
+ printk("ERROR: PL_REQUEST_EXCLUDE cannot find paddr %#lx\n",
+ paddr);
+ return -EINVAL;
+ }
+ if (reqp->more) {
+ start = reqp->map_index;
+ vaddr += (reqp->map_index * sizeof(struct page));
+ pfn += reqp->map_index;
+ }
+ memset(&rep, 0, sizeof(rep));
+ in_pfn_list = 0;
+
+ for (i = start, pagep = (struct page *)vaddr; i < end;
+ i++, pagep++, pfn++) {
+ if (in_pfn_list == max_pfn_list) {
+ rep.in_pfn_list = in_pfn_list;
+ rep.more = 1;
+ rep.map_index = i;
+ break;
+ }
+ /*
+ * Exclude the free page managed by a buddy
+ */
+ if ((reqp->exclude_bits & DL_EXCLUDE_FREE)
+ && (pagep->flags & (1UL << PG_buddy))) {
+ pe = &pfn_list[in_pfn_list++];
+ pe->pfn = pfn;
+ pe->order = pagep->private;
+ rep.pfn_free += (1 << pe->order);
+ }
+ /*
+ * Exclude the cache page without the private page.
+ */
+ else if ((reqp->exclude_bits & DL_EXCLUDE_CACHE)
+ && (isLRU(pagep->flags) || isSwapCache(pagep->flags))
+ && !isPrivate(pagep->flags) && !isAnon(pagep->mapping)) {
+ pe = &pfn_list[in_pfn_list++];
+ pe->pfn = pfn;
+ pe->order = 0; /* assume 4k */
+ rep.pfn_cache++;
+ }
+ /*
+ * Exclude the cache page with the private page.
+ */
+ else if ((reqp->exclude_bits & DL_EXCLUDE_CACHE_PRI)
+ && (isLRU(pagep->flags) || isSwapCache(pagep->flags))
+ && !isAnon(pagep->mapping)) {
+ pe = &pfn_list[in_pfn_list++];
+ pe->pfn = pfn;
+ pe->order = 0; /* assume 4k */
+ rep.pfn_cache_private++;
+ }
+ /*
+ * Exclude the data page of the user process.
+ */
+ else if ((reqp->exclude_bits & DL_EXCLUDE_USER_DATA)
+ && isAnon(pagep->mapping)) {
+ pe = &pfn_list[in_pfn_list++];
+ pe->pfn = pfn;
+ pe->order = 0; /* assume 4k */
+ rep.pfn_user++;
+ }
+
+ }
+ rep.in_pfn_list = in_pfn_list;
+ if (copy_to_user(reqp->reply_ptr, &rep, sizeof(struct pfn_reply)))
+ return -EFAULT;
+ if (in_pfn_list) {
+ if (copy_to_user(reqp->pfn_list_ptr, pfn_list,
+ (in_pfn_list * sizeof(struct pfn_element))))
+ return -EFAULT;
+ }
+ return 0;
+}
+
+static ssize_t write_vmcore_pfn_lists(struct file *file,
+ const char __user *user_buf, size_t count, loff_t *ppos)
+{
+ int ret;
+ struct pfn_list_request pfn_list_request;
+
+ if (count != sizeof(struct pfn_list_request)) {
+ return -EINVAL;
+ }
+
+ if (copy_from_user(&pfn_list_request, user_buf, count))
+ return -EFAULT;
+
+ if (pfn_list_request.request == PL_REQUEST_FREE) {
+ ret = write_vmcore_get_free(&pfn_list_request);
+ } else if (pfn_list_request.request == PL_REQUEST_EXCLUDE) {
+ ret = write_vmcore_get_excludes(&pfn_list_request);
+ } else if (pfn_list_request.request == PL_REQUEST_MEMMAP) {
+ ret = write_vmcore_get_memmap(&pfn_list_request);
+ } else {
+ return -EINVAL;
+ }
+
+ if (ret)
+ return ret;
+ return count;
+}
+
static const struct file_operations proc_vmcore_operations = {
.read = read_vmcore,
};
+static const struct file_operations proc_vmcore_pfn_lists_operations = {
+ .write = write_vmcore_pfn_lists,
+};
+
static struct vmcore* __init get_new_element(void)
{
return kzalloc(sizeof(struct vmcore), GFP_KERNEL);
@@ -648,6 +1212,10 @@ static int __init vmcore_init(void)
proc_vmcore = proc_create("vmcore", S_IRUSR, NULL, &proc_vmcore_operations);
if (proc_vmcore)
proc_vmcore->size = vmcore_size;
+
+ proc_vmcore_pfn_lists = proc_create("vmcore_pfn_lists", S_IWUSR, NULL,
+ &proc_vmcore_pfn_lists_operations);
+
return 0;
}
module_init(vmcore_init)
Index: linux/include/linux/makedumpfile.h
===================================================================
--- /dev/null
+++ linux/include/linux/makedumpfile.h
@@ -0,0 +1,115 @@
+/*
+ * makedumpfile.h
+ * portions Copyright (C) 2006, 2007, 2008, 2009 NEC Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define isLRU(flags) (flags & (1UL << PG_lru))
+#define isPrivate(flags) (flags & (1UL << PG_private))
+#define isSwapCache(flags) (flags & (1UL << PG_swapcache))
+
+static inline int
+isAnon(struct address_space *mapping)
+{
+ return ((unsigned long)mapping & PAGE_MAPPING_ANON) != 0;
+}
+
+#define DL_EXCLUDE_ZERO (0x001) /* Exclude Pages filled with Zeros */
+#define DL_EXCLUDE_CACHE (0x002) /* Exclude Cache Pages
+ without Private Pages */
+#define DL_EXCLUDE_CACHE_PRI (0x004) /* Exclude Cache Pages
+ with Private Pages */
+#define DL_EXCLUDE_USER_DATA (0x008) /* Exclude UserProcessData Pages */
+#define DL_EXCLUDE_FREE (0x010) /* Exclude Free Pages */
+
+#define PL_REQUEST_FREE 1 /* request for a list of free pages */
+#define PL_REQUEST_EXCLUDE 2 /* request for a list of excludable
+ pages */
+#define PL_REQUEST_MEMMAP 3 /* request to pass in the makedumpfile
+ mem_map_data table */
+/*
+ * a request for finding pfn's that can be excluded from the dump
+ * they may be pages of particular types or free pages
+ */
+struct pfn_list_request {
+ int request; /* PL_REQUEST_FREE PL_REQUEST_EXCLUDE or */
+ /* PL_REQUEST_MEMMAP */
+ int debug;
+ unsigned long paddr; /* mem_map address for PL_REQUEST_EXCLUDE */
+ unsigned long pfn_start;/* pfn represented by paddr */
+ unsigned long pgdat_paddr; /* for PL_REQUEST_FREE */
+ unsigned long pgdat_vaddr; /* for PL_REQUEST_FREE */
+ int node; /* for PL_REQUEST_FREE */
+ int exclude_bits; /* for PL_REQUEST_EXCLUDE */
+ int count; /* for PL_REQUEST_EXCLUDE */
+ void *reply_ptr; /* address of user's pfn_reply, for reply */
+ void *pfn_list_ptr; /* address of user's pfn array (*pfn_list) */
+ int map_count; /* for PL_REQUEST_MEMMAP; elements */
+ int map_size; /* for PL_REQUEST_MEMMAP; bytes in table */
+ void *map_ptr; /* for PL_REQUEST_MEMMAP; address of table */
+ long list_size; /* for PL_REQUEST_MEMMAP negotiation */
+ /* resume info: */
+ int more; /* 0 for done, 1 for "there's more" */
+ /* PL_REQUEST_EXCLUDE: */
+ int map_index; /* slot in the mem_map array of page structs */
+ /* PL_REQUEST_FREE: */
+ int zone_index; /* zone within the node's pgdat_list */
+ int freearea_index; /* free_area within the zone */
+ int type_index; /* free_list within the free_area */
+ int list_ct; /* page within the list */
+};
+
+/*
+ * the reply from a pfn_list_request
+ * the list of pfn's itself is pointed to by pfn_list
+ */
+struct pfn_reply {
+ long pfn_list_elements; /* negotiated on PL_REQUEST_MEMMAP */
+ long in_pfn_list; /* returned by PL_REQUEST_EXCLUDE and
+ PL_REQUEST_FREE */
+ /* resume info */
+ int more; /* 0 == done, 1 == there is more */
+ /* PL_REQUEST_MEMMAP: */
+ int map_index; /* slot in the mem_map array of page structs */
+ /* PL_REQUEST_FREE: */
+ int zone_index; /* zone within the node's pgdat_list */
+ int freearea_index; /* free_area within the zone */
+ int type_index; /* free_list within the free_area */
+ int list_ct; /* page within the list */
+ /* statistic counters: */
+ unsigned long long pfn_cache; /* PL_REQUEST_EXCLUDE */
+ unsigned long long pfn_cache_private; /* PL_REQUEST_EXCLUDE */
+ unsigned long long pfn_user; /* PL_REQUEST_EXCLUDE */
+ unsigned long long pfn_free; /* PL_REQUEST_FREE */
+};
+
+struct pfn_element {
+ unsigned long pfn;
+ unsigned long order;
+};
+
+struct mem_map_data {
+ /*
+ * pfn_start/pfn_end are the pfn's represented by this mem_map entry.
+ * mem_map is the virtual address of the array of page structures
+ * that represent these pages.
+ * paddr is the physical address of that array of structures.
+ * ending_paddr would be (pfn_end - pfn_start) * sizeof(struct page).
+ * section_vaddr is the address we get from ioremap_cache().
+ */
+ unsigned long long pfn_start;
+ unsigned long long pfn_end;
+ unsigned long mem_map;
+ unsigned long long paddr; /* filled in by makedumpfile */
+ unsigned long long ending_paddr; /* filled in by kernel */
+ void *section_vaddr; /* filled in by kernel */
+};
[-- Attachment #3: Type: text/plain, Size: 143 bytes --]
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
next prev parent reply other threads:[~2013-01-16 16:18 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-01-04 16:20 [PATCH v2] makedumpfile: request the kernel do page scans Cliff Wickman
2013-01-16 12:15 ` Lisa Mitchell [this message]
2013-01-16 12:51 ` Lisa Mitchell
2013-01-16 17:50 ` Cliff Wickman
-- strict thread matches above, loose matches on Subject: below --
2012-11-21 20:06 Cliff Wickman
2012-11-22 1:43 ` Hatayama, Daisuke
2012-11-22 14:07 ` HATAYAMA Daisuke
2013-01-07 13:39 ` Cliff Wickman
2013-01-09 15:09 ` HATAYAMA Daisuke
[not found] ` <20130111223034.GA2154@sgi.com>
2013-01-17 1:38 ` HATAYAMA Daisuke
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1358338529.13097.987.camel@lisamlinux.fc.hp.com \
--to=lisa.mitchell@hp.com \
--cc=cpw@sgi.com \
--cc=d.hatayama@jp.fujitsu.com \
--cc=kexec@lists.infradead.org \
--cc=kumagai-atsushi@mxc.nes.nec.co.jp \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox