Kexec Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] makedumpfile: for large memories
@ 2013-12-31 23:30 cpw
  2013-12-31 23:34 ` [PATCH 1/2] makedumpfile: raw i/o and use of root device Cliff Wickman
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: cpw @ 2013-12-31 23:30 UTC (permalink / raw)
  To: kexec; +Cc: d.hatayama, kumagai-atsushi

From: Cliff Wickman <cpw@sgi.com>

Gentlemen of kexec,

I have been working on enabling kdump on some very large systems, and
have found some solutions that I hope you will consider.

The first issue is to work within the restricted size of crashkernel memory
under 2.6.32-based kernels, such as sles11 and rhel6.

The second issue is to reduce the very large size of a dump of a big memory
system, even on an idle system.

These are my propositions:

Size of crashkernel memory
  1) raw i/o for writing the dump
  2) use root device for the bitmap file (not tmpfs)
  3) raw i/o for reading/writing the bitmaps
  
Size of dump (and hence the duration of dumping)
  4) exclude page structures for unused pages


1) Is quite easy.  The cache of pages needs to be aligned on a block
  boundary and written in block multiples, as required by O_DIRECT files.

  The use of raw i/o prevents the growing of the crash kernel's page
  cache.

2) Is also quite easy.  My patch finds the path to the crash
  kernel's root device by examining the dump pathname. Storing the bitmaps
  to a file is otherwise not conserving memory, as they are being written
  to tmpfs.

3) Raw i/o for the bitmaps, is accomplished by caching the
  bitmap file in a similar way to that of the dump file.

  I find that the use of direct i/o is not significantly slower than
  writing through the kernel's page cache.

4) The excluding of unused kernel page structures is very
  important for a large memory system.  The kernel otherwise includes
  3.67 million pages of page structures per TB of memory. By contrast
  the rest of the kernel is only about 1 million pages.

Test results are below, for systems of 1TB, 2TB, 8.8TB and 16TB.
(There are no 'old' numbers for 16TB as time and space requirements
 made those effectively useless.)

Run times were generally reduced 2-3x, and dump size reduced about 8x.

All timings were done using 512M of crashkernel memory.

   System memory size
   1TB                     unpatched    patched
     OS: rhel6.4 (does a free pages pass)
     page scan time           1.6min    1.6min
     dump copy time           2.4min     .4min
     total time               4.1min    2.0min
     dump size                 3014M      364M

     OS: rhel6.5
     page scan time            .6min     .6min
     dump copy time           2.3min     .5min
     total time               2.9min    1.1min
     dump size                 3011M      423M

     OS: sles11sp3 (3.0.93)
     page scan time            .5min     .5min
     dump copy time           2.3min     .5min
     total time               2.8min    1.0min
     dump size                 2950M      350M

   2TB
     OS: rhel6.5           (cyclicx3)
     page scan time           2.0min    1.8min
     dump copy time           8.0min    1.5min
     total time              10.0min    3.3min
     dump size                 6141M      835M

   8.8TB
     OS: rhel6.5           (cyclicx5)
     page scan time           6.6min    5.5min
     dump copy time          67.8min    6.2min
     total time              74.4min   11.7min
     dump size                 15.8G      2.7G

   16TB
     OS: rhel6.4
     page scan time                   125.3min
     dump copy time                    13.2min
     total time                       138.5min
     dump size                            4.0G

     OS: rhel6.5
     page scan time                    27.8min
     dump copy time                    13.3min
     total time                        41.1min
     dump size                            4.1G

Page scan time is greatly affected by whether or not the
kernel supports mmap of /proc/vmcore.

The choice of snappy vs. zlib compression becomes fairly irrelevant
when we can shrink the dump size dramatically.  The above
were done with snappy compression.

I am sending my 2 working patches.  
They are kludgy in the sense that they ignore all forms of
kdump except the creation of a disk dump, and all architectures
except x86_64.
But I think they are sufficient to demonstrate the sizable
time, crashkernel space and disk space savings that are possible.

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] makedumpfile: raw i/o and use of root device
  2013-12-31 23:30 [PATCH 0/2] makedumpfile: for large memories cpw
@ 2013-12-31 23:34 ` Cliff Wickman
  2013-12-31 23:36 ` [PATCH 2/2] makedumpfile: exclude unused vmemmap pages Cliff Wickman
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Cliff Wickman @ 2013-12-31 23:34 UTC (permalink / raw)
  To: kexec, d.hatayama, kumagai-atsushi

On Tue, Dec 31, 2013 at 05:30:01PM -0600, cpw wrote:

Use O_DIRECT (raw) i/o for the dump and for the bitmaps file, so that writing
to those files does not allocate kernel memory for page cache.

Use the root device for the bitmaps file so that kernel memory is not consumed
for tmpfs.

The pathname for the root device is derived from the path to the dump
directory.

Raw I/O requires well-formed reads and writes. Buffers are aligned on 512-byte
boundaries, lseek's are done to 4096-byte boundaries, and transfers are
multiples of 4096 bytes.

The kludge is to handle the boundary between the part of the file containing
the page descriptors and the last part of the file, containing the page
data.  The data for that boundary area must be assembled into a page buffer and
written with a single write.

This patch is not meant to work in conjunction with cyclic mode. Cyclic mode
is effectively disabled by it, as it is not needed when employing these
methods. The second scan of pages needed by cyclic mode is thus eliminated.

The patch adds -j and -J options to force raw i/o even if there is sufficient
crashkernel memory not to require it.  (see flag_rawdump and flag_rawbitmaps).

Signed-off-by: Cliff Wickman <cpw@sgi.com>

---
 makedumpfile.c |  501 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 makedumpfile.h |   10 +
 print_info.c   |    8 
 3 files changed, 421 insertions(+), 98 deletions(-)

Index: makedumpfile-1.5.5/makedumpfile.c
===================================================================
--- makedumpfile-1.5.5.orig/makedumpfile.c
+++ makedumpfile-1.5.5/makedumpfile.c
@@ -49,6 +49,8 @@ unsigned long long pfn_free;
 unsigned long long pfn_hwpoison;
 
 unsigned long long num_dumped;
+long blocksize;
+static int plenty_of_memory(void);
 
 int retcd = FAILED;	/* return code */
 
@@ -900,10 +902,17 @@ int
 open_dump_file(void)
 {
 	int fd;
-	int open_flags = O_RDWR|O_CREAT|O_TRUNC;
+	int open_flags;
 
+	if (info->flag_rawdump)
+		open_flags = O_RDWR|O_CREAT|O_TRUNC|O_DIRECT;
+	else
+		open_flags = O_RDWR|O_CREAT|O_TRUNC;
+
+#if 0
 	if (!info->flag_force)
 		open_flags |= O_EXCL;
+#endif
 
 	if (info->flag_flatten) {
 		fd = STDOUT_FILENO;
@@ -939,12 +948,35 @@ check_dump_file(const char *path)
 int
 open_dump_bitmap(void)
 {
-	int i, fd;
-	char *tmpname;
+	int i, fd, flags;
+	char *tmpname, *cp;
+	char prefix[100];
+	int len;
 
+	/* note that /tmp is tmpfs, so it uses crash kernel memory */
 	tmpname = getenv("TMPDIR");
-	if (!tmpname)
-		tmpname = "/tmp";
+	if (!tmpname) {
+		/* use the prefix of the dump name   e.g. /mnt//var/.... */
+		if (!strchr(info->name_dumpfile,'v')) {
+			printf("no /var found in name_dumpfile %s\n",
+				info->name_dumpfile);
+			exit(1);
+		} else {
+			cp = strchr(info->name_dumpfile,'v');
+			if (strncmp(cp-1, "/var", 4)) {
+				printf("no /var found in name_dumpfile %s\n",
+					info->name_dumpfile);
+				exit(1);
+			}
+		}
+		len = cp - info->name_dumpfile - 1;
+		strncpy(prefix, info->name_dumpfile, len);
+		if (*(prefix + len - 1) == '/')
+			len -= 1;
+		*(prefix + len) = '\0';
+		tmpname = prefix;
+		strcat(tmpname, "/");
+	}
 
 	if ((info->name_bitmap = (char *)malloc(sizeof(FILENAME_BITMAP) +
 						strlen(tmpname) + 1)) == NULL) {
@@ -953,9 +985,12 @@ open_dump_bitmap(void)
 		return FALSE;
 	}
 	strcpy(info->name_bitmap, tmpname);
-	strcat(info->name_bitmap, "/");
 	strcat(info->name_bitmap, FILENAME_BITMAP);
-	if ((fd = mkstemp(info->name_bitmap)) < 0) {
+	if (info->flag_rawbitmaps)
+		flags = O_RDWR|O_CREAT|O_TRUNC|O_DIRECT;
+	else
+		flags = O_RDWR|O_CREAT|O_TRUNC;
+	if ((fd = open(info->name_bitmap, flags)) < 0) {
 		ERRMSG("Can't open the bitmap file(%s). %s\n",
 		    info->name_bitmap, strerror(errno));
 		return FALSE;
@@ -2860,6 +2895,7 @@ initialize_bitmap_memory(void)
 	struct dump_bitmap *bmp;
 	off_t bitmap_offset;
 	off_t bitmap_len, max_sect_len;
+	char *cp;
 	unsigned long pfn;
 	int i, j;
 	long block_size;
@@ -2881,7 +2917,14 @@ initialize_bitmap_memory(void)
 	bmp->fd        = info->fd_memory;
 	bmp->file_name = info->name_memory;
 	bmp->no_block  = -1;
-	memset(bmp->buf, 0, BUFSIZE_BITMAP);
+	if ((cp = malloc(blocksize + DIRECT_ALIGN)) == NULL) {
+		ERRMSG("Can't allocate memory for the bitmap buffer. %s\n",
+		    strerror(errno));
+		exit(1);
+	}
+	bmp->buf_malloced = cp;
+	bmp->buf = cp - ((unsigned long)cp % DIRECT_ALIGN) + DIRECT_ALIGN;
+	memset(bmp->buf, 0, blocksize);
 	bmp->offset = bitmap_offset + bitmap_len / 2;
 	info->bitmap_memory = bmp;
 
@@ -2893,6 +2936,7 @@ initialize_bitmap_memory(void)
 	if (info->valid_pages == NULL) {
 		ERRMSG("Can't allocate memory for the valid_pages. %s\n",
 		    strerror(errno));
+		free(bmp->buf_malloced);
 		free(bmp);
 		return FALSE;
 	}
@@ -3075,9 +3119,9 @@ out:
 			unsigned long long free_memory;
 
 			/*
-                        * The buffer size is specified as Kbyte with
-                        * --cyclic-buffer <size> option.
-                        */
+			 * The buffer size is specified as Kbyte with
+			 * --cyclic-buffer <size> option.
+			 */
 			info->bufsize_cyclic <<= 10;
 
 			/*
@@ -3190,7 +3234,7 @@ out:
 		DEBUG_MSG("The kernel doesn't support mmap(),");
 		DEBUG_MSG("read() will be used instead.\n");
 		info->flag_usemmap = MMAP_DISABLE;
-        }
+	}
 
 	return TRUE;
 }
@@ -3198,9 +3242,18 @@ out:
 void
 initialize_bitmap(struct dump_bitmap *bitmap)
 {
+	char *cp;
+
 	bitmap->fd        = info->fd_bitmap;
 	bitmap->file_name = info->name_bitmap;
 	bitmap->no_block  = -1;
+	if ((cp = malloc(blocksize + DIRECT_ALIGN)) == NULL) {
+		ERRMSG("Can't allocate memory for the bitmap buffer. %s\n",
+		    strerror(errno));
+		exit(1);
+	}
+	bitmap->buf_malloced = cp;
+	bitmap->buf = cp - ((unsigned long)cp % DIRECT_ALIGN) + DIRECT_ALIGN;
 	memset(bitmap->buf, 0, BUFSIZE_BITMAP);
 }
 
@@ -3266,9 +3319,9 @@ set_bitmap(struct dump_bitmap *bitmap, u
 	byte = (pfn%PFN_BUFBITMAP)>>3;
 	bit  = (pfn%PFN_BUFBITMAP) & 7;
 	if (val)
-		bitmap->buf[byte] |= 1<<bit;
+		*(bitmap->buf + byte) |= 1<<bit;
 	else
-		bitmap->buf[byte] &= ~(1<<bit);
+		*(bitmap->buf + byte) &= ~(1<<bit);
 
 	return TRUE;
 }
@@ -3444,6 +3497,29 @@ read_cache(struct cache_data *cd)
 	return TRUE;
 }
 
+void
+fill_to_offset(struct cache_data *cd, int blocksize)
+{
+	off_t current;
+	long num_blocks;
+	long i;
+
+	current = lseek(cd->fd, 0, SEEK_CUR);
+	if ((cd->offset - current) % blocksize) {
+		printf("ERROR: fill area is %#lx\n", cd->offset - current);
+		exit(1);
+	}
+	if (cd->cache_size < blocksize) {
+		printf("ERROR: cache buf is only %ld\n", cd->cache_size);
+		exit(1);
+	}
+	num_blocks = (cd->offset - current) / blocksize;
+	for (i = 0; i < num_blocks; i++) {
+		write(cd->fd, cd->buf, blocksize);
+	}
+	return;
+}
+
 int
 is_bigendian(void)
 {
@@ -3513,6 +3589,14 @@ write_buffer(int fd, off_t offset, void 
 int
 write_cache(struct cache_data *cd, void *buf, size_t size)
 {
+	/* sanity check; do not overflow this buffer */
+	/* (it is of cd->cache_size + info->page_size) */
+	if (size > ((cd->cache_size - cd->buf_size) + info->page_size)) {
+		fprintf(stderr, "write_cache buffer overflow! size %#lx\n",
+			size);
+		exit(1);
+	}
+
 	memcpy(cd->buf + cd->buf_size, buf, size);
 	cd->buf_size += size;
 
@@ -3524,7 +3608,8 @@ write_cache(struct cache_data *cd, void 
 		return FALSE;
 
 	cd->buf_size -= cd->cache_size;
-	memcpy(cd->buf, cd->buf + cd->cache_size, cd->buf_size);
+	if (cd->buf_size)
+		memcpy(cd->buf, cd->buf + cd->cache_size, cd->buf_size);
 	cd->offset += cd->cache_size;
 	return TRUE;
 }
@@ -3556,6 +3641,21 @@ write_cache_zero(struct cache_data *cd, 
 	return write_cache_bufsz(cd);
 }
 
+/* flush the full cache to the file */
+int
+write_cache_flush(struct cache_data *cd)
+{
+	if (cd->buf_size == 0)
+		return TRUE;
+	if (cd->buf_size < cd->cache_size) {
+		memset(cd->buf + cd->buf_size, 0, cd->cache_size - cd->buf_size);
+	}
+	cd->buf_size = cd->cache_size;
+	if (!write_cache_bufsz(cd))
+		return FALSE;
+	return TRUE;
+}
+
 int
 read_buf_from_stdin(void *buf, int buf_size)
 {
@@ -4332,11 +4432,19 @@ create_1st_bitmap(void)
 {
 	int i;
 	unsigned int num_pt_loads = get_num_pt_loads();
- 	char buf[info->page_size];
+	char *buf;
 	unsigned long long pfn, pfn_start, pfn_end, pfn_bitmap1;
 	unsigned long long phys_start, phys_end;
 	struct timeval tv_start;
 	off_t offset_page;
+	char *cp;
+
+	if ((cp = malloc(blocksize + DIRECT_ALIGN)) == NULL) {
+		ERRMSG("Can't allocate memory for the bitmap buffer. %s\n",
+		    strerror(errno));
+		exit(1);
+	}
+	buf = cp - ((unsigned long)cp % DIRECT_ALIGN) + DIRECT_ALIGN;
 
 	if (info->flag_refiltering)
 		return copy_1st_bitmap_from_memory();
@@ -4347,7 +4455,7 @@ create_1st_bitmap(void)
 	/*
 	 * At first, clear all the bits on the 1st-bitmap.
 	 */
-	memset(buf, 0, sizeof(buf));
+	memset(buf, 0, blocksize);
 
 	if (lseek(info->bitmap1->fd, info->bitmap1->offset, SEEK_SET) < 0) {
 		ERRMSG("Can't seek the bitmap(%s). %s\n",
@@ -4796,8 +4904,16 @@ int
 copy_bitmap(void)
 {
 	off_t offset;
-	unsigned char buf[info->page_size];
- 	const off_t failed = (off_t)-1;
+	unsigned char *buf;
+	unsigned char *cp;
+	const off_t failed = (off_t)-1;
+
+	if ((cp = malloc(blocksize + DIRECT_ALIGN)) == NULL) {
+		ERRMSG("Can't allocate memory for the bitmap buffer. %s\n",
+		    strerror(errno));
+		exit(1);
+	}
+	buf = cp - ((unsigned long)cp % DIRECT_ALIGN) + DIRECT_ALIGN;
 
 	offset = 0;
 	while (offset < (info->len_bitmap / 2)) {
@@ -4807,7 +4923,7 @@ copy_bitmap(void)
 			    info->name_bitmap, strerror(errno));
 			return FALSE;
 		}
-		if (read(info->bitmap1->fd, buf, sizeof(buf)) != sizeof(buf)) {
+		if (read(info->bitmap1->fd, buf, blocksize) != blocksize) {
 			ERRMSG("Can't read the dump memory(%s). %s\n",
 			    info->name_memory, strerror(errno));
 			return FALSE;
@@ -4818,12 +4934,12 @@ copy_bitmap(void)
 			    info->name_bitmap, strerror(errno));
 			return FALSE;
 		}
-		if (write(info->bitmap2->fd, buf, sizeof(buf)) != sizeof(buf)) {
+		if (write(info->bitmap2->fd, buf, blocksize) != blocksize) {
 			ERRMSG("Can't write the bitmap(%s). %s\n",
 		    	info->name_bitmap, strerror(errno));
 			return FALSE;
 		}
-		offset += sizeof(buf);
+		offset += blocksize;
 	}
 
 	return TRUE;
@@ -5013,7 +5129,8 @@ void
 free_bitmap1_buffer(void)
 {
 	if (info->bitmap1) {
-		free(info->bitmap1);
+		if (info->bitmap1->buf_malloced)
+			free(info->bitmap1->buf_malloced);
 		info->bitmap1 = NULL;
 	}
 }
@@ -5022,7 +5139,8 @@ void
 free_bitmap2_buffer(void)
 {
 	if (info->bitmap2) {
-		free(info->bitmap2);
+		if (info->bitmap2->buf_malloced)
+			free(info->bitmap2->buf_malloced);
 		info->bitmap2 = NULL;
 	}
 }
@@ -5030,8 +5148,18 @@ free_bitmap2_buffer(void)
 void
 free_bitmap_buffer(void)
 {
-	free_bitmap1_buffer();
-	free_bitmap2_buffer();
+	if (info->bitmap1) {
+		if (info->bitmap1->buf_malloced)
+			free(info->bitmap1->buf_malloced);
+		free(info->bitmap1);
+		info->bitmap1 = NULL;
+	}
+	if (info->bitmap2) {
+		if (info->bitmap2->buf_malloced)
+			free(info->bitmap2->buf_malloced);
+		free(info->bitmap2);
+		info->bitmap2 = NULL;
+	}
 }
 
 int
@@ -5058,7 +5186,6 @@ create_dump_bitmap(void)
 	} else {
 		if (!prepare_bitmap_buffer())
 			goto out;
-
 		if (!create_1st_bitmap())
 			goto out;
 
@@ -5130,25 +5257,31 @@ get_loads_dumpfile(void)
 int
 prepare_cache_data(struct cache_data *cd)
 {
+	char *cp;
+
 	cd->fd         = info->fd_dumpfile;
 	cd->file_name  = info->name_dumpfile;
 	cd->cache_size = info->page_size << info->block_order;
 	cd->buf_size   = 0;
 	cd->buf        = NULL;
 
-	if ((cd->buf = malloc(cd->cache_size + info->page_size)) == NULL) {
+	if ((cp = malloc(cd->cache_size + info->page_size + DIRECT_ALIGN)) == NULL) {
 		ERRMSG("Can't allocate memory for the data buffer. %s\n",
 		    strerror(errno));
 		return FALSE;
 	}
+	cd->buf_malloced = cp;
+	cd->buf = cp - ((unsigned long)cp % DIRECT_ALIGN) + DIRECT_ALIGN;
 	return TRUE;
 }
 
 void
 free_cache_data(struct cache_data *cd)
 {
-	free(cd->buf);
+	if (cd->buf_malloced)
+		free(cd->buf_malloced);
 	cd->buf = NULL;
+	cd->buf_malloced = NULL;
 }
 
 int
@@ -5397,19 +5530,21 @@ out:
 }
 
 int
-write_kdump_header(void)
+write_kdump_header(struct cache_data *cd)
 {
 	int ret = FALSE;
 	size_t size;
 	off_t offset_note, offset_vmcoreinfo;
-	unsigned long size_note, size_vmcoreinfo;
+	unsigned long size_note, size_vmcoreinfo, remaining_size_note;
+	unsigned long write_size, room;
 	struct disk_dump_header *dh = info->dump_header;
 	struct kdump_sub_header kh;
-	char *buf = NULL;
+	char *buf = NULL, *cp;
 
 	if (info->flag_elf_dumpfile)
 		return FALSE;
 
+	/* uses reads of /proc/vmcore */
 	get_pt_note(&offset_note, &size_note);
 
 	/*
@@ -5426,6 +5561,7 @@ write_kdump_header(void)
 	dh->bitmap_blocks  = divideup(info->len_bitmap, dh->block_size);
 	memcpy(&dh->timestamp, &info->timestamp, sizeof(dh->timestamp));
 	memcpy(&dh->utsname, &info->system_utsname, sizeof(dh->utsname));
+	blocksize = dh->block_size;
 	if (info->flag_compress & DUMP_DH_COMPRESSED_ZLIB)
 		dh->status |= DUMP_DH_COMPRESSED_ZLIB;
 #ifdef USELZO
@@ -5438,7 +5574,7 @@ write_kdump_header(void)
 #endif
 
 	size = sizeof(struct disk_dump_header);
-	if (!write_buffer(info->fd_dumpfile, 0, dh, size, info->name_dumpfile))
+	if (!write_cache(cd, dh, size))
 		return FALSE;
 
 	/*
@@ -5494,9 +5630,21 @@ write_kdump_header(void)
 				goto out;
 		}
 
-		if (!write_buffer(info->fd_dumpfile, kh.offset_note, buf,
-		    kh.size_note, info->name_dumpfile))
-			goto out;
+		/* the note may be huge, so do this in a loop to not
+		   overflow the cache */
+		remaining_size_note = kh.size_note;
+		cp = buf;
+		do {
+			room = cd->cache_size - cd->buf_size;
+			if (remaining_size_note > room)
+				write_size = room;
+			else
+				write_size = remaining_size_note;
+			if (!write_cache(cd, cp, write_size))
+				goto out;
+			remaining_size_note -= write_size;
+			cp += write_size;
+		} while (remaining_size_note);
 
 		if (has_vmcoreinfo()) {
 			get_vmcoreinfo(&offset_vmcoreinfo, &size_vmcoreinfo);
@@ -5512,8 +5660,7 @@ write_kdump_header(void)
 			kh.size_vmcoreinfo = size_vmcoreinfo;
 		}
 	}
-	if (!write_buffer(info->fd_dumpfile, dh->block_size, &kh,
-	    size, info->name_dumpfile))
+	if (!write_cache(cd, &kh, size))
 		goto out;
 
 	info->sub_header = kh;
@@ -6110,13 +6257,15 @@ write_elf_pages_cyclic(struct cache_data
 }
 
 int
-write_kdump_pages(struct cache_data *cd_header, struct cache_data *cd_page)
+write_kdump_pages(struct cache_data *cd_descs, struct cache_data *cd_page)
 {
- 	unsigned long long pfn, per, num_dumpable;
+	unsigned long long pfn, per, num_dumpable;
 	unsigned long long start_pfn, end_pfn;
 	unsigned long size_out;
+	long prefix;
 	struct page_desc pd, pd_zero;
 	off_t offset_data = 0;
+	off_t initial_offset_data;
 	struct disk_dump_header *dh = info->dump_header;
 	unsigned char buf[info->page_size], *buf_out = NULL;
 	unsigned long len_buf_out;
@@ -6124,8 +6273,12 @@ write_kdump_pages(struct cache_data *cd_
 	struct timeval tv_start;
 	const off_t failed = (off_t)-1;
 	unsigned long len_buf_out_zlib, len_buf_out_lzo, len_buf_out_snappy;
+	int saved_bytes = 0;
+	int cpysize;
+	char *save_block1, *save_block_cur, *save_block2;
 
 	int ret = FALSE;
+	int status;
 
 	if (info->flag_elf_dumpfile)
 		return FALSE;
@@ -6166,13 +6319,41 @@ write_kdump_pages(struct cache_data *cd_
 	per = num_dumpable / 10000;
 
 	/*
-	 * Calculate the offset of the page data.
+	 * Calculate the offset of the page_desc's and page data.
 	 */
-	cd_header->offset
+	cd_descs->offset
 	    = (DISKDUMP_HEADER_BLOCKS + dh->sub_hdr_size + dh->bitmap_blocks)
 		* dh->block_size;
-	cd_page->offset = cd_header->offset + sizeof(page_desc_t)*num_dumpable;
-	offset_data  = cd_page->offset;
+	/* this is already a pagesize multiple, so well-formed for i/o */
+
+	cd_page->offset = cd_descs->offset + (sizeof(page_desc_t) * num_dumpable);
+	offset_data = cd_page->offset;
+
+	/* for i/o, round this page data offset down to a block boundary */
+	prefix = cd_page->offset % blocksize;
+	cd_page->offset -= prefix;
+	initial_offset_data = cd_page->offset;
+	cd_page->buf_size = prefix;
+	memset(cd_page->buf, 0, prefix);
+
+	fill_to_offset(cd_descs, blocksize);
+
+	if ((save_block1 = malloc(blocksize * 2)) == NULL) {
+		ERRMSG("Can't allocate memory for save block. %s\n",
+		       strerror(errno));
+		goto out;
+	}
+	/* put on block address boundary for well-rounded i/o */
+	save_block1 += (blocksize - (unsigned long)save_block1 % blocksize);
+	save_block_cur = save_block1 + prefix;
+	saved_bytes += prefix;
+	if ((save_block2 = malloc(blocksize + DIRECT_ALIGN)) == NULL) {
+		ERRMSG("Can't allocate memory for save block2. %s\n",
+		       strerror(errno));
+		goto out;
+	}
+	/* put on block address boundary for well-rounded i/o */
+	save_block2 += (DIRECT_ALIGN - (unsigned long)save_block2 % DIRECT_ALIGN);
 
 	/*
 	 * Set a fileoffset of Physical Address 0x0.
@@ -6196,6 +6377,14 @@ write_kdump_pages(struct cache_data *cd_
 		memset(buf, 0, pd_zero.size);
 		if (!write_cache(cd_page, buf, pd_zero.size))
 			goto out;
+
+		cpysize = pd_zero.size;
+		if ((saved_bytes + cpysize) > blocksize)
+			cpysize = blocksize - saved_bytes;
+		memcpy(save_block_cur, buf, cpysize);
+		saved_bytes += cpysize;
+		save_block_cur += cpysize;
+
 		offset_data  += pd_zero.size;
 	}
 	if (info->flag_split) {
@@ -6229,7 +6418,7 @@ write_kdump_pages(struct cache_data *cd_
 		 */
 		if ((info->dump_level & DL_EXCLUDE_ZERO)
 		    && is_zero_page(buf, info->page_size)) {
-			if (!write_cache(cd_header, &pd_zero, sizeof(page_desc_t)))
+			if (!write_cache(cd_descs, &pd_zero, sizeof(page_desc_t)))
 				goto out;
 			pfn_zero++;
 			continue;
@@ -6280,24 +6469,70 @@ write_kdump_pages(struct cache_data *cd_
 		/*
 		 * Write the page header.
 		 */
-		if (!write_cache(cd_header, &pd, sizeof(page_desc_t)))
+		if (!write_cache(cd_descs, &pd, sizeof(page_desc_t))) {
+			PROGRESS_MSG(
+				"makedumpfile: write error on page header; dump incomplete\n");
 			goto out;
+		}
 
 		/*
 		 * Write the page data.
 		 */
+		/* kludge: save the partial block where page desc's and data overlap */
+		/* (this is the second part of the full block (save_block) where
+		    they overlap) */
+		if (saved_bytes < blocksize) {
+			memcpy(save_block_cur, buf, pd.size);
+			saved_bytes += pd.size;
+			save_block_cur += pd.size;
+		}
 		if (!write_cache(cd_page, buf, pd.size))
 			goto out;
 	}
 
 	/*
-	 * Write the remainder.
+	 * Write the remainder (well-formed blocks)
 	 */
-	if (!write_cache_bufsz(cd_page))
+	/* adjust the cd_descs to write out only full blocks beyond the
+	   data in the buffer */
+	if (cd_descs->buf_size % blocksize) {
+		cd_descs->buf_size +=
+			(blocksize - (cd_descs->buf_size % blocksize));
+		cd_descs->cache_size = cd_descs->buf_size;
+	}
+	if (!write_cache_flush(cd_descs))
 		goto out;
-	if (!write_cache_bufsz(cd_header))
+
+	/*
+	 * kludge: the page data will overwrite the last block of the page_desc's,
+	 * so re-construct a block from:
+	 *   the last block of the page_desc's (length 'prefix') (will read into
+	 *   save_block2) and the end (4096-prefix) of the page data we saved in
+	 *   save_block1.
+	 */
+	if (!write_cache_flush(cd_page))
 		goto out;
 
+	if (lseek(cd_page->fd, initial_offset_data, SEEK_SET) == failed) {
+		printf("kludge: seek to %#lx, fd %d failed errno %d\n",
+			initial_offset_data, cd_page->fd, errno);
+		exit(1);
+	}
+	if (read(cd_page->fd, save_block2, blocksize) != blocksize) {
+		printf("kludge: read block2 failed\n");
+		exit(1);
+	}
+	/* combine the overlapping parts into save_block1 */
+	memcpy(save_block1, save_block2, prefix);
+
+	if (lseek(cd_page->fd, initial_offset_data, SEEK_SET) == failed) {
+		printf("kludge: seek to %#lx, fd %d failed errno %d\n",
+			initial_offset_data, cd_page->fd, errno);
+		exit(1);
+	}
+	status = write(cd_page->fd, save_block1, blocksize);
+	/* end of kludged block */
+
 	/*
 	 * print [100 %]
 	 */
@@ -6307,8 +6542,6 @@ write_kdump_pages(struct cache_data *cd_
 
 	ret = TRUE;
 out:
-	if (buf_out != NULL)
-		free(buf_out);
 #ifdef USELZO
 	if (wrkmem != NULL)
 		free(wrkmem);
@@ -6456,18 +6689,18 @@ write_kdump_pages_cyclic(struct cache_da
 		pd.offset     = *offset_data;
 		*offset_data  += pd.size;
 
-                /*
-                 * Write the page header.
-                 */
-                if (!write_cache(cd_header, &pd, sizeof(page_desc_t)))
-                        goto out;
-
-                /*
-                 * Write the page data.
-                 */
-                if (!write_cache(cd_page, buf, pd.size))
-                        goto out;
-        }
+		/*
+		 * Write the page header.
+		 */
+		if (!write_cache(cd_header, &pd, sizeof(page_desc_t)))
+			goto out;
+
+		/*
+		 * Write the page data.
+		 */
+		if (!write_cache(cd_page, buf, pd.size))
+			goto out;
+	}
 
 	ret = TRUE;
 out:
@@ -6704,50 +6937,48 @@ write_kdump_eraseinfo(struct cache_data 
 }
 
 int
-write_kdump_bitmap(void)
+write_kdump_bitmap(struct cache_data *cd)
 {
 	struct cache_data bm;
 	long long buf_size;
-	off_t offset;
+	long write_size;
 
 	int ret = FALSE;
 
 	if (info->flag_elf_dumpfile)
 		return FALSE;
 
+	/* set up to read bit map file in big blocks from the start */
 	bm.fd        = info->fd_bitmap;
 	bm.file_name = info->name_bitmap;
 	bm.offset    = 0;
-	bm.buf       = NULL;
-
-	if ((bm.buf = calloc(1, BUFSIZE_BITMAP)) == NULL) {
-		ERRMSG("Can't allocate memory for dump bitmap buffer. %s\n",
-		    strerror(errno));
-		goto out;
+	bm.cache_size = cd->cache_size;
+	bm.buf = cd->buf; /* use the bitmap cd */
+	/* using the dumpfile cd_bitmap buffer and fd */
+	if (lseek(cd->fd, info->offset_bitmap1, SEEK_SET) < 0) {
+		ERRMSG("Can't seek the dump file(%s). %s\n",
+		       info->name_memory, strerror(errno));
+		return FALSE;
 	}
-	offset = info->offset_bitmap1;
 	buf_size = info->len_bitmap;
 
 	while (buf_size > 0) {
-		if (buf_size >= BUFSIZE_BITMAP)
-			bm.cache_size = BUFSIZE_BITMAP;
-		else
-			bm.cache_size = buf_size;
-
 		if(!read_cache(&bm))
 			goto out;
 
-		if (!write_buffer(info->fd_dumpfile, offset,
-		    bm.buf, bm.cache_size, info->name_dumpfile))
-			goto out;
-
-		offset += bm.cache_size;
-		buf_size -= BUFSIZE_BITMAP;
+		write_size = cd->cache_size;
+		if (buf_size < cd->cache_size) {
+			write_size = buf_size;
+		}
+		if (write(cd->fd, cd->buf, write_size) != write_size) {
+			ERRMSG("Can't write a destination file. %s\n",
+				strerror(errno));
+			exit(1);
+		}
+		buf_size -= bm.cache_size;
 	}
 	ret = TRUE;
 out:
-	if (bm.buf != NULL)
-		free(bm.buf);
 
 	return ret;
 }
@@ -6756,7 +6987,7 @@ int
 write_kdump_bitmap1_cyclic(void)
 {
 	off_t offset;
-        int increment;
+	int increment;
 	int ret = FALSE;
 
 	increment = divideup(info->cyclic_end_pfn - info->cyclic_start_pfn, BITPERBYTE);
@@ -6875,14 +7106,14 @@ write_kdump_pages_and_bitmap_cyclic(stru
 			continue;
 
 		if (!update_cyclic_region(pfn))
-                        return FALSE;
+			return FALSE;
 
 		if (!write_kdump_pages_cyclic(cd_header, cd_page, &pd_zero, &offset_data))
 			return FALSE;
 
 		if (!write_kdump_bitmap2_cyclic())
 			return FALSE;
-        }
+	}
 
 	/*
 	 * Write the remainder.
@@ -7799,7 +8030,7 @@ int
 writeout_dumpfile(void)
 {
 	int ret = FALSE;
-	struct cache_data cd_header, cd_page;
+	struct cache_data cd_header, cd_page_descs, cd_page, cd_bitmap;
 
 	info->flag_nospace = FALSE;
 
@@ -7812,11 +8043,20 @@ writeout_dumpfile(void)
 	}
 	if (!prepare_cache_data(&cd_header))
 		return FALSE;
+	cd_header.offset = 0;
 
 	if (!prepare_cache_data(&cd_page)) {
 		free_cache_data(&cd_header);
 		return FALSE;
 	}
+	if (!prepare_cache_data(&cd_page_descs)) {
+		free_cache_data(&cd_header);
+		free_cache_data(&cd_page);
+		return FALSE;
+	}
+	if (!prepare_cache_data(&cd_bitmap))
+		return FALSE;
+
 	if (info->flag_elf_dumpfile) {
 		if (!write_elf_header(&cd_header))
 			goto out;
@@ -7830,20 +8070,35 @@ writeout_dumpfile(void)
 		if (!write_elf_eraseinfo(&cd_header))
 			goto out;
 	} else if (info->flag_cyclic) {
-		if (!write_kdump_header())
+		if (!write_kdump_header(&cd_header))
 			goto out;
 		if (!write_kdump_pages_and_bitmap_cyclic(&cd_header, &cd_page))
 			goto out;
 		if (!write_kdump_eraseinfo(&cd_page))
 			goto out;
 	} else {
-		if (!write_kdump_header())
+
+		/*
+		 * Use cd_header for the caching operation up to the bit map.
+		 * Use cd_bitmap for 1-block (4096) operations on the bit map.
+		 * (it fits between the file header and page_desc's, both of
+		 *  which end and start on block boundaries)
+		 * Then use cd_page_descs and cd_page for page headers and
+		 * data (and eraseinfo).
+		 * Then back to cd_header to fill in the bitmap.
+		 */
+
+		if (!write_kdump_header(&cd_header))
 			goto out;
-		if (!write_kdump_pages(&cd_header, &cd_page))
+		write_cache_flush(&cd_header);
+
+		if (!write_kdump_pages(&cd_page_descs, &cd_page))
 			goto out;
 		if (!write_kdump_eraseinfo(&cd_page))
 			goto out;
-		if (!write_kdump_bitmap())
+
+		cd_bitmap.offset = info->offset_bitmap1;
+		if (!write_kdump_bitmap(&cd_bitmap))
 			goto out;
 	}
 	if (info->flag_flatten) {
@@ -7883,7 +8138,7 @@ setup_splitting(void)
 		}
 		if (SPLITTING_END_PFN(i-1) > info->max_mapnr)
 			SPLITTING_END_PFN(i-1) = info->max_mapnr;
-        } else {
+	} else {
 		initialize_2nd_bitmap(&bitmap2);
 
 		pfn_per_dumpfile = num_dumpable / info->num_dumpfile;
@@ -8005,11 +8260,43 @@ create_dumpfile(void)
 		if (!get_elf_info(info->fd_memory, info->name_memory))
 			return FALSE;
 	}
+	blocksize = info->page_size;
+	if (!blocksize)
+		blocksize = sysconf(_SC_PAGE_SIZE);
 	if (!initial())
 		return FALSE;
 
 	print_vtop();
 
+	if (info->flag_rawdump)
+		PROGRESS_MSG("Using O_DIRECT i/o for dump.\n");
+	if (info->flag_rawbitmaps)
+		PROGRESS_MSG("Using O_DIRECT i/o for bitmap.\n");
+	if (plenty_of_memory()) {
+		PROGRESS_MSG("Plenty of memory.\n");
+		info->flag_cyclic = FALSE;
+		if (!info->flag_rawdump)
+			PROGRESS_MSG("Using page cache for bitmap file.\n");
+		if (!info->flag_rawbitmaps)
+			PROGRESS_MSG("Using page cache for dump file.\n");
+	} else {
+		/* memory is restricted; solution is direct i/o */
+		if (!info->flag_rawdump) {
+			info->flag_rawdump = 1;
+			PROGRESS_MSG(
+			"Restricted memory; switching to O_DIRECT i/o for dump.\n");
+		}
+		if (!info->flag_rawbitmaps) {
+			info->flag_rawbitmaps = 1;
+			PROGRESS_MSG(
+			"Restricted memory; switching to O_DIRECT i/o for bitmap.\n");
+		}
+	}
+
+	if (info->flag_cyclic == FALSE) {
+		PROGRESS_MSG("Using non-cyclic mode.\n");
+	}
+
 	num_retry = 0;
 retry:
 	if (info->flag_refiltering) {
@@ -8045,11 +8332,11 @@ retry:
 		 */
 		num_retry++;
 		if ((info->dump_level = get_next_dump_level(num_retry)) < 0)
- 			return FALSE;
+			return FALSE;
 		MSG("Retry to create a dumpfile by dump_level(%d).\n",
 		    info->dump_level);
 		if (!delete_dumpfile())
- 			return FALSE;
+			return FALSE;
 		goto retry;
 	}
 	print_report();
@@ -8911,6 +9198,22 @@ out:
 	return free_size;
 }
 
+/*
+ * Plenty of memory to do a non-cyclic dump.
+ * Default to non-cyclic in this case.
+ */
+static int
+plenty_of_memory(void)
+{
+	unsigned long free_size;
+	unsigned long needed_size;
+
+	free_size = get_free_memory_size();
+	needed_size = (info->max_mapnr * 2) / BITPERBYTE;
+	if (free_size > (needed_size + (10*1024*1024)))
+		return 1;
+	return 0;
+}
 
 /*
  * Choose the lesser value of the two below as the size of cyclic buffer.
@@ -9041,6 +9344,12 @@ main(int argc, char *argv[])
 			info->flag_read_vmcoreinfo = 1;
 			info->name_vmcoreinfo = optarg;
 			break;
+		case OPT_RAWDUMP:
+			info->flag_rawdump = 1;
+			break;
+		case OPT_RAWBITMAPS:
+			info->flag_rawbitmaps = 1;
+			break;
 		case OPT_DISKSET:
 			if (!sadump_add_diskset_info(optarg))
 				goto out;
Index: makedumpfile-1.5.5/makedumpfile.h
===================================================================
--- makedumpfile-1.5.5.orig/makedumpfile.h
+++ makedumpfile-1.5.5/makedumpfile.h
@@ -18,6 +18,7 @@
 
 #include <stdio.h>
 #include <stdlib.h>
+#define __USE_GNU
 #include <fcntl.h>
 #include <gelf.h>
 #include <sys/stat.h>
@@ -215,6 +216,7 @@ isAnon(unsigned long mapping)
 #define FILENAME_BITMAP		"kdump_bitmapXXXXXX"
 #define FILENAME_STDOUT		"STDOUT"
 #define MAP_REGION		(4096*1024)
+#define DIRECT_ALIGN		(512)
 
 /*
  * Minimam vmcore has 2 ProgramHeaderTables(PT_NOTE and PT_LOAD).
@@ -822,7 +824,8 @@ struct dump_bitmap {
 	int		fd;
 	int		no_block;
 	char		*file_name;
-	char		buf[BUFSIZE_BITMAP];
+	char		*buf;
+	char		*buf_malloced;
 	off_t		offset;
 };
 
@@ -830,6 +833,7 @@ struct cache_data {
 	int	fd;
 	char	*file_name;
 	char	*buf;
+	char	*buf_malloced;
 	size_t	buf_size;
 	size_t	cache_size;
 	off_t	offset;
@@ -911,6 +915,8 @@ struct DumpInfo {
 	int		flag_use_printk_log; /* did we read printk_log symbol name? */
 	int		flag_nospace;	     /* the flag of "No space on device" error */
 	int		flag_vmemmap;        /* kernel supports vmemmap address space */
+	int		flag_rawdump;        /* use raw i/o for the dump file */
+	int		flag_rawbitmaps;     /* use raw i/o for the bitmaps file */
 	unsigned long	vaddr_for_vtop;      /* virtual address for debugging */
 	long		page_size;           /* size of page */
 	long		page_shift;
@@ -1729,6 +1735,8 @@ struct elf_prstatus {
 #define OPT_GENERATE_VMCOREINFO 'g'
 #define OPT_HELP                'h'
 #define OPT_READ_VMCOREINFO     'i'
+#define OPT_RAWDUMP             'j'
+#define OPT_RAWBITMAPS          'J'
 #define OPT_COMPRESS_LZO        'l'
 #define OPT_COMPRESS_SNAPPY     'p'
 #define OPT_REARRANGE           'R'
Index: makedumpfile-1.5.5/print_info.c
===================================================================
--- makedumpfile-1.5.5.orig/print_info.c
+++ makedumpfile-1.5.5/print_info.c
@@ -48,7 +48,7 @@ print_usage(void)
 	MSG("\n");
 	MSG("Usage:\n");
 	MSG("  Creating DUMPFILE:\n");
-	MSG("  # makedumpfile    [-c|-l|-E] [-d DL] [-x VMLINUX|-i VMCOREINFO] VMCORE\n");
+	MSG("  # makedumpfile    [-c|-l|-E] [-d DL] [-j] [-J] [-x VMLINUX|-i VMCOREINFO] VMCORE\n");
 	MSG("    DUMPFILE\n");
 	MSG("\n");
 	MSG("  Creating DUMPFILE with filtered kernel data specified through filter config\n");
@@ -95,6 +95,12 @@ print_usage(void)
 	MSG("      -E option, because the ELF format does not support compressed data.\n");
 	MSG("      THIS IS ONLY FOR THE CRASH UTILITY.\n");
 	MSG("\n");
+	MSG("  [-j]:\n");
+	MSG("      Use raw (O_DIRECT) i/o on dump file to avoid expanding kernel pagecache.\n");
+	MSG("\n");
+	MSG("  [-J]:\n");
+	MSG("      Use raw (O_DIRECT) i/o on bitmap file to avoid expanding kernel pagecache.\n");
+	MSG("\n");
 	MSG("  [-d DL]:\n");
 	MSG("      Specify the type of unnecessary page for analysis.\n");
 	MSG("      Pages of the specified type are not copied to DUMPFILE. The page type\n");
-- 
Cliff Wickman
SGI
cpw@sgi.com
(651) 683-3824

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2/2] makedumpfile: exclude unused vmemmap pages
  2013-12-31 23:30 [PATCH 0/2] makedumpfile: for large memories cpw
  2013-12-31 23:34 ` [PATCH 1/2] makedumpfile: raw i/o and use of root device Cliff Wickman
@ 2013-12-31 23:36 ` Cliff Wickman
  2014-01-06  9:27 ` [PATCH 0/2] makedumpfile: for large memories Atsushi Kumagai
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Cliff Wickman @ 2013-12-31 23:36 UTC (permalink / raw)
  To: kexec, d.hatayama, kumagai-atsushi

On Tue, Dec 31, 2013 at 05:30:01PM -0600, cpw wrote:

Exclude kernel pages that contain nothing but page structures for pages
that are not being included in the dump.
These can amount to 3.67 million pages per terabyte of system memory!

The kernel's page table, starting at virtual address 0xffffea0000000000, is 
searched to find the actual pages containing the vmemmap page structures.

Bitmap1 is a map of dumpable (i.e existing) pages. Bitmap2 is a map
of pages not to be excluded.
To speed the search of bitmaps only whole 64-bit words of 1's in 
bitmap1 and 0's in bitmap2 are tested to see if they are vmemmap pages.

The list of vmemmap pfn's to be excluded is written to a small file in order
to conserve crash kernel memory.

In practice, this whole procedure only takes about 10 seconds on a
16TB machine.

The effect of omitting unused page structures from the dump has only
one, minimal side effect that I can find: the crash command "kmem -f" will
fail when attempting to walk through free pages. This seems to me to be
a trivial negative when weighed against the enabling and acceleration
of dumps on large systems.

This patch includes -e and -N options to exclude or include unneeded
vmemmap pages regardless of system size (see flag_includevm and
flag_excludvm).  By default the exclusion of such pages is only
done on a system of a terabyte or more.

Signed-off-by: Cliff Wickman <cpw@sgi.com>

---
 makedumpfile.c |  686 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 makedumpfile.h |   62 +++++
 print_info.c   |    9 
 3 files changed, 753 insertions(+), 4 deletions(-)

Index: makedumpfile-1.5.5/makedumpfile.c
===================================================================
--- makedumpfile-1.5.5.orig/makedumpfile.c
+++ makedumpfile-1.5.5/makedumpfile.c
@@ -31,9 +31,12 @@ struct offset_table	offset_table;
 struct array_table	array_table;
 struct number_table	number_table;
 struct srcfile_table	srcfile_table;
+struct save_control	sc;
 
 struct vm_table		vt = { 0 };
 struct DumpInfo		*info = NULL;
+struct vmap_pfns	*gvmem_pfns;
+int nr_gvmem_pfns;
 
 char filename_stdout[] = FILENAME_STDOUT;
 
@@ -3111,6 +3114,21 @@ out:
 	if (!get_max_mapnr())
 		return FALSE;
 
+	/*
+	 * Do not capture unused pages of vmemmap (page structures) for memories
+	 * of a terabyte or more, unless -N was specified to explicitly capture them.
+	 *   max. memory pfn is in info->max_mapnr; a terabyte is pfn 0x10000000
+	 */
+	if (!info->flag_includevm && info->max_mapnr >= 0x10000000UL) {
+		info->flag_excludevm = 1;
+		PROGRESS_MSG(
+			"Large memory, excluding unused vmemmap page structures.\n");
+	}
+	if (info->flag_includevm) {
+		info->flag_excludevm = 0;
+		PROGRESS_MSG("Capturing all vmemmap page structures.\n");
+	}
+
 	if (info->flag_cyclic) {
 		if (info->bufsize_cyclic == 0) {
 			if (!calculate_cyclic_buffer_size())
@@ -4945,6 +4963,340 @@ copy_bitmap(void)
 	return TRUE;
 }
 
+/*
+ * Given a range of unused pfn's, check whether we can drop the vmemmap pages
+ * that represent them.
+ *  (pfn ranges are literally start and end, not start and end+1)
+ *   see the array of vmemmap pfns and the pfns then represent: gvmem_pfns
+ * Return 1 for delete, 0 for not to delete.
+ */
+int
+find_vmemmap_pages(unsigned long startpfn, unsigned long endpfn, unsigned long *vmappfn,
+									unsigned long *nmapnpfns)
+{
+	int i;
+	long npfns_offset, vmemmap_offset, vmemmap_pfns, start_vmemmap_pfn;
+	long npages, end_vmemmap_pfn;
+	struct vmap_pfns *vmapp;
+	int pagesize = info->page_size;
+
+	for (i = 0; i < nr_gvmem_pfns; i++) {
+		vmapp = gvmem_pfns + i;
+		if ((startpfn >= vmapp->rep_pfn_start) &&
+		    (endpfn <= vmapp->rep_pfn_end)) {
+			npfns_offset = startpfn - vmapp->rep_pfn_start;
+			vmemmap_offset = npfns_offset * size_table.page;
+			// round up to a page boundary
+			if (vmemmap_offset % pagesize)
+				vmemmap_offset += (pagesize - (vmemmap_offset % pagesize));
+			vmemmap_pfns = vmemmap_offset / pagesize;
+			start_vmemmap_pfn = vmapp->vmap_pfn_start + vmemmap_pfns;
+			*vmappfn = start_vmemmap_pfn;
+
+			npfns_offset = endpfn - vmapp->rep_pfn_start;
+			vmemmap_offset = npfns_offset * size_table.page;
+			// round down to page boundary
+			vmemmap_offset -= (vmemmap_offset % pagesize);
+			vmemmap_pfns = vmemmap_offset / pagesize;
+			end_vmemmap_pfn = vmapp->vmap_pfn_start + vmemmap_pfns;
+			npages = end_vmemmap_pfn - start_vmemmap_pfn;
+			if (npages == 0)
+				return 0;
+			*nmapnpfns = npages;
+			return 1;
+		}
+	}
+	return 0;
+}
+
+/*
+ * Finalize the structure for saving pfn's to be deleted.
+ */
+void
+finalize_save_control()
+{
+	free(sc.sc_buf_malloced);
+	close(sc.sc_fd);
+	return;
+}
+
+/*
+ * Reset the structure for saving pfn's to be deleted so that it can be read
+ */
+void
+reset_save_control()
+{
+	int i;
+	if (sc.sc_bufposition == 0)
+		return;
+
+	/* direct i/o, so have to write the whole buffer */
+	i = write(sc.sc_fd, sc.sc_buf, sc.sc_buflen);
+	if (i != sc.sc_buflen) {
+		fprintf(stderr, "reset: Can't write a page to %s\n",
+			sc.sc_filename);
+		exit(1);
+	}
+	sc.sc_filelen += sc.sc_bufposition;
+
+	if (lseek(sc.sc_fd, 0, SEEK_SET) < 0) {
+		fprintf(stderr, "Can't seek the pfn file %s).", sc.sc_filename);
+		exit(1);
+	}
+	sc.sc_fileposition = 0;
+	sc.sc_bufposition = sc.sc_buflen; /* trigger 1st read */
+	return;
+}
+
+/*
+ * Initialize the structure for saving pfn's to be deleted.
+ */
+void
+init_save_control()
+{
+	int flags, len;
+	char *filename, *cp;
+
+	filename = malloc(50);
+	*filename = '\0';
+	/* use the prefix of the dump name   e.g. /mnt//var/.... */
+	if (!strchr(info->name_dumpfile,'v')) {
+		printf("no /var found in name_dumpfile %s\n", info->name_dumpfile);
+		exit(1);
+	} else {
+		cp = strchr(info->name_dumpfile,'v');
+		if (strncmp(cp-1, "/var", 4)) {
+			printf("no /var found in name_dumpfile %s\n",
+				info->name_dumpfile);
+			exit(1);
+		}
+	}
+	len = cp - info->name_dumpfile - 1;
+	strncpy(filename, info->name_dumpfile, len);
+	if (*(filename + len - 1) == '/')
+		len -= 1;
+	*(filename + len) = '\0';
+
+	strcat(filename, "/makedumpfilepfns");
+	sc.sc_filename = filename;
+	flags = O_RDWR|O_CREAT|O_TRUNC|O_DIRECT;
+	if ((sc.sc_fd = open(sc.sc_filename, flags,
+							S_IRUSR|S_IWUSR)) < 0) {
+		fprintf(stderr, "Can't open the pfn file %s.\n",
+			sc.sc_filename);
+		exit(1);
+	}
+	unlink(sc.sc_filename);
+
+	sc.sc_buf_malloced = malloc(blocksize + DIRECT_ALIGN);
+	if (!sc.sc_buf_malloced) {
+		fprintf(stderr, "Can't allocate a page for pfn buf.\n");
+		exit(1);
+	}
+	/* round down to a block boundary */
+	sc.sc_buf = sc.sc_buf_malloced -
+	   ((unsigned long)sc.sc_buf_malloced % DIRECT_ALIGN) + DIRECT_ALIGN;
+	sc.sc_buflen = blocksize;
+	sc.sc_bufposition = 0;
+	sc.sc_fileposition = 0;
+	sc.sc_filelen = 0;
+}
+
+/*
+ * Save a starting pfn and number of pfns for later delete from bitmap.
+ */
+void
+save_deletes(unsigned long startpfn, unsigned long numpfns)
+{
+	int i;
+	struct sc_entry *scp;
+
+	if (sc.sc_bufposition == sc.sc_buflen) {
+		i = write(sc.sc_fd, sc.sc_buf, sc.sc_buflen);
+		if (i != sc.sc_buflen) {
+			fprintf(stderr, "save: Can't write a page to %s\n",
+				sc.sc_filename);
+			exit(1);
+		}
+		sc.sc_filelen += sc.sc_buflen;
+		sc.sc_bufposition = 0;
+	}
+	scp = (struct sc_entry *)(sc.sc_buf + sc.sc_bufposition);
+	scp->startpfn = startpfn;
+	scp->numpfns = numpfns;
+	sc.sc_bufposition += sizeof(struct sc_entry);
+}
+
+/*
+ * Get a starting pfn and number of pfns for delete from bitmap.
+ * Return 0 for success, 1 for 'no more'
+ */
+int
+get_deletes(unsigned long *startpfn, unsigned long *numpfns)
+{
+	int i;
+	struct sc_entry *scp;
+
+	if (sc.sc_fileposition >= sc.sc_filelen) {
+		return 1;
+	}
+
+	if (sc.sc_bufposition == sc.sc_buflen) {
+		i = read(sc.sc_fd, sc.sc_buf, sc.sc_buflen);
+		if (i <= 0) {
+			fprintf(stderr, "Can't read a page from %s.\n", sc.sc_filename);
+			exit(1);
+		}
+		sc.sc_bufposition = 0;
+	}
+	scp = (struct sc_entry *)(sc.sc_buf + sc.sc_bufposition);
+	*startpfn = scp->startpfn;
+	*numpfns = scp->numpfns;
+	sc.sc_bufposition += sizeof(struct sc_entry);
+	sc.sc_fileposition += sizeof(struct sc_entry);
+	return 0;
+}
+
+/*
+ * Find the big holes in bitmap2; they represent ranges for which
+ * we do not need page structures.
+ * Bitmap1 is a map of dumpable (i.e existing) pages.
+ * They must only be pages that exist, so they will be 0 bits
+ * in the 2nd bitmap but 1 bits in the 1st bitmap.
+ * For speed, only worry about whole word full of bits.
+ */
+void
+find_unused_vmemmap_pages(void)
+{
+	struct dump_bitmap *bitmap1 = info->bitmap1;
+	struct dump_bitmap *bitmap2 = info->bitmap2;
+	unsigned long long pfn;
+	unsigned long *lp1, *lp2, startpfn, endpfn;
+	unsigned long vmapstartpfn, vmapnumpfns;
+	int i, sz, numpages=0, did_deletes;
+	int startword, numwords, do_break=0;
+	long deleted_pages = 0;
+	off_t new_offset1, new_offset2;
+
+	/* read each block of both bitmaps */
+	for (pfn = 0; pfn < info->max_mapnr; pfn += PFN_BUFBITMAP) { /* size in bits */
+		numpages++;
+		did_deletes = 0;
+		new_offset1 = bitmap1->offset + BUFSIZE_BITMAP * (pfn / PFN_BUFBITMAP);
+		if (lseek(bitmap1->fd, new_offset1, SEEK_SET) < 0 ) {
+			ERRMSG("Can't seek the bitmap(%s). %s\n",
+				bitmap1->file_name, strerror(errno));
+			return;
+		}
+		if (read(bitmap1->fd, bitmap1->buf, BUFSIZE_BITMAP) != BUFSIZE_BITMAP) {
+			ERRMSG("Can't read the bitmap(%s). %s\n",
+				bitmap1->file_name, strerror(errno));
+			return;
+		}
+		bitmap1->no_block = pfn / PFN_BUFBITMAP;
+
+		new_offset2 = bitmap2->offset + BUFSIZE_BITMAP * (pfn / PFN_BUFBITMAP);
+		if (lseek(bitmap2->fd, new_offset2, SEEK_SET) < 0 ) {
+			ERRMSG("Can't seek the bitmap(%s). %s\n",
+				bitmap2->file_name, strerror(errno));
+			return;
+		}
+		if (read(bitmap2->fd, bitmap2->buf, BUFSIZE_BITMAP) != BUFSIZE_BITMAP) {
+			ERRMSG("Can't read the bitmap(%s). %s\n",
+				bitmap2->file_name, strerror(errno));
+			return;
+		}
+		bitmap2->no_block = pfn / PFN_BUFBITMAP;
+
+		/* process this one page of both bitmaps at a time */
+		lp1 = (unsigned long *)bitmap1->buf;
+		lp2 = (unsigned long *)bitmap2->buf;
+		/* sz is words in the block */
+		sz = BUFSIZE_BITMAP / sizeof(unsigned long);
+		startword = -1;
+		for (i = 0; i < sz; i++, lp1++, lp2++) {
+			/* for each whole word in the block */
+			/* deal in full 64-page chunks only */
+			if (*lp1 == 0xffffffffffffffffUL) {
+				if (*lp2 == 0) {
+					/* we are in a series we want */
+					if (startword == -1) {
+						/* starting a new group */
+						startword = i;
+					}
+				} else {
+					/* we hit a used page */
+					if (startword >= 0)
+						do_break = 1;
+				}
+			} else {
+				/* we hit a hole in real memory, or part of one */
+				if (startword >= 0)
+					do_break = 1;
+			}
+			if (do_break) {
+				do_break = 0;
+				if (startword >= 0) {
+					numwords = i - startword;
+					/* 64 bits represents 64 page structs, which
+ 					   are not even one page of them (takes
+					   at least 73) */
+					if (numwords > 1) {
+						startpfn = pfn +
+							(startword * BITS_PER_WORD);
+						/* pfn ranges are literally start and end,
+						   not start and end + 1 */
+						endpfn = startpfn +
+							(numwords * BITS_PER_WORD) - 1;
+						if (find_vmemmap_pages(startpfn, endpfn,
+							&vmapstartpfn, &vmapnumpfns)) {
+							save_deletes(vmapstartpfn,
+								vmapnumpfns);
+							deleted_pages += vmapnumpfns;
+							did_deletes = 1;
+						}
+					}
+				}
+				startword = -1;
+			}
+		}
+		if (startword >= 0) {
+			numwords = i - startword;
+			if (numwords > 1) {
+				startpfn = pfn + (startword * BITS_PER_WORD);
+				/* pfn ranges are literally start and end,
+				   not start and end + 1 */
+				endpfn = startpfn + (numwords * BITS_PER_WORD) - 1;
+				if (find_vmemmap_pages(startpfn, endpfn,
+							&vmapstartpfn, &vmapnumpfns)) {
+					save_deletes(vmapstartpfn, vmapnumpfns);
+					deleted_pages += vmapnumpfns;
+					did_deletes = 1;
+				}
+			}
+		}
+	}
+	PROGRESS_MSG("\nExcluded %ld unused vmemmap pages\n", deleted_pages);
+
+	return;
+}
+
+/*
+ * Retrieve the list of pfn's and delete them from bitmap2;
+ */
+void
+delete_unused_vmemmap_pages(void)
+{
+	unsigned long startpfn, numpfns, pfn, i;
+
+	while (!get_deletes(&startpfn, &numpfns)) {
+		for (i = 0, pfn = startpfn; i < numpfns; i++, pfn++) {
+			clear_bit_on_2nd_bitmap_for_kernel(pfn);
+		}
+	}
+	return;
+}
+
 int
 create_2nd_bitmap(void)
 {
@@ -5016,6 +5368,15 @@ create_2nd_bitmap(void)
 	if (!sync_2nd_bitmap())
 		return FALSE;
 
+	/* -e means exclude vmemmap page structures for unused pages */
+	if (info->flag_excludevm) {
+		init_save_control();
+		find_unused_vmemmap_pages();
+		reset_save_control();
+		delete_unused_vmemmap_pages();
+		finalize_save_control();
+	}
+
 	return TRUE;
 }
 
@@ -5222,7 +5583,7 @@ get_loads_dumpfile(void)
 			continue;
 
 		pfn_start = paddr_to_pfn(load.p_paddr);
-		pfn_end   = paddr_to_pfn(load.p_paddr + load.p_memsz);
+		pfn_end = paddr_to_pfn(load.p_paddr + load.p_memsz);
 		frac_head = page_size - (load.p_paddr % page_size);
 		frac_tail = (load.p_paddr + load.p_memsz) % page_size;
 
@@ -8248,6 +8609,315 @@ writeout_multiple_dumpfiles(void)
 	return ret;
 }
 
+/*
+ * Scan the kernel page table for the pfn's of the page structs
+ * Place them in array gvmem_pfns[nr_gvmem_pfns]
+ */
+void
+find_vmemmap()
+{
+	int i, verbose = 0;
+	int pgd_index, pud_index;
+	int start_range = 1;
+	int num_pmds=0, num_pmds_valid=0;
+	int break_in_valids, break_after_invalids;
+	int do_break, done = 0;
+	int last_valid=0, last_invalid=0;
+	int pagestructsize, structsperhpage, hugepagesize;
+	long page_structs_per_pud;
+	long num_puds, groups = 0;
+	long pgdindex, pudindex, pmdindex;
+	long vaddr, vaddr_base;
+	long rep_pfn_start = 0, rep_pfn_end = 0;
+	unsigned long init_level4_pgt;
+	unsigned long max_paddr, high_pfn;
+	unsigned long pgd_addr, pud_addr, pmd_addr;
+	unsigned long *pgdp, *pudp, *pmdp;
+	unsigned long pud_page[PTRS_PER_PUD];
+	unsigned long pmd_page[PTRS_PER_PMD];
+	unsigned long vmap_offset_start = 0, vmap_offset_end = 0;
+	unsigned long pmd, tpfn;
+	unsigned long pvaddr = 0;
+	unsigned long data_addr = 0, last_data_addr = 0, start_data_addr = 0;
+	/*
+	 * data_addr is the paddr of the page holding the page structs.
+	 * We keep lists of contiguous pages and the pfn's that their
+	 * page structs represent.
+	 *  start_data_addr and last_data_addr mark start/end of those
+	 *  contiguous areas.
+	 * An area descriptor is vmap start/end pfn and rep start/end
+	 *  of the pfn's represented by the vmap start/end.
+	 */
+	struct vmap_pfns *vmapp, *vmaphead = NULL, *cur, *tail;
+
+	init_level4_pgt = SYMBOL(init_level4_pgt);
+	if (init_level4_pgt == NOT_FOUND_SYMBOL) {
+		fprintf(stderr, "init_level4_pgt not found\n");
+		return;
+	}
+	pagestructsize = size_table.page;
+	hugepagesize = PTRS_PER_PMD * info->page_size;
+	vaddr_base = info->vmemmap_start;
+	vaddr = vaddr_base;
+	max_paddr = get_max_paddr();
+	/*
+	 * the page structures are mapped at VMEMMAP_START (info->vmemmap_start)
+	 * for max_paddr >> 12 page structures
+	 */
+	high_pfn = max_paddr >> 12;
+	pgd_index = pgd4_index(vaddr_base);
+	pud_index = pud_index(vaddr_base);
+	pgd_addr = vaddr_to_paddr(init_level4_pgt); /* address of pgd */
+	pgd_addr += pgd_index * sizeof(unsigned long);
+	page_structs_per_pud = (PTRS_PER_PUD * PTRS_PER_PMD * info->page_size) /
+									pagestructsize;
+	num_puds = (high_pfn + page_structs_per_pud - 1) / page_structs_per_pud;
+	pvaddr = VMEMMAP_START;
+	structsperhpage = hugepagesize / pagestructsize;
+
+	/* outer loop is for pud entries in the pgd */
+	for (pgdindex = 0, pgdp = (unsigned long *)pgd_addr; pgdindex < num_puds;
+								pgdindex++, pgdp++) {
+		/* read the pgd one word at a time, into pud_addr */
+		if (!readmem(PADDR, (unsigned long long)pgdp, (void *)&pud_addr,
+								sizeof(unsigned long))) {
+			ERRMSG("Can't get pgd entry for slot %d.\n", pgd_index);
+			return;
+		}
+		/* mask the pgd entry for the address of the pud page */
+		pud_addr &= PMASK;
+		/* read the entire pud page */
+		if (!readmem(PADDR, (unsigned long long)pud_addr, (void *)pud_page,
+					PTRS_PER_PUD * sizeof(unsigned long))) {
+			ERRMSG("Can't get pud entry for pgd slot %ld.\n", pgdindex);
+			return;
+		}
+		/* step thru each pmd address in the pud page */
+		/* pudp points to an entry in the pud page */
+		for (pudp = (unsigned long *)pud_page, pudindex = 0;
+					pudindex < PTRS_PER_PUD; pudindex++, pudp++) {
+			pmd_addr = *pudp & PMASK;
+			/* read the entire pmd page */
+			if (!readmem(PADDR, pmd_addr, (void *)pmd_page,
+					PTRS_PER_PMD * sizeof(unsigned long))) {
+				ERRMSG("Can't get pud entry for slot %ld.\n", pudindex);
+				return;
+			}
+			/* pmdp points to an entry in the pmd */
+			for (pmdp = (unsigned long *)pmd_page, pmdindex = 0;
+					pmdindex < PTRS_PER_PMD; pmdindex++, pmdp++) {
+				/* linear page position in this page table: */
+				pmd = *pmdp;
+				num_pmds++;
+				tpfn = (pvaddr - VMEMMAP_START) /
+							pagestructsize;
+				if (tpfn >= high_pfn) {
+					done = 1;
+					break;
+				}
+				/*
+				 * vmap_offset_start:
+				 * Starting logical position in the
+				 * vmemmap array for the group stays
+				 * constant until a hole in the table
+				 * or a break in contiguousness.
+				 */
+
+				/*
+				 * Ending logical position in the
+				 * vmemmap array:
+				 */
+				vmap_offset_end += hugepagesize;
+				do_break = 0;
+				break_in_valids = 0;
+				break_after_invalids = 0;
+				/*
+				 * We want breaks either when:
+				 * - we hit a hole (invalid)
+				 * - we discontiguous page is a string of valids
+				 */
+				if (pmd) {
+					data_addr = (pmd & PMASK);
+					if (start_range) {
+						/* first-time kludge */
+						start_data_addr = data_addr;
+						last_data_addr = start_data_addr
+							 - hugepagesize;
+						start_range = 0;
+					}
+					if (last_invalid) {
+						/* end of a hole */
+						start_data_addr = data_addr;
+						last_data_addr = start_data_addr
+							 - hugepagesize;
+						/* trigger update of offset */
+						do_break = 1;
+					}
+					last_valid = 1;
+					last_invalid = 0;
+					/*
+					 * we have it a gap in physical
+					 * contiguousness in the table.
+					 */
+					/* ?? consecutive holes will have
+					   same data_addr */
+					if (data_addr !=
+						last_data_addr + hugepagesize) {
+						do_break = 1;
+						break_in_valids = 1;
+					}
+					if (verbose)
+						printf("valid: pud %ld pmd %ld pfn %#lx"
+							" pvaddr %#lx pfns %#lx-%lx"
+							" start %#lx end %#lx\n",
+							pudindex, pmdindex,
+							data_addr >> 12,
+							pvaddr, tpfn,
+					tpfn + structsperhpage - 1,
+					vmap_offset_start,
+					vmap_offset_end);
+					num_pmds_valid++;
+					if (!(pmd & PSE)) {
+						printf("vmemmap pmd not huge, abort\n");
+						exit(1);
+					}
+				} else {
+					if (last_valid) {
+						/* this a hole after some valids */
+						do_break = 1;
+						break_in_valids = 1;
+						break_after_invalids = 0;
+					}
+					last_valid = 0;
+					last_invalid = 1;
+					/*
+					 * There are holes in this sparsely
+					 * populated table; they are 2MB gaps
+					 * represented by null pmd entries.
+					 */
+					if (verbose)
+						printf("invalid: pud %ld pmd %ld %#lx"
+							" pfns %#lx-%lx start %#lx end"
+							" %#lx\n", pudindex, pmdindex,
+							pvaddr, tpfn,
+							tpfn + structsperhpage - 1,
+							vmap_offset_start,
+							vmap_offset_end);
+				}
+				if (do_break) {
+					/* The end of a hole is not summarized.
+					 * It must be the start of a hole or
+					 * hitting a discontiguous series.
+					 */
+					if (break_in_valids || break_after_invalids) {
+						/*
+						 * calculate that pfns
+						 * represented by the current
+						 * offset in the vmemmap.
+						 */
+						/* page struct even partly on this page */
+						rep_pfn_start = vmap_offset_start /
+							pagestructsize;
+						/* ending page struct entirely on
+ 						   this page */
+						rep_pfn_end = ((vmap_offset_end -
+							hugepagesize) / pagestructsize);
+ 						if (verbose)
+							printf("vmap pfns %#lx-%lx "
+							"represent pfns %#lx-%lx\n\n",
+							start_data_addr >> PAGESHFT,
+							last_data_addr >> PAGESHFT,
+							rep_pfn_start, rep_pfn_end);
+						groups++;
+						vmapp = (struct vmap_pfns *)malloc(
+								sizeof(struct vmap_pfns));
+						/* pfn of this 2MB page of page structs */
+						vmapp->vmap_pfn_start = start_data_addr
+									>> PTE_SHIFT;
+						vmapp->vmap_pfn_end = last_data_addr
+									>> PTE_SHIFT;
+						/* these (start/end) are literal pfns
+ 						 * on this page, not start and end+1 */
+						vmapp->rep_pfn_start = rep_pfn_start;
+						vmapp->rep_pfn_end = rep_pfn_end;
+
+						if (!vmaphead) {
+							vmaphead = vmapp;
+							vmapp->next = vmapp;
+							vmapp->prev = vmapp;
+						} else {
+							tail = vmaphead->prev;
+							vmaphead->prev = vmapp;
+							tail->next = vmapp;
+							vmapp->next = vmaphead;
+							vmapp->prev = tail;
+						}
+					}
+
+					/* update logical position at every break */
+					vmap_offset_start =
+						vmap_offset_end - hugepagesize;
+					start_data_addr = data_addr;
+				}
+
+				last_data_addr = data_addr;
+				pvaddr += hugepagesize;
+				/*
+				 * pvaddr is current virtual address
+				 *   eg 0xffffea0004200000 if
+				 *    vmap_offset_start is 4200000
+				 */
+			}
+		}
+		tpfn = (pvaddr - VMEMMAP_START) / pagestructsize;
+		if (tpfn >= high_pfn) {
+			done = 1;
+			break;
+		}
+	}
+	rep_pfn_start = vmap_offset_start / pagestructsize;
+	rep_pfn_end = (vmap_offset_end - hugepagesize) / pagestructsize;
+ 	if (verbose)
+		printf("vmap pfns %#lx-%lx represent pfns %#lx-%lx\n\n",
+			start_data_addr >> PAGESHFT, last_data_addr >> PAGESHFT,
+			rep_pfn_start, rep_pfn_end);
+	groups++;
+	vmapp = (struct vmap_pfns *)malloc(sizeof(struct vmap_pfns));
+	vmapp->vmap_pfn_start = start_data_addr >> PTE_SHIFT;
+	vmapp->vmap_pfn_end = last_data_addr >> PTE_SHIFT;
+	vmapp->rep_pfn_start = rep_pfn_start;
+	vmapp->rep_pfn_end = rep_pfn_end;
+	if (!vmaphead) {
+		vmaphead = vmapp;
+		vmapp->next = vmapp;
+		vmapp->prev = vmapp;
+	} else {
+		tail = vmaphead->prev;
+		vmaphead->prev = vmapp;
+		tail->next = vmapp;
+		vmapp->next = vmaphead;
+		vmapp->prev = tail;
+	}
+	if (verbose)
+		printf("num_pmds: %d num_pmds_valid %d\n", num_pmds, num_pmds_valid);
+
+	/* transfer the linked list to an array */
+	cur = vmaphead;
+	gvmem_pfns = (struct vmap_pfns *)malloc(sizeof(struct vmap_pfns) * groups);
+	i = 0;
+	do {
+		vmapp = gvmem_pfns + i;
+		vmapp->vmap_pfn_start = cur->vmap_pfn_start;
+		vmapp->vmap_pfn_end = cur->vmap_pfn_end;
+		vmapp->rep_pfn_start = cur->rep_pfn_start;
+		vmapp->rep_pfn_end = cur->rep_pfn_end;
+		cur = cur->next;
+		free(cur->prev);
+		i++;
+	} while (cur != vmaphead);
+	nr_gvmem_pfns = i;
+}
+
 int
 create_dumpfile(void)
 {
@@ -8268,6 +8938,10 @@ create_dumpfile(void)
 
 	print_vtop();
 
+	/* create an array of translations from pfn to vmemmap pages */
+	if (info->flag_excludevm)
+		find_vmemmap();
+
 	if (info->flag_rawdump)
 		PROGRESS_MSG("Using O_DIRECT i/o for dump.\n");
 	if (info->flag_rawbitmaps)
@@ -9300,7 +9974,7 @@ main(int argc, char *argv[])
 
 	info->block_order = DEFAULT_ORDER;
 	message_level = DEFAULT_MSG_LEVEL;
-	while ((opt = getopt_long(argc, argv, "b:cDd:EFfg:hi:lpRvXx:", longopts,
+	while ((opt = getopt_long(argc, argv, "b:cDd:eEFfg:hi:lNpRvXx:", longopts,
 	    NULL)) != -1) {
 		switch (opt) {
 		case OPT_BLOCK_ORDER:
@@ -9315,6 +9989,14 @@ main(int argc, char *argv[])
 		case OPT_DEBUG:
 			flag_debug = TRUE;
 			break;
+		case OPT_EXCLUDEVM:
+			info->flag_excludevm = 1;
+			/* exclude unused vmemmap pages */
+			break;
+		case OPT_INCLUDEVM:
+			info->flag_includevm = 1;
+			/* include unused vmemmap pages */
+			break;
 		case OPT_DUMP_LEVEL:
 			if (!parse_dump_level(optarg))
 				goto out;
Index: makedumpfile-1.5.5/makedumpfile.h
===================================================================
--- makedumpfile-1.5.5.orig/makedumpfile.h
+++ makedumpfile-1.5.5/makedumpfile.h
@@ -44,6 +44,9 @@
 #include "diskdump_mod.h"
 #include "sadump_mod.h"
 
+#define VMEMMAPSTART 0xffffea0000000000UL
+#define BITS_PER_WORD 64
+
 /*
  * Result of command
  */
@@ -477,6 +480,7 @@ do { \
 #define VMALLOC_END		(info->vmalloc_end)
 #define VMEMMAP_START		(info->vmemmap_start)
 #define VMEMMAP_END		(info->vmemmap_end)
+#define PMASK			(0x7ffffffffffff000UL)
 
 #ifdef __arm__
 #define KVBASE_MASK		(0xffff)
@@ -561,15 +565,20 @@ do { \
 #define PGDIR_SIZE		(1UL << PGDIR_SHIFT)
 #define PGDIR_MASK		(~(PGDIR_SIZE - 1))
 #define PTRS_PER_PGD		(512)
+#define PGD_SHIFT		(39)
+#define PUD_SHIFT		(30)
 #define PMD_SHIFT		(21)
 #define PMD_SIZE		(1UL << PMD_SHIFT)
 #define PMD_MASK		(~(PMD_SIZE - 1))
+#define PTRS_PER_PUD		(512)
 #define PTRS_PER_PMD		(512)
 #define PTRS_PER_PTE		(512)
 #define PTE_SHIFT		(12)
 
 #define pml4_index(address) (((address) >> PML4_SHIFT) & (PTRS_PER_PML4 - 1))
 #define pgd_index(address)  (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
+#define pgd4_index(address)  (((address) >> PGD_SHIFT) & (PTRS_PER_PGD - 1))
+#define pud_index(address)  (((address) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
 #define pmd_index(address)  (((address) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
 #define pte_index(address)  (((address) >> PTE_SHIFT) & (PTRS_PER_PTE - 1))
 
@@ -683,7 +692,6 @@ do { \
 /*
  * 4 Levels paging
  */
-#define PUD_SHIFT		(PMD_SHIFT + PTRS_PER_PTD_SHIFT)
 #define PGDIR_SHIFT_4L		(PUD_SHIFT + PTRS_PER_PTD_SHIFT)
 
 #define MASK_PUD   	((1UL << REGION_SHIFT) - 1) & (~((1UL << PUD_SHIFT) - 1))
@@ -917,6 +925,8 @@ struct DumpInfo {
 	int		flag_vmemmap;        /* kernel supports vmemmap address space */
 	int		flag_rawdump;        /* use raw i/o for the dump file */
 	int		flag_rawbitmaps;     /* use raw i/o for the bitmaps file */
+	int		flag_excludevm;      /* exclude unused vmemmap pages */
+	int		flag_includevm;      /* include unused vmemmap pages */
 	unsigned long	vaddr_for_vtop;      /* virtual address for debugging */
 	long		page_size;           /* size of page */
 	long		page_shift;
@@ -1449,6 +1459,52 @@ struct srcfile_table {
 	char	pud_t[LEN_SRCFILE];
 };
 
+/*
+ * This structure records where the vmemmap page structures reside, and which
+ * pfn's are represented by those page structures.
+ * The actual pages containing the page structures are 2MB pages, so their pfn's
+ * will all be multiples of 0x200.
+ * The page structures are 7 64-bit words in length (0x38) so they overlap the
+ * 2MB boundaries. Each page structure represents a 4k page.
+ * A 4k page is here defined to be represented on a 2MB page if its page structure
+ * 'ends' on that page (even if it began on the page before).
+ */
+struct vmap_pfns {
+	struct vmap_pfns *next;
+	struct vmap_pfns *prev;
+	/*
+	 * These (start/end) are literal pfns of 2MB pages on which the page
+	 * structures reside, not start and end+1.
+	 */
+	unsigned long vmap_pfn_start;
+	unsigned long vmap_pfn_end;
+	/*
+	 * These (start/end) are literal pfns represented on these pages, not
+	 * start and end+1.
+	 * The starting page struct is at least partly on the first page; the
+ 	 * ending page struct is entirely on the last page.
+ 	 */
+	unsigned long rep_pfn_start;
+	unsigned long rep_pfn_end;
+};
+
+/* for saving a list of pfns to a buffer, and then to a file if necessary */
+struct save_control {
+	int sc_fd;
+	char *sc_filename;
+	char *sc_buf_malloced;
+	char *sc_buf;
+	long sc_buflen; /* length of buffer never changes */
+	long sc_bufposition; /* offset of next slot for write, or next to be read */
+	long sc_filelen; /* length of valid data written */
+	long sc_fileposition; /* offset in file of next entry to be read */
+};
+/* one entry in the buffer and file */
+struct sc_entry {
+	unsigned long startpfn;
+	unsigned long numpfns;
+};
+
 extern struct symbol_table	symbol_table;
 extern struct size_table	size_table;
 extern struct offset_table	offset_table;
@@ -1595,6 +1651,8 @@ int get_xen_info_ia64(void);
 #define get_xen_basic_info_arch(X) FALSE
 #define get_xen_info_arch(X) FALSE
 #endif	/* s390x */
+#define PAGESHFT	12 /* assuming a 4k page */
+#define PSE		128 /* bit 7 */
 
 static inline int
 is_on(char *bitmap, int i)
@@ -1729,6 +1787,7 @@ struct elf_prstatus {
 #define OPT_COMPRESS_ZLIB       'c'
 #define OPT_DEBUG               'D'
 #define OPT_DUMP_LEVEL          'd'
+#define OPT_EXCLUDEVM           'e'
 #define OPT_ELF_DUMPFILE        'E'
 #define OPT_FLATTEN             'F'
 #define OPT_FORCE               'f'
@@ -1738,6 +1797,7 @@ struct elf_prstatus {
 #define OPT_RAWDUMP             'j'
 #define OPT_RAWBITMAPS          'J'
 #define OPT_COMPRESS_LZO        'l'
+#define OPT_INCLUDEVM           'N'
 #define OPT_COMPRESS_SNAPPY     'p'
 #define OPT_REARRANGE           'R'
 #define OPT_VERSION             'v'
Index: makedumpfile-1.5.5/print_info.c
===================================================================
--- makedumpfile-1.5.5.orig/print_info.c
+++ makedumpfile-1.5.5/print_info.c
@@ -48,7 +48,7 @@ print_usage(void)
 	MSG("\n");
 	MSG("Usage:\n");
 	MSG("  Creating DUMPFILE:\n");
-	MSG("  # makedumpfile    [-c|-l|-E] [-d DL] [-j] [-J] [-x VMLINUX|-i VMCOREINFO] VMCORE\n");
+	MSG("  # makedumpfile    [-c|-l|-E|-j|-J|-e|-N] [-d DL] [-x VMLINUX|-i VMCOREINFO] VMCORE\n");
 	MSG("    DUMPFILE\n");
 	MSG("\n");
 	MSG("  Creating DUMPFILE with filtered kernel data specified through filter config\n");
@@ -101,6 +101,13 @@ print_usage(void)
 	MSG("  [-J]:\n");
 	MSG("      Use raw (O_DIRECT) i/o on bitmap file to avoid expanding kernel pagecache.\n");
 	MSG("\n");
+	MSG("  [-e]:\n");
+	MSG("      Exclude page structures (vmemmap) for unused pages.\n");
+	MSG("      (this will be the default for memories over 1 terabyte)\n");
+	MSG("\n");
+	MSG("  [-N]:\n");
+	MSG("      Explicitly include all page structures (vmemmap).\n");
+	MSG("\n");
 	MSG("  [-d DL]:\n");
 	MSG("      Specify the type of unnecessary page for analysis.\n");
 	MSG("      Pages of the specified type are not copied to DUMPFILE. The page type\n");
-- 
Cliff Wickman
SGI
cpw@sgi.com
(651) 683-3824

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] makedumpfile: for large memories
  2013-12-31 23:30 [PATCH 0/2] makedumpfile: for large memories cpw
  2013-12-31 23:34 ` [PATCH 1/2] makedumpfile: raw i/o and use of root device Cliff Wickman
  2013-12-31 23:36 ` [PATCH 2/2] makedumpfile: exclude unused vmemmap pages Cliff Wickman
@ 2014-01-06  9:27 ` Atsushi Kumagai
  2014-01-09  0:25   ` Cliff Wickman
  2014-01-07 10:14 ` HATAYAMA Daisuke
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Atsushi Kumagai @ 2014-01-06  9:27 UTC (permalink / raw)
  To: cpw; +Cc: d.hatayama@jp.fujitsu.com, kexec@lists.infradead.org

Hello Cliff,

On 2014/01/01 8:30:47, kexec <kexec-bounces@lists.infradead.org> wrote:
> From: Cliff Wickman <cpw@sgi.com>
> 
> Gentlemen of kexec,
> 
> I have been working on enabling kdump on some very large systems, and
> have found some solutions that I hope you will consider.
> 
> The first issue is to work within the restricted size of crashkernel memory
> under 2.6.32-based kernels, such as sles11 and rhel6.
> 
> The second issue is to reduce the very large size of a dump of a big memory
> system, even on an idle system.
> 
> These are my propositions:
> 
> Size of crashkernel memory
>   1) raw i/o for writing the dump
>   2) use root device for the bitmap file (not tmpfs)
>   3) raw i/o for reading/writing the bitmaps
>   
> Size of dump (and hence the duration of dumping)
>   4) exclude page structures for unused pages
> 
> 
> 1) Is quite easy.  The cache of pages needs to be aligned on a block
>   boundary and written in block multiples, as required by O_DIRECT files.
> 
>   The use of raw i/o prevents the growing of the crash kernel's page
>   cache.

There is no reason to reject this idea, please re-post it as a formal patch.
If possible, I would like to know the benefit of only this.

> 2) Is also quite easy.  My patch finds the path to the crash
>   kernel's root device by examining the dump pathname. Storing the bitmaps
>   to a file is otherwise not conserving memory, as they are being written
>   to tmpfs.

Users will expect that the size of dump file is the same as the size of
RAM at most, they will prepare a disk which fit to save that.
But 2) breaks this estimation, I worry about it a little.

Of course, I don't reject this idea just only for that reason,
but I would like to know the definite advantage of this.
I suppose that the improvement showed in your benchmarks may be came
from 1) and 4) mostly, so could you let me know that only 2) and 3)
can perform much faster than the current cyclic mode ?

> 3) Raw i/o for the bitmaps, is accomplished by caching the
>   bitmap file in a similar way to that of the dump file.
> 
>   I find that the use of direct i/o is not significantly slower than
>   writing through the kernel's page cache.
>
> 4) The excluding of unused kernel page structures is very
>   important for a large memory system.  The kernel otherwise includes
>   3.67 million pages of page structures per TB of memory. By contrast
>   the rest of the kernel is only about 1 million pages.

According to your and Dave's mails, 4) seems risky and unacceptable
for now. I think we need more investigation for this.


Thanks
Atsushi Kumagai

> Test results are below, for systems of 1TB, 2TB, 8.8TB and 16TB.
> (There are no 'old' numbers for 16TB as time and space requirements
>  made those effectively useless.)
> 
> Run times were generally reduced 2-3x, and dump size reduced about 8x.
> 
> All timings were done using 512M of crashkernel memory.
> 
>    System memory size
>    1TB                     unpatched    patched
>      OS: rhel6.4 (does a free pages pass)
>      page scan time           1.6min    1.6min
>      dump copy time           2.4min     .4min
>      total time               4.1min    2.0min
>      dump size                 3014M      364M
> 
>      OS: rhel6.5
>      page scan time            .6min     .6min
>      dump copy time           2.3min     .5min
>      total time               2.9min    1.1min
>      dump size                 3011M      423M
> 
>      OS: sles11sp3 (3.0.93)
>      page scan time            .5min     .5min
>      dump copy time           2.3min     .5min
>      total time               2.8min    1.0min
>      dump size                 2950M      350M
> 
>    2TB
>      OS: rhel6.5           (cyclicx3)
>      page scan time           2.0min    1.8min
>      dump copy time           8.0min    1.5min
>      total time              10.0min    3.3min
>      dump size                 6141M      835M
> 
>    8.8TB
>      OS: rhel6.5           (cyclicx5)
>      page scan time           6.6min    5.5min
>      dump copy time          67.8min    6.2min
>      total time              74.4min   11.7min
>      dump size                 15.8G      2.7G
> 
>    16TB
>      OS: rhel6.4
>      page scan time                   125.3min
>      dump copy time                    13.2min
>      total time                       138.5min
>      dump size                            4.0G
> 
>      OS: rhel6.5
>      page scan time                    27.8min
>      dump copy time                    13.3min
>      total time                        41.1min
>      dump size                            4.1G
> 
> Page scan time is greatly affected by whether or not the
> kernel supports mmap of /proc/vmcore.
> 
> The choice of snappy vs. zlib compression becomes fairly irrelevant
> when we can shrink the dump size dramatically.  The above
> were done with snappy compression.
> 
> I am sending my 2 working patches.  
> They are kludgy in the sense that they ignore all forms of
> kdump except the creation of a disk dump, and all architectures
> except x86_64.
> But I think they are sufficient to demonstrate the sizable
> time, crashkernel space and disk space savings that are possible.
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] makedumpfile: for large memories
  2013-12-31 23:30 [PATCH 0/2] makedumpfile: for large memories cpw
                   ` (2 preceding siblings ...)
  2014-01-06  9:27 ` [PATCH 0/2] makedumpfile: for large memories Atsushi Kumagai
@ 2014-01-07 10:14 ` HATAYAMA Daisuke
  2014-01-10 17:58 ` [PATCH 1/2 V2] raw i/o and root device to use less memory Cliff Wickman
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: HATAYAMA Daisuke @ 2014-01-07 10:14 UTC (permalink / raw)
  To: cpw; +Cc: kumagai-atsushi, kexec

(2014/01/01 8:30), cpw wrote:
> From: Cliff Wickman <cpw@sgi.com>
> 
> Gentlemen of kexec,
> 
> I have been working on enabling kdump on some very large systems, and
> have found some solutions that I hope you will consider.
> 
> The first issue is to work within the restricted size of crashkernel memory
> under 2.6.32-based kernels, such as sles11 and rhel6.
> 
> The second issue is to reduce the very large size of a dump of a big memory
> system, even on an idle system.
> 
> These are my propositions:
> 
> Size of crashkernel memory
>    1) raw i/o for writing the dump
>    2) use root device for the bitmap file (not tmpfs)
>    3) raw i/o for reading/writing the bitmaps
>    

Thanks for 1) and 3). I have the same idea of using direct i/o but have yet
to evaluate how it improves performance. This work is very helpful to me.

For 2), I understand the merit as long as non-cyclic mode is alive, but
there are issues we should consider. Root device could be broken in general
due to the bug that caused the crash, which means this reduces reliability
in some amount. At least this should be warned in help message. Also, you
need to deal with flattened format mode in which mode makedumpfile writes
dump data in standard output. I think it sufficient to disallow using this
new functionality together with -F option.

> Size of dump (and hence the duration of dumping)
>    4) exclude page structures for unused pages
> 
> 
> 1) Is quite easy.  The cache of pages needs to be aligned on a block
>    boundary and written in block multiples, as required by O_DIRECT files.
> 
>    The use of raw i/o prevents the growing of the crash kernel's page
>    cache.
> 
> 2) Is also quite easy.  My patch finds the path to the crash
>    kernel's root device by examining the dump pathname. Storing the bitmaps
>    to a file is otherwise not conserving memory, as they are being written
>    to tmpfs.
> 
> 3) Raw i/o for the bitmaps, is accomplished by caching the
>    bitmap file in a similar way to that of the dump file.
> 
>    I find that the use of direct i/o is not significantly slower than
>    writing through the kernel's page cache.
> 
> 4) The excluding of unused kernel page structures is very
>    important for a large memory system.  The kernel otherwise includes
>    3.67 million pages of page structures per TB of memory. By contrast
>    the rest of the kernel is only about 1 million pages.
> 
> Test results are below, for systems of 1TB, 2TB, 8.8TB and 16TB.
> (There are no 'old' numbers for 16TB as time and space requirements
>   made those effectively useless.)
> 
> Run times were generally reduced 2-3x, and dump size reduced about 8x.
> 
> All timings were done using 512M of crashkernel memory.
> 
>     System memory size
>     1TB                     unpatched    patched
>       OS: rhel6.4 (does a free pages pass)
>       page scan time           1.6min    1.6min
>       dump copy time           2.4min     .4min
>       total time               4.1min    2.0min
>       dump size                 3014M      364M
> 
>       OS: rhel6.5
>       page scan time            .6min     .6min
>       dump copy time           2.3min     .5min
>       total time               2.9min    1.1min
>       dump size                 3011M      423M
> 
>       OS: sles11sp3 (3.0.93)
>       page scan time            .5min     .5min
>       dump copy time           2.3min     .5min
>       total time               2.8min    1.0min
>       dump size                 2950M      350M
> 
>     2TB
>       OS: rhel6.5           (cyclicx3)
>       page scan time           2.0min    1.8min
>       dump copy time           8.0min    1.5min
>       total time              10.0min    3.3min
>       dump size                 6141M      835M
> 
>     8.8TB
>       OS: rhel6.5           (cyclicx5)
>       page scan time           6.6min    5.5min
>       dump copy time          67.8min    6.2min
>       total time              74.4min   11.7min
>       dump size                 15.8G      2.7G
> 
>     16TB
>       OS: rhel6.4
>       page scan time                   125.3min
>       dump copy time                    13.2min
>       total time                       138.5min
>       dump size                            4.0G
> 
>       OS: rhel6.5
>       page scan time                    27.8min
>       dump copy time                    13.3min
>       total time                        41.1min
>       dump size                            4.1G
> 

Could you tell me what kind of filesystem you use as dump partition?
Although I don't know filesystem things very much, I heard performance
of direct I/O depends much on filesystems.

Also, how did you measure these times? I've forgotten reporting this,
but surprisingly, I found reported time in cyclic-mode is different
from that in non-cyclic mode. In cyclic-mode, time for writing data
contains that for scanning pages. So, it must look larger than the
actual.

> Page scan time is greatly affected by whether or not the
> kernel supports mmap of /proc/vmcore.
> 

Another idea of improving page scan time is to touch mmaped region
directly, not through readmem(), by which we can skip copying page
descriptors into buffers. Although I have yet to evaluate how much this
affects performance, it must amount to considerably large in total.

> The choice of snappy vs. zlib compression becomes fairly irrelevant
> when we can shrink the dump size dramatically.  The above
> were done with snappy compression.
> 
> I am sending my 2 working patches.
> They are kludgy in the sense that they ignore all forms of
> kdump except the creation of a disk dump, and all architectures
> except x86_64.
> But I think they are sufficient to demonstrate the sizable
> time, crashkernel space and disk space savings that are possible.
> 

-- 
Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] makedumpfile: for large memories
  2014-01-06  9:27 ` [PATCH 0/2] makedumpfile: for large memories Atsushi Kumagai
@ 2014-01-09  0:25   ` Cliff Wickman
  2014-01-10  7:48     ` Atsushi Kumagai
  0 siblings, 1 reply; 16+ messages in thread
From: Cliff Wickman @ 2014-01-09  0:25 UTC (permalink / raw)
  To: Atsushi Kumagai; +Cc: d.hatayama@jp.fujitsu.com, kexec@lists.infradead.org

On Mon, Jan 06, 2014 at 09:27:34AM +0000, Atsushi Kumagai wrote:
> Hello Cliff,
> 
> On 2014/01/01 8:30:47, kexec <kexec-bounces@lists.infradead.org> wrote:
> > From: Cliff Wickman <cpw@sgi.com>
> > 
> > Gentlemen of kexec,
> > 
> > I have been working on enabling kdump on some very large systems, and
> > have found some solutions that I hope you will consider.
> > 
> > The first issue is to work within the restricted size of crashkernel memory
> > under 2.6.32-based kernels, such as sles11 and rhel6.
> > 
> > The second issue is to reduce the very large size of a dump of a big memory
> > system, even on an idle system.
> > 
> > These are my propositions:
> > 
> > Size of crashkernel memory
> >   1) raw i/o for writing the dump
> >   2) use root device for the bitmap file (not tmpfs)
> >   3) raw i/o for reading/writing the bitmaps
> >   
> > Size of dump (and hence the duration of dumping)
> >   4) exclude page structures for unused pages
> > 
> > 
> > 1) Is quite easy.  The cache of pages needs to be aligned on a block
> >   boundary and written in block multiples, as required by O_DIRECT files.
> > 
> >   The use of raw i/o prevents the growing of the crash kernel's page
> >   cache.
> 
> There is no reason to reject this idea, please re-post it as a formal patch.
> If possible, I would like to know the benefit of only this.

The motivation for using raw i/o was purely to be able to conserve memory,
not for speed.
However, I haven't noticed any significant degradation in speed.
Memory is in 'very' short supply on a large machine (ironically) and a 2.6 or 
3.0 kernel.  We're constrained to the low 4GB, and the kernel is putting other
things in that memory that are related to memory size.
The obvious solution is cyclic mode, but that requires at least 2x the page
scans.  Once for the scan of unnecessary pages and several partial 
scans for the copy phase.
But it is tmpfs and kernel page cache that are using up available memory.
If we avoid those, a single page scan can work in about 350M of crashkernel
memory.
This is not a problem with 3.10+ kernels as we're not constrained to low 4G.
 
> > 2) Is also quite easy.  My patch finds the path to the crash
> >   kernel's root device by examining the dump pathname. Storing the bitmaps
> >   to a file is otherwise not conserving memory, as they are being written
> >   to tmpfs.
> 
> Users will expect that the size of dump file is the same as the size of
> RAM at most, they will prepare a disk which fit to save that.
> But 2) breaks this estimation, I worry about it a little.

The bit map file is very small compared to the dump. And the dump should be
much smaller than RAM.  Particularly with 4), the excluding of unused page structures.
> 
> Of course, I don't reject this idea just only for that reason,
> but I would like to know the definite advantage of this.
> I suppose that the improvement showed in your benchmarks may be came
> from 1) and 4) mostly, so could you let me know that only 2) and 3)
> can perform much faster than the current cyclic mode ?

2) and 3), the handling of the bitmap, are small contributors to the
memory shortage issue.  They are a bigger issue the bigger the system.
It's just that if we consistently avoid enlarging page cache and
tmpfs we can avoid the 2nd page scan altogether.
True, my benchmarks show only .2 min. and 1.1 min. improvements
for 2TB and 8TB (2.0 vs 1.8, and 6.6 vs 5.5).
But that's an improvement, not a loss.  And we're absolutely
not going to run out of memory as the scan and copies proceed.
This is important on these old kernels with minimal memory available.
 
> > 3) Raw i/o for the bitmaps, is accomplished by caching the
> >   bitmap file in a similar way to that of the dump file.
> > 
> >   I find that the use of direct i/o is not significantly slower than
> >   writing through the kernel's page cache.
> >
> > 4) The excluding of unused kernel page structures is very
> >   important for a large memory system.  The kernel otherwise includes
> >   3.67 million pages of page structures per TB of memory. By contrast
> >   the rest of the kernel is only about 1 million pages.
> 
> According to your and Dave's mails, 4) seems risky and unacceptable
> for now. I think we need more investigation for this.

I've been working with Dave on a patch for crash.  It will warn the
user that certain kmem command options will fail.  But that is
only relevant to examinations of free memory and user memory, the
contents of which we're not capturing anyway.

Number 4), the exclusion of page structures for non-captured
pages is really the crux of the improvement.
A linux kernel should not be hugely bigger on a big machine than
on a small one.  Slightly bigger, yes, because of bigger slab
caches. 
But in practice the dumps of big memories are huge, and all
because of page structures.
To find the unneeded ones only takes a few seconds, but cuts
hours off the dumping process.  Without this a customer is just
not going to allow his very big system to be dumped.

-Cliff
> 
> 
> Thanks
> Atsushi Kumagai
> 
> > Test results are below, for systems of 1TB, 2TB, 8.8TB and 16TB.
> > (There are no 'old' numbers for 16TB as time and space requirements
> >  made those effectively useless.)
> > 
> > Run times were generally reduced 2-3x, and dump size reduced about 8x.
> > 
> > All timings were done using 512M of crashkernel memory.
> > 
> >    System memory size
> >    1TB                     unpatched    patched
> >      OS: rhel6.4 (does a free pages pass)
> >      page scan time           1.6min    1.6min
> >      dump copy time           2.4min     .4min
> >      total time               4.1min    2.0min
> >      dump size                 3014M      364M
> > 
> >      OS: rhel6.5
> >      page scan time            .6min     .6min
> >      dump copy time           2.3min     .5min
> >      total time               2.9min    1.1min
> >      dump size                 3011M      423M
> > 
> >      OS: sles11sp3 (3.0.93)
> >      page scan time            .5min     .5min
> >      dump copy time           2.3min     .5min
> >      total time               2.8min    1.0min
> >      dump size                 2950M      350M
> > 
> >    2TB
> >      OS: rhel6.5           (cyclicx3)
> >      page scan time           2.0min    1.8min
> >      dump copy time           8.0min    1.5min
> >      total time              10.0min    3.3min
> >      dump size                 6141M      835M
> > 
> >    8.8TB
> >      OS: rhel6.5           (cyclicx5)
> >      page scan time           6.6min    5.5min
> >      dump copy time          67.8min    6.2min
> >      total time              74.4min   11.7min
> >      dump size                 15.8G      2.7G
> > 
> >    16TB
> >      OS: rhel6.4
> >      page scan time                   125.3min
> >      dump copy time                    13.2min
> >      total time                       138.5min
> >      dump size                            4.0G
> > 
> >      OS: rhel6.5
> >      page scan time                    27.8min
> >      dump copy time                    13.3min
> >      total time                        41.1min
> >      dump size                            4.1G
> > 
> > Page scan time is greatly affected by whether or not the
> > kernel supports mmap of /proc/vmcore.
> > 
> > The choice of snappy vs. zlib compression becomes fairly irrelevant
> > when we can shrink the dump size dramatically.  The above
> > were done with snappy compression.
> > 
> > I am sending my 2 working patches.  
> > They are kludgy in the sense that they ignore all forms of
> > kdump except the creation of a disk dump, and all architectures
> > except x86_64.
> > But I think they are sufficient to demonstrate the sizable
> > time, crashkernel space and disk space savings that are possible.
> > 
> > _______________________________________________
> > kexec mailing list
> > kexec@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec

-- 
Cliff Wickman
SGI
cpw@sgi.com
(651) 683-3824

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] makedumpfile: for large memories
  2014-01-09  0:25   ` Cliff Wickman
@ 2014-01-10  7:48     ` Atsushi Kumagai
  2014-01-10 18:23       ` Cliff Wickman
  0 siblings, 1 reply; 16+ messages in thread
From: Atsushi Kumagai @ 2014-01-10  7:48 UTC (permalink / raw)
  To: Cliff Wickman; +Cc: d.hatayama@jp.fujitsu.com, kexec@lists.infradead.org

On 2014/01/09 9:26:20, kexec <kexec-bounces@lists.infradead.org> wrote:
> On Mon, Jan 06, 2014 at 09:27:34AM +0000, Atsushi Kumagai wrote:
> > Hello Cliff,
> > 
> > On 2014/01/01 8:30:47, kexec <kexec-bounces@lists.infradead.org> wrote:
> > > From: Cliff Wickman <cpw@sgi.com>
> > > 
> > > Gentlemen of kexec,
> > > 
> > > I have been working on enabling kdump on some very large systems, and
> > > have found some solutions that I hope you will consider.
> > > 
> > > The first issue is to work within the restricted size of crashkernel memory
> > > under 2.6.32-based kernels, such as sles11 and rhel6.
> > > 
> > > The second issue is to reduce the very large size of a dump of a big memory
> > > system, even on an idle system.
> > > 
> > > These are my propositions:
> > > 
> > > Size of crashkernel memory
> > >   1) raw i/o for writing the dump
> > >   2) use root device for the bitmap file (not tmpfs)
> > >   3) raw i/o for reading/writing the bitmaps
> > >   
> > > Size of dump (and hence the duration of dumping)
> > >   4) exclude page structures for unused pages
> > > 
> > > 
> > > 1) Is quite easy.  The cache of pages needs to be aligned on a block
> > >   boundary and written in block multiples, as required by O_DIRECT files.
> > > 
> > >   The use of raw i/o prevents the growing of the crash kernel's page
> > >   cache.
> > 
> > There is no reason to reject this idea, please re-post it as a formal patch.
> > If possible, I would like to know the benefit of only this.
> 
> The motivation for using raw i/o was purely to be able to conserve memory,
> not for speed.

OK, 1) is also for removing cyclic mode, right ?
I think there is no need to conserve memory with 1) since 2) is enough to
remove cyclic mode.
(To be exact, there are some cases that we have to use cyclic mode as
 Hatayama-san said, but I don't mention that in this mail.)

> However, I haven't noticed any significant degradation in speed.
> Memory is in 'very' short supply on a large machine (ironically) and a 2.6 or 
> 3.0 kernel.  We're constrained to the low 4GB, and the kernel is putting other
> things in that memory that are related to memory size.
> The obvious solution is cyclic mode, but that requires at least 2x the page
> scans.  Once for the scan of unnecessary pages and several partial 
> scans for the copy phase.
> But it is tmpfs and kernel page cache that are using up available memory.
> If we avoid those, a single page scan can work in about 350M of crashkernel
> memory.
> This is not a problem with 3.10+ kernels as we're not constrained to low 4G.

Even if we can use 350M fully, 5TB is the limit system memory size
in non-cyclic mode unless 2), since the bitmap file requires 64MB
per 1TB RAM. So, I can't find an importance of 1).

> > > 2) Is also quite easy.  My patch finds the path to the crash
> > >   kernel's root device by examining the dump pathname. Storing the bitmaps
> > >   to a file is otherwise not conserving memory, as they are being written
> > >   to tmpfs.
> > 
> > Users will expect that the size of dump file is the same as the size of
> > RAM at most, they will prepare a disk which fit to save that.
> > But 2) breaks this estimation, I worry about it a little.
> 
> The bit map file is very small compared to the dump. And the dump should be
> much smaller than RAM.  Particularly with 4), the excluding of unused page structures.
> > 
> > Of course, I don't reject this idea just only for that reason,
> > but I would like to know the definite advantage of this.
> > I suppose that the improvement showed in your benchmarks may be came
> > from 1) and 4) mostly, so could you let me know that only 2) and 3)
> > can perform much faster than the current cyclic mode ?
> 
> 2) and 3), the handling of the bitmap, are small contributors to the
> memory shortage issue.  They are a bigger issue the bigger the system.
> It's just that if we consistently avoid enlarging page cache and
> tmpfs we can avoid the 2nd page scan altogether.
> True, my benchmarks show only .2 min. and 1.1 min. improvements
> for 2TB and 8TB (2.0 vs 1.8, and 6.6 vs 5.5).
> But that's an improvement, not a loss.  And we're absolutely
> not going to run out of memory as the scan and copies proceed.
> This is important on these old kernels with minimal memory available.

Does just changing TMPDIR to a disk meet that purpose ?
Is it necessary to add new codes ?

> > > 3) Raw i/o for the bitmaps, is accomplished by caching the
> > >   bitmap file in a similar way to that of the dump file.
> > > 
> > >   I find that the use of direct i/o is not significantly slower than
> > >   writing through the kernel's page cache.
> > >
> > > 4) The excluding of unused kernel page structures is very
> > >   important for a large memory system.  The kernel otherwise includes
> > >   3.67 million pages of page structures per TB of memory. By contrast
> > >   the rest of the kernel is only about 1 million pages.
> > 
> > According to your and Dave's mails, 4) seems risky and unacceptable
> > for now. I think we need more investigation for this.
> 
> I've been working with Dave on a patch for crash.  It will warn the
> user that certain kmem command options will fail.  But that is
> only relevant to examinations of free memory and user memory, the
> contents of which we're not capturing anyway.
> 
> Number 4), the exclusion of page structures for non-captured
> pages is really the crux of the improvement.
> A linux kernel should not be hugely bigger on a big machine than
> on a small one.  Slightly bigger, yes, because of bigger slab
> caches. 
> But in practice the dumps of big memories are huge, and all
> because of page structures.
> To find the unneeded ones only takes a few seconds, but cuts
> hours off the dumping process.  Without this a customer is just
> not going to allow his very big system to be dumped.

I understand the benefit of this, but I still suspect that this
feature is really required from users, it sounds too progressive
to me.
This is a big patch, so I want to make sure that this feature will
be used in practice, any comments are welcome.


Thanks
Atsushi Kumagai

> -Cliff
> > 
> > 
> > Thanks
> > Atsushi Kumagai
> > 
> > > Test results are below, for systems of 1TB, 2TB, 8.8TB and 16TB.
> > > (There are no 'old' numbers for 16TB as time and space requirements
> > >  made those effectively useless.)
> > > 
> > > Run times were generally reduced 2-3x, and dump size reduced about 8x.
> > > 
> > > All timings were done using 512M of crashkernel memory.
> > > 
> > >    System memory size
> > >    1TB                     unpatched    patched
> > >      OS: rhel6.4 (does a free pages pass)
> > >      page scan time           1.6min    1.6min
> > >      dump copy time           2.4min     .4min
> > >      total time               4.1min    2.0min
> > >      dump size                 3014M      364M
> > > 
> > >      OS: rhel6.5
> > >      page scan time            .6min     .6min
> > >      dump copy time           2.3min     .5min
> > >      total time               2.9min    1.1min
> > >      dump size                 3011M      423M
> > > 
> > >      OS: sles11sp3 (3.0.93)
> > >      page scan time            .5min     .5min
> > >      dump copy time           2.3min     .5min
> > >      total time               2.8min    1.0min
> > >      dump size                 2950M      350M
> > > 
> > >    2TB
> > >      OS: rhel6.5           (cyclicx3)
> > >      page scan time           2.0min    1.8min
> > >      dump copy time           8.0min    1.5min
> > >      total time              10.0min    3.3min
> > >      dump size                 6141M      835M
> > > 
> > >    8.8TB
> > >      OS: rhel6.5           (cyclicx5)
> > >      page scan time           6.6min    5.5min
> > >      dump copy time          67.8min    6.2min
> > >      total time              74.4min   11.7min
> > >      dump size                 15.8G      2.7G
> > > 
> > >    16TB
> > >      OS: rhel6.4
> > >      page scan time                   125.3min
> > >      dump copy time                    13.2min
> > >      total time                       138.5min
> > >      dump size                            4.0G
> > > 
> > >      OS: rhel6.5
> > >      page scan time                    27.8min
> > >      dump copy time                    13.3min
> > >      total time                        41.1min
> > >      dump size                            4.1G
> > > 
> > > Page scan time is greatly affected by whether or not the
> > > kernel supports mmap of /proc/vmcore.
> > > 
> > > The choice of snappy vs. zlib compression becomes fairly irrelevant
> > > when we can shrink the dump size dramatically.  The above
> > > were done with snappy compression.
> > > 
> > > I am sending my 2 working patches.  
> > > They are kludgy in the sense that they ignore all forms of
> > > kdump except the creation of a disk dump, and all architectures
> > > except x86_64.
> > > But I think they are sufficient to demonstrate the sizable
> > > time, crashkernel space and disk space savings that are possible.
> > > 
> > > _______________________________________________
> > > kexec mailing list
> > > kexec@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/kexec
> 
> -- 
> Cliff Wickman
> SGI
> cpw@sgi.com
> (651) 683-3824
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2 V2] raw i/o and root device to use less memory
  2013-12-31 23:30 [PATCH 0/2] makedumpfile: for large memories cpw
                   ` (3 preceding siblings ...)
  2014-01-07 10:14 ` HATAYAMA Daisuke
@ 2014-01-10 17:58 ` Cliff Wickman
  2014-01-13  9:58   ` Michael Holzheu
  2014-01-10 18:00 ` [PATCH 2/2 V2] exclude unused vmemmap pages Cliff Wickman
  2014-01-14 11:33 ` [PATCH 0/2] makedumpfile: for large memories HATAYAMA Daisuke
  6 siblings, 1 reply; 16+ messages in thread
From: Cliff Wickman @ 2014-01-10 17:58 UTC (permalink / raw)
  To: kexec; +Cc: d.hatayama, kumagai-atsushi


Version 2:
is only a bug fix for version 1
- fixes a bug in the writing of the sub-header (in write_kdump_header())


Use O_DIRECT (raw) i/o for the dump and for the bitmaps file, so that writing
to those files does not allocate kernel memory for page cache.

Use the root device for the bitmaps file so that kernel memory is not consumed
for tmpfs.

The pathname for the root device is derived from the path to the dump
directory.

Raw I/O requires well-formed reads and writes. Buffers are aligned on 512-byte
boundaries, lseek's are done to 4096-byte boundaries, and transfers are
multiples of 4096 bytes.

The kludge is to handle the boundary between the part of the file containing
the page descriptors and the last part of the file, containing the page
data.  The data for that boundary area must be assembled into a page buffer and
written with a single write.

Signed-off-by: Cliff Wickman <cpw@sgi.com>

---
 makedumpfile.c |  545 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 makedumpfile.h |   10 -
 print_info.c   |    8 
 3 files changed, 460 insertions(+), 103 deletions(-)

Index: makedumpfile-1.5.5/makedumpfile.c
===================================================================
--- makedumpfile-1.5.5.orig/makedumpfile.c
+++ makedumpfile-1.5.5/makedumpfile.c
@@ -49,6 +49,8 @@ unsigned long long pfn_free;
 unsigned long long pfn_hwpoison;
 
 unsigned long long num_dumped;
+long blocksize;
+static int plenty_of_memory(void);
 
 int retcd = FAILED;	/* return code */
 
@@ -900,10 +902,17 @@ int
 open_dump_file(void)
 {
 	int fd;
-	int open_flags = O_RDWR|O_CREAT|O_TRUNC;
+	int open_flags;
 
+	if (info->flag_rawdump)
+		open_flags = O_RDWR|O_CREAT|O_TRUNC|O_DIRECT;
+	else
+		open_flags = O_RDWR|O_CREAT|O_TRUNC;
+
+#if 0
 	if (!info->flag_force)
 		open_flags |= O_EXCL;
+#endif
 
 	if (info->flag_flatten) {
 		fd = STDOUT_FILENO;
@@ -939,12 +948,35 @@ check_dump_file(const char *path)
 int
 open_dump_bitmap(void)
 {
-	int i, fd;
-	char *tmpname;
+	int i, fd, flags;
+	char *tmpname, *cp;
+	char prefix[100];
+	int len;
 
+	/* note that /tmp is tmpfs, so it uses crash kernel memory */
 	tmpname = getenv("TMPDIR");
-	if (!tmpname)
-		tmpname = "/tmp";
+	if (!tmpname) {
+		/* use the prefix of the dump name   e.g. /mnt//var/.... */
+		if (!strchr(info->name_dumpfile,'v')) {
+			printf("no /var found in name_dumpfile %s\n",
+				info->name_dumpfile);
+			exit(1);
+		} else {
+			cp = strchr(info->name_dumpfile,'v');
+			if (strncmp(cp-1, "/var", 4)) {
+				printf("no /var found in name_dumpfile %s\n",
+					info->name_dumpfile);
+				exit(1);
+			}
+		}
+		len = cp - info->name_dumpfile - 1;
+		strncpy(prefix, info->name_dumpfile, len);
+		if (*(prefix + len - 1) == '/')
+			len -= 1;
+		*(prefix + len) = '\0';
+		tmpname = prefix;
+		strcat(tmpname, "/");
+	}
 
 	if ((info->name_bitmap = (char *)malloc(sizeof(FILENAME_BITMAP) +
 						strlen(tmpname) + 1)) == NULL) {
@@ -953,9 +985,12 @@ open_dump_bitmap(void)
 		return FALSE;
 	}
 	strcpy(info->name_bitmap, tmpname);
-	strcat(info->name_bitmap, "/");
 	strcat(info->name_bitmap, FILENAME_BITMAP);
-	if ((fd = mkstemp(info->name_bitmap)) < 0) {
+	if (info->flag_rawbitmaps)
+		flags = O_RDWR|O_CREAT|O_TRUNC|O_DIRECT;
+	else
+		flags = O_RDWR|O_CREAT|O_TRUNC;
+	if ((fd = open(info->name_bitmap, flags)) < 0) {
 		ERRMSG("Can't open the bitmap file(%s). %s\n",
 		    info->name_bitmap, strerror(errno));
 		return FALSE;
@@ -2860,6 +2895,7 @@ initialize_bitmap_memory(void)
 	struct dump_bitmap *bmp;
 	off_t bitmap_offset;
 	off_t bitmap_len, max_sect_len;
+	char *cp;
 	unsigned long pfn;
 	int i, j;
 	long block_size;
@@ -2881,7 +2917,14 @@ initialize_bitmap_memory(void)
 	bmp->fd        = info->fd_memory;
 	bmp->file_name = info->name_memory;
 	bmp->no_block  = -1;
-	memset(bmp->buf, 0, BUFSIZE_BITMAP);
+	if ((cp = malloc(blocksize + DIRECT_ALIGN)) == NULL) {
+		ERRMSG("Can't allocate memory for the bitmap buffer. %s\n",
+		    strerror(errno));
+		exit(1);
+	}
+	bmp->buf_malloced = cp;
+	bmp->buf = cp - ((unsigned long)cp % DIRECT_ALIGN) + DIRECT_ALIGN;
+	memset(bmp->buf, 0, blocksize);
 	bmp->offset = bitmap_offset + bitmap_len / 2;
 	info->bitmap_memory = bmp;
 
@@ -2893,6 +2936,7 @@ initialize_bitmap_memory(void)
 	if (info->valid_pages == NULL) {
 		ERRMSG("Can't allocate memory for the valid_pages. %s\n",
 		    strerror(errno));
+		free(bmp->buf_malloced);
 		free(bmp);
 		return FALSE;
 	}
@@ -3075,9 +3119,9 @@ out:
 			unsigned long long free_memory;
 
 			/*
-                        * The buffer size is specified as Kbyte with
-                        * --cyclic-buffer <size> option.
-                        */
+			 * The buffer size is specified as Kbyte with
+			 * --cyclic-buffer <size> option.
+			 */
 			info->bufsize_cyclic <<= 10;
 
 			/*
@@ -3190,7 +3234,7 @@ out:
 		DEBUG_MSG("The kernel doesn't support mmap(),");
 		DEBUG_MSG("read() will be used instead.\n");
 		info->flag_usemmap = MMAP_DISABLE;
-        }
+	}
 
 	return TRUE;
 }
@@ -3198,9 +3242,18 @@ out:
 void
 initialize_bitmap(struct dump_bitmap *bitmap)
 {
+	char *cp;
+
 	bitmap->fd        = info->fd_bitmap;
 	bitmap->file_name = info->name_bitmap;
 	bitmap->no_block  = -1;
+	if ((cp = malloc(blocksize + DIRECT_ALIGN)) == NULL) {
+		ERRMSG("Can't allocate memory for the bitmap buffer. %s\n",
+		    strerror(errno));
+		exit(1);
+	}
+	bitmap->buf_malloced = cp;
+	bitmap->buf = cp - ((unsigned long)cp % DIRECT_ALIGN) + DIRECT_ALIGN;
 	memset(bitmap->buf, 0, BUFSIZE_BITMAP);
 }
 
@@ -3266,9 +3319,9 @@ set_bitmap(struct dump_bitmap *bitmap, u
 	byte = (pfn%PFN_BUFBITMAP)>>3;
 	bit  = (pfn%PFN_BUFBITMAP) & 7;
 	if (val)
-		bitmap->buf[byte] |= 1<<bit;
+		*(bitmap->buf + byte) |= 1<<bit;
 	else
-		bitmap->buf[byte] &= ~(1<<bit);
+		*(bitmap->buf + byte) &= ~(1<<bit);
 
 	return TRUE;
 }
@@ -3444,6 +3497,29 @@ read_cache(struct cache_data *cd)
 	return TRUE;
 }
 
+void
+fill_to_offset(struct cache_data *cd, int blocksize)
+{
+	off_t current;
+	long num_blocks;
+	long i;
+
+	current = lseek(cd->fd, 0, SEEK_CUR);
+	if ((cd->offset - current) % blocksize) {
+		printf("ERROR: fill area is %#lx\n", cd->offset - current);
+		exit(1);
+	}
+	if (cd->cache_size < blocksize) {
+		printf("ERROR: cache buf is only %ld\n", cd->cache_size);
+		exit(1);
+	}
+	num_blocks = (cd->offset - current) / blocksize;
+	for (i = 0; i < num_blocks; i++) {
+		write(cd->fd, cd->buf, blocksize);
+	}
+	return;
+}
+
 int
 is_bigendian(void)
 {
@@ -3511,8 +3587,30 @@ write_buffer(int fd, off_t offset, void 
 }
 
 int
+seek_cache(struct cache_data *cd, off_t offset)
+{
+	const off_t failed = (off_t)-1;
+
+	if (lseek(cd->fd, offset, SEEK_SET) == failed) {
+		ERRMSG("Can't seek the dump file(%s). %s\n",
+		    cd->file_name, strerror(errno));
+		return FALSE;
+	}
+	cd->offset = offset;
+	return TRUE;
+}
+
+int
 write_cache(struct cache_data *cd, void *buf, size_t size)
 {
+	/* sanity check; do not overflow this buffer */
+	/* (it is of cd->cache_size + info->page_size) */
+	if (size > ((cd->cache_size - cd->buf_size) + info->page_size)) {
+		fprintf(stderr, "write_cache buffer overflow! size %#lx\n",
+			size);
+		exit(1);
+	}
+
 	memcpy(cd->buf + cd->buf_size, buf, size);
 	cd->buf_size += size;
 
@@ -3524,7 +3622,8 @@ write_cache(struct cache_data *cd, void 
 		return FALSE;
 
 	cd->buf_size -= cd->cache_size;
-	memcpy(cd->buf, cd->buf + cd->cache_size, cd->buf_size);
+	if (cd->buf_size)
+		memcpy(cd->buf, cd->buf + cd->cache_size, cd->buf_size);
 	cd->offset += cd->cache_size;
 	return TRUE;
 }
@@ -3556,6 +3655,21 @@ write_cache_zero(struct cache_data *cd, 
 	return write_cache_bufsz(cd);
 }
 
+/* flush the full cache to the file */
+int
+write_cache_flush(struct cache_data *cd)
+{
+	if (cd->buf_size == 0)
+		return TRUE;
+	if (cd->buf_size < cd->cache_size) {
+		memset(cd->buf + cd->buf_size, 0, cd->cache_size - cd->buf_size);
+	}
+	cd->buf_size = cd->cache_size;
+	if (!write_cache_bufsz(cd))
+		return FALSE;
+	return TRUE;
+}
+
 int
 read_buf_from_stdin(void *buf, int buf_size)
 {
@@ -4332,11 +4446,19 @@ create_1st_bitmap(void)
 {
 	int i;
 	unsigned int num_pt_loads = get_num_pt_loads();
- 	char buf[info->page_size];
+	char *buf;
 	unsigned long long pfn, pfn_start, pfn_end, pfn_bitmap1;
 	unsigned long long phys_start, phys_end;
 	struct timeval tv_start;
 	off_t offset_page;
+	char *cp;
+
+	if ((cp = malloc(blocksize + DIRECT_ALIGN)) == NULL) {
+		ERRMSG("Can't allocate memory for the bitmap buffer. %s\n",
+		    strerror(errno));
+		exit(1);
+	}
+	buf = cp - ((unsigned long)cp % DIRECT_ALIGN) + DIRECT_ALIGN;
 
 	if (info->flag_refiltering)
 		return copy_1st_bitmap_from_memory();
@@ -4347,7 +4469,7 @@ create_1st_bitmap(void)
 	/*
 	 * At first, clear all the bits on the 1st-bitmap.
 	 */
-	memset(buf, 0, sizeof(buf));
+	memset(buf, 0, blocksize);
 
 	if (lseek(info->bitmap1->fd, info->bitmap1->offset, SEEK_SET) < 0) {
 		ERRMSG("Can't seek the bitmap(%s). %s\n",
@@ -4796,8 +4918,16 @@ int
 copy_bitmap(void)
 {
 	off_t offset;
-	unsigned char buf[info->page_size];
- 	const off_t failed = (off_t)-1;
+	unsigned char *buf;
+	unsigned char *cp;
+	const off_t failed = (off_t)-1;
+
+	if ((cp = malloc(blocksize + DIRECT_ALIGN)) == NULL) {
+		ERRMSG("Can't allocate memory for the bitmap buffer. %s\n",
+		    strerror(errno));
+		exit(1);
+	}
+	buf = cp - ((unsigned long)cp % DIRECT_ALIGN) + DIRECT_ALIGN;
 
 	offset = 0;
 	while (offset < (info->len_bitmap / 2)) {
@@ -4807,7 +4937,7 @@ copy_bitmap(void)
 			    info->name_bitmap, strerror(errno));
 			return FALSE;
 		}
-		if (read(info->bitmap1->fd, buf, sizeof(buf)) != sizeof(buf)) {
+		if (read(info->bitmap1->fd, buf, blocksize) != blocksize) {
 			ERRMSG("Can't read the dump memory(%s). %s\n",
 			    info->name_memory, strerror(errno));
 			return FALSE;
@@ -4818,12 +4948,12 @@ copy_bitmap(void)
 			    info->name_bitmap, strerror(errno));
 			return FALSE;
 		}
-		if (write(info->bitmap2->fd, buf, sizeof(buf)) != sizeof(buf)) {
+		if (write(info->bitmap2->fd, buf, blocksize) != blocksize) {
 			ERRMSG("Can't write the bitmap(%s). %s\n",
 		    	info->name_bitmap, strerror(errno));
 			return FALSE;
 		}
-		offset += sizeof(buf);
+		offset += blocksize;
 	}
 
 	return TRUE;
@@ -5013,7 +5143,8 @@ void
 free_bitmap1_buffer(void)
 {
 	if (info->bitmap1) {
-		free(info->bitmap1);
+		if (info->bitmap1->buf_malloced)
+			free(info->bitmap1->buf_malloced);
 		info->bitmap1 = NULL;
 	}
 }
@@ -5022,7 +5153,8 @@ void
 free_bitmap2_buffer(void)
 {
 	if (info->bitmap2) {
-		free(info->bitmap2);
+		if (info->bitmap2->buf_malloced)
+			free(info->bitmap2->buf_malloced);
 		info->bitmap2 = NULL;
 	}
 }
@@ -5030,8 +5162,18 @@ free_bitmap2_buffer(void)
 void
 free_bitmap_buffer(void)
 {
-	free_bitmap1_buffer();
-	free_bitmap2_buffer();
+	if (info->bitmap1) {
+		if (info->bitmap1->buf_malloced)
+			free(info->bitmap1->buf_malloced);
+		free(info->bitmap1);
+		info->bitmap1 = NULL;
+	}
+	if (info->bitmap2) {
+		if (info->bitmap2->buf_malloced)
+			free(info->bitmap2->buf_malloced);
+		free(info->bitmap2);
+		info->bitmap2 = NULL;
+	}
 }
 
 int
@@ -5058,7 +5200,6 @@ create_dump_bitmap(void)
 	} else {
 		if (!prepare_bitmap_buffer())
 			goto out;
-
 		if (!create_1st_bitmap())
 			goto out;
 
@@ -5130,25 +5271,31 @@ get_loads_dumpfile(void)
 int
 prepare_cache_data(struct cache_data *cd)
 {
+	char *cp;
+
 	cd->fd         = info->fd_dumpfile;
 	cd->file_name  = info->name_dumpfile;
 	cd->cache_size = info->page_size << info->block_order;
 	cd->buf_size   = 0;
 	cd->buf        = NULL;
 
-	if ((cd->buf = malloc(cd->cache_size + info->page_size)) == NULL) {
+	if ((cp = malloc(cd->cache_size + info->page_size + DIRECT_ALIGN)) == NULL) {
 		ERRMSG("Can't allocate memory for the data buffer. %s\n",
 		    strerror(errno));
 		return FALSE;
 	}
+	cd->buf_malloced = cp;
+	cd->buf = cp - ((unsigned long)cp % DIRECT_ALIGN) + DIRECT_ALIGN;
 	return TRUE;
 }
 
 void
 free_cache_data(struct cache_data *cd)
 {
-	free(cd->buf);
+	if (cd->buf_malloced)
+		free(cd->buf_malloced);
 	cd->buf = NULL;
+	cd->buf_malloced = NULL;
 }
 
 int
@@ -5397,19 +5544,21 @@ out:
 }
 
 int
-write_kdump_header(void)
+write_kdump_header(struct cache_data *cd)
 {
 	int ret = FALSE;
 	size_t size;
 	off_t offset_note, offset_vmcoreinfo;
-	unsigned long size_note, size_vmcoreinfo;
+	unsigned long size_note, size_vmcoreinfo, remaining_size_note;
+	unsigned long write_size, room;
 	struct disk_dump_header *dh = info->dump_header;
 	struct kdump_sub_header kh;
-	char *buf = NULL;
+	char *buf = NULL, *cp;
 
 	if (info->flag_elf_dumpfile)
 		return FALSE;
 
+	/* uses reads of /proc/vmcore */
 	get_pt_note(&offset_note, &size_note);
 
 	/*
@@ -5426,6 +5575,7 @@ write_kdump_header(void)
 	dh->bitmap_blocks  = divideup(info->len_bitmap, dh->block_size);
 	memcpy(&dh->timestamp, &info->timestamp, sizeof(dh->timestamp));
 	memcpy(&dh->utsname, &info->system_utsname, sizeof(dh->utsname));
+	blocksize = dh->block_size;
 	if (info->flag_compress & DUMP_DH_COMPRESSED_ZLIB)
 		dh->status |= DUMP_DH_COMPRESSED_ZLIB;
 #ifdef USELZO
@@ -5438,7 +5588,7 @@ write_kdump_header(void)
 #endif
 
 	size = sizeof(struct disk_dump_header);
-	if (!write_buffer(info->fd_dumpfile, 0, dh, size, info->name_dumpfile))
+	if (!write_cache(cd, dh, blocksize))
 		return FALSE;
 
 	/*
@@ -5463,7 +5613,18 @@ write_kdump_header(void)
 		kh.start_pfn_64 = info->split_start_pfn;
 		kh.end_pfn_64   = info->split_end_pfn;
 	}
-	if (has_pt_note()) {
+
+	/* position the cache to the block boundary for the subheader */
+	if (!write_cache_flush(cd))
+		goto out;
+	if (!seek_cache(cd, DISKDUMP_HEADER_BLOCKS * dh->block_size))
+		goto out;
+
+	if (!has_pt_note()) {
+		/* no notes, just the subheader */
+		if (!write_cache(cd, &kh, size))
+			goto out;
+	} else {
 		/*
 		 * Write ELF note section
 		 */
@@ -5494,27 +5655,47 @@ write_kdump_header(void)
 				goto out;
 		}
 
-		if (!write_buffer(info->fd_dumpfile, kh.offset_note, buf,
-		    kh.size_note, info->name_dumpfile))
-			goto out;
-
 		if (has_vmcoreinfo()) {
 			get_vmcoreinfo(&offset_vmcoreinfo, &size_vmcoreinfo);
 			/*
-			 * Set vmcoreinfo data
+			 * Set vmcoreinfo data information.
 			 *
 			 * NOTE: ELF note section contains vmcoreinfo data, and
 			 *       kh.offset_vmcoreinfo points the vmcoreinfo data.
+			 *
+			 * The vmcoreinfo is typically the ending portion
+			 * of the note data.
 			 */
 			kh.offset_vmcoreinfo
 			    = offset_vmcoreinfo - offset_note
 			      + kh.offset_note;
 			kh.size_vmcoreinfo = size_vmcoreinfo;
 		}
+
+		/* write the completed subheader structure kh */
+		if (!write_cache(cd, &kh, size))
+			goto out;
+
+		/*
+		 * Now the note buffer, after the subheader.
+		 * The note may be huge, so do this in a loop to not
+		 * overflow the cache.
+		 */
+		remaining_size_note = kh.size_note;
+		cp = buf;
+		do {
+			room = cd->cache_size - cd->buf_size;
+			if (remaining_size_note > room)
+				write_size = room;
+			else
+				write_size = remaining_size_note;
+			if (!write_cache(cd, cp, write_size))
+				goto out;
+			remaining_size_note -= write_size;
+			cp += write_size;
+		} while (remaining_size_note);
+
 	}
-	if (!write_buffer(info->fd_dumpfile, dh->block_size, &kh,
-	    size, info->name_dumpfile))
-		goto out;
 
 	info->sub_header = kh;
 	info->offset_bitmap1
@@ -6110,13 +6291,15 @@ write_elf_pages_cyclic(struct cache_data
 }
 
 int
-write_kdump_pages(struct cache_data *cd_header, struct cache_data *cd_page)
+write_kdump_pages(struct cache_data *cd_descs, struct cache_data *cd_page)
 {
- 	unsigned long long pfn, per, num_dumpable;
+	unsigned long long pfn, per, num_dumpable;
 	unsigned long long start_pfn, end_pfn;
 	unsigned long size_out;
+	long prefix;
 	struct page_desc pd, pd_zero;
 	off_t offset_data = 0;
+	off_t initial_offset_data;
 	struct disk_dump_header *dh = info->dump_header;
 	unsigned char buf[info->page_size], *buf_out = NULL;
 	unsigned long len_buf_out;
@@ -6124,8 +6307,12 @@ write_kdump_pages(struct cache_data *cd_
 	struct timeval tv_start;
 	const off_t failed = (off_t)-1;
 	unsigned long len_buf_out_zlib, len_buf_out_lzo, len_buf_out_snappy;
+	int saved_bytes = 0;
+	int cpysize;
+	char *save_block1, *save_block_cur, *save_block2;
 
 	int ret = FALSE;
+	int status;
 
 	if (info->flag_elf_dumpfile)
 		return FALSE;
@@ -6166,13 +6353,41 @@ write_kdump_pages(struct cache_data *cd_
 	per = num_dumpable / 10000;
 
 	/*
-	 * Calculate the offset of the page data.
+	 * Calculate the offset of the page_desc's and page data.
 	 */
-	cd_header->offset
+	cd_descs->offset
 	    = (DISKDUMP_HEADER_BLOCKS + dh->sub_hdr_size + dh->bitmap_blocks)
 		* dh->block_size;
-	cd_page->offset = cd_header->offset + sizeof(page_desc_t)*num_dumpable;
-	offset_data  = cd_page->offset;
+	/* this is already a pagesize multiple, so well-formed for i/o */
+
+	cd_page->offset = cd_descs->offset + (sizeof(page_desc_t) * num_dumpable);
+	offset_data = cd_page->offset;
+
+	/* for i/o, round this page data offset down to a block boundary */
+	prefix = cd_page->offset % blocksize;
+	cd_page->offset -= prefix;
+	initial_offset_data = cd_page->offset;
+	cd_page->buf_size = prefix;
+	memset(cd_page->buf, 0, prefix);
+
+	fill_to_offset(cd_descs, blocksize);
+
+	if ((save_block1 = malloc(blocksize * 2)) == NULL) {
+		ERRMSG("Can't allocate memory for save block. %s\n",
+		       strerror(errno));
+		goto out;
+	}
+	/* put on block address boundary for well-rounded i/o */
+	save_block1 += (blocksize - (unsigned long)save_block1 % blocksize);
+	save_block_cur = save_block1 + prefix;
+	saved_bytes += prefix;
+	if ((save_block2 = malloc(blocksize + DIRECT_ALIGN)) == NULL) {
+		ERRMSG("Can't allocate memory for save block2. %s\n",
+		       strerror(errno));
+		goto out;
+	}
+	/* put on block address boundary for well-rounded i/o */
+	save_block2 += (DIRECT_ALIGN - (unsigned long)save_block2 % DIRECT_ALIGN);
 
 	/*
 	 * Set a fileoffset of Physical Address 0x0.
@@ -6196,6 +6411,14 @@ write_kdump_pages(struct cache_data *cd_
 		memset(buf, 0, pd_zero.size);
 		if (!write_cache(cd_page, buf, pd_zero.size))
 			goto out;
+
+		cpysize = pd_zero.size;
+		if ((saved_bytes + cpysize) > blocksize)
+			cpysize = blocksize - saved_bytes;
+		memcpy(save_block_cur, buf, cpysize);
+		saved_bytes += cpysize;
+		save_block_cur += cpysize;
+
 		offset_data  += pd_zero.size;
 	}
 	if (info->flag_split) {
@@ -6229,7 +6452,7 @@ write_kdump_pages(struct cache_data *cd_
 		 */
 		if ((info->dump_level & DL_EXCLUDE_ZERO)
 		    && is_zero_page(buf, info->page_size)) {
-			if (!write_cache(cd_header, &pd_zero, sizeof(page_desc_t)))
+			if (!write_cache(cd_descs, &pd_zero, sizeof(page_desc_t)))
 				goto out;
 			pfn_zero++;
 			continue;
@@ -6280,24 +6503,70 @@ write_kdump_pages(struct cache_data *cd_
 		/*
 		 * Write the page header.
 		 */
-		if (!write_cache(cd_header, &pd, sizeof(page_desc_t)))
+		if (!write_cache(cd_descs, &pd, sizeof(page_desc_t))) {
+			PROGRESS_MSG(
+				"makedumpfile: write error on page header; dump incomplete\n");
 			goto out;
+		}
 
 		/*
 		 * Write the page data.
 		 */
+		/* kludge: save the partial block where page desc's and data overlap */
+		/* (this is the second part of the full block (save_block) where
+		    they overlap) */
+		if (saved_bytes < blocksize) {
+			memcpy(save_block_cur, buf, pd.size);
+			saved_bytes += pd.size;
+			save_block_cur += pd.size;
+		}
 		if (!write_cache(cd_page, buf, pd.size))
 			goto out;
 	}
 
 	/*
-	 * Write the remainder.
+	 * Write the remainder (well-formed blocks)
 	 */
-	if (!write_cache_bufsz(cd_page))
+	/* adjust the cd_descs to write out only full blocks beyond the
+	   data in the buffer */
+	if (cd_descs->buf_size % blocksize) {
+		cd_descs->buf_size +=
+			(blocksize - (cd_descs->buf_size % blocksize));
+		cd_descs->cache_size = cd_descs->buf_size;
+	}
+	if (!write_cache_flush(cd_descs))
 		goto out;
-	if (!write_cache_bufsz(cd_header))
+
+	/*
+	 * kludge: the page data will overwrite the last block of the page_desc's,
+	 * so re-construct a block from:
+	 *   the last block of the page_desc's (length 'prefix') (will read into
+	 *   save_block2) and the end (4096-prefix) of the page data we saved in
+	 *   save_block1.
+	 */
+	if (!write_cache_flush(cd_page))
 		goto out;
 
+	if (lseek(cd_page->fd, initial_offset_data, SEEK_SET) == failed) {
+		printf("kludge: seek to %#lx, fd %d failed errno %d\n",
+			initial_offset_data, cd_page->fd, errno);
+		exit(1);
+	}
+	if (read(cd_page->fd, save_block2, blocksize) != blocksize) {
+		printf("kludge: read block2 failed\n");
+		exit(1);
+	}
+	/* combine the overlapping parts into save_block1 */
+	memcpy(save_block1, save_block2, prefix);
+
+	if (lseek(cd_page->fd, initial_offset_data, SEEK_SET) == failed) {
+		printf("kludge: seek to %#lx, fd %d failed errno %d\n",
+			initial_offset_data, cd_page->fd, errno);
+		exit(1);
+	}
+	status = write(cd_page->fd, save_block1, blocksize);
+	/* end of kludged block */
+
 	/*
 	 * print [100 %]
 	 */
@@ -6307,8 +6576,6 @@ write_kdump_pages(struct cache_data *cd_
 
 	ret = TRUE;
 out:
-	if (buf_out != NULL)
-		free(buf_out);
 #ifdef USELZO
 	if (wrkmem != NULL)
 		free(wrkmem);
@@ -6456,18 +6723,18 @@ write_kdump_pages_cyclic(struct cache_da
 		pd.offset     = *offset_data;
 		*offset_data  += pd.size;
 
-                /*
-                 * Write the page header.
-                 */
-                if (!write_cache(cd_header, &pd, sizeof(page_desc_t)))
-                        goto out;
-
-                /*
-                 * Write the page data.
-                 */
-                if (!write_cache(cd_page, buf, pd.size))
-                        goto out;
-        }
+		/*
+		 * Write the page header.
+		 */
+		if (!write_cache(cd_header, &pd, sizeof(page_desc_t)))
+			goto out;
+
+		/*
+		 * Write the page data.
+		 */
+		if (!write_cache(cd_page, buf, pd.size))
+			goto out;
+	}
 
 	ret = TRUE;
 out:
@@ -6704,50 +6971,48 @@ write_kdump_eraseinfo(struct cache_data 
 }
 
 int
-write_kdump_bitmap(void)
+write_kdump_bitmap(struct cache_data *cd)
 {
 	struct cache_data bm;
 	long long buf_size;
-	off_t offset;
+	long write_size;
 
 	int ret = FALSE;
 
 	if (info->flag_elf_dumpfile)
 		return FALSE;
 
+	/* set up to read bit map file in big blocks from the start */
 	bm.fd        = info->fd_bitmap;
 	bm.file_name = info->name_bitmap;
 	bm.offset    = 0;
-	bm.buf       = NULL;
-
-	if ((bm.buf = calloc(1, BUFSIZE_BITMAP)) == NULL) {
-		ERRMSG("Can't allocate memory for dump bitmap buffer. %s\n",
-		    strerror(errno));
-		goto out;
+	bm.cache_size = cd->cache_size;
+	bm.buf = cd->buf; /* use the bitmap cd */
+	/* using the dumpfile cd_bitmap buffer and fd */
+	if (lseek(cd->fd, info->offset_bitmap1, SEEK_SET) < 0) {
+		ERRMSG("Can't seek the dump file(%s). %s\n",
+		       info->name_memory, strerror(errno));
+		return FALSE;
 	}
-	offset = info->offset_bitmap1;
 	buf_size = info->len_bitmap;
 
 	while (buf_size > 0) {
-		if (buf_size >= BUFSIZE_BITMAP)
-			bm.cache_size = BUFSIZE_BITMAP;
-		else
-			bm.cache_size = buf_size;
-
 		if(!read_cache(&bm))
 			goto out;
 
-		if (!write_buffer(info->fd_dumpfile, offset,
-		    bm.buf, bm.cache_size, info->name_dumpfile))
-			goto out;
-
-		offset += bm.cache_size;
-		buf_size -= BUFSIZE_BITMAP;
+		write_size = cd->cache_size;
+		if (buf_size < cd->cache_size) {
+			write_size = buf_size;
+		}
+		if (write(cd->fd, cd->buf, write_size) != write_size) {
+			ERRMSG("Can't write a destination file. %s\n",
+				strerror(errno));
+			exit(1);
+		}
+		buf_size -= bm.cache_size;
 	}
 	ret = TRUE;
 out:
-	if (bm.buf != NULL)
-		free(bm.buf);
 
 	return ret;
 }
@@ -6756,7 +7021,7 @@ int
 write_kdump_bitmap1_cyclic(void)
 {
 	off_t offset;
-        int increment;
+	int increment;
 	int ret = FALSE;
 
 	increment = divideup(info->cyclic_end_pfn - info->cyclic_start_pfn, BITPERBYTE);
@@ -6875,14 +7140,14 @@ write_kdump_pages_and_bitmap_cyclic(stru
 			continue;
 
 		if (!update_cyclic_region(pfn))
-                        return FALSE;
+			return FALSE;
 
 		if (!write_kdump_pages_cyclic(cd_header, cd_page, &pd_zero, &offset_data))
 			return FALSE;
 
 		if (!write_kdump_bitmap2_cyclic())
 			return FALSE;
-        }
+	}
 
 	/*
 	 * Write the remainder.
@@ -7799,7 +8064,7 @@ int
 writeout_dumpfile(void)
 {
 	int ret = FALSE;
-	struct cache_data cd_header, cd_page;
+	struct cache_data cd_header, cd_page_descs, cd_page, cd_bitmap;
 
 	info->flag_nospace = FALSE;
 
@@ -7812,11 +8077,20 @@ writeout_dumpfile(void)
 	}
 	if (!prepare_cache_data(&cd_header))
 		return FALSE;
+	cd_header.offset = 0;
 
 	if (!prepare_cache_data(&cd_page)) {
 		free_cache_data(&cd_header);
 		return FALSE;
 	}
+	if (!prepare_cache_data(&cd_page_descs)) {
+		free_cache_data(&cd_header);
+		free_cache_data(&cd_page);
+		return FALSE;
+	}
+	if (!prepare_cache_data(&cd_bitmap))
+		return FALSE;
+
 	if (info->flag_elf_dumpfile) {
 		if (!write_elf_header(&cd_header))
 			goto out;
@@ -7830,20 +8104,35 @@ writeout_dumpfile(void)
 		if (!write_elf_eraseinfo(&cd_header))
 			goto out;
 	} else if (info->flag_cyclic) {
-		if (!write_kdump_header())
+		if (!write_kdump_header(&cd_header))
 			goto out;
 		if (!write_kdump_pages_and_bitmap_cyclic(&cd_header, &cd_page))
 			goto out;
 		if (!write_kdump_eraseinfo(&cd_page))
 			goto out;
 	} else {
-		if (!write_kdump_header())
+
+		/*
+		 * Use cd_header for the caching operation up to the bit map.
+		 * Use cd_bitmap for 1-block (4096) operations on the bit map.
+		 * (it fits between the file header and page_desc's, both of
+		 *  which end and start on block boundaries)
+		 * Then use cd_page_descs and cd_page for page headers and
+		 * data (and eraseinfo).
+		 * Then back to cd_header to fill in the bitmap.
+		 */
+
+		if (!write_kdump_header(&cd_header))
 			goto out;
-		if (!write_kdump_pages(&cd_header, &cd_page))
+		write_cache_flush(&cd_header);
+
+		if (!write_kdump_pages(&cd_page_descs, &cd_page))
 			goto out;
 		if (!write_kdump_eraseinfo(&cd_page))
 			goto out;
-		if (!write_kdump_bitmap())
+
+		cd_bitmap.offset = info->offset_bitmap1;
+		if (!write_kdump_bitmap(&cd_bitmap))
 			goto out;
 	}
 	if (info->flag_flatten) {
@@ -7883,7 +8172,7 @@ setup_splitting(void)
 		}
 		if (SPLITTING_END_PFN(i-1) > info->max_mapnr)
 			SPLITTING_END_PFN(i-1) = info->max_mapnr;
-        } else {
+	} else {
 		initialize_2nd_bitmap(&bitmap2);
 
 		pfn_per_dumpfile = num_dumpable / info->num_dumpfile;
@@ -8005,11 +8294,43 @@ create_dumpfile(void)
 		if (!get_elf_info(info->fd_memory, info->name_memory))
 			return FALSE;
 	}
+	blocksize = info->page_size;
+	if (!blocksize)
+		blocksize = sysconf(_SC_PAGE_SIZE);
 	if (!initial())
 		return FALSE;
 
 	print_vtop();
 
+	if (info->flag_rawdump)
+		PROGRESS_MSG("Using O_DIRECT i/o for dump.\n");
+	if (info->flag_rawbitmaps)
+		PROGRESS_MSG("Using O_DIRECT i/o for bitmap.\n");
+	if (plenty_of_memory()) {
+		PROGRESS_MSG("Plenty of memory.\n");
+		info->flag_cyclic = FALSE;
+		if (!info->flag_rawdump)
+			PROGRESS_MSG("Using page cache for bitmap file.\n");
+		if (!info->flag_rawbitmaps)
+			PROGRESS_MSG("Using page cache for dump file.\n");
+	} else {
+		/* memory is restricted; solution is direct i/o */
+		if (!info->flag_rawdump) {
+			info->flag_rawdump = 1;
+			PROGRESS_MSG(
+			"Restricted memory; switching to O_DIRECT i/o for dump.\n");
+		}
+		if (!info->flag_rawbitmaps) {
+			info->flag_rawbitmaps = 1;
+			PROGRESS_MSG(
+			"Restricted memory; switching to O_DIRECT i/o for bitmap.\n");
+		}
+	}
+
+	if (info->flag_cyclic == FALSE) {
+		PROGRESS_MSG("Using non-cyclic mode.\n");
+	}
+
 	num_retry = 0;
 retry:
 	if (info->flag_refiltering) {
@@ -8045,11 +8366,11 @@ retry:
 		 */
 		num_retry++;
 		if ((info->dump_level = get_next_dump_level(num_retry)) < 0)
- 			return FALSE;
+			return FALSE;
 		MSG("Retry to create a dumpfile by dump_level(%d).\n",
 		    info->dump_level);
 		if (!delete_dumpfile())
- 			return FALSE;
+			return FALSE;
 		goto retry;
 	}
 	print_report();
@@ -8911,6 +9232,22 @@ out:
 	return free_size;
 }
 
+/*
+ * Plenty of memory to do a non-cyclic dump.
+ * Default to non-cyclic in this case.
+ */
+static int
+plenty_of_memory(void)
+{
+	unsigned long free_size;
+	unsigned long needed_size;
+
+	free_size = get_free_memory_size();
+	needed_size = (info->max_mapnr * 2) / BITPERBYTE;
+	if (free_size > (needed_size + (10*1024*1024)))
+		return 1;
+	return 0;
+}
 
 /*
  * Choose the lesser value of the two below as the size of cyclic buffer.
@@ -8997,7 +9334,7 @@ main(int argc, char *argv[])
 
 	info->block_order = DEFAULT_ORDER;
 	message_level = DEFAULT_MSG_LEVEL;
-	while ((opt = getopt_long(argc, argv, "b:cDd:EFfg:hi:lpRvXx:", longopts,
+	while ((opt = getopt_long(argc, argv, "b:cDd:EFfg:hi:jJlpRvXx:", longopts,
 	    NULL)) != -1) {
 		switch (opt) {
 		case OPT_BLOCK_ORDER:
@@ -9041,6 +9378,12 @@ main(int argc, char *argv[])
 			info->flag_read_vmcoreinfo = 1;
 			info->name_vmcoreinfo = optarg;
 			break;
+		case OPT_RAWDUMP:
+			info->flag_rawdump = 1;
+			break;
+		case OPT_RAWBITMAPS:
+			info->flag_rawbitmaps = 1;
+			break;
 		case OPT_DISKSET:
 			if (!sadump_add_diskset_info(optarg))
 				goto out;
Index: makedumpfile-1.5.5/makedumpfile.h
===================================================================
--- makedumpfile-1.5.5.orig/makedumpfile.h
+++ makedumpfile-1.5.5/makedumpfile.h
@@ -18,6 +18,7 @@
 
 #include <stdio.h>
 #include <stdlib.h>
+#define __USE_GNU
 #include <fcntl.h>
 #include <gelf.h>
 #include <sys/stat.h>
@@ -215,6 +216,7 @@ isAnon(unsigned long mapping)
 #define FILENAME_BITMAP		"kdump_bitmapXXXXXX"
 #define FILENAME_STDOUT		"STDOUT"
 #define MAP_REGION		(4096*1024)
+#define DIRECT_ALIGN		(512)
 
 /*
  * Minimam vmcore has 2 ProgramHeaderTables(PT_NOTE and PT_LOAD).
@@ -822,7 +824,8 @@ struct dump_bitmap {
 	int		fd;
 	int		no_block;
 	char		*file_name;
-	char		buf[BUFSIZE_BITMAP];
+	char		*buf;
+	char		*buf_malloced;
 	off_t		offset;
 };
 
@@ -830,6 +833,7 @@ struct cache_data {
 	int	fd;
 	char	*file_name;
 	char	*buf;
+	char	*buf_malloced;
 	size_t	buf_size;
 	size_t	cache_size;
 	off_t	offset;
@@ -911,6 +915,8 @@ struct DumpInfo {
 	int		flag_use_printk_log; /* did we read printk_log symbol name? */
 	int		flag_nospace;	     /* the flag of "No space on device" error */
 	int		flag_vmemmap;        /* kernel supports vmemmap address space */
+	int		flag_rawdump;        /* use raw i/o for the dump file */
+	int		flag_rawbitmaps;     /* use raw i/o for the bitmaps file */
 	unsigned long	vaddr_for_vtop;      /* virtual address for debugging */
 	long		page_size;           /* size of page */
 	long		page_shift;
@@ -1729,6 +1735,8 @@ struct elf_prstatus {
 #define OPT_GENERATE_VMCOREINFO 'g'
 #define OPT_HELP                'h'
 #define OPT_READ_VMCOREINFO     'i'
+#define OPT_RAWDUMP             'j'
+#define OPT_RAWBITMAPS          'J'
 #define OPT_COMPRESS_LZO        'l'
 #define OPT_COMPRESS_SNAPPY     'p'
 #define OPT_REARRANGE           'R'
Index: makedumpfile-1.5.5/print_info.c
===================================================================
--- makedumpfile-1.5.5.orig/print_info.c
+++ makedumpfile-1.5.5/print_info.c
@@ -48,7 +48,7 @@ print_usage(void)
 	MSG("\n");
 	MSG("Usage:\n");
 	MSG("  Creating DUMPFILE:\n");
-	MSG("  # makedumpfile    [-c|-l|-E] [-d DL] [-x VMLINUX|-i VMCOREINFO] VMCORE\n");
+	MSG("  # makedumpfile    [-c|-l|-E] [-d DL] [-j] [-J] [-x VMLINUX|-i VMCOREINFO] VMCORE\n");
 	MSG("    DUMPFILE\n");
 	MSG("\n");
 	MSG("  Creating DUMPFILE with filtered kernel data specified through filter config\n");
@@ -95,6 +95,12 @@ print_usage(void)
 	MSG("      -E option, because the ELF format does not support compressed data.\n");
 	MSG("      THIS IS ONLY FOR THE CRASH UTILITY.\n");
 	MSG("\n");
+	MSG("  [-j]:\n");
+	MSG("      Use raw (O_DIRECT) i/o on dump file to avoid expanding kernel pagecache.\n");
+	MSG("\n");
+	MSG("  [-J]:\n");
+	MSG("      Use raw (O_DIRECT) i/o on bitmap file to avoid expanding kernel pagecache.\n");
+	MSG("\n");
 	MSG("  [-d DL]:\n");
 	MSG("      Specify the type of unnecessary page for analysis.\n");
 	MSG("      Pages of the specified type are not copied to DUMPFILE. The page type\n");

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2/2 V2] exclude unused vmemmap pages
  2013-12-31 23:30 [PATCH 0/2] makedumpfile: for large memories cpw
                   ` (4 preceding siblings ...)
  2014-01-10 17:58 ` [PATCH 1/2 V2] raw i/o and root device to use less memory Cliff Wickman
@ 2014-01-10 18:00 ` Cliff Wickman
  2014-01-14 11:33 ` [PATCH 0/2] makedumpfile: for large memories HATAYAMA Daisuke
  6 siblings, 0 replies; 16+ messages in thread
From: Cliff Wickman @ 2014-01-10 18:00 UTC (permalink / raw)
  To: kexec; +Cc: d.hatayama, kumagai-atsushi

From: Cliff Wickman <cpw@sgi.com>

Version 2:
provides some requested changes:
- remove the automatic exclusion of page structures for memories over 1TB; it
  will only be done by explicit request (-e)
- remove the -N option; no need to explicitly include unused vmemmap pages
  as they will be included by default
- add DUMP_DH_EXCLUDED_VMEMMAP to the dump header; to warn crash users that
  these page structures are excluded
- fix the making of the filename for pfn file (in init_save_control())
but still is only tested on a disk dump of an x86_64 system.

Exclude kernel pages that contain nothing but page structures for pages
that are not being included in the dump.
These can amount to 3.67 million pages per terabyte of system memory!

The kernel's page table, starting at virtual address 0xffffea0000000000, is 
searched to find the actual pages containing the vmemmap page structures.

Bitmap1 is a map of dumpable (i.e existing) pages. Bitmap2 is a map
of pages not to be excluded.
To speed the search of bitmaps only whole 64-bit words of 1's in 
bitmap1 and 0's in bitmap2 are tested to see if they are vmemmap pages.

The list of vmemmap pfn's to be excluded is written to a small file in order
to conserve crash kernel memory.

In practice, this whole procedure only takes about 10 seconds on a
16TB machine.

Signed-off-by: Cliff Wickman <cpw@sgi.com>

---
 diskdump_mod.h |    1 
 makedumpfile.c |  679 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 makedumpfile.h |   60 ++++-
 print_info.c   |    6 
 4 files changed, 742 insertions(+), 4 deletions(-)

Index: makedumpfile-1.5.5/makedumpfile.c
===================================================================
--- makedumpfile-1.5.5.orig/makedumpfile.c
+++ makedumpfile-1.5.5/makedumpfile.c
@@ -31,9 +31,12 @@ struct offset_table	offset_table;
 struct array_table	array_table;
 struct number_table	number_table;
 struct srcfile_table	srcfile_table;
+struct save_control	sc;
 
 struct vm_table		vt = { 0 };
 struct DumpInfo		*info = NULL;
+struct vmap_pfns	*gvmem_pfns;
+int nr_gvmem_pfns;
 
 char filename_stdout[] = FILENAME_STDOUT;
 
@@ -3111,6 +3114,10 @@ out:
 	if (!get_max_mapnr())
 		return FALSE;
 
+	if (info->flag_excludevm) {
+		PROGRESS_MSG("Excluding unused vmemmap page structures.\n");
+	}
+
 	if (info->flag_cyclic) {
 		if (info->bufsize_cyclic == 0) {
 			if (!calculate_cyclic_buffer_size())
@@ -4959,6 +4966,345 @@ copy_bitmap(void)
 	return TRUE;
 }
 
+/*
+ * Given a range of unused pfn's, check whether we can drop the vmemmap pages
+ * that represent them.
+ *  (pfn ranges are literally start and end, not start and end+1)
+ *   see the array of vmemmap pfns and the pfns then represent: gvmem_pfns
+ * Return 1 for delete, 0 for not to delete.
+ */
+int
+find_vmemmap_pages(unsigned long startpfn, unsigned long endpfn, unsigned long *vmappfn,
+									unsigned long *nmapnpfns)
+{
+	int i;
+	long npfns_offset, vmemmap_offset, vmemmap_pfns, start_vmemmap_pfn;
+	long npages, end_vmemmap_pfn;
+	struct vmap_pfns *vmapp;
+	int pagesize = info->page_size;
+
+	for (i = 0; i < nr_gvmem_pfns; i++) {
+		vmapp = gvmem_pfns + i;
+		if ((startpfn >= vmapp->rep_pfn_start) &&
+		    (endpfn <= vmapp->rep_pfn_end)) {
+			npfns_offset = startpfn - vmapp->rep_pfn_start;
+			vmemmap_offset = npfns_offset * size_table.page;
+			// round up to a page boundary
+			if (vmemmap_offset % pagesize)
+				vmemmap_offset += (pagesize - (vmemmap_offset % pagesize));
+			vmemmap_pfns = vmemmap_offset / pagesize;
+			start_vmemmap_pfn = vmapp->vmap_pfn_start + vmemmap_pfns;
+			*vmappfn = start_vmemmap_pfn;
+
+			npfns_offset = endpfn - vmapp->rep_pfn_start;
+			vmemmap_offset = npfns_offset * size_table.page;
+			// round down to page boundary
+			vmemmap_offset -= (vmemmap_offset % pagesize);
+			vmemmap_pfns = vmemmap_offset / pagesize;
+			end_vmemmap_pfn = vmapp->vmap_pfn_start + vmemmap_pfns;
+			npages = end_vmemmap_pfn - start_vmemmap_pfn;
+			if (npages == 0)
+				return 0;
+			*nmapnpfns = npages;
+			return 1;
+		}
+	}
+	return 0;
+}
+
+/*
+ * Finalize the structure for saving pfn's to be deleted.
+ */
+void
+finalize_save_control()
+{
+	free(sc.sc_buf_malloced);
+	close(sc.sc_fd);
+	return;
+}
+
+/*
+ * Reset the structure for saving pfn's to be deleted so that it can be read
+ */
+void
+reset_save_control()
+{
+	int i;
+	if (sc.sc_bufposition == 0)
+		return;
+
+	/* direct i/o, so have to write the whole buffer */
+	i = write(sc.sc_fd, sc.sc_buf, sc.sc_buflen);
+	if (i != sc.sc_buflen) {
+		fprintf(stderr, "reset: Can't write a page to %s\n",
+			sc.sc_filename);
+		exit(1);
+	}
+	sc.sc_filelen += sc.sc_bufposition;
+
+	if (lseek(sc.sc_fd, 0, SEEK_SET) < 0) {
+		fprintf(stderr, "Can't seek the pfn file %s).", sc.sc_filename);
+		exit(1);
+	}
+	sc.sc_fileposition = 0;
+	sc.sc_bufposition = sc.sc_buflen; /* trigger 1st read */
+	return;
+}
+
+/*
+ * Initialize the structure for saving pfn's to be deleted.
+ */
+void
+init_save_control()
+{
+	int flags, len;
+	char *filename, *tmpname, *cp;
+
+	filename = malloc(50);
+	*filename = '\0';
+	tmpname = getenv("TMPDIR");
+	if (!tmpname) {
+		/* use the prefix of the dump name   e.g. /mnt//var/.... */
+		if (!strchr(info->name_dumpfile,'v')) {
+			printf("no /var found in name_dumpfile %s\n", info->name_dumpfile);
+			exit(1);
+		} else {
+			cp = strchr(info->name_dumpfile,'v');
+			if (strncmp(cp-1, "/var", 4)) {
+				printf("no /var found in name_dumpfile %s\n",
+					info->name_dumpfile);
+				exit(1);
+			}
+		}
+		len = cp - info->name_dumpfile - 1;
+		strncpy(filename, info->name_dumpfile, len);
+		if (*(filename + len - 1) == '/')
+			len -= 1;
+		*(filename + len) = '\0';
+	} else {
+		strcpy(filename, tmpname);
+	}
+
+	strcat(filename, "/makedumpfilepfns");
+	sc.sc_filename = filename;
+	flags = O_RDWR|O_CREAT|O_TRUNC|O_DIRECT;
+	if ((sc.sc_fd = open(sc.sc_filename, flags,
+							S_IRUSR|S_IWUSR)) < 0) {
+		fprintf(stderr, "Can't open the pfn file %s.\n",
+			sc.sc_filename);
+		exit(1);
+	}
+	unlink(sc.sc_filename);
+
+	sc.sc_buf_malloced = malloc(blocksize + DIRECT_ALIGN);
+	if (!sc.sc_buf_malloced) {
+		fprintf(stderr, "Can't allocate a page for pfn buf.\n");
+		exit(1);
+	}
+	/* round down to a block boundary */
+	sc.sc_buf = sc.sc_buf_malloced -
+	   ((unsigned long)sc.sc_buf_malloced % DIRECT_ALIGN) + DIRECT_ALIGN;
+	sc.sc_buflen = blocksize;
+	sc.sc_bufposition = 0;
+	sc.sc_fileposition = 0;
+	sc.sc_filelen = 0;
+}
+
+/*
+ * Save a starting pfn and number of pfns for later delete from bitmap.
+ */
+void
+save_deletes(unsigned long startpfn, unsigned long numpfns)
+{
+	int i;
+	struct sc_entry *scp;
+
+	if (sc.sc_bufposition == sc.sc_buflen) {
+		i = write(sc.sc_fd, sc.sc_buf, sc.sc_buflen);
+		if (i != sc.sc_buflen) {
+			fprintf(stderr, "save: Can't write a page to %s\n",
+				sc.sc_filename);
+			exit(1);
+		}
+		sc.sc_filelen += sc.sc_buflen;
+		sc.sc_bufposition = 0;
+	}
+	scp = (struct sc_entry *)(sc.sc_buf + sc.sc_bufposition);
+	scp->startpfn = startpfn;
+	scp->numpfns = numpfns;
+	sc.sc_bufposition += sizeof(struct sc_entry);
+}
+
+/*
+ * Get a starting pfn and number of pfns for delete from bitmap.
+ * Return 0 for success, 1 for 'no more'
+ */
+int
+get_deletes(unsigned long *startpfn, unsigned long *numpfns)
+{
+	int i;
+	struct sc_entry *scp;
+
+	if (sc.sc_fileposition >= sc.sc_filelen) {
+		return 1;
+	}
+
+	if (sc.sc_bufposition == sc.sc_buflen) {
+		i = read(sc.sc_fd, sc.sc_buf, sc.sc_buflen);
+		if (i <= 0) {
+			fprintf(stderr, "Can't read a page from %s.\n", sc.sc_filename);
+			exit(1);
+		}
+		sc.sc_bufposition = 0;
+	}
+	scp = (struct sc_entry *)(sc.sc_buf + sc.sc_bufposition);
+	*startpfn = scp->startpfn;
+	*numpfns = scp->numpfns;
+	sc.sc_bufposition += sizeof(struct sc_entry);
+	sc.sc_fileposition += sizeof(struct sc_entry);
+	return 0;
+}
+
+/*
+ * Find the big holes in bitmap2; they represent ranges for which
+ * we do not need page structures.
+ * Bitmap1 is a map of dumpable (i.e existing) pages.
+ * They must only be pages that exist, so they will be 0 bits
+ * in the 2nd bitmap but 1 bits in the 1st bitmap.
+ * For speed, only worry about whole word full of bits.
+ */
+void
+find_unused_vmemmap_pages(void)
+{
+	struct dump_bitmap *bitmap1 = info->bitmap1;
+	struct dump_bitmap *bitmap2 = info->bitmap2;
+	unsigned long long pfn;
+	unsigned long *lp1, *lp2, startpfn, endpfn;
+	unsigned long vmapstartpfn, vmapnumpfns;
+	int i, sz, numpages=0, did_deletes;
+	int startword, numwords, do_break=0;
+	long deleted_pages = 0;
+	off_t new_offset1, new_offset2;
+
+	/* read each block of both bitmaps */
+	for (pfn = 0; pfn < info->max_mapnr; pfn += PFN_BUFBITMAP) { /* size in bits */
+		numpages++;
+		did_deletes = 0;
+		new_offset1 = bitmap1->offset + BUFSIZE_BITMAP * (pfn / PFN_BUFBITMAP);
+		if (lseek(bitmap1->fd, new_offset1, SEEK_SET) < 0 ) {
+			ERRMSG("Can't seek the bitmap(%s). %s\n",
+				bitmap1->file_name, strerror(errno));
+			return;
+		}
+		if (read(bitmap1->fd, bitmap1->buf, BUFSIZE_BITMAP) != BUFSIZE_BITMAP) {
+			ERRMSG("Can't read the bitmap(%s). %s\n",
+				bitmap1->file_name, strerror(errno));
+			return;
+		}
+		bitmap1->no_block = pfn / PFN_BUFBITMAP;
+
+		new_offset2 = bitmap2->offset + BUFSIZE_BITMAP * (pfn / PFN_BUFBITMAP);
+		if (lseek(bitmap2->fd, new_offset2, SEEK_SET) < 0 ) {
+			ERRMSG("Can't seek the bitmap(%s). %s\n",
+				bitmap2->file_name, strerror(errno));
+			return;
+		}
+		if (read(bitmap2->fd, bitmap2->buf, BUFSIZE_BITMAP) != BUFSIZE_BITMAP) {
+			ERRMSG("Can't read the bitmap(%s). %s\n",
+				bitmap2->file_name, strerror(errno));
+			return;
+		}
+		bitmap2->no_block = pfn / PFN_BUFBITMAP;
+
+		/* process this one page of both bitmaps at a time */
+		lp1 = (unsigned long *)bitmap1->buf;
+		lp2 = (unsigned long *)bitmap2->buf;
+		/* sz is words in the block */
+		sz = BUFSIZE_BITMAP / sizeof(unsigned long);
+		startword = -1;
+		for (i = 0; i < sz; i++, lp1++, lp2++) {
+			/* for each whole word in the block */
+			/* deal in full 64-page chunks only */
+			if (*lp1 == 0xffffffffffffffffUL) {
+				if (*lp2 == 0) {
+					/* we are in a series we want */
+					if (startword == -1) {
+						/* starting a new group */
+						startword = i;
+					}
+				} else {
+					/* we hit a used page */
+					if (startword >= 0)
+						do_break = 1;
+				}
+			} else {
+				/* we hit a hole in real memory, or part of one */
+				if (startword >= 0)
+					do_break = 1;
+			}
+			if (do_break) {
+				do_break = 0;
+				if (startword >= 0) {
+					numwords = i - startword;
+					/* 64 bits represents 64 page structs, which
+ 					   are not even one page of them (takes
+					   at least 73) */
+					if (numwords > 1) {
+						startpfn = pfn +
+							(startword * BITS_PER_WORD);
+						/* pfn ranges are literally start and end,
+						   not start and end + 1 */
+						endpfn = startpfn +
+							(numwords * BITS_PER_WORD) - 1;
+						if (find_vmemmap_pages(startpfn, endpfn,
+							&vmapstartpfn, &vmapnumpfns)) {
+							save_deletes(vmapstartpfn,
+								vmapnumpfns);
+							deleted_pages += vmapnumpfns;
+							did_deletes = 1;
+						}
+					}
+				}
+				startword = -1;
+			}
+		}
+		if (startword >= 0) {
+			numwords = i - startword;
+			if (numwords > 1) {
+				startpfn = pfn + (startword * BITS_PER_WORD);
+				/* pfn ranges are literally start and end,
+				   not start and end + 1 */
+				endpfn = startpfn + (numwords * BITS_PER_WORD) - 1;
+				if (find_vmemmap_pages(startpfn, endpfn,
+							&vmapstartpfn, &vmapnumpfns)) {
+					save_deletes(vmapstartpfn, vmapnumpfns);
+					deleted_pages += vmapnumpfns;
+					did_deletes = 1;
+				}
+			}
+		}
+	}
+	PROGRESS_MSG("\nExcluded %ld unused vmemmap pages\n", deleted_pages);
+
+	return;
+}
+
+/*
+ * Retrieve the list of pfn's and delete them from bitmap2;
+ */
+void
+delete_unused_vmemmap_pages(void)
+{
+	unsigned long startpfn, numpfns, pfn, i;
+
+	while (!get_deletes(&startpfn, &numpfns)) {
+		for (i = 0, pfn = startpfn; i < numpfns; i++, pfn++) {
+			clear_bit_on_2nd_bitmap_for_kernel(pfn);
+		}
+	}
+	return;
+}
+
 int
 create_2nd_bitmap(void)
 {
@@ -5030,6 +5376,15 @@ create_2nd_bitmap(void)
 	if (!sync_2nd_bitmap())
 		return FALSE;
 
+	/* -e means exclude vmemmap page structures for unused pages */
+	if (info->flag_excludevm) {
+		init_save_control();
+		find_unused_vmemmap_pages();
+		reset_save_control();
+		delete_unused_vmemmap_pages();
+		finalize_save_control();
+	}
+
 	return TRUE;
 }
 
@@ -5236,7 +5591,7 @@ get_loads_dumpfile(void)
 			continue;
 
 		pfn_start = paddr_to_pfn(load.p_paddr);
-		pfn_end   = paddr_to_pfn(load.p_paddr + load.p_memsz);
+		pfn_end = paddr_to_pfn(load.p_paddr + load.p_memsz);
 		frac_head = page_size - (load.p_paddr % page_size);
 		frac_tail = (load.p_paddr + load.p_memsz) % page_size;
 
@@ -5586,6 +5941,9 @@ write_kdump_header(struct cache_data *cd
 	else if (info->flag_compress & DUMP_DH_COMPRESSED_SNAPPY)
 		dh->status |= DUMP_DH_COMPRESSED_SNAPPY;
 #endif
+	if (info->flag_excludevm) {
+		dh->status |= DUMP_DH_EXCLUDED_VMEMMAP;
+	}
 
 	size = sizeof(struct disk_dump_header);
 	if (!write_cache(cd, dh, blocksize))
@@ -8282,6 +8640,315 @@ writeout_multiple_dumpfiles(void)
 	return ret;
 }
 
+/*
+ * Scan the kernel page table for the pfn's of the page structs
+ * Place them in array gvmem_pfns[nr_gvmem_pfns]
+ */
+void
+find_vmemmap()
+{
+	int i, verbose = 0;
+	int pgd_index, pud_index;
+	int start_range = 1;
+	int num_pmds=0, num_pmds_valid=0;
+	int break_in_valids, break_after_invalids;
+	int do_break, done = 0;
+	int last_valid=0, last_invalid=0;
+	int pagestructsize, structsperhpage, hugepagesize;
+	long page_structs_per_pud;
+	long num_puds, groups = 0;
+	long pgdindex, pudindex, pmdindex;
+	long vaddr, vaddr_base;
+	long rep_pfn_start = 0, rep_pfn_end = 0;
+	unsigned long init_level4_pgt;
+	unsigned long max_paddr, high_pfn;
+	unsigned long pgd_addr, pud_addr, pmd_addr;
+	unsigned long *pgdp, *pudp, *pmdp;
+	unsigned long pud_page[PTRS_PER_PUD];
+	unsigned long pmd_page[PTRS_PER_PMD];
+	unsigned long vmap_offset_start = 0, vmap_offset_end = 0;
+	unsigned long pmd, tpfn;
+	unsigned long pvaddr = 0;
+	unsigned long data_addr = 0, last_data_addr = 0, start_data_addr = 0;
+	/*
+	 * data_addr is the paddr of the page holding the page structs.
+	 * We keep lists of contiguous pages and the pfn's that their
+	 * page structs represent.
+	 *  start_data_addr and last_data_addr mark start/end of those
+	 *  contiguous areas.
+	 * An area descriptor is vmap start/end pfn and rep start/end
+	 *  of the pfn's represented by the vmap start/end.
+	 */
+	struct vmap_pfns *vmapp, *vmaphead = NULL, *cur, *tail;
+
+	init_level4_pgt = SYMBOL(init_level4_pgt);
+	if (init_level4_pgt == NOT_FOUND_SYMBOL) {
+		fprintf(stderr, "init_level4_pgt not found\n");
+		return;
+	}
+	pagestructsize = size_table.page;
+	hugepagesize = PTRS_PER_PMD * info->page_size;
+	vaddr_base = info->vmemmap_start;
+	vaddr = vaddr_base;
+	max_paddr = get_max_paddr();
+	/*
+	 * the page structures are mapped at VMEMMAP_START (info->vmemmap_start)
+	 * for max_paddr >> 12 page structures
+	 */
+	high_pfn = max_paddr >> 12;
+	pgd_index = pgd4_index(vaddr_base);
+	pud_index = pud_index(vaddr_base);
+	pgd_addr = vaddr_to_paddr(init_level4_pgt); /* address of pgd */
+	pgd_addr += pgd_index * sizeof(unsigned long);
+	page_structs_per_pud = (PTRS_PER_PUD * PTRS_PER_PMD * info->page_size) /
+									pagestructsize;
+	num_puds = (high_pfn + page_structs_per_pud - 1) / page_structs_per_pud;
+	pvaddr = VMEMMAP_START;
+	structsperhpage = hugepagesize / pagestructsize;
+
+	/* outer loop is for pud entries in the pgd */
+	for (pgdindex = 0, pgdp = (unsigned long *)pgd_addr; pgdindex < num_puds;
+								pgdindex++, pgdp++) {
+		/* read the pgd one word at a time, into pud_addr */
+		if (!readmem(PADDR, (unsigned long long)pgdp, (void *)&pud_addr,
+								sizeof(unsigned long))) {
+			ERRMSG("Can't get pgd entry for slot %d.\n", pgd_index);
+			return;
+		}
+		/* mask the pgd entry for the address of the pud page */
+		pud_addr &= PMASK;
+		/* read the entire pud page */
+		if (!readmem(PADDR, (unsigned long long)pud_addr, (void *)pud_page,
+					PTRS_PER_PUD * sizeof(unsigned long))) {
+			ERRMSG("Can't get pud entry for pgd slot %ld.\n", pgdindex);
+			return;
+		}
+		/* step thru each pmd address in the pud page */
+		/* pudp points to an entry in the pud page */
+		for (pudp = (unsigned long *)pud_page, pudindex = 0;
+					pudindex < PTRS_PER_PUD; pudindex++, pudp++) {
+			pmd_addr = *pudp & PMASK;
+			/* read the entire pmd page */
+			if (!readmem(PADDR, pmd_addr, (void *)pmd_page,
+					PTRS_PER_PMD * sizeof(unsigned long))) {
+				ERRMSG("Can't get pud entry for slot %ld.\n", pudindex);
+				return;
+			}
+			/* pmdp points to an entry in the pmd */
+			for (pmdp = (unsigned long *)pmd_page, pmdindex = 0;
+					pmdindex < PTRS_PER_PMD; pmdindex++, pmdp++) {
+				/* linear page position in this page table: */
+				pmd = *pmdp;
+				num_pmds++;
+				tpfn = (pvaddr - VMEMMAP_START) /
+							pagestructsize;
+				if (tpfn >= high_pfn) {
+					done = 1;
+					break;
+				}
+				/*
+				 * vmap_offset_start:
+				 * Starting logical position in the
+				 * vmemmap array for the group stays
+				 * constant until a hole in the table
+				 * or a break in contiguousness.
+				 */
+
+				/*
+				 * Ending logical position in the
+				 * vmemmap array:
+				 */
+				vmap_offset_end += hugepagesize;
+				do_break = 0;
+				break_in_valids = 0;
+				break_after_invalids = 0;
+				/*
+				 * We want breaks either when:
+				 * - we hit a hole (invalid)
+				 * - we discontiguous page is a string of valids
+				 */
+				if (pmd) {
+					data_addr = (pmd & PMASK);
+					if (start_range) {
+						/* first-time kludge */
+						start_data_addr = data_addr;
+						last_data_addr = start_data_addr
+							 - hugepagesize;
+						start_range = 0;
+					}
+					if (last_invalid) {
+						/* end of a hole */
+						start_data_addr = data_addr;
+						last_data_addr = start_data_addr
+							 - hugepagesize;
+						/* trigger update of offset */
+						do_break = 1;
+					}
+					last_valid = 1;
+					last_invalid = 0;
+					/*
+					 * we have it a gap in physical
+					 * contiguousness in the table.
+					 */
+					/* ?? consecutive holes will have
+					   same data_addr */
+					if (data_addr !=
+						last_data_addr + hugepagesize) {
+						do_break = 1;
+						break_in_valids = 1;
+					}
+					if (verbose)
+						printf("valid: pud %ld pmd %ld pfn %#lx"
+							" pvaddr %#lx pfns %#lx-%lx"
+							" start %#lx end %#lx\n",
+							pudindex, pmdindex,
+							data_addr >> 12,
+							pvaddr, tpfn,
+					tpfn + structsperhpage - 1,
+					vmap_offset_start,
+					vmap_offset_end);
+					num_pmds_valid++;
+					if (!(pmd & PSE)) {
+						printf("vmemmap pmd not huge, abort\n");
+						exit(1);
+					}
+				} else {
+					if (last_valid) {
+						/* this a hole after some valids */
+						do_break = 1;
+						break_in_valids = 1;
+						break_after_invalids = 0;
+					}
+					last_valid = 0;
+					last_invalid = 1;
+					/*
+					 * There are holes in this sparsely
+					 * populated table; they are 2MB gaps
+					 * represented by null pmd entries.
+					 */
+					if (verbose)
+						printf("invalid: pud %ld pmd %ld %#lx"
+							" pfns %#lx-%lx start %#lx end"
+							" %#lx\n", pudindex, pmdindex,
+							pvaddr, tpfn,
+							tpfn + structsperhpage - 1,
+							vmap_offset_start,
+							vmap_offset_end);
+				}
+				if (do_break) {
+					/* The end of a hole is not summarized.
+					 * It must be the start of a hole or
+					 * hitting a discontiguous series.
+					 */
+					if (break_in_valids || break_after_invalids) {
+						/*
+						 * calculate that pfns
+						 * represented by the current
+						 * offset in the vmemmap.
+						 */
+						/* page struct even partly on this page */
+						rep_pfn_start = vmap_offset_start /
+							pagestructsize;
+						/* ending page struct entirely on
+ 						   this page */
+						rep_pfn_end = ((vmap_offset_end -
+							hugepagesize) / pagestructsize);
+ 						if (verbose)
+							printf("vmap pfns %#lx-%lx "
+							"represent pfns %#lx-%lx\n\n",
+							start_data_addr >> PAGESHFT,
+							last_data_addr >> PAGESHFT,
+							rep_pfn_start, rep_pfn_end);
+						groups++;
+						vmapp = (struct vmap_pfns *)malloc(
+								sizeof(struct vmap_pfns));
+						/* pfn of this 2MB page of page structs */
+						vmapp->vmap_pfn_start = start_data_addr
+									>> PTE_SHIFT;
+						vmapp->vmap_pfn_end = last_data_addr
+									>> PTE_SHIFT;
+						/* these (start/end) are literal pfns
+ 						 * on this page, not start and end+1 */
+						vmapp->rep_pfn_start = rep_pfn_start;
+						vmapp->rep_pfn_end = rep_pfn_end;
+
+						if (!vmaphead) {
+							vmaphead = vmapp;
+							vmapp->next = vmapp;
+							vmapp->prev = vmapp;
+						} else {
+							tail = vmaphead->prev;
+							vmaphead->prev = vmapp;
+							tail->next = vmapp;
+							vmapp->next = vmaphead;
+							vmapp->prev = tail;
+						}
+					}
+
+					/* update logical position at every break */
+					vmap_offset_start =
+						vmap_offset_end - hugepagesize;
+					start_data_addr = data_addr;
+				}
+
+				last_data_addr = data_addr;
+				pvaddr += hugepagesize;
+				/*
+				 * pvaddr is current virtual address
+				 *   eg 0xffffea0004200000 if
+				 *    vmap_offset_start is 4200000
+				 */
+			}
+		}
+		tpfn = (pvaddr - VMEMMAP_START) / pagestructsize;
+		if (tpfn >= high_pfn) {
+			done = 1;
+			break;
+		}
+	}
+	rep_pfn_start = vmap_offset_start / pagestructsize;
+	rep_pfn_end = (vmap_offset_end - hugepagesize) / pagestructsize;
+ 	if (verbose)
+		printf("vmap pfns %#lx-%lx represent pfns %#lx-%lx\n\n",
+			start_data_addr >> PAGESHFT, last_data_addr >> PAGESHFT,
+			rep_pfn_start, rep_pfn_end);
+	groups++;
+	vmapp = (struct vmap_pfns *)malloc(sizeof(struct vmap_pfns));
+	vmapp->vmap_pfn_start = start_data_addr >> PTE_SHIFT;
+	vmapp->vmap_pfn_end = last_data_addr >> PTE_SHIFT;
+	vmapp->rep_pfn_start = rep_pfn_start;
+	vmapp->rep_pfn_end = rep_pfn_end;
+	if (!vmaphead) {
+		vmaphead = vmapp;
+		vmapp->next = vmapp;
+		vmapp->prev = vmapp;
+	} else {
+		tail = vmaphead->prev;
+		vmaphead->prev = vmapp;
+		tail->next = vmapp;
+		vmapp->next = vmaphead;
+		vmapp->prev = tail;
+	}
+	if (verbose)
+		printf("num_pmds: %d num_pmds_valid %d\n", num_pmds, num_pmds_valid);
+
+	/* transfer the linked list to an array */
+	cur = vmaphead;
+	gvmem_pfns = (struct vmap_pfns *)malloc(sizeof(struct vmap_pfns) * groups);
+	i = 0;
+	do {
+		vmapp = gvmem_pfns + i;
+		vmapp->vmap_pfn_start = cur->vmap_pfn_start;
+		vmapp->vmap_pfn_end = cur->vmap_pfn_end;
+		vmapp->rep_pfn_start = cur->rep_pfn_start;
+		vmapp->rep_pfn_end = cur->rep_pfn_end;
+		cur = cur->next;
+		free(cur->prev);
+		i++;
+	} while (cur != vmaphead);
+	nr_gvmem_pfns = i;
+}
+
 int
 create_dumpfile(void)
 {
@@ -8302,6 +8969,10 @@ create_dumpfile(void)
 
 	print_vtop();
 
+	/* create an array of translations from pfn to vmemmap pages */
+	if (info->flag_excludevm)
+		find_vmemmap();
+
 	if (info->flag_rawdump)
 		PROGRESS_MSG("Using O_DIRECT i/o for dump.\n");
 	if (info->flag_rawbitmaps)
@@ -9334,7 +10005,7 @@ main(int argc, char *argv[])
 
 	info->block_order = DEFAULT_ORDER;
 	message_level = DEFAULT_MSG_LEVEL;
-	while ((opt = getopt_long(argc, argv, "b:cDd:EFfg:hi:jJlpRvXx:", longopts,
+	while ((opt = getopt_long(argc, argv, "b:cDd:eEFfg:hi:jJlpRvXx:", longopts,
 	    NULL)) != -1) {
 		switch (opt) {
 		case OPT_BLOCK_ORDER:
@@ -9349,6 +10020,10 @@ main(int argc, char *argv[])
 		case OPT_DEBUG:
 			flag_debug = TRUE;
 			break;
+		case OPT_EXCLUDEVM:
+			info->flag_excludevm = 1;
+			/* exclude unused vmemmap pages */
+			break;
 		case OPT_DUMP_LEVEL:
 			if (!parse_dump_level(optarg))
 				goto out;
Index: makedumpfile-1.5.5/makedumpfile.h
===================================================================
--- makedumpfile-1.5.5.orig/makedumpfile.h
+++ makedumpfile-1.5.5/makedumpfile.h
@@ -44,6 +44,9 @@
 #include "diskdump_mod.h"
 #include "sadump_mod.h"
 
+#define VMEMMAPSTART 0xffffea0000000000UL
+#define BITS_PER_WORD 64
+
 /*
  * Result of command
  */
@@ -477,6 +480,7 @@ do { \
 #define VMALLOC_END		(info->vmalloc_end)
 #define VMEMMAP_START		(info->vmemmap_start)
 #define VMEMMAP_END		(info->vmemmap_end)
+#define PMASK			(0x7ffffffffffff000UL)
 
 #ifdef __arm__
 #define KVBASE_MASK		(0xffff)
@@ -561,15 +565,20 @@ do { \
 #define PGDIR_SIZE		(1UL << PGDIR_SHIFT)
 #define PGDIR_MASK		(~(PGDIR_SIZE - 1))
 #define PTRS_PER_PGD		(512)
+#define PGD_SHIFT		(39)
+#define PUD_SHIFT		(30)
 #define PMD_SHIFT		(21)
 #define PMD_SIZE		(1UL << PMD_SHIFT)
 #define PMD_MASK		(~(PMD_SIZE - 1))
+#define PTRS_PER_PUD		(512)
 #define PTRS_PER_PMD		(512)
 #define PTRS_PER_PTE		(512)
 #define PTE_SHIFT		(12)
 
 #define pml4_index(address) (((address) >> PML4_SHIFT) & (PTRS_PER_PML4 - 1))
 #define pgd_index(address)  (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
+#define pgd4_index(address)  (((address) >> PGD_SHIFT) & (PTRS_PER_PGD - 1))
+#define pud_index(address)  (((address) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
 #define pmd_index(address)  (((address) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
 #define pte_index(address)  (((address) >> PTE_SHIFT) & (PTRS_PER_PTE - 1))
 
@@ -683,7 +692,6 @@ do { \
 /*
  * 4 Levels paging
  */
-#define PUD_SHIFT		(PMD_SHIFT + PTRS_PER_PTD_SHIFT)
 #define PGDIR_SHIFT_4L		(PUD_SHIFT + PTRS_PER_PTD_SHIFT)
 
 #define MASK_PUD   	((1UL << REGION_SHIFT) - 1) & (~((1UL << PUD_SHIFT) - 1))
@@ -917,6 +925,7 @@ struct DumpInfo {
 	int		flag_vmemmap;        /* kernel supports vmemmap address space */
 	int		flag_rawdump;        /* use raw i/o for the dump file */
 	int		flag_rawbitmaps;     /* use raw i/o for the bitmaps file */
+	int		flag_excludevm;      /* exclude unused vmemmap pages */
 	unsigned long	vaddr_for_vtop;      /* virtual address for debugging */
 	long		page_size;           /* size of page */
 	long		page_shift;
@@ -1449,6 +1458,52 @@ struct srcfile_table {
 	char	pud_t[LEN_SRCFILE];
 };
 
+/*
+ * This structure records where the vmemmap page structures reside, and which
+ * pfn's are represented by those page structures.
+ * The actual pages containing the page structures are 2MB pages, so their pfn's
+ * will all be multiples of 0x200.
+ * The page structures are 7 64-bit words in length (0x38) so they overlap the
+ * 2MB boundaries. Each page structure represents a 4k page.
+ * A 4k page is here defined to be represented on a 2MB page if its page structure
+ * 'ends' on that page (even if it began on the page before).
+ */
+struct vmap_pfns {
+	struct vmap_pfns *next;
+	struct vmap_pfns *prev;
+	/*
+	 * These (start/end) are literal pfns of 2MB pages on which the page
+	 * structures reside, not start and end+1.
+	 */
+	unsigned long vmap_pfn_start;
+	unsigned long vmap_pfn_end;
+	/*
+	 * These (start/end) are literal pfns represented on these pages, not
+	 * start and end+1.
+	 * The starting page struct is at least partly on the first page; the
+ 	 * ending page struct is entirely on the last page.
+ 	 */
+	unsigned long rep_pfn_start;
+	unsigned long rep_pfn_end;
+};
+
+/* for saving a list of pfns to a buffer, and then to a file if necessary */
+struct save_control {
+	int sc_fd;
+	char *sc_filename;
+	char *sc_buf_malloced;
+	char *sc_buf;
+	long sc_buflen; /* length of buffer never changes */
+	long sc_bufposition; /* offset of next slot for write, or next to be read */
+	long sc_filelen; /* length of valid data written */
+	long sc_fileposition; /* offset in file of next entry to be read */
+};
+/* one entry in the buffer and file */
+struct sc_entry {
+	unsigned long startpfn;
+	unsigned long numpfns;
+};
+
 extern struct symbol_table	symbol_table;
 extern struct size_table	size_table;
 extern struct offset_table	offset_table;
@@ -1595,6 +1650,8 @@ int get_xen_info_ia64(void);
 #define get_xen_basic_info_arch(X) FALSE
 #define get_xen_info_arch(X) FALSE
 #endif	/* s390x */
+#define PAGESHFT	12 /* assuming a 4k page */
+#define PSE		128 /* bit 7 */
 
 static inline int
 is_on(char *bitmap, int i)
@@ -1729,6 +1786,7 @@ struct elf_prstatus {
 #define OPT_COMPRESS_ZLIB       'c'
 #define OPT_DEBUG               'D'
 #define OPT_DUMP_LEVEL          'd'
+#define OPT_EXCLUDEVM           'e'
 #define OPT_ELF_DUMPFILE        'E'
 #define OPT_FLATTEN             'F'
 #define OPT_FORCE               'f'
Index: makedumpfile-1.5.5/print_info.c
===================================================================
--- makedumpfile-1.5.5.orig/print_info.c
+++ makedumpfile-1.5.5/print_info.c
@@ -48,7 +48,7 @@ print_usage(void)
 	MSG("\n");
 	MSG("Usage:\n");
 	MSG("  Creating DUMPFILE:\n");
-	MSG("  # makedumpfile    [-c|-l|-E] [-d DL] [-j] [-J] [-x VMLINUX|-i VMCOREINFO] VMCORE\n");
+	MSG("  # makedumpfile    [-c|-l|-E|-j|-J|-e|-N] [-d DL] [-x VMLINUX|-i VMCOREINFO] VMCORE\n");
 	MSG("    DUMPFILE\n");
 	MSG("\n");
 	MSG("  Creating DUMPFILE with filtered kernel data specified through filter config\n");
@@ -101,6 +101,10 @@ print_usage(void)
 	MSG("  [-J]:\n");
 	MSG("      Use raw (O_DIRECT) i/o on bitmap file to avoid expanding kernel pagecache.\n");
 	MSG("\n");
+	MSG("  [-e]:\n");
+	MSG("      Exclude page structures (vmemmap) for unused pages.\n");
+	MSG("      (the default is to capture them all, which amounts to 3.67M pages per terabyte)\n");
+	MSG("\n");
 	MSG("  [-d DL]:\n");
 	MSG("      Specify the type of unnecessary page for analysis.\n");
 	MSG("      Pages of the specified type are not copied to DUMPFILE. The page type\n");
Index: makedumpfile-1.5.5/diskdump_mod.h
===================================================================
--- makedumpfile-1.5.5.orig/diskdump_mod.h
+++ makedumpfile-1.5.5/diskdump_mod.h
@@ -95,6 +95,7 @@ struct kdump_sub_header {
 #define DUMP_DH_COMPRESSED_LZO	0x2	/* paged is compressed with lzo */
 #define DUMP_DH_COMPRESSED_SNAPPY	0x4
 					/* paged is compressed with snappy */
+#define DUMP_DH_EXCLUDED_VMEMMAP 0x8	/* unused vmemmap pages are excluded */
 
 /* descriptor of each page for vmcore */
 typedef struct page_desc {

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] makedumpfile: for large memories
  2014-01-10  7:48     ` Atsushi Kumagai
@ 2014-01-10 18:23       ` Cliff Wickman
  2014-01-14 12:59         ` HATAYAMA Daisuke
  0 siblings, 1 reply; 16+ messages in thread
From: Cliff Wickman @ 2014-01-10 18:23 UTC (permalink / raw)
  To: Atsushi Kumagai; +Cc: d.hatayama@jp.fujitsu.com, kexec@lists.infradead.org

On Fri, Jan 10, 2014 at 07:48:27AM +0000, Atsushi Kumagai wrote:
> On 2014/01/09 9:26:20, kexec <kexec-bounces@lists.infradead.org> wrote:
> > On Mon, Jan 06, 2014 at 09:27:34AM +0000, Atsushi Kumagai wrote:
> > > Hello Cliff,
> > > 
> > > On 2014/01/01 8:30:47, kexec <kexec-bounces@lists.infradead.org> wrote:
> > > > From: Cliff Wickman <cpw@sgi.com>
> > > > 
> > > > Gentlemen of kexec,
> > > > 
> > > > I have been working on enabling kdump on some very large systems, and
> > > > have found some solutions that I hope you will consider.
> > > > 
> > > > The first issue is to work within the restricted size of crashkernel memory
> > > > under 2.6.32-based kernels, such as sles11 and rhel6.
> > > > 
> > > > The second issue is to reduce the very large size of a dump of a big memory
> > > > system, even on an idle system.
> > > > 
> > > > These are my propositions:
> > > > 
> > > > Size of crashkernel memory
> > > >   1) raw i/o for writing the dump
> > > >   2) use root device for the bitmap file (not tmpfs)
> > > >   3) raw i/o for reading/writing the bitmaps
> > > >   
> > > > Size of dump (and hence the duration of dumping)
> > > >   4) exclude page structures for unused pages
> > > > 
> > > > 
> > > > 1) Is quite easy.  The cache of pages needs to be aligned on a block
> > > >   boundary and written in block multiples, as required by O_DIRECT files.
> > > > 
> > > >   The use of raw i/o prevents the growing of the crash kernel's page
> > > >   cache.

Today I posted V2 of both patches.  V2 of the first patch fixes a bug.
V2 of the second patch make some of the changes that you and Hatayama-san
requested.  But these updates don't address all of your points.

 
> > > There is no reason to reject this idea, please re-post it as a formal patch.
> > > If possible, I would like to know the benefit of only this.
> > 
> > The motivation for using raw i/o was purely to be able to conserve memory,
> > not for speed.

> OK, 1) is also for removing cyclic mode, right ?

I did disable cyclic mode for my testing.  I wanted to prove that makedumpfile
can work in a small memory without cyclic mode. 
I think this is an alternative to cyclic mode, but I don't know all the
issues.  This is a proof of concept only -- I hope that you guys who have
the big picture of all the dump-capture issues can fit it in properly.

> I think there is no need to conserve memory with 1) since 2) is enough to
> remove cyclic mode.
> (To be exact, there are some cases that we have to use cyclic mode as
>  Hatayama-san said, but I don't mention that in this mail.)
> 
> > However, I haven't noticed any significant degradation in speed.
> > Memory is in 'very' short supply on a large machine (ironically) and a 2.6 or 
> > 3.0 kernel.  We're constrained to the low 4GB, and the kernel is putting other
> > things in that memory that are related to memory size.
> > The obvious solution is cyclic mode, but that requires at least 2x the page
> > scans.  Once for the scan of unnecessary pages and several partial 
> > scans for the copy phase.
> > But it is tmpfs and kernel page cache that are using up available memory.
> > If we avoid those, a single page scan can work in about 350M of crashkernel
> > memory.
> > This is not a problem with 3.10+ kernels as we're not constrained to low 4G.
> 
> Even if we can use 350M fully, 5TB is the limit system memory size
> in non-cyclic mode unless 2), since the bitmap file requires 64MB
> per 1TB RAM. So, I can't find an importance of 1).

1) raw i/o for writing the dump
2) use root device for the bitmap file (not tmpfs)
3) raw i/o for reading/writing the bitmaps
Non-raw i/o for either of theses files is going to enlarge kernel page
cache.  There doesn't seem to be any way to ask the kernel to limit
that growth.  And writing to tmpfs is consuming memory.  The one file
is much larger than the other, but to be consistent and not let i/o
consume memory I think we have to do all three.
 
> > > > 2) Is also quite easy.  My patch finds the path to the crash
> > > >   kernel's root device by examining the dump pathname. Storing the bitmaps
> > > >   to a file is otherwise not conserving memory, as they are being written
> > > >   to tmpfs.
> > > 
> > > Users will expect that the size of dump file is the same as the size of
> > > RAM at most, they will prepare a disk which fit to save that.
> > > But 2) breaks this estimation, I worry about it a little.
> > 
> > The bit map file is very small compared to the dump. And the dump should be
> > much smaller than RAM.  Particularly with 4), the excluding of unused page structures.
> > > 
> > > Of course, I don't reject this idea just only for that reason,
> > > but I would like to know the definite advantage of this.
> > > I suppose that the improvement showed in your benchmarks may be came
> > > from 1) and 4) mostly, so could you let me know that only 2) and 3)
> > > can perform much faster than the current cyclic mode ?
> > 
> > 2) and 3), the handling of the bitmap, are small contributors to the
> > memory shortage issue.  They are a bigger issue the bigger the system.
> > It's just that if we consistently avoid enlarging page cache and
> > tmpfs we can avoid the 2nd page scan altogether.
> > True, my benchmarks show only .2 min. and 1.1 min. improvements
> > for 2TB and 8TB (2.0 vs 1.8, and 6.6 vs 5.5).
> > But that's an improvement, not a loss.  And we're absolutely
> > not going to run out of memory as the scan and copies proceed.
> > This is important on these old kernels with minimal memory available.
> 
> Does just changing TMPDIR to a disk meet that purpose ?
> Is it necessary to add new codes ?
Perhaps.  But page cache is going to grow.
> 
> > > > 3) Raw i/o for the bitmaps, is accomplished by caching the
> > > >   bitmap file in a similar way to that of the dump file.
> > > > 
> > > >   I find that the use of direct i/o is not significantly slower than
> > > >   writing through the kernel's page cache.
> > > >
> > > > 4) The excluding of unused kernel page structures is very
> > > >   important for a large memory system.  The kernel otherwise includes
> > > >   3.67 million pages of page structures per TB of memory. By contrast
> > > >   the rest of the kernel is only about 1 million pages.
> > > 
> > > According to your and Dave's mails, 4) seems risky and unacceptable
> > > for now. I think we need more investigation for this.
I think we have addressed that with Dave and a patch to crash.
> > 
> > I've been working with Dave on a patch for crash.  It will warn the
> > user that certain kmem command options will fail.  But that is
> > only relevant to examinations of free memory and user memory, the
> > contents of which we're not capturing anyway.
> > 
> > Number 4), the exclusion of page structures for non-captured
> > pages is really the crux of the improvement.
> > A linux kernel should not be hugely bigger on a big machine than
> > on a small one.  Slightly bigger, yes, because of bigger slab
> > caches. 
> > But in practice the dumps of big memories are huge, and all
> > because of page structures.
> > To find the unneeded ones only takes a few seconds, but cuts
> > hours off the dumping process.  Without this a customer is just
> > not going to allow his very big system to be dumped.
> 
> I understand the benefit of this, but I still suspect that this
> feature is really required from users, it sounds too progressive
> to me.
> This is a big patch, so I want to make sure that this feature will
> be used in practice, any comments are welcome.

Agreed. Comments from developers of systems with big memories especially.
The second patch, regarding elimination of millions of pages of page
structures representing excluded pages is particularly important to being
able to dump a large system at all.
But shortening the scan of free and user pages is also helpful as the
memory size increases.

-Cliff
> 
> 
> Thanks
> Atsushi Kumagai
> 
> > -Cliff
> > > 
> > > 
> > > Thanks
> > > Atsushi Kumagai
> > > 
> > > > Test results are below, for systems of 1TB, 2TB, 8.8TB and 16TB.
> > > > (There are no 'old' numbers for 16TB as time and space requirements
> > > >  made those effectively useless.)
> > > > 
> > > > Run times were generally reduced 2-3x, and dump size reduced about 8x.
> > > > 
> > > > All timings were done using 512M of crashkernel memory.
> > > > 
> > > >    System memory size
> > > >    1TB                     unpatched    patched
> > > >      OS: rhel6.4 (does a free pages pass)
> > > >      page scan time           1.6min    1.6min
> > > >      dump copy time           2.4min     .4min
> > > >      total time               4.1min    2.0min
> > > >      dump size                 3014M      364M
> > > > 
> > > >      OS: rhel6.5
> > > >      page scan time            .6min     .6min
> > > >      dump copy time           2.3min     .5min
> > > >      total time               2.9min    1.1min
> > > >      dump size                 3011M      423M
> > > > 
> > > >      OS: sles11sp3 (3.0.93)
> > > >      page scan time            .5min     .5min
> > > >      dump copy time           2.3min     .5min
> > > >      total time               2.8min    1.0min
> > > >      dump size                 2950M      350M
> > > > 
> > > >    2TB
> > > >      OS: rhel6.5           (cyclicx3)
> > > >      page scan time           2.0min    1.8min
> > > >      dump copy time           8.0min    1.5min
> > > >      total time              10.0min    3.3min
> > > >      dump size                 6141M      835M
> > > > 
> > > >    8.8TB
> > > >      OS: rhel6.5           (cyclicx5)
> > > >      page scan time           6.6min    5.5min
> > > >      dump copy time          67.8min    6.2min
> > > >      total time              74.4min   11.7min
> > > >      dump size                 15.8G      2.7G
> > > > 
> > > >    16TB
> > > >      OS: rhel6.4
> > > >      page scan time                   125.3min
> > > >      dump copy time                    13.2min
> > > >      total time                       138.5min
> > > >      dump size                            4.0G
> > > > 
> > > >      OS: rhel6.5
> > > >      page scan time                    27.8min
> > > >      dump copy time                    13.3min
> > > >      total time                        41.1min
> > > >      dump size                            4.1G
> > > > 
> > > > Page scan time is greatly affected by whether or not the
> > > > kernel supports mmap of /proc/vmcore.
> > > > 
> > > > The choice of snappy vs. zlib compression becomes fairly irrelevant
> > > > when we can shrink the dump size dramatically.  The above
> > > > were done with snappy compression.
> > > > 
> > > > I am sending my 2 working patches.  
> > > > They are kludgy in the sense that they ignore all forms of
> > > > kdump except the creation of a disk dump, and all architectures
> > > > except x86_64.
> > > > But I think they are sufficient to demonstrate the sizable
> > > > time, crashkernel space and disk space savings that are possible.
> > > > 
> > > > _______________________________________________
> > > > kexec mailing list
> > > > kexec@lists.infradead.org
> > > > http://lists.infradead.org/mailman/listinfo/kexec
> > 
> > -- 
> > Cliff Wickman
> > SGI
> > cpw@sgi.com
> > (651) 683-3824
> > 
> > _______________________________________________
> > kexec mailing list
> > kexec@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec

-- 
Cliff Wickman
SGI
cpw@sgi.com
(651) 683-3824

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2 V2] exclude unused vmemmap pages
       [not found] <mailman.19254.1389376849.1059.kexec@lists.infradead.org>
@ 2014-01-10 19:17 ` Dave Anderson
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Anderson @ 2014-01-10 19:17 UTC (permalink / raw)
  To: kexec



----- Original Message -----
>
> provides some requested changes:
> - remove the automatic exclusion of page structures for memories over 1TB; it
>   will only be done by explicit request (-e)
> - remove the -N option; no need to explicitly include unused vmemmap pages
>   as they will be included by default
> - add DUMP_DH_EXCLUDED_VMEMMAP to the dump header; to warn crash users that
>   these page structures are excluded
> - fix the making of the filename for pfn file (in init_save_control())
> but still is only tested on a disk dump of an x86_64 system.
> 
> Exclude kernel pages that contain nothing but page structures for pages
> that are not being included in the dump.
> These can amount to 3.67 million pages per terabyte of system memory!
> 
> The kernel's page table, starting at virtual address 0xffffea0000000000, is
> searched to find the actual pages containing the vmemmap page structures.
> 
> Bitmap1 is a map of dumpable (i.e existing) pages. Bitmap2 is a map
> of pages not to be excluded.
> To speed the search of bitmaps only whole 64-bit words of 1's in
> bitmap1 and 0's in bitmap2 are tested to see if they are vmemmap pages.
> 
> The list of vmemmap pfn's to be excluded is written to a small file in order
> to conserve crash kernel memory.
> 
> In practice, this whole procedure only takes about 10 seconds on a
> 16TB machine.
> 
> Signed-off-by: Cliff Wickman <cpw@sgi.com>

... [ cut ] ...

> +#define OPT_EXCLUDEVM           'e'
>  #define OPT_ELF_DUMPFILE        'E'
>  #define OPT_FLATTEN             'F'
>  #define OPT_FORCE               'f'
> Index: makedumpfile-1.5.5/print_info.c
> ===================================================================
> --- makedumpfile-1.5.5.orig/print_info.c
> +++ makedumpfile-1.5.5/print_info.c
> @@ -48,7 +48,7 @@ print_usage(void)
>  	MSG("\n");
>  	MSG("Usage:\n");
>  	MSG("  Creating DUMPFILE:\n");
> -	MSG("  # makedumpfile    [-c|-l|-E] [-d DL] [-j] [-J] [-x VMLINUX|-i VMCOREINFO] VMCORE\n");
> +	MSG("  # makedumpfile    [-c|-l|-E|-j|-J|-e|-N] [-d DL] [-x VMLINUX|-i VMCOREINFO] VMCORE\n");
>  	MSG("    DUMPFILE\n");
>  	MSG("\n");
>  	MSG("  Creating DUMPFILE with filtered kernel data specified through filter config\n");
> @@ -101,6 +101,10 @@ print_usage(void)
>  	MSG("  [-J]:\n");
>  	MSG("      Use raw (O_DIRECT) i/o on bitmap file to avoid expanding kernel pagecache.\n");
>  	MSG("\n");
> +	MSG("  [-e]:\n");
> +	MSG("      Exclude page structures (vmemmap) for unused pages.\n");
> +	MSG("      (the default is to capture them all, which amounts to 3.67M pages per terabyte)\n");
> +	MSG("\n");
>  	MSG("  [-d DL]:\n");
>  	MSG("      Specify the type of unnecessary page for analysis.\n");
>  	MSG("      Pages of the specified type are not copied to DUMPFILE. The page
>  	type\n");

Perhaps there should be a warning above concerning the potential for crash analysis
problems?  The description above makes it sound like -e is similar in nature (and
therefore as harmless) as the -d <level> option.

It could be argued that there's no such warning for eppic/erasures, but with eppic
it's far less likely to result in crash analysis failures.  With -e, the user is
virtually guaranteed to have issues.

I realize that this is beating a dead horse, but again, this is an option
that should very rarely be used -- and if applied, the user/administrator should
be well aware of the ramifications. 

Dave



_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2 V2] raw i/o and root device to use less memory
  2014-01-10 17:58 ` [PATCH 1/2 V2] raw i/o and root device to use less memory Cliff Wickman
@ 2014-01-13  9:58   ` Michael Holzheu
  2014-01-13 13:30     ` Cliff Wickman
  0 siblings, 1 reply; 16+ messages in thread
From: Michael Holzheu @ 2014-01-13  9:58 UTC (permalink / raw)
  To: Cliff Wickman; +Cc: kumagai-atsushi, d.hatayama, kexec

On Fri, 10 Jan 2014 11:58:30 -0600
Cliff Wickman <cpw@sgi.com> wrote:

[snip]

> Use O_DIRECT (raw) i/o for the dump and for the bitmaps file, so that writing
> to those files does not allocate kernel memory for page cache.

Hello Cliff,

Have you tested O_DIRECT together with the new vmcore mmap interface?
IIRC when we tried O_DIRECT together with mmap for some reason it did
not work.

Michael



_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2 V2] raw i/o and root device to use less memory
  2014-01-13  9:58   ` Michael Holzheu
@ 2014-01-13 13:30     ` Cliff Wickman
  2014-01-13 15:02       ` Michael Holzheu
  0 siblings, 1 reply; 16+ messages in thread
From: Cliff Wickman @ 2014-01-13 13:30 UTC (permalink / raw)
  To: Michael Holzheu; +Cc: kumagai-atsushi, d.hatayama, kexec

On Mon, Jan 13, 2014 at 10:58:31AM +0100, Michael Holzheu wrote:
> On Fri, 10 Jan 2014 11:58:30 -0600
> Cliff Wickman <cpw@sgi.com> wrote:
> 
> [snip]
> 
> > Use O_DIRECT (raw) i/o for the dump and for the bitmaps file, so that writing
> > to those files does not allocate kernel memory for page cache.
> 
> Hello Cliff,
> 
> Have you tested O_DIRECT together with the new vmcore mmap interface?
> IIRC when we tried O_DIRECT together with mmap for some reason it did
> not work.
> 
> Michael

Hi Michael,

  How new of a kernel do I need?  i.e. how 'new' of an mmap interface?
  The latest kernel I used was 3.0.93 (sle11sp3).
  And makedumpfile was 1.5.5.

-Cliff
-- 
Cliff Wickman
SGI
cpw@sgi.com
(651) 683-3824

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2 V2] raw i/o and root device to use less memory
  2014-01-13 13:30     ` Cliff Wickman
@ 2014-01-13 15:02       ` Michael Holzheu
  0 siblings, 0 replies; 16+ messages in thread
From: Michael Holzheu @ 2014-01-13 15:02 UTC (permalink / raw)
  To: Cliff Wickman; +Cc: kumagai-atsushi, d.hatayama, kexec

On Mon, 13 Jan 2014 07:30:33 -0600
Cliff Wickman <cpw@sgi.com> wrote:

> On Mon, Jan 13, 2014 at 10:58:31AM +0100, Michael Holzheu wrote:
> > On Fri, 10 Jan 2014 11:58:30 -0600
> > Cliff Wickman <cpw@sgi.com> wrote:
> > 
> > [snip]
> > 
> > > Use O_DIRECT (raw) i/o for the dump and for the bitmaps file, so that writing
> > > to those files does not allocate kernel memory for page cache.
> > 
> > Hello Cliff,
> > 
> > Have you tested O_DIRECT together with the new vmcore mmap interface?
> > IIRC when we tried O_DIRECT together with mmap for some reason it did
> > not work.
> > 
> > Michael
> 
> Hi Michael,
> 
>   How new of a kernel do I need?  i.e. how 'new' of an mmap interface?
>   The latest kernel I used was 3.0.93 (sle11sp3).
>   And makedumpfile was 1.5.5.

I think sles11 SP3 does not support mmap for /proc/vmcore. Upstream you
need at least kernel 3.11. For makedumpfile version 1.5.5 should be ok.

Michael


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] makedumpfile: for large memories
  2013-12-31 23:30 [PATCH 0/2] makedumpfile: for large memories cpw
                   ` (5 preceding siblings ...)
  2014-01-10 18:00 ` [PATCH 2/2 V2] exclude unused vmemmap pages Cliff Wickman
@ 2014-01-14 11:33 ` HATAYAMA Daisuke
  6 siblings, 0 replies; 16+ messages in thread
From: HATAYAMA Daisuke @ 2014-01-14 11:33 UTC (permalink / raw)
  To: cpw; +Cc: kumagai-atsushi, kexec

(2014/01/01 8:30), cpw wrote:
> From: Cliff Wickman <cpw@sgi.com>
> 
> Gentlemen of kexec,
> 
> I have been working on enabling kdump on some very large systems, and
> have found some solutions that I hope you will consider.
> 
> The first issue is to work within the restricted size of crashkernel memory
> under 2.6.32-based kernels, such as sles11 and rhel6.
> 
> The second issue is to reduce the very large size of a dump of a big memory
> system, even on an idle system.
> 
> These are my propositions:
> 
> Size of crashkernel memory
>    1) raw i/o for writing the dump
>    2) use root device for the bitmap file (not tmpfs)
>    3) raw i/o for reading/writing the bitmaps
>    
> Size of dump (and hence the duration of dumping)
>    4) exclude page structures for unused pages
> 
> 
> 1) Is quite easy.  The cache of pages needs to be aligned on a block
>    boundary and written in block multiples, as required by O_DIRECT files.
> 
>    The use of raw i/o prevents the growing of the crash kernel's page
>    cache.
> 
> 2) Is also quite easy.  My patch finds the path to the crash
>    kernel's root device by examining the dump pathname. Storing the bitmaps
>    to a file is otherwise not conserving memory, as they are being written
>    to tmpfs.
> 
> 3) Raw i/o for the bitmaps, is accomplished by caching the
>    bitmap file in a similar way to that of the dump file.
> 
>    I find that the use of direct i/o is not significantly slower than
>    writing through the kernel's page cache.
> 
> 4) The excluding of unused kernel page structures is very
>    important for a large memory system.  The kernel otherwise includes
>    3.67 million pages of page structures per TB of memory. By contrast
>    the rest of the kernel is only about 1 million pages.
> 
> Test results are below, for systems of 1TB, 2TB, 8.8TB and 16TB.
> (There are no 'old' numbers for 16TB as time and space requirements
>   made those effectively useless.)
> 
> Run times were generally reduced 2-3x, and dump size reduced about 8x.
> 
> All timings were done using 512M of crashkernel memory.
> 
>     System memory size
>     1TB                     unpatched    patched
>       OS: rhel6.4 (does a free pages pass)
>       page scan time           1.6min    1.6min
>       dump copy time           2.4min     .4min
>       total time               4.1min    2.0min
>       dump size                 3014M      364M
> 
>       OS: rhel6.5
>       page scan time            .6min     .6min
>       dump copy time           2.3min     .5min
>       total time               2.9min    1.1min
>       dump size                 3011M      423M
> 
>       OS: sles11sp3 (3.0.93)
>       page scan time            .5min     .5min
>       dump copy time           2.3min     .5min
>       total time               2.8min    1.0min
>       dump size                 2950M      350M
> 
>     2TB
>       OS: rhel6.5           (cyclicx3)
>       page scan time           2.0min    1.8min
>       dump copy time           8.0min    1.5min
>       total time              10.0min    3.3min
>       dump size                 6141M      835M
> 
>     8.8TB
>       OS: rhel6.5           (cyclicx5)
>       page scan time           6.6min    5.5min
>       dump copy time          67.8min    6.2min
>       total time              74.4min   11.7min
>       dump size                 15.8G      2.7G
> 
>     16TB
>       OS: rhel6.4
>       page scan time                   125.3min
>       dump copy time                    13.2min
>       total time                       138.5min
>       dump size                            4.0G
> 
>       OS: rhel6.5
>       page scan time                    27.8min
>       dump copy time                    13.3min
>       total time                        41.1min
>       dump size                            4.1G
> 

Also, could you please show us results in more detail?
That is, this benchmark is relevant to 3 parameters below

- cyclic mode or non-cyclic mode
- cached I/O or direct I/O
- with or without page structure object array

Please describe results of each parameter separately, and we can easily
understand how each parameter affects without confusion.

-- 
Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] makedumpfile: for large memories
  2014-01-10 18:23       ` Cliff Wickman
@ 2014-01-14 12:59         ` HATAYAMA Daisuke
  0 siblings, 0 replies; 16+ messages in thread
From: HATAYAMA Daisuke @ 2014-01-14 12:59 UTC (permalink / raw)
  To: Cliff Wickman; +Cc: kexec@lists.infradead.org, Atsushi Kumagai

(2014/01/11 3:23), Cliff Wickman wrote:
> On Fri, Jan 10, 2014 at 07:48:27AM +0000, Atsushi Kumagai wrote:
>> On 2014/01/09 9:26:20, kexec <kexec-bounces@lists.infradead.org> wrote:
>>> On Mon, Jan 06, 2014 at 09:27:34AM +0000, Atsushi Kumagai wrote:
>>>> Hello Cliff,
>>>>
>>>> On 2014/01/01 8:30:47, kexec <kexec-bounces@lists.infradead.org> wrote:
>>>>> From: Cliff Wickman <cpw@sgi.com>
>>>>>
>>>>> Gentlemen of kexec,
>>>>>
>>>>> I have been working on enabling kdump on some very large systems, and
>>>>> have found some solutions that I hope you will consider.
>>>>>
>>>>> The first issue is to work within the restricted size of crashkernel memory
>>>>> under 2.6.32-based kernels, such as sles11 and rhel6.
>>>>>
>>>>> The second issue is to reduce the very large size of a dump of a big memory
>>>>> system, even on an idle system.
>>>>>
>>>>> These are my propositions:
>>>>>
>>>>> Size of crashkernel memory
>>>>>    1) raw i/o for writing the dump
>>>>>    2) use root device for the bitmap file (not tmpfs)
>>>>>    3) raw i/o for reading/writing the bitmaps
>>>>>
>>>>> Size of dump (and hence the duration of dumping)
>>>>>    4) exclude page structures for unused pages
>>>>>
>>>>>
>>>>> 1) Is quite easy.  The cache of pages needs to be aligned on a block
>>>>>    boundary and written in block multiples, as required by O_DIRECT files.
>>>>>
>>>>>    The use of raw i/o prevents the growing of the crash kernel's page
>>>>>    cache.
>
> Today I posted V2 of both patches.  V2 of the first patch fixes a bug.
> V2 of the second patch make some of the changes that you and Hatayama-san
> requested.  But these updates don't address all of your points.
>
>
>>>> There is no reason to reject this idea, please re-post it as a formal patch.
>>>> If possible, I would like to know the benefit of only this.
>>>
>>> The motivation for using raw i/o was purely to be able to conserve memory,
>>> not for speed.
>
>> OK, 1) is also for removing cyclic mode, right ?
>
> I did disable cyclic mode for my testing.  I wanted to prove that makedumpfile
> can work in a small memory without cyclic mode.
> I think this is an alternative to cyclic mode, but I don't know all the
> issues.  This is a proof of concept only -- I hope that you guys who have
> the big picture of all the dump-capture issues can fit it in properly.
>
>> I think there is no need to conserve memory with 1) since 2) is enough to
>> remove cyclic mode.
>> (To be exact, there are some cases that we have to use cyclic mode as
>>   Hatayama-san said, but I don't mention that in this mail.)
>>
>>> However, I haven't noticed any significant degradation in speed.
>>> Memory is in 'very' short supply on a large machine (ironically) and a 2.6 or
>>> 3.0 kernel.  We're constrained to the low 4GB, and the kernel is putting other
>>> things in that memory that are related to memory size.
>>> The obvious solution is cyclic mode, but that requires at least 2x the page
>>> scans.  Once for the scan of unnecessary pages and several partial
>>> scans for the copy phase.
>>> But it is tmpfs and kernel page cache that are using up available memory.
>>> If we avoid those, a single page scan can work in about 350M of crashkernel
>>> memory.
>>> This is not a problem with 3.10+ kernels as we're not constrained to low 4G.
>>
>> Even if we can use 350M fully, 5TB is the limit system memory size
>> in non-cyclic mode unless 2), since the bitmap file requires 64MB
>> per 1TB RAM. So, I can't find an importance of 1).
>
> 1) raw i/o for writing the dump
> 2) use root device for the bitmap file (not tmpfs)
> 3) raw i/o for reading/writing the bitmaps
> Non-raw i/o for either of theses files is going to enlarge kernel page
> cache.  There doesn't seem to be any way to ask the kernel to limit
> that growth.  And writing to tmpfs is consuming memory.  The one file
> is much larger than the other, but to be consistent and not let i/o
> consume memory I think we have to do all three.
>
>>>>> 2) Is also quite easy.  My patch finds the path to the crash
>>>>>    kernel's root device by examining the dump pathname. Storing the bitmaps
>>>>>    to a file is otherwise not conserving memory, as they are being written
>>>>>    to tmpfs.
>>>>
>>>> Users will expect that the size of dump file is the same as the size of
>>>> RAM at most, they will prepare a disk which fit to save that.
>>>> But 2) breaks this estimation, I worry about it a little.
>>>
>>> The bit map file is very small compared to the dump. And the dump should be
>>> much smaller than RAM.  Particularly with 4), the excluding of unused page structures.
>>>>
>>>> Of course, I don't reject this idea just only for that reason,
>>>> but I would like to know the definite advantage of this.
>>>> I suppose that the improvement showed in your benchmarks may be came
>>>> from 1) and 4) mostly, so could you let me know that only 2) and 3)
>>>> can perform much faster than the current cyclic mode ?
>>>
>>> 2) and 3), the handling of the bitmap, are small contributors to the
>>> memory shortage issue.  They are a bigger issue the bigger the system.
>>> It's just that if we consistently avoid enlarging page cache and
>>> tmpfs we can avoid the 2nd page scan altogether.
>>> True, my benchmarks show only .2 min. and 1.1 min. improvements
>>> for 2TB and 8TB (2.0 vs 1.8, and 6.6 vs 5.5).
>>> But that's an improvement, not a loss.  And we're absolutely
>>> not going to run out of memory as the scan and copies proceed.
>>> This is important on these old kernels with minimal memory available.
>>
>> Does just changing TMPDIR to a disk meet that purpose ?
>> Is it necessary to add new codes ?
> Perhaps.  But page cache is going to grow.

But then what do you worry about? If using some disk of enough size,
you no longer need to worry about page cache for OOM issue.

On the other hand, another usecase of direct I/O I came up with in the past
was to suppress performance degradation on multiple CPUs due to a lot of TLB
flush. I saw the degradation on my benchmark I did some years ago using around
2.6.30 kernel. On these kernels, flush_tlb_others() was implemented using 8?
interrupt vectors (sorry, I no longer have good memory about the number)
and using more than 8 CPUs, performance no longer scaled.

But I have yet to investigate this usecase even now because I have
addressed other issues that affect scalability, and I might saw different
result on the same benchmark if using the improved recent environment.

For example, we have mmap() now, so we can choose larger mapping size than
ioremap(). This should be working well to drastically reduce the number of
TLB flush.

Also, the recent kernel uses smp_call_function() to call tlb flush handler
on each CPU, so there's no tlb_flush_others() above on recent kernel.

So, I expect situation is getting better than the past, but on multiple CPUS,
effect of page cache is bigger than on a single CPU. This is correct.
Alghouth I have yet to do benchmark on the recent environment, I might still
see some amount of distinguishable degradation caused by releasing page cache.
(Conversely speaking, it's ready to see how page cache affects performance
on multiple CPUs.)

-- 
Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-01-14 13:01 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-31 23:30 [PATCH 0/2] makedumpfile: for large memories cpw
2013-12-31 23:34 ` [PATCH 1/2] makedumpfile: raw i/o and use of root device Cliff Wickman
2013-12-31 23:36 ` [PATCH 2/2] makedumpfile: exclude unused vmemmap pages Cliff Wickman
2014-01-06  9:27 ` [PATCH 0/2] makedumpfile: for large memories Atsushi Kumagai
2014-01-09  0:25   ` Cliff Wickman
2014-01-10  7:48     ` Atsushi Kumagai
2014-01-10 18:23       ` Cliff Wickman
2014-01-14 12:59         ` HATAYAMA Daisuke
2014-01-07 10:14 ` HATAYAMA Daisuke
2014-01-10 17:58 ` [PATCH 1/2 V2] raw i/o and root device to use less memory Cliff Wickman
2014-01-13  9:58   ` Michael Holzheu
2014-01-13 13:30     ` Cliff Wickman
2014-01-13 15:02       ` Michael Holzheu
2014-01-10 18:00 ` [PATCH 2/2 V2] exclude unused vmemmap pages Cliff Wickman
2014-01-14 11:33 ` [PATCH 0/2] makedumpfile: for large memories HATAYAMA Daisuke
     [not found] <mailman.19254.1389376849.1059.kexec@lists.infradead.org>
2014-01-10 19:17 ` [PATCH 2/2 V2] exclude unused vmemmap pages Dave Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox