[RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in direct mapping region

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in direct mapping region
@ 2013-01-10 11:59 HATAYAMA Daisuke
  2013-01-10 11:59 ` [RFC PATCH v1 1/3] vmcore: Add function to merge memory mapping of vmcore HATAYAMA Daisuke
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: HATAYAMA Daisuke @ 2013-01-10 11:59 UTC (permalink / raw)
  To: ebiederm, vgoyal, cpw, kumagai-atsushi, lisa.mitchell; +Cc: kexec, linux-kernel

Currently, kdump reads the 1st kernel's memory, called old memory in
the source code, using ioremap per a single page. This causes big
performance degradation since page tables modification and tlb flush
happen each time the single page is read.

This issue turned out from Cliff's kernel-space filtering work.

To avoid calling ioremap, we map a whole 1st kernel's memory targeted
as vmcore regions in direct mapping table. By this we got big
performance improvement. See the following simple benchmark.

Machine spec:

| CPU    | Intel(R) Xeon(R) CPU E7- 4820  @ 2.00GHz (4 sockets, 8 cores) (*) |
| Memory | 32 GB                                                             |
| Kernel | 3.7 vanilla and with this patch set                               |

 (*) only 1 cpu is used in the 2nd kenrel now.

Benchmark:

I executed the following commands on the 2nd kernel and recorded real
time.

  $ time dd bs=$((4096 * n)) if=/proc/vmcore of=/dev/null

[3.7 vanilla]

| block size | time      | performance |
|       [KB] |           | [MB/sec]    |
|------------+-----------+-------------|
|          4 | 5m 46.97s | 93.56       |
|          8 | 4m 20.68s | 124.52      |
|         16 | 3m 37.85s | 149.01      |

[3.7 with this patch]

| block size | time   | performance |
|       [KB] |        |    [GB/sec] |
|------------+--------+-------------|
|          4 | 17.59s |        1.85 |
|          8 | 14.73s |        2.20 |
|         16 | 14.26s |        2.28 |
|         32 | 13.38s |        2.43 |
|         64 | 12.77s |        2.54 |
|        128 | 12.41s |        2.62 |
|        256 | 12.50s |        2.60 |
|        512 | 12.37s |        2.62 |
|       1024 | 12.30s |        2.65 |
|       2048 | 12.29s |        2.64 |
|       4096 | 12.32s |        2.63 |

[perf bench]

I also did perf bench mem memcpy -o on the 2nd kenrel like:

# /var/crash/perf bench mem memcpy -o -l 128MB
# Running mem/memcpy benchmark...
# Copying 128MB Bytes ...

       2.854337 GB/Sec (with prefault)

Several trials stably showed around 2.85 [GB/Sec].

Notes:

* Why direct mapping region

  I chose direct mapping region because this address space has 64TB
  length to cover a whole physical memory while vmlloc-and-ioremap
  region has 16TB only. For some particular machine with huge memory,
  the latter is already problematic.

  In the near future, machine with more than 64TB could occur, but
  then direct mapping space would also be extended to follow.

* Memory consumption issue on the 2nd kenrel

  Typical reserved memory size for the 2nd kerne is 512MB. But if
  mapping tera-byte memory with 4kB pages, page table size amounts to
  more than giga bytes.

  But direct mapping region is mapped using 1GB and 2MB pages. By
  this, memory consumption for page table is minimamized in most
  cases.

  Boot debug message tells you how each map is mapped:

vmcore: [oldmem 0000000027000000-000000002708afff]
vmcore: [oldmem 0000000000100000-0000000026ffffff]
vmcore: [oldmem 0000000037000000-000000007b00cfff]
vmcore: [oldmem 0000000100000000-000000087fffffff]
 [mem 0x27000000-0x2708afff] page 4k
 [mem 0x00100000-0x001fffff] page 4k
 [mem 0x00200000-0x26ffffff] page 2M
 [mem 0x37000000-0x7affffff] page 2M
 [mem 0x7b000000-0x7b00cfff] page 4k
 [mem 0x100000000-0x87fffffff] page 1G

  where each [oldmem <start>-<end>] is mapped region and I omited some
  other messages.

TODO:

* Use of init_memory_mapping

  init_memory_mapping is used to map memory in direct mapping region
  both in boot time and memory hot-plug codes. This should be used
  here too, but just as I explain in the patch description, I faced
  some page-fault related bugs after it was called in the 2nd kernel
  boot. This means page table mapping is not done correctly.

  As a workaround, I wrote the code constructing page table from
  scratch just like Cliff's patch, and it works well aparently now.

  But ideally it's necessary to know why init_memory_mapping doesn't
  work well. I continue to debug this. Sugestion around this is very
  helpful. This issue comes purely from lack of my familiality around
  here (^^;

* Benchmark of Cliff's kernel-space filtering

  He has attempted kernel-space filtering of makedumpfile for
  performance improvement. I noticed the ioremap issue through his
  this work.

  I now think bad performance is mainly caused by the ioremap issue. I
  don't know how much filtering performance is improved by doing it in
  kernel-space. I guess there's just a similar improvement just like
  increasing block size just as the above benchmark.

  Anyway, we need first to compare kernel-space filtering with
  user-space one.

  Note that this work is orthogonal to kernel-space filtering, can be
  proceeded separately.

---

HATAYAMA Daisuke (3):
      vmcore: read vmcore through direct mapping region
      vmcore: map vmcore memory in direct mapping region
      vmcore: Add function to merge memory mapping of vmcore

 fs/proc/vmcore.c |  420 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 419 insertions(+), 1 deletions(-)

-- 

Thanks.
HATAYAMA, Daisuke

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH v1 1/3] vmcore: Add function to merge memory mapping of vmcore
  2013-01-10 11:59 [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in direct mapping region HATAYAMA Daisuke
@ 2013-01-10 11:59 ` HATAYAMA Daisuke
  2013-01-10 11:59 ` [RFC PATCH v1 2/3] vmcore: map vmcore memory in direct mapping region HATAYAMA Daisuke
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: HATAYAMA Daisuke @ 2013-01-10 11:59 UTC (permalink / raw)
  To: ebiederm, vgoyal, cpw, kumagai-atsushi, lisa.mitchell; +Cc: kexec, linux-kernel

vmcore_list has memory map information in the 1st kernel, each of
which represents position and size of the objects like:

  1) NT_PRSTATUS x the number of lcpus
  2) VMCOREINFO
  3) kernel code
  4) copy of the first 640kB memory
  5) System RAM entries

where in /proc/vmcore, 1) and 2) are visible as a single PT_NOTE
entry, and 5) as PT_LOAD entries.

This mapping is never exclusive. For example, any of 1), 2) and 4) is
always contained in one of the System RAM entries.

I add function oldmem_merge_vmcore_list that merges ranges represented
by vmcore_list and makes a merged list in oldmem_list.

We'll remap ranges represented by oldmem_list in direct mapping region
in the patch set that follows this patch.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
---

 fs/proc/vmcore.c |   83 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 83 insertions(+), 0 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 0d5071d..405b5e2 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -27,6 +27,11 @@
  */
 static LIST_HEAD(vmcore_list);
 
+/* Remap chunks of contiguous memory represented by this list in
+ * direct mapping region.
+ */
+static LIST_HEAD(oldmem_list);
+
 /* Stores the pointer to the buffer containing kernel elf core headers. */
 static char *elfcorebuf;
 static size_t elfcorebuf_sz;
@@ -137,6 +142,84 @@ static u64 map_offset_to_paddr(loff_t offset, struct list_head *vc_list,
 	return 0;
 }
 
+static struct vmcore* __init get_new_element(void);
+
+static int
+oldmem_merge_vmcore_list_one(struct vmcore *r, struct list_head *new_list)
+{
+	unsigned long m_start, m_end, n_start, n_end;
+	struct vmcore _m, *m, *n, *new;
+
+	m = &_m;
+	m->paddr = r->paddr;
+	m->size = r->size;
+	m->offset = r->offset;
+
+retry:
+	list_for_each_entry(n, new_list, list) {
+
+		m_start = m->paddr;
+		m_end = m->paddr + m->size - 1;
+
+		n_start = n->paddr;
+		n_end = n->paddr + n->size - 1;
+
+		/* not mergeable */
+		if (((m_start < n_start) && (m_end < n_start))
+		    || ((n_start < m_start) && (n_end < m_start)))
+			continue;
+
+		/* merge n to m */
+		m->paddr = min(m->paddr, n->paddr);
+		m->size = max(m_end, n_end) - min(m_start, n_start) + 1;
+		m->offset = min(m->offset, n->offset);
+
+		/* n is no longer useful, delete it */
+		list_del(&n->list);
+		kfree(n);
+
+		goto retry;
+	}
+
+	/* there's no map in new_list to merge m, create new element */
+	new = get_new_element();
+	if (!new)
+		return -ENOMEM;
+
+	new->paddr = m->paddr;
+	new->size = m->size;
+	new->offset = m->offset;
+
+	list_add_tail(&new->list, new_list);
+
+	return 0;
+}
+
+static int
+oldmem_merge_vmcore_list(struct list_head *vc_list, struct list_head *om_list)
+{
+	struct vmcore *m;
+	int ret;
+
+	list_for_each_entry(m, vc_list, list) {
+		printk("vmcore: [mem %016llx-%016llx]\n",
+		       m->paddr, m->paddr + m->size - 1);
+	}
+
+	list_for_each_entry(m, vc_list, list) {
+		ret = oldmem_merge_vmcore_list_one(m, om_list);
+		if (ret < 0)
+			return ret;
+	}
+
+	list_for_each_entry(m, om_list, list) {
+		printk("vmcore: [oldmem %016llx-%016llx]\n",
+		       m->paddr, m->paddr + m->size - 1);
+	}
+
+	return 0;
+}
+
 /* Read from the ELF header and then the crash dump. On error, negative value is
  * returned otherwise number of bytes read are returned.
  */


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH v1 2/3] vmcore: map vmcore memory in direct mapping region
  2013-01-10 11:59 [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in direct mapping region HATAYAMA Daisuke
  2013-01-10 11:59 ` [RFC PATCH v1 1/3] vmcore: Add function to merge memory mapping of vmcore HATAYAMA Daisuke
@ 2013-01-10 11:59 ` HATAYAMA Daisuke
  2013-01-10 11:59 ` [RFC PATCH v1 3/3] vmcore: read vmcore through " HATAYAMA Daisuke
  2013-01-17 22:13 ` [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in " Vivek Goyal
  3 siblings, 0 replies; 8+ messages in thread
From: HATAYAMA Daisuke @ 2013-01-10 11:59 UTC (permalink / raw)
  To: ebiederm, vgoyal, cpw, kumagai-atsushi, lisa.mitchell; +Cc: kexec, linux-kernel

Map memory map regions represented by vmcore in direct mapping region,
where as much memory as possible are mapped using 1G or 4M pages to
reduce memory consumption for page tables.

I resued large part of init_memory_mapping. In fact, I first tried to
use it but I have faced some page-fault related bug that seems to be
caused by this additional mapping. I have not figured out the cause
yet so I wrote part of making page tables from scratch like Cliff's
patch.

Signed-off-by: Cliff Wickman <cpw@sgi.com>
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
---

 fs/proc/vmcore.c |  292 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 292 insertions(+), 0 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 405b5e2..aa14570 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -21,6 +21,8 @@
 #include <linux/list.h>
 #include <asm/uaccess.h>
 #include <asm/io.h>
+#include <asm/tlbflush.h>
+#include <asm/pgalloc.h>
 
 /* List representing chunks of contiguous memory areas and their offsets in
  * vmcore file.
@@ -220,6 +222,290 @@ oldmem_merge_vmcore_list(struct list_head *vc_list, struct list_head *om_list)
 	return 0;
 }
 
+enum {
+        NR_RANGE_MR = 5,
+};
+
+struct map_range {
+	unsigned long start;
+	unsigned long end;
+	unsigned page_size_mask;
+};
+
+static int save_mr(struct map_range *mr, int nr_range,
+		   unsigned long start_pfn, unsigned long end_pfn,
+		   unsigned long page_size_mask)
+{
+	if (start_pfn < end_pfn) {
+		if (nr_range >= NR_RANGE_MR)
+			panic("run out of range for init_memory_mapping\n");
+		mr[nr_range].start = start_pfn<<PAGE_SHIFT;
+		mr[nr_range].end   = end_pfn<<PAGE_SHIFT;
+		mr[nr_range].page_size_mask = page_size_mask;
+		nr_range++;
+	}
+
+	return nr_range;
+}
+
+static int
+oldmem_align_maps_in_page_size(struct map_range *mr,
+			       unsigned long start,
+			       unsigned long end)
+{
+        unsigned long page_size_mask = 0;
+        unsigned long start_pfn, end_pfn;
+        unsigned long pos;
+        int use_pse, use_gbpages;
+        int i, nr_range;
+
+#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK)
+	/*
+	 * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
+	 * This will simplify cpa(), which otherwise needs to support splitting
+	 * large pages into small in interrupt context, etc.
+	 */
+	use_pse = use_gbpages = 0;
+#else
+	use_pse = cpu_has_pse;
+	use_gbpages = direct_gbpages;
+#endif
+
+	/* Enable PSE if available */
+	if (cpu_has_pse)
+		set_in_cr4(X86_CR4_PSE);
+
+	/* Enable PGE if available */
+	if (cpu_has_pge) {
+		set_in_cr4(X86_CR4_PGE);
+		__supported_pte_mask |= _PAGE_GLOBAL;
+	}
+
+	if (use_gbpages)
+		page_size_mask |= 1 << PG_LEVEL_1G;
+	if (use_pse)
+		page_size_mask |= 1 << PG_LEVEL_2M;
+
+	memset(mr, 0, sizeof(mr));
+	nr_range = 0;
+
+	/* head if not big page alignment ? */
+	start_pfn = start >> PAGE_SHIFT;
+	pos = start_pfn << PAGE_SHIFT;
+#ifdef CONFIG_X86_32
+	/*
+	 * Don't use a large page for the first 2/4MB of memory
+	 * because there are often fixed size MTRRs in there
+	 * and overlapping MTRRs into large pages can cause
+	 * slowdowns.
+	 */
+	if (pos == 0)
+		end_pfn = 1<<(PMD_SHIFT - PAGE_SHIFT);
+	else
+		end_pfn = ((pos + (PMD_SIZE - 1))>>PMD_SHIFT)
+				 << (PMD_SHIFT - PAGE_SHIFT);
+#else /* CONFIG_X86_64 */
+	end_pfn = ((pos + (PMD_SIZE - 1)) >> PMD_SHIFT)
+			<< (PMD_SHIFT - PAGE_SHIFT);
+#endif
+	if (end_pfn > (end >> PAGE_SHIFT))
+		end_pfn = end >> PAGE_SHIFT;
+	if (start_pfn < end_pfn) {
+		nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0);
+		pos = end_pfn << PAGE_SHIFT;
+	}
+
+	/* big page (2M) range */
+	start_pfn = ((pos + (PMD_SIZE - 1))>>PMD_SHIFT)
+			 << (PMD_SHIFT - PAGE_SHIFT);
+#ifdef CONFIG_X86_32
+	end_pfn = (end>>PMD_SHIFT) << (PMD_SHIFT - PAGE_SHIFT);
+#else /* CONFIG_X86_64 */
+	end_pfn = ((pos + (PUD_SIZE - 1))>>PUD_SHIFT)
+			 << (PUD_SHIFT - PAGE_SHIFT);
+	if (end_pfn > ((end>>PMD_SHIFT)<<(PMD_SHIFT - PAGE_SHIFT)))
+		end_pfn = ((end>>PMD_SHIFT)<<(PMD_SHIFT - PAGE_SHIFT));
+#endif
+
+	if (start_pfn < end_pfn) {
+		nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
+				page_size_mask & (1<<PG_LEVEL_2M));
+		pos = end_pfn << PAGE_SHIFT;
+	}
+
+#ifdef CONFIG_X86_64
+	/* big page (1G) range */
+	start_pfn = ((pos + (PUD_SIZE - 1))>>PUD_SHIFT)
+			 << (PUD_SHIFT - PAGE_SHIFT);
+	end_pfn = (end >> PUD_SHIFT) << (PUD_SHIFT - PAGE_SHIFT);
+	if (start_pfn < end_pfn) {
+		nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
+				page_size_mask &
+				 ((1<<PG_LEVEL_2M)|(1<<PG_LEVEL_1G)));
+		pos = end_pfn << PAGE_SHIFT;
+	}
+
+	/* tail is not big page (1G) alignment */
+	start_pfn = ((pos + (PMD_SIZE - 1))>>PMD_SHIFT)
+			 << (PMD_SHIFT - PAGE_SHIFT);
+	end_pfn = (end >> PMD_SHIFT) << (PMD_SHIFT - PAGE_SHIFT);
+	if (start_pfn < end_pfn) {
+		nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
+				page_size_mask & (1<<PG_LEVEL_2M));
+		pos = end_pfn << PAGE_SHIFT;
+	}
+#endif
+
+	/* tail is not big page (2M) alignment */
+	start_pfn = pos>>PAGE_SHIFT;
+	end_pfn = end>>PAGE_SHIFT;
+	nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0);
+
+	/* try to merge same page size and continuous */
+	for (i = 0; nr_range > 1 && i < nr_range - 1; i++) {
+		unsigned long old_start;
+		if (mr[i].end != mr[i+1].start ||
+		    mr[i].page_size_mask != mr[i+1].page_size_mask)
+			continue;
+		/* move it */
+		old_start = mr[i].start;
+		memmove(&mr[i], &mr[i+1],
+			(nr_range - 1 - i) * sizeof(struct map_range));
+		mr[i--].start = old_start;
+		nr_range--;
+	}
+
+	return nr_range;
+}
+
+static int
+oldmem_physical_mapping_init(unsigned long start, unsigned long end,
+			     unsigned long page_size_mask)
+{
+	unsigned long paddr, vaddr, hpagesize;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	pgprot_t prot;
+	unsigned long pages_1G, pages_2M, pages_4K;
+	unsigned long pages_PUD, pages_PMD, pages_PTE;
+
+	if (page_size_mask & (1 << PG_LEVEL_1G)) {
+		hpagesize = PUD_SIZE;
+		prot = PAGE_KERNEL_LARGE;
+	} else if (page_size_mask & (1 << PG_LEVEL_2M)) {
+		hpagesize = PMD_SIZE;
+		prot = PAGE_KERNEL_LARGE;
+	} else {
+		hpagesize = PAGE_SIZE;
+		prot = PAGE_KERNEL;
+	}
+
+	paddr = start;
+	vaddr = (unsigned long)__va(start);
+
+	pages_1G = 0;
+	pages_2M = 0;
+	pages_4K = 0;
+
+	pages_PUD = 0;
+	pages_PMD = 0;
+	pages_PTE = 0;
+
+	while (paddr < end) {
+		pgd = pgd_offset_k(vaddr);
+		if (!pgd_present(*pgd)) {
+			pud = pud_alloc_one(&init_mm, vaddr);
+			set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
+			pages_PUD++;
+		}
+		pud = pud_offset(pgd, vaddr);
+		if (page_size_mask & (1 << PG_LEVEL_1G)) {
+			set_pud(pud, __pud(paddr | pgprot_val(prot)));
+			pages_1G++;
+		} else {
+			if (!pud_present(*pud)) {
+				pmd = pmd_alloc_one(&init_mm, vaddr);
+				set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
+				pages_PMD++;
+			}
+			pmd = pmd_offset(pud, vaddr);
+			if (page_size_mask & (1 << PG_LEVEL_2M)) {
+				set_pmd(pmd, __pmd(paddr | pgprot_val(prot)));
+				pages_2M++;
+			} else {
+				if (!pmd_present(*pmd)) {
+					pte = pte_alloc_one_kernel(&init_mm, vaddr);
+					set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
+					pages_PTE++;
+				}
+				pte = pte_offset_kernel(pmd, vaddr);
+				set_pte(pte, __pte(paddr | pgprot_val(prot)));
+				pages_4K++;
+			}
+		}
+		if (end - paddr < hpagesize)
+			break;
+		paddr += hpagesize;
+		vaddr += hpagesize;
+	}
+
+	update_page_count(PG_LEVEL_1G, pages_1G);
+	update_page_count(PG_LEVEL_2M, pages_2M);
+	update_page_count(PG_LEVEL_4K, pages_4K);
+
+	printk("vmcore: PUD pages: %lu\n", pages_PUD);
+	printk("vmcore: PMD pages: %lu\n", pages_PMD);
+	printk("vmcore: PTE pages: %lu\n", pages_PTE);
+
+	__flush_tlb_all();
+
+	return 0;
+}
+
+static void init_old_memory_mapping(unsigned long start, unsigned long end)
+{
+	struct map_range mr[NR_RANGE_MR];
+	int i, ret, nr_range;
+
+	nr_range = oldmem_align_maps_in_page_size(mr, start, end);
+
+	for (i = 0; i < nr_range; i++)
+		printk(KERN_DEBUG " [mem %#010lx-%#010lx] page %s\n",
+		       mr[i].start, mr[i].end - 1,
+		       (mr[i].page_size_mask & (1<<PG_LEVEL_1G))?"1G":(
+			       (mr[i].page_size_mask & (1<<PG_LEVEL_2M))?"2M":"4k"));
+
+	for (i = 0; i < nr_range; i++)
+		ret = oldmem_physical_mapping_init(mr[i].start,
+						   mr[i].end,
+						   mr[i].page_size_mask);
+
+	__flush_tlb_all();
+}
+
+static int oldmem_init(struct list_head *vc_list, struct list_head *om_list)
+{
+	struct vmcore *m;
+	int ret;
+
+	ret = oldmem_merge_vmcore_list(vc_list, om_list);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry(m, om_list, list) {
+		unsigned long start, end;
+
+		start = (m->paddr >> PAGE_SHIFT) << PAGE_SHIFT;
+		end = ((m->paddr + m->size + PAGE_SIZE - 1) >> PAGE_SHIFT) << PAGE_SHIFT;
+
+		init_old_memory_mapping(start, end);
+	}
+
+	return 0;
+}
+
 /* Read from the ELF header and then the crash dump. On error, negative value is
  * returned otherwise number of bytes read are returned.
  */
@@ -777,6 +1063,12 @@ static int __init vmcore_init(void)
 		return rc;
 	}
 
+	rc = oldmem_init(&vmcore_list, &oldmem_list);
+	if (rc) {
+		printk(KERN_WARNING "Kdump: failed to map vmcore\n");
+		return rc;
+	}
+
 	proc_vmcore = proc_create("vmcore", S_IRUSR, NULL, &proc_vmcore_operations);
 	if (proc_vmcore)
 		proc_vmcore->size = vmcore_size;


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH v1 3/3] vmcore: read vmcore through direct mapping region
  2013-01-10 11:59 [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in direct mapping region HATAYAMA Daisuke
  2013-01-10 11:59 ` [RFC PATCH v1 1/3] vmcore: Add function to merge memory mapping of vmcore HATAYAMA Daisuke
  2013-01-10 11:59 ` [RFC PATCH v1 2/3] vmcore: map vmcore memory in direct mapping region HATAYAMA Daisuke
@ 2013-01-10 11:59 ` HATAYAMA Daisuke
  2013-01-17 22:13 ` [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in " Vivek Goyal
  3 siblings, 0 replies; 8+ messages in thread
From: HATAYAMA Daisuke @ 2013-01-10 11:59 UTC (permalink / raw)
  To: ebiederm, vgoyal, cpw, kumagai-atsushi, lisa.mitchell; +Cc: kexec, linux-kernel

Now regions represented by vmcore are mapped through direct mapping
region. We reads requested memory through direct mapping region
instead of using ioremap.

Notice that we still keep read_from_oldmem that uses ioremap because
we need to use it when reading elf headers to make vmcore_list in
vmcore initialization.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
---

 fs/proc/vmcore.c |   45 ++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 44 insertions(+), 1 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index aa14570..1c6259e 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -123,6 +123,49 @@ static ssize_t read_from_oldmem(char *buf, size_t count,
 	return read;
 }
 
+/* Reads a page from the oldmem device from given offset. */
+static ssize_t read_from_oldmem_noioremap(char *buf, size_t count,
+					  u64 *ppos, int userbuf)
+{
+        unsigned long pfn, offset;
+        size_t nr_bytes;
+        ssize_t read = 0;
+
+        if (!count)
+                return 0;
+
+        offset = (unsigned long)(*ppos % PAGE_SIZE);
+        pfn = (unsigned long)(*ppos / PAGE_SIZE);
+
+        do {
+                if (count > (PAGE_SIZE - offset))
+                        nr_bytes = PAGE_SIZE - offset;
+                else
+                        nr_bytes = count;
+
+                /* If pfn is not ram, return zeros for sparse dump files */
+                if (pfn_is_ram(pfn) == 0)
+                        memset(buf, 0, nr_bytes);
+                else {
+                        void *vaddr = pfn_to_kaddr(pfn);
+
+                        if (userbuf) {
+                                if (copy_to_user(buf, vaddr + offset, nr_bytes))
+                                        return -EFAULT;
+                        } else
+                                memcpy(buf, vaddr + offset, nr_bytes);
+                }
+                *ppos += nr_bytes;
+                count -= nr_bytes;
+                buf += nr_bytes;
+                read += nr_bytes;
+                ++pfn;
+                offset = 0;
+        } while (count);
+
+        return read;
+}
+
 /* Maps vmcore file offset to respective physical address in memroy. */
 static u64 map_offset_to_paddr(loff_t offset, struct list_head *vc_list,
 					struct vmcore **m_ptr)
@@ -553,7 +596,7 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
 		tsz = nr_bytes;
 
 	while (buflen) {
-		tmp = read_from_oldmem(buffer, tsz, &start, 1);
+		tmp = read_from_oldmem_noioremap(buffer, tsz, &start, 1);
 		if (tmp < 0)
 			return tmp;
 		buflen -= tsz;


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in direct mapping region
  2013-01-10 11:59 [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in direct mapping region HATAYAMA Daisuke
                   ` (2 preceding siblings ...)
  2013-01-10 11:59 ` [RFC PATCH v1 3/3] vmcore: read vmcore through " HATAYAMA Daisuke
@ 2013-01-17 22:13 ` Vivek Goyal
  2013-01-18 14:06   ` HATAYAMA Daisuke
  3 siblings, 1 reply; 8+ messages in thread
From: Vivek Goyal @ 2013-01-17 22:13 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: ebiederm, cpw, kumagai-atsushi, lisa.mitchell, kexec,
	linux-kernel

On Thu, Jan 10, 2013 at 08:59:34PM +0900, HATAYAMA Daisuke wrote:
> Currently, kdump reads the 1st kernel's memory, called old memory in
> the source code, using ioremap per a single page. This causes big
> performance degradation since page tables modification and tlb flush
> happen each time the single page is read.
> 
> This issue turned out from Cliff's kernel-space filtering work.
> 
> To avoid calling ioremap, we map a whole 1st kernel's memory targeted
> as vmcore regions in direct mapping table. By this we got big
> performance improvement. See the following simple benchmark.
> 
> Machine spec:
> 
> | CPU    | Intel(R) Xeon(R) CPU E7- 4820  @ 2.00GHz (4 sockets, 8 cores) (*) |
> | Memory | 32 GB                                                             |
> | Kernel | 3.7 vanilla and with this patch set                               |
> 
>  (*) only 1 cpu is used in the 2nd kenrel now.
> 
> Benchmark:
> 
> I executed the following commands on the 2nd kernel and recorded real
> time.
> 
>   $ time dd bs=$((4096 * n)) if=/proc/vmcore of=/dev/null
> 
> [3.7 vanilla]
> 
> | block size | time      | performance |
> |       [KB] |           | [MB/sec]    |
> |------------+-----------+-------------|
> |          4 | 5m 46.97s | 93.56       |
> |          8 | 4m 20.68s | 124.52      |
> |         16 | 3m 37.85s | 149.01      |
> 
> [3.7 with this patch]
> 
> | block size | time   | performance |
> |       [KB] |        |    [GB/sec] |
> |------------+--------+-------------|
> |          4 | 17.59s |        1.85 |
> |          8 | 14.73s |        2.20 |
> |         16 | 14.26s |        2.28 |
> |         32 | 13.38s |        2.43 |
> |         64 | 12.77s |        2.54 |
> |        128 | 12.41s |        2.62 |
> |        256 | 12.50s |        2.60 |
> |        512 | 12.37s |        2.62 |
> |       1024 | 12.30s |        2.65 |
> |       2048 | 12.29s |        2.64 |
> |       4096 | 12.32s |        2.63 |
> 

These are impressive improvements. I missed the discussion on mmap().
So why couldn't we provide mmap() interface for /proc/vmcore. If that
works then application can select to mmap/unmap bigger chunks of file
(instead ioremap mapping/remapping a page at a time). 

And if application controls the size of mapping, then it can vary the
size of mapping based on available amount of free memory. That way if
somebody reserves less amount of memory, we could still dump but with
some time penalty.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in direct mapping region
  2013-01-17 22:13 ` [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in " Vivek Goyal
@ 2013-01-18 14:06   ` HATAYAMA Daisuke
  2013-01-18 20:54     ` Vivek Goyal
  0 siblings, 1 reply; 8+ messages in thread
From: HATAYAMA Daisuke @ 2013-01-18 14:06 UTC (permalink / raw)
  To: vgoyal; +Cc: kexec, linux-kernel, lisa.mitchell, kumagai-atsushi, ebiederm,
	cpw

From: Vivek Goyal <vgoyal@redhat.com>
Subject: Re: [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in direct mapping region
Date: Thu, 17 Jan 2013 17:13:48 -0500

> On Thu, Jan 10, 2013 at 08:59:34PM +0900, HATAYAMA Daisuke wrote:
>> Currently, kdump reads the 1st kernel's memory, called old memory in
>> the source code, using ioremap per a single page. This causes big
>> performance degradation since page tables modification and tlb flush
>> happen each time the single page is read.
>> 
>> This issue turned out from Cliff's kernel-space filtering work.
>> 
>> To avoid calling ioremap, we map a whole 1st kernel's memory targeted
>> as vmcore regions in direct mapping table. By this we got big
>> performance improvement. See the following simple benchmark.
>> 
>> Machine spec:
>> 
>> | CPU    | Intel(R) Xeon(R) CPU E7- 4820  @ 2.00GHz (4 sockets, 8 cores) (*) |
>> | Memory | 32 GB                                                             |
>> | Kernel | 3.7 vanilla and with this patch set                               |
>> 
>>  (*) only 1 cpu is used in the 2nd kenrel now.
>> 
>> Benchmark:
>> 
>> I executed the following commands on the 2nd kernel and recorded real
>> time.
>> 
>>   $ time dd bs=$((4096 * n)) if=/proc/vmcore of=/dev/null
>> 
>> [3.7 vanilla]
>> 
>> | block size | time      | performance |
>> |       [KB] |           | [MB/sec]    |
>> |------------+-----------+-------------|
>> |          4 | 5m 46.97s | 93.56       |
>> |          8 | 4m 20.68s | 124.52      |
>> |         16 | 3m 37.85s | 149.01      |
>> 
>> [3.7 with this patch]
>> 
>> | block size | time   | performance |
>> |       [KB] |        |    [GB/sec] |
>> |------------+--------+-------------|
>> |          4 | 17.59s |        1.85 |
>> |          8 | 14.73s |        2.20 |
>> |         16 | 14.26s |        2.28 |
>> |         32 | 13.38s |        2.43 |
>> |         64 | 12.77s |        2.54 |
>> |        128 | 12.41s |        2.62 |
>> |        256 | 12.50s |        2.60 |
>> |        512 | 12.37s |        2.62 |
>> |       1024 | 12.30s |        2.65 |
>> |       2048 | 12.29s |        2.64 |
>> |       4096 | 12.32s |        2.63 |
>> 
> 
> These are impressive improvements. I missed the discussion on mmap().
> So why couldn't we provide mmap() interface for /proc/vmcore. If that
> works then application can select to mmap/unmap bigger chunks of file
> (instead ioremap mapping/remapping a page at a time). 
> 
> And if application controls the size of mapping, then it can vary the
> size of mapping based on available amount of free memory. That way if
> somebody reserves less amount of memory, we could still dump but with
> some time penalty.
> 

mmap() needs user-space page table in addition to kernel-space's, and
it looks that remap_pfn_range() that creates the user-space page
table, doesn't support large pages, only 4KB pages. If mmaping small
chunks only for small memory programming, then we would again face the
same issue as with ioremap. I don't know whether hugetlbfs supports
mmap and 1GB page now.

Another idea to reduce size of page table is to extend mapping ranges
to cover a whole memory as many 1GB pages as possible. For example,
supporse M is size of system memory, then total size of PGD and PUD
pages to cover M is:

   ( 1  +  roundup(M, 512GB) / 512GB ) * PAGE_SIZE
     ~     ~~~~~~~~~~~~~~~~~~~~~~~~~
     ^                 ^
     |                 |
  PGD page         PUD pages

Ideally, 2TB system can be covered with 20KB and 16TB with 132KB only.

So I first want to evaluate this logic. Although I've not seen
actually yet, I expect most of memory maps on tera-byte memory
machines consists of 1GB-aligned huge chunks.

Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in direct mapping region
  2013-01-18 14:06   ` HATAYAMA Daisuke
@ 2013-01-18 20:54     ` Vivek Goyal
  2013-01-21  6:56       ` HATAYAMA Daisuke
  0 siblings, 1 reply; 8+ messages in thread
From: Vivek Goyal @ 2013-01-18 20:54 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: kexec, linux-kernel, lisa.mitchell, kumagai-atsushi, ebiederm,
	cpw, Rik Van Riel

On Fri, Jan 18, 2013 at 11:06:59PM +0900, HATAYAMA Daisuke wrote:

[..]
> > These are impressive improvements. I missed the discussion on mmap().
> > So why couldn't we provide mmap() interface for /proc/vmcore. If that
> > works then application can select to mmap/unmap bigger chunks of file
> > (instead ioremap mapping/remapping a page at a time). 
> > 
> > And if application controls the size of mapping, then it can vary the
> > size of mapping based on available amount of free memory. That way if
> > somebody reserves less amount of memory, we could still dump but with
> > some time penalty.
> > 
> 
> mmap() needs user-space page table in addition to kernel-space's,

[ CC Rik van Riel] 

I was chatting with Rik and it does not look like that there is any
fundamental requirement that range of pfn being mapped in user tables
has to be mapped in kernel tables too. Did you run into specific issue.

> and
> it looks that remap_pfn_range() that creates the user-space page
> table, doesn't support large pages, only 4KB pages.

This indeed looks like the case. May be we can enahnce remap_pfn_range()
to take an argument and create larger size mappings.

> If mmaping small
> chunks only for small memory programming, then we would again face the
> same issue as with ioremap.

Even if it is 4KB pages, I think it will still be faster than current
interface. Because we will not be issuing these many tlb flushes.
(Assuming makedumpfile has been modified to map/unap large areas of
/proc/vmcore).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in direct mapping region
  2013-01-18 20:54     ` Vivek Goyal
@ 2013-01-21  6:56       ` HATAYAMA Daisuke
  0 siblings, 0 replies; 8+ messages in thread
From: HATAYAMA Daisuke @ 2013-01-21  6:56 UTC (permalink / raw)
  To: vgoyal
  Cc: riel, kexec, linux-kernel, lisa.mitchell, kumagai-atsushi,
	ebiederm, cpw

From: Vivek Goyal <vgoyal@redhat.com>
Subject: Re: [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in direct mapping region
Date: Fri, 18 Jan 2013 15:54:13 -0500

> On Fri, Jan 18, 2013 at 11:06:59PM +0900, HATAYAMA Daisuke wrote:
> 
> [..]
>> > These are impressive improvements. I missed the discussion on mmap().
>> > So why couldn't we provide mmap() interface for /proc/vmcore. If that
>> > works then application can select to mmap/unmap bigger chunks of file
>> > (instead ioremap mapping/remapping a page at a time). 
>> > 
>> > And if application controls the size of mapping, then it can vary the
>> > size of mapping based on available amount of free memory. That way if
>> > somebody reserves less amount of memory, we could still dump but with
>> > some time penalty.
>> > 
>> 
>> mmap() needs user-space page table in addition to kernel-space's,
> 
> [ CC Rik van Riel] 
> 
> I was chatting with Rik and it does not look like that there is any
> fundamental requirement that range of pfn being mapped in user tables
> has to be mapped in kernel tables too. Did you run into specific issue.
> 

No, I was confused simply this around.

>> and
>> it looks that remap_pfn_range() that creates the user-space page
>> table, doesn't support large pages, only 4KB pages.
> 
> This indeed looks like the case. May be we can enahnce remap_pfn_range()
> to take an argument and create larger size mappings.
> 

Adding a new argument to remap_pfn_range would never easily be
accepted because it changes signature of it. It is the function that
is exported to modules.

As init_memory_mapping does, it should internally automatically divide
a given ranges of kernel address space into properly aligned ones then
remap them.

Also, if we extend this in the future, we need to have some feature
for userland to know a given kernel can use 2MB/1GB pages for
remapping. makedumpfile needs to estimate how much memory is required
for the remapping.

>> If mmaping small
>> chunks only for small memory programming, then we would again face the
>> same issue as with ioremap.
> 
> Even if it is 4KB pages, I think it will still be faster than current
> interface. Because we will not be issuing these many tlb flushes.
> (Assuming makedumpfile has been modified to map/unap large areas of
> /proc/vmcore).
> 

OK, I'll go in this direction first. From my local investigation, I'm
beginning with thinking that my idea to map a whole DIMM ranges in
direct mapping region is difficult due to some memory hot-plug issues,
and mmap interface is more useful than keeping page table handling in
/proc/vmcore when we process /proc/vmcore in paralell where each
process reads different range.

Assuming we can use 4KB pages only, if we use 1MB buffer for page
table, we can cover about 500MB memory region. Then, remapping is done
about 2000 times. On ioremap case, remapping is done 268435456
times. Peformacne should be improved so much. We should benchmark this
first.

Thanks.
HATAYAMA, Daisuke

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-01-21  6:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-10 11:59 [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in direct mapping region HATAYAMA Daisuke
2013-01-10 11:59 ` [RFC PATCH v1 1/3] vmcore: Add function to merge memory mapping of vmcore HATAYAMA Daisuke
2013-01-10 11:59 ` [RFC PATCH v1 2/3] vmcore: map vmcore memory in direct mapping region HATAYAMA Daisuke
2013-01-10 11:59 ` [RFC PATCH v1 3/3] vmcore: read vmcore through " HATAYAMA Daisuke
2013-01-17 22:13 ` [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in " Vivek Goyal
2013-01-18 14:06   ` HATAYAMA Daisuke
2013-01-18 20:54     ` Vivek Goyal
2013-01-21  6:56       ` HATAYAMA Daisuke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox