[PATCH 1/4] memblock: Fix big size with find

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/4] memblock: Fix big size with find_region()
       [not found] <4CAA4BD5.4020505@kernel.org>
@ 2010-10-04 21:57 ` Yinghai Lu
  2010-10-06  6:28   ` [tip:core/memblock] memblock: Fix wraparound in find_region() tip-bot for Yinghai Lu
  2010-10-04 21:57 ` [PATCH 2/4] x86, memblock: Fix crashkernel allocation Yinghai Lu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Yinghai Lu @ 2010-10-04 21:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Benjamin Herrenschmidt
  Cc: linux-kernel@vger.kernel.org, Jeremy Fitzhardinge, Vivek Goyal


When trying to find huge range for crashkernel, get

[    0.000000] ------------[ cut here ]------------
[    0.000000] WARNING: at arch/x86/mm/memblock.c:248 memblock_x86_reserve_range+0x40/0x7a()
[    0.000000] Hardware name: Sun Fire x4800
[    0.000000] memblock_x86_reserve_range: wrong range [0xffffffff37000000, 0x137000000)
[    0.000000] Modules linked in:
[    0.000000] Pid: 0, comm: swapper Not tainted 2.6.36-rc5-tip-yh-01876-g1cac214-dirty #59
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff82816f7e>] ? memblock_x86_reserve_range+0x40/0x7a
[    0.000000]  [<ffffffff81078c2d>] warn_slowpath_common+0x85/0x9e
[    0.000000]  [<ffffffff81078d38>] warn_slowpath_fmt+0x6e/0x70
[    0.000000]  [<ffffffff8281e77c>] ? memblock_find_region+0x40/0x78
[    0.000000]  [<ffffffff8281eb1f>] ? memblock_find_base+0x9a/0xb9
[    0.000000]  [<ffffffff82816f7e>] memblock_x86_reserve_range+0x40/0x7a
[    0.000000]  [<ffffffff8280452c>] setup_arch+0x99d/0xb2a
[    0.000000]  [<ffffffff810a3e02>] ? trace_hardirqs_off+0xd/0xf
[    0.000000]  [<ffffffff81cec7d8>] ? _raw_spin_unlock_irqrestore+0x3d/0x4c
[    0.000000]  [<ffffffff827ffcec>] start_kernel+0xde/0x3f1
[    0.000000]  [<ffffffff827ff2d4>] x86_64_start_reservations+0xa0/0xa4
[    0.000000]  [<ffffffff827ff3de>] x86_64_start_kernel+0x106/0x10d
[    0.000000] ---[ end trace a7919e7f17c0a725 ]---
[    0.000000] Reserving 8192MB of memory at 17592186041200MB for crashkernel (System RAM: 526336MB)

Because memblock_find_region() can not handle size > end, base will be set to huge num.

Try to check size with end.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 mm/memblock.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/memblock.c
===================================================================
--- linux-2.6.orig/mm/memblock.c
+++ linux-2.6/mm/memblock.c
@@ -105,13 +105,18 @@ static phys_addr_t __init memblock_find_
 	phys_addr_t base, res_base;
 	long j;
 
+	/* In case, huge size is requested */
+	if (end < size)
+		return MEMBLOCK_ERROR;
+
+	base = memblock_align_down((end - size), align);
+
 	/* Prevent allocations returning 0 as it's also used to
 	 * indicate an allocation failure
 	 */
 	if (start == 0)
 		start = PAGE_SIZE;
 
-	base = memblock_align_down((end - size), align);
 	while (start <= base) {
 		j = memblock_overlaps_region(&memblock.reserved, base, size);
 		if (j < 0)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/4] x86, memblock: Fix crashkernel allocation
       [not found] <4CAA4BD5.4020505@kernel.org>
  2010-10-04 21:57 ` [PATCH 1/4] memblock: Fix big size with find_region() Yinghai Lu
@ 2010-10-04 21:57 ` Yinghai Lu
  2010-10-05 21:15   ` H. Peter Anvin
  2010-10-05 22:29   ` H. Peter Anvin
  2010-10-04 21:58 ` [PATCH 3/4] x86, memblock: Remove __memblock_x86_find_in_range_size() Yinghai Lu
  2010-10-04 21:58 ` [PATCH 4/4] x86, mm, memblock, 32bit: Make add_highpages honor early reserved ranges Yinghai Lu
  3 siblings, 2 replies; 22+ messages in thread
From: Yinghai Lu @ 2010-10-04 21:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Benjamin Herrenschmidt
  Cc: linux-kernel@vger.kernel.org, Jeremy Fitzhardinge, Vivek Goyal


Cai Qian found crashkernel is broken with x86 memblock changes
1. crashkernel=128M@32M always reported that range is used, even first kernel is small
   no one use that range
2. always get following report when using "kexec -p"
	Could not find a free area of memory of a000 bytes...
	locate_hole failed

The root cause is that generic memblock_find_in_range() will try to get range from top_down.
But crashkernel do need from low and specified range.

Let's limit the target range with rash_base + crash_size to make sure that
We get range from bottom.

-v5: use DEFAULT_BZIMAGE_ADDR_MAX to limit area that could be used by bzImge.
     also second try for vmlinux or new kexec tools will use bzImage 64bit entry

Reported-and-Bisected-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/kernel/setup.c |   24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c
+++ linux-2.6/arch/x86/kernel/setup.c
@@ -501,6 +501,7 @@ static inline unsigned long long get_tot
 	return total << PAGE_SHIFT;
 }
 
+#define DEFAULT_BZIMAGE_ADDR_MAX 0x37FFFFFF
 static void __init reserve_crashkernel(void)
 {
 	unsigned long long total_mem;
@@ -518,17 +519,28 @@ static void __init reserve_crashkernel(v
 	if (crash_base <= 0) {
 		const unsigned long long alignment = 16<<20;	/* 16M */
 
-		crash_base = memblock_find_in_range(alignment, ULONG_MAX, crash_size,
-				 alignment);
+		/*
+		 * Assume half crash_size is for bzImage
+		 *  kexec want bzImage is below DEFAULT_BZIMAGE_ADDR_MAX
+		 */
+		crash_base = memblock_find_in_range(alignment,
+				DEFAULT_BZIMAGE_ADDR_MAX + crash_size/2,
+				crash_size, alignment);
+
 		if (crash_base == MEMBLOCK_ERROR) {
-			pr_info("crashkernel reservation failed - No suitable area found.\n");
-			return;
+			crash_base = memblock_find_in_range(alignment,
+					 ULONG_MAX, crash_size, alignment);
+
+			if (crash_base == MEMBLOCK_ERROR) {
+				pr_info("crashkernel reservation failed - No suitable area found.\n");
+				return;
+			}
 		}
 	} else {
 		unsigned long long start;
 
-		start = memblock_find_in_range(crash_base, ULONG_MAX, crash_size,
-				 1<<20);
+		start = memblock_find_in_range(crash_base,
+				 crash_base + crash_size, crash_size, 1<<20);
 		if (start != crash_base) {
 			pr_info("crashkernel reservation failed - memory is in use.\n");
 			return;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 3/4] x86, memblock: Remove __memblock_x86_find_in_range_size()
       [not found] <4CAA4BD5.4020505@kernel.org>
  2010-10-04 21:57 ` [PATCH 1/4] memblock: Fix big size with find_region() Yinghai Lu
  2010-10-04 21:57 ` [PATCH 2/4] x86, memblock: Fix crashkernel allocation Yinghai Lu
@ 2010-10-04 21:58 ` Yinghai Lu
  2010-10-06  6:29   ` [tip:core/memblock] " tip-bot for Yinghai Lu
  2010-10-04 21:58 ` [PATCH 4/4] x86, mm, memblock, 32bit: Make add_highpages honor early reserved ranges Yinghai Lu
  3 siblings, 1 reply; 22+ messages in thread
From: Yinghai Lu @ 2010-10-04 21:58 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Benjamin Herrenschmidt
  Cc: linux-kernel@vger.kernel.org, Jeremy Fitzhardinge, Vivek Goyal


Fold it into memblock_x86_find_in_range(), and change bad_addr_size()
to check_reserve_memblock().

So whole memblock_x86_find_in_range_size() code is  more readable.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/mm/memblock.c |   39 +++++++++++----------------------------
 1 file changed, 11 insertions(+), 28 deletions(-)

Index: linux-2.6/arch/x86/mm/memblock.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/memblock.c
+++ linux-2.6/arch/x86/mm/memblock.c
@@ -8,7 +8,7 @@
 #include <linux/range.h>
 
 /* Check for already reserved areas */
-static inline bool __init bad_addr_size(u64 *addrp, u64 *sizep, u64 align)
+static bool __init check_with_memblock_reserved_size(u64 *addrp, u64 *sizep, u64 align)
 {
 	struct memblock_region *r;
 	u64 addr = *addrp, last;
@@ -30,7 +30,7 @@ again:
 			goto again;
 		}
 		if (last <= (r->base + r->size) && addr >= r->base) {
-			(*sizep)++;
+			*sizep = 0;
 			return false;
 		}
 	}
@@ -41,29 +41,6 @@ again:
 	return changed;
 }
 
-static u64 __init __memblock_x86_find_in_range_size(u64 ei_start, u64 ei_last, u64 start,
-			 u64 *sizep, u64 align)
-{
-	u64 addr, last;
-
-	addr = round_up(ei_start, align);
-	if (addr < start)
-		addr = round_up(start, align);
-	if (addr >= ei_last)
-		goto out;
-	*sizep = ei_last - addr;
-	while (bad_addr_size(&addr, sizep, align) && addr + *sizep <= ei_last)
-		;
-	last = addr + *sizep;
-	if (last > ei_last)
-		goto out;
-
-	return addr;
-
-out:
-	return MEMBLOCK_ERROR;
-}
-
 /*
  * Find next free range after start, and size is returned in *sizep
  */
@@ -76,10 +53,16 @@ u64 __init memblock_x86_find_in_range_si
 		u64 ei_last = ei_start + r->size;
 		u64 addr;
 
-		addr = __memblock_x86_find_in_range_size(ei_start, ei_last, start,
-					 sizep, align);
+		addr = round_up(ei_start, align);
+		if (addr < start)
+			addr = round_up(start, align);
+		if (addr >= ei_last)
+			continue;
+		*sizep = ei_last - addr;
+		while (check_with_memblock_reserved_size(&addr, sizep, align))
+			;
 
-		if (addr != MEMBLOCK_ERROR)
+		if (*sizep)
 			return addr;
 	}
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/4] x86, mm, memblock, 32bit: Make add_highpages honor early reserved ranges
       [not found] <4CAA4BD5.4020505@kernel.org>
                   ` (2 preceding siblings ...)
  2010-10-04 21:58 ` [PATCH 3/4] x86, memblock: Remove __memblock_x86_find_in_range_size() Yinghai Lu
@ 2010-10-04 21:58 ` Yinghai Lu
  2010-10-05 22:50   ` H. Peter Anvin
  3 siblings, 1 reply; 22+ messages in thread
From: Yinghai Lu @ 2010-10-04 21:58 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Benjamin Herrenschmidt
  Cc: linux-kernel@vger.kernel.org, Jeremy Fitzhardinge, Vivek Goyal


Originally the only early reserved range that is overlapped with high pages :
 "KVA RAM", but We do remove them from active ranges.

It turns out xen could have that kind of overlapping to support memory bollaon.

So We need to make add_highpage_with_active_regions() to subtract memblock
reserved just like low ram.

In this patch, refactering get_freel_all_memory_range() to make it can be used
by add_highpage_with_active_regions().
Also we don't need to remove "KVA RAM" from active ranges.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 arch/x86/include/asm/memblock.h |    2 +
 arch/x86/mm/init_32.c           |   59 ++++++++++++----------------------------
 arch/x86/mm/memblock.c          |   19 ++++++++++--
 arch/x86/mm/numa_32.c           |    2 -
 4 files changed, 36 insertions(+), 46 deletions(-)

Index: linux-2.6/arch/x86/include/asm/memblock.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/memblock.h
+++ linux-2.6/arch/x86/include/asm/memblock.h
@@ -9,6 +9,8 @@ void memblock_x86_to_bootmem(u64 start,
 void memblock_x86_reserve_range(u64 start, u64 end, char *name);
 void memblock_x86_free_range(u64 start, u64 end);
 struct range;
+int __get_free_all_memory_range(struct range **range, int nodeid,
+			 unsigned long start_pfn, unsigned long end_pfn);
 int get_free_all_memory_range(struct range **rangep, int nodeid);
 
 void memblock_x86_register_active_regions(int nid, unsigned long start_pfn,
Index: linux-2.6/arch/x86/mm/init_32.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_32.c
+++ linux-2.6/arch/x86/mm/init_32.c
@@ -423,49 +423,28 @@ static void __init add_one_highpage_init
 	totalhigh_pages++;
 }
 
-struct add_highpages_data {
-	unsigned long start_pfn;
-	unsigned long end_pfn;
-};
-
-static int __init add_highpages_work_fn(unsigned long start_pfn,
-					 unsigned long end_pfn, void *datax)
+void __init add_highpages_with_active_regions(int nid,
+			 unsigned long start_pfn, unsigned long end_pfn)
 {
-	int node_pfn;
-	struct page *page;
-	unsigned long final_start_pfn, final_end_pfn;
-	struct add_highpages_data *data;
-
-	data = (struct add_highpages_data *)datax;
-
-	final_start_pfn = max(start_pfn, data->start_pfn);
-	final_end_pfn = min(end_pfn, data->end_pfn);
-	if (final_start_pfn >= final_end_pfn)
-		return 0;
-
-	for (node_pfn = final_start_pfn; node_pfn < final_end_pfn;
-	     node_pfn++) {
-		if (!pfn_valid(node_pfn))
-			continue;
-		page = pfn_to_page(node_pfn);
-		add_one_highpage_init(page);
+	struct range *range;
+	int nr_range;
+	int i;
+
+	nr_range = __get_free_all_memory_range(&range, nid, start_pfn, end_pfn);
+
+	for (i = 0; i < nr_range; i++) {
+		struct page *page;
+		int node_pfn;
+
+		for (node_pfn = range[i].start; node_pfn < range[i].end;
+		     node_pfn++) {
+			if (!pfn_valid(node_pfn))
+				continue;
+			page = pfn_to_page(node_pfn);
+			add_one_highpage_init(page);
+		}
 	}
-
-	return 0;
-
 }
-
-void __init add_highpages_with_active_regions(int nid, unsigned long start_pfn,
-					      unsigned long end_pfn)
-{
-	struct add_highpages_data data;
-
-	data.start_pfn = start_pfn;
-	data.end_pfn = end_pfn;
-
-	work_with_active_regions(nid, add_highpages_work_fn, &data);
-}
-
 #else
 static inline void permanent_kmaps_init(pgd_t *pgd_base)
 {
Index: linux-2.6/arch/x86/mm/memblock.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/memblock.c
+++ linux-2.6/arch/x86/mm/memblock.c
@@ -139,7 +139,8 @@ static int __init count_early_node_map(i
 	return data.nr;
 }
 
-int __init get_free_all_memory_range(struct range **rangep, int nodeid)
+int __init __get_free_all_memory_range(struct range **rangep, int nodeid,
+			 unsigned long start_pfn, unsigned long end_pfn)
 {
 	int count;
 	struct range *range;
@@ -155,9 +156,9 @@ int __init get_free_all_memory_range(str
 	 * at first
 	 */
 	nr_range = add_from_early_node_map(range, count, nr_range, nodeid);
-#ifdef CONFIG_X86_32
-	subtract_range(range, count, max_low_pfn, -1ULL);
-#endif
+	subtract_range(range, count, 0, start_pfn);
+	subtract_range(range, count, end_pfn, -1ULL);
+
 	memblock_x86_subtract_reserved(range, count);
 	nr_range = clean_sort_range(range, count);
 
@@ -165,6 +166,16 @@ int __init get_free_all_memory_range(str
 	return nr_range;
 }
 
+int __init get_free_all_memory_range(struct range **rangep, int nodeid)
+{
+	unsigned long end_pfn = -1ULL;
+
+#ifdef CONFIG_X86_32
+	end_pfn = max_low_pfn;
+#endif
+	return __get_free_all_memory_range(rangep, nodeid, 0, end_pfn);
+}
+
 static u64 __init __memblock_x86_memory_in_range(u64 addr, u64 limit, bool get_free)
 {
 	int i, count;
Index: linux-2.6/arch/x86/mm/numa_32.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/numa_32.c
+++ linux-2.6/arch/x86/mm/numa_32.c
@@ -326,8 +326,6 @@ static __init unsigned long calculate_nu
 			      "KVA RAM");
 
 		node_remap_start_pfn[nid] = node_kva_final>>PAGE_SHIFT;
-		remove_active_range(nid, node_remap_start_pfn[nid],
-					 node_remap_start_pfn[nid] + size);
 	}
 	printk(KERN_INFO "Reserving total of %lx pages for numa KVA remap\n",
 			reserve_pages);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] x86, memblock: Fix crashkernel allocation
  2010-10-04 21:57 ` [PATCH 2/4] x86, memblock: Fix crashkernel allocation Yinghai Lu
@ 2010-10-05 21:15   ` H. Peter Anvin
  2010-10-05 22:29   ` H. Peter Anvin
  1 sibling, 0 replies; 22+ messages in thread
From: H. Peter Anvin @ 2010-10-05 21:15 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Thomas Gleixner, Ingo Molnar, Benjamin Herrenschmidt,
	linux-kernel@vger.kernel.org, Jeremy Fitzhardinge, Vivek Goyal,
	kexec@lists.infradead.org

On 10/04/2010 02:57 PM, Yinghai Lu wrote:
>
> +#define DEFAULT_BZIMAGE_ADDR_MAX 0x37FFFFFF
>  static void __init reserve_crashkernel(void)
>  {
>  	unsigned long long total_mem;
> @@ -518,17 +519,28 @@ static void __init reserve_crashkernel(v
>  	if (crash_base <= 0) {
>  		const unsigned long long alignment = 16<<20;	/* 16M */
>  
> -		crash_base = memblock_find_in_range(alignment, ULONG_MAX, crash_size,
> -				 alignment);
> +		/*
> +		 * Assume half crash_size is for bzImage
> +		 *  kexec want bzImage is below DEFAULT_BZIMAGE_ADDR_MAX
> +		 */
> +		crash_base = memblock_find_in_range(alignment,
> +				DEFAULT_BZIMAGE_ADDR_MAX + crash_size/2,
> +				crash_size, alignment);
> +
>  		if (crash_base == MEMBLOCK_ERROR) {
> -			pr_info("crashkernel reservation failed - No suitable area found.\n");
> -			return;
> +			crash_base = memblock_find_in_range(alignment,
> +					 ULONG_MAX, crash_size, alignment);
> +
> +			if (crash_base == MEMBLOCK_ERROR) {
> +				pr_info("crashkernel reservation failed - No suitable area found.\n");
> +				return;
> +			}
>  		}
>  

Okay, this *really* doesn't make sense.

It's bad enough that kexec doesn't know what memory is safe for it, but
why the heck the heuristic that "half is for bzImage and the rest can go
beyond the heuristic limit"?  Can't we at least simply cap the region to
the default, unless the kexec system has passed in some knowable
alternative?  Furthermore, why bother having the "fallback" at all
(certainly without having a message!?)  If we don't get the memory area
we need we're likely to randomly fail anyway.

Let me be completely clear -- it's obvious from all of this that kexec
is fundamentally broken by design: if kexec can't communicate the safe
memory to use it's busted seven ways to Sunday and it needs to be fixed.
 However, in the meantime I can see capping the memory available to it
as a temporary band-aid, but a fallback to picking random memory is
nuts, especially on the motivation that "a future kexec version might be
able to use it."  If so, the "future kexec tools" should SAY SO.

This is beyond crazy -- it's complete and total bonkers.

	-hpa

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] x86, memblock: Fix crashkernel allocation
  2010-10-04 21:57 ` [PATCH 2/4] x86, memblock: Fix crashkernel allocation Yinghai Lu
  2010-10-05 21:15   ` H. Peter Anvin
@ 2010-10-05 22:29   ` H. Peter Anvin
  2010-10-05 23:05     ` Yinghai Lu
  1 sibling, 1 reply; 22+ messages in thread
From: H. Peter Anvin @ 2010-10-05 22:29 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Thomas Gleixner, Ingo Molnar, Benjamin Herrenschmidt,
	linux-kernel@vger.kernel.org, Jeremy Fitzhardinge, Vivek Goyal,
	kexec@lists.infradead.org

On 10/04/2010 02:57 PM, Yinghai Lu wrote:
>
> +#define DEFAULT_BZIMAGE_ADDR_MAX 0x37FFFFFF
>  static void __init reserve_crashkernel(void)
>  {
>  	unsigned long long total_mem;
> @@ -518,17 +519,28 @@ static void __init reserve_crashkernel(v
>  	if (crash_base <= 0) {
>  		const unsigned long long alignment = 16<<20;	/* 16M */
>  
> -		crash_base = memblock_find_in_range(alignment, ULONG_MAX, crash_size,
> -				 alignment);
> +		/*
> +		 * Assume half crash_size is for bzImage
> +		 *  kexec want bzImage is below DEFAULT_BZIMAGE_ADDR_MAX
> +		 */
> +		crash_base = memblock_find_in_range(alignment,
> +				DEFAULT_BZIMAGE_ADDR_MAX + crash_size/2,
> +				crash_size, alignment);
> +
>  		if (crash_base == MEMBLOCK_ERROR) {
> -			pr_info("crashkernel reservation failed - No suitable area found.\n");
> -			return;
> +			crash_base = memblock_find_in_range(alignment,
> +					 ULONG_MAX, crash_size, alignment);
> +
> +			if (crash_base == MEMBLOCK_ERROR) {
> +				pr_info("crashkernel reservation failed - No suitable area found.\n");
> +				return;
> +			}
>  		}
>  

Okay, this *really* doesn't make sense.

It's bad enough that kexec doesn't know what memory is safe for it, but
why the heck the heuristic that "half is for bzImage and the rest can go
beyond the heuristic limit"?  Can't we at least simply cap the region to
the default, unless the kexec system has passed in some knowable
alternative?  Furthermore, why bother having the "fallback" at all
(certainly without having a message!?)  If we don't get the memory area
we need we're likely to randomly fail anyway.

Let me be completely clear -- it's obvious from all of this that kexec
is fundamentally broken by design: if kexec can't communicate the safe
memory to use it's busted seven ways to Sunday and it needs to be fixed.
 However, in the meantime I can see capping the memory available to it
as a temporary band-aid, but a fallback to picking random memory is
nuts, especially on the motivation that "a future kexec version might be
able to use it."  If so, the "future kexec tools" should SAY SO.

This is beyond crazy -- it's complete and total bonkers.

	-hpa

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] x86, mm, memblock, 32bit: Make add_highpages honor early reserved ranges
  2010-10-04 21:58 ` [PATCH 4/4] x86, mm, memblock, 32bit: Make add_highpages honor early reserved ranges Yinghai Lu
@ 2010-10-05 22:50   ` H. Peter Anvin
  2010-10-05 23:15     ` Yinghai Lu
  0 siblings, 1 reply; 22+ messages in thread
From: H. Peter Anvin @ 2010-10-05 22:50 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Thomas Gleixner, Ingo Molnar, Benjamin Herrenschmidt,
	linux-kernel@vger.kernel.org, Jeremy Fitzhardinge, Vivek Goyal

On 10/04/2010 02:58 PM, Yinghai Lu wrote:
>  
> +int __init get_free_all_memory_range(struct range **rangep, int nodeid)
> +{
> +	unsigned long end_pfn = -1ULL;
> +

Uh... I really hope I don't have to explain what's wrong with that...

	-hpa

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] x86, memblock: Fix crashkernel allocation
  2010-10-05 22:29   ` H. Peter Anvin
@ 2010-10-05 23:05     ` Yinghai Lu
  2010-10-06  6:27       ` [tip:core/memblock] " tip-bot for Yinghai Lu
  0 siblings, 1 reply; 22+ messages in thread
From: Yinghai Lu @ 2010-10-05 23:05 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Thomas Gleixner, Ingo Molnar, Benjamin Herrenschmidt,
	linux-kernel@vger.kernel.org, Jeremy Fitzhardinge, Vivek Goyal,
	kexec@lists.infradead.org

On 10/05/2010 03:29 PM, H. Peter Anvin wrote:
> On 10/04/2010 02:57 PM, Yinghai Lu wrote:
>>
>> +#define DEFAULT_BZIMAGE_ADDR_MAX 0x37FFFFFF
>>  static void __init reserve_crashkernel(void)
>>  {
>>  	unsigned long long total_mem;
>> @@ -518,17 +519,28 @@ static void __init reserve_crashkernel(v
>>  	if (crash_base <= 0) {
>>  		const unsigned long long alignment = 16<<20;	/* 16M */
>>  
>> -		crash_base = memblock_find_in_range(alignment, ULONG_MAX, crash_size,
>> -				 alignment);
>> +		/*
>> +		 * Assume half crash_size is for bzImage
>> +		 *  kexec want bzImage is below DEFAULT_BZIMAGE_ADDR_MAX
>> +		 */
>> +		crash_base = memblock_find_in_range(alignment,
>> +				DEFAULT_BZIMAGE_ADDR_MAX + crash_size/2,
>> +				crash_size, alignment);
>> +
>>  		if (crash_base == MEMBLOCK_ERROR) {
>> -			pr_info("crashkernel reservation failed - No suitable area found.\n");
>> -			return;
>> +			crash_base = memblock_find_in_range(alignment,
>> +					 ULONG_MAX, crash_size, alignment);
>> +
>> +			if (crash_base == MEMBLOCK_ERROR) {
>> +				pr_info("crashkernel reservation failed - No suitable area found.\n");
>> +				return;
>> +			}
>>  		}
>>  
> 
> Okay, this *really* doesn't make sense.
> 
> It's bad enough that kexec doesn't know what memory is safe for it, but
> why the heck the heuristic that "half is for bzImage and the rest can go
> beyond the heuristic limit"?

kdump want that range half for bzImage or half for initrd.

and kexec only check if bzImage can be put under small range.

>  Can't we at least simply cap the region to
> the default, unless the kexec system has passed in some knowable
> alternative?

+		crash_base = memblock_find_in_range(alignment,
+				DEFAULT_BZIMAGE_ADDR_MAX,
+				crash_size, alignment);


  Furthermore, why bother having the "fallback" at all
> (certainly without having a message!?)  If we don't get the memory area
> we need we're likely to randomly fail anyway.

if kexec is fixed to work with bzImage with 64bit entry...

> 
> Let me be completely clear -- it's obvious from all of this that kexec
> is fundamentally broken by design: if kexec can't communicate the safe
> memory to use it's busted seven ways to Sunday and it needs to be fixed.
>  However, in the meantime I can see capping the memory available to it
> as a temporary band-aid, but a fallback to picking random memory is
> nuts, especially on the motivation that "a future kexec version might be
> able to use it."  If so, the "future kexec tools" should SAY SO.

ok, please check

[PATCH -v6] x86, memblock: Fix crashkernel allocation

Cai Qian found crashkernel is broken with x86 memblock changes
1. crashkernel=128M@32M always reported that range is used, even first kernel is small
   no one use that range
2. always get following report when using "kexec -p"
	Could not find a free area of memory of a000 bytes...
	locate_hole failed

The root cause is that generic memblock_find_in_range() will try to get range from top_down.
But crashkernel do need from low and specified range.

Let's limit the target range with crash_base + crash_size to make sure that
We get exact range.

-v6: use DEFAULT_BZIMAGE_ADDR_MAX to limit area that could be used by bzImge.

Reported-and-Bisected-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/kernel/setup.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c
+++ linux-2.6/arch/x86/kernel/setup.c
@@ -501,6 +501,7 @@ static inline unsigned long long get_tot
 	return total << PAGE_SHIFT;
 }
 
+#define DEFAULT_BZIMAGE_ADDR_MAX 0x37FFFFFF
 static void __init reserve_crashkernel(void)
 {
 	unsigned long long total_mem;
@@ -518,8 +519,12 @@ static void __init reserve_crashkernel(v
 	if (crash_base <= 0) {
 		const unsigned long long alignment = 16<<20;	/* 16M */
 
-		crash_base = memblock_find_in_range(alignment, ULONG_MAX, crash_size,
-				 alignment);
+		/*
+		 *  kexec want bzImage is below DEFAULT_BZIMAGE_ADDR_MAX
+		 */
+		crash_base = memblock_find_in_range(alignment,
+			       DEFAULT_BZIMAGE_ADDR_MAX, crash_size, alignment);
+
 		if (crash_base == MEMBLOCK_ERROR) {
 			pr_info("crashkernel reservation failed - No suitable area found.\n");
 			return;
@@ -527,8 +532,8 @@ static void __init reserve_crashkernel(v
 	} else {
 		unsigned long long start;
 
-		start = memblock_find_in_range(crash_base, ULONG_MAX, crash_size,
-				 1<<20);
+		start = memblock_find_in_range(crash_base,
+				 crash_base + crash_size, crash_size, 1<<20);
 		if (start != crash_base) {
 			pr_info("crashkernel reservation failed - memory is in use.\n");
 			return;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] x86, mm, memblock, 32bit: Make add_highpages honor early reserved ranges
  2010-10-05 22:50   ` H. Peter Anvin
@ 2010-10-05 23:15     ` Yinghai Lu
  2010-10-06  6:28       ` [tip:core/memblock] x86-32, memblock: " tip-bot for Yinghai Lu
  0 siblings, 1 reply; 22+ messages in thread
From: Yinghai Lu @ 2010-10-05 23:15 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Thomas Gleixner, Ingo Molnar, Benjamin Herrenschmidt,
	linux-kernel@vger.kernel.org, Jeremy Fitzhardinge, Vivek Goyal

On 10/05/2010 03:50 PM, H. Peter Anvin wrote:
> On 10/04/2010 02:58 PM, Yinghai Lu wrote:
>>  
>> +int __init get_free_all_memory_range(struct range **rangep, int nodeid)
>> +{
>> +	unsigned long end_pfn = -1ULL;
>> +
> 
> Uh... I really hope I don't have to explain what's wrong with that...

sorry for that.

[PATCH -v2] x86, mm, memblock, 32bit: Make add_highpages honor early reserved ranges

Originally the only early reserved range that is overlapped with high pages :
 "KVA RAM", but We do remove them from active ranges.

It turns out xen could have that kind of overlapping to support memory bollaon.

So We need to make add_highpage_with_active_regions() to subtract memblock
reserved just like low ram.

In this patch, refactering get_freel_all_memory_range() to make it can be used
by add_highpage_with_active_regions().
Also we don't need to remove "KVA RAM" from active ranges.

-v2: use -1UL instead of -1ULL

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 arch/x86/include/asm/memblock.h |    2 +
 arch/x86/mm/init_32.c           |   59 ++++++++++++----------------------------
 arch/x86/mm/memblock.c          |   19 ++++++++++--
 arch/x86/mm/numa_32.c           |    2 -
 4 files changed, 36 insertions(+), 46 deletions(-)

Index: linux-2.6/arch/x86/include/asm/memblock.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/memblock.h
+++ linux-2.6/arch/x86/include/asm/memblock.h
@@ -9,6 +9,8 @@ void memblock_x86_to_bootmem(u64 start,
 void memblock_x86_reserve_range(u64 start, u64 end, char *name);
 void memblock_x86_free_range(u64 start, u64 end);
 struct range;
+int __get_free_all_memory_range(struct range **range, int nodeid,
+			 unsigned long start_pfn, unsigned long end_pfn);
 int get_free_all_memory_range(struct range **rangep, int nodeid);
 
 void memblock_x86_register_active_regions(int nid, unsigned long start_pfn,
Index: linux-2.6/arch/x86/mm/init_32.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_32.c
+++ linux-2.6/arch/x86/mm/init_32.c
@@ -423,49 +423,28 @@ static void __init add_one_highpage_init
 	totalhigh_pages++;
 }
 
-struct add_highpages_data {
-	unsigned long start_pfn;
-	unsigned long end_pfn;
-};
-
-static int __init add_highpages_work_fn(unsigned long start_pfn,
-					 unsigned long end_pfn, void *datax)
+void __init add_highpages_with_active_regions(int nid,
+			 unsigned long start_pfn, unsigned long end_pfn)
 {
-	int node_pfn;
-	struct page *page;
-	unsigned long final_start_pfn, final_end_pfn;
-	struct add_highpages_data *data;
-
-	data = (struct add_highpages_data *)datax;
-
-	final_start_pfn = max(start_pfn, data->start_pfn);
-	final_end_pfn = min(end_pfn, data->end_pfn);
-	if (final_start_pfn >= final_end_pfn)
-		return 0;
-
-	for (node_pfn = final_start_pfn; node_pfn < final_end_pfn;
-	     node_pfn++) {
-		if (!pfn_valid(node_pfn))
-			continue;
-		page = pfn_to_page(node_pfn);
-		add_one_highpage_init(page);
+	struct range *range;
+	int nr_range;
+	int i;
+
+	nr_range = __get_free_all_memory_range(&range, nid, start_pfn, end_pfn);
+
+	for (i = 0; i < nr_range; i++) {
+		struct page *page;
+		int node_pfn;
+
+		for (node_pfn = range[i].start; node_pfn < range[i].end;
+		     node_pfn++) {
+			if (!pfn_valid(node_pfn))
+				continue;
+			page = pfn_to_page(node_pfn);
+			add_one_highpage_init(page);
+		}
 	}
-
-	return 0;
-
 }
-
-void __init add_highpages_with_active_regions(int nid, unsigned long start_pfn,
-					      unsigned long end_pfn)
-{
-	struct add_highpages_data data;
-
-	data.start_pfn = start_pfn;
-	data.end_pfn = end_pfn;
-
-	work_with_active_regions(nid, add_highpages_work_fn, &data);
-}
-
 #else
 static inline void permanent_kmaps_init(pgd_t *pgd_base)
 {
Index: linux-2.6/arch/x86/mm/memblock.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/memblock.c
+++ linux-2.6/arch/x86/mm/memblock.c
@@ -139,7 +139,8 @@ static int __init count_early_node_map(i
 	return data.nr;
 }
 
-int __init get_free_all_memory_range(struct range **rangep, int nodeid)
+int __init __get_free_all_memory_range(struct range **rangep, int nodeid,
+			 unsigned long start_pfn, unsigned long end_pfn)
 {
 	int count;
 	struct range *range;
@@ -155,9 +156,9 @@ int __init get_free_all_memory_range(str
 	 * at first
 	 */
 	nr_range = add_from_early_node_map(range, count, nr_range, nodeid);
-#ifdef CONFIG_X86_32
-	subtract_range(range, count, max_low_pfn, -1ULL);
-#endif
+	subtract_range(range, count, 0, start_pfn);
+	subtract_range(range, count, end_pfn, -1ULL);
+
 	memblock_x86_subtract_reserved(range, count);
 	nr_range = clean_sort_range(range, count);
 
@@ -165,6 +166,16 @@ int __init get_free_all_memory_range(str
 	return nr_range;
 }
 
+int __init get_free_all_memory_range(struct range **rangep, int nodeid)
+{
+	unsigned long end_pfn = -1UL;
+
+#ifdef CONFIG_X86_32
+	end_pfn = max_low_pfn;
+#endif
+	return __get_free_all_memory_range(rangep, nodeid, 0, end_pfn);
+}
+
 static u64 __init __memblock_x86_memory_in_range(u64 addr, u64 limit, bool get_free)
 {
 	int i, count;
Index: linux-2.6/arch/x86/mm/numa_32.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/numa_32.c
+++ linux-2.6/arch/x86/mm/numa_32.c
@@ -326,8 +326,6 @@ static __init unsigned long calculate_nu
 			      "KVA RAM");
 
 		node_remap_start_pfn[nid] = node_kva_final>>PAGE_SHIFT;
-		remove_active_range(nid, node_remap_start_pfn[nid],
-					 node_remap_start_pfn[nid] + size);
 	}
 	printk(KERN_INFO "Reserving total of %lx pages for numa KVA remap\n",
 			reserve_pages);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:core/memblock] x86, memblock: Fix crashkernel allocation
  2010-10-05 23:05     ` Yinghai Lu
@ 2010-10-06  6:27       ` tip-bot for Yinghai Lu
  2010-10-06 15:14         ` Vivek Goyal
  0 siblings, 1 reply; 22+ messages in thread
From: tip-bot for Yinghai Lu @ 2010-10-06  6:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, caiqian, tglx, vgoyal

Commit-ID:  9f4c13964b58608fbce05540743281ea3146c0e8
Gitweb:     http://git.kernel.org/tip/9f4c13964b58608fbce05540743281ea3146c0e8
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Tue, 5 Oct 2010 16:05:14 -0700
Committer:  H. Peter Anvin <hpa@zytor.com>
CommitDate: Tue, 5 Oct 2010 21:43:14 -0700

x86, memblock: Fix crashkernel allocation

Cai Qian found crashkernel is broken with the x86 memblock changes.

1. crashkernel=128M@32M always reported that range is used, even if
   the first kernel is small and does not usethat range

2. we always got following report when using "kexec -p"
	Could not find a free area of memory of a000 bytes...
	locate_hole failed

The root cause is that generic memblock_find_in_range() will try to
allocate from the top of the range, whereas the kexec code was written
assuming that allocation was always near the bottom and that it could
blindly extend memory upward.  Unfortunately the kexec code doesn't
have a system for requesting the range that it really needs, so this
is subject to probabilistic failures.

This patch hacks around the problem by limiting the target range
heuristically to below the traditional bzImage max range.  This number
is arbitrary and not always correct, and a much better result would be
obtained by having kexec communicate this number based on the kernel
header information and any appropriate command line options.

Reported-and-Bisected-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4CABAF2A.5090501@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
---
 arch/x86/kernel/setup.c |   13 +++++++++----
 1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index bf89e0a..b11a238 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -502,6 +502,7 @@ static inline unsigned long long get_total_mem(void)
 	return total << PAGE_SHIFT;
 }
 
+#define DEFAULT_BZIMAGE_ADDR_MAX 0x37FFFFFF
 static void __init reserve_crashkernel(void)
 {
 	unsigned long long total_mem;
@@ -519,8 +520,12 @@ static void __init reserve_crashkernel(void)
 	if (crash_base <= 0) {
 		const unsigned long long alignment = 16<<20;	/* 16M */
 
-		crash_base = memblock_find_in_range(alignment, ULONG_MAX, crash_size,
-				 alignment);
+		/*
+		 *  kexec want bzImage is below DEFAULT_BZIMAGE_ADDR_MAX
+		 */
+		crash_base = memblock_find_in_range(alignment,
+			       DEFAULT_BZIMAGE_ADDR_MAX, crash_size, alignment);
+
 		if (crash_base == MEMBLOCK_ERROR) {
 			pr_info("crashkernel reservation failed - No suitable area found.\n");
 			return;
@@ -528,8 +533,8 @@ static void __init reserve_crashkernel(void)
 	} else {
 		unsigned long long start;
 
-		start = memblock_find_in_range(crash_base, ULONG_MAX, crash_size,
-				 1<<20);
+		start = memblock_find_in_range(crash_base,
+				 crash_base + crash_size, crash_size, 1<<20);
 		if (start != crash_base) {
 			pr_info("crashkernel reservation failed - memory is in use.\n");
 			return;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:core/memblock] x86-32, memblock: Make add_highpages honor early reserved ranges
  2010-10-05 23:15     ` Yinghai Lu
@ 2010-10-06  6:28       ` tip-bot for Yinghai Lu
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Yinghai Lu @ 2010-10-06  6:28 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, yinghai, tglx

Commit-ID:  1d931264af0f10649b35afa8fbd2e169da51ac08
Gitweb:     http://git.kernel.org/tip/1d931264af0f10649b35afa8fbd2e169da51ac08
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Tue, 5 Oct 2010 16:15:15 -0700
Committer:  H. Peter Anvin <hpa@zytor.com>
CommitDate: Tue, 5 Oct 2010 21:44:35 -0700

x86-32, memblock: Make add_highpages honor early reserved ranges

Originally the only early reserved range that is overlapped with high
pages is "KVA RAM", but we already do remove that from the active ranges.

However, It turns out Xen could have that kind of overlapping to support memory
ballooning.x

So we need to make add_highpage_with_active_regions() to subtract
memblock reserved just like low ram; this is the proper design anyway.

In this patch, refactering get_freel_all_memory_range() to make it can
be used by add_highpage_with_active_regions().  Also we don't need to
remove "KVA RAM" from active ranges.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4CABB183.1040607@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
---
 arch/x86/include/asm/memblock.h |    2 +
 arch/x86/mm/init_32.c           |   53 +++++++++++---------------------------
 arch/x86/mm/memblock.c          |   19 +++++++++++---
 arch/x86/mm/numa_32.c           |    2 -
 4 files changed, 33 insertions(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/memblock.h b/arch/x86/include/asm/memblock.h
index 2c304bb..19ae14b 100644
--- a/arch/x86/include/asm/memblock.h
+++ b/arch/x86/include/asm/memblock.h
@@ -9,6 +9,8 @@ void memblock_x86_to_bootmem(u64 start, u64 end);
 void memblock_x86_reserve_range(u64 start, u64 end, char *name);
 void memblock_x86_free_range(u64 start, u64 end);
 struct range;
+int __get_free_all_memory_range(struct range **range, int nodeid,
+			 unsigned long start_pfn, unsigned long end_pfn);
 int get_free_all_memory_range(struct range **rangep, int nodeid);
 
 void memblock_x86_register_active_regions(int nid, unsigned long start_pfn,
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index c2385d7..8546709 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -423,49 +423,28 @@ static void __init add_one_highpage_init(struct page *page)
 	totalhigh_pages++;
 }
 
-struct add_highpages_data {
-	unsigned long start_pfn;
-	unsigned long end_pfn;
-};
-
-static int __init add_highpages_work_fn(unsigned long start_pfn,
-					 unsigned long end_pfn, void *datax)
+void __init add_highpages_with_active_regions(int nid,
+			 unsigned long start_pfn, unsigned long end_pfn)
 {
-	int node_pfn;
-	struct page *page;
-	unsigned long final_start_pfn, final_end_pfn;
-	struct add_highpages_data *data;
+	struct range *range;
+	int nr_range;
+	int i;
 
-	data = (struct add_highpages_data *)datax;
+	nr_range = __get_free_all_memory_range(&range, nid, start_pfn, end_pfn);
 
-	final_start_pfn = max(start_pfn, data->start_pfn);
-	final_end_pfn = min(end_pfn, data->end_pfn);
-	if (final_start_pfn >= final_end_pfn)
-		return 0;
+	for (i = 0; i < nr_range; i++) {
+		struct page *page;
+		int node_pfn;
 
-	for (node_pfn = final_start_pfn; node_pfn < final_end_pfn;
-	     node_pfn++) {
-		if (!pfn_valid(node_pfn))
-			continue;
-		page = pfn_to_page(node_pfn);
-		add_one_highpage_init(page);
+		for (node_pfn = range[i].start; node_pfn < range[i].end;
+		     node_pfn++) {
+			if (!pfn_valid(node_pfn))
+				continue;
+			page = pfn_to_page(node_pfn);
+			add_one_highpage_init(page);
+		}
 	}
-
-	return 0;
-
 }
-
-void __init add_highpages_with_active_regions(int nid, unsigned long start_pfn,
-					      unsigned long end_pfn)
-{
-	struct add_highpages_data data;
-
-	data.start_pfn = start_pfn;
-	data.end_pfn = end_pfn;
-
-	work_with_active_regions(nid, add_highpages_work_fn, &data);
-}
-
 #else
 static inline void permanent_kmaps_init(pgd_t *pgd_base)
 {
diff --git a/arch/x86/mm/memblock.c b/arch/x86/mm/memblock.c
index 50ecbc5..fd7a040 100644
--- a/arch/x86/mm/memblock.c
+++ b/arch/x86/mm/memblock.c
@@ -156,7 +156,8 @@ static int __init count_early_node_map(int nodeid)
 	return data.nr;
 }
 
-int __init get_free_all_memory_range(struct range **rangep, int nodeid)
+int __init __get_free_all_memory_range(struct range **rangep, int nodeid,
+			 unsigned long start_pfn, unsigned long end_pfn)
 {
 	int count;
 	struct range *range;
@@ -172,9 +173,9 @@ int __init get_free_all_memory_range(struct range **rangep, int nodeid)
 	 * at first
 	 */
 	nr_range = add_from_early_node_map(range, count, nr_range, nodeid);
-#ifdef CONFIG_X86_32
-	subtract_range(range, count, max_low_pfn, -1ULL);
-#endif
+	subtract_range(range, count, 0, start_pfn);
+	subtract_range(range, count, end_pfn, -1ULL);
+
 	memblock_x86_subtract_reserved(range, count);
 	nr_range = clean_sort_range(range, count);
 
@@ -182,6 +183,16 @@ int __init get_free_all_memory_range(struct range **rangep, int nodeid)
 	return nr_range;
 }
 
+int __init get_free_all_memory_range(struct range **rangep, int nodeid)
+{
+	unsigned long end_pfn = -1UL;
+
+#ifdef CONFIG_X86_32
+	end_pfn = max_low_pfn;
+#endif
+	return __get_free_all_memory_range(rangep, nodeid, 0, end_pfn);
+}
+
 static u64 __init __memblock_x86_memory_in_range(u64 addr, u64 limit, bool get_free)
 {
 	int i, count;
diff --git a/arch/x86/mm/numa_32.c b/arch/x86/mm/numa_32.c
index 70ddeb7..84a3e4c 100644
--- a/arch/x86/mm/numa_32.c
+++ b/arch/x86/mm/numa_32.c
@@ -326,8 +326,6 @@ static __init unsigned long calculate_numa_remap_pages(void)
 			      "KVA RAM");
 
 		node_remap_start_pfn[nid] = node_kva_final>>PAGE_SHIFT;
-		remove_active_range(nid, node_remap_start_pfn[nid],
-					 node_remap_start_pfn[nid] + size);
 	}
 	printk(KERN_INFO "Reserving total of %lx pages for numa KVA remap\n",
 			reserve_pages);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:core/memblock] memblock: Fix wraparound in find_region()
  2010-10-04 21:57 ` [PATCH 1/4] memblock: Fix big size with find_region() Yinghai Lu
@ 2010-10-06  6:28   ` tip-bot for Yinghai Lu
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Yinghai Lu @ 2010-10-06  6:28 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, yinghai, tglx

Commit-ID:  f1af98c7629a1b76fd7336decbc776acdeed2120
Gitweb:     http://git.kernel.org/tip/f1af98c7629a1b76fd7336decbc776acdeed2120
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Mon, 4 Oct 2010 14:57:39 -0700
Committer:  H. Peter Anvin <hpa@zytor.com>
CommitDate: Tue, 5 Oct 2010 21:45:35 -0700

memblock: Fix wraparound in find_region()

When trying to find huge range for crashkernel, get

[    0.000000] ------------[ cut here ]------------
[    0.000000] WARNING: at arch/x86/mm/memblock.c:248 memblock_x86_reserve_range+0x40/0x7a()
[    0.000000] Hardware name: Sun Fire x4800
[    0.000000] memblock_x86_reserve_range: wrong range [0xffffffff37000000, 0x137000000)
[    0.000000] Modules linked in:
[    0.000000] Pid: 0, comm: swapper Not tainted 2.6.36-rc5-tip-yh-01876-g1cac214-dirty #59
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff82816f7e>] ? memblock_x86_reserve_range+0x40/0x7a
[    0.000000]  [<ffffffff81078c2d>] warn_slowpath_common+0x85/0x9e
[    0.000000]  [<ffffffff81078d38>] warn_slowpath_fmt+0x6e/0x70
[    0.000000]  [<ffffffff8281e77c>] ? memblock_find_region+0x40/0x78
[    0.000000]  [<ffffffff8281eb1f>] ? memblock_find_base+0x9a/0xb9
[    0.000000]  [<ffffffff82816f7e>] memblock_x86_reserve_range+0x40/0x7a
[    0.000000]  [<ffffffff8280452c>] setup_arch+0x99d/0xb2a
[    0.000000]  [<ffffffff810a3e02>] ? trace_hardirqs_off+0xd/0xf
[    0.000000]  [<ffffffff81cec7d8>] ? _raw_spin_unlock_irqrestore+0x3d/0x4c
[    0.000000]  [<ffffffff827ffcec>] start_kernel+0xde/0x3f1
[    0.000000]  [<ffffffff827ff2d4>] x86_64_start_reservations+0xa0/0xa4
[    0.000000]  [<ffffffff827ff3de>] x86_64_start_kernel+0x106/0x10d
[    0.000000] ---[ end trace a7919e7f17c0a725 ]---
[    0.000000] Reserving 8192MB of memory at 17592186041200MB for crashkernel (System RAM: 526336MB)

This is caused by a wraparound in the test due to size > end;
explicitly check for this condition and fail.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4CAA4DD3.1080401@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
---
 mm/memblock.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index d5d63ac..9ad3969 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -105,13 +105,18 @@ static phys_addr_t __init memblock_find_region(phys_addr_t start, phys_addr_t en
 	phys_addr_t base, res_base;
 	long j;
 
+	/* In case, huge size is requested */
+	if (end < size)
+		return MEMBLOCK_ERROR;
+
+	base = memblock_align_down((end - size), align);
+
 	/* Prevent allocations returning 0 as it's also used to
 	 * indicate an allocation failure
 	 */
 	if (start == 0)
 		start = PAGE_SIZE;
 
-	base = memblock_align_down((end - size), align);
 	while (start <= base) {
 		j = memblock_overlaps_region(&memblock.reserved, base, size);
 		if (j < 0)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:core/memblock] x86, memblock: Remove __memblock_x86_find_in_range_size()
  2010-10-04 21:58 ` [PATCH 3/4] x86, memblock: Remove __memblock_x86_find_in_range_size() Yinghai Lu
@ 2010-10-06  6:29   ` tip-bot for Yinghai Lu
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Yinghai Lu @ 2010-10-06  6:29 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, yinghai, tglx

Commit-ID:  16c36f743bf8481d0ba40a6de0af11736095d7cf
Gitweb:     http://git.kernel.org/tip/16c36f743bf8481d0ba40a6de0af11736095d7cf
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Mon, 4 Oct 2010 14:58:04 -0700
Committer:  H. Peter Anvin <hpa@zytor.com>
CommitDate: Tue, 5 Oct 2010 21:45:43 -0700

x86, memblock: Remove __memblock_x86_find_in_range_size()

Fold it into memblock_x86_find_in_range(), and change bad_addr_size()
to check_reserve_memblock().

So whole memblock_x86_find_in_range_size() code is more readable.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4CAA4DEC.4000401@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
---
 arch/x86/mm/memblock.c |   39 +++++++++++----------------------------
 1 files changed, 11 insertions(+), 28 deletions(-)

diff --git a/arch/x86/mm/memblock.c b/arch/x86/mm/memblock.c
index fd7a040..aa11693 100644
--- a/arch/x86/mm/memblock.c
+++ b/arch/x86/mm/memblock.c
@@ -8,7 +8,7 @@
 #include <linux/range.h>
 
 /* Check for already reserved areas */
-static inline bool __init bad_addr_size(u64 *addrp, u64 *sizep, u64 align)
+static bool __init check_with_memblock_reserved_size(u64 *addrp, u64 *sizep, u64 align)
 {
 	struct memblock_region *r;
 	u64 addr = *addrp, last;
@@ -30,7 +30,7 @@ again:
 			goto again;
 		}
 		if (last <= (r->base + r->size) && addr >= r->base) {
-			(*sizep)++;
+			*sizep = 0;
 			return false;
 		}
 	}
@@ -41,29 +41,6 @@ again:
 	return changed;
 }
 
-static u64 __init __memblock_x86_find_in_range_size(u64 ei_start, u64 ei_last, u64 start,
-			 u64 *sizep, u64 align)
-{
-	u64 addr, last;
-
-	addr = round_up(ei_start, align);
-	if (addr < start)
-		addr = round_up(start, align);
-	if (addr >= ei_last)
-		goto out;
-	*sizep = ei_last - addr;
-	while (bad_addr_size(&addr, sizep, align) && addr + *sizep <= ei_last)
-		;
-	last = addr + *sizep;
-	if (last > ei_last)
-		goto out;
-
-	return addr;
-
-out:
-	return MEMBLOCK_ERROR;
-}
-
 /*
  * Find next free range after start, and size is returned in *sizep
  */
@@ -76,10 +53,16 @@ u64 __init memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align)
 		u64 ei_last = ei_start + r->size;
 		u64 addr;
 
-		addr = __memblock_x86_find_in_range_size(ei_start, ei_last, start,
-					 sizep, align);
+		addr = round_up(ei_start, align);
+		if (addr < start)
+			addr = round_up(start, align);
+		if (addr >= ei_last)
+			continue;
+		*sizep = ei_last - addr;
+		while (check_with_memblock_reserved_size(&addr, sizep, align))
+			;
 
-		if (addr != MEMBLOCK_ERROR)
+		if (*sizep)
 			return addr;
 	}
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [tip:core/memblock] x86, memblock: Fix crashkernel allocation
  2010-10-06  6:27       ` [tip:core/memblock] " tip-bot for Yinghai Lu
@ 2010-10-06 15:14         ` Vivek Goyal
  2010-10-06 22:16           ` H. Peter Anvin
  0 siblings, 1 reply; 22+ messages in thread
From: Vivek Goyal @ 2010-10-06 15:14 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, yinghai, caiqian, tglx; +Cc: linux-tip-commits

On Wed, Oct 06, 2010 at 06:27:54AM +0000, tip-bot for Yinghai Lu wrote:
> Commit-ID:  9f4c13964b58608fbce05540743281ea3146c0e8
> Gitweb:     http://git.kernel.org/tip/9f4c13964b58608fbce05540743281ea3146c0e8
> Author:     Yinghai Lu <yinghai@kernel.org>
> AuthorDate: Tue, 5 Oct 2010 16:05:14 -0700
> Committer:  H. Peter Anvin <hpa@zytor.com>
> CommitDate: Tue, 5 Oct 2010 21:43:14 -0700
> 
> x86, memblock: Fix crashkernel allocation
> 
> Cai Qian found crashkernel is broken with the x86 memblock changes.
> 
> 1. crashkernel=128M@32M always reported that range is used, even if
>    the first kernel is small and does not usethat range
> 
> 2. we always got following report when using "kexec -p"
> 	Could not find a free area of memory of a000 bytes...
> 	locate_hole failed
> 
> The root cause is that generic memblock_find_in_range() will try to
> allocate from the top of the range, whereas the kexec code was written
> assuming that allocation was always near the bottom and that it could
> blindly extend memory upward.  Unfortunately the kexec code doesn't
> have a system for requesting the range that it really needs, so this
> is subject to probabilistic failures.
> 
> This patch hacks around the problem by limiting the target range
> heuristically to below the traditional bzImage max range.  This number
> is arbitrary and not always correct, and a much better result would be
> obtained by having kexec communicate this number based on the kernel
> header information and any appropriate command line options.
> 
> Reported-and-Bisected-by: CAI Qian <caiqian@redhat.com>
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> LKML-Reference: <4CABAF2A.5090501@kernel.org>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
> ---
>  arch/x86/kernel/setup.c |   13 +++++++++----
>  1 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index bf89e0a..b11a238 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -502,6 +502,7 @@ static inline unsigned long long get_total_mem(void)
>  	return total << PAGE_SHIFT;
>  }
>  
> +#define DEFAULT_BZIMAGE_ADDR_MAX 0x37FFFFFF
>  static void __init reserve_crashkernel(void)
>  {
>  	unsigned long long total_mem;
> @@ -519,8 +520,12 @@ static void __init reserve_crashkernel(void)
>  	if (crash_base <= 0) {
>  		const unsigned long long alignment = 16<<20;	/* 16M */
>  
> -		crash_base = memblock_find_in_range(alignment, ULONG_MAX, crash_size,
> -				 alignment);
> +		/*
> +		 *  kexec want bzImage is below DEFAULT_BZIMAGE_ADDR_MAX
> +		 */
> +		crash_base = memblock_find_in_range(alignment,
> +			       DEFAULT_BZIMAGE_ADDR_MAX, crash_size, alignment);
> +

I really don't understand why to put a upper limit of DEFAULT_BZIMAGE_ADDR_MAX.
It does not make much sense to internally impose an upper limit on
reserved memory area if reserver has not specified one.

Why can't we provide a function which does a search from bottom up for
the required size of memory. If the memory finally reserved does not meet
the constraints needed by kexec, then kexec load will fail. Kernel does
not have to try to figure out the upper limit in this case.

Current state of affairs are not perfect, but coming up with artificial
upper limit here is further deterioriating the situation, IMHO.

Regarding the question of specifying the upper limit by kexec on command
line, I think it is hard. Kexec needs to load multiple segments and some
needs to go in really low memory area and some can be in higher memory
area. What is the upper limit in this case. If we take the upper limit
of lowest memory segment, then we will just not have sufficient memory
to load all segments.

That would mean split the reserved region into multiple parts and one
should specifiy separate upper limit for each region. That would make
the whole thing complex.

So can we atleast maintain the status quo where we search for crashkernel
memory bottom up without any upper limits instread of top down.

Thanks
Vivek




>  		if (crash_base == MEMBLOCK_ERROR) {
>  			pr_info("crashkernel reservation failed - No suitable area found.\n");
>  			return;
> @@ -528,8 +533,8 @@ static void __init reserve_crashkernel(void)
>  	} else {
>  		unsigned long long start;
>  
> -		start = memblock_find_in_range(crash_base, ULONG_MAX, crash_size,
> -				 1<<20);
> +		start = memblock_find_in_range(crash_base,
> +				 crash_base + crash_size, crash_size, 1<<20);
>  		if (start != crash_base) {
>  			pr_info("crashkernel reservation failed - memory is in use.\n");
>  			return;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:core/memblock] x86, memblock: Fix crashkernel allocation
  2010-10-06 15:14         ` Vivek Goyal
@ 2010-10-06 22:16           ` H. Peter Anvin
  2010-10-06 22:47             ` Vivek Goyal
  0 siblings, 1 reply; 22+ messages in thread
From: H. Peter Anvin @ 2010-10-06 22:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mingo, linux-kernel, yinghai, caiqian, tglx, linux-tip-commits

On 10/06/2010 08:14 AM, Vivek Goyal wrote:
> 
> I really don't understand why to put a upper limit of DEFAULT_BZIMAGE_ADDR_MAX.
> It does not make much sense to internally impose an upper limit on
> reserved memory area if reserver has not specified one.
> 
> Why can't we provide a function which does a search from bottom up for
> the required size of memory. If the memory finally reserved does not meet
> the constraints needed by kexec, then kexec load will fail. Kernel does
> not have to try to figure out the upper limit in this case.
> 
> Current state of affairs are not perfect, but coming up with artificial
> upper limit here is further deterioriating the situation, IMHO.
> 
> Regarding the question of specifying the upper limit by kexec on command
> line, I think it is hard. Kexec needs to load multiple segments and some
> needs to go in really low memory area and some can be in higher memory
> area. What is the upper limit in this case. If we take the upper limit
> of lowest memory segment, then we will just not have sufficient memory
> to load all segments.
> 
> That would mean split the reserved region into multiple parts and one
> should specifiy separate upper limit for each region. That would make
> the whole thing complex.
> 
> So can we atleast maintain the status quo where we search for crashkernel
> memory bottom up without any upper limits instread of top down.
> 

The reason the "whole thing is complex" is because your constraints are
complex, and you're still trying to hide them from the kernel.  And what
is absolutely incomprehensible to me is that you seem to think this is okay.

I really, REALLY, ***REALLY*** don't want to burden the kernel with a
bunch of constraints which are invisible to it, where things will
randomly fail because the implementation changed.  We have too much of
that already, and it causes an enormous amount of problems all over the
kernel.

Of course, we're already painted into a corner with a bad design that
isn't going to change overnight, and of course, this is hardly the first
time this has happened -- we do find our way out of tight spots on a
regular basis.  Perhaps you're right and the best thing is to add an
explicit bottoms-up allocator function for now, *BUT* I would also like
to see a firm commitment to fix the underlying architectural problem for
real, and not just "maintain the status quo" indefinitely, which is what
your emails make me think you're expecting.

It's worth noting that we used to have the same problem in the kernel
overall, until Jeremy did a whole lot of work adding the brk allocator
specifically to make sure that our allocations end up deterministic.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:core/memblock] x86, memblock: Fix crashkernel allocation
  2010-10-06 22:16           ` H. Peter Anvin
@ 2010-10-06 22:47             ` Vivek Goyal
  2010-10-06 23:06               ` Vivek Goyal
  2010-10-06 23:09               ` H. Peter Anvin
  0 siblings, 2 replies; 22+ messages in thread
From: Vivek Goyal @ 2010-10-06 22:47 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: mingo, linux-kernel, yinghai, caiqian, tglx, linux-tip-commits

On Wed, Oct 06, 2010 at 03:16:17PM -0700, H. Peter Anvin wrote:
> On 10/06/2010 08:14 AM, Vivek Goyal wrote:
> > 
> > I really don't understand why to put a upper limit of DEFAULT_BZIMAGE_ADDR_MAX.
> > It does not make much sense to internally impose an upper limit on
> > reserved memory area if reserver has not specified one.
> > 
> > Why can't we provide a function which does a search from bottom up for
> > the required size of memory. If the memory finally reserved does not meet
> > the constraints needed by kexec, then kexec load will fail. Kernel does
> > not have to try to figure out the upper limit in this case.
> > 
> > Current state of affairs are not perfect, but coming up with artificial
> > upper limit here is further deterioriating the situation, IMHO.
> > 
> > Regarding the question of specifying the upper limit by kexec on command
> > line, I think it is hard. Kexec needs to load multiple segments and some
> > needs to go in really low memory area and some can be in higher memory
> > area. What is the upper limit in this case. If we take the upper limit
> > of lowest memory segment, then we will just not have sufficient memory
> > to load all segments.
> > 
> > That would mean split the reserved region into multiple parts and one
> > should specifiy separate upper limit for each region. That would make
> > the whole thing complex.
> > 
> > So can we atleast maintain the status quo where we search for crashkernel
> > memory bottom up without any upper limits instread of top down.
> > 
> 
> The reason the "whole thing is complex" is because your constraints are
> complex, and you're still trying to hide them from the kernel.  And what
> is absolutely incomprehensible to me is that you seem to think this is okay.
> 
> I really, REALLY, ***REALLY*** don't want to burden the kernel with a
> bunch of constraints which are invisible to it, where things will
> randomly fail because the implementation changed.  We have too much of
> that already, and it causes an enormous amount of problems all over the
> kernel.
> 
> Of course, we're already painted into a corner with a bad design that
> isn't going to change overnight, and of course, this is hardly the first
> time this has happened -- we do find our way out of tight spots on a
> regular basis.  Perhaps you're right and the best thing is to add an
> explicit bottoms-up allocator function for now, *BUT* I would also like
> to see a firm commitment to fix the underlying architectural problem for
> real, and not just "maintain the status quo" indefinitely, which is what
> your emails make me think you're expecting.

I really don't mind fixing the things properly in long term, just that I am
running out of ideas regarding how to fix it in proper way.

To me the best thing would be that this whole allocation thing be dyanmic
from user space where kexec will run, determine what it is loading, 
determine what are the memory contstraints on these segments (min, upper
limit, alignment etc), and then ask kernel for reserving contiguous
memory. This kind of dynamic reservation will remove lot of problems
associated with crashkernel= reservations.

But I am not aware of anyway of doing dynamic allocation and it certainly
does not seem to be easy to be able to allocated 128M of memory contiguously.

Because we don't have a way to reserve memory dynamically later, we end up
doing a big chunk of reservation using kernel command line and later
figure out what to load where. Now with this approach kexec has not even run
so how it can tell you what are the memory constraints.

So to me one of the ways of properly fixing is adding some kind of
capability to reserve the memory dynamically (may be using sys_kexec())
and get rid of this notion of reserving memory at boot time.

The other concern you raised is hiding constraints from kernel. At this
point of time the only problem with crashkernel=X@0 syntax is that it
does not tell you whether to look for memory bottom up or top down. How
about if we specify it explicitly in the syntax so that kernel does not
have to assume things?

In fact the initial crashkernel syntax was. crashkernel=X@Y. This meant
allocated X amount of memory at location Y. This left no ambiguity and
kernel did not have to assume things. It had the problem though that 
we might not have physical RAM at location Y. So I think that's when
somebody came up with the idea of crashkernel=X@0 so that we ideally
want memory at location 0, but if you can't provide that, then provide
anything available next scanning bottom up. 

So the only part missing from syntax is explicitly speicifying "next
available location scanning bottom up". If we add that to syntax then
kernel does not have to make assumptions. (except the alignment part).

So how about modifying syntax to crashkernel=X@Y#BU.

The "#BU" part can be optional and in that case kernel is free to allocate
memory either top down or bottom up.

Or any other string which can communicate the bottom up part in a more 
intutive manner.

Thanks
Vivek 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:core/memblock] x86, memblock: Fix crashkernel allocation
  2010-10-06 22:47             ` Vivek Goyal
@ 2010-10-06 23:06               ` Vivek Goyal
  2010-10-06 23:09               ` H. Peter Anvin
  1 sibling, 0 replies; 22+ messages in thread
From: Vivek Goyal @ 2010-10-06 23:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: mingo, linux-kernel, yinghai, caiqian, tglx, linux-tip-commits,
	Eric W. Biederman

On Wed, Oct 06, 2010 at 06:47:04PM -0400, Vivek Goyal wrote:
> On Wed, Oct 06, 2010 at 03:16:17PM -0700, H. Peter Anvin wrote:
> > On 10/06/2010 08:14 AM, Vivek Goyal wrote:
> > > 
> > > I really don't understand why to put a upper limit of DEFAULT_BZIMAGE_ADDR_MAX.
> > > It does not make much sense to internally impose an upper limit on
> > > reserved memory area if reserver has not specified one.
> > > 
> > > Why can't we provide a function which does a search from bottom up for
> > > the required size of memory. If the memory finally reserved does not meet
> > > the constraints needed by kexec, then kexec load will fail. Kernel does
> > > not have to try to figure out the upper limit in this case.
> > > 
> > > Current state of affairs are not perfect, but coming up with artificial
> > > upper limit here is further deterioriating the situation, IMHO.
> > > 
> > > Regarding the question of specifying the upper limit by kexec on command
> > > line, I think it is hard. Kexec needs to load multiple segments and some
> > > needs to go in really low memory area and some can be in higher memory
> > > area. What is the upper limit in this case. If we take the upper limit
> > > of lowest memory segment, then we will just not have sufficient memory
> > > to load all segments.
> > > 
> > > That would mean split the reserved region into multiple parts and one
> > > should specifiy separate upper limit for each region. That would make
> > > the whole thing complex.
> > > 
> > > So can we atleast maintain the status quo where we search for crashkernel
> > > memory bottom up without any upper limits instread of top down.
> > > 
> > 
> > The reason the "whole thing is complex" is because your constraints are
> > complex, and you're still trying to hide them from the kernel.  And what
> > is absolutely incomprehensible to me is that you seem to think this is okay.
> > 
> > I really, REALLY, ***REALLY*** don't want to burden the kernel with a
> > bunch of constraints which are invisible to it, where things will
> > randomly fail because the implementation changed.  We have too much of
> > that already, and it causes an enormous amount of problems all over the
> > kernel.
> > 
> > Of course, we're already painted into a corner with a bad design that
> > isn't going to change overnight, and of course, this is hardly the first
> > time this has happened -- we do find our way out of tight spots on a
> > regular basis.  Perhaps you're right and the best thing is to add an
> > explicit bottoms-up allocator function for now, *BUT* I would also like
> > to see a firm commitment to fix the underlying architectural problem for
> > real, and not just "maintain the status quo" indefinitely, which is what
> > your emails make me think you're expecting.
> 
> I really don't mind fixing the things properly in long term, just that I am
> running out of ideas regarding how to fix it in proper way.
> 
> To me the best thing would be that this whole allocation thing be dyanmic
> from user space where kexec will run, determine what it is loading, 
> determine what are the memory contstraints on these segments (min, upper
> limit, alignment etc), and then ask kernel for reserving contiguous
> memory. This kind of dynamic reservation will remove lot of problems
> associated with crashkernel= reservations.
> 
> But I am not aware of anyway of doing dynamic allocation and it certainly
> does not seem to be easy to be able to allocated 128M of memory contiguously.
> 
> Because we don't have a way to reserve memory dynamically later, we end up
> doing a big chunk of reservation using kernel command line and later
> figure out what to load where. Now with this approach kexec has not even run
> so how it can tell you what are the memory constraints.
> 
> So to me one of the ways of properly fixing is adding some kind of
> capability to reserve the memory dynamically (may be using sys_kexec())
> and get rid of this notion of reserving memory at boot time.
> 
> The other concern you raised is hiding constraints from kernel. At this
> point of time the only problem with crashkernel=X@0 syntax is that it
> does not tell you whether to look for memory bottom up or top down. How
> about if we specify it explicitly in the syntax so that kernel does not
> have to assume things?
> 
> In fact the initial crashkernel syntax was. crashkernel=X@Y. This meant
> allocated X amount of memory at location Y. This left no ambiguity and
> kernel did not have to assume things. It had the problem though that 
> we might not have physical RAM at location Y. So I think that's when
> somebody came up with the idea of crashkernel=X@0 so that we ideally
> want memory at location 0, but if you can't provide that, then provide
> anything available next scanning bottom up. 
> 
> So the only part missing from syntax is explicitly speicifying "next
> available location scanning bottom up". If we add that to syntax then
> kernel does not have to make assumptions. (except the alignment part).
> 
> So how about modifying syntax to crashkernel=X@Y#BU.
> 
> The "#BU" part can be optional and in that case kernel is free to allocate
> memory either top down or bottom up.
> 

Thinking more on above point.

crashkernel=X@Y will mean that allocate memory X at location Y only. If
it is not available just don't reserve. This will also mean that
we should not overload Y=0 case and crashkernel=X@0 should also mean
that reserve X amount of memory at location 0 and if it is not available
fail.

crashkernel=<size>@<offset>#<policy> will communicate additional
inforation regarding how to allocate memory. 

policy could be comma separated strings to communicate bottom up or top down
constraints. In future we could extend it to specify additional
constraints like alignment.

policy="<str>"

BU---> Look for next available free memory bottom up if memory at originally
       specified location is not available.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:core/memblock] x86, memblock: Fix crashkernel allocation
  2010-10-06 22:47             ` Vivek Goyal
  2010-10-06 23:06               ` Vivek Goyal
@ 2010-10-06 23:09               ` H. Peter Anvin
  2010-10-07 18:18                 ` Vivek Goyal
  1 sibling, 1 reply; 22+ messages in thread
From: H. Peter Anvin @ 2010-10-06 23:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mingo@redhat.com, linux-kernel@vger.kernel.org,
	yinghai@kernel.org, caiqian@redhat.com, tglx@linutronix.de,
	linux-tip-commits@vger.kernel.org

On 10/06/2010 03:47 PM, Vivek Goyal wrote:
> 
> I really don't mind fixing the things properly in long term, just that I am
> running out of ideas regarding how to fix it in proper way.
> 
> To me the best thing would be that this whole allocation thing be dyanmic
> from user space where kexec will run, determine what it is loading, 
> determine what are the memory contstraints on these segments (min, upper
> limit, alignment etc), and then ask kernel for reserving contiguous
> memory. This kind of dynamic reservation will remove lot of problems
> associated with crashkernel= reservations.
> 
> But I am not aware of anyway of doing dynamic allocation and it certainly
> does not seem to be easy to be able to allocated 128M of memory contiguously.
> 
> Because we don't have a way to reserve memory dynamically later, we end up
> doing a big chunk of reservation using kernel command line and later
> figure out what to load where. Now with this approach kexec has not even run
> so how it can tell you what are the memory constraints.
> 
> So to me one of the ways of properly fixing is adding some kind of
> capability to reserve the memory dynamically (may be using sys_kexec())
> and get rid of this notion of reserving memory at boot time.

The problem, of course, will allocating very large chunks of memory at
runtime is that there are going to be some number of non-movable and
non-evictable pages that are going to break up the contiguous ranges.
However, the mm recently added support for moving most pages, which
should make that kind of allocation a lot more feasible.  I haven't
experimented how well it works in practice, but I rather suspect that as
long as the crashkernel is installed sufficiently early in the boot
process it should have a very good probability of success.  Another
option, although one which has its own hackiness issues, is to do a
conservative allocation at boot time in preparation of the kexec call,
which is then freed.  This doesn't really address the issue of location,
though, which is part of the problem here.

> The other concern you raised is hiding constraints from kernel. At this
> point of time the only problem with crashkernel=X@0 syntax is that it
> does not tell you whether to look for memory bottom up or top down. How
> about if we specify it explicitly in the syntax so that kernel does not
> have to assume things?

See below.

> In fact the initial crashkernel syntax was. crashkernel=X@Y. This meant
> allocated X amount of memory at location Y. This left no ambiguity and
> kernel did not have to assume things. It had the problem though that 
> we might not have physical RAM at location Y. So I think that's when
> somebody came up with the idea of crashkernel=X@0 so that we ideally
> want memory at location 0, but if you can't provide that, then provide
> anything available next scanning bottom up. 
> 
> So the only part missing from syntax is explicitly speicifying "next
> available location scanning bottom up". If we add that to syntax then
> kernel does not have to make assumptions. (except the alignment part).
> 
> So how about modifying syntax to crashkernel=X@Y#BU.
> 
> The "#BU" part can be optional and in that case kernel is free to allocate
> memory either top down or bottom up.
> 
> Or any other string which can communicate the bottom up part in a more 
> intutive manner.

The whole problem here is that "bottoms up" isn't the true constraint --
it's a proxy for "this chunk needs < address X, this chunk needs <
address Y, ..." which is the real issue.  This is particularly messy
since low memory is a (sometimes very) precious resource that is used by
a lot of things (BIOS stubs, DMA-mask-limited hardware devices, and
perhaps especially 1:1 mappable pages on 32 bits, and so on), and one of
the major reasons we want to switch to a top-down allocation scheme is
to not waste a precious resource when we don't have to.

The one improvement one could to the crashkernel= syntax is perhaps
"crashkernel=X<Y" meaning "allocate entirely below Y", since that is (at
least in part) the real constraint.  It could even be extended to
multiple segments: "crashkernel=X<Y,Z<W,..." if we really need to...
that way you have your preallocation.

	-hpa

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:core/memblock] x86, memblock: Fix crashkernel allocation
  2010-10-06 23:09               ` H. Peter Anvin
@ 2010-10-07 18:18                 ` Vivek Goyal
  2010-10-07 18:54                   ` H. Peter Anvin
  0 siblings, 1 reply; 22+ messages in thread
From: Vivek Goyal @ 2010-10-07 18:18 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: mingo@redhat.com, linux-kernel@vger.kernel.org,
	yinghai@kernel.org, caiqian@redhat.com, tglx@linutronix.de,
	linux-tip-commits@vger.kernel.org, Kexec Mailing List

On Wed, Oct 06, 2010 at 04:09:29PM -0700, H. Peter Anvin wrote:
> On 10/06/2010 03:47 PM, Vivek Goyal wrote:
> > 
> > I really don't mind fixing the things properly in long term, just that I am
> > running out of ideas regarding how to fix it in proper way.
> > 
> > To me the best thing would be that this whole allocation thing be dyanmic
> > from user space where kexec will run, determine what it is loading, 
> > determine what are the memory contstraints on these segments (min, upper
> > limit, alignment etc), and then ask kernel for reserving contiguous
> > memory. This kind of dynamic reservation will remove lot of problems
> > associated with crashkernel= reservations.
> > 
> > But I am not aware of anyway of doing dynamic allocation and it certainly
> > does not seem to be easy to be able to allocated 128M of memory contiguously.
> > 
> > Because we don't have a way to reserve memory dynamically later, we end up
> > doing a big chunk of reservation using kernel command line and later
> > figure out what to load where. Now with this approach kexec has not even run
> > so how it can tell you what are the memory constraints.
> > 
> > So to me one of the ways of properly fixing is adding some kind of
> > capability to reserve the memory dynamically (may be using sys_kexec())
> > and get rid of this notion of reserving memory at boot time.
> 
> The problem, of course, will allocating very large chunks of memory at
> runtime is that there are going to be some number of non-movable and
> non-evictable pages that are going to break up the contiguous ranges.
> However, the mm recently added support for moving most pages, which
> should make that kind of allocation a lot more feasible.  I haven't
> experimented how well it works in practice, but I rather suspect that as
> long as the crashkernel is installed sufficiently early in the boot
> process it should have a very good probability of success.

Ok.

>  Another
> option, although one which has its own hackiness issues, is to do a
> conservative allocation at boot time in preparation of the kexec call,
> which is then freed.  This doesn't really address the issue of location,
> though, which is part of the problem here.
> 
> > The other concern you raised is hiding constraints from kernel. At this
> > point of time the only problem with crashkernel=X@0 syntax is that it
> > does not tell you whether to look for memory bottom up or top down. How
> > about if we specify it explicitly in the syntax so that kernel does not
> > have to assume things?
> 
> See below.
> 
> > In fact the initial crashkernel syntax was. crashkernel=X@Y. This meant
> > allocated X amount of memory at location Y. This left no ambiguity and
> > kernel did not have to assume things. It had the problem though that 
> > we might not have physical RAM at location Y. So I think that's when
> > somebody came up with the idea of crashkernel=X@0 so that we ideally
> > want memory at location 0, but if you can't provide that, then provide
> > anything available next scanning bottom up. 
> > 
> > So the only part missing from syntax is explicitly speicifying "next
> > available location scanning bottom up". If we add that to syntax then
> > kernel does not have to make assumptions. (except the alignment part).
> > 
> > So how about modifying syntax to crashkernel=X@Y#BU.
> > 
> > The "#BU" part can be optional and in that case kernel is free to allocate
> > memory either top down or bottom up.
> > 
> > Or any other string which can communicate the bottom up part in a more 
> > intutive manner.
> 
> The whole problem here is that "bottoms up" isn't the true constraint --
> it's a proxy for "this chunk needs < address X, this chunk needs <
> address Y, ..." which is the real issue.  This is particularly messy
> since low memory is a (sometimes very) precious resource that is used by
> a lot of things (BIOS stubs, DMA-mask-limited hardware devices, and
> perhaps especially 1:1 mappable pages on 32 bits, and so on), and one of
> the major reasons we want to switch to a top-down allocation scheme is
> to not waste a precious resource when we don't have to.
> 
> The one improvement one could to the crashkernel= syntax is perhaps
> "crashkernel=X<Y" meaning "allocate entirely below Y", since that is (at
> least in part) the real constraint.  It could even be extended to
> multiple segments: "crashkernel=X<Y,Z<W,..." if we really need to...
> that way you have your preallocation.

Ok, I was browsing through kexec-tools, x86 bzImage code and trying to
refresh my memory what segments were being loaded and what were memory
address concerns.

- relocatable bzImage (max addr 0x37ffffff, 896MB). 
	Though I don't know/understand where that 896MB come from.

- initrd (max addr 0x37ffffff, 896MB)
	Don't know why 896MB as upper limit

- Purgatory (max addr 2G)

- A segment to keep elf headers (no limit)
	These are accessed when second kernel as fully booted so can be
	addressed in higher addresses.

- A backup segment to copy first 640K of memory (not aware of any limit)
- Setup/parameter segment (no limit)
	- We don't really execute anything here and just access it for
  	  command line.

So atleast for bzImage it looks that if we specify crashkernel=128M<896M, it
will work.

So I am fine with above additional syntax for crashkernel=. May be we shall
have to the deprecate the crashkernel=X<@0 syntax.

CCing kexec list, in case others have any comments.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:core/memblock] x86, memblock: Fix crashkernel allocation
  2010-10-07 18:18                 ` Vivek Goyal
@ 2010-10-07 18:54                   ` H. Peter Anvin
  2010-10-07 19:21                     ` Vivek Goyal
  0 siblings, 1 reply; 22+ messages in thread
From: H. Peter Anvin @ 2010-10-07 18:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mingo@redhat.com, linux-kernel@vger.kernel.org,
	yinghai@kernel.org, caiqian@redhat.com, tglx@linutronix.de,
	linux-tip-commits@vger.kernel.org, Kexec Mailing List

On 10/07/2010 11:18 AM, Vivek Goyal wrote:
> 
> Ok, I was browsing through kexec-tools, x86 bzImage code and trying to
> refresh my memory what segments were being loaded and what were memory
> address concerns.
> 
> - relocatable bzImage (max addr 0x37ffffff, 896MB). 
> 	Though I don't know/understand where that 896MB come from.
> 
> - initrd (max addr 0x37ffffff, 896MB)
> 	Don't know why 896MB as upper limit

896 MB is presumably the (default!!) LOWMEM limit on 32 bits.  This is
actually wrong if vmalloc= is also specified on command line, though, or
with nonstandard compile-time options.

> - Purgatory (max addr 2G)
> 
> - A segment to keep elf headers (no limit)
> 	These are accessed when second kernel as fully booted so can be
> 	addressed in higher addresses.
> 
> - A backup segment to copy first 640K of memory (not aware of any limit)
> - Setup/parameter segment (no limit)
> 	- We don't really execute anything here and just access it for
>   	  command line.

Probably has a 4 GB limit, since I believe it only has a 32-bit pointer.

> So atleast for bzImage it looks that if we specify crashkernel=128M<896M, it
> will work.
> 
> So I am fine with above additional syntax for crashkernel=. May be we shall
> have to the deprecate the crashkernel=X<@0 syntax.
> 
> CCing kexec list, in case others have any comments.

It would be easy enough to either deprecate or make it an alias for
crashkernel=...<896M, which is basically what Yinghai's patch does.

	-hpa

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:core/memblock] x86, memblock: Fix crashkernel allocation
  2010-10-07 18:54                   ` H. Peter Anvin
@ 2010-10-07 19:21                     ` Vivek Goyal
  2010-10-07 20:44                       ` H. Peter Anvin
  0 siblings, 1 reply; 22+ messages in thread
From: Vivek Goyal @ 2010-10-07 19:21 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: mingo@redhat.com, linux-kernel@vger.kernel.org,
	yinghai@kernel.org, caiqian@redhat.com, tglx@linutronix.de,
	linux-tip-commits@vger.kernel.org, Kexec Mailing List

On Thu, Oct 07, 2010 at 11:54:31AM -0700, H. Peter Anvin wrote:
> On 10/07/2010 11:18 AM, Vivek Goyal wrote:
> > 
> > Ok, I was browsing through kexec-tools, x86 bzImage code and trying to
> > refresh my memory what segments were being loaded and what were memory
> > address concerns.
> > 
> > - relocatable bzImage (max addr 0x37ffffff, 896MB). 
> > 	Though I don't know/understand where that 896MB come from.
> > 
> > - initrd (max addr 0x37ffffff, 896MB)
> > 	Don't know why 896MB as upper limit
> 
> 896 MB is presumably the (default!!) LOWMEM limit on 32 bits.  This is
> actually wrong if vmalloc= is also specified on command line, though, or
> with nonstandard compile-time options.
> 
> > - Purgatory (max addr 2G)
> > 
> > - A segment to keep elf headers (no limit)
> > 	These are accessed when second kernel as fully booted so can be
> > 	addressed in higher addresses.
> > 
> > - A backup segment to copy first 640K of memory (not aware of any limit)
> > - Setup/parameter segment (no limit)
> > 	- We don't really execute anything here and just access it for
> >   	  command line.
> 
> Probably has a 4 GB limit, since I believe it only has a 32-bit pointer.
> 
> > So atleast for bzImage it looks that if we specify crashkernel=128M<896M, it
> > will work.
> > 
> > So I am fine with above additional syntax for crashkernel=. May be we shall
> > have to the deprecate the crashkernel=X<@0 syntax.
> > 
> > CCing kexec list, in case others have any comments.
> 
> It would be easy enough to either deprecate or make it an alias for
> crashkernel=...<896M, which is basically what Yinghai's patch does.

Agreed.

So Yinghai's patch is fine. I need to write a patch for introducing
crashkernel=X<Y syntax to make the behavior explicit. Will do...

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:core/memblock] x86, memblock: Fix crashkernel allocation
  2010-10-07 19:21                     ` Vivek Goyal
@ 2010-10-07 20:44                       ` H. Peter Anvin
  0 siblings, 0 replies; 22+ messages in thread
From: H. Peter Anvin @ 2010-10-07 20:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mingo@redhat.com, linux-kernel@vger.kernel.org,
	yinghai@kernel.org, caiqian@redhat.com, tglx@linutronix.de,
	linux-tip-commits@vger.kernel.org, Kexec Mailing List

On 10/07/2010 12:21 PM, Vivek Goyal wrote:
>>
>> It would be easy enough to either deprecate or make it an alias for
>> crashkernel=...<896M, which is basically what Yinghai's patch does.
> 
> Agreed.
> 
> So Yinghai's patch is fine. I need to write a patch for introducing
> crashkernel=X<Y syntax to make the behavior explicit. Will do...
> 

Sounds like a plan.  Thanks!

	-hpa

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2010-10-07 20:44 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <4CAA4BD5.4020505@kernel.org>
2010-10-04 21:57 ` [PATCH 1/4] memblock: Fix big size with find_region() Yinghai Lu
2010-10-06  6:28   ` [tip:core/memblock] memblock: Fix wraparound in find_region() tip-bot for Yinghai Lu
2010-10-04 21:57 ` [PATCH 2/4] x86, memblock: Fix crashkernel allocation Yinghai Lu
2010-10-05 21:15   ` H. Peter Anvin
2010-10-05 22:29   ` H. Peter Anvin
2010-10-05 23:05     ` Yinghai Lu
2010-10-06  6:27       ` [tip:core/memblock] " tip-bot for Yinghai Lu
2010-10-06 15:14         ` Vivek Goyal
2010-10-06 22:16           ` H. Peter Anvin
2010-10-06 22:47             ` Vivek Goyal
2010-10-06 23:06               ` Vivek Goyal
2010-10-06 23:09               ` H. Peter Anvin
2010-10-07 18:18                 ` Vivek Goyal
2010-10-07 18:54                   ` H. Peter Anvin
2010-10-07 19:21                     ` Vivek Goyal
2010-10-07 20:44                       ` H. Peter Anvin
2010-10-04 21:58 ` [PATCH 3/4] x86, memblock: Remove __memblock_x86_find_in_range_size() Yinghai Lu
2010-10-06  6:29   ` [tip:core/memblock] " tip-bot for Yinghai Lu
2010-10-04 21:58 ` [PATCH 4/4] x86, mm, memblock, 32bit: Make add_highpages honor early reserved ranges Yinghai Lu
2010-10-05 22:50   ` H. Peter Anvin
2010-10-05 23:15     ` Yinghai Lu
2010-10-06  6:28       ` [tip:core/memblock] x86-32, memblock: " tip-bot for Yinghai Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox