* [PATCH v5 1/6] memblock: Factor out of top-down allocation
2013-09-24 18:23 [PATCH v5 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed Zhang Yanfei
@ 2013-09-24 18:25 ` Zhang Yanfei
2013-09-27 22:23 ` Toshi Kani
2013-09-24 18:27 ` [PATCH v5 2/6] memblock: Introduce bottom-up allocation mode Zhang Yanfei
` (4 subsequent siblings)
5 siblings, 1 reply; 34+ messages in thread
From: Zhang Yanfei @ 2013-09-24 18:25 UTC (permalink / raw)
To: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Tejun Heo, Toshi Kani, Wanpeng Li,
Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
Rik van Riel, jweiner, prarit
Cc: x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org, Linux MM,
linux-acpi, imtangchen, Zhang Yanfei
From: Tang Chen <tangchen@cn.fujitsu.com>
This patch creates a new function __memblock_find_range_rev
to factor out of top-down allocation from memblock_find_in_range_node.
This is a preparation because we will introduce a new bottom-up
allocation mode in the following patch.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
mm/memblock.c | 47 ++++++++++++++++++++++++++++++++++-------------
1 files changed, 34 insertions(+), 13 deletions(-)
diff --git a/mm/memblock.c b/mm/memblock.c
index 0ac412a..3d80c74 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -83,33 +83,25 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
}
/**
- * memblock_find_in_range_node - find free area in given range and node
+ * __memblock_find_range_rev - find free area utility, in reverse order
* @start: start of candidate range
* @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
* @size: size of free area to find
* @align: alignment of free area to find
* @nid: nid of the free area to find, %MAX_NUMNODES for any node
*
- * Find @size free area aligned to @align in the specified range and node.
+ * Utility called from memblock_find_in_range_node(), find free area top-down.
*
* RETURNS:
* Found address on success, %0 on failure.
*/
-phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
- phys_addr_t end, phys_addr_t size,
- phys_addr_t align, int nid)
+static phys_addr_t __init_memblock
+__memblock_find_range_rev(phys_addr_t start, phys_addr_t end,
+ phys_addr_t size, phys_addr_t align, int nid)
{
phys_addr_t this_start, this_end, cand;
u64 i;
- /* pump up @end */
- if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
- end = memblock.current_limit;
-
- /* avoid allocating the first page */
- start = max_t(phys_addr_t, start, PAGE_SIZE);
- end = max(start, end);
-
for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
this_start = clamp(this_start, start, end);
this_end = clamp(this_end, start, end);
@@ -121,10 +113,39 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
if (cand >= this_start)
return cand;
}
+
return 0;
}
/**
+ * memblock_find_in_range_node - find free area in given range and node
+ * @start: start of candidate range
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Find @size free area aligned to @align in the specified range and node.
+ *
+ * RETURNS:
+ * Found address on success, %0 on failure.
+ */
+phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
+ phys_addr_t end, phys_addr_t size,
+ phys_addr_t align, int nid)
+{
+ /* pump up @end */
+ if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
+ end = memblock.current_limit;
+
+ /* avoid allocating the first page */
+ start = max_t(phys_addr_t, start, PAGE_SIZE);
+ end = max(start, end);
+
+ return __memblock_find_range_rev(start, end, size, align, nid);
+}
+
+/**
* memblock_find_in_range - find free area in given range
* @start: start of candidate range
* @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v5 1/6] memblock: Factor out of top-down allocation
2013-09-24 18:25 ` [PATCH v5 1/6] memblock: Factor out of top-down allocation Zhang Yanfei
@ 2013-09-27 22:23 ` Toshi Kani
0 siblings, 0 replies; 34+ messages in thread
From: Toshi Kani @ 2013-09-27 22:23 UTC (permalink / raw)
To: Zhang Yanfei
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Tejun Heo, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On Wed, 2013-09-25 at 02:25 +0800, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
>
> This patch creates a new function __memblock_find_range_rev
> to factor out of top-down allocation from memblock_find_in_range_node.
> This is a preparation because we will introduce a new bottom-up
> allocation mode in the following patch.
>
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
A minor comment below...
> ---
> mm/memblock.c | 47 ++++++++++++++++++++++++++++++++++-------------
> 1 files changed, 34 insertions(+), 13 deletions(-)
>
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 0ac412a..3d80c74 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -83,33 +83,25 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
> }
>
> /**
> - * memblock_find_in_range_node - find free area in given range and node
> + * __memblock_find_range_rev - find free area utility, in reverse order
> * @start: start of candidate range
> * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
> * @size: size of free area to find
> * @align: alignment of free area to find
> * @nid: nid of the free area to find, %MAX_NUMNODES for any node
> *
> - * Find @size free area aligned to @align in the specified range and node.
> + * Utility called from memblock_find_in_range_node(), find free area top-down.
> *
> * RETURNS:
> * Found address on success, %0 on failure.
> */
> -phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
> - phys_addr_t end, phys_addr_t size,
> - phys_addr_t align, int nid)
> +static phys_addr_t __init_memblock
> +__memblock_find_range_rev(phys_addr_t start, phys_addr_t end,
Since we are now using the terms "top down" and "bottom up"
consistently, how about name this function as
__memblock_find_range_top_down()?
Thanks,
-Toshi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v5 2/6] memblock: Introduce bottom-up allocation mode
2013-09-24 18:23 [PATCH v5 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed Zhang Yanfei
2013-09-24 18:25 ` [PATCH v5 1/6] memblock: Factor out of top-down allocation Zhang Yanfei
@ 2013-09-24 18:27 ` Zhang Yanfei
2013-09-26 14:45 ` Tejun Heo
2013-09-27 22:29 ` Toshi Kani
2013-09-24 18:29 ` [PATCH v5 3/6] x86/mm: Factor out of top-down direct mapping setup Zhang Yanfei
` (3 subsequent siblings)
5 siblings, 2 replies; 34+ messages in thread
From: Zhang Yanfei @ 2013-09-24 18:27 UTC (permalink / raw)
To: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Tejun Heo, Toshi Kani, Wanpeng Li,
Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
Rik van Riel, jweiner, prarit
Cc: x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org, Linux MM,
linux-acpi, imtangchen, Zhang Yanfei
From: Tang Chen <tangchen@cn.fujitsu.com>
The Linux kernel cannot migrate pages used by the kernel. As a result, kernel
pages cannot be hot-removed. So we cannot allocate hotpluggable memory for
the kernel.
ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info.
But before SRAT is parsed, memblock has already started to allocate memory
for the kernel. So we need to prevent memblock from doing this.
In a memory hotplug system, any numa node the kernel resides in should
be unhotpluggable. And for a modern server, each node could have at least
16GB memory. So memory around the kernel image is highly likely unhotpluggable.
So the basic idea is: Allocate memory from the end of the kernel image and
to the higher memory. Since memory allocation before SRAT is parsed won't
be too much, it could highly likely be in the same node with kernel image.
The current memblock can only allocate memory top-down. So this patch introduces
a new bottom-up allocation mode to allocate memory bottom-up. And later
when we use this allocation direction to allocate memory, we will limit
the start address above the kernel.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
include/linux/memblock.h | 16 +++++++++
mm/memblock.c | 81 ++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 95 insertions(+), 2 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 31e95ac..c1e2633 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -35,6 +35,7 @@ struct memblock_type {
};
struct memblock {
+ bool bottom_up; /* is bottom up direction? */
phys_addr_t current_limit;
struct memblock_type memory;
struct memblock_type reserved;
@@ -148,6 +149,21 @@ phys_addr_t memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, int nid)
phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align);
+#ifdef CONFIG_MOVABLE_NODE
+static inline void memblock_set_bottom_up(bool enable)
+{
+ memblock.bottom_up = enable;
+}
+
+static inline bool memblock_bottom_up(void)
+{
+ return memblock.bottom_up;
+}
+#else
+static inline void memblock_set_bottom_up(bool enable) {}
+static inline bool memblock_bottom_up(void) { return false; }
+#endif
+
/* Flags for memblock_alloc_base() amd __memblock_alloc_base() */
#define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0)
#define MEMBLOCK_ALLOC_ACCESSIBLE 0
diff --git a/mm/memblock.c b/mm/memblock.c
index 3d80c74..a8e81c3 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -20,6 +20,8 @@
#include <linux/seq_file.h>
#include <linux/memblock.h>
+#include <asm-generic/sections.h>
+
static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
@@ -32,6 +34,7 @@ struct memblock memblock __initdata_memblock = {
.reserved.cnt = 1, /* empty dummy entry */
.reserved.max = INIT_MEMBLOCK_REGIONS,
+ .bottom_up = false,
.current_limit = MEMBLOCK_ALLOC_ANYWHERE,
};
@@ -83,6 +86,38 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
}
/**
+ * __memblock_find_range - find free area utility
+ * @start: start of candidate range
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Utility called from memblock_find_in_range_node(), find free area bottom-up.
+ *
+ * RETURNS:
+ * Found address on success, 0 on failure.
+ */
+static phys_addr_t __init_memblock
+__memblock_find_range(phys_addr_t start, phys_addr_t end, phys_addr_t size,
+ phys_addr_t align, int nid)
+{
+ phys_addr_t this_start, this_end, cand;
+ u64 i;
+
+ for_each_free_mem_range(i, nid, &this_start, &this_end, NULL) {
+ this_start = clamp(this_start, start, end);
+ this_end = clamp(this_end, start, end);
+
+ cand = round_up(this_start, align);
+ if (cand < this_end && this_end - cand >= size)
+ return cand;
+ }
+
+ return 0;
+}
+
+/**
* __memblock_find_range_rev - find free area utility, in reverse order
* @start: start of candidate range
* @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
@@ -93,7 +128,7 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
* Utility called from memblock_find_in_range_node(), find free area top-down.
*
* RETURNS:
- * Found address on success, %0 on failure.
+ * Found address on success, 0 on failure.
*/
static phys_addr_t __init_memblock
__memblock_find_range_rev(phys_addr_t start, phys_addr_t end,
@@ -127,13 +162,24 @@ __memblock_find_range_rev(phys_addr_t start, phys_addr_t end,
*
* Find @size free area aligned to @align in the specified range and node.
*
+ * When allocation direction is bottom-up, the @start should be greater
+ * than the end of the kernel image. Otherwise, it will be trimmed. The
+ * reason is that we want the bottom-up allocation just near the kernel
+ * image so it is highly likely that the allocated memory and the kernel
+ * will reside in the same node.
+ *
+ * If bottom-up allocation failed, will try to allocate memory top-down.
+ *
* RETURNS:
- * Found address on success, %0 on failure.
+ * Found address on success, 0 on failure.
*/
phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
phys_addr_t end, phys_addr_t size,
phys_addr_t align, int nid)
{
+ int ret;
+ phys_addr_t kernel_end;
+
/* pump up @end */
if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
end = memblock.current_limit;
@@ -141,6 +187,37 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
/* avoid allocating the first page */
start = max_t(phys_addr_t, start, PAGE_SIZE);
end = max(start, end);
+ kernel_end = __pa_symbol(_end);
+
+ /*
+ * try bottom-up allocation only when bottom-up mode
+ * is set and @end is above the kernel image.
+ */
+ if (memblock_bottom_up() && end > kernel_end) {
+ phys_addr_t bottom_up_start;
+
+ /* make sure we will allocate above the kernel */
+ bottom_up_start = max(start, kernel_end);
+
+ /* ok, try bottom-up allocation first */
+ ret = __memblock_find_range(bottom_up_start, end,
+ size, align, nid);
+ if (ret)
+ return ret;
+
+ /*
+ * we always limit bottom-up allocation above the kernel,
+ * but top-down allocation doesn't have the limit, so
+ * retrying top-down allocation may succeed when bottom-up
+ * allocation failed.
+ *
+ * bottom-up allocation is expected to be fail very rarely,
+ * so we use WARN_ONCE() here to see the stack trace if
+ * fail happens.
+ */
+ WARN_ONCE(1, "memblock: Failed to allocate memory in bottom up "
+ "direction. Now try top down direction.\n");
+ }
return __memblock_find_range_rev(start, end, size, align, nid);
}
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v5 2/6] memblock: Introduce bottom-up allocation mode
2013-09-24 18:27 ` [PATCH v5 2/6] memblock: Introduce bottom-up allocation mode Zhang Yanfei
@ 2013-09-26 14:45 ` Tejun Heo
2013-09-26 15:37 ` Zhang Yanfei
2013-09-27 22:29 ` Toshi Kani
1 sibling, 1 reply; 34+ messages in thread
From: Tejun Heo @ 2013-09-26 14:45 UTC (permalink / raw)
To: Zhang Yanfei
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
Hello,
On Wed, Sep 25, 2013 at 02:27:48AM +0800, Zhang Yanfei wrote:
> +#ifdef CONFIG_MOVABLE_NODE
> +static inline void memblock_set_bottom_up(bool enable)
> +{
> + memblock.bottom_up = enable;
> +}
> +
> +static inline bool memblock_bottom_up(void)
> +{
> + return memblock.bottom_up;
> +}
Can you please explain what this is for here?
> + /*
> + * we always limit bottom-up allocation above the kernel,
> + * but top-down allocation doesn't have the limit, so
> + * retrying top-down allocation may succeed when bottom-up
> + * allocation failed.
> + *
> + * bottom-up allocation is expected to be fail very rarely,
> + * so we use WARN_ONCE() here to see the stack trace if
> + * fail happens.
> + */
> + WARN_ONCE(1, "memblock: Failed to allocate memory in bottom up "
> + "direction. Now try top down direction.\n");
> + }
You and I would know what was going on and what the consequence of the
failure may be but the above warning message is kinda useless to a
user / admin, right? It doesn't really say anything meaningful.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 2/6] memblock: Introduce bottom-up allocation mode
2013-09-26 14:45 ` Tejun Heo
@ 2013-09-26 15:37 ` Zhang Yanfei
2013-09-26 15:50 ` Tejun Heo
2013-09-26 16:51 ` Zhang Yanfei
0 siblings, 2 replies; 34+ messages in thread
From: Zhang Yanfei @ 2013-09-26 15:37 UTC (permalink / raw)
To: Tejun Heo
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
Hello tejun,
Thanks for your quick comments first:)
On 09/26/2013 10:45 PM, Tejun Heo wrote:
> Hello,
>
> On Wed, Sep 25, 2013 at 02:27:48AM +0800, Zhang Yanfei wrote:
>> +#ifdef CONFIG_MOVABLE_NODE
>> +static inline void memblock_set_bottom_up(bool enable)
>> +{
>> + memblock.bottom_up = enable;
>> +}
>> +
>> +static inline bool memblock_bottom_up(void)
>> +{
>> + return memblock.bottom_up;
>> +}
>
> Can you please explain what this is for here?
OK, will do.
>
>> + /*
>> + * we always limit bottom-up allocation above the kernel,
>> + * but top-down allocation doesn't have the limit, so
>> + * retrying top-down allocation may succeed when bottom-up
>> + * allocation failed.
>> + *
>> + * bottom-up allocation is expected to be fail very rarely,
>> + * so we use WARN_ONCE() here to see the stack trace if
>> + * fail happens.
>> + */
>> + WARN_ONCE(1, "memblock: Failed to allocate memory in bottom up "
>> + "direction. Now try top down direction.\n");
>> + }
>
> You and I would know what was going on and what the consequence of the
> failure may be but the above warning message is kinda useless to a
> user / admin, right? It doesn't really say anything meaningful.
>
Hmmmm.. May be something like this:
WARN_ONCE(1, "Failed to allocated memory above the kernel in bottom-up,"
"so try to allocate memory below the kernel.");
Thanks
--
Thanks.
Zhang Yanfei
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 2/6] memblock: Introduce bottom-up allocation mode
2013-09-26 15:37 ` Zhang Yanfei
@ 2013-09-26 15:50 ` Tejun Heo
2013-09-26 15:54 ` Zhang Yanfei
2013-09-26 16:51 ` Zhang Yanfei
1 sibling, 1 reply; 34+ messages in thread
From: Tejun Heo @ 2013-09-26 15:50 UTC (permalink / raw)
To: Zhang Yanfei
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On Thu, Sep 26, 2013 at 11:37:34PM +0800, Zhang Yanfei wrote:
> >> + WARN_ONCE(1, "memblock: Failed to allocate memory in bottom up "
> >> + "direction. Now try top down direction.\n");
> >> + }
> >
> > You and I would know what was going on and what the consequence of the
> > failure may be but the above warning message is kinda useless to a
> > user / admin, right? It doesn't really say anything meaningful.
> >
>
> Hmmmm.. May be something like this:
>
> WARN_ONCE(1, "Failed to allocated memory above the kernel in bottom-up,"
> "so try to allocate memory below the kernel.");
How about something like "memblock: bottom-up allocation failed,
memory hotunplug may be affected\n".
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 2/6] memblock: Introduce bottom-up allocation mode
2013-09-26 15:50 ` Tejun Heo
@ 2013-09-26 15:54 ` Zhang Yanfei
0 siblings, 0 replies; 34+ messages in thread
From: Zhang Yanfei @ 2013-09-26 15:54 UTC (permalink / raw)
To: Tejun Heo
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On 09/26/2013 11:50 PM, Tejun Heo wrote:
> On Thu, Sep 26, 2013 at 11:37:34PM +0800, Zhang Yanfei wrote:
>>>> + WARN_ONCE(1, "memblock: Failed to allocate memory in bottom up "
>>>> + "direction. Now try top down direction.\n");
>>>> + }
>>>
>>> You and I would know what was going on and what the consequence of the
>>> failure may be but the above warning message is kinda useless to a
>>> user / admin, right? It doesn't really say anything meaningful.
>>>
>>
>> Hmmmm.. May be something like this:
>>
>> WARN_ONCE(1, "Failed to allocated memory above the kernel in bottom-up,"
>> "so try to allocate memory below the kernel.");
>
> How about something like "memblock: bottom-up allocation failed,
> memory hotunplug may be affected\n".
>
Ok, I understand what you want. Explicitly telling the user the functionality
may be invalid due to some failure. Yeah, this is really meaningful, i will
take yours, thanks.
--
Thanks.
Zhang Yanfei
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 2/6] memblock: Introduce bottom-up allocation mode
2013-09-26 15:37 ` Zhang Yanfei
2013-09-26 15:50 ` Tejun Heo
@ 2013-09-26 16:51 ` Zhang Yanfei
1 sibling, 0 replies; 34+ messages in thread
From: Zhang Yanfei @ 2013-09-26 16:51 UTC (permalink / raw)
To: Tejun Heo
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On 09/26/2013 11:37 PM, Zhang Yanfei wrote:
> Hello tejun,
>
> Thanks for your quick comments first:)
>
> On 09/26/2013 10:45 PM, Tejun Heo wrote:
>> Hello,
>>
>> On Wed, Sep 25, 2013 at 02:27:48AM +0800, Zhang Yanfei wrote:
>>> +#ifdef CONFIG_MOVABLE_NODE
>>> +static inline void memblock_set_bottom_up(bool enable)
>>> +{
>>> + memblock.bottom_up = enable;
>>> +}
>>> +
>>> +static inline bool memblock_bottom_up(void)
>>> +{
>>> + return memblock.bottom_up;
>>> +}
>>
>> Can you please explain what this is for here?
>
> OK, will do.
I write the function description here so you could give your
comments still in this version.
/*
* Set the allocation direction to bottom-up or top-down.
*/
static inline void memblock_set_bottom_up(bool enable)
{
memblock.bottom_up = enable;
}
/*
* Check if the allocation direction is bottom-up or not.
* if this is true, that said, the boot option "movablenode"
* has been specified, and memblock will allocate memory
* just near the kernel image.
*/
static inline bool memblock_bottom_up(void)
{
return memblock.bottom_up;
}
Thanks.
--
Thanks.
Zhang Yanfei
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 2/6] memblock: Introduce bottom-up allocation mode
2013-09-24 18:27 ` [PATCH v5 2/6] memblock: Introduce bottom-up allocation mode Zhang Yanfei
2013-09-26 14:45 ` Tejun Heo
@ 2013-09-27 22:29 ` Toshi Kani
1 sibling, 0 replies; 34+ messages in thread
From: Toshi Kani @ 2013-09-27 22:29 UTC (permalink / raw)
To: Zhang Yanfei
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Tejun Heo, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On Wed, 2013-09-25 at 02:27 +0800, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
>
> The Linux kernel cannot migrate pages used by the kernel. As a result, kernel
> pages cannot be hot-removed. So we cannot allocate hotpluggable memory for
> the kernel.
>
> ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info.
> But before SRAT is parsed, memblock has already started to allocate memory
> for the kernel. So we need to prevent memblock from doing this.
>
> In a memory hotplug system, any numa node the kernel resides in should
> be unhotpluggable. And for a modern server, each node could have at least
> 16GB memory. So memory around the kernel image is highly likely unhotpluggable.
>
> So the basic idea is: Allocate memory from the end of the kernel image and
> to the higher memory. Since memory allocation before SRAT is parsed won't
> be too much, it could highly likely be in the same node with kernel image.
>
> The current memblock can only allocate memory top-down. So this patch introduces
> a new bottom-up allocation mode to allocate memory bottom-up. And later
> when we use this allocation direction to allocate memory, we will limit
> the start address above the kernel.
>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
:
> /**
> + * __memblock_find_range - find free area utility
> + * @start: start of candidate range
> + * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
> + * @size: size of free area to find
> + * @align: alignment of free area to find
> + * @nid: nid of the free area to find, %MAX_NUMNODES for any node
> + *
> + * Utility called from memblock_find_in_range_node(), find free area bottom-up.
> + *
> + * RETURNS:
> + * Found address on success, 0 on failure.
> + */
> +static phys_addr_t __init_memblock
> +__memblock_find_range(phys_addr_t start, phys_addr_t end, phys_addr_t size,
Similarly, how about name this function as
__memblock_find_range_bottom_up()?
> + phys_addr_t align, int nid)
> +{
> + phys_addr_t this_start, this_end, cand;
> + u64 i;
> +
> + for_each_free_mem_range(i, nid, &this_start, &this_end, NULL) {
> + this_start = clamp(this_start, start, end);
> + this_end = clamp(this_end, start, end);
> +
> + cand = round_up(this_start, align);
> + if (cand < this_end && this_end - cand >= size)
> + return cand;
> + }
> +
> + return 0;
> +}
> +
> +/**
> * __memblock_find_range_rev - find free area utility, in reverse order
> * @start: start of candidate range
> * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
> @@ -93,7 +128,7 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
> * Utility called from memblock_find_in_range_node(), find free area top-down.
> *
> * RETURNS:
> - * Found address on success, %0 on failure.
> + * Found address on success, 0 on failure.
> */
> static phys_addr_t __init_memblock
> __memblock_find_range_rev(phys_addr_t start, phys_addr_t end,
> @@ -127,13 +162,24 @@ __memblock_find_range_rev(phys_addr_t start, phys_addr_t end,
> *
> * Find @size free area aligned to @align in the specified range and node.
> *
> + * When allocation direction is bottom-up, the @start should be greater
> + * than the end of the kernel image. Otherwise, it will be trimmed. The
> + * reason is that we want the bottom-up allocation just near the kernel
> + * image so it is highly likely that the allocated memory and the kernel
> + * will reside in the same node.
> + *
> + * If bottom-up allocation failed, will try to allocate memory top-down.
> + *
> * RETURNS:
> - * Found address on success, %0 on failure.
> + * Found address on success, 0 on failure.
> */
> phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
> phys_addr_t end, phys_addr_t size,
> phys_addr_t align, int nid)
> {
> + int ret;
> + phys_addr_t kernel_end;
> +
> /* pump up @end */
> if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> end = memblock.current_limit;
> @@ -141,6 +187,37 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
> /* avoid allocating the first page */
> start = max_t(phys_addr_t, start, PAGE_SIZE);
> end = max(start, end);
> + kernel_end = __pa_symbol(_end);
Please address the issue in __pa_symbol() that Andrew pointed out.
Thanks,
-Toshi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v5 3/6] x86/mm: Factor out of top-down direct mapping setup
2013-09-24 18:23 [PATCH v5 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed Zhang Yanfei
2013-09-24 18:25 ` [PATCH v5 1/6] memblock: Factor out of top-down allocation Zhang Yanfei
2013-09-24 18:27 ` [PATCH v5 2/6] memblock: Introduce bottom-up allocation mode Zhang Yanfei
@ 2013-09-24 18:29 ` Zhang Yanfei
2013-09-26 14:46 ` Tejun Heo
2013-09-27 22:43 ` Toshi Kani
2013-09-24 18:30 ` [PATCH v5 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up Zhang Yanfei
` (2 subsequent siblings)
5 siblings, 2 replies; 34+ messages in thread
From: Zhang Yanfei @ 2013-09-24 18:29 UTC (permalink / raw)
To: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Tejun Heo, Toshi Kani, Wanpeng Li,
Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
Rik van Riel, jweiner, prarit
Cc: x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org, Linux MM,
linux-acpi, imtangchen, Zhang Yanfei
From: Tang Chen <tangchen@cn.fujitsu.com>
This patch creates a new function memory_map_top_down to
factor out of the top-down direct memory mapping pagetable
setup. This is also a preparation for the following patch,
which will introduce the bottom-up memory mapping. That said,
we will put the two ways of pagetable setup into separate
functions, and choose to use which way in init_mem_mapping,
which makes the code more clear.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
arch/x86/mm/init.c | 58 ++++++++++++++++++++++++++++++++++------------------
1 files changed, 38 insertions(+), 20 deletions(-)
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 04664cd..dbe57e5 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -401,27 +401,26 @@ static unsigned long __init init_range_memory_mapping(
/* (PUD_SHIFT-PMD_SHIFT)/2 */
#define STEP_SIZE_SHIFT 5
-void __init init_mem_mapping(void)
+
+/**
+ * memory_map_top_down - Map [map_start, map_end) top down
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range
+ * [map_start, map_end) in top-down.
+ */
+static void __init memory_map_top_down(unsigned long map_start,
+ unsigned long map_end)
{
- unsigned long end, real_end, start, last_start;
+ unsigned long real_end, start, last_start;
unsigned long step_size;
unsigned long addr;
unsigned long mapped_ram_size = 0;
unsigned long new_mapped_ram_size;
- probe_page_size_mask();
-
-#ifdef CONFIG_X86_64
- end = max_pfn << PAGE_SHIFT;
-#else
- end = max_low_pfn << PAGE_SHIFT;
-#endif
-
- /* the ISA range is always mapped regardless of memory holes */
- init_memory_mapping(0, ISA_END_ADDRESS);
-
/* xen has big range in reserved near end of ram, skip it at first.*/
- addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
+ addr = memblock_find_in_range(map_start, map_end, PMD_SIZE, PMD_SIZE);
real_end = addr + PMD_SIZE;
/* step_size need to be small so pgt_buf from BRK could cover it */
@@ -436,13 +435,13 @@ void __init init_mem_mapping(void)
* end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
* for page table.
*/
- while (last_start > ISA_END_ADDRESS) {
+ while (last_start > map_start) {
if (last_start > step_size) {
start = round_down(last_start - 1, step_size);
- if (start < ISA_END_ADDRESS)
- start = ISA_END_ADDRESS;
+ if (start < map_start)
+ start = map_start;
} else
- start = ISA_END_ADDRESS;
+ start = map_start;
new_mapped_ram_size = init_range_memory_mapping(start,
last_start);
last_start = start;
@@ -453,8 +452,27 @@ void __init init_mem_mapping(void)
mapped_ram_size += new_mapped_ram_size;
}
- if (real_end < end)
- init_range_memory_mapping(real_end, end);
+ if (real_end < map_end)
+ init_range_memory_mapping(real_end, map_end);
+}
+
+void __init init_mem_mapping(void)
+{
+ unsigned long end;
+
+ probe_page_size_mask();
+
+#ifdef CONFIG_X86_64
+ end = max_pfn << PAGE_SHIFT;
+#else
+ end = max_low_pfn << PAGE_SHIFT;
+#endif
+
+ /* the ISA range is always mapped regardless of memory holes */
+ init_memory_mapping(0, ISA_END_ADDRESS);
+
+ /* setup direct mapping for range [ISA_END_ADDRESS, end) in top-down*/
+ memory_map_top_down(ISA_END_ADDRESS, end);
#ifdef CONFIG_X86_64
if (max_pfn > max_low_pfn) {
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v5 3/6] x86/mm: Factor out of top-down direct mapping setup
2013-09-24 18:29 ` [PATCH v5 3/6] x86/mm: Factor out of top-down direct mapping setup Zhang Yanfei
@ 2013-09-26 14:46 ` Tejun Heo
2013-09-26 15:39 ` Zhang Yanfei
2013-09-27 22:43 ` Toshi Kani
1 sibling, 1 reply; 34+ messages in thread
From: Tejun Heo @ 2013-09-26 14:46 UTC (permalink / raw)
To: Zhang Yanfei
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On Wed, Sep 25, 2013 at 02:29:06AM +0800, Zhang Yanfei wrote:
> +/**
> + * memory_map_top_down - Map [map_start, map_end) top down
> + * @map_start: start address of the target memory range
> + * @map_end: end address of the target memory range
> + *
> + * This function will setup direct mapping for memory range
> + * [map_start, map_end) in top-down.
Can you please put a bit more effort into the function description?
Other than that,
Acked-by: Tejun Heo <tj@kernel.org>
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 3/6] x86/mm: Factor out of top-down direct mapping setup
2013-09-26 14:46 ` Tejun Heo
@ 2013-09-26 15:39 ` Zhang Yanfei
2013-09-26 16:56 ` Zhang Yanfei
0 siblings, 1 reply; 34+ messages in thread
From: Zhang Yanfei @ 2013-09-26 15:39 UTC (permalink / raw)
To: Tejun Heo
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On 09/26/2013 10:46 PM, Tejun Heo wrote:
> On Wed, Sep 25, 2013 at 02:29:06AM +0800, Zhang Yanfei wrote:
>> +/**
>> + * memory_map_top_down - Map [map_start, map_end) top down
>> + * @map_start: start address of the target memory range
>> + * @map_end: end address of the target memory range
>> + *
>> + * This function will setup direct mapping for memory range
>> + * [map_start, map_end) in top-down.
>
> Can you please put a bit more effort into the function description?
Sorry.... I will try to make a more detailed description.
>
> Other than that,
>
> Acked-by: Tejun Heo <tj@kernel.org>
Thanks.
--
Thanks.
Zhang Yanfei
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 3/6] x86/mm: Factor out of top-down direct mapping setup
2013-09-26 15:39 ` Zhang Yanfei
@ 2013-09-26 16:56 ` Zhang Yanfei
0 siblings, 0 replies; 34+ messages in thread
From: Zhang Yanfei @ 2013-09-26 16:56 UTC (permalink / raw)
To: Tejun Heo
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
Hello tejun,
On 09/26/2013 11:39 PM, Zhang Yanfei wrote:
> On 09/26/2013 10:46 PM, Tejun Heo wrote:
>> On Wed, Sep 25, 2013 at 02:29:06AM +0800, Zhang Yanfei wrote:
>>> +/**
>>> + * memory_map_top_down - Map [map_start, map_end) top down
>>> + * @map_start: start address of the target memory range
>>> + * @map_end: end address of the target memory range
>>> + *
>>> + * This function will setup direct mapping for memory range
>>> + * [map_start, map_end) in top-down.
>>
>> Can you please put a bit more effort into the function description?
>
> Sorry.... I will try to make a more detailed description.
Trying below:
/**
* memory_map_top_down - Map [map_start, map_end) top down
* @map_start: start address of the target memory range
* @map_end: end address of the target memory range
*
* This function will setup direct mapping for memory range
* [map_start, map_end) in top-down. That said, the page tables
* will be allocated at the end of the memory, and we map the
* memory top-down.
*/
static void __init memory_map_top_down(unsigned long map_start,
unsigned long map_end)
{
Thanks.
>
>>
>> Other than that,
>>
>> Acked-by: Tejun Heo <tj@kernel.org>
>
> Thanks.
>
--
Thanks.
Zhang Yanfei
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 3/6] x86/mm: Factor out of top-down direct mapping setup
2013-09-24 18:29 ` [PATCH v5 3/6] x86/mm: Factor out of top-down direct mapping setup Zhang Yanfei
2013-09-26 14:46 ` Tejun Heo
@ 2013-09-27 22:43 ` Toshi Kani
1 sibling, 0 replies; 34+ messages in thread
From: Toshi Kani @ 2013-09-27 22:43 UTC (permalink / raw)
To: Zhang Yanfei
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Tejun Heo, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On Wed, 2013-09-25 at 02:29 +0800, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
>
> This patch creates a new function memory_map_top_down to
> factor out of the top-down direct memory mapping pagetable
> setup. This is also a preparation for the following patch,
> which will introduce the bottom-up memory mapping. That said,
> we will put the two ways of pagetable setup into separate
> functions, and choose to use which way in init_mem_mapping,
> which makes the code more clear.
>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Thanks,
-Toshi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v5 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
2013-09-24 18:23 [PATCH v5 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed Zhang Yanfei
` (2 preceding siblings ...)
2013-09-24 18:29 ` [PATCH v5 3/6] x86/mm: Factor out of top-down direct mapping setup Zhang Yanfei
@ 2013-09-24 18:30 ` Zhang Yanfei
2013-09-26 14:48 ` Tejun Heo
2013-09-26 22:52 ` Andrew Morton
2013-09-24 18:34 ` [PATCH v5 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed Zhang Yanfei
2013-09-24 18:35 ` [PATCH v5 6/6] mem-hotplug: Introduce movablenode boot option Zhang Yanfei
5 siblings, 2 replies; 34+ messages in thread
From: Zhang Yanfei @ 2013-09-24 18:30 UTC (permalink / raw)
To: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Tejun Heo, Toshi Kani, Wanpeng Li,
Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
Rik van Riel, jweiner, prarit
Cc: x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org, Linux MM,
linux-acpi, imtangchen, Zhang Yanfei
From: Tang Chen <tangchen@cn.fujitsu.com>
The Linux kernel cannot migrate pages used by the kernel. As a
result, kernel pages cannot be hot-removed. So we cannot allocate
hotpluggable memory for the kernel.
In a memory hotplug system, any numa node the kernel resides in
should be unhotpluggable. And for a modern server, each node could
have at least 16GB memory. So memory around the kernel image is
highly likely unhotpluggable.
ACPI SRAT (System Resource Affinity Table) contains the memory
hotplug info. But before SRAT is parsed, memblock has already
started to allocate memory for the kernel. So we need to prevent
memblock from doing this.
So direct memory mapping page tables setup is the case. init_mem_mapping()
is called before SRAT is parsed. To prevent page tables being allocated
within hotpluggable memory, we will use bottom-up direction to allocate
page tables from the end of kernel image to the higher memory.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
arch/x86/mm/init.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 62 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index dbe57e5..d35363e 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -456,6 +456,48 @@ static void __init memory_map_top_down(unsigned long map_start,
init_range_memory_mapping(real_end, map_end);
}
+/**
+ * memory_map_bottom_up - Map [map_start, map_end) bottom up
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range
+ * [map_start, map_end) in bottom-up.
+ */
+static void __init memory_map_bottom_up(unsigned long map_start,
+ unsigned long map_end)
+{
+ unsigned long next, new_mapped_ram_size, start;
+ unsigned long mapped_ram_size = 0;
+ /* step_size need to be small so pgt_buf from BRK could cover it */
+ unsigned long step_size = PMD_SIZE;
+
+ start = map_start;
+ min_pfn_mapped = start >> PAGE_SHIFT;
+
+ /*
+ * We start from the bottom (@map_start) and go to the top (@map_end).
+ * The memblock_find_in_range() gets us a block of RAM from the
+ * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
+ * for page table.
+ */
+ while (start < map_end) {
+ if (map_end - start > step_size) {
+ next = round_up(start + 1, step_size);
+ if (next > map_end)
+ next = map_end;
+ } else
+ next = map_end;
+
+ new_mapped_ram_size = init_range_memory_mapping(start, next);
+ start = next;
+
+ if (new_mapped_ram_size > mapped_ram_size)
+ step_size <<= STEP_SIZE_SHIFT;
+ mapped_ram_size += new_mapped_ram_size;
+ }
+}
+
void __init init_mem_mapping(void)
{
unsigned long end;
@@ -471,8 +513,26 @@ void __init init_mem_mapping(void)
/* the ISA range is always mapped regardless of memory holes */
init_memory_mapping(0, ISA_END_ADDRESS);
- /* setup direct mapping for range [ISA_END_ADDRESS, end) in top-down*/
- memory_map_top_down(ISA_END_ADDRESS, end);
+ /*
+ * If the allocation is in bottom-up direction, we setup direct mapping
+ * in bottom-up, otherwise we setup direct mapping in top-down.
+ */
+ if (memblock_bottom_up()) {
+ unsigned long kernel_end;
+
+ kernel_end = __pa_symbol(_end);
+ /*
+ * we need two separate calls here. This is because we want to
+ * allocate page tables above the kernel. So we first map
+ * [kernel_end, end) to make memory above the kernel be mapped
+ * as soon as possible. And then use page tables allocated above
+ * the kernel to map [ISA_END_ADDRESS, kernel_end).
+ */
+ memory_map_bottom_up(kernel_end, end);
+ memory_map_bottom_up(ISA_END_ADDRESS, kernel_end);
+ } else {
+ memory_map_top_down(ISA_END_ADDRESS, end);
+ }
#ifdef CONFIG_X86_64
if (max_pfn > max_low_pfn) {
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v5 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
2013-09-24 18:30 ` [PATCH v5 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up Zhang Yanfei
@ 2013-09-26 14:48 ` Tejun Heo
2013-09-26 15:43 ` Zhang Yanfei
2013-09-26 17:00 ` Zhang Yanfei
2013-09-26 22:52 ` Andrew Morton
1 sibling, 2 replies; 34+ messages in thread
From: Tejun Heo @ 2013-09-26 14:48 UTC (permalink / raw)
To: Zhang Yanfei
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
Hello,
On Wed, Sep 25, 2013 at 02:30:51AM +0800, Zhang Yanfei wrote:
> +/**
> + * memory_map_bottom_up - Map [map_start, map_end) bottom up
> + * @map_start: start address of the target memory range
> + * @map_end: end address of the target memory range
> + *
> + * This function will setup direct mapping for memory range
> + * [map_start, map_end) in bottom-up.
Ditto about the comment.
> + */
> +static void __init memory_map_bottom_up(unsigned long map_start,
> + unsigned long map_end)
> +{
> + unsigned long next, new_mapped_ram_size, start;
> + unsigned long mapped_ram_size = 0;
> + /* step_size need to be small so pgt_buf from BRK could cover it */
> + unsigned long step_size = PMD_SIZE;
> +
> + start = map_start;
> + min_pfn_mapped = start >> PAGE_SHIFT;
> +
> + /*
> + * We start from the bottom (@map_start) and go to the top (@map_end).
> + * The memblock_find_in_range() gets us a block of RAM from the
> + * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
> + * for page table.
> + */
> + while (start < map_end) {
> + if (map_end - start > step_size) {
> + next = round_up(start + 1, step_size);
> + if (next > map_end)
> + next = map_end;
> + } else
> + next = map_end;
> +
> + new_mapped_ram_size = init_range_memory_mapping(start, next);
> + start = next;
> +
> + if (new_mapped_ram_size > mapped_ram_size)
> + step_size <<= STEP_SIZE_SHIFT;
> + mapped_ram_size += new_mapped_ram_size;
> + }
> +}
As Yinghai pointed out in another thread, do we need to worry about
falling back to top-down?
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
2013-09-26 14:48 ` Tejun Heo
@ 2013-09-26 15:43 ` Zhang Yanfei
2013-09-26 15:48 ` Tejun Heo
2013-09-26 17:00 ` Zhang Yanfei
1 sibling, 1 reply; 34+ messages in thread
From: Zhang Yanfei @ 2013-09-26 15:43 UTC (permalink / raw)
To: Tejun Heo
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
Hello tejun,
On 09/26/2013 10:48 PM, Tejun Heo wrote:
> Hello,
>
> On Wed, Sep 25, 2013 at 02:30:51AM +0800, Zhang Yanfei wrote:
>> +/**
>> + * memory_map_bottom_up - Map [map_start, map_end) bottom up
>> + * @map_start: start address of the target memory range
>> + * @map_end: end address of the target memory range
>> + *
>> + * This function will setup direct mapping for memory range
>> + * [map_start, map_end) in bottom-up.
>
> Ditto about the comment.
OK, will do.
>
>> + */
>> +static void __init memory_map_bottom_up(unsigned long map_start,
>> + unsigned long map_end)
>> +{
>> + unsigned long next, new_mapped_ram_size, start;
>> + unsigned long mapped_ram_size = 0;
>> + /* step_size need to be small so pgt_buf from BRK could cover it */
>> + unsigned long step_size = PMD_SIZE;
>> +
>> + start = map_start;
>> + min_pfn_mapped = start >> PAGE_SHIFT;
>> +
>> + /*
>> + * We start from the bottom (@map_start) and go to the top (@map_end).
>> + * The memblock_find_in_range() gets us a block of RAM from the
>> + * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
>> + * for page table.
>> + */
>> + while (start < map_end) {
>> + if (map_end - start > step_size) {
>> + next = round_up(start + 1, step_size);
>> + if (next > map_end)
>> + next = map_end;
>> + } else
>> + next = map_end;
>> +
>> + new_mapped_ram_size = init_range_memory_mapping(start, next);
>> + start = next;
>> +
>> + if (new_mapped_ram_size > mapped_ram_size)
>> + step_size <<= STEP_SIZE_SHIFT;
>> + mapped_ram_size += new_mapped_ram_size;
>> + }
>> +}
>
> As Yinghai pointed out in another thread, do we need to worry about
> falling back to top-down?
I've explained to him. Nop, we don't need to worry about that. Because even
the min_pfn_mapped becomes ISA_END_ADDRESS in the second call below, we won't
allocate memory below the kernel because we have limited the allocation above
the kernel.
Thanks.
--
Thanks.
Zhang Yanfei
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
2013-09-26 15:43 ` Zhang Yanfei
@ 2013-09-26 15:48 ` Tejun Heo
2013-09-26 16:03 ` Zhang Yanfei
0 siblings, 1 reply; 34+ messages in thread
From: Tejun Heo @ 2013-09-26 15:48 UTC (permalink / raw)
To: Zhang Yanfei
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On Thu, Sep 26, 2013 at 11:43:02PM +0800, Zhang Yanfei wrote:
> > As Yinghai pointed out in another thread, do we need to worry about
> > falling back to top-down?
>
> I've explained to him. Nop, we don't need to worry about that. Because even
> the min_pfn_mapped becomes ISA_END_ADDRESS in the second call below, we won't
> allocate memory below the kernel because we have limited the allocation above
> the kernel.
Maybe I misunderstood but wasn't he worrying about there not being
enough space above kernel? In that case, it'd automatically fall back
to top-down allocation anyway, right?
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
2013-09-26 15:48 ` Tejun Heo
@ 2013-09-26 16:03 ` Zhang Yanfei
2013-09-26 16:08 ` Tejun Heo
0 siblings, 1 reply; 34+ messages in thread
From: Zhang Yanfei @ 2013-09-26 16:03 UTC (permalink / raw)
To: Tejun Heo
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On 09/26/2013 11:48 PM, Tejun Heo wrote:
> On Thu, Sep 26, 2013 at 11:43:02PM +0800, Zhang Yanfei wrote:
>>> As Yinghai pointed out in another thread, do we need to worry about
>>> falling back to top-down?
>>
>> I've explained to him. Nop, we don't need to worry about that. Because even
>> the min_pfn_mapped becomes ISA_END_ADDRESS in the second call below, we won't
>> allocate memory below the kernel because we have limited the allocation above
>> the kernel.
>
> Maybe I misunderstood but wasn't he worrying about there not being
> enough space above kernel? In that case, it'd automatically fall back
> to top-down allocation anyway, right?
Ah, I see. You are saying another issue. He is worrying that if we use
kexec to load the kernel high, say we have 16GB, we put the kernel in
15.99GB (just an example), so we only have less than 100MB above the kernel.
But as I've explained to him, in almost all the cases, if we want our
memory hotplug work, we don't do that. And yeah, assume we have this
problem, it'd fall back to top down and that return backs to patch 2,
we will trigger the WARN_ONCE, and the admin will know what has happened.
Thanks.
--
Thanks.
Zhang Yanfei
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
2013-09-26 16:03 ` Zhang Yanfei
@ 2013-09-26 16:08 ` Tejun Heo
0 siblings, 0 replies; 34+ messages in thread
From: Tejun Heo @ 2013-09-26 16:08 UTC (permalink / raw)
To: Zhang Yanfei
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On Fri, Sep 27, 2013 at 12:03:01AM +0800, Zhang Yanfei wrote:
> Ah, I see. You are saying another issue. He is worrying that if we use
> kexec to load the kernel high, say we have 16GB, we put the kernel in
> 15.99GB (just an example), so we only have less than 100MB above the kernel.
>
> But as I've explained to him, in almost all the cases, if we want our
> memory hotplug work, we don't do that. And yeah, assume we have this
> problem, it'd fall back to top down and that return backs to patch 2,
> we will trigger the WARN_ONCE, and the admin will know what has happened.
Alright,
Acked-by: Tejun Heo <tj@kernel.org>
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
2013-09-26 14:48 ` Tejun Heo
2013-09-26 15:43 ` Zhang Yanfei
@ 2013-09-26 17:00 ` Zhang Yanfei
1 sibling, 0 replies; 34+ messages in thread
From: Zhang Yanfei @ 2013-09-26 17:00 UTC (permalink / raw)
To: Tejun Heo
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On 09/26/2013 10:48 PM, Tejun Heo wrote:
> Hello,
>
> On Wed, Sep 25, 2013 at 02:30:51AM +0800, Zhang Yanfei wrote:
>> +/**
>> + * memory_map_bottom_up - Map [map_start, map_end) bottom up
>> + * @map_start: start address of the target memory range
>> + * @map_end: end address of the target memory range
>> + *
>> + * This function will setup direct mapping for memory range
>> + * [map_start, map_end) in bottom-up.
>
> Ditto about the comment.
Trying below:
/**
* memory_map_bottom_up - Map [map_start, map_end) bottom up
* @map_start: start address of the target memory range
* @map_end: end address of the target memory range
*
* This function will setup direct mapping for memory range
* [map_start, map_end) in bottom-up. Since we have limited the
* bottom-up allocation above the kernel, the page tables will
* be allocated just above the kernel and we map the memory
* in [map_start, map_end) in bottom-up.
*/
static void __init memory_map_bottom_up(unsigned long map_start,
unsigned long map_end)
{
Thanks.
--
Thanks.
Zhang Yanfei
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
2013-09-24 18:30 ` [PATCH v5 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up Zhang Yanfei
2013-09-26 14:48 ` Tejun Heo
@ 2013-09-26 22:52 ` Andrew Morton
2013-09-26 22:57 ` H. Peter Anvin
1 sibling, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2013-09-26 22:52 UTC (permalink / raw)
To: Zhang Yanfei
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Tejun Heo, Toshi Kani, Wanpeng Li, Thomas Renninger, Yinghai Lu,
Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner, prarit,
x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org, Linux MM,
linux-acpi, imtangchen, Zhang Yanfei
On Wed, 25 Sep 2013 02:30:51 +0800 Zhang Yanfei <zhangyanfei.yes@gmail.com> wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
>
> The Linux kernel cannot migrate pages used by the kernel. As a
> result, kernel pages cannot be hot-removed. So we cannot allocate
> hotpluggable memory for the kernel.
>
> In a memory hotplug system, any numa node the kernel resides in
> should be unhotpluggable. And for a modern server, each node could
> have at least 16GB memory. So memory around the kernel image is
> highly likely unhotpluggable.
>
> ACPI SRAT (System Resource Affinity Table) contains the memory
> hotplug info. But before SRAT is parsed, memblock has already
> started to allocate memory for the kernel. So we need to prevent
> memblock from doing this.
>
> So direct memory mapping page tables setup is the case. init_mem_mapping()
> is called before SRAT is parsed. To prevent page tables being allocated
> within hotpluggable memory, we will use bottom-up direction to allocate
> page tables from the end of kernel image to the higher memory.
>
> ...
>
> + kernel_end = __pa_symbol(_end);
__pa_symbol() is implemented only on mips and x86.
I stole the mips implementation like this:
--- a/mm/memblock.c~a
+++ a/mm/memblock.c
@@ -187,8 +187,11 @@ phys_addr_t __init_memblock memblock_fin
/* avoid allocating the first page */
start = max_t(phys_addr_t, start, PAGE_SIZE);
end = max(start, end);
+#ifdef CONFIG_X86
kernel_end = __pa_symbol(_end);
-
+#else
+ kernel_end = __pa(RELOC_HIDE((unsigned long)(_end), 0));
+#endif
/*
* try bottom-up allocation only when bottom-up mode
* is set and @end is above the kernel image.
just so I can get a -mm release out the door.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
2013-09-26 22:52 ` Andrew Morton
@ 2013-09-26 22:57 ` H. Peter Anvin
0 siblings, 0 replies; 34+ messages in thread
From: H. Peter Anvin @ 2013-09-26 22:57 UTC (permalink / raw)
To: Andrew Morton, Zhang Yanfei
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, Tejun Heo,
Toshi Kani, Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu,
Wen Congyang, Lai Jiangshan, isimatu.yasuaki, izumi.taku,
Mel Gorman, Minchan Kim, mina86, gong.chen, vasilis.liaskovitis,
lwoodman, Rik van Riel, jweiner, prarit, x86@kernel.org,
linux-doc, linux-kernel@vger.kernel.org, Linux MM, linux-acpi,
imtangchen, Zhang Yanfei
Can we put this in a common header somewhere?
Andrew Morton <akpm@linux-foundation.org> wrote:
>On Wed, 25 Sep 2013 02:30:51 +0800 Zhang Yanfei
><zhangyanfei.yes@gmail.com> wrote:
>
>> From: Tang Chen <tangchen@cn.fujitsu.com>
>>
>> The Linux kernel cannot migrate pages used by the kernel. As a
>> result, kernel pages cannot be hot-removed. So we cannot allocate
>> hotpluggable memory for the kernel.
>>
>> In a memory hotplug system, any numa node the kernel resides in
>> should be unhotpluggable. And for a modern server, each node could
>> have at least 16GB memory. So memory around the kernel image is
>> highly likely unhotpluggable.
>>
>> ACPI SRAT (System Resource Affinity Table) contains the memory
>> hotplug info. But before SRAT is parsed, memblock has already
>> started to allocate memory for the kernel. So we need to prevent
>> memblock from doing this.
>>
>> So direct memory mapping page tables setup is the case.
>init_mem_mapping()
>> is called before SRAT is parsed. To prevent page tables being
>allocated
>> within hotpluggable memory, we will use bottom-up direction to
>allocate
>> page tables from the end of kernel image to the higher memory.
>>
>> ...
>>
>> + kernel_end = __pa_symbol(_end);
>
>__pa_symbol() is implemented only on mips and x86.
>
>I stole the mips implementation like this:
>
>--- a/mm/memblock.c~a
>+++ a/mm/memblock.c
>@@ -187,8 +187,11 @@ phys_addr_t __init_memblock memblock_fin
> /* avoid allocating the first page */
> start = max_t(phys_addr_t, start, PAGE_SIZE);
> end = max(start, end);
>+#ifdef CONFIG_X86
> kernel_end = __pa_symbol(_end);
>-
>+#else
>+ kernel_end = __pa(RELOC_HIDE((unsigned long)(_end), 0));
>+#endif
> /*
> * try bottom-up allocation only when bottom-up mode
> * is set and @end is above the kernel image.
>
>just so I can get a -mm release out the door.
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v5 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed
2013-09-24 18:23 [PATCH v5 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed Zhang Yanfei
` (3 preceding siblings ...)
2013-09-24 18:30 ` [PATCH v5 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up Zhang Yanfei
@ 2013-09-24 18:34 ` Zhang Yanfei
2013-09-26 14:49 ` Tejun Heo
2013-09-27 23:14 ` Toshi Kani
2013-09-24 18:35 ` [PATCH v5 6/6] mem-hotplug: Introduce movablenode boot option Zhang Yanfei
5 siblings, 2 replies; 34+ messages in thread
From: Zhang Yanfei @ 2013-09-24 18:34 UTC (permalink / raw)
To: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Tejun Heo, Toshi Kani, Wanpeng Li,
Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
Rik van Riel, jweiner, prarit
Cc: x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org, Linux MM,
linux-acpi, imtangchen, Zhang Yanfei
From: Tang Chen <tangchen@cn.fujitsu.com>
Memory reserved for crashkernel could be large. So we should not allocate
this memory bottom up from the end of kernel image.
When SRAT is parsed, we will be able to know whihc memory is hotpluggable,
and we can avoid allocating this memory for the kernel. So reorder
reserve_crashkernel() after SRAT is parsed.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
arch/x86/kernel/setup.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f0de629..36cfce3 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1120,8 +1120,6 @@ void __init setup_arch(char **cmdline_p)
acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
#endif
- reserve_crashkernel();
-
vsmp_init();
io_delay_init();
@@ -1136,6 +1134,12 @@ void __init setup_arch(char **cmdline_p)
initmem_init();
memblock_find_dma_reserve();
+ /*
+ * Reserve memory for crash kernel after SRAT is parsed so that it
+ * won't consume hotpluggable memory.
+ */
+ reserve_crashkernel();
+
#ifdef CONFIG_KVM_GUEST
kvmclock_init();
#endif
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v5 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed
2013-09-24 18:34 ` [PATCH v5 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed Zhang Yanfei
@ 2013-09-26 14:49 ` Tejun Heo
2013-09-26 15:46 ` Zhang Yanfei
2013-09-27 23:14 ` Toshi Kani
1 sibling, 1 reply; 34+ messages in thread
From: Tejun Heo @ 2013-09-26 14:49 UTC (permalink / raw)
To: Zhang Yanfei
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On Wed, Sep 25, 2013 at 02:34:34AM +0800, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
>
> Memory reserved for crashkernel could be large. So we should not allocate
> this memory bottom up from the end of kernel image.
>
> When SRAT is parsed, we will be able to know whihc memory is hotpluggable,
> and we can avoid allocating this memory for the kernel. So reorder
> reserve_crashkernel() after SRAT is parsed.
>
> Acked-by: Tejun Heo <tj@kernel.org>
So, I was hoping to hear from you on how you tested it when I wrote
the previous comment - the "provided..." part.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed
2013-09-26 14:49 ` Tejun Heo
@ 2013-09-26 15:46 ` Zhang Yanfei
0 siblings, 0 replies; 34+ messages in thread
From: Zhang Yanfei @ 2013-09-26 15:46 UTC (permalink / raw)
To: Tejun Heo
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On 09/26/2013 10:49 PM, Tejun Heo wrote:
> On Wed, Sep 25, 2013 at 02:34:34AM +0800, Zhang Yanfei wrote:
>> From: Tang Chen <tangchen@cn.fujitsu.com>
>>
>> Memory reserved for crashkernel could be large. So we should not allocate
>> this memory bottom up from the end of kernel image.
>>
>> When SRAT is parsed, we will be able to know whihc memory is hotpluggable,
>> and we can avoid allocating this memory for the kernel. So reorder
>> reserve_crashkernel() after SRAT is parsed.
>>
>> Acked-by: Tejun Heo <tj@kernel.org>
>
> So, I was hoping to hear from you on how you tested it when I wrote
> the previous comment - the "provided..." part.
>
This function is actually used for kexec/kdump. So After applying
this patch, booting the kernel, this reservation is successful and
the kdump service starts successfully.
Thanks.
--
Thanks.
Zhang Yanfei
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed
2013-09-24 18:34 ` [PATCH v5 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed Zhang Yanfei
2013-09-26 14:49 ` Tejun Heo
@ 2013-09-27 23:14 ` Toshi Kani
1 sibling, 0 replies; 34+ messages in thread
From: Toshi Kani @ 2013-09-27 23:14 UTC (permalink / raw)
To: Zhang Yanfei
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Tejun Heo, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On Wed, 2013-09-25 at 02:34 +0800, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
>
> Memory reserved for crashkernel could be large. So we should not allocate
> this memory bottom up from the end of kernel image.
>
> When SRAT is parsed, we will be able to know whihc memory is hotpluggable,
> and we can avoid allocating this memory for the kernel. So reorder
> reserve_crashkernel() after SRAT is parsed.
>
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> ---
> arch/x86/kernel/setup.c | 8 ++++++--
> 1 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index f0de629..36cfce3 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1120,8 +1120,6 @@ void __init setup_arch(char **cmdline_p)
> acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
> #endif
>
> - reserve_crashkernel();
> -
> vsmp_init();
>
> io_delay_init();
> @@ -1136,6 +1134,12 @@ void __init setup_arch(char **cmdline_p)
> initmem_init();
> memblock_find_dma_reserve();
>
> + /*
> + * Reserve memory for crash kernel after SRAT is parsed so that it
> + * won't consume hotpluggable memory.
> + */
> + reserve_crashkernel();
Out of curiosity, is there any particular reason why it is moved after
memblock_find_dma_reserve(), not initmem_init()?
Thanks,
-Toshi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v5 6/6] mem-hotplug: Introduce movablenode boot option
2013-09-24 18:23 [PATCH v5 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed Zhang Yanfei
` (4 preceding siblings ...)
2013-09-24 18:34 ` [PATCH v5 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed Zhang Yanfei
@ 2013-09-24 18:35 ` Zhang Yanfei
2013-09-26 14:53 ` Tejun Heo
5 siblings, 1 reply; 34+ messages in thread
From: Zhang Yanfei @ 2013-09-24 18:35 UTC (permalink / raw)
To: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Tejun Heo, Toshi Kani, Wanpeng Li,
Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
Rik van Riel, jweiner, prarit
Cc: x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org, Linux MM,
linux-acpi, imtangchen, Zhang Yanfei
From: Tang Chen <tangchen@cn.fujitsu.com>
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.
But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.
So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movablenode boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and
later we can set it as ZONE_MOVABLE.
To achieve this, the movablenode boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained
in the previous patches. So if movablenode boot option is set, the kernel
does the following:
1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.
Users can specify "movablenode" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.
Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
Documentation/kernel-parameters.txt | 15 +++++++++++++++
arch/x86/kernel/setup.c | 7 +++++++
mm/memory_hotplug.c | 31 +++++++++++++++++++++++++++++++
3 files changed, 53 insertions(+), 0 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 1a036cd..8c056c4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1769,6 +1769,21 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
that the amount of memory usable for all allocations
is not too small.
+ movablenode [KNL,X86] This parameter enables/disables the
+ kernel to arrange hotpluggable memory ranges recorded
+ in ACPI SRAT(System Resource Affinity Table) as
+ ZONE_MOVABLE. And these memory can be hot-removed when
+ the system is up.
+ By specifying this option, all the hotpluggable memory
+ will be in ZONE_MOVABLE, which the kernel cannot use.
+ This will cause NUMA performance down. For users who
+ care about NUMA performance, just don't use it.
+ If all the memory ranges in the system are hotpluggable,
+ then the ones used by the kernel at early time, such as
+ kernel code and data segments, initrd file and so on,
+ won't be set as ZONE_MOVABLE, and won't be hotpluggable.
+ Otherwise the kernel won't have enough memory to boot.
+
MTD_Partition= [MTD]
Format: <name>,<region-number>,<size>,<offset>
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 36cfce3..b8fefb7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1132,6 +1132,13 @@ void __init setup_arch(char **cmdline_p)
early_acpi_boot_init();
initmem_init();
+
+ /*
+ * When ACPI SRAT is parsed, which is done in initmem_init(),
+ * set memblock back to the top-down direction.
+ */
+ memblock_set_bottom_up(false);
+
memblock_find_dma_reserve();
/*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ed85fe3..dcd819a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -31,6 +31,7 @@
#include <linux/firmware-map.h>
#include <linux/stop_machine.h>
#include <linux/hugetlb.h>
+#include <linux/memblock.h>
#include <asm/tlbflush.h>
@@ -1412,6 +1413,36 @@ static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
}
#endif /* CONFIG_MOVABLE_NODE */
+static int __init cmdline_parse_movablenode(char *p)
+{
+#ifdef CONFIG_MOVABLE_NODE
+ /*
+ * Memory used by the kernel cannot be hot-removed because Linux
+ * cannot migrate the kernel pages. When memory hotplug is
+ * enabled, we should prevent memblock from allocating memory
+ * for the kernel.
+ *
+ * ACPI SRAT records all hotpluggable memory ranges. But before
+ * SRAT is parsed, we don't know about it.
+ *
+ * The kernel image is loaded into memory at very early time. We
+ * cannot prevent this anyway. So on NUMA system, we set any
+ * node the kernel resides in as un-hotpluggable.
+ *
+ * Since on modern servers, one node could have double-digit
+ * gigabytes memory, we can assume the memory around the kernel
+ * image is also un-hotpluggable. So before SRAT is parsed, just
+ * allocate memory near the kernel image to try the best to keep
+ * the kernel away from hotpluggable memory.
+ */
+ memblock_set_bottom_up(true);
+#else
+ pr_warn("movablenode option not supported");
+#endif
+ return 0;
+}
+early_param("movablenode", cmdline_parse_movablenode);
+
/* check which state of node_states will be changed when offline memory */
static void node_states_check_changes_offline(unsigned long nr_pages,
struct zone *zone, struct memory_notify *arg)
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v5 6/6] mem-hotplug: Introduce movablenode boot option
2013-09-24 18:35 ` [PATCH v5 6/6] mem-hotplug: Introduce movablenode boot option Zhang Yanfei
@ 2013-09-26 14:53 ` Tejun Heo
2013-09-26 16:42 ` Zhang Yanfei
0 siblings, 1 reply; 34+ messages in thread
From: Tejun Heo @ 2013-09-26 14:53 UTC (permalink / raw)
To: Zhang Yanfei
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On Wed, Sep 25, 2013 at 02:35:14AM +0800, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
>
> The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
> As we mentioned before, if hotpluggable memory is used by the kernel,
> it cannot be hot-removed. So memory hotplug users may want to set all
> hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
>
> Memory hotplug users may also set a node as movable node, which has
> ZONE_MOVABLE only, so that the whole node can be hot-removed.
>
> But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
> kernel cannot use memory in movable nodes. This will cause NUMA
> performance down. And other users may be unhappy.
>
> So we need a way to allow users to enable and disable this functionality.
> In this patch, we introduce movablenode boot option to allow users to
> choose to not to consume hotpluggable memory at early boot time and
> later we can set it as ZONE_MOVABLE.
>
> To achieve this, the movablenode boot option will control the memblock
> allocation direction. That said, after memblock is ready, before SRAT is
> parsed, we should allocate memory near the kernel image as we explained
> in the previous patches. So if movablenode boot option is set, the kernel
> does the following:
>
> 1. After memblock is ready, make memblock allocate memory bottom up.
> 2. After SRAT is parsed, make memblock behave as default, allocate memory
> top down.
>
> Users can specify "movablenode" in kernel commandline to enable this
> functionality. For those who don't use memory hotplug or who don't want
> to lose their NUMA performance, just don't specify anything. The kernel
> will work as before.
>
> Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
I hope the param description and comment were better. Not necessarily
longer, but clearer, so it'd be great if you can polish them a bit
more. Other than that,
Acked-by: Tejun Heo <tj@kernel.org>
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 6/6] mem-hotplug: Introduce movablenode boot option
2013-09-26 14:53 ` Tejun Heo
@ 2013-09-26 16:42 ` Zhang Yanfei
2013-09-27 6:26 ` Ingo Molnar
0 siblings, 1 reply; 34+ messages in thread
From: Zhang Yanfei @ 2013-09-26 16:42 UTC (permalink / raw)
To: Tejun Heo
Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
Andrew Morton, Toshi Kani, Wanpeng Li, Thomas Renninger,
Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
prarit, x86@kernel.org, linux-doc, linux-kernel@vger.kernel.org,
Linux MM, linux-acpi, imtangchen, Zhang Yanfei
On 09/26/2013 10:53 PM, Tejun Heo wrote:
> On Wed, Sep 25, 2013 at 02:35:14AM +0800, Zhang Yanfei wrote:
>> From: Tang Chen <tangchen@cn.fujitsu.com>
>>
>> The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
>> As we mentioned before, if hotpluggable memory is used by the kernel,
>> it cannot be hot-removed. So memory hotplug users may want to set all
>> hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
>>
>> Memory hotplug users may also set a node as movable node, which has
>> ZONE_MOVABLE only, so that the whole node can be hot-removed.
>>
>> But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
>> kernel cannot use memory in movable nodes. This will cause NUMA
>> performance down. And other users may be unhappy.
>>
>> So we need a way to allow users to enable and disable this functionality.
>> In this patch, we introduce movablenode boot option to allow users to
>> choose to not to consume hotpluggable memory at early boot time and
>> later we can set it as ZONE_MOVABLE.
>>
>> To achieve this, the movablenode boot option will control the memblock
>> allocation direction. That said, after memblock is ready, before SRAT is
>> parsed, we should allocate memory near the kernel image as we explained
>> in the previous patches. So if movablenode boot option is set, the kernel
>> does the following:
>>
>> 1. After memblock is ready, make memblock allocate memory bottom up.
>> 2. After SRAT is parsed, make memblock behave as default, allocate memory
>> top down.
>>
>> Users can specify "movablenode" in kernel commandline to enable this
>> functionality. For those who don't use memory hotplug or who don't want
>> to lose their NUMA performance, just don't specify anything. The kernel
>> will work as before.
>>
>> Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
>> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
>
> I hope the param description and comment were better. Not necessarily
> longer, but clearer, so it'd be great if you can polish them a bit
OK. Trying below:
movablenode [KNL,X86] This option enables the kernel to arrange
hotpluggable memory into ZONE_MOVABLE zone. If memory
in a node is all hotpluggable, the option may make
the whole node has only one ZONE_MOVABLE zone, so that
the whole node can be hot-removed after system is up.
Note that this option may cause NUMA performance down.
As for the comment in cmdline_parse_movablenode():
/*
* ACPI SRAT records all hotpluggable memory ranges. But before
* SRAT is parsed, we don't know about it. So by specifying this
* option, we will use the bottom-up mode to try allocating memory
* near the kernel image before SRAT is parsed.
*
* Bottom-up mode prevents memblock allocating hotpluggable memory
* for the kernel so that the kernel will arrange hotpluggable
* memory into ZONE_MOVABLE zone when possible.
*/
Thanks.
> more. Other than that,
>
> Acked-by: Tejun Heo <tj@kernel.org>
>
> Thanks.
>
--
Thanks.
Zhang Yanfei
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 6/6] mem-hotplug: Introduce movablenode boot option
2013-09-26 16:42 ` Zhang Yanfei
@ 2013-09-27 6:26 ` Ingo Molnar
2013-09-27 8:08 ` Yanfei Zhang
0 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2013-09-27 6:26 UTC (permalink / raw)
To: Zhang Yanfei
Cc: Tejun Heo, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
H. Peter Anvin, Andrew Morton, Toshi Kani, Wanpeng Li,
Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
Rik van Riel, jweiner, prarit, x86@kernel.org, linux-doc,
linux-kernel@vger.kernel.org, Linux MM, linux-acpi, imtangchen,
Zhang Yanfei
* Zhang Yanfei <zhangyanfei.yes@gmail.com> wrote:
> OK. Trying below:
>
> movablenode [KNL,X86] This option enables the kernel to arrange
> hotpluggable memory into ZONE_MOVABLE zone. If memory
> in a node is all hotpluggable, the option may make
> the whole node has only one ZONE_MOVABLE zone, so that
> the whole node can be hot-removed after system is up.
> Note that this option may cause NUMA performance down.
That paragraph doesn't really parse in several places ...
Also, more importantly, please explain why this needs to be a boot option.
In terms of user friendliness boot options are at the bottom of the list,
and boot options also don't really help feature tests.
Presumably the feature is safe and has no costs, and hence could be added
as a regular .config option, with a boot option only as an additional
configurability option?
Thanks,
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 6/6] mem-hotplug: Introduce movablenode boot option
2013-09-27 6:26 ` Ingo Molnar
@ 2013-09-27 8:08 ` Yanfei Zhang
2013-09-27 11:05 ` Ingo Molnar
0 siblings, 1 reply; 34+ messages in thread
From: Yanfei Zhang @ 2013-09-27 8:08 UTC (permalink / raw)
To: Ingo Molnar
Cc: Tejun Heo, Rafael J . Wysocki, lenb@kernel.org, Thomas Gleixner,
mingo@elte.hu, H. Peter Anvin, Andrew Morton, Toshi Kani,
Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
Lai Jiangshan, isimatu.yasuaki@jp.fujitsu.com,
izumi.taku@jp.fujitsu.com, Mel Gorman, Minchan Kim,
mina86@mina86.com, gong.chen@linux.intel.com,
vasilis.liaskovitis@profitbricks.com, lwoodman@redhat.com,
Rik van Riel, jweiner@redhat.com, prarit@redhat.com,
x86@kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, Linux MM,
linux-acpi@vger.kernel.org, imtangchen@gmail.com, Zhang Yanfei
[-- Attachment #1: Type: text/plain, Size: 1346 bytes --]
hello Ingo,
在 2013年9月27日星期五,Ingo Molnar 写道:
>
> * Zhang Yanfei <zhangyanfei.yes@gmail.com <javascript:;>> wrote:
>
> > OK. Trying below:
> >
> > movablenode [KNL,X86] This option enables the kernel to arrange
> > hotpluggable memory into ZONE_MOVABLE zone. If memory
> > in a node is all hotpluggable, the option may make
> > the whole node has only one ZONE_MOVABLE zone, so that
> > the whole node can be hot-removed after system is up.
> > Note that this option may cause NUMA performance down.
>
> That paragraph doesn't really parse in several places ...
Sorry…could you point out the places a bit?
>
> Also, more importantly, please explain why this needs to be a boot option.
> In terms of user friendliness boot options are at the bottom of the list,
> and boot options also don't really help feature tests.
>
> Presumably the feature is safe and has no costs, and hence could be added
> as a regular .config option, with a boot option only as an additional
> configurability option?
Yeah, the kernel already has config MOVABLE_NODE, which is the config
enabing this feature, and we introduce this boot option to expand the
configurability.
Thanks.
Zhang
>
> Thanks,
>
> Ingo
>
[-- Attachment #2: Type: text/html, Size: 2011 bytes --]
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v5 6/6] mem-hotplug: Introduce movablenode boot option
2013-09-27 8:08 ` Yanfei Zhang
@ 2013-09-27 11:05 ` Ingo Molnar
0 siblings, 0 replies; 34+ messages in thread
From: Ingo Molnar @ 2013-09-27 11:05 UTC (permalink / raw)
To: Yanfei Zhang
Cc: Tejun Heo, Rafael J . Wysocki, lenb@kernel.org, Thomas Gleixner,
mingo@elte.hu, H. Peter Anvin, Andrew Morton, Toshi Kani,
Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
Lai Jiangshan, isimatu.yasuaki@jp.fujitsu.com,
izumi.taku@jp.fujitsu.com, Mel Gorman, Minchan Kim,
mina86@mina86.com, gong.chen@linux.intel.com,
vasilis.liaskovitis@profitbricks.com, lwoodman@redhat.com,
Rik van Riel, jweiner@redhat.com, prarit@redhat.com,
x86@kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, Linux MM,
linux-acpi@vger.kernel.org, imtangchen@gmail.com, Zhang Yanfei
* Yanfei Zhang <zhangyanfei.yes@gmail.com> wrote:
> > Also, more importantly, please explain why this needs to be a boot
> > option. In terms of user friendliness boot options are at the bottom
> > of the list, and boot options also don't really help feature tests.
> >
> > Presumably the feature is safe and has no costs, and hence could be
> > added as a regular .config option, with a boot option only as an
> > additional configurability option?
>
>
> Yeah, the kernel already has config MOVABLE_NODE, which is the config
> enabing this feature, and we introduce this boot option to expand the
> configurability.
So if this is purely a boot option to disable CONFIG_MOVABLE_NODE=y then
this:
> + movablenode [KNL,X86] This parameter enables/disables the
> + kernel to arrange hotpluggable memory ranges recorded
> + in ACPI SRAT(System Resource Affinity Table) as
> + ZONE_MOVABLE. And these memory can be hot-removed when
> + the system is up.
> + By specifying this option, all the hotpluggable memory
> + will be in ZONE_MOVABLE, which the kernel cannot use.
> + This will cause NUMA performance down. For users who
> + care about NUMA performance, just don't use it.
> + If all the memory ranges in the system are hotpluggable,
> + then the ones used by the kernel at early time, such as
> + kernel code and data segments, initrd file and so on,
> + won't be set as ZONE_MOVABLE, and won't be hotpluggable.
> + Otherwise the kernel won't have enough memory to boot.
> +
> MTD_Partition= [MTD]
should be something like:
> + movablenode [KNL,X86] Boot-time switch to disable the effects of
> + CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.
And make sure that the description in mm/Kconfig is uptodate.
Having the same feature described twice in two different places will only
create documentation bitrot and confusion. Also, the boot flag is only
about disabling the feature, right?
Also, because the feature is named MOVABLE_NODE, the boot option should
match that and be called movable_node - not movablenode.
Thanks,
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread