linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 part1 0/9] Introduce movablemem_map boot option.
@ 2013-03-16 10:35 Tang Chen
  2013-03-16 10:35 ` [PATCH v1 1/9] x86: get pg_data_t's memory from other node Tang Chen
                   ` (9 more replies)
  0 siblings, 10 replies; 12+ messages in thread
From: Tang Chen @ 2013-03-16 10:35 UTC (permalink / raw)
  To: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst
  Cc: x86, linux-doc, linux-kernel, linux-mm

Hi Yinghai, all,

As Yinghai have implemented parsing numa info early more considerately,
I think we can introduce the movablemem_map boot option again.

This patch-set is based on Linux 3.9 rc-2, but need to apply Yinghai's
"x86, ACPI, numa: Parse numa info early" patch-set first.
Please refer to:
v1: https://lkml.org/lkml/2013/3/7/642
v2: https://lkml.org/lkml/2013/3/10/47


In this part1 patch-set, we reimplemented movablemem_map boot option
based on Yinghai's SRAT work. The path is like this:
1) parse SRAT, fill only existing memory into numa_meminfo, like:
   numa_cleanup_meminfo() {
 251         const u64 low = 0;
 252         const u64 high = PFN_PHYS(max_pfn);
......
 255         /* first, trim all entries */
 256         for (i = 0; i < mi->nr_blks; i++) {
 257                 struct numa_memblk *bi = &mi->blk[i];
 258 
 259                 /* make sure all blocks are inside the limits */
 260                 bi->start = max(bi->start, low);
 261                 bi->end = min(bi->end, high);
 262 
 263                 /* and there's no empty block */
 264                 if (bi->start >= bi->end)
 265                         numa_remove_memblk_from(i--, mi);
 266         }
......
   }

   Those non-existing memory, such as memory not added yet, won't be
   stored in numa_meminfo.

2) initialize memory mapping for the existing memory, putting pagetables
   and vmemmap on local node.

Since not all memory info is kept, we have to sanitize movablemem_map.map[]
when we parse SRAT, so we may prevent allocating pagetables or vmemmap on
local node if user specified the whole node as movable.

To avoid this problem, here is my idea:
1) Store not only existing memory ranges in numa_mem_info, but all the
   memory info from SRAT; 
2) Map only existing memory as before;
3) Do memblock limitation after memory mapping initialization using
   numa_meminfo, so that movablemem_map will be able to exclude pagetables
   and vmemmap ranges on local node.

This will be done in part2 soon.

How do you think?

Part2 of this patch-set is under development.

========================================================================
[What we are doing]
This patchset introduces a boot option for user to specify ZONE_MOVABLE
memory map for each node in the system. Users can use it in two ways:

1. movablecore_map=nn[KMG]@ss[KMG]
   In this way, the kernel will make sure memory range from ss to ss+nn is
   on ZONE_MOVABLE. The hotplug info provided by SRAT will be ignored.

2. movablecore_map=acpi
   In this way, the kernel will use memory hotplug info in SRAT to determine
   ZONE_MOVABLE for each node. All the ranges user has specified will be
   ignored.


[Why we do this]
If we hot remove a memroy device, it cannot have kernel memory,
because Linux cannot migrate kernel memory currently. Therefore,
we have to guarantee that the hot removed memory has only movable
memoroy.
(Here is an exception: When we implement the node hotplug functionality,
for those kernel memory whose life cycle is the same as the node, such as
pagetables, vmemmap and so on, although the kernel cannot migrate them,
we can still put them on local node because we can free them before we
hot-remove the node. This is not implemented yet.)

Linux has two boot options, kernelcore= and movablecore=, for
creating movable memory. These boot options can specify the amount
of memory use as kernel or movable memory. Using them, we can
create ZONE_MOVABLE which has only movable memory.
(NOTE: doing this will cause NUMA performance because the kernel won't
 be able to distribute kernel memory evenly to each node.)

But it does not fulfill a requirement of memory hot remove, because
even if we specify the boot options, movable memory is distributed
in each node evenly. So when we want to hot remove memory which
memory range is 0x80000000-0c0000000, we have no way to specify
the memory as movable memory.

Furthermore, even if we can use SRAT, users still need an interface
to enable/disable this functionality if they don't want to lose their
NUMA performance.  So I think, a user interface is always needed.

So we proposed this new feature which specifies memory range to use as
movable memory.


[Ways to do this]
There may be 2 ways to specify movable memory.
1. use firmware information
2. use boot option

1. use firmware information
  According to ACPI spec 5.0, SRAT table has memory affinity structure
  and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
  Affinity Structure". If we use the information, we might be able to
  specify movable memory by firmware. For example, if Hot Pluggable
  Filed is enabled, Linux sets the memory as movable memory.

2. use boot option
  This is our proposal. New boot option can specify memory range to use
  as movable memory.


[How we do this]
We chose second way, because if we use first way, users cannot change
memory range to use as movable memory easily. We think if we create
movable memory, performance regression may occur by NUMA. In this case,
user can turn off the feature easily if we prepare the boot option.
And if we prepare the boot optino, the user can select which memory
to use as movable memory easily. 


[How to use]
1. For movablecore_map=nn[KMG]@ss[KMG]:
         *
         * SRAT:                |_____| |_____| |_________| |_________| ......
         * node id:                0       1         1           2
         * user specified:                |__|                 |___|
         * ZONE_MOVABLE:                  |___| |_________|    |______| ......
         *
   NOTE: 1) User can specify this option more than once, but at most MAX_NUMNODES
            times. The extra options will be ignored.
         2) In this case, SRAT info will be ingored.

2. For movablemem_map=acpi:
         *
         * SRAT:                |_____| |_____| |_________| |_________| ......
         * node id:                0       1         1           2
         * hotpluggable:           n       y         y           n
         * ZONE_MOVABLE:                |_____| |_________|
         *
   NOTE: 1) Before parsing SRAT, memblock has already reserve some memory ranges
            for other purposes, such as for kernel image. We cannot prevent
            kernel from using these memory, so we need to exclude these memory
            even if it is hotpluggable.
            Furthermore, to ensure the kernel has enough memory to boot, we make
            all the memory on the node which the kernel resides in should be
            un-hotpluggable.
         2) In this case, all the user specified memory ranges will be ingored.

We also need to consider the following points:
1) Using this boot option could cause NUMA performance down because the kernel
   memory will not be distributed on each node evenly. So for users who don't
   want to lose their NUMA performance, just don't use it.
2) If kernelcore or movablecore is also specified, movablecore_map will have
   higher priority to be satisfied.
3) This option has no conflict with memmap option.


Tang Chen (8):
  acpi: Print hotplug info in SRAT.
  x86, mm, numa, acpi: Add movable_memmap boot option.
  x86, mm, numa, acpi: Introduce zone_movable_limit[] to store start
    pfn of ZONE_MOVABLE.
  x86, mm, numa, acpi: Extend movablemem_map to the end of each node.
  x86, mm, numa, acpi: Support getting hotplug info from SRAT.
  x86, mm, numa, acpi: Sanitize zone_movable_limit[].
  x86, mm, numa, acpi: make movablemem_map have higher priority
  x86, mm, numa, acpi: Memblock limit with movablemem_map

Yasuaki Ishimatsu (1):
  x86: get pg_data_t's memory from other node

 Documentation/kernel-parameters.txt |   36 +++++
 arch/x86/mm/numa.c                  |    5 +-
 arch/x86/mm/srat.c                  |  130 +++++++++++++++++-
 include/linux/memblock.h            |    2 +
 include/linux/mm.h                  |   22 +++
 mm/memblock.c                       |   50 +++++++
 mm/page_alloc.c                     |  265 ++++++++++++++++++++++++++++++++++-
 7 files changed, 500 insertions(+), 10 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v1 1/9] x86: get pg_data_t's memory from other node
  2013-03-16 10:35 [PATCH v1 part1 0/9] Introduce movablemem_map boot option Tang Chen
@ 2013-03-16 10:35 ` Tang Chen
  2013-03-16 10:35 ` [PATCH v1 2/9] acpi: Print hotplug info in SRAT Tang Chen
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Tang Chen @ 2013-03-16 10:35 UTC (permalink / raw)
  To: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst
  Cc: x86, linux-doc, linux-kernel, linux-mm

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

If system can create movable node which all memory of the
node is allocated as ZONE_MOVABLE, setup_node_data() cannot
allocate memory for the node's pg_data_t.
So, use memblock_alloc_try_nid() instead of memblock_alloc_nid()
to retry when the first allocation fails.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 arch/x86/mm/numa.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 11acdf6..4f754e6 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -214,10 +214,9 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
 	 * Allocate node data.  Try node-local memory and then any node.
 	 * Never allocate in DMA zone.
 	 */
-	nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
+	nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
 	if (!nd_pa) {
-		pr_err("Cannot find %zu bytes in node %d\n",
-		       nd_size, nid);
+		pr_err("Cannot find %zu bytes in any node\n", nd_size);
 		return;
 	}
 	nd = __va(nd_pa);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 2/9] acpi: Print hotplug info in SRAT.
  2013-03-16 10:35 [PATCH v1 part1 0/9] Introduce movablemem_map boot option Tang Chen
  2013-03-16 10:35 ` [PATCH v1 1/9] x86: get pg_data_t's memory from other node Tang Chen
@ 2013-03-16 10:35 ` Tang Chen
  2013-03-16 10:35 ` [PATCH v1 3/9] x86, mm, numa, acpi: Add movable_memmap boot option Tang Chen
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Tang Chen @ 2013-03-16 10:35 UTC (permalink / raw)
  To: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst
  Cc: x86, linux-doc, linux-kernel, linux-mm

The Hot Pluggable field in SRAT points out if the memory could be
hotplugged while the system is running. It is useful to print out
this info when parsing SRAT.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/srat.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 443f9ef..5055fa7 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -146,6 +146,7 @@ int __init
 acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 {
 	u64 start, end;
+	u32 hotpluggable;
 	int node, pxm;
 
 	if (srat_disabled())
@@ -154,7 +155,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 		goto out_err_bad_srat;
 	if ((ma->flags & ACPI_SRAT_MEM_ENABLED) == 0)
 		goto out_err;
-	if ((ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) && !save_add_info())
+	hotpluggable = ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE;
+	if (hotpluggable && !save_add_info())
 		goto out_err;
 
 	start = ma->base_address;
@@ -174,9 +176,10 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 
 	node_set(node, numa_nodes_parsed);
 
-	printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n",
+	printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx] %s\n",
 	       node, pxm,
-	       (unsigned long long) start, (unsigned long long) end - 1);
+	       (unsigned long long) start, (unsigned long long) end - 1,
+	       hotpluggable ? "Hot Pluggable" : "");
 
 	return 0;
 out_err_bad_srat:
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 3/9] x86, mm, numa, acpi: Add movable_memmap boot option.
  2013-03-16 10:35 [PATCH v1 part1 0/9] Introduce movablemem_map boot option Tang Chen
  2013-03-16 10:35 ` [PATCH v1 1/9] x86: get pg_data_t's memory from other node Tang Chen
  2013-03-16 10:35 ` [PATCH v1 2/9] acpi: Print hotplug info in SRAT Tang Chen
@ 2013-03-16 10:35 ` Tang Chen
  2013-03-16 10:35 ` [PATCH v1 4/9] x86, mm, numa, acpi: Introduce zone_movable_limit[] to store start pfn of ZONE_MOVABLE Tang Chen
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Tang Chen @ 2013-03-16 10:35 UTC (permalink / raw)
  To: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst
  Cc: x86, linux-doc, linux-kernel, linux-mm

Add functions to parse movablemem_map boot option. Since the option
could be specified more then once, all the maps will be stored in the
global array movablemem_map.map[].

And also, we keep the array in monotonic increasing order by start_pfn.
And merge all overlapped ranges.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 Documentation/kernel-parameters.txt |   21 ++++++
 include/linux/mm.h                  |   11 +++
 mm/page_alloc.c                     |  131 +++++++++++++++++++++++++++++++++++
 3 files changed, 163 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 4609e81..dd3a36a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1649,6 +1649,27 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			that the amount of memory usable for all allocations
 			is not too small.
 
+	movablemem_map=nn[KMG]@ss[KMG]
+			[KNL,X86,IA-64,PPC] This parameter is similar to
+			memmap except it specifies the memory map of
+			ZONE_MOVABLE.
+			If user specifies memory ranges, the info in SRAT will
+			be ingored. And it works like the following:
+			- If more ranges are all within one node, then from
+			  lowest ss to the end of the node will be ZONE_MOVABLE.
+			- If a range is within a node, then from ss to the end
+			  of the node will be ZONE_MOVABLE.
+			- If a range covers two or more nodes, then from ss to
+			  the end of the 1st node will be ZONE_MOVABLE, and all
+			  the rest nodes will only have ZONE_MOVABLE.
+			If memmap is specified at the same time, the
+			movablemem_map will be limited within the memmap
+			areas. If kernelcore or movablecore is also specified,
+			movablemem_map will have higher priority to be
+			satisfied. So the administrator should be careful that
+			the amount of movablemem_map areas are not too large.
+			Otherwise kernel won't have enough memory to start.
+
 	MTD_Partition=	[MTD]
 			Format: <name>,<region-number>,<size>,<offset>
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1c79b10..9c068d5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1332,6 +1332,17 @@ extern void free_bootmem_with_active_regions(int nid,
 						unsigned long max_low_pfn);
 extern void sparse_memory_present_with_active_regions(int nid);
 
+#define MOVABLEMEM_MAP_MAX MAX_NUMNODES
+struct movablemem_entry {
+	unsigned long start_pfn;    /* start pfn of memory segment */
+	unsigned long end_pfn;      /* end pfn of memory segment (exclusive) */
+};
+
+struct movablemem_map {
+	int nr_map;
+	struct movablemem_entry map[MOVABLEMEM_MAP_MAX];
+};
+
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 #if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f368db4..27fcd29 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -202,6 +202,9 @@ static unsigned long __meminitdata nr_all_pages;
 static unsigned long __meminitdata dma_reserve;
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+/* Movable memory ranges, will also be used by memblock subsystem. */
+struct movablemem_map movablemem_map;
+
 static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
@@ -5061,6 +5064,134 @@ static int __init cmdline_parse_movablecore(char *p)
 early_param("kernelcore", cmdline_parse_kernelcore);
 early_param("movablecore", cmdline_parse_movablecore);
 
+/**
+ * insert_movablemem_map - Insert a memory range in to movablemem_map.map.
+ * @start_pfn:	start pfn of the range
+ * @end_pfn:	end pfn of the range
+ *
+ * This function will also merge the overlapped ranges, and sort the array
+ * by start_pfn in monotonic increasing order.
+ */
+static void __init insert_movablemem_map(unsigned long start_pfn,
+					  unsigned long end_pfn)
+{
+	int pos, overlap;
+
+	/*
+	 * pos will be at the 1st overlapped range, or the position
+	 * where the element should be inserted.
+	 */
+	for (pos = 0; pos < movablemem_map.nr_map; pos++)
+		if (start_pfn <= movablemem_map.map[pos].end_pfn)
+			break;
+
+	/* If there is no overlapped range, just insert the element. */
+	if (pos == movablemem_map.nr_map ||
+	    end_pfn < movablemem_map.map[pos].start_pfn) {
+		/*
+		 * If pos is not the end of array, we need to move all
+		 * the rest elements backward.
+		 */
+		if (pos < movablemem_map.nr_map)
+			memmove(&movablemem_map.map[pos+1],
+				&movablemem_map.map[pos],
+				sizeof(struct movablemem_entry) *
+				(movablemem_map.nr_map - pos));
+		movablemem_map.map[pos].start_pfn = start_pfn;
+		movablemem_map.map[pos].end_pfn = end_pfn;
+		movablemem_map.nr_map++;
+		return;
+	}
+
+	/* overlap will be at the last overlapped range */
+	for (overlap = pos + 1; overlap < movablemem_map.nr_map; overlap++)
+		if (end_pfn < movablemem_map.map[overlap].start_pfn)
+			break;
+
+	/*
+	 * If there are more ranges overlapped, we need to merge them,
+	 * and move the rest elements forward.
+	 */
+	overlap--;
+	movablemem_map.map[pos].start_pfn = min(start_pfn,
+					movablemem_map.map[pos].start_pfn);
+	movablemem_map.map[pos].end_pfn = max(end_pfn,
+					movablemem_map.map[overlap].end_pfn);
+
+	if (pos != overlap && overlap + 1 != movablemem_map.nr_map)
+		memmove(&movablemem_map.map[pos+1],
+			&movablemem_map.map[overlap+1],
+			sizeof(struct movablemem_entry) *
+			(movablemem_map.nr_map - overlap - 1));
+
+	movablemem_map.nr_map -= overlap - pos;
+}
+
+/**
+ * movablemem_map_add_region - Add a memory range into movablemem_map.
+ * @start:	physical start address of range
+ * @end:	physical end address of range
+ *
+ * This function transform the physical address into pfn, and then add the
+ * range into movablemem_map by calling insert_movablemem_map().
+ */
+static void __init movablemem_map_add_region(u64 start, u64 size)
+{
+	unsigned long start_pfn, end_pfn;
+
+	/* In case size == 0 or start + size overflows */
+	if (start + size <= start)
+		return;
+
+	if (movablemem_map.nr_map >= ARRAY_SIZE(movablemem_map.map)) {
+		pr_err("movablemem_map: too many entries; "
+		       "ignoring [mem %#010llx-%#010llx]\n",
+		       (unsigned long long) start,
+		       (unsigned long long) (start + size - 1));
+		return;
+	}
+
+	start_pfn = PFN_DOWN(start);
+	end_pfn = PFN_UP(start + size);
+	insert_movablemem_map(start_pfn, end_pfn);
+}
+
+/*
+ * cmdline_parse_movablemem_map - Parse boot option movablemem_map.
+ * @p:	The boot option of the following format:
+ *	movablemem_map=nn[KMG]@ss[KMG]
+ *
+ * This option sets the memory range [ss, ss+nn) to be used as movable memory.
+ *
+ * Return: 0 on success or -EINVAL on failure.
+ */
+static int __init cmdline_parse_movablemem_map(char *p)
+{
+	char *oldp;
+	u64 start_at, mem_size;
+
+	if (!p)
+		goto err;
+
+	oldp = p;
+	mem_size = memparse(p, &p);
+	if (p == oldp)
+		goto err;
+
+	if (*p == '@') {
+		oldp = ++p;
+		start_at = memparse(p, &p);
+		if (p == oldp || *p != '\0')
+			goto err;
+
+		movablemem_map_add_region(start_at, mem_size);
+		return 0;
+	}
+err:
+	return -EINVAL;
+}
+early_param("movablemem_map", cmdline_parse_movablemem_map);
+
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 /**
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 4/9] x86, mm, numa, acpi: Introduce zone_movable_limit[] to store start pfn of ZONE_MOVABLE.
  2013-03-16 10:35 [PATCH v1 part1 0/9] Introduce movablemem_map boot option Tang Chen
                   ` (2 preceding siblings ...)
  2013-03-16 10:35 ` [PATCH v1 3/9] x86, mm, numa, acpi: Add movable_memmap boot option Tang Chen
@ 2013-03-16 10:35 ` Tang Chen
  2013-03-16 10:35 ` [PATCH v1 5/9] x86, mm, numa, acpi: Extend movablemem_map to the end of each node Tang Chen
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Tang Chen @ 2013-03-16 10:35 UTC (permalink / raw)
  To: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst
  Cc: x86, linux-doc, linux-kernel, linux-mm

Since node info in SRAT may not be in increasing order, we may meet
a lower range after we handled a higher range. So we need to keep
the lowest movable pfn each time we parse a SRAT memory entry, and
update it when we get a lower one.

This patch introduces a new array zone_movable_limit[], which is used
to store the start pfn of each node's ZONE_MOVABLE.

We update it each time we parsed a SRAT memory entry if necessary.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/srat.c |   29 +++++++++++++++++++++++++++++
 include/linux/mm.h |    9 +++++++++
 mm/page_alloc.c    |   35 +++++++++++++++++++++++++++++++++--
 3 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 5055fa7..6cd4d33 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -141,6 +141,33 @@ static inline int save_add_info(void) {return 1;}
 static inline int save_add_info(void) {return 0;}
 #endif
 
+#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+static void __init sanitize_movablemem_map(int nid, u64 start, u64 end)
+{
+	int overlap;
+	unsigned long start_pfn, end_pfn;
+
+	start_pfn = PFN_DOWN(start);
+	end_pfn = PFN_UP(end);
+
+	overlap = movablemem_map_overlap(start_pfn, end_pfn);
+	if (overlap >= 0) {
+		start_pfn = max(start_pfn,
+				movablemem_map.map[overlap].start_pfn);
+
+		if (zone_movable_limit[nid])
+			zone_movable_limit[nid] = min(zone_movable_limit[nid],
+						      start_pfn);
+		else
+			zone_movable_limit[nid] = start_pfn;
+	}
+}
+#else		/* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+static inline void sanitize_movablemem_map(int nid, u64 start, u64 end)
+{
+}
+#endif		/* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+
 /* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
 int __init
 acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
@@ -181,6 +208,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 	       (unsigned long long) start, (unsigned long long) end - 1,
 	       hotpluggable ? "Hot Pluggable" : "");
 
+	sanitize_movablemem_map(node, start, end);
+
 	return 0;
 out_err_bad_srat:
 	bad_srat();
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9c068d5..d2c5fec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1343,6 +1343,15 @@ struct movablemem_map {
 	struct movablemem_entry map[MOVABLEMEM_MAP_MAX];
 };
 
+extern struct movablemem_map movablemem_map;
+
+extern void __init insert_movablemem_map(unsigned long start_pfn,
+					 unsigned long end_pfn);
+extern int __init movablemem_map_overlap(unsigned long start_pfn,
+					 unsigned long end_pfn);
+
+extern unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];
+
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 #if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 27fcd29..f451ded 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -210,6 +210,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -5065,6 +5066,36 @@ early_param("kernelcore", cmdline_parse_kernelcore);
 early_param("movablecore", cmdline_parse_movablecore);
 
 /**
+ * movablemem_map_overlap() - Check if a range overlaps movablemem_map.map[].
+ * @start_pfn: start pfn of the range to be checked
+ * @end_pfn:   end pfn of the range to be checked (exclusive)
+ *
+ * This function checks if a given memory range [start_pfn, end_pfn) overlaps
+ * the movablemem_map.map[] array.
+ *
+ * Return: index of the first overlapped element in movablemem_map.map[]
+ *         or -1 if they don't overlap each other.
+ */
+int __init movablemem_map_overlap(unsigned long start_pfn,
+				  unsigned long end_pfn)
+{
+	int overlap;
+
+	if (!movablemem_map.nr_map)
+		return -1;
+
+	for (overlap = 0; overlap < movablemem_map.nr_map; overlap++)
+		if (start_pfn < movablemem_map.map[overlap].end_pfn)
+			break;
+
+	if (overlap == movablemem_map.nr_map ||
+	    end_pfn <= movablemem_map.map[overlap].start_pfn)
+		return -1;
+
+	return overlap;
+}
+
+/**
  * insert_movablemem_map - Insert a memory range in to movablemem_map.map.
  * @start_pfn:	start pfn of the range
  * @end_pfn:	end pfn of the range
@@ -5072,8 +5103,8 @@ early_param("movablecore", cmdline_parse_movablecore);
  * This function will also merge the overlapped ranges, and sort the array
  * by start_pfn in monotonic increasing order.
  */
-static void __init insert_movablemem_map(unsigned long start_pfn,
-					  unsigned long end_pfn)
+void __init insert_movablemem_map(unsigned long start_pfn,
+				  unsigned long end_pfn)
 {
 	int pos, overlap;
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 5/9] x86, mm, numa, acpi: Extend movablemem_map to the end of each node.
  2013-03-16 10:35 [PATCH v1 part1 0/9] Introduce movablemem_map boot option Tang Chen
                   ` (3 preceding siblings ...)
  2013-03-16 10:35 ` [PATCH v1 4/9] x86, mm, numa, acpi: Introduce zone_movable_limit[] to store start pfn of ZONE_MOVABLE Tang Chen
@ 2013-03-16 10:35 ` Tang Chen
  2013-03-16 10:35 ` [PATCH v1 6/9] x86, mm, numa, acpi: Support getting hotplug info from SRAT Tang Chen
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Tang Chen @ 2013-03-16 10:35 UTC (permalink / raw)
  To: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst
  Cc: x86, linux-doc, linux-kernel, linux-mm

When implementing movablemem_map boot option, we introduced an array
movablemem_map.map[] to store the memory ranges to be set as ZONE_MOVABLE.

Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify
the whole node memory range, we need to extend it to the node end so that
we can use it to prevent memblock from allocating memory in the ranges
user didn't specify.

We now implement movablemem_map boot option like this:
        /*
         * For movablemem_map=nn[KMG]@ss[KMG]:
         *
         * SRAT:                |_____| |_____| |_________| |_________| ......
         * node id:                0       1         1           2
         * user specified:                |__|                 |___|
         * movablemem_map:                |___| |_________|    |______| ......
         *
         * Using movablemem_map, we can prevent memblock from allocating memory
         * on ZONE_MOVABLE at boot time.
         *
         * NOTE: In this case, SRAT info will be ingored.
         */

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/srat.c |   34 ++++++++++++++++++++++++++++++----
 1 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 6cd4d33..ee888a2 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -150,16 +150,42 @@ static void __init sanitize_movablemem_map(int nid, u64 start, u64 end)
 	start_pfn = PFN_DOWN(start);
 	end_pfn = PFN_UP(end);
 
+	/*
+	 * For movablecore_map=nn[KMG]@ss[KMG]:
+	 *
+	 * SRAT:                |_____| |_____| |_________| |_________| ......
+	 * node id:                0       1         1           2
+	 * user specified:                |__|                 |___|
+	 * movablemem_map:                |___| |_________|    |______| ......
+	 *
+	 * Using movablemem_map, we can prevent memblock from allocating memory
+	 * on ZONE_MOVABLE at boot time.
+	 */
 	overlap = movablemem_map_overlap(start_pfn, end_pfn);
 	if (overlap >= 0) {
+		/*
+		 * If this range overlaps with movablemem_map, then update
+		 * zone_movable_limit[nid] if it has lower start pfn.
+		 */
 		start_pfn = max(start_pfn,
 				movablemem_map.map[overlap].start_pfn);
 
-		if (zone_movable_limit[nid])
-			zone_movable_limit[nid] = min(zone_movable_limit[nid],
-						      start_pfn);
-		else
+		if (!zone_movable_limit[nid] ||
+		    zone_movable_limit[nid] > start_pfn)
 			zone_movable_limit[nid] = start_pfn;
+
+		/* Insert the higher part of the overlapped range. */
+		if (movablemem_map.map[overlap].end_pfn < end_pfn)
+			insert_movablemem_map(start_pfn, end_pfn);
+	} else {
+		/*
+		 * If this is a range higher than zone_movable_limit[nid],
+		 * insert it to movablemem_map because all ranges higher than
+		 * zone_movable_limit[nid] on this node will be ZONE_MOVABLE.
+		 */
+		if (zone_movable_limit[nid] &&
+		    start_pfn > zone_movable_limit[nid])
+			insert_movablemem_map(start_pfn, end_pfn);
 	}
 }
 #else		/* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 6/9] x86, mm, numa, acpi: Support getting hotplug info from SRAT.
  2013-03-16 10:35 [PATCH v1 part1 0/9] Introduce movablemem_map boot option Tang Chen
                   ` (4 preceding siblings ...)
  2013-03-16 10:35 ` [PATCH v1 5/9] x86, mm, numa, acpi: Extend movablemem_map to the end of each node Tang Chen
@ 2013-03-16 10:35 ` Tang Chen
  2013-03-16 10:35 ` [PATCH v1 7/9] x86, mm, numa, acpi: Sanitize zone_movable_limit[] Tang Chen
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Tang Chen @ 2013-03-16 10:35 UTC (permalink / raw)
  To: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst
  Cc: x86, linux-doc, linux-kernel, linux-mm

We now provide an option for users who don't want to specify physical
memory address in kernel commandline.

        /*
         * For movablemem_map=acpi:
         *
         * SRAT:                |_____| |_____| |_________| |_________| ......
         * node id:                0       1         1           2
         * hotpluggable:           n       y         y           n
         * movablemem_map:              |_____| |_________|
         *
         * Using movablemem_map, we can prevent memblock from allocating memory
         * on ZONE_MOVABLE at boot time.
         */

So user just specify movablemem_map=acpi, and the kernel will use hotpluggable
info in SRAT to determine which memory ranges should be set as ZONE_MOVABLE.

NOTE: Using this way will cause NUMA performance down because the whole node
      will be set as ZONE_MOVABLE, and kernel cannot use memory on it.
      If users don't want to lose NUMA performance, just don't use it.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 Documentation/kernel-parameters.txt |   15 +++++++
 arch/x86/mm/srat.c                  |   74 +++++++++++++++++++++++++++++++++--
 include/linux/mm.h                  |    2 +
 mm/page_alloc.c                     |   22 ++++++++++-
 4 files changed, 108 insertions(+), 5 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index dd3a36a..40387a2 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1649,6 +1649,17 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			that the amount of memory usable for all allocations
 			is not too small.
 
+	movablemem_map=acpi
+			[KNL,X86,IA-64,PPC] This parameter is similar to
+			memmap except it specifies the memory map of
+			ZONE_MOVABLE.
+			This option inform the kernel to use Hot Pluggable bit
+			in flags from SRAT from ACPI BIOS to determine which
+			memory devices could be hotplugged. The corresponding
+			memory ranges will be set as ZONE_MOVABLE.
+			NOTE: Whatever node the kernel resides in will always
+			      be un-hotpluggable.
+
 	movablemem_map=nn[KMG]@ss[KMG]
 			[KNL,X86,IA-64,PPC] This parameter is similar to
 			memmap except it specifies the memory map of
@@ -1669,6 +1680,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			satisfied. So the administrator should be careful that
 			the amount of movablemem_map areas are not too large.
 			Otherwise kernel won't have enough memory to start.
+			NOTE: We don't stop users specifying the node the
+			      kernel resides in as hotpluggable so that this
+			      option can be used as a workaround of firmware
+			      bugs.
 
 	MTD_Partition=	[MTD]
 			Format: <name>,<region-number>,<size>,<offset>
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index ee888a2..fd3d4c8 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -142,15 +142,78 @@ static inline int save_add_info(void) {return 0;}
 #endif
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-static void __init sanitize_movablemem_map(int nid, u64 start, u64 end)
+static void __init sanitize_movablemem_map(int nid, u64 start, u64 end,
+					   bool hotpluggable)
 {
-	int overlap;
+	int overlap, i;
 	unsigned long start_pfn, end_pfn;
 
 	start_pfn = PFN_DOWN(start);
 	end_pfn = PFN_UP(end);
 
 	/*
+	 * For movablemem_map=acpi:
+	 *
+	 * SRAT:                |_____| |_____| |_________| |_________| ......
+	 * node id:                0       1         1           2
+	 * hotpluggable:           n       y         y           n
+	 * movablemem_map:              |_____| |_________|
+	 *
+	 * Using movablemem_map, we can prevent memblock from allocating memory
+	 * on ZONE_MOVABLE at boot time.
+	 *
+	 * Before parsing SRAT, memblock has already reserve some memory ranges
+	 * for other purposes, such as for kernel image. We cannot prevent
+	 * kernel from using these memory, so we need to exclude these memory
+	 * even if it is hotpluggable.
+	 * Furthermore, to ensure the kernel has enough memory to boot, we make
+	 * all the memory on the node which the kernel resides in should be
+	 * un-hotpluggable.
+	 */
+	if (hotpluggable && movablemem_map.acpi) {
+		/* Exclude ranges reserved by memblock. */
+		struct memblock_type *rgn = &memblock.reserved;
+
+		for (i = 0; i < rgn->cnt; i++) {
+			if (end <= rgn->regions[i].base ||
+			    start >= rgn->regions[i].base +
+			    rgn->regions[i].size)
+				continue;
+
+			/*
+			 * If the memory range overlaps the memory reserved by
+			 * memblock, then the kernel resides in this node.
+			 */
+			node_set(nid, movablemem_map.numa_nodes_kernel);
+			zone_movable_limit[nid] = 0;
+
+			return;
+		}
+
+		/*
+		 * If the kernel resides in this node, then the whole node
+		 * should not be hotpluggable.
+		 */
+		if (node_isset(nid, movablemem_map.numa_nodes_kernel)) {
+			zone_movable_limit[nid] = 0;
+			return;
+		}
+
+		/*
+		 * Otherwise, if the range is hotpluggable, and the kernel is
+		 * not on this node, insert it into movablemem_map.
+		 */
+		insert_movablemem_map(start_pfn, end_pfn);
+		if (zone_movable_limit[nid])
+			zone_movable_limit[nid] = min(zone_movable_limit[nid],
+						      start_pfn);
+		else
+			zone_movable_limit[nid] = start_pfn;
+
+		return;
+	}
+
+	/*
 	 * For movablecore_map=nn[KMG]@ss[KMG]:
 	 *
 	 * SRAT:                |_____| |_____| |_________| |_________| ......
@@ -160,6 +223,8 @@ static void __init sanitize_movablemem_map(int nid, u64 start, u64 end)
 	 *
 	 * Using movablemem_map, we can prevent memblock from allocating memory
 	 * on ZONE_MOVABLE at boot time.
+	 *
+	 * NOTE: In this case, SRAT info will be ingored.
 	 */
 	overlap = movablemem_map_overlap(start_pfn, end_pfn);
 	if (overlap >= 0) {
@@ -189,7 +254,8 @@ static void __init sanitize_movablemem_map(int nid, u64 start, u64 end)
 	}
 }
 #else		/* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
-static inline void sanitize_movablemem_map(int nid, u64 start, u64 end)
+static inline void sanitize_movablemem_map(int nid, u64 start, u64 end,
+					   bool hotpluggable)
 {
 }
 #endif		/* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
@@ -234,7 +300,7 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 	       (unsigned long long) start, (unsigned long long) end - 1,
 	       hotpluggable ? "Hot Pluggable" : "");
 
-	sanitize_movablemem_map(node, start, end);
+	sanitize_movablemem_map(node, start, end, hotpluggable);
 
 	return 0;
 out_err_bad_srat:
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d2c5fec..37cf1d7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1339,8 +1339,10 @@ struct movablemem_entry {
 };
 
 struct movablemem_map {
+	bool acpi;	/* True if using SRAT info. */
 	int nr_map;
 	struct movablemem_entry map[MOVABLEMEM_MAP_MAX];
+	nodemask_t numa_nodes_kernel;   /* on which nodes kernel resides in */
 };
 
 extern struct movablemem_map movablemem_map;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f451ded..31d27af 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -203,7 +203,10 @@ static unsigned long __meminitdata dma_reserve;
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 /* Movable memory ranges, will also be used by memblock subsystem. */
-struct movablemem_map movablemem_map;
+struct movablemem_map movablemem_map = {
+	.acpi = false,
+	.nr_map = 0,
+};
 
 static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
@@ -5204,6 +5207,23 @@ static int __init cmdline_parse_movablemem_map(char *p)
 	if (!p)
 		goto err;
 
+	if (!strcmp(p, "acpi"))
+		movablemem_map.acpi = true;
+
+	/*
+	 * If user decide to use info from BIOS, all the other user specified
+	 * ranges will be ingored.
+	 */
+	if (movablemem_map.acpi) {
+		if (movablemem_map.nr_map) {
+			memset(movablemem_map.map, 0,
+			       sizeof(struct movablemem_entry) *
+			       movablemem_map.nr_map);
+			movablemem_map.nr_map = 0;
+		}
+		return 0;
+	}
+
 	oldp = p;
 	mem_size = memparse(p, &p);
 	if (p == oldp)
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 7/9] x86, mm, numa, acpi: Sanitize zone_movable_limit[].
  2013-03-16 10:35 [PATCH v1 part1 0/9] Introduce movablemem_map boot option Tang Chen
                   ` (5 preceding siblings ...)
  2013-03-16 10:35 ` [PATCH v1 6/9] x86, mm, numa, acpi: Support getting hotplug info from SRAT Tang Chen
@ 2013-03-16 10:35 ` Tang Chen
  2013-03-16 10:35 ` [PATCH v1 8/9] x86, mm, numa, acpi: make movablemem_map have higher priority Tang Chen
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Tang Chen @ 2013-03-16 10:35 UTC (permalink / raw)
  To: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst
  Cc: x86, linux-doc, linux-kernel, linux-mm

As mentioned by Liu Jiang and Wu Jiangguo, users could specify DMA,
DMA32, and HIGHMEM as movable. In order to ensure the kernel will
work correctly, we should exclude these memory ranges out from
zone_movable_limit[].

NOTE: Do find_usable_zone_for_movable() to initialize movable_zone
      so that sanitize_zone_movable_limit() could use it. This is
      pointed out by Wu Jianguo <wujianguo@huawei.com>.

Reported-by: Wu Jianguo <wujianguo@huawei.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Liu Jiang <jiang.liu@huawei.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 mm/page_alloc.c |   55 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 54 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 31d27af..70ed381 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4412,6 +4412,58 @@ static unsigned long __meminit zone_absent_pages_in_node(int nid,
 	return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
 }
 
+/**
+ * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
+ *
+ * zone_movable_limit[] have been initialized when parsing SRAT or
+ * movablemem_map. This function will try to exclude ZONE_DMA, ZONE_DMA32,
+ * and HIGHMEM from zone_movable_limit[].
+ *
+ * zone_movable_limit[nid] == 0 means no limit for the node.
+ *
+ * Note: Need to be called with movable_zone initialized.
+ */
+static void __meminit sanitize_zone_movable_limit(void)
+{
+	int i, nid;
+	unsigned long start_pfn, end_pfn;
+
+	if (!movablemem_map.nr_map)
+		return;
+
+	/* Iterate each node id. */
+	for_each_node(nid) {
+		/* If we have no limit for this node, just skip it. */
+		if (!zone_movable_limit[nid])
+			continue;
+
+#ifdef CONFIG_ZONE_DMA
+		/* Skip DMA memory. */
+		if (zone_movable_limit[nid] <
+		    arch_zone_highest_possible_pfn[ZONE_DMA])
+			zone_movable_limit[nid] =
+				arch_zone_highest_possible_pfn[ZONE_DMA];
+#endif
+
+#ifdef CONFIG_ZONE_DMA32
+		/* Skip DMA32 memory. */
+		if (zone_movable_limit[nid] <
+		    arch_zone_highest_possible_pfn[ZONE_DMA32])
+			zone_movable_limit[nid] =
+				arch_zone_highest_possible_pfn[ZONE_DMA32];
+#endif
+
+#ifdef CONFIG_HIGHMEM
+		/* Skip lowmem if ZONE_MOVABLE is highmem. */
+		if (zone_movable_is_highmem() &&
+		    zone_movable_limit[nid] <
+		    arch_zone_lowest_possible_pfn[ZONE_HIGHMEM])
+			zone_movable_limit[nid] =
+				arch_zone_lowest_possible_pfn[ZONE_HIGHMEM];
+#endif
+	}
+}
+
 #else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
 					unsigned long zone_type,
@@ -4826,7 +4878,6 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 		goto out;
 
 	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
-	find_usable_zone_for_movable();
 	usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
 
 restart:
@@ -4985,6 +5036,8 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 
 	/* Find the PFNs that ZONE_MOVABLE begins at in each node */
 	memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
+	find_usable_zone_for_movable();
+	sanitize_zone_movable_limit();
 	find_zone_movable_pfns_for_nodes();
 
 	/* Print out the zone ranges */
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 8/9] x86, mm, numa, acpi: make movablemem_map have higher priority
  2013-03-16 10:35 [PATCH v1 part1 0/9] Introduce movablemem_map boot option Tang Chen
                   ` (6 preceding siblings ...)
  2013-03-16 10:35 ` [PATCH v1 7/9] x86, mm, numa, acpi: Sanitize zone_movable_limit[] Tang Chen
@ 2013-03-16 10:35 ` Tang Chen
  2013-03-16 10:35 ` [PATCH v1 9/9] x86, mm, numa, acpi: Memblock limit with movablemem_map Tang Chen
  2013-03-17  0:25 ` [PATCH v1 part1 0/9] Introduce movablemem_map boot option Will Huck
  9 siblings, 0 replies; 12+ messages in thread
From: Tang Chen @ 2013-03-16 10:35 UTC (permalink / raw)
  To: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst
  Cc: x86, linux-doc, linux-kernel, linux-mm

If kernelcore or movablecore is specified at the same time with
movablemem_map, movablemem_map will have higher priority to be
satisfied.  This patch will make find_zone_movable_pfns_for_nodes()
calculate zone_movable_pfn[] with the limit from zone_movable_limit[].

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 mm/page_alloc.c |   28 +++++++++++++++++++++++++---
 1 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 70ed381..bdde30d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4873,9 +4873,17 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 		required_kernelcore = max(required_kernelcore, corepages);
 	}
 
-	/* If kernelcore was not specified, there is no ZONE_MOVABLE */
-	if (!required_kernelcore)
+	/*
+	 * If neither kernelcore/movablecore nor movablemem_map is specified,
+	 * there is no ZONE_MOVABLE. But if movablemem_map is specified, the
+	 * start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
+	 */
+	if (!required_kernelcore) {
+		if (movablemem_map.nr_map)
+			memcpy(zone_movable_pfn, zone_movable_limit,
+				sizeof(zone_movable_pfn));
 		goto out;
+	}
 
 	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
 	usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
@@ -4905,10 +4913,24 @@ restart:
 		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
 			unsigned long size_pages;
 
+			/*
+			 * Find more memory for kernelcore in
+			 * [zone_movable_pfn[nid], zone_movable_limit[nid]).
+			 */
 			start_pfn = max(start_pfn, zone_movable_pfn[nid]);
 			if (start_pfn >= end_pfn)
 				continue;
 
+			if (zone_movable_limit[nid]) {
+				end_pfn = min(end_pfn, zone_movable_limit[nid]);
+				/* No range left for kernelcore in this node */
+				if (start_pfn >= end_pfn) {
+					zone_movable_pfn[nid] =
+							zone_movable_limit[nid];
+					break;
+				}
+			}
+
 			/* Account for what is only usable for kernelcore */
 			if (start_pfn < usable_startpfn) {
 				unsigned long kernel_pages;
@@ -4968,12 +4990,12 @@ restart:
 	if (usable_nodes && required_kernelcore > usable_nodes)
 		goto restart;
 
+out:
 	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
 	for (nid = 0; nid < MAX_NUMNODES; nid++)
 		zone_movable_pfn[nid] =
 			roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
 
-out:
 	/* restore the node_state */
 	node_states[N_MEMORY] = saved_node_state;
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 9/9] x86, mm, numa, acpi: Memblock limit with movablemem_map
  2013-03-16 10:35 [PATCH v1 part1 0/9] Introduce movablemem_map boot option Tang Chen
                   ` (7 preceding siblings ...)
  2013-03-16 10:35 ` [PATCH v1 8/9] x86, mm, numa, acpi: make movablemem_map have higher priority Tang Chen
@ 2013-03-16 10:35 ` Tang Chen
  2013-03-17  0:25 ` [PATCH v1 part1 0/9] Introduce movablemem_map boot option Will Huck
  9 siblings, 0 replies; 12+ messages in thread
From: Tang Chen @ 2013-03-16 10:35 UTC (permalink / raw)
  To: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst
  Cc: x86, linux-doc, linux-kernel, linux-mm

Ensure memblock will not allocate memory from areas that may be
ZONE_MOVABLE. The map info is from movablemem_map boot option.

The following problem was reported by Stephen Rothwell:
The definition of struct movablecore_map is protected by
CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region()
is not. So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
movablecore_map in memblock_overlaps_region().

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 include/linux/memblock.h |    2 +
 mm/memblock.c            |   50 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+), 0 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f388203..3e5ecb2 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,6 +42,7 @@ struct memblock {
 
 extern struct memblock memblock;
 extern int memblock_debug;
+extern struct movablemem_map movablemem_map;
 
 #define memblock_dbg(fmt, ...) \
 	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
@@ -60,6 +61,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 			  unsigned long *out_end_pfn, int *out_nid);
 
diff --git a/mm/memblock.c b/mm/memblock.c
index b8d9147..1bcd9b9 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -92,9 +92,58 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
  *
  * Find @size free area aligned to @align in the specified range and node.
  *
+ * If we have CONFIG_HAVE_MEMBLOCK_NODE_MAP defined, we need to check if the
+ * memory we found if not in hotpluggable ranges.
+ *
  * RETURNS:
  * Found address on success, %0 on failure.
  */
+#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
+					phys_addr_t end, phys_addr_t size,
+					phys_addr_t align, int nid)
+{
+	phys_addr_t this_start, this_end, cand;
+	u64 i;
+	int curr = movablemem_map.nr_map - 1;
+
+	/* pump up @end */
+	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
+		end = memblock.current_limit;
+
+	/* avoid allocating the first page */
+	start = max_t(phys_addr_t, start, PAGE_SIZE);
+	end = max(start, end);
+
+	for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
+		this_start = clamp(this_start, start, end);
+		this_end = clamp(this_end, start, end);
+
+restart:
+		if (this_end <= this_start || this_end < size)
+			continue;
+
+		for (; curr >= 0; curr--) {
+			if ((movablemem_map.map[curr].start_pfn << PAGE_SHIFT)
+			    < this_end)
+				break;
+		}
+
+		cand = round_down(this_end - size, align);
+		if (curr >= 0 &&
+		    cand < movablemem_map.map[curr].end_pfn << PAGE_SHIFT) {
+			this_end = movablemem_map.map[curr].start_pfn
+				   << PAGE_SHIFT;
+			goto restart;
+		}
+
+		if (cand >= this_start)
+			return cand;
+	}
+
+	return 0;
+}
+#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 					phys_addr_t end, phys_addr_t size,
 					phys_addr_t align, int nid)
@@ -123,6 +172,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 	}
 	return 0;
 }
+#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 /**
  * memblock_find_in_range - find free area in given range
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 part1 0/9] Introduce movablemem_map boot option.
  2013-03-16 10:35 [PATCH v1 part1 0/9] Introduce movablemem_map boot option Tang Chen
                   ` (8 preceding siblings ...)
  2013-03-16 10:35 ` [PATCH v1 9/9] x86, mm, numa, acpi: Memblock limit with movablemem_map Tang Chen
@ 2013-03-17  0:25 ` Will Huck
  2013-03-18  9:57   ` Tang Chen
  9 siblings, 1 reply; 12+ messages in thread
From: Will Huck @ 2013-03-17  0:25 UTC (permalink / raw)
  To: Tang Chen
  Cc: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst, x86, linux-doc, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 8876 bytes --]

Hi Tang,
On 03/16/2013 06:35 PM, Tang Chen wrote:
> Hi Yinghai, all,
>
> As Yinghai have implemented parsing numa info early more considerately,
> I think we can introduce the movablemem_map boot option again.
>
> This patch-set is based on Linux 3.9 rc-2, but need to apply Yinghai's
> "x86, ACPI, numa: Parse numa info early" patch-set first.
> Please refer to:
> v1: https://lkml.org/lkml/2013/3/7/642
> v2: https://lkml.org/lkml/2013/3/10/47
>
>
> In this part1 patch-set, we reimplemented movablemem_map boot option
> based on Yinghai's SRAT work. The path is like this:
> 1) parse SRAT, fill only existing memory into numa_meminfo, like:
>     numa_cleanup_meminfo() {
>   251         const u64 low = 0;
>   252         const u64 high = PFN_PHYS(max_pfn);
> ......
>   255         /* first, trim all entries */
>   256         for (i = 0; i < mi->nr_blks; i++) {
>   257                 struct numa_memblk *bi = &mi->blk[i];
>   258
>   259                 /* make sure all blocks are inside the limits */
>   260                 bi->start = max(bi->start, low);
>   261                 bi->end = min(bi->end, high);
>   262
>   263                 /* and there's no empty block */
>   264                 if (bi->start >= bi->end)
>   265                         numa_remove_memblk_from(i--, mi);
>   266         }
> ......
>     }
>
>     Those non-existing memory, such as memory not added yet, won't be
>     stored in numa_meminfo.
>
> 2) initialize memory mapping for the existing memory, putting pagetables
>     and vmemmap on local node.
>
> Since not all memory info is kept, we have to sanitize movablemem_map.map[]
> when we parse SRAT, so we may prevent allocating pagetables or vmemmap on
> local node if user specified the whole node as movable.
>
> To avoid this problem, here is my idea:
> 1) Store not only existing memory ranges in numa_mem_info, but all the
>     memory info from SRAT;
> 2) Map only existing memory as before;
> 3) Do memblock limitation after memory mapping initialization using
>     numa_meminfo, so that movablemem_map will be able to exclude pagetables
>     and vmemmap ranges on local node.
>
> This will be done in part2 soon.
>
> How do you think?
>
> Part2 of this patch-set is under development.
>
> ========================================================================
> [What we are doing]
> This patchset introduces a boot option for user to specify ZONE_MOVABLE
> memory map for each node in the system. Users can use it in two ways:
>
> 1. movablecore_map=nn[KMG]@ss[KMG]
>     In this way, the kernel will make sure memory range from ss to ss+nn is
>     on ZONE_MOVABLE. The hotplug info provided by SRAT will be ignored.
>
> 2. movablecore_map=acpi
>     In this way, the kernel will use memory hotplug info in SRAT to determine
>     ZONE_MOVABLE for each node. All the ranges user has specified will be
>     ignored.
>
>
> [Why we do this]
> If we hot remove a memroy device, it cannot have kernel memory,
> because Linux cannot migrate kernel memory currently. Therefore,
> we have to guarantee that the hot removed memory has only movable
> memoroy.
> (Here is an exception: When we implement the node hotplug functionality,
> for those kernel memory whose life cycle is the same as the node, such as
> pagetables, vmemmap and so on, although the kernel cannot migrate them,
> we can still put them on local node because we can free them before we
> hot-remove the node. This is not implemented yet.)
>
> Linux has two boot options, kernelcore= and movablecore=, for
> creating movable memory. These boot options can specify the amount
> of memory use as kernel or movable memory. Using them, we can
> create ZONE_MOVABLE which has only movable memory.
> (NOTE: doing this will cause NUMA performance because the kernel won't
>   be able to distribute kernel memory evenly to each node.)
>
> But it does not fulfill a requirement of memory hot remove, because
> even if we specify the boot options, movable memory is distributed
> in each node evenly. So when we want to hot remove memory which
> memory range is 0x80000000-0c0000000, we have no way to specify
> the memory as movable memory.
>
> Furthermore, even if we can use SRAT, users still need an interface
> to enable/disable this functionality if they don't want to lose their
> NUMA performance.  So I think, a user interface is always needed.
>
> So we proposed this new feature which specifies memory range to use as
> movable memory.

http://marc.info/?l=linux-mm&m=136014458829566&w=2

It seems that Mel don't like this idea.

>
> [Ways to do this]
> There may be 2 ways to specify movable memory.
> 1. use firmware information
> 2. use boot option
>
> 1. use firmware information
>    According to ACPI spec 5.0, SRAT table has memory affinity structure
>    and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>    Affinity Structure". If we use the information, we might be able to
>    specify movable memory by firmware. For example, if Hot Pluggable
>    Filed is enabled, Linux sets the memory as movable memory.
>
> 2. use boot option
>    This is our proposal. New boot option can specify memory range to use
>    as movable memory.
>
>
> [How we do this]
> We chose second way, because if we use first way, users cannot change
> memory range to use as movable memory easily. We think if we create
> movable memory, performance regression may occur by NUMA. In this case,
> user can turn off the feature easily if we prepare the boot option.
> And if we prepare the boot optino, the user can select which memory
> to use as movable memory easily.
>
>
> [How to use]
> 1. For movablecore_map=nn[KMG]@ss[KMG]:
>           *
>           * SRAT:                |_____| |_____| |_________| |_________| ......
>           * node id:                0       1         1           2
>           * user specified:                |__|                 |___|
>           * ZONE_MOVABLE:                  |___| |_________|    |______| ......
>           *
>     NOTE: 1) User can specify this option more than once, but at most MAX_NUMNODES
>              times. The extra options will be ignored.
>           2) In this case, SRAT info will be ingored.
>
> 2. For movablemem_map=acpi:
>           *
>           * SRAT:                |_____| |_____| |_________| |_________| ......
>           * node id:                0       1         1           2
>           * hotpluggable:           n       y         y           n
>           * ZONE_MOVABLE:                |_____| |_________|
>           *
>     NOTE: 1) Before parsing SRAT, memblock has already reserve some memory ranges
>              for other purposes, such as for kernel image. We cannot prevent
>              kernel from using these memory, so we need to exclude these memory
>              even if it is hotpluggable.
>              Furthermore, to ensure the kernel has enough memory to boot, we make
>              all the memory on the node which the kernel resides in should be
>              un-hotpluggable.
>           2) In this case, all the user specified memory ranges will be ingored.
>
> We also need to consider the following points:
> 1) Using this boot option could cause NUMA performance down because the kernel
>     memory will not be distributed on each node evenly. So for users who don't
>     want to lose their NUMA performance, just don't use it.
> 2) If kernelcore or movablecore is also specified, movablecore_map will have
>     higher priority to be satisfied.
> 3) This option has no conflict with memmap option.
>
>
> Tang Chen (8):
>    acpi: Print hotplug info in SRAT.
>    x86, mm, numa, acpi: Add movable_memmap boot option.
>    x86, mm, numa, acpi: Introduce zone_movable_limit[] to store start
>      pfn of ZONE_MOVABLE.
>    x86, mm, numa, acpi: Extend movablemem_map to the end of each node.
>    x86, mm, numa, acpi: Support getting hotplug info from SRAT.
>    x86, mm, numa, acpi: Sanitize zone_movable_limit[].
>    x86, mm, numa, acpi: make movablemem_map have higher priority
>    x86, mm, numa, acpi: Memblock limit with movablemem_map
>
> Yasuaki Ishimatsu (1):
>    x86: get pg_data_t's memory from other node
>
>   Documentation/kernel-parameters.txt |   36 +++++
>   arch/x86/mm/numa.c                  |    5 +-
>   arch/x86/mm/srat.c                  |  130 +++++++++++++++++-
>   include/linux/memblock.h            |    2 +
>   include/linux/mm.h                  |   22 +++
>   mm/memblock.c                       |   50 +++++++
>   mm/page_alloc.c                     |  265 ++++++++++++++++++++++++++++++++++-
>   7 files changed, 500 insertions(+), 10 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


[-- Attachment #2: Type: text/html, Size: 9676 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 part1 0/9] Introduce movablemem_map boot option.
  2013-03-17  0:25 ` [PATCH v1 part1 0/9] Introduce movablemem_map boot option Will Huck
@ 2013-03-18  9:57   ` Tang Chen
  0 siblings, 0 replies; 12+ messages in thread
From: Tang Chen @ 2013-03-18  9:57 UTC (permalink / raw)
  To: Will Huck
  Cc: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst, x86, linux-doc, linux-kernel, linux-mm

Hi Will,

On 03/17/2013 08:25 AM, Will Huck wrote:
>
> http://marc.info/?l=linux-mm&m=136014458829566&w=2
>
> It seems that Mel don't like this idea.
>

Thank you for reminding me this.

And yes, I have read that email. :)

And about this boot option, we have had a long discussion before.
Please refer to: https://lkml.org/lkml/2012/11/29/190

The situation is:

For now, Linux kernel cannot migrate kernel direct mapping memory. And
there is no way to ensure that ZONE_NORMAL has no kernel memory. So we
can only use ZONE_MOVABLE to ensure the memory device could be removed.

For now, I have the following reasons that movablemem_map boot option is
necessary. Some may be mentioned before, but here, I think I need to say
them again:

1) If we want to hot-remove a memory device, the device should only have
    memory of two types:
    - kernel memory whose life cycle is the same as the memory device.
      such as pagetables, vmemmap
    - user memory that could be migrated.

    For type1: we can allocate it on local node, just like Yinghai's work,
               and free it when hot-removing.
    For type2: we can migrate it at run time. But it must be in ZONE_MOVABLE
               because we cannot ensure ZONE_NORMAL has no kernel memory.

    So we need a way to limit hotpluggable memory in ZONE_MOVABLE.

2) We have the following ways to do it:
    a) use SRAT, which I have already implemented
    b) specify physical address ranges, which I have implemented too, but
       obviously very few guys like it.
    c) specify node id. But nid could be changed on some platform by 
firmware.

    Because of c), we chose to use physical address ranges. To satisfy all
    users, I also implemented a).

3) Even if we don't specify physical address in command line, we use SRAT,
    we still need the logic in this patch-set to achieve the same goal.

4) Since setting a whole node as movable will cause NUMA performance down,
    no matter which way we use, we always need an interface to open or close
    this functionality.
    The boot option itself is an interface. If users don't specify it in
    command line, the kernel will work as before.

So I do want to try again to push this boot option.  :)

With this boot option, memory hotplug will work now.


It's true that if we reimplement the whole mm in Linux to make kernel
memory migratable, but we need to handle a lot of problems. I agree with 
Mel.
But it is a long way to go in the future.

And the work in the near future:
1) Allocate pagetables and vmemmap on local node, as Yinghai said.
2) Do the proper modification for hot-add and hot-remove.
    - Reserve memory for pagetables and vmemmap when hot-add, maybe use
      memblock.
    - Free all pagetables and vmemmap before hot-remove.
3) And about Mel's advice, modify memory management in Linux to migrate
    kernel pages, it is a long way to go in the future. I think we can
    discuss more.

Thanks. :)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-03-18 11:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-16 10:35 [PATCH v1 part1 0/9] Introduce movablemem_map boot option Tang Chen
2013-03-16 10:35 ` [PATCH v1 1/9] x86: get pg_data_t's memory from other node Tang Chen
2013-03-16 10:35 ` [PATCH v1 2/9] acpi: Print hotplug info in SRAT Tang Chen
2013-03-16 10:35 ` [PATCH v1 3/9] x86, mm, numa, acpi: Add movable_memmap boot option Tang Chen
2013-03-16 10:35 ` [PATCH v1 4/9] x86, mm, numa, acpi: Introduce zone_movable_limit[] to store start pfn of ZONE_MOVABLE Tang Chen
2013-03-16 10:35 ` [PATCH v1 5/9] x86, mm, numa, acpi: Extend movablemem_map to the end of each node Tang Chen
2013-03-16 10:35 ` [PATCH v1 6/9] x86, mm, numa, acpi: Support getting hotplug info from SRAT Tang Chen
2013-03-16 10:35 ` [PATCH v1 7/9] x86, mm, numa, acpi: Sanitize zone_movable_limit[] Tang Chen
2013-03-16 10:35 ` [PATCH v1 8/9] x86, mm, numa, acpi: make movablemem_map have higher priority Tang Chen
2013-03-16 10:35 ` [PATCH v1 9/9] x86, mm, numa, acpi: Memblock limit with movablemem_map Tang Chen
2013-03-17  0:25 ` [PATCH v1 part1 0/9] Introduce movablemem_map boot option Will Huck
2013-03-18  9:57   ` Tang Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).