[PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE
@ 2013-10-12  6:00 Zhang Yanfei
  2013-10-12  6:03 ` [PATCH part2 v2 1/8] x86: get pg_data_t's memory from other node Zhang Yanfei
                   ` (8 more replies)
  0 siblings, 9 replies; 11+ messages in thread
From: Zhang Yanfei @ 2013-10-12  6:00 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86@mina86.com, gong.chen@linux.intel.com,
	Vasilis Liaskovitis, lwoodman@redhat.com, Rik van Riel,
	jweiner@redhat.com, Prarit Bhargava
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Linux MM,
	ACPI Devel Maling List, Chen Tang, Tang Chen, Zhang Yanfei

Hello guys, this is the part2 of our memory hotplug work. This part
is based on the part1:
    "x86, memblock: Allocate memory near kernel image before SRAT parsed"
which is base on 3.12-rc4.

You could refer part1 from: https://lkml.org/lkml/2013/10/10/644

Any comments are welcome! Thanks!

[Problem]

The current Linux cannot migrate pages used by the kerenl because
of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and
keep the va unmodified. So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not migratable.
For example, the physical address may be cached somewhere and will be used.
It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages are
used by the kernel, they are not migratable. As a result, memory used by
the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the following
way to do memory hotplug.

[What we are doing]

In Linux, memory in one numa node is divided into several zones. One of the
zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.

To do this, we need ACPI's help.

[How we do this]

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
affinities in SRAT record every memory range in the system, and also, flags
specifying if the memory range is hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
   ZONE_MOVABLE.
   (This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
   hotpluggable memory for the kernel before the memory initialization
   finishes.
   (This is what we are going to do. See below.)

[About this patch-set]

In previous part's patches, we have made the kernel allocate memory near
kernel image before SRAT parsed to avoid allocating hotpluggable memory
for kernel. So this patch-set does the following things:

1. Improve memblock to support flags, which are used to indicate different 
   memory type.

2. Mark all hotpluggable memory in memblock.memory[].

3. Make the default memblock allocator skip hotpluggable memory.

4. Improve "movable_node" boot option to have higher priority of movablecore
   and kernelcore boot option.

Change log v1 -> v2:
1. Rebase this part on the v7 version of part1
2. Fix bug: If movable_node boot option not specified, memblock still
   checks hotpluggable memory when allocating memory. 

Tang Chen (7):
  memblock, numa: Introduce flag into memblock
  memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark
    hotpluggable regions
  memblock: Make memblock_set_node() support different memblock_type
  acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock
  acpi, numa, mem_hotplug: Mark all nodes the kernel resides
    un-hotpluggable
  memblock, mem_hotplug: Make memblock skip hotpluggable regions if
    needed
  x86, numa, acpi, memory-hotplug: Make movable_node have higher
    priority

Yasuaki Ishimatsu (1):
  x86: get pg_data_t's memory from other node

 arch/metag/mm/init.c      |    3 +-
 arch/metag/mm/numa.c      |    3 +-
 arch/microblaze/mm/init.c |    3 +-
 arch/powerpc/mm/mem.c     |    2 +-
 arch/powerpc/mm/numa.c    |    8 ++-
 arch/sh/kernel/setup.c    |    4 +-
 arch/sparc/mm/init_64.c   |    5 +-
 arch/x86/mm/init_32.c     |    2 +-
 arch/x86/mm/init_64.c     |    2 +-
 arch/x86/mm/numa.c        |   63 +++++++++++++++++++++--
 arch/x86/mm/srat.c        |    5 ++
 include/linux/memblock.h  |   39 ++++++++++++++-
 mm/memblock.c             |  123 ++++++++++++++++++++++++++++++++++++++-------
 mm/memory_hotplug.c       |    1 +
 mm/page_alloc.c           |   28 ++++++++++-
 15 files changed, 252 insertions(+), 39 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH part2 v2 1/8] x86: get pg_data_t's memory from other node
  2013-10-12  6:00 [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
@ 2013-10-12  6:03 ` Zhang Yanfei
  2013-10-12  6:04 ` [PATCH part2 v2 2/8] memblock, numa: Introduce flag into memblock Zhang Yanfei
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Zhang Yanfei @ 2013-10-12  6:03 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86@mina86.com, gong.chen@linux.intel.com,
	Vasilis Liaskovitis, lwoodman@redhat.com, Rik van Riel,
	jweiner@redhat.com, Prarit Bhargava
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Linux MM,
	ACPI Devel Maling List, Chen Tang, Tang Chen, Zhang Yanfei

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

If system can create movable node which all memory of the node is allocated
as ZONE_MOVABLE, setup_node_data() cannot allocate memory for the node's
pg_data_t. So, invoke memblock_alloc_nid(...MAX_NUMNODES) again to retry when
the first allocation fails. Otherwise, the system could failed to boot.
(We don't use memblock_alloc_try_nid() to retry because in this function,
if the allocation fails, it will panic the system.)

The node_data could be on hotpluggable node. And so could pagetable and
vmemmap. But for now, doing so will break memory hot-remove path.

A node could have several memory devices. And the device who holds node
data should be hot-removed in the last place. But in NUMA level, we don't
know which memory_block (/sys/devices/system/node/nodeX/memoryXXX) belongs
to which memory device. We only have node. So we can only do node hotplug.

But in virtualization, developers are now developing memory hotplug in qemu,
which support a single memory device hotplug. So a whole node hotplug will
not satisfy virtualization users.

So at last, we concluded that we'd better do memory hotplug and local node
things (local node node data, pagetable, vmemmap, ...) in two steps.
Please refer to https://lkml.org/lkml/2013/6/19/73

For now, we put node_data of movable node to another node, and then improve
it in the future.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
---
 arch/x86/mm/numa.c |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 24aec58..e17db5d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -211,9 +211,14 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
 	 */
 	nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
 	if (!nd_pa) {
-		pr_err("Cannot find %zu bytes in node %d\n",
-		       nd_size, nid);
-		return;
+		pr_warn("Cannot find %zu bytes in node %d, so try other nodes",
+			nd_size, nid);
+		nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES,
+					   MAX_NUMNODES);
+		if (!nd_pa) {
+			pr_err("Cannot find %zu bytes in any node\n", nd_size);
+			return;
+		}
 	}
 	nd = __va(nd_pa);

-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH part2 v2 2/8] memblock, numa: Introduce flag into memblock
  2013-10-12  6:00 [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
  2013-10-12  6:03 ` [PATCH part2 v2 1/8] x86: get pg_data_t's memory from other node Zhang Yanfei
@ 2013-10-12  6:04 ` Zhang Yanfei
  2013-10-12  6:05 ` [PATCH part2 v2 3/8] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions Zhang Yanfei
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Zhang Yanfei @ 2013-10-12  6:04 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86@mina86.com, gong.chen@linux.intel.com,
	Vasilis Liaskovitis, lwoodman@redhat.com, Rik van Riel,
	jweiner@redhat.com, Prarit Bhargava
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Linux MM,
	ACPI Devel Maling List, Chen Tang, Tang Chen, Zhang Yanfei

From: Tang Chen <tangchen@cn.fujitsu.com>

There is no flag in memblock to describe what type the memory is.
Sometimes, we may use memblock to reserve some memory for special usage.
And we want to know what kind of memory it is. So we need a way to
differentiate memory for different usage.

In hotplug environment, we want to reserve hotpluggable memory so the
kernel won't be able to use it. And when the system is up, we have to
free these hotpluggable memory to buddy. So we need to mark these memory
first.

In order to do so, we need to mark out these special memory in memblock.
In this patch, we introduce a new "flags" member into memblock_region:
   struct memblock_region {
           phys_addr_t base;
           phys_addr_t size;
           unsigned long flags;		/* This is new. */
   #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
           int nid;
   #endif
   };

This patch does the following things:
1) Add "flags" member to memblock_region.
2) Modify the following APIs' prototype:
	memblock_add_region()
	memblock_insert_region()
3) Add memblock_reserve_region() to support reserve memory with flags, and keep
   memblock_reserve()'s prototype unmodified.
4) Modify other APIs to support flags, but keep their prototype unmodified.

The idea is from Wen Congyang <wency@cn.fujitsu.com> and Liu Jiang <jiang.liu@huawei.com>.

v1 -> v2:
As tj suggested, a zero flag MEMBLK_DEFAULT will make users confused. If
we want to specify any other flag, such MEMBLK_HOTPLUG, users don't know
to use MEMBLK_DEFAULT | MEMBLK_HOTPLUG or just MEMBLK_HOTPLUG. So remove
MEMBLK_DEFAULT (which is 0), and just use 0 by default to avoid confusions
to users.

Suggested-by: Wen Congyang <wency@cn.fujitsu.com>
Suggested-by: Liu Jiang <jiang.liu@huawei.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |    1 +
 mm/memblock.c            |   53 +++++++++++++++++++++++++++++++++-------------
 2 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 77c60e5..9a805ec 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -22,6 +22,7 @@
 struct memblock_region {
 	phys_addr_t base;
 	phys_addr_t size;
+	unsigned long flags;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	int nid;
 #endif
diff --git a/mm/memblock.c b/mm/memblock.c
index 53e477b..877973e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -255,6 +255,7 @@ static void __init_memblock memblock_remove_region(struct memblock_type *type, u
 		type->cnt = 1;
 		type->regions[0].base = 0;
 		type->regions[0].size = 0;
+		type->regions[0].flags = 0;
 		memblock_set_region_node(&type->regions[0], MAX_NUMNODES);
 	}
 }
@@ -405,7 +406,8 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
 
 		if (this->base + this->size != next->base ||
 		    memblock_get_region_node(this) !=
-		    memblock_get_region_node(next)) {
+		    memblock_get_region_node(next) ||
+		    this->flags != next->flags) {
 			BUG_ON(this->base + this->size > next->base);
 			i++;
 			continue;
@@ -425,13 +427,15 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
  * @base:	base address of the new region
  * @size:	size of the new region
  * @nid:	node id of the new region
+ * @flags:	flags of the new region
  *
  * Insert new memblock region [@base,@base+@size) into @type at @idx.
  * @type must already have extra room to accomodate the new region.
  */
 static void __init_memblock memblock_insert_region(struct memblock_type *type,
 						   int idx, phys_addr_t base,
-						   phys_addr_t size, int nid)
+						   phys_addr_t size,
+						   int nid, unsigned long flags)
 {
 	struct memblock_region *rgn = &type->regions[idx];
 
@@ -439,6 +443,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
 	memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));
 	rgn->base = base;
 	rgn->size = size;
+	rgn->flags = flags;
 	memblock_set_region_node(rgn, nid);
 	type->cnt++;
 	type->total_size += size;
@@ -450,6 +455,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
  * @base: base address of the new region
  * @size: size of the new region
  * @nid: nid of the new region
+ * @flags: flags of the new region
  *
  * Add new memblock region [@base,@base+@size) into @type.  The new region
  * is allowed to overlap with existing ones - overlaps don't affect already
@@ -460,7 +466,8 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
  * 0 on success, -errno on failure.
  */
 static int __init_memblock memblock_add_region(struct memblock_type *type,
-				phys_addr_t base, phys_addr_t size, int nid)
+				phys_addr_t base, phys_addr_t size,
+				int nid, unsigned long flags)
 {
 	bool insert = false;
 	phys_addr_t obase = base;
@@ -475,6 +482,7 @@ static int __init_memblock memblock_add_region(struct memblock_type *type,
 		WARN_ON(type->cnt != 1 || type->total_size);
 		type->regions[0].base = base;
 		type->regions[0].size = size;
+		type->regions[0].flags = flags;
 		memblock_set_region_node(&type->regions[0], nid);
 		type->total_size = size;
 		return 0;
@@ -505,7 +513,8 @@ repeat:
 			nr_new++;
 			if (insert)
 				memblock_insert_region(type, i++, base,
-						       rbase - base, nid);
+						       rbase - base, nid,
+						       flags);
 		}
 		/* area below @rend is dealt with, forget about it */
 		base = min(rend, end);
@@ -515,7 +524,8 @@ repeat:
 	if (base < end) {
 		nr_new++;
 		if (insert)
-			memblock_insert_region(type, i, base, end - base, nid);
+			memblock_insert_region(type, i, base, end - base,
+					       nid, flags);
 	}
 
 	/*
@@ -537,12 +547,13 @@ repeat:
 int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size,
 				       int nid)
 {
-	return memblock_add_region(&memblock.memory, base, size, nid);
+	return memblock_add_region(&memblock.memory, base, size, nid, 0);
 }
 
 int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size)
 {
-	return memblock_add_region(&memblock.memory, base, size, MAX_NUMNODES);
+	return memblock_add_region(&memblock.memory, base, size,
+				   MAX_NUMNODES, 0);
 }
 
 /**
@@ -597,7 +608,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
 			rgn->size -= base - rbase;
 			type->total_size -= base - rbase;
 			memblock_insert_region(type, i, rbase, base - rbase,
-					       memblock_get_region_node(rgn));
+					       memblock_get_region_node(rgn),
+					       rgn->flags);
 		} else if (rend > end) {
 			/*
 			 * @rgn intersects from above.  Split and redo the
@@ -607,7 +619,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
 			rgn->size -= end - rbase;
 			type->total_size -= end - rbase;
 			memblock_insert_region(type, i--, rbase, end - rbase,
-					       memblock_get_region_node(rgn));
+					       memblock_get_region_node(rgn),
+					       rgn->flags);
 		} else {
 			/* @rgn is fully contained, record it */
 			if (!*end_rgn)
@@ -649,16 +662,24 @@ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)
 	return __memblock_remove(&memblock.reserved, base, size);
 }
 
-int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+static int __init_memblock memblock_reserve_region(phys_addr_t base,
+						   phys_addr_t size,
+						   int nid,
+						   unsigned long flags)
 {
 	struct memblock_type *_rgn = &memblock.reserved;
 
-	memblock_dbg("memblock_reserve: [%#016llx-%#016llx] %pF\n",
+	memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",
 		     (unsigned long long)base,
 		     (unsigned long long)base + size,
-		     (void *)_RET_IP_);
+		     flags, (void *)_RET_IP_);
+
+	return memblock_add_region(_rgn, base, size, nid, flags);
+}
 
-	return memblock_add_region(_rgn, base, size, MAX_NUMNODES);
+int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+{
+	return memblock_reserve_region(base, size, MAX_NUMNODES, 0);
 }
 
 /**
@@ -1101,6 +1122,7 @@ void __init_memblock memblock_set_current_limit(phys_addr_t limit)
 static void __init_memblock memblock_dump(struct memblock_type *type, char *name)
 {
 	unsigned long long base, size;
+	unsigned long flags;
 	int i;
 
 	pr_info(" %s.cnt  = 0x%lx\n", name, type->cnt);
@@ -1111,13 +1133,14 @@ static void __init_memblock memblock_dump(struct memblock_type *type, char *name
 
 		base = rgn->base;
 		size = rgn->size;
+		flags = rgn->flags;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 		if (memblock_get_region_node(rgn) != MAX_NUMNODES)
 			snprintf(nid_buf, sizeof(nid_buf), " on node %d",
 				 memblock_get_region_node(rgn));
 #endif
-		pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s\n",
-			name, i, base, base + size - 1, size, nid_buf);
+		pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s flags: %#lx\n",
+			name, i, base, base + size - 1, size, nid_buf, flags);
 	}
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH part2 v2 3/8] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions
  2013-10-12  6:00 [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
  2013-10-12  6:03 ` [PATCH part2 v2 1/8] x86: get pg_data_t's memory from other node Zhang Yanfei
  2013-10-12  6:04 ` [PATCH part2 v2 2/8] memblock, numa: Introduce flag into memblock Zhang Yanfei
@ 2013-10-12  6:05 ` Zhang Yanfei
  2013-10-12  6:06 ` [PATCH part2 v2 4/8] memblock: Make memblock_set_node() support different memblock_type Zhang Yanfei
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Zhang Yanfei @ 2013-10-12  6:05 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86@mina86.com, gong.chen@linux.intel.com,
	Vasilis Liaskovitis, lwoodman@redhat.com, Rik van Riel,
	jweiner@redhat.com, Prarit Bhargava
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Linux MM,
	ACPI Devel Maling List, Chen Tang, Tang Chen, Zhang Yanfei

From: Tang Chen <tangchen@cn.fujitsu.com>

In find_hotpluggable_memory, once we find out a memory region which is
hotpluggable, we want to mark them in memblock.memory. So that we could
control memblock allocator not to allocte hotpluggable memory for the kernel
later.

To achieve this goal, we introduce MEMBLOCK_HOTPLUG flag to indicate the
hotpluggable memory regions in memblock and a function memblock_mark_hotplug()
to mark hotpluggable memory if we find one.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |   17 +++++++++++++++
 mm/memblock.c            |   52 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+), 0 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 9a805ec..b788faa 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -19,6 +19,9 @@
 
 #define INIT_MEMBLOCK_REGIONS	128
 
+/* Definition of memblock flags. */
+#define MEMBLOCK_HOTPLUG	0x1	/* hotpluggable region */
+
 struct memblock_region {
 	phys_addr_t base;
 	phys_addr_t size;
@@ -60,6 +63,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
 int memblock_free(phys_addr_t base, phys_addr_t size);
 int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
+int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
+int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
@@ -122,6 +127,18 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start,
 	     i != (u64)ULLONG_MAX;					\
 	     __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid))
 
+static inline void memblock_set_region_flags(struct memblock_region *r,
+					     unsigned long flags)
+{
+	r->flags |= flags;
+}
+
+static inline void memblock_clear_region_flags(struct memblock_region *r,
+					       unsigned long flags)
+{
+	r->flags &= ~flags;
+}
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid);
 
diff --git a/mm/memblock.c b/mm/memblock.c
index 877973e..5bea331 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -683,6 +683,58 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
 }
 
 /**
+ * memblock_mark_hotplug - Mark hotpluggable memory with flag MEMBLOCK_HOTPLUG.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * This function isolates region [@base, @base + @size), and mark it with flag
+ * MEMBLOCK_HOTPLUG.
+ *
+ * Return 0 on succees, -errno on failure.
+ */
+int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
+{
+	struct memblock_type *type = &memblock.memory;
+	int i, ret, start_rgn, end_rgn;
+
+	ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
+	if (ret)
+		return ret;
+
+	for (i = start_rgn; i < end_rgn; i++)
+		memblock_set_region_flags(&type->regions[i], MEMBLOCK_HOTPLUG);
+
+	memblock_merge_regions(type);
+	return 0;
+}
+
+/**
+ * memblock_clear_hotplug - Clear flag MEMBLOCK_HOTPLUG for a specified region.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * This function isolates region [@base, @base + @size), and clear flag
+ * MEMBLOCK_HOTPLUG for the isolated regions.
+ *
+ * Return 0 on succees, -errno on failure.
+ */
+int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size)
+{
+	struct memblock_type *type = &memblock.memory;
+	int i, ret, start_rgn, end_rgn;
+
+	ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
+	if (ret)
+		return ret;
+
+	for (i = start_rgn; i < end_rgn; i++)
+		memblock_clear_region_flags(&type->regions[i], MEMBLOCK_HOTPLUG);
+
+	memblock_merge_regions(type);
+	return 0;
+}
+
+/**
  * __next_free_mem_range - next function for for_each_free_mem_range()
  * @idx: pointer to u64 loop variable
  * @nid: node selector, %MAX_NUMNODES for all nodes
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH part2 v2 4/8] memblock: Make memblock_set_node() support different memblock_type
  2013-10-12  6:00 [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
                   ` (2 preceding siblings ...)
  2013-10-12  6:05 ` [PATCH part2 v2 3/8] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions Zhang Yanfei
@ 2013-10-12  6:06 ` Zhang Yanfei
  2013-10-12  6:07 ` [PATCH part2 v2 5/8] acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock Zhang Yanfei
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Zhang Yanfei @ 2013-10-12  6:06 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86@mina86.com, gong.chen@linux.intel.com,
	Vasilis Liaskovitis, lwoodman@redhat.com, Rik van Riel,
	jweiner@redhat.com, Prarit Bhargava
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Linux MM,
	ACPI Devel Maling List, Chen Tang, Tang Chen, Zhang Yanfei

From: Tang Chen <tangchen@cn.fujitsu.com>

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/metag/mm/init.c      |    3 ++-
 arch/metag/mm/numa.c      |    3 ++-
 arch/microblaze/mm/init.c |    3 ++-
 arch/powerpc/mm/mem.c     |    2 +-
 arch/powerpc/mm/numa.c    |    8 +++++---
 arch/sh/kernel/setup.c    |    4 ++--
 arch/sparc/mm/init_64.c   |    5 +++--
 arch/x86/mm/init_32.c     |    2 +-
 arch/x86/mm/init_64.c     |    2 +-
 arch/x86/mm/numa.c        |    6 ++++--
 include/linux/memblock.h  |    3 ++-
 mm/memblock.c             |    6 +++---
 12 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/arch/metag/mm/init.c b/arch/metag/mm/init.c
index 1239195..d94a58f 100644
--- a/arch/metag/mm/init.c
+++ b/arch/metag/mm/init.c
@@ -205,7 +205,8 @@ static void __init do_init_bootmem(void)
 		start_pfn = memblock_region_memory_base_pfn(reg);
 		end_pfn = memblock_region_memory_end_pfn(reg);
 		memblock_set_node(PFN_PHYS(start_pfn),
-				  PFN_PHYS(end_pfn - start_pfn), 0);
+				  PFN_PHYS(end_pfn - start_pfn),
+				  &memblock.memory, 0);
 	}
 
 	/* All of system RAM sits in node 0 for the non-NUMA case */
diff --git a/arch/metag/mm/numa.c b/arch/metag/mm/numa.c
index 9ae578c..229407f 100644
--- a/arch/metag/mm/numa.c
+++ b/arch/metag/mm/numa.c
@@ -42,7 +42,8 @@ void __init setup_bootmem_node(int nid, unsigned long start, unsigned long end)
 	memblock_add(start, end - start);
 
 	memblock_set_node(PFN_PHYS(start_pfn),
-			  PFN_PHYS(end_pfn - start_pfn), nid);
+			  PFN_PHYS(end_pfn - start_pfn),
+			  &memblock.memory, nid);
 
 	/* Node-local pgdat */
 	pgdat_paddr = memblock_alloc_base(sizeof(struct pglist_data),
diff --git a/arch/microblaze/mm/init.c b/arch/microblaze/mm/init.c
index 74c7bcc..89077d3 100644
--- a/arch/microblaze/mm/init.c
+++ b/arch/microblaze/mm/init.c
@@ -192,7 +192,8 @@ void __init setup_memory(void)
 		start_pfn = memblock_region_memory_base_pfn(reg);
 		end_pfn = memblock_region_memory_end_pfn(reg);
 		memblock_set_node(start_pfn << PAGE_SHIFT,
-					(end_pfn - start_pfn) << PAGE_SHIFT, 0);
+				  (end_pfn - start_pfn) << PAGE_SHIFT,
+				  &memblock.memory, 0);
 	}
 
 	/* free bootmem is whole main memory */
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 3fa93dc..231b785 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -209,7 +209,7 @@ void __init do_init_bootmem(void)
 	/* Place all memblock_regions in the same node and merge contiguous
 	 * memblock_regions
 	 */
-	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
+	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock_memory, 0);
 
 	/* Add all physical memory to the bootmem map, mark each area
 	 * present.
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c916127..f82f2ea 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -670,7 +670,8 @@ static void __init parse_drconf_memory(struct device_node *memory)
 			node_set_online(nid);
 			sz = numa_enforce_memory_limit(base, size);
 			if (sz)
-				memblock_set_node(base, sz, nid);
+				memblock_set_node(base, sz,
+						  &memblock.memory, nid);
 		} while (--ranges);
 	}
 }
@@ -760,7 +761,7 @@ new_range:
 				continue;
 		}
 
-		memblock_set_node(start, size, nid);
+		memblock_set_node(start, size, &memblock.memory, nid);
 
 		if (--ranges)
 			goto new_range;
@@ -797,7 +798,8 @@ static void __init setup_nonnuma(void)
 
 		fake_numa_create_new_node(end_pfn, &nid);
 		memblock_set_node(PFN_PHYS(start_pfn),
-				  PFN_PHYS(end_pfn - start_pfn), nid);
+				  PFN_PHYS(end_pfn - start_pfn),
+				  &memblock.memory, nid);
 		node_set_online(nid);
 	}
 }
diff --git a/arch/sh/kernel/setup.c b/arch/sh/kernel/setup.c
index 1cf90e9..de19cfa 100644
--- a/arch/sh/kernel/setup.c
+++ b/arch/sh/kernel/setup.c
@@ -230,8 +230,8 @@ void __init __add_active_range(unsigned int nid, unsigned long start_pfn,
 	pmb_bolt_mapping((unsigned long)__va(start), start, end - start,
 			 PAGE_KERNEL);
 
-	memblock_set_node(PFN_PHYS(start_pfn),
-			  PFN_PHYS(end_pfn - start_pfn), nid);
+	memblock_set_node(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn),
+			  &memblock.memory, nid);
 }
 
 void __init __weak plat_early_device_setup(void)
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index ed82eda..31beb53 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -1021,7 +1021,8 @@ static void __init add_node_ranges(void)
 				"start[%lx] end[%lx]\n",
 				nid, start, this_end);
 
-			memblock_set_node(start, this_end - start, nid);
+			memblock_set_node(start, this_end - start,
+					  &memblock.memory, nid);
 			start = this_end;
 		}
 	}
@@ -1325,7 +1326,7 @@ static void __init bootmem_init_nonnuma(void)
 	       (top_of_ram - total_ram) >> 20);
 
 	init_node_masks_nonnuma();
-	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
+	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock.memory, 0);
 	allocate_node_data(0);
 	node_set_online(0);
 }
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 4287f1f..d9685b6 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -665,7 +665,7 @@ void __init initmem_init(void)
 	high_memory = (void *) __va(max_low_pfn * PAGE_SIZE - 1) + 1;
 #endif
 
-	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
+	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock.memory, 0);
 	sparse_memory_present_with_active_regions(0);
 
 #ifdef CONFIG_FLATMEM
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 104d56a..f35c66c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -643,7 +643,7 @@ kernel_physical_mapping_init(unsigned long start,
 #ifndef CONFIG_NUMA
 void __init initmem_init(void)
 {
-	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
+	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock.memory, 0);
 }
 #endif
 
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e17db5d..ab69e1d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -492,7 +492,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 
 	for (i = 0; i < mi->nr_blks; i++) {
 		struct numa_memblk *mb = &mi->blk[i];
-		memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+		memblock_set_node(mb->start, mb->end - mb->start,
+				  &memblock.memory, mb->nid);
 	}
 
 	/*
@@ -566,7 +567,8 @@ static int __init numa_init(int (*init_func)(void))
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
-	WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
+	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
+				  MAX_NUMNODES));
 	numa_reset_distance();
 
 	ret = init_func();
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index b788faa..97480d3 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -140,7 +140,8 @@ static inline void memblock_clear_region_flags(struct memblock_region *r,
 }
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid);
+int memblock_set_node(phys_addr_t base, phys_addr_t size,
+		      struct memblock_type *type, int nid);
 
 static inline void memblock_set_region_node(struct memblock_region *r, int nid)
 {
diff --git a/mm/memblock.c b/mm/memblock.c
index 5bea331..7de9c76 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -910,18 +910,18 @@ void __init_memblock __next_mem_pfn_range(int *idx, int nid,
  * memblock_set_node - set node ID on memblock regions
  * @base: base of area to set node ID for
  * @size: size of area to set node ID for
+ * @type: memblock type to set node ID for
  * @nid: node ID to set
  *
- * Set the nid of memblock memory regions in [@base,@base+@size) to @nid.
+ * Set the nid of memblock @type regions in [@base,@base+@size) to @nid.
  * Regions which cross the area boundaries are split as necessary.
  *
  * RETURNS:
  * 0 on success, -errno on failure.
  */
 int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
-				      int nid)
+				      struct memblock_type *type, int nid)
 {
-	struct memblock_type *type = &memblock.memory;
 	int start_rgn, end_rgn;
 	int i, ret;
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH part2 v2 5/8] acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock
  2013-10-12  6:00 [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
                   ` (3 preceding siblings ...)
  2013-10-12  6:06 ` [PATCH part2 v2 4/8] memblock: Make memblock_set_node() support different memblock_type Zhang Yanfei
@ 2013-10-12  6:07 ` Zhang Yanfei
  2013-10-12  6:08 ` [PATCH part2 v2 6/8] acpi, numa, mem_hotplug: Mark all nodes the kernel resides un-hotpluggable Zhang Yanfei
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Zhang Yanfei @ 2013-10-12  6:07 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86@mina86.com, gong.chen@linux.intel.com,
	Vasilis Liaskovitis, lwoodman@redhat.com, Rik van Riel,
	jweiner@redhat.com, Prarit Bhargava
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Linux MM,
	ACPI Devel Maling List, Chen Tang, Tang Chen, Zhang Yanfei

From: Tang Chen <tangchen@cn.fujitsu.com>

When parsing SRAT, we know that which memory area is hotpluggable.
So we invoke function memblock_mark_hotplug() introduced by previous
patch to mark hotpluggable memory in memblock.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |    2 ++
 arch/x86/mm/srat.c |    5 +++++
 2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index ab69e1d..408c02d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -569,6 +569,8 @@ static int __init numa_init(int (*init_func)(void))
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
 	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
 				  MAX_NUMNODES));
+	/* In case that parsing SRAT failed. */
+	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
 	numa_reset_distance();
 
 	ret = init_func();
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 266ca91..ca7c484 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -181,6 +181,11 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 		(unsigned long long) start, (unsigned long long) end - 1,
 		hotpluggable ? " hotplug" : "");
 
+	/* Mark hotplug range in memblock. */
+	if (hotpluggable && memblock_mark_hotplug(start, ma->length))
+		pr_warn("SRAT: Failed to mark hotplug range [mem %#010Lx-%#010Lx] in memblock\n",
+			(unsigned long long) start, (unsigned long long) end - 1);
+
 	return 0;
 out_err_bad_srat:
 	bad_srat();
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH part2 v2 6/8] acpi, numa, mem_hotplug: Mark all nodes the kernel resides un-hotpluggable
  2013-10-12  6:00 [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
                   ` (4 preceding siblings ...)
  2013-10-12  6:07 ` [PATCH part2 v2 5/8] acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock Zhang Yanfei
@ 2013-10-12  6:08 ` Zhang Yanfei
  2013-10-12  6:09 ` [PATCH part2 v2 7/8] memblock, mem_hotplug: Make memblock skip hotpluggable regions if needed Zhang Yanfei
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Zhang Yanfei @ 2013-10-12  6:08 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86@mina86.com, gong.chen@linux.intel.com,
	Vasilis Liaskovitis, lwoodman@redhat.com, Rik van Riel,
	jweiner@redhat.com, Prarit Bhargava
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Linux MM,
	ACPI Devel Maling List, Chen Tang, Tang Chen, Zhang Yanfei

From: Tang Chen <tangchen@cn.fujitsu.com>

At very early time, the kernel have to use some memory such as
loading the kernel image. We cannot prevent this anyway. So any
node the kernel resides in should be un-hotpluggable.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 408c02d..f26b16f 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -494,6 +494,14 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 		struct numa_memblk *mb = &mi->blk[i];
 		memblock_set_node(mb->start, mb->end - mb->start,
 				  &memblock.memory, mb->nid);
+
+		/*
+		 * At this time, all memory regions reserved by memblock are
+		 * used by the kernel. Set the nid in memblock.reserved will
+		 * mark out all the nodes the kernel resides in.
+		 */
+		memblock_set_node(mb->start, mb->end - mb->start,
+				  &memblock.reserved, mb->nid);
 	}
 
 	/*
@@ -555,6 +563,30 @@ static void __init numa_init_array(void)
 	}
 }
 
+static void __init numa_clear_kernel_node_hotplug(void)
+{
+	int i, nid;
+	nodemask_t numa_kernel_nodes;
+	unsigned long start, end;
+	struct memblock_type *type = &memblock.reserved;
+
+	/* Mark all kernel nodes. */
+	for (i = 0; i < type->cnt; i++)
+		node_set(type->regions[i].nid, numa_kernel_nodes);
+
+	/* Clear MEMBLOCK_HOTPLUG flag for memory in kernel nodes. */
+	for (i = 0; i < numa_meminfo.nr_blks; i++) {
+		nid = numa_meminfo.blk[i].nid;
+		if (!node_isset(nid, numa_kernel_nodes))
+			continue;
+
+		start = numa_meminfo.blk[i].start;
+		end = numa_meminfo.blk[i].end;
+
+		memblock_clear_hotplug(start, end - start);
+	}
+}
+
 static int __init numa_init(int (*init_func)(void))
 {
 	int i;
@@ -569,6 +601,8 @@ static int __init numa_init(int (*init_func)(void))
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
 	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
 				  MAX_NUMNODES));
+	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
+				  MAX_NUMNODES));
 	/* In case that parsing SRAT failed. */
 	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
 	numa_reset_distance();
@@ -606,6 +640,16 @@ static int __init numa_init(int (*init_func)(void))
 			numa_clear_node(i);
 	}
 	numa_init_array();
+
+	/*
+	 * At very early time, the kernel have to use some memory such as
+	 * loading the kernel image. We cannot prevent this anyway. So any
+	 * node the kernel resides in should be un-hotpluggable.
+	 *
+	 * And when we come here, numa_init() won't fail.
+	 */
+	numa_clear_kernel_node_hotplug();
+
 	return 0;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH part2 v2 7/8] memblock, mem_hotplug: Make memblock skip hotpluggable regions if needed
  2013-10-12  6:00 [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
                   ` (5 preceding siblings ...)
  2013-10-12  6:08 ` [PATCH part2 v2 6/8] acpi, numa, mem_hotplug: Mark all nodes the kernel resides un-hotpluggable Zhang Yanfei
@ 2013-10-12  6:09 ` Zhang Yanfei
  2013-10-12  6:09 ` [PATCH part2 v2 8/8] x86, numa, acpi, memory-hotplug: Make movable_node have higher priority Zhang Yanfei
  2013-11-13 13:50 ` [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
  8 siblings, 0 replies; 11+ messages in thread
From: Zhang Yanfei @ 2013-10-12  6:09 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86@mina86.com, gong.chen@linux.intel.com,
	Vasilis Liaskovitis, lwoodman@redhat.com, Rik van Riel,
	jweiner@redhat.com, Prarit Bhargava
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Linux MM,
	ACPI Devel Maling List, Chen Tang, Tang Chen, Zhang Yanfei

From: Tang Chen <tangchen@cn.fujitsu.com>

Linux kernel cannot migrate pages used by the kernel. As a result, hotpluggable
memory used by the kernel won't be able to be hot-removed. To solve this
problem, the basic idea is to prevent memblock from allocating hotpluggable
memory for the kernel at early time, and arrange all hotpluggable memory in
ACPI SRAT(System Resource Affinity Table) as ZONE_MOVABLE when initializing
zones.

In the previous patches, we have marked hotpluggable memory regions with
MEMBLOCK_HOTPLUG flag in memblock.memory.

In this patch, we make memblock skip these hotpluggable memory regions in
the default top-down allocation function if movable_node boot option is
specified.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |   18 ++++++++++++++++++
 mm/memblock.c            |   12 ++++++++++++
 mm/memory_hotplug.c      |    1 +
 3 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 97480d3..bfc1dba 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -47,6 +47,10 @@ struct memblock {
 
 extern struct memblock memblock;
 extern int memblock_debug;
+#ifdef CONFIG_MOVABLE_NODE
+/* If movable_node boot option specified */
+extern bool movable_node_enabled;
+#endif /* CONFIG_MOVABLE_NODE */
 
 #define memblock_dbg(fmt, ...) \
 	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
@@ -65,6 +69,20 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
 int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
 int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
+#ifdef CONFIG_MOVABLE_NODE
+static inline bool memblock_is_hotpluggable(struct memblock_region *m)
+{
+	return m->flags & MEMBLOCK_HOTPLUG;
+}
+
+static inline bool movable_node_is_enabled(void)
+{
+	return movable_node_enabled;
+}
+#else
+static inline bool memblock_is_hotpluggable(struct memblock_region *m){ return false; }
+static inline bool movable_node_is_enabled(void) { return false; }
+#endif
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index 7de9c76..7f69012 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -39,6 +39,9 @@ struct memblock memblock __initdata_memblock = {
 };
 
 int memblock_debug __initdata_memblock;
+#ifdef CONFIG_MOVABLE_NODE
+bool movable_node_enabled __initdata_memblock = false;
+#endif
 static int memblock_can_resize __initdata_memblock;
 static int memblock_memory_in_slab __initdata_memblock = 0;
 static int memblock_reserved_in_slab __initdata_memblock = 0;
@@ -819,6 +822,11 @@ void __init_memblock __next_free_mem_range(u64 *idx, int nid,
  * @out_nid: ptr to int for nid of the range, can be %NULL
  *
  * Reverse of __next_free_mem_range().
+ *
+ * Linux kernel cannot migrate pages used by itself. Memory hotplug users won't
+ * be able to hot-remove hotpluggable memory used by the kernel. So this
+ * function skip hotpluggable regions if needed when allocating memory for the
+ * kernel.
  */
 void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
 					   phys_addr_t *out_start,
@@ -843,6 +851,10 @@ void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
 		if (nid != MAX_NUMNODES && nid != memblock_get_region_node(m))
 			continue;
 
+		/* skip hotpluggable memory regions if needed */
+		if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
+			continue;
+
 		/* scan areas before each reservation for intersection */
 		for ( ; ri >= 0; ri--) {
 			struct memblock_region *r = &rsv->regions[ri];
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8c91d0a..729a2d8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1436,6 +1436,7 @@ static int __init cmdline_parse_movable_node(char *p)
 	 * the kernel away from hotpluggable memory.
 	 */
 	memblock_set_bottom_up(true);
+	movable_node_enabled = true;
 #else
 	pr_warn("movable_node option not supported\n");
 #endif
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH part2 v2 8/8] x86, numa, acpi, memory-hotplug: Make movable_node have higher priority
  2013-10-12  6:00 [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
                   ` (6 preceding siblings ...)
  2013-10-12  6:09 ` [PATCH part2 v2 7/8] memblock, mem_hotplug: Make memblock skip hotpluggable regions if needed Zhang Yanfei
@ 2013-10-12  6:09 ` Zhang Yanfei
  2013-11-13 13:50 ` [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
  8 siblings, 0 replies; 11+ messages in thread
From: Zhang Yanfei @ 2013-10-12  6:09 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86@mina86.com, gong.chen@linux.intel.com,
	Vasilis Liaskovitis, lwoodman@redhat.com, Rik van Riel,
	jweiner@redhat.com, Prarit Bhargava
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Linux MM,
	ACPI Devel Maling List, Chen Tang, Tang Chen, Zhang Yanfei

From: Tang Chen <tangchen@cn.fujitsu.com>

If users specify the original movablecore=nn@ss boot option, the kernel will
arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot option is similar
except it specifies ZONE_NORMAL ranges.

Now, if users specify "movable_node" in kernel commandline, the kernel will
arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users do this, all
the other movablecore=nn@ss and kernelcore=nn@ss options should be ignored.

For those who don't want this, just specify nothing. The kernel will act as
before.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
---
 mm/page_alloc.c |   28 ++++++++++++++++++++++++++--
 1 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd886fa..768ea0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5021,9 +5021,33 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 	nodemask_t saved_node_state = node_states[N_MEMORY];
 	unsigned long totalpages = early_calculate_totalpages();
 	int usable_nodes = nodes_weight(node_states[N_MEMORY]);
+	struct memblock_type *type = &memblock.memory;
+
+	/* Need to find movable_zone earlier when movable_node is specified. */
+	find_usable_zone_for_movable();
+
+	/*
+	 * If movable_node is specified, ignore kernelcore and movablecore
+	 * options.
+	 */
+	if (movable_node_is_enabled()) {
+		for (i = 0; i < type->cnt; i++) {
+			if (!memblock_is_hotpluggable(&type->regions[i]))
+				continue;
+
+			nid = type->regions[i].nid;
+
+			usable_startpfn = PFN_DOWN(type->regions[i].base);
+			zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+				min(usable_startpfn, zone_movable_pfn[nid]) :
+				usable_startpfn;
+		}
+
+		goto out2;
+	}
 
 	/*
-	 * If movablecore was specified, calculate what size of
+	 * If movablecore=nn[KMG] was specified, calculate what size of
 	 * kernelcore that corresponds so that memory usable for
 	 * any allocation type is evenly spread. If both kernelcore
 	 * and movablecore are specified, then the value of kernelcore
@@ -5049,7 +5073,6 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 		goto out;
 
 	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
-	find_usable_zone_for_movable();
 	usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
 
 restart:
@@ -5140,6 +5163,7 @@ restart:
 	if (usable_nodes && required_kernelcore > usable_nodes)
 		goto restart;
 
+out2:
 	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
 	for (nid = 0; nid < MAX_NUMNODES; nid++)
 		zone_movable_pfn[nid] =
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE
  2013-10-12  6:00 [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
                   ` (7 preceding siblings ...)
  2013-10-12  6:09 ` [PATCH part2 v2 8/8] x86, numa, acpi, memory-hotplug: Make movable_node have higher priority Zhang Yanfei
@ 2013-11-13 13:50 ` Zhang Yanfei
  2013-11-19  9:56   ` Zhang Yanfei
  8 siblings, 1 reply; 11+ messages in thread
From: Zhang Yanfei @ 2013-11-13 13:50 UTC (permalink / raw)
  To: Zhang Yanfei, Tejun Heo
  Cc: Andrew Morton, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86@mina86.com, gong.chen@linux.intel.com,
	Vasilis Liaskovitis, lwoodman@redhat.com, Rik van Riel,
	jweiner@redhat.com, Prarit Bhargava, x86@kernel.org,
	linux-kernel@vger.kernel.org, Linux MM, ACPI Devel Maling List,
	Chen Tang, Tang Chen

Hello guys,

Could anyone help reviewing this part?

The first part has been merged into linus's tree. And I've tried to apply this part
to today's linus tree:

commit 42a2d923cc349583ebf6fdd52a7d35e1c2f7e6bd
Merge: 5cbb3d2 75ecab1
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Nov 13 17:40:34 2013 +0900

    Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next

No conflict, no compiling error and it works well.

Tejun, any comments?

On 10/12/2013 02:00 PM, Zhang Yanfei wrote:
> Hello guys, this is the part2 of our memory hotplug work. This part
> is based on the part1:
>     "x86, memblock: Allocate memory near kernel image before SRAT parsed"
> which is base on 3.12-rc4.
> 
> You could refer part1 from: https://lkml.org/lkml/2013/10/10/644
> 
> Any comments are welcome! Thanks!
> 
> [Problem]
> 
> The current Linux cannot migrate pages used by the kerenl because
> of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
> When the pa is changed, we cannot simply update the pagetable and
> keep the va unmodified. So the kernel pages are not migratable.
> 
> There are also some other issues will cause the kernel pages not migratable.
> For example, the physical address may be cached somewhere and will be used.
> It is not to update all the caches.
> 
> When doing memory hotplug in Linux, we first migrate all the pages in one
> memory device somewhere else, and then remove the device. But if pages are
> used by the kernel, they are not migratable. As a result, memory used by
> the kernel cannot be hot-removed.
> 
> Modifying the kernel direct mapping mechanism is too difficult to do. And
> it may cause the kernel performance down and unstable. So we use the following
> way to do memory hotplug.
> 
> 
> [What we are doing]
> 
> In Linux, memory in one numa node is divided into several zones. One of the
> zones is ZONE_MOVABLE, which the kernel won't use.
> 
> In order to implement memory hotplug in Linux, we are going to arrange all
> hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
> 
> To do this, we need ACPI's help.
> 
> 
> [How we do this]
> 
> In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
> affinities in SRAT record every memory range in the system, and also, flags
> specifying if the memory range is hotpluggable.
> (Please refer to ACPI spec 5.0 5.2.16)
> 
> With the help of SRAT, we have to do the following two things to achieve our
> goal:
> 
> 1. When doing memory hot-add, allow the users arranging hotpluggable as
>    ZONE_MOVABLE.
>    (This has been done by the MOVABLE_NODE functionality in Linux.)
> 
> 2. when the system is booting, prevent bootmem allocator from allocating
>    hotpluggable memory for the kernel before the memory initialization
>    finishes.
>    (This is what we are going to do. See below.)
> 
> 
> [About this patch-set]
> 
> In previous part's patches, we have made the kernel allocate memory near
> kernel image before SRAT parsed to avoid allocating hotpluggable memory
> for kernel. So this patch-set does the following things:
> 
> 1. Improve memblock to support flags, which are used to indicate different 
>    memory type.
> 
> 2. Mark all hotpluggable memory in memblock.memory[].
> 
> 3. Make the default memblock allocator skip hotpluggable memory.
> 
> 4. Improve "movable_node" boot option to have higher priority of movablecore
>    and kernelcore boot option.
> 
> Change log v1 -> v2:
> 1. Rebase this part on the v7 version of part1
> 2. Fix bug: If movable_node boot option not specified, memblock still
>    checks hotpluggable memory when allocating memory. 
> 
> Tang Chen (7):
>   memblock, numa: Introduce flag into memblock
>   memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark
>     hotpluggable regions
>   memblock: Make memblock_set_node() support different memblock_type
>   acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock
>   acpi, numa, mem_hotplug: Mark all nodes the kernel resides
>     un-hotpluggable
>   memblock, mem_hotplug: Make memblock skip hotpluggable regions if
>     needed
>   x86, numa, acpi, memory-hotplug: Make movable_node have higher
>     priority
> 
> Yasuaki Ishimatsu (1):
>   x86: get pg_data_t's memory from other node
> 
>  arch/metag/mm/init.c      |    3 +-
>  arch/metag/mm/numa.c      |    3 +-
>  arch/microblaze/mm/init.c |    3 +-
>  arch/powerpc/mm/mem.c     |    2 +-
>  arch/powerpc/mm/numa.c    |    8 ++-
>  arch/sh/kernel/setup.c    |    4 +-
>  arch/sparc/mm/init_64.c   |    5 +-
>  arch/x86/mm/init_32.c     |    2 +-
>  arch/x86/mm/init_64.c     |    2 +-
>  arch/x86/mm/numa.c        |   63 +++++++++++++++++++++--
>  arch/x86/mm/srat.c        |    5 ++
>  include/linux/memblock.h  |   39 ++++++++++++++-
>  mm/memblock.c             |  123 ++++++++++++++++++++++++++++++++++++++-------
>  mm/memory_hotplug.c       |    1 +
>  mm/page_alloc.c           |   28 ++++++++++-
>  15 files changed, 252 insertions(+), 39 deletions(-)
> 


-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE
  2013-11-13 13:50 ` [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
@ 2013-11-19  9:56   ` Zhang Yanfei
  0 siblings, 0 replies; 11+ messages in thread
From: Zhang Yanfei @ 2013-11-19  9:56 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Tejun Heo, Andrew Morton, Rafael J . Wysocki, Len Brown,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Toshi Kani,
	Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86@mina86.com, gong.chen@linux.intel.com,
	Vasilis Liaskovitis, lwoodman@redhat.com, Rik van Riel,
	jweiner@redhat.com, Prarit Bhargava, x86@kernel.org,
	linux-kernel@vger.kernel.org, Linux MM, ACPI Devel Maling List,
	Chen Tang, Tang Chen

ping...

On 11/13/2013 09:50 PM, Zhang Yanfei wrote:
> Hello guys,
> 
> Could anyone help reviewing this part?
> 
> The first part has been merged into linus's tree. And I've tried to apply this part
> to today's linus tree:
> 
> commit 42a2d923cc349583ebf6fdd52a7d35e1c2f7e6bd
> Merge: 5cbb3d2 75ecab1
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date:   Wed Nov 13 17:40:34 2013 +0900
> 
>     Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
> 
> No conflict, no compiling error and it works well.
> 
> Tejun, any comments?
> 
> On 10/12/2013 02:00 PM, Zhang Yanfei wrote:
>> Hello guys, this is the part2 of our memory hotplug work. This part
>> is based on the part1:
>>     "x86, memblock: Allocate memory near kernel image before SRAT parsed"
>> which is base on 3.12-rc4.
>>
>> You could refer part1 from: https://lkml.org/lkml/2013/10/10/644
>>
>> Any comments are welcome! Thanks!
>>
>> [Problem]
>>
>> The current Linux cannot migrate pages used by the kerenl because
>> of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
>> When the pa is changed, we cannot simply update the pagetable and
>> keep the va unmodified. So the kernel pages are not migratable.
>>
>> There are also some other issues will cause the kernel pages not migratable.
>> For example, the physical address may be cached somewhere and will be used.
>> It is not to update all the caches.
>>
>> When doing memory hotplug in Linux, we first migrate all the pages in one
>> memory device somewhere else, and then remove the device. But if pages are
>> used by the kernel, they are not migratable. As a result, memory used by
>> the kernel cannot be hot-removed.
>>
>> Modifying the kernel direct mapping mechanism is too difficult to do. And
>> it may cause the kernel performance down and unstable. So we use the following
>> way to do memory hotplug.
>>
>>
>> [What we are doing]
>>
>> In Linux, memory in one numa node is divided into several zones. One of the
>> zones is ZONE_MOVABLE, which the kernel won't use.
>>
>> In order to implement memory hotplug in Linux, we are going to arrange all
>> hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
>>
>> To do this, we need ACPI's help.
>>
>>
>> [How we do this]
>>
>> In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
>> affinities in SRAT record every memory range in the system, and also, flags
>> specifying if the memory range is hotpluggable.
>> (Please refer to ACPI spec 5.0 5.2.16)
>>
>> With the help of SRAT, we have to do the following two things to achieve our
>> goal:
>>
>> 1. When doing memory hot-add, allow the users arranging hotpluggable as
>>    ZONE_MOVABLE.
>>    (This has been done by the MOVABLE_NODE functionality in Linux.)
>>
>> 2. when the system is booting, prevent bootmem allocator from allocating
>>    hotpluggable memory for the kernel before the memory initialization
>>    finishes.
>>    (This is what we are going to do. See below.)
>>
>>
>> [About this patch-set]
>>
>> In previous part's patches, we have made the kernel allocate memory near
>> kernel image before SRAT parsed to avoid allocating hotpluggable memory
>> for kernel. So this patch-set does the following things:
>>
>> 1. Improve memblock to support flags, which are used to indicate different 
>>    memory type.
>>
>> 2. Mark all hotpluggable memory in memblock.memory[].
>>
>> 3. Make the default memblock allocator skip hotpluggable memory.
>>
>> 4. Improve "movable_node" boot option to have higher priority of movablecore
>>    and kernelcore boot option.
>>
>> Change log v1 -> v2:
>> 1. Rebase this part on the v7 version of part1
>> 2. Fix bug: If movable_node boot option not specified, memblock still
>>    checks hotpluggable memory when allocating memory. 
>>
>> Tang Chen (7):
>>   memblock, numa: Introduce flag into memblock
>>   memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark
>>     hotpluggable regions
>>   memblock: Make memblock_set_node() support different memblock_type
>>   acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock
>>   acpi, numa, mem_hotplug: Mark all nodes the kernel resides
>>     un-hotpluggable
>>   memblock, mem_hotplug: Make memblock skip hotpluggable regions if
>>     needed
>>   x86, numa, acpi, memory-hotplug: Make movable_node have higher
>>     priority
>>
>> Yasuaki Ishimatsu (1):
>>   x86: get pg_data_t's memory from other node
>>
>>  arch/metag/mm/init.c      |    3 +-
>>  arch/metag/mm/numa.c      |    3 +-
>>  arch/microblaze/mm/init.c |    3 +-
>>  arch/powerpc/mm/mem.c     |    2 +-
>>  arch/powerpc/mm/numa.c    |    8 ++-
>>  arch/sh/kernel/setup.c    |    4 +-
>>  arch/sparc/mm/init_64.c   |    5 +-
>>  arch/x86/mm/init_32.c     |    2 +-
>>  arch/x86/mm/init_64.c     |    2 +-
>>  arch/x86/mm/numa.c        |   63 +++++++++++++++++++++--
>>  arch/x86/mm/srat.c        |    5 ++
>>  include/linux/memblock.h  |   39 ++++++++++++++-
>>  mm/memblock.c             |  123 ++++++++++++++++++++++++++++++++++++++-------
>>  mm/memory_hotplug.c       |    1 +
>>  mm/page_alloc.c           |   28 ++++++++++-
>>  15 files changed, 252 insertions(+), 39 deletions(-)
>>
> 
> 


-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-11-19  9:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-12  6:00 [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
2013-10-12  6:03 ` [PATCH part2 v2 1/8] x86: get pg_data_t's memory from other node Zhang Yanfei
2013-10-12  6:04 ` [PATCH part2 v2 2/8] memblock, numa: Introduce flag into memblock Zhang Yanfei
2013-10-12  6:05 ` [PATCH part2 v2 3/8] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions Zhang Yanfei
2013-10-12  6:06 ` [PATCH part2 v2 4/8] memblock: Make memblock_set_node() support different memblock_type Zhang Yanfei
2013-10-12  6:07 ` [PATCH part2 v2 5/8] acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock Zhang Yanfei
2013-10-12  6:08 ` [PATCH part2 v2 6/8] acpi, numa, mem_hotplug: Mark all nodes the kernel resides un-hotpluggable Zhang Yanfei
2013-10-12  6:09 ` [PATCH part2 v2 7/8] memblock, mem_hotplug: Make memblock skip hotpluggable regions if needed Zhang Yanfei
2013-10-12  6:09 ` [PATCH part2 v2 8/8] x86, numa, acpi, memory-hotplug: Make movable_node have higher priority Zhang Yanfei
2013-11-13 13:50 ` [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
2013-11-19  9:56   ` Zhang Yanfei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).