[PATCH v6 00/15] memory-hotplug: hot-remove physical memory

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
@ 2013-01-09  9:32 Tang Chen
  2013-01-09  9:32 ` [PATCH v6 01/15] memory-hotplug: try to offline the memory twice to avoid dependence Tang Chen
                   ` (17 more replies)
  0 siblings, 18 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-09  9:32 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen, hpa,
	linfeng, laijs, mgorman, yinghai, glommer
  Cc: linux-s390, linux-ia64, linux-acpi, linux-sh, x86, linux-kernel,
	cmetcalf, linux-mm, sparclinux, linuxppc-dev

Here is the physical memory hot-remove patch-set based on 3.8rc-2.

This patch-set aims to implement physical memory hot-removing.

The patches can free/remove the following things:

  - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
  - memmap of sparse-vmemmap                  : [PATCH 6,7,8,10/15]
  - page table of removed memory              : [RFC PATCH 7,8,10/15]
  - node and related sysfs files              : [RFC PATCH 13-15/15]

Existing problem:
If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages.

For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
and memory11 under the directory /sys/devices/system/memory/.

If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
cgroup is not provided by this memory device. But when we online memory9, the
memory stored page cgroup may be provided by memory8. So we can't offline
memory8 now. We should offline the memory in the reversed order.

When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first, so
offlining memory may fail.

In patch1, we provide a solution which is not good enough:
Iterate twice to offline the memory.
1st iterate: offline every non primary memory block.
2nd iterate: offline primary (i.e. first added) memory block.

And a new idea from Wen Congyang <wency@cn.fujitsu.com> is:
allocate the memory from the memory block they are describing.

But we are not sure if it is OK to do so because there is not existing API
to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE
to MEM_ONLINE. And also, it may interfere the hugepage.

How to test this patchset?
1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
   ACPI_HOTPLUG_MEMORY must be selected.
2. load the module acpi_memhotplug
3. hotplug the memory device(it depends on your hardware)
   You will see the memory device under the directory /sys/bus/acpi/devices/.
   Its name is PNP0C80:XX.
4. online/offline pages provided by this memory device
   You can write online/offline to /sys/devices/system/memory/memoryX/state to
   online/offline pages provided by this memory device
5. hotremove the memory device
   You can hotremove the memory device by the hardware, or writing 1 to
   /sys/bus/acpi/devices/PNP0C80:XX/eject.

Note: if the memory provided by the memory device is used by the kernel, it
can't be offlined. It is not a bug.

Changelogs from v5 to v6:
 Patch3: Add some more comments to explain memory hot-remove.
 Patch4: Remove bootmem member in struct firmware_map_entry.
 Patch6: Repeatedly register bootmem pages when using hugepage.
 Patch8: Repeatedly free bootmem pages when using hugepage.
 Patch14: Don't free pgdat when offlining a node, just reset it to 0.
 Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
          one when online a node.

Changelogs from v4 to v5:
 Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
         avoid disabling irq because we need flush tlb when free pagetables.
 Patch8: new patch, pick up some common APIs that are used to free direct mapping
         and vmemmap pagetables.
 Patch9: free direct mapping pagetables on x86_64 arch.
 Patch10: free vmemmap pagetables.
 Patch11: since freeing memmap with vmemmap has been implemented, the config
          macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
          no longer needed.
 Patch13: no need to modify acpi_memory_disable_device() since it was removed,
          and add nid parameter when calling remove_memory().

Changelogs from v3 to v4:
 Patch7: remove unused codes.
 Patch8: fix nr_pages that is passed to free_map_bootmem()

Changelogs from v2 to v3:
 Patch9: call sync_global_pgds() if pgd is changed
 Patch10: fix a problem int the patch

Changelogs from v1 to v2:
 Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
         memory block. 2nd iterate: offline primary (i.e. first added) memory
         block.

 Patch3: new patch, no logical change, just remove reduntant codes.

 Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
         after the pagetable is changed.

 Patch12: new patch, free node_data when a node is offlined.

Tang Chen (6):
  memory-hotplug: move pgdat_resize_lock into
    sparse_remove_one_section()
  memory-hotplug: remove page table of x86_64 architecture
  memory-hotplug: remove memmap of sparse-vmemmap
  memory-hotplug: Integrated __remove_section() of
    CONFIG_SPARSEMEM_VMEMMAP.
  memory-hotplug: remove sysfs file of node
  memory-hotplug: Do not allocate pdgat if it was not freed when
    offline.

Wen Congyang (5):
  memory-hotplug: try to offline the memory twice to avoid dependence
  memory-hotplug: remove redundant codes
  memory-hotplug: introduce new function arch_remove_memory() for
    removing page table depends on architecture
  memory-hotplug: Common APIs to support page tables hot-remove
  memory-hotplug: free node_data when a node is offlined

Yasuaki Ishimatsu (4):
  memory-hotplug: check whether all memory blocks are offlined or not
    when removing memory
  memory-hotplug: remove /sys/firmware/memmap/X sysfs
  memory-hotplug: implement register_page_bootmem_info_section of
    sparse-vmemmap
  memory-hotplug: memory_hotplug: clear zone when removing the memory

 arch/arm64/mm/mmu.c                  |    3 +
 arch/ia64/mm/discontig.c             |   10 +
 arch/ia64/mm/init.c                  |   18 ++
 arch/powerpc/mm/init_64.c            |   10 +
 arch/powerpc/mm/mem.c                |   12 +
 arch/s390/mm/init.c                  |   12 +
 arch/s390/mm/vmem.c                  |   10 +
 arch/sh/mm/init.c                    |   17 ++
 arch/sparc/mm/init_64.c              |   10 +
 arch/tile/mm/init.c                  |    8 +
 arch/x86/include/asm/pgtable_types.h |    1 +
 arch/x86/mm/init_32.c                |   12 +
 arch/x86/mm/init_64.c                |  390 +++++++++++++++++++++++++++++
 arch/x86/mm/pageattr.c               |   47 ++--
 drivers/acpi/acpi_memhotplug.c       |    8 +-
 drivers/base/memory.c                |    6 +
 drivers/firmware/memmap.c            |   96 +++++++-
 include/linux/bootmem.h              |    1 +
 include/linux/firmware-map.h         |    6 +
 include/linux/memory_hotplug.h       |   15 +-
 include/linux/mm.h                   |    4 +-
 mm/memory_hotplug.c                  |  459 +++++++++++++++++++++++++++++++---
 mm/sparse.c                          |    8 +-
 23 files changed, 1094 insertions(+), 69 deletions(-)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v6 01/15] memory-hotplug: try to offline the memory twice to avoid dependence
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
@ 2013-01-09  9:32 ` Tang Chen
  2013-01-09  9:32 ` [PATCH v6 02/15] memory-hotplug: check whether all memory blocks are offlined or not when removing memory Tang Chen
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-09  9:32 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen, hpa,
	linfeng, laijs, mgorman, yinghai, glommer
  Cc: linux-s390, linux-ia64, linux-acpi, linux-sh, x86, linux-kernel,
	cmetcalf, linux-mm, sparclinux, linuxppc-dev

From: Wen Congyang <wency@cn.fujitsu.com>

memory can't be offlined when CONFIG_MEMCG is selected.
For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
and memory11 under the directory /sys/devices/system/memory/.

If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages. When we online memory8, the memory stored page cgroup
is not provided by this memory device. But when we online memory9, the memory
stored page cgroup may be provided by memory8. So we can't offline memory8
now. We should offline the memory in the reversed order.

When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first, so
offlining memory may fail. In such case, iterate twice to offline the memory.
1st iterate: offline every non primary memory block.
2nd iterate: offline primary (i.e. first added) memory block.

This idea is suggested by KOSAKI Motohiro.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 mm/memory_hotplug.c |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d04ed87..62e04c9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1388,10 +1388,13 @@ int remove_memory(u64 start, u64 size)
 	unsigned long start_pfn, end_pfn;
 	unsigned long pfn, section_nr;
 	int ret;
+	int return_on_error = 0;
+	int retry = 0;

 	start_pfn = PFN_DOWN(start);
 	end_pfn = start_pfn + PFN_DOWN(size);

+repeat:
 	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
 		section_nr = pfn_to_section_nr(pfn);
 		if (!present_section_nr(section_nr))
@@ -1410,14 +1413,23 @@ int remove_memory(u64 start, u64 size)

 		ret = offline_memory_block(mem);
 		if (ret) {
-			kobject_put(&mem->dev.kobj);
-			return ret;
+			if (return_on_error) {
+				kobject_put(&mem->dev.kobj);
+				return ret;
+			} else {
+				retry = 1;
+			}
 		}
 	}

 	if (mem)
 		kobject_put(&mem->dev.kobj);

+	if (retry) {
+		return_on_error = 1;
+		goto repeat;
+	}
+
 	return 0;
 }
 #else
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v6 02/15] memory-hotplug: check whether all memory blocks are offlined or not when removing memory
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
  2013-01-09  9:32 ` [PATCH v6 01/15] memory-hotplug: try to offline the memory twice to avoid dependence Tang Chen
@ 2013-01-09  9:32 ` Tang Chen
  2013-01-09 23:11   ` Andrew Morton
  2013-01-09  9:32 ` [PATCH v6 03/15] memory-hotplug: remove redundant codes Tang Chen
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 67+ messages in thread
From: Tang Chen @ 2013-01-09  9:32 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen, hpa,
	linfeng, laijs, mgorman, yinghai, glommer
  Cc: linux-s390, linux-ia64, linux-acpi, linux-sh, x86, linux-kernel,
	cmetcalf, linux-mm, sparclinux, linuxppc-dev

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

We remove the memory like this:
1. lock memory hotplug
2. offline a memory block
3. unlock memory hotplug
4. repeat 1-3 to offline all memory blocks
5. lock memory hotplug
6. remove memory(TODO)
7. unlock memory hotplug

All memory blocks must be offlined before removing memory. But we don't hold
the lock in the whole operation. So we should check whether all memory blocks
are offlined before step6. Otherwise, kernel maybe panicked.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 drivers/base/memory.c          |    6 +++++
 include/linux/memory_hotplug.h |    1 +
 mm/memory_hotplug.c            |   48 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 987604d..8300a18 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -693,6 +693,12 @@ int offline_memory_block(struct memory_block *mem)
 	return ret;
 }
 
+/* return true if the memory block is offlined, otherwise, return false */
+bool is_memblock_offlined(struct memory_block *mem)
+{
+	return mem->state == MEM_OFFLINE;
+}
+
 /*
  * Initialize the sysfs support for memory devices...
  */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 4a45c4e..8dd0950 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -247,6 +247,7 @@ extern int add_memory(int nid, u64 start, u64 size);
 extern int arch_add_memory(int nid, u64 start, u64 size);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern int offline_memory_block(struct memory_block *mem);
+extern bool is_memblock_offlined(struct memory_block *mem);
 extern int remove_memory(u64 start, u64 size);
 extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 								int nr_pages);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 62e04c9..5808045 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1430,6 +1430,54 @@ repeat:
 		goto repeat;
 	}
 
+	lock_memory_hotplug();
+
+	/*
+	 * we have offlined all memory blocks like this:
+	 *   1. lock memory hotplug
+	 *   2. offline a memory block
+	 *   3. unlock memory hotplug
+	 *
+	 * repeat step1-3 to offline the memory block. All memory blocks
+	 * must be offlined before removing memory. But we don't hold the
+	 * lock in the whole operation. So we should check whether all
+	 * memory blocks are offlined.
+	 */
+
+	mem = NULL;
+	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+		section_nr = pfn_to_section_nr(pfn);
+		if (!present_section_nr(section_nr))
+			continue;
+
+		section = __nr_to_section(section_nr);
+		/* same memblock? */
+		if (mem)
+			if ((section_nr >= mem->start_section_nr) &&
+			    (section_nr <= mem->end_section_nr))
+				continue;
+
+		mem = find_memory_block_hinted(section, mem);
+		if (!mem)
+			continue;
+
+		ret = is_memblock_offlined(mem);
+		if (!ret) {
+			pr_warn("removing memory fails, because memory "
+				"[%#010llx-%#010llx] is onlined\n",
+				PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
+				PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1)) - 1);
+
+			kobject_put(&mem->dev.kobj);
+			unlock_memory_hotplug();
+			return ret;
+		}
+	}
+
+	if (mem)
+		kobject_put(&mem->dev.kobj);
+	unlock_memory_hotplug();
+
 	return 0;
 }
 #else
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v6 03/15] memory-hotplug: remove redundant codes
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
  2013-01-09  9:32 ` [PATCH v6 01/15] memory-hotplug: try to offline the memory twice to avoid dependence Tang Chen
  2013-01-09  9:32 ` [PATCH v6 02/15] memory-hotplug: check whether all memory blocks are offlined or not when removing memory Tang Chen
@ 2013-01-09  9:32 ` Tang Chen
  2013-01-09  9:32 ` [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs Tang Chen
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-09  9:32 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen, hpa,
	linfeng, laijs, mgorman, yinghai, glommer
  Cc: linux-s390, linux-ia64, linux-acpi, linux-sh, x86, linux-kernel,
	cmetcalf, linux-mm, sparclinux, linuxppc-dev

From: Wen Congyang <wency@cn.fujitsu.com>

offlining memory blocks and checking whether memory blocks are offlined
are very similar. This patch introduces a new function to remove
redundant codes.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memory_hotplug.c |  129 ++++++++++++++++++++++++++++++++------------------
 1 files changed, 82 insertions(+), 47 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5808045..69d62eb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1381,20 +1381,26 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
 }
 
-int remove_memory(u64 start, u64 size)
+/**
+ * walk_memory_range - walks through all mem sections in [start_pfn, end_pfn)
+ * @start_pfn: start pfn of the memory range
+ * @end_pfn: end pft of the memory range
+ * @arg: argument passed to func
+ * @func: callback for each memory section walked
+ *
+ * This function walks through all present mem sections in range
+ * [start_pfn, end_pfn) and call func on each mem section.
+ *
+ * Returns the return value of func.
+ */
+static int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
+		void *arg, int (*func)(struct memory_block *, void *))
 {
 	struct memory_block *mem = NULL;
 	struct mem_section *section;
-	unsigned long start_pfn, end_pfn;
 	unsigned long pfn, section_nr;
 	int ret;
-	int return_on_error = 0;
-	int retry = 0;
-
-	start_pfn = PFN_DOWN(start);
-	end_pfn = start_pfn + PFN_DOWN(size);
 
-repeat:
 	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
 		section_nr = pfn_to_section_nr(pfn);
 		if (!present_section_nr(section_nr))
@@ -1411,22 +1417,76 @@ repeat:
 		if (!mem)
 			continue;
 
-		ret = offline_memory_block(mem);
+		ret = func(mem, arg);
 		if (ret) {
-			if (return_on_error) {
-				kobject_put(&mem->dev.kobj);
-				return ret;
-			} else {
-				retry = 1;
-			}
+			kobject_put(&mem->dev.kobj);
+			return ret;
 		}
 	}
 
 	if (mem)
 		kobject_put(&mem->dev.kobj);
 
-	if (retry) {
-		return_on_error = 1;
+	return 0;
+}
+
+/**
+ * offline_memory_block_cb - callback function for offlining memory block
+ * @mem: the memory block to be offlined
+ * @arg: buffer to hold error msg
+ *
+ * Always return 0, and put the error msg in arg if any.
+ */
+static int offline_memory_block_cb(struct memory_block *mem, void *arg)
+{
+	int *ret = arg;
+	int error = offline_memory_block(mem);
+
+	if (error != 0 && *ret == 0)
+		*ret = error;
+
+	return 0;
+}
+
+static int is_memblock_offlined_cb(struct memory_block *mem, void *arg)
+{
+	int ret = !is_memblock_offlined(mem);
+
+	if (unlikely(ret))
+		pr_warn("removing memory fails, because memory "
+			"[%#010llx-%#010llx] is onlined\n",
+			PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
+			PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1);
+
+	return ret;
+}
+
+int remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn, end_pfn;
+	int ret = 0;
+	int retry = 1;
+
+	start_pfn = PFN_DOWN(start);
+	end_pfn = start_pfn + PFN_DOWN(size);
+
+	/*
+	 * When CONFIG_MEMCG is on, one memory block may be used by other
+	 * blocks to store page cgroup when onlining pages. But we don't know
+	 * in what order pages are onlined. So we iterate twice to offline
+	 * memory:
+	 * 1st iterate: offline every non primary memory block.
+	 * 2nd iterate: offline primary (i.e. first added) memory block.
+	 */
+repeat:
+	walk_memory_range(start_pfn, end_pfn, &ret,
+			  offline_memory_block_cb);
+	if (ret) {
+		if (!retry)
+			return ret;
+
+		retry = 0;
+		ret = 0;
 		goto repeat;
 	}
 
@@ -1444,38 +1504,13 @@ repeat:
 	 * memory blocks are offlined.
 	 */
 
-	mem = NULL;
-	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
-		section_nr = pfn_to_section_nr(pfn);
-		if (!present_section_nr(section_nr))
-			continue;
-
-		section = __nr_to_section(section_nr);
-		/* same memblock? */
-		if (mem)
-			if ((section_nr >= mem->start_section_nr) &&
-			    (section_nr <= mem->end_section_nr))
-				continue;
-
-		mem = find_memory_block_hinted(section, mem);
-		if (!mem)
-			continue;
-
-		ret = is_memblock_offlined(mem);
-		if (!ret) {
-			pr_warn("removing memory fails, because memory "
-				"[%#010llx-%#010llx] is onlined\n",
-				PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
-				PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1)) - 1);
-
-			kobject_put(&mem->dev.kobj);
-			unlock_memory_hotplug();
-			return ret;
-		}
+	ret = walk_memory_range(start_pfn, end_pfn, NULL,
+				is_memblock_offlined_cb);
+	if (ret) {
+		unlock_memory_hotplug();
+		return ret;
 	}
 
-	if (mem)
-		kobject_put(&mem->dev.kobj);
 	unlock_memory_hotplug();
 
 	return 0;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (2 preceding siblings ...)
  2013-01-09  9:32 ` [PATCH v6 03/15] memory-hotplug: remove redundant codes Tang Chen
@ 2013-01-09  9:32 ` Tang Chen
  2013-01-09 22:49   ` Andrew Morton
  2013-01-09 23:19   ` Andrew Morton
  2013-01-09  9:32 ` [PATCH v6 05/15] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture Tang Chen
                   ` (13 subsequent siblings)
  17 siblings, 2 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-09  9:32 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen, hpa,
	linfeng, laijs, mgorman, yinghai, glommer
  Cc: linux-s390, linux-ia64, linux-acpi, linux-sh, x86, linux-kernel,
	cmetcalf, linux-mm, sparclinux, linuxppc-dev

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type}
sysfs files are created. But there is no code to remove these files. The patch
implements the function to remove them.

Note: The code does not free firmware_map_entry which is allocated by bootmem.
      So the patch makes memory leak. But I think the memory leak size is
      very samll. And it does not affect the system.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 drivers/firmware/memmap.c    |   96 +++++++++++++++++++++++++++++++++++++++++-
 include/linux/firmware-map.h |    6 +++
 mm/memory_hotplug.c          |    5 ++-
 3 files changed, 104 insertions(+), 3 deletions(-)

diff --git a/drivers/firmware/memmap.c b/drivers/firmware/memmap.c
index 90723e6..4211da5 100644
--- a/drivers/firmware/memmap.c
+++ b/drivers/firmware/memmap.c
@@ -21,6 +21,7 @@
 #include <linux/types.h>
 #include <linux/bootmem.h>
 #include <linux/slab.h>
+#include <linux/mm.h>
 
 /*
  * Data types ------------------------------------------------------------------
@@ -79,7 +80,26 @@ static const struct sysfs_ops memmap_attr_ops = {
 	.show = memmap_attr_show,
 };
 
+
+static inline struct firmware_map_entry *
+to_memmap_entry(struct kobject *kobj)
+{
+	return container_of(kobj, struct firmware_map_entry, kobj);
+}
+
+static void release_firmware_map_entry(struct kobject *kobj)
+{
+	struct firmware_map_entry *entry = to_memmap_entry(kobj);
+
+	if (PageReserved(virt_to_page(entry)))
+		/* There is no way to free memory allocated from bootmem */
+		return;
+
+	kfree(entry);
+}
+
 static struct kobj_type memmap_ktype = {
+	.release	= release_firmware_map_entry,
 	.sysfs_ops	= &memmap_attr_ops,
 	.default_attrs	= def_attrs,
 };
@@ -94,6 +114,7 @@ static struct kobj_type memmap_ktype = {
  * in firmware initialisation code in one single thread of execution.
  */
 static LIST_HEAD(map_entries);
+static DEFINE_SPINLOCK(map_entries_lock);
 
 /**
  * firmware_map_add_entry() - Does the real work to add a firmware memmap entry.
@@ -118,11 +139,25 @@ static int firmware_map_add_entry(u64 start, u64 end,
 	INIT_LIST_HEAD(&entry->list);
 	kobject_init(&entry->kobj, &memmap_ktype);
 
+	spin_lock(&map_entries_lock);
 	list_add_tail(&entry->list, &map_entries);
+	spin_unlock(&map_entries_lock);
 
 	return 0;
 }
 
+/**
+ * firmware_map_remove_entry() - Does the real work to remove a firmware
+ * memmap entry.
+ * @entry: removed entry.
+ **/
+static inline void firmware_map_remove_entry(struct firmware_map_entry *entry)
+{
+	spin_lock(&map_entries_lock);
+	list_del(&entry->list);
+	spin_unlock(&map_entries_lock);
+}
+
 /*
  * Add memmap entry on sysfs
  */
@@ -144,6 +179,35 @@ static int add_sysfs_fw_map_entry(struct firmware_map_entry *entry)
 	return 0;
 }
 
+/*
+ * Remove memmap entry on sysfs
+ */
+static inline void remove_sysfs_fw_map_entry(struct firmware_map_entry *entry)
+{
+	kobject_put(&entry->kobj);
+}
+
+/*
+ * Search memmap entry
+ */
+
+static struct firmware_map_entry * __meminit
+firmware_map_find_entry(u64 start, u64 end, const char *type)
+{
+	struct firmware_map_entry *entry;
+
+	spin_lock(&map_entries_lock);
+	list_for_each_entry(entry, &map_entries, list)
+		if ((entry->start == start) && (entry->end == end) &&
+		    (!strcmp(entry->type, type))) {
+			spin_unlock(&map_entries_lock);
+			return entry;
+		}
+
+	spin_unlock(&map_entries_lock);
+	return NULL;
+}
+
 /**
  * firmware_map_add_hotplug() - Adds a firmware mapping entry when we do
  * memory hotplug.
@@ -196,6 +260,32 @@ int __init firmware_map_add_early(u64 start, u64 end, const char *type)
 	return firmware_map_add_entry(start, end, type, entry);
 }
 
+/**
+ * firmware_map_remove() - remove a firmware mapping entry
+ * @start: Start of the memory range.
+ * @end:   End of the memory range.
+ * @type:  Type of the memory range.
+ *
+ * removes a firmware mapping entry.
+ *
+ * Returns 0 on success, or -EINVAL if no entry.
+ **/
+int __meminit firmware_map_remove(u64 start, u64 end, const char *type)
+{
+	struct firmware_map_entry *entry;
+
+	entry = firmware_map_find_entry(start, end - 1, type);
+	if (!entry)
+		return -EINVAL;
+
+	firmware_map_remove_entry(entry);
+
+	/* remove the memmap entry */
+	remove_sysfs_fw_map_entry(entry);
+
+	return 0;
+}
+
 /*
  * Sysfs functions -------------------------------------------------------------
  */
@@ -217,8 +307,10 @@ static ssize_t type_show(struct firmware_map_entry *entry, char *buf)
 	return snprintf(buf, PAGE_SIZE, "%s\n", entry->type);
 }
 
-#define to_memmap_attr(_attr) container_of(_attr, struct memmap_attribute, attr)
-#define to_memmap_entry(obj) container_of(obj, struct firmware_map_entry, kobj)
+static inline struct memmap_attribute *to_memmap_attr(struct attribute *attr)
+{
+	return container_of(attr, struct memmap_attribute, attr);
+}
 
 static ssize_t memmap_attr_show(struct kobject *kobj,
 				struct attribute *attr, char *buf)
diff --git a/include/linux/firmware-map.h b/include/linux/firmware-map.h
index 43fe52f..71d4fa7 100644
--- a/include/linux/firmware-map.h
+++ b/include/linux/firmware-map.h
@@ -25,6 +25,7 @@
 
 int firmware_map_add_early(u64 start, u64 end, const char *type);
 int firmware_map_add_hotplug(u64 start, u64 end, const char *type);
+int firmware_map_remove(u64 start, u64 end, const char *type);
 
 #else /* CONFIG_FIRMWARE_MEMMAP */
 
@@ -38,6 +39,11 @@ static inline int firmware_map_add_hotplug(u64 start, u64 end, const char *type)
 	return 0;
 }
 
+static inline int firmware_map_remove(u64 start, u64 end, const char *type)
+{
+	return 0;
+}
+
 #endif /* CONFIG_FIRMWARE_MEMMAP */
 
 #endif /* _LINUX_FIRMWARE_MAP_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 69d62eb..9fd5904 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1461,7 +1461,7 @@ static int is_memblock_offlined_cb(struct memory_block *mem, void *arg)
 	return ret;
 }
 
-int remove_memory(u64 start, u64 size)
+int __ref remove_memory(u64 start, u64 size)
 {
 	unsigned long start_pfn, end_pfn;
 	int ret = 0;
@@ -1511,6 +1511,9 @@ repeat:
 		return ret;
 	}
 
+	/* remove memmap entry */
+	firmware_map_remove(start, start + size, "System RAM");
+
 	unlock_memory_hotplug();
 
 	return 0;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v6 05/15] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (3 preceding siblings ...)
  2013-01-09  9:32 ` [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs Tang Chen
@ 2013-01-09  9:32 ` Tang Chen
  2013-01-09 22:50   ` Andrew Morton
  2013-01-09  9:32 ` [PATCH v6 06/15] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap Tang Chen
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 67+ messages in thread
From: Tang Chen @ 2013-01-09  9:32 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen, hpa,
	linfeng, laijs, mgorman, yinghai, glommer
  Cc: linux-s390, linux-ia64, linux-acpi, linux-sh, x86, linux-kernel,
	cmetcalf, linux-mm, sparclinux, linuxppc-dev

From: Wen Congyang <wency@cn.fujitsu.com>

For removing memory, we need to remove page table. But it depends
on architecture. So the patch introduce arch_remove_memory() for
removing page table. Now it only calls __remove_pages().

Note: __remove_pages() for some archtecuture is not implemented
      (I don't know how to implement it for s390).

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 arch/ia64/mm/init.c            |   18 ++++++++++++++++++
 arch/powerpc/mm/mem.c          |   12 ++++++++++++
 arch/s390/mm/init.c            |   12 ++++++++++++
 arch/sh/mm/init.c              |   17 +++++++++++++++++
 arch/tile/mm/init.c            |    8 ++++++++
 arch/x86/mm/init_32.c          |   12 ++++++++++++
 arch/x86/mm/init_64.c          |   15 +++++++++++++++
 include/linux/memory_hotplug.h |    1 +
 mm/memory_hotplug.c            |    2 ++
 9 files changed, 97 insertions(+), 0 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index b755ea9..20bc967 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -688,6 +688,24 @@ int arch_add_memory(int nid, u64 start, u64 size)
 
 	return ret;
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+	int ret;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	ret = __remove_pages(zone, start_pfn, nr_pages);
+	if (ret)
+		pr_warn("%s: Problem encountered in __remove_pages() as"
+			" ret=%d\n", __func__,  ret);
+
+	return ret;
+}
+#endif
 #endif
 
 /*
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 0dba506..09c6451 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -133,6 +133,18 @@ int arch_add_memory(int nid, u64 start, u64 size)
 
 	return __add_pages(nid, zone, start_pfn, nr_pages);
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	return __remove_pages(zone, start_pfn, nr_pages);
+}
+#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 /*
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index ae672f4..49ce6bb 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -228,4 +228,16 @@ int arch_add_memory(int nid, u64 start, u64 size)
 		vmem_remove_mapping(start, size);
 	return rc;
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	/*
+	 * There is no hardware or firmware interface which could trigger a
+	 * hot memory remove on s390. So there is nothing that needs to be
+	 * implemented.
+	 */
+	return -EBUSY;
+}
+#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 82cc576..1057940 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -558,4 +558,21 @@ int memory_add_physaddr_to_nid(u64 addr)
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+	int ret;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	ret = __remove_pages(zone, start_pfn, nr_pages);
+	if (unlikely(ret))
+		pr_warn("%s: Failed, __remove_pages() == %d\n", __func__,
+			ret);
+
+	return ret;
+}
+#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
index ef29d6c..2749515 100644
--- a/arch/tile/mm/init.c
+++ b/arch/tile/mm/init.c
@@ -935,6 +935,14 @@ int remove_memory(u64 start, u64 size)
 {
 	return -EINVAL;
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	/* TODO */
+	return -EBUSY;
+}
+#endif
 #endif
 
 struct kmem_cache *pgd_cache;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 745d66b..3166e78 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -836,6 +836,18 @@ int arch_add_memory(int nid, u64 start, u64 size)
 
 	return __add_pages(nid, zone, start_pfn, nr_pages);
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	return __remove_pages(zone, start_pfn, nr_pages);
+}
+#endif
 #endif
 
 /*
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index e779e0b..f78509c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -682,6 +682,21 @@ int arch_add_memory(int nid, u64 start, u64 size)
 }
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int __ref arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+	int ret;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	ret = __remove_pages(zone, start_pfn, nr_pages);
+	WARN_ON_ONCE(ret);
+
+	return ret;
+}
+#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 static struct kcore_list kcore_vsyscall;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8dd0950..31a563b 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -96,6 +96,7 @@ extern void __online_page_free(struct page *page);
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern bool is_pageblock_removable_nolock(struct page *page);
+extern int arch_remove_memory(u64 start, u64 size);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 /* reasonably generic interface to expand the physical pages in a zone  */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9fd5904..f6724c2 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1514,6 +1514,8 @@ repeat:
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
 
+	arch_remove_memory(start, size);
+
 	unlock_memory_hotplug();
 
 	return 0;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v6 06/15] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (4 preceding siblings ...)
  2013-01-09  9:32 ` [PATCH v6 05/15] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture Tang Chen
@ 2013-01-09  9:32 ` Tang Chen
  2013-01-09  9:32 ` [PATCH v6 07/15] memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section() Tang Chen
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-09  9:32 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen, hpa,
	linfeng, laijs, mgorman, yinghai, glommer
  Cc: linux-s390, linux-ia64, linux-acpi, linux-sh, x86, linux-kernel,
	cmetcalf, linux-mm, sparclinux, linuxppc-dev

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

For removing memmap region of sparse-vmemmap which is allocated bootmem,
memmap region of sparse-vmemmap needs to be registered by get_page_bootmem().
So the patch searches pages of virtual mapping and registers the pages by
get_page_bootmem().

Note: register_page_bootmem_memmap() is not implemented for ia64, ppc, s390,
and sparc.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Wu Jianguo <wujianguo@huawei.com>
---
 arch/ia64/mm/discontig.c       |    6 ++++
 arch/powerpc/mm/init_64.c      |    6 ++++
 arch/s390/mm/vmem.c            |    6 ++++
 arch/sparc/mm/init_64.c        |    6 ++++
 arch/x86/mm/init_64.c          |   58 ++++++++++++++++++++++++++++++++++++++++
 include/linux/memory_hotplug.h |   11 +------
 include/linux/mm.h             |    3 +-
 mm/memory_hotplug.c            |   33 ++++++++++++++++++++---
 8 files changed, 115 insertions(+), 14 deletions(-)

diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index c641333..33943db 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -822,4 +822,10 @@ int __meminit vmemmap_populate(struct page *start_page,
 {
 	return vmemmap_populate_basepages(start_page, size, node);
 }
+
+void register_page_bootmem_memmap(unsigned long section_nr,
+				  struct page *start_page, unsigned long size)
+{
+	/* TODO */
+}
 #endif
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 95a4529..6466440 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,11 @@ int __meminit vmemmap_populate(struct page *start_page,
 
 	return 0;
 }
+
+void register_page_bootmem_memmap(unsigned long section_nr,
+				  struct page *start_page, unsigned long size)
+{
+	/* TODO */
+}
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 6ed1426..2c14bc2 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -272,6 +272,12 @@ out:
 	return ret;
 }
 
+void register_page_bootmem_memmap(unsigned long section_nr,
+				  struct page *start_page, unsigned long size)
+{
+	/* TODO */
+}
+
 /*
  * Add memory segment to the segment list if it doesn't overlap with
  * an already present segment.
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index c3b7242..1f30db3 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2231,6 +2231,12 @@ void __meminit vmemmap_populate_print_last(void)
 		node_start = 0;
 	}
 }
+
+void register_page_bootmem_memmap(unsigned long section_nr,
+				  struct page *start_page, unsigned long size)
+{
+	/* TODO */
+}
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
 static void prot_init_common(unsigned long page_none,
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index f78509c..9ac1723 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1000,6 +1000,64 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node)
 	return 0;
 }
 
+void register_page_bootmem_memmap(unsigned long section_nr,
+				  struct page *start_page, unsigned long size)
+{
+	unsigned long addr = (unsigned long)start_page;
+	unsigned long end = (unsigned long)(start_page + size);
+	unsigned long next;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	unsigned int nr_pages;
+	struct page *page;
+
+	for (; addr < end; addr = next) {
+		pte_t *pte = NULL;
+
+		pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd)) {
+			next = (addr + PAGE_SIZE) & PAGE_MASK;
+			continue;
+		}
+		get_page_bootmem(section_nr, pgd_page(*pgd), MIX_SECTION_INFO);
+
+		pud = pud_offset(pgd, addr);
+		if (pud_none(*pud)) {
+			next = (addr + PAGE_SIZE) & PAGE_MASK;
+			continue;
+		}
+		get_page_bootmem(section_nr, pud_page(*pud), MIX_SECTION_INFO);
+
+		if (!cpu_has_pse) {
+			next = (addr + PAGE_SIZE) & PAGE_MASK;
+			pmd = pmd_offset(pud, addr);
+			if (pmd_none(*pmd))
+				continue;
+			get_page_bootmem(section_nr, pmd_page(*pmd),
+					 MIX_SECTION_INFO);
+
+			pte = pte_offset_kernel(pmd, addr);
+			if (pte_none(*pte))
+				continue;
+			get_page_bootmem(section_nr, pte_page(*pte),
+					 SECTION_INFO);
+		} else {
+			next = pmd_addr_end(addr, end);
+
+			pmd = pmd_offset(pud, addr);
+			if (pmd_none(*pmd))
+				continue;
+
+			nr_pages = 1 << (get_order(PMD_SIZE));
+			page = pmd_page(*pmd);
+			while (nr_pages--)
+				get_page_bootmem(section_nr, page++,
+						 SECTION_INFO);
+		}
+	}
+}
+
 void __meminit vmemmap_populate_print_last(void)
 {
 	if (p_start) {
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 31a563b..2441f36 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -174,17 +174,10 @@ static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
 #endif /* CONFIG_NUMA */
 #endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
 
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
-{
-}
-static inline void put_page_bootmem(struct page *page)
-{
-}
-#else
 extern void register_page_bootmem_info_node(struct pglist_data *pgdat);
 extern void put_page_bootmem(struct page *page);
-#endif
+extern void get_page_bootmem(unsigned long ingo, struct page *page,
+			     unsigned long type);
 
 /*
  * Lock for memory hotplug guarantees 1) all callbacks for memory hotplug
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6320407..1eca498 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1709,7 +1709,8 @@ int vmemmap_populate_basepages(struct page *start_page,
 						unsigned long pages, int node);
 int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
 void vmemmap_populate_print_last(void);
-
+void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
+				  unsigned long size);
 
 enum mf_flags {
 	MF_COUNT_INCREASED = 1 << 0,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f6724c2..0682d2a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -91,9 +91,8 @@ static void release_memory_resource(struct resource *res)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
-#ifndef CONFIG_SPARSEMEM_VMEMMAP
-static void get_page_bootmem(unsigned long info,  struct page *page,
-			     unsigned long type)
+void get_page_bootmem(unsigned long info,  struct page *page,
+		      unsigned long type)
 {
 	page->lru.next = (struct list_head *) type;
 	SetPagePrivate(page);
@@ -128,6 +127,7 @@ void __ref put_page_bootmem(struct page *page)
 
 }
 
+#ifndef CONFIG_SPARSEMEM_VMEMMAP
 static void register_page_bootmem_info_section(unsigned long start_pfn)
 {
 	unsigned long *usemap, mapsize, section_nr, i;
@@ -161,6 +161,32 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
 		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
 
 }
+#else
+static void register_page_bootmem_info_section(unsigned long start_pfn)
+{
+	unsigned long *usemap, mapsize, section_nr, i;
+	struct mem_section *ms;
+	struct page *page, *memmap;
+
+	if (!pfn_valid(start_pfn))
+		return;
+
+	section_nr = pfn_to_section_nr(start_pfn);
+	ms = __nr_to_section(section_nr);
+
+	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
+
+	register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
+
+	usemap = __nr_to_section(section_nr)->pageblock_flags;
+	page = virt_to_page(usemap);
+
+	mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
+
+	for (i = 0; i < mapsize; i++, page++)
+		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
+}
+#endif
 
 void register_page_bootmem_info_node(struct pglist_data *pgdat)
 {
@@ -203,7 +229,6 @@ void register_page_bootmem_info_node(struct pglist_data *pgdat)
 			register_page_bootmem_info_section(pfn);
 	}
 }
-#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
 
 static void grow_zone_span(struct zone *zone, unsigned long start_pfn,
 			   unsigned long end_pfn)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v6 07/15] memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section()
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (5 preceding siblings ...)
  2013-01-09  9:32 ` [PATCH v6 06/15] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap Tang Chen
@ 2013-01-09  9:32 ` Tang Chen
  2013-01-09  9:32 ` [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove Tang Chen
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-09  9:32 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen, hpa,
	linfeng, laijs, mgorman, yinghai, glommer
  Cc: linux-s390, linux-ia64, linux-acpi, linux-sh, x86, linux-kernel,
	cmetcalf, linux-mm, sparclinux, linuxppc-dev

In __remove_section(), we locked pgdat_resize_lock when calling
sparse_remove_one_section(). This lock will disable irq. But we don't need
to lock the whole function. If we do some work to free pagetables in
free_section_usemap(), we need to call flush_tlb_all(), which need
irq enabled. Otherwise the WARN_ON_ONCE() in smp_call_function_many()
will be triggered.

If we lock the whole sparse_remove_one_section(), then we come to this call trace:

[  454.796248] ------------[ cut here ]------------
[  454.851408] WARNING: at kernel/smp.c:461 smp_call_function_many+0xbd/0x260()
[  454.935620] Hardware name: PRIMEQUEST 1800E
......
[  455.652201] Call Trace:
[  455.681391]  [<ffffffff8106e73f>] warn_slowpath_common+0x7f/0xc0
[  455.753151]  [<ffffffff810560a0>] ? leave_mm+0x50/0x50
[  455.814527]  [<ffffffff8106e79a>] warn_slowpath_null+0x1a/0x20
[  455.884208]  [<ffffffff810e7a9d>] smp_call_function_many+0xbd/0x260
[  455.959082]  [<ffffffff810e7ecb>] smp_call_function+0x3b/0x50
[  456.027722]  [<ffffffff810560a0>] ? leave_mm+0x50/0x50
[  456.089098]  [<ffffffff810e7f4b>] on_each_cpu+0x3b/0xc0
[  456.151512]  [<ffffffff81055f0c>] flush_tlb_all+0x1c/0x20
[  456.216004]  [<ffffffff8104f8de>] remove_pagetable+0x14e/0x1d0
[  456.285683]  [<ffffffff8104f978>] vmemmap_free+0x18/0x20
[  456.349139]  [<ffffffff811b8797>] sparse_remove_one_section+0xf7/0x100
[  456.427126]  [<ffffffff811c5fc2>] __remove_section+0xa2/0xb0
[  456.494726]  [<ffffffff811c6070>] __remove_pages+0xa0/0xd0
[  456.560258]  [<ffffffff81669c7b>] arch_remove_memory+0x6b/0xc0
[  456.629937]  [<ffffffff8166ad28>] remove_memory+0xb8/0xf0
[  456.694431]  [<ffffffff813e686f>] acpi_memory_device_remove+0x53/0x96
[  456.771379]  [<ffffffff813b33c4>] acpi_device_remove+0x90/0xb2
[  456.841059]  [<ffffffff8144b02c>] __device_release_driver+0x7c/0xf0
[  456.915928]  [<ffffffff8144b1af>] device_release_driver+0x2f/0x50
[  456.988719]  [<ffffffff813b4476>] acpi_bus_remove+0x32/0x6d
[  457.055285]  [<ffffffff813b4542>] acpi_bus_trim+0x91/0x102
[  457.120814]  [<ffffffff813b463b>] acpi_bus_hot_remove_device+0x88/0x16b
[  457.199840]  [<ffffffff813afda7>] acpi_os_execute_deferred+0x27/0x34
[  457.275756]  [<ffffffff81091ece>] process_one_work+0x20e/0x5c0
[  457.345434]  [<ffffffff81091e5f>] ? process_one_work+0x19f/0x5c0
[  457.417190]  [<ffffffff813afd80>] ? acpi_os_wait_events_complete+0x23/0x23
[  457.499332]  [<ffffffff81093f6e>] worker_thread+0x12e/0x370
[  457.565896]  [<ffffffff81093e40>] ? manage_workers+0x180/0x180
[  457.635574]  [<ffffffff8109a09e>] kthread+0xee/0x100
[  457.694871]  [<ffffffff810dfaf9>] ? __lock_release+0x129/0x190
[  457.764552]  [<ffffffff81099fb0>] ? __init_kthread_worker+0x70/0x70
[  457.839427]  [<ffffffff81690aac>] ret_from_fork+0x7c/0xb0
[  457.903914]  [<ffffffff81099fb0>] ? __init_kthread_worker+0x70/0x70
[  457.978784] ---[ end trace 25e85300f542aa01 ]---

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memory_hotplug.c |    4 ----
 mm/sparse.c         |    5 ++++-
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0682d2a..674e791 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -442,8 +442,6 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 #else
 static int __remove_section(struct zone *zone, struct mem_section *ms)
 {
-	unsigned long flags;
-	struct pglist_data *pgdat = zone->zone_pgdat;
 	int ret = -EINVAL;
 
 	if (!valid_section(ms))
@@ -453,9 +451,7 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 	if (ret)
 		return ret;
 
-	pgdat_resize_lock(pgdat, &flags);
 	sparse_remove_one_section(zone, ms);
-	pgdat_resize_unlock(pgdat, &flags);
 	return 0;
 }
 #endif
diff --git a/mm/sparse.c b/mm/sparse.c
index aadbb2a..05ca73a 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -796,8 +796,10 @@ static inline void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
 void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 {
 	struct page *memmap = NULL;
-	unsigned long *usemap = NULL;
+	unsigned long *usemap = NULL, flags;
+	struct pglist_data *pgdat = zone->zone_pgdat;
 
+	pgdat_resize_lock(pgdat, &flags);
 	if (ms->section_mem_map) {
 		usemap = ms->pageblock_flags;
 		memmap = sparse_decode_mem_map(ms->section_mem_map,
@@ -805,6 +807,7 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 		ms->section_mem_map = 0;
 		ms->pageblock_flags = NULL;
 	}
+	pgdat_resize_unlock(pgdat, &flags);
 
 	clear_hwpoisoned_pages(memmap, PAGES_PER_SECTION);
 	free_section_usemap(memmap, usemap);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (6 preceding siblings ...)
  2013-01-09  9:32 ` [PATCH v6 07/15] memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section() Tang Chen
@ 2013-01-09  9:32 ` Tang Chen
  2013-01-29 13:02   ` Simon Jeons
                     ` (2 more replies)
  2013-01-09  9:32 ` [PATCH v6 09/15] memory-hotplug: remove page table of x86_64 architecture Tang Chen
                   ` (9 subsequent siblings)
  17 siblings, 3 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-09  9:32 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen, hpa,
	linfeng, laijs, mgorman, yinghai, glommer
  Cc: linux-s390, linux-ia64, linux-acpi, linux-sh, x86, linux-kernel,
	cmetcalf, linux-mm, sparclinux, linuxppc-dev

From: Wen Congyang <wency@cn.fujitsu.com>

When memory is removed, the corresponding pagetables should alse be removed.
This patch introduces some common APIs to support vmemmap pagetable and x86_64
architecture pagetable removing.

All pages of virtual mapping in removed memory cannot be freedi if some pages
used as PGD/PUD includes not only removed memory but also other memory. So the
patch uses the following way to check whether page can be freed or not.

 1. When removing memory, the page structs of the revmoved memory are filled
    with 0FD.
 2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
    In this case, the page used as PT/PMD can be freed.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/include/asm/pgtable_types.h |    1 +
 arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
 arch/x86/mm/pageattr.c               |   47 +++---
 include/linux/bootmem.h              |    1 +
 4 files changed, 326 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 3c32db8..4b6fd2a 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
  * as a pte too.
  */
 extern pte_t *lookup_address(unsigned long address, unsigned int *level);
+extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
 
 #endif	/* !__ASSEMBLY__ */
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 9ac1723..fe01116 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
 }
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
+#define PAGE_INUSE 0xFD
+
+static void __meminit free_pagetable(struct page *page, int order)
+{
+	struct zone *zone;
+	bool bootmem = false;
+	unsigned long magic;
+	unsigned int nr_pages = 1 << order;
+
+	/* bootmem page has reserved flag */
+	if (PageReserved(page)) {
+		__ClearPageReserved(page);
+		bootmem = true;
+
+		magic = (unsigned long)page->lru.next;
+		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
+			while (nr_pages--)
+				put_page_bootmem(page++);
+		} else
+			__free_pages_bootmem(page, order);
+	} else
+		free_pages((unsigned long)page_address(page), order);
+
+	/*
+	 * SECTION_INFO pages and MIX_SECTION_INFO pages
+	 * are all allocated by bootmem.
+	 */
+	if (bootmem) {
+		zone = page_zone(page);
+		zone_span_writelock(zone);
+		zone->present_pages += nr_pages;
+		zone_span_writeunlock(zone);
+		totalram_pages += nr_pages;
+	}
+}
+
+static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
+{
+	pte_t *pte;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		pte = pte_start + i;
+		if (pte_val(*pte))
+			return;
+	}
+
+	/* free a pte talbe */
+	free_pagetable(pmd_page(*pmd), 0);
+	spin_lock(&init_mm.page_table_lock);
+	pmd_clear(pmd);
+	spin_unlock(&init_mm.page_table_lock);
+}
+
+static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
+{
+	pmd_t *pmd;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		pmd = pmd_start + i;
+		if (pmd_val(*pmd))
+			return;
+	}
+
+	/* free a pmd talbe */
+	free_pagetable(pud_page(*pud), 0);
+	spin_lock(&init_mm.page_table_lock);
+	pud_clear(pud);
+	spin_unlock(&init_mm.page_table_lock);
+}
+
+/* Return true if pgd is changed, otherwise return false. */
+static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
+{
+	pud_t *pud;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		pud = pud_start + i;
+		if (pud_val(*pud))
+			return false;
+	}
+
+	/* free a pud table */
+	free_pagetable(pgd_page(*pgd), 0);
+	spin_lock(&init_mm.page_table_lock);
+	pgd_clear(pgd);
+	spin_unlock(&init_mm.page_table_lock);
+
+	return true;
+}
+
+static void __meminit
+remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
+		 bool direct)
+{
+	unsigned long next, pages = 0;
+	pte_t *pte;
+	void *page_addr;
+	phys_addr_t phys_addr;
+
+	pte = pte_start + pte_index(addr);
+	for (; addr < end; addr = next, pte++) {
+		next = (addr + PAGE_SIZE) & PAGE_MASK;
+		if (next > end)
+			next = end;
+
+		if (!pte_present(*pte))
+			continue;
+
+		/*
+		 * We mapped [0,1G) memory as identity mapping when
+		 * initializing, in arch/x86/kernel/head_64.S. These
+		 * pagetables cannot be removed.
+		 */
+		phys_addr = pte_val(*pte) + (addr & PAGE_MASK);
+		if (phys_addr < (phys_addr_t)0x40000000)
+			return;
+
+		if (IS_ALIGNED(addr, PAGE_SIZE) &&
+		    IS_ALIGNED(next, PAGE_SIZE)) {
+			if (!direct) {
+				free_pagetable(pte_page(*pte), 0);
+				pages++;
+			}
+
+			spin_lock(&init_mm.page_table_lock);
+			pte_clear(&init_mm, addr, pte);
+			spin_unlock(&init_mm.page_table_lock);
+		} else {
+			/*
+			 * If we are not removing the whole page, it means
+			 * other ptes in this page are being used and we canot
+			 * remove them. So fill the unused ptes with 0xFD, and
+			 * remove the page when it is wholly filled with 0xFD.
+			 */
+			memset((void *)addr, PAGE_INUSE, next - addr);
+			page_addr = page_address(pte_page(*pte));
+
+			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
+				free_pagetable(pte_page(*pte), 0);
+				pages++;
+
+				spin_lock(&init_mm.page_table_lock);
+				pte_clear(&init_mm, addr, pte);
+				spin_unlock(&init_mm.page_table_lock);
+			}
+		}
+	}
+
+	/* Call free_pte_table() in remove_pmd_table(). */
+	flush_tlb_all();
+	if (direct)
+		update_page_count(PG_LEVEL_4K, -pages);
+}
+
+static void __meminit
+remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
+		 bool direct)
+{
+	unsigned long pte_phys, next, pages = 0;
+	pte_t *pte_base;
+	pmd_t *pmd;
+
+	pmd = pmd_start + pmd_index(addr);
+	for (; addr < end; addr = next, pmd++) {
+		next = pmd_addr_end(addr, end);
+
+		if (!pmd_present(*pmd))
+			continue;
+
+		if (pmd_large(*pmd)) {
+			if (IS_ALIGNED(addr, PMD_SIZE) &&
+			    IS_ALIGNED(next, PMD_SIZE)) {
+				if (!direct) {
+					free_pagetable(pmd_page(*pmd),
+						       get_order(PMD_SIZE));
+					pages++;
+				}
+
+				spin_lock(&init_mm.page_table_lock);
+				pmd_clear(pmd);
+				spin_unlock(&init_mm.page_table_lock);
+				continue;
+			}
+
+			/*
+			 * We use 2M page, but we need to remove part of them,
+			 * so split 2M page to 4K page.
+			 */
+			pte_base = (pte_t *)alloc_low_page(&pte_phys);
+			BUG_ON(!pte_base);
+			__split_large_page((pte_t *)pmd, addr,
+					   (pte_t *)pte_base);
+
+			spin_lock(&init_mm.page_table_lock);
+			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
+			spin_unlock(&init_mm.page_table_lock);
+
+			flush_tlb_all();
+		}
+
+		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
+		remove_pte_table(pte_base, addr, next, direct);
+		free_pte_table(pte_base, pmd);
+		unmap_low_page(pte_base);
+	}
+
+	/* Call free_pmd_table() in remove_pud_table(). */
+	if (direct)
+		update_page_count(PG_LEVEL_2M, -pages);
+}
+
+static void __meminit
+remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
+		 bool direct)
+{
+	unsigned long pmd_phys, next, pages = 0;
+	pmd_t *pmd_base;
+	pud_t *pud;
+
+	pud = pud_start + pud_index(addr);
+	for (; addr < end; addr = next, pud++) {
+		next = pud_addr_end(addr, end);
+
+		if (!pud_present(*pud))
+			continue;
+
+		if (pud_large(*pud)) {
+			if (IS_ALIGNED(addr, PUD_SIZE) &&
+			    IS_ALIGNED(next, PUD_SIZE)) {
+				if (!direct) {
+					free_pagetable(pud_page(*pud),
+						       get_order(PUD_SIZE));
+					pages++;
+				}
+
+				spin_lock(&init_mm.page_table_lock);
+				pud_clear(pud);
+				spin_unlock(&init_mm.page_table_lock);
+				continue;
+			}
+
+			/*
+			 * We use 1G page, but we need to remove part of them,
+			 * so split 1G page to 2M page.
+			 */
+			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
+			BUG_ON(!pmd_base);
+			__split_large_page((pte_t *)pud, addr,
+					   (pte_t *)pmd_base);
+
+			spin_lock(&init_mm.page_table_lock);
+			pud_populate(&init_mm, pud, __va(pmd_phys));
+			spin_unlock(&init_mm.page_table_lock);
+
+			flush_tlb_all();
+		}
+
+		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
+		remove_pmd_table(pmd_base, addr, next, direct);
+		free_pmd_table(pmd_base, pud);
+		unmap_low_page(pmd_base);
+	}
+
+	if (direct)
+		update_page_count(PG_LEVEL_1G, -pages);
+}
+
+/* start and end are both virtual address. */
+static void __meminit
+remove_pagetable(unsigned long start, unsigned long end, bool direct)
+{
+	unsigned long next;
+	pgd_t *pgd;
+	pud_t *pud;
+	bool pgd_changed = false;
+
+	for (; start < end; start = next) {
+		pgd = pgd_offset_k(start);
+		if (!pgd_present(*pgd))
+			continue;
+
+		next = pgd_addr_end(start, end);
+
+		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
+		remove_pud_table(pud, start, next, direct);
+		if (free_pud_table(pud, pgd))
+			pgd_changed = true;
+		unmap_low_page(pud);
+	}
+
+	if (pgd_changed)
+		sync_global_pgds(start, end - 1);
+
+	flush_tlb_all();
+}
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 int __ref arch_remove_memory(u64 start, u64 size)
 {
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index a718e0d..7dcb6f9 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -501,21 +501,13 @@ out_unlock:
 	return do_split;
 }
 
-static int split_large_page(pte_t *kpte, unsigned long address)
+int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
 {
 	unsigned long pfn, pfninc = 1;
 	unsigned int i, level;
-	pte_t *pbase, *tmp;
+	pte_t *tmp;
 	pgprot_t ref_prot;
-	struct page *base;
-
-	if (!debug_pagealloc)
-		spin_unlock(&cpa_lock);
-	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
-	if (!debug_pagealloc)
-		spin_lock(&cpa_lock);
-	if (!base)
-		return -ENOMEM;
+	struct page *base = virt_to_page(pbase);
 
 	spin_lock(&pgd_lock);
 	/*
@@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
 	 * up for us already:
 	 */
 	tmp = lookup_address(address, &level);
-	if (tmp != kpte)
-		goto out_unlock;
+	if (tmp != kpte) {
+		spin_unlock(&pgd_lock);
+		return 1;
+	}
 
-	pbase = (pte_t *)page_address(base);
 	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
 	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
 	/*
@@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
 	 * going on.
 	 */
 	__flush_tlb_all();
+	spin_unlock(&pgd_lock);
 
-	base = NULL;
+	return 0;
+}
 
-out_unlock:
-	/*
-	 * If we dropped out via the lookup_address check under
-	 * pgd_lock then stick the page back into the pool:
-	 */
-	if (base)
+static int split_large_page(pte_t *kpte, unsigned long address)
+{
+	pte_t *pbase;
+	struct page *base;
+
+	if (!debug_pagealloc)
+		spin_unlock(&cpa_lock);
+	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
+	if (!debug_pagealloc)
+		spin_lock(&cpa_lock);
+	if (!base)
+		return -ENOMEM;
+
+	pbase = (pte_t *)page_address(base);
+	if (__split_large_page(kpte, address, pbase))
 		__free_page(base);
-	spin_unlock(&pgd_lock);
 
 	return 0;
 }
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index 3f778c2..190ff06 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
 			      unsigned long size);
 extern void free_bootmem(unsigned long physaddr, unsigned long size);
 extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
+extern void __free_pages_bootmem(struct page *page, unsigned int order);
 
 /*
  * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v6 09/15] memory-hotplug: remove page table of x86_64 architecture
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (7 preceding siblings ...)
  2013-01-09  9:32 ` [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove Tang Chen
@ 2013-01-09  9:32 ` Tang Chen
  2013-01-09  9:32 ` [PATCH v6 10/15] memory-hotplug: remove memmap of sparse-vmemmap Tang Chen
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-09  9:32 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen, hpa,
	linfeng, laijs, mgorman, yinghai, glommer
  Cc: linux-s390, linux-ia64, linux-acpi, linux-sh, x86, linux-kernel,
	cmetcalf, linux-mm, sparclinux, linuxppc-dev

This patch searches a page table about the removed memory, and clear
page table for x86_64 architecture.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/init_64.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index fe01116..d950f9b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -981,6 +981,15 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 	flush_tlb_all();
 }
 
+void __meminit
+kernel_physical_mapping_remove(unsigned long start, unsigned long end)
+{
+	start = (unsigned long)__va(start);
+	end = (unsigned long)__va(end);
+
+	remove_pagetable(start, end, true);
+}
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 int __ref arch_remove_memory(u64 start, u64 size)
 {
@@ -990,6 +999,7 @@ int __ref arch_remove_memory(u64 start, u64 size)
 	int ret;
 
 	zone = page_zone(pfn_to_page(start_pfn));
+	kernel_physical_mapping_remove(start, start + size);
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v6 10/15] memory-hotplug: remove memmap of sparse-vmemmap
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (8 preceding siblings ...)
  2013-01-09  9:32 ` [PATCH v6 09/15] memory-hotplug: remove page table of x86_64 architecture Tang Chen
@ 2013-01-09  9:32 ` Tang Chen
  2013-01-09  9:32 ` [PATCH v6 11/15] memory-hotplug: Integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP Tang Chen
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-09  9:32 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen, hpa,
	linfeng, laijs, mgorman, yinghai, glommer
  Cc: linux-s390, linux-ia64, linux-acpi, linux-sh, x86, linux-kernel,
	cmetcalf, linux-mm, sparclinux, linuxppc-dev

This patch introduces a new API vmemmap_free() to free and remove
vmemmap pagetables. Since pagetable implements are different, each
architecture has to provide its own version of vmemmap_free(), just
like vmemmap_populate().

Note:  vmemmap_free() are not implemented for ia64, ppc, s390, and sparc.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/arm64/mm/mmu.c       |    3 +++
 arch/ia64/mm/discontig.c  |    4 ++++
 arch/powerpc/mm/init_64.c |    4 ++++
 arch/s390/mm/vmem.c       |    4 ++++
 arch/sparc/mm/init_64.c   |    4 ++++
 arch/x86/mm/init_64.c     |    8 ++++++++
 include/linux/mm.h        |    1 +
 mm/sparse.c               |    3 ++-
 8 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index a6885d8..9834886 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -392,4 +392,7 @@ int __meminit vmemmap_populate(struct page *start_page,
 	return 0;
 }
 #endif	/* CONFIG_ARM64_64K_PAGES */
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+}
 #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 33943db..882a0fd 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -823,6 +823,10 @@ int __meminit vmemmap_populate(struct page *start_page,
 	return vmemmap_populate_basepages(start_page, size, node);
 }
 
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
 {
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 6466440..2969591 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -298,6 +298,10 @@ int __meminit vmemmap_populate(struct page *start_page,
 	return 0;
 }
 
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
 {
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 2c14bc2..81e6ba3 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -272,6 +272,10 @@ out:
 	return ret;
 }
 
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
 {
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 1f30db3..5afe21a 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2232,6 +2232,10 @@ void __meminit vmemmap_populate_print_last(void)
 	}
 }
 
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
 {
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index d950f9b..e829113 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1309,6 +1309,14 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node)
 	return 0;
 }
 
+void __ref vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+	unsigned long start = (unsigned long)memmap;
+	unsigned long end = (unsigned long)(memmap + nr_pages);
+
+	remove_pagetable(start, end, false);
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1eca498..31d5e5d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1709,6 +1709,7 @@ int vmemmap_populate_basepages(struct page *start_page,
 						unsigned long pages, int node);
 int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
 void vmemmap_populate_print_last(void);
+void vmemmap_free(struct page *memmap, unsigned long nr_pages);
 void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
 				  unsigned long size);
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 05ca73a..cff9796 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -615,10 +615,11 @@ static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
 }
 static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
 {
-	return; /* XXX: Not implemented yet */
+	vmemmap_free(memmap, nr_pages);
 }
 static void free_map_bootmem(struct page *memmap, unsigned long nr_pages)
 {
+	vmemmap_free(memmap, nr_pages);
 }
 #else
 static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v6 11/15] memory-hotplug: Integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP.
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (9 preceding siblings ...)
  2013-01-09  9:32 ` [PATCH v6 10/15] memory-hotplug: remove memmap of sparse-vmemmap Tang Chen
@ 2013-01-09  9:32 ` Tang Chen
  2013-01-09  9:32 ` [PATCH v6 12/15] memory-hotplug: memory_hotplug: clear zone when removing the memory Tang Chen
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-09  9:32 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen, hpa,
	linfeng, laijs, mgorman, yinghai, glommer
  Cc: linux-s390, linux-ia64, linux-acpi, linux-sh, x86, linux-kernel,
	cmetcalf, linux-mm, sparclinux, linuxppc-dev

Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even if
we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 mm/memory_hotplug.c |   11 -----------
 1 files changed, 0 insertions(+), 11 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 674e791..b20c4c7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -430,16 +430,6 @@ static int __meminit __add_section(int nid, struct zone *zone,
 	return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
 }
 
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-static int __remove_section(struct zone *zone, struct mem_section *ms)
-{
-	/*
-	 * XXX: Freeing memmap with vmemmap is not implement yet.
-	 *      This should be removed later.
-	 */
-	return -EBUSY;
-}
-#else
 static int __remove_section(struct zone *zone, struct mem_section *ms)
 {
 	int ret = -EINVAL;
@@ -454,7 +444,6 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 	sparse_remove_one_section(zone, ms);
 	return 0;
 }
-#endif
 
 /*
  * Reasonably generic function for adding memory.  It is
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v6 12/15] memory-hotplug: memory_hotplug: clear zone when removing the memory
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (10 preceding siblings ...)
  2013-01-09  9:32 ` [PATCH v6 11/15] memory-hotplug: Integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP Tang Chen
@ 2013-01-09  9:32 ` Tang Chen
  2013-01-09  9:32 ` [PATCH v6 13/15] memory-hotplug: remove sysfs file of node Tang Chen
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-09  9:32 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen, hpa,
	linfeng, laijs, mgorman, yinghai, glommer
  Cc: linux-s390, linux-ia64, linux-acpi, linux-sh, x86, linux-kernel,
	cmetcalf, linux-mm, sparclinux, linuxppc-dev

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

When a memory is added, we update zone's and pgdat's start_pfn and
spanned_pages in the function __add_zone(). So we should revert them
when the memory is removed.

The patch adds a new function __remove_zone() to do this.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 mm/memory_hotplug.c |  207 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 207 insertions(+), 0 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b20c4c7..da20c14 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -430,8 +430,211 @@ static int __meminit __add_section(int nid, struct zone *zone,
 	return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
 }
 
+/* find the smallest valid pfn in the range [start_pfn, end_pfn) */
+static int find_smallest_section_pfn(int nid, struct zone *zone,
+				     unsigned long start_pfn,
+				     unsigned long end_pfn)
+{
+	struct mem_section *ms;
+
+	for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
+		ms = __pfn_to_section(start_pfn);
+
+		if (unlikely(!valid_section(ms)))
+			continue;
+
+		if (unlikely(pfn_to_nid(start_pfn) != nid))
+			continue;
+
+		if (zone && zone != page_zone(pfn_to_page(start_pfn)))
+			continue;
+
+		return start_pfn;
+	}
+
+	return 0;
+}
+
+/* find the biggest valid pfn in the range [start_pfn, end_pfn). */
+static int find_biggest_section_pfn(int nid, struct zone *zone,
+				    unsigned long start_pfn,
+				    unsigned long end_pfn)
+{
+	struct mem_section *ms;
+	unsigned long pfn;
+
+	/* pfn is the end pfn of a memory section. */
+	pfn = end_pfn - 1;
+	for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
+		ms = __pfn_to_section(pfn);
+
+		if (unlikely(!valid_section(ms)))
+			continue;
+
+		if (unlikely(pfn_to_nid(pfn) != nid))
+			continue;
+
+		if (zone && zone != page_zone(pfn_to_page(pfn)))
+			continue;
+
+		return pfn;
+	}
+
+	return 0;
+}
+
+static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
+			     unsigned long end_pfn)
+{
+	unsigned long zone_start_pfn =  zone->zone_start_pfn;
+	unsigned long zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+	unsigned long pfn;
+	struct mem_section *ms;
+	int nid = zone_to_nid(zone);
+
+	zone_span_writelock(zone);
+	if (zone_start_pfn == start_pfn) {
+		/*
+		 * If the section is smallest section in the zone, it need
+		 * shrink zone->zone_start_pfn and zone->zone_spanned_pages.
+		 * In this case, we find second smallest valid mem_section
+		 * for shrinking zone.
+		 */
+		pfn = find_smallest_section_pfn(nid, zone, end_pfn,
+						zone_end_pfn);
+		if (pfn) {
+			zone->zone_start_pfn = pfn;
+			zone->spanned_pages = zone_end_pfn - pfn;
+		}
+	} else if (zone_end_pfn == end_pfn) {
+		/*
+		 * If the section is biggest section in the zone, it need
+		 * shrink zone->spanned_pages.
+		 * In this case, we find second biggest valid mem_section for
+		 * shrinking zone.
+		 */
+		pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
+					       start_pfn);
+		if (pfn)
+			zone->spanned_pages = pfn - zone_start_pfn + 1;
+	}
+
+	/*
+	 * The section is not biggest or smallest mem_section in the zone, it
+	 * only creates a hole in the zone. So in this case, we need not
+	 * change the zone. But perhaps, the zone has only hole data. Thus
+	 * it check the zone has only hole or not.
+	 */
+	pfn = zone_start_pfn;
+	for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
+		ms = __pfn_to_section(pfn);
+
+		if (unlikely(!valid_section(ms)))
+			continue;
+
+		if (page_zone(pfn_to_page(pfn)) != zone)
+			continue;
+
+		 /* If the section is current section, it continues the loop */
+		if (start_pfn == pfn)
+			continue;
+
+		/* If we find valid section, we have nothing to do */
+		zone_span_writeunlock(zone);
+		return;
+	}
+
+	/* The zone has no valid section */
+	zone->zone_start_pfn = 0;
+	zone->spanned_pages = 0;
+	zone_span_writeunlock(zone);
+}
+
+static void shrink_pgdat_span(struct pglist_data *pgdat,
+			      unsigned long start_pfn, unsigned long end_pfn)
+{
+	unsigned long pgdat_start_pfn =  pgdat->node_start_pfn;
+	unsigned long pgdat_end_pfn =
+		pgdat->node_start_pfn + pgdat->node_spanned_pages;
+	unsigned long pfn;
+	struct mem_section *ms;
+	int nid = pgdat->node_id;
+
+	if (pgdat_start_pfn == start_pfn) {
+		/*
+		 * If the section is smallest section in the pgdat, it need
+		 * shrink pgdat->node_start_pfn and pgdat->node_spanned_pages.
+		 * In this case, we find second smallest valid mem_section
+		 * for shrinking zone.
+		 */
+		pfn = find_smallest_section_pfn(nid, NULL, end_pfn,
+						pgdat_end_pfn);
+		if (pfn) {
+			pgdat->node_start_pfn = pfn;
+			pgdat->node_spanned_pages = pgdat_end_pfn - pfn;
+		}
+	} else if (pgdat_end_pfn == end_pfn) {
+		/*
+		 * If the section is biggest section in the pgdat, it need
+		 * shrink pgdat->node_spanned_pages.
+		 * In this case, we find second biggest valid mem_section for
+		 * shrinking zone.
+		 */
+		pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn,
+					       start_pfn);
+		if (pfn)
+			pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1;
+	}
+
+	/*
+	 * If the section is not biggest or smallest mem_section in the pgdat,
+	 * it only creates a hole in the pgdat. So in this case, we need not
+	 * change the pgdat.
+	 * But perhaps, the pgdat has only hole data. Thus it check the pgdat
+	 * has only hole or not.
+	 */
+	pfn = pgdat_start_pfn;
+	for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SECTION) {
+		ms = __pfn_to_section(pfn);
+
+		if (unlikely(!valid_section(ms)))
+			continue;
+
+		if (pfn_to_nid(pfn) != nid)
+			continue;
+
+		 /* If the section is current section, it continues the loop */
+		if (start_pfn == pfn)
+			continue;
+
+		/* If we find valid section, we have nothing to do */
+		return;
+	}
+
+	/* The pgdat has no valid section */
+	pgdat->node_start_pfn = 0;
+	pgdat->node_spanned_pages = 0;
+}
+
+static void __remove_zone(struct zone *zone, unsigned long start_pfn)
+{
+	struct pglist_data *pgdat = zone->zone_pgdat;
+	int nr_pages = PAGES_PER_SECTION;
+	int zone_type;
+	unsigned long flags;
+
+	zone_type = zone - pgdat->node_zones;
+
+	pgdat_resize_lock(zone->zone_pgdat, &flags);
+	shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
+	shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
+	pgdat_resize_unlock(zone->zone_pgdat, &flags);
+}
+
 static int __remove_section(struct zone *zone, struct mem_section *ms)
 {
+	unsigned long start_pfn;
+	int scn_nr;
 	int ret = -EINVAL;
 
 	if (!valid_section(ms))
@@ -441,6 +644,10 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 	if (ret)
 		return ret;
 
+	scn_nr = __section_nr(ms);
+	start_pfn = section_nr_to_pfn(scn_nr);
+	__remove_zone(zone, start_pfn);
+
 	sparse_remove_one_section(zone, ms);
 	return 0;
 }
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v6 13/15] memory-hotplug: remove sysfs file of node
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (11 preceding siblings ...)
  2013-01-09  9:32 ` [PATCH v6 12/15] memory-hotplug: memory_hotplug: clear zone when removing the memory Tang Chen
@ 2013-01-09  9:32 ` Tang Chen
  2013-01-09  9:32 ` [PATCH v6 14/15] memory-hotplug: free node_data when a node is offlined Tang Chen
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-09  9:32 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen, hpa,
	linfeng, laijs, mgorman, yinghai, glommer
  Cc: linux-s390, linux-ia64, linux-acpi, linux-sh, x86, linux-kernel,
	cmetcalf, linux-mm, sparclinux, linuxppc-dev

This patch introduces a new function try_offline_node() to
remove sysfs file of node when all memory sections of this
node are removed. If some memory sections of this node are
not removed, this function does nothing.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 drivers/acpi/acpi_memhotplug.c |    8 ++++-
 include/linux/memory_hotplug.h |    2 +-
 mm/memory_hotplug.c            |   58 ++++++++++++++++++++++++++++++++++++++-
 3 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index eb30e5a..9c53cc6 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -295,9 +295,11 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 
 static int acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
 {
-	int result = 0;
+	int result = 0, nid;
 	struct acpi_memory_info *info, *n;
 
+	nid = acpi_get_node(mem_device->device->handle);
+
 	list_for_each_entry_safe(info, n, &mem_device->res_list, list) {
 		if (info->failed)
 			/* The kernel does not use this memory block */
@@ -310,7 +312,9 @@ static int acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
 			 */
 			return -EBUSY;
 
-		result = remove_memory(info->start_addr, info->length);
+		if (nid < 0)
+			nid = memory_add_physaddr_to_nid(info->start_addr);
+		result = remove_memory(nid, info->start_addr, info->length);
 		if (result)
 			return result;
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 2441f36..f60e728 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -242,7 +242,7 @@ extern int arch_add_memory(int nid, u64 start, u64 size);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern int offline_memory_block(struct memory_block *mem);
 extern bool is_memblock_offlined(struct memory_block *mem);
-extern int remove_memory(u64 start, u64 size);
+extern int remove_memory(int nid, u64 start, u64 size);
 extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 								int nr_pages);
 extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index da20c14..a8703f7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -29,6 +29,7 @@
 #include <linux/suspend.h>
 #include <linux/mm_inline.h>
 #include <linux/firmware-map.h>
+#include <linux/stop_machine.h>
 
 #include <asm/tlbflush.h>
 
@@ -1678,7 +1679,58 @@ static int is_memblock_offlined_cb(struct memory_block *mem, void *arg)
 	return ret;
 }
 
-int __ref remove_memory(u64 start, u64 size)
+static int check_cpu_on_node(void *data)
+{
+	struct pglist_data *pgdat = data;
+	int cpu;
+
+	for_each_present_cpu(cpu) {
+		if (cpu_to_node(cpu) == pgdat->node_id)
+			/*
+			 * the cpu on this node isn't removed, and we can't
+			 * offline this node.
+			 */
+			return -EBUSY;
+	}
+
+	return 0;
+}
+
+/* offline the node if all memory sections of this node are removed */
+static void try_offline_node(int nid)
+{
+	unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
+	unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
+	unsigned long pfn;
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+		unsigned long section_nr = pfn_to_section_nr(pfn);
+
+		if (!present_section_nr(section_nr))
+			continue;
+
+		if (pfn_to_nid(pfn) != nid)
+			continue;
+
+		/*
+		 * some memory sections of this node are not removed, and we
+		 * can't offline node now.
+		 */
+		return;
+	}
+
+	if (stop_machine(check_cpu_on_node, NODE_DATA(nid), NULL))
+		return;
+
+	/*
+	 * all memory/cpu of this node are removed, we can offline this
+	 * node now.
+	 */
+	node_set_offline(nid);
+	unregister_one_node(nid);
+}
+
+int __ref remove_memory(int nid, u64 start, u64 size)
 {
 	unsigned long start_pfn, end_pfn;
 	int ret = 0;
@@ -1733,6 +1785,8 @@ repeat:
 
 	arch_remove_memory(start, size);
 
+	try_offline_node(nid);
+
 	unlock_memory_hotplug();
 
 	return 0;
@@ -1742,7 +1796,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 {
 	return -EINVAL;
 }
-int remove_memory(u64 start, u64 size)
+int remove_memory(int nid, u64 start, u64 size)
 {
 	return -EINVAL;
 }
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v6 14/15] memory-hotplug: free node_data when a node is offlined
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (12 preceding siblings ...)
  2013-01-09  9:32 ` [PATCH v6 13/15] memory-hotplug: remove sysfs file of node Tang Chen
@ 2013-01-09  9:32 ` Tang Chen
  2013-01-09  9:32 ` [PATCH v6 15/15] memory-hotplug: Do not allocate pdgat if it was not freed when offline Tang Chen
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-09  9:32 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen, hpa,
	linfeng, laijs, mgorman, yinghai, glommer
  Cc: linux-s390, linux-ia64, linux-acpi, linux-sh, x86, linux-kernel,
	cmetcalf, linux-mm, sparclinux, linuxppc-dev

From: Wen Congyang <wency@cn.fujitsu.com>

We call hotadd_new_pgdat() to allocate memory to store node_data. So we
should free it when removing a node.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memory_hotplug.c |   30 +++++++++++++++++++++++++++---
 1 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a8703f7..8b67752 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1699,9 +1699,12 @@ static int check_cpu_on_node(void *data)
 /* offline the node if all memory sections of this node are removed */
 static void try_offline_node(int nid)
 {
-	unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
-	unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
+	pg_data_t *pgdat = NODE_DATA(nid);
+	unsigned long start_pfn = pgdat->node_start_pfn;
+	unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
 	unsigned long pfn;
+	struct page *pgdat_page = virt_to_page(pgdat);
+	int i;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
 		unsigned long section_nr = pfn_to_section_nr(pfn);
@@ -1719,7 +1722,7 @@ static void try_offline_node(int nid)
 		return;
 	}
 
-	if (stop_machine(check_cpu_on_node, NODE_DATA(nid), NULL))
+	if (stop_machine(check_cpu_on_node, pgdat, NULL))
 		return;
 
 	/*
@@ -1728,6 +1731,27 @@ static void try_offline_node(int nid)
 	 */
 	node_set_offline(nid);
 	unregister_one_node(nid);
+
+	if (!PageSlab(pgdat_page) && !PageCompound(pgdat_page))
+		/* node data is allocated from boot memory */
+		return;
+
+	/* free waittable in each zone */
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (zone->wait_table)
+			vfree(zone->wait_table);
+	}
+
+	/*
+	 * Since there is no way to guarentee the address of pgdat/zone is not
+	 * on stack of any kernel threads or used by other kernel objects
+	 * without reference counting or other symchronizing method, do not
+	 * reset node_data and free pgdat here. Just reset it to 0 and reuse
+	 * the memory when the node is online again.
+	 */
+	memset(pgdat, 0, sizeof(*pgdat));
 }
 
 int __ref remove_memory(int nid, u64 start, u64 size)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v6 15/15] memory-hotplug: Do not allocate pdgat if it was not freed when offline.
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (13 preceding siblings ...)
  2013-01-09  9:32 ` [PATCH v6 14/15] memory-hotplug: free node_data when a node is offlined Tang Chen
@ 2013-01-09  9:32 ` Tang Chen
  2013-01-09 22:23 ` [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Andrew Morton
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-09  9:32 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen, hpa,
	linfeng, laijs, mgorman, yinghai, glommer
  Cc: linux-s390, linux-ia64, linux-acpi, linux-sh, x86, linux-kernel,
	cmetcalf, linux-mm, sparclinux, linuxppc-dev

Since there is no way to guarentee the address of pgdat/zone is not
on stack of any kernel threads or used by other kernel objects
without reference counting or other symchronizing method, we cannot
reset node_data and free pgdat when offlining a node. Just reset pgdat
to 0 and reuse the memory when the node is online again.

The problem is suggested by Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
The idea is from Wen Congyang <wency@cn.fujitsu.com>

NOTE: If we don't reset pgdat to 0, the WARN_ON in free_area_init_node()
      will be triggered.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
---
 mm/memory_hotplug.c |   20 ++++++++++++--------
 1 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8b67752..8aa2b56 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1015,11 +1015,14 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
 	unsigned long zholes_size[MAX_NR_ZONES] = {0};
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 
-	pgdat = arch_alloc_nodedata(nid);
-	if (!pgdat)
-		return NULL;
+	pgdat = NODE_DATA(nid);
+	if (!pgdat) {
+		pgdat = arch_alloc_nodedata(nid);
+		if (!pgdat)
+			return NULL;
 
-	arch_refresh_nodedata(nid, pgdat);
+		arch_refresh_nodedata(nid, pgdat);
+	}
 
 	/* we can use NODE_DATA(nid) from here */
 
@@ -1072,7 +1075,7 @@ out:
 int __ref add_memory(int nid, u64 start, u64 size)
 {
 	pg_data_t *pgdat = NULL;
-	int new_pgdat = 0;
+	int new_pgdat = 0, new_node = 0;
 	struct resource *res;
 	int ret;
 
@@ -1083,12 +1086,13 @@ int __ref add_memory(int nid, u64 start, u64 size)
 	if (!res)
 		goto out;
 
-	if (!node_online(nid)) {
+	new_pgdat = NODE_DATA(nid) ? 0 : 1;
+	new_node = node_online(nid) ? 0 : 1;
+	if (new_node) {
 		pgdat = hotadd_new_pgdat(nid, start);
 		ret = -ENOMEM;
 		if (!pgdat)
 			goto error;
-		new_pgdat = 1;
 	}
 
 	/* call arch's memory hotadd */
@@ -1100,7 +1104,7 @@ int __ref add_memory(int nid, u64 start, u64 size)
 	/* we online node here. we can't roll back from here. */
 	node_set_online(nid);
 
-	if (new_pgdat) {
+	if (new_node) {
 		ret = register_one_node(nid);
 		/*
 		 * If sysfs file of new node can't create, cpu on the node
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (14 preceding siblings ...)
  2013-01-09  9:32 ` [PATCH v6 15/15] memory-hotplug: Do not allocate pdgat if it was not freed when offline Tang Chen
@ 2013-01-09 22:23 ` Andrew Morton
  2013-01-10  2:17   ` Tang Chen
  2013-01-09 23:33 ` Andrew Morton
  2013-01-29 12:52 ` Simon Jeons
  17 siblings, 1 reply; 67+ messages in thread
From: Andrew Morton @ 2013-01-09 22:23 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	linuxppc-dev

On Wed, 9 Jan 2013 17:32:24 +0800
Tang Chen <tangchen@cn.fujitsu.com> wrote:

> Here is the physical memory hot-remove patch-set based on 3.8rc-2.
> 
> This patch-set aims to implement physical memory hot-removing.
> 
> The patches can free/remove the following things:
> 
>   - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
>   - memmap of sparse-vmemmap                  : [PATCH 6,7,8,10/15]
>   - page table of removed memory              : [RFC PATCH 7,8,10/15]
>   - node and related sysfs files              : [RFC PATCH 13-15/15]
> 
> 
> Existing problem:
> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
> when we online pages.
> 
> For example: there is a memory device on node 1. The address range
> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
> and memory11 under the directory /sys/devices/system/memory/.
> 
> If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
> cgroup is not provided by this memory device. But when we online memory9, the
> memory stored page cgroup may be provided by memory8. So we can't offline
> memory8 now. We should offline the memory in the reversed order.
> 
> When the memory device is hotremoved, we will auto offline memory provided
> by this memory device. But we don't know which memory is onlined first, so
> offlining memory may fail.

This does sound like a significant problem.  We should assume that
mmecg is available and in use.

> In patch1, we provide a solution which is not good enough:
> Iterate twice to offline the memory.
> 1st iterate: offline every non primary memory block.
> 2nd iterate: offline primary (i.e. first added) memory block.

Let's flesh this out a bit.

If we online memory8, memory9, memory10 and memory11 then I'd have
thought that they would need to offlined in reverse order, which will
require four iterations, not two.  Is this wrong and if so, why?

Also, what happens if we wish to offline only memory9?  Do we offline
memory11 then memory10 then memory9 and then re-online memory10 and
memory11?

> And a new idea from Wen Congyang <wency@cn.fujitsu.com> is:
> allocate the memory from the memory block they are describing.

Yes.

> But we are not sure if it is OK to do so because there is not existing API
> to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE
> to MEM_ONLINE.

This all sounds solvable - can we proceed in this fashion?

> And also, it may interfere the hugepage.

Please provide full details on this problem.

> Note: if the memory provided by the memory device is used by the kernel, it
> can't be offlined. It is not a bug.

Right.  But how often does this happen in testing?  In other words,
please provide an overall description of how well memory hot-remove is
presently operating.  Is it reliable?  What is the success rate in
real-world situations?  Are there precautions which the administrator
can take to improve the success rate?  What are the remaining problems
and are there plans to address them?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs
  2013-01-09  9:32 ` [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs Tang Chen
@ 2013-01-09 22:49   ` Andrew Morton
  2013-01-10  6:07     ` Tang Chen
  2013-01-09 23:19   ` Andrew Morton
  1 sibling, 1 reply; 67+ messages in thread
From: Andrew Morton @ 2013-01-09 22:49 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	linuxppc-dev

On Wed, 9 Jan 2013 17:32:28 +0800
Tang Chen <tangchen@cn.fujitsu.com> wrote:

> When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type}
> sysfs files are created. But there is no code to remove these files. The patch
> implements the function to remove them.
> 
> Note: The code does not free firmware_map_entry which is allocated by bootmem.
>       So the patch makes memory leak. But I think the memory leak size is
>       very samll. And it does not affect the system.

Well that's bad.  Can we remember the address of that memory and then
reuse the storage if/when the memory is re-added?  That at least puts an upper
bound on the leak.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 05/15] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture
  2013-01-09  9:32 ` [PATCH v6 05/15] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture Tang Chen
@ 2013-01-09 22:50   ` Andrew Morton
  2013-01-10  2:25     ` Tang Chen
  0 siblings, 1 reply; 67+ messages in thread
From: Andrew Morton @ 2013-01-09 22:50 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	linuxppc-dev

On Wed, 9 Jan 2013 17:32:29 +0800
Tang Chen <tangchen@cn.fujitsu.com> wrote:

> For removing memory, we need to remove page table. But it depends
> on architecture. So the patch introduce arch_remove_memory() for
> removing page table. Now it only calls __remove_pages().
> 
> Note: __remove_pages() for some archtecuture is not implemented
>       (I don't know how to implement it for s390).

Can this break the build for s390?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 02/15] memory-hotplug: check whether all memory blocks are offlined or not when removing memory
  2013-01-09  9:32 ` [PATCH v6 02/15] memory-hotplug: check whether all memory blocks are offlined or not when removing memory Tang Chen
@ 2013-01-09 23:11   ` Andrew Morton
  2013-01-10  5:56     ` Tang Chen
  0 siblings, 1 reply; 67+ messages in thread
From: Andrew Morton @ 2013-01-09 23:11 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	linuxppc-dev

On Wed, 9 Jan 2013 17:32:26 +0800
Tang Chen <tangchen@cn.fujitsu.com> wrote:

> We remove the memory like this:
> 1. lock memory hotplug
> 2. offline a memory block
> 3. unlock memory hotplug
> 4. repeat 1-3 to offline all memory blocks
> 5. lock memory hotplug
> 6. remove memory(TODO)
> 7. unlock memory hotplug
> 
> All memory blocks must be offlined before removing memory. But we don't hold
> the lock in the whole operation. So we should check whether all memory blocks
> are offlined before step6. Otherwise, kernel maybe panicked.

Well, the obvious question is: why don't we hold lock_memory_hotplug()
for all of steps 1-4?  Please send the reasons for this in a form which
I can paste into the changelog.


Actually, I wonder if doing this would fix a race in the current
remove_memory() repeat: loop.  That code does a
find_memory_block_hinted() followed by offline_memory_block(), but
afaict find_memory_block_hinted() only does a get_device().  Is the
get_device() sufficiently strong to prevent problems if another thread
concurrently offlines or otherwise alters this memory_block's state?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs
  2013-01-09  9:32 ` [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs Tang Chen
  2013-01-09 22:49   ` Andrew Morton
@ 2013-01-09 23:19   ` Andrew Morton
  2013-01-10  6:15     ` Tang Chen
  1 sibling, 1 reply; 67+ messages in thread
From: Andrew Morton @ 2013-01-09 23:19 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	linuxppc-dev

On Wed, 9 Jan 2013 17:32:28 +0800
Tang Chen <tangchen@cn.fujitsu.com> wrote:

> From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> 
> When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type}
> sysfs files are created. But there is no code to remove these files. The patch
> implements the function to remove them.
> 
> Note: The code does not free firmware_map_entry which is allocated by bootmem.
>       So the patch makes memory leak. But I think the memory leak size is
>       very samll. And it does not affect the system.
> 
> ...
>
> +static struct firmware_map_entry * __meminit
> +firmware_map_find_entry(u64 start, u64 end, const char *type)
> +{
> +	struct firmware_map_entry *entry;
> +
> +	spin_lock(&map_entries_lock);
> +	list_for_each_entry(entry, &map_entries, list)
> +		if ((entry->start == start) && (entry->end == end) &&
> +		    (!strcmp(entry->type, type))) {
> +			spin_unlock(&map_entries_lock);
> +			return entry;
> +		}
> +
> +	spin_unlock(&map_entries_lock);
> +	return NULL;
> +}
>
> ...
>
> +	entry = firmware_map_find_entry(start, end - 1, type);
> +	if (!entry)
> +		return -EINVAL;
> +
> +	firmware_map_remove_entry(entry);
>
> ...
>

The above code looks racy.  After firmware_map_find_entry() does the
spin_unlock() there is nothing to prevent a concurrent
firmware_map_remove_entry() from removing the entry, so the kernel ends
up calling firmware_map_remove_entry() twice against the same entry.

An easy fix for this is to hold the spinlock across the entire
lookup/remove operation.

This problem is inherent to firmware_map_find_entry() as you have
implemented it, so this function simply should not exist in the current
form - no caller can use it without being buggy!  A simple fix for this
is to remove the spin_lock()/spin_unlock() from
firmware_map_find_entry() and add locking documentation to
firmware_map_find_entry(), explaining that the caller must hold
map_entries_lock and must not release that lock until processing of
firmware_map_find_entry()'s return value has completed.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (15 preceding siblings ...)
  2013-01-09 22:23 ` [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Andrew Morton
@ 2013-01-09 23:33 ` Andrew Morton
  2013-01-10  2:18   ` Tang Chen
  2013-01-29 12:52 ` Simon Jeons
  17 siblings, 1 reply; 67+ messages in thread
From: Andrew Morton @ 2013-01-09 23:33 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	linuxppc-dev

On Wed, 9 Jan 2013 17:32:24 +0800
Tang Chen <tangchen@cn.fujitsu.com> wrote:

> This patch-set aims to implement physical memory hot-removing.

As you were on th patch delivery path, all of these patches should have
your Signed-off-by:.  But some were missing it.  I fixed this in my
copy of the patches.

I suspect this patchset adds a significant amount of code which will
not be used if CONFIG_MEMORY_HOTPLUG=n.  "[PATCH v6 06/15]
memory-hotplug: implement register_page_bootmem_info_section of
sparse-vmemmap", for example.  This is not a good thing, so please go
through the patchset (in fact, go through all the memhotplug code) and
let's see if we can reduce the bloat for CONFIG_MEMORY_HOTPLUG=n
kernels.

This needn't be done immediately - it would be OK by me if you were to
defer this exercise until all the new memhotplug code is largely in
place.  But please, let's do it.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-09 22:23 ` [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Andrew Morton
@ 2013-01-10  2:17   ` Tang Chen
  2013-01-10  7:14     ` Glauber Costa
  0 siblings, 1 reply; 67+ messages in thread
From: Tang Chen @ 2013-01-10  2:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	linuxppc-dev

Hi Andrew,

Thank you very much for your pushing. :)

On 01/10/2013 06:23 AM, Andrew Morton wrote:
>
> This does sound like a significant problem.  We should assume that
> mmecg is available and in use.
>
>> In patch1, we provide a solution which is not good enough:
>> Iterate twice to offline the memory.
>> 1st iterate: offline every non primary memory block.
>> 2nd iterate: offline primary (i.e. first added) memory block.
>
> Let's flesh this out a bit.
>
> If we online memory8, memory9, memory10 and memory11 then I'd have
> thought that they would need to offlined in reverse order, which will
> require four iterations, not two.  Is this wrong and if so, why?

Well, we may need more than two iterations if all memory8, memory9,
memory10 are in use by kernel, and 10 depends on 9, 9 depends on 8.

So, as you see here, the iteration method is not good enough.

But this only happens when the memory is used by kernel, which will not
be able to be migrated. So if we can use a boot option, such as
movablecore_map, or movable_online functionality to limit the memory as 
movable, the kernel will not use this memory. So it is safe when we are
doing node hot-remove.

>
> Also, what happens if we wish to offline only memory9?  Do we offline
> memory11 then memory10 then memory9 and then re-online memory10 and
> memory11?

In this case, offlining memory9 could fail if user do this by himself,
for example using sysfs.

In this path, it is in memory hot-remove path. So when we remove a
memory device, it will automatically offline all pages, and it is in
reverse order by itself.

And again, this is not good enough. We will figure out a reasonable way
to solve it soon.

>
>> And a new idea from Wen Congyang<wency@cn.fujitsu.com>  is:
>> allocate the memory from the memory block they are describing.
>
> Yes.
>
>> But we are not sure if it is OK to do so because there is not existing API
>> to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE
>> to MEM_ONLINE.
>
> This all sounds solvable - can we proceed in this fashion?

Yes, we are in progress now.

>
>> And also, it may interfere the hugepage.
>
> Please provide full details on this problem.

It is not very clear now, and if I find something, I'll share it out.

>
>> Note: if the memory provided by the memory device is used by the kernel, it
>> can't be offlined. It is not a bug.
>
> Right.  But how often does this happen in testing?  In other words,
> please provide an overall description of how well memory hot-remove is
> presently operating.  Is it reliable?  What is the success rate in
> real-world situations?

We test the hot-remove functionality mostly with movable_online used.
And the memory used by kernel is not allowed to be removed.

We will do some tests in the kernel memory offline cases, and tell you
the test results soon.

And since we are trying out some other ways, I think the problem will
be solved soon.

> Are there precautions which the administrator
> can take to improve the success rate?

Administrator could use movablecore_map boot option or movable_online
functionality (which is now in kernel) to limit memory as movable to
avoid this problem.

> What are the remaining problems
> and are there plans to address them?

For now, we will try to allocate page_group on the memory block which
itself is describing. And all the other parts seems work well now.

And we are still testing. If we have any problem, we will share.

Thanks. :)

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-09 23:33 ` Andrew Morton
@ 2013-01-10  2:18   ` Tang Chen
  0 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-10  2:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	linuxppc-dev

Hi Andrew,

On 01/10/2013 07:33 AM, Andrew Morton wrote:
> On Wed, 9 Jan 2013 17:32:24 +0800
> Tang Chen<tangchen@cn.fujitsu.com>  wrote:
>
>> This patch-set aims to implement physical memory hot-removing.
>
> As you were on th patch delivery path, all of these patches should have
> your Signed-off-by:.  But some were missing it.  I fixed this in my
> copy of the patches.

Thank you very much for the help. Next time I'll add it myself.

>
>
> I suspect this patchset adds a significant amount of code which will
> not be used if CONFIG_MEMORY_HOTPLUG=n.  "[PATCH v6 06/15]
> memory-hotplug: implement register_page_bootmem_info_section of
> sparse-vmemmap", for example.  This is not a good thing, so please go
> through the patchset (in fact, go through all the memhotplug code) and
> let's see if we can reduce the bloat for CONFIG_MEMORY_HOTPLUG=n
> kernels.
>
> This needn't be done immediately - it would be OK by me if you were to
> defer this exercise until all the new memhotplug code is largely in
> place.  But please, let's do it.

OK, I'll do have a check on it when the page_cgroup problem is solved.

Thanks. :)

>
>
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 05/15] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture
  2013-01-09 22:50   ` Andrew Morton
@ 2013-01-10  2:25     ` Tang Chen
  0 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-10  2:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	linuxppc-dev

Hi Andrew,

On 01/10/2013 06:50 AM, Andrew Morton wrote:
> On Wed, 9 Jan 2013 17:32:29 +0800
> Tang Chen<tangchen@cn.fujitsu.com>  wrote:
>
>> For removing memory, we need to remove page table. But it depends
>> on architecture. So the patch introduce arch_remove_memory() for
>> removing page table. Now it only calls __remove_pages().
>>
>> Note: __remove_pages() for some archtecuture is not implemented
>>        (I don't know how to implement it for s390).
>
> Can this break the build for s390?

No, I don't think so. The arch_remove_memory() in s390 will only
return -EBUSY.

Thanks. :)

>
>
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 02/15] memory-hotplug: check whether all memory blocks are offlined or not when removing memory
  2013-01-09 23:11   ` Andrew Morton
@ 2013-01-10  5:56     ` Tang Chen
  0 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-10  5:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	linuxppc-dev

Hi Andrew,

On 01/10/2013 07:11 AM, Andrew Morton wrote:
> On Wed, 9 Jan 2013 17:32:26 +0800
> Tang Chen<tangchen@cn.fujitsu.com>  wrote:
>
>> We remove the memory like this:
>> 1. lock memory hotplug
>> 2. offline a memory block
>> 3. unlock memory hotplug
>> 4. repeat 1-3 to offline all memory blocks
>> 5. lock memory hotplug
>> 6. remove memory(TODO)
>> 7. unlock memory hotplug
>>
>> All memory blocks must be offlined before removing memory. But we don't hold
>> the lock in the whole operation. So we should check whether all memory blocks
>> are offlined before step6. Otherwise, kernel maybe panicked.
>
> Well, the obvious question is: why don't we hold lock_memory_hotplug()
> for all of steps 1-4?  Please send the reasons for this in a form which
> I can paste into the changelog.

In the changelog form:

Offlining a memory block and removing a memory device can be two
different operations. Users can just offline some memory blocks
without removing the memory device. For this purpose, the kernel has
held lock_memory_hotplug() in __offline_pages(). To reuse the code
for memory hot-remove, we repeat step 1-3 to offline all the memory
blocks, repeatedly lock and unlock memory hotplug, but not hold the
memory hotplug lock in the whole operation.

>
>
> Actually, I wonder if doing this would fix a race in the current
> remove_memory() repeat: loop.  That code does a
> find_memory_block_hinted() followed by offline_memory_block(), but
> afaict find_memory_block_hinted() only does a get_device().  Is the
> get_device() sufficiently strong to prevent problems if another thread
> concurrently offlines or otherwise alters this memory_block's state?

I think we already have memory_block->state_mutex to protect the
concurrently changing of memory_block's state.

The find_memory_block_hinted() here is to find the memory_block
corresponding to the memory section we are dealing with.

Thanks. :)

>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs
  2013-01-09 22:49   ` Andrew Morton
@ 2013-01-10  6:07     ` Tang Chen
  0 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-10  6:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	linuxppc-dev

Hi Andrew,

On 01/10/2013 06:49 AM, Andrew Morton wrote:
> On Wed, 9 Jan 2013 17:32:28 +0800
> Tang Chen<tangchen@cn.fujitsu.com>  wrote:
>
>> When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type}
>> sysfs files are created. But there is no code to remove these files. The patch
>> implements the function to remove them.
>>
>> Note: The code does not free firmware_map_entry which is allocated by bootmem.
>>        So the patch makes memory leak. But I think the memory leak size is
>>        very samll. And it does not affect the system.
>
> Well that's bad.  Can we remember the address of that memory and then
> reuse the storage if/when the memory is re-added?  That at least puts an upper
> bound on the leak.

I think we can do this. I'll post a new patch to do so.

Thanks. :)

>
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs
  2013-01-09 23:19   ` Andrew Morton
@ 2013-01-10  6:15     ` Tang Chen
  0 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-10  6:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	linuxppc-dev

Hi Andrew,

On 01/10/2013 07:19 AM, Andrew Morton wrote:
>> ...
>>
>> +	entry = firmware_map_find_entry(start, end - 1, type);
>> +	if (!entry)
>> +		return -EINVAL;
>> +
>> +	firmware_map_remove_entry(entry);
>>
>> ...
>>
>
> The above code looks racy.  After firmware_map_find_entry() does the
> spin_unlock() there is nothing to prevent a concurrent
> firmware_map_remove_entry() from removing the entry, so the kernel ends
> up calling firmware_map_remove_entry() twice against the same entry.
>
> An easy fix for this is to hold the spinlock across the entire
> lookup/remove operation.
>
>
> This problem is inherent to firmware_map_find_entry() as you have
> implemented it, so this function simply should not exist in the current
> form - no caller can use it without being buggy!  A simple fix for this
> is to remove the spin_lock()/spin_unlock() from
> firmware_map_find_entry() and add locking documentation to
> firmware_map_find_entry(), explaining that the caller must hold
> map_entries_lock and must not release that lock until processing of
> firmware_map_find_entry()'s return value has completed.

Thank you for your advice, I'll fix it soon.

Since you have merged the patch-set, do I need to resend all these
patches again, or just send a patch to fix it based on the current
one ?

Thanks. :)

>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-10  2:17   ` Tang Chen
@ 2013-01-10  7:14     ` Glauber Costa
  2013-01-10  7:31       ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 67+ messages in thread
From: Glauber Costa @ 2013-01-10  7:14 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, wujianguo,
	yinghai, laijs, linux-kernel, minchan.kim, Andrew Morton,
	linuxppc-dev

On 01/10/2013 06:17 AM, Tang Chen wrote:
>>> Note: if the memory provided by the memory device is used by the
>>> kernel, it
>>> can't be offlined. It is not a bug.
>>
>> Right.  But how often does this happen in testing?  In other words,
>> please provide an overall description of how well memory hot-remove is
>> presently operating.  Is it reliable?  What is the success rate in
>> real-world situations?
> 
> We test the hot-remove functionality mostly with movable_online used.
> And the memory used by kernel is not allowed to be removed.

Can you try doing this using cpusets configured to hardwall ?
It is my understanding that the object allocators will try hard not to
allocate anything outside the walls defined by cpuset. Which means that
if you have one process per node, and they are hardwalled, your kernel
memory will be spread evenly among the machine. With a big enough load,
they should eventually be present in all blocks.

Another question I have for you: Have you considering calling
shrink_slab to try to deplete the caches and therefore free at least
slab memory in the nodes that can't be offlined? Is it relevant?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-10  7:14     ` Glauber Costa
@ 2013-01-10  7:31       ` Kamezawa Hiroyuki
  2013-01-10  7:55         ` Glauber Costa
  0 siblings, 1 reply; 67+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-10  7:31 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-ia64, linux-sh, Tang Chen, linux-mm, paulus, hpa,
	sparclinux, cl, linux-s390, x86, linux-acpi, isimatu.yasuaki,
	linfeng, mgorman, kosaki.motohiro, rientjes, len.brown, wency,
	cmetcalf, wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	Andrew Morton, linuxppc-dev

(2013/01/10 16:14), Glauber Costa wrote:
> On 01/10/2013 06:17 AM, Tang Chen wrote:
>>>> Note: if the memory provided by the memory device is used by the
>>>> kernel, it
>>>> can't be offlined. It is not a bug.
>>>
>>> Right.  But how often does this happen in testing?  In other words,
>>> please provide an overall description of how well memory hot-remove is
>>> presently operating.  Is it reliable?  What is the success rate in
>>> real-world situations?
>>
>> We test the hot-remove functionality mostly with movable_online used.
>> And the memory used by kernel is not allowed to be removed.
>
> Can you try doing this using cpusets configured to hardwall ?
> It is my understanding that the object allocators will try hard not to
> allocate anything outside the walls defined by cpuset. Which means that
> if you have one process per node, and they are hardwalled, your kernel
> memory will be spread evenly among the machine. With a big enough load,
> they should eventually be present in all blocks.
>

I'm sorry I couldn't catch your point.
Do you want to confirm whether cpuset can work enough instead of ZONE_MOVABLE ?
Or Do you want to confirm whether ZONE_MOVABLE will not work if it's used with cpuset ?


> Another question I have for you: Have you considering calling
> shrink_slab to try to deplete the caches and therefore free at least
> slab memory in the nodes that can't be offlined? Is it relevant?
>

At this stage, we don't consider to call shrink_slab(). We require
nearly 100% success at offlining memory for removing DIMM.
It's my understanding.

IMHO, I don't think shrink_slab() can kill all objects in a node even
if they are some caches. We need more study for doing that.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-10  7:31       ` Kamezawa Hiroyuki
@ 2013-01-10  7:55         ` Glauber Costa
  2013-01-10  8:23           ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 67+ messages in thread
From: Glauber Costa @ 2013-01-10  7:55 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-ia64, linux-sh, Tang Chen, linux-mm, paulus, hpa,
	sparclinux, cl, linux-s390, x86, linux-acpi, isimatu.yasuaki,
	linfeng, mgorman, kosaki.motohiro, rientjes, len.brown, wency,
	cmetcalf, wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	Andrew Morton, linuxppc-dev

On 01/10/2013 11:31 AM, Kamezawa Hiroyuki wrote:
> (2013/01/10 16:14), Glauber Costa wrote:
>> On 01/10/2013 06:17 AM, Tang Chen wrote:
>>>>> Note: if the memory provided by the memory device is used by the
>>>>> kernel, it
>>>>> can't be offlined. It is not a bug.
>>>>
>>>> Right.  But how often does this happen in testing?  In other words,
>>>> please provide an overall description of how well memory hot-remove is
>>>> presently operating.  Is it reliable?  What is the success rate in
>>>> real-world situations?
>>>
>>> We test the hot-remove functionality mostly with movable_online used.
>>> And the memory used by kernel is not allowed to be removed.
>>
>> Can you try doing this using cpusets configured to hardwall ?
>> It is my understanding that the object allocators will try hard not to
>> allocate anything outside the walls defined by cpuset. Which means that
>> if you have one process per node, and they are hardwalled, your kernel
>> memory will be spread evenly among the machine. With a big enough load,
>> they should eventually be present in all blocks.
>>
> 
> I'm sorry I couldn't catch your point.
> Do you want to confirm whether cpuset can work enough instead of
> ZONE_MOVABLE ?
> Or Do you want to confirm whether ZONE_MOVABLE will not work if it's
> used with cpuset ?
> 
> 
No, I am not proposing to use cpuset do tackle the problem. I am just
wondering if you would still have high success rates with cpusets in use
with hardwalls. This is just one example of a workload that would spread
kernel memory around quite heavily.

So this is just me trying to understand the limitations of the mechanism.

>> Another question I have for you: Have you considering calling
>> shrink_slab to try to deplete the caches and therefore free at least
>> slab memory in the nodes that can't be offlined? Is it relevant?
>>
> 
> At this stage, we don't consider to call shrink_slab(). We require
> nearly 100% success at offlining memory for removing DIMM.
> It's my understanding.
> 
Of course, this is indisputable.

> IMHO, I don't think shrink_slab() can kill all objects in a node even
> if they are some caches. We need more study for doing that.
> 

Indeed, shrink_slab can only kill cached objects. They, however, are
usually a very big part of kernel memory. I wonder though if in case of
failure, it is worth it to try at least one shrink pass before you give up.

It is not very different from what is in memory-failure.c, except that
we could do better and do a more targetted shrinking (support for that
is being worked on)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-10  7:55         ` Glauber Costa
@ 2013-01-10  8:23           ` Kamezawa Hiroyuki
  2013-01-10  8:36             ` Glauber Costa
  0 siblings, 1 reply; 67+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-10  8:23 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-ia64, linux-sh, Tang Chen, linux-mm, paulus, hpa,
	sparclinux, cl, linux-s390, x86, linux-acpi, isimatu.yasuaki,
	linfeng, mgorman, kosaki.motohiro, rientjes, len.brown, wency,
	cmetcalf, wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	Andrew Morton, linuxppc-dev

(2013/01/10 16:55), Glauber Costa wrote:
> On 01/10/2013 11:31 AM, Kamezawa Hiroyuki wrote:
>> (2013/01/10 16:14), Glauber Costa wrote:
>>> On 01/10/2013 06:17 AM, Tang Chen wrote:
>>>>>> Note: if the memory provided by the memory device is used by the
>>>>>> kernel, it
>>>>>> can't be offlined. It is not a bug.
>>>>>
>>>>> Right.  But how often does this happen in testing?  In other words,
>>>>> please provide an overall description of how well memory hot-remove is
>>>>> presently operating.  Is it reliable?  What is the success rate in
>>>>> real-world situations?
>>>>
>>>> We test the hot-remove functionality mostly with movable_online used.
>>>> And the memory used by kernel is not allowed to be removed.
>>>
>>> Can you try doing this using cpusets configured to hardwall ?
>>> It is my understanding that the object allocators will try hard not to
>>> allocate anything outside the walls defined by cpuset. Which means that
>>> if you have one process per node, and they are hardwalled, your kernel
>>> memory will be spread evenly among the machine. With a big enough load,
>>> they should eventually be present in all blocks.
>>>
>>
>> I'm sorry I couldn't catch your point.
>> Do you want to confirm whether cpuset can work enough instead of
>> ZONE_MOVABLE ?
>> Or Do you want to confirm whether ZONE_MOVABLE will not work if it's
>> used with cpuset ?
>>
>>
> No, I am not proposing to use cpuset do tackle the problem. I am just
> wondering if you would still have high success rates with cpusets in use
> with hardwalls. This is just one example of a workload that would spread
> kernel memory around quite heavily.
>
> So this is just me trying to understand the limitations of the mechanism.
>

Hm, okay. In my undestanding, if the whole memory of a node is configured as
MOVABLE, no kernel memory will not be allocated in the node because zonelist
will not match. So, if cpuset is used with hardwalls, user will see -ENOMEM or OOM,
I guess. even fork() will fail if fallback-to-other-node is not allowed.

If it's configure as ZONE_NORMAL, you need to pray for offlining memory.

AFAIK, IBM's ppc? has 16MB section size. So, some of sections can be offlined
even if they are configured as ZONE_NORMAL. For them, placement of offlined
memory is not important because it's virtualized by LPAR, they don't try
to remove DIMM, they just want to increase/decrease amount of memory.
It's an another approach.

But here, we(fujitsu) tries to remove a system board/DIMM.
So, configuring the whole memory of a node as ZONE_MOVABLE and tries to guarantee
DIMM as removable.

>> IMHO, I don't think shrink_slab() can kill all objects in a node even
>> if they are some caches. We need more study for doing that.
>>
>
> Indeed, shrink_slab can only kill cached objects. They, however, are
> usually a very big part of kernel memory. I wonder though if in case of
> failure, it is worth it to try at least one shrink pass before you give up.
>

Yeah, now, his (our) approach is never allowing kernel memory on a node to be
hot-removed by ZONE_MOVABLE. So, shrink_slab()'s effect will not be seen.

If other brave guys tries to use ZONE_NORMAL for hot-pluggable DIMM, I see,
it's worth triying.

How about checking the target memsection is in NORMAL or in MOVABLE at
hot-removing ? If NORMAL, shrink_slab() will be worth to be called.

BTW, shrink_slab() is now node/zone aware ? If not, fixing that first will
be better direction I guess.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-10  8:23           ` Kamezawa Hiroyuki
@ 2013-01-10  8:36             ` Glauber Costa
  2013-01-10  8:39               ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 67+ messages in thread
From: Glauber Costa @ 2013-01-10  8:36 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-ia64, linux-sh, Tang Chen, linux-mm, paulus, hpa,
	sparclinux, cl, linux-s390, x86, linux-acpi, isimatu.yasuaki,
	linfeng, mgorman, kosaki.motohiro, rientjes, len.brown, wency,
	cmetcalf, wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	Andrew Morton, linuxppc-dev


> If it's configure as ZONE_NORMAL, you need to pray for offlining memory.
> 
> AFAIK, IBM's ppc? has 16MB section size. So, some of sections can be
> offlined
> even if they are configured as ZONE_NORMAL. For them, placement of offlined
> memory is not important because it's virtualized by LPAR, they don't try
> to remove DIMM, they just want to increase/decrease amount of memory.
> It's an another approach.
> 
> But here, we(fujitsu) tries to remove a system board/DIMM.
> So, configuring the whole memory of a node as ZONE_MOVABLE and tries to
> guarantee
> DIMM as removable.
> 
>>> IMHO, I don't think shrink_slab() can kill all objects in a node even
>>> if they are some caches. We need more study for doing that.
>>>
>>
>> Indeed, shrink_slab can only kill cached objects. They, however, are
>> usually a very big part of kernel memory. I wonder though if in case of
>> failure, it is worth it to try at least one shrink pass before you
>> give up.
>>
> 
> Yeah, now, his (our) approach is never allowing kernel memory on a node
> to be
> hot-removed by ZONE_MOVABLE. So, shrink_slab()'s effect will not be seen.

Ok, that clarifies it to me.
> 
> If other brave guys tries to use ZONE_NORMAL for hot-pluggable DIMM, I see,
> it's worth triying.
> 
I was under the impression that this was being done in here.

> How about checking the target memsection is in NORMAL or in MOVABLE at
> hot-removing ? If NORMAL, shrink_slab() will be worth to be called.
> 
Yes, this is what I meant. I think there is value investigating this,
since for a lot of workloads, a lot of the kernel memory will consist of
shrinkable cached memory. It would provide you with the same level of
guarantees (zero), but can improve the success  rate (this is, of
course, a guess)


> BTW, shrink_slab() is now node/zone aware ? If not, fixing that first will
> be better direction I guess.
> 
It is not upstream, but there are patches for this that I am already
using in my private tree.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-10  8:36             ` Glauber Costa
@ 2013-01-10  8:39               ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 67+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-10  8:39 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-ia64, linux-sh, Tang Chen, linux-mm, paulus, hpa,
	sparclinux, cl, linux-s390, x86, linux-acpi, isimatu.yasuaki,
	linfeng, mgorman, kosaki.motohiro, rientjes, len.brown, wency,
	cmetcalf, wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	Andrew Morton, linuxppc-dev

(2013/01/10 17:36), Glauber Costa wrote:
  
>> BTW, shrink_slab() is now node/zone aware ? If not, fixing that first will
>> be better direction I guess.
>>
> It is not upstream, but there are patches for this that I am already
> using in my private tree.
>

Oh, I see. If it's merged, it's worth add "shrink_slab() if ZONE_NORMAL"
code.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (16 preceding siblings ...)
  2013-01-09 23:33 ` Andrew Morton
@ 2013-01-29 12:52 ` Simon Jeons
  2013-01-30  2:32   ` Tang Chen
  2013-01-30 10:15   ` Tang Chen
  17 siblings, 2 replies; 67+ messages in thread
From: Simon Jeons @ 2013-01-29 12:52 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

Hi Tang,

On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
> Here is the physical memory hot-remove patch-set based on 3.8rc-2.

Some questions ask you, not has relationship with this patchset, but is
memory hotplug stuff.

1. In function node_states_check_changes_online:

comments:
* If we don't have HIGHMEM nor movable node,
* node_states[N_NORMAL_MEMORY] contains nodes which have zones of
* 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.

How to understand it? Why we don't have HIGHMEM nor movable node and
node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
N_NORMAL_MEMORY only means the node has regular memory.

* If we don't have movable node, node_states[N_NORMAL_MEMORY]
* contains nodes which have zones of 0...ZONE_MOVABLE,
* set zone_last to ZONE_MOVABLE.

How to understand?

2. In function move_pfn_range_left, why end <= z2->zone_start_pfn is not
correct? The comments said that must include/overlap, why?

3. In function online_pages, the normal case(w/o online_kenrel,
online_movable), why not check if the new zone is overlap with adjacent
zones?

4. Could you summarize the difference implementation between hot-add and
logic-add, hot-remove and logic-remove?   


> 
> This patch-set aims to implement physical memory hot-removing.
> 
> The patches can free/remove the following things:
> 
>   - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
>   - memmap of sparse-vmemmap                  : [PATCH 6,7,8,10/15]
>   - page table of removed memory              : [RFC PATCH 7,8,10/15]
>   - node and related sysfs files              : [RFC PATCH 13-15/15]
> 
> 
> Existing problem:
> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
> when we online pages.
> 
> For example: there is a memory device on node 1. The address range
> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
> and memory11 under the directory /sys/devices/system/memory/.
> 
> If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
> cgroup is not provided by this memory device. But when we online memory9, the
> memory stored page cgroup may be provided by memory8. So we can't offline
> memory8 now. We should offline the memory in the reversed order.
> 
> When the memory device is hotremoved, we will auto offline memory provided
> by this memory device. But we don't know which memory is onlined first, so
> offlining memory may fail.
> 
> In patch1, we provide a solution which is not good enough:
> Iterate twice to offline the memory.
> 1st iterate: offline every non primary memory block.
> 2nd iterate: offline primary (i.e. first added) memory block.
> 
> And a new idea from Wen Congyang <wency@cn.fujitsu.com> is:
> allocate the memory from the memory block they are describing.
> 
> But we are not sure if it is OK to do so because there is not existing API
> to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE
> to MEM_ONLINE. And also, it may interfere the hugepage.
> 
> 
> 
> How to test this patchset?
> 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
>    ACPI_HOTPLUG_MEMORY must be selected.
> 2. load the module acpi_memhotplug
> 3. hotplug the memory device(it depends on your hardware)
>    You will see the memory device under the directory /sys/bus/acpi/devices/.
>    Its name is PNP0C80:XX.
> 4. online/offline pages provided by this memory device
>    You can write online/offline to /sys/devices/system/memory/memoryX/state to
>    online/offline pages provided by this memory device
> 5. hotremove the memory device
>    You can hotremove the memory device by the hardware, or writing 1 to
>    /sys/bus/acpi/devices/PNP0C80:XX/eject.

Is there a similar knode to hot-add the memory device?

> 
> 
> Note: if the memory provided by the memory device is used by the kernel, it
> can't be offlined. It is not a bug.
> 
> 
> Changelogs from v5 to v6:
>  Patch3: Add some more comments to explain memory hot-remove.
>  Patch4: Remove bootmem member in struct firmware_map_entry.
>  Patch6: Repeatedly register bootmem pages when using hugepage.
>  Patch8: Repeatedly free bootmem pages when using hugepage.
>  Patch14: Don't free pgdat when offlining a node, just reset it to 0.
>  Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
>           one when online a node.
> 
> Changelogs from v4 to v5:
>  Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
>          avoid disabling irq because we need flush tlb when free pagetables.
>  Patch8: new patch, pick up some common APIs that are used to free direct mapping
>          and vmemmap pagetables.
>  Patch9: free direct mapping pagetables on x86_64 arch.
>  Patch10: free vmemmap pagetables.
>  Patch11: since freeing memmap with vmemmap has been implemented, the config
>           macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
>           no longer needed.
>  Patch13: no need to modify acpi_memory_disable_device() since it was removed,
>           and add nid parameter when calling remove_memory().
> 
> Changelogs from v3 to v4:
>  Patch7: remove unused codes.
>  Patch8: fix nr_pages that is passed to free_map_bootmem()
> 
> Changelogs from v2 to v3:
>  Patch9: call sync_global_pgds() if pgd is changed
>  Patch10: fix a problem int the patch
> 
> Changelogs from v1 to v2:
>  Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
>          memory block. 2nd iterate: offline primary (i.e. first added) memory
>          block.
> 
>  Patch3: new patch, no logical change, just remove reduntant codes.
> 
>  Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
>          after the pagetable is changed.
> 
>  Patch12: new patch, free node_data when a node is offlined.
> 
> 
> Tang Chen (6):
>   memory-hotplug: move pgdat_resize_lock into
>     sparse_remove_one_section()
>   memory-hotplug: remove page table of x86_64 architecture
>   memory-hotplug: remove memmap of sparse-vmemmap
>   memory-hotplug: Integrated __remove_section() of
>     CONFIG_SPARSEMEM_VMEMMAP.
>   memory-hotplug: remove sysfs file of node
>   memory-hotplug: Do not allocate pdgat if it was not freed when
>     offline.
> 
> Wen Congyang (5):
>   memory-hotplug: try to offline the memory twice to avoid dependence
>   memory-hotplug: remove redundant codes
>   memory-hotplug: introduce new function arch_remove_memory() for
>     removing page table depends on architecture
>   memory-hotplug: Common APIs to support page tables hot-remove
>   memory-hotplug: free node_data when a node is offlined
> 
> Yasuaki Ishimatsu (4):
>   memory-hotplug: check whether all memory blocks are offlined or not
>     when removing memory
>   memory-hotplug: remove /sys/firmware/memmap/X sysfs
>   memory-hotplug: implement register_page_bootmem_info_section of
>     sparse-vmemmap
>   memory-hotplug: memory_hotplug: clear zone when removing the memory
> 
>  arch/arm64/mm/mmu.c                  |    3 +
>  arch/ia64/mm/discontig.c             |   10 +
>  arch/ia64/mm/init.c                  |   18 ++
>  arch/powerpc/mm/init_64.c            |   10 +
>  arch/powerpc/mm/mem.c                |   12 +
>  arch/s390/mm/init.c                  |   12 +
>  arch/s390/mm/vmem.c                  |   10 +
>  arch/sh/mm/init.c                    |   17 ++
>  arch/sparc/mm/init_64.c              |   10 +
>  arch/tile/mm/init.c                  |    8 +
>  arch/x86/include/asm/pgtable_types.h |    1 +
>  arch/x86/mm/init_32.c                |   12 +
>  arch/x86/mm/init_64.c                |  390 +++++++++++++++++++++++++++++
>  arch/x86/mm/pageattr.c               |   47 ++--
>  drivers/acpi/acpi_memhotplug.c       |    8 +-
>  drivers/base/memory.c                |    6 +
>  drivers/firmware/memmap.c            |   96 +++++++-
>  include/linux/bootmem.h              |    1 +
>  include/linux/firmware-map.h         |    6 +
>  include/linux/memory_hotplug.h       |   15 +-
>  include/linux/mm.h                   |    4 +-
>  mm/memory_hotplug.c                  |  459 +++++++++++++++++++++++++++++++---
>  mm/sparse.c                          |    8 +-
>  23 files changed, 1094 insertions(+), 69 deletions(-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
  2013-01-09  9:32 ` [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove Tang Chen
@ 2013-01-29 13:02   ` Simon Jeons
  2013-01-30  1:53     ` Jianguo Wu
  2013-01-29 13:04   ` Simon Jeons
  2013-02-04 23:04   ` Andrew Morton
  2 siblings, 1 reply; 67+ messages in thread
From: Simon Jeons @ 2013-01-29 13:02 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

Hi Tang,
On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> When memory is removed, the corresponding pagetables should alse be removed.
> This patch introduces some common APIs to support vmemmap pagetable and x86_64
> architecture pagetable removing.
> 

When page table of hot-add memory is created?

> All pages of virtual mapping in removed memory cannot be freedi if some pages
> used as PGD/PUD includes not only removed memory but also other memory. So the
> patch uses the following way to check whether page can be freed or not.
> 
>  1. When removing memory, the page structs of the revmoved memory are filled
>     with 0FD.
>  2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
>     In this case, the page used as PT/PMD can be freed.
> 
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> ---
>  arch/x86/include/asm/pgtable_types.h |    1 +
>  arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
>  arch/x86/mm/pageattr.c               |   47 +++---
>  include/linux/bootmem.h              |    1 +
>  4 files changed, 326 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 3c32db8..4b6fd2a 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
>   * as a pte too.
>   */
>  extern pte_t *lookup_address(unsigned long address, unsigned int *level);
> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
>  
>  #endif	/* !__ASSEMBLY__ */
>  
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 9ac1723..fe01116 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
>  }
>  EXPORT_SYMBOL_GPL(arch_add_memory);
>  
> +#define PAGE_INUSE 0xFD
> +
> +static void __meminit free_pagetable(struct page *page, int order)
> +{
> +	struct zone *zone;
> +	bool bootmem = false;
> +	unsigned long magic;
> +	unsigned int nr_pages = 1 << order;
> +
> +	/* bootmem page has reserved flag */
> +	if (PageReserved(page)) {
> +		__ClearPageReserved(page);
> +		bootmem = true;
> +
> +		magic = (unsigned long)page->lru.next;
> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> +			while (nr_pages--)
> +				put_page_bootmem(page++);
> +		} else
> +			__free_pages_bootmem(page, order);
> +	} else
> +		free_pages((unsigned long)page_address(page), order);
> +
> +	/*
> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
> +	 * are all allocated by bootmem.
> +	 */
> +	if (bootmem) {
> +		zone = page_zone(page);
> +		zone_span_writelock(zone);
> +		zone->present_pages += nr_pages;
> +		zone_span_writeunlock(zone);
> +		totalram_pages += nr_pages;
> +	}
> +}
> +
> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> +{
> +	pte_t *pte;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PTE; i++) {
> +		pte = pte_start + i;
> +		if (pte_val(*pte))
> +			return;
> +	}
> +
> +	/* free a pte talbe */
> +	free_pagetable(pmd_page(*pmd), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pmd_clear(pmd);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> +{
> +	pmd_t *pmd;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PMD; i++) {
> +		pmd = pmd_start + i;
> +		if (pmd_val(*pmd))
> +			return;
> +	}
> +
> +	/* free a pmd talbe */
> +	free_pagetable(pud_page(*pud), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pud_clear(pud);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
> +/* Return true if pgd is changed, otherwise return false. */
> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
> +{
> +	pud_t *pud;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PUD; i++) {
> +		pud = pud_start + i;
> +		if (pud_val(*pud))
> +			return false;
> +	}
> +
> +	/* free a pud table */
> +	free_pagetable(pgd_page(*pgd), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pgd_clear(pgd);
> +	spin_unlock(&init_mm.page_table_lock);
> +
> +	return true;
> +}
> +
> +static void __meminit
> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
> +		 bool direct)
> +{
> +	unsigned long next, pages = 0;
> +	pte_t *pte;
> +	void *page_addr;
> +	phys_addr_t phys_addr;
> +
> +	pte = pte_start + pte_index(addr);
> +	for (; addr < end; addr = next, pte++) {
> +		next = (addr + PAGE_SIZE) & PAGE_MASK;
> +		if (next > end)
> +			next = end;
> +
> +		if (!pte_present(*pte))
> +			continue;
> +
> +		/*
> +		 * We mapped [0,1G) memory as identity mapping when
> +		 * initializing, in arch/x86/kernel/head_64.S. These
> +		 * pagetables cannot be removed.
> +		 */
> +		phys_addr = pte_val(*pte) + (addr & PAGE_MASK);
> +		if (phys_addr < (phys_addr_t)0x40000000)
> +			return;
> +
> +		if (IS_ALIGNED(addr, PAGE_SIZE) &&
> +		    IS_ALIGNED(next, PAGE_SIZE)) {
> +			if (!direct) {
> +				free_pagetable(pte_page(*pte), 0);
> +				pages++;
> +			}
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pte_clear(&init_mm, addr, pte);
> +			spin_unlock(&init_mm.page_table_lock);
> +		} else {
> +			/*
> +			 * If we are not removing the whole page, it means
> +			 * other ptes in this page are being used and we canot
> +			 * remove them. So fill the unused ptes with 0xFD, and
> +			 * remove the page when it is wholly filled with 0xFD.
> +			 */
> +			memset((void *)addr, PAGE_INUSE, next - addr);
> +			page_addr = page_address(pte_page(*pte));
> +
> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
> +				free_pagetable(pte_page(*pte), 0);
> +				pages++;
> +
> +				spin_lock(&init_mm.page_table_lock);
> +				pte_clear(&init_mm, addr, pte);
> +				spin_unlock(&init_mm.page_table_lock);
> +			}
> +		}
> +	}
> +
> +	/* Call free_pte_table() in remove_pmd_table(). */
> +	flush_tlb_all();
> +	if (direct)
> +		update_page_count(PG_LEVEL_4K, -pages);
> +}
> +
> +static void __meminit
> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> +		 bool direct)
> +{
> +	unsigned long pte_phys, next, pages = 0;
> +	pte_t *pte_base;
> +	pmd_t *pmd;
> +
> +	pmd = pmd_start + pmd_index(addr);
> +	for (; addr < end; addr = next, pmd++) {
> +		next = pmd_addr_end(addr, end);
> +
> +		if (!pmd_present(*pmd))
> +			continue;
> +
> +		if (pmd_large(*pmd)) {
> +			if (IS_ALIGNED(addr, PMD_SIZE) &&
> +			    IS_ALIGNED(next, PMD_SIZE)) {
> +				if (!direct) {
> +					free_pagetable(pmd_page(*pmd),
> +						       get_order(PMD_SIZE));
> +					pages++;
> +				}
> +
> +				spin_lock(&init_mm.page_table_lock);
> +				pmd_clear(pmd);
> +				spin_unlock(&init_mm.page_table_lock);
> +				continue;
> +			}
> +
> +			/*
> +			 * We use 2M page, but we need to remove part of them,
> +			 * so split 2M page to 4K page.
> +			 */
> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
> +			BUG_ON(!pte_base);
> +			__split_large_page((pte_t *)pmd, addr,
> +					   (pte_t *)pte_base);
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> +			spin_unlock(&init_mm.page_table_lock);
> +
> +			flush_tlb_all();
> +		}
> +
> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
> +		remove_pte_table(pte_base, addr, next, direct);
> +		free_pte_table(pte_base, pmd);
> +		unmap_low_page(pte_base);
> +	}
> +
> +	/* Call free_pmd_table() in remove_pud_table(). */
> +	if (direct)
> +		update_page_count(PG_LEVEL_2M, -pages);
> +}
> +
> +static void __meminit
> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
> +		 bool direct)
> +{
> +	unsigned long pmd_phys, next, pages = 0;
> +	pmd_t *pmd_base;
> +	pud_t *pud;
> +
> +	pud = pud_start + pud_index(addr);
> +	for (; addr < end; addr = next, pud++) {
> +		next = pud_addr_end(addr, end);
> +
> +		if (!pud_present(*pud))
> +			continue;
> +
> +		if (pud_large(*pud)) {
> +			if (IS_ALIGNED(addr, PUD_SIZE) &&
> +			    IS_ALIGNED(next, PUD_SIZE)) {
> +				if (!direct) {
> +					free_pagetable(pud_page(*pud),
> +						       get_order(PUD_SIZE));
> +					pages++;
> +				}
> +
> +				spin_lock(&init_mm.page_table_lock);
> +				pud_clear(pud);
> +				spin_unlock(&init_mm.page_table_lock);
> +				continue;
> +			}
> +
> +			/*
> +			 * We use 1G page, but we need to remove part of them,
> +			 * so split 1G page to 2M page.
> +			 */
> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
> +			BUG_ON(!pmd_base);
> +			__split_large_page((pte_t *)pud, addr,
> +					   (pte_t *)pmd_base);
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pud_populate(&init_mm, pud, __va(pmd_phys));
> +			spin_unlock(&init_mm.page_table_lock);
> +
> +			flush_tlb_all();
> +		}
> +
> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
> +		remove_pmd_table(pmd_base, addr, next, direct);
> +		free_pmd_table(pmd_base, pud);
> +		unmap_low_page(pmd_base);
> +	}
> +
> +	if (direct)
> +		update_page_count(PG_LEVEL_1G, -pages);
> +}
> +
> +/* start and end are both virtual address. */
> +static void __meminit
> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
> +{
> +	unsigned long next;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	bool pgd_changed = false;
> +
> +	for (; start < end; start = next) {
> +		pgd = pgd_offset_k(start);
> +		if (!pgd_present(*pgd))
> +			continue;
> +
> +		next = pgd_addr_end(start, end);
> +
> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> +		remove_pud_table(pud, start, next, direct);
> +		if (free_pud_table(pud, pgd))
> +			pgd_changed = true;
> +		unmap_low_page(pud);
> +	}
> +
> +	if (pgd_changed)
> +		sync_global_pgds(start, end - 1);
> +
> +	flush_tlb_all();
> +}
> +
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  int __ref arch_remove_memory(u64 start, u64 size)
>  {
> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> index a718e0d..7dcb6f9 100644
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -501,21 +501,13 @@ out_unlock:
>  	return do_split;
>  }
>  
> -static int split_large_page(pte_t *kpte, unsigned long address)
> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>  {
>  	unsigned long pfn, pfninc = 1;
>  	unsigned int i, level;
> -	pte_t *pbase, *tmp;
> +	pte_t *tmp;
>  	pgprot_t ref_prot;
> -	struct page *base;
> -
> -	if (!debug_pagealloc)
> -		spin_unlock(&cpa_lock);
> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> -	if (!debug_pagealloc)
> -		spin_lock(&cpa_lock);
> -	if (!base)
> -		return -ENOMEM;
> +	struct page *base = virt_to_page(pbase);
>  
>  	spin_lock(&pgd_lock);
>  	/*
> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>  	 * up for us already:
>  	 */
>  	tmp = lookup_address(address, &level);
> -	if (tmp != kpte)
> -		goto out_unlock;
> +	if (tmp != kpte) {
> +		spin_unlock(&pgd_lock);
> +		return 1;
> +	}
>  
> -	pbase = (pte_t *)page_address(base);
>  	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
>  	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
>  	/*
> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>  	 * going on.
>  	 */
>  	__flush_tlb_all();
> +	spin_unlock(&pgd_lock);
>  
> -	base = NULL;
> +	return 0;
> +}
>  
> -out_unlock:
> -	/*
> -	 * If we dropped out via the lookup_address check under
> -	 * pgd_lock then stick the page back into the pool:
> -	 */
> -	if (base)
> +static int split_large_page(pte_t *kpte, unsigned long address)
> +{
> +	pte_t *pbase;
> +	struct page *base;
> +
> +	if (!debug_pagealloc)
> +		spin_unlock(&cpa_lock);
> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> +	if (!debug_pagealloc)
> +		spin_lock(&cpa_lock);
> +	if (!base)
> +		return -ENOMEM;
> +
> +	pbase = (pte_t *)page_address(base);
> +	if (__split_large_page(kpte, address, pbase))
>  		__free_page(base);
> -	spin_unlock(&pgd_lock);
>  
>  	return 0;
>  }
> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> index 3f778c2..190ff06 100644
> --- a/include/linux/bootmem.h
> +++ b/include/linux/bootmem.h
> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
>  			      unsigned long size);
>  extern void free_bootmem(unsigned long physaddr, unsigned long size);
>  extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
>  
>  /*
>   * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
  2013-01-09  9:32 ` [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove Tang Chen
  2013-01-29 13:02   ` Simon Jeons
@ 2013-01-29 13:04   ` Simon Jeons
  2013-01-30  2:16     ` Tang Chen
  2013-02-04 23:04   ` Andrew Morton
  2 siblings, 1 reply; 67+ messages in thread
From: Simon Jeons @ 2013-01-29 13:04 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

Hi Tang,
On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> When memory is removed, the corresponding pagetables should alse be removed.
> This patch introduces some common APIs to support vmemmap pagetable and x86_64
> architecture pagetable removing.

Why don't need to build_all_zonelists like online_pages does during
hot-add path(add_memory)?

> 
> All pages of virtual mapping in removed memory cannot be freedi if some pages
> used as PGD/PUD includes not only removed memory but also other memory. So the
> patch uses the following way to check whether page can be freed or not.
> 
>  1. When removing memory, the page structs of the revmoved memory are filled
>     with 0FD.
>  2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
>     In this case, the page used as PT/PMD can be freed.
> 
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> ---
>  arch/x86/include/asm/pgtable_types.h |    1 +
>  arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
>  arch/x86/mm/pageattr.c               |   47 +++---
>  include/linux/bootmem.h              |    1 +
>  4 files changed, 326 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 3c32db8..4b6fd2a 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
>   * as a pte too.
>   */
>  extern pte_t *lookup_address(unsigned long address, unsigned int *level);
> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
>  
>  #endif	/* !__ASSEMBLY__ */
>  
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 9ac1723..fe01116 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
>  }
>  EXPORT_SYMBOL_GPL(arch_add_memory);
>  
> +#define PAGE_INUSE 0xFD
> +
> +static void __meminit free_pagetable(struct page *page, int order)
> +{
> +	struct zone *zone;
> +	bool bootmem = false;
> +	unsigned long magic;
> +	unsigned int nr_pages = 1 << order;
> +
> +	/* bootmem page has reserved flag */
> +	if (PageReserved(page)) {
> +		__ClearPageReserved(page);
> +		bootmem = true;
> +
> +		magic = (unsigned long)page->lru.next;
> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> +			while (nr_pages--)
> +				put_page_bootmem(page++);
> +		} else
> +			__free_pages_bootmem(page, order);
> +	} else
> +		free_pages((unsigned long)page_address(page), order);
> +
> +	/*
> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
> +	 * are all allocated by bootmem.
> +	 */
> +	if (bootmem) {
> +		zone = page_zone(page);
> +		zone_span_writelock(zone);
> +		zone->present_pages += nr_pages;
> +		zone_span_writeunlock(zone);
> +		totalram_pages += nr_pages;
> +	}
> +}
> +
> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> +{
> +	pte_t *pte;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PTE; i++) {
> +		pte = pte_start + i;
> +		if (pte_val(*pte))
> +			return;
> +	}
> +
> +	/* free a pte talbe */
> +	free_pagetable(pmd_page(*pmd), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pmd_clear(pmd);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> +{
> +	pmd_t *pmd;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PMD; i++) {
> +		pmd = pmd_start + i;
> +		if (pmd_val(*pmd))
> +			return;
> +	}
> +
> +	/* free a pmd talbe */
> +	free_pagetable(pud_page(*pud), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pud_clear(pud);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
> +/* Return true if pgd is changed, otherwise return false. */
> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
> +{
> +	pud_t *pud;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PUD; i++) {
> +		pud = pud_start + i;
> +		if (pud_val(*pud))
> +			return false;
> +	}
> +
> +	/* free a pud table */
> +	free_pagetable(pgd_page(*pgd), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pgd_clear(pgd);
> +	spin_unlock(&init_mm.page_table_lock);
> +
> +	return true;
> +}
> +
> +static void __meminit
> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
> +		 bool direct)
> +{
> +	unsigned long next, pages = 0;
> +	pte_t *pte;
> +	void *page_addr;
> +	phys_addr_t phys_addr;
> +
> +	pte = pte_start + pte_index(addr);
> +	for (; addr < end; addr = next, pte++) {
> +		next = (addr + PAGE_SIZE) & PAGE_MASK;
> +		if (next > end)
> +			next = end;
> +
> +		if (!pte_present(*pte))
> +			continue;
> +
> +		/*
> +		 * We mapped [0,1G) memory as identity mapping when
> +		 * initializing, in arch/x86/kernel/head_64.S. These
> +		 * pagetables cannot be removed.
> +		 */
> +		phys_addr = pte_val(*pte) + (addr & PAGE_MASK);
> +		if (phys_addr < (phys_addr_t)0x40000000)
> +			return;
> +
> +		if (IS_ALIGNED(addr, PAGE_SIZE) &&
> +		    IS_ALIGNED(next, PAGE_SIZE)) {
> +			if (!direct) {
> +				free_pagetable(pte_page(*pte), 0);
> +				pages++;
> +			}
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pte_clear(&init_mm, addr, pte);
> +			spin_unlock(&init_mm.page_table_lock);
> +		} else {
> +			/*
> +			 * If we are not removing the whole page, it means
> +			 * other ptes in this page are being used and we canot
> +			 * remove them. So fill the unused ptes with 0xFD, and
> +			 * remove the page when it is wholly filled with 0xFD.
> +			 */
> +			memset((void *)addr, PAGE_INUSE, next - addr);
> +			page_addr = page_address(pte_page(*pte));
> +
> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
> +				free_pagetable(pte_page(*pte), 0);
> +				pages++;
> +
> +				spin_lock(&init_mm.page_table_lock);
> +				pte_clear(&init_mm, addr, pte);
> +				spin_unlock(&init_mm.page_table_lock);
> +			}
> +		}
> +	}
> +
> +	/* Call free_pte_table() in remove_pmd_table(). */
> +	flush_tlb_all();
> +	if (direct)
> +		update_page_count(PG_LEVEL_4K, -pages);
> +}
> +
> +static void __meminit
> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> +		 bool direct)
> +{
> +	unsigned long pte_phys, next, pages = 0;
> +	pte_t *pte_base;
> +	pmd_t *pmd;
> +
> +	pmd = pmd_start + pmd_index(addr);
> +	for (; addr < end; addr = next, pmd++) {
> +		next = pmd_addr_end(addr, end);
> +
> +		if (!pmd_present(*pmd))
> +			continue;
> +
> +		if (pmd_large(*pmd)) {
> +			if (IS_ALIGNED(addr, PMD_SIZE) &&
> +			    IS_ALIGNED(next, PMD_SIZE)) {
> +				if (!direct) {
> +					free_pagetable(pmd_page(*pmd),
> +						       get_order(PMD_SIZE));
> +					pages++;
> +				}
> +
> +				spin_lock(&init_mm.page_table_lock);
> +				pmd_clear(pmd);
> +				spin_unlock(&init_mm.page_table_lock);
> +				continue;
> +			}
> +
> +			/*
> +			 * We use 2M page, but we need to remove part of them,
> +			 * so split 2M page to 4K page.
> +			 */
> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
> +			BUG_ON(!pte_base);
> +			__split_large_page((pte_t *)pmd, addr,
> +					   (pte_t *)pte_base);
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> +			spin_unlock(&init_mm.page_table_lock);
> +
> +			flush_tlb_all();
> +		}
> +
> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
> +		remove_pte_table(pte_base, addr, next, direct);
> +		free_pte_table(pte_base, pmd);
> +		unmap_low_page(pte_base);
> +	}
> +
> +	/* Call free_pmd_table() in remove_pud_table(). */
> +	if (direct)
> +		update_page_count(PG_LEVEL_2M, -pages);
> +}
> +
> +static void __meminit
> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
> +		 bool direct)
> +{
> +	unsigned long pmd_phys, next, pages = 0;
> +	pmd_t *pmd_base;
> +	pud_t *pud;
> +
> +	pud = pud_start + pud_index(addr);
> +	for (; addr < end; addr = next, pud++) {
> +		next = pud_addr_end(addr, end);
> +
> +		if (!pud_present(*pud))
> +			continue;
> +
> +		if (pud_large(*pud)) {
> +			if (IS_ALIGNED(addr, PUD_SIZE) &&
> +			    IS_ALIGNED(next, PUD_SIZE)) {
> +				if (!direct) {
> +					free_pagetable(pud_page(*pud),
> +						       get_order(PUD_SIZE));
> +					pages++;
> +				}
> +
> +				spin_lock(&init_mm.page_table_lock);
> +				pud_clear(pud);
> +				spin_unlock(&init_mm.page_table_lock);
> +				continue;
> +			}
> +
> +			/*
> +			 * We use 1G page, but we need to remove part of them,
> +			 * so split 1G page to 2M page.
> +			 */
> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
> +			BUG_ON(!pmd_base);
> +			__split_large_page((pte_t *)pud, addr,
> +					   (pte_t *)pmd_base);
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pud_populate(&init_mm, pud, __va(pmd_phys));
> +			spin_unlock(&init_mm.page_table_lock);
> +
> +			flush_tlb_all();
> +		}
> +
> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
> +		remove_pmd_table(pmd_base, addr, next, direct);
> +		free_pmd_table(pmd_base, pud);
> +		unmap_low_page(pmd_base);
> +	}
> +
> +	if (direct)
> +		update_page_count(PG_LEVEL_1G, -pages);
> +}
> +
> +/* start and end are both virtual address. */
> +static void __meminit
> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
> +{
> +	unsigned long next;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	bool pgd_changed = false;
> +
> +	for (; start < end; start = next) {
> +		pgd = pgd_offset_k(start);
> +		if (!pgd_present(*pgd))
> +			continue;
> +
> +		next = pgd_addr_end(start, end);
> +
> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> +		remove_pud_table(pud, start, next, direct);
> +		if (free_pud_table(pud, pgd))
> +			pgd_changed = true;
> +		unmap_low_page(pud);
> +	}
> +
> +	if (pgd_changed)
> +		sync_global_pgds(start, end - 1);
> +
> +	flush_tlb_all();
> +}
> +
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  int __ref arch_remove_memory(u64 start, u64 size)
>  {
> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> index a718e0d..7dcb6f9 100644
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -501,21 +501,13 @@ out_unlock:
>  	return do_split;
>  }
>  
> -static int split_large_page(pte_t *kpte, unsigned long address)
> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>  {
>  	unsigned long pfn, pfninc = 1;
>  	unsigned int i, level;
> -	pte_t *pbase, *tmp;
> +	pte_t *tmp;
>  	pgprot_t ref_prot;
> -	struct page *base;
> -
> -	if (!debug_pagealloc)
> -		spin_unlock(&cpa_lock);
> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> -	if (!debug_pagealloc)
> -		spin_lock(&cpa_lock);
> -	if (!base)
> -		return -ENOMEM;
> +	struct page *base = virt_to_page(pbase);
>  
>  	spin_lock(&pgd_lock);
>  	/*
> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>  	 * up for us already:
>  	 */
>  	tmp = lookup_address(address, &level);
> -	if (tmp != kpte)
> -		goto out_unlock;
> +	if (tmp != kpte) {
> +		spin_unlock(&pgd_lock);
> +		return 1;
> +	}
>  
> -	pbase = (pte_t *)page_address(base);
>  	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
>  	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
>  	/*
> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>  	 * going on.
>  	 */
>  	__flush_tlb_all();
> +	spin_unlock(&pgd_lock);
>  
> -	base = NULL;
> +	return 0;
> +}
>  
> -out_unlock:
> -	/*
> -	 * If we dropped out via the lookup_address check under
> -	 * pgd_lock then stick the page back into the pool:
> -	 */
> -	if (base)
> +static int split_large_page(pte_t *kpte, unsigned long address)
> +{
> +	pte_t *pbase;
> +	struct page *base;
> +
> +	if (!debug_pagealloc)
> +		spin_unlock(&cpa_lock);
> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> +	if (!debug_pagealloc)
> +		spin_lock(&cpa_lock);
> +	if (!base)
> +		return -ENOMEM;
> +
> +	pbase = (pte_t *)page_address(base);
> +	if (__split_large_page(kpte, address, pbase))
>  		__free_page(base);
> -	spin_unlock(&pgd_lock);
>  
>  	return 0;
>  }
> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> index 3f778c2..190ff06 100644
> --- a/include/linux/bootmem.h
> +++ b/include/linux/bootmem.h
> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
>  			      unsigned long size);
>  extern void free_bootmem(unsigned long physaddr, unsigned long size);
>  extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
>  
>  /*
>   * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
  2013-01-29 13:02   ` Simon Jeons
@ 2013-01-30  1:53     ` Jianguo Wu
  2013-01-30  2:13       ` Simon Jeons
  0 siblings, 1 reply; 67+ messages in thread
From: Jianguo Wu @ 2013-01-30  1:53 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, Tang Chen, linux-mm, paulus, hpa,
	sparclinux, cl, linux-s390, x86, linux-acpi, isimatu.yasuaki,
	linfeng, mgorman, kosaki.motohiro, rientjes, len.brown, wency,
	cmetcalf, glommer, yinghai, laijs, linux-kernel, minchan.kim,
	akpm, linuxppc-dev

On 2013/1/29 21:02, Simon Jeons wrote:

> Hi Tang,
> On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> When memory is removed, the corresponding pagetables should alse be removed.
>> This patch introduces some common APIs to support vmemmap pagetable and x86_64
>> architecture pagetable removing.
>>
> 
> When page table of hot-add memory is created?


Hi Simon,

For x86_64, page table of hot-add memory is created by:
    add_memory->arch_add_memory->init_memory_mapping->kernel_physical_mapping_init

> 
>> All pages of virtual mapping in removed memory cannot be freedi if some pages
>> used as PGD/PUD includes not only removed memory but also other memory. So the
>> patch uses the following way to check whether page can be freed or not.
>>
>>  1. When removing memory, the page structs of the revmoved memory are filled
>>     with 0FD.
>>  2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
>>     In this case, the page used as PT/PMD can be freed.
>>
>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
>> ---
>>  arch/x86/include/asm/pgtable_types.h |    1 +
>>  arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
>>  arch/x86/mm/pageattr.c               |   47 +++---
>>  include/linux/bootmem.h              |    1 +
>>  4 files changed, 326 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
>> index 3c32db8..4b6fd2a 100644
>> --- a/arch/x86/include/asm/pgtable_types.h
>> +++ b/arch/x86/include/asm/pgtable_types.h
>> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
>>   * as a pte too.
>>   */
>>  extern pte_t *lookup_address(unsigned long address, unsigned int *level);
>> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
>>  
>>  #endif	/* !__ASSEMBLY__ */
>>  
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 9ac1723..fe01116 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
>>  }
>>  EXPORT_SYMBOL_GPL(arch_add_memory);
>>  
>> +#define PAGE_INUSE 0xFD
>> +
>> +static void __meminit free_pagetable(struct page *page, int order)
>> +{
>> +	struct zone *zone;
>> +	bool bootmem = false;
>> +	unsigned long magic;
>> +	unsigned int nr_pages = 1 << order;
>> +
>> +	/* bootmem page has reserved flag */
>> +	if (PageReserved(page)) {
>> +		__ClearPageReserved(page);
>> +		bootmem = true;
>> +
>> +		magic = (unsigned long)page->lru.next;
>> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
>> +			while (nr_pages--)
>> +				put_page_bootmem(page++);
>> +		} else
>> +			__free_pages_bootmem(page, order);
>> +	} else
>> +		free_pages((unsigned long)page_address(page), order);
>> +
>> +	/*
>> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
>> +	 * are all allocated by bootmem.
>> +	 */
>> +	if (bootmem) {
>> +		zone = page_zone(page);
>> +		zone_span_writelock(zone);
>> +		zone->present_pages += nr_pages;
>> +		zone_span_writeunlock(zone);
>> +		totalram_pages += nr_pages;
>> +	}
>> +}
>> +
>> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
>> +{
>> +	pte_t *pte;
>> +	int i;
>> +
>> +	for (i = 0; i < PTRS_PER_PTE; i++) {
>> +		pte = pte_start + i;
>> +		if (pte_val(*pte))
>> +			return;
>> +	}
>> +
>> +	/* free a pte talbe */
>> +	free_pagetable(pmd_page(*pmd), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pmd_clear(pmd);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
>> +{
>> +	pmd_t *pmd;
>> +	int i;
>> +
>> +	for (i = 0; i < PTRS_PER_PMD; i++) {
>> +		pmd = pmd_start + i;
>> +		if (pmd_val(*pmd))
>> +			return;
>> +	}
>> +
>> +	/* free a pmd talbe */
>> +	free_pagetable(pud_page(*pud), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pud_clear(pud);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +/* Return true if pgd is changed, otherwise return false. */
>> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
>> +{
>> +	pud_t *pud;
>> +	int i;
>> +
>> +	for (i = 0; i < PTRS_PER_PUD; i++) {
>> +		pud = pud_start + i;
>> +		if (pud_val(*pud))
>> +			return false;
>> +	}
>> +
>> +	/* free a pud table */
>> +	free_pagetable(pgd_page(*pgd), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pgd_clear(pgd);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +
>> +	return true;
>> +}
>> +
>> +static void __meminit
>> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
>> +		 bool direct)
>> +{
>> +	unsigned long next, pages = 0;
>> +	pte_t *pte;
>> +	void *page_addr;
>> +	phys_addr_t phys_addr;
>> +
>> +	pte = pte_start + pte_index(addr);
>> +	for (; addr < end; addr = next, pte++) {
>> +		next = (addr + PAGE_SIZE) & PAGE_MASK;
>> +		if (next > end)
>> +			next = end;
>> +
>> +		if (!pte_present(*pte))
>> +			continue;
>> +
>> +		/*
>> +		 * We mapped [0,1G) memory as identity mapping when
>> +		 * initializing, in arch/x86/kernel/head_64.S. These
>> +		 * pagetables cannot be removed.
>> +		 */
>> +		phys_addr = pte_val(*pte) + (addr & PAGE_MASK);
>> +		if (phys_addr < (phys_addr_t)0x40000000)
>> +			return;
>> +
>> +		if (IS_ALIGNED(addr, PAGE_SIZE) &&
>> +		    IS_ALIGNED(next, PAGE_SIZE)) {
>> +			if (!direct) {
>> +				free_pagetable(pte_page(*pte), 0);
>> +				pages++;
>> +			}
>> +
>> +			spin_lock(&init_mm.page_table_lock);
>> +			pte_clear(&init_mm, addr, pte);
>> +			spin_unlock(&init_mm.page_table_lock);
>> +		} else {
>> +			/*
>> +			 * If we are not removing the whole page, it means
>> +			 * other ptes in this page are being used and we canot
>> +			 * remove them. So fill the unused ptes with 0xFD, and
>> +			 * remove the page when it is wholly filled with 0xFD.
>> +			 */
>> +			memset((void *)addr, PAGE_INUSE, next - addr);
>> +			page_addr = page_address(pte_page(*pte));
>> +
>> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
>> +				free_pagetable(pte_page(*pte), 0);
>> +				pages++;
>> +
>> +				spin_lock(&init_mm.page_table_lock);
>> +				pte_clear(&init_mm, addr, pte);
>> +				spin_unlock(&init_mm.page_table_lock);
>> +			}
>> +		}
>> +	}
>> +
>> +	/* Call free_pte_table() in remove_pmd_table(). */
>> +	flush_tlb_all();
>> +	if (direct)
>> +		update_page_count(PG_LEVEL_4K, -pages);
>> +}
>> +
>> +static void __meminit
>> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
>> +		 bool direct)
>> +{
>> +	unsigned long pte_phys, next, pages = 0;
>> +	pte_t *pte_base;
>> +	pmd_t *pmd;
>> +
>> +	pmd = pmd_start + pmd_index(addr);
>> +	for (; addr < end; addr = next, pmd++) {
>> +		next = pmd_addr_end(addr, end);
>> +
>> +		if (!pmd_present(*pmd))
>> +			continue;
>> +
>> +		if (pmd_large(*pmd)) {
>> +			if (IS_ALIGNED(addr, PMD_SIZE) &&
>> +			    IS_ALIGNED(next, PMD_SIZE)) {
>> +				if (!direct) {
>> +					free_pagetable(pmd_page(*pmd),
>> +						       get_order(PMD_SIZE));
>> +					pages++;
>> +				}
>> +
>> +				spin_lock(&init_mm.page_table_lock);
>> +				pmd_clear(pmd);
>> +				spin_unlock(&init_mm.page_table_lock);
>> +				continue;
>> +			}
>> +
>> +			/*
>> +			 * We use 2M page, but we need to remove part of them,
>> +			 * so split 2M page to 4K page.
>> +			 */
>> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
>> +			BUG_ON(!pte_base);
>> +			__split_large_page((pte_t *)pmd, addr,
>> +					   (pte_t *)pte_base);
>> +
>> +			spin_lock(&init_mm.page_table_lock);
>> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
>> +			spin_unlock(&init_mm.page_table_lock);
>> +
>> +			flush_tlb_all();
>> +		}
>> +
>> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
>> +		remove_pte_table(pte_base, addr, next, direct);
>> +		free_pte_table(pte_base, pmd);
>> +		unmap_low_page(pte_base);
>> +	}
>> +
>> +	/* Call free_pmd_table() in remove_pud_table(). */
>> +	if (direct)
>> +		update_page_count(PG_LEVEL_2M, -pages);
>> +}
>> +
>> +static void __meminit
>> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
>> +		 bool direct)
>> +{
>> +	unsigned long pmd_phys, next, pages = 0;
>> +	pmd_t *pmd_base;
>> +	pud_t *pud;
>> +
>> +	pud = pud_start + pud_index(addr);
>> +	for (; addr < end; addr = next, pud++) {
>> +		next = pud_addr_end(addr, end);
>> +
>> +		if (!pud_present(*pud))
>> +			continue;
>> +
>> +		if (pud_large(*pud)) {
>> +			if (IS_ALIGNED(addr, PUD_SIZE) &&
>> +			    IS_ALIGNED(next, PUD_SIZE)) {
>> +				if (!direct) {
>> +					free_pagetable(pud_page(*pud),
>> +						       get_order(PUD_SIZE));
>> +					pages++;
>> +				}
>> +
>> +				spin_lock(&init_mm.page_table_lock);
>> +				pud_clear(pud);
>> +				spin_unlock(&init_mm.page_table_lock);
>> +				continue;
>> +			}
>> +
>> +			/*
>> +			 * We use 1G page, but we need to remove part of them,
>> +			 * so split 1G page to 2M page.
>> +			 */
>> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
>> +			BUG_ON(!pmd_base);
>> +			__split_large_page((pte_t *)pud, addr,
>> +					   (pte_t *)pmd_base);
>> +
>> +			spin_lock(&init_mm.page_table_lock);
>> +			pud_populate(&init_mm, pud, __va(pmd_phys));
>> +			spin_unlock(&init_mm.page_table_lock);
>> +
>> +			flush_tlb_all();
>> +		}
>> +
>> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
>> +		remove_pmd_table(pmd_base, addr, next, direct);
>> +		free_pmd_table(pmd_base, pud);
>> +		unmap_low_page(pmd_base);
>> +	}
>> +
>> +	if (direct)
>> +		update_page_count(PG_LEVEL_1G, -pages);
>> +}
>> +
>> +/* start and end are both virtual address. */
>> +static void __meminit
>> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
>> +{
>> +	unsigned long next;
>> +	pgd_t *pgd;
>> +	pud_t *pud;
>> +	bool pgd_changed = false;
>> +
>> +	for (; start < end; start = next) {
>> +		pgd = pgd_offset_k(start);
>> +		if (!pgd_present(*pgd))
>> +			continue;
>> +
>> +		next = pgd_addr_end(start, end);
>> +
>> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
>> +		remove_pud_table(pud, start, next, direct);
>> +		if (free_pud_table(pud, pgd))
>> +			pgd_changed = true;
>> +		unmap_low_page(pud);
>> +	}
>> +
>> +	if (pgd_changed)
>> +		sync_global_pgds(start, end - 1);
>> +
>> +	flush_tlb_all();
>> +}
>> +
>>  #ifdef CONFIG_MEMORY_HOTREMOVE
>>  int __ref arch_remove_memory(u64 start, u64 size)
>>  {
>> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
>> index a718e0d..7dcb6f9 100644
>> --- a/arch/x86/mm/pageattr.c
>> +++ b/arch/x86/mm/pageattr.c
>> @@ -501,21 +501,13 @@ out_unlock:
>>  	return do_split;
>>  }
>>  
>> -static int split_large_page(pte_t *kpte, unsigned long address)
>> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>>  {
>>  	unsigned long pfn, pfninc = 1;
>>  	unsigned int i, level;
>> -	pte_t *pbase, *tmp;
>> +	pte_t *tmp;
>>  	pgprot_t ref_prot;
>> -	struct page *base;
>> -
>> -	if (!debug_pagealloc)
>> -		spin_unlock(&cpa_lock);
>> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>> -	if (!debug_pagealloc)
>> -		spin_lock(&cpa_lock);
>> -	if (!base)
>> -		return -ENOMEM;
>> +	struct page *base = virt_to_page(pbase);
>>  
>>  	spin_lock(&pgd_lock);
>>  	/*
>> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>>  	 * up for us already:
>>  	 */
>>  	tmp = lookup_address(address, &level);
>> -	if (tmp != kpte)
>> -		goto out_unlock;
>> +	if (tmp != kpte) {
>> +		spin_unlock(&pgd_lock);
>> +		return 1;
>> +	}
>>  
>> -	pbase = (pte_t *)page_address(base);
>>  	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
>>  	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
>>  	/*
>> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>>  	 * going on.
>>  	 */
>>  	__flush_tlb_all();
>> +	spin_unlock(&pgd_lock);
>>  
>> -	base = NULL;
>> +	return 0;
>> +}
>>  
>> -out_unlock:
>> -	/*
>> -	 * If we dropped out via the lookup_address check under
>> -	 * pgd_lock then stick the page back into the pool:
>> -	 */
>> -	if (base)
>> +static int split_large_page(pte_t *kpte, unsigned long address)
>> +{
>> +	pte_t *pbase;
>> +	struct page *base;
>> +
>> +	if (!debug_pagealloc)
>> +		spin_unlock(&cpa_lock);
>> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>> +	if (!debug_pagealloc)
>> +		spin_lock(&cpa_lock);
>> +	if (!base)
>> +		return -ENOMEM;
>> +
>> +	pbase = (pte_t *)page_address(base);
>> +	if (__split_large_page(kpte, address, pbase))
>>  		__free_page(base);
>> -	spin_unlock(&pgd_lock);
>>  
>>  	return 0;
>>  }
>> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
>> index 3f778c2..190ff06 100644
>> --- a/include/linux/bootmem.h
>> +++ b/include/linux/bootmem.h
>> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
>>  			      unsigned long size);
>>  extern void free_bootmem(unsigned long physaddr, unsigned long size);
>>  extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
>> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
>>  
>>  /*
>>   * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,
> 
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
  2013-01-30  1:53     ` Jianguo Wu
@ 2013-01-30  2:13       ` Simon Jeons
  0 siblings, 0 replies; 67+ messages in thread
From: Simon Jeons @ 2013-01-30  2:13 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: linux-ia64, linux-sh, Tang Chen, linux-mm, paulus, hpa,
	sparclinux, cl, linux-s390, x86, linux-acpi, isimatu.yasuaki,
	linfeng, mgorman, kosaki.motohiro, rientjes, len.brown, wency,
	cmetcalf, glommer, yinghai, laijs, linux-kernel, minchan.kim,
	akpm, linuxppc-dev

On Wed, 2013-01-30 at 09:53 +0800, Jianguo Wu wrote:
> On 2013/1/29 21:02, Simon Jeons wrote:
> 
> > Hi Tang,
> > On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
> >> From: Wen Congyang <wency@cn.fujitsu.com>
> >>
> >> When memory is removed, the corresponding pagetables should alse be removed.
> >> This patch introduces some common APIs to support vmemmap pagetable and x86_64
> >> architecture pagetable removing.
> >>
> > 
> > When page table of hot-add memory is created?
> 
> 
> Hi Simon,
> 
> For x86_64, page table of hot-add memory is created by:
>     add_memory->arch_add_memory->init_memory_mapping->kernel_physical_mapping_init

Yup, thanks. :)

> 
> > 
> >> All pages of virtual mapping in removed memory cannot be freedi if some pages
> >> used as PGD/PUD includes not only removed memory but also other memory. So the
> >> patch uses the following way to check whether page can be freed or not.
> >>
> >>  1. When removing memory, the page structs of the revmoved memory are filled
> >>     with 0FD.
> >>  2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
> >>     In this case, the page used as PT/PMD can be freed.
> >>
> >> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> >> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
> >> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> >> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> >> ---
> >>  arch/x86/include/asm/pgtable_types.h |    1 +
> >>  arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
> >>  arch/x86/mm/pageattr.c               |   47 +++---
> >>  include/linux/bootmem.h              |    1 +
> >>  4 files changed, 326 insertions(+), 22 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> >> index 3c32db8..4b6fd2a 100644
> >> --- a/arch/x86/include/asm/pgtable_types.h
> >> +++ b/arch/x86/include/asm/pgtable_types.h
> >> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
> >>   * as a pte too.
> >>   */
> >>  extern pte_t *lookup_address(unsigned long address, unsigned int *level);
> >> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
> >>  
> >>  #endif	/* !__ASSEMBLY__ */
> >>  
> >> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> >> index 9ac1723..fe01116 100644
> >> --- a/arch/x86/mm/init_64.c
> >> +++ b/arch/x86/mm/init_64.c
> >> @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
> >>  }
> >>  EXPORT_SYMBOL_GPL(arch_add_memory);
> >>  
> >> +#define PAGE_INUSE 0xFD
> >> +
> >> +static void __meminit free_pagetable(struct page *page, int order)
> >> +{
> >> +	struct zone *zone;
> >> +	bool bootmem = false;
> >> +	unsigned long magic;
> >> +	unsigned int nr_pages = 1 << order;
> >> +
> >> +	/* bootmem page has reserved flag */
> >> +	if (PageReserved(page)) {
> >> +		__ClearPageReserved(page);
> >> +		bootmem = true;
> >> +
> >> +		magic = (unsigned long)page->lru.next;
> >> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> >> +			while (nr_pages--)
> >> +				put_page_bootmem(page++);
> >> +		} else
> >> +			__free_pages_bootmem(page, order);
> >> +	} else
> >> +		free_pages((unsigned long)page_address(page), order);
> >> +
> >> +	/*
> >> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
> >> +	 * are all allocated by bootmem.
> >> +	 */
> >> +	if (bootmem) {
> >> +		zone = page_zone(page);
> >> +		zone_span_writelock(zone);
> >> +		zone->present_pages += nr_pages;
> >> +		zone_span_writeunlock(zone);
> >> +		totalram_pages += nr_pages;
> >> +	}
> >> +}
> >> +
> >> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> >> +{
> >> +	pte_t *pte;
> >> +	int i;
> >> +
> >> +	for (i = 0; i < PTRS_PER_PTE; i++) {
> >> +		pte = pte_start + i;
> >> +		if (pte_val(*pte))
> >> +			return;
> >> +	}
> >> +
> >> +	/* free a pte talbe */
> >> +	free_pagetable(pmd_page(*pmd), 0);
> >> +	spin_lock(&init_mm.page_table_lock);
> >> +	pmd_clear(pmd);
> >> +	spin_unlock(&init_mm.page_table_lock);
> >> +}
> >> +
> >> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> >> +{
> >> +	pmd_t *pmd;
> >> +	int i;
> >> +
> >> +	for (i = 0; i < PTRS_PER_PMD; i++) {
> >> +		pmd = pmd_start + i;
> >> +		if (pmd_val(*pmd))
> >> +			return;
> >> +	}
> >> +
> >> +	/* free a pmd talbe */
> >> +	free_pagetable(pud_page(*pud), 0);
> >> +	spin_lock(&init_mm.page_table_lock);
> >> +	pud_clear(pud);
> >> +	spin_unlock(&init_mm.page_table_lock);
> >> +}
> >> +
> >> +/* Return true if pgd is changed, otherwise return false. */
> >> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
> >> +{
> >> +	pud_t *pud;
> >> +	int i;
> >> +
> >> +	for (i = 0; i < PTRS_PER_PUD; i++) {
> >> +		pud = pud_start + i;
> >> +		if (pud_val(*pud))
> >> +			return false;
> >> +	}
> >> +
> >> +	/* free a pud table */
> >> +	free_pagetable(pgd_page(*pgd), 0);
> >> +	spin_lock(&init_mm.page_table_lock);
> >> +	pgd_clear(pgd);
> >> +	spin_unlock(&init_mm.page_table_lock);
> >> +
> >> +	return true;
> >> +}
> >> +
> >> +static void __meminit
> >> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
> >> +		 bool direct)
> >> +{
> >> +	unsigned long next, pages = 0;
> >> +	pte_t *pte;
> >> +	void *page_addr;
> >> +	phys_addr_t phys_addr;
> >> +
> >> +	pte = pte_start + pte_index(addr);
> >> +	for (; addr < end; addr = next, pte++) {
> >> +		next = (addr + PAGE_SIZE) & PAGE_MASK;
> >> +		if (next > end)
> >> +			next = end;
> >> +
> >> +		if (!pte_present(*pte))
> >> +			continue;
> >> +
> >> +		/*
> >> +		 * We mapped [0,1G) memory as identity mapping when
> >> +		 * initializing, in arch/x86/kernel/head_64.S. These
> >> +		 * pagetables cannot be removed.
> >> +		 */
> >> +		phys_addr = pte_val(*pte) + (addr & PAGE_MASK);
> >> +		if (phys_addr < (phys_addr_t)0x40000000)
> >> +			return;
> >> +
> >> +		if (IS_ALIGNED(addr, PAGE_SIZE) &&
> >> +		    IS_ALIGNED(next, PAGE_SIZE)) {
> >> +			if (!direct) {
> >> +				free_pagetable(pte_page(*pte), 0);
> >> +				pages++;
> >> +			}
> >> +
> >> +			spin_lock(&init_mm.page_table_lock);
> >> +			pte_clear(&init_mm, addr, pte);
> >> +			spin_unlock(&init_mm.page_table_lock);
> >> +		} else {
> >> +			/*
> >> +			 * If we are not removing the whole page, it means
> >> +			 * other ptes in this page are being used and we canot
> >> +			 * remove them. So fill the unused ptes with 0xFD, and
> >> +			 * remove the page when it is wholly filled with 0xFD.
> >> +			 */
> >> +			memset((void *)addr, PAGE_INUSE, next - addr);
> >> +			page_addr = page_address(pte_page(*pte));
> >> +
> >> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
> >> +				free_pagetable(pte_page(*pte), 0);
> >> +				pages++;
> >> +
> >> +				spin_lock(&init_mm.page_table_lock);
> >> +				pte_clear(&init_mm, addr, pte);
> >> +				spin_unlock(&init_mm.page_table_lock);
> >> +			}
> >> +		}
> >> +	}
> >> +
> >> +	/* Call free_pte_table() in remove_pmd_table(). */
> >> +	flush_tlb_all();
> >> +	if (direct)
> >> +		update_page_count(PG_LEVEL_4K, -pages);
> >> +}
> >> +
> >> +static void __meminit
> >> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> >> +		 bool direct)
> >> +{
> >> +	unsigned long pte_phys, next, pages = 0;
> >> +	pte_t *pte_base;
> >> +	pmd_t *pmd;
> >> +
> >> +	pmd = pmd_start + pmd_index(addr);
> >> +	for (; addr < end; addr = next, pmd++) {
> >> +		next = pmd_addr_end(addr, end);
> >> +
> >> +		if (!pmd_present(*pmd))
> >> +			continue;
> >> +
> >> +		if (pmd_large(*pmd)) {
> >> +			if (IS_ALIGNED(addr, PMD_SIZE) &&
> >> +			    IS_ALIGNED(next, PMD_SIZE)) {
> >> +				if (!direct) {
> >> +					free_pagetable(pmd_page(*pmd),
> >> +						       get_order(PMD_SIZE));
> >> +					pages++;
> >> +				}
> >> +
> >> +				spin_lock(&init_mm.page_table_lock);
> >> +				pmd_clear(pmd);
> >> +				spin_unlock(&init_mm.page_table_lock);
> >> +				continue;
> >> +			}
> >> +
> >> +			/*
> >> +			 * We use 2M page, but we need to remove part of them,
> >> +			 * so split 2M page to 4K page.
> >> +			 */
> >> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
> >> +			BUG_ON(!pte_base);
> >> +			__split_large_page((pte_t *)pmd, addr,
> >> +					   (pte_t *)pte_base);
> >> +
> >> +			spin_lock(&init_mm.page_table_lock);
> >> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> >> +			spin_unlock(&init_mm.page_table_lock);
> >> +
> >> +			flush_tlb_all();
> >> +		}
> >> +
> >> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
> >> +		remove_pte_table(pte_base, addr, next, direct);
> >> +		free_pte_table(pte_base, pmd);
> >> +		unmap_low_page(pte_base);
> >> +	}
> >> +
> >> +	/* Call free_pmd_table() in remove_pud_table(). */
> >> +	if (direct)
> >> +		update_page_count(PG_LEVEL_2M, -pages);
> >> +}
> >> +
> >> +static void __meminit
> >> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
> >> +		 bool direct)
> >> +{
> >> +	unsigned long pmd_phys, next, pages = 0;
> >> +	pmd_t *pmd_base;
> >> +	pud_t *pud;
> >> +
> >> +	pud = pud_start + pud_index(addr);
> >> +	for (; addr < end; addr = next, pud++) {
> >> +		next = pud_addr_end(addr, end);
> >> +
> >> +		if (!pud_present(*pud))
> >> +			continue;
> >> +
> >> +		if (pud_large(*pud)) {
> >> +			if (IS_ALIGNED(addr, PUD_SIZE) &&
> >> +			    IS_ALIGNED(next, PUD_SIZE)) {
> >> +				if (!direct) {
> >> +					free_pagetable(pud_page(*pud),
> >> +						       get_order(PUD_SIZE));
> >> +					pages++;
> >> +				}
> >> +
> >> +				spin_lock(&init_mm.page_table_lock);
> >> +				pud_clear(pud);
> >> +				spin_unlock(&init_mm.page_table_lock);
> >> +				continue;
> >> +			}
> >> +
> >> +			/*
> >> +			 * We use 1G page, but we need to remove part of them,
> >> +			 * so split 1G page to 2M page.
> >> +			 */
> >> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
> >> +			BUG_ON(!pmd_base);
> >> +			__split_large_page((pte_t *)pud, addr,
> >> +					   (pte_t *)pmd_base);
> >> +
> >> +			spin_lock(&init_mm.page_table_lock);
> >> +			pud_populate(&init_mm, pud, __va(pmd_phys));
> >> +			spin_unlock(&init_mm.page_table_lock);
> >> +
> >> +			flush_tlb_all();
> >> +		}
> >> +
> >> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
> >> +		remove_pmd_table(pmd_base, addr, next, direct);
> >> +		free_pmd_table(pmd_base, pud);
> >> +		unmap_low_page(pmd_base);
> >> +	}
> >> +
> >> +	if (direct)
> >> +		update_page_count(PG_LEVEL_1G, -pages);
> >> +}
> >> +
> >> +/* start and end are both virtual address. */
> >> +static void __meminit
> >> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
> >> +{
> >> +	unsigned long next;
> >> +	pgd_t *pgd;
> >> +	pud_t *pud;
> >> +	bool pgd_changed = false;
> >> +
> >> +	for (; start < end; start = next) {
> >> +		pgd = pgd_offset_k(start);
> >> +		if (!pgd_present(*pgd))
> >> +			continue;
> >> +
> >> +		next = pgd_addr_end(start, end);
> >> +
> >> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> >> +		remove_pud_table(pud, start, next, direct);
> >> +		if (free_pud_table(pud, pgd))
> >> +			pgd_changed = true;
> >> +		unmap_low_page(pud);
> >> +	}
> >> +
> >> +	if (pgd_changed)
> >> +		sync_global_pgds(start, end - 1);
> >> +
> >> +	flush_tlb_all();
> >> +}
> >> +
> >>  #ifdef CONFIG_MEMORY_HOTREMOVE
> >>  int __ref arch_remove_memory(u64 start, u64 size)
> >>  {
> >> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> >> index a718e0d..7dcb6f9 100644
> >> --- a/arch/x86/mm/pageattr.c
> >> +++ b/arch/x86/mm/pageattr.c
> >> @@ -501,21 +501,13 @@ out_unlock:
> >>  	return do_split;
> >>  }
> >>  
> >> -static int split_large_page(pte_t *kpte, unsigned long address)
> >> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
> >>  {
> >>  	unsigned long pfn, pfninc = 1;
> >>  	unsigned int i, level;
> >> -	pte_t *pbase, *tmp;
> >> +	pte_t *tmp;
> >>  	pgprot_t ref_prot;
> >> -	struct page *base;
> >> -
> >> -	if (!debug_pagealloc)
> >> -		spin_unlock(&cpa_lock);
> >> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> >> -	if (!debug_pagealloc)
> >> -		spin_lock(&cpa_lock);
> >> -	if (!base)
> >> -		return -ENOMEM;
> >> +	struct page *base = virt_to_page(pbase);
> >>  
> >>  	spin_lock(&pgd_lock);
> >>  	/*
> >> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
> >>  	 * up for us already:
> >>  	 */
> >>  	tmp = lookup_address(address, &level);
> >> -	if (tmp != kpte)
> >> -		goto out_unlock;
> >> +	if (tmp != kpte) {
> >> +		spin_unlock(&pgd_lock);
> >> +		return 1;
> >> +	}
> >>  
> >> -	pbase = (pte_t *)page_address(base);
> >>  	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
> >>  	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
> >>  	/*
> >> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
> >>  	 * going on.
> >>  	 */
> >>  	__flush_tlb_all();
> >> +	spin_unlock(&pgd_lock);
> >>  
> >> -	base = NULL;
> >> +	return 0;
> >> +}
> >>  
> >> -out_unlock:
> >> -	/*
> >> -	 * If we dropped out via the lookup_address check under
> >> -	 * pgd_lock then stick the page back into the pool:
> >> -	 */
> >> -	if (base)
> >> +static int split_large_page(pte_t *kpte, unsigned long address)
> >> +{
> >> +	pte_t *pbase;
> >> +	struct page *base;
> >> +
> >> +	if (!debug_pagealloc)
> >> +		spin_unlock(&cpa_lock);
> >> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> >> +	if (!debug_pagealloc)
> >> +		spin_lock(&cpa_lock);
> >> +	if (!base)
> >> +		return -ENOMEM;
> >> +
> >> +	pbase = (pte_t *)page_address(base);
> >> +	if (__split_large_page(kpte, address, pbase))
> >>  		__free_page(base);
> >> -	spin_unlock(&pgd_lock);
> >>  
> >>  	return 0;
> >>  }
> >> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> >> index 3f778c2..190ff06 100644
> >> --- a/include/linux/bootmem.h
> >> +++ b/include/linux/bootmem.h
> >> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
> >>  			      unsigned long size);
> >>  extern void free_bootmem(unsigned long physaddr, unsigned long size);
> >>  extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
> >> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
> >>  
> >>  /*
> >>   * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,
> > 
> > 
> > 
> > .
> > 
> 
> 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
  2013-01-29 13:04   ` Simon Jeons
@ 2013-01-30  2:16     ` Tang Chen
  2013-01-30  3:27       ` Simon Jeons
  0 siblings, 1 reply; 67+ messages in thread
From: Tang Chen @ 2013-01-30  2:16 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

On 01/29/2013 09:04 PM, Simon Jeons wrote:
> Hi Tang,
> On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
>> From: Wen Congyang<wency@cn.fujitsu.com>
>>
>> When memory is removed, the corresponding pagetables should alse be removed.
>> This patch introduces some common APIs to support vmemmap pagetable and x86_64
>> architecture pagetable removing.
>
> Why don't need to build_all_zonelists like online_pages does during
> hot-add path(add_memory)?

Hi Simon,

As you said, build_all_zonelists is done by online_pages. When the 
memory device
is hot-added, we cannot use it. we can only use is when we online the 
pages on it.

But we can online the pages as different types, kernel or movable (which 
belongs to
different zones), and we can online part of the memory, not all of them.
So each time we online some pages, we should check if we need to update 
the zone list.

So I think that is why we do build_all_zonelists when online_pages.
(just my opinion)

Thanks. :)

>
>>
>> All pages of virtual mapping in removed memory cannot be freedi if some pages
>> used as PGD/PUD includes not only removed memory but also other memory. So the
>> patch uses the following way to check whether page can be freed or not.
>>
>>   1. When removing memory, the page structs of the revmoved memory are filled
>>      with 0FD.
>>   2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
>>      In this case, the page used as PT/PMD can be freed.
>>
>> Signed-off-by: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
>> Signed-off-by: Jianguo Wu<wujianguo@huawei.com>
>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>> ---
>>   arch/x86/include/asm/pgtable_types.h |    1 +
>>   arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
>>   arch/x86/mm/pageattr.c               |   47 +++---
>>   include/linux/bootmem.h              |    1 +
>>   4 files changed, 326 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
>> index 3c32db8..4b6fd2a 100644
>> --- a/arch/x86/include/asm/pgtable_types.h
>> +++ b/arch/x86/include/asm/pgtable_types.h
>> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
>>    * as a pte too.
>>    */
>>   extern pte_t *lookup_address(unsigned long address, unsigned int *level);
>> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
>>
>>   #endif	/* !__ASSEMBLY__ */
>>
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 9ac1723..fe01116 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
>>   }
>>   EXPORT_SYMBOL_GPL(arch_add_memory);
>>
>> +#define PAGE_INUSE 0xFD
>> +
>> +static void __meminit free_pagetable(struct page *page, int order)
>> +{
>> +	struct zone *zone;
>> +	bool bootmem = false;
>> +	unsigned long magic;
>> +	unsigned int nr_pages = 1<<  order;
>> +
>> +	/* bootmem page has reserved flag */
>> +	if (PageReserved(page)) {
>> +		__ClearPageReserved(page);
>> +		bootmem = true;
>> +
>> +		magic = (unsigned long)page->lru.next;
>> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
>> +			while (nr_pages--)
>> +				put_page_bootmem(page++);
>> +		} else
>> +			__free_pages_bootmem(page, order);
>> +	} else
>> +		free_pages((unsigned long)page_address(page), order);
>> +
>> +	/*
>> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
>> +	 * are all allocated by bootmem.
>> +	 */
>> +	if (bootmem) {
>> +		zone = page_zone(page);
>> +		zone_span_writelock(zone);
>> +		zone->present_pages += nr_pages;
>> +		zone_span_writeunlock(zone);
>> +		totalram_pages += nr_pages;
>> +	}
>> +}
>> +
>> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
>> +{
>> +	pte_t *pte;
>> +	int i;
>> +
>> +	for (i = 0; i<  PTRS_PER_PTE; i++) {
>> +		pte = pte_start + i;
>> +		if (pte_val(*pte))
>> +			return;
>> +	}
>> +
>> +	/* free a pte talbe */
>> +	free_pagetable(pmd_page(*pmd), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pmd_clear(pmd);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
>> +{
>> +	pmd_t *pmd;
>> +	int i;
>> +
>> +	for (i = 0; i<  PTRS_PER_PMD; i++) {
>> +		pmd = pmd_start + i;
>> +		if (pmd_val(*pmd))
>> +			return;
>> +	}
>> +
>> +	/* free a pmd talbe */
>> +	free_pagetable(pud_page(*pud), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pud_clear(pud);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +/* Return true if pgd is changed, otherwise return false. */
>> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
>> +{
>> +	pud_t *pud;
>> +	int i;
>> +
>> +	for (i = 0; i<  PTRS_PER_PUD; i++) {
>> +		pud = pud_start + i;
>> +		if (pud_val(*pud))
>> +			return false;
>> +	}
>> +
>> +	/* free a pud table */
>> +	free_pagetable(pgd_page(*pgd), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pgd_clear(pgd);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +
>> +	return true;
>> +}
>> +
>> +static void __meminit
>> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
>> +		 bool direct)
>> +{
>> +	unsigned long next, pages = 0;
>> +	pte_t *pte;
>> +	void *page_addr;
>> +	phys_addr_t phys_addr;
>> +
>> +	pte = pte_start + pte_index(addr);
>> +	for (; addr<  end; addr = next, pte++) {
>> +		next = (addr + PAGE_SIZE)&  PAGE_MASK;
>> +		if (next>  end)
>> +			next = end;
>> +
>> +		if (!pte_present(*pte))
>> +			continue;
>> +
>> +		/*
>> +		 * We mapped [0,1G) memory as identity mapping when
>> +		 * initializing, in arch/x86/kernel/head_64.S. These
>> +		 * pagetables cannot be removed.
>> +		 */
>> +		phys_addr = pte_val(*pte) + (addr&  PAGE_MASK);
>> +		if (phys_addr<  (phys_addr_t)0x40000000)
>> +			return;
>> +
>> +		if (IS_ALIGNED(addr, PAGE_SIZE)&&
>> +		    IS_ALIGNED(next, PAGE_SIZE)) {
>> +			if (!direct) {
>> +				free_pagetable(pte_page(*pte), 0);
>> +				pages++;
>> +			}
>> +
>> +			spin_lock(&init_mm.page_table_lock);
>> +			pte_clear(&init_mm, addr, pte);
>> +			spin_unlock(&init_mm.page_table_lock);
>> +		} else {
>> +			/*
>> +			 * If we are not removing the whole page, it means
>> +			 * other ptes in this page are being used and we canot
>> +			 * remove them. So fill the unused ptes with 0xFD, and
>> +			 * remove the page when it is wholly filled with 0xFD.
>> +			 */
>> +			memset((void *)addr, PAGE_INUSE, next - addr);
>> +			page_addr = page_address(pte_page(*pte));
>> +
>> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
>> +				free_pagetable(pte_page(*pte), 0);
>> +				pages++;
>> +
>> +				spin_lock(&init_mm.page_table_lock);
>> +				pte_clear(&init_mm, addr, pte);
>> +				spin_unlock(&init_mm.page_table_lock);
>> +			}
>> +		}
>> +	}
>> +
>> +	/* Call free_pte_table() in remove_pmd_table(). */
>> +	flush_tlb_all();
>> +	if (direct)
>> +		update_page_count(PG_LEVEL_4K, -pages);
>> +}
>> +
>> +static void __meminit
>> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
>> +		 bool direct)
>> +{
>> +	unsigned long pte_phys, next, pages = 0;
>> +	pte_t *pte_base;
>> +	pmd_t *pmd;
>> +
>> +	pmd = pmd_start + pmd_index(addr);
>> +	for (; addr<  end; addr = next, pmd++) {
>> +		next = pmd_addr_end(addr, end);
>> +
>> +		if (!pmd_present(*pmd))
>> +			continue;
>> +
>> +		if (pmd_large(*pmd)) {
>> +			if (IS_ALIGNED(addr, PMD_SIZE)&&
>> +			    IS_ALIGNED(next, PMD_SIZE)) {
>> +				if (!direct) {
>> +					free_pagetable(pmd_page(*pmd),
>> +						       get_order(PMD_SIZE));
>> +					pages++;
>> +				}
>> +
>> +				spin_lock(&init_mm.page_table_lock);
>> +				pmd_clear(pmd);
>> +				spin_unlock(&init_mm.page_table_lock);
>> +				continue;
>> +			}
>> +
>> +			/*
>> +			 * We use 2M page, but we need to remove part of them,
>> +			 * so split 2M page to 4K page.
>> +			 */
>> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
>> +			BUG_ON(!pte_base);
>> +			__split_large_page((pte_t *)pmd, addr,
>> +					   (pte_t *)pte_base);
>> +
>> +			spin_lock(&init_mm.page_table_lock);
>> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
>> +			spin_unlock(&init_mm.page_table_lock);
>> +
>> +			flush_tlb_all();
>> +		}
>> +
>> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
>> +		remove_pte_table(pte_base, addr, next, direct);
>> +		free_pte_table(pte_base, pmd);
>> +		unmap_low_page(pte_base);
>> +	}
>> +
>> +	/* Call free_pmd_table() in remove_pud_table(). */
>> +	if (direct)
>> +		update_page_count(PG_LEVEL_2M, -pages);
>> +}
>> +
>> +static void __meminit
>> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
>> +		 bool direct)
>> +{
>> +	unsigned long pmd_phys, next, pages = 0;
>> +	pmd_t *pmd_base;
>> +	pud_t *pud;
>> +
>> +	pud = pud_start + pud_index(addr);
>> +	for (; addr<  end; addr = next, pud++) {
>> +		next = pud_addr_end(addr, end);
>> +
>> +		if (!pud_present(*pud))
>> +			continue;
>> +
>> +		if (pud_large(*pud)) {
>> +			if (IS_ALIGNED(addr, PUD_SIZE)&&
>> +			    IS_ALIGNED(next, PUD_SIZE)) {
>> +				if (!direct) {
>> +					free_pagetable(pud_page(*pud),
>> +						       get_order(PUD_SIZE));
>> +					pages++;
>> +				}
>> +
>> +				spin_lock(&init_mm.page_table_lock);
>> +				pud_clear(pud);
>> +				spin_unlock(&init_mm.page_table_lock);
>> +				continue;
>> +			}
>> +
>> +			/*
>> +			 * We use 1G page, but we need to remove part of them,
>> +			 * so split 1G page to 2M page.
>> +			 */
>> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
>> +			BUG_ON(!pmd_base);
>> +			__split_large_page((pte_t *)pud, addr,
>> +					   (pte_t *)pmd_base);
>> +
>> +			spin_lock(&init_mm.page_table_lock);
>> +			pud_populate(&init_mm, pud, __va(pmd_phys));
>> +			spin_unlock(&init_mm.page_table_lock);
>> +
>> +			flush_tlb_all();
>> +		}
>> +
>> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
>> +		remove_pmd_table(pmd_base, addr, next, direct);
>> +		free_pmd_table(pmd_base, pud);
>> +		unmap_low_page(pmd_base);
>> +	}
>> +
>> +	if (direct)
>> +		update_page_count(PG_LEVEL_1G, -pages);
>> +}
>> +
>> +/* start and end are both virtual address. */
>> +static void __meminit
>> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
>> +{
>> +	unsigned long next;
>> +	pgd_t *pgd;
>> +	pud_t *pud;
>> +	bool pgd_changed = false;
>> +
>> +	for (; start<  end; start = next) {
>> +		pgd = pgd_offset_k(start);
>> +		if (!pgd_present(*pgd))
>> +			continue;
>> +
>> +		next = pgd_addr_end(start, end);
>> +
>> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
>> +		remove_pud_table(pud, start, next, direct);
>> +		if (free_pud_table(pud, pgd))
>> +			pgd_changed = true;
>> +		unmap_low_page(pud);
>> +	}
>> +
>> +	if (pgd_changed)
>> +		sync_global_pgds(start, end - 1);
>> +
>> +	flush_tlb_all();
>> +}
>> +
>>   #ifdef CONFIG_MEMORY_HOTREMOVE
>>   int __ref arch_remove_memory(u64 start, u64 size)
>>   {
>> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
>> index a718e0d..7dcb6f9 100644
>> --- a/arch/x86/mm/pageattr.c
>> +++ b/arch/x86/mm/pageattr.c
>> @@ -501,21 +501,13 @@ out_unlock:
>>   	return do_split;
>>   }
>>
>> -static int split_large_page(pte_t *kpte, unsigned long address)
>> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>>   {
>>   	unsigned long pfn, pfninc = 1;
>>   	unsigned int i, level;
>> -	pte_t *pbase, *tmp;
>> +	pte_t *tmp;
>>   	pgprot_t ref_prot;
>> -	struct page *base;
>> -
>> -	if (!debug_pagealloc)
>> -		spin_unlock(&cpa_lock);
>> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>> -	if (!debug_pagealloc)
>> -		spin_lock(&cpa_lock);
>> -	if (!base)
>> -		return -ENOMEM;
>> +	struct page *base = virt_to_page(pbase);
>>
>>   	spin_lock(&pgd_lock);
>>   	/*
>> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>>   	 * up for us already:
>>   	 */
>>   	tmp = lookup_address(address,&level);
>> -	if (tmp != kpte)
>> -		goto out_unlock;
>> +	if (tmp != kpte) {
>> +		spin_unlock(&pgd_lock);
>> +		return 1;
>> +	}
>>
>> -	pbase = (pte_t *)page_address(base);
>>   	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
>>   	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
>>   	/*
>> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>>   	 * going on.
>>   	 */
>>   	__flush_tlb_all();
>> +	spin_unlock(&pgd_lock);
>>
>> -	base = NULL;
>> +	return 0;
>> +}
>>
>> -out_unlock:
>> -	/*
>> -	 * If we dropped out via the lookup_address check under
>> -	 * pgd_lock then stick the page back into the pool:
>> -	 */
>> -	if (base)
>> +static int split_large_page(pte_t *kpte, unsigned long address)
>> +{
>> +	pte_t *pbase;
>> +	struct page *base;
>> +
>> +	if (!debug_pagealloc)
>> +		spin_unlock(&cpa_lock);
>> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>> +	if (!debug_pagealloc)
>> +		spin_lock(&cpa_lock);
>> +	if (!base)
>> +		return -ENOMEM;
>> +
>> +	pbase = (pte_t *)page_address(base);
>> +	if (__split_large_page(kpte, address, pbase))
>>   		__free_page(base);
>> -	spin_unlock(&pgd_lock);
>>
>>   	return 0;
>>   }
>> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
>> index 3f778c2..190ff06 100644
>> --- a/include/linux/bootmem.h
>> +++ b/include/linux/bootmem.h
>> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
>>   			      unsigned long size);
>>   extern void free_bootmem(unsigned long physaddr, unsigned long size);
>>   extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
>> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
>>
>>   /*
>>    * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-29 12:52 ` Simon Jeons
@ 2013-01-30  2:32   ` Tang Chen
  2013-01-30  2:48     ` Simon Jeons
  2013-01-30 10:15   ` Tang Chen
  1 sibling, 1 reply; 67+ messages in thread
From: Tang Chen @ 2013-01-30  2:32 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

On 01/29/2013 08:52 PM, Simon Jeons wrote:
> Hi Tang,
>
> On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
>> Here is the physical memory hot-remove patch-set based on 3.8rc-2.

Hi Simon,

I'll summarize all the info and answer you later. :)

Thanks for asking. :)

>
> Some questions ask you, not has relationship with this patchset, but is
> memory hotplug stuff.
>
> 1. In function node_states_check_changes_online:
>
> comments:
> * If we don't have HIGHMEM nor movable node,
> * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
> * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
>
> How to understand it? Why we don't have HIGHMEM nor movable node and
> node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
> N_NORMAL_MEMORY only means the node has regular memory.
>
> * If we don't have movable node, node_states[N_NORMAL_MEMORY]
> * contains nodes which have zones of 0...ZONE_MOVABLE,
> * set zone_last to ZONE_MOVABLE.
>
> How to understand?
>
> 2. In function move_pfn_range_left, why end<= z2->zone_start_pfn is not
> correct? The comments said that must include/overlap, why?
>
> 3. In function online_pages, the normal case(w/o online_kenrel,
> online_movable), why not check if the new zone is overlap with adjacent
> zones?
>
> 4. Could you summarize the difference implementation between hot-add and
> logic-add, hot-remove and logic-remove?
>
>
>>
>> This patch-set aims to implement physical memory hot-removing.
>>
>> The patches can free/remove the following things:
>>
>>    - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
>>    - memmap of sparse-vmemmap                  : [PATCH 6,7,8,10/15]
>>    - page table of removed memory              : [RFC PATCH 7,8,10/15]
>>    - node and related sysfs files              : [RFC PATCH 13-15/15]
>>
>>
>> Existing problem:
>> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>> when we online pages.
>>
>> For example: there is a memory device on node 1. The address range
>> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>> and memory11 under the directory /sys/devices/system/memory/.
>>
>> If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
>> cgroup is not provided by this memory device. But when we online memory9, the
>> memory stored page cgroup may be provided by memory8. So we can't offline
>> memory8 now. We should offline the memory in the reversed order.
>>
>> When the memory device is hotremoved, we will auto offline memory provided
>> by this memory device. But we don't know which memory is onlined first, so
>> offlining memory may fail.
>>
>> In patch1, we provide a solution which is not good enough:
>> Iterate twice to offline the memory.
>> 1st iterate: offline every non primary memory block.
>> 2nd iterate: offline primary (i.e. first added) memory block.
>>
>> And a new idea from Wen Congyang<wency@cn.fujitsu.com>  is:
>> allocate the memory from the memory block they are describing.
>>
>> But we are not sure if it is OK to do so because there is not existing API
>> to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE
>> to MEM_ONLINE. And also, it may interfere the hugepage.
>>
>>
>>
>> How to test this patchset?
>> 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
>>     ACPI_HOTPLUG_MEMORY must be selected.
>> 2. load the module acpi_memhotplug
>> 3. hotplug the memory device(it depends on your hardware)
>>     You will see the memory device under the directory /sys/bus/acpi/devices/.
>>     Its name is PNP0C80:XX.
>> 4. online/offline pages provided by this memory device
>>     You can write online/offline to /sys/devices/system/memory/memoryX/state to
>>     online/offline pages provided by this memory device
>> 5. hotremove the memory device
>>     You can hotremove the memory device by the hardware, or writing 1 to
>>     /sys/bus/acpi/devices/PNP0C80:XX/eject.
>
> Is there a similar knode to hot-add the memory device?
>
>>
>>
>> Note: if the memory provided by the memory device is used by the kernel, it
>> can't be offlined. It is not a bug.
>>
>>
>> Changelogs from v5 to v6:
>>   Patch3: Add some more comments to explain memory hot-remove.
>>   Patch4: Remove bootmem member in struct firmware_map_entry.
>>   Patch6: Repeatedly register bootmem pages when using hugepage.
>>   Patch8: Repeatedly free bootmem pages when using hugepage.
>>   Patch14: Don't free pgdat when offlining a node, just reset it to 0.
>>   Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
>>            one when online a node.
>>
>> Changelogs from v4 to v5:
>>   Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
>>           avoid disabling irq because we need flush tlb when free pagetables.
>>   Patch8: new patch, pick up some common APIs that are used to free direct mapping
>>           and vmemmap pagetables.
>>   Patch9: free direct mapping pagetables on x86_64 arch.
>>   Patch10: free vmemmap pagetables.
>>   Patch11: since freeing memmap with vmemmap has been implemented, the config
>>            macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
>>            no longer needed.
>>   Patch13: no need to modify acpi_memory_disable_device() since it was removed,
>>            and add nid parameter when calling remove_memory().
>>
>> Changelogs from v3 to v4:
>>   Patch7: remove unused codes.
>>   Patch8: fix nr_pages that is passed to free_map_bootmem()
>>
>> Changelogs from v2 to v3:
>>   Patch9: call sync_global_pgds() if pgd is changed
>>   Patch10: fix a problem int the patch
>>
>> Changelogs from v1 to v2:
>>   Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
>>           memory block. 2nd iterate: offline primary (i.e. first added) memory
>>           block.
>>
>>   Patch3: new patch, no logical change, just remove reduntant codes.
>>
>>   Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
>>           after the pagetable is changed.
>>
>>   Patch12: new patch, free node_data when a node is offlined.
>>
>>
>> Tang Chen (6):
>>    memory-hotplug: move pgdat_resize_lock into
>>      sparse_remove_one_section()
>>    memory-hotplug: remove page table of x86_64 architecture
>>    memory-hotplug: remove memmap of sparse-vmemmap
>>    memory-hotplug: Integrated __remove_section() of
>>      CONFIG_SPARSEMEM_VMEMMAP.
>>    memory-hotplug: remove sysfs file of node
>>    memory-hotplug: Do not allocate pdgat if it was not freed when
>>      offline.
>>
>> Wen Congyang (5):
>>    memory-hotplug: try to offline the memory twice to avoid dependence
>>    memory-hotplug: remove redundant codes
>>    memory-hotplug: introduce new function arch_remove_memory() for
>>      removing page table depends on architecture
>>    memory-hotplug: Common APIs to support page tables hot-remove
>>    memory-hotplug: free node_data when a node is offlined
>>
>> Yasuaki Ishimatsu (4):
>>    memory-hotplug: check whether all memory blocks are offlined or not
>>      when removing memory
>>    memory-hotplug: remove /sys/firmware/memmap/X sysfs
>>    memory-hotplug: implement register_page_bootmem_info_section of
>>      sparse-vmemmap
>>    memory-hotplug: memory_hotplug: clear zone when removing the memory
>>
>>   arch/arm64/mm/mmu.c                  |    3 +
>>   arch/ia64/mm/discontig.c             |   10 +
>>   arch/ia64/mm/init.c                  |   18 ++
>>   arch/powerpc/mm/init_64.c            |   10 +
>>   arch/powerpc/mm/mem.c                |   12 +
>>   arch/s390/mm/init.c                  |   12 +
>>   arch/s390/mm/vmem.c                  |   10 +
>>   arch/sh/mm/init.c                    |   17 ++
>>   arch/sparc/mm/init_64.c              |   10 +
>>   arch/tile/mm/init.c                  |    8 +
>>   arch/x86/include/asm/pgtable_types.h |    1 +
>>   arch/x86/mm/init_32.c                |   12 +
>>   arch/x86/mm/init_64.c                |  390 +++++++++++++++++++++++++++++
>>   arch/x86/mm/pageattr.c               |   47 ++--
>>   drivers/acpi/acpi_memhotplug.c       |    8 +-
>>   drivers/base/memory.c                |    6 +
>>   drivers/firmware/memmap.c            |   96 +++++++-
>>   include/linux/bootmem.h              |    1 +
>>   include/linux/firmware-map.h         |    6 +
>>   include/linux/memory_hotplug.h       |   15 +-
>>   include/linux/mm.h                   |    4 +-
>>   mm/memory_hotplug.c                  |  459 +++++++++++++++++++++++++++++++---
>>   mm/sparse.c                          |    8 +-
>>   23 files changed, 1094 insertions(+), 69 deletions(-)
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
>
>
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-30  2:32   ` Tang Chen
@ 2013-01-30  2:48     ` Simon Jeons
  2013-01-30  3:00       ` Tang Chen
  0 siblings, 1 reply; 67+ messages in thread
From: Simon Jeons @ 2013-01-30  2:48 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

On Wed, 2013-01-30 at 10:32 +0800, Tang Chen wrote:
> On 01/29/2013 08:52 PM, Simon Jeons wrote:
> > Hi Tang,
> >
> > On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
> >> Here is the physical memory hot-remove patch-set based on 3.8rc-2.
> 
> Hi Simon,
> 
> I'll summarize all the info and answer you later. :)
> 
> Thanks for asking. :)

Thanks Tang, IIRC, there's qemu feature support memory hot-add/remove
emulation if we don't have machine which supports memory hot-add/remove
to test. Is that qemu feature merged? Otherwise where can I get that
patchset?

> 
> >
> > Some questions ask you, not has relationship with this patchset, but is
> > memory hotplug stuff.
> >
> > 1. In function node_states_check_changes_online:
> >
> > comments:
> > * If we don't have HIGHMEM nor movable node,
> > * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
> > * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
> >
> > How to understand it? Why we don't have HIGHMEM nor movable node and
> > node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
> > N_NORMAL_MEMORY only means the node has regular memory.
> >
> > * If we don't have movable node, node_states[N_NORMAL_MEMORY]
> > * contains nodes which have zones of 0...ZONE_MOVABLE,
> > * set zone_last to ZONE_MOVABLE.
> >
> > How to understand?
> >
> > 2. In function move_pfn_range_left, why end<= z2->zone_start_pfn is not
> > correct? The comments said that must include/overlap, why?
> >
> > 3. In function online_pages, the normal case(w/o online_kenrel,
> > online_movable), why not check if the new zone is overlap with adjacent
> > zones?
> >
> > 4. Could you summarize the difference implementation between hot-add and
> > logic-add, hot-remove and logic-remove?
> >
> >
> >>
> >> This patch-set aims to implement physical memory hot-removing.
> >>
> >> The patches can free/remove the following things:
> >>
> >>    - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
> >>    - memmap of sparse-vmemmap                  : [PATCH 6,7,8,10/15]
> >>    - page table of removed memory              : [RFC PATCH 7,8,10/15]
> >>    - node and related sysfs files              : [RFC PATCH 13-15/15]
> >>
> >>
> >> Existing problem:
> >> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
> >> when we online pages.
> >>
> >> For example: there is a memory device on node 1. The address range
> >> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
> >> and memory11 under the directory /sys/devices/system/memory/.
> >>
> >> If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
> >> cgroup is not provided by this memory device. But when we online memory9, the
> >> memory stored page cgroup may be provided by memory8. So we can't offline
> >> memory8 now. We should offline the memory in the reversed order.
> >>
> >> When the memory device is hotremoved, we will auto offline memory provided
> >> by this memory device. But we don't know which memory is onlined first, so
> >> offlining memory may fail.
> >>
> >> In patch1, we provide a solution which is not good enough:
> >> Iterate twice to offline the memory.
> >> 1st iterate: offline every non primary memory block.
> >> 2nd iterate: offline primary (i.e. first added) memory block.
> >>
> >> And a new idea from Wen Congyang<wency@cn.fujitsu.com>  is:
> >> allocate the memory from the memory block they are describing.
> >>
> >> But we are not sure if it is OK to do so because there is not existing API
> >> to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE
> >> to MEM_ONLINE. And also, it may interfere the hugepage.
> >>
> >>
> >>
> >> How to test this patchset?
> >> 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
> >>     ACPI_HOTPLUG_MEMORY must be selected.
> >> 2. load the module acpi_memhotplug
> >> 3. hotplug the memory device(it depends on your hardware)
> >>     You will see the memory device under the directory /sys/bus/acpi/devices/.
> >>     Its name is PNP0C80:XX.
> >> 4. online/offline pages provided by this memory device
> >>     You can write online/offline to /sys/devices/system/memory/memoryX/state to
> >>     online/offline pages provided by this memory device
> >> 5. hotremove the memory device
> >>     You can hotremove the memory device by the hardware, or writing 1 to
> >>     /sys/bus/acpi/devices/PNP0C80:XX/eject.
> >
> > Is there a similar knode to hot-add the memory device?
> >
> >>
> >>
> >> Note: if the memory provided by the memory device is used by the kernel, it
> >> can't be offlined. It is not a bug.
> >>
> >>
> >> Changelogs from v5 to v6:
> >>   Patch3: Add some more comments to explain memory hot-remove.
> >>   Patch4: Remove bootmem member in struct firmware_map_entry.
> >>   Patch6: Repeatedly register bootmem pages when using hugepage.
> >>   Patch8: Repeatedly free bootmem pages when using hugepage.
> >>   Patch14: Don't free pgdat when offlining a node, just reset it to 0.
> >>   Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
> >>            one when online a node.
> >>
> >> Changelogs from v4 to v5:
> >>   Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
> >>           avoid disabling irq because we need flush tlb when free pagetables.
> >>   Patch8: new patch, pick up some common APIs that are used to free direct mapping
> >>           and vmemmap pagetables.
> >>   Patch9: free direct mapping pagetables on x86_64 arch.
> >>   Patch10: free vmemmap pagetables.
> >>   Patch11: since freeing memmap with vmemmap has been implemented, the config
> >>            macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
> >>            no longer needed.
> >>   Patch13: no need to modify acpi_memory_disable_device() since it was removed,
> >>            and add nid parameter when calling remove_memory().
> >>
> >> Changelogs from v3 to v4:
> >>   Patch7: remove unused codes.
> >>   Patch8: fix nr_pages that is passed to free_map_bootmem()
> >>
> >> Changelogs from v2 to v3:
> >>   Patch9: call sync_global_pgds() if pgd is changed
> >>   Patch10: fix a problem int the patch
> >>
> >> Changelogs from v1 to v2:
> >>   Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
> >>           memory block. 2nd iterate: offline primary (i.e. first added) memory
> >>           block.
> >>
> >>   Patch3: new patch, no logical change, just remove reduntant codes.
> >>
> >>   Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
> >>           after the pagetable is changed.
> >>
> >>   Patch12: new patch, free node_data when a node is offlined.
> >>
> >>
> >> Tang Chen (6):
> >>    memory-hotplug: move pgdat_resize_lock into
> >>      sparse_remove_one_section()
> >>    memory-hotplug: remove page table of x86_64 architecture
> >>    memory-hotplug: remove memmap of sparse-vmemmap
> >>    memory-hotplug: Integrated __remove_section() of
> >>      CONFIG_SPARSEMEM_VMEMMAP.
> >>    memory-hotplug: remove sysfs file of node
> >>    memory-hotplug: Do not allocate pdgat if it was not freed when
> >>      offline.
> >>
> >> Wen Congyang (5):
> >>    memory-hotplug: try to offline the memory twice to avoid dependence
> >>    memory-hotplug: remove redundant codes
> >>    memory-hotplug: introduce new function arch_remove_memory() for
> >>      removing page table depends on architecture
> >>    memory-hotplug: Common APIs to support page tables hot-remove
> >>    memory-hotplug: free node_data when a node is offlined
> >>
> >> Yasuaki Ishimatsu (4):
> >>    memory-hotplug: check whether all memory blocks are offlined or not
> >>      when removing memory
> >>    memory-hotplug: remove /sys/firmware/memmap/X sysfs
> >>    memory-hotplug: implement register_page_bootmem_info_section of
> >>      sparse-vmemmap
> >>    memory-hotplug: memory_hotplug: clear zone when removing the memory
> >>
> >>   arch/arm64/mm/mmu.c                  |    3 +
> >>   arch/ia64/mm/discontig.c             |   10 +
> >>   arch/ia64/mm/init.c                  |   18 ++
> >>   arch/powerpc/mm/init_64.c            |   10 +
> >>   arch/powerpc/mm/mem.c                |   12 +
> >>   arch/s390/mm/init.c                  |   12 +
> >>   arch/s390/mm/vmem.c                  |   10 +
> >>   arch/sh/mm/init.c                    |   17 ++
> >>   arch/sparc/mm/init_64.c              |   10 +
> >>   arch/tile/mm/init.c                  |    8 +
> >>   arch/x86/include/asm/pgtable_types.h |    1 +
> >>   arch/x86/mm/init_32.c                |   12 +
> >>   arch/x86/mm/init_64.c                |  390 +++++++++++++++++++++++++++++
> >>   arch/x86/mm/pageattr.c               |   47 ++--
> >>   drivers/acpi/acpi_memhotplug.c       |    8 +-
> >>   drivers/base/memory.c                |    6 +
> >>   drivers/firmware/memmap.c            |   96 +++++++-
> >>   include/linux/bootmem.h              |    1 +
> >>   include/linux/firmware-map.h         |    6 +
> >>   include/linux/memory_hotplug.h       |   15 +-
> >>   include/linux/mm.h                   |    4 +-
> >>   mm/memory_hotplug.c                  |  459 +++++++++++++++++++++++++++++++---
> >>   mm/sparse.c                          |    8 +-
> >>   23 files changed, 1094 insertions(+), 69 deletions(-)
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
> >
> >
> >

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-30  2:48     ` Simon Jeons
@ 2013-01-30  3:00       ` Tang Chen
  0 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-30  3:00 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

On 01/30/2013 10:48 AM, Simon Jeons wrote:
> On Wed, 2013-01-30 at 10:32 +0800, Tang Chen wrote:
>> On 01/29/2013 08:52 PM, Simon Jeons wrote:
>>> Hi Tang,
>>>
>>> On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
>>>> Here is the physical memory hot-remove patch-set based on 3.8rc-2.
>>
>> Hi Simon,
>>
>> I'll summarize all the info and answer you later. :)
>>
>> Thanks for asking. :)
>
> Thanks Tang, IIRC, there's qemu feature support memory hot-add/remove
> emulation if we don't have machine which supports memory hot-add/remove
> to test. Is that qemu feature merged? Otherwise where can I get that
> patchset?

Hi Simon,

There are patches to support hot-add/remove in qemu, but they are not 
merged yet.
You can get the latest patches here:
http://lists.nongnu.org/archive/html/qemu-devel/2012-12/msg02693.html

BTY, it is unstable and full of problems, and you need to compile your 
own seabios too.

Thanks. :)

>
>>
>>>
>>> Some questions ask you, not has relationship with this patchset, but is
>>> memory hotplug stuff.
>>>
>>> 1. In function node_states_check_changes_online:
>>>
>>> comments:
>>> * If we don't have HIGHMEM nor movable node,
>>> * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
>>> * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
>>>
>>> How to understand it? Why we don't have HIGHMEM nor movable node and
>>> node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
>>> N_NORMAL_MEMORY only means the node has regular memory.
>>>
>>> * If we don't have movable node, node_states[N_NORMAL_MEMORY]
>>> * contains nodes which have zones of 0...ZONE_MOVABLE,
>>> * set zone_last to ZONE_MOVABLE.
>>>
>>> How to understand?
>>>
>>> 2. In function move_pfn_range_left, why end<= z2->zone_start_pfn is not
>>> correct? The comments said that must include/overlap, why?
>>>
>>> 3. In function online_pages, the normal case(w/o online_kenrel,
>>> online_movable), why not check if the new zone is overlap with adjacent
>>> zones?
>>>
>>> 4. Could you summarize the difference implementation between hot-add and
>>> logic-add, hot-remove and logic-remove?
>>>
>>>
>>>>
>>>> This patch-set aims to implement physical memory hot-removing.
>>>>
>>>> The patches can free/remove the following things:
>>>>
>>>>     - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
>>>>     - memmap of sparse-vmemmap                  : [PATCH 6,7,8,10/15]
>>>>     - page table of removed memory              : [RFC PATCH 7,8,10/15]
>>>>     - node and related sysfs files              : [RFC PATCH 13-15/15]
>>>>
>>>>
>>>> Existing problem:
>>>> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>>>> when we online pages.
>>>>
>>>> For example: there is a memory device on node 1. The address range
>>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>>>> and memory11 under the directory /sys/devices/system/memory/.
>>>>
>>>> If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
>>>> cgroup is not provided by this memory device. But when we online memory9, the
>>>> memory stored page cgroup may be provided by memory8. So we can't offline
>>>> memory8 now. We should offline the memory in the reversed order.
>>>>
>>>> When the memory device is hotremoved, we will auto offline memory provided
>>>> by this memory device. But we don't know which memory is onlined first, so
>>>> offlining memory may fail.
>>>>
>>>> In patch1, we provide a solution which is not good enough:
>>>> Iterate twice to offline the memory.
>>>> 1st iterate: offline every non primary memory block.
>>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>>
>>>> And a new idea from Wen Congyang<wency@cn.fujitsu.com>   is:
>>>> allocate the memory from the memory block they are describing.
>>>>
>>>> But we are not sure if it is OK to do so because there is not existing API
>>>> to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE
>>>> to MEM_ONLINE. And also, it may interfere the hugepage.
>>>>
>>>>
>>>>
>>>> How to test this patchset?
>>>> 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
>>>>      ACPI_HOTPLUG_MEMORY must be selected.
>>>> 2. load the module acpi_memhotplug
>>>> 3. hotplug the memory device(it depends on your hardware)
>>>>      You will see the memory device under the directory /sys/bus/acpi/devices/.
>>>>      Its name is PNP0C80:XX.
>>>> 4. online/offline pages provided by this memory device
>>>>      You can write online/offline to /sys/devices/system/memory/memoryX/state to
>>>>      online/offline pages provided by this memory device
>>>> 5. hotremove the memory device
>>>>      You can hotremove the memory device by the hardware, or writing 1 to
>>>>      /sys/bus/acpi/devices/PNP0C80:XX/eject.
>>>
>>> Is there a similar knode to hot-add the memory device?
>>>
>>>>
>>>>
>>>> Note: if the memory provided by the memory device is used by the kernel, it
>>>> can't be offlined. It is not a bug.
>>>>
>>>>
>>>> Changelogs from v5 to v6:
>>>>    Patch3: Add some more comments to explain memory hot-remove.
>>>>    Patch4: Remove bootmem member in struct firmware_map_entry.
>>>>    Patch6: Repeatedly register bootmem pages when using hugepage.
>>>>    Patch8: Repeatedly free bootmem pages when using hugepage.
>>>>    Patch14: Don't free pgdat when offlining a node, just reset it to 0.
>>>>    Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
>>>>             one when online a node.
>>>>
>>>> Changelogs from v4 to v5:
>>>>    Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
>>>>            avoid disabling irq because we need flush tlb when free pagetables.
>>>>    Patch8: new patch, pick up some common APIs that are used to free direct mapping
>>>>            and vmemmap pagetables.
>>>>    Patch9: free direct mapping pagetables on x86_64 arch.
>>>>    Patch10: free vmemmap pagetables.
>>>>    Patch11: since freeing memmap with vmemmap has been implemented, the config
>>>>             macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
>>>>             no longer needed.
>>>>    Patch13: no need to modify acpi_memory_disable_device() since it was removed,
>>>>             and add nid parameter when calling remove_memory().
>>>>
>>>> Changelogs from v3 to v4:
>>>>    Patch7: remove unused codes.
>>>>    Patch8: fix nr_pages that is passed to free_map_bootmem()
>>>>
>>>> Changelogs from v2 to v3:
>>>>    Patch9: call sync_global_pgds() if pgd is changed
>>>>    Patch10: fix a problem int the patch
>>>>
>>>> Changelogs from v1 to v2:
>>>>    Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
>>>>            memory block. 2nd iterate: offline primary (i.e. first added) memory
>>>>            block.
>>>>
>>>>    Patch3: new patch, no logical change, just remove reduntant codes.
>>>>
>>>>    Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
>>>>            after the pagetable is changed.
>>>>
>>>>    Patch12: new patch, free node_data when a node is offlined.
>>>>
>>>>
>>>> Tang Chen (6):
>>>>     memory-hotplug: move pgdat_resize_lock into
>>>>       sparse_remove_one_section()
>>>>     memory-hotplug: remove page table of x86_64 architecture
>>>>     memory-hotplug: remove memmap of sparse-vmemmap
>>>>     memory-hotplug: Integrated __remove_section() of
>>>>       CONFIG_SPARSEMEM_VMEMMAP.
>>>>     memory-hotplug: remove sysfs file of node
>>>>     memory-hotplug: Do not allocate pdgat if it was not freed when
>>>>       offline.
>>>>
>>>> Wen Congyang (5):
>>>>     memory-hotplug: try to offline the memory twice to avoid dependence
>>>>     memory-hotplug: remove redundant codes
>>>>     memory-hotplug: introduce new function arch_remove_memory() for
>>>>       removing page table depends on architecture
>>>>     memory-hotplug: Common APIs to support page tables hot-remove
>>>>     memory-hotplug: free node_data when a node is offlined
>>>>
>>>> Yasuaki Ishimatsu (4):
>>>>     memory-hotplug: check whether all memory blocks are offlined or not
>>>>       when removing memory
>>>>     memory-hotplug: remove /sys/firmware/memmap/X sysfs
>>>>     memory-hotplug: implement register_page_bootmem_info_section of
>>>>       sparse-vmemmap
>>>>     memory-hotplug: memory_hotplug: clear zone when removing the memory
>>>>
>>>>    arch/arm64/mm/mmu.c                  |    3 +
>>>>    arch/ia64/mm/discontig.c             |   10 +
>>>>    arch/ia64/mm/init.c                  |   18 ++
>>>>    arch/powerpc/mm/init_64.c            |   10 +
>>>>    arch/powerpc/mm/mem.c                |   12 +
>>>>    arch/s390/mm/init.c                  |   12 +
>>>>    arch/s390/mm/vmem.c                  |   10 +
>>>>    arch/sh/mm/init.c                    |   17 ++
>>>>    arch/sparc/mm/init_64.c              |   10 +
>>>>    arch/tile/mm/init.c                  |    8 +
>>>>    arch/x86/include/asm/pgtable_types.h |    1 +
>>>>    arch/x86/mm/init_32.c                |   12 +
>>>>    arch/x86/mm/init_64.c                |  390 +++++++++++++++++++++++++++++
>>>>    arch/x86/mm/pageattr.c               |   47 ++--
>>>>    drivers/acpi/acpi_memhotplug.c       |    8 +-
>>>>    drivers/base/memory.c                |    6 +
>>>>    drivers/firmware/memmap.c            |   96 +++++++-
>>>>    include/linux/bootmem.h              |    1 +
>>>>    include/linux/firmware-map.h         |    6 +
>>>>    include/linux/memory_hotplug.h       |   15 +-
>>>>    include/linux/mm.h                   |    4 +-
>>>>    mm/memory_hotplug.c                  |  459 +++++++++++++++++++++++++++++++---
>>>>    mm/sparse.c                          |    8 +-
>>>>    23 files changed, 1094 insertions(+), 69 deletions(-)
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email:<a href=mailto:"dont@kvack.org">   email@kvack.org</a>
>>>
>>>
>>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
  2013-01-30  2:16     ` Tang Chen
@ 2013-01-30  3:27       ` Simon Jeons
  2013-01-30  5:55         ` Tang Chen
  0 siblings, 1 reply; 67+ messages in thread
From: Simon Jeons @ 2013-01-30  3:27 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

On Wed, 2013-01-30 at 10:16 +0800, Tang Chen wrote:
> On 01/29/2013 09:04 PM, Simon Jeons wrote:
> > Hi Tang,
> > On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
> >> From: Wen Congyang<wency@cn.fujitsu.com>
> >>
> >> When memory is removed, the corresponding pagetables should alse be removed.
> >> This patch introduces some common APIs to support vmemmap pagetable and x86_64
> >> architecture pagetable removing.
> >
> > Why don't need to build_all_zonelists like online_pages does during
> > hot-add path(add_memory)?
> 
> Hi Simon,
> 
> As you said, build_all_zonelists is done by online_pages. When the 
> memory device
> is hot-added, we cannot use it. we can only use is when we online the 
> pages on it.

Why?

If a node has just one memory device and memory is small, some zone will
not present like zone_highmem, then hot-add another memory device and
zone_highmem appear, if you should build_all_zonelists this time?

> 
> But we can online the pages as different types, kernel or movable (which 
> belongs to
> different zones), and we can online part of the memory, not all of them.
> So each time we online some pages, we should check if we need to update 
> the zone list.
> 
> So I think that is why we do build_all_zonelists when online_pages.
> (just my opinion)
> 
> Thanks. :)
> 
> >
> >>
> >> All pages of virtual mapping in removed memory cannot be freedi if some pages
> >> used as PGD/PUD includes not only removed memory but also other memory. So the
> >> patch uses the following way to check whether page can be freed or not.
> >>
> >>   1. When removing memory, the page structs of the revmoved memory are filled
> >>      with 0FD.
> >>   2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
> >>      In this case, the page used as PT/PMD can be freed.
> >>
> >> Signed-off-by: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
> >> Signed-off-by: Jianguo Wu<wujianguo@huawei.com>
> >> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
> >> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
> >> ---
> >>   arch/x86/include/asm/pgtable_types.h |    1 +
> >>   arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
> >>   arch/x86/mm/pageattr.c               |   47 +++---
> >>   include/linux/bootmem.h              |    1 +
> >>   4 files changed, 326 insertions(+), 22 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> >> index 3c32db8..4b6fd2a 100644
> >> --- a/arch/x86/include/asm/pgtable_types.h
> >> +++ b/arch/x86/include/asm/pgtable_types.h
> >> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
> >>    * as a pte too.
> >>    */
> >>   extern pte_t *lookup_address(unsigned long address, unsigned int *level);
> >> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
> >>
> >>   #endif	/* !__ASSEMBLY__ */
> >>
> >> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> >> index 9ac1723..fe01116 100644
> >> --- a/arch/x86/mm/init_64.c
> >> +++ b/arch/x86/mm/init_64.c
> >> @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
> >>   }
> >>   EXPORT_SYMBOL_GPL(arch_add_memory);
> >>
> >> +#define PAGE_INUSE 0xFD
> >> +
> >> +static void __meminit free_pagetable(struct page *page, int order)
> >> +{
> >> +	struct zone *zone;
> >> +	bool bootmem = false;
> >> +	unsigned long magic;
> >> +	unsigned int nr_pages = 1<<  order;
> >> +
> >> +	/* bootmem page has reserved flag */
> >> +	if (PageReserved(page)) {
> >> +		__ClearPageReserved(page);
> >> +		bootmem = true;
> >> +
> >> +		magic = (unsigned long)page->lru.next;
> >> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> >> +			while (nr_pages--)
> >> +				put_page_bootmem(page++);
> >> +		} else
> >> +			__free_pages_bootmem(page, order);
> >> +	} else
> >> +		free_pages((unsigned long)page_address(page), order);
> >> +
> >> +	/*
> >> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
> >> +	 * are all allocated by bootmem.
> >> +	 */
> >> +	if (bootmem) {
> >> +		zone = page_zone(page);
> >> +		zone_span_writelock(zone);
> >> +		zone->present_pages += nr_pages;
> >> +		zone_span_writeunlock(zone);
> >> +		totalram_pages += nr_pages;
> >> +	}
> >> +}
> >> +
> >> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> >> +{
> >> +	pte_t *pte;
> >> +	int i;
> >> +
> >> +	for (i = 0; i<  PTRS_PER_PTE; i++) {
> >> +		pte = pte_start + i;
> >> +		if (pte_val(*pte))
> >> +			return;
> >> +	}
> >> +
> >> +	/* free a pte talbe */
> >> +	free_pagetable(pmd_page(*pmd), 0);
> >> +	spin_lock(&init_mm.page_table_lock);
> >> +	pmd_clear(pmd);
> >> +	spin_unlock(&init_mm.page_table_lock);
> >> +}
> >> +
> >> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> >> +{
> >> +	pmd_t *pmd;
> >> +	int i;
> >> +
> >> +	for (i = 0; i<  PTRS_PER_PMD; i++) {
> >> +		pmd = pmd_start + i;
> >> +		if (pmd_val(*pmd))
> >> +			return;
> >> +	}
> >> +
> >> +	/* free a pmd talbe */
> >> +	free_pagetable(pud_page(*pud), 0);
> >> +	spin_lock(&init_mm.page_table_lock);
> >> +	pud_clear(pud);
> >> +	spin_unlock(&init_mm.page_table_lock);
> >> +}
> >> +
> >> +/* Return true if pgd is changed, otherwise return false. */
> >> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
> >> +{
> >> +	pud_t *pud;
> >> +	int i;
> >> +
> >> +	for (i = 0; i<  PTRS_PER_PUD; i++) {
> >> +		pud = pud_start + i;
> >> +		if (pud_val(*pud))
> >> +			return false;
> >> +	}
> >> +
> >> +	/* free a pud table */
> >> +	free_pagetable(pgd_page(*pgd), 0);
> >> +	spin_lock(&init_mm.page_table_lock);
> >> +	pgd_clear(pgd);
> >> +	spin_unlock(&init_mm.page_table_lock);
> >> +
> >> +	return true;
> >> +}
> >> +
> >> +static void __meminit
> >> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
> >> +		 bool direct)
> >> +{
> >> +	unsigned long next, pages = 0;
> >> +	pte_t *pte;
> >> +	void *page_addr;
> >> +	phys_addr_t phys_addr;
> >> +
> >> +	pte = pte_start + pte_index(addr);
> >> +	for (; addr<  end; addr = next, pte++) {
> >> +		next = (addr + PAGE_SIZE)&  PAGE_MASK;
> >> +		if (next>  end)
> >> +			next = end;
> >> +
> >> +		if (!pte_present(*pte))
> >> +			continue;
> >> +
> >> +		/*
> >> +		 * We mapped [0,1G) memory as identity mapping when
> >> +		 * initializing, in arch/x86/kernel/head_64.S. These
> >> +		 * pagetables cannot be removed.
> >> +		 */
> >> +		phys_addr = pte_val(*pte) + (addr&  PAGE_MASK);
> >> +		if (phys_addr<  (phys_addr_t)0x40000000)
> >> +			return;
> >> +
> >> +		if (IS_ALIGNED(addr, PAGE_SIZE)&&
> >> +		    IS_ALIGNED(next, PAGE_SIZE)) {
> >> +			if (!direct) {
> >> +				free_pagetable(pte_page(*pte), 0);
> >> +				pages++;
> >> +			}
> >> +
> >> +			spin_lock(&init_mm.page_table_lock);
> >> +			pte_clear(&init_mm, addr, pte);
> >> +			spin_unlock(&init_mm.page_table_lock);
> >> +		} else {
> >> +			/*
> >> +			 * If we are not removing the whole page, it means
> >> +			 * other ptes in this page are being used and we canot
> >> +			 * remove them. So fill the unused ptes with 0xFD, and
> >> +			 * remove the page when it is wholly filled with 0xFD.
> >> +			 */
> >> +			memset((void *)addr, PAGE_INUSE, next - addr);
> >> +			page_addr = page_address(pte_page(*pte));
> >> +
> >> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
> >> +				free_pagetable(pte_page(*pte), 0);
> >> +				pages++;
> >> +
> >> +				spin_lock(&init_mm.page_table_lock);
> >> +				pte_clear(&init_mm, addr, pte);
> >> +				spin_unlock(&init_mm.page_table_lock);
> >> +			}
> >> +		}
> >> +	}
> >> +
> >> +	/* Call free_pte_table() in remove_pmd_table(). */
> >> +	flush_tlb_all();
> >> +	if (direct)
> >> +		update_page_count(PG_LEVEL_4K, -pages);
> >> +}
> >> +
> >> +static void __meminit
> >> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> >> +		 bool direct)
> >> +{
> >> +	unsigned long pte_phys, next, pages = 0;
> >> +	pte_t *pte_base;
> >> +	pmd_t *pmd;
> >> +
> >> +	pmd = pmd_start + pmd_index(addr);
> >> +	for (; addr<  end; addr = next, pmd++) {
> >> +		next = pmd_addr_end(addr, end);
> >> +
> >> +		if (!pmd_present(*pmd))
> >> +			continue;
> >> +
> >> +		if (pmd_large(*pmd)) {
> >> +			if (IS_ALIGNED(addr, PMD_SIZE)&&
> >> +			    IS_ALIGNED(next, PMD_SIZE)) {
> >> +				if (!direct) {
> >> +					free_pagetable(pmd_page(*pmd),
> >> +						       get_order(PMD_SIZE));
> >> +					pages++;
> >> +				}
> >> +
> >> +				spin_lock(&init_mm.page_table_lock);
> >> +				pmd_clear(pmd);
> >> +				spin_unlock(&init_mm.page_table_lock);
> >> +				continue;
> >> +			}
> >> +
> >> +			/*
> >> +			 * We use 2M page, but we need to remove part of them,
> >> +			 * so split 2M page to 4K page.
> >> +			 */
> >> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
> >> +			BUG_ON(!pte_base);
> >> +			__split_large_page((pte_t *)pmd, addr,
> >> +					   (pte_t *)pte_base);
> >> +
> >> +			spin_lock(&init_mm.page_table_lock);
> >> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> >> +			spin_unlock(&init_mm.page_table_lock);
> >> +
> >> +			flush_tlb_all();
> >> +		}
> >> +
> >> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
> >> +		remove_pte_table(pte_base, addr, next, direct);
> >> +		free_pte_table(pte_base, pmd);
> >> +		unmap_low_page(pte_base);
> >> +	}
> >> +
> >> +	/* Call free_pmd_table() in remove_pud_table(). */
> >> +	if (direct)
> >> +		update_page_count(PG_LEVEL_2M, -pages);
> >> +}
> >> +
> >> +static void __meminit
> >> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
> >> +		 bool direct)
> >> +{
> >> +	unsigned long pmd_phys, next, pages = 0;
> >> +	pmd_t *pmd_base;
> >> +	pud_t *pud;
> >> +
> >> +	pud = pud_start + pud_index(addr);
> >> +	for (; addr<  end; addr = next, pud++) {
> >> +		next = pud_addr_end(addr, end);
> >> +
> >> +		if (!pud_present(*pud))
> >> +			continue;
> >> +
> >> +		if (pud_large(*pud)) {
> >> +			if (IS_ALIGNED(addr, PUD_SIZE)&&
> >> +			    IS_ALIGNED(next, PUD_SIZE)) {
> >> +				if (!direct) {
> >> +					free_pagetable(pud_page(*pud),
> >> +						       get_order(PUD_SIZE));
> >> +					pages++;
> >> +				}
> >> +
> >> +				spin_lock(&init_mm.page_table_lock);
> >> +				pud_clear(pud);
> >> +				spin_unlock(&init_mm.page_table_lock);
> >> +				continue;
> >> +			}
> >> +
> >> +			/*
> >> +			 * We use 1G page, but we need to remove part of them,
> >> +			 * so split 1G page to 2M page.
> >> +			 */
> >> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
> >> +			BUG_ON(!pmd_base);
> >> +			__split_large_page((pte_t *)pud, addr,
> >> +					   (pte_t *)pmd_base);
> >> +
> >> +			spin_lock(&init_mm.page_table_lock);
> >> +			pud_populate(&init_mm, pud, __va(pmd_phys));
> >> +			spin_unlock(&init_mm.page_table_lock);
> >> +
> >> +			flush_tlb_all();
> >> +		}
> >> +
> >> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
> >> +		remove_pmd_table(pmd_base, addr, next, direct);
> >> +		free_pmd_table(pmd_base, pud);
> >> +		unmap_low_page(pmd_base);
> >> +	}
> >> +
> >> +	if (direct)
> >> +		update_page_count(PG_LEVEL_1G, -pages);
> >> +}
> >> +
> >> +/* start and end are both virtual address. */
> >> +static void __meminit
> >> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
> >> +{
> >> +	unsigned long next;
> >> +	pgd_t *pgd;
> >> +	pud_t *pud;
> >> +	bool pgd_changed = false;
> >> +
> >> +	for (; start<  end; start = next) {
> >> +		pgd = pgd_offset_k(start);
> >> +		if (!pgd_present(*pgd))
> >> +			continue;
> >> +
> >> +		next = pgd_addr_end(start, end);
> >> +
> >> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> >> +		remove_pud_table(pud, start, next, direct);
> >> +		if (free_pud_table(pud, pgd))
> >> +			pgd_changed = true;
> >> +		unmap_low_page(pud);
> >> +	}
> >> +
> >> +	if (pgd_changed)
> >> +		sync_global_pgds(start, end - 1);
> >> +
> >> +	flush_tlb_all();
> >> +}
> >> +
> >>   #ifdef CONFIG_MEMORY_HOTREMOVE
> >>   int __ref arch_remove_memory(u64 start, u64 size)
> >>   {
> >> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> >> index a718e0d..7dcb6f9 100644
> >> --- a/arch/x86/mm/pageattr.c
> >> +++ b/arch/x86/mm/pageattr.c
> >> @@ -501,21 +501,13 @@ out_unlock:
> >>   	return do_split;
> >>   }
> >>
> >> -static int split_large_page(pte_t *kpte, unsigned long address)
> >> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
> >>   {
> >>   	unsigned long pfn, pfninc = 1;
> >>   	unsigned int i, level;
> >> -	pte_t *pbase, *tmp;
> >> +	pte_t *tmp;
> >>   	pgprot_t ref_prot;
> >> -	struct page *base;
> >> -
> >> -	if (!debug_pagealloc)
> >> -		spin_unlock(&cpa_lock);
> >> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> >> -	if (!debug_pagealloc)
> >> -		spin_lock(&cpa_lock);
> >> -	if (!base)
> >> -		return -ENOMEM;
> >> +	struct page *base = virt_to_page(pbase);
> >>
> >>   	spin_lock(&pgd_lock);
> >>   	/*
> >> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
> >>   	 * up for us already:
> >>   	 */
> >>   	tmp = lookup_address(address,&level);
> >> -	if (tmp != kpte)
> >> -		goto out_unlock;
> >> +	if (tmp != kpte) {
> >> +		spin_unlock(&pgd_lock);
> >> +		return 1;
> >> +	}
> >>
> >> -	pbase = (pte_t *)page_address(base);
> >>   	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
> >>   	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
> >>   	/*
> >> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
> >>   	 * going on.
> >>   	 */
> >>   	__flush_tlb_all();
> >> +	spin_unlock(&pgd_lock);
> >>
> >> -	base = NULL;
> >> +	return 0;
> >> +}
> >>
> >> -out_unlock:
> >> -	/*
> >> -	 * If we dropped out via the lookup_address check under
> >> -	 * pgd_lock then stick the page back into the pool:
> >> -	 */
> >> -	if (base)
> >> +static int split_large_page(pte_t *kpte, unsigned long address)
> >> +{
> >> +	pte_t *pbase;
> >> +	struct page *base;
> >> +
> >> +	if (!debug_pagealloc)
> >> +		spin_unlock(&cpa_lock);
> >> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> >> +	if (!debug_pagealloc)
> >> +		spin_lock(&cpa_lock);
> >> +	if (!base)
> >> +		return -ENOMEM;
> >> +
> >> +	pbase = (pte_t *)page_address(base);
> >> +	if (__split_large_page(kpte, address, pbase))
> >>   		__free_page(base);
> >> -	spin_unlock(&pgd_lock);
> >>
> >>   	return 0;
> >>   }
> >> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> >> index 3f778c2..190ff06 100644
> >> --- a/include/linux/bootmem.h
> >> +++ b/include/linux/bootmem.h
> >> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
> >>   			      unsigned long size);
> >>   extern void free_bootmem(unsigned long physaddr, unsigned long size);
> >>   extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
> >> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
> >>
> >>   /*
> >>    * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
  2013-01-30  3:27       ` Simon Jeons
@ 2013-01-30  5:55         ` Tang Chen
  2013-01-30  7:32           ` Simon Jeons
  0 siblings, 1 reply; 67+ messages in thread
From: Tang Chen @ 2013-01-30  5:55 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

On 01/30/2013 11:27 AM, Simon Jeons wrote:
> On Wed, 2013-01-30 at 10:16 +0800, Tang Chen wrote:
>> On 01/29/2013 09:04 PM, Simon Jeons wrote:
>>> Hi Tang,
>>> On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
>>>> From: Wen Congyang<wency@cn.fujitsu.com>
>>>>
>>>> When memory is removed, the corresponding pagetables should alse be removed.
>>>> This patch introduces some common APIs to support vmemmap pagetable and x86_64
>>>> architecture pagetable removing.
>>>
>>> Why don't need to build_all_zonelists like online_pages does during
>>> hot-add path(add_memory)?
>>
>> Hi Simon,
>>
>> As you said, build_all_zonelists is done by online_pages. When the
>> memory device
>> is hot-added, we cannot use it. we can only use is when we online the
>> pages on it.
>
> Why?
>
> If a node has just one memory device and memory is small, some zone will
> not present like zone_highmem, then hot-add another memory device and
> zone_highmem appear, if you should build_all_zonelists this time?

Hi Simon,

We built zone list when the first memory on the node is hot-added.

add_memory()
  |-->if (!node_online(nid)) hotadd_new_pgdat()
                              |-->free_area_init_node()
                              |-->build_all_zonelists()

All the zones on the new node will be initialized as empty. So here, we 
build zone list.

But actually we did nothing because no page is online, and zones are empty.
In build_zonelists_node(), populated_zone(zone) will always be false.

The real work of building zone list is when pages are online. :)


And in your question, you said some small memory is there, and 
zone_normal is present.
OK, when these pages are onlined (not added), the zone list has been 
rebuilt.
But pages in zone_highmem is not added, which means not onlined, so we 
don't need to
build zone list for it. And later, the zone_highmem pages are added, we 
still don't
rebuild the zone list because the real rebuilding work is when the pages 
are onlined.

I think this is the current logic. :)

Thanks. :)

>
>>
>> But we can online the pages as different types, kernel or movable (which
>> belongs to
>> different zones), and we can online part of the memory, not all of them.
>> So each time we online some pages, we should check if we need to update
>> the zone list.
>>
>> So I think that is why we do build_all_zonelists when online_pages.
>> (just my opinion)
>>
>> Thanks. :)
>>
>>>
>>>>
>>>> All pages of virtual mapping in removed memory cannot be freedi if some pages
>>>> used as PGD/PUD includes not only removed memory but also other memory. So the
>>>> patch uses the following way to check whether page can be freed or not.
>>>>
>>>>    1. When removing memory, the page structs of the revmoved memory are filled
>>>>       with 0FD.
>>>>    2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
>>>>       In this case, the page used as PT/PMD can be freed.
>>>>
>>>> Signed-off-by: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
>>>> Signed-off-by: Jianguo Wu<wujianguo@huawei.com>
>>>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>>>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>>>> ---
>>>>    arch/x86/include/asm/pgtable_types.h |    1 +
>>>>    arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
>>>>    arch/x86/mm/pageattr.c               |   47 +++---
>>>>    include/linux/bootmem.h              |    1 +
>>>>    4 files changed, 326 insertions(+), 22 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
>>>> index 3c32db8..4b6fd2a 100644
>>>> --- a/arch/x86/include/asm/pgtable_types.h
>>>> +++ b/arch/x86/include/asm/pgtable_types.h
>>>> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
>>>>     * as a pte too.
>>>>     */
>>>>    extern pte_t *lookup_address(unsigned long address, unsigned int *level);
>>>> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
>>>>
>>>>    #endif	/* !__ASSEMBLY__ */
>>>>
>>>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>>>> index 9ac1723..fe01116 100644
>>>> --- a/arch/x86/mm/init_64.c
>>>> +++ b/arch/x86/mm/init_64.c
>>>> @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(arch_add_memory);
>>>>
>>>> +#define PAGE_INUSE 0xFD
>>>> +
>>>> +static void __meminit free_pagetable(struct page *page, int order)
>>>> +{
>>>> +	struct zone *zone;
>>>> +	bool bootmem = false;
>>>> +	unsigned long magic;
>>>> +	unsigned int nr_pages = 1<<   order;
>>>> +
>>>> +	/* bootmem page has reserved flag */
>>>> +	if (PageReserved(page)) {
>>>> +		__ClearPageReserved(page);
>>>> +		bootmem = true;
>>>> +
>>>> +		magic = (unsigned long)page->lru.next;
>>>> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
>>>> +			while (nr_pages--)
>>>> +				put_page_bootmem(page++);
>>>> +		} else
>>>> +			__free_pages_bootmem(page, order);
>>>> +	} else
>>>> +		free_pages((unsigned long)page_address(page), order);
>>>> +
>>>> +	/*
>>>> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
>>>> +	 * are all allocated by bootmem.
>>>> +	 */
>>>> +	if (bootmem) {
>>>> +		zone = page_zone(page);
>>>> +		zone_span_writelock(zone);
>>>> +		zone->present_pages += nr_pages;
>>>> +		zone_span_writeunlock(zone);
>>>> +		totalram_pages += nr_pages;
>>>> +	}
>>>> +}
>>>> +
>>>> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
>>>> +{
>>>> +	pte_t *pte;
>>>> +	int i;
>>>> +
>>>> +	for (i = 0; i<   PTRS_PER_PTE; i++) {
>>>> +		pte = pte_start + i;
>>>> +		if (pte_val(*pte))
>>>> +			return;
>>>> +	}
>>>> +
>>>> +	/* free a pte talbe */
>>>> +	free_pagetable(pmd_page(*pmd), 0);
>>>> +	spin_lock(&init_mm.page_table_lock);
>>>> +	pmd_clear(pmd);
>>>> +	spin_unlock(&init_mm.page_table_lock);
>>>> +}
>>>> +
>>>> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
>>>> +{
>>>> +	pmd_t *pmd;
>>>> +	int i;
>>>> +
>>>> +	for (i = 0; i<   PTRS_PER_PMD; i++) {
>>>> +		pmd = pmd_start + i;
>>>> +		if (pmd_val(*pmd))
>>>> +			return;
>>>> +	}
>>>> +
>>>> +	/* free a pmd talbe */
>>>> +	free_pagetable(pud_page(*pud), 0);
>>>> +	spin_lock(&init_mm.page_table_lock);
>>>> +	pud_clear(pud);
>>>> +	spin_unlock(&init_mm.page_table_lock);
>>>> +}
>>>> +
>>>> +/* Return true if pgd is changed, otherwise return false. */
>>>> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
>>>> +{
>>>> +	pud_t *pud;
>>>> +	int i;
>>>> +
>>>> +	for (i = 0; i<   PTRS_PER_PUD; i++) {
>>>> +		pud = pud_start + i;
>>>> +		if (pud_val(*pud))
>>>> +			return false;
>>>> +	}
>>>> +
>>>> +	/* free a pud table */
>>>> +	free_pagetable(pgd_page(*pgd), 0);
>>>> +	spin_lock(&init_mm.page_table_lock);
>>>> +	pgd_clear(pgd);
>>>> +	spin_unlock(&init_mm.page_table_lock);
>>>> +
>>>> +	return true;
>>>> +}
>>>> +
>>>> +static void __meminit
>>>> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
>>>> +		 bool direct)
>>>> +{
>>>> +	unsigned long next, pages = 0;
>>>> +	pte_t *pte;
>>>> +	void *page_addr;
>>>> +	phys_addr_t phys_addr;
>>>> +
>>>> +	pte = pte_start + pte_index(addr);
>>>> +	for (; addr<   end; addr = next, pte++) {
>>>> +		next = (addr + PAGE_SIZE)&   PAGE_MASK;
>>>> +		if (next>   end)
>>>> +			next = end;
>>>> +
>>>> +		if (!pte_present(*pte))
>>>> +			continue;
>>>> +
>>>> +		/*
>>>> +		 * We mapped [0,1G) memory as identity mapping when
>>>> +		 * initializing, in arch/x86/kernel/head_64.S. These
>>>> +		 * pagetables cannot be removed.
>>>> +		 */
>>>> +		phys_addr = pte_val(*pte) + (addr&   PAGE_MASK);
>>>> +		if (phys_addr<   (phys_addr_t)0x40000000)
>>>> +			return;
>>>> +
>>>> +		if (IS_ALIGNED(addr, PAGE_SIZE)&&
>>>> +		    IS_ALIGNED(next, PAGE_SIZE)) {
>>>> +			if (!direct) {
>>>> +				free_pagetable(pte_page(*pte), 0);
>>>> +				pages++;
>>>> +			}
>>>> +
>>>> +			spin_lock(&init_mm.page_table_lock);
>>>> +			pte_clear(&init_mm, addr, pte);
>>>> +			spin_unlock(&init_mm.page_table_lock);
>>>> +		} else {
>>>> +			/*
>>>> +			 * If we are not removing the whole page, it means
>>>> +			 * other ptes in this page are being used and we canot
>>>> +			 * remove them. So fill the unused ptes with 0xFD, and
>>>> +			 * remove the page when it is wholly filled with 0xFD.
>>>> +			 */
>>>> +			memset((void *)addr, PAGE_INUSE, next - addr);
>>>> +			page_addr = page_address(pte_page(*pte));
>>>> +
>>>> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
>>>> +				free_pagetable(pte_page(*pte), 0);
>>>> +				pages++;
>>>> +
>>>> +				spin_lock(&init_mm.page_table_lock);
>>>> +				pte_clear(&init_mm, addr, pte);
>>>> +				spin_unlock(&init_mm.page_table_lock);
>>>> +			}
>>>> +		}
>>>> +	}
>>>> +
>>>> +	/* Call free_pte_table() in remove_pmd_table(). */
>>>> +	flush_tlb_all();
>>>> +	if (direct)
>>>> +		update_page_count(PG_LEVEL_4K, -pages);
>>>> +}
>>>> +
>>>> +static void __meminit
>>>> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
>>>> +		 bool direct)
>>>> +{
>>>> +	unsigned long pte_phys, next, pages = 0;
>>>> +	pte_t *pte_base;
>>>> +	pmd_t *pmd;
>>>> +
>>>> +	pmd = pmd_start + pmd_index(addr);
>>>> +	for (; addr<   end; addr = next, pmd++) {
>>>> +		next = pmd_addr_end(addr, end);
>>>> +
>>>> +		if (!pmd_present(*pmd))
>>>> +			continue;
>>>> +
>>>> +		if (pmd_large(*pmd)) {
>>>> +			if (IS_ALIGNED(addr, PMD_SIZE)&&
>>>> +			    IS_ALIGNED(next, PMD_SIZE)) {
>>>> +				if (!direct) {
>>>> +					free_pagetable(pmd_page(*pmd),
>>>> +						       get_order(PMD_SIZE));
>>>> +					pages++;
>>>> +				}
>>>> +
>>>> +				spin_lock(&init_mm.page_table_lock);
>>>> +				pmd_clear(pmd);
>>>> +				spin_unlock(&init_mm.page_table_lock);
>>>> +				continue;
>>>> +			}
>>>> +
>>>> +			/*
>>>> +			 * We use 2M page, but we need to remove part of them,
>>>> +			 * so split 2M page to 4K page.
>>>> +			 */
>>>> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
>>>> +			BUG_ON(!pte_base);
>>>> +			__split_large_page((pte_t *)pmd, addr,
>>>> +					   (pte_t *)pte_base);
>>>> +
>>>> +			spin_lock(&init_mm.page_table_lock);
>>>> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
>>>> +			spin_unlock(&init_mm.page_table_lock);
>>>> +
>>>> +			flush_tlb_all();
>>>> +		}
>>>> +
>>>> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
>>>> +		remove_pte_table(pte_base, addr, next, direct);
>>>> +		free_pte_table(pte_base, pmd);
>>>> +		unmap_low_page(pte_base);
>>>> +	}
>>>> +
>>>> +	/* Call free_pmd_table() in remove_pud_table(). */
>>>> +	if (direct)
>>>> +		update_page_count(PG_LEVEL_2M, -pages);
>>>> +}
>>>> +
>>>> +static void __meminit
>>>> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
>>>> +		 bool direct)
>>>> +{
>>>> +	unsigned long pmd_phys, next, pages = 0;
>>>> +	pmd_t *pmd_base;
>>>> +	pud_t *pud;
>>>> +
>>>> +	pud = pud_start + pud_index(addr);
>>>> +	for (; addr<   end; addr = next, pud++) {
>>>> +		next = pud_addr_end(addr, end);
>>>> +
>>>> +		if (!pud_present(*pud))
>>>> +			continue;
>>>> +
>>>> +		if (pud_large(*pud)) {
>>>> +			if (IS_ALIGNED(addr, PUD_SIZE)&&
>>>> +			    IS_ALIGNED(next, PUD_SIZE)) {
>>>> +				if (!direct) {
>>>> +					free_pagetable(pud_page(*pud),
>>>> +						       get_order(PUD_SIZE));
>>>> +					pages++;
>>>> +				}
>>>> +
>>>> +				spin_lock(&init_mm.page_table_lock);
>>>> +				pud_clear(pud);
>>>> +				spin_unlock(&init_mm.page_table_lock);
>>>> +				continue;
>>>> +			}
>>>> +
>>>> +			/*
>>>> +			 * We use 1G page, but we need to remove part of them,
>>>> +			 * so split 1G page to 2M page.
>>>> +			 */
>>>> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
>>>> +			BUG_ON(!pmd_base);
>>>> +			__split_large_page((pte_t *)pud, addr,
>>>> +					   (pte_t *)pmd_base);
>>>> +
>>>> +			spin_lock(&init_mm.page_table_lock);
>>>> +			pud_populate(&init_mm, pud, __va(pmd_phys));
>>>> +			spin_unlock(&init_mm.page_table_lock);
>>>> +
>>>> +			flush_tlb_all();
>>>> +		}
>>>> +
>>>> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
>>>> +		remove_pmd_table(pmd_base, addr, next, direct);
>>>> +		free_pmd_table(pmd_base, pud);
>>>> +		unmap_low_page(pmd_base);
>>>> +	}
>>>> +
>>>> +	if (direct)
>>>> +		update_page_count(PG_LEVEL_1G, -pages);
>>>> +}
>>>> +
>>>> +/* start and end are both virtual address. */
>>>> +static void __meminit
>>>> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
>>>> +{
>>>> +	unsigned long next;
>>>> +	pgd_t *pgd;
>>>> +	pud_t *pud;
>>>> +	bool pgd_changed = false;
>>>> +
>>>> +	for (; start<   end; start = next) {
>>>> +		pgd = pgd_offset_k(start);
>>>> +		if (!pgd_present(*pgd))
>>>> +			continue;
>>>> +
>>>> +		next = pgd_addr_end(start, end);
>>>> +
>>>> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
>>>> +		remove_pud_table(pud, start, next, direct);
>>>> +		if (free_pud_table(pud, pgd))
>>>> +			pgd_changed = true;
>>>> +		unmap_low_page(pud);
>>>> +	}
>>>> +
>>>> +	if (pgd_changed)
>>>> +		sync_global_pgds(start, end - 1);
>>>> +
>>>> +	flush_tlb_all();
>>>> +}
>>>> +
>>>>    #ifdef CONFIG_MEMORY_HOTREMOVE
>>>>    int __ref arch_remove_memory(u64 start, u64 size)
>>>>    {
>>>> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
>>>> index a718e0d..7dcb6f9 100644
>>>> --- a/arch/x86/mm/pageattr.c
>>>> +++ b/arch/x86/mm/pageattr.c
>>>> @@ -501,21 +501,13 @@ out_unlock:
>>>>    	return do_split;
>>>>    }
>>>>
>>>> -static int split_large_page(pte_t *kpte, unsigned long address)
>>>> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>>>>    {
>>>>    	unsigned long pfn, pfninc = 1;
>>>>    	unsigned int i, level;
>>>> -	pte_t *pbase, *tmp;
>>>> +	pte_t *tmp;
>>>>    	pgprot_t ref_prot;
>>>> -	struct page *base;
>>>> -
>>>> -	if (!debug_pagealloc)
>>>> -		spin_unlock(&cpa_lock);
>>>> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>>>> -	if (!debug_pagealloc)
>>>> -		spin_lock(&cpa_lock);
>>>> -	if (!base)
>>>> -		return -ENOMEM;
>>>> +	struct page *base = virt_to_page(pbase);
>>>>
>>>>    	spin_lock(&pgd_lock);
>>>>    	/*
>>>> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>>>>    	 * up for us already:
>>>>    	 */
>>>>    	tmp = lookup_address(address,&level);
>>>> -	if (tmp != kpte)
>>>> -		goto out_unlock;
>>>> +	if (tmp != kpte) {
>>>> +		spin_unlock(&pgd_lock);
>>>> +		return 1;
>>>> +	}
>>>>
>>>> -	pbase = (pte_t *)page_address(base);
>>>>    	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
>>>>    	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
>>>>    	/*
>>>> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>>>>    	 * going on.
>>>>    	 */
>>>>    	__flush_tlb_all();
>>>> +	spin_unlock(&pgd_lock);
>>>>
>>>> -	base = NULL;
>>>> +	return 0;
>>>> +}
>>>>
>>>> -out_unlock:
>>>> -	/*
>>>> -	 * If we dropped out via the lookup_address check under
>>>> -	 * pgd_lock then stick the page back into the pool:
>>>> -	 */
>>>> -	if (base)
>>>> +static int split_large_page(pte_t *kpte, unsigned long address)
>>>> +{
>>>> +	pte_t *pbase;
>>>> +	struct page *base;
>>>> +
>>>> +	if (!debug_pagealloc)
>>>> +		spin_unlock(&cpa_lock);
>>>> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>>>> +	if (!debug_pagealloc)
>>>> +		spin_lock(&cpa_lock);
>>>> +	if (!base)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	pbase = (pte_t *)page_address(base);
>>>> +	if (__split_large_page(kpte, address, pbase))
>>>>    		__free_page(base);
>>>> -	spin_unlock(&pgd_lock);
>>>>
>>>>    	return 0;
>>>>    }
>>>> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
>>>> index 3f778c2..190ff06 100644
>>>> --- a/include/linux/bootmem.h
>>>> +++ b/include/linux/bootmem.h
>>>> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
>>>>    			      unsigned long size);
>>>>    extern void free_bootmem(unsigned long physaddr, unsigned long size);
>>>>    extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
>>>> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
>>>>
>>>>    /*
>>>>     * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
>
>
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
  2013-01-30  5:55         ` Tang Chen
@ 2013-01-30  7:32           ` Simon Jeons
  0 siblings, 0 replies; 67+ messages in thread
From: Simon Jeons @ 2013-01-30  7:32 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

On Wed, 2013-01-30 at 13:55 +0800, Tang Chen wrote:
> On 01/30/2013 11:27 AM, Simon Jeons wrote:
> > On Wed, 2013-01-30 at 10:16 +0800, Tang Chen wrote:
> >> On 01/29/2013 09:04 PM, Simon Jeons wrote:
> >>> Hi Tang,
> >>> On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
> >>>> From: Wen Congyang<wency@cn.fujitsu.com>
> >>>>
> >>>> When memory is removed, the corresponding pagetables should alse be removed.
> >>>> This patch introduces some common APIs to support vmemmap pagetable and x86_64
> >>>> architecture pagetable removing.
> >>>
> >>> Why don't need to build_all_zonelists like online_pages does during
> >>> hot-add path(add_memory)?
> >>
> >> Hi Simon,
> >>
> >> As you said, build_all_zonelists is done by online_pages. When the
> >> memory device
> >> is hot-added, we cannot use it. we can only use is when we online the
> >> pages on it.
> >
> > Why?
> >
> > If a node has just one memory device and memory is small, some zone will
> > not present like zone_highmem, then hot-add another memory device and
> > zone_highmem appear, if you should build_all_zonelists this time?
> 
> Hi Simon,
> 
> We built zone list when the first memory on the node is hot-added.
> 
> add_memory()
>   |-->if (!node_online(nid)) hotadd_new_pgdat()
>                               |-->free_area_init_node()
>                               |-->build_all_zonelists()
> 
> All the zones on the new node will be initialized as empty. So here, we 
> build zone list.
> 
> But actually we did nothing because no page is online, and zones are empty.
> In build_zonelists_node(), populated_zone(zone) will always be false.
> 
> The real work of building zone list is when pages are online. :)
> 
> 
> And in your question, you said some small memory is there, and 
> zone_normal is present.
> OK, when these pages are onlined (not added), the zone list has been 
> rebuilt.
> But pages in zone_highmem is not added, which means not onlined, so we 
> don't need to
> build zone list for it. And later, the zone_highmem pages are added, we 
> still don't
> rebuild the zone list because the real rebuilding work is when the pages 
> are onlined.
> 
> I think this is the current logic. :)

Thanks for you clarify. Actually, I miss "Even if the memory is
hot-added, it is not at ready-to-use state. For using newly added
memory, you have to 'online' the memory section" in the doc. :)

> 
> Thanks. :)
> 
> >
> >>
> >> But we can online the pages as different types, kernel or movable (which
> >> belongs to
> >> different zones), and we can online part of the memory, not all of them.
> >> So each time we online some pages, we should check if we need to update
> >> the zone list.
> >>
> >> So I think that is why we do build_all_zonelists when online_pages.
> >> (just my opinion)
> >>
> >> Thanks. :)
> >>
> >>>
> >>>>
> >>>> All pages of virtual mapping in removed memory cannot be freedi if some pages
> >>>> used as PGD/PUD includes not only removed memory but also other memory. So the
> >>>> patch uses the following way to check whether page can be freed or not.
> >>>>
> >>>>    1. When removing memory, the page structs of the revmoved memory are filled
> >>>>       with 0FD.
> >>>>    2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
> >>>>       In this case, the page used as PT/PMD can be freed.
> >>>>
> >>>> Signed-off-by: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
> >>>> Signed-off-by: Jianguo Wu<wujianguo@huawei.com>
> >>>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
> >>>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
> >>>> ---
> >>>>    arch/x86/include/asm/pgtable_types.h |    1 +
> >>>>    arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
> >>>>    arch/x86/mm/pageattr.c               |   47 +++---
> >>>>    include/linux/bootmem.h              |    1 +
> >>>>    4 files changed, 326 insertions(+), 22 deletions(-)
> >>>>
> >>>> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> >>>> index 3c32db8..4b6fd2a 100644
> >>>> --- a/arch/x86/include/asm/pgtable_types.h
> >>>> +++ b/arch/x86/include/asm/pgtable_types.h
> >>>> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
> >>>>     * as a pte too.
> >>>>     */
> >>>>    extern pte_t *lookup_address(unsigned long address, unsigned int *level);
> >>>> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
> >>>>
> >>>>    #endif	/* !__ASSEMBLY__ */
> >>>>
> >>>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> >>>> index 9ac1723..fe01116 100644
> >>>> --- a/arch/x86/mm/init_64.c
> >>>> +++ b/arch/x86/mm/init_64.c
> >>>> @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
> >>>>    }
> >>>>    EXPORT_SYMBOL_GPL(arch_add_memory);
> >>>>
> >>>> +#define PAGE_INUSE 0xFD
> >>>> +
> >>>> +static void __meminit free_pagetable(struct page *page, int order)
> >>>> +{
> >>>> +	struct zone *zone;
> >>>> +	bool bootmem = false;
> >>>> +	unsigned long magic;
> >>>> +	unsigned int nr_pages = 1<<   order;
> >>>> +
> >>>> +	/* bootmem page has reserved flag */
> >>>> +	if (PageReserved(page)) {
> >>>> +		__ClearPageReserved(page);
> >>>> +		bootmem = true;
> >>>> +
> >>>> +		magic = (unsigned long)page->lru.next;
> >>>> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> >>>> +			while (nr_pages--)
> >>>> +				put_page_bootmem(page++);
> >>>> +		} else
> >>>> +			__free_pages_bootmem(page, order);
> >>>> +	} else
> >>>> +		free_pages((unsigned long)page_address(page), order);
> >>>> +
> >>>> +	/*
> >>>> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
> >>>> +	 * are all allocated by bootmem.
> >>>> +	 */
> >>>> +	if (bootmem) {
> >>>> +		zone = page_zone(page);
> >>>> +		zone_span_writelock(zone);
> >>>> +		zone->present_pages += nr_pages;
> >>>> +		zone_span_writeunlock(zone);
> >>>> +		totalram_pages += nr_pages;
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> >>>> +{
> >>>> +	pte_t *pte;
> >>>> +	int i;
> >>>> +
> >>>> +	for (i = 0; i<   PTRS_PER_PTE; i++) {
> >>>> +		pte = pte_start + i;
> >>>> +		if (pte_val(*pte))
> >>>> +			return;
> >>>> +	}
> >>>> +
> >>>> +	/* free a pte talbe */
> >>>> +	free_pagetable(pmd_page(*pmd), 0);
> >>>> +	spin_lock(&init_mm.page_table_lock);
> >>>> +	pmd_clear(pmd);
> >>>> +	spin_unlock(&init_mm.page_table_lock);
> >>>> +}
> >>>> +
> >>>> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> >>>> +{
> >>>> +	pmd_t *pmd;
> >>>> +	int i;
> >>>> +
> >>>> +	for (i = 0; i<   PTRS_PER_PMD; i++) {
> >>>> +		pmd = pmd_start + i;
> >>>> +		if (pmd_val(*pmd))
> >>>> +			return;
> >>>> +	}
> >>>> +
> >>>> +	/* free a pmd talbe */
> >>>> +	free_pagetable(pud_page(*pud), 0);
> >>>> +	spin_lock(&init_mm.page_table_lock);
> >>>> +	pud_clear(pud);
> >>>> +	spin_unlock(&init_mm.page_table_lock);
> >>>> +}
> >>>> +
> >>>> +/* Return true if pgd is changed, otherwise return false. */
> >>>> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
> >>>> +{
> >>>> +	pud_t *pud;
> >>>> +	int i;
> >>>> +
> >>>> +	for (i = 0; i<   PTRS_PER_PUD; i++) {
> >>>> +		pud = pud_start + i;
> >>>> +		if (pud_val(*pud))
> >>>> +			return false;
> >>>> +	}
> >>>> +
> >>>> +	/* free a pud table */
> >>>> +	free_pagetable(pgd_page(*pgd), 0);
> >>>> +	spin_lock(&init_mm.page_table_lock);
> >>>> +	pgd_clear(pgd);
> >>>> +	spin_unlock(&init_mm.page_table_lock);
> >>>> +
> >>>> +	return true;
> >>>> +}
> >>>> +
> >>>> +static void __meminit
> >>>> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
> >>>> +		 bool direct)
> >>>> +{
> >>>> +	unsigned long next, pages = 0;
> >>>> +	pte_t *pte;
> >>>> +	void *page_addr;
> >>>> +	phys_addr_t phys_addr;
> >>>> +
> >>>> +	pte = pte_start + pte_index(addr);
> >>>> +	for (; addr<   end; addr = next, pte++) {
> >>>> +		next = (addr + PAGE_SIZE)&   PAGE_MASK;
> >>>> +		if (next>   end)
> >>>> +			next = end;
> >>>> +
> >>>> +		if (!pte_present(*pte))
> >>>> +			continue;
> >>>> +
> >>>> +		/*
> >>>> +		 * We mapped [0,1G) memory as identity mapping when
> >>>> +		 * initializing, in arch/x86/kernel/head_64.S. These
> >>>> +		 * pagetables cannot be removed.
> >>>> +		 */
> >>>> +		phys_addr = pte_val(*pte) + (addr&   PAGE_MASK);
> >>>> +		if (phys_addr<   (phys_addr_t)0x40000000)
> >>>> +			return;
> >>>> +
> >>>> +		if (IS_ALIGNED(addr, PAGE_SIZE)&&
> >>>> +		    IS_ALIGNED(next, PAGE_SIZE)) {
> >>>> +			if (!direct) {
> >>>> +				free_pagetable(pte_page(*pte), 0);
> >>>> +				pages++;
> >>>> +			}
> >>>> +
> >>>> +			spin_lock(&init_mm.page_table_lock);
> >>>> +			pte_clear(&init_mm, addr, pte);
> >>>> +			spin_unlock(&init_mm.page_table_lock);
> >>>> +		} else {
> >>>> +			/*
> >>>> +			 * If we are not removing the whole page, it means
> >>>> +			 * other ptes in this page are being used and we canot
> >>>> +			 * remove them. So fill the unused ptes with 0xFD, and
> >>>> +			 * remove the page when it is wholly filled with 0xFD.
> >>>> +			 */
> >>>> +			memset((void *)addr, PAGE_INUSE, next - addr);
> >>>> +			page_addr = page_address(pte_page(*pte));
> >>>> +
> >>>> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
> >>>> +				free_pagetable(pte_page(*pte), 0);
> >>>> +				pages++;
> >>>> +
> >>>> +				spin_lock(&init_mm.page_table_lock);
> >>>> +				pte_clear(&init_mm, addr, pte);
> >>>> +				spin_unlock(&init_mm.page_table_lock);
> >>>> +			}
> >>>> +		}
> >>>> +	}
> >>>> +
> >>>> +	/* Call free_pte_table() in remove_pmd_table(). */
> >>>> +	flush_tlb_all();
> >>>> +	if (direct)
> >>>> +		update_page_count(PG_LEVEL_4K, -pages);
> >>>> +}
> >>>> +
> >>>> +static void __meminit
> >>>> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> >>>> +		 bool direct)
> >>>> +{
> >>>> +	unsigned long pte_phys, next, pages = 0;
> >>>> +	pte_t *pte_base;
> >>>> +	pmd_t *pmd;
> >>>> +
> >>>> +	pmd = pmd_start + pmd_index(addr);
> >>>> +	for (; addr<   end; addr = next, pmd++) {
> >>>> +		next = pmd_addr_end(addr, end);
> >>>> +
> >>>> +		if (!pmd_present(*pmd))
> >>>> +			continue;
> >>>> +
> >>>> +		if (pmd_large(*pmd)) {
> >>>> +			if (IS_ALIGNED(addr, PMD_SIZE)&&
> >>>> +			    IS_ALIGNED(next, PMD_SIZE)) {
> >>>> +				if (!direct) {
> >>>> +					free_pagetable(pmd_page(*pmd),
> >>>> +						       get_order(PMD_SIZE));
> >>>> +					pages++;
> >>>> +				}
> >>>> +
> >>>> +				spin_lock(&init_mm.page_table_lock);
> >>>> +				pmd_clear(pmd);
> >>>> +				spin_unlock(&init_mm.page_table_lock);
> >>>> +				continue;
> >>>> +			}
> >>>> +
> >>>> +			/*
> >>>> +			 * We use 2M page, but we need to remove part of them,
> >>>> +			 * so split 2M page to 4K page.
> >>>> +			 */
> >>>> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
> >>>> +			BUG_ON(!pte_base);
> >>>> +			__split_large_page((pte_t *)pmd, addr,
> >>>> +					   (pte_t *)pte_base);
> >>>> +
> >>>> +			spin_lock(&init_mm.page_table_lock);
> >>>> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> >>>> +			spin_unlock(&init_mm.page_table_lock);
> >>>> +
> >>>> +			flush_tlb_all();
> >>>> +		}
> >>>> +
> >>>> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
> >>>> +		remove_pte_table(pte_base, addr, next, direct);
> >>>> +		free_pte_table(pte_base, pmd);
> >>>> +		unmap_low_page(pte_base);
> >>>> +	}
> >>>> +
> >>>> +	/* Call free_pmd_table() in remove_pud_table(). */
> >>>> +	if (direct)
> >>>> +		update_page_count(PG_LEVEL_2M, -pages);
> >>>> +}
> >>>> +
> >>>> +static void __meminit
> >>>> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
> >>>> +		 bool direct)
> >>>> +{
> >>>> +	unsigned long pmd_phys, next, pages = 0;
> >>>> +	pmd_t *pmd_base;
> >>>> +	pud_t *pud;
> >>>> +
> >>>> +	pud = pud_start + pud_index(addr);
> >>>> +	for (; addr<   end; addr = next, pud++) {
> >>>> +		next = pud_addr_end(addr, end);
> >>>> +
> >>>> +		if (!pud_present(*pud))
> >>>> +			continue;
> >>>> +
> >>>> +		if (pud_large(*pud)) {
> >>>> +			if (IS_ALIGNED(addr, PUD_SIZE)&&
> >>>> +			    IS_ALIGNED(next, PUD_SIZE)) {
> >>>> +				if (!direct) {
> >>>> +					free_pagetable(pud_page(*pud),
> >>>> +						       get_order(PUD_SIZE));
> >>>> +					pages++;
> >>>> +				}
> >>>> +
> >>>> +				spin_lock(&init_mm.page_table_lock);
> >>>> +				pud_clear(pud);
> >>>> +				spin_unlock(&init_mm.page_table_lock);
> >>>> +				continue;
> >>>> +			}
> >>>> +
> >>>> +			/*
> >>>> +			 * We use 1G page, but we need to remove part of them,
> >>>> +			 * so split 1G page to 2M page.
> >>>> +			 */
> >>>> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
> >>>> +			BUG_ON(!pmd_base);
> >>>> +			__split_large_page((pte_t *)pud, addr,
> >>>> +					   (pte_t *)pmd_base);
> >>>> +
> >>>> +			spin_lock(&init_mm.page_table_lock);
> >>>> +			pud_populate(&init_mm, pud, __va(pmd_phys));
> >>>> +			spin_unlock(&init_mm.page_table_lock);
> >>>> +
> >>>> +			flush_tlb_all();
> >>>> +		}
> >>>> +
> >>>> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
> >>>> +		remove_pmd_table(pmd_base, addr, next, direct);
> >>>> +		free_pmd_table(pmd_base, pud);
> >>>> +		unmap_low_page(pmd_base);
> >>>> +	}
> >>>> +
> >>>> +	if (direct)
> >>>> +		update_page_count(PG_LEVEL_1G, -pages);
> >>>> +}
> >>>> +
> >>>> +/* start and end are both virtual address. */
> >>>> +static void __meminit
> >>>> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
> >>>> +{
> >>>> +	unsigned long next;
> >>>> +	pgd_t *pgd;
> >>>> +	pud_t *pud;
> >>>> +	bool pgd_changed = false;
> >>>> +
> >>>> +	for (; start<   end; start = next) {
> >>>> +		pgd = pgd_offset_k(start);
> >>>> +		if (!pgd_present(*pgd))
> >>>> +			continue;
> >>>> +
> >>>> +		next = pgd_addr_end(start, end);
> >>>> +
> >>>> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> >>>> +		remove_pud_table(pud, start, next, direct);
> >>>> +		if (free_pud_table(pud, pgd))
> >>>> +			pgd_changed = true;
> >>>> +		unmap_low_page(pud);
> >>>> +	}
> >>>> +
> >>>> +	if (pgd_changed)
> >>>> +		sync_global_pgds(start, end - 1);
> >>>> +
> >>>> +	flush_tlb_all();
> >>>> +}
> >>>> +
> >>>>    #ifdef CONFIG_MEMORY_HOTREMOVE
> >>>>    int __ref arch_remove_memory(u64 start, u64 size)
> >>>>    {
> >>>> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> >>>> index a718e0d..7dcb6f9 100644
> >>>> --- a/arch/x86/mm/pageattr.c
> >>>> +++ b/arch/x86/mm/pageattr.c
> >>>> @@ -501,21 +501,13 @@ out_unlock:
> >>>>    	return do_split;
> >>>>    }
> >>>>
> >>>> -static int split_large_page(pte_t *kpte, unsigned long address)
> >>>> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
> >>>>    {
> >>>>    	unsigned long pfn, pfninc = 1;
> >>>>    	unsigned int i, level;
> >>>> -	pte_t *pbase, *tmp;
> >>>> +	pte_t *tmp;
> >>>>    	pgprot_t ref_prot;
> >>>> -	struct page *base;
> >>>> -
> >>>> -	if (!debug_pagealloc)
> >>>> -		spin_unlock(&cpa_lock);
> >>>> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> >>>> -	if (!debug_pagealloc)
> >>>> -		spin_lock(&cpa_lock);
> >>>> -	if (!base)
> >>>> -		return -ENOMEM;
> >>>> +	struct page *base = virt_to_page(pbase);
> >>>>
> >>>>    	spin_lock(&pgd_lock);
> >>>>    	/*
> >>>> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
> >>>>    	 * up for us already:
> >>>>    	 */
> >>>>    	tmp = lookup_address(address,&level);
> >>>> -	if (tmp != kpte)
> >>>> -		goto out_unlock;
> >>>> +	if (tmp != kpte) {
> >>>> +		spin_unlock(&pgd_lock);
> >>>> +		return 1;
> >>>> +	}
> >>>>
> >>>> -	pbase = (pte_t *)page_address(base);
> >>>>    	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
> >>>>    	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
> >>>>    	/*
> >>>> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
> >>>>    	 * going on.
> >>>>    	 */
> >>>>    	__flush_tlb_all();
> >>>> +	spin_unlock(&pgd_lock);
> >>>>
> >>>> -	base = NULL;
> >>>> +	return 0;
> >>>> +}
> >>>>
> >>>> -out_unlock:
> >>>> -	/*
> >>>> -	 * If we dropped out via the lookup_address check under
> >>>> -	 * pgd_lock then stick the page back into the pool:
> >>>> -	 */
> >>>> -	if (base)
> >>>> +static int split_large_page(pte_t *kpte, unsigned long address)
> >>>> +{
> >>>> +	pte_t *pbase;
> >>>> +	struct page *base;
> >>>> +
> >>>> +	if (!debug_pagealloc)
> >>>> +		spin_unlock(&cpa_lock);
> >>>> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> >>>> +	if (!debug_pagealloc)
> >>>> +		spin_lock(&cpa_lock);
> >>>> +	if (!base)
> >>>> +		return -ENOMEM;
> >>>> +
> >>>> +	pbase = (pte_t *)page_address(base);
> >>>> +	if (__split_large_page(kpte, address, pbase))
> >>>>    		__free_page(base);
> >>>> -	spin_unlock(&pgd_lock);
> >>>>
> >>>>    	return 0;
> >>>>    }
> >>>> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> >>>> index 3f778c2..190ff06 100644
> >>>> --- a/include/linux/bootmem.h
> >>>> +++ b/include/linux/bootmem.h
> >>>> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
> >>>>    			      unsigned long size);
> >>>>    extern void free_bootmem(unsigned long physaddr, unsigned long size);
> >>>>    extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
> >>>> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
> >>>>
> >>>>    /*
> >>>>     * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,
> >>>
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
> >
> >
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-29 12:52 ` Simon Jeons
  2013-01-30  2:32   ` Tang Chen
@ 2013-01-30 10:15   ` Tang Chen
  2013-01-30 10:18     ` Tang Chen
  2013-01-31  1:22     ` Simon Jeons
  1 sibling, 2 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-30 10:15 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

Hi Simon,

Please see below. :)

On 01/29/2013 08:52 PM, Simon Jeons wrote:
> Hi Tang,
>
> On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
>> Here is the physical memory hot-remove patch-set based on 3.8rc-2.
>
> Some questions ask you, not has relationship with this patchset, but is
> memory hotplug stuff.
>
> 1. In function node_states_check_changes_online:
>
> comments:
> * If we don't have HIGHMEM nor movable node,
> * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
> * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
>
> How to understand it? Why we don't have HIGHMEM nor movable node and
> node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
> N_NORMAL_MEMORY only means the node has regular memory.
>

First of all, I think we need to understand why we need N_MEMORY.

In order to support movable node, which has only ZONE_MOVABLE (the last 
zone),
we introduce N_MEMORY to represent the node has normal, highmem and 
movable memory.

Here, "we have movable node" means you configured CONFIG_MOVABLE_NODE.
This config option doesn't mean we don't have movable pages, (NO)
it means we don't have a node which has only movable pages (only have 
ZONE_MOVABLE). (YES)

Here, if we don't have CONFIG_MOVABLE_NODE (we don't have movable node), 
we don't need a
separate node_states[] element to represent a particular node because we 
won't have a node
which has only ZONE_MOVABLE.

So,
1) if we don't have highmem nor movable node, N_MEMORY == N_HIGH_MEMORY 
== N_NORMAL_MEMORY,
    which means N_NORMAL_MEMORY effects as N_MEMORY. If we online pages 
as movable, we need
    to update node_states[N_NORMAL_MEMORY].

Please refer to the definition of enum zone_type, if we don't have 
CONFIG_HIGHMEM, we won't
have ZONE_HIGHMEM, but ZONE_NORMAL and ZONE_MOVABLE will always there. 
So we can have movable
pages, and the zone_last should be ZONE_MOVABLE.

Again, because we won't have a node only having ZONE_MOVABLE, so we just 
need to update
node_states[N_NORMAL_MEMORY].

> * If we don't have movable node, node_states[N_NORMAL_MEMORY]
> * contains nodes which have zones of 0...ZONE_MOVABLE,
> * set zone_last to ZONE_MOVABLE.
>
> How to understand?

2) this code is in #ifdef CONFIG_HIGHMEM, which means we have highmem, 
so if we don't have
    movable node, N_MEMORY == N_HIGH_MEMORY, and N_HIGH_MEMORY effects 
as N_MEMORY. If we
    online pages as movable, we need to update node_states[N_NORMAL_MEMORY].

>
> 2. In function move_pfn_range_left, why end<= z2->zone_start_pfn is not
> correct? The comments said that must include/overlap, why?
>

This one is easy, if I understand you correctly.
move_pfn_range_left() is used to move the left most part [start_pfn, 
end_pfn) of z2 to z1.
So if end_pfn<= z2->zone_start_pfn, it means [start_pfn, end_pfn) is not 
part of z2.
Then it fails.

> 3. In function online_pages, the normal case(w/o online_kenrel,
> online_movable), why not check if the new zone is overlap with adjacent
> zones?
>

Can a zone overlap with the others ? I don't think so.

One pfn could only be in one zone,
    zone = page_zone(pfn_to_page(pfn));

it could overlap with others, I think. :)

But maybe I misunderstand you. :)

> 4. Could you summarize the difference implementation between hot-add and
> logic-add, hot-remove and logic-remove?

Sorry, I don't quite understand what do you mean by logic-add/remove.
Would you please explain more ?

If you meant the sys fs interfaces, I think they are just another set of 
entrances
of memory hotplug.

Thanks.  :)

>
>
>>
>> This patch-set aims to implement physical memory hot-removing.
>>
>> The patches can free/remove the following things:
>>
>>    - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
>>    - memmap of sparse-vmemmap                  : [PATCH 6,7,8,10/15]
>>    - page table of removed memory              : [RFC PATCH 7,8,10/15]
>>    - node and related sysfs files              : [RFC PATCH 13-15/15]
>>
>>
>> Existing problem:
>> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>> when we online pages.
>>
>> For example: there is a memory device on node 1. The address range
>> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>> and memory11 under the directory /sys/devices/system/memory/.
>>
>> If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
>> cgroup is not provided by this memory device. But when we online memory9, the
>> memory stored page cgroup may be provided by memory8. So we can't offline
>> memory8 now. We should offline the memory in the reversed order.
>>
>> When the memory device is hotremoved, we will auto offline memory provided
>> by this memory device. But we don't know which memory is onlined first, so
>> offlining memory may fail.
>>
>> In patch1, we provide a solution which is not good enough:
>> Iterate twice to offline the memory.
>> 1st iterate: offline every non primary memory block.
>> 2nd iterate: offline primary (i.e. first added) memory block.
>>
>> And a new idea from Wen Congyang<wency@cn.fujitsu.com>  is:
>> allocate the memory from the memory block they are describing.
>>
>> But we are not sure if it is OK to do so because there is not existing API
>> to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE
>> to MEM_ONLINE. And also, it may interfere the hugepage.
>>
>>
>>
>> How to test this patchset?
>> 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
>>     ACPI_HOTPLUG_MEMORY must be selected.
>> 2. load the module acpi_memhotplug
>> 3. hotplug the memory device(it depends on your hardware)
>>     You will see the memory device under the directory /sys/bus/acpi/devices/.
>>     Its name is PNP0C80:XX.
>> 4. online/offline pages provided by this memory device
>>     You can write online/offline to /sys/devices/system/memory/memoryX/state to
>>     online/offline pages provided by this memory device
>> 5. hotremove the memory device
>>     You can hotremove the memory device by the hardware, or writing 1 to
>>     /sys/bus/acpi/devices/PNP0C80:XX/eject.
>
> Is there a similar knode to hot-add the memory device?
>
>>
>>
>> Note: if the memory provided by the memory device is used by the kernel, it
>> can't be offlined. It is not a bug.
>>
>>
>> Changelogs from v5 to v6:
>>   Patch3: Add some more comments to explain memory hot-remove.
>>   Patch4: Remove bootmem member in struct firmware_map_entry.
>>   Patch6: Repeatedly register bootmem pages when using hugepage.
>>   Patch8: Repeatedly free bootmem pages when using hugepage.
>>   Patch14: Don't free pgdat when offlining a node, just reset it to 0.
>>   Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
>>            one when online a node.
>>
>> Changelogs from v4 to v5:
>>   Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
>>           avoid disabling irq because we need flush tlb when free pagetables.
>>   Patch8: new patch, pick up some common APIs that are used to free direct mapping
>>           and vmemmap pagetables.
>>   Patch9: free direct mapping pagetables on x86_64 arch.
>>   Patch10: free vmemmap pagetables.
>>   Patch11: since freeing memmap with vmemmap has been implemented, the config
>>            macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
>>            no longer needed.
>>   Patch13: no need to modify acpi_memory_disable_device() since it was removed,
>>            and add nid parameter when calling remove_memory().
>>
>> Changelogs from v3 to v4:
>>   Patch7: remove unused codes.
>>   Patch8: fix nr_pages that is passed to free_map_bootmem()
>>
>> Changelogs from v2 to v3:
>>   Patch9: call sync_global_pgds() if pgd is changed
>>   Patch10: fix a problem int the patch
>>
>> Changelogs from v1 to v2:
>>   Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
>>           memory block. 2nd iterate: offline primary (i.e. first added) memory
>>           block.
>>
>>   Patch3: new patch, no logical change, just remove reduntant codes.
>>
>>   Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
>>           after the pagetable is changed.
>>
>>   Patch12: new patch, free node_data when a node is offlined.
>>
>>
>> Tang Chen (6):
>>    memory-hotplug: move pgdat_resize_lock into
>>      sparse_remove_one_section()
>>    memory-hotplug: remove page table of x86_64 architecture
>>    memory-hotplug: remove memmap of sparse-vmemmap
>>    memory-hotplug: Integrated __remove_section() of
>>      CONFIG_SPARSEMEM_VMEMMAP.
>>    memory-hotplug: remove sysfs file of node
>>    memory-hotplug: Do not allocate pdgat if it was not freed when
>>      offline.
>>
>> Wen Congyang (5):
>>    memory-hotplug: try to offline the memory twice to avoid dependence
>>    memory-hotplug: remove redundant codes
>>    memory-hotplug: introduce new function arch_remove_memory() for
>>      removing page table depends on architecture
>>    memory-hotplug: Common APIs to support page tables hot-remove
>>    memory-hotplug: free node_data when a node is offlined
>>
>> Yasuaki Ishimatsu (4):
>>    memory-hotplug: check whether all memory blocks are offlined or not
>>      when removing memory
>>    memory-hotplug: remove /sys/firmware/memmap/X sysfs
>>    memory-hotplug: implement register_page_bootmem_info_section of
>>      sparse-vmemmap
>>    memory-hotplug: memory_hotplug: clear zone when removing the memory
>>
>>   arch/arm64/mm/mmu.c                  |    3 +
>>   arch/ia64/mm/discontig.c             |   10 +
>>   arch/ia64/mm/init.c                  |   18 ++
>>   arch/powerpc/mm/init_64.c            |   10 +
>>   arch/powerpc/mm/mem.c                |   12 +
>>   arch/s390/mm/init.c                  |   12 +
>>   arch/s390/mm/vmem.c                  |   10 +
>>   arch/sh/mm/init.c                    |   17 ++
>>   arch/sparc/mm/init_64.c              |   10 +
>>   arch/tile/mm/init.c                  |    8 +
>>   arch/x86/include/asm/pgtable_types.h |    1 +
>>   arch/x86/mm/init_32.c                |   12 +
>>   arch/x86/mm/init_64.c                |  390 +++++++++++++++++++++++++++++
>>   arch/x86/mm/pageattr.c               |   47 ++--
>>   drivers/acpi/acpi_memhotplug.c       |    8 +-
>>   drivers/base/memory.c                |    6 +
>>   drivers/firmware/memmap.c            |   96 +++++++-
>>   include/linux/bootmem.h              |    1 +
>>   include/linux/firmware-map.h         |    6 +
>>   include/linux/memory_hotplug.h       |   15 +-
>>   include/linux/mm.h                   |    4 +-
>>   mm/memory_hotplug.c                  |  459 +++++++++++++++++++++++++++++++---
>>   mm/sparse.c                          |    8 +-
>>   23 files changed, 1094 insertions(+), 69 deletions(-)
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
>
>
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-30 10:15   ` Tang Chen
@ 2013-01-30 10:18     ` Tang Chen
  2013-01-31  1:22     ` Simon Jeons
  1 sibling, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-30 10:18 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

On 01/30/2013 06:15 PM, Tang Chen wrote:
> Hi Simon,
>
> Please see below. :)
>
> On 01/29/2013 08:52 PM, Simon Jeons wrote:
>> Hi Tang,
>>
>> On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
>>> Here is the physical memory hot-remove patch-set based on 3.8rc-2.
>>
>> Some questions ask you, not has relationship with this patchset, but is
>> memory hotplug stuff.
>>
>> 1. In function node_states_check_changes_online:
>>
>> comments:
>> * If we don't have HIGHMEM nor movable node,
>> * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
>> * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
>>
>> How to understand it? Why we don't have HIGHMEM nor movable node and
>> node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
>> N_NORMAL_MEMORY only means the node has regular memory.
>>
>
> First of all, I think we need to understand why we need N_MEMORY.
>
> In order to support movable node, which has only ZONE_MOVABLE (the last
> zone),
> we introduce N_MEMORY to represent the node has normal, highmem and
> movable memory.
>
> Here, "we have movable node" means you configured CONFIG_MOVABLE_NODE.

Sorry, should be "we don't have movable node" means you didn't 
configured CONFIG_MOVABLE_NODE.

> This config option doesn't mean we don't have movable pages, (NO)
> it means we don't have a node which has only movable pages (only have
> ZONE_MOVABLE). (YES)
>
> Here, if we don't have CONFIG_MOVABLE_NODE (we don't have movable node),
> we don't need a
> separate node_states[] element to represent a particular node because we
> won't have a node
> which has only ZONE_MOVABLE.
>
> So,
> 1) if we don't have highmem nor movable node, N_MEMORY == N_HIGH_MEMORY
> == N_NORMAL_MEMORY,
> which means N_NORMAL_MEMORY effects as N_MEMORY. If we online pages as
> movable, we need
> to update node_states[N_NORMAL_MEMORY].
>
> Please refer to the definition of enum zone_type, if we don't have
> CONFIG_HIGHMEM, we won't
> have ZONE_HIGHMEM, but ZONE_NORMAL and ZONE_MOVABLE will always there.
> So we can have movable
> pages, and the zone_last should be ZONE_MOVABLE.
>
> Again, because we won't have a node only having ZONE_MOVABLE, so we just
> need to update
> node_states[N_NORMAL_MEMORY].
>
>> * If we don't have movable node, node_states[N_NORMAL_MEMORY]
>> * contains nodes which have zones of 0...ZONE_MOVABLE,
>> * set zone_last to ZONE_MOVABLE.
>>
>> How to understand?
>
> 2) this code is in #ifdef CONFIG_HIGHMEM, which means we have highmem,
> so if we don't have
> movable node, N_MEMORY == N_HIGH_MEMORY, and N_HIGH_MEMORY effects as
> N_MEMORY. If we
> online pages as movable, we need to update node_states[N_NORMAL_MEMORY].
>
>>
>> 2. In function move_pfn_range_left, why end<= z2->zone_start_pfn is not
>> correct? The comments said that must include/overlap, why?
>>
>
> This one is easy, if I understand you correctly.
> move_pfn_range_left() is used to move the left most part [start_pfn,
> end_pfn) of z2 to z1.
> So if end_pfn<= z2->zone_start_pfn, it means [start_pfn, end_pfn) is not
> part of z2.
> Then it fails.
>
>> 3. In function online_pages, the normal case(w/o online_kenrel,
>> online_movable), why not check if the new zone is overlap with adjacent
>> zones?
>>
>
> Can a zone overlap with the others ? I don't think so.
>
> One pfn could only be in one zone,
> zone = page_zone(pfn_to_page(pfn));
>
> it could overlap with others, I think. :)
>
> But maybe I misunderstand you. :)
>
>> 4. Could you summarize the difference implementation between hot-add and
>> logic-add, hot-remove and logic-remove?
>
> Sorry, I don't quite understand what do you mean by logic-add/remove.
> Would you please explain more ?
>
> If you meant the sys fs interfaces, I think they are just another set of
> entrances
> of memory hotplug.
>
> Thanks. :)
>
>>
>>
>>>
>>> This patch-set aims to implement physical memory hot-removing.
>>>
>>> The patches can free/remove the following things:
>>>
>>> - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
>>> - memmap of sparse-vmemmap : [PATCH 6,7,8,10/15]
>>> - page table of removed memory : [RFC PATCH 7,8,10/15]
>>> - node and related sysfs files : [RFC PATCH 13-15/15]
>>>
>>>
>>> Existing problem:
>>> If CONFIG_MEMCG is selected, we will allocate memory to store page
>>> cgroup
>>> when we online pages.
>>>
>>> For example: there is a memory device on node 1. The address range
>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9,
>>> memory10,
>>> and memory11 under the directory /sys/devices/system/memory/.
>>>
>>> If CONFIG_MEMCG is selected, when we online memory8, the memory
>>> stored page
>>> cgroup is not provided by this memory device. But when we online
>>> memory9, the
>>> memory stored page cgroup may be provided by memory8. So we can't
>>> offline
>>> memory8 now. We should offline the memory in the reversed order.
>>>
>>> When the memory device is hotremoved, we will auto offline memory
>>> provided
>>> by this memory device. But we don't know which memory is onlined
>>> first, so
>>> offlining memory may fail.
>>>
>>> In patch1, we provide a solution which is not good enough:
>>> Iterate twice to offline the memory.
>>> 1st iterate: offline every non primary memory block.
>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>
>>> And a new idea from Wen Congyang<wency@cn.fujitsu.com> is:
>>> allocate the memory from the memory block they are describing.
>>>
>>> But we are not sure if it is OK to do so because there is not
>>> existing API
>>> to do so, and we need to move page_cgroup memory allocation from
>>> MEM_GOING_ONLINE
>>> to MEM_ONLINE. And also, it may interfere the hugepage.
>>>
>>>
>>>
>>> How to test this patchset?
>>> 1. apply this patchset and build the kernel. MEMORY_HOTPLUG,
>>> MEMORY_HOTREMOVE,
>>> ACPI_HOTPLUG_MEMORY must be selected.
>>> 2. load the module acpi_memhotplug
>>> 3. hotplug the memory device(it depends on your hardware)
>>> You will see the memory device under the directory
>>> /sys/bus/acpi/devices/.
>>> Its name is PNP0C80:XX.
>>> 4. online/offline pages provided by this memory device
>>> You can write online/offline to
>>> /sys/devices/system/memory/memoryX/state to
>>> online/offline pages provided by this memory device
>>> 5. hotremove the memory device
>>> You can hotremove the memory device by the hardware, or writing 1 to
>>> /sys/bus/acpi/devices/PNP0C80:XX/eject.
>>
>> Is there a similar knode to hot-add the memory device?
>>
>>>
>>>
>>> Note: if the memory provided by the memory device is used by the
>>> kernel, it
>>> can't be offlined. It is not a bug.
>>>
>>>
>>> Changelogs from v5 to v6:
>>> Patch3: Add some more comments to explain memory hot-remove.
>>> Patch4: Remove bootmem member in struct firmware_map_entry.
>>> Patch6: Repeatedly register bootmem pages when using hugepage.
>>> Patch8: Repeatedly free bootmem pages when using hugepage.
>>> Patch14: Don't free pgdat when offlining a node, just reset it to 0.
>>> Patch15: New patch, pgdat is not freed in patch14, so don't allocate
>>> a new
>>> one when online a node.
>>>
>>> Changelogs from v4 to v5:
>>> Patch7: new patch, move pgdat_resize_lock into
>>> sparse_remove_one_section() to
>>> avoid disabling irq because we need flush tlb when free pagetables.
>>> Patch8: new patch, pick up some common APIs that are used to free
>>> direct mapping
>>> and vmemmap pagetables.
>>> Patch9: free direct mapping pagetables on x86_64 arch.
>>> Patch10: free vmemmap pagetables.
>>> Patch11: since freeing memmap with vmemmap has been implemented, the
>>> config
>>> macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
>>> no longer needed.
>>> Patch13: no need to modify acpi_memory_disable_device() since it was
>>> removed,
>>> and add nid parameter when calling remove_memory().
>>>
>>> Changelogs from v3 to v4:
>>> Patch7: remove unused codes.
>>> Patch8: fix nr_pages that is passed to free_map_bootmem()
>>>
>>> Changelogs from v2 to v3:
>>> Patch9: call sync_global_pgds() if pgd is changed
>>> Patch10: fix a problem int the patch
>>>
>>> Changelogs from v1 to v2:
>>> Patch1: new patch, offline memory twice. 1st iterate: offline every
>>> non primary
>>> memory block. 2nd iterate: offline primary (i.e. first added) memory
>>> block.
>>>
>>> Patch3: new patch, no logical change, just remove reduntant codes.
>>>
>>> Patch9: merge the patch from wujianguo into this patch. flush tlb on
>>> all cpu
>>> after the pagetable is changed.
>>>
>>> Patch12: new patch, free node_data when a node is offlined.
>>>
>>>
>>> Tang Chen (6):
>>> memory-hotplug: move pgdat_resize_lock into
>>> sparse_remove_one_section()
>>> memory-hotplug: remove page table of x86_64 architecture
>>> memory-hotplug: remove memmap of sparse-vmemmap
>>> memory-hotplug: Integrated __remove_section() of
>>> CONFIG_SPARSEMEM_VMEMMAP.
>>> memory-hotplug: remove sysfs file of node
>>> memory-hotplug: Do not allocate pdgat if it was not freed when
>>> offline.
>>>
>>> Wen Congyang (5):
>>> memory-hotplug: try to offline the memory twice to avoid dependence
>>> memory-hotplug: remove redundant codes
>>> memory-hotplug: introduce new function arch_remove_memory() for
>>> removing page table depends on architecture
>>> memory-hotplug: Common APIs to support page tables hot-remove
>>> memory-hotplug: free node_data when a node is offlined
>>>
>>> Yasuaki Ishimatsu (4):
>>> memory-hotplug: check whether all memory blocks are offlined or not
>>> when removing memory
>>> memory-hotplug: remove /sys/firmware/memmap/X sysfs
>>> memory-hotplug: implement register_page_bootmem_info_section of
>>> sparse-vmemmap
>>> memory-hotplug: memory_hotplug: clear zone when removing the memory
>>>
>>> arch/arm64/mm/mmu.c | 3 +
>>> arch/ia64/mm/discontig.c | 10 +
>>> arch/ia64/mm/init.c | 18 ++
>>> arch/powerpc/mm/init_64.c | 10 +
>>> arch/powerpc/mm/mem.c | 12 +
>>> arch/s390/mm/init.c | 12 +
>>> arch/s390/mm/vmem.c | 10 +
>>> arch/sh/mm/init.c | 17 ++
>>> arch/sparc/mm/init_64.c | 10 +
>>> arch/tile/mm/init.c | 8 +
>>> arch/x86/include/asm/pgtable_types.h | 1 +
>>> arch/x86/mm/init_32.c | 12 +
>>> arch/x86/mm/init_64.c | 390 +++++++++++++++++++++++++++++
>>> arch/x86/mm/pageattr.c | 47 ++--
>>> drivers/acpi/acpi_memhotplug.c | 8 +-
>>> drivers/base/memory.c | 6 +
>>> drivers/firmware/memmap.c | 96 +++++++-
>>> include/linux/bootmem.h | 1 +
>>> include/linux/firmware-map.h | 6 +
>>> include/linux/memory_hotplug.h | 15 +-
>>> include/linux/mm.h | 4 +-
>>> mm/memory_hotplug.c | 459 +++++++++++++++++++++++++++++++---
>>> mm/sparse.c | 8 +-
>>> 23 files changed, 1094 insertions(+), 69 deletions(-)
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org. For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email:<a href=mailto:"dont@kvack.org"> email@kvack.org</a>
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-30 10:15   ` Tang Chen
  2013-01-30 10:18     ` Tang Chen
@ 2013-01-31  1:22     ` Simon Jeons
  2013-01-31  3:31       ` Tang Chen
  1 sibling, 1 reply; 67+ messages in thread
From: Simon Jeons @ 2013-01-31  1:22 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

Hi Tang,
On Wed, 2013-01-30 at 18:15 +0800, Tang Chen wrote:
> Hi Simon,
> 
> Please see below. :)
> 
> On 01/29/2013 08:52 PM, Simon Jeons wrote:
> > Hi Tang,
> >
> > On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
> >> Here is the physical memory hot-remove patch-set based on 3.8rc-2.
> >
> > Some questions ask you, not has relationship with this patchset, but is
> > memory hotplug stuff.
> >
> > 1. In function node_states_check_changes_online:
> >
> > comments:
> > * If we don't have HIGHMEM nor movable node,
> > * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
> > * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
> >
> > How to understand it? Why we don't have HIGHMEM nor movable node and
> > node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
> > N_NORMAL_MEMORY only means the node has regular memory.
> >
> 
> First of all, I think we need to understand why we need N_MEMORY.
> 
> In order to support movable node, which has only ZONE_MOVABLE (the last 
> zone),
> we introduce N_MEMORY to represent the node has normal, highmem and 
> movable memory.
> 
> Here, "we have movable node" means you configured CONFIG_MOVABLE_NODE.
> This config option doesn't mean we don't have movable pages, (NO)
> it means we don't have a node which has only movable pages (only have 
> ZONE_MOVABLE). (YES)
> 
> Here, if we don't have CONFIG_MOVABLE_NODE (we don't have movable node), 
> we don't need a
> separate node_states[] element to represent a particular node because we 
> won't have a node
> which has only ZONE_MOVABLE.
> 
> So,
> 1) if we don't have highmem nor movable node, N_MEMORY == N_HIGH_MEMORY 
> == N_NORMAL_MEMORY,
>     which means N_NORMAL_MEMORY effects as N_MEMORY. If we online pages 
> as movable, we need
>     to update node_states[N_NORMAL_MEMORY].

Sorry, I still confuse. :( 
update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or
node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE?

> 
> Please refer to the definition of enum zone_type, if we don't have 
> CONFIG_HIGHMEM, we won't
> have ZONE_HIGHMEM, but ZONE_NORMAL and ZONE_MOVABLE will always there. 
> So we can have movable
> pages, and the zone_last should be ZONE_MOVABLE.

node_states is what? node_states[N_NORMAL_MEMOR] or
node_states[N_MEMORY]?

> 
> Again, because we won't have a node only having ZONE_MOVABLE, so we just 
> need to update
> node_states[N_NORMAL_MEMORY].
> 
> > * If we don't have movable node, node_states[N_NORMAL_MEMORY]
> > * contains nodes which have zones of 0...ZONE_MOVABLE,
> > * set zone_last to ZONE_MOVABLE.
> >
> > How to understand?
> 
> 2) this code is in #ifdef CONFIG_HIGHMEM, which means we have highmem, 
> so if we don't have
>     movable node, N_MEMORY == N_HIGH_MEMORY, and N_HIGH_MEMORY effects 
> as N_MEMORY. If we
>     online pages as movable, we need to update node_states[N_NORMAL_MEMORY].
> 
> >
> > 2. In function move_pfn_range_left, why end<= z2->zone_start_pfn is not
> > correct? The comments said that must include/overlap, why?
> >
> 
> This one is easy, if I understand you correctly.
> move_pfn_range_left() is used to move the left most part [start_pfn, 
> end_pfn) of z2 to z1.
> So if end_pfn<= z2->zone_start_pfn, it means [start_pfn, end_pfn) is not 
> part of z2.
> Then it fails.

Yup, very clear now. :)
Why check !z1->wait_table in function move_pfn_range_left and function
__add_zone? I think zone->wait_table is initialized in
free_area_init_core, which will be called during system initialization
and hotadd_new_pgdat path.

> 
> > 3. In function online_pages, the normal case(w/o online_kenrel,
> > online_movable), why not check if the new zone is overlap with adjacent
> > zones?
> >
> 
> Can a zone overlap with the others ? I don't think so.
> 
> One pfn could only be in one zone,
>     zone = page_zone(pfn_to_page(pfn));

thanks. :)

There is a zone populated check in function online_pages. But zone is
populated in free_area_init_core which will be called during system
initialization and hotadd_new_pgdat path. Why still need this check?

> 
> it could overlap with others, I think. :)
> 
> But maybe I misunderstand you. :)
> 
> > 4. Could you summarize the difference implementation between hot-add and
> > logic-add, hot-remove and logic-remove?
> 
> Sorry, I don't quite understand what do you mean by logic-add/remove.
> Would you please explain more ?
> 
> If you meant the sys fs interfaces, I think they are just another set of 
> entrances
> of memory hotplug.

Please ingore this silly question. :(

> 
> Thanks.  :)
> 
> >
> >
> >>
> >> This patch-set aims to implement physical memory hot-removing.
> >>
> >> The patches can free/remove the following things:
> >>
> >>    - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
> >>    - memmap of sparse-vmemmap                  : [PATCH 6,7,8,10/15]
> >>    - page table of removed memory              : [RFC PATCH 7,8,10/15]
> >>    - node and related sysfs files              : [RFC PATCH 13-15/15]
> >>
> >>
> >> Existing problem:
> >> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
> >> when we online pages.
> >>
> >> For example: there is a memory device on node 1. The address range
> >> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
> >> and memory11 under the directory /sys/devices/system/memory/.
> >>
> >> If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
> >> cgroup is not provided by this memory device. But when we online memory9, the
> >> memory stored page cgroup may be provided by memory8. So we can't offline
> >> memory8 now. We should offline the memory in the reversed order.
> >>
> >> When the memory device is hotremoved, we will auto offline memory provided
> >> by this memory device. But we don't know which memory is onlined first, so
> >> offlining memory may fail.
> >>
> >> In patch1, we provide a solution which is not good enough:
> >> Iterate twice to offline the memory.
> >> 1st iterate: offline every non primary memory block.
> >> 2nd iterate: offline primary (i.e. first added) memory block.
> >>
> >> And a new idea from Wen Congyang<wency@cn.fujitsu.com>  is:
> >> allocate the memory from the memory block they are describing.
> >>
> >> But we are not sure if it is OK to do so because there is not existing API
> >> to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE
> >> to MEM_ONLINE. And also, it may interfere the hugepage.
> >>
> >>
> >>
> >> How to test this patchset?
> >> 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
> >>     ACPI_HOTPLUG_MEMORY must be selected.
> >> 2. load the module acpi_memhotplug
> >> 3. hotplug the memory device(it depends on your hardware)
> >>     You will see the memory device under the directory /sys/bus/acpi/devices/.
> >>     Its name is PNP0C80:XX.
> >> 4. online/offline pages provided by this memory device
> >>     You can write online/offline to /sys/devices/system/memory/memoryX/state to
> >>     online/offline pages provided by this memory device
> >> 5. hotremove the memory device
> >>     You can hotremove the memory device by the hardware, or writing 1 to
> >>     /sys/bus/acpi/devices/PNP0C80:XX/eject.
> >
> > Is there a similar knode to hot-add the memory device?
> >
> >>
> >>
> >> Note: if the memory provided by the memory device is used by the kernel, it
> >> can't be offlined. It is not a bug.
> >>
> >>
> >> Changelogs from v5 to v6:
> >>   Patch3: Add some more comments to explain memory hot-remove.
> >>   Patch4: Remove bootmem member in struct firmware_map_entry.
> >>   Patch6: Repeatedly register bootmem pages when using hugepage.
> >>   Patch8: Repeatedly free bootmem pages when using hugepage.
> >>   Patch14: Don't free pgdat when offlining a node, just reset it to 0.
> >>   Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
> >>            one when online a node.
> >>
> >> Changelogs from v4 to v5:
> >>   Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
> >>           avoid disabling irq because we need flush tlb when free pagetables.
> >>   Patch8: new patch, pick up some common APIs that are used to free direct mapping
> >>           and vmemmap pagetables.
> >>   Patch9: free direct mapping pagetables on x86_64 arch.
> >>   Patch10: free vmemmap pagetables.
> >>   Patch11: since freeing memmap with vmemmap has been implemented, the config
> >>            macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
> >>            no longer needed.
> >>   Patch13: no need to modify acpi_memory_disable_device() since it was removed,
> >>            and add nid parameter when calling remove_memory().
> >>
> >> Changelogs from v3 to v4:
> >>   Patch7: remove unused codes.
> >>   Patch8: fix nr_pages that is passed to free_map_bootmem()
> >>
> >> Changelogs from v2 to v3:
> >>   Patch9: call sync_global_pgds() if pgd is changed
> >>   Patch10: fix a problem int the patch
> >>
> >> Changelogs from v1 to v2:
> >>   Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
> >>           memory block. 2nd iterate: offline primary (i.e. first added) memory
> >>           block.
> >>
> >>   Patch3: new patch, no logical change, just remove reduntant codes.
> >>
> >>   Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
> >>           after the pagetable is changed.
> >>
> >>   Patch12: new patch, free node_data when a node is offlined.
> >>
> >>
> >> Tang Chen (6):
> >>    memory-hotplug: move pgdat_resize_lock into
> >>      sparse_remove_one_section()
> >>    memory-hotplug: remove page table of x86_64 architecture
> >>    memory-hotplug: remove memmap of sparse-vmemmap
> >>    memory-hotplug: Integrated __remove_section() of
> >>      CONFIG_SPARSEMEM_VMEMMAP.
> >>    memory-hotplug: remove sysfs file of node
> >>    memory-hotplug: Do not allocate pdgat if it was not freed when
> >>      offline.
> >>
> >> Wen Congyang (5):
> >>    memory-hotplug: try to offline the memory twice to avoid dependence
> >>    memory-hotplug: remove redundant codes
> >>    memory-hotplug: introduce new function arch_remove_memory() for
> >>      removing page table depends on architecture
> >>    memory-hotplug: Common APIs to support page tables hot-remove
> >>    memory-hotplug: free node_data when a node is offlined
> >>
> >> Yasuaki Ishimatsu (4):
> >>    memory-hotplug: check whether all memory blocks are offlined or not
> >>      when removing memory
> >>    memory-hotplug: remove /sys/firmware/memmap/X sysfs
> >>    memory-hotplug: implement register_page_bootmem_info_section of
> >>      sparse-vmemmap
> >>    memory-hotplug: memory_hotplug: clear zone when removing the memory
> >>
> >>   arch/arm64/mm/mmu.c                  |    3 +
> >>   arch/ia64/mm/discontig.c             |   10 +
> >>   arch/ia64/mm/init.c                  |   18 ++
> >>   arch/powerpc/mm/init_64.c            |   10 +
> >>   arch/powerpc/mm/mem.c                |   12 +
> >>   arch/s390/mm/init.c                  |   12 +
> >>   arch/s390/mm/vmem.c                  |   10 +
> >>   arch/sh/mm/init.c                    |   17 ++
> >>   arch/sparc/mm/init_64.c              |   10 +
> >>   arch/tile/mm/init.c                  |    8 +
> >>   arch/x86/include/asm/pgtable_types.h |    1 +
> >>   arch/x86/mm/init_32.c                |   12 +
> >>   arch/x86/mm/init_64.c                |  390 +++++++++++++++++++++++++++++
> >>   arch/x86/mm/pageattr.c               |   47 ++--
> >>   drivers/acpi/acpi_memhotplug.c       |    8 +-
> >>   drivers/base/memory.c                |    6 +
> >>   drivers/firmware/memmap.c            |   96 +++++++-
> >>   include/linux/bootmem.h              |    1 +
> >>   include/linux/firmware-map.h         |    6 +
> >>   include/linux/memory_hotplug.h       |   15 +-
> >>   include/linux/mm.h                   |    4 +-
> >>   mm/memory_hotplug.c                  |  459 +++++++++++++++++++++++++++++++---
> >>   mm/sparse.c                          |    8 +-
> >>   23 files changed, 1094 insertions(+), 69 deletions(-)
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
> >
> >
> >

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-31  1:22     ` Simon Jeons
@ 2013-01-31  3:31       ` Tang Chen
  2013-01-31  6:19         ` Simon Jeons
  0 siblings, 1 reply; 67+ messages in thread
From: Tang Chen @ 2013-01-31  3:31 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

Hi Simon,

Please see below. :)

On 01/31/2013 09:22 AM, Simon Jeons wrote:
>
> Sorry, I still confuse. :(
> update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or
> node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE?
>
> node_states is what? node_states[N_NORMAL_MEMOR] or
> node_states[N_MEMORY]?

Are you asking what node_states[] is ?

node_states[] is an array of nodemask,

     extern nodemask_t node_states[NR_NODE_STATES];

For example, node_states[N_NORMAL_MEMOR] represents which nodes have 
normal memory.
If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is
node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ... 
ZONE_MOVABLE.

> Why check !z1->wait_table in function move_pfn_range_left and function
> __add_zone? I think zone->wait_table is initialized in
> free_area_init_core, which will be called during system initialization
> and hotadd_new_pgdat path.

I think,

free_area_init_core(), in the for loop,
  |--> size = zone_spanned_pages_in_node();
  |--> if (!size)
               continue;  ----------------  If zone is empty, we jump 
out the for loop.
  |--> init_currently_empty_zone()

So, if the zone is empty, wait_table is not initialized.

In move_pfn_range_left(z1, z2), we move pages from z2 to z1. But z1 
could be empty.
So we need to check it and initialize z1->wait_table because we are 
moving pages into it.

> There is a zone populated check in function online_pages. But zone is
> populated in free_area_init_core which will be called during system
> initialization and hotadd_new_pgdat path. Why still need this check?
>

Because we could also rebuild zone list when we offline pages.

__offline_pages()
  |--> zone->present_pages -= offlined_pages;
  |--> if (!populated_zone(zone)) {
               build_all_zonelists(NULL, NULL);
       }

If the zone is empty, and other zones on the same node is not empty, the 
node
won't be offlined, and next time we online pages of this zone, the pgdat 
won't
be initialized again, and we need to check populated_zone(zone) when 
onlining
pages.

Thanks. :)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-31  3:31       ` Tang Chen
@ 2013-01-31  6:19         ` Simon Jeons
  2013-01-31  7:10           ` Tang Chen
  0 siblings, 1 reply; 67+ messages in thread
From: Simon Jeons @ 2013-01-31  6:19 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

Hi Tang,
On Thu, 2013-01-31 at 11:31 +0800, Tang Chen wrote:
> Hi Simon,
> 
> Please see below. :)
> 
> On 01/31/2013 09:22 AM, Simon Jeons wrote:
> >
> > Sorry, I still confuse. :(
> > update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or
> > node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE?
> >
> > node_states is what? node_states[N_NORMAL_MEMOR] or
> > node_states[N_MEMORY]?
> 
> Are you asking what node_states[] is ?
> 
> node_states[] is an array of nodemask,
> 
>      extern nodemask_t node_states[NR_NODE_STATES];
> 
> For example, node_states[N_NORMAL_MEMOR] represents which nodes have 
> normal memory.
> If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is
> node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ... 
> ZONE_MOVABLE.
> 

Sorry, how can nodes_state[N_NORMAL_MEMORY] represents a node have 0 ...
*ZONE_MOVABLE*, the comment of enum nodes_states said that
N_NORMAL_MEMORY just means the node has regular memory.  

> 
> > Why check !z1->wait_table in function move_pfn_range_left and function
> > __add_zone? I think zone->wait_table is initialized in
> > free_area_init_core, which will be called during system initialization
> > and hotadd_new_pgdat path.
> 
> I think,
> 
> free_area_init_core(), in the for loop,
>   |--> size = zone_spanned_pages_in_node();
>   |--> if (!size)
>                continue;  ----------------  If zone is empty, we jump 
> out the for loop.
>   |--> init_currently_empty_zone()
> 
> So, if the zone is empty, wait_table is not initialized.
> 
> In move_pfn_range_left(z1, z2), we move pages from z2 to z1. But z1 
> could be empty.
> So we need to check it and initialize z1->wait_table because we are 
> moving pages into it.

thanks.

> 
> 
> > There is a zone populated check in function online_pages. But zone is
> > populated in free_area_init_core which will be called during system
> > initialization and hotadd_new_pgdat path. Why still need this check?
> >
> 
> Because we could also rebuild zone list when we offline pages.
> 
> __offline_pages()
>   |--> zone->present_pages -= offlined_pages;
>   |--> if (!populated_zone(zone)) {
>                build_all_zonelists(NULL, NULL);
>        }
> 
> If the zone is empty, and other zones on the same node is not empty, the 
> node
> won't be offlined, and next time we online pages of this zone, the pgdat 
> won't
> be initialized again, and we need to check populated_zone(zone) when 
> onlining
> pages.

thanks.

> 
> Thanks. :)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-31  6:19         ` Simon Jeons
@ 2013-01-31  7:10           ` Tang Chen
  2013-01-31  8:17             ` Simon Jeons
  2013-01-31  8:48             ` Simon Jeons
  0 siblings, 2 replies; 67+ messages in thread
From: Tang Chen @ 2013-01-31  7:10 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

On 01/31/2013 02:19 PM, Simon Jeons wrote:
> Hi Tang,
> On Thu, 2013-01-31 at 11:31 +0800, Tang Chen wrote:
>> Hi Simon,
>>
>> Please see below. :)
>>
>> On 01/31/2013 09:22 AM, Simon Jeons wrote:
>>>
>>> Sorry, I still confuse. :(
>>> update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or
>>> node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE?
>>>
>>> node_states is what? node_states[N_NORMAL_MEMOR] or
>>> node_states[N_MEMORY]?
>>
>> Are you asking what node_states[] is ?
>>
>> node_states[] is an array of nodemask,
>>
>>       extern nodemask_t node_states[NR_NODE_STATES];
>>
>> For example, node_states[N_NORMAL_MEMOR] represents which nodes have
>> normal memory.
>> If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is
>> node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ...
>> ZONE_MOVABLE.
>>
>
> Sorry, how can nodes_state[N_NORMAL_MEMORY] represents a node have 0 ...
> *ZONE_MOVABLE*, the comment of enum nodes_states said that
> N_NORMAL_MEMORY just means the node has regular memory.
>

Hi Simon,

Let's say it in this way.

If we don't have CONFIG_HIGHMEM, N_HIGH_MEMORY == N_NORMAL_MEMORY. We 
don't have a separate
macro to represent highmem because we don't have highmem.
This is easy to understand, right ?

Now, think it just like above:
If we don't have CONFIG_MOVABLE_NODE, N_MEMORY == N_HIGH_MEMORY == 
N_NORMAL_MEMORY.
This means we don't allow a node to have only movable memory, not we 
don't have movable memory.
A node could have normal memory and movable memory. So 
nodes_state[N_NORMAL_MEMORY] represents
a node have 0 ... *ZONE_MOVABLE*.

I think the point is: CONFIG_MOVABLE_NODE means we allow a node to have 
only movable memory.
So without CONFIG_MOVABLE_NODE, it doesn't mean a node cannot have 
movable memory. It means
the node cannot have only movable memory. It can have normal memory and 
movable memory.

1) With CONFIG_MOVABLE_NODE:
    N_NORMAL_MEMORY: nodes who have normal memory.
                     normal memory only
                     normal and highmem
                     normal and highmem and movablemem
                     normal and movablemem
    N_MEMORY: nodes who has memory (any memory)
                     normal memory only
                     normal and highmem
                     normal and highmem and movablemem
                     normal and movablemem ---------------- We can have 
movablemem.
                     highmem only -------------------------
                     highmem and movablemem ---------------
                     movablemem only ---------------------- We can have 
movablemem only.    ***

2) With out CONFIG_MOVABLE_NODE:
    N_MEMORY == N_NORMAL_MEMORY: (Here, I omit N_HIGH_MEMORY)
                     normal memory only
                     normal and highmem
                     normal and highmem and movablemem
                     normal and movablemem ---------------- We can have 
movablemem.
                     No movablemem only ------------------- We cannot 
have movablemem only. ***

The semantics is not that clear here. So we can only try to understand 
it from the code where
we use N_MEMORY. :)

That is my understanding of this.

Thanks. :)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-31  7:10           ` Tang Chen
@ 2013-01-31  8:17             ` Simon Jeons
  2013-01-31  8:48             ` Simon Jeons
  1 sibling, 0 replies; 67+ messages in thread
From: Simon Jeons @ 2013-01-31  8:17 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

Hi Tang,
On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:
> On 01/31/2013 02:19 PM, Simon Jeons wrote:
> > Hi Tang,
> > On Thu, 2013-01-31 at 11:31 +0800, Tang Chen wrote:
> >> Hi Simon,
> >>
> >> Please see below. :)
> >>
> >> On 01/31/2013 09:22 AM, Simon Jeons wrote:
> >>>
> >>> Sorry, I still confuse. :(
> >>> update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or
> >>> node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE?
> >>>
> >>> node_states is what? node_states[N_NORMAL_MEMOR] or
> >>> node_states[N_MEMORY]?
> >>
> >> Are you asking what node_states[] is ?
> >>
> >> node_states[] is an array of nodemask,
> >>
> >>       extern nodemask_t node_states[NR_NODE_STATES];
> >>
> >> For example, node_states[N_NORMAL_MEMOR] represents which nodes have
> >> normal memory.
> >> If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is
> >> node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ...
> >> ZONE_MOVABLE.
> >>
> >
> > Sorry, how can nodes_state[N_NORMAL_MEMORY] represents a node have 0 ...
> > *ZONE_MOVABLE*, the comment of enum nodes_states said that
> > N_NORMAL_MEMORY just means the node has regular memory.
> >
> 
> Hi Simon,
> 
> Let's say it in this way.
> 
> If we don't have CONFIG_HIGHMEM, N_HIGH_MEMORY == N_NORMAL_MEMORY. We 
> don't have a separate
> macro to represent highmem because we don't have highmem.
> This is easy to understand, right ?
> 
> Now, think it just like above:
> If we don't have CONFIG_MOVABLE_NODE, N_MEMORY == N_HIGH_MEMORY == 
> N_NORMAL_MEMORY.
> This means we don't allow a node to have only movable memory, not we 
> don't have movable memory.
> A node could have normal memory and movable memory. So 
> nodes_state[N_NORMAL_MEMORY] represents
> a node have 0 ... *ZONE_MOVABLE*.
> 
> I think the point is: CONFIG_MOVABLE_NODE means we allow a node to have 
> only movable memory.
> So without CONFIG_MOVABLE_NODE, it doesn't mean a node cannot have 
> movable memory. It means
> the node cannot have only movable memory. It can have normal memory and 
> movable memory.
> 
> 1) With CONFIG_MOVABLE_NODE:
>     N_NORMAL_MEMORY: nodes who have normal memory.
>                      normal memory only
>                      normal and highmem
>                      normal and highmem and movablemem
>                      normal and movablemem
>     N_MEMORY: nodes who has memory (any memory)
>                      normal memory only
>                      normal and highmem
>                      normal and highmem and movablemem
>                      normal and movablemem ---------------- We can have 
> movablemem.
>                      highmem only -------------------------
>                      highmem and movablemem ---------------
>                      movablemem only ---------------------- We can have 
> movablemem only.    ***
> 
> 2) With out CONFIG_MOVABLE_NODE:
>     N_MEMORY == N_NORMAL_MEMORY: (Here, I omit N_HIGH_MEMORY)
>                      normal memory only
>                      normal and highmem
>                      normal and highmem and movablemem
>                      normal and movablemem ---------------- We can have 
> movablemem.
>                      No movablemem only ------------------- We cannot 
> have movablemem only. ***
> 
> The semantics is not that clear here. So we can only try to understand 
> it from the code where
> we use N_MEMORY. :)
> 
> That is my understanding of this.

Thanks for your clarify, very clear now. :)

> 
> Thanks. :)
> 
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-31  7:10           ` Tang Chen
  2013-01-31  8:17             ` Simon Jeons
@ 2013-01-31  8:48             ` Simon Jeons
  2013-01-31  9:44               ` Tang Chen
  1 sibling, 1 reply; 67+ messages in thread
From: Simon Jeons @ 2013-01-31  8:48 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

Hi Tang,
On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:

1. IIUC, there is a button on machine which supports hot-remove memory,
then what's the difference between press button and echo to /sys?
2. Since kernel memory is linear mapping(I mean direct mapping part),
why can't put kernel direct mapping memory into one memory device, and
other memory into the other devices? As you know x86_64 don't need
highmem, IIUC, all kernel memory will linear mapping in this case. Is my
idea available? If is correct, x86_32 can't implement in the same way
since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's
hard to focus kernel memory on single memory device.
3. In current implementation, if memory hotplug just need memory
subsystem and ACPI codes support? Or also needs firmware take part in?
Hope you can explain in details, thanks in advance. :)
4. What's the status of memory hotplug? Apart from can't remove kernel
memory, other things are fully implementation?  


> On 01/31/2013 02:19 PM, Simon Jeons wrote:
> > Hi Tang,
> > On Thu, 2013-01-31 at 11:31 +0800, Tang Chen wrote:
> >> Hi Simon,
> >>
> >> Please see below. :)
> >>
> >> On 01/31/2013 09:22 AM, Simon Jeons wrote:
> >>>
> >>> Sorry, I still confuse. :(
> >>> update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or
> >>> node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE?
> >>>
> >>> node_states is what? node_states[N_NORMAL_MEMOR] or
> >>> node_states[N_MEMORY]?
> >>
> >> Are you asking what node_states[] is ?
> >>
> >> node_states[] is an array of nodemask,
> >>
> >>       extern nodemask_t node_states[NR_NODE_STATES];
> >>
> >> For example, node_states[N_NORMAL_MEMOR] represents which nodes have
> >> normal memory.
> >> If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is
> >> node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ...
> >> ZONE_MOVABLE.
> >>
> >
> > Sorry, how can nodes_state[N_NORMAL_MEMORY] represents a node have 0 ...
> > *ZONE_MOVABLE*, the comment of enum nodes_states said that
> > N_NORMAL_MEMORY just means the node has regular memory.
> >
> 
> Hi Simon,
> 
> Let's say it in this way.
> 
> If we don't have CONFIG_HIGHMEM, N_HIGH_MEMORY == N_NORMAL_MEMORY. We 
> don't have a separate
> macro to represent highmem because we don't have highmem.
> This is easy to understand, right ?
> 
> Now, think it just like above:
> If we don't have CONFIG_MOVABLE_NODE, N_MEMORY == N_HIGH_MEMORY == 
> N_NORMAL_MEMORY.
> This means we don't allow a node to have only movable memory, not we 
> don't have movable memory.
> A node could have normal memory and movable memory. So 
> nodes_state[N_NORMAL_MEMORY] represents
> a node have 0 ... *ZONE_MOVABLE*.
> 
> I think the point is: CONFIG_MOVABLE_NODE means we allow a node to have 
> only movable memory.
> So without CONFIG_MOVABLE_NODE, it doesn't mean a node cannot have 
> movable memory. It means
> the node cannot have only movable memory. It can have normal memory and 
> movable memory.
> 
> 1) With CONFIG_MOVABLE_NODE:
>     N_NORMAL_MEMORY: nodes who have normal memory.
>                      normal memory only
>                      normal and highmem
>                      normal and highmem and movablemem
>                      normal and movablemem
>     N_MEMORY: nodes who has memory (any memory)
>                      normal memory only
>                      normal and highmem
>                      normal and highmem and movablemem
>                      normal and movablemem ---------------- We can have 
> movablemem.
>                      highmem only -------------------------
>                      highmem and movablemem ---------------
>                      movablemem only ---------------------- We can have 
> movablemem only.    ***
> 
> 2) With out CONFIG_MOVABLE_NODE:
>     N_MEMORY == N_NORMAL_MEMORY: (Here, I omit N_HIGH_MEMORY)
>                      normal memory only
>                      normal and highmem
>                      normal and highmem and movablemem
>                      normal and movablemem ---------------- We can have 
> movablemem.
>                      No movablemem only ------------------- We cannot 
> have movablemem only. ***
> 
> The semantics is not that clear here. So we can only try to understand 
> it from the code where
> we use N_MEMORY. :)
> 
> That is my understanding of this.
> 
> Thanks. :)
> 
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-31  8:48             ` Simon Jeons
@ 2013-01-31  9:44               ` Tang Chen
  2013-01-31 10:38                 ` Simon Jeons
  0 siblings, 1 reply; 67+ messages in thread
From: Tang Chen @ 2013-01-31  9:44 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

Hi Simon,

On 01/31/2013 04:48 PM, Simon Jeons wrote:
> Hi Tang,
> On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:
>
> 1. IIUC, there is a button on machine which supports hot-remove memory,
> then what's the difference between press button and echo to /sys?

No important difference, I think. Since I don't have the machine you are
saying, I cannot surely answer you. :)
AFAIK, pressing the button means trigger the hotplug from hardware, sysfs
is just another entrance. At last, they will run into the same code.

> 2. Since kernel memory is linear mapping(I mean direct mapping part),
> why can't put kernel direct mapping memory into one memory device, and
> other memory into the other devices?

We cannot do that because in that way, we will lose NUMA performance.

If you know NUMA, you will understand the following example:

node0:                    node1:
    cpu0~cpu15                cpu16~cpu31
    memory0~memory511         memory512~memory1023

cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511.
If we set direct mapping area in node0, and movable area in node1, then
the kernel code running on cpu16~cpu31 will have to access 
memory0~memory511.
This is a terrible performance down.

>As you know x86_64 don't need
> highmem, IIUC, all kernel memory will linear mapping in this case. Is my
> idea available? If is correct, x86_32 can't implement in the same way
> since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's
> hard to focus kernel memory on single memory device.

Sorry, I'm not quite familiar with x86_32 box.

> 3. In current implementation, if memory hotplug just need memory
> subsystem and ACPI codes support? Or also needs firmware take part in?
> Hope you can explain in details, thanks in advance. :)

We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
based memory migration mentioned by Liu Jiang.

So far, I only know this. :)

> 4. What's the status of memory hotplug? Apart from can't remove kernel
> memory, other things are fully implementation?

I think the main job is done for now. And there are still bugs to fix.
And this functionality is not stable.

Thanks. :)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-31  9:44               ` Tang Chen
@ 2013-01-31 10:38                 ` Simon Jeons
  2013-02-01  1:32                   ` Jianguo Wu
  0 siblings, 1 reply; 67+ messages in thread
From: Simon Jeons @ 2013-01-31 10:38 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

Hi Tang,
On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote:
> Hi Simon,
> 
> On 01/31/2013 04:48 PM, Simon Jeons wrote:
> > Hi Tang,
> > On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:
> >
> > 1. IIUC, there is a button on machine which supports hot-remove memory,
> > then what's the difference between press button and echo to /sys?
> 
> No important difference, I think. Since I don't have the machine you are
> saying, I cannot surely answer you. :)
> AFAIK, pressing the button means trigger the hotplug from hardware, sysfs
> is just another entrance. At last, they will run into the same code.
> 
> > 2. Since kernel memory is linear mapping(I mean direct mapping part),
> > why can't put kernel direct mapping memory into one memory device, and
> > other memory into the other devices?
> 
> We cannot do that because in that way, we will lose NUMA performance.
> 
> If you know NUMA, you will understand the following example:
> 
> node0:                    node1:
>     cpu0~cpu15                cpu16~cpu31
>     memory0~memory511         memory512~memory1023
> 
> cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511.
> If we set direct mapping area in node0, and movable area in node1, then
> the kernel code running on cpu16~cpu31 will have to access 
> memory0~memory511.
> This is a terrible performance down.

So if config NUMA, kernel memory will not be linear mapping anymore? For
example, 

Node 0  Node 1 

0 ~ 10G 11G~14G

kernel memory only at Node 0? Can part of kernel memory also at Node 1?

How big is kernel direct mapping memory in x86_64? Is there max limit?
It seems that only around 896MB on x86_32. 

> 
> >As you know x86_64 don't need
> > highmem, IIUC, all kernel memory will linear mapping in this case. Is my
> > idea available? If is correct, x86_32 can't implement in the same way
> > since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's
> > hard to focus kernel memory on single memory device.
> 
> Sorry, I'm not quite familiar with x86_32 box.
> 
> > 3. In current implementation, if memory hotplug just need memory
> > subsystem and ACPI codes support? Or also needs firmware take part in?
> > Hope you can explain in details, thanks in advance. :)
> 
> We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
> based memory migration mentioned by Liu Jiang.

Is there any material about firmware based memory migration?

> 
> So far, I only know this. :)
> 
> > 4. What's the status of memory hotplug? Apart from can't remove kernel
> > memory, other things are fully implementation?
> 
> I think the main job is done for now. And there are still bugs to fix.
> And this functionality is not stable.
> 
> Thanks. :)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-01-31 10:38                 ` Simon Jeons
@ 2013-02-01  1:32                   ` Jianguo Wu
  2013-02-01  1:36                     ` Simon Jeons
  0 siblings, 1 reply; 67+ messages in thread
From: Jianguo Wu @ 2013-02-01  1:32 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, Tang Chen, linux-mm, paulus, hpa,
	sparclinux, cl, linux-s390, x86, linux-acpi, isimatu.yasuaki,
	linfeng, mgorman, kosaki.motohiro, rientjes, len.brown, wency,
	cmetcalf, glommer, yinghai, laijs, linux-kernel, minchan.kim,
	akpm, linuxppc-dev

On 2013/1/31 18:38, Simon Jeons wrote:

> Hi Tang,
> On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote:
>> Hi Simon,
>>
>> On 01/31/2013 04:48 PM, Simon Jeons wrote:
>>> Hi Tang,
>>> On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:
>>>
>>> 1. IIUC, there is a button on machine which supports hot-remove memory,
>>> then what's the difference between press button and echo to /sys?
>>
>> No important difference, I think. Since I don't have the machine you are
>> saying, I cannot surely answer you. :)
>> AFAIK, pressing the button means trigger the hotplug from hardware, sysfs
>> is just another entrance. At last, they will run into the same code.
>>
>>> 2. Since kernel memory is linear mapping(I mean direct mapping part),
>>> why can't put kernel direct mapping memory into one memory device, and
>>> other memory into the other devices?
>>
>> We cannot do that because in that way, we will lose NUMA performance.
>>
>> If you know NUMA, you will understand the following example:
>>
>> node0:                    node1:
>>     cpu0~cpu15                cpu16~cpu31
>>     memory0~memory511         memory512~memory1023
>>
>> cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511.
>> If we set direct mapping area in node0, and movable area in node1, then
>> the kernel code running on cpu16~cpu31 will have to access 
>> memory0~memory511.
>> This is a terrible performance down.
> 
> So if config NUMA, kernel memory will not be linear mapping anymore? For
> example, 
> 
> Node 0  Node 1 
> 
> 0 ~ 10G 11G~14G
> 
> kernel memory only at Node 0? Can part of kernel memory also at Node 1?
> 
> How big is kernel direct mapping memory in x86_64? Is there max limit?


Max kernel direct mapping memory in x86_64 is 64TB.

> It seems that only around 896MB on x86_32. 
> 
>>
>>> As you know x86_64 don't need
>>> highmem, IIUC, all kernel memory will linear mapping in this case. Is my
>>> idea available? If is correct, x86_32 can't implement in the same way
>>> since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's
>>> hard to focus kernel memory on single memory device.
>>
>> Sorry, I'm not quite familiar with x86_32 box.
>>
>>> 3. In current implementation, if memory hotplug just need memory
>>> subsystem and ACPI codes support? Or also needs firmware take part in?
>>> Hope you can explain in details, thanks in advance. :)
>>
>> We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
>> based memory migration mentioned by Liu Jiang.
> 
> Is there any material about firmware based memory migration?
> 
>>
>> So far, I only know this. :)
>>
>>> 4. What's the status of memory hotplug? Apart from can't remove kernel
>>> memory, other things are fully implementation?
>>
>> I think the main job is done for now. And there are still bugs to fix.
>> And this functionality is not stable.
>>
>> Thanks. :)
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> .
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-02-01  1:32                   ` Jianguo Wu
@ 2013-02-01  1:36                     ` Simon Jeons
  2013-02-01  1:57                       ` Jianguo Wu
  2013-02-01  1:57                       ` Tang Chen
  0 siblings, 2 replies; 67+ messages in thread
From: Simon Jeons @ 2013-02-01  1:36 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: linux-ia64, linux-sh, Tang Chen, linux-mm, paulus, hpa,
	sparclinux, cl, linux-s390, x86, linux-acpi, isimatu.yasuaki,
	linfeng, mgorman, kosaki.motohiro, rientjes, len.brown, wency,
	cmetcalf, glommer, yinghai, laijs, linux-kernel, minchan.kim,
	akpm, linuxppc-dev

On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote:
> On 2013/1/31 18:38, Simon Jeons wrote:
> 
> > Hi Tang,
> > On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote:
> >> Hi Simon,
> >>
> >> On 01/31/2013 04:48 PM, Simon Jeons wrote:
> >>> Hi Tang,
> >>> On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:
> >>>
> >>> 1. IIUC, there is a button on machine which supports hot-remove memory,
> >>> then what's the difference between press button and echo to /sys?
> >>
> >> No important difference, I think. Since I don't have the machine you are
> >> saying, I cannot surely answer you. :)
> >> AFAIK, pressing the button means trigger the hotplug from hardware, sysfs
> >> is just another entrance. At last, they will run into the same code.
> >>
> >>> 2. Since kernel memory is linear mapping(I mean direct mapping part),
> >>> why can't put kernel direct mapping memory into one memory device, and
> >>> other memory into the other devices?
> >>
> >> We cannot do that because in that way, we will lose NUMA performance.
> >>
> >> If you know NUMA, you will understand the following example:
> >>
> >> node0:                    node1:
> >>     cpu0~cpu15                cpu16~cpu31
> >>     memory0~memory511         memory512~memory1023
> >>
> >> cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511.
> >> If we set direct mapping area in node0, and movable area in node1, then
> >> the kernel code running on cpu16~cpu31 will have to access 
> >> memory0~memory511.
> >> This is a terrible performance down.
> > 
> > So if config NUMA, kernel memory will not be linear mapping anymore? For
> > example, 
> > 
> > Node 0  Node 1 
> > 
> > 0 ~ 10G 11G~14G
> > 
> > kernel memory only at Node 0? Can part of kernel memory also at Node 1?
> > 
> > How big is kernel direct mapping memory in x86_64? Is there max limit?
> 
> 
> Max kernel direct mapping memory in x86_64 is 64TB.

For example, I have 8G memory, all of them will be direct mapping for
kernel? then userspace memory allocated from where?

> 
> > It seems that only around 896MB on x86_32. 
> > 
> >>
> >>> As you know x86_64 don't need
> >>> highmem, IIUC, all kernel memory will linear mapping in this case. Is my
> >>> idea available? If is correct, x86_32 can't implement in the same way
> >>> since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's
> >>> hard to focus kernel memory on single memory device.
> >>
> >> Sorry, I'm not quite familiar with x86_32 box.
> >>
> >>> 3. In current implementation, if memory hotplug just need memory
> >>> subsystem and ACPI codes support? Or also needs firmware take part in?
> >>> Hope you can explain in details, thanks in advance. :)
> >>
> >> We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
> >> based memory migration mentioned by Liu Jiang.
> > 
> > Is there any material about firmware based memory migration?
> > 
> >>
> >> So far, I only know this. :)
> >>
> >>> 4. What's the status of memory hotplug? Apart from can't remove kernel
> >>> memory, other things are fully implementation?
> >>
> >> I think the main job is done for now. And there are still bugs to fix.
> >> And this functionality is not stable.
> >>
> >> Thanks. :)
> > 
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> > .
> > 
> 
> 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-02-01  1:36                     ` Simon Jeons
@ 2013-02-01  1:57                       ` Jianguo Wu
  2013-02-01  2:06                         ` Simon Jeons
  2013-02-01  1:57                       ` Tang Chen
  1 sibling, 1 reply; 67+ messages in thread
From: Jianguo Wu @ 2013-02-01  1:57 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, Tang Chen, linux-mm, paulus, hpa,
	sparclinux, cl, linux-s390, x86, linux-acpi, isimatu.yasuaki,
	linfeng, mgorman, kosaki.motohiro, rientjes, len.brown, wency,
	cmetcalf, glommer, yinghai, laijs, linux-kernel, minchan.kim,
	akpm, linuxppc-dev

On 2013/2/1 9:36, Simon Jeons wrote:

> On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote:
>> On 2013/1/31 18:38, Simon Jeons wrote:
>>
>>> Hi Tang,
>>> On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote:
>>>> Hi Simon,
>>>>
>>>> On 01/31/2013 04:48 PM, Simon Jeons wrote:
>>>>> Hi Tang,
>>>>> On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:
>>>>>
>>>>> 1. IIUC, there is a button on machine which supports hot-remove memory,
>>>>> then what's the difference between press button and echo to /sys?
>>>>
>>>> No important difference, I think. Since I don't have the machine you are
>>>> saying, I cannot surely answer you. :)
>>>> AFAIK, pressing the button means trigger the hotplug from hardware, sysfs
>>>> is just another entrance. At last, they will run into the same code.
>>>>
>>>>> 2. Since kernel memory is linear mapping(I mean direct mapping part),
>>>>> why can't put kernel direct mapping memory into one memory device, and
>>>>> other memory into the other devices?
>>>>
>>>> We cannot do that because in that way, we will lose NUMA performance.
>>>>
>>>> If you know NUMA, you will understand the following example:
>>>>
>>>> node0:                    node1:
>>>>     cpu0~cpu15                cpu16~cpu31
>>>>     memory0~memory511         memory512~memory1023
>>>>
>>>> cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511.
>>>> If we set direct mapping area in node0, and movable area in node1, then
>>>> the kernel code running on cpu16~cpu31 will have to access 
>>>> memory0~memory511.
>>>> This is a terrible performance down.
>>>
>>> So if config NUMA, kernel memory will not be linear mapping anymore? For
>>> example, 
>>>
>>> Node 0  Node 1 
>>>
>>> 0 ~ 10G 11G~14G
>>>
>>> kernel memory only at Node 0? Can part of kernel memory also at Node 1?
>>>
>>> How big is kernel direct mapping memory in x86_64? Is there max limit?
>>
>>
>> Max kernel direct mapping memory in x86_64 is 64TB.
> 
> For example, I have 8G memory, all of them will be direct mapping for
> kernel? then userspace memory allocated from where?

Direct mapping memory means you can use __va() and pa(), but not means that them
can be only used by kernel, them can be used by user-space too, as long as them are free.

> 
>>
>>> It seems that only around 896MB on x86_32. 
>>>
>>>>
>>>>> As you know x86_64 don't need
>>>>> highmem, IIUC, all kernel memory will linear mapping in this case. Is my
>>>>> idea available? If is correct, x86_32 can't implement in the same way
>>>>> since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's
>>>>> hard to focus kernel memory on single memory device.
>>>>
>>>> Sorry, I'm not quite familiar with x86_32 box.
>>>>
>>>>> 3. In current implementation, if memory hotplug just need memory
>>>>> subsystem and ACPI codes support? Or also needs firmware take part in?
>>>>> Hope you can explain in details, thanks in advance. :)
>>>>
>>>> We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
>>>> based memory migration mentioned by Liu Jiang.
>>>
>>> Is there any material about firmware based memory migration?
>>>
>>>>
>>>> So far, I only know this. :)
>>>>
>>>>> 4. What's the status of memory hotplug? Apart from can't remove kernel
>>>>> memory, other things are fully implementation?
>>>>
>>>> I think the main job is done for now. And there are still bugs to fix.
>>>> And this functionality is not stable.
>>>>
>>>> Thanks. :)
>>>
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>
>>> .
>>>
>>
>>
>>
> 
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-02-01  1:36                     ` Simon Jeons
  2013-02-01  1:57                       ` Jianguo Wu
@ 2013-02-01  1:57                       ` Tang Chen
  2013-02-01  2:17                         ` Simon Jeons
  1 sibling, 1 reply; 67+ messages in thread
From: Tang Chen @ 2013-02-01  1:57 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	Jianguo Wu, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

On 02/01/2013 09:36 AM, Simon Jeons wrote:
> On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote:
>>>
>>> So if config NUMA, kernel memory will not be linear mapping anymore? For
>>> example,
>>>
>>> Node 0  Node 1
>>>
>>> 0 ~ 10G 11G~14G

It has nothing to do with linear mapping, I think.

>>>
>>> kernel memory only at Node 0? Can part of kernel memory also at Node 1?

Please refer to find_zone_movable_pfns_for_nodes().
The kernel is not only on node0. It uses all the online nodes evenly. :)

>>>
>>> How big is kernel direct mapping memory in x86_64? Is there max limit?
>>
>>
>> Max kernel direct mapping memory in x86_64 is 64TB.
>
> For example, I have 8G memory, all of them will be direct mapping for
> kernel? then userspace memory allocated from where?

I think you misunderstood what Wu tried to say. :)

The kernel mapped that large space, it doesn't mean it is using that 
large space.
The mapping is to make kernel be able to access all the memory, not for 
the kernel
to use only. User space can also use the memory, but each process has 
its own mapping.

For example:

                                        64TB, what ever 
    xxxTB, what ever
logic address space:     |_____kernel_______|_________user_________________|
                                        \  \  /  /
                                         \  /\  /
physical address space:              |___\/__\/_____________|  4GB or 
8GB, what ever
                                           *****

The ***** part physical is mapped to user space in the process' own 
pagetable.
It is also direct mapped in kernel's pagetable. So the kernel can also 
access it. :)

>
>>
>>> It seems that only around 896MB on x86_32.
>>>
>>>>
>>>> We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
>>>> based memory migration mentioned by Liu Jiang.
>>>
>>> Is there any material about firmware based memory migration?

No, I don't have any because this is a functionality of machine from HUAWEI.
I think you can ask Liu Jiang or Wu Jianguo to share some with you. :)

Thanks. :)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-02-01  1:57                       ` Jianguo Wu
@ 2013-02-01  2:06                         ` Simon Jeons
  2013-02-01  2:18                           ` Jianguo Wu
  0 siblings, 1 reply; 67+ messages in thread
From: Simon Jeons @ 2013-02-01  2:06 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: linux-ia64, linux-sh, Tang Chen, linux-mm, paulus, hpa,
	sparclinux, cl, linux-s390, x86, linux-acpi, isimatu.yasuaki,
	linfeng, mgorman, kosaki.motohiro, rientjes, len.brown, wency,
	cmetcalf, glommer, yinghai, laijs, linux-kernel, minchan.kim,
	akpm, linuxppc-dev

Hi Jianguo,
On Fri, 2013-02-01 at 09:57 +0800, Jianguo Wu wrote:
> On 2013/2/1 9:36, Simon Jeons wrote:
> 
> > On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote:
> >> On 2013/1/31 18:38, Simon Jeons wrote:
> >>
> >>> Hi Tang,
> >>> On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote:
> >>>> Hi Simon,
> >>>>
> >>>> On 01/31/2013 04:48 PM, Simon Jeons wrote:
> >>>>> Hi Tang,
> >>>>> On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:
> >>>>>
> >>>>> 1. IIUC, there is a button on machine which supports hot-remove memory,
> >>>>> then what's the difference between press button and echo to /sys?
> >>>>
> >>>> No important difference, I think. Since I don't have the machine you are
> >>>> saying, I cannot surely answer you. :)
> >>>> AFAIK, pressing the button means trigger the hotplug from hardware, sysfs
> >>>> is just another entrance. At last, they will run into the same code.
> >>>>
> >>>>> 2. Since kernel memory is linear mapping(I mean direct mapping part),
> >>>>> why can't put kernel direct mapping memory into one memory device, and
> >>>>> other memory into the other devices?
> >>>>
> >>>> We cannot do that because in that way, we will lose NUMA performance.
> >>>>
> >>>> If you know NUMA, you will understand the following example:
> >>>>
> >>>> node0:                    node1:
> >>>>     cpu0~cpu15                cpu16~cpu31
> >>>>     memory0~memory511         memory512~memory1023
> >>>>
> >>>> cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511.
> >>>> If we set direct mapping area in node0, and movable area in node1, then
> >>>> the kernel code running on cpu16~cpu31 will have to access 
> >>>> memory0~memory511.
> >>>> This is a terrible performance down.
> >>>
> >>> So if config NUMA, kernel memory will not be linear mapping anymore? For
> >>> example, 
> >>>
> >>> Node 0  Node 1 
> >>>
> >>> 0 ~ 10G 11G~14G
> >>>
> >>> kernel memory only at Node 0? Can part of kernel memory also at Node 1?
> >>>
> >>> How big is kernel direct mapping memory in x86_64? Is there max limit?
> >>
> >>
> >> Max kernel direct mapping memory in x86_64 is 64TB.
> > 
> > For example, I have 8G memory, all of them will be direct mapping for
> > kernel? then userspace memory allocated from where?
> 
> Direct mapping memory means you can use __va() and pa(), but not means that them
> can be only used by kernel, them can be used by user-space too, as long as them are free.

IIUC, the benefit of va() and pa() is just for quick get
virtual/physical address, it takes advantage of linear mapping. But mmu
still need to go through pgd/pud/pmd/pte, correct?

> 
> > 
> >>
> >>> It seems that only around 896MB on x86_32. 
> >>>
> >>>>
> >>>>> As you know x86_64 don't need
> >>>>> highmem, IIUC, all kernel memory will linear mapping in this case. Is my
> >>>>> idea available? If is correct, x86_32 can't implement in the same way
> >>>>> since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's
> >>>>> hard to focus kernel memory on single memory device.
> >>>>
> >>>> Sorry, I'm not quite familiar with x86_32 box.
> >>>>
> >>>>> 3. In current implementation, if memory hotplug just need memory
> >>>>> subsystem and ACPI codes support? Or also needs firmware take part in?
> >>>>> Hope you can explain in details, thanks in advance. :)
> >>>>
> >>>> We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
> >>>> based memory migration mentioned by Liu Jiang.
> >>>
> >>> Is there any material about firmware based memory migration?
> >>>
> >>>>
> >>>> So far, I only know this. :)
> >>>>
> >>>>> 4. What's the status of memory hotplug? Apart from can't remove kernel
> >>>>> memory, other things are fully implementation?
> >>>>
> >>>> I think the main job is done for now. And there are still bugs to fix.
> >>>> And this functionality is not stable.
> >>>>
> >>>> Thanks. :)
> >>>
> >>>
> >>> --
> >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>> the body to majordomo@kvack.org.  For more info on Linux MM,
> >>> see: http://www.linux-mm.org/ .
> >>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >>>
> >>> .
> >>>
> >>
> >>
> >>
> > 
> > 
> > 
> > .
> > 
> 
> 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-02-01  1:57                       ` Tang Chen
@ 2013-02-01  2:17                         ` Simon Jeons
  2013-02-01  2:42                           ` Tang Chen
  0 siblings, 1 reply; 67+ messages in thread
From: Simon Jeons @ 2013-02-01  2:17 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	Jianguo Wu, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

Hi Tang,
On Fri, 2013-02-01 at 09:57 +0800, Tang Chen wrote:
> On 02/01/2013 09:36 AM, Simon Jeons wrote:
> > On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote:
> >>>
> >>> So if config NUMA, kernel memory will not be linear mapping anymore? For
> >>> example,
> >>>
> >>> Node 0  Node 1
> >>>
> >>> 0 ~ 10G 11G~14G
> 
> It has nothing to do with linear mapping, I think.
> 
> >>>
> >>> kernel memory only at Node 0? Can part of kernel memory also at Node 1?
> 
> Please refer to find_zone_movable_pfns_for_nodes().

I see, thanks. :)

> The kernel is not only on node0. It uses all the online nodes evenly. :)
> 
> >>>
> >>> How big is kernel direct mapping memory in x86_64? Is there max limit?
> >>
> >>
> >> Max kernel direct mapping memory in x86_64 is 64TB.
> >
> > For example, I have 8G memory, all of them will be direct mapping for
> > kernel? then userspace memory allocated from where?
> 
> I think you misunderstood what Wu tried to say. :)
> 
> The kernel mapped that large space, it doesn't mean it is using that 
> large space.
> The mapping is to make kernel be able to access all the memory, not for 
> the kernel
> to use only. User space can also use the memory, but each process has 
> its own mapping.
> 
> For example:
> 
>                                         64TB, what ever 
>     xxxTB, what ever
> logic address space:     |_____kernel_______|_________user_________________|
>                                         \  \  /  /
>                                          \  /\  /
> physical address space:              |___\/__\/_____________|  4GB or 
> 8GB, what ever
>                                            *****

How much address space user process can have on x86_64? Also 8GB?

> 
> The ***** part physical is mapped to user space in the process' own 
> pagetable.
> It is also direct mapped in kernel's pagetable. So the kernel can also 
> access it. :)

But how to protect user process not modify kernel memory?

> 
> >
> >>
> >>> It seems that only around 896MB on x86_32.
> >>>
> >>>>
> >>>> We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
> >>>> based memory migration mentioned by Liu Jiang.
> >>>
> >>> Is there any material about firmware based memory migration?
> 
> No, I don't have any because this is a functionality of machine from HUAWEI.
> I think you can ask Liu Jiang or Wu Jianguo to share some with you. :)
> 
> Thanks. :)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-02-01  2:06                         ` Simon Jeons
@ 2013-02-01  2:18                           ` Jianguo Wu
  0 siblings, 0 replies; 67+ messages in thread
From: Jianguo Wu @ 2013-02-01  2:18 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, Tang Chen, linux-mm, paulus, hpa,
	sparclinux, cl, linux-s390, x86, linux-acpi, isimatu.yasuaki,
	linfeng, mgorman, kosaki.motohiro, rientjes, len.brown, wency,
	cmetcalf, glommer, yinghai, laijs, linux-kernel, minchan.kim,
	akpm, linuxppc-dev

On 2013/2/1 10:06, Simon Jeons wrote:

> Hi Jianguo,
> On Fri, 2013-02-01 at 09:57 +0800, Jianguo Wu wrote:
>> On 2013/2/1 9:36, Simon Jeons wrote:
>>
>>> On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote:
>>>> On 2013/1/31 18:38, Simon Jeons wrote:
>>>>
>>>>> Hi Tang,
>>>>> On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote:
>>>>>> Hi Simon,
>>>>>>
>>>>>> On 01/31/2013 04:48 PM, Simon Jeons wrote:
>>>>>>> Hi Tang,
>>>>>>> On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:
>>>>>>>
>>>>>>> 1. IIUC, there is a button on machine which supports hot-remove memory,
>>>>>>> then what's the difference between press button and echo to /sys?
>>>>>>
>>>>>> No important difference, I think. Since I don't have the machine you are
>>>>>> saying, I cannot surely answer you. :)
>>>>>> AFAIK, pressing the button means trigger the hotplug from hardware, sysfs
>>>>>> is just another entrance. At last, they will run into the same code.
>>>>>>
>>>>>>> 2. Since kernel memory is linear mapping(I mean direct mapping part),
>>>>>>> why can't put kernel direct mapping memory into one memory device, and
>>>>>>> other memory into the other devices?
>>>>>>
>>>>>> We cannot do that because in that way, we will lose NUMA performance.
>>>>>>
>>>>>> If you know NUMA, you will understand the following example:
>>>>>>
>>>>>> node0:                    node1:
>>>>>>     cpu0~cpu15                cpu16~cpu31
>>>>>>     memory0~memory511         memory512~memory1023
>>>>>>
>>>>>> cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511.
>>>>>> If we set direct mapping area in node0, and movable area in node1, then
>>>>>> the kernel code running on cpu16~cpu31 will have to access 
>>>>>> memory0~memory511.
>>>>>> This is a terrible performance down.
>>>>>
>>>>> So if config NUMA, kernel memory will not be linear mapping anymore? For
>>>>> example, 
>>>>>
>>>>> Node 0  Node 1 
>>>>>
>>>>> 0 ~ 10G 11G~14G
>>>>>
>>>>> kernel memory only at Node 0? Can part of kernel memory also at Node 1?
>>>>>
>>>>> How big is kernel direct mapping memory in x86_64? Is there max limit?
>>>>
>>>>
>>>> Max kernel direct mapping memory in x86_64 is 64TB.
>>>
>>> For example, I have 8G memory, all of them will be direct mapping for
>>> kernel? then userspace memory allocated from where?
>>
>> Direct mapping memory means you can use __va() and pa(), but not means that them
>> can be only used by kernel, them can be used by user-space too, as long as them are free.
> 
> IIUC, the benefit of va() and pa() is just for quick get
> virtual/physical address, it takes advantage of linear mapping. But mmu
> still need to go through pgd/pud/pmd/pte, correct?

Yes.

> 

>>
>>>
>>>>
>>>>> It seems that only around 896MB on x86_32. 
>>>>>
>>>>>>
>>>>>>> As you know x86_64 don't need
>>>>>>> highmem, IIUC, all kernel memory will linear mapping in this case. Is my
>>>>>>> idea available? If is correct, x86_32 can't implement in the same way
>>>>>>> since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's
>>>>>>> hard to focus kernel memory on single memory device.
>>>>>>
>>>>>> Sorry, I'm not quite familiar with x86_32 box.
>>>>>>
>>>>>>> 3. In current implementation, if memory hotplug just need memory
>>>>>>> subsystem and ACPI codes support? Or also needs firmware take part in?
>>>>>>> Hope you can explain in details, thanks in advance. :)
>>>>>>
>>>>>> We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
>>>>>> based memory migration mentioned by Liu Jiang.
>>>>>
>>>>> Is there any material about firmware based memory migration?
>>>>>
>>>>>>
>>>>>> So far, I only know this. :)
>>>>>>
>>>>>>> 4. What's the status of memory hotplug? Apart from can't remove kernel
>>>>>>> memory, other things are fully implementation?
>>>>>>
>>>>>> I think the main job is done for now. And there are still bugs to fix.
>>>>>> And this functionality is not stable.
>>>>>>
>>>>>> Thanks. :)
>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>>> see: http://www.linux-mm.org/ .
>>>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>>>
>>>>> .
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> .
>>>
>>
>>
>>
> 
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-02-01  2:17                         ` Simon Jeons
@ 2013-02-01  2:42                           ` Tang Chen
  2013-02-01  3:06                             ` Simon Jeons
  0 siblings, 1 reply; 67+ messages in thread
From: Tang Chen @ 2013-02-01  2:42 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	Jianguo Wu, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

Hi Simon,

On 02/01/2013 10:17 AM, Simon Jeons wrote:
>> For example:
>>
>>                                          64TB, what ever
>>      xxxTB, what ever
>> logic address space:     |_____kernel_______|_________user_________________|
>>                                          \  \  /  /
>>                                           \  /\  /
>> physical address space:              |___\/__\/_____________|  4GB or
>> 8GB, what ever
>>                                             *****
>
> How much address space user process can have on x86_64? Also 8GB?

Usually, we don't say that.

8GB is your physical memory, right ?
But kernel space and user space is the logic conception in OS. They are 
in logic
address space.

So both the kernel space and the user space can use all the physical memory.
But if the page is already in use by either of them, the other one 
cannot use it.
For example, some pages are direct mapped to kernel, and is in use by 
kernel, the
user space cannot map it.

>
>>
>> The ***** part physical is mapped to user space in the process' own
>> pagetable.
>> It is also direct mapped in kernel's pagetable. So the kernel can also
>> access it. :)
>
> But how to protect user process not modify kernel memory?

This is the job of CPU. On intel cpus, user space code is running in 
level 3, and
kernel space code is running in level 0. So the code in level 3 cannot 
access the data
segment in level 0.

Thanks. :)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-02-01  2:42                           ` Tang Chen
@ 2013-02-01  3:06                             ` Simon Jeons
  2013-02-01  3:39                               ` Tang Chen
  0 siblings, 1 reply; 67+ messages in thread
From: Simon Jeons @ 2013-02-01  3:06 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	Jianguo Wu, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

Hi Tang,
On Fri, 2013-02-01 at 10:42 +0800, Tang Chen wrote:

I confuse!

> Hi Simon,
> 
> On 02/01/2013 10:17 AM, Simon Jeons wrote:
> >> For example:
> >>
> >>                                          64TB, what ever
> >>      xxxTB, what ever
> >> logic address space:     |_____kernel_______|_________user_________________|
> >>                                          \  \  /  /
> >>                                           \  /\  /
> >> physical address space:              |___\/__\/_____________|  4GB or
> >> 8GB, what ever
> >>                                             *****
> >
> > How much address space user process can have on x86_64? Also 8GB?
> 
> Usually, we don't say that.
> 
> 8GB is your physical memory, right ?
> But kernel space and user space is the logic conception in OS. They are 
> in logic
> address space.
> 
> So both the kernel space and the user space can use all the physical memory.
> But if the page is already in use by either of them, the other one 
> cannot use it.
> For example, some pages are direct mapped to kernel, and is in use by 
> kernel, the
> user space cannot map it.

How can distinguish map and use? I mean how can confirm memory is used
by kernel instead of map? 

> 
> >
> >>
> >> The ***** part physical is mapped to user space in the process' own
> >> pagetable.
> >> It is also direct mapped in kernel's pagetable. So the kernel can also
> >> access it. :)
> >
> > But how to protect user process not modify kernel memory?
> 
> This is the job of CPU. On intel cpus, user space code is running in 
> level 3, and
> kernel space code is running in level 0. So the code in level 3 cannot 
> access the data
> segment in level 0.

1) If user process and kenel map to same physical memory, user process
will get SIGSEGV during #PF if access to this memory, but If user proces
s will map to the same memory which kernel map? Why? It can't access it.
2) If two user processes map to same physical memory, what will happen
if one process access the memory?

> 
> Thanks. :)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
  2013-02-01  3:06                             ` Simon Jeons
@ 2013-02-01  3:39                               ` Tang Chen
  0 siblings, 0 replies; 67+ messages in thread
From: Tang Chen @ 2013-02-01  3:39 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	Jianguo Wu, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev

Hi Simon,

On 02/01/2013 11:06 AM, Simon Jeons wrote:
>
> How can distinguish map and use? I mean how can confirm memory is used
> by kernel instead of map?

If the page is free, for example, it is in the buddy system, it is not 
in use.
Even if it is direct mapped by kernel, the kernel logic should not to 
access it
because you didn't allocate it. This is the kernel's logic. Of course 
the hardware
and the user will not know this.

You want to access some memory, you should first have a logic address, 
right?
So how can you get a logic address ?  You call alloc api.

For example, when you are coding, of course you write:

p = alloc_xxx(); ---- allocate memory, now, it is in use, alloc_xxx() 
makes kernel know it.
*p = ......      ---- use the memory

You won't write:
p = 0xFFFF8745;  ---- if so, kernel doesn't know it is in use
*p = ......      ---- wrong...

right ?

The kernel mapped a page, it doesn't mean it is using the page. You 
should allocate it.
That is just the kernel's allocating logic.

Well, I think I can only give you this answer now. If you want something 
deeper, I think
you need to read how the kernel manage the physical pages. :)

>
> 1) If user process and kenel map to same physical memory, user process
> will get SIGSEGV during #PF if access to this memory, but If user proces
> s will map to the same memory which kernel map? Why? It can't access it.

When you call malloc() to allocate memory in user space, the OS logic will
assure that you won't map a page that has already been used by kernel.

A page is mapped by kernel, but not used by kernel (not allocated, like 
above),
malloc() could allocate it, and map it to user space. This is the situation
you are talking about, right ?

Now it is mapped by kernel and user, but it is only allocated by user. 
So the kernel
will not use it. When the kernel wants some memory, it will allocate 
some other memory.
This is just the kernel logic. This is what memory management subsystem 
does.

I think I cannot answer more because I'm also a student in memory 
management.
This is just my understanding. And I hope it is helpful. :)

> 2) If two user processes map to same physical memory, what will happen
> if one process access the memory?

Obviously you don't need to worry about this situation. We can swap the page
used by process 1 out, and process 2 can use the same page. When process 
1 wants
to access it again, we swap it in. This only happens when the physical 
memory
is not enough to use. :)

And also, if you are using shared memory in user space, like

shmget(), shmat()......

it is the shared memory, both processes can use it at the same time.

Thanks. :)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
  2013-01-09  9:32 ` [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove Tang Chen
  2013-01-29 13:02   ` Simon Jeons
  2013-01-29 13:04   ` Simon Jeons
@ 2013-02-04 23:04   ` Andrew Morton
  2 siblings, 0 replies; 67+ messages in thread
From: Andrew Morton @ 2013-02-04 23:04 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim,
	linuxppc-dev

On Wed, 9 Jan 2013 17:32:32 +0800
Tang Chen <tangchen@cn.fujitsu.com> wrote:

> +static void __meminit
> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
> +{
> +	unsigned long next;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	bool pgd_changed = false;
> +
> +	for (; start < end; start = next) {
> +		pgd = pgd_offset_k(start);
> +		if (!pgd_present(*pgd))
> +			continue;
> +
> +		next = pgd_addr_end(start, end);
> +
> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> +		remove_pud_table(pud, start, next, direct);
> +		if (free_pud_table(pud, pgd))
> +			pgd_changed = true;
> +		unmap_low_page(pud);
> +	}
> +
> +	if (pgd_changed)
> +		sync_global_pgds(start, end - 1);
> +
> +	flush_tlb_all();
> +}

This generates a compiler warning saying that `next' may be used
uninitialised.

The warning is correct.  If we take that `continue' on the first pass
through the loop, the "start = next" will copy uninitialised data into
`start'.

Is this the correct fix?

--- a/arch/x86/mm/init_64.c~memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix-fix-fix-fix
+++ a/arch/x86/mm/init_64.c
@@ -993,12 +993,12 @@ remove_pagetable(unsigned long start, un
 	bool pgd_changed = false;
 
 	for (; start < end; start = next) {
+		next = pgd_addr_end(start, end);
+
 		pgd = pgd_offset_k(start);
 		if (!pgd_present(*pgd))
 			continue;
 
-		next = pgd_addr_end(start, end);
-
 		pud = (pud_t *)pgd_page_vaddr(*pgd);
 		remove_pud_table(pud, start, next, direct);
 		if (free_pud_table(pud, pgd))
_

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2013-02-04 23:04 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-09  9:32 [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Tang Chen
2013-01-09  9:32 ` [PATCH v6 01/15] memory-hotplug: try to offline the memory twice to avoid dependence Tang Chen
2013-01-09  9:32 ` [PATCH v6 02/15] memory-hotplug: check whether all memory blocks are offlined or not when removing memory Tang Chen
2013-01-09 23:11   ` Andrew Morton
2013-01-10  5:56     ` Tang Chen
2013-01-09  9:32 ` [PATCH v6 03/15] memory-hotplug: remove redundant codes Tang Chen
2013-01-09  9:32 ` [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs Tang Chen
2013-01-09 22:49   ` Andrew Morton
2013-01-10  6:07     ` Tang Chen
2013-01-09 23:19   ` Andrew Morton
2013-01-10  6:15     ` Tang Chen
2013-01-09  9:32 ` [PATCH v6 05/15] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture Tang Chen
2013-01-09 22:50   ` Andrew Morton
2013-01-10  2:25     ` Tang Chen
2013-01-09  9:32 ` [PATCH v6 06/15] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap Tang Chen
2013-01-09  9:32 ` [PATCH v6 07/15] memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section() Tang Chen
2013-01-09  9:32 ` [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove Tang Chen
2013-01-29 13:02   ` Simon Jeons
2013-01-30  1:53     ` Jianguo Wu
2013-01-30  2:13       ` Simon Jeons
2013-01-29 13:04   ` Simon Jeons
2013-01-30  2:16     ` Tang Chen
2013-01-30  3:27       ` Simon Jeons
2013-01-30  5:55         ` Tang Chen
2013-01-30  7:32           ` Simon Jeons
2013-02-04 23:04   ` Andrew Morton
2013-01-09  9:32 ` [PATCH v6 09/15] memory-hotplug: remove page table of x86_64 architecture Tang Chen
2013-01-09  9:32 ` [PATCH v6 10/15] memory-hotplug: remove memmap of sparse-vmemmap Tang Chen
2013-01-09  9:32 ` [PATCH v6 11/15] memory-hotplug: Integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP Tang Chen
2013-01-09  9:32 ` [PATCH v6 12/15] memory-hotplug: memory_hotplug: clear zone when removing the memory Tang Chen
2013-01-09  9:32 ` [PATCH v6 13/15] memory-hotplug: remove sysfs file of node Tang Chen
2013-01-09  9:32 ` [PATCH v6 14/15] memory-hotplug: free node_data when a node is offlined Tang Chen
2013-01-09  9:32 ` [PATCH v6 15/15] memory-hotplug: Do not allocate pdgat if it was not freed when offline Tang Chen
2013-01-09 22:23 ` [PATCH v6 00/15] memory-hotplug: hot-remove physical memory Andrew Morton
2013-01-10  2:17   ` Tang Chen
2013-01-10  7:14     ` Glauber Costa
2013-01-10  7:31       ` Kamezawa Hiroyuki
2013-01-10  7:55         ` Glauber Costa
2013-01-10  8:23           ` Kamezawa Hiroyuki
2013-01-10  8:36             ` Glauber Costa
2013-01-10  8:39               ` Kamezawa Hiroyuki
2013-01-09 23:33 ` Andrew Morton
2013-01-10  2:18   ` Tang Chen
2013-01-29 12:52 ` Simon Jeons
2013-01-30  2:32   ` Tang Chen
2013-01-30  2:48     ` Simon Jeons
2013-01-30  3:00       ` Tang Chen
2013-01-30 10:15   ` Tang Chen
2013-01-30 10:18     ` Tang Chen
2013-01-31  1:22     ` Simon Jeons
2013-01-31  3:31       ` Tang Chen
2013-01-31  6:19         ` Simon Jeons
2013-01-31  7:10           ` Tang Chen
2013-01-31  8:17             ` Simon Jeons
2013-01-31  8:48             ` Simon Jeons
2013-01-31  9:44               ` Tang Chen
2013-01-31 10:38                 ` Simon Jeons
2013-02-01  1:32                   ` Jianguo Wu
2013-02-01  1:36                     ` Simon Jeons
2013-02-01  1:57                       ` Jianguo Wu
2013-02-01  2:06                         ` Simon Jeons
2013-02-01  2:18                           ` Jianguo Wu
2013-02-01  1:57                       ` Tang Chen
2013-02-01  2:17                         ` Simon Jeons
2013-02-01  2:42                           ` Tang Chen
2013-02-01  3:06                             ` Simon Jeons
2013-02-01  3:39                               ` Tang Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).