[Patch V3 1/9] x86, NUMA, ACPI: Online node earlier when doing CPU hot-addition

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Patch V3 1/9] x86, NUMA, ACPI: Online node earlier when doing CPU hot-addition
       [not found] <1439781546-7217-1-git-send-email-jiang.liu@linux.intel.com>
@ 2015-08-17  3:18 ` Jiang Liu
  2015-08-17  3:19 ` [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug Jiang Liu
  1 sibling, 0 replies; 6+ messages in thread
From: Jiang Liu @ 2015-08-17  3:18 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo,
	Rafael J. Wysocki, Len Brown, Pavel Machek, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86
  Cc: Jiang Liu, Tony Luck, linux-mm, linux-hotplug, linux-kernel,
	linux-pm

With typical CPU hot-addition flow on x86, PCI host bridges embedded
in physical processor are always associated with NOMA_NO_NODE, which
may cause sub-optimal performance.
1) Handle CPU hot-addition notification
	acpi_processor_add()
		acpi_processor_get_info()
			acpi_processor_hotadd_init()
				acpi_map_lsapic()
1.a)					acpi_map_cpu2node()

2) Handle PCI host bridge hot-addition notification
	acpi_pci_root_add()
		pci_acpi_scan_root()
2.a)			if (node != NUMA_NO_NODE && !node_online(node)) node = NUMA_NO_NODE;

3) Handle memory hot-addition notification
	acpi_memory_device_add()
		acpi_memory_enable_device()
			add_memory()
3.a)				node_set_online();

4) Online CPUs through sysfs interfaces
	cpu_subsys_online()
		cpu_up()
			try_online_node()
4.a)				node_set_online();

So associated node is always in offline state because it is onlined
until step 3.a or 4.a.

We could improve performance by online node at step 1.a. This change
also makes the code symmetric. Nodes are always created when handling
CPU/memory hot-addition events instead of handling user requests from
sysfs interfaces, and are destroyed when handling CPU/memory hot-removal
events.

It also close a race window caused by kmalloc_node(cpu_to_node(cpu)),
which may cause system panic as below.
[ 3663.324476] BUG: unable to handle kernel paging request at 0000000000001f08
[ 3663.332348] IP: [<ffffffff81172219>] __alloc_pages_nodemask+0xb9/0x2d0
[ 3663.339719] PGD 82fe10067 PUD 82ebef067 PMD 0
[ 3663.344773] Oops: 0000 [#1] SMP
[ 3663.348455] Modules linked in: shpchp gpio_ich x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd microcode joydev sb_edac edac_core lpc_ich ipmi_si tpm_tis ipmi_msghandler ioatdma wmi acpi_pad mac_hid lp parport ixgbe isci mpt2sas dca ahci ptp libsas libahci raid_class pps_core scsi_transport_sas mdio hid_generic usbhid hid
[ 3663.394393] CPU: 61 PID: 2416 Comm: cron Tainted: G        W    3.14.0-rc5+ #21
[ 3663.402643] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRIVTIN1.86B.0047.F03.1403031049 03/03/2014
[ 3663.414299] task: ffff88082fe54b00 ti: ffff880845fba000 task.ti: ffff880845fba000
[ 3663.422741] RIP: 0010:[<ffffffff81172219>]  [<ffffffff81172219>] __alloc_pages_nodemask+0xb9/0x2d0
[ 3663.432857] RSP: 0018:ffff880845fbbcd0  EFLAGS: 00010246
[ 3663.439265] RAX: 0000000000001f00 RBX: 0000000000000000 RCX: 0000000000000000
[ 3663.447291] RDX: 0000000000000000 RSI: 0000000000000a8d RDI: ffffffff81a8d950
[ 3663.455318] RBP: ffff880845fbbd58 R08: ffff880823293400 R09: 0000000000000001
[ 3663.463345] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000002052d0
[ 3663.471363] R13: ffff880854c07600 R14: 0000000000000002 R15: 0000000000000000
[ 3663.479389] FS:  00007f2e8b99e800(0000) GS:ffff88105a400000(0000) knlGS:0000000000000000
[ 3663.488514] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3663.495018] CR2: 0000000000001f08 CR3: 00000008237b1000 CR4: 00000000001407e0
[ 3663.503476] Stack:
[ 3663.505757]  ffffffff811bd74d ffff880854c01d98 ffff880854c01df0 ffff880854c01dd0
[ 3663.514167]  00000003208ca420 000000075a5d84d0 ffff88082fe54b00 ffffffff811bb35f
[ 3663.522567]  ffff880854c07600 0000000000000003 0000000000001f00 ffff880845fbbd48
[ 3663.530976] Call Trace:
[ 3663.533753]  [<ffffffff811bd74d>] ? deactivate_slab+0x41d/0x4f0
[ 3663.540421]  [<ffffffff811bb35f>] ? new_slab+0x3f/0x2d0
[ 3663.546307]  [<ffffffff811bb3c5>] new_slab+0xa5/0x2d0
[ 3663.552001]  [<ffffffff81768c97>] __slab_alloc+0x35d/0x54a
[ 3663.558185]  [<ffffffff810a4845>] ? local_clock+0x25/0x30
[ 3663.564686]  [<ffffffff8177a34c>] ? __do_page_fault+0x4ec/0x5e0
[ 3663.571356]  [<ffffffff810b0054>] ? alloc_fair_sched_group+0xc4/0x190
[ 3663.578609]  [<ffffffff810c77f1>] ? __raw_spin_lock_init+0x21/0x60
[ 3663.585570]  [<ffffffff811be476>] kmem_cache_alloc_node_trace+0xa6/0x1d0
[ 3663.593112]  [<ffffffff810b0054>] ? alloc_fair_sched_group+0xc4/0x190
[ 3663.600363]  [<ffffffff810b0054>] alloc_fair_sched_group+0xc4/0x190
[ 3663.607423]  [<ffffffff810a359f>] sched_create_group+0x3f/0x80
[ 3663.613994]  [<ffffffff810b611f>] sched_autogroup_create_attach+0x3f/0x1b0
[ 3663.621732]  [<ffffffff8108258a>] sys_setsid+0xea/0x110
[ 3663.628020]  [<ffffffff8177f42d>] system_call_fastpath+0x1a/0x1f
[ 3663.634780] Code: 00 44 89 e7 e8 b9 f8 f4 ff 41 f6 c4 10 74 18 31 d2 be 8d 0a 00 00 48 c7 c7 50 d9 a8 81 e8 70 6a f2 ff e8 db dd 5f 00 48 8b 45 c8 <48> 83 78 08 00 0f 84 b5 01 00 00 48 83 c0 08 44 89 75 c0 4d 89
[ 3663.657032] RIP  [<ffffffff81172219>] __alloc_pages_nodemask+0xb9/0x2d0
[ 3663.664491]  RSP <ffff880845fbbcd0>
[ 3663.668429] CR2: 0000000000001f08
[ 3663.672659] ---[ end trace df13f08ed9de18ad ]---

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
---
 arch/x86/kernel/acpi/boot.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index e49ee24da85e..07930e1d2fe9 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -704,6 +704,11 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 
 	nid = acpi_get_node(handle);
 	if (nid != -1) {
+		if (try_online_node(nid)) {
+			pr_warn("failed to online node%d for CPU%d, use node%d instead.\n",
+				nid, cpu, first_node(node_online_map));
+			nid = first_node(node_online_map);
+		}
 		set_apicid_to_node(physid, nid);
 		numa_set_node(cpu, nid);
 	}
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug
       [not found] <1439781546-7217-1-git-send-email-jiang.liu@linux.intel.com>
  2015-08-17  3:18 ` [Patch V3 1/9] x86, NUMA, ACPI: Online node earlier when doing CPU hot-addition Jiang Liu
@ 2015-08-17  3:19 ` Jiang Liu
  2015-08-18  6:11   ` Tang Chen
  2015-08-18  7:31   ` Ingo Molnar
  1 sibling, 2 replies; 6+ messages in thread
From: Jiang Liu @ 2015-08-17  3:19 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Rafael J. Wysocki, Len Brown, Pavel Machek, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Dave Hansen,
	Jan H. Schönherr, Igor Mammedov, Paul E. McKenney, Xishi Qiu
  Cc: Tony Luck, linux-mm, linux-hotplug, linux-kernel, Ingo Molnar,
	linux-pm

With current implementation, all CPUs within a NUMA node will be
assocaited with another NUMA node if the node has no memory installed.

For example, on a four-node system, CPUs on node 2 and 3 are associated
with node 0 when are no memory install on node 2 and 3, which may
confuse users.
root@bkd01sdp:~# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 0 size: 15602 MB
node 0 free: 15014 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15985 MB
node 1 free: 15686 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

To be worse, the CPU affinity relationship won't get fixed even after
memory has been added to those nodes. After memory hot-addition to
node 2, CPUs on node 2 are still associated with node 0. This may cause
sub-optimal performance.
root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 0 size: 15602 MB
node 0 free: 14743 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15985 MB
node 1 free: 15715 MB
node 2 cpus:
node 2 size: 128 MB
node 2 free: 128 MB
node distances:
node   0   1   2
  0:  10  21  21
  1:  21  10  21
  2:  21  21  10

With support of memoryless node enabled, it will correctly report system
hardware topology for nodes without memory installed.
root@bkd01sdp:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
node 0 size: 15725 MB
node 0 free: 15129 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15862 MB
node 1 free: 15627 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3
  0:  10  21  21  21
  1:  21  10  21  21
  2:  21  21  10  21
  3:  21  21  21  10

With memoryless node enabled, CPUs are correctly associated with node 2
after memory hot-addition to node 2.
root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
node 0 size: 15725 MB
node 0 free: 14872 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15862 MB
node 1 free: 15641 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
node 2 size: 128 MB
node 2 free: 127 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3
  0:  10  21  21  21
  1:  21  10  21  21
  2:  21  21  10  21
  3:  21  21  21  10

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
---
 arch/x86/Kconfig            |    3 +++
 arch/x86/kernel/acpi/boot.c |    4 +++-
 arch/x86/kernel/smpboot.c   |    2 ++
 arch/x86/mm/numa.c          |   49 +++++++++++++++++++++++++++++++------------
 4 files changed, 44 insertions(+), 14 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b3a1a5d77d92..5d7ad70ace0d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2069,6 +2069,9 @@ config USE_PERCPU_NUMA_NODE_ID
 	def_bool y
 	depends on NUMA
 
+config HAVE_MEMORYLESS_NODES
+	def_bool NUMA
+
 config ARCH_ENABLE_SPLIT_PMD_PTLOCK
 	def_bool y
 	depends on X86_64 || X86_PAE
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 07930e1d2fe9..3403f1f0f28d 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -711,6 +711,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 		}
 		set_apicid_to_node(physid, nid);
 		numa_set_node(cpu, nid);
+		set_cpu_numa_mem(cpu, local_memory_node(nid));
 	}
 #endif
 }
@@ -743,9 +744,10 @@ int acpi_unmap_cpu(int cpu)
 {
 #ifdef CONFIG_ACPI_NUMA
 	set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
+	set_cpu_numa_mem(cpu, NUMA_NO_NODE);
 #endif
 
-	per_cpu(x86_cpu_to_apicid, cpu) = -1;
+	per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
 	set_cpu_present(cpu, false);
 	num_processors--;
 
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index b1f3ed9c7a9e..aeec91ac6fd4 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -162,6 +162,8 @@ static void smp_callin(void)
 	 */
 	phys_id = read_apic_id();
 
+	set_numa_mem(local_memory_node(cpu_to_node(cpuid)));
+
 	/*
 	 * the boot CPU has finished the init stage and is spinning
 	 * on callin_map until we finish. We are free to set up this
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 08860bdf5744..f2a4e23bd14d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -22,6 +22,7 @@
 
 int __initdata numa_off;
 nodemask_t numa_nodes_parsed __initdata;
+static nodemask_t numa_nodes_empty __initdata;
 
 struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
@@ -560,17 +561,16 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 			end = max(mi->blk[i].end, end);
 		}
 
-		if (start >= end)
-			continue;
-
 		/*
 		 * Don't confuse VM with a node that doesn't have the
 		 * minimum amount of memory:
 		 */
-		if (end && (end - start) < NODE_MIN_SIZE)
-			continue;
-
-		alloc_node_data(nid);
+		if (start < end && (end - start) >= NODE_MIN_SIZE) {
+			alloc_node_data(nid);
+		} else if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
+			alloc_node_data(nid);
+			node_set(nid, numa_nodes_empty);
+		}
 	}
 
 	/* Dump memblock with node info and return. */
@@ -587,14 +587,18 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
  */
 static void __init numa_init_array(void)
 {
-	int rr, i;
+	int i, rr = MAX_NUMNODES;
 
-	rr = first_node(node_online_map);
 	for (i = 0; i < nr_cpu_ids; i++) {
+		/* Search for an onlined node with memory */
+		do {
+			if (rr != MAX_NUMNODES)
+				rr = next_node(rr, node_online_map);
+			if (rr == MAX_NUMNODES)
+				rr = first_node(node_online_map);
+		} while (node_isset(rr, numa_nodes_empty));
+
 		numa_set_node(i, rr);
-		rr = next_node(rr, node_online_map);
-		if (rr == MAX_NUMNODES)
-			rr = first_node(node_online_map);
 	}
 }
 
@@ -696,9 +700,12 @@ static __init int find_near_online_node(int node)
 {
 	int n, val;
 	int min_val = INT_MAX;
-	int best_node = -1;
+	int best_node = NUMA_NO_NODE;
 
 	for_each_online_node(n) {
+		if (node_isset(n, numa_nodes_empty))
+			continue;
+
 		val = node_distance(node, n);
 
 		if (val < min_val) {
@@ -739,6 +746,22 @@ void __init init_cpu_to_node(void)
 		if (!node_online(node))
 			node = find_near_online_node(node);
 		numa_set_node(cpu, node);
+		if (node_spanned_pages(node))
+			set_cpu_numa_mem(cpu, node);
+		if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES))
+			node_clear(node, numa_nodes_empty);
+	}
+
+	/* Destroy empty nodes */
+	if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
+		int nid;
+		const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+
+		for_each_node_mask(nid, numa_nodes_empty) {
+			node_set_offline(nid);
+			memblock_free(__pa(node_data[nid]), nd_size);
+			node_data[nid] = NULL;
+		}
 	}
 }
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug
  2015-08-17  3:19 ` [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug Jiang Liu
@ 2015-08-18  6:11   ` Tang Chen
  2015-08-18  6:59     ` Jiang Liu
  2015-08-18  7:31   ` Ingo Molnar
  1 sibling, 1 reply; 6+ messages in thread
From: Tang Chen @ 2015-08-18  6:11 UTC (permalink / raw)
  To: Jiang Liu, Andrew Morton, Mel Gorman, David Rientjes,
	Mike Galbraith, Peter Zijlstra, Rafael J . Wysocki, Tejun Heo,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Rafael J. Wysocki, Len Brown, Pavel Machek, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Dave Hansen,
	"Jan H. Schönherr", Igor Mammedov, Paul E. McKenney
  Cc: Tony Luck, linux-mm, linux-hotplug, linux-kernel, Ingo Molnar,
	linux-pm, tangchen


Hi Liu,

On 08/17/2015 11:19 AM, Jiang Liu wrote:
> ......
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b3a1a5d77d92..5d7ad70ace0d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2069,6 +2069,9 @@ config USE_PERCPU_NUMA_NODE_ID
>   	def_bool y
>   	depends on NUMA
>   
> +config HAVE_MEMORYLESS_NODES
> +	def_bool NUMA
> +
>   config ARCH_ENABLE_SPLIT_PMD_PTLOCK
>   	def_bool y
>   	depends on X86_64 || X86_PAE
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index 07930e1d2fe9..3403f1f0f28d 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -711,6 +711,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
>   		}
>   		set_apicid_to_node(physid, nid);
>   		numa_set_node(cpu, nid);
> +		set_cpu_numa_mem(cpu, local_memory_node(nid));
>   	}
>   #endif
>   }
> @@ -743,9 +744,10 @@ int acpi_unmap_cpu(int cpu)
>   {
>   #ifdef CONFIG_ACPI_NUMA
>   	set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
> +	set_cpu_numa_mem(cpu, NUMA_NO_NODE);
>   #endif
>   
> -	per_cpu(x86_cpu_to_apicid, cpu) = -1;
> +	per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
>   	set_cpu_present(cpu, false);
>   	num_processors--;
>   
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index b1f3ed9c7a9e..aeec91ac6fd4 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -162,6 +162,8 @@ static void smp_callin(void)
>   	 */
>   	phys_id = read_apic_id();
>   
> +	set_numa_mem(local_memory_node(cpu_to_node(cpuid)));
> +
>   	/*
>   	 * the boot CPU has finished the init stage and is spinning
>   	 * on callin_map until we finish. We are free to set up this
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 08860bdf5744..f2a4e23bd14d 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -22,6 +22,7 @@
>   
>   int __initdata numa_off;
>   nodemask_t numa_nodes_parsed __initdata;
> +static nodemask_t numa_nodes_empty __initdata;
>   
>   struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
>   EXPORT_SYMBOL(node_data);
> @@ -560,17 +561,16 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>   			end = max(mi->blk[i].end, end);
>   		}
>   
> -		if (start >= end)
> -			continue;
> -
>   		/*
>   		 * Don't confuse VM with a node that doesn't have the
>   		 * minimum amount of memory:
>   		 */
> -		if (end && (end - start) < NODE_MIN_SIZE)
> -			continue;
> -
> -		alloc_node_data(nid);
> +		if (start < end && (end - start) >= NODE_MIN_SIZE) {
> +			alloc_node_data(nid);
> +		} else if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
> +			alloc_node_data(nid);
> +			node_set(nid, numa_nodes_empty);

Seeing from here, I think numa_nodes_empty represents all memory-less nodes.
So, since we still have cpu-less nodes out there, shall we rename it to
numa_nodes_memoryless or something similar ?

And BTW, does x86 support cpu-less node after these patches ?

Since I don't have any memory-less or cpu-less node on my box, I cannot 
tell it clearly.
A node is brought online when is has memory in original kernel. So I 
think it is supported.

> +		}
>   	}
>   
>   	/* Dump memblock with node info and return. */
> @@ -587,14 +587,18 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>    */
>   static void __init numa_init_array(void)
>   {
> -	int rr, i;
> +	int i, rr = MAX_NUMNODES;
>   
> -	rr = first_node(node_online_map);
>   	for (i = 0; i < nr_cpu_ids; i++) {
> +		/* Search for an onlined node with memory */
> +		do {
> +			if (rr != MAX_NUMNODES)
> +				rr = next_node(rr, node_online_map);
> +			if (rr == MAX_NUMNODES)
> +				rr = first_node(node_online_map);
> +		} while (node_isset(rr, numa_nodes_empty));
> +
>   		numa_set_node(i, rr);
> -		rr = next_node(rr, node_online_map);
> -		if (rr == MAX_NUMNODES)
> -			rr = first_node(node_online_map);
>   	}
>   }
>   
> @@ -696,9 +700,12 @@ static __init int find_near_online_node(int node)
>   {
>   	int n, val;
>   	int min_val = INT_MAX;
> -	int best_node = -1;
> +	int best_node = NUMA_NO_NODE;
>   
>   	for_each_online_node(n) {
> +		if (node_isset(n, numa_nodes_empty))
> +			continue;
> +
>   		val = node_distance(node, n);
>   
>   		if (val < min_val) {
> @@ -739,6 +746,22 @@ void __init init_cpu_to_node(void)
>   		if (!node_online(node))
>   			node = find_near_online_node(node);
>   		numa_set_node(cpu, node);

So, CPUs are still mapped to online near node, right ?

I was expecting CPUs on a memory-less node are mapped to the node they
belong to. If so, the current memory allocator may fail because they assume
each online node has memory. I was trying to do this in my patch.

https://lkml.org/lkml/2015/7/7/205

Of course, my patch is not to support memory-less node, just run into 
this problem.

> +		if (node_spanned_pages(node))
> +			set_cpu_numa_mem(cpu, node);
> +		if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES))
> +			node_clear(node, numa_nodes_empty);

And since we are supporting memory-less node, it's better to provide a
for_each_memoryless_node() wrapper.

> +	}
> +
> +	/* Destroy empty nodes */
> +	if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
> +		int nid;
> +		const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
> +
> +		for_each_node_mask(nid, numa_nodes_empty) {
> +			node_set_offline(nid);
> +			memblock_free(__pa(node_data[nid]), nd_size);
> +			node_data[nid] = NULL;

So, memory-less nodes are set offline finally. It's a little different 
from what I thought.
I was expecting that both memory-less and cpu-less nodes could also be 
online after
this patch, which would be very helpful to me.

But actually, they are just exist temporarily, used to set _numa_mem_ so 
that cpu_to_mem()
is able to work, right ?

Thanks.

> +		}
>   	}
>   }
>   


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug
  2015-08-18  6:11   ` Tang Chen
@ 2015-08-18  6:59     ` Jiang Liu
  2015-08-18 11:28       ` Tang Chen
  0 siblings, 1 reply; 6+ messages in thread
From: Jiang Liu @ 2015-08-18  6:59 UTC (permalink / raw)
  To: Tang Chen, Andrew Morton, Mel Gorman, David Rientjes,
	Mike Galbraith, Peter Zijlstra, Rafael J . Wysocki, Tejun Heo,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Rafael J. Wysocki, Len Brown, Pavel Machek, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Dave Hansen,
	Jan H. Schönherr, Igor Mammedov, Paul E. McKenney, Xishi Qiu
  Cc: Tony Luck, linux-mm, linux-hotplug, linux-kernel, Ingo Molnar,
	linux-pm

On 2015/8/18 14:11, Tang Chen wrote:
> 
> Hi Liu,
> 
> On 08/17/2015 11:19 AM, Jiang Liu wrote:
......
>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>> index 08860bdf5744..f2a4e23bd14d 100644
>> --- a/arch/x86/mm/numa.c
>> +++ b/arch/x86/mm/numa.c
>> @@ -22,6 +22,7 @@
>>     int __initdata numa_off;
>>   nodemask_t numa_nodes_parsed __initdata;
>> +static nodemask_t numa_nodes_empty __initdata;
>>     struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
>>   EXPORT_SYMBOL(node_data);
>> @@ -560,17 +561,16 @@ static int __init numa_register_memblks(struct
>> numa_meminfo *mi)
>>               end = max(mi->blk[i].end, end);
>>           }
>>   -        if (start >= end)
>> -            continue;
>> -
>>           /*
>>            * Don't confuse VM with a node that doesn't have the
>>            * minimum amount of memory:
>>            */
>> -        if (end && (end - start) < NODE_MIN_SIZE)
>> -            continue;
>> -
>> -        alloc_node_data(nid);
>> +        if (start < end && (end - start) >= NODE_MIN_SIZE) {
>> +            alloc_node_data(nid);
>> +        } else if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
>> +            alloc_node_data(nid);
>> +            node_set(nid, numa_nodes_empty);
> 
> Seeing from here, I think numa_nodes_empty represents all memory-less
> nodes.
> So, since we still have cpu-less nodes out there, shall we rename it to
> numa_nodes_memoryless or something similar ?
> 
> And BTW, does x86 support cpu-less node after these patches ?
> 
> Since I don't have any memory-less or cpu-less node on my box, I cannot
> tell it clearly.
> A node is brought online when is has memory in original kernel. So I
> think it is supported.
Hi Chen,
	Thanks for review. With current Intel processor, there's no
hardware configurations for CPU-less NUMA node. From the code itself,
I think CPU-less node is supported. So we could fake CPU-less node
by "maxcpus" kernel parameter. For example, when "maxcpus=2" is
specified on my system, we get following NUMA topology. Among which,
node 2 is CPU-less node with memory.

root@bkd04sdp:~# numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: 15954 MB
node 0 free: 15686 MB
node 1 cpus:
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus:
node 2 size: 16113 MB
node 2 free: 16058 MB
node distances:
node   0   1   2
  0:  10  21  21
  1:  21  10  21
  2:  21  21  10


>> +        }
...
>>       }
>> @@ -739,6 +746,22 @@ void __init init_cpu_to_node(void)
>>           if (!node_online(node))
>>               node = find_near_online_node(node);
>>           numa_set_node(cpu, node);
> 
> So, CPUs are still mapped to online near node, right ?
> 
> I was expecting CPUs on a memory-less node are mapped to the node they
> belong to. If so, the current memory allocator may fail because they assume
> each online node has memory. I was trying to do this in my patch.
> 
> https://lkml.org/lkml/2015/7/7/205
> 
> Of course, my patch is not to support memory-less node, just run into
> this problem.
We have two sets of interfaces to figure out NUMA node associated with
a CPU.
1) numa_node_id()/cpu_to_node() return the NUMA node associated with
   the CPU, no matter whether there's memory associated with the node.
2) numa_mem_id()/cpu_to_mem() return the NUMA node the CPU should
   allocate memory from.

> 
>> +        if (node_spanned_pages(node))
>> +            set_cpu_numa_mem(cpu, node);
>> +        if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES))
>> +            node_clear(node, numa_nodes_empty);
> 
> And since we are supporting memory-less node, it's better to provide a
> for_each_memoryless_node() wrapper.
> 
>> +    }
>> +
>> +    /* Destroy empty nodes */
>> +    if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
>> +        int nid;
>> +        const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
>> +
>> +        for_each_node_mask(nid, numa_nodes_empty) {
>> +            node_set_offline(nid);
>> +            memblock_free(__pa(node_data[nid]), nd_size);
>> +            node_data[nid] = NULL;
> 
> So, memory-less nodes are set offline finally. It's a little different
> from what I thought.
> I was expecting that both memory-less and cpu-less nodes could also be
> online after
> this patch, which would be very helpful to me.
> 
> But actually, they are just exist temporarily, used to set _numa_mem_ so
> that cpu_to_mem()
> is able to work, right ?

No. We have removed NUMA node w/ CPU but w/o memory from the
numa_nodes_empty set. So here we only remove NUMA node without
CPU and memory.
> +        if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES))
> +            node_clear(node, numa_nodes_empty);

Please refer to the example below, which has memoryless node (node 1).
root@bkd04sdp:~# numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 45 46 47 48 49 50 51 52
53 54 55 56 57 58 59
node 0 size: 15954 MB
node 0 free: 15584 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 60 61 62 63 64
65 66 67 68 69 70 71 72 73 74
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 75 76 77 78 79
80 81 82 83 84 85 86 87 88 89
node 2 size: 16113 MB
node 2 free: 15802 MB
node distances:
node   0   1   2
  0:  10  21  21
  1:  21  10  21
  2:  21  21  10
Thanks!
Gerry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug
  2015-08-17  3:19 ` [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug Jiang Liu
  2015-08-18  6:11   ` Tang Chen
@ 2015-08-18  7:31   ` Ingo Molnar
  1 sibling, 0 replies; 6+ messages in thread
From: Ingo Molnar @ 2015-08-18  7:31 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Rafael J. Wysocki, Len Brown, Pavel Machek, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Dave Hansen,
	Jan H. Schönherr, Igor Mammedov, Paul E. McKenney, Xishi Qiu


* Jiang Liu <jiang.liu@linux.intel.com> wrote:

> With current implementation, all CPUs within a NUMA node will be
> assocaited with another NUMA node if the node has no memory installed.

typo.

> 
> For example, on a four-node system, CPUs on node 2 and 3 are associated
> with node 0 when are no memory install on node 2 and 3, which may
> confuse users.
>
> root@bkd01sdp:~# numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
> node 0 size: 15602 MB
> node 0 free: 15014 MB
> node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
> node 1 size: 15985 MB
> node 1 free: 15686 MB
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
> 
> To be worse, the CPU affinity relationship won't get fixed even after
> memory has been added to those nodes. After memory hot-addition to
> node 2, CPUs on node 2 are still associated with node 0. This may cause
> sub-optimal performance.
> root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
> available: 3 nodes (0-2)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
> node 0 size: 15602 MB
> node 0 free: 14743 MB
> node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
> node 1 size: 15985 MB
> node 1 free: 15715 MB
> node 2 cpus:
> node 2 size: 128 MB
> node 2 free: 128 MB
> node distances:
> node   0   1   2
>   0:  10  21  21
>   1:  21  10  21
>   2:  21  21  10
> 
> With support of memoryless node enabled, it will correctly report system
> hardware topology for nodes without memory installed.
> root@bkd01sdp:~# numactl --hardware
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
> node 0 size: 15725 MB
> node 0 free: 15129 MB
> node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
> node 1 size: 15862 MB
> node 1 free: 15627 MB
> node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
> node 2 size: 0 MB
> node 2 free: 0 MB
> node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
> node 3 size: 0 MB
> node 3 free: 0 MB
> node distances:
> node   0   1   2   3
>   0:  10  21  21  21
>   1:  21  10  21  21
>   2:  21  21  10  21
>   3:  21  21  21  10
> 
> With memoryless node enabled, CPUs are correctly associated with node 2
> after memory hot-addition to node 2.
> root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
> node 0 size: 15725 MB
> node 0 free: 14872 MB
> node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
> node 1 size: 15862 MB
> node 1 free: 15641 MB
> node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
> node 2 size: 128 MB
> node 2 free: 127 MB
> node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
> node 3 size: 0 MB
> node 3 free: 0 MB
> node distances:
> node   0   1   2   3
>   0:  10  21  21  21
>   1:  21  10  21  21
>   2:  21  21  10  21
>   3:  21  21  21  10
> 
> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
> ---
>  arch/x86/Kconfig            |    3 +++
>  arch/x86/kernel/acpi/boot.c |    4 +++-
>  arch/x86/kernel/smpboot.c   |    2 ++
>  arch/x86/mm/numa.c          |   49 +++++++++++++++++++++++++++++++------------
>  4 files changed, 44 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b3a1a5d77d92..5d7ad70ace0d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2069,6 +2069,9 @@ config USE_PERCPU_NUMA_NODE_ID
>  	def_bool y
>  	depends on NUMA
>  
> +config HAVE_MEMORYLESS_NODES
> +	def_bool NUMA
> +
>  config ARCH_ENABLE_SPLIT_PMD_PTLOCK
>  	def_bool y
>  	depends on X86_64 || X86_PAE
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index 07930e1d2fe9..3403f1f0f28d 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -711,6 +711,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
>  		}
>  		set_apicid_to_node(physid, nid);
>  		numa_set_node(cpu, nid);
> +		set_cpu_numa_mem(cpu, local_memory_node(nid));
>  	}
>  #endif
>  }
> @@ -743,9 +744,10 @@ int acpi_unmap_cpu(int cpu)
>  {
>  #ifdef CONFIG_ACPI_NUMA
>  	set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
> +	set_cpu_numa_mem(cpu, NUMA_NO_NODE);
>  #endif
>  
> -	per_cpu(x86_cpu_to_apicid, cpu) = -1;
> +	per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
>  	set_cpu_present(cpu, false);
>  	num_processors--;
>  
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index b1f3ed9c7a9e..aeec91ac6fd4 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -162,6 +162,8 @@ static void smp_callin(void)
>  	 */
>  	phys_id = read_apic_id();
>  
> +	set_numa_mem(local_memory_node(cpu_to_node(cpuid)));
> +
>  	/*
>  	 * the boot CPU has finished the init stage and is spinning
>  	 * on callin_map until we finish. We are free to set up this
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 08860bdf5744..f2a4e23bd14d 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -22,6 +22,7 @@
>  
>  int __initdata numa_off;
>  nodemask_t numa_nodes_parsed __initdata;
> +static nodemask_t numa_nodes_empty __initdata;
>  
>  struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
>  EXPORT_SYMBOL(node_data);
> @@ -560,17 +561,16 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>  			end = max(mi->blk[i].end, end);
>  		}
>  
> -		if (start >= end)
> -			continue;
> -
>  		/*
>  		 * Don't confuse VM with a node that doesn't have the
>  		 * minimum amount of memory:
>  		 */
> -		if (end && (end - start) < NODE_MIN_SIZE)
> -			continue;
> -
> -		alloc_node_data(nid);
> +		if (start < end && (end - start) >= NODE_MIN_SIZE) {
> +			alloc_node_data(nid);
> +		} else if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
> +			alloc_node_data(nid);
> +			node_set(nid, numa_nodes_empty);
> +		}
>  	}
>  
>  	/* Dump memblock with node info and return. */
> @@ -587,14 +587,18 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>   */
>  static void __init numa_init_array(void)
>  {
> -	int rr, i;
> +	int i, rr = MAX_NUMNODES;
>  
> -	rr = first_node(node_online_map);
>  	for (i = 0; i < nr_cpu_ids; i++) {
> +		/* Search for an onlined node with memory */
> +		do {
> +			if (rr != MAX_NUMNODES)
> +				rr = next_node(rr, node_online_map);
> +			if (rr == MAX_NUMNODES)
> +				rr = first_node(node_online_map);
> +		} while (node_isset(rr, numa_nodes_empty));
> +
>  		numa_set_node(i, rr);
> -		rr = next_node(rr, node_online_map);
> -		if (rr == MAX_NUMNODES)
> -			rr = first_node(node_online_map);
>  	}
>  }
>  
> @@ -696,9 +700,12 @@ static __init int find_near_online_node(int node)
>  {
>  	int n, val;
>  	int min_val = INT_MAX;
> -	int best_node = -1;
> +	int best_node = NUMA_NO_NODE;
>  
>  	for_each_online_node(n) {
> +		if (node_isset(n, numa_nodes_empty))
> +			continue;
> +
>  		val = node_distance(node, n);
>  
>  		if (val < min_val) {
> @@ -739,6 +746,22 @@ void __init init_cpu_to_node(void)
>  		if (!node_online(node))
>  			node = find_near_online_node(node);
>  		numa_set_node(cpu, node);
> +		if (node_spanned_pages(node))
> +			set_cpu_numa_mem(cpu, node);
> +		if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES))
> +			node_clear(node, numa_nodes_empty);
> +	}
> +
> +	/* Destroy empty nodes */
> +	if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
> +		int nid;
> +		const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
> +
> +		for_each_node_mask(nid, numa_nodes_empty) {
> +			node_set_offline(nid);
> +			memblock_free(__pa(node_data[nid]), nd_size);
> +			node_data[nid] = NULL;
> +		}
>  	}
>  }

So this patch makes messy code even messier.

I'd like to see the fixes, but this really needs to be done cleaner.

There are several problems:

1) the naming is not clear enough between VM and scheduling nodes and their masks. 
   For example we are mixing uses of 'numa_nodes_empty' (a memory space concept),
   'node_online_map' (a scheduling concept), which makes the code hard to read.

   To add insult to injury, 'numa_nodes_empty' is added with zero comments:

       > +static nodemask_t numa_nodes_empty __initdata;

   To resolve this the names should be clearer I think. Something like 
   numa_nomem_mask or so.

2) the existing code is (unfortunately) confusing to begin with. For example what 
   does find_near_online_node() do? It's not commented.

   init_cpu_to_node() has comments but it's mostly implementational gibberish that 
   does not answer the question of what the function's main, high level purpose 
   is. I'm uneasy about modifying code that is hard to read - it should be 
   improved first.

3)

   So I'm wondering about logic like this:

> +		if (node_spanned_pages(node))
> +			set_cpu_numa_mem(cpu, node);

   So first we link the node in the _numa_mem_ array if the node has memory (?).

> +		if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES))
> +			node_clear(node, numa_nodes_empty);

   But we unconditionally clear it in the numa_nodes_empty - i.e. it has memory?
   Shouldn't the node_clear() be inside the 'has memory' condition?

4)

    Bits like this are confusing:

> +	/* Destroy empty nodes */
> +	if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
> +		int nid;
> +		const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);

    Why do we 'destroy' them? What does 'destroy' mean here?

So I think this series should first make the whole code readable and 
understandable - then fix the bugs as gradually as possible: one bug one patch.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug
  2015-08-18  6:59     ` Jiang Liu
@ 2015-08-18 11:28       ` Tang Chen
  0 siblings, 0 replies; 6+ messages in thread
From: Tang Chen @ 2015-08-18 11:28 UTC (permalink / raw)
  To: Jiang Liu, Andrew Morton, Mel Gorman, David Rientjes,
	Mike Galbraith, Peter Zijlstra, Rafael J . Wysocki, Tejun Heo,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Rafael J. Wysocki, Len Brown, Pavel Machek, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Dave Hansen,
	"Jan H. Schönherr"
  Cc: Tony Luck, linux-mm, linux-hotplug, linux-kernel, Ingo Molnar,
	linux-pm, tangchen


On 08/18/2015 02:59 PM, Jiang Liu wrote:
>
> ...
>>>        }
>>> @@ -739,6 +746,22 @@ void __init init_cpu_to_node(void)
>>>            if (!node_online(node))
>>>                node = find_near_online_node(node);

Hi Liu,

If cpu-less, memory-less and normal node will all be online anyway,
I think we don't need to find_near_online_node() any more for
CPUs on offline nodes.

Or is there any other case ?

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-08-18 11:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1439781546-7217-1-git-send-email-jiang.liu@linux.intel.com>
2015-08-17  3:18 ` [Patch V3 1/9] x86, NUMA, ACPI: Online node earlier when doing CPU hot-addition Jiang Liu
2015-08-17  3:19 ` [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug Jiang Liu
2015-08-18  6:11   ` Tang Chen
2015-08-18  6:59     ` Jiang Liu
2015-08-18 11:28       ` Tang Chen
2015-08-18  7:31   ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).