[PATCH v2 0/7] Make cpuid <-> nodeid mapping persistent.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/7] Make cpuid <-> nodeid mapping persistent.
@ 2015-09-10  4:27 Tang Chen
  2015-09-10  4:27 ` [PATCH v2 1/7] x86, numa: Move definition of find_near_online_node() forward Tang Chen
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Tang Chen @ 2015-09-10  4:27 UTC (permalink / raw)
  To: tj, jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa,
	yasu.isimatu, isimatu.yasuaki, kamezawa.hiroyu, izumi.taku,
	gongzhaogang, qiaonuohan
  Cc: tangchen, x86, linux-acpi, linux-kernel, linux-mm

The whole patch-set aims at solving this problem:

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU
------------------------
node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU
------------------------
node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU
------------------------
node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs){
...
        /* if cpumask is contained inside a NUMA node, we belong to that node */
        if (wq_numa_enabled) {
                for_each_node(node) {
                        if (cpumask_subset(pool->attrs->cpumask,
                                           wq_numa_possible_cpumask[node])) {
                                pool->node = node;
                                break;
                        }
                }
        }

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
        struct worker *worker;

        worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, useing the wrong node.

        ......

        return worker;
}

[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id)     <->   apicid
4. cpuid (logical cpu id)     <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' apicid.
   This is also done by introducing an extra parameter to these apis to let the caller
   control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
   This is done via an additional acpi namespace walk for processors.

Patch 1 ~ 3 are some prepare works.
Patch 4 ~ 7 finishes the 4 steps above.

For previous discussion, please refer to:
https://lkml.org/lkml/2015/2/27/145
https://lkml.org/lkml/2015/3/25/989
https://lkml.org/lkml/2015/5/14/244
https://lkml.org/lkml/2015/7/7/200

Change log v1 -> v2:
1. Split code movement and actual changes. Add patch 1.
2. Synchronize best near online node record when node hotplug happens. In patch 2.
3. Fix some comment.

Gu Zheng (5):
  x86, gfp: Cache best near node for memory allocation.
  x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at
    boot time.
  x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store
    persistent cpuid <-> apicid mapping.
  x86, acpi, cpu-hotplug: Enable MADT APIs to return disabled apicid.
  x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when
    booting.

Tang Chen (2):
  x86, numa: Move definition of find_near_online_node() forward.
  x86, numa: Introduce a node to node array to map a node to its best
    online node.

 arch/ia64/kernel/acpi.c         |   2 +-
 arch/x86/include/asm/mpspec.h   |   1 +
 arch/x86/include/asm/topology.h |  10 ++++
 arch/x86/kernel/acpi/boot.c     |   8 +--
 arch/x86/kernel/apic/apic.c     |  77 ++++++++++++++++++++++---
 arch/x86/mm/numa.c              |  80 +++++++++++++++++++-------
 drivers/acpi/acpi_processor.c   |   5 +-
 drivers/acpi/bus.c              |   3 +
 drivers/acpi/processor_core.c   | 122 +++++++++++++++++++++++++++++++++-------
 include/linux/acpi.h            |   2 +
 include/linux/gfp.h             |   8 ++-
 mm/memory_hotplug.c             |   4 ++
 12 files changed, 264 insertions(+), 58 deletions(-)

-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 1/7] x86, numa: Move definition of find_near_online_node() forward.
  2015-09-10  4:27 [PATCH v2 0/7] Make cpuid <-> nodeid mapping persistent Tang Chen
@ 2015-09-10  4:27 ` Tang Chen
  2015-09-10  4:27 ` [PATCH v2 2/7] x86, numa: Introduce a node to node array to map a node to its best online node Tang Chen
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Tang Chen @ 2015-09-10  4:27 UTC (permalink / raw)
  To: tj, jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa,
	yasu.isimatu, isimatu.yasuaki, kamezawa.hiroyu, izumi.taku,
	gongzhaogang, qiaonuohan
  Cc: tangchen, x86, linux-acpi, linux-kernel, linux-mm

Will call this function earlier in next coming patches.
So simply move its definition forward. And also, add comments for it.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/numa.c | 47 +++++++++++++++++++++++++++++------------------
 1 file changed, 29 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4053bb5..fea387a 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -78,6 +78,35 @@ EXPORT_SYMBOL(node_to_cpumask_map);
 DEFINE_EARLY_PER_CPU(int, x86_cpu_to_node_map, NUMA_NO_NODE);
 EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_node_map);
 
+/**
+ * find_near_online_node - Find the best near online node of a node.
+ * @node: NUMA node ID of the current node.
+ *
+ * Find the best near online node of @node, based on node_distance[] array.
+ * The best near online node is the backup node for memory allocation on
+ * one node.
+ *
+ * RETURNS:
+ * The best near online node ID on success, -1 on failure.
+ */
+static __init int find_near_online_node(int node)
+{
+	int n, val;
+	int min_val = INT_MAX;
+	int near_node = -1;
+
+	for_each_online_node(n) {
+		val = node_distance(node, n);
+
+		if (val < min_val) {
+			min_val = val;
+			near_node = n;
+		}
+	}
+
+	return near_node;
+}
+
 void numa_set_node(int cpu, int node)
 {
 	int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map);
@@ -702,24 +731,6 @@ void __init x86_numa_init(void)
 	numa_init(dummy_numa_init);
 }
 
-static __init int find_near_online_node(int node)
-{
-	int n, val;
-	int min_val = INT_MAX;
-	int best_node = -1;
-
-	for_each_online_node(n) {
-		val = node_distance(node, n);
-
-		if (val < min_val) {
-			min_val = val;
-			best_node = n;
-		}
-	}
-
-	return best_node;
-}
-
 /*
  * Setup early cpu_to_node.
  *
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 2/7] x86, numa: Introduce a node to node array to map a node to its best online node.
  2015-09-10  4:27 [PATCH v2 0/7] Make cpuid <-> nodeid mapping persistent Tang Chen
  2015-09-10  4:27 ` [PATCH v2 1/7] x86, numa: Move definition of find_near_online_node() forward Tang Chen
@ 2015-09-10  4:27 ` Tang Chen
  2015-09-10  4:27 ` [PATCH v2 3/7] x86, gfp: Cache best near node for memory allocation Tang Chen
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Tang Chen @ 2015-09-10  4:27 UTC (permalink / raw)
  To: tj, jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa,
	yasu.isimatu, isimatu.yasuaki, kamezawa.hiroyu, izumi.taku,
	gongzhaogang, qiaonuohan
  Cc: tangchen, x86, linux-acpi, linux-kernel, linux-mm

The whole patch-set aims at solving this problem:

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU
------------------------
node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU
------------------------
node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU
------------------------
node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs){
...
        /* if cpumask is contained inside a NUMA node, we belong to that node */
        if (wq_numa_enabled) {
                for_each_node(node) {
                        if (cpumask_subset(pool->attrs->cpumask,
                                           wq_numa_possible_cpumask[node])) {
                                pool->node = node;
                                break;
                        }
                }
        }

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
        struct worker *worker;

        worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, useing the wrong node.

        ......

        return worker;
}

[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id)     <->   apicid
4. cpuid (logical cpu id)     <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' apicid.
   This is also done by introducing an extra parameter to these apis to let the caller
   control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
   This is done via an additional acpi namespace walk for processors.

But before that, we should make memory allocators be able to get best near online node
at any time, because if node hotplug happens, the best near online node will change.

In current kernel, CPUs on a memory-less node are all mapped to its best online
node to ensure the memory allocation on these CPUs successful. This is done
outside alloc_pages_node() and alloc_pages_exact_node(), when the kernel boots.

In this patch, we calculate best near online node for all nodes at node hotplug time,
and store them in an array so that they could be obtained inside memory allocator
at any time.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/include/asm/topology.h | 10 ++++++++++
 arch/x86/mm/numa.c              | 32 +++++++++++++++++++++++++++++++-
 mm/memory_hotplug.c             |  4 ++++
 3 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 0fb4648..53422fd 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -82,6 +82,9 @@ static inline const struct cpumask *cpumask_of_node(int node)
 }
 #endif

+extern int get_near_online_node(int node);
+extern void update_node_to_near_node_map(void);
+
 extern void setup_node_to_cpumask_map(void);

 /*
@@ -113,6 +116,13 @@ static inline int early_cpu_to_node(int cpu)

 static inline void setup_node_to_cpumask_map(void) { }

+static inline int get_near_online_node(int node)
+{
+	return 0;
+}
+
+static inline void update_node_to_near_node_map() { }
+
 #endif

 #include <asm-generic/topology.h>
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index fea387a..8bd7661 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -78,6 +78,14 @@ EXPORT_SYMBOL(node_to_cpumask_map);
 DEFINE_EARLY_PER_CPU(int, x86_cpu_to_node_map, NUMA_NO_NODE);
 EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_node_map);

+/*
+ * Map nid index to the best near online node. The best near online node
+ * is the backup node for memory allocation on offline node.
+ */
+static int node_to_near_node_map[] = {
+	[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE,
+};
+
 /**
  * find_near_online_node - Find the best near online node of a node.
  * @node: NUMA node ID of the current node.
@@ -89,7 +97,7 @@ EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_node_map);
  * RETURNS:
  * The best near online node ID on success, -1 on failure.
  */
-static __init int find_near_online_node(int node)
+static int find_near_online_node(int node)
 {
 	int n, val;
 	int min_val = INT_MAX;
@@ -107,6 +115,25 @@ static __init int find_near_online_node(int node)
 	return near_node;
 }

+int get_near_online_node(int node)
+{
+	return node_to_near_node_map[node];
+}
+EXPORT_SYMBOL(get_near_online_node);
+
+static void set_near_online_node(int node)
+{
+	node_to_near_node_map[node] = find_near_online_node(node);
+}
+
+void update_node_to_near_node_map()
+{
+	int node;
+
+	for_each_node(node)
+		set_near_online_node(node);
+}
+
 void numa_set_node(int cpu, int node)
 {
 	int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map);
@@ -126,6 +153,8 @@ void numa_set_node(int cpu, int node)
 #endif
 	per_cpu(x86_cpu_to_node_map, cpu) = node;

+	set_near_online_node(node);
+
 	set_cpu_numa_node(cpu, node);
 }

@@ -249,6 +278,7 @@ static void __init alloc_node_data(int nid)
 	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));

 	node_set_online(nid);
+	update_node_to_near_node_map();
 }

 /**
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6da82bc..9d78d5f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1164,6 +1164,8 @@ int try_online_node(int nid)
 		goto out;
 	}
 	node_set_online(nid);
+	update_node_to_near_node_map();
+
 	ret = register_one_node(nid);
 	BUG_ON(ret);

@@ -1264,6 +1266,7 @@ int __ref add_memory(int nid, u64 start, u64 size)

 	/* we online node here. we can't roll back from here. */
 	node_set_online(nid);
+	update_node_to_near_node_map();

 	if (new_node) {
 		ret = register_one_node(nid);
@@ -1970,6 +1973,7 @@ void try_offline_node(int nid)
 	 */
 	node_set_offline(nid);
 	unregister_one_node(nid);
+	update_node_to_near_node_map();

 	/* free waittable in each zone */
 	for (i = 0; i < MAX_NR_ZONES; i++) {
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 3/7] x86, gfp: Cache best near node for memory allocation.
  2015-09-10  4:27 [PATCH v2 0/7] Make cpuid <-> nodeid mapping persistent Tang Chen
  2015-09-10  4:27 ` [PATCH v2 1/7] x86, numa: Move definition of find_near_online_node() forward Tang Chen
  2015-09-10  4:27 ` [PATCH v2 2/7] x86, numa: Introduce a node to node array to map a node to its best online node Tang Chen
@ 2015-09-10  4:27 ` Tang Chen
  2015-09-10 19:29   ` Tejun Heo
  2015-09-10  4:27 ` [PATCH v2 4/7] x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at boot time Tang Chen
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Tang Chen @ 2015-09-10  4:27 UTC (permalink / raw)
  To: tj, jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa,
	yasu.isimatu, isimatu.yasuaki, kamezawa.hiroyu, izumi.taku,
	gongzhaogang, qiaonuohan
  Cc: tangchen, x86, linux-acpi, linux-kernel, linux-mm, Gu Zheng

From: Gu Zheng <guz.fnst@cn.fujitsu.com>

In the current kernel, all possible cpus are mapped to the best near online
node if they reside in a memory-less node in init_cpu_to_node().

init_cpu_to_node()
{
	......
	for_each_possible_cpu(cpu) {
		......
		if (!node_online(node))
			node = find_near_online_node(node);
		numa_set_node(cpu, node);
	}
}

The reason for doing this is to prevent memory allocation failure if the
cpu is online but there is no memory on that node.

But since cpuid <-> nodeid mapping is planed to be made static, doing
so in initialization pharse makes no sense any more.

The best near online node for each cpu has been cached in an array in previous
patch. And the reason for doing this is to avoid mapping CPUs on memory-less
nodes to other nodes.

So in this patch, we get best near online node for CPUs on memory-less nodes
inside alloc_pages_node() and alloc_pages_exact_node() to avoid memory allocation
failure.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/numa.c  | 3 +--
 include/linux/gfp.h | 8 +++++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bd7661..e89b9fb 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -151,6 +151,7 @@ void numa_set_node(int cpu, int node)
 		return;
 	}
 #endif
+
 	per_cpu(x86_cpu_to_node_map, cpu) = node;
 
 	set_near_online_node(node);
@@ -787,8 +788,6 @@ void __init init_cpu_to_node(void)
 
 		if (node == NUMA_NO_NODE)
 			continue;
-		if (!node_online(node))
-			node = find_near_online_node(node);
 		numa_set_node(cpu, node);
 	}
 }
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ad35f30..1a1324f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -307,13 +307,19 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 	if (nid < 0)
 		nid = numa_node_id();
 
+	if (!node_online(nid))
+		nid = get_near_online_node(nid);
+
 	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 
 static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
-	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
+	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+
+	if (!node_online(nid))
+		nid = get_near_online_node(nid);
 
 	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 4/7] x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at boot time.
  2015-09-10  4:27 [PATCH v2 0/7] Make cpuid <-> nodeid mapping persistent Tang Chen
                   ` (2 preceding siblings ...)
  2015-09-10  4:27 ` [PATCH v2 3/7] x86, gfp: Cache best near node for memory allocation Tang Chen
@ 2015-09-10  4:27 ` Tang Chen
  2015-09-10 23:10   ` Rafael J. Wysocki
  2015-09-10  4:27 ` [PATCH v2 5/7] x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store persistent cpuid <-> apicid mapping Tang Chen
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Tang Chen @ 2015-09-10  4:27 UTC (permalink / raw)
  To: tj, jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa,
	yasu.isimatu, isimatu.yasuaki, kamezawa.hiroyu, izumi.taku,
	gongzhaogang, qiaonuohan
  Cc: tangchen, x86, linux-acpi, linux-kernel, linux-mm, Gu Zheng

From: Gu Zheng <guz.fnst@cn.fujitsu.com>

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU
------------------------
node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU
------------------------
node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU
------------------------
node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs){
...
        /* if cpumask is contained inside a NUMA node, we belong to that node */
        if (wq_numa_enabled) {
                for_each_node(node) {
                        if (cpumask_subset(pool->attrs->cpumask,
                                           wq_numa_possible_cpumask[node])) {
                                pool->node = node;
                                break;
                        }
                }
        }

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
        struct worker *worker;

        worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, useing the wrong node.

        ......

        return worker;
}

[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id)     <->   apicid
4. cpuid (logical cpu id)     <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' apicid.
   This is also done by introducing an extra parameter to these apis to let the caller
   control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
   This is done via an additional acpi namespace walk for processors.

This patch finished step 1.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/kernel/apic/apic.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index dcb5285..a9c9830 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1977,7 +1977,7 @@ void disconnect_bsp_APIC(int virt_wire_setup)
 	apic_write(APIC_LVT1, value);
 }
 
-int generic_processor_info(int apicid, int version)
+static int __generic_processor_info(int apicid, int version, bool enabled)
 {
 	int cpu, max = nr_cpu_ids;
 	bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
@@ -2011,7 +2011,8 @@ int generic_processor_info(int apicid, int version)
 			   " Processor %d/0x%x ignored.\n",
 			   thiscpu, apicid);
 
-		disabled_cpus++;
+		if (enabled)
+			disabled_cpus++;
 		return -ENODEV;
 	}
 
@@ -2028,7 +2029,8 @@ int generic_processor_info(int apicid, int version)
 			" reached. Keeping one slot for boot cpu."
 			"  Processor %d/0x%x ignored.\n", max, thiscpu, apicid);
 
-		disabled_cpus++;
+		if (enabled)
+			disabled_cpus++;
 		return -ENODEV;
 	}
 
@@ -2039,11 +2041,14 @@ int generic_processor_info(int apicid, int version)
 			"ACPI: NR_CPUS/possible_cpus limit of %i reached."
 			"  Processor %d/0x%x ignored.\n", max, thiscpu, apicid);
 
-		disabled_cpus++;
+		if (enabled)
+			disabled_cpus++;
 		return -EINVAL;
 	}
 
-	num_processors++;
+	if (enabled)
+		num_processors++;
+
 	if (apicid == boot_cpu_physical_apicid) {
 		/*
 		 * x86_bios_cpu_apicid is required to have processors listed
@@ -2071,7 +2076,8 @@ int generic_processor_info(int apicid, int version)
 			apic_version[boot_cpu_physical_apicid], cpu, version);
 	}
 
-	physid_set(apicid, phys_cpu_present_map);
+	if (enabled)
+		physid_set(apicid, phys_cpu_present_map);
 	if (apicid > max_physical_apicid)
 		max_physical_apicid = apicid;
 
@@ -2084,11 +2090,17 @@ int generic_processor_info(int apicid, int version)
 		apic->x86_32_early_logical_apicid(cpu);
 #endif
 	set_cpu_possible(cpu, true);
-	set_cpu_present(cpu, true);
+	if (enabled)
+		set_cpu_present(cpu, true);
 
 	return cpu;
 }
 
+int generic_processor_info(int apicid, int version)
+{
+	return __generic_processor_info(apicid, version, true);
+}
+
 int hard_smp_processor_id(void)
 {
 	return read_apic_id();
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 5/7] x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store persistent cpuid <-> apicid mapping.
  2015-09-10  4:27 [PATCH v2 0/7] Make cpuid <-> nodeid mapping persistent Tang Chen
                   ` (3 preceding siblings ...)
  2015-09-10  4:27 ` [PATCH v2 4/7] x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at boot time Tang Chen
@ 2015-09-10  4:27 ` Tang Chen
  2015-09-10 19:55   ` Tejun Heo
  2015-09-10  4:27 ` [PATCH v2 6/7] x86, acpi, cpu-hotplug: Enable MADT APIs to return disabled apicid Tang Chen
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Tang Chen @ 2015-09-10  4:27 UTC (permalink / raw)
  To: tj, jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa,
	yasu.isimatu, isimatu.yasuaki, kamezawa.hiroyu, izumi.taku,
	gongzhaogang, qiaonuohan
  Cc: tangchen, x86, linux-acpi, linux-kernel, linux-mm, Gu Zheng

From: Gu Zheng <guz.fnst@cn.fujitsu.com>

This patch finishes step2 mentioned in previous patch 4.

In this patch, we introduce a new static array named apicid_to_cpuid[],
which is large enough to store info for all possible cpus.

And then, we modify the cpuid calculation. In generic_processor_info(),
it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid
mapping changes with node hotplug.

After this patch, we find the next unused cpuid, map it to an apicid,
and store the mapping in apicid_to_cpuid[], so that cpuid <-> apicid
mapping will be persistent.

And finally we will use this array to make cpuid <-> nodeid persistent.

cpuid <-> apicid mapping is established at local apic registeration time.
But non-present or disabled cpus are ignored.

In this patch, we establish all possible cpuid <-> apicid mapping when
registering local apic.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/include/asm/mpspec.h |  1 +
 arch/x86/kernel/acpi/boot.c   |  6 ++---
 arch/x86/kernel/apic/apic.c   | 53 ++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 53 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index b07233b..db902d8 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
 #endif
 
 int generic_processor_info(int apicid, int version);
+int __generic_processor_info(int apicid, int version, bool enabled);
 
 #define PHYSID_ARRAY_SIZE	BITS_TO_LONGS(MAX_LOCAL_APIC)
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index e49ee24..bcc85b2 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
 		return -EINVAL;
 	}
 
-	if (!enabled) {
+	if (!enabled)
 		++disabled_cpus;
-		return -EINVAL;
-	}
 
 	if (boot_cpu_physical_apicid != -1U)
 		ver = apic_version[boot_cpu_physical_apicid];
 
-	return generic_processor_info(id, ver);
+	return __generic_processor_info(id, ver, enabled);
 }
 
 static int __init
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index a9c9830..42b2a9c 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1977,7 +1977,45 @@ void disconnect_bsp_APIC(int virt_wire_setup)
 	apic_write(APIC_LVT1, value);
 }
 
-static int __generic_processor_info(int apicid, int version, bool enabled)
+/*
+ * Current allocated max logical CPU ID plus 1.
+ * All allocated CPU ID should be in [0, max_logical_cpuid),
+ * so the maximum of max_logical_cpuid is nr_cpu_ids.
+ *
+ * NOTE: Reserve 0 for BSP.
+ */
+static int max_logical_cpuid = 1;
+
+static int cpuid_to_apicid[] = {
+	[0 ... NR_CPUS - 1] = -1,
+};
+
+static int allocate_logical_cpuid(int apicid)
+{
+	int i;
+
+	/*
+	 * cpuid <-> apicid mapping is persistent, so when a cpu is up,
+	 * check if the kernel has allocated a cpuid for it.
+	 */
+	for (i = 0; i < max_logical_cpuid; i++) {
+		if (cpuid_to_apicid[i] == apicid)
+			return i;
+	}
+
+	/* Allocate a new cpuid. */
+	if (max_logical_cpuid >= nr_cpu_ids) {
+		WARN_ONCE(1, "Only %d processors supported."
+			     "Processor %d/0x%x and the rest are ignored.\n",
+			     nr_cpu_ids - 1, max_logical_cpuid, apicid);
+		return -1;
+	}
+
+	cpuid_to_apicid[max_logical_cpuid] = apicid;
+	return max_logical_cpuid++;
+}
+
+int __generic_processor_info(int apicid, int version, bool enabled)
 {
 	int cpu, max = nr_cpu_ids;
 	bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
@@ -2058,8 +2096,17 @@ static int __generic_processor_info(int apicid, int version, bool enabled)
 		 * for BSP.
 		 */
 		cpu = 0;
-	} else
-		cpu = cpumask_next_zero(-1, cpu_present_mask);
+
+		/* Logical cpuid 0 is reserved for BSP. */
+		cpuid_to_apicid[0] = apicid;
+	} else {
+		cpu = allocate_logical_cpuid(apicid);
+		if (cpu < 0) {
+			if (enabled)
+				disabled_cpus++;
+			return -EINVAL;
+		}
+	}
 
 	/*
 	 * Validate version
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 6/7] x86, acpi, cpu-hotplug: Enable MADT APIs to return disabled apicid.
  2015-09-10  4:27 [PATCH v2 0/7] Make cpuid <-> nodeid mapping persistent Tang Chen
                   ` (4 preceding siblings ...)
  2015-09-10  4:27 ` [PATCH v2 5/7] x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store persistent cpuid <-> apicid mapping Tang Chen
@ 2015-09-10  4:27 ` Tang Chen
  2015-09-10  4:27 ` [PATCH v2 7/7] x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting Tang Chen
  2015-10-23 19:49 ` [PATCH v2 0/7] Make cpuid <-> nodeid mapping persistent Yasuaki Ishimatsu
  7 siblings, 0 replies; 24+ messages in thread
From: Tang Chen @ 2015-09-10  4:27 UTC (permalink / raw)
  To: tj, jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa,
	yasu.isimatu, isimatu.yasuaki, kamezawa.hiroyu, izumi.taku,
	gongzhaogang, qiaonuohan
  Cc: tangchen, x86, linux-acpi, linux-kernel, linux-mm, Gu Zheng

From: Gu Zheng <guz.fnst@cn.fujitsu.com>

This patch finishes step3 mentioned in previous patch 4.

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id)     <->   apicid
4. cpuid (logical cpu id)     <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not persistent.

So, in order to setup persistent cpuid <-> nodeid mapping for all possible CPUs,
we should:
1. Setup cpuid <-> apicid mapping for all possible CPUs, which has been done in patch 4.
2. Setup cpuid <-> nodeid mapping for all possible CPUs. But before that, we should
   obtain all apicids from MADT.

All processors' apicids can be obtained by _MAT method or from MADT in ACPI.
The current code ignores disabled processors and returns -ENODEV.

After this patch, a new parameter will be added to MADT APIs so that caller
is able to control if disabled processors are ignored.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 drivers/acpi/acpi_processor.c |  5 +++-
 drivers/acpi/processor_core.c | 57 +++++++++++++++++++++++++++----------------
 2 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index 92a5f73..2deade1 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -282,8 +282,11 @@ static int acpi_processor_get_info(struct acpi_device *device)
 	 *  Extra Processor objects may be enumerated on MP systems with
 	 *  less than the max # of CPUs. They should be ignored _iff
 	 *  they are physically not present.
+	 *
+	 *  NOTE: Even if the processor has a cpuid, it may be not present
+	 *  because cpuid <-> apicid mapping is persistent.
 	 */
-	if (invalid_logical_cpuid(pr->id)) {
+	if (invalid_logical_cpuid(pr->id) || !cpu_present(pr->id)) {
 		int ret = acpi_processor_hotadd_init(pr);
 		if (ret)
 			return ret;
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 33a38d6..824b98b 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -32,12 +32,12 @@ static struct acpi_table_madt *get_madt_table(void)
 }
 
 static int map_lapic_id(struct acpi_subtable_header *entry,
-		 u32 acpi_id, phys_cpuid_t *apic_id)
+		 u32 acpi_id, phys_cpuid_t *apic_id, bool ignore_disabled)
 {
 	struct acpi_madt_local_apic *lapic =
 		container_of(entry, struct acpi_madt_local_apic, header);
 
-	if (!(lapic->lapic_flags & ACPI_MADT_ENABLED))
+	if (ignore_disabled && !(lapic->lapic_flags & ACPI_MADT_ENABLED))
 		return -ENODEV;
 
 	if (lapic->processor_id != acpi_id)
@@ -48,12 +48,13 @@ static int map_lapic_id(struct acpi_subtable_header *entry,
 }
 
 static int map_x2apic_id(struct acpi_subtable_header *entry,
-		int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+		int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+		bool ignore_disabled)
 {
 	struct acpi_madt_local_x2apic *apic =
 		container_of(entry, struct acpi_madt_local_x2apic, header);
 
-	if (!(apic->lapic_flags & ACPI_MADT_ENABLED))
+	if (ignore_disabled && !(apic->lapic_flags & ACPI_MADT_ENABLED))
 		return -ENODEV;
 
 	if (device_declaration && (apic->uid == acpi_id)) {
@@ -65,12 +66,13 @@ static int map_x2apic_id(struct acpi_subtable_header *entry,
 }
 
 static int map_lsapic_id(struct acpi_subtable_header *entry,
-		int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+		int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+		bool ignore_disabled)
 {
 	struct acpi_madt_local_sapic *lsapic =
 		container_of(entry, struct acpi_madt_local_sapic, header);
 
-	if (!(lsapic->lapic_flags & ACPI_MADT_ENABLED))
+	if (ignore_disabled && !(lsapic->lapic_flags & ACPI_MADT_ENABLED))
 		return -ENODEV;
 
 	if (device_declaration) {
@@ -87,12 +89,13 @@ static int map_lsapic_id(struct acpi_subtable_header *entry,
  * Retrieve the ARM CPU physical identifier (MPIDR)
  */
 static int map_gicc_mpidr(struct acpi_subtable_header *entry,
-		int device_declaration, u32 acpi_id, phys_cpuid_t *mpidr)
+		int device_declaration, u32 acpi_id, phys_cpuid_t *mpidr,
+		bool ignore_disabled)
 {
 	struct acpi_madt_generic_interrupt *gicc =
 	    container_of(entry, struct acpi_madt_generic_interrupt, header);
 
-	if (!(gicc->flags & ACPI_MADT_ENABLED))
+	if (ignore_disabled && !(gicc->flags & ACPI_MADT_ENABLED))
 		return -ENODEV;
 
 	/* device_declaration means Device object in DSDT, in the
@@ -108,7 +111,7 @@ static int map_gicc_mpidr(struct acpi_subtable_header *entry,
 	return -EINVAL;
 }
 
-static phys_cpuid_t map_madt_entry(int type, u32 acpi_id)
+static phys_cpuid_t map_madt_entry(int type, u32 acpi_id, bool ignore_disabled)
 {
 	unsigned long madt_end, entry;
 	phys_cpuid_t phys_id = PHYS_CPUID_INVALID;	/* CPU hardware ID */
@@ -128,16 +131,20 @@ static phys_cpuid_t map_madt_entry(int type, u32 acpi_id)
 		struct acpi_subtable_header *header =
 			(struct acpi_subtable_header *)entry;
 		if (header->type == ACPI_MADT_TYPE_LOCAL_APIC) {
-			if (!map_lapic_id(header, acpi_id, &phys_id))
+			if (!map_lapic_id(header, acpi_id, &phys_id,
+					  ignore_disabled))
 				break;
 		} else if (header->type == ACPI_MADT_TYPE_LOCAL_X2APIC) {
-			if (!map_x2apic_id(header, type, acpi_id, &phys_id))
+			if (!map_x2apic_id(header, type, acpi_id, &phys_id,
+					   ignore_disabled))
 				break;
 		} else if (header->type == ACPI_MADT_TYPE_LOCAL_SAPIC) {
-			if (!map_lsapic_id(header, type, acpi_id, &phys_id))
+			if (!map_lsapic_id(header, type, acpi_id, &phys_id,
+					   ignore_disabled))
 				break;
 		} else if (header->type == ACPI_MADT_TYPE_GENERIC_INTERRUPT) {
-			if (!map_gicc_mpidr(header, type, acpi_id, &phys_id))
+			if (!map_gicc_mpidr(header, type, acpi_id, &phys_id,
+					    ignore_disabled))
 				break;
 		}
 		entry += header->length;
@@ -145,7 +152,8 @@ static phys_cpuid_t map_madt_entry(int type, u32 acpi_id)
 	return phys_id;
 }
 
-static phys_cpuid_t map_mat_entry(acpi_handle handle, int type, u32 acpi_id)
+static phys_cpuid_t map_mat_entry(acpi_handle handle, int type, u32 acpi_id,
+				  bool ignore_disabled)
 {
 	struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
 	union acpi_object *obj;
@@ -166,30 +174,37 @@ static phys_cpuid_t map_mat_entry(acpi_handle handle, int type, u32 acpi_id)
 
 	header = (struct acpi_subtable_header *)obj->buffer.pointer;
 	if (header->type == ACPI_MADT_TYPE_LOCAL_APIC)
-		map_lapic_id(header, acpi_id, &phys_id);
+		map_lapic_id(header, acpi_id, &phys_id, ignore_disabled);
 	else if (header->type == ACPI_MADT_TYPE_LOCAL_SAPIC)
-		map_lsapic_id(header, type, acpi_id, &phys_id);
+		map_lsapic_id(header, type, acpi_id, &phys_id, ignore_disabled);
 	else if (header->type == ACPI_MADT_TYPE_LOCAL_X2APIC)
-		map_x2apic_id(header, type, acpi_id, &phys_id);
+		map_x2apic_id(header, type, acpi_id, &phys_id, ignore_disabled);
 	else if (header->type == ACPI_MADT_TYPE_GENERIC_INTERRUPT)
-		map_gicc_mpidr(header, type, acpi_id, &phys_id);
+		map_gicc_mpidr(header, type, acpi_id, &phys_id,
+			       ignore_disabled);
 
 exit:
 	kfree(buffer.pointer);
 	return phys_id;
 }
 
-phys_cpuid_t acpi_get_phys_id(acpi_handle handle, int type, u32 acpi_id)
+static phys_cpuid_t __acpi_get_phys_id(acpi_handle handle, int type,
+				       u32 acpi_id, bool ignore_disabled)
 {
 	phys_cpuid_t phys_id;
 
-	phys_id = map_mat_entry(handle, type, acpi_id);
+	phys_id = map_mat_entry(handle, type, acpi_id, ignore_disabled);
 	if (invalid_phys_cpuid(phys_id))
-		phys_id = map_madt_entry(type, acpi_id);
+		phys_id = map_madt_entry(type, acpi_id, ignore_disabled);
 
 	return phys_id;
 }
 
+phys_cpuid_t acpi_get_phys_id(acpi_handle handle, int type, u32 acpi_id)
+{
+	return __acpi_get_phys_id(handle, type, acpi_id, true);
+}
+
 int acpi_map_cpuid(phys_cpuid_t phys_id, u32 acpi_id)
 {
 #ifdef CONFIG_SMP
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 7/7] x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting.
  2015-09-10  4:27 [PATCH v2 0/7] Make cpuid <-> nodeid mapping persistent Tang Chen
                   ` (5 preceding siblings ...)
  2015-09-10  4:27 ` [PATCH v2 6/7] x86, acpi, cpu-hotplug: Enable MADT APIs to return disabled apicid Tang Chen
@ 2015-09-10  4:27 ` Tang Chen
  2015-10-23 19:49 ` [PATCH v2 0/7] Make cpuid <-> nodeid mapping persistent Yasuaki Ishimatsu
  7 siblings, 0 replies; 24+ messages in thread
From: Tang Chen @ 2015-09-10  4:27 UTC (permalink / raw)
  To: tj, jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa,
	yasu.isimatu, isimatu.yasuaki, kamezawa.hiroyu, izumi.taku,
	gongzhaogang, qiaonuohan
  Cc: tangchen, x86, linux-acpi, linux-kernel, linux-mm, Gu Zheng

From: Gu Zheng <guz.fnst@cn.fujitsu.com>

This patch finishes step4 mentioned in previous patch 4.

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id)     <->   apicid
4. cpuid (logical cpu id)     <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not persistent.

So, in order to setup persistent cpuid <-> nodeid mapping for all possible CPUs,
we should:
1. Setup cpuid <-> apicid mapping for all possible CPUs, which has been done in patch 4.
2. Setup cpuid <-> nodeid mapping for all possible CPUs.

This patch set the persistent cpuid <-> nodeid mapping for all enabled/disabled
processors at boot time via an additional acpi namespace walk for processors.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/ia64/kernel/acpi.c       |  2 +-
 arch/x86/kernel/acpi/boot.c   |  2 +-
 drivers/acpi/bus.c            |  3 ++
 drivers/acpi/processor_core.c | 65 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/acpi.h          |  2 ++
 5 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index b1698bc..7db5563 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
  *  ACPI based hotplug CPU support
  */
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
 	/*
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index bcc85b2..b9a1aa1 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -695,7 +695,7 @@ static void __init acpi_set_irq_model_ioapic(void)
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 #include <acpi/processor.h>
 
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
 	int nid;
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 513e7230e..fd03885 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -700,6 +700,9 @@ static int __init acpi_init(void)
 	acpi_debugfs_init();
 	acpi_sleep_proc_init();
 	acpi_wakeup_device_init();
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+	acpi_set_processor_mapping();
+#endif
 	return 0;
 }
 
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 824b98b..45580ff 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -261,6 +261,71 @@ int acpi_get_cpuid(acpi_handle handle, int type, u32 acpi_id)
 }
 EXPORT_SYMBOL_GPL(acpi_get_cpuid);
 
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+static bool map_processor(acpi_handle handle, int *phys_id, int *cpuid)
+{
+	int type;
+	u32 acpi_id;
+	acpi_status status;
+	acpi_object_type acpi_type;
+	unsigned long long tmp;
+	union acpi_object object = { 0 };
+	struct acpi_buffer buffer = { sizeof(union acpi_object), &object };
+
+	status = acpi_get_type(handle, &acpi_type);
+	if (ACPI_FAILURE(status))
+		return false;
+
+	switch (acpi_type) {
+	case ACPI_TYPE_PROCESSOR:
+		status = acpi_evaluate_object(handle, NULL, NULL, &buffer);
+		if (ACPI_FAILURE(status))
+			return false;
+		acpi_id = object.processor.proc_id;
+		break;
+	case ACPI_TYPE_DEVICE:
+		status = acpi_evaluate_integer(handle, "_UID", NULL, &tmp);
+		if (ACPI_FAILURE(status))
+			return false;
+		acpi_id = tmp;
+		break;
+	default:
+		return false;
+	}
+
+	type = (acpi_type == ACPI_TYPE_DEVICE) ? 1 : 0;
+
+	*phys_id = __acpi_get_phys_id(handle, type, acpi_id, false);
+	*cpuid = acpi_map_cpuid(*phys_id, acpi_id);
+	if (*cpuid == -1)
+		return false;
+
+	return true;
+}
+
+static acpi_status __init
+set_processor_node_mapping(acpi_handle handle, u32 lvl, void *context,
+			   void **rv)
+{
+	u32 apic_id;
+	int cpu_id;
+
+	if (!map_processor(handle, &apic_id, &cpu_id))
+		return AE_ERROR;
+
+	acpi_map_cpu2node(handle, cpu_id, apic_id);
+	return AE_OK;
+}
+
+void __init acpi_set_processor_mapping(void)
+{
+	/* Set persistent cpu <-> node mapping for all processors. */
+	acpi_walk_namespace(ACPI_TYPE_PROCESSOR, ACPI_ROOT_OBJECT,
+			    ACPI_UINT32_MAX, set_processor_node_mapping,
+			    NULL, NULL, NULL);
+}
+#endif
+
 #ifdef CONFIG_ACPI_HOTPLUG_IOAPIC
 static int get_ioapic_id(struct acpi_subtable_header *entry, u32 gsi_base,
 			 u64 *phys_addr, int *ioapic_id)
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index d2445fa..7a78830 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -185,6 +185,8 @@ static inline bool invalid_phys_cpuid(phys_cpuid_t phys_id)
 /* Arch dependent functions for cpu hotplug support */
 int acpi_map_cpu(acpi_handle handle, phys_cpuid_t physid, int *pcpu);
 int acpi_unmap_cpu(int cpu);
+void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid);
+void __init acpi_set_processor_mapping(void);
 #endif /* CONFIG_ACPI_HOTPLUG_CPU */
 
 #ifdef CONFIG_ACPI_HOTPLUG_IOAPIC
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] x86, gfp: Cache best near node for memory allocation.
  2015-09-10  4:27 ` [PATCH v2 3/7] x86, gfp: Cache best near node for memory allocation Tang Chen
@ 2015-09-10 19:29   ` Tejun Heo
  2015-09-10 19:38     ` Tejun Heo
  2015-09-26  9:31     ` Tang Chen
  0 siblings, 2 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-10 19:29 UTC (permalink / raw)
  To: Tang Chen
  Cc: jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa, yasu.isimatu,
	isimatu.yasuaki, kamezawa.hiroyu, izumi.taku, gongzhaogang,
	qiaonuohan, x86, linux-acpi, linux-kernel, linux-mm, Gu Zheng

Hello,

On Thu, Sep 10, 2015 at 12:27:45PM +0800, Tang Chen wrote:
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index ad35f30..1a1324f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -307,13 +307,19 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>  	if (nid < 0)
>  		nid = numa_node_id();
>  
> +	if (!node_online(nid))
> +		nid = get_near_online_node(nid);
> +
>  	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
>  }

Why not just update node_data[]->node_zonelist in the first place?
Also, what's the synchronization rule here?  How are allocators
synchronized against node hot [un]plugs?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] x86, gfp: Cache best near node for memory allocation.
  2015-09-10 19:29   ` Tejun Heo
@ 2015-09-10 19:38     ` Tejun Heo
  2015-09-10 22:02       ` Christoph Lameter
  2015-09-11  0:14       ` Christoph Lameter
  2015-09-26  9:31     ` Tang Chen
  1 sibling, 2 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-10 19:38 UTC (permalink / raw)
  To: Tang Chen
  Cc: jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa, yasu.isimatu,
	isimatu.yasuaki, kamezawa.hiroyu, izumi.taku, gongzhaogang,
	qiaonuohan, x86, linux-acpi, linux-kernel, linux-mm, Gu Zheng,
	Christoph Lameter

(cc'ing Christoph Lameter)

On Thu, Sep 10, 2015 at 03:29:35PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Thu, Sep 10, 2015 at 12:27:45PM +0800, Tang Chen wrote:
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index ad35f30..1a1324f 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -307,13 +307,19 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> >  	if (nid < 0)
> >  		nid = numa_node_id();
> >  
> > +	if (!node_online(nid))
> > +		nid = get_near_online_node(nid);
> > +
> >  	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> >  }
> 
> Why not just update node_data[]->node_zonelist in the first place?
> Also, what's the synchronization rule here?  How are allocators
> synchronized against node hot [un]plugs?

Also, shouldn't kmalloc_node() or any public allocator fall back
automatically to a near node w/o GFP_THISNODE?  Why is this failing at
all?  I get that cpu id -> node id mapping changing messes up the
locality but allocations shouldn't fail, right?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/7] x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store persistent cpuid <-> apicid mapping.
  2015-09-10  4:27 ` [PATCH v2 5/7] x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store persistent cpuid <-> apicid mapping Tang Chen
@ 2015-09-10 19:55   ` Tejun Heo
  2015-09-26  9:52     ` Tang Chen
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2015-09-10 19:55 UTC (permalink / raw)
  To: Tang Chen
  Cc: jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa, yasu.isimatu,
	isimatu.yasuaki, kamezawa.hiroyu, izumi.taku, gongzhaogang,
	qiaonuohan, x86, linux-acpi, linux-kernel, linux-mm, Gu Zheng

Hello,

So, overall, I think this is the right way to go although I have no
idea whether the acpi part is okay.

> +/*
> + * Current allocated max logical CPU ID plus 1.
> + * All allocated CPU ID should be in [0, max_logical_cpuid),
> + * so the maximum of max_logical_cpuid is nr_cpu_ids.
> + *
> + * NOTE: Reserve 0 for BSP.
> + */
> +static int max_logical_cpuid = 1;

Rename it to nr_logical_cpuids and just mention that it's allocated
contiguously?

> +static int cpuid_to_apicid[] = {
> +	[0 ... NR_CPUS - 1] = -1,
> +};

And maybe mention how the two variables are synchronized?

> +static int allocate_logical_cpuid(int apicid)
> +{
> +	int i;
> +
> +	/*
> +	 * cpuid <-> apicid mapping is persistent, so when a cpu is up,
> +	 * check if the kernel has allocated a cpuid for it.
> +	 */
> +	for (i = 0; i < max_logical_cpuid; i++) {
> +		if (cpuid_to_apicid[i] == apicid)
> +			return i;
> +	}
> +
> +	/* Allocate a new cpuid. */
> +	if (max_logical_cpuid >= nr_cpu_ids) {
> +		WARN_ONCE(1, "Only %d processors supported."
> +			     "Processor %d/0x%x and the rest are ignored.\n",
> +			     nr_cpu_ids - 1, max_logical_cpuid, apicid);
> +		return -1;
> +	}

So, the original code didn't have this failure mode, why is this
different for the new code?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] x86, gfp: Cache best near node for memory allocation.
  2015-09-10 19:38     ` Tejun Heo
@ 2015-09-10 22:02       ` Christoph Lameter
  2015-09-10 22:08         ` Tejun Heo
  2015-09-11  0:14       ` Christoph Lameter
  1 sibling, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2015-09-10 22:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa,
	yasu.isimatu, isimatu.yasuaki, kamezawa.hiroyu, izumi.taku,
	gongzhaogang, qiaonuohan, x86, linux-acpi, linux-kernel, linux-mm,
	Gu Zheng

On Thu, 10 Sep 2015, Tejun Heo wrote:

> > Why not just update node_data[]->node_zonelist in the first place?
> > Also, what's the synchronization rule here?  How are allocators
> > synchronized against node hot [un]plugs?
>
> Also, shouldn't kmalloc_node() or any public allocator fall back
> automatically to a near node w/o GFP_THISNODE?  Why is this failing at
> all?  I get that cpu id -> node id mapping changing messes up the
> locality but allocations shouldn't fail, right?

Without a node specification allocations are subject to various
constraints and memory policies. It is not simply going to the next node.
The memory load may require spreading out the allocations over multiple
nodes, the app may have specified which nodes are to be used etc etc.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] x86, gfp: Cache best near node for memory allocation.
  2015-09-10 22:02       ` Christoph Lameter
@ 2015-09-10 22:08         ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-10 22:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tang Chen, jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa,
	yasu.isimatu, isimatu.yasuaki, kamezawa.hiroyu, izumi.taku,
	gongzhaogang, qiaonuohan, x86, linux-acpi, linux-kernel, linux-mm,
	Gu Zheng

Hello,

On Thu, Sep 10, 2015 at 05:02:31PM -0500, Christoph Lameter wrote:
> > Also, shouldn't kmalloc_node() or any public allocator fall back
> > automatically to a near node w/o GFP_THISNODE?  Why is this failing at
> > all?  I get that cpu id -> node id mapping changing messes up the
> > locality but allocations shouldn't fail, right?
> 
> Without a node specification allocations are subject to various
> constraints and memory policies. It is not simply going to the next node.
> The memory load may require spreading out the allocations over multiple
> nodes, the app may have specified which nodes are to be used etc etc.

Yeah, sure, but even w/ node specified, it shouldn't fail unless
THISNODE, right?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 4/7] x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at boot time.
  2015-09-10  4:27 ` [PATCH v2 4/7] x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at boot time Tang Chen
@ 2015-09-10 23:10   ` Rafael J. Wysocki
  2015-09-26  9:44     ` Tang Chen
  0 siblings, 1 reply; 24+ messages in thread
From: Rafael J. Wysocki @ 2015-09-10 23:10 UTC (permalink / raw)
  To: Tang Chen
  Cc: tj, jiang.liu, mika.j.penttila, mingo, akpm, hpa, yasu.isimatu,
	isimatu.yasuaki, kamezawa.hiroyu, izumi.taku, gongzhaogang,
	qiaonuohan, x86, linux-acpi, linux-kernel, linux-mm, Gu Zheng

On Thursday, September 10, 2015 12:27:46 PM Tang Chen wrote:
> From: Gu Zheng <guz.fnst@cn.fujitsu.com>
> 
> [Problem]
> 
> cpuid <-> nodeid mapping is firstly established at boot time. And workqueue caches
> the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.
> 
> When doing node online/offline, cpuid <-> nodeid mapping is established/destroyed,
> which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
> workqueue does not update wq_numa_possible_cpumask.
> 
> So here is the problem:
> 
> Assume we have the following cpuid <-> nodeid in the beginning:
> 
>   Node | CPU
> ------------------------
> node 0 |  0-14, 60-74
> node 1 | 15-29, 75-89
> node 2 | 30-44, 90-104
> node 3 | 45-59, 105-119
> 
> and we hot-remove node2 and node3, it becomes:
> 
>   Node | CPU
> ------------------------
> node 0 |  0-14, 60-74
> node 1 | 15-29, 75-89
> 
> and we hot-add node4 and node5, it becomes:
> 
>   Node | CPU
> ------------------------
> node 0 |  0-14, 60-74
> node 1 | 15-29, 75-89
> node 4 | 30-59
> node 5 | 90-119
> 
> But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.
> 
> When a pool workqueue is initialized, if its cpumask belongs to a node, its
> pool->node will be mapped to that node. And memory used by this workqueue will
> also be allocated on that node.
> 
> static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs){
> ...
>         /* if cpumask is contained inside a NUMA node, we belong to that node */
>         if (wq_numa_enabled) {
>                 for_each_node(node) {
>                         if (cpumask_subset(pool->attrs->cpumask,
>                                            wq_numa_possible_cpumask[node])) {
>                                 pool->node = node;
>                                 break;
>                         }
>                 }
>         }
> 
> Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline node,
> which will lead to memory allocation failure:
> 
>  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>   cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
>   node 0: slabs: 6172, objs: 259224, free: 245741
>   node 1: slabs: 3261, objs: 136962, free: 127656
> 
> It happens here:
> 
> create_worker(struct worker_pool *pool)
>  |--> worker = alloc_worker(pool->node);
> 
> static struct worker *alloc_worker(int node)
> {
>         struct worker *worker;
> 
>         worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, useing the wrong node.
> 
>         ......
> 
>         return worker;
> }
> 
> [Solution]
> 
> There are four mappings in the kernel:
> 1. nodeid (logical node id)   <->   pxm
> 2. apicid (physical cpu id)   <->   nodeid
> 3. cpuid (logical cpu id)     <->   apicid
> 4. cpuid (logical cpu id)     <->   nodeid
> 
> 1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> pxm
>    mapping is setup at boot time. This mapping is persistent, won't change.
> 
> 2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at boot
>    time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is also
>    persistent.
> 
> 3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
>    allocated, lower ids first, and released at CPU hotremove time, reused for other
>    hotadded CPUs. So this mapping is not persistent.
> 
> 4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
>    cleared at CPU hotremove time. As a result of 3, this mapping is not persistent.
> 
> To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
> cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
> cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> apicid
> mapping. So the key point is obtaining all cpus' apicid.
> 
> apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
> MADT (Multiple APIC Description Table). So we finish the job in the following steps:
> 
> 1. Enable apic registeration flow to handle both enabled and disabled cpus.
>    This is done by introducing an extra parameter to generic_processor_info to let the
>    caller control if disabled cpus are ignored.
> 
> 2. Introduce a new array storing all possible cpuid <-> apicid mapping. And also modify
>    the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping when
>    registering local apic. Store the mapping in this array.
> 
> 3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' apicid.
>    This is also done by introducing an extra parameter to these apis to let the caller
>    control if disabled cpus are ignored.
> 
> 4. Establish all possible cpuid <-> nodeid mapping.
>    This is done via an additional acpi namespace walk for processors.
> 
> This patch finished step 1.

Can you please avoid using the same (or at least very similar changelog)
for multiple patches in the series?  That doesn't help a lot.

> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> ---
>  arch/x86/kernel/apic/apic.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index dcb5285..a9c9830 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -1977,7 +1977,7 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>  	apic_write(APIC_LVT1, value);
>  }
>  
> -int generic_processor_info(int apicid, int version)
> +static int __generic_processor_info(int apicid, int version, bool enabled)
>  {
>  	int cpu, max = nr_cpu_ids;
>  	bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
> @@ -2011,7 +2011,8 @@ int generic_processor_info(int apicid, int version)
>  			   " Processor %d/0x%x ignored.\n",
>  			   thiscpu, apicid);
>  
> -		disabled_cpus++;
> +		if (enabled)
> +			disabled_cpus++;

This doesn't look particularly clean to me to be honest.

>  		return -ENODEV;
>  	}
>  
> @@ -2028,7 +2029,8 @@ int generic_processor_info(int apicid, int version)
>  			" reached. Keeping one slot for boot cpu."
>  			"  Processor %d/0x%x ignored.\n", max, thiscpu, apicid);
>  
> -		disabled_cpus++;
> +		if (enabled)
> +			disabled_cpus++;

Likewise and so on.

Maybe call it "enabled_only"?

>  		return -ENODEV;
>  	}
>  
> @@ -2039,11 +2041,14 @@ int generic_processor_info(int apicid, int version)
>  			"ACPI: NR_CPUS/possible_cpus limit of %i reached."
>  			"  Processor %d/0x%x ignored.\n", max, thiscpu, apicid);
>  
> -		disabled_cpus++;
> +		if (enabled)
> +			disabled_cpus++;
>  		return -EINVAL;
>  	}
>  
> -	num_processors++;
> +	if (enabled)
> +		num_processors++;
> +
>  	if (apicid == boot_cpu_physical_apicid) {
>  		/*
>  		 * x86_bios_cpu_apicid is required to have processors listed
> @@ -2071,7 +2076,8 @@ int generic_processor_info(int apicid, int version)
>  			apic_version[boot_cpu_physical_apicid], cpu, version);
>  	}
>  
> -	physid_set(apicid, phys_cpu_present_map);
> +	if (enabled)
> +		physid_set(apicid, phys_cpu_present_map);
>  	if (apicid > max_physical_apicid)
>  		max_physical_apicid = apicid;
>  
> @@ -2084,11 +2090,17 @@ int generic_processor_info(int apicid, int version)
>  		apic->x86_32_early_logical_apicid(cpu);
>  #endif
>  	set_cpu_possible(cpu, true);
> -	set_cpu_present(cpu, true);
> +	if (enabled)
> +		set_cpu_present(cpu, true);
>  
>  	return cpu;
>  }
>  
> +int generic_processor_info(int apicid, int version)
> +{
> +	return __generic_processor_info(apicid, version, true);
> +}
> +
>  int hard_smp_processor_id(void)
>  {
>  	return read_apic_id();
> 

Thanks,
Rafael

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] x86, gfp: Cache best near node for memory allocation.
  2015-09-10 19:38     ` Tejun Heo
  2015-09-10 22:02       ` Christoph Lameter
@ 2015-09-11  0:14       ` Christoph Lameter
  2015-09-26  9:35         ` Tang Chen
  1 sibling, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2015-09-11  0:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa,
	yasu.isimatu, isimatu.yasuaki, kamezawa.hiroyu, izumi.taku,
	gongzhaogang, qiaonuohan, x86, linux-acpi, linux-kernel, linux-mm,
	Gu Zheng

On Thu, 10 Sep 2015, Tejun Heo wrote:

> > Why not just update node_data[]->node_zonelist in the first place?
> > Also, what's the synchronization rule here?  How are allocators
> > synchronized against node hot [un]plugs?
>
> Also, shouldn't kmalloc_node() or any public allocator fall back
> automatically to a near node w/o GFP_THISNODE?  Why is this failing at
> all?  I get that cpu id -> node id mapping changing messes up the
> locality but allocations shouldn't fail, right?

Yes that should occur in the absence of other constraints (mempolicies,
cpusets, cgroups, allocation type). If the constraints do not allow an
allocation then the allocation will fail.

Also: Are the zonelists setup the right way?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] x86, gfp: Cache best near node for memory allocation.
  2015-09-10 19:29   ` Tejun Heo
  2015-09-10 19:38     ` Tejun Heo
@ 2015-09-26  9:31     ` Tang Chen
  2015-09-26 17:53       ` Tejun Heo
  1 sibling, 1 reply; 24+ messages in thread
From: Tang Chen @ 2015-09-26  9:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa, yasu.isimatu,
	isimatu.yasuaki, kamezawa.hiroyu, izumi.taku, gongzhaogang,
	qiaonuohan, x86, linux-acpi, linux-kernel, linux-mm, tangchen

Hi, tj

On 09/11/2015 03:29 AM, Tejun Heo wrote:
> Hello,
>
> On Thu, Sep 10, 2015 at 12:27:45PM +0800, Tang Chen wrote:
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index ad35f30..1a1324f 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -307,13 +307,19 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>>   	if (nid < 0)
>>   		nid = numa_node_id();
>>   
>> +	if (!node_online(nid))
>> +		nid = get_near_online_node(nid);
>> +
>>   	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
>>   }
> Why not just update node_data[]->node_zonelist in the first place?

zonelist will be rebuilt in __offline_pages() when the zone is not 
populated any more.

Here, getting the best near online node is for those cpus on memory-less 
nodes.

In the original code, if nid is NUMA_NO_NODE, the node the current cpu 
resides in
will be chosen. And if the node is memory-less node, the cpu will be 
mapped to its
best near online node.

But this patch-set will map the cpu to its original node, so 
numa_node_id() may return
a memory-less node to allocator. And then memory allocation may fail.

> Also, what's the synchronization rule here?  How are allocators
> synchronized against node hot [un]plugs?

The rule is: node_to_near_node_map[] array will be updated each time 
node [un]hotplug happens.

Now it is not protected by a lock. But I think acquiring a lock may 
cause performance regression
to memory allocator.

When rebuilding zonelist, stop_machine is used. So I think maybe 
updating the
node_to_near_node_map[] array at the same time when zonelist is rebuilt 
could be a good idea.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] x86, gfp: Cache best near node for memory allocation.
  2015-09-11  0:14       ` Christoph Lameter
@ 2015-09-26  9:35         ` Tang Chen
  0 siblings, 0 replies; 24+ messages in thread
From: Tang Chen @ 2015-09-26  9:35 UTC (permalink / raw)
  To: Christoph Lameter, Tejun Heo
  Cc: jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa, yasu.isimatu,
	isimatu.yasuaki, kamezawa.hiroyu, izumi.taku, gongzhaogang,
	qiaonuohan, x86, linux-acpi, linux-kernel, linux-mm, Gu Zheng

Hi, Christoph, tj,

On 09/11/2015 08:14 AM, Christoph Lameter wrote:
> On Thu, 10 Sep 2015, Tejun Heo wrote:
>
>>> Why not just update node_data[]->node_zonelist in the first place?
>>> Also, what's the synchronization rule here?  How are allocators
>>> synchronized against node hot [un]plugs?
>> Also, shouldn't kmalloc_node() or any public allocator fall back
>> automatically to a near node w/o GFP_THISNODE?  Why is this failing at
>> all?  I get that cpu id -> node id mapping changing messes up the
>> locality but allocations shouldn't fail, right?

Yes. That is the reason we are getting near online node here.

> Yes that should occur in the absence of other constraints (mempolicies,
> cpusets, cgroups, allocation type). If the constraints do not allow an
> allocation then the allocation will fail.
>
> Also: Are the zonelists setup the right way?

zonelist will be rebuilt in __offline_pages() when the zone is not 
populated any more.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 4/7] x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at boot time.
  2015-09-10 23:10   ` Rafael J. Wysocki
@ 2015-09-26  9:44     ` Tang Chen
  0 siblings, 0 replies; 24+ messages in thread
From: Tang Chen @ 2015-09-26  9:44 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: tj, jiang.liu, mika.j.penttila, mingo, akpm, hpa, yasu.isimatu,
	isimatu.yasuaki, kamezawa.hiroyu, izumi.taku, gongzhaogang,
	qiaonuohan, x86, linux-acpi, linux-kernel, linux-mm, tangchen

Hi Rafael,

On 09/11/2015 07:10 AM, Rafael J. Wysocki wrote:
> On Thursday, September 10, 2015 12:27:46 PM Tang Chen wrote:
>> ......
> Can you please avoid using the same (or at least very similar changelog)
> for multiple patches in the series?  That doesn't help a lot.

OK, will update the comment and include more useful info.

>
>> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
>> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
>> ---
>>   arch/x86/kernel/apic/apic.c | 26 +++++++++++++++++++-------
>>   1 file changed, 19 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
>> index dcb5285..a9c9830 100644
>> --- a/arch/x86/kernel/apic/apic.c
>> +++ b/arch/x86/kernel/apic/apic.c
>> @@ -1977,7 +1977,7 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>>   	apic_write(APIC_LVT1, value);
>>   }
>>   
>> -int generic_processor_info(int apicid, int version)
>> +static int __generic_processor_info(int apicid, int version, bool enabled)
>>   {
>>   	int cpu, max = nr_cpu_ids;
>>   	bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
>> @@ -2011,7 +2011,8 @@ int generic_processor_info(int apicid, int version)
>>   			   " Processor %d/0x%x ignored.\n",
>>   			   thiscpu, apicid);
>>   
>> -		disabled_cpus++;
>> +		if (enabled)
>> +			disabled_cpus++;
> This doesn't look particularly clean to me to be honest.
>
>>   		return -ENODEV;
>>   	}
>>   
>> @@ -2028,7 +2029,8 @@ int generic_processor_info(int apicid, int version)
>>   			" reached. Keeping one slot for boot cpu."
>>   			"  Processor %d/0x%x ignored.\n", max, thiscpu, apicid);
>>   
>> -		disabled_cpus++;
>> +		if (enabled)
>> +			disabled_cpus++;
> Likewise and so on.
>
> Maybe call it "enabled_only"?

OK, the name makes no sense here. Will rename it.

Thanks.

>
>>   		return -ENODEV;
>>   	}
>>   
>> @@ -2039,11 +2041,14 @@ int generic_processor_info(int apicid, int version)
>>   			"ACPI: NR_CPUS/possible_cpus limit of %i reached."
>>   			"  Processor %d/0x%x ignored.\n", max, thiscpu, apicid);
>>   
>> -		disabled_cpus++;
>> +		if (enabled)
>> +			disabled_cpus++;
>>   		return -EINVAL;
>>   	}
>>   
>> -	num_processors++;
>> +	if (enabled)
>> +		num_processors++;
>> +
>>   	if (apicid == boot_cpu_physical_apicid) {
>>   		/*
>>   		 * x86_bios_cpu_apicid is required to have processors listed
>> @@ -2071,7 +2076,8 @@ int generic_processor_info(int apicid, int version)
>>   			apic_version[boot_cpu_physical_apicid], cpu, version);
>>   	}
>>   
>> -	physid_set(apicid, phys_cpu_present_map);
>> +	if (enabled)
>> +		physid_set(apicid, phys_cpu_present_map);
>>   	if (apicid > max_physical_apicid)
>>   		max_physical_apicid = apicid;
>>   
>> @@ -2084,11 +2090,17 @@ int generic_processor_info(int apicid, int version)
>>   		apic->x86_32_early_logical_apicid(cpu);
>>   #endif
>>   	set_cpu_possible(cpu, true);
>> -	set_cpu_present(cpu, true);
>> +	if (enabled)
>> +		set_cpu_present(cpu, true);
>>   
>>   	return cpu;
>>   }
>>   
>> +int generic_processor_info(int apicid, int version)
>> +{
>> +	return __generic_processor_info(apicid, version, true);
>> +}
>> +
>>   int hard_smp_processor_id(void)
>>   {
>>   	return read_apic_id();
>>
> Thanks,
> Rafael
>
> .
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/7] x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store persistent cpuid <-> apicid mapping.
  2015-09-10 19:55   ` Tejun Heo
@ 2015-09-26  9:52     ` Tang Chen
  2015-09-26 17:56       ` Tejun Heo
  0 siblings, 1 reply; 24+ messages in thread
From: Tang Chen @ 2015-09-26  9:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa, yasu.isimatu,
	isimatu.yasuaki, kamezawa.hiroyu, izumi.taku, gongzhaogang,
	qiaonuohan, x86, linux-acpi, linux-kernel, linux-mm, tangchen

Hi tj,

On 09/11/2015 03:55 AM, Tejun Heo wrote:
> Hello,
>
> So, overall, I think this is the right way to go although I have no
> idea whether the acpi part is okay.

Thank you very much for reviewing. :)

>
>> +/*
>> + * Current allocated max logical CPU ID plus 1.
>> + * All allocated CPU ID should be in [0, max_logical_cpuid),
>> + * so the maximum of max_logical_cpuid is nr_cpu_ids.
>> + *
>> + * NOTE: Reserve 0 for BSP.
>> + */
>> +static int max_logical_cpuid = 1;
> Rename it to nr_logical_cpuids and just mention that it's allocated
> contiguously?

OK.

>
>> +static int cpuid_to_apicid[] = {
>> +	[0 ... NR_CPUS - 1] = -1,
>> +};
> And maybe mention how the two variables are synchronized?

User should call allocate_logical_cpuid() to get a new logical cpuid.
This allocator will ensure the synchronization.

Will mention it in the comment.

>
>> +static int allocate_logical_cpuid(int apicid)
>> +{
>> +	int i;
>> +
>> +	/*
>> +	 * cpuid <-> apicid mapping is persistent, so when a cpu is up,
>> +	 * check if the kernel has allocated a cpuid for it.
>> +	 */
>> +	for (i = 0; i < max_logical_cpuid; i++) {
>> +		if (cpuid_to_apicid[i] == apicid)
>> +			return i;
>> +	}
>> +
>> +	/* Allocate a new cpuid. */
>> +	if (max_logical_cpuid >= nr_cpu_ids) {
>> +		WARN_ONCE(1, "Only %d processors supported."
>> +			     "Processor %d/0x%x and the rest are ignored.\n",
>> +			     nr_cpu_ids - 1, max_logical_cpuid, apicid);
>> +		return -1;
>> +	}
> So, the original code didn't have this failure mode, why is this
> different for the new code?

It is not different. Since max_logical_cpuid is new, this is ensure it 
won't
go beyond NR_CPUS.

Thanks.

>
> Thanks.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] x86, gfp: Cache best near node for memory allocation.
  2015-09-26  9:31     ` Tang Chen
@ 2015-09-26 17:53       ` Tejun Heo
  2015-09-28  1:50         ` Tang Chen
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2015-09-26 17:53 UTC (permalink / raw)
  To: Tang Chen
  Cc: jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa, yasu.isimatu,
	isimatu.yasuaki, kamezawa.hiroyu, izumi.taku, gongzhaogang,
	qiaonuohan, x86, linux-acpi, linux-kernel, linux-mm

Hello, Tang.

On Sat, Sep 26, 2015 at 05:31:07PM +0800, Tang Chen wrote:
> >>@@ -307,13 +307,19 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> >>  	if (nid < 0)
> >>  		nid = numa_node_id();
> >>+	if (!node_online(nid))
> >>+		nid = get_near_online_node(nid);
> >>+
> >>  	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> >>  }
> >Why not just update node_data[]->node_zonelist in the first place?
> 
> zonelist will be rebuilt in __offline_pages() when the zone is not populated
> any more.
> 
> Here, getting the best near online node is for those cpus on memory-less
> nodes.
> 
> In the original code, if nid is NUMA_NO_NODE, the node the current cpu
> resides in
> will be chosen. And if the node is memory-less node, the cpu will be mapped
> to its
> best near online node.
> 
> But this patch-set will map the cpu to its original node, so numa_node_id()
> may return
> a memory-less node to allocator. And then memory allocation may fail.

Correct me if I'm wrong but the zonelist dictates which memory areas
the page allocator is gonna try to from, right?  What I'm wondering is
why we aren't handling memory-less nodes by simply updating their
zonelists.  I mean, if, say, node 2 is memory-less, its zonelist can
simply point to zones from other nodes, right?  What am I missing
here?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/7] x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store persistent cpuid <-> apicid mapping.
  2015-09-26  9:52     ` Tang Chen
@ 2015-09-26 17:56       ` Tejun Heo
  2015-09-28  1:57         ` Tang Chen
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2015-09-26 17:56 UTC (permalink / raw)
  To: Tang Chen
  Cc: jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa, yasu.isimatu,
	isimatu.yasuaki, kamezawa.hiroyu, izumi.taku, gongzhaogang,
	qiaonuohan, x86, linux-acpi, linux-kernel, linux-mm

On Sat, Sep 26, 2015 at 05:52:09PM +0800, Tang Chen wrote:
> >>+static int allocate_logical_cpuid(int apicid)
> >>+{
> >>+	int i;
> >>+
> >>+	/*
> >>+	 * cpuid <-> apicid mapping is persistent, so when a cpu is up,
> >>+	 * check if the kernel has allocated a cpuid for it.
> >>+	 */
> >>+	for (i = 0; i < max_logical_cpuid; i++) {
> >>+		if (cpuid_to_apicid[i] == apicid)
> >>+			return i;
> >>+	}
> >>+
> >>+	/* Allocate a new cpuid. */
> >>+	if (max_logical_cpuid >= nr_cpu_ids) {
> >>+		WARN_ONCE(1, "Only %d processors supported."
> >>+			     "Processor %d/0x%x and the rest are ignored.\n",
> >>+			     nr_cpu_ids - 1, max_logical_cpuid, apicid);
> >>+		return -1;
> >>+	}
> >So, the original code didn't have this failure mode, why is this
> >different for the new code?
> 
> It is not different. Since max_logical_cpuid is new, this is ensure it won't
> go beyond NR_CPUS.

If the above condition can happen, the original code should have had a
similar check as above, right?  Sure, max_logical_cpuid is a new thing
but that doesn't seem to change whether the above condition can happen
or not, no?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] x86, gfp: Cache best near node for memory allocation.
  2015-09-26 17:53       ` Tejun Heo
@ 2015-09-28  1:50         ` Tang Chen
  0 siblings, 0 replies; 24+ messages in thread
From: Tang Chen @ 2015-09-28  1:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa, yasu.isimatu,
	isimatu.yasuaki, kamezawa.hiroyu, izumi.taku, gongzhaogang,
	qiaonuohan, x86, linux-acpi, linux-kernel, linux-mm

Hi, tj,

On 09/27/2015 01:53 AM, Tejun Heo wrote:
> Hello, Tang.
>
> On Sat, Sep 26, 2015 at 05:31:07PM +0800, Tang Chen wrote:
>>>> @@ -307,13 +307,19 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>>>>   	if (nid < 0)
>>>>   		nid = numa_node_id();
>>>> +	if (!node_online(nid))
>>>> +		nid = get_near_online_node(nid);
>>>> +
>>>>   	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
>>>>   }
>>> Why not just update node_data[]->node_zonelist in the first place?
>> zonelist will be rebuilt in __offline_pages() when the zone is not populated
>> any more.
>>
>> Here, getting the best near online node is for those cpus on memory-less
>> nodes.
>>
>> In the original code, if nid is NUMA_NO_NODE, the node the current cpu
>> resides in
>> will be chosen. And if the node is memory-less node, the cpu will be mapped
>> to its
>> best near online node.
>>
>> But this patch-set will map the cpu to its original node, so numa_node_id()
>> may return
>> a memory-less node to allocator. And then memory allocation may fail.
> Correct me if I'm wrong but the zonelist dictates which memory areas
> the page allocator is gonna try to from, right?  What I'm wondering is
> why we aren't handling memory-less nodes by simply updating their
> zonelists.  I mean, if, say, node 2 is memory-less, its zonelist can
> simply point to zones from other nodes, right?  What am I missing
> here?

Oh, yes, you are right. But I remember some time ago, Liu, Jiang has or was
going to handle memory less node like this in his patch:

https://lkml.org/lkml/2015/8/16/130

BTW, to Liu Jiang, how is your patches going on ?

Thanks.

>
> Thanks.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/7] x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store persistent cpuid <-> apicid mapping.
  2015-09-26 17:56       ` Tejun Heo
@ 2015-09-28  1:57         ` Tang Chen
  0 siblings, 0 replies; 24+ messages in thread
From: Tang Chen @ 2015-09-28  1:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa, yasu.isimatu,
	isimatu.yasuaki, kamezawa.hiroyu, izumi.taku, gongzhaogang,
	qiaonuohan, x86, linux-acpi, linux-kernel, linux-mm, tangchen


On 09/27/2015 01:56 AM, Tejun Heo wrote:
> On Sat, Sep 26, 2015 at 05:52:09PM +0800, Tang Chen wrote:
>>>> +static int allocate_logical_cpuid(int apicid)
>>>> +{
>>>> +	int i;
>>>> +
>>>> +	/*
>>>> +	 * cpuid <-> apicid mapping is persistent, so when a cpu is up,
>>>> +	 * check if the kernel has allocated a cpuid for it.
>>>> +	 */
>>>> +	for (i = 0; i < max_logical_cpuid; i++) {
>>>> +		if (cpuid_to_apicid[i] == apicid)
>>>> +			return i;
>>>> +	}
>>>> +
>>>> +	/* Allocate a new cpuid. */
>>>> +	if (max_logical_cpuid >= nr_cpu_ids) {
>>>> +		WARN_ONCE(1, "Only %d processors supported."
>>>> +			     "Processor %d/0x%x and the rest are ignored.\n",
>>>> +			     nr_cpu_ids - 1, max_logical_cpuid, apicid);
>>>> +		return -1;
>>>> +	}
>>> So, the original code didn't have this failure mode, why is this
>>> different for the new code?
>> It is not different. Since max_logical_cpuid is new, this is ensure it won't
>> go beyond NR_CPUS.
> If the above condition can happen, the original code should have had a
> similar check as above, right?  Sure, max_logical_cpuid is a new thing
> but that doesn't seem to change whether the above condition can happen
> or not, no?

Right, indeed. It is in

generic_processor_info()
|--> if (num_processors >= nr_cpu_ids)

Will remove my new added check.

Thanks.

>
> Thanks.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/7] Make cpuid <-> nodeid mapping persistent.
  2015-09-10  4:27 [PATCH v2 0/7] Make cpuid <-> nodeid mapping persistent Tang Chen
                   ` (6 preceding siblings ...)
  2015-09-10  4:27 ` [PATCH v2 7/7] x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting Tang Chen
@ 2015-10-23 19:49 ` Yasuaki Ishimatsu
  7 siblings, 0 replies; 24+ messages in thread
From: Yasuaki Ishimatsu @ 2015-10-23 19:49 UTC (permalink / raw)
  To: Tang Chen
  Cc: tj, jiang.liu, mika.j.penttila, mingo, akpm, rjw, hpa,
	isimatu.yasuaki, kamezawa.hiroyu, izumi.taku, gongzhaogang,
	qiaonuohan, x86, linux-acpi, linux-kernel, linux-mm

Hi Tang,

Your patch assumes that system supports memory less node and
fixes the issue on x86 architecture.

But if system does not supports memory less node, your patch cannot
fix the issue. It means that system must support memory less node
to support Node (CPU and memory) hotplug.

Why don't you fix workqueue directly?

Thanks,
Yasuaki Ishimatsu

On Thu, 10 Sep 2015 12:27:42 +0800
Tang Chen <tangchen@cn.fujitsu.com> wrote:

> The whole patch-set aims at solving this problem:
> 
> [Problem]
> 
> cpuid <-> nodeid mapping is firstly established at boot time. And workqueue caches
> the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.
> 
> When doing node online/offline, cpuid <-> nodeid mapping is established/destroyed,
> which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
> workqueue does not update wq_numa_possible_cpumask.
> 
> So here is the problem:
> 
> Assume we have the following cpuid <-> nodeid in the beginning:
> 
>   Node | CPU
> ------------------------
> node 0 |  0-14, 60-74
> node 1 | 15-29, 75-89
> node 2 | 30-44, 90-104
> node 3 | 45-59, 105-119
> 
> and we hot-remove node2 and node3, it becomes:
> 
>   Node | CPU
> ------------------------
> node 0 |  0-14, 60-74
> node 1 | 15-29, 75-89
> 
> and we hot-add node4 and node5, it becomes:
> 
>   Node | CPU
> ------------------------
> node 0 |  0-14, 60-74
> node 1 | 15-29, 75-89
> node 4 | 30-59
> node 5 | 90-119
> 
> But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.
> 
> When a pool workqueue is initialized, if its cpumask belongs to a node, its
> pool->node will be mapped to that node. And memory used by this workqueue will
> also be allocated on that node.
> 
> static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs){
> ...
>         /* if cpumask is contained inside a NUMA node, we belong to that node */
>         if (wq_numa_enabled) {
>                 for_each_node(node) {
>                         if (cpumask_subset(pool->attrs->cpumask,
>                                            wq_numa_possible_cpumask[node])) {
>                                 pool->node = node;
>                                 break;
>                         }
>                 }
>         }
> 
> Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline node,
> which will lead to memory allocation failure:
> 
>  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>   cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
>   node 0: slabs: 6172, objs: 259224, free: 245741
>   node 1: slabs: 3261, objs: 136962, free: 127656
> 
> It happens here:
> 
> create_worker(struct worker_pool *pool)
>  |--> worker = alloc_worker(pool->node);
> 
> static struct worker *alloc_worker(int node)
> {
>         struct worker *worker;
> 
>         worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, useing the wrong node.
> 
>         ......
> 
>         return worker;
> }
> 
> 
> [Solution]
> 
> There are four mappings in the kernel:
> 1. nodeid (logical node id)   <->   pxm
> 2. apicid (physical cpu id)   <->   nodeid
> 3. cpuid (logical cpu id)     <->   apicid
> 4. cpuid (logical cpu id)     <->   nodeid
> 
> 1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> pxm
>    mapping is setup at boot time. This mapping is persistent, won't change.
> 
> 2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at boot
>    time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is also
>    persistent.
> 
> 3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
>    allocated, lower ids first, and released at CPU hotremove time, reused for other
>    hotadded CPUs. So this mapping is not persistent.
> 
> 4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
>    cleared at CPU hotremove time. As a result of 3, this mapping is not persistent.
> 
> To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
> cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
> cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> apicid
> mapping. So the key point is obtaining all cpus' apicid.
> 
> apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
> MADT (Multiple APIC Description Table). So we finish the job in the following steps:
> 
> 1. Enable apic registeration flow to handle both enabled and disabled cpus.
>    This is done by introducing an extra parameter to generic_processor_info to let the
>    caller control if disabled cpus are ignored.
> 
> 2. Introduce a new array storing all possible cpuid <-> apicid mapping. And also modify
>    the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping when
>    registering local apic. Store the mapping in this array.
> 
> 3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' apicid.
>    This is also done by introducing an extra parameter to these apis to let the caller
>    control if disabled cpus are ignored.
> 
> 4. Establish all possible cpuid <-> nodeid mapping.
>    This is done via an additional acpi namespace walk for processors.
> 
> 
> Patch 1 ~ 3 are some prepare works.
> Patch 4 ~ 7 finishes the 4 steps above.
> 
> 
> For previous discussion, please refer to:
> https://lkml.org/lkml/2015/2/27/145
> https://lkml.org/lkml/2015/3/25/989
> https://lkml.org/lkml/2015/5/14/244
> https://lkml.org/lkml/2015/7/7/200
> 
> 
> Change log v1 -> v2:
> 1. Split code movement and actual changes. Add patch 1.
> 2. Synchronize best near online node record when node hotplug happens. In patch 2.
> 3. Fix some comment.
> 
> 
> Gu Zheng (5):
>   x86, gfp: Cache best near node for memory allocation.
>   x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at
>     boot time.
>   x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store
>     persistent cpuid <-> apicid mapping.
>   x86, acpi, cpu-hotplug: Enable MADT APIs to return disabled apicid.
>   x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when
>     booting.
> 
> Tang Chen (2):
>   x86, numa: Move definition of find_near_online_node() forward.
>   x86, numa: Introduce a node to node array to map a node to its best
>     online node.
> 
>  arch/ia64/kernel/acpi.c         |   2 +-
>  arch/x86/include/asm/mpspec.h   |   1 +
>  arch/x86/include/asm/topology.h |  10 ++++
>  arch/x86/kernel/acpi/boot.c     |   8 +--
>  arch/x86/kernel/apic/apic.c     |  77 ++++++++++++++++++++++---
>  arch/x86/mm/numa.c              |  80 +++++++++++++++++++-------
>  drivers/acpi/acpi_processor.c   |   5 +-
>  drivers/acpi/bus.c              |   3 +
>  drivers/acpi/processor_core.c   | 122 +++++++++++++++++++++++++++++++++-------
>  include/linux/acpi.h            |   2 +
>  include/linux/gfp.h             |   8 ++-
>  mm/memory_hotplug.c             |   4 ++
>  12 files changed, 264 insertions(+), 58 deletions(-)
> 
> -- 
> 1.9.3
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-10-23 19:49 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-10  4:27 [PATCH v2 0/7] Make cpuid <-> nodeid mapping persistent Tang Chen
2015-09-10  4:27 ` [PATCH v2 1/7] x86, numa: Move definition of find_near_online_node() forward Tang Chen
2015-09-10  4:27 ` [PATCH v2 2/7] x86, numa: Introduce a node to node array to map a node to its best online node Tang Chen
2015-09-10  4:27 ` [PATCH v2 3/7] x86, gfp: Cache best near node for memory allocation Tang Chen
2015-09-10 19:29   ` Tejun Heo
2015-09-10 19:38     ` Tejun Heo
2015-09-10 22:02       ` Christoph Lameter
2015-09-10 22:08         ` Tejun Heo
2015-09-11  0:14       ` Christoph Lameter
2015-09-26  9:35         ` Tang Chen
2015-09-26  9:31     ` Tang Chen
2015-09-26 17:53       ` Tejun Heo
2015-09-28  1:50         ` Tang Chen
2015-09-10  4:27 ` [PATCH v2 4/7] x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at boot time Tang Chen
2015-09-10 23:10   ` Rafael J. Wysocki
2015-09-26  9:44     ` Tang Chen
2015-09-10  4:27 ` [PATCH v2 5/7] x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store persistent cpuid <-> apicid mapping Tang Chen
2015-09-10 19:55   ` Tejun Heo
2015-09-26  9:52     ` Tang Chen
2015-09-26 17:56       ` Tejun Heo
2015-09-28  1:57         ` Tang Chen
2015-09-10  4:27 ` [PATCH v2 6/7] x86, acpi, cpu-hotplug: Enable MADT APIs to return disabled apicid Tang Chen
2015-09-10  4:27 ` [PATCH v2 7/7] x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting Tang Chen
2015-10-23 19:49 ` [PATCH v2 0/7] Make cpuid <-> nodeid mapping persistent Yasuaki Ishimatsu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).