From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>,
Matt Mackall <mpm@selenic.com>,
David Rientjes <rientjes@google.com>,
Pekka Enberg <penberg@kernel.org>,
Linux Memory Management List <linux-mm@kvack.org>,
Paul Mackerras <paulus@samba.org>, Tejun Heo <tj@kernel.org>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
linuxppc-dev@lists.ozlabs.org, Christoph Lameter <cl@linux.com>,
Wanpeng Li <liwanp@linux.vnet.ibm.com>,
Anton Blanchard <anton@samba.org>
Subject: [RFC PATCH 4/4] powerpc: reorder per-cpu NUMA information's initialization
Date: Wed, 13 Aug 2014 17:17:23 -0700 [thread overview]
Message-ID: <20140814001723.GM11121@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140814001301.GI11121@linux.vnet.ibm.com>
There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.
NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
after start_secondary(), similar to ia64, which is invoked via
smp_init().
Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
early_initcall()") made init_workqueues() be invoked via
do_pre_smp_initcalls(), which is obviously before the secondary
processors are online.
Additionally, the following commits changed init_workqueues() to use
cpu_to_node to determine the node to use for kthread_create_on_node:
bce903809ab3f ("workqueue: add wq_numa_tbl_len and
wq_numa_possible_cpumask[]")
f3f90ad469342 ("workqueue: determine NUMA node of workers accourding to
the allowed cpumask")
Therefore, when init_workqueues() runs, it sees all CPUs as being on
Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
a high number of slab deactivations
(http://www.spinics.net/lists/linux-mm/msg67489.html).
While testing memoryless nodes on PowerKVM guests with a fix to the
workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
guest topology:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 2
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
node 1 size: 16336 MB
node 1 free: 15329 MB
node distances:
node 0 1
0: 10 40
1: 40 10
the slab consumption decreases from:
Slab: 932416 kB
SUnreclaim: 902336 kB
to
Slab: 395264 kB
SUnreclaim: 359424 kB
And we see a corresponding increase in the slab efficiency from:
slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 337 MB 11.28% 100.00%
task_struct 288 MB 9.93% 100.00%
to:
slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 37 MB 100.00% 100.00%
task_struct 31 MB 100.00% 100.00%
Powerpc didn't support memoryless nodes until recently (64bb80d87f01
"powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261194d
"powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
helped improve memory consumption with these kind of environments.
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
---
Ben & others, one area I'm still unsure of is if calling the NUMA
callback for all CPUs is desired. I don't know how else to get the NUMA
topology into the array easily, but I didn't test in an environment with
hotpluggable CPUs, so I'm not sure if it will lead to errors there (are
there device-tree entries for the topology of CPUs that will be plugged
in? I assume not, actually, so maybe we should keep the logic in
start_secondary so that those CPUs that are hotplugged later get the
right topology data?
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 1007fb802e6b..1fc8984f272e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -376,6 +376,12 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
GFP_KERNEL, cpu_to_node(cpu));
zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
GFP_KERNEL, cpu_to_node(cpu));
+ /*
+ * numa_node_id() works after this.
+ */
+ set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+ set_cpu_numa_mem(cpu,
+ local_memory_node(numa_cpu_lookup_table[cpu]));
}
cpumask_set_cpu(boot_cpuid, cpu_sibling_mask(boot_cpuid));
@@ -723,12 +729,6 @@ void start_secondary(void *unused)
}
traverse_core_siblings(cpu, true);
- /*
- * numa_node_id() works after this.
- */
- set_numa_node(numa_cpu_lookup_table[cpu]);
- set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
-
smp_wmb();
notify_cpu_starting(cpu);
set_cpu_online(cpu, true);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d3e9a78eaed3..32341e16b8ce 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1049,7 +1049,7 @@ static void __init mark_reserved_regions_for_nid(int nid)
void __init do_init_bootmem(void)
{
- int nid;
+ int nid, cpu;
min_low_pfn = 0;
max_low_pfn = memblock_end_of_DRAM() >> PAGE_SHIFT;
@@ -1122,8 +1122,15 @@ void __init do_init_bootmem(void)
reset_numa_cpu_lookup_table();
register_cpu_notifier(&ppc64_numa_nb);
- cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE,
- (void *)(unsigned long)boot_cpuid);
+ /*
+ * We need the numa_cpu_lookup_table to be accurate for all
+ * CPUs, even before we online them, so that we can use
+ * cpu_to_{node,mem} early in boot, cf. smp_prepare_cpus().
+ */
+ for_each_possible_cpu(cpu) {
+ cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE,
+ (void *)(unsigned long)boot_cpuid);
+ }
}
void __init paging_init(void)
WARNING: multiple messages have this Message-ID (diff)
From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>,
David Rientjes <rientjes@google.com>,
Han Pingtian <hanpt@linux.vnet.ibm.com>,
Pekka Enberg <penberg@kernel.org>,
Paul Mackerras <paulus@samba.org>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Michael Ellerman <mpe@ellerman.id.au>,
Anton Blanchard <anton@samba.org>, Matt Mackall <mpm@selenic.com>,
Christoph Lameter <cl@linux.com>,
Wanpeng Li <liwanp@linux.vnet.ibm.com>, Tejun Heo <tj@kernel.org>,
Linux Memory Management List <linux-mm@kvack.org>,
linuxppc-dev@lists.ozlabs.org
Subject: [RFC PATCH 4/4] powerpc: reorder per-cpu NUMA information's initialization
Date: Wed, 13 Aug 2014 17:17:23 -0700 [thread overview]
Message-ID: <20140814001723.GM11121@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140814001301.GI11121@linux.vnet.ibm.com>
There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.
NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
after start_secondary(), similar to ia64, which is invoked via
smp_init().
Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
early_initcall()") made init_workqueues() be invoked via
do_pre_smp_initcalls(), which is obviously before the secondary
processors are online.
Additionally, the following commits changed init_workqueues() to use
cpu_to_node to determine the node to use for kthread_create_on_node:
bce903809ab3f ("workqueue: add wq_numa_tbl_len and
wq_numa_possible_cpumask[]")
f3f90ad469342 ("workqueue: determine NUMA node of workers accourding to
the allowed cpumask")
Therefore, when init_workqueues() runs, it sees all CPUs as being on
Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
a high number of slab deactivations
(http://www.spinics.net/lists/linux-mm/msg67489.html).
While testing memoryless nodes on PowerKVM guests with a fix to the
workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
guest topology:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 2
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
node 1 size: 16336 MB
node 1 free: 15329 MB
node distances:
node 0 1
0: 10 40
1: 40 10
the slab consumption decreases from:
Slab: 932416 kB
SUnreclaim: 902336 kB
to
Slab: 395264 kB
SUnreclaim: 359424 kB
And we see a corresponding increase in the slab efficiency from:
slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 337 MB 11.28% 100.00%
task_struct 288 MB 9.93% 100.00%
to:
slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 37 MB 100.00% 100.00%
task_struct 31 MB 100.00% 100.00%
Powerpc didn't support memoryless nodes until recently (64bb80d87f01
"powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261194d
"powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
helped improve memory consumption with these kind of environments.
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
---
Ben & others, one area I'm still unsure of is if calling the NUMA
callback for all CPUs is desired. I don't know how else to get the NUMA
topology into the array easily, but I didn't test in an environment with
hotpluggable CPUs, so I'm not sure if it will lead to errors there (are
there device-tree entries for the topology of CPUs that will be plugged
in? I assume not, actually, so maybe we should keep the logic in
start_secondary so that those CPUs that are hotplugged later get the
right topology data?
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 1007fb802e6b..1fc8984f272e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -376,6 +376,12 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
GFP_KERNEL, cpu_to_node(cpu));
zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
GFP_KERNEL, cpu_to_node(cpu));
+ /*
+ * numa_node_id() works after this.
+ */
+ set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+ set_cpu_numa_mem(cpu,
+ local_memory_node(numa_cpu_lookup_table[cpu]));
}
cpumask_set_cpu(boot_cpuid, cpu_sibling_mask(boot_cpuid));
@@ -723,12 +729,6 @@ void start_secondary(void *unused)
}
traverse_core_siblings(cpu, true);
- /*
- * numa_node_id() works after this.
- */
- set_numa_node(numa_cpu_lookup_table[cpu]);
- set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
-
smp_wmb();
notify_cpu_starting(cpu);
set_cpu_online(cpu, true);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d3e9a78eaed3..32341e16b8ce 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1049,7 +1049,7 @@ static void __init mark_reserved_regions_for_nid(int nid)
void __init do_init_bootmem(void)
{
- int nid;
+ int nid, cpu;
min_low_pfn = 0;
max_low_pfn = memblock_end_of_DRAM() >> PAGE_SHIFT;
@@ -1122,8 +1122,15 @@ void __init do_init_bootmem(void)
reset_numa_cpu_lookup_table();
register_cpu_notifier(&ppc64_numa_nb);
- cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE,
- (void *)(unsigned long)boot_cpuid);
+ /*
+ * We need the numa_cpu_lookup_table to be accurate for all
+ * CPUs, even before we online them, so that we can use
+ * cpu_to_{node,mem} early in boot, cf. smp_prepare_cpus().
+ */
+ for_each_possible_cpu(cpu) {
+ cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE,
+ (void *)(unsigned long)boot_cpuid);
+ }
}
void __init paging_init(void)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-08-14 0:17 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-08-14 0:13 [RFC PATCH 0/4] Improve slab consumption with memoryless nodes Nishanth Aravamudan
2014-08-14 0:13 ` Nishanth Aravamudan
2014-08-14 0:14 ` [RFC PATCH v3 1/4] topology: add support for node_to_mem_node() to determine the fallback node Nishanth Aravamudan
2014-08-14 0:14 ` Nishanth Aravamudan
2014-08-14 14:35 ` Christoph Lameter
2014-08-14 14:35 ` Christoph Lameter
2014-08-14 20:06 ` Nishanth Aravamudan
2014-08-14 20:06 ` Nishanth Aravamudan
2014-08-22 21:52 ` Nishanth Aravamudan
2014-08-22 21:52 ` Nishanth Aravamudan
2014-08-14 0:15 ` [RFC PATCH 2/4] slub: fallback to node_to_mem_node() node if allocating on memoryless node Nishanth Aravamudan
2014-08-14 0:15 ` Nishanth Aravamudan
2014-08-14 0:16 ` [RFC PATCH 3/4] Partial revert of 81c98869faa5 ("kthread: ensure locality of task_struct allocations") Nishanth Aravamudan
2014-08-14 0:16 ` Nishanth Aravamudan
2014-08-14 0:17 ` Nishanth Aravamudan [this message]
2014-08-14 0:17 ` [RFC PATCH 4/4] powerpc: reorder per-cpu NUMA information's initialization Nishanth Aravamudan
2014-08-22 1:10 ` [RFC PATCH 0/4] Improve slab consumption with memoryless nodes Nishanth Aravamudan
2014-08-22 1:10 ` Nishanth Aravamudan
2014-08-22 20:32 ` Andrew Morton
2014-08-22 20:32 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140814001723.GM11121@linux.vnet.ibm.com \
--to=nacc@linux.vnet.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=anton@samba.org \
--cc=cl@linux.com \
--cc=hanpt@linux.vnet.ibm.com \
--cc=iamjoonsoo.kim@lge.com \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=liwanp@linux.vnet.ibm.com \
--cc=mpm@selenic.com \
--cc=paulus@samba.org \
--cc=penberg@kernel.org \
--cc=rientjes@google.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.