[RFC PATCH 4/4] powerpc: reorder per-cpu NUMA information's initialization

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>,
	Matt Mackall <mpm@selenic.com>,
	David Rientjes <rientjes@google.com>,
	Pekka Enberg <penberg@kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Paul Mackerras <paulus@samba.org>, Tejun Heo <tj@kernel.org>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	linuxppc-dev@lists.ozlabs.org, Christoph Lameter <cl@linux.com>,
	Wanpeng Li <liwanp@linux.vnet.ibm.com>,
	Anton Blanchard <anton@samba.org>
Subject: [RFC PATCH 4/4] powerpc: reorder per-cpu NUMA information's initialization
Date: Wed, 13 Aug 2014 17:17:23 -0700	[thread overview]
Message-ID: <20140814001723.GM11121@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140814001301.GI11121@linux.vnet.ibm.com>

There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.

NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
after start_secondary(), similar to ia64, which is invoked via
smp_init().

Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
early_initcall()") made init_workqueues() be invoked via
do_pre_smp_initcalls(), which is obviously before the secondary
processors are online.

Additionally, the following commits changed init_workqueues() to use
cpu_to_node to determine the node to use for kthread_create_on_node:

bce903809ab3f ("workqueue: add wq_numa_tbl_len and
wq_numa_possible_cpumask[]")
f3f90ad469342 ("workqueue: determine NUMA node of workers accourding to
the allowed cpumask")

Therefore, when init_workqueues() runs, it sees all CPUs as being on
Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
a high number of slab deactivations
(http://www.spinics.net/lists/linux-mm/msg67489.html).

While testing memoryless nodes on PowerKVM guests with a fix to the
workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
guest topology:

    available: 2 nodes (0-1)
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 2
    node 0 size: 0 MB
    node 0 free: 0 MB
    node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
    node 1 size: 16336 MB
    node 1 free: 15329 MB
    node distances:
    node   0   1
      0:  10  40
      1:  40  10

the slab consumption decreases from:

    Slab:             932416 kB
    SUnreclaim:       902336 kB

to

    Slab:             395264 kB
    SUnreclaim:       359424 kB

And we see a corresponding increase in the slab efficiency from:

    slab                                   mem     objs    slabs
                                          used   active   active
    ------------------------------------------------------------
    kmalloc-16384                       337 MB   11.28%  100.00%
    task_struct                         288 MB    9.93%  100.00%

to:

    slab                                   mem     objs    slabs
                                          used   active   active
    ------------------------------------------------------------
    kmalloc-16384                        37 MB  100.00%  100.00%
    task_struct                          31 MB  100.00%  100.00%

Powerpc didn't support memoryless nodes until recently (64bb80d87f01
"powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261194d
"powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
helped improve memory consumption with these kind of environments.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

---
Ben & others, one area I'm still unsure of is if calling the NUMA
callback for all CPUs is desired. I don't know how else to get the NUMA
topology into the array easily, but I didn't test in an environment with
hotpluggable CPUs, so I'm not sure if it will lead to errors there (are
there device-tree entries for the topology of CPUs that will be plugged
in? I assume not, actually, so maybe we should keep the logic in
start_secondary so that those CPUs that are hotplugged later get the
right topology data?

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 1007fb802e6b..1fc8984f272e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -376,6 +376,12 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 					GFP_KERNEL, cpu_to_node(cpu));
 		zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
 					GFP_KERNEL, cpu_to_node(cpu));
+		/*
+		 * numa_node_id() works after this.
+		 */
+		set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+		set_cpu_numa_mem(cpu,
+				 local_memory_node(numa_cpu_lookup_table[cpu]));
 	}
 
 	cpumask_set_cpu(boot_cpuid, cpu_sibling_mask(boot_cpuid));
@@ -723,12 +729,6 @@ void start_secondary(void *unused)
 	}
 	traverse_core_siblings(cpu, true);
 
-	/*
-	 * numa_node_id() works after this.
-	 */
-	set_numa_node(numa_cpu_lookup_table[cpu]);
-	set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
-
 	smp_wmb();
 	notify_cpu_starting(cpu);
 	set_cpu_online(cpu, true);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d3e9a78eaed3..32341e16b8ce 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1049,7 +1049,7 @@ static void __init mark_reserved_regions_for_nid(int nid)
 
 void __init do_init_bootmem(void)
 {
-	int nid;
+	int nid, cpu;
 
 	min_low_pfn = 0;
 	max_low_pfn = memblock_end_of_DRAM() >> PAGE_SHIFT;
@@ -1122,8 +1122,15 @@ void __init do_init_bootmem(void)
 
 	reset_numa_cpu_lookup_table();
 	register_cpu_notifier(&ppc64_numa_nb);
-	cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE,
-			  (void *)(unsigned long)boot_cpuid);
+	/*
+	 * We need the numa_cpu_lookup_table to be accurate for all
+	 * CPUs, even before we online them, so that we can use
+	 * cpu_to_{node,mem} early in boot, cf. smp_prepare_cpus().
+	 */
+	for_each_possible_cpu(cpu) {
+		cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE,
+				  (void *)(unsigned long)boot_cpuid);
+	}
 }
 
 void __init paging_init(void)

WARNING: multiple messages have this Message-ID (diff)

From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	David Rientjes <rientjes@google.com>,
	Han Pingtian <hanpt@linux.vnet.ibm.com>,
	Pekka Enberg <penberg@kernel.org>,
	Paul Mackerras <paulus@samba.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Anton Blanchard <anton@samba.org>, Matt Mackall <mpm@selenic.com>,
	Christoph Lameter <cl@linux.com>,
	Wanpeng Li <liwanp@linux.vnet.ibm.com>, Tejun Heo <tj@kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	linuxppc-dev@lists.ozlabs.org
Subject: [RFC PATCH 4/4] powerpc: reorder per-cpu NUMA information's initialization
Date: Wed, 13 Aug 2014 17:17:23 -0700	[thread overview]
Message-ID: <20140814001723.GM11121@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140814001301.GI11121@linux.vnet.ibm.com>

There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.

NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
after start_secondary(), similar to ia64, which is invoked via
smp_init().

Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
early_initcall()") made init_workqueues() be invoked via
do_pre_smp_initcalls(), which is obviously before the secondary
processors are online.

Additionally, the following commits changed init_workqueues() to use
cpu_to_node to determine the node to use for kthread_create_on_node:

bce903809ab3f ("workqueue: add wq_numa_tbl_len and
wq_numa_possible_cpumask[]")
f3f90ad469342 ("workqueue: determine NUMA node of workers accourding to
the allowed cpumask")

Therefore, when init_workqueues() runs, it sees all CPUs as being on
Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
a high number of slab deactivations
(http://www.spinics.net/lists/linux-mm/msg67489.html).

While testing memoryless nodes on PowerKVM guests with a fix to the
workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
guest topology:

    available: 2 nodes (0-1)
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 2
    node 0 size: 0 MB
    node 0 free: 0 MB
    node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
    node 1 size: 16336 MB
    node 1 free: 15329 MB
    node distances:
    node   0   1
      0:  10  40
      1:  40  10

the slab consumption decreases from:

    Slab:             932416 kB
    SUnreclaim:       902336 kB

to

    Slab:             395264 kB
    SUnreclaim:       359424 kB

And we see a corresponding increase in the slab efficiency from:

    slab                                   mem     objs    slabs
                                          used   active   active
    ------------------------------------------------------------
    kmalloc-16384                       337 MB   11.28%  100.00%
    task_struct                         288 MB    9.93%  100.00%

to:

    slab                                   mem     objs    slabs
                                          used   active   active
    ------------------------------------------------------------
    kmalloc-16384                        37 MB  100.00%  100.00%
    task_struct                          31 MB  100.00%  100.00%

Powerpc didn't support memoryless nodes until recently (64bb80d87f01
"powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261194d
"powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
helped improve memory consumption with these kind of environments.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

---
Ben & others, one area I'm still unsure of is if calling the NUMA
callback for all CPUs is desired. I don't know how else to get the NUMA
topology into the array easily, but I didn't test in an environment with
hotpluggable CPUs, so I'm not sure if it will lead to errors there (are
there device-tree entries for the topology of CPUs that will be plugged
in? I assume not, actually, so maybe we should keep the logic in
start_secondary so that those CPUs that are hotplugged later get the
right topology data?

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 1007fb802e6b..1fc8984f272e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -376,6 +376,12 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 					GFP_KERNEL, cpu_to_node(cpu));
 		zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
 					GFP_KERNEL, cpu_to_node(cpu));
+		/*
+		 * numa_node_id() works after this.
+		 */
+		set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+		set_cpu_numa_mem(cpu,
+				 local_memory_node(numa_cpu_lookup_table[cpu]));
 	}
 
 	cpumask_set_cpu(boot_cpuid, cpu_sibling_mask(boot_cpuid));
@@ -723,12 +729,6 @@ void start_secondary(void *unused)
 	}
 	traverse_core_siblings(cpu, true);
 
-	/*
-	 * numa_node_id() works after this.
-	 */
-	set_numa_node(numa_cpu_lookup_table[cpu]);
-	set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
-
 	smp_wmb();
 	notify_cpu_starting(cpu);
 	set_cpu_online(cpu, true);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d3e9a78eaed3..32341e16b8ce 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1049,7 +1049,7 @@ static void __init mark_reserved_regions_for_nid(int nid)
 
 void __init do_init_bootmem(void)
 {
-	int nid;
+	int nid, cpu;
 
 	min_low_pfn = 0;
 	max_low_pfn = memblock_end_of_DRAM() >> PAGE_SHIFT;
@@ -1122,8 +1122,15 @@ void __init do_init_bootmem(void)
 
 	reset_numa_cpu_lookup_table();
 	register_cpu_notifier(&ppc64_numa_nb);
-	cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE,
-			  (void *)(unsigned long)boot_cpuid);
+	/*
+	 * We need the numa_cpu_lookup_table to be accurate for all
+	 * CPUs, even before we online them, so that we can use
+	 * cpu_to_{node,mem} early in boot, cf. smp_prepare_cpus().
+	 */
+	for_each_possible_cpu(cpu) {
+		cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE,
+				  (void *)(unsigned long)boot_cpuid);
+	}
 }
 
 void __init paging_init(void)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2014-08-14  0:17 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-14  0:13 [RFC PATCH 0/4] Improve slab consumption with memoryless nodes Nishanth Aravamudan
2014-08-14  0:13 ` Nishanth Aravamudan
2014-08-14  0:14 ` [RFC PATCH v3 1/4] topology: add support for node_to_mem_node() to determine the fallback node Nishanth Aravamudan
2014-08-14  0:14   ` Nishanth Aravamudan
2014-08-14 14:35   ` Christoph Lameter
2014-08-14 14:35     ` Christoph Lameter
2014-08-14 20:06     ` Nishanth Aravamudan
2014-08-14 20:06       ` Nishanth Aravamudan
2014-08-22 21:52       ` Nishanth Aravamudan
2014-08-22 21:52         ` Nishanth Aravamudan
2014-08-14  0:15 ` [RFC PATCH 2/4] slub: fallback to node_to_mem_node() node if allocating on memoryless node Nishanth Aravamudan
2014-08-14  0:15   ` Nishanth Aravamudan
2014-08-14  0:16 ` [RFC PATCH 3/4] Partial revert of 81c98869faa5 ("kthread: ensure locality of task_struct allocations") Nishanth Aravamudan
2014-08-14  0:16   ` Nishanth Aravamudan
2014-08-14  0:17 ` Nishanth Aravamudan [this message]
2014-08-14  0:17   ` [RFC PATCH 4/4] powerpc: reorder per-cpu NUMA information's initialization Nishanth Aravamudan
2014-08-22  1:10 ` [RFC PATCH 0/4] Improve slab consumption with memoryless nodes Nishanth Aravamudan
2014-08-22  1:10   ` Nishanth Aravamudan
2014-08-22 20:32   ` Andrew Morton
2014-08-22 20:32     ` Andrew Morton

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:1007fb802e6 dfblob:1fc8984f272 dfblob:d3e9a78eaed
dfblob:32341e16b8c dfblob:1007fb802e6 dfblob:1fc8984f272
dfblob:d3e9a78eaed dfblob:32341e16b8c )
 OR (
bs:"[RFC PATCH 4/4] powerpc: reorder per-cpu NUMA information's initialization" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140814001723.GM11121@linux.vnet.ibm.com \
    --to=nacc@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=anton@samba.org \
    --cc=cl@linux.com \
    --cc=hanpt@linux.vnet.ibm.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=liwanp@linux.vnet.ibm.com \
    --cc=mpm@selenic.com \
    --cc=paulus@samba.org \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.