[RFC PATCH 4/4] powerpc: reorder per-cpu NUMA information's initialization

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>,
	Matt Mackall <mpm@selenic.com>,
	David Rientjes <rientjes@google.com>,
	Pekka Enberg <penberg@kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Paul Mackerras <paulus@samba.org>, Tejun Heo <tj@kernel.org>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	linuxppc-dev@lists.ozlabs.org, Christoph Lameter <cl@linux.com>,
	Wanpeng Li <liwanp@linux.vnet.ibm.com>,
	Anton Blanchard <anton@samba.org>
Subject: [RFC PATCH 4/4] powerpc: reorder per-cpu NUMA information's initialization
Date: Wed, 13 Aug 2014 17:17:23 -0700	[thread overview]
Message-ID: <20140814001723.GM11121@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140814001301.GI11121@linux.vnet.ibm.com>

There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.

NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
after start_secondary(), similar to ia64, which is invoked via
smp_init().

Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
early_initcall()") made init_workqueues() be invoked via
do_pre_smp_initcalls(), which is obviously before the secondary
processors are online.

Additionally, the following commits changed init_workqueues() to use
cpu_to_node to determine the node to use for kthread_create_on_node:

bce903809ab3f ("workqueue: add wq_numa_tbl_len and
wq_numa_possible_cpumask[]")
f3f90ad469342 ("workqueue: determine NUMA node of workers accourding to
the allowed cpumask")

Therefore, when init_workqueues() runs, it sees all CPUs as being on
Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
a high number of slab deactivations
(http://www.spinics.net/lists/linux-mm/msg67489.html).

While testing memoryless nodes on PowerKVM guests with a fix to the
workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
guest topology:

    available: 2 nodes (0-1)
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 2
    node 0 size: 0 MB
    node 0 free: 0 MB
    node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
    node 1 size: 16336 MB
    node 1 free: 15329 MB
    node distances:
    node   0   1
      0:  10  40
      1:  40  10

the slab consumption decreases from:

    Slab:             932416 kB
    SUnreclaim:       902336 kB

to

    Slab:             395264 kB
    SUnreclaim:       359424 kB

And we see a corresponding increase in the slab efficiency from:

    slab                                   mem     objs    slabs
                                          used   active   active
    ------------------------------------------------------------
    kmalloc-16384                       337 MB   11.28%  100.00%
    task_struct                         288 MB    9.93%  100.00%

to:

    slab                                   mem     objs    slabs
                                          used   active   active
    ------------------------------------------------------------
    kmalloc-16384                        37 MB  100.00%  100.00%
    task_struct                          31 MB  100.00%  100.00%

Powerpc didn't support memoryless nodes until recently (64bb80d87f01
"powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261194d
"powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
helped improve memory consumption with these kind of environments.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

---
Ben & others, one area I'm still unsure of is if calling the NUMA
callback for all CPUs is desired. I don't know how else to get the NUMA
topology into the array easily, but I didn't test in an environment with
hotpluggable CPUs, so I'm not sure if it will lead to errors there (are
there device-tree entries for the topology of CPUs that will be plugged
in? I assume not, actually, so maybe we should keep the logic in
start_secondary so that those CPUs that are hotplugged later get the
right topology data?

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 1007fb802e6b..1fc8984f272e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -376,6 +376,12 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 					GFP_KERNEL, cpu_to_node(cpu));
 		zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
 					GFP_KERNEL, cpu_to_node(cpu));
+		/*
+		 * numa_node_id() works after this.
+		 */
+		set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+		set_cpu_numa_mem(cpu,
+				 local_memory_node(numa_cpu_lookup_table[cpu]));
 	}
 
 	cpumask_set_cpu(boot_cpuid, cpu_sibling_mask(boot_cpuid));
@@ -723,12 +729,6 @@ void start_secondary(void *unused)
 	}
 	traverse_core_siblings(cpu, true);
 
-	/*
-	 * numa_node_id() works after this.
-	 */
-	set_numa_node(numa_cpu_lookup_table[cpu]);
-	set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
-
 	smp_wmb();
 	notify_cpu_starting(cpu);
 	set_cpu_online(cpu, true);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d3e9a78eaed3..32341e16b8ce 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1049,7 +1049,7 @@ static void __init mark_reserved_regions_for_nid(int nid)
 
 void __init do_init_bootmem(void)
 {
-	int nid;
+	int nid, cpu;
 
 	min_low_pfn = 0;
 	max_low_pfn = memblock_end_of_DRAM() >> PAGE_SHIFT;
@@ -1122,8 +1122,15 @@ void __init do_init_bootmem(void)
 
 	reset_numa_cpu_lookup_table();
 	register_cpu_notifier(&ppc64_numa_nb);
-	cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE,
-			  (void *)(unsigned long)boot_cpuid);
+	/*
+	 * We need the numa_cpu_lookup_table to be accurate for all
+	 * CPUs, even before we online them, so that we can use
+	 * cpu_to_{node,mem} early in boot, cf. smp_prepare_cpus().
+	 */
+	for_each_possible_cpu(cpu) {
+		cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE,
+				  (void *)(unsigned long)boot_cpuid);
+	}
 }
 
 void __init paging_init(void)

next prev parent reply	other threads:[~2014-08-14  0:17 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-14  0:13 [RFC PATCH 0/4] Improve slab consumption with memoryless nodes Nishanth Aravamudan
2014-08-14  0:14 ` [RFC PATCH v3 1/4] topology: add support for node_to_mem_node() to determine the fallback node Nishanth Aravamudan
2014-08-14 14:35   ` Christoph Lameter
2014-08-14 20:06     ` Nishanth Aravamudan
2014-08-22 21:52       ` Nishanth Aravamudan
2014-08-14  0:15 ` [RFC PATCH 2/4] slub: fallback to node_to_mem_node() node if allocating on memoryless node Nishanth Aravamudan
2014-08-14  0:16 ` [RFC PATCH 3/4] Partial revert of 81c98869faa5 ("kthread: ensure locality of task_struct allocations") Nishanth Aravamudan
2014-08-14  0:17 ` Nishanth Aravamudan [this message]
2014-08-22  1:10 ` [RFC PATCH 0/4] Improve slab consumption with memoryless nodes Nishanth Aravamudan
2014-08-22 20:32   ` Andrew Morton

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:1007fb802e6 dfblob:1fc8984f272 dfblob:d3e9a78eaed
dfblob:32341e16b8c )
 OR (
bs:"[RFC PATCH 4/4] powerpc: reorder per-cpu NUMA information's initialization" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140814001723.GM11121@linux.vnet.ibm.com \
    --to=nacc@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=anton@samba.org \
    --cc=cl@linux.com \
    --cc=hanpt@linux.vnet.ibm.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=liwanp@linux.vnet.ibm.com \
    --cc=mpm@selenic.com \
    --cc=paulus@samba.org \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).