[RFC 0/2] Memoryless nodes and kworker

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/2] Memoryless nodes and kworker
@ 2014-07-17 23:09 Nishanth Aravamudan
  2014-07-17 23:09 ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Nishanth Aravamudan
  2014-07-18 11:20 ` [RFC 0/2] Memoryless nodes and kworker Tejun Heo
  0 siblings, 2 replies; 13+ messages in thread
From: Nishanth Aravamudan @ 2014-07-17 23:09 UTC (permalink / raw)
  To: benh
  Cc: Fenghua Yu, Tony Luck, linux-ia64, linux-kernel, linux-mm,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li

[Apologies for the large Cc list, but I believe we have the following
interested parties:

x86 (recently posted memoryless node support)
ia64 (existing memoryless node support)
ppc (existing memoryless node support)
previous discussion of how to solve Anton's issue with slab usage
workqueue contributors/maintainers]

There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.

While testing memoryless nodes on PowerKVM guests with the patches in
this series, with a guest topology of
    
    available: 2 nodes (0-1)
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
    node 0 size: 0 MB
    node 0 free: 0 MB
    node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
    node 1 size: 16336 MB
    node 1 free: 15329 MB
    node distances:
    node   0   1
      0:  10  40
      1:  40  10
    
the slab consumption decreases from
    
    Slab:             932416 kB
    SUnreclaim:       902336 kB
    
to
    
    Slab:             395264 kB
    SUnreclaim:       359424 kB
    
And we see a corresponding increase in the slab efficiency from
    
    slab                                   mem     objs    slabs
                                          used   active   active
    ------------------------------------------------------------
    kmalloc-16384                       337 MB   11.28%  100.00%
    task_struct                         288 MB    9.93%  100.00%
    
to
    
    slab                                   mem     objs    slabs
                                          used   active   active
    ------------------------------------------------------------
    kmalloc-16384                        37 MB  100.00%  100.00%
    task_struct                          31 MB  100.00%  100.00%

It turns out we see this large slab usage due to using the wrong NUMA
information when creating kthreads.
    
Two changes are required, one of which is in the workqueue code and one
of which is in the powerpc initialization. Note that ia64 may want to
consider something similar.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC 1/2] workqueue: use the nearest NUMA node, not the local one
  2014-07-17 23:09 [RFC 0/2] Memoryless nodes and kworker Nishanth Aravamudan
@ 2014-07-17 23:09 ` Nishanth Aravamudan
  2014-07-17 23:15   ` [RFC 2/2] powerpc: reorder per-cpu NUMA information's initialization Nishanth Aravamudan
  2014-07-18  8:11   ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Lai Jiangshan
  2014-07-18 11:20 ` [RFC 0/2] Memoryless nodes and kworker Tejun Heo
  1 sibling, 2 replies; 13+ messages in thread
From: Nishanth Aravamudan @ 2014-07-17 23:09 UTC (permalink / raw)
  To: benh
  Cc: Fenghua Yu, Tony Luck, linux-ia64, linux-kernel, linux-mm,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li

In the presence of memoryless nodes, the workqueue code incorrectly uses
cpu_to_node() to determine what node to prefer memory allocations come
from. cpu_to_mem() should be used instead, which will use the nearest
NUMA node with memory.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 35974ac..0bba022 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3547,7 +3547,12 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 		for_each_node(node) {
 			if (cpumask_subset(pool->attrs->cpumask,
 					   wq_numa_possible_cpumask[node])) {
-				pool->node = node;
+				/*
+				 * We could use local_memory_node(node) here,
+				 * but it is expensive and the following caches
+				 * the same value.
+				 */
+				pool->node = cpu_to_mem(cpumask_first(pool->attrs->cpumask));
 				break;
 			}
 		}
@@ -4921,7 +4926,7 @@ static int __init init_workqueues(void)
 			pool->cpu = cpu;
 			cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
 			pool->attrs->nice = std_nice[i++];
-			pool->node = cpu_to_node(cpu);
+			pool->node = cpu_to_mem(cpu);
 
 			/* alloc pool ID */
 			mutex_lock(&wq_pool_mutex);

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 2/2] powerpc: reorder per-cpu NUMA information's initialization
  2014-07-17 23:09 ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Nishanth Aravamudan
@ 2014-07-17 23:15   ` Nishanth Aravamudan
  2014-07-18  8:11   ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Lai Jiangshan
  1 sibling, 0 replies; 13+ messages in thread
From: Nishanth Aravamudan @ 2014-07-17 23:15 UTC (permalink / raw)
  To: benh
  Cc: Fenghua Yu, Tony Luck, linux-ia64, linux-kernel, linux-mm,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li

There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.

NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
after start_secondary(), similar to ia64, which is invoked via
smp_init().

Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
early_initcall()") made init_workqueues() be invoked via
do_pre_smp_initcalls(), which is obviously before the secondary
processors are online.

Additionally, the following commits changed init_workqueues() to use
cpu_to_node to determine the node to use for kthread_create_on_node:

bce903809ab3f ("workqueue: add wq_numa_tbl_len and
wq_numa_possible_cpumask[]")
f3f90ad469342 ("workqueue: determine NUMA node of workers accourding to
the allowed cpumask")

Therefore, when init_workqueues() runs, it sees all CPUs as being on
Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
a high number of slab deactivations
(http://www.spinics.net/lists/linux-mm/msg67489.html).

Fix this by initializing the powerpc-specific CPU<->node/local memory
node mapping as early as possible, which on powerpc is
do_init_bootmem(). Currently that function initializes the mapping for
the boot CPU, but we extend it to setup the mapping for all possible
CPUs. Then, in smp_prepare_cpus(), we can correspondingly set the
per-cpu values for all possible CPUs. That ensures that before the
early_initcalls run (and really as early as possible), the per-cpu NUMA
mapping is accurate.

While testing memoryless nodes on PowerKVM guests with a fix to the
workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
guest topology of:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 16336 MB
node 1 free: 15329 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10

the slab consumption decreases from

Slab:             932416 kB
SUnreclaim:       902336 kB

to

Slab:             395264 kB
SUnreclaim:       359424 kB

And we a corresponding increase in the slab efficiency from

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       337 MB   11.28%  100.00%
task_struct                         288 MB    9.93%  100.00%

to

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                        37 MB  100.00%  100.00%
task_struct                          31 MB  100.00%  100.00%

Powerpc didn't support memoryless nodes until recently (64bb80d87f01
"powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261194d
"powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
helped improve memory consumption with these kind of environments.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 51a3ff7..91ff531 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -376,6 +376,11 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 					GFP_KERNEL, cpu_to_node(cpu));
 		zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
 					GFP_KERNEL, cpu_to_node(cpu));
+		/*
+		 * numa_node_id() works after this.
+		 */
+		set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+		set_cpu_numa_mem(cpu, local_memory_node(numa_cpu_lookup_table[cpu]));
 	}
 
 	cpumask_set_cpu(boot_cpuid, cpu_sibling_mask(boot_cpuid));
@@ -723,12 +728,6 @@ void start_secondary(void *unused)
 	}
 	traverse_core_siblings(cpu, true);
 
-	/*
-	 * numa_node_id() works after this.
-	 */
-	set_numa_node(numa_cpu_lookup_table[cpu]);
-	set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
-
 	smp_wmb();
 	notify_cpu_starting(cpu);
 	set_cpu_online(cpu, true);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 3b181b2..b1f0b86 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1049,7 +1049,7 @@ static void __init mark_reserved_regions_for_nid(int nid)
 
 void __init do_init_bootmem(void)
 {
-	int nid;
+	int nid, cpu;
 
 	min_low_pfn = 0;
 	max_low_pfn = memblock_end_of_DRAM() >> PAGE_SHIFT;
@@ -1122,8 +1122,15 @@ void __init do_init_bootmem(void)
 
 	reset_numa_cpu_lookup_table();
 	register_cpu_notifier(&ppc64_numa_nb);
-	cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE,
-			  (void *)(unsigned long)boot_cpuid);
+	/*
+	 * We need the numa_cpu_lookup_table to be accurate for all CPUs,
+	 * even before we online them, so that we can use cpu_to_{node,mem}
+	 * early in boot, cf. smp_prepare_cpus().
+	 */
+	for_each_possible_cpu(cpu) {
+		cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE,
+				  (void *)(unsigned long)cpu);
+	}
 }
 
 void __init paging_init(void)

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC 1/2] workqueue: use the nearest NUMA node, not the local one
  2014-07-17 23:09 ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Nishanth Aravamudan
  2014-07-17 23:15   ` [RFC 2/2] powerpc: reorder per-cpu NUMA information's initialization Nishanth Aravamudan
@ 2014-07-18  8:11   ` Lai Jiangshan
  2014-07-18 17:33     ` Nish Aravamudan
  1 sibling, 1 reply; 13+ messages in thread
From: Lai Jiangshan @ 2014-07-18  8:11 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Fenghua Yu, Tony Luck, linux-ia64, linux-kernel, linux-mm,
	David Rientjes, Tejun Heo, Joonsoo Kim, linuxppc-dev, Jiang Liu,
	Wanpeng Li

Hi,

I'm curious about what will it happen when alloc_pages_node(memoryless_node).

If the memory is allocated from the most preferable node for the @memoryless_node,
why we need to bother and use cpu_to_mem() in the caller site?

If not, why the memory allocation subsystem refuses to find a preferable node
for @memoryless_node in this case? Does it intend on some purpose or
it can't find in some cases?

Thanks,
Lai

Added CC to Tejun (workqueue maintainer).

On 07/18/2014 07:09 AM, Nishanth Aravamudan wrote:
> In the presence of memoryless nodes, the workqueue code incorrectly uses
> cpu_to_node() to determine what node to prefer memory allocations come
> from. cpu_to_mem() should be used instead, which will use the nearest
> NUMA node with memory.
> 
> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 35974ac..0bba022 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3547,7 +3547,12 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>  		for_each_node(node) {
>  			if (cpumask_subset(pool->attrs->cpumask,
>  					   wq_numa_possible_cpumask[node])) {
> -				pool->node = node;
> +				/*
> +				 * We could use local_memory_node(node) here,
> +				 * but it is expensive and the following caches
> +				 * the same value.
> +				 */
> +				pool->node = cpu_to_mem(cpumask_first(pool->attrs->cpumask));
>  				break;
>  			}
>  		}
> @@ -4921,7 +4926,7 @@ static int __init init_workqueues(void)
>  			pool->cpu = cpu;
>  			cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
>  			pool->attrs->nice = std_nice[i++];
> -			pool->node = cpu_to_node(cpu);
> +			pool->node = cpu_to_mem(cpu);
>  
>  			/* alloc pool ID */
>  			mutex_lock(&wq_pool_mutex);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 1/2] workqueue: use the nearest NUMA node, not the local one
  2014-07-18  8:11   ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Lai Jiangshan
@ 2014-07-18 17:33     ` Nish Aravamudan
  0 siblings, 0 replies; 13+ messages in thread
From: Nish Aravamudan @ 2014-07-18 17:33 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Fenghua Yu, Tony Luck, linux-ia64, Nishanth Aravamudan,
	linux-kernel@vger.kernel.org, Linux Memory Management List,
	David Rientjes, Tejun Heo, Joonsoo Kim, linuxppc-dev, Jiang Liu,
	Wanpeng Li

[-- Attachment #1: Type: text/plain, Size: 5911 bytes --]

[ Apologies for replying from a different address, we have a service outage
at work. ]

On Fri, Jul 18, 2014 at 1:11 AM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
>
> Hi,

Thank you for your response!

> I'm curious about what will it happen when
alloc_pages_node(memoryless_node).

alloc_pages_node() is only involved in one of the possible paths (maybe
this occurs on x86 with THREAD_INFO > PAGE_SIZE?) On powerpc, though,
that's not the case.

Details:

1. pool->node is used in the invocation of kthread_create_on_node() in
create_worker().
2. kthread_create_on_node sets up a struct kthread_create_info with
create->node = node and wakes up kthreadd.
3. kthreadd calls create_kthread, which sets current->pref_node_fork =
create->node.
4. dup_task_struct() calls node = tsk_fork_get_node() before invoking
alloc_task_struct_node(node) and alloc_thread_info_node(node).
5. tsk_fork_get_node() returns current->pref_node_fork for kthreadd.
6. alloc_task_struct_node() calls kmem_cache_alloc_node(,GFP_KERNEL, node)
7. alloc_thread_info_node() either calls kmem_cache_alloc_node(,GFP_KERNEL,
node) or alloc_kmem_pages_node(node,GFP_KERNEL,), depending on the size of
THREAD_INFO relative to PAGE_SIZE.
8a. alloc_kmem_pages_node() -> alloc_pages_node -> __alloc_pages with a
zonelist built from node_zonelist. This should lead to proper fallback.
8b. kmem_cache_alloc_node() calls slab_alloc_node()
9. For a memoryless node, we will trigger the following:

        if (unlikely(!object || !node_match(page, node))) {
                object = __slab_alloc(s, gfpflags, node, addr, c);
                stat(s, ALLOC_SLOWPATH);
        }

10. __slab_alloc() in turn will:

        if (unlikely(!node_match(page, node))) {
                stat(s, ALLOC_NODE_MISMATCH);
                deactivate_slab(s, page, c->freelist);
                c->page = NULL;
                c->freelist = NULL;
                goto new_slab;
        }

deactivating the slab. Thus, every kthread created with a node
specification leads to a single object on a slab. We see an explosion in
the slab consumption, all of which is unreclaimable.

Anton originally proposed not deactivating slabs when we *know* the
allocation will be remote (i.e., from a memoryless node). Joonsoo and
Christoph disagreed with this and proposed alternative solutions, which
weren't agreed upon at the time.

> If the memory is allocated from the most preferable node for the
@memoryless_node,
> why we need to bother and use cpu_to_mem() in the caller site?

The reason is that the node passed is a hint into the MM subsystem of what
node we want memory to come from. Well, I take that back, I think
semantically there are two ways to interpret the node parameter:

1) The NUMA node we want memory from
2) The NUMA node we expect memory from

The path through the MM above sort of conflates the two, the caller
specified an impossible request (which the MM subsystem technically knows
but which knowledge it isn't using at this point) of memory from a node
that has none.

We could change the core MM to do better in the presence of memoryless
nodes, and should, but this seems far less invasive and does the right
thing. Semantically, I think the workqueue's pool->node is meant to be the
node from which we want memory allocated, which is the node with memory
closest to the CPU.

Thanks,
Nish

> If not, why the memory allocation subsystem refuses to find a preferable
node
> for @memoryless_node in this case? Does it intend on some purpose or
> it can't find in some cases?
>
> Thanks,
> Lai
>
> Added CC to Tejun (workqueue maintainer).
>
> On 07/18/2014 07:09 AM, Nishanth Aravamudan wrote:
> > In the presence of memoryless nodes, the workqueue code incorrectly uses
> > cpu_to_node() to determine what node to prefer memory allocations come
> > from. cpu_to_mem() should be used instead, which will use the nearest
> > NUMA node with memory.
> >
> > Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 35974ac..0bba022 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -3547,7 +3547,12 @@ static struct worker_pool
*get_unbound_pool(const struct workqueue_attrs *attrs)
> >               for_each_node(node) {
> >                       if (cpumask_subset(pool->attrs->cpumask,
> >
 wq_numa_possible_cpumask[node])) {
> > -                             pool->node = node;
> > +                             /*
> > +                              * We could use local_memory_node(node)
here,
> > +                              * but it is expensive and the following
caches
> > +                              * the same value.
> > +                              */
> > +                             pool->node =
cpu_to_mem(cpumask_first(pool->attrs->cpumask));
> >                               break;
> >                       }
> >               }
> > @@ -4921,7 +4926,7 @@ static int __init init_workqueues(void)
> >                       pool->cpu = cpu;
> >                       cpumask_copy(pool->attrs->cpumask,
cpumask_of(cpu));
> >                       pool->attrs->nice = std_nice[i++];
> > -                     pool->node = cpu_to_node(cpu);
> > +                     pool->node = cpu_to_mem(cpu);
> >
> >                       /* alloc pool ID */
> >                       mutex_lock(&wq_pool_mutex);
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

[-- Attachment #2: Type: text/html, Size: 7835 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/2] Memoryless nodes and kworker
  2014-07-17 23:09 [RFC 0/2] Memoryless nodes and kworker Nishanth Aravamudan
  2014-07-17 23:09 ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Nishanth Aravamudan
@ 2014-07-18 11:20 ` Tejun Heo
  2014-07-18 17:42   ` Nish Aravamudan
  1 sibling, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2014-07-18 11:20 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Fenghua Yu, Tony Luck, linux-ia64, linux-kernel, linux-mm,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li

On Thu, Jul 17, 2014 at 04:09:23PM -0700, Nishanth Aravamudan wrote:
> [Apologies for the large Cc list, but I believe we have the following
> interested parties:
> 
> x86 (recently posted memoryless node support)
> ia64 (existing memoryless node support)
> ppc (existing memoryless node support)
> previous discussion of how to solve Anton's issue with slab usage
> workqueue contributors/maintainers]

Well, you forgot to cc me.

...
> It turns out we see this large slab usage due to using the wrong NUMA
> information when creating kthreads.
>     
> Two changes are required, one of which is in the workqueue code and one
> of which is in the powerpc initialization. Note that ia64 may want to
> consider something similar.

Wasn't there a thread on this exact subject a few weeks ago?  Was that
someone else?  Memory-less node detail leaking out of allocator proper
isn't a good idea.  Please allow allocator users to specify the nodes
they're on and let the allocator layer deal with mapping that to
whatever is appropriate.  Please don't push that to everybody.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/2] Memoryless nodes and kworker
  2014-07-18 11:20 ` [RFC 0/2] Memoryless nodes and kworker Tejun Heo
@ 2014-07-18 17:42   ` Nish Aravamudan
  2014-07-18 18:00     ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Nish Aravamudan @ 2014-07-18 17:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fenghua Yu, Tony Luck, linux-ia64, Nishanth Aravamudan,
	linux-kernel@vger.kernel.org, Linux Memory Management List,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li

[-- Attachment #1: Type: text/plain, Size: 2552 bytes --]

Hi Tejun,

On Fri, Jul 18, 2014 at 4:20 AM, Tejun Heo <tj@kernel.org> wrote:
>
> On Thu, Jul 17, 2014 at 04:09:23PM -0700, Nishanth Aravamudan wrote:
> > [Apologies for the large Cc list, but I believe we have the following
> > interested parties:
> >
> > x86 (recently posted memoryless node support)
> > ia64 (existing memoryless node support)
> > ppc (existing memoryless node support)
> > previous discussion of how to solve Anton's issue with slab usage
> > workqueue contributors/maintainers]
>
> Well, you forgot to cc me.

Ah I'm very sorry! That's what I get for editing e-mails... Thank you for
your reply!

> ...
> > It turns out we see this large slab usage due to using the wrong NUMA
> > information when creating kthreads.
> >
> > Two changes are required, one of which is in the workqueue code and one
> > of which is in the powerpc initialization. Note that ia64 may want to
> > consider something similar.
>
> Wasn't there a thread on this exact subject a few weeks ago?  Was that
> someone else?  Memory-less node detail leaking out of allocator proper
> isn't a good idea.  Please allow allocator users to specify the nodes
> they're on and let the allocator layer deal with mapping that to
> whatever is appropriate.  Please don't push that to everybody.

I didn't send anything for the workqueue logic anytime recently. Jiang sent
out a patchset for x86 memoryless node support, which may have touched
kernel/workqueue.c.

So, to be clear, this is not *necessarily* about memoryless nodes. It's
about the semantics intended. The workqueue code currently calls
cpu_to_node() in a few places, and passes that node into the core MM as a
hint about where the memory should come from. However, when memoryless
nodes are present, that hint is guaranteed to be wrong, as it's the nearest
NUMA node to the CPU (which happens to be the one its on), not the nearest
NUMA node with memory. The hint is correctly specified as cpu_to_mem(),
which does the right thing in the presence or absence of memoryless nodes.
And I think encapsulates the hint's semantics correctly -- please give me
memory from where I expect it, which is the closest NUMA node.

I guess we could also change tsk_fork_get_node to return
local_memory_node(tsk->pref_node_fork), but that can be a bit expensive, as
it generates a new zonelist each time to determine the first fallback node.
We get the exact same semantics (because cpu_to_mem() caches the result of
local_memory_node) by using cpu_to_mem directly.

Again, apologies for not Cc'ing you originally.

-Nish

[-- Attachment #2: Type: text/html, Size: 2947 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/2] Memoryless nodes and kworker
  2014-07-18 17:42   ` Nish Aravamudan
@ 2014-07-18 18:00     ` Tejun Heo
  2014-07-18 18:01       ` Tejun Heo
  2014-07-18 18:12       ` Nish Aravamudan
  0 siblings, 2 replies; 13+ messages in thread
From: Tejun Heo @ 2014-07-18 18:00 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Fenghua Yu, Tony Luck, linux-ia64, Nishanth Aravamudan,
	linux-kernel@vger.kernel.org, Linux Memory Management List,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li

Hello,

On Fri, Jul 18, 2014 at 10:42:29AM -0700, Nish Aravamudan wrote:
> So, to be clear, this is not *necessarily* about memoryless nodes. It's
> about the semantics intended. The workqueue code currently calls
> cpu_to_node() in a few places, and passes that node into the core MM as a
> hint about where the memory should come from. However, when memoryless
> nodes are present, that hint is guaranteed to be wrong, as it's the nearest
> NUMA node to the CPU (which happens to be the one its on), not the nearest
> NUMA node with memory. The hint is correctly specified as cpu_to_mem(),

It's telling the allocator the node the CPU is on.  Choosing and
falling back the actual allocation is the allocator's job.

> which does the right thing in the presence or absence of memoryless nodes.
> And I think encapsulates the hint's semantics correctly -- please give me
> memory from where I expect it, which is the closest NUMA node.

I don't think it does.  It loses information at too high a layer.
Workqueue here doesn't care how memory subsystem is structured, it's
just telling the allocator where it's at and expecting it to do the
right thing.  Please consider the following scenario.

	A - B - C - D - E

Let's say C is a memory-less node.  If we map from C to either B or D
from individual users and that node can't serve that memory request,
the allocator would fall back to A or E respectively when the right
thing to do would be falling back to D or B respectively, right?

This isn't a huge issue but it shows that this is the wrong layer to
deal with this issue.  Let the allocators express where they are.
Choosing and falling back belong to the memory allocator.  That's the
only place which has all the information that's necessary and those
details must be contained there.  Please don't leak it to memory
allocator users.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/2] Memoryless nodes and kworker
  2014-07-18 18:00     ` Tejun Heo
@ 2014-07-18 18:01       ` Tejun Heo
  2014-07-18 18:12       ` Nish Aravamudan
  1 sibling, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2014-07-18 18:01 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Fenghua Yu, Tony Luck, linux-ia64, Nishanth Aravamudan,
	linux-kernel@vger.kernel.org, Linux Memory Management List,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li

On Fri, Jul 18, 2014 at 02:00:08PM -0400, Tejun Heo wrote:
> This isn't a huge issue but it shows that this is the wrong layer to
> deal with this issue.  Let the allocators express where they are.
                                 ^
				 allocator users
> Choosing and falling back belong to the memory allocator.  That's the
> only place which has all the information that's necessary and those
> details must be contained there.  Please don't leak it to memory
> allocator users.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/2] Memoryless nodes and kworker
  2014-07-18 18:00     ` Tejun Heo
  2014-07-18 18:01       ` Tejun Heo
@ 2014-07-18 18:12       ` Nish Aravamudan
  2014-07-18 18:19         ` Tejun Heo
  1 sibling, 1 reply; 13+ messages in thread
From: Nish Aravamudan @ 2014-07-18 18:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fenghua Yu, Tony Luck, linux-ia64, Nishanth Aravamudan,
	linux-kernel@vger.kernel.org, Linux Memory Management List,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li

[-- Attachment #1: Type: text/plain, Size: 3191 bytes --]

Hi Tejun,

[I found the other thread where you made these points, thanks you for
expressing them so clearly again!]

On Fri, Jul 18, 2014 at 11:00 AM, Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Jul 18, 2014 at 10:42:29AM -0700, Nish Aravamudan wrote:
> > So, to be clear, this is not *necessarily* about memoryless nodes. It's
> > about the semantics intended. The workqueue code currently calls
> > cpu_to_node() in a few places, and passes that node into the core MM as
a
> > hint about where the memory should come from. However, when memoryless
> > nodes are present, that hint is guaranteed to be wrong, as it's the
nearest
> > NUMA node to the CPU (which happens to be the one its on), not the
nearest
> > NUMA node with memory. The hint is correctly specified as cpu_to_mem(),
>
> It's telling the allocator the node the CPU is on.  Choosing and
> falling back the actual allocation is the allocator's job.

Ok, I agree with you then, if that's all the semantic is supposed to be.

But looking at the comment for kthread_create_on_node:

 * If thread is going to be bound on a particular cpu, give its node
 * in @node, to get NUMA affinity for kthread stack, or else give -1.

so the API interprets it as a suggestion for the affinity itself, *not* the
node the kthread should be on. Piddly, yes, but actually I have another
thought altogether, and in reviewing Jiang's patches this seems like the
right approach:

why aren't these callers using kthread_create_on_cpu()? That API was
already change to use cpu_to_mem() [so one change, rather than of all over
the kernel source]. We could change it back to cpu_to_node and push down
the knowledge about the fallback.

> > which does the right thing in the presence or absence of memoryless
nodes.
> > And I think encapsulates the hint's semantics correctly -- please give
me
> > memory from where I expect it, which is the closest NUMA node.
>
> I don't think it does.  It loses information at too high a layer.
> Workqueue here doesn't care how memory subsystem is structured, it's
> just telling the allocator where it's at and expecting it to do the
> right thing.  Please consider the following scenario.
>
>         A - B - C - D - E
>
> Let's say C is a memory-less node.  If we map from C to either B or D
> from individual users and that node can't serve that memory request,
> the allocator would fall back to A or E respectively when the right
> thing to do would be falling back to D or B respectively, right?

Yes, this is a good point. But honestly, we're not really even to the point
of talking about fallback here, at least in my testing, going off-node at
all causes SLUB-configured slabs to deactivate, which then leads to an
explosion in the unreclaimable slab.

> This isn't a huge issue but it shows that this is the wrong layer to
> deal with this issue.  Let the allocators express where they are.
> Choosing and falling back belong to the memory allocator.  That's the
> only place which has all the information that's necessary and those
> details must be contained there.  Please don't leak it to memory
> allocator users.

Ok, I will continue to work at that level of abstraction.

Thanks,
Nish

[-- Attachment #2: Type: text/html, Size: 3817 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/2] Memoryless nodes and kworker
  2014-07-18 18:12       ` Nish Aravamudan
@ 2014-07-18 18:19         ` Tejun Heo
  2014-07-18 18:47           ` Nish Aravamudan
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2014-07-18 18:19 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Fenghua Yu, Tony Luck, linux-ia64, Nishanth Aravamudan,
	linux-kernel@vger.kernel.org, Linux Memory Management List,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li

Hello,

On Fri, Jul 18, 2014 at 11:12:01AM -0700, Nish Aravamudan wrote:
> why aren't these callers using kthread_create_on_cpu()? That API was

It is using that.  There just are other data structures too.

> already change to use cpu_to_mem() [so one change, rather than of all over
> the kernel source]. We could change it back to cpu_to_node and push down
> the knowledge about the fallback.

And once it's properly solved, please convert back kthread to use
cpu_to_node() too.  We really shouldn't be sprinkling the new subtly
different variant across the kernel.  It's wrong and confusing.

> Yes, this is a good point. But honestly, we're not really even to the point
> of talking about fallback here, at least in my testing, going off-node at
> all causes SLUB-configured slabs to deactivate, which then leads to an
> explosion in the unreclaimable slab.

I don't think moving the logic inside allocator proper is a huge
amount of work and this isn't the first spillage of this subtlety out
of allocator proper.  Fortunately, it hasn't spread too much yet.
Let's please stop it here.  I'm not saying you shouldn't or can't fix
the off-node allocation.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/2] Memoryless nodes and kworker
  2014-07-18 18:19         ` Tejun Heo
@ 2014-07-18 18:47           ` Nish Aravamudan
  2014-07-18 18:58             ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Nish Aravamudan @ 2014-07-18 18:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Fenghua Yu, Tony Luck, linux-ia64, Nishanth Aravamudan,
	linux-kernel@vger.kernel.org, Linux Memory Management List,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li

[-- Attachment #1: Type: text/plain, Size: 2381 bytes --]

On Fri, Jul 18, 2014 at 11:19 AM, Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Jul 18, 2014 at 11:12:01AM -0700, Nish Aravamudan wrote:
> > why aren't these callers using kthread_create_on_cpu()? That API was
>
> It is using that.  There just are other data structures too.

Sorry, I might not have been clear.

Why are any callers of the format kthread_create_on_node(...,
cpu_to_node(cpu), ...) not using kthread_create_on_cpu(..., cpu, ...)?

In total in Linus' tree, there are only two APIs that use
kthread_create_on_cpu() -- smpboot_create_threads() and
smpboot_register_percpu_thread(). Neither of those seem to be used by the
workqueue code that I can see as of yet.

> > already change to use cpu_to_mem() [so one change, rather than of all
over
> > the kernel source]. We could change it back to cpu_to_node and push down
> > the knowledge about the fallback.
>
> And once it's properly solved, please convert back kthread to use
> cpu_to_node() too.  We really shouldn't be sprinkling the new subtly
> different variant across the kernel.  It's wrong and confusing.

I understand what you mean, but it's equally wrong for the kernel to be
wasting GBs of slab. Different kinds of wrongness :)

> > Yes, this is a good point. But honestly, we're not really even to the
point
> > of talking about fallback here, at least in my testing, going off-node
at
> > all causes SLUB-configured slabs to deactivate, which then leads to an
> > explosion in the unreclaimable slab.
>
> I don't think moving the logic inside allocator proper is a huge
> amount of work and this isn't the first spillage of this subtlety out
> of allocator proper.  Fortunately, it hasn't spread too much yet.
> Let's please stop it here.  I'm not saying you shouldn't or can't fix
> the off-node allocation.

It seems like an additional reasonable approach would be to provide a
suitable _cpu() API for the allocators. I'm not sure why saying that
callers should know about NUMA (in order to call cpu_to_node() in every
caller) is any better than saying that callers should know about memoryless
nodes (in order to call cpu_to_mem() in every caller instead) -- when at
least in several cases that I've seen the relevant data is what CPU we're
expecting to run or are running on. Seems like the _cpu API would specify
-- please allocate memory local to this CPU, wherever it is?

Thanks,
Nish

[-- Attachment #2: Type: text/html, Size: 2862 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/2] Memoryless nodes and kworker
  2014-07-18 18:47           ` Nish Aravamudan
@ 2014-07-18 18:58             ` Tejun Heo
  0 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2014-07-18 18:58 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Fenghua Yu, Tony Luck, linux-ia64, Nishanth Aravamudan,
	linux-kernel@vger.kernel.org, Linux Memory Management List,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li

Hello,

On Fri, Jul 18, 2014 at 11:47:08AM -0700, Nish Aravamudan wrote:
> Why are any callers of the format kthread_create_on_node(...,
> cpu_to_node(cpu), ...) not using kthread_create_on_cpu(..., cpu, ...)?

Ah, okay, that's because unbound workers are NUMA node affine, not
CPU.

> It seems like an additional reasonable approach would be to provide a
> suitable _cpu() API for the allocators. I'm not sure why saying that
> callers should know about NUMA (in order to call cpu_to_node() in every
> caller) is any better than saying that callers should know about memoryless
> nodes (in order to call cpu_to_mem() in every caller instead) -- when at

It is better because that's what they want to express - "I'm on this
memory node, please allocate memory on or close to this one".  That's
what the caller cares about.  Calling with cpu could be an option but
you'll eventually run into cases where you end up having to map back
NUMA node id to a CPU on it, which will probably feel at least a bit
silly.  There are things which really are per-NUMA node.

So, let's please express what needs to be expressed.  Massaging around
it can be useful at times but that doesn't seem to be the case here.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-07-18 18:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-17 23:09 [RFC 0/2] Memoryless nodes and kworker Nishanth Aravamudan
2014-07-17 23:09 ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Nishanth Aravamudan
2014-07-17 23:15   ` [RFC 2/2] powerpc: reorder per-cpu NUMA information's initialization Nishanth Aravamudan
2014-07-18  8:11   ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Lai Jiangshan
2014-07-18 17:33     ` Nish Aravamudan
2014-07-18 11:20 ` [RFC 0/2] Memoryless nodes and kworker Tejun Heo
2014-07-18 17:42   ` Nish Aravamudan
2014-07-18 18:00     ` Tejun Heo
2014-07-18 18:01       ` Tejun Heo
2014-07-18 18:12       ` Nish Aravamudan
2014-07-18 18:19         ` Tejun Heo
2014-07-18 18:47           ` Nish Aravamudan
2014-07-18 18:58             ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).