* [RFC 0/2] Memoryless nodes and kworker @ 2014-07-17 23:09 Nishanth Aravamudan 2014-07-17 23:09 ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Nishanth Aravamudan 2014-07-18 11:20 ` [RFC 0/2] Memoryless nodes and kworker Tejun Heo 0 siblings, 2 replies; 13+ messages in thread From: Nishanth Aravamudan @ 2014-07-17 23:09 UTC (permalink / raw) To: benh Cc: Fenghua Yu, Tony Luck, linux-ia64, linux-kernel, linux-mm, David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li [Apologies for the large Cc list, but I believe we have the following interested parties: x86 (recently posted memoryless node support) ia64 (existing memoryless node support) ppc (existing memoryless node support) previous discussion of how to solve Anton's issue with slab usage workqueue contributors/maintainers] There is an issue currently where NUMA information is used on powerpc (and possibly ia64) before it has been read from the device-tree, which leads to large slab consumption with CONFIG_SLUB and memoryless nodes. While testing memoryless nodes on PowerKVM guests with the patches in this series, with a guest topology of available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 node 1 size: 16336 MB node 1 free: 15329 MB node distances: node 0 1 0: 10 40 1: 40 10 the slab consumption decreases from Slab: 932416 kB SUnreclaim: 902336 kB to Slab: 395264 kB SUnreclaim: 359424 kB And we see a corresponding increase in the slab efficiency from slab mem objs slabs used active active ------------------------------------------------------------ kmalloc-16384 337 MB 11.28% 100.00% task_struct 288 MB 9.93% 100.00% to slab mem objs slabs used active active ------------------------------------------------------------ kmalloc-16384 37 MB 100.00% 100.00% task_struct 31 MB 100.00% 100.00% It turns out we see this large slab usage due to using the wrong NUMA information when creating kthreads. Two changes are required, one of which is in the workqueue code and one of which is in the powerpc initialization. Note that ia64 may want to consider something similar. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC 1/2] workqueue: use the nearest NUMA node, not the local one 2014-07-17 23:09 [RFC 0/2] Memoryless nodes and kworker Nishanth Aravamudan @ 2014-07-17 23:09 ` Nishanth Aravamudan 2014-07-17 23:15 ` [RFC 2/2] powerpc: reorder per-cpu NUMA information's initialization Nishanth Aravamudan 2014-07-18 8:11 ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Lai Jiangshan 2014-07-18 11:20 ` [RFC 0/2] Memoryless nodes and kworker Tejun Heo 1 sibling, 2 replies; 13+ messages in thread From: Nishanth Aravamudan @ 2014-07-17 23:09 UTC (permalink / raw) To: benh Cc: Fenghua Yu, Tony Luck, linux-ia64, linux-kernel, linux-mm, David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li In the presence of memoryless nodes, the workqueue code incorrectly uses cpu_to_node() to determine what node to prefer memory allocations come from. cpu_to_mem() should be used instead, which will use the nearest NUMA node with memory. Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 35974ac..0bba022 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -3547,7 +3547,12 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) for_each_node(node) { if (cpumask_subset(pool->attrs->cpumask, wq_numa_possible_cpumask[node])) { - pool->node = node; + /* + * We could use local_memory_node(node) here, + * but it is expensive and the following caches + * the same value. + */ + pool->node = cpu_to_mem(cpumask_first(pool->attrs->cpumask)); break; } } @@ -4921,7 +4926,7 @@ static int __init init_workqueues(void) pool->cpu = cpu; cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu)); pool->attrs->nice = std_nice[i++]; - pool->node = cpu_to_node(cpu); + pool->node = cpu_to_mem(cpu); /* alloc pool ID */ mutex_lock(&wq_pool_mutex); ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC 2/2] powerpc: reorder per-cpu NUMA information's initialization 2014-07-17 23:09 ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Nishanth Aravamudan @ 2014-07-17 23:15 ` Nishanth Aravamudan 2014-07-18 8:11 ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Lai Jiangshan 1 sibling, 0 replies; 13+ messages in thread From: Nishanth Aravamudan @ 2014-07-17 23:15 UTC (permalink / raw) To: benh Cc: Fenghua Yu, Tony Luck, linux-ia64, linux-kernel, linux-mm, David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li There is an issue currently where NUMA information is used on powerpc (and possibly ia64) before it has been read from the device-tree, which leads to large slab consumption with CONFIG_SLUB and memoryless nodes. NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate after start_secondary(), similar to ia64, which is invoked via smp_init(). Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as early_initcall()") made init_workqueues() be invoked via do_pre_smp_initcalls(), which is obviously before the secondary processors are online. Additionally, the following commits changed init_workqueues() to use cpu_to_node to determine the node to use for kthread_create_on_node: bce903809ab3f ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]") f3f90ad469342 ("workqueue: determine NUMA node of workers accourding to the allowed cpumask") Therefore, when init_workqueues() runs, it sees all CPUs as being on Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to a high number of slab deactivations (http://www.spinics.net/lists/linux-mm/msg67489.html). Fix this by initializing the powerpc-specific CPU<->node/local memory node mapping as early as possible, which on powerpc is do_init_bootmem(). Currently that function initializes the mapping for the boot CPU, but we extend it to setup the mapping for all possible CPUs. Then, in smp_prepare_cpus(), we can correspondingly set the per-cpu values for all possible CPUs. That ensures that before the early_initcalls run (and really as early as possible), the per-cpu NUMA mapping is accurate. While testing memoryless nodes on PowerKVM guests with a fix to the workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a guest topology of: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 node 1 size: 16336 MB node 1 free: 15329 MB node distances: node 0 1 0: 10 40 1: 40 10 the slab consumption decreases from Slab: 932416 kB SUnreclaim: 902336 kB to Slab: 395264 kB SUnreclaim: 359424 kB And we a corresponding increase in the slab efficiency from slab mem objs slabs used active active ------------------------------------------------------------ kmalloc-16384 337 MB 11.28% 100.00% task_struct 288 MB 9.93% 100.00% to slab mem objs slabs used active active ------------------------------------------------------------ kmalloc-16384 37 MB 100.00% 100.00% task_struct 31 MB 100.00% 100.00% Powerpc didn't support memoryless nodes until recently (64bb80d87f01 "powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261194d "powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also helped improve memory consumption with these kind of environments. Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 51a3ff7..91ff531 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -376,6 +376,11 @@ void __init smp_prepare_cpus(unsigned int max_cpus) GFP_KERNEL, cpu_to_node(cpu)); zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu), GFP_KERNEL, cpu_to_node(cpu)); + /* + * numa_node_id() works after this. + */ + set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]); + set_cpu_numa_mem(cpu, local_memory_node(numa_cpu_lookup_table[cpu])); } cpumask_set_cpu(boot_cpuid, cpu_sibling_mask(boot_cpuid)); @@ -723,12 +728,6 @@ void start_secondary(void *unused) } traverse_core_siblings(cpu, true); - /* - * numa_node_id() works after this. - */ - set_numa_node(numa_cpu_lookup_table[cpu]); - set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); - smp_wmb(); notify_cpu_starting(cpu); set_cpu_online(cpu, true); diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 3b181b2..b1f0b86 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1049,7 +1049,7 @@ static void __init mark_reserved_regions_for_nid(int nid) void __init do_init_bootmem(void) { - int nid; + int nid, cpu; min_low_pfn = 0; max_low_pfn = memblock_end_of_DRAM() >> PAGE_SHIFT; @@ -1122,8 +1122,15 @@ void __init do_init_bootmem(void) reset_numa_cpu_lookup_table(); register_cpu_notifier(&ppc64_numa_nb); - cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE, - (void *)(unsigned long)boot_cpuid); + /* + * We need the numa_cpu_lookup_table to be accurate for all CPUs, + * even before we online them, so that we can use cpu_to_{node,mem} + * early in boot, cf. smp_prepare_cpus(). + */ + for_each_possible_cpu(cpu) { + cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE, + (void *)(unsigned long)cpu); + } } void __init paging_init(void) ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC 1/2] workqueue: use the nearest NUMA node, not the local one 2014-07-17 23:09 ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Nishanth Aravamudan 2014-07-17 23:15 ` [RFC 2/2] powerpc: reorder per-cpu NUMA information's initialization Nishanth Aravamudan @ 2014-07-18 8:11 ` Lai Jiangshan 2014-07-18 17:33 ` Nish Aravamudan 1 sibling, 1 reply; 13+ messages in thread From: Lai Jiangshan @ 2014-07-18 8:11 UTC (permalink / raw) To: Nishanth Aravamudan Cc: Fenghua Yu, Tony Luck, linux-ia64, linux-kernel, linux-mm, David Rientjes, Tejun Heo, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li Hi, I'm curious about what will it happen when alloc_pages_node(memoryless_node). If the memory is allocated from the most preferable node for the @memoryless_node, why we need to bother and use cpu_to_mem() in the caller site? If not, why the memory allocation subsystem refuses to find a preferable node for @memoryless_node in this case? Does it intend on some purpose or it can't find in some cases? Thanks, Lai Added CC to Tejun (workqueue maintainer). On 07/18/2014 07:09 AM, Nishanth Aravamudan wrote: > In the presence of memoryless nodes, the workqueue code incorrectly uses > cpu_to_node() to determine what node to prefer memory allocations come > from. cpu_to_mem() should be used instead, which will use the nearest > NUMA node with memory. > > Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 35974ac..0bba022 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -3547,7 +3547,12 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) > for_each_node(node) { > if (cpumask_subset(pool->attrs->cpumask, > wq_numa_possible_cpumask[node])) { > - pool->node = node; > + /* > + * We could use local_memory_node(node) here, > + * but it is expensive and the following caches > + * the same value. > + */ > + pool->node = cpu_to_mem(cpumask_first(pool->attrs->cpumask)); > break; > } > } > @@ -4921,7 +4926,7 @@ static int __init init_workqueues(void) > pool->cpu = cpu; > cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu)); > pool->attrs->nice = std_nice[i++]; > - pool->node = cpu_to_node(cpu); > + pool->node = cpu_to_mem(cpu); > > /* alloc pool ID */ > mutex_lock(&wq_pool_mutex); > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 1/2] workqueue: use the nearest NUMA node, not the local one 2014-07-18 8:11 ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Lai Jiangshan @ 2014-07-18 17:33 ` Nish Aravamudan 0 siblings, 0 replies; 13+ messages in thread From: Nish Aravamudan @ 2014-07-18 17:33 UTC (permalink / raw) To: Lai Jiangshan Cc: Fenghua Yu, Tony Luck, linux-ia64, Nishanth Aravamudan, linux-kernel@vger.kernel.org, Linux Memory Management List, David Rientjes, Tejun Heo, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li [-- Attachment #1: Type: text/plain, Size: 5911 bytes --] [ Apologies for replying from a different address, we have a service outage at work. ] On Fri, Jul 18, 2014 at 1:11 AM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote: > > Hi, Thank you for your response! > I'm curious about what will it happen when alloc_pages_node(memoryless_node). alloc_pages_node() is only involved in one of the possible paths (maybe this occurs on x86 with THREAD_INFO > PAGE_SIZE?) On powerpc, though, that's not the case. Details: 1. pool->node is used in the invocation of kthread_create_on_node() in create_worker(). 2. kthread_create_on_node sets up a struct kthread_create_info with create->node = node and wakes up kthreadd. 3. kthreadd calls create_kthread, which sets current->pref_node_fork = create->node. 4. dup_task_struct() calls node = tsk_fork_get_node() before invoking alloc_task_struct_node(node) and alloc_thread_info_node(node). 5. tsk_fork_get_node() returns current->pref_node_fork for kthreadd. 6. alloc_task_struct_node() calls kmem_cache_alloc_node(,GFP_KERNEL, node) 7. alloc_thread_info_node() either calls kmem_cache_alloc_node(,GFP_KERNEL, node) or alloc_kmem_pages_node(node,GFP_KERNEL,), depending on the size of THREAD_INFO relative to PAGE_SIZE. 8a. alloc_kmem_pages_node() -> alloc_pages_node -> __alloc_pages with a zonelist built from node_zonelist. This should lead to proper fallback. 8b. kmem_cache_alloc_node() calls slab_alloc_node() 9. For a memoryless node, we will trigger the following: if (unlikely(!object || !node_match(page, node))) { object = __slab_alloc(s, gfpflags, node, addr, c); stat(s, ALLOC_SLOWPATH); } 10. __slab_alloc() in turn will: if (unlikely(!node_match(page, node))) { stat(s, ALLOC_NODE_MISMATCH); deactivate_slab(s, page, c->freelist); c->page = NULL; c->freelist = NULL; goto new_slab; } deactivating the slab. Thus, every kthread created with a node specification leads to a single object on a slab. We see an explosion in the slab consumption, all of which is unreclaimable. Anton originally proposed not deactivating slabs when we *know* the allocation will be remote (i.e., from a memoryless node). Joonsoo and Christoph disagreed with this and proposed alternative solutions, which weren't agreed upon at the time. > If the memory is allocated from the most preferable node for the @memoryless_node, > why we need to bother and use cpu_to_mem() in the caller site? The reason is that the node passed is a hint into the MM subsystem of what node we want memory to come from. Well, I take that back, I think semantically there are two ways to interpret the node parameter: 1) The NUMA node we want memory from 2) The NUMA node we expect memory from The path through the MM above sort of conflates the two, the caller specified an impossible request (which the MM subsystem technically knows but which knowledge it isn't using at this point) of memory from a node that has none. We could change the core MM to do better in the presence of memoryless nodes, and should, but this seems far less invasive and does the right thing. Semantically, I think the workqueue's pool->node is meant to be the node from which we want memory allocated, which is the node with memory closest to the CPU. Thanks, Nish > If not, why the memory allocation subsystem refuses to find a preferable node > for @memoryless_node in this case? Does it intend on some purpose or > it can't find in some cases? > > Thanks, > Lai > > Added CC to Tejun (workqueue maintainer). > > On 07/18/2014 07:09 AM, Nishanth Aravamudan wrote: > > In the presence of memoryless nodes, the workqueue code incorrectly uses > > cpu_to_node() to determine what node to prefer memory allocations come > > from. cpu_to_mem() should be used instead, which will use the nearest > > NUMA node with memory. > > > > Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> > > > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > > index 35974ac..0bba022 100644 > > --- a/kernel/workqueue.c > > +++ b/kernel/workqueue.c > > @@ -3547,7 +3547,12 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) > > for_each_node(node) { > > if (cpumask_subset(pool->attrs->cpumask, > > wq_numa_possible_cpumask[node])) { > > - pool->node = node; > > + /* > > + * We could use local_memory_node(node) here, > > + * but it is expensive and the following caches > > + * the same value. > > + */ > > + pool->node = cpu_to_mem(cpumask_first(pool->attrs->cpumask)); > > break; > > } > > } > > @@ -4921,7 +4926,7 @@ static int __init init_workqueues(void) > > pool->cpu = cpu; > > cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu)); > > pool->attrs->nice = std_nice[i++]; > > - pool->node = cpu_to_node(cpu); > > + pool->node = cpu_to_mem(cpu); > > > > /* alloc pool ID */ > > mutex_lock(&wq_pool_mutex); > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ [-- Attachment #2: Type: text/html, Size: 7835 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/2] Memoryless nodes and kworker 2014-07-17 23:09 [RFC 0/2] Memoryless nodes and kworker Nishanth Aravamudan 2014-07-17 23:09 ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Nishanth Aravamudan @ 2014-07-18 11:20 ` Tejun Heo 2014-07-18 17:42 ` Nish Aravamudan 1 sibling, 1 reply; 13+ messages in thread From: Tejun Heo @ 2014-07-18 11:20 UTC (permalink / raw) To: Nishanth Aravamudan Cc: Fenghua Yu, Tony Luck, linux-ia64, linux-kernel, linux-mm, David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li On Thu, Jul 17, 2014 at 04:09:23PM -0700, Nishanth Aravamudan wrote: > [Apologies for the large Cc list, but I believe we have the following > interested parties: > > x86 (recently posted memoryless node support) > ia64 (existing memoryless node support) > ppc (existing memoryless node support) > previous discussion of how to solve Anton's issue with slab usage > workqueue contributors/maintainers] Well, you forgot to cc me. ... > It turns out we see this large slab usage due to using the wrong NUMA > information when creating kthreads. > > Two changes are required, one of which is in the workqueue code and one > of which is in the powerpc initialization. Note that ia64 may want to > consider something similar. Wasn't there a thread on this exact subject a few weeks ago? Was that someone else? Memory-less node detail leaking out of allocator proper isn't a good idea. Please allow allocator users to specify the nodes they're on and let the allocator layer deal with mapping that to whatever is appropriate. Please don't push that to everybody. Thanks. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/2] Memoryless nodes and kworker 2014-07-18 11:20 ` [RFC 0/2] Memoryless nodes and kworker Tejun Heo @ 2014-07-18 17:42 ` Nish Aravamudan 2014-07-18 18:00 ` Tejun Heo 0 siblings, 1 reply; 13+ messages in thread From: Nish Aravamudan @ 2014-07-18 17:42 UTC (permalink / raw) To: Tejun Heo Cc: Fenghua Yu, Tony Luck, linux-ia64, Nishanth Aravamudan, linux-kernel@vger.kernel.org, Linux Memory Management List, David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li [-- Attachment #1: Type: text/plain, Size: 2552 bytes --] Hi Tejun, On Fri, Jul 18, 2014 at 4:20 AM, Tejun Heo <tj@kernel.org> wrote: > > On Thu, Jul 17, 2014 at 04:09:23PM -0700, Nishanth Aravamudan wrote: > > [Apologies for the large Cc list, but I believe we have the following > > interested parties: > > > > x86 (recently posted memoryless node support) > > ia64 (existing memoryless node support) > > ppc (existing memoryless node support) > > previous discussion of how to solve Anton's issue with slab usage > > workqueue contributors/maintainers] > > Well, you forgot to cc me. Ah I'm very sorry! That's what I get for editing e-mails... Thank you for your reply! > ... > > It turns out we see this large slab usage due to using the wrong NUMA > > information when creating kthreads. > > > > Two changes are required, one of which is in the workqueue code and one > > of which is in the powerpc initialization. Note that ia64 may want to > > consider something similar. > > Wasn't there a thread on this exact subject a few weeks ago? Was that > someone else? Memory-less node detail leaking out of allocator proper > isn't a good idea. Please allow allocator users to specify the nodes > they're on and let the allocator layer deal with mapping that to > whatever is appropriate. Please don't push that to everybody. I didn't send anything for the workqueue logic anytime recently. Jiang sent out a patchset for x86 memoryless node support, which may have touched kernel/workqueue.c. So, to be clear, this is not *necessarily* about memoryless nodes. It's about the semantics intended. The workqueue code currently calls cpu_to_node() in a few places, and passes that node into the core MM as a hint about where the memory should come from. However, when memoryless nodes are present, that hint is guaranteed to be wrong, as it's the nearest NUMA node to the CPU (which happens to be the one its on), not the nearest NUMA node with memory. The hint is correctly specified as cpu_to_mem(), which does the right thing in the presence or absence of memoryless nodes. And I think encapsulates the hint's semantics correctly -- please give me memory from where I expect it, which is the closest NUMA node. I guess we could also change tsk_fork_get_node to return local_memory_node(tsk->pref_node_fork), but that can be a bit expensive, as it generates a new zonelist each time to determine the first fallback node. We get the exact same semantics (because cpu_to_mem() caches the result of local_memory_node) by using cpu_to_mem directly. Again, apologies for not Cc'ing you originally. -Nish [-- Attachment #2: Type: text/html, Size: 2947 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/2] Memoryless nodes and kworker 2014-07-18 17:42 ` Nish Aravamudan @ 2014-07-18 18:00 ` Tejun Heo 2014-07-18 18:01 ` Tejun Heo 2014-07-18 18:12 ` Nish Aravamudan 0 siblings, 2 replies; 13+ messages in thread From: Tejun Heo @ 2014-07-18 18:00 UTC (permalink / raw) To: Nish Aravamudan Cc: Fenghua Yu, Tony Luck, linux-ia64, Nishanth Aravamudan, linux-kernel@vger.kernel.org, Linux Memory Management List, David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li Hello, On Fri, Jul 18, 2014 at 10:42:29AM -0700, Nish Aravamudan wrote: > So, to be clear, this is not *necessarily* about memoryless nodes. It's > about the semantics intended. The workqueue code currently calls > cpu_to_node() in a few places, and passes that node into the core MM as a > hint about where the memory should come from. However, when memoryless > nodes are present, that hint is guaranteed to be wrong, as it's the nearest > NUMA node to the CPU (which happens to be the one its on), not the nearest > NUMA node with memory. The hint is correctly specified as cpu_to_mem(), It's telling the allocator the node the CPU is on. Choosing and falling back the actual allocation is the allocator's job. > which does the right thing in the presence or absence of memoryless nodes. > And I think encapsulates the hint's semantics correctly -- please give me > memory from where I expect it, which is the closest NUMA node. I don't think it does. It loses information at too high a layer. Workqueue here doesn't care how memory subsystem is structured, it's just telling the allocator where it's at and expecting it to do the right thing. Please consider the following scenario. A - B - C - D - E Let's say C is a memory-less node. If we map from C to either B or D from individual users and that node can't serve that memory request, the allocator would fall back to A or E respectively when the right thing to do would be falling back to D or B respectively, right? This isn't a huge issue but it shows that this is the wrong layer to deal with this issue. Let the allocators express where they are. Choosing and falling back belong to the memory allocator. That's the only place which has all the information that's necessary and those details must be contained there. Please don't leak it to memory allocator users. Thanks. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/2] Memoryless nodes and kworker 2014-07-18 18:00 ` Tejun Heo @ 2014-07-18 18:01 ` Tejun Heo 2014-07-18 18:12 ` Nish Aravamudan 1 sibling, 0 replies; 13+ messages in thread From: Tejun Heo @ 2014-07-18 18:01 UTC (permalink / raw) To: Nish Aravamudan Cc: Fenghua Yu, Tony Luck, linux-ia64, Nishanth Aravamudan, linux-kernel@vger.kernel.org, Linux Memory Management List, David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li On Fri, Jul 18, 2014 at 02:00:08PM -0400, Tejun Heo wrote: > This isn't a huge issue but it shows that this is the wrong layer to > deal with this issue. Let the allocators express where they are. ^ allocator users > Choosing and falling back belong to the memory allocator. That's the > only place which has all the information that's necessary and those > details must be contained there. Please don't leak it to memory > allocator users. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/2] Memoryless nodes and kworker 2014-07-18 18:00 ` Tejun Heo 2014-07-18 18:01 ` Tejun Heo @ 2014-07-18 18:12 ` Nish Aravamudan 2014-07-18 18:19 ` Tejun Heo 1 sibling, 1 reply; 13+ messages in thread From: Nish Aravamudan @ 2014-07-18 18:12 UTC (permalink / raw) To: Tejun Heo Cc: Fenghua Yu, Tony Luck, linux-ia64, Nishanth Aravamudan, linux-kernel@vger.kernel.org, Linux Memory Management List, David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li [-- Attachment #1: Type: text/plain, Size: 3191 bytes --] Hi Tejun, [I found the other thread where you made these points, thanks you for expressing them so clearly again!] On Fri, Jul 18, 2014 at 11:00 AM, Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Fri, Jul 18, 2014 at 10:42:29AM -0700, Nish Aravamudan wrote: > > So, to be clear, this is not *necessarily* about memoryless nodes. It's > > about the semantics intended. The workqueue code currently calls > > cpu_to_node() in a few places, and passes that node into the core MM as a > > hint about where the memory should come from. However, when memoryless > > nodes are present, that hint is guaranteed to be wrong, as it's the nearest > > NUMA node to the CPU (which happens to be the one its on), not the nearest > > NUMA node with memory. The hint is correctly specified as cpu_to_mem(), > > It's telling the allocator the node the CPU is on. Choosing and > falling back the actual allocation is the allocator's job. Ok, I agree with you then, if that's all the semantic is supposed to be. But looking at the comment for kthread_create_on_node: * If thread is going to be bound on a particular cpu, give its node * in @node, to get NUMA affinity for kthread stack, or else give -1. so the API interprets it as a suggestion for the affinity itself, *not* the node the kthread should be on. Piddly, yes, but actually I have another thought altogether, and in reviewing Jiang's patches this seems like the right approach: why aren't these callers using kthread_create_on_cpu()? That API was already change to use cpu_to_mem() [so one change, rather than of all over the kernel source]. We could change it back to cpu_to_node and push down the knowledge about the fallback. > > which does the right thing in the presence or absence of memoryless nodes. > > And I think encapsulates the hint's semantics correctly -- please give me > > memory from where I expect it, which is the closest NUMA node. > > I don't think it does. It loses information at too high a layer. > Workqueue here doesn't care how memory subsystem is structured, it's > just telling the allocator where it's at and expecting it to do the > right thing. Please consider the following scenario. > > A - B - C - D - E > > Let's say C is a memory-less node. If we map from C to either B or D > from individual users and that node can't serve that memory request, > the allocator would fall back to A or E respectively when the right > thing to do would be falling back to D or B respectively, right? Yes, this is a good point. But honestly, we're not really even to the point of talking about fallback here, at least in my testing, going off-node at all causes SLUB-configured slabs to deactivate, which then leads to an explosion in the unreclaimable slab. > This isn't a huge issue but it shows that this is the wrong layer to > deal with this issue. Let the allocators express where they are. > Choosing and falling back belong to the memory allocator. That's the > only place which has all the information that's necessary and those > details must be contained there. Please don't leak it to memory > allocator users. Ok, I will continue to work at that level of abstraction. Thanks, Nish [-- Attachment #2: Type: text/html, Size: 3817 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/2] Memoryless nodes and kworker 2014-07-18 18:12 ` Nish Aravamudan @ 2014-07-18 18:19 ` Tejun Heo 2014-07-18 18:47 ` Nish Aravamudan 0 siblings, 1 reply; 13+ messages in thread From: Tejun Heo @ 2014-07-18 18:19 UTC (permalink / raw) To: Nish Aravamudan Cc: Fenghua Yu, Tony Luck, linux-ia64, Nishanth Aravamudan, linux-kernel@vger.kernel.org, Linux Memory Management List, David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li Hello, On Fri, Jul 18, 2014 at 11:12:01AM -0700, Nish Aravamudan wrote: > why aren't these callers using kthread_create_on_cpu()? That API was It is using that. There just are other data structures too. > already change to use cpu_to_mem() [so one change, rather than of all over > the kernel source]. We could change it back to cpu_to_node and push down > the knowledge about the fallback. And once it's properly solved, please convert back kthread to use cpu_to_node() too. We really shouldn't be sprinkling the new subtly different variant across the kernel. It's wrong and confusing. > Yes, this is a good point. But honestly, we're not really even to the point > of talking about fallback here, at least in my testing, going off-node at > all causes SLUB-configured slabs to deactivate, which then leads to an > explosion in the unreclaimable slab. I don't think moving the logic inside allocator proper is a huge amount of work and this isn't the first spillage of this subtlety out of allocator proper. Fortunately, it hasn't spread too much yet. Let's please stop it here. I'm not saying you shouldn't or can't fix the off-node allocation. Thanks. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/2] Memoryless nodes and kworker 2014-07-18 18:19 ` Tejun Heo @ 2014-07-18 18:47 ` Nish Aravamudan 2014-07-18 18:58 ` Tejun Heo 0 siblings, 1 reply; 13+ messages in thread From: Nish Aravamudan @ 2014-07-18 18:47 UTC (permalink / raw) To: Tejun Heo Cc: Fenghua Yu, Tony Luck, linux-ia64, Nishanth Aravamudan, linux-kernel@vger.kernel.org, Linux Memory Management List, David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li [-- Attachment #1: Type: text/plain, Size: 2381 bytes --] On Fri, Jul 18, 2014 at 11:19 AM, Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Fri, Jul 18, 2014 at 11:12:01AM -0700, Nish Aravamudan wrote: > > why aren't these callers using kthread_create_on_cpu()? That API was > > It is using that. There just are other data structures too. Sorry, I might not have been clear. Why are any callers of the format kthread_create_on_node(..., cpu_to_node(cpu), ...) not using kthread_create_on_cpu(..., cpu, ...)? In total in Linus' tree, there are only two APIs that use kthread_create_on_cpu() -- smpboot_create_threads() and smpboot_register_percpu_thread(). Neither of those seem to be used by the workqueue code that I can see as of yet. > > already change to use cpu_to_mem() [so one change, rather than of all over > > the kernel source]. We could change it back to cpu_to_node and push down > > the knowledge about the fallback. > > And once it's properly solved, please convert back kthread to use > cpu_to_node() too. We really shouldn't be sprinkling the new subtly > different variant across the kernel. It's wrong and confusing. I understand what you mean, but it's equally wrong for the kernel to be wasting GBs of slab. Different kinds of wrongness :) > > Yes, this is a good point. But honestly, we're not really even to the point > > of talking about fallback here, at least in my testing, going off-node at > > all causes SLUB-configured slabs to deactivate, which then leads to an > > explosion in the unreclaimable slab. > > I don't think moving the logic inside allocator proper is a huge > amount of work and this isn't the first spillage of this subtlety out > of allocator proper. Fortunately, it hasn't spread too much yet. > Let's please stop it here. I'm not saying you shouldn't or can't fix > the off-node allocation. It seems like an additional reasonable approach would be to provide a suitable _cpu() API for the allocators. I'm not sure why saying that callers should know about NUMA (in order to call cpu_to_node() in every caller) is any better than saying that callers should know about memoryless nodes (in order to call cpu_to_mem() in every caller instead) -- when at least in several cases that I've seen the relevant data is what CPU we're expecting to run or are running on. Seems like the _cpu API would specify -- please allocate memory local to this CPU, wherever it is? Thanks, Nish [-- Attachment #2: Type: text/html, Size: 2862 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/2] Memoryless nodes and kworker 2014-07-18 18:47 ` Nish Aravamudan @ 2014-07-18 18:58 ` Tejun Heo 0 siblings, 0 replies; 13+ messages in thread From: Tejun Heo @ 2014-07-18 18:58 UTC (permalink / raw) To: Nish Aravamudan Cc: Fenghua Yu, Tony Luck, linux-ia64, Nishanth Aravamudan, linux-kernel@vger.kernel.org, Linux Memory Management List, David Rientjes, Joonsoo Kim, linuxppc-dev, Jiang Liu, Wanpeng Li Hello, On Fri, Jul 18, 2014 at 11:47:08AM -0700, Nish Aravamudan wrote: > Why are any callers of the format kthread_create_on_node(..., > cpu_to_node(cpu), ...) not using kthread_create_on_cpu(..., cpu, ...)? Ah, okay, that's because unbound workers are NUMA node affine, not CPU. > It seems like an additional reasonable approach would be to provide a > suitable _cpu() API for the allocators. I'm not sure why saying that > callers should know about NUMA (in order to call cpu_to_node() in every > caller) is any better than saying that callers should know about memoryless > nodes (in order to call cpu_to_mem() in every caller instead) -- when at It is better because that's what they want to express - "I'm on this memory node, please allocate memory on or close to this one". That's what the caller cares about. Calling with cpu could be an option but you'll eventually run into cases where you end up having to map back NUMA node id to a CPU on it, which will probably feel at least a bit silly. There are things which really are per-NUMA node. So, let's please express what needs to be expressed. Massaging around it can be useful at times but that doesn't seem to be the case here. Thanks. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2014-07-18 18:58 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-07-17 23:09 [RFC 0/2] Memoryless nodes and kworker Nishanth Aravamudan 2014-07-17 23:09 ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Nishanth Aravamudan 2014-07-17 23:15 ` [RFC 2/2] powerpc: reorder per-cpu NUMA information's initialization Nishanth Aravamudan 2014-07-18 8:11 ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Lai Jiangshan 2014-07-18 17:33 ` Nish Aravamudan 2014-07-18 11:20 ` [RFC 0/2] Memoryless nodes and kworker Tejun Heo 2014-07-18 17:42 ` Nish Aravamudan 2014-07-18 18:00 ` Tejun Heo 2014-07-18 18:01 ` Tejun Heo 2014-07-18 18:12 ` Nish Aravamudan 2014-07-18 18:19 ` Tejun Heo 2014-07-18 18:47 ` Nish Aravamudan 2014-07-18 18:58 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).