* [PATCH] online CPU before memory failed in pcpu_alloc_pages()
@ 2010-05-18 6:17 minskey guo
2010-05-20 20:43 ` Andrew Morton
0 siblings, 1 reply; 14+ messages in thread
From: minskey guo @ 2010-05-18 6:17 UTC (permalink / raw)
To: akpm, linux-mm; +Cc: prarit, andi.kleen, linux-kernel, minskey guo
From: minskey guo <chaohong.guo@intel.com>
The operation of "enable CPU to online before memory within a node"
fails in some case according to Prarit. The warnings as follows:
Pid: 7440, comm: bash Not tainted 2.6.32 #2
Call Trace:
[<ffffffff81155985>] pcpu_alloc+0xa05/0xa70
[<ffffffff81155a20>] __alloc_percpu+0x10/0x20
[<ffffffff81089605>] __create_workqueue_key+0x75/0x280
[<ffffffff8110e050>] ? __build_all_zonelists+0x0/0x5d0
[<ffffffff810c1eba>] stop_machine_create+0x3a/0xb0
[<ffffffff810c1f57>] stop_machine+0x27/0x60
[<ffffffff8110f1a0>] build_all_zonelists+0xd0/0x2b0
[<ffffffff814c1d12>] cpu_up+0xb3/0xe3
[<ffffffff814b3c40>] store_online+0x70/0xa0
[<ffffffff81326100>] sysdev_store+0x20/0x30
[<ffffffff811d29a5>] sysfs_write_file+0xe5/0x170
[<ffffffff81163d28>] vfs_write+0xb8/0x1a0
[<ffffffff810cfd22>] ? audit_syscall_entry+0x252/0x280
[<ffffffff81164761>] sys_write+0x51/0x90
[<ffffffff81013132>] system_call_fastpath+0x16/0x1b
Built 4 zonelists in Zone order, mobility grouping on. Total pages: 12331603
PERCPU: allocation failed, size=128 align=64, failed to populate
With "enable CPU to online before memory" patch, when the 1st CPU of
an offlined node is being onlined, we build zonelists for that node.
If per-cpu area needs to be extended during zonelists building period,
alloc_pages_node() will be called. The routine alloc_pages_node() fails
on the node in-onlining because the node doesn't have zonelists created
yet.
To fix this issue, we try to alloc memory from current node.
Signed-off-by: minskey guo <chaohong.guo@intel.com>
---
mm/percpu.c | 18 +++++++++++++++++-
1 files changed, 17 insertions(+), 1 deletions(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index 6e09741..fabdb10 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -714,13 +714,29 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
{
const gfp_t gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
unsigned int cpu;
+ int nid;
int i;
for_each_possible_cpu(cpu) {
for (i = page_start; i < page_end; i++) {
struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
- *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
+ nid = cpu_to_node(cpu);
+
+ /*
+ * It is allowable to online a CPU within a NUMA
+ * node which doesn't have onlined local memory.
+ * In this case, we need to create zonelists for
+ * that node when cpu is being onlined. If per-cpu
+ * area needs to be extended at the exact time when
+ * zonelists of that node is being created, we alloc
+ * memory from current node.
+ */
+ if ((nid == -1) ||
+ !(node_zonelist(nid, GFP_KERNEL)->_zonerefs->zone))
+ nid = numa_node_id();
+
+ *pagep = alloc_pages_node(nid, gfp, 0);
if (!*pagep) {
pcpu_free_pages(chunk, pages, populated,
page_start, page_end);
--
1.7.0.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
2010-05-18 6:17 [PATCH] online CPU before memory failed in pcpu_alloc_pages() minskey guo
@ 2010-05-20 20:43 ` Andrew Morton
2010-05-21 0:55 ` Stephen Rothwell
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Andrew Morton @ 2010-05-20 20:43 UTC (permalink / raw)
To: minskey guo
Cc: linux-mm, prarit, andi.kleen, linux-kernel, minskey guo,
Tejun Heo, stable
On Tue, 18 May 2010 14:17:22 +0800
minskey guo <chaohong_guo@linux.intel.com> wrote:
> From: minskey guo <chaohong.guo@intel.com>
>
> The operation of "enable CPU to online before memory within a node"
> fails in some case according to Prarit. The warnings as follows:
>
> Pid: 7440, comm: bash Not tainted 2.6.32 #2
> Call Trace:
> [<ffffffff81155985>] pcpu_alloc+0xa05/0xa70
> [<ffffffff81155a20>] __alloc_percpu+0x10/0x20
> [<ffffffff81089605>] __create_workqueue_key+0x75/0x280
> [<ffffffff8110e050>] ? __build_all_zonelists+0x0/0x5d0
> [<ffffffff810c1eba>] stop_machine_create+0x3a/0xb0
> [<ffffffff810c1f57>] stop_machine+0x27/0x60
> [<ffffffff8110f1a0>] build_all_zonelists+0xd0/0x2b0
> [<ffffffff814c1d12>] cpu_up+0xb3/0xe3
> [<ffffffff814b3c40>] store_online+0x70/0xa0
> [<ffffffff81326100>] sysdev_store+0x20/0x30
> [<ffffffff811d29a5>] sysfs_write_file+0xe5/0x170
> [<ffffffff81163d28>] vfs_write+0xb8/0x1a0
> [<ffffffff810cfd22>] ? audit_syscall_entry+0x252/0x280
> [<ffffffff81164761>] sys_write+0x51/0x90
> [<ffffffff81013132>] system_call_fastpath+0x16/0x1b
> Built 4 zonelists in Zone order, mobility grouping on. Total pages: 12331603
> PERCPU: allocation failed, size=128 align=64, failed to populate
>
> With "enable CPU to online before memory" patch, when the 1st CPU of
> an offlined node is being onlined, we build zonelists for that node.
> If per-cpu area needs to be extended during zonelists building period,
> alloc_pages_node() will be called. The routine alloc_pages_node() fails
> on the node in-onlining because the node doesn't have zonelists created
> yet.
>
> To fix this issue, we try to alloc memory from current node.
How serious is this issue? Just a warning? Dead box?
Because if we want to port this fix into 2.6.34.x, we have a little
problem.
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -714,13 +714,29 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
In linux-next, Tejun has gone and moved pcpu_alloc_pages() into the new
mm/percpu-vm.c. So either
a) the -stable guys will need to patch a different file or
b) we apply this fix first and muck up Tejun's tree or
c) the bug isn't very serious so none of this applies.
> {
> const gfp_t gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
> unsigned int cpu;
> + int nid;
> int i;
>
> for_each_possible_cpu(cpu) {
> for (i = page_start; i < page_end; i++) {
> struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
>
> - *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
> + nid = cpu_to_node(cpu);
> +
> + /*
> + * It is allowable to online a CPU within a NUMA
> + * node which doesn't have onlined local memory.
> + * In this case, we need to create zonelists for
> + * that node when cpu is being onlined. If per-cpu
> + * area needs to be extended at the exact time when
> + * zonelists of that node is being created, we alloc
> + * memory from current node.
> + */
> + if ((nid == -1) ||
> + !(node_zonelist(nid, GFP_KERNEL)->_zonerefs->zone))
> + nid = numa_node_id();
> +
> + *pagep = alloc_pages_node(nid, gfp, 0);
> if (!*pagep) {
> pcpu_free_pages(chunk, pages, populated,
> page_start, page_end);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
2010-05-20 20:43 ` Andrew Morton
@ 2010-05-21 0:55 ` Stephen Rothwell
2010-05-21 4:44 ` KAMEZAWA Hiroyuki
2010-05-21 4:05 ` Guo, Chaohong
2010-05-21 7:29 ` Kleen, Andi
2 siblings, 1 reply; 14+ messages in thread
From: Stephen Rothwell @ 2010-05-21 0:55 UTC (permalink / raw)
To: Andrew Morton
Cc: minskey guo, linux-mm, prarit, andi.kleen, linux-kernel,
minskey guo, Tejun Heo, stable
[-- Attachment #1: Type: text/plain, Size: 491 bytes --]
Hi Andrew,
On Thu, 20 May 2010 13:43:59 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -714,13 +714,29 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
>
> In linux-next, Tejun has gone and moved pcpu_alloc_pages() into the new
> mm/percpu-vm.c. So either
This has gone into Linus' tree today ...
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
2010-05-21 0:55 ` Stephen Rothwell
@ 2010-05-21 4:44 ` KAMEZAWA Hiroyuki
2010-05-21 8:22 ` minskey guo
2010-05-21 12:32 ` Lee Schermerhorn
0 siblings, 2 replies; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-21 4:44 UTC (permalink / raw)
To: Stephen Rothwell
Cc: Andrew Morton, minskey guo, linux-mm, prarit, andi.kleen,
linux-kernel, minskey guo, Tejun Heo, stable
On Fri, 21 May 2010 10:55:12 +1000
Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> Hi Andrew,
>
> On Thu, 20 May 2010 13:43:59 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > > --- a/mm/percpu.c
> > > +++ b/mm/percpu.c
> > > @@ -714,13 +714,29 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
> >
> > In linux-next, Tejun has gone and moved pcpu_alloc_pages() into the new
> > mm/percpu-vm.c. So either
>
> This has gone into Linus' tree today ...
>
Hmm, a comment here.
Recently, Lee Schermerhorn developed
numa-introduce-numa_mem_id-effective-local-memory-node-id-fix2.patch
Then, you can use cpu_to_mem() instead of cpu_to_node() to find the
nearest available node.
I don't check cpu_to_mem() is synchronized with NUMA hotplug but
using cpu_to_mem() rather than adding
=
+ if ((nid == -1) ||
+ !(node_zonelist(nid, GFP_KERNEL)->_zonerefs->zone))
+ nid = numa_node_id();
+
==
is better.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
2010-05-21 4:44 ` KAMEZAWA Hiroyuki
@ 2010-05-21 8:22 ` minskey guo
2010-05-21 8:39 ` KAMEZAWA Hiroyuki
2010-05-21 12:32 ` Lee Schermerhorn
1 sibling, 1 reply; 14+ messages in thread
From: minskey guo @ 2010-05-21 8:22 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Stephen Rothwell, Andrew Morton, linux-mm, prarit, andi.kleen,
linux-kernel, minskey guo, Tejun Heo, stable
>>>> --- a/mm/percpu.c
>>>> +++ b/mm/percpu.c
>>>> @@ -714,13 +714,29 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
>>>
>>> In linux-next, Tejun has gone and moved pcpu_alloc_pages() into the new
>>> mm/percpu-vm.c. So either
>>
>> This has gone into Linus' tree today ...
>>
>
> Hmm, a comment here.
>
> Recently, Lee Schermerhorn developed
>
> numa-introduce-numa_mem_id-effective-local-memory-node-id-fix2.patch
>
> Then, you can use cpu_to_mem() instead of cpu_to_node() to find the
> nearest available node.
> I don't check cpu_to_mem() is synchronized with NUMA hotplug but
> using cpu_to_mem() rather than adding
> =
>
> + if ((nid == -1) ||
> + !(node_zonelist(nid, GFP_KERNEL)->_zonerefs->zone))
> + nid = numa_node_id();
> +
> ==
>
> is better.
Yes. I can use cpu_to_mem(). only some little difference during
CPU online: 1st cpu within memoryless node gets memory from current
node or the node to which the cpu0 belongs,
But I have a question about the patch:
numa-slab-use-numa_mem_id-for-slab-local-memory-node.patch,
@@ -2968,9 +2991,23 @@ static int __build_all_zonelists(void *d
...
- for_each_possible_cpu(cpu)
+ for_each_possible_cpu(cpu) {
setup_pageset(&per_cpu(boot_pageset, cpu), 0);
...
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+ if (cpu_online(cpu))
+ cpu_to_mem(cpu) = local_memory_node(cpu_to_node(cpu));
+#endif
Look at the last two lines, suppose that memory is onlined before CPUs,
where will cpu_to_mem(cpu) be set to the right nodeid for the last
onlined cpu ? Does that CPU always get memory from the node including
cpu0 for slab allocator where cpu_to_mem() is used ?
thanks,
-minskey
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
2010-05-21 8:22 ` minskey guo
@ 2010-05-21 8:39 ` KAMEZAWA Hiroyuki
2010-05-21 9:12 ` minskey guo
0 siblings, 1 reply; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-21 8:39 UTC (permalink / raw)
To: minskey guo
Cc: Stephen Rothwell, Andrew Morton, linux-mm, prarit, andi.kleen,
linux-kernel, minskey guo, Tejun Heo, stable
On Fri, 21 May 2010 16:22:19 +0800
minskey guo <chaohong_guo@linux.intel.com> wrote:
> Yes. I can use cpu_to_mem(). only some little difference during
> CPU online: 1st cpu within memoryless node gets memory from current
> node or the node to which the cpu0 belongs,
>
>
> But I have a question about the patch:
>
> numa-slab-use-numa_mem_id-for-slab-local-memory-node.patch,
>
>
>
>
> @@ -2968,9 +2991,23 @@ static int __build_all_zonelists(void *d
> ...
>
> - for_each_possible_cpu(cpu)
> + for_each_possible_cpu(cpu) {
> setup_pageset(&per_cpu(boot_pageset, cpu), 0);
> ...
>
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> + if (cpu_online(cpu))
> + cpu_to_mem(cpu) = local_memory_node(cpu_to_node(cpu));
> +#endif
>
>
> Look at the last two lines, suppose that memory is onlined before CPUs,
> where will cpu_to_mem(cpu) be set to the right nodeid for the last
> onlined cpu ? Does that CPU always get memory from the node including
> cpu0 for slab allocator where cpu_to_mem() is used ?
>
build_all_zonelist() is called at boot, initialization.
And it calls local_memory_node(cpu_to_node(cpu)) for possible cpus.
So, "how cpu_to_node() for possible cpus is configured" is important.
At quick look, root/arch/x86/mm/numa_64.c has following code.
786 /*
787 * Setup early cpu_to_node.
788 *
789 * Populate cpu_to_node[] only if x86_cpu_to_apicid[],
790 * and apicid_to_node[] tables have valid entries for a CPU.
791 * This means we skip cpu_to_node[] initialisation for NUMA
792 * emulation and faking node case (when running a kernel compiled
793 * for NUMA on a non NUMA box), which is OK as cpu_to_node[]
794 * is already initialized in a round robin manner at numa_init_array,
795 * prior to this call, and this initialization is good enough
796 * for the fake NUMA cases.
797 *
798 * Called before the per_cpu areas are setup.
799 */
800 void __init init_cpu_to_node(void)
801 {
802 int cpu;
803 u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
804
805 BUG_ON(cpu_to_apicid == NULL);
806
807 for_each_possible_cpu(cpu) {
808 int node;
809 u16 apicid = cpu_to_apicid[cpu];
810
811 if (apicid == BAD_APICID)
812 continue;
813 node = apicid_to_node[apicid];
814 if (node == NUMA_NO_NODE)
815 continue;
816 if (!node_online(node))
817 node = find_near_online_node(node);
818 numa_set_node(cpu, node);
819 }
820 }
So, cpu_to_node(cpu) for possible cpus will have NUMA_NO_NODE(-1)
or the number of the nearest node.
IIUC, if SRAT is not broken, all pxm has its own node_id. So,
cpu_to_node(cpu) will return the nearest node and cpu_to_mem() will
find the nearest node with memory.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
2010-05-21 8:39 ` KAMEZAWA Hiroyuki
@ 2010-05-21 9:12 ` minskey guo
2010-05-21 13:21 ` Lee Schermerhorn
0 siblings, 1 reply; 14+ messages in thread
From: minskey guo @ 2010-05-21 9:12 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Stephen Rothwell, Andrew Morton, linux-mm, prarit, andi.kleen,
linux-kernel, minskey guo, Tejun Heo, stable
On 05/21/2010 04:39 PM, KAMEZAWA Hiroyuki wrote:
> On Fri, 21 May 2010 16:22:19 +0800
> minskey guo<chaohong_guo@linux.intel.com> wrote:
>
>> Yes. I can use cpu_to_mem(). only some little difference during
>> CPU online: 1st cpu within memoryless node gets memory from current
>> node or the node to which the cpu0 belongs,
>>
>>
>> But I have a question about the patch:
>>
>> numa-slab-use-numa_mem_id-for-slab-local-memory-node.patch,
>>
>>
>>
>>
>> @@ -2968,9 +2991,23 @@ static int __build_all_zonelists(void *d
>> ...
>>
>> - for_each_possible_cpu(cpu)
>> + for_each_possible_cpu(cpu) {
>> setup_pageset(&per_cpu(boot_pageset, cpu), 0);
>> ...
>>
>> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
>> + if (cpu_online(cpu))
>> + cpu_to_mem(cpu) = local_memory_node(cpu_to_node(cpu));
>> +#endif
Look at the above code, int __build_all_zonelists(), cpu_to_mem(cpu)
is set only when cpu is onlined. Suppose that a node with local memory,
all memory segments are onlined first, and then, cpus within that node
are onlined one by one, in this case, where does the cpu_to_mem(cpu)
for the last cpu get its value ?
>
> So, cpu_to_node(cpu) for possible cpus will have NUMA_NO_NODE(-1)
> or the number of the nearest node.
>
> IIUC, if SRAT is not broken, all pxm has its own node_id.
Thank you very much for the info, I have been thinking why node_id
is (-1) in some cases.
-minskey
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
2010-05-21 9:12 ` minskey guo
@ 2010-05-21 13:21 ` Lee Schermerhorn
2010-05-24 1:03 ` Guo, Chaohong
0 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2010-05-21 13:21 UTC (permalink / raw)
To: minskey guo
Cc: KAMEZAWA Hiroyuki, Stephen Rothwell, Andrew Morton, linux-mm,
prarit, andi.kleen, linux-kernel, minskey guo, Tejun Heo, stable
On Fri, 2010-05-21 at 17:12 +0800, minskey guo wrote:
> On 05/21/2010 04:39 PM, KAMEZAWA Hiroyuki wrote:
> > On Fri, 21 May 2010 16:22:19 +0800
> > minskey guo<chaohong_guo@linux.intel.com> wrote:
> >
> >> Yes. I can use cpu_to_mem(). only some little difference during
> >> CPU online: 1st cpu within memoryless node gets memory from current
> >> node or the node to which the cpu0 belongs,
> >>
> >>
> >> But I have a question about the patch:
> >>
> >> numa-slab-use-numa_mem_id-for-slab-local-memory-node.patch,
> >>
> >>
> >>
> >>
> >> @@ -2968,9 +2991,23 @@ static int __build_all_zonelists(void *d
> >> ...
> >>
> >> - for_each_possible_cpu(cpu)
> >> + for_each_possible_cpu(cpu) {
> >> setup_pageset(&per_cpu(boot_pageset, cpu), 0);
> >> ...
> >>
> >> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> >> + if (cpu_online(cpu))
> >> + cpu_to_mem(cpu) = local_memory_node(cpu_to_node(cpu));
> >> +#endif
>
> Look at the above code, int __build_all_zonelists(), cpu_to_mem(cpu)
> is set only when cpu is onlined. Suppose that a node with local memory,
> all memory segments are onlined first, and then, cpus within that node
> are onlined one by one, in this case, where does the cpu_to_mem(cpu)
> for the last cpu get its value ?
Minskey:
As I mentioned to Kame-san, x86 does not define
CONFIG_HAVE_MEMORYLESS_NODES, so this code is not compiled for that
arch. If x86 did support memoryless nodes--i.e., did not hide them and
reassign the cpus to other nodes, as is the case for ia64--then we could
have on-line cpus associated with memoryless nodes. The code above is
in __build_all_zonelists() so that in the case where we add memory to a
previously memoryless node, we re-evaluate the "local memory node" for
all online cpus.
For cpu hotplug--again, if x86 supports memoryless nodes--we'll need to
add a similar chunk to the path where we set up the cpu_to_node map for
a hotplugged cpu. See, for example, the call to set_numa_mem() in
smp_callin() in arch/ia64/kernel/smpboot.c. But currently, I don't
think you can use the numa_mem_id()/cpu_to_mem() interfaces for your
purpose. I suppose you could change page_alloc.c to compile
local_memory_node() #if defined(CONFIG_HAVE_MEMORYLESS_NODES) || defined
(CPU_HOTPLUG) and use that function to find the nearest memory. It
should return a valid node after zonelists have been rebuilt.
Does that make sense?
Lee
>
>
> >
> > So, cpu_to_node(cpu) for possible cpus will have NUMA_NO_NODE(-1)
> > or the number of the nearest node.
> >
> > IIUC, if SRAT is not broken, all pxm has its own node_id.
>
> Thank you very much for the info, I have been thinking why node_id
> is (-1) in some cases.
>
>
> -minskey
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
2010-05-21 13:21 ` Lee Schermerhorn
@ 2010-05-24 1:03 ` Guo, Chaohong
2010-05-24 14:59 ` Lee Schermerhorn
0 siblings, 1 reply; 14+ messages in thread
From: Guo, Chaohong @ 2010-05-24 1:03 UTC (permalink / raw)
To: Lee Schermerhorn, minskey guo
Cc: KAMEZAWA Hiroyuki, Stephen Rothwell, Andrew Morton,
linux-mm@kvack.org, prarit@redhat.com, Kleen, Andi,
linux-kernel@vger.kernel.org, Tejun Heo, stable@kernel.org
>> >>
>> >>
>> >> @@ -2968,9 +2991,23 @@ static int __build_all_zonelists(void *d
>> >> ...
>> >>
>> >> - for_each_possible_cpu(cpu)
>> >> + for_each_possible_cpu(cpu) {
>> >> setup_pageset(&per_cpu(boot_pageset, cpu), 0);
>> >> ...
>> >>
>> >> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
>> >> + if (cpu_online(cpu))
>> >> + cpu_to_mem(cpu) = local_memory_node(cpu_to_node(cpu));
>> >> +#endif
>>
>> Look at the above code, int __build_all_zonelists(), cpu_to_mem(cpu)
>> is set only when cpu is onlined. Suppose that a node with local memory,
>> all memory segments are onlined first, and then, cpus within that node
>> are onlined one by one, in this case, where does the cpu_to_mem(cpu)
>> for the last cpu get its value ?
>
>Minskey:
>
>As I mentioned to Kame-san, x86 does not define
>CONFIG_HAVE_MEMORYLESS_NODES, so this code is not compiled for that
>arch. If x86 did support memoryless nodes--i.e., did not hide them and
>reassign the cpus to other nodes, as is the case for ia64--then we could
>have on-line cpus associated with memoryless nodes. The code above is
>in __build_all_zonelists() so that in the case where we add memory to a
>previously memoryless node, we re-evaluate the "local memory node" for
>all online cpus.
>
>For cpu hotplug--again, if x86 supports memoryless nodes--we'll need to
>add a similar chunk to the path where we set up the cpu_to_node map for
>a hotplugged cpu. See, for example, the call to set_numa_mem() in
>smp_callin() in arch/ia64/kernel/smpboot.c.
Yeah, that's what I am looking for.
But currently, I don't
>think you can use the numa_mem_id()/cpu_to_mem() interfaces for your
>purpose. I suppose you could change page_alloc.c to compile
>local_memory_node() #if defined(CONFIG_HAVE_MEMORYLESS_NODES) ||
>defined
>(CPU_HOTPLUG) and use that function to find the nearest memory. It
>should return a valid node after zonelists have been rebuilt.
>
>Does that make sense?
Yes, besides, I need to find a place in hotplug path to call set_numa_mem()
just as you mentioned for ia64 platform. Is my understanding right ?
Thanks,
-minskey
>
>Lee
>>
>>
>> >
>> > So, cpu_to_node(cpu) for possible cpus will have NUMA_NO_NODE(-1)
>> > or the number of the nearest node.
>> >
>> > IIUC, if SRAT is not broken, all pxm has its own node_id.
>>
>> Thank you very much for the info, I have been thinking why node_id
>> is (-1) in some cases.
>>
>>
>> -minskey
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
2010-05-24 1:03 ` Guo, Chaohong
@ 2010-05-24 14:59 ` Lee Schermerhorn
2010-05-25 1:35 ` Guo, Chaohong
0 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2010-05-24 14:59 UTC (permalink / raw)
To: Guo, Chaohong
Cc: minskey guo, KAMEZAWA Hiroyuki, Stephen Rothwell, Andrew Morton,
linux-mm@kvack.org, prarit@redhat.com, Kleen, Andi,
linux-kernel@vger.kernel.org, Tejun Heo, stable@kernel.org
On Mon, 2010-05-24 at 09:03 +0800, Guo, Chaohong wrote:
>
> >> >>
> >> >>
> >> >> @@ -2968,9 +2991,23 @@ static int __build_all_zonelists(void *d
> >> >> ...
> >> >>
> >> >> - for_each_possible_cpu(cpu)
> >> >> + for_each_possible_cpu(cpu) {
> >> >> setup_pageset(&per_cpu(boot_pageset, cpu), 0);
> >> >> ...
> >> >>
> >> >> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> >> >> + if (cpu_online(cpu))
> >> >> + cpu_to_mem(cpu) = local_memory_node(cpu_to_node(cpu));
> >> >> +#endif
> >>
> >> Look at the above code, int __build_all_zonelists(), cpu_to_mem(cpu)
> >> is set only when cpu is onlined. Suppose that a node with local memory,
> >> all memory segments are onlined first, and then, cpus within that node
> >> are onlined one by one, in this case, where does the cpu_to_mem(cpu)
> >> for the last cpu get its value ?
> >
> >Minskey:
> >
> >As I mentioned to Kame-san, x86 does not define
> >CONFIG_HAVE_MEMORYLESS_NODES, so this code is not compiled for that
> >arch. If x86 did support memoryless nodes--i.e., did not hide them and
> >reassign the cpus to other nodes, as is the case for ia64--then we could
> >have on-line cpus associated with memoryless nodes. The code above is
> >in __build_all_zonelists() so that in the case where we add memory to a
> >previously memoryless node, we re-evaluate the "local memory node" for
> >all online cpus.
> >
> >For cpu hotplug--again, if x86 supports memoryless nodes--we'll need to
> >add a similar chunk to the path where we set up the cpu_to_node map for
> >a hotplugged cpu. See, for example, the call to set_numa_mem() in
> >smp_callin() in arch/ia64/kernel/smpboot.c.
>
>
> Yeah, that's what I am looking for.
>
>
>
> But currently, I don't
> >think you can use the numa_mem_id()/cpu_to_mem() interfaces for your
> >purpose. I suppose you could change page_alloc.c to compile
> >local_memory_node() #if defined(CONFIG_HAVE_MEMORYLESS_NODES) ||
> >defined
> >(CPU_HOTPLUG) and use that function to find the nearest memory. It
> >should return a valid node after zonelists have been rebuilt.
> >
> >Does that make sense?
>
> Yes, besides, I need to find a place in hotplug path to call set_numa_mem()
> just as you mentioned for ia64 platform. Is my understanding right ?
I don't think you can use any of the "numa_mem" functions on x86[_64]
without doing a lot more work to expose memoryless nodes. On x86_64,
numa_mem_id() and cpu_to_mem() always return the same as numa_node_id()
and cpu_to_node(). This is because x86_64 code hides memoryless nodes
and reassigns all cpus to nodes with memory. Are you planning on
changing this such that memoryless nodes remain on-line with their cpus
associated with them? If so, go for it! If not, then you don't need
to [can't really, I think] use set_numa_mem()/cpu_to_mem() for your
purposes. That's why I suggested you arrange for local_memory_node() to
be compiled for CPU_HOTPLUG and call that function directly to obtain a
nearby node from which you can allocate memory during cpu hot plug. Or,
I could just completely misunderstand what you propose to do with these
percpu variables.
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
2010-05-24 14:59 ` Lee Schermerhorn
@ 2010-05-25 1:35 ` Guo, Chaohong
0 siblings, 0 replies; 14+ messages in thread
From: Guo, Chaohong @ 2010-05-25 1:35 UTC (permalink / raw)
To: Lee Schermerhorn
Cc: minskey guo, KAMEZAWA Hiroyuki, Stephen Rothwell, Andrew Morton,
linux-mm@kvack.org, prarit@redhat.com, Kleen, Andi,
linux-kernel@vger.kernel.org, Tejun Heo, stable@kernel.org
>>
>> But currently, I don't
>> >think you can use the numa_mem_id()/cpu_to_mem() interfaces for your
>> >purpose. I suppose you could change page_alloc.c to compile
>> >local_memory_node() #if defined(CONFIG_HAVE_MEMORYLESS_NODES) ||
>> >defined
>> >(CPU_HOTPLUG) and use that function to find the nearest memory. It
>> >should return a valid node after zonelists have been rebuilt.
>> >
>> >Does that make sense?
>>
>> Yes, besides, I need to find a place in hotplug path to call set_numa_mem()
>> just as you mentioned for ia64 platform. Is my understanding right ?
>
>I don't think you can use any of the "numa_mem" functions on x86[_64]
>without doing a lot more work to expose memoryless nodes. On x86_64,
>numa_mem_id() and cpu_to_mem() always return the same as numa_node_id()
>and cpu_to_node(). This is because x86_64 code hides memoryless nodes
>and reassigns all cpus to nodes with memory. Are you planning on
>changing this such that memoryless nodes remain on-line with their cpus
>associated with them? If so, go for it! If not, then you don't need
>to [can't really, I think] use set_numa_mem()/cpu_to_mem() for your
>purposes. That's why I suggested you arrange for local_memory_node() to
>be compiled for CPU_HOTPLUG and call that function directly to obtain a
>nearby node from which you can allocate memory during cpu hot plug. Or,
>I could just completely misunderstand what you propose to do with these
>percpu variables.
Got it, thank you very much for detailed explanation.
-minskey
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
2010-05-21 4:44 ` KAMEZAWA Hiroyuki
2010-05-21 8:22 ` minskey guo
@ 2010-05-21 12:32 ` Lee Schermerhorn
1 sibling, 0 replies; 14+ messages in thread
From: Lee Schermerhorn @ 2010-05-21 12:32 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Stephen Rothwell, Andrew Morton, minskey guo, linux-mm, prarit,
andi.kleen, linux-kernel, Tejun Heo, stable
On Fri, 2010-05-21 at 13:44 +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 21 May 2010 10:55:12 +1000
> Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> > Hi Andrew,
> >
> > On Thu, 20 May 2010 13:43:59 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > > --- a/mm/percpu.c
> > > > +++ b/mm/percpu.c
> > > > @@ -714,13 +714,29 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
> > >
> > > In linux-next, Tejun has gone and moved pcpu_alloc_pages() into the new
> > > mm/percpu-vm.c. So either
> >
> > This has gone into Linus' tree today ...
> >
>
> Hmm, a comment here.
>
> Recently, Lee Schermerhorn developed
>
> numa-introduce-numa_mem_id-effective-local-memory-node-id-fix2.patch
>
> Then, you can use cpu_to_mem() instead of cpu_to_node() to find the
> nearest available node.
> I don't check cpu_to_mem() is synchronized with NUMA hotplug but
> using cpu_to_mem() rather than adding
> =
>
> + if ((nid == -1) ||
> + !(node_zonelist(nid, GFP_KERNEL)->_zonerefs->zone))
> + nid = numa_node_id();
> +
> ==
>
> is better.
Kame-san, all:
numa_mem_id() and cpu_to_mem() are not supported [yet] on x86 because
x86 hides all memoryless nodes and moves cpus to "nearby" [for some
definition thereof] nodes with memory. So, these interfaces just return
numa_node_id() and cpu_to_node() for x86. Perhaps that will change
someday...
Lee
>
> Thanks,
> -Kame
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
2010-05-20 20:43 ` Andrew Morton
2010-05-21 0:55 ` Stephen Rothwell
@ 2010-05-21 4:05 ` Guo, Chaohong
2010-05-21 7:29 ` Kleen, Andi
2 siblings, 0 replies; 14+ messages in thread
From: Guo, Chaohong @ 2010-05-21 4:05 UTC (permalink / raw)
To: Andrew Morton, minskey guo
Cc: linux-mm@kvack.org, prarit@redhat.com, Kleen, Andi,
linux-kernel@vger.kernel.org, Tejun Heo, stable@kernel.org
>> The operation of "enable CPU to online before memory within a node"
>> fails in some case according to Prarit. The warnings as follows:
>>
>> Pid: 7440, comm: bash Not tainted 2.6.32 #2
>> Call Trace:
>> [<ffffffff81155985>] pcpu_alloc+0xa05/0xa70
>> [<ffffffff81155a20>] __alloc_percpu+0x10/0x20
>> [<ffffffff81089605>] __create_workqueue_key+0x75/0x280
>> [<ffffffff8110e050>] ? __build_all_zonelists+0x0/0x5d0
>> [<ffffffff810c1eba>] stop_machine_create+0x3a/0xb0
>> [<ffffffff810c1f57>] stop_machine+0x27/0x60
>> [<ffffffff8110f1a0>] build_all_zonelists+0xd0/0x2b0
>> [<ffffffff814c1d12>] cpu_up+0xb3/0xe3
>> [<ffffffff814b3c40>] store_online+0x70/0xa0
>> [<ffffffff81326100>] sysdev_store+0x20/0x30
>> [<ffffffff811d29a5>] sysfs_write_file+0xe5/0x170
>> [<ffffffff81163d28>] vfs_write+0xb8/0x1a0
>> [<ffffffff810cfd22>] ? audit_syscall_entry+0x252/0x280
>> [<ffffffff81164761>] sys_write+0x51/0x90
>> [<ffffffff81013132>] system_call_fastpath+0x16/0x1b
>> Built 4 zonelists in Zone order, mobility grouping on. Total pages: 12331603
>> PERCPU: allocation failed, size=128 align=64, failed to populate
>>
>> With "enable CPU to online before memory" patch, when the 1st CPU of
>> an offlined node is being onlined, we build zonelists for that node.
>> If per-cpu area needs to be extended during zonelists building period,
>> alloc_pages_node() will be called. The routine alloc_pages_node() fails
>> on the node in-onlining because the node doesn't have zonelists created
>> yet.
>>
>> To fix this issue, we try to alloc memory from current node.
>
>How serious is this issue? Just a warning? Dead box?
>
>Because if we want to port this fix into 2.6.34.x, we have a little
>problem.
when onlining CPU within a node without local memory , at that time, if
per-cpu-area were used up and failed to be extended, there will be many
warnings about the failure of pcpu_allco(), and at last, an out-of-memory
is triggered and some processes get killed by OOM.
-minskey
>
>
>> --- a/mm/percpu.c
>> +++ b/mm/percpu.c
>> @@ -714,13 +714,29 @@ static int pcpu_alloc_pages(struct pcpu_chunk
>*chunk,
>
>In linux-next, Tejun has gone and moved pcpu_alloc_pages() into the new
>mm/percpu-vm.c. So either
>
>a) the -stable guys will need to patch a different file or
>
>b) we apply this fix first and muck up Tejun's tree or
>
>c) the bug isn't very serious so none of this applies.
>
>> {
>> const gfp_t gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
>> unsigned int cpu;
>> + int nid;
>> int i;
>>
>> for_each_possible_cpu(cpu) {
>> for (i = page_start; i < page_end; i++) {
>> struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
>>
>> - *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
>> + nid = cpu_to_node(cpu);
>> +
>> + /*
>> + * It is allowable to online a CPU within a NUMA
>> + * node which doesn't have onlined local memory.
>> + * In this case, we need to create zonelists for
>> + * that node when cpu is being onlined. If per-cpu
>> + * area needs to be extended at the exact time when
>> + * zonelists of that node is being created, we alloc
>> + * memory from current node.
>> + */
>> + if ((nid == -1) ||
>> + !(node_zonelist(nid, GFP_KERNEL)->_zonerefs->zone))
>> + nid = numa_node_id();
>> +
>> + *pagep = alloc_pages_node(nid, gfp, 0);
>> if (!*pagep) {
>> pcpu_free_pages(chunk, pages, populated,
>> page_start, page_end);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
2010-05-20 20:43 ` Andrew Morton
2010-05-21 0:55 ` Stephen Rothwell
2010-05-21 4:05 ` Guo, Chaohong
@ 2010-05-21 7:29 ` Kleen, Andi
2 siblings, 0 replies; 14+ messages in thread
From: Kleen, Andi @ 2010-05-21 7:29 UTC (permalink / raw)
To: Andrew Morton, minskey guo
Cc: linux-mm@kvack.org, prarit@redhat.com,
linux-kernel@vger.kernel.org, Guo, Chaohong, Tejun Heo,
stable@kernel.org
>
>How serious is this issue? Just a warning? Dead box?
It's pretty much a showstopper for memory hotadd with a new node.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-05-25 1:36 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-18 6:17 [PATCH] online CPU before memory failed in pcpu_alloc_pages() minskey guo
2010-05-20 20:43 ` Andrew Morton
2010-05-21 0:55 ` Stephen Rothwell
2010-05-21 4:44 ` KAMEZAWA Hiroyuki
2010-05-21 8:22 ` minskey guo
2010-05-21 8:39 ` KAMEZAWA Hiroyuki
2010-05-21 9:12 ` minskey guo
2010-05-21 13:21 ` Lee Schermerhorn
2010-05-24 1:03 ` Guo, Chaohong
2010-05-24 14:59 ` Lee Schermerhorn
2010-05-25 1:35 ` Guo, Chaohong
2010-05-21 12:32 ` Lee Schermerhorn
2010-05-21 4:05 ` Guo, Chaohong
2010-05-21 7:29 ` Kleen, Andi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).