linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] online CPU before memory failed in pcpu_alloc_pages()
@ 2010-05-18  6:17 minskey guo
  2010-05-20 20:43 ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: minskey guo @ 2010-05-18  6:17 UTC (permalink / raw)
  To: akpm, linux-mm; +Cc: prarit, andi.kleen, linux-kernel, minskey guo

From: minskey guo <chaohong.guo@intel.com>

The operation of "enable CPU to online before memory within a node"
fails in some case according to Prarit. The warnings as follows:

Pid: 7440, comm: bash Not tainted 2.6.32 #2
Call Trace:
 [<ffffffff81155985>] pcpu_alloc+0xa05/0xa70
 [<ffffffff81155a20>] __alloc_percpu+0x10/0x20
 [<ffffffff81089605>] __create_workqueue_key+0x75/0x280
 [<ffffffff8110e050>] ? __build_all_zonelists+0x0/0x5d0
 [<ffffffff810c1eba>] stop_machine_create+0x3a/0xb0
 [<ffffffff810c1f57>] stop_machine+0x27/0x60
 [<ffffffff8110f1a0>] build_all_zonelists+0xd0/0x2b0
 [<ffffffff814c1d12>] cpu_up+0xb3/0xe3
 [<ffffffff814b3c40>] store_online+0x70/0xa0
 [<ffffffff81326100>] sysdev_store+0x20/0x30
 [<ffffffff811d29a5>] sysfs_write_file+0xe5/0x170
 [<ffffffff81163d28>] vfs_write+0xb8/0x1a0
 [<ffffffff810cfd22>] ? audit_syscall_entry+0x252/0x280
 [<ffffffff81164761>] sys_write+0x51/0x90
 [<ffffffff81013132>] system_call_fastpath+0x16/0x1b
Built 4 zonelists in Zone order, mobility grouping on.  Total pages: 12331603
PERCPU: allocation failed, size=128 align=64, failed to populate

With "enable CPU to online before memory" patch, when the 1st CPU of
an offlined node is being onlined, we build zonelists for that node.
If per-cpu area needs to be extended during zonelists building period,
alloc_pages_node() will be called. The routine alloc_pages_node() fails
on the node in-onlining because the node doesn't have zonelists created
yet.

To fix this issue,  we try to alloc memory from current node.

Signed-off-by: minskey guo <chaohong.guo@intel.com>
---
 mm/percpu.c |   18 +++++++++++++++++-
 1 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 6e09741..fabdb10 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -714,13 +714,29 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
 {
 	const gfp_t gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
 	unsigned int cpu;
+	int nid;
 	int i;
 
 	for_each_possible_cpu(cpu) {
 		for (i = page_start; i < page_end; i++) {
 			struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
 
-			*pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
+			nid = cpu_to_node(cpu);
+
+			/*
+			 * It is allowable to online a CPU within a NUMA
+			 * node which doesn't have onlined local memory.
+			 * In this case, we need to create zonelists for
+			 * that node when cpu is being onlined. If per-cpu
+			 * area needs to be extended at the exact time when
+			 * zonelists of that node is being created, we alloc
+			 * memory from current node.
+			 */
+			if ((nid == -1) ||
+			    !(node_zonelist(nid, GFP_KERNEL)->_zonerefs->zone))
+				nid = numa_node_id();
+
+			*pagep = alloc_pages_node(nid, gfp, 0);
 			if (!*pagep) {
 				pcpu_free_pages(chunk, pages, populated,
 						page_start, page_end);
-- 
1.7.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
  2010-05-18  6:17 [PATCH] online CPU before memory failed in pcpu_alloc_pages() minskey guo
@ 2010-05-20 20:43 ` Andrew Morton
  2010-05-21  0:55   ` Stephen Rothwell
                     ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Andrew Morton @ 2010-05-20 20:43 UTC (permalink / raw)
  To: minskey guo
  Cc: linux-mm, prarit, andi.kleen, linux-kernel, minskey guo,
	Tejun Heo, stable

On Tue, 18 May 2010 14:17:22 +0800
minskey guo <chaohong_guo@linux.intel.com> wrote:

> From: minskey guo <chaohong.guo@intel.com>
> 
> The operation of "enable CPU to online before memory within a node"
> fails in some case according to Prarit. The warnings as follows:
> 
> Pid: 7440, comm: bash Not tainted 2.6.32 #2
> Call Trace:
>  [<ffffffff81155985>] pcpu_alloc+0xa05/0xa70
>  [<ffffffff81155a20>] __alloc_percpu+0x10/0x20
>  [<ffffffff81089605>] __create_workqueue_key+0x75/0x280
>  [<ffffffff8110e050>] ? __build_all_zonelists+0x0/0x5d0
>  [<ffffffff810c1eba>] stop_machine_create+0x3a/0xb0
>  [<ffffffff810c1f57>] stop_machine+0x27/0x60
>  [<ffffffff8110f1a0>] build_all_zonelists+0xd0/0x2b0
>  [<ffffffff814c1d12>] cpu_up+0xb3/0xe3
>  [<ffffffff814b3c40>] store_online+0x70/0xa0
>  [<ffffffff81326100>] sysdev_store+0x20/0x30
>  [<ffffffff811d29a5>] sysfs_write_file+0xe5/0x170
>  [<ffffffff81163d28>] vfs_write+0xb8/0x1a0
>  [<ffffffff810cfd22>] ? audit_syscall_entry+0x252/0x280
>  [<ffffffff81164761>] sys_write+0x51/0x90
>  [<ffffffff81013132>] system_call_fastpath+0x16/0x1b
> Built 4 zonelists in Zone order, mobility grouping on.  Total pages: 12331603
> PERCPU: allocation failed, size=128 align=64, failed to populate
> 
> With "enable CPU to online before memory" patch, when the 1st CPU of
> an offlined node is being onlined, we build zonelists for that node.
> If per-cpu area needs to be extended during zonelists building period,
> alloc_pages_node() will be called. The routine alloc_pages_node() fails
> on the node in-onlining because the node doesn't have zonelists created
> yet.
> 
> To fix this issue,  we try to alloc memory from current node.

How serious is this issue?  Just a warning?  Dead box?

Because if we want to port this fix into 2.6.34.x, we have a little
problem.


> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -714,13 +714,29 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,

In linux-next, Tejun has gone and moved pcpu_alloc_pages() into the new
mm/percpu-vm.c.  So either

a) the -stable guys will need to patch a different file or

b) we apply this fix first and muck up Tejun's tree or

c) the bug isn't very serious so none of this applies.

>  {
>  	const gfp_t gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
>  	unsigned int cpu;
> +	int nid;
>  	int i;
>  
>  	for_each_possible_cpu(cpu) {
>  		for (i = page_start; i < page_end; i++) {
>  			struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
>  
> -			*pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
> +			nid = cpu_to_node(cpu);
> +
> +			/*
> +			 * It is allowable to online a CPU within a NUMA
> +			 * node which doesn't have onlined local memory.
> +			 * In this case, we need to create zonelists for
> +			 * that node when cpu is being onlined. If per-cpu
> +			 * area needs to be extended at the exact time when
> +			 * zonelists of that node is being created, we alloc
> +			 * memory from current node.
> +			 */
> +			if ((nid == -1) ||
> +			    !(node_zonelist(nid, GFP_KERNEL)->_zonerefs->zone))
> +				nid = numa_node_id();
> +
> +			*pagep = alloc_pages_node(nid, gfp, 0);
>  			if (!*pagep) {
>  				pcpu_free_pages(chunk, pages, populated,
>  						page_start, page_end);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
  2010-05-20 20:43 ` Andrew Morton
@ 2010-05-21  0:55   ` Stephen Rothwell
  2010-05-21  4:44     ` KAMEZAWA Hiroyuki
  2010-05-21  4:05   ` Guo, Chaohong
  2010-05-21  7:29   ` Kleen, Andi
  2 siblings, 1 reply; 14+ messages in thread
From: Stephen Rothwell @ 2010-05-21  0:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: minskey guo, linux-mm, prarit, andi.kleen, linux-kernel,
	minskey guo, Tejun Heo, stable

[-- Attachment #1: Type: text/plain, Size: 491 bytes --]

Hi Andrew,

On Thu, 20 May 2010 13:43:59 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -714,13 +714,29 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
> 
> In linux-next, Tejun has gone and moved pcpu_alloc_pages() into the new
> mm/percpu-vm.c.  So either

This has gone into Linus' tree today ...

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
  2010-05-20 20:43 ` Andrew Morton
  2010-05-21  0:55   ` Stephen Rothwell
@ 2010-05-21  4:05   ` Guo, Chaohong
  2010-05-21  7:29   ` Kleen, Andi
  2 siblings, 0 replies; 14+ messages in thread
From: Guo, Chaohong @ 2010-05-21  4:05 UTC (permalink / raw)
  To: Andrew Morton, minskey guo
  Cc: linux-mm@kvack.org, prarit@redhat.com, Kleen, Andi,
	linux-kernel@vger.kernel.org, Tejun Heo, stable@kernel.org



>> The operation of "enable CPU to online before memory within a node"
>> fails in some case according to Prarit. The warnings as follows:
>>
>> Pid: 7440, comm: bash Not tainted 2.6.32 #2
>> Call Trace:
>>  [<ffffffff81155985>] pcpu_alloc+0xa05/0xa70
>>  [<ffffffff81155a20>] __alloc_percpu+0x10/0x20
>>  [<ffffffff81089605>] __create_workqueue_key+0x75/0x280
>>  [<ffffffff8110e050>] ? __build_all_zonelists+0x0/0x5d0
>>  [<ffffffff810c1eba>] stop_machine_create+0x3a/0xb0
>>  [<ffffffff810c1f57>] stop_machine+0x27/0x60
>>  [<ffffffff8110f1a0>] build_all_zonelists+0xd0/0x2b0
>>  [<ffffffff814c1d12>] cpu_up+0xb3/0xe3
>>  [<ffffffff814b3c40>] store_online+0x70/0xa0
>>  [<ffffffff81326100>] sysdev_store+0x20/0x30
>>  [<ffffffff811d29a5>] sysfs_write_file+0xe5/0x170
>>  [<ffffffff81163d28>] vfs_write+0xb8/0x1a0
>>  [<ffffffff810cfd22>] ? audit_syscall_entry+0x252/0x280
>>  [<ffffffff81164761>] sys_write+0x51/0x90
>>  [<ffffffff81013132>] system_call_fastpath+0x16/0x1b
>> Built 4 zonelists in Zone order, mobility grouping on.  Total pages: 12331603
>> PERCPU: allocation failed, size=128 align=64, failed to populate
>>
>> With "enable CPU to online before memory" patch, when the 1st CPU of
>> an offlined node is being onlined, we build zonelists for that node.
>> If per-cpu area needs to be extended during zonelists building period,
>> alloc_pages_node() will be called. The routine alloc_pages_node() fails
>> on the node in-onlining because the node doesn't have zonelists created
>> yet.
>>
>> To fix this issue,  we try to alloc memory from current node.
>
>How serious is this issue?  Just a warning?  Dead box?
>
>Because if we want to port this fix into 2.6.34.x, we have a little
>problem.


when onlining CPU within a node without local memory , at that time, if
per-cpu-area were used up and failed to be extended, there will be many
warnings about the failure of pcpu_allco(),  and at last, an out-of-memory 
is triggered and some processes get killed by OOM.


-minskey

















>
>
>> --- a/mm/percpu.c
>> +++ b/mm/percpu.c
>> @@ -714,13 +714,29 @@ static int pcpu_alloc_pages(struct pcpu_chunk
>*chunk,
>
>In linux-next, Tejun has gone and moved pcpu_alloc_pages() into the new
>mm/percpu-vm.c.  So either
>
>a) the -stable guys will need to patch a different file or
>
>b) we apply this fix first and muck up Tejun's tree or
>
>c) the bug isn't very serious so none of this applies.
>
>>  {
>>  	const gfp_t gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
>>  	unsigned int cpu;
>> +	int nid;
>>  	int i;
>>
>>  	for_each_possible_cpu(cpu) {
>>  		for (i = page_start; i < page_end; i++) {
>>  			struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
>>
>> -			*pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
>> +			nid = cpu_to_node(cpu);
>> +
>> +			/*
>> +			 * It is allowable to online a CPU within a NUMA
>> +			 * node which doesn't have onlined local memory.
>> +			 * In this case, we need to create zonelists for
>> +			 * that node when cpu is being onlined. If per-cpu
>> +			 * area needs to be extended at the exact time when
>> +			 * zonelists of that node is being created, we alloc
>> +			 * memory from current node.
>> +			 */
>> +			if ((nid == -1) ||
>> +			    !(node_zonelist(nid, GFP_KERNEL)->_zonerefs->zone))
>> +				nid = numa_node_id();
>> +
>> +			*pagep = alloc_pages_node(nid, gfp, 0);
>>  			if (!*pagep) {
>>  				pcpu_free_pages(chunk, pages, populated,
>>  						page_start, page_end);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
  2010-05-21  0:55   ` Stephen Rothwell
@ 2010-05-21  4:44     ` KAMEZAWA Hiroyuki
  2010-05-21  8:22       ` minskey guo
  2010-05-21 12:32       ` Lee Schermerhorn
  0 siblings, 2 replies; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-21  4:44 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Andrew Morton, minskey guo, linux-mm, prarit, andi.kleen,
	linux-kernel, minskey guo, Tejun Heo, stable

On Fri, 21 May 2010 10:55:12 +1000
Stephen Rothwell <sfr@canb.auug.org.au> wrote:

> Hi Andrew,
> 
> On Thu, 20 May 2010 13:43:59 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > > --- a/mm/percpu.c
> > > +++ b/mm/percpu.c
> > > @@ -714,13 +714,29 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
> > 
> > In linux-next, Tejun has gone and moved pcpu_alloc_pages() into the new
> > mm/percpu-vm.c.  So either
> 
> This has gone into Linus' tree today ...
> 

Hmm, a comment here.

Recently, Lee Schermerhorn developed

 numa-introduce-numa_mem_id-effective-local-memory-node-id-fix2.patch

Then, you can use cpu_to_mem() instead of cpu_to_node() to find the
nearest available node.
I don't check cpu_to_mem() is synchronized with NUMA hotplug but
using cpu_to_mem() rather than adding 
=

+			if ((nid == -1) ||
+			    !(node_zonelist(nid, GFP_KERNEL)->_zonerefs->zone))
+				nid = numa_node_id();
+
==

is better. 

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
  2010-05-20 20:43 ` Andrew Morton
  2010-05-21  0:55   ` Stephen Rothwell
  2010-05-21  4:05   ` Guo, Chaohong
@ 2010-05-21  7:29   ` Kleen, Andi
  2 siblings, 0 replies; 14+ messages in thread
From: Kleen, Andi @ 2010-05-21  7:29 UTC (permalink / raw)
  To: Andrew Morton, minskey guo
  Cc: linux-mm@kvack.org, prarit@redhat.com,
	linux-kernel@vger.kernel.org, Guo, Chaohong, Tejun Heo,
	stable@kernel.org

>
>How serious is this issue?  Just a warning?  Dead box?

It's pretty much a showstopper for memory hotadd with a new node.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
  2010-05-21  4:44     ` KAMEZAWA Hiroyuki
@ 2010-05-21  8:22       ` minskey guo
  2010-05-21  8:39         ` KAMEZAWA Hiroyuki
  2010-05-21 12:32       ` Lee Schermerhorn
  1 sibling, 1 reply; 14+ messages in thread
From: minskey guo @ 2010-05-21  8:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Stephen Rothwell, Andrew Morton, linux-mm, prarit, andi.kleen,
	linux-kernel, minskey guo, Tejun Heo, stable


>>>> --- a/mm/percpu.c
>>>> +++ b/mm/percpu.c
>>>> @@ -714,13 +714,29 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
>>>
>>> In linux-next, Tejun has gone and moved pcpu_alloc_pages() into the new
>>> mm/percpu-vm.c.  So either
>>
>> This has gone into Linus' tree today ...
>>
>
> Hmm, a comment here.
>
> Recently, Lee Schermerhorn developed
>
>   numa-introduce-numa_mem_id-effective-local-memory-node-id-fix2.patch
>
> Then, you can use cpu_to_mem() instead of cpu_to_node() to find the
> nearest available node.
> I don't check cpu_to_mem() is synchronized with NUMA hotplug but
> using cpu_to_mem() rather than adding
> =
>
> +			if ((nid == -1) ||
> +			    !(node_zonelist(nid, GFP_KERNEL)->_zonerefs->zone))
> +				nid = numa_node_id();
> +
> ==
>
> is better.


Yes.  I can use cpu_to_mem().  only some little difference during
CPU online:  1st cpu within memoryless node gets memory from current
node or the node to which the cpu0 belongs,


But I have a question about the patch:

    numa-slab-use-numa_mem_id-for-slab-local-memory-node.patch,




@@ -2968,9 +2991,23 @@ static int __build_all_zonelists(void *d
...

-	for_each_possible_cpu(cpu)
+	for_each_possible_cpu(cpu) {
		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
...

+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+ 	if (cpu_online(cpu))
+		cpu_to_mem(cpu) = local_memory_node(cpu_to_node(cpu));
+#endif


Look at the last two lines, suppose that memory is onlined before CPUs,
where will cpu_to_mem(cpu) be set to the right nodeid for the last
onlined cpu ?  Does that CPU always get memory from the node including 
cpu0 for slab allocator where cpu_to_mem() is used ?



thanks,
-minskey



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
  2010-05-21  8:22       ` minskey guo
@ 2010-05-21  8:39         ` KAMEZAWA Hiroyuki
  2010-05-21  9:12           ` minskey guo
  0 siblings, 1 reply; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-21  8:39 UTC (permalink / raw)
  To: minskey guo
  Cc: Stephen Rothwell, Andrew Morton, linux-mm, prarit, andi.kleen,
	linux-kernel, minskey guo, Tejun Heo, stable

On Fri, 21 May 2010 16:22:19 +0800
minskey guo <chaohong_guo@linux.intel.com> wrote:

> Yes.  I can use cpu_to_mem().  only some little difference during
> CPU online:  1st cpu within memoryless node gets memory from current
> node or the node to which the cpu0 belongs,
> 
> 
> But I have a question about the patch:
> 
>     numa-slab-use-numa_mem_id-for-slab-local-memory-node.patch,
> 
> 
> 
> 
> @@ -2968,9 +2991,23 @@ static int __build_all_zonelists(void *d
> ...
> 
> -	for_each_possible_cpu(cpu)
> +	for_each_possible_cpu(cpu) {
> 		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
> ...
> 
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> + 	if (cpu_online(cpu))
> +		cpu_to_mem(cpu) = local_memory_node(cpu_to_node(cpu));
> +#endif
> 
> 
> Look at the last two lines, suppose that memory is onlined before CPUs,
> where will cpu_to_mem(cpu) be set to the right nodeid for the last
> onlined cpu ?  Does that CPU always get memory from the node including 
> cpu0 for slab allocator where cpu_to_mem() is used ?
> 
build_all_zonelist() is called at boot, initialization.
And it calls local_memory_node(cpu_to_node(cpu)) for possible cpus.

So, "how cpu_to_node() for possible cpus is configured" is important.
At quick look, root/arch/x86/mm/numa_64.c has following code.


 786 /*
 787  * Setup early cpu_to_node.
 788  *
 789  * Populate cpu_to_node[] only if x86_cpu_to_apicid[],
 790  * and apicid_to_node[] tables have valid entries for a CPU.
 791  * This means we skip cpu_to_node[] initialisation for NUMA
 792  * emulation and faking node case (when running a kernel compiled
 793  * for NUMA on a non NUMA box), which is OK as cpu_to_node[]
 794  * is already initialized in a round robin manner at numa_init_array,
 795  * prior to this call, and this initialization is good enough
 796  * for the fake NUMA cases.
 797  *
 798  * Called before the per_cpu areas are setup.
 799  */
 800 void __init init_cpu_to_node(void)
 801 {
 802         int cpu;
 803         u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
 804 
 805         BUG_ON(cpu_to_apicid == NULL);
 806 
 807         for_each_possible_cpu(cpu) {
 808                 int node;
 809                 u16 apicid = cpu_to_apicid[cpu];
 810 
 811                 if (apicid == BAD_APICID)
 812                         continue;
 813                 node = apicid_to_node[apicid];
 814                 if (node == NUMA_NO_NODE)
 815                         continue;
 816                 if (!node_online(node))
 817                         node = find_near_online_node(node);
 818                 numa_set_node(cpu, node);
 819         }
 820 }


So, cpu_to_node(cpu) for possible cpus will have NUMA_NO_NODE(-1)
or the number of the nearest node.

IIUC, if SRAT is not broken, all pxm has its own node_id. So,
cpu_to_node(cpu) will return the nearest node and cpu_to_mem() will
find the nearest node with memory.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
  2010-05-21  8:39         ` KAMEZAWA Hiroyuki
@ 2010-05-21  9:12           ` minskey guo
  2010-05-21 13:21             ` Lee Schermerhorn
  0 siblings, 1 reply; 14+ messages in thread
From: minskey guo @ 2010-05-21  9:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Stephen Rothwell, Andrew Morton, linux-mm, prarit, andi.kleen,
	linux-kernel, minskey guo, Tejun Heo, stable

On 05/21/2010 04:39 PM, KAMEZAWA Hiroyuki wrote:
> On Fri, 21 May 2010 16:22:19 +0800
> minskey guo<chaohong_guo@linux.intel.com>  wrote:
>
>> Yes.  I can use cpu_to_mem().  only some little difference during
>> CPU online:  1st cpu within memoryless node gets memory from current
>> node or the node to which the cpu0 belongs,
>>
>>
>> But I have a question about the patch:
>>
>>      numa-slab-use-numa_mem_id-for-slab-local-memory-node.patch,
>>
>>
>>
>>
>> @@ -2968,9 +2991,23 @@ static int __build_all_zonelists(void *d
>> ...
>>
>> -	for_each_possible_cpu(cpu)
>> +	for_each_possible_cpu(cpu) {
>> 		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
>> ...
>>
>> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
>> + 	if (cpu_online(cpu))
>> +		cpu_to_mem(cpu) = local_memory_node(cpu_to_node(cpu));
>> +#endif

Look at the above code,  int __build_all_zonelists(),  cpu_to_mem(cpu)
is set only when cpu is onlined.  Suppose that a node with local memory,
all memory segments are onlined first, and then,  cpus within that node
are onlined one by one,  in this case,  where does the cpu_to_mem(cpu)
for the last cpu get its value ?


>
> So, cpu_to_node(cpu) for possible cpus will have NUMA_NO_NODE(-1)
> or the number of the nearest node.
>
> IIUC, if SRAT is not broken, all pxm has its own node_id.

Thank you very much for the info,  I have been thinking why node_id
is (-1) in some cases.


-minskey

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
  2010-05-21  4:44     ` KAMEZAWA Hiroyuki
  2010-05-21  8:22       ` minskey guo
@ 2010-05-21 12:32       ` Lee Schermerhorn
  1 sibling, 0 replies; 14+ messages in thread
From: Lee Schermerhorn @ 2010-05-21 12:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Stephen Rothwell, Andrew Morton, minskey guo, linux-mm, prarit,
	andi.kleen, linux-kernel, Tejun Heo, stable

On Fri, 2010-05-21 at 13:44 +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 21 May 2010 10:55:12 +1000
> Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> 
> > Hi Andrew,
> > 
> > On Thu, 20 May 2010 13:43:59 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > > --- a/mm/percpu.c
> > > > +++ b/mm/percpu.c
> > > > @@ -714,13 +714,29 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
> > > 
> > > In linux-next, Tejun has gone and moved pcpu_alloc_pages() into the new
> > > mm/percpu-vm.c.  So either
> > 
> > This has gone into Linus' tree today ...
> > 
> 
> Hmm, a comment here.
> 
> Recently, Lee Schermerhorn developed
> 
>  numa-introduce-numa_mem_id-effective-local-memory-node-id-fix2.patch
> 
> Then, you can use cpu_to_mem() instead of cpu_to_node() to find the
> nearest available node.
> I don't check cpu_to_mem() is synchronized with NUMA hotplug but
> using cpu_to_mem() rather than adding 
> =
> 
> +			if ((nid == -1) ||
> +			    !(node_zonelist(nid, GFP_KERNEL)->_zonerefs->zone))
> +				nid = numa_node_id();
> +
> ==
> 
> is better. 


Kame-san, all:

numa_mem_id() and cpu_to_mem() are not supported [yet] on x86 because
x86 hides all memoryless nodes and moves cpus to "nearby" [for some
definition thereof] nodes with memory.  So, these interfaces just return
numa_node_id() and cpu_to_node() for x86.  Perhaps that will change
someday...

Lee


> 
> Thanks,
> -Kame
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
  2010-05-21  9:12           ` minskey guo
@ 2010-05-21 13:21             ` Lee Schermerhorn
  2010-05-24  1:03               ` Guo, Chaohong
  0 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2010-05-21 13:21 UTC (permalink / raw)
  To: minskey guo
  Cc: KAMEZAWA Hiroyuki, Stephen Rothwell, Andrew Morton, linux-mm,
	prarit, andi.kleen, linux-kernel, minskey guo, Tejun Heo, stable

On Fri, 2010-05-21 at 17:12 +0800, minskey guo wrote:
> On 05/21/2010 04:39 PM, KAMEZAWA Hiroyuki wrote:
> > On Fri, 21 May 2010 16:22:19 +0800
> > minskey guo<chaohong_guo@linux.intel.com>  wrote:
> >
> >> Yes.  I can use cpu_to_mem().  only some little difference during
> >> CPU online:  1st cpu within memoryless node gets memory from current
> >> node or the node to which the cpu0 belongs,
> >>
> >>
> >> But I have a question about the patch:
> >>
> >>      numa-slab-use-numa_mem_id-for-slab-local-memory-node.patch,
> >>
> >>
> >>
> >>
> >> @@ -2968,9 +2991,23 @@ static int __build_all_zonelists(void *d
> >> ...
> >>
> >> -	for_each_possible_cpu(cpu)
> >> +	for_each_possible_cpu(cpu) {
> >> 		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
> >> ...
> >>
> >> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> >> + 	if (cpu_online(cpu))
> >> +		cpu_to_mem(cpu) = local_memory_node(cpu_to_node(cpu));
> >> +#endif
> 
> Look at the above code,  int __build_all_zonelists(),  cpu_to_mem(cpu)
> is set only when cpu is onlined.  Suppose that a node with local memory,
> all memory segments are onlined first, and then,  cpus within that node
> are onlined one by one,  in this case,  where does the cpu_to_mem(cpu)
> for the last cpu get its value ?

Minskey:

As I mentioned to Kame-san, x86 does not define
CONFIG_HAVE_MEMORYLESS_NODES, so this code is not compiled for that
arch.  If x86 did support memoryless nodes--i.e., did not hide them and
reassign the cpus to other nodes, as is the case for ia64--then we could
have on-line cpus associated with memoryless nodes.  The code above is
in __build_all_zonelists() so that in the case where we add memory to a
previously memoryless node, we re-evaluate the "local memory node" for
all online cpus.

For cpu hotplug--again, if x86 supports memoryless nodes--we'll need to
add a similar chunk to the path where we set up the cpu_to_node map for
a hotplugged cpu.  See, for example, the call to set_numa_mem() in
smp_callin() in arch/ia64/kernel/smpboot.c.  But currently, I don't
think you can use the numa_mem_id()/cpu_to_mem() interfaces for your
purpose.  I suppose you could change page_alloc.c to compile
local_memory_node() #if defined(CONFIG_HAVE_MEMORYLESS_NODES) || defined
(CPU_HOTPLUG) and use that function to find the nearest memory.  It
should return a valid node after zonelists have been rebuilt.

Does that make sense?

Lee
> 
> 
> >
> > So, cpu_to_node(cpu) for possible cpus will have NUMA_NO_NODE(-1)
> > or the number of the nearest node.
> >
> > IIUC, if SRAT is not broken, all pxm has its own node_id.
> 
> Thank you very much for the info,  I have been thinking why node_id
> is (-1) in some cases.
> 
> 
> -minskey
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
  2010-05-21 13:21             ` Lee Schermerhorn
@ 2010-05-24  1:03               ` Guo, Chaohong
  2010-05-24 14:59                 ` Lee Schermerhorn
  0 siblings, 1 reply; 14+ messages in thread
From: Guo, Chaohong @ 2010-05-24  1:03 UTC (permalink / raw)
  To: Lee Schermerhorn, minskey guo
  Cc: KAMEZAWA Hiroyuki, Stephen Rothwell, Andrew Morton,
	linux-mm@kvack.org, prarit@redhat.com, Kleen, Andi,
	linux-kernel@vger.kernel.org, Tejun Heo, stable@kernel.org



>> >>
>> >>
>> >> @@ -2968,9 +2991,23 @@ static int __build_all_zonelists(void *d
>> >> ...
>> >>
>> >> -	for_each_possible_cpu(cpu)
>> >> +	for_each_possible_cpu(cpu) {
>> >> 		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
>> >> ...
>> >>
>> >> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
>> >> + 	if (cpu_online(cpu))
>> >> +		cpu_to_mem(cpu) = local_memory_node(cpu_to_node(cpu));
>> >> +#endif
>>
>> Look at the above code,  int __build_all_zonelists(),  cpu_to_mem(cpu)
>> is set only when cpu is onlined.  Suppose that a node with local memory,
>> all memory segments are onlined first, and then,  cpus within that node
>> are onlined one by one,  in this case,  where does the cpu_to_mem(cpu)
>> for the last cpu get its value ?
>
>Minskey:
>
>As I mentioned to Kame-san, x86 does not define
>CONFIG_HAVE_MEMORYLESS_NODES, so this code is not compiled for that
>arch.  If x86 did support memoryless nodes--i.e., did not hide them and
>reassign the cpus to other nodes, as is the case for ia64--then we could
>have on-line cpus associated with memoryless nodes.  The code above is
>in __build_all_zonelists() so that in the case where we add memory to a
>previously memoryless node, we re-evaluate the "local memory node" for
>all online cpus.
>
>For cpu hotplug--again, if x86 supports memoryless nodes--we'll need to
>add a similar chunk to the path where we set up the cpu_to_node map for
>a hotplugged cpu.  See, for example, the call to set_numa_mem() in
>smp_callin() in arch/ia64/kernel/smpboot.c. 


Yeah, that's what I am looking for. 



 But currently, I don't
>think you can use the numa_mem_id()/cpu_to_mem() interfaces for your
>purpose.  I suppose you could change page_alloc.c to compile
>local_memory_node() #if defined(CONFIG_HAVE_MEMORYLESS_NODES) ||
>defined
>(CPU_HOTPLUG) and use that function to find the nearest memory.  It
>should return a valid node after zonelists have been rebuilt.
>
>Does that make sense?

Yes, besides,  I need to find a place in hotplug path to call set_numa_mem()
just as you mentioned for ia64 platform.  Is my understanding right ?




Thanks,
-minskey








>
>Lee
>>
>>
>> >
>> > So, cpu_to_node(cpu) for possible cpus will have NUMA_NO_NODE(-1)
>> > or the number of the nearest node.
>> >
>> > IIUC, if SRAT is not broken, all pxm has its own node_id.
>>
>> Thank you very much for the info,  I have been thinking why node_id
>> is (-1) in some cases.
>>
>>
>> -minskey
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
  2010-05-24  1:03               ` Guo, Chaohong
@ 2010-05-24 14:59                 ` Lee Schermerhorn
  2010-05-25  1:35                   ` Guo, Chaohong
  0 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2010-05-24 14:59 UTC (permalink / raw)
  To: Guo, Chaohong
  Cc: minskey guo, KAMEZAWA Hiroyuki, Stephen Rothwell, Andrew Morton,
	linux-mm@kvack.org, prarit@redhat.com, Kleen, Andi,
	linux-kernel@vger.kernel.org, Tejun Heo, stable@kernel.org

On Mon, 2010-05-24 at 09:03 +0800, Guo, Chaohong wrote:
> 
> >> >>
> >> >>
> >> >> @@ -2968,9 +2991,23 @@ static int __build_all_zonelists(void *d
> >> >> ...
> >> >>
> >> >> -	for_each_possible_cpu(cpu)
> >> >> +	for_each_possible_cpu(cpu) {
> >> >> 		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
> >> >> ...
> >> >>
> >> >> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> >> >> + 	if (cpu_online(cpu))
> >> >> +		cpu_to_mem(cpu) = local_memory_node(cpu_to_node(cpu));
> >> >> +#endif
> >>
> >> Look at the above code,  int __build_all_zonelists(),  cpu_to_mem(cpu)
> >> is set only when cpu is onlined.  Suppose that a node with local memory,
> >> all memory segments are onlined first, and then,  cpus within that node
> >> are onlined one by one,  in this case,  where does the cpu_to_mem(cpu)
> >> for the last cpu get its value ?
> >
> >Minskey:
> >
> >As I mentioned to Kame-san, x86 does not define
> >CONFIG_HAVE_MEMORYLESS_NODES, so this code is not compiled for that
> >arch.  If x86 did support memoryless nodes--i.e., did not hide them and
> >reassign the cpus to other nodes, as is the case for ia64--then we could
> >have on-line cpus associated with memoryless nodes.  The code above is
> >in __build_all_zonelists() so that in the case where we add memory to a
> >previously memoryless node, we re-evaluate the "local memory node" for
> >all online cpus.
> >
> >For cpu hotplug--again, if x86 supports memoryless nodes--we'll need to
> >add a similar chunk to the path where we set up the cpu_to_node map for
> >a hotplugged cpu.  See, for example, the call to set_numa_mem() in
> >smp_callin() in arch/ia64/kernel/smpboot.c. 
> 
> 
> Yeah, that's what I am looking for. 
> 
> 
> 
>  But currently, I don't
> >think you can use the numa_mem_id()/cpu_to_mem() interfaces for your
> >purpose.  I suppose you could change page_alloc.c to compile
> >local_memory_node() #if defined(CONFIG_HAVE_MEMORYLESS_NODES) ||
> >defined
> >(CPU_HOTPLUG) and use that function to find the nearest memory.  It
> >should return a valid node after zonelists have been rebuilt.
> >
> >Does that make sense?
> 
> Yes, besides,  I need to find a place in hotplug path to call set_numa_mem()
> just as you mentioned for ia64 platform.  Is my understanding right ?

I don't think you can use any of the "numa_mem" functions on x86[_64]
without doing a lot more work to expose memoryless nodes.  On x86_64,
numa_mem_id() and cpu_to_mem() always return the same as numa_node_id()
and cpu_to_node().  This is because x86_64 code hides memoryless nodes
and reassigns all cpus to nodes with memory.  Are you planning on
changing this such that memoryless nodes remain on-line with their cpus
associated with them?  If so, go for it!   If not, then you don't need
to [can't really, I think] use set_numa_mem()/cpu_to_mem() for your
purposes.  That's why I suggested you arrange for local_memory_node() to
be compiled for CPU_HOTPLUG and call that function directly to obtain a
nearby node from which you can allocate memory during cpu hot plug.  Or,
I could just completely misunderstand what you propose to do with these
percpu variables.

Lee



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] online CPU before memory failed in pcpu_alloc_pages()
  2010-05-24 14:59                 ` Lee Schermerhorn
@ 2010-05-25  1:35                   ` Guo, Chaohong
  0 siblings, 0 replies; 14+ messages in thread
From: Guo, Chaohong @ 2010-05-25  1:35 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: minskey guo, KAMEZAWA Hiroyuki, Stephen Rothwell, Andrew Morton,
	linux-mm@kvack.org, prarit@redhat.com, Kleen, Andi,
	linux-kernel@vger.kernel.org, Tejun Heo, stable@kernel.org



>>
>>  But currently, I don't
>> >think you can use the numa_mem_id()/cpu_to_mem() interfaces for your
>> >purpose.  I suppose you could change page_alloc.c to compile
>> >local_memory_node() #if defined(CONFIG_HAVE_MEMORYLESS_NODES) ||
>> >defined
>> >(CPU_HOTPLUG) and use that function to find the nearest memory.  It
>> >should return a valid node after zonelists have been rebuilt.
>> >
>> >Does that make sense?
>>
>> Yes, besides,  I need to find a place in hotplug path to call set_numa_mem()
>> just as you mentioned for ia64 platform.  Is my understanding right ?
>
>I don't think you can use any of the "numa_mem" functions on x86[_64]
>without doing a lot more work to expose memoryless nodes.  On x86_64,
>numa_mem_id() and cpu_to_mem() always return the same as numa_node_id()
>and cpu_to_node().  This is because x86_64 code hides memoryless nodes
>and reassigns all cpus to nodes with memory.  Are you planning on
>changing this such that memoryless nodes remain on-line with their cpus
>associated with them?  If so, go for it!   If not, then you don't need
>to [can't really, I think] use set_numa_mem()/cpu_to_mem() for your
>purposes.  That's why I suggested you arrange for local_memory_node() to
>be compiled for CPU_HOTPLUG and call that function directly to obtain a
>nearby node from which you can allocate memory during cpu hot plug.  Or,
>I could just completely misunderstand what you propose to do with these
>percpu variables.

Got it, thank you very much for detailed explanation.


-minskey

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-05-25  1:36 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-18  6:17 [PATCH] online CPU before memory failed in pcpu_alloc_pages() minskey guo
2010-05-20 20:43 ` Andrew Morton
2010-05-21  0:55   ` Stephen Rothwell
2010-05-21  4:44     ` KAMEZAWA Hiroyuki
2010-05-21  8:22       ` minskey guo
2010-05-21  8:39         ` KAMEZAWA Hiroyuki
2010-05-21  9:12           ` minskey guo
2010-05-21 13:21             ` Lee Schermerhorn
2010-05-24  1:03               ` Guo, Chaohong
2010-05-24 14:59                 ` Lee Schermerhorn
2010-05-25  1:35                   ` Guo, Chaohong
2010-05-21 12:32       ` Lee Schermerhorn
2010-05-21  4:05   ` Guo, Chaohong
2010-05-21  7:29   ` Kleen, Andi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).