From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756621Ab1BXXap (ORCPT ); Thu, 24 Feb 2011 18:30:45 -0500 Received: from rcsinet10.oracle.com ([148.87.113.121]:58110 "EHLO rcsinet10.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756564Ab1BXXao (ORCPT ); Thu, 24 Feb 2011 18:30:44 -0500 Message-ID: <4D66EA0A.1050405@kernel.org> Date: Thu, 24 Feb 2011 15:30:18 -0800 From: Yinghai Lu User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.16) Gecko/20101125 SUSE/3.0.11 Thunderbird/3.0.11 MIME-Version: 1.0 To: David Rientjes CC: Tejun Heo , Ingo Molnar , tglx@linutronix.de, "H. Peter Anvin" , linux-kernel@vger.kernel.org Subject: Re: [patch] x86, mm: Fix size of numa_distance array References: <20110224145128.GM7840@htj.dyndns.org> <4D66AC9C.6080500@kernel.org> <20110224192305.GB15498@elte.hu> <4D66B176.9030300@kernel.org> <20110224193211.GC15498@elte.hu> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Source-IP: acsmt355.oracle.com [141.146.40.155] X-Auth-Type: Internal IP X-CT-RefId: str=0001.0A090202.4D66EA10.01B4,ss=1,fgs=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/24/2011 02:46 PM, David Rientjes wrote: > On Thu, 24 Feb 2011, Tejun Heo wrote: > >>>> DavidR reported that x86/mm broke his numa emulation with 128M etc. >>> >>> That regression needs to be fixed. Tejun, do you know about that bug? >> >> Nope, David said he was gonna look into what happened but never got >> back. David? >> > > I merged x86/mm with Linus' tree, it booted fine without numa=fake but > then panics with numa=fake=128M (and could only be captured by > earlyprintk): > > [ 0.000000] BUG: unable to handle kernel paging request at ffff88007ff00000 > [ 0.000000] IP: [] numa_alloc_distance+0x146/0x17a > [ 0.000000] PGD 1804063 PUD 7fefd067 PMD 7fefe067 PTE 0 > [ 0.000000] Oops: 0002 [#1] SMP > [ 0.000000] last sysfs file: > [ 0.000000] CPU 0 > [ 0.000000] Modules linked in: > [ 0.000000] > [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.38-x86-mm #1 > [ 0.000000] RIP: 0010:[] [] numa_alloc_distance+0x146/0x17a > [ 0.000000] RSP: 0000:ffffffff81801d28 EFLAGS: 00010006 > [ 0.000000] RAX: 0000000000000009 RBX: 00000000000001ff RCX: 0000000000000ff8 > [ 0.000000] RDX: 0000000000000008 RSI: 000000007feff014 RDI: ffffffff8199ed0a > [ 0.000000] RBP: ffffffff81801dc8 R08: 0000000000001000 R09: 000000008199ed0a > [ 0.000000] R10: 000000007feff004 R11: 000000007fefd000 R12: 00000000000001ff > [ 0.000000] R13: ffff88007feff000 R14: ffffffff81801d28 R15: ffffffff819b7ca0 > [ 0.000000] FS: 0000000000000000(0000) GS:ffffffff818da000(0000) knlGS:0000000000000000 > [ 0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 0.000000] CR2: ffff88007ff00000 CR3: 0000000001803000 CR4: 00000000000000b0 > [ 0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 0.000000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 0.000000] Process swapper (pid: 0, threadinfo ffffffff81800000, task ffffffff8180b020) > [ 0.000000] Stack: > [ 0.000000] ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff > [ 0.000000] ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff > [ 0.000000] 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > [ 0.000000] Call Trace: > [ 0.000000] [] numa_set_distance+0x24/0xac > [ 0.000000] [] numa_emulation+0x236/0x284 > [ 0.000000] [] ? x86_acpi_numa_init+0x0/0x1b > [ 0.000000] [] initmem_init+0xe8/0x56c > [ 0.000000] [] ? native_apic_mem_read+0x9/0x13 > [ 0.000000] [] ? x86_acpi_numa_init+0x0/0x1b > [ 0.000000] [] ? amd_numa_init+0x0/0x376 > [ 0.000000] [] ? dummy_numa_init+0x0/0x66 > [ 0.000000] [] ? register_lapic_address+0x75/0x85 > [ 0.000000] [] setup_arch+0xa29/0xae9 > [ 0.000000] [] ? printk+0x41/0x47 > [ 0.000000] [] start_kernel+0x8a/0x386 > [ 0.000000] [] x86_64_start_reservations+0xb4/0xb8 > [ 0.000000] [] x86_64_start_kernel+0xf2/0xf9 > > That's this: > > 430 numa_distance_cnt = cnt; > 431 > 432 /* fill with the default distances */ > 433 for (i = 0; i < cnt; i++) > 434 for (j = 0; j < cnt; j++) > 435 ===> numa_distance[i * cnt + j] = i == j ? > 436 LOCAL_DISTANCE : REMOTE_DISTANCE; > 437 printk(KERN_DEBUG "NUMA: Initialized distance table, cnt=%d\n", cnt); > 438 > 439 return 0; > > We're overflowing the array and it's easy to see why: > > for_each_node_mask(i, nodes_parsed) > cnt = i; > size = ++cnt * sizeof(numa_distance[0]); > > cnt is the highest node id parsed, so numa_distance[] must be cnt * cnt. > The following patch fixes the issue on top of x86/mm. > > I'm running on a 64GB machine with CONFIG_NODES_SHIFT == 10, so > numa=fake=128M would result in 512 nodes. That's going to require 2MB for > numa_distance (and that's not __initdata). Before these changes, we > calculated numa_distance() using pxms without this additional mapping, is > there any way to reduce this? (Admittedly real NUMA machines with 512 > nodes wouldn't mind sacrificing 2MB, but we didn't need this before.) > > > > x86, mm: Fix size of numa_distance array > > numa_distance should be sized like the SLIT, an NxN matrix where N is the > highest node id. This patch fixes the calulcation to avoid overflowing > the array on the subsequent iteration. > > Signed-off-by: David Rientjes > --- > arch/x86/mm/numa_64.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c > index cccc01d..abf0131 100644 > --- a/arch/x86/mm/numa_64.c > +++ b/arch/x86/mm/numa_64.c > @@ -414,7 +414,7 @@ static int __init numa_alloc_distance(void) > > for_each_node_mask(i, nodes_parsed) > cnt = i; > - size = ++cnt * sizeof(numa_distance[0]); > + size = cnt * cnt * sizeof(numa_distance[0]); should be + cnt++; + size = cnt * cnt * sizeof(numa_distance[0]); > > phys = memblock_find_in_range(0, (u64)max_pfn_mapped << PAGE_SHIFT, > size, PAGE_SIZE);