* Nodes with no memory
@ 2008-11-21 23:50 Dave Hansen
2008-11-22 0:49 ` Nathan Lynch
0 siblings, 1 reply; 4+ messages in thread
From: Dave Hansen @ 2008-11-21 23:50 UTC (permalink / raw)
To: linuxppc-dev; +Cc: mjw, Nathan Lynch, Paul Mackerras
I was handed off a bug report about a blade not booting with a, um
"newer" kernel. After turning on some debugging messages, I got this
ominous message:
node 1
NODE_DATA() = c000000000000000
Which obviously comes from here:
arch/powerpc/mm/numa.c
for_each_online_node(nid) {
unsigned long start_pfn, end_pfn;
unsigned long bootmem_paddr;
unsigned long bootmap_pages;
get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
/* Allocate the node structure node local if possible */
NODE_DATA(nid) = careful_allocation(nid,
sizeof(struct pglist_data),
SMP_CACHE_BYTES, end_pfn);
NODE_DATA(nid) = __va(NODE_DATA(nid));
memset(NODE_DATA(nid), 0, sizeof(struct pglist_data));
...
careful_allocation() returns a NULL physical address, but we go ahead
and run __va() on it, stick it in NODE_DATA(), and memset it. Yay!
I seem to recall that we fixed some issues with memoryless nodes a few
years ago, like around the memory hotplug days, but I don't see the
patches anywhere.
I'm thinking that we need to at least fix careful_allocation() to oops
and not return NULL, or check to make sure all it callers check its
return code. Plus, we probably also need to ensure that all ppc code
doing for_each_online_node() does not assume a valid NODE_DATA() for all
those nodes.
Any other thoughts?
I'll have a patch for the above issue sometime soon.
-- Dave
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: Nodes with no memory 2008-11-21 23:50 Nodes with no memory Dave Hansen @ 2008-11-22 0:49 ` Nathan Lynch 2008-11-22 1:17 ` Dave Hansen 0 siblings, 1 reply; 4+ messages in thread From: Nathan Lynch @ 2008-11-22 0:49 UTC (permalink / raw) To: Dave Hansen; +Cc: linuxppc-dev, mjw, Paul Mackerras Dave Hansen wrote: > I was handed off a bug report about a blade not booting with a, um > "newer" kernel. If you're unable to provide basic information such as the kernel version then perhaps this isn't the best forum for discussing this. :) > I'm thinking that we need to at least fix careful_allocation() to oops > and not return NULL, or check to make sure all it callers check its > return code. Well, careful_allocation() in current mainline tries pretty hard to panic if it can't satisfy the request. Why isn't that happening? ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Nodes with no memory 2008-11-22 0:49 ` Nathan Lynch @ 2008-11-22 1:17 ` Dave Hansen 2008-11-22 8:58 ` Milton Miller 0 siblings, 1 reply; 4+ messages in thread From: Dave Hansen @ 2008-11-22 1:17 UTC (permalink / raw) To: Nathan Lynch; +Cc: linuxppc-dev, mjw, anton, Paul Mackerras On Fri, 2008-11-21 at 18:49 -0600, Nathan Lynch wrote: > Dave Hansen wrote: > > I was handed off a bug report about a blade not booting with a, um > > "newer" kernel. > > If you're unable to provide basic information such as the kernel > version then perhaps this isn't the best forum for discussing this. :) Let's just say a derivative of 2.6.27.5. I will, of course be trying to reproduce on mainline. I'm just going with the kernel closest to the bug report as I can get for now. > > I'm thinking that we need to at least fix careful_allocation() to oops > > and not return NULL, or check to make sure all it callers check its > > return code. > > Well, careful_allocation() in current mainline tries pretty hard to > panic if it can't satisfy the request. Why isn't that happening? I added some random debugging to careful_alloc() to find out. careful_allocation(1, 7680, 80, 0) careful_allocation() ret1: 00000001dffe4100 careful_allocation() ret2: 00000001dffe4100 careful_allocation() ret3: 00000001dffe4100 careful_allocation() ret4: c000000000000000 careful_allocation() ret5: 0000000000000000 It looks to me like it is hitting 'the memory came from a previously allocated node' check. So, the __lmb_alloc_base() appears to get something worthwhile, but that gets overwritten later. I'm still not quite sure what this comment means. Are we just trying to get node locality from the allocation? I also need to go look at how __alloc_bootmem_node() ends up returning c000000000000000. It should be returning NULL, and panic'ing, in careful_alloc(). This probably has to do with the fact that NODE_DATA() isn't set up, yet, but I'll double check. /* * If the memory came from a previously allocated node, we must * retry with the bootmem allocator. */ new_nid = early_pfn_to_nid(ret >> PAGE_SHIFT); if (new_nid < nid) { ret = (unsigned long)__alloc_bootmem_node(NODE_DATA(new_nid), size, align, 0); dbg("careful_allocation() ret4: %016lx\n", ret); if (!ret) panic("numa.c: cannot allocate %lu bytes on node %d", size, new_nid); ret = __pa(ret); dbg("careful_allocation() ret5: %016lx\n", ret); dbg("alloc_bootmem %lx %lx\n", ret, size); } -- Dave ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Nodes with no memory 2008-11-22 1:17 ` Dave Hansen @ 2008-11-22 8:58 ` Milton Miller 0 siblings, 0 replies; 4+ messages in thread From: Milton Miller @ 2008-11-22 8:58 UTC (permalink / raw) To: Dave Hansen; +Cc: linux-ppc, Nathan Lynch On Sat Nov 22 at 12:17:22 EST in 2008 Dave Hansen wrote: > On Fri, 2008-11-21 at 18:49 -0600, Nathan Lynch wrote: >> Dave Hansen wrote: >>> I was handed off a bug report about a blade not booting with a, um >>> "newer" kernel. >> >> If you're unable to provide basic information such as the kernel >> version then perhaps this isn't the best forum for discussing this. =20= >> :) > > Let's just say a derivative of 2.6.27.5. I will, of course be trying=20= > to reproduce on mainline. I'm just going with the kernel closest to=20= > the bug report as I can get for now. This reminds me. I was asked to look at a system that had all cpus and=20= memory on node 1. I recently switched to 2.6.27.0, and had a similar=20 failure when I tried my latest development kernel. However, I realized=20= that the user was wanting to run my previously supported 2.6.24 kernel,=20= and that did not have this issue, so I never got back to debugging this=20= problem. (Both kernels had similar patches applied, but very little to=20= mm or numa selection). I was able to fix the problem they were having=20= and returned the machine to them without debugging the issue, but I=20 suspect the problem was introduced to mainline between 2.6.24 and=20 2.6.27. >>> I'm thinking that we need to at least fix careful_allocation() to=20 >>> oops >>> and not return NULL, or check to make sure all it callers check its >>> return code. >> >> Well, careful_allocation() in current mainline tries pretty hard to >> panic if it can't satisfy the request. Why isn't that happening? > > I added some random debugging to careful_alloc() to find out. > > careful_allocation(1, 7680, 80, 0) > careful_allocation() ret1: 00000001dffe4100 > careful_allocation() ret2: 00000001dffe4100 > careful_allocation() ret3: 00000001dffe4100 > careful_allocation() ret4: c000000000000000 > careful_allocation() ret5: 0000000000000000 > > It looks to me like it is hitting 'the memory came from a previously > =A0allocated node' check. So, the __lmb_alloc_base() appears to get > something worthwhile, but that gets overwritten later. > > I'm still not quite sure what this comment means. Are we just trying=20= > to > get node locality from the allocation? My memory (and a quick look) is that careful alloc is used while we are=20= in the process of creating the memory maps for the node. We want them=20= to be allocated from memory on the node, but will accept memory from=20 any node to handle the case that memory is not available in the desired=20= node. Linux requires the maps exist for every online node. Because we are in the process transferring the memory between=20 allocators, the check for new_nid < nid is meant to say "if the memory=20= did not come from the preferred node, but instead came from one we=20 already transfered, then we need to obtain that memory from the new=20 allocator". If it came from the preferred node or a later node, the=20 allocation we did is valid, and will be marked in-use when we transfer=20= that node's memory. > I also need to go look at how __alloc_bootmem_node() ends up returning > c000000000000000. It should be returning NULL, and panic'ing, in > careful_alloc(). This probably has to do with the fact that=20 > NODE_DATA() > isn't set up, yet, but I'll double check. We setup NODE_DATA with the result of this alloc in nid order. If=20 early_pfs_to_nid returns the wrong value then we would obviously be in=20= trouble here. > /* > * If the memory came from a previously allocated node, we = must > * retry with the bootmem allocator. > */ > new_nid =3D early_pfn_to_nid(ret >> PAGE_SHIFT); > if (new_nid < nid) { > ret =3D (unsigned=20 > long)__alloc_bootmem_node(NODE_DATA(new_nid), > size, align, 0); > dbg("careful_allocation() ret4: %016lx\n", ret); > > if (!ret) > panic("numa.c: cannot allocate %lu bytes on=20 > node %d", > size, new_nid); > > ret =3D __pa(ret); > dbg("careful_allocation() ret5: %016lx\n", ret); > > dbg("alloc_bootmem %lx %lx\n", ret, size); > } Perhaps someone can recreate this with the fake numa stuff that was=20 added since 2.6.24? Or edit a device tree to fake the numa=20 assignments for memory and kexec using the modified tree. milton ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2008-11-22 8:57 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-11-21 23:50 Nodes with no memory Dave Hansen 2008-11-22 0:49 ` Nathan Lynch 2008-11-22 1:17 ` Dave Hansen 2008-11-22 8:58 ` Milton Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox