public inbox for linuxppc-dev@ozlabs.org
 help / color / mirror / Atom feed
* Nodes with no memory
@ 2008-11-21 23:50 Dave Hansen
  2008-11-22  0:49 ` Nathan Lynch
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Hansen @ 2008-11-21 23:50 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mjw, Nathan Lynch, Paul Mackerras

I was handed off a bug report about a blade not booting with a, um
"newer" kernel.  After turning on some debugging messages, I got this
ominous message:

        node 1
        NODE_DATA() = c000000000000000

Which obviously comes from here:

arch/powerpc/mm/numa.c

        for_each_online_node(nid) {
                unsigned long start_pfn, end_pfn;
                unsigned long bootmem_paddr;
                unsigned long bootmap_pages;

                get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);

                /* Allocate the node structure node local if possible */
                NODE_DATA(nid) = careful_allocation(nid,
                                        sizeof(struct pglist_data),
                                        SMP_CACHE_BYTES, end_pfn);
                NODE_DATA(nid) = __va(NODE_DATA(nid));
                memset(NODE_DATA(nid), 0, sizeof(struct pglist_data));
		...

careful_allocation() returns a NULL physical address, but we go ahead
and run __va() on it, stick it in NODE_DATA(), and memset it.  Yay!

I seem to recall that we fixed some issues with memoryless nodes a few
years ago, like around the memory hotplug days, but I don't see the
patches anywhere.

I'm thinking that we need to at least fix careful_allocation() to oops
and not return NULL, or check to make sure all it callers check its
return code.  Plus,  we probably also need to ensure that all ppc code
doing for_each_online_node() does not assume a valid NODE_DATA() for all
those nodes.

Any other thoughts?

I'll have a patch for the above issue sometime soon. 

-- Dave

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Nodes with no memory
  2008-11-21 23:50 Nodes with no memory Dave Hansen
@ 2008-11-22  0:49 ` Nathan Lynch
  2008-11-22  1:17   ` Dave Hansen
  0 siblings, 1 reply; 4+ messages in thread
From: Nathan Lynch @ 2008-11-22  0:49 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linuxppc-dev, mjw, Paul Mackerras

Dave Hansen wrote:
> I was handed off a bug report about a blade not booting with a, um
> "newer" kernel.

If you're unable to provide basic information such as the kernel
version then perhaps this isn't the best forum for discussing this.  :)


> I'm thinking that we need to at least fix careful_allocation() to oops
> and not return NULL, or check to make sure all it callers check its
> return code.

Well, careful_allocation() in current mainline tries pretty hard to
panic if it can't satisfy the request.  Why isn't that happening?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Nodes with no memory
  2008-11-22  0:49 ` Nathan Lynch
@ 2008-11-22  1:17   ` Dave Hansen
  2008-11-22  8:58     ` Milton Miller
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Hansen @ 2008-11-22  1:17 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: linuxppc-dev, mjw, anton, Paul Mackerras

On Fri, 2008-11-21 at 18:49 -0600, Nathan Lynch wrote:
> Dave Hansen wrote:
> > I was handed off a bug report about a blade not booting with a, um
> > "newer" kernel.
> 
> If you're unable to provide basic information such as the kernel
> version then perhaps this isn't the best forum for discussing this.  :)

Let's just say a derivative of 2.6.27.5.  I will, of course be trying to reproduce on mainline.  I'm just going with the kernel closest to the bug report as I can get for now. 

> > I'm thinking that we need to at least fix careful_allocation() to oops
> > and not return NULL, or check to make sure all it callers check its
> > return code.
> 
> Well, careful_allocation() in current mainline tries pretty hard to
> panic if it can't satisfy the request.  Why isn't that happening?

I added some random debugging to careful_alloc() to find out.

careful_allocation(1, 7680, 80, 0)
careful_allocation() ret1: 00000001dffe4100
careful_allocation() ret2: 00000001dffe4100
careful_allocation() ret3: 00000001dffe4100
careful_allocation() ret4: c000000000000000
careful_allocation() ret5: 0000000000000000

It looks to me like it is hitting 'the memory came from a previously
allocated node' check.  So, the __lmb_alloc_base() appears to get
something worthwhile, but that gets overwritten later.

I'm still not quite sure what this comment means.  Are we just trying to
get node locality from the allocation?

I also need to go look at how __alloc_bootmem_node() ends up returning
c000000000000000.  It should be returning NULL, and panic'ing, in
careful_alloc().  This probably has to do with the fact that NODE_DATA()
isn't set up, yet, but I'll double check.

        /*
         * If the memory came from a previously allocated node, we must
         * retry with the bootmem allocator.
         */
        new_nid = early_pfn_to_nid(ret >> PAGE_SHIFT);
        if (new_nid < nid) {
                ret = (unsigned long)__alloc_bootmem_node(NODE_DATA(new_nid),
                                size, align, 0);
                dbg("careful_allocation() ret4: %016lx\n", ret);

                if (!ret)
                        panic("numa.c: cannot allocate %lu bytes on node %d",
                              size, new_nid);

                ret = __pa(ret);
                dbg("careful_allocation() ret5: %016lx\n", ret);

                dbg("alloc_bootmem %lx %lx\n", ret, size);
        }



-- Dave

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Nodes with no memory
  2008-11-22  1:17   ` Dave Hansen
@ 2008-11-22  8:58     ` Milton Miller
  0 siblings, 0 replies; 4+ messages in thread
From: Milton Miller @ 2008-11-22  8:58 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-ppc, Nathan Lynch

On Sat Nov 22 at 12:17:22 EST in 2008 Dave Hansen wrote:
> On Fri, 2008-11-21 at 18:49 -0600, Nathan Lynch wrote:
>> Dave Hansen wrote:
>>> I was handed off a bug report about a blade not booting with a, um
>>> "newer" kernel.
>>
>> If you're unable to provide basic information such as the kernel
>> version then perhaps this isn't the best forum for discussing this. =20=

>> :)
>
> Let's just say a derivative of 2.6.27.5.  I will, of course be trying=20=

> to reproduce on mainline.  I'm just going with the kernel closest to=20=

> the bug report as I can get for now.

This reminds me.  I was asked to look at a system that had all cpus and=20=

memory on node 1.  I recently switched to 2.6.27.0, and had a similar=20
failure when I tried my latest development kernel.  However, I realized=20=

that the user was wanting to run my previously supported 2.6.24 kernel,=20=

and that did not have this issue, so I never got back to debugging this=20=

problem.  (Both kernels had similar patches applied, but very little to=20=

mm or numa selection).  I was able to fix the problem they were having=20=

and returned the machine to them without debugging the issue, but I=20
suspect the problem was introduced to mainline between 2.6.24 and=20
2.6.27.

>>> I'm thinking that we need to at least fix careful_allocation() to=20
>>> oops
>>> and not return NULL, or check to make sure all it callers check its
>>> return code.
>>
>> Well, careful_allocation() in current mainline tries pretty hard to
>> panic if it can't satisfy the request.  Why isn't that happening?
>
> I added some random debugging to careful_alloc() to find out.
>
> careful_allocation(1, 7680, 80, 0)
> careful_allocation() ret1: 00000001dffe4100
> careful_allocation() ret2: 00000001dffe4100
> careful_allocation() ret3: 00000001dffe4100
> careful_allocation() ret4: c000000000000000
> careful_allocation() ret5: 0000000000000000
>
> It looks to me like it is hitting 'the memory came from a previously
> =A0allocated node' check.  So, the __lmb_alloc_base() appears to get
> something worthwhile, but that gets overwritten later.
>
> I'm still not quite sure what this comment means.  Are we just trying=20=

> to
> get node locality from the allocation?

My memory (and a quick look) is that careful alloc is used while we are=20=

in the process of creating the memory maps for the node.  We want them=20=

to be allocated from memory on the node, but will accept memory from=20
any node to handle the case that memory is not available in the desired=20=

node.  Linux requires the maps exist for every online node.

Because we are in the process transferring the memory between=20
allocators, the check for new_nid < nid is meant to say "if the memory=20=

did not come from the preferred node, but instead came from one we=20
already transfered, then we need to obtain that memory from the new=20
allocator".  If it came from the preferred node or a later node, the=20
allocation we did is valid, and will be marked in-use when we transfer=20=

that node's memory.

> I also need to go look at how __alloc_bootmem_node() ends up returning
> c000000000000000.  It should be returning NULL, and panic'ing, in
> careful_alloc().  This probably has to do with the fact that=20
> NODE_DATA()
> isn't set up, yet, but I'll double check.

We setup NODE_DATA with the result of this alloc in nid order.  If=20
early_pfs_to_nid returns the wrong value then we would obviously be in=20=

trouble here.

>         /*
>          * If the memory came from a previously allocated node, we =
must
>          * retry with the bootmem allocator.
>          */
>         new_nid =3D early_pfn_to_nid(ret >> PAGE_SHIFT);
>         if (new_nid < nid) {
>                 ret =3D (unsigned=20
> long)__alloc_bootmem_node(NODE_DATA(new_nid),
>                                 size, align, 0);
>                 dbg("careful_allocation() ret4: %016lx\n", ret);
>
>                 if (!ret)
>                         panic("numa.c: cannot allocate %lu bytes on=20
> node %d",
>                               size, new_nid);
>
>                 ret =3D __pa(ret);
>                 dbg("careful_allocation() ret5: %016lx\n", ret);
>
>                 dbg("alloc_bootmem %lx %lx\n", ret, size);
>         }

Perhaps someone can recreate this with the fake numa stuff that was=20
added since 2.6.24?   Or edit a device tree to fake the numa=20
assignments for memory and kexec using the modified tree.

milton

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-11-22  8:57 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-21 23:50 Nodes with no memory Dave Hansen
2008-11-22  0:49 ` Nathan Lynch
2008-11-22  1:17   ` Dave Hansen
2008-11-22  8:58     ` Milton Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox