* numa aware lmb and sparc stuff
@ 2010-05-10 4:35 Benjamin Herrenschmidt
2010-05-10 5:01 ` Paul Mundt
0 siblings, 1 reply; 6+ messages in thread
From: Benjamin Herrenschmidt @ 2010-05-10 4:35 UTC (permalink / raw)
To: David Miller; +Cc: Yinghai Lu, Ingo Molnar, Thomas Gleixner, linux-mm@kvack.org
Hi Dave !
So I'm looking at properly sorting out the interactions between LMB and
NUMA, among other in order to use that stuff on powerpc (and others) as
well but also to try to sort out some of that NO_BOOTMEM stuff from
Yinghai.
Currently, my understanding of how things work on sparc is that you
construct an array of "struct node_mem_mask" at boot, one for each
node, which are used to define the base and size of nodes as powers of
two.
You then pass to lmb_alloc_nid() a pointer to a nid_range() function
which walks that array to provide node information back to lmb (which in
my current patch series, I replaced with an arch callback
lmb_nid_range()).
Now, I'm trying to figure out whether I can replace that later part with
generic code in lmb.c which would use the early_node_map[] instead.
>From what I can see, your only callsite of lmb_alloc_nid() is in
allocate_node_data() which is called in your three bootmem init
variants.
In all three cases, you proceed to call add_node_ranges() which calls
add_active_range() for the intersection of all lmb and nodes before you
call allocate_node_data(). This early_node_map[] should be properly
initialized by the time you get there.
So unless i'm missing something, I should be able to completely remove
lmb's reliance on that nid_range() callback and instead have lmb itself
use the various early_node_map[] accessors such as
for_each_active_range_index_in_nid() or similar.
What do you think ? Am I missing an important part of the picture on
sparc64 ?
If not, then I should be able to easily make that whole LMB numa thing
completely arch neutral.
Cheers,
Ben.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: numa aware lmb and sparc stuff
2010-05-10 4:35 numa aware lmb and sparc stuff Benjamin Herrenschmidt
@ 2010-05-10 5:01 ` Paul Mundt
2010-05-10 5:29 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 6+ messages in thread
From: Paul Mundt @ 2010-05-10 5:01 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: David Miller, Yinghai Lu, Ingo Molnar, Thomas Gleixner,
linux-mm@kvack.org
On Mon, May 10, 2010 at 02:35:26PM +1000, Benjamin Herrenschmidt wrote:
> So unless i'm missing something, I should be able to completely remove
> lmb's reliance on that nid_range() callback and instead have lmb itself
> use the various early_node_map[] accessors such as
> for_each_active_range_index_in_nid() or similar.
>
If you do this then you will also be coupling LMB with
ARCH_POPULATES_NODE_MAP, which the nid_range() callback offers an
alternative for (although since there aren't any architectures presently
using LMB that don't also set ARCH_POPULATES_NODE_MAP perhaps this is
ok). The nobootmem stuff also has a reliance on the early node map
already.
> If not, then I should be able to easily make that whole LMB numa thing
> completely arch neutral.
>
I've just started sorting out some of the LMB/NUMA bits on SH now as
well, so I'd certainly be interested in any changes on top of Yinghai's
work you're planning on doing.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: numa aware lmb and sparc stuff
2010-05-10 5:01 ` Paul Mundt
@ 2010-05-10 5:29 ` Benjamin Herrenschmidt
2010-05-10 6:03 ` Paul Mundt
0 siblings, 1 reply; 6+ messages in thread
From: Benjamin Herrenschmidt @ 2010-05-10 5:29 UTC (permalink / raw)
To: Paul Mundt
Cc: David Miller, Yinghai Lu, Ingo Molnar, Thomas Gleixner,
linux-mm@kvack.org
On Mon, 2010-05-10 at 14:01 +0900, Paul Mundt wrote:
> On Mon, May 10, 2010 at 02:35:26PM +1000, Benjamin Herrenschmidt wrote:
> > So unless i'm missing something, I should be able to completely remove
> > lmb's reliance on that nid_range() callback and instead have lmb itself
> > use the various early_node_map[] accessors such as
> > for_each_active_range_index_in_nid() or similar.
> >
> If you do this then you will also be coupling LMB with
> ARCH_POPULATES_NODE_MAP, which the nid_range() callback offers an
> alternative for (although since there aren't any architectures presently
> using LMB that don't also set ARCH_POPULATES_NODE_MAP perhaps this is
> ok). The nobootmem stuff also has a reliance on the early node map
> already.
Right, my tentative implementation indeed requires
ARCH_POPULATES_NODE_MAP for lmb_alloc_nid() to be available (I even
documented it). Do you see that as a limitation in the long run ?
> > If not, then I should be able to easily make that whole LMB numa thing
> > completely arch neutral.
> >
> I've just started sorting out some of the LMB/NUMA bits on SH now as
> well, so I'd certainly be interested in any changes on top of Yinghai's
> work you're planning on doing.
I'm not sure I plan to change things on -top- of Yinghai work. I'm still
maintaining a patch series that is rooted before Yinghai current one, as
I very very much dislike pretty much everything in there. Though I plan
to provide all the functionality he needs for his x86 port and
NO_BOOTMEM implementation.
I'll post my WIP series later today after I got a chance to do some
tests.
Cheers,
Ben.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: numa aware lmb and sparc stuff
2010-05-10 5:29 ` Benjamin Herrenschmidt
@ 2010-05-10 6:03 ` Paul Mundt
2010-05-10 7:00 ` Benjamin Herrenschmidt
2010-05-10 7:49 ` Benjamin Herrenschmidt
0 siblings, 2 replies; 6+ messages in thread
From: Paul Mundt @ 2010-05-10 6:03 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: David Miller, Yinghai Lu, Ingo Molnar, Thomas Gleixner,
linux-mm@kvack.org
On Mon, May 10, 2010 at 03:29:23PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2010-05-10 at 14:01 +0900, Paul Mundt wrote:
> > On Mon, May 10, 2010 at 02:35:26PM +1000, Benjamin Herrenschmidt wrote:
> > > So unless i'm missing something, I should be able to completely remove
> > > lmb's reliance on that nid_range() callback and instead have lmb itself
> > > use the various early_node_map[] accessors such as
> > > for_each_active_range_index_in_nid() or similar.
> > >
> > If you do this then you will also be coupling LMB with
> > ARCH_POPULATES_NODE_MAP, which the nid_range() callback offers an
> > alternative for (although since there aren't any architectures presently
> > using LMB that don't also set ARCH_POPULATES_NODE_MAP perhaps this is
> > ok). The nobootmem stuff also has a reliance on the early node map
> > already.
>
> Right, my tentative implementation indeed requires
> ARCH_POPULATES_NODE_MAP for lmb_alloc_nid() to be available (I even
> documented it). Do you see that as a limitation in the long run ?
>
I wouldn't call it a limitation so much as a subtle dependency. All of
the current platforms that are supporting NUMA are doing so along with
ARCH_POPULATES_NODE_MAP, so in those cases making the early_node_map
dependence explicit and generic will permit the killing off of
architecture-private data structures and accounting for region sizes and
node mappings.
The NUMA platforms that do not currently follow the
ARCH_POPULATES_NODE_MAP semantics seem to already be in various states of
disarray (generically broken, bitrotted, etc.). To that extent, perhaps
it's also useful to have NUMA imply ARCH_POPULATES_NODE_MAP? New
architectures that are going to opt for sparsemem or NUMA are likely
going to end up down the ARCH_POPULATES_NODE_MAP path anyways I would
imagine.
> > I've just started sorting out some of the LMB/NUMA bits on SH now as
> > well, so I'd certainly be interested in any changes on top of Yinghai's
> > work you're planning on doing.
>
> I'm not sure I plan to change things on -top- of Yinghai work. I'm still
> maintaining a patch series that is rooted before Yinghai current one, as
> I very very much dislike pretty much everything in there. Though I plan
> to provide all the functionality he needs for his x86 port and
> NO_BOOTMEM implementation.
>
That sounds fine, too. I'll certainly give it a go once the patches show
up.
On a somewhat related note, is your intention with powerpc that sparsemem
sections are always encapsulated within a single LMB region (assuming
that the sparsemem and LMB section sizes are different)? Do you simply
never permit node sizes smaller than the sparsemem section size (ie, in
the fake NUMA case)? I've been playing with this with both sparsemem and
ARCH_HAS_HOLES_MEMORYMODEL where those sorts of combinations will be
quite common. It would be good to have some LMB guidelines hammered out
before people get too carried away with building infrastructure on top of
it at least.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: numa aware lmb and sparc stuff
2010-05-10 6:03 ` Paul Mundt
@ 2010-05-10 7:00 ` Benjamin Herrenschmidt
2010-05-10 7:49 ` Benjamin Herrenschmidt
1 sibling, 0 replies; 6+ messages in thread
From: Benjamin Herrenschmidt @ 2010-05-10 7:00 UTC (permalink / raw)
To: Paul Mundt
Cc: David Miller, Yinghai Lu, Ingo Molnar, Thomas Gleixner,
linux-mm@kvack.org
> I wouldn't call it a limitation so much as a subtle dependency. All of
> the current platforms that are supporting NUMA are doing so along with
> ARCH_POPULATES_NODE_MAP, so in those cases making the early_node_map
> dependence explicit and generic will permit the killing off of
> architecture-private data structures and accounting for region sizes and
> node mappings.
Right.
> The NUMA platforms that do not currently follow the
> ARCH_POPULATES_NODE_MAP semantics seem to already be in various states of
> disarray (generically broken, bitrotted, etc.). To that extent, perhaps
> it's also useful to have NUMA imply ARCH_POPULATES_NODE_MAP? New
> architectures that are going to opt for sparsemem or NUMA are likely
> going to end up down the ARCH_POPULATES_NODE_MAP path anyways I would
> imagine.
I tend to agree.
> That sounds fine, too. I'll certainly give it a go once the patches show
> up.
Thanks. I hope to have a first round out tonight, by no mean final, and
that doesn't handle yet all of Yinghai x86 and NO_BOOTMEM needs just
yet, but going through his patches, I'm finding that a lot of stuff in
there is either redundant or gratuitously obfuscated, so I have some
hope to get things done a bit more cleanly sooner than later :-)
I'm still not sure whether I may just implement _another_ NO_BOOTMEM
entirely: CONFIG_ARCH_BOOTMEM_USES_LMB to start with, and when x86 is
ported over to LMB, just plain kill the existing NO_BOOTMEM gunk, and
associated x86 crap that Yinghai made generic such as kernel/range.c
etc... we'll see.
Time is my main issue, and Ingo seems to have some kind of countdown
running that if we don't come up in the next few day with something
cleaner, he's going to merge all the junk for the sake of it :-)
> On a somewhat related note, is your intention with powerpc that sparsemem
> sections are always encapsulated within a single LMB region (assuming
> that the sparsemem and LMB section sizes are different)?
To some extent yes. 16M is our huge page size and the minimum
granularity of LMB's as provided by firmware on pSeries. This is thus
also our granularity for memory hotswap, thus it made sense to use that
for our sparsemem section size.
But that's not necessarily directly related to the kernel LMB code which
will happily coalesce consecutive regions, among others. LMB doesn't
keep track of node information for now at least. I've been hesitating
about adding that or not (and preventing coalescing accross node
boundaries) but I see no obvious need right now.
> Do you simply never permit node sizes smaller than the sparsemem section
> size (ie, in the fake NUMA case)? I've been playing with this with both sparsemem and
> ARCH_HAS_HOLES_MEMORYMODEL where those sorts of combinations will be
> quite common. It would be good to have some LMB guidelines hammered out
> before people get too carried away with building infrastructure on top of
> it at least.
I'm not too familiar with the fake numa case (appart from knowing it's
broken and having patches queued up on patchwork that I haven't had a
chance to review yet) but I think it's fair to assume a node will be at
least a section I suppose.
Cheers,
Ben.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: numa aware lmb and sparc stuff
2010-05-10 6:03 ` Paul Mundt
2010-05-10 7:00 ` Benjamin Herrenschmidt
@ 2010-05-10 7:49 ` Benjamin Herrenschmidt
1 sibling, 0 replies; 6+ messages in thread
From: Benjamin Herrenschmidt @ 2010-05-10 7:49 UTC (permalink / raw)
To: Paul Mundt
Cc: David Miller, Yinghai Lu, Ingo Molnar, Thomas Gleixner,
linux-mm@kvack.org
> I wouldn't call it a limitation so much as a subtle dependency. All of
> the current platforms that are supporting NUMA are doing so along with
> ARCH_POPULATES_NODE_MAP, so in those cases making the early_node_map
> dependence explicit and generic will permit the killing off of
> architecture-private data structures and accounting for region sizes and
> node mappings.
>
> The NUMA platforms that do not currently follow the
> ARCH_POPULATES_NODE_MAP semantics seem to already be in various states of
> disarray (generically broken, bitrotted, etc.). To that extent, perhaps
> it's also useful to have NUMA imply ARCH_POPULATES_NODE_MAP? New
> architectures that are going to opt for sparsemem or NUMA are likely
> going to end up down the ARCH_POPULATES_NODE_MAP path anyways I would
> imagine.
Ok so I had a chat with Dave and it looks like that won't do for sparc.
They don't really have ranges. Or rather, they do in HW, but with their
hypervisor, you can get the pages all scattered in what they call "real
memory", so early_node_map[] doesn't work well.
So I'll rollback my changes in that area for now, put back the arch
callback, but I'll keep at hand a default variant that uses
early_node_map[] for the like of us.
Cheers,
Ben.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2010-05-10 7:49 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-10 4:35 numa aware lmb and sparc stuff Benjamin Herrenschmidt
2010-05-10 5:01 ` Paul Mundt
2010-05-10 5:29 ` Benjamin Herrenschmidt
2010-05-10 6:03 ` Paul Mundt
2010-05-10 7:00 ` Benjamin Herrenschmidt
2010-05-10 7:49 ` Benjamin Herrenschmidt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).