questions about init_memory_mapping

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* questions about init_memory_mapping_high()
@ 2011-02-23 17:19 Tejun Heo
  2011-02-23 20:24 ` Yinghai Lu
  2011-02-28 18:14 ` questions about init_memory_mapping_high() H. Peter Anvin
  0 siblings, 2 replies; 24+ messages in thread
From: Tejun Heo @ 2011-02-23 17:19 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: x86, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel

Hello, guys.

I've been looking at init_memory_mapping_high() added by commit
1411e0ec31 (x86-64, numa: Put pgtable to local node memory) and I got
curious about several things.

1. The only rationale given in the commit description is that a
   RED-PEN is killed, which was the following.

	/*
	 * RED-PEN putting page tables only on node 0 could
	 * cause a hotspot and fill up ZONE_DMA. The page tables
	 * need roughly 0.5KB per GB.
	 */

   This already wasn't true with top-down memblock allocation.

   The 0.5KB per GiB comment is for 32bit w/ 3 level mapping.  On
   64bit, it's ~4KiB per GiB when using 2MiB mappings and, well, very
   small per GiB if 1GiB mapping is used.  Even with 2MiB mapping,
   1TiB mapping would only be 4MiB.  Under ZONE_DMA, this could be
   problematic but with top-down this can't be a problem in any
   realistic way in foreseeable future.

2. In most cases, the kernel mapping ends up using 1GiB mappings and
   when using 1GiB mappings, a single second level table would cover
   512GiB of memory.  IOW, little, if any, is gained by trying to
   allocate the page table on node local memory when 1GiB mappings are
   used, they end up sharing the same page somewhere anyway.

   I guess this was the reason why the commit message showed usage of
   2MiB mappings so that each node would end up with their own third
   level page tables.  Is this something we need to optimize for?  I
   don't recall seeing recent machines which don't use 1GiB pages for
   the linear mapping.  Are there NUMA machines which can't use 1GiB
   mappings?

   Or was this for the future where we would be using a lot more than
   512GiB of memory?  If so, wouldn't that be a bit over-reaching?
   Wouldn't we be likely to have 512GiB mappings if we get to a point
   where NUMA locality of such mappings actually become a problem?

3. The new code creates linear mapping only for memory regions where
   e820 actually says there is memory as opposed to mapping from base
   to top.  Again, I'm not sure what the intention of this change was.
   Having larger mappings over holes is much cheaper than having to
   break down the mappings into smaller sized mappings around the
   holes both in terms of memory and run time overhead.  Why would we
   want to match the linear address mapping to the e820 map exactly?

Also, Yinghai, can you please try to write commit descriptions with
more details?  It really sucks for other people when they have to
guess what the actual changes and underlying intentions are.  The
commit adding init_memory_mapping_high() is very anemic on details
about how the behavior changes and the only intention given there is
RED-PEN removal even which is largely a miss.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: questions about init_memory_mapping_high()
  2011-02-23 17:19 questions about init_memory_mapping_high() Tejun Heo
@ 2011-02-23 20:24 ` Yinghai Lu
  2011-02-23 20:46   ` Tejun Heo
  2011-02-28 18:14 ` questions about init_memory_mapping_high() H. Peter Anvin
  1 sibling, 1 reply; 24+ messages in thread
From: Yinghai Lu @ 2011-02-23 20:24 UTC (permalink / raw)
  To: Tejun Heo; +Cc: x86, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel

On 02/23/2011 09:19 AM, Tejun Heo wrote:
> Hello, guys.
> 
> I've been looking at init_memory_mapping_high() added by commit
> 1411e0ec31 (x86-64, numa: Put pgtable to local node memory) and I got
> curious about several things.
> 
> 1. The only rationale given in the commit description is that a
>    RED-PEN is killed, which was the following.
> 
> 	/*
> 	 * RED-PEN putting page tables only on node 0 could
> 	 * cause a hotspot and fill up ZONE_DMA. The page tables
> 	 * need roughly 0.5KB per GB.
> 	 */
> 
>    This already wasn't true with top-down memblock allocation.
> 
>    The 0.5KB per GiB comment is for 32bit w/ 3 level mapping.  On
>    64bit, it's ~4KiB per GiB when using 2MiB mappings and, well, very
>    small per GiB if 1GiB mapping is used.  Even with 2MiB mapping,
>    1TiB mapping would only be 4MiB.  Under ZONE_DMA, this could be
>    problematic but with top-down this can't be a problem in any
>    realistic way in foreseeable future.

before that patch set:
page table for [0, 4g) is just under and near 512M.
page table for [4g, 128) is just under and near 2g ( assume 0-2g is ram under 4g)

first patch in the patch set will
page table for [0, 4g) is just under and near 2g.( assume 0-2g is ram under 4g)
page table for [4g, 128) is just under and near 128g 

so top down could put most page table on last node.

for debug purpose case, 2M and 1G page could be disabled.

code excerpt from init_memory_mapping()

        printk(KERN_INFO "init_memory_mapping: %016lx-%016lx\n", start, end);

#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK)
        /*
         * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
         * This will simplify cpa(), which otherwise needs to support splitting
         * large pages into small in interrupt context, etc.
         */
        use_pse = use_gbpages = 0;
#else
        use_pse = cpu_has_pse;
        use_gbpages = direct_gbpages;
#endif


> 
> 2. In most cases, the kernel mapping ends up using 1GiB mappings and
>    when using 1GiB mappings, a single second level table would cover
>    512GiB of memory.  IOW, little, if any, is gained by trying to
>    allocate the page table on node local memory when 1GiB mappings are
>    used, they end up sharing the same page somewhere anyway.
> 
>    I guess this was the reason why the commit message showed usage of
>    2MiB mappings so that each node would end up with their own third
>    level page tables.  Is this something we need to optimize for?  I
>    don't recall seeing recent machines which don't use 1GiB pages for
>    the linear mapping.  Are there NUMA machines which can't use 1GiB
>    mappings?
> 
>    Or was this for the future where we would be using a lot more than
>    512GiB of memory?  If so, wouldn't that be a bit over-reaching?
>    Wouldn't we be likely to have 512GiB mappings if we get to a point
>    where NUMA locality of such mappings actually become a problem?


till now:
amd 64 cpu does support 1gb page.

Intel CPU Nehalem-EX does not. and several vendors do provide 8 sockets
NUMA system with 1024g and 2048g RAM

cpu after Nehalem-EX looks support 1gb page.



> 
> 3. The new code creates linear mapping only for memory regions where
>    e820 actually says there is memory as opposed to mapping from base
>    to top.  Again, I'm not sure what the intention of this change was.
>    Having larger mappings over holes is much cheaper than having to
>    break down the mappings into smaller sized mappings around the
>    holes both in terms of memory and run time overhead.  Why would we
>    want to match the linear address mapping to the e820 map exactly?

we don't need to map those holes if there is any.

for hotplug case, they should map new added memory later.

> 
> Also, Yinghai, can you please try to write commit descriptions with
> more details?  It really sucks for other people when they have to
> guess what the actual changes and underlying intentions are.  The
> commit adding init_memory_mapping_high() is very anemic on details
> about how the behavior changes and the only intention given there is
> RED-PEN removal even which is largely a miss.

i don't know what you are talking about. that changelog is clear enough.

Yinghai

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: questions about init_memory_mapping_high()
  2011-02-23 20:24 ` Yinghai Lu
@ 2011-02-23 20:46   ` Tejun Heo
  2011-02-23 20:51     ` Yinghai Lu
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2011-02-23 20:46 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: x86, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel

On Wed, Feb 23, 2011 at 12:24:58PM -0800, Yinghai Lu wrote:
> >    I guess this was the reason why the commit message showed usage of
> >    2MiB mappings so that each node would end up with their own third
> >    level page tables.  Is this something we need to optimize for?  I
> >    don't recall seeing recent machines which don't use 1GiB pages for
> >    the linear mapping.  Are there NUMA machines which can't use 1GiB
> >    mappings?
> 
> till now:
> amd 64 cpu does support 1gb page.
> 
> Intel CPU Nehalem-EX does not. and several vendors do provide 8 sockets
> NUMA system with 1024g and 2048g RAM

That's interesting.  Didn't expect that.  So, this one is an actually
valid reason for implementing per node mapping.  Is this Nehalem-EX
only thing?  Or is it applicable to all xeons upto now?

> > 3. The new code creates linear mapping only for memory regions where
> >    e820 actually says there is memory as opposed to mapping from base
> >    to top.  Again, I'm not sure what the intention of this change was.
> >    Having larger mappings over holes is much cheaper than having to
> >    break down the mappings into smaller sized mappings around the
> >    holes both in terms of memory and run time overhead.  Why would we
> >    want to match the linear address mapping to the e820 map exactly?
> 
> we don't need to map those holes if there is any.

Yeah, sure, my point was that not mapping those holes is likely to be
worse.  Wouldn't it be better to get low and high ends of the occupied
area and expand those to larger mapping size?  It's worse to match the
memory map exactly.  You unnecessarily end up with smaller mappings.

> for hotplug case, they should map new added memory later.

Sure.

> > Also, Yinghai, can you please try to write commit descriptions with
> > more details?  It really sucks for other people when they have to
> > guess what the actual changes and underlying intentions are.  The
> > commit adding init_memory_mapping_high() is very anemic on details
> > about how the behavior changes and the only intention given there is
> > RED-PEN removal even which is largely a miss.
> 
> i don't know what you are talking about. that changelog is clear enough.

Ah well, if you still think the changelog is clear enough, I give up.
I guess I'll just keep rewriting your changelogs.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: questions about init_memory_mapping_high()
  2011-02-23 20:46   ` Tejun Heo
@ 2011-02-23 20:51     ` Yinghai Lu
  2011-02-23 21:03       ` Tejun Heo
  0 siblings, 1 reply; 24+ messages in thread
From: Yinghai Lu @ 2011-02-23 20:51 UTC (permalink / raw)
  To: Tejun Heo; +Cc: x86, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel

On 02/23/2011 12:46 PM, Tejun Heo wrote:
> On Wed, Feb 23, 2011 at 12:24:58PM -0800, Yinghai Lu wrote:
>>>    I guess this was the reason why the commit message showed usage of
>>>    2MiB mappings so that each node would end up with their own third
>>>    level page tables.  Is this something we need to optimize for?  I
>>>    don't recall seeing recent machines which don't use 1GiB pages for
>>>    the linear mapping.  Are there NUMA machines which can't use 1GiB
>>>    mappings?
>>
>> till now:
>> amd 64 cpu does support 1gb page.
>>
>> Intel CPU Nehalem-EX does not. and several vendors do provide 8 sockets
>> NUMA system with 1024g and 2048g RAM
> 
> That's interesting.  Didn't expect that.  So, this one is an actually
> valid reason for implementing per node mapping.  Is this Nehalem-EX
> only thing?  Or is it applicable to all xeons upto now?

only have access for Nehalem-EX and Westmere-EX till now.

> 
>>> 3. The new code creates linear mapping only for memory regions where
>>>    e820 actually says there is memory as opposed to mapping from base
>>>    to top.  Again, I'm not sure what the intention of this change was.
>>>    Having larger mappings over holes is much cheaper than having to
>>>    break down the mappings into smaller sized mappings around the
>>>    holes both in terms of memory and run time overhead.  Why would we
>>>    want to match the linear address mapping to the e820 map exactly?
>>
>> we don't need to map those holes if there is any.
> 
> Yeah, sure, my point was that not mapping those holes is likely to be
> worse.  Wouldn't it be better to get low and high ends of the occupied
> area and expand those to larger mapping size?  It's worse to match the
> memory map exactly.  You unnecessarily end up with smaller mappings.

it will reuse previous not used entries in the init_memory_mapping().

> 
>> for hotplug case, they should map new added memory later.
> 
> Sure.
> 
>>> Also, Yinghai, can you please try to write commit descriptions with
>>> more details?  It really sucks for other people when they have to
>>> guess what the actual changes and underlying intentions are.  The
>>> commit adding init_memory_mapping_high() is very anemic on details
>>> about how the behavior changes and the only intention given there is
>>> RED-PEN removal even which is largely a miss.
>>
>> i don't know what you are talking about. that changelog is clear enough.
> 
> Ah well, if you still think the changelog is clear enough, I give up.
> I guess I'll just keep rewriting your changelogs.

Thank you very much.

Yinghai

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: questions about init_memory_mapping_high()
  2011-02-23 20:51     ` Yinghai Lu
@ 2011-02-23 21:03       ` Tejun Heo
  2011-02-23 22:17         ` Yinghai Lu
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2011-02-23 21:03 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: x86, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel

Hey,

On Wed, Feb 23, 2011 at 12:51:37PM -0800, Yinghai Lu wrote:
> On 02/23/2011 12:46 PM, Tejun Heo wrote:
> >> Intel CPU Nehalem-EX does not. and several vendors do provide 8 sockets
> >> NUMA system with 1024g and 2048g RAM
> > 
> > That's interesting.  Didn't expect that.  So, this one is an actually
> > valid reason for implementing per node mapping.  Is this Nehalem-EX
> > only thing?  Or is it applicable to all xeons upto now?
> 
> only have access for Nehalem-EX and Westmere-EX till now.

I see.  I was wondering whether it was a worthwhile optimization if it
was an one-off thing for Nehalem-EX.

> >>> 3. The new code creates linear mapping only for memory regions where
> >>>    e820 actually says there is memory as opposed to mapping from base
> >>>    to top.  Again, I'm not sure what the intention of this change was.
> >>>    Having larger mappings over holes is much cheaper than having to
> >>>    break down the mappings into smaller sized mappings around the
> >>>    holes both in terms of memory and run time overhead.  Why would we
> >>>    want to match the linear address mapping to the e820 map exactly?
> >>
> >> we don't need to map those holes if there is any.
> > 
> > Yeah, sure, my point was that not mapping those holes is likely to be
> > worse.  Wouldn't it be better to get low and high ends of the occupied
> > area and expand those to larger mapping size?  It's worse to match the
> > memory map exactly.  You unnecessarily end up with smaller mappings.
> 
> it will reuse previous not used entries in the init_memory_mapping().

Hmmm... I'm not really following.  Can you elaborate?  The reason why
smaller mapping is bad is because of increased TLB pressure.  What
does using the existing entries have to do with it?

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: questions about init_memory_mapping_high()
  2011-02-23 21:03       ` Tejun Heo
@ 2011-02-23 22:17         ` Yinghai Lu
  2011-02-24  9:15           ` Tejun Heo
  0 siblings, 1 reply; 24+ messages in thread
From: Yinghai Lu @ 2011-02-23 22:17 UTC (permalink / raw)
  To: Tejun Heo; +Cc: x86, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel

On Wed, Feb 23, 2011 at 1:03 PM, Tejun Heo <tj@kernel.org> wrote:
>> > Yeah, sure, my point was that not mapping those holes is likely to be
>> > worse.  Wouldn't it be better to get low and high ends of the occupied
>> > area and expand those to larger mapping size?  It's worse to match the
>> > memory map exactly.  You unnecessarily end up with smaller mappings.
>>
>> it will reuse previous not used entries in the init_memory_mapping().
>
> Hmmm... I'm not really following.  Can you elaborate?  The reason why
> smaller mapping is bad is because of increased TLB pressure.  What
> does using the existing entries have to do with it?

assume 1g page is used. first node will actually mapped 512G already.
so if the system only have 1024g. then first 512g page table will on node0 ram.
second 512g page table will be on node4.

when only 2M are used, it is 1G boundary. for 1024g system.
page table (about 512k) for mem 0-128g is on node0.
page table (about 512k) for mem 128g-256g is on node1.
...
Do you mean we need to put those all 512k together to reduce TLB presure?

Yinghai

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: questions about init_memory_mapping_high()
  2011-02-23 22:17         ` Yinghai Lu
@ 2011-02-24  9:15           ` Tejun Heo
  2011-02-25  1:37             ` Yinghai Lu
                               ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Tejun Heo @ 2011-02-24  9:15 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: x86, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel

Hey, again.

On Wed, Feb 23, 2011 at 02:17:34PM -0800, Yinghai Lu wrote:
> > Hmmm... I'm not really following.  Can you elaborate?  The reason why
> > smaller mapping is bad is because of increased TLB pressure.  What
> > does using the existing entries have to do with it?
> 
> assume 1g page is used. first node will actually mapped 512G already.
> so if the system only have 1024g. then first 512g page table will on node0 ram.
> second 512g page table will be on node4.
> 
> when only 2M are used, it is 1G boundary. for 1024g system.
> page table (about 512k) for mem 0-128g is on node0.
> page table (about 512k) for mem 128g-256g is on node1.
> ...
> Do you mean we need to put those all 512k together to reduce TLB presure?

Nope, let's say the machine supports 1GiB mapping, has 8GiB of memory
where [0,4)GiB is node 0 and [4,8)GiB node1, and there's a hole of
128MiB right on top of 4GiB.  Before the change, the page mapping code
wouldn't care about the whole and just map the whole [0,8)GiB area
with eight 1GiB mapping.  Now with your change, [4, 5)GiB will be
mapped using 2MiB mappings to avoid mapping the 128MiB hole.

We end up unnecessarily using smaller size mappings (512 2MiB mappings
instead of 1 1GiB mapping) thus increasing TLB pressure.  There is no
reason to match the linear address mapping exactly to the physical
memory map.  It is no accident that the original code didn't consider
memory holes.  Using larger mappings over them is more beneficial to
trying to punch holes with smaller mappings.

This rather important change was made without any description or
explanation, which I find somewhat disturbing.  Anyways, what we can
do is just taking bottom and top addresses of occupied NUMA regions
and round them down and up, respectively, to the largest page mapping
size supported as long as the top address doesn't go over max_pfn
instead of mapping exactly according to the memblocks.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: questions about init_memory_mapping_high()
  2011-02-24  9:15           ` Tejun Heo
@ 2011-02-25  1:37             ` Yinghai Lu
  2011-02-25  1:38             ` [PATCH 1/2] x86,mm: Introduce init_memory_mapping_ext() Yinghai Lu
  2011-02-25  6:20             ` [PATCH 2/2] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high() Yinghai Lu
  2 siblings, 0 replies; 24+ messages in thread
From: Yinghai Lu @ 2011-02-25  1:37 UTC (permalink / raw)
  To: Tejun Heo; +Cc: x86, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel

On 02/24/2011 01:15 AM, Tejun Heo wrote:
> Hey, again.
> 
> On Wed, Feb 23, 2011 at 02:17:34PM -0800, Yinghai Lu wrote:
>>> Hmmm... I'm not really following.  Can you elaborate?  The reason why
>>> smaller mapping is bad is because of increased TLB pressure.  What
>>> does using the existing entries have to do with it?
>>
>> assume 1g page is used. first node will actually mapped 512G already.
>> so if the system only have 1024g. then first 512g page table will on node0 ram.
>> second 512g page table will be on node4.
>>
>> when only 2M are used, it is 1G boundary. for 1024g system.
>> page table (about 512k) for mem 0-128g is on node0.
>> page table (about 512k) for mem 128g-256g is on node1.
>> ...
>> Do you mean we need to put those all 512k together to reduce TLB presure?
> 
> Nope, let's say the machine supports 1GiB mapping, has 8GiB of memory
> where [0,4)GiB is node 0 and [4,8)GiB node1, and there's a hole of
> 128MiB right on top of 4GiB.  Before the change, the page mapping code
> wouldn't care about the whole and just map the whole [0,8)GiB area
> with eight 1GiB mapping.  Now with your change, [4, 5)GiB will be
> mapped using 2MiB mappings to avoid mapping the 128MiB hole.
> 
> We end up unnecessarily using smaller size mappings (512 2MiB mappings
> instead of 1 1GiB mapping) thus increasing TLB pressure.  There is no
> reason to match the linear address mapping exactly to the physical
> memory map.  It is no accident that the original code didn't consider
> memory holes.  Using larger mappings over them is more beneficial to
> trying to punch holes with smaller mappings.
> 
> This rather important change was made without any description or
> explanation, which I find somewhat disturbing.  Anyways, what we can
> do is just taking bottom and top addresses of occupied NUMA regions
> and round them down and up, respectively, to the largest page mapping
> size supported as long as the top address doesn't go over max_pfn
> instead of mapping exactly according to the memblocks.
> 

ok, please check two patches that fix the problem.

thanks

Yinghai

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/2] x86,mm: Introduce init_memory_mapping_ext()
  2011-02-24  9:15           ` Tejun Heo
  2011-02-25  1:37             ` Yinghai Lu
@ 2011-02-25  1:38             ` Yinghai Lu
  2011-02-25  6:20             ` [PATCH 2/2] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high() Yinghai Lu
  2 siblings, 0 replies; 24+ messages in thread
From: Yinghai Lu @ 2011-02-25  1:38 UTC (permalink / raw)
  To: Tejun Heo, Ingo Molnar, Thomas Gleixner, H. Peter Anvin; +Cc: x86, linux-kernel


Add extra input tbl_end. it could be smaller than end.

Prepare for init_memory_mapping_high() to align boundary to 1G.
aka end could round up to 1g bound, and will be bigger then original
node end.

init_memory_mapping will call init_memory_mapping_ext with tbl=end.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/include/asm/page_types.h |    7 +++++--
 arch/x86/mm/init.c                |   22 +++++++++++++++-------
 2 files changed, 20 insertions(+), 9 deletions(-)

Index: linux-2.6/arch/x86/include/asm/page_types.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/page_types.h
+++ linux-2.6/arch/x86/include/asm/page_types.h
@@ -51,8 +51,11 @@ static inline phys_addr_t get_max_mapped
 	return (phys_addr_t)max_pfn_mapped << PAGE_SHIFT;
 }
 
-extern unsigned long init_memory_mapping(unsigned long start,
-					 unsigned long end);
+unsigned long init_memory_mapping_ext(unsigned long start,
+				      unsigned long end,
+				      unsigned long tbl_end);
+
+unsigned long init_memory_mapping(unsigned long start, unsigned long end);
 
 void init_memory_mapping_high(void);
 
Index: linux-2.6/arch/x86/mm/init.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init.c
+++ linux-2.6/arch/x86/mm/init.c
@@ -30,10 +30,12 @@ int direct_gbpages
 #endif
 ;
 
-static void __init find_early_table_space(unsigned long end, int use_pse,
+static void __init find_early_table_space(unsigned long end,
+					  unsigned long tbl_end,
+					  int use_pse,
 					  int use_gbpages)
 {
-	unsigned long puds, pmds, ptes, tables, start = 0, good_end = end;
+	unsigned long puds, pmds, ptes, tables, start = 0;
 	phys_addr_t base;
 
 	puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
@@ -66,10 +68,10 @@ static void __init find_early_table_spac
 	/* for fixmap */
 	tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
 
-	good_end = max_pfn_mapped << PAGE_SHIFT;
+	tbl_end = max_pfn_mapped << PAGE_SHIFT;
 #endif
 
-	base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
+	base = memblock_find_in_range(start, tbl_end, tables, PAGE_SIZE);
 	if (base == MEMBLOCK_ERROR)
 		panic("Cannot find space for the kernel page tables");
 
@@ -114,8 +116,9 @@ static int __meminit save_mr(struct map_
  * This runs before bootmem is initialized and gets pages directly from
  * the physical memory. To access them they are temporarily mapped.
  */
-unsigned long __init_refok init_memory_mapping(unsigned long start,
-					       unsigned long end)
+unsigned long __init_refok init_memory_mapping_ext(unsigned long start,
+					       unsigned long end,
+					       unsigned long tbl_end)
 {
 	unsigned long page_size_mask = 0;
 	unsigned long start_pfn, end_pfn;
@@ -258,7 +261,7 @@ unsigned long __init_refok init_memory_m
 	 * nodes are discovered.
 	 */
 	if (!after_bootmem)
-		find_early_table_space(end, use_pse, use_gbpages);
+		find_early_table_space(end, tbl_end, use_pse, use_gbpages);
 
 	for (i = 0; i < nr_range; i++)
 		ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
@@ -282,6 +285,11 @@ unsigned long __init_refok init_memory_m
 	return ret >> PAGE_SHIFT;
 }
 
+unsigned long __init_refok init_memory_mapping(unsigned long start,
+					       unsigned long end)
+{
+	return init_memory_mapping_ext(start, end, end);
+}
 
 /*
  * devmem_is_allowed() checks to see if /dev/mem access to a certain address

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/2] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high()
  2011-02-24  9:15           ` Tejun Heo
  2011-02-25  1:37             ` Yinghai Lu
  2011-02-25  1:38             ` [PATCH 1/2] x86,mm: Introduce init_memory_mapping_ext() Yinghai Lu
@ 2011-02-25  6:20             ` Yinghai Lu
  2011-02-25 10:03               ` Ingo Molnar
  2011-02-25 11:16               ` [PATCH 2/2] " Tejun Heo
  2 siblings, 2 replies; 24+ messages in thread
From: Yinghai Lu @ 2011-02-25  6:20 UTC (permalink / raw)
  To: Tejun Heo, Ingo Molnar, Thomas Gleixner, H. Peter Anvin; +Cc: x86, linux-kernel


tj pointed out:
	when node does not have 1G aligned boundary, like 128M.
init_memory_mapping_high() could render smaller mapping by 128M on one node,
and 896M on next node with 2M pages instead of 1g page. that could increase
TLB presure.

So if gb page is used, try to align the boundary to 1G before calling
init_memory_mapping_ext(), to make sure only use one 1g entry for that cross
node 1G.
Need to init_meory_mapping_ext() to table tbl_end, to make sure pgtable is on
previous node instead of next node.

on one AMD 512g system with not aligned boundary (extra 768M)
before the patch
[    0.000000] init_memory_mapping: [0x00000000000000-0x000000d7f9ffff]
[    0.000000]  0000000000 - 00c0000000 page 1G
[    0.000000]  00c0000000 - 00d7e00000 page 2M
[    0.000000]  00d7e00000 - 00d7fa0000 page 4k
[    0.000000] kernel direct mapping tables up to d7fa0000 @ [0xd7f9d000-0xd7f9ffff] pre-allocated
[    0.000000] kernel direct mapping tables up to d7fa0000 @ [0xd7f9d000-0xd7f9efff] final
[    0.000000]     memblock_x86_reserve_range: [0xd7f9d000-0xd7f9efff]          PGTABLE
...
[    0.000000] Adding active range (0, 0x10, 0x98) 0 entries of 3200 used
[    0.000000] Adding active range (0, 0x100, 0xd7fa0) 1 entries of 3200 used
[    0.000000] Adding active range (0, 0x100000, 0x1028000) 2 entries of 3200 used
[    0.000000] Adding active range (1, 0x1028000, 0x2028000) 3 entries of 3200 used
[    0.000000] Adding active range (2, 0x2028000, 0x3028000) 4 entries of 3200 used
[    0.000000] Adding active range (3, 0x3028000, 0x4028000) 5 entries of 3200 used
[    0.000000] Adding active range (4, 0x4028000, 0x5028000) 6 entries of 3200 used
[    0.000000] Adding active range (5, 0x5028000, 0x6028000) 7 entries of 3200 used
[    0.000000] Adding active range (6, 0x6028000, 0x7028000) 8 entries of 3200 used
[    0.000000] Adding active range (7, 0x7028000, 0x8028000) 9 entries of 3200 used
[    0.000000] init_memory_mapping: [0x00000100000000-0x00001027ffffff]
[    0.000000]  0100000000 - 1000000000 page 1G
[    0.000000]  1000000000 - 1028000000 page 2M
[    0.000000] kernel direct mapping tables up to 1028000000 @ [0x1027ffe000-0x1027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 1028000000 @ [0x1027ffe000-0x1027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x1027ffe000-0x1027ffefff]          PGTABLE
[    0.000000] init_memory_mapping: [0x00001028000000-0x00002027ffffff]
[    0.000000]  1028000000 - 1040000000 page 2M
[    0.000000]  1040000000 - 2000000000 page 1G
[    0.000000]  2000000000 - 2028000000 page 2M
[    0.000000] kernel direct mapping tables up to 2028000000 @ [0x2027ffe000-0x2027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 2028000000 @ [0x2027ffe000-0x2027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x2027ffe000-0x2027ffefff]          PGTABLE
[    0.000000] init_memory_mapping: [0x00002028000000-0x00003027ffffff]
[    0.000000]  2028000000 - 2040000000 page 2M
[    0.000000]  2040000000 - 3000000000 page 1G
[    0.000000]  3000000000 - 3028000000 page 2M
[    0.000000] kernel direct mapping tables up to 3028000000 @ [0x3027ffe000-0x3027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 3028000000 @ [0x3027ffe000-0x3027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x3027ffe000-0x3027ffefff]          PGTABLE
[    0.000000] init_memory_mapping: [0x00003028000000-0x00004027ffffff]
[    0.000000]  3028000000 - 3040000000 page 2M
[    0.000000]  3040000000 - 4000000000 page 1G
[    0.000000]  4000000000 - 4028000000 page 2M
[    0.000000] kernel direct mapping tables up to 4028000000 @ [0x4027ffe000-0x4027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 4028000000 @ [0x4027ffe000-0x4027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x4027ffe000-0x4027ffefff]          PGTABLE
[    0.000000] init_memory_mapping: [0x00004028000000-0x00005027ffffff]
[    0.000000]  4028000000 - 4040000000 page 2M
[    0.000000]  4040000000 - 5000000000 page 1G
[    0.000000]  5000000000 - 5028000000 page 2M
[    0.000000] kernel direct mapping tables up to 5028000000 @ [0x5027ffe000-0x5027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 5028000000 @ [0x5027ffe000-0x5027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x5027ffe000-0x5027ffefff]          PGTABLE
[    0.000000] init_memory_mapping: [0x00005028000000-0x00006027ffffff]
[    0.000000]  5028000000 - 5040000000 page 2M
[    0.000000]  5040000000 - 6000000000 page 1G
[    0.000000]  6000000000 - 6028000000 page 2M
[    0.000000] kernel direct mapping tables up to 6028000000 @ [0x6027ffe000-0x6027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 6028000000 @ [0x6027ffe000-0x6027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x6027ffe000-0x6027ffefff]          PGTABLE
[    0.000000] init_memory_mapping: [0x00006028000000-0x00007027ffffff]
[    0.000000]  6028000000 - 6040000000 page 2M
[    0.000000]  6040000000 - 7000000000 page 1G
[    0.000000]  7000000000 - 7028000000 page 2M
[    0.000000] kernel direct mapping tables up to 7028000000 @ [0x7027ffe000-0x7027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 7028000000 @ [0x7027ffe000-0x7027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x7027ffe000-0x7027ffefff]          PGTABLE
[    0.000000] init_memory_mapping: [0x00007028000000-0x00008027ffffff]
[    0.000000]  7028000000 - 7040000000 page 2M
[    0.000000]  7040000000 - 8000000000 page 1G
[    0.000000]  8000000000 - 8028000000 page 2M
[    0.000000] kernel direct mapping tables up to 8028000000 @ [0x8027ffd000-0x8027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 8028000000 @ [0x8027ffd000-0x8027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x8027ffd000-0x8027ffefff]          PGTABLE

after patch
[    0.000000] init_memory_mapping: [0x00000000000000-0x000000d7f9ffff]
[    0.000000]  0000000000 - 00c0000000 page 1G
[    0.000000]  00c0000000 - 00d7e00000 page 2M
[    0.000000]  00d7e00000 - 00d7fa0000 page 4k
[    0.000000] kernel direct mapping tables up to d7fa0000 @ [0xd7f9d000-0xd7f9ffff] pre-allocated
[    0.000000] kernel direct mapping tables up to d7fa0000 @ [0xd7f9d000-0xd7f9efff] final
[    0.000000]     memblock_x86_reserve_range: [0xd7f9d000-0xd7f9efff]          PGTABLE
...
[    0.000000] Adding active range (0, 0x10, 0x98) 0 entries of 3200 used
[    0.000000] Adding active range (0, 0x100, 0xd7fa0) 1 entries of 3200 used
[    0.000000] Adding active range (0, 0x100000, 0x1028000) 2 entries of 3200 used
[    0.000000] Adding active range (1, 0x1028000, 0x2028000) 3 entries of 3200 used
[    0.000000] Adding active range (2, 0x2028000, 0x3028000) 4 entries of 3200 used
[    0.000000] Adding active range (3, 0x3028000, 0x4028000) 5 entries of 3200 used
[    0.000000] Adding active range (4, 0x4028000, 0x5028000) 6 entries of 3200 used
[    0.000000] Adding active range (5, 0x5028000, 0x6028000) 7 entries of 3200 used
[    0.000000] Adding active range (6, 0x6028000, 0x7028000) 8 entries of 3200 used
[    0.000000] Adding active range (7, 0x7028000, 0x8028000) 9 entries of 3200 used
[    0.000000] init_memory_mapping: [0x00000100000000-0x0000103fffffff]
[    0.000000]  0100000000 - 1040000000 page 1G
[    0.000000] kernel direct mapping tables up to 1040000000 @ [0x1027fff000-0x1027ffffff] pre-allocated
[    0.000000] init_memory_mapping: [0x00001040000000-0x0000203fffffff]
[    0.000000]  1040000000 - 2040000000 page 1G
[    0.000000] kernel direct mapping tables up to 2040000000 @ [0x2027fff000-0x2027ffffff] pre-allocated
[    0.000000] init_memory_mapping: [0x00002040000000-0x0000303fffffff]
[    0.000000]  2040000000 - 3040000000 page 1G
[    0.000000] kernel direct mapping tables up to 3040000000 @ [0x3027fff000-0x3027ffffff] pre-allocated
[    0.000000] init_memory_mapping: [0x00003040000000-0x0000403fffffff]
[    0.000000]  3040000000 - 4040000000 page 1G
[    0.000000] kernel direct mapping tables up to 4040000000 @ [0x4027fff000-0x4027ffffff] pre-allocated
[    0.000000] init_memory_mapping: [0x00004040000000-0x0000503fffffff]
[    0.000000]  4040000000 - 5040000000 page 1G
[    0.000000] kernel direct mapping tables up to 5040000000 @ [0x5027fff000-0x5027ffffff] pre-allocated
[    0.000000] init_memory_mapping: [0x00005040000000-0x0000603fffffff]
[    0.000000]  5040000000 - 6040000000 page 1G
[    0.000000] kernel direct mapping tables up to 6040000000 @ [0x6027fff000-0x6027ffffff] pre-allocated
[    0.000000] init_memory_mapping: [0x00006040000000-0x0000703fffffff]
[    0.000000]  6040000000 - 7040000000 page 1G
[    0.000000] kernel direct mapping tables up to 7040000000 @ [0x7027fff000-0x7027ffffff] pre-allocated
[    0.000000] init_memory_mapping: [0x00007040000000-0x00008027ffffff]
[    0.000000]  7040000000 - 8000000000 page 1G
[    0.000000]  8000000000 - 8028000000 page 2M
[    0.000000] kernel direct mapping tables up to 8028000000 @ [0x8027ffd000-0x8027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 8028000000 @ [0x8027ffd000-0x8027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x8027ffd000-0x8027ffefff]          PGTABLE

So it fixs the extra mapping problem.

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/mm/init_64.c |   21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -614,6 +614,7 @@ struct mapping_work_data {
 	unsigned long start;
 	unsigned long end;
 	unsigned long pfn_mapped;
+	unsigned long align;
 };
 
 static int __init_refok
@@ -621,7 +622,14 @@ mapping_work_fn(unsigned long start_pfn,
 {
 	struct mapping_work_data *data = datax;
 	unsigned long pfn_mapped;
-	unsigned long final_start, final_end;
+	unsigned long final_start, final_end, tbl_end;
+
+	tbl_end = end_pfn << PAGE_SHIFT;
+	/* need to align them to 1G or 2M boundary to avoid smaller mapping */
+	start_pfn = round_down(start_pfn, data->align>>PAGE_SHIFT);
+	if (start_pfn < data->pfn_mapped)
+		start_pfn = data->pfn_mapped;
+	end_pfn = round_up(end_pfn, data->align>>PAGE_SHIFT);
 
 	final_start = max_t(unsigned long, start_pfn<<PAGE_SHIFT, data->start);
 	final_end = min_t(unsigned long, end_pfn<<PAGE_SHIFT, data->end);
@@ -629,7 +637,7 @@ mapping_work_fn(unsigned long start_pfn,
 	if (final_end <= final_start)
 		return 0;
 
-	pfn_mapped = init_memory_mapping(final_start, final_end);
+	pfn_mapped = init_memory_mapping_ext(final_start, final_end, tbl_end);
 
 	if (pfn_mapped > data->pfn_mapped)
 		data->pfn_mapped = pfn_mapped;
@@ -641,6 +649,15 @@ static unsigned long __init_refok
 init_memory_mapping_active_regions(unsigned long start, unsigned long end)
 {
 	struct mapping_work_data data;
+	int use_gbpages;
+
+	/* see init_memory_mapping() for the setting */
+#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK)
+	use_gbpages = 0;
+#else
+	use_gbpages = direct_gbpages;
+#endif
+	data.align = use_gbpages ? 1UL<<30 : 1UL<<21;
 
 	data.start = start;
 	data.end = end;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high()
  2011-02-25  6:20             ` [PATCH 2/2] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high() Yinghai Lu
@ 2011-02-25 10:03               ` Ingo Molnar
  2011-02-25 20:22                 ` Yinghai Lu
                                   ` (3 more replies)
  2011-02-25 11:16               ` [PATCH 2/2] " Tejun Heo
  1 sibling, 4 replies; 24+ messages in thread
From: Ingo Molnar @ 2011-02-25 10:03 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tejun Heo, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
	linux-kernel


* Yinghai Lu <yinghai@kernel.org> wrote:

>  init_memory_mapping_active_regions(unsigned long start, unsigned long end)
>  {
>  	struct mapping_work_data data;
> +	int use_gbpages;
> +
> +	/* see init_memory_mapping() for the setting */
> +#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK)
> +	use_gbpages = 0;
> +#else
> +	use_gbpages = direct_gbpages;
> +#endif

Sigh. You should *never* ever even think about writing such code. It only results in 
crap, and in crap duplicated elsewhere as well:

if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK)
        /*
         * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
         * This will simplify cpa(), which otherwise needs to support splitting
         * large pages into small in interrupt context, etc.
         */
        use_pse = use_gbpages = 0;
#else
        use_pse = cpu_has_pse;
        use_gbpages = direct_gbpages;
#endif

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high()
  2011-02-25 10:03               ` Ingo Molnar
@ 2011-02-25 20:22                 ` Yinghai Lu
  2011-02-26  3:06                 ` [PATCH 1/3] x86, mm: Introduce global page_size_mask Yinghai Lu
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Yinghai Lu @ 2011-02-25 20:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
	linux-kernel

On 02/25/2011 02:03 AM, Ingo Molnar wrote:
> 
> * Yinghai Lu <yinghai@kernel.org> wrote:
> 
>>  init_memory_mapping_active_regions(unsigned long start, unsigned long end)
>>  {
>>  	struct mapping_work_data data;
>> +	int use_gbpages;
>> +
>> +	/* see init_memory_mapping() for the setting */
>> +#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK)
>> +	use_gbpages = 0;
>> +#else
>> +	use_gbpages = direct_gbpages;
>> +#endif
> 
> Sigh. You should *never* ever even think about writing such code. It only results in 
> crap, and in crap duplicated elsewhere as well:
> 
> if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK)
>         /*
>          * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
>          * This will simplify cpa(), which otherwise needs to support splitting
>          * large pages into small in interrupt context, etc.
>          */
>         use_pse = use_gbpages = 0;
> #else
>         use_pse = cpu_has_pse;
>         use_gbpages = direct_gbpages;
> #endif

sorry, actually i copied from there.

or i could add max_map_unit_size variable?

Thanks

Yinghai



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/3] x86, mm:  Introduce global page_size_mask
  2011-02-25 10:03               ` Ingo Molnar
  2011-02-25 20:22                 ` Yinghai Lu
@ 2011-02-26  3:06                 ` Yinghai Lu
  2011-02-26  3:07                 ` [PATCH 2/3] x86,mm: Introduce init_memory_mapping_ext() Yinghai Lu
  2011-02-26  3:08                 ` [PATCH 3/3] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high() Yinghai Lu
  3 siblings, 0 replies; 24+ messages in thread
From: Yinghai Lu @ 2011-02-26  3:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
	linux-kernel


Add probe_page_size_mask() to detect if need to use 1G or 2M.
and store them in page_size_mask.

Only probe them at first init_memory_mapping calling.
second and later init_memory_mapping() calling does not need probe again.
also we don't need to pass use_gbpages around.

Suggested-by: Ingo Molnar <mingo@elte.hu>
Signe-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/include/asm/pgtable.h |    1 
 arch/x86/mm/init.c             |   70 +++++++++++++++++++++--------------------
 2 files changed, 37 insertions(+), 34 deletions(-)

Index: linux-2.6/arch/x86/include/asm/pgtable.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/pgtable.h
+++ linux-2.6/arch/x86/include/asm/pgtable.h
@@ -597,6 +597,7 @@ static inline int pgd_none(pgd_t pgd)
 #ifndef __ASSEMBLY__
 
 extern int direct_gbpages;
+extern int page_size_mask;
 
 /* local pte updates need not use xchg for locking */
 static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
Index: linux-2.6/arch/x86/mm/init.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init.c
+++ linux-2.6/arch/x86/mm/init.c
@@ -30,8 +30,9 @@ int direct_gbpages
 #endif
 ;
 
-static void __init find_early_table_space(unsigned long end, int use_pse,
-					  int use_gbpages)
+int page_size_mask = -1;
+
+static void __init find_early_table_space(unsigned long end)
 {
 	unsigned long puds, pmds, ptes, tables, start = 0, good_end = end;
 	phys_addr_t base;
@@ -39,7 +40,7 @@ static void __init find_early_table_spac
 	puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
 	tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
 
-	if (use_gbpages) {
+	if (page_size_mask & (1 << PG_LEVEL_1G)) {
 		unsigned long extra;
 
 		extra = end - ((end>>PUD_SHIFT) << PUD_SHIFT);
@@ -49,7 +50,7 @@ static void __init find_early_table_spac
 
 	tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
 
-	if (use_pse) {
+	if (page_size_mask & (1 << PG_LEVEL_2M)) {
 		unsigned long extra;
 
 		extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT);
@@ -81,6 +82,35 @@ static void __init find_early_table_spac
 		end, pgt_buf_start << PAGE_SHIFT, pgt_buf_top << PAGE_SHIFT);
 }
 
+static void probe_page_size_mask(void)
+{
+	if (page_size_mask != -1)
+		return;
+
+	page_size_mask = 0;
+#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
+	/*
+	 * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
+	 * This will simplify cpa(), which otherwise needs to support splitting
+	 * large pages into small in interrupt context, etc.
+	 */
+	if (direct_gbpages)
+		page_size_mask |= 1 << PG_LEVEL_1G;
+	if (cpu_has_pse)
+		page_size_mask |= 1 << PG_LEVEL_2M;
+#endif
+
+	/* Enable PSE if available */
+	if (cpu_has_pse)
+		set_in_cr4(X86_CR4_PSE);
+
+	/* Enable PGE if available */
+	if (cpu_has_pge) {
+		set_in_cr4(X86_CR4_PGE);
+		__supported_pte_mask |= _PAGE_GLOBAL;
+	}
+}
+
 struct map_range {
 	unsigned long start;
 	unsigned long end;
@@ -117,43 +147,15 @@ static int __meminit save_mr(struct map_
 unsigned long __init_refok init_memory_mapping(unsigned long start,
 					       unsigned long end)
 {
-	unsigned long page_size_mask = 0;
 	unsigned long start_pfn, end_pfn;
 	unsigned long ret = 0;
 	unsigned long pos;
-
 	struct map_range mr[NR_RANGE_MR];
 	int nr_range, i;
-	int use_pse, use_gbpages;
 
 	printk(KERN_INFO "init_memory_mapping: %016lx-%016lx\n", start, end);
 
-#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK)
-	/*
-	 * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
-	 * This will simplify cpa(), which otherwise needs to support splitting
-	 * large pages into small in interrupt context, etc.
-	 */
-	use_pse = use_gbpages = 0;
-#else
-	use_pse = cpu_has_pse;
-	use_gbpages = direct_gbpages;
-#endif
-
-	/* Enable PSE if available */
-	if (cpu_has_pse)
-		set_in_cr4(X86_CR4_PSE);
-
-	/* Enable PGE if available */
-	if (cpu_has_pge) {
-		set_in_cr4(X86_CR4_PGE);
-		__supported_pte_mask |= _PAGE_GLOBAL;
-	}
-
-	if (use_gbpages)
-		page_size_mask |= 1 << PG_LEVEL_1G;
-	if (use_pse)
-		page_size_mask |= 1 << PG_LEVEL_2M;
+	probe_page_size_mask();
 
 	memset(mr, 0, sizeof(mr));
 	nr_range = 0;
@@ -258,7 +260,7 @@ unsigned long __init_refok init_memory_m
 	 * nodes are discovered.
 	 */
 	if (!after_bootmem)
-		find_early_table_space(end, use_pse, use_gbpages);
+		find_early_table_space(end);
 
 	for (i = 0; i < nr_range; i++)
 		ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/3] x86,mm: Introduce init_memory_mapping_ext()
  2011-02-25 10:03               ` Ingo Molnar
  2011-02-25 20:22                 ` Yinghai Lu
  2011-02-26  3:06                 ` [PATCH 1/3] x86, mm: Introduce global page_size_mask Yinghai Lu
@ 2011-02-26  3:07                 ` Yinghai Lu
  2011-02-26  3:08                 ` [PATCH 3/3] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high() Yinghai Lu
  3 siblings, 0 replies; 24+ messages in thread
From: Yinghai Lu @ 2011-02-26  3:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
	linux-kernel


Add extra input tbl_end. it could be smaller than end.

Prepare for init_memory_mapping_high() to align boundary to 1G.
aka end could round up to 1g bound, and will be bigger then original
node end.

init_memory_mapping will call init_memory_mapping_ext with tbl=end.

-v2: updated after page_size_mask change

Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/include/asm/page_types.h |    7 +++++--
 arch/x86/mm/init.c                |   21 ++++++++++++++-------
 2 files changed, 19 insertions(+), 9 deletions(-)

Index: linux-2.6/arch/x86/include/asm/page_types.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/page_types.h
+++ linux-2.6/arch/x86/include/asm/page_types.h
@@ -51,8 +51,11 @@ static inline phys_addr_t get_max_mapped
 	return (phys_addr_t)max_pfn_mapped << PAGE_SHIFT;
 }
 
-extern unsigned long init_memory_mapping(unsigned long start,
-					 unsigned long end);
+unsigned long init_memory_mapping_ext(unsigned long start,
+				      unsigned long end,
+				      unsigned long tbl_end);
+
+unsigned long init_memory_mapping(unsigned long start, unsigned long end);
 
 void init_memory_mapping_high(void);
 
Index: linux-2.6/arch/x86/mm/init.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init.c
+++ linux-2.6/arch/x86/mm/init.c
@@ -32,9 +32,10 @@ int direct_gbpages
 
 int page_size_mask = -1;
 
-static void __init find_early_table_space(unsigned long end)
+static void __init find_early_table_space(unsigned long end,
+					  unsigned long tbl_end)
 {
-	unsigned long puds, pmds, ptes, tables, start = 0, good_end = end;
+	unsigned long puds, pmds, ptes, tables, start = 0;
 	phys_addr_t base;
 
 	puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
@@ -67,10 +68,10 @@ static void __init find_early_table_spac
 	/* for fixmap */
 	tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
 
-	good_end = max_pfn_mapped << PAGE_SHIFT;
+	tbl_end = max_pfn_mapped << PAGE_SHIFT;
 #endif
 
-	base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
+	base = memblock_find_in_range(start, tbl_end, tables, PAGE_SIZE);
 	if (base == MEMBLOCK_ERROR)
 		panic("Cannot find space for the kernel page tables");
 
@@ -144,8 +145,9 @@ static int __meminit save_mr(struct map_
  * This runs before bootmem is initialized and gets pages directly from
  * the physical memory. To access them they are temporarily mapped.
  */
-unsigned long __init_refok init_memory_mapping(unsigned long start,
-					       unsigned long end)
+unsigned long __init_refok init_memory_mapping_ext(unsigned long start,
+					       unsigned long end,
+					       unsigned long tbl_end)
 {
 	unsigned long start_pfn, end_pfn;
 	unsigned long ret = 0;
@@ -260,7 +262,7 @@ unsigned long __init_refok init_memory_m
 	 * nodes are discovered.
 	 */
 	if (!after_bootmem)
-		find_early_table_space(end);
+		find_early_table_space(end, tbl_end);
 
 	for (i = 0; i < nr_range; i++)
 		ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
@@ -284,6 +286,11 @@ unsigned long __init_refok init_memory_m
 	return ret >> PAGE_SHIFT;
 }
 
+unsigned long __init_refok init_memory_mapping(unsigned long start,
+					       unsigned long end)
+{
+	return init_memory_mapping_ext(start, end, end);
+}
 
 /*
  * devmem_is_allowed() checks to see if /dev/mem access to a certain address

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/3] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high()
  2011-02-25 10:03               ` Ingo Molnar
                                   ` (2 preceding siblings ...)
  2011-02-26  3:07                 ` [PATCH 2/3] x86,mm: Introduce init_memory_mapping_ext() Yinghai Lu
@ 2011-02-26  3:08                 ` Yinghai Lu
  2011-02-26 10:36                   ` Tejun Heo
  3 siblings, 1 reply; 24+ messages in thread
From: Yinghai Lu @ 2011-02-26  3:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
	linux-kernel


tj pointed out:
	when node does not have 1G aligned boundary, like 128M.
init_memory_mapping_high() could render smaller mapping by 128M on one node,
and 896M on next node with 2M pages instead of 1g page. that could increase
TLB presure.

So if gb page is used, try to align the boundary to 1G before calling
init_memory_mapping_ext(), to make sure only use one 1g entry for that cross
node 1G.
Need to init_meory_mapping_ext() to table tbl_end, to make sure pgtable is on
previous node instead of next node.

on one AMD 512g system with not aligned boundary (extra 768M)
before the patch
[    0.000000] init_memory_mapping: [0x00000000000000-0x000000d7f9ffff]
[    0.000000]  0000000000 - 00c0000000 page 1G
[    0.000000]  00c0000000 - 00d7e00000 page 2M
[    0.000000]  00d7e00000 - 00d7fa0000 page 4k
[    0.000000] kernel direct mapping tables up to d7fa0000 @ [0xd7f9d000-0xd7f9ffff] pre-allocated
[    0.000000] kernel direct mapping tables up to d7fa0000 @ [0xd7f9d000-0xd7f9efff] final
[    0.000000]     memblock_x86_reserve_range: [0xd7f9d000-0xd7f9efff]          PGTABLE
...
[    0.000000] Adding active range (0, 0x10, 0x98) 0 entries of 3200 used
[    0.000000] Adding active range (0, 0x100, 0xd7fa0) 1 entries of 3200 used
[    0.000000] Adding active range (0, 0x100000, 0x1028000) 2 entries of 3200 used
[    0.000000] Adding active range (1, 0x1028000, 0x2028000) 3 entries of 3200 used
[    0.000000] Adding active range (2, 0x2028000, 0x3028000) 4 entries of 3200 used
[    0.000000] Adding active range (3, 0x3028000, 0x4028000) 5 entries of 3200 used
[    0.000000] Adding active range (4, 0x4028000, 0x5028000) 6 entries of 3200 used
[    0.000000] Adding active range (5, 0x5028000, 0x6028000) 7 entries of 3200 used
[    0.000000] Adding active range (6, 0x6028000, 0x7028000) 8 entries of 3200 used
[    0.000000] Adding active range (7, 0x7028000, 0x8028000) 9 entries of 3200 used
[    0.000000] init_memory_mapping: [0x00000100000000-0x00001027ffffff]
[    0.000000]  0100000000 - 1000000000 page 1G
[    0.000000]  1000000000 - 1028000000 page 2M
[    0.000000] kernel direct mapping tables up to 1028000000 @ [0x1027ffe000-0x1027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 1028000000 @ [0x1027ffe000-0x1027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x1027ffe000-0x1027ffefff]          PGTABLE
[    0.000000] init_memory_mapping: [0x00001028000000-0x00002027ffffff]
[    0.000000]  1028000000 - 1040000000 page 2M
[    0.000000]  1040000000 - 2000000000 page 1G
[    0.000000]  2000000000 - 2028000000 page 2M
[    0.000000] kernel direct mapping tables up to 2028000000 @ [0x2027ffe000-0x2027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 2028000000 @ [0x2027ffe000-0x2027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x2027ffe000-0x2027ffefff]          PGTABLE
[    0.000000] init_memory_mapping: [0x00002028000000-0x00003027ffffff]
[    0.000000]  2028000000 - 2040000000 page 2M
[    0.000000]  2040000000 - 3000000000 page 1G
[    0.000000]  3000000000 - 3028000000 page 2M
[    0.000000] kernel direct mapping tables up to 3028000000 @ [0x3027ffe000-0x3027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 3028000000 @ [0x3027ffe000-0x3027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x3027ffe000-0x3027ffefff]          PGTABLE
[    0.000000] init_memory_mapping: [0x00003028000000-0x00004027ffffff]
[    0.000000]  3028000000 - 3040000000 page 2M
[    0.000000]  3040000000 - 4000000000 page 1G
[    0.000000]  4000000000 - 4028000000 page 2M
[    0.000000] kernel direct mapping tables up to 4028000000 @ [0x4027ffe000-0x4027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 4028000000 @ [0x4027ffe000-0x4027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x4027ffe000-0x4027ffefff]          PGTABLE
[    0.000000] init_memory_mapping: [0x00004028000000-0x00005027ffffff]
[    0.000000]  4028000000 - 4040000000 page 2M
[    0.000000]  4040000000 - 5000000000 page 1G
[    0.000000]  5000000000 - 5028000000 page 2M
[    0.000000] kernel direct mapping tables up to 5028000000 @ [0x5027ffe000-0x5027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 5028000000 @ [0x5027ffe000-0x5027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x5027ffe000-0x5027ffefff]          PGTABLE
[    0.000000] init_memory_mapping: [0x00005028000000-0x00006027ffffff]
[    0.000000]  5028000000 - 5040000000 page 2M
[    0.000000]  5040000000 - 6000000000 page 1G
[    0.000000]  6000000000 - 6028000000 page 2M
[    0.000000] kernel direct mapping tables up to 6028000000 @ [0x6027ffe000-0x6027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 6028000000 @ [0x6027ffe000-0x6027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x6027ffe000-0x6027ffefff]          PGTABLE
[    0.000000] init_memory_mapping: [0x00006028000000-0x00007027ffffff]
[    0.000000]  6028000000 - 6040000000 page 2M
[    0.000000]  6040000000 - 7000000000 page 1G
[    0.000000]  7000000000 - 7028000000 page 2M
[    0.000000] kernel direct mapping tables up to 7028000000 @ [0x7027ffe000-0x7027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 7028000000 @ [0x7027ffe000-0x7027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x7027ffe000-0x7027ffefff]          PGTABLE
[    0.000000] init_memory_mapping: [0x00007028000000-0x00008027ffffff]
[    0.000000]  7028000000 - 7040000000 page 2M
[    0.000000]  7040000000 - 8000000000 page 1G
[    0.000000]  8000000000 - 8028000000 page 2M
[    0.000000] kernel direct mapping tables up to 8028000000 @ [0x8027ffd000-0x8027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 8028000000 @ [0x8027ffd000-0x8027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x8027ffd000-0x8027ffefff]          PGTABLE

after patch
...
[    0.000000] init_memory_mapping: [0x00000100000000-0x0000103fffffff]
[    0.000000]  0100000000 - 1040000000 page 1G
[    0.000000] kernel direct mapping tables up to 1040000000 @ [0x1027fff000-0x1027ffffff] pre-allocated
[    0.000000] init_memory_mapping: [0x00001040000000-0x0000203fffffff]
[    0.000000]  1040000000 - 2040000000 page 1G
[    0.000000] kernel direct mapping tables up to 2040000000 @ [0x2027fff000-0x2027ffffff] pre-allocated
[    0.000000] init_memory_mapping: [0x00002040000000-0x0000303fffffff]
[    0.000000]  2040000000 - 3040000000 page 1G
[    0.000000] kernel direct mapping tables up to 3040000000 @ [0x3027fff000-0x3027ffffff] pre-allocated
[    0.000000] init_memory_mapping: [0x00003040000000-0x0000403fffffff]
[    0.000000]  3040000000 - 4040000000 page 1G
[    0.000000] kernel direct mapping tables up to 4040000000 @ [0x4027fff000-0x4027ffffff] pre-allocated
[    0.000000] init_memory_mapping: [0x00004040000000-0x0000503fffffff]
[    0.000000]  4040000000 - 5040000000 page 1G
[    0.000000] kernel direct mapping tables up to 5040000000 @ [0x5027fff000-0x5027ffffff] pre-allocated
[    0.000000] init_memory_mapping: [0x00005040000000-0x0000603fffffff]
[    0.000000]  5040000000 - 6040000000 page 1G
[    0.000000] kernel direct mapping tables up to 6040000000 @ [0x6027fff000-0x6027ffffff] pre-allocated
[    0.000000] init_memory_mapping: [0x00006040000000-0x0000703fffffff]
[    0.000000]  6040000000 - 7040000000 page 1G
[    0.000000] kernel direct mapping tables up to 7040000000 @ [0x7027fff000-0x7027ffffff] pre-allocated
[    0.000000] init_memory_mapping: [0x00007040000000-0x00008027ffffff]
[    0.000000]  7040000000 - 8000000000 page 1G
[    0.000000]  8000000000 - 8028000000 page 2M
[    0.000000] kernel direct mapping tables up to 8028000000 @ [0x8027ffd000-0x8027ffffff] pre-allocated
[    0.000000] kernel direct mapping tables up to 8028000000 @ [0x8027ffd000-0x8027ffefff] final
[    0.000000]     memblock_x86_reserve_range: [0x8027ffd000-0x8027ffefff]          PGTABLE

So it fix the extra mapping problem.

-v2:  Ingo is not happy with the #ifdef detection etc.
	use page_size_mask instead of checking if gbpage is used or not.

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/mm/init_64.c |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -614,6 +614,7 @@ struct mapping_work_data {
 	unsigned long start;
 	unsigned long end;
 	unsigned long pfn_mapped;
+	unsigned long align;
 };
 
 static int __init_refok
@@ -621,7 +622,14 @@ mapping_work_fn(unsigned long start_pfn,
 {
 	struct mapping_work_data *data = datax;
 	unsigned long pfn_mapped;
-	unsigned long final_start, final_end;
+	unsigned long final_start, final_end, tbl_end;
+
+	tbl_end = end_pfn << PAGE_SHIFT;
+	/* need to align them to 1G or 2M boundary to avoid smaller mapping */
+	start_pfn = round_down(start_pfn, data->align>>PAGE_SHIFT);
+	if (start_pfn < data->pfn_mapped)
+		start_pfn = data->pfn_mapped;
+	end_pfn = round_up(end_pfn, data->align>>PAGE_SHIFT);
 
 	final_start = max_t(unsigned long, start_pfn<<PAGE_SHIFT, data->start);
 	final_end = min_t(unsigned long, end_pfn<<PAGE_SHIFT, data->end);
@@ -629,7 +637,7 @@ mapping_work_fn(unsigned long start_pfn,
 	if (final_end <= final_start)
 		return 0;
 
-	pfn_mapped = init_memory_mapping(final_start, final_end);
+	pfn_mapped = init_memory_mapping_ext(final_start, final_end, tbl_end);
 
 	if (pfn_mapped > data->pfn_mapped)
 		data->pfn_mapped = pfn_mapped;
@@ -645,6 +653,7 @@ init_memory_mapping_active_regions(unsig
 	data.start = start;
 	data.end = end;
 	data.pfn_mapped = 0;
+	data.align = (page_size_mask & (1<<PG_LEVEL_1G)) ? 1UL<<30 : 1UL<<21;
 
 	work_with_active_regions(MAX_NUMNODES, mapping_work_fn, &data);
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high()
  2011-02-26  3:08                 ` [PATCH 3/3] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high() Yinghai Lu
@ 2011-02-26 10:36                   ` Tejun Heo
  2011-02-26 10:55                     ` Tejun Heo
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2011-02-26 10:36 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Ingo Molnar, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
	linux-kernel

On Fri, Feb 25, 2011 at 07:08:40PM -0800, Yinghai Lu wrote:
> +	end_pfn = round_up(end_pfn, data->align>>PAGE_SHIFT);

And now you're mapping beyond max_pfn without even noting the behavior
change _anywhere_.  What the hell?  It's not like this point hasn't
been brought up before.  It has been mentioned _twice_ in this very
thread.  Come on.

<rant>
>From the previous responses, I suppose you wouldn't care for this
advice but well I'll give it anyway.  Spend time and effort
documenting the changes and their rationales you make in the comments
and changelog.  Putting those things in words will force _yourself_ to
think about whether the changes are accurate and justified in addition
to helping other people understand and review the changes.  And down
the road, after several years, when someone, even yourself, needs to
change the related code again, [s]he will be able to find out and
understand how and why things are implemented much more easily.

In the second patch, you added @tbl_end and your explanation was
@tbl_end could be shorter than @end.  What is that?  How is anyone
supposed to understand what that means?  You needed that change
because the code currently depends on memory range to do NUMA affine
allocation and when nodes are interleaved the rounding up may end up
allocating page table from a different node.  If you have put that in
words, you would probably have recognized how lame and cryptic that
piece of code is and other reviewers would also have much easier time
understanding what that is doing and say no.  And maybe this is too
much to ask but why not add a nice docbook comment while you're adding
an extra parameter?

At this point, I really find it difficult to take your patches
seriously.  They're cryptic, badly documented, and making behavior
changes left and right and even when advised you don't even try to
describe the changes and rationales.
</rant>

I know you know the code and hardware and have keen eyes for details.
_Please_ give it a bit more effort.

For this one, I think I'll just redo the patches and rip out the
memblock iteration code.  The complexity doesn't really seem
justified.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high()
  2011-02-26 10:36                   ` Tejun Heo
@ 2011-02-26 10:55                     ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2011-02-26 10:55 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Ingo Molnar, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
	linux-kernel

On Sat, Feb 26, 2011 at 11:36:16AM +0100, Tejun Heo wrote:
> On Fri, Feb 25, 2011 at 07:08:40PM -0800, Yinghai Lu wrote:
> > +	end_pfn = round_up(end_pfn, data->align>>PAGE_SHIFT);
> 
> And now you're mapping beyond max_pfn without even noting the behavior
> change _anywhere_.  What the hell?  It's not like this point hasn't
> been brought up before.  It has been mentioned _twice_ in this very
> thread.  Come on.

Reading the code again, the range is capped by
mapping_work_data->start/end, so it won't go over max_pfn.  I
apologize for the unwarranted rant about this part, but unfortunately
the rest of the complaints still stand.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high()
  2011-02-25  6:20             ` [PATCH 2/2] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high() Yinghai Lu
  2011-02-25 10:03               ` Ingo Molnar
@ 2011-02-25 11:16               ` Tejun Heo
  2011-02-25 20:18                 ` Yinghai Lu
  1 sibling, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2011-02-25 11:16 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel

On Thu, Feb 24, 2011 at 10:20:35PM -0800, Yinghai Lu wrote:
> tj pointed out:
> 	when node does not have 1G aligned boundary, like 128M.
> init_memory_mapping_high() could render smaller mapping by 128M on one node,
> and 896M on next node with 2M pages instead of 1g page. that could increase
> TLB presure.
> 
> So if gb page is used, try to align the boundary to 1G before calling
> init_memory_mapping_ext(), to make sure only use one 1g entry for that cross
> node 1G.
> Need to init_meory_mapping_ext() to table tbl_end, to make sure pgtable is on
> previous node instead of next node.

I don't know, Yinghai.  The whole code seems overly complicated to me.
Just ignore e820 map when building linear mapping.  It doesn't matter.
Why not just do something like the following?  Also, can you please
add some comments explaining how the NUMA affine allocation actually
works for page tables?  Or better, can you please make that explicit?
It currently depends on memories being registered in ascending address
order, right?  The memblock code already is NUMA aware, I think it
would be far better to make the node affine part explicit.

Thanks.

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 46e684f..4fd0b59 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -966,6 +966,11 @@ void __init setup_arch(char **cmdline_p)
 	memblock.current_limit = get_max_mapped();
 
 	/*
+	 * Add whole lot of comment explaining what's going on and WHY
+	 * because as it currently stands, it's frigging cryptic.
+	 */
+
+	/*
 	 * NOTE: On x86-32, only from this point on, fixmaps are ready for use.
 	 */
 
diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
index 7757d22..50ec03c 100644
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -536,8 +536,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 	if (!numa_meminfo_cover_memory(mi))
 		return -EINVAL;
 
-	init_memory_mapping_high();
-
 	/* Finally register nodes. */
 	for_each_node_mask(nid, node_possible_map) {
 		u64 start = (u64)max_pfn << PAGE_SHIFT;
@@ -550,8 +548,12 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 			end = max(mi->blk[i].end, end);
 		}
 
-		if (start < end)
+		if (start < end) {
+			init_memory_mapping(
+			  ALIGN_DOWN_TO_MAX_MAP_SIZE_AND_CONVERT_TO_PFN(start),
+			  ALIGN_UP_SIMILARY_BUT_DONT_GO_OVER_MAX_PFN(end));
 			setup_node_bootmem(nid, start, end);
+		}
 	}
 
 	return 0;


-- 
tejun

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high()
  2011-02-25 11:16               ` [PATCH 2/2] " Tejun Heo
@ 2011-02-25 20:18                 ` Yinghai Lu
  2011-02-26  8:57                   ` Tejun Heo
  0 siblings, 1 reply; 24+ messages in thread
From: Yinghai Lu @ 2011-02-25 20:18 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel

On 02/25/2011 03:16 AM, Tejun Heo wrote:
> On Thu, Feb 24, 2011 at 10:20:35PM -0800, Yinghai Lu wrote:
>> tj pointed out:
>> 	when node does not have 1G aligned boundary, like 128M.
>> init_memory_mapping_high() could render smaller mapping by 128M on one node,
>> and 896M on next node with 2M pages instead of 1g page. that could increase
>> TLB presure.
>>
>> So if gb page is used, try to align the boundary to 1G before calling
>> init_memory_mapping_ext(), to make sure only use one 1g entry for that cross
>> node 1G.
>> Need to init_meory_mapping_ext() to table tbl_end, to make sure pgtable is on
>> previous node instead of next node.
> 
> I don't know, Yinghai.  The whole code seems overly complicated to me.
> Just ignore e820 map when building linear mapping.  It doesn't matter.
> Why not just do something like the following?  Also, can you please
> add some comments explaining how the NUMA affine allocation actually
> works for page tables?
yes, that could be done in separated patch.

>  Or better, can you please make that explicit?
> It currently depends on memories being registered in ascending address
> order, right?  The memblock code already is NUMA aware, I think it
> would be far better to make the node affine part explicit.

yes, memblock is numa aware after memblock_x86_register_active_regions().
and it rely on early_node_map[].

do you mean let init_memory_mapping to take node id like setup_node_bootmem?
so find_early_table_space could take nodeid instead of tbl_end? 

> 
> Thanks.
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 46e684f..4fd0b59 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -966,6 +966,11 @@ void __init setup_arch(char **cmdline_p)
>  	memblock.current_limit = get_max_mapped();
>  
>  	/*
> +	 * Add whole lot of comment explaining what's going on and WHY
> +	 * because as it currently stands, it's frigging cryptic.
> +	 */
> +
> +	/*
>  	 * NOTE: On x86-32, only from this point on, fixmaps are ready for use.
>  	 */
>  
> diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
> index 7757d22..50ec03c 100644
> --- a/arch/x86/mm/numa_64.c
> +++ b/arch/x86/mm/numa_64.c
> @@ -536,8 +536,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>  	if (!numa_meminfo_cover_memory(mi))
>  		return -EINVAL;
>  
> -	init_memory_mapping_high();
> -
>  	/* Finally register nodes. */
>  	for_each_node_mask(nid, node_possible_map) {
>  		u64 start = (u64)max_pfn << PAGE_SHIFT;
> @@ -550,8 +548,12 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>  			end = max(mi->blk[i].end, end);
>  		}
>  
> -		if (start < end)
> +		if (start < end) {
> +			init_memory_mapping(
> +			  ALIGN_DOWN_TO_MAX_MAP_SIZE_AND_CONVERT_TO_PFN(start),
> +			  ALIGN_UP_SIMILARY_BUT_DONT_GO_OVER_MAX_PFN(end));
>  			setup_node_bootmem(nid, start, end);
> +		}
will have problem with cross node conf. like 0-4g, 8-12g on node0, 4g-8g, 12g-16g on node1.

>  	}
>  
>  	return 0;
> 
> 

Thanks

Yinghai Lu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high()
  2011-02-25 20:18                 ` Yinghai Lu
@ 2011-02-26  8:57                   ` Tejun Heo
  2011-02-27 11:53                     ` Ingo Molnar
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2011-02-26  8:57 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel

On Fri, Feb 25, 2011 at 12:18:44PM -0800, Yinghai Lu wrote:
> >  Or better, can you please make that explicit?
> > It currently depends on memories being registered in ascending address
> > order, right?  The memblock code already is NUMA aware, I think it
> > would be far better to make the node affine part explicit.
> 
> yes, memblock is numa aware after memblock_x86_register_active_regions().
> and it rely on early_node_map[].
> 
> do you mean let init_memory_mapping to take node id like setup_node_bootmem?
> so find_early_table_space could take nodeid instead of tbl_end? 

Yeap.

> > @@ -550,8 +548,12 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> >  			end = max(mi->blk[i].end, end);
> >  		}
> >  
> > -		if (start < end)
> > +		if (start < end) {
> > +			init_memory_mapping(
> > +			  ALIGN_DOWN_TO_MAX_MAP_SIZE_AND_CONVERT_TO_PFN(start),
> > +			  ALIGN_UP_SIMILARY_BUT_DONT_GO_OVER_MAX_PFN(end));
> >  			setup_node_bootmem(nid, start, end);
> > +		}
> will have problem with cross node conf. like 0-4g, 8-12g on node0, 4g-8g, 12g-16g on node1.

And how common are they?  This whole cruft is basically meaningless if
1GiB mapping is supported, IOW, basically on all AMD 64s and all
post-nehalem intels.  Why not just cite the limitation in the comment
and stick to something simple?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high()
  2011-02-26  8:57                   ` Tejun Heo
@ 2011-02-27 11:53                     ` Ingo Molnar
  0 siblings, 0 replies; 24+ messages in thread
From: Ingo Molnar @ 2011-02-27 11:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yinghai Lu, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
	linux-kernel


* Tejun Heo <tj@kernel.org> wrote:

> > will have problem with cross node conf. like 0-4g, 8-12g on node0, 4g-8g, 
> > 12g-16g on node1.
> 
> And how common are they?  This whole cruft is basically meaningless if 1GiB 
> mapping is supported, IOW, basically on all AMD 64s and all post-nehalem intels.  
> Why not just cite the limitation in the comment and stick to something simple?

Such complexity should be justified via very careful "perf stat --repeat" 
measurements.

I.e. showing 'before patch' and 'after patch' instruction, TLB miss and cycle 
counts, showing that a positive effect that goes beyond the noise of the measurement 
exists.

1GB mappings should be assumed as the common case - anything else probably does not 
matter from a future performance/scalability POV. For vmalloc() it might make sense 
- but even for those precise measurements should be done about the positive effect.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: questions about init_memory_mapping_high()
  2011-02-23 17:19 questions about init_memory_mapping_high() Tejun Heo
  2011-02-23 20:24 ` Yinghai Lu
@ 2011-02-28 18:14 ` H. Peter Anvin
  2011-03-01  8:29   ` Tejun Heo
  1 sibling, 1 reply; 24+ messages in thread
From: H. Peter Anvin @ 2011-02-28 18:14 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Yinghai Lu, x86, Ingo Molnar, Thomas Gleixner, linux-kernel

On 02/23/2011 09:19 AM, Tejun Heo wrote:
> Hello, guys.
> 
> I've been looking at init_memory_mapping_high() added by commit
> 1411e0ec31 (x86-64, numa: Put pgtable to local node memory) and I got
> curious about several things.
> 
> 1. The only rationale given in the commit description is that a
>    RED-PEN is killed, which was the following.
> 
> 	/*
> 	 * RED-PEN putting page tables only on node 0 could
> 	 * cause a hotspot and fill up ZONE_DMA. The page tables
> 	 * need roughly 0.5KB per GB.
> 	 */
> 
>    This already wasn't true with top-down memblock allocation.
> 
>    The 0.5KB per GiB comment is for 32bit w/ 3 level mapping.  On
>    64bit, it's ~4KiB per GiB when using 2MiB mappings and, well, very
>    small per GiB if 1GiB mapping is used.  Even with 2MiB mapping,
>    1TiB mapping would only be 4MiB.  Under ZONE_DMA, this could be
>    problematic but with top-down this can't be a problem in any
>    realistic way in foreseeable future.
> 

It's true on 64 bits too when PAE is not available (e.g. with Xen.)

	-hpa

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: questions about init_memory_mapping_high()
  2011-02-28 18:14 ` questions about init_memory_mapping_high() H. Peter Anvin
@ 2011-03-01  8:29   ` Tejun Heo
  2011-03-01 19:44     ` H. Peter Anvin
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2011-03-01  8:29 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Yinghai Lu, x86, Ingo Molnar, Thomas Gleixner, linux-kernel

Hey,

(sorry about the earlier empty reply, fat finger on my phone)

On Mon, Feb 28, 2011 at 10:14:44AM -0800, H. Peter Anvin wrote:
> > 1. The only rationale given in the commit description is that a
> >    RED-PEN is killed, which was the following.
> > 
> > 	/*
> > 	 * RED-PEN putting page tables only on node 0 could
> > 	 * cause a hotspot and fill up ZONE_DMA. The page tables
> > 	 * need roughly 0.5KB per GB.
> > 	 */
> > 
> >    This already wasn't true with top-down memblock allocation.
> > 
> >    The 0.5KB per GiB comment is for 32bit w/ 3 level mapping.  On
> >    64bit, it's ~4KiB per GiB when using 2MiB mappings and, well, very
> >    small per GiB if 1GiB mapping is used.  Even with 2MiB mapping,
> >    1TiB mapping would only be 4MiB.  Under ZONE_DMA, this could be
> >    problematic but with top-down this can't be a problem in any
> >    realistic way in foreseeable future.
> > 
> 
> It's true on 64 bits too when PAE is not available (e.g. with Xen.)

Hmm... I don't follow.  Can you elaborate?  If PAE is not available
for whatever reason, the physical memory is limited to 4GiB but I
don't follow what that has to do with the above.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: questions about init_memory_mapping_high()
  2011-03-01  8:29   ` Tejun Heo
@ 2011-03-01 19:44     ` H. Peter Anvin
  0 siblings, 0 replies; 24+ messages in thread
From: H. Peter Anvin @ 2011-03-01 19:44 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Yinghai Lu, x86, Ingo Molnar, Thomas Gleixner, linux-kernel

On 03/01/2011 12:29 AM, Tejun Heo wrote:
> Hey,
> 
> (sorry about the earlier empty reply, fat finger on my phone)
> 
> On Mon, Feb 28, 2011 at 10:14:44AM -0800, H. Peter Anvin wrote:
>>> 1. The only rationale given in the commit description is that a
>>>    RED-PEN is killed, which was the following.
>>>
>>> 	/*
>>> 	 * RED-PEN putting page tables only on node 0 could
>>> 	 * cause a hotspot and fill up ZONE_DMA. The page tables
>>> 	 * need roughly 0.5KB per GB.
>>> 	 */
>>>
>>>    This already wasn't true with top-down memblock allocation.
>>>
>>>    The 0.5KB per GiB comment is for 32bit w/ 3 level mapping.  On
>>>    64bit, it's ~4KiB per GiB when using 2MiB mappings and, well, very
>>>    small per GiB if 1GiB mapping is used.  Even with 2MiB mapping,
>>>    1TiB mapping would only be 4MiB.  Under ZONE_DMA, this could be
>>>    problematic but with top-down this can't be a problem in any
>>>    realistic way in foreseeable future.
>>>
>>
>> It's true on 64 bits too when PAE is not available (e.g. with Xen.)
> 
> Hmm... I don't follow.  Can you elaborate?  If PAE is not available
> for whatever reason, the physical memory is limited to 4GiB but I
> don't follow what that has to do with the above.
> 

Sorry, PSE, not PAE.

	-hpa

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2011-03-01 19:44 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-23 17:19 questions about init_memory_mapping_high() Tejun Heo
2011-02-23 20:24 ` Yinghai Lu
2011-02-23 20:46   ` Tejun Heo
2011-02-23 20:51     ` Yinghai Lu
2011-02-23 21:03       ` Tejun Heo
2011-02-23 22:17         ` Yinghai Lu
2011-02-24  9:15           ` Tejun Heo
2011-02-25  1:37             ` Yinghai Lu
2011-02-25  1:38             ` [PATCH 1/2] x86,mm: Introduce init_memory_mapping_ext() Yinghai Lu
2011-02-25  6:20             ` [PATCH 2/2] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high() Yinghai Lu
2011-02-25 10:03               ` Ingo Molnar
2011-02-25 20:22                 ` Yinghai Lu
2011-02-26  3:06                 ` [PATCH 1/3] x86, mm: Introduce global page_size_mask Yinghai Lu
2011-02-26  3:07                 ` [PATCH 2/3] x86,mm: Introduce init_memory_mapping_ext() Yinghai Lu
2011-02-26  3:08                 ` [PATCH 3/3] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high() Yinghai Lu
2011-02-26 10:36                   ` Tejun Heo
2011-02-26 10:55                     ` Tejun Heo
2011-02-25 11:16               ` [PATCH 2/2] " Tejun Heo
2011-02-25 20:18                 ` Yinghai Lu
2011-02-26  8:57                   ` Tejun Heo
2011-02-27 11:53                     ` Ingo Molnar
2011-02-28 18:14 ` questions about init_memory_mapping_high() H. Peter Anvin
2011-03-01  8:29   ` Tejun Heo
2011-03-01 19:44     ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox