public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Yinghai Lu <yinghai@kernel.org>
To: Tejun Heo <tj@kernel.org>
Cc: x86@kernel.org, Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	linux-kernel@vger.kernel.org
Subject: Re: questions about init_memory_mapping_high()
Date: Wed, 23 Feb 2011 12:24:58 -0800	[thread overview]
Message-ID: <4D656D1A.7030006@kernel.org> (raw)
In-Reply-To: <20110223171945.GI26065@htj.dyndns.org>

On 02/23/2011 09:19 AM, Tejun Heo wrote:
> Hello, guys.
> 
> I've been looking at init_memory_mapping_high() added by commit
> 1411e0ec31 (x86-64, numa: Put pgtable to local node memory) and I got
> curious about several things.
> 
> 1. The only rationale given in the commit description is that a
>    RED-PEN is killed, which was the following.
> 
> 	/*
> 	 * RED-PEN putting page tables only on node 0 could
> 	 * cause a hotspot and fill up ZONE_DMA. The page tables
> 	 * need roughly 0.5KB per GB.
> 	 */
> 
>    This already wasn't true with top-down memblock allocation.
> 
>    The 0.5KB per GiB comment is for 32bit w/ 3 level mapping.  On
>    64bit, it's ~4KiB per GiB when using 2MiB mappings and, well, very
>    small per GiB if 1GiB mapping is used.  Even with 2MiB mapping,
>    1TiB mapping would only be 4MiB.  Under ZONE_DMA, this could be
>    problematic but with top-down this can't be a problem in any
>    realistic way in foreseeable future.

before that patch set:
page table for [0, 4g) is just under and near 512M.
page table for [4g, 128) is just under and near 2g ( assume 0-2g is ram under 4g)

first patch in the patch set will
page table for [0, 4g) is just under and near 2g.( assume 0-2g is ram under 4g)
page table for [4g, 128) is just under and near 128g 

so top down could put most page table on last node.

for debug purpose case, 2M and 1G page could be disabled.

code excerpt from init_memory_mapping()

        printk(KERN_INFO "init_memory_mapping: %016lx-%016lx\n", start, end);

#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK)
        /*
         * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
         * This will simplify cpa(), which otherwise needs to support splitting
         * large pages into small in interrupt context, etc.
         */
        use_pse = use_gbpages = 0;
#else
        use_pse = cpu_has_pse;
        use_gbpages = direct_gbpages;
#endif


> 
> 2. In most cases, the kernel mapping ends up using 1GiB mappings and
>    when using 1GiB mappings, a single second level table would cover
>    512GiB of memory.  IOW, little, if any, is gained by trying to
>    allocate the page table on node local memory when 1GiB mappings are
>    used, they end up sharing the same page somewhere anyway.
> 
>    I guess this was the reason why the commit message showed usage of
>    2MiB mappings so that each node would end up with their own third
>    level page tables.  Is this something we need to optimize for?  I
>    don't recall seeing recent machines which don't use 1GiB pages for
>    the linear mapping.  Are there NUMA machines which can't use 1GiB
>    mappings?
> 
>    Or was this for the future where we would be using a lot more than
>    512GiB of memory?  If so, wouldn't that be a bit over-reaching?
>    Wouldn't we be likely to have 512GiB mappings if we get to a point
>    where NUMA locality of such mappings actually become a problem?


till now:
amd 64 cpu does support 1gb page.

Intel CPU Nehalem-EX does not. and several vendors do provide 8 sockets
NUMA system with 1024g and 2048g RAM

cpu after Nehalem-EX looks support 1gb page.



> 
> 3. The new code creates linear mapping only for memory regions where
>    e820 actually says there is memory as opposed to mapping from base
>    to top.  Again, I'm not sure what the intention of this change was.
>    Having larger mappings over holes is much cheaper than having to
>    break down the mappings into smaller sized mappings around the
>    holes both in terms of memory and run time overhead.  Why would we
>    want to match the linear address mapping to the e820 map exactly?

we don't need to map those holes if there is any.

for hotplug case, they should map new added memory later.

> 
> Also, Yinghai, can you please try to write commit descriptions with
> more details?  It really sucks for other people when they have to
> guess what the actual changes and underlying intentions are.  The
> commit adding init_memory_mapping_high() is very anemic on details
> about how the behavior changes and the only intention given there is
> RED-PEN removal even which is largely a miss.

i don't know what you are talking about. that changelog is clear enough.

Yinghai

  reply	other threads:[~2011-02-23 20:26 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-23 17:19 questions about init_memory_mapping_high() Tejun Heo
2011-02-23 20:24 ` Yinghai Lu [this message]
2011-02-23 20:46   ` Tejun Heo
2011-02-23 20:51     ` Yinghai Lu
2011-02-23 21:03       ` Tejun Heo
2011-02-23 22:17         ` Yinghai Lu
2011-02-24  9:15           ` Tejun Heo
2011-02-25  1:37             ` Yinghai Lu
2011-02-25  1:38             ` [PATCH 1/2] x86,mm: Introduce init_memory_mapping_ext() Yinghai Lu
2011-02-25  6:20             ` [PATCH 2/2] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high() Yinghai Lu
2011-02-25 10:03               ` Ingo Molnar
2011-02-25 20:22                 ` Yinghai Lu
2011-02-26  3:06                 ` [PATCH 1/3] x86, mm: Introduce global page_size_mask Yinghai Lu
2011-02-26  3:07                 ` [PATCH 2/3] x86,mm: Introduce init_memory_mapping_ext() Yinghai Lu
2011-02-26  3:08                 ` [PATCH 3/3] x86,mm,64bit: Round up memory boundary for init_memory_mapping_high() Yinghai Lu
2011-02-26 10:36                   ` Tejun Heo
2011-02-26 10:55                     ` Tejun Heo
2011-02-25 11:16               ` [PATCH 2/2] " Tejun Heo
2011-02-25 20:18                 ` Yinghai Lu
2011-02-26  8:57                   ` Tejun Heo
2011-02-27 11:53                     ` Ingo Molnar
2011-02-28 18:14 ` questions about init_memory_mapping_high() H. Peter Anvin
2011-03-01  8:29   ` Tejun Heo
2011-03-01 19:44     ` H. Peter Anvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D656D1A.7030006@kernel.org \
    --to=yinghai@kernel.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox