All of lore.kernel.org
 help / color / mirror / Atom feed
From: "H. Peter Anvin" <hpa@zytor.com>
To: Konrad Rzeszutek Wilk <konrad@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	Yinghai Lu <yinghai@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
	Tejun Heo <tj@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
Date: Thu, 04 Oct 2012 14:52:45 -0700	[thread overview]
Message-ID: <506E052D.7060101@zytor.com> (raw)
In-Reply-To: <20121004135646.GE9158@phenom.dumpdata.com>

On 10/04/2012 06:56 AM, Konrad Rzeszutek Wilk wrote:
>
> What Peter had in mind is a nice system where we get rid of
> this linear allocation of page-tables (so pgt_buf_start -> pgt_buf
> _end are linearly allocated). His thinking (and Peter if I mess
> up please correct me), is that we can stick the various pagetables
> in different spots in memory. Mainly that as we look at mapping
> a region (say 0GB->1GB), we look at in chunks (2MB?) and allocate
> a page-table at the _end_ of the newly mapped chunk if we have
> filled all entries in said pagetable.
>
> For simplicity, lets say we are just dealing with PTE tables and
> we are mapping the region 0GB->1GB with 4KB pages.
>
> First we stick a page-table (or if there is a found one reuse it)
> at the start of the region (so 0-2MB).
>
> 0MB.......................2MB
> /-----\
> |PTE_A|
> \-----/
>
> The PTE entries in it will cover 0->2MB (PTE table #A) and once it is
> finished, it will stick a new pagetable at the end of the 2MB region:
>
> 0MB.......................2MB...........................4MB
> /-----\                /-----\
> |PTE_A|                |PTE_B|
> \-----/                \-----/
>
>
> The PTE_B page table will be used to map 2MB->4MB.
>
> Once that is finished .. we repeat the cycle.
>
> That should remove the utter duct-tape madness and make this a lot
> easier.
>

You got the basic idea right but the details slightly wrong.  Let me try 
to explain.

When we start up, we know we have a set of page tables which maps the 
kernel text, data, bss and brk.  This is set up by the startup code on 
native and by the domain builder on Xen.

We can reserve an arbitrary chunk of brk that is (a) big enough to map 
the kernel text+data+bss+brk itself plus (b) some arbitrary additional 
chunk of memory (perhaps we reserve another 256K of brk or so, enough to 
map 128 MB in the worst case of 4K PAE pages.)

Step 1:

- Create page table mappings for kernel text+data+bss+brk out of the
   brk region.

Step 2:

- Start creating mappings for the topmost memory region downward, until
   the brk reserved area is exhaused.

Step 3:

- Call a paravirt hook on the page tables created so far.  On native
   this does nothing, on Xen it can map it readonly and tell the
   hypervisor it is a page table.

Step 4:

- Switch to the newly created page table.  The bootup page table is now
   obsolete.

Step 5:

- Moving downward from the last address mapped, create new page tables
   for any additional unmapped memory region until either we run out of
   unmapped memory regions, or we run out of mapped memory for
   the memory regions to map.

Step 6:

- Call the paravirt hook for the new page tables, then add them to the
   page table tree.

Step 7:

- Repeat from step 5 until there are no more unmapped memory regions.


This:

a) removes any need to guesstimate how much page tables are going to
    consume.  We simply construct them; they may not be contiguous but
    that's okay.

b) very cleanly solves the Xen problem of not wanting to status-flip
    pages any more than necessary.


The only reason for moving downward rather than upward is that we want 
the page tables as high as possible in memory, since memory at low 
addresses is precious (for stupid DMA devices, for things like 
kexec/kdump, and so on.)

	-hpa





-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


  reply	other threads:[~2012-10-04 21:53 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-30  7:57 [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup Yinghai Lu
2012-09-30  7:57 ` [PATCH 01/13] x86, mm: Add global page_size_mask and probe one time only Yinghai Lu
2012-09-30  7:57 ` [PATCH 02/13] x86, mm: Split out split_mem_range from init_memory_mapping Yinghai Lu
2012-09-30  7:57 ` [PATCH 03/13] x86, mm: Move init_memory_mapping calling out of setup.c Yinghai Lu
2012-09-30  7:57 ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit Yinghai Lu
2012-10-01 11:00   ` Stefano Stabellini
2012-10-03 16:51     ` Jacob Shin
2012-10-03 18:34       ` H. Peter Anvin
2012-10-04 13:56       ` Konrad Rzeszutek Wilk
2012-10-04 21:52         ` H. Peter Anvin [this message]
2012-10-04 16:19       ` Yinghai Lu
2012-10-04 16:46         ` Konrad Rzeszutek Wilk
2012-10-04 21:29           ` Yinghai Lu
2012-10-05 21:04             ` Eric W. Biederman
2012-10-05 21:19               ` Yinghai Lu
2012-10-05 21:32                 ` Eric W. Biederman
2012-10-05 21:37                   ` Yinghai Lu
2012-10-05 21:41                     ` Eric W. Biederman
2012-10-05 21:43                       ` Yinghai Lu
2012-10-05 22:01                         ` 896MB address limit (was: Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit) Eric W. Biederman
2012-10-05 22:01                           ` Eric W. Biederman
2012-10-06  0:18                       ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit H. Peter Anvin
2012-10-06  0:45                         ` Eric W. Biederman
2012-10-06  1:02                           ` H. Peter Anvin
2012-10-06  0:17                   ` H. Peter Anvin
2012-10-06  0:28                     ` Eric W. Biederman
2012-10-06  0:36                       ` H. Peter Anvin
2012-10-04 15:57     ` Yinghai Lu
2012-10-04 16:45       ` Konrad Rzeszutek Wilk
2012-10-04 21:21         ` Yinghai Lu
2012-10-04 21:40           ` Yinghai Lu
2012-10-04 21:41             ` H. Peter Anvin
2012-10-04 21:46               ` Yinghai Lu
2012-10-04 21:54                 ` H. Peter Anvin
2012-10-05  7:46                   ` Yinghai Lu
2012-10-05 11:27                     ` Stefano Stabellini
2012-10-05 14:58                       ` Yinghai Lu
2012-10-06  7:44                         ` [PATCH 0/3] x86: pre mapping page table to make xen happy Yinghai Lu
2012-10-06  7:44                           ` [PATCH 1/3] x86: get early page table from BRK Yinghai Lu
2012-10-08 12:09                             ` Stefano Stabellini
2012-10-06  7:44                           ` [PATCH 2/3] x86, mm: Don't clear page table if next range is ram Yinghai Lu
2012-10-09 15:46                             ` Konrad Rzeszutek Wilk
2012-10-10  1:00                               ` Yinghai Lu
2012-10-10 13:41                                 ` Konrad Rzeszutek Wilk
2012-10-10 14:43                                   ` Yinghai Lu
2012-10-06  7:44                           ` [PATCH 3/3] x86, mm: Remove early_memremap workaround for page table accessing Yinghai Lu
2012-10-09 15:48                             ` Konrad Rzeszutek Wilk
2012-10-08  6:36                         ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit Yinghai Lu
2012-10-05 10:47       ` Stefano Stabellini
2012-09-30  7:57 ` [PATCH 05/13] x86, mm: Find early page table buffer altogether Yinghai Lu
2012-09-30  7:57 ` [PATCH 06/13] x86, mm: Separate out calculate_table_space_size() Yinghai Lu
2012-09-30  7:57 ` [PATCH 07/13] x86, mm: Move down two calculate_table_space_size down Yinghai Lu
2012-09-30  7:57 ` [PATCH 08/13] x86, mm: Set memblock initial limit to 1M Yinghai Lu
2012-09-30  7:57 ` [PATCH 09/13] x86: if kernel .text .data .bss are not marked as E820_RAM, complain and fix Yinghai Lu
2012-09-30  7:57 ` [PATCH 10/13] x86: Fixup code testing if a pfn is direct mapped Yinghai Lu
2012-09-30  7:57 ` [PATCH 11/13] x86: Only direct map addresses that are marked as E820_RAM Yinghai Lu
2012-09-30  7:57 ` [PATCH 12/13] x86/mm: calculate_table_space_size based on memory ranges that are being mapped Yinghai Lu
2012-09-30  7:57 ` [PATCH 13/13] x86, mm: Use func pointer to table size calculation and mapping Yinghai Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=506E052D.7060101@zytor.com \
    --to=hpa@zytor.com \
    --cc=jacob.shin@amd.com \
    --cc=konrad.wilk@oracle.com \
    --cc=konrad@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=stefano.stabellini@eu.citrix.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=yinghai@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.