All of lore.kernel.org
 help / color / mirror / Atom feed
From: Konrad Rzeszutek Wilk <konrad@kernel.org>
To: Jacob Shin <jacob.shin@amd.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	Yinghai Lu <yinghai@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
	"H. Peter Anvin" <hpa@zytor.com>, Tejun Heo <tj@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
Date: Thu, 4 Oct 2012 09:56:48 -0400	[thread overview]
Message-ID: <20121004135646.GE9158@phenom.dumpdata.com> (raw)
In-Reply-To: <20121003165105.GA30214@jshin-Toonie>

On Wed, Oct 03, 2012 at 11:51:06AM -0500, Jacob Shin wrote:
> On Mon, Oct 01, 2012 at 12:00:26PM +0100, Stefano Stabellini wrote:
> > On Sun, 30 Sep 2012, Yinghai Lu wrote:
> > > After
> > > 
> > > | commit 8548c84da2f47e71bbbe300f55edb768492575f7
> > > | Author: Takashi Iwai <tiwai@suse.de>
> > > | Date:   Sun Oct 23 23:19:12 2011 +0200
> > > |
> > > |    x86: Fix S4 regression
> > > |
> > > |    Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
> > > |    regression since 2.6.39, namely the machine reboots occasionally at S4
> > > |    resume.  It doesn't happen always, overall rate is about 1/20.  But,
> > > |    like other bugs, once when this happens, it continues to happen.
> > > |
> > > |    This patch fixes the problem by essentially reverting the memory
> > > |    assignment in the older way.
> > > 
> > > Have some page table around 512M again, that will prevent kdump to find 512M
> > > under 768M.
> > > 
> > > We need revert that reverting, so we could put page table high again for 64bit.
> > > 
> > > Takashi agreed that S4 regression could be something else.
> > > 
> > > 	https://lkml.org/lkml/2012/6/15/182
> > > 
> > > Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> > > ---
> > >  arch/x86/mm/init.c |    2 +-
> > >  1 files changed, 1 insertions(+), 1 deletions(-)
> > > 
> > > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> > > index 9f69180..aadb154 100644
> > > --- a/arch/x86/mm/init.c
> > > +++ b/arch/x86/mm/init.c
> > > @@ -76,8 +76,8 @@ static void __init find_early_table_space(struct map_range *mr,
> > >  #ifdef CONFIG_X86_32
> > >  	/* for fixmap */
> > >  	tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
> > > -#endif
> > >  	good_end = max_pfn_mapped << PAGE_SHIFT;
> > > +#endif
> > >  
> > >  	base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
> > >  	if (!base)
> > 
> > Isn't this going to cause init_memory_mapping to allocate pagetable
> > pages from memory not yet mapped?
> > Last time I spoke with HPA and Thomas about this, they seem to agree
> > that it isn't a very good idea.
> > Also, it is proven to cause a certain amount of headaches on Xen,
> > see commit d8aa5ec3382e6a545b8f25178d1e0992d4927f19.
> > 
> 
> Any comments, thoughts? hpa? Yinghai?
> 
> So it seems that during init_memory_mapping Xen needs to modify page table 
> bits and the memory where the page tables live needs to be direct mapped at
> that time.

That is not exactly true. I am not sure if we are just using the wrong
words for it - so let me try to write up what the impediment is.

There is also this talk between Stefano and tglrx that can help in
getting ones' head around it: https://lkml.org/lkml/2012/8/24/335

The restriction that Xen places on Linux page-tables is that they MUST
be read-only when in usage. Meaning if you creating a PTE table (or PMD,
PUD, etc), you can write to it as long as you want - but the moment you
hook it up to a live page-table - it must be marked RO (so the PMD entry
cannot have _PAGE_RW on it). Easy enough.

This means that if we are re-using a pagetable during the
init_memory_mapping (so we iomap it), we need to iomap it with
!_PAGE_RW) - and that is where xen_set_pte_init has a check for
is_early_ioremap_ptep. To add to the fun, the pagetables are expanding -
so as one is ioremapping/iounmaping, you have to check the pgt_buf_end
to check whether the page table we are mapping is within the:
 pgt_buf_start -> pgt_buf_end <- pgt_buf_top

(and pgt_buf_end can increment up to pgt_buf_top).

Now the next part of this that is hard to wrap around is when you
want to create a PTE entries for the pgt_buf_start -> pgt_buf_end.
Its double fun, b/c your pgt_buf_end can increment as you are
trying to create those PTE entries - and you _MUST_ mark those
PTE entries as RO. This is b/c those pagetables (pgt_buf_start ->
pgt_buf_end) are live and only Xen can touch them.

This feels like operating on a live patient, while said patient
is running the marathon. Only duct-tape expert can apply for
this position.


What Peter had in mind is a nice system where we get rid of
this linear allocation of page-tables (so pgt_buf_start -> pgt_buf
_end are linearly allocated). His thinking (and Peter if I mess
up please correct me), is that we can stick the various pagetables
in different spots in memory. Mainly that as we look at mapping
a region (say 0GB->1GB), we look at in chunks (2MB?) and allocate
a page-table at the _end_ of the newly mapped chunk if we have
filled all entries in said pagetable.

For simplicity, lets say we are just dealing with PTE tables and
we are mapping the region 0GB->1GB with 4KB pages.

First we stick a page-table (or if there is a found one reuse it)
at the start of the region (so 0-2MB).

0MB.......................2MB
/-----\
|PTE_A|
\-----/

The PTE entries in it will cover 0->2MB (PTE table #A) and once it is
finished, it will stick a new pagetable at the end of the 2MB region:

0MB.......................2MB...........................4MB
/-----\                /-----\
|PTE_A|                |PTE_B|
\-----/                \-----/


The PTE_B page table will be used to map 2MB->4MB.

Once that is finished .. we repeat the cycle.

That should remove the utter duct-tape madness and make this a lot
easier.

  parent reply	other threads:[~2012-10-04 14:08 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-30  7:57 [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup Yinghai Lu
2012-09-30  7:57 ` [PATCH 01/13] x86, mm: Add global page_size_mask and probe one time only Yinghai Lu
2012-09-30  7:57 ` [PATCH 02/13] x86, mm: Split out split_mem_range from init_memory_mapping Yinghai Lu
2012-09-30  7:57 ` [PATCH 03/13] x86, mm: Move init_memory_mapping calling out of setup.c Yinghai Lu
2012-09-30  7:57 ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit Yinghai Lu
2012-10-01 11:00   ` Stefano Stabellini
2012-10-03 16:51     ` Jacob Shin
2012-10-03 18:34       ` H. Peter Anvin
2012-10-04 13:56       ` Konrad Rzeszutek Wilk [this message]
2012-10-04 21:52         ` H. Peter Anvin
2012-10-04 16:19       ` Yinghai Lu
2012-10-04 16:46         ` Konrad Rzeszutek Wilk
2012-10-04 21:29           ` Yinghai Lu
2012-10-05 21:04             ` Eric W. Biederman
2012-10-05 21:19               ` Yinghai Lu
2012-10-05 21:32                 ` Eric W. Biederman
2012-10-05 21:37                   ` Yinghai Lu
2012-10-05 21:41                     ` Eric W. Biederman
2012-10-05 21:43                       ` Yinghai Lu
2012-10-05 22:01                         ` 896MB address limit (was: Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit) Eric W. Biederman
2012-10-05 22:01                           ` Eric W. Biederman
2012-10-06  0:18                       ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit H. Peter Anvin
2012-10-06  0:45                         ` Eric W. Biederman
2012-10-06  1:02                           ` H. Peter Anvin
2012-10-06  0:17                   ` H. Peter Anvin
2012-10-06  0:28                     ` Eric W. Biederman
2012-10-06  0:36                       ` H. Peter Anvin
2012-10-04 15:57     ` Yinghai Lu
2012-10-04 16:45       ` Konrad Rzeszutek Wilk
2012-10-04 21:21         ` Yinghai Lu
2012-10-04 21:40           ` Yinghai Lu
2012-10-04 21:41             ` H. Peter Anvin
2012-10-04 21:46               ` Yinghai Lu
2012-10-04 21:54                 ` H. Peter Anvin
2012-10-05  7:46                   ` Yinghai Lu
2012-10-05 11:27                     ` Stefano Stabellini
2012-10-05 14:58                       ` Yinghai Lu
2012-10-06  7:44                         ` [PATCH 0/3] x86: pre mapping page table to make xen happy Yinghai Lu
2012-10-06  7:44                           ` [PATCH 1/3] x86: get early page table from BRK Yinghai Lu
2012-10-08 12:09                             ` Stefano Stabellini
2012-10-06  7:44                           ` [PATCH 2/3] x86, mm: Don't clear page table if next range is ram Yinghai Lu
2012-10-09 15:46                             ` Konrad Rzeszutek Wilk
2012-10-10  1:00                               ` Yinghai Lu
2012-10-10 13:41                                 ` Konrad Rzeszutek Wilk
2012-10-10 14:43                                   ` Yinghai Lu
2012-10-06  7:44                           ` [PATCH 3/3] x86, mm: Remove early_memremap workaround for page table accessing Yinghai Lu
2012-10-09 15:48                             ` Konrad Rzeszutek Wilk
2012-10-08  6:36                         ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit Yinghai Lu
2012-10-05 10:47       ` Stefano Stabellini
2012-09-30  7:57 ` [PATCH 05/13] x86, mm: Find early page table buffer altogether Yinghai Lu
2012-09-30  7:57 ` [PATCH 06/13] x86, mm: Separate out calculate_table_space_size() Yinghai Lu
2012-09-30  7:57 ` [PATCH 07/13] x86, mm: Move down two calculate_table_space_size down Yinghai Lu
2012-09-30  7:57 ` [PATCH 08/13] x86, mm: Set memblock initial limit to 1M Yinghai Lu
2012-09-30  7:57 ` [PATCH 09/13] x86: if kernel .text .data .bss are not marked as E820_RAM, complain and fix Yinghai Lu
2012-09-30  7:57 ` [PATCH 10/13] x86: Fixup code testing if a pfn is direct mapped Yinghai Lu
2012-09-30  7:57 ` [PATCH 11/13] x86: Only direct map addresses that are marked as E820_RAM Yinghai Lu
2012-09-30  7:57 ` [PATCH 12/13] x86/mm: calculate_table_space_size based on memory ranges that are being mapped Yinghai Lu
2012-09-30  7:57 ` [PATCH 13/13] x86, mm: Use func pointer to table size calculation and mapping Yinghai Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121004135646.GE9158@phenom.dumpdata.com \
    --to=konrad@kernel.org \
    --cc=hpa@zytor.com \
    --cc=jacob.shin@amd.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=stefano.stabellini@eu.citrix.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=yinghai@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.