early fixmap causes kmap breakage

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* early fixmap causes kmap breakage
@ 2008-12-18 21:15 Nick Piggin
  2008-12-29 23:17 ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2008-12-18 21:15 UTC (permalink / raw)
  To: Eric W. Biederman, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, Ian Campbell

Hi,

I've debugged a problem where i386+pae systems with more than a few CPUs
blow up at boot in the kmap_atomic code.

The problem is that the kmap_atomic pte pages all need to be contiguous
memory because the pte is calculated via the first kmap pte page + an
offset (so as not to have to walk the page tables every time).

The fixmap setup code crudely allocates contiguous pte pages, which is fine,
but if it finds an already populated pmd entry, then it will not switch it
to a new, contiguous pte page. So the early fixmap introduces a discontig
page table right in the middle of the kmap atomic fixmaps.

Commenting out the eaarly fixmap setup in head_32.S gets everything working
properly. What would be the best way to fix this? Could we put the early
fixmap page table in initdata, and then have the fixmap setup proper first
clear its corresponding pmd entry?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: early fixmap causes kmap breakage
  2008-12-18 21:15 early fixmap causes kmap breakage Nick Piggin
@ 2008-12-29 23:17 ` Andrew Morton
  2008-12-30  4:01   ` Nick Piggin
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2008-12-29 23:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: ebiederm, linux-kernel, tglx, mingo, ijc

On Thu, 18 Dec 2008 22:15:43 +0100
Nick Piggin <npiggin@suse.de> wrote:

> Hi,
> 
> I've debugged a problem where i386+pae systems with more than a few CPUs
> blow up at boot in the kmap_atomic code.

ping?

> The problem is that the kmap_atomic pte pages all need to be contiguous
> memory because the pte is calculated via the first kmap pte page + an
> offset (so as not to have to walk the page tables every time).
> 
> The fixmap setup code crudely allocates contiguous pte pages, which is fine,
> but if it finds an already populated pmd entry, then it will not switch it
> to a new, contiguous pte page. So the early fixmap introduces a discontig
> page table right in the middle of the kmap atomic fixmaps.
> 
> Commenting out the eaarly fixmap setup in head_32.S gets everything working
> properly. What would be the best way to fix this? Could we put the early
> fixmap page table in initdata, and then have the fixmap setup proper first
> clear its corresponding pmd entry?

How come users/testers aren't reporting this?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: early fixmap causes kmap breakage
  2008-12-29 23:17 ` Andrew Morton
@ 2008-12-30  4:01   ` Nick Piggin
  2008-12-30  6:13     ` Ingo Molnar
  2008-12-30  6:22     ` Eric W. Biederman
  0 siblings, 2 replies; 14+ messages in thread
From: Nick Piggin @ 2008-12-30  4:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ebiederm, linux-kernel, tglx, mingo, ijc

On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote:
> On Thu, 18 Dec 2008 22:15:43 +0100
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > Hi,
> > 
> > I've debugged a problem where i386+pae systems with more than a few CPUs
> > blow up at boot in the kmap_atomic code.
> 
> ping?

No further progress here, I'm waiting on input for how to fix this
"nicely". Meantime, clearing the early fixmap pte I guess works, but
you lose a page... is it possible to put it into .initdata or is
there some issue with that? (I guess on a PAE kernel, 4K isn't a
big deal).

 
> > The problem is that the kmap_atomic pte pages all need to be contiguous
> > memory because the pte is calculated via the first kmap pte page + an
> > offset (so as not to have to walk the page tables every time).
> > 
> > The fixmap setup code crudely allocates contiguous pte pages, which is fine,
> > but if it finds an already populated pmd entry, then it will not switch it
> > to a new, contiguous pte page. So the early fixmap introduces a discontig
> > page table right in the middle of the kmap atomic fixmaps.
> > 
> > Commenting out the eaarly fixmap setup in head_32.S gets everything working
> > properly. What would be the best way to fix this? Could we put the early
> > fixmap page table in initdata, and then have the fixmap setup proper first
> > clear its corresponding pmd entry?
> 
> How come users/testers aren't reporting this?

Because apparently nobody tests 32-bit PAE systems with more than a couple
of CPUs anymore. This bug comes from HW vendor doing testing of SLES11.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: early fixmap causes kmap breakage
  2008-12-30  4:01   ` Nick Piggin
@ 2008-12-30  6:13     ` Ingo Molnar
  2008-12-30  7:54       ` Nick Piggin
  2008-12-30 10:28       ` Nick Piggin
  2008-12-30  6:22     ` Eric W. Biederman
  1 sibling, 2 replies; 14+ messages in thread
From: Ingo Molnar @ 2008-12-30  6:13 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, ebiederm, linux-kernel, tglx, ijc


* Nick Piggin <npiggin@suse.de> wrote:

> On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote:
> > On Thu, 18 Dec 2008 22:15:43 +0100
> > Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > Hi,
> > > 
> > > I've debugged a problem where i386+pae systems with more than a few CPUs
> > > blow up at boot in the kmap_atomic code.
> > 
> > ping?
> 
> No further progress here, I'm waiting on input for how to fix this 
> "nicely". Meantime, clearing the early fixmap pte I guess works, but you 
> lose a page... is it possible to put it into .initdata or is there some 
> issue with that? (I guess on a PAE kernel, 4K isn't a big deal).

yeah, 4K shouldnt be a big deal. Mind sending a patch for this?

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: early fixmap causes kmap breakage
  2008-12-30  6:13     ` Ingo Molnar
@ 2008-12-30  7:54       ` Nick Piggin
  2008-12-30  8:14         ` Nick Piggin
  2008-12-30 10:28       ` Nick Piggin
  1 sibling, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2008-12-30  7:54 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, ebiederm, linux-kernel, tglx, ijc

On Tue, Dec 30, 2008 at 07:13:44AM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote:
> > > On Thu, 18 Dec 2008 22:15:43 +0100
> > > Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > Hi,
> > > > 
> > > > I've debugged a problem where i386+pae systems with more than a few CPUs
> > > > blow up at boot in the kmap_atomic code.
> > > 
> > > ping?
> > 
> > No further progress here, I'm waiting on input for how to fix this 
> > "nicely". Meantime, clearing the early fixmap pte I guess works, but you 
> > lose a page... is it possible to put it into .initdata or is there some 
> > issue with that? (I guess on a PAE kernel, 4K isn't a big deal).
> 
> yeah, 4K shouldnt be a big deal. Mind sending a patch for this?

I don't know if we should put a warning when we encounter this condition,
to help catch other unintended uses...

But this at least clears out the early fixmap pte and provides a contiguous
allocation...

---
 arch/x86/mm/init_32.c |   15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/mm/init_32.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_32.c
+++ linux-2.6/arch/x86/mm/init_32.c
@@ -155,6 +155,7 @@ page_table_range_init(unsigned long star
 	unsigned long vaddr;
 	pgd_t *pgd;
 	pmd_t *pmd;
+	pte_t *lastpte = NULL;
 
 	vaddr = start;
 	pgd_idx = pgd_index(vaddr);
@@ -166,7 +167,19 @@ page_table_range_init(unsigned long star
 		pmd = pmd + pmd_index(vaddr);
 		for (; (pmd_idx < PTRS_PER_PMD) && (vaddr != end);
 							pmd++, pmd_idx++) {
-			one_page_table_init(pmd);
+			pte_t *pte;
+
+			pte = one_page_table_init(pmd);
+			if (lastpte && lastpte + PTRS_PER_PTE != pte) {
+				pte_t *newpte;
+
+				pmd_clear(pmd);
+				newpte = one_page_table_init(pmd);
+				BUG_ON(lastpte + PTRS_PER_PTE != newpte);
+				memset(newpte, pte, PAGE_SIZE);
+				pte = lastpte;
+			}
+			lastpte = pte;
 
 			vaddr += PMD_SIZE;
 		}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: early fixmap causes kmap breakage
  2008-12-30  7:54       ` Nick Piggin
@ 2008-12-30  8:14         ` Nick Piggin
  0 siblings, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2008-12-30  8:14 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, ebiederm, linux-kernel, tglx, ijc

On Tue, Dec 30, 2008 at 08:54:28AM +0100, Nick Piggin wrote:
> On Tue, Dec 30, 2008 at 07:13:44AM +0100, Ingo Molnar wrote:
> > 
> > * Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote:
> > > > On Thu, 18 Dec 2008 22:15:43 +0100
> > > > Nick Piggin <npiggin@suse.de> wrote:
> > > > 
> > > > > Hi,
> > > > > 
> > > > > I've debugged a problem where i386+pae systems with more than a few CPUs
> > > > > blow up at boot in the kmap_atomic code.
> > > > 
> > > > ping?
> > > 
> > > No further progress here, I'm waiting on input for how to fix this 
> > > "nicely". Meantime, clearing the early fixmap pte I guess works, but you 
> > > lose a page... is it possible to put it into .initdata or is there some 
> > > issue with that? (I guess on a PAE kernel, 4K isn't a big deal).
> > 
> > yeah, 4K shouldnt be a big deal. Mind sending a patch for this?
> 
> I don't know if we should put a warning when we encounter this condition,
> to help catch other unintended uses...
> 
> But this at least clears out the early fixmap pte and provides a contiguous
> allocation...
> 
> ---
>  arch/x86/mm/init_32.c |   15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/arch/x86/mm/init_32.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init_32.c
> +++ linux-2.6/arch/x86/mm/init_32.c
> @@ -155,6 +155,7 @@ page_table_range_init(unsigned long star
>  	unsigned long vaddr;
>  	pgd_t *pgd;
>  	pmd_t *pmd;
> +	pte_t *lastpte = NULL;
>  
>  	vaddr = start;
>  	pgd_idx = pgd_index(vaddr);
> @@ -166,7 +167,19 @@ page_table_range_init(unsigned long star
>  		pmd = pmd + pmd_index(vaddr);
>  		for (; (pmd_idx < PTRS_PER_PMD) && (vaddr != end);
>  							pmd++, pmd_idx++) {
> -			one_page_table_init(pmd);
> +			pte_t *pte;
> +
> +			pte = one_page_table_init(pmd);
> +			if (lastpte && lastpte + PTRS_PER_PTE != pte) {
> +				pte_t *newpte;
> +
> +				pmd_clear(pmd);

Hmm, no actually we do need a TLB flush here.

> +				newpte = one_page_table_init(pmd);
> +				BUG_ON(lastpte + PTRS_PER_PTE != newpte);
> +				memset(newpte, pte, PAGE_SIZE);

And here we probably need a loop doing set_pte just in case the
CPU speculatively loads an incomplete TLB...

> +				pte = lastpte;
> +			}
> +			lastpte = pte;
>  
>  			vaddr += PMD_SIZE;
>  		}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: early fixmap causes kmap breakage
  2008-12-30  6:13     ` Ingo Molnar
  2008-12-30  7:54       ` Nick Piggin
@ 2008-12-30 10:28       ` Nick Piggin
  2008-12-30 22:41         ` Eric W. Biederman
  1 sibling, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2008-12-30 10:28 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, ebiederm, linux-kernel, tglx, ijc

On Tue, Dec 30, 2008 at 07:13:44AM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote:
> > > On Thu, 18 Dec 2008 22:15:43 +0100
> > > Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > Hi,
> > > > 
> > > > I've debugged a problem where i386+pae systems with more than a few CPUs
> > > > blow up at boot in the kmap_atomic code.
> > > 
> > > ping?
> > 
> > No further progress here, I'm waiting on input for how to fix this 
> > "nicely". Meantime, clearing the early fixmap pte I guess works, but you 
> > lose a page... is it possible to put it into .initdata or is there some 
> > issue with that? (I guess on a PAE kernel, 4K isn't a big deal).
> 
> yeah, 4K shouldnt be a big deal. Mind sending a patch for this?

How's this?
--

The early fixmap pmd entry inserted at the very top of the KVA is casing the
subsequent fixmap mapping code to not provide physically linear pte pages over
the kmap atomic portion of the fixmap (which relies on said property to calculate
pte address).

This has caused weird boot failures in kmap_atomic much later in the boot
process (initial userspace faults) on a 32-bit PAE system with a larger number
of CPUs (smaller CPU counts tend not to run over into the next page so don't
show up the problem).

Solve this by attempting to clear out the page table, and copy any of its
entries to the new one. Also, add a bug if a nonlinear condition is encounted
and can't be resolved, which might save some hours of debugging if this fragile
scheme ever breaks again...

Putting swapper_pg_fixmap into initdata is an exercise left for the reviewer...

Signed-off-by: Nick Piggin <npiggin@suse.de>

---
 arch/x86/mm/init_32.c |   26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/mm/init_32.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_32.c
+++ linux-2.6/arch/x86/mm/init_32.c
@@ -155,6 +155,7 @@ page_table_range_init(unsigned long star
 	unsigned long vaddr;
 	pgd_t *pgd;
 	pmd_t *pmd;
+	pte_t *lastpte = NULL;
 
 	vaddr = start;
 	pgd_idx = pgd_index(vaddr);
@@ -166,7 +167,30 @@ page_table_range_init(unsigned long star
 		pmd = pmd + pmd_index(vaddr);
 		for (; (pmd_idx < PTRS_PER_PMD) && (vaddr != end);
 							pmd++, pmd_idx++) {
-			one_page_table_init(pmd);
+			pte_t *pte;
+
+			pte = one_page_table_init(pmd);
+			/*
+			 * Something (early fixmap) has already put a pte page
+			 * here, which causes the page table allocation to
+			 * become nonlinear. Attempt to fix it, and if it is
+			 * still nonlinear then we have to bug.
+			 */
+			if (lastpte && lastpte + PTRS_PER_PTE != pte) {
+				pte_t *newpte;
+				int i;
+
+				pmd_clear(pmd);
+				__flush_tlb_all();
+
+				newpte = one_page_table_init(pmd);
+				BUG_ON(lastpte + PTRS_PER_PTE != newpte);
+				for (i = 0; i < PTRS_PER_PTE; i++) {
+					set_pte(newpte + i, pte_val(*(pte + i)));
+				}
+				pte = lastpte;
+			}
+			lastpte = pte;
 
 			vaddr += PMD_SIZE;
 		}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: early fixmap causes kmap breakage
  2008-12-30 10:28       ` Nick Piggin
@ 2008-12-30 22:41         ` Eric W. Biederman
  2008-12-31  1:54           ` Nick Piggin
  0 siblings, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2008-12-30 22:41 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Ingo Molnar, Andrew Morton, linux-kernel, tglx, ijc

Nick Piggin <npiggin@suse.de> writes:

> On Tue, Dec 30, 2008 at 07:13:44AM +0100, Ingo Molnar wrote:
>> 
>> * Nick Piggin <npiggin@suse.de> wrote:
>> 
>> > On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote:
>> > > On Thu, 18 Dec 2008 22:15:43 +0100
>> > > Nick Piggin <npiggin@suse.de> wrote:
>> > > 
>> > > > Hi,
>> > > > 
>> > > > I've debugged a problem where i386+pae systems with more than a few CPUs
>> > > > blow up at boot in the kmap_atomic code.
>> > > 
>> > > ping?
>> > 
>> > No further progress here, I'm waiting on input for how to fix this 
>> > "nicely". Meantime, clearing the early fixmap pte I guess works, but you 
>> > lose a page... is it possible to put it into .initdata or is there some 
>> > issue with that? (I guess on a PAE kernel, 4K isn't a big deal).
>> 
>> yeah, 4K shouldnt be a big deal. Mind sending a patch for this?
>
> How's this?
> --
>
> The early fixmap pmd entry inserted at the very top of the KVA is casing the
> subsequent fixmap mapping code to not provide physically linear pte pages over
> the kmap atomic portion of the fixmap (which relies on said property to
> calculate
> pte address).
>
> This has caused weird boot failures in kmap_atomic much later in the boot
> process (initial userspace faults) on a 32-bit PAE system with a larger number
> of CPUs (smaller CPU counts tend not to run over into the next page so don't
> show up the problem).
atomic>
> Solve this by attempting to clear out the page table, and copy any of its
> entries to the new one. Also, add a bug if a nonlinear condition is encounted
> and can't be resolved, which might save some hours of debugging if this fragile
> scheme ever breaks again...
>
> Putting swapper_pg_fixmap into initdata is an exercise left for the reviewer...

Ok.  I see what is going on now.  We have exceeded 512 fixmap entries, causing
the fixmap entries to consume more than 2MB of the address space.   Which broke
the assumption that the fixmap entries are all contiguous.

Ditching the swapper_pg_fixmap has some problems.

This appears to break early_printk to a usb debug port, which calls
set_fixmap_nocache and expects the mapping to last.

This looks like it will have problems with Xen and other environments
where we come in with a pre-populated page table, possibly unmapping
something important.

one_page_table_init relies on alloc_bootmem_low_pages for it's memory allocation
so we do not have a guarantee that we will have contiguous memory even without
this.

I see three ways we can address this.
- Grow swapper_pg_fixmap to cover the entire fixmap range.
  This trivially and without problems gives an atomic guarantee,
  and should allow removal of code that sets up the fixmaps later
  in C, except in weird cases like Xen.

- Decide it is worth optimizing kmap_atomic_prot some more.
  Have a kmap_pte per cpu.
  Cache line align the kmap pte entries so we don't get conflicts
  per cpu, at which point we should be guaranteed the all 13 of
  them will be physically contiguous.

- Not support more than 32 cpus on x86_32.

I suspect it might even be worth writing a version of one_page_table_init
that would guarantee discontiguous pages.  So we can flush out these
kinds of fragile assumptions.  

Eric


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: early fixmap causes kmap breakage
  2008-12-30 22:41         ` Eric W. Biederman
@ 2008-12-31  1:54           ` Nick Piggin
  2008-12-31  9:01             ` Eric W. Biederman
  0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2008-12-31  1:54 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Ingo Molnar, Andrew Morton, linux-kernel, tglx, ijc

On Tue, Dec 30, 2008 at 02:41:53PM -0800, Eric W. Biederman wrote:
> Nick Piggin <npiggin@suse.de> writes:
> 
> > On Tue, Dec 30, 2008 at 07:13:44AM +0100, Ingo Molnar wrote:
> >> 
> >> * Nick Piggin <npiggin@suse.de> wrote:
> >> 
> >> > On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote:
> >> > > On Thu, 18 Dec 2008 22:15:43 +0100
> >> > > Nick Piggin <npiggin@suse.de> wrote:
> >> > > 
> >> > > > Hi,
> >> > > > 
> >> > > > I've debugged a problem where i386+pae systems with more than a few CPUs
> >> > > > blow up at boot in the kmap_atomic code.
> >> > > 
> >> > > ping?
> >> > 
> >> > No further progress here, I'm waiting on input for how to fix this 
> >> > "nicely". Meantime, clearing the early fixmap pte I guess works, but you 
> >> > lose a page... is it possible to put it into .initdata or is there some 
> >> > issue with that? (I guess on a PAE kernel, 4K isn't a big deal).
> >> 
> >> yeah, 4K shouldnt be a big deal. Mind sending a patch for this?
> >
> > How's this?
> > --
> >
> > The early fixmap pmd entry inserted at the very top of the KVA is casing the
> > subsequent fixmap mapping code to not provide physically linear pte pages over
> > the kmap atomic portion of the fixmap (which relies on said property to
> > calculate
> > pte address).
> >
> > This has caused weird boot failures in kmap_atomic much later in the boot
> > process (initial userspace faults) on a 32-bit PAE system with a larger number
> > of CPUs (smaller CPU counts tend not to run over into the next page so don't
> > show up the problem).
> atomic>
> > Solve this by attempting to clear out the page table, and copy any of its
> > entries to the new one. Also, add a bug if a nonlinear condition is encounted
> > and can't be resolved, which might save some hours of debugging if this fragile
> > scheme ever breaks again...
> >
> > Putting swapper_pg_fixmap into initdata is an exercise left for the reviewer...
> 
> Ok.  I see what is going on now.  We have exceeded 512 fixmap entries, causing
> the fixmap entries to consume more than 2MB of the address space.   Which broke
> the assumption that the fixmap entries are all contiguous.

Yes. That wasn't obvious from my problem description?

 
> Ditching the swapper_pg_fixmap has some problems.
> 
> This appears to break early_printk to a usb debug port, which calls
> set_fixmap_nocache and expects the mapping to last.
> 
> This looks like it will have problems with Xen and other environments
> where we come in with a pre-populated page table, possibly unmapping
> something important.

My patch copies the early fixmap mappings to the new page table. Isn't
this enough?
 
> one_page_table_init relies on alloc_bootmem_low_pages for it's memory allocation
> so we do not have a guarantee that we will have contiguous memory even without
> this.

It's OK, if fragile, in early boot where it uses alloc_low_page.

 
> I see three ways we can address this.
> - Grow swapper_pg_fixmap to cover the entire fixmap range.
>   This trivially and without problems gives an atomic guarantee,
>   and should allow removal of code that sets up the fixmaps later
>   in C, except in weird cases like Xen.

Would be fine by me, although I want to get a minimal patch working
in the meantime if that is going to be complex.

 
> - Decide it is worth optimizing kmap_atomic_prot some more.
>   Have a kmap_pte per cpu.
>   Cache line align the kmap pte entries so we don't get conflicts
>   per cpu, at which point we should be guaranteed the all 13 of
>   them will be physically contiguous.

Go wild if you'd like to spend time optimising x86 PAE ;)

 
> - Not support more than 32 cpus on x86_32.
 
This is not an option really. Anyway, it's not more than 32 CPUs, but
can be a problem with as few as about 8 depending on config.


> I suspect it might even be worth writing a version of one_page_table_init
> that would guarantee discontiguous pages.  So we can flush out these
> kinds of fragile assumptions.  

So did you actually see anything wrong with my last patch which catches
discontiguous page tables and copies over the early fixmap translations?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: early fixmap causes kmap breakage
  2008-12-31  1:54           ` Nick Piggin
@ 2008-12-31  9:01             ` Eric W. Biederman
  2008-12-31  9:33               ` Nick Piggin
  0 siblings, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2008-12-31  9:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Ingo Molnar, Andrew Morton, linux-kernel, tglx, ijc

Nick Piggin <npiggin@suse.de> writes:

> On Tue, Dec 30, 2008 at 02:41:53PM -0800, Eric W. Biederman wrote:
>> Nick Piggin <npiggin@suse.de> writes:
>> 
>> > On Tue, Dec 30, 2008 at 07:13:44AM +0100, Ingo Molnar wrote:
>> >> 
>> >> * Nick Piggin <npiggin@suse.de> wrote:
>> >> 
>> >> > On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote:
>> >> > > On Thu, 18 Dec 2008 22:15:43 +0100
>> >> > > Nick Piggin <npiggin@suse.de> wrote:
>> >> > > 
>> >> > > > Hi,
>> >> > > > 
>> >> > > > I've debugged a problem where i386+pae systems with more than a few
> CPUs
>> >> > > > blow up at boot in the kmap_atomic code.
>> >> > > 
>> >> > > ping?
>> >> > 
>> >> > No further progress here, I'm waiting on input for how to fix this 
>> >> > "nicely". Meantime, clearing the early fixmap pte I guess works, but you
>> >> > lose a page... is it possible to put it into .initdata or is there some 
>> >> > issue with that? (I guess on a PAE kernel, 4K isn't a big deal).
>> >> 
>> >> yeah, 4K shouldnt be a big deal. Mind sending a patch for this?
>> >
>> > How's this?
>> > --
>> >
>> > The early fixmap pmd entry inserted at the very top of the KVA is casing the
>> > subsequent fixmap mapping code to not provide physically linear pte pages
> over
>> > the kmap atomic portion of the fixmap (which relies on said property to
>> > calculate
>> > pte address).
>> >
>> > This has caused weird boot failures in kmap_atomic much later in the boot
>> > process (initial userspace faults) on a 32-bit PAE system with a larger
> number
>> > of CPUs (smaller CPU counts tend not to run over into the next page so don't
>> > show up the problem).
>> atomic>
>> > Solve this by attempting to clear out the page table, and copy any of its
>> > entries to the new one. Also, add a bug if a nonlinear condition is
> encounted
>> > and can't be resolved, which might save some hours of debugging if this
> fragile
>> > scheme ever breaks again...
>> >
>> > Putting swapper_pg_fixmap into initdata is an exercise left for the
> reviewer...
>> 
>> Ok.  I see what is going on now.  We have exceeded 512 fixmap entries, causing
>> the fixmap entries to consume more than 2MB of the address space.  Which broke
>> the assumption that the fixmap entries are all contiguous.
>
> Yes. That wasn't obvious from my problem description?

Not quite.  I was the change to one_page_table_init in the patch that made
it clear to me.  If you have been working the code for a while or come at
it from the proper angle it is, obvious.  I haven't looked at it in a bit
and came at it from the wrong angle initially. 

>> Ditching the swapper_pg_fixmap has some problems.
>> 
>> This appears to break early_printk to a usb debug port, which calls
>> set_fixmap_nocache and expects the mapping to last.
>> 
>> This looks like it will have problems with Xen and other environments
>> where we come in with a pre-populated page table, possibly unmapping
>> something important.
>
> My patch copies the early fixmap mappings to the new page table. Isn't
> this enough?

I'm not certain.  Xen can get really finicky if you mess with it's
pre-established mappings.  Which is one of the reasons why we preserve
the current mappings.  It was a real struggle to find a solution that
worked for everything last time we had a PAE problem, and apparently
we failed to test with large numbers of cpus in the config.

So please excuse me if I'm leery of a quick fix, to code that is obviously
insufficiently clean to do what needs to be done intentionally.

I also don't see the urgency as this problem has existed for long enough
that I have had time to page out the original conversation, so please
let's take the time to do this cleanly.

>> one_page_table_init relies on alloc_bootmem_low_pages for it's memory
> allocation
>> so we do not have a guarantee that we will have contiguous memory even without
>> this.
>
> It's OK, if fragile, in early boot where it uses alloc_low_page.

Which accounts for early_ioremap_init but not permanent_kmaps_init
which has the same constraint, and also calls page_table_range_init,
but does so after the bootmem allocator has been initialized.  So
no we can't rely on the simplicity of alloc_low_page to guarantee
us a contiguous chunk of ptes. 

>> I see three ways we can address this.
>> - Grow swapper_pg_fixmap to cover the entire fixmap range.
>>   This trivially and without problems gives an atomic guarantee,
>>   and should allow removal of code that sets up the fixmaps later
>>   in C, except in weird cases like Xen.
>
> Would be fine by me, although I want to get a minimal patch working
> in the meantime if that is going to be complex.

No.  It should be just a matter of passing the number of pages
we need to initialize in a form that assembly can use and upgrading
the current code to a loop.

Unless Xen and co, rely on this code to add kmaps to their initial
page table which not that I think about it I expect they do.

> Go wild if you'd like to spend time optimising x86 PAE ;)

Nit: It is x86 highmem not x86 PAE.

>> - Not support more than 32 cpus on x86_32.
>  
> This is not an option really. Anyway, it's not more than 32 CPUs, but
> can be a problem with as few as about 8 depending on config.

That sounds like a lot of IOAPICs.  Can you really see this with only
8 cpus??

>> I suspect it might even be worth writing a version of one_page_table_init
>> that would guarantee discontiguous pages.  So we can flush out these
>> kinds of fragile assumptions.  
>
> So did you actually see anything wrong with my last patch which catches
> discontiguous page tables and copies over the early fixmap translations?

Sorry I missed the copy of the ptes in my quest to put what you were saying
in perspective.

Real concerns:

We are not always using alloc_low_page() so we have an allocator that
does not guarantee the property we are seeking.

- I don't see what prevents us from hitting a kernel page table entry
  and unmapping ourselves while we are in the middle of this.

  Minor but very important that we verify.

- I don't see (if the end mappings are the same) why we need to flush
  the tlb.

- We have a lot of stale comments that you are not updating.

- page_table_range_init does not explicitly do what you are changing it to
  guarantee in your patch.  So I think we are taking a weird accidental dependency
  and applying a bit more duck tape instead of fixing this cleanly, which
  means some other change is likely to trigger this in another way.

On the other hand it does look like fixing page_table_range_init so that
it cleanly guarantees what we actually expect from it, is the way to go.
It is in C so it is readable and it is it can be used to fix up any
preexisting page tables form Xen and elsewhere.

Stale comments that I found at a quick look:
mm/init_32.c:
>  * In general, pagetable_init() assumes that the pagetable may already
>  * be partially populated, and so it avoids stomping on any existing
>  * mappings.

asm/highmem.h:
 > /*
 >  * Right now we initialize only a single pte table. It can be extended
 >  * easily, subsequent pte tables have to be allocated in one physical
 >  * chunk of RAM.
 >  */

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: early fixmap causes kmap breakage
  2008-12-31  9:01             ` Eric W. Biederman
@ 2008-12-31  9:33               ` Nick Piggin
  2009-01-09 10:24                 ` Ian Campbell
  0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2008-12-31  9:33 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Ingo Molnar, Andrew Morton, linux-kernel, tglx, ijc

On Wed, Dec 31, 2008 at 01:01:27AM -0800, Eric W. Biederman wrote:
> Nick Piggin <npiggin@suse.de> writes:
> >> the assumption that the fixmap entries are all contiguous.
> >
> > Yes. That wasn't obvious from my problem description?
> 
> Not quite.  I was the change to one_page_table_init in the patch that made
> it clear to me.  If you have been working the code for a while or come at
> it from the proper angle it is, obvious.  I haven't looked at it in a bit
> and came at it from the wrong angle initially. 

OK. I'll try to improve it.

 
> >> Ditching the swapper_pg_fixmap has some problems.
> >> 
> >> This appears to break early_printk to a usb debug port, which calls
> >> set_fixmap_nocache and expects the mapping to last.
> >> 
> >> This looks like it will have problems with Xen and other environments
> >> where we come in with a pre-populated page table, possibly unmapping
> >> something important.
> >
> > My patch copies the early fixmap mappings to the new page table. Isn't
> > this enough?
> 
> I'm not certain.  Xen can get really finicky if you mess with it's
> pre-established mappings.  Which is one of the reasons why we preserve
> the current mappings.  It was a real struggle to find a solution that
> worked for everything last time we had a PAE problem, and apparently
> we failed to test with large numbers of cpus in the config.
> 
> So please excuse me if I'm leery of a quick fix, to code that is obviously
> insufficiently clean to do what needs to be done intentionally.

Well we should just ask the Xen people whether it will work or not. Why
wouldn't it? Xen has to be able to cope with kernel mappings coming and
going.


> I also don't see the urgency as this problem has existed for long enough
> that I have had time to page out the original conversation, so please
> let's take the time to do this cleanly.

Just to have a straightforward fix for stable kernels I guess.

 
> >> one_page_table_init relies on alloc_bootmem_low_pages for it's memory
> > allocation
> >> so we do not have a guarantee that we will have contiguous memory even without
> >> this.
> >
> > It's OK, if fragile, in early boot where it uses alloc_low_page.
> 
> Which accounts for early_ioremap_init but not permanent_kmaps_init
> which has the same constraint, and also calls page_table_range_init,
> but does so after the bootmem allocator has been initialized.  So
> no we can't rely on the simplicity of alloc_low_page to guarantee
> us a contiguous chunk of ptes. 

Sure, but I have added a check for that so it will panic if
there is a problem. My patch hasn't made things in that department
more fragile, put it that way.

> >> I see three ways we can address this.
> >> - Grow swapper_pg_fixmap to cover the entire fixmap range.
> >>   This trivially and without problems gives an atomic guarantee,
> >>   and should allow removal of code that sets up the fixmaps later
> >>   in C, except in weird cases like Xen.
> >
> > Would be fine by me, although I want to get a minimal patch working
> > in the meantime if that is going to be complex.
> 
> No.  It should be just a matter of passing the number of pages
> we need to initialize in a form that assembly can use and upgrading
> the current code to a loop.
> 
> Unless Xen and co, rely on this code to add kmaps to their initial
> page table which not that I think about it I expect they do.
> 
> > Go wild if you'd like to spend time optimising x86 PAE ;)
> 
> Nit: It is x86 highmem not x86 PAE.
> 
> >> - Not support more than 32 cpus on x86_32.
> >  
> > This is not an option really. Anyway, it's not more than 32 CPUs, but
> > can be a problem with as few as about 8 depending on config.
> 
> That sounds like a lot of IOAPICs.  Can you really see this with only
> 8 cpus??

With highmem debugging, yes I think I saw a kmap bug come from CPU7.

 
> >> I suspect it might even be worth writing a version of one_page_table_init
> >> that would guarantee discontiguous pages.  So we can flush out these
> >> kinds of fragile assumptions.  
> >
> > So did you actually see anything wrong with my last patch which catches
> > discontiguous page tables and copies over the early fixmap translations?
> 
> Sorry I missed the copy of the ptes in my quest to put what you were saying
> in perspective.
> 
> Real concerns:
> 
> We are not always using alloc_low_page() so we have an allocator that
> does not guarantee the property we are seeking.
> 
> - I don't see what prevents us from hitting a kernel page table entry
>   and unmapping ourselves while we are in the middle of this.
> 
>   Minor but very important that we verify.

Because it's in the pkmap area?

 
> - I don't see (if the end mappings are the same) why we need to flush
>   the tlb.

Hmm... x86 CPUs can cache higher level page table translations in
non-cache-coherent buffers, so they could continue to look at the
old page table page after we insert new mappings int othe new one,
thus missing them and faulting. I don't know whether that fault is
going to be guaranteed to flush those higher level translations,
but I don't want to risk it. It's not performance critical.


> - We have a lot of stale comments that you are not updating.

Which? (ah, I see below)

> - page_table_range_init does not explicitly do what you are changing it to
>   guarantee in your patch.  So I think we are taking a weird accidental dependency
>   and applying a bit more duck tape instead of fixing this cleanly, which
>   means some other change is likely to trigger this in another way.

What do you mean? Contiguity of pte pages is guaranteed after my patch.
If there is anything else it is supposed to do but doesn't, then that's
for another patch?

 
> On the other hand it does look like fixing page_table_range_init so that
> it cleanly guarantees what we actually expect from it, is the way to go.
> It is in C so it is readable and it is it can be used to fix up any
> preexisting page tables form Xen and elsewhere.
> 
> Stale comments that I found at a quick look:
> mm/init_32.c:
> >  * In general, pagetable_init() assumes that the pagetable may already
> >  * be partially populated, and so it avoids stomping on any existing
> >  * mappings.
> 
> asm/highmem.h:
>  > /*
>  >  * Right now we initialize only a single pte table. It can be extended
>  >  * easily, subsequent pte tables have to be allocated in one physical
>  >  * chunk of RAM.
>  >  */
> 
> Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: early fixmap causes kmap breakage
  2008-12-31  9:33               ` Nick Piggin
@ 2009-01-09 10:24                 ` Ian Campbell
  0 siblings, 0 replies; 14+ messages in thread
From: Ian Campbell @ 2009-01-09 10:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric W. Biederman, Ingo Molnar, Andrew Morton, linux-kernel, tglx,
	Jeremy Fitzhardinge

Adding Jeremy to CC

On Wed, 2008-12-31 at 10:33 +0100, Nick Piggin wrote:
> 
> > >> Ditching the swapper_pg_fixmap has some problems.
> > >> 
> > >> This appears to break early_printk to a usb debug port, which 
> > >> calls set_fixmap_nocache and expects the mapping to last.
> > >> 
> > >> This looks like it will have problems with Xen and other 
> > >> environments where we come in with a pre-populated page table,
> > >> possibly unmapping something important.
> > >
> > > My patch copies the early fixmap mappings to the new page table.

> > > Isn't this enough?
> > 
> > I'm not certain.  Xen can get really finicky if you mess with it's
> > pre-established mappings.  Which is one of the reasons why we 
> > preserve the current mappings.  It was a real struggle to find a 
> > solution that worked for everything last time we had a PAE problem,
> > and apparently we failed to test with large numbers of cpus in the
> > config.
> > 
> > So please excuse me if I'm leery of a quick fix, to code that is 
> > obviously insufficiently clean to do what needs to be done 
> > intentionally.
> 
> Well we should just ask the Xen people whether it will work or not.
> Why wouldn't it? Xen has to be able to cope with kernel mappings
> coming and going.

I think it should be fine from Xen's PoV since one_page_table_init()
correctly uses paravirt_alloc_pte() and set_pte() does the right thing
under Xen.

BTW, I needed this fixlet since set_pte takes a pte_t not a pteval_t,
otherwise:
  CC      arch/x86/mm/init_32.o
[...]arch/x86/mm/init_32.c: In function 'page_table_range_init':
[...]arch/x86/mm/init_32.c:188: error: incompatible type for argument 2 of 'set_pte'

diff -r cf59aba1694f arch/x86/mm/init_32.c
--- a/arch/x86/mm/init_32.c	Fri Jan 09 10:13:31 2009 +0000
+++ b/arch/x86/mm/init_32.c	Fri Jan 09 10:16:38 2009 +0000
@@ -185,7 +185,7 @@
 				newpte = one_page_table_init(pmd);
 				BUG_ON(lastpte + PTRS_PER_PTE != newpte);
 				for (i = 0; i < PTRS_PER_PTE; i++) {
-					set_pte(newpte + i, pte_val(*(pte + i)));
+					set_pte(newpte + i, *(pte + i));
 				}
 				pte = lastpte;
 			}

-- 
Ian Campbell
Current Noise: Firebird - Needle In The Groove

"Now we'll have to kill you."
		-- Linus Torvalds


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: early fixmap causes kmap breakage
  2008-12-30  4:01   ` Nick Piggin
  2008-12-30  6:13     ` Ingo Molnar
@ 2008-12-30  6:22     ` Eric W. Biederman
  2008-12-30  6:35       ` Nick Piggin
  1 sibling, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2008-12-30  6:22 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, tglx, mingo, ijc

Nick Piggin <npiggin@suse.de> writes:

> On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote:
>> On Thu, 18 Dec 2008 22:15:43 +0100
>> Nick Piggin <npiggin@suse.de> wrote:
>> 
>> > Hi,
>> > 
>> > I've debugged a problem where i386+pae systems with more than a few CPUs
>> > blow up at boot in the kmap_atomic code.
>> 
>> ping?
>
> No further progress here, I'm waiting on input for how to fix this
> "nicely". Meantime, clearing the early fixmap pte I guess works, but
> you lose a page... is it possible to put it into .initdata or is
> there some issue with that? (I guess on a PAE kernel, 4K isn't a
> big deal).
>
>  
>> > The problem is that the kmap_atomic pte pages all need to be contiguous
>> > memory because the pte is calculated via the first kmap pte page + an
>> > offset (so as not to have to walk the page tables every time).
>> > 
>> > The fixmap setup code crudely allocates contiguous pte pages, which is fine,
>> > but if it finds an already populated pmd entry, then it will not switch it
>> > to a new, contiguous pte page. So the early fixmap introduces a discontig
>> > page table right in the middle of the kmap atomic fixmaps.

Where is this?

>> > Commenting out the eaarly fixmap setup in head_32.S gets everything working
>> > properly. What would be the best way to fix this? Could we put the early
>> > fixmap page table in initdata, and then have the fixmap setup proper first
>> > clear its corresponding pmd entry?

Why would we want or need to?

>> How come users/testers aren't reporting this?
>
> Because apparently nobody tests 32-bit PAE systems with more than a couple
> of CPUs anymore. This bug comes from HW vendor doing testing of SLES11.

Hmm.

I have taken a quick skim and I am not seeming the part of the code you are
talking about.  Is the problem code in mainline?

I'm guessing it has something to do with reserve_top_address() being called
with a bad value in the normal case.  But I don't see it being called
at all in the normal case.

Eric


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: early fixmap causes kmap breakage
  2008-12-30  6:22     ` Eric W. Biederman
@ 2008-12-30  6:35       ` Nick Piggin
  0 siblings, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2008-12-30  6:35 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Andrew Morton, linux-kernel, tglx, mingo, ijc

On Mon, Dec 29, 2008 at 10:22:12PM -0800, Eric W. Biederman wrote:
> Nick Piggin <npiggin@suse.de> writes:
> 
> > On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote:
> >> On Thu, 18 Dec 2008 22:15:43 +0100
> >> Nick Piggin <npiggin@suse.de> wrote:
> >> 
> >> > Hi,
> >> > 
> >> > I've debugged a problem where i386+pae systems with more than a few CPUs
> >> > blow up at boot in the kmap_atomic code.
> >> 
> >> ping?
> >
> > No further progress here, I'm waiting on input for how to fix this
> > "nicely". Meantime, clearing the early fixmap pte I guess works, but
> > you lose a page... is it possible to put it into .initdata or is
> > there some issue with that? (I guess on a PAE kernel, 4K isn't a
> > big deal).
> >
> >  
> >> > The problem is that the kmap_atomic pte pages all need to be contiguous
> >> > memory because the pte is calculated via the first kmap pte page + an
> >> > offset (so as not to have to walk the page tables every time).
> >> > 
> >> > The fixmap setup code crudely allocates contiguous pte pages, which is fine,
> >> > but if it finds an already populated pmd entry, then it will not switch it
> >> > to a new, contiguous pte page. So the early fixmap introduces a discontig
> >> > page table right in the middle of the kmap atomic fixmaps.
> 
> Where is this?

Where is what?

 
> >> > Commenting out the eaarly fixmap setup in head_32.S gets everything working
> >> > properly. What would be the best way to fix this? Could we put the early
> >> > fixmap page table in initdata, and then have the fixmap setup proper first
> >> > clear its corresponding pmd entry?
> 
> Why would we want or need to?

So the fixmap allocator allocates contiguous pagetables. Wasn't my
description clear enough?

 
> >> How come users/testers aren't reporting this?
> >
> > Because apparently nobody tests 32-bit PAE systems with more than a couple
> > of CPUs anymore. This bug comes from HW vendor doing testing of SLES11.
> 
> Hmm.
> 
> I have taken a quick skim and I am not seeming the part of the code you are
> talking about.  Is the problem code in mainline?
> 
> I'm guessing it has something to do with reserve_top_address() being called
> with a bad value in the normal case.  But I don't see it being called
> at all in the normal case.

Yes I think so. one_page_table_init will not allocate a new pte page if
one already exists, so it will not be contiguous with previous/subsequent
pages with the early fixmap pte there. page_table_range_init explains why
that is a problem (this code seems terribly fragile and prone to breakage
but I don't want to expend theeffort on it -- still, I have a patch to
panic the kernel if it breaks its guarantee of contiguous page tables,
which makes breakage easily visible).



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2009-01-09 10:27 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-18 21:15 early fixmap causes kmap breakage Nick Piggin
2008-12-29 23:17 ` Andrew Morton
2008-12-30  4:01   ` Nick Piggin
2008-12-30  6:13     ` Ingo Molnar
2008-12-30  7:54       ` Nick Piggin
2008-12-30  8:14         ` Nick Piggin
2008-12-30 10:28       ` Nick Piggin
2008-12-30 22:41         ` Eric W. Biederman
2008-12-31  1:54           ` Nick Piggin
2008-12-31  9:01             ` Eric W. Biederman
2008-12-31  9:33               ` Nick Piggin
2009-01-09 10:24                 ` Ian Campbell
2008-12-30  6:22     ` Eric W. Biederman
2008-12-30  6:35       ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox