* early fixmap causes kmap breakage @ 2008-12-18 21:15 Nick Piggin 2008-12-29 23:17 ` Andrew Morton 0 siblings, 1 reply; 14+ messages in thread From: Nick Piggin @ 2008-12-18 21:15 UTC (permalink / raw) To: Eric W. Biederman, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Ian Campbell Hi, I've debugged a problem where i386+pae systems with more than a few CPUs blow up at boot in the kmap_atomic code. The problem is that the kmap_atomic pte pages all need to be contiguous memory because the pte is calculated via the first kmap pte page + an offset (so as not to have to walk the page tables every time). The fixmap setup code crudely allocates contiguous pte pages, which is fine, but if it finds an already populated pmd entry, then it will not switch it to a new, contiguous pte page. So the early fixmap introduces a discontig page table right in the middle of the kmap atomic fixmaps. Commenting out the eaarly fixmap setup in head_32.S gets everything working properly. What would be the best way to fix this? Could we put the early fixmap page table in initdata, and then have the fixmap setup proper first clear its corresponding pmd entry? Thanks, Nick ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: early fixmap causes kmap breakage 2008-12-18 21:15 early fixmap causes kmap breakage Nick Piggin @ 2008-12-29 23:17 ` Andrew Morton 2008-12-30 4:01 ` Nick Piggin 0 siblings, 1 reply; 14+ messages in thread From: Andrew Morton @ 2008-12-29 23:17 UTC (permalink / raw) To: Nick Piggin; +Cc: ebiederm, linux-kernel, tglx, mingo, ijc On Thu, 18 Dec 2008 22:15:43 +0100 Nick Piggin <npiggin@suse.de> wrote: > Hi, > > I've debugged a problem where i386+pae systems with more than a few CPUs > blow up at boot in the kmap_atomic code. ping? > The problem is that the kmap_atomic pte pages all need to be contiguous > memory because the pte is calculated via the first kmap pte page + an > offset (so as not to have to walk the page tables every time). > > The fixmap setup code crudely allocates contiguous pte pages, which is fine, > but if it finds an already populated pmd entry, then it will not switch it > to a new, contiguous pte page. So the early fixmap introduces a discontig > page table right in the middle of the kmap atomic fixmaps. > > Commenting out the eaarly fixmap setup in head_32.S gets everything working > properly. What would be the best way to fix this? Could we put the early > fixmap page table in initdata, and then have the fixmap setup proper first > clear its corresponding pmd entry? How come users/testers aren't reporting this? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: early fixmap causes kmap breakage 2008-12-29 23:17 ` Andrew Morton @ 2008-12-30 4:01 ` Nick Piggin 2008-12-30 6:13 ` Ingo Molnar 2008-12-30 6:22 ` Eric W. Biederman 0 siblings, 2 replies; 14+ messages in thread From: Nick Piggin @ 2008-12-30 4:01 UTC (permalink / raw) To: Andrew Morton; +Cc: ebiederm, linux-kernel, tglx, mingo, ijc On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote: > On Thu, 18 Dec 2008 22:15:43 +0100 > Nick Piggin <npiggin@suse.de> wrote: > > > Hi, > > > > I've debugged a problem where i386+pae systems with more than a few CPUs > > blow up at boot in the kmap_atomic code. > > ping? No further progress here, I'm waiting on input for how to fix this "nicely". Meantime, clearing the early fixmap pte I guess works, but you lose a page... is it possible to put it into .initdata or is there some issue with that? (I guess on a PAE kernel, 4K isn't a big deal). > > The problem is that the kmap_atomic pte pages all need to be contiguous > > memory because the pte is calculated via the first kmap pte page + an > > offset (so as not to have to walk the page tables every time). > > > > The fixmap setup code crudely allocates contiguous pte pages, which is fine, > > but if it finds an already populated pmd entry, then it will not switch it > > to a new, contiguous pte page. So the early fixmap introduces a discontig > > page table right in the middle of the kmap atomic fixmaps. > > > > Commenting out the eaarly fixmap setup in head_32.S gets everything working > > properly. What would be the best way to fix this? Could we put the early > > fixmap page table in initdata, and then have the fixmap setup proper first > > clear its corresponding pmd entry? > > How come users/testers aren't reporting this? Because apparently nobody tests 32-bit PAE systems with more than a couple of CPUs anymore. This bug comes from HW vendor doing testing of SLES11. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: early fixmap causes kmap breakage 2008-12-30 4:01 ` Nick Piggin @ 2008-12-30 6:13 ` Ingo Molnar 2008-12-30 7:54 ` Nick Piggin 2008-12-30 10:28 ` Nick Piggin 2008-12-30 6:22 ` Eric W. Biederman 1 sibling, 2 replies; 14+ messages in thread From: Ingo Molnar @ 2008-12-30 6:13 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, ebiederm, linux-kernel, tglx, ijc * Nick Piggin <npiggin@suse.de> wrote: > On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote: > > On Thu, 18 Dec 2008 22:15:43 +0100 > > Nick Piggin <npiggin@suse.de> wrote: > > > > > Hi, > > > > > > I've debugged a problem where i386+pae systems with more than a few CPUs > > > blow up at boot in the kmap_atomic code. > > > > ping? > > No further progress here, I'm waiting on input for how to fix this > "nicely". Meantime, clearing the early fixmap pte I guess works, but you > lose a page... is it possible to put it into .initdata or is there some > issue with that? (I guess on a PAE kernel, 4K isn't a big deal). yeah, 4K shouldnt be a big deal. Mind sending a patch for this? Ingo ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: early fixmap causes kmap breakage 2008-12-30 6:13 ` Ingo Molnar @ 2008-12-30 7:54 ` Nick Piggin 2008-12-30 8:14 ` Nick Piggin 2008-12-30 10:28 ` Nick Piggin 1 sibling, 1 reply; 14+ messages in thread From: Nick Piggin @ 2008-12-30 7:54 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, ebiederm, linux-kernel, tglx, ijc On Tue, Dec 30, 2008 at 07:13:44AM +0100, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote: > > > On Thu, 18 Dec 2008 22:15:43 +0100 > > > Nick Piggin <npiggin@suse.de> wrote: > > > > > > > Hi, > > > > > > > > I've debugged a problem where i386+pae systems with more than a few CPUs > > > > blow up at boot in the kmap_atomic code. > > > > > > ping? > > > > No further progress here, I'm waiting on input for how to fix this > > "nicely". Meantime, clearing the early fixmap pte I guess works, but you > > lose a page... is it possible to put it into .initdata or is there some > > issue with that? (I guess on a PAE kernel, 4K isn't a big deal). > > yeah, 4K shouldnt be a big deal. Mind sending a patch for this? I don't know if we should put a warning when we encounter this condition, to help catch other unintended uses... But this at least clears out the early fixmap pte and provides a contiguous allocation... --- arch/x86/mm/init_32.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) Index: linux-2.6/arch/x86/mm/init_32.c =================================================================== --- linux-2.6.orig/arch/x86/mm/init_32.c +++ linux-2.6/arch/x86/mm/init_32.c @@ -155,6 +155,7 @@ page_table_range_init(unsigned long star unsigned long vaddr; pgd_t *pgd; pmd_t *pmd; + pte_t *lastpte = NULL; vaddr = start; pgd_idx = pgd_index(vaddr); @@ -166,7 +167,19 @@ page_table_range_init(unsigned long star pmd = pmd + pmd_index(vaddr); for (; (pmd_idx < PTRS_PER_PMD) && (vaddr != end); pmd++, pmd_idx++) { - one_page_table_init(pmd); + pte_t *pte; + + pte = one_page_table_init(pmd); + if (lastpte && lastpte + PTRS_PER_PTE != pte) { + pte_t *newpte; + + pmd_clear(pmd); + newpte = one_page_table_init(pmd); + BUG_ON(lastpte + PTRS_PER_PTE != newpte); + memset(newpte, pte, PAGE_SIZE); + pte = lastpte; + } + lastpte = pte; vaddr += PMD_SIZE; } ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: early fixmap causes kmap breakage 2008-12-30 7:54 ` Nick Piggin @ 2008-12-30 8:14 ` Nick Piggin 0 siblings, 0 replies; 14+ messages in thread From: Nick Piggin @ 2008-12-30 8:14 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, ebiederm, linux-kernel, tglx, ijc On Tue, Dec 30, 2008 at 08:54:28AM +0100, Nick Piggin wrote: > On Tue, Dec 30, 2008 at 07:13:44AM +0100, Ingo Molnar wrote: > > > > * Nick Piggin <npiggin@suse.de> wrote: > > > > > On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote: > > > > On Thu, 18 Dec 2008 22:15:43 +0100 > > > > Nick Piggin <npiggin@suse.de> wrote: > > > > > > > > > Hi, > > > > > > > > > > I've debugged a problem where i386+pae systems with more than a few CPUs > > > > > blow up at boot in the kmap_atomic code. > > > > > > > > ping? > > > > > > No further progress here, I'm waiting on input for how to fix this > > > "nicely". Meantime, clearing the early fixmap pte I guess works, but you > > > lose a page... is it possible to put it into .initdata or is there some > > > issue with that? (I guess on a PAE kernel, 4K isn't a big deal). > > > > yeah, 4K shouldnt be a big deal. Mind sending a patch for this? > > I don't know if we should put a warning when we encounter this condition, > to help catch other unintended uses... > > But this at least clears out the early fixmap pte and provides a contiguous > allocation... > > --- > arch/x86/mm/init_32.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > Index: linux-2.6/arch/x86/mm/init_32.c > =================================================================== > --- linux-2.6.orig/arch/x86/mm/init_32.c > +++ linux-2.6/arch/x86/mm/init_32.c > @@ -155,6 +155,7 @@ page_table_range_init(unsigned long star > unsigned long vaddr; > pgd_t *pgd; > pmd_t *pmd; > + pte_t *lastpte = NULL; > > vaddr = start; > pgd_idx = pgd_index(vaddr); > @@ -166,7 +167,19 @@ page_table_range_init(unsigned long star > pmd = pmd + pmd_index(vaddr); > for (; (pmd_idx < PTRS_PER_PMD) && (vaddr != end); > pmd++, pmd_idx++) { > - one_page_table_init(pmd); > + pte_t *pte; > + > + pte = one_page_table_init(pmd); > + if (lastpte && lastpte + PTRS_PER_PTE != pte) { > + pte_t *newpte; > + > + pmd_clear(pmd); Hmm, no actually we do need a TLB flush here. > + newpte = one_page_table_init(pmd); > + BUG_ON(lastpte + PTRS_PER_PTE != newpte); > + memset(newpte, pte, PAGE_SIZE); And here we probably need a loop doing set_pte just in case the CPU speculatively loads an incomplete TLB... > + pte = lastpte; > + } > + lastpte = pte; > > vaddr += PMD_SIZE; > } ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: early fixmap causes kmap breakage 2008-12-30 6:13 ` Ingo Molnar 2008-12-30 7:54 ` Nick Piggin @ 2008-12-30 10:28 ` Nick Piggin 2008-12-30 22:41 ` Eric W. Biederman 1 sibling, 1 reply; 14+ messages in thread From: Nick Piggin @ 2008-12-30 10:28 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, ebiederm, linux-kernel, tglx, ijc On Tue, Dec 30, 2008 at 07:13:44AM +0100, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote: > > > On Thu, 18 Dec 2008 22:15:43 +0100 > > > Nick Piggin <npiggin@suse.de> wrote: > > > > > > > Hi, > > > > > > > > I've debugged a problem where i386+pae systems with more than a few CPUs > > > > blow up at boot in the kmap_atomic code. > > > > > > ping? > > > > No further progress here, I'm waiting on input for how to fix this > > "nicely". Meantime, clearing the early fixmap pte I guess works, but you > > lose a page... is it possible to put it into .initdata or is there some > > issue with that? (I guess on a PAE kernel, 4K isn't a big deal). > > yeah, 4K shouldnt be a big deal. Mind sending a patch for this? How's this? -- The early fixmap pmd entry inserted at the very top of the KVA is casing the subsequent fixmap mapping code to not provide physically linear pte pages over the kmap atomic portion of the fixmap (which relies on said property to calculate pte address). This has caused weird boot failures in kmap_atomic much later in the boot process (initial userspace faults) on a 32-bit PAE system with a larger number of CPUs (smaller CPU counts tend not to run over into the next page so don't show up the problem). Solve this by attempting to clear out the page table, and copy any of its entries to the new one. Also, add a bug if a nonlinear condition is encounted and can't be resolved, which might save some hours of debugging if this fragile scheme ever breaks again... Putting swapper_pg_fixmap into initdata is an exercise left for the reviewer... Signed-off-by: Nick Piggin <npiggin@suse.de> --- arch/x86/mm/init_32.c | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) Index: linux-2.6/arch/x86/mm/init_32.c =================================================================== --- linux-2.6.orig/arch/x86/mm/init_32.c +++ linux-2.6/arch/x86/mm/init_32.c @@ -155,6 +155,7 @@ page_table_range_init(unsigned long star unsigned long vaddr; pgd_t *pgd; pmd_t *pmd; + pte_t *lastpte = NULL; vaddr = start; pgd_idx = pgd_index(vaddr); @@ -166,7 +167,30 @@ page_table_range_init(unsigned long star pmd = pmd + pmd_index(vaddr); for (; (pmd_idx < PTRS_PER_PMD) && (vaddr != end); pmd++, pmd_idx++) { - one_page_table_init(pmd); + pte_t *pte; + + pte = one_page_table_init(pmd); + /* + * Something (early fixmap) has already put a pte page + * here, which causes the page table allocation to + * become nonlinear. Attempt to fix it, and if it is + * still nonlinear then we have to bug. + */ + if (lastpte && lastpte + PTRS_PER_PTE != pte) { + pte_t *newpte; + int i; + + pmd_clear(pmd); + __flush_tlb_all(); + + newpte = one_page_table_init(pmd); + BUG_ON(lastpte + PTRS_PER_PTE != newpte); + for (i = 0; i < PTRS_PER_PTE; i++) { + set_pte(newpte + i, pte_val(*(pte + i))); + } + pte = lastpte; + } + lastpte = pte; vaddr += PMD_SIZE; } ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: early fixmap causes kmap breakage 2008-12-30 10:28 ` Nick Piggin @ 2008-12-30 22:41 ` Eric W. Biederman 2008-12-31 1:54 ` Nick Piggin 0 siblings, 1 reply; 14+ messages in thread From: Eric W. Biederman @ 2008-12-30 22:41 UTC (permalink / raw) To: Nick Piggin; +Cc: Ingo Molnar, Andrew Morton, linux-kernel, tglx, ijc Nick Piggin <npiggin@suse.de> writes: > On Tue, Dec 30, 2008 at 07:13:44AM +0100, Ingo Molnar wrote: >> >> * Nick Piggin <npiggin@suse.de> wrote: >> >> > On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote: >> > > On Thu, 18 Dec 2008 22:15:43 +0100 >> > > Nick Piggin <npiggin@suse.de> wrote: >> > > >> > > > Hi, >> > > > >> > > > I've debugged a problem where i386+pae systems with more than a few CPUs >> > > > blow up at boot in the kmap_atomic code. >> > > >> > > ping? >> > >> > No further progress here, I'm waiting on input for how to fix this >> > "nicely". Meantime, clearing the early fixmap pte I guess works, but you >> > lose a page... is it possible to put it into .initdata or is there some >> > issue with that? (I guess on a PAE kernel, 4K isn't a big deal). >> >> yeah, 4K shouldnt be a big deal. Mind sending a patch for this? > > How's this? > -- > > The early fixmap pmd entry inserted at the very top of the KVA is casing the > subsequent fixmap mapping code to not provide physically linear pte pages over > the kmap atomic portion of the fixmap (which relies on said property to > calculate > pte address). > > This has caused weird boot failures in kmap_atomic much later in the boot > process (initial userspace faults) on a 32-bit PAE system with a larger number > of CPUs (smaller CPU counts tend not to run over into the next page so don't > show up the problem). atomic> > Solve this by attempting to clear out the page table, and copy any of its > entries to the new one. Also, add a bug if a nonlinear condition is encounted > and can't be resolved, which might save some hours of debugging if this fragile > scheme ever breaks again... > > Putting swapper_pg_fixmap into initdata is an exercise left for the reviewer... Ok. I see what is going on now. We have exceeded 512 fixmap entries, causing the fixmap entries to consume more than 2MB of the address space. Which broke the assumption that the fixmap entries are all contiguous. Ditching the swapper_pg_fixmap has some problems. This appears to break early_printk to a usb debug port, which calls set_fixmap_nocache and expects the mapping to last. This looks like it will have problems with Xen and other environments where we come in with a pre-populated page table, possibly unmapping something important. one_page_table_init relies on alloc_bootmem_low_pages for it's memory allocation so we do not have a guarantee that we will have contiguous memory even without this. I see three ways we can address this. - Grow swapper_pg_fixmap to cover the entire fixmap range. This trivially and without problems gives an atomic guarantee, and should allow removal of code that sets up the fixmaps later in C, except in weird cases like Xen. - Decide it is worth optimizing kmap_atomic_prot some more. Have a kmap_pte per cpu. Cache line align the kmap pte entries so we don't get conflicts per cpu, at which point we should be guaranteed the all 13 of them will be physically contiguous. - Not support more than 32 cpus on x86_32. I suspect it might even be worth writing a version of one_page_table_init that would guarantee discontiguous pages. So we can flush out these kinds of fragile assumptions. Eric ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: early fixmap causes kmap breakage 2008-12-30 22:41 ` Eric W. Biederman @ 2008-12-31 1:54 ` Nick Piggin 2008-12-31 9:01 ` Eric W. Biederman 0 siblings, 1 reply; 14+ messages in thread From: Nick Piggin @ 2008-12-31 1:54 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Ingo Molnar, Andrew Morton, linux-kernel, tglx, ijc On Tue, Dec 30, 2008 at 02:41:53PM -0800, Eric W. Biederman wrote: > Nick Piggin <npiggin@suse.de> writes: > > > On Tue, Dec 30, 2008 at 07:13:44AM +0100, Ingo Molnar wrote: > >> > >> * Nick Piggin <npiggin@suse.de> wrote: > >> > >> > On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote: > >> > > On Thu, 18 Dec 2008 22:15:43 +0100 > >> > > Nick Piggin <npiggin@suse.de> wrote: > >> > > > >> > > > Hi, > >> > > > > >> > > > I've debugged a problem where i386+pae systems with more than a few CPUs > >> > > > blow up at boot in the kmap_atomic code. > >> > > > >> > > ping? > >> > > >> > No further progress here, I'm waiting on input for how to fix this > >> > "nicely". Meantime, clearing the early fixmap pte I guess works, but you > >> > lose a page... is it possible to put it into .initdata or is there some > >> > issue with that? (I guess on a PAE kernel, 4K isn't a big deal). > >> > >> yeah, 4K shouldnt be a big deal. Mind sending a patch for this? > > > > How's this? > > -- > > > > The early fixmap pmd entry inserted at the very top of the KVA is casing the > > subsequent fixmap mapping code to not provide physically linear pte pages over > > the kmap atomic portion of the fixmap (which relies on said property to > > calculate > > pte address). > > > > This has caused weird boot failures in kmap_atomic much later in the boot > > process (initial userspace faults) on a 32-bit PAE system with a larger number > > of CPUs (smaller CPU counts tend not to run over into the next page so don't > > show up the problem). > atomic> > > Solve this by attempting to clear out the page table, and copy any of its > > entries to the new one. Also, add a bug if a nonlinear condition is encounted > > and can't be resolved, which might save some hours of debugging if this fragile > > scheme ever breaks again... > > > > Putting swapper_pg_fixmap into initdata is an exercise left for the reviewer... > > Ok. I see what is going on now. We have exceeded 512 fixmap entries, causing > the fixmap entries to consume more than 2MB of the address space. Which broke > the assumption that the fixmap entries are all contiguous. Yes. That wasn't obvious from my problem description? > Ditching the swapper_pg_fixmap has some problems. > > This appears to break early_printk to a usb debug port, which calls > set_fixmap_nocache and expects the mapping to last. > > This looks like it will have problems with Xen and other environments > where we come in with a pre-populated page table, possibly unmapping > something important. My patch copies the early fixmap mappings to the new page table. Isn't this enough? > one_page_table_init relies on alloc_bootmem_low_pages for it's memory allocation > so we do not have a guarantee that we will have contiguous memory even without > this. It's OK, if fragile, in early boot where it uses alloc_low_page. > I see three ways we can address this. > - Grow swapper_pg_fixmap to cover the entire fixmap range. > This trivially and without problems gives an atomic guarantee, > and should allow removal of code that sets up the fixmaps later > in C, except in weird cases like Xen. Would be fine by me, although I want to get a minimal patch working in the meantime if that is going to be complex. > - Decide it is worth optimizing kmap_atomic_prot some more. > Have a kmap_pte per cpu. > Cache line align the kmap pte entries so we don't get conflicts > per cpu, at which point we should be guaranteed the all 13 of > them will be physically contiguous. Go wild if you'd like to spend time optimising x86 PAE ;) > - Not support more than 32 cpus on x86_32. This is not an option really. Anyway, it's not more than 32 CPUs, but can be a problem with as few as about 8 depending on config. > I suspect it might even be worth writing a version of one_page_table_init > that would guarantee discontiguous pages. So we can flush out these > kinds of fragile assumptions. So did you actually see anything wrong with my last patch which catches discontiguous page tables and copies over the early fixmap translations? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: early fixmap causes kmap breakage 2008-12-31 1:54 ` Nick Piggin @ 2008-12-31 9:01 ` Eric W. Biederman 2008-12-31 9:33 ` Nick Piggin 0 siblings, 1 reply; 14+ messages in thread From: Eric W. Biederman @ 2008-12-31 9:01 UTC (permalink / raw) To: Nick Piggin; +Cc: Ingo Molnar, Andrew Morton, linux-kernel, tglx, ijc Nick Piggin <npiggin@suse.de> writes: > On Tue, Dec 30, 2008 at 02:41:53PM -0800, Eric W. Biederman wrote: >> Nick Piggin <npiggin@suse.de> writes: >> >> > On Tue, Dec 30, 2008 at 07:13:44AM +0100, Ingo Molnar wrote: >> >> >> >> * Nick Piggin <npiggin@suse.de> wrote: >> >> >> >> > On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote: >> >> > > On Thu, 18 Dec 2008 22:15:43 +0100 >> >> > > Nick Piggin <npiggin@suse.de> wrote: >> >> > > >> >> > > > Hi, >> >> > > > >> >> > > > I've debugged a problem where i386+pae systems with more than a few > CPUs >> >> > > > blow up at boot in the kmap_atomic code. >> >> > > >> >> > > ping? >> >> > >> >> > No further progress here, I'm waiting on input for how to fix this >> >> > "nicely". Meantime, clearing the early fixmap pte I guess works, but you >> >> > lose a page... is it possible to put it into .initdata or is there some >> >> > issue with that? (I guess on a PAE kernel, 4K isn't a big deal). >> >> >> >> yeah, 4K shouldnt be a big deal. Mind sending a patch for this? >> > >> > How's this? >> > -- >> > >> > The early fixmap pmd entry inserted at the very top of the KVA is casing the >> > subsequent fixmap mapping code to not provide physically linear pte pages > over >> > the kmap atomic portion of the fixmap (which relies on said property to >> > calculate >> > pte address). >> > >> > This has caused weird boot failures in kmap_atomic much later in the boot >> > process (initial userspace faults) on a 32-bit PAE system with a larger > number >> > of CPUs (smaller CPU counts tend not to run over into the next page so don't >> > show up the problem). >> atomic> >> > Solve this by attempting to clear out the page table, and copy any of its >> > entries to the new one. Also, add a bug if a nonlinear condition is > encounted >> > and can't be resolved, which might save some hours of debugging if this > fragile >> > scheme ever breaks again... >> > >> > Putting swapper_pg_fixmap into initdata is an exercise left for the > reviewer... >> >> Ok. I see what is going on now. We have exceeded 512 fixmap entries, causing >> the fixmap entries to consume more than 2MB of the address space. Which broke >> the assumption that the fixmap entries are all contiguous. > > Yes. That wasn't obvious from my problem description? Not quite. I was the change to one_page_table_init in the patch that made it clear to me. If you have been working the code for a while or come at it from the proper angle it is, obvious. I haven't looked at it in a bit and came at it from the wrong angle initially. >> Ditching the swapper_pg_fixmap has some problems. >> >> This appears to break early_printk to a usb debug port, which calls >> set_fixmap_nocache and expects the mapping to last. >> >> This looks like it will have problems with Xen and other environments >> where we come in with a pre-populated page table, possibly unmapping >> something important. > > My patch copies the early fixmap mappings to the new page table. Isn't > this enough? I'm not certain. Xen can get really finicky if you mess with it's pre-established mappings. Which is one of the reasons why we preserve the current mappings. It was a real struggle to find a solution that worked for everything last time we had a PAE problem, and apparently we failed to test with large numbers of cpus in the config. So please excuse me if I'm leery of a quick fix, to code that is obviously insufficiently clean to do what needs to be done intentionally. I also don't see the urgency as this problem has existed for long enough that I have had time to page out the original conversation, so please let's take the time to do this cleanly. >> one_page_table_init relies on alloc_bootmem_low_pages for it's memory > allocation >> so we do not have a guarantee that we will have contiguous memory even without >> this. > > It's OK, if fragile, in early boot where it uses alloc_low_page. Which accounts for early_ioremap_init but not permanent_kmaps_init which has the same constraint, and also calls page_table_range_init, but does so after the bootmem allocator has been initialized. So no we can't rely on the simplicity of alloc_low_page to guarantee us a contiguous chunk of ptes. >> I see three ways we can address this. >> - Grow swapper_pg_fixmap to cover the entire fixmap range. >> This trivially and without problems gives an atomic guarantee, >> and should allow removal of code that sets up the fixmaps later >> in C, except in weird cases like Xen. > > Would be fine by me, although I want to get a minimal patch working > in the meantime if that is going to be complex. No. It should be just a matter of passing the number of pages we need to initialize in a form that assembly can use and upgrading the current code to a loop. Unless Xen and co, rely on this code to add kmaps to their initial page table which not that I think about it I expect they do. > Go wild if you'd like to spend time optimising x86 PAE ;) Nit: It is x86 highmem not x86 PAE. >> - Not support more than 32 cpus on x86_32. > > This is not an option really. Anyway, it's not more than 32 CPUs, but > can be a problem with as few as about 8 depending on config. That sounds like a lot of IOAPICs. Can you really see this with only 8 cpus?? >> I suspect it might even be worth writing a version of one_page_table_init >> that would guarantee discontiguous pages. So we can flush out these >> kinds of fragile assumptions. > > So did you actually see anything wrong with my last patch which catches > discontiguous page tables and copies over the early fixmap translations? Sorry I missed the copy of the ptes in my quest to put what you were saying in perspective. Real concerns: We are not always using alloc_low_page() so we have an allocator that does not guarantee the property we are seeking. - I don't see what prevents us from hitting a kernel page table entry and unmapping ourselves while we are in the middle of this. Minor but very important that we verify. - I don't see (if the end mappings are the same) why we need to flush the tlb. - We have a lot of stale comments that you are not updating. - page_table_range_init does not explicitly do what you are changing it to guarantee in your patch. So I think we are taking a weird accidental dependency and applying a bit more duck tape instead of fixing this cleanly, which means some other change is likely to trigger this in another way. On the other hand it does look like fixing page_table_range_init so that it cleanly guarantees what we actually expect from it, is the way to go. It is in C so it is readable and it is it can be used to fix up any preexisting page tables form Xen and elsewhere. Stale comments that I found at a quick look: mm/init_32.c: > * In general, pagetable_init() assumes that the pagetable may already > * be partially populated, and so it avoids stomping on any existing > * mappings. asm/highmem.h: > /* > * Right now we initialize only a single pte table. It can be extended > * easily, subsequent pte tables have to be allocated in one physical > * chunk of RAM. > */ Eric ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: early fixmap causes kmap breakage 2008-12-31 9:01 ` Eric W. Biederman @ 2008-12-31 9:33 ` Nick Piggin 2009-01-09 10:24 ` Ian Campbell 0 siblings, 1 reply; 14+ messages in thread From: Nick Piggin @ 2008-12-31 9:33 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Ingo Molnar, Andrew Morton, linux-kernel, tglx, ijc On Wed, Dec 31, 2008 at 01:01:27AM -0800, Eric W. Biederman wrote: > Nick Piggin <npiggin@suse.de> writes: > >> the assumption that the fixmap entries are all contiguous. > > > > Yes. That wasn't obvious from my problem description? > > Not quite. I was the change to one_page_table_init in the patch that made > it clear to me. If you have been working the code for a while or come at > it from the proper angle it is, obvious. I haven't looked at it in a bit > and came at it from the wrong angle initially. OK. I'll try to improve it. > >> Ditching the swapper_pg_fixmap has some problems. > >> > >> This appears to break early_printk to a usb debug port, which calls > >> set_fixmap_nocache and expects the mapping to last. > >> > >> This looks like it will have problems with Xen and other environments > >> where we come in with a pre-populated page table, possibly unmapping > >> something important. > > > > My patch copies the early fixmap mappings to the new page table. Isn't > > this enough? > > I'm not certain. Xen can get really finicky if you mess with it's > pre-established mappings. Which is one of the reasons why we preserve > the current mappings. It was a real struggle to find a solution that > worked for everything last time we had a PAE problem, and apparently > we failed to test with large numbers of cpus in the config. > > So please excuse me if I'm leery of a quick fix, to code that is obviously > insufficiently clean to do what needs to be done intentionally. Well we should just ask the Xen people whether it will work or not. Why wouldn't it? Xen has to be able to cope with kernel mappings coming and going. > I also don't see the urgency as this problem has existed for long enough > that I have had time to page out the original conversation, so please > let's take the time to do this cleanly. Just to have a straightforward fix for stable kernels I guess. > >> one_page_table_init relies on alloc_bootmem_low_pages for it's memory > > allocation > >> so we do not have a guarantee that we will have contiguous memory even without > >> this. > > > > It's OK, if fragile, in early boot where it uses alloc_low_page. > > Which accounts for early_ioremap_init but not permanent_kmaps_init > which has the same constraint, and also calls page_table_range_init, > but does so after the bootmem allocator has been initialized. So > no we can't rely on the simplicity of alloc_low_page to guarantee > us a contiguous chunk of ptes. Sure, but I have added a check for that so it will panic if there is a problem. My patch hasn't made things in that department more fragile, put it that way. > >> I see three ways we can address this. > >> - Grow swapper_pg_fixmap to cover the entire fixmap range. > >> This trivially and without problems gives an atomic guarantee, > >> and should allow removal of code that sets up the fixmaps later > >> in C, except in weird cases like Xen. > > > > Would be fine by me, although I want to get a minimal patch working > > in the meantime if that is going to be complex. > > No. It should be just a matter of passing the number of pages > we need to initialize in a form that assembly can use and upgrading > the current code to a loop. > > Unless Xen and co, rely on this code to add kmaps to their initial > page table which not that I think about it I expect they do. > > > Go wild if you'd like to spend time optimising x86 PAE ;) > > Nit: It is x86 highmem not x86 PAE. > > >> - Not support more than 32 cpus on x86_32. > > > > This is not an option really. Anyway, it's not more than 32 CPUs, but > > can be a problem with as few as about 8 depending on config. > > That sounds like a lot of IOAPICs. Can you really see this with only > 8 cpus?? With highmem debugging, yes I think I saw a kmap bug come from CPU7. > >> I suspect it might even be worth writing a version of one_page_table_init > >> that would guarantee discontiguous pages. So we can flush out these > >> kinds of fragile assumptions. > > > > So did you actually see anything wrong with my last patch which catches > > discontiguous page tables and copies over the early fixmap translations? > > Sorry I missed the copy of the ptes in my quest to put what you were saying > in perspective. > > Real concerns: > > We are not always using alloc_low_page() so we have an allocator that > does not guarantee the property we are seeking. > > - I don't see what prevents us from hitting a kernel page table entry > and unmapping ourselves while we are in the middle of this. > > Minor but very important that we verify. Because it's in the pkmap area? > - I don't see (if the end mappings are the same) why we need to flush > the tlb. Hmm... x86 CPUs can cache higher level page table translations in non-cache-coherent buffers, so they could continue to look at the old page table page after we insert new mappings int othe new one, thus missing them and faulting. I don't know whether that fault is going to be guaranteed to flush those higher level translations, but I don't want to risk it. It's not performance critical. > - We have a lot of stale comments that you are not updating. Which? (ah, I see below) > - page_table_range_init does not explicitly do what you are changing it to > guarantee in your patch. So I think we are taking a weird accidental dependency > and applying a bit more duck tape instead of fixing this cleanly, which > means some other change is likely to trigger this in another way. What do you mean? Contiguity of pte pages is guaranteed after my patch. If there is anything else it is supposed to do but doesn't, then that's for another patch? > On the other hand it does look like fixing page_table_range_init so that > it cleanly guarantees what we actually expect from it, is the way to go. > It is in C so it is readable and it is it can be used to fix up any > preexisting page tables form Xen and elsewhere. > > Stale comments that I found at a quick look: > mm/init_32.c: > > * In general, pagetable_init() assumes that the pagetable may already > > * be partially populated, and so it avoids stomping on any existing > > * mappings. > > asm/highmem.h: > > /* > > * Right now we initialize only a single pte table. It can be extended > > * easily, subsequent pte tables have to be allocated in one physical > > * chunk of RAM. > > */ > > Eric ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: early fixmap causes kmap breakage 2008-12-31 9:33 ` Nick Piggin @ 2009-01-09 10:24 ` Ian Campbell 0 siblings, 0 replies; 14+ messages in thread From: Ian Campbell @ 2009-01-09 10:24 UTC (permalink / raw) To: Nick Piggin Cc: Eric W. Biederman, Ingo Molnar, Andrew Morton, linux-kernel, tglx, Jeremy Fitzhardinge Adding Jeremy to CC On Wed, 2008-12-31 at 10:33 +0100, Nick Piggin wrote: > > > >> Ditching the swapper_pg_fixmap has some problems. > > >> > > >> This appears to break early_printk to a usb debug port, which > > >> calls set_fixmap_nocache and expects the mapping to last. > > >> > > >> This looks like it will have problems with Xen and other > > >> environments where we come in with a pre-populated page table, > > >> possibly unmapping something important. > > > > > > My patch copies the early fixmap mappings to the new page table. > > > Isn't this enough? > > > > I'm not certain. Xen can get really finicky if you mess with it's > > pre-established mappings. Which is one of the reasons why we > > preserve the current mappings. It was a real struggle to find a > > solution that worked for everything last time we had a PAE problem, > > and apparently we failed to test with large numbers of cpus in the > > config. > > > > So please excuse me if I'm leery of a quick fix, to code that is > > obviously insufficiently clean to do what needs to be done > > intentionally. > > Well we should just ask the Xen people whether it will work or not. > Why wouldn't it? Xen has to be able to cope with kernel mappings > coming and going. I think it should be fine from Xen's PoV since one_page_table_init() correctly uses paravirt_alloc_pte() and set_pte() does the right thing under Xen. BTW, I needed this fixlet since set_pte takes a pte_t not a pteval_t, otherwise: CC arch/x86/mm/init_32.o [...]arch/x86/mm/init_32.c: In function 'page_table_range_init': [...]arch/x86/mm/init_32.c:188: error: incompatible type for argument 2 of 'set_pte' diff -r cf59aba1694f arch/x86/mm/init_32.c --- a/arch/x86/mm/init_32.c Fri Jan 09 10:13:31 2009 +0000 +++ b/arch/x86/mm/init_32.c Fri Jan 09 10:16:38 2009 +0000 @@ -185,7 +185,7 @@ newpte = one_page_table_init(pmd); BUG_ON(lastpte + PTRS_PER_PTE != newpte); for (i = 0; i < PTRS_PER_PTE; i++) { - set_pte(newpte + i, pte_val(*(pte + i))); + set_pte(newpte + i, *(pte + i)); } pte = lastpte; } -- Ian Campbell Current Noise: Firebird - Needle In The Groove "Now we'll have to kill you." -- Linus Torvalds ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: early fixmap causes kmap breakage 2008-12-30 4:01 ` Nick Piggin 2008-12-30 6:13 ` Ingo Molnar @ 2008-12-30 6:22 ` Eric W. Biederman 2008-12-30 6:35 ` Nick Piggin 1 sibling, 1 reply; 14+ messages in thread From: Eric W. Biederman @ 2008-12-30 6:22 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, tglx, mingo, ijc Nick Piggin <npiggin@suse.de> writes: > On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote: >> On Thu, 18 Dec 2008 22:15:43 +0100 >> Nick Piggin <npiggin@suse.de> wrote: >> >> > Hi, >> > >> > I've debugged a problem where i386+pae systems with more than a few CPUs >> > blow up at boot in the kmap_atomic code. >> >> ping? > > No further progress here, I'm waiting on input for how to fix this > "nicely". Meantime, clearing the early fixmap pte I guess works, but > you lose a page... is it possible to put it into .initdata or is > there some issue with that? (I guess on a PAE kernel, 4K isn't a > big deal). > > >> > The problem is that the kmap_atomic pte pages all need to be contiguous >> > memory because the pte is calculated via the first kmap pte page + an >> > offset (so as not to have to walk the page tables every time). >> > >> > The fixmap setup code crudely allocates contiguous pte pages, which is fine, >> > but if it finds an already populated pmd entry, then it will not switch it >> > to a new, contiguous pte page. So the early fixmap introduces a discontig >> > page table right in the middle of the kmap atomic fixmaps. Where is this? >> > Commenting out the eaarly fixmap setup in head_32.S gets everything working >> > properly. What would be the best way to fix this? Could we put the early >> > fixmap page table in initdata, and then have the fixmap setup proper first >> > clear its corresponding pmd entry? Why would we want or need to? >> How come users/testers aren't reporting this? > > Because apparently nobody tests 32-bit PAE systems with more than a couple > of CPUs anymore. This bug comes from HW vendor doing testing of SLES11. Hmm. I have taken a quick skim and I am not seeming the part of the code you are talking about. Is the problem code in mainline? I'm guessing it has something to do with reserve_top_address() being called with a bad value in the normal case. But I don't see it being called at all in the normal case. Eric ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: early fixmap causes kmap breakage 2008-12-30 6:22 ` Eric W. Biederman @ 2008-12-30 6:35 ` Nick Piggin 0 siblings, 0 replies; 14+ messages in thread From: Nick Piggin @ 2008-12-30 6:35 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Andrew Morton, linux-kernel, tglx, mingo, ijc On Mon, Dec 29, 2008 at 10:22:12PM -0800, Eric W. Biederman wrote: > Nick Piggin <npiggin@suse.de> writes: > > > On Mon, Dec 29, 2008 at 03:17:31PM -0800, Andrew Morton wrote: > >> On Thu, 18 Dec 2008 22:15:43 +0100 > >> Nick Piggin <npiggin@suse.de> wrote: > >> > >> > Hi, > >> > > >> > I've debugged a problem where i386+pae systems with more than a few CPUs > >> > blow up at boot in the kmap_atomic code. > >> > >> ping? > > > > No further progress here, I'm waiting on input for how to fix this > > "nicely". Meantime, clearing the early fixmap pte I guess works, but > > you lose a page... is it possible to put it into .initdata or is > > there some issue with that? (I guess on a PAE kernel, 4K isn't a > > big deal). > > > > > >> > The problem is that the kmap_atomic pte pages all need to be contiguous > >> > memory because the pte is calculated via the first kmap pte page + an > >> > offset (so as not to have to walk the page tables every time). > >> > > >> > The fixmap setup code crudely allocates contiguous pte pages, which is fine, > >> > but if it finds an already populated pmd entry, then it will not switch it > >> > to a new, contiguous pte page. So the early fixmap introduces a discontig > >> > page table right in the middle of the kmap atomic fixmaps. > > Where is this? Where is what? > >> > Commenting out the eaarly fixmap setup in head_32.S gets everything working > >> > properly. What would be the best way to fix this? Could we put the early > >> > fixmap page table in initdata, and then have the fixmap setup proper first > >> > clear its corresponding pmd entry? > > Why would we want or need to? So the fixmap allocator allocates contiguous pagetables. Wasn't my description clear enough? > >> How come users/testers aren't reporting this? > > > > Because apparently nobody tests 32-bit PAE systems with more than a couple > > of CPUs anymore. This bug comes from HW vendor doing testing of SLES11. > > Hmm. > > I have taken a quick skim and I am not seeming the part of the code you are > talking about. Is the problem code in mainline? > > I'm guessing it has something to do with reserve_top_address() being called > with a bad value in the normal case. But I don't see it being called > at all in the normal case. Yes I think so. one_page_table_init will not allocate a new pte page if one already exists, so it will not be contiguous with previous/subsequent pages with the early fixmap pte there. page_table_range_init explains why that is a problem (this code seems terribly fragile and prone to breakage but I don't want to expend theeffort on it -- still, I have a patch to panic the kernel if it breaks its guarantee of contiguous page tables, which makes breakage easily visible). ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2009-01-09 10:27 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-12-18 21:15 early fixmap causes kmap breakage Nick Piggin 2008-12-29 23:17 ` Andrew Morton 2008-12-30 4:01 ` Nick Piggin 2008-12-30 6:13 ` Ingo Molnar 2008-12-30 7:54 ` Nick Piggin 2008-12-30 8:14 ` Nick Piggin 2008-12-30 10:28 ` Nick Piggin 2008-12-30 22:41 ` Eric W. Biederman 2008-12-31 1:54 ` Nick Piggin 2008-12-31 9:01 ` Eric W. Biederman 2008-12-31 9:33 ` Nick Piggin 2009-01-09 10:24 ` Ian Campbell 2008-12-30 6:22 ` Eric W. Biederman 2008-12-30 6:35 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox