* Unsafe pte_update() in do_page_fault() (4xx and Book-E) @ 2006-03-02 20:26 Eugene Surovegin 2006-03-03 3:43 ` Benjamin Herrenschmidt 2006-03-28 7:55 ` [PATCH] lock PTE before updating it in 440/BookE page fault handler Eugene Surovegin 0 siblings, 2 replies; 6+ messages in thread From: Eugene Surovegin @ 2006-03-02 20:26 UTC (permalink / raw) To: Paul Mackerras; +Cc: linuxppc-dev, Kumar Gala Hi! For the last couple of days I was debugging rare swap_dup: Bad swap file entry 0x00000080 errors in my custom 2.4 kernel running on 405GPr system. My current theory is that this error is caused by the special lazy dcache/icache flush handling on 4xx and BookE. Because this code in my 2.4 was actually a backport from 2.6, I think we have a problem in current 2.6 as well. Here is what I think happens. On 4xx/BookE we use execute bit to deffer dcache to icache flush, in do_page_fault() we flush page when execute trap triggers and enable _PAGE_HWEXEC bit in PTE. Unfortunately, we don't lock this PTE and it's possible that after pte_present() check but _before_ pte_update() call this particular page was purged from the memory, e.g. because of extreme memory pressure (of course, I'm assuming enabled preempt). If this happens, pte_update() sets _PAGE_HWEXEC bit in just cleared PTE. Sometime later, another page fault happens for this page, but because of that set bit, pte_none() test in handle_pte_fault() fails, and we continue along the wrong path, thinking that this PTE was swapped out to the swap file, and this triggers swap_dup error I mentioned at the beginning. _PAGE_HWXEC is 0x200 on 405GPr, and because swap entry is PTE shifted 2 bits to the right, we get that "0x00000080" value. Paul, does my theory make any sense? I cannot test 2.6 on our hw. So far, after I added additional page_table_lock locking to my 2.4 in do_page_fault(), I haven't seen these errors, but it's too early to be 100% sure :). I'll make a patch for 2.6 if you think my analysis is correct. -- Eugene ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Unsafe pte_update() in do_page_fault() (4xx and Book-E) 2006-03-02 20:26 Unsafe pte_update() in do_page_fault() (4xx and Book-E) Eugene Surovegin @ 2006-03-03 3:43 ` Benjamin Herrenschmidt 2006-03-03 3:55 ` Eugene Surovegin 2006-03-28 7:55 ` [PATCH] lock PTE before updating it in 440/BookE page fault handler Eugene Surovegin 1 sibling, 1 reply; 6+ messages in thread From: Benjamin Herrenschmidt @ 2006-03-03 3:43 UTC (permalink / raw) To: Eugene Surovegin; +Cc: linuxppc-dev, Kumar Gala > If this happens, pte_update() sets _PAGE_HWEXEC bit in just cleared > PTE. Sometime later, another page fault happens for this page, but > because of that set bit, pte_none() test in handle_pte_fault() fails, > and we continue along the wrong path, thinking that this PTE was > swapped out to the swap file, and this triggers swap_dup error I > mentioned at the beginning. Can we preempt at that point ? As tehre is no SMP 4xx that I know of preempt would be the only cause for such a race ... Ben. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Unsafe pte_update() in do_page_fault() (4xx and Book-E) 2006-03-03 3:43 ` Benjamin Herrenschmidt @ 2006-03-03 3:55 ` Eugene Surovegin 0 siblings, 0 replies; 6+ messages in thread From: Eugene Surovegin @ 2006-03-03 3:55 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, Kumar Gala On Fri, Mar 03, 2006 at 02:43:02PM +1100, Benjamin Herrenschmidt wrote: > > > If this happens, pte_update() sets _PAGE_HWEXEC bit in just cleared > > PTE. Sometime later, another page fault happens for this page, but > > because of that set bit, pte_none() test in handle_pte_fault() fails, > > and we continue along the wrong path, thinking that this PTE was > > swapped out to the swap file, and this triggers swap_dup error I > > mentioned at the beginning. > > Can we preempt at that point ? As tehre is no SMP 4xx that I know of > preempt would be the only cause for such a race ... Yes, as I mentioned in the original post, I'm running preempt enabled kernel. -- Eugene ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH] lock PTE before updating it in 440/BookE page fault handler 2006-03-02 20:26 Unsafe pte_update() in do_page_fault() (4xx and Book-E) Eugene Surovegin 2006-03-03 3:43 ` Benjamin Herrenschmidt @ 2006-03-28 7:55 ` Eugene Surovegin 2006-03-28 11:10 ` Paul Mackerras 1 sibling, 1 reply; 6+ messages in thread From: Eugene Surovegin @ 2006-03-28 7:55 UTC (permalink / raw) To: Paul Mackerras; +Cc: linuxppc-dev Fix 44x and BookE page fault handler to correctly lock PTE before trying to pte_update() it, otherwise this PTE might be swapped out after pte_present() check but before pte_uptdate() call, resulting in corrupted PTE. This can happen with enabled preemption and low memory condition. Signed-off-by: Eugene Surovegin <ebs@ebshome.net> --- arch/ppc/mm/fault.c | 30 +++++++++++++++++------------- arch/ppc/mm/pgtable.c | 6 ++++-- include/asm-ppc/pgtable.h | 3 ++- 3 files changed, 23 insertions(+), 16 deletions(-) diff --git a/arch/ppc/mm/fault.c b/arch/ppc/mm/fault.c index 0217188..8e08ca3 100644 --- a/arch/ppc/mm/fault.c +++ b/arch/ppc/mm/fault.c @@ -202,6 +202,7 @@ good_area: /* an exec - 4xx/Book-E allows for per-page execute permission */ } else if (TRAP(regs) == 0x400) { pte_t *ptep; + pmd_t *pmdp; #if 0 /* It would be nice to actually enforce the VM execute @@ -215,21 +216,24 @@ good_area: /* Since 4xx/Book-E supports per-page execute permission, * we lazily flush dcache to icache. */ ptep = NULL; - if (get_pteptr(mm, address, &ptep) && pte_present(*ptep)) { - struct page *page = pte_page(*ptep); - - if (! test_bit(PG_arch_1, &page->flags)) { - flush_dcache_icache_page(page); - set_bit(PG_arch_1, &page->flags); + if (get_pteptr(mm, address, &ptep, &pmdp)) { + spinlock_t *ptl = pte_lockptr(mm, pmdp); + spin_lock(ptl); + if (pte_present(*ptep)) { + struct page *page = pte_page(*ptep); + + if (!test_bit(PG_arch_1, &page->flags)) { + flush_dcache_icache_page(page); + set_bit(PG_arch_1, &page->flags); + } + pte_update(ptep, 0, _PAGE_HWEXEC); + _tlbie(address); + pte_unmap_unlock(ptep, ptl); + up_read(&mm->mmap_sem); + return 0; } - pte_update(ptep, 0, _PAGE_HWEXEC); - _tlbie(address); - pte_unmap(ptep); - up_read(&mm->mmap_sem); - return 0; + pte_unmap_unlock(ptep, ptl); } - if (ptep != NULL) - pte_unmap(ptep); #endif /* a read */ } else { diff --git a/arch/ppc/mm/pgtable.c b/arch/ppc/mm/pgtable.c index 6ea9185..98a83be 100644 --- a/arch/ppc/mm/pgtable.c +++ b/arch/ppc/mm/pgtable.c @@ -368,7 +368,7 @@ void __init io_block_mapping(unsigned lo * the PTE pointer is unmodified if PTE is not found. */ int -get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep) +get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep, pmd_t **pmdp) { pgd_t *pgd; pmd_t *pmd; @@ -383,6 +383,8 @@ get_pteptr(struct mm_struct *mm, unsigne if (pte) { retval = 1; *ptep = pte; + if (pmdp) + *pmdp = pmd; /* XXX caller needs to do pte_unmap, yuck */ } } @@ -420,7 +422,7 @@ unsigned long iopa(unsigned long addr) mm = &init_mm; pa = 0; - if (get_pteptr(mm, addr, &pte)) { + if (get_pteptr(mm, addr, &pte, NULL)) { pa = (pte_val(*pte) & PAGE_MASK) | (addr & ~PAGE_MASK); pte_unmap(pte); } diff --git a/include/asm-ppc/pgtable.h b/include/asm-ppc/pgtable.h index e1c62da..570b355 100644 --- a/include/asm-ppc/pgtable.h +++ b/include/asm-ppc/pgtable.h @@ -837,7 +837,8 @@ static inline int io_remap_pfn_range(str */ #define pgtable_cache_init() do { } while (0) -extern int get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep); +extern int get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep, + pmd_t **pmdp); #include <asm-generic/pgtable.h> ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] lock PTE before updating it in 440/BookE page fault handler 2006-03-28 7:55 ` [PATCH] lock PTE before updating it in 440/BookE page fault handler Eugene Surovegin @ 2006-03-28 11:10 ` Paul Mackerras 2006-03-28 18:13 ` Eugene Surovegin 0 siblings, 1 reply; 6+ messages in thread From: Paul Mackerras @ 2006-03-28 11:10 UTC (permalink / raw) To: Eugene Surovegin; +Cc: linuxppc-dev Eugene Surovegin writes: > Fix 44x and BookE page fault handler to correctly lock PTE before > trying to pte_update() it, otherwise this PTE might be swapped out > after pte_present() check but before pte_uptdate() call, resulting in > corrupted PTE. This can happen with enabled preemption and low memory > condition. That gives me this for an ARCH=powerpc 32-bit pmac build: /home/paulus/kernel/powerpc/arch/powerpc/mm/pgtable_32.c:376: error: conflicting types for 'get_pteptr' /home/paulus/kernel/powerpc/include/asm-ppc/pgtable.h:841: error: previous declaration of 'get_pteptr' was here Paul. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] lock PTE before updating it in 440/BookE page fault handler 2006-03-28 11:10 ` Paul Mackerras @ 2006-03-28 18:13 ` Eugene Surovegin 0 siblings, 0 replies; 6+ messages in thread From: Eugene Surovegin @ 2006-03-28 18:13 UTC (permalink / raw) To: Paul Mackerras; +Cc: linuxppc-dev On Tue, Mar 28, 2006 at 10:10:39PM +1100, Paul Mackerras wrote: > Eugene Surovegin writes: > > > Fix 44x and BookE page fault handler to correctly lock PTE before > > trying to pte_update() it, otherwise this PTE might be swapped out > > after pte_present() check but before pte_uptdate() call, resulting in > > corrupted PTE. This can happen with enabled preemption and low memory > > condition. > > That gives me this for an ARCH=powerpc 32-bit pmac build: > > /home/paulus/kernel/powerpc/arch/powerpc/mm/pgtable_32.c:376: error: conflicting types for 'get_pteptr' > /home/paulus/kernel/powerpc/include/asm-ppc/pgtable.h:841: error: previous declaration of 'get_pteptr' was here Oops, sorry. Haven't noticed we have duplicated files in ppc and powerpc. Full patch follows. Fix 44x and BookE page fault handler to correctly lock PTE before trying to pte_update() it, otherwise this PTE might be swapped out after pte_present() check but before pte_uptdate() call, resulting in corrupted PTE. This can happen with enabled preemption and low memory condition. Signed-off-by: Eugene Surovegin <ebs@ebshome.net> --- arch/powerpc/mm/fault.c | 30 +++++++++++++++++------------- arch/powerpc/mm/pgtable_32.c | 6 ++++-- arch/ppc/mm/fault.c | 30 +++++++++++++++++------------- arch/ppc/mm/pgtable.c | 6 ++++-- include/asm-ppc/pgtable.h | 3 ++- 5 files changed, 44 insertions(+), 31 deletions(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index ec4adcb..5aea090 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -267,25 +267,29 @@ good_area: #endif #if defined(CONFIG_4xx) || defined(CONFIG_BOOKE) pte_t *ptep; + pmd_t *pmdp; /* Since 4xx/Book-E supports per-page execute permission, * we lazily flush dcache to icache. */ ptep = NULL; - if (get_pteptr(mm, address, &ptep) && pte_present(*ptep)) { - struct page *page = pte_page(*ptep); - - if (! test_bit(PG_arch_1, &page->flags)) { - flush_dcache_icache_page(page); - set_bit(PG_arch_1, &page->flags); + if (get_pteptr(mm, address, &ptep, &pmdp)) { + spinlock_t *ptl = pte_lockptr(mm, pmdp); + spin_lock(ptl); + if (pte_present(*ptep)) { + struct page *page = pte_page(*ptep); + + if (!test_bit(PG_arch_1, &page->flags)) { + flush_dcache_icache_page(page); + set_bit(PG_arch_1, &page->flags); + } + pte_update(ptep, 0, _PAGE_HWEXEC); + _tlbie(address); + pte_unmap_unlock(ptep, ptl); + up_read(&mm->mmap_sem); + return 0; } - pte_update(ptep, 0, _PAGE_HWEXEC); - _tlbie(address); - pte_unmap(ptep); - up_read(&mm->mmap_sem); - return 0; + pte_unmap_unlock(ptep, ptl); } - if (ptep != NULL) - pte_unmap(ptep); #endif /* a write */ } else if (is_write) { diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c index d296eb6..9062860 100644 --- a/arch/powerpc/mm/pgtable_32.c +++ b/arch/powerpc/mm/pgtable_32.c @@ -372,7 +372,7 @@ void __init io_block_mapping(unsigned lo * the PTE pointer is unmodified if PTE is not found. */ int -get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep) +get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep, pmd_t **pmdp) { pgd_t *pgd; pmd_t *pmd; @@ -387,6 +387,8 @@ get_pteptr(struct mm_struct *mm, unsigne if (pte) { retval = 1; *ptep = pte; + if (pmdp) + *pmdp = pmd; /* XXX caller needs to do pte_unmap, yuck */ } } @@ -424,7 +426,7 @@ unsigned long iopa(unsigned long addr) mm = &init_mm; pa = 0; - if (get_pteptr(mm, addr, &pte)) { + if (get_pteptr(mm, addr, &pte, NULL)) { pa = (pte_val(*pte) & PAGE_MASK) | (addr & ~PAGE_MASK); pte_unmap(pte); } diff --git a/arch/ppc/mm/fault.c b/arch/ppc/mm/fault.c index 0217188..8e08ca3 100644 --- a/arch/ppc/mm/fault.c +++ b/arch/ppc/mm/fault.c @@ -202,6 +202,7 @@ good_area: /* an exec - 4xx/Book-E allows for per-page execute permission */ } else if (TRAP(regs) == 0x400) { pte_t *ptep; + pmd_t *pmdp; #if 0 /* It would be nice to actually enforce the VM execute @@ -215,21 +216,24 @@ good_area: /* Since 4xx/Book-E supports per-page execute permission, * we lazily flush dcache to icache. */ ptep = NULL; - if (get_pteptr(mm, address, &ptep) && pte_present(*ptep)) { - struct page *page = pte_page(*ptep); - - if (! test_bit(PG_arch_1, &page->flags)) { - flush_dcache_icache_page(page); - set_bit(PG_arch_1, &page->flags); + if (get_pteptr(mm, address, &ptep, &pmdp)) { + spinlock_t *ptl = pte_lockptr(mm, pmdp); + spin_lock(ptl); + if (pte_present(*ptep)) { + struct page *page = pte_page(*ptep); + + if (!test_bit(PG_arch_1, &page->flags)) { + flush_dcache_icache_page(page); + set_bit(PG_arch_1, &page->flags); + } + pte_update(ptep, 0, _PAGE_HWEXEC); + _tlbie(address); + pte_unmap_unlock(ptep, ptl); + up_read(&mm->mmap_sem); + return 0; } - pte_update(ptep, 0, _PAGE_HWEXEC); - _tlbie(address); - pte_unmap(ptep); - up_read(&mm->mmap_sem); - return 0; + pte_unmap_unlock(ptep, ptl); } - if (ptep != NULL) - pte_unmap(ptep); #endif /* a read */ } else { diff --git a/arch/ppc/mm/pgtable.c b/arch/ppc/mm/pgtable.c index 6ea9185..98a83be 100644 --- a/arch/ppc/mm/pgtable.c +++ b/arch/ppc/mm/pgtable.c @@ -368,7 +368,7 @@ void __init io_block_mapping(unsigned lo * the PTE pointer is unmodified if PTE is not found. */ int -get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep) +get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep, pmd_t **pmdp) { pgd_t *pgd; pmd_t *pmd; @@ -383,6 +383,8 @@ get_pteptr(struct mm_struct *mm, unsigne if (pte) { retval = 1; *ptep = pte; + if (pmdp) + *pmdp = pmd; /* XXX caller needs to do pte_unmap, yuck */ } } @@ -420,7 +422,7 @@ unsigned long iopa(unsigned long addr) mm = &init_mm; pa = 0; - if (get_pteptr(mm, addr, &pte)) { + if (get_pteptr(mm, addr, &pte, NULL)) { pa = (pte_val(*pte) & PAGE_MASK) | (addr & ~PAGE_MASK); pte_unmap(pte); } diff --git a/include/asm-ppc/pgtable.h b/include/asm-ppc/pgtable.h index e1c62da..570b355 100644 --- a/include/asm-ppc/pgtable.h +++ b/include/asm-ppc/pgtable.h @@ -837,7 +837,8 @@ static inline int io_remap_pfn_range(str */ #define pgtable_cache_init() do { } while (0) -extern int get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep); +extern int get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep, + pmd_t **pmdp); #include <asm-generic/pgtable.h> ^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2006-03-28 18:13 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-03-02 20:26 Unsafe pte_update() in do_page_fault() (4xx and Book-E) Eugene Surovegin 2006-03-03 3:43 ` Benjamin Herrenschmidt 2006-03-03 3:55 ` Eugene Surovegin 2006-03-28 7:55 ` [PATCH] lock PTE before updating it in 440/BookE page fault handler Eugene Surovegin 2006-03-28 11:10 ` Paul Mackerras 2006-03-28 18:13 ` Eugene Surovegin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).