* [PATCH 1/3] Fix COW D-cache aliasing on fork
@ 2006-10-19 16:35 Ralf Baechle
2006-10-19 16:35 ` [PATCH 2/3] Pass vma argument to copy_user_highpage() Ralf Baechle
` (2 more replies)
0 siblings, 3 replies; 46+ messages in thread
From: Ralf Baechle @ 2006-10-19 16:35 UTC (permalink / raw)
To: Linus Torvalds, Andrew Morton; +Cc: linux-kernel, Atsushi Nemoto, Ralf Baechle
From: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Problem:
1. There is a process containing two thread (T1 and T2). The
thread T1 calls fork(). Then dup_mmap() function called on T1 context.
static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
...
flush_cache_mm(current->mm);
... /* A */
(write-protect all Copy-On-Write pages)
... /* B */
flush_tlb_mm(current->mm);
...
2. When preemption happens between A and B (or on SMP kernel), the
thread T2 can run and modify data on COW pages without page fault
(modified data will stay in cache).
3. Some time after fork() completed, the thread T2 may cause a page
fault by write-protect on a COW page.
4. Then data of the COW page will be copied to newly allocated
physical page (copy_cow_page()). It reads data via kernel mapping.
The kernel mapping can have different 'color' with user space
mapping of the thread T2 (dcache aliasing). Therefore
copy_cow_page() will copy stale data. Then the modified data in
cache will be lost.
In order to allow architecture code to deal with this problem allow
architecture code to override copy_user_highpage() by defining
__HAVE_ARCH_COPY_USER_HIGHPAGE in <asm/page.h>.
The main part of this patch was originally written by Ralf Baechle;
Atushi Nemoto did the the debugging.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
include/linux/highmem.h | 4 ++++
1 file changed, 4 insertions(+)
Index: upstream-linus/include/linux/highmem.h
===================================================================
--- upstream-linus.orig/include/linux/highmem.h 2006-10-17 00:14:43.000000000 +0100
+++ upstream-linus/include/linux/highmem.h 2006-10-17 00:15:21.000000000 +0100
@@ -94,6 +94,8 @@ static inline void memclear_highpage_flu
kunmap_atomic(kaddr, KM_USER0);
}
+#ifndef __HAVE_ARCH_COPY_USER_HIGHPAGE
+
static inline void copy_user_highpage(struct page *to, struct page *from, unsigned long vaddr)
{
char *vfrom, *vto;
@@ -107,6 +109,8 @@ static inline void copy_user_highpage(st
smp_wmb();
}
+#endif
+
static inline void copy_highpage(struct page *to, struct page *from)
{
char *vfrom, *vto;
^ permalink raw reply [flat|nested] 46+ messages in thread* [PATCH 2/3] Pass vma argument to copy_user_highpage(). 2006-10-19 16:35 [PATCH 1/3] Fix COW D-cache aliasing on fork Ralf Baechle @ 2006-10-19 16:35 ` Ralf Baechle 2006-10-19 16:35 ` [PATCH 3/3] MIPS: Fix COW D-cache aliasing on fork Ralf Baechle 2006-10-19 17:46 ` [PATCH 1/3] " Nick Piggin 2 siblings, 0 replies; 46+ messages in thread From: Ralf Baechle @ 2006-10-19 16:35 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton; +Cc: linux-kernel, Atsushi Nemoto, Ralf Baechle From: Atsushi Nemoto <anemo@mba.ocn.ne.jp> To allow a more effective copy_user_highpage() on certain architectures, a vma argument is added to the function and cow_user_page() allowing the implementation of these functions to check for the VM_EXEC bit. The main part of this patch was originally written by Ralf Baechle; Atushi Nemoto did the the debugging. Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Signed-off-by: Ralf Baechle <ralf@linux-mips.org> include/linux/highmem.h | 3 ++- mm/memory.c | 10 +++++----- 2 files changed, 7 insertions(+), 6 deletions(-) Index: upstream-linus/mm/memory.c =================================================================== --- upstream-linus.orig/mm/memory.c 2006-10-17 00:14:41.000000000 +0100 +++ upstream-linus/mm/memory.c 2006-10-17 00:15:40.000000000 +0100 @@ -1431,7 +1431,7 @@ static inline pte_t maybe_mkwrite(pte_t return pte; } -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va) +static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) { /* * If the source page was a PFN mapping, we don't have @@ -1453,9 +1453,9 @@ static inline void cow_user_page(struct memset(kaddr, 0, PAGE_SIZE); kunmap_atomic(kaddr, KM_USER0); return; - + } - copy_user_highpage(dst, src, va); + copy_user_highpage(dst, src, va, vma); } /* @@ -1566,7 +1566,7 @@ gotten: new_page = alloc_page_vma(GFP_HIGHUSER, vma, address); if (!new_page) goto oom; - cow_user_page(new_page, old_page, address); + cow_user_page(new_page, old_page, address, vma); } /* @@ -2190,7 +2190,7 @@ retry: page = alloc_page_vma(GFP_HIGHUSER, vma, address); if (!page) goto oom; - copy_user_highpage(page, new_page, address); + copy_user_highpage(page, new_page, address, vma); page_cache_release(new_page); new_page = page; anon = 1; Index: upstream-linus/include/linux/highmem.h =================================================================== --- upstream-linus.orig/include/linux/highmem.h 2006-10-17 00:15:21.000000000 +0100 +++ upstream-linus/include/linux/highmem.h 2006-10-17 00:17:23.000000000 +0100 @@ -96,7 +96,8 @@ static inline void memclear_highpage_flu #ifndef __HAVE_ARCH_COPY_USER_HIGHPAGE -static inline void copy_user_highpage(struct page *to, struct page *from, unsigned long vaddr) +static inline void copy_user_highpage(struct page *to, struct page *from, + unsigned long vaddr, struct vm_area_struct *vma) { char *vfrom, *vto; ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 3/3] MIPS: Fix COW D-cache aliasing on fork 2006-10-19 16:35 [PATCH 1/3] Fix COW D-cache aliasing on fork Ralf Baechle 2006-10-19 16:35 ` [PATCH 2/3] Pass vma argument to copy_user_highpage() Ralf Baechle @ 2006-10-19 16:35 ` Ralf Baechle 2006-10-22 2:32 ` Ralf Baechle 2006-10-19 17:46 ` [PATCH 1/3] " Nick Piggin 2 siblings, 1 reply; 46+ messages in thread From: Ralf Baechle @ 2006-10-19 16:35 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton; +Cc: linux-kernel, Atsushi Nemoto, Ralf Baechle From: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Provide a custom copy_user_highpage() to deal with aliasing issues on MIPS. It uses kmap_coherent() to map an user page for kernel with same color. Rewrite copy_to_user_page() and copy_from_user_page() with the new interfaces to avoid extra cache flushing. The main part of this patch was originally written by Ralf Baechle; Atushi Nemoto did the the debugging. Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Signed-off-by: Ralf Baechle <ralf@linux-mips.org> arch/mips/mm/init.c | 213 ++++++++++++++++++++++++++++++++++++++++-- arch/mips/mm/pgtable-32.c | 7 - arch/mips/mm/pgtable-64.c | 11 ++ include/asm-mips/cacheflush.h | 19 --- include/asm-mips/fixmap.h | 14 ++ include/asm-mips/page.h | 17 +-- 6 files changed, 242 insertions(+), 39 deletions(-) Index: upstream-linus/arch/mips/mm/init.c =================================================================== --- upstream-linus.orig/arch/mips/mm/init.c 2006-10-12 18:51:03.000000000 +0100 +++ upstream-linus/arch/mips/mm/init.c 2006-10-12 18:51:18.000000000 +0100 @@ -30,11 +30,39 @@ #include <asm/cachectl.h> #include <asm/cpu.h> #include <asm/dma.h> +#include <asm/kmap_types.h> #include <asm/mmu_context.h> #include <asm/sections.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> #include <asm/tlb.h> +#include <asm/fixmap.h> + +/* CP0 hazard avoidance. */ +#define BARRIER __asm__ __volatile__(".set noreorder\n\t" \ + "nop; nop; nop; nop; nop; nop;\n\t" \ + ".set reorder\n\t") + +/* Atomicity and interruptability */ +#ifdef CONFIG_MIPS_MT_SMTC + +#include <asm/mipsmtregs.h> + +#define ENTER_CRITICAL(flags) \ + { \ + unsigned int mvpflags; \ + local_irq_save(flags);\ + mvpflags = dvpe() +#define EXIT_CRITICAL(flags) \ + evpe(mvpflags); \ + local_irq_restore(flags); \ + } +#else + +#define ENTER_CRITICAL(flags) local_irq_save(flags) +#define EXIT_CRITICAL(flags) local_irq_restore(flags) + +#endif /* CONFIG_MIPS_MT_SMTC */ DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); @@ -80,13 +108,183 @@ unsigned long setup_zero_pages(void) return 1UL << order; } -#ifdef CONFIG_HIGHMEM -pte_t *kmap_pte; -pgprot_t kmap_prot; +/* + * These are almost like kmap_atomic / kunmap_atmic except they take an + * additional address argument as the hint. + */ #define kmap_get_fixmap_pte(vaddr) \ pte_offset_kernel(pmd_offset(pud_offset(pgd_offset_k(vaddr), (vaddr)), (vaddr)), (vaddr)) +#ifdef CONFIG_MIPS_MT_SMTC +static pte_t *kmap_coherent_pte; +static void __init kmap_coherent_init(void) +{ + unsigned long vaddr; + + /* cache the first coherent kmap pte */ + vaddr = __fix_to_virt(FIX_CMAP_BEGIN); + kmap_coherent_pte = kmap_get_fixmap_pte(vaddr); +} +#else +static inline void kmap_coherent_init(void) {} +#endif + +static inline void *kmap_coherent(struct page *page, unsigned long addr) +{ + enum fixed_addresses idx; + unsigned long vaddr, flags, entrylo; + unsigned long old_ctx; + pte_t pte; + unsigned int tlbidx; + + inc_preempt_count(); + idx = (addr >> PAGE_SHIFT) & (FIX_N_COLOURS - 1); +#ifdef CONFIG_MIPS_MT_SMTC + idx += FIX_N_COLOURS * smp_processor_id(); +#endif + vaddr = __fix_to_virt(FIX_CMAP_END - idx); + pte = mk_pte(page, PAGE_KERNEL); +#if defined(CONFIG_64BIT_PHYS_ADDR) && defined(CONFIG_CPU_MIPS32_R1) + entrylo = pte.pte_high; +#else + entrylo = pte_val(pte) >> 6; +#endif + + ENTER_CRITICAL(flags); + old_ctx = read_c0_entryhi(); + write_c0_entryhi(vaddr & (PAGE_MASK << 1)); + write_c0_entrylo0(entrylo); + write_c0_entrylo1(entrylo); +#ifdef CONFIG_MIPS_MT_SMTC + set_pte(kmap_coherent_pte - (FIX_CMAP_END - idx), pte); + /* preload TLB instead of local_flush_tlb_one() */ + mtc0_tlbw_hazard(); + tlb_probe(); + BARRIER; + tlbidx = read_c0_index(); + mtc0_tlbw_hazard(); + if (tlbidx < 0) + tlb_write_random(); + else + tlb_write_indexed(); +#else + tlbidx = read_c0_wired(); + write_c0_wired(tlbidx + 1); + write_c0_index(tlbidx); + mtc0_tlbw_hazard(); + tlb_write_indexed(); +#endif + tlbw_use_hazard(); + write_c0_entryhi(old_ctx); + EXIT_CRITICAL(flags); + + return (void*) vaddr; +} + +#define UNIQUE_ENTRYHI(idx) (CKSEG0 + ((idx) << (PAGE_SHIFT + 1))) + +static inline void kunmap_coherent(struct page *page) +{ +#ifndef CONFIG_MIPS_MT_SMTC + unsigned int wired; + unsigned long flags, old_ctx; + + ENTER_CRITICAL(flags); + old_ctx = read_c0_entryhi(); + wired = read_c0_wired() - 1; + write_c0_wired(wired); + write_c0_index(wired); + write_c0_entryhi(UNIQUE_ENTRYHI(wired)); + write_c0_entrylo0(0); + write_c0_entrylo1(0); + mtc0_tlbw_hazard(); + tlb_write_indexed(); + write_c0_entryhi(old_ctx); + EXIT_CRITICAL(flags); +#endif + dec_preempt_count(); + preempt_check_resched(); +} + +void copy_user_highpage(struct page *to, struct page *from, + unsigned long vaddr, struct vm_area_struct *vma) +{ + void *vfrom, *vto; + + vto = kmap_atomic(to, KM_USER1); + if (cpu_has_dc_aliases) { + vfrom = kmap_coherent(from, vaddr); + copy_page(vto, vfrom); + kunmap_coherent(from); + } else { + vfrom = kmap_atomic(from, KM_USER0); + copy_page(vto, vfrom); + kunmap_atomic(vfrom, KM_USER0); + } + if (((vma->vm_flags & VM_EXEC) && !cpu_has_ic_fills_f_dc) || + pages_do_alias((unsigned long)vto, vaddr & PAGE_MASK)) + flush_data_cache_page((unsigned long)vto); + kunmap_atomic(vto, KM_USER1); + /* Make sure this page is cleared on other CPU's too before using it */ + smp_wmb(); +} + +EXPORT_SYMBOL(copy_user_highpage); + +void copy_user_page(void *vto, void *vfrom, unsigned long vaddr, + struct page *to) +{ + if (cpu_has_dc_aliases) { + struct page *from = virt_to_page(vfrom); + vfrom = kmap_coherent(from, vaddr); + copy_page(vto, vfrom); + kunmap_coherent(from); + } else + copy_page(vto, vfrom); + if (!cpu_has_ic_fills_f_dc || + pages_do_alias((unsigned long)vto, vaddr & PAGE_MASK)) + flush_data_cache_page((unsigned long)vto); +} + +EXPORT_SYMBOL(copy_user_page); + +void copy_to_user_page(struct vm_area_struct *vma, + struct page *page, unsigned long vaddr, void *dst, const void *src, + unsigned long len) +{ + if (cpu_has_dc_aliases) { + void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK); + memcpy(vto, src, len); + kunmap_coherent(page); + } else + memcpy(dst, src, len); + if ((vma->vm_flags & VM_EXEC) && !cpu_has_ic_fills_f_dc) + flush_cache_page(vma, vaddr, page_to_pfn(page)); +} + +EXPORT_SYMBOL(copy_to_user_page); + +void copy_from_user_page(struct vm_area_struct *vma, + struct page *page, unsigned long vaddr, void *dst, const void *src, + unsigned long len) +{ + if (cpu_has_dc_aliases) { + void *vfrom = + kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK); + memcpy(dst, vfrom, len); + kunmap_coherent(page); + } else + memcpy(dst, src, len); +} + +EXPORT_SYMBOL(copy_from_user_page); + + +#ifdef CONFIG_HIGHMEM +pte_t *kmap_pte; +pgprot_t kmap_prot; + static void __init kmap_init(void) { unsigned long kmap_vstart; @@ -97,11 +295,12 @@ static void __init kmap_init(void) kmap_prot = PAGE_KERNEL; } +#endif /* CONFIG_HIGHMEM */ -#ifdef CONFIG_32BIT void __init fixrange_init(unsigned long start, unsigned long end, pgd_t *pgd_base) { +#if defined(CONFIG_HIGHMEM) || defined(CONFIG_MIPS_MT_SMTC) pgd_t *pgd; pud_t *pud; pmd_t *pmd; @@ -122,7 +321,7 @@ void __init fixrange_init(unsigned long for (; (k < PTRS_PER_PMD) && (vaddr != end); pmd++, k++) { if (pmd_none(*pmd)) { pte = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE); - set_pmd(pmd, __pmd(pte)); + set_pmd(pmd, __pmd((unsigned long)pte)); if (pte != pte_offset_kernel(pmd, 0)) BUG(); } @@ -132,9 +331,8 @@ void __init fixrange_init(unsigned long } j = 0; } +#endif } -#endif /* CONFIG_32BIT */ -#endif /* CONFIG_HIGHMEM */ #ifndef CONFIG_NEED_MULTIPLE_NODES extern void pagetable_init(void); @@ -175,6 +373,7 @@ void __init paging_init(void) #ifdef CONFIG_HIGHMEM kmap_init(); #endif + kmap_coherent_init(); max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT; low = max_low_pfn; Index: upstream-linus/arch/mips/mm/pgtable-32.c =================================================================== --- upstream-linus.orig/arch/mips/mm/pgtable-32.c 2006-10-12 18:51:03.000000000 +0100 +++ upstream-linus/arch/mips/mm/pgtable-32.c 2006-10-12 18:51:18.000000000 +0100 @@ -31,9 +31,10 @@ void pgd_init(unsigned long page) void __init pagetable_init(void) { -#ifdef CONFIG_HIGHMEM unsigned long vaddr; - pgd_t *pgd, *pgd_base; + pgd_t *pgd_base; +#ifdef CONFIG_HIGHMEM + pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; @@ -44,7 +45,6 @@ void __init pagetable_init(void) pgd_init((unsigned long)swapper_pg_dir + sizeof(pgd_t) * USER_PTRS_PER_PGD); -#ifdef CONFIG_HIGHMEM pgd_base = swapper_pg_dir; /* @@ -53,6 +53,7 @@ void __init pagetable_init(void) vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK; fixrange_init(vaddr, 0, pgd_base); +#ifdef CONFIG_HIGHMEM /* * Permanent kmaps: */ Index: upstream-linus/arch/mips/mm/pgtable-64.c =================================================================== --- upstream-linus.orig/arch/mips/mm/pgtable-64.c 2006-10-12 18:51:03.000000000 +0100 +++ upstream-linus/arch/mips/mm/pgtable-64.c 2006-10-12 18:51:18.000000000 +0100 @@ -8,6 +8,7 @@ */ #include <linux/init.h> #include <linux/mm.h> +#include <asm/fixmap.h> #include <asm/pgtable.h> void pgd_init(unsigned long page) @@ -52,7 +53,17 @@ void pmd_init(unsigned long addr, unsign void __init pagetable_init(void) { + unsigned long vaddr; + pgd_t *pgd_base; + /* Initialize the entire pgd. */ pgd_init((unsigned long)swapper_pg_dir); pmd_init((unsigned long)invalid_pmd_table, (unsigned long)invalid_pte_table); + + pgd_base = swapper_pg_dir; + /* + * Fixed mappings: + */ + vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK; + fixrange_init(vaddr, 0, pgd_base); } Index: upstream-linus/include/asm-mips/cacheflush.h =================================================================== --- upstream-linus.orig/include/asm-mips/cacheflush.h 2006-10-12 18:51:03.000000000 +0100 +++ upstream-linus/include/asm-mips/cacheflush.h 2006-10-12 18:51:18.000000000 +0100 @@ -55,24 +55,13 @@ extern void (*flush_icache_range)(unsign #define flush_cache_vmap(start, end) flush_cache_all() #define flush_cache_vunmap(start, end) flush_cache_all() -static inline void copy_to_user_page(struct vm_area_struct *vma, +extern void copy_to_user_page(struct vm_area_struct *vma, struct page *page, unsigned long vaddr, void *dst, const void *src, - unsigned long len) -{ - if (cpu_has_dc_aliases) - flush_cache_page(vma, vaddr, page_to_pfn(page)); - memcpy(dst, src, len); - __flush_icache_page(vma, page); -} + unsigned long len); -static inline void copy_from_user_page(struct vm_area_struct *vma, +extern void copy_from_user_page(struct vm_area_struct *vma, struct page *page, unsigned long vaddr, void *dst, const void *src, - unsigned long len) -{ - if (cpu_has_dc_aliases) - flush_cache_page(vma, vaddr, page_to_pfn(page)); - memcpy(dst, src, len); -} + unsigned long len); extern void (*flush_cache_sigtramp)(unsigned long addr); extern void (*flush_icache_all)(void); Index: upstream-linus/include/asm-mips/fixmap.h =================================================================== --- upstream-linus.orig/include/asm-mips/fixmap.h 2006-10-12 18:51:03.000000000 +0100 +++ upstream-linus/include/asm-mips/fixmap.h 2006-10-12 18:51:18.000000000 +0100 @@ -45,8 +45,16 @@ * fix-mapped? */ enum fixed_addresses { +#define FIX_N_COLOURS 8 + FIX_CMAP_BEGIN, +#ifdef CONFIG_MIPS_MT_SMTC + FIX_CMAP_END = FIX_CMAP_BEGIN + (FIX_N_COLOURS * NR_CPUS), +#else + FIX_CMAP_END = FIX_CMAP_BEGIN + FIX_N_COLOURS, +#endif #ifdef CONFIG_HIGHMEM - FIX_KMAP_BEGIN, /* reserved pte's for temporary kernel mappings */ + /* reserved pte's for temporary kernel mappings */ + FIX_KMAP_BEGIN = FIX_CMAP_END + 1, FIX_KMAP_END = FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1, #endif __end_of_fixed_addresses @@ -70,9 +78,9 @@ extern void __set_fixmap (enum fixed_add * at the top of mem.. */ #if defined(CONFIG_CPU_TX39XX) || defined(CONFIG_CPU_TX49XX) -#define FIXADDR_TOP (0xff000000UL - 0x2000) +#define FIXADDR_TOP ((unsigned long)(long)(int)(0xff000000 - 0x20000)) #else -#define FIXADDR_TOP (0xffffe000UL) +#define FIXADDR_TOP ((unsigned long)(long)(int)0xfffe0000) #endif #define FIXADDR_SIZE (__end_of_fixed_addresses << PAGE_SHIFT) #define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE) Index: upstream-linus/include/asm-mips/page.h =================================================================== --- upstream-linus.orig/include/asm-mips/page.h 2006-10-12 18:51:03.000000000 +0100 +++ upstream-linus/include/asm-mips/page.h 2006-10-12 18:51:18.000000000 +0100 @@ -34,8 +34,6 @@ #ifndef __ASSEMBLY__ -#include <asm/cpu-features.h> - extern void clear_page(void * page); extern void copy_page(void * to, void * from); @@ -59,16 +57,13 @@ static inline void clear_user_page(void flush_data_cache_page((unsigned long)addr); } -static inline void copy_user_page(void *vto, void *vfrom, unsigned long vaddr, - struct page *to) -{ - extern void (*flush_data_cache_page)(unsigned long addr); +extern void copy_user_page(void *vto, void *vfrom, unsigned long vaddr, + struct page *to); +struct vm_area_struct; +extern void copy_user_highpage(struct page *to, struct page *from, + unsigned long vaddr, struct vm_area_struct *vma); - copy_page(vto, vfrom); - if (!cpu_has_ic_fills_f_dc || - pages_do_alias((unsigned long)vto, vaddr & PAGE_MASK)) - flush_data_cache_page((unsigned long)vto); -} +#define __HAVE_ARCH_COPY_USER_HIGHPAGE /* * These are used to make use of C type-checking.. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/3] MIPS: Fix COW D-cache aliasing on fork 2006-10-19 16:35 ` [PATCH 3/3] MIPS: Fix COW D-cache aliasing on fork Ralf Baechle @ 2006-10-22 2:32 ` Ralf Baechle 0 siblings, 0 replies; 46+ messages in thread From: Ralf Baechle @ 2006-10-22 2:32 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton; +Cc: linux-kernel, Atsushi Nemoto With parts of the previous patch 3/3 just having been applied this new patch replaces it. < ---- snip ---- > From: Atsushi Nemoto <anemo@mba.ocn.ne.jp> MIPS: Fix COW D-cache aliasing on fork Provide a custom copy_user_highpage() to deal with aliasing issues on MIPS. It uses kmap_coherent() to map an user page for kernel with same color. The main part of this patch was originally written by Ralf Baechle; Atushi Nemoto did the the debugging. Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Signed-off-by: Ralf Baechle <ralf@linux-mips.org> arch/mips/mm/init.c | 25 +++++++++++++++++++++++++ include/asm-mips/page.h | 17 ++++++----------- 2 files changed, 31 insertions(+), 11 deletions(-) Index: upstream-alias/arch/mips/mm/init.c =================================================================== --- upstream-alias.orig/arch/mips/mm/init.c 2006-10-22 03:08:41.000000000 +0100 +++ upstream-alias/arch/mips/mm/init.c 2006-10-22 03:08:45.000000000 +0100 @@ -203,6 +203,31 @@ static inline void kunmap_coherent(struc preempt_check_resched(); } +void copy_user_highpage(struct page *to, struct page *from, + unsigned long vaddr, struct vm_area_struct *vma) +{ + void *vfrom, *vto; + + vto = kmap_atomic(to, KM_USER1); + if (cpu_has_dc_aliases) { + vfrom = kmap_coherent(from, vaddr); + copy_page(vto, vfrom); + kunmap_coherent(from); + } else { + vfrom = kmap_atomic(from, KM_USER0); + copy_page(vto, vfrom); + kunmap_atomic(vfrom, KM_USER0); + } + if (((vma->vm_flags & VM_EXEC) && !cpu_has_ic_fills_f_dc) || + pages_do_alias((unsigned long)vto, vaddr & PAGE_MASK)) + flush_data_cache_page((unsigned long)vto); + kunmap_atomic(vto, KM_USER1); + /* Make sure this page is cleared on other CPU's too before using it */ + smp_wmb(); +} + +EXPORT_SYMBOL(copy_user_highpage); + void copy_to_user_page(struct vm_area_struct *vma, struct page *page, unsigned long vaddr, void *dst, const void *src, unsigned long len) Index: upstream-alias/include/asm-mips/page.h =================================================================== --- upstream-alias.orig/include/asm-mips/page.h 2006-10-22 03:08:41.000000000 +0100 +++ upstream-alias/include/asm-mips/page.h 2006-10-22 03:08:45.000000000 +0100 @@ -34,8 +34,6 @@ #ifndef __ASSEMBLY__ -#include <asm/cpu-features.h> - extern void clear_page(void * page); extern void copy_page(void * to, void * from); @@ -59,16 +57,13 @@ static inline void clear_user_page(void flush_data_cache_page((unsigned long)addr); } -static inline void copy_user_page(void *vto, void *vfrom, unsigned long vaddr, - struct page *to) -{ - extern void (*flush_data_cache_page)(unsigned long addr); +extern void copy_user_page(void *vto, void *vfrom, unsigned long vaddr, + struct page *to); +struct vm_area_struct; +extern void copy_user_highpage(struct page *to, struct page *from, + unsigned long vaddr, struct vm_area_struct *vma); - copy_page(vto, vfrom); - if (!cpu_has_ic_fills_f_dc || - pages_do_alias((unsigned long)vto, vaddr & PAGE_MASK)) - flush_data_cache_page((unsigned long)vto); -} +#define __HAVE_ARCH_COPY_USER_HIGHPAGE /* * These are used to make use of C type-checking.. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-19 16:35 [PATCH 1/3] Fix COW D-cache aliasing on fork Ralf Baechle 2006-10-19 16:35 ` [PATCH 2/3] Pass vma argument to copy_user_highpage() Ralf Baechle 2006-10-19 16:35 ` [PATCH 3/3] MIPS: Fix COW D-cache aliasing on fork Ralf Baechle @ 2006-10-19 17:46 ` Nick Piggin 2006-10-19 18:13 ` Ralf Baechle 2 siblings, 1 reply; 46+ messages in thread From: Nick Piggin @ 2006-10-19 17:46 UTC (permalink / raw) To: Ralf Baechle; +Cc: Linus Torvalds, Andrew Morton, linux-kernel, Atsushi Nemoto Ralf Baechle wrote: > From: Atsushi Nemoto <anemo@mba.ocn.ne.jp> > > Problem: > > 1. There is a process containing two thread (T1 and T2). The > thread T1 calls fork(). Then dup_mmap() function called on T1 context. > > static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) > ... > flush_cache_mm(current->mm); > ... /* A */ > (write-protect all Copy-On-Write pages) > ... /* B */ > flush_tlb_mm(current->mm); > ... > > 2. When preemption happens between A and B (or on SMP kernel), the > thread T2 can run and modify data on COW pages without page fault > (modified data will stay in cache). > > 3. Some time after fork() completed, the thread T2 may cause a page > fault by write-protect on a COW page. > > 4. Then data of the COW page will be copied to newly allocated > physical page (copy_cow_page()). It reads data via kernel mapping. > The kernel mapping can have different 'color' with user space > mapping of the thread T2 (dcache aliasing). Therefore > copy_cow_page() will copy stale data. Then the modified data in > cache will be lost. What about if you just flush the caches after write protecting all COW pages? Would that work? Simpler? Better performance? (I don't know) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-19 17:46 ` [PATCH 1/3] " Nick Piggin @ 2006-10-19 18:13 ` Ralf Baechle 2006-10-19 18:48 ` Nick Piggin 2006-10-19 22:59 ` David Miller 0 siblings, 2 replies; 46+ messages in thread From: Ralf Baechle @ 2006-10-19 18:13 UTC (permalink / raw) To: Nick Piggin; +Cc: Linus Torvalds, Andrew Morton, linux-kernel, Atsushi Nemoto On Fri, Oct 20, 2006 at 03:46:35AM +1000, Nick Piggin wrote: > What about if you just flush the caches after write protecting all > COW pages? Would that work? Simpler? Better performance? (I don't know) That would require changing the order of cache flush and tlb flush. To keep certain architectures that require a valid translation in the TLB the cacheflush has to be done first. Not sure if those architectures need a writeable mapping for dirty cachelines - I think hypersparc was one of them. Ralf ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-19 18:13 ` Ralf Baechle @ 2006-10-19 18:48 ` Nick Piggin 2006-10-19 22:59 ` David Miller 1 sibling, 0 replies; 46+ messages in thread From: Nick Piggin @ 2006-10-19 18:48 UTC (permalink / raw) To: Ralf Baechle; +Cc: Linus Torvalds, Andrew Morton, linux-kernel, Atsushi Nemoto Ralf Baechle wrote: > On Fri, Oct 20, 2006 at 03:46:35AM +1000, Nick Piggin wrote: > > >>What about if you just flush the caches after write protecting all >>COW pages? Would that work? Simpler? Better performance? (I don't know) > > > That would require changing the order of cache flush and tlb flush. To Can't we just move flush_cache_mm to below the copy_page_range, before the flush_tlb_mm? > keep certain architectures that require a valid translation in the TLB > the cacheflush has to be done first. Not sure if those architectures need > a writeable mapping for dirty cachelines - I think hypersparc was one > of them. If the cache is dirty, then the TLB must have a writeable mapping in it, mustn't it? If there is an architecture where this isn't the case, then the current code is broken anyway, because in your example the T2 thread is dirtying data right before the mapping gets write protected anyway. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-19 18:13 ` Ralf Baechle 2006-10-19 18:48 ` Nick Piggin @ 2006-10-19 22:59 ` David Miller 2006-10-20 14:39 ` Nick Piggin 1 sibling, 1 reply; 46+ messages in thread From: David Miller @ 2006-10-19 22:59 UTC (permalink / raw) To: ralf; +Cc: nickpiggin, torvalds, akpm, linux-kernel, anemo From: Ralf Baechle <ralf@linux-mips.org> Date: Thu, 19 Oct 2006 19:13:46 +0100 > That would require changing the order of cache flush and tlb flush. > To keep certain architectures that require a valid translation in > the TLB the cacheflush has to be done first. Not sure if those > architectures need a writeable mapping for dirty cachelines - I > think hypersparc was one of them. There just has to be "a mapping" in the TLB so that the L2 cache can translate the virtual address to a physical one for the writeback to main memory. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-19 22:59 ` David Miller @ 2006-10-20 14:39 ` Nick Piggin 2006-10-20 15:49 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 46+ messages in thread From: Nick Piggin @ 2006-10-20 14:39 UTC (permalink / raw) To: David Miller; +Cc: ralf, torvalds, akpm, linux-kernel, anemo David Miller wrote: > From: Ralf Baechle <ralf@linux-mips.org> > Date: Thu, 19 Oct 2006 19:13:46 +0100 > > >>That would require changing the order of cache flush and tlb flush. >>To keep certain architectures that require a valid translation in >>the TLB the cacheflush has to be done first. Not sure if those >>architectures need a writeable mapping for dirty cachelines - I >>think hypersparc was one of them. > > > There just has to be "a mapping" in the TLB so that the L2 cache can > translate the virtual address to a physical one for the writeback to > main memory. So moving the flush_cache_mm below the copy_page_range, to just before the flush_tlb_mm, would work then? This would make the race much smaller than with this patchset. But doesn't that still leave a race? What if another thread writes to cache after we have flushed it but before flushing the TLBs? Although we've marked the the ptes readonly, the CPU won't trap if the TLB is valid? There must be some special way for the arch to handle this, but I can't see it. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 14:39 ` Nick Piggin @ 2006-10-20 15:49 ` Linus Torvalds 2006-10-20 15:57 ` Nick Piggin 2006-10-20 19:36 ` David Miller 2006-10-20 16:05 ` Ralf Baechle 2006-10-20 19:23 ` David Miller 2 siblings, 2 replies; 46+ messages in thread From: Linus Torvalds @ 2006-10-20 15:49 UTC (permalink / raw) To: Nick Piggin; +Cc: David Miller, ralf, akpm, linux-kernel, anemo On Sat, 21 Oct 2006, Nick Piggin wrote: > > So moving the flush_cache_mm below the copy_page_range, to just > before the flush_tlb_mm, would work then? This would make the > race much smaller than with this patchset. > > But doesn't that still leave a race? > > What if another thread writes to cache after we have flushed it > but before flushing the TLBs? Although we've marked the the ptes > readonly, the CPU won't trap if the TLB is valid? There must be > some special way for the arch to handle this, but I can't see it. Why not do the cache flush _after_ the TLB flush? There's still a mapping, and never mind that it's read-only: the _mapping_ still exists, and I doubt any CPU will not do the writeback (the readonly bit had better affect the _frontend_ of the memory pipeline, but affectign the back end would be insane and very hard, since you can't raise a fault any more). Hmm? Linus ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 15:49 ` Linus Torvalds @ 2006-10-20 15:57 ` Nick Piggin 2006-10-20 16:36 ` Linus Torvalds 2006-10-20 19:36 ` David Miller 1 sibling, 1 reply; 46+ messages in thread From: Nick Piggin @ 2006-10-20 15:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: David Miller, ralf, akpm, linux-kernel, anemo Linus Torvalds wrote: > > On Sat, 21 Oct 2006, Nick Piggin wrote: > >>So moving the flush_cache_mm below the copy_page_range, to just >>before the flush_tlb_mm, would work then? This would make the >>race much smaller than with this patchset. >> >>But doesn't that still leave a race? >> >>What if another thread writes to cache after we have flushed it >>but before flushing the TLBs? Although we've marked the the ptes >>readonly, the CPU won't trap if the TLB is valid? There must be >>some special way for the arch to handle this, but I can't see it. > > > Why not do the cache flush _after_ the TLB flush? There's still a mapping, > and never mind that it's read-only: the _mapping_ still exists, and I > doubt any CPU will not do the writeback (the readonly bit had better > affect the _frontend_ of the memory pipeline, but affectign the back end > would be insane and very hard, since you can't raise a fault any more). I didn't think that would work if there is no TLB. But if the writeback can cause a TLB reload, and then bypass the readonly protection, then yes would close all races. Of course, you may also want to do the racy cache flush before the TLB flush as well, so you don't immediately take a load of TLB misses to write it out. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 15:57 ` Nick Piggin @ 2006-10-20 16:36 ` Linus Torvalds 2006-10-20 16:47 ` Nick Piggin 0 siblings, 1 reply; 46+ messages in thread From: Linus Torvalds @ 2006-10-20 16:36 UTC (permalink / raw) To: Nick Piggin; +Cc: David Miller, ralf, akpm, linux-kernel, anemo On Sat, 21 Oct 2006, Nick Piggin wrote: > > I didn't think that would work if there is no TLB. But if the writeback > can cause a TLB reload, and then bypass the readonly protection, then > yes would close all races. On the other hand, doing the cache flush at COW time is "kind of equivalent" to just doing it after the TLB flush. It's now just _much_ after the flush ;) So maybe the COW D$ aliasing patch-series is just the right thing to do. Not worry about D$ at _all_ when doing the actual fork, and only worry about it on an actual COW event. Hmm? Linus ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 16:36 ` Linus Torvalds @ 2006-10-20 16:47 ` Nick Piggin 2006-10-20 17:16 ` Linus Torvalds 2006-10-21 0:46 ` Ralf Baechle 0 siblings, 2 replies; 46+ messages in thread From: Nick Piggin @ 2006-10-20 16:47 UTC (permalink / raw) To: Linus Torvalds; +Cc: David Miller, ralf, akpm, linux-kernel, anemo Linus Torvalds wrote: > > On Sat, 21 Oct 2006, Nick Piggin wrote: > >>I didn't think that would work if there is no TLB. But if the writeback >>can cause a TLB reload, and then bypass the readonly protection, then >>yes would close all races. > > > On the other hand, doing the cache flush at COW time is "kind of > equivalent" to just doing it after the TLB flush. It's now just _much_ > after the flush ;) > > So maybe the COW D$ aliasing patch-series is just the right thing to do. > Not worry about D$ at _all_ when doing the actual fork, and only worry > about it on an actual COW event. Hmm? Well if we have the calls in there, we should at least make them work right for the architectures there now. At the moment the flush_cache_mm before the copy_page_range wouldn't seem to do anything if you can still have threads dirty the cache again through existing TLB entries. I don't think that flushing on COW is exactly right though, because dirty data can remain invisible if you're only doing reads (no write, no flush). And if that cache gets written back at some point, you're going to see supposedly RO data change underneath you. I think? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 16:47 ` Nick Piggin @ 2006-10-20 17:16 ` Linus Torvalds 2006-10-20 17:37 ` Nick Piggin 2006-10-21 0:46 ` Ralf Baechle 1 sibling, 1 reply; 46+ messages in thread From: Linus Torvalds @ 2006-10-20 17:16 UTC (permalink / raw) To: Nick Piggin Cc: David Miller, ralf, Andrew Morton, Linux Kernel Mailing List, anemo, linux-arch, Martin Schwidefsky On Sat, 21 Oct 2006, Nick Piggin wrote: > > So maybe the COW D$ aliasing patch-series is just the right thing to do. Not > > worry about D$ at _all_ when doing the actual fork, and only worry about it > > on an actual COW event. Hmm? > > Well if we have the calls in there, we should at least make them work > right for the architectures there now. At the moment the flush_cache_mm > before the copy_page_range wouldn't seem to do anything if you can still > have threads dirty the cache again through existing TLB entries. > > I don't think that flushing on COW is exactly right though, because dirty > data can remain invisible if you're only doing reads (no write, no flush). You're right. A virtually indexed cache needs the flush _before_ we return from the fork into a new process (since otherwise the dirty data won't be visible in the new virtual address space). So you've convinced me. Flushing at COW time _cannot_ be right, because it by definition means that there has been a time when the new process didn't see the dirty data in the case of a virtual index. And in the case of a physical index it cannot matter. So I think the right thing to do is to forget about the COW D$ series (which probably _hides_ most of the problems in practice, so it "works" that way) and instead go with Ralf's last patch that just moves the flush_cache_mm() to after the TLB flush. We do need to have all the architecture people (especially S390, which has been very strange in this regard in the past) check that it's ok. The _mappings_ are still valid, so S390 should be able to do the write-back, but there may be architectures that would want to do the flush _both_ before and after (for performance reasons - if writing out dirty data requires a TLB lookup, doing most fo the writeback before is probably a better thing, and then we can do a _second_ writeback after the flush to close the race with some other thread dirtying the pages before the TLB was marked read-only). I added linux-arch and Martin Schwidefsky (s390) to the Cc:. Guys, in case you missed the earlier discussion: there's a suggested patch by Ralf Baechle on linux-kernel (but it does just the "flush after" version, not the "perhaps we need it both before and after" thing I theorise about above). Message-ID: 20061020160538.GB18649@linux-mips.org. Linus ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 17:16 ` Linus Torvalds @ 2006-10-20 17:37 ` Nick Piggin 0 siblings, 0 replies; 46+ messages in thread From: Nick Piggin @ 2006-10-20 17:37 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, ralf, Andrew Morton, Linux Kernel Mailing List, anemo, linux-arch, Martin Schwidefsky Linus Torvalds wrote: > > On Sat, 21 Oct 2006, Nick Piggin wrote: > >>>So maybe the COW D$ aliasing patch-series is just the right thing to do. Not >>>worry about D$ at _all_ when doing the actual fork, and only worry about it >>>on an actual COW event. Hmm? >> >>Well if we have the calls in there, we should at least make them work >>right for the architectures there now. At the moment the flush_cache_mm >>before the copy_page_range wouldn't seem to do anything if you can still >>have threads dirty the cache again through existing TLB entries. >> >>I don't think that flushing on COW is exactly right though, because dirty >>data can remain invisible if you're only doing reads (no write, no flush). > > > You're right. A virtually indexed cache needs the flush _before_ we return > from the fork into a new process (since otherwise the dirty data won't be > visible in the new virtual address space). > > So you've convinced me. Flushing at COW time _cannot_ be right, because it > by definition means that there has been a time when the new process didn't > see the dirty data in the case of a virtual index. And in the case of a > physical index it cannot matter. > > So I think the right thing to do is to forget about the COW D$ series > (which probably _hides_ most of the problems in practice, so it "works" > that way) and instead go with Ralf's last patch that just moves the > flush_cache_mm() to after the TLB flush. So long as we don't move around the mmap semaphores, I'm OK with that patch... > We do need to have all the architecture people (especially S390, which has > been very strange in this regard in the past) check that it's ok. The > _mappings_ are still valid, so S390 should be able to do the write-back, > but there may be architectures that would want to do the flush _both_ > before and after (for performance reasons - if writing out dirty data > requires a TLB lookup, doing most fo the writeback before is probably a > better thing, and then we can do a _second_ writeback after the flush to > close the race with some other thread dirtying the pages before the TLB > was marked read-only). Yes, that's my theory too. Probably the thing to aim for is replacing that API with a new single call to flush caches and tlbs, and the arch can do what best suits. But for now, to get it actually *working*, moving the flush_cache_mm seems like the first step. > I added linux-arch and Martin Schwidefsky (s390) to the Cc:. > > Guys, in case you missed the earlier discussion: there's a suggested patch > by Ralf Baechle on linux-kernel (but it does just the "flush after" > version, not the "perhaps we need it both before and after" thing I > theorise about above). Message-ID: 20061020160538.GB18649@linux-mips.org. As I mentioned there, we probably want to to check that other places which flush caches before invalidating TLBs (eg. most of the kernel) is OK in the presence of concurrent writes to valid TLBs from other threads. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 16:47 ` Nick Piggin 2006-10-20 17:16 ` Linus Torvalds @ 2006-10-21 0:46 ` Ralf Baechle 1 sibling, 0 replies; 46+ messages in thread From: Ralf Baechle @ 2006-10-21 0:46 UTC (permalink / raw) To: Nick Piggin; +Cc: Linus Torvalds, David Miller, akpm, linux-kernel, anemo On Sat, Oct 21, 2006 at 02:47:56AM +1000, Nick Piggin wrote: > >So maybe the COW D$ aliasing patch-series is just the right thing to do. > >Not worry about D$ at _all_ when doing the actual fork, and only worry > >about it on an actual COW event. Hmm? > > Well if we have the calls in there, we should at least make them work > right for the architectures there now. At the moment the flush_cache_mm > before the copy_page_range wouldn't seem to do anything if you can still > have threads dirty the cache again through existing TLB entries. If you're talking about getting 2.6.19 out of the door without risking too much wreckage, I'm all for it. > I don't think that flushing on COW is exactly right though, because dirty > data can remain invisible if you're only doing reads (no write, no flush). > And if that cache gets written back at some point, you're going to see > supposedly RO data change underneath you. I think? You will not be able to avoid aliases at COW breaking time by any kind of cache flush. In a VIPT cache the userspace page is living at it's userspace address but the current implemention of copy_user_page will try to copy it at it's kernel space address and those two may not live at the same cache index. So, when breaking a COW mapping you have two options: 1) copying involves touch a userspace mapped page at it's kernel address. This may result in an alias so apropriate cache flushing will be needed. On a MIPS system this flush itself is like 1,000 cycles but could easily several times as much. Add the cost of bringing all the data that was blown away from the cache back later on. 2) try to be smart and avoid the creation of aliases by creating proper temporary aliases. Creating and tearing down a TLB mapping is dirt cheap. The current strategy is #1 which happens to work for MIPS because there is at most an alias between clean lines which is harmless on MIPS but will blow up on PA8800 - without overkill flushing. It is possible to implement the common sequence of fork + exec such that the child never ever breaks a COW page, not even a stack page. So why should a PIPT or VIPT cache be flushed in such a case? We can do better than that. Ralf ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 15:49 ` Linus Torvalds 2006-10-20 15:57 ` Nick Piggin @ 2006-10-20 19:36 ` David Miller 2006-10-20 19:54 ` Linus Torvalds 1 sibling, 1 reply; 46+ messages in thread From: David Miller @ 2006-10-20 19:36 UTC (permalink / raw) To: torvalds; +Cc: nickpiggin, ralf, akpm, linux-kernel, anemo From: Linus Torvalds <torvalds@osdl.org> Date: Fri, 20 Oct 2006 08:49:35 -0700 (PDT) > Why not do the cache flush _after_ the TLB flush? There's still a mapping, > and never mind that it's read-only: the _mapping_ still exists, and I > doubt any CPU will not do the writeback (the readonly bit had better > affect the _frontend_ of the memory pipeline, but affectign the back end > would be insane and very hard, since you can't raise a fault any more). > > Hmm? You get an asynchronous fault from the L2 cache, and that's also what happens when the TLB entry is missing during L2 writeback too. You get a level 15 non-maskable IRQ when these asynchronous errors happen. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 19:36 ` David Miller @ 2006-10-20 19:54 ` Linus Torvalds 2006-10-20 19:58 ` David Miller 0 siblings, 1 reply; 46+ messages in thread From: Linus Torvalds @ 2006-10-20 19:54 UTC (permalink / raw) To: David Miller; +Cc: nickpiggin, ralf, akpm, linux-kernel, anemo On Fri, 20 Oct 2006, David Miller wrote: > > You get an asynchronous fault from the L2 cache, and that's also what > happens when the TLB entry is missing during L2 writeback too. You > get a level 15 non-maskable IRQ when these asynchronous errors happen. Well, sparc always was crud. I can see the missing tlb entry, but if it's been turned read-only, the write-back should still work (it clearly _was_ writable when the write that dirtied the cacheline happened). Anyway, if you cannot flush a read-only mapping, then the "flush at COW fault time" won't work _either_, since the original mapping is still read-only. So regardless, the COW-time flush cannot work. But the "flush before the TLB flush, and then flush after in case we had a race" approach should work as well as it can in practice, no? Linus ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 19:54 ` Linus Torvalds @ 2006-10-20 19:58 ` David Miller 2006-10-20 20:10 ` Linus Torvalds 0 siblings, 1 reply; 46+ messages in thread From: David Miller @ 2006-10-20 19:58 UTC (permalink / raw) To: torvalds; +Cc: nickpiggin, ralf, akpm, linux-kernel, anemo From: Linus Torvalds <torvalds@osdl.org> Date: Fri, 20 Oct 2006 12:54:17 -0700 (PDT) > Well, sparc always was crud. I can see the missing tlb entry, but if it's > been turned read-only, the write-back should still work (it clearly _was_ > writable when the write that dirtied the cacheline happened). I did some more digging, here's what I think the hardware actually does: 1) On L2 cacheline load, the "user" and "writable" protection bits are propagated from the TLB entry into the L2 cache line. Access checks are done on L2 cache hit using this cached copy of the two protection bits. 2) On L2 dirty cacheline writeback, the physical address is obtained from the TLB So what you guys are suggesting should probably work fine. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 19:58 ` David Miller @ 2006-10-20 20:10 ` Linus Torvalds 2006-10-20 20:59 ` Russell King ` (2 more replies) 0 siblings, 3 replies; 46+ messages in thread From: Linus Torvalds @ 2006-10-20 20:10 UTC (permalink / raw) To: David Miller Cc: nickpiggin, ralf, Andrew Morton, Linux Kernel Mailing List, anemo, linux-arch, Martin Schwidefsky On Fri, 20 Oct 2006, David Miller wrote: > > I did some more digging, here's what I think the hardware actually > does: Ok, this sounds sane. What should we do about this? How does this patch look to people? (Totally untested, and I'm not sure we should even do that whole "oldmm->mm_users" test, but I'm throwing it out here for discussion, in case it matters for performance. The second D$ flush should obviously be unnecessary for the common unthreaded case, which is why none of this has mattered historically, I think). Comments? We need ARM, MIPS, sparc and S390 at the very least to sign off on this, and somebody to write a nice explanation for the changelog (and preferably do this through -mm too). Linus --- diff --git a/kernel/fork.c b/kernel/fork.c index 29ebb30..14c6a1d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -287,8 +287,18 @@ static inline int dup_mmap(struct mm_str } retval = 0; out: - up_write(&mm->mmap_sem); flush_tlb_mm(oldmm); + /* + * If we have other threads using the old mm, we need to + * flush the D$ again - the other threads might have dirtied + * it more before the TLB got flushed. + * + * After the flush, they can no longer dirty more pages, + * since they are now marked read-only, of course. + */ + if (atomic_read(&oldmm->mm_users) != 1) + flush_cache_mm(oldmm); + up_write(&mm->mmap_sem); up_write(&oldmm->mmap_sem); return retval; fail_nomem_policy: ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 20:10 ` Linus Torvalds @ 2006-10-20 20:59 ` Russell King 2006-10-20 21:06 ` David Miller 2006-10-20 21:12 ` Linus Torvalds 2006-10-20 21:49 ` Ralf Baechle 2006-10-23 8:50 ` Martin Schwidefsky 2 siblings, 2 replies; 46+ messages in thread From: Russell King @ 2006-10-20 20:59 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, nickpiggin, ralf, Andrew Morton, Linux Kernel Mailing List, anemo, linux-arch, Martin Schwidefsky On Fri, Oct 20, 2006 at 01:10:59PM -0700, Linus Torvalds wrote: > On Fri, 20 Oct 2006, David Miller wrote: > > I did some more digging, here's what I think the hardware actually > > does: > > Ok, this sounds sane. > > What should we do about this? How does this patch look to people? > > (Totally untested, and I'm not sure we should even do that whole > "oldmm->mm_users" test, but I'm throwing it out here for discussion, in > case it matters for performance. The second D$ flush should obviously be > unnecessary for the common unthreaded case, which is why none of this has > mattered historically, I think). > > Comments? We need ARM, MIPS, sparc and S390 at the very least to sign off > on this, and somebody to write a nice explanation for the changelog (and > preferably do this through -mm too). Well, looking at do_wp_page() I'm now quite concerned about ARM and COW. I can't see how this code could _possibly_ work with a virtually indexed cache as it stands. Yet, the kernel does appear to work. I'm afraid I'm utterly confused with the Linux MM in this day and age, so I don't think I can even consider commenting on this change. The majority of ARM caches are VIVT, so data read via the kernel mappings definitely does not hit the same cache lines as data accessed via the user mappings. Our copy_user_page() function merely copies between the two kernel mappings of the pages so with VIVT caches the kernel mappings - as it always has done since it's original invention. However, when I look at this code now, I see _no where_ where we synchronise the cache between the userspace mapping and the kernel space mapping before copying a COW page. So I'm afraid I'm going to have to hold up my hand and say "I don't understand the Linux MM anymore". -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 Serial core ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 20:59 ` Russell King @ 2006-10-20 21:06 ` David Miller 2006-10-20 21:17 ` Russell King 2006-10-20 21:12 ` Linus Torvalds 1 sibling, 1 reply; 46+ messages in thread From: David Miller @ 2006-10-20 21:06 UTC (permalink / raw) To: rmk+lkml Cc: torvalds, nickpiggin, ralf, akpm, linux-kernel, anemo, linux-arch, schwidefsky From: Russell King <rmk+lkml@arm.linux.org.uk> Date: Fri, 20 Oct 2006 21:59:29 +0100 > However, when I look at this code now, I see _no where_ where we synchronise > the cache between the userspace mapping and the kernel space mapping before > copying a COW page. When the user obtains write access to the page, we'll flush. Since there are many locations at which write access can be obtained, there are many locations where the synchronization is obtained. One popular way to obtain the synchronization is to implement flush_dcache_page() to flush, and implement clear_page() and copy_user_page() to clear and copy pages in kernel space at special temporrary mappings whose virtual address will alias up properly with userspace's mapping. That's why we pass a virtual address to these two arch functions. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 21:06 ` David Miller @ 2006-10-20 21:17 ` Russell King 2006-10-20 21:30 ` David Miller 0 siblings, 1 reply; 46+ messages in thread From: Russell King @ 2006-10-20 21:17 UTC (permalink / raw) To: David Miller Cc: torvalds, nickpiggin, ralf, akpm, linux-kernel, anemo, linux-arch, schwidefsky On Fri, Oct 20, 2006 at 02:06:19PM -0700, David Miller wrote: > From: Russell King <rmk+lkml@arm.linux.org.uk> > Date: Fri, 20 Oct 2006 21:59:29 +0100 > > > However, when I look at this code now, I see _no where_ where we synchronise > > the cache between the userspace mapping and the kernel space mapping before > > copying a COW page. > > When the user obtains write access to the page, we'll flush. > > Since there are many locations at which write access can be > obtained, there are many locations where the synchronization > is obtained. > > One popular way to obtain the synchronization is to implement > flush_dcache_page() to flush, and implement clear_page() and > copy_user_page() to clear and copy pages in kernel space at > special temporrary mappings whose virtual address will alias > up properly with userspace's mapping. That's why we pass a > virtual address to these two arch functions. I did say I had a VIVT cache. With such a cache, the *only* place where you can read data written via one mapping is via that very same mapping. There is no other virtual address which will give you coherent access to the data in another mapping. The majority of ARMs to date have been VIVT, and the majority of Linux kernels have worked fine (albiet the "recent" breakage of PIO block IO.) I'm now in the situation where I come back to look at the MM code and, to put it quite frankly, I can't see any possible way for ARM to work with this code. In practice, however, it does appear to work. I just can't see _why_ it's working. Hence why I'm declaring the "I don't understand" flag and refraining to endorse the patch - I _can't_ endorse what I don't understand. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 Serial core ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 21:17 ` Russell King @ 2006-10-20 21:30 ` David Miller 0 siblings, 0 replies; 46+ messages in thread From: David Miller @ 2006-10-20 21:30 UTC (permalink / raw) To: rmk+lkml Cc: torvalds, nickpiggin, ralf, akpm, linux-kernel, anemo, linux-arch, schwidefsky From: Russell King <rmk+lkml@arm.linux.org.uk> Date: Fri, 20 Oct 2006 22:17:23 +0100 > I did say I had a VIVT cache. And everything I said was premised with this understanding :-) ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 20:59 ` Russell King 2006-10-20 21:06 ` David Miller @ 2006-10-20 21:12 ` Linus Torvalds 2006-10-20 21:28 ` Russell King 2006-10-20 21:41 ` Ralf Baechle 1 sibling, 2 replies; 46+ messages in thread From: Linus Torvalds @ 2006-10-20 21:12 UTC (permalink / raw) To: Russell King Cc: David Miller, nickpiggin, ralf, Andrew Morton, Linux Kernel Mailing List, anemo, linux-arch, Martin Schwidefsky On Fri, 20 Oct 2006, Russell King wrote: > > Well, looking at do_wp_page() I'm now quite concerned about ARM and COW. > I can't see how this code could _possibly_ work with a virtually indexed > cache as it stands. Yet, the kernel does appear to work. It really shouldn't need any extra code, exactly because by the time it hits any page-fault, the caches had better be in sync with the physical page contents _anyway_ (yes, being virtual, the caches will _duplicate_ the contents, but since the pages are read-only, that aliasing should be perfectly fine). > I'm afraid I'm utterly confused with the Linux MM in this day and age, so > I don't think I can even consider commenting on this change. Well, we'd need somebody to verify that it still works, but quite frankly, the likelihood of it breaking anything seems basically nil. > However, when I look at this code now, I see _no where_ where we synchronise > the cache between the userspace mapping and the kernel space mapping before > copying a COW page. At the COW, it should be synchronized already, exactly because we did the cache_flush_mm() when we _created_ the COW mapping in the first place. It's just that we weren't quite careful enough at that time (and even then, that would only matter for some really really unlikely and strange situations that only happen when you fork() from a _threaded_ environment, so it shouldn't be anything you'd notice under normal load). I think. > So I'm afraid I'm going to have to hold up my hand and say "I don't > understand the Linux MM anymore". There are few enough people who understand it even though they're supposed to. I certainly have to always go back and look and read the code when there is anything subtle going on, and even then I want to be backed up by one of the _competent_ people ;) Linus ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 21:12 ` Linus Torvalds @ 2006-10-20 21:28 ` Russell King 2006-10-20 21:41 ` Ralf Baechle 1 sibling, 0 replies; 46+ messages in thread From: Russell King @ 2006-10-20 21:28 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, nickpiggin, ralf, Andrew Morton, Linux Kernel Mailing List, anemo, linux-arch, Martin Schwidefsky On Fri, Oct 20, 2006 at 02:12:11PM -0700, Linus Torvalds wrote: > On Fri, 20 Oct 2006, Russell King wrote: > > Well, looking at do_wp_page() I'm now quite concerned about ARM and COW. > > I can't see how this code could _possibly_ work with a virtually indexed > > cache as it stands. Yet, the kernel does appear to work. > > It really shouldn't need any extra code, exactly because by the time it > hits any page-fault, the caches had better be in sync with the physical > page contents _anyway_ (yes, being virtual, the caches will _duplicate_ > the contents, but since the pages are read-only, that aliasing should be > perfectly fine). Oh, of course! That explains why it actually works as expected! Thanks for filling back in that bit of swapped-out-years-ago-and-lost information. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 Serial core ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 21:12 ` Linus Torvalds 2006-10-20 21:28 ` Russell King @ 2006-10-20 21:41 ` Ralf Baechle 2006-10-21 16:28 ` Atsushi Nemoto 1 sibling, 1 reply; 46+ messages in thread From: Ralf Baechle @ 2006-10-20 21:41 UTC (permalink / raw) To: Linus Torvalds Cc: Russell King, David Miller, nickpiggin, Andrew Morton, Linux Kernel Mailing List, anemo, linux-arch, Martin Schwidefsky, James Bottomley On Fri, Oct 20, 2006 at 02:12:11PM -0700, Linus Torvalds wrote: > > Well, looking at do_wp_page() I'm now quite concerned about ARM and COW. > > I can't see how this code could _possibly_ work with a virtually indexed > > cache as it stands. Yet, the kernel does appear to work. > > It really shouldn't need any extra code, exactly because by the time it > hits any page-fault, the caches had better be in sync with the physical > page contents _anyway_ (yes, being virtual, the caches will _duplicate_ > the contents, but since the pages are read-only, that aliasing should be > perfectly fine). Until yesterday I also thought multiple read-only copies wouldn't do any harm. Well, until I learned about the wonderful behaviour of the PA8800 caches. PA8800 has VIPT primary caches, PIPT secondary caches. And the sinister part - caches are exclusive, that is a cacheline is either in L1 or L2 but never in both and can migrate between L1 and L2. Now onsider the following scenario: o physical address P is mapped to two aliasing addresses V1 and V2 o a load from V1 results in a clean line in L1 caching P at index V1. o a store to V2 results in a clean line in L1 caching P at index V2. o the line at V2 is getting written back to memory. o a victim replacement of the line at V1 results in the _clean_ line migrating back from L1 to L2. -> another read from V2 will return stale data. As consequence flush_cache_mm() on PA (or at least PA8800) currently blows away the entire cache, as Kyle McMartin just told me. The whole 1.5MB L1 and 32MB of L2 making fork an ultraheavy operation. > It's just that we weren't quite careful enough at that time (and even > then, that would only matter for some really really unlikely and strange > situations that only happen when you fork() from a _threaded_ environment, > so it shouldn't be anything you'd notice under normal load). > > I think. The flush is there since a very long time. I have it in my tree since ~ 2.1.36 and I get the feeling anybody every has been seriously revisited the issue since. Ralf ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 21:41 ` Ralf Baechle @ 2006-10-21 16:28 ` Atsushi Nemoto 0 siblings, 0 replies; 46+ messages in thread From: Atsushi Nemoto @ 2006-10-21 16:28 UTC (permalink / raw) To: ralf Cc: torvalds, rmk+lkml, davem, nickpiggin, akpm, linux-kernel, linux-arch, schwidefsky, James.Bottomley On Fri, 20 Oct 2006 22:41:22 +0100, Ralf Baechle <ralf@linux-mips.org> wrote: > > It's just that we weren't quite careful enough at that time (and even > > then, that would only matter for some really really unlikely and strange > > situations that only happen when you fork() from a _threaded_ environment, > > so it shouldn't be anything you'd notice under normal load). > > > > I think. > > The flush is there since a very long time. I have it in my tree since > ~ 2.1.36 and I get the feeling anybody every has been seriously revisited > the issue since. I think calling fork() (or system() or popen() or so) in threaded program is neither very unlikely or strange. But this breakage happens very rarely indeed, especially non-preemptive kernel. During debugging this issue, I had used this test program and slightly modified kernel --- inserting yield() at middle of dup_mmap(). With the modified kernel on 32KB VIPT D$, running this test program some times could reproduce the breakage ("BAD!" messages). I heard PARISC people had successed to reproduce it too. #include <stdio.h> #include <stdlib.h> #include <string.h> #include <pthread.h> #include <unistd.h> #include <sys/types.h> #include <sys/wait.h> static void *thread_func(void *arg) { unsigned char buf[2048], j; int i; for (j = 0; ; j++) { /* fill buf[] with j */ memset(buf, j, sizeof(buf)/2); sched_yield(); memset(buf + sizeof(buf)/2, j, sizeof(buf)/2); sched_yield(); /* check buf[] contents */ for (i = 0; i < sizeof(buf); i++) { if (buf[i] != j) { printf("BAD! %p (%d != %d)\n", buf + i, buf[i], j); exit(1); } } } } int main(int argc, char *argv[]) { int i; pid_t pid; pthread_t tid; for (i = 0; i < 4; i++) pthread_create(&tid, NULL, thread_func, NULL); for (i = 0; i < 100; i++) { pid = fork(); if (pid == -1) { perror("fork"); exit(1); } if (pid) waitpid(pid, NULL, 0); else exit(0); } return 0; } --- Atsushi Nemoto ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 20:10 ` Linus Torvalds 2006-10-20 20:59 ` Russell King @ 2006-10-20 21:49 ` Ralf Baechle 2006-10-20 22:02 ` Linus Torvalds 2006-10-23 8:50 ` Martin Schwidefsky 2 siblings, 1 reply; 46+ messages in thread From: Ralf Baechle @ 2006-10-20 21:49 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, nickpiggin, Andrew Morton, Linux Kernel Mailing List, anemo, linux-arch, Martin Schwidefsky, James Bottomley On Fri, Oct 20, 2006 at 01:10:59PM -0700, Linus Torvalds wrote: > Ok, this sounds sane. > > What should we do about this? How does this patch look to people? > > (Totally untested, and I'm not sure we should even do that whole > "oldmm->mm_users" test, but I'm throwing it out here for discussion, in > case it matters for performance. The second D$ flush should obviously be > unnecessary for the common unthreaded case, which is why none of this has > mattered historically, I think). > > Comments? We need ARM, MIPS, sparc and S390 at the very least to sign off > on this, and somebody to write a nice explanation for the changelog (and > preferably do this through -mm too). As a minimal solution your patch would work for MIPS but performance would be suboptimal. With my D-cache alias series applied the flush_cache_mm() in dup_mmap() becomes entirely redundant. When I delete the call (not part of my patchset) it means 12% faster fork. But I'm not proposing this for 2.6.19. Note this does not make the flush_cache_mm() on process termination redundant ... Ralf ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 21:49 ` Ralf Baechle @ 2006-10-20 22:02 ` Linus Torvalds 2006-10-20 22:22 ` David Miller 0 siblings, 1 reply; 46+ messages in thread From: Linus Torvalds @ 2006-10-20 22:02 UTC (permalink / raw) To: Ralf Baechle Cc: David Miller, nickpiggin, Andrew Morton, Linux Kernel Mailing List, anemo, linux-arch, Martin Schwidefsky, James Bottomley On Fri, 20 Oct 2006, Ralf Baechle wrote: > > As a minimal solution your patch would work for MIPS but performance would be > suboptimal. Not so. > With my D-cache alias series applied the flush_cache_mm() in dup_mmap() > becomes entirely redundant. No it does not, as pointed out by Nick. If there are dirty lines in the virtual cache, they _must_ be flushd long before the COW happens. Because if they are not, they don't show up in the child of the fork (which only sees it's _own_ virtual cache). See? > When I delete the call (not part of my patchset) it means 12% faster > fork. But I'm not proposing this for 2.6.19. I just suspect it means a _buggy_ fork. It so happens (I think), that fork is big enough that it probably flushes the L1 cache _anyway_. Does MIPS have some kind of "flush_cache_mm()" that could only flush user-level caches? Maybe the overhead is from flushing all dirty cachelines, regardless of whether they are kernel or not (and dirty kernel cachelines are going to be the most common by far in that path). Linus ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 22:02 ` Linus Torvalds @ 2006-10-20 22:22 ` David Miller 2006-10-20 22:51 ` Ralf Baechle 0 siblings, 1 reply; 46+ messages in thread From: David Miller @ 2006-10-20 22:22 UTC (permalink / raw) To: torvalds Cc: ralf, nickpiggin, akpm, linux-kernel, anemo, linux-arch, schwidefsky, James.Bottomley From: Linus Torvalds <torvalds@osdl.org> Date: Fri, 20 Oct 2006 15:02:39 -0700 (PDT) > On Fri, 20 Oct 2006, Ralf Baechle wrote: > > When I delete the call (not part of my patchset) it means 12% faster > > fork. But I'm not proposing this for 2.6.19. > > I just suspect it means a _buggy_ fork. > > It so happens (I think), that fork is big enough that it probably flushes > the L1 cache _anyway_. My understanding is that this works because in Ralf's original patch (which is the context in which he is removing the flush_cache_mm() call), he uses kmap()/kunmap() to map the page(s) being accessed at a kernel virtual address which will fall into the same cache color as the user virtual address --> no alias problems. Since he does this for every page touched on the kernel side during dup_mmap(), the existing flush_cache_mm() call in dup_mmap() does in fact become redundant. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 22:22 ` David Miller @ 2006-10-20 22:51 ` Ralf Baechle 2006-10-20 23:28 ` Linus Torvalds 0 siblings, 1 reply; 46+ messages in thread From: Ralf Baechle @ 2006-10-20 22:51 UTC (permalink / raw) To: David Miller Cc: torvalds, nickpiggin, akpm, linux-kernel, anemo, linux-arch, schwidefsky, James.Bottomley On Fri, Oct 20, 2006 at 03:22:47PM -0700, David Miller wrote: > > On Fri, 20 Oct 2006, Ralf Baechle wrote: > > > When I delete the call (not part of my patchset) it means 12% faster > > > fork. But I'm not proposing this for 2.6.19. > > > > I just suspect it means a _buggy_ fork. > > > > It so happens (I think), that fork is big enough that it probably flushes > > the L1 cache _anyway_. I doubt it; I've tested this on 64K I-cache VIPT, 64K D-cache VIPT. > My understanding is that this works because in Ralf's original patch > (which is the context in which he is removing the flush_cache_mm() > call), he uses kmap()/kunmap() to map the page(s) being accessed at a > kernel virtual address which will fall into the same cache color as > the user virtual address --> no alias problems. > > Since he does this for every page touched on the kernel side during > dup_mmap(), the existing flush_cache_mm() call in dup_mmap() does in > fact become redundant. Correct. It means no cache flush operation to deal with aliases at all left in fork and COW code. Another advantage of this strategy is that we will never have to handle less virtual coherency exceptions. A virtual coherency exception is raised on some MIPS processors when they detect the creation of a cache alias. This allows the software to cleanup caches. Neat as an alarm system for alias debugging but rather expensive to service if large numbers are raised, not available on all processors and also detects the creation of harmless aliases of clean lines, thus a slight annoyance. Ralf ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 22:51 ` Ralf Baechle @ 2006-10-20 23:28 ` Linus Torvalds 2006-10-21 0:06 ` Ralf Baechle 0 siblings, 1 reply; 46+ messages in thread From: Linus Torvalds @ 2006-10-20 23:28 UTC (permalink / raw) To: Ralf Baechle Cc: David Miller, nickpiggin, akpm, linux-kernel, anemo, linux-arch, schwidefsky, James.Bottomley On Fri, 20 Oct 2006, Ralf Baechle wrote: > > > My understanding is that this works because in Ralf's original patch > > (which is the context in which he is removing the flush_cache_mm() > > call), he uses kmap()/kunmap() to map the page(s) being accessed at a > > kernel virtual address which will fall into the same cache color as > > the user virtual address --> no alias problems. > > > > Since he does this for every page touched on the kernel side during > > dup_mmap(), the existing flush_cache_mm() call in dup_mmap() does in > > fact become redundant. > > Correct. > > It means no cache flush operation to deal with aliases at all left in > fork and COW code. Umm. That would seem to only happen to work for a direct-mapped virtually indexed cache where the index is taken purely from the virtual address, and there are no "process context" bits in the virtually indexed D$. The moment there are process context bits involved, afaik you absolutely _need_ to flush, because otherwise the other process will never pick up the dirty state (which it would need to reload from memory). That said, maybe nobody does that. Virtual caches are a total braindamage in the first place, so hopefully they have limited use. Linus ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 23:28 ` Linus Torvalds @ 2006-10-21 0:06 ` Ralf Baechle 2006-10-21 0:38 ` Linus Torvalds 0 siblings, 1 reply; 46+ messages in thread From: Ralf Baechle @ 2006-10-21 0:06 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, nickpiggin, akpm, linux-kernel, anemo, linux-arch, schwidefsky, James.Bottomley On Fri, Oct 20, 2006 at 04:28:37PM -0700, Linus Torvalds wrote: > > > My understanding is that this works because in Ralf's original patch > > > (which is the context in which he is removing the flush_cache_mm() > > > call), he uses kmap()/kunmap() to map the page(s) being accessed at a > > > kernel virtual address which will fall into the same cache color as > > > the user virtual address --> no alias problems. > > > > > > Since he does this for every page touched on the kernel side during > > > dup_mmap(), the existing flush_cache_mm() call in dup_mmap() does in > > > fact become redundant. > > > > Correct. > > > > It means no cache flush operation to deal with aliases at all left in > > fork and COW code. > > Umm. That would seem to only happen to work for a direct-mapped virtually > indexed cache where the index is taken purely from the virtual address, > and there are no "process context" bits in the virtually indexed D$. No MIPS processor has something like that. See below. > The moment there are process context bits involved, afaik you absolutely > _need_ to flush, because otherwise the other process will never pick up > the dirty state (which it would need to reload from memory). Correct. > That said, maybe nobody does that. Virtual caches are a total braindamage > in the first place, so hopefully they have limited use. On MIPS we never had pure virtual caches. The major variants in existence are: o D-cache PIPT, I-cache PIPT o PIVT (no typo!) Only the R6000 has this and it's not supported by Linux. o D-cache VIPT, I-cache VIPT This is by far the most common on any MIPS designed since '91. A variant of these caches has hardware logic to detect cache aliases and fix them automatically and therefore is equivalent to PIPT even though they are not implemented as PIPT. And obviously the alias replay of the pipe will cost a few cycles. The R10000 family of SGI belongs into this class and the 24K/34K family of synthesizable cores by MIPS Technologies have this as a synthesis option. Another variant throws virtual coherency exceptions as I've explained in another thread. o D-cache PIPT, I-cache VIVT with additional address space tags. o Cacheless. Not usually running Linux but heck, it's working anyway. Be sure I'm sending a CPU designers a strong message about aliases. And I think they're slowly getting the message that kernel hackers like to poke needles into voodoo dolls for aliases ;-) Ralf ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-21 0:06 ` Ralf Baechle @ 2006-10-21 0:38 ` Linus Torvalds 2006-10-21 1:29 ` Paul Mackerras ` (2 more replies) 0 siblings, 3 replies; 46+ messages in thread From: Linus Torvalds @ 2006-10-21 0:38 UTC (permalink / raw) To: Ralf Baechle Cc: David Miller, nickpiggin, akpm, linux-kernel, anemo, linux-arch, schwidefsky, James.Bottomley On Sat, 21 Oct 2006, Ralf Baechle wrote: > > > That said, maybe nobody does that. Virtual caches are a total braindamage > > in the first place, so hopefully they have limited use. > > On MIPS we never had pure virtual caches. Ok, so on MIPS my schenario doesn't matter. I think (but may be mistaken) that ARM _does_ have pure virtual caches with a process ID, but people have always ended up flushing them at context switch simply because it just causes too much trouble. Sparc? VIPT too? Davem? I have absolutely zero clue about s390. Anyway, it sounds to me like this is too big to decide for 2.6.19 anyway, and as far as I can tell this i snot a regression, right? Ie we've always had the aliasing issue. Ralf? But it would be good to have something for the early -rc1 sequence for 2.6.20, and maybe the MIPS COW D$ patches are it, if it has performance advantages on MIPS that can also be translated to other virtual cache users.. > Be sure I'm sending a CPU designers a strong message about aliases. Castration. That's the best solution. We don't want those people procreating. Linus ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-21 0:38 ` Linus Torvalds @ 2006-10-21 1:29 ` Paul Mackerras 2006-10-21 2:11 ` David Miller 2006-12-02 9:49 ` Russell King 2 siblings, 0 replies; 46+ messages in thread From: Paul Mackerras @ 2006-10-21 1:29 UTC (permalink / raw) To: Linus Torvalds Cc: Ralf Baechle, David Miller, nickpiggin, akpm, linux-kernel, anemo, linux-arch, schwidefsky, James.Bottomley Linus Torvalds writes: > I think (but may be mistaken) that ARM _does_ have pure virtual caches > with a process ID, but people have always ended up flushing them at > context switch simply because it just causes too much trouble. > > Sparc? VIPT too? Davem? There is one PowerPC embedded chip family, the PPC440, which has a virtual i-cache with a process ID tag. The d-cache is sane though. Of course, the i-cache being readonly means we avoid the nastier issues. Paul. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-21 0:38 ` Linus Torvalds 2006-10-21 1:29 ` Paul Mackerras @ 2006-10-21 2:11 ` David Miller 2006-10-21 2:37 ` Linus Torvalds 2006-12-02 9:49 ` Russell King 2 siblings, 1 reply; 46+ messages in thread From: David Miller @ 2006-10-21 2:11 UTC (permalink / raw) To: torvalds Cc: ralf, nickpiggin, akpm, linux-kernel, anemo, linux-arch, schwidefsky, James.Bottomley From: Linus Torvalds <torvalds@osdl.org> Date: Fri, 20 Oct 2006 17:38:32 -0700 (PDT) > I think (but may be mistaken) that ARM _does_ have pure virtual caches > with a process ID, but people have always ended up flushing them at > context switch simply because it just causes too much trouble. > > Sparc? VIPT too? Davem? sun4c is VIVT, but has no SMP variants. sun4m has both VIPT and PIPT. > But it would be good to have something for the early -rc1 sequence for > 2.6.20, and maybe the MIPS COW D$ patches are it, if it has performance > advantages on MIPS that can also be translated to other virtual cache > users.. I think it could help for sun4m highmem configs. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-21 2:11 ` David Miller @ 2006-10-21 2:37 ` Linus Torvalds 2006-10-21 2:46 ` David Miller ` (2 more replies) 0 siblings, 3 replies; 46+ messages in thread From: Linus Torvalds @ 2006-10-21 2:37 UTC (permalink / raw) To: David Miller Cc: ralf, nickpiggin, akpm, linux-kernel, anemo, linux-arch, schwidefsky, James.Bottomley On Fri, 20 Oct 2006, David Miller wrote: > > From: Linus Torvalds <torvalds@osdl.org> > Date: Fri, 20 Oct 2006 17:38:32 -0700 (PDT) > > > I think (but may be mistaken) that ARM _does_ have pure virtual caches > > with a process ID, but people have always ended up flushing them at > > context switch simply because it just causes too much trouble. > > > > Sparc? VIPT too? Davem? > > sun4c is VIVT, but has no SMP variants. You don't need SMP - we have sleeping sections here, so even threads on UP can trigger it. Now, to trigger it you need to have - virtual indexing not just by address, but by some "address space identifier" thing too - (in practice) a big enough cache that switching tasks wouldn't flush it anyway. > sun4m has both VIPT and PIPT. > > > But it would be good to have something for the early -rc1 sequence for > > 2.6.20, and maybe the MIPS COW D$ patches are it, if it has performance > > advantages on MIPS that can also be translated to other virtual cache > > users.. > > I think it could help for sun4m highmem configs. Well, if you can re-create the performance numbers (Ralf - can you send the full series with the final "remove the now unnecessary flush" to Davem?), that will make deciding things easier, I think. I suspect sparc, mips and arm are the main architectures where virtually indexed caching really matters enough for this to be an issue at all. Linus ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-21 2:37 ` Linus Torvalds @ 2006-10-21 2:46 ` David Miller 2006-10-21 18:27 ` Ralf Baechle 2006-10-22 1:34 ` Ralf Baechle 2 siblings, 0 replies; 46+ messages in thread From: David Miller @ 2006-10-21 2:46 UTC (permalink / raw) To: torvalds Cc: ralf, nickpiggin, akpm, linux-kernel, anemo, linux-arch, schwidefsky, James.Bottomley From: Linus Torvalds <torvalds@osdl.org> Date: Fri, 20 Oct 2006 19:37:24 -0700 (PDT) > On Fri, 20 Oct 2006, David Miller wrote: > > I think it could help for sun4m highmem configs. > > Well, if you can re-create the performance numbers (Ralf - can you send > the full series with the final "remove the now unnecessary flush" to > Davem?), that will make deciding things easier, I think. > > I suspect sparc, mips and arm are the main architectures where virtually > indexed caching really matters enough for this to be an issue at all. Unfortunately, I don't have any sparc 32-bit systems any more, so I can't really help out here. I just make sure the build keeps working :-) ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-21 2:37 ` Linus Torvalds 2006-10-21 2:46 ` David Miller @ 2006-10-21 18:27 ` Ralf Baechle 2006-10-22 1:34 ` Ralf Baechle 2 siblings, 0 replies; 46+ messages in thread From: Ralf Baechle @ 2006-10-21 18:27 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, nickpiggin, akpm, linux-kernel, anemo, linux-arch, schwidefsky, James.Bottomley On Fri, Oct 20, 2006 at 07:37:24PM -0700, Linus Torvalds wrote: > Well, if you can re-create the performance numbers (Ralf - can you send > the full series with the final "remove the now unnecessary flush" to > Davem?), that will make deciding things easier, I think. > > I suspect sparc, mips and arm are the main architectures where virtually > indexed caching really matters enough for this to be an issue at all. What I was using for my fork benchmark was basically the series as posted in this thread + the quick hack patch below. I'll dig up some numbers for the posted patchset and will send them later. Ralf diff --git a/kernel/fork.c b/kernel/fork.c index 29ebb30..c83d226 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -202,7 +202,6 @@ static inline int dup_mmap(struct mm_str struct mempolicy *pol; down_write(&oldmm->mmap_sem); - flush_cache_mm(oldmm); /* * Not linked in yet - no deadlock potential: */ ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-21 2:37 ` Linus Torvalds 2006-10-21 2:46 ` David Miller 2006-10-21 18:27 ` Ralf Baechle @ 2006-10-22 1:34 ` Ralf Baechle 2 siblings, 0 replies; 46+ messages in thread From: Ralf Baechle @ 2006-10-22 1:34 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, nickpiggin, akpm, linux-kernel, anemo, linux-arch, schwidefsky, James.Bottomley On Fri, Oct 20, 2006 at 07:37:24PM -0700, Linus Torvalds wrote: > Well, if you can re-create the performance numbers (Ralf - can you send > the full series with the final "remove the now unnecessary flush" to > Davem?), that will make deciding things easier, I think. Blwo are numbers and comments from Atsushi Nemoto on two Toshiba TY49 cores with 16K rsp. 32K per primary cache. Each lmbench run was repeated twice. The numbers taken without the flush_cache_mm hack to dup_mmap(), so there are those 12% on fork which can easily be obtained in addition on PIVT caches such as the TX49. Processor, Processes - times in microseconds - smaller is better ------------------------------------------------------------------------------ Host OS Mhz null null open slct sig sig fork exec sh call I/O stat clos TCP inst hndl proc proc proc --------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- TX49-16K without-patch 197 0.72 2.40 17.7 34.3 82.9 2.84 26.2 2500 9364 39.K TX49-16K without-patch 197 0.73 2.40 17.7 34.4 82.9 2.85 26.1 2495 9337 39.K TX49-16K without-patch 197 0.72 2.40 17.8 34.3 82.9 2.85 26.1 2501 9341 39.K TX49-16K with-patch 197 0.72 2.39 20.1 31.9 82.9 2.85 20.2 2491 9101 38.K TX49-16K with-patch 197 0.72 2.39 20.1 32.8 82.9 2.86 20.2 2496 9058 38.K TX49-16K with-patch 197 0.72 2.39 20.1 32.8 82.9 2.85 20.3 2501 9074 38.K TX49-32K without-patch 396 0.36 1.19 6.78 11.3 41.0 1.41 8.15 1246 4674 19.K TX49-32K without-patch 396 0.36 1.19 6.78 11.3 41.0 1.41 8.17 1251 4680 19.K TX49-32K without-patch 396 0.36 1.19 6.79 11.3 41.0 1.41 8.15 1250 4682 19.K TX49-32K with-patch 396 0.36 1.19 6.79 10.2 41.0 1.41 8.14 1230 4638 19.K TX49-32K with-patch 396 0.36 1.19 6.78 10.2 40.9 1.41 8.14 1241 4628 19.K TX49-32K with-patch 396 0.36 1.19 6.79 10.2 40.9 1.41 8.14 1238 4627 19.K A little bit faster on exec/proc and open/close (why?). Strange results on sig/hndl and stat on TX49-16K again. Context switching - times in microseconds - smaller is better ------------------------------------------------------------------------- Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw --------- ------------- ------ ------ ------ ------ ------ ------- ------- TX49-16K without-patch 4.7800 87.1 36.5 97.2 47.8 101.1 40.8 TX49-16K without-patch 5.4000 88.4 28.9 96.2 39.6 101.2 40.8 TX49-16K without-patch 4.6800 84.5 32.7 96.8 46.5 100.2 42.9 TX49-16K with-patch 2.4600 82.7 34.0 93.5 50.5 97.2 43.3 TX49-16K with-patch 1.5200 87.7 33.6 95.1 42.4 99.7 43.6 TX49-16K with-patch 1.7700 86.2 34.1 95.8 49.0 99.2 41.7 TX49-32K without-patch 31.4 11.3 72.1 15.2 72.3 16.9 TX49-32K without-patch 30.4 11.6 72.2 16.1 73.2 15.1 TX49-32K without-patch 33.5 12.1 71.2 15.5 73.0 17.0 TX49-32K with-patch 30.9 11.5 72.3 17.4 73.1 17.5 TX49-32K with-patch 31.5 11.9 71.8 15.8 73.0 16.7 TX49-32K with-patch 32.5 10.4 71.7 16.0 72.5 16.6 No noticeable differences. File & VM system latencies in microseconds - smaller is better ------------------------------------------------------------------------------- Host OS 0K File 10K File Mmap Prot Page 100fd Create Delete Create Delete Latency Fault Fault selct --------- ------------- ------ ------ ------ ------ ------- ----- ------- ----- TX49-16K without-patch 251.8 192.5 1212.1 397.6 500.0 4.388 7.32710 50.5 TX49-16K without-patch 254.7 193.9 1197.6 394.9 505.0 4.412 7.34230 50.5 TX49-16K without-patch 252.6 193.6 1212.1 399.4 499.0 4.758 7.33790 50.5 TX49-16K with-patch 251.8 192.2 1207.7 391.7 502.0 0.143 7.32320 50.5 TX49-16K with-patch 252.7 194.0 1200.5 393.7 505.0 0.108 7.32030 50.5 TX49-16K with-patch 252.0 191.8 1199.0 392.3 508.0 0.011 7.33150 50.5 TX49-32K without-patch 86.0 54.8 461.3 146.3 378.0 1.818 5.45460 25.0 TX49-32K without-patch 86.5 54.1 454.3 148.1 378.0 1.816 5.47120 25.0 TX49-32K without-patch 86.7 53.8 458.1 148.0 378.0 2.130 5.48540 25.0 TX49-32K with-patch 90.4 52.5 460.8 148.7 377.0 0.471 5.46340 25.0 TX49-32K with-patch 88.8 52.6 462.5 148.6 380.0 0.476 5.44630 25.0 TX49-32K with-patch 88.7 52.9 466.4 147.8 378.0 0.477 5.49560 25.0 Major improvements on Prot/Fault. Ralf ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-21 0:38 ` Linus Torvalds 2006-10-21 1:29 ` Paul Mackerras 2006-10-21 2:11 ` David Miller @ 2006-12-02 9:49 ` Russell King 2 siblings, 0 replies; 46+ messages in thread From: Russell King @ 2006-12-02 9:49 UTC (permalink / raw) To: Linus Torvalds Cc: Ralf Baechle, David Miller, nickpiggin, akpm, linux-kernel, anemo, linux-arch, schwidefsky, James.Bottomley On Fri, Oct 20, 2006 at 05:38:32PM -0700, Linus Torvalds wrote: > On Sat, 21 Oct 2006, Ralf Baechle wrote: > > > That said, maybe nobody does that. Virtual caches are a total braindamage > > > in the first place, so hopefully they have limited use. > > > > On MIPS we never had pure virtual caches. > > Ok, so on MIPS my schenario doesn't matter. > > I think (but may be mistaken) that ARM _does_ have pure virtual caches > with a process ID, but people have always ended up flushing them at > context switch simply because it just causes too much trouble. Just read this, sorry. The majority of ARM CPU implementations have pure virtual caches _without_ process IDs. (Some have a nasty hack which involves remapping the lower 32MB of virtual memory space to other areas of the cache's virtual space, but obviously that limits you to 32MB of VM.) Thankfully, with ARM version 6, they had an inkling of clue, and decided to move to VIPT caches but with _optional_ aliasing, and if the CPU design was Harvard there's a possibility for D/I cache aliasing. > > Be sure I'm sending a CPU designers a strong message about aliases. > > Castration. That's the best solution. We don't want those people > procreating. Absolutely. Can we start such a program in Cambridge, England ASAP please? -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 20:10 ` Linus Torvalds 2006-10-20 20:59 ` Russell King 2006-10-20 21:49 ` Ralf Baechle @ 2006-10-23 8:50 ` Martin Schwidefsky 2 siblings, 0 replies; 46+ messages in thread From: Martin Schwidefsky @ 2006-10-23 8:50 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, nickpiggin, ralf, Andrew Morton, Linux Kernel Mailing List, anemo, linux-arch On Fri, 2006-10-20 at 13:10 -0700, Linus Torvalds wrote: > > On Fri, 20 Oct 2006, David Miller wrote: > > > > I did some more digging, here's what I think the hardware actually > > does: > > Ok, this sounds sane. > > What should we do about this? How does this patch look to people? > > (Totally untested, and I'm not sure we should even do that whole > "oldmm->mm_users" test, but I'm throwing it out here for discussion, in > case it matters for performance. The second D$ flush should obviously be > unnecessary for the common unthreaded case, which is why none of this has > mattered historically, I think). > > Comments? We need ARM, MIPS, sparc and S390 at the very least to sign off > on this, and somebody to write a nice explanation for the changelog (and > preferably do this through -mm too). On s390 you never have to worry about cache flushing. It is not stated anywhere in the principles of operation but the architecture has to be PIPT, otherwise it couldn't possibly work. The best indication for it is that there is no cache flush instruction. The view on data in memory is always consistent. -- blue skies, Martin. Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 14:39 ` Nick Piggin 2006-10-20 15:49 ` Linus Torvalds @ 2006-10-20 16:05 ` Ralf Baechle 2006-10-20 16:30 ` Nick Piggin 2006-10-20 19:23 ` David Miller 2 siblings, 1 reply; 46+ messages in thread From: Ralf Baechle @ 2006-10-20 16:05 UTC (permalink / raw) To: Nick Piggin; +Cc: David Miller, torvalds, akpm, linux-kernel, anemo On Sat, Oct 21, 2006 at 12:39:40AM +1000, Nick Piggin wrote: > >>That would require changing the order of cache flush and tlb flush. > >>To keep certain architectures that require a valid translation in > >>the TLB the cacheflush has to be done first. Not sure if those > >>architectures need a writeable mapping for dirty cachelines - I > >>think hypersparc was one of them. > > > > > >There just has to be "a mapping" in the TLB so that the L2 cache can > >translate the virtual address to a physical one for the writeback to > >main memory. > > So moving the flush_cache_mm below the copy_page_range, to just > before the flush_tlb_mm, would work then? This would make the > race much smaller than with this patchset. 90% of this changeset are MIPS-specific code. Of that in turn much is just infrastructure which is already being used anyway. > But doesn't that still leave a race? Both calls would have to be done under the mmap_sem to close any races. > What if another thread writes to cache after we have flushed it > but before flushing the TLBs? Although we've marked the the ptes > readonly, the CPU won't trap if the TLB is valid? There must be > some special way for the arch to handle this, but I can't see it. There isn't really. Reordering with a patch like: Signed-off-by: Ralf Baechle <ralf@linux-mips.org> diff --git a/kernel/fork.c b/kernel/fork.c index 29ebb30..28e51e0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -202,7 +202,6 @@ static inline int dup_mmap(struct mm_str struct mempolicy *pol; down_write(&oldmm->mmap_sem); - flush_cache_mm(oldmm); /* * Not linked in yet - no deadlock potential: */ @@ -287,8 +286,9 @@ static inline int dup_mmap(struct mm_str } retval = 0; out: - up_write(&mm->mmap_sem); flush_tlb_mm(oldmm); + flush_cache_mm(oldmm); + up_write(&mm->mmap_sem); up_write(&oldmm->mmap_sem); return retval; fail_nomem_policy: should close the hole for all effected architectures. I say should because this patch would need another round of linux-arch reviewing and I haven't tested it this patch yet myself. But even so that doesn't change that I would really like to make copy_user_highpage() an arch interface replacing copy_to_user_page. The current way of doing things enforces a cacheflush on MIPS which itself is pricy - 1,000 cycles when it's cheap but could be several times as expensive. And as a side effect of the cacheflush the process breaking a COW page will start with a cold page. Or if an architecture wants to be clever about aliasing and uses the vto argument of copy_user_page to create a non-conflicting mapping it means the mapping setup by copy_user_highpage will be unused ... Ralf ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 16:05 ` Ralf Baechle @ 2006-10-20 16:30 ` Nick Piggin 0 siblings, 0 replies; 46+ messages in thread From: Nick Piggin @ 2006-10-20 16:30 UTC (permalink / raw) To: Ralf Baechle; +Cc: David Miller, torvalds, akpm, linux-kernel, anemo Ralf Baechle wrote: > On Sat, Oct 21, 2006 at 12:39:40AM +1000, Nick Piggin wrote: >>So moving the flush_cache_mm below the copy_page_range, to just >>before the flush_tlb_mm, would work then? This would make the >>race much smaller than with this patchset. > > > 90% of this changeset are MIPS-specific code. Of that in turn much is > just infrastructure which is already being used anyway. > > >>But doesn't that still leave a race? > > > Both calls would have to be done under the mmap_sem to close any races. Of course. >>What if another thread writes to cache after we have flushed it >>but before flushing the TLBs? Although we've marked the the ptes >>readonly, the CPU won't trap if the TLB is valid? There must be >>some special way for the arch to handle this, but I can't see it. > > > There isn't really. Reordering with a patch like: > > Signed-off-by: Ralf Baechle <ralf@linux-mips.org> > > diff --git a/kernel/fork.c b/kernel/fork.c > index 29ebb30..28e51e0 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -202,7 +202,6 @@ static inline int dup_mmap(struct mm_str > struct mempolicy *pol; > > down_write(&oldmm->mmap_sem); > - flush_cache_mm(oldmm); > /* > * Not linked in yet - no deadlock potential: > */ > @@ -287,8 +286,9 @@ static inline int dup_mmap(struct mm_str > } > retval = 0; > out: > - up_write(&mm->mmap_sem); You don't need to do that. Nothing will use the new mm. You might also want a flush before the flush_tlb_mm. Actually, we should turn this into a single call, so architectures can optimise it however they like. > flush_tlb_mm(oldmm); > + flush_cache_mm(oldmm); That does close the race. We just need to ensure that all architectures can handle this case. And we need to figure out whether any of the other code that follows the same flush_cache_* .. flush_tlb_* is buggy in the presence of other threads writing into the cache in between. I suspect they may well be, and you probably only noticed the issue in fork, because the window is so large. Places where we want to remove the mapping completely are going to be more tricky to fix. Either we need to transition to readonly, then flush, then transition to invalid, or arch code needs to be reworked (the single operation caches and tlb flush will do the trick, but that might to do an IPI, which is bad). > + up_write(&mm->mmap_sem); > up_write(&oldmm->mmap_sem); > return retval; > fail_nomem_policy: > > should close the hole for all effected architectures. I say should > because this patch would need another round of linux-arch reviewing and I > haven't tested it this patch yet myself. > > But even so that doesn't change that I would really like to make > copy_user_highpage() an arch interface replacing copy_to_user_page. > > The current way of doing things enforces a cacheflush on MIPS which itself > is pricy - 1,000 cycles when it's cheap but could be several times as > expensive. And as a side effect of the cacheflush the process breaking > a COW page will start with a cold page. > > Or if an architecture wants to be clever about aliasing and uses the > vto argument of copy_user_page to create a non-conflicting mapping it > means the mapping setup by copy_user_highpage will be unused ... OK, I'm not arguing against any other changes. Hmm... I don't see how you can kmap_coherent the "from" page when it can be mapped into more than one virtual address? Does the MIPS port restrict remapping to the same colour? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] Fix COW D-cache aliasing on fork 2006-10-20 14:39 ` Nick Piggin 2006-10-20 15:49 ` Linus Torvalds 2006-10-20 16:05 ` Ralf Baechle @ 2006-10-20 19:23 ` David Miller 2 siblings, 0 replies; 46+ messages in thread From: David Miller @ 2006-10-20 19:23 UTC (permalink / raw) To: nickpiggin; +Cc: ralf, torvalds, akpm, linux-kernel, anemo From: Nick Piggin <nickpiggin@yahoo.com.au> Date: Sat, 21 Oct 2006 00:39:40 +1000 > So moving the flush_cache_mm below the copy_page_range, to just > before the flush_tlb_mm, would work then? This would make the > race much smaller than with this patchset. > > But doesn't that still leave a race? > > What if another thread writes to cache after we have flushed it > but before flushing the TLBs? Although we've marked the the ptes > readonly, the CPU won't trap if the TLB is valid? There must be > some special way for the arch to handle this, but I can't see it. Also, it is actually the case that doing page-by-page cache flushes can be cheaper than flush_mm_cache() on certain cpus. Very few cpus that need this cache flushing provide a "context" based cache flush. On cpus like the mentioned hypersparc, there is no way to do a "context" flush of the cache, so we flush the entire multi-megabyte L2 cache. Actually, it allows to flush only "user" cache lines which keeps the kernel cache lines in there, but still it's very expensive. ^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2006-12-02 9:49 UTC | newest] Thread overview: 46+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-10-19 16:35 [PATCH 1/3] Fix COW D-cache aliasing on fork Ralf Baechle 2006-10-19 16:35 ` [PATCH 2/3] Pass vma argument to copy_user_highpage() Ralf Baechle 2006-10-19 16:35 ` [PATCH 3/3] MIPS: Fix COW D-cache aliasing on fork Ralf Baechle 2006-10-22 2:32 ` Ralf Baechle 2006-10-19 17:46 ` [PATCH 1/3] " Nick Piggin 2006-10-19 18:13 ` Ralf Baechle 2006-10-19 18:48 ` Nick Piggin 2006-10-19 22:59 ` David Miller 2006-10-20 14:39 ` Nick Piggin 2006-10-20 15:49 ` Linus Torvalds 2006-10-20 15:57 ` Nick Piggin 2006-10-20 16:36 ` Linus Torvalds 2006-10-20 16:47 ` Nick Piggin 2006-10-20 17:16 ` Linus Torvalds 2006-10-20 17:37 ` Nick Piggin 2006-10-21 0:46 ` Ralf Baechle 2006-10-20 19:36 ` David Miller 2006-10-20 19:54 ` Linus Torvalds 2006-10-20 19:58 ` David Miller 2006-10-20 20:10 ` Linus Torvalds 2006-10-20 20:59 ` Russell King 2006-10-20 21:06 ` David Miller 2006-10-20 21:17 ` Russell King 2006-10-20 21:30 ` David Miller 2006-10-20 21:12 ` Linus Torvalds 2006-10-20 21:28 ` Russell King 2006-10-20 21:41 ` Ralf Baechle 2006-10-21 16:28 ` Atsushi Nemoto 2006-10-20 21:49 ` Ralf Baechle 2006-10-20 22:02 ` Linus Torvalds 2006-10-20 22:22 ` David Miller 2006-10-20 22:51 ` Ralf Baechle 2006-10-20 23:28 ` Linus Torvalds 2006-10-21 0:06 ` Ralf Baechle 2006-10-21 0:38 ` Linus Torvalds 2006-10-21 1:29 ` Paul Mackerras 2006-10-21 2:11 ` David Miller 2006-10-21 2:37 ` Linus Torvalds 2006-10-21 2:46 ` David Miller 2006-10-21 18:27 ` Ralf Baechle 2006-10-22 1:34 ` Ralf Baechle 2006-12-02 9:49 ` Russell King 2006-10-23 8:50 ` Martin Schwidefsky 2006-10-20 16:05 ` Ralf Baechle 2006-10-20 16:30 ` Nick Piggin 2006-10-20 19:23 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox