* [rfc] changes to user memory mapping scheme @ 2008-01-13 3:08 Nick Piggin 2008-01-13 3:09 ` [rfc][patch 1/2] mm: introduce VM_MIXEDMAP Nick Piggin 2008-01-13 3:10 ` [rfc][patch 2/2] mm: introduce optional pte_special pte bit Nick Piggin 0 siblings, 2 replies; 24+ messages in thread From: Nick Piggin @ 2008-01-13 3:08 UTC (permalink / raw) To: Linus Torvalds, Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch Hi guys, I'd like to get some comments about changes to the mm/memory.c code which determines what to do with user ptes. Basically, we have a vm_normal_page() to determine whether the page under the pte is "normal" (has an underlying struct page, which is refcounted); or "special" (may or may not have a struct page, but either way we don't want to refcount anything). Now the first patch should be uncontroversial. It adds a slightly different special vma mapping that can support COW for arbitrary page and/or pfn mappings. VM_PFNMAP cannot do this due to its special support for COW on /dev/mem. We are going to need this to support non-struct page backed XIP mappings, which is interesting for virtualisation, and for Jared's (IMO very important) work on hybrid RAM/NVRAM filesystems. The second patch simplifies a lot of this tricky logic by using an extra bit in the valid, user pte. Now depending on how s390 goes, this patch may or may not be required to get all this working on that architecture. I'm posting it here for comments because firstly, it might be required; and secondly, it is already written and has some other nice properties for *any* architecture that implements it. While we're working out the exact details of the xip stuff, I'd like to pipeline things by asking for comments on the changes we need to the memory mapping side of things. Thanks, Nick ^ permalink raw reply [flat|nested] 24+ messages in thread
* [rfc][patch 1/2] mm: introduce VM_MIXEDMAP 2008-01-13 3:08 [rfc] changes to user memory mapping scheme Nick Piggin @ 2008-01-13 3:09 ` Nick Piggin 2008-01-13 3:10 ` [rfc][patch 2/2] mm: introduce optional pte_special pte bit Nick Piggin 1 sibling, 0 replies; 24+ messages in thread From: Nick Piggin @ 2008-01-13 3:09 UTC (permalink / raw) To: Linus Torvalds, Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch mm: introduce VM_MIXEDMAP From: Jared Hulbert <jaredeh@gmail.com> Introduce a new type of mapping, VM_MIXEDMAP. This is unlike VM_PFNMAP in that it can support COW mappings of arbitrary ranges including ranges without struct page (PFNMAP can only support COW in those cases where the un-COW-ed translations are mapped linearly in the virtual address). VM_MIXEDMAP achieves this by refcounting all pfn_valid pages, and not refcounting !pfn_valid pages (which is not an option for VM_PFNMAP, because it needs to avoid refcounting pfn_valid pages eg. for /dev/mem mappings). (Needs Jared's SOB) Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-arch@vger.kernel.org --- include/linux/mm.h | 1 mm/memory.c | 79 ++++++++++++++++++++++++++++++++++++++--------------- 2 files changed, 59 insertions(+), 21 deletions(-) Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -106,6 +106,7 @@ extern unsigned int kobjsize(const void #define VM_ALWAYSDUMP 0x04000000 /* Always include in core dumps */ #define VM_CAN_NONLINEAR 0x08000000 /* Has ->fault & does nonlinear pages */ +#define VM_MIXEDMAP 0x10000000 /* Can contain "struct page" and pure PFN pages */ #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */ #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -361,35 +361,65 @@ static inline int is_cow_mapping(unsigne } /* - * This function gets the "struct page" associated with a pte. + * This function gets the "struct page" associated with a pte or returns + * NULL if no "struct page" is associated with the pte. * - * NOTE! Some mappings do not have "struct pages". A raw PFN mapping - * will have each page table entry just pointing to a raw page frame - * number, and as far as the VM layer is concerned, those do not have - * pages associated with them - even if the PFN might point to memory + * A raw VM_PFNMAP mapping (ie. one that is not COWed) may not have any "struct + * page" backing, and even if they do, they are not refcounted. COWed pages of + * a VM_PFNMAP do always have a struct page, and they are normally refcounted + * (they are _normal_ pages). + * + * So a raw PFNMAP mapping will have each page table entry just pointing + * to a page frame number, and as far as the VM layer is concerned, those do + * not have pages associated with them - even if the PFN might point to memory * that otherwise is perfectly fine and has a "struct page". * - * The way we recognize those mappings is through the rules set up - * by "remap_pfn_range()": the vma will have the VM_PFNMAP bit set, - * and the vm_pgoff will point to the first PFN mapped: thus every + * The way we recognize COWed pages within VM_PFNMAP mappings is through the + * rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit + * set, and the vm_pgoff will point to the first PFN mapped: thus every * page that is a raw mapping will always honor the rule * * pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT) * - * and if that isn't true, the page has been COW'ed (in which case it - * _does_ have a "struct page" associated with it even if it is in a - * VM_PFNMAP range). + * A call to vm_normal_page() will return NULL for such a page. + * + * If the page doesn't follow the "remap_pfn_range()" rule in a VM_PFNMAP + * then the page has been COW'ed. A COW'ed page _does_ have a "struct page" + * associated with it even if it is in a VM_PFNMAP range. Calling + * vm_normal_page() on such a page will therefore return the "struct page". + * + * + * VM_MIXEDMAP mappings can likewise contain memory with or without "struct + * page" backing, however the difference is that _all_ pages with a struct + * page (that is, those where pfn_valid is true) are refcounted and considered + * normal pages by the VM. The disadvantage is that pages are refcounted + * (which can be slower and simply not an option for some PFNMAP users). The + * advantage is that we don't have to follow the strict linearity rule of + * PFNMAP mappings in order to support COWable mappings. + * + * A call to vm_normal_page() with a VM_MIXEDMAP mapping will return the + * associated "struct page" or NULL for memory not backed by a "struct page". + * + * + * All other mappings should have a valid struct page, which will be + * returned by a call to vm_normal_page(). */ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte) { unsigned long pfn = pte_pfn(pte); - if (unlikely(vma->vm_flags & VM_PFNMAP)) { - unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT; - if (pfn == vma->vm_pgoff + off) - return NULL; - if (!is_cow_mapping(vma->vm_flags)) - return NULL; + if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { + if (vma->vm_flags & VM_MIXEDMAP) { + if (!pfn_valid(pfn)) + return NULL; + goto out; + } else { + unsigned long off = (addr-vma->vm_start) >> PAGE_SHIFT; + if (pfn == vma->vm_pgoff + off) + return NULL; + if (!is_cow_mapping(vma->vm_flags)) + return NULL; + } } /* @@ -410,6 +440,7 @@ struct page *vm_normal_page(struct vm_ar * The PAGE_ZERO() pages and various VDSO mappings can * cause them to exist. */ +out: return pfn_to_page(pfn); } @@ -1211,8 +1242,11 @@ int vm_insert_pfn(struct vm_area_struct pte_t *pte, entry; spinlock_t *ptl; - BUG_ON(!(vma->vm_flags & VM_PFNMAP)); - BUG_ON(is_cow_mapping(vma->vm_flags)); + BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))); + BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) == + (VM_PFNMAP|VM_MIXEDMAP)); + BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags)); + BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn)); retval = -ENOMEM; pte = get_locked_pte(mm, addr, &ptl); @@ -2386,10 +2420,13 @@ static noinline int do_no_pfn(struct mm_ unsigned long pfn; pte_unmap(page_table); - BUG_ON(!(vma->vm_flags & VM_PFNMAP)); - BUG_ON(is_cow_mapping(vma->vm_flags)); + BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))); + BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags)); pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK); + + BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn)); + if (unlikely(pfn == NOPFN_OOM)) return VM_FAULT_OOM; else if (unlikely(pfn == NOPFN_SIGBUS)) ^ permalink raw reply [flat|nested] 24+ messages in thread
* [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-13 3:08 [rfc] changes to user memory mapping scheme Nick Piggin 2008-01-13 3:09 ` [rfc][patch 1/2] mm: introduce VM_MIXEDMAP Nick Piggin @ 2008-01-13 3:10 ` Nick Piggin 2008-01-13 3:41 ` Linus Torvalds 2008-01-16 17:14 ` David Howells 1 sibling, 2 replies; 24+ messages in thread From: Nick Piggin @ 2008-01-13 3:10 UTC (permalink / raw) To: Linus Torvalds, Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch mm: introduce pte_special pte bit Rather than play interesting games with vmas to work out whether the mapped page should be refcounted or not, use a new bit in the "present" pte to distinguish such pages. This allows much simpler "vm_normal_page" implementation, and more flexible rules for COW pages in pfn mappings (eg. our proposed VM_MIXEDMAP mode would becomes a noop). vm_normal_page() can be in the top several kernel functions in a profile (I've seen it use ~5% kernel time in fork/exit intensive workloads). Special bits also provides one of the required pieces for the lockless get_user_pages; and allows s390 to implement pfn-only XIP mappings (although granted they could also do so if they modified their memory model somewhat). Unfortunately, not all architectures can spare a bit in the pte for this. So add an ifdefs, depending on whether the arch can support it or not. Although pte_special can remove all restrictions that pfn mappings currently have, we continue to enforce those restrictions even on architectures that have the new pte bit, for consistency. Provide an implementation for x86 and uml. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-arch@vger.kernel.org --- include/asm-alpha/pgtable.h | 4 - include/asm-avr32/pgtable.h | 8 +++ include/asm-cris/pgtable.h | 2 include/asm-frv/pgtable.h | 2 include/asm-ia64/pgtable.h | 3 + include/asm-m32r/pgtable.h | 10 ++++ include/asm-m68k/motorola_pgtable.h | 2 include/asm-m68k/sun3_pgtable.h | 2 include/asm-mips/pgtable.h | 2 include/asm-parisc/pgtable.h | 2 include/asm-powerpc/pgtable-ppc32.h | 3 + include/asm-powerpc/pgtable-ppc64.h | 3 + include/asm-ppc/pgtable.h | 3 + include/asm-s390/pgtable.h | 10 ++++ include/asm-sh64/pgtable.h | 2 include/asm-sparc/pgtable.h | 7 +++ include/asm-sparc64/pgtable.h | 10 ++++ include/asm-um/pgtable.h | 13 +++++ include/asm-x86/pgtable_32.h | 5 ++ include/asm-x86/pgtable_64.h | 5 ++ include/asm-xtensa/pgtable.h | 4 + include/linux/mm.h | 3 - mm/memory.c | 79 ++++++++++++++++++++---------------- 23 files changed, 147 insertions(+), 37 deletions(-) Index: linux-2.6/include/asm-um/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-um/pgtable.h +++ linux-2.6/include/asm-um/pgtable.h @@ -21,6 +21,8 @@ #define _PAGE_USER 0x040 #define _PAGE_ACCESSED 0x080 #define _PAGE_DIRTY 0x100 +#define _PAGE_SPECIAL 0x200 +#define __HAVE_ARCH_PTE_SPECIAL /* If _PAGE_PRESENT is clear, we use these: */ #define _PAGE_FILE 0x008 /* nonlinear file mapping, saved PTE; unset:swap */ #define _PAGE_PROTNONE 0x010 /* if the user mapped it with PROT_NONE; @@ -220,6 +222,11 @@ static inline int pte_newprot(pte_t pte) return(pte_present(pte) && (pte_get_bits(pte, _PAGE_NEWPROT))); } +static inline int pte_special(pte_t pte) +{ + return pte_get_bits(pte, _PAGE_SPECIAL); +} + /* * ================================= * Flags setting section. @@ -288,6 +295,12 @@ static inline pte_t pte_mknewpage(pte_t return(pte); } +static inline pte_t pte_mkspecial(pte_t pte) +{ + pte_set_bits(pte, _PAGE_SPECIAL); + return(pte); +} + static inline void set_pte(pte_t *pteptr, pte_t pteval) { pte_copy(*pteptr, pteval); Index: linux-2.6/include/asm-x86/pgtable_32.h =================================================================== --- linux-2.6.orig/include/asm-x86/pgtable_32.h +++ linux-2.6/include/asm-x86/pgtable_32.h @@ -102,6 +102,7 @@ void paging_init(void); #define _PAGE_BIT_UNUSED2 10 #define _PAGE_BIT_UNUSED3 11 #define _PAGE_BIT_NX 63 +#define _PAGE_BIT_SPECIAL _PAGE_BIT_UNUSED1 #define _PAGE_PRESENT 0x001 #define _PAGE_RW 0x002 @@ -115,6 +116,8 @@ void paging_init(void); #define _PAGE_UNUSED1 0x200 /* available for programmer */ #define _PAGE_UNUSED2 0x400 #define _PAGE_UNUSED3 0x800 +#define _PAGE_SPECIAL PAGE_UNUSED1 +#define __HAVE_ARCH_PTE_SPECIAL /* If _PAGE_PRESENT is clear, we use these: */ #define _PAGE_FILE 0x040 /* nonlinear file mapping, saved PTE; unset:swap */ @@ -219,6 +222,7 @@ static inline int pte_dirty(pte_t pte) static inline int pte_young(pte_t pte) { return (pte).pte_low & _PAGE_ACCESSED; } static inline int pte_write(pte_t pte) { return (pte).pte_low & _PAGE_RW; } static inline int pte_huge(pte_t pte) { return (pte).pte_low & _PAGE_PSE; } +static inline int pte_special(pte_t pte) { return (pte).pte_low & _PAGE_SPECIAL; } /* * The following only works if pte_present() is not true. @@ -232,6 +236,7 @@ static inline pte_t pte_mkdirty(pte_t pt static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte_low |= _PAGE_ACCESSED; return pte; } static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte_low |= _PAGE_RW; return pte; } static inline pte_t pte_mkhuge(pte_t pte) { (pte).pte_low |= _PAGE_PSE; return pte; } +static inline pte_t pte_mkspecial(pte_t pte) { (pte).pte_low |= _PAGE_SPECIAL; return pte; } #ifdef CONFIG_X86_PAE # include <asm/pgtable-3level.h> Index: linux-2.6/include/asm-x86/pgtable_64.h =================================================================== --- linux-2.6.orig/include/asm-x86/pgtable_64.h +++ linux-2.6/include/asm-x86/pgtable_64.h @@ -151,6 +151,7 @@ static inline pte_t ptep_get_and_clear_f #define _PAGE_BIT_DIRTY 6 #define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */ #define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */ +#define _PAGE_BIT_SPECIAL 9 #define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */ #define _PAGE_PRESENT 0x001 @@ -163,6 +164,8 @@ static inline pte_t ptep_get_and_clear_f #define _PAGE_PSE 0x080 /* 2MB page */ #define _PAGE_FILE 0x040 /* nonlinear file mapping, saved PTE; unset:swap */ #define _PAGE_GLOBAL 0x100 /* Global TLB entry */ +#define _PAGE_SPECIAL 0x200 +#define __HAVE_ARCH_PTE_SPECIAL #define _PAGE_PROTNONE 0x080 /* If not present */ #define _PAGE_NX (_AC(1,UL)<<_PAGE_BIT_NX) @@ -272,6 +275,7 @@ static inline int pte_young(pte_t pte) static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_RW; } static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_huge(pte_t pte) { return pte_val(pte) & _PAGE_PSE; } +static inline int pte_special(pte_t pte) { return pte_val(pte) & _PAGE_SPECIAL; } static inline pte_t pte_mkclean(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_DIRTY)); return pte; } static inline pte_t pte_mkold(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_ACCESSED)); return pte; } @@ -282,6 +286,7 @@ static inline pte_t pte_mkyoung(pte_t pt static inline pte_t pte_mkwrite(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; } static inline pte_t pte_mkhuge(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_PSE)); return pte; } static inline pte_t pte_clrhuge(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_PSE)); return pte; } +static inline pte_t pte_mkspecial(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_SPECIAL)); return pte; } struct vm_area_struct; Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -699,7 +699,8 @@ struct zap_details { unsigned long truncate_count; /* Compare vm_truncate_count */ }; -struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t); +struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte); + unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *); unsigned long unmap_vmas(struct mmu_gather **tlb, Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -361,34 +361,38 @@ static inline int is_cow_mapping(unsigne } /* - * This function gets the "struct page" associated with a pte or returns - * NULL if no "struct page" is associated with the pte. + * vm_normal_page -- This function gets the "struct page" associated with a pte. * - * A raw VM_PFNMAP mapping (ie. one that is not COWed) may not have any "struct - * page" backing, and even if they do, they are not refcounted. COWed pages of - * a VM_PFNMAP do always have a struct page, and they are normally refcounted - * (they are _normal_ pages). - * - * So a raw PFNMAP mapping will have each page table entry just pointing - * to a page frame number, and as far as the VM layer is concerned, those do - * not have pages associated with them - even if the PFN might point to memory - * that otherwise is perfectly fine and has a "struct page". + * "Special" mappings do not wish to be associated with a "struct page" (either + * it doesn't exist, or it exists but they don't want to touch it). In this + * case, NULL is returned here. "Normal" mappings do have a struct page. + * + * There are 2 broad cases. Firstly, an architecture may define a pte_special() + * pte bit, in which case this function is trivial. Secondly, an architecture may + * not have a spare pte bit, which requires a more complicated scheme, described + * below. + * + * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a + * special mapping (even if there are underlying and valid "struct pages"). + * COWed pages of a VM_PFNMAP are always normal. * * The way we recognize COWed pages within VM_PFNMAP mappings is through the * rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit - * set, and the vm_pgoff will point to the first PFN mapped: thus every - * page that is a raw mapping will always honor the rule + * set, and the vm_pgoff will point to the first PFN mapped: thus every special + * mapping will always honor the rule * * pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT) * - * A call to vm_normal_page() will return NULL for such a page. + * And for normal mappings this is false. * - * If the page doesn't follow the "remap_pfn_range()" rule in a VM_PFNMAP - * then the page has been COW'ed. A COW'ed page _does_ have a "struct page" - * associated with it even if it is in a VM_PFNMAP range. Calling - * vm_normal_page() on such a page will therefore return the "struct page". + * This restricts such mappings to be a linear translation from virtual address to + * pfn. To get around this restriction, we allow arbitrary mappings so long as the + * vma is not a COW mapping; in that case, we know that all ptes are special (because + * none can have been COWed). * * + * In order to support COW of arbitrary special mappings, we have VM_MIXEDMAP. + * * VM_MIXEDMAP mappings can likewise contain memory with or without "struct * page" backing, however the difference is that _all_ pages with a struct * page (that is, those where pfn_valid is true) are refcounted and considered @@ -397,15 +401,18 @@ static inline int is_cow_mapping(unsigne * advantage is that we don't have to follow the strict linearity rule of * PFNMAP mappings in order to support COWable mappings. * - * A call to vm_normal_page() with a VM_MIXEDMAP mapping will return the - * associated "struct page" or NULL for memory not backed by a "struct page". - * - * - * All other mappings should have a valid struct page, which will be - * returned by a call to vm_normal_page(). */ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte) { +#ifdef __HAVE_ARCH_PTE_SPECIAL + if (likely(!pte_special(pte))) { + VM_BUG_ON(!pfn_valid(pte_pfn(pte))); + return pte_page(pte); + } + VM_BUG_ON(!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))); + return NULL; +#else + unsigned long pfn = pte_pfn(pte); if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { @@ -423,10 +430,9 @@ struct page *vm_normal_page(struct vm_ar } /* - * Add some anal sanity checks for now. Eventually, - * we should just do "return pfn_to_page(pfn)", but - * in the meantime we check that we get a valid pfn, - * and that the resulting page looks ok. + * Add some anal sanity checks for now. Eventually, we should just do + * "return pfn_to_page(pfn)", but in the meantime we check that we get + * a valid pfn, and that the resulting page looks ok. */ if (unlikely(!pfn_valid(pfn))) { print_bad_pte(vma, pte, addr); @@ -434,14 +440,13 @@ struct page *vm_normal_page(struct vm_ar } /* - * NOTE! We still have PageReserved() pages in the page - * tables. + * NOTE! We still have PageReserved() pages in the page tables. * - * The PAGE_ZERO() pages and various VDSO mappings can - * cause them to exist. + * Eg. VDSO mappings can cause them to exist. */ out: return pfn_to_page(pfn); +#endif } /* @@ -1242,6 +1247,12 @@ int vm_insert_pfn(struct vm_area_struct pte_t *pte, entry; spinlock_t *ptl; + /* + * Technically, architectures with pte_special can avoid all these + * restrictions (same for remap_pfn_range). However we would like + * consistency in testing and feature parity among all, so we should + * try to keep these invariants in place for everybody. + */ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))); BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) == (VM_PFNMAP|VM_MIXEDMAP)); @@ -1257,7 +1268,7 @@ int vm_insert_pfn(struct vm_area_struct goto out_unlock; /* Ok, finally just insert the thing.. */ - entry = pfn_pte(pfn, vma->vm_page_prot); + entry = pte_mkspecial(pfn_pte(pfn, vma->vm_page_prot)); set_pte_at(mm, addr, pte, entry); update_mmu_cache(vma, addr, entry); @@ -1288,7 +1299,7 @@ static int remap_pte_range(struct mm_str arch_enter_lazy_mmu_mode(); do { BUG_ON(!pte_none(*pte)); - set_pte_at(mm, addr, pte, pfn_pte(pfn, prot)); + set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot))); pfn++; } while (pte++, addr += PAGE_SIZE, addr != end); arch_leave_lazy_mmu_mode(); Index: linux-2.6/include/asm-alpha/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-alpha/pgtable.h +++ linux-2.6/include/asm-alpha/pgtable.h @@ -269,7 +269,7 @@ extern inline int pte_write(pte_t pte) extern inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } extern inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } extern inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } -extern inline int pte_special(pte_t pte) { return pte_val(pte) & _PAGE_SPECIAL; } +extern inline int pte_special(pte_t pte) { return 0; } extern inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) |= _PAGE_FOW; return pte; } extern inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~(__DIRTY_BITS); return pte; } @@ -277,7 +277,7 @@ extern inline pte_t pte_mkold(pte_t pte) extern inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) &= ~_PAGE_FOW; return pte; } extern inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |= __DIRTY_BITS; return pte; } extern inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= __ACCESS_BITS; return pte; } -extern inline pte_t pte_mkspecial(pte_t pte) { pte_val(pte) |= _PAGE_SPECIAL; return pte; } +extern inline pte_t pte_mkspecial(pte_t pte) { return pte; } #define PAGE_DIR_OFFSET(tsk,address) pgd_offset((tsk),(address)) Index: linux-2.6/include/asm-avr32/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-avr32/pgtable.h +++ linux-2.6/include/asm-avr32/pgtable.h @@ -211,6 +211,10 @@ static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } +static inline int pte_special(pte_t pte) +{ + return 0; +} /* * The following only work if pte_present() is not true. @@ -251,6 +255,10 @@ static inline pte_t pte_mkyoung(pte_t pt set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED)); return pte; } +static inline pte_t pte_mkspecial(pte_t pte) +{ + return pte; +} #define pmd_none(x) (!pmd_val(x)) #define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT) Index: linux-2.6/include/asm-cris/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-cris/pgtable.h +++ linux-2.6/include/asm-cris/pgtable.h @@ -115,6 +115,7 @@ static inline int pte_write(pte_t pte) static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_MODIFIED; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } +static inline int pte_special(pte_t pte) { return 0; } static inline pte_t pte_wrprotect(pte_t pte) { @@ -162,6 +163,7 @@ static inline pte_t pte_mkyoung(pte_t pt } return pte; } +static inline pte_t pte_mkspecial(pte_t pte) { return pte; } /* * Conversion functions: convert a page and protection to a page entry, Index: linux-2.6/include/asm-frv/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-frv/pgtable.h +++ linux-2.6/include/asm-frv/pgtable.h @@ -380,6 +380,7 @@ static inline pmd_t *pmd_offset(pud_t *d static inline int pte_dirty(pte_t pte) { return (pte).pte & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return (pte).pte & _PAGE_ACCESSED; } static inline int pte_write(pte_t pte) { return !((pte).pte & _PAGE_WP); } +static inline int pte_special(pte_t pte) { return 0; } static inline pte_t pte_mkclean(pte_t pte) { (pte).pte &= ~_PAGE_DIRTY; return pte; } static inline pte_t pte_mkold(pte_t pte) { (pte).pte &= ~_PAGE_ACCESSED; return pte; } @@ -387,6 +388,7 @@ static inline pte_t pte_wrprotect(pte_t static inline pte_t pte_mkdirty(pte_t pte) { (pte).pte |= _PAGE_DIRTY; return pte; } static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte |= _PAGE_ACCESSED; return pte; } static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte &= ~_PAGE_WP; return pte; } +static inline pte_t pte_mkspecial(pte_t pte) { return pte; } static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { Index: linux-2.6/include/asm-ia64/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-ia64/pgtable.h +++ linux-2.6/include/asm-ia64/pgtable.h @@ -302,6 +302,8 @@ ia64_phys_addr_valid (unsigned long addr #define pte_dirty(pte) ((pte_val(pte) & _PAGE_D) != 0) #define pte_young(pte) ((pte_val(pte) & _PAGE_A) != 0) #define pte_file(pte) ((pte_val(pte) & _PAGE_FILE) != 0) +#define pte_special(pte) 0 + /* * Note: we convert AR_RWX to AR_RX and AR_RW to AR_R by clearing the 2nd bit in the * access rights: @@ -313,6 +315,7 @@ ia64_phys_addr_valid (unsigned long addr #define pte_mkclean(pte) (__pte(pte_val(pte) & ~_PAGE_D)) #define pte_mkdirty(pte) (__pte(pte_val(pte) | _PAGE_D)) #define pte_mkhuge(pte) (__pte(pte_val(pte))) +#define pte_mkspecial(pte) (pte) /* * Because ia64's Icache and Dcache is not coherent (on a cpu), we need to Index: linux-2.6/include/asm-m32r/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-m32r/pgtable.h +++ linux-2.6/include/asm-m32r/pgtable.h @@ -214,6 +214,11 @@ static inline int pte_file(pte_t pte) return pte_val(pte) & _PAGE_FILE; } +static inline int pte_special(pte_t pte) +{ + return 0; +} + static inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~_PAGE_DIRTY; @@ -250,6 +255,11 @@ static inline pte_t pte_mkwrite(pte_t pt return pte; } +static inline pte_t pte_mkspecial(pte_t pte) +{ + return pte; +} + static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_ACCESSED, ptep); Index: linux-2.6/include/asm-m68k/motorola_pgtable.h =================================================================== --- linux-2.6.orig/include/asm-m68k/motorola_pgtable.h +++ linux-2.6/include/asm-m68k/motorola_pgtable.h @@ -168,6 +168,7 @@ static inline int pte_write(pte_t pte) static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } +static inline int pte_special(pte_t pte) { return 0; } static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) |= _PAGE_RONLY; return pte; } static inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~_PAGE_DIRTY; return pte; } @@ -185,6 +186,7 @@ static inline pte_t pte_mkcache(pte_t pt pte_val(pte) = (pte_val(pte) & _CACHEMASK040) | m68k_supervisor_cachemode; return pte; } +static inline pte_t pte_mkspecial(pte_t pte) { return pte; } #define PAGE_DIR_OFFSET(tsk,address) pgd_offset((tsk),(address)) Index: linux-2.6/include/asm-m68k/sun3_pgtable.h =================================================================== --- linux-2.6.orig/include/asm-m68k/sun3_pgtable.h +++ linux-2.6/include/asm-m68k/sun3_pgtable.h @@ -169,6 +169,7 @@ static inline int pte_write(pte_t pte) static inline int pte_dirty(pte_t pte) { return pte_val(pte) & SUN3_PAGE_MODIFIED; } static inline int pte_young(pte_t pte) { return pte_val(pte) & SUN3_PAGE_ACCESSED; } static inline int pte_file(pte_t pte) { return pte_val(pte) & SUN3_PAGE_ACCESSED; } +static inline int pte_special(pte_t pte) { return 0; } static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) &= ~SUN3_PAGE_WRITEABLE; return pte; } static inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~SUN3_PAGE_MODIFIED; return pte; } @@ -181,6 +182,7 @@ static inline pte_t pte_mknocache(pte_t //static inline pte_t pte_mkcache(pte_t pte) { pte_val(pte) &= SUN3_PAGE_NOCACHE; return pte; } // until then, use: static inline pte_t pte_mkcache(pte_t pte) { return pte; } +static inline pte_t pte_mkspecial(pte_t pte) { return pte; } extern pgd_t swapper_pg_dir[PTRS_PER_PGD]; extern pgd_t kernel_pg_dir[PTRS_PER_PGD]; Index: linux-2.6/include/asm-mips/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-mips/pgtable.h +++ linux-2.6/include/asm-mips/pgtable.h @@ -285,6 +285,8 @@ static inline pte_t pte_mkyoung(pte_t pt return pte; } #endif +static inline int pte_special(pte_t pte) { return 0; } +static inline pte_t pte_mkspecial(pte_t pte) { return pte; } /* * Macro to make mark a page protection value as "uncacheable". Note Index: linux-2.6/include/asm-parisc/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-parisc/pgtable.h +++ linux-2.6/include/asm-parisc/pgtable.h @@ -331,6 +331,7 @@ static inline int pte_dirty(pte_t pte) static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_WRITE; } static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } +static inline int pte_special(pte_t pte) { return 0; } static inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~_PAGE_DIRTY; return pte; } static inline pte_t pte_mkold(pte_t pte) { pte_val(pte) &= ~_PAGE_ACCESSED; return pte; } @@ -338,6 +339,7 @@ static inline pte_t pte_wrprotect(pte_t static inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |= _PAGE_DIRTY; return pte; } static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= _PAGE_ACCESSED; return pte; } static inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) |= _PAGE_WRITE; return pte; } +static inline pte_t pte_mkspecial(pte_t pte) { return pte; } /* * Conversion functions: convert a page and protection to a page entry, Index: linux-2.6/include/asm-powerpc/pgtable-ppc32.h =================================================================== --- linux-2.6.orig/include/asm-powerpc/pgtable-ppc32.h +++ linux-2.6/include/asm-powerpc/pgtable-ppc32.h @@ -514,6 +514,7 @@ static inline int pte_write(pte_t pte) static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } +static inline int pte_special(pte_t pte) { return 0; } static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; } static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; } @@ -531,6 +532,8 @@ static inline pte_t pte_mkdirty(pte_t pt pte_val(pte) |= _PAGE_DIRTY; return pte; } static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= _PAGE_ACCESSED; return pte; } +static inline pte_t pte_mkspecial(pte_t pte) { + return pte; } static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { Index: linux-2.6/include/asm-powerpc/pgtable-ppc64.h =================================================================== --- linux-2.6.orig/include/asm-powerpc/pgtable-ppc64.h +++ linux-2.6/include/asm-powerpc/pgtable-ppc64.h @@ -239,6 +239,7 @@ static inline int pte_write(pte_t pte) { static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY;} static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;} static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;} +static inline int pte_special(pte_t pte) { return 0; } static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; } static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; } @@ -257,6 +258,8 @@ static inline pte_t pte_mkyoung(pte_t pt pte_val(pte) |= _PAGE_ACCESSED; return pte; } static inline pte_t pte_mkhuge(pte_t pte) { return pte; } +static inline pte_t pte_mkspecial(pte_t pte) { + return pte; } /* Atomic PTE updates */ static inline unsigned long pte_update(struct mm_struct *mm, Index: linux-2.6/include/asm-ppc/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-ppc/pgtable.h +++ linux-2.6/include/asm-ppc/pgtable.h @@ -537,6 +537,7 @@ static inline int pte_write(pte_t pte) static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } +static inline int pte_special(pte_t pte) { return 0; } static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; } static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; } @@ -554,6 +555,8 @@ static inline pte_t pte_mkdirty(pte_t pt pte_val(pte) |= _PAGE_DIRTY; return pte; } static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= _PAGE_ACCESSED; return pte; } +static inline pte_t pte_mkspecial(pte_t pte) { + return pte; } static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { Index: linux-2.6/include/asm-s390/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-s390/pgtable.h +++ linux-2.6/include/asm-s390/pgtable.h @@ -504,6 +504,11 @@ static inline int pte_file(pte_t pte) return (pte_val(pte) & mask) == _PAGE_TYPE_FILE; } +static inline int pte_special(pte_t pte) +{ + return 0; +} + #define __HAVE_ARCH_PTE_SAME #define pte_same(a,b) (pte_val(a) == pte_val(b)) @@ -654,6 +659,11 @@ static inline pte_t pte_mkyoung(pte_t pt return pte; } +static inline pte_t pte_mkspecial(pte_t pte) +{ + return pte; +} + #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) Index: linux-2.6/include/asm-sh64/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-sh64/pgtable.h +++ linux-2.6/include/asm-sh64/pgtable.h @@ -419,6 +419,7 @@ static inline int pte_dirty(pte_t pte){ static inline int pte_young(pte_t pte){ return pte_val(pte) & _PAGE_ACCESSED; } static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } static inline int pte_write(pte_t pte){ return pte_val(pte) & _PAGE_WRITE; } +static inline int pte_special(pte_t pte) { return 0; } static inline pte_t pte_wrprotect(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_WRITE)); return pte; } static inline pte_t pte_mkclean(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_DIRTY)); return pte; } @@ -427,6 +428,7 @@ static inline pte_t pte_mkwrite(pte_t pt static inline pte_t pte_mkdirty(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY)); return pte; } static inline pte_t pte_mkyoung(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED)); return pte; } static inline pte_t pte_mkhuge(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_SZHUGE)); return pte; } +static inline pte_t pte_mkspecial(pte_t pte) { return pte; } /* Index: linux-2.6/include/asm-sparc/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-sparc/pgtable.h +++ linux-2.6/include/asm-sparc/pgtable.h @@ -219,6 +219,11 @@ static inline int pte_file(pte_t pte) return pte_val(pte) & BTFIXUP_HALF(pte_filei); } +static inline int pte_special(pte_t pte) +{ + return 0; +} + /* */ BTFIXUPDEF_HALF(pte_wrprotecti) @@ -251,6 +256,8 @@ BTFIXUPDEF_CALL_CONST(pte_t, pte_mkyoung #define pte_mkdirty(pte) BTFIXUP_CALL(pte_mkdirty)(pte) #define pte_mkyoung(pte) BTFIXUP_CALL(pte_mkyoung)(pte) +#define pte_mkspecial(pte_t pte) (pte) + #define pfn_pte(pfn, prot) mk_pte(pfn_to_page(pfn), prot) BTFIXUPDEF_CALL(unsigned long, pte_pfn, pte_t) Index: linux-2.6/include/asm-sparc64/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-sparc64/pgtable.h +++ linux-2.6/include/asm-sparc64/pgtable.h @@ -506,6 +506,11 @@ static inline pte_t pte_mkyoung(pte_t pt return __pte(pte_val(pte) | mask); } +static inline pte_t pte_mkspecial(pte_t pte) +{ + return pte; +} + static inline unsigned long pte_young(pte_t pte) { unsigned long mask; @@ -608,6 +613,11 @@ static inline unsigned long pte_present( return val; } +static inline int pte_special(pte_t pte) +{ + return 0; +} + #define pmd_set(pmdp, ptep) \ (pmd_val(*(pmdp)) = (__pa((unsigned long) (ptep)) >> 11UL)) #define pud_set(pudp, pmdp) \ Index: linux-2.6/include/asm-xtensa/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-xtensa/pgtable.h +++ linux-2.6/include/asm-xtensa/pgtable.h @@ -212,6 +212,8 @@ static inline int pte_write(pte_t pte) { static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } +static inline int pte_special(pte_t pte) { return 0; } + static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) &= ~(_PAGE_WRITABLE | _PAGE_HW_WRITE); return pte; } static inline pte_t pte_mkclean(pte_t pte) @@ -224,6 +226,8 @@ static inline pte_t pte_mkyoung(pte_t pt { pte_val(pte) |= _PAGE_ACCESSED; return pte; } static inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) |= _PAGE_WRITABLE; return pte; } +static inline pte_t pte_mkspecial(pte_t pte) + { return pte; } /* * Conversion functions: convert a page and protection to a page entry, ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-13 3:10 ` [rfc][patch 2/2] mm: introduce optional pte_special pte bit Nick Piggin @ 2008-01-13 3:41 ` Linus Torvalds 2008-01-13 4:39 ` Nick Piggin 2008-01-16 17:14 ` David Howells 1 sibling, 1 reply; 24+ messages in thread From: Linus Torvalds @ 2008-01-13 3:41 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch On Sun, 13 Jan 2008, Nick Piggin wrote: > > mm: introduce pte_special pte bit What's the point of this? > 23 files changed, 147 insertions(+), 37 deletions(-) That's lots of new (ugly) code, and two totally different paths, that aren't even cleanly abstracted, so now there's two separate things that are just arbitrarily selected by an #ifdef. You seem to claim that this is a performance issue, with vm_normal_page() eating up to 5% of time on some loads, but it would appear that the main thing you did that will speed up vm_normal_page() is the fact that you replaced the current if (unlikely(!pfn_valid(pfn))) { print_bad_pte(vma, pte, addr); return NULL; } with a VM_BUG_ON(!pfn_valid(pte_pfn(pte))); which is basically compiled away in the normal case. From what I can tell, your vm_normal_page() is no less expensive than the current one is: the cost is always going to be that multiply by the size of the "struct page" implicit in pte_page(pte) (which in the current one is just written out as pfn = pte_pfn(pte) pfn_page(pfn) rather than hidden in a macro). Does the code generation really change that radically that this makes any real difference? Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-13 3:41 ` Linus Torvalds @ 2008-01-13 4:39 ` Nick Piggin 2008-01-13 4:45 ` Linus Torvalds 0 siblings, 1 reply; 24+ messages in thread From: Nick Piggin @ 2008-01-13 4:39 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch On Sat, Jan 12, 2008 at 07:41:16PM -0800, Linus Torvalds wrote: > > > On Sun, 13 Jan 2008, Nick Piggin wrote: > > > > mm: introduce pte_special pte bit > > What's the point of this? Well, it's written in the changelog. I'm not asking for it to be merged or anything just now because obviously it isn't in a patchset with anything that needs it, just want some comments on the idea. Myself, I would like such a bit to implement lockless get_user_pages on x86. The x390 guys want to implement pfn mapped xip with the same bit in their pagetables. > > 23 files changed, 147 insertions(+), 37 deletions(-) > > That's lots of new (ugly) code, and two totally different paths, that > aren't even cleanly abstracted, so now there's two separate things that > are just arbitrarily selected by an #ifdef. How should it be cleanly abstracted? > You seem to claim that this is a performance issue, with vm_normal_page() Hmm, the performance observation was an aside. I'd love to be able to fix up the existing vm_normal_page as well. Sorry if that wasn't clear. > eating up to 5% of time on some loads, but it would appear that the main > thing you did that will speed up vm_normal_page() is the fact that you > replaced the current > > if (unlikely(!pfn_valid(pfn))) { > print_bad_pte(vma, pte, addr); > return NULL; > } FWIW I have wanted to put this under DEBUG_VM too, but there were objections. > Does the code generation really change that radically that this makes any > real difference? It's mainly for semantics. Basically: if we have architectures with such a bit in their ptes anyway for other reasons, I'm hoping we can use it in vm_normal_page too, because it's just a nicer. (and it is fundamentally faster for MIXEDMAP mappings, but that's probably not a big issue at this point) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-13 4:39 ` Nick Piggin @ 2008-01-13 4:45 ` Linus Torvalds 2008-01-13 5:06 ` Nick Piggin 0 siblings, 1 reply; 24+ messages in thread From: Linus Torvalds @ 2008-01-13 4:45 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch On Sun, 13 Jan 2008, Nick Piggin wrote: > On Sat, Jan 12, 2008 at 07:41:16PM -0800, Linus Torvalds wrote: > > > > What's the point of this? > > Well, it's written in the changelog. No it's not. It says "Rather than play interesting games with vmas..", but: - it doesn't say why the new model is any better - and the source code *still* has to play all the same interestign games anyway due to s390 (and possibly others) that cannot play the *new* interesting games. So I repeat: what's the _point_. > > That's lots of new (ugly) code, and two totally different paths, that > > aren't even cleanly abstracted, so now there's two separate things that > > are just arbitrarily selected by an #ifdef. > > How should it be cleanly abstracted? I don't think it can, since you have to leave the old code anyway. Which just returns me to the original question: what's the actual improvement here? Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-13 4:45 ` Linus Torvalds @ 2008-01-13 5:06 ` Nick Piggin 2008-01-13 16:50 ` Linus Torvalds 0 siblings, 1 reply; 24+ messages in thread From: Nick Piggin @ 2008-01-13 5:06 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch On Sat, Jan 12, 2008 at 08:45:06PM -0800, Linus Torvalds wrote: > > > On Sun, 13 Jan 2008, Nick Piggin wrote: > > On Sat, Jan 12, 2008 at 07:41:16PM -0800, Linus Torvalds wrote: > > > > > > What's the point of this? > > > > Well, it's written in the changelog. > > No it's not. > > It says "Rather than play interesting games with vmas..", but: OK, you have to read to the 3rd paragraph. > > > That's lots of new (ugly) code, and two totally different paths, that > > > aren't even cleanly abstracted, so now there's two separate things that > > > are just arbitrarily selected by an #ifdef. > > > > How should it be cleanly abstracted? > > I don't think it can, since you have to leave the old code anyway. I don't think having 2 code paths means it is a bad abstraction at all. We do things like that everywhere, what makes you say this is bad? My implementation is a bit crude maybe. You could have another arch call so you write it with an if() instead of an ifdef I guess. > Which just returns me to the original question: what's the actual > improvement here? Well the immediate improvement from this actual patch is just that it gives better and smaller code for vm_normal_page (even if you discount the debug checks in the existing code). But for example, s390 perhaps cannot implement VM_MIXEDMAP using pfn_valid() like we have in vm_normal_page. The alternative is basically to have a different path for those guys anyway, so why not make it a "core" code thing rather than s390 specific. So. Is there a big problem with adding this extra path? It's really quite simple, literally just if (likely(!pte_special(pte))) return pte_page(pte); return NULL; So I don't think maintainability is a problem. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-13 5:06 ` Nick Piggin @ 2008-01-13 16:50 ` Linus Torvalds 2008-01-13 20:46 ` Martin Schwidefsky ` (2 more replies) 0 siblings, 3 replies; 24+ messages in thread From: Linus Torvalds @ 2008-01-13 16:50 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch On Sun, 13 Jan 2008, Nick Piggin wrote: > > OK, you have to read to the 3rd paragraph. That one doesn't actually say what the point is either. It just claims that there *is* a point that isn't actually explained or mentioned further. > Well the immediate improvement from this actual patch is just that > it gives better and smaller code for vm_normal_page (even if you > discount the debug checks in the existing code). It does no such thing, except for the slow-path that doesn't matter. So it may optimize the slow-path (PFNMAP/MIXMAP), but the real path stays exactly the same from a code generation standpoint (just checking a different bit, afaik). If those pte_special bits are required for unexplained lockless get_user_pages, is this going to get EVEN WORSE when s390 (again) cannot do it? Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-13 16:50 ` Linus Torvalds @ 2008-01-13 20:46 ` Martin Schwidefsky 2008-01-14 21:04 ` Jared Hulbert 2008-01-16 3:38 ` Nick Piggin 2 siblings, 0 replies; 24+ messages in thread From: Martin Schwidefsky @ 2008-01-13 20:46 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Hugh Dickins, Jared Hulbert, Carsten Otte, Heiko Carstens, linux-arch On Sun, 2008-01-13 at 08:50 -0800, Linus Torvalds wrote: > > Well the immediate improvement from this actual patch is just that > > it gives better and smaller code for vm_normal_page (even if you > > discount the debug checks in the existing code). > > It does no such thing, except for the slow-path that doesn't matter. > > So it may optimize the slow-path (PFNMAP/MIXMAP), but the real path stays > exactly the same from a code generation standpoint (just checking a > different bit, afaik). > > If those pte_special bits are required for unexplained lockless > get_user_pages, is this going to get EVEN WORSE when s390 (again) cannot > do it? Neiter is the pte_special bit required for s390 nor can't we implement pfn_valid in a way that would work with the new VM_MIXMAP vmas and copy on write. It would be slow though because DCSS segments on s390 can have different types. For one type the pages are reference counted (hotplug memory via DCSS), for the other the pages are not reference counted (read only xip DCSS). I doubt that we will stay alone with the problem, with KVM you can easily imagine to introduce hot memory add by mapping an anonymous piece of memory. For s390 the straight forward solution for pages with a pfn > max_pfn is to walk the list of all DCSS segments. For a system where /usr lives on a xip DCSS this happens frequently. It seems reasonable to me to introduce a pte bit to decide between the two cases, in particular since Nick has some other use for the bit as well (don't know too much about that features as well). When a non- reference counting pte is established we know it is special, we just have forgotten about it in vm_normal_page. What makes this ugly is the fact that there currently are some architectures like arm that do not have room for the pte_special bit in the pte. Seems like we need a clean abstraction to allow each architecture to choose the best way to make the decision between reference counted or not. It is only two arch calls, one when a pte is created for a non-refcounting page and another for the check in vm_normal_page to get the information back. The default implementation would be a nop for the first call and a pfn_valid check for the second call. For s390 I would prefer a pte bit if I can get it. If not then we have to play games with pfn_valid. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-13 16:50 ` Linus Torvalds 2008-01-13 20:46 ` Martin Schwidefsky @ 2008-01-14 21:04 ` Jared Hulbert 2008-01-15 9:18 ` Carsten Otte 2008-01-16 3:38 ` Nick Piggin 2 siblings, 1 reply; 24+ messages in thread From: Jared Hulbert @ 2008-01-14 21:04 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Hugh Dickins, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch > That one doesn't actually say what the point is either. It just claims > that there *is* a point that isn't actually explained or mentioned > further. I can't comment on Nick' lockless get_user_pages, but I can talk about the XIP part. As far as I understand it XIP is used with Linux in two kinds of systems. The very small (embedded ARM/MIPS) and the very big (s390). XIP support for the s390 use case was merged long ago but did not address the needs of the embedded use case. Not surprising given how different the systems are. 5 year old funky patches for the embedded use case have been the norm for hundreds of embedded designs, including millions of Japanese phones. When I asked Andrew for advice on how to proceed to get the funky patches merged he said, "We have an XIP subsystem, why don't you make that work?" When I looked in to the issue I found that the vm needed a little tweaking before we could leverage the filemap_xip.c work. Fortunately Nick was eager to help. The result of that effort was the VM_MIXEDMAP patch and then Nick's get_xip_page->get_xip_address patch. In a uniquely Linux situation we now have advocates of the smallest embedded systems and the big mainframe excited about the same patchset. The only problem is we are having problems sorting out how to deal with vm_normal_page() for the VM_MIXEDMAP case such that it works well for both ARM and s390. Hence the pte_special bit. Seems like the result is a reasonable compromise to me, I'm surprised Nicks patch is as clean as it is. Many of the other proposed solutions are very awkward for one or the other. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-14 21:04 ` Jared Hulbert @ 2008-01-15 9:18 ` Carsten Otte 0 siblings, 0 replies; 24+ messages in thread From: Carsten Otte @ 2008-01-15 9:18 UTC (permalink / raw) To: Jared Hulbert Cc: Linus Torvalds, Nick Piggin, Hugh Dickins, mschwid2, heicars2, linux-arch Jared Hulbert wrote: > Seems > like the result is a reasonable compromise to me, I'm surprised Nicks > patch is as clean as it is. Many of the other proposed solutions are > very awkward for one or the other. I second that. It looks way cleaner then all other attempts to solve the problem. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-13 16:50 ` Linus Torvalds 2008-01-13 20:46 ` Martin Schwidefsky 2008-01-14 21:04 ` Jared Hulbert @ 2008-01-16 3:38 ` Nick Piggin 2008-01-16 4:04 ` Linus Torvalds 2 siblings, 1 reply; 24+ messages in thread From: Nick Piggin @ 2008-01-16 3:38 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch On Sun, Jan 13, 2008 at 08:50:23AM -0800, Linus Torvalds wrote: > > > On Sun, 13 Jan 2008, Nick Piggin wrote: > > > > OK, you have to read to the 3rd paragraph. > > That one doesn't actually say what the point is either. It just claims > that there *is* a point that isn't actually explained or mentioned > further. OK, right after reading it again I concede it is unclear and doesn't actually make the point. My biggest concen is that s390 basically are thinking about doing something like this: if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { if (vma->vm_flags & VM_MIXEDMAP) { #ifdef s390 if (pte_special(pte)) return NULL; #else if (!pfn_valid(pfn)) return NULL; #endif goto out; } else { unsigned long off = (addr-vma->vm_start) >> PAGE_SHIFT; if (pfn == vma->vm_pgoff + off) return NULL; if (!is_cow_mapping(vma->vm_flags)) return NULL; } } Of course it is nicely wrapped in arch functions, but that is the basic logic. Now what happens is that already gives us 2 paths through that function which needs to be understood by developers, but it also IMO is more confusing because it mixes the 2 schemes (vma based vs pte based). Basically what I want to do is make the pte_special part usable by all architectures, and bring it out of the vma tests. I think if anything it actually makes the code clearer to a novice reader because they can quite easily see the intention of the more complex vma scheme, by understanding what the pte scheme does (althogh the comments are pretty good at doing that anyway). My point is just that I don't think it is too horrible. It isn't like the poor developer who has to understand the vma scheme will have much problem with the pte scheme... > > Well the immediate improvement from this actual patch is just that > > it gives better and smaller code for vm_normal_page (even if you > > discount the debug checks in the existing code). > > It does no such thing, except for the slow-path that doesn't matter. Well it does do such a thing. Here is the old one (with the sanity check ifdefed away). 88 bytes or so, somewhat sparse in icache for the fastpath, which also has a short forward jump: vm_normal_page: movabsq $70368744177663, %rax #, tmp66 andq %rax, %rdx # tmp66, pfn movq 40(%rdi), %rax # <variable>.vm_flags, D.23528 shrq $12, %rdx #, pfn testl $268436480, %eax #, D.23528 je .L14 #, testl $268435456, %eax #, D.23528 je .L16 #, cmpq end_pfn(%rip), %rdx # end_pfn, pfn jb .L14 #, jmp .L18 # .L16: subq 8(%rdi), %rsi # <variable>.vm_start, addr shrq $12, %rsi #, addr addq 136(%rdi), %rsi # <variable>.vm_pgoff, addr cmpq %rsi, %rdx # addr, pfn je .L18 #, andl $40, %eax #, tmp72 cmpl $32, %eax #, tmp72 jne .L18 #, .L14: imulq $56, %rdx, %rax #, pfn, D.23534 addq mem_map(%rip), %rax # mem_map, D.23534 ret .L18: xorl %eax, %eax # D.23534 ret The new one looks like this. It has no taken branches in the fastpath so is dense in icache, is less than half the size, and works out of registers. vm_normal_page: xorl %eax, %eax # D.23526 testb $2, %dh #, pte$pte jne .L16 #, movabsq $70368744177663, %rax #, tmp66 andq %rax, %rdx # tmp66, pte$pte shrq $12, %rdx #, pte$pte imulq $56, %rdx, %rax #, pte$pte, D.23526 addq mem_map(%rip), %rax # mem_map, D.23526 .L16: ret I think it is slightly better code even for the fastpath. > So it may optimize the slow-path (PFNMAP/MIXMAP), but the real path stays > exactly the same from a code generation standpoint (just checking a > different bit, afaik). Well, possibly MIXMAP will become an important path too if you have a lot of use of these XIP filesystems. > If those pte_special bits are required for unexplained lockless > get_user_pages, is this going to get EVEN WORSE when s390 (again) cannot > do it? Ah, my raising of the lockless get_user_pages was just to demonstrate that some other architectures (like x86) would also like to have a pte_special bit in their pte. So it isn't like we'd be using a pte bit _simply_ to make this little improvement to vm_normal_page(), just that the path isn't going to be an s390 only thing because we may get other architectures with a pte_special bit for other good reasons. lockless get_user_pages won't touch vm_normal_page, or even the existing get_user_pages. Architectures that don't support it would just fall through to regular get_user_pages... ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-16 3:38 ` Nick Piggin @ 2008-01-16 4:04 ` Linus Torvalds 2008-01-16 4:37 ` Nick Piggin 0 siblings, 1 reply; 24+ messages in thread From: Linus Torvalds @ 2008-01-16 4:04 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch On Wed, 16 Jan 2008, Nick Piggin wrote: > > My biggest concen is that s390 basically are thinking about doing > something like this: > > if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { > if (vma->vm_flags & VM_MIXEDMAP) { > #ifdef s390 > if (pte_special(pte)) > return NULL; > #else > if (!pfn_valid(pfn)) > return NULL; > #endif Ok. So now that I actually know what the reason for the insanity was. > Basically what I want to do is make the pte_special part usable by > all architectures, and bring it out of the vma tests. But if it's always there, then there is no reason for *any* of this. The right thing is to just always do return pte_special(pte) ? NULL : pte_page(pte); and I'd be happy with that too. I'm just not happy with the mixing, and with the #ifdef inside code. All the other code only makes sense if there isn't a special bit in the pte, but I was literally told that the only architecture that does *not* have any free bits for that is S390. So now you use S390 as an excuse to do that "pte_special()", which is what I think EVERYBODY ELSE wanted to do in the first place! Can you see why I'm not so enamoured with the patch? Either it makes sense, and *everybody* should just use that simple one-liner, or it doesn't make sense, and *nobody* should do it. It's the "both cases" that just drives me wild. It doesn't make sense. Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-16 4:04 ` Linus Torvalds @ 2008-01-16 4:37 ` Nick Piggin 2008-01-16 4:48 ` Linus Torvalds 0 siblings, 1 reply; 24+ messages in thread From: Nick Piggin @ 2008-01-16 4:37 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch On Tue, Jan 15, 2008 at 08:04:19PM -0800, Linus Torvalds wrote: > > > On Wed, 16 Jan 2008, Nick Piggin wrote: > > > > My biggest concen is that s390 basically are thinking about doing > > something like this: > > > > if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { > > if (vma->vm_flags & VM_MIXEDMAP) { > > #ifdef s390 > > if (pte_special(pte)) > > return NULL; > > #else > > if (!pfn_valid(pfn)) > > return NULL; > > #endif > > Ok. So now that I actually know what the reason for the insanity was. > > > Basically what I want to do is make the pte_special part usable by > > all architectures, and bring it out of the vma tests. > > But if it's always there, then there is no reason for *any* of this. The > right thing is to just always do > > return pte_special(pte) ? NULL : pte_page(pte); > > and I'd be happy with that too. I'm just not happy with the mixing, and > with the #ifdef inside code. Right, that's what I had hoped as well. But when I say pte_special *usable* by all architectures, I mean it is usable by all that can spare a bit in the pte. Apparently ARM can't because some some bug in an Xscale CPU or something (the thread is on linux-arch). > All the other code only makes sense if there isn't a special bit in the > pte, but I was literally told that the only architecture that does *not* > have any free bits for that is S390. So now you use S390 as an excuse to > do that "pte_special()", which is what I think EVERYBODY ELSE wanted to do > in the first place! I remember that too. I guess some wires got crossed somewhere. s390 evidently does have free bits in their pte_present-type ptes. > Can you see why I'm not so enamoured with the patch? Either it makes > sense, and *everybody* should just use that simple one-liner, or it > doesn't make sense, and *nobody* should do it. > > It's the "both cases" that just drives me wild. It doesn't make sense. I can see what you mean, but if we have one architecture (arm) that cannot support pte_special, then the conclusion is that we should only support vma-based scheme and be done with it. But in that case will still end up with the s390 specific code in the VM_MIXEDMAP case, because they cannot support VM_MIXEDMAP with pfn_valid (because of their memory model). So once they have that special case there, I'd much rather make it an explicit core VM concept rather than hiding it away in s390 code. The argument to allow other architectures to use it is a very minor one, just that there is no reason *not* to allow them to use it if it generates even slightly better code (especially not if they want to use a pte_special bit for other reasons anyway eg. lockless GUP). Actually it is useful even just to involve the core VM people in the logic rather than hiding it in s390 stuff (which scares people away). And also makes the pte-based scheme testable on non-s390... ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-16 4:37 ` Nick Piggin @ 2008-01-16 4:48 ` Linus Torvalds 2008-01-16 4:51 ` David Miller 2008-01-16 5:17 ` Nick Piggin 0 siblings, 2 replies; 24+ messages in thread From: Linus Torvalds @ 2008-01-16 4:48 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch On Wed, 16 Jan 2008, Nick Piggin wrote: > > Right, that's what I had hoped as well. But when I say pte_special > *usable* by all architectures, I mean it is usable by all that can > spare a bit in the pte. Apparently ARM can't because some some bug > in an Xscale CPU or something (the thread is on linux-arch). Hmm. Can you give a pointer to some browsable archive? I guess I should subscribe, but there's too much email, too little time. linux-arch is one of the lists that I probably should look at. That said: especially for PFNMAP and friends, it may be possible to simply re-use an existign hardware bit, like (for example) a "cacheable" bit. That doesn't mean that such an architecture would have a free bit for any *arbitrary* software use, but the "no <struct page> backing" is really a pretty special feature, and may well map fairly well 1:1 with something like a "cache disable" bit (which I do think ARM has). It's not like we necessarily would want /dev/mem to be mapped cacheable *anyway*, much less on some architecture with stupid virtual caches. > I remember that too. I guess some wires got crossed somewhere. s390 > evidently does have free bits in their pte_present-type ptes. I think they had two types of PTE's - 32-bit and 64-bit. Maybe it's just the 32-bit one that was all used up (but see above - maybe cacheable bits are doable?) I do have to say, one of the reasons I enjoyed PFNMAP was that so far we've basially been able to live without any SW-specified bits at all. Yeah, we use "software bits" on architectures to emulate dirty/accessed, but we have never really needed any "kernel internal bits". And I do think that's generally a good idea. So in that sense, I'd actually prefer the current setup if it's not a huge pain. I saw your 5% number, but I really wonder about that one. Was that perhaps with the much more expensive non-linear NUMA "pfn_to_page()"? THAT expense would drown out any vma->vm_flags costs. Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-16 4:48 ` Linus Torvalds @ 2008-01-16 4:51 ` David Miller 2008-01-16 5:23 ` Linus Torvalds 2008-01-16 5:17 ` Nick Piggin 1 sibling, 1 reply; 24+ messages in thread From: David Miller @ 2008-01-16 4:51 UTC (permalink / raw) To: torvalds Cc: npiggin, hugh, jaredeh, cotte, schwidefsky, heiko.carstens, linux-arch From: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue, 15 Jan 2008 20:48:42 -0800 (PST) > Can you give a pointer to some browsable archive? http://marc.info/?l=linux-arch ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-16 4:51 ` David Miller @ 2008-01-16 5:23 ` Linus Torvalds 2008-01-16 5:48 ` Nick Piggin 0 siblings, 1 reply; 24+ messages in thread From: Linus Torvalds @ 2008-01-16 5:23 UTC (permalink / raw) To: David Miller Cc: npiggin, hugh, jaredeh, cotte, schwidefsky, heiko.carstens, linux-arch On Tue, 15 Jan 2008, David Miller wrote: > > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Tue, 15 Jan 2008 20:48:42 -0800 (PST) > > > Can you give a pointer to some browsable archive? > > http://marc.info/?l=linux-arch .. and the discussion itself that actually explains why ARM has problems and why S390 suddenly _does_ have a bit for this after all? (Not that I consider marc to be really "browsable" in the first place.. ) Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-16 5:23 ` Linus Torvalds @ 2008-01-16 5:48 ` Nick Piggin 2008-01-16 9:52 ` Martin Schwidefsky 0 siblings, 1 reply; 24+ messages in thread From: Nick Piggin @ 2008-01-16 5:48 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, hugh, jaredeh, cotte, schwidefsky, heiko.carstens, linux-arch On Tue, Jan 15, 2008 at 09:23:57PM -0800, Linus Torvalds wrote: > > > On Tue, 15 Jan 2008, David Miller wrote: > > > > From: Linus Torvalds <torvalds@linux-foundation.org> > > Date: Tue, 15 Jan 2008 20:48:42 -0800 (PST) > > > > > Can you give a pointer to some browsable archive? > > > > http://marc.info/?l=linux-arch > > .. and the discussion itself that actually explains why ARM has problems > and why S390 suddenly _does_ have a bit for this after all? > > (Not that I consider marc to be really "browsable" in the first place.. ) > > Linus I sent you an exact link to the thread on marc in an earlier message... not many others archive linux-arch or linux-mm unfortunately. But no I didn't see a discussion of why s390 does have a bit, beyond the s390 devs just asserting there is one, and providing a patch ;) I never saw any of the discussions concluding that s390 does *not* have a bit spare. Do you recall if they were public? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-16 5:48 ` Nick Piggin @ 2008-01-16 9:52 ` Martin Schwidefsky 0 siblings, 0 replies; 24+ messages in thread From: Martin Schwidefsky @ 2008-01-16 9:52 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, David Miller, hugh, jaredeh, cotte, heiko.carstens, linux-arch On Wed, 2008-01-16 at 06:48 +0100, Nick Piggin wrote: > On Tue, Jan 15, 2008 at 09:23:57PM -0800, Linus Torvalds wrote: > > On Tue, 15 Jan 2008, David Miller wrote: > > > > > > From: Linus Torvalds <torvalds@linux-foundation.org> > > > Date: Tue, 15 Jan 2008 20:48:42 -0800 (PST) > > > > > > > Can you give a pointer to some browsable archive? > > > > > > http://marc.info/?l=linux-arch > > > > .. and the discussion itself that actually explains why ARM has problems > > and why S390 suddenly _does_ have a bit for this after all? > > > > (Not that I consider marc to be really "browsable" in the first place.. ) > > > > Linus > > > I sent you an exact link to the thread on marc in an earlier message... > not many others archive linux-arch or linux-mm unfortunately. > > But no I didn't see a discussion of why s390 does have a bit, beyond the > s390 devs just asserting there is one, and providing a patch ;) I never > saw any of the discussions concluding that s390 does *not* have a bit > spare. Do you recall if they were public? Hmm, it all depends on the type of the pte. The hardware specs tell us that the last 8 bits of a pte is software defined. There is the hardware invalid bit 2**9 and the hardware read-only bit 2**10. Two of the software defined bits and the two hardware bits are used for the page type. The current code distinguishes 8 different types (the comments says six but that is wrong :-/): #define _PAGE_TYPE_EMPTY 0x400 #define _PAGE_TYPE_NONE 0x401 #define _PAGE_TYPE_SWAP 0x403 #define _PAGE_TYPE_FILE 0x601 /* bit 0x002 is used for offset !! */ #define _PAGE_TYPE_RO 0x200 #define _PAGE_TYPE_RW 0x000 #define _PAGE_TYPE_EX_RO 0x202 #define _PAGE_TYPE_EX_RW 0x002 The types where we have no free bits are _PAGE_TYPE_SWAP and _PAGE_TYPE_FILE. That should be true for all architectures, since any free bit could be used to increase the allowable size of the swap device and the file-page offset. For 64 bit we could even make room for another bit in the swap and file ptes since it won't hurt much to lower the swap size / file-page offset. For 31 bit the bits are in short supply. In ESA mode another 3 bits of the pte are reserved, they have to be zero or bad things will happen (specification exceptions). So with the two hardware and the two software bits a total of 7 bits of 32 bits are lost. As the comment for _PAGE_TYPE_FILE indicates I was able to squeeze one more bit out of a pte for file ptes. Which makes 25 free bits for a swap pte and 26 free bits for a file pte, or 32 4GB swap devices and 64GB max file size for remap_file_pages. Again this is ONLY for 31 bit and ONLY for the swap and file ptes. For pte_present() ptes we have 6 free bits. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-16 4:48 ` Linus Torvalds 2008-01-16 4:51 ` David Miller @ 2008-01-16 5:17 ` Nick Piggin 2008-01-16 10:52 ` Catalin Marinas 2008-01-16 17:21 ` Linus Torvalds 1 sibling, 2 replies; 24+ messages in thread From: Nick Piggin @ 2008-01-16 5:17 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch, rmk, catalin.marinas On Tue, Jan 15, 2008 at 08:48:42PM -0800, Linus Torvalds wrote: > > > On Wed, 16 Jan 2008, Nick Piggin wrote: > > > > Right, that's what I had hoped as well. But when I say pte_special > > *usable* by all architectures, I mean it is usable by all that can > > spare a bit in the pte. Apparently ARM can't because some some bug > > in an Xscale CPU or something (the thread is on linux-arch). > > Hmm. Can you give a pointer to some browsable archive? I guess I should > subscribe, but there's too much email, too little time. linux-arch is one > of the lists that I probably should look at. http://marc.info/?t=119968107900003&r=2&w=2 I've also cc'ed Russell and Catalin, who were involved in that one. > That said: especially for PFNMAP and friends, it may be possible to simply > re-use an existign hardware bit, like (for example) a "cacheable" bit. > > That doesn't mean that such an architecture would have a free bit for any > *arbitrary* software use, but the "no <struct page> backing" is really a > pretty special feature, and may well map fairly well 1:1 with something > like a "cache disable" bit (which I do think ARM has). > > It's not like we necessarily would want /dev/mem to be mapped cacheable > *anyway*, much less on some architecture with stupid virtual caches. > > > I remember that too. I guess some wires got crossed somewhere. s390 > > evidently does have free bits in their pte_present-type ptes. > > I think they had two types of PTE's - 32-bit and 64-bit. Maybe it's just > the 32-bit one that was all used up (but see above - maybe cacheable bits > are doable?) Not sure, I'll let the s390 people chime in here. We'd still need a pte_special bit for 32 and 64 bit ptes (if they are different)... > I do have to say, one of the reasons I enjoyed PFNMAP was that so far > we've basially been able to live without any SW-specified bits at all. > Yeah, we use "software bits" on architectures to emulate dirty/accessed, > but we have never really needed any "kernel internal bits". And I do think > that's generally a good idea. I agree in a way (on one hand it would be nice to simplify the whole PFNMAP logic, on the other hand it isn't actually a problem to keep it, and it is a nice way of avoiding the use of a pte bit, which might be important in future even if not today...) > So in that sense, I'd actually prefer the current setup if it's not a huge > pain. Well that is why I wanted to go the ifdef route (or rather, tidy it up to be an if (CONSTANT) { } else { } or something). We could go and hide the pte-check in s390 specific code just in the VM_MIXEDMAP case, but again I think that is actually just making things less clear (than having the 2 distinct cases in core code). > I saw your 5% number, but I really wonder about that one. Was that > perhaps with the much more expensive non-linear NUMA "pfn_to_page()"? THAT > expense would drown out any vma->vm_flags costs. The 5% case yes that was with SPARSEMEM. And yes we should get rid of that pfn_valid test IMO (or put it under CONFIG_DEBUG_VM). I'm not aware of it ever triggering, so I think it should go away (I think Hugh nacked my attempt at this last time -- Hugh, what do you think?). However, minor performance issues aside, I'd still hope to find a good way to get the pte_special path in. --- vm_normal_page has been seen to take nearly 5% kernel time in profiles, and regularly appears in the top 10 or so functions in a profile. It is called at every page fault, and for every pte in bulk copy or unmaps. pfn_valid in particular can be quite expensive in some memory models. Place that code under CONFIG_DEBUG_VM. I'm not aware of it catching any problems. Signed-off-by: Nick Piggin <npiggin@suse.de> --- Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -392,16 +392,13 @@ struct page *vm_normal_page(struct vm_ar return NULL; } - /* - * Add some anal sanity checks for now. Eventually, - * we should just do "return pfn_to_page(pfn)", but - * in the meantime we check that we get a valid pfn, - * and that the resulting page looks ok. - */ +#ifdef CONFIG_DEBUG_VM + /* Check that we get a valid pfn. */ if (unlikely(!pfn_valid(pfn))) { print_bad_pte(vma, pte, addr); return NULL; } +#endif /* * NOTE! We still have PageReserved() pages in the page ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-16 5:17 ` Nick Piggin @ 2008-01-16 10:52 ` Catalin Marinas 2008-01-16 18:18 ` Russell King 2008-01-16 17:21 ` Linus Torvalds 1 sibling, 1 reply; 24+ messages in thread From: Catalin Marinas @ 2008-01-16 10:52 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch, rmk On Wed, 2008-01-16 at 06:17 +0100, Nick Piggin wrote: > On Tue, Jan 15, 2008 at 08:48:42PM -0800, Linus Torvalds wrote: > > On Wed, 16 Jan 2008, Nick Piggin wrote: > > > Right, that's what I had hoped as well. But when I say pte_special > > > *usable* by all architectures, I mean it is usable by all that can > > > spare a bit in the pte. Apparently ARM can't because some some bug > > > in an Xscale CPU or something (the thread is on linux-arch). > > > > Hmm. Can you give a pointer to some browsable archive? I guess I should > > subscribe, but there's too much email, too little time. linux-arch is one > > of the lists that I probably should look at. > > http://marc.info/?t=119968107900003&r=2&w=2 > > I've also cc'ed Russell and Catalin, who were involved in that one. In summary, on the ARM side, on architectures ARMv5 and earlier we had more bits available. Starting with ARMv6, because the memory ordering model was changed, 3 bits were used for the TEX encoding (Type EXtension), in addition to the original C and B bits (cacheable and bufferable, which on ARMv6 and later may have a different meaning, based on the TEX bits). However, ARMv6 comes with a new configuration mode, "TEX remapping", which, if enabled, allows 8 combinations of the TEX, C and B bits to be encoded in separate registers and only use 3 bits in the PTE (TEX[0], C, B) rather than 5 (i.e. one more bit compared to ARMv5). Since this mode isn't backwards compatible, the decision was to not use it and add the TEX bits to the PTE. Because of a bug on Xscale3, Russell might use another (the last one) bit in the PTE but I don't know the details. There are some solutions (with drawbacks as well): 1. Use the new TEX remapping on ARMv6 (does Xscale3 support it?) and emulate it on pre-ARMv6 CPUs. On older CPUs, this might add a few cycles to the set_pte function. While on setting up the page tables these might be lost in the noise, Russell mentioned the clearing up of the page tables. Maybe the pte_clear function could use a fast-track implementation which simply zeros the pte rather than checking the bits. 2. As above but reserve the TEX[0] bit on ARMv5 for compatibility with ARMv6 and no emulation is needed. In this case, C and B bits have a totally different meaning on ARMv5 and ARMv6 (as currently on the former and, on the latter, they are just used as an index in a pre-defined table stored in configuration registers). I don't think this would be a problem since we build the PTE bits for various memory types at boot time based on the CPU architecture. 3. Use TEX remapping on ARMv6 and keep the current implementation for ARMv5. This would mean 2 separate pte_* implementations, probably optimal for the corresponding architectures but you cannot build a kernel supporting both at the same time (though I think even the current implementation needs some tweaking). If this is needed, we could use option 1 above but I think the code would be really complicated. If Xscale3 supports TEX remapping (it's an ARMv6, so it should but I'm not sure), I think that option 2 above could be feasible to free 2 more bits in the PTE (leaving us with 3 spare bits). I'm happy to give this approach a try but it's Russell that has the final word since the patch would be pretty intrusive (and would need to be discussed on linux-arm-kernel list first as I'm not sure what wider implications it might have). -- Catalin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-16 10:52 ` Catalin Marinas @ 2008-01-16 18:18 ` Russell King 0 siblings, 0 replies; 24+ messages in thread From: Russell King @ 2008-01-16 18:18 UTC (permalink / raw) To: Catalin Marinas Cc: Nick Piggin, Linus Torvalds, Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch On Wed, Jan 16, 2008 at 10:52:29AM +0000, Catalin Marinas wrote: > If Xscale3 supports TEX remapping (it's an ARMv6, so it should but I'm > not sure) For the elimination of doubt, Xscale 3 is ARMv5 with cherry picked ARMv6 features - eg, it's VIVT, no ASID. I don't believe it has any remapping ability... but as I've said - when I get around to resolving this shared mmap issue I'll know more about the way forward. At the moment, the hold up is this EABI crap - the filesystem I have for the Xscale 3 platform is EABI only so I can't run any of my existing test programs, and I don't have a toolchain capable of generating executables for EABI userspace - and I don't think I can sensibly put the OABI libraries on the platform either. So my first problem to resolve is how to run my test programs which show up the issue on this hardware. The second problem which is proving difficult is to find out why one of the patches which enabled proper use of the set_pte_ext() function on Xscale 3 CPUs resulted in Lennert's IOP board falling over - I've not heard anything further from Lennert on that issue yet, despite prodding. So, I might just throw it into the next merge window _anyway_ and then we'll get more people testing it and hopefully get to the bottom of the window. The other alternative is to just let the patch sit around for ever and nothing will happen - which clearly isn't productive. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-16 5:17 ` Nick Piggin 2008-01-16 10:52 ` Catalin Marinas @ 2008-01-16 17:21 ` Linus Torvalds 1 sibling, 0 replies; 24+ messages in thread From: Linus Torvalds @ 2008-01-16 17:21 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch, rmk, catalin.marinas On Wed, 16 Jan 2008, Nick Piggin wrote: > > http://marc.info/?t=119968107900003&r=2&w=2 Gaah. That first original patch I would have been perfectly happy with. I guess having the fallback is inevitable. I don't like it, but oh well. Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit 2008-01-13 3:10 ` [rfc][patch 2/2] mm: introduce optional pte_special pte bit Nick Piggin 2008-01-13 3:41 ` Linus Torvalds @ 2008-01-16 17:14 ` David Howells 1 sibling, 0 replies; 24+ messages in thread From: David Howells @ 2008-01-16 17:14 UTC (permalink / raw) To: Nick Piggin Cc: dhowells, Linus Torvalds, Hugh Dickins, Jared Hulbert, Carsten Otte, Martin Schwidefsky, Heiko Carstens, linux-arch Nick Piggin <npiggin@suse.de> wrote: > Unfortunately, not all architectures can spare a bit in the pte for this. So > add an ifdefs, depending on whether the arch can support it or not. Although > pte_special can remove all restrictions that pfn mappings currently have, we > continue to enforce those restrictions even on architectures that have the new > pte bit, for consistency. FRV has a single bit to spare in its PTEs (xAMPRx_RESERVED13) that you could use. MN10300 (which is currently lurking in the -mm tree) has no spare bits unless I can drop either the PAGE_ACCESSED or PAGE_NX bits. David ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2008-01-16 18:19 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-01-13 3:08 [rfc] changes to user memory mapping scheme Nick Piggin 2008-01-13 3:09 ` [rfc][patch 1/2] mm: introduce VM_MIXEDMAP Nick Piggin 2008-01-13 3:10 ` [rfc][patch 2/2] mm: introduce optional pte_special pte bit Nick Piggin 2008-01-13 3:41 ` Linus Torvalds 2008-01-13 4:39 ` Nick Piggin 2008-01-13 4:45 ` Linus Torvalds 2008-01-13 5:06 ` Nick Piggin 2008-01-13 16:50 ` Linus Torvalds 2008-01-13 20:46 ` Martin Schwidefsky 2008-01-14 21:04 ` Jared Hulbert 2008-01-15 9:18 ` Carsten Otte 2008-01-16 3:38 ` Nick Piggin 2008-01-16 4:04 ` Linus Torvalds 2008-01-16 4:37 ` Nick Piggin 2008-01-16 4:48 ` Linus Torvalds 2008-01-16 4:51 ` David Miller 2008-01-16 5:23 ` Linus Torvalds 2008-01-16 5:48 ` Nick Piggin 2008-01-16 9:52 ` Martin Schwidefsky 2008-01-16 5:17 ` Nick Piggin 2008-01-16 10:52 ` Catalin Marinas 2008-01-16 18:18 ` Russell King 2008-01-16 17:21 ` Linus Torvalds 2008-01-16 17:14 ` David Howells
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).