* [00/17] Virtual Compound Page Support V1
@ 2007-09-25 23:42 Christoph Lameter
2007-09-25 23:42 ` [01/17] Vmalloc: Move vmalloc_to_page to mm/vmalloc Christoph Lameter
` (16 more replies)
0 siblings, 17 replies; 57+ messages in thread
From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw)
Cc: linux-fsdevel, linux-kernel
RFC->V1
- Support for all compound functions for virtual compound pages
(including the compound_nth_page() necessary for LBS mmap support)
- Fix various bugs
- Fix i386 build
Currently there is a strong tendency to avoid larger page allocations in
the kernel because of past fragmentation issues and the current
defragmentation methods are still evolving. It is not clear to what extend
they can provide reliable allocations for higher order pages (plus the
definition of "reliable" seems to be in the eye of the beholder).
We use vmalloc allocations in many locations to provide a safe
way to allocate larger arrays. That is due to the danger of higher order
allocations failing. Virtual Compound pages allow the use of regular
page allocator allocations that will fall back only if there is an actual
problem with acquiring a higher order page.
This patch set provides a way for a higher page allocation to fall back.
Instead of a physically contiguous page a virtually contiguous page
is provided. The functionality of the vmalloc layer is used to provide
the necessary page tables and control structures to establish a virtually
contiguous area.
Advantages:
- If higher order allocations are failing then virtual compound pages
consisting of a series of order-0 pages can stand in for those
allocations.
- Reliability as long as the vmalloc layer can provide virtual mappings.
- Ability to reduce the use of vmalloc layer significantly by using
physically contiguous memory instead of virtual contiguous memory.
Most uses of vmalloc() can be converted to page allocator calls.
- The use of physically contiguous memory instead of vmalloc may allow the
use larger TLB entries thus reducing TLB pressure. Also reduces the need
for page table walks.
Disadvantages:
- In order to use fall back the logic accessing the memory must be
aware that the memory could be backed by a virtual mapping and take
precautions. virt_to_page() and page_address() may not work and
vmalloc_to_page() and vmalloc_address() (introduced through this
patch set) may have to be called.
- Virtual mappings are less efficient than physical mappings.
Performance will drop once virtual fall back occurs.
- Virtual mappings have more memory overhead. vm_area control structures
page tables, page arrays etc need to be allocated and managed to provide
virtual mappings.
The patchset provides this functionality in stages. Stage 1 introduces
the basic fall back mechanism necessary to replace vmalloc allocations
with
alloc_page(GFP_VFALLBACK, order, ....)
which signifies to the page allocator that a higher order is to be found
but a virtual mapping may stand in if there is an issue with fragmentation.
Stage 1 functionality does not allow allocation and freeing of virtual
mappings from interrupt contexts.
The stage 1 series ends with the conversion of a few key uses of vmalloc
in the VM to alloc_pages() for the allocation of sparsemems memmap table
and the wait table in each zone. Other uses of vmalloc could be converted
in the same way.
Stage 2 functionality enhances the fallback even more allowing allocation
and frees in interrupt context.
SLUB is then modified to use the virtual mappings for slab caches
that are marked with SLAB_VFALLBACK. If a slab cache is marked this way
then we drop all the restraints regarding page order and allocate
good large memory areas that fit lots of objects so that we rarely
have to use the slow paths.
Two slab caches--the dentry cache and the buffer_heads--are then flagged
that way. Others could be converted in the same way.
The patch set also provides a debugging aid through setting
CONFIG_VFALLBACK_ALWAYS
If set then all GFP_VFALLBACK allocations fall back to the virtual
mappings. This is useful for verification tests. The test of this
patch set was done by enabling that options and compiling a kernel.
The patch set is also available via git from the largeblock git tree via
git pull
git://git.kernel.org/pub/scm/linux/kernel/git/christoph/largeblocksize.git
vcompound
--
^ permalink raw reply [flat|nested] 57+ messages in thread* [01/17] Vmalloc: Move vmalloc_to_page to mm/vmalloc. 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 2007-09-25 23:42 ` [02/17] vmalloc: add const Christoph Lameter ` (15 subsequent siblings) 16 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_move_vmalloc_to_page --] [-- Type: text/plain, Size: 4217 bytes --] We already have page table manipulation for vmalloc in vmalloc.c. Move the vmalloc_to_page() function there as well. Move the definitions for vmalloc related functions in mm.h to before the functions dealing with compound pages because they will soon need to use them. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- include/linux/mm.h | 5 +++-- mm/memory.c | 40 ---------------------------------------- mm/vmalloc.c | 38 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 41 insertions(+), 42 deletions(-) Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2007-09-24 16:55:28.000000000 -0700 +++ linux-2.6/mm/memory.c 2007-09-24 16:55:32.000000000 -0700 @@ -2727,46 +2727,6 @@ int make_pages_present(unsigned long add return ret == len ? 0 : -1; } -/* - * Map a vmalloc()-space virtual address to the physical page. - */ -struct page * vmalloc_to_page(void * vmalloc_addr) -{ - unsigned long addr = (unsigned long) vmalloc_addr; - struct page *page = NULL; - pgd_t *pgd = pgd_offset_k(addr); - pud_t *pud; - pmd_t *pmd; - pte_t *ptep, pte; - - if (!pgd_none(*pgd)) { - pud = pud_offset(pgd, addr); - if (!pud_none(*pud)) { - pmd = pmd_offset(pud, addr); - if (!pmd_none(*pmd)) { - ptep = pte_offset_map(pmd, addr); - pte = *ptep; - if (pte_present(pte)) - page = pte_page(pte); - pte_unmap(ptep); - } - } - } - return page; -} - -EXPORT_SYMBOL(vmalloc_to_page); - -/* - * Map a vmalloc()-space virtual address to the physical page frame number. - */ -unsigned long vmalloc_to_pfn(void * vmalloc_addr) -{ - return page_to_pfn(vmalloc_to_page(vmalloc_addr)); -} - -EXPORT_SYMBOL(vmalloc_to_pfn); - #if !defined(__HAVE_ARCH_GATE_AREA) #if defined(AT_SYSINFO_EHDR) Index: linux-2.6/mm/vmalloc.c =================================================================== --- linux-2.6.orig/mm/vmalloc.c 2007-09-24 16:55:28.000000000 -0700 +++ linux-2.6/mm/vmalloc.c 2007-09-24 16:55:32.000000000 -0700 @@ -166,6 +166,44 @@ int map_vm_area(struct vm_struct *area, } EXPORT_SYMBOL_GPL(map_vm_area); +/* + * Map a vmalloc()-space virtual address to the physical page. + */ +struct page *vmalloc_to_page(void *vmalloc_addr) +{ + unsigned long addr = (unsigned long) vmalloc_addr; + struct page *page = NULL; + pgd_t *pgd = pgd_offset_k(addr); + pud_t *pud; + pmd_t *pmd; + pte_t *ptep, pte; + + if (!pgd_none(*pgd)) { + pud = pud_offset(pgd, addr); + if (!pud_none(*pud)) { + pmd = pmd_offset(pud, addr); + if (!pmd_none(*pmd)) { + ptep = pte_offset_map(pmd, addr); + pte = *ptep; + if (pte_present(pte)) + page = pte_page(pte); + pte_unmap(ptep); + } + } + } + return page; +} +EXPORT_SYMBOL(vmalloc_to_page); + +/* + * Map a vmalloc()-space virtual address to the physical page frame number. + */ +unsigned long vmalloc_to_pfn(void *vmalloc_addr) +{ + return page_to_pfn(vmalloc_to_page(vmalloc_addr)); +} +EXPORT_SYMBOL(vmalloc_to_pfn); + static struct vm_struct *__get_vm_area_node(unsigned long size, unsigned long flags, unsigned long start, unsigned long end, int node, gfp_t gfp_mask) Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h 2007-09-24 16:55:28.000000000 -0700 +++ linux-2.6/include/linux/mm.h 2007-09-24 16:57:23.000000000 -0700 @@ -294,6 +294,9 @@ static inline int get_page_unless_zero(s return atomic_inc_not_zero(&page->_count); } +struct page *vmalloc_to_page(void *addr); +unsigned long vmalloc_to_pfn(void *addr); + static inline struct page *compound_head(struct page *page) { if (unlikely(PageTail(page))) @@ -1160,8 +1163,6 @@ static inline unsigned long vma_pages(st pgprot_t vm_get_page_prot(unsigned long vm_flags); struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr); -struct page *vmalloc_to_page(void *addr); -unsigned long vmalloc_to_pfn(void *addr); int remap_pfn_range(struct vm_area_struct *, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t); int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* [02/17] vmalloc: add const 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter 2007-09-25 23:42 ` [01/17] Vmalloc: Move vmalloc_to_page to mm/vmalloc Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 2007-09-25 23:42 ` [03/17] i386: Resolve dependency of asm-i386/pgtable.h on highmem.h Christoph Lameter ` (14 subsequent siblings) 16 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_vmalloc_const --] [-- Type: text/plain, Size: 4475 bytes --] Make vmalloc functions work the same way as kfree() and friends that take a const void * argument. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- include/linux/mm.h | 4 ++-- include/linux/vmalloc.h | 6 +++--- mm/vmalloc.c | 16 ++++++++-------- 3 files changed, 13 insertions(+), 13 deletions(-) Index: linux-2.6/mm/vmalloc.c =================================================================== --- linux-2.6.orig/mm/vmalloc.c 2007-09-24 16:58:37.000000000 -0700 +++ linux-2.6/mm/vmalloc.c 2007-09-24 16:58:45.000000000 -0700 @@ -169,7 +169,7 @@ EXPORT_SYMBOL_GPL(map_vm_area); /* * Map a vmalloc()-space virtual address to the physical page. */ -struct page *vmalloc_to_page(void *vmalloc_addr) +struct page *vmalloc_to_page(const void *vmalloc_addr) { unsigned long addr = (unsigned long) vmalloc_addr; struct page *page = NULL; @@ -198,7 +198,7 @@ EXPORT_SYMBOL(vmalloc_to_page); /* * Map a vmalloc()-space virtual address to the physical page frame number. */ -unsigned long vmalloc_to_pfn(void *vmalloc_addr) +unsigned long vmalloc_to_pfn(const void *vmalloc_addr) { return page_to_pfn(vmalloc_to_page(vmalloc_addr)); } @@ -305,7 +305,7 @@ struct vm_struct *get_vm_area_node(unsig } /* Caller must hold vmlist_lock */ -static struct vm_struct *__find_vm_area(void *addr) +static struct vm_struct *__find_vm_area(const void *addr) { struct vm_struct *tmp; @@ -318,7 +318,7 @@ static struct vm_struct *__find_vm_area( } /* Caller must hold vmlist_lock */ -static struct vm_struct *__remove_vm_area(void *addr) +static struct vm_struct *__remove_vm_area(const void *addr) { struct vm_struct **p, *tmp; @@ -347,7 +347,7 @@ found: * This function returns the found VM area, but using it is NOT safe * on SMP machines, except for its size or flags. */ -struct vm_struct *remove_vm_area(void *addr) +struct vm_struct *remove_vm_area(const void *addr) { struct vm_struct *v; write_lock(&vmlist_lock); @@ -356,7 +356,7 @@ struct vm_struct *remove_vm_area(void *a return v; } -static void __vunmap(void *addr, int deallocate_pages) +static void __vunmap(const void *addr, int deallocate_pages) { struct vm_struct *area; @@ -407,7 +407,7 @@ static void __vunmap(void *addr, int dea * * Must not be called in interrupt context. */ -void vfree(void *addr) +void vfree(const void *addr) { BUG_ON(in_interrupt()); __vunmap(addr, 1); @@ -423,7 +423,7 @@ EXPORT_SYMBOL(vfree); * * Must not be called in interrupt context. */ -void vunmap(void *addr) +void vunmap(const void *addr) { BUG_ON(in_interrupt()); __vunmap(addr, 0); Index: linux-2.6/include/linux/vmalloc.h =================================================================== --- linux-2.6.orig/include/linux/vmalloc.h 2007-09-24 16:58:37.000000000 -0700 +++ linux-2.6/include/linux/vmalloc.h 2007-09-24 16:58:45.000000000 -0700 @@ -45,11 +45,11 @@ extern void *vmalloc_32_user(unsigned lo extern void *__vmalloc(unsigned long size, gfp_t gfp_mask, pgprot_t prot); extern void *__vmalloc_area(struct vm_struct *area, gfp_t gfp_mask, pgprot_t prot); -extern void vfree(void *addr); +extern void vfree(const void *addr); extern void *vmap(struct page **pages, unsigned int count, unsigned long flags, pgprot_t prot); -extern void vunmap(void *addr); +extern void vunmap(const void *addr); extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr, unsigned long pgoff); @@ -71,7 +71,7 @@ extern struct vm_struct *__get_vm_area(u extern struct vm_struct *get_vm_area_node(unsigned long size, unsigned long flags, int node, gfp_t gfp_mask); -extern struct vm_struct *remove_vm_area(void *addr); +extern struct vm_struct *remove_vm_area(const void *addr); extern int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages); Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h 2007-09-24 16:58:44.000000000 -0700 +++ linux-2.6/include/linux/mm.h 2007-09-24 16:58:48.000000000 -0700 @@ -294,8 +294,8 @@ static inline int get_page_unless_zero(s return atomic_inc_not_zero(&page->_count); } -struct page *vmalloc_to_page(void *addr); -unsigned long vmalloc_to_pfn(void *addr); +struct page *vmalloc_to_page(const void *addr); +unsigned long vmalloc_to_pfn(const void *addr); static inline struct page *compound_head(struct page *page) { -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* [03/17] i386: Resolve dependency of asm-i386/pgtable.h on highmem.h 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter 2007-09-25 23:42 ` [01/17] Vmalloc: Move vmalloc_to_page to mm/vmalloc Christoph Lameter 2007-09-25 23:42 ` [02/17] vmalloc: add const Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 2007-09-25 23:42 ` [04/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries Christoph Lameter ` (13 subsequent siblings) 16 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_fix_i386_pgtable_mess --] [-- Type: text/plain, Size: 2029 bytes --] pgtable.h does not include highmem.h but uses various constants from highmem.h. We cannot include highmem.h because highmem.h will in turn include many other include files that also depend on pgtable.h So move the definitions from highmem.h into pgtable.h. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- include/asm-i386/highmem.h | 6 ------ include/asm-i386/pgtable.h | 8 ++++++++ 2 files changed, 8 insertions(+), 6 deletions(-) Index: linux-2.6/include/asm-i386/highmem.h =================================================================== --- linux-2.6.orig/include/asm-i386/highmem.h 2007-09-20 23:54:57.000000000 -0700 +++ linux-2.6/include/asm-i386/highmem.h 2007-09-20 23:55:40.000000000 -0700 @@ -38,11 +38,6 @@ extern pte_t *pkmap_page_table; * easily, subsequent pte tables have to be allocated in one physical * chunk of RAM. */ -#ifdef CONFIG_X86_PAE -#define LAST_PKMAP 512 -#else -#define LAST_PKMAP 1024 -#endif /* * Ordering is: * @@ -58,7 +53,6 @@ extern pte_t *pkmap_page_table; * VMALLOC_START * high_memory */ -#define PKMAP_BASE ( (FIXADDR_BOOT_START - PAGE_SIZE*(LAST_PKMAP + 1)) & PMD_MASK ) #define LAST_PKMAP_MASK (LAST_PKMAP-1) #define PKMAP_NR(virt) ((virt-PKMAP_BASE) >> PAGE_SHIFT) #define PKMAP_ADDR(nr) (PKMAP_BASE + ((nr) << PAGE_SHIFT)) Index: linux-2.6/include/asm-i386/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-i386/pgtable.h 2007-09-20 23:55:16.000000000 -0700 +++ linux-2.6/include/asm-i386/pgtable.h 2007-09-20 23:56:21.000000000 -0700 @@ -81,6 +81,14 @@ void paging_init(void); #define VMALLOC_OFFSET (8*1024*1024) #define VMALLOC_START (((unsigned long) high_memory + \ 2*VMALLOC_OFFSET-1) & ~(VMALLOC_OFFSET-1)) +#ifdef CONFIG_X86_PAE +#define LAST_PKMAP 512 +#else +#define LAST_PKMAP 1024 +#endif + +#define PKMAP_BASE ( (FIXADDR_BOOT_START - PAGE_SIZE*(LAST_PKMAP + 1)) & PMD_MASK ) + #ifdef CONFIG_HIGHMEM # define VMALLOC_END (PKMAP_BASE-2*PAGE_SIZE) #else -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* [04/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter ` (2 preceding siblings ...) 2007-09-25 23:42 ` [03/17] i386: Resolve dependency of asm-i386/pgtable.h on highmem.h Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 2007-09-25 23:42 ` [05/17] vmalloc: clean up page array indexing Christoph Lameter ` (12 subsequent siblings) 16 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_is_vmalloc_addr --] [-- Type: text/plain, Size: 4668 bytes --] is_vmalloc_addr() is used in a couple of places. Add a version to vmalloc.h and replace the other checks. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- drivers/net/cxgb3/cxgb3_offload.c | 4 +--- fs/ntfs/malloc.h | 3 +-- fs/proc/kcore.c | 2 +- fs/xfs/linux-2.6/kmem.c | 3 +-- fs/xfs/linux-2.6/xfs_buf.c | 3 +-- include/linux/mm.h | 8 ++++++++ mm/sparse.c | 10 +--------- 7 files changed, 14 insertions(+), 19 deletions(-) Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h 2007-09-24 18:32:35.000000000 -0700 +++ linux-2.6/include/linux/mm.h 2007-09-24 18:33:03.000000000 -0700 @@ -297,6 +297,14 @@ static inline int get_page_unless_zero(s struct page *vmalloc_to_page(const void *addr); unsigned long vmalloc_to_pfn(const void *addr); +/* Determine if an address is within the vmalloc range */ +static inline int is_vmalloc_addr(const void *x) +{ + unsigned long addr = (unsigned long)x; + + return addr >= VMALLOC_START && addr < VMALLOC_END; +} + static inline struct page *compound_head(struct page *page) { if (unlikely(PageTail(page))) Index: linux-2.6/mm/sparse.c =================================================================== --- linux-2.6.orig/mm/sparse.c 2007-09-24 18:30:46.000000000 -0700 +++ linux-2.6/mm/sparse.c 2007-09-24 18:33:03.000000000 -0700 @@ -289,17 +289,9 @@ got_map_ptr: return ret; } -static int vaddr_in_vmalloc_area(void *addr) -{ - if (addr >= (void *)VMALLOC_START && - addr < (void *)VMALLOC_END) - return 1; - return 0; -} - static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages) { - if (vaddr_in_vmalloc_area(memmap)) + if (is_vmalloc_addr(memmap)) vfree(memmap); else free_pages((unsigned long)memmap, Index: linux-2.6/drivers/net/cxgb3/cxgb3_offload.c =================================================================== --- linux-2.6.orig/drivers/net/cxgb3/cxgb3_offload.c 2007-09-24 18:30:46.000000000 -0700 +++ linux-2.6/drivers/net/cxgb3/cxgb3_offload.c 2007-09-24 18:33:03.000000000 -0700 @@ -1035,9 +1035,7 @@ void *cxgb_alloc_mem(unsigned long size) */ void cxgb_free_mem(void *addr) { - unsigned long p = (unsigned long)addr; - - if (p >= VMALLOC_START && p < VMALLOC_END) + if (is_vmalloc_addr(addr)) vfree(addr); else kfree(addr); Index: linux-2.6/fs/ntfs/malloc.h =================================================================== --- linux-2.6.orig/fs/ntfs/malloc.h 2007-09-24 18:30:46.000000000 -0700 +++ linux-2.6/fs/ntfs/malloc.h 2007-09-24 18:33:03.000000000 -0700 @@ -85,8 +85,7 @@ static inline void *ntfs_malloc_nofs_nof static inline void ntfs_free(void *addr) { - if (likely(((unsigned long)addr < VMALLOC_START) || - ((unsigned long)addr >= VMALLOC_END ))) { + if (!is_vmalloc_addr(addr)) { kfree(addr); /* free_page((unsigned long)addr); */ return; Index: linux-2.6/fs/proc/kcore.c =================================================================== --- linux-2.6.orig/fs/proc/kcore.c 2007-09-24 18:30:46.000000000 -0700 +++ linux-2.6/fs/proc/kcore.c 2007-09-24 18:33:03.000000000 -0700 @@ -325,7 +325,7 @@ read_kcore(struct file *file, char __use if (m == NULL) { if (clear_user(buffer, tsz)) return -EFAULT; - } else if ((start >= VMALLOC_START) && (start < VMALLOC_END)) { + } else if (is_vmalloc_addr((void *)start)) { char * elf_buf; struct vm_struct *m; unsigned long curstart = start; Index: linux-2.6/fs/xfs/linux-2.6/kmem.c =================================================================== --- linux-2.6.orig/fs/xfs/linux-2.6/kmem.c 2007-09-24 18:30:46.000000000 -0700 +++ linux-2.6/fs/xfs/linux-2.6/kmem.c 2007-09-24 18:33:03.000000000 -0700 @@ -92,8 +92,7 @@ kmem_zalloc_greedy(size_t *size, size_t void kmem_free(void *ptr, size_t size) { - if (((unsigned long)ptr < VMALLOC_START) || - ((unsigned long)ptr >= VMALLOC_END)) { + if (!is_vmalloc_addr(ptr)) { kfree(ptr); } else { vfree(ptr); Index: linux-2.6/fs/xfs/linux-2.6/xfs_buf.c =================================================================== --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_buf.c 2007-09-24 18:30:46.000000000 -0700 +++ linux-2.6/fs/xfs/linux-2.6/xfs_buf.c 2007-09-24 18:33:03.000000000 -0700 @@ -696,8 +696,7 @@ static inline struct page * mem_to_page( void *addr) { - if (((unsigned long)addr < VMALLOC_START) || - ((unsigned long)addr >= VMALLOC_END)) { + if ((!is_vmalloc_addr(addr))) { return virt_to_page(addr); } else { return vmalloc_to_page(addr); -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* [05/17] vmalloc: clean up page array indexing 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter ` (3 preceding siblings ...) 2007-09-25 23:42 ` [04/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 2007-09-25 23:42 ` [06/17] vunmap: return page array passed on vmap() Christoph Lameter ` (11 subsequent siblings) 16 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_array_indexes --] [-- Type: text/plain, Size: 1425 bytes --] The page array is repeatedly indexed both in vunmap and vmalloc_area_node(). Add a temporary variable to make it easier to read (and easier to patch later). Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/vmalloc.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) Index: linux-2.6/mm/vmalloc.c =================================================================== --- linux-2.6.orig/mm/vmalloc.c 2007-09-18 13:22:16.000000000 -0700 +++ linux-2.6/mm/vmalloc.c 2007-09-18 13:22:17.000000000 -0700 @@ -383,8 +383,10 @@ static void __vunmap(const void *addr, i int i; for (i = 0; i < area->nr_pages; i++) { - BUG_ON(!area->pages[i]); - __free_page(area->pages[i]); + struct page *page = area->pages[i]; + + BUG_ON(!page); + __free_page(page); } if (area->flags & VM_VPAGES) @@ -488,15 +490,19 @@ void *__vmalloc_area_node(struct vm_stru } for (i = 0; i < area->nr_pages; i++) { + struct page *page; + if (node < 0) - area->pages[i] = alloc_page(gfp_mask); + page = alloc_page(gfp_mask); else - area->pages[i] = alloc_pages_node(node, gfp_mask, 0); - if (unlikely(!area->pages[i])) { + page = alloc_pages_node(node, gfp_mask, 0); + + if (unlikely(!page)) { /* Successfully allocated i pages, free them in __vunmap() */ area->nr_pages = i; goto fail; } + area->pages[i] = page; } if (map_vm_area(area, prot, &pages)) -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* [06/17] vunmap: return page array passed on vmap() 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter ` (4 preceding siblings ...) 2007-09-25 23:42 ` [05/17] vmalloc: clean up page array indexing Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 2007-09-25 23:42 ` [07/17] vmalloc_address(): Determine vmalloc address from page struct Christoph Lameter ` (10 subsequent siblings) 16 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_vunmap_returns_pages --] [-- Type: text/plain, Size: 4263 bytes --] Make vunmap return the page array that was used at vmap. This is useful if one has no structures to track the page array but simply stores the virtual address somewhere. The disposition of the page array can be decided upon after vunmap. vfree() may now also be used instead of vunmap which will release the page array after vunmap'ping it. As noted by Kamezawa: The same subsystem that provides the page array to vmap must must use its own method to dispose of the page array. If vfree() is called to free the page array then the page array must either be 1. Allocated via the slab allocator 2. Allocated via vmalloc but then VM_VPAGES must have been passed at vunmap to specify that a vfree is needed. RFC->v1: - Add comment explaining how to use vfree() to dispose of the page array passed on vmap(). Signed-off-by: Christoph Lameter <clameter@sgi.com> --- include/linux/vmalloc.h | 2 +- mm/vmalloc.c | 33 +++++++++++++++++++++++---------- 2 files changed, 24 insertions(+), 11 deletions(-) Index: linux-2.6/include/linux/vmalloc.h =================================================================== --- linux-2.6.orig/include/linux/vmalloc.h 2007-09-24 15:52:53.000000000 -0700 +++ linux-2.6/include/linux/vmalloc.h 2007-09-24 15:59:15.000000000 -0700 @@ -49,7 +49,7 @@ extern void vfree(const void *addr); extern void *vmap(struct page **pages, unsigned int count, unsigned long flags, pgprot_t prot); -extern void vunmap(const void *addr); +extern struct page **vunmap(const void *addr); extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr, unsigned long pgoff); Index: linux-2.6/mm/vmalloc.c =================================================================== --- linux-2.6.orig/mm/vmalloc.c 2007-09-24 15:56:49.000000000 -0700 +++ linux-2.6/mm/vmalloc.c 2007-09-24 16:02:10.000000000 -0700 @@ -356,17 +356,18 @@ struct vm_struct *remove_vm_area(const v return v; } -static void __vunmap(const void *addr, int deallocate_pages) +static struct page **__vunmap(const void *addr, int deallocate_pages) { struct vm_struct *area; + struct page **pages; if (!addr) - return; + return NULL; if ((PAGE_SIZE-1) & (unsigned long)addr) { printk(KERN_ERR "Trying to vfree() bad address (%p)\n", addr); WARN_ON(1); - return; + return NULL; } area = remove_vm_area(addr); @@ -374,29 +375,30 @@ static void __vunmap(const void *addr, i printk(KERN_ERR "Trying to vfree() nonexistent vm area (%p)\n", addr); WARN_ON(1); - return; + return NULL; } + pages = area->pages; debug_check_no_locks_freed(addr, area->size); if (deallocate_pages) { int i; for (i = 0; i < area->nr_pages; i++) { - struct page *page = area->pages[i]; + struct page *page = pages[i]; BUG_ON(!page); __free_page(page); } if (area->flags & VM_VPAGES) - vfree(area->pages); + vfree(pages); else - kfree(area->pages); + kfree(pages); } kfree(area); - return; + return pages; } /** @@ -424,11 +426,13 @@ EXPORT_SYMBOL(vfree); * which was created from the page array passed to vmap(). * * Must not be called in interrupt context. + * + * Returns a pointer to the array of pointers to page structs */ -void vunmap(const void *addr) +struct page **vunmap(const void *addr) { BUG_ON(in_interrupt()); - __vunmap(addr, 0); + return __vunmap(addr, 0); } EXPORT_SYMBOL(vunmap); @@ -441,6 +445,13 @@ EXPORT_SYMBOL(vunmap); * * Maps @count pages from @pages into contiguous kernel virtual * space. + * + * The page array may be freed via vfree() on the virtual address + * returned. In that case the page array must be allocated via + * the slab allocator. If the page array was allocated via + * vmalloc then VM_VPAGES must be specified in the flags. There is + * no support for vfree() to free a page array allocated via the + * page allocator. */ void *vmap(struct page **pages, unsigned int count, unsigned long flags, pgprot_t prot) @@ -453,6 +464,8 @@ void *vmap(struct page **pages, unsigned area = get_vm_area((count << PAGE_SHIFT), flags); if (!area) return NULL; + area->pages = pages; + area->nr_pages = count; if (map_vm_area(area, prot, &pages)) { vunmap(area->addr); return NULL; -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* [07/17] vmalloc_address(): Determine vmalloc address from page struct 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter ` (5 preceding siblings ...) 2007-09-25 23:42 ` [06/17] vunmap: return page array passed on vmap() Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 2007-09-25 23:42 ` [08/17] GFP_VFALLBACK: Allow fallback of compound pages to virtual mappings Christoph Lameter ` (9 subsequent siblings) 16 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_vmalloc_address --] [-- Type: text/plain, Size: 3361 bytes --] Sometimes we need to figure out which vmalloc address is in use for a certain page struct. There is no easy way to figure out the vmalloc address from the page struct. So simply search through the kernel page tables to find the address. This is a fairly expensive process. Use sparingly (or provide a better implementation). Signed-off-by: Christoph Lameter <clameter@sgi.com> --- include/linux/mm.h | 1 mm/vmalloc.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 78 insertions(+) Index: linux-2.6/mm/vmalloc.c =================================================================== --- linux-2.6.orig/mm/vmalloc.c 2007-09-24 16:59:54.000000000 -0700 +++ linux-2.6/mm/vmalloc.c 2007-09-24 17:00:07.000000000 -0700 @@ -196,6 +196,83 @@ struct page *vmalloc_to_page(const void EXPORT_SYMBOL(vmalloc_to_page); /* + * Determine vmalloc address from a page struct. + * + * Linear search through all ptes of the vmalloc area. + */ +static unsigned long vaddr_pte_range(pmd_t *pmd, unsigned long addr, + unsigned long end, unsigned long pfn) +{ + pte_t *pte; + + pte = pte_offset_kernel(pmd, addr); + do { + pte_t ptent = *pte; + if (pte_present(ptent) && pte_pfn(ptent) == pfn) + return addr; + } while (pte++, addr += PAGE_SIZE, addr != end); + return 0; +} + +static inline unsigned long vaddr_pmd_range(pud_t *pud, unsigned long addr, + unsigned long end, unsigned long pfn) +{ + pmd_t *pmd; + unsigned long next; + unsigned long n; + + pmd = pmd_offset(pud, addr); + do { + next = pmd_addr_end(addr, end); + if (pmd_none_or_clear_bad(pmd)) + continue; + n = vaddr_pte_range(pmd, addr, next, pfn); + if (n) + return n; + } while (pmd++, addr = next, addr != end); + return 0; +} + +static inline unsigned long vaddr_pud_range(pgd_t *pgd, unsigned long addr, + unsigned long end, unsigned long pfn) +{ + pud_t *pud; + unsigned long next; + unsigned long n; + + pud = pud_offset(pgd, addr); + do { + next = pud_addr_end(addr, end); + if (pud_none_or_clear_bad(pud)) + continue; + n = vaddr_pmd_range(pud, addr, next, pfn); + if (n) + return n; + } while (pud++, addr = next, addr != end); + return 0; +} + +void *vmalloc_address(struct page *page) +{ + pgd_t *pgd; + unsigned long next, n; + unsigned long addr = VMALLOC_START; + unsigned long pfn = page_to_pfn(page); + + pgd = pgd_offset_k(VMALLOC_START); + do { + next = pgd_addr_end(addr, VMALLOC_END); + if (pgd_none_or_clear_bad(pgd)) + continue; + n = vaddr_pud_range(pgd, addr, next, pfn); + if (n) + return (void *)n; + } while (pgd++, addr = next, addr < VMALLOC_END); + return NULL; +} +EXPORT_SYMBOL(vmalloc_address); + +/* * Map a vmalloc()-space virtual address to the physical page frame number. */ unsigned long vmalloc_to_pfn(const void *vmalloc_addr) Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h 2007-09-24 17:00:33.000000000 -0700 +++ linux-2.6/include/linux/mm.h 2007-09-24 17:00:42.000000000 -0700 @@ -296,6 +296,7 @@ static inline int get_page_unless_zero(s struct page *vmalloc_to_page(const void *addr); unsigned long vmalloc_to_pfn(const void *addr); +void *vmalloc_address(struct page *); /* Determine if an address is within the vmalloc range */ static inline int is_vmalloc_addr(const void *x) -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* [08/17] GFP_VFALLBACK: Allow fallback of compound pages to virtual mappings 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter ` (6 preceding siblings ...) 2007-09-25 23:42 ` [07/17] vmalloc_address(): Determine vmalloc address from page struct Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 2007-09-25 23:42 ` [09/17] VFALLBACK: Debugging aid Christoph Lameter ` (8 subsequent siblings) 16 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_core --] [-- Type: text/plain, Size: 13194 bytes --] Add a new gfp flag __GFP_VFALLBACK If specified during a higher order allocation then the system will fall back to vmap and attempt to create a virtually contiguous area instead of a physically contiguous area. In many cases the virtually contiguous area can stand in for the physically contiguous area (with some loss of performance). The pages used for VFALLBACK are marked with a new flag PageVcompound(page). The mark is necessary since we have to know upon free if we have to destroy a virtual mapping. No additional flag is consumed through the use of PG_swapcache together with PG_compound (similar to PageHead() and PageTail()). Also add a new function compound_nth_page(page, n) to find the nth page of a compound page. For real compound pages this simply reduces to page + n. For virtual compound pages we need to consult the page tables to figure out the nth page. Add new page to address and vice versa functions. struct page *addr_to_page(const void *address); void *page_to_addr(struct page *); The new conversion functions allow the conversion of vmalloc areas to the corresponding page structs that back it and vice versa. If the addresses or the page struct is not part of a vmalloc function then fall back to virt_to_page and page_address(). Signed-off-by: Christoph Lameter <clameter@sgi.com> --- include/linux/gfp.h | 5 + include/linux/mm.h | 33 +++++++++-- include/linux/page-flags.h | 18 ++++++ mm/page_alloc.c | 131 ++++++++++++++++++++++++++++++++++++++++----- mm/vmalloc.c | 10 +++ 5 files changed, 179 insertions(+), 18 deletions(-) Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c 2007-09-25 10:22:16.000000000 -0700 +++ linux-2.6/mm/page_alloc.c 2007-09-25 10:22:36.000000000 -0700 @@ -60,6 +60,8 @@ long nr_swap_pages; int percpu_pagelist_fraction; static void __free_pages_ok(struct page *page, unsigned int order); +static struct page *vcompound_alloc(gfp_t, int, + struct zonelist *, unsigned long); /* * results with 256, 32 in the lowmem_reserve sysctl: @@ -251,7 +253,7 @@ static void prep_compound_page(struct pa set_compound_order(page, order); __SetPageHead(page); for (i = 1; i < nr_pages; i++) { - struct page *p = page + i; + struct page *p = compound_nth_page(page, i); __SetPageTail(p); p->first_page = page; @@ -266,17 +268,23 @@ static void destroy_compound_page(struct if (unlikely(compound_order(page) != order)) bad_page(page); - if (unlikely(!PageHead(page))) - bad_page(page); - __ClearPageHead(page); for (i = 1; i < nr_pages; i++) { - struct page *p = page + i; + struct page *p = compound_nth_page(page, i); if (unlikely(!PageTail(p) | (p->first_page != page))) bad_page(page); __ClearPageTail(p); } + + /* + * The PageHead is important since it determines how operations on + * a compound page have to be performed. We can only tear the head + * down after all the tail pages are done. + */ + if (unlikely(!PageHead(page))) + bad_page(page); + __ClearPageHead(page); } static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags) @@ -1230,6 +1238,82 @@ try_next_zone: } /* + * Virtual Compound Page support. + * + * Virtual Compound Pages are used to fall back to order 0 allocations if large + * linear mappings are not available and __GFP_VFALLBACK is set. They are + * formatted according to compound page conventions. I.e. following + * page->first_page if PageTail(page) is set can be used to determine the + * head page. + */ +static noinline struct page *vcompound_alloc(gfp_t gfp_mask, int order, + struct zonelist *zonelist, unsigned long alloc_flags) +{ + void *addr; + struct page *page; + int i; + int nr_pages = 1 << order; + struct page **pages = kmalloc(nr_pages * sizeof(struct page *), + gfp_mask & GFP_LEVEL_MASK); + + if (!pages) + return NULL; + + for (i = 0; i < nr_pages; i++) { + page = get_page_from_freelist(gfp_mask & ~__GFP_VFALLBACK, + 0, zonelist, alloc_flags); + if (!page) + goto abort; + + /* Sets PageCompound which makes PageHead(page) true */ + __SetPageVcompound(page); + pages[i] = page; + } + addr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); + if (!addr) + goto abort; + + prep_compound_page(pages[0], order); + return pages[0]; + +abort: + while (i-- > 0) { + page = pages[i]; + if (!page) + continue; + __ClearPageVcompound(page); + __free_page(page); + } + kfree(pages); + return NULL; +} + +static void vcompound_free(void *addr) +{ + struct page **pages; + int i; + struct page *page = vmalloc_to_page(addr); + int order = compound_order(page); + int nr_pages = 1 << order; + + destroy_compound_page(page, order); + pages = vunmap(addr); + /* + * First page will have zero refcount since it maintains state + * for the compound and was decremented before we got here. + */ + __ClearPageVcompound(page); + free_hot_page(page); + + for (i = 1; i < nr_pages; i++) { + page = pages[i]; + __ClearPageVcompound(page); + __free_page(page); + } + kfree(pages); +} + +/* * This is the 'heart' of the zoned buddy allocator. */ struct page * fastcall @@ -1324,12 +1408,12 @@ nofail_alloc: goto nofail_alloc; } } - goto nopage; + goto try_vcompound; } /* Atomic allocations - we can't balance anything */ if (!wait) - goto nopage; + goto try_vcompound; cond_resched(); @@ -1360,6 +1444,11 @@ nofail_alloc: */ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order, zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET); + + if (!page && order && (gfp_mask & __GFP_VFALLBACK)) + page = vcompound_alloc(gfp_mask, order, + zonelist, alloc_flags); + if (page) goto got_pg; @@ -1391,6 +1480,14 @@ nofail_alloc: goto rebalance; } +try_vcompound: + /* Last chance before failing the allocation */ + if (order && (gfp_mask & __GFP_VFALLBACK)) { + page = vcompound_alloc(gfp_mask, order, + zonelist, alloc_flags); + if (page) + goto got_pg; + } nopage: if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) { printk(KERN_WARNING "%s: page allocation failure." @@ -1414,7 +1511,7 @@ fastcall unsigned long __get_free_pages( page = alloc_pages(gfp_mask, order); if (!page) return 0; - return (unsigned long) page_address(page); + return (unsigned long) page_to_addr(page); } EXPORT_SYMBOL(__get_free_pages); @@ -1431,7 +1528,7 @@ fastcall unsigned long get_zeroed_page(g page = alloc_pages(gfp_mask | __GFP_ZERO, 0); if (page) - return (unsigned long) page_address(page); + return (unsigned long) page_to_addr(page); return 0; } @@ -1450,8 +1547,12 @@ fastcall void __free_pages(struct page * if (put_page_testzero(page)) { if (order == 0) free_hot_page(page); - else - __free_pages_ok(page, order); + else { + if (unlikely(PageVcompound(page))) + vcompound_free(vmalloc_address(page)); + else + __free_pages_ok(page, order); + } } } @@ -1460,8 +1561,12 @@ EXPORT_SYMBOL(__free_pages); fastcall void free_pages(unsigned long addr, unsigned int order) { if (addr != 0) { - VM_BUG_ON(!virt_addr_valid((void *)addr)); - __free_pages(virt_to_page((void *)addr), order); + if (unlikely(is_vmalloc_addr((void *)addr))) + vcompound_free((void *)addr); + else { + VM_BUG_ON(!virt_addr_valid((void *)addr)); + __free_pages(virt_to_page((void *)addr), order); + } } } Index: linux-2.6/include/linux/gfp.h =================================================================== --- linux-2.6.orig/include/linux/gfp.h 2007-09-25 10:22:16.000000000 -0700 +++ linux-2.6/include/linux/gfp.h 2007-09-25 10:22:36.000000000 -0700 @@ -43,6 +43,7 @@ struct vm_area_struct; #define __GFP_REPEAT ((__force gfp_t)0x400u) /* Retry the allocation. Might fail */ #define __GFP_NOFAIL ((__force gfp_t)0x800u) /* Retry for ever. Cannot fail */ #define __GFP_NORETRY ((__force gfp_t)0x1000u)/* Do not retry. Might fail */ +#define __GFP_VFALLBACK ((__force gfp_t)0x2000u)/* Permit fallback to vmalloc */ #define __GFP_COMP ((__force gfp_t)0x4000u)/* Add compound page metadata */ #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */ #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */ @@ -86,6 +87,10 @@ struct vm_area_struct; #define GFP_THISNODE ((__force gfp_t)0) #endif +/* + * Allocate large page but allow fallback to a virtually mapped page + */ +#define GFP_VFALLBACK (GFP_KERNEL | __GFP_VFALLBACK) /* Flag - indicates that the buffer will be suitable for DMA. Ignored on some platforms, used as appropriate on others */ Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h 2007-09-25 10:22:16.000000000 -0700 +++ linux-2.6/include/linux/page-flags.h 2007-09-25 10:22:36.000000000 -0700 @@ -248,6 +248,24 @@ static inline void __ClearPageTail(struc #define __SetPageHead(page) __SetPageCompound(page) #define __ClearPageHead(page) __ClearPageCompound(page) +/* + * PG_swapcache is used in combination with PG_compound to indicate + * that a compound page was allocated via vmalloc. + */ +#define PG_vcompound_mask ((1L << PG_compound) | (1L << PG_swapcache)) +#define PageVcompound(page) ((page->flags & PG_vcompound_mask) \ + == PG_vcompound_mask) + +static inline void __SetPageVcompound(struct page *page) +{ + page->flags |= PG_vcompound_mask; +} + +static inline void __ClearPageVcompound(struct page *page) +{ + page->flags &= ~PG_vcompound_mask; +} + #ifdef CONFIG_SWAP #define PageSwapCache(page) test_bit(PG_swapcache, &(page)->flags) #define SetPageSwapCache(page) set_bit(PG_swapcache, &(page)->flags) Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h 2007-09-25 10:22:35.000000000 -0700 +++ linux-2.6/include/linux/mm.h 2007-09-25 10:22:36.000000000 -0700 @@ -297,6 +297,7 @@ static inline int get_page_unless_zero(s struct page *vmalloc_to_page(const void *addr); unsigned long vmalloc_to_pfn(const void *addr); void *vmalloc_address(struct page *); +struct page *vmalloc_nth_page(struct page *page, int n); /* Determine if an address is within the vmalloc range */ static inline int is_vmalloc_addr(const void *x) @@ -306,6 +307,14 @@ static inline int is_vmalloc_addr(const return addr >= VMALLOC_START && addr < VMALLOC_END; } +static inline struct page *addr_to_page(const void *x) +{ + + if (unlikely(is_vmalloc_addr(x))) + return vmalloc_to_page(x); + return virt_to_page(x); +} + static inline struct page *compound_head(struct page *page) { if (unlikely(PageTail(page))) @@ -327,7 +336,7 @@ static inline void get_page(struct page static inline struct page *virt_to_head_page(const void *x) { - struct page *page = virt_to_page(x); + struct page *page = addr_to_page(x); return compound_head(page); } @@ -352,27 +361,34 @@ void split_page(struct page *page, unsig */ typedef void compound_page_dtor(struct page *); +static inline struct page *compound_nth_page(struct page *page, int n) +{ + if (likely(!PageVcompound(page))) + return page + n; + return vmalloc_nth_page(page, n); +} + static inline void set_compound_page_dtor(struct page *page, compound_page_dtor *dtor) { - page[1].lru.next = (void *)dtor; + compound_nth_page(page, 1)->lru.next = (void *)dtor; } static inline compound_page_dtor *get_compound_page_dtor(struct page *page) { - return (compound_page_dtor *)page[1].lru.next; + return (compound_page_dtor *)compound_nth_page(page, 1)->lru.next; } static inline int compound_order(struct page *page) { if (!PageHead(page)) return 0; - return (unsigned long)page[1].lru.prev; + return (unsigned long)compound_nth_page(page, 1)->lru.prev; } static inline void set_compound_order(struct page *page, unsigned long order) { - page[1].lru.prev = (void *)order; + compound_nth_page(page, 1)->lru.prev = (void *)order; } /* @@ -624,6 +640,13 @@ void page_address_init(void); #define page_address_init() do { } while(0) #endif +static inline void *page_to_addr(struct page *page) +{ + if (unlikely(PageVcompound(page))) + return vmalloc_address(page); + return page_address(page); +} + /* * On an anonymous page mapped into a user virtual memory area, * page->mapping points to its anon_vma, not to a struct address_space; Index: linux-2.6/mm/vmalloc.c =================================================================== --- linux-2.6.orig/mm/vmalloc.c 2007-09-25 10:22:35.000000000 -0700 +++ linux-2.6/mm/vmalloc.c 2007-09-25 10:22:38.000000000 -0700 @@ -657,6 +657,16 @@ void *vmalloc(unsigned long size) } EXPORT_SYMBOL(vmalloc); +/* + * Given a pointer to the first page struct: + * Determine a pointer to the nth page. + */ +struct page *vmalloc_nth_page(struct page *page, int n) +{ + return vmalloc_to_page(vmalloc_address(page) + n * PAGE_SIZE); +} +EXPORT_SYMBOL(vmalloc_nth_page); + /** * vmalloc_user - allocate zeroed virtually contiguous memory for userspace * @size: allocation size -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* [09/17] VFALLBACK: Debugging aid 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter ` (7 preceding siblings ...) 2007-09-25 23:42 ` [08/17] GFP_VFALLBACK: Allow fallback of compound pages to virtual mappings Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 2007-09-25 23:42 ` [10/17] Use GFP_VFALLBACK for sparsemem Christoph Lameter ` (7 subsequent siblings) 16 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_debugging_aid --] [-- Type: text/plain, Size: 1881 bytes --] Virtual fallbacks are rare and thus subtle bugs may creep in if we do not test the fallbacks. CONFIG_VFALLBACK_ALWAYS makes all GFP_VFALLBACK allocations fall back to virtual mapping. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- lib/Kconfig.debug | 11 +++++++++++ mm/page_alloc.c | 6 ++++++ 2 files changed, 17 insertions(+) Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c 2007-09-24 18:48:03.000000000 -0700 +++ linux-2.6/mm/page_alloc.c 2007-09-24 18:58:52.000000000 -0700 @@ -1208,6 +1208,12 @@ zonelist_scan: } } +#ifdef CONFIG_VFALLBACK_ALWAYS + if ((gfp_mask & __GFP_VFALLBACK) && + system_state == SYSTEM_RUNNING) + return vcompound_alloc(gfp_mask, order, + zonelist, alloc_flags); +#endif page = buffered_rmqueue(zonelist, zone, order, gfp_mask); if (page) break; Index: linux-2.6/lib/Kconfig.debug =================================================================== --- linux-2.6.orig/lib/Kconfig.debug 2007-09-24 18:30:45.000000000 -0700 +++ linux-2.6/lib/Kconfig.debug 2007-09-24 18:48:06.000000000 -0700 @@ -105,6 +105,17 @@ config DETECT_SOFTLOCKUP can be detected via the NMI-watchdog, on platforms that support it.) +config VFALLBACK_ALWAYS + bool "Always fall back to Virtual Compound pages" + default y + help + Virtual compound pages are only allocated if there is no linear + memory available. They are a fallback and errors created by the + use of virtual mappings instead of linear ones may not surface + because of their infrequent use. This option makes every + allocation that allows a fallback to a virtual mapping use + the virtual mapping. May have a significant performance impact. + config SCHED_DEBUG bool "Collect scheduler debugging info" depends on DEBUG_KERNEL && PROC_FS -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* [10/17] Use GFP_VFALLBACK for sparsemem. 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter ` (8 preceding siblings ...) 2007-09-25 23:42 ` [09/17] VFALLBACK: Debugging aid Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 2007-09-25 23:42 ` [11/17] GFP_VFALLBACK for zone wait table Christoph Lameter ` (6 subsequent siblings) 16 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_sparse_gfp_vfallback --] [-- Type: text/plain, Size: 1471 bytes --] Sparsemem currently attempts first to do a physically contiguous mapping and then falls back to vmalloc. The same thing can now be accomplished using GFP_VFALLBACK. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/sparse.c | 23 +++-------------------- 1 file changed, 3 insertions(+), 20 deletions(-) Index: linux-2.6/mm/sparse.c =================================================================== --- linux-2.6.orig/mm/sparse.c 2007-09-19 18:05:34.000000000 -0700 +++ linux-2.6/mm/sparse.c 2007-09-19 18:27:25.000000000 -0700 @@ -269,32 +269,15 @@ void __init sparse_init(void) #ifdef CONFIG_MEMORY_HOTPLUG static struct page *__kmalloc_section_memmap(unsigned long nr_pages) { - struct page *page, *ret; unsigned long memmap_size = sizeof(struct page) * nr_pages; - page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size)); - if (page) - goto got_map_page; - - ret = vmalloc(memmap_size); - if (ret) - goto got_map_ptr; - - return NULL; -got_map_page: - ret = (struct page *)pfn_to_kaddr(page_to_pfn(page)); -got_map_ptr: - memset(ret, 0, memmap_size); - - return ret; + return (struct page *)__get_free_pages(GFP_VFALLBACK, + get_order(memmap_size)); } static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages) { - if (is_vmalloc_addr(memmap)) - vfree(memmap); - else - free_pages((unsigned long)memmap, + free_pages((unsigned long)memmap, get_order(sizeof(struct page) * nr_pages)); } -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* [11/17] GFP_VFALLBACK for zone wait table. 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter ` (9 preceding siblings ...) 2007-09-25 23:42 ` [10/17] Use GFP_VFALLBACK for sparsemem Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 2007-09-25 23:42 ` [12/17] Virtual Compound page allocation from interrupt context Christoph Lameter ` (5 subsequent siblings) 16 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_wait_table_no_vmalloc --] [-- Type: text/plain, Size: 1002 bytes --] Currently vmalloc is used for the zone wait table possibly generating the need to generate lots of TLBs to access the tables. We can now use GFP_VFALLBACK to attempt the use of a physically contiguous page that can then use the large kernel TLBs. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/page_alloc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c 2007-09-24 18:48:06.000000000 -0700 +++ linux-2.6/mm/page_alloc.c 2007-09-24 18:48:16.000000000 -0700 @@ -2550,7 +2550,9 @@ int zone_wait_table_init(struct zone *zo * To use this new node's memory, further consideration will be * necessary. */ - zone->wait_table = (wait_queue_head_t *)vmalloc(alloc_size); + zone->wait_table = (wait_queue_head_t *) + __get_free_pages(GFP_VFALLBACK, + get_order(alloc_size)); } if (!zone->wait_table) return -ENOMEM; -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* [12/17] Virtual Compound page allocation from interrupt context. 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter ` (10 preceding siblings ...) 2007-09-25 23:42 ` [11/17] GFP_VFALLBACK for zone wait table Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 2007-09-25 23:42 ` [13/17] Virtual compound page freeing in " Christoph Lameter ` (4 subsequent siblings) 16 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_interrupt_alloc --] [-- Type: text/plain, Size: 1527 bytes --] In an interrupt context we cannot wait for the vmlist_lock in __get_vm_area_node(). So use a trylock instead. If the trylock fails then the atomic allocation will fail and subsequently be retried. This only works because the flush_cache_vunmap in use for allocation is never performing any IPIs in contrast to flush_tlb_... in use for freeing. flush_cache_vunmap is only used on architectures with a virtually mapped cache (xtensa, pa-risc). [Note: Nick Piggin is working on a scheme to make this simpler by no longer requiring flushes] Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/vmalloc.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) Index: linux-2.6/mm/vmalloc.c =================================================================== --- linux-2.6.orig/mm/vmalloc.c 2007-09-24 16:03:49.000000000 -0700 +++ linux-2.6/mm/vmalloc.c 2007-09-24 16:04:32.000000000 -0700 @@ -289,7 +289,6 @@ static struct vm_struct *__get_vm_area_n unsigned long align = 1; unsigned long addr; - BUG_ON(in_interrupt()); if (flags & VM_IOREMAP) { int bit = fls(size); @@ -314,7 +313,14 @@ static struct vm_struct *__get_vm_area_n */ size += PAGE_SIZE; - write_lock(&vmlist_lock); + if (gfp_mask & __GFP_WAIT) + write_lock(&vmlist_lock); + else { + if (!write_trylock(&vmlist_lock)) { + kfree(area); + return NULL; + } + } for (p = &vmlist; (tmp = *p) != NULL ;p = &tmp->next) { if ((unsigned long)tmp->addr < addr) { if((unsigned long)tmp->addr + tmp->size >= addr) -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* [13/17] Virtual compound page freeing in interrupt context 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter ` (11 preceding siblings ...) 2007-09-25 23:42 ` [12/17] Virtual Compound page allocation from interrupt context Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 2007-09-28 4:52 ` KAMEZAWA Hiroyuki 2007-09-25 23:42 ` [14/17] Allow bit_waitqueue to wait on a bit in a vmalloc area Christoph Lameter ` (3 subsequent siblings) 16 siblings, 1 reply; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_interrupt_free --] [-- Type: text/plain, Size: 1488 bytes --] If we are in an interrupt context then simply defer the free via a workqueue. Removing a virtual mappping *must* be done with interrupts enabled since tlb_xx functions are called that rely on interrupts for processor to processor communications. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/page_alloc.c | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c 2007-09-25 00:20:56.000000000 -0700 +++ linux-2.6/mm/page_alloc.c 2007-09-25 00:20:57.000000000 -0700 @@ -1294,7 +1294,12 @@ abort: return NULL; } -static void vcompound_free(void *addr) +/* + * Virtual Compound freeing functions. This is complicated by the vmalloc + * layer not being able to free virtual allocations when interrupts are + * disabled. So we defer the frees via a workqueue if necessary. + */ +static void __vcompound_free(void *addr) { struct page **pages; int i; @@ -1319,6 +1324,22 @@ static void vcompound_free(void *addr) kfree(pages); } +static void vcompound_free_work(struct work_struct *w) +{ + __vcompound_free((void *)w); +} + +static noinline void vcompound_free(void *addr) +{ + if (in_interrupt()) { + struct work_struct *w = addr; + + INIT_WORK(w, vcompound_free_work); + schedule_work(w); + } else + __vcompound_free(addr); +} + /* * This is the 'heart' of the zoned buddy allocator. */ -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [13/17] Virtual compound page freeing in interrupt context 2007-09-25 23:42 ` [13/17] Virtual compound page freeing in " Christoph Lameter @ 2007-09-28 4:52 ` KAMEZAWA Hiroyuki 2007-09-28 17:35 ` Christoph Lameter 0 siblings, 1 reply; 57+ messages in thread From: KAMEZAWA Hiroyuki @ 2007-09-28 4:52 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-fsdevel, linux-kernel On Tue, 25 Sep 2007 16:42:17 -0700 Christoph Lameter <clameter@sgi.com> wrote: > +static noinline void vcompound_free(void *addr) > +{ > + if (in_interrupt()) { Should be (in_interrupt() || irqs_disabled()) ? Regards, -Kame ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [13/17] Virtual compound page freeing in interrupt context 2007-09-28 4:52 ` KAMEZAWA Hiroyuki @ 2007-09-28 17:35 ` Christoph Lameter 2007-09-28 23:58 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 57+ messages in thread From: Christoph Lameter @ 2007-09-28 17:35 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: linux-fsdevel, linux-kernel On Fri, 28 Sep 2007, KAMEZAWA Hiroyuki wrote: > On Tue, 25 Sep 2007 16:42:17 -0700 > Christoph Lameter <clameter@sgi.com> wrote: > > > +static noinline void vcompound_free(void *addr) > > +{ > > + if (in_interrupt()) { > > Should be (in_interrupt() || irqs_disabled()) ? Maybe only irqs_disabled()? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [13/17] Virtual compound page freeing in interrupt context 2007-09-28 17:35 ` Christoph Lameter @ 2007-09-28 23:58 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 57+ messages in thread From: KAMEZAWA Hiroyuki @ 2007-09-28 23:58 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-fsdevel, linux-kernel On Fri, 28 Sep 2007 10:35:44 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > On Fri, 28 Sep 2007, KAMEZAWA Hiroyuki wrote: > > > On Tue, 25 Sep 2007 16:42:17 -0700 > > Christoph Lameter <clameter@sgi.com> wrote: > > > > > +static noinline void vcompound_free(void *addr) > > > +{ > > > + if (in_interrupt()) { > > > > Should be (in_interrupt() || irqs_disabled()) ? > > Maybe only irqs_disabled()? > Ah,.. maybe. -Kame ^ permalink raw reply [flat|nested] 57+ messages in thread
* [14/17] Allow bit_waitqueue to wait on a bit in a vmalloc area 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter ` (12 preceding siblings ...) 2007-09-25 23:42 ` [13/17] Virtual compound page freeing in " Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 2007-09-25 23:42 ` [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK Christoph Lameter ` (2 subsequent siblings) 16 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_wait_on_virtually_mapped_object --] [-- Type: text/plain, Size: 906 bytes --] If bit waitqueue is passed a virtual address then it must use vmalloc_to_page instead of virt_to_page to get to the page struct. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- kernel/wait.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6/kernel/wait.c =================================================================== --- linux-2.6.orig/kernel/wait.c 2007-09-20 19:03:42.000000000 -0700 +++ linux-2.6/kernel/wait.c 2007-09-20 19:07:42.000000000 -0700 @@ -245,7 +245,7 @@ EXPORT_SYMBOL(wake_up_bit); fastcall wait_queue_head_t *bit_waitqueue(void *word, int bit) { const int shift = BITS_PER_LONG == 32 ? 5 : 6; - const struct zone *zone = page_zone(virt_to_page(word)); + const struct zone *zone = page_zone(addr_to_page(word)); unsigned long val = (unsigned long)word << shift | bit; return &zone->wait_table[hash_long(val, zone->wait_table_bits)]; -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter ` (13 preceding siblings ...) 2007-09-25 23:42 ` [14/17] Allow bit_waitqueue to wait on a bit in a vmalloc area Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 2007-09-25 23:42 ` [16/17] Allow virtual fallback for buffer_heads Christoph Lameter 2007-09-25 23:42 ` [17/17] Allow virtual fallback for dentries Christoph Lameter 16 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_slub_support --] [-- Type: text/plain, Size: 6799 bytes --] SLAB_VFALLBACK can be specified for selected slab caches. If fallback is available then the conservative settings for higher order allocations are overridden. We then request an order that can accomodate at mininum 100 objects. The size of an individual slab allocation is allowed to reach up to 256k (order 6 on i386, order 4 on IA64). Implementing fallback requires special handling of virtual mappings in the free path. However, the impact is minimal since we already check the address if its NULL or ZERO_SIZE_PTR. No additional cachelines are touched if we do not fall back. However, if we need to handle a virtual compound page then walk the kernel page table in the free paths to determine the page struct. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- include/linux/slab.h | 1 include/linux/slub_def.h | 1 mm/slub.c | 52 +++++++++++++++++++++++++++-------------------- 3 files changed, 32 insertions(+), 22 deletions(-) Index: linux-2.6/include/linux/slab.h =================================================================== --- linux-2.6.orig/include/linux/slab.h 2007-09-24 20:34:14.000000000 -0700 +++ linux-2.6/include/linux/slab.h 2007-09-24 20:35:09.000000000 -0700 @@ -19,6 +19,7 @@ * The ones marked DEBUG are only valid if CONFIG_SLAB_DEBUG is set. */ #define SLAB_DEBUG_FREE 0x00000100UL /* DEBUG: Perform (expensive) checks on free */ +#define SLAB_VFALLBACK 0x00000200UL /* May fall back to vmalloc */ #define SLAB_RED_ZONE 0x00000400UL /* DEBUG: Red zone objs in a cache */ #define SLAB_POISON 0x00000800UL /* DEBUG: Poison objects */ #define SLAB_HWCACHE_ALIGN 0x00002000UL /* Align objs on cache lines */ Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2007-09-24 20:34:14.000000000 -0700 +++ linux-2.6/mm/slub.c 2007-09-24 20:35:09.000000000 -0700 @@ -285,7 +285,7 @@ static inline int check_valid_pointer(st if (!object) return 1; - base = page_address(page); + base = page_to_addr(page); if (object < base || object >= base + s->objects * s->size || (object - base) % s->size) { return 0; @@ -470,7 +470,7 @@ static void slab_fix(struct kmem_cache * static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p) { unsigned int off; /* Offset of last byte */ - u8 *addr = page_address(page); + u8 *addr = page_to_addr(page); print_tracking(s, p); @@ -648,7 +648,7 @@ static int slab_pad_check(struct kmem_ca if (!(s->flags & SLAB_POISON)) return 1; - start = page_address(page); + start = page_to_addr(page); end = start + (PAGE_SIZE << s->order); length = s->objects * s->size; remainder = end - (start + length); @@ -1049,11 +1049,7 @@ static struct page *allocate_slab(struct struct page * page; int pages = 1 << s->order; - if (s->order) - flags |= __GFP_COMP; - - if (s->flags & SLAB_CACHE_DMA) - flags |= SLUB_DMA; + flags |= s->gfpflags; if (node == -1) page = alloc_pages(flags, s->order); @@ -1107,7 +1103,7 @@ static struct page *new_slab(struct kmem SLAB_STORE_USER | SLAB_TRACE)) SetSlabDebug(page); - start = page_address(page); + start = page_to_addr(page); end = start + s->objects * s->size; if (unlikely(s->flags & SLAB_POISON)) @@ -1139,7 +1135,7 @@ static void __free_slab(struct kmem_cach void *p; slab_pad_check(s, page); - for_each_object(p, s, page_address(page)) + for_each_object(p, s, page_to_addr(page)) check_object(s, page, p, 0); ClearSlabDebug(page); } @@ -1789,10 +1785,9 @@ static inline int slab_order(int size, i return order; } -static inline int calculate_order(int size) +static inline int calculate_order(int size, int min_objects, int max_order) { int order; - int min_objects; int fraction; /* @@ -1803,13 +1798,12 @@ static inline int calculate_order(int si * First we reduce the acceptable waste in a slab. Then * we reduce the minimum objects required in a slab. */ - min_objects = slub_min_objects; while (min_objects > 1) { fraction = 8; while (fraction >= 4) { order = slab_order(size, min_objects, - slub_max_order, fraction); - if (order <= slub_max_order) + max_order, fraction); + if (order <= max_order) return order; fraction /= 2; } @@ -1820,8 +1814,8 @@ static inline int calculate_order(int si * We were unable to place multiple objects in a slab. Now * lets see if we can place a single object there. */ - order = slab_order(size, 1, slub_max_order, 1); - if (order <= slub_max_order) + order = slab_order(size, 1, max_order, 1); + if (order <= max_order) return order; /* @@ -2068,10 +2062,24 @@ static int calculate_sizes(struct kmem_c size = ALIGN(size, align); s->size = size; - s->order = calculate_order(size); + if (s->flags & SLAB_VFALLBACK) + s->order = calculate_order(size, 100, 18 - PAGE_SHIFT); + else + s->order = calculate_order(size, slub_min_objects, + slub_max_order); + if (s->order < 0) return 0; + if (s->order) + s->gfpflags |= __GFP_COMP; + + if (s->flags & SLAB_VFALLBACK) + s->gfpflags |= __GFP_VFALLBACK; + + if (s->flags & SLAB_CACHE_DMA) + s->flags |= SLUB_DMA; + /* * Determine the number of objects per slab */ @@ -2814,7 +2822,7 @@ static int validate_slab(struct kmem_cac unsigned long *map) { void *p; - void *addr = page_address(page); + void *addr = page_to_addr(page); if (!check_slab(s, page) || !on_freelist(s, page, NULL)) @@ -3056,7 +3064,7 @@ static int add_location(struct loc_track cpu_set(track->cpu, l->cpus); } - node_set(page_to_nid(virt_to_page(track)), l->nodes); + node_set(page_to_nid(addr_to_page(track)), l->nodes); return 1; } @@ -3087,14 +3095,14 @@ static int add_location(struct loc_track cpus_clear(l->cpus); cpu_set(track->cpu, l->cpus); nodes_clear(l->nodes); - node_set(page_to_nid(virt_to_page(track)), l->nodes); + node_set(page_to_nid(addr_to_page(track)), l->nodes); return 1; } static void process_slab(struct loc_track *t, struct kmem_cache *s, struct page *page, enum track_item alloc) { - void *addr = page_address(page); + void *addr = page_to_addr(page); DECLARE_BITMAP(map, s->objects); void *p; Index: linux-2.6/include/linux/slub_def.h =================================================================== --- linux-2.6.orig/include/linux/slub_def.h 2007-09-24 20:34:14.000000000 -0700 +++ linux-2.6/include/linux/slub_def.h 2007-09-24 20:35:09.000000000 -0700 @@ -31,6 +31,7 @@ struct kmem_cache { int objsize; /* The size of an object without meta data */ int offset; /* Free pointer offset. */ int order; + int gfpflags; /* Allocation flags */ /* * Avoid an extra cache line for UP, SMP and for the node local to -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* [16/17] Allow virtual fallback for buffer_heads 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter ` (14 preceding siblings ...) 2007-09-25 23:42 ` [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 2007-09-25 23:42 ` [17/17] Allow virtual fallback for dentries Christoph Lameter 16 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_buffer_head --] [-- Type: text/plain, Size: 801 bytes --] This is in particular useful for large I/Os because it will allow > 100 allocs from the SLUB fast path without having to go to the page allocator. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- fs/buffer.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6/fs/buffer.c =================================================================== --- linux-2.6.orig/fs/buffer.c 2007-09-18 15:44:37.000000000 -0700 +++ linux-2.6/fs/buffer.c 2007-09-18 15:44:51.000000000 -0700 @@ -3008,7 +3008,8 @@ void __init buffer_init(void) int nrpages; bh_cachep = KMEM_CACHE(buffer_head, - SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD); + SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD| + SLAB_VFALLBACK); /* * Limit the bh occupancy to 10% of ZONE_NORMAL -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* [17/17] Allow virtual fallback for dentries 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter ` (15 preceding siblings ...) 2007-09-25 23:42 ` [16/17] Allow virtual fallback for buffer_heads Christoph Lameter @ 2007-09-25 23:42 ` Christoph Lameter 16 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-25 23:42 UTC (permalink / raw) Cc: linux-fsdevel, linux-kernel [-- Attachment #1: vcompound_dentry --] [-- Type: text/plain, Size: 656 bytes --] Signed-off-by: Christoph Lameter <clameter@sgi.com> --- fs/dcache.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6/fs/dcache.c =================================================================== --- linux-2.6.orig/fs/dcache.c 2007-09-24 16:47:43.000000000 -0700 +++ linux-2.6/fs/dcache.c 2007-09-24 17:03:15.000000000 -0700 @@ -2118,7 +2118,8 @@ static void __init dcache_init(unsigned * of the dcache. */ dentry_cache = KMEM_CACHE(dentry, - SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD); + SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD| + SLAB_VFALLBACK); register_shrinker(&dcache_shrinker); -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* [00/17] [RFC] Virtual Compound Page Support
@ 2007-09-19 3:36 Christoph Lameter
2007-09-19 3:36 ` [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK Christoph Lameter
0 siblings, 1 reply; 57+ messages in thread
From: Christoph Lameter @ 2007-09-19 3:36 UTC (permalink / raw)
To: Christoph Hellwig, Mel Gorman
Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe
Currently there is a strong tendency to avoid larger page allocations in
the kernel because of past fragmentation issues and the current
defragmentation methods are still evolving. It is not clear to what extend
they can provide reliable allocations for higher order pages (plus the
definition of "reliable" seems to be in the eye of the beholder).
Currently we use vmalloc allocations in many locations to provide a safe
way to allocate larger arrays. That is due to the danger of higher order
allocations failing. Virtual Compound pages allow the use of regular
page allocator allocations that will fall back only if there is an actual
problem with acquiring a higher order page.
This patch set provides a way for a higher page allocation to fall back.
Instead of a physically contiguous page a virtually contiguous page
is provided. The functionality of the vmalloc layer is used to provide
the necessary page tables and control structures to establish a virtually
contiguous area.
Advantages:
- If higher order allocations are failing then virtual compound pages
consisting of a series of order-0 pages can stand in for those
allocations.
- "Reliability" as long as the vmalloc layer can provide virtual mappings.
- Ability to reduce the use of vmalloc layer significantly by using
physically contiguous memory instead of virtual contiguous memory.
Most uses of vmalloc() can be converted to page allocator calls.
- The use of physically contiguous memory instead of vmalloc may allow the
use larger TLB entries thus reducing TLB pressure. Also reduces the need
for page table walks.
Disadvantages:
- In order to use fall back the logic accessing the memory must be
aware that the memory could be backed by a virtual mapping and take
precautions. virt_to_page() and page_address() may not work and
vmalloc_to_page() and vmalloc_address() (introduced through this
patch set) may have to be called.
- Virtual mappings are less efficient than physical mappings.
Performance will drop once virtual fall back occurs.
- Virtual mappings have more memory overhead. vm_area control structures
page tables, page arrays etc need to be allocated and managed to provide
virtual mappings.
The patchset provides this functionality in stages. Stage 1 introduces
the basic fall back mechanism necessary to replace vmalloc allocations
with
alloc_page(GFP_VFALLBACK, order, ....)
which signifies to the page allocator that a higher order is to be found
but a virtual mapping may stand in if there is an issue with fragmentation.
Stage 1 functionality does not allow allocation and freeing of virtual
mappings from interrupt contexts.
The stage 1 series ends with the conversion of a few key uses of vmalloc
in the VM to alloc_pages() for the allocation of sparsemems memmap table
and the wait table in each zone. Other uses of vmalloc could be converted
in the same way.
Stage 2 functionality enhances the fallback even more allowing allocation
and frees in interrupt context.
SLUB is then modified to use the virtual mappings for slab caches
that are marked with SLAB_VFALLBACK. If a slab cache is marked this way
then we drop all the restraints regarding page order and allocate
good large memory areas that fit lots of objects so that we rarely
have to use the slow paths.
Two slab caches--the dentry cache and the buffer_heads--are then flagged
that way. Others could be converted in the same way.
The patch set also provides a debugging aid through setting
CONFIG_VFALLBACK_ALWAYS
If set then all GFP_VFALLBACK allocations fall back to the virtual
mappings. This is useful for verification tests. The test of this
patch set was done by enabling that options and compiling a kernel.
Stage 3 functionality could be the adding of support for the large
buffer size patchset. Not done yet and not sure if it would be useful
to do.
Much of this patchset may only be needed for special cases in which the
existing defragmentation methods fail for some reason. It may be better to
have the system operate without such a safety net and make sure that the
page allocator can return large orders in a reliable way.
The initial idea for this patchset came from Nick Piggin's fsblock
and from his arguments about reliability and guarantees. Since his
fsblock uses the virtual mappings I think it is legitimate to
generalize the use of virtual mappings to support higher order
allocations in this way. The application of these ideas to the large
block size patchset etc are straightforward. If wanted I can base
the next rev of the largebuffer patchset on this one and implement
fallback.
Contrary to Nick, I still doubt that any of this provides a "guarantee".
Have said that I have to deal with various failure scenarios in the VM
daily and I'd certainly like to see it work in a more reliable manner.
IMHO getting rid of the various workarounds to deal with the small 4k
pages and avoiding additional layers that group these pages in subsystem
specific ways is something that can simplify the kernel and make the
kernel more reliable overall.
If people feel that a virtual fall back is needed then so be it. Maybe
we can shed our security blanket later when the approaches to deal
with fragmentation have matured.
The patch set is also available via git from the largeblock git tree via
git pull
git://git.kernel.org/pub/scm/linux/kernel/git/christoph/largeblocksize.git
vcompound
--
^ permalink raw reply [flat|nested] 57+ messages in thread* [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-19 3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter @ 2007-09-19 3:36 ` Christoph Lameter 2007-09-27 21:42 ` Nick Piggin 0 siblings, 1 reply; 57+ messages in thread From: Christoph Lameter @ 2007-09-19 3:36 UTC (permalink / raw) To: Christoph Hellwig, Mel Gorman Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe [-- Attachment #1: vcompound_slub_support --] [-- Type: text/plain, Size: 8705 bytes --] SLAB_VFALLBACK can be specified for selected slab caches. If fallback is available then the conservative settings for higher order allocations are overridden. We then request an order that can accomodate at mininum 100 objects. The size of an individual slab allocation is allowed to reach up to 256k (order 6 on i386, order 4 on IA64). Implementing fallback requires special handling of virtual mappings in the free path. However, the impact is minimal since we already check the address if its NULL or ZERO_SIZE_PTR. No additional cachelines are touched if we do not fall back. However, if we need to handle a virtual compound page then walk the kernel page table in the free paths to determine the page struct. We also need special handling in the allocation paths since the virtual addresses cannot be obtained via page_address(). SLUB exploits that page->private is set to the vmalloc address to avoid a costly vmalloc_address(). However, for diagnostics there is still the need to determine the vmalloc address from the page struct. There we must use the costly vmalloc_address(). Signed-off-by: Christoph Lameter <clameter@sgi.com> --- include/linux/slab.h | 1 include/linux/slub_def.h | 1 mm/slub.c | 83 ++++++++++++++++++++++++++++++++--------------- 3 files changed, 60 insertions(+), 25 deletions(-) Index: linux-2.6/include/linux/slab.h =================================================================== --- linux-2.6.orig/include/linux/slab.h 2007-09-18 17:03:30.000000000 -0700 +++ linux-2.6/include/linux/slab.h 2007-09-18 17:07:39.000000000 -0700 @@ -19,6 +19,7 @@ * The ones marked DEBUG are only valid if CONFIG_SLAB_DEBUG is set. */ #define SLAB_DEBUG_FREE 0x00000100UL /* DEBUG: Perform (expensive) checks on free */ +#define SLAB_VFALLBACK 0x00000200UL /* May fall back to vmalloc */ #define SLAB_RED_ZONE 0x00000400UL /* DEBUG: Red zone objs in a cache */ #define SLAB_POISON 0x00000800UL /* DEBUG: Poison objects */ #define SLAB_HWCACHE_ALIGN 0x00002000UL /* Align objs on cache lines */ Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2007-09-18 17:03:30.000000000 -0700 +++ linux-2.6/mm/slub.c 2007-09-18 18:13:38.000000000 -0700 @@ -20,6 +20,7 @@ #include <linux/mempolicy.h> #include <linux/ctype.h> #include <linux/kallsyms.h> +#include <linux/vmalloc.h> /* * Lock order: @@ -277,6 +278,26 @@ static inline struct kmem_cache_node *ge #endif } +static inline void *slab_address(struct page *page) +{ + if (unlikely(PageVcompound(page))) + return vmalloc_address(page); + else + return page_address(page); +} + +static inline struct page *virt_to_slab(const void *addr) +{ + struct page *page; + + if (unlikely(is_vmalloc_addr(addr))) + page = vmalloc_to_page(addr); + else + page = virt_to_page(addr); + + return compound_head(page); +} + static inline int check_valid_pointer(struct kmem_cache *s, struct page *page, const void *object) { @@ -285,7 +306,7 @@ static inline int check_valid_pointer(st if (!object) return 1; - base = page_address(page); + base = slab_address(page); if (object < base || object >= base + s->objects * s->size || (object - base) % s->size) { return 0; @@ -470,7 +491,7 @@ static void slab_fix(struct kmem_cache * static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p) { unsigned int off; /* Offset of last byte */ - u8 *addr = page_address(page); + u8 *addr = slab_address(page); print_tracking(s, p); @@ -648,7 +669,7 @@ static int slab_pad_check(struct kmem_ca if (!(s->flags & SLAB_POISON)) return 1; - start = page_address(page); + start = slab_address(page); end = start + (PAGE_SIZE << s->order); length = s->objects * s->size; remainder = end - (start + length); @@ -1040,11 +1061,7 @@ static struct page *allocate_slab(struct struct page * page; int pages = 1 << s->order; - if (s->order) - flags |= __GFP_COMP; - - if (s->flags & SLAB_CACHE_DMA) - flags |= SLUB_DMA; + flags |= s->gfpflags; if (node == -1) page = alloc_pages(flags, s->order); @@ -1098,7 +1115,11 @@ static struct page *new_slab(struct kmem SLAB_STORE_USER | SLAB_TRACE)) SetSlabDebug(page); - start = page_address(page); + if (!PageVcompound(page)) + start = slab_address(page); + else + start = (void *)page->private; + end = start + s->objects * s->size; if (unlikely(s->flags & SLAB_POISON)) @@ -1130,7 +1151,7 @@ static void __free_slab(struct kmem_cach void *p; slab_pad_check(s, page); - for_each_object(p, s, page_address(page)) + for_each_object(p, s, slab_address(page)) check_object(s, page, p, 0); ClearSlabDebug(page); } @@ -1672,7 +1693,7 @@ void kmem_cache_free(struct kmem_cache * { struct page *page; - page = virt_to_head_page(x); + page = virt_to_slab(x); slab_free(s, page, x, __builtin_return_address(0)); } @@ -1681,7 +1702,7 @@ EXPORT_SYMBOL(kmem_cache_free); /* Figure out on which slab object the object resides */ static struct page *get_object_page(const void *x) { - struct page *page = virt_to_head_page(x); + struct page *page = virt_to_slab(x); if (!PageSlab(page)) return NULL; @@ -1780,10 +1801,9 @@ static inline int slab_order(int size, i return order; } -static inline int calculate_order(int size) +static inline int calculate_order(int size, int min_objects, int max_order) { int order; - int min_objects; int fraction; /* @@ -1794,13 +1814,12 @@ static inline int calculate_order(int si * First we reduce the acceptable waste in a slab. Then * we reduce the minimum objects required in a slab. */ - min_objects = slub_min_objects; while (min_objects > 1) { fraction = 8; while (fraction >= 4) { order = slab_order(size, min_objects, - slub_max_order, fraction); - if (order <= slub_max_order) + max_order, fraction); + if (order <= max_order) return order; fraction /= 2; } @@ -1811,8 +1830,8 @@ static inline int calculate_order(int si * We were unable to place multiple objects in a slab. Now * lets see if we can place a single object there. */ - order = slab_order(size, 1, slub_max_order, 1); - if (order <= slub_max_order) + order = slab_order(size, 1, max_order, 1); + if (order <= max_order) return order; /* @@ -2059,10 +2078,24 @@ static int calculate_sizes(struct kmem_c size = ALIGN(size, align); s->size = size; - s->order = calculate_order(size); + if (s->flags & SLAB_VFALLBACK) + s->order = calculate_order(size, 100, 18 - PAGE_SHIFT); + else + s->order = calculate_order(size, slub_min_objects, + slub_max_order); + if (s->order < 0) return 0; + if (s->order) + s->gfpflags |= __GFP_COMP; + + if (s->flags & SLAB_VFALLBACK) + s->gfpflags |= __GFP_VFALLBACK; + + if (s->flags & SLAB_CACHE_DMA) + s->flags |= SLUB_DMA; + /* * Determine the number of objects per slab */ @@ -2477,7 +2510,7 @@ void kfree(const void *x) if (ZERO_OR_NULL_PTR(x)) return; - page = virt_to_head_page(x); + page = virt_to_slab(x); s = page->slab; slab_free(s, page, (void *)x, __builtin_return_address(0)); @@ -2806,7 +2839,7 @@ static int validate_slab(struct kmem_cac unsigned long *map) { void *p; - void *addr = page_address(page); + void *addr = slab_address(page); if (!check_slab(s, page) || !on_freelist(s, page, NULL)) @@ -3048,7 +3081,7 @@ static int add_location(struct loc_track cpu_set(track->cpu, l->cpus); } - node_set(page_to_nid(virt_to_page(track)), l->nodes); + node_set(page_to_nid(virt_to_slab(track)), l->nodes); return 1; } @@ -3079,14 +3112,14 @@ static int add_location(struct loc_track cpus_clear(l->cpus); cpu_set(track->cpu, l->cpus); nodes_clear(l->nodes); - node_set(page_to_nid(virt_to_page(track)), l->nodes); + node_set(page_to_nid(virt_to_slab(track)), l->nodes); return 1; } static void process_slab(struct loc_track *t, struct kmem_cache *s, struct page *page, enum track_item alloc) { - void *addr = page_address(page); + void *addr = slab_address(page); DECLARE_BITMAP(map, s->objects); void *p; Index: linux-2.6/include/linux/slub_def.h =================================================================== --- linux-2.6.orig/include/linux/slub_def.h 2007-09-18 17:03:30.000000000 -0700 +++ linux-2.6/include/linux/slub_def.h 2007-09-18 17:07:39.000000000 -0700 @@ -31,6 +31,7 @@ struct kmem_cache { int objsize; /* The size of an object without meta data */ int offset; /* Free pointer offset. */ int order; + int gfpflags; /* Allocation flags */ /* * Avoid an extra cache line for UP, SMP and for the node local to -- ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-19 3:36 ` [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK Christoph Lameter @ 2007-09-27 21:42 ` Nick Piggin 2007-09-28 17:33 ` Christoph Lameter 0 siblings, 1 reply; 57+ messages in thread From: Nick Piggin @ 2007-09-27 21:42 UTC (permalink / raw) To: Christoph Lameter Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Wednesday 19 September 2007 13:36, Christoph Lameter wrote: > SLAB_VFALLBACK can be specified for selected slab caches. If fallback is > available then the conservative settings for higher order allocations are > overridden. We then request an order that can accomodate at mininum > 100 objects. The size of an individual slab allocation is allowed to reach > up to 256k (order 6 on i386, order 4 on IA64). How come SLUB wants such a big amount of objects? I thought the unqueued nature of it made it better than slab because it minimised the amount of cache hot memory lying around in slabs... vmalloc is incredibly slow and unscalable at the moment. I'm still working on making it more scalable and faster -- hopefully to a point where it would actually be usable for this... but you still get moved off large TLBs, and also have to inevitably do tlb flushing. Or do you have SLUB at a point where performance is comparable to SLAB, and this is just a possible idea for more performance? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-27 21:42 ` Nick Piggin @ 2007-09-28 17:33 ` Christoph Lameter 2007-09-28 5:14 ` Nick Piggin ` (2 more replies) 0 siblings, 3 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-28 17:33 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Fri, 28 Sep 2007, Nick Piggin wrote: > On Wednesday 19 September 2007 13:36, Christoph Lameter wrote: > > SLAB_VFALLBACK can be specified for selected slab caches. If fallback is > > available then the conservative settings for higher order allocations are > > overridden. We then request an order that can accomodate at mininum > > 100 objects. The size of an individual slab allocation is allowed to reach > > up to 256k (order 6 on i386, order 4 on IA64). > > How come SLUB wants such a big amount of objects? I thought the > unqueued nature of it made it better than slab because it minimised > the amount of cache hot memory lying around in slabs... The more objects in a page the more the fast path runs. The more the fast path runs the lower the cache footprint and the faster the overall allocations etc. SLAB can be configured for large queues holdings lots of objects. SLUB can only reach the same through large pages because it does not have queues. One could add the ability to manage pools of cpu slabs but that would be adding yet another layer to compensate for the problem of the small pages. Reliable large page allocations means that we can get rid of these layers and the many workarounds that we have in place right now. The unqueued nature of SLUB reduces memory requirements and in general the more efficient code paths of SLUB offset the advantage that SLAB can reach by being able to put more objects onto its queues. SLAB necessarily introduces complexity and cache line use through the need to manage those queues. > vmalloc is incredibly slow and unscalable at the moment. I'm still working > on making it more scalable and faster -- hopefully to a point where it would > actually be usable for this... but you still get moved off large TLBs, and > also have to inevitably do tlb flushing. Again I have not seen any fallbacks to vmalloc in my testing. What we are doing here is mainly to address your theoretical cases that we so far have never seen to be a problem and increase the reliability of allocations of page orders larger than 3 to a usable level. So far I have so far not dared to enable orders larger than 3 by default. AFAICT The performance of vmalloc is not really relevant. If this would become an issue then it would be possible to reduce the orders used to avoid fallbacks. > Or do you have SLUB at a point where performance is comparable to SLAB, > and this is just a possible idea for more performance? AFAICT SLUBs performance is superior to SLAB in most cases and it was like that from the beginning. I am still concerned about several corner cases though (I think most of them are going to be addressed by the per cpu patches in mm). Having a comparable or larger amount of per cpu objects as SLAB is something that also could address some of these concerns and could increase performance much further. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-28 17:33 ` Christoph Lameter @ 2007-09-28 5:14 ` Nick Piggin 2007-10-01 20:50 ` Christoph Lameter 2007-09-28 17:55 ` Peter Zijlstra 2007-09-28 21:05 ` Mel Gorman 2 siblings, 1 reply; 57+ messages in thread From: Nick Piggin @ 2007-09-28 5:14 UTC (permalink / raw) To: Christoph Lameter Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Saturday 29 September 2007 03:33, Christoph Lameter wrote: > On Fri, 28 Sep 2007, Nick Piggin wrote: > > On Wednesday 19 September 2007 13:36, Christoph Lameter wrote: > > > SLAB_VFALLBACK can be specified for selected slab caches. If fallback > > > is available then the conservative settings for higher order > > > allocations are overridden. We then request an order that can > > > accomodate at mininum 100 objects. The size of an individual slab > > > allocation is allowed to reach up to 256k (order 6 on i386, order 4 on > > > IA64). > > > > How come SLUB wants such a big amount of objects? I thought the > > unqueued nature of it made it better than slab because it minimised > > the amount of cache hot memory lying around in slabs... > > The more objects in a page the more the fast path runs. The more the fast > path runs the lower the cache footprint and the faster the overall > allocations etc. > > SLAB can be configured for large queues holdings lots of objects. > SLUB can only reach the same through large pages because it does not > have queues. One could add the ability to manage pools of cpu slabs but > that would be adding yet another layer to compensate for the problem of > the small pages. Reliable large page allocations means that we can get rid > of these layers and the many workarounds that we have in place right now. That doesn't sound very nice because you don't actually want to use up higher order allocations if you can avoid it, and you definitely don't want to be increasing your slab page size unit if you can help it, because it compounds the problem of slab fragmentation. > The unqueued nature of SLUB reduces memory requirements and in general the > more efficient code paths of SLUB offset the advantage that SLAB can reach > by being able to put more objects onto its queues. > introduces complexity and cache line use through the need to manage those > queues. I thought it was slower. Have you fixed the performance regression? (OK, I read further down that you are still working on it but not confirmed yet...) > > vmalloc is incredibly slow and unscalable at the moment. I'm still > > working on making it more scalable and faster -- hopefully to a point > > where it would actually be usable for this... but you still get moved off > > large TLBs, and also have to inevitably do tlb flushing. > > Again I have not seen any fallbacks to vmalloc in my testing. What we are > doing here is mainly to address your theoretical cases that we so far have > never seen to be a problem and increase the reliability of allocations of > page orders larger than 3 to a usable level. So far I have so far not > dared to enable orders larger than 3 by default. Basically, all that shows is that your testing isn't very thorough. 128MB is an order of magnitude *more* memory than some users have. They probably wouldn't be happy with a regression in slab allocator performance either. > > Or do you have SLUB at a point where performance is comparable to SLAB, > > and this is just a possible idea for more performance? > > AFAICT SLUBs performance is superior to SLAB in most cases and it was like > that from the beginning. I am still concerned about several corner cases > though (I think most of them are going to be addressed by the per cpu > patches in mm). Having a comparable or larger amount of per cpu objects as > SLAB is something that also could address some of these concerns and could > increase performance much further. OK, so long as it isn't going to depend on using higher order pages, that's fine. (if they help even further as an optional thing, that's fine too. You can turn them on your huge systems and not even bother about adding this vmap fallback -- you won't have me to nag you about these purely theoretical issues). ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-28 5:14 ` Nick Piggin @ 2007-10-01 20:50 ` Christoph Lameter 2007-10-02 8:43 ` Nick Piggin 0 siblings, 1 reply; 57+ messages in thread From: Christoph Lameter @ 2007-10-01 20:50 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Fri, 28 Sep 2007, Nick Piggin wrote: > I thought it was slower. Have you fixed the performance regression? > (OK, I read further down that you are still working on it but not confirmed > yet...) The problem is with the weird way of Intel testing and communication. Every 3-6 month or so they will tell you the system is X% up or down on arch Y (and they wont give you details because its somehow secret). And then there are conflicting statements by the two or so performance test departments. One of them repeatedly assured me that they do not see any regressions. > OK, so long as it isn't going to depend on using higher order pages, that's > fine. (if they help even further as an optional thing, that's fine too. You > can turn them on your huge systems and not even bother about adding > this vmap fallback -- you won't have me to nag you about these > purely theoretical issues). Well the vmap fallback is generally useful AFAICT. Higher order allocations are common on some of our platforms. Order 1 failures even affect essential things like stacks that have nothing to do with SLUB and the LBS patchset. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-10-01 20:50 ` Christoph Lameter @ 2007-10-02 8:43 ` Nick Piggin 0 siblings, 0 replies; 57+ messages in thread From: Nick Piggin @ 2007-10-02 8:43 UTC (permalink / raw) To: Christoph Lameter Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Tuesday 02 October 2007 06:50, Christoph Lameter wrote: > On Fri, 28 Sep 2007, Nick Piggin wrote: > > I thought it was slower. Have you fixed the performance regression? > > (OK, I read further down that you are still working on it but not > > confirmed yet...) > > The problem is with the weird way of Intel testing and communication. > Every 3-6 month or so they will tell you the system is X% up or down on > arch Y (and they wont give you details because its somehow secret). And > then there are conflicting statements by the two or so performance test > departments. One of them repeatedly assured me that they do not see any > regressions. Just so long as there aren't known regressions that would require higher order allocations to fix them. > > OK, so long as it isn't going to depend on using higher order pages, > > that's fine. (if they help even further as an optional thing, that's fine > > too. You can turn them on your huge systems and not even bother about > > adding this vmap fallback -- you won't have me to nag you about these > > purely theoretical issues). > > Well the vmap fallback is generally useful AFAICT. Higher order > allocations are common on some of our platforms. Order 1 failures even > affect essential things like stacks that have nothing to do with SLUB and > the LBS patchset. I don't know if it is worth the trouble, though. The best thing to do is to ensure that contiguous memory is not wasted on frivolous things... a few order-1 or 2 allocations aren't too much of a problem. The only high order allocation failure I've seen from fragmentation for a long time IIRC are the order-3 failures coming from e1000. And obviously they cannot use vmap. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-28 17:33 ` Christoph Lameter 2007-09-28 5:14 ` Nick Piggin @ 2007-09-28 17:55 ` Peter Zijlstra 2007-09-28 18:20 ` Christoph Lameter 2007-09-28 21:05 ` Mel Gorman 2 siblings, 1 reply; 57+ messages in thread From: Peter Zijlstra @ 2007-09-28 17:55 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Fri, 2007-09-28 at 10:33 -0700, Christoph Lameter wrote: > Again I have not seen any fallbacks to vmalloc in my testing. What we are > doing here is mainly to address your theoretical cases that we so far have > never seen to be a problem and increase the reliability of allocations of > page orders larger than 3 to a usable level. So far I have so far not > dared to enable orders larger than 3 by default. take a recent -mm kernel, boot with mem=128M. start 2 processes that each mmap a separate 64M file, and which does sequential writes on them. start a 3th process that does the same with 64M anonymous. wait for a while, and you'll see order=1 failures. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-28 17:55 ` Peter Zijlstra @ 2007-09-28 18:20 ` Christoph Lameter 2007-09-28 18:25 ` Peter Zijlstra 2007-09-29 8:45 ` Peter Zijlstra 0 siblings, 2 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-28 18:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Fri, 28 Sep 2007, Peter Zijlstra wrote: > > On Fri, 2007-09-28 at 10:33 -0700, Christoph Lameter wrote: > > > Again I have not seen any fallbacks to vmalloc in my testing. What we are > > doing here is mainly to address your theoretical cases that we so far have > > never seen to be a problem and increase the reliability of allocations of > > page orders larger than 3 to a usable level. So far I have so far not > > dared to enable orders larger than 3 by default. > > take a recent -mm kernel, boot with mem=128M. Ok so only 32k pages to play with? I have tried parallel kernel compiles with mem=256m and they seemed to be fine. > start 2 processes that each mmap a separate 64M file, and which does > sequential writes on them. start a 3th process that does the same with > 64M anonymous. > > wait for a while, and you'll see order=1 failures. Really? That means we can no longer even allocate stacks for forking. Its surprising that neither lumpy reclaim nor the mobility patches can deal with it? Lumpy reclaim should be able to free neighboring pages to avoid the order 1 failure unless there are lots of pinned pages. I guess then that lots of pages are pinned through I/O? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-28 18:20 ` Christoph Lameter @ 2007-09-28 18:25 ` Peter Zijlstra 2007-09-28 18:41 ` Christoph Lameter ` (2 more replies) 2007-09-29 8:45 ` Peter Zijlstra 1 sibling, 3 replies; 57+ messages in thread From: Peter Zijlstra @ 2007-09-28 18:25 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote: > > start 2 processes that each mmap a separate 64M file, and which does > > sequential writes on them. start a 3th process that does the same with > > 64M anonymous. > > > > wait for a while, and you'll see order=1 failures. > > Really? That means we can no longer even allocate stacks for forking. > > Its surprising that neither lumpy reclaim nor the mobility patches can > deal with it? Lumpy reclaim should be able to free neighboring pages to > avoid the order 1 failure unless there are lots of pinned pages. > > I guess then that lots of pages are pinned through I/O? memory got massively fragemented, as anti-frag gets easily defeated. setting min_free_kbytes to 12M does seem to solve it - it forces 2 max order blocks to stay available, so we don't mix types. however 12M on 128M is rather a lot. its still on my todo list to look at it further.. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-28 18:25 ` Peter Zijlstra @ 2007-09-28 18:41 ` Christoph Lameter 2007-09-28 20:22 ` Nick Piggin 2007-09-28 21:14 ` Mel Gorman 2007-09-28 20:59 ` Mel Gorman 2007-09-29 8:13 ` Andrew Morton 2 siblings, 2 replies; 57+ messages in thread From: Christoph Lameter @ 2007-09-28 18:41 UTC (permalink / raw) To: Peter Zijlstra Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Fri, 28 Sep 2007, Peter Zijlstra wrote: > memory got massively fragemented, as anti-frag gets easily defeated. > setting min_free_kbytes to 12M does seem to solve it - it forces 2 max > order blocks to stay available, so we don't mix types. however 12M on > 128M is rather a lot. Yes, strict ordering would be much better. On NUMA it may be possible to completely forbid merging. We can fall back to other nodes if necessary. 12M is not much on a NUMA system. But this shows that (unsurprisingly) we may have issues on systems with a small amounts of memory and we may not want to use higher orders on such systems. The case you got may be good to use as a testcase for the virtual fallback. Hmmmm... Maybe it is possible to allocate the stack as a virtual compound page. Got some script/code to produce that problem? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-28 18:41 ` Christoph Lameter @ 2007-09-28 20:22 ` Nick Piggin 2007-09-28 21:14 ` Mel Gorman 1 sibling, 0 replies; 57+ messages in thread From: Nick Piggin @ 2007-09-28 20:22 UTC (permalink / raw) To: Christoph Lameter Cc: Peter Zijlstra, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Saturday 29 September 2007 04:41, Christoph Lameter wrote: > On Fri, 28 Sep 2007, Peter Zijlstra wrote: > > memory got massively fragemented, as anti-frag gets easily defeated. > > setting min_free_kbytes to 12M does seem to solve it - it forces 2 max > > order blocks to stay available, so we don't mix types. however 12M on > > 128M is rather a lot. > > Yes, strict ordering would be much better. On NUMA it may be possible to > completely forbid merging. We can fall back to other nodes if necessary. > 12M is not much on a NUMA system. > > But this shows that (unsurprisingly) we may have issues on systems with a > small amounts of memory and we may not want to use higher orders on such > systems. > > The case you got may be good to use as a testcase for the virtual > fallback. Hmmmm... Maybe it is possible to allocate the stack as a virtual > compound page. Got some script/code to produce that problem? Yeah, you could do that, but we generally don't have big problems allocating stacks in mainline, because we have very few users of higher order pages, the few that are there don't seem to be a problem. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-28 18:41 ` Christoph Lameter 2007-09-28 20:22 ` Nick Piggin @ 2007-09-28 21:14 ` Mel Gorman 1 sibling, 0 replies; 57+ messages in thread From: Mel Gorman @ 2007-09-28 21:14 UTC (permalink / raw) To: Christoph Lameter Cc: Peter Zijlstra, Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On (28/09/07 11:41), Christoph Lameter didst pronounce: > On Fri, 28 Sep 2007, Peter Zijlstra wrote: > > > memory got massively fragemented, as anti-frag gets easily defeated. > > setting min_free_kbytes to 12M does seem to solve it - it forces 2 max > > order blocks to stay available, so we don't mix types. however 12M on > > 128M is rather a lot. > > Yes, strict ordering would be much better. On NUMA it may be possible to > completely forbid merging. The forbidding of merging is trivial and the code is isolated to one function __rmqueue_fallback(). We don't do it because the decision at development time was that it was better to allow fragmentation than take a reclaim step for example[1] and slow things up. This is based on my initial assumption of anti-frag being mainly of interest to hugepages which are happy to wait long periods during startup or fail. > We can fall back to other nodes if necessary. > 12M is not much on a NUMA system. > > But this shows that (unsurprisingly) we may have issues on systems with a > small amounts of memory and we may not want to use higher orders on such > systems. > This is another option if you want to use a higher order for SLUB by default. Use order-0 unless you are sure there is enough memory. At boot if there is loads of memory, set the higher order and up min_free_kbytes on each node to reduce mixing[2]. We can test with Peters uber-hostile case to see if it works[3]. > The case you got may be good to use as a testcase for the virtual > fallback. Hmmmm... For sure. > Maybe it is possible to allocate the stack as a virtual > compound page. Got some script/code to produce that problem? > [1] It might be tunnel vision but I still keep hugepages in mind as the principal user of anti-frag. Andy used to have patches that force evicted pages of the "foreign" type when mixing occured so the end result was no mixing. We never fully completed them because it was too costly for hugepages. [2] This would require the identification of mixed blocks to be a statistic available in mainline. Right now, it's only available in -mm when PAGE_OWNER is set [3] The definition of working in this case being that order-0 allocations fail which he has produced -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-28 18:25 ` Peter Zijlstra 2007-09-28 18:41 ` Christoph Lameter @ 2007-09-28 20:59 ` Mel Gorman 2007-09-29 8:13 ` Andrew Morton 2 siblings, 0 replies; 57+ messages in thread From: Mel Gorman @ 2007-09-28 20:59 UTC (permalink / raw) To: Peter Zijlstra Cc: Christoph Lameter, Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On (28/09/07 20:25), Peter Zijlstra didst pronounce: > > On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote: > > > > start 2 processes that each mmap a separate 64M file, and which does > > > sequential writes on them. start a 3th process that does the same with > > > 64M anonymous. > > > > > > wait for a while, and you'll see order=1 failures. > > > > Really? That means we can no longer even allocate stacks for forking. > > > > Its surprising that neither lumpy reclaim nor the mobility patches can > > deal with it? Lumpy reclaim should be able to free neighboring pages to > > avoid the order 1 failure unless there are lots of pinned pages. > > > > I guess then that lots of pages are pinned through I/O? > > memory got massively fragemented, as anti-frag gets easily defeated. > setting min_free_kbytes to 12M does seem to solve it - it forces 2 max The 12MB is related to the size of pageblock_order. I strongly suspect that if you forced pageblock_order to be something like 4 or 5, the min_free_kbytes would not need to be raised. The current values are selected based on the hugepage size. > order blocks to stay available, so we don't mix types. however 12M on > 128M is rather a lot. > > its still on my todo list to look at it further.. > -- -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-28 18:25 ` Peter Zijlstra 2007-09-28 18:41 ` Christoph Lameter 2007-09-28 20:59 ` Mel Gorman @ 2007-09-29 8:13 ` Andrew Morton 2007-09-29 8:47 ` Peter Zijlstra 2 siblings, 1 reply; 57+ messages in thread From: Andrew Morton @ 2007-09-29 8:13 UTC (permalink / raw) To: Peter Zijlstra Cc: Christoph Lameter, Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Fri, 28 Sep 2007 20:25:50 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote: > > > > start 2 processes that each mmap a separate 64M file, and which does > > > sequential writes on them. start a 3th process that does the same with > > > 64M anonymous. > > > > > > wait for a while, and you'll see order=1 failures. > > > > Really? That means we can no longer even allocate stacks for forking. > > > > Its surprising that neither lumpy reclaim nor the mobility patches can > > deal with it? Lumpy reclaim should be able to free neighboring pages to > > avoid the order 1 failure unless there are lots of pinned pages. > > > > I guess then that lots of pages are pinned through I/O? > > memory got massively fragemented, as anti-frag gets easily defeated. > setting min_free_kbytes to 12M does seem to solve it - it forces 2 max > order blocks to stay available, so we don't mix types. however 12M on > 128M is rather a lot. > > its still on my todo list to look at it further.. > That would be really really bad (as in: patch-dropping time) if those order-1 allocations are not atomic. What's the callsite? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-29 8:13 ` Andrew Morton @ 2007-09-29 8:47 ` Peter Zijlstra 2007-09-29 8:53 ` Peter Zijlstra 2007-09-29 9:00 ` Andrew Morton 0 siblings, 2 replies; 57+ messages in thread From: Peter Zijlstra @ 2007-09-29 8:47 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Lameter, Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Sat, 2007-09-29 at 01:13 -0700, Andrew Morton wrote: > On Fri, 28 Sep 2007 20:25:50 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > > > On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote: > > > > > > start 2 processes that each mmap a separate 64M file, and which does > > > > sequential writes on them. start a 3th process that does the same with > > > > 64M anonymous. > > > > > > > > wait for a while, and you'll see order=1 failures. > > > > > > Really? That means we can no longer even allocate stacks for forking. > > > > > > Its surprising that neither lumpy reclaim nor the mobility patches can > > > deal with it? Lumpy reclaim should be able to free neighboring pages to > > > avoid the order 1 failure unless there are lots of pinned pages. > > > > > > I guess then that lots of pages are pinned through I/O? > > > > memory got massively fragemented, as anti-frag gets easily defeated. > > setting min_free_kbytes to 12M does seem to solve it - it forces 2 max > > order blocks to stay available, so we don't mix types. however 12M on > > 128M is rather a lot. > > > > its still on my todo list to look at it further.. > > > > That would be really really bad (as in: patch-dropping time) if those > order-1 allocations are not atomic. > > What's the callsite? Ah, right, that was the detail... all this lumpy reclaim is useless for atomic allocations. And with SLUB using higher order pages, atomic !0 order allocations will be very very common. One I can remember was: add_to_page_cache() radix_tree_insert() radix_tree_node_alloc() kmem_cache_alloc() which is an atomic callsite. Which leaves us in a situation where we can load pages, because there is free memory, but can't manage to allocate memory to track them.. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-29 8:47 ` Peter Zijlstra @ 2007-09-29 8:53 ` Peter Zijlstra 2007-09-29 9:01 ` Andrew Morton 2007-09-29 9:00 ` Andrew Morton 1 sibling, 1 reply; 57+ messages in thread From: Peter Zijlstra @ 2007-09-29 8:53 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Lameter, Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Sat, 2007-09-29 at 10:47 +0200, Peter Zijlstra wrote: > Ah, right, that was the detail... all this lumpy reclaim is useless for > atomic allocations. And with SLUB using higher order pages, atomic !0 > order allocations will be very very common. > > One I can remember was: > > add_to_page_cache() > radix_tree_insert() > radix_tree_node_alloc() > kmem_cache_alloc() > > which is an atomic callsite. > > Which leaves us in a situation where we can load pages, because there is > free memory, but can't manage to allocate memory to track them.. Ah, I found a boot log of one of these sessions, its also full of order-2 OOMs.. :-/ ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-29 8:53 ` Peter Zijlstra @ 2007-09-29 9:01 ` Andrew Morton 2007-09-29 9:14 ` Peter Zijlstra 0 siblings, 1 reply; 57+ messages in thread From: Andrew Morton @ 2007-09-29 9:01 UTC (permalink / raw) To: Peter Zijlstra Cc: Christoph Lameter, Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Sat, 29 Sep 2007 10:53:41 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > On Sat, 2007-09-29 at 10:47 +0200, Peter Zijlstra wrote: > > > Ah, right, that was the detail... all this lumpy reclaim is useless for > > atomic allocations. And with SLUB using higher order pages, atomic !0 > > order allocations will be very very common. > > > > One I can remember was: > > > > add_to_page_cache() > > radix_tree_insert() > > radix_tree_node_alloc() > > kmem_cache_alloc() > > > > which is an atomic callsite. > > > > Which leaves us in a situation where we can load pages, because there is > > free memory, but can't manage to allocate memory to track them.. > > Ah, I found a boot log of one of these sessions, its also full of > order-2 OOMs.. :-/ oom-killings, or page allocation failures? The latter, one hopes. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-29 9:01 ` Andrew Morton @ 2007-09-29 9:14 ` Peter Zijlstra 2007-09-29 9:27 ` Andrew Morton 0 siblings, 1 reply; 57+ messages in thread From: Peter Zijlstra @ 2007-09-29 9:14 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Lameter, Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Sat, 2007-09-29 at 02:01 -0700, Andrew Morton wrote: > On Sat, 29 Sep 2007 10:53:41 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > > > On Sat, 2007-09-29 at 10:47 +0200, Peter Zijlstra wrote: > > > > > Ah, right, that was the detail... all this lumpy reclaim is useless for > > > atomic allocations. And with SLUB using higher order pages, atomic !0 > > > order allocations will be very very common. > > > > > > One I can remember was: > > > > > > add_to_page_cache() > > > radix_tree_insert() > > > radix_tree_node_alloc() > > > kmem_cache_alloc() > > > > > > which is an atomic callsite. > > > > > > Which leaves us in a situation where we can load pages, because there is > > > free memory, but can't manage to allocate memory to track them.. > > > > Ah, I found a boot log of one of these sessions, its also full of > > order-2 OOMs.. :-/ > > oom-killings, or page allocation failures? The latter, one hopes. Linux version 2.6.23-rc4-mm1-dirty (root@dyad) (gcc version 4.1.2 (Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007 ... mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0 Call Trace: 611b3878: [<6002dd28>] printk_ratelimit+0x15/0x17 611b3888: [<60052ed4>] out_of_memory+0x80/0x100 611b38c8: [<60054b0c>] __alloc_pages+0x1ed/0x280 611b3948: [<6006c608>] allocate_slab+0x5b/0xb0 611b3968: [<6006c705>] new_slab+0x7e/0x183 611b39a8: [<6006cbae>] __slab_alloc+0xc9/0x14b 611b39b0: [<6011f89f>] radix_tree_preload+0x70/0xbf 611b39b8: [<600980f2>] do_mpage_readpage+0x3b3/0x472 611b39e0: [<6011f89f>] radix_tree_preload+0x70/0xbf 611b39f8: [<6006cc81>] kmem_cache_alloc+0x51/0x98 611b3a38: [<6011f89f>] radix_tree_preload+0x70/0xbf 611b3a58: [<6004f8e2>] add_to_page_cache+0x22/0xf7 611b3a98: [<6004f9c6>] add_to_page_cache_lru+0xf/0x24 611b3ab8: [<6009821e>] mpage_readpages+0x6d/0x109 611b3ac0: [<600d59f0>] ext3_get_block+0x0/0xf2 611b3b08: [<6005483d>] get_page_from_freelist+0x8d/0xc1 611b3b88: [<600d6937>] ext3_readpages+0x18/0x1a 611b3b98: [<60056f00>] read_pages+0x37/0x9b 611b3bd8: [<60057064>] __do_page_cache_readahead+0x100/0x157 611b3c48: [<60057196>] do_page_cache_readahead+0x52/0x5f 611b3c78: [<60050ab4>] filemap_fault+0x145/0x278 611b3ca8: [<60022b61>] run_syscall_stub+0xd1/0xdd 611b3ce8: [<6005eae3>] __do_fault+0x7e/0x3ca 611b3d68: [<6005ee60>] do_linear_fault+0x31/0x33 611b3d88: [<6005f149>] handle_mm_fault+0x14e/0x246 611b3da8: [<60120a7b>] __up_read+0x73/0x7b 611b3de8: [<60013177>] handle_page_fault+0x11f/0x23b 611b3e48: [<60013419>] segv+0xac/0x297 611b3f28: [<60013367>] segv_handler+0x68/0x6e 611b3f48: [<600232ad>] get_skas_faultinfo+0x9c/0xa1 611b3f68: [<60023853>] userspace+0x13a/0x19d 611b3fc8: [<60010d58>] fork_handler+0x86/0x8d Mem-info: Normal per-cpu: CPU 0: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0 Active:11 inactive:9 dirty:0 writeback:1 unstable:0 free:19533 slab:10587 mapped:0 pagetables:260 bounce:0 Normal free:78132kB min:4096kB low:5120kB high:6144kB active:44kB inactive:36kB present:129280kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 Normal: 7503*4kB 5977*8kB 19*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 78132kB Swap cache: add 1192822, delete 1192790, find 491441/626861, race 0+1 Free swap = 455300kB Total swap = 524280kB Free swap: 455300kB 32768 pages of RAM 0 pages of HIGHMEM 1948 reserved pages 11 pages shared 32 pages swap cached Out of memory: kill process 2647 (portmap) score 2233 or a child Killed process 2647 (portmap) ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-29 9:14 ` Peter Zijlstra @ 2007-09-29 9:27 ` Andrew Morton 2007-09-28 20:19 ` Nick Piggin 0 siblings, 1 reply; 57+ messages in thread From: Andrew Morton @ 2007-09-29 9:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Christoph Lameter, Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > oom-killings, or page allocation failures? The latter, one hopes. > > > Linux version 2.6.23-rc4-mm1-dirty (root@dyad) (gcc version 4.1.2 (Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007 > > ... > > > mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0 > Call Trace: > 611b3878: [<6002dd28>] printk_ratelimit+0x15/0x17 > 611b3888: [<60052ed4>] out_of_memory+0x80/0x100 > 611b38c8: [<60054b0c>] __alloc_pages+0x1ed/0x280 > 611b3948: [<6006c608>] allocate_slab+0x5b/0xb0 > 611b3968: [<6006c705>] new_slab+0x7e/0x183 > 611b39a8: [<6006cbae>] __slab_alloc+0xc9/0x14b > 611b39b0: [<6011f89f>] radix_tree_preload+0x70/0xbf > 611b39b8: [<600980f2>] do_mpage_readpage+0x3b3/0x472 > 611b39e0: [<6011f89f>] radix_tree_preload+0x70/0xbf > 611b39f8: [<6006cc81>] kmem_cache_alloc+0x51/0x98 > 611b3a38: [<6011f89f>] radix_tree_preload+0x70/0xbf > 611b3a58: [<6004f8e2>] add_to_page_cache+0x22/0xf7 > 611b3a98: [<6004f9c6>] add_to_page_cache_lru+0xf/0x24 > 611b3ab8: [<6009821e>] mpage_readpages+0x6d/0x109 > 611b3ac0: [<600d59f0>] ext3_get_block+0x0/0xf2 > 611b3b08: [<6005483d>] get_page_from_freelist+0x8d/0xc1 > 611b3b88: [<600d6937>] ext3_readpages+0x18/0x1a > 611b3b98: [<60056f00>] read_pages+0x37/0x9b > 611b3bd8: [<60057064>] __do_page_cache_readahead+0x100/0x157 > 611b3c48: [<60057196>] do_page_cache_readahead+0x52/0x5f > 611b3c78: [<60050ab4>] filemap_fault+0x145/0x278 > 611b3ca8: [<60022b61>] run_syscall_stub+0xd1/0xdd > 611b3ce8: [<6005eae3>] __do_fault+0x7e/0x3ca > 611b3d68: [<6005ee60>] do_linear_fault+0x31/0x33 > 611b3d88: [<6005f149>] handle_mm_fault+0x14e/0x246 > 611b3da8: [<60120a7b>] __up_read+0x73/0x7b > 611b3de8: [<60013177>] handle_page_fault+0x11f/0x23b > 611b3e48: [<60013419>] segv+0xac/0x297 > 611b3f28: [<60013367>] segv_handler+0x68/0x6e > 611b3f48: [<600232ad>] get_skas_faultinfo+0x9c/0xa1 > 611b3f68: [<60023853>] userspace+0x13a/0x19d > 611b3fc8: [<60010d58>] fork_handler+0x86/0x8d OK, that's different. Someone broke the vm - order-2 GFP_KERNEL allocations aren't supposed to fail. I'm suspecting that did_some_progress thing. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-29 9:27 ` Andrew Morton @ 2007-09-28 20:19 ` Nick Piggin 2007-09-29 19:20 ` Andrew Morton 0 siblings, 1 reply; 57+ messages in thread From: Nick Piggin @ 2007-09-28 20:19 UTC (permalink / raw) To: Andrew Morton Cc: Peter Zijlstra, Christoph Lameter, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Saturday 29 September 2007 19:27, Andrew Morton wrote: > On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > oom-killings, or page allocation failures? The latter, one hopes. > > > > Linux version 2.6.23-rc4-mm1-dirty (root@dyad) (gcc version 4.1.2 (Ubuntu > > 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007 > > > > ... > > > > > > mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0 > > Call Trace: > > 611b3878: [<6002dd28>] printk_ratelimit+0x15/0x17 > > 611b3888: [<60052ed4>] out_of_memory+0x80/0x100 > > 611b38c8: [<60054b0c>] __alloc_pages+0x1ed/0x280 > > 611b3948: [<6006c608>] allocate_slab+0x5b/0xb0 > > 611b3968: [<6006c705>] new_slab+0x7e/0x183 > > 611b39a8: [<6006cbae>] __slab_alloc+0xc9/0x14b > > 611b39b0: [<6011f89f>] radix_tree_preload+0x70/0xbf > > 611b39b8: [<600980f2>] do_mpage_readpage+0x3b3/0x472 > > 611b39e0: [<6011f89f>] radix_tree_preload+0x70/0xbf > > 611b39f8: [<6006cc81>] kmem_cache_alloc+0x51/0x98 > > 611b3a38: [<6011f89f>] radix_tree_preload+0x70/0xbf > > 611b3a58: [<6004f8e2>] add_to_page_cache+0x22/0xf7 > > 611b3a98: [<6004f9c6>] add_to_page_cache_lru+0xf/0x24 > > 611b3ab8: [<6009821e>] mpage_readpages+0x6d/0x109 > > 611b3ac0: [<600d59f0>] ext3_get_block+0x0/0xf2 > > 611b3b08: [<6005483d>] get_page_from_freelist+0x8d/0xc1 > > 611b3b88: [<600d6937>] ext3_readpages+0x18/0x1a > > 611b3b98: [<60056f00>] read_pages+0x37/0x9b > > 611b3bd8: [<60057064>] __do_page_cache_readahead+0x100/0x157 > > 611b3c48: [<60057196>] do_page_cache_readahead+0x52/0x5f > > 611b3c78: [<60050ab4>] filemap_fault+0x145/0x278 > > 611b3ca8: [<60022b61>] run_syscall_stub+0xd1/0xdd > > 611b3ce8: [<6005eae3>] __do_fault+0x7e/0x3ca > > 611b3d68: [<6005ee60>] do_linear_fault+0x31/0x33 > > 611b3d88: [<6005f149>] handle_mm_fault+0x14e/0x246 > > 611b3da8: [<60120a7b>] __up_read+0x73/0x7b > > 611b3de8: [<60013177>] handle_page_fault+0x11f/0x23b > > 611b3e48: [<60013419>] segv+0xac/0x297 > > 611b3f28: [<60013367>] segv_handler+0x68/0x6e > > 611b3f48: [<600232ad>] get_skas_faultinfo+0x9c/0xa1 > > 611b3f68: [<60023853>] userspace+0x13a/0x19d > > 611b3fc8: [<60010d58>] fork_handler+0x86/0x8d > > OK, that's different. Someone broke the vm - order-2 GFP_KERNEL > allocations aren't supposed to fail. > > I'm suspecting that did_some_progress thing. The allocation didn't fail -- it invoked the OOM killer because the kernel ran out of unfragmented memory. Probably because higher order allocations are the new vogue in -mm at the moment ;) ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-28 20:19 ` Nick Piggin @ 2007-09-29 19:20 ` Andrew Morton 2007-09-29 19:09 ` Nick Piggin 0 siblings, 1 reply; 57+ messages in thread From: Andrew Morton @ 2007-09-29 19:20 UTC (permalink / raw) To: Nick Piggin Cc: Peter Zijlstra, Christoph Lameter, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Sat, 29 Sep 2007 06:19:33 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > On Saturday 29 September 2007 19:27, Andrew Morton wrote: > > On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> > wrote: > > > > oom-killings, or page allocation failures? The latter, one hopes. > > > > > > Linux version 2.6.23-rc4-mm1-dirty (root@dyad) (gcc version 4.1.2 (Ubuntu > > > 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007 > > > > > > ... > > > > > > > > > mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0 > > > Call Trace: > > > 611b3878: [<6002dd28>] printk_ratelimit+0x15/0x17 > > > 611b3888: [<60052ed4>] out_of_memory+0x80/0x100 > > > 611b38c8: [<60054b0c>] __alloc_pages+0x1ed/0x280 > > > 611b3948: [<6006c608>] allocate_slab+0x5b/0xb0 > > > 611b3968: [<6006c705>] new_slab+0x7e/0x183 > > > 611b39a8: [<6006cbae>] __slab_alloc+0xc9/0x14b > > > 611b39b0: [<6011f89f>] radix_tree_preload+0x70/0xbf > > > 611b39b8: [<600980f2>] do_mpage_readpage+0x3b3/0x472 > > > 611b39e0: [<6011f89f>] radix_tree_preload+0x70/0xbf > > > 611b39f8: [<6006cc81>] kmem_cache_alloc+0x51/0x98 > > > 611b3a38: [<6011f89f>] radix_tree_preload+0x70/0xbf > > > 611b3a58: [<6004f8e2>] add_to_page_cache+0x22/0xf7 > > > 611b3a98: [<6004f9c6>] add_to_page_cache_lru+0xf/0x24 > > > 611b3ab8: [<6009821e>] mpage_readpages+0x6d/0x109 > > > 611b3ac0: [<600d59f0>] ext3_get_block+0x0/0xf2 > > > 611b3b08: [<6005483d>] get_page_from_freelist+0x8d/0xc1 > > > 611b3b88: [<600d6937>] ext3_readpages+0x18/0x1a > > > 611b3b98: [<60056f00>] read_pages+0x37/0x9b > > > 611b3bd8: [<60057064>] __do_page_cache_readahead+0x100/0x157 > > > 611b3c48: [<60057196>] do_page_cache_readahead+0x52/0x5f > > > 611b3c78: [<60050ab4>] filemap_fault+0x145/0x278 > > > 611b3ca8: [<60022b61>] run_syscall_stub+0xd1/0xdd > > > 611b3ce8: [<6005eae3>] __do_fault+0x7e/0x3ca > > > 611b3d68: [<6005ee60>] do_linear_fault+0x31/0x33 > > > 611b3d88: [<6005f149>] handle_mm_fault+0x14e/0x246 > > > 611b3da8: [<60120a7b>] __up_read+0x73/0x7b > > > 611b3de8: [<60013177>] handle_page_fault+0x11f/0x23b > > > 611b3e48: [<60013419>] segv+0xac/0x297 > > > 611b3f28: [<60013367>] segv_handler+0x68/0x6e > > > 611b3f48: [<600232ad>] get_skas_faultinfo+0x9c/0xa1 > > > 611b3f68: [<60023853>] userspace+0x13a/0x19d > > > 611b3fc8: [<60010d58>] fork_handler+0x86/0x8d > > > > OK, that's different. Someone broke the vm - order-2 GFP_KERNEL > > allocations aren't supposed to fail. > > > > I'm suspecting that did_some_progress thing. > > The allocation didn't fail -- it invoked the OOM killer because the kernel > ran out of unfragmented memory. We can't "run out of unfragmented memory" for an order-2 GFP_KERNEL allocation in this workload. We go and synchronously free stuff up to make it work. How did this get broken? > Probably because higher order > allocations are the new vogue in -mm at the moment ;) That's a different bug. bug 1: We shouldn't be doing higher-order allocations in slub because of the considerable damage this does to atomic allocations. bug 2: order-2 GFP_KERNEL allocations shouldn't fail like this. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-29 19:20 ` Andrew Morton @ 2007-09-29 19:09 ` Nick Piggin 2007-09-30 20:12 ` Andrew Morton 0 siblings, 1 reply; 57+ messages in thread From: Nick Piggin @ 2007-09-29 19:09 UTC (permalink / raw) To: Andrew Morton Cc: Peter Zijlstra, Christoph Lameter, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Sunday 30 September 2007 05:20, Andrew Morton wrote: > On Sat, 29 Sep 2007 06:19:33 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > On Saturday 29 September 2007 19:27, Andrew Morton wrote: > > > On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra > > > <a.p.zijlstra@chello.nl> > > > > wrote: > > > > > oom-killings, or page allocation failures? The latter, one hopes. > > > > > > > > Linux version 2.6.23-rc4-mm1-dirty (root@dyad) (gcc version 4.1.2 > > > > (Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007 > > > > > > > > ... > > > > > > > > > > > > mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0 > > > > Call Trace: > > > > 611b3878: [<6002dd28>] printk_ratelimit+0x15/0x17 > > > > 611b3888: [<60052ed4>] out_of_memory+0x80/0x100 > > > > 611b38c8: [<60054b0c>] __alloc_pages+0x1ed/0x280 > > > > 611b3948: [<6006c608>] allocate_slab+0x5b/0xb0 > > > > 611b3968: [<6006c705>] new_slab+0x7e/0x183 > > > > 611b39a8: [<6006cbae>] __slab_alloc+0xc9/0x14b > > > > 611b39b0: [<6011f89f>] radix_tree_preload+0x70/0xbf > > > > 611b39b8: [<600980f2>] do_mpage_readpage+0x3b3/0x472 > > > > 611b39e0: [<6011f89f>] radix_tree_preload+0x70/0xbf > > > > 611b39f8: [<6006cc81>] kmem_cache_alloc+0x51/0x98 > > > > 611b3a38: [<6011f89f>] radix_tree_preload+0x70/0xbf > > > > 611b3a58: [<6004f8e2>] add_to_page_cache+0x22/0xf7 > > > > 611b3a98: [<6004f9c6>] add_to_page_cache_lru+0xf/0x24 > > > > 611b3ab8: [<6009821e>] mpage_readpages+0x6d/0x109 > > > > 611b3ac0: [<600d59f0>] ext3_get_block+0x0/0xf2 > > > > 611b3b08: [<6005483d>] get_page_from_freelist+0x8d/0xc1 > > > > 611b3b88: [<600d6937>] ext3_readpages+0x18/0x1a > > > > 611b3b98: [<60056f00>] read_pages+0x37/0x9b > > > > 611b3bd8: [<60057064>] __do_page_cache_readahead+0x100/0x157 > > > > 611b3c48: [<60057196>] do_page_cache_readahead+0x52/0x5f > > > > 611b3c78: [<60050ab4>] filemap_fault+0x145/0x278 > > > > 611b3ca8: [<60022b61>] run_syscall_stub+0xd1/0xdd > > > > 611b3ce8: [<6005eae3>] __do_fault+0x7e/0x3ca > > > > 611b3d68: [<6005ee60>] do_linear_fault+0x31/0x33 > > > > 611b3d88: [<6005f149>] handle_mm_fault+0x14e/0x246 > > > > 611b3da8: [<60120a7b>] __up_read+0x73/0x7b > > > > 611b3de8: [<60013177>] handle_page_fault+0x11f/0x23b > > > > 611b3e48: [<60013419>] segv+0xac/0x297 > > > > 611b3f28: [<60013367>] segv_handler+0x68/0x6e > > > > 611b3f48: [<600232ad>] get_skas_faultinfo+0x9c/0xa1 > > > > 611b3f68: [<60023853>] userspace+0x13a/0x19d > > > > 611b3fc8: [<60010d58>] fork_handler+0x86/0x8d > > > > > > OK, that's different. Someone broke the vm - order-2 GFP_KERNEL > > > allocations aren't supposed to fail. > > > > > > I'm suspecting that did_some_progress thing. > > > > The allocation didn't fail -- it invoked the OOM killer because the > > kernel ran out of unfragmented memory. > > We can't "run out of unfragmented memory" for an order-2 GFP_KERNEL > allocation in this workload. We go and synchronously free stuff up to make > it work. > > How did this get broken? Either no more order-2 pages could be freed, or the ones that were being freed were being used by something else (eg. other order-2 slab allocations). > > Probably because higher order > > allocations are the new vogue in -mm at the moment ;) > > That's a different bug. > > bug 1: We shouldn't be doing higher-order allocations in slub because of > the considerable damage this does to atomic allocations. > > bug 2: order-2 GFP_KERNEL allocations shouldn't fail like this. I think one causes 2 as well -- it isn't just considerable damage to atomic allocations but to GFP_KERNEL allocations too. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-29 19:09 ` Nick Piggin @ 2007-09-30 20:12 ` Andrew Morton 2007-09-30 4:16 ` Nick Piggin 0 siblings, 1 reply; 57+ messages in thread From: Andrew Morton @ 2007-09-30 20:12 UTC (permalink / raw) To: Nick Piggin Cc: Peter Zijlstra, Christoph Lameter, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Sun, 30 Sep 2007 05:09:28 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > On Sunday 30 September 2007 05:20, Andrew Morton wrote: > > On Sat, 29 Sep 2007 06:19:33 +1000 Nick Piggin <nickpiggin@yahoo.com.au> > wrote: > > > On Saturday 29 September 2007 19:27, Andrew Morton wrote: > > > > On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra > > > > <a.p.zijlstra@chello.nl> > > > > > > wrote: > > > > > > oom-killings, or page allocation failures? The latter, one hopes. > > > > > > > > > > Linux version 2.6.23-rc4-mm1-dirty (root@dyad) (gcc version 4.1.2 > > > > > (Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007 > > > > > > > > > > ... > > > > > > > > > > > > > > > mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0 > > > > > Call Trace: > > > > > 611b3878: [<6002dd28>] printk_ratelimit+0x15/0x17 > > > > > 611b3888: [<60052ed4>] out_of_memory+0x80/0x100 > > > > > 611b38c8: [<60054b0c>] __alloc_pages+0x1ed/0x280 > > > > > 611b3948: [<6006c608>] allocate_slab+0x5b/0xb0 > > > > > 611b3968: [<6006c705>] new_slab+0x7e/0x183 > > > > > 611b39a8: [<6006cbae>] __slab_alloc+0xc9/0x14b > > > > > 611b39b0: [<6011f89f>] radix_tree_preload+0x70/0xbf > > > > > 611b39b8: [<600980f2>] do_mpage_readpage+0x3b3/0x472 > > > > > 611b39e0: [<6011f89f>] radix_tree_preload+0x70/0xbf > > > > > 611b39f8: [<6006cc81>] kmem_cache_alloc+0x51/0x98 > > > > > 611b3a38: [<6011f89f>] radix_tree_preload+0x70/0xbf > > > > > 611b3a58: [<6004f8e2>] add_to_page_cache+0x22/0xf7 > > > > > 611b3a98: [<6004f9c6>] add_to_page_cache_lru+0xf/0x24 > > > > > 611b3ab8: [<6009821e>] mpage_readpages+0x6d/0x109 > > > > > 611b3ac0: [<600d59f0>] ext3_get_block+0x0/0xf2 > > > > > 611b3b08: [<6005483d>] get_page_from_freelist+0x8d/0xc1 > > > > > 611b3b88: [<600d6937>] ext3_readpages+0x18/0x1a > > > > > 611b3b98: [<60056f00>] read_pages+0x37/0x9b > > > > > 611b3bd8: [<60057064>] __do_page_cache_readahead+0x100/0x157 > > > > > 611b3c48: [<60057196>] do_page_cache_readahead+0x52/0x5f > > > > > 611b3c78: [<60050ab4>] filemap_fault+0x145/0x278 > > > > > 611b3ca8: [<60022b61>] run_syscall_stub+0xd1/0xdd > > > > > 611b3ce8: [<6005eae3>] __do_fault+0x7e/0x3ca > > > > > 611b3d68: [<6005ee60>] do_linear_fault+0x31/0x33 > > > > > 611b3d88: [<6005f149>] handle_mm_fault+0x14e/0x246 > > > > > 611b3da8: [<60120a7b>] __up_read+0x73/0x7b > > > > > 611b3de8: [<60013177>] handle_page_fault+0x11f/0x23b > > > > > 611b3e48: [<60013419>] segv+0xac/0x297 > > > > > 611b3f28: [<60013367>] segv_handler+0x68/0x6e > > > > > 611b3f48: [<600232ad>] get_skas_faultinfo+0x9c/0xa1 > > > > > 611b3f68: [<60023853>] userspace+0x13a/0x19d > > > > > 611b3fc8: [<60010d58>] fork_handler+0x86/0x8d > > > > > > > > OK, that's different. Someone broke the vm - order-2 GFP_KERNEL > > > > allocations aren't supposed to fail. > > > > > > > > I'm suspecting that did_some_progress thing. > > > > > > The allocation didn't fail -- it invoked the OOM killer because the > > > kernel ran out of unfragmented memory. > > > > We can't "run out of unfragmented memory" for an order-2 GFP_KERNEL > > allocation in this workload. We go and synchronously free stuff up to make > > it work. > > > > How did this get broken? > > Either no more order-2 pages could be freed, or the ones that were being > freed were being used by something else (eg. other order-2 slab allocations). No. The current design of reclaim (for better or for worse) is that for order 0,1,2 and 3 allocations we just keep on trying until it works. That got broken and I think it got broken at a design level when that did_some_progress logic went in. Perhaps something else we did later worsened things. > > > > Probably because higher order > > > allocations are the new vogue in -mm at the moment ;) > > > > That's a different bug. > > > > bug 1: We shouldn't be doing higher-order allocations in slub because of > > the considerable damage this does to atomic allocations. > > > > bug 2: order-2 GFP_KERNEL allocations shouldn't fail like this. > > I think one causes 2 as well -- it isn't just considerable damage to atomic > allocations but to GFP_KERNEL allocations too. Well sure, because we already broke GFP_KERNEL allocations. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-30 20:12 ` Andrew Morton @ 2007-09-30 4:16 ` Nick Piggin 0 siblings, 0 replies; 57+ messages in thread From: Nick Piggin @ 2007-09-30 4:16 UTC (permalink / raw) To: Andrew Morton Cc: Peter Zijlstra, Christoph Lameter, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Monday 01 October 2007 06:12, Andrew Morton wrote: > On Sun, 30 Sep 2007 05:09:28 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > On Sunday 30 September 2007 05:20, Andrew Morton wrote: > > > We can't "run out of unfragmented memory" for an order-2 GFP_KERNEL > > > allocation in this workload. We go and synchronously free stuff up to > > > make it work. > > > > > > How did this get broken? > > > > Either no more order-2 pages could be freed, or the ones that were being > > freed were being used by something else (eg. other order-2 slab > > allocations). > > No. The current design of reclaim (for better or for worse) is that for > order 0,1,2 and 3 allocations we just keep on trying until it works. That > got broken and I think it got broken at a design level when that > did_some_progress logic went in. Perhaps something else we did later > worsened things. It will keep trying until it works. It won't have stopped trying (unless I'm very mistaken?), it's just oom killing things merrily along the way. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-29 8:47 ` Peter Zijlstra 2007-09-29 8:53 ` Peter Zijlstra @ 2007-09-29 9:00 ` Andrew Morton 2007-10-01 20:55 ` Christoph Lameter 1 sibling, 1 reply; 57+ messages in thread From: Andrew Morton @ 2007-09-29 9:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Christoph Lameter, Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Sat, 29 Sep 2007 10:47:12 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > On Sat, 2007-09-29 at 01:13 -0700, Andrew Morton wrote: > > On Fri, 28 Sep 2007 20:25:50 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > > > > > > On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote: > > > > > > > > start 2 processes that each mmap a separate 64M file, and which does > > > > > sequential writes on them. start a 3th process that does the same with > > > > > 64M anonymous. > > > > > > > > > > wait for a while, and you'll see order=1 failures. > > > > > > > > Really? That means we can no longer even allocate stacks for forking. > > > > > > > > Its surprising that neither lumpy reclaim nor the mobility patches can > > > > deal with it? Lumpy reclaim should be able to free neighboring pages to > > > > avoid the order 1 failure unless there are lots of pinned pages. > > > > > > > > I guess then that lots of pages are pinned through I/O? > > > > > > memory got massively fragemented, as anti-frag gets easily defeated. > > > setting min_free_kbytes to 12M does seem to solve it - it forces 2 max > > > order blocks to stay available, so we don't mix types. however 12M on > > > 128M is rather a lot. > > > > > > its still on my todo list to look at it further.. > > > > > > > That would be really really bad (as in: patch-dropping time) if those > > order-1 allocations are not atomic. > > > > What's the callsite? > > Ah, right, that was the detail... all this lumpy reclaim is useless for > atomic allocations. And with SLUB using higher order pages, atomic !0 > order allocations will be very very common. Oh OK. I thought we'd already fixed slub so that it didn't do that. Maybe that fix is in -mm but I don't think so. Trying to do atomic order-1 allocations on behalf of arbitray slab caches just won't fly - this is a significant degradation in kernel reliability, as you've very easily demonstrated. > One I can remember was: > > add_to_page_cache() > radix_tree_insert() > radix_tree_node_alloc() > kmem_cache_alloc() > > which is an atomic callsite. > > Which leaves us in a situation where we can load pages, because there is > free memory, but can't manage to allocate memory to track them.. Right. Leading to application failure which for many is equivalent to a complete system outage. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-29 9:00 ` Andrew Morton @ 2007-10-01 20:55 ` Christoph Lameter 2007-10-01 21:30 ` Andrew Morton 0 siblings, 1 reply; 57+ messages in thread From: Christoph Lameter @ 2007-10-01 20:55 UTC (permalink / raw) To: Andrew Morton Cc: Peter Zijlstra, Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Sat, 29 Sep 2007, Andrew Morton wrote: > > atomic allocations. And with SLUB using higher order pages, atomic !0 > > order allocations will be very very common. > > Oh OK. > > I thought we'd already fixed slub so that it didn't do that. Maybe that > fix is in -mm but I don't think so. > > Trying to do atomic order-1 allocations on behalf of arbitray slab caches > just won't fly - this is a significant degradation in kernel reliability, > as you've very easily demonstrated. Ummm... SLAB also does order 1 allocations. We have always done them. See mm/slab.c /* * Do not go above this order unless 0 objects fit into the slab. */ #define BREAK_GFP_ORDER_HI 1 #define BREAK_GFP_ORDER_LO 0 static int slab_break_gfp_order = BREAK_GFP_ORDER_LO; ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-10-01 20:55 ` Christoph Lameter @ 2007-10-01 21:30 ` Andrew Morton 2007-10-01 21:38 ` Christoph Lameter 2007-10-02 9:19 ` Peter Zijlstra 0 siblings, 2 replies; 57+ messages in thread From: Andrew Morton @ 2007-10-01 21:30 UTC (permalink / raw) To: Christoph Lameter Cc: a.p.zijlstra, nickpiggin, hch, mel, linux-fsdevel, linux-kernel, dgc, jens.axboe On Mon, 1 Oct 2007 13:55:29 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > On Sat, 29 Sep 2007, Andrew Morton wrote: > > > > atomic allocations. And with SLUB using higher order pages, atomic !0 > > > order allocations will be very very common. > > > > Oh OK. > > > > I thought we'd already fixed slub so that it didn't do that. Maybe that > > fix is in -mm but I don't think so. > > > > Trying to do atomic order-1 allocations on behalf of arbitray slab caches > > just won't fly - this is a significant degradation in kernel reliability, > > as you've very easily demonstrated. > > Ummm... SLAB also does order 1 allocations. We have always done them. > > See mm/slab.c > > /* > * Do not go above this order unless 0 objects fit into the slab. > */ > #define BREAK_GFP_ORDER_HI 1 > #define BREAK_GFP_ORDER_LO 0 > static int slab_break_gfp_order = BREAK_GFP_ORDER_LO; Do slab and slub use the same underlying page size for each slab? Single data point: the CONFIG_SLAB boxes which I have access to here are using order-0 for radix_tree_node, so they won't be failing in the way in which Peter's machine is. I've never ever before seen reports of page allocation failures in the radix-tree node allocation code, and that's the bottom line. This is just a drop-dead must-fix show-stopping bug. We cannot rely upon atomic order-1 allocations succeeding so we cannot use them for radix-tree nodes. Nor for lots of other things which we have no chance of identifying. Peter, is this bug -mm only, or is 2.6.23 similarly failing? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-10-01 21:30 ` Andrew Morton @ 2007-10-01 21:38 ` Christoph Lameter 2007-10-01 21:45 ` Andrew Morton 2007-10-02 9:19 ` Peter Zijlstra 1 sibling, 1 reply; 57+ messages in thread From: Christoph Lameter @ 2007-10-01 21:38 UTC (permalink / raw) To: Andrew Morton Cc: a.p.zijlstra, nickpiggin, hch, mel, linux-fsdevel, linux-kernel, dgc, jens.axboe On Mon, 1 Oct 2007, Andrew Morton wrote: > Do slab and slub use the same underlying page size for each slab? SLAB cannot pack objects as dense as SLUB and they have different algorithm to make the choice of order. Thus the number of objects per slab may vary between SLAB and SLUB and therefore also the choice of order to store these objects. > Single data point: the CONFIG_SLAB boxes which I have access to here are > using order-0 for radix_tree_node, so they won't be failing in the way in > which Peter's machine is. Upstream SLUB uses order 0 allocations for the radix tree. MM varies because the use of higher order allocs is more loose if the mobility algorithms are found to be active: 2.6.23-rc8: Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg\ radix_tree_node 14281 552 9.9M 2432/948/1 7 0 38 79 ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-10-01 21:38 ` Christoph Lameter @ 2007-10-01 21:45 ` Andrew Morton 2007-10-01 21:52 ` Christoph Lameter 0 siblings, 1 reply; 57+ messages in thread From: Andrew Morton @ 2007-10-01 21:45 UTC (permalink / raw) To: Christoph Lameter Cc: a.p.zijlstra, nickpiggin, hch, mel, linux-fsdevel, linux-kernel, dgc, jens.axboe On Mon, 1 Oct 2007 14:38:55 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > On Mon, 1 Oct 2007, Andrew Morton wrote: > > > Do slab and slub use the same underlying page size for each slab? > > SLAB cannot pack objects as dense as SLUB and they have different > algorithm to make the choice of order. Thus the number of objects per slab > may vary between SLAB and SLUB and therefore also the choice of order to > store these objects. > > > Single data point: the CONFIG_SLAB boxes which I have access to here are > > using order-0 for radix_tree_node, so they won't be failing in the way in > > which Peter's machine is. > > Upstream SLUB uses order 0 allocations for the radix tree. OK, that's a relief. > MM varies > because the use of higher order allocs is more loose if the mobility > algorithms are found to be active: > > 2.6.23-rc8: > > Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg\ > radix_tree_node 14281 552 9.9M 2432/948/1 7 0 38 79 Ah. So the already-dropped slub-exploit-page-mobility-to-increase-allocation-order.patch was the culprit? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-10-01 21:45 ` Andrew Morton @ 2007-10-01 21:52 ` Christoph Lameter 0 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-10-01 21:52 UTC (permalink / raw) To: Andrew Morton Cc: a.p.zijlstra, nickpiggin, hch, mel, linux-fsdevel, linux-kernel, dgc, jens.axboe On Mon, 1 Oct 2007, Andrew Morton wrote: > Ah. So the already-dropped > slub-exploit-page-mobility-to-increase-allocation-order.patch was the > culprit? Yes without that patch SLUB will no longer take special action if antifrag is around. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-10-01 21:30 ` Andrew Morton 2007-10-01 21:38 ` Christoph Lameter @ 2007-10-02 9:19 ` Peter Zijlstra 1 sibling, 0 replies; 57+ messages in thread From: Peter Zijlstra @ 2007-10-02 9:19 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Lameter, nickpiggin, hch, mel, linux-fsdevel, linux-kernel, dgc, jens.axboe [-- Attachment #1: Type: text/plain, Size: 1924 bytes --] On Mon, 2007-10-01 at 14:30 -0700, Andrew Morton wrote: > On Mon, 1 Oct 2007 13:55:29 -0700 (PDT) > Christoph Lameter <clameter@sgi.com> wrote: > > > On Sat, 29 Sep 2007, Andrew Morton wrote: > > > > > > atomic allocations. And with SLUB using higher order pages, atomic !0 > > > > order allocations will be very very common. > > > > > > Oh OK. > > > > > > I thought we'd already fixed slub so that it didn't do that. Maybe that > > > fix is in -mm but I don't think so. > > > > > > Trying to do atomic order-1 allocations on behalf of arbitray slab caches > > > just won't fly - this is a significant degradation in kernel reliability, > > > as you've very easily demonstrated. > > > > Ummm... SLAB also does order 1 allocations. We have always done them. > > > > See mm/slab.c > > > > /* > > * Do not go above this order unless 0 objects fit into the slab. > > */ > > #define BREAK_GFP_ORDER_HI 1 > > #define BREAK_GFP_ORDER_LO 0 > > static int slab_break_gfp_order = BREAK_GFP_ORDER_LO; > > Do slab and slub use the same underlying page size for each slab? > > Single data point: the CONFIG_SLAB boxes which I have access to here are > using order-0 for radix_tree_node, so they won't be failing in the way in > which Peter's machine is. > > I've never ever before seen reports of page allocation failures in the > radix-tree node allocation code, and that's the bottom line. This is just > a drop-dead must-fix show-stopping bug. We cannot rely upon atomic order-1 > allocations succeeding so we cannot use them for radix-tree nodes. Nor for > lots of other things which we have no chance of identifying. > > Peter, is this bug -mm only, or is 2.6.23 similarly failing? I'm mainly using -mm (so you have at least one tester :-), I think the -mm specific SLUB patch that ups slub_min_order makes the problem -mm specific, would have to test .23. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-28 18:20 ` Christoph Lameter 2007-09-28 18:25 ` Peter Zijlstra @ 2007-09-29 8:45 ` Peter Zijlstra 2007-10-01 21:01 ` Christoph Lameter 1 sibling, 1 reply; 57+ messages in thread From: Peter Zijlstra @ 2007-09-29 8:45 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote: > Really? That means we can no longer even allocate stacks for forking. I think I'm running with 4k stacks... ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-29 8:45 ` Peter Zijlstra @ 2007-10-01 21:01 ` Christoph Lameter 2007-10-02 8:37 ` Nick Piggin 0 siblings, 1 reply; 57+ messages in thread From: Christoph Lameter @ 2007-10-01 21:01 UTC (permalink / raw) To: Peter Zijlstra Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe, akpm On Sat, 29 Sep 2007, Peter Zijlstra wrote: > > On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote: > > > Really? That means we can no longer even allocate stacks for forking. > > I think I'm running with 4k stacks... 4k stacks will never fly on an SGI x86_64 NUMA configuration given the additional data that may be kept on the stack. We are currently considering to go from 8k to 16k (or even 32k) to make things work. So having the ability to put the stacks in vmalloc space may be something to look at. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-10-01 21:01 ` Christoph Lameter @ 2007-10-02 8:37 ` Nick Piggin 0 siblings, 0 replies; 57+ messages in thread From: Nick Piggin @ 2007-10-02 8:37 UTC (permalink / raw) To: Christoph Lameter Cc: Peter Zijlstra, Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe, akpm On Tuesday 02 October 2007 07:01, Christoph Lameter wrote: > On Sat, 29 Sep 2007, Peter Zijlstra wrote: > > On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote: > > > Really? That means we can no longer even allocate stacks for forking. > > > > I think I'm running with 4k stacks... > > 4k stacks will never fly on an SGI x86_64 NUMA configuration given the > additional data that may be kept on the stack. We are currently > considering to go from 8k to 16k (or even 32k) to make things work. So > having the ability to put the stacks in vmalloc space may be something to > look at. i386 and x86-64 already used 8K stacks for years and they have never really been much problem before. They only started failing when contiguous memory is getting used up by other things, _even with_ those anti-frag patches in there. Bottom line is that you do not use higher order allocations when you do not need them. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-28 17:33 ` Christoph Lameter 2007-09-28 5:14 ` Nick Piggin 2007-09-28 17:55 ` Peter Zijlstra @ 2007-09-28 21:05 ` Mel Gorman 2007-10-01 21:10 ` Christoph Lameter 2 siblings, 1 reply; 57+ messages in thread From: Mel Gorman @ 2007-09-28 21:05 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe On (28/09/07 10:33), Christoph Lameter didst pronounce: > On Fri, 28 Sep 2007, Nick Piggin wrote: > > > On Wednesday 19 September 2007 13:36, Christoph Lameter wrote: > > > SLAB_VFALLBACK can be specified for selected slab caches. If fallback is > > > available then the conservative settings for higher order allocations are > > > overridden. We then request an order that can accomodate at mininum > > > 100 objects. The size of an individual slab allocation is allowed to reach > > > up to 256k (order 6 on i386, order 4 on IA64). > > > > How come SLUB wants such a big amount of objects? I thought the > > unqueued nature of it made it better than slab because it minimised > > the amount of cache hot memory lying around in slabs... > > The more objects in a page the more the fast path runs. The more the fast > path runs the lower the cache footprint and the faster the overall > allocations etc. > > SLAB can be configured for large queues holdings lots of objects. > SLUB can only reach the same through large pages because it does not > have queues. Large pages, flood gates etc. Be wary. SLUB has to run 100% reliable or things go whoops. SLUB regularly depends on atomic allocations and cannot take the necessary steps to get the contiguous pages if it gets into trouble. This means that something like lumpy reclaim cannot help you in it's current state. We currently do not take the per-emptive steps with kswapd to ensure the high-order pages are free. We also don't do something like have users that can sleep keep the watermarks high. I had considered the possibility but didn't have the justification for the complexity. Minimally, SLUB by default should continue to use order-0 pages. Peter has managed to bust order-1 pages with mem=128MB. Admittedly, it was a really hostile workload but the point remains. It was artifically worked around with min_free_kbytes (value set based on pageblock_order, could also have been artifically worked around by dropping pageblock_order) and he eventually caused order-0 failures so the workload is pretty damn hostile to everything. > One could add the ability to manage pools of cpu slabs but > that would be adding yet another layer to compensate for the problem of > the small pages. A compromise may be to have per-cpu lists for higher-order pages in the page allocator itself as they can be easily drained unlike the SLAB queues. The thing to watch for would be excessive IPI calls which would offset any performance gained by SLUB using larger pages. > Reliable large page allocations means that we can get rid > of these layers and the many workarounds that we have in place right now. > They are not reliable yet, particularly for atomic allocs. > The unqueued nature of SLUB reduces memory requirements and in general the > more efficient code paths of SLUB offset the advantage that SLAB can reach > by being able to put more objects onto its queues. SLAB necessarily > introduces complexity and cache line use through the need to manage those > queues. > > > vmalloc is incredibly slow and unscalable at the moment. I'm still working > > on making it more scalable and faster -- hopefully to a point where it would > > actually be usable for this... but you still get moved off large TLBs, and > > also have to inevitably do tlb flushing. > > Again I have not seen any fallbacks to vmalloc in my testing. What we are > doing here is mainly to address your theoretical cases that we so far have > never seen to be a problem and increase the reliability of allocations of > page orders larger than 3 to a usable level. So far I have so far not > dared to enable orders larger than 3 by default. > > AFAICT The performance of vmalloc is not really relevant. If this would > become an issue then it would be possible to reduce the orders used to > avoid fallbacks. > If we're falling back to vmalloc ever, there is a danger that the problem is postponed until vmalloc space is consumed. More an issue for 32 bit. > > Or do you have SLUB at a point where performance is comparable to SLAB, > > and this is just a possible idea for more performance? > > AFAICT SLUBs performance is superior to SLAB in most cases and it was like > that from the beginning. I am still concerned about several corner cases > though (I think most of them are going to be addressed by the per cpu > patches in mm). Having a comparable or larger amount of per cpu objects as > SLAB is something that also could address some of these concerns and could > increase performance much further. > -- -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK 2007-09-28 21:05 ` Mel Gorman @ 2007-10-01 21:10 ` Christoph Lameter 0 siblings, 0 replies; 57+ messages in thread From: Christoph Lameter @ 2007-10-01 21:10 UTC (permalink / raw) To: Mel Gorman Cc: Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel, David Chinner, Jens Axboe, akpm On Fri, 28 Sep 2007, Mel Gorman wrote: > Minimally, SLUB by default should continue to use order-0 pages. Peter has > managed to bust order-1 pages with mem=128MB. Admittedly, it was a really > hostile workload but the point remains. It was artifically worked around > with min_free_kbytes (value set based on pageblock_order, could also have > been artifically worked around by dropping pageblock_order) and he eventually > caused order-0 failures so the workload is pretty damn hostile to everything. SLAB default is order 1 so is SLUB default upstream. SLAB does runtime detection of the amount of memory and configures the max order correspondingly: from mm/slab.c: /* * Fragmentation resistance on low memory - only use bigger * page orders on machines with more than 32MB of memory. */ if (num_physpages > (32 << 20) >> PAGE_SHIFT) slab_break_gfp_order = BREAK_GFP_ORDER_HI; We could duplicate something like that for SLUB. ^ permalink raw reply [flat|nested] 57+ messages in thread
end of thread, other threads:[~2007-10-03 1:14 UTC | newest] Thread overview: 57+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-09-25 23:42 [00/17] Virtual Compound Page Support V1 Christoph Lameter 2007-09-25 23:42 ` [01/17] Vmalloc: Move vmalloc_to_page to mm/vmalloc Christoph Lameter 2007-09-25 23:42 ` [02/17] vmalloc: add const Christoph Lameter 2007-09-25 23:42 ` [03/17] i386: Resolve dependency of asm-i386/pgtable.h on highmem.h Christoph Lameter 2007-09-25 23:42 ` [04/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries Christoph Lameter 2007-09-25 23:42 ` [05/17] vmalloc: clean up page array indexing Christoph Lameter 2007-09-25 23:42 ` [06/17] vunmap: return page array passed on vmap() Christoph Lameter 2007-09-25 23:42 ` [07/17] vmalloc_address(): Determine vmalloc address from page struct Christoph Lameter 2007-09-25 23:42 ` [08/17] GFP_VFALLBACK: Allow fallback of compound pages to virtual mappings Christoph Lameter 2007-09-25 23:42 ` [09/17] VFALLBACK: Debugging aid Christoph Lameter 2007-09-25 23:42 ` [10/17] Use GFP_VFALLBACK for sparsemem Christoph Lameter 2007-09-25 23:42 ` [11/17] GFP_VFALLBACK for zone wait table Christoph Lameter 2007-09-25 23:42 ` [12/17] Virtual Compound page allocation from interrupt context Christoph Lameter 2007-09-25 23:42 ` [13/17] Virtual compound page freeing in " Christoph Lameter 2007-09-28 4:52 ` KAMEZAWA Hiroyuki 2007-09-28 17:35 ` Christoph Lameter 2007-09-28 23:58 ` KAMEZAWA Hiroyuki 2007-09-25 23:42 ` [14/17] Allow bit_waitqueue to wait on a bit in a vmalloc area Christoph Lameter 2007-09-25 23:42 ` [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK Christoph Lameter 2007-09-25 23:42 ` [16/17] Allow virtual fallback for buffer_heads Christoph Lameter 2007-09-25 23:42 ` [17/17] Allow virtual fallback for dentries Christoph Lameter -- strict thread matches above, loose matches on Subject: below -- 2007-09-19 3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter 2007-09-19 3:36 ` [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK Christoph Lameter 2007-09-27 21:42 ` Nick Piggin 2007-09-28 17:33 ` Christoph Lameter 2007-09-28 5:14 ` Nick Piggin 2007-10-01 20:50 ` Christoph Lameter 2007-10-02 8:43 ` Nick Piggin 2007-09-28 17:55 ` Peter Zijlstra 2007-09-28 18:20 ` Christoph Lameter 2007-09-28 18:25 ` Peter Zijlstra 2007-09-28 18:41 ` Christoph Lameter 2007-09-28 20:22 ` Nick Piggin 2007-09-28 21:14 ` Mel Gorman 2007-09-28 20:59 ` Mel Gorman 2007-09-29 8:13 ` Andrew Morton 2007-09-29 8:47 ` Peter Zijlstra 2007-09-29 8:53 ` Peter Zijlstra 2007-09-29 9:01 ` Andrew Morton 2007-09-29 9:14 ` Peter Zijlstra 2007-09-29 9:27 ` Andrew Morton 2007-09-28 20:19 ` Nick Piggin 2007-09-29 19:20 ` Andrew Morton 2007-09-29 19:09 ` Nick Piggin 2007-09-30 20:12 ` Andrew Morton 2007-09-30 4:16 ` Nick Piggin 2007-09-29 9:00 ` Andrew Morton 2007-10-01 20:55 ` Christoph Lameter 2007-10-01 21:30 ` Andrew Morton 2007-10-01 21:38 ` Christoph Lameter 2007-10-01 21:45 ` Andrew Morton 2007-10-01 21:52 ` Christoph Lameter 2007-10-02 9:19 ` Peter Zijlstra 2007-09-29 8:45 ` Peter Zijlstra 2007-10-01 21:01 ` Christoph Lameter 2007-10-02 8:37 ` Nick Piggin 2007-09-28 21:05 ` Mel Gorman 2007-10-01 21:10 ` Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).