Re: [13/17] Virtual compound page freeing in interrupt context

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [13/17] Virtual compound page freeing in interrupt context
  2007-09-19  3:36 ` [13/17] Virtual compound page freeing in " Christoph Lameter
@ 2007-09-18 20:36   ` Nick Piggin
  2007-09-20 17:50     ` Christoph Lameter
  0 siblings, 1 reply; 110+ messages in thread
From: Nick Piggin @ 2007-09-18 20:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel,
	David Chinner, Jens Axboe

On Wednesday 19 September 2007 13:36, Christoph Lameter wrote:
> If we are in an interrupt context then simply defer the free via a
> workqueue.
>
> In an interrupt context it is not possible to use vmalloc_addr() to
> determine the vmalloc address. So add a variant that does that too.
>
> Removing a virtual mappping *must* be done with interrupts enabled
> since tlb_xx functions are called that rely on interrupts for
> processor to processor communications.

These things will clash drastically with my lazy TLB flushing and scalability
work. Actually the lazy TLB flushing will provide a natural way to defer
unmapping at interrupt time, and the scalability work should make it
easier to vmap from interrupt context too, if you really need that.

>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
>
> ---
>  mm/page_alloc.c |   23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c	2007-09-18 20:10:55.000000000 -0700
> +++ linux-2.6/mm/page_alloc.c	2007-09-18 20:11:40.000000000 -0700
> @@ -1297,7 +1297,12 @@ abort:
>  	return NULL;
>  }
>
> -static void vcompound_free(void *addr)
> +/*
> + * Virtual Compound freeing functions. This is complicated by the vmalloc
> + * layer not being able to free virtual allocations when interrupts are
> + * disabled. So we defer the frees via a workqueue if necessary.
> + */
> +static void __vcompound_free(void *addr)
>  {
>  	struct page **pages = vunmap(addr);
>  	int i;
> @@ -1320,6 +1325,22 @@ static void vcompound_free(void *addr)
>  	kfree(pages);
>  }
>
> +static void vcompound_free_work(struct work_struct *w)
> +{
> +	__vcompound_free((void *)w);
> +}
> +
> +static void vcompound_free(void *addr)
> +{
> +	if (in_interrupt()) {
> +		struct work_struct *w = addr;
> +
> +		INIT_WORK(w, vcompound_free_work);
> +		schedule_work(w);
> +	} else
> +		__vcompound_free(addr);
> +}
> +
>  /*
>   * This is the 'heart' of the zoned buddy allocator.
>   */

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [00/17] [RFC] Virtual Compound Page Support
@ 2007-09-19  3:36 Christoph Lameter
  2007-09-19  3:36 ` [01/17] Vmalloc: Move vmalloc_to_page to mm/vmalloc Christoph Lameter
                   ` (18 more replies)
  0 siblings, 19 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

Currently there is a strong tendency to avoid larger page allocations in
the kernel because of past fragmentation issues and the current
defragmentation methods are still evolving. It is not clear to what extend
they can provide reliable allocations for higher order pages (plus the
definition of "reliable" seems to be in the eye of the beholder).

Currently we use vmalloc allocations in many locations to provide a safe
way to allocate larger arrays. That is due to the danger of higher order
allocations failing. Virtual Compound pages allow the use of regular
page allocator allocations that will fall back only if there is an actual
problem with acquiring a higher order page.

This patch set provides a way for a higher page allocation to fall back.
Instead of a physically contiguous page a virtually contiguous page
is provided. The functionality of the vmalloc layer is used to provide
the necessary page tables and control structures to establish a virtually
contiguous area.

Advantages:

- If higher order allocations are failing then virtual compound pages
  consisting of a series of order-0 pages can stand in for those
  allocations.

- "Reliability" as long as the vmalloc layer can provide virtual mappings.

- Ability to reduce the use of vmalloc layer significantly by using
  physically contiguous memory instead of virtual contiguous memory.
  Most uses of vmalloc() can be converted to page allocator calls.

- The use of physically contiguous memory instead of vmalloc may allow the
  use larger TLB entries thus reducing TLB pressure. Also reduces the need
  for page table walks.

Disadvantages:

- In order to use fall back the logic accessing the memory must be
  aware that the memory could be backed by a virtual mapping and take
  precautions. virt_to_page() and page_address() may not work and
  vmalloc_to_page() and vmalloc_address() (introduced through this
  patch set) may have to be called.

- Virtual mappings are less efficient than physical mappings.
  Performance will drop once virtual fall back occurs.

- Virtual mappings have more memory overhead. vm_area control structures
  page tables, page arrays etc need to be allocated and managed to provide
  virtual mappings.

The patchset provides this functionality in stages. Stage 1 introduces
the basic fall back mechanism necessary to replace vmalloc allocations
with

	alloc_page(GFP_VFALLBACK, order, ....)

which signifies to the page allocator that a higher order is to be found
but a virtual mapping may stand in if there is an issue with fragmentation.

Stage 1 functionality does not allow allocation and freeing of virtual
mappings from interrupt contexts.

The stage 1 series ends with the conversion of a few key uses of vmalloc
in the VM to alloc_pages() for the allocation of sparsemems memmap table
and the wait table in each zone. Other uses of vmalloc could be converted
in the same way.

Stage 2 functionality enhances the fallback even more allowing allocation
and frees in interrupt context.

SLUB is then modified to use the virtual mappings for slab caches
that are marked with SLAB_VFALLBACK. If a slab cache is marked this way
then we drop all the restraints regarding page order and allocate
good large memory areas that fit lots of objects so that we rarely
have to use the slow paths.

Two slab caches--the dentry cache and the buffer_heads--are then flagged
that way. Others could be converted in the same way.

The patch set also provides a debugging aid through setting

	CONFIG_VFALLBACK_ALWAYS

If set then all GFP_VFALLBACK allocations fall back to the virtual
mappings. This is useful for verification tests. The test of this
patch set was done by enabling that options and compiling a kernel.

Stage 3 functionality could be the adding of support for the large
buffer size patchset. Not done yet and not sure if it would be useful
to do.

Much of this patchset may only be needed for special cases in which the
existing defragmentation methods fail for some reason. It may be better to
have the system operate without such a safety net and make sure that the
page allocator can return large orders in a reliable way.

The initial idea for this patchset came from Nick Piggin's fsblock
and from his arguments about reliability and guarantees. Since his
fsblock uses the virtual mappings I think it is legitimate to
generalize the use of virtual mappings to support higher order
allocations in this way. The application of these ideas to the large
block size patchset etc are straightforward. If wanted I can base
the next rev of the largebuffer patchset on this one and implement
fallback.

Contrary to Nick, I still doubt that any of this provides a "guarantee".
Have said that I have to deal with various failure scenarios in the VM
daily and I'd certainly like to see it work in a more reliable manner.

IMHO getting rid of the various workarounds to deal with the small 4k
pages and avoiding additional layers that group these pages in subsystem
specific ways is something that can simplify the kernel and make the
kernel more reliable overall.

If people feel that a virtual fall back is needed then so be it. Maybe
we can shed our security blanket later when the approaches to deal
with fragmentation have matured.

The patch set is also available via git from the largeblock git tree via

git pull
  git://git.kernel.org/pub/scm/linux/kernel/git/christoph/largeblocksize.git
    vcompound

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [01/17] Vmalloc: Move vmalloc_to_page to mm/vmalloc.
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-19  3:36 ` [02/17] Vmalloc: add const Christoph Lameter
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_move_vmalloc_to_page --]
[-- Type: text/plain, Size: 4474 bytes --]

We already have page table manipulation for vmalloc in vmalloc.c. Move the
vmalloc_to_page() function there as well. Also move the related definitions
from include/linux/mm.h.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm.h      |    2 --
 include/linux/vmalloc.h |    4 ++++
 mm/memory.c             |   40 ----------------------------------------
 mm/vmalloc.c            |   38 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 42 insertions(+), 42 deletions(-)

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2007-09-18 18:33:56.000000000 -0700
+++ linux-2.6/mm/memory.c	2007-09-18 18:34:06.000000000 -0700
@@ -2727,46 +2727,6 @@ int make_pages_present(unsigned long add
 	return ret == len ? 0 : -1;
 }
 
-/* 
- * Map a vmalloc()-space virtual address to the physical page.
- */
-struct page * vmalloc_to_page(void * vmalloc_addr)
-{
-	unsigned long addr = (unsigned long) vmalloc_addr;
-	struct page *page = NULL;
-	pgd_t *pgd = pgd_offset_k(addr);
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *ptep, pte;
-  
-	if (!pgd_none(*pgd)) {
-		pud = pud_offset(pgd, addr);
-		if (!pud_none(*pud)) {
-			pmd = pmd_offset(pud, addr);
-			if (!pmd_none(*pmd)) {
-				ptep = pte_offset_map(pmd, addr);
-				pte = *ptep;
-				if (pte_present(pte))
-					page = pte_page(pte);
-				pte_unmap(ptep);
-			}
-		}
-	}
-	return page;
-}
-
-EXPORT_SYMBOL(vmalloc_to_page);
-
-/*
- * Map a vmalloc()-space virtual address to the physical page frame number.
- */
-unsigned long vmalloc_to_pfn(void * vmalloc_addr)
-{
-	return page_to_pfn(vmalloc_to_page(vmalloc_addr));
-}
-
-EXPORT_SYMBOL(vmalloc_to_pfn);
-
 #if !defined(__HAVE_ARCH_GATE_AREA)
 
 #if defined(AT_SYSINFO_EHDR)
Index: linux-2.6/mm/vmalloc.c
===================================================================
--- linux-2.6.orig/mm/vmalloc.c	2007-09-18 18:33:56.000000000 -0700
+++ linux-2.6/mm/vmalloc.c	2007-09-18 18:34:06.000000000 -0700
@@ -166,6 +166,44 @@ int map_vm_area(struct vm_struct *area, 
 }
 EXPORT_SYMBOL_GPL(map_vm_area);
 
+/*
+ * Map a vmalloc()-space virtual address to the physical page.
+ */
+struct page *vmalloc_to_page(void *vmalloc_addr)
+{
+	unsigned long addr = (unsigned long) vmalloc_addr;
+	struct page *page = NULL;
+	pgd_t *pgd = pgd_offset_k(addr);
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *ptep, pte;
+
+	if (!pgd_none(*pgd)) {
+		pud = pud_offset(pgd, addr);
+		if (!pud_none(*pud)) {
+			pmd = pmd_offset(pud, addr);
+			if (!pmd_none(*pmd)) {
+				ptep = pte_offset_map(pmd, addr);
+				pte = *ptep;
+				if (pte_present(pte))
+					page = pte_page(pte);
+				pte_unmap(ptep);
+			}
+		}
+	}
+	return page;
+}
+EXPORT_SYMBOL(vmalloc_to_page);
+
+/*
+ * Map a vmalloc()-space virtual address to the physical page frame number.
+ */
+unsigned long vmalloc_to_pfn(void *vmalloc_addr)
+{
+	return page_to_pfn(vmalloc_to_page(vmalloc_addr));
+}
+EXPORT_SYMBOL(vmalloc_to_pfn);
+
 static struct vm_struct *__get_vm_area_node(unsigned long size, unsigned long flags,
 					    unsigned long start, unsigned long end,
 					    int node, gfp_t gfp_mask)
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2007-09-18 18:33:56.000000000 -0700
+++ linux-2.6/include/linux/mm.h	2007-09-18 18:34:06.000000000 -0700
@@ -1160,8 +1160,6 @@ static inline unsigned long vma_pages(st
 
 pgprot_t vm_get_page_prot(unsigned long vm_flags);
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
-struct page *vmalloc_to_page(void *addr);
-unsigned long vmalloc_to_pfn(void *addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
 			unsigned long pfn, unsigned long size, pgprot_t);
 int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
Index: linux-2.6/include/linux/vmalloc.h
===================================================================
--- linux-2.6.orig/include/linux/vmalloc.h	2007-09-18 18:33:57.000000000 -0700
+++ linux-2.6/include/linux/vmalloc.h	2007-09-18 18:34:24.000000000 -0700
@@ -81,6 +81,10 @@ extern void unmap_kernel_range(unsigned 
 extern struct vm_struct *alloc_vm_area(size_t size);
 extern void free_vm_area(struct vm_struct *area);
 
+/* Determine page struct from address */
+struct page *vmalloc_to_page(void *addr);
+unsigned long vmalloc_to_pfn(void *addr);
+
 /*
  *	Internals.  Dont't use..
  */

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [02/17] Vmalloc: add const
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
  2007-09-19  3:36 ` [01/17] Vmalloc: Move vmalloc_to_page to mm/vmalloc Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-19  3:36 ` [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries Christoph Lameter
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_vmalloc_const --]
[-- Type: text/plain, Size: 4207 bytes --]

Make vmalloc functions work the same way as kfree() and friends that
take a const void * argument.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/vmalloc.h |   10 +++++-----
 mm/vmalloc.c            |   16 ++++++++--------
 2 files changed, 13 insertions(+), 13 deletions(-)

Index: linux-2.6/mm/vmalloc.c
===================================================================
--- linux-2.6.orig/mm/vmalloc.c	2007-09-18 18:34:06.000000000 -0700
+++ linux-2.6/mm/vmalloc.c	2007-09-18 18:34:33.000000000 -0700
@@ -169,7 +169,7 @@ EXPORT_SYMBOL_GPL(map_vm_area);
 /*
  * Map a vmalloc()-space virtual address to the physical page.
  */
-struct page *vmalloc_to_page(void *vmalloc_addr)
+struct page *vmalloc_to_page(const void *vmalloc_addr)
 {
 	unsigned long addr = (unsigned long) vmalloc_addr;
 	struct page *page = NULL;
@@ -198,7 +198,7 @@ EXPORT_SYMBOL(vmalloc_to_page);
 /*
  * Map a vmalloc()-space virtual address to the physical page frame number.
  */
-unsigned long vmalloc_to_pfn(void *vmalloc_addr)
+unsigned long vmalloc_to_pfn(const void *vmalloc_addr)
 {
 	return page_to_pfn(vmalloc_to_page(vmalloc_addr));
 }
@@ -305,7 +305,7 @@ struct vm_struct *get_vm_area_node(unsig
 }
 
 /* Caller must hold vmlist_lock */
-static struct vm_struct *__find_vm_area(void *addr)
+static struct vm_struct *__find_vm_area(const void *addr)
 {
 	struct vm_struct *tmp;
 
@@ -318,7 +318,7 @@ static struct vm_struct *__find_vm_area(
 }
 
 /* Caller must hold vmlist_lock */
-static struct vm_struct *__remove_vm_area(void *addr)
+static struct vm_struct *__remove_vm_area(const void *addr)
 {
 	struct vm_struct **p, *tmp;
 
@@ -347,7 +347,7 @@ found:
  *	This function returns the found VM area, but using it is NOT safe
  *	on SMP machines, except for its size or flags.
  */
-struct vm_struct *remove_vm_area(void *addr)
+struct vm_struct *remove_vm_area(const void *addr)
 {
 	struct vm_struct *v;
 	write_lock(&vmlist_lock);
@@ -356,7 +356,7 @@ struct vm_struct *remove_vm_area(void *a
 	return v;
 }
 
-static void __vunmap(void *addr, int deallocate_pages)
+static void __vunmap(const void *addr, int deallocate_pages)
 {
 	struct vm_struct *area;
 
@@ -407,7 +407,7 @@ static void __vunmap(void *addr, int dea
  *
  *	Must not be called in interrupt context.
  */
-void vfree(void *addr)
+void vfree(const void *addr)
 {
 	BUG_ON(in_interrupt());
 	__vunmap(addr, 1);
@@ -423,7 +423,7 @@ EXPORT_SYMBOL(vfree);
  *
  *	Must not be called in interrupt context.
  */
-void vunmap(void *addr)
+void vunmap(const void *addr)
 {
 	BUG_ON(in_interrupt());
 	__vunmap(addr, 0);
Index: linux-2.6/include/linux/vmalloc.h
===================================================================
--- linux-2.6.orig/include/linux/vmalloc.h	2007-09-18 18:34:24.000000000 -0700
+++ linux-2.6/include/linux/vmalloc.h	2007-09-18 18:35:03.000000000 -0700
@@ -45,11 +45,11 @@ extern void *vmalloc_32_user(unsigned lo
 extern void *__vmalloc(unsigned long size, gfp_t gfp_mask, pgprot_t prot);
 extern void *__vmalloc_area(struct vm_struct *area, gfp_t gfp_mask,
 				pgprot_t prot);
-extern void vfree(void *addr);
+extern void vfree(const void *addr);
 
 extern void *vmap(struct page **pages, unsigned int count,
 			unsigned long flags, pgprot_t prot);
-extern void vunmap(void *addr);
+extern void vunmap(const void *addr);
 
 extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
 							unsigned long pgoff);
@@ -71,7 +71,7 @@ extern struct vm_struct *__get_vm_area(u
 extern struct vm_struct *get_vm_area_node(unsigned long size,
 					  unsigned long flags, int node,
 					  gfp_t gfp_mask);
-extern struct vm_struct *remove_vm_area(void *addr);
+extern struct vm_struct *remove_vm_area(const void *addr);
 
 extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
 			struct page ***pages);
@@ -82,8 +82,8 @@ extern struct vm_struct *alloc_vm_area(s
 extern void free_vm_area(struct vm_struct *area);
 
 /* Determine page struct from address */
-struct page *vmalloc_to_page(void *addr);
-unsigned long vmalloc_to_pfn(void *addr);
+struct page *vmalloc_to_page(const void *addr);
+unsigned long vmalloc_to_pfn(const void *addr);
 
 /*
  *	Internals.  Dont't use..

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
  2007-09-19  3:36 ` [01/17] Vmalloc: Move vmalloc_to_page to mm/vmalloc Christoph Lameter
  2007-09-19  3:36 ` [02/17] Vmalloc: add const Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-19  6:32   ` David Rientjes
  2007-09-19  3:36 ` [04/17] vmalloc: clean up page array indexing Christoph Lameter
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_is_vmalloc_addr --]
[-- Type: text/plain, Size: 4724 bytes --]

This test is used in a couple of places. Add a version to vmalloc.h
and replace the other checks.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 drivers/net/cxgb3/cxgb3_offload.c |    4 +---
 fs/ntfs/malloc.h                  |    3 +--
 fs/proc/kcore.c                   |    2 +-
 fs/xfs/linux-2.6/kmem.c           |    3 +--
 fs/xfs/linux-2.6/xfs_buf.c        |    3 +--
 include/linux/mm.h                |    8 ++++++++
 mm/sparse.c                       |   10 +---------
 7 files changed, 14 insertions(+), 19 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2007-09-17 21:46:06.000000000 -0700
+++ linux-2.6/include/linux/mm.h	2007-09-17 23:56:54.000000000 -0700
@@ -1158,6 +1158,14 @@ static inline unsigned long vma_pages(st
 	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
 }
 
+/* Determine if an address is within the vmalloc range */
+static inline int is_vmalloc_addr(const void *x)
+{
+	unsigned long addr = (unsigned long)x;
+
+	return addr >= VMALLOC_START && addr < VMALLOC_END;
+}
+
 pgprot_t vm_get_page_prot(unsigned long vm_flags);
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
Index: linux-2.6/mm/sparse.c
===================================================================
--- linux-2.6.orig/mm/sparse.c	2007-09-17 21:45:24.000000000 -0700
+++ linux-2.6/mm/sparse.c	2007-09-17 23:56:26.000000000 -0700
@@ -289,17 +289,9 @@ got_map_ptr:
 	return ret;
 }
 
-static int vaddr_in_vmalloc_area(void *addr)
-{
-	if (addr >= (void *)VMALLOC_START &&
-	    addr < (void *)VMALLOC_END)
-		return 1;
-	return 0;
-}
-
 static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
 {
-	if (vaddr_in_vmalloc_area(memmap))
+	if (is_vmalloc_addr(memmap))
 		vfree(memmap);
 	else
 		free_pages((unsigned long)memmap,
Index: linux-2.6/drivers/net/cxgb3/cxgb3_offload.c
===================================================================
--- linux-2.6.orig/drivers/net/cxgb3/cxgb3_offload.c	2007-09-17 21:45:24.000000000 -0700
+++ linux-2.6/drivers/net/cxgb3/cxgb3_offload.c	2007-09-17 21:46:06.000000000 -0700
@@ -1035,9 +1035,7 @@ void *cxgb_alloc_mem(unsigned long size)
  */
 void cxgb_free_mem(void *addr)
 {
-	unsigned long p = (unsigned long)addr;
-
-	if (p >= VMALLOC_START && p < VMALLOC_END)
+	if (is_vmalloc_addr(addr))
 		vfree(addr);
 	else
 		kfree(addr);
Index: linux-2.6/fs/ntfs/malloc.h
===================================================================
--- linux-2.6.orig/fs/ntfs/malloc.h	2007-09-17 21:45:24.000000000 -0700
+++ linux-2.6/fs/ntfs/malloc.h	2007-09-17 21:46:06.000000000 -0700
@@ -85,8 +85,7 @@ static inline void *ntfs_malloc_nofs_nof
 
 static inline void ntfs_free(void *addr)
 {
-	if (likely(((unsigned long)addr < VMALLOC_START) ||
-			((unsigned long)addr >= VMALLOC_END ))) {
+	if (!is_vmalloc_addr(addr)) {
 		kfree(addr);
 		/* free_page((unsigned long)addr); */
 		return;
Index: linux-2.6/fs/proc/kcore.c
===================================================================
--- linux-2.6.orig/fs/proc/kcore.c	2007-09-17 21:45:24.000000000 -0700
+++ linux-2.6/fs/proc/kcore.c	2007-09-17 21:46:06.000000000 -0700
@@ -325,7 +325,7 @@ read_kcore(struct file *file, char __use
 		if (m == NULL) {
 			if (clear_user(buffer, tsz))
 				return -EFAULT;
-		} else if ((start >= VMALLOC_START) && (start < VMALLOC_END)) {
+		} else if (is_vmalloc_addr((void *)start)) {
 			char * elf_buf;
 			struct vm_struct *m;
 			unsigned long curstart = start;
Index: linux-2.6/fs/xfs/linux-2.6/kmem.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/kmem.c	2007-09-17 21:45:24.000000000 -0700
+++ linux-2.6/fs/xfs/linux-2.6/kmem.c	2007-09-17 21:46:06.000000000 -0700
@@ -92,8 +92,7 @@ kmem_zalloc_greedy(size_t *size, size_t 
 void
 kmem_free(void *ptr, size_t size)
 {
-	if (((unsigned long)ptr < VMALLOC_START) ||
-	    ((unsigned long)ptr >= VMALLOC_END)) {
+	if (!is_vmalloc_addr(ptr)) {
 		kfree(ptr);
 	} else {
 		vfree(ptr);
Index: linux-2.6/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_buf.c	2007-09-17 21:45:24.000000000 -0700
+++ linux-2.6/fs/xfs/linux-2.6/xfs_buf.c	2007-09-17 21:46:06.000000000 -0700
@@ -696,8 +696,7 @@ static inline struct page *
 mem_to_page(
 	void			*addr)
 {
-	if (((unsigned long)addr < VMALLOC_START) ||
-	    ((unsigned long)addr >= VMALLOC_END)) {
+	if ((!is_vmalloc_addr(addr))) {
 		return virt_to_page(addr);
 	} else {
 		return vmalloc_to_page(addr);

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [04/17] vmalloc: clean up page array indexing
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
                   ` (2 preceding siblings ...)
  2007-09-19  3:36 ` [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-19  3:36 ` [05/17] vunmap: return page array Christoph Lameter
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_array_indexes --]
[-- Type: text/plain, Size: 1425 bytes --]

The page array is repeatedly indexed both in vunmap and vmalloc_area_node().
Add a temporary variable to make it easier to read (and easier to patch
later).

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/vmalloc.c |   16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/vmalloc.c
===================================================================
--- linux-2.6.orig/mm/vmalloc.c	2007-09-18 13:22:16.000000000 -0700
+++ linux-2.6/mm/vmalloc.c	2007-09-18 13:22:17.000000000 -0700
@@ -383,8 +383,10 @@ static void __vunmap(const void *addr, i
 		int i;
 
 		for (i = 0; i < area->nr_pages; i++) {
-			BUG_ON(!area->pages[i]);
-			__free_page(area->pages[i]);
+			struct page *page = area->pages[i];
+
+			BUG_ON(!page);
+			__free_page(page);
 		}
 
 		if (area->flags & VM_VPAGES)
@@ -488,15 +490,19 @@ void *__vmalloc_area_node(struct vm_stru
 	}
 
 	for (i = 0; i < area->nr_pages; i++) {
+		struct page *page;
+
 		if (node < 0)
-			area->pages[i] = alloc_page(gfp_mask);
+			page = alloc_page(gfp_mask);
 		else
-			area->pages[i] = alloc_pages_node(node, gfp_mask, 0);
-		if (unlikely(!area->pages[i])) {
+			page = alloc_pages_node(node, gfp_mask, 0);
+
+		if (unlikely(!page)) {
 			/* Successfully allocated i pages, free them in __vunmap() */
 			area->nr_pages = i;
 			goto fail;
 		}
+		area->pages[i] = page;
 	}
 
 	if (map_vm_area(area, prot, &pages))

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [05/17] vunmap: return page array
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
                   ` (3 preceding siblings ...)
  2007-09-19  3:36 ` [04/17] vmalloc: clean up page array indexing Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-19  8:05   ` KAMEZAWA Hiroyuki
  2007-09-19  3:36 ` [06/17] vmalloc_address(): Determine vmalloc address from page struct Christoph Lameter
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_vunmap_returns_pages --]
[-- Type: text/plain, Size: 3213 bytes --]

Make vunmap return the page array that was used at vmap. This is useful
if one has no structures to track the page array but simply stores the
virtual address somewhere. The disposition of the page array can be
decided upon after vunmap. vfree() may now also be used instead of
vunmap which will release the page array after vunmap'ping it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/vmalloc.h |    2 +-
 mm/vmalloc.c            |   26 ++++++++++++++++----------
 2 files changed, 17 insertions(+), 11 deletions(-)

Index: linux-2.6/include/linux/vmalloc.h
===================================================================
--- linux-2.6.orig/include/linux/vmalloc.h	2007-09-18 13:22:56.000000000 -0700
+++ linux-2.6/include/linux/vmalloc.h	2007-09-18 13:22:57.000000000 -0700
@@ -49,7 +49,7 @@ extern void vfree(const void *addr);
 
 extern void *vmap(struct page **pages, unsigned int count,
 			unsigned long flags, pgprot_t prot);
-extern void vunmap(const void *addr);
+extern struct page **vunmap(const void *addr);
 
 extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
 							unsigned long pgoff);
Index: linux-2.6/mm/vmalloc.c
===================================================================
--- linux-2.6.orig/mm/vmalloc.c	2007-09-18 13:22:56.000000000 -0700
+++ linux-2.6/mm/vmalloc.c	2007-09-18 13:22:57.000000000 -0700
@@ -356,17 +356,18 @@ struct vm_struct *remove_vm_area(const v
 	return v;
 }
 
-static void __vunmap(const void *addr, int deallocate_pages)
+static struct page **__vunmap(const void *addr, int deallocate_pages)
 {
 	struct vm_struct *area;
+	struct page **pages;
 
 	if (!addr)
-		return;
+		return NULL;
 
 	if ((PAGE_SIZE-1) & (unsigned long)addr) {
 		printk(KERN_ERR "Trying to vfree() bad address (%p)\n", addr);
 		WARN_ON(1);
-		return;
+		return NULL;
 	}
 
 	area = remove_vm_area(addr);
@@ -374,29 +375,30 @@ static void __vunmap(const void *addr, i
 		printk(KERN_ERR "Trying to vfree() nonexistent vm area (%p)\n",
 				addr);
 		WARN_ON(1);
-		return;
+		return NULL;
 	}
 
+	pages = area->pages;
 	debug_check_no_locks_freed(addr, area->size);
 
 	if (deallocate_pages) {
 		int i;
 
 		for (i = 0; i < area->nr_pages; i++) {
-			struct page *page = area->pages[i];
+			struct page *page = pages[i];
 
 			BUG_ON(!page);
 			__free_page(page);
 		}
 
 		if (area->flags & VM_VPAGES)
-			vfree(area->pages);
+			vfree(pages);
 		else
-			kfree(area->pages);
+			kfree(pages);
 	}
 
 	kfree(area);
-	return;
+	return pages;
 }
 
 /**
@@ -424,11 +426,13 @@ EXPORT_SYMBOL(vfree);
  *	which was created from the page array passed to vmap().
  *
  *	Must not be called in interrupt context.
+ *
+ *	Returns a pointer to the array of pointers to page structs
  */
-void vunmap(const void *addr)
+struct page **vunmap(const void *addr)
 {
 	BUG_ON(in_interrupt());
-	__vunmap(addr, 0);
+	return __vunmap(addr, 0);
 }
 EXPORT_SYMBOL(vunmap);
 
@@ -453,6 +457,8 @@ void *vmap(struct page **pages, unsigned
 	area = get_vm_area((count << PAGE_SHIFT), flags);
 	if (!area)
 		return NULL;
+	area->pages = pages;
+	area->nr_pages = count;
 	if (map_vm_area(area, prot, &pages)) {
 		vunmap(area->addr);
 		return NULL;

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [06/17] vmalloc_address(): Determine vmalloc address from page struct
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
                   ` (4 preceding siblings ...)
  2007-09-19  3:36 ` [05/17] vunmap: return page array Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-19  3:36 ` [07/17] GFP_VFALLBACK: Allow fallback of compound pages to virtual mappings Christoph Lameter
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_vmalloc_address --]
[-- Type: text/plain, Size: 3358 bytes --]

Sometimes we need to figure out which vmalloc address is in use
for a certain page struct. There is no easy way to figure out
the vmalloc address from the page struct. So simply search through
the kernel page table to find the address. This is a fairly expensive
process. Use sparingly (or provide a better implementation).

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/vmalloc.h |    3 +
 mm/vmalloc.c            |   77 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+)

Index: linux-2.6/mm/vmalloc.c
===================================================================
--- linux-2.6.orig/mm/vmalloc.c	2007-09-18 18:35:13.000000000 -0700
+++ linux-2.6/mm/vmalloc.c	2007-09-18 18:35:18.000000000 -0700
@@ -196,6 +196,83 @@ struct page *vmalloc_to_page(const void 
 EXPORT_SYMBOL(vmalloc_to_page);
 
 /*
+ * Determine vmalloc address from a page struct.
+ *
+ * Linear search through all ptes of the vmalloc area.
+ */
+static unsigned long vaddr_pte_range(pmd_t *pmd, unsigned long addr,
+		unsigned long end, unsigned long pfn)
+{
+	pte_t *pte;
+
+	pte = pte_offset_kernel(pmd, addr);
+	do {
+		pte_t ptent = *pte;
+		if (pte_present(ptent) && pte_pfn(ptent) == pfn)
+			return addr;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+	return 0;
+}
+
+static inline unsigned long vaddr_pmd_range(pud_t *pud, unsigned long addr,
+		unsigned long end, unsigned long pfn)
+{
+	pmd_t *pmd;
+	unsigned long next;
+	unsigned long n;
+
+	pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		n = vaddr_pte_range(pmd, addr, next, pfn);
+		if (n)
+			return n;
+	} while (pmd++, addr = next, addr != end);
+	return 0;
+}
+
+static inline unsigned long vaddr_pud_range(pgd_t *pgd, unsigned long addr,
+		unsigned long end, unsigned long pfn)
+{
+	pud_t *pud;
+	unsigned long next;
+	unsigned long n;
+
+	pud = pud_offset(pgd, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud))
+			continue;
+		n = vaddr_pmd_range(pud, addr, next, pfn);
+		if (n)
+			return n;
+	} while (pud++, addr = next, addr != end);
+	return 0;
+}
+
+void *vmalloc_address(struct page *page)
+{
+	pgd_t *pgd;
+	unsigned long next, n;
+	unsigned long addr = VMALLOC_START;
+	unsigned long pfn = page_to_pfn(page);
+
+	pgd = pgd_offset_k(VMALLOC_START);
+	do {
+		next = pgd_addr_end(addr, VMALLOC_END);
+		if (pgd_none_or_clear_bad(pgd))
+			continue;
+		n = vaddr_pud_range(pgd, addr, next, pfn);
+		if (n)
+			return (void *)n;
+	} while (pgd++, addr = next, addr < VMALLOC_END);
+	return NULL;
+}
+EXPORT_SYMBOL(vmalloc_address);
+
+/*
  * Map a vmalloc()-space virtual address to the physical page frame number.
  */
 unsigned long vmalloc_to_pfn(const void *vmalloc_addr)
Index: linux-2.6/include/linux/vmalloc.h
===================================================================
--- linux-2.6.orig/include/linux/vmalloc.h	2007-09-18 18:35:13.000000000 -0700
+++ linux-2.6/include/linux/vmalloc.h	2007-09-18 18:35:48.000000000 -0700
@@ -85,6 +85,9 @@ extern void free_vm_area(struct vm_struc
 struct page *vmalloc_to_page(const void *addr);
 unsigned long vmalloc_to_pfn(const void *addr);
 
+/* Determine address from page struct pointer */
+void *vmalloc_address(struct page *);
+
 /*
  *	Internals.  Dont't use..
  */

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [07/17] GFP_VFALLBACK: Allow fallback of compound pages to virtual mappings
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
                   ` (5 preceding siblings ...)
  2007-09-19  3:36 ` [06/17] vmalloc_address(): Determine vmalloc address from page struct Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-19  3:36 ` [08/17] Pass vmalloc address in page->private Christoph Lameter
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_core --]
[-- Type: text/plain, Size: 7244 bytes --]

This adds a new gfp flag

__GFP_VFALLBACK

If specified during a higher order allocation then the system will fall
back to vmap and attempt to create a virtually contiguous area instead of
a physically contiguous area. In many cases the virtually contiguous area
can stand in for the physically contiguous area (with some loss of
performance).

The pages used for VFALLBACK are marked with a new flag
PageVcompound(page). The mark is necessary since we have to know upon
free if we have to destroy a virtual mapping. No additional flag is
consumed through the use of PG_swapcache together with PG_compound
(similar to PageHead() and PageTail()).

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/gfp.h        |    5 +
 include/linux/page-flags.h |   18 +++++++
 mm/page_alloc.c            |  113 ++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 130 insertions(+), 6 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2007-09-18 17:03:54.000000000 -0700
+++ linux-2.6/mm/page_alloc.c	2007-09-18 18:25:46.000000000 -0700
@@ -1230,6 +1230,86 @@ try_next_zone:
 }
 
 /*
+ * Virtual Compound Page support.
+ *
+ * Virtual Compound Pages are used to fall back to order 0 allocations if large
+ * linear mappings are not available and __GFP_VFALLBACK is set. They are
+ * formatted according to compound page conventions. I.e. following
+ * page->first_page if PageTail(page) is set can be used to determine the
+ * head page.
+ */
+struct page *vcompound_alloc(gfp_t gfp_mask, int order,
+		struct zonelist *zonelist, unsigned long alloc_flags)
+{
+	void *addr;
+	struct page *page;
+	int i;
+	int nr_pages = 1 << order;
+	struct page **pages = kzalloc((nr_pages + 1) * sizeof(struct page *),
+						gfp_mask & GFP_LEVEL_MASK);
+
+	if (!pages)
+		return NULL;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = get_page_from_freelist(gfp_mask & ~__GFP_VFALLBACK,
+					0, zonelist, alloc_flags);
+		if (!page)
+			goto abort;
+
+		/* Sets PageCompound which makes PageHead(page) true */
+		__SetPageVcompound(page);
+		if (i) {
+			page->first_page = pages[0];
+			__SetPageTail(page);
+		}
+		pages[i] = page;
+	}
+
+	addr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
+	if (!addr)
+		goto abort;
+
+	return pages[0];
+
+abort:
+	for (i = 0; i < nr_pages; i++) {
+		page = pages[i];
+		if (!page)
+			continue;
+		__ClearPageTail(page);
+		__ClearPageHead(page);
+		__ClearPageVcompound(page);
+		__free_page(page);
+	}
+	kfree(pages);
+	return NULL;
+}
+
+static void vcompound_free(void *addr)
+{
+	struct page **pages = vunmap(addr);
+	int i;
+
+	/*
+	 * First page will have zero refcount since it maintains state
+	 * for the compound and was decremented before we got here.
+	 */
+	__ClearPageHead(pages[0]);
+	__ClearPageVcompound(pages[0]);
+	free_hot_page(pages[0]);
+
+	for (i = 1; pages[i]; i++) {
+		struct page *page = pages[i];
+
+		__ClearPageTail(page);
+		__ClearPageVcompound(page);
+		__free_page(page);
+	}
+	kfree(pages);
+}
+
+/*
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page * fastcall
@@ -1324,12 +1404,12 @@ nofail_alloc:
 				goto nofail_alloc;
 			}
 		}
-		goto nopage;
+		goto try_vcompound;
 	}
 
 	/* Atomic allocations - we can't balance anything */
 	if (!wait)
-		goto nopage;
+		goto try_vcompound;
 
 	cond_resched();
 
@@ -1360,6 +1440,11 @@ nofail_alloc:
 		 */
 		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
 				zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+
+		if (!page && order && (gfp_mask & __GFP_VFALLBACK))
+			page = vcompound_alloc(gfp_mask, order,
+					zonelist, alloc_flags);
+
 		if (page)
 			goto got_pg;
 
@@ -1391,6 +1476,14 @@ nofail_alloc:
 		goto rebalance;
 	}
 
+try_vcompound:
+	/* Last chance before failing the allocation */
+	if (order && (gfp_mask & __GFP_VFALLBACK)) {
+		page = vcompound_alloc(gfp_mask, order,
+					zonelist, alloc_flags);
+		if (page)
+			goto got_pg;
+	}
 nopage:
 	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
 		printk(KERN_WARNING "%s: page allocation failure."
@@ -1450,8 +1543,12 @@ fastcall void __free_pages(struct page *
 	if (put_page_testzero(page)) {
 		if (order == 0)
 			free_hot_page(page);
-		else
-			__free_pages_ok(page, order);
+		else {
+			if (unlikely(PageVcompound(page)))
+				vcompound_free(vmalloc_address(page));
+			else
+				__free_pages_ok(page, order);
+		}
 	}
 }
 
@@ -1460,8 +1557,12 @@ EXPORT_SYMBOL(__free_pages);
 fastcall void free_pages(unsigned long addr, unsigned int order)
 {
 	if (addr != 0) {
-		VM_BUG_ON(!virt_addr_valid((void *)addr));
-		__free_pages(virt_to_page((void *)addr), order);
+		if (unlikely(is_vmalloc_addr((void *)addr)))
+			vcompound_free((void *)addr);
+		else {
+			VM_BUG_ON(!virt_addr_valid((void *)addr));
+			__free_pages(virt_to_page((void *)addr), order);
+		}
 	}
 }
 
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h	2007-09-18 17:03:54.000000000 -0700
+++ linux-2.6/include/linux/gfp.h	2007-09-18 17:03:58.000000000 -0700
@@ -43,6 +43,7 @@ struct vm_area_struct;
 #define __GFP_REPEAT	((__force gfp_t)0x400u)	/* Retry the allocation.  Might fail */
 #define __GFP_NOFAIL	((__force gfp_t)0x800u)	/* Retry for ever.  Cannot fail */
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* Do not retry.  Might fail */
+#define __GFP_VFALLBACK	((__force gfp_t)0x2000u)/* Permit fallback to vmalloc */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
@@ -86,6 +87,10 @@ struct vm_area_struct;
 #define GFP_THISNODE	((__force gfp_t)0)
 #endif
 
+/*
+ * Allocate large page but allow fallback to a virtually mapped page
+ */
+#define GFP_VFALLBACK	(GFP_KERNEL | __GFP_VFALLBACK)
 
 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2007-09-18 17:03:54.000000000 -0700
+++ linux-2.6/include/linux/page-flags.h	2007-09-18 17:03:58.000000000 -0700
@@ -248,6 +248,24 @@ static inline void __ClearPageTail(struc
 #define __SetPageHead(page)	__SetPageCompound(page)
 #define __ClearPageHead(page)	__ClearPageCompound(page)
 
+/*
+ * PG_swapcache is used in combination with PG_compound to indicate
+ * that a compound page was allocated via vmalloc.
+ */
+#define PG_vcompound_mask ((1L << PG_compound) | (1L << PG_swapcache))
+#define PageVcompound(page)	((page->flags & PG_vcompound_mask) \
+					== PG_vcompound_mask)
+
+static inline void __SetPageVcompound(struct page *page)
+{
+	page->flags |= PG_vcompound_mask;
+}
+
+static inline void __ClearPageVcompound(struct page *page)
+{
+	page->flags &= ~PG_vcompound_mask;
+}
+
 #ifdef CONFIG_SWAP
 #define PageSwapCache(page)	test_bit(PG_swapcache, &(page)->flags)
 #define SetPageSwapCache(page)	set_bit(PG_swapcache, &(page)->flags)

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [08/17] Pass vmalloc address in page->private
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
                   ` (6 preceding siblings ...)
  2007-09-19  3:36 ` [07/17] GFP_VFALLBACK: Allow fallback of compound pages to virtual mappings Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-19  3:36 ` [09/17] VFALLBACK: Debugging aid Christoph Lameter
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_pass_addr_in_private --]
[-- Type: text/plain, Size: 1510 bytes --]

Avoid expensive lookups of virtual addresses from page structs by
storing the vmalloc address in page->private. We can then avoid
the vmalloc_address() in the get_xxxx_pagexxxx() functions and
simply return page->private.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/page_alloc.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2007-09-18 18:35:55.000000000 -0700
+++ linux-2.6/mm/page_alloc.c	2007-09-18 18:36:01.000000000 -0700
@@ -1276,6 +1276,11 @@ struct page *vcompound_alloc(gfp_t gfp_m
 	if (!addr)
 		goto abort;
 
+	/*
+	 * Give the caller a chance to avoid an expensive vmalloc_addr()
+	 * call.
+	 */
+	pages[0]->private = (unsigned long)addr;
 	return pages[0];
 
 abort:
@@ -1534,6 +1539,8 @@ fastcall unsigned long __get_free_pages(
 	page = alloc_pages(gfp_mask, order);
 	if (!page)
 		return 0;
+	if (unlikely(PageVcompound(page)))
+		return page->private;
 	return (unsigned long) page_address(page);
 }
 
@@ -1550,9 +1557,11 @@ fastcall unsigned long get_zeroed_page(g
 	VM_BUG_ON((gfp_mask & __GFP_HIGHMEM) != 0);
 
 	page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
-	if (page)
-		return (unsigned long) page_address(page);
-	return 0;
+	if (!page)
+		return 0;
+	if (unlikely(PageVcompound(page)))
+		return page->private;
+	return (unsigned long) page_address(page);
 }
 
 EXPORT_SYMBOL(get_zeroed_page);

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [09/17] VFALLBACK: Debugging aid
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
                   ` (7 preceding siblings ...)
  2007-09-19  3:36 ` [08/17] Pass vmalloc address in page->private Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-19  3:36 ` [10/17] Use GFP_VFALLBACK for sparsemem Christoph Lameter
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_debuging_aid --]
[-- Type: text/plain, Size: 2011 bytes --]

Virtual fallbacks are rare and thus subtle bugs may creep in if we do not
test the fallbacks. CONFIG_VFALLBACK_ALWAYS makes all GFP_VFALLBACK
allocations fall back to virtual mapping.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 lib/Kconfig.debug |   11 +++++++++++
 mm/page_alloc.c   |    9 +++++++++
 2 files changed, 20 insertions(+)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2007-09-18 19:19:34.000000000 -0700
+++ linux-2.6/mm/page_alloc.c	2007-09-18 20:16:26.000000000 -0700
@@ -1205,7 +1205,16 @@ zonelist_scan:
 					goto this_zone_full;
 			}
 		}
+#ifdef CONFIG_VFALLBACK_ALWAYS
+		if ((gfp_mask & __GFP_VFALLBACK) &&
+				system_state == SYSTEM_RUNNING)  {
+			struct page *vcompound_alloc(gfp_t, int,
+					struct zonelist *, unsigned long);
 
+			page = vcompound_alloc(gfp_mask, order,
+						zonelist, alloc_flags);
+		} else
+#endif
 		page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
 		if (page)
 			break;
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug	2007-09-18 19:19:28.000000000 -0700
+++ linux-2.6/lib/Kconfig.debug	2007-09-18 19:19:34.000000000 -0700
@@ -105,6 +105,17 @@ config DETECT_SOFTLOCKUP
 	   can be detected via the NMI-watchdog, on platforms that
 	   support it.)
 
+config VFALLBACK_ALWAYS
+	bool "Always fall back to Virtual Compound pages"
+	default y
+	help
+	  Virtual compound pages are only allocated if there is no linear
+	  memory available. They are a fallback and errors created by the
+	  use of virtual mappings instead of linear ones may not surface
+	  because of their infrequent use. This option makes every
+	  allocation that allows a fallback to a virtual mapping use
+	  the virtual mapping. May have a significant performance impact.
+
 config SCHED_DEBUG
 	bool "Collect scheduler debugging info"
 	depends on DEBUG_KERNEL && PROC_FS

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [10/17] Use GFP_VFALLBACK for sparsemem.
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
                   ` (8 preceding siblings ...)
  2007-09-19  3:36 ` [09/17] VFALLBACK: Debugging aid Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-19  3:36 ` [11/17] GFP_VFALLBACK for zone wait table Christoph Lameter
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_sparse_gfp_vfallback --]
[-- Type: text/plain, Size: 1476 bytes --]

Sparsemem currently attempts first to do a physically contiguous mapping
and then falls back to vmalloc. The same thing can now be accomplished
using GFP_VFALLBACK.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/sparse.c |   23 +++--------------------
 1 file changed, 3 insertions(+), 20 deletions(-)

Index: linux-2.6/mm/sparse.c
===================================================================
--- linux-2.6.orig/mm/sparse.c	2007-09-18 13:21:44.000000000 -0700
+++ linux-2.6/mm/sparse.c	2007-09-18 13:28:43.000000000 -0700
@@ -269,32 +269,15 @@ void __init sparse_init(void)
 #ifdef CONFIG_MEMORY_HOTPLUG
 static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
 {
-	struct page *page, *ret;
 	unsigned long memmap_size = sizeof(struct page) * nr_pages;
 
-	page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
-	if (page)
-		goto got_map_page;
-
-	ret = vmalloc(memmap_size);
-	if (ret)
-		goto got_map_ptr;
-
-	return NULL;
-got_map_page:
-	ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
-got_map_ptr:
-	memset(ret, 0, memmap_size);
-
-	return ret;
+	return (struct page *)alloc_page(GFP_VFALLBACK|__GFP_ZERO,
+		get_order(memmap_size));
 }
 
 static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
 {
-	if (is_vmalloc_addr(memmap))
-		vfree(memmap);
-	else
-		free_pages((unsigned long)memmap,
+	free_pages((unsigned long)memmap,
 			   get_order(sizeof(struct page) * nr_pages));
 }
 

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [11/17] GFP_VFALLBACK for zone wait table.
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
                   ` (9 preceding siblings ...)
  2007-09-19  3:36 ` [10/17] Use GFP_VFALLBACK for sparsemem Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-19  3:36 ` [12/17] Virtual Compound page allocation from interrupt context Christoph Lameter
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_wait_table_no_vmalloc --]
[-- Type: text/plain, Size: 1007 bytes --]

Currently we have to use vmalloc for the zone wait table possibly generating
the need to create lots of TLBs to access the tables. We can now use
GFP_VFALLBACK to attempt the use of a physically contiguous page that can then
use the large kernel TLBs.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/page_alloc.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2007-09-18 14:29:05.000000000 -0700
+++ linux-2.6/mm/page_alloc.c	2007-09-18 14:29:10.000000000 -0700
@@ -2572,7 +2572,9 @@ int zone_wait_table_init(struct zone *zo
 		 * To use this new node's memory, further consideration will be
 		 * necessary.
 		 */
-		zone->wait_table = (wait_queue_head_t *)vmalloc(alloc_size);
+		zone->wait_table = (wait_queue_head_t *)
+			__get_free_pages(GFP_VFALLBACK,
+					get_order(alloc_size));
 	}
 	if (!zone->wait_table)
 		return -ENOMEM;

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [12/17] Virtual Compound page allocation from interrupt context.
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
                   ` (10 preceding siblings ...)
  2007-09-19  3:36 ` [11/17] GFP_VFALLBACK for zone wait table Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-19  3:36 ` [13/17] Virtual compound page freeing in " Christoph Lameter
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_interrupt_alloc --]
[-- Type: text/plain, Size: 1332 bytes --]

In an interrupt context we cannot wait for the vmlist_lock in
__get_vm_area_node(). So use a trylock instead. If the trylock fails
then the atomic allocation will fail and subsequently be retried.

This only works because the flush_cache_vunmap in use for
allocation is never performing any IPIs in contrast to flush_tlb_...
in use for freeing.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/vmalloc.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/vmalloc.c
===================================================================
--- linux-2.6.orig/mm/vmalloc.c	2007-09-18 10:52:11.000000000 -0700
+++ linux-2.6/mm/vmalloc.c	2007-09-18 10:54:21.000000000 -0700
@@ -289,7 +289,6 @@ static struct vm_struct *__get_vm_area_n
 	unsigned long align = 1;
 	unsigned long addr;
 
-	BUG_ON(in_interrupt());
 	if (flags & VM_IOREMAP) {
 		int bit = fls(size);
 
@@ -314,7 +313,14 @@ static struct vm_struct *__get_vm_area_n
 	 */
 	size += PAGE_SIZE;
 
-	write_lock(&vmlist_lock);
+	if (gfp_mask & __GFP_WAIT)
+		write_lock(&vmlist_lock);
+	else {
+		if (!write_trylock(&vmlist_lock)) {
+			kfree(area);
+			return NULL;
+		}
+	}
 	for (p = &vmlist; (tmp = *p) != NULL ;p = &tmp->next) {
 		if ((unsigned long)tmp->addr < addr) {
 			if((unsigned long)tmp->addr + tmp->size >= addr)

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [13/17] Virtual compound page freeing in interrupt context
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
                   ` (11 preceding siblings ...)
  2007-09-19  3:36 ` [12/17] Virtual Compound page allocation from interrupt context Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-18 20:36   ` Nick Piggin
  2007-09-19  3:36 ` [14/17] Allow bit_waitqueue to wait on a bit in a vmalloc area Christoph Lameter
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_interrupt_free --]
[-- Type: text/plain, Size: 1631 bytes --]

If we are in an interrupt context then simply defer the free via a workqueue.

In an interrupt context it is not possible to use vmalloc_addr() to determine
the vmalloc address. So add a variant that does that too.

Removing a virtual mappping *must* be done with interrupts enabled
since tlb_xx functions are called that rely on interrupts for
processor to processor communications.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/page_alloc.c |   23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2007-09-18 20:10:55.000000000 -0700
+++ linux-2.6/mm/page_alloc.c	2007-09-18 20:11:40.000000000 -0700
@@ -1297,7 +1297,12 @@ abort:
 	return NULL;
 }
 
-static void vcompound_free(void *addr)
+/*
+ * Virtual Compound freeing functions. This is complicated by the vmalloc
+ * layer not being able to free virtual allocations when interrupts are
+ * disabled. So we defer the frees via a workqueue if necessary.
+ */
+static void __vcompound_free(void *addr)
 {
 	struct page **pages = vunmap(addr);
 	int i;
@@ -1320,6 +1325,22 @@ static void vcompound_free(void *addr)
 	kfree(pages);
 }
 
+static void vcompound_free_work(struct work_struct *w)
+{
+	__vcompound_free((void *)w);
+}
+
+static void vcompound_free(void *addr)
+{
+	if (in_interrupt()) {
+		struct work_struct *w = addr;
+
+		INIT_WORK(w, vcompound_free_work);
+		schedule_work(w);
+	} else
+		__vcompound_free(addr);
+}
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [14/17] Allow bit_waitqueue to wait on a bit in a vmalloc area
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
                   ` (12 preceding siblings ...)
  2007-09-19  3:36 ` [13/17] Virtual compound page freeing in " Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-19  4:12   ` Gabriel C
  2007-09-19  3:36 ` [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK Christoph Lameter
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_wait_on_virtually_mapped_object --]
[-- Type: text/plain, Size: 1229 bytes --]

If bit waitqueue is passed a virtual address then it must use
vmalloc_to_page instead of virt_to_page to get to the page struct.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 kernel/wait.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/wait.c
===================================================================
--- linux-2.6.orig/kernel/wait.c	2007-09-18 19:19:27.000000000 -0700
+++ linux-2.6/kernel/wait.c	2007-09-18 20:10:39.000000000 -0700
@@ -9,6 +9,7 @@
 #include <linux/mm.h>
 #include <linux/wait.h>
 #include <linux/hash.h>
+#include <linux/vmalloc.h>
 
 void init_waitqueue_head(wait_queue_head_t *q)
 {
@@ -245,9 +246,16 @@ EXPORT_SYMBOL(wake_up_bit);
 fastcall wait_queue_head_t *bit_waitqueue(void *word, int bit)
 {
 	const int shift = BITS_PER_LONG == 32 ? 5 : 6;
-	const struct zone *zone = page_zone(virt_to_page(word));
 	unsigned long val = (unsigned long)word << shift | bit;
+	struct page *page;
+	struct zone *zone;
 
+	if (is_vmalloc_addr(word))
+		page = vmalloc_to_page(word)
+	else
+		page = virt_to_page(word);
+
+	zone = page_zone(page);
 	return &zone->wait_table[hash_long(val, zone->wait_table_bits)];
 }
 EXPORT_SYMBOL(bit_waitqueue);

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
                   ` (13 preceding siblings ...)
  2007-09-19  3:36 ` [14/17] Allow bit_waitqueue to wait on a bit in a vmalloc area Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-27 21:42   ` Nick Piggin
  2007-09-19  3:36 ` [16/17] Allow virtual fallback for buffer_heads Christoph Lameter
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_slub_support --]
[-- Type: text/plain, Size: 8705 bytes --]

SLAB_VFALLBACK can be specified for selected slab caches. If fallback is
available then the conservative settings for higher order allocations are
overridden. We then request an order that can accomodate at mininum
100 objects. The size of an individual slab allocation is allowed to reach
up to 256k (order 6 on i386, order 4 on IA64).

Implementing fallback requires special handling of virtual mappings in
the free path. However, the impact is minimal since we already check the
address if its NULL or ZERO_SIZE_PTR. No additional cachelines are
touched if we do not fall back. However, if we need to handle a virtual
compound page then walk the kernel page table in the free paths to
determine the page struct.

We also need special handling in the allocation paths since the virtual
addresses cannot be obtained via page_address(). SLUB exploits that
page->private is set to the vmalloc address to avoid a costly
vmalloc_address().

However, for diagnostics there is still the need to determine the
vmalloc address from the page struct. There we must use the costly
vmalloc_address().

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h     |    1 
 include/linux/slub_def.h |    1 
 mm/slub.c                |   83 ++++++++++++++++++++++++++++++++---------------
 3 files changed, 60 insertions(+), 25 deletions(-)

Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h	2007-09-18 17:03:30.000000000 -0700
+++ linux-2.6/include/linux/slab.h	2007-09-18 17:07:39.000000000 -0700
@@ -19,6 +19,7 @@
  * The ones marked DEBUG are only valid if CONFIG_SLAB_DEBUG is set.
  */
 #define SLAB_DEBUG_FREE		0x00000100UL	/* DEBUG: Perform (expensive) checks on free */
+#define SLAB_VFALLBACK		0x00000200UL	/* May fall back to vmalloc */
 #define SLAB_RED_ZONE		0x00000400UL	/* DEBUG: Red zone objs in a cache */
 #define SLAB_POISON		0x00000800UL	/* DEBUG: Poison objects */
 #define SLAB_HWCACHE_ALIGN	0x00002000UL	/* Align objs on cache lines */
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-09-18 17:03:30.000000000 -0700
+++ linux-2.6/mm/slub.c	2007-09-18 18:13:38.000000000 -0700
@@ -20,6 +20,7 @@
 #include <linux/mempolicy.h>
 #include <linux/ctype.h>
 #include <linux/kallsyms.h>
+#include <linux/vmalloc.h>
 
 /*
  * Lock order:
@@ -277,6 +278,26 @@ static inline struct kmem_cache_node *ge
 #endif
 }
 
+static inline void *slab_address(struct page *page)
+{
+	if (unlikely(PageVcompound(page)))
+		return vmalloc_address(page);
+	else
+		return page_address(page);
+}
+
+static inline struct page *virt_to_slab(const void *addr)
+{
+	struct page *page;
+
+	if (unlikely(is_vmalloc_addr(addr)))
+		page = vmalloc_to_page(addr);
+	else
+		page = virt_to_page(addr);
+
+	return compound_head(page);
+}
+
 static inline int check_valid_pointer(struct kmem_cache *s,
 				struct page *page, const void *object)
 {
@@ -285,7 +306,7 @@ static inline int check_valid_pointer(st
 	if (!object)
 		return 1;
 
-	base = page_address(page);
+	base = slab_address(page);
 	if (object < base || object >= base + s->objects * s->size ||
 		(object - base) % s->size) {
 		return 0;
@@ -470,7 +491,7 @@ static void slab_fix(struct kmem_cache *
 static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p)
 {
 	unsigned int off;	/* Offset of last byte */
-	u8 *addr = page_address(page);
+	u8 *addr = slab_address(page);
 
 	print_tracking(s, p);
 
@@ -648,7 +669,7 @@ static int slab_pad_check(struct kmem_ca
 	if (!(s->flags & SLAB_POISON))
 		return 1;
 
-	start = page_address(page);
+	start = slab_address(page);
 	end = start + (PAGE_SIZE << s->order);
 	length = s->objects * s->size;
 	remainder = end - (start + length);
@@ -1040,11 +1061,7 @@ static struct page *allocate_slab(struct
 	struct page * page;
 	int pages = 1 << s->order;
 
-	if (s->order)
-		flags |= __GFP_COMP;
-
-	if (s->flags & SLAB_CACHE_DMA)
-		flags |= SLUB_DMA;
+	flags |= s->gfpflags;
 
 	if (node == -1)
 		page = alloc_pages(flags, s->order);
@@ -1098,7 +1115,11 @@ static struct page *new_slab(struct kmem
 			SLAB_STORE_USER | SLAB_TRACE))
 		SetSlabDebug(page);
 
-	start = page_address(page);
+	if (!PageVcompound(page))
+		start = slab_address(page);
+	else
+		start = (void *)page->private;
+
 	end = start + s->objects * s->size;
 
 	if (unlikely(s->flags & SLAB_POISON))
@@ -1130,7 +1151,7 @@ static void __free_slab(struct kmem_cach
 		void *p;
 
 		slab_pad_check(s, page);
-		for_each_object(p, s, page_address(page))
+		for_each_object(p, s, slab_address(page))
 			check_object(s, page, p, 0);
 		ClearSlabDebug(page);
 	}
@@ -1672,7 +1693,7 @@ void kmem_cache_free(struct kmem_cache *
 {
 	struct page *page;
 
-	page = virt_to_head_page(x);
+	page = virt_to_slab(x);
 
 	slab_free(s, page, x, __builtin_return_address(0));
 }
@@ -1681,7 +1702,7 @@ EXPORT_SYMBOL(kmem_cache_free);
 /* Figure out on which slab object the object resides */
 static struct page *get_object_page(const void *x)
 {
-	struct page *page = virt_to_head_page(x);
+	struct page *page = virt_to_slab(x);
 
 	if (!PageSlab(page))
 		return NULL;
@@ -1780,10 +1801,9 @@ static inline int slab_order(int size, i
 	return order;
 }
 
-static inline int calculate_order(int size)
+static inline int calculate_order(int size, int min_objects, int max_order)
 {
 	int order;
-	int min_objects;
 	int fraction;
 
 	/*
@@ -1794,13 +1814,12 @@ static inline int calculate_order(int si
 	 * First we reduce the acceptable waste in a slab. Then
 	 * we reduce the minimum objects required in a slab.
 	 */
-	min_objects = slub_min_objects;
 	while (min_objects > 1) {
 		fraction = 8;
 		while (fraction >= 4) {
 			order = slab_order(size, min_objects,
-						slub_max_order, fraction);
-			if (order <= slub_max_order)
+						max_order, fraction);
+			if (order <= max_order)
 				return order;
 			fraction /= 2;
 		}
@@ -1811,8 +1830,8 @@ static inline int calculate_order(int si
 	 * We were unable to place multiple objects in a slab. Now
 	 * lets see if we can place a single object there.
 	 */
-	order = slab_order(size, 1, slub_max_order, 1);
-	if (order <= slub_max_order)
+	order = slab_order(size, 1, max_order, 1);
+	if (order <= max_order)
 		return order;
 
 	/*
@@ -2059,10 +2078,24 @@ static int calculate_sizes(struct kmem_c
 	size = ALIGN(size, align);
 	s->size = size;
 
-	s->order = calculate_order(size);
+	if (s->flags & SLAB_VFALLBACK)
+		s->order = calculate_order(size, 100, 18 - PAGE_SHIFT);
+	else
+		s->order = calculate_order(size, slub_min_objects,
+							slub_max_order);
+
 	if (s->order < 0)
 		return 0;
 
+	if (s->order)
+		s->gfpflags |= __GFP_COMP;
+
+	if (s->flags & SLAB_VFALLBACK)
+		s->gfpflags |= __GFP_VFALLBACK;
+
+	if (s->flags & SLAB_CACHE_DMA)
+		s->flags |= SLUB_DMA;
+
 	/*
 	 * Determine the number of objects per slab
 	 */
@@ -2477,7 +2510,7 @@ void kfree(const void *x)
 	if (ZERO_OR_NULL_PTR(x))
 		return;
 
-	page = virt_to_head_page(x);
+	page = virt_to_slab(x);
 	s = page->slab;
 
 	slab_free(s, page, (void *)x, __builtin_return_address(0));
@@ -2806,7 +2839,7 @@ static int validate_slab(struct kmem_cac
 						unsigned long *map)
 {
 	void *p;
-	void *addr = page_address(page);
+	void *addr = slab_address(page);
 
 	if (!check_slab(s, page) ||
 			!on_freelist(s, page, NULL))
@@ -3048,7 +3081,7 @@ static int add_location(struct loc_track
 
 				cpu_set(track->cpu, l->cpus);
 			}
-			node_set(page_to_nid(virt_to_page(track)), l->nodes);
+			node_set(page_to_nid(virt_to_slab(track)), l->nodes);
 			return 1;
 		}
 
@@ -3079,14 +3112,14 @@ static int add_location(struct loc_track
 	cpus_clear(l->cpus);
 	cpu_set(track->cpu, l->cpus);
 	nodes_clear(l->nodes);
-	node_set(page_to_nid(virt_to_page(track)), l->nodes);
+	node_set(page_to_nid(virt_to_slab(track)), l->nodes);
 	return 1;
 }
 
 static void process_slab(struct loc_track *t, struct kmem_cache *s,
 		struct page *page, enum track_item alloc)
 {
-	void *addr = page_address(page);
+	void *addr = slab_address(page);
 	DECLARE_BITMAP(map, s->objects);
 	void *p;
 
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2007-09-18 17:03:30.000000000 -0700
+++ linux-2.6/include/linux/slub_def.h	2007-09-18 17:07:39.000000000 -0700
@@ -31,6 +31,7 @@ struct kmem_cache {
 	int objsize;		/* The size of an object without meta data */
 	int offset;		/* Free pointer offset. */
 	int order;
+	int gfpflags;		/* Allocation flags */
 
 	/*
 	 * Avoid an extra cache line for UP, SMP and for the node local to

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [16/17] Allow virtual fallback for buffer_heads
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
                   ` (14 preceding siblings ...)
  2007-09-19  3:36 ` [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-19  3:36 ` [17/17] Allow virtual fallback for dentries Christoph Lameter
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_buffer_head --]
[-- Type: text/plain, Size: 801 bytes --]

This is in particular useful for large I/Os because it will allow > 100
allocs from the SLUB fast path without having to go to the page allocator.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/buffer.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2007-09-18 15:44:37.000000000 -0700
+++ linux-2.6/fs/buffer.c	2007-09-18 15:44:51.000000000 -0700
@@ -3008,7 +3008,8 @@ void __init buffer_init(void)
 	int nrpages;
 
 	bh_cachep = KMEM_CACHE(buffer_head,
-			SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
+			SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|
+			SLAB_VFALLBACK);
 
 	/*
 	 * Limit the bh occupancy to 10% of ZONE_NORMAL

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [17/17] Allow virtual fallback for dentries
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
                   ` (15 preceding siblings ...)
  2007-09-19  3:36 ` [16/17] Allow virtual fallback for buffer_heads Christoph Lameter
@ 2007-09-19  3:36 ` Christoph Lameter
  2007-09-19  7:34 ` [00/17] [RFC] Virtual Compound Page Support Anton Altaparmakov
  2007-09-19  8:24 ` Andi Kleen
  18 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19  3:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mel Gorman
  Cc: linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

[-- Attachment #1: vcompound_dentry --]
[-- Type: text/plain, Size: 656 bytes --]

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/dcache.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2007-09-18 18:42:19.000000000 -0700
+++ linux-2.6/fs/dcache.c	2007-09-18 18:42:55.000000000 -0700
@@ -2118,7 +2118,8 @@ static void __init dcache_init(unsigned 
 	 * of the dcache. 
 	 */
 	dentry_cache = KMEM_CACHE(dentry,
-		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
+		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|
+		SLAB_VFALLBACK);
 	
 	register_shrinker(&dcache_shrinker);
 

-- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [14/17] Allow bit_waitqueue to wait on a bit in a vmalloc area
  2007-09-19  3:36 ` [14/17] Allow bit_waitqueue to wait on a bit in a vmalloc area Christoph Lameter
@ 2007-09-19  4:12   ` Gabriel C
  2007-09-19 17:40     ` Christoph Lameter
  0 siblings, 1 reply; 110+ messages in thread
From: Gabriel C @ 2007-09-19  4:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel,
	David Chinner, Jens Axboe

Christoph Lameter wrote:

>  
> +	if (is_vmalloc_addr(word))
> +		page = vmalloc_to_page(word)
					^^^^^^
Missing ' ; '

> +	else
> +		page = virt_to_page(word);
> +
> +	zone = page_zone(page);
>  	return &zone->wait_table[hash_long(val, zone->wait_table_bits)];
>  }
>  EXPORT_SYMBOL(bit_waitqueue);
> 

Regards,

Gabriel

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries
  2007-09-19  3:36 ` [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries Christoph Lameter
@ 2007-09-19  6:32   ` David Rientjes
  2007-09-19  7:24     ` Anton Altaparmakov
  0 siblings, 1 reply; 110+ messages in thread
From: David Rientjes @ 2007-09-19  6:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel,
	David Chinner, Jens Axboe

On Tue, 18 Sep 2007, Christoph Lameter wrote:

> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h	2007-09-17 21:46:06.000000000 -0700
> +++ linux-2.6/include/linux/mm.h	2007-09-17 23:56:54.000000000 -0700
> @@ -1158,6 +1158,14 @@ static inline unsigned long vma_pages(st
>  	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>  }
>  
> +/* Determine if an address is within the vmalloc range */
> +static inline int is_vmalloc_addr(const void *x)
> +{
> +	unsigned long addr = (unsigned long)x;
> +
> +	return addr >= VMALLOC_START && addr < VMALLOC_END;
> +}

This breaks on i386 because VMALLOC_END is defined in terms of PKMAP_BASE 
in the CONFIG_HIGHMEM case.

This function should probably be in include/linux/vmalloc.h instead since 
all callers already include it anyway.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries
  2007-09-19  6:32   ` David Rientjes
@ 2007-09-19  7:24     ` Anton Altaparmakov
  2007-09-19  8:09       ` David Rientjes
  2007-09-19 17:29       ` Christoph Lameter
  0 siblings, 2 replies; 110+ messages in thread
From: Anton Altaparmakov @ 2007-09-19  7:24 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On 19 Sep 2007, at 07:32, David Rientjes wrote:
> On Tue, 18 Sep 2007, Christoph Lameter wrote:
>> Index: linux-2.6/include/linux/mm.h
>> ===================================================================
>> --- linux-2.6.orig/include/linux/mm.h	2007-09-17  
>> 21:46:06.000000000 -0700
>> +++ linux-2.6/include/linux/mm.h	2007-09-17 23:56:54.000000000 -0700
>> @@ -1158,6 +1158,14 @@ static inline unsigned long vma_pages(st
>>  	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>  }
>>
>> +/* Determine if an address is within the vmalloc range */
>> +static inline int is_vmalloc_addr(const void *x)
>> +{
>> +	unsigned long addr = (unsigned long)x;
>> +
>> +	return addr >= VMALLOC_START && addr < VMALLOC_END;
>> +}
>
> This breaks on i386 because VMALLOC_END is defined in terms of  
> PKMAP_BASE
> in the CONFIG_HIGHMEM case.

That is incorrect.  This works perfectly on i386 and on ALL  
architectures supported by Linux.  A lot of places in the kernel  
already do this today (mostly hand coded though, eg XFS and NTFS)...

There even is such a function already in mm/ 
sparse.c::vaddr_in_vmalloc_area() with pretty identical content.  I  
would suggest either this new inline should go away completely and  
use the existing one and export it or the existing one should go away  
and the inline should be used.  It seems silly to have two functions  
with different names doing exactly the same thing!

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [00/17] [RFC] Virtual Compound Page Support
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
                   ` (16 preceding siblings ...)
  2007-09-19  3:36 ` [17/17] Allow virtual fallback for dentries Christoph Lameter
@ 2007-09-19  7:34 ` Anton Altaparmakov
  2007-09-19  8:34   ` Eric Dumazet
  2007-09-19  8:24 ` Andi Kleen
  18 siblings, 1 reply; 110+ messages in thread
From: Anton Altaparmakov @ 2007-09-19  7:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, Mel Gorman, David Chinner, Jens Axboe,
	linux-fsdevel, linux-kernel

Hi Christoph,

On 19 Sep 2007, at 04:36, Christoph Lameter wrote:
> Currently there is a strong tendency to avoid larger page  
> allocations in
> the kernel because of past fragmentation issues and the current
> defragmentation methods are still evolving. It is not clear to what  
> extend
> they can provide reliable allocations for higher order pages (plus the
> definition of "reliable" seems to be in the eye of the beholder).
>
> Currently we use vmalloc allocations in many locations to provide a  
> safe
> way to allocate larger arrays. That is due to the danger of higher  
> order
> allocations failing. Virtual Compound pages allow the use of regular
> page allocator allocations that will fall back only if there is an  
> actual
> problem with acquiring a higher order page.
>
> This patch set provides a way for a higher page allocation to fall  
> back.
> Instead of a physically contiguous page a virtually contiguous page
> is provided. The functionality of the vmalloc layer is used to provide
> the necessary page tables and control structures to establish a  
> virtually
> contiguous area.

I like this a lot.  It will get rid of all the silly games we have to  
play when needing both large allocations and efficient allocations  
where possible.  In NTFS I can then just allocated higher order pages  
instead of having to mess about with the allocation size and  
allocating a single page if the requested size is <= PAGE_SIZE or  
using vmalloc() if the size is bigger.  And it will make it faster  
because a lot of the time a higher order page allocation will succeed  
with your patchset without resorting to vmalloc() so that will be a  
lot faster.

So where I currently have fs/ntfs/malloc.h the below mess I could get  
rid of it completely and just use the normal page allocator/ 
deallocator instead...

static inline void *__ntfs_malloc(unsigned long size, gfp_t gfp_mask)
{
         if (likely(size <= PAGE_SIZE)) {
                 BUG_ON(!size);
                 /* kmalloc() has per-CPU caches so is faster for  
now. */
                 return kmalloc(PAGE_SIZE, gfp_mask & ~__GFP_HIGHMEM);
                 /* return (void *)__get_free_page(gfp_mask); */
         }
         if (likely(size >> PAGE_SHIFT < num_physpages))
                 return __vmalloc(size, gfp_mask, PAGE_KERNEL);
         return NULL;
}

And other places in the kernel can make use of the same.  I think XFS  
does very similar things to NTFS in terms of larger allocations at  
least and there are probably more places I don't know about off the  
top of my head...

I am looking forward to your patchset going into mainline.  (-:

Best regards,

	Anton

> Advantages:
>
> - If higher order allocations are failing then virtual compound pages
>   consisting of a series of order-0 pages can stand in for those
>   allocations.
>
> - "Reliability" as long as the vmalloc layer can provide virtual  
> mappings.
>
> - Ability to reduce the use of vmalloc layer significantly by using
>   physically contiguous memory instead of virtual contiguous memory.
>   Most uses of vmalloc() can be converted to page allocator calls.
>
> - The use of physically contiguous memory instead of vmalloc may  
> allow the
>   use larger TLB entries thus reducing TLB pressure. Also reduces  
> the need
>   for page table walks.
>
> Disadvantages:
>
> - In order to use fall back the logic accessing the memory must be
>   aware that the memory could be backed by a virtual mapping and take
>   precautions. virt_to_page() and page_address() may not work and
>   vmalloc_to_page() and vmalloc_address() (introduced through this
>   patch set) may have to be called.
>
> - Virtual mappings are less efficient than physical mappings.
>   Performance will drop once virtual fall back occurs.
>
> - Virtual mappings have more memory overhead. vm_area control  
> structures
>   page tables, page arrays etc need to be allocated and managed to  
> provide
>   virtual mappings.
>
> The patchset provides this functionality in stages. Stage 1 introduces
> the basic fall back mechanism necessary to replace vmalloc allocations
> with
>
> 	alloc_page(GFP_VFALLBACK, order, ....)
>
> which signifies to the page allocator that a higher order is to be  
> found
> but a virtual mapping may stand in if there is an issue with  
> fragmentation.
>
> Stage 1 functionality does not allow allocation and freeing of virtual
> mappings from interrupt contexts.
>
> The stage 1 series ends with the conversion of a few key uses of  
> vmalloc
> in the VM to alloc_pages() for the allocation of sparsemems memmap  
> table
> and the wait table in each zone. Other uses of vmalloc could be  
> converted
> in the same way.
>
>
> Stage 2 functionality enhances the fallback even more allowing  
> allocation
> and frees in interrupt context.
>
> SLUB is then modified to use the virtual mappings for slab caches
> that are marked with SLAB_VFALLBACK. If a slab cache is marked this  
> way
> then we drop all the restraints regarding page order and allocate
> good large memory areas that fit lots of objects so that we rarely
> have to use the slow paths.
>
> Two slab caches--the dentry cache and the buffer_heads--are then  
> flagged
> that way. Others could be converted in the same way.
>
> The patch set also provides a debugging aid through setting
>
> 	CONFIG_VFALLBACK_ALWAYS
>
> If set then all GFP_VFALLBACK allocations fall back to the virtual
> mappings. This is useful for verification tests. The test of this
> patch set was done by enabling that options and compiling a kernel.
>
>
> Stage 3 functionality could be the adding of support for the large
> buffer size patchset. Not done yet and not sure if it would be useful
> to do.
>
> Much of this patchset may only be needed for special cases in which  
> the
> existing defragmentation methods fail for some reason. It may be  
> better to
> have the system operate without such a safety net and make sure  
> that the
> page allocator can return large orders in a reliable way.
>
> The initial idea for this patchset came from Nick Piggin's fsblock
> and from his arguments about reliability and guarantees. Since his
> fsblock uses the virtual mappings I think it is legitimate to
> generalize the use of virtual mappings to support higher order
> allocations in this way. The application of these ideas to the large
> block size patchset etc are straightforward. If wanted I can base
> the next rev of the largebuffer patchset on this one and implement
> fallback.
>
> Contrary to Nick, I still doubt that any of this provides a  
> "guarantee".
> Have said that I have to deal with various failure scenarios in the VM
> daily and I'd certainly like to see it work in a more reliable manner.
>
> IMHO getting rid of the various workarounds to deal with the small 4k
> pages and avoiding additional layers that group these pages in  
> subsystem
> specific ways is something that can simplify the kernel and make the
> kernel more reliable overall.
>
> If people feel that a virtual fall back is needed then so be it. Maybe
> we can shed our security blanket later when the approaches to deal
> with fragmentation have matured.
>
> The patch set is also available via git from the largeblock git  
> tree via
>
> git pull
>   git://git.kernel.org/pub/scm/linux/kernel/git/christoph/ 
> largeblocksize.git
>     vcompound
>
> -- 
> -
> To unsubscribe from this list: send the line "unsubscribe linux- 
> kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [05/17] vunmap: return page array
  2007-09-19  3:36 ` [05/17] vunmap: return page array Christoph Lameter
@ 2007-09-19  8:05   ` KAMEZAWA Hiroyuki
  2007-09-19 22:15     ` Christoph Lameter
  0 siblings, 1 reply; 110+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-09-19  8:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel,
	David Chinner, Jens Axboe

On Tue, 18 Sep 2007 20:36:10 -0700
Christoph Lameter <clameter@sgi.com> wrote:

> Make vunmap return the page array that was used at vmap. This is useful
> if one has no structures to track the page array but simply stores the
> virtual address somewhere. The disposition of the page array can be
> decided upon after vunmap. vfree() may now also be used instead of
> vunmap which will release the page array after vunmap'ping it.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 

Hmm, I don't like returning array which someone allocated in past and forgot.
And, area->page[] array under vmalloc() is allocated as following (in -rc6-mm1)
==
       if (array_size > PAGE_SIZE) {
                pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO,
                                        PAGE_KERNEL, node);
                area->flags |= VM_VPAGES;
        } else {
                pages = kmalloc_node(array_size,
                                (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO,
                                node);
        }
==
A bit complicating.

At least, please add comments "How to free page-array returned by vummap"

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries
  2007-09-19  7:24     ` Anton Altaparmakov
@ 2007-09-19  8:09       ` David Rientjes
  2007-09-19  8:44         ` Anton Altaparmakov
  2007-09-19 17:29       ` Christoph Lameter
  1 sibling, 1 reply; 110+ messages in thread
From: David Rientjes @ 2007-09-19  8:09 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Christoph Lameter, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On Wed, 19 Sep 2007, Anton Altaparmakov wrote:

> > > Index: linux-2.6/include/linux/mm.h
> > > ===================================================================
> > > --- linux-2.6.orig/include/linux/mm.h	2007-09-17 21:46:06.000000000
> > > -0700
> > > +++ linux-2.6/include/linux/mm.h	2007-09-17 23:56:54.000000000 -0700
> > > @@ -1158,6 +1158,14 @@ static inline unsigned long vma_pages(st
> > > 	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> > > }
> > > 
> > > +/* Determine if an address is within the vmalloc range */
> > > +static inline int is_vmalloc_addr(const void *x)
> > > +{
> > > +	unsigned long addr = (unsigned long)x;
> > > +
> > > +	return addr >= VMALLOC_START && addr < VMALLOC_END;
> > > +}
> > 
> > This breaks on i386 because VMALLOC_END is defined in terms of PKMAP_BASE
> > in the CONFIG_HIGHMEM case.
> 
> That is incorrect.  This works perfectly on i386 and on ALL architectures
> supported by Linux.  A lot of places in the kernel already do this today
> (mostly hand coded though, eg XFS and NTFS)...
> 

Hmm, really?

After applying patches 1-3 in this series and compiling on my i386 with 
defconfig, I get this:

In file included from include/linux/suspend.h:11,
                 from arch/i386/kernel/asm-offsets.c:11:
include/linux/mm.h: In function 'is_vmalloc_addr':
include/linux/mm.h:1166: error: 'PKMAP_BASE' undeclared (first use in this function)
include/linux/mm.h:1166: error: (Each undeclared identifier is reported only once
include/linux/mm.h:1166: error: for each function it appears in.)

so I don't know what you're talking about.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [00/17] [RFC] Virtual Compound Page Support
  2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
                   ` (17 preceding siblings ...)
  2007-09-19  7:34 ` [00/17] [RFC] Virtual Compound Page Support Anton Altaparmakov
@ 2007-09-19  8:24 ` Andi Kleen
  2007-09-19 17:36   ` Christoph Lameter
  18 siblings, 1 reply; 110+ messages in thread
From: Andi Kleen @ 2007-09-19  8:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel

Christoph Lameter <clameter@sgi.com> writes:

It seems like a good idea simply because the same functionality
is already open coded in a couple of places and unifying
that would be a good thing. But ...
 
> The patchset provides this functionality in stages. Stage 1 introduces
> the basic fall back mechanism necessary to replace vmalloc allocations
> with
> 
> 	alloc_page(GFP_VFALLBACK, order, ....)

Is there a reason this needs to be a GFP flag versus a wrapper
around alloc_page/free_page ?  page_alloc.c is already too complicated
and it's better to keep new features separated. The only drawback
would be that free_pages would need a different call, but that
doesn't seem like a big problem.

Especially integrating it into slab would seem wrong to me.
slab is already too complicated and for users who need that
large areas page granuality rounding to pages is probably fine.

Also such a wrapper could do the old alloc_page_exact() trick:
instead of always rounding up to next order return the left over
pages to the VM. In some cases this can save significant memory.

I'm also a little dubious about your attempts to do vmalloc in
interrupt context. Is that really needed? GFP_ATOMIC allocations of
large areas seem to be extremly unreliable to me and not design. Even
if it works sometimes free probably wouldn't work there due to the
flushes, which is very nasty. It would be better to drop that.

-Andi



> which signifies to the page allocator that a higher order is to be found
> but a virtual mapping may stand in if there is an issue with fragmentation.
> 
> Stage 1 functionality does not allow allocation and freeing of virtual
> mappings from interrupt contexts.
> 
> The stage 1 series ends with the conversion of a few key uses of vmalloc
> in the VM to alloc_pages() for the allocation of sparsemems memmap table
> and the wait table in each zone. Other uses of vmalloc could be converted
> in the same way.
> 
> 
> Stage 2 functionality enhances the fallback even more allowing allocation
> and frees in interrupt context.
> 
> SLUB is then modified to use the virtual mappings for slab caches
> that are marked with SLAB_VFALLBACK. If a slab cache is marked this way
> then we drop all the restraints regarding page order and allocate
> good large memory areas that fit lots of objects so that we rarely
> have to use the slow paths.
> 
> Two slab caches--the dentry cache and the buffer_heads--are then flagged
> that way. Others could be converted in the same way.
> 
> The patch set also provides a debugging aid through setting
> 
> 	CONFIG_VFALLBACK_ALWAYS
> 
> If set then all GFP_VFALLBACK allocations fall back to the virtual
> mappings. This is useful for verification tests. The test of this
> patch set was done by enabling that options and compiling a kernel.
> 
> 
> Stage 3 functionality could be the adding of support for the large
> buffer size patchset. Not done yet and not sure if it would be useful
> to do.
> 
> Much of this patchset may only be needed for special cases in which the
> existing defragmentation methods fail for some reason. It may be better to
> have the system operate without such a safety net and make sure that the
> page allocator can return large orders in a reliable way.
> 
> The initial idea for this patchset came from Nick Piggin's fsblock
> and from his arguments about reliability and guarantees. Since his
> fsblock uses the virtual mappings I think it is legitimate to
> generalize the use of virtual mappings to support higher order
> allocations in this way. The application of these ideas to the large
> block size patchset etc are straightforward. If wanted I can base
> the next rev of the largebuffer patchset on this one and implement
> fallback.
> 
> Contrary to Nick, I still doubt that any of this provides a "guarantee".
> Have said that I have to deal with various failure scenarios in the VM
> daily and I'd certainly like to see it work in a more reliable manner.
> 
> IMHO getting rid of the various workarounds to deal with the small 4k
> pages and avoiding additional layers that group these pages in subsystem
> specific ways is something that can simplify the kernel and make the
> kernel more reliable overall.
> 
> If people feel that a virtual fall back is needed then so be it. Maybe
> we can shed our security blanket later when the approaches to deal
> with fragmentation have matured.
> 
> The patch set is also available via git from the largeblock git tree via
> 
> git pull
>   git://git.kernel.org/pub/scm/linux/kernel/git/christoph/largeblocksize.git
>     vcompound
> 
> -- 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [00/17] [RFC] Virtual Compound Page Support
  2007-09-19  7:34 ` [00/17] [RFC] Virtual Compound Page Support Anton Altaparmakov
@ 2007-09-19  8:34   ` Eric Dumazet
  2007-09-19 17:33     ` Christoph Lameter
  0 siblings, 1 reply; 110+ messages in thread
From: Eric Dumazet @ 2007-09-19  8:34 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Christoph Lameter, Christoph Hellwig, Mel Gorman, David Chinner,
	Jens Axboe, linux-fsdevel, linux-kernel

On Wed, 19 Sep 2007 08:34:47 +0100
Anton Altaparmakov <aia21@cam.ac.uk> wrote:

> Hi Christoph,
> 
> On 19 Sep 2007, at 04:36, Christoph Lameter wrote:
> > Currently there is a strong tendency to avoid larger page  
> > allocations in
> > the kernel because of past fragmentation issues and the current
> > defragmentation methods are still evolving. It is not clear to what  
> > extend
> > they can provide reliable allocations for higher order pages (plus the
> > definition of "reliable" seems to be in the eye of the beholder).
> >
> > Currently we use vmalloc allocations in many locations to provide a  
> > safe
> > way to allocate larger arrays. That is due to the danger of higher  
> > order
> > allocations failing. Virtual Compound pages allow the use of regular
> > page allocator allocations that will fall back only if there is an  
> > actual
> > problem with acquiring a higher order page.
> >
> > This patch set provides a way for a higher page allocation to fall  
> > back.
> > Instead of a physically contiguous page a virtually contiguous page
> > is provided. The functionality of the vmalloc layer is used to provide
> > the necessary page tables and control structures to establish a  
> > virtually
> > contiguous area.
> 
> I like this a lot.  It will get rid of all the silly games we have to  
> play when needing both large allocations and efficient allocations  
> where possible.  In NTFS I can then just allocated higher order pages  
> instead of having to mess about with the allocation size and  
> allocating a single page if the requested size is <= PAGE_SIZE or  
> using vmalloc() if the size is bigger.  And it will make it faster  
> because a lot of the time a higher order page allocation will succeed  
> with your patchset without resorting to vmalloc() so that will be a  
> lot faster.
> 
> So where I currently have fs/ntfs/malloc.h the below mess I could get  
> rid of it completely and just use the normal page allocator/ 
> deallocator instead...
> 
> static inline void *__ntfs_malloc(unsigned long size, gfp_t gfp_mask)
> {
>          if (likely(size <= PAGE_SIZE)) {
>                  BUG_ON(!size);
>                  /* kmalloc() has per-CPU caches so is faster for  
> now. */
>                  return kmalloc(PAGE_SIZE, gfp_mask & ~__GFP_HIGHMEM);
>                  /* return (void *)__get_free_page(gfp_mask); */
>          }
>          if (likely(size >> PAGE_SHIFT < num_physpages))
>                  return __vmalloc(size, gfp_mask, PAGE_KERNEL);
>          return NULL;
> }
> 
> And other places in the kernel can make use of the same.  I think XFS  
> does very similar things to NTFS in terms of larger allocations at  
> least and there are probably more places I don't know about off the  
> top of my head...
> 
> I am looking forward to your patchset going into mainline.  (-:

Sure, it sounds *really* good. But...

1) Only power of two allocations are good candidates, or we waste RAM

2) On i386 machines, we have a small vmalloc window. (128 MB default value)
  Many servers with >4GB memory (PAE) like to boot with vmalloc=32M option to get 992MB of LOWMEM.
  If we allow some slub caches to fallback to vmalloc land, we'll have problems to tune this.

3) A fallback to vmalloc means an allocation of one vm_struct per compound page.

4) vmalloc() currently uses a linked list of vm_struct. Might need something more scalable.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries
  2007-09-19  8:09       ` David Rientjes
@ 2007-09-19  8:44         ` Anton Altaparmakov
  2007-09-19  9:19           ` David Rientjes
  2007-09-19 17:29           ` Christoph Lameter
  0 siblings, 2 replies; 110+ messages in thread
From: Anton Altaparmakov @ 2007-09-19  8:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On 19 Sep 2007, at 09:09, David Rientjes wrote:
> On Wed, 19 Sep 2007, Anton Altaparmakov wrote:
>>>> Index: linux-2.6/include/linux/mm.h
>>>> ===================================================================
>>>> --- linux-2.6.orig/include/linux/mm.h	2007-09-17 21:46:06.000000000
>>>> -0700
>>>> +++ linux-2.6/include/linux/mm.h	2007-09-17 23:56:54.000000000  
>>>> -0700
>>>> @@ -1158,6 +1158,14 @@ static inline unsigned long vma_pages(st
>>>> 	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>>> }
>>>>
>>>> +/* Determine if an address is within the vmalloc range */
>>>> +static inline int is_vmalloc_addr(const void *x)
>>>> +{
>>>> +	unsigned long addr = (unsigned long)x;
>>>> +
>>>> +	return addr >= VMALLOC_START && addr < VMALLOC_END;
>>>> +}
>>>
>>> This breaks on i386 because VMALLOC_END is defined in terms of  
>>> PKMAP_BASE
>>> in the CONFIG_HIGHMEM case.
>>
>> That is incorrect.  This works perfectly on i386 and on ALL  
>> architectures
>> supported by Linux.  A lot of places in the kernel already do this  
>> today
>> (mostly hand coded though, eg XFS and NTFS)...
>
> Hmm, really?
>
> After applying patches 1-3 in this series and compiling on my i386  
> with
> defconfig, I get this:
>
> In file included from include/linux/suspend.h:11,
>                  from arch/i386/kernel/asm-offsets.c:11:
> include/linux/mm.h: In function 'is_vmalloc_addr':
> include/linux/mm.h:1166: error: 'PKMAP_BASE' undeclared (first use  
> in this function)
> include/linux/mm.h:1166: error: (Each undeclared identifier is  
> reported only once
> include/linux/mm.h:1166: error: for each function it appears in.)
>
> so I don't know what you're talking about.

Just a compile failure not inherently broken!

Add:

#include <linux/highmem.h> to the top of linux/mm.h and it should  
compile just fine.

Although it may cause a problem as highmem.h also includes mm.h so a  
bit of trickery may be needed to get it to compile...

I suspect that is_vmalloc_addr() should not be in linux/mm.h at all  
and should be in linux/vmalloc.h instead and vmalloc.h should include  
linux/highmem.h.  That would be more sensible than sticking a vmalloc  
related function into linux/mm.h where it does not belong...

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries
  2007-09-19  8:44         ` Anton Altaparmakov
@ 2007-09-19  9:19           ` David Rientjes
  2007-09-19 13:23             ` Anton Altaparmakov
  2007-09-19 17:29           ` Christoph Lameter
  1 sibling, 1 reply; 110+ messages in thread
From: David Rientjes @ 2007-09-19  9:19 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Christoph Lameter, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On Wed, 19 Sep 2007, Anton Altaparmakov wrote:

> Although it may cause a problem as highmem.h also includes mm.h so a bit of
> trickery may be needed to get it to compile...
> 
> I suspect that is_vmalloc_addr() should not be in linux/mm.h at all and should
> be in linux/vmalloc.h instead and vmalloc.h should include linux/highmem.h.
> That would be more sensible than sticking a vmalloc related function into
> linux/mm.h where it does not belong...
> 

That is why I suggested include/linux/vmalloc.h as its home in the first 
place.  And no, adding an include for linux/highmem.h (and asm/pgtable.h) 
to linux/vmalloc.h does not work.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries
  2007-09-19  9:19           ` David Rientjes
@ 2007-09-19 13:23             ` Anton Altaparmakov
  0 siblings, 0 replies; 110+ messages in thread
From: Anton Altaparmakov @ 2007-09-19 13:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On 19 Sep 2007, at 10:19, David Rientjes wrote:
> On Wed, 19 Sep 2007, Anton Altaparmakov wrote:
>> Although it may cause a problem as highmem.h also includes mm.h so  
>> a bit of
>> trickery may be needed to get it to compile...
>>
>> I suspect that is_vmalloc_addr() should not be in linux/mm.h at  
>> all and should
>> be in linux/vmalloc.h instead and vmalloc.h should include linux/ 
>> highmem.h.
>> That would be more sensible than sticking a vmalloc related  
>> function into
>> linux/mm.h where it does not belong...
>
> That is why I suggested include/linux/vmalloc.h as its home in the  
> first
> place.  And no, adding an include for linux/highmem.h (and asm/ 
> pgtable.h)
> to linux/vmalloc.h does not work.

I am sure Christoph can figure out somewhere that it will work.   
After all, the code in that function already exists both as another  
function and open coded in several places and compiles fine there...

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries
  2007-09-19  7:24     ` Anton Altaparmakov
  2007-09-19  8:09       ` David Rientjes
@ 2007-09-19 17:29       ` Christoph Lameter
  2007-09-19 17:52         ` Anton Altaparmakov
  1 sibling, 1 reply; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19 17:29 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: David Rientjes, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On Wed, 19 Sep 2007, Anton Altaparmakov wrote:

> There even is such a function already in mm/sparse.c::vaddr_in_vmalloc_area()
> with pretty identical content.  I would suggest either this new inline should
> go away completely and use the existing one and export it or the existing one
> should go away and the inline should be used.  It seems silly to have two
> functions with different names doing exactly the same thing!

Just in case you have not noticed: This patchset removes the 
vaddr_in_vmalloc_area() in sparse.c

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries
  2007-09-19  8:44         ` Anton Altaparmakov
  2007-09-19  9:19           ` David Rientjes
@ 2007-09-19 17:29           ` Christoph Lameter
  2007-09-19 17:52             ` Anton Altaparmakov
  1 sibling, 1 reply; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19 17:29 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: David Rientjes, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On Wed, 19 Sep 2007, Anton Altaparmakov wrote:

> I suspect that is_vmalloc_addr() should not be in linux/mm.h at all and should
> be in linux/vmalloc.h instead and vmalloc.h should include linux/highmem.h.
> That would be more sensible than sticking a vmalloc related function into
> linux/mm.h where it does not belong...

Tried it and that leads to all sort of other failures.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [00/17] [RFC] Virtual Compound Page Support
  2007-09-19  8:34   ` Eric Dumazet
@ 2007-09-19 17:33     ` Christoph Lameter
  0 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19 17:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Anton Altaparmakov, Christoph Hellwig, Mel Gorman, David Chinner,
	Jens Axboe, linux-fsdevel, linux-kernel

On Wed, 19 Sep 2007, Eric Dumazet wrote:

> 1) Only power of two allocations are good candidates, or we waste RAM

Correct.

> 2) On i386 machines, we have a small vmalloc window. (128 MB default value)
>   Many servers with >4GB memory (PAE) like to boot with vmalloc=32M option to get 992MB of LOWMEM.
>   If we allow some slub caches to fallback to vmalloc land, we'll have problems to tune this.

We would first do the vmalloc conversion to GFP_VFALLBACK which would 
reduce the vmalloc requirements of drivers and core significantly. The 
patchset should actually reduce the vmalloc space requirements 
significantly. They are only needed in situations where the page allocator 
cannot provide a contiguous mapping and that gets rarer the better Mel's 
antifrag code works.

> 4) vmalloc() currently uses a linked list of vm_struct. Might need something more scalable.

If its rarely used then its not that big of a deal. The better the anti 
fragmentation measures the less vmalloc use.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [00/17] [RFC] Virtual Compound Page Support
  2007-09-19  8:24 ` Andi Kleen
@ 2007-09-19 17:36   ` Christoph Lameter
  0 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19 17:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel

On Wed, 19 Sep 2007, Andi Kleen wrote:

> > 	alloc_page(GFP_VFALLBACK, order, ....)
> 
> Is there a reason this needs to be a GFP flag versus a wrapper
> around alloc_page/free_page ?  page_alloc.c is already too complicated
> and it's better to keep new features separated. The only drawback
> would be that free_pages would need a different call, but that
> doesn't seem like a big problem.

I tried to make this a wrapper but there is a lot of logic in 
__alloc_pages() that would have to be replicated. Also there are specific 
places in __alloc_pages() were we can establish that we have enough memory
but its the memory fragmentation that prevents us from satisfying the 
request for a larger page.
 
> I'm also a little dubious about your attempts to do vmalloc in
> interrupt context. Is that really needed? GFP_ATOMIC allocations of
> large areas seem to be extremly unreliable to me and not design. Even
> if it works sometimes free probably wouldn't work there due to the
> flushes, which is very nasty. It would be better to drop that.

The flushes are only done on virtuall mapped architectures (xtensa) and 
are simple ASM code that can run in an interrupt context

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [14/17] Allow bit_waitqueue to wait on a bit in a vmalloc area
  2007-09-19  4:12   ` Gabriel C
@ 2007-09-19 17:40     ` Christoph Lameter
  0 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19 17:40 UTC (permalink / raw)
  To: Gabriel C
  Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel,
	David Chinner, Jens Axboe

On Wed, 19 Sep 2007, Gabriel C wrote:

> Christoph Lameter wrote:
> 
> >  
> > +	if (is_vmalloc_addr(word))
> > +		page = vmalloc_to_page(word)
> 					^^^^^^
> Missing ' ; '

Argh. Late beautification attempts are backfiring....

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries
  2007-09-19 17:29       ` Christoph Lameter
@ 2007-09-19 17:52         ` Anton Altaparmakov
  0 siblings, 0 replies; 110+ messages in thread
From: Anton Altaparmakov @ 2007-09-19 17:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On 19 Sep 2007, at 18:29, Christoph Lameter wrote:
> On Wed, 19 Sep 2007, Anton Altaparmakov wrote:
>> There even is such a function already in mm/ 
>> sparse.c::vaddr_in_vmalloc_area()
>> with pretty identical content.  I would suggest either this new  
>> inline should
>> go away completely and use the existing one and export it or the  
>> existing one
>> should go away and the inline should be used.  It seems silly to  
>> have two
>> functions with different names doing exactly the same thing!
>
> Just in case you have not noticed: This patchset removes the
> vaddr_in_vmalloc_area() in sparse.c

That's great!  And sorry I did not notice that bit...

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries
  2007-09-19 17:29           ` Christoph Lameter
@ 2007-09-19 17:52             ` Anton Altaparmakov
  0 siblings, 0 replies; 110+ messages in thread
From: Anton Altaparmakov @ 2007-09-19 17:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On 19 Sep 2007, at 18:29, Christoph Lameter wrote:
> On Wed, 19 Sep 2007, Anton Altaparmakov wrote:
>> I suspect that is_vmalloc_addr() should not be in linux/mm.h at  
>> all and should
>> be in linux/vmalloc.h instead and vmalloc.h should include linux/ 
>> highmem.h.
>> That would be more sensible than sticking a vmalloc related  
>> function into
>> linux/mm.h where it does not belong...
>
> Tried it and that leads to all sort of other failures.

Ah, a new header file "vaddr.h" or something then?

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [05/17] vunmap: return page array
  2007-09-19  8:05   ` KAMEZAWA Hiroyuki
@ 2007-09-19 22:15     ` Christoph Lameter
  2007-09-20  0:47       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 110+ messages in thread
From: Christoph Lameter @ 2007-09-19 22:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel,
	David Chinner, Jens Axboe

On Wed, 19 Sep 2007, KAMEZAWA Hiroyuki wrote:

> Hmm, I don't like returning array which someone allocated in past and forgot.

But that is exactly the point. There is no need to keep track of the 
information that is of no interest until the disposition of the mapping.

> And, area->page[] array under vmalloc() is allocated as following (in -rc6-mm1)
> ==
>        if (array_size > PAGE_SIZE) {
>                 pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO,
>                                         PAGE_KERNEL, node);
>                 area->flags |= VM_VPAGES;
>         } else {
>                 pages = kmalloc_node(array_size,
>                                 (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO,
>                                 node);
>         }
> ==
> A bit complicating.

Not at all. You can pass a __vmalloc'ed entity to vmap if you add VM_VPAGES 
to the flags passed to it.

> At least, please add comments "How to free page-array returned by vummap"

But that depends on how the vmap was called. The caller knows what he has 
done to acquire the memory and therefore also knows how to get rid of it.

The knowledge of how to displose of things is only important when we use
vfree() to free up a vmap() (we never do that today) and expect the page 
array to go with it.

In that case the user needs to specific VM_VPAGES if the page array was 
vmalloced.

I can add a comment explaining that?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [05/17] vunmap: return page array
  2007-09-19 22:15     ` Christoph Lameter
@ 2007-09-20  0:47       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 110+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-09-20  0:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel,
	David Chinner, Jens Axboe

On Wed, 19 Sep 2007 15:15:58 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Wed, 19 Sep 2007, KAMEZAWA Hiroyuki wrote:
> 
> > Hmm, I don't like returning array which someone allocated in past and forgot.
> 
> But that is exactly the point. There is no need to keep track of the 
> information that is of no interest until the disposition of the mapping.
> 
yes. But I think neat style of this kind function is
==
/* If array != NULL, pointer of unmapped pages are stored in array[] */
extern int vunmap(const void *addr, struct page **array);
==
But yes, this costs.

> > And, area->page[] array under vmalloc() is allocated as following (in -rc6-mm1)
> > ==
> >        if (array_size > PAGE_SIZE) {
> >                 pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO,
> >                                         PAGE_KERNEL, node);
> >                 area->flags |= VM_VPAGES;
> >         } else {
> >                 pages = kmalloc_node(array_size,
> >                                 (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO,
> >                                 node);
> >         }
> > ==
> > A bit complicating.
> 
> Not at all. You can pass a __vmalloc'ed entity to vmap if you add VM_VPAGES 
> to the flags passed to it.
> 
> > At least, please add comments "How to free page-array returned by vummap"
> 
> But that depends on how the vmap was called. The caller knows what he has 
> done to acquire the memory and therefore also knows how to get rid of it.
>  
Hm, it seems all your patch are for hiding usage of vmalloc()/vmap().
If freeing pages[] array is also hidden in your patch's context, no problem.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [13/17] Virtual compound page freeing in interrupt context
  2007-09-18 20:36   ` Nick Piggin
@ 2007-09-20 17:50     ` Christoph Lameter
  0 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-20 17:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel,
	David Chinner, Jens Axboe

On Wed, 19 Sep 2007, Nick Piggin wrote:

> > Removing a virtual mappping *must* be done with interrupts enabled
> > since tlb_xx functions are called that rely on interrupts for
> > processor to processor communications.
> 
> These things will clash drastically with my lazy TLB flushing and scalability
> work. Actually the lazy TLB flushing will provide a natural way to defer
> unmapping at interrupt time, and the scalability work should make it
> easier to vmap from interrupt context too, if you really need that.

Both would be good to have.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-19  3:36 ` [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK Christoph Lameter
@ 2007-09-27 21:42   ` Nick Piggin
  2007-09-28 17:33     ` Christoph Lameter
  0 siblings, 1 reply; 110+ messages in thread
From: Nick Piggin @ 2007-09-27 21:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel,
	David Chinner, Jens Axboe

On Wednesday 19 September 2007 13:36, Christoph Lameter wrote:
> SLAB_VFALLBACK can be specified for selected slab caches. If fallback is
> available then the conservative settings for higher order allocations are
> overridden. We then request an order that can accomodate at mininum
> 100 objects. The size of an individual slab allocation is allowed to reach
> up to 256k (order 6 on i386, order 4 on IA64).

How come SLUB wants such a big amount of objects? I thought the
unqueued nature of it made it better than slab because it minimised
the amount of cache hot memory lying around in slabs...

vmalloc is incredibly slow and unscalable at the moment. I'm still working
on making it more scalable and faster -- hopefully to a point where it would
actually be usable for this... but you still get moved off large TLBs, and
also have to inevitably do tlb flushing.

Or do you have SLUB at a point where performance is comparable to SLAB,
and this is just a possible idea for more performance?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-28 17:33     ` Christoph Lameter
@ 2007-09-28  5:14       ` Nick Piggin
  2007-10-01 20:50         ` Christoph Lameter
  2007-09-28 17:55       ` [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK Peter Zijlstra
  2007-09-28 21:05       ` Mel Gorman
  2 siblings, 1 reply; 110+ messages in thread
From: Nick Piggin @ 2007-09-28  5:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel,
	David Chinner, Jens Axboe

On Saturday 29 September 2007 03:33, Christoph Lameter wrote:
> On Fri, 28 Sep 2007, Nick Piggin wrote:
> > On Wednesday 19 September 2007 13:36, Christoph Lameter wrote:
> > > SLAB_VFALLBACK can be specified for selected slab caches. If fallback
> > > is available then the conservative settings for higher order
> > > allocations are overridden. We then request an order that can
> > > accomodate at mininum 100 objects. The size of an individual slab
> > > allocation is allowed to reach up to 256k (order 6 on i386, order 4 on
> > > IA64).
> >
> > How come SLUB wants such a big amount of objects? I thought the
> > unqueued nature of it made it better than slab because it minimised
> > the amount of cache hot memory lying around in slabs...
>
> The more objects in a page the more the fast path runs. The more the fast
> path runs the lower the cache footprint and the faster the overall
> allocations etc.
>
> SLAB can be configured for large queues holdings lots of objects.
> SLUB can only reach the same through large pages because it does not
> have queues. One could add the ability to manage pools of cpu slabs but
> that would be adding yet another layer to compensate for the problem of
> the small pages. Reliable large page allocations means that we can get rid
> of these layers and the many workarounds that we have in place right now.

That doesn't sound very nice because you don't actually want to use up
higher order allocations if you can avoid it, and you definitely don't want
to be increasing your slab page size unit if you can help it, because it
compounds the problem of slab fragmentation.


> The unqueued nature of SLUB reduces memory requirements and in general the
> more efficient code paths of SLUB offset the advantage that SLAB can reach
> by being able to put more objects onto its queues.
> introduces complexity and cache line use through the need to manage those
> queues.

I thought it was slower. Have you fixed the performance regression?
(OK, I read further down that you are still working on it but not confirmed
yet...)


> > vmalloc is incredibly slow and unscalable at the moment. I'm still
> > working on making it more scalable and faster -- hopefully to a point
> > where it would actually be usable for this... but you still get moved off
> > large TLBs, and also have to inevitably do tlb flushing.
>
> Again I have not seen any fallbacks to vmalloc in my testing. What we are
> doing here is mainly to address your theoretical cases that we so far have
> never seen to be a problem and increase the reliability of allocations of
> page orders larger than 3 to a usable level. So far I have so far not
> dared to enable orders larger than 3 by default.

Basically, all that shows is that your testing isn't very thorough. 128MB
is an order of magnitude *more* memory than some users have. They
probably wouldn't be happy with a regression in slab allocator performance
either.


> > Or do you have SLUB at a point where performance is comparable to SLAB,
> > and this is just a possible idea for more performance?
>
> AFAICT SLUBs performance is superior to SLAB in most cases and it was like
> that from the beginning. I am still concerned about several corner cases
> though (I think most of them are going to be addressed by the per cpu
> patches in mm). Having a comparable or larger amount of per cpu objects as
> SLAB is something that also could address some of these concerns and could
> increase performance much further.

OK, so long as it isn't going to depend on using higher order pages, that's
fine. (if they help even further as an optional thing, that's fine too. You
can turn them on your huge systems and not even bother about adding
this vmap fallback -- you won't have me to nag you about these
purely theoretical issues).

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-27 21:42   ` Nick Piggin
@ 2007-09-28 17:33     ` Christoph Lameter
  2007-09-28  5:14       ` Nick Piggin
                         ` (2 more replies)
  0 siblings, 3 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-28 17:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel,
	David Chinner, Jens Axboe

On Fri, 28 Sep 2007, Nick Piggin wrote:

> On Wednesday 19 September 2007 13:36, Christoph Lameter wrote:
> > SLAB_VFALLBACK can be specified for selected slab caches. If fallback is
> > available then the conservative settings for higher order allocations are
> > overridden. We then request an order that can accomodate at mininum
> > 100 objects. The size of an individual slab allocation is allowed to reach
> > up to 256k (order 6 on i386, order 4 on IA64).
> 
> How come SLUB wants such a big amount of objects? I thought the
> unqueued nature of it made it better than slab because it minimised
> the amount of cache hot memory lying around in slabs...

The more objects in a page the more the fast path runs. The more the fast 
path runs the lower the cache footprint and the faster the overall 
allocations etc.

SLAB can be configured for large queues holdings lots of objects. 
SLUB can only reach the same through large pages because it does not 
have queues. One could add the ability to manage pools of cpu slabs but 
that would be adding yet another layer to compensate for the problem of 
the small pages. Reliable large page allocations means that we can get rid 
of these layers and the many workarounds that we have in place right now.

The unqueued nature of SLUB reduces memory requirements and in general the 
more efficient code paths of SLUB offset the advantage that SLAB can reach 
by being able to put more objects onto its queues. SLAB necessarily 
introduces complexity and cache line use through the need to manage those 
queues.

> vmalloc is incredibly slow and unscalable at the moment. I'm still working
> on making it more scalable and faster -- hopefully to a point where it would
> actually be usable for this... but you still get moved off large TLBs, and
> also have to inevitably do tlb flushing.

Again I have not seen any fallbacks to vmalloc in my testing. What we are 
doing here is mainly to address your theoretical cases that we so far have 
never seen to be a problem and increase the reliability of allocations of
page orders larger than 3 to a usable level. So far I have so far not 
dared to enable orders larger than 3 by default.

AFAICT The performance of vmalloc is not really relevant. If this would 
become an issue then it would be possible to reduce the orders used to 
avoid fallbacks.

> Or do you have SLUB at a point where performance is comparable to SLAB,
> and this is just a possible idea for more performance?

AFAICT SLUBs performance is superior to SLAB in most cases and it was like 
that from the beginning. I am still concerned about several corner cases 
though (I think most of them are going to be addressed by the per cpu 
patches in mm). Having a comparable or larger amount of per cpu objects as 
SLAB is something that also could address some of these concerns and could 
increase performance much further.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-28 17:33     ` Christoph Lameter
  2007-09-28  5:14       ` Nick Piggin
@ 2007-09-28 17:55       ` Peter Zijlstra
  2007-09-28 18:20         ` Christoph Lameter
  2007-09-28 21:05       ` Mel Gorman
  2 siblings, 1 reply; 110+ messages in thread
From: Peter Zijlstra @ 2007-09-28 17:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe


On Fri, 2007-09-28 at 10:33 -0700, Christoph Lameter wrote:

> Again I have not seen any fallbacks to vmalloc in my testing. What we are 
> doing here is mainly to address your theoretical cases that we so far have 
> never seen to be a problem and increase the reliability of allocations of
> page orders larger than 3 to a usable level. So far I have so far not 
> dared to enable orders larger than 3 by default.

take a recent -mm kernel, boot with mem=128M.

start 2 processes that each mmap a separate 64M file, and which does
sequential writes on them. start a 3th process that does the same with
64M anonymous.

wait for a while, and you'll see order=1 failures.




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-28 17:55       ` [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK Peter Zijlstra
@ 2007-09-28 18:20         ` Christoph Lameter
  2007-09-28 18:25           ` Peter Zijlstra
  2007-09-29  8:45           ` Peter Zijlstra
  0 siblings, 2 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-28 18:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On Fri, 28 Sep 2007, Peter Zijlstra wrote:

> 
> On Fri, 2007-09-28 at 10:33 -0700, Christoph Lameter wrote:
> 
> > Again I have not seen any fallbacks to vmalloc in my testing. What we are 
> > doing here is mainly to address your theoretical cases that we so far have 
> > never seen to be a problem and increase the reliability of allocations of
> > page orders larger than 3 to a usable level. So far I have so far not 
> > dared to enable orders larger than 3 by default.
> 
> take a recent -mm kernel, boot with mem=128M.

Ok so only 32k pages to play with? I have tried parallel kernel compiles 
with mem=256m and they seemed to be fine.

> start 2 processes that each mmap a separate 64M file, and which does
> sequential writes on them. start a 3th process that does the same with
> 64M anonymous.
> 
> wait for a while, and you'll see order=1 failures.

Really? That means we can no longer even allocate stacks for forking.

Its surprising that neither lumpy reclaim nor the mobility patches can 
deal with it? Lumpy reclaim should be able to free neighboring pages to 
avoid the order 1 failure unless there are lots of pinned pages.

I guess then that lots of pages are pinned through I/O?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-28 18:20         ` Christoph Lameter
@ 2007-09-28 18:25           ` Peter Zijlstra
  2007-09-28 18:41             ` Christoph Lameter
                               ` (2 more replies)
  2007-09-29  8:45           ` Peter Zijlstra
  1 sibling, 3 replies; 110+ messages in thread
From: Peter Zijlstra @ 2007-09-28 18:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe


On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:

> > start 2 processes that each mmap a separate 64M file, and which does
> > sequential writes on them. start a 3th process that does the same with
> > 64M anonymous.
> > 
> > wait for a while, and you'll see order=1 failures.
> 
> Really? That means we can no longer even allocate stacks for forking.
> 
> Its surprising that neither lumpy reclaim nor the mobility patches can 
> deal with it? Lumpy reclaim should be able to free neighboring pages to 
> avoid the order 1 failure unless there are lots of pinned pages.
> 
> I guess then that lots of pages are pinned through I/O?

memory got massively fragemented, as anti-frag gets easily defeated.
setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
order blocks to stay available, so we don't mix types. however 12M on
128M is rather a lot.

its still on my todo list to look at it further..


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-28 18:25           ` Peter Zijlstra
@ 2007-09-28 18:41             ` Christoph Lameter
  2007-09-28 20:22               ` Nick Piggin
  2007-09-28 21:14               ` Mel Gorman
  2007-09-28 20:59             ` Mel Gorman
  2007-09-29  8:13             ` Andrew Morton
  2 siblings, 2 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-09-28 18:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On Fri, 28 Sep 2007, Peter Zijlstra wrote:

> memory got massively fragemented, as anti-frag gets easily defeated.
> setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
> order blocks to stay available, so we don't mix types. however 12M on
> 128M is rather a lot.

Yes, strict ordering would be much better. On NUMA it may be possible to 
completely forbid merging. We can fall back to other nodes if necessary. 
12M is not much on a NUMA system.

But this shows that (unsurprisingly) we may have issues on systems with a 
small amounts of memory and we may not want to use higher orders on such 
systems.

The case you got may be good to use as a testcase for the virtual 
fallback. Hmmmm... Maybe it is possible to allocate the stack as a virtual 
compound page. Got some script/code to produce that problem?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-29  9:27                       ` Andrew Morton
@ 2007-09-28 20:19                         ` Nick Piggin
  2007-09-29 19:20                           ` Andrew Morton
  0 siblings, 1 reply; 110+ messages in thread
From: Nick Piggin @ 2007-09-28 20:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Christoph Lameter, Christoph Hellwig, Mel Gorman,
	linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

On Saturday 29 September 2007 19:27, Andrew Morton wrote:
> On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> 
wrote:
> > > oom-killings, or page allocation failures?  The latter, one hopes.
> >
> > Linux version 2.6.23-rc4-mm1-dirty (root@dyad) (gcc version 4.1.2 (Ubuntu
> > 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007
> >
> > ...
> >
> >
> > mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
> > Call Trace:
> > 611b3878:  [<6002dd28>] printk_ratelimit+0x15/0x17
> > 611b3888:  [<60052ed4>] out_of_memory+0x80/0x100
> > 611b38c8:  [<60054b0c>] __alloc_pages+0x1ed/0x280
> > 611b3948:  [<6006c608>] allocate_slab+0x5b/0xb0
> > 611b3968:  [<6006c705>] new_slab+0x7e/0x183
> > 611b39a8:  [<6006cbae>] __slab_alloc+0xc9/0x14b
> > 611b39b0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > 611b39b8:  [<600980f2>] do_mpage_readpage+0x3b3/0x472
> > 611b39e0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > 611b39f8:  [<6006cc81>] kmem_cache_alloc+0x51/0x98
> > 611b3a38:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > 611b3a58:  [<6004f8e2>] add_to_page_cache+0x22/0xf7
> > 611b3a98:  [<6004f9c6>] add_to_page_cache_lru+0xf/0x24
> > 611b3ab8:  [<6009821e>] mpage_readpages+0x6d/0x109
> > 611b3ac0:  [<600d59f0>] ext3_get_block+0x0/0xf2
> > 611b3b08:  [<6005483d>] get_page_from_freelist+0x8d/0xc1
> > 611b3b88:  [<600d6937>] ext3_readpages+0x18/0x1a
> > 611b3b98:  [<60056f00>] read_pages+0x37/0x9b
> > 611b3bd8:  [<60057064>] __do_page_cache_readahead+0x100/0x157
> > 611b3c48:  [<60057196>] do_page_cache_readahead+0x52/0x5f
> > 611b3c78:  [<60050ab4>] filemap_fault+0x145/0x278
> > 611b3ca8:  [<60022b61>] run_syscall_stub+0xd1/0xdd
> > 611b3ce8:  [<6005eae3>] __do_fault+0x7e/0x3ca
> > 611b3d68:  [<6005ee60>] do_linear_fault+0x31/0x33
> > 611b3d88:  [<6005f149>] handle_mm_fault+0x14e/0x246
> > 611b3da8:  [<60120a7b>] __up_read+0x73/0x7b
> > 611b3de8:  [<60013177>] handle_page_fault+0x11f/0x23b
> > 611b3e48:  [<60013419>] segv+0xac/0x297
> > 611b3f28:  [<60013367>] segv_handler+0x68/0x6e
> > 611b3f48:  [<600232ad>] get_skas_faultinfo+0x9c/0xa1
> > 611b3f68:  [<60023853>] userspace+0x13a/0x19d
> > 611b3fc8:  [<60010d58>] fork_handler+0x86/0x8d
>
> OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
> allocations aren't supposed to fail.
>
> I'm suspecting that did_some_progress thing.

The allocation didn't fail -- it invoked the OOM killer because the kernel
ran out of unfragmented memory. Probably because higher order
allocations are the new vogue in -mm at the moment ;)

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-28 18:41             ` Christoph Lameter
@ 2007-09-28 20:22               ` Nick Piggin
  2007-09-28 21:14               ` Mel Gorman
  1 sibling, 0 replies; 110+ messages in thread
From: Nick Piggin @ 2007-09-28 20:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On Saturday 29 September 2007 04:41, Christoph Lameter wrote:
> On Fri, 28 Sep 2007, Peter Zijlstra wrote:
> > memory got massively fragemented, as anti-frag gets easily defeated.
> > setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
> > order blocks to stay available, so we don't mix types. however 12M on
> > 128M is rather a lot.
>
> Yes, strict ordering would be much better. On NUMA it may be possible to
> completely forbid merging. We can fall back to other nodes if necessary.
> 12M is not much on a NUMA system.
>
> But this shows that (unsurprisingly) we may have issues on systems with a
> small amounts of memory and we may not want to use higher orders on such
> systems.
>
> The case you got may be good to use as a testcase for the virtual
> fallback. Hmmmm... Maybe it is possible to allocate the stack as a virtual
> compound page. Got some script/code to produce that problem?

Yeah, you could do that, but we generally don't have big problems allocating
stacks in mainline, because we have very few users of higher order pages,
the few that are there don't seem to be a problem.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-28 18:25           ` Peter Zijlstra
  2007-09-28 18:41             ` Christoph Lameter
@ 2007-09-28 20:59             ` Mel Gorman
  2007-09-29  8:13             ` Andrew Morton
  2 siblings, 0 replies; 110+ messages in thread
From: Mel Gorman @ 2007-09-28 20:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Nick Piggin, Christoph Hellwig, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On (28/09/07 20:25), Peter Zijlstra didst pronounce:
> 
> On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
> 
> > > start 2 processes that each mmap a separate 64M file, and which does
> > > sequential writes on them. start a 3th process that does the same with
> > > 64M anonymous.
> > > 
> > > wait for a while, and you'll see order=1 failures.
> > 
> > Really? That means we can no longer even allocate stacks for forking.
> > 
> > Its surprising that neither lumpy reclaim nor the mobility patches can 
> > deal with it? Lumpy reclaim should be able to free neighboring pages to 
> > avoid the order 1 failure unless there are lots of pinned pages.
> > 
> > I guess then that lots of pages are pinned through I/O?
> 
> memory got massively fragemented, as anti-frag gets easily defeated.
> setting min_free_kbytes to 12M does seem to solve it - it forces 2 max

The 12MB is related to the size of pageblock_order. I strongly suspect
that if you forced pageblock_order to be something like 4 or 5, the
min_free_kbytes would not need to be raised. The current values are
selected based on the hugepage size.

> order blocks to stay available, so we don't mix types. however 12M on
> 128M is rather a lot.
> 
> its still on my todo list to look at it further..
> 

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-28 17:33     ` Christoph Lameter
  2007-09-28  5:14       ` Nick Piggin
  2007-09-28 17:55       ` [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK Peter Zijlstra
@ 2007-09-28 21:05       ` Mel Gorman
  2007-10-01 21:10         ` Christoph Lameter
  2 siblings, 1 reply; 110+ messages in thread
From: Mel Gorman @ 2007-09-28 21:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel,
	David Chinner, Jens Axboe

On (28/09/07 10:33), Christoph Lameter didst pronounce:
> On Fri, 28 Sep 2007, Nick Piggin wrote:
> 
> > On Wednesday 19 September 2007 13:36, Christoph Lameter wrote:
> > > SLAB_VFALLBACK can be specified for selected slab caches. If fallback is
> > > available then the conservative settings for higher order allocations are
> > > overridden. We then request an order that can accomodate at mininum
> > > 100 objects. The size of an individual slab allocation is allowed to reach
> > > up to 256k (order 6 on i386, order 4 on IA64).
> > 
> > How come SLUB wants such a big amount of objects? I thought the
> > unqueued nature of it made it better than slab because it minimised
> > the amount of cache hot memory lying around in slabs...
> 
> The more objects in a page the more the fast path runs. The more the fast 
> path runs the lower the cache footprint and the faster the overall 
> allocations etc.
> 
> SLAB can be configured for large queues holdings lots of objects. 
> SLUB can only reach the same through large pages because it does not 
> have queues.

Large pages, flood gates etc. Be wary.

SLUB has to run 100% reliable or things go whoops. SLUB regularly depends on
atomic allocations and cannot take the necessary steps to get the contiguous
pages if it gets into trouble. This means that something like lumpy reclaim
cannot help you in it's current state.

We currently do not take the per-emptive steps with kswapd to ensure the
high-order pages are free. We also don't do something like have users that
can sleep keep the watermarks high. I had considered the possibility but
didn't have the justification for the complexity.

Minimally, SLUB by default should continue to use order-0 pages. Peter has
managed to bust order-1 pages with mem=128MB. Admittedly, it was a really
hostile workload but the point remains. It was artifically worked around
with min_free_kbytes (value set based on pageblock_order, could also have
been artifically worked around by dropping pageblock_order) and he eventually
caused order-0 failures so the workload is pretty damn hostile to everything.

> One could add the ability to manage pools of cpu slabs but 
> that would be adding yet another layer to compensate for the problem of 
> the small pages.

A compromise may be to have per-cpu lists for higher-order pages in the page
allocator itself as they can be easily drained unlike the SLAB queues. The
thing to watch for would be excessive IPI calls which would offset any
performance gained by SLUB using larger pages.

> Reliable large page allocations means that we can get rid 
> of these layers and the many workarounds that we have in place right now.
> 

They are not reliable yet, particularly for atomic allocs.

> The unqueued nature of SLUB reduces memory requirements and in general the 
> more efficient code paths of SLUB offset the advantage that SLAB can reach 
> by being able to put more objects onto its queues. SLAB necessarily 
> introduces complexity and cache line use through the need to manage those 
> queues.
> 
> > vmalloc is incredibly slow and unscalable at the moment. I'm still working
> > on making it more scalable and faster -- hopefully to a point where it would
> > actually be usable for this... but you still get moved off large TLBs, and
> > also have to inevitably do tlb flushing.
> 
> Again I have not seen any fallbacks to vmalloc in my testing. What we are 
> doing here is mainly to address your theoretical cases that we so far have 
> never seen to be a problem and increase the reliability of allocations of
> page orders larger than 3 to a usable level. So far I have so far not 
> dared to enable orders larger than 3 by default.
> 
> AFAICT The performance of vmalloc is not really relevant. If this would 
> become an issue then it would be possible to reduce the orders used to 
> avoid fallbacks.
> 

If we're falling back to vmalloc ever, there is a danger that the
problem is postponed until vmalloc space is consumed. More an issue for
32 bit.

> > Or do you have SLUB at a point where performance is comparable to SLAB,
> > and this is just a possible idea for more performance?
> 
> AFAICT SLUBs performance is superior to SLAB in most cases and it was like 
> that from the beginning. I am still concerned about several corner cases 
> though (I think most of them are going to be addressed by the per cpu 
> patches in mm). Having a comparable or larger amount of per cpu objects as 
> SLAB is something that also could address some of these concerns and could 
> increase performance much further.
> 

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-28 18:41             ` Christoph Lameter
  2007-09-28 20:22               ` Nick Piggin
@ 2007-09-28 21:14               ` Mel Gorman
  1 sibling, 0 replies; 110+ messages in thread
From: Mel Gorman @ 2007-09-28 21:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Nick Piggin, Christoph Hellwig, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On (28/09/07 11:41), Christoph Lameter didst pronounce:
> On Fri, 28 Sep 2007, Peter Zijlstra wrote:
> 
> > memory got massively fragemented, as anti-frag gets easily defeated.
> > setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
> > order blocks to stay available, so we don't mix types. however 12M on
> > 128M is rather a lot.
> 
> Yes, strict ordering would be much better. On NUMA it may be possible to 
> completely forbid merging.

The forbidding of merging is trivial and the code is isolated to one function
__rmqueue_fallback(). We don't do it because the decision at development
time was that it was better to allow fragmentation than take a reclaim step
for example[1] and slow things up. This is based on my initial assumption
of anti-frag being mainly of interest to hugepages which are happy to wait
long periods during startup or fail.

> We can fall back to other nodes if necessary. 
> 12M is not much on a NUMA system.
> 
> But this shows that (unsurprisingly) we may have issues on systems with a 
> small amounts of memory and we may not want to use higher orders on such 
> systems.
> 

This is another option if you want to use a higher order for SLUB by
default. Use order-0 unless you are sure there is enough memory. At boot
if there is loads of memory, set the higher order and up min_free_kbytes on
each node to reduce mixing[2]. We can test with Peters uber-hostile
case to see if it works[3].

> The case you got may be good to use as a testcase for the virtual 
> fallback. Hmmmm...

For sure.

> Maybe it is possible to allocate the stack as a virtual 
> compound page. Got some script/code to produce that problem?
> 

[1] It might be tunnel vision but I still keep hugepages in mind as the
    principal user of anti-frag. Andy used to have patches that force evicted
    pages of the "foreign" type when mixing occured so the end result was
    no mixing. We never fully completed them because it was too costly
    for hugepages.

[2] This would require the identification of mixed blocks to be a
    statistic available in mainline. Right now, it's only available in -mm
    when PAGE_OWNER is set

[3] The definition of working in this case being that order-0
    allocations fail which he has produced

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-28 18:25           ` Peter Zijlstra
  2007-09-28 18:41             ` Christoph Lameter
  2007-09-28 20:59             ` Mel Gorman
@ 2007-09-29  8:13             ` Andrew Morton
  2007-09-29  8:47               ` Peter Zijlstra
  2 siblings, 1 reply; 110+ messages in thread
From: Andrew Morton @ 2007-09-29  8:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Nick Piggin, Christoph Hellwig, Mel Gorman,
	linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

On Fri, 28 Sep 2007 20:25:50 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> 
> On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
> 
> > > start 2 processes that each mmap a separate 64M file, and which does
> > > sequential writes on them. start a 3th process that does the same with
> > > 64M anonymous.
> > > 
> > > wait for a while, and you'll see order=1 failures.
> > 
> > Really? That means we can no longer even allocate stacks for forking.
> > 
> > Its surprising that neither lumpy reclaim nor the mobility patches can 
> > deal with it? Lumpy reclaim should be able to free neighboring pages to 
> > avoid the order 1 failure unless there are lots of pinned pages.
> > 
> > I guess then that lots of pages are pinned through I/O?
> 
> memory got massively fragemented, as anti-frag gets easily defeated.
> setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
> order blocks to stay available, so we don't mix types. however 12M on
> 128M is rather a lot.
> 
> its still on my todo list to look at it further..
> 

That would be really really bad (as in: patch-dropping time) if those
order-1 allocations are not atomic.

What's the callsite? 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-28 18:20         ` Christoph Lameter
  2007-09-28 18:25           ` Peter Zijlstra
@ 2007-09-29  8:45           ` Peter Zijlstra
  2007-10-01 21:01             ` Christoph Lameter
  1 sibling, 1 reply; 110+ messages in thread
From: Peter Zijlstra @ 2007-09-29  8:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe


On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:

> Really? That means we can no longer even allocate stacks for forking.

I think I'm running with 4k stacks...


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-29  8:13             ` Andrew Morton
@ 2007-09-29  8:47               ` Peter Zijlstra
  2007-09-29  8:53                 ` Peter Zijlstra
  2007-09-29  9:00                 ` Andrew Morton
  0 siblings, 2 replies; 110+ messages in thread
From: Peter Zijlstra @ 2007-09-29  8:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Christoph Hellwig, Mel Gorman,
	linux-fsdevel, linux-kernel, David Chinner, Jens Axboe


On Sat, 2007-09-29 at 01:13 -0700, Andrew Morton wrote:
> On Fri, 28 Sep 2007 20:25:50 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > 
> > On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
> > 
> > > > start 2 processes that each mmap a separate 64M file, and which does
> > > > sequential writes on them. start a 3th process that does the same with
> > > > 64M anonymous.
> > > > 
> > > > wait for a while, and you'll see order=1 failures.
> > > 
> > > Really? That means we can no longer even allocate stacks for forking.
> > > 
> > > Its surprising that neither lumpy reclaim nor the mobility patches can 
> > > deal with it? Lumpy reclaim should be able to free neighboring pages to 
> > > avoid the order 1 failure unless there are lots of pinned pages.
> > > 
> > > I guess then that lots of pages are pinned through I/O?
> > 
> > memory got massively fragemented, as anti-frag gets easily defeated.
> > setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
> > order blocks to stay available, so we don't mix types. however 12M on
> > 128M is rather a lot.
> > 
> > its still on my todo list to look at it further..
> > 
> 
> That would be really really bad (as in: patch-dropping time) if those
> order-1 allocations are not atomic.
> 
> What's the callsite? 

Ah, right, that was the detail... all this lumpy reclaim is useless for
atomic allocations. And with SLUB using higher order pages, atomic !0
order allocations will be very very common.

One I can remember was:

  add_to_page_cache()
    radix_tree_insert()
      radix_tree_node_alloc()
        kmem_cache_alloc()

which is an atomic callsite.

Which leaves us in a situation where we can load pages, because there is
free memory, but can't manage to allocate memory to track them.. 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-29  8:47               ` Peter Zijlstra
@ 2007-09-29  8:53                 ` Peter Zijlstra
  2007-09-29  9:01                   ` Andrew Morton
  2007-09-29  9:00                 ` Andrew Morton
  1 sibling, 1 reply; 110+ messages in thread
From: Peter Zijlstra @ 2007-09-29  8:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Christoph Hellwig, Mel Gorman,
	linux-fsdevel, linux-kernel, David Chinner, Jens Axboe


On Sat, 2007-09-29 at 10:47 +0200, Peter Zijlstra wrote:

> Ah, right, that was the detail... all this lumpy reclaim is useless for
> atomic allocations. And with SLUB using higher order pages, atomic !0
> order allocations will be very very common.
> 
> One I can remember was:
> 
>   add_to_page_cache()
>     radix_tree_insert()
>       radix_tree_node_alloc()
>         kmem_cache_alloc()
> 
> which is an atomic callsite.
> 
> Which leaves us in a situation where we can load pages, because there is
> free memory, but can't manage to allocate memory to track them.. 

Ah, I found a boot log of one of these sessions, its also full of
order-2 OOMs.. :-/


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-29  8:47               ` Peter Zijlstra
  2007-09-29  8:53                 ` Peter Zijlstra
@ 2007-09-29  9:00                 ` Andrew Morton
  2007-10-01 20:55                   ` Christoph Lameter
  1 sibling, 1 reply; 110+ messages in thread
From: Andrew Morton @ 2007-09-29  9:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Nick Piggin, Christoph Hellwig, Mel Gorman,
	linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

On Sat, 29 Sep 2007 10:47:12 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> 
> On Sat, 2007-09-29 at 01:13 -0700, Andrew Morton wrote:
> > On Fri, 28 Sep 2007 20:25:50 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > 
> > > 
> > > On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
> > > 
> > > > > start 2 processes that each mmap a separate 64M file, and which does
> > > > > sequential writes on them. start a 3th process that does the same with
> > > > > 64M anonymous.
> > > > > 
> > > > > wait for a while, and you'll see order=1 failures.
> > > > 
> > > > Really? That means we can no longer even allocate stacks for forking.
> > > > 
> > > > Its surprising that neither lumpy reclaim nor the mobility patches can 
> > > > deal with it? Lumpy reclaim should be able to free neighboring pages to 
> > > > avoid the order 1 failure unless there are lots of pinned pages.
> > > > 
> > > > I guess then that lots of pages are pinned through I/O?
> > > 
> > > memory got massively fragemented, as anti-frag gets easily defeated.
> > > setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
> > > order blocks to stay available, so we don't mix types. however 12M on
> > > 128M is rather a lot.
> > > 
> > > its still on my todo list to look at it further..
> > > 
> > 
> > That would be really really bad (as in: patch-dropping time) if those
> > order-1 allocations are not atomic.
> > 
> > What's the callsite? 
> 
> Ah, right, that was the detail... all this lumpy reclaim is useless for
> atomic allocations. And with SLUB using higher order pages, atomic !0
> order allocations will be very very common.

Oh OK.

I thought we'd already fixed slub so that it didn't do that.  Maybe that
fix is in -mm but I don't think so.

Trying to do atomic order-1 allocations on behalf of arbitray slab caches
just won't fly - this is a significant degradation in kernel reliability,
as you've very easily demonstrated.

> One I can remember was:
> 
>   add_to_page_cache()
>     radix_tree_insert()
>       radix_tree_node_alloc()
>         kmem_cache_alloc()
> 
> which is an atomic callsite.
> 
> Which leaves us in a situation where we can load pages, because there is
> free memory, but can't manage to allocate memory to track them.. 

Right.  Leading to application failure which for many is equivalent to a
complete system outage.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-29  8:53                 ` Peter Zijlstra
@ 2007-09-29  9:01                   ` Andrew Morton
  2007-09-29  9:14                     ` Peter Zijlstra
  0 siblings, 1 reply; 110+ messages in thread
From: Andrew Morton @ 2007-09-29  9:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Nick Piggin, Christoph Hellwig, Mel Gorman,
	linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

On Sat, 29 Sep 2007 10:53:41 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> 
> On Sat, 2007-09-29 at 10:47 +0200, Peter Zijlstra wrote:
> 
> > Ah, right, that was the detail... all this lumpy reclaim is useless for
> > atomic allocations. And with SLUB using higher order pages, atomic !0
> > order allocations will be very very common.
> > 
> > One I can remember was:
> > 
> >   add_to_page_cache()
> >     radix_tree_insert()
> >       radix_tree_node_alloc()
> >         kmem_cache_alloc()
> > 
> > which is an atomic callsite.
> > 
> > Which leaves us in a situation where we can load pages, because there is
> > free memory, but can't manage to allocate memory to track them.. 
> 
> Ah, I found a boot log of one of these sessions, its also full of
> order-2 OOMs.. :-/

oom-killings, or page allocation failures?  The latter, one hopes.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-29  9:01                   ` Andrew Morton
@ 2007-09-29  9:14                     ` Peter Zijlstra
  2007-09-29  9:27                       ` Andrew Morton
  0 siblings, 1 reply; 110+ messages in thread
From: Peter Zijlstra @ 2007-09-29  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Christoph Hellwig, Mel Gorman,
	linux-fsdevel, linux-kernel, David Chinner, Jens Axboe


On Sat, 2007-09-29 at 02:01 -0700, Andrew Morton wrote:
> On Sat, 29 Sep 2007 10:53:41 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > 
> > On Sat, 2007-09-29 at 10:47 +0200, Peter Zijlstra wrote:
> > 
> > > Ah, right, that was the detail... all this lumpy reclaim is useless for
> > > atomic allocations. And with SLUB using higher order pages, atomic !0
> > > order allocations will be very very common.
> > > 
> > > One I can remember was:
> > > 
> > >   add_to_page_cache()
> > >     radix_tree_insert()
> > >       radix_tree_node_alloc()
> > >         kmem_cache_alloc()
> > > 
> > > which is an atomic callsite.
> > > 
> > > Which leaves us in a situation where we can load pages, because there is
> > > free memory, but can't manage to allocate memory to track them.. 
> > 
> > Ah, I found a boot log of one of these sessions, its also full of
> > order-2 OOMs.. :-/
> 
> oom-killings, or page allocation failures?  The latter, one hopes.


Linux version 2.6.23-rc4-mm1-dirty (root@dyad) (gcc version 4.1.2 (Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007

...


mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
Call Trace:
611b3878:  [<6002dd28>] printk_ratelimit+0x15/0x17
611b3888:  [<60052ed4>] out_of_memory+0x80/0x100
611b38c8:  [<60054b0c>] __alloc_pages+0x1ed/0x280
611b3948:  [<6006c608>] allocate_slab+0x5b/0xb0
611b3968:  [<6006c705>] new_slab+0x7e/0x183
611b39a8:  [<6006cbae>] __slab_alloc+0xc9/0x14b
611b39b0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
611b39b8:  [<600980f2>] do_mpage_readpage+0x3b3/0x472
611b39e0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
611b39f8:  [<6006cc81>] kmem_cache_alloc+0x51/0x98
611b3a38:  [<6011f89f>] radix_tree_preload+0x70/0xbf
611b3a58:  [<6004f8e2>] add_to_page_cache+0x22/0xf7
611b3a98:  [<6004f9c6>] add_to_page_cache_lru+0xf/0x24
611b3ab8:  [<6009821e>] mpage_readpages+0x6d/0x109
611b3ac0:  [<600d59f0>] ext3_get_block+0x0/0xf2
611b3b08:  [<6005483d>] get_page_from_freelist+0x8d/0xc1
611b3b88:  [<600d6937>] ext3_readpages+0x18/0x1a
611b3b98:  [<60056f00>] read_pages+0x37/0x9b
611b3bd8:  [<60057064>] __do_page_cache_readahead+0x100/0x157
611b3c48:  [<60057196>] do_page_cache_readahead+0x52/0x5f
611b3c78:  [<60050ab4>] filemap_fault+0x145/0x278
611b3ca8:  [<60022b61>] run_syscall_stub+0xd1/0xdd
611b3ce8:  [<6005eae3>] __do_fault+0x7e/0x3ca
611b3d68:  [<6005ee60>] do_linear_fault+0x31/0x33
611b3d88:  [<6005f149>] handle_mm_fault+0x14e/0x246
611b3da8:  [<60120a7b>] __up_read+0x73/0x7b
611b3de8:  [<60013177>] handle_page_fault+0x11f/0x23b
611b3e48:  [<60013419>] segv+0xac/0x297
611b3f28:  [<60013367>] segv_handler+0x68/0x6e
611b3f48:  [<600232ad>] get_skas_faultinfo+0x9c/0xa1
611b3f68:  [<60023853>] userspace+0x13a/0x19d
611b3fc8:  [<60010d58>] fork_handler+0x86/0x8d

Mem-info:
Normal per-cpu:
CPU    0: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
Active:11 inactive:9 dirty:0 writeback:1 unstable:0
 free:19533 slab:10587 mapped:0 pagetables:260 bounce:0
Normal free:78132kB min:4096kB low:5120kB high:6144kB active:44kB inactive:36kB present:129280kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0
Normal: 7503*4kB 5977*8kB 19*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 78132kB
Swap cache: add 1192822, delete 1192790, find 491441/626861, race 0+1
Free swap  = 455300kB
Total swap = 524280kB
Free swap:       455300kB
32768 pages of RAM
0 pages of HIGHMEM
1948 reserved pages
11 pages shared
32 pages swap cached
Out of memory: kill process 2647 (portmap) score 2233 or a child
Killed process 2647 (portmap)



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-29  9:14                     ` Peter Zijlstra
@ 2007-09-29  9:27                       ` Andrew Morton
  2007-09-28 20:19                         ` Nick Piggin
  0 siblings, 1 reply; 110+ messages in thread
From: Andrew Morton @ 2007-09-29  9:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Nick Piggin, Christoph Hellwig, Mel Gorman,
	linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> > oom-killings, or page allocation failures?  The latter, one hopes.
> 
> 
> Linux version 2.6.23-rc4-mm1-dirty (root@dyad) (gcc version 4.1.2 (Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007
> 
> ...
> 
> 
> mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
> Call Trace:
> 611b3878:  [<6002dd28>] printk_ratelimit+0x15/0x17
> 611b3888:  [<60052ed4>] out_of_memory+0x80/0x100
> 611b38c8:  [<60054b0c>] __alloc_pages+0x1ed/0x280
> 611b3948:  [<6006c608>] allocate_slab+0x5b/0xb0
> 611b3968:  [<6006c705>] new_slab+0x7e/0x183
> 611b39a8:  [<6006cbae>] __slab_alloc+0xc9/0x14b
> 611b39b0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> 611b39b8:  [<600980f2>] do_mpage_readpage+0x3b3/0x472
> 611b39e0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> 611b39f8:  [<6006cc81>] kmem_cache_alloc+0x51/0x98
> 611b3a38:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> 611b3a58:  [<6004f8e2>] add_to_page_cache+0x22/0xf7
> 611b3a98:  [<6004f9c6>] add_to_page_cache_lru+0xf/0x24
> 611b3ab8:  [<6009821e>] mpage_readpages+0x6d/0x109
> 611b3ac0:  [<600d59f0>] ext3_get_block+0x0/0xf2
> 611b3b08:  [<6005483d>] get_page_from_freelist+0x8d/0xc1
> 611b3b88:  [<600d6937>] ext3_readpages+0x18/0x1a
> 611b3b98:  [<60056f00>] read_pages+0x37/0x9b
> 611b3bd8:  [<60057064>] __do_page_cache_readahead+0x100/0x157
> 611b3c48:  [<60057196>] do_page_cache_readahead+0x52/0x5f
> 611b3c78:  [<60050ab4>] filemap_fault+0x145/0x278
> 611b3ca8:  [<60022b61>] run_syscall_stub+0xd1/0xdd
> 611b3ce8:  [<6005eae3>] __do_fault+0x7e/0x3ca
> 611b3d68:  [<6005ee60>] do_linear_fault+0x31/0x33
> 611b3d88:  [<6005f149>] handle_mm_fault+0x14e/0x246
> 611b3da8:  [<60120a7b>] __up_read+0x73/0x7b
> 611b3de8:  [<60013177>] handle_page_fault+0x11f/0x23b
> 611b3e48:  [<60013419>] segv+0xac/0x297
> 611b3f28:  [<60013367>] segv_handler+0x68/0x6e
> 611b3f48:  [<600232ad>] get_skas_faultinfo+0x9c/0xa1
> 611b3f68:  [<60023853>] userspace+0x13a/0x19d
> 611b3fc8:  [<60010d58>] fork_handler+0x86/0x8d

OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
allocations aren't supposed to fail.

I'm suspecting that did_some_progress thing.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-29 19:20                           ` Andrew Morton
@ 2007-09-29 19:09                             ` Nick Piggin
  2007-09-30 20:12                               ` Andrew Morton
  0 siblings, 1 reply; 110+ messages in thread
From: Nick Piggin @ 2007-09-29 19:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Christoph Lameter, Christoph Hellwig, Mel Gorman,
	linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

On Sunday 30 September 2007 05:20, Andrew Morton wrote:
> On Sat, 29 Sep 2007 06:19:33 +1000 Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> > On Saturday 29 September 2007 19:27, Andrew Morton wrote:
> > > On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra
> > > <a.p.zijlstra@chello.nl>
> >
> > wrote:
> > > > > oom-killings, or page allocation failures?  The latter, one hopes.
> > > >
> > > > Linux version 2.6.23-rc4-mm1-dirty (root@dyad) (gcc version 4.1.2
> > > > (Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007
> > > >
> > > > ...
> > > >
> > > >
> > > > mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
> > > > Call Trace:
> > > > 611b3878:  [<6002dd28>] printk_ratelimit+0x15/0x17
> > > > 611b3888:  [<60052ed4>] out_of_memory+0x80/0x100
> > > > 611b38c8:  [<60054b0c>] __alloc_pages+0x1ed/0x280
> > > > 611b3948:  [<6006c608>] allocate_slab+0x5b/0xb0
> > > > 611b3968:  [<6006c705>] new_slab+0x7e/0x183
> > > > 611b39a8:  [<6006cbae>] __slab_alloc+0xc9/0x14b
> > > > 611b39b0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > > > 611b39b8:  [<600980f2>] do_mpage_readpage+0x3b3/0x472
> > > > 611b39e0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > > > 611b39f8:  [<6006cc81>] kmem_cache_alloc+0x51/0x98
> > > > 611b3a38:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > > > 611b3a58:  [<6004f8e2>] add_to_page_cache+0x22/0xf7
> > > > 611b3a98:  [<6004f9c6>] add_to_page_cache_lru+0xf/0x24
> > > > 611b3ab8:  [<6009821e>] mpage_readpages+0x6d/0x109
> > > > 611b3ac0:  [<600d59f0>] ext3_get_block+0x0/0xf2
> > > > 611b3b08:  [<6005483d>] get_page_from_freelist+0x8d/0xc1
> > > > 611b3b88:  [<600d6937>] ext3_readpages+0x18/0x1a
> > > > 611b3b98:  [<60056f00>] read_pages+0x37/0x9b
> > > > 611b3bd8:  [<60057064>] __do_page_cache_readahead+0x100/0x157
> > > > 611b3c48:  [<60057196>] do_page_cache_readahead+0x52/0x5f
> > > > 611b3c78:  [<60050ab4>] filemap_fault+0x145/0x278
> > > > 611b3ca8:  [<60022b61>] run_syscall_stub+0xd1/0xdd
> > > > 611b3ce8:  [<6005eae3>] __do_fault+0x7e/0x3ca
> > > > 611b3d68:  [<6005ee60>] do_linear_fault+0x31/0x33
> > > > 611b3d88:  [<6005f149>] handle_mm_fault+0x14e/0x246
> > > > 611b3da8:  [<60120a7b>] __up_read+0x73/0x7b
> > > > 611b3de8:  [<60013177>] handle_page_fault+0x11f/0x23b
> > > > 611b3e48:  [<60013419>] segv+0xac/0x297
> > > > 611b3f28:  [<60013367>] segv_handler+0x68/0x6e
> > > > 611b3f48:  [<600232ad>] get_skas_faultinfo+0x9c/0xa1
> > > > 611b3f68:  [<60023853>] userspace+0x13a/0x19d
> > > > 611b3fc8:  [<60010d58>] fork_handler+0x86/0x8d
> > >
> > > OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
> > > allocations aren't supposed to fail.
> > >
> > > I'm suspecting that did_some_progress thing.
> >
> > The allocation didn't fail -- it invoked the OOM killer because the
> > kernel ran out of unfragmented memory.
>
> We can't "run out of unfragmented memory" for an order-2 GFP_KERNEL
> allocation in this workload.  We go and synchronously free stuff up to make
> it work.
>
> How did this get broken?

Either no more order-2 pages could be freed, or the ones that were being
freed were being used by something else (eg. other order-2 slab allocations).


> > Probably because higher order
> > allocations are the new vogue in -mm at the moment ;)
>
> That's a different bug.
>
> bug 1: We shouldn't be doing higher-order allocations in slub because of
> the considerable damage this does to atomic allocations.
>
> bug 2: order-2 GFP_KERNEL allocations shouldn't fail like this.

I think one causes 2 as well -- it isn't just considerable damage to atomic
allocations but to GFP_KERNEL allocations too.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-28 20:19                         ` Nick Piggin
@ 2007-09-29 19:20                           ` Andrew Morton
  2007-09-29 19:09                             ` Nick Piggin
  0 siblings, 1 reply; 110+ messages in thread
From: Andrew Morton @ 2007-09-29 19:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Christoph Lameter, Christoph Hellwig, Mel Gorman,
	linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

On Sat, 29 Sep 2007 06:19:33 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Saturday 29 September 2007 19:27, Andrew Morton wrote:
> > On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> 
> wrote:
> > > > oom-killings, or page allocation failures?  The latter, one hopes.
> > >
> > > Linux version 2.6.23-rc4-mm1-dirty (root@dyad) (gcc version 4.1.2 (Ubuntu
> > > 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007
> > >
> > > ...
> > >
> > >
> > > mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
> > > Call Trace:
> > > 611b3878:  [<6002dd28>] printk_ratelimit+0x15/0x17
> > > 611b3888:  [<60052ed4>] out_of_memory+0x80/0x100
> > > 611b38c8:  [<60054b0c>] __alloc_pages+0x1ed/0x280
> > > 611b3948:  [<6006c608>] allocate_slab+0x5b/0xb0
> > > 611b3968:  [<6006c705>] new_slab+0x7e/0x183
> > > 611b39a8:  [<6006cbae>] __slab_alloc+0xc9/0x14b
> > > 611b39b0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > > 611b39b8:  [<600980f2>] do_mpage_readpage+0x3b3/0x472
> > > 611b39e0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > > 611b39f8:  [<6006cc81>] kmem_cache_alloc+0x51/0x98
> > > 611b3a38:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > > 611b3a58:  [<6004f8e2>] add_to_page_cache+0x22/0xf7
> > > 611b3a98:  [<6004f9c6>] add_to_page_cache_lru+0xf/0x24
> > > 611b3ab8:  [<6009821e>] mpage_readpages+0x6d/0x109
> > > 611b3ac0:  [<600d59f0>] ext3_get_block+0x0/0xf2
> > > 611b3b08:  [<6005483d>] get_page_from_freelist+0x8d/0xc1
> > > 611b3b88:  [<600d6937>] ext3_readpages+0x18/0x1a
> > > 611b3b98:  [<60056f00>] read_pages+0x37/0x9b
> > > 611b3bd8:  [<60057064>] __do_page_cache_readahead+0x100/0x157
> > > 611b3c48:  [<60057196>] do_page_cache_readahead+0x52/0x5f
> > > 611b3c78:  [<60050ab4>] filemap_fault+0x145/0x278
> > > 611b3ca8:  [<60022b61>] run_syscall_stub+0xd1/0xdd
> > > 611b3ce8:  [<6005eae3>] __do_fault+0x7e/0x3ca
> > > 611b3d68:  [<6005ee60>] do_linear_fault+0x31/0x33
> > > 611b3d88:  [<6005f149>] handle_mm_fault+0x14e/0x246
> > > 611b3da8:  [<60120a7b>] __up_read+0x73/0x7b
> > > 611b3de8:  [<60013177>] handle_page_fault+0x11f/0x23b
> > > 611b3e48:  [<60013419>] segv+0xac/0x297
> > > 611b3f28:  [<60013367>] segv_handler+0x68/0x6e
> > > 611b3f48:  [<600232ad>] get_skas_faultinfo+0x9c/0xa1
> > > 611b3f68:  [<60023853>] userspace+0x13a/0x19d
> > > 611b3fc8:  [<60010d58>] fork_handler+0x86/0x8d
> >
> > OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
> > allocations aren't supposed to fail.
> >
> > I'm suspecting that did_some_progress thing.
> 
> The allocation didn't fail -- it invoked the OOM killer because the kernel
> ran out of unfragmented memory.

We can't "run out of unfragmented memory" for an order-2 GFP_KERNEL
allocation in this workload.  We go and synchronously free stuff up to make
it work.

How did this get broken?

> Probably because higher order
> allocations are the new vogue in -mm at the moment ;)

That's a different bug.

bug 1: We shouldn't be doing higher-order allocations in slub because of
the considerable damage this does to atomic allocations.

bug 2: order-2 GFP_KERNEL allocations shouldn't fail like this.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-30 20:12                               ` Andrew Morton
@ 2007-09-30  4:16                                 ` Nick Piggin
  0 siblings, 0 replies; 110+ messages in thread
From: Nick Piggin @ 2007-09-30  4:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Christoph Lameter, Christoph Hellwig, Mel Gorman,
	linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

On Monday 01 October 2007 06:12, Andrew Morton wrote:
> On Sun, 30 Sep 2007 05:09:28 +1000 Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> > On Sunday 30 September 2007 05:20, Andrew Morton wrote:

> > > We can't "run out of unfragmented memory" for an order-2 GFP_KERNEL
> > > allocation in this workload.  We go and synchronously free stuff up to
> > > make it work.
> > >
> > > How did this get broken?
> >
> > Either no more order-2 pages could be freed, or the ones that were being
> > freed were being used by something else (eg. other order-2 slab
> > allocations).
>
> No.  The current design of reclaim (for better or for worse) is that for
> order 0,1,2 and 3 allocations we just keep on trying until it works.  That
> got broken and I think it got broken at a design level when that
> did_some_progress logic went in.  Perhaps something else we did later
> worsened things.

It will keep trying until it works. It won't have stopped trying (unless
I'm very mistaken?), it's just oom killing things merrily along the way.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-29 19:09                             ` Nick Piggin
@ 2007-09-30 20:12                               ` Andrew Morton
  2007-09-30  4:16                                 ` Nick Piggin
  0 siblings, 1 reply; 110+ messages in thread
From: Andrew Morton @ 2007-09-30 20:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Christoph Lameter, Christoph Hellwig, Mel Gorman,
	linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

On Sun, 30 Sep 2007 05:09:28 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Sunday 30 September 2007 05:20, Andrew Morton wrote:
> > On Sat, 29 Sep 2007 06:19:33 +1000 Nick Piggin <nickpiggin@yahoo.com.au> 
> wrote:
> > > On Saturday 29 September 2007 19:27, Andrew Morton wrote:
> > > > On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra
> > > > <a.p.zijlstra@chello.nl>
> > >
> > > wrote:
> > > > > > oom-killings, or page allocation failures?  The latter, one hopes.
> > > > >
> > > > > Linux version 2.6.23-rc4-mm1-dirty (root@dyad) (gcc version 4.1.2
> > > > > (Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007
> > > > >
> > > > > ...
> > > > >
> > > > >
> > > > > mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
> > > > > Call Trace:
> > > > > 611b3878:  [<6002dd28>] printk_ratelimit+0x15/0x17
> > > > > 611b3888:  [<60052ed4>] out_of_memory+0x80/0x100
> > > > > 611b38c8:  [<60054b0c>] __alloc_pages+0x1ed/0x280
> > > > > 611b3948:  [<6006c608>] allocate_slab+0x5b/0xb0
> > > > > 611b3968:  [<6006c705>] new_slab+0x7e/0x183
> > > > > 611b39a8:  [<6006cbae>] __slab_alloc+0xc9/0x14b
> > > > > 611b39b0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > > > > 611b39b8:  [<600980f2>] do_mpage_readpage+0x3b3/0x472
> > > > > 611b39e0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > > > > 611b39f8:  [<6006cc81>] kmem_cache_alloc+0x51/0x98
> > > > > 611b3a38:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > > > > 611b3a58:  [<6004f8e2>] add_to_page_cache+0x22/0xf7
> > > > > 611b3a98:  [<6004f9c6>] add_to_page_cache_lru+0xf/0x24
> > > > > 611b3ab8:  [<6009821e>] mpage_readpages+0x6d/0x109
> > > > > 611b3ac0:  [<600d59f0>] ext3_get_block+0x0/0xf2
> > > > > 611b3b08:  [<6005483d>] get_page_from_freelist+0x8d/0xc1
> > > > > 611b3b88:  [<600d6937>] ext3_readpages+0x18/0x1a
> > > > > 611b3b98:  [<60056f00>] read_pages+0x37/0x9b
> > > > > 611b3bd8:  [<60057064>] __do_page_cache_readahead+0x100/0x157
> > > > > 611b3c48:  [<60057196>] do_page_cache_readahead+0x52/0x5f
> > > > > 611b3c78:  [<60050ab4>] filemap_fault+0x145/0x278
> > > > > 611b3ca8:  [<60022b61>] run_syscall_stub+0xd1/0xdd
> > > > > 611b3ce8:  [<6005eae3>] __do_fault+0x7e/0x3ca
> > > > > 611b3d68:  [<6005ee60>] do_linear_fault+0x31/0x33
> > > > > 611b3d88:  [<6005f149>] handle_mm_fault+0x14e/0x246
> > > > > 611b3da8:  [<60120a7b>] __up_read+0x73/0x7b
> > > > > 611b3de8:  [<60013177>] handle_page_fault+0x11f/0x23b
> > > > > 611b3e48:  [<60013419>] segv+0xac/0x297
> > > > > 611b3f28:  [<60013367>] segv_handler+0x68/0x6e
> > > > > 611b3f48:  [<600232ad>] get_skas_faultinfo+0x9c/0xa1
> > > > > 611b3f68:  [<60023853>] userspace+0x13a/0x19d
> > > > > 611b3fc8:  [<60010d58>] fork_handler+0x86/0x8d
> > > >
> > > > OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
> > > > allocations aren't supposed to fail.
> > > >
> > > > I'm suspecting that did_some_progress thing.
> > >
> > > The allocation didn't fail -- it invoked the OOM killer because the
> > > kernel ran out of unfragmented memory.
> >
> > We can't "run out of unfragmented memory" for an order-2 GFP_KERNEL
> > allocation in this workload.  We go and synchronously free stuff up to make
> > it work.
> >
> > How did this get broken?
> 
> Either no more order-2 pages could be freed, or the ones that were being
> freed were being used by something else (eg. other order-2 slab allocations).

No.  The current design of reclaim (for better or for worse) is that for
order 0,1,2 and 3 allocations we just keep on trying until it works.  That
got broken and I think it got broken at a design level when that
did_some_progress logic went in.  Perhaps something else we did later
worsened things.

>
> > > Probably because higher order
> > > allocations are the new vogue in -mm at the moment ;)
> >
> > That's a different bug.
> >
> > bug 1: We shouldn't be doing higher-order allocations in slub because of
> > the considerable damage this does to atomic allocations.
> >
> > bug 2: order-2 GFP_KERNEL allocations shouldn't fail like this.
> 
> I think one causes 2 as well -- it isn't just considerable damage to atomic
> allocations but to GFP_KERNEL allocations too.

Well sure, because we already broke GFP_KERNEL allocations.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-28  5:14       ` Nick Piggin
@ 2007-10-01 20:50         ` Christoph Lameter
  2007-10-02  8:43           ` Nick Piggin
  2007-10-04 16:16           ` SLUB performance regression vs SLAB Matthew Wilcox
  0 siblings, 2 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-10-01 20:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel,
	David Chinner, Jens Axboe

On Fri, 28 Sep 2007, Nick Piggin wrote:

> I thought it was slower. Have you fixed the performance regression?
> (OK, I read further down that you are still working on it but not confirmed
> yet...)

The problem is with the weird way of Intel testing and communication. 
Every 3-6 month or so they will tell you the system is X% up or down on 
arch Y (and they wont give you details because its somehow secret). And 
then there are conflicting statements by the two or so performance test 
departments. One of them repeatedly assured me that they do not see any 
regressions.

> OK, so long as it isn't going to depend on using higher order pages, that's
> fine. (if they help even further as an optional thing, that's fine too. You
> can turn them on your huge systems and not even bother about adding
> this vmap fallback -- you won't have me to nag you about these
> purely theoretical issues).

Well the vmap fallback is generally useful AFAICT. Higher order 
allocations are common on some of our platforms. Order 1 failures even 
affect essential things like stacks that have nothing to do with SLUB and 
the LBS patchset.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-29  9:00                 ` Andrew Morton
@ 2007-10-01 20:55                   ` Christoph Lameter
  2007-10-01 21:30                     ` Andrew Morton
  0 siblings, 1 reply; 110+ messages in thread
From: Christoph Lameter @ 2007-10-01 20:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Nick Piggin, Christoph Hellwig, Mel Gorman,
	linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

On Sat, 29 Sep 2007, Andrew Morton wrote:

> > atomic allocations. And with SLUB using higher order pages, atomic !0
> > order allocations will be very very common.
> 
> Oh OK.
> 
> I thought we'd already fixed slub so that it didn't do that.  Maybe that
> fix is in -mm but I don't think so.
> 
> Trying to do atomic order-1 allocations on behalf of arbitray slab caches
> just won't fly - this is a significant degradation in kernel reliability,
> as you've very easily demonstrated.

Ummm... SLAB also does order 1 allocations. We have always done them.

See mm/slab.c

/*
 * Do not go above this order unless 0 objects fit into the slab.
 */
#define BREAK_GFP_ORDER_HI      1
#define BREAK_GFP_ORDER_LO      0
static int slab_break_gfp_order = BREAK_GFP_ORDER_LO;


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-29  8:45           ` Peter Zijlstra
@ 2007-10-01 21:01             ` Christoph Lameter
  2007-10-02  8:37               ` Nick Piggin
  0 siblings, 1 reply; 110+ messages in thread
From: Christoph Lameter @ 2007-10-01 21:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe, akpm

On Sat, 29 Sep 2007, Peter Zijlstra wrote:

> 
> On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
> 
> > Really? That means we can no longer even allocate stacks for forking.
> 
> I think I'm running with 4k stacks...

4k stacks will never fly on an SGI x86_64 NUMA configuration given the 
additional data that may be kept on the stack. We are currently 
considering to go from 8k to 16k (or even 32k) to make things work. So 
having the ability to put the stacks in vmalloc space may be something to 
look at.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-09-28 21:05       ` Mel Gorman
@ 2007-10-01 21:10         ` Christoph Lameter
  0 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-10-01 21:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel,
	David Chinner, Jens Axboe, akpm

On Fri, 28 Sep 2007, Mel Gorman wrote:

> Minimally, SLUB by default should continue to use order-0 pages. Peter has
> managed to bust order-1 pages with mem=128MB. Admittedly, it was a really
> hostile workload but the point remains. It was artifically worked around
> with min_free_kbytes (value set based on pageblock_order, could also have
> been artifically worked around by dropping pageblock_order) and he eventually
> caused order-0 failures so the workload is pretty damn hostile to everything.

SLAB default is order 1 so is SLUB default upstream.

SLAB does runtime detection of the amount of memory and configures the max 
order correspondingly:

from mm/slab.c:

	/*
         * Fragmentation resistance on low memory - only use bigger
         * page orders on machines with more than 32MB of memory.
         */
        if (num_physpages > (32 << 20) >> PAGE_SHIFT)
                slab_break_gfp_order = BREAK_GFP_ORDER_HI;


We could duplicate something like that for SLUB.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-10-01 20:55                   ` Christoph Lameter
@ 2007-10-01 21:30                     ` Andrew Morton
  2007-10-01 21:38                       ` Christoph Lameter
  2007-10-02  9:19                       ` Peter Zijlstra
  0 siblings, 2 replies; 110+ messages in thread
From: Andrew Morton @ 2007-10-01 21:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: a.p.zijlstra, nickpiggin, hch, mel, linux-fsdevel, linux-kernel,
	dgc, jens.axboe

On Mon, 1 Oct 2007 13:55:29 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Sat, 29 Sep 2007, Andrew Morton wrote:
> 
> > > atomic allocations. And with SLUB using higher order pages, atomic !0
> > > order allocations will be very very common.
> > 
> > Oh OK.
> > 
> > I thought we'd already fixed slub so that it didn't do that.  Maybe that
> > fix is in -mm but I don't think so.
> > 
> > Trying to do atomic order-1 allocations on behalf of arbitray slab caches
> > just won't fly - this is a significant degradation in kernel reliability,
> > as you've very easily demonstrated.
> 
> Ummm... SLAB also does order 1 allocations. We have always done them.
> 
> See mm/slab.c
> 
> /*
>  * Do not go above this order unless 0 objects fit into the slab.
>  */
> #define BREAK_GFP_ORDER_HI      1
> #define BREAK_GFP_ORDER_LO      0
> static int slab_break_gfp_order = BREAK_GFP_ORDER_LO;

Do slab and slub use the same underlying page size for each slab?

Single data point: the CONFIG_SLAB boxes which I have access to here are
using order-0 for radix_tree_node, so they won't be failing in the way in
which Peter's machine is.

I've never ever before seen reports of page allocation failures in the
radix-tree node allocation code, and that's the bottom line.  This is just
a drop-dead must-fix show-stopping bug.  We cannot rely upon atomic order-1
allocations succeeding so we cannot use them for radix-tree nodes.  Nor for
lots of other things which we have no chance of identifying.

Peter, is this bug -mm only, or is 2.6.23 similarly failing?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-10-01 21:30                     ` Andrew Morton
@ 2007-10-01 21:38                       ` Christoph Lameter
  2007-10-01 21:45                         ` Andrew Morton
  2007-10-02  9:19                       ` Peter Zijlstra
  1 sibling, 1 reply; 110+ messages in thread
From: Christoph Lameter @ 2007-10-01 21:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: a.p.zijlstra, nickpiggin, hch, mel, linux-fsdevel, linux-kernel,
	dgc, jens.axboe

On Mon, 1 Oct 2007, Andrew Morton wrote:

> Do slab and slub use the same underlying page size for each slab?

SLAB cannot pack objects as dense as SLUB and they have different 
algorithm to make the choice of order. Thus the number of objects per slab 
may vary between SLAB and SLUB and therefore also the choice of order to 
store these objects.

> Single data point: the CONFIG_SLAB boxes which I have access to here are
> using order-0 for radix_tree_node, so they won't be failing in the way in
> which Peter's machine is.

Upstream SLUB uses order 0 allocations for the radix tree. MM varies 
because the use of higher order allocs is more loose if the mobility 
algorithms are found to be active:

2.6.23-rc8:

Name                   Objects Objsize    Space Slabs/Part/Cpu  O/S O %Fr %Ef Flg\
radix_tree_node          14281     552     9.9M     2432/948/1    7 0  38  79

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-10-01 21:38                       ` Christoph Lameter
@ 2007-10-01 21:45                         ` Andrew Morton
  2007-10-01 21:52                           ` Christoph Lameter
  0 siblings, 1 reply; 110+ messages in thread
From: Andrew Morton @ 2007-10-01 21:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: a.p.zijlstra, nickpiggin, hch, mel, linux-fsdevel, linux-kernel,
	dgc, jens.axboe

On Mon, 1 Oct 2007 14:38:55 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Mon, 1 Oct 2007, Andrew Morton wrote:
> 
> > Do slab and slub use the same underlying page size for each slab?
> 
> SLAB cannot pack objects as dense as SLUB and they have different 
> algorithm to make the choice of order. Thus the number of objects per slab 
> may vary between SLAB and SLUB and therefore also the choice of order to 
> store these objects.
> 
> > Single data point: the CONFIG_SLAB boxes which I have access to here are
> > using order-0 for radix_tree_node, so they won't be failing in the way in
> > which Peter's machine is.
> 
> Upstream SLUB uses order 0 allocations for the radix tree.

OK, that's a relief.

> MM varies 
> because the use of higher order allocs is more loose if the mobility 
> algorithms are found to be active:
> 
> 2.6.23-rc8:
> 
> Name                   Objects Objsize    Space Slabs/Part/Cpu  O/S O %Fr %Ef Flg\
> radix_tree_node          14281     552     9.9M     2432/948/1    7 0  38  79

Ah.  So the already-dropped
slub-exploit-page-mobility-to-increase-allocation-order.patch was the
culprit?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-10-01 21:45                         ` Andrew Morton
@ 2007-10-01 21:52                           ` Christoph Lameter
  0 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-10-01 21:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: a.p.zijlstra, nickpiggin, hch, mel, linux-fsdevel, linux-kernel,
	dgc, jens.axboe

On Mon, 1 Oct 2007, Andrew Morton wrote:

> Ah.  So the already-dropped
> slub-exploit-page-mobility-to-increase-allocation-order.patch was the
> culprit?

Yes without that patch SLUB will no longer take special action if antifrag 
is around.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-10-01 21:01             ` Christoph Lameter
@ 2007-10-02  8:37               ` Nick Piggin
  0 siblings, 0 replies; 110+ messages in thread
From: Nick Piggin @ 2007-10-02  8:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe, akpm

On Tuesday 02 October 2007 07:01, Christoph Lameter wrote:
> On Sat, 29 Sep 2007, Peter Zijlstra wrote:
> > On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
> > > Really? That means we can no longer even allocate stacks for forking.
> >
> > I think I'm running with 4k stacks...
>
> 4k stacks will never fly on an SGI x86_64 NUMA configuration given the
> additional data that may be kept on the stack. We are currently
> considering to go from 8k to 16k (or even 32k) to make things work. So
> having the ability to put the stacks in vmalloc space may be something to
> look at.

i386 and x86-64 already used 8K stacks for years and they have never
really been much problem before.

They only started failing when contiguous memory is getting used up
by other things, _even with_ those anti-frag patches in there.

Bottom line is that you do not use higher order allocations when you do
not need them.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-10-01 20:50         ` Christoph Lameter
@ 2007-10-02  8:43           ` Nick Piggin
  2007-10-04 16:16           ` SLUB performance regression vs SLAB Matthew Wilcox
  1 sibling, 0 replies; 110+ messages in thread
From: Nick Piggin @ 2007-10-02  8:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, Mel Gorman, linux-fsdevel, linux-kernel,
	David Chinner, Jens Axboe

On Tuesday 02 October 2007 06:50, Christoph Lameter wrote:
> On Fri, 28 Sep 2007, Nick Piggin wrote:
> > I thought it was slower. Have you fixed the performance regression?
> > (OK, I read further down that you are still working on it but not
> > confirmed yet...)
>
> The problem is with the weird way of Intel testing and communication.
> Every 3-6 month or so they will tell you the system is X% up or down on
> arch Y (and they wont give you details because its somehow secret). And
> then there are conflicting statements by the two or so performance test
> departments. One of them repeatedly assured me that they do not see any
> regressions.

Just so long as there aren't known regressions that would require higher
order allocations to fix them.


> > OK, so long as it isn't going to depend on using higher order pages,
> > that's fine. (if they help even further as an optional thing, that's fine
> > too. You can turn them on your huge systems and not even bother about
> > adding this vmap fallback -- you won't have me to nag you about these
> > purely theoretical issues).
>
> Well the vmap fallback is generally useful AFAICT. Higher order
> allocations are common on some of our platforms. Order 1 failures even
> affect essential things like stacks that have nothing to do with SLUB and
> the LBS patchset.

I don't know if it is worth the trouble, though. The best thing to do is to
ensure that contiguous memory is not wasted on frivolous things... a few
order-1 or 2 allocations aren't too much of a problem.

The only high order allocation failure I've seen from fragmentation for a
long time IIRC are the order-3 failures coming from e1000. And obviously
they cannot use vmap.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
  2007-10-01 21:30                     ` Andrew Morton
  2007-10-01 21:38                       ` Christoph Lameter
@ 2007-10-02  9:19                       ` Peter Zijlstra
  1 sibling, 0 replies; 110+ messages in thread
From: Peter Zijlstra @ 2007-10-02  9:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, nickpiggin, hch, mel, linux-fsdevel,
	linux-kernel, dgc, jens.axboe

[-- Attachment #1: Type: text/plain, Size: 1924 bytes --]

On Mon, 2007-10-01 at 14:30 -0700, Andrew Morton wrote:
> On Mon, 1 Oct 2007 13:55:29 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
> > On Sat, 29 Sep 2007, Andrew Morton wrote:
> > 
> > > > atomic allocations. And with SLUB using higher order pages, atomic !0
> > > > order allocations will be very very common.
> > > 
> > > Oh OK.
> > > 
> > > I thought we'd already fixed slub so that it didn't do that.  Maybe that
> > > fix is in -mm but I don't think so.
> > > 
> > > Trying to do atomic order-1 allocations on behalf of arbitray slab caches
> > > just won't fly - this is a significant degradation in kernel reliability,
> > > as you've very easily demonstrated.
> > 
> > Ummm... SLAB also does order 1 allocations. We have always done them.
> > 
> > See mm/slab.c
> > 
> > /*
> >  * Do not go above this order unless 0 objects fit into the slab.
> >  */
> > #define BREAK_GFP_ORDER_HI      1
> > #define BREAK_GFP_ORDER_LO      0
> > static int slab_break_gfp_order = BREAK_GFP_ORDER_LO;
> 
> Do slab and slub use the same underlying page size for each slab?
> 
> Single data point: the CONFIG_SLAB boxes which I have access to here are
> using order-0 for radix_tree_node, so they won't be failing in the way in
> which Peter's machine is.
> 
> I've never ever before seen reports of page allocation failures in the
> radix-tree node allocation code, and that's the bottom line.  This is just
> a drop-dead must-fix show-stopping bug.  We cannot rely upon atomic order-1
> allocations succeeding so we cannot use them for radix-tree nodes.  Nor for
> lots of other things which we have no chance of identifying.
> 
> Peter, is this bug -mm only, or is 2.6.23 similarly failing?

I'm mainly using -mm (so you have at least one tester :-), I think the
-mm specific SLUB patch that ups slub_min_order makes the problem -mm
specific, would have to test .23.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* SLUB performance regression vs SLAB
  2007-10-01 20:50         ` Christoph Lameter
  2007-10-02  8:43           ` Nick Piggin
@ 2007-10-04 16:16           ` Matthew Wilcox
  2007-10-04 17:38             ` Christoph Lameter
  1 sibling, 1 reply; 110+ messages in thread
From: Matthew Wilcox @ 2007-10-04 16:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On Mon, Oct 01, 2007 at 01:50:44PM -0700, Christoph Lameter wrote:
> The problem is with the weird way of Intel testing and communication. 
> Every 3-6 month or so they will tell you the system is X% up or down on 
> arch Y (and they wont give you details because its somehow secret). And 
> then there are conflicting statements by the two or so performance test 
> departments. One of them repeatedly assured me that they do not see any 
> regressions.

Could you cut out the snarky remarks?  It takes a long time to run a
test, and testing every one of the patches you send really isn't high
on anyone's priority list.  The performance team have also been having
problems getting stable results with recent kernels, adding to the delay.
The good news is that we do now have committment to testing upstream
kernels, so you should see results more frequently than you have been.

I'm taking over from Suresh as liason for the performance team, so
if you hear *anything* from *anyone* else at Intel about performance,
I want you to cc me about it.  OK?  And I don't want to hear any more
whining about hearing different things from different people.

So, on "a well-known OLTP benchmark which prohibits publishing absolute
numbers" and on an x86-64 system (I don't think exactly which model
is important), we're seeing *6.51%* performance loss on slub vs slab.
This is with a 2.6.23-rc3 kernel.  Tuning the boot parameters, as you've
asked for before (slub_min_order=2, slub_max_order=4, slub_min_objects=8)
gets back 0.38% of that.  It's still down 6.13% over slab.

For what it's worth, 2.6.23-rc3 already has a 1.19% regression versus
RHEL 4.5, so the performance guys are really unhappy about going up to
almost 8% regression.

In the detailed profiles, __slab_free is the third most expensive
function, behind only spin locks.  get_partial_node is right behind it
in fourth place, and kmem_cache_alloc is sixth.  __slab_alloc is eight
and kmem_cache_free is tenth.  These positions don't change with the
slub boot parameters.

Now, where do we go next?  I suspect that 2.6.23-rc9 has significant
changes since -rc3, but I'd like to confirm that before kicking off
another (expensive) run.  Please, tell me what useful kernels are to test.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 16:16           ` SLUB performance regression vs SLAB Matthew Wilcox
@ 2007-10-04 17:38             ` Christoph Lameter
  2007-10-04 17:50               ` Arjan van de Ven
  2007-10-04 18:32               ` Matthew Wilcox
  0 siblings, 2 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-10-04 17:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On Thu, 4 Oct 2007, Matthew Wilcox wrote:

> So, on "a well-known OLTP benchmark which prohibits publishing absolute
> numbers" and on an x86-64 system (I don't think exactly which model
> is important), we're seeing *6.51%* performance loss on slub vs slab.
> This is with a 2.6.23-rc3 kernel.  Tuning the boot parameters, as you've
> asked for before (slub_min_order=2, slub_max_order=4, slub_min_objects=8)
> gets back 0.38% of that.  It's still down 6.13% over slab.

Yeah the fastpath vs. slow path is not the issue as Siddha and I concluded 
earlier. Seems that we are mainly seeing cacheline bouncing due to two 
cpus accessing meta data in the same page struct. The patches in 
MM that are scheduled to be merged for .24 address that issue. I 
have repeatedly asked that these patches be tested. The patches were 
posted months ago.

> Now, where do we go next?  I suspect that 2.6.23-rc9 has significant
> changes since -rc3, but I'd like to confirm that before kicking off
> another (expensive) run.  Please, tell me what useful kernels are to test.

I thought Siddha has a test in the works with the per cpu structure 
patchset from MM? Could you sync up with Siddha?


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 18:32               ` Matthew Wilcox
@ 2007-10-04 17:49                 ` Christoph Lameter
  2007-10-04 19:28                   ` Matthew Wilcox
  0 siblings, 1 reply; 110+ messages in thread
From: Christoph Lameter @ 2007-10-04 17:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On Thu, 4 Oct 2007, Matthew Wilcox wrote:

> > Yeah the fastpath vs. slow path is not the issue as Siddha and I concluded 
> > earlier. Seems that we are mainly seeing cacheline bouncing due to two 
> > cpus accessing meta data in the same page struct. The patches in 
> > MM that are scheduled to be merged for .24 address that issue. I 
> > have repeatedly asked that these patches be tested. The patches were 
> > posted months ago.
> 
> I just checked with the guys who did the test.  When I said -rc3, I
> mis-spoke; this is 2.6.23-rc3 *plus* the patches which Suresh agreed to
> test for you.

I was not aware of that. Would it be possible for you to summarize all the 
test data that you have right now about SLUB vs. SLAB with the patches 
listed? Exactly what kernel version and what version of the per cpu 
patches were tested? Was the page allocator pass through patchset 
separately applied as I requested?

Finally: Is there some way that I can reproduce the tests on my machines?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 17:38             ` Christoph Lameter
@ 2007-10-04 17:50               ` Arjan van de Ven
  2007-10-04 17:58                 ` Christoph Lameter
                                   ` (2 more replies)
  2007-10-04 18:32               ` Matthew Wilcox
  1 sibling, 3 replies; 110+ messages in thread
From: Arjan van de Ven @ 2007-10-04 17:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, Nick Piggin, Christoph Hellwig, Mel Gorman,
	linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

On Thu, 4 Oct 2007 10:38:15 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:


> Yeah the fastpath vs. slow path is not the issue as Siddha and I
> concluded earlier. Seems that we are mainly seeing cacheline bouncing
> due to two cpus accessing meta data in the same page struct. The
> patches in MM that are scheduled to be merged for .24 address 


Ok every time something says anything not 100% positive about SLUB you
come back with "but it's fixed in the next patch set"... *every time*.

To be honest, to me that sounds that SLUB isn't ready for prime time
yet, or at least not ready to be the only one in town...

The day that the answer is "the kernel.org slub is fixing all the
issues" is when it's ready..

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 17:50               ` Arjan van de Ven
@ 2007-10-04 17:58                 ` Christoph Lameter
  2007-10-04 18:26                 ` Peter Zijlstra
  2007-10-04 20:48                 ` David Miller
  2 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-10-04 17:58 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Matthew Wilcox, Nick Piggin, Christoph Hellwig, Mel Gorman,
	linux-fsdevel, linux-kernel, David Chinner, Jens Axboe

On Thu, 4 Oct 2007, Arjan van de Ven wrote:

> Ok every time something says anything not 100% positive about SLUB you
> come back with "but it's fixed in the next patch set"... *every time*.

All I ask that people test the fixes that have been out there for the 
known issues. If there are remaining performance issues then lets figure 
them out and address them.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 17:50               ` Arjan van de Ven
  2007-10-04 17:58                 ` Christoph Lameter
@ 2007-10-04 18:26                 ` Peter Zijlstra
  2007-10-04 20:48                 ` David Miller
  2 siblings, 0 replies; 110+ messages in thread
From: Peter Zijlstra @ 2007-10-04 18:26 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Christoph Lameter, Matthew Wilcox, Nick Piggin, Christoph Hellwig,
	Mel Gorman, linux-fsdevel, linux-kernel, David Chinner,
	Jens Axboe


On Thu, 2007-10-04 at 10:50 -0700, Arjan van de Ven wrote:
> On Thu, 4 Oct 2007 10:38:15 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
> 
> > Yeah the fastpath vs. slow path is not the issue as Siddha and I
> > concluded earlier. Seems that we are mainly seeing cacheline bouncing
> > due to two cpus accessing meta data in the same page struct. The
> > patches in MM that are scheduled to be merged for .24 address 
> 
> 
> Ok every time something says anything not 100% positive about SLUB you
> come back with "but it's fixed in the next patch set"... *every time*.
> 
> To be honest, to me that sounds that SLUB isn't ready for prime time
> yet, or at least not ready to be the only one in town...
> 
> The day that the answer is "the kernel.org slub is fixing all the
> issues" is when it's ready..

Arjan, to be honest, there has been some confusion on _what_ code has
been tested with what results. And with Christoph not able to reproduce
these results locally, it is very hard for him to fix it proper.




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 17:38             ` Christoph Lameter
  2007-10-04 17:50               ` Arjan van de Ven
@ 2007-10-04 18:32               ` Matthew Wilcox
  2007-10-04 17:49                 ` Christoph Lameter
  1 sibling, 1 reply; 110+ messages in thread
From: Matthew Wilcox @ 2007-10-04 18:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe

On Thu, Oct 04, 2007 at 10:38:15AM -0700, Christoph Lameter wrote:
> On Thu, 4 Oct 2007, Matthew Wilcox wrote:
> 
> > So, on "a well-known OLTP benchmark which prohibits publishing absolute
> > numbers" and on an x86-64 system (I don't think exactly which model
> > is important), we're seeing *6.51%* performance loss on slub vs slab.
> > This is with a 2.6.23-rc3 kernel.  Tuning the boot parameters, as you've
> > asked for before (slub_min_order=2, slub_max_order=4, slub_min_objects=8)
> > gets back 0.38% of that.  It's still down 6.13% over slab.
> 
> Yeah the fastpath vs. slow path is not the issue as Siddha and I concluded 
> earlier. Seems that we are mainly seeing cacheline bouncing due to two 
> cpus accessing meta data in the same page struct. The patches in 
> MM that are scheduled to be merged for .24 address that issue. I 
> have repeatedly asked that these patches be tested. The patches were 
> posted months ago.

I just checked with the guys who did the test.  When I said -rc3, I
mis-spoke; this is 2.6.23-rc3 *plus* the patches which Suresh agreed to
test for you.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 19:28                   ` Matthew Wilcox
@ 2007-10-04 19:05                     ` Christoph Lameter
  2007-10-04 19:46                       ` Siddha, Suresh B
  2007-10-04 20:55                     ` David Miller
  1 sibling, 1 reply; 110+ messages in thread
From: Christoph Lameter @ 2007-10-04 19:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe, suresh.b.siddha

On Thu, 4 Oct 2007, Matthew Wilcox wrote:

> We have three runs, all with 2.6.23-rc3 plus the patches that Suresh
> applied from 20070922.  The first run is with slab.  The second run is
> with SLUB and the third run is SLUB plus the tuning parameters you
> recommended.

There was quite a bit of communication on tuning parameters. Guess we got 
more confusion there and multiple configurations settings that I wanted to 
be tested separately were merged. Setting slub_min_order to more than zero 
can certainly be detrimental to performance since higher order page 
allocations can cause cacheline bouncing on zone locks.

Which patches? 20070922 refers to a pull on the slab git tree on the 
performance branch?

> I have a spreadsheet with Vtune data in it that was collected during
> each of these test runs, so we can see which functions are the hottest.
> I can grab that data and send it to you, if that's interesting.

Please do. Add the kernel .configs please. Is there any slab queue tuning 
going on on boot with the SLAB configuration?

Include any tuning that was done to the kernel please.

> > Was the page allocator pass through patchset 
> > separately applied as I requested?
> 
> I don't believe so.  Suresh?

If it was a git pull then the pass through was included and never taken 
out.

> I think for future tests, it would be easiest if you send me a git
> reference.  That way we will all know precisely what is being tested.

Sure we can do that.

> > Finally: Is there some way that I can reproduce the tests on my machines?
> 
> As usual for these kinds of setups ... take a two-CPU machine, 64GB
> of memory, half a dozen fibre channel adapters, about 3000 discs,
> a commercial database, a team of experts for three months worth of
> tuning ...
> 
> I don't know if anyone's tried to replicate a benchmark like this using
> Postgres.  Would be nice if they have ...

Well we got our own performance test department here at SGI. If we get 
them involved then we can add another 3 months until we get the test 
results confirmed ;-). Seems that this is a small configuration. Why
does it take that long? And the experts knew SLAB and not SLUB right?

Lets look at all the data that you got and then see if this is enough to 
figure out what is wrong.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 17:49                 ` Christoph Lameter
@ 2007-10-04 19:28                   ` Matthew Wilcox
  2007-10-04 19:05                     ` Christoph Lameter
  2007-10-04 20:55                     ` David Miller
  0 siblings, 2 replies; 110+ messages in thread
From: Matthew Wilcox @ 2007-10-04 19:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Christoph Hellwig, Mel Gorman, linux-fsdevel,
	linux-kernel, David Chinner, Jens Axboe, suresh.b.siddha

On Thu, Oct 04, 2007 at 10:49:52AM -0700, Christoph Lameter wrote:
> I was not aware of that. Would it be possible for you to summarize all the 
> test data that you have right now about SLUB vs. SLAB with the patches 
> listed?  Exactly what kernel version and what version of the per cpu 
> patches were tested?

We have three runs, all with 2.6.23-rc3 plus the patches that Suresh
applied from 20070922.  The first run is with slab.  The second run is
with SLUB and the third run is SLUB plus the tuning parameters you
recommended.

I have a spreadsheet with Vtune data in it that was collected during
each of these test runs, so we can see which functions are the hottest.
I can grab that data and send it to you, if that's interesting.

> Was the page allocator pass through patchset 
> separately applied as I requested?

I don't believe so.  Suresh?

I think for future tests, it would be easiest if you send me a git
reference.  That way we will all know precisely what is being tested.

> Finally: Is there some way that I can reproduce the tests on my machines?

As usual for these kinds of setups ... take a two-CPU machine, 64GB
of memory, half a dozen fibre channel adapters, about 3000 discs,
a commercial database, a team of experts for three months worth of
tuning ...

I don't know if anyone's tried to replicate a benchmark like this using
Postgres.  Would be nice if they have ...

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 19:05                     ` Christoph Lameter
@ 2007-10-04 19:46                       ` Siddha, Suresh B
  0 siblings, 0 replies; 110+ messages in thread
From: Siddha, Suresh B @ 2007-10-04 19:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, Nick Piggin, Christoph Hellwig, Mel Gorman,
	linux-fsdevel, linux-kernel, David Chinner, Jens Axboe,
	suresh.b.siddha

On Thu, Oct 04, 2007 at 12:05:35PM -0700, Christoph Lameter wrote:
> > > Was the page allocator pass through patchset 
> > > separately applied as I requested?
> > 
> > I don't believe so.  Suresh?
> 
> If it was a git pull then the pass through was included and never taken 
> out.

It was a git pull from the performance branch that you pointed out earlier
http://git.kernel.org/?p=linux/kernel/git/christoph/slab.git;a=log;h=performance

and the config is based on EL5 config with just the SLUB turned on.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 17:50               ` Arjan van de Ven
  2007-10-04 17:58                 ` Christoph Lameter
  2007-10-04 18:26                 ` Peter Zijlstra
@ 2007-10-04 20:48                 ` David Miller
  2007-10-04 20:58                   ` Matthew Wilcox
  2 siblings, 1 reply; 110+ messages in thread
From: David Miller @ 2007-10-04 20:48 UTC (permalink / raw)
  To: arjan
  Cc: clameter, willy, nickpiggin, hch, mel, linux-fsdevel,
	linux-kernel, dgc, jens.axboe

From: Arjan van de Ven <arjan@infradead.org>
Date: Thu, 4 Oct 2007 10:50:46 -0700

> Ok every time something says anything not 100% positive about SLUB you
> come back with "but it's fixed in the next patch set"... *every time*.

I think this is partly Christoph subconsciously venting his
frustration that he's never given a reproducable test case he can use
to fix the problem.

There comes a point where it is the reporter's responsibility to help
the developer come up with a publishable test case the developer can
use to work on fixing the problem and help ensure it stays fixed.

Using an unpublishable benchmark, whose results even cannot be
published, really stretches the limits of "reasonable" don't you
think?

This "SLUB isn't ready yet" bullshit is just a shamans dance which
distracts attention away from the real problem, which is that a
reproducable, publishable test case, is not being provided to the
developer so he can work on fixing the problem.

I can tell you this thing would be fixed overnight if a proper test
case had been provided by now.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 19:28                   ` Matthew Wilcox
  2007-10-04 19:05                     ` Christoph Lameter
@ 2007-10-04 20:55                     ` David Miller
  2007-10-04 21:02                       ` Chuck Ebbert
  2007-10-04 21:05                       ` Matthew Wilcox
  1 sibling, 2 replies; 110+ messages in thread
From: David Miller @ 2007-10-04 20:55 UTC (permalink / raw)
  To: willy
  Cc: clameter, nickpiggin, hch, mel, linux-fsdevel, linux-kernel, dgc,
	jens.axboe, suresh.b.siddha

From: willy@linux.intel.com (Matthew Wilcox)
Date: Thu, 4 Oct 2007 12:28:25 -0700

> On Thu, Oct 04, 2007 at 10:49:52AM -0700, Christoph Lameter wrote:
> > Finally: Is there some way that I can reproduce the tests on my machines?
> 
> As usual for these kinds of setups ... take a two-CPU machine, 64GB
> of memory, half a dozen fibre channel adapters, about 3000 discs,
> a commercial database, a team of experts for three months worth of
> tuning ...

Anything, I do mean anything, can be simulated using small test
programs.  Pointing at a big fancy machine with lots of storage
and disk is a passive aggressive way to avoid the real issues,
in that nobody is putting forth the effort to try and come up
with an at least publishable test case that Christoph can use to
help you guys.

If coming up with a reproducable and publishable test case is
the difference between this getting fixed and it not getting
fixed, are you going to invest the time to do that?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 20:48                 ` David Miller
@ 2007-10-04 20:58                   ` Matthew Wilcox
  2007-10-04 21:05                     ` David Miller
  2007-10-04 21:11                     ` Christoph Lameter
  0 siblings, 2 replies; 110+ messages in thread
From: Matthew Wilcox @ 2007-10-04 20:58 UTC (permalink / raw)
  To: David Miller
  Cc: arjan, clameter, willy, nickpiggin, hch, mel, linux-fsdevel,
	linux-kernel, dgc, jens.axboe

On Thu, Oct 04, 2007 at 01:48:34PM -0700, David Miller wrote:
> There comes a point where it is the reporter's responsibility to help
> the developer come up with a publishable test case the developer can
> use to work on fixing the problem and help ensure it stays fixed.

That's a lot of effort.  Is it more effort than doing some remote
debugging with Christoph?  I don't know.

> Using an unpublishable benchmark, whose results even cannot be
> published, really stretches the limits of "reasonable" don't you
> think?

Yet here we stand.  Christoph is aggressively trying to get slab removed
from the tree.  There is a testcase which shows slub performing worse
than slab.  It's not my fault I can't publish it.  And just because I
can't publish it doesn't mean it doesn't exist.

Slab needs to not get removed until slub is as good a performer on this
benchmark.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 20:55                     ` David Miller
@ 2007-10-04 21:02                       ` Chuck Ebbert
  2007-10-04 21:11                         ` David Miller
  2007-10-05 20:32                         ` Peter Zijlstra
  2007-10-04 21:05                       ` Matthew Wilcox
  1 sibling, 2 replies; 110+ messages in thread
From: Chuck Ebbert @ 2007-10-04 21:02 UTC (permalink / raw)
  To: David Miller
  Cc: willy, clameter, nickpiggin, hch, mel, linux-fsdevel,
	linux-kernel, dgc, jens.axboe, suresh.b.siddha

On 10/04/2007 04:55 PM, David Miller wrote:
> 
> Anything, I do mean anything, can be simulated using small test
> programs.

How do you simulate reading 100TB of data spread across 3000 disks,
selecting 10% of it using some criterion, then sorting and summarizing
the result?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 20:55                     ` David Miller
  2007-10-04 21:02                       ` Chuck Ebbert
@ 2007-10-04 21:05                       ` Matthew Wilcox
  2007-10-05  2:43                         ` Christoph Lameter
  1 sibling, 1 reply; 110+ messages in thread
From: Matthew Wilcox @ 2007-10-04 21:05 UTC (permalink / raw)
  To: David Miller
  Cc: willy, clameter, nickpiggin, hch, mel, linux-fsdevel,
	linux-kernel, dgc, jens.axboe, suresh.b.siddha

On Thu, Oct 04, 2007 at 01:55:37PM -0700, David Miller wrote:
> Anything, I do mean anything, can be simulated using small test
> programs.  Pointing at a big fancy machine with lots of storage
> and disk is a passive aggressive way to avoid the real issues,
> in that nobody is putting forth the effort to try and come up
> with an at least publishable test case that Christoph can use to
> help you guys.
> 
> If coming up with a reproducable and publishable test case is
> the difference between this getting fixed and it not getting
> fixed, are you going to invest the time to do that?

If that's what it takes, then yes.  But I'm far from convinced that
it's as easy to come up with a TPC benchmark simulator as you think.
There have been efforts in the past (orasim, for example), but
presumably Christoph has already tried these benchmarks.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 20:58                   ` Matthew Wilcox
@ 2007-10-04 21:05                     ` David Miller
  2007-10-04 21:11                     ` Christoph Lameter
  1 sibling, 0 replies; 110+ messages in thread
From: David Miller @ 2007-10-04 21:05 UTC (permalink / raw)
  To: matthew
  Cc: arjan, clameter, willy, nickpiggin, hch, mel, linux-fsdevel,
	linux-kernel, dgc, jens.axboe

From: Matthew Wilcox <matthew@wil.cx>
Date: Thu, 4 Oct 2007 14:58:12 -0600

> On Thu, Oct 04, 2007 at 01:48:34PM -0700, David Miller wrote:
> > There comes a point where it is the reporter's responsibility to help
> > the developer come up with a publishable test case the developer can
> > use to work on fixing the problem and help ensure it stays fixed.
> 
> That's a lot of effort.  Is it more effort than doing some remote
> debugging with Christoph?  I don't know.

That's a good question and an excellent point.  I'm sure that,
either way, Christoph will be more than willing to engage and
assist.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 21:02                       ` Chuck Ebbert
@ 2007-10-04 21:11                         ` David Miller
  2007-10-04 21:47                           ` Chuck Ebbert
  2007-10-05 20:32                         ` Peter Zijlstra
  1 sibling, 1 reply; 110+ messages in thread
From: David Miller @ 2007-10-04 21:11 UTC (permalink / raw)
  To: cebbert
  Cc: willy, clameter, nickpiggin, hch, mel, linux-fsdevel,
	linux-kernel, dgc, jens.axboe, suresh.b.siddha

From: Chuck Ebbert <cebbert@redhat.com>
Date: Thu, 04 Oct 2007 17:02:17 -0400

> How do you simulate reading 100TB of data spread across 3000 disks,
> selecting 10% of it using some criterion, then sorting and
> summarizing the result?

You repeatedly read zeros from a smaller disk into the same amount of
memory, and sort that as if it were real data instead.

You're not thinking outside of the box, and you need to do that to
write good test cases and fix kernel bugs effectively.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 20:58                   ` Matthew Wilcox
  2007-10-04 21:05                     ` David Miller
@ 2007-10-04 21:11                     ` Christoph Lameter
  1 sibling, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-10-04 21:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Miller, arjan, willy, nickpiggin, hch, mel, linux-fsdevel,
	linux-kernel, dgc, jens.axboe

On Thu, 4 Oct 2007, Matthew Wilcox wrote:

> Yet here we stand.  Christoph is aggressively trying to get slab removed
> from the tree.  There is a testcase which shows slub performing worse
> than slab.  It's not my fault I can't publish it.  And just because I
> can't publish it doesn't mean it doesn't exist.
> 
> Slab needs to not get removed until slub is as good a performer on this
> benchmark.

I agree with this .... SLAB will stay until we have worked through all the 
performance issues.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 21:11                         ` David Miller
@ 2007-10-04 21:47                           ` Chuck Ebbert
  2007-10-04 22:07                             ` David Miller
  0 siblings, 1 reply; 110+ messages in thread
From: Chuck Ebbert @ 2007-10-04 21:47 UTC (permalink / raw)
  To: David Miller
  Cc: willy, clameter, nickpiggin, hch, mel, linux-fsdevel,
	linux-kernel, dgc, jens.axboe, suresh.b.siddha

On 10/04/2007 05:11 PM, David Miller wrote:
> From: Chuck Ebbert <cebbert@redhat.com>
> Date: Thu, 04 Oct 2007 17:02:17 -0400
> 
>> How do you simulate reading 100TB of data spread across 3000 disks,
>> selecting 10% of it using some criterion, then sorting and
>> summarizing the result?
> 
> You repeatedly read zeros from a smaller disk into the same amount of
> memory, and sort that as if it were real data instead.

You've just replaced 3000 concurrent streams of data with a single
stream.  That won't test the memory allocator's ability to allocate
memory to many concurrent users very well.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 21:47                           ` Chuck Ebbert
@ 2007-10-04 22:07                             ` David Miller
  2007-10-04 22:23                               ` David Chinner
  0 siblings, 1 reply; 110+ messages in thread
From: David Miller @ 2007-10-04 22:07 UTC (permalink / raw)
  To: cebbert
  Cc: willy, clameter, nickpiggin, hch, mel, linux-fsdevel,
	linux-kernel, dgc, jens.axboe, suresh.b.siddha

From: Chuck Ebbert <cebbert@redhat.com>
Date: Thu, 04 Oct 2007 17:47:48 -0400

> On 10/04/2007 05:11 PM, David Miller wrote:
> > From: Chuck Ebbert <cebbert@redhat.com>
> > Date: Thu, 04 Oct 2007 17:02:17 -0400
> > 
> >> How do you simulate reading 100TB of data spread across 3000 disks,
> >> selecting 10% of it using some criterion, then sorting and
> >> summarizing the result?
> > 
> > You repeatedly read zeros from a smaller disk into the same amount of
> > memory, and sort that as if it were real data instead.
> 
> You've just replaced 3000 concurrent streams of data with a single
> stream.  That won't test the memory allocator's ability to allocate
> memory to many concurrent users very well.

You've kindly removed my "thinking outside of the box" comment.

The point is was not that my specific suggestion would be
perfect, but that if you used your creativity and thought
in similar directions you might find a way to do it.

People are too narrow minded when it comes to these things, and
that's the problem I want to address.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 22:07                             ` David Miller
@ 2007-10-04 22:23                               ` David Chinner
  2007-10-05  6:48                                 ` Jens Axboe
  0 siblings, 1 reply; 110+ messages in thread
From: David Chinner @ 2007-10-04 22:23 UTC (permalink / raw)
  To: David Miller
  Cc: cebbert, willy, clameter, nickpiggin, hch, mel, linux-fsdevel,
	linux-kernel, dgc, jens.axboe, suresh.b.siddha

On Thu, Oct 04, 2007 at 03:07:18PM -0700, David Miller wrote:
> From: Chuck Ebbert <cebbert@redhat.com> Date: Thu, 04 Oct 2007 17:47:48
> -0400
> 
> > On 10/04/2007 05:11 PM, David Miller wrote:
> > > From: Chuck Ebbert <cebbert@redhat.com> Date: Thu, 04 Oct 2007 17:02:17
> > > -0400
> > > 
> > >> How do you simulate reading 100TB of data spread across 3000 disks,
> > >> selecting 10% of it using some criterion, then sorting and summarizing
> > >> the result?
> > > 
> > > You repeatedly read zeros from a smaller disk into the same amount of
> > > memory, and sort that as if it were real data instead.
> > 
> > You've just replaced 3000 concurrent streams of data with a single stream.
> > That won't test the memory allocator's ability to allocate memory to many
> > concurrent users very well.
> 
> You've kindly removed my "thinking outside of the box" comment.
> 
> The point is was not that my specific suggestion would be perfect, but that
> if you used your creativity and thought in similar directions you might find
> a way to do it.
> 
> People are too narrow minded when it comes to these things, and that's the
> problem I want to address.

And it's a good point, too, because often problems to one person are a
no-brainer to someone else.

Creating lots of "fake" disks is trivial to do, IMO.  Use loopback on sparse
files containing sparse filesxi, use ramdisks containing sparse files or write a
sparse dm target for sparse block device mapping, etc. I'm sure there's more than the
few I just threw out...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 21:05                       ` Matthew Wilcox
@ 2007-10-05  2:43                         ` Christoph Lameter
  2007-10-05  2:53                           ` Arjan van de Ven
  0 siblings, 1 reply; 110+ messages in thread
From: Christoph Lameter @ 2007-10-05  2:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Miller, willy, nickpiggin, hch, mel, linux-fsdevel,
	linux-kernel, dgc, jens.axboe, suresh.b.siddha

I just spend some time looking at the functions that you see high in the 
list. The trouble is that I have to speculate and that I have nothing to 
verify my thoughts. If you could give me the hitlist for each of the 
3 runs then this would help to check my thinking. I could be totally off 
here.

It seems that we miss the per cpu slab frequently on slab_free() which 
leads to the calling of __slab_free() and which in turn needs to take a 
lock on the page (in the page struct). Typically the page lock is 
uncontended which seems to not be the case here otherwise it would not be 
that high up.

The per cpu patch in mm should reduce the contention on the page struct by 
not touching the page struct on alloc and on free. Does not seem to work 
all the way though. slab_free() still has to touch the page struct if the 
free is not to the currently active cpu slab.

So there could still be page struct contention left if multiple processors 
frequently and simultaneously free to the same slab and that slab is not 
the per cpu slab of a cpu. That could be addressed by optimizing the 
object free handling further to not touch the page struct even if we miss 
the per cpu slab.

That get_partial* is far up indicates contention on the list lock that 
should be addressable by either increasing the slab size or by changing 
the object free handling to batch in some form.

This is an SMP system right? 2 cores with 4 cpus each? The main loop is 
always hitting on the same slabs? Which slabs would this be? Am I right in 
thinking that one process allocates objects and then lets multiple other 
processors do work and then the allocated object is freed from a cpu that 
did not allocate the object? If neighboring objects in one slab are 
allocated on one cpu and then are almost simultaneously freed from a set 
of different cpus then this may be explain the situation.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-05  2:43                         ` Christoph Lameter
@ 2007-10-05  2:53                           ` Arjan van de Ven
  0 siblings, 0 replies; 110+ messages in thread
From: Arjan van de Ven @ 2007-10-05  2:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, David Miller, willy, nickpiggin, hch, mel,
	linux-fsdevel, linux-kernel, dgc, jens.axboe, suresh.b.siddha

On Thu, 4 Oct 2007 19:43:58 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> So there could still be page struct contention left if multiple
> processors frequently and simultaneously free to the same slab and
> that slab is not the per cpu slab of a cpu. That could be addressed
> by optimizing the object free handling further to not touch the page
> struct even if we miss the per cpu slab.
> 
> That get_partial* is far up indicates contention on the list lock
> that should be addressable by either increasing the slab size or by
> changing the object free handling to batch in some form.
> 
> This is an SMP system right? 2 cores with 4 cpus each? The main loop
> is always hitting on the same slabs? Which slabs would this be? Am I
> right in thinking that one process allocates objects and then lets
> multiple other processors do work and then the allocated object is
> freed from a cpu that did not allocate the object? If neighboring
> objects in one slab are allocated on one cpu and then are almost
> simultaneously freed from a set of different cpus then this may be
> explain the situation. -

one of the characteristics of the application in use is the following:
all cores submit IO (which means they allocate various scsi and block
structures on all cpus).. but only 1 will free it (the one the IRQ is
bound to). SO it's allocate-on-one-free-on-another at a high rate.

That is assuming this is the IO slab; that's a bit of an assumption
obviously (it's one of the slab things that are hot, but it's a complex
workload, there could be others)

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 22:23                               ` David Chinner
@ 2007-10-05  6:48                                 ` Jens Axboe
  2007-10-05  9:19                                   ` Pekka Enberg
  2007-10-05 11:56                                   ` Matthew Wilcox
  0 siblings, 2 replies; 110+ messages in thread
From: Jens Axboe @ 2007-10-05  6:48 UTC (permalink / raw)
  To: David Chinner
  Cc: David Miller, cebbert, willy, clameter, nickpiggin, hch, mel,
	linux-fsdevel, linux-kernel, suresh.b.siddha

On Fri, Oct 05 2007, David Chinner wrote:
> On Thu, Oct 04, 2007 at 03:07:18PM -0700, David Miller wrote:
> > From: Chuck Ebbert <cebbert@redhat.com> Date: Thu, 04 Oct 2007 17:47:48
> > -0400
> > 
> > > On 10/04/2007 05:11 PM, David Miller wrote:
> > > > From: Chuck Ebbert <cebbert@redhat.com> Date: Thu, 04 Oct 2007 17:02:17
> > > > -0400
> > > > 
> > > >> How do you simulate reading 100TB of data spread across 3000 disks,
> > > >> selecting 10% of it using some criterion, then sorting and summarizing
> > > >> the result?
> > > > 
> > > > You repeatedly read zeros from a smaller disk into the same amount of
> > > > memory, and sort that as if it were real data instead.
> > > 
> > > You've just replaced 3000 concurrent streams of data with a single stream.
> > > That won't test the memory allocator's ability to allocate memory to many
> > > concurrent users very well.
> > 
> > You've kindly removed my "thinking outside of the box" comment.
> > 
> > The point is was not that my specific suggestion would be perfect, but that
> > if you used your creativity and thought in similar directions you might find
> > a way to do it.
> > 
> > People are too narrow minded when it comes to these things, and that's the
> > problem I want to address.
> 
> And it's a good point, too, because often problems to one person are a
> no-brainer to someone else.
> 
> Creating lots of "fake" disks is trivial to do, IMO.  Use loopback on
> sparse files containing sparse filesxi, use ramdisks containing sparse
> files or write a sparse dm target for sparse block device mapping,
> etc. I'm sure there's more than the few I just threw out...

Or use scsi_debug to fake drives/controllers, works wonderful as well
for some things and involve the full IO stack.

I'd like to second Davids emails here, this is a serious problem. Having
a reproducible test case lowers the barrier for getting the problem
fixed by orders of magnitude. It's the difference between the problem
getting fixed in a day or two and it potentially lingering for months,
because email ping-pong takes forever and "the test team has moved on to
other tests, we'll let you know the results of test foo in 3 weeks time
when we have a new slot on the box" just removing any developer
motivation to work on the issue.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-05  6:48                                 ` Jens Axboe
@ 2007-10-05  9:19                                   ` Pekka Enberg
  2007-10-05  9:28                                     ` Jens Axboe
  2007-10-05 11:56                                   ` Matthew Wilcox
  1 sibling, 1 reply; 110+ messages in thread
From: Pekka Enberg @ 2007-10-05  9:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Chinner, David Miller, cebbert, willy, clameter, nickpiggin,
	hch, mel, linux-fsdevel, linux-kernel, suresh.b.siddha

Hi,

On 10/5/07, Jens Axboe <jens.axboe@oracle.com> wrote:
> I'd like to second Davids emails here, this is a serious problem. Having
> a reproducible test case lowers the barrier for getting the problem
> fixed by orders of magnitude. It's the difference between the problem
> getting fixed in a day or two and it potentially lingering for months,
> because email ping-pong takes forever and "the test team has moved on to
> other tests, we'll let you know the results of test foo in 3 weeks time
> when we have a new slot on the box" just removing any developer
> motivation to work on the issue.

What I don't understand is that why don't the people who _have_ access
to the test case fix the problem? Unlike slab, slub is not a pile of
crap that only Christoph can hack on...

                                   Pekka

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-05  9:19                                   ` Pekka Enberg
@ 2007-10-05  9:28                                     ` Jens Axboe
  2007-10-05 11:12                                       ` Andi Kleen
  0 siblings, 1 reply; 110+ messages in thread
From: Jens Axboe @ 2007-10-05  9:28 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Chinner, David Miller, cebbert, willy, clameter, nickpiggin,
	hch, mel, linux-fsdevel, linux-kernel, suresh.b.siddha

On Fri, Oct 05 2007, Pekka Enberg wrote:
> Hi,
> 
> On 10/5/07, Jens Axboe <jens.axboe@oracle.com> wrote:
> > I'd like to second Davids emails here, this is a serious problem. Having
> > a reproducible test case lowers the barrier for getting the problem
> > fixed by orders of magnitude. It's the difference between the problem
> > getting fixed in a day or two and it potentially lingering for months,
> > because email ping-pong takes forever and "the test team has moved on to
> > other tests, we'll let you know the results of test foo in 3 weeks time
> > when we have a new slot on the box" just removing any developer
> > motivation to work on the issue.
> 
> What I don't understand is that why don't the people who _have_ access
> to the test case fix the problem? Unlike slab, slub is not a pile of
> crap that only Christoph can hack on...

Often the people testing are only doing just that, testing. So they
kindly offer to test any patches and so on, which usually takes forever
because of the above limitations in response time, machine availability,
etc.

Writing a small test module to exercise slub/slab in various ways
(allocating from all cpus freeing from one, as described) should not be
too hard. Perhaps that would be enough to find this performance
discrepancy between slab and slub?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-05  9:28                                     ` Jens Axboe
@ 2007-10-05 11:12                                       ` Andi Kleen
  2007-10-05 12:39                                         ` Jens Axboe
  0 siblings, 1 reply; 110+ messages in thread
From: Andi Kleen @ 2007-10-05 11:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Pekka Enberg, David Chinner, David Miller, cebbert, willy,
	clameter, nickpiggin, hch, mel, linux-fsdevel, linux-kernel,
	suresh.b.siddha

Jens Axboe <jens.axboe@oracle.com> writes:
> 
> Writing a small test module to exercise slub/slab in various ways
> (allocating from all cpus freeing from one, as described) should not be
> too hard. Perhaps that would be enough to find this performance
> discrepancy between slab and slub?

You could simulate that by just sending packets using unix sockets 
between threads bound to different CPUs. Sending a packet allocates; receiving 
deallocates.

But it's not clear that will really simulate the cache bounce environment
of the database test. I don't think all passing of data between CPUs 
using slub objects is slow.

-Andi

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-05  6:48                                 ` Jens Axboe
  2007-10-05  9:19                                   ` Pekka Enberg
@ 2007-10-05 11:56                                   ` Matthew Wilcox
  2007-10-05 12:37                                     ` Jens Axboe
  2007-10-05 19:27                                     ` Christoph Lameter
  1 sibling, 2 replies; 110+ messages in thread
From: Matthew Wilcox @ 2007-10-05 11:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Chinner, David Miller, cebbert, willy, clameter, nickpiggin,
	hch, mel, linux-fsdevel, linux-kernel, suresh.b.siddha

On Fri, Oct 05, 2007 at 08:48:53AM +0200, Jens Axboe wrote:
> I'd like to second Davids emails here, this is a serious problem. Having
> a reproducible test case lowers the barrier for getting the problem
> fixed by orders of magnitude. It's the difference between the problem
> getting fixed in a day or two and it potentially lingering for months,
> because email ping-pong takes forever and "the test team has moved on to
> other tests, we'll let you know the results of test foo in 3 weeks time
> when we have a new slot on the box" just removing any developer
> motivation to work on the issue.

I vaguely remembered something called orasim, so I went looking for it.
I found http://oss.oracle.com/~wcoekaer/orasim/ which is dated from
2004, and I found http://oss.oracle.com/projects/orasimjobfiles/ which
seems to be a stillborn project.  Is there anything else I should know
about orasim?  ;-)

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-05 11:56                                   ` Matthew Wilcox
@ 2007-10-05 12:37                                     ` Jens Axboe
  2007-10-05 19:27                                     ` Christoph Lameter
  1 sibling, 0 replies; 110+ messages in thread
From: Jens Axboe @ 2007-10-05 12:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Chinner, David Miller, cebbert, willy, clameter, nickpiggin,
	hch, mel, linux-fsdevel, linux-kernel, suresh.b.siddha

On Fri, Oct 05 2007, Matthew Wilcox wrote:
> On Fri, Oct 05, 2007 at 08:48:53AM +0200, Jens Axboe wrote:
> > I'd like to second Davids emails here, this is a serious problem. Having
> > a reproducible test case lowers the barrier for getting the problem
> > fixed by orders of magnitude. It's the difference between the problem
> > getting fixed in a day or two and it potentially lingering for months,
> > because email ping-pong takes forever and "the test team has moved on to
> > other tests, we'll let you know the results of test foo in 3 weeks time
> > when we have a new slot on the box" just removing any developer
> > motivation to work on the issue.
> 
> I vaguely remembered something called orasim, so I went looking for it.
> I found http://oss.oracle.com/~wcoekaer/orasim/ which is dated from
> 2004, and I found http://oss.oracle.com/projects/orasimjobfiles/ which
> seems to be a stillborn project.  Is there anything else I should know
> about orasim?  ;-)

I don't know much about orasim, except that internally we're trying to
use fio for that instead. As far as I know, it was a project that was
never feature complete (or completed all together, for that matter).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-05 11:12                                       ` Andi Kleen
@ 2007-10-05 12:39                                         ` Jens Axboe
  2007-10-05 19:31                                           ` Christoph Lameter
  0 siblings, 1 reply; 110+ messages in thread
From: Jens Axboe @ 2007-10-05 12:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Pekka Enberg, David Chinner, David Miller, cebbert, willy,
	clameter, nickpiggin, hch, mel, linux-fsdevel, linux-kernel,
	suresh.b.siddha

On Fri, Oct 05 2007, Andi Kleen wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
> > 
> > Writing a small test module to exercise slub/slab in various ways
> > (allocating from all cpus freeing from one, as described) should not be
> > too hard. Perhaps that would be enough to find this performance
> > discrepancy between slab and slub?
> 
> You could simulate that by just sending packets using unix sockets
> between threads bound to different CPUs. Sending a packet allocates;
> receiving deallocates.

Sure, there are a host of ways to accomplish the same thing.

> But it's not clear that will really simulate the cache bounce
> environment of the database test. I don't think all passing of data
> between CPUs using slub objects is slow.

It might not, it might. The point is trying to isolate the problem and
making a simple test case that could be used to reproduce it, so that
Christoph (or someone else) can easily fix it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-05 11:56                                   ` Matthew Wilcox
  2007-10-05 12:37                                     ` Jens Axboe
@ 2007-10-05 19:27                                     ` Christoph Lameter
  1 sibling, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-10-05 19:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, David Chinner, David Miller, cebbert, willy,
	nickpiggin, hch, mel, linux-fsdevel, linux-kernel,
	suresh.b.siddha

On Fri, 5 Oct 2007, Matthew Wilcox wrote:

> I vaguely remembered something called orasim, so I went looking for it.
> I found http://oss.oracle.com/~wcoekaer/orasim/ which is dated from
> 2004, and I found http://oss.oracle.com/projects/orasimjobfiles/ which
> seems to be a stillborn project.  Is there anything else I should know
> about orasim?  ;-)

Too bad. If this would work then I would have a load to work against. I 
have a patch here that may address the issue for SMP (no NUMA for now) by 
batching all frees on the per cpu freelist and then dumping them in 
groups. But it is likely not too wise to have you run your weeklong 
tests on this one. Needs some more care first.




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-05 12:39                                         ` Jens Axboe
@ 2007-10-05 19:31                                           ` Christoph Lameter
  2007-10-05 19:32                                             ` Christoph Lameter
  0 siblings, 1 reply; 110+ messages in thread
From: Christoph Lameter @ 2007-10-05 19:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andi Kleen, Pekka Enberg, David Chinner, David Miller, cebbert,
	willy, nickpiggin, hch, mel, linux-fsdevel, linux-kernel,
	suresh.b.siddha

On Fri, 5 Oct 2007, Jens Axboe wrote:

> It might not, it might. The point is trying to isolate the problem and
> making a simple test case that could be used to reproduce it, so that
> Christoph (or someone else) can easily fix it.

In case there is someone who wants to hack on it: Here is what I got so 
far for batching the frees. I will try to come up with a test next week if 
nothing else happens before:

Patch 1/2 on top of mm:

SLUB: Keep counter of remaining objects on the per cpu list

Add a counter to keep track of how many objects are on the per cpu list.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slub_def.h |    1 +
 mm/slub.c                |    8 ++++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

Index: linux-2.6.23-rc8-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.23-rc8-mm2.orig/include/linux/slub_def.h	2007-10-04 22:41:58.000000000 -0700
+++ linux-2.6.23-rc8-mm2/include/linux/slub_def.h	2007-10-04 22:42:08.000000000 -0700
@@ -15,6 +15,7 @@ struct kmem_cache_cpu {
 	void **freelist;
 	struct page *page;
 	int node;
+	int remaining;
 	unsigned int offset;
 	unsigned int objsize;
 };
Index: linux-2.6.23-rc8-mm2/mm/slub.c
===================================================================
--- linux-2.6.23-rc8-mm2.orig/mm/slub.c	2007-10-04 22:41:58.000000000 -0700
+++ linux-2.6.23-rc8-mm2/mm/slub.c	2007-10-04 22:42:08.000000000 -0700
@@ -1386,12 +1386,13 @@ static void deactivate_slab(struct kmem_
 	 * because both freelists are empty. So this is unlikely
 	 * to occur.
 	 */
-	while (unlikely(c->freelist)) {
+	while (unlikely(c->remaining)) {
 		void **object;
 
 		/* Retrieve object from cpu_freelist */
 		object = c->freelist;
 		c->freelist = c->freelist[c->offset];
+		c->remaining--;
 
 		/* And put onto the regular freelist */
 		object[c->offset] = page->freelist;
@@ -1491,6 +1492,7 @@ load_freelist:
 
 	object = c->page->freelist;
 	c->freelist = object[c->offset];
+	c->remaining = s->objects - c->page->inuse - 1;
 	c->page->inuse = s->objects;
 	c->page->freelist = NULL;
 	c->node = page_to_nid(c->page);
@@ -1574,13 +1576,14 @@ static void __always_inline *slab_alloc(
 
 	local_irq_save(flags);
 	c = get_cpu_slab(s, smp_processor_id());
-	if (unlikely(!c->freelist || !node_match(c, node)))
+	if (unlikely(!c->remaining || !node_match(c, node)))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
 		object = c->freelist;
 		c->freelist = object[c->offset];
+		c->remaining--;
 	}
 	local_irq_restore(flags);
 
@@ -1686,6 +1689,7 @@ static void __always_inline slab_free(st
 	if (likely(page == c->page && c->node >= 0)) {
 		object[c->offset] = c->freelist;
 		c->freelist = object;
+		c->remaining++;
 	} else
 		__slab_free(s, page, x, addr, c->offset);
 



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-05 19:31                                           ` Christoph Lameter
@ 2007-10-05 19:32                                             ` Christoph Lameter
  0 siblings, 0 replies; 110+ messages in thread
From: Christoph Lameter @ 2007-10-05 19:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andi Kleen, Pekka Enberg, David Chinner, David Miller, cebbert,
	willy, nickpiggin, hch, mel, linux-fsdevel, linux-kernel,
	suresh.b.siddha

Patch 2/2


SLUB: Allow foreign objects on the per cpu object lists.

In order to free objects we need to touch the page struct of the page that the
object belongs to. If this occurs too frequently then we could generate a bouncing
cacheline.

We do not want that to occur too frequently. We can avoid the page struct touching
for per cpu objects. Now we extend that to allow a limited number of objects that are
not part of the cpu slab. Allow up to 4 times the objects that fit into a page
in the per cpu list.

If the objects are allocated before we need to free them then we have saved touching
a page struct twice. The objects are presumably cache hot, so it is performance wise
good to recycle these locally.

Foreign objects are drained before deactivating cpu slabs and if too many objects
accumulate.

For kmem_cache_free() this also has the beneficial effect of getting virt_to_page()
operations eliminated or grouped together which may help reduce the cache footprint
and increase the speed of virt_to_page() lookups (they hopefully all come from the
same pages).

For kfree() we may have to do virt_to_page() in the worst case twice. Once grouped
together.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slub_def.h |    1 
 mm/slub.c                |   82 ++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 68 insertions(+), 15 deletions(-)

Index: linux-2.6.23-rc8-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.23-rc8-mm2.orig/include/linux/slub_def.h	2007-10-04 22:42:08.000000000 -0700
+++ linux-2.6.23-rc8-mm2/include/linux/slub_def.h	2007-10-04 22:43:19.000000000 -0700
@@ -16,6 +16,7 @@ struct kmem_cache_cpu {
 	struct page *page;
 	int node;
 	int remaining;
+	int drain_limit;
 	unsigned int offset;
 	unsigned int objsize;
 };
Index: linux-2.6.23-rc8-mm2/mm/slub.c
===================================================================
--- linux-2.6.23-rc8-mm2.orig/mm/slub.c	2007-10-04 22:42:08.000000000 -0700
+++ linux-2.6.23-rc8-mm2/mm/slub.c	2007-10-04 22:56:49.000000000 -0700
@@ -187,6 +187,12 @@ static inline void ClearSlabDebug(struct
  */
 #define MAX_PARTIAL 10
 
+/*
+ * How many times the number of objects per slab can accumulate on the
+ * per cpu objects list before we drain it.
+ */
+#define DRAIN_FACTOR 4
+
 #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
 				SLAB_POISON | SLAB_STORE_USER)
 
@@ -1375,6 +1381,54 @@ static void unfreeze_slab(struct kmem_ca
 	}
 }
 
+static void __slab_free(struct kmem_cache *s, struct page *page,
+				void *x, void *addr, unsigned int offset);
+
+/*
+ * Drain freelist of objects foreign to the slab. Interrupts must be off.
+ *
+ * This is called
+ *
+ * 1. Before taking the slub lock when a cpu slab is to be deactivated.
+ *    Deactivation can only deal with native objects on the freelist.
+ *
+ * 2. If the number of objects in the per cpu structures grows beyond
+ *    3 times the objects that fit in a slab. In that case we need to throw
+ *    some objects away. Stripping the foreign objects does the job and
+ *    localizes any new the allocations.
+ */
+static void drain_foreign(struct kmem_cache *s, struct kmem_cache_cpu *c, void *addr)
+{
+	void **freelist = c->freelist;
+
+	if (unlikely(c->node < 0)) {
+		/* Slow path user */
+		__slab_free(s, virt_to_head_page(freelist), freelist, addr, c->offset);
+		freelist = NULL;
+		c->remaining--;
+	}
+
+	if (!freelist)
+		return;
+
+	c->freelist = NULL;
+	c->remaining = 0;
+
+	while (freelist) {
+		void **object = freelist;
+		struct page *page = virt_to_head_page(freelist);
+
+		freelist = freelist[c->offset];
+		if (page == c->page) {
+			/* Local object. Keep for future allocations */
+			object[c->offset] = c->freelist;
+			c->freelist = object;
+			c->remaining++;
+		} else
+			__slab_free(s, page, object, NULL, c->offset);
+	}
+}
+
 /*
  * Remove the cpu slab
  */
@@ -1405,6 +1459,7 @@ static void deactivate_slab(struct kmem_
 
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
+	drain_foreign(s, c, NULL);
 	slab_lock(c->page);
 	deactivate_slab(s, c);
 }
@@ -1480,6 +1535,7 @@ static void *__slab_alloc(struct kmem_ca
 	if (!c->page)
 		goto new_slab;
 
+	drain_foreign(s, c, NULL);
 	slab_lock(c->page);
 	if (unlikely(!node_match(c, node)))
 		goto another_slab;
@@ -1553,6 +1609,7 @@ debug:
 	c->page->inuse++;
 	c->page->freelist = object[c->offset];
 	c->node = -1;
+	c->remaining = s->objects * 64;
 	slab_unlock(c->page);
 	return object;
 }
@@ -1676,8 +1733,8 @@ debug:
  * If fastpath is not possible then fall back to __slab_free where we deal
  * with all sorts of special processing.
  */
-static void __always_inline slab_free(struct kmem_cache *s,
-			struct page *page, void *x, void *addr)
+static void __always_inline slab_free(struct kmem_cache *s, void *x,
+								void *addr)
 {
 	void **object = (void *)x;
 	unsigned long flags;
@@ -1686,23 +1743,17 @@ static void __always_inline slab_free(st
 	local_irq_save(flags);
 	debug_check_no_locks_freed(object, s->objsize);
 	c = get_cpu_slab(s, smp_processor_id());
-	if (likely(page == c->page && c->node >= 0)) {
-		object[c->offset] = c->freelist;
-		c->freelist = object;
-		c->remaining++;
-	} else
-		__slab_free(s, page, x, addr, c->offset);
-
+	object[c->offset] = c->freelist;
+	c->freelist = object;
+	c->remaining++;
+	if (unlikely(c->remaining >= c->drain_limit))
+		drain_foreign(s, c, addr);
 	local_irq_restore(flags);
 }
 
 void kmem_cache_free(struct kmem_cache *s, void *x)
 {
-	struct page *page;
-
-	page = virt_to_head_page(x);
-
-	slab_free(s, page, x, __builtin_return_address(0));
+	slab_free(s, x, __builtin_return_address(0));
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
@@ -1879,6 +1930,7 @@ static void init_kmem_cache_cpu(struct k
 	c->node = 0;
 	c->offset = s->offset / sizeof(void *);
 	c->objsize = s->objsize;
+	c->drain_limit = DRAIN_FACTOR * s->objects;
 }
 
 static void init_kmem_cache_node(struct kmem_cache_node *n)
@@ -2626,7 +2678,7 @@ void kfree(const void *x)
 		put_page(page);
 		return;
 	}
-	slab_free(page->slab, page, (void *)x, __builtin_return_address(0));
+	slab_free(page->slab, (void *)x, __builtin_return_address(0));
 }
 EXPORT_SYMBOL(kfree);
 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-04 21:02                       ` Chuck Ebbert
  2007-10-04 21:11                         ` David Miller
@ 2007-10-05 20:32                         ` Peter Zijlstra
  2007-10-05 21:31                           ` David Miller
  1 sibling, 1 reply; 110+ messages in thread
From: Peter Zijlstra @ 2007-10-05 20:32 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: David Miller, willy, clameter, nickpiggin, hch, mel,
	linux-fsdevel, linux-kernel, dgc, jens.axboe, suresh.b.siddha


On Thu, 2007-10-04 at 17:02 -0400, Chuck Ebbert wrote:
> On 10/04/2007 04:55 PM, David Miller wrote:
> > 
> > Anything, I do mean anything, can be simulated using small test
> > programs.
> 
> How do you simulate reading 100TB of data spread across 3000 disks,
> selecting 10% of it using some criterion, then sorting and summarizing
> the result?

Focus on the slab allocator usage, instrument it, record a trace,
generate a statistical model that matches, and write a small
programm/kernel module that has the same allocation pattern. Then verify
this statistical workload still shows the same performance difference.

Easy: no
Doable: yes




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: SLUB performance regression vs SLAB
  2007-10-05 20:32                         ` Peter Zijlstra
@ 2007-10-05 21:31                           ` David Miller
  0 siblings, 0 replies; 110+ messages in thread
From: David Miller @ 2007-10-05 21:31 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: cebbert, willy, clameter, nickpiggin, hch, mel, linux-fsdevel,
	linux-kernel, dgc, jens.axboe, suresh.b.siddha

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri, 05 Oct 2007 22:32:00 +0200

> Focus on the slab allocator usage, instrument it, record a trace,
> generate a statistical model that matches, and write a small
> programm/kernel module that has the same allocation pattern. Then verify
> this statistical workload still shows the same performance difference.
> 
> Easy: no
> Doable: yes

The other important bit is likely to generate a lot of DMA traffic
such that the L2 cache bandwidth is getting used on the bus
side by the PCI controller doing invalidations of both dirty
and clean L2 cache lines as devices DMA to/from them.

This will also be exercising the memory controller, further contending
with the cpu when SLAB touches cold data structures.

^ permalink raw reply	[flat|nested] 110+ messages in thread

end of thread, other threads:[~2007-10-05 21:31 UTC | newest]

Thread overview: 110+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-19  3:36 [00/17] [RFC] Virtual Compound Page Support Christoph Lameter
2007-09-19  3:36 ` [01/17] Vmalloc: Move vmalloc_to_page to mm/vmalloc Christoph Lameter
2007-09-19  3:36 ` [02/17] Vmalloc: add const Christoph Lameter
2007-09-19  3:36 ` [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries Christoph Lameter
2007-09-19  6:32   ` David Rientjes
2007-09-19  7:24     ` Anton Altaparmakov
2007-09-19  8:09       ` David Rientjes
2007-09-19  8:44         ` Anton Altaparmakov
2007-09-19  9:19           ` David Rientjes
2007-09-19 13:23             ` Anton Altaparmakov
2007-09-19 17:29           ` Christoph Lameter
2007-09-19 17:52             ` Anton Altaparmakov
2007-09-19 17:29       ` Christoph Lameter
2007-09-19 17:52         ` Anton Altaparmakov
2007-09-19  3:36 ` [04/17] vmalloc: clean up page array indexing Christoph Lameter
2007-09-19  3:36 ` [05/17] vunmap: return page array Christoph Lameter
2007-09-19  8:05   ` KAMEZAWA Hiroyuki
2007-09-19 22:15     ` Christoph Lameter
2007-09-20  0:47       ` KAMEZAWA Hiroyuki
2007-09-19  3:36 ` [06/17] vmalloc_address(): Determine vmalloc address from page struct Christoph Lameter
2007-09-19  3:36 ` [07/17] GFP_VFALLBACK: Allow fallback of compound pages to virtual mappings Christoph Lameter
2007-09-19  3:36 ` [08/17] Pass vmalloc address in page->private Christoph Lameter
2007-09-19  3:36 ` [09/17] VFALLBACK: Debugging aid Christoph Lameter
2007-09-19  3:36 ` [10/17] Use GFP_VFALLBACK for sparsemem Christoph Lameter
2007-09-19  3:36 ` [11/17] GFP_VFALLBACK for zone wait table Christoph Lameter
2007-09-19  3:36 ` [12/17] Virtual Compound page allocation from interrupt context Christoph Lameter
2007-09-19  3:36 ` [13/17] Virtual compound page freeing in " Christoph Lameter
2007-09-18 20:36   ` Nick Piggin
2007-09-20 17:50     ` Christoph Lameter
2007-09-19  3:36 ` [14/17] Allow bit_waitqueue to wait on a bit in a vmalloc area Christoph Lameter
2007-09-19  4:12   ` Gabriel C
2007-09-19 17:40     ` Christoph Lameter
2007-09-19  3:36 ` [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK Christoph Lameter
2007-09-27 21:42   ` Nick Piggin
2007-09-28 17:33     ` Christoph Lameter
2007-09-28  5:14       ` Nick Piggin
2007-10-01 20:50         ` Christoph Lameter
2007-10-02  8:43           ` Nick Piggin
2007-10-04 16:16           ` SLUB performance regression vs SLAB Matthew Wilcox
2007-10-04 17:38             ` Christoph Lameter
2007-10-04 17:50               ` Arjan van de Ven
2007-10-04 17:58                 ` Christoph Lameter
2007-10-04 18:26                 ` Peter Zijlstra
2007-10-04 20:48                 ` David Miller
2007-10-04 20:58                   ` Matthew Wilcox
2007-10-04 21:05                     ` David Miller
2007-10-04 21:11                     ` Christoph Lameter
2007-10-04 18:32               ` Matthew Wilcox
2007-10-04 17:49                 ` Christoph Lameter
2007-10-04 19:28                   ` Matthew Wilcox
2007-10-04 19:05                     ` Christoph Lameter
2007-10-04 19:46                       ` Siddha, Suresh B
2007-10-04 20:55                     ` David Miller
2007-10-04 21:02                       ` Chuck Ebbert
2007-10-04 21:11                         ` David Miller
2007-10-04 21:47                           ` Chuck Ebbert
2007-10-04 22:07                             ` David Miller
2007-10-04 22:23                               ` David Chinner
2007-10-05  6:48                                 ` Jens Axboe
2007-10-05  9:19                                   ` Pekka Enberg
2007-10-05  9:28                                     ` Jens Axboe
2007-10-05 11:12                                       ` Andi Kleen
2007-10-05 12:39                                         ` Jens Axboe
2007-10-05 19:31                                           ` Christoph Lameter
2007-10-05 19:32                                             ` Christoph Lameter
2007-10-05 11:56                                   ` Matthew Wilcox
2007-10-05 12:37                                     ` Jens Axboe
2007-10-05 19:27                                     ` Christoph Lameter
2007-10-05 20:32                         ` Peter Zijlstra
2007-10-05 21:31                           ` David Miller
2007-10-04 21:05                       ` Matthew Wilcox
2007-10-05  2:43                         ` Christoph Lameter
2007-10-05  2:53                           ` Arjan van de Ven
2007-09-28 17:55       ` [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK Peter Zijlstra
2007-09-28 18:20         ` Christoph Lameter
2007-09-28 18:25           ` Peter Zijlstra
2007-09-28 18:41             ` Christoph Lameter
2007-09-28 20:22               ` Nick Piggin
2007-09-28 21:14               ` Mel Gorman
2007-09-28 20:59             ` Mel Gorman
2007-09-29  8:13             ` Andrew Morton
2007-09-29  8:47               ` Peter Zijlstra
2007-09-29  8:53                 ` Peter Zijlstra
2007-09-29  9:01                   ` Andrew Morton
2007-09-29  9:14                     ` Peter Zijlstra
2007-09-29  9:27                       ` Andrew Morton
2007-09-28 20:19                         ` Nick Piggin
2007-09-29 19:20                           ` Andrew Morton
2007-09-29 19:09                             ` Nick Piggin
2007-09-30 20:12                               ` Andrew Morton
2007-09-30  4:16                                 ` Nick Piggin
2007-09-29  9:00                 ` Andrew Morton
2007-10-01 20:55                   ` Christoph Lameter
2007-10-01 21:30                     ` Andrew Morton
2007-10-01 21:38                       ` Christoph Lameter
2007-10-01 21:45                         ` Andrew Morton
2007-10-01 21:52                           ` Christoph Lameter
2007-10-02  9:19                       ` Peter Zijlstra
2007-09-29  8:45           ` Peter Zijlstra
2007-10-01 21:01             ` Christoph Lameter
2007-10-02  8:37               ` Nick Piggin
2007-09-28 21:05       ` Mel Gorman
2007-10-01 21:10         ` Christoph Lameter
2007-09-19  3:36 ` [16/17] Allow virtual fallback for buffer_heads Christoph Lameter
2007-09-19  3:36 ` [17/17] Allow virtual fallback for dentries Christoph Lameter
2007-09-19  7:34 ` [00/17] [RFC] Virtual Compound Page Support Anton Altaparmakov
2007-09-19  8:34   ` Eric Dumazet
2007-09-19 17:33     ` Christoph Lameter
2007-09-19  8:24 ` Andi Kleen
2007-09-19 17:36   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).