public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Containment measures for slab objects on scatter gather lists
@ 2007-06-29  4:01 Christoph Lameter
  2007-06-29  4:10 ` David Miller
  2007-06-29 13:04 ` Hugh Dickins
  0 siblings, 2 replies; 22+ messages in thread
From: Christoph Lameter @ 2007-06-29  4:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: James Bottomley, David Miller, Russell King, akpm, linux-kernel

I had a talk with James Bottomley last night and it seems that there is an 
established way of using the page structs of slab objects in the block 
layer. Drivers may use the DMA interfaces to issue control commands. In 
that case they may allocate a short structure via the slab allocator and 
put the control commands into that slab object.

The driver will then perform a sg_init_one() on the slab object. 
sg_init_one() calls sg_set_buf() which determines the page struct of a 
page. In this case sg_set_buf() will determine the page struct of a slab 
object. The dma layer may then perform operations on the "slab page". The 
block layer folks seem to have spend some time to make this work right.

That usage is made a bit dangerous by the issue that the fields of the 
page struct are used to varying degrees by slab allocators for their own 
purposes. A particular problem arises with SLUB since SLUB needs to 
maximizes the usage of the page struct fields to avoid having to put 
control structures in the page itself and be able to completely fill up a 
page with slab objects.

What is new in SLUB is the overloading of the mapping, index and mapcount 
fields. The other slab allocators only overload the lru and private fields 
(The version of SLOB in mm also overloads the index field like SLUB).

So it is in general not possible to call arbitrary page cache functions on 
slab pages because fields are used in ways not conforming to page cache 
use. The use of slab page structs on scatter gather lists currently only 
works because a select number of page cache functions are called on them 
that make restricted use of fields in the page structs.

The called functions are:

kmap / kunmap

	This is not a problem since slab pages are never highmem
	pages. The effect of kmap if used on the page struct of a slab page
	is to determine the address of the page.

flush_dcache_page

	This is a no op for most architectures. On architectures that
	have a virtualized cache it is necessary to flush all the
	page ranges in which this page is mapped. For that purpose
	the mapping of a page must be determined. Since SLUB uses
	the mapping for a pointer to the kmem_cache structure this
	will result in using a kmem_cache address. If this address is
	used like an address_space structure then the kernel will oops.

	flush_dcache_page() must flush even for slab objects if
	slab objects are put onto the scatter gather lists since the
	dma engine must be able to read the information of the control
	structure from memory. The block layer will take care of any objects
	overlapping page boundaries and flush multiple pages if necessary.

We have only recently encountered this issue which is due to the following:

1. Architectures that need to flush virtual caches are rare and so
   flush_dcache_page() typically does nothing or nothing harmful since
   page->mapping, page->mapcount and page->index are not used.

2. It is not too frequent for a driver to perform a kmalloc in order to
   generate a control structure. As a result tests on sparc64 and arm
   did not show any problems.

Architectures affected seem to be only pa-risc and arm. Sparc64 may be 
affected too but sparc64 does not use page_mapping to scan over uses of 
the page (and xtensa has the virtual cache flushing commented out for some 
reason). The issue is currently fixed in Linus' tree by Hugh's check for a 
slab page in page_mapping.

We have here an established use that is somewhat dicey. It may be best to 
ensure that the sort of use is restricted to the functions listed above. 
The issue only occurs in the flush_dcache_page() of the two or three 
listed platforms.

Make sure that no other page cache functions are called for such pages by 
putting a VM_BUG_ON(PageSlab(page)) into the frequently used 
page_mapping() inline function. It will trigger even on platforms that do 
not implement a flush_dcache_page() function if page_mapping() is used on 
any slab page so that we can detect potential issues early. This is useful 
regardless of the slab allocator in use since in general a slab page 
cannot be handled like a regular page cache page.

Modify the functions in the affected arches to check for PageSlab() and 
use a NULL mapping if such a page is encountered. This may only be 
necessary for parisc and arm since sparc64 and xtensa do not scan over 
processes mapping a page but I have modified those two arches also for 
correctnesses sake since they use page_mapping() in flush_dcache_page().

Still a better solution would be to not use the slab allocator at all for 
the objects that are used to send commands to the devices. These are not 
permanent and grabbing a page from the pcp lists and putting it back is 
likely as fast as performing a kmalloc.

I have no access to the platforms discussed here. Testing was restricted 
to running it on my laptop.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6/arch/arm/mm/flush.c
===================================================================
--- linux-2.6.orig/arch/arm/mm/flush.c	2007-06-28 18:57:18.000000000 -0700
+++ linux-2.6/arch/arm/mm/flush.c	2007-06-28 19:11:46.000000000 -0700
@@ -188,7 +188,17 @@
  */
 void flush_dcache_page(struct page *page)
 {
-	struct address_space *mapping = page_mapping(page);
+	struct address_space *mapping;
+
+	/*
+	 * This function is special in that a page struct obtained via
+	 * virt_to_page from a slab object may be passed to it. However, slab
+	 * allocators may use the mapping field for their own purposes. As a
+	 * result mapping may be != NULL here although the page is not mapped.
+	 * Slab objects are never mapped into user space so use NULL for that
+	 * special case.
+	 */
+	mapping = PageSlab(page) ? NULL : page_mapping(page);
 
 #ifndef CONFIG_SMP
 	if (mapping && !mapping_mapped(mapping))
Index: linux-2.6/arch/parisc/kernel/cache.c
===================================================================
--- linux-2.6.orig/arch/parisc/kernel/cache.c	2007-06-28 18:57:18.000000000 -0700
+++ linux-2.6/arch/parisc/kernel/cache.c	2007-06-28 19:11:46.000000000 -0700
@@ -339,7 +339,7 @@
 
 void flush_dcache_page(struct page *page)
 {
-	struct address_space *mapping = page_mapping(page);
+	struct address_space *mapping;
 	struct vm_area_struct *mpnt;
 	struct prio_tree_iter iter;
 	unsigned long offset;
@@ -347,6 +347,15 @@
 	pgoff_t pgoff;
 	unsigned long pfn = page_to_pfn(page);
 
+	/*
+	 * This function is special in that a page struct obtained via
+	 * virt_to_page from a slab object may be passed to it. However, slab
+	 * allocators may use the mapping field for their own purposes. As a
+	 * result mapping may be != NULL here although the page is not mapped.
+	 * Slab objects are never mapped into user space so use NULL for that
+	 * special case.
+	 */
+	mapping = PageSlab(page) ? NULL : page_mapping(page);
 
 	if (mapping && !mapping_mapped(mapping)) {
 		set_bit(PG_dcache_dirty, &page->flags);
Index: linux-2.6/arch/sparc64/mm/init.c
===================================================================
--- linux-2.6.orig/arch/sparc64/mm/init.c	2007-06-28 18:57:18.000000000 -0700
+++ linux-2.6/arch/sparc64/mm/init.c	2007-06-28 19:11:46.000000000 -0700
@@ -339,7 +339,15 @@
 
 	this_cpu = get_cpu();
 
-	mapping = page_mapping(page);
+	/*
+	 * This function is special in that a page struct obtained via
+	 * virt_to_page from a slab object may be passed to it. However, slab
+	 * allocators may use the mapping field for their own purposes. As a
+	 * result mapping may be != NULL here although the page is not mapped.
+	 * Slab objects are never mapped into user space so use NULL for that
+	 * special case.
+	 */
+	mapping = PageSlab(page) ? NULL : page_mapping(page);
 	if (mapping && !mapping_mapped(mapping)) {
 		int dirty = test_bit(PG_dcache_dirty, &page->flags);
 		if (dirty) {
Index: linux-2.6/arch/xtensa/mm/init.c
===================================================================
--- linux-2.6.orig/arch/xtensa/mm/init.c	2007-06-28 18:57:18.000000000 -0700
+++ linux-2.6/arch/xtensa/mm/init.c	2007-06-28 19:11:46.000000000 -0700
@@ -433,7 +433,7 @@
 void flush_dcache_page(struct page *page)
 {
 	unsigned long addr = __pa(page_address(page));
-	struct address_space *mapping = page_mapping(page);
+	struct address_space *mapping;
 
 	__flush_invalidate_dcache_page_phys(addr);
 
@@ -442,6 +442,15 @@
 
 	/* If this page hasn't been mapped, yet, handle I$/D$ coherency later.*/
 #if 0
+	/*
+	 * This function is special in that a page struct obtained via
+	 * virt_to_page from a slab object may be passed to it. However, slab
+	 * allocators may use the mapping field for their own purposes. As a
+	 * result mapping may be != NULL although the page is not mapped.
+	 * Slab objects are never mapped into user space so use NULL for that
+	 * special case.
+	 */
+	mapping = PageSlab(page) ? NULL : page_mapping(page);
 	if (mapping && !mapping_mapped(mapping))
 		clear_bit(PG_cache_clean, &page->flags);
 	else
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2007-06-28 18:57:18.000000000 -0700
+++ linux-2.6/include/linux/mm.h	2007-06-28 19:11:46.000000000 -0700
@@ -601,12 +601,16 @@
 {
 	struct address_space *mapping = page->mapping;
 
+	/*
+	 * The scatter gather layer currently may place the page struct
+	 * of slab objects onto the scatter gather lists. The VM_BUG_ON
+	 * here insures that we get notified if page cache functions are
+	 * called for such a page.
+	 */
+	VM_BUG_ON(PageSlab(page));
+
 	if (unlikely(PageSwapCache(page)))
 		mapping = &swapper_space;
-#ifdef CONFIG_SLUB
-	else if (unlikely(PageSlab(page)))
-		mapping = NULL;
-#endif
 	else if (unlikely((unsigned long)mapping & PAGE_MAPPING_ANON))
 		mapping = NULL;
 	return mapping;
Index: linux-2.6/include/linux/highmem.h
===================================================================
--- linux-2.6.orig/include/linux/highmem.h	2007-06-28 19:11:48.000000000 -0700
+++ linux-2.6/include/linux/highmem.h	2007-06-28 19:25:36.000000000 -0700
@@ -38,6 +38,11 @@
 #ifndef ARCH_HAS_KMAP
 static inline void *kmap(struct page *page)
 {
+	/*
+	 * WARNING: page may be a slab page and kmap may be called for such
+	 * a page as a result of putting a slab object onto scatter gather
+	 * lists by a device driver.
+	 */
 	might_sleep();
 	return page_address(page);
 }
@@ -48,6 +53,11 @@
 
 static inline void *kmap_atomic(struct page *page, enum km_type idx)
 {
+	/*
+	 * WARNING: page may be a slab page and kmap_atomic may be called for
+	 * such a page as a result of putting a slab object onto scatter gather
+	 * lists by a device driver.
+	 */
 	pagefault_disable();
 	return page_address(page);
 }

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29  4:01 [PATCH] Containment measures for slab objects on scatter gather lists Christoph Lameter
@ 2007-06-29  4:10 ` David Miller
  2007-06-29  4:22   ` Christoph Lameter
  2007-06-29 13:04 ` Hugh Dickins
  1 sibling, 1 reply; 22+ messages in thread
From: David Miller @ 2007-06-29  4:10 UTC (permalink / raw)
  To: clameter; +Cc: hugh, James.Bottomley, rmk+lkml, akpm, linux-kernel

From: Christoph Lameter <clameter@sgi.com>
Date: Thu, 28 Jun 2007 21:01:36 -0700 (PDT)

> Modify the functions in the affected arches to check for PageSlab() and 
> use a NULL mapping if such a page is encountered. This may only be 
> necessary for parisc and arm since sparc64 and xtensa do not scan over 
> processes mapping a page but I have modified those two arches also for 
> correctnesses sake since they use page_mapping() in flush_dcache_page().
> 
> Still a better solution would be to not use the slab allocator at all for 
> the objects that are used to send commands to the devices. These are not 
> permanent and grabbing a page from the pcp lists and putting it back is 
> likely as fast as performing a kmalloc.

Jens Axboe wants to get references to the page structs behind
kmalloc() allocated pages in his networking splice work.

We pass scatterlists around, but networking buffers are
composed of a kmalloc()'d data header area for packet headers
and some of the initial packet data, then a true scatterlist
of page/offset/len triplets.

Splice wants to work with pages everwhere.

This is a reocurring theme, we should provide some kind of
solution to these issues instead of wishing they would go
away.

We could make a special allocator in the networking that carves
chunks out of pages but I'm sure you'll find that about as stupid
and wasteful as I do :-)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29  4:10 ` David Miller
@ 2007-06-29  4:22   ` Christoph Lameter
  2007-06-29  4:28     ` David Miller
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2007-06-29  4:22 UTC (permalink / raw)
  To: David Miller; +Cc: hugh, James.Bottomley, rmk+lkml, akpm, linux-kernel

On Thu, 28 Jun 2007, David Miller wrote:

> > Still a better solution would be to not use the slab allocator at all for 
> > the objects that are used to send commands to the devices. These are not 
> > permanent and grabbing a page from the pcp lists and putting it back is 
> > likely as fast as performing a kmalloc.
> 
> Jens Axboe wants to get references to the page structs behind
> kmalloc() allocated pages in his networking splice work.

You can get such a reference and then the slab page will be in limbo if 
all objects are freed until that reference is given up. The reference 
method is also use by kmem_cache_vacate() (but that is slab internal).

> We could make a special allocator in the networking that carves
> chunks out of pages but I'm sure you'll find that about as stupid
> and wasteful as I do :-)

Well then I guess this containment patch is as good as we can get.
It makes sure that things do not get out of hand.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29  4:22   ` Christoph Lameter
@ 2007-06-29  4:28     ` David Miller
  2007-06-29  4:39       ` Christoph Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2007-06-29  4:28 UTC (permalink / raw)
  To: clameter; +Cc: hugh, James.Bottomley, rmk+lkml, akpm, linux-kernel

From: Christoph Lameter <clameter@sgi.com>
Date: Thu, 28 Jun 2007 21:22:22 -0700 (PDT)

> On Thu, 28 Jun 2007, David Miller wrote:
> 
> > > Still a better solution would be to not use the slab allocator at all for 
> > > the objects that are used to send commands to the devices. These are not 
> > > permanent and grabbing a page from the pcp lists and putting it back is 
> > > likely as fast as performing a kmalloc.
> > 
> > Jens Axboe wants to get references to the page structs behind
> > kmalloc() allocated pages in his networking splice work.
> 
> You can get such a reference and then the slab page will be in limbo if 
> all objects are freed until that reference is given up. The reference 
> method is also use by kmem_cache_vacate() (but that is slab internal).

What about if someone kfree()'s that object meanwhile?

Can we bump the SLAB object count just like we can a page?
That's basically what's happening in the stuff Jens is
working on, he needs to grab a reference to a SLAB
object just like one can a page.  Even if there is an
intervening kfree (like a put_page()) the SLAB object is
still live until all the references are put, and thus it
can't get reallocated and given to another client by SLAB.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29  4:28     ` David Miller
@ 2007-06-29  4:39       ` Christoph Lameter
  2007-06-29  5:06         ` David Miller
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2007-06-29  4:39 UTC (permalink / raw)
  To: David Miller; +Cc: hugh, James.Bottomley, rmk+lkml, akpm, linux-kernel

On Thu, 28 Jun 2007, David Miller wrote:

> > You can get such a reference and then the slab page will be in limbo if 
> > all objects are freed until that reference is given up. The reference 
> > method is also use by kmem_cache_vacate() (but that is slab internal).
> 
> What about if someone kfree()'s that object meanwhile?

If that is the last object in the slab then the page is deslabified and 
it will stick around as a regular page until its refcount reaches zero.

There is slab functionality in SLUB that uses this trick to insure that 
the slab does not go away.

> Can we bump the SLAB object count just like we can a page?

You can bump the slab page count. But you have to use virt_to_head_page() 
instead of virt_to_page() since slab pages may be compound pages (in both 
SLAB and SLUB). Incrementing the refcount of a tail page will cause an oops on
free.

> That's basically what's happening in the stuff Jens is
> working on, he needs to grab a reference to a SLAB
> object just like one can a page.  Even if there is an
> intervening kfree (like a put_page()) the SLAB object is
> still live until all the references are put, and thus it
> can't get reallocated and given to another client by SLAB.

If you kfree the object then all slab allocators will feel free to 
immediately alloc the object for other purposes.

If you want to prohibit further allocations from a slab then I can likely 
give you a function call that removes a slab from the partial lists so 
that the slab will not be allocated from. But this will only work with 
SLUB.

At some point the slab will then need to be returned to the partial slab 
lists.

Hmmmm... Maybe we are creating more of a mess with this. Isnt there some 
other way to handle these object.




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29  4:39       ` Christoph Lameter
@ 2007-06-29  5:06         ` David Miller
  2007-06-29  5:24           ` Andrew Morton
  2007-06-29  7:00           ` Christoph Lameter
  0 siblings, 2 replies; 22+ messages in thread
From: David Miller @ 2007-06-29  5:06 UTC (permalink / raw)
  To: clameter; +Cc: hugh, James.Bottomley, rmk+lkml, akpm, linux-kernel

From: Christoph Lameter <clameter@sgi.com>
Date: Thu, 28 Jun 2007 21:39:01 -0700 (PDT)

> Hmmmm... Maybe we are creating more of a mess with this. Isnt there some 
> other way to handle these object.

That's where I was going with the silly idea to use another
allocator :)

Really, it would be great if we could treat kmalloc() objects
just like real pages.  Everything wants to do I/O on pages
but sometimes (like the networking) you have a kmalloc
chunk which is technically just a part of a page.

The fact that there is no easy way to make this work is
frustrating :-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29  5:06         ` David Miller
@ 2007-06-29  5:24           ` Andrew Morton
  2007-06-29  5:37             ` David Miller
  2007-06-30  8:49             ` Christoph Hellwig
  2007-06-29  7:00           ` Christoph Lameter
  1 sibling, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2007-06-29  5:24 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, hugh, James.Bottomley, rmk+lkml, linux-kernel

On Thu, 28 Jun 2007 22:06:06 -0700 (PDT) David Miller <davem@davemloft.net> wrote:

> From: Christoph Lameter <clameter@sgi.com>
> Date: Thu, 28 Jun 2007 21:39:01 -0700 (PDT)
> 
> > Hmmmm... Maybe we are creating more of a mess with this. Isnt there some 
> > other way to handle these object.
> 
> That's where I was going with the silly idea to use another
> allocator :)
> 
> Really, it would be great if we could treat kmalloc() objects
> just like real pages.

>From a high level, that seems like a bad idea.  kmalloc() gives you a
virtual address and you really shouldn't be poking around at that memory's
underlying page's pageframe metadata.

However we can of course do tasteless and weird things if the benefit is
sufficient....

>  Everything wants to do I/O on pages
> but sometimes (like the networking) you have a kmalloc
> chunk which is technically just a part of a page.

hm.  So what happens when two quite different threads of control are doing
IO against two hunks of kmalloced memory which happen to come from the same
page?  Either some (kernel-wide) locking is needed, or that pageframe needs
to be treated as readonly?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29  5:24           ` Andrew Morton
@ 2007-06-29  5:37             ` David Miller
  2007-06-29  5:45               ` Andrew Morton
  2007-06-29  6:50               ` Christoph Lameter
  2007-06-30  8:49             ` Christoph Hellwig
  1 sibling, 2 replies; 22+ messages in thread
From: David Miller @ 2007-06-29  5:37 UTC (permalink / raw)
  To: akpm; +Cc: clameter, hugh, James.Bottomley, rmk+lkml, linux-kernel

From: Andrew Morton <akpm@linux-foundation.org>
Date: Thu, 28 Jun 2007 22:24:24 -0700

> So what happens when two quite different threads of control are doing
> IO against two hunks of kmalloced memory which happen to come from the same
> page?  Either some (kernel-wide) locking is needed, or that pageframe needs
> to be treated as readonly?

Or you put an atomic_t at the beginning or tail of every SLAB
object.  It's a space cost not a runtime cost for the common
case which is:

	smp_rmb();
	if (atomic_read(&slab_obj->count) == 1)
		really_free_it();
	else if (atomic_dec_and_test(...))

Note I don't like this variant either. :)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29  5:37             ` David Miller
@ 2007-06-29  5:45               ` Andrew Morton
  2007-06-29  6:51                 ` Christoph Lameter
  2007-06-29 12:16                 ` Alan Cox
  2007-06-29  6:50               ` Christoph Lameter
  1 sibling, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2007-06-29  5:45 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, hugh, James.Bottomley, rmk+lkml, linux-kernel

On Thu, 28 Jun 2007 22:37:34 -0700 (PDT) David Miller <davem@davemloft.net> wrote:

> From: Andrew Morton <akpm@linux-foundation.org>
> Date: Thu, 28 Jun 2007 22:24:24 -0700
> 
> > So what happens when two quite different threads of control are doing
> > IO against two hunks of kmalloced memory which happen to come from the same
> > page?  Either some (kernel-wide) locking is needed, or that pageframe needs
> > to be treated as readonly?
> 
> Or you put an atomic_t at the beginning or tail of every SLAB
> object.  It's a space cost not a runtime cost for the common
> case which is:
> 
> 	smp_rmb();
> 	if (atomic_read(&slab_obj->count) == 1)
> 		really_free_it();
> 	else if (atomic_dec_and_test(...))
> 
> Note I don't like this variant either. :)

but, but...  Christoph said 'The dma layer may then perform operations on
the "slab page'.

If those operations involve modifying that slab page's pageframe then what
stops concurrent dma'ers from stomping on each other's changes?  As in:
why aren't we already buggy?

If those operations _don't_ involve modifying the pageframe (hopes this is
true) then we're read-only and things become much easier?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29  5:37             ` David Miller
  2007-06-29  5:45               ` Andrew Morton
@ 2007-06-29  6:50               ` Christoph Lameter
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Lameter @ 2007-06-29  6:50 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, hugh, James.Bottomley, rmk+lkml, linux-kernel

On Thu, 28 Jun 2007, David Miller wrote:

> From: Andrew Morton <akpm@linux-foundation.org>
> Date: Thu, 28 Jun 2007 22:24:24 -0700
> 
> > So what happens when two quite different threads of control are doing
> > IO against two hunks of kmalloced memory which happen to come from the same
> > page?  Either some (kernel-wide) locking is needed, or that pageframe needs
> > to be treated as readonly?
> 
> Or you put an atomic_t at the beginning or tail of every SLAB
> object.  It's a space cost not a runtime cost for the common
> case which is:

Hmmm... We could do something like

kmem_cache_get(slab, object)

and

kmem_cache_put(slab, object)

kmem_cache_get would disable allocations from the slab and increment a 
refcount.

kmem_cache_put would enable allocations again if the refcount reaches 
one.

The problem is that freeing an object may cause writes to the object. F.e.
poisoning will overwrite the object on free. SLUB will put its free 
pointer in the first words etc.




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29  5:45               ` Andrew Morton
@ 2007-06-29  6:51                 ` Christoph Lameter
  2007-06-29 12:16                 ` Alan Cox
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Lameter @ 2007-06-29  6:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Miller, hugh, James.Bottomley, rmk+lkml, linux-kernel

On Thu, 28 Jun 2007, Andrew Morton wrote:

> If those operations _don't_ involve modifying the pageframe (hopes this is
> true) then we're read-only and things become much easier?

This is true right now. We are way off topic ...


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29  5:06         ` David Miller
  2007-06-29  5:24           ` Andrew Morton
@ 2007-06-29  7:00           ` Christoph Lameter
  2007-06-29  9:06             ` David Miller
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2007-06-29  7:00 UTC (permalink / raw)
  To: David Miller; +Cc: hugh, James.Bottomley, rmk+lkml, akpm, linux-kernel

On Thu, 28 Jun 2007, David Miller wrote:

> Really, it would be great if we could treat kmalloc() objects
> just like real pages.  Everything wants to do I/O on pages
> but sometimes (like the networking) you have a kmalloc
> chunk which is technically just a part of a page.
> 
> The fact that there is no easy way to make this work is
> frustrating :-)

There is easy way: Allocate a page and just use the first N bytes. You can 
specify the bytes to be used when putting the memory onto the scatter 
gather list. This wastes memory but it works. You have real refcounting 
since you got a real page.

How frequent are these objects?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29  7:00           ` Christoph Lameter
@ 2007-06-29  9:06             ` David Miller
  0 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2007-06-29  9:06 UTC (permalink / raw)
  To: clameter; +Cc: hugh, James.Bottomley, rmk+lkml, akpm, linux-kernel

From: Christoph Lameter <clameter@sgi.com>
Date: Fri, 29 Jun 2007 00:00:39 -0700 (PDT)

> On Thu, 28 Jun 2007, David Miller wrote:
> 
> > Really, it would be great if we could treat kmalloc() objects
> > just like real pages.  Everything wants to do I/O on pages
> > but sometimes (like the networking) you have a kmalloc
> > chunk which is technically just a part of a page.
> > 
> > The fact that there is no easy way to make this work is
> > frustrating :-)
> 
> There is easy way: Allocate a page and just use the first N bytes. You can 
> specify the bytes to be used when putting the memory onto the scatter 
> gather list. This wastes memory but it works. You have real refcounting 
> since you got a real page.
> 
> How frequent are these objects?

Every single network packet.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29  5:45               ` Andrew Morton
  2007-06-29  6:51                 ` Christoph Lameter
@ 2007-06-29 12:16                 ` Alan Cox
  2007-06-29 20:45                   ` Andrew Morton
  1 sibling, 1 reply; 22+ messages in thread
From: Alan Cox @ 2007-06-29 12:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Miller, clameter, hugh, James.Bottomley, rmk+lkml,
	linux-kernel

> If those operations involve modifying that slab page's pageframe then what
> stops concurrent dma'ers from stomping on each other's changes?  As in:
> why aren't we already buggy?

Or DMA operations falling out with CPU operations in the same memory
area. Not all platforms have hardware consistency and some will blat the
entire page out of cache.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29  4:01 [PATCH] Containment measures for slab objects on scatter gather lists Christoph Lameter
  2007-06-29  4:10 ` David Miller
@ 2007-06-29 13:04 ` Hugh Dickins
  2007-06-29 14:15   ` Christoph Lameter
  1 sibling, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2007-06-29 13:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: James Bottomley, David Miller, Russell King, akpm, linux-kernel

On Thu, 28 Jun 2007, Christoph Lameter wrote:

> I had a talk with James Bottomley last night and it seems that there is an 
> established way of using the page structs of slab objects in the block 
> layer. Drivers may use the DMA interfaces to issue control commands. In 
> that case they may allocate a short structure via the slab allocator and 
> put the control commands into that slab object.
> 
> The driver will then perform a sg_init_one() on the slab object. 
> sg_init_one() calls sg_set_buf() which determines the page struct of a 
> page. In this case sg_set_buf() will determine the page struct of a slab 
> object. The dma layer may then perform operations on the "slab page". The 
> block layer folks seem to have spend some time to make this work right.

Yes, I don't see why this comes as such a surprise and horror to you,
so much in need of dire WARNINGs.  kmalloc memory is not a different
kind of memory from what you get from the page allocators.

I stand by my page_mapping patch, and the remark I made before,
that page_mapping(page) is the correct place to check this.  What is
page_mapping(page) for?  Precisely to return the struct address_space*
from page->mapping when that's what's in there, and not when that field
has been reused for something else.

So lines like
> +	mapping = PageSlab(page) ? NULL : page_mapping(page);
seem to miss the point.

I agree that the only clash found yet has been in flush_dcache_page,
so some bytes and branches can indeed be saved by just doing the
test in there.  Oh, but your VM_BUG_ON cancels out that saving.
And if we were to try to save bytes and branches there, it's the
synthetic swapper_space business (only required in a couple of
places) I'd be wanting to cut out.

To me this all seems like a big fuss to excuse your surprise:
so please don't expect an Ack from me; but if others prefer
this, I won't be Nacking.  (Though I'll probably whine about
it into eternity ;)

Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29 13:04 ` Hugh Dickins
@ 2007-06-29 14:15   ` Christoph Lameter
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Lameter @ 2007-06-29 14:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: James Bottomley, David Miller, Russell King, akpm, linux-kernel

On Fri, 29 Jun 2007, Hugh Dickins wrote:

> I stand by my page_mapping patch, and the remark I made before,
> that page_mapping(page) is the correct place to check this.  What is
> page_mapping(page) for?  Precisely to return the struct address_space*
> from page->mapping when that's what's in there, and not when that field
> has been reused for something else.
> 
> So lines like
> > +	mapping = PageSlab(page) ? NULL : page_mapping(page);
> seem to miss the point.

They certainly point out to the reader that one can expect a slab page 
here where one may not expect one since flush_dcache_page is a page
cache function.

> I agree that the only clash found yet has been in flush_dcache_page,
> so some bytes and branches can indeed be saved by just doing the
> test in there.  Oh, but your VM_BUG_ON cancels out that saving.
> And if we were to try to save bytes and branches there, it's the
> synthetic swapper_space business (only required in a couple of
> places) I'd be wanting to cut out.

VM_BUG_ON is different from BUG_ON. BUG_ON is always checked. VM_BUG_ON 
depends on a debug config option.

> To me this all seems like a big fuss to excuse your surprise:

I am weirded out by the use of terms like "accusations" and "excuses" in 
these discussions. But if it makes you feel better....

Others seemed to have encountered the same surprises before me. So it is 
better to point out these issues in the sources. There is the danger of 
other pagecache functions getting called for slab pages in the I/O 
layer and the check in page_mapping provides some protection from such 
changes. The checks/comments in the functions where we allow slab page use 
help the ones enhancing the code to keep these issues in mind and help 
them to not be surprised in turn.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29 12:16                 ` Alan Cox
@ 2007-06-29 20:45                   ` Andrew Morton
  2007-06-29 21:14                     ` Russell King
  2007-06-29 22:39                     ` Alan Cox
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2007-06-29 20:45 UTC (permalink / raw)
  To: Alan Cox
  Cc: David Miller, clameter, hugh, James.Bottomley, rmk+lkml,
	linux-kernel

On Fri, 29 Jun 2007 13:16:57 +0100
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > If those operations involve modifying that slab page's pageframe then what
> > stops concurrent dma'ers from stomping on each other's changes?  As in:
> > why aren't we already buggy?
> 
> Or DMA operations falling out with CPU operations in the same memory
> area. Not all platforms have hardware consistency and some will blat the
> entire page out of cache.

Is that just a performance problem, or can data be lost here?  It depends
on the meaning of "blat": writeback?  invalidate?  More details, please.


I'm dyin here and nobody will talk to me.  If the kernel is already doing
these things, why aren't we already buggy?  Is it because we don't actually
modify the pageframes of these dma-to-from-kmalloced pages?  But we were
thinking of doing so in the future?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29 20:45                   ` Andrew Morton
@ 2007-06-29 21:14                     ` Russell King
  2007-06-29 23:11                       ` Alan Cox
  2007-06-29 22:39                     ` Alan Cox
  1 sibling, 1 reply; 22+ messages in thread
From: Russell King @ 2007-06-29 21:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alan Cox, David Miller, clameter, hugh, James.Bottomley,
	linux-kernel

On Fri, Jun 29, 2007 at 01:45:29PM -0700, Andrew Morton wrote:
> On Fri, 29 Jun 2007 13:16:57 +0100
> Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> 
> > > If those operations involve modifying that slab page's pageframe then what
> > > stops concurrent dma'ers from stomping on each other's changes?  As in:
> > > why aren't we already buggy?
> > 
> > Or DMA operations falling out with CPU operations in the same memory
> > area. Not all platforms have hardware consistency and some will blat the
> > entire page out of cache.
> 
> Is that just a performance problem, or can data be lost here?  It depends
> on the meaning of "blat": writeback?  invalidate?  More details, please.
> 
> 
> I'm dyin here and nobody will talk to me.  If the kernel is already doing
> these things, why aren't we already buggy?  Is it because we don't actually
> modify the pageframes of these dma-to-from-kmalloced pages?  But we were
> thinking of doing so in the future?

I think people are getting too het up about this.

DMA to or from memory should be done via the DMA mapping API.  If we're
DMAing to/from a limited range within a page, either we should be using
dma_map_single(), or dma_map_page() with an appropriate offset and size.

Other cache flushing functions should not be called for DMA operations;
any cache handling required by non-coherent architectures should be done
by the DMA API only.

However, with non-coherent aliasing architectures (such as those with
aliasing VIPT or VIVT caches) there is an additional requirement on PIO
to page cache.  If the page we're writing data has some cache lines
allocated to it, we potentially hit those cache lines and the data
doesn't hit the underlying page.  Later on, when we come to map the
page into userspace, the data may still be sitting in the cache lines
corresponding with the kernel's mapping.  Therefore, there is a
requirement to ensure that the cache state WRT the kernel's mapping is
the same irrespective of the method by which data ends up in the page.

That means that for these caches, the data PIO'd into the page must be
written back to the underlying page before the page is handed to
userspace.

The two are completely separate; it seems to me from the above discussion
that people are confusing the two scenarios, and mixing DMA with the PIO
cache handling.  Please don't, you'll only get more and more confused.

(Note: with the dma_map_* API, architectures have to be sensible when
they're passed offests and sizes which aren't cacheline aligned.
Technically, it's buggy to ask for non-L1 line aligned offsets and
sizes, but they do happen.  We handle this on ARM by writing back
the overlapped lines and invalidating the rest before the DMA operation
commences, and hope that the overlapped lines aren't touched for the
duration of the DMA.)

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29 20:45                   ` Andrew Morton
  2007-06-29 21:14                     ` Russell King
@ 2007-06-29 22:39                     ` Alan Cox
  1 sibling, 0 replies; 22+ messages in thread
From: Alan Cox @ 2007-06-29 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Miller, clameter, hugh, James.Bottomley, rmk+lkml,
	linux-kernel

On Fri, 29 Jun 2007 13:45:29 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 29 Jun 2007 13:16:57 +0100
> Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> 
> > > If those operations involve modifying that slab page's pageframe then what
> > > stops concurrent dma'ers from stomping on each other's changes?  As in:
> > > why aren't we already buggy?
> > 
> > Or DMA operations falling out with CPU operations in the same memory
> > area. Not all platforms have hardware consistency and some will blat the
> > entire page out of cache.
> 
> Is that just a performance problem, or can data be lost here?  It depends
> on the meaning of "blat": writeback?  invalidate?  More details, please.

Invalidate. Sorry didn't realise it they hadn't discovered that word down
under.

If you've got something packing objects in tight we are going
to have fun with cache handling simply because the CPU cache granularity
may mean that the invalidate also invalidates a few bytes on (ie a 12
byte object will invalidate 16 bytes of memory) and you've just removed
any CPU held changes in the start of the next object.

Alan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29 21:14                     ` Russell King
@ 2007-06-29 23:11                       ` Alan Cox
  2007-06-30  7:54                         ` Russell King
  0 siblings, 1 reply; 22+ messages in thread
From: Alan Cox @ 2007-06-29 23:11 UTC (permalink / raw)
  To: Russell King
  Cc: Andrew Morton, David Miller, clameter, hugh, James.Bottomley,
	linux-kernel

> DMA to or from memory should be done via the DMA mapping API.  If we're
> DMAing to/from a limited range within a page, either we should be using
> dma_map_single(), or dma_map_page() with an appropriate offset and size.

If those ranges overlap a cache line then the dma mapping API will not
save your backside.

On a system with a 32 byte cache granularity what happens if you get two
dma mapping calls for x and x+16. Right now the thing that avoids this
occurring is that the allocators don't pack stuff in that hard so x+16
always belongs to the same driver and we can hope driver authors are
sensible 

> sizes, but they do happen.  We handle this on ARM by writing back
> the overlapped lines and invalidating the rest before the DMA operation
> commences, and hope that the overlapped lines aren't touched for the
> duration of the DMA.)

The combination of "hope" and "DMA" isn't a good one for stable system
design. In this situation we should be waving large red flags

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29 23:11                       ` Alan Cox
@ 2007-06-30  7:54                         ` Russell King
  0 siblings, 0 replies; 22+ messages in thread
From: Russell King @ 2007-06-30  7:54 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrew Morton, David Miller, clameter, hugh, James.Bottomley,
	linux-kernel

On Sat, Jun 30, 2007 at 12:11:38AM +0100, Alan Cox wrote:
> > DMA to or from memory should be done via the DMA mapping API.  If we're
> > DMAing to/from a limited range within a page, either we should be using
> > dma_map_single(), or dma_map_page() with an appropriate offset and size.
> 
> If those ranges overlap a cache line then the dma mapping API will not
> save your backside.

There's nothing much that the DMA API can do though.  Consider DMA
to a result buffer which is, eg, only 16 bytes in size.  So you get
passed a size of '16' to the DMA API.  What should you do at this
point?  BUG()?

What if you have 64 or 128 byte cache lines?

> > sizes, but they do happen.  We handle this on ARM by writing back
> > the overlapped lines and invalidating the rest before the DMA operation
> > commences, and hope that the overlapped lines aren't touched for the
> > duration of the DMA.)
> 
> The combination of "hope" and "DMA" isn't a good one for stable system
> design. In this situation we should be waving large red flags

I agree.

However, I don't think this is an issue for the DMA API to handle; it's
something that driver authors need to be aware of.  If they wish to do
a DMA to a kmalloc'd buffer or even a page, we could require that
offsets and sizes are cache line aligned.

However, remember that turning on slab debugging turns off cache line
alignment, so imposing such a requirement implies that the slab
debugging will break DMA, or driver authors also have to be aware of
that and do their own alignment internally, *or* we provide an allocator
which does unconditionally align.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Containment measures for slab objects on scatter gather lists
  2007-06-29  5:24           ` Andrew Morton
  2007-06-29  5:37             ` David Miller
@ 2007-06-30  8:49             ` Christoph Hellwig
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2007-06-30  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Miller, clameter, hugh, James.Bottomley, rmk+lkml,
	linux-kernel

On Thu, Jun 28, 2007 at 10:24:24PM -0700, Andrew Morton wrote:
> > Really, it would be great if we could treat kmalloc() objects
> > just like real pages.
> 
> >From a high level, that seems like a bad idea.  kmalloc() gives you a
> virtual address and you really shouldn't be poking around at that memory's
> underlying page's pageframe metadata.
> 
> However we can of course do tasteless and weird things if the benefit is
> sufficient....

Hey, when we had exactly that issues coming up with xfs/ext3 recovery over
iscsi/aoe you said it's fine :)  End result is that XFS got fixed and ext3
is still broken..


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2007-06-30  8:50 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-29  4:01 [PATCH] Containment measures for slab objects on scatter gather lists Christoph Lameter
2007-06-29  4:10 ` David Miller
2007-06-29  4:22   ` Christoph Lameter
2007-06-29  4:28     ` David Miller
2007-06-29  4:39       ` Christoph Lameter
2007-06-29  5:06         ` David Miller
2007-06-29  5:24           ` Andrew Morton
2007-06-29  5:37             ` David Miller
2007-06-29  5:45               ` Andrew Morton
2007-06-29  6:51                 ` Christoph Lameter
2007-06-29 12:16                 ` Alan Cox
2007-06-29 20:45                   ` Andrew Morton
2007-06-29 21:14                     ` Russell King
2007-06-29 23:11                       ` Alan Cox
2007-06-30  7:54                         ` Russell King
2007-06-29 22:39                     ` Alan Cox
2007-06-29  6:50               ` Christoph Lameter
2007-06-30  8:49             ` Christoph Hellwig
2007-06-29  7:00           ` Christoph Lameter
2007-06-29  9:06             ` David Miller
2007-06-29 13:04 ` Hugh Dickins
2007-06-29 14:15   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox