[PATCH] fs: fcntl_setlease defies lease

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] fs: fcntl_setlease defies lease_init assumptions
@ 2006-05-07 23:21 Daniel Hokka Zakrisson
  2006-05-08  3:33 ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Daniel Hokka Zakrisson @ 2006-05-07 23:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: Björn Steinbrink, greg, matthew, torvalds

fcntl_setlease uses a struct file_lock on the stack to find the
file_lock it's actually looking for. The problem with this approach is
that lease_init will attempt to free the file_lock if the arg argument
is invalid, causing kmem_cache_free to be called with the address of the
on-stack file_lock.
After running the following test-case, it doesn't take long for an i686
machine running 2.6.16.13 to stop responding completely. 
2.6.17-rc3-git12 shows similar results, although it takes a bit more 
activity before crashing.

Björn Steinbrink also identified a slab leak in the same piece of code, 
caused by the fasync_helper call which will allocate a new slab every 
time because it uses the wrong structure (the one on the stack) when the 
a lease on that file already exists.

#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

void die(const char *msg)
{
	perror(msg);
	exit(1);
}
int main(int argc, char *argv[])
{
	int fd;
	if (argc == 1)
	{
		fprintf(stderr, "Usage: %s file\n", argv[0]);
		exit(1);
	}
	fd = open(argv[1], O_RDONLY);
	if (fd == -1)
		die("open()");
	if (fcntl(fd, F_SETLEASE, F_RDLCK) == -1)
		die("setlease(RDLCK)");
	if (fcntl(fd, F_SETLEASE, F_RDLCK) == -1)
		die("setlease(RDLCK2)");
	if (fcntl(fd, F_SETLEASE, 5) == -1)
		die("setlease(invalid)");
	close(fd);
	return 0;
}

Signed-off-by: Daniel Hokka Zakrisson <daniel@hozac.com>

---
--- a/fs/locks.c	2006-05-07 00:29:26.000000000 +0200
+++ b/fs/locks.c	2006-05-07 23:29:12.000000000 +0200
@@ -1363,6 +1363,7 @@ static int __setlease(struct file *filp,
  		goto out;

  	if (my_before != NULL) {
+		*flp = *my_before;
  		error = lease->fl_lmops->fl_change(my_before, arg);
  		goto out;
  	}
@@ -1433,7 +1434,7 @@ EXPORT_SYMBOL(setlease);
   */
  int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
  {
-	struct file_lock fl, *flp = &fl;
+	struct file_lock *fl, *flp;
  	struct dentry *dentry = filp->f_dentry;
  	struct inode *inode = dentry->d_inode;
  	int error;
@@ -1446,14 +1447,15 @@ int fcntl_setlease(unsigned int fd, stru
  	if (error)
  		return error;

-	locks_init_lock(&fl);
-	error = lease_init(filp, arg, &fl);
+	error = lease_alloc(filp, arg, &fl);
  	if (error)
  		return error;
+	flp = fl;

  	lock_kernel();

  	error = __setlease(filp, arg, &flp);
+	locks_free_lock(fl);
  	if (error || arg == F_UNLCK)
  		goto out_unlock;



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-07 23:21 [PATCH] fs: fcntl_setlease defies lease_init assumptions Daniel Hokka Zakrisson
@ 2006-05-08  3:33 ` Linus Torvalds
  2006-05-08  3:34   ` Linus Torvalds
                     ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Linus Torvalds @ 2006-05-08  3:33 UTC (permalink / raw)
  To: Daniel Hokka Zakrisson; +Cc: linux-kernel, Björn Steinbrink, greg, matthew



On Mon, 8 May 2006, Daniel Hokka Zakrisson wrote:
>
> fcntl_setlease uses a struct file_lock on the stack to find the
> file_lock it's actually looking for. The problem with this approach is
> that lease_init will attempt to free the file_lock if the arg argument
> is invalid, causing kmem_cache_free to be called with the address of the
> on-stack file_lock.

Ok, I was actually really surprised that we'd ever allow a non-slab page 
to be free'd as a slab or kmalloc allocation, without screaming very 
loudly indeed. That implies a lack of some pretty fundamental sanity 
checking by default in the slab layer (I suspect slab debugging turns it 
on, but even without it, that's just nasty).

Can you see if this trivial patch at least causes a honking huge 
"kernel BUG" message to be triggered quickly?

Trond wrote an alternate patch for the actual problem which I sent 
separately, but it would probably be good to have more safety in the slab 
layer by default regardless.

		Linus

---
diff --git a/mm/slab.c b/mm/slab.c
index c32af7e..279ca59 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -597,6 +597,7 @@ static inline struct kmem_cache *page_ge
 {
 	if (unlikely(PageCompound(page)))
 		page = (struct page *)page_private(page);
+	BUG_ON(!PageSlab(page));
 	return (struct kmem_cache *)page->lru.next;
 }
 
@@ -609,6 +610,7 @@ static inline struct slab *page_get_slab
 {
 	if (unlikely(PageCompound(page)))
 		page = (struct page *)page_private(page);
+	BUG_ON(!PageSlab(page));
 	return (struct slab *)page->lru.prev;
 }
 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-08  3:33 ` Linus Torvalds
@ 2006-05-08  3:34   ` Linus Torvalds
  2006-05-08  8:02     ` Daniel Hokka Zakrisson
  2006-05-08  7:57   ` Daniel Hokka Zakrisson
  2006-05-08  8:31   ` Pekka Enberg
  2 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2006-05-08  3:34 UTC (permalink / raw)
  To: Daniel Hokka Zakrisson; +Cc: linux-kernel, Björn Steinbrink, greg, matthew

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2217 bytes --]



On Sun, 7 May 2006, Linus Torvalds wrote:
> 
> Trond wrote an alternate patch for the actual problem which I sent 
> separately, but it would probably be good to have more safety in the slab 
> layer by default regardless.

And here's Trond's suggested locks.c fix.

		Linus

---
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Subject: fs/locks.c: Fix lease_init
Date: Sun, 07 May 2006 23:02:42 -0400

It is insane to be giving lease_init() the task of freeing the lock it is
supposed to initialise, given that the lock is not guaranteed to be
allocated on the stack. This causes lockups in fcntl_setlease().
Problem diagnosed by Daniel Hokka Zakrisson <daniel@hozac.com>

Also fix a slab leak in __setlease() due to an uninitialised return value.
Problem diagnosed by BjÃ¶rn Steinbrink.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---

 fs/locks.c |   21 ++++++++++++---------
 1 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index abd9448..64b96b1 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -446,15 +446,14 @@ static struct lock_manager_operations le
  */
 static int lease_init(struct file *filp, int type, struct file_lock *fl)
  {
+	if (assign_type(fl, type) != 0)
+		return -EINVAL;
+
 	fl->fl_owner = current->files;
 	fl->fl_pid = current->tgid;
 
 	fl->fl_file = filp;
 	fl->fl_flags = FL_LEASE;
-	if (assign_type(fl, type) != 0) {
-		locks_free_lock(fl);
-		return -EINVAL;
-	}
 	fl->fl_start = 0;
 	fl->fl_end = OFFSET_MAX;
 	fl->fl_ops = NULL;
@@ -466,16 +465,19 @@ static int lease_init(struct file *filp,
 static int lease_alloc(struct file *filp, int type, struct file_lock **flp)
 {
 	struct file_lock *fl = locks_alloc_lock();
-	int error;
+	int error = -ENOMEM;
 
 	if (fl == NULL)
-		return -ENOMEM;
+		goto out;
 
 	error = lease_init(filp, type, fl);
-	if (error)
-		return error;
+	if (error) {
+		locks_free_lock(fl);
+		fl = NULL;
+	}
+out:
 	*flp = fl;
-	return 0;
+	return error;
 }
 
 /* Check if two locks overlap each other.
@@ -1391,6 +1393,7 @@ static int __setlease(struct file *filp,
 		goto out;
 
 	if (my_before != NULL) {
+		*flp = *my_before;
 		error = lease->fl_lmops->fl_change(my_before, arg);
 		goto out;
 	}

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-08  3:33 ` Linus Torvalds
  2006-05-08  3:34   ` Linus Torvalds
@ 2006-05-08  7:57   ` Daniel Hokka Zakrisson
  2006-05-08  8:31   ` Pekka Enberg
  2 siblings, 0 replies; 26+ messages in thread
From: Daniel Hokka Zakrisson @ 2006-05-08  7:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Björn Steinbrink, greg, matthew

Linus Torvalds wrote:
> Ok, I was actually really surprised that we'd ever allow a non-slab page 
> to be free'd as a slab or kmalloc allocation, without screaming very 
> loudly indeed. That implies a lack of some pretty fundamental sanity 
> checking by default in the slab layer (I suspect slab debugging turns it 
> on, but even without it, that's just nasty).
> 
> Can you see if this trivial patch at least causes a honking huge 
> "kernel BUG" message to be triggered quickly?

It doesn't, at least not before the machine becomes unresponsive.

-- 
Daniel Hokka Zakrisson

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-08  3:34   ` Linus Torvalds
@ 2006-05-08  8:02     ` Daniel Hokka Zakrisson
  0 siblings, 0 replies; 26+ messages in thread
From: Daniel Hokka Zakrisson @ 2006-05-08  8:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Björn Steinbrink, greg, matthew

Linus Torvalds wrote:
> 
> On Sun, 7 May 2006, Linus Torvalds wrote:
> 
>>Trond wrote an alternate patch for the actual problem which I sent 
>>separately, but it would probably be good to have more safety in the slab 
>>layer by default regardless.
> 
> 
> And here's Trond's suggested locks.c fix.

I can confirm that it does fix the two issues.

-- 
Daniel Hokka Zakrisson

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-08  3:33 ` Linus Torvalds
  2006-05-08  3:34   ` Linus Torvalds
  2006-05-08  7:57   ` Daniel Hokka Zakrisson
@ 2006-05-08  8:31   ` Pekka Enberg
  2006-05-08  8:34     ` Pekka Enberg
  2 siblings, 1 reply; 26+ messages in thread
From: Pekka Enberg @ 2006-05-08  8:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Daniel Hokka Zakrisson, linux-kernel, Björn Steinbrink, greg,
	matthew

Hi Linus,

On 5/8/06, Linus Torvalds <torvalds@osdl.org> wrote:
> Ok, I was actually really surprised that we'd ever allow a non-slab page
> to be free'd as a slab or kmalloc allocation, without screaming very
> loudly indeed. That implies a lack of some pretty fundamental sanity
> checking by default in the slab layer (I suspect slab debugging turns it
> on, but even without it, that's just nasty).
>
> Can you see if this trivial patch at least causes a honking huge
> "kernel BUG" message to be triggered quickly?

page_get_cache and page_get_slab are too late. You would need to do
the check in __cache_free; otherwise the stack pointer goes to per-CPU
caches and can be given back by kmalloc(). Adding PageSlab debugging
to __cache_free is probably too much of a performance hit, though.

                                         Pekka

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-08  8:31   ` Pekka Enberg
@ 2006-05-08  8:34     ` Pekka Enberg
  2006-05-08 15:12       ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Pekka Enberg @ 2006-05-08  8:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Daniel Hokka Zakrisson, linux-kernel, Björn Steinbrink, greg,
	matthew

On 5/8/06, Linus Torvalds <torvalds@osdl.org> wrote:
> > Ok, I was actually really surprised that we'd ever allow a non-slab page
> > to be free'd as a slab or kmalloc allocation, without screaming very
> > loudly indeed. That implies a lack of some pretty fundamental sanity
> > checking by default in the slab layer (I suspect slab debugging turns it
> > on, but even without it, that's just nasty).
> >
> > Can you see if this trivial patch at least causes a honking huge
> > "kernel BUG" message to be triggered quickly?

On 5/8/06, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> page_get_cache and page_get_slab are too late. You would need to do
> the check in __cache_free; otherwise the stack pointer goes to per-CPU
> caches and can be given back by kmalloc(). Adding PageSlab debugging
> to __cache_free is probably too much of a performance hit, though.

Btw, CONFIG_DEBUG_SLAB should catch this case, see kfree_debugcheck()
for details.

                                                  Pekka

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-08  8:34     ` Pekka Enberg
@ 2006-05-08 15:12       ` Linus Torvalds
  2006-05-08 16:06         ` Pekka Enberg
  2006-05-08 16:36         ` Dave Jones
  0 siblings, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2006-05-08 15:12 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Daniel Hokka Zakrisson, linux-kernel, Björn Steinbrink, greg,
	matthew

On Mon, 8 May 2006, Pekka Enberg wrote:
> 
> On 5/8/06, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> > page_get_cache and page_get_slab are too late. You would need to do
> > the check in __cache_free; otherwise the stack pointer goes to per-CPU
> > caches and can be given back by kmalloc(). Adding PageSlab debugging
> > to __cache_free is probably too much of a performance hit, though.
> 
> Btw, CONFIG_DEBUG_SLAB should catch this case, see kfree_debugcheck()
> for details.

Yeah, but CONFIG_DEBUG_SLAB is _really_ expensive. 

We do have a lot of very basic debug checks (unconditionally) in the 
kernel to verify various "must be true" kinds of things. It might slow 
things down a bit, but in general, I think anything that helps catch 
problems early tends to pay itself back very quickly. So I'm more than 
happy with a simple BUG_ON() in even a hot path, if it just ends up being 
compiled into a "test and branch to unlikely" and doesn't need any costly 
locking etc around it.

Fedora had DEBUG_SLAB enabled in their development kernel, and that 
actually helped a lot. But I suspect they may _not_ have it in their 
non-development ones, and those have a much bigger test-base, so it might 
well be worth it to have a good base-line that catches serious problems, 
and have DEBUG_SLAB enable the expensive tests.

It's not like trying to free a non-kmalloc'ed pointer is a really strange 
event. malloc/free bugs are some of the most common serious problems in 
user space, and I suspect they are _less_ common in the kernel, but 
still..

		Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-08 15:12       ` Linus Torvalds
@ 2006-05-08 16:06         ` Pekka Enberg
  2006-05-08 16:28           ` Linus Torvalds
  2006-05-08 16:36         ` Dave Jones
  1 sibling, 1 reply; 26+ messages in thread
From: Pekka Enberg @ 2006-05-08 16:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Daniel Hokka Zakrisson, linux-kernel, Björn Steinbrink, greg,
	matthew, Christoph Lameter, manfred, akpm

On Mon, 2006-05-08 at 08:12 -0700, Linus Torvalds wrote:
> Yeah, but CONFIG_DEBUG_SLAB is _really_ expensive. 
> 
> We do have a lot of very basic debug checks (unconditionally) in the 
> kernel to verify various "must be true" kinds of things. It might slow 
> things down a bit, but in general, I think anything that helps catch 
> problems early tends to pay itself back very quickly. So I'm more than 
> happy with a simple BUG_ON() in even a hot path, if it just ends up being 
> compiled into a "test and branch to unlikely" and doesn't need any costly 
> locking etc around it.

I was under the impression that virt_to_page() is expensive, even more
so on NUMA.  Do we really want this check included unconditionally in
slab free hot path?

				Pekka

   text    data     bss     dec     hex filename
   9279     664      80   10023    2727 mm/slab.o (vanilla uma)
   9327     664      80   10071    2757 mm/slab.o (debug uma)
  13464    2596      24   16084    3ed4 mm/slab.o (vanilla numa)
  13492    2596      24   16112    3ef0 mm/slab.o (debug numa)

diff --git a/mm/slab.c b/mm/slab.c
index c32af7e..8ace45b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3077,6 +3077,8 @@ static inline void __cache_free(struct k
 	check_irq_off();
 	objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
 
+	BUG_ON(!PageSlab(virt_to_page(objp)));
+
 	/* Make sure we are not freeing a object from another
 	 * node to the array cache on this cpu.
 	 */



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-08 16:06         ` Pekka Enberg
@ 2006-05-08 16:28           ` Linus Torvalds
  2006-05-08 19:36             ` Pekka Enberg
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2006-05-08 16:28 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Daniel Hokka Zakrisson, linux-kernel, Björn Steinbrink, greg,
	matthew, Christoph Lameter, manfred, akpm

On Mon, 8 May 2006, Pekka Enberg wrote:
> 
> I was under the impression that virt_to_page() is expensive, even more
> so on NUMA.

virt_to_page() should be pretty cheap, but you're right, that's probably 
the much higher expense than the test. And reading the "struct page" can 
get a cache-miss.  Although - especially for NUMA - we're actually going 
to do that virt_to_page() _anyway_ just a few lines below (as part of 
"virt_to_slab()".

Similarly, for "kfree()", we will actually have done that same thing 
already (kfree() does "virt_to_cache(objp);").

So we're actually only left with a single case that doesn't currently do 
it (and didn't trigger my trivial two additions): kmem_cache_free() just 
trusts the cachep pointer implicitly. And that's obviously the case that 
the whole fcntl_setlease thing used.

Everybody else would have triggered from my patch which added it at a 
point where it was basically free (because we had looked up the page 
anyway, and we were going to read from the "struct page" info regardless).

So from a performance standpoint, maybe my previous trivial patch is the 
right thing to do, along with an even _stronger_ test for 
kmem_cache_free(), where we could do

	BUG_ON(virt_to_cache(objp) != cachep);

which you can then remove from the slab debug case.

So for a lot of the normal paths, you'd basically have no extra cost (two 
instructions, no data cache pressure), but for kmem_cache_free() we'd have 
a slight overhead (but a lot lower than SLAB_DEBUG, and at least for NUMA 
it's reading a cacheline that we'd be using regardless.

I think it sounds like it's worth it, but I'm not going to really push it.

		Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-08 15:12       ` Linus Torvalds
  2006-05-08 16:06         ` Pekka Enberg
@ 2006-05-08 16:36         ` Dave Jones
  1 sibling, 0 replies; 26+ messages in thread
From: Dave Jones @ 2006-05-08 16:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pekka Enberg, Daniel Hokka Zakrisson, linux-kernel,
	Björn Steinbrink, greg, matthew

On Mon, May 08, 2006 at 08:12:18AM -0700, Linus Torvalds wrote:

 > Fedora had DEBUG_SLAB enabled in their development kernel, and that 
 > actually helped a lot. But I suspect they may _not_ have it in their 
 > non-development ones, and those have a much bigger test-base, so it might 
 > well be worth it to have a good base-line that catches serious problems, 
 > and have DEBUG_SLAB enable the expensive tests.

That's correct. Though at times I'll build a one-off test kernel if I'm
suspicious about something, and push that out as a testing update before
the real update goes live.

Those typically don't get anywhere near the level of exposure as our
development kernels though, so they tend not to show up problems as often.

We also carry the 'check the redzones of non-free slabs every few minutes'
patch, which turned up some things once or twice, but nothing else in
quite a while.

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-08 16:28           ` Linus Torvalds
@ 2006-05-08 19:36             ` Pekka Enberg
  2006-05-09  3:38               ` Christoph Lameter
  0 siblings, 1 reply; 26+ messages in thread
From: Pekka Enberg @ 2006-05-08 19:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Daniel Hokka Zakrisson, linux-kernel, Björn Steinbrink, greg,
	matthew, Christoph Lameter, manfred, akpm

On Mon, 2006-05-08 at 09:28 -0700, Linus Torvalds wrote:
> So from a performance standpoint, maybe my previous trivial patch is the 
> right thing to do, along with an even _stronger_ test for 
> kmem_cache_free(), where we could do
> 
> 	BUG_ON(virt_to_cache(objp) != cachep);
> 
> which you can then remove from the slab debug case.
> 
> So for a lot of the normal paths, you'd basically have no extra cost (two 
> instructions, no data cache pressure), but for kmem_cache_free() we'd have 
> a slight overhead (but a lot lower than SLAB_DEBUG, and at least for NUMA 
> it's reading a cacheline that we'd be using regardless.
> 
> I think it sounds like it's worth it, but I'm not going to really push it.

Sounds good to me. Andrew?

				Pekka

[PATCH] slab: verify pointers before free

From: Pekka Enberg <penberg@cs.helsinki.fi>

Passing an invalid pointer to kfree() and kmem_cache_free() is likely
to cause bad memory corruption or even take down the whole system
because the bad pointer is likely reused immediately due to the
per-CPU caches.  Until now, we don't do any verification for this if
CONFIG_DEBUG_SLAB is disabled.

As suggested by Linus, add PageSlab check to page_to_cache() and
page_to_slab() to verify pointers passed to kfree().  Also, move the
stronger check from cache_free_debugcheck() to kmem_cache_free() to
ensure the passed pointer actually belongs to the cache we're about to
free the object.

For page_to_cache() and page_to_slab(), the assertions should have
virtually no extra cost (two instructions, no data cache pressure) and
for kmem_cache_free() the overhead should be minimal.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>

---

 mm/slab.c |   13 ++++---------
 1 files changed, 4 insertions(+), 9 deletions(-)

8e4b800f3fb45bbffcc7b365115a63b2a4c911cb
diff --git a/mm/slab.c b/mm/slab.c
index c32af7e..bc9805a 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -597,6 +597,7 @@ static inline struct kmem_cache *page_ge
 {
 	if (unlikely(PageCompound(page)))
 		page = (struct page *)page_private(page);
+	BUG_ON(!PageSlab(page));
 	return (struct kmem_cache *)page->lru.next;
 }
 
@@ -609,6 +610,7 @@ static inline struct slab *page_get_slab
 {
 	if (unlikely(PageCompound(page)))
 		page = (struct page *)page_private(page);
+	BUG_ON(!PageSlab(page));
 	return (struct slab *)page->lru.prev;
 }
 
@@ -2597,15 +2599,6 @@ static void *cache_free_debugcheck(struc
 	kfree_debugcheck(objp);
 	page = virt_to_page(objp);
 
-	if (page_get_cache(page) != cachep) {
-		printk(KERN_ERR "mismatch in kmem_cache_free: expected "
-				"cache %p, got %p\n",
-		       page_get_cache(page), cachep);
-		printk(KERN_ERR "%p is %s.\n", cachep, cachep->name);
-		printk(KERN_ERR "%p is %s.\n", page_get_cache(page),
-		       page_get_cache(page)->name);
-		WARN_ON(1);
-	}
 	slabp = page_get_slab(page);
 
 	if (cachep->flags & SLAB_RED_ZONE) {
@@ -3361,6 +3354,8 @@ void kmem_cache_free(struct kmem_cache *
 {
 	unsigned long flags;
 
+	BUG_ON(virt_to_cache(objp) != cachep);
+
 	local_irq_save(flags);
 	__cache_free(cachep, objp);
 	local_irq_restore(flags);
-- 
1.3.0





^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-08 19:36             ` Pekka Enberg
@ 2006-05-09  3:38               ` Christoph Lameter
  2006-05-09  3:49                 ` Martin J. Bligh
                                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Christoph Lameter @ 2006-05-09  3:38 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linus Torvalds, Daniel Hokka Zakrisson, linux-kernel,
	Björn Steinbrink, greg, matthew, manfred, akpm

On Mon, 8 May 2006, Pekka Enberg wrote:

> > I think it sounds like it's worth it, but I'm not going to really push it.
> 
> Sounds good to me. Andrew?

virt_to_page is not cheap on NUMA.

On IA64 virt_to_page is:

#define virt_to_page(kaddr)     pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)

# define pfn_to_page(pfn)       (vmem_map + (pfn))

vmem_map is not a linear map but a virtual mapping that may require 
several faults to get the information.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-09  3:38               ` Christoph Lameter
@ 2006-05-09  3:49                 ` Martin J. Bligh
  2006-05-09  5:31                   ` Christoph Lameter
  2006-05-09  6:22                 ` Manfred Spraul
  2006-05-09 14:40                 ` Linus Torvalds
  2 siblings, 1 reply; 26+ messages in thread
From: Martin J. Bligh @ 2006-05-09  3:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Linus Torvalds, Daniel Hokka Zakrisson,
	linux-kernel, Björn Steinbrink, greg, matthew, manfred, akpm

Christoph Lameter wrote:
> On Mon, 8 May 2006, Pekka Enberg wrote:
> 
> 
>>>I think it sounds like it's worth it, but I'm not going to really push it.
>>
>>Sounds good to me. Andrew?
> 
> 
> virt_to_page is not cheap on NUMA.
> 
> On IA64 virt_to_page is:
> 
> #define virt_to_page(kaddr)     pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
> 
> # define pfn_to_page(pfn)       (vmem_map + (pfn))
> 
> vmem_map is not a linear map but a virtual mapping that may require 
> several faults to get the information.

Can't you use sparsemem instead? It solves the same problem without the
magic faulting, doesn't it?

M.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-09  3:49                 ` Martin J. Bligh
@ 2006-05-09  5:31                   ` Christoph Lameter
  2006-05-09  6:16                     ` Martin J. Bligh
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2006-05-09  5:31 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Pekka Enberg, Linus Torvalds, Daniel Hokka Zakrisson,
	linux-kernel, Björn Steinbrink, greg, matthew, manfred, akpm

On Mon, 8 May 2006, Martin J. Bligh wrote:

> Can't you use sparsemem instead? It solves the same problem without the
> magic faulting, doesn't it?

But sparsemem has more complex table lookups. Ultimately IA64 will move 
to sparsemem (I think) but we are not there yet and we would like to be 
sure that there are no performance regressions with that move.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-09  5:31                   ` Christoph Lameter
@ 2006-05-09  6:16                     ` Martin J. Bligh
  0 siblings, 0 replies; 26+ messages in thread
From: Martin J. Bligh @ 2006-05-09  6:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Linus Torvalds, Daniel Hokka Zakrisson,
	linux-kernel, Björn Steinbrink, greg, matthew, manfred, akpm,
	Andy Whitcroft

Christoph Lameter wrote:
> On Mon, 8 May 2006, Martin J. Bligh wrote:
> 
> 
>>Can't you use sparsemem instead? It solves the same problem without the
>>magic faulting, doesn't it?
> 
> 
> But sparsemem has more complex table lookups. Ultimately IA64 will move 
> to sparsemem (I think) but we are not there yet and we would like to be 
> sure that there are no performance regressions with that move.

Please explain your concerns in more detail re complexity. I was under
the impression the design avoided that nicely by folding the
calculations together down into a single layer.

It's been around for a long time now ... has nobody tested the
performance on ia64 yet?

M.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-09  3:38               ` Christoph Lameter
  2006-05-09  3:49                 ` Martin J. Bligh
@ 2006-05-09  6:22                 ` Manfred Spraul
  2006-05-09  6:35                   ` Keith Owens
                                     ` (2 more replies)
  2006-05-09 14:40                 ` Linus Torvalds
  2 siblings, 3 replies; 26+ messages in thread
From: Manfred Spraul @ 2006-05-09  6:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Linus Torvalds, Daniel Hokka Zakrisson,
	linux-kernel, Björn Steinbrink, greg, matthew, akpm

Christoph Lameter wrote:

>On Mon, 8 May 2006, Pekka Enberg wrote:
>
>  
>
>>>I think it sounds like it's worth it, but I'm not going to really push it.
>>>      
>>>
>>Sounds good to me. Andrew?
>>    
>>
>
>virt_to_page is not cheap on NUMA.
>
>  
>
I know. But it was a design assumption when I wrote the slab allocator.
Acutally, it's not cheap on non-NUMA either. And the PageCompound() 
check adds an additional branch.

Probably the allocator should be rewritten, without relying on 
virt_to_page().
Any proposals how kfree and kmem_cache_free could locate the cachep 
pointer? That's the performance critical part.

Right's now it's
<<<
   page = vir_to_page(objp)
   if (unlikely(PageCompound(page)))
           page = (struct page *)page_private(page);
   cachep = (struct kmem_cache *)page->lru.next;
 >>>

What about:
- switch from power of two kmalloc caches to slighly smaller caches.
- change the kmalloc(PAGE_SIZE) users to get_free_page(). 
get_free_page() is now fast, too.
- use cachep= *((struct kmem_cache **)(objp & 0xfff))

The result would be a few small restrictions: all objects must start in 
the first page of a slab (there are no exceptions on my 2.6.16 system), 
and PAGE_SIZE'd caches are very expensive. Replacing the names_cache 
with get_free_page is trivial. That leaves the pgd cache.

--
    Manfred



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-09  6:22                 ` Manfred Spraul
@ 2006-05-09  6:35                   ` Keith Owens
  2006-05-09  6:37                   ` Nick Piggin
  2006-05-09 10:26                   ` Pekka J Enberg
  2 siblings, 0 replies; 26+ messages in thread
From: Keith Owens @ 2006-05-09  6:35 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Christoph Lameter, Pekka Enberg, Linus Torvalds,
	Daniel Hokka Zakrisson, linux-kernel, Björn Steinbrink, greg,
	matthew, akpm

Manfred Spraul (on Tue, 09 May 2006 08:22:59 +0200) wrote:
>What about:
>- switch from power of two kmalloc caches to slighly smaller caches.
>- change the kmalloc(PAGE_SIZE) users to get_free_page(). 
>get_free_page() is now fast, too.
>- use cachep= *((struct kmem_cache **)(objp & 0xfff))

0xfff?  (PAGE_SIZE - 1) surely.  Not all the world is a 386.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-09  6:22                 ` Manfred Spraul
  2006-05-09  6:35                   ` Keith Owens
@ 2006-05-09  6:37                   ` Nick Piggin
  2006-05-09 10:26                   ` Pekka J Enberg
  2 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2006-05-09  6:37 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Christoph Lameter, Pekka Enberg, Linus Torvalds,
	Daniel Hokka Zakrisson, linux-kernel, Björn Steinbrink, greg,
	matthew, akpm

Manfred Spraul wrote:

> Christoph Lameter wrote:
>
>> virt_to_page is not cheap on NUMA.
>>
>>  
>>
> I know. But it was a design assumption when I wrote the slab allocator.
> Acutally, it's not cheap on non-NUMA either. And the PageCompound() 
> check adds an additional branch.


Just FYI, the PageCompound check can go as soon as nommu stops trying to 
do its
broken page refcounting on slab pages (whenever that happens :P).
--

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-09  6:22                 ` Manfred Spraul
  2006-05-09  6:35                   ` Keith Owens
  2006-05-09  6:37                   ` Nick Piggin
@ 2006-05-09 10:26                   ` Pekka J Enberg
  2006-05-09 18:25                     ` Manfred Spraul
  2 siblings, 1 reply; 26+ messages in thread
From: Pekka J Enberg @ 2006-05-09 10:26 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Christoph Lameter, Linus Torvalds, Daniel Hokka Zakrisson,
	linux-kernel, Björn Steinbrink, greg, matthew, akpm

On Tue, 9 May 2006, Manfred Spraul wrote:
> Probably the allocator should be rewritten, without relying on virt_to_page().
> Any proposals how kfree and kmem_cache_free could locate the cachep pointer?
> That's the performance critical part.
> 
> Right's now it's
> <<<
>   page = vir_to_page(objp)
>   if (unlikely(PageCompound(page)))
>           page = (struct page *)page_private(page);
>   cachep = (struct kmem_cache *)page->lru.next;
> >>>
> 
> What about:
> - switch from power of two kmalloc caches to slighly smaller caches.
> - change the kmalloc(PAGE_SIZE) users to get_free_page(). get_free_page() is
> now fast, too.
> - use cachep= *((struct kmem_cache **)(objp & 0xfff))

I think you mean

static inline struct kmem_cache *slab_get_cache(const void *obj)
{
        struct kmem_cache **p = (void *)((unsigned long) obj & ~(PAGE_SIZE-1));
        return *p;
}

On Tue, 9 May 2006, Manfred Spraul wrote:
> The result would be a few small restrictions: all objects must start in the
> first page of a slab (there are no exceptions on my 2.6.16 system), and
> PAGE_SIZE'd caches are very expensive. Replacing the names_cache with
> get_free_page is trivial. That leaves the pgd cache.

Your plan makes sense for slabs that have slab management structures 
embedded within. We already have enough free space there for one pointer 
due to 

    colour_off += cachep->slab_size;

in the alloc_slabmgmt() function, I think. Are you planning to kill 
external slab management allocation completely by switching to 
get_free_pages() for those cases? I'd much rather make the switch to page 
allocator under the hood so kmalloc(PAGE_SIZE*n) would still work because 
it's much nicer API.

				Pekka

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-09  3:38               ` Christoph Lameter
  2006-05-09  3:49                 ` Martin J. Bligh
  2006-05-09  6:22                 ` Manfred Spraul
@ 2006-05-09 14:40                 ` Linus Torvalds
  2006-05-09 23:59                   ` Christoph Lameter
  2 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2006-05-09 14:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Daniel Hokka Zakrisson, linux-kernel,
	Björn Steinbrink, greg, matthew, manfred, akpm



On Mon, 8 May 2006, Christoph Lameter wrote:
>
> On Mon, 8 May 2006, Pekka Enberg wrote:
> 
> > > I think it sounds like it's worth it, but I'm not going to really push it.
> > 
> > Sounds good to me. Andrew?
> 
> virt_to_page is not cheap on NUMA.

Right now the __cache_free() chain does "virt_to_page()" on NUMA 
regardless, through the

	#ifdef CONFIG_NUMA
	        {
	                struct slab *slabp;
	                slabp = virt_to_slab(objp);
	,,,

thing. The suggested patch obviously makes it do it _twice_: once to get 
the cachep, once to get the slabp. But some simple re-organization would 
make it do it just once, if we passed in the "struct page *" instead of 
the "struct cachep" - since in the end, every single path into the real 
core of the allocator does end up needing it.

			Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-09 10:26                   ` Pekka J Enberg
@ 2006-05-09 18:25                     ` Manfred Spraul
  2006-05-09 18:56                       ` Linus Torvalds
  2006-05-09 19:05                       ` Pekka Enberg
  0 siblings, 2 replies; 26+ messages in thread
From: Manfred Spraul @ 2006-05-09 18:25 UTC (permalink / raw)
  To: Pekka J Enberg
  Cc: Christoph Lameter, Linus Torvalds, Daniel Hokka Zakrisson,
	linux-kernel, Björn Steinbrink, greg, matthew, akpm

Pekka J Enberg wrote:

>I think you mean
>
>static inline struct kmem_cache *slab_get_cache(const void *obj)
>{
>        struct kmem_cache **p = (void *)((unsigned long) obj & ~(PAGE_SIZE-1));
>        return *p;
>}
>
>  
>
Of course.

>On Tue, 9 May 2006, Manfred Spraul wrote:
>  
>
>>The result would be a few small restrictions: all objects must start in the
>>first page of a slab (there are no exceptions on my 2.6.16 system), and
>>PAGE_SIZE'd caches are very expensive. Replacing the names_cache with
>>get_free_page is trivial. That leaves the pgd cache.
>>    
>>
>
>Your plan makes sense for slabs that have slab management structures 
>embedded within.
>
No - it would only make sense if it could be used for all slabs. 
Otherwise: How should kfree figure out if it's called for a slab with 
embedded pointers or not?

> We already have enough free space there for one pointer 
>due to 
>
>    colour_off += cachep->slab_size;
>
>in the alloc_slabmgmt() function, I think. Are you planning to kill 
>external slab management allocation completely by switching to 
>get_free_pages() for those cases? I'd much rather make the switch to page 
>allocator under the hood so kmalloc(PAGE_SIZE*n) would still work because 
>it's much nicer API.
>  
>
How many kmalloc(PAGE_SIZE*n) users are there?

--
    Manfred

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-09 18:25                     ` Manfred Spraul
@ 2006-05-09 18:56                       ` Linus Torvalds
  2006-05-09 19:05                       ` Pekka Enberg
  1 sibling, 0 replies; 26+ messages in thread
From: Linus Torvalds @ 2006-05-09 18:56 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Pekka J Enberg, Christoph Lameter, Daniel Hokka Zakrisson,
	linux-kernel, Björn Steinbrink, greg, matthew, akpm

On Tue, 9 May 2006, Manfred Spraul wrote:
>
> How many kmalloc(PAGE_SIZE*n) users are there?

A single PAGE_SIZE allocation is quite common. Lots of kernel structures 
end up (often for historical reasons) being that size. PATH_MAX, for one. 
Sometimes it's also simply because it's the one "known" size that doesn't 
cause fragmentation and is easily available, so..

In other words, it's often the "canonical size" for some random buffer: 
if only because it's known to be the largest possible buffer that is 
always available.

			Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-09 18:25                     ` Manfred Spraul
  2006-05-09 18:56                       ` Linus Torvalds
@ 2006-05-09 19:05                       ` Pekka Enberg
  2006-05-09 19:15                         ` Pekka Enberg
  1 sibling, 1 reply; 26+ messages in thread
From: Pekka Enberg @ 2006-05-09 19:05 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Christoph Lameter, Linus Torvalds, Daniel Hokka Zakrisson,
	linux-kernel, Björn Steinbrink, greg, matthew, akpm

On Tue, 2006-05-09 at 20:25 +0200, Manfred Spraul wrote:
> No - it would only make sense if it could be used for all slabs. 
> Otherwise: How should kfree figure out if it's called for a slab with 
> embedded pointers or not?

Yeah, you're absolutely right. Which is exactly why I asked what you
want to do with the OFF_SLAB case. We certainly can support both by
virtualizing kfree() and kmem_cache_free() with struct slab_operations
type of thing, but that's probably not very performant. Well, it might
be for NUMA, as virt_to_page() is so expensive.

Anyway, embedding struct kmem_cache pointer in the beginning of slab
page is doable (I have a proof-of-concept patch for that btw) but we
need to solve OFF_SLAB first.

				Pekka

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-09 19:05                       ` Pekka Enberg
@ 2006-05-09 19:15                         ` Pekka Enberg
  0 siblings, 0 replies; 26+ messages in thread
From: Pekka Enberg @ 2006-05-09 19:15 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Christoph Lameter, Linus Torvalds, Daniel Hokka Zakrisson,
	linux-kernel, Björn Steinbrink, greg, matthew, akpm

On Tue, 2006-05-09 at 20:25 +0200, Manfred Spraul wrote:
> > No - it would only make sense if it could be used for all slabs. 
> > Otherwise: How should kfree figure out if it's called for a slab with 
> > embedded pointers or not?

On Tue, 2006-05-09 at 22:05 +0300, Pekka Enberg wrote:
> We certainly can support both by virtualizing kfree() and
> kmem_cache_free() with struct slab_operations type of thing

Uhm, not for kfree(), obviously. I was thinking of kmem_cache_free().

			Pekka


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions
  2006-05-09 14:40                 ` Linus Torvalds
@ 2006-05-09 23:59                   ` Christoph Lameter
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2006-05-09 23:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pekka Enberg, Daniel Hokka Zakrisson, linux-kernel,
	Björn Steinbrink, greg, matthew, manfred, akpm

On Tue, 9 May 2006, Linus Torvalds wrote:

> Right now the __cache_free() chain does "virt_to_page()" on NUMA 
> regardless, through the
> 
> 	#ifdef CONFIG_NUMA
> 	        {
> 	                struct slab *slabp;
> 	                slabp = virt_to_slab(objp);
> 	,,,
> 
> thing. The suggested patch obviously makes it do it _twice_: once to get 
> the cachep, once to get the slabp. But some simple re-organization would 
> make it do it just once, if we passed in the "struct page *" instead of 
> the "struct cachep" - since in the end, every single path into the real 
> core of the allocator does end up needing it.

Sounds like the best approach to address this rather than another slab 
redesign.


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2006-05-10  0:00 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-07 23:21 [PATCH] fs: fcntl_setlease defies lease_init assumptions Daniel Hokka Zakrisson
2006-05-08  3:33 ` Linus Torvalds
2006-05-08  3:34   ` Linus Torvalds
2006-05-08  8:02     ` Daniel Hokka Zakrisson
2006-05-08  7:57   ` Daniel Hokka Zakrisson
2006-05-08  8:31   ` Pekka Enberg
2006-05-08  8:34     ` Pekka Enberg
2006-05-08 15:12       ` Linus Torvalds
2006-05-08 16:06         ` Pekka Enberg
2006-05-08 16:28           ` Linus Torvalds
2006-05-08 19:36             ` Pekka Enberg
2006-05-09  3:38               ` Christoph Lameter
2006-05-09  3:49                 ` Martin J. Bligh
2006-05-09  5:31                   ` Christoph Lameter
2006-05-09  6:16                     ` Martin J. Bligh
2006-05-09  6:22                 ` Manfred Spraul
2006-05-09  6:35                   ` Keith Owens
2006-05-09  6:37                   ` Nick Piggin
2006-05-09 10:26                   ` Pekka J Enberg
2006-05-09 18:25                     ` Manfred Spraul
2006-05-09 18:56                       ` Linus Torvalds
2006-05-09 19:05                       ` Pekka Enberg
2006-05-09 19:15                         ` Pekka Enberg
2006-05-09 14:40                 ` Linus Torvalds
2006-05-09 23:59                   ` Christoph Lameter
2006-05-08 16:36         ` Dave Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).