Latest vfs scalability patch

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Latest vfs scalability patch
@ 2009-10-06  6:49 Nick Piggin
  2009-10-06 10:14 ` Jens Axboe
  2009-10-15 10:08 ` Latest vfs scalability patch Anton Blanchard
  0 siblings, 2 replies; 57+ messages in thread
From: Nick Piggin @ 2009-10-06  6:49 UTC (permalink / raw)
  To: Linux Kernel Mailing List, linux-fsdevel
  Cc: Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds,
	Jens Axboe

Hi,

Several people have been interested to test my vfs patches, so rather
than resend patches I have uploaded a rollup against Linus's current
head.

ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/

I have used ext2,ext3,autofs4,nfs as well as in-memory filesystems
OK (although this doesn't mean there are no bugs!). Otherwise, if your
filesystem compiles, then there is a reasonable chance of it working,
or ask me and I can try updating it for the new locking.

I would be interested in seeing any numbers people might come up with,
including single-threaded performance.

Thanks, Nick

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Latest vfs scalability patch
  2009-10-06  6:49 Latest vfs scalability patch Nick Piggin
@ 2009-10-06 10:14 ` Jens Axboe
  2009-10-06 10:26   ` Jens Axboe
  2009-10-06 12:26   ` Nick Piggin
  2009-10-15 10:08 ` Latest vfs scalability patch Anton Blanchard
  1 sibling, 2 replies; 57+ messages in thread
From: Jens Axboe @ 2009-10-06 10:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds

On Tue, Oct 06 2009, Nick Piggin wrote:
> Hi,
> 
> Several people have been interested to test my vfs patches, so rather
> than resend patches I have uploaded a rollup against Linus's current
> head.
> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/
> 
> I have used ext2,ext3,autofs4,nfs as well as in-memory filesystems
> OK (although this doesn't mean there are no bugs!). Otherwise, if your
> filesystem compiles, then there is a reasonable chance of it working,
> or ask me and I can try updating it for the new locking.
> 
> I would be interested in seeing any numbers people might come up with,
> including single-threaded performance.

I gave this a quick spin on the 64-thread nehalem. Just a simple dbench
with 64 clients on tmpfs. The results are below. While running perf top
-a in mainline, the top 5 entries are:

          2086691.00 - 96.6% : _spin_lock
            14866.00 -  0.7% : copy_user_generic_string
             5710.00 -  0.3% : mutex_spin_on_owner
             2837.00 -  0.1% : _atomic_dec_and_lock
             2274.00 -  0.1% : __d_lookup

Uhm auch... It doesn't look much prettier for the patch kernel, though:

          9396422.00 - 95.7% : _spin_lock
            66978.00 -  0.7% : copy_user_generic_string
            43775.00 -  0.4% : dput
            23946.00 -  0.2% : __link_path_walk
            17699.00 -  0.2% : path_init
            15046.00 -  0.2% : do_lookup


Anyway, below are the results. Seem very stable.

                                throughput
------------------------------------------------
2.6.32-rc3-git          |      561.218 MB/sec
2.6.32-rc3-git+patch    |      627.022 MB/sec


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Latest vfs scalability patch
  2009-10-06 10:14 ` Jens Axboe
@ 2009-10-06 10:26   ` Jens Axboe
  2009-10-06 11:10     ` Peter Zijlstra
  2009-10-06 12:26   ` Nick Piggin
  1 sibling, 1 reply; 57+ messages in thread
From: Jens Axboe @ 2009-10-06 10:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds

On Tue, Oct 06 2009, Jens Axboe wrote:
> On Tue, Oct 06 2009, Nick Piggin wrote:
> > Hi,
> > 
> > Several people have been interested to test my vfs patches, so rather
> > than resend patches I have uploaded a rollup against Linus's current
> > head.
> > 
> > ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/
> > 
> > I have used ext2,ext3,autofs4,nfs as well as in-memory filesystems
> > OK (although this doesn't mean there are no bugs!). Otherwise, if your
> > filesystem compiles, then there is a reasonable chance of it working,
> > or ask me and I can try updating it for the new locking.
> > 
> > I would be interested in seeing any numbers people might come up with,
> > including single-threaded performance.
> 
> I gave this a quick spin on the 64-thread nehalem. Just a simple dbench
> with 64 clients on tmpfs. The results are below. While running perf top
> -a in mainline, the top 5 entries are:
> 
>           2086691.00 - 96.6% : _spin_lock
>             14866.00 -  0.7% : copy_user_generic_string
>              5710.00 -  0.3% : mutex_spin_on_owner
>              2837.00 -  0.1% : _atomic_dec_and_lock
>              2274.00 -  0.1% : __d_lookup
> 
> Uhm auch... It doesn't look much prettier for the patch kernel, though:
> 
>           9396422.00 - 95.7% : _spin_lock
>             66978.00 -  0.7% : copy_user_generic_string
>             43775.00 -  0.4% : dput
>             23946.00 -  0.2% : __link_path_walk
>             17699.00 -  0.2% : path_init
>             15046.00 -  0.2% : do_lookup

I did a quick perf analysis on that, but only on 8 clients (64 clients
basically causes perf to shit itself, it's just not functional). So for
a 8 client 60s dbench run, we're already into ~75% spinlock time. The
components are:

dput (44%)
        path_put
                path_walk
                        do_path_lookup
path_get (44%)
        path_init
                do_path_lookup
__d_path (7%)

vfsmount_read_lock (3%)

This is for the patched kernel.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Latest vfs scalability patch
  2009-10-06 10:26   ` Jens Axboe
@ 2009-10-06 11:10     ` Peter Zijlstra
  2009-10-06 12:51       ` Jens Axboe
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2009-10-06 11:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Nick Piggin, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Linus Torvalds

On Tue, 2009-10-06 at 12:26 +0200, Jens Axboe wrote:

> I did a quick perf analysis on that, but only on 8 clients (64 clients
> basically causes perf to shit itself, it's just not functional).

Even when used with -a so that you profile each cpu? I can imagine the
cacheline contention from the inherited counters to render a 64 thread
machine less than usable?



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Latest vfs scalability patch
  2009-10-06 11:10     ` Peter Zijlstra
@ 2009-10-06 12:51       ` Jens Axboe
  0 siblings, 0 replies; 57+ messages in thread
From: Jens Axboe @ 2009-10-06 12:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Linus Torvalds

On Tue, Oct 06 2009, Peter Zijlstra wrote:
> On Tue, 2009-10-06 at 12:26 +0200, Jens Axboe wrote:
> 
> > I did a quick perf analysis on that, but only on 8 clients (64 clients
> > basically causes perf to shit itself, it's just not functional).
> 
> Even when used with -a so that you profile each cpu? I can imagine the
> cacheline contention from the inherited counters to render a 64 thread
> machine less than usable?

No, that's without -a. I used -a for perf top, and that has some impact
on performance while it's running, but millions of miles apart from what
it does without it :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Latest vfs scalability patch
  2009-10-06 10:14 ` Jens Axboe
  2009-10-06 10:26   ` Jens Axboe
@ 2009-10-06 12:26   ` Nick Piggin
  2009-10-06 12:49     ` Jens Axboe
  1 sibling, 1 reply; 57+ messages in thread
From: Nick Piggin @ 2009-10-06 12:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds

On Tue, Oct 06, 2009 at 12:14:14PM +0200, Jens Axboe wrote:
> On Tue, Oct 06 2009, Nick Piggin wrote:
> > Hi,
> > 
> > Several people have been interested to test my vfs patches, so rather
> > than resend patches I have uploaded a rollup against Linus's current
> > head.
> > 
> > ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/
> > 
> > I have used ext2,ext3,autofs4,nfs as well as in-memory filesystems
> > OK (although this doesn't mean there are no bugs!). Otherwise, if your
> > filesystem compiles, then there is a reasonable chance of it working,
> > or ask me and I can try updating it for the new locking.
> > 
> > I would be interested in seeing any numbers people might come up with,
> > including single-threaded performance.
> 
> I gave this a quick spin on the 64-thread nehalem. Just a simple dbench
> with 64 clients on tmpfs. The results are below. While running perf top
> -a in mainline, the top 5 entries are:
> 
>           2086691.00 - 96.6% : _spin_lock
>             14866.00 -  0.7% : copy_user_generic_string
>              5710.00 -  0.3% : mutex_spin_on_owner
>              2837.00 -  0.1% : _atomic_dec_and_lock
>              2274.00 -  0.1% : __d_lookup
> 
> Uhm auch... It doesn't look much prettier for the patch kernel, though:
> 
>           9396422.00 - 95.7% : _spin_lock
>             66978.00 -  0.7% : copy_user_generic_string
>             43775.00 -  0.4% : dput
>             23946.00 -  0.2% : __link_path_walk
>             17699.00 -  0.2% : path_init
>             15046.00 -  0.2% : do_lookup

Yep, this is the problem of the common-path lookup. Every dentry
element in the path has its d_lock taken for every path lookup,
so cwd dentry lock bounces a lot for dbench.

I'm working on doing path traversal without any locks or stores
to the dentries in the common cases, so that should basically
be the last bit of the puzzle for vfs locking (although it can be
considered a different type of problem than the global lock
removal, but RCU-freed struct inode is important for the approach
I'm taking, so I'm basing it on top of these patches).
 
It's a copout, but you could try running multiple dbenches under
different working directories (or actually, IIRC dbench does root
based path lookups so maybe that won't help you much).


> Anyway, below are the results. Seem very stable.
> 
>                                 throughput
> ------------------------------------------------
> 2.6.32-rc3-git          |      561.218 MB/sec
> 2.6.32-rc3-git+patch    |      627.022 MB/sec

Well it's good to see you got some improvement.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Latest vfs scalability patch
  2009-10-06 12:26   ` Nick Piggin
@ 2009-10-06 12:49     ` Jens Axboe
  2009-10-07  8:58       ` [rfc][patch] store-free path walking Nick Piggin
  0 siblings, 1 reply; 57+ messages in thread
From: Jens Axboe @ 2009-10-06 12:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds

On Tue, Oct 06 2009, Nick Piggin wrote:
> On Tue, Oct 06, 2009 at 12:14:14PM +0200, Jens Axboe wrote:
> > On Tue, Oct 06 2009, Nick Piggin wrote:
> > > Hi,
> > > 
> > > Several people have been interested to test my vfs patches, so rather
> > > than resend patches I have uploaded a rollup against Linus's current
> > > head.
> > > 
> > > ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/
> > > 
> > > I have used ext2,ext3,autofs4,nfs as well as in-memory filesystems
> > > OK (although this doesn't mean there are no bugs!). Otherwise, if your
> > > filesystem compiles, then there is a reasonable chance of it working,
> > > or ask me and I can try updating it for the new locking.
> > > 
> > > I would be interested in seeing any numbers people might come up with,
> > > including single-threaded performance.
> > 
> > I gave this a quick spin on the 64-thread nehalem. Just a simple dbench
> > with 64 clients on tmpfs. The results are below. While running perf top
> > -a in mainline, the top 5 entries are:
> > 
> >           2086691.00 - 96.6% : _spin_lock
> >             14866.00 -  0.7% : copy_user_generic_string
> >              5710.00 -  0.3% : mutex_spin_on_owner
> >              2837.00 -  0.1% : _atomic_dec_and_lock
> >              2274.00 -  0.1% : __d_lookup
> > 
> > Uhm auch... It doesn't look much prettier for the patch kernel, though:
> > 
> >           9396422.00 - 95.7% : _spin_lock
> >             66978.00 -  0.7% : copy_user_generic_string
> >             43775.00 -  0.4% : dput
> >             23946.00 -  0.2% : __link_path_walk
> >             17699.00 -  0.2% : path_init
> >             15046.00 -  0.2% : do_lookup
> 
> Yep, this is the problem of the common-path lookup. Every dentry
> element in the path has its d_lock taken for every path lookup,
> so cwd dentry lock bounces a lot for dbench.
> 
> I'm working on doing path traversal without any locks or stores
> to the dentries in the common cases, so that should basically
> be the last bit of the puzzle for vfs locking (although it can be
> considered a different type of problem than the global lock
> removal, but RCU-freed struct inode is important for the approach
> I'm taking, so I'm basing it on top of these patches).
>  
> It's a copout, but you could try running multiple dbenches under
> different working directories (or actually, IIRC dbench does root
> based path lookups so maybe that won't help you much).

Yeah, it's hitting dentry->d_lock pretty hard so basically
spin-serialized at that point.

> > Anyway, below are the results. Seem very stable.
> > 
> >                                 throughput
> > ------------------------------------------------
> > 2.6.32-rc3-git          |      561.218 MB/sec
> > 2.6.32-rc3-git+patch    |      627.022 MB/sec
> 
> Well it's good to see you got some improvement.

Yes, it's an improvement though the results are still pretty abysmal :-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [rfc][patch] store-free path walking
  2009-10-06 12:49     ` Jens Axboe
@ 2009-10-07  8:58       ` Nick Piggin
  2009-10-07  9:56         ` Jens Axboe
                           ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Nick Piggin @ 2009-10-07  8:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds

On Tue, Oct 06, 2009 at 02:49:41PM +0200, Jens Axboe wrote:
> On Tue, Oct 06 2009, Nick Piggin wrote:
> > It's a copout, but you could try running multiple dbenches under
> > different working directories (or actually, IIRC dbench does root
> > based path lookups so maybe that won't help you much).
> 
> Yeah, it's hitting dentry->d_lock pretty hard so basically
> spin-serialized at that point.
> 
> > > Anyway, below are the results. Seem very stable.
> > > 
> > >                                 throughput
> > > ------------------------------------------------
> > > 2.6.32-rc3-git          |      561.218 MB/sec
> > > 2.6.32-rc3-git+patch    |      627.022 MB/sec
> > 
> > Well it's good to see you got some improvement.
> 
> Yes, it's an improvement though the results are still pretty abysmal :-)

OK, I have a really basic patch that does store-free path walking
(except on the final element). dbench is pretty nasty still because
it seems to do a lot of stupid things like reading from /proc/mounts
all the time.

Actually it isn't quite store-free because it still takes mnt ref
(which is hard to avoid, but at least it is a per-cpu data). So
this patch applies on top of my recent patchset. At least it does
not store to the dentries.

Basically it is holding rcu_read_lock for the entire walk, it uses
inode RCU from my earlier patches to check inode permissions, and
it bails out at the slightest sign of trouble :) (actually I think
I have got it to walk mountpoints which is nice).

The seqlock in the dentry is for getting consistent name,len pointer,
and also not giving a false positive if a rename has partially
overwritten the name string (false negatives are always fine in the
lock free lookup path but false positives are not). Possibly we
could make do with a per-sb seqlock for this (or just rename_lock).

I have duplicated most of the lookup code, altering it to the lock
free version, which returns -EAGAIN if it gets in trouble and the
regular path walk starts up. I don't know if this is the best way
to go... it's fairly easy but a bit ugly. But complexifying the
existing path walk any more would not be nice either. It might be
nice to try locking the dentry that we're having trouble with and
continuing from there, rather than redoing the whole walk with
locks...

Anyway, this is the basics working for now, microbenchmark shows
same-cwd lookups scale linearly now too. We can probably slowly
tackle more cases if they come up as being important, simply by
auditing filesystems etc.

--

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -1187,6 +1187,7 @@ struct dentry *d_alloc(struct dentry * p
 	dentry->d_count = 1;
 	dentry->d_flags = DCACHE_UNHASHED;
 	spin_lock_init(&dentry->d_lock);
+	seqcount_init(&dentry->d_seq);
 	dentry->d_inode = NULL;
 	dentry->d_parent = NULL;
 	dentry->d_sb = NULL;
@@ -1579,21 +1580,6 @@ err_out:
  * d_lookup() is protected against the concurrent renames in some unrelated
  * directory using the seqlockt_t rename_lock.
  */
-
-struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
-{
-	struct dentry * dentry = NULL;
-	unsigned seq;
-
-        do {
-                seq = read_seqbegin(&rename_lock);
-                dentry = __d_lookup(parent, name);
-                if (dentry)
-			break;
-	} while (read_seqretry(&rename_lock, seq));
-	return dentry;
-}
-
 struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
 {
 	unsigned int len = name->len;
@@ -1656,6 +1642,78 @@ next:
  	return found;
 }
 
+struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
+{
+	struct dentry *dentry = NULL;
+	unsigned seq;
+
+        do {
+                seq = read_seqbegin(&rename_lock);
+                dentry = __d_lookup(parent, name);
+                if (dentry)
+			break;
+	} while (read_seqretry(&rename_lock, seq));
+	return dentry;
+}
+
+struct dentry * __d_lookup_rcu(struct dentry * parent, struct qstr * name)
+{
+	unsigned int len = name->len;
+	unsigned int hash = name->hash;
+	const unsigned char *str = name->name;
+	struct dcache_hash_bucket *b = d_hash(parent, hash);
+	struct hlist_head *head = &b->head;
+	struct hlist_node *node;
+	struct dentry *dentry;
+
+	hlist_for_each_entry_rcu(dentry, node, head, d_hash) {
+		unsigned seq;
+		struct dentry *tparent;
+		const char *tname;
+		int tlen;
+
+		if (dentry->d_name.hash != hash)
+			continue;
+
+seqretry:
+		seq = read_seqcount_begin(&dentry->d_seq);
+		tparent = dentry->d_parent;
+		if (tparent != parent)
+			continue;
+		tlen = dentry->d_name.len;
+		if (tlen != len)
+			continue;
+		tname = dentry->d_name.name;
+		if (read_seqcount_retry(&dentry->d_seq, seq))
+			goto seqretry;
+		if (memcmp(tname, str, tlen))
+			continue;
+		if (read_seqcount_retry(&dentry->d_seq, seq))
+			goto seqretry;
+
+		return dentry;
+ 	}
+ 	return NULL;
+}
+
+struct dentry *d_lookup_rcu(struct dentry *parent, struct qstr * name)
+{
+	struct dentry *dentry = NULL;
+	unsigned seq;
+
+	if (parent->d_op && parent->d_op->d_compare)
+		goto out;
+
+        do {
+                seq = read_seqbegin(&rename_lock);
+                dentry = __d_lookup_rcu(parent, name);
+                if (dentry)
+			break;
+	} while (read_seqretry(&rename_lock, seq));
+out:
+	return dentry;
+}
+
 /**
  * d_hash_and_lookup - hash the qstr then search for a dentry
  * @dir: Directory to search in
@@ -1925,6 +1983,8 @@ static void d_move_locked(struct dentry
 	list_del(&target->d_u.d_child);
 
 	/* Switch the names.. */
+	write_seqcount_begin(&dentry->d_seq);
+	write_seqcount_begin(&target->d_seq);
 	switch_names(dentry, target);
 	swap(dentry->d_name.hash, target->d_name.hash);
 
@@ -1939,6 +1999,8 @@ static void d_move_locked(struct dentry
 		/* And add them back to the (new) parent lists */
 		list_add(&target->d_u.d_child, &target->d_parent->d_subdirs);
 	}
+	write_seqcount_end(&target->d_seq);
+	write_seqcount_end(&dentry->d_seq);
 
 	list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs);
 	if (target->d_parent != dentry->d_parent)
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -200,6 +200,29 @@ static int acl_permission_check(struct i
 	return -EACCES;
 }
 
+static int acl_permission_check_rcu(struct inode *inode, int mask,
+		int (*check_acl)(struct inode *inode, int mask))
+{
+	umode_t			mode = inode->i_mode;
+
+	mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
+
+	if (current_fsuid() == inode->i_uid)
+		mode >>= 6;
+	else {
+		if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl)
+			return -EAGAIN;
+		if (in_group_p(inode->i_gid))
+			mode >>= 3;
+	}
+
+	/*
+	 * If the DACs are ok we don't need any capability check.
+	 */
+	if ((mask & ~mode) == 0)
+		return 0;
+	return -EACCES;
+}
 /**
  * generic_permission  -  check for access rights on a Posix-like filesystem
  * @inode:	inode to check access rights for
@@ -465,6 +488,25 @@ ok:
 	return security_inode_permission(inode, MAY_EXEC);
 }
 
+static int exec_permission_lite_rcu(struct inode *inode)
+{
+	int ret;
+
+	if (inode->i_op->permission)
+		return -EAGAIN;
+	ret = acl_permission_check_rcu(inode, MAY_EXEC, inode->i_op->check_acl);
+	if (ret == -EAGAIN)
+		return ret;
+	if (!ret)
+		goto ok;
+
+	if (capable(CAP_DAC_OVERRIDE) || capable(CAP_DAC_READ_SEARCH))
+		goto ok;
+
+	return ret;
+ok:
+	return security_inode_permission(inode, MAY_EXEC);
+}
 /*
  * This is called when everything else fails, and we actually have
  * to go to the low-level filesystem to find out what we should do..
@@ -569,6 +611,17 @@ static __always_inline void set_root(str
 	}
 }
 
+static __always_inline void set_root_rcu(struct nameidata *nd)
+{
+	if (!nd->root.mnt) {
+		struct fs_struct *fs = current->fs;
+		read_lock(&fs->lock);
+		nd->root = fs->root;
+		mntget(nd->root.mnt);
+		read_unlock(&fs->lock);
+	}
+}
+
 static __always_inline int __vfs_follow_link(struct nameidata *nd, const char *link)
 {
 	int res = 0;
@@ -611,6 +664,14 @@ static void path_put_conditional(struct
 		mntput(path->mnt);
 }
 
+static inline void path_to_nameidata_rcu(struct path *path, struct nameidata *nd)
+{
+	if (nd->path.mnt != path->mnt)
+		mntput(nd->path.mnt);
+	nd->path.mnt = path->mnt;
+	nd->path.dentry = path->dentry;
+}
+
 static inline void path_to_nameidata(struct path *path, struct nameidata *nd)
 {
 	dput(nd->path.dentry);
@@ -705,6 +766,22 @@ int follow_up(struct path *path)
 /* no need for dcache_lock, as serialization is taken care in
  * namespace.c
  */
+static int __follow_mount_rcu(struct path *path)
+{
+	int res = 0;
+	while (d_mountpoint(path->dentry)) {
+		struct vfsmount *mounted = lookup_mnt(path);
+		if (!mounted)
+			break;
+		if (res)
+			mntput(path->mnt);
+		path->mnt = mounted;
+		path->dentry = mounted->mnt_root;
+		res = 1;
+	}
+	return res;
+}
+
 static int __follow_mount(struct path *path)
 {
 	int res = 0;
@@ -791,6 +868,24 @@ static __always_inline void follow_dotdo
  *  small and for now I'd prefer to have fast path as straight as possible.
  *  It _is_ time-critical.
  */
+static int do_lookup_rcu(struct nameidata *nd, struct qstr *name,
+		     struct path *path)
+{
+	struct vfsmount *mnt = nd->path.mnt;
+	struct dentry *dentry;
+
+	dentry = __d_lookup_rcu(nd->path.dentry, name);
+
+	if (!dentry)
+		return -EAGAIN;
+	if (dentry->d_op && dentry->d_op->d_revalidate)
+		return -EAGAIN;
+	path->mnt = mnt;
+	path->dentry = dentry;
+	__follow_mount_rcu(path);
+	return 0;
+}
+
 static int do_lookup(struct nameidata *nd, struct qstr *name,
 		     struct path *path)
 {
@@ -825,6 +920,134 @@ fail:
 	return PTR_ERR(dentry);
 }
 
+static noinline int link_path_walk_rcu(const char *name, struct nameidata *nd, struct path *next)
+{
+	struct inode *inode;
+	unsigned int lookup_flags = nd->flags;
+
+	while (*name=='/')
+		name++;
+	if (!*name)
+		goto return_reval;
+
+	inode = nd->path.dentry->d_inode;
+	if (nd->depth)
+		lookup_flags = LOOKUP_FOLLOW | (nd->flags & LOOKUP_CONTINUE);
+
+	/* At this point we know we have a real path component. */
+	for(;;) {
+		unsigned long hash;
+		struct qstr this;
+		unsigned int c;
+
+		nd->flags |= LOOKUP_CONTINUE;
+		if (exec_permission_lite_rcu(inode))
+			return -EAGAIN;
+
+		this.name = name;
+		c = *(const unsigned char *)name;
+
+		hash = init_name_hash();
+		do {
+			name++;
+			hash = partial_name_hash(c, hash);
+			c = *(const unsigned char *)name;
+		} while (c && (c != '/'));
+		this.len = name - (const char *) this.name;
+		this.hash = end_name_hash(hash);
+
+		/* remove trailing slashes? */
+		if (!c)
+			goto last_component;
+		while (*++name == '/');
+		if (!*name)
+			goto last_with_slashes;
+
+		if (this.name[0] == '.') switch (this.len) {
+			default:
+				break;
+			case 2:
+				if (this.name[1] != '.')
+					break;
+				return -EAGAIN;
+			case 1:
+				continue;
+		}
+		if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash)
+			return -EAGAIN;
+		/* This does the actual lookups.. */
+		if (do_lookup_rcu(nd, &this, next))
+			return -EAGAIN;
+
+		inode = next->dentry->d_inode;
+		if (!inode)
+			return -ENOENT;
+		if (inode->i_op->follow_link)
+			return -EAGAIN;
+		path_to_nameidata_rcu(next, nd);
+		if (!inode->i_op->lookup)
+			return -ENOTDIR;
+		continue;
+		/* here ends the main loop */
+
+last_with_slashes:
+		lookup_flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
+last_component:
+		/* Clear LOOKUP_CONTINUE iff it was previously unset */
+		nd->flags &= lookup_flags | ~LOOKUP_CONTINUE;
+		if (lookup_flags & LOOKUP_PARENT)
+			return -EAGAIN;
+		if (this.name[0] == '.') switch (this.len) {
+			default:
+				break;
+			case 2:
+				if (this.name[1] != '.')
+					break;
+				return -EAGAIN;
+			case 1:
+				goto return_reval;
+		}
+		if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash)
+			return -EAGAIN;
+		if (do_lookup_rcu(nd, &this, next))
+			return -EAGAIN;
+		inode = next->dentry->d_inode;
+		if ((lookup_flags & LOOKUP_FOLLOW)
+		    && inode && inode->i_op->follow_link)
+			return -EAGAIN;
+
+		path_to_nameidata_rcu(next, nd);
+		if (!inode)
+			return -ENOENT;
+		if (lookup_flags & LOOKUP_DIRECTORY) {
+			if (!inode->i_op->lookup)
+				return -ENOTDIR;
+		}
+		goto return_base;
+	}
+return_reval:
+	/*
+	 * We bypassed the ordinary revalidation routines.
+	 * We may need to check the cached dentry for staleness.
+	 */
+	if (nd->path.dentry && nd->path.dentry->d_sb &&
+	    (nd->path.dentry->d_sb->s_type->fs_flags & FS_REVAL_DOT))
+		return -EAGAIN;
+return_base:
+	spin_lock(&nd->path.dentry->d_lock);
+	if (d_unhashed(nd->path.dentry)) {
+		spin_unlock(&nd->path.dentry->d_lock);
+		return -EAGAIN;
+	}
+	if (!nd->dentry->d_inode) {
+		spin_unlock(&nd->path.dentry->d_lock);
+		return -EAGAIN;
+	}
+	nd->path.dentry->d_count++;
+	spin_unlock(&nd->path.dentry->d_lock);
+	return 0;
+}
+
 /*
  * Name resolution.
  * This is the basic name resolution function, turning a pathname into
@@ -887,7 +1110,7 @@ static int __link_path_walk(const char *
 		if (this.name[0] == '.') switch (this.len) {
 			default:
 				break;
-			case 2:	
+			case 2:
 				if (this.name[1] != '.')
 					break;
 				follow_dotdot(nd);
@@ -942,7 +1165,7 @@ last_component:
 		if (this.name[0] == '.') switch (this.len) {
 			default:
 				break;
-			case 2:	
+			case 2:
 				if (this.name[1] != '.')
 					break;
 				follow_dotdot(nd);
@@ -1013,12 +1236,82 @@ return_err:
 	return err;
 }
 
+static int path_walk_rcu(const char *name, struct nameidata *nd)
+{
+	struct path save = nd->path;
+	struct path path = {.mnt = NULL};
+	int err;
+
+	current->total_link_count = 0;
+	err = link_path_walk_rcu(name, nd, &path);
+	if (err) {
+		if (path.mnt != nd->path.mnt)
+			mntput(path.mnt);
+		mntput(nd->path.mnt);
+		if (err == -EAGAIN)
+			nd->path = save;
+	}
+	return err;
+}
+
 static int path_walk(const char *name, struct nameidata *nd)
 {
 	current->total_link_count = 0;
 	return link_path_walk(name, nd);
 }
 
+static noinline int path_init_rcu(int dfd, const char *name, unsigned int flags, struct nameidata *nd)
+{
+	int retval = 0;
+	int fput_needed;
+	struct file *file;
+
+	nd->last_type = LAST_ROOT; /* if there are only slashes... */
+	nd->flags = flags;
+	nd->depth = 0;
+	nd->root.mnt = NULL;
+
+	if (*name=='/') {
+		set_root_rcu(nd);
+		nd->path = nd->root;
+		mntget(nd->root.mnt);
+	} else if (dfd == AT_FDCWD) {
+		struct fs_struct *fs = current->fs;
+		read_lock(&fs->lock);
+		nd->path = fs->pwd;
+		mntget(fs->pwd.mnt);
+		read_unlock(&fs->lock);
+	} else {
+		struct dentry *dentry;
+
+		file = fget_light(dfd, &fput_needed);
+		retval = -EBADF;
+		if (!file)
+			goto out_fail;
+
+		dentry = file->f_path.dentry;
+
+		retval = -ENOTDIR;
+		if (!S_ISDIR(dentry->d_inode->i_mode))
+			goto fput_fail;
+
+		retval = file_permission(file, MAY_EXEC);
+		if (retval)
+			goto fput_fail;
+
+		nd->path = file->f_path;
+		mntget(file->f_path.mnt);
+
+		fput_light(file, fput_needed);
+	}
+	return 0;
+
+fput_fail:
+	fput_light(file, fput_needed);
+out_fail:
+	return retval;
+}
+
 static int path_init(int dfd, const char *name, unsigned int flags, struct nameidata *nd)
 {
 	int retval = 0;
@@ -1075,16 +1368,46 @@ out_fail:
 static int do_path_lookup(int dfd, const char *name,
 				unsigned int flags, struct nameidata *nd)
 {
-	int retval = path_init(dfd, name, flags, nd);
-	if (!retval)
-		retval = path_walk(name, nd);
-	if (unlikely(!retval && !audit_dummy_context() && nd->path.dentry &&
-				nd->path.dentry->d_inode))
-		audit_inode(name, nd->path.dentry);
+	int retval;
+
+	rcu_read_lock();
+	retval = path_init_rcu(dfd, name, flags, nd);
+	if (unlikely(retval)) {
+		rcu_read_unlock();
+		return retval;
+	}
+	retval = path_walk_rcu(name, nd);
+	rcu_read_unlock();
+	if (likely(!retval)) {
+		if (unlikely(!audit_dummy_context())) {
+			if (nd->path.dentry && nd->path.dentry->d_inode)
+				audit_inode(name, nd->path.dentry);
+		}
+	}
 	if (nd->root.mnt) {
-		path_put(&nd->root);
+		mntput(nd->root.mnt);
 		nd->root.mnt = NULL;
 	}
+
+	if (unlikely(retval == -EAGAIN)) {
+		/* slower, locked walk */
+		retval = path_init(dfd, name, flags, nd);
+		if (unlikely(retval))
+			return retval;
+		retval = path_walk(name, nd);
+		if (likely(!retval)) {
+			if (unlikely(!audit_dummy_context())) {
+				if (nd->path.dentry && nd->path.dentry->d_inode)
+					audit_inode(name, nd->path.dentry);
+			}
+		}
+
+		if (nd->root.mnt) {
+			path_put(&nd->root);
+			nd->root.mnt = NULL;
+		}
+	}
+
 	return retval;
 }
 
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -5,6 +5,7 @@
 #include <linux/list.h>
 #include <linux/rculist.h>
 #include <linux/spinlock.h>
+#include <linux/seqlock.h>
 #include <linux/cache.h>
 #include <linux/rcupdate.h>
 
@@ -81,15 +82,16 @@ full_name_hash(const unsigned char *name
  * large memory footprint increase).
  */
 #ifdef CONFIG_64BIT
-#define DNAME_INLINE_LEN_MIN 32 /* 192 bytes */
+#define DNAME_INLINE_LEN_MIN 24 /* 192 bytes */
 #else
-#define DNAME_INLINE_LEN_MIN 40 /* 128 bytes */
+#define DNAME_INLINE_LEN_MIN 36 /* 128 bytes */
 #endif
 
 struct dentry {
 	unsigned int d_count;		/* protected by d_lock */
 	unsigned int d_flags;		/* protected by d_lock */
 	spinlock_t d_lock;		/* per dentry lock */
+	seqcount_t d_seq;		/* per dentry seqlock */
 	int d_mounted;
 	struct inode *d_inode;		/* Where the name belongs to - NULL is
 					 * negative */
@@ -283,9 +285,11 @@ extern void d_move(struct dentry *, stru
 extern struct dentry *d_ancestor(struct dentry *, struct dentry *);
 
 /* appendix may either be NULL or be used for transname suffixes */
-extern struct dentry * d_lookup(struct dentry *, struct qstr *);
-extern struct dentry * __d_lookup(struct dentry *, struct qstr *);
-extern struct dentry * d_hash_and_lookup(struct dentry *, struct qstr *);
+extern struct dentry *d_lookup(struct dentry *, struct qstr *);
+extern struct dentry *__d_lookup(struct dentry *, struct qstr *);
+extern struct dentry *d_lookup_rcu(struct dentry *, struct qstr *);
+extern struct dentry *__d_lookup_rcu(struct dentry *, struct qstr *);
+extern struct dentry *d_hash_and_lookup(struct dentry *, struct qstr *);
 
 /* validate "insecure" dentry pointer */
 extern int d_validate(struct dentry *, struct dentry *);


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07  8:58       ` [rfc][patch] store-free path walking Nick Piggin
@ 2009-10-07  9:56         ` Jens Axboe
  2009-10-07 10:10           ` Nick Piggin
  2009-10-12  3:58           ` Nick Piggin
  2009-10-07 14:56         ` Linus Torvalds
  2009-10-09 23:16         ` Paul E. McKenney
  2 siblings, 2 replies; 57+ messages in thread
From: Jens Axboe @ 2009-10-07  9:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds

On Wed, Oct 07 2009, Nick Piggin wrote:
> On Tue, Oct 06, 2009 at 02:49:41PM +0200, Jens Axboe wrote:
> > On Tue, Oct 06 2009, Nick Piggin wrote:
> > > It's a copout, but you could try running multiple dbenches under
> > > different working directories (or actually, IIRC dbench does root
> > > based path lookups so maybe that won't help you much).
> > 
> > Yeah, it's hitting dentry->d_lock pretty hard so basically
> > spin-serialized at that point.
> > 
> > > > Anyway, below are the results. Seem very stable.
> > > > 
> > > >                                 throughput
> > > > ------------------------------------------------
> > > > 2.6.32-rc3-git          |      561.218 MB/sec
> > > > 2.6.32-rc3-git+patch    |      627.022 MB/sec
> > > 
> > > Well it's good to see you got some improvement.
> > 
> > Yes, it's an improvement though the results are still pretty abysmal :-)
> 
> OK, I have a really basic patch that does store-free path walking
> (except on the final element). dbench is pretty nasty still because
> it seems to do a lot of stupid things like reading from /proc/mounts
> all the time.
> 
> Actually it isn't quite store-free because it still takes mnt ref
> (which is hard to avoid, but at least it is a per-cpu data). So
> this patch applies on top of my recent patchset. At least it does
> not store to the dentries.
> 
> Basically it is holding rcu_read_lock for the entire walk, it uses
> inode RCU from my earlier patches to check inode permissions, and
> it bails out at the slightest sign of trouble :) (actually I think
> I have got it to walk mountpoints which is nice).
> 
> The seqlock in the dentry is for getting consistent name,len pointer,
> and also not giving a false positive if a rename has partially
> overwritten the name string (false negatives are always fine in the
> lock free lookup path but false positives are not). Possibly we
> could make do with a per-sb seqlock for this (or just rename_lock).
> 
> I have duplicated most of the lookup code, altering it to the lock
> free version, which returns -EAGAIN if it gets in trouble and the
> regular path walk starts up. I don't know if this is the best way
> to go... it's fairly easy but a bit ugly. But complexifying the
> existing path walk any more would not be nice either. It might be
> nice to try locking the dentry that we're having trouble with and
> continuing from there, rather than redoing the whole walk with
> locks...
> 
> Anyway, this is the basics working for now, microbenchmark shows
> same-cwd lookups scale linearly now too. We can probably slowly
> tackle more cases if they come up as being important, simply by
> auditing filesystems etc.

                                throughput
------------------------------------------------
2.6.32-rc3-git          |      561.218 MB/sec
2.6.32-rc3-git+patch    |      627.022 MB/sec
2.6.32-rc3-git+patch+inc|      969.761 MB/sec

So better, quite a bit too. Latencies are not listed here, but they are
also a lot better. Perf top still shows ~95% spinlock time. I did a
shorter run (the above are full 600 second runs) of 60s with profiling
and the full 64 clients, this time using -a as well (which generated
9.4GB of trace data!). The top is now:

_spin_lock (92%)
        path_get (39%)
                d_path (59%)
                path_init (26%)
                path_walk (13%)
        dput (37%)
                path_put (86%)
                link_path_walk (13%)
        __d_path (23%)

And finally, this:

> +	if (!nd->dentry->d_inode) {
> +		spin_unlock(&nd->path.dentry->d_lock);
> +		return -EAGAIN;
> +	}

doesn't compile, it wants to be

        if (!nd->path.dentry->d_inode) {

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07  9:56         ` Jens Axboe
@ 2009-10-07 10:10           ` Nick Piggin
  2009-10-12  3:58           ` Nick Piggin
  1 sibling, 0 replies; 57+ messages in thread
From: Nick Piggin @ 2009-10-07 10:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds

On Wed, Oct 07, 2009 at 11:56:57AM +0200, Jens Axboe wrote:
> On Wed, Oct 07 2009, Nick Piggin wrote:
> > Anyway, this is the basics working for now, microbenchmark shows
> > same-cwd lookups scale linearly now too. We can probably slowly
> > tackle more cases if they come up as being important, simply by
> > auditing filesystems etc.
> 
>                                 throughput
> ------------------------------------------------
> 2.6.32-rc3-git          |      561.218 MB/sec
> 2.6.32-rc3-git+patch    |      627.022 MB/sec
> 2.6.32-rc3-git+patch+inc|      969.761 MB/sec
> 
> So better, quite a bit too. Latencies are not listed here, but they are
> also a lot better. Perf top still shows ~95% spinlock time. I did a
> shorter run (the above are full 600 second runs) of 60s with profiling
> and the full 64 clients, this time using -a as well (which generated
> 9.4GB of trace data!). The top is now:
> 
> _spin_lock (92%)
>         path_get (39%)
>                 d_path (59%)
>                 path_init (26%)
>                 path_walk (13%)
>         dput (37%)
>                 path_put (86%)
>                 link_path_walk (13%)
>         __d_path (23%)

path_init, path_walk, and link_path_walk are all non-lockless
variants, so the RCU walk is dropping out in some cases. path_put
will be significantly coming from locked lookups too. It could be
improved by expanding the cases we do lockless walk for (or
allowing a lockless walk to turn into a locked walk part-way
through, rather than restarting the whole thing, which is probably
a very good idea anyway).

d_path and __d_path are... I think dbench doing something stupid.
Although even those could possibly be optimised to avoid d_lock
as well... Although after looking at strace from dbench, I'd
rather take profiles from real workloads before adding complexity
(or even a real samba serving a netbench workload would be
preferable to dbench, I think).

But it's always nice to see numbers and results. Nearly 2x
increase isn't too bad, even if it is still horribly choked.

> And finally, this:
> 
> > +	if (!nd->dentry->d_inode) {
> > +		spin_unlock(&nd->path.dentry->d_lock);
> > +		return -EAGAIN;
> > +	}
> 
> doesn't compile, it wants to be
> 
>         if (!nd->path.dentry->d_inode) {

Ah thanks, forgot to refresh.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07  9:56         ` Jens Axboe
  2009-10-07 10:10           ` Nick Piggin
@ 2009-10-12  3:58           ` Nick Piggin
  2009-10-12  5:59             ` Nick Piggin
  2009-10-13  1:26             ` Christoph Hellwig
  1 sibling, 2 replies; 57+ messages in thread
From: Nick Piggin @ 2009-10-12  3:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds, samba-technical

On Wed, Oct 07, 2009 at 11:56:57AM +0200, Jens Axboe wrote:
> On Wed, Oct 07 2009, Nick Piggin wrote:
> > Anyway, this is the basics working for now, microbenchmark shows
> > same-cwd lookups scale linearly now too. We can probably slowly
> > tackle more cases if they come up as being important, simply by
> > auditing filesystems etc.
> 
>                                 throughput
> ------------------------------------------------
> 2.6.32-rc3-git          |      561.218 MB/sec
> 2.6.32-rc3-git+patch    |      627.022 MB/sec
> 2.6.32-rc3-git+patch+inc|      969.761 MB/sec
> 
> So better, quite a bit too. Latencies are not listed here, but they are
> also a lot better. Perf top still shows ~95% spinlock time. I did a
> shorter run (the above are full 600 second runs) of 60s with profiling
> and the full 64 clients, this time using -a as well (which generated
> 9.4GB of trace data!). The top is now:

Hey Jens,

Try changing the 'statvfs' syscall in dbench to 'statfs'.
glibc has to do some nasty stuff parsing /proc/mounts to
make statvfs work. On my 2s8c opteron it goes like this:
clients     vanilla kernel     vfs scale (MB/s)
1            476                447
2           1092               1128
4           2027               2260
8           2398               4200

Single threaded performance isn't as good so I need to look
at the reasons for that :(. But it's practically linearly
scalable now. The dropoff at 8 I'd say is probably due to
the memory controllers running out of steam rather than
cacheline or lock contention.

Unfortunately we didn't just do this posix API in-kernel,
and statfs is Linux-specific. But we do have some spare
room in statfs structure I think to pass back mount flags
for statvfs.

Tridge, Samba people: measuring vfs performance with dbench
in my effort to improve Linux vfs scalability has shown up
the statvfs syscall you make to be the final problematic
issue for this workload. In particular reading /proc/mounts
that glibc does to impement it. We could add complexity to
the kernel to try improving it, or we could extend the
statfs syscall so glibc can avoid the issue (requiring
glibc upgrade). But I would like to know whether samba
really uses statvfs() significantly?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-12  3:58           ` Nick Piggin
@ 2009-10-12  5:59             ` Nick Piggin
  2009-10-12  8:20               ` Jens Axboe
  2009-10-13  1:26             ` Christoph Hellwig
  1 sibling, 1 reply; 57+ messages in thread
From: Nick Piggin @ 2009-10-12  5:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds, samba-technical

On Mon, Oct 12, 2009 at 05:58:43AM +0200, Nick Piggin wrote:
> On Wed, Oct 07, 2009 at 11:56:57AM +0200, Jens Axboe wrote:
> Try changing the 'statvfs' syscall in dbench to 'statfs'.
> glibc has to do some nasty stuff parsing /proc/mounts to
> make statvfs work. On my 2s8c opteron it goes like this:
> clients     vanilla kernel     vfs scale (MB/s)
> 1            476                447
> 2           1092               1128
> 4           2027               2260
> 8           2398               4200
> 
> Single threaded performance isn't as good so I need to look
> at the reasons for that :(. But it's practically linearly
> scalable now. The dropoff at 8 I'd say is probably due to
> the memory controllers running out of steam rather than
> cacheline or lock contention.

Ah, no on a bigger machine it starts slowing down again due
to shared cwd contention, possibly due to creat/unlink type
events. This could be improved by not restarting the entire
path walk when we run into trouble but just trying to proceed
from the last successful element.

Anyway, if you do get a chance to run dbench with this
modification, I would appreciate seeing a profile with clal
traces (my bigger system is ia64 which doesn't do perf yet).

Thanks,
Nick

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-12  5:59             ` Nick Piggin
@ 2009-10-12  8:20               ` Jens Axboe
  2009-10-12 11:00                 ` Jens Axboe
  0 siblings, 1 reply; 57+ messages in thread
From: Jens Axboe @ 2009-10-12  8:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds, samba-technical

On Mon, Oct 12 2009, Nick Piggin wrote:
> On Mon, Oct 12, 2009 at 05:58:43AM +0200, Nick Piggin wrote:
> > On Wed, Oct 07, 2009 at 11:56:57AM +0200, Jens Axboe wrote:
> > Try changing the 'statvfs' syscall in dbench to 'statfs'.
> > glibc has to do some nasty stuff parsing /proc/mounts to
> > make statvfs work. On my 2s8c opteron it goes like this:
> > clients     vanilla kernel     vfs scale (MB/s)
> > 1            476                447
> > 2           1092               1128
> > 4           2027               2260
> > 8           2398               4200
> > 
> > Single threaded performance isn't as good so I need to look
> > at the reasons for that :(. But it's practically linearly
> > scalable now. The dropoff at 8 I'd say is probably due to
> > the memory controllers running out of steam rather than
> > cacheline or lock contention.
> 
> Ah, no on a bigger machine it starts slowing down again due
> to shared cwd contention, possibly due to creat/unlink type
> events. This could be improved by not restarting the entire
> path walk when we run into trouble but just trying to proceed
> from the last successful element.
> 
I was starting to do a few runs, but there's something funky going on
here. The throughput rates are consistent throughout a single run, but
not at all between runs. I suspect this may be due to CPU placement.
The numbers also look pretty odd, here's an example from a patched
kernel with dbench using statfs:

Clients         Patched
------------------------
1                1.00
2                1.23
4                2.96
8                1.22
16               0.89
32               0.83
64               0.83

So while the numbers fluctuate by as much as 20% from run to run.

OK, so it seems FAIR_SLEEPERS sched feature is responsible for this, if
I turn that off I get more consistent numbers. Below table is -git vs
vfs patches on -git. Baseline is -git with 1 client, > 1.00 is faster
and vice versa.

Clients         Vanilla         VFS scale
-----------------------------------------
1                1.00            0.96
2                1.69            1.71
4                2.16            2.98
8                0.99            1.00
16               0.90            0.85

As you can see, it's still quickling spiralling into most of the time (>
95%) spinning on a lock and killing scaling.

> Anyway, if you do get a chance to run dbench with this
> modification, I would appreciate seeing a profile with clal
> traces (my bigger system is ia64 which doesn't do perf yet).

For what number of clients?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-12  8:20               ` Jens Axboe
@ 2009-10-12 11:00                 ` Jens Axboe
  0 siblings, 0 replies; 57+ messages in thread
From: Jens Axboe @ 2009-10-12 11:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds, samba-technical

[-- Attachment #1: Type: text/plain, Size: 676 bytes --]

On Mon, Oct 12 2009, Jens Axboe wrote:
> > Anyway, if you do get a chance to run dbench with this
> > modification, I would appreciate seeing a profile with clal
> > traces (my bigger system is ia64 which doesn't do perf yet).
> 
> For what number of clients?

Attached are call graphs for 8 and 32 clients, with your latest patch
applied. At 8 clients, we're already at 80% spinlock time. While running
the 32 client record, I wanted to open a file in the tmpfs mount where
dbench was running. bash was spinning until dbench completed.

At 32 clients, throughput is 75% of the 8 client case. And spinlock time
there is at 90%.

I have attached both outputs.

-- 
Jens Axboe


[-- Attachment #2: 8client.txt.bz2 --]
[-- Type: application/octet-stream, Size: 53343 bytes --]

[-- Attachment #3: 32client.txt.bz2 --]
[-- Type: application/octet-stream, Size: 65409 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-12  3:58           ` Nick Piggin
  2009-10-12  5:59             ` Nick Piggin
@ 2009-10-13  1:26             ` Christoph Hellwig
  2009-10-13  1:52               ` Nick Piggin
  1 sibling, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2009-10-13  1:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds,
	samba-technical

On Mon, Oct 12, 2009 at 05:58:43AM +0200, Nick Piggin wrote:
> Tridge, Samba people: measuring vfs performance with dbench
> in my effort to improve Linux vfs scalability has shown up
> the statvfs syscall you make to be the final problematic
> issue for this workload. In particular reading /proc/mounts
> that glibc does to impement it. We could add complexity to
> the kernel to try improving it, or we could extend the
> statfs syscall so glibc can avoid the issue (requiring
> glibc upgrade). But I would like to know whether samba
> really uses statvfs() significantly?

Not sure if it's the reason why Samba uses it, but many portable
applications use statvfs because that is the standardizes one in
XPG / recent Posix while statfs is just a BSD extension Linux picked
up.  So making sure statvfs goes fast is a pretty essential thing.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-13  1:26             ` Christoph Hellwig
@ 2009-10-13  1:52               ` Nick Piggin
  0 siblings, 0 replies; 57+ messages in thread
From: Nick Piggin @ 2009-10-13  1:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds,
	samba-technical

On Mon, Oct 12, 2009 at 09:26:27PM -0400, Christoph Hellwig wrote:
> On Mon, Oct 12, 2009 at 05:58:43AM +0200, Nick Piggin wrote:
> > Tridge, Samba people: measuring vfs performance with dbench
> > in my effort to improve Linux vfs scalability has shown up
> > the statvfs syscall you make to be the final problematic
> > issue for this workload. In particular reading /proc/mounts
> > that glibc does to impement it. We could add complexity to
> > the kernel to try improving it, or we could extend the
> > statfs syscall so glibc can avoid the issue (requiring
> > glibc upgrade). But I would like to know whether samba
> > really uses statvfs() significantly?
> 
> Not sure if it's the reason why Samba uses it, but many portable
> applications use statvfs because that is the standardizes one in
> XPG / recent Posix while statfs is just a BSD extension Linux picked
> up.  So making sure statvfs goes fast is a pretty essential thing.

OK, I'll keep that in mind. I think the best idea then will be
to extend statfs so that it returns the mount flags for glibc
to implement statvfs nicely and get rid of the insane stuff it
is doing now.

I don't know how we can version that nicely but I guess I'll
talk to the glibc guys about it.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07  8:58       ` [rfc][patch] store-free path walking Nick Piggin
  2009-10-07  9:56         ` Jens Axboe
@ 2009-10-07 14:56         ` Linus Torvalds
  2009-10-07 16:27           ` Linus Torvalds
                             ` (2 more replies)
  2009-10-09 23:16         ` Paul E. McKenney
  2 siblings, 3 replies; 57+ messages in thread
From: Linus Torvalds @ 2009-10-07 14:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra

On Wed, 7 Oct 2009, Nick Piggin wrote:
> 
> OK, I have a really basic patch that does store-free path walking
> (except on the final element).

Yay!

> dbench is pretty nasty still because it seems to do a lot of stupid 
> things like reading from /proc/mounts all the time.

You should largely forget about dbench, it can certainly be a useful 
benchmark, but at the same time it's certainly not a _meaningful_ one.
There are better things to try.

> The seqlock in the dentry is for getting consistent name,len pointer,
> and also not giving a false positive if a rename has partially
> overwritten the name string (false negatives are always fine in the
> lock free lookup path but false positives are not). Possibly we
> could make do with a per-sb seqlock for this (or just rename_lock).

My plan was always to just use rename_lock, and only do it at the outer 
level (and do it for both lookup failures _and_ for the success case). 
Your approach is _way_ more conservative than I would have done, and also 
potentially much slower due to the seqlock-per-path-component thing. 

Remember: seqlocks are almost free on x86, but they can be reasonably 
expensive in other places.

Hmm. Regardless, this very much does look like what I envisioned, apart 
from details like that. And maybe your per-dentry seqlock is the right 
choice. On x86, it certainly doesn't have the performance issues it could 
have in other places.

I'd like Al to take a look, if he's around.

			Linus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07 14:56         ` Linus Torvalds
@ 2009-10-07 16:27           ` Linus Torvalds
  2009-10-07 16:46             ` Nick Piggin
  2009-10-07 16:29           ` Nick Piggin
  2009-10-08 12:36           ` Nick Piggin
  2 siblings, 1 reply; 57+ messages in thread
From: Linus Torvalds @ 2009-10-07 16:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra

On Wed, 7 Oct 2009, Linus Torvalds wrote:
> 
> Hmm. Regardless, this very much does look like what I envisioned, apart 
> from details like that. And maybe your per-dentry seqlock is the right 
> choice. On x86, it certainly doesn't have the performance issues it could 
> have in other places.

Actually, if we really want to do the per-dentry thing, then we should 
change it a bit. Maybe rather than using a seqlock data structure (which 
is really just a unsigned counter and a spinlock), we could do just the 
unsigned counter, and use the d_lock as the spinlock for the sequence 
lock.

The hackiest way to do that woudl be to get rid of d_lock entirely, 
replace it with d_seqlock, and then just do

	#define d_lock d_seqlock.lock

instead (but the dentry structure may well have layout issues that makes 
that not work very well - we're mixing pointers and 'int'-sized things 
and need to pack them well etc).

That would cut down the seqlock memory costs from 8 bytes (or more - just 
the spinlock itself is currently 8 bytes on ia64, so on ia64 the seqlock 
is actually 16 bytes, not to mention all the spinlock debugging cases) to 
just four bytes.

However, I still suspect we could do things entirely without the seqlock. 
The outer seqlock will handle the "couldn't find it" case, and I've got 
the strongest feeling that we should be able to just use some basic memory 
ordering on the dentry hash to make the inner seqlock unnecessary (ie 
make sure that either we don't see the old entry at all, or that we can 
guarantee that it won't trigger a successful compare while the rename is 
in process because we set the dentry name length to zero).

			Linus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07 16:27           ` Linus Torvalds
@ 2009-10-07 16:46             ` Nick Piggin
  2009-10-07 19:25               ` Linus Torvalds
  0 siblings, 1 reply; 57+ messages in thread
From: Nick Piggin @ 2009-10-07 16:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra

On Wed, Oct 07, 2009 at 09:27:59AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 7 Oct 2009, Linus Torvalds wrote:
> > 
> > Hmm. Regardless, this very much does look like what I envisioned, apart 
> > from details like that. And maybe your per-dentry seqlock is the right 
> > choice. On x86, it certainly doesn't have the performance issues it could 
> > have in other places.
> 
> Actually, if we really want to do the per-dentry thing, then we should 
> change it a bit. Maybe rather than using a seqlock data structure (which 
> is really just a unsigned counter and a spinlock), we could do just the 
> unsigned counter, and use the d_lock as the spinlock for the sequence 
> lock.
> 
> The hackiest way to do that woudl be to get rid of d_lock entirely, 
> replace it with d_seqlock, and then just do
> 
> 	#define d_lock d_seqlock.lock
> 
> instead (but the dentry structure may well have layout issues that makes 
> that not work very well - we're mixing pointers and 'int'-sized things 
> and need to pack them well etc).
> 
> That would cut down the seqlock memory costs from 8 bytes (or more - just 
> the spinlock itself is currently 8 bytes on ia64, so on ia64 the seqlock 
> is actually 16 bytes, not to mention all the spinlock debugging cases) to 
> just four bytes.

Oh I did that, used a "seqcount" which is the bare sequence counter
(and update it while holding d_lock).

Yes it still has packing issues, athough I think I can get rid of
d_mounted so it will then pack nicely and size won't change. (just
have a flag if we are mounted at least once, and just store the
count elsewhere for mountpoints -- or even just search the mount
hash on each umount to see if anything is left mounted on it)

 
> However, I still suspect we could do things entirely without the seqlock. 
> The outer seqlock will handle the "couldn't find it" case, and I've got 
> the strongest feeling that we should be able to just use some basic memory 
> ordering on the dentry hash to make the inner seqlock unnecessary (ie 
> make sure that either we don't see the old entry at all, or that we can 
> guarantee that it won't trigger a successful compare while the rename is 
> in process because we set the dentry name length to zero).

Well, I would be all for improving things of course. But keep in
mind we already do the rename_lock seqcount for each d_lookup,
so the lock free lookup path is only doing extra seqlocks on dcache
hash collision cases.

But I do agree it needs more thought. I'll try to get the powerpc
guys interested in running tests for us tomorrow :)


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07 16:46             ` Nick Piggin
@ 2009-10-07 19:25               ` Linus Torvalds
  2009-10-07 20:34                 ` Andi Kleen
  0 siblings, 1 reply; 57+ messages in thread
From: Linus Torvalds @ 2009-10-07 19:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra

On Wed, 7 Oct 2009, Nick Piggin wrote:
> 
> Oh I did that, used a "seqcount" which is the bare sequence counter
> (and update it while holding d_lock).

Oh, I didn't notice, for a really silly reason: I looked at how you had 
changed DNAME_INLINE_LEN_MIN on 64-bit, and noticed that you were using 8 
bytes for your sequence lock.

But that's only true due to the padding things, and the seqcount itself is 
indeed just 4 bytes. 

Sad. But we do have the added requirement that we want to keep the 
commonly used fields together in the same cacheline, so maybe it's 
unavoidable.

		Linus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07 19:25               ` Linus Torvalds
@ 2009-10-07 20:34                 ` Andi Kleen
  2009-10-07 20:51                   ` Linus Torvalds
  0 siblings, 1 reply; 57+ messages in thread
From: Andi Kleen @ 2009-10-07 20:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Jens Axboe, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra

Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> Sad. But we do have the added requirement that we want to keep the 
> commonly used fields together in the same cacheline, so maybe it's 
> unavoidable.

With some strategic prefetches it might be cheap enough to use 
two cache lines. Just fetch it with the other.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07 20:34                 ` Andi Kleen
@ 2009-10-07 20:51                   ` Linus Torvalds
  2009-10-07 21:06                     ` Andi Kleen
  0 siblings, 1 reply; 57+ messages in thread
From: Linus Torvalds @ 2009-10-07 20:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, Jens Axboe, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra



On Wed, 7 Oct 2009, Andi Kleen wrote:
> 
> With some strategic prefetches it might be cheap enough to use 
> two cache lines. Just fetch it with the other.

I _seriously_ doubt that.

This is one of the biggest (and hottest) hash tables in the whole kernel.

No amount of prefetching will help the fact that you effectively double 
your cache footprint / working set if you have to fetch two cachelines in 
the hot case lookup rather than one.

			Linus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07 20:51                   ` Linus Torvalds
@ 2009-10-07 21:06                     ` Andi Kleen
  2009-10-07 21:20                       ` Linus Torvalds
  0 siblings, 1 reply; 57+ messages in thread
From: Andi Kleen @ 2009-10-07 21:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Nick Piggin, Jens Axboe, Linux Kernel Mailing List,
	linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra

On Wed, Oct 07, 2009 at 01:51:07PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 7 Oct 2009, Andi Kleen wrote:
> > 
> > With some strategic prefetches it might be cheap enough to use 
> > two cache lines. Just fetch it with the other.
> 
> I _seriously_ doubt that.
> 
> This is one of the biggest (and hottest) hash tables in the whole kernel.
> 
> No amount of prefetching will help the fact that you effectively double 
> your cache footprint / working set if you have to fetch two cachelines in 
> the hot case lookup rather than one.

It won't double it, because there are many more cache lines from
all kinds of other things in these paths. The point was just to hide 
the latency of fetching two at different times. The actual cache foot print
is not that important, as long as it's not excessive, just the latencies
hurt.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07 21:06                     ` Andi Kleen
@ 2009-10-07 21:20                       ` Linus Torvalds
  2009-10-07 21:57                         ` Linus Torvalds
  0 siblings, 1 reply; 57+ messages in thread
From: Linus Torvalds @ 2009-10-07 21:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, Jens Axboe, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra

On Wed, 7 Oct 2009, Andi Kleen wrote:
> 
> It won't double it, because there are many more cache lines from
> all kinds of other things in these paths.

Not all that many. The big part of the D$ is the hash table lookup in this 
path.

> The point was just to hide the latency of fetching two at different 
> times. The actual cache foot print is not that important, as long as 
> it's not excessive

You're full of sh*t.

Every extra line you fetch pushes out another line. No amount of 
prefetching will hide that. So stop spreading idiotic fairytales.

The best optimization is to not do the work at all. It really is that 
simple.  Prefetching as an optimization technique absolutely sucks.

		Linus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07 21:20                       ` Linus Torvalds
@ 2009-10-07 21:57                         ` Linus Torvalds
  2009-10-07 22:22                           ` Andi Kleen
  2009-10-08 13:12                           ` Denys Vlasenko
  0 siblings, 2 replies; 57+ messages in thread
From: Linus Torvalds @ 2009-10-07 21:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, Jens Axboe, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra

On Wed, 7 Oct 2009, Linus Torvalds wrote:
>
> Every extra line you fetch pushes out another line. No amount of 
> prefetching will hide that.

Side note: what it _will_ do is hide the cost, and spread it out to other 
places where you don't see it any more, or you see it but can't do 
anything about it.

And that's not a good thing. You end up with "muddy" profiles that don't 
have nice clear problem spots any more, so you may think that your profile 
looks better ("Look ma, no nasty hotspots!"), but actual performance 
hasn't improved one whit, and may well have decreased.

This, btw, is exactly the kind of thing we saw with some of the 
non-temporal work, when we used nontemporal stores to copy pages on COW 
faults, or when doing pre-zeroing of pages. You get rid of some of the 
hot-spots in the kernel, and you then replace them with user space taking 
the cache misses in random spots instead. The kernel profile looks better, 
and system time may go down, but actual performace never went down - you 
just moved your cache miss cost from one place to another.

There's no question that prefetching cannot help, but it helps only if 
it's about fetching data that you would need anyway early. In contrast, if 
the option is "don't touch the other cacheline at all", prefetching is 
_always_ a loss. No ifs, buts and maybes about it.

			Linus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07 21:57                         ` Linus Torvalds
@ 2009-10-07 22:22                           ` Andi Kleen
  2009-10-08  7:39                             ` Nick Piggin
  2009-10-08 13:12                           ` Denys Vlasenko
  1 sibling, 1 reply; 57+ messages in thread
From: Andi Kleen @ 2009-10-07 22:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Nick Piggin, Jens Axboe, Linux Kernel Mailing List,
	linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra

On Wed, Oct 07, 2009 at 02:57:20PM -0700, Linus Torvalds wrote:
> There's no question that prefetching cannot help, but it helps only if 
> it's about fetching data that you would need anyway early. In contrast, if 
> the option is "don't touch the other cacheline at all", prefetching is 
> _always_ a loss. No ifs, buts and maybes about it.

My point (probably not very well written expressed)
was that in a typical VFS operation there are hundreds
of cache lines touched for various things (code, global, various
objects, random stuff) and one more or less in the final dentry 
is not that big a difference in the global picture.
(ok I realize final in this case means the elements in the path)

Also typical operations don't do the same VFS operation in
a loop, but other things that cools the caches first
and then have to fetch everything in again.

I agree that touching more cache lines on the hash chain
walk for immediates would be dumb because there can
be potentially a lot of them, but the final referenced
ones are much fewer.

Or rather if minimizing total foot-print is the goal
there are lower hanging fruit than in the dentry itself
by just cutting fat from the whole path.

(e.g. I liked Mathieu's immediate value work recently
reposted because it had the nice potential to remove a lot of
"global" cache lines in such paths by pushing them
into the icache). 

And if it's possible to do less dcache locks or less 
looping in a seqlock by paying with one cache 
line that could be a reasonable trade off, assuming you
can hide the latency.

Or maybe not and I'm totally wrong on that.
I'll shut up on this now.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07 22:22                           ` Andi Kleen
@ 2009-10-08  7:39                             ` Nick Piggin
  2009-10-09 17:53                               ` Andi Kleen
  0 siblings, 1 reply; 57+ messages in thread
From: Nick Piggin @ 2009-10-08  7:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Jens Axboe, Linux Kernel Mailing List,
	linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra

On Thu, Oct 08, 2009 at 12:22:35AM +0200, Andi Kleen wrote:
> On Wed, Oct 07, 2009 at 02:57:20PM -0700, Linus Torvalds wrote:
> > There's no question that prefetching cannot help, but it helps only if 
> > it's about fetching data that you would need anyway early. In contrast, if 
> > the option is "don't touch the other cacheline at all", prefetching is 
> > _always_ a loss. No ifs, buts and maybes about it.
> 
> My point (probably not very well written expressed)
> was that in a typical VFS operation there are hundreds
> of cache lines touched for various things (code, global, various
> objects, random stuff) and one more or less in the final dentry 
> is not that big a difference in the global picture.
> (ok I realize final in this case means the elements in the path)

Yes I do know what you were getting at... Although I would
say dcache is still fairly significant. It's 3 cachelines
for every single path element we look up. Now I suspect it
has been a while since someone closely looked at the cache
layout with an eye to the lookup code.

Actually no, it does look relatively sane (I guess it doesn't
change that much), except it might actualy be a nice idea
to move d_iname up to about above d_lru, so we could actually
do path lookup in 2 cachelines per entry (or even 1 if we
they have very short names).

I will do a bit of profiling on the lookup code and see...

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-08  7:39                             ` Nick Piggin
@ 2009-10-09 17:53                               ` Andi Kleen
  0 siblings, 0 replies; 57+ messages in thread
From: Andi Kleen @ 2009-10-09 17:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Linus Torvalds, Jens Axboe, Linux Kernel Mailing List,
	linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra

On Thu, Oct 08, 2009 at 09:39:30AM +0200, Nick Piggin wrote:
> > My point (probably not very well written expressed)
> > was that in a typical VFS operation there are hundreds
> > of cache lines touched for various things (code, global, various
> > objects, random stuff) and one more or less in the final dentry 
> > is not that big a difference in the global picture.
> > (ok I realize final in this case means the elements in the path)
> 
> Yes I do know what you were getting at... Although I would
> say dcache is still fairly significant. It's 3 cachelines
> for every single path element we look up. Now I suspect it

That's true, for deeply nested paths where the elements add up 
it's a concern. I was assuming that paths are not that deep, but that might 
be too optimistic.

> Actually no, it does look relatively sane (I guess it doesn't
> change that much), except it might actualy be a nice idea
> to move d_iname up to about above d_lru, so we could actually
> do path lookup in 2 cachelines per entry (or even 1 if we
> they have very short names).
> 
> I will do a bit of profiling on the lookup code and see...

Yes numbers would be great.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07 21:57                         ` Linus Torvalds
  2009-10-07 22:22                           ` Andi Kleen
@ 2009-10-08 13:12                           ` Denys Vlasenko
  2009-10-09  7:47                             ` Nick Piggin
  2009-10-09 17:49                             ` Andi Kleen
  1 sibling, 2 replies; 57+ messages in thread
From: Denys Vlasenko @ 2009-10-08 13:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Nick Piggin, Jens Axboe, Linux Kernel Mailing List,
	linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra

On Wed, Oct 7, 2009 at 11:57 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> This, btw, is exactly the kind of thing we saw with some of the
> non-temporal work, when we used nontemporal stores to copy pages on COW
> faults, or when doing pre-zeroing of pages. You get rid of some of the
> hot-spots in the kernel, and you then replace them with user space taking
> the cache misses in random spots instead. The kernel profile looks better,
> and system time may go down, but actual performace never went down - you
> just moved your cache miss cost from one place to another.

A few years ago when K7s were not ancient yet, after hearing
argument for and against non-temporal stores,
I decided to finally figure it for myself.

I tested kernel build workload on two kernels with the only
one difference - clear_page with and without non-temporal stores.

"Non-temporal stores" kernel was faster, not slower. Just a little bit,
but reproducibly.
--
vda

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-08 13:12                           ` Denys Vlasenko
@ 2009-10-09  7:47                             ` Nick Piggin
  2009-10-09 17:49                             ` Andi Kleen
  1 sibling, 0 replies; 57+ messages in thread
From: Nick Piggin @ 2009-10-09  7:47 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Andi Kleen, Jens Axboe, Linux Kernel Mailing List,
	linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra

On Thu, Oct 08, 2009 at 03:12:08PM +0200, Denys Vlasenko wrote:
> On Wed, Oct 7, 2009 at 11:57 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > This, btw, is exactly the kind of thing we saw with some of the
> > non-temporal work, when we used nontemporal stores to copy pages on COW
> > faults, or when doing pre-zeroing of pages. You get rid of some of the
> > hot-spots in the kernel, and you then replace them with user space taking
> > the cache misses in random spots instead. The kernel profile looks better,
> > and system time may go down, but actual performace never went down - you
> > just moved your cache miss cost from one place to another.
> 
> A few years ago when K7s were not ancient yet, after hearing
> argument for and against non-temporal stores,
> I decided to finally figure it for myself.
> 
> I tested kernel build workload on two kernels with the only
> one difference - clear_page with and without non-temporal stores.
> 
> "Non-temporal stores" kernel was faster, not slower. Just a little bit,
> but reproducibly.

It is going to be highly dependent on architecture and workload and
exactly where you use the nontemporal stores of course. I would say
with non-temporal stores in clear_page (a case where we can often
expect the memory to be used again quickly because it is anonymous
process memory), then we are quite likely to cause _more_ activity
on the memory controller and dimms which cost far more power than
cache access.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-08 13:12                           ` Denys Vlasenko
  2009-10-09  7:47                             ` Nick Piggin
@ 2009-10-09 17:49                             ` Andi Kleen
  1 sibling, 0 replies; 57+ messages in thread
From: Andi Kleen @ 2009-10-09 17:49 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Andi Kleen, Nick Piggin, Jens Axboe,
	Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra

> A few years ago when K7s were not ancient yet, after hearing
> argument for and against non-temporal stores,
> I decided to finally figure it for myself.
> 
> I tested kernel build workload on two kernels with the only
> one difference - clear_page with and without non-temporal stores.
> 
> "Non-temporal stores" kernel was faster, not slower. Just a little bit,
> but reproducibly.

I did the same experiments and in my case (K8) they were slower in 
real-world benchmarks for clear_page (but faster in micro benchmarks)

But I didn't propose non temporal anyways.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07 14:56         ` Linus Torvalds
  2009-10-07 16:27           ` Linus Torvalds
@ 2009-10-07 16:29           ` Nick Piggin
  2009-10-08 12:36           ` Nick Piggin
  2 siblings, 0 replies; 57+ messages in thread
From: Nick Piggin @ 2009-10-07 16:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra

On Wed, Oct 07, 2009 at 07:56:33AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 7 Oct 2009, Nick Piggin wrote:
> > 
> > OK, I have a really basic patch that does store-free path walking
> > (except on the final element).
> 
> Yay!
> 
> > dbench is pretty nasty still because it seems to do a lot of stupid 
> > things like reading from /proc/mounts all the time.
> 
> You should largely forget about dbench, it can certainly be a useful 
> benchmark, but at the same time it's certainly not a _meaningful_ one.
> There are better things to try.

Yes, sure. I'm just pointing out that it seems to do insane things
(like reading /proc/mounts at regular intervals, although I don't
see that in dbench source so I really hope it isn't libc being
"smart").

I agree it is not a very good benchmark.

 
> > The seqlock in the dentry is for getting consistent name,len pointer,
> > and also not giving a false positive if a rename has partially
> > overwritten the name string (false negatives are always fine in the
> > lock free lookup path but false positives are not). Possibly we
> > could make do with a per-sb seqlock for this (or just rename_lock).
> 
> My plan was always to just use rename_lock, and only do it at the outer 
> level (and do it for both lookup failures _and_ for the success case). 
> Your approach is _way_ more conservative than I would have done, and also 
> potentially much slower due to the seqlock-per-path-component thing. 

Hmm, the only issue is that we need a consistent load of the name
pointer and the length, otherwise our memcmp might go crazy. We
could solve this by another level of indirection so a rename only
requires a pointer swap...

But anyway at this approach I only use a single seqlock, because
the negative case always falls out to the locked walk anyway (this
again might be a bit conservative and something we could tighten
up).

 
> Remember: seqlocks are almost free on x86, but they can be reasonably 
> expensive in other places.
> 
> Hmm. Regardless, this very much does look like what I envisioned, apart 
> from details like that. And maybe your per-dentry seqlock is the right 
> choice. On x86, it certainly doesn't have the performance issues it could 
> have in other places.

Yeah, well at least the basics seem to be there. I agree it is not
totally clean and will have some cases that need optimising, but it
is something people can start looking at...


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07 14:56         ` Linus Torvalds
  2009-10-07 16:27           ` Linus Torvalds
  2009-10-07 16:29           ` Nick Piggin
@ 2009-10-08 12:36           ` Nick Piggin
  2009-10-08 12:57             ` Jens Axboe
  2009-10-09  3:50             ` Nick Piggin
  2 siblings, 2 replies; 57+ messages in thread
From: Nick Piggin @ 2009-10-08 12:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra

On Wed, Oct 07, 2009 at 07:56:33AM -0700, Linus Torvalds wrote:
> On Wed, 7 Oct 2009, Nick Piggin wrote:
> > 
> > OK, I have a really basic patch that does store-free path walking
> > (except on the final element).
> 
> Yay!
> 
> > dbench is pretty nasty still because it seems to do a lot of stupid 
> > things like reading from /proc/mounts all the time.
> 
> You should largely forget about dbench, it can certainly be a useful 
> benchmark, but at the same time it's certainly not a _meaningful_ one.
> There are better things to try.

OK, here's one you might find interesting. It is a cached git diff
workload in a linux kernel tree. I actually ran it in a loop 100
times in order to get some reasonable sample sizes, then I ran
parallel and serial configs (PreloadIndex = true/false). Compared
plain kernel with all vfs patches to now.

2.6.32-rc3 serial
5.35user 7.12system 0:12.47elapsed 100%CPU

2.6.32-rc3 parallel
5.79user 17.69system 0:09.41elapsed 249%CPU

vfs serial
5.30user 5.62system 0:10.92elapsed 100%CPU

vfs parallel
4.86user 0.68system 0:06.82elapsed 81%CPU

(I don't know what happened with CPU accounting on the last one, but
elapsed time was accurate).

The profiles are interesting. It's pretty verbose but I've included
just the backtraces for the locking functions.

serial
plain
# Samples: 288849
#
# Overhead         Command                     Shared Object
# ........  ..............  ................................
#
    55.46%             git  [kernel]
                |
                |--36.52%-- __d_lookup
                |--9.57%-- __link_path_walk
                |--6.26%-- _atomic_dec_and_lock
                |          |
                |          |--39.42%-- dput
                |          |          |
                |          |          |--53.66%-- path_put
                |          |          |          |
                |          |          |          |--90.91%-- vfs_fstatat
                |          |          |          |          vfs_lstat
                |          |          |          |          sys_newlstat
                |          |          |          |          system_call_fastpath
                |          |          |          |
                |          |          |           --9.09%-- path_walk
                |          |          |                     do_path_lookup
                |          |          |                     user_path_at
                |          |          |                     vfs_fstatat
                |          |          |                     vfs_lstat
                |          |          |                     sys_newlstat
                |          |          |                     system_call_fastpath
                |          |          |
                |          |           --46.34%-- __link_path_walk
                |          |                     path_walk
                |          |                     do_path_lookup
                |          |                     user_path_at
                |          |                     vfs_fstatat
                |          |                     vfs_lstat
                |          |                     sys_newlstat
                |          |                     system_call_fastpath
                |          |
                |          |--31.73%-- path_put
                |          |          |
                |          |          |--57.58%-- vfs_fstatat
                |          |          |          vfs_lstat
                |          |          |          sys_newlstat
                |          |          |          system_call_fastpath
                |          |          |
                |          |           --42.42%-- path_walk
                |          |                     do_path_lookup
                |          |                     user_path_at
                |          |                     vfs_fstatat
                |          |                     vfs_lstat
                |          |                     sys_newlstat
                |          |                     system_call_fastpath
                |          |
                |          |--21.15%-- __link_path_walk
                |          |          path_walk
                |          |          do_path_lookup
                |          |          user_path_at
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |           --7.69%-- mntput_no_expire
                |                     path_put
                |                     |
                |                     |--50.00%-- vfs_fstatat
                |                     |          vfs_lstat
                |                     |          sys_newlstat
                |                     |          system_call_fastpath
                |                     |
                |                      --50.00%-- path_walk
                |                                do_path_lookup
                |                                user_path_at
                |                                vfs_fstatat
                |                                vfs_lstat
                |                                sys_newlstat
                |                                system_call_fastpath
                |
                |--5.78%-- strncpy_from_user
                |--5.60%-- _spin_unlock
                |          |
                |          |--88.17%-- dput
                |          |          path_put
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |          |--4.30%-- path_put
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |          |--3.23%-- do_lookup
                |          |          __link_path_walk
                |          |          path_walk
                |          |          do_path_lookup
                |          |          user_path_at
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |          |--2.15%-- handle_mm_fault
                |          |          do_page_fault
                |          |          page_fault
                |          |
                |           --2.15%-- __d_lookup
                |                     do_lookup
                |                     __link_path_walk
                |                     path_walk
                |                     do_path_lookup
                |                     user_path_at
                |                     vfs_fstatat
                |                     vfs_lstat
                |                     sys_newlstat
                |                     system_call_fastpath
                |
                |--5.17%-- generic_fillattr
                |--2.95%-- acl_permission_check
                |--1.87%-- groups_search
                |--1.81%-- kmem_cache_free
                |--1.68%-- system_call
                |--1.62%-- clear_page_c
                |--1.56%-- do_lookup
                |--1.44%-- _spin_lock
                |          |
                |          |--58.33%-- __d_lookup
                |          |          do_lookup
                |          |          __link_path_walk
                |          |          path_walk
                |          |          do_path_lookup
                |          |          user_path_at
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |          __lxstat
                |          |
                |          |--20.83%-- dput
                |          |          path_put
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |          __lxstat
                |          |
                |          |--16.67%-- do_lookup
                |          |          __link_path_walk
                |          |          path_walk
                |          |          do_path_lookup
                |          |          user_path_at
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |          __lxstat
                |          |
                |           --4.17%-- copy_process
                |                     do_fork
                |                     sys_clone
                |                     stub_clone
                |                     __libc_fork
                |                     0x494a5d
                |
                |--1.38%-- dput
                |--1.38%-- mntput_no_expire
                |--1.32%-- cp_new_stat
                |--1.26%-- path_walk
                |--1.20%-- sysret_check
                |--1.08%-- kmem_cache_alloc
                |--0.96%-- __follow_mount
                |--0.96%-- copy_user_generic_string
                |--0.66%-- in_group_p
                |--0.54%-- page_fault
                 --7.40%-- [...]

So serial case still has significant time in locking. 13% of all kernel
cycles.

vfs
amples: 254207
#
# Overhead         Command                     Shared Object
# ........  ..............  ................................
#
    53.15%             git  [kernel]
                |
                |--37.47%-- __d_lookup_rcu
                |--15.63%-- link_path_walk_rcu
                |--6.70%-- strncpy_from_user
                |--5.65%-- generic_fillattr
                |--3.49%-- _spin_lock
                |          |
                |          |--66.00%-- dput
                |          |          path_put
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |          |--14.00%-- mntput_no_expire
                |          |          mntput
                |          |          path_put
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |          |--6.00%-- link_path_walk_rcu
                |          |          do_path_lookup
                |          |          |
                |          |          |--66.67%-- user_path_at
                |          |          |          vfs_fstatat
                |          |          |          vfs_lstat
                |          |          |          sys_newlstat
                |          |          |          system_call_fastpath
                |          |          |
                |          |           --33.33%-- do_filp_open
                |          |                     do_sys_open
                |          |                     sys_open
                |          |                     system_call_fastpath
                |          |
                |          |--4.00%-- path_put
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |          |--4.00%-- do_path_lookup
                |          |          user_path_at
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |          |--2.00%-- anon_vma_link
                |          |          dup_mm
                |          |          copy_process
                |          |          do_fork
                |          |          sys_clone
                |          |          stub_clone
                |          |          __libc_fork
                |          |
                |          |--2.00%-- do_page_fault
                |          |          page_fault
                |          |
                |           --2.00%-- vfsmount_read_lock
                |                     mntput_no_expire
                |                     mntput
                |                     path_put
                |                     vfs_fstatat
                |                     vfs_lstat
                |                     sys_newlstat
                |                     system_call_fastpath
                |
                |--2.44%-- kmem_cache_free
                |--1.95%-- system_call
                |--1.88%-- groups_search
                |--1.81%-- do_path_lookup
                |--1.54%-- cp_new_stat
                |--1.33%-- clear_page_c
                |--1.33%-- kmem_cache_alloc
                |--1.12%-- mntput_no_expire
                |--1.05%-- do_lookup_rcu
                |--0.98%-- dput
                |--0.91%-- page_fault
                |--0.91%-- copy_user_generic_string
                |--0.77%-- sysret_check
                |--0.77%-- in_group_p
                |--0.77%-- getname
                |--0.70%-- _spin_unlock
                |          |
                |          |--30.00%-- mntput_no_expire
                |          |          mntput
                |          |          path_put
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |          __lxstat
                |          |
                |          |--20.00%-- link_path_walk_rcu
                |          |          do_path_lookup
                |          |          user_path_at
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |          __lxstat
                |          |
                |          |--10.00%-- handle_mm_fault
                |          |          do_page_fault
                |          |          page_fault
                |          |          0x45f62a
                |          |
                |          |--10.00%-- vfsmount_read_unlock
                |          |          mntput_no_expire
                |          |          mntput
                |          |          path_put
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |          __lxstat
                |          |
                |          |--10.00%-- dput
                |          |          path_put
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |          __lxstat
                |          |
                |          |--10.00%-- path_put
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |          __lxstat
                |          |
                |           --10.00%-- do_path_lookup
                |                     user_path_at
                |                     vfs_fstatat
                |                     vfs_lstat
                |                     sys_newlstat
                |                     system_call_fastpath
                |                     __lxstat
                |
                |--0.63%-- path_put
                |--0.56%-- copy_page_c
                |--0.56%-- user_path_at
                 --9.07%-- [...]

Locking goes to about 4%. Signifciantly coming from dput of the final
dentry element which is basically impossible to avoid, so we're much
closer to optimal.

The parallel case is interesting too.
plain
# Samples: 635836
#
# Overhead         Command                     Shared Object
# ........  ..............  ................................
#
    76.39%             git  [kernel]
                |
                |--32.26%-- _atomic_dec_and_lock
                |          |
                |          |--60.44%-- dput
                |          |          |
                |          |          |--51.15%-- path_put
                |          |          |          |
                |          |          |          |--94.91%-- path_walk
                |          |          |          |          do_path_lookup
                |          |          |          |          user_path_at
                |          |          |          |          vfs_fstatat
                |          |          |          |          vfs_lstat
                |          |          |          |          sys_newlstat
                |          |          |          |          system_call_fastpath
                |          |          |          |
                |          |          |           --5.09%-- vfs_fstatat
                |          |          |                     vfs_lstat
                |          |          |                     sys_newlstat
                |          |          |                     system_call_fastpath
                |          |          |
                |          |           --48.85%-- __link_path_walk
                |          |                     path_walk
                |          |                     do_path_lookup
                |          |                     user_path_at
                |          |                     vfs_fstatat
                |          |                     vfs_lstat
                |          |                     sys_newlstat
                |          |                     system_call_fastpath
                |          |
                |          |--14.04%-- mntput_no_expire
                |          |          path_put
                |          |          |
                |          |          |--51.29%-- path_walk
                |          |          |          do_path_lookup
                |          |          |          user_path_at
                |          |          |          vfs_fstatat
                |          |          |          vfs_lstat
                |          |          |          sys_newlstat
                |          |          |          system_call_fastpath
                |          |          |
                |          |           --48.71%-- vfs_fstatat
                |          |                     vfs_lstat
                |          |                     sys_newlstat
                |          |                     system_call_fastpath
                |          |
                |          |--13.01%-- path_put
                |          |          |
                |          |          |--95.81%-- path_walk
                |          |          |          do_path_lookup
                |          |          |          user_path_at
                |          |          |          vfs_fstatat
                |          |          |          vfs_lstat
                |          |          |          sys_newlstat
                |          |          |          system_call_fastpath
                |          |          |
                |          |           --4.19%-- vfs_fstatat
                |          |                     vfs_lstat
                |          |                     sys_newlstat
                |          |                     system_call_fastpath
                |          |
                |           --12.52%-- __link_path_walk
                |                     path_walk
                |                     do_path_lookup
                |                     user_path_at
                |                     vfs_fstatat
                |                     vfs_lstat
                |                     sys_newlstat
                |                     system_call_fastpath
                |
                |--13.23%-- path_walk
                |--12.94%-- __d_lookup
                |--7.81%-- do_path_lookup
                |--7.53%-- path_init
                |--3.84%-- __link_path_walk
                |--2.36%-- acl_permission_check
                |--2.15%-- _spin_lock
                |          |
                |          |--42.73%-- _atomic_dec_and_lock
                |          |          dput
                |          |          path_put
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |          |--39.09%-- __d_lookup
                |          |          do_lookup
                |          |          __link_path_walk
                |          |          path_walk
                |          |          do_path_lookup
                |          |          user_path_at
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |          |--9.09%-- do_lookup
                |          |          __link_path_walk
                |          |          path_walk
                |          |          do_path_lookup
                |          |          user_path_at
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |          |--8.18%-- dput
                |          |          path_put
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |           --0.91%-- system_call_fastpath
                |                     0x7fb0fcf23257
                |                     0x7fb0fcf158bd
                |
                |--2.01%-- generic_fillattr
                |--1.76%-- _spin_unlock
                |          |
                |          |--85.56%-- dput
                |          |          path_put
                |          |          |
                |          |          |--98.70%-- vfs_fstatat
                |          |          |          vfs_lstat
                |          |          |          sys_newlstat
                |          |          |          system_call_fastpath
                |          |          |
                |          |           --1.30%-- __link_path_walk
                |          |                     path_walk
                |          |                     do_path_lookup
                |          |                     do_filp_open
                |          |                     do_sys_open
                |          |                     sys_open
                |          |                     system_call_fastpath
                |          |
                |          |--5.56%-- __d_lookup
                |          |          do_lookup
                |          |          __link_path_walk
                |          |          path_walk
                |          |          do_path_lookup
                |          |          user_path_at
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |          |--4.44%-- path_put
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |          |--2.22%-- do_lookup
                |          |          __link_path_walk
                |          |          path_walk
                |          |          do_path_lookup
                |          |          user_path_at
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |          |--1.11%-- handle_mm_fault
                |          |          do_page_fault
                |          |          page_fault
                |          |
                |           --1.11%-- update_process_times
                |                     tick_sched_timer
                |                     __run_hrtimer
                |                     hrtimer_interrupt
                |                     smp_apic_timer_interrupt
                |                     apic_timer_interrupt
                |
                |--1.62%-- _read_unlock
                |          |
                |          |--75.90%-- path_init
                |          |          do_path_lookup
                |          |          user_path_at
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |           --24.10%-- do_path_lookup
                |                     user_path_at
                |                     vfs_fstatat
                |                     vfs_lstat
                |                     sys_newlstat
                |                     system_call_fastpath
                |
                |--1.29%-- strncpy_from_user
                |--1.17%-- path_put
                |--1.01%-- dput
                |--0.62%-- kmem_cache_free
                |--0.60%-- do_lookup
                |--0.59%-- clear_page_c

We can see it is really starting to choke on atomic_dec_and_lock. I
don't know how many tasks you spawn off in git here, but it looks
like this is nearing the absolute limit of scalbility.

vfs

amples: 273522
#
# Overhead         Command                     Shared Object
# ........  ..............  ................................
#
    48.24%             git  [kernel]
                |
                |--32.37%-- __d_lookup_rcu
                |--14.14%-- link_path_walk_rcu
                |--7.57%-- _read_unlock
                |          |
                |          |--96.46%-- path_init_rcu
                |          |          do_path_lookup
                |          |          user_path_at
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |           --3.54%-- do_path_lookup
                |                     user_path_at
                |                     vfs_fstatat
                |                     vfs_lstat
                |                     sys_newlstat
                |                     system_call_fastpath
                |
                |--7.04%-- generic_fillattr
                |--5.50%-- strncpy_from_user
                |--2.68%-- kmem_cache_free
                |--2.55%-- _spin_lock
                |          |
                |          |--81.58%-- dput
                |          |          path_put
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |          |--5.26%-- do_path_lookup
                |          |          user_path_at
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |
                |          |--5.26%-- try_to_wake_up
                |          |          |
                |          |          |--50.00%-- wake_up_state
                |          |          |          wake_futex
                |          |          |          futex_wake
                |          |          |          do_futex
                |          |          |          sys_futex
                |          |          |          mm_release
                |          |          |          exit_mm
                |          |          |          do_exit
                |          |          |          sys_exit
                |          |          |          system_call_fastpath
                |          |          |          start_thread
                |          |          |
                |          |           --50.00%-- wake_up_process
                |          |                     __up_write
                |          |                     up_write
                |          |                     sys_mmap
                |          |                     system_call_fastpath
                |          |                     mmap64
                |          |
                |          |--5.26%-- vfsmount_read_lock
                |          |          mntput_no_expire
                |          |          mntput
                |          |          path_put
                |          |          vfs_fstatat
                |          |          vfs_lstat
                |          |          sys_newlstat
                |          |          system_call_fastpath
                |          |          __lxstat
                |          |          |
                |          |          |--50.00%-- 0x7f7640b9e2c0
                |          |          |          0x4ab3b1fc
                |          |          |
                |          |           --50.00%-- 0x7f7640bb4e78
                |          |                     0x4a803476
                |          |
                |           --2.63%-- path_put
                |                     vfs_fstatat
                |                     vfs_lstat
                |                     sys_newlstat
                |                     system_call_fastpath
                |                     __lxstat
                |                     0x7f7640d7f488
                |                     0x4a8034a4
                |
                |--2.48%-- clear_page_c
                |--1.61%-- system_call
                |--1.47%-- copy_user_generic_string
                |--1.41%-- cp_new_stat
                |--1.41%-- groups_search
                |--1.21%-- do_lookup_rcu
                |--0.94%-- kmem_cache_alloc
                |--0.94%-- do_path_lookup
                |--0.87%-- in_group_p
                |--0.80%-- page_fault
                |--0.80%-- sysret_check
                |--0.74%-- dput
                |--0.67%-- getname
                |--0.67%-- user_path_at
                |--0.67%-- mntput_no_expire
                |--0.60%-- unmap_vmas
                |--0.54%-- _spin_unlock
                |--0.54%-- vfs_fstatat
                |--0.54%-- path_init_rcu
                 --9.25%-- [...]

This one is interesting. spin_lock/spin_unlock remains very low, however
read_unlock pops up. This would be... fs->lock. You're using threads
then (rather than processes)?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-08 12:36           ` Nick Piggin
@ 2009-10-08 12:57             ` Jens Axboe
  2009-10-08 13:22               ` Nick Piggin
  2009-10-09  3:50             ` Nick Piggin
  1 sibling, 1 reply; 57+ messages in thread
From: Jens Axboe @ 2009-10-08 12:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra

On Thu, Oct 08 2009, Nick Piggin wrote:
> On Wed, Oct 07, 2009 at 07:56:33AM -0700, Linus Torvalds wrote:
> > On Wed, 7 Oct 2009, Nick Piggin wrote:
> > > 
> > > OK, I have a really basic patch that does store-free path walking
> > > (except on the final element).
> > 
> > Yay!
> > 
> > > dbench is pretty nasty still because it seems to do a lot of stupid 
> > > things like reading from /proc/mounts all the time.
> > 
> > You should largely forget about dbench, it can certainly be a useful 
> > benchmark, but at the same time it's certainly not a _meaningful_ one.
> > There are better things to try.
> 
> OK, here's one you might find interesting. It is a cached git diff
> workload in a linux kernel tree. I actually ran it in a loop 100
> times in order to get some reasonable sample sizes, then I ran
> parallel and serial configs (PreloadIndex = true/false). Compared
> plain kernel with all vfs patches to now.
> 
> 2.6.32-rc3 serial
> 5.35user 7.12system 0:12.47elapsed 100%CPU
> 
> 2.6.32-rc3 parallel
> 5.79user 17.69system 0:09.41elapsed 249%CPU
> 
> vfs serial
> 5.30user 5.62system 0:10.92elapsed 100%CPU
> 
> vfs parallel
> 4.86user 0.68system 0:06.82elapsed 81%CPU

Since the box was booted anyway, I tried the git test too. Results are
with 2.6.32-rc3 serial being the baseline 1.00 scores, smaller than 1.00
are faster and vice versa.

2.6.32-rc3 serial
real    1.00
user    1.00
sys     1.00

2.6.32-rc3 parallel
real    0.80
user    0.83
sys     8.38

sys time, auch...

vfs serial
real    0.86
user    0.93
sys     0.84

vfs parallel
real    0.43
user    0.72
sys     0.13

Let me know if you want profiles or anything like that. I'd say that
looks veeeery tasty.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-08 12:57             ` Jens Axboe
@ 2009-10-08 13:22               ` Nick Piggin
  2009-10-08 13:30                 ` Jens Axboe
  2009-10-09  8:54                 ` Jens Axboe
  0 siblings, 2 replies; 57+ messages in thread
From: Nick Piggin @ 2009-10-08 13:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra

On Thu, Oct 08, 2009 at 02:57:46PM +0200, Jens Axboe wrote:
> On Thu, Oct 08 2009, Nick Piggin wrote:
> > On Wed, Oct 07, 2009 at 07:56:33AM -0700, Linus Torvalds wrote:
> > > On Wed, 7 Oct 2009, Nick Piggin wrote:
> > > > 
> > > > OK, I have a really basic patch that does store-free path walking
> > > > (except on the final element).
> > > 
> > > Yay!
> > > 
> > > > dbench is pretty nasty still because it seems to do a lot of stupid 
> > > > things like reading from /proc/mounts all the time.
> > > 
> > > You should largely forget about dbench, it can certainly be a useful 
> > > benchmark, but at the same time it's certainly not a _meaningful_ one.
> > > There are better things to try.
> > 
> > OK, here's one you might find interesting. It is a cached git diff
> > workload in a linux kernel tree. I actually ran it in a loop 100
> > times in order to get some reasonable sample sizes, then I ran
> > parallel and serial configs (PreloadIndex = true/false). Compared
> > plain kernel with all vfs patches to now.
> > 
> > 2.6.32-rc3 serial
> > 5.35user 7.12system 0:12.47elapsed 100%CPU
> > 
> > 2.6.32-rc3 parallel
> > 5.79user 17.69system 0:09.41elapsed 249%CPU
> > 
> > vfs serial
> > 5.30user 5.62system 0:10.92elapsed 100%CPU
> > 
> > vfs parallel
> > 4.86user 0.68system 0:06.82elapsed 81%CPU
> 
> Since the box was booted anyway, I tried the git test too. Results are
> with 2.6.32-rc3 serial being the baseline 1.00 scores, smaller than 1.00
> are faster and vice versa.
> 
> 2.6.32-rc3 serial
> real    1.00
> user    1.00
> sys     1.00
> 
> 2.6.32-rc3 parallel
> real    0.80
> user    0.83
> sys     8.38
> 
> sys time, auch...
> 
> vfs serial
> real    0.86
> user    0.93
> sys     0.84

This is actualy nice too. My tests were on a 2s8c Barcelona system,
but this is showing we have a nice serial win on Nehalem as well.
Actually K8 CPUs have a bit faster lock primitives than earlier
Intel CPUs I think (closer to Nehalem), so we might see an even
bigger win with a Core2.

 
> vfs parallel
> real    0.43
> user    0.72
> sys     0.13
> 
> Let me know if you want profiles or anything like that. I'd say that
> looks veeeery tasty.

It doesn't look all that different to mine, so profiles probably
not required at this point. Is the CPU accounting going wrong? It
looks like thread times are not being accumulated back properly,
which IIRC they should be. But 'real' time should be accurate, so
it is going a lot faster.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-08 13:22               ` Nick Piggin
@ 2009-10-08 13:30                 ` Jens Axboe
  2009-10-08 18:00                   ` Peter Zijlstra
  2009-10-09  8:54                 ` Jens Axboe
  1 sibling, 1 reply; 57+ messages in thread
From: Jens Axboe @ 2009-10-08 13:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra

On Thu, Oct 08 2009, Nick Piggin wrote:
> On Thu, Oct 08, 2009 at 02:57:46PM +0200, Jens Axboe wrote:
> > On Thu, Oct 08 2009, Nick Piggin wrote:
> > > On Wed, Oct 07, 2009 at 07:56:33AM -0700, Linus Torvalds wrote:
> > > > On Wed, 7 Oct 2009, Nick Piggin wrote:
> > > > > 
> > > > > OK, I have a really basic patch that does store-free path walking
> > > > > (except on the final element).
> > > > 
> > > > Yay!
> > > > 
> > > > > dbench is pretty nasty still because it seems to do a lot of stupid 
> > > > > things like reading from /proc/mounts all the time.
> > > > 
> > > > You should largely forget about dbench, it can certainly be a useful 
> > > > benchmark, but at the same time it's certainly not a _meaningful_ one.
> > > > There are better things to try.
> > > 
> > > OK, here's one you might find interesting. It is a cached git diff
> > > workload in a linux kernel tree. I actually ran it in a loop 100
> > > times in order to get some reasonable sample sizes, then I ran
> > > parallel and serial configs (PreloadIndex = true/false). Compared
> > > plain kernel with all vfs patches to now.
> > > 
> > > 2.6.32-rc3 serial
> > > 5.35user 7.12system 0:12.47elapsed 100%CPU
> > > 
> > > 2.6.32-rc3 parallel
> > > 5.79user 17.69system 0:09.41elapsed 249%CPU
> > > 
> > > vfs serial
> > > 5.30user 5.62system 0:10.92elapsed 100%CPU
> > > 
> > > vfs parallel
> > > 4.86user 0.68system 0:06.82elapsed 81%CPU
> > 
> > Since the box was booted anyway, I tried the git test too. Results are
> > with 2.6.32-rc3 serial being the baseline 1.00 scores, smaller than 1.00
> > are faster and vice versa.
> > 
> > 2.6.32-rc3 serial
> > real    1.00
> > user    1.00
> > sys     1.00
> > 
> > 2.6.32-rc3 parallel
> > real    0.80
> > user    0.83
> > sys     8.38
> > 
> > sys time, auch...
> > 
> > vfs serial
> > real    0.86
> > user    0.93
> > sys     0.84
> 
> This is actualy nice too. My tests were on a 2s8c Barcelona system,
> but this is showing we have a nice serial win on Nehalem as well.
> Actually K8 CPUs have a bit faster lock primitives than earlier
> Intel CPUs I think (closer to Nehalem), so we might see an even
> bigger win with a Core2.

Yes, this is just as interesting as the parallel results imho. I don't
have a core 2 test box, so I cannot test that.

> > vfs parallel
> > real    0.43
> > user    0.72
> > sys     0.13
> > 
> > Let me know if you want profiles or anything like that. I'd say that
> > looks veeeery tasty.
> 
> It doesn't look all that different to mine, so profiles probably
> not required at this point. Is the CPU accounting going wrong? It
> looks like thread times are not being accumulated back properly,
> which IIRC they should be. But 'real' time should be accurate, so
> it is going a lot faster.

Yes, it looks very similar, the higher CPU count just makes the parallel
git preload on -rc3 stock look even more crappy (you had roughly 2x
number of sys time, I have roughly 8x) when compared to the serialized
approach.

IIRC, there was a bug with thread accounting very recently. Why would it
not hit -rc3 alone, though? Does look fishy, though.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-08 13:30                 ` Jens Axboe
@ 2009-10-08 18:00                   ` Peter Zijlstra
  2009-10-09  4:04                     ` Nick Piggin
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2009-10-08 18:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Nick Piggin, Linus Torvalds, Linux Kernel Mailing List,
	linux-fsdevel, Ravikiran G Thirumalai

On Thu, 2009-10-08 at 15:30 +0200, Jens Axboe wrote:
> 
> IIRC, there was a bug with thread accounting very recently. Why would it
> not hit -rc3 alone, though? Does look fishy, though.

That was caused by: 
  def0a9b2573e00ab0b486cb5382625203ab4c4a6 sched_clock: Make it NMI safe

And was fixed by:  
  152f9d0710a62708710161bce1b29fa8292c8c11 sched_clock: Fix
atomicity/continuity bug by using cmpxchg64()

Which is in -rc3

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-08 18:00                   ` Peter Zijlstra
@ 2009-10-09  4:04                     ` Nick Piggin
  0 siblings, 0 replies; 57+ messages in thread
From: Nick Piggin @ 2009-10-09  4:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, Linus Torvalds, Linux Kernel Mailing List,
	linux-fsdevel, Ravikiran G Thirumalai

On Thu, Oct 08, 2009 at 08:00:45PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-10-08 at 15:30 +0200, Jens Axboe wrote:
> > 
> > IIRC, there was a bug with thread accounting very recently. Why would it
> > not hit -rc3 alone, though? Does look fishy, though.
> 
> That was caused by: 
>   def0a9b2573e00ab0b486cb5382625203ab4c4a6 sched_clock: Make it NMI safe
> 
> And was fixed by:  
>   152f9d0710a62708710161bce1b29fa8292c8c11 sched_clock: Fix
> atomicity/continuity bug by using cmpxchg64()
> 
> Which is in -rc3

Well what ever is happening here still happens with post-rc3 kernels.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-08 13:22               ` Nick Piggin
  2009-10-08 13:30                 ` Jens Axboe
@ 2009-10-09  8:54                 ` Jens Axboe
  2009-10-09  9:51                   ` Jens Axboe
  2009-10-09 10:07                   ` Nick Piggin
  1 sibling, 2 replies; 57+ messages in thread
From: Jens Axboe @ 2009-10-09  8:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra

On Thu, Oct 08 2009, Nick Piggin wrote:
> This is actualy nice too. My tests were on a 2s8c Barcelona system,
> but this is showing we have a nice serial win on Nehalem as well.
> Actually K8 CPUs have a bit faster lock primitives than earlier
> Intel CPUs I think (closer to Nehalem), so we might see an even
> bigger win with a Core2.

Ran it on another box. This isn't quite a core 2 though, it's a sparc64
(Niagara 2). It has 64 threads, too, but just 8 cores.

2.6.32-rc3 serial
real    0m5.390s
user    0m1.340s
sys     0m2.970s

2.6.32-rc3 parallel
real    0m2.009s
user    0m0.900s
sys     0m2.490s

vfs serial
real    0m4.816s
user    0m1.250s
sys     0m2.270s

vfs parallel
real    0m1.967s
user    0m0.920s
sys     0m1.960s

So it's a win-win there on that platform too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-09  8:54                 ` Jens Axboe
@ 2009-10-09  9:51                   ` Jens Axboe
  2009-10-09 10:02                     ` Nick Piggin
  2009-10-09 10:07                   ` Nick Piggin
  1 sibling, 1 reply; 57+ messages in thread
From: Jens Axboe @ 2009-10-09  9:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra, chris.mason

Nick,

One more thing... I see you converted part of btrfs, but there's still
this one sitting in btrfs_invalidate_inodes()

        if (atomic_read(&inode->i_count) > 1)
                d_prune_aliases(inode);

Not sure how best to solve that, with a __d_prune_aliases() that assumed
the lock was held it would be easy. But perhaps you have better ideas,
this email is more of a heads-up since perhaps you missed this spot
(CC'ing Chris).

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-09  9:51                   ` Jens Axboe
@ 2009-10-09 10:02                     ` Nick Piggin
  2009-10-09 10:08                       ` Jens Axboe
  0 siblings, 1 reply; 57+ messages in thread
From: Nick Piggin @ 2009-10-09 10:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra, chris.mason

On Fri, Oct 09, 2009 at 11:51:19AM +0200, Jens Axboe wrote:
> Nick,
> 
> One more thing... I see you converted part of btrfs, but there's still
> this one sitting in btrfs_invalidate_inodes()
> 
>         if (atomic_read(&inode->i_count) > 1)
>                 d_prune_aliases(inode);
> 
> Not sure how best to solve that, with a __d_prune_aliases() that assumed
> the lock was held it would be easy. But perhaps you have better ideas,
> this email is more of a heads-up since perhaps you missed this spot
> (CC'ing Chris).

It's OK, you can load inode->i_count integer atomically -- in this
sequence d_prune_aliases can't have assumed anything about i_count
anyway because regardless of its type it might have changed in
between.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-09 10:02                     ` Nick Piggin
@ 2009-10-09 10:08                       ` Jens Axboe
  0 siblings, 0 replies; 57+ messages in thread
From: Jens Axboe @ 2009-10-09 10:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra, chris.mason

On Fri, Oct 09 2009, Nick Piggin wrote:
> On Fri, Oct 09, 2009 at 11:51:19AM +0200, Jens Axboe wrote:
> > Nick,
> > 
> > One more thing... I see you converted part of btrfs, but there's still
> > this one sitting in btrfs_invalidate_inodes()
> > 
> >         if (atomic_read(&inode->i_count) > 1)
> >                 d_prune_aliases(inode);
> > 
> > Not sure how best to solve that, with a __d_prune_aliases() that assumed
> > the lock was held it would be easy. But perhaps you have better ideas,
> > this email is more of a heads-up since perhaps you missed this spot
> > (CC'ing Chris).
> 
> It's OK, you can load inode->i_count integer atomically -- in this
> sequence d_prune_aliases can't have assumed anything about i_count
> anyway because regardless of its type it might have changed in
> between.

Right, it was already racy wrt i_count. So the below should be OK.

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d3dadb6..7e02b5d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3419,8 +3419,10 @@ again:
 		inode = igrab(&entry->vfs_inode);
 		if (inode) {
 			spin_unlock(&root->inode_lock);
-			if (atomic_read(&inode->i_count) > 1)
+
+			if (inode->i_count > 1)
 				d_prune_aliases(inode);
+
 			/*
 			 * btrfs_drop_inode will remove it from
 			 * the inode cache when its usage count

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-09  8:54                 ` Jens Axboe
  2009-10-09  9:51                   ` Jens Axboe
@ 2009-10-09 10:07                   ` Nick Piggin
  1 sibling, 0 replies; 57+ messages in thread
From: Nick Piggin @ 2009-10-09 10:07 UTC (permalink / raw)
  To: Jens Axboe, David Miller
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra

On Fri, Oct 09, 2009 at 10:54:52AM +0200, Jens Axboe wrote:
> On Thu, Oct 08 2009, Nick Piggin wrote:
> > This is actualy nice too. My tests were on a 2s8c Barcelona system,
> > but this is showing we have a nice serial win on Nehalem as well.
> > Actually K8 CPUs have a bit faster lock primitives than earlier
> > Intel CPUs I think (closer to Nehalem), so we might see an even
> > bigger win with a Core2.
> 
> Ran it on another box. This isn't quite a core 2 though, it's a sparc64
> (Niagara 2). It has 64 threads, too, but just 8 cores.
> 
> 2.6.32-rc3 serial
> real    0m5.390s
> user    0m1.340s
> sys     0m2.970s
> 
> 2.6.32-rc3 parallel
> real    0m2.009s
> user    0m0.900s
> sys     0m2.490s
> 
> vfs serial
> real    0m4.816s
> user    0m1.250s
> sys     0m2.270s
> 
> vfs parallel
> real    0m1.967s
> user    0m0.920s
> sys     0m1.960s
> 
> So it's a win-win there on that platform too.

That's nice to see it's really quite a good win in the serial case
on this CPU too.

Parallel interestingly not improved. Whether it is because the locks
are able to be bounced around much more quickly than on our multi
socket systems, or some quirk of the lots-of-chickens CPU that doesn't
take so well to the workload, I don't know. Maybe it's even the
fs->lock that will still be there in your patches.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-08 12:36           ` Nick Piggin
  2009-10-08 12:57             ` Jens Axboe
@ 2009-10-09  3:50             ` Nick Piggin
  2009-10-09  6:15               ` David Miller
  1 sibling, 1 reply; 57+ messages in thread
From: Nick Piggin @ 2009-10-09  3:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra

On Thu, Oct 08, 2009 at 02:36:22PM +0200, Nick Piggin wrote:
> vfs
> 
> amples: 273522
> #
> # Overhead         Command                     Shared Object
> # ........  ..............  ................................
> #
>     48.24%             git  [kernel]
>                 |
>                 |--32.37%-- __d_lookup_rcu
>                 |--14.14%-- link_path_walk_rcu
>                 |--7.57%-- _read_unlock
>                 |          |
>                 |          |--96.46%-- path_init_rcu
>                 |          |          do_path_lookup
>                 |          |          user_path_at
>                 |          |          vfs_fstatat
>                 |          |          vfs_lstat
>                 |          |          sys_newlstat
>                 |          |          system_call_fastpath
>                 |          |
>                 |           --3.54%-- do_path_lookup
>                 |                     user_path_at
>                 |                     vfs_fstatat
>                 |                     vfs_lstat
>                 |                     sys_newlstat
>                 |                     system_call_fastpath

> This one is interesting. spin_lock/spin_unlock remains very low, however
> read_unlock pops up. This would be... fs->lock. You're using threads
> then (rather than processes)?

OK, I got rid of this guy from the RCU walk. Basically now hold
vfsmount_lock over the entire RCU path walk (which also pins the mnt)
and use a seqlock in the fs struct to get a consistent mnt,dentry
pair. This also simplifies the walk because we don't need the
complexity to avoid mntget/mntput (just do one final mntget on the
resulting mnt before dropping vfsmount_lock).

vfsmount_lock adds one per-cpu atomic for the spinlock, and we
remove two thread-shared atomics for fs->lock so a net win for
both single threaded performance and thread-shared scalability.
Latency is no problem because we hold rcu_read_lock for the same
length of time anyway.

The parallel git diff workload is improved by serveral percent.

Phew. I think I'll stop about here and try to start getting some of
this crap cleaned up and start trying to get the rest of the
filesystems done.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-09  3:50             ` Nick Piggin
@ 2009-10-09  6:15               ` David Miller
  2009-10-09 10:40                 ` Nick Piggin
  2009-10-09 10:44                 ` Nick Piggin
  0 siblings, 2 replies; 57+ messages in thread
From: David Miller @ 2009-10-09  6:15 UTC (permalink / raw)
  To: npiggin; +Cc: torvalds, jens.axboe, linux-kernel, linux-fsdevel, kiran, peterz

From: Nick Piggin <npiggin@suse.de>
Date: Fri, 9 Oct 2009 05:50:50 +0200

> OK, I got rid of this guy from the RCU walk. Basically now hold
> vfsmount_lock over the entire RCU path walk (which also pins the mnt)
> and use a seqlock in the fs struct to get a consistent mnt,dentry
> pair. This also simplifies the walk because we don't need the
> complexity to avoid mntget/mntput (just do one final mntget on the
> resulting mnt before dropping vfsmount_lock).
> 
> vfsmount_lock adds one per-cpu atomic for the spinlock, and we
> remove two thread-shared atomics for fs->lock so a net win for
> both single threaded performance and thread-shared scalability.
> Latency is no problem because we hold rcu_read_lock for the same
> length of time anyway.
> 
> The parallel git diff workload is improved by serveral percent.

Sounds sweet Nick, can't wait to play with your next set of
patches here.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-09  6:15               ` David Miller
@ 2009-10-09 10:40                 ` Nick Piggin
  2009-10-09 11:09                   ` Jens Axboe
  2009-10-09 10:44                 ` Nick Piggin
  1 sibling, 1 reply; 57+ messages in thread
From: Nick Piggin @ 2009-10-09 10:40 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, jens.axboe, linux-kernel, linux-fsdevel, kiran, peterz

On Thu, Oct 08, 2009 at 11:15:27PM -0700, David Miller wrote:
> From: Nick Piggin <npiggin@suse.de>
> Date: Fri, 9 Oct 2009 05:50:50 +0200
> 
> > OK, I got rid of this guy from the RCU walk. Basically now hold
> > vfsmount_lock over the entire RCU path walk (which also pins the mnt)
> > and use a seqlock in the fs struct to get a consistent mnt,dentry
> > pair. This also simplifies the walk because we don't need the
> > complexity to avoid mntget/mntput (just do one final mntget on the
> > resulting mnt before dropping vfsmount_lock).
> > 
> > vfsmount_lock adds one per-cpu atomic for the spinlock, and we
> > remove two thread-shared atomics for fs->lock so a net win for
> > both single threaded performance and thread-shared scalability.
> > Latency is no problem because we hold rcu_read_lock for the same
> > length of time anyway.
> > 
> > The parallel git diff workload is improved by serveral percent.
> 
> Sounds sweet Nick, can't wait to play with your next set of
> patches here.

http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/

OK well there will be another rev of the patches there with everything,
also with the d_mounted patch so dentry remains a nice size. Also tried
rearranging some structures in the dentry to help lookups but haven't
quite finished there.

Jens I would be interested to see whether parallel case on the N2 is
improved at all. Profiles from that case might be interesting.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-09 10:40                 ` Nick Piggin
@ 2009-10-09 11:09                   ` Jens Axboe
  0 siblings, 0 replies; 57+ messages in thread
From: Jens Axboe @ 2009-10-09 11:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Miller, torvalds, linux-kernel, linux-fsdevel, kiran,
	peterz

On Fri, Oct 09 2009, Nick Piggin wrote:
> On Thu, Oct 08, 2009 at 11:15:27PM -0700, David Miller wrote:
> > From: Nick Piggin <npiggin@suse.de>
> > Date: Fri, 9 Oct 2009 05:50:50 +0200
> > 
> > > OK, I got rid of this guy from the RCU walk. Basically now hold
> > > vfsmount_lock over the entire RCU path walk (which also pins the mnt)
> > > and use a seqlock in the fs struct to get a consistent mnt,dentry
> > > pair. This also simplifies the walk because we don't need the
> > > complexity to avoid mntget/mntput (just do one final mntget on the
> > > resulting mnt before dropping vfsmount_lock).
> > > 
> > > vfsmount_lock adds one per-cpu atomic for the spinlock, and we
> > > remove two thread-shared atomics for fs->lock so a net win for
> > > both single threaded performance and thread-shared scalability.
> > > Latency is no problem because we hold rcu_read_lock for the same
> > > length of time anyway.
> > > 
> > > The parallel git diff workload is improved by serveral percent.
> > 
> > Sounds sweet Nick, can't wait to play with your next set of
> > patches here.
> 
> http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/
> 
> OK well there will be another rev of the patches there with everything,
> also with the d_mounted patch so dentry remains a nice size. Also tried
> rearranging some structures in the dentry to help lookups but haven't
> quite finished there.
> 
> Jens I would be interested to see whether parallel case on the N2 is
> improved at all. Profiles from that case might be interesting.

Results with 09102009.patch below. And thanks for doing a rolled up
patch again, at least I don't use quilt :-)

2.6.32-rc3 serial
real    0m6.587s
user    0m1.970s
sys     0m2.890s

2.6.32-rc3 parallel
real    0m3.180s
user    0m1.460s
sys     0m2.700s

vfs serial
real    0m6.543s
user    0m1.970s
sys     0m2.590s

vfs parallel
real    0m3.121s
user    0m1.200s
sys     0m2.920s

Doesn't looks so good. Don't pay too much attention to user/sys time
btw, they don't seem very stable. The real elapsed time is stable,
though.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-09  6:15               ` David Miller
  2009-10-09 10:40                 ` Nick Piggin
@ 2009-10-09 10:44                 ` Nick Piggin
  2009-10-09 10:48                   ` Jens Axboe
  1 sibling, 1 reply; 57+ messages in thread
From: Nick Piggin @ 2009-10-09 10:44 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, jens.axboe, linux-kernel, linux-fsdevel, kiran, peterz

On Thu, Oct 08, 2009 at 11:15:27PM -0700, David Miller wrote:
> From: Nick Piggin <npiggin@suse.de>
> Date: Fri, 9 Oct 2009 05:50:50 +0200
> 
> > OK, I got rid of this guy from the RCU walk. Basically now hold
> > vfsmount_lock over the entire RCU path walk (which also pins the mnt)
> > and use a seqlock in the fs struct to get a consistent mnt,dentry
> > pair. This also simplifies the walk because we don't need the
> > complexity to avoid mntget/mntput (just do one final mntget on the
> > resulting mnt before dropping vfsmount_lock).
> > 
> > vfsmount_lock adds one per-cpu atomic for the spinlock, and we
> > remove two thread-shared atomics for fs->lock so a net win for
> > both single threaded performance and thread-shared scalability.
> > Latency is no problem because we hold rcu_read_lock for the same
> > length of time anyway.
> > 
> > The parallel git diff workload is improved by serveral percent.
> 
> Sounds sweet Nick, can't wait to play with your next set of
> patches here.

Oh and the RCU patch there just in case you want to try some file
open/close benchmarks. Turns out I broke Paul's new scalable RCU ;)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-09 10:44                 ` Nick Piggin
@ 2009-10-09 10:48                   ` Jens Axboe
  0 siblings, 0 replies; 57+ messages in thread
From: Jens Axboe @ 2009-10-09 10:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Miller, torvalds, linux-kernel, linux-fsdevel, kiran,
	peterz

On Fri, Oct 09 2009, Nick Piggin wrote:
> On Thu, Oct 08, 2009 at 11:15:27PM -0700, David Miller wrote:
> > From: Nick Piggin <npiggin@suse.de>
> > Date: Fri, 9 Oct 2009 05:50:50 +0200
> > 
> > > OK, I got rid of this guy from the RCU walk. Basically now hold
> > > vfsmount_lock over the entire RCU path walk (which also pins the mnt)
> > > and use a seqlock in the fs struct to get a consistent mnt,dentry
> > > pair. This also simplifies the walk because we don't need the
> > > complexity to avoid mntget/mntput (just do one final mntget on the
> > > resulting mnt before dropping vfsmount_lock).
> > > 
> > > vfsmount_lock adds one per-cpu atomic for the spinlock, and we
> > > remove two thread-shared atomics for fs->lock so a net win for
> > > both single threaded performance and thread-shared scalability.
> > > Latency is no problem because we hold rcu_read_lock for the same
> > > length of time anyway.
> > > 
> > > The parallel git diff workload is improved by serveral percent.
> > 
> > Sounds sweet Nick, can't wait to play with your next set of
> > patches here.
> 
> Oh and the RCU patch there just in case you want to try some file
> open/close benchmarks. Turns out I broke Paul's new scalable RCU ;)

In that case, you may want to fix the permissions on that file :-)

I'll try the new rolled up patch.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [rfc][patch] store-free path walking
  2009-10-07  8:58       ` [rfc][patch] store-free path walking Nick Piggin
  2009-10-07  9:56         ` Jens Axboe
  2009-10-07 14:56         ` Linus Torvalds
@ 2009-10-09 23:16         ` Paul E. McKenney
  2 siblings, 0 replies; 57+ messages in thread
From: Paul E. McKenney @ 2009-10-09 23:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel,
	Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds

On Wed, Oct 07, 2009 at 10:58:49AM +0200, Nick Piggin wrote:
> On Tue, Oct 06, 2009 at 02:49:41PM +0200, Jens Axboe wrote:
> > On Tue, Oct 06 2009, Nick Piggin wrote:
> > > It's a copout, but you could try running multiple dbenches under
> > > different working directories (or actually, IIRC dbench does root
> > > based path lookups so maybe that won't help you much).
> > 
> > Yeah, it's hitting dentry->d_lock pretty hard so basically
> > spin-serialized at that point.
> > 
> > > > Anyway, below are the results. Seem very stable.
> > > > 
> > > >                                 throughput
> > > > ------------------------------------------------
> > > > 2.6.32-rc3-git          |      561.218 MB/sec
> > > > 2.6.32-rc3-git+patch    |      627.022 MB/sec
> > > 
> > > Well it's good to see you got some improvement.
> > 
> > Yes, it's an improvement though the results are still pretty abysmal :-)
> 
> OK, I have a really basic patch that does store-free path walking
> (except on the final element). dbench is pretty nasty still because
> it seems to do a lot of stupid things like reading from /proc/mounts
> all the time.
> 
> Actually it isn't quite store-free because it still takes mnt ref
> (which is hard to avoid, but at least it is a per-cpu data). So
> this patch applies on top of my recent patchset. At least it does
> not store to the dentries.
> 
> Basically it is holding rcu_read_lock for the entire walk, it uses
> inode RCU from my earlier patches to check inode permissions, and
> it bails out at the slightest sign of trouble :) (actually I think
> I have got it to walk mountpoints which is nice).
> 
> The seqlock in the dentry is for getting consistent name,len pointer,
> and also not giving a false positive if a rename has partially
> overwritten the name string (false negatives are always fine in the
> lock free lookup path but false positives are not). Possibly we
> could make do with a per-sb seqlock for this (or just rename_lock).

I have no idea whether it is worth it, but the usual trick to avoid
name/len inconsistencies is to have a pointer to a structure containing
both the name and the length (already provided by qstr in the form of
d_name) and d_hash (which is not already part of d_name).  This does
mean that when renaming from one short name to another, one must still
use external storage for one of the names.  For example, when renaming
from "foo.c" to "bar.c", if "foo.c" is stored in the dentry itself,
you must allocate memory for the "bar.c", even though "bar.c" would
otherwise fit into the dentry.

After a grace period, you could put the "bar.c" into the dentry, and
after another grace period, you could deallocate the memory.  These
grace periods would presumably be handled by call_rcu() rather than
synchronize_rcu().

This approach would make it safe to access dentries that were in the
process of being renamed.

							Thanx, Paul

> I have duplicated most of the lookup code, altering it to the lock
> free version, which returns -EAGAIN if it gets in trouble and the
> regular path walk starts up. I don't know if this is the best way
> to go... it's fairly easy but a bit ugly. But complexifying the
> existing path walk any more would not be nice either. It might be
> nice to try locking the dentry that we're having trouble with and
> continuing from there, rather than redoing the whole walk with
> locks...
> 
> Anyway, this is the basics working for now, microbenchmark shows
> same-cwd lookups scale linearly now too. We can probably slowly
> tackle more cases if they come up as being important, simply by
> auditing filesystems etc.
> 
> --
> 
> Index: linux-2.6/fs/dcache.c
> ===================================================================
> --- linux-2.6.orig/fs/dcache.c
> +++ linux-2.6/fs/dcache.c
> @@ -1187,6 +1187,7 @@ struct dentry *d_alloc(struct dentry * p
>  	dentry->d_count = 1;
>  	dentry->d_flags = DCACHE_UNHASHED;
>  	spin_lock_init(&dentry->d_lock);
> +	seqcount_init(&dentry->d_seq);
>  	dentry->d_inode = NULL;
>  	dentry->d_parent = NULL;
>  	dentry->d_sb = NULL;
> @@ -1579,21 +1580,6 @@ err_out:
>   * d_lookup() is protected against the concurrent renames in some unrelated
>   * directory using the seqlockt_t rename_lock.
>   */
> -
> -struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
> -{
> -	struct dentry * dentry = NULL;
> -	unsigned seq;
> -
> -        do {
> -                seq = read_seqbegin(&rename_lock);
> -                dentry = __d_lookup(parent, name);
> -                if (dentry)
> -			break;
> -	} while (read_seqretry(&rename_lock, seq));
> -	return dentry;
> -}
> -
>  struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
>  {
>  	unsigned int len = name->len;
> @@ -1656,6 +1642,78 @@ next:
>   	return found;
>  }
> 
> +struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
> +{
> +	struct dentry *dentry = NULL;
> +	unsigned seq;
> +
> +        do {
> +                seq = read_seqbegin(&rename_lock);
> +                dentry = __d_lookup(parent, name);
> +                if (dentry)
> +			break;
> +	} while (read_seqretry(&rename_lock, seq));
> +	return dentry;
> +}
> +
> +struct dentry * __d_lookup_rcu(struct dentry * parent, struct qstr * name)
> +{
> +	unsigned int len = name->len;
> +	unsigned int hash = name->hash;
> +	const unsigned char *str = name->name;
> +	struct dcache_hash_bucket *b = d_hash(parent, hash);
> +	struct hlist_head *head = &b->head;
> +	struct hlist_node *node;
> +	struct dentry *dentry;
> +
> +	hlist_for_each_entry_rcu(dentry, node, head, d_hash) {
> +		unsigned seq;
> +		struct dentry *tparent;
> +		const char *tname;
> +		int tlen;
> +
> +		if (dentry->d_name.hash != hash)
> +			continue;
> +
> +seqretry:
> +		seq = read_seqcount_begin(&dentry->d_seq);
> +		tparent = dentry->d_parent;
> +		if (tparent != parent)
> +			continue;
> +		tlen = dentry->d_name.len;
> +		if (tlen != len)
> +			continue;
> +		tname = dentry->d_name.name;
> +		if (read_seqcount_retry(&dentry->d_seq, seq))
> +			goto seqretry;
> +		if (memcmp(tname, str, tlen))
> +			continue;
> +		if (read_seqcount_retry(&dentry->d_seq, seq))
> +			goto seqretry;
> +
> +		return dentry;
> + 	}
> + 	return NULL;
> +}
> +
> +struct dentry *d_lookup_rcu(struct dentry *parent, struct qstr * name)
> +{
> +	struct dentry *dentry = NULL;
> +	unsigned seq;
> +
> +	if (parent->d_op && parent->d_op->d_compare)
> +		goto out;
> +
> +        do {
> +                seq = read_seqbegin(&rename_lock);
> +                dentry = __d_lookup_rcu(parent, name);
> +                if (dentry)
> +			break;
> +	} while (read_seqretry(&rename_lock, seq));
> +out:
> +	return dentry;
> +}
> +
>  /**
>   * d_hash_and_lookup - hash the qstr then search for a dentry
>   * @dir: Directory to search in
> @@ -1925,6 +1983,8 @@ static void d_move_locked(struct dentry
>  	list_del(&target->d_u.d_child);
> 
>  	/* Switch the names.. */
> +	write_seqcount_begin(&dentry->d_seq);
> +	write_seqcount_begin(&target->d_seq);
>  	switch_names(dentry, target);
>  	swap(dentry->d_name.hash, target->d_name.hash);
> 
> @@ -1939,6 +1999,8 @@ static void d_move_locked(struct dentry
>  		/* And add them back to the (new) parent lists */
>  		list_add(&target->d_u.d_child, &target->d_parent->d_subdirs);
>  	}
> +	write_seqcount_end(&target->d_seq);
> +	write_seqcount_end(&dentry->d_seq);
> 
>  	list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs);
>  	if (target->d_parent != dentry->d_parent)
> Index: linux-2.6/fs/namei.c
> ===================================================================
> --- linux-2.6.orig/fs/namei.c
> +++ linux-2.6/fs/namei.c
> @@ -200,6 +200,29 @@ static int acl_permission_check(struct i
>  	return -EACCES;
>  }
> 
> +static int acl_permission_check_rcu(struct inode *inode, int mask,
> +		int (*check_acl)(struct inode *inode, int mask))
> +{
> +	umode_t			mode = inode->i_mode;
> +
> +	mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
> +
> +	if (current_fsuid() == inode->i_uid)
> +		mode >>= 6;
> +	else {
> +		if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl)
> +			return -EAGAIN;
> +		if (in_group_p(inode->i_gid))
> +			mode >>= 3;
> +	}
> +
> +	/*
> +	 * If the DACs are ok we don't need any capability check.
> +	 */
> +	if ((mask & ~mode) == 0)
> +		return 0;
> +	return -EACCES;
> +}
>  /**
>   * generic_permission  -  check for access rights on a Posix-like filesystem
>   * @inode:	inode to check access rights for
> @@ -465,6 +488,25 @@ ok:
>  	return security_inode_permission(inode, MAY_EXEC);
>  }
> 
> +static int exec_permission_lite_rcu(struct inode *inode)
> +{
> +	int ret;
> +
> +	if (inode->i_op->permission)
> +		return -EAGAIN;
> +	ret = acl_permission_check_rcu(inode, MAY_EXEC, inode->i_op->check_acl);
> +	if (ret == -EAGAIN)
> +		return ret;
> +	if (!ret)
> +		goto ok;
> +
> +	if (capable(CAP_DAC_OVERRIDE) || capable(CAP_DAC_READ_SEARCH))
> +		goto ok;
> +
> +	return ret;
> +ok:
> +	return security_inode_permission(inode, MAY_EXEC);
> +}
>  /*
>   * This is called when everything else fails, and we actually have
>   * to go to the low-level filesystem to find out what we should do..
> @@ -569,6 +611,17 @@ static __always_inline void set_root(str
>  	}
>  }
> 
> +static __always_inline void set_root_rcu(struct nameidata *nd)
> +{
> +	if (!nd->root.mnt) {
> +		struct fs_struct *fs = current->fs;
> +		read_lock(&fs->lock);
> +		nd->root = fs->root;
> +		mntget(nd->root.mnt);
> +		read_unlock(&fs->lock);
> +	}
> +}
> +
>  static __always_inline int __vfs_follow_link(struct nameidata *nd, const char *link)
>  {
>  	int res = 0;
> @@ -611,6 +664,14 @@ static void path_put_conditional(struct
>  		mntput(path->mnt);
>  }
> 
> +static inline void path_to_nameidata_rcu(struct path *path, struct nameidata *nd)
> +{
> +	if (nd->path.mnt != path->mnt)
> +		mntput(nd->path.mnt);
> +	nd->path.mnt = path->mnt;
> +	nd->path.dentry = path->dentry;
> +}
> +
>  static inline void path_to_nameidata(struct path *path, struct nameidata *nd)
>  {
>  	dput(nd->path.dentry);
> @@ -705,6 +766,22 @@ int follow_up(struct path *path)
>  /* no need for dcache_lock, as serialization is taken care in
>   * namespace.c
>   */
> +static int __follow_mount_rcu(struct path *path)
> +{
> +	int res = 0;
> +	while (d_mountpoint(path->dentry)) {
> +		struct vfsmount *mounted = lookup_mnt(path);
> +		if (!mounted)
> +			break;
> +		if (res)
> +			mntput(path->mnt);
> +		path->mnt = mounted;
> +		path->dentry = mounted->mnt_root;
> +		res = 1;
> +	}
> +	return res;
> +}
> +
>  static int __follow_mount(struct path *path)
>  {
>  	int res = 0;
> @@ -791,6 +868,24 @@ static __always_inline void follow_dotdo
>   *  small and for now I'd prefer to have fast path as straight as possible.
>   *  It _is_ time-critical.
>   */
> +static int do_lookup_rcu(struct nameidata *nd, struct qstr *name,
> +		     struct path *path)
> +{
> +	struct vfsmount *mnt = nd->path.mnt;
> +	struct dentry *dentry;
> +
> +	dentry = __d_lookup_rcu(nd->path.dentry, name);
> +
> +	if (!dentry)
> +		return -EAGAIN;
> +	if (dentry->d_op && dentry->d_op->d_revalidate)
> +		return -EAGAIN;
> +	path->mnt = mnt;
> +	path->dentry = dentry;
> +	__follow_mount_rcu(path);
> +	return 0;
> +}
> +
>  static int do_lookup(struct nameidata *nd, struct qstr *name,
>  		     struct path *path)
>  {
> @@ -825,6 +920,134 @@ fail:
>  	return PTR_ERR(dentry);
>  }
> 
> +static noinline int link_path_walk_rcu(const char *name, struct nameidata *nd, struct path *next)
> +{
> +	struct inode *inode;
> +	unsigned int lookup_flags = nd->flags;
> +
> +	while (*name=='/')
> +		name++;
> +	if (!*name)
> +		goto return_reval;
> +
> +	inode = nd->path.dentry->d_inode;
> +	if (nd->depth)
> +		lookup_flags = LOOKUP_FOLLOW | (nd->flags & LOOKUP_CONTINUE);
> +
> +	/* At this point we know we have a real path component. */
> +	for(;;) {
> +		unsigned long hash;
> +		struct qstr this;
> +		unsigned int c;
> +
> +		nd->flags |= LOOKUP_CONTINUE;
> +		if (exec_permission_lite_rcu(inode))
> +			return -EAGAIN;
> +
> +		this.name = name;
> +		c = *(const unsigned char *)name;
> +
> +		hash = init_name_hash();
> +		do {
> +			name++;
> +			hash = partial_name_hash(c, hash);
> +			c = *(const unsigned char *)name;
> +		} while (c && (c != '/'));
> +		this.len = name - (const char *) this.name;
> +		this.hash = end_name_hash(hash);
> +
> +		/* remove trailing slashes? */
> +		if (!c)
> +			goto last_component;
> +		while (*++name == '/');
> +		if (!*name)
> +			goto last_with_slashes;
> +
> +		if (this.name[0] == '.') switch (this.len) {
> +			default:
> +				break;
> +			case 2:
> +				if (this.name[1] != '.')
> +					break;
> +				return -EAGAIN;
> +			case 1:
> +				continue;
> +		}
> +		if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash)
> +			return -EAGAIN;
> +		/* This does the actual lookups.. */
> +		if (do_lookup_rcu(nd, &this, next))
> +			return -EAGAIN;
> +
> +		inode = next->dentry->d_inode;
> +		if (!inode)
> +			return -ENOENT;
> +		if (inode->i_op->follow_link)
> +			return -EAGAIN;
> +		path_to_nameidata_rcu(next, nd);
> +		if (!inode->i_op->lookup)
> +			return -ENOTDIR;
> +		continue;
> +		/* here ends the main loop */
> +
> +last_with_slashes:
> +		lookup_flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
> +last_component:
> +		/* Clear LOOKUP_CONTINUE iff it was previously unset */
> +		nd->flags &= lookup_flags | ~LOOKUP_CONTINUE;
> +		if (lookup_flags & LOOKUP_PARENT)
> +			return -EAGAIN;
> +		if (this.name[0] == '.') switch (this.len) {
> +			default:
> +				break;
> +			case 2:
> +				if (this.name[1] != '.')
> +					break;
> +				return -EAGAIN;
> +			case 1:
> +				goto return_reval;
> +		}
> +		if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash)
> +			return -EAGAIN;
> +		if (do_lookup_rcu(nd, &this, next))
> +			return -EAGAIN;
> +		inode = next->dentry->d_inode;
> +		if ((lookup_flags & LOOKUP_FOLLOW)
> +		    && inode && inode->i_op->follow_link)
> +			return -EAGAIN;
> +
> +		path_to_nameidata_rcu(next, nd);
> +		if (!inode)
> +			return -ENOENT;
> +		if (lookup_flags & LOOKUP_DIRECTORY) {
> +			if (!inode->i_op->lookup)
> +				return -ENOTDIR;
> +		}
> +		goto return_base;
> +	}
> +return_reval:
> +	/*
> +	 * We bypassed the ordinary revalidation routines.
> +	 * We may need to check the cached dentry for staleness.
> +	 */
> +	if (nd->path.dentry && nd->path.dentry->d_sb &&
> +	    (nd->path.dentry->d_sb->s_type->fs_flags & FS_REVAL_DOT))
> +		return -EAGAIN;
> +return_base:
> +	spin_lock(&nd->path.dentry->d_lock);
> +	if (d_unhashed(nd->path.dentry)) {
> +		spin_unlock(&nd->path.dentry->d_lock);
> +		return -EAGAIN;
> +	}
> +	if (!nd->dentry->d_inode) {
> +		spin_unlock(&nd->path.dentry->d_lock);
> +		return -EAGAIN;
> +	}
> +	nd->path.dentry->d_count++;
> +	spin_unlock(&nd->path.dentry->d_lock);
> +	return 0;
> +}
> +
>  /*
>   * Name resolution.
>   * This is the basic name resolution function, turning a pathname into
> @@ -887,7 +1110,7 @@ static int __link_path_walk(const char *
>  		if (this.name[0] == '.') switch (this.len) {
>  			default:
>  				break;
> -			case 2:	
> +			case 2:
>  				if (this.name[1] != '.')
>  					break;
>  				follow_dotdot(nd);
> @@ -942,7 +1165,7 @@ last_component:
>  		if (this.name[0] == '.') switch (this.len) {
>  			default:
>  				break;
> -			case 2:	
> +			case 2:
>  				if (this.name[1] != '.')
>  					break;
>  				follow_dotdot(nd);
> @@ -1013,12 +1236,82 @@ return_err:
>  	return err;
>  }
> 
> +static int path_walk_rcu(const char *name, struct nameidata *nd)
> +{
> +	struct path save = nd->path;
> +	struct path path = {.mnt = NULL};
> +	int err;
> +
> +	current->total_link_count = 0;
> +	err = link_path_walk_rcu(name, nd, &path);
> +	if (err) {
> +		if (path.mnt != nd->path.mnt)
> +			mntput(path.mnt);
> +		mntput(nd->path.mnt);
> +		if (err == -EAGAIN)
> +			nd->path = save;
> +	}
> +	return err;
> +}
> +
>  static int path_walk(const char *name, struct nameidata *nd)
>  {
>  	current->total_link_count = 0;
>  	return link_path_walk(name, nd);
>  }
> 
> +static noinline int path_init_rcu(int dfd, const char *name, unsigned int flags, struct nameidata *nd)
> +{
> +	int retval = 0;
> +	int fput_needed;
> +	struct file *file;
> +
> +	nd->last_type = LAST_ROOT; /* if there are only slashes... */
> +	nd->flags = flags;
> +	nd->depth = 0;
> +	nd->root.mnt = NULL;
> +
> +	if (*name=='/') {
> +		set_root_rcu(nd);
> +		nd->path = nd->root;
> +		mntget(nd->root.mnt);
> +	} else if (dfd == AT_FDCWD) {
> +		struct fs_struct *fs = current->fs;
> +		read_lock(&fs->lock);
> +		nd->path = fs->pwd;
> +		mntget(fs->pwd.mnt);
> +		read_unlock(&fs->lock);
> +	} else {
> +		struct dentry *dentry;
> +
> +		file = fget_light(dfd, &fput_needed);
> +		retval = -EBADF;
> +		if (!file)
> +			goto out_fail;
> +
> +		dentry = file->f_path.dentry;
> +
> +		retval = -ENOTDIR;
> +		if (!S_ISDIR(dentry->d_inode->i_mode))
> +			goto fput_fail;
> +
> +		retval = file_permission(file, MAY_EXEC);
> +		if (retval)
> +			goto fput_fail;
> +
> +		nd->path = file->f_path;
> +		mntget(file->f_path.mnt);
> +
> +		fput_light(file, fput_needed);
> +	}
> +	return 0;
> +
> +fput_fail:
> +	fput_light(file, fput_needed);
> +out_fail:
> +	return retval;
> +}
> +
>  static int path_init(int dfd, const char *name, unsigned int flags, struct nameidata *nd)
>  {
>  	int retval = 0;
> @@ -1075,16 +1368,46 @@ out_fail:
>  static int do_path_lookup(int dfd, const char *name,
>  				unsigned int flags, struct nameidata *nd)
>  {
> -	int retval = path_init(dfd, name, flags, nd);
> -	if (!retval)
> -		retval = path_walk(name, nd);
> -	if (unlikely(!retval && !audit_dummy_context() && nd->path.dentry &&
> -				nd->path.dentry->d_inode))
> -		audit_inode(name, nd->path.dentry);
> +	int retval;
> +
> +	rcu_read_lock();
> +	retval = path_init_rcu(dfd, name, flags, nd);
> +	if (unlikely(retval)) {
> +		rcu_read_unlock();
> +		return retval;
> +	}
> +	retval = path_walk_rcu(name, nd);
> +	rcu_read_unlock();
> +	if (likely(!retval)) {
> +		if (unlikely(!audit_dummy_context())) {
> +			if (nd->path.dentry && nd->path.dentry->d_inode)
> +				audit_inode(name, nd->path.dentry);
> +		}
> +	}
>  	if (nd->root.mnt) {
> -		path_put(&nd->root);
> +		mntput(nd->root.mnt);
>  		nd->root.mnt = NULL;
>  	}
> +
> +	if (unlikely(retval == -EAGAIN)) {
> +		/* slower, locked walk */
> +		retval = path_init(dfd, name, flags, nd);
> +		if (unlikely(retval))
> +			return retval;
> +		retval = path_walk(name, nd);
> +		if (likely(!retval)) {
> +			if (unlikely(!audit_dummy_context())) {
> +				if (nd->path.dentry && nd->path.dentry->d_inode)
> +					audit_inode(name, nd->path.dentry);
> +			}
> +		}
> +
> +		if (nd->root.mnt) {
> +			path_put(&nd->root);
> +			nd->root.mnt = NULL;
> +		}
> +	}
> +
>  	return retval;
>  }
> 
> Index: linux-2.6/include/linux/dcache.h
> ===================================================================
> --- linux-2.6.orig/include/linux/dcache.h
> +++ linux-2.6/include/linux/dcache.h
> @@ -5,6 +5,7 @@
>  #include <linux/list.h>
>  #include <linux/rculist.h>
>  #include <linux/spinlock.h>
> +#include <linux/seqlock.h>
>  #include <linux/cache.h>
>  #include <linux/rcupdate.h>
> 
> @@ -81,15 +82,16 @@ full_name_hash(const unsigned char *name
>   * large memory footprint increase).
>   */
>  #ifdef CONFIG_64BIT
> -#define DNAME_INLINE_LEN_MIN 32 /* 192 bytes */
> +#define DNAME_INLINE_LEN_MIN 24 /* 192 bytes */
>  #else
> -#define DNAME_INLINE_LEN_MIN 40 /* 128 bytes */
> +#define DNAME_INLINE_LEN_MIN 36 /* 128 bytes */
>  #endif
> 
>  struct dentry {
>  	unsigned int d_count;		/* protected by d_lock */
>  	unsigned int d_flags;		/* protected by d_lock */
>  	spinlock_t d_lock;		/* per dentry lock */
> +	seqcount_t d_seq;		/* per dentry seqlock */
>  	int d_mounted;
>  	struct inode *d_inode;		/* Where the name belongs to - NULL is
>  					 * negative */
> @@ -283,9 +285,11 @@ extern void d_move(struct dentry *, stru
>  extern struct dentry *d_ancestor(struct dentry *, struct dentry *);
> 
>  /* appendix may either be NULL or be used for transname suffixes */
> -extern struct dentry * d_lookup(struct dentry *, struct qstr *);
> -extern struct dentry * __d_lookup(struct dentry *, struct qstr *);
> -extern struct dentry * d_hash_and_lookup(struct dentry *, struct qstr *);
> +extern struct dentry *d_lookup(struct dentry *, struct qstr *);
> +extern struct dentry *__d_lookup(struct dentry *, struct qstr *);
> +extern struct dentry *d_lookup_rcu(struct dentry *, struct qstr *);
> +extern struct dentry *__d_lookup_rcu(struct dentry *, struct qstr *);
> +extern struct dentry *d_hash_and_lookup(struct dentry *, struct qstr *);
> 
>  /* validate "insecure" dentry pointer */
>  extern int d_validate(struct dentry *, struct dentry *);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Latest vfs scalability patch
  2009-10-06  6:49 Latest vfs scalability patch Nick Piggin
  2009-10-06 10:14 ` Jens Axboe
@ 2009-10-15 10:08 ` Anton Blanchard
  2009-10-15 10:39   ` Nick Piggin
  2009-10-15 10:53   ` Nick Piggin
  1 sibling, 2 replies; 57+ messages in thread
From: Anton Blanchard @ 2009-10-15 10:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 1756 bytes --]


Hi Nick,

> Several people have been interested to test my vfs patches, so rather
> than resend patches I have uploaded a rollup against Linus's current
> head.
> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/
> 
> I have used ext2,ext3,autofs4,nfs as well as in-memory filesystems
> OK (although this doesn't mean there are no bugs!). Otherwise, if your
> filesystem compiles, then there is a reasonable chance of it working,
> or ask me and I can try updating it for the new locking.
> 
> I would be interested in seeing any numbers people might come up with,
> including single-threaded performance.

Thanks for doing a rollup patch, it made it easy to test. I gave it a spin on
a 64 core (128 thread) POWER5+ box. I started simple by looking at open/close
performance, eg:

void testcase(void)
{
	char tmpfile[] = "/tmp/testfile.XXXXXX";

	mkstemp(tmpfile);

	while (1) {
		int fd = open(tmpfile, O_RDWR);
		close(fd);
	}
}

At first the results were 10x slower. I took a look and it appears the
MNT_MOUNTED flag is getting cleared by a remount (I'm testing on the root
filesystem). This fixed it:

--- fs/namespace.c~	2009-10-15 04:34:02.000000000 -0500
+++ fs/namespace.c	2009-10-15 04:35:00.000000000 -0500
@@ -1711,7 +1711,8 @@ static int do_remount(struct path *path,
 	else
 		err = do_remount_sb(sb, flags, data, 0);
 	if (!err)
-		path->mnt->mnt_flags = mnt_flags;
+		path->mnt->mnt_flags = mnt_flags |
+			(path->mnt->mnt_flags & MNT_MOUNTED);
 	up_write(&sb->s_umount);
 	if (!err) {
 		security_sb_post_remount(path->mnt, flags, data);

Attached is a before and after graph. Single thread performance is 20%
faster, and we go from hitting a wall at 2 cores to scaling all the way
to 64 cores. Nice work!!!

Anton

[-- Attachment #2: open_close_performance.png --]
[-- Type: image/png, Size: 26403 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Latest vfs scalability patch
  2009-10-15 10:08 ` Latest vfs scalability patch Anton Blanchard
@ 2009-10-15 10:39   ` Nick Piggin
  2009-10-15 10:46     ` Anton Blanchard
  2009-10-15 10:53   ` Nick Piggin
  1 sibling, 1 reply; 57+ messages in thread
From: Nick Piggin @ 2009-10-15 10:39 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds, Jens Axboe

Hi Anton,

On Thu, Oct 15, 2009 at 09:08:54PM +1100, Anton Blanchard wrote:
> 
> Hi Nick,
> 
> > Several people have been interested to test my vfs patches, so rather
> > than resend patches I have uploaded a rollup against Linus's current
> > head.
> > 
> > ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/
> > 
> > I have used ext2,ext3,autofs4,nfs as well as in-memory filesystems
> > OK (although this doesn't mean there are no bugs!). Otherwise, if your
> > filesystem compiles, then there is a reasonable chance of it working,
> > or ask me and I can try updating it for the new locking.
> > 
> > I would be interested in seeing any numbers people might come up with,
> > including single-threaded performance.
> 
> Thanks for doing a rollup patch, it made it easy to test. I gave it a spin on
> a 64 core (128 thread) POWER5+ box. I started simple by looking at open/close
> performance, eg:
> 
> void testcase(void)
> {
> 	char tmpfile[] = "/tmp/testfile.XXXXXX";
> 
> 	mkstemp(tmpfile);
> 
> 	while (1) {
> 		int fd = open(tmpfile, O_RDWR);
> 		close(fd);
> 	}
> }
> 
> At first the results were 10x slower. I took a look and it appears the
> MNT_MOUNTED flag is getting cleared by a remount (I'm testing on the root
> filesystem). This fixed it:

Oh dear, thanks for that. That bugfix is needed for the patchset
I just sent to be merged.

 
> --- fs/namespace.c~	2009-10-15 04:34:02.000000000 -0500
> +++ fs/namespace.c	2009-10-15 04:35:00.000000000 -0500
> @@ -1711,7 +1711,8 @@ static int do_remount(struct path *path,
>  	else
>  		err = do_remount_sb(sb, flags, data, 0);
>  	if (!err)
> -		path->mnt->mnt_flags = mnt_flags;
> +		path->mnt->mnt_flags = mnt_flags |
> +			(path->mnt->mnt_flags & MNT_MOUNTED);
>  	up_write(&sb->s_umount);
>  	if (!err) {
>  		security_sb_post_remount(path->mnt, flags, data);
> 
> Attached is a before and after graph. Single thread performance is 20%
> faster, and we go from hitting a wall at 2 cores to scaling all the way
> to 64 cores. Nice work!!!

Nice looking graph, thanks! Did you use the rcu patch there as
well? (my open-close testing chokes RCU but your system might
be managing to keep outstanding callbacks below the threshold)
 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Latest vfs scalability patch
  2009-10-15 10:39   ` Nick Piggin
@ 2009-10-15 10:46     ` Anton Blanchard
  0 siblings, 0 replies; 57+ messages in thread
From: Anton Blanchard @ 2009-10-15 10:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds, Jens Axboe

 
Hi Nick,

> Nice looking graph, thanks! Did you use the rcu patch there as
> well? (my open-close testing chokes RCU but your system might
> be managing to keep outstanding callbacks below the threshold)

Yep, I am running with the RCU patch but forgot to mention it :)

Anton

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Latest vfs scalability patch
  2009-10-15 10:08 ` Latest vfs scalability patch Anton Blanchard
  2009-10-15 10:39   ` Nick Piggin
@ 2009-10-15 10:53   ` Nick Piggin
  2009-10-15 11:23     ` Anton Blanchard
  1 sibling, 1 reply; 57+ messages in thread
From: Nick Piggin @ 2009-10-15 10:53 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds, Jens Axboe

On Thu, Oct 15, 2009 at 09:08:54PM +1100, Anton Blanchard wrote:
> 
> Hi Nick,
> 
> > Several people have been interested to test my vfs patches, so rather
> > than resend patches I have uploaded a rollup against Linus's current
> > head.
> > 
> > ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/
> > 
> > I have used ext2,ext3,autofs4,nfs as well as in-memory filesystems
> > OK (although this doesn't mean there are no bugs!). Otherwise, if your
> > filesystem compiles, then there is a reasonable chance of it working,
> > or ask me and I can try updating it for the new locking.
> > 
> > I would be interested in seeing any numbers people might come up with,
> > including single-threaded performance.
> 
> Thanks for doing a rollup patch, it made it easy to test. I gave it a spin on
> a 64 core (128 thread) POWER5+ box. I started simple by looking at open/close
> performance, eg:

I wonder what other good performance tests you can add to your test
framework? creat/unlink is another easy one. And for each case, putting
threads in their own cwd versus a common cwd are the variants.

BTW. for these cases in your tests it will be nice if you can run on
ramfs because that will isolate purely the vfs. Perhaps also include
other filesystems as you get time, but I think ramfs is the most
useful for us to start with.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Latest vfs scalability patch
  2009-10-15 10:53   ` Nick Piggin
@ 2009-10-15 11:23     ` Anton Blanchard
  2009-10-15 11:41       ` Nick Piggin
  0 siblings, 1 reply; 57+ messages in thread
From: Anton Blanchard @ 2009-10-15 11:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 744 bytes --]

 
Hi Nick,

> I wonder what other good performance tests you can add to your test
> framework? creat/unlink is another easy one. And for each case, putting
> threads in their own cwd versus a common cwd are the variants.

I did try the two combinations of creat/unlink but haven't had a chance to
digest the profiles yet. I've attached them (taken at 64 cores, ie worst
case :)

In both cases performance was significantly better than mainline.

> BTW. for these cases in your tests it will be nice if you can run on
> ramfs because that will isolate purely the vfs. Perhaps also include
> other filesystems as you get time, but I think ramfs is the most
> useful for us to start with.

Good point. I'll add that into the setup scripts.

Anton

[-- Attachment #2: open_unlink.profile --]
[-- Type: text/plain, Size: 10567 bytes --]

# Samples: 82617
#
# Overhead          Command                      Shared Object  Symbol
# ........  ...............  .................................  ......
#
    99.16%  unlink1_process  [kernel]                           [k] ._spin_lock
                |          
                |--99.98%-- ._spin_lock
                |          |          
                |          |--49.80%-- .path_get
                |          |          (nil)
                |          |          |          
                |          |          |--71.62%-- .path_init
                |          |          |          |          
                |          |          |          |--51.37%-- .do_path_lookup
                |          |          |          |          .user_path_parent
                |          |          |          |          .do_unlinkat
                |          |          |          |          syscall_exit
                |          |          |          |          0xfff8e90bd4c
                |          |          |          |          .testcase
                |          |          |          |          .affinitize
                |          |          |          |          .new_task
                |          |          |          |          .main
                |          |          |          |          0xfff8e84f33c
                |          |          |          |          0xfff8e84f55c
                |          |          |          |          (nil)
                |          |          |          |          
                |          |          |           --48.63%-- .do_filp_open
                |          |          |                     .do_sys_open
                |          |          |                     syscall_exit
                |          |          |                     0xfff8e909c7c
                |          |          |                     .testcase
                |          |          |                     .affinitize
                |          |          |                     .new_task
                |          |          |                     .main
                |          |          |                     0xfff8e84f33c
                |          |          |                     0xfff8e84f55c
                |          |          |                     (nil)
                |          |          |          
                |          |           --28.38%-- .path_walk
                |          |                     |          
                |          |                     |--50.49%-- .do_filp_open
                |          |                     |          .do_sys_open
                |          |                     |          syscall_exit
                |          |                     |          0xfff8e909c7c
                |          |                     |          .testcase
                |          |                     |          .affinitize
                |          |                     |          .new_task
                |          |                     |          .main
                |          |                     |          0xfff8e84f33c
                |          |                     |          0xfff8e84f55c
                |          |                     |          (nil)
                |          |                     |          
                |          |                      --49.51%-- .do_path_lookup
                |          |                                .user_path_parent
                |          |                                .do_unlinkat
                |          |                                syscall_exit
                |          |                                0xfff8e90bd4c
                |          |                                .testcase
                |          |                                .affinitize
                |          |                                .new_task
                |          |                                .main
                |          |                                0xfff8e84f33c
                |          |                                0xfff8e84f55c
                |          |                                (nil)
                |          |          
                |          |--49.58%-- .dput
                |          |          (nil)
                |          |          |          
                |          |          |--64.21%-- .path_put
                |          |          |          |          
                |          |          |          |--48.48%-- .path_walk
                |          |          |          |          |          
                |          |          |          |          |--51.00%-- .do_path_lookup
                |          |          |          |          |          .user_path_parent
                |          |          |          |          |          .do_unlinkat
                |          |          |          |          |          syscall_exit
                |          |          |          |          |          0xfff8e90bd4c
                |          |          |          |          |          .testcase
                |          |          |          |          |          .affinitize
                |          |          |          |          |          .new_task
                |          |          |          |          |          .main
                |          |          |          |          |          0xfff8e84f33c
                |          |          |          |          |          0xfff8e84f55c
                |          |          |          |          |          (nil)
                |          |          |          |          |          
                |          |          |          |           --49.00%-- .do_filp_open
                |          |          |          |                     .do_sys_open
                |          |          |          |                     syscall_exit
                |          |          |          |                     0xfff8e909c7c
                |          |          |          |                     .testcase
                |          |          |          |                     .affinitize
                |          |          |          |                     .new_task
                |          |          |          |                     .main
                |          |          |          |                     0xfff8e84f33c
                |          |          |          |                     0xfff8e84f55c
                |          |          |          |                     (nil)
                |          |          |          |          
                |          |          |          |--25.79%-- .do_filp_open
                |          |          |          |          .do_sys_open
                |          |          |          |          syscall_exit
                |          |          |          |          0xfff8e909c7c
                |          |          |          |          .testcase
                |          |          |          |          .affinitize
                |          |          |          |          .new_task
                |          |          |          |          .main
                |          |          |          |          0xfff8e84f33c
                |          |          |          |          0xfff8e84f55c
                |          |          |          |          (nil)
                |          |          |          |          
                |          |          |          |--25.72%-- .do_path_lookup
                |          |          |          |          .user_path_parent
                |          |          |          |          .do_unlinkat
                |          |          |          |          syscall_exit
                |          |          |          |          0xfff8e90bd4c
                |          |          |          |          .testcase
                |          |          |          |          .affinitize
                |          |          |          |          .new_task
                |          |          |          |          .main
                |          |          |          |          0xfff8e84f33c
                |          |          |          |          0xfff8e84f55c
                |          |          |          |          (nil)
                |          |          |           --0.00%-- [...]
                |          |          |          
                |          |          |--35.78%-- .__link_path_walk
                |          |          |          .path_walk
                |          |          |          |          
                |          |          |          |--50.81%-- .do_path_lookup
                |          |          |          |          .user_path_parent
                |          |          |          |          .do_unlinkat
                |          |          |          |          syscall_exit
                |          |          |          |          0xfff8e90bd4c
                |          |          |          |          .testcase
                |          |          |          |          .affinitize
                |          |          |          |          .new_task
                |          |          |          |          .main
                |          |          |          |          0xfff8e84f33c
                |          |          |          |          0xfff8e84f55c
                |          |          |          |          (nil)
                |          |          |          |          
                |          |          |           --49.19%-- .do_filp_open
                |          |          |                     .do_sys_open
                |          |          |                     syscall_exit
                |          |          |                     0xfff8e909c7c
                |          |          |                     .testcase
                |          |          |                     .affinitize
                |          |          |                     .new_task
                |          |          |                     .main
                |          |          |                     0xfff8e84f33c
                |          |          |                     0xfff8e84f55c
                |          |          |                     (nil)
                |          |           --0.01%-- [...]
                |           --0.62%-- [...]
                 --0.02%-- [...]

[-- Attachment #3: open_unlink_directories.profile --]
[-- Type: text/plain, Size: 12730 bytes --]

# Samples: 101253
#
# Overhead          Command                      Shared Object  Symbol
# ........  ...............  .................................  ......
#
^[[31m    99.15%^[[m  unlink2_process  [kernel]                           [k] ._spin_lock
                |          
                |^[[31m--99.97%-- ^[[m._spin_lock
                |          |          
                |          |^[[31m--49.96%-- ^[[m.dput
                |          |          (nil)
                |          |          |          
                |          |          |^[[31m--71.64%-- ^[[m.__link_path_walk
                |          |          |          .path_walk
                |          |          |          |          
                |          |          |          |^[[31m--50.78%-- ^[[m.do_path_lookup
                |          |          |          |          .user_path_parent
                |          |          |          |          .do_unlinkat
                |          |          |          |          syscall_exit
                |          |          |          |          0xfff86a85d4c
                |          |          |          |          .testcase
                |          |          |          |          .affinitize
                |          |          |          |          .new_task
                |          |          |          |          .main
                |          |          |          |          0xfff869c933c
                |          |          |          |          0xfff869c955c
                |          |          |          |          (nil)
                |          |          |          |          
                |          |          |           ^[[31m--49.22%-- ^[[m.do_filp_open
                |          |          |                     .do_sys_open
                |          |          |                     syscall_exit
                |          |          |                     0xfff86a83c7c
                |          |          |                     .testcase
                |          |          |                     .affinitize
                |          |          |                     .new_task
                |          |          |                     .main
                |          |          |                     0xfff869c933c
                |          |          |                     0xfff869c955c
                |          |          |                     (nil)
                |          |          |          
                |          |          |^[[31m--28.35%-- ^[[m.path_put
                |          |          |          |          
                |          |          |          |^[[31m--50.91%-- ^[[m.path_walk
                |          |          |          |          |          
                |          |          |          |          |^[[31m--53.90%-- ^[[m.do_path_lookup
                |          |          |          |          |          .user_path_parent
                |          |          |          |          |          .do_unlinkat
                |          |          |          |          |          syscall_exit
                |          |          |          |          |          0xfff86a85d4c
                |          |          |          |          |          .testcase
                |          |          |          |          |          .affinitize
                |          |          |          |          |          .new_task
                |          |          |          |          |          .main
                |          |          |          |          |          0xfff869c933c
                |          |          |          |          |          0xfff869c955c
                |          |          |          |          |          (nil)
                |          |          |          |          |          
                |          |          |          |           ^[[31m--46.10%-- ^[[m.do_filp_open
                |          |          |          |                     .do_sys_open
                |          |          |          |                     syscall_exit
                |          |          |          |                     0xfff86a83c7c
                |          |          |          |                     .testcase
                |          |          |          |                     .affinitize
                |          |          |          |                     .new_task
                |          |          |          |                     .main
                |          |          |          |                     0xfff869c933c
                |          |          |          |                     0xfff869c955c
                |          |          |          |                     (nil)
                |          |          |          |          
                |          |          |          |^[[31m--25.98%-- ^[[m.do_filp_open
                |          |          |          |          .do_sys_open
                |          |          |          |          syscall_exit
                |          |          |          |          0xfff86a83c7c
                |          |          |          |          .testcase
                |          |          |          |          .affinitize
                |          |          |          |          .new_task
                |          |          |          |          .main
                |          |          |          |          0xfff869c933c
                |          |          |          |          0xfff869c955c
                |          |          |          |          (nil)
                |          |          |          |          
                |          |          |          |^[[31m--23.10%-- ^[[m.do_path_lookup
                |          |          |          |          .user_path_parent
                |          |          |          |          .do_unlinkat
                |          |          |          |          syscall_exit
                |          |          |          |          0xfff86a85d4c
                |          |          |          |          .testcase
                |          |          |          |          .affinitize
                |          |          |          |          .new_task
                |          |          |          |          .main
                |          |          |          |          0xfff869c933c
                |          |          |          |          0xfff869c955c
                |          |          |          |          (nil)
                |          |          |           --0.01%-- [...]
                |          |           --0.01%-- [...]
                |          |          
                |          |^[[31m--28.88%-- ^[[m.__follow_mount
                |          |          (nil)
                |          |          .do_lookup
                |          |          .__link_path_walk
                |          |          .path_walk
                |          |          |          
                |          |          |^[[31m--50.26%-- ^[[m.do_filp_open
                |          |          |          .do_sys_open
                |          |          |          syscall_exit
                |          |          |          0xfff86a83c7c
                |          |          |          .testcase
                |          |          |          .affinitize
                |          |          |          .new_task
                |          |          |          .main
                |          |          |          0xfff869c933c
                |          |          |          0xfff869c955c
                |          |          |          (nil)
                |          |          |          
                |          |           ^[[31m--49.74%-- ^[[m.do_path_lookup
                |          |                     .user_path_parent
                |          |                     .do_unlinkat
                |          |                     syscall_exit
                |          |                     0xfff86a85d4c
                |          |                     .testcase
                |          |                     .affinitize
                |          |                     .new_task
                |          |                     .main
                |          |                     0xfff869c933c
                |          |                     0xfff869c955c
                |          |                     (nil)
                |          |          
                |          |^[[31m--20.33%-- ^[[m.path_get
                |          |          (nil)
                |          |          |          
                |          |          |^[[31m--69.39%-- ^[[m.path_init
                |          |          |          |          
                |          |          |          |^[[31m--50.78%-- ^[[m.do_path_lookup
                |          |          |          |          .user_path_parent
                |          |          |          |          .do_unlinkat
                |          |          |          |          syscall_exit
                |          |          |          |          0xfff86a85d4c
                |          |          |          |          .testcase
                |          |          |          |          .affinitize
                |          |          |          |          .new_task
                |          |          |          |          .main
                |          |          |          |          0xfff869c933c
                |          |          |          |          0xfff869c955c
                |          |          |          |          (nil)
                |          |          |          |          
                |          |          |           ^[[31m--49.22%-- ^[[m.do_filp_open
                |          |          |                     .do_sys_open
                |          |          |                     syscall_exit
                |          |          |                     0xfff86a83c7c
                |          |          |                     .testcase
                |          |          |                     .affinitize
                |          |          |                     .new_task
                |          |          |                     .main
                |          |          |                     0xfff869c933c
                |          |          |                     0xfff869c955c
                |          |          |                     (nil)
                |          |          |          
                |          |           ^[[31m--30.61%-- ^[[m.path_walk
                |          |                     |          
                |          |                     |^[[31m--53.29%-- ^[[m.do_filp_open
                |          |                     |          .do_sys_open
                |          |                     |          syscall_exit
                |          |                     |          0xfff86a83c7c
                |          |                     |          .testcase
                |          |                     |          .affinitize
                |          |                     |          .new_task
                |          |                     |          .main
                |          |                     |          0xfff869c933c
                |          |                     |          0xfff869c955c
                |          |                     |          (nil)
                |          |                     |          
                |          |                      ^[[31m--46.71%-- ^[[m.do_path_lookup
                |          |                                .user_path_parent
                |          |                                .do_unlinkat
                |          |                                syscall_exit
                |          |                                0xfff86a85d4c
                |          |                                .testcase
                |          |                                .affinitize
                |          |                                .new_task
                |          |                                .main
                |          |                                0xfff869c933c
                |          |                                0xfff869c955c
                |          |                                (nil)
                |           ^[[32m--0.83%-- ^[[m[...]
                 --0.03%-- [...]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Latest vfs scalability patch
  2009-10-15 11:23     ` Anton Blanchard
@ 2009-10-15 11:41       ` Nick Piggin
  2009-10-15 11:48         ` Nick Piggin
  0 siblings, 1 reply; 57+ messages in thread
From: Nick Piggin @ 2009-10-15 11:41 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds, Jens Axboe

On Thu, Oct 15, 2009 at 10:23:29PM +1100, Anton Blanchard wrote:
>  
> Hi Nick,
> 
> > I wonder what other good performance tests you can add to your test
> > framework? creat/unlink is another easy one. And for each case, putting
> > threads in their own cwd versus a common cwd are the variants.
> 
> I did try the two combinations of creat/unlink but haven't had a chance to
> digest the profiles yet. I've attached them (taken at 64 cores, ie worst
> case :)
> 
> In both cases performance was significantly better than mainline.
> 
> > BTW. for these cases in your tests it will be nice if you can run on
> > ramfs because that will isolate purely the vfs. Perhaps also include
> > other filesystems as you get time, but I think ramfs is the most
> > useful for us to start with.
> 
> Good point. I'll add that into the setup scripts.
> 
> Anton

> # Samples: 82617
> #
> # Overhead          Command                      Shared Object  Symbol
> # ........  ...............  .................................  ......
> #
>     99.16%  unlink1_process  [kernel]                           [k] ._spin_lock
>                 |          
>                 |--99.98%-- ._spin_lock
>                 |          |          
>                 |          |--49.80%-- .path_get
>                 |          |--49.58%-- .dput

Hmm, both your profiles look like they are hammering on a common cwd
here. The lock-free path walk can probably be extended to help a bit,
but you would still end up hitting locks on the parent dentry/inode
when doing the create destroy. My 64-way numbers look like this:


create-unlink 1 processes seperate-cwd 105306.58 ops/s
create-unlink 2 processes seperate-cwd 103004.20 ops/s
create-unlink 4 processes seperate-cwd 92438.69 ops/s
create-unlink 8 processes seperate-cwd 91138.93 ops/s
create-unlink 16 processes seperate-cwd 91025.36 ops/s
create-unlink 32 processes seperate-cwd 83757.75 ops/s
create-unlink 64 processes seperate-cwd 81718.29 ops/s

create-unlink 1 processes same-cwd 110139.61 ops/s
create-unlink 2 processes same-cwd 26611.69 ops/s
create-unlink 4 processes same-cwd 13819.46 ops/s
create-unlink 8 processes same-cwd 4724.83 ops/s
create-unlink 16 processes same-cwd 1368.99 ops/s
create-unlink 32 processes same-cwd 335.08 ops/s
create-unlink 64 processes same-cwd 114.88 ops/s

If your seperate-cwd numbers aren't scaling reasonably well, I will
have to get to the bottom of it.

BTW these numbers are ops/s/cpu which btw in some cases I like better.
At a quick glance it is very easy to see if scalability is linear,
and also when you graph it then there is less tendency to drown
out the low end numbers (although it could go the other way and drown
the high end numbers for non-scalable cases). Maybe you could add an
option to output either.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Latest vfs scalability patch
  2009-10-15 11:41       ` Nick Piggin
@ 2009-10-15 11:48         ` Nick Piggin
  0 siblings, 0 replies; 57+ messages in thread
From: Nick Piggin @ 2009-10-15 11:48 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai,
	Peter Zijlstra, Linus Torvalds, Jens Axboe

On Thu, Oct 15, 2009 at 01:41:19PM +0200, Nick Piggin wrote:
> On Thu, Oct 15, 2009 at 10:23:29PM +1100, Anton Blanchard wrote:
> >  
> > Hi Nick,
> > 
> > > I wonder what other good performance tests you can add to your test
> > > framework? creat/unlink is another easy one. And for each case, putting
> > > threads in their own cwd versus a common cwd are the variants.
> > 
> > I did try the two combinations of creat/unlink but haven't had a chance to
> > digest the profiles yet. I've attached them (taken at 64 cores, ie worst
> > case :)
> > 
> > In both cases performance was significantly better than mainline.
> > 
> > > BTW. for these cases in your tests it will be nice if you can run on
> > > ramfs because that will isolate purely the vfs. Perhaps also include
> > > other filesystems as you get time, but I think ramfs is the most
> > > useful for us to start with.
> > 
> > Good point. I'll add that into the setup scripts.
> > 
> > Anton
> 
> > # Samples: 82617
> > #
> > # Overhead          Command                      Shared Object  Symbol
> > # ........  ...............  .................................  ......
> > #
> >     99.16%  unlink1_process  [kernel]                           [k] ._spin_lock
> >                 |          
> >                 |--99.98%-- ._spin_lock
> >                 |          |          
> >                 |          |--49.80%-- .path_get
> >                 |          |--49.58%-- .dput
> 
> Hmm, both your profiles look like they are hammering on a common cwd
> here. The lock-free path walk can probably be extended to help a bit,
> but you would still end up hitting locks on the parent dentry/inode
> when doing the create destroy. My 64-way numbers look like this:
> 
> 
> create-unlink 1 processes seperate-cwd 105306.58 ops/s
> create-unlink 2 processes seperate-cwd 103004.20 ops/s
> create-unlink 4 processes seperate-cwd 92438.69 ops/s
> create-unlink 8 processes seperate-cwd 91138.93 ops/s
> create-unlink 16 processes seperate-cwd 91025.36 ops/s
> create-unlink 32 processes seperate-cwd 83757.75 ops/s
> create-unlink 64 processes seperate-cwd 81718.29 ops/s

dumb profile for this guy looks like this:
206681 total                                      0.0270
 25851 _spin_lock                               161.5687
 13628 kmem_cache_free                            7.3427
  9890 _spin_unlock                              61.8125
  7087 kmem_cache_alloc                           6.5138
  6770 _read_lock                                35.2604
  5587 __call_rcu                                 4.8498
  5580 __link_path_walk                           0.5571
  5246 do_filp_open                               0.9476
  4946 __rcu_process_callbacks                    2.0608
  4904 __percpu_counter_add                      11.7885
  3933 d_alloc                                    5.1211
  3906 memset                                     3.6989
  3807 path_init_rcu                              3.2154
  3370 __mutex_init                              35.1042
  3254 mnt_want_write                             4.6222

oprofile isn't working on this guy either, and I no longer have
the patience to try working out where such locking is coming from
without lockdep or perf ;) But it sure is a lot better than your
profiles...

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2009-10-15 11:49 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-06  6:49 Latest vfs scalability patch Nick Piggin
2009-10-06 10:14 ` Jens Axboe
2009-10-06 10:26   ` Jens Axboe
2009-10-06 11:10     ` Peter Zijlstra
2009-10-06 12:51       ` Jens Axboe
2009-10-06 12:26   ` Nick Piggin
2009-10-06 12:49     ` Jens Axboe
2009-10-07  8:58       ` [rfc][patch] store-free path walking Nick Piggin
2009-10-07  9:56         ` Jens Axboe
2009-10-07 10:10           ` Nick Piggin
2009-10-12  3:58           ` Nick Piggin
2009-10-12  5:59             ` Nick Piggin
2009-10-12  8:20               ` Jens Axboe
2009-10-12 11:00                 ` Jens Axboe
2009-10-13  1:26             ` Christoph Hellwig
2009-10-13  1:52               ` Nick Piggin
2009-10-07 14:56         ` Linus Torvalds
2009-10-07 16:27           ` Linus Torvalds
2009-10-07 16:46             ` Nick Piggin
2009-10-07 19:25               ` Linus Torvalds
2009-10-07 20:34                 ` Andi Kleen
2009-10-07 20:51                   ` Linus Torvalds
2009-10-07 21:06                     ` Andi Kleen
2009-10-07 21:20                       ` Linus Torvalds
2009-10-07 21:57                         ` Linus Torvalds
2009-10-07 22:22                           ` Andi Kleen
2009-10-08  7:39                             ` Nick Piggin
2009-10-09 17:53                               ` Andi Kleen
2009-10-08 13:12                           ` Denys Vlasenko
2009-10-09  7:47                             ` Nick Piggin
2009-10-09 17:49                             ` Andi Kleen
2009-10-07 16:29           ` Nick Piggin
2009-10-08 12:36           ` Nick Piggin
2009-10-08 12:57             ` Jens Axboe
2009-10-08 13:22               ` Nick Piggin
2009-10-08 13:30                 ` Jens Axboe
2009-10-08 18:00                   ` Peter Zijlstra
2009-10-09  4:04                     ` Nick Piggin
2009-10-09  8:54                 ` Jens Axboe
2009-10-09  9:51                   ` Jens Axboe
2009-10-09 10:02                     ` Nick Piggin
2009-10-09 10:08                       ` Jens Axboe
2009-10-09 10:07                   ` Nick Piggin
2009-10-09  3:50             ` Nick Piggin
2009-10-09  6:15               ` David Miller
2009-10-09 10:40                 ` Nick Piggin
2009-10-09 11:09                   ` Jens Axboe
2009-10-09 10:44                 ` Nick Piggin
2009-10-09 10:48                   ` Jens Axboe
2009-10-09 23:16         ` Paul E. McKenney
2009-10-15 10:08 ` Latest vfs scalability patch Anton Blanchard
2009-10-15 10:39   ` Nick Piggin
2009-10-15 10:46     ` Anton Blanchard
2009-10-15 10:53   ` Nick Piggin
2009-10-15 11:23     ` Anton Blanchard
2009-10-15 11:41       ` Nick Piggin
2009-10-15 11:48         ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).