* Latest vfs scalability patch @ 2009-10-06 6:49 Nick Piggin 2009-10-06 10:14 ` Jens Axboe 2009-10-15 10:08 ` Latest vfs scalability patch Anton Blanchard 0 siblings, 2 replies; 57+ messages in thread From: Nick Piggin @ 2009-10-06 6:49 UTC (permalink / raw) To: Linux Kernel Mailing List, linux-fsdevel Cc: Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds, Jens Axboe Hi, Several people have been interested to test my vfs patches, so rather than resend patches I have uploaded a rollup against Linus's current head. ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/ I have used ext2,ext3,autofs4,nfs as well as in-memory filesystems OK (although this doesn't mean there are no bugs!). Otherwise, if your filesystem compiles, then there is a reasonable chance of it working, or ask me and I can try updating it for the new locking. I would be interested in seeing any numbers people might come up with, including single-threaded performance. Thanks, Nick ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Latest vfs scalability patch 2009-10-06 6:49 Latest vfs scalability patch Nick Piggin @ 2009-10-06 10:14 ` Jens Axboe 2009-10-06 10:26 ` Jens Axboe 2009-10-06 12:26 ` Nick Piggin 2009-10-15 10:08 ` Latest vfs scalability patch Anton Blanchard 1 sibling, 2 replies; 57+ messages in thread From: Jens Axboe @ 2009-10-06 10:14 UTC (permalink / raw) To: Nick Piggin Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds On Tue, Oct 06 2009, Nick Piggin wrote: > Hi, > > Several people have been interested to test my vfs patches, so rather > than resend patches I have uploaded a rollup against Linus's current > head. > > ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/ > > I have used ext2,ext3,autofs4,nfs as well as in-memory filesystems > OK (although this doesn't mean there are no bugs!). Otherwise, if your > filesystem compiles, then there is a reasonable chance of it working, > or ask me and I can try updating it for the new locking. > > I would be interested in seeing any numbers people might come up with, > including single-threaded performance. I gave this a quick spin on the 64-thread nehalem. Just a simple dbench with 64 clients on tmpfs. The results are below. While running perf top -a in mainline, the top 5 entries are: 2086691.00 - 96.6% : _spin_lock 14866.00 - 0.7% : copy_user_generic_string 5710.00 - 0.3% : mutex_spin_on_owner 2837.00 - 0.1% : _atomic_dec_and_lock 2274.00 - 0.1% : __d_lookup Uhm auch... It doesn't look much prettier for the patch kernel, though: 9396422.00 - 95.7% : _spin_lock 66978.00 - 0.7% : copy_user_generic_string 43775.00 - 0.4% : dput 23946.00 - 0.2% : __link_path_walk 17699.00 - 0.2% : path_init 15046.00 - 0.2% : do_lookup Anyway, below are the results. Seem very stable. throughput ------------------------------------------------ 2.6.32-rc3-git | 561.218 MB/sec 2.6.32-rc3-git+patch | 627.022 MB/sec -- Jens Axboe ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Latest vfs scalability patch 2009-10-06 10:14 ` Jens Axboe @ 2009-10-06 10:26 ` Jens Axboe 2009-10-06 11:10 ` Peter Zijlstra 2009-10-06 12:26 ` Nick Piggin 1 sibling, 1 reply; 57+ messages in thread From: Jens Axboe @ 2009-10-06 10:26 UTC (permalink / raw) To: Nick Piggin Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds On Tue, Oct 06 2009, Jens Axboe wrote: > On Tue, Oct 06 2009, Nick Piggin wrote: > > Hi, > > > > Several people have been interested to test my vfs patches, so rather > > than resend patches I have uploaded a rollup against Linus's current > > head. > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/ > > > > I have used ext2,ext3,autofs4,nfs as well as in-memory filesystems > > OK (although this doesn't mean there are no bugs!). Otherwise, if your > > filesystem compiles, then there is a reasonable chance of it working, > > or ask me and I can try updating it for the new locking. > > > > I would be interested in seeing any numbers people might come up with, > > including single-threaded performance. > > I gave this a quick spin on the 64-thread nehalem. Just a simple dbench > with 64 clients on tmpfs. The results are below. While running perf top > -a in mainline, the top 5 entries are: > > 2086691.00 - 96.6% : _spin_lock > 14866.00 - 0.7% : copy_user_generic_string > 5710.00 - 0.3% : mutex_spin_on_owner > 2837.00 - 0.1% : _atomic_dec_and_lock > 2274.00 - 0.1% : __d_lookup > > Uhm auch... It doesn't look much prettier for the patch kernel, though: > > 9396422.00 - 95.7% : _spin_lock > 66978.00 - 0.7% : copy_user_generic_string > 43775.00 - 0.4% : dput > 23946.00 - 0.2% : __link_path_walk > 17699.00 - 0.2% : path_init > 15046.00 - 0.2% : do_lookup I did a quick perf analysis on that, but only on 8 clients (64 clients basically causes perf to shit itself, it's just not functional). So for a 8 client 60s dbench run, we're already into ~75% spinlock time. The components are: dput (44%) path_put path_walk do_path_lookup path_get (44%) path_init do_path_lookup __d_path (7%) vfsmount_read_lock (3%) This is for the patched kernel. -- Jens Axboe ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Latest vfs scalability patch 2009-10-06 10:26 ` Jens Axboe @ 2009-10-06 11:10 ` Peter Zijlstra 2009-10-06 12:51 ` Jens Axboe 0 siblings, 1 reply; 57+ messages in thread From: Peter Zijlstra @ 2009-10-06 11:10 UTC (permalink / raw) To: Jens Axboe Cc: Nick Piggin, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Linus Torvalds On Tue, 2009-10-06 at 12:26 +0200, Jens Axboe wrote: > I did a quick perf analysis on that, but only on 8 clients (64 clients > basically causes perf to shit itself, it's just not functional). Even when used with -a so that you profile each cpu? I can imagine the cacheline contention from the inherited counters to render a 64 thread machine less than usable? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Latest vfs scalability patch 2009-10-06 11:10 ` Peter Zijlstra @ 2009-10-06 12:51 ` Jens Axboe 0 siblings, 0 replies; 57+ messages in thread From: Jens Axboe @ 2009-10-06 12:51 UTC (permalink / raw) To: Peter Zijlstra Cc: Nick Piggin, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Linus Torvalds On Tue, Oct 06 2009, Peter Zijlstra wrote: > On Tue, 2009-10-06 at 12:26 +0200, Jens Axboe wrote: > > > I did a quick perf analysis on that, but only on 8 clients (64 clients > > basically causes perf to shit itself, it's just not functional). > > Even when used with -a so that you profile each cpu? I can imagine the > cacheline contention from the inherited counters to render a 64 thread > machine less than usable? No, that's without -a. I used -a for perf top, and that has some impact on performance while it's running, but millions of miles apart from what it does without it :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Latest vfs scalability patch 2009-10-06 10:14 ` Jens Axboe 2009-10-06 10:26 ` Jens Axboe @ 2009-10-06 12:26 ` Nick Piggin 2009-10-06 12:49 ` Jens Axboe 1 sibling, 1 reply; 57+ messages in thread From: Nick Piggin @ 2009-10-06 12:26 UTC (permalink / raw) To: Jens Axboe Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds On Tue, Oct 06, 2009 at 12:14:14PM +0200, Jens Axboe wrote: > On Tue, Oct 06 2009, Nick Piggin wrote: > > Hi, > > > > Several people have been interested to test my vfs patches, so rather > > than resend patches I have uploaded a rollup against Linus's current > > head. > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/ > > > > I have used ext2,ext3,autofs4,nfs as well as in-memory filesystems > > OK (although this doesn't mean there are no bugs!). Otherwise, if your > > filesystem compiles, then there is a reasonable chance of it working, > > or ask me and I can try updating it for the new locking. > > > > I would be interested in seeing any numbers people might come up with, > > including single-threaded performance. > > I gave this a quick spin on the 64-thread nehalem. Just a simple dbench > with 64 clients on tmpfs. The results are below. While running perf top > -a in mainline, the top 5 entries are: > > 2086691.00 - 96.6% : _spin_lock > 14866.00 - 0.7% : copy_user_generic_string > 5710.00 - 0.3% : mutex_spin_on_owner > 2837.00 - 0.1% : _atomic_dec_and_lock > 2274.00 - 0.1% : __d_lookup > > Uhm auch... It doesn't look much prettier for the patch kernel, though: > > 9396422.00 - 95.7% : _spin_lock > 66978.00 - 0.7% : copy_user_generic_string > 43775.00 - 0.4% : dput > 23946.00 - 0.2% : __link_path_walk > 17699.00 - 0.2% : path_init > 15046.00 - 0.2% : do_lookup Yep, this is the problem of the common-path lookup. Every dentry element in the path has its d_lock taken for every path lookup, so cwd dentry lock bounces a lot for dbench. I'm working on doing path traversal without any locks or stores to the dentries in the common cases, so that should basically be the last bit of the puzzle for vfs locking (although it can be considered a different type of problem than the global lock removal, but RCU-freed struct inode is important for the approach I'm taking, so I'm basing it on top of these patches). It's a copout, but you could try running multiple dbenches under different working directories (or actually, IIRC dbench does root based path lookups so maybe that won't help you much). > Anyway, below are the results. Seem very stable. > > throughput > ------------------------------------------------ > 2.6.32-rc3-git | 561.218 MB/sec > 2.6.32-rc3-git+patch | 627.022 MB/sec Well it's good to see you got some improvement. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Latest vfs scalability patch 2009-10-06 12:26 ` Nick Piggin @ 2009-10-06 12:49 ` Jens Axboe 2009-10-07 8:58 ` [rfc][patch] store-free path walking Nick Piggin 0 siblings, 1 reply; 57+ messages in thread From: Jens Axboe @ 2009-10-06 12:49 UTC (permalink / raw) To: Nick Piggin Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds On Tue, Oct 06 2009, Nick Piggin wrote: > On Tue, Oct 06, 2009 at 12:14:14PM +0200, Jens Axboe wrote: > > On Tue, Oct 06 2009, Nick Piggin wrote: > > > Hi, > > > > > > Several people have been interested to test my vfs patches, so rather > > > than resend patches I have uploaded a rollup against Linus's current > > > head. > > > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/ > > > > > > I have used ext2,ext3,autofs4,nfs as well as in-memory filesystems > > > OK (although this doesn't mean there are no bugs!). Otherwise, if your > > > filesystem compiles, then there is a reasonable chance of it working, > > > or ask me and I can try updating it for the new locking. > > > > > > I would be interested in seeing any numbers people might come up with, > > > including single-threaded performance. > > > > I gave this a quick spin on the 64-thread nehalem. Just a simple dbench > > with 64 clients on tmpfs. The results are below. While running perf top > > -a in mainline, the top 5 entries are: > > > > 2086691.00 - 96.6% : _spin_lock > > 14866.00 - 0.7% : copy_user_generic_string > > 5710.00 - 0.3% : mutex_spin_on_owner > > 2837.00 - 0.1% : _atomic_dec_and_lock > > 2274.00 - 0.1% : __d_lookup > > > > Uhm auch... It doesn't look much prettier for the patch kernel, though: > > > > 9396422.00 - 95.7% : _spin_lock > > 66978.00 - 0.7% : copy_user_generic_string > > 43775.00 - 0.4% : dput > > 23946.00 - 0.2% : __link_path_walk > > 17699.00 - 0.2% : path_init > > 15046.00 - 0.2% : do_lookup > > Yep, this is the problem of the common-path lookup. Every dentry > element in the path has its d_lock taken for every path lookup, > so cwd dentry lock bounces a lot for dbench. > > I'm working on doing path traversal without any locks or stores > to the dentries in the common cases, so that should basically > be the last bit of the puzzle for vfs locking (although it can be > considered a different type of problem than the global lock > removal, but RCU-freed struct inode is important for the approach > I'm taking, so I'm basing it on top of these patches). > > It's a copout, but you could try running multiple dbenches under > different working directories (or actually, IIRC dbench does root > based path lookups so maybe that won't help you much). Yeah, it's hitting dentry->d_lock pretty hard so basically spin-serialized at that point. > > Anyway, below are the results. Seem very stable. > > > > throughput > > ------------------------------------------------ > > 2.6.32-rc3-git | 561.218 MB/sec > > 2.6.32-rc3-git+patch | 627.022 MB/sec > > Well it's good to see you got some improvement. Yes, it's an improvement though the results are still pretty abysmal :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 57+ messages in thread
* [rfc][patch] store-free path walking 2009-10-06 12:49 ` Jens Axboe @ 2009-10-07 8:58 ` Nick Piggin 2009-10-07 9:56 ` Jens Axboe ` (2 more replies) 0 siblings, 3 replies; 57+ messages in thread From: Nick Piggin @ 2009-10-07 8:58 UTC (permalink / raw) To: Jens Axboe Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds On Tue, Oct 06, 2009 at 02:49:41PM +0200, Jens Axboe wrote: > On Tue, Oct 06 2009, Nick Piggin wrote: > > It's a copout, but you could try running multiple dbenches under > > different working directories (or actually, IIRC dbench does root > > based path lookups so maybe that won't help you much). > > Yeah, it's hitting dentry->d_lock pretty hard so basically > spin-serialized at that point. > > > > Anyway, below are the results. Seem very stable. > > > > > > throughput > > > ------------------------------------------------ > > > 2.6.32-rc3-git | 561.218 MB/sec > > > 2.6.32-rc3-git+patch | 627.022 MB/sec > > > > Well it's good to see you got some improvement. > > Yes, it's an improvement though the results are still pretty abysmal :-) OK, I have a really basic patch that does store-free path walking (except on the final element). dbench is pretty nasty still because it seems to do a lot of stupid things like reading from /proc/mounts all the time. Actually it isn't quite store-free because it still takes mnt ref (which is hard to avoid, but at least it is a per-cpu data). So this patch applies on top of my recent patchset. At least it does not store to the dentries. Basically it is holding rcu_read_lock for the entire walk, it uses inode RCU from my earlier patches to check inode permissions, and it bails out at the slightest sign of trouble :) (actually I think I have got it to walk mountpoints which is nice). The seqlock in the dentry is for getting consistent name,len pointer, and also not giving a false positive if a rename has partially overwritten the name string (false negatives are always fine in the lock free lookup path but false positives are not). Possibly we could make do with a per-sb seqlock for this (or just rename_lock). I have duplicated most of the lookup code, altering it to the lock free version, which returns -EAGAIN if it gets in trouble and the regular path walk starts up. I don't know if this is the best way to go... it's fairly easy but a bit ugly. But complexifying the existing path walk any more would not be nice either. It might be nice to try locking the dentry that we're having trouble with and continuing from there, rather than redoing the whole walk with locks... Anyway, this is the basics working for now, microbenchmark shows same-cwd lookups scale linearly now too. We can probably slowly tackle more cases if they come up as being important, simply by auditing filesystems etc. -- Index: linux-2.6/fs/dcache.c =================================================================== --- linux-2.6.orig/fs/dcache.c +++ linux-2.6/fs/dcache.c @@ -1187,6 +1187,7 @@ struct dentry *d_alloc(struct dentry * p dentry->d_count = 1; dentry->d_flags = DCACHE_UNHASHED; spin_lock_init(&dentry->d_lock); + seqcount_init(&dentry->d_seq); dentry->d_inode = NULL; dentry->d_parent = NULL; dentry->d_sb = NULL; @@ -1579,21 +1580,6 @@ err_out: * d_lookup() is protected against the concurrent renames in some unrelated * directory using the seqlockt_t rename_lock. */ - -struct dentry * d_lookup(struct dentry * parent, struct qstr * name) -{ - struct dentry * dentry = NULL; - unsigned seq; - - do { - seq = read_seqbegin(&rename_lock); - dentry = __d_lookup(parent, name); - if (dentry) - break; - } while (read_seqretry(&rename_lock, seq)); - return dentry; -} - struct dentry * __d_lookup(struct dentry * parent, struct qstr * name) { unsigned int len = name->len; @@ -1656,6 +1642,78 @@ next: return found; } +struct dentry * d_lookup(struct dentry * parent, struct qstr * name) +{ + struct dentry *dentry = NULL; + unsigned seq; + + do { + seq = read_seqbegin(&rename_lock); + dentry = __d_lookup(parent, name); + if (dentry) + break; + } while (read_seqretry(&rename_lock, seq)); + return dentry; +} + +struct dentry * __d_lookup_rcu(struct dentry * parent, struct qstr * name) +{ + unsigned int len = name->len; + unsigned int hash = name->hash; + const unsigned char *str = name->name; + struct dcache_hash_bucket *b = d_hash(parent, hash); + struct hlist_head *head = &b->head; + struct hlist_node *node; + struct dentry *dentry; + + hlist_for_each_entry_rcu(dentry, node, head, d_hash) { + unsigned seq; + struct dentry *tparent; + const char *tname; + int tlen; + + if (dentry->d_name.hash != hash) + continue; + +seqretry: + seq = read_seqcount_begin(&dentry->d_seq); + tparent = dentry->d_parent; + if (tparent != parent) + continue; + tlen = dentry->d_name.len; + if (tlen != len) + continue; + tname = dentry->d_name.name; + if (read_seqcount_retry(&dentry->d_seq, seq)) + goto seqretry; + if (memcmp(tname, str, tlen)) + continue; + if (read_seqcount_retry(&dentry->d_seq, seq)) + goto seqretry; + + return dentry; + } + return NULL; +} + +struct dentry *d_lookup_rcu(struct dentry *parent, struct qstr * name) +{ + struct dentry *dentry = NULL; + unsigned seq; + + if (parent->d_op && parent->d_op->d_compare) + goto out; + + do { + seq = read_seqbegin(&rename_lock); + dentry = __d_lookup_rcu(parent, name); + if (dentry) + break; + } while (read_seqretry(&rename_lock, seq)); +out: + return dentry; +} + /** * d_hash_and_lookup - hash the qstr then search for a dentry * @dir: Directory to search in @@ -1925,6 +1983,8 @@ static void d_move_locked(struct dentry list_del(&target->d_u.d_child); /* Switch the names.. */ + write_seqcount_begin(&dentry->d_seq); + write_seqcount_begin(&target->d_seq); switch_names(dentry, target); swap(dentry->d_name.hash, target->d_name.hash); @@ -1939,6 +1999,8 @@ static void d_move_locked(struct dentry /* And add them back to the (new) parent lists */ list_add(&target->d_u.d_child, &target->d_parent->d_subdirs); } + write_seqcount_end(&target->d_seq); + write_seqcount_end(&dentry->d_seq); list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs); if (target->d_parent != dentry->d_parent) Index: linux-2.6/fs/namei.c =================================================================== --- linux-2.6.orig/fs/namei.c +++ linux-2.6/fs/namei.c @@ -200,6 +200,29 @@ static int acl_permission_check(struct i return -EACCES; } +static int acl_permission_check_rcu(struct inode *inode, int mask, + int (*check_acl)(struct inode *inode, int mask)) +{ + umode_t mode = inode->i_mode; + + mask &= MAY_READ | MAY_WRITE | MAY_EXEC; + + if (current_fsuid() == inode->i_uid) + mode >>= 6; + else { + if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl) + return -EAGAIN; + if (in_group_p(inode->i_gid)) + mode >>= 3; + } + + /* + * If the DACs are ok we don't need any capability check. + */ + if ((mask & ~mode) == 0) + return 0; + return -EACCES; +} /** * generic_permission - check for access rights on a Posix-like filesystem * @inode: inode to check access rights for @@ -465,6 +488,25 @@ ok: return security_inode_permission(inode, MAY_EXEC); } +static int exec_permission_lite_rcu(struct inode *inode) +{ + int ret; + + if (inode->i_op->permission) + return -EAGAIN; + ret = acl_permission_check_rcu(inode, MAY_EXEC, inode->i_op->check_acl); + if (ret == -EAGAIN) + return ret; + if (!ret) + goto ok; + + if (capable(CAP_DAC_OVERRIDE) || capable(CAP_DAC_READ_SEARCH)) + goto ok; + + return ret; +ok: + return security_inode_permission(inode, MAY_EXEC); +} /* * This is called when everything else fails, and we actually have * to go to the low-level filesystem to find out what we should do.. @@ -569,6 +611,17 @@ static __always_inline void set_root(str } } +static __always_inline void set_root_rcu(struct nameidata *nd) +{ + if (!nd->root.mnt) { + struct fs_struct *fs = current->fs; + read_lock(&fs->lock); + nd->root = fs->root; + mntget(nd->root.mnt); + read_unlock(&fs->lock); + } +} + static __always_inline int __vfs_follow_link(struct nameidata *nd, const char *link) { int res = 0; @@ -611,6 +664,14 @@ static void path_put_conditional(struct mntput(path->mnt); } +static inline void path_to_nameidata_rcu(struct path *path, struct nameidata *nd) +{ + if (nd->path.mnt != path->mnt) + mntput(nd->path.mnt); + nd->path.mnt = path->mnt; + nd->path.dentry = path->dentry; +} + static inline void path_to_nameidata(struct path *path, struct nameidata *nd) { dput(nd->path.dentry); @@ -705,6 +766,22 @@ int follow_up(struct path *path) /* no need for dcache_lock, as serialization is taken care in * namespace.c */ +static int __follow_mount_rcu(struct path *path) +{ + int res = 0; + while (d_mountpoint(path->dentry)) { + struct vfsmount *mounted = lookup_mnt(path); + if (!mounted) + break; + if (res) + mntput(path->mnt); + path->mnt = mounted; + path->dentry = mounted->mnt_root; + res = 1; + } + return res; +} + static int __follow_mount(struct path *path) { int res = 0; @@ -791,6 +868,24 @@ static __always_inline void follow_dotdo * small and for now I'd prefer to have fast path as straight as possible. * It _is_ time-critical. */ +static int do_lookup_rcu(struct nameidata *nd, struct qstr *name, + struct path *path) +{ + struct vfsmount *mnt = nd->path.mnt; + struct dentry *dentry; + + dentry = __d_lookup_rcu(nd->path.dentry, name); + + if (!dentry) + return -EAGAIN; + if (dentry->d_op && dentry->d_op->d_revalidate) + return -EAGAIN; + path->mnt = mnt; + path->dentry = dentry; + __follow_mount_rcu(path); + return 0; +} + static int do_lookup(struct nameidata *nd, struct qstr *name, struct path *path) { @@ -825,6 +920,134 @@ fail: return PTR_ERR(dentry); } +static noinline int link_path_walk_rcu(const char *name, struct nameidata *nd, struct path *next) +{ + struct inode *inode; + unsigned int lookup_flags = nd->flags; + + while (*name=='/') + name++; + if (!*name) + goto return_reval; + + inode = nd->path.dentry->d_inode; + if (nd->depth) + lookup_flags = LOOKUP_FOLLOW | (nd->flags & LOOKUP_CONTINUE); + + /* At this point we know we have a real path component. */ + for(;;) { + unsigned long hash; + struct qstr this; + unsigned int c; + + nd->flags |= LOOKUP_CONTINUE; + if (exec_permission_lite_rcu(inode)) + return -EAGAIN; + + this.name = name; + c = *(const unsigned char *)name; + + hash = init_name_hash(); + do { + name++; + hash = partial_name_hash(c, hash); + c = *(const unsigned char *)name; + } while (c && (c != '/')); + this.len = name - (const char *) this.name; + this.hash = end_name_hash(hash); + + /* remove trailing slashes? */ + if (!c) + goto last_component; + while (*++name == '/'); + if (!*name) + goto last_with_slashes; + + if (this.name[0] == '.') switch (this.len) { + default: + break; + case 2: + if (this.name[1] != '.') + break; + return -EAGAIN; + case 1: + continue; + } + if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) + return -EAGAIN; + /* This does the actual lookups.. */ + if (do_lookup_rcu(nd, &this, next)) + return -EAGAIN; + + inode = next->dentry->d_inode; + if (!inode) + return -ENOENT; + if (inode->i_op->follow_link) + return -EAGAIN; + path_to_nameidata_rcu(next, nd); + if (!inode->i_op->lookup) + return -ENOTDIR; + continue; + /* here ends the main loop */ + +last_with_slashes: + lookup_flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; +last_component: + /* Clear LOOKUP_CONTINUE iff it was previously unset */ + nd->flags &= lookup_flags | ~LOOKUP_CONTINUE; + if (lookup_flags & LOOKUP_PARENT) + return -EAGAIN; + if (this.name[0] == '.') switch (this.len) { + default: + break; + case 2: + if (this.name[1] != '.') + break; + return -EAGAIN; + case 1: + goto return_reval; + } + if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) + return -EAGAIN; + if (do_lookup_rcu(nd, &this, next)) + return -EAGAIN; + inode = next->dentry->d_inode; + if ((lookup_flags & LOOKUP_FOLLOW) + && inode && inode->i_op->follow_link) + return -EAGAIN; + + path_to_nameidata_rcu(next, nd); + if (!inode) + return -ENOENT; + if (lookup_flags & LOOKUP_DIRECTORY) { + if (!inode->i_op->lookup) + return -ENOTDIR; + } + goto return_base; + } +return_reval: + /* + * We bypassed the ordinary revalidation routines. + * We may need to check the cached dentry for staleness. + */ + if (nd->path.dentry && nd->path.dentry->d_sb && + (nd->path.dentry->d_sb->s_type->fs_flags & FS_REVAL_DOT)) + return -EAGAIN; +return_base: + spin_lock(&nd->path.dentry->d_lock); + if (d_unhashed(nd->path.dentry)) { + spin_unlock(&nd->path.dentry->d_lock); + return -EAGAIN; + } + if (!nd->dentry->d_inode) { + spin_unlock(&nd->path.dentry->d_lock); + return -EAGAIN; + } + nd->path.dentry->d_count++; + spin_unlock(&nd->path.dentry->d_lock); + return 0; +} + /* * Name resolution. * This is the basic name resolution function, turning a pathname into @@ -887,7 +1110,7 @@ static int __link_path_walk(const char * if (this.name[0] == '.') switch (this.len) { default: break; - case 2: + case 2: if (this.name[1] != '.') break; follow_dotdot(nd); @@ -942,7 +1165,7 @@ last_component: if (this.name[0] == '.') switch (this.len) { default: break; - case 2: + case 2: if (this.name[1] != '.') break; follow_dotdot(nd); @@ -1013,12 +1236,82 @@ return_err: return err; } +static int path_walk_rcu(const char *name, struct nameidata *nd) +{ + struct path save = nd->path; + struct path path = {.mnt = NULL}; + int err; + + current->total_link_count = 0; + err = link_path_walk_rcu(name, nd, &path); + if (err) { + if (path.mnt != nd->path.mnt) + mntput(path.mnt); + mntput(nd->path.mnt); + if (err == -EAGAIN) + nd->path = save; + } + return err; +} + static int path_walk(const char *name, struct nameidata *nd) { current->total_link_count = 0; return link_path_walk(name, nd); } +static noinline int path_init_rcu(int dfd, const char *name, unsigned int flags, struct nameidata *nd) +{ + int retval = 0; + int fput_needed; + struct file *file; + + nd->last_type = LAST_ROOT; /* if there are only slashes... */ + nd->flags = flags; + nd->depth = 0; + nd->root.mnt = NULL; + + if (*name=='/') { + set_root_rcu(nd); + nd->path = nd->root; + mntget(nd->root.mnt); + } else if (dfd == AT_FDCWD) { + struct fs_struct *fs = current->fs; + read_lock(&fs->lock); + nd->path = fs->pwd; + mntget(fs->pwd.mnt); + read_unlock(&fs->lock); + } else { + struct dentry *dentry; + + file = fget_light(dfd, &fput_needed); + retval = -EBADF; + if (!file) + goto out_fail; + + dentry = file->f_path.dentry; + + retval = -ENOTDIR; + if (!S_ISDIR(dentry->d_inode->i_mode)) + goto fput_fail; + + retval = file_permission(file, MAY_EXEC); + if (retval) + goto fput_fail; + + nd->path = file->f_path; + mntget(file->f_path.mnt); + + fput_light(file, fput_needed); + } + return 0; + +fput_fail: + fput_light(file, fput_needed); +out_fail: + return retval; +} + static int path_init(int dfd, const char *name, unsigned int flags, struct nameidata *nd) { int retval = 0; @@ -1075,16 +1368,46 @@ out_fail: static int do_path_lookup(int dfd, const char *name, unsigned int flags, struct nameidata *nd) { - int retval = path_init(dfd, name, flags, nd); - if (!retval) - retval = path_walk(name, nd); - if (unlikely(!retval && !audit_dummy_context() && nd->path.dentry && - nd->path.dentry->d_inode)) - audit_inode(name, nd->path.dentry); + int retval; + + rcu_read_lock(); + retval = path_init_rcu(dfd, name, flags, nd); + if (unlikely(retval)) { + rcu_read_unlock(); + return retval; + } + retval = path_walk_rcu(name, nd); + rcu_read_unlock(); + if (likely(!retval)) { + if (unlikely(!audit_dummy_context())) { + if (nd->path.dentry && nd->path.dentry->d_inode) + audit_inode(name, nd->path.dentry); + } + } if (nd->root.mnt) { - path_put(&nd->root); + mntput(nd->root.mnt); nd->root.mnt = NULL; } + + if (unlikely(retval == -EAGAIN)) { + /* slower, locked walk */ + retval = path_init(dfd, name, flags, nd); + if (unlikely(retval)) + return retval; + retval = path_walk(name, nd); + if (likely(!retval)) { + if (unlikely(!audit_dummy_context())) { + if (nd->path.dentry && nd->path.dentry->d_inode) + audit_inode(name, nd->path.dentry); + } + } + + if (nd->root.mnt) { + path_put(&nd->root); + nd->root.mnt = NULL; + } + } + return retval; } Index: linux-2.6/include/linux/dcache.h =================================================================== --- linux-2.6.orig/include/linux/dcache.h +++ linux-2.6/include/linux/dcache.h @@ -5,6 +5,7 @@ #include <linux/list.h> #include <linux/rculist.h> #include <linux/spinlock.h> +#include <linux/seqlock.h> #include <linux/cache.h> #include <linux/rcupdate.h> @@ -81,15 +82,16 @@ full_name_hash(const unsigned char *name * large memory footprint increase). */ #ifdef CONFIG_64BIT -#define DNAME_INLINE_LEN_MIN 32 /* 192 bytes */ +#define DNAME_INLINE_LEN_MIN 24 /* 192 bytes */ #else -#define DNAME_INLINE_LEN_MIN 40 /* 128 bytes */ +#define DNAME_INLINE_LEN_MIN 36 /* 128 bytes */ #endif struct dentry { unsigned int d_count; /* protected by d_lock */ unsigned int d_flags; /* protected by d_lock */ spinlock_t d_lock; /* per dentry lock */ + seqcount_t d_seq; /* per dentry seqlock */ int d_mounted; struct inode *d_inode; /* Where the name belongs to - NULL is * negative */ @@ -283,9 +285,11 @@ extern void d_move(struct dentry *, stru extern struct dentry *d_ancestor(struct dentry *, struct dentry *); /* appendix may either be NULL or be used for transname suffixes */ -extern struct dentry * d_lookup(struct dentry *, struct qstr *); -extern struct dentry * __d_lookup(struct dentry *, struct qstr *); -extern struct dentry * d_hash_and_lookup(struct dentry *, struct qstr *); +extern struct dentry *d_lookup(struct dentry *, struct qstr *); +extern struct dentry *__d_lookup(struct dentry *, struct qstr *); +extern struct dentry *d_lookup_rcu(struct dentry *, struct qstr *); +extern struct dentry *__d_lookup_rcu(struct dentry *, struct qstr *); +extern struct dentry *d_hash_and_lookup(struct dentry *, struct qstr *); /* validate "insecure" dentry pointer */ extern int d_validate(struct dentry *, struct dentry *); ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 8:58 ` [rfc][patch] store-free path walking Nick Piggin @ 2009-10-07 9:56 ` Jens Axboe 2009-10-07 10:10 ` Nick Piggin 2009-10-12 3:58 ` Nick Piggin 2009-10-07 14:56 ` Linus Torvalds 2009-10-09 23:16 ` Paul E. McKenney 2 siblings, 2 replies; 57+ messages in thread From: Jens Axboe @ 2009-10-07 9:56 UTC (permalink / raw) To: Nick Piggin Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds On Wed, Oct 07 2009, Nick Piggin wrote: > On Tue, Oct 06, 2009 at 02:49:41PM +0200, Jens Axboe wrote: > > On Tue, Oct 06 2009, Nick Piggin wrote: > > > It's a copout, but you could try running multiple dbenches under > > > different working directories (or actually, IIRC dbench does root > > > based path lookups so maybe that won't help you much). > > > > Yeah, it's hitting dentry->d_lock pretty hard so basically > > spin-serialized at that point. > > > > > > Anyway, below are the results. Seem very stable. > > > > > > > > throughput > > > > ------------------------------------------------ > > > > 2.6.32-rc3-git | 561.218 MB/sec > > > > 2.6.32-rc3-git+patch | 627.022 MB/sec > > > > > > Well it's good to see you got some improvement. > > > > Yes, it's an improvement though the results are still pretty abysmal :-) > > OK, I have a really basic patch that does store-free path walking > (except on the final element). dbench is pretty nasty still because > it seems to do a lot of stupid things like reading from /proc/mounts > all the time. > > Actually it isn't quite store-free because it still takes mnt ref > (which is hard to avoid, but at least it is a per-cpu data). So > this patch applies on top of my recent patchset. At least it does > not store to the dentries. > > Basically it is holding rcu_read_lock for the entire walk, it uses > inode RCU from my earlier patches to check inode permissions, and > it bails out at the slightest sign of trouble :) (actually I think > I have got it to walk mountpoints which is nice). > > The seqlock in the dentry is for getting consistent name,len pointer, > and also not giving a false positive if a rename has partially > overwritten the name string (false negatives are always fine in the > lock free lookup path but false positives are not). Possibly we > could make do with a per-sb seqlock for this (or just rename_lock). > > I have duplicated most of the lookup code, altering it to the lock > free version, which returns -EAGAIN if it gets in trouble and the > regular path walk starts up. I don't know if this is the best way > to go... it's fairly easy but a bit ugly. But complexifying the > existing path walk any more would not be nice either. It might be > nice to try locking the dentry that we're having trouble with and > continuing from there, rather than redoing the whole walk with > locks... > > Anyway, this is the basics working for now, microbenchmark shows > same-cwd lookups scale linearly now too. We can probably slowly > tackle more cases if they come up as being important, simply by > auditing filesystems etc. throughput ------------------------------------------------ 2.6.32-rc3-git | 561.218 MB/sec 2.6.32-rc3-git+patch | 627.022 MB/sec 2.6.32-rc3-git+patch+inc| 969.761 MB/sec So better, quite a bit too. Latencies are not listed here, but they are also a lot better. Perf top still shows ~95% spinlock time. I did a shorter run (the above are full 600 second runs) of 60s with profiling and the full 64 clients, this time using -a as well (which generated 9.4GB of trace data!). The top is now: _spin_lock (92%) path_get (39%) d_path (59%) path_init (26%) path_walk (13%) dput (37%) path_put (86%) link_path_walk (13%) __d_path (23%) And finally, this: > + if (!nd->dentry->d_inode) { > + spin_unlock(&nd->path.dentry->d_lock); > + return -EAGAIN; > + } doesn't compile, it wants to be if (!nd->path.dentry->d_inode) { -- Jens Axboe ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 9:56 ` Jens Axboe @ 2009-10-07 10:10 ` Nick Piggin 2009-10-12 3:58 ` Nick Piggin 1 sibling, 0 replies; 57+ messages in thread From: Nick Piggin @ 2009-10-07 10:10 UTC (permalink / raw) To: Jens Axboe Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds On Wed, Oct 07, 2009 at 11:56:57AM +0200, Jens Axboe wrote: > On Wed, Oct 07 2009, Nick Piggin wrote: > > Anyway, this is the basics working for now, microbenchmark shows > > same-cwd lookups scale linearly now too. We can probably slowly > > tackle more cases if they come up as being important, simply by > > auditing filesystems etc. > > throughput > ------------------------------------------------ > 2.6.32-rc3-git | 561.218 MB/sec > 2.6.32-rc3-git+patch | 627.022 MB/sec > 2.6.32-rc3-git+patch+inc| 969.761 MB/sec > > So better, quite a bit too. Latencies are not listed here, but they are > also a lot better. Perf top still shows ~95% spinlock time. I did a > shorter run (the above are full 600 second runs) of 60s with profiling > and the full 64 clients, this time using -a as well (which generated > 9.4GB of trace data!). The top is now: > > _spin_lock (92%) > path_get (39%) > d_path (59%) > path_init (26%) > path_walk (13%) > dput (37%) > path_put (86%) > link_path_walk (13%) > __d_path (23%) path_init, path_walk, and link_path_walk are all non-lockless variants, so the RCU walk is dropping out in some cases. path_put will be significantly coming from locked lookups too. It could be improved by expanding the cases we do lockless walk for (or allowing a lockless walk to turn into a locked walk part-way through, rather than restarting the whole thing, which is probably a very good idea anyway). d_path and __d_path are... I think dbench doing something stupid. Although even those could possibly be optimised to avoid d_lock as well... Although after looking at strace from dbench, I'd rather take profiles from real workloads before adding complexity (or even a real samba serving a netbench workload would be preferable to dbench, I think). But it's always nice to see numbers and results. Nearly 2x increase isn't too bad, even if it is still horribly choked. > And finally, this: > > > + if (!nd->dentry->d_inode) { > > + spin_unlock(&nd->path.dentry->d_lock); > > + return -EAGAIN; > > + } > > doesn't compile, it wants to be > > if (!nd->path.dentry->d_inode) { Ah thanks, forgot to refresh. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 9:56 ` Jens Axboe 2009-10-07 10:10 ` Nick Piggin @ 2009-10-12 3:58 ` Nick Piggin 2009-10-12 5:59 ` Nick Piggin 2009-10-13 1:26 ` Christoph Hellwig 1 sibling, 2 replies; 57+ messages in thread From: Nick Piggin @ 2009-10-12 3:58 UTC (permalink / raw) To: Jens Axboe Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds, samba-technical On Wed, Oct 07, 2009 at 11:56:57AM +0200, Jens Axboe wrote: > On Wed, Oct 07 2009, Nick Piggin wrote: > > Anyway, this is the basics working for now, microbenchmark shows > > same-cwd lookups scale linearly now too. We can probably slowly > > tackle more cases if they come up as being important, simply by > > auditing filesystems etc. > > throughput > ------------------------------------------------ > 2.6.32-rc3-git | 561.218 MB/sec > 2.6.32-rc3-git+patch | 627.022 MB/sec > 2.6.32-rc3-git+patch+inc| 969.761 MB/sec > > So better, quite a bit too. Latencies are not listed here, but they are > also a lot better. Perf top still shows ~95% spinlock time. I did a > shorter run (the above are full 600 second runs) of 60s with profiling > and the full 64 clients, this time using -a as well (which generated > 9.4GB of trace data!). The top is now: Hey Jens, Try changing the 'statvfs' syscall in dbench to 'statfs'. glibc has to do some nasty stuff parsing /proc/mounts to make statvfs work. On my 2s8c opteron it goes like this: clients vanilla kernel vfs scale (MB/s) 1 476 447 2 1092 1128 4 2027 2260 8 2398 4200 Single threaded performance isn't as good so I need to look at the reasons for that :(. But it's practically linearly scalable now. The dropoff at 8 I'd say is probably due to the memory controllers running out of steam rather than cacheline or lock contention. Unfortunately we didn't just do this posix API in-kernel, and statfs is Linux-specific. But we do have some spare room in statfs structure I think to pass back mount flags for statvfs. Tridge, Samba people: measuring vfs performance with dbench in my effort to improve Linux vfs scalability has shown up the statvfs syscall you make to be the final problematic issue for this workload. In particular reading /proc/mounts that glibc does to impement it. We could add complexity to the kernel to try improving it, or we could extend the statfs syscall so glibc can avoid the issue (requiring glibc upgrade). But I would like to know whether samba really uses statvfs() significantly? Thanks, Nick ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-12 3:58 ` Nick Piggin @ 2009-10-12 5:59 ` Nick Piggin 2009-10-12 8:20 ` Jens Axboe 2009-10-13 1:26 ` Christoph Hellwig 1 sibling, 1 reply; 57+ messages in thread From: Nick Piggin @ 2009-10-12 5:59 UTC (permalink / raw) To: Jens Axboe Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds, samba-technical On Mon, Oct 12, 2009 at 05:58:43AM +0200, Nick Piggin wrote: > On Wed, Oct 07, 2009 at 11:56:57AM +0200, Jens Axboe wrote: > Try changing the 'statvfs' syscall in dbench to 'statfs'. > glibc has to do some nasty stuff parsing /proc/mounts to > make statvfs work. On my 2s8c opteron it goes like this: > clients vanilla kernel vfs scale (MB/s) > 1 476 447 > 2 1092 1128 > 4 2027 2260 > 8 2398 4200 > > Single threaded performance isn't as good so I need to look > at the reasons for that :(. But it's practically linearly > scalable now. The dropoff at 8 I'd say is probably due to > the memory controllers running out of steam rather than > cacheline or lock contention. Ah, no on a bigger machine it starts slowing down again due to shared cwd contention, possibly due to creat/unlink type events. This could be improved by not restarting the entire path walk when we run into trouble but just trying to proceed from the last successful element. Anyway, if you do get a chance to run dbench with this modification, I would appreciate seeing a profile with clal traces (my bigger system is ia64 which doesn't do perf yet). Thanks, Nick ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-12 5:59 ` Nick Piggin @ 2009-10-12 8:20 ` Jens Axboe 2009-10-12 11:00 ` Jens Axboe 0 siblings, 1 reply; 57+ messages in thread From: Jens Axboe @ 2009-10-12 8:20 UTC (permalink / raw) To: Nick Piggin Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds, samba-technical On Mon, Oct 12 2009, Nick Piggin wrote: > On Mon, Oct 12, 2009 at 05:58:43AM +0200, Nick Piggin wrote: > > On Wed, Oct 07, 2009 at 11:56:57AM +0200, Jens Axboe wrote: > > Try changing the 'statvfs' syscall in dbench to 'statfs'. > > glibc has to do some nasty stuff parsing /proc/mounts to > > make statvfs work. On my 2s8c opteron it goes like this: > > clients vanilla kernel vfs scale (MB/s) > > 1 476 447 > > 2 1092 1128 > > 4 2027 2260 > > 8 2398 4200 > > > > Single threaded performance isn't as good so I need to look > > at the reasons for that :(. But it's practically linearly > > scalable now. The dropoff at 8 I'd say is probably due to > > the memory controllers running out of steam rather than > > cacheline or lock contention. > > Ah, no on a bigger machine it starts slowing down again due > to shared cwd contention, possibly due to creat/unlink type > events. This could be improved by not restarting the entire > path walk when we run into trouble but just trying to proceed > from the last successful element. > I was starting to do a few runs, but there's something funky going on here. The throughput rates are consistent throughout a single run, but not at all between runs. I suspect this may be due to CPU placement. The numbers also look pretty odd, here's an example from a patched kernel with dbench using statfs: Clients Patched ------------------------ 1 1.00 2 1.23 4 2.96 8 1.22 16 0.89 32 0.83 64 0.83 So while the numbers fluctuate by as much as 20% from run to run. OK, so it seems FAIR_SLEEPERS sched feature is responsible for this, if I turn that off I get more consistent numbers. Below table is -git vs vfs patches on -git. Baseline is -git with 1 client, > 1.00 is faster and vice versa. Clients Vanilla VFS scale ----------------------------------------- 1 1.00 0.96 2 1.69 1.71 4 2.16 2.98 8 0.99 1.00 16 0.90 0.85 As you can see, it's still quickling spiralling into most of the time (> 95%) spinning on a lock and killing scaling. > Anyway, if you do get a chance to run dbench with this > modification, I would appreciate seeing a profile with clal > traces (my bigger system is ia64 which doesn't do perf yet). For what number of clients? -- Jens Axboe ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-12 8:20 ` Jens Axboe @ 2009-10-12 11:00 ` Jens Axboe 0 siblings, 0 replies; 57+ messages in thread From: Jens Axboe @ 2009-10-12 11:00 UTC (permalink / raw) To: Nick Piggin Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds, samba-technical [-- Attachment #1: Type: text/plain, Size: 676 bytes --] On Mon, Oct 12 2009, Jens Axboe wrote: > > Anyway, if you do get a chance to run dbench with this > > modification, I would appreciate seeing a profile with clal > > traces (my bigger system is ia64 which doesn't do perf yet). > > For what number of clients? Attached are call graphs for 8 and 32 clients, with your latest patch applied. At 8 clients, we're already at 80% spinlock time. While running the 32 client record, I wanted to open a file in the tmpfs mount where dbench was running. bash was spinning until dbench completed. At 32 clients, throughput is 75% of the 8 client case. And spinlock time there is at 90%. I have attached both outputs. -- Jens Axboe [-- Attachment #2: 8client.txt.bz2 --] [-- Type: application/octet-stream, Size: 53343 bytes --] [-- Attachment #3: 32client.txt.bz2 --] [-- Type: application/octet-stream, Size: 65409 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-12 3:58 ` Nick Piggin 2009-10-12 5:59 ` Nick Piggin @ 2009-10-13 1:26 ` Christoph Hellwig 2009-10-13 1:52 ` Nick Piggin 1 sibling, 1 reply; 57+ messages in thread From: Christoph Hellwig @ 2009-10-13 1:26 UTC (permalink / raw) To: Nick Piggin Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds, samba-technical On Mon, Oct 12, 2009 at 05:58:43AM +0200, Nick Piggin wrote: > Tridge, Samba people: measuring vfs performance with dbench > in my effort to improve Linux vfs scalability has shown up > the statvfs syscall you make to be the final problematic > issue for this workload. In particular reading /proc/mounts > that glibc does to impement it. We could add complexity to > the kernel to try improving it, or we could extend the > statfs syscall so glibc can avoid the issue (requiring > glibc upgrade). But I would like to know whether samba > really uses statvfs() significantly? Not sure if it's the reason why Samba uses it, but many portable applications use statvfs because that is the standardizes one in XPG / recent Posix while statfs is just a BSD extension Linux picked up. So making sure statvfs goes fast is a pretty essential thing. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-13 1:26 ` Christoph Hellwig @ 2009-10-13 1:52 ` Nick Piggin 0 siblings, 0 replies; 57+ messages in thread From: Nick Piggin @ 2009-10-13 1:52 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds, samba-technical On Mon, Oct 12, 2009 at 09:26:27PM -0400, Christoph Hellwig wrote: > On Mon, Oct 12, 2009 at 05:58:43AM +0200, Nick Piggin wrote: > > Tridge, Samba people: measuring vfs performance with dbench > > in my effort to improve Linux vfs scalability has shown up > > the statvfs syscall you make to be the final problematic > > issue for this workload. In particular reading /proc/mounts > > that glibc does to impement it. We could add complexity to > > the kernel to try improving it, or we could extend the > > statfs syscall so glibc can avoid the issue (requiring > > glibc upgrade). But I would like to know whether samba > > really uses statvfs() significantly? > > Not sure if it's the reason why Samba uses it, but many portable > applications use statvfs because that is the standardizes one in > XPG / recent Posix while statfs is just a BSD extension Linux picked > up. So making sure statvfs goes fast is a pretty essential thing. OK, I'll keep that in mind. I think the best idea then will be to extend statfs so that it returns the mount flags for glibc to implement statvfs nicely and get rid of the insane stuff it is doing now. I don't know how we can version that nicely but I guess I'll talk to the glibc guys about it. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 8:58 ` [rfc][patch] store-free path walking Nick Piggin 2009-10-07 9:56 ` Jens Axboe @ 2009-10-07 14:56 ` Linus Torvalds 2009-10-07 16:27 ` Linus Torvalds ` (2 more replies) 2009-10-09 23:16 ` Paul E. McKenney 2 siblings, 3 replies; 57+ messages in thread From: Linus Torvalds @ 2009-10-07 14:56 UTC (permalink / raw) To: Nick Piggin Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Wed, 7 Oct 2009, Nick Piggin wrote: > > OK, I have a really basic patch that does store-free path walking > (except on the final element). Yay! > dbench is pretty nasty still because it seems to do a lot of stupid > things like reading from /proc/mounts all the time. You should largely forget about dbench, it can certainly be a useful benchmark, but at the same time it's certainly not a _meaningful_ one. There are better things to try. > The seqlock in the dentry is for getting consistent name,len pointer, > and also not giving a false positive if a rename has partially > overwritten the name string (false negatives are always fine in the > lock free lookup path but false positives are not). Possibly we > could make do with a per-sb seqlock for this (or just rename_lock). My plan was always to just use rename_lock, and only do it at the outer level (and do it for both lookup failures _and_ for the success case). Your approach is _way_ more conservative than I would have done, and also potentially much slower due to the seqlock-per-path-component thing. Remember: seqlocks are almost free on x86, but they can be reasonably expensive in other places. Hmm. Regardless, this very much does look like what I envisioned, apart from details like that. And maybe your per-dentry seqlock is the right choice. On x86, it certainly doesn't have the performance issues it could have in other places. I'd like Al to take a look, if he's around. Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 14:56 ` Linus Torvalds @ 2009-10-07 16:27 ` Linus Torvalds 2009-10-07 16:46 ` Nick Piggin 2009-10-07 16:29 ` Nick Piggin 2009-10-08 12:36 ` Nick Piggin 2 siblings, 1 reply; 57+ messages in thread From: Linus Torvalds @ 2009-10-07 16:27 UTC (permalink / raw) To: Nick Piggin Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Wed, 7 Oct 2009, Linus Torvalds wrote: > > Hmm. Regardless, this very much does look like what I envisioned, apart > from details like that. And maybe your per-dentry seqlock is the right > choice. On x86, it certainly doesn't have the performance issues it could > have in other places. Actually, if we really want to do the per-dentry thing, then we should change it a bit. Maybe rather than using a seqlock data structure (which is really just a unsigned counter and a spinlock), we could do just the unsigned counter, and use the d_lock as the spinlock for the sequence lock. The hackiest way to do that woudl be to get rid of d_lock entirely, replace it with d_seqlock, and then just do #define d_lock d_seqlock.lock instead (but the dentry structure may well have layout issues that makes that not work very well - we're mixing pointers and 'int'-sized things and need to pack them well etc). That would cut down the seqlock memory costs from 8 bytes (or more - just the spinlock itself is currently 8 bytes on ia64, so on ia64 the seqlock is actually 16 bytes, not to mention all the spinlock debugging cases) to just four bytes. However, I still suspect we could do things entirely without the seqlock. The outer seqlock will handle the "couldn't find it" case, and I've got the strongest feeling that we should be able to just use some basic memory ordering on the dentry hash to make the inner seqlock unnecessary (ie make sure that either we don't see the old entry at all, or that we can guarantee that it won't trigger a successful compare while the rename is in process because we set the dentry name length to zero). Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 16:27 ` Linus Torvalds @ 2009-10-07 16:46 ` Nick Piggin 2009-10-07 19:25 ` Linus Torvalds 0 siblings, 1 reply; 57+ messages in thread From: Nick Piggin @ 2009-10-07 16:46 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Wed, Oct 07, 2009 at 09:27:59AM -0700, Linus Torvalds wrote: > > > On Wed, 7 Oct 2009, Linus Torvalds wrote: > > > > Hmm. Regardless, this very much does look like what I envisioned, apart > > from details like that. And maybe your per-dentry seqlock is the right > > choice. On x86, it certainly doesn't have the performance issues it could > > have in other places. > > Actually, if we really want to do the per-dentry thing, then we should > change it a bit. Maybe rather than using a seqlock data structure (which > is really just a unsigned counter and a spinlock), we could do just the > unsigned counter, and use the d_lock as the spinlock for the sequence > lock. > > The hackiest way to do that woudl be to get rid of d_lock entirely, > replace it with d_seqlock, and then just do > > #define d_lock d_seqlock.lock > > instead (but the dentry structure may well have layout issues that makes > that not work very well - we're mixing pointers and 'int'-sized things > and need to pack them well etc). > > That would cut down the seqlock memory costs from 8 bytes (or more - just > the spinlock itself is currently 8 bytes on ia64, so on ia64 the seqlock > is actually 16 bytes, not to mention all the spinlock debugging cases) to > just four bytes. Oh I did that, used a "seqcount" which is the bare sequence counter (and update it while holding d_lock). Yes it still has packing issues, athough I think I can get rid of d_mounted so it will then pack nicely and size won't change. (just have a flag if we are mounted at least once, and just store the count elsewhere for mountpoints -- or even just search the mount hash on each umount to see if anything is left mounted on it) > However, I still suspect we could do things entirely without the seqlock. > The outer seqlock will handle the "couldn't find it" case, and I've got > the strongest feeling that we should be able to just use some basic memory > ordering on the dentry hash to make the inner seqlock unnecessary (ie > make sure that either we don't see the old entry at all, or that we can > guarantee that it won't trigger a successful compare while the rename is > in process because we set the dentry name length to zero). Well, I would be all for improving things of course. But keep in mind we already do the rename_lock seqcount for each d_lookup, so the lock free lookup path is only doing extra seqlocks on dcache hash collision cases. But I do agree it needs more thought. I'll try to get the powerpc guys interested in running tests for us tomorrow :) ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 16:46 ` Nick Piggin @ 2009-10-07 19:25 ` Linus Torvalds 2009-10-07 20:34 ` Andi Kleen 0 siblings, 1 reply; 57+ messages in thread From: Linus Torvalds @ 2009-10-07 19:25 UTC (permalink / raw) To: Nick Piggin Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Wed, 7 Oct 2009, Nick Piggin wrote: > > Oh I did that, used a "seqcount" which is the bare sequence counter > (and update it while holding d_lock). Oh, I didn't notice, for a really silly reason: I looked at how you had changed DNAME_INLINE_LEN_MIN on 64-bit, and noticed that you were using 8 bytes for your sequence lock. But that's only true due to the padding things, and the seqcount itself is indeed just 4 bytes. Sad. But we do have the added requirement that we want to keep the commonly used fields together in the same cacheline, so maybe it's unavoidable. Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 19:25 ` Linus Torvalds @ 2009-10-07 20:34 ` Andi Kleen 2009-10-07 20:51 ` Linus Torvalds 0 siblings, 1 reply; 57+ messages in thread From: Andi Kleen @ 2009-10-07 20:34 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra Linus Torvalds <torvalds@linux-foundation.org> writes: > > Sad. But we do have the added requirement that we want to keep the > commonly used fields together in the same cacheline, so maybe it's > unavoidable. With some strategic prefetches it might be cheap enough to use two cache lines. Just fetch it with the other. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 20:34 ` Andi Kleen @ 2009-10-07 20:51 ` Linus Torvalds 2009-10-07 21:06 ` Andi Kleen 0 siblings, 1 reply; 57+ messages in thread From: Linus Torvalds @ 2009-10-07 20:51 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Wed, 7 Oct 2009, Andi Kleen wrote: > > With some strategic prefetches it might be cheap enough to use > two cache lines. Just fetch it with the other. I _seriously_ doubt that. This is one of the biggest (and hottest) hash tables in the whole kernel. No amount of prefetching will help the fact that you effectively double your cache footprint / working set if you have to fetch two cachelines in the hot case lookup rather than one. Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 20:51 ` Linus Torvalds @ 2009-10-07 21:06 ` Andi Kleen 2009-10-07 21:20 ` Linus Torvalds 0 siblings, 1 reply; 57+ messages in thread From: Andi Kleen @ 2009-10-07 21:06 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Nick Piggin, Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Wed, Oct 07, 2009 at 01:51:07PM -0700, Linus Torvalds wrote: > > > On Wed, 7 Oct 2009, Andi Kleen wrote: > > > > With some strategic prefetches it might be cheap enough to use > > two cache lines. Just fetch it with the other. > > I _seriously_ doubt that. > > This is one of the biggest (and hottest) hash tables in the whole kernel. > > No amount of prefetching will help the fact that you effectively double > your cache footprint / working set if you have to fetch two cachelines in > the hot case lookup rather than one. It won't double it, because there are many more cache lines from all kinds of other things in these paths. The point was just to hide the latency of fetching two at different times. The actual cache foot print is not that important, as long as it's not excessive, just the latencies hurt. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 21:06 ` Andi Kleen @ 2009-10-07 21:20 ` Linus Torvalds 2009-10-07 21:57 ` Linus Torvalds 0 siblings, 1 reply; 57+ messages in thread From: Linus Torvalds @ 2009-10-07 21:20 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Wed, 7 Oct 2009, Andi Kleen wrote: > > It won't double it, because there are many more cache lines from > all kinds of other things in these paths. Not all that many. The big part of the D$ is the hash table lookup in this path. > The point was just to hide the latency of fetching two at different > times. The actual cache foot print is not that important, as long as > it's not excessive You're full of sh*t. Every extra line you fetch pushes out another line. No amount of prefetching will hide that. So stop spreading idiotic fairytales. The best optimization is to not do the work at all. It really is that simple. Prefetching as an optimization technique absolutely sucks. Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 21:20 ` Linus Torvalds @ 2009-10-07 21:57 ` Linus Torvalds 2009-10-07 22:22 ` Andi Kleen 2009-10-08 13:12 ` Denys Vlasenko 0 siblings, 2 replies; 57+ messages in thread From: Linus Torvalds @ 2009-10-07 21:57 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Wed, 7 Oct 2009, Linus Torvalds wrote: > > Every extra line you fetch pushes out another line. No amount of > prefetching will hide that. Side note: what it _will_ do is hide the cost, and spread it out to other places where you don't see it any more, or you see it but can't do anything about it. And that's not a good thing. You end up with "muddy" profiles that don't have nice clear problem spots any more, so you may think that your profile looks better ("Look ma, no nasty hotspots!"), but actual performance hasn't improved one whit, and may well have decreased. This, btw, is exactly the kind of thing we saw with some of the non-temporal work, when we used nontemporal stores to copy pages on COW faults, or when doing pre-zeroing of pages. You get rid of some of the hot-spots in the kernel, and you then replace them with user space taking the cache misses in random spots instead. The kernel profile looks better, and system time may go down, but actual performace never went down - you just moved your cache miss cost from one place to another. There's no question that prefetching cannot help, but it helps only if it's about fetching data that you would need anyway early. In contrast, if the option is "don't touch the other cacheline at all", prefetching is _always_ a loss. No ifs, buts and maybes about it. Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 21:57 ` Linus Torvalds @ 2009-10-07 22:22 ` Andi Kleen 2009-10-08 7:39 ` Nick Piggin 2009-10-08 13:12 ` Denys Vlasenko 1 sibling, 1 reply; 57+ messages in thread From: Andi Kleen @ 2009-10-07 22:22 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Nick Piggin, Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Wed, Oct 07, 2009 at 02:57:20PM -0700, Linus Torvalds wrote: > There's no question that prefetching cannot help, but it helps only if > it's about fetching data that you would need anyway early. In contrast, if > the option is "don't touch the other cacheline at all", prefetching is > _always_ a loss. No ifs, buts and maybes about it. My point (probably not very well written expressed) was that in a typical VFS operation there are hundreds of cache lines touched for various things (code, global, various objects, random stuff) and one more or less in the final dentry is not that big a difference in the global picture. (ok I realize final in this case means the elements in the path) Also typical operations don't do the same VFS operation in a loop, but other things that cools the caches first and then have to fetch everything in again. I agree that touching more cache lines on the hash chain walk for immediates would be dumb because there can be potentially a lot of them, but the final referenced ones are much fewer. Or rather if minimizing total foot-print is the goal there are lower hanging fruit than in the dentry itself by just cutting fat from the whole path. (e.g. I liked Mathieu's immediate value work recently reposted because it had the nice potential to remove a lot of "global" cache lines in such paths by pushing them into the icache). And if it's possible to do less dcache locks or less looping in a seqlock by paying with one cache line that could be a reasonable trade off, assuming you can hide the latency. Or maybe not and I'm totally wrong on that. I'll shut up on this now. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 22:22 ` Andi Kleen @ 2009-10-08 7:39 ` Nick Piggin 2009-10-09 17:53 ` Andi Kleen 0 siblings, 1 reply; 57+ messages in thread From: Nick Piggin @ 2009-10-08 7:39 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Thu, Oct 08, 2009 at 12:22:35AM +0200, Andi Kleen wrote: > On Wed, Oct 07, 2009 at 02:57:20PM -0700, Linus Torvalds wrote: > > There's no question that prefetching cannot help, but it helps only if > > it's about fetching data that you would need anyway early. In contrast, if > > the option is "don't touch the other cacheline at all", prefetching is > > _always_ a loss. No ifs, buts and maybes about it. > > My point (probably not very well written expressed) > was that in a typical VFS operation there are hundreds > of cache lines touched for various things (code, global, various > objects, random stuff) and one more or less in the final dentry > is not that big a difference in the global picture. > (ok I realize final in this case means the elements in the path) Yes I do know what you were getting at... Although I would say dcache is still fairly significant. It's 3 cachelines for every single path element we look up. Now I suspect it has been a while since someone closely looked at the cache layout with an eye to the lookup code. Actually no, it does look relatively sane (I guess it doesn't change that much), except it might actualy be a nice idea to move d_iname up to about above d_lru, so we could actually do path lookup in 2 cachelines per entry (or even 1 if we they have very short names). I will do a bit of profiling on the lookup code and see... ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-08 7:39 ` Nick Piggin @ 2009-10-09 17:53 ` Andi Kleen 0 siblings, 0 replies; 57+ messages in thread From: Andi Kleen @ 2009-10-09 17:53 UTC (permalink / raw) To: Nick Piggin Cc: Andi Kleen, Linus Torvalds, Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Thu, Oct 08, 2009 at 09:39:30AM +0200, Nick Piggin wrote: > > My point (probably not very well written expressed) > > was that in a typical VFS operation there are hundreds > > of cache lines touched for various things (code, global, various > > objects, random stuff) and one more or less in the final dentry > > is not that big a difference in the global picture. > > (ok I realize final in this case means the elements in the path) > > Yes I do know what you were getting at... Although I would > say dcache is still fairly significant. It's 3 cachelines > for every single path element we look up. Now I suspect it That's true, for deeply nested paths where the elements add up it's a concern. I was assuming that paths are not that deep, but that might be too optimistic. > Actually no, it does look relatively sane (I guess it doesn't > change that much), except it might actualy be a nice idea > to move d_iname up to about above d_lru, so we could actually > do path lookup in 2 cachelines per entry (or even 1 if we > they have very short names). > > I will do a bit of profiling on the lookup code and see... Yes numbers would be great. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 21:57 ` Linus Torvalds 2009-10-07 22:22 ` Andi Kleen @ 2009-10-08 13:12 ` Denys Vlasenko 2009-10-09 7:47 ` Nick Piggin 2009-10-09 17:49 ` Andi Kleen 1 sibling, 2 replies; 57+ messages in thread From: Denys Vlasenko @ 2009-10-08 13:12 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Nick Piggin, Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Wed, Oct 7, 2009 at 11:57 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > This, btw, is exactly the kind of thing we saw with some of the > non-temporal work, when we used nontemporal stores to copy pages on COW > faults, or when doing pre-zeroing of pages. You get rid of some of the > hot-spots in the kernel, and you then replace them with user space taking > the cache misses in random spots instead. The kernel profile looks better, > and system time may go down, but actual performace never went down - you > just moved your cache miss cost from one place to another. A few years ago when K7s were not ancient yet, after hearing argument for and against non-temporal stores, I decided to finally figure it for myself. I tested kernel build workload on two kernels with the only one difference - clear_page with and without non-temporal stores. "Non-temporal stores" kernel was faster, not slower. Just a little bit, but reproducibly. -- vda ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-08 13:12 ` Denys Vlasenko @ 2009-10-09 7:47 ` Nick Piggin 2009-10-09 17:49 ` Andi Kleen 1 sibling, 0 replies; 57+ messages in thread From: Nick Piggin @ 2009-10-09 7:47 UTC (permalink / raw) To: Denys Vlasenko Cc: Linus Torvalds, Andi Kleen, Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Thu, Oct 08, 2009 at 03:12:08PM +0200, Denys Vlasenko wrote: > On Wed, Oct 7, 2009 at 11:57 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > This, btw, is exactly the kind of thing we saw with some of the > > non-temporal work, when we used nontemporal stores to copy pages on COW > > faults, or when doing pre-zeroing of pages. You get rid of some of the > > hot-spots in the kernel, and you then replace them with user space taking > > the cache misses in random spots instead. The kernel profile looks better, > > and system time may go down, but actual performace never went down - you > > just moved your cache miss cost from one place to another. > > A few years ago when K7s were not ancient yet, after hearing > argument for and against non-temporal stores, > I decided to finally figure it for myself. > > I tested kernel build workload on two kernels with the only > one difference - clear_page with and without non-temporal stores. > > "Non-temporal stores" kernel was faster, not slower. Just a little bit, > but reproducibly. It is going to be highly dependent on architecture and workload and exactly where you use the nontemporal stores of course. I would say with non-temporal stores in clear_page (a case where we can often expect the memory to be used again quickly because it is anonymous process memory), then we are quite likely to cause _more_ activity on the memory controller and dimms which cost far more power than cache access. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-08 13:12 ` Denys Vlasenko 2009-10-09 7:47 ` Nick Piggin @ 2009-10-09 17:49 ` Andi Kleen 1 sibling, 0 replies; 57+ messages in thread From: Andi Kleen @ 2009-10-09 17:49 UTC (permalink / raw) To: Denys Vlasenko Cc: Linus Torvalds, Andi Kleen, Nick Piggin, Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra > A few years ago when K7s were not ancient yet, after hearing > argument for and against non-temporal stores, > I decided to finally figure it for myself. > > I tested kernel build workload on two kernels with the only > one difference - clear_page with and without non-temporal stores. > > "Non-temporal stores" kernel was faster, not slower. Just a little bit, > but reproducibly. I did the same experiments and in my case (K8) they were slower in real-world benchmarks for clear_page (but faster in micro benchmarks) But I didn't propose non temporal anyways. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 14:56 ` Linus Torvalds 2009-10-07 16:27 ` Linus Torvalds @ 2009-10-07 16:29 ` Nick Piggin 2009-10-08 12:36 ` Nick Piggin 2 siblings, 0 replies; 57+ messages in thread From: Nick Piggin @ 2009-10-07 16:29 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Wed, Oct 07, 2009 at 07:56:33AM -0700, Linus Torvalds wrote: > > > On Wed, 7 Oct 2009, Nick Piggin wrote: > > > > OK, I have a really basic patch that does store-free path walking > > (except on the final element). > > Yay! > > > dbench is pretty nasty still because it seems to do a lot of stupid > > things like reading from /proc/mounts all the time. > > You should largely forget about dbench, it can certainly be a useful > benchmark, but at the same time it's certainly not a _meaningful_ one. > There are better things to try. Yes, sure. I'm just pointing out that it seems to do insane things (like reading /proc/mounts at regular intervals, although I don't see that in dbench source so I really hope it isn't libc being "smart"). I agree it is not a very good benchmark. > > The seqlock in the dentry is for getting consistent name,len pointer, > > and also not giving a false positive if a rename has partially > > overwritten the name string (false negatives are always fine in the > > lock free lookup path but false positives are not). Possibly we > > could make do with a per-sb seqlock for this (or just rename_lock). > > My plan was always to just use rename_lock, and only do it at the outer > level (and do it for both lookup failures _and_ for the success case). > Your approach is _way_ more conservative than I would have done, and also > potentially much slower due to the seqlock-per-path-component thing. Hmm, the only issue is that we need a consistent load of the name pointer and the length, otherwise our memcmp might go crazy. We could solve this by another level of indirection so a rename only requires a pointer swap... But anyway at this approach I only use a single seqlock, because the negative case always falls out to the locked walk anyway (this again might be a bit conservative and something we could tighten up). > Remember: seqlocks are almost free on x86, but they can be reasonably > expensive in other places. > > Hmm. Regardless, this very much does look like what I envisioned, apart > from details like that. And maybe your per-dentry seqlock is the right > choice. On x86, it certainly doesn't have the performance issues it could > have in other places. Yeah, well at least the basics seem to be there. I agree it is not totally clean and will have some cases that need optimising, but it is something people can start looking at... ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 14:56 ` Linus Torvalds 2009-10-07 16:27 ` Linus Torvalds 2009-10-07 16:29 ` Nick Piggin @ 2009-10-08 12:36 ` Nick Piggin 2009-10-08 12:57 ` Jens Axboe 2009-10-09 3:50 ` Nick Piggin 2 siblings, 2 replies; 57+ messages in thread From: Nick Piggin @ 2009-10-08 12:36 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Wed, Oct 07, 2009 at 07:56:33AM -0700, Linus Torvalds wrote: > On Wed, 7 Oct 2009, Nick Piggin wrote: > > > > OK, I have a really basic patch that does store-free path walking > > (except on the final element). > > Yay! > > > dbench is pretty nasty still because it seems to do a lot of stupid > > things like reading from /proc/mounts all the time. > > You should largely forget about dbench, it can certainly be a useful > benchmark, but at the same time it's certainly not a _meaningful_ one. > There are better things to try. OK, here's one you might find interesting. It is a cached git diff workload in a linux kernel tree. I actually ran it in a loop 100 times in order to get some reasonable sample sizes, then I ran parallel and serial configs (PreloadIndex = true/false). Compared plain kernel with all vfs patches to now. 2.6.32-rc3 serial 5.35user 7.12system 0:12.47elapsed 100%CPU 2.6.32-rc3 parallel 5.79user 17.69system 0:09.41elapsed 249%CPU vfs serial 5.30user 5.62system 0:10.92elapsed 100%CPU vfs parallel 4.86user 0.68system 0:06.82elapsed 81%CPU (I don't know what happened with CPU accounting on the last one, but elapsed time was accurate). The profiles are interesting. It's pretty verbose but I've included just the backtraces for the locking functions. serial plain # Samples: 288849 # # Overhead Command Shared Object # ........ .............. ................................ # 55.46% git [kernel] | |--36.52%-- __d_lookup |--9.57%-- __link_path_walk |--6.26%-- _atomic_dec_and_lock | | | |--39.42%-- dput | | | | | |--53.66%-- path_put | | | | | | | |--90.91%-- vfs_fstatat | | | | vfs_lstat | | | | sys_newlstat | | | | system_call_fastpath | | | | | | | --9.09%-- path_walk | | | do_path_lookup | | | user_path_at | | | vfs_fstatat | | | vfs_lstat | | | sys_newlstat | | | system_call_fastpath | | | | | --46.34%-- __link_path_walk | | path_walk | | do_path_lookup | | user_path_at | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--31.73%-- path_put | | | | | |--57.58%-- vfs_fstatat | | | vfs_lstat | | | sys_newlstat | | | system_call_fastpath | | | | | --42.42%-- path_walk | | do_path_lookup | | user_path_at | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--21.15%-- __link_path_walk | | path_walk | | do_path_lookup | | user_path_at | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | --7.69%-- mntput_no_expire | path_put | | | |--50.00%-- vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | --50.00%-- path_walk | do_path_lookup | user_path_at | vfs_fstatat | vfs_lstat | sys_newlstat | system_call_fastpath | |--5.78%-- strncpy_from_user |--5.60%-- _spin_unlock | | | |--88.17%-- dput | | path_put | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--4.30%-- path_put | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--3.23%-- do_lookup | | __link_path_walk | | path_walk | | do_path_lookup | | user_path_at | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--2.15%-- handle_mm_fault | | do_page_fault | | page_fault | | | --2.15%-- __d_lookup | do_lookup | __link_path_walk | path_walk | do_path_lookup | user_path_at | vfs_fstatat | vfs_lstat | sys_newlstat | system_call_fastpath | |--5.17%-- generic_fillattr |--2.95%-- acl_permission_check |--1.87%-- groups_search |--1.81%-- kmem_cache_free |--1.68%-- system_call |--1.62%-- clear_page_c |--1.56%-- do_lookup |--1.44%-- _spin_lock | | | |--58.33%-- __d_lookup | | do_lookup | | __link_path_walk | | path_walk | | do_path_lookup | | user_path_at | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | __lxstat | | | |--20.83%-- dput | | path_put | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | __lxstat | | | |--16.67%-- do_lookup | | __link_path_walk | | path_walk | | do_path_lookup | | user_path_at | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | __lxstat | | | --4.17%-- copy_process | do_fork | sys_clone | stub_clone | __libc_fork | 0x494a5d | |--1.38%-- dput |--1.38%-- mntput_no_expire |--1.32%-- cp_new_stat |--1.26%-- path_walk |--1.20%-- sysret_check |--1.08%-- kmem_cache_alloc |--0.96%-- __follow_mount |--0.96%-- copy_user_generic_string |--0.66%-- in_group_p |--0.54%-- page_fault --7.40%-- [...] So serial case still has significant time in locking. 13% of all kernel cycles. vfs amples: 254207 # # Overhead Command Shared Object # ........ .............. ................................ # 53.15% git [kernel] | |--37.47%-- __d_lookup_rcu |--15.63%-- link_path_walk_rcu |--6.70%-- strncpy_from_user |--5.65%-- generic_fillattr |--3.49%-- _spin_lock | | | |--66.00%-- dput | | path_put | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--14.00%-- mntput_no_expire | | mntput | | path_put | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--6.00%-- link_path_walk_rcu | | do_path_lookup | | | | | |--66.67%-- user_path_at | | | vfs_fstatat | | | vfs_lstat | | | sys_newlstat | | | system_call_fastpath | | | | | --33.33%-- do_filp_open | | do_sys_open | | sys_open | | system_call_fastpath | | | |--4.00%-- path_put | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--4.00%-- do_path_lookup | | user_path_at | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--2.00%-- anon_vma_link | | dup_mm | | copy_process | | do_fork | | sys_clone | | stub_clone | | __libc_fork | | | |--2.00%-- do_page_fault | | page_fault | | | --2.00%-- vfsmount_read_lock | mntput_no_expire | mntput | path_put | vfs_fstatat | vfs_lstat | sys_newlstat | system_call_fastpath | |--2.44%-- kmem_cache_free |--1.95%-- system_call |--1.88%-- groups_search |--1.81%-- do_path_lookup |--1.54%-- cp_new_stat |--1.33%-- clear_page_c |--1.33%-- kmem_cache_alloc |--1.12%-- mntput_no_expire |--1.05%-- do_lookup_rcu |--0.98%-- dput |--0.91%-- page_fault |--0.91%-- copy_user_generic_string |--0.77%-- sysret_check |--0.77%-- in_group_p |--0.77%-- getname |--0.70%-- _spin_unlock | | | |--30.00%-- mntput_no_expire | | mntput | | path_put | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | __lxstat | | | |--20.00%-- link_path_walk_rcu | | do_path_lookup | | user_path_at | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | __lxstat | | | |--10.00%-- handle_mm_fault | | do_page_fault | | page_fault | | 0x45f62a | | | |--10.00%-- vfsmount_read_unlock | | mntput_no_expire | | mntput | | path_put | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | __lxstat | | | |--10.00%-- dput | | path_put | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | __lxstat | | | |--10.00%-- path_put | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | __lxstat | | | --10.00%-- do_path_lookup | user_path_at | vfs_fstatat | vfs_lstat | sys_newlstat | system_call_fastpath | __lxstat | |--0.63%-- path_put |--0.56%-- copy_page_c |--0.56%-- user_path_at --9.07%-- [...] Locking goes to about 4%. Signifciantly coming from dput of the final dentry element which is basically impossible to avoid, so we're much closer to optimal. The parallel case is interesting too. plain # Samples: 635836 # # Overhead Command Shared Object # ........ .............. ................................ # 76.39% git [kernel] | |--32.26%-- _atomic_dec_and_lock | | | |--60.44%-- dput | | | | | |--51.15%-- path_put | | | | | | | |--94.91%-- path_walk | | | | do_path_lookup | | | | user_path_at | | | | vfs_fstatat | | | | vfs_lstat | | | | sys_newlstat | | | | system_call_fastpath | | | | | | | --5.09%-- vfs_fstatat | | | vfs_lstat | | | sys_newlstat | | | system_call_fastpath | | | | | --48.85%-- __link_path_walk | | path_walk | | do_path_lookup | | user_path_at | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--14.04%-- mntput_no_expire | | path_put | | | | | |--51.29%-- path_walk | | | do_path_lookup | | | user_path_at | | | vfs_fstatat | | | vfs_lstat | | | sys_newlstat | | | system_call_fastpath | | | | | --48.71%-- vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--13.01%-- path_put | | | | | |--95.81%-- path_walk | | | do_path_lookup | | | user_path_at | | | vfs_fstatat | | | vfs_lstat | | | sys_newlstat | | | system_call_fastpath | | | | | --4.19%-- vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | --12.52%-- __link_path_walk | path_walk | do_path_lookup | user_path_at | vfs_fstatat | vfs_lstat | sys_newlstat | system_call_fastpath | |--13.23%-- path_walk |--12.94%-- __d_lookup |--7.81%-- do_path_lookup |--7.53%-- path_init |--3.84%-- __link_path_walk |--2.36%-- acl_permission_check |--2.15%-- _spin_lock | | | |--42.73%-- _atomic_dec_and_lock | | dput | | path_put | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--39.09%-- __d_lookup | | do_lookup | | __link_path_walk | | path_walk | | do_path_lookup | | user_path_at | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--9.09%-- do_lookup | | __link_path_walk | | path_walk | | do_path_lookup | | user_path_at | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--8.18%-- dput | | path_put | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | --0.91%-- system_call_fastpath | 0x7fb0fcf23257 | 0x7fb0fcf158bd | |--2.01%-- generic_fillattr |--1.76%-- _spin_unlock | | | |--85.56%-- dput | | path_put | | | | | |--98.70%-- vfs_fstatat | | | vfs_lstat | | | sys_newlstat | | | system_call_fastpath | | | | | --1.30%-- __link_path_walk | | path_walk | | do_path_lookup | | do_filp_open | | do_sys_open | | sys_open | | system_call_fastpath | | | |--5.56%-- __d_lookup | | do_lookup | | __link_path_walk | | path_walk | | do_path_lookup | | user_path_at | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--4.44%-- path_put | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--2.22%-- do_lookup | | __link_path_walk | | path_walk | | do_path_lookup | | user_path_at | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--1.11%-- handle_mm_fault | | do_page_fault | | page_fault | | | --1.11%-- update_process_times | tick_sched_timer | __run_hrtimer | hrtimer_interrupt | smp_apic_timer_interrupt | apic_timer_interrupt | |--1.62%-- _read_unlock | | | |--75.90%-- path_init | | do_path_lookup | | user_path_at | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | --24.10%-- do_path_lookup | user_path_at | vfs_fstatat | vfs_lstat | sys_newlstat | system_call_fastpath | |--1.29%-- strncpy_from_user |--1.17%-- path_put |--1.01%-- dput |--0.62%-- kmem_cache_free |--0.60%-- do_lookup |--0.59%-- clear_page_c We can see it is really starting to choke on atomic_dec_and_lock. I don't know how many tasks you spawn off in git here, but it looks like this is nearing the absolute limit of scalbility. vfs amples: 273522 # # Overhead Command Shared Object # ........ .............. ................................ # 48.24% git [kernel] | |--32.37%-- __d_lookup_rcu |--14.14%-- link_path_walk_rcu |--7.57%-- _read_unlock | | | |--96.46%-- path_init_rcu | | do_path_lookup | | user_path_at | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | --3.54%-- do_path_lookup | user_path_at | vfs_fstatat | vfs_lstat | sys_newlstat | system_call_fastpath | |--7.04%-- generic_fillattr |--5.50%-- strncpy_from_user |--2.68%-- kmem_cache_free |--2.55%-- _spin_lock | | | |--81.58%-- dput | | path_put | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--5.26%-- do_path_lookup | | user_path_at | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | | |--5.26%-- try_to_wake_up | | | | | |--50.00%-- wake_up_state | | | wake_futex | | | futex_wake | | | do_futex | | | sys_futex | | | mm_release | | | exit_mm | | | do_exit | | | sys_exit | | | system_call_fastpath | | | start_thread | | | | | --50.00%-- wake_up_process | | __up_write | | up_write | | sys_mmap | | system_call_fastpath | | mmap64 | | | |--5.26%-- vfsmount_read_lock | | mntput_no_expire | | mntput | | path_put | | vfs_fstatat | | vfs_lstat | | sys_newlstat | | system_call_fastpath | | __lxstat | | | | | |--50.00%-- 0x7f7640b9e2c0 | | | 0x4ab3b1fc | | | | | --50.00%-- 0x7f7640bb4e78 | | 0x4a803476 | | | --2.63%-- path_put | vfs_fstatat | vfs_lstat | sys_newlstat | system_call_fastpath | __lxstat | 0x7f7640d7f488 | 0x4a8034a4 | |--2.48%-- clear_page_c |--1.61%-- system_call |--1.47%-- copy_user_generic_string |--1.41%-- cp_new_stat |--1.41%-- groups_search |--1.21%-- do_lookup_rcu |--0.94%-- kmem_cache_alloc |--0.94%-- do_path_lookup |--0.87%-- in_group_p |--0.80%-- page_fault |--0.80%-- sysret_check |--0.74%-- dput |--0.67%-- getname |--0.67%-- user_path_at |--0.67%-- mntput_no_expire |--0.60%-- unmap_vmas |--0.54%-- _spin_unlock |--0.54%-- vfs_fstatat |--0.54%-- path_init_rcu --9.25%-- [...] This one is interesting. spin_lock/spin_unlock remains very low, however read_unlock pops up. This would be... fs->lock. You're using threads then (rather than processes)? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-08 12:36 ` Nick Piggin @ 2009-10-08 12:57 ` Jens Axboe 2009-10-08 13:22 ` Nick Piggin 2009-10-09 3:50 ` Nick Piggin 1 sibling, 1 reply; 57+ messages in thread From: Jens Axboe @ 2009-10-08 12:57 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Thu, Oct 08 2009, Nick Piggin wrote: > On Wed, Oct 07, 2009 at 07:56:33AM -0700, Linus Torvalds wrote: > > On Wed, 7 Oct 2009, Nick Piggin wrote: > > > > > > OK, I have a really basic patch that does store-free path walking > > > (except on the final element). > > > > Yay! > > > > > dbench is pretty nasty still because it seems to do a lot of stupid > > > things like reading from /proc/mounts all the time. > > > > You should largely forget about dbench, it can certainly be a useful > > benchmark, but at the same time it's certainly not a _meaningful_ one. > > There are better things to try. > > OK, here's one you might find interesting. It is a cached git diff > workload in a linux kernel tree. I actually ran it in a loop 100 > times in order to get some reasonable sample sizes, then I ran > parallel and serial configs (PreloadIndex = true/false). Compared > plain kernel with all vfs patches to now. > > 2.6.32-rc3 serial > 5.35user 7.12system 0:12.47elapsed 100%CPU > > 2.6.32-rc3 parallel > 5.79user 17.69system 0:09.41elapsed 249%CPU > > vfs serial > 5.30user 5.62system 0:10.92elapsed 100%CPU > > vfs parallel > 4.86user 0.68system 0:06.82elapsed 81%CPU Since the box was booted anyway, I tried the git test too. Results are with 2.6.32-rc3 serial being the baseline 1.00 scores, smaller than 1.00 are faster and vice versa. 2.6.32-rc3 serial real 1.00 user 1.00 sys 1.00 2.6.32-rc3 parallel real 0.80 user 0.83 sys 8.38 sys time, auch... vfs serial real 0.86 user 0.93 sys 0.84 vfs parallel real 0.43 user 0.72 sys 0.13 Let me know if you want profiles or anything like that. I'd say that looks veeeery tasty. -- Jens Axboe ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-08 12:57 ` Jens Axboe @ 2009-10-08 13:22 ` Nick Piggin 2009-10-08 13:30 ` Jens Axboe 2009-10-09 8:54 ` Jens Axboe 0 siblings, 2 replies; 57+ messages in thread From: Nick Piggin @ 2009-10-08 13:22 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Thu, Oct 08, 2009 at 02:57:46PM +0200, Jens Axboe wrote: > On Thu, Oct 08 2009, Nick Piggin wrote: > > On Wed, Oct 07, 2009 at 07:56:33AM -0700, Linus Torvalds wrote: > > > On Wed, 7 Oct 2009, Nick Piggin wrote: > > > > > > > > OK, I have a really basic patch that does store-free path walking > > > > (except on the final element). > > > > > > Yay! > > > > > > > dbench is pretty nasty still because it seems to do a lot of stupid > > > > things like reading from /proc/mounts all the time. > > > > > > You should largely forget about dbench, it can certainly be a useful > > > benchmark, but at the same time it's certainly not a _meaningful_ one. > > > There are better things to try. > > > > OK, here's one you might find interesting. It is a cached git diff > > workload in a linux kernel tree. I actually ran it in a loop 100 > > times in order to get some reasonable sample sizes, then I ran > > parallel and serial configs (PreloadIndex = true/false). Compared > > plain kernel with all vfs patches to now. > > > > 2.6.32-rc3 serial > > 5.35user 7.12system 0:12.47elapsed 100%CPU > > > > 2.6.32-rc3 parallel > > 5.79user 17.69system 0:09.41elapsed 249%CPU > > > > vfs serial > > 5.30user 5.62system 0:10.92elapsed 100%CPU > > > > vfs parallel > > 4.86user 0.68system 0:06.82elapsed 81%CPU > > Since the box was booted anyway, I tried the git test too. Results are > with 2.6.32-rc3 serial being the baseline 1.00 scores, smaller than 1.00 > are faster and vice versa. > > 2.6.32-rc3 serial > real 1.00 > user 1.00 > sys 1.00 > > 2.6.32-rc3 parallel > real 0.80 > user 0.83 > sys 8.38 > > sys time, auch... > > vfs serial > real 0.86 > user 0.93 > sys 0.84 This is actualy nice too. My tests were on a 2s8c Barcelona system, but this is showing we have a nice serial win on Nehalem as well. Actually K8 CPUs have a bit faster lock primitives than earlier Intel CPUs I think (closer to Nehalem), so we might see an even bigger win with a Core2. > vfs parallel > real 0.43 > user 0.72 > sys 0.13 > > Let me know if you want profiles or anything like that. I'd say that > looks veeeery tasty. It doesn't look all that different to mine, so profiles probably not required at this point. Is the CPU accounting going wrong? It looks like thread times are not being accumulated back properly, which IIRC they should be. But 'real' time should be accurate, so it is going a lot faster. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-08 13:22 ` Nick Piggin @ 2009-10-08 13:30 ` Jens Axboe 2009-10-08 18:00 ` Peter Zijlstra 2009-10-09 8:54 ` Jens Axboe 1 sibling, 1 reply; 57+ messages in thread From: Jens Axboe @ 2009-10-08 13:30 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Thu, Oct 08 2009, Nick Piggin wrote: > On Thu, Oct 08, 2009 at 02:57:46PM +0200, Jens Axboe wrote: > > On Thu, Oct 08 2009, Nick Piggin wrote: > > > On Wed, Oct 07, 2009 at 07:56:33AM -0700, Linus Torvalds wrote: > > > > On Wed, 7 Oct 2009, Nick Piggin wrote: > > > > > > > > > > OK, I have a really basic patch that does store-free path walking > > > > > (except on the final element). > > > > > > > > Yay! > > > > > > > > > dbench is pretty nasty still because it seems to do a lot of stupid > > > > > things like reading from /proc/mounts all the time. > > > > > > > > You should largely forget about dbench, it can certainly be a useful > > > > benchmark, but at the same time it's certainly not a _meaningful_ one. > > > > There are better things to try. > > > > > > OK, here's one you might find interesting. It is a cached git diff > > > workload in a linux kernel tree. I actually ran it in a loop 100 > > > times in order to get some reasonable sample sizes, then I ran > > > parallel and serial configs (PreloadIndex = true/false). Compared > > > plain kernel with all vfs patches to now. > > > > > > 2.6.32-rc3 serial > > > 5.35user 7.12system 0:12.47elapsed 100%CPU > > > > > > 2.6.32-rc3 parallel > > > 5.79user 17.69system 0:09.41elapsed 249%CPU > > > > > > vfs serial > > > 5.30user 5.62system 0:10.92elapsed 100%CPU > > > > > > vfs parallel > > > 4.86user 0.68system 0:06.82elapsed 81%CPU > > > > Since the box was booted anyway, I tried the git test too. Results are > > with 2.6.32-rc3 serial being the baseline 1.00 scores, smaller than 1.00 > > are faster and vice versa. > > > > 2.6.32-rc3 serial > > real 1.00 > > user 1.00 > > sys 1.00 > > > > 2.6.32-rc3 parallel > > real 0.80 > > user 0.83 > > sys 8.38 > > > > sys time, auch... > > > > vfs serial > > real 0.86 > > user 0.93 > > sys 0.84 > > This is actualy nice too. My tests were on a 2s8c Barcelona system, > but this is showing we have a nice serial win on Nehalem as well. > Actually K8 CPUs have a bit faster lock primitives than earlier > Intel CPUs I think (closer to Nehalem), so we might see an even > bigger win with a Core2. Yes, this is just as interesting as the parallel results imho. I don't have a core 2 test box, so I cannot test that. > > vfs parallel > > real 0.43 > > user 0.72 > > sys 0.13 > > > > Let me know if you want profiles or anything like that. I'd say that > > looks veeeery tasty. > > It doesn't look all that different to mine, so profiles probably > not required at this point. Is the CPU accounting going wrong? It > looks like thread times are not being accumulated back properly, > which IIRC they should be. But 'real' time should be accurate, so > it is going a lot faster. Yes, it looks very similar, the higher CPU count just makes the parallel git preload on -rc3 stock look even more crappy (you had roughly 2x number of sys time, I have roughly 8x) when compared to the serialized approach. IIRC, there was a bug with thread accounting very recently. Why would it not hit -rc3 alone, though? Does look fishy, though. -- Jens Axboe ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-08 13:30 ` Jens Axboe @ 2009-10-08 18:00 ` Peter Zijlstra 2009-10-09 4:04 ` Nick Piggin 0 siblings, 1 reply; 57+ messages in thread From: Peter Zijlstra @ 2009-10-08 18:00 UTC (permalink / raw) To: Jens Axboe Cc: Nick Piggin, Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai On Thu, 2009-10-08 at 15:30 +0200, Jens Axboe wrote: > > IIRC, there was a bug with thread accounting very recently. Why would it > not hit -rc3 alone, though? Does look fishy, though. That was caused by: def0a9b2573e00ab0b486cb5382625203ab4c4a6 sched_clock: Make it NMI safe And was fixed by: 152f9d0710a62708710161bce1b29fa8292c8c11 sched_clock: Fix atomicity/continuity bug by using cmpxchg64() Which is in -rc3 ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-08 18:00 ` Peter Zijlstra @ 2009-10-09 4:04 ` Nick Piggin 0 siblings, 0 replies; 57+ messages in thread From: Nick Piggin @ 2009-10-09 4:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Jens Axboe, Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai On Thu, Oct 08, 2009 at 08:00:45PM +0200, Peter Zijlstra wrote: > On Thu, 2009-10-08 at 15:30 +0200, Jens Axboe wrote: > > > > IIRC, there was a bug with thread accounting very recently. Why would it > > not hit -rc3 alone, though? Does look fishy, though. > > That was caused by: > def0a9b2573e00ab0b486cb5382625203ab4c4a6 sched_clock: Make it NMI safe > > And was fixed by: > 152f9d0710a62708710161bce1b29fa8292c8c11 sched_clock: Fix > atomicity/continuity bug by using cmpxchg64() > > Which is in -rc3 Well what ever is happening here still happens with post-rc3 kernels. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-08 13:22 ` Nick Piggin 2009-10-08 13:30 ` Jens Axboe @ 2009-10-09 8:54 ` Jens Axboe 2009-10-09 9:51 ` Jens Axboe 2009-10-09 10:07 ` Nick Piggin 1 sibling, 2 replies; 57+ messages in thread From: Jens Axboe @ 2009-10-09 8:54 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Thu, Oct 08 2009, Nick Piggin wrote: > This is actualy nice too. My tests were on a 2s8c Barcelona system, > but this is showing we have a nice serial win on Nehalem as well. > Actually K8 CPUs have a bit faster lock primitives than earlier > Intel CPUs I think (closer to Nehalem), so we might see an even > bigger win with a Core2. Ran it on another box. This isn't quite a core 2 though, it's a sparc64 (Niagara 2). It has 64 threads, too, but just 8 cores. 2.6.32-rc3 serial real 0m5.390s user 0m1.340s sys 0m2.970s 2.6.32-rc3 parallel real 0m2.009s user 0m0.900s sys 0m2.490s vfs serial real 0m4.816s user 0m1.250s sys 0m2.270s vfs parallel real 0m1.967s user 0m0.920s sys 0m1.960s So it's a win-win there on that platform too. -- Jens Axboe ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-09 8:54 ` Jens Axboe @ 2009-10-09 9:51 ` Jens Axboe 2009-10-09 10:02 ` Nick Piggin 2009-10-09 10:07 ` Nick Piggin 1 sibling, 1 reply; 57+ messages in thread From: Jens Axboe @ 2009-10-09 9:51 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, chris.mason Nick, One more thing... I see you converted part of btrfs, but there's still this one sitting in btrfs_invalidate_inodes() if (atomic_read(&inode->i_count) > 1) d_prune_aliases(inode); Not sure how best to solve that, with a __d_prune_aliases() that assumed the lock was held it would be easy. But perhaps you have better ideas, this email is more of a heads-up since perhaps you missed this spot (CC'ing Chris). -- Jens Axboe ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-09 9:51 ` Jens Axboe @ 2009-10-09 10:02 ` Nick Piggin 2009-10-09 10:08 ` Jens Axboe 0 siblings, 1 reply; 57+ messages in thread From: Nick Piggin @ 2009-10-09 10:02 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, chris.mason On Fri, Oct 09, 2009 at 11:51:19AM +0200, Jens Axboe wrote: > Nick, > > One more thing... I see you converted part of btrfs, but there's still > this one sitting in btrfs_invalidate_inodes() > > if (atomic_read(&inode->i_count) > 1) > d_prune_aliases(inode); > > Not sure how best to solve that, with a __d_prune_aliases() that assumed > the lock was held it would be easy. But perhaps you have better ideas, > this email is more of a heads-up since perhaps you missed this spot > (CC'ing Chris). It's OK, you can load inode->i_count integer atomically -- in this sequence d_prune_aliases can't have assumed anything about i_count anyway because regardless of its type it might have changed in between. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-09 10:02 ` Nick Piggin @ 2009-10-09 10:08 ` Jens Axboe 0 siblings, 0 replies; 57+ messages in thread From: Jens Axboe @ 2009-10-09 10:08 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, chris.mason On Fri, Oct 09 2009, Nick Piggin wrote: > On Fri, Oct 09, 2009 at 11:51:19AM +0200, Jens Axboe wrote: > > Nick, > > > > One more thing... I see you converted part of btrfs, but there's still > > this one sitting in btrfs_invalidate_inodes() > > > > if (atomic_read(&inode->i_count) > 1) > > d_prune_aliases(inode); > > > > Not sure how best to solve that, with a __d_prune_aliases() that assumed > > the lock was held it would be easy. But perhaps you have better ideas, > > this email is more of a heads-up since perhaps you missed this spot > > (CC'ing Chris). > > It's OK, you can load inode->i_count integer atomically -- in this > sequence d_prune_aliases can't have assumed anything about i_count > anyway because regardless of its type it might have changed in > between. Right, it was already racy wrt i_count. So the below should be OK. diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d3dadb6..7e02b5d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3419,8 +3419,10 @@ again: inode = igrab(&entry->vfs_inode); if (inode) { spin_unlock(&root->inode_lock); - if (atomic_read(&inode->i_count) > 1) + + if (inode->i_count > 1) d_prune_aliases(inode); + /* * btrfs_drop_inode will remove it from * the inode cache when its usage count -- Jens Axboe ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-09 8:54 ` Jens Axboe 2009-10-09 9:51 ` Jens Axboe @ 2009-10-09 10:07 ` Nick Piggin 1 sibling, 0 replies; 57+ messages in thread From: Nick Piggin @ 2009-10-09 10:07 UTC (permalink / raw) To: Jens Axboe, David Miller Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Fri, Oct 09, 2009 at 10:54:52AM +0200, Jens Axboe wrote: > On Thu, Oct 08 2009, Nick Piggin wrote: > > This is actualy nice too. My tests were on a 2s8c Barcelona system, > > but this is showing we have a nice serial win on Nehalem as well. > > Actually K8 CPUs have a bit faster lock primitives than earlier > > Intel CPUs I think (closer to Nehalem), so we might see an even > > bigger win with a Core2. > > Ran it on another box. This isn't quite a core 2 though, it's a sparc64 > (Niagara 2). It has 64 threads, too, but just 8 cores. > > 2.6.32-rc3 serial > real 0m5.390s > user 0m1.340s > sys 0m2.970s > > 2.6.32-rc3 parallel > real 0m2.009s > user 0m0.900s > sys 0m2.490s > > vfs serial > real 0m4.816s > user 0m1.250s > sys 0m2.270s > > vfs parallel > real 0m1.967s > user 0m0.920s > sys 0m1.960s > > So it's a win-win there on that platform too. That's nice to see it's really quite a good win in the serial case on this CPU too. Parallel interestingly not improved. Whether it is because the locks are able to be bounced around much more quickly than on our multi socket systems, or some quirk of the lots-of-chickens CPU that doesn't take so well to the workload, I don't know. Maybe it's even the fs->lock that will still be there in your patches. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-08 12:36 ` Nick Piggin 2009-10-08 12:57 ` Jens Axboe @ 2009-10-09 3:50 ` Nick Piggin 2009-10-09 6:15 ` David Miller 1 sibling, 1 reply; 57+ messages in thread From: Nick Piggin @ 2009-10-09 3:50 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra On Thu, Oct 08, 2009 at 02:36:22PM +0200, Nick Piggin wrote: > vfs > > amples: 273522 > # > # Overhead Command Shared Object > # ........ .............. ................................ > # > 48.24% git [kernel] > | > |--32.37%-- __d_lookup_rcu > |--14.14%-- link_path_walk_rcu > |--7.57%-- _read_unlock > | | > | |--96.46%-- path_init_rcu > | | do_path_lookup > | | user_path_at > | | vfs_fstatat > | | vfs_lstat > | | sys_newlstat > | | system_call_fastpath > | | > | --3.54%-- do_path_lookup > | user_path_at > | vfs_fstatat > | vfs_lstat > | sys_newlstat > | system_call_fastpath > This one is interesting. spin_lock/spin_unlock remains very low, however > read_unlock pops up. This would be... fs->lock. You're using threads > then (rather than processes)? OK, I got rid of this guy from the RCU walk. Basically now hold vfsmount_lock over the entire RCU path walk (which also pins the mnt) and use a seqlock in the fs struct to get a consistent mnt,dentry pair. This also simplifies the walk because we don't need the complexity to avoid mntget/mntput (just do one final mntget on the resulting mnt before dropping vfsmount_lock). vfsmount_lock adds one per-cpu atomic for the spinlock, and we remove two thread-shared atomics for fs->lock so a net win for both single threaded performance and thread-shared scalability. Latency is no problem because we hold rcu_read_lock for the same length of time anyway. The parallel git diff workload is improved by serveral percent. Phew. I think I'll stop about here and try to start getting some of this crap cleaned up and start trying to get the rest of the filesystems done. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-09 3:50 ` Nick Piggin @ 2009-10-09 6:15 ` David Miller 2009-10-09 10:40 ` Nick Piggin 2009-10-09 10:44 ` Nick Piggin 0 siblings, 2 replies; 57+ messages in thread From: David Miller @ 2009-10-09 6:15 UTC (permalink / raw) To: npiggin; +Cc: torvalds, jens.axboe, linux-kernel, linux-fsdevel, kiran, peterz From: Nick Piggin <npiggin@suse.de> Date: Fri, 9 Oct 2009 05:50:50 +0200 > OK, I got rid of this guy from the RCU walk. Basically now hold > vfsmount_lock over the entire RCU path walk (which also pins the mnt) > and use a seqlock in the fs struct to get a consistent mnt,dentry > pair. This also simplifies the walk because we don't need the > complexity to avoid mntget/mntput (just do one final mntget on the > resulting mnt before dropping vfsmount_lock). > > vfsmount_lock adds one per-cpu atomic for the spinlock, and we > remove two thread-shared atomics for fs->lock so a net win for > both single threaded performance and thread-shared scalability. > Latency is no problem because we hold rcu_read_lock for the same > length of time anyway. > > The parallel git diff workload is improved by serveral percent. Sounds sweet Nick, can't wait to play with your next set of patches here. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-09 6:15 ` David Miller @ 2009-10-09 10:40 ` Nick Piggin 2009-10-09 11:09 ` Jens Axboe 2009-10-09 10:44 ` Nick Piggin 1 sibling, 1 reply; 57+ messages in thread From: Nick Piggin @ 2009-10-09 10:40 UTC (permalink / raw) To: David Miller Cc: torvalds, jens.axboe, linux-kernel, linux-fsdevel, kiran, peterz On Thu, Oct 08, 2009 at 11:15:27PM -0700, David Miller wrote: > From: Nick Piggin <npiggin@suse.de> > Date: Fri, 9 Oct 2009 05:50:50 +0200 > > > OK, I got rid of this guy from the RCU walk. Basically now hold > > vfsmount_lock over the entire RCU path walk (which also pins the mnt) > > and use a seqlock in the fs struct to get a consistent mnt,dentry > > pair. This also simplifies the walk because we don't need the > > complexity to avoid mntget/mntput (just do one final mntget on the > > resulting mnt before dropping vfsmount_lock). > > > > vfsmount_lock adds one per-cpu atomic for the spinlock, and we > > remove two thread-shared atomics for fs->lock so a net win for > > both single threaded performance and thread-shared scalability. > > Latency is no problem because we hold rcu_read_lock for the same > > length of time anyway. > > > > The parallel git diff workload is improved by serveral percent. > > Sounds sweet Nick, can't wait to play with your next set of > patches here. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/ OK well there will be another rev of the patches there with everything, also with the d_mounted patch so dentry remains a nice size. Also tried rearranging some structures in the dentry to help lookups but haven't quite finished there. Jens I would be interested to see whether parallel case on the N2 is improved at all. Profiles from that case might be interesting. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-09 10:40 ` Nick Piggin @ 2009-10-09 11:09 ` Jens Axboe 0 siblings, 0 replies; 57+ messages in thread From: Jens Axboe @ 2009-10-09 11:09 UTC (permalink / raw) To: Nick Piggin Cc: David Miller, torvalds, linux-kernel, linux-fsdevel, kiran, peterz On Fri, Oct 09 2009, Nick Piggin wrote: > On Thu, Oct 08, 2009 at 11:15:27PM -0700, David Miller wrote: > > From: Nick Piggin <npiggin@suse.de> > > Date: Fri, 9 Oct 2009 05:50:50 +0200 > > > > > OK, I got rid of this guy from the RCU walk. Basically now hold > > > vfsmount_lock over the entire RCU path walk (which also pins the mnt) > > > and use a seqlock in the fs struct to get a consistent mnt,dentry > > > pair. This also simplifies the walk because we don't need the > > > complexity to avoid mntget/mntput (just do one final mntget on the > > > resulting mnt before dropping vfsmount_lock). > > > > > > vfsmount_lock adds one per-cpu atomic for the spinlock, and we > > > remove two thread-shared atomics for fs->lock so a net win for > > > both single threaded performance and thread-shared scalability. > > > Latency is no problem because we hold rcu_read_lock for the same > > > length of time anyway. > > > > > > The parallel git diff workload is improved by serveral percent. > > > > Sounds sweet Nick, can't wait to play with your next set of > > patches here. > > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/ > > OK well there will be another rev of the patches there with everything, > also with the d_mounted patch so dentry remains a nice size. Also tried > rearranging some structures in the dentry to help lookups but haven't > quite finished there. > > Jens I would be interested to see whether parallel case on the N2 is > improved at all. Profiles from that case might be interesting. Results with 09102009.patch below. And thanks for doing a rolled up patch again, at least I don't use quilt :-) 2.6.32-rc3 serial real 0m6.587s user 0m1.970s sys 0m2.890s 2.6.32-rc3 parallel real 0m3.180s user 0m1.460s sys 0m2.700s vfs serial real 0m6.543s user 0m1.970s sys 0m2.590s vfs parallel real 0m3.121s user 0m1.200s sys 0m2.920s Doesn't looks so good. Don't pay too much attention to user/sys time btw, they don't seem very stable. The real elapsed time is stable, though. -- Jens Axboe ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-09 6:15 ` David Miller 2009-10-09 10:40 ` Nick Piggin @ 2009-10-09 10:44 ` Nick Piggin 2009-10-09 10:48 ` Jens Axboe 1 sibling, 1 reply; 57+ messages in thread From: Nick Piggin @ 2009-10-09 10:44 UTC (permalink / raw) To: David Miller Cc: torvalds, jens.axboe, linux-kernel, linux-fsdevel, kiran, peterz On Thu, Oct 08, 2009 at 11:15:27PM -0700, David Miller wrote: > From: Nick Piggin <npiggin@suse.de> > Date: Fri, 9 Oct 2009 05:50:50 +0200 > > > OK, I got rid of this guy from the RCU walk. Basically now hold > > vfsmount_lock over the entire RCU path walk (which also pins the mnt) > > and use a seqlock in the fs struct to get a consistent mnt,dentry > > pair. This also simplifies the walk because we don't need the > > complexity to avoid mntget/mntput (just do one final mntget on the > > resulting mnt before dropping vfsmount_lock). > > > > vfsmount_lock adds one per-cpu atomic for the spinlock, and we > > remove two thread-shared atomics for fs->lock so a net win for > > both single threaded performance and thread-shared scalability. > > Latency is no problem because we hold rcu_read_lock for the same > > length of time anyway. > > > > The parallel git diff workload is improved by serveral percent. > > Sounds sweet Nick, can't wait to play with your next set of > patches here. Oh and the RCU patch there just in case you want to try some file open/close benchmarks. Turns out I broke Paul's new scalable RCU ;) ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-09 10:44 ` Nick Piggin @ 2009-10-09 10:48 ` Jens Axboe 0 siblings, 0 replies; 57+ messages in thread From: Jens Axboe @ 2009-10-09 10:48 UTC (permalink / raw) To: Nick Piggin Cc: David Miller, torvalds, linux-kernel, linux-fsdevel, kiran, peterz On Fri, Oct 09 2009, Nick Piggin wrote: > On Thu, Oct 08, 2009 at 11:15:27PM -0700, David Miller wrote: > > From: Nick Piggin <npiggin@suse.de> > > Date: Fri, 9 Oct 2009 05:50:50 +0200 > > > > > OK, I got rid of this guy from the RCU walk. Basically now hold > > > vfsmount_lock over the entire RCU path walk (which also pins the mnt) > > > and use a seqlock in the fs struct to get a consistent mnt,dentry > > > pair. This also simplifies the walk because we don't need the > > > complexity to avoid mntget/mntput (just do one final mntget on the > > > resulting mnt before dropping vfsmount_lock). > > > > > > vfsmount_lock adds one per-cpu atomic for the spinlock, and we > > > remove two thread-shared atomics for fs->lock so a net win for > > > both single threaded performance and thread-shared scalability. > > > Latency is no problem because we hold rcu_read_lock for the same > > > length of time anyway. > > > > > > The parallel git diff workload is improved by serveral percent. > > > > Sounds sweet Nick, can't wait to play with your next set of > > patches here. > > Oh and the RCU patch there just in case you want to try some file > open/close benchmarks. Turns out I broke Paul's new scalable RCU ;) In that case, you may want to fix the permissions on that file :-) I'll try the new rolled up patch. -- Jens Axboe ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [rfc][patch] store-free path walking 2009-10-07 8:58 ` [rfc][patch] store-free path walking Nick Piggin 2009-10-07 9:56 ` Jens Axboe 2009-10-07 14:56 ` Linus Torvalds @ 2009-10-09 23:16 ` Paul E. McKenney 2 siblings, 0 replies; 57+ messages in thread From: Paul E. McKenney @ 2009-10-09 23:16 UTC (permalink / raw) To: Nick Piggin Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds On Wed, Oct 07, 2009 at 10:58:49AM +0200, Nick Piggin wrote: > On Tue, Oct 06, 2009 at 02:49:41PM +0200, Jens Axboe wrote: > > On Tue, Oct 06 2009, Nick Piggin wrote: > > > It's a copout, but you could try running multiple dbenches under > > > different working directories (or actually, IIRC dbench does root > > > based path lookups so maybe that won't help you much). > > > > Yeah, it's hitting dentry->d_lock pretty hard so basically > > spin-serialized at that point. > > > > > > Anyway, below are the results. Seem very stable. > > > > > > > > throughput > > > > ------------------------------------------------ > > > > 2.6.32-rc3-git | 561.218 MB/sec > > > > 2.6.32-rc3-git+patch | 627.022 MB/sec > > > > > > Well it's good to see you got some improvement. > > > > Yes, it's an improvement though the results are still pretty abysmal :-) > > OK, I have a really basic patch that does store-free path walking > (except on the final element). dbench is pretty nasty still because > it seems to do a lot of stupid things like reading from /proc/mounts > all the time. > > Actually it isn't quite store-free because it still takes mnt ref > (which is hard to avoid, but at least it is a per-cpu data). So > this patch applies on top of my recent patchset. At least it does > not store to the dentries. > > Basically it is holding rcu_read_lock for the entire walk, it uses > inode RCU from my earlier patches to check inode permissions, and > it bails out at the slightest sign of trouble :) (actually I think > I have got it to walk mountpoints which is nice). > > The seqlock in the dentry is for getting consistent name,len pointer, > and also not giving a false positive if a rename has partially > overwritten the name string (false negatives are always fine in the > lock free lookup path but false positives are not). Possibly we > could make do with a per-sb seqlock for this (or just rename_lock). I have no idea whether it is worth it, but the usual trick to avoid name/len inconsistencies is to have a pointer to a structure containing both the name and the length (already provided by qstr in the form of d_name) and d_hash (which is not already part of d_name). This does mean that when renaming from one short name to another, one must still use external storage for one of the names. For example, when renaming from "foo.c" to "bar.c", if "foo.c" is stored in the dentry itself, you must allocate memory for the "bar.c", even though "bar.c" would otherwise fit into the dentry. After a grace period, you could put the "bar.c" into the dentry, and after another grace period, you could deallocate the memory. These grace periods would presumably be handled by call_rcu() rather than synchronize_rcu(). This approach would make it safe to access dentries that were in the process of being renamed. Thanx, Paul > I have duplicated most of the lookup code, altering it to the lock > free version, which returns -EAGAIN if it gets in trouble and the > regular path walk starts up. I don't know if this is the best way > to go... it's fairly easy but a bit ugly. But complexifying the > existing path walk any more would not be nice either. It might be > nice to try locking the dentry that we're having trouble with and > continuing from there, rather than redoing the whole walk with > locks... > > Anyway, this is the basics working for now, microbenchmark shows > same-cwd lookups scale linearly now too. We can probably slowly > tackle more cases if they come up as being important, simply by > auditing filesystems etc. > > -- > > Index: linux-2.6/fs/dcache.c > =================================================================== > --- linux-2.6.orig/fs/dcache.c > +++ linux-2.6/fs/dcache.c > @@ -1187,6 +1187,7 @@ struct dentry *d_alloc(struct dentry * p > dentry->d_count = 1; > dentry->d_flags = DCACHE_UNHASHED; > spin_lock_init(&dentry->d_lock); > + seqcount_init(&dentry->d_seq); > dentry->d_inode = NULL; > dentry->d_parent = NULL; > dentry->d_sb = NULL; > @@ -1579,21 +1580,6 @@ err_out: > * d_lookup() is protected against the concurrent renames in some unrelated > * directory using the seqlockt_t rename_lock. > */ > - > -struct dentry * d_lookup(struct dentry * parent, struct qstr * name) > -{ > - struct dentry * dentry = NULL; > - unsigned seq; > - > - do { > - seq = read_seqbegin(&rename_lock); > - dentry = __d_lookup(parent, name); > - if (dentry) > - break; > - } while (read_seqretry(&rename_lock, seq)); > - return dentry; > -} > - > struct dentry * __d_lookup(struct dentry * parent, struct qstr * name) > { > unsigned int len = name->len; > @@ -1656,6 +1642,78 @@ next: > return found; > } > > +struct dentry * d_lookup(struct dentry * parent, struct qstr * name) > +{ > + struct dentry *dentry = NULL; > + unsigned seq; > + > + do { > + seq = read_seqbegin(&rename_lock); > + dentry = __d_lookup(parent, name); > + if (dentry) > + break; > + } while (read_seqretry(&rename_lock, seq)); > + return dentry; > +} > + > +struct dentry * __d_lookup_rcu(struct dentry * parent, struct qstr * name) > +{ > + unsigned int len = name->len; > + unsigned int hash = name->hash; > + const unsigned char *str = name->name; > + struct dcache_hash_bucket *b = d_hash(parent, hash); > + struct hlist_head *head = &b->head; > + struct hlist_node *node; > + struct dentry *dentry; > + > + hlist_for_each_entry_rcu(dentry, node, head, d_hash) { > + unsigned seq; > + struct dentry *tparent; > + const char *tname; > + int tlen; > + > + if (dentry->d_name.hash != hash) > + continue; > + > +seqretry: > + seq = read_seqcount_begin(&dentry->d_seq); > + tparent = dentry->d_parent; > + if (tparent != parent) > + continue; > + tlen = dentry->d_name.len; > + if (tlen != len) > + continue; > + tname = dentry->d_name.name; > + if (read_seqcount_retry(&dentry->d_seq, seq)) > + goto seqretry; > + if (memcmp(tname, str, tlen)) > + continue; > + if (read_seqcount_retry(&dentry->d_seq, seq)) > + goto seqretry; > + > + return dentry; > + } > + return NULL; > +} > + > +struct dentry *d_lookup_rcu(struct dentry *parent, struct qstr * name) > +{ > + struct dentry *dentry = NULL; > + unsigned seq; > + > + if (parent->d_op && parent->d_op->d_compare) > + goto out; > + > + do { > + seq = read_seqbegin(&rename_lock); > + dentry = __d_lookup_rcu(parent, name); > + if (dentry) > + break; > + } while (read_seqretry(&rename_lock, seq)); > +out: > + return dentry; > +} > + > /** > * d_hash_and_lookup - hash the qstr then search for a dentry > * @dir: Directory to search in > @@ -1925,6 +1983,8 @@ static void d_move_locked(struct dentry > list_del(&target->d_u.d_child); > > /* Switch the names.. */ > + write_seqcount_begin(&dentry->d_seq); > + write_seqcount_begin(&target->d_seq); > switch_names(dentry, target); > swap(dentry->d_name.hash, target->d_name.hash); > > @@ -1939,6 +1999,8 @@ static void d_move_locked(struct dentry > /* And add them back to the (new) parent lists */ > list_add(&target->d_u.d_child, &target->d_parent->d_subdirs); > } > + write_seqcount_end(&target->d_seq); > + write_seqcount_end(&dentry->d_seq); > > list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs); > if (target->d_parent != dentry->d_parent) > Index: linux-2.6/fs/namei.c > =================================================================== > --- linux-2.6.orig/fs/namei.c > +++ linux-2.6/fs/namei.c > @@ -200,6 +200,29 @@ static int acl_permission_check(struct i > return -EACCES; > } > > +static int acl_permission_check_rcu(struct inode *inode, int mask, > + int (*check_acl)(struct inode *inode, int mask)) > +{ > + umode_t mode = inode->i_mode; > + > + mask &= MAY_READ | MAY_WRITE | MAY_EXEC; > + > + if (current_fsuid() == inode->i_uid) > + mode >>= 6; > + else { > + if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl) > + return -EAGAIN; > + if (in_group_p(inode->i_gid)) > + mode >>= 3; > + } > + > + /* > + * If the DACs are ok we don't need any capability check. > + */ > + if ((mask & ~mode) == 0) > + return 0; > + return -EACCES; > +} > /** > * generic_permission - check for access rights on a Posix-like filesystem > * @inode: inode to check access rights for > @@ -465,6 +488,25 @@ ok: > return security_inode_permission(inode, MAY_EXEC); > } > > +static int exec_permission_lite_rcu(struct inode *inode) > +{ > + int ret; > + > + if (inode->i_op->permission) > + return -EAGAIN; > + ret = acl_permission_check_rcu(inode, MAY_EXEC, inode->i_op->check_acl); > + if (ret == -EAGAIN) > + return ret; > + if (!ret) > + goto ok; > + > + if (capable(CAP_DAC_OVERRIDE) || capable(CAP_DAC_READ_SEARCH)) > + goto ok; > + > + return ret; > +ok: > + return security_inode_permission(inode, MAY_EXEC); > +} > /* > * This is called when everything else fails, and we actually have > * to go to the low-level filesystem to find out what we should do.. > @@ -569,6 +611,17 @@ static __always_inline void set_root(str > } > } > > +static __always_inline void set_root_rcu(struct nameidata *nd) > +{ > + if (!nd->root.mnt) { > + struct fs_struct *fs = current->fs; > + read_lock(&fs->lock); > + nd->root = fs->root; > + mntget(nd->root.mnt); > + read_unlock(&fs->lock); > + } > +} > + > static __always_inline int __vfs_follow_link(struct nameidata *nd, const char *link) > { > int res = 0; > @@ -611,6 +664,14 @@ static void path_put_conditional(struct > mntput(path->mnt); > } > > +static inline void path_to_nameidata_rcu(struct path *path, struct nameidata *nd) > +{ > + if (nd->path.mnt != path->mnt) > + mntput(nd->path.mnt); > + nd->path.mnt = path->mnt; > + nd->path.dentry = path->dentry; > +} > + > static inline void path_to_nameidata(struct path *path, struct nameidata *nd) > { > dput(nd->path.dentry); > @@ -705,6 +766,22 @@ int follow_up(struct path *path) > /* no need for dcache_lock, as serialization is taken care in > * namespace.c > */ > +static int __follow_mount_rcu(struct path *path) > +{ > + int res = 0; > + while (d_mountpoint(path->dentry)) { > + struct vfsmount *mounted = lookup_mnt(path); > + if (!mounted) > + break; > + if (res) > + mntput(path->mnt); > + path->mnt = mounted; > + path->dentry = mounted->mnt_root; > + res = 1; > + } > + return res; > +} > + > static int __follow_mount(struct path *path) > { > int res = 0; > @@ -791,6 +868,24 @@ static __always_inline void follow_dotdo > * small and for now I'd prefer to have fast path as straight as possible. > * It _is_ time-critical. > */ > +static int do_lookup_rcu(struct nameidata *nd, struct qstr *name, > + struct path *path) > +{ > + struct vfsmount *mnt = nd->path.mnt; > + struct dentry *dentry; > + > + dentry = __d_lookup_rcu(nd->path.dentry, name); > + > + if (!dentry) > + return -EAGAIN; > + if (dentry->d_op && dentry->d_op->d_revalidate) > + return -EAGAIN; > + path->mnt = mnt; > + path->dentry = dentry; > + __follow_mount_rcu(path); > + return 0; > +} > + > static int do_lookup(struct nameidata *nd, struct qstr *name, > struct path *path) > { > @@ -825,6 +920,134 @@ fail: > return PTR_ERR(dentry); > } > > +static noinline int link_path_walk_rcu(const char *name, struct nameidata *nd, struct path *next) > +{ > + struct inode *inode; > + unsigned int lookup_flags = nd->flags; > + > + while (*name=='/') > + name++; > + if (!*name) > + goto return_reval; > + > + inode = nd->path.dentry->d_inode; > + if (nd->depth) > + lookup_flags = LOOKUP_FOLLOW | (nd->flags & LOOKUP_CONTINUE); > + > + /* At this point we know we have a real path component. */ > + for(;;) { > + unsigned long hash; > + struct qstr this; > + unsigned int c; > + > + nd->flags |= LOOKUP_CONTINUE; > + if (exec_permission_lite_rcu(inode)) > + return -EAGAIN; > + > + this.name = name; > + c = *(const unsigned char *)name; > + > + hash = init_name_hash(); > + do { > + name++; > + hash = partial_name_hash(c, hash); > + c = *(const unsigned char *)name; > + } while (c && (c != '/')); > + this.len = name - (const char *) this.name; > + this.hash = end_name_hash(hash); > + > + /* remove trailing slashes? */ > + if (!c) > + goto last_component; > + while (*++name == '/'); > + if (!*name) > + goto last_with_slashes; > + > + if (this.name[0] == '.') switch (this.len) { > + default: > + break; > + case 2: > + if (this.name[1] != '.') > + break; > + return -EAGAIN; > + case 1: > + continue; > + } > + if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) > + return -EAGAIN; > + /* This does the actual lookups.. */ > + if (do_lookup_rcu(nd, &this, next)) > + return -EAGAIN; > + > + inode = next->dentry->d_inode; > + if (!inode) > + return -ENOENT; > + if (inode->i_op->follow_link) > + return -EAGAIN; > + path_to_nameidata_rcu(next, nd); > + if (!inode->i_op->lookup) > + return -ENOTDIR; > + continue; > + /* here ends the main loop */ > + > +last_with_slashes: > + lookup_flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; > +last_component: > + /* Clear LOOKUP_CONTINUE iff it was previously unset */ > + nd->flags &= lookup_flags | ~LOOKUP_CONTINUE; > + if (lookup_flags & LOOKUP_PARENT) > + return -EAGAIN; > + if (this.name[0] == '.') switch (this.len) { > + default: > + break; > + case 2: > + if (this.name[1] != '.') > + break; > + return -EAGAIN; > + case 1: > + goto return_reval; > + } > + if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) > + return -EAGAIN; > + if (do_lookup_rcu(nd, &this, next)) > + return -EAGAIN; > + inode = next->dentry->d_inode; > + if ((lookup_flags & LOOKUP_FOLLOW) > + && inode && inode->i_op->follow_link) > + return -EAGAIN; > + > + path_to_nameidata_rcu(next, nd); > + if (!inode) > + return -ENOENT; > + if (lookup_flags & LOOKUP_DIRECTORY) { > + if (!inode->i_op->lookup) > + return -ENOTDIR; > + } > + goto return_base; > + } > +return_reval: > + /* > + * We bypassed the ordinary revalidation routines. > + * We may need to check the cached dentry for staleness. > + */ > + if (nd->path.dentry && nd->path.dentry->d_sb && > + (nd->path.dentry->d_sb->s_type->fs_flags & FS_REVAL_DOT)) > + return -EAGAIN; > +return_base: > + spin_lock(&nd->path.dentry->d_lock); > + if (d_unhashed(nd->path.dentry)) { > + spin_unlock(&nd->path.dentry->d_lock); > + return -EAGAIN; > + } > + if (!nd->dentry->d_inode) { > + spin_unlock(&nd->path.dentry->d_lock); > + return -EAGAIN; > + } > + nd->path.dentry->d_count++; > + spin_unlock(&nd->path.dentry->d_lock); > + return 0; > +} > + > /* > * Name resolution. > * This is the basic name resolution function, turning a pathname into > @@ -887,7 +1110,7 @@ static int __link_path_walk(const char * > if (this.name[0] == '.') switch (this.len) { > default: > break; > - case 2: > + case 2: > if (this.name[1] != '.') > break; > follow_dotdot(nd); > @@ -942,7 +1165,7 @@ last_component: > if (this.name[0] == '.') switch (this.len) { > default: > break; > - case 2: > + case 2: > if (this.name[1] != '.') > break; > follow_dotdot(nd); > @@ -1013,12 +1236,82 @@ return_err: > return err; > } > > +static int path_walk_rcu(const char *name, struct nameidata *nd) > +{ > + struct path save = nd->path; > + struct path path = {.mnt = NULL}; > + int err; > + > + current->total_link_count = 0; > + err = link_path_walk_rcu(name, nd, &path); > + if (err) { > + if (path.mnt != nd->path.mnt) > + mntput(path.mnt); > + mntput(nd->path.mnt); > + if (err == -EAGAIN) > + nd->path = save; > + } > + return err; > +} > + > static int path_walk(const char *name, struct nameidata *nd) > { > current->total_link_count = 0; > return link_path_walk(name, nd); > } > > +static noinline int path_init_rcu(int dfd, const char *name, unsigned int flags, struct nameidata *nd) > +{ > + int retval = 0; > + int fput_needed; > + struct file *file; > + > + nd->last_type = LAST_ROOT; /* if there are only slashes... */ > + nd->flags = flags; > + nd->depth = 0; > + nd->root.mnt = NULL; > + > + if (*name=='/') { > + set_root_rcu(nd); > + nd->path = nd->root; > + mntget(nd->root.mnt); > + } else if (dfd == AT_FDCWD) { > + struct fs_struct *fs = current->fs; > + read_lock(&fs->lock); > + nd->path = fs->pwd; > + mntget(fs->pwd.mnt); > + read_unlock(&fs->lock); > + } else { > + struct dentry *dentry; > + > + file = fget_light(dfd, &fput_needed); > + retval = -EBADF; > + if (!file) > + goto out_fail; > + > + dentry = file->f_path.dentry; > + > + retval = -ENOTDIR; > + if (!S_ISDIR(dentry->d_inode->i_mode)) > + goto fput_fail; > + > + retval = file_permission(file, MAY_EXEC); > + if (retval) > + goto fput_fail; > + > + nd->path = file->f_path; > + mntget(file->f_path.mnt); > + > + fput_light(file, fput_needed); > + } > + return 0; > + > +fput_fail: > + fput_light(file, fput_needed); > +out_fail: > + return retval; > +} > + > static int path_init(int dfd, const char *name, unsigned int flags, struct nameidata *nd) > { > int retval = 0; > @@ -1075,16 +1368,46 @@ out_fail: > static int do_path_lookup(int dfd, const char *name, > unsigned int flags, struct nameidata *nd) > { > - int retval = path_init(dfd, name, flags, nd); > - if (!retval) > - retval = path_walk(name, nd); > - if (unlikely(!retval && !audit_dummy_context() && nd->path.dentry && > - nd->path.dentry->d_inode)) > - audit_inode(name, nd->path.dentry); > + int retval; > + > + rcu_read_lock(); > + retval = path_init_rcu(dfd, name, flags, nd); > + if (unlikely(retval)) { > + rcu_read_unlock(); > + return retval; > + } > + retval = path_walk_rcu(name, nd); > + rcu_read_unlock(); > + if (likely(!retval)) { > + if (unlikely(!audit_dummy_context())) { > + if (nd->path.dentry && nd->path.dentry->d_inode) > + audit_inode(name, nd->path.dentry); > + } > + } > if (nd->root.mnt) { > - path_put(&nd->root); > + mntput(nd->root.mnt); > nd->root.mnt = NULL; > } > + > + if (unlikely(retval == -EAGAIN)) { > + /* slower, locked walk */ > + retval = path_init(dfd, name, flags, nd); > + if (unlikely(retval)) > + return retval; > + retval = path_walk(name, nd); > + if (likely(!retval)) { > + if (unlikely(!audit_dummy_context())) { > + if (nd->path.dentry && nd->path.dentry->d_inode) > + audit_inode(name, nd->path.dentry); > + } > + } > + > + if (nd->root.mnt) { > + path_put(&nd->root); > + nd->root.mnt = NULL; > + } > + } > + > return retval; > } > > Index: linux-2.6/include/linux/dcache.h > =================================================================== > --- linux-2.6.orig/include/linux/dcache.h > +++ linux-2.6/include/linux/dcache.h > @@ -5,6 +5,7 @@ > #include <linux/list.h> > #include <linux/rculist.h> > #include <linux/spinlock.h> > +#include <linux/seqlock.h> > #include <linux/cache.h> > #include <linux/rcupdate.h> > > @@ -81,15 +82,16 @@ full_name_hash(const unsigned char *name > * large memory footprint increase). > */ > #ifdef CONFIG_64BIT > -#define DNAME_INLINE_LEN_MIN 32 /* 192 bytes */ > +#define DNAME_INLINE_LEN_MIN 24 /* 192 bytes */ > #else > -#define DNAME_INLINE_LEN_MIN 40 /* 128 bytes */ > +#define DNAME_INLINE_LEN_MIN 36 /* 128 bytes */ > #endif > > struct dentry { > unsigned int d_count; /* protected by d_lock */ > unsigned int d_flags; /* protected by d_lock */ > spinlock_t d_lock; /* per dentry lock */ > + seqcount_t d_seq; /* per dentry seqlock */ > int d_mounted; > struct inode *d_inode; /* Where the name belongs to - NULL is > * negative */ > @@ -283,9 +285,11 @@ extern void d_move(struct dentry *, stru > extern struct dentry *d_ancestor(struct dentry *, struct dentry *); > > /* appendix may either be NULL or be used for transname suffixes */ > -extern struct dentry * d_lookup(struct dentry *, struct qstr *); > -extern struct dentry * __d_lookup(struct dentry *, struct qstr *); > -extern struct dentry * d_hash_and_lookup(struct dentry *, struct qstr *); > +extern struct dentry *d_lookup(struct dentry *, struct qstr *); > +extern struct dentry *__d_lookup(struct dentry *, struct qstr *); > +extern struct dentry *d_lookup_rcu(struct dentry *, struct qstr *); > +extern struct dentry *__d_lookup_rcu(struct dentry *, struct qstr *); > +extern struct dentry *d_hash_and_lookup(struct dentry *, struct qstr *); > > /* validate "insecure" dentry pointer */ > extern int d_validate(struct dentry *, struct dentry *); > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Latest vfs scalability patch 2009-10-06 6:49 Latest vfs scalability patch Nick Piggin 2009-10-06 10:14 ` Jens Axboe @ 2009-10-15 10:08 ` Anton Blanchard 2009-10-15 10:39 ` Nick Piggin 2009-10-15 10:53 ` Nick Piggin 1 sibling, 2 replies; 57+ messages in thread From: Anton Blanchard @ 2009-10-15 10:08 UTC (permalink / raw) To: Nick Piggin Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 1756 bytes --] Hi Nick, > Several people have been interested to test my vfs patches, so rather > than resend patches I have uploaded a rollup against Linus's current > head. > > ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/ > > I have used ext2,ext3,autofs4,nfs as well as in-memory filesystems > OK (although this doesn't mean there are no bugs!). Otherwise, if your > filesystem compiles, then there is a reasonable chance of it working, > or ask me and I can try updating it for the new locking. > > I would be interested in seeing any numbers people might come up with, > including single-threaded performance. Thanks for doing a rollup patch, it made it easy to test. I gave it a spin on a 64 core (128 thread) POWER5+ box. I started simple by looking at open/close performance, eg: void testcase(void) { char tmpfile[] = "/tmp/testfile.XXXXXX"; mkstemp(tmpfile); while (1) { int fd = open(tmpfile, O_RDWR); close(fd); } } At first the results were 10x slower. I took a look and it appears the MNT_MOUNTED flag is getting cleared by a remount (I'm testing on the root filesystem). This fixed it: --- fs/namespace.c~ 2009-10-15 04:34:02.000000000 -0500 +++ fs/namespace.c 2009-10-15 04:35:00.000000000 -0500 @@ -1711,7 +1711,8 @@ static int do_remount(struct path *path, else err = do_remount_sb(sb, flags, data, 0); if (!err) - path->mnt->mnt_flags = mnt_flags; + path->mnt->mnt_flags = mnt_flags | + (path->mnt->mnt_flags & MNT_MOUNTED); up_write(&sb->s_umount); if (!err) { security_sb_post_remount(path->mnt, flags, data); Attached is a before and after graph. Single thread performance is 20% faster, and we go from hitting a wall at 2 cores to scaling all the way to 64 cores. Nice work!!! Anton [-- Attachment #2: open_close_performance.png --] [-- Type: image/png, Size: 26403 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Latest vfs scalability patch 2009-10-15 10:08 ` Latest vfs scalability patch Anton Blanchard @ 2009-10-15 10:39 ` Nick Piggin 2009-10-15 10:46 ` Anton Blanchard 2009-10-15 10:53 ` Nick Piggin 1 sibling, 1 reply; 57+ messages in thread From: Nick Piggin @ 2009-10-15 10:39 UTC (permalink / raw) To: Anton Blanchard Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds, Jens Axboe Hi Anton, On Thu, Oct 15, 2009 at 09:08:54PM +1100, Anton Blanchard wrote: > > Hi Nick, > > > Several people have been interested to test my vfs patches, so rather > > than resend patches I have uploaded a rollup against Linus's current > > head. > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/ > > > > I have used ext2,ext3,autofs4,nfs as well as in-memory filesystems > > OK (although this doesn't mean there are no bugs!). Otherwise, if your > > filesystem compiles, then there is a reasonable chance of it working, > > or ask me and I can try updating it for the new locking. > > > > I would be interested in seeing any numbers people might come up with, > > including single-threaded performance. > > Thanks for doing a rollup patch, it made it easy to test. I gave it a spin on > a 64 core (128 thread) POWER5+ box. I started simple by looking at open/close > performance, eg: > > void testcase(void) > { > char tmpfile[] = "/tmp/testfile.XXXXXX"; > > mkstemp(tmpfile); > > while (1) { > int fd = open(tmpfile, O_RDWR); > close(fd); > } > } > > At first the results were 10x slower. I took a look and it appears the > MNT_MOUNTED flag is getting cleared by a remount (I'm testing on the root > filesystem). This fixed it: Oh dear, thanks for that. That bugfix is needed for the patchset I just sent to be merged. > --- fs/namespace.c~ 2009-10-15 04:34:02.000000000 -0500 > +++ fs/namespace.c 2009-10-15 04:35:00.000000000 -0500 > @@ -1711,7 +1711,8 @@ static int do_remount(struct path *path, > else > err = do_remount_sb(sb, flags, data, 0); > if (!err) > - path->mnt->mnt_flags = mnt_flags; > + path->mnt->mnt_flags = mnt_flags | > + (path->mnt->mnt_flags & MNT_MOUNTED); > up_write(&sb->s_umount); > if (!err) { > security_sb_post_remount(path->mnt, flags, data); > > Attached is a before and after graph. Single thread performance is 20% > faster, and we go from hitting a wall at 2 cores to scaling all the way > to 64 cores. Nice work!!! Nice looking graph, thanks! Did you use the rcu patch there as well? (my open-close testing chokes RCU but your system might be managing to keep outstanding callbacks below the threshold) ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Latest vfs scalability patch 2009-10-15 10:39 ` Nick Piggin @ 2009-10-15 10:46 ` Anton Blanchard 0 siblings, 0 replies; 57+ messages in thread From: Anton Blanchard @ 2009-10-15 10:46 UTC (permalink / raw) To: Nick Piggin Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds, Jens Axboe Hi Nick, > Nice looking graph, thanks! Did you use the rcu patch there as > well? (my open-close testing chokes RCU but your system might > be managing to keep outstanding callbacks below the threshold) Yep, I am running with the RCU patch but forgot to mention it :) Anton ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Latest vfs scalability patch 2009-10-15 10:08 ` Latest vfs scalability patch Anton Blanchard 2009-10-15 10:39 ` Nick Piggin @ 2009-10-15 10:53 ` Nick Piggin 2009-10-15 11:23 ` Anton Blanchard 1 sibling, 1 reply; 57+ messages in thread From: Nick Piggin @ 2009-10-15 10:53 UTC (permalink / raw) To: Anton Blanchard Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds, Jens Axboe On Thu, Oct 15, 2009 at 09:08:54PM +1100, Anton Blanchard wrote: > > Hi Nick, > > > Several people have been interested to test my vfs patches, so rather > > than resend patches I have uploaded a rollup against Linus's current > > head. > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/ > > > > I have used ext2,ext3,autofs4,nfs as well as in-memory filesystems > > OK (although this doesn't mean there are no bugs!). Otherwise, if your > > filesystem compiles, then there is a reasonable chance of it working, > > or ask me and I can try updating it for the new locking. > > > > I would be interested in seeing any numbers people might come up with, > > including single-threaded performance. > > Thanks for doing a rollup patch, it made it easy to test. I gave it a spin on > a 64 core (128 thread) POWER5+ box. I started simple by looking at open/close > performance, eg: I wonder what other good performance tests you can add to your test framework? creat/unlink is another easy one. And for each case, putting threads in their own cwd versus a common cwd are the variants. BTW. for these cases in your tests it will be nice if you can run on ramfs because that will isolate purely the vfs. Perhaps also include other filesystems as you get time, but I think ramfs is the most useful for us to start with. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Latest vfs scalability patch 2009-10-15 10:53 ` Nick Piggin @ 2009-10-15 11:23 ` Anton Blanchard 2009-10-15 11:41 ` Nick Piggin 0 siblings, 1 reply; 57+ messages in thread From: Anton Blanchard @ 2009-10-15 11:23 UTC (permalink / raw) To: Nick Piggin Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 744 bytes --] Hi Nick, > I wonder what other good performance tests you can add to your test > framework? creat/unlink is another easy one. And for each case, putting > threads in their own cwd versus a common cwd are the variants. I did try the two combinations of creat/unlink but haven't had a chance to digest the profiles yet. I've attached them (taken at 64 cores, ie worst case :) In both cases performance was significantly better than mainline. > BTW. for these cases in your tests it will be nice if you can run on > ramfs because that will isolate purely the vfs. Perhaps also include > other filesystems as you get time, but I think ramfs is the most > useful for us to start with. Good point. I'll add that into the setup scripts. Anton [-- Attachment #2: open_unlink.profile --] [-- Type: text/plain, Size: 10567 bytes --] # Samples: 82617 # # Overhead Command Shared Object Symbol # ........ ............... ................................. ...... # 99.16% unlink1_process [kernel] [k] ._spin_lock | |--99.98%-- ._spin_lock | | | |--49.80%-- .path_get | | (nil) | | | | | |--71.62%-- .path_init | | | | | | | |--51.37%-- .do_path_lookup | | | | .user_path_parent | | | | .do_unlinkat | | | | syscall_exit | | | | 0xfff8e90bd4c | | | | .testcase | | | | .affinitize | | | | .new_task | | | | .main | | | | 0xfff8e84f33c | | | | 0xfff8e84f55c | | | | (nil) | | | | | | | --48.63%-- .do_filp_open | | | .do_sys_open | | | syscall_exit | | | 0xfff8e909c7c | | | .testcase | | | .affinitize | | | .new_task | | | .main | | | 0xfff8e84f33c | | | 0xfff8e84f55c | | | (nil) | | | | | --28.38%-- .path_walk | | | | | |--50.49%-- .do_filp_open | | | .do_sys_open | | | syscall_exit | | | 0xfff8e909c7c | | | .testcase | | | .affinitize | | | .new_task | | | .main | | | 0xfff8e84f33c | | | 0xfff8e84f55c | | | (nil) | | | | | --49.51%-- .do_path_lookup | | .user_path_parent | | .do_unlinkat | | syscall_exit | | 0xfff8e90bd4c | | .testcase | | .affinitize | | .new_task | | .main | | 0xfff8e84f33c | | 0xfff8e84f55c | | (nil) | | | |--49.58%-- .dput | | (nil) | | | | | |--64.21%-- .path_put | | | | | | | |--48.48%-- .path_walk | | | | | | | | | |--51.00%-- .do_path_lookup | | | | | .user_path_parent | | | | | .do_unlinkat | | | | | syscall_exit | | | | | 0xfff8e90bd4c | | | | | .testcase | | | | | .affinitize | | | | | .new_task | | | | | .main | | | | | 0xfff8e84f33c | | | | | 0xfff8e84f55c | | | | | (nil) | | | | | | | | | --49.00%-- .do_filp_open | | | | .do_sys_open | | | | syscall_exit | | | | 0xfff8e909c7c | | | | .testcase | | | | .affinitize | | | | .new_task | | | | .main | | | | 0xfff8e84f33c | | | | 0xfff8e84f55c | | | | (nil) | | | | | | | |--25.79%-- .do_filp_open | | | | .do_sys_open | | | | syscall_exit | | | | 0xfff8e909c7c | | | | .testcase | | | | .affinitize | | | | .new_task | | | | .main | | | | 0xfff8e84f33c | | | | 0xfff8e84f55c | | | | (nil) | | | | | | | |--25.72%-- .do_path_lookup | | | | .user_path_parent | | | | .do_unlinkat | | | | syscall_exit | | | | 0xfff8e90bd4c | | | | .testcase | | | | .affinitize | | | | .new_task | | | | .main | | | | 0xfff8e84f33c | | | | 0xfff8e84f55c | | | | (nil) | | | --0.00%-- [...] | | | | | |--35.78%-- .__link_path_walk | | | .path_walk | | | | | | | |--50.81%-- .do_path_lookup | | | | .user_path_parent | | | | .do_unlinkat | | | | syscall_exit | | | | 0xfff8e90bd4c | | | | .testcase | | | | .affinitize | | | | .new_task | | | | .main | | | | 0xfff8e84f33c | | | | 0xfff8e84f55c | | | | (nil) | | | | | | | --49.19%-- .do_filp_open | | | .do_sys_open | | | syscall_exit | | | 0xfff8e909c7c | | | .testcase | | | .affinitize | | | .new_task | | | .main | | | 0xfff8e84f33c | | | 0xfff8e84f55c | | | (nil) | | --0.01%-- [...] | --0.62%-- [...] --0.02%-- [...] [-- Attachment #3: open_unlink_directories.profile --] [-- Type: text/plain, Size: 12730 bytes --] # Samples: 101253 # # Overhead Command Shared Object Symbol # ........ ............... ................................. ...... # ^[[31m 99.15%^[[m unlink2_process [kernel] [k] ._spin_lock | |^[[31m--99.97%-- ^[[m._spin_lock | | | |^[[31m--49.96%-- ^[[m.dput | | (nil) | | | | | |^[[31m--71.64%-- ^[[m.__link_path_walk | | | .path_walk | | | | | | | |^[[31m--50.78%-- ^[[m.do_path_lookup | | | | .user_path_parent | | | | .do_unlinkat | | | | syscall_exit | | | | 0xfff86a85d4c | | | | .testcase | | | | .affinitize | | | | .new_task | | | | .main | | | | 0xfff869c933c | | | | 0xfff869c955c | | | | (nil) | | | | | | | ^[[31m--49.22%-- ^[[m.do_filp_open | | | .do_sys_open | | | syscall_exit | | | 0xfff86a83c7c | | | .testcase | | | .affinitize | | | .new_task | | | .main | | | 0xfff869c933c | | | 0xfff869c955c | | | (nil) | | | | | |^[[31m--28.35%-- ^[[m.path_put | | | | | | | |^[[31m--50.91%-- ^[[m.path_walk | | | | | | | | | |^[[31m--53.90%-- ^[[m.do_path_lookup | | | | | .user_path_parent | | | | | .do_unlinkat | | | | | syscall_exit | | | | | 0xfff86a85d4c | | | | | .testcase | | | | | .affinitize | | | | | .new_task | | | | | .main | | | | | 0xfff869c933c | | | | | 0xfff869c955c | | | | | (nil) | | | | | | | | | ^[[31m--46.10%-- ^[[m.do_filp_open | | | | .do_sys_open | | | | syscall_exit | | | | 0xfff86a83c7c | | | | .testcase | | | | .affinitize | | | | .new_task | | | | .main | | | | 0xfff869c933c | | | | 0xfff869c955c | | | | (nil) | | | | | | | |^[[31m--25.98%-- ^[[m.do_filp_open | | | | .do_sys_open | | | | syscall_exit | | | | 0xfff86a83c7c | | | | .testcase | | | | .affinitize | | | | .new_task | | | | .main | | | | 0xfff869c933c | | | | 0xfff869c955c | | | | (nil) | | | | | | | |^[[31m--23.10%-- ^[[m.do_path_lookup | | | | .user_path_parent | | | | .do_unlinkat | | | | syscall_exit | | | | 0xfff86a85d4c | | | | .testcase | | | | .affinitize | | | | .new_task | | | | .main | | | | 0xfff869c933c | | | | 0xfff869c955c | | | | (nil) | | | --0.01%-- [...] | | --0.01%-- [...] | | | |^[[31m--28.88%-- ^[[m.__follow_mount | | (nil) | | .do_lookup | | .__link_path_walk | | .path_walk | | | | | |^[[31m--50.26%-- ^[[m.do_filp_open | | | .do_sys_open | | | syscall_exit | | | 0xfff86a83c7c | | | .testcase | | | .affinitize | | | .new_task | | | .main | | | 0xfff869c933c | | | 0xfff869c955c | | | (nil) | | | | | ^[[31m--49.74%-- ^[[m.do_path_lookup | | .user_path_parent | | .do_unlinkat | | syscall_exit | | 0xfff86a85d4c | | .testcase | | .affinitize | | .new_task | | .main | | 0xfff869c933c | | 0xfff869c955c | | (nil) | | | |^[[31m--20.33%-- ^[[m.path_get | | (nil) | | | | | |^[[31m--69.39%-- ^[[m.path_init | | | | | | | |^[[31m--50.78%-- ^[[m.do_path_lookup | | | | .user_path_parent | | | | .do_unlinkat | | | | syscall_exit | | | | 0xfff86a85d4c | | | | .testcase | | | | .affinitize | | | | .new_task | | | | .main | | | | 0xfff869c933c | | | | 0xfff869c955c | | | | (nil) | | | | | | | ^[[31m--49.22%-- ^[[m.do_filp_open | | | .do_sys_open | | | syscall_exit | | | 0xfff86a83c7c | | | .testcase | | | .affinitize | | | .new_task | | | .main | | | 0xfff869c933c | | | 0xfff869c955c | | | (nil) | | | | | ^[[31m--30.61%-- ^[[m.path_walk | | | | | |^[[31m--53.29%-- ^[[m.do_filp_open | | | .do_sys_open | | | syscall_exit | | | 0xfff86a83c7c | | | .testcase | | | .affinitize | | | .new_task | | | .main | | | 0xfff869c933c | | | 0xfff869c955c | | | (nil) | | | | | ^[[31m--46.71%-- ^[[m.do_path_lookup | | .user_path_parent | | .do_unlinkat | | syscall_exit | | 0xfff86a85d4c | | .testcase | | .affinitize | | .new_task | | .main | | 0xfff869c933c | | 0xfff869c955c | | (nil) | ^[[32m--0.83%-- ^[[m[...] --0.03%-- [...] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Latest vfs scalability patch 2009-10-15 11:23 ` Anton Blanchard @ 2009-10-15 11:41 ` Nick Piggin 2009-10-15 11:48 ` Nick Piggin 0 siblings, 1 reply; 57+ messages in thread From: Nick Piggin @ 2009-10-15 11:41 UTC (permalink / raw) To: Anton Blanchard Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds, Jens Axboe On Thu, Oct 15, 2009 at 10:23:29PM +1100, Anton Blanchard wrote: > > Hi Nick, > > > I wonder what other good performance tests you can add to your test > > framework? creat/unlink is another easy one. And for each case, putting > > threads in their own cwd versus a common cwd are the variants. > > I did try the two combinations of creat/unlink but haven't had a chance to > digest the profiles yet. I've attached them (taken at 64 cores, ie worst > case :) > > In both cases performance was significantly better than mainline. > > > BTW. for these cases in your tests it will be nice if you can run on > > ramfs because that will isolate purely the vfs. Perhaps also include > > other filesystems as you get time, but I think ramfs is the most > > useful for us to start with. > > Good point. I'll add that into the setup scripts. > > Anton > # Samples: 82617 > # > # Overhead Command Shared Object Symbol > # ........ ............... ................................. ...... > # > 99.16% unlink1_process [kernel] [k] ._spin_lock > | > |--99.98%-- ._spin_lock > | | > | |--49.80%-- .path_get > | |--49.58%-- .dput Hmm, both your profiles look like they are hammering on a common cwd here. The lock-free path walk can probably be extended to help a bit, but you would still end up hitting locks on the parent dentry/inode when doing the create destroy. My 64-way numbers look like this: create-unlink 1 processes seperate-cwd 105306.58 ops/s create-unlink 2 processes seperate-cwd 103004.20 ops/s create-unlink 4 processes seperate-cwd 92438.69 ops/s create-unlink 8 processes seperate-cwd 91138.93 ops/s create-unlink 16 processes seperate-cwd 91025.36 ops/s create-unlink 32 processes seperate-cwd 83757.75 ops/s create-unlink 64 processes seperate-cwd 81718.29 ops/s create-unlink 1 processes same-cwd 110139.61 ops/s create-unlink 2 processes same-cwd 26611.69 ops/s create-unlink 4 processes same-cwd 13819.46 ops/s create-unlink 8 processes same-cwd 4724.83 ops/s create-unlink 16 processes same-cwd 1368.99 ops/s create-unlink 32 processes same-cwd 335.08 ops/s create-unlink 64 processes same-cwd 114.88 ops/s If your seperate-cwd numbers aren't scaling reasonably well, I will have to get to the bottom of it. BTW these numbers are ops/s/cpu which btw in some cases I like better. At a quick glance it is very easy to see if scalability is linear, and also when you graph it then there is less tendency to drown out the low end numbers (although it could go the other way and drown the high end numbers for non-scalable cases). Maybe you could add an option to output either. Thanks, Nick ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Latest vfs scalability patch 2009-10-15 11:41 ` Nick Piggin @ 2009-10-15 11:48 ` Nick Piggin 0 siblings, 0 replies; 57+ messages in thread From: Nick Piggin @ 2009-10-15 11:48 UTC (permalink / raw) To: Anton Blanchard Cc: Linux Kernel Mailing List, linux-fsdevel, Ravikiran G Thirumalai, Peter Zijlstra, Linus Torvalds, Jens Axboe On Thu, Oct 15, 2009 at 01:41:19PM +0200, Nick Piggin wrote: > On Thu, Oct 15, 2009 at 10:23:29PM +1100, Anton Blanchard wrote: > > > > Hi Nick, > > > > > I wonder what other good performance tests you can add to your test > > > framework? creat/unlink is another easy one. And for each case, putting > > > threads in their own cwd versus a common cwd are the variants. > > > > I did try the two combinations of creat/unlink but haven't had a chance to > > digest the profiles yet. I've attached them (taken at 64 cores, ie worst > > case :) > > > > In both cases performance was significantly better than mainline. > > > > > BTW. for these cases in your tests it will be nice if you can run on > > > ramfs because that will isolate purely the vfs. Perhaps also include > > > other filesystems as you get time, but I think ramfs is the most > > > useful for us to start with. > > > > Good point. I'll add that into the setup scripts. > > > > Anton > > > # Samples: 82617 > > # > > # Overhead Command Shared Object Symbol > > # ........ ............... ................................. ...... > > # > > 99.16% unlink1_process [kernel] [k] ._spin_lock > > | > > |--99.98%-- ._spin_lock > > | | > > | |--49.80%-- .path_get > > | |--49.58%-- .dput > > Hmm, both your profiles look like they are hammering on a common cwd > here. The lock-free path walk can probably be extended to help a bit, > but you would still end up hitting locks on the parent dentry/inode > when doing the create destroy. My 64-way numbers look like this: > > > create-unlink 1 processes seperate-cwd 105306.58 ops/s > create-unlink 2 processes seperate-cwd 103004.20 ops/s > create-unlink 4 processes seperate-cwd 92438.69 ops/s > create-unlink 8 processes seperate-cwd 91138.93 ops/s > create-unlink 16 processes seperate-cwd 91025.36 ops/s > create-unlink 32 processes seperate-cwd 83757.75 ops/s > create-unlink 64 processes seperate-cwd 81718.29 ops/s dumb profile for this guy looks like this: 206681 total 0.0270 25851 _spin_lock 161.5687 13628 kmem_cache_free 7.3427 9890 _spin_unlock 61.8125 7087 kmem_cache_alloc 6.5138 6770 _read_lock 35.2604 5587 __call_rcu 4.8498 5580 __link_path_walk 0.5571 5246 do_filp_open 0.9476 4946 __rcu_process_callbacks 2.0608 4904 __percpu_counter_add 11.7885 3933 d_alloc 5.1211 3906 memset 3.6989 3807 path_init_rcu 3.2154 3370 __mutex_init 35.1042 3254 mnt_want_write 4.6222 oprofile isn't working on this guy either, and I no longer have the patience to try working out where such locking is coming from without lockdep or perf ;) But it sure is a lot better than your profiles... ^ permalink raw reply [flat|nested] 57+ messages in thread
end of thread, other threads:[~2009-10-15 11:49 UTC | newest] Thread overview: 57+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-10-06 6:49 Latest vfs scalability patch Nick Piggin 2009-10-06 10:14 ` Jens Axboe 2009-10-06 10:26 ` Jens Axboe 2009-10-06 11:10 ` Peter Zijlstra 2009-10-06 12:51 ` Jens Axboe 2009-10-06 12:26 ` Nick Piggin 2009-10-06 12:49 ` Jens Axboe 2009-10-07 8:58 ` [rfc][patch] store-free path walking Nick Piggin 2009-10-07 9:56 ` Jens Axboe 2009-10-07 10:10 ` Nick Piggin 2009-10-12 3:58 ` Nick Piggin 2009-10-12 5:59 ` Nick Piggin 2009-10-12 8:20 ` Jens Axboe 2009-10-12 11:00 ` Jens Axboe 2009-10-13 1:26 ` Christoph Hellwig 2009-10-13 1:52 ` Nick Piggin 2009-10-07 14:56 ` Linus Torvalds 2009-10-07 16:27 ` Linus Torvalds 2009-10-07 16:46 ` Nick Piggin 2009-10-07 19:25 ` Linus Torvalds 2009-10-07 20:34 ` Andi Kleen 2009-10-07 20:51 ` Linus Torvalds 2009-10-07 21:06 ` Andi Kleen 2009-10-07 21:20 ` Linus Torvalds 2009-10-07 21:57 ` Linus Torvalds 2009-10-07 22:22 ` Andi Kleen 2009-10-08 7:39 ` Nick Piggin 2009-10-09 17:53 ` Andi Kleen 2009-10-08 13:12 ` Denys Vlasenko 2009-10-09 7:47 ` Nick Piggin 2009-10-09 17:49 ` Andi Kleen 2009-10-07 16:29 ` Nick Piggin 2009-10-08 12:36 ` Nick Piggin 2009-10-08 12:57 ` Jens Axboe 2009-10-08 13:22 ` Nick Piggin 2009-10-08 13:30 ` Jens Axboe 2009-10-08 18:00 ` Peter Zijlstra 2009-10-09 4:04 ` Nick Piggin 2009-10-09 8:54 ` Jens Axboe 2009-10-09 9:51 ` Jens Axboe 2009-10-09 10:02 ` Nick Piggin 2009-10-09 10:08 ` Jens Axboe 2009-10-09 10:07 ` Nick Piggin 2009-10-09 3:50 ` Nick Piggin 2009-10-09 6:15 ` David Miller 2009-10-09 10:40 ` Nick Piggin 2009-10-09 11:09 ` Jens Axboe 2009-10-09 10:44 ` Nick Piggin 2009-10-09 10:48 ` Jens Axboe 2009-10-09 23:16 ` Paul E. McKenney 2009-10-15 10:08 ` Latest vfs scalability patch Anton Blanchard 2009-10-15 10:39 ` Nick Piggin 2009-10-15 10:46 ` Anton Blanchard 2009-10-15 10:53 ` Nick Piggin 2009-10-15 11:23 ` Anton Blanchard 2009-10-15 11:41 ` Nick Piggin 2009-10-15 11:48 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).