* [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) [not found] <20101213023733.GB6522@amd> @ 2010-12-13 2:42 ` Nick Piggin 2010-12-13 3:31 ` Nick Piggin 2010-12-13 7:25 ` Eric Dumazet 2010-12-13 2:53 ` rcu-walk and dcache scaling tree update and status Ed Tomlinson 2010-12-13 3:40 ` Stephen Rothwell 2 siblings, 2 replies; 16+ messages in thread From: Nick Piggin @ 2010-12-13 2:42 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Andrew Morton, Al Viro, Stephen Rothwell, linux-fsdevel, linux-kernel On Mon, Dec 13, 2010 at 01:37:33PM +1100, Nick Piggin wrote: > Final note: > You won't be able to reproduce the parallel path walk scalability > numbers that I've posted, because the vfsmount refcounting scalability > patch is not included. I have a new idea for that now, so I'll be asking > for comments with that soon. Here is the patch I've been using, which works but has the problem described in the changelog. But it works nicely for testing. As I said, I have a promising approach to solving the problem. fs: scale mntget/mntput Improve scalability of mntget/mntput by using per-cpu counters protected by the reader side of the brlock vfsmount_lock. If the mnt_hash field of the vfsmount structure is attached to a list, then it is mounted which contributes to its refcount, so the per-cpu counters need not be summed. MNT_PSEUDO keeps track of whether the vfsmount is actually a pseudo filesystem that will never be attached (such as sockfs). No extra atomics in the common case because atomic mnt refcount is now replaced with per-CPU spinlock. Code will be bigger and more complex however. With the previous per-cpu locking patch, mount lookups and common case refcounting are now per-cpu and should be ideally scalable. path lookups (and hence path_get/path_put) within the same vfsmount should now be more scalable, however this will often be hidden by dcache_lock on final dput, and d_lock on common path elements (eg. cwd or root dentry). Signed-off-by: Nick Piggin <npiggin@kernel.dk> [Note: this is not for merging. Un-attached operation (lazy umount) may not be uncommon and will be slowed down and actually have worse scalablilty after this patch. I need to think about how to do fast refcounting with unattached mounts.] --- drivers/mtd/mtdchar.c | 1 fs/internal.h | 1 fs/libfs.c | 1 fs/namespace.c | 167 +++++++++++++++++++++++++++++++++++++++++++------- fs/pnode.c | 4 - include/linux/mount.h | 26 +------ 6 files changed, 154 insertions(+), 46 deletions(-) Index: linux-2.6/fs/namespace.c =================================================================== --- linux-2.6.orig/fs/namespace.c 2010-12-12 03:48:57.000000000 +1100 +++ linux-2.6/fs/namespace.c 2010-12-12 03:51:52.000000000 +1100 @@ -138,6 +138,64 @@ void mnt_release_group_id(struct vfsmoun mnt->mnt_group_id = 0; } +/* + * vfsmount lock must be held for read + */ +static inline void add_mnt_count(struct vfsmount *mnt, int n) +{ +#ifdef CONFIG_SMP + (*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) += n; +#else + mnt->mnt_count += n; +#endif +} + +static inline void set_mnt_count(struct vfsmount *mnt, int n) +{ +#ifdef CONFIG_SMP + preempt_disable(); + (*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) = n; + preempt_enable(); +#else + mnt->mnt_count = n; +#endif +} + +/* + * vfsmount lock must be held for read + */ +static inline void inc_mnt_count(struct vfsmount *mnt) +{ + add_mnt_count(mnt, 1); +} + +/* + * vfsmount lock must be held for read + */ +static inline void dec_mnt_count(struct vfsmount *mnt) +{ + add_mnt_count(mnt, -1); +} + +/* + * vfsmount lock must be held for write + */ +unsigned int count_mnt_count(struct vfsmount *mnt) +{ +#ifdef CONFIG_SMP + unsigned int count = 0; + int cpu; + + for_each_possible_cpu(cpu) { + count += *per_cpu_ptr(mnt->mnt_count, cpu); + } + + return count; +#else + return mnt->mnt_count; +#endif +} + struct vfsmount *alloc_vfsmnt(const char *name) { struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL); @@ -154,7 +212,15 @@ struct vfsmount *alloc_vfsmnt(const char goto out_free_id; } - atomic_set(&mnt->mnt_count, 1); +#ifdef CONFIG_SMP + mnt->mnt_count = alloc_percpu(int); + if (!mnt->mnt_count) + goto out_free_devname; +#else + mnt->mnt_count = 0; +#endif + set_mnt_count(mnt, 1); + INIT_LIST_HEAD(&mnt->mnt_hash); INIT_LIST_HEAD(&mnt->mnt_child); INIT_LIST_HEAD(&mnt->mnt_mounts); @@ -169,7 +235,7 @@ struct vfsmount *alloc_vfsmnt(const char #ifdef CONFIG_SMP mnt->mnt_writers = alloc_percpu(int); if (!mnt->mnt_writers) - goto out_free_devname; + goto out_free_mntcount; #else mnt->mnt_writers = 0; #endif @@ -177,6 +243,8 @@ struct vfsmount *alloc_vfsmnt(const char return mnt; #ifdef CONFIG_SMP +out_free_mntcount: + free_percpu(mnt->mnt_count); out_free_devname: kfree(mnt->mnt_devname); #endif @@ -662,8 +730,8 @@ static inline void __mntput(struct vfsmo * to make r/w->r/o transitions. */ /* - * atomic_dec_and_lock() used to deal with ->mnt_count decrements - * provides barriers, so count_mnt_writers() below is safe. AV + * The locking used to deal with mnt_count decrement provides barriers, + * so count_mnt_writers() below is safe. */ WARN_ON(count_mnt_writers(mnt)); fsnotify_vfsmount_delete(mnt); @@ -675,45 +743,76 @@ static inline void __mntput(struct vfsmo void mntput_no_expire(struct vfsmount *mnt) { repeat: - if (atomic_add_unless(&mnt->mnt_count, -1, 1)) + if (likely(!list_empty(&mnt->mnt_hash) || + mnt->mnt_flags & MNT_PSEUDO)) { + br_read_lock(vfsmount_lock); + if (unlikely(list_empty(&mnt->mnt_hash) && + (!(mnt->mnt_flags & MNT_PSEUDO)))) { + br_read_unlock(vfsmount_lock); + goto repeat; + } + dec_mnt_count(mnt); + br_read_unlock(vfsmount_lock); return; + } + br_write_lock(vfsmount_lock); - if (!atomic_dec_and_test(&mnt->mnt_count)) { + dec_mnt_count(mnt); + if (count_mnt_count(mnt)) { br_write_unlock(vfsmount_lock); return; } - if (likely(!mnt->mnt_pinned)) { + if (unlikely(mnt->mnt_pinned)) { + add_mnt_count(mnt, mnt->mnt_pinned + 1); + mnt->mnt_pinned = 0; br_write_unlock(vfsmount_lock); - __mntput(mnt); - return; + acct_auto_close_mnt(mnt); + goto repeat; } - atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count); - mnt->mnt_pinned = 0; br_write_unlock(vfsmount_lock); - acct_auto_close_mnt(mnt); - goto repeat; + __mntput(mnt); } EXPORT_SYMBOL(mntput_no_expire); +void mntput(struct vfsmount *mnt) +{ + if (mnt) { + /* avoid cacheline pingpong, hope gcc doesn't get "smart" */ + if (unlikely(mnt->mnt_expiry_mark)) + mnt->mnt_expiry_mark = 0; + mntput_no_expire(mnt); + } +} +EXPORT_SYMBOL(mntput); + +struct vfsmount *mntget(struct vfsmount *mnt) +{ + if (mnt) { + preempt_disable(); + inc_mnt_count(mnt); + preempt_enable(); + } + return mnt; +} +EXPORT_SYMBOL(mntget); + void mnt_pin(struct vfsmount *mnt) { br_write_lock(vfsmount_lock); mnt->mnt_pinned++; br_write_unlock(vfsmount_lock); } - EXPORT_SYMBOL(mnt_pin); void mnt_unpin(struct vfsmount *mnt) { br_write_lock(vfsmount_lock); if (mnt->mnt_pinned) { - atomic_inc(&mnt->mnt_count); + inc_mnt_count(mnt); mnt->mnt_pinned--; } br_write_unlock(vfsmount_lock); } - EXPORT_SYMBOL(mnt_unpin); static inline void mangle(struct seq_file *m, const char *s) @@ -1008,12 +1107,13 @@ int may_umount_tree(struct vfsmount *mnt int minimum_refs = 0; struct vfsmount *p; - br_read_lock(vfsmount_lock); + /* write lock needed for count_mnt_count */ + br_write_lock(vfsmount_lock); for (p = mnt; p; p = next_mnt(p, mnt)) { - actual_refs += atomic_read(&p->mnt_count); + actual_refs += count_mnt_count(p); minimum_refs += 2; } - br_read_unlock(vfsmount_lock); + br_write_unlock(vfsmount_lock); if (actual_refs > minimum_refs) return 0; @@ -1040,10 +1140,10 @@ int may_umount(struct vfsmount *mnt) { int ret = 1; down_read(&namespace_sem); - br_read_lock(vfsmount_lock); + br_write_lock(vfsmount_lock); if (propagate_mount_busy(mnt, 2)) ret = 0; - br_read_unlock(vfsmount_lock); + br_write_unlock(vfsmount_lock); up_read(&namespace_sem); return ret; } @@ -1125,8 +1225,16 @@ static int do_umount(struct vfsmount *mn flags & (MNT_FORCE | MNT_DETACH)) return -EINVAL; - if (atomic_read(&mnt->mnt_count) != 2) + /* + * probably don't strictly need the lock here if we examined + * all race cases, but it's a slowpath. + */ + br_write_lock(vfsmount_lock); + if (count_mnt_count(mnt) != 2) { + br_write_lock(vfsmount_lock); return -EBUSY; + } + br_write_unlock(vfsmount_lock); if (!xchg(&mnt->mnt_expiry_mark, 1)) return -EAGAIN; @@ -2350,6 +2458,12 @@ SYSCALL_DEFINE2(pivot_root, const char _ touch_mnt_namespace(current->nsproxy->mnt_ns); br_write_unlock(vfsmount_lock); chroot_fs_refs(&root, &new); + + /* Drop MNT_PSEUDO from old, add it to new. See init_mount_tree */ + BUG_ON(!(root.mnt->mnt_flags & MNT_PSEUDO)); + root.mnt->mnt_flags &= ~MNT_PSEUDO; + new.mnt->mnt_flags |= MNT_PSEUDO; + error = 0; path_put(&root_parent); path_put(&parent_path); @@ -2376,6 +2490,13 @@ static void __init init_mount_tree(void) mnt = do_kern_mount("rootfs", 0, "rootfs", NULL); if (IS_ERR(mnt)) panic("Can't create rootfs"); + /* + * MNT_PSEUDO tells mnt refcounting that we're pinned, so don't + * bother checking for zero references. Give one of these to root + * because it isn't "attached" to the tree. See mntput(). + */ + mnt->mnt_flags |= MNT_PSEUDO; + ns = create_mnt_ns(mnt); if (IS_ERR(ns)) panic("Can't allocate initial namespace"); Index: linux-2.6/include/linux/mount.h =================================================================== --- linux-2.6.orig/include/linux/mount.h 2010-12-12 03:27:08.000000000 +1100 +++ linux-2.6/include/linux/mount.h 2010-12-12 03:51:52.000000000 +1100 @@ -30,6 +30,7 @@ struct mnt_namespace; #define MNT_SHRINKABLE 0x100 #define MNT_WRITE_HOLD 0x200 +#define MNT_PSEUDO 0x400 #define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */ #define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */ @@ -70,19 +71,15 @@ struct vfsmount { struct mnt_namespace *mnt_ns; /* containing namespace */ int mnt_id; /* mount identifier */ int mnt_group_id; /* peer group identifier */ - /* - * We put mnt_count & mnt_expiry_mark at the end of struct vfsmount - * to let these frequently modified fields in a separate cache line - * (so that reads of mnt_flags wont ping-pong on SMP machines) - */ - atomic_t mnt_count; int mnt_expiry_mark; /* true if marked for expiry */ int mnt_pinned; int mnt_ghosts; #ifdef CONFIG_SMP int __percpu *mnt_writers; + int __percpu *mnt_count; #else int mnt_writers; + int mnt_count; #endif }; @@ -95,13 +92,6 @@ static inline int *get_mnt_writers_ptr(s #endif } -static inline struct vfsmount *mntget(struct vfsmount *mnt) -{ - if (mnt) - atomic_inc(&mnt->mnt_count); - return mnt; -} - struct file; /* forward dec */ extern int mnt_want_write(struct vfsmount *mnt); @@ -109,18 +99,12 @@ extern int mnt_want_write_file(struct fi extern int mnt_clone_write(struct vfsmount *mnt); extern void mnt_drop_write(struct vfsmount *mnt); extern void mntput_no_expire(struct vfsmount *mnt); +extern void mntput(struct vfsmount *mnt); +extern struct vfsmount *mntget(struct vfsmount *mnt); extern void mnt_pin(struct vfsmount *mnt); extern void mnt_unpin(struct vfsmount *mnt); extern int __mnt_is_readonly(struct vfsmount *mnt); -static inline void mntput(struct vfsmount *mnt) -{ - if (mnt) { - mnt->mnt_expiry_mark = 0; - mntput_no_expire(mnt); - } -} - extern struct vfsmount *do_kern_mount(const char *fstype, int flags, const char *name, void *data); Index: linux-2.6/fs/pnode.c =================================================================== --- linux-2.6.orig/fs/pnode.c 2010-12-12 03:27:08.000000000 +1100 +++ linux-2.6/fs/pnode.c 2010-12-12 03:51:52.000000000 +1100 @@ -288,7 +288,7 @@ int propagate_mnt(struct vfsmount *dest_ */ static inline int do_refcount_check(struct vfsmount *mnt, int count) { - int mycount = atomic_read(&mnt->mnt_count) - mnt->mnt_ghosts; + int mycount = count_mnt_count(mnt) - mnt->mnt_ghosts; return (mycount > count); } @@ -300,7 +300,7 @@ static inline int do_refcount_check(stru * Check if any of these mounts that **do not have submounts** * have more references than 'refcnt'. If so return busy. * - * vfsmount lock must be held for read or write + * vfsmount lock must be held for write */ int propagate_mount_busy(struct vfsmount *mnt, int refcnt) { Index: linux-2.6/fs/internal.h =================================================================== --- linux-2.6.orig/fs/internal.h 2010-12-12 03:27:08.000000000 +1100 +++ linux-2.6/fs/internal.h 2010-12-12 03:51:52.000000000 +1100 @@ -63,6 +63,7 @@ extern int copy_mount_string(const void extern void free_vfsmnt(struct vfsmount *); extern struct vfsmount *alloc_vfsmnt(const char *); +extern unsigned int count_mnt_count(struct vfsmount *mnt); extern struct vfsmount *__lookup_mnt(struct vfsmount *, struct dentry *, int); extern void mnt_set_mountpoint(struct vfsmount *, struct dentry *, struct vfsmount *); Index: linux-2.6/drivers/mtd/mtdchar.c =================================================================== --- linux-2.6.orig/drivers/mtd/mtdchar.c 2010-12-12 03:27:08.000000000 +1100 +++ linux-2.6/drivers/mtd/mtdchar.c 2010-12-12 03:51:52.000000000 +1100 @@ -1201,6 +1201,7 @@ static int __init init_mtdchar(void) static void __exit cleanup_mtdchar(void) { unregister_mtd_user(&mtdchar_notifier); + mtd_inode_mnt->mnt_flags &= ~MNT_PSEUDO; mntput(mtd_inode_mnt); unregister_filesystem(&mtd_inodefs_type); __unregister_chrdev(MTD_CHAR_MAJOR, 0, 1 << MINORBITS, "mtd"); Index: linux-2.6/arch/ia64/kernel/perfmon.c =================================================================== --- linux-2.6.orig/arch/ia64/kernel/perfmon.c 2010-12-12 03:48:57.000000000 +1100 +++ linux-2.6/arch/ia64/kernel/perfmon.c 2010-12-12 03:51:52.000000000 +1100 @@ -1553,8 +1553,10 @@ init_pfm_fs(void) err = PTR_ERR(pfmfs_mnt); if (IS_ERR(pfmfs_mnt)) unregister_filesystem(&pfm_fs_type); - else + else { err = 0; + pfmfs_mnt->mnt_flags |= MNT_PSEUDO; + } } return err; } Index: linux-2.6/fs/anon_inodes.c =================================================================== --- linux-2.6.orig/fs/anon_inodes.c 2010-12-12 03:51:50.000000000 +1100 +++ linux-2.6/fs/anon_inodes.c 2010-12-12 03:51:52.000000000 +1100 @@ -223,6 +223,7 @@ static int __init anon_inode_init(void) error = PTR_ERR(anon_inode_mnt); goto err_unregister_filesystem; } + anon_inode_mnt->mnt_flags |= MNT_PSEUDO; anon_inode_inode = anon_inode_mkinode(); if (IS_ERR(anon_inode_inode)) { error = PTR_ERR(anon_inode_inode); @@ -232,6 +233,7 @@ static int __init anon_inode_init(void) return 0; err_mntput: + anon_inode_mnt->mnt_flags &= ~MNT_PSEUDO; mntput(anon_inode_mnt); err_unregister_filesystem: unregister_filesystem(&anon_inode_fs_type); Index: linux-2.6/fs/block_dev.c =================================================================== --- linux-2.6.orig/fs/block_dev.c 2010-12-12 03:27:08.000000000 +1100 +++ linux-2.6/fs/block_dev.c 2010-12-12 03:51:52.000000000 +1100 @@ -499,6 +499,7 @@ void __init bdev_cache_init(void) bd_mnt = kern_mount(&bd_type); if (IS_ERR(bd_mnt)) panic("Cannot create bdev pseudo-fs"); + bd_mnt->mnt_flags |= MNT_PSEUDO; /* * This vfsmount structure is only used to obtain the * blockdev_superblock, so tell kmemleak not to report it. Index: linux-2.6/fs/pipe.c =================================================================== --- linux-2.6.orig/fs/pipe.c 2010-12-12 03:51:50.000000000 +1100 +++ linux-2.6/fs/pipe.c 2010-12-12 03:51:52.000000000 +1100 @@ -1285,6 +1285,7 @@ static int __init init_pipe_fs(void) err = PTR_ERR(pipe_mnt); unregister_filesystem(&pipe_fs_type); } + pipe_mnt->mnt_flags |= MNT_PSEUDO; } return err; } @@ -1292,6 +1293,7 @@ static int __init init_pipe_fs(void) static void __exit exit_pipe_fs(void) { unregister_filesystem(&pipe_fs_type); + pipe_mnt->mnt_flags &= ~MNT_PSEUDO; mntput(pipe_mnt); } Index: linux-2.6/net/socket.c =================================================================== --- linux-2.6.orig/net/socket.c 2010-12-12 03:51:50.000000000 +1100 +++ linux-2.6/net/socket.c 2010-12-12 03:51:52.000000000 +1100 @@ -2375,6 +2375,8 @@ EXPORT_SYMBOL(sock_unregister); static int __init sock_init(void) { + int err; + /* * Initialize sock SLAB cache. */ @@ -2391,8 +2393,16 @@ static int __init sock_init(void) */ init_inodecache(); - register_filesystem(&sock_fs_type); + + err = register_filesystem(&sock_fs_type); + if (err) + goto out_fs; sock_mnt = kern_mount(&sock_fs_type); + if (IS_ERR(sock_mnt)) { + err = PTR_ERR(sock_mnt); + goto out_mount; + } + sock_mnt->mnt_flags |= MNT_PSEUDO; /* The real protocol initialization is performed in later initcalls. */ @@ -2405,7 +2415,13 @@ static int __init sock_init(void) skb_timestamping_init(); #endif - return 0; +out: + return err; + +out_mount: + unregister_filesystem(&sock_fs_type); +out_fs: + goto out; } core_initcall(sock_init); /* early initcall */ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) 2010-12-13 2:42 ` [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) Nick Piggin @ 2010-12-13 3:31 ` Nick Piggin 2010-12-13 3:43 ` Nick Piggin 2010-12-13 7:25 ` Eric Dumazet 1 sibling, 1 reply; 16+ messages in thread From: Nick Piggin @ 2010-12-13 3:31 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Andrew Morton, Al Viro, Stephen Rothwell, linux-fsdevel, linux-kernel On Mon, Dec 13, 2010 at 01:42:17PM +1100, Nick Piggin wrote: > On Mon, Dec 13, 2010 at 01:37:33PM +1100, Nick Piggin wrote: > > Final note: > > You won't be able to reproduce the parallel path walk scalability > > numbers that I've posted, because the vfsmount refcounting scalability > > patch is not included. I have a new idea for that now, so I'll be asking > > for comments with that soon. > > Here is the patch I've been using, which works but has the problem > described in the changelog. But it works nicely for testing. > > As I said, I have a promising approach to solving the problem. > > fs: scale mntget/mntput [...] > [Note: this is not for merging. Un-attached operation (lazy umount) may not be > uncommon and will be slowed down and actually have worse scalablilty after > this patch. I need to think about how to do fast refcounting with unattached > mounts.] So the problem this patch tries to fix is vfsmount refcount scalability. We need to take a ref for every successful path lookup, and often lookups are going to the same mountpoint. (Yes this little bouncing atomic hurts, badly, even on my small 2s12c tightly connected system on the parallel git diff workload -- because there are other bouncing kernel cachelines in this workload). The fundamental difficulty is that a simple refcount can never be SMP scalable, because dropping the ref requires we check whether we are the last reference (which implies communicating with other CPUs that might have taken references). We can make them scalable by keeping a local count, and checking the global sum less frequently. Some possibilities: - avoid checking global sum while vfsmount is mounted, because the mount contributes to the refcount (that is what this patch does, but it kills performance inside a lazy umounted subtree). - check global sum once every time interval (this would delay mount and sb garbage collection, so it's probably a showstopper). - check global sum only if local sum goes to 0 (this is difficult with vfsmounts because the 'get' and the 'put' can happen on different CPUs, so we'd need to have a per-thread refcount, or carry around the CPU number with the refcount, both get horribly ugly, it turns out). My proposal is a variant / generalisation of the 1st idea, which is to have "long" refcounts. Normal refcounts will be per-cpu difference of incs and decs, but dropping a reference will not have to check the global sum while "long" refcounts are elevated. If the mount is a long refcount, then that is what this current patch essentially is. But then I would also have cwd take the long refcount, which allows detached operation to remain fast while there are processes working inside the detached namespace. Details of locking aren't completely worked out -- it's a bit more tricky because umount can be much heavier than fork() or chdir(), so there are some difficulties in making long refcount operations faster (the problem is remaining race-free versus the fast mntput check, but I think a seqcount to go with the long refcount should do the trick). ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) 2010-12-13 3:31 ` Nick Piggin @ 2010-12-13 3:43 ` Nick Piggin 0 siblings, 0 replies; 16+ messages in thread From: Nick Piggin @ 2010-12-13 3:43 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Andrew Morton, Al Viro, Stephen Rothwell, linux-fsdevel, linux-kernel On Mon, Dec 13, 2010 at 02:31:10PM +1100, Nick Piggin wrote: > (Yes this little bouncing atomic hurts, badly, even on my small 2s12c > tightly connected system on the parallel git diff workload -- because > there are other bouncing kernel cachelines in this workload). ^^^ are no other ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) 2010-12-13 2:42 ` [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) Nick Piggin 2010-12-13 3:31 ` Nick Piggin @ 2010-12-13 7:25 ` Eric Dumazet 2010-12-13 8:33 ` Nick Piggin 2010-12-14 12:40 ` Nick Piggin 1 sibling, 2 replies; 16+ messages in thread From: Eric Dumazet @ 2010-12-13 7:25 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Andrew Morton, Al Viro, Stephen Rothwell, linux-fsdevel, linux-kernel Le lundi 13 décembre 2010 à 13:42 +1100, Nick Piggin a écrit : > + */ > +static inline void add_mnt_count(struct vfsmount *mnt, int n) maybe name it __add_mnt_count() (should be called with preempt off) > +{ > +#ifdef CONFIG_SMP > + (*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) += n; __this_cpu_add(mnt->mnt_count, n); > +#else > + mnt->mnt_count += n; > +#endif > +} and define a preempt safe version : static inline void __add_mnt_count(struct vfsmount *mnt, int n) { #ifdef CONFIG_SMP __this_cpu_add(mnt->mnt_count, n); #else mnt->mnt_count += n; #endif } static inline void add_mnt_count(struct vfsmount *mnt, int n) { #ifdef CONFIG_SMP this_cpu_add(mnt->mnt_count, n); #else preempt_disable(); mnt->mnt_count += n; preempt_enable(); #endif } > + > +static inline void set_mnt_count(struct vfsmount *mnt, int n) > +{ > +#ifdef CONFIG_SMP > + preempt_disable(); > + (*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) = n; > + preempt_enable(); last 3 lines can be replaced by : this_cpu_write(mnt->mnt_count, n); > +#else > + mnt->mnt_count = n; > +#endif > +} > + > #ifdef CONFIG_SMP > int __percpu *mnt_writers; > + int __percpu *mnt_count; > #else > int mnt_writers; > + int mnt_count; > #endif You could use a struct and one per cpu allocation to use one cache line for both objects : struct mnt_counters { int writers; int count; }; ... #ifdef CONFIG_SMP struct mnt_counters __percpu *mnt_counters; #else struct mnt_counters mnt_counters; #endif This would use one pointer instead of two in SMP ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) 2010-12-13 7:25 ` Eric Dumazet @ 2010-12-13 8:33 ` Nick Piggin 2010-12-14 12:40 ` Nick Piggin 1 sibling, 0 replies; 16+ messages in thread From: Nick Piggin @ 2010-12-13 8:33 UTC (permalink / raw) To: Eric Dumazet Cc: Nick Piggin, Linus Torvalds, Andrew Morton, Al Viro, Stephen Rothwell, linux-fsdevel, linux-kernel On Mon, Dec 13, 2010 at 6:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le lundi 13 décembre 2010 à 13:42 +1100, Nick Piggin a écrit : >> + */ >> +static inline void add_mnt_count(struct vfsmount *mnt, int n) > > maybe name it __add_mnt_count() (should be called with preempt off) > >> +{ >> +#ifdef CONFIG_SMP >> + (*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) += n; > > __this_cpu_add(mnt->mnt_count, n); > >> +#else >> + mnt->mnt_count += n; >> +#endif >> +} > > and define a preempt safe version : > > static inline void __add_mnt_count(struct vfsmount *mnt, int n) > { > #ifdef CONFIG_SMP > __this_cpu_add(mnt->mnt_count, n); > #else > mnt->mnt_count += n; > #endif > } > > static inline void add_mnt_count(struct vfsmount *mnt, int n) > { > #ifdef CONFIG_SMP > this_cpu_add(mnt->mnt_count, n); > #else > preempt_disable(); > mnt->mnt_count += n; > preempt_enable(); > #endif > } That looks good, thanks. >> + >> +static inline void set_mnt_count(struct vfsmount *mnt, int n) >> +{ >> +#ifdef CONFIG_SMP > > >> + preempt_disable(); >> + (*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) = n; >> + preempt_enable(); > > last 3 lines can be replaced by : > > this_cpu_write(mnt->mnt_count, n); Yep, thanks. >> +#else >> + mnt->mnt_count = n; >> +#endif >> +} >> + > >> #ifdef CONFIG_SMP >> int __percpu *mnt_writers; >> + int __percpu *mnt_count; >> #else >> int mnt_writers; >> + int mnt_count; >> #endif > > You could use a struct and one per cpu allocation to use one cache line > for both objects : > > struct mnt_counters { > int writers; > int count; > }; > > ... > > #ifdef CONFIG_SMP > struct mnt_counters __percpu *mnt_counters; > #else > struct mnt_counters mnt_counters; > #endif > > This would use one pointer instead of two in SMP Yes that's a good point too. Thanks, Nick ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) 2010-12-13 7:25 ` Eric Dumazet 2010-12-13 8:33 ` Nick Piggin @ 2010-12-14 12:40 ` Nick Piggin 2010-12-15 8:16 ` Andreas Dilger 1 sibling, 1 reply; 16+ messages in thread From: Nick Piggin @ 2010-12-14 12:40 UTC (permalink / raw) To: Eric Dumazet Cc: Nick Piggin, Linus Torvalds, Andrew Morton, Al Viro, Stephen Rothwell, linux-fsdevel, linux-kernel OK, this ones includes your suggestions, and implements the approach I outlined earlier. I verified it works and is scalable on detached mounts, so I'll add it to the git branch when it survives a bit more testing. It's the last piece of the puzzle... fs: scale mntget/mntput The problem that this patch aims to fix is vfsmount refcounting scalability. We need to take a reference on the vfsmount for every successful path lookup, which often go to the same mount point. The fundamental difficulty is that a "simple" reference count can never be made scalable, because any time a reference is dropped, we must check whether that was the last reference. To do that requires communication with all other CPUs that may have taken a reference count. We can make refcounts more scalable in a couple of ways, involving keeping distributed counters, and checking for the global-zero condition less frequently. - check the global sum once every interval (this will delay zero detection for some interval, so it's probably a showstopper for vfsmounts). - keep a local count and only taking the global sum when local reaches 0 (this is difficult for vfsmounts, because we can't hold preempt off for the life of a reference, so a counter would need to be per-thread or tied strongly to a particular CPU which requires more locking). - keep a local difference of increments and decrements, which allows us to sum the total difference and hence find the refcount when summing all CPUs. Then, keep a single integer "long" refcount for slow and long lasting references, and only take the global sum of local counters when the long refcount is 0. This last scheme is what I implemented here. Attached mounts and process root and working directory references are "long" references, and everything else is a short reference. This allows scalable vfsmount references during path walking over mounted subtrees and unattached (lazy umounted) mounts with processes still running in them. This results in one fewer atomic op in the fastpath: mntget is now just a per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock and non-atomic decrement in the common case. However code is otherwise bigger and heavier, so single threaded performance is basically a wash. Signed-off-by: Nick Piggin <npiggin@kernel.dk> --- arch/ia64/kernel/perfmon.c | 2 drivers/mtd/mtdchar.c | 2 fs/anon_inodes.c | 2 fs/fs_struct.c | 26 +++- fs/internal.h | 1 fs/namei.c | 24 ++++ fs/namespace.c | 240 +++++++++++++++++++++++++++++++++++++-------- fs/pipe.c | 2 fs/pnode.c | 4 fs/super.c | 2 include/linux/mount.h | 53 +++------ include/linux/path.h | 2 net/socket.c | 19 +++ 13 files changed, 282 insertions(+), 97 deletions(-) Index: linux-2.6/fs/namespace.c =================================================================== --- linux-2.6.orig/fs/namespace.c 2010-12-14 21:07:19.000000000 +1100 +++ linux-2.6/fs/namespace.c 2010-12-14 23:33:00.000000000 +1100 @@ -138,6 +138,64 @@ void mnt_release_group_id(struct vfsmoun mnt->mnt_group_id = 0; } +/* + * vfsmount lock must be held for read + */ +static inline void add_mnt_count(struct vfsmount *mnt, int n) +{ +#ifdef CONFIG_SMP + this_cpu_add(mnt->mnt_pcp->mnt_count, n); +#else + preempt_disable(); + mnt->mnt_count += n; + preempt_enable(); +#endif +} + +static inline void set_mnt_count(struct vfsmount *mnt, int n) +{ +#ifdef CONFIG_SMP + this_cpu_add(mnt->mnt_pcp->mnt_count, n); +#else + mnt->mnt_count = n; +#endif +} + +/* + * vfsmount lock must be held for read + */ +static inline void inc_mnt_count(struct vfsmount *mnt) +{ + add_mnt_count(mnt, 1); +} + +/* + * vfsmount lock must be held for read + */ +static inline void dec_mnt_count(struct vfsmount *mnt) +{ + add_mnt_count(mnt, -1); +} + +/* + * vfsmount lock must be held for write + */ +unsigned int count_mnt_count(struct vfsmount *mnt) +{ +#ifdef CONFIG_SMP + unsigned int count = atomic_read(&mnt->mnt_longrefs); + int cpu; + + for_each_possible_cpu(cpu) { + count += per_cpu_ptr(mnt->mnt_pcp, cpu)->mnt_count; + } + + return count; +#else + return mnt->mnt_count; +#endif +} + struct vfsmount *alloc_vfsmnt(const char *name) { struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL); @@ -154,7 +212,17 @@ struct vfsmount *alloc_vfsmnt(const char goto out_free_id; } - atomic_set(&mnt->mnt_count, 1); +#ifdef CONFIG_SMP + mnt->mnt_pcp = alloc_percpu(struct mnt_pcp); + if (!mnt->mnt_pcp) + goto out_free_devname; + + atomic_set(&mnt->mnt_longrefs, 1); +#else + mnt->mnt_count = 1; + mnt->mnt_writers = 0; +#endif + INIT_LIST_HEAD(&mnt->mnt_hash); INIT_LIST_HEAD(&mnt->mnt_child); INIT_LIST_HEAD(&mnt->mnt_mounts); @@ -166,13 +234,6 @@ struct vfsmount *alloc_vfsmnt(const char #ifdef CONFIG_FSNOTIFY INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks); #endif -#ifdef CONFIG_SMP - mnt->mnt_writers = alloc_percpu(int); - if (!mnt->mnt_writers) - goto out_free_devname; -#else - mnt->mnt_writers = 0; -#endif } return mnt; @@ -219,7 +280,7 @@ EXPORT_SYMBOL_GPL(__mnt_is_readonly); static inline void inc_mnt_writers(struct vfsmount *mnt) { #ifdef CONFIG_SMP - (*per_cpu_ptr(mnt->mnt_writers, smp_processor_id()))++; + this_cpu_inc(mnt->mnt_pcp->mnt_writers); #else mnt->mnt_writers++; #endif @@ -228,7 +289,7 @@ static inline void inc_mnt_writers(struc static inline void dec_mnt_writers(struct vfsmount *mnt) { #ifdef CONFIG_SMP - (*per_cpu_ptr(mnt->mnt_writers, smp_processor_id()))--; + this_cpu_dec(mnt->mnt_pcp->mnt_writers); #else mnt->mnt_writers--; #endif @@ -241,7 +302,7 @@ static unsigned int count_mnt_writers(st int cpu; for_each_possible_cpu(cpu) { - count += *per_cpu_ptr(mnt->mnt_writers, cpu); + count += per_cpu_ptr(mnt->mnt_pcp, cpu)->mnt_writers; } return count; @@ -418,7 +479,7 @@ void free_vfsmnt(struct vfsmount *mnt) kfree(mnt->mnt_devname); mnt_free_id(mnt); #ifdef CONFIG_SMP - free_percpu(mnt->mnt_writers); + free_percpu(mnt->mnt_pcp); #endif kmem_cache_free(mnt_cache, mnt); } @@ -652,9 +713,10 @@ static struct vfsmount *clone_mnt(struct return NULL; } -static inline void __mntput(struct vfsmount *mnt) +static inline void mntfree(struct vfsmount *mnt) { struct super_block *sb = mnt->mnt_sb; + /* * This probably indicates that somebody messed * up a mnt_want/drop_write() pair. If this @@ -662,8 +724,8 @@ static inline void __mntput(struct vfsmo * to make r/w->r/o transitions. */ /* - * atomic_dec_and_lock() used to deal with ->mnt_count decrements - * provides barriers, so count_mnt_writers() below is safe. AV + * The locking used to deal with mnt_count decrement provides barriers, + * so count_mnt_writers() below is safe. */ WARN_ON(count_mnt_writers(mnt)); fsnotify_vfsmount_delete(mnt); @@ -672,28 +734,113 @@ static inline void __mntput(struct vfsmo deactivate_super(sb); } -void mntput_no_expire(struct vfsmount *mnt) +#ifdef CONFIG_SMP +static inline void __mntput(struct vfsmount *mnt, int longrefs) { -repeat: - if (atomic_add_unless(&mnt->mnt_count, -1, 1)) - return; + if (!longrefs) { +put_again: + br_read_lock(vfsmount_lock); + if (likely(atomic_read(&mnt->mnt_longrefs))) { + dec_mnt_count(mnt); + br_read_unlock(vfsmount_lock); + return; + } + br_read_unlock(vfsmount_lock); + } else { + BUG_ON(!atomic_read(&mnt->mnt_longrefs)); + if (atomic_add_unless(&mnt->mnt_longrefs, -1, 1)) + return; + } + br_write_lock(vfsmount_lock); - if (!atomic_dec_and_test(&mnt->mnt_count)) { + if (!longrefs) + dec_mnt_count(mnt); + else + atomic_dec(&mnt->mnt_longrefs); + if (count_mnt_count(mnt)) { br_write_unlock(vfsmount_lock); return; } - if (likely(!mnt->mnt_pinned)) { + if (unlikely(mnt->mnt_pinned)) { + add_mnt_count(mnt, mnt->mnt_pinned + 1); + mnt->mnt_pinned = 0; br_write_unlock(vfsmount_lock); - __mntput(mnt); + acct_auto_close_mnt(mnt); + goto put_again; + } + br_write_unlock(vfsmount_lock); + mntfree(mnt); +} +#else +static inline void __mntput(struct vfsmount *mnt, int longrefs) +{ +put_again: + dec_mnt_count(mnt); + if (likely(count_mnt_count(mnt))) return; + br_write_lock(vfsmount_lock); + if (unlikely(mnt->mnt_pinned)) { + add_mnt_count(mnt, mnt->mnt_pinned + 1); + mnt->mnt_pinned = 0; + br_write_unlock(vfsmount_lock); + acct_auto_close_mnt(mnt); + goto put_again; } - atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count); - mnt->mnt_pinned = 0; br_write_unlock(vfsmount_lock); - acct_auto_close_mnt(mnt); - goto repeat; + mntfree(mnt); +} +#endif + +static void mntput_no_expire(struct vfsmount *mnt) +{ + __mntput(mnt, 0); } -EXPORT_SYMBOL(mntput_no_expire); + +void mntput(struct vfsmount *mnt) +{ + if (mnt) { + /* avoid cacheline pingpong, hope gcc doesn't get "smart" */ + if (unlikely(mnt->mnt_expiry_mark)) + mnt->mnt_expiry_mark = 0; + __mntput(mnt, 0); + } +} +EXPORT_SYMBOL(mntput); + +struct vfsmount *mntget(struct vfsmount *mnt) +{ + if (mnt) + inc_mnt_count(mnt); + return mnt; +} +EXPORT_SYMBOL(mntget); + +void mntput_long(struct vfsmount *mnt) +{ +#ifdef CONFIG_SMP + if (mnt) { + /* avoid cacheline pingpong, hope gcc doesn't get "smart" */ + if (unlikely(mnt->mnt_expiry_mark)) + mnt->mnt_expiry_mark = 0; + __mntput(mnt, 1); + } +#else + mntput(mnt); +#endif +} +EXPORT_SYMBOL(mntput_long); + +struct vfsmount *mntget_long(struct vfsmount *mnt) +{ +#ifdef CONFIG_SMP + if (mnt) + atomic_inc(&mnt->mnt_longrefs); + return mnt; +#else + return mntget(mnt); +#endif +} +EXPORT_SYMBOL(mntget_long); void mnt_pin(struct vfsmount *mnt) { @@ -701,19 +848,17 @@ void mnt_pin(struct vfsmount *mnt) mnt->mnt_pinned++; br_write_unlock(vfsmount_lock); } - EXPORT_SYMBOL(mnt_pin); void mnt_unpin(struct vfsmount *mnt) { br_write_lock(vfsmount_lock); if (mnt->mnt_pinned) { - atomic_inc(&mnt->mnt_count); + inc_mnt_count(mnt); mnt->mnt_pinned--; } br_write_unlock(vfsmount_lock); } - EXPORT_SYMBOL(mnt_unpin); static inline void mangle(struct seq_file *m, const char *s) @@ -1008,12 +1153,13 @@ int may_umount_tree(struct vfsmount *mnt int minimum_refs = 0; struct vfsmount *p; - br_read_lock(vfsmount_lock); + /* write lock needed for count_mnt_count */ + br_write_lock(vfsmount_lock); for (p = mnt; p; p = next_mnt(p, mnt)) { - actual_refs += atomic_read(&p->mnt_count); + actual_refs += count_mnt_count(p); minimum_refs += 2; } - br_read_unlock(vfsmount_lock); + br_write_unlock(vfsmount_lock); if (actual_refs > minimum_refs) return 0; @@ -1040,10 +1186,10 @@ int may_umount(struct vfsmount *mnt) { int ret = 1; down_read(&namespace_sem); - br_read_lock(vfsmount_lock); + br_write_lock(vfsmount_lock); if (propagate_mount_busy(mnt, 2)) ret = 0; - br_read_unlock(vfsmount_lock); + br_write_unlock(vfsmount_lock); up_read(&namespace_sem); return ret; } @@ -1070,7 +1216,7 @@ void release_mounts(struct list_head *he dput(dentry); mntput(m); } - mntput(mnt); + mntput_long(mnt); } } @@ -1125,8 +1271,16 @@ static int do_umount(struct vfsmount *mn flags & (MNT_FORCE | MNT_DETACH)) return -EINVAL; - if (atomic_read(&mnt->mnt_count) != 2) + /* + * probably don't strictly need the lock here if we examined + * all race cases, but it's a slowpath. + */ + br_write_lock(vfsmount_lock); + if (count_mnt_count(mnt) != 2) { + br_write_lock(vfsmount_lock); return -EBUSY; + } + br_write_unlock(vfsmount_lock); if (!xchg(&mnt->mnt_expiry_mark, 1)) return -EAGAIN; @@ -1815,7 +1969,7 @@ int do_add_mount(struct vfsmount *newmnt unlock: up_write(&namespace_sem); - mntput(newmnt); + mntput_long(newmnt); return err; } @@ -2148,11 +2302,11 @@ static struct mnt_namespace *dup_mnt_ns( if (fs) { if (p == fs->root.mnt) { rootmnt = p; - fs->root.mnt = mntget(q); + fs->root.mnt = mntget_long(q); } if (p == fs->pwd.mnt) { pwdmnt = p; - fs->pwd.mnt = mntget(q); + fs->pwd.mnt = mntget_long(q); } } p = next_mnt(p, mnt_ns->root); @@ -2161,9 +2315,9 @@ static struct mnt_namespace *dup_mnt_ns( up_write(&namespace_sem); if (rootmnt) - mntput(rootmnt); + mntput_long(rootmnt); if (pwdmnt) - mntput(pwdmnt); + mntput_long(pwdmnt); return new_ns; } @@ -2350,6 +2504,7 @@ SYSCALL_DEFINE2(pivot_root, const char _ touch_mnt_namespace(current->nsproxy->mnt_ns); br_write_unlock(vfsmount_lock); chroot_fs_refs(&root, &new); + error = 0; path_put(&root_parent); path_put(&parent_path); @@ -2376,6 +2531,7 @@ static void __init init_mount_tree(void) mnt = do_kern_mount("rootfs", 0, "rootfs", NULL); if (IS_ERR(mnt)) panic("Can't create rootfs"); + ns = create_mnt_ns(mnt); if (IS_ERR(ns)) panic("Can't allocate initial namespace"); Index: linux-2.6/include/linux/mount.h =================================================================== --- linux-2.6.orig/include/linux/mount.h 2010-12-14 21:07:19.000000000 +1100 +++ linux-2.6/include/linux/mount.h 2010-12-14 21:07:29.000000000 +1100 @@ -13,6 +13,7 @@ #include <linux/list.h> #include <linux/nodemask.h> #include <linux/spinlock.h> +#include <linux/seqlock.h> #include <asm/atomic.h> struct super_block; @@ -46,12 +47,24 @@ struct mnt_namespace; #define MNT_INTERNAL 0x4000 +struct mnt_pcp { + int mnt_count; + int mnt_writers; +}; + struct vfsmount { struct list_head mnt_hash; struct vfsmount *mnt_parent; /* fs we are mounted on */ struct dentry *mnt_mountpoint; /* dentry of mountpoint */ struct dentry *mnt_root; /* root of the mounted tree */ struct super_block *mnt_sb; /* pointer to superblock */ +#ifdef CONFIG_SMP + struct mnt_pcp __percpu *mnt_pcp; + atomic_t mnt_longrefs; +#else + int mnt_count; + int mnt_writers; +#endif struct list_head mnt_mounts; /* list of children, anchored here */ struct list_head mnt_child; /* and going through their mnt_child */ int mnt_flags; @@ -70,57 +83,25 @@ struct vfsmount { struct mnt_namespace *mnt_ns; /* containing namespace */ int mnt_id; /* mount identifier */ int mnt_group_id; /* peer group identifier */ - /* - * We put mnt_count & mnt_expiry_mark at the end of struct vfsmount - * to let these frequently modified fields in a separate cache line - * (so that reads of mnt_flags wont ping-pong on SMP machines) - */ - atomic_t mnt_count; int mnt_expiry_mark; /* true if marked for expiry */ int mnt_pinned; int mnt_ghosts; -#ifdef CONFIG_SMP - int __percpu *mnt_writers; -#else - int mnt_writers; -#endif }; -static inline int *get_mnt_writers_ptr(struct vfsmount *mnt) -{ -#ifdef CONFIG_SMP - return mnt->mnt_writers; -#else - return &mnt->mnt_writers; -#endif -} - -static inline struct vfsmount *mntget(struct vfsmount *mnt) -{ - if (mnt) - atomic_inc(&mnt->mnt_count); - return mnt; -} - struct file; /* forward dec */ extern int mnt_want_write(struct vfsmount *mnt); extern int mnt_want_write_file(struct file *file); extern int mnt_clone_write(struct vfsmount *mnt); extern void mnt_drop_write(struct vfsmount *mnt); -extern void mntput_no_expire(struct vfsmount *mnt); +extern void mntput(struct vfsmount *mnt); +extern struct vfsmount *mntget(struct vfsmount *mnt); +extern void mntput_long(struct vfsmount *mnt); +extern struct vfsmount *mntget_long(struct vfsmount *mnt); extern void mnt_pin(struct vfsmount *mnt); extern void mnt_unpin(struct vfsmount *mnt); extern int __mnt_is_readonly(struct vfsmount *mnt); -static inline void mntput(struct vfsmount *mnt) -{ - if (mnt) { - mnt->mnt_expiry_mark = 0; - mntput_no_expire(mnt); - } -} - extern struct vfsmount *do_kern_mount(const char *fstype, int flags, const char *name, void *data); Index: linux-2.6/fs/pnode.c =================================================================== --- linux-2.6.orig/fs/pnode.c 2010-12-14 21:07:19.000000000 +1100 +++ linux-2.6/fs/pnode.c 2010-12-14 21:07:29.000000000 +1100 @@ -288,7 +288,7 @@ int propagate_mnt(struct vfsmount *dest_ */ static inline int do_refcount_check(struct vfsmount *mnt, int count) { - int mycount = atomic_read(&mnt->mnt_count) - mnt->mnt_ghosts; + int mycount = count_mnt_count(mnt) - mnt->mnt_ghosts; return (mycount > count); } @@ -300,7 +300,7 @@ static inline int do_refcount_check(stru * Check if any of these mounts that **do not have submounts** * have more references than 'refcnt'. If so return busy. * - * vfsmount lock must be held for read or write + * vfsmount lock must be held for write */ int propagate_mount_busy(struct vfsmount *mnt, int refcnt) { Index: linux-2.6/fs/internal.h =================================================================== --- linux-2.6.orig/fs/internal.h 2010-12-14 21:07:19.000000000 +1100 +++ linux-2.6/fs/internal.h 2010-12-14 21:07:29.000000000 +1100 @@ -63,6 +63,7 @@ extern int copy_mount_string(const void extern void free_vfsmnt(struct vfsmount *); extern struct vfsmount *alloc_vfsmnt(const char *); +extern unsigned int count_mnt_count(struct vfsmount *mnt); extern struct vfsmount *__lookup_mnt(struct vfsmount *, struct dentry *, int); extern void mnt_set_mountpoint(struct vfsmount *, struct dentry *, struct vfsmount *); Index: linux-2.6/drivers/mtd/mtdchar.c =================================================================== --- linux-2.6.orig/drivers/mtd/mtdchar.c 2010-12-14 21:07:19.000000000 +1100 +++ linux-2.6/drivers/mtd/mtdchar.c 2010-12-14 21:07:29.000000000 +1100 @@ -1201,7 +1201,7 @@ static int __init init_mtdchar(void) static void __exit cleanup_mtdchar(void) { unregister_mtd_user(&mtdchar_notifier); - mntput(mtd_inode_mnt); + mntput_long(mtd_inode_mnt); unregister_filesystem(&mtd_inodefs_type); __unregister_chrdev(MTD_CHAR_MAJOR, 0, 1 << MINORBITS, "mtd"); } Index: linux-2.6/arch/ia64/kernel/perfmon.c =================================================================== --- linux-2.6.orig/arch/ia64/kernel/perfmon.c 2010-12-14 21:07:19.000000000 +1100 +++ linux-2.6/arch/ia64/kernel/perfmon.c 2010-12-14 21:07:29.000000000 +1100 @@ -1542,7 +1542,7 @@ pfm_exit_smpl_buffer(pfm_buffer_fmt_t *f * any operations on the root directory. However, we need a non-trivial * d_name - pfm: will go nicely and kill the special-casing in procfs. */ -static struct vfsmount *pfmfs_mnt; +static struct vfsmount *pfmfs_mnt __read_mostly; static int __init init_pfm_fs(void) Index: linux-2.6/fs/anon_inodes.c =================================================================== --- linux-2.6.orig/fs/anon_inodes.c 2010-12-14 21:07:19.000000000 +1100 +++ linux-2.6/fs/anon_inodes.c 2010-12-14 21:07:29.000000000 +1100 @@ -232,7 +232,7 @@ static int __init anon_inode_init(void) return 0; err_mntput: - mntput(anon_inode_mnt); + mntput_long(anon_inode_mnt); err_unregister_filesystem: unregister_filesystem(&anon_inode_fs_type); err_exit: Index: linux-2.6/fs/pipe.c =================================================================== --- linux-2.6.orig/fs/pipe.c 2010-12-14 21:07:19.000000000 +1100 +++ linux-2.6/fs/pipe.c 2010-12-14 21:07:29.000000000 +1100 @@ -1292,7 +1292,7 @@ static int __init init_pipe_fs(void) static void __exit exit_pipe_fs(void) { unregister_filesystem(&pipe_fs_type); - mntput(pipe_mnt); + mntput_long(pipe_mnt); } fs_initcall(init_pipe_fs); Index: linux-2.6/net/socket.c =================================================================== --- linux-2.6.orig/net/socket.c 2010-12-14 21:07:19.000000000 +1100 +++ linux-2.6/net/socket.c 2010-12-14 21:07:29.000000000 +1100 @@ -2375,6 +2375,8 @@ EXPORT_SYMBOL(sock_unregister); static int __init sock_init(void) { + int err; + /* * Initialize sock SLAB cache. */ @@ -2391,8 +2393,15 @@ static int __init sock_init(void) */ init_inodecache(); - register_filesystem(&sock_fs_type); + + err = register_filesystem(&sock_fs_type); + if (err) + goto out_fs; sock_mnt = kern_mount(&sock_fs_type); + if (IS_ERR(sock_mnt)) { + err = PTR_ERR(sock_mnt); + goto out_mount; + } /* The real protocol initialization is performed in later initcalls. */ @@ -2405,7 +2414,13 @@ static int __init sock_init(void) skb_timestamping_init(); #endif - return 0; +out: + return err; + +out_mount: + unregister_filesystem(&sock_fs_type); +out_fs: + goto out; } core_initcall(sock_init); /* early initcall */ Index: linux-2.6/fs/super.c =================================================================== --- linux-2.6.orig/fs/super.c 2010-12-14 21:07:19.000000000 +1100 +++ linux-2.6/fs/super.c 2010-12-14 21:07:29.000000000 +1100 @@ -1140,7 +1140,7 @@ static struct vfsmount *fs_set_subtype(s return mnt; err: - mntput(mnt); + mntput_long(mnt); return ERR_PTR(err); } Index: linux-2.6/fs/fs_struct.c =================================================================== --- linux-2.6.orig/fs/fs_struct.c 2010-12-14 22:01:15.000000000 +1100 +++ linux-2.6/fs/fs_struct.c 2010-12-14 22:56:27.000000000 +1100 @@ -17,11 +17,11 @@ void set_fs_root(struct fs_struct *fs, s write_seqcount_begin(&fs->seq); old_root = fs->root; fs->root = *path; - path_get(path); + path_get_long(path); write_seqcount_end(&fs->seq); spin_unlock(&fs->lock); if (old_root.dentry) - path_put(&old_root); + path_put_long(&old_root); } /* @@ -36,12 +36,12 @@ void set_fs_pwd(struct fs_struct *fs, st write_seqcount_begin(&fs->seq); old_pwd = fs->pwd; fs->pwd = *path; - path_get(path); + path_get_long(path); write_seqcount_end(&fs->seq); spin_unlock(&fs->lock); if (old_pwd.dentry) - path_put(&old_pwd); + path_put_long(&old_pwd); } void chroot_fs_refs(struct path *old_root, struct path *new_root) @@ -59,13 +59,13 @@ void chroot_fs_refs(struct path *old_roo write_seqcount_begin(&fs->seq); if (fs->root.dentry == old_root->dentry && fs->root.mnt == old_root->mnt) { - path_get(new_root); + path_get_long(new_root); fs->root = *new_root; count++; } if (fs->pwd.dentry == old_root->dentry && fs->pwd.mnt == old_root->mnt) { - path_get(new_root); + path_get_long(new_root); fs->pwd = *new_root; count++; } @@ -76,13 +76,13 @@ void chroot_fs_refs(struct path *old_roo } while_each_thread(g, p); read_unlock(&tasklist_lock); while (count--) - path_put(old_root); + path_put_long(old_root); } void free_fs_struct(struct fs_struct *fs) { - path_put(&fs->root); - path_put(&fs->pwd); + path_put_long(&fs->root); + path_put_long(&fs->pwd); kmem_cache_free(fs_cachep, fs); } @@ -115,7 +115,13 @@ struct fs_struct *copy_fs_struct(struct spin_lock_init(&fs->lock); seqcount_init(&fs->seq); fs->umask = old->umask; - get_fs_root_and_pwd(old, &fs->root, &fs->pwd); + + spin_lock(&old->lock); + fs->root = old->root; + path_get_long(&fs->root); + fs->pwd = old->pwd; + path_get_long(&fs->pwd); + spin_unlock(&old->lock); } return fs; } Index: linux-2.6/fs/namei.c =================================================================== --- linux-2.6.orig/fs/namei.c 2010-12-14 22:06:54.000000000 +1100 +++ linux-2.6/fs/namei.c 2010-12-14 22:07:51.000000000 +1100 @@ -448,6 +448,18 @@ void path_get(struct path *path) EXPORT_SYMBOL(path_get); /** + * path_get_long - get a long reference to a path + * @path: path to get the reference to + * + * Given a path increment the reference count to the dentry and the vfsmount. + */ +void path_get_long(struct path *path) +{ + mntget_long(path->mnt); + dget(path->dentry); +} + +/** * path_put - put a reference to a path * @path: path to put the reference to * @@ -461,6 +473,18 @@ void path_put(struct path *path) EXPORT_SYMBOL(path_put); /** + * path_put_long - put a long reference to a path + * @path: path to put the reference to + * + * Given a path decrement the reference count to the dentry and the vfsmount. + */ +void path_put_long(struct path *path) +{ + dput(path->dentry); + mntput_long(path->mnt); +} + +/** * nameidata_drop_rcu - drop this nameidata out of rcu-walk * @nd: nameidata pathwalk data to drop * @Returns: 0 on success, -ECHLID on failure Index: linux-2.6/include/linux/path.h =================================================================== --- linux-2.6.orig/include/linux/path.h 2010-12-14 22:06:38.000000000 +1100 +++ linux-2.6/include/linux/path.h 2010-12-14 22:08:13.000000000 +1100 @@ -10,7 +10,9 @@ struct path { }; extern void path_get(struct path *); +extern void path_get_long(struct path *); extern void path_put(struct path *); +extern void path_put_long(struct path *); static inline int path_equal(const struct path *path1, const struct path *path2) { ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) 2010-12-14 12:40 ` Nick Piggin @ 2010-12-15 8:16 ` Andreas Dilger 2010-12-15 10:24 ` Nick Piggin 0 siblings, 1 reply; 16+ messages in thread From: Andreas Dilger @ 2010-12-15 8:16 UTC (permalink / raw) To: Nick Piggin Cc: Eric Dumazet, Linus Torvalds, Andrew Morton, Al Viro, Stephen Rothwell, linux-fsdevel, linux-kernel On 2010-12-14, at 05:40, Nick Piggin wrote: > +static inline void add_mnt_count(struct vfsmount *mnt, int n) > +static inline void set_mnt_count(struct vfsmount *mnt, int n) > +static inline void inc_mnt_count(struct vfsmount *mnt) > +static inline void dec_mnt_count(struct vfsmount *mnt) > +unsigned int count_mnt_count(struct vfsmount *mnt) Minor nit - it is easier to find these related functions via tag completion if they are of the form "mnt_count_{add,set,inc,dec}" and it would also be more consistent and easier to understand if you rename count_mnt_count() to mnt_count_sum() or mnt_count_read(). This also follows the kernel naming convention much more closely (e.g. atomic_{inc,dec,add,sub,set,read}(), mutex_{init,lock,unlock}(), etc), which is normally of the form {type}_{action}. Yes, I see the other {inc,dec,count}_mnt_writers() functions exist, but I'd prefer not to add more bad function names to the kernel. Maybe a separate patch to clean up those names is in order... Cheers, Andreas ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) 2010-12-15 8:16 ` Andreas Dilger @ 2010-12-15 10:24 ` Nick Piggin 0 siblings, 0 replies; 16+ messages in thread From: Nick Piggin @ 2010-12-15 10:24 UTC (permalink / raw) To: Andreas Dilger Cc: Nick Piggin, Eric Dumazet, Linus Torvalds, Andrew Morton, Al Viro, Stephen Rothwell, linux-fsdevel, linux-kernel On Wed, Dec 15, 2010 at 01:16:52AM -0700, Andreas Dilger wrote: > On 2010-12-14, at 05:40, Nick Piggin wrote: > > +static inline void add_mnt_count(struct vfsmount *mnt, int n) > > +static inline void set_mnt_count(struct vfsmount *mnt, int n) > > +static inline void inc_mnt_count(struct vfsmount *mnt) > > +static inline void dec_mnt_count(struct vfsmount *mnt) > > +unsigned int count_mnt_count(struct vfsmount *mnt) > > Minor nit - it is easier to find these related functions via tag > completion if they are of the form "mnt_count_{add,set,inc,dec}" > and it would also be more consistent and easier to understand if > you rename count_mnt_count() to mnt_count_sum() or mnt_count_read(). > > This also follows the kernel naming convention much more closely > (e.g. atomic_{inc,dec,add,sub,set,read}(), mutex_{init,lock,unlock}(), > etc), which is normally of the form {type}_{action}. OK, I agree. > Yes, I see the other {inc,dec,count}_mnt_writers() functions > exist, but I'd prefer not to add more bad function names to > the kernel. Maybe a separate patch to clean up those names is > in order... I might have added those too :P Yes they should be consistent, so I'll rename them first. Thanks, Nick ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: rcu-walk and dcache scaling tree update and status [not found] <20101213023733.GB6522@amd> 2010-12-13 2:42 ` [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) Nick Piggin @ 2010-12-13 2:53 ` Ed Tomlinson 2010-12-13 2:59 ` Nick Piggin 2010-12-13 3:40 ` Stephen Rothwell 2 siblings, 1 reply; 16+ messages in thread From: Ed Tomlinson @ 2010-12-13 2:53 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Andrew Morton, Al Viro, Stephen Rothwell, linux-fsdevel, linux-kernel On Sunday 12 December 2010 21:37:33 Nick Piggin wrote: > The vfs-scale branch has had some progress, but it is now requiring > wider testing and detailed review, particularly of the fine details of > dcache_lock lifting, and rcu-walk synchronisation details and > documentation. > > Linus has suggested pretty strongly that he wants to pull this in the > next merge window (recently, that "inodes will be RCU freed in 2.6.38" > in an urelated discussion). As far as I know, that's what he's going to > do. I'd like to get this some time in linux-next to improve test > coverage (many filesystems I can't even test, so there are bound to be a > few silly crashes). Stephen, how do I arrange that? > > From my point of view, it has had nowhere near enough review, > particularly I want Al to be happy with it, filesystem changes looked at > and tested by respective fs maintainers, and anybody who is good at > concurrency. However, if Linus still wants to merge it to kick things > along, I am not going to stop him this time, because I have no known > bugs or pending changes required. > > I, like everybody else, would prefer bugs or design flaws to be found > *before* it goes upstream, of course. I would be happy to spend time on > irc with reviewers (ask me offline). And if anybody has reasonable > concerns or suggestions, I will be happy to take that into account. I > will not flame anybody who reads my replies, even if it takes a while > for one or both of us to understand. > > Documentation/filesystems/path-lookup.txt is a good place to start > reviewing the fun stuff. I would much appreciate review of documentation > and comments too, if anything is not clear, omitted, or not matching the > code. > > Also, please keep an eye on the end result when reviewing patches. > Particularly the locking patches before dcache_lock is lifted, these are > supposed to provide a lock coverage to lift dcache_lock with minimal > complexity. They are not supposed to be nice looking code that you'd > want to run on your production box, they are supposed to be nice > changesets (from a review and verification point of view). > > Git tree is here: > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > Branch is: > > vfs-scale-working > > Changes since last posting: > * Add a lot more comments for rcu-walk code and functions > * Fix reported d_compare vfat crash > * Incorporate review suggestions > * Make rcu-walk bail out if we have to call a security subsystem > * Fix for filesystems rewriting dentry name in-place > * Audit d_seq barrier write-side, add a few places where it was missing > * Optimised dentry memcmp > > Testing: > Testing filesystems is difficult, particularly remote filesystems, and > ones without mkfs packaged in debian. I'm running ltp and xfstests among > others, but those cover a tiny portion of what you can do with the > dcache. The more testing the merrier. > > I have been unable to break anything for a long time, but the race > windows can be tiny. I've been trying to insert random delays into > different parts of critical sections, and write tests specifically > targetting particular races, but that's slow going -- review of locking, > or testing on different configurations should be much more productive. > > Final note: > You won't be able to reproduce the parallel path walk scalability > numbers that I've posted, because the vfsmount refcounting scalability > patch is not included. I have a new idea for that now, so I'll be asking > for comments with that soon. I get this when building: security/security.c: In function 'security_inode_exec_permission': security/security.c:520: error: 'rcu' undeclared (first use in this function) security/security.c:520: error: (Each undeclared identifier is reported only once security/security.c:520: error: for each function it appears in.) make[1]: *** [security/security.o] Error 1 make: *** [security] Error 2 Missing include maybe? Ed ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: rcu-walk and dcache scaling tree update and status 2010-12-13 2:53 ` rcu-walk and dcache scaling tree update and status Ed Tomlinson @ 2010-12-13 2:59 ` Nick Piggin 2010-12-13 3:45 ` Stephen Rothwell 0 siblings, 1 reply; 16+ messages in thread From: Nick Piggin @ 2010-12-13 2:59 UTC (permalink / raw) To: Ed Tomlinson Cc: Nick Piggin, Linus Torvalds, Andrew Morton, Al Viro, Stephen Rothwell, linux-fsdevel, linux-kernel On Sun, Dec 12, 2010 at 09:53:32PM -0500, Ed Tomlinson wrote: > On Sunday 12 December 2010 21:37:33 Nick Piggin wrote: > > Final note: > > You won't be able to reproduce the parallel path walk scalability > > numbers that I've posted, because the vfsmount refcounting scalability > > patch is not included. I have a new idea for that now, so I'll be asking > > for comments with that soon. > > I get this when building: > > security/security.c: In function 'security_inode_exec_permission': > security/security.c:520: error: 'rcu' undeclared (first use in this function) > security/security.c:520: error: (Each undeclared identifier is reported only once > security/security.c:520: error: for each function it appears in.) > make[1]: *** [security/security.o] Error 1 > make: *** [security] Error 2 > > Missing include maybe? Ah, missed permutation while doing a final build check. `rcu` should just be renamed to `flags`. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: rcu-walk and dcache scaling tree update and status 2010-12-13 2:59 ` Nick Piggin @ 2010-12-13 3:45 ` Stephen Rothwell 2010-12-13 3:50 ` Nick Piggin 0 siblings, 1 reply; 16+ messages in thread From: Stephen Rothwell @ 2010-12-13 3:45 UTC (permalink / raw) To: Nick Piggin Cc: Ed Tomlinson, Linus Torvalds, Andrew Morton, Al Viro, linux-fsdevel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 951 bytes --] Hi Nick, On Mon, 13 Dec 2010 13:59:10 +1100 Nick Piggin <npiggin@kernel.dk> wrote: > > On Sun, Dec 12, 2010 at 09:53:32PM -0500, Ed Tomlinson wrote: > > > > I get this when building: > > > > security/security.c: In function 'security_inode_exec_permission': > > security/security.c:520: error: 'rcu' undeclared (first use in this function) > > security/security.c:520: error: (Each undeclared identifier is reported only once > > security/security.c:520: error: for each function it appears in.) > > make[1]: *** [security/security.o] Error 1 > > make: *** [security] Error 2 > > > > Missing include maybe? > > Ah, missed permutation while doing a final build check. `rcu` should > just be renamed to `flags`. You will, of course never put anything not build tested into a tree for linux-next inclusion, right? ;-) -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/ [-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: rcu-walk and dcache scaling tree update and status 2010-12-13 3:45 ` Stephen Rothwell @ 2010-12-13 3:50 ` Nick Piggin 0 siblings, 0 replies; 16+ messages in thread From: Nick Piggin @ 2010-12-13 3:50 UTC (permalink / raw) To: Stephen Rothwell Cc: Nick Piggin, Ed Tomlinson, Linus Torvalds, Andrew Morton, Al Viro, linux-fsdevel, linux-kernel On Mon, Dec 13, 2010 at 02:45:55PM +1100, Stephen Rothwell wrote: > Hi Nick, > > On Mon, 13 Dec 2010 13:59:10 +1100 Nick Piggin <npiggin@kernel.dk> wrote: > > > > On Sun, Dec 12, 2010 at 09:53:32PM -0500, Ed Tomlinson wrote: > > > > > > I get this when building: > > > > > > security/security.c: In function 'security_inode_exec_permission': > > > security/security.c:520: error: 'rcu' undeclared (first use in this function) > > > security/security.c:520: error: (Each undeclared identifier is reported only once > > > security/security.c:520: error: for each function it appears in.) > > > make[1]: *** [security/security.o] Error 1 > > > make: *** [security] Error 2 > > > > > > Missing include maybe? > > > > Ah, missed permutation while doing a final build check. `rcu` should > > just be renamed to `flags`. > > You will, of course never put anything not build tested into a tree for > linux-next inclusion, right? ;-) Right. It will be guaranteed to have been build tested with at least one .config :) Which reminds me, I haven't done an allmodconfig recently to see if I've missed exports... ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: rcu-walk and dcache scaling tree update and status [not found] <20101213023733.GB6522@amd> 2010-12-13 2:42 ` [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) Nick Piggin 2010-12-13 2:53 ` rcu-walk and dcache scaling tree update and status Ed Tomlinson @ 2010-12-13 3:40 ` Stephen Rothwell 2010-12-13 3:48 ` Nick Piggin 2 siblings, 1 reply; 16+ messages in thread From: Stephen Rothwell @ 2010-12-13 3:40 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Andrew Morton, Al Viro, linux-fsdevel, linux-kernel, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 1068 bytes --] Hi Nick, On Mon, 13 Dec 2010 13:37:33 +1100 Nick Piggin <npiggin@kernel.dk> wrote: > > Linus has suggested pretty strongly that he wants to pull this in the > next merge window (recently, that "inodes will be RCU freed in 2.6.38" > in an urelated discussion). As far as I know, that's what he's going to > do. I'd like to get this some time in linux-next to improve test > coverage (many filesystems I can't even test, so there are bound to be a > few silly crashes). Stephen, how do I arrange that? Well, you ask me :-) (letting me know where the git tree is). I this case I would like an email from Al and/or Linus (and/or maybe Christoph?) before I add this to linux-next as you say below that (despite being as well tested as you can do) it does require more review and buyin from others. Of course, if Linus is going to merge this during the next merge window, then it really should be in linux-next as soon as it is ready enough. -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/ [-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: rcu-walk and dcache scaling tree update and status 2010-12-13 3:40 ` Stephen Rothwell @ 2010-12-13 3:48 ` Nick Piggin 2010-12-14 0:03 ` Stephen Rothwell 0 siblings, 1 reply; 16+ messages in thread From: Nick Piggin @ 2010-12-13 3:48 UTC (permalink / raw) To: Stephen Rothwell Cc: Nick Piggin, Linus Torvalds, Andrew Morton, Al Viro, linux-fsdevel, linux-kernel, Christoph Hellwig On Mon, Dec 13, 2010 at 02:40:49PM +1100, Stephen Rothwell wrote: > Hi Nick, > > On Mon, 13 Dec 2010 13:37:33 +1100 Nick Piggin <npiggin@kernel.dk> wrote: > > > > Linus has suggested pretty strongly that he wants to pull this in the > > next merge window (recently, that "inodes will be RCU freed in 2.6.38" > > in an urelated discussion). As far as I know, that's what he's going to > > do. I'd like to get this some time in linux-next to improve test > > coverage (many filesystems I can't even test, so there are bound to be a > > few silly crashes). Stephen, how do I arrange that? > > Well, you ask me :-) (letting me know where the git tree is). > > I this case I would like an email from Al and/or Linus (and/or maybe > Christoph?) before I add this to linux-next as you say below that > (despite being as well tested as you can do) it does require more review > and buyin from others. Of course, if Linus is going to merge this during > the next merge window, then it really should be in linux-next as soon as > it is ready enough. Yes I would like an ack from Al to have it in linux-next. Note that wouldn't have to be an ack on the patches or agreement to merge it next window, just an "ok to get wider review and testing" type of thing. If Linus's is intending *not* to pull it, then it's not so urgent, and I might instead ask Andrew to host it in -mm for a while. I just don't want to get caught with my pants down here, because there are a good number of untested filesystems and things. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: rcu-walk and dcache scaling tree update and status 2010-12-13 3:48 ` Nick Piggin @ 2010-12-14 0:03 ` Stephen Rothwell 2010-12-14 0:16 ` Stephen Rothwell 0 siblings, 1 reply; 16+ messages in thread From: Stephen Rothwell @ 2010-12-14 0:03 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Andrew Morton, Al Viro, linux-fsdevel, linux-kernel, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 2381 bytes --] Hi Nick, On Mon, 13 Dec 2010 14:48:54 +1100 Nick Piggin <npiggin@kernel.dk> wrote: > > Yes I would like an ack from Al to have it in linux-next. Note that > wouldn't have to be an ack on the patches or agreement to merge it next > window, just an "ok to get wider review and testing" type of thing. > > If Linus's is intending *not* to pull it, then it's not so urgent, and I > might instead ask Andrew to host it in -mm for a while. I just don't > want to get caught with my pants down here, because there are a good > number of untested filesystems and things. OK, since Linus seems so keen (and you fixed the reported compile error) I will include your tree today. If Al has objections, I can always remove it again. I have called the tree (for my purposes) "vfs_scale" and have just you listed as the contact. Thanks for adding your subsystem tree as a participant of linux-next. As you may know, this is not a judgment of your code. The purpose of linux-next is for integration testing and to lower the impact of conflicts between subsystems in the next merge window. You will need to ensure that the patches/commits in your tree/series have been: * submitted under GPL v2 (or later) and include the Contributor's Signed-off-by, * posted to the relevant mailing list, * reviewed by you (or another maintainer of your subsystem tree), * successfully unit tested, and * destined for the current or next Linux merge window. Basically, this should be just what you would send to Linus (or ask him to fetch). It is allowed to be rebased if you deem it necessary. -- Cheers, Stephen Rothwell sfr@canb.auug.org.au Legal Stuff: By participating in linux-next, your subsystem tree contributions are public and will be included in the linux-next trees. You may be sent e-mail messages indicating errors or other issues when the patches/commits from your subsystem tree are merged and tested in linux-next. These messages may also be cross-posted to the linux-next mailing list, the linux-kernel mailing list, etc. The linux-next tree project and IBM (my employer) make no warranties regarding the linux-next project, the testing procedures, the results, the e-mails, etc. If you don't agree to these ground rules, let me know and I'll remove your tree from participation in linux-next. [-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: rcu-walk and dcache scaling tree update and status 2010-12-14 0:03 ` Stephen Rothwell @ 2010-12-14 0:16 ` Stephen Rothwell 0 siblings, 0 replies; 16+ messages in thread From: Stephen Rothwell @ 2010-12-14 0:16 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Andrew Morton, Al Viro, linux-fsdevel, linux-kernel, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 359 bytes --] Hi Nick, On Tue, 14 Dec 2010 11:03:26 +1100 Stephen Rothwell <sfr@canb.auug.org.au> wrote: > > I have called the tree (for my purposes) "vfs_scale" and have just you ^^^^^^^^^ Actually "vfs-scale". -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/ [-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2010-12-16 2:27 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <20101213023733.GB6522@amd> 2010-12-13 2:42 ` [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) Nick Piggin 2010-12-13 3:31 ` Nick Piggin 2010-12-13 3:43 ` Nick Piggin 2010-12-13 7:25 ` Eric Dumazet 2010-12-13 8:33 ` Nick Piggin 2010-12-14 12:40 ` Nick Piggin 2010-12-15 8:16 ` Andreas Dilger 2010-12-15 10:24 ` Nick Piggin 2010-12-13 2:53 ` rcu-walk and dcache scaling tree update and status Ed Tomlinson 2010-12-13 2:59 ` Nick Piggin 2010-12-13 3:45 ` Stephen Rothwell 2010-12-13 3:50 ` Nick Piggin 2010-12-13 3:40 ` Stephen Rothwell 2010-12-13 3:48 ` Nick Piggin 2010-12-14 0:03 ` Stephen Rothwell 2010-12-14 0:16 ` Stephen Rothwell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).